Re: [RFC PATCH 00/10] Preemption in hypervisor (ARM only)

From: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
To: Julien Grall <julien.grall.oss@gmail.com>
Cc: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	George Dunlap <george.dunlap@citrix.com>,
	Dario Faggioli <dfaggioli@suse.com>,
	Meng Xu <mengxu@cis.upenn.edu>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Ian Jackson <iwj@xenproject.org>, Jan Beulich <jbeulich@suse.com>,
	Stefano Stabellini <sstabellini@kernel.org>, Wei Liu <wl@xen.org>
Subject: Re: [RFC PATCH 00/10] Preemption in hypervisor (ARM only)
Date: Wed, 24 Feb 2021 23:58:54 +0000	[thread overview]
Message-ID: <87eeh59fwi.fsf@epam.com> (raw)
In-Reply-To: <CAJ=z9a0v37rc_B7xVdQECAYd52PJ0UajGzvX1DYP56Q2RXQ2Tw@mail.gmail.com>

Julien Grall writes:

> On Wed, 24 Feb 2021 at 20:58, Volodymyr Babchuk
> <Volodymyr_Babchuk@epam.com> wrote:
>>
>>
>> Hi Julien,
>>
>> Julien Grall writes:
>>
>>> On 23/02/2021 12:06, Volodymyr Babchuk wrote:
>>>> Hi Julien,
>>>
>>> Hi Volodymyr,
>>>
>>>> Julien Grall writes:
>>>>> On 23/02/2021 02:34, Volodymyr Babchuk wrote:
>>>>> ... just rescheduling the vCPU. It will also give the opportunity for
>>>>> the guest to handle interrupts.
>>>>>
>>>>> If you don't return to the guest, then risk to get an RCU sched stall
>>>>> on that the vCPU (some hypercalls can take really really long).
>>>> Ah yes, you are right. I'd only wish that hypervisor saved context
>>>> of
>>>> hypercall on it's side...
>>>> I have example of OP-TEE before my eyes. They have special return
>>>> code
>>>> "task was interrupted" and they use separate call "continue execution of
>>>> interrupted task", which takes opaque context handle as a
>>>> parameter. With this approach state of interrupted call never leaks to > rest of the system.
>>>
>>> Feel free to suggest a new approach for the hypercals.
>>>
>>
>> I believe, I suggested it right above. There are some corner cases, that
>> should be addressed, of course.
>
> If we wanted a clean break, then possibly yes.  But I meant one that doesn't
> break all the existing users and doesn't put Xen at risk.
>
> I don't believe your approach fulfill it.

Of course, we can't touch any hypercalls that are part of stable
ABI. But if I got this right, domctls and sysctls are not stable, so one
can change theirs behavior quite drastically in major releases.

>>
>>>>>
>>>>>> This approach itself have obvious
>>>>>> problems: code that executes hypercall is responsible for preemption,
>>>>>> preemption checks are infrequent (because they are costly by
>>>>>> themselves), hypercall execution state is stored in guest-controlled
>>>>>> area, we rely on guest's good will to continue the hypercall.
>>>>>
>>>>> Why is it a problem to rely on guest's good will? The hypercalls
>>>>> should be preempted at a boundary that is safe to continue.
>>>> Yes, and it imposes restrictions on how to write hypercall
>>>> handler.
>>>> In other words, there are much more places in hypercall handler code
>>>> where it can be preempted than where hypercall continuation can be
>>>> used. For example, you can preempt hypercall that holds a mutex, but of
>>>> course you can't create an continuation point in such place.
>>>
>>> I disagree, you can create continuation point in such place. Although
>>> it will be more complex because you have to make sure you break the
>>> work in a restartable place.
>>
>> Maybe there is some misunderstanding. You can't create hypercall
>> continuation point in a place where you are holding mutex. Because,
>> there is absolutely not guarantee that guest will restart the
>> hypercall.
>
> I don't think we are disagreeing here. My point is you should rarely
> need to hold a mutex for a long period, so you could break your work
> in smaller chunk. In which cases, you can use hypercall continuation.

Let's say in this way: generally you can hold a mutex much longer than
you can hold a spinlock. And nothing catastrophic will happen if you are
preempted while holding a mutex. Better to avoid, this of course.

>
>>
>> But you can preempt vCPU while holding mutex, because xen owns scheduler
>> and it can guarantee that vCPU will be scheduled eventually to continue
>> the work and release mutex.
>
> The problem is the "eventually". If you are accounting the time spent
> in the hypervisor to the vCPU A, then there is a possibility that it
> has exhausted its time slice. In which case, your vCPU A may be
> sleeping for a while with a mutex held.
>
> If another vCPU B needs the mutex, it will either have to wait
> potentially for a long time or we need to force vCPU A to run on
> borrowed time.

Yes, of course.

>>
>>> I would also like to point out that preemption also have some drawbacks.
>>> With RT in mind, you have to deal with priority inversion (e.g. a
>>> lower priority vCPU held a mutex that is required by an higher
>>> priority).
>>
>> Of course. This is not as simple as "just call scheduler when we want
>> to".
>
> Your e-mail made it sounds like it was easy to add preemption in
> Xen. ;)

I'm sorry for that :)
Actually, there is lots of work to do. It appeared to me that "current"
needs to be reworked, preempt_enable/disable needs to be reworked, per-cpu
variables also should be reworked. And this is just to ensure
consistency of already existing code.

And I am not mentioning x86 support there...

>>> Outside of RT, you have to be careful where mutex are held. In your
>>> earlier answer, you suggested to held mutex for the memory
>>> allocation. If you do that, then it means a domain A can block
>>> allocation for domain B as it helds the mutex.
>>
>> As long as we do not exit to a EL1 with mutex being held, domain A can't
>> block anything. Of course, we have to deal with priority inversion, but
>> malicious domain can't cause DoS.
>
> It is not really a priority inversion problem outside of RT because
> all the tasks will have the same priority. It is more a time
> accounting problem because each vCPU may have a different number of
> credits.

Speaking of that, RTDS does not use concept of priority. And ARINC of
course neither.

>>>>> I am really surprised that this is the only changes necessary in
>>>>> Xen. For a first approach, we may want to be conservative when the
>>>>> preemption is happening as I am not convinced that all the places are
>>>>> safe to preempt.
>>>>>
>>>> Well, I can't say that I ran extensive tests, but I played with this
>>>> for
>>>> some time and it seemed quite stable. Of course, I had some problems
>>>> with RTDS...
>>>> As I see it, Xen is already supports SMP, so all places where races
>>>> are
>>>> possible should already be covered by spinlocks or taken into account by
>>>> some other means.
>>> That's correct for shared resources. I am more worried for any
>>> hypercalls that expected to run more or less continuously (e.g not
>>> taking into account interrupt) on the same pCPU.
>>
>> Are there many such hypercalls? They can disable preemption if they
>> really need to run on the same pCPU. As I understand, they should be
>> relatively fast, because they can't create continuations anyway.
>
> Well, I never tried to make Xen preemptible... My comment is based on
> the fact that the use preempt_{enable, disable}() was mostly done on a
> best effort basis.
>
> The usual suspects are anything using this_cpu() or interacting with
> the per-CPU HW registers.
>
> From a quick look here a few things (only looked at Arm):
>   * map_domain_page() in particular on arm32 because this is using
> per-CPU page-tables
>   * guest_atomics_* as this uses this_cpu()
>   * virt_to_mfn() in particular the failure path
>   * Incorrect use (or missing) rcu locking. (Hopefully Juergen's
> recent work in the RCU mitigate the risk)
>
> I can provide guidance, but you will have to go through the code and
> check what's happening.

Thank you for the list. Of course, I need to see thru all the code. I
already had a bunch of problems with per_cpu variables...

-- 
Volodymyr Babchuk at EPAM