Re: [RFC PATCH 00/10] Preemption in hypervisor (ARM only)

From: Julien Grall <julien.grall.oss@gmail.com>
To: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
Cc: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	 George Dunlap <george.dunlap@citrix.com>,
	Dario Faggioli <dfaggioli@suse.com>,
	 Meng Xu <mengxu@cis.upenn.edu>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	 Ian Jackson <iwj@xenproject.org>,
	Jan Beulich <jbeulich@suse.com>,
	 Stefano Stabellini <sstabellini@kernel.org>,
	Wei Liu <wl@xen.org>
Subject: Re: [RFC PATCH 00/10] Preemption in hypervisor (ARM only)
Date: Wed, 24 Feb 2021 22:31:26 +0000	[thread overview]
Message-ID: <CAJ=z9a0v37rc_B7xVdQECAYd52PJ0UajGzvX1DYP56Q2RXQ2Tw@mail.gmail.com> (raw)
In-Reply-To: <87o8g99oal.fsf@epam.com>

On Wed, 24 Feb 2021 at 20:58, Volodymyr Babchuk
<Volodymyr_Babchuk@epam.com> wrote:
>
>
> Hi Julien,
>
> Julien Grall writes:
>
> > On 23/02/2021 12:06, Volodymyr Babchuk wrote:
> >> Hi Julien,
> >
> > Hi Volodymyr,
> >
> >> Julien Grall writes:
> >>> On 23/02/2021 02:34, Volodymyr Babchuk wrote:
> >>> ... just rescheduling the vCPU. It will also give the opportunity for
> >>> the guest to handle interrupts.
> >>>
> >>> If you don't return to the guest, then risk to get an RCU sched stall
> >>> on that the vCPU (some hypercalls can take really really long).
> >> Ah yes, you are right. I'd only wish that hypervisor saved context
> >> of
> >> hypercall on it's side...
> >> I have example of OP-TEE before my eyes. They have special return
> >> code
> >> "task was interrupted" and they use separate call "continue execution of
> >> interrupted task", which takes opaque context handle as a
> >> parameter. With this approach state of interrupted call never leaks to > rest of the system.
> >
> > Feel free to suggest a new approach for the hypercals.
> >
>
> I believe, I suggested it right above. There are some corner cases, that
> should be addressed, of course.

If we wanted a clean break, then possibly yes.  But I meant one that doesn't
break all the existing users and doesn't put Xen at risk.

I don't believe your approach fulfill it.

>
> >>>
> >>>> This approach itself have obvious
> >>>> problems: code that executes hypercall is responsible for preemption,
> >>>> preemption checks are infrequent (because they are costly by
> >>>> themselves), hypercall execution state is stored in guest-controlled
> >>>> area, we rely on guest's good will to continue the hypercall.
> >>>
> >>> Why is it a problem to rely on guest's good will? The hypercalls
> >>> should be preempted at a boundary that is safe to continue.
> >> Yes, and it imposes restrictions on how to write hypercall
> >> handler.
> >> In other words, there are much more places in hypercall handler code
> >> where it can be preempted than where hypercall continuation can be
> >> used. For example, you can preempt hypercall that holds a mutex, but of
> >> course you can't create an continuation point in such place.
> >
> > I disagree, you can create continuation point in such place. Although
> > it will be more complex because you have to make sure you break the
> > work in a restartable place.
>
> Maybe there is some misunderstanding. You can't create hypercall
> continuation point in a place where you are holding mutex. Because,
> there is absolutely not guarantee that guest will restart the
> hypercall.

I don't think we are disagreeing here. My point is you should rarely
need to hold a mutex for a long period, so you could break your work
in smaller chunk. In which cases, you can use hypercall continuation.

>
> But you can preempt vCPU while holding mutex, because xen owns scheduler
> and it can guarantee that vCPU will be scheduled eventually to continue
> the work and release mutex.

The problem is the "eventually". If you are accounting the time spent
in the hypervisor to the vCPU A, then there is a possibility that it
has exhausted its time slice. In which case, your vCPU A may be
sleeping for a while with a mutex held.

If another vCPU B needs the mutex, it will either have to wait
potentially for a long time or we need to force vCPU A to run on
borrowed time.

>
> > I would also like to point out that preemption also have some drawbacks.
> > With RT in mind, you have to deal with priority inversion (e.g. a
> > lower priority vCPU held a mutex that is required by an higher
> > priority).
>
> Of course. This is not as simple as "just call scheduler when we want
> to".

Your e-mail made it sounds like it was easy to add preemption in Xen. ;)

>
> > Outside of RT, you have to be careful where mutex are held. In your
> > earlier answer, you suggested to held mutex for the memory
> > allocation. If you do that, then it means a domain A can block
> > allocation for domain B as it helds the mutex.
>
> As long as we do not exit to a EL1 with mutex being held, domain A can't
> block anything. Of course, we have to deal with priority inversion, but
> malicious domain can't cause DoS.

It is not really a priority inversion problem outside of RT because
all the tasks will have the same priority. It is more a time
accounting problem because each vCPU may have a different number of
credits.

> >>> I am really surprised that this is the only changes necessary in
> >>> Xen. For a first approach, we may want to be conservative when the
> >>> preemption is happening as I am not convinced that all the places are
> >>> safe to preempt.
> >>>
> >> Well, I can't say that I ran extensive tests, but I played with this
> >> for
> >> some time and it seemed quite stable. Of course, I had some problems
> >> with RTDS...
> >> As I see it, Xen is already supports SMP, so all places where races
> >> are
> >> possible should already be covered by spinlocks or taken into account by
> >> some other means.
> > That's correct for shared resources. I am more worried for any
> > hypercalls that expected to run more or less continuously (e.g not
> > taking into account interrupt) on the same pCPU.
>
> Are there many such hypercalls? They can disable preemption if they
> really need to run on the same pCPU. As I understand, they should be
> relatively fast, because they can't create continuations anyway.

Well, I never tried to make Xen preemptible... My comment is based on
the fact that the use preempt_{enable, disable}() was mostly done on a
best effort basis.

The usual suspects are anything using this_cpu() or interacting with
the per-CPU HW registers.

From a quick look here a few things (only looked at Arm):
  * map_domain_page() in particular on arm32 because this is using
per-CPU page-tables
  * guest_atomics_* as this uses this_cpu()
  * virt_to_mfn() in particular the failure path
  * Incorrect use (or missing) rcu locking. (Hopefully Juergen's
recent work in the RCU mitigate the risk)

I can provide guidance, but you will have to go through the code and
check what's happening.

Cheers,