Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\

From: Marcelo Tosatti <mtosatti@redhat.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	mingo@redhat.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
Date: Wed, 27 Sep 2017 21:44:54 -0300	[thread overview]
Message-ID: <20170928004452.GA30040@amt.cnet> (raw)
In-Reply-To: <6f4afefd-8726-13ff-371e-0d3896b4cf6a@redhat.com>

On Wed, Sep 27, 2017 at 11:37:48AM +0200, Paolo Bonzini wrote:
> On 27/09/2017 00:49, Marcelo Tosatti wrote:
> > On Mon, Sep 25, 2017 at 05:12:42PM +0200, Paolo Bonzini wrote:
> >> On 25/09/2017 11:13, Peter Zijlstra wrote:
> >>> On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote:
> >>>> I think you are missing the following point:
> >>>>
> >>>> "vcpu0 can be interrupted when its not in a spinlock protected section, 
> >>>> otherwise it can't."
> >>
> >> Who says that?  Certainly a driver can dedicate a single VCPU to
> >> periodic polling of the device, in such a way that the polling does not
> >> require a spinlock.
> > 
> > This sequence:
> > 
> > 
> > VCPU-0					VCPU-1 (running realtime workload)
> > 
> > takes spinlock A
> > scheduled out				
> > 					spinlock(A) (busy spins until
> > 						     VCPU-0 is scheduled
> > 						     back in)
> > scheduled in
> > finishes execution of 
> > code under protected section
> > releases spinlock(A)			
> > 
> > 					takes spinlock(A)
> 
> Yes, but then you have
> 
>                                         busy waits for flag to be set
>   polls device
>   scheduled out
>   device receives data
>                                         ...
>   scheduled in
>   set flag
> 
> or
> 
>   check for work to do
>   scheduled out
>                                         submit work
>                                         busy waits for work to complete
>   scheduled in
>   do work
> 
> None of which have anything to do with spinlocks.  They're just
> different ways to get priority inversion, and this...
> 
> >>>> So this point of "everything needs to be RT and the priorities must be
> >>>> designed carefully", is this: 
> >>>>
> >>>> 	WHEN in spinlock protected section (more specifically, when 
> >>>> 	spinlock protected section _shared with realtime vcpus_),
> >>>>
> >>>> 	priority of vcpu0 > priority of emulator thread
> >>>>
> >>>> 	OTHERWISE
> >>>>
> >>>> 	priority of vcpu0 < priority of emulator thread.
> >>
> >> This is _not_ designed carefully, this is messy.
> > 
> > This is very precise to me. What is "messy" about it? (its clearly
> > defined).
> 
> ... it's not how you design RT systems to avoid priority inversion.
> It's just solving an instance of the issue.
> 
> >> The emulator thread can interrupt the VCPU thread, so it has to be at
> >> higher RT priority (+ priority inheritance of mutexes).  
> > 
> > It can only do that _when_ the VCPU thread is not running a critical
> > section which a higher priority task depends on.
> 
> All VCPUs must have the same priority.  There's no such thing as a
> housekeeping VCPU.
> 
> There could be one sacrificial VCPU that you place on the same physical
> CPU as the emulator thread, but that's it.  The whole system must run
> smoothly even if you place those on the same physical CPU, without any
> PV hacks.
> 
> > So when you say "The emulator thread can interrupt the VCPU thread", 
> > you're saying that it has to be modified to interrupt for a maximum
> > amount of time (say 15us).
> > 
> > Is that what you are suggesting?
> 
> This is correct.  But I think you are missing a fundamental point, that
> is not specific to virt.  If a thread 1 can trigger an event in thread
> 2, and thread 2 runs at priority N, thread 1 must be placed at priority >N.
> 
> In this case, the emulator thread can signal I/O completion to the VCPU:
> it doesn't matter if the VCPU is polling or using interrupts, the
> emulator thread must be placed at higher priority than the VCPU.  This
> is the root cause of the SeaBIOS issue that you were seeing.  Yes, it
> was me who suggested moving VCPU0 and the emulator thread to
> SCHED_NORMAL, but that was just a bandaid until the real fix was done
> (which is to set up SCHED_FIFO priorities correctly and use PI mutexes).
> 
> In fact, this is not specific to the emulator thread.  It applies just
> as well to vhost threads, to QEMU iothreads, even to the KVM PIT
> emulation thread if the guest were using it.
> 
> > Paolo, you don't control how many interruptions of the emulator thread
> > happen per second.
> 
> Indeed these non-VCPU threads should indeed run rarely and for a small
> amount of time only to achieve bounded latencies in the VCPUs.  But if
> they don't, those are simply bugs and we fix them.  In fact, sources of
> frequent interruptions have all been fixed or moved outside the main
> thread; for example, disks can use separate iothreads.  Configuration
> problems are also bugs, just in a different place.

> 
> For a very basic VM that I've just tried (whatever QEMU does by default,
> only tweak was not using the QEMU GUI), there is exactly one
> interruption per second when the VM is idle.  That interruption in turn
> is caused by udev periodically probing the guest CDROM (!).  So that's a
> configuration problem: remove the CDROM, and it does one interruption
> every 10-30 seconds, again all of them guest triggered.
> 
> Again: if you have many interruptions, it's not a flaw in KVM or QEMU's
> design, it's just that someone is doing something stupid.  It could be
> the guest (e.g. unnecessary devices or daemons as in the example above),
> QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second
> just to increment the clock), or the management (e.g. polling "is the VM
> running" 50 times per second).  But it can and must be fixed.

No, i mean you can run anything in VCPU-0 (it is valid to do that).
And that "anything" can generate 1 interrupt per second, 1000 or 10.000
interrupts per second. Which are all valid things to be done.

"I can't run a kernel compilation on VCPU-0 because that will impact
latency on the realtime VCPU-1" is not acceptable.

> The design should be the basic one: attribute the right priority to each
> thread and things just work.
> 
> Paolo
> 
> > So if you let the emulator thread interrupt the
> > emulator thread at all times, without some kind of bounding 
> > of these interruptions per time unit, you have a similar
> > problem as (*) (where the realtime task is scheduled).
> >