From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752586AbdI1AvS (ORCPT <rfc822;w@1wt.eu>);
        Wed, 27 Sep 2017 20:51:18 -0400
Received: from mx1.redhat.com ([209.132.183.28]:36982 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1752491AbdI1AvP (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 27 Sep 2017 20:51:15 -0400
DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 7FB49199F6D
Authentication-Results: ext-mx02.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-Results: ext-mx02.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=mtosatti@redhat.com
Date: Wed, 27 Sep 2017 21:44:54 -0300
From: Marcelo Tosatti <mtosatti@redhat.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, mingo@redhat.com,
        kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
        Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO
 hypercall\
Message-ID: <20170928004452.GA30040@amt.cnet>
References: <20170922123305.GB29608@amt.cnet>
 <20170922125556.cyzybj6c7jqypbmo@hirez.programming.kicks-ass.net>
 <951aaa3f-b20d-6f67-9454-f193f4445fc7@redhat.com>
 <20170923134114.qdfdegrd6afqrkut@hirez.programming.kicks-ass.net>
 <855950672.7912001.1506258344142.JavaMail.zimbra@redhat.com>
 <20170925025751.GB30813@amt.cnet>
 <20170925091316.bnwpiscs2bvpdxk5@hirez.programming.kicks-ass.net>
 <00ff8cbf-4e41-a950-568c-3bd95e155d4b@redhat.com>
 <20170926224925.GA9119@amt.cnet>
 <6f4afefd-8726-13ff-371e-0d3896b4cf6a@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <6f4afefd-8726-13ff-371e-0d3896b4cf6a@redhat.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Thu, 28 Sep 2017 00:51:15 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Sep 27, 2017 at 11:37:48AM +0200, Paolo Bonzini wrote:
> On 27/09/2017 00:49, Marcelo Tosatti wrote:
> > On Mon, Sep 25, 2017 at 05:12:42PM +0200, Paolo Bonzini wrote:
> >> On 25/09/2017 11:13, Peter Zijlstra wrote:
> >>> On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote:
> >>>> I think you are missing the following point:
> >>>>
> >>>> "vcpu0 can be interrupted when its not in a spinlock protected section, 
> >>>> otherwise it can't."
> >>
> >> Who says that?  Certainly a driver can dedicate a single VCPU to
> >> periodic polling of the device, in such a way that the polling does not
> >> require a spinlock.
> > 
> > This sequence:
> > 
> > 
> > VCPU-0					VCPU-1 (running realtime workload)
> > 
> > takes spinlock A
> > scheduled out				
> > 					spinlock(A) (busy spins until
> > 						     VCPU-0 is scheduled
> > 						     back in)
> > scheduled in
> > finishes execution of 
> > code under protected section
> > releases spinlock(A)			
> > 
> > 					takes spinlock(A)
> 
> Yes, but then you have
> 
>                                         busy waits for flag to be set
>   polls device
>   scheduled out
>   device receives data
>                                         ...
>   scheduled in
>   set flag
> 
> or
> 
>   check for work to do
>   scheduled out
>                                         submit work
>                                         busy waits for work to complete
>   scheduled in
>   do work
> 
> None of which have anything to do with spinlocks.  They're just
> different ways to get priority inversion, and this...
> 
> >>>> So this point of "everything needs to be RT and the priorities must be
> >>>> designed carefully", is this: 
> >>>>
> >>>> 	WHEN in spinlock protected section (more specifically, when 
> >>>> 	spinlock protected section _shared with realtime vcpus_),
> >>>>
> >>>> 	priority of vcpu0 > priority of emulator thread
> >>>>
> >>>> 	OTHERWISE
> >>>>
> >>>> 	priority of vcpu0 < priority of emulator thread.
> >>
> >> This is _not_ designed carefully, this is messy.
> > 
> > This is very precise to me. What is "messy" about it? (its clearly
> > defined).
> 
> ... it's not how you design RT systems to avoid priority inversion.
> It's just solving an instance of the issue.
> 
> >> The emulator thread can interrupt the VCPU thread, so it has to be at
> >> higher RT priority (+ priority inheritance of mutexes).  
> > 
> > It can only do that _when_ the VCPU thread is not running a critical
> > section which a higher priority task depends on.
> 
> All VCPUs must have the same priority.  There's no such thing as a
> housekeeping VCPU.
> 
> There could be one sacrificial VCPU that you place on the same physical
> CPU as the emulator thread, but that's it.  The whole system must run
> smoothly even if you place those on the same physical CPU, without any
> PV hacks.
> 
> > So when you say "The emulator thread can interrupt the VCPU thread", 
> > you're saying that it has to be modified to interrupt for a maximum
> > amount of time (say 15us).
> > 
> > Is that what you are suggesting?
> 
> This is correct.  But I think you are missing a fundamental point, that
> is not specific to virt.  If a thread 1 can trigger an event in thread
> 2, and thread 2 runs at priority N, thread 1 must be placed at priority >N.
> 
> In this case, the emulator thread can signal I/O completion to the VCPU:
> it doesn't matter if the VCPU is polling or using interrupts, the
> emulator thread must be placed at higher priority than the VCPU.  This
> is the root cause of the SeaBIOS issue that you were seeing.  Yes, it
> was me who suggested moving VCPU0 and the emulator thread to
> SCHED_NORMAL, but that was just a bandaid until the real fix was done
> (which is to set up SCHED_FIFO priorities correctly and use PI mutexes).
> 
> In fact, this is not specific to the emulator thread.  It applies just
> as well to vhost threads, to QEMU iothreads, even to the KVM PIT
> emulation thread if the guest were using it.
> 
> > Paolo, you don't control how many interruptions of the emulator thread
> > happen per second.
> 
> Indeed these non-VCPU threads should indeed run rarely and for a small
> amount of time only to achieve bounded latencies in the VCPUs.  But if
> they don't, those are simply bugs and we fix them.  In fact, sources of
> frequent interruptions have all been fixed or moved outside the main
> thread; for example, disks can use separate iothreads.  Configuration
> problems are also bugs, just in a different place.


> 
> For a very basic VM that I've just tried (whatever QEMU does by default,
> only tweak was not using the QEMU GUI), there is exactly one
> interruption per second when the VM is idle.  That interruption in turn
> is caused by udev periodically probing the guest CDROM (!).  So that's a
> configuration problem: remove the CDROM, and it does one interruption
> every 10-30 seconds, again all of them guest triggered.
> 
> Again: if you have many interruptions, it's not a flaw in KVM or QEMU's
> design, it's just that someone is doing something stupid.  It could be
> the guest (e.g. unnecessary devices or daemons as in the example above),
> QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second
> just to increment the clock), or the management (e.g. polling "is the VM
> running" 50 times per second).  But it can and must be fixed.

No, i mean you can run anything in VCPU-0 (it is valid to do that).
And that "anything" can generate 1 interrupt per second, 1000 or 10.000
interrupts per second. Which are all valid things to be done.

"I can't run a kernel compilation on VCPU-0 because that will impact
latency on the realtime VCPU-1" is not acceptable.

> The design should be the basic one: attribute the right priority to each
> thread and things just work.
> 
> Paolo
> 
> > So if you let the emulator thread interrupt the
> > emulator thread at all times, without some kind of bounding 
> > of these interruptions per time unit, you have a similar
> > problem as (*) (where the realtime task is scheduled).
> >