From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752586AbdI1AvS (ORCPT ); Wed, 27 Sep 2017 20:51:18 -0400 Received: from mx1.redhat.com ([209.132.183.28]:36982 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752491AbdI1AvP (ORCPT ); Wed, 27 Sep 2017 20:51:15 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 7FB49199F6D Authentication-Results: ext-mx02.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx02.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=mtosatti@redhat.com Date: Wed, 27 Sep 2017 21:44:54 -0300 From: Marcelo Tosatti To: Paolo Bonzini Cc: Peter Zijlstra , Konrad Rzeszutek Wilk , mingo@redhat.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Thomas Gleixner Subject: Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Message-ID: <20170928004452.GA30040@amt.cnet> References: <20170922123305.GB29608@amt.cnet> <20170922125556.cyzybj6c7jqypbmo@hirez.programming.kicks-ass.net> <951aaa3f-b20d-6f67-9454-f193f4445fc7@redhat.com> <20170923134114.qdfdegrd6afqrkut@hirez.programming.kicks-ass.net> <855950672.7912001.1506258344142.JavaMail.zimbra@redhat.com> <20170925025751.GB30813@amt.cnet> <20170925091316.bnwpiscs2bvpdxk5@hirez.programming.kicks-ass.net> <00ff8cbf-4e41-a950-568c-3bd95e155d4b@redhat.com> <20170926224925.GA9119@amt.cnet> <6f4afefd-8726-13ff-371e-0d3896b4cf6a@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6f4afefd-8726-13ff-371e-0d3896b4cf6a@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Thu, 28 Sep 2017 00:51:15 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Sep 27, 2017 at 11:37:48AM +0200, Paolo Bonzini wrote: > On 27/09/2017 00:49, Marcelo Tosatti wrote: > > On Mon, Sep 25, 2017 at 05:12:42PM +0200, Paolo Bonzini wrote: > >> On 25/09/2017 11:13, Peter Zijlstra wrote: > >>> On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote: > >>>> I think you are missing the following point: > >>>> > >>>> "vcpu0 can be interrupted when its not in a spinlock protected section, > >>>> otherwise it can't." > >> > >> Who says that? Certainly a driver can dedicate a single VCPU to > >> periodic polling of the device, in such a way that the polling does not > >> require a spinlock. > > > > This sequence: > > > > > > VCPU-0 VCPU-1 (running realtime workload) > > > > takes spinlock A > > scheduled out > > spinlock(A) (busy spins until > > VCPU-0 is scheduled > > back in) > > scheduled in > > finishes execution of > > code under protected section > > releases spinlock(A) > > > > takes spinlock(A) > > Yes, but then you have > > busy waits for flag to be set > polls device > scheduled out > device receives data > ... > scheduled in > set flag > > or > > check for work to do > scheduled out > submit work > busy waits for work to complete > scheduled in > do work > > None of which have anything to do with spinlocks. They're just > different ways to get priority inversion, and this... > > >>>> So this point of "everything needs to be RT and the priorities must be > >>>> designed carefully", is this: > >>>> > >>>> WHEN in spinlock protected section (more specifically, when > >>>> spinlock protected section _shared with realtime vcpus_), > >>>> > >>>> priority of vcpu0 > priority of emulator thread > >>>> > >>>> OTHERWISE > >>>> > >>>> priority of vcpu0 < priority of emulator thread. > >> > >> This is _not_ designed carefully, this is messy. > > > > This is very precise to me. What is "messy" about it? (its clearly > > defined). > > ... it's not how you design RT systems to avoid priority inversion. > It's just solving an instance of the issue. > > >> The emulator thread can interrupt the VCPU thread, so it has to be at > >> higher RT priority (+ priority inheritance of mutexes). > > > > It can only do that _when_ the VCPU thread is not running a critical > > section which a higher priority task depends on. > > All VCPUs must have the same priority. There's no such thing as a > housekeeping VCPU. > > There could be one sacrificial VCPU that you place on the same physical > CPU as the emulator thread, but that's it. The whole system must run > smoothly even if you place those on the same physical CPU, without any > PV hacks. > > > So when you say "The emulator thread can interrupt the VCPU thread", > > you're saying that it has to be modified to interrupt for a maximum > > amount of time (say 15us). > > > > Is that what you are suggesting? > > This is correct. But I think you are missing a fundamental point, that > is not specific to virt. If a thread 1 can trigger an event in thread > 2, and thread 2 runs at priority N, thread 1 must be placed at priority >N. > > In this case, the emulator thread can signal I/O completion to the VCPU: > it doesn't matter if the VCPU is polling or using interrupts, the > emulator thread must be placed at higher priority than the VCPU. This > is the root cause of the SeaBIOS issue that you were seeing. Yes, it > was me who suggested moving VCPU0 and the emulator thread to > SCHED_NORMAL, but that was just a bandaid until the real fix was done > (which is to set up SCHED_FIFO priorities correctly and use PI mutexes). > > In fact, this is not specific to the emulator thread. It applies just > as well to vhost threads, to QEMU iothreads, even to the KVM PIT > emulation thread if the guest were using it. > > > Paolo, you don't control how many interruptions of the emulator thread > > happen per second. > > Indeed these non-VCPU threads should indeed run rarely and for a small > amount of time only to achieve bounded latencies in the VCPUs. But if > they don't, those are simply bugs and we fix them. In fact, sources of > frequent interruptions have all been fixed or moved outside the main > thread; for example, disks can use separate iothreads. Configuration > problems are also bugs, just in a different place. > > For a very basic VM that I've just tried (whatever QEMU does by default, > only tweak was not using the QEMU GUI), there is exactly one > interruption per second when the VM is idle. That interruption in turn > is caused by udev periodically probing the guest CDROM (!). So that's a > configuration problem: remove the CDROM, and it does one interruption > every 10-30 seconds, again all of them guest triggered. > > Again: if you have many interruptions, it's not a flaw in KVM or QEMU's > design, it's just that someone is doing something stupid. It could be > the guest (e.g. unnecessary devices or daemons as in the example above), > QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second > just to increment the clock), or the management (e.g. polling "is the VM > running" 50 times per second). But it can and must be fixed. No, i mean you can run anything in VCPU-0 (it is valid to do that). And that "anything" can generate 1 interrupt per second, 1000 or 10.000 interrupts per second. Which are all valid things to be done. "I can't run a kernel compilation on VCPU-0 because that will impact latency on the realtime VCPU-1" is not acceptable. > The design should be the basic one: attribute the right priority to each > thread and things just work. > > Paolo > > > So if you let the emulator thread interrupt the > > emulator thread at all times, without some kind of bounding > > of these interruptions per time unit, you have a similar > > problem as (*) (where the realtime task is scheduled). > >