From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S933978AbdIYI6o (ORCPT <rfc822;w@1wt.eu>);
        Mon, 25 Sep 2017 04:58:44 -0400
Received: from bombadil.infradead.org ([65.50.211.133]:39306 "EHLO
        bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S932107AbdIYI6l (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 25 Sep 2017 04:58:41 -0400
Date: Mon, 25 Sep 2017 10:58:35 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, mingo@redhat.com,
        kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
        Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO
 hypercall\
Message-ID: <20170925085835.4zxze42mmi4emyj4@hirez.programming.kicks-ass.net>
References: <20170921114039.466130276@redhat.com>
 <20170921133653.GO26248@char.us.oracle.com>
 <20170921140628.zliqlz7mrlqs5pzz@hirez.programming.kicks-ass.net>
 <20170922011039.GB20133@amt.cnet>
 <20170922100004.ydmaxvgpc2zx7j25@hirez.programming.kicks-ass.net>
 <20170922121640.GA29589@amt.cnet>
 <20170922123107.fjh2yfwnej73trim@hirez.programming.kicks-ass.net>
 <20170922124005.GA30393@amt.cnet>
 <20170922130141.tz6f4gktihmbhqli@hirez.programming.kicks-ass.net>
 <20170925022238.GB5140@amt.cnet>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170925022238.GB5140@amt.cnet>
User-Agent: NeoMutt/20170609 (1.8.3)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sun, Sep 24, 2017 at 11:22:38PM -0300, Marcelo Tosatti wrote:
> On Fri, Sep 22, 2017 at 03:01:41PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 22, 2017 at 09:40:05AM -0300, Marcelo Tosatti wrote:
> > 
> > > Are you arguing its invalid for the following application to execute on 
> > > housekeeping vcpu of a realtime system:
> > > 
> > > void main(void)
> > > {
> > > 
> > >     submit_IO();
> > >     do {
> > >        computation();
> > >     } while (!interrupted());
> > > }
> > > 
> > > Really?
> > 
> > No. Nobody cares about random crap tasks.
> 
> Nobody has control over all code that runs in userspace Peter. And not
> supporting a valid sequence of steps because its "crap" (whatever your 
> definition of crap is) makes no sense.
> 
> It might be that someone decides to do the above (i really can't see 
> any actual reasoning i can follow and agree on your "its crap"
> argument), this truly seems valid to me.

We don't care what other tasks do. This isn't a hard thing to
understand. You're free to run whatever junk on your CPUs. This doesn't
(much) affect the correct functioning of RT tasks that you also run
there.

> So lets follow the reasoning steps:
> 
> 1) "NACK, because you didnt understand the problem".
> 
> 	OK thats an invalid NACK, you did understand the problem
> 	later and now your argument is the following.

It was a NACK because you wrote a shit changelog that didn't explain the
problem. But yes.

> 2) "NACK, because all VCPUs should be SCHED_FIFO all the time".

Very much, if you want a RT guest, all VCPU's should run at RT prio and
the interaction between the VCPUs and all supporting threads should be
designed for RT.

> But the existence of this code path from userspace:
> 
>   submit_IO();
>   do {
>      computation();
>   } while (!interrupted());
> 
> Its a supported code sequence, and works fine in a non-RT environment.

Who cares about that chunk of code? Have you forgotten to mention that
this is the form of the emulation thread?

> Therefore it should work on an -RT environment.

No, this is where you're wrong. That code works on -RT as long as you
don't expect it to be a valid RT program. -RT kernels will run !RT stuff
just fine.

But the moment you run a program as RT (FIFO/RR/DEADLINE) it had better
damn well be a valid RT program, and that excludes a lot of code.

> So please give me some logical reasoning for the NACK (people can live with
> it, but it has to be good enough to justify the decreasing packing of 
> guests in pCPUs):
> 
> 1) "Voodoo programming" (its hard for me to parse what you mean with
> that... do you mean you foresee this style of priority boosting causing
> problems in the future? Can you give an example?).

Your 'solution' only works if you sacrifice a goat on a full moon,
because only that ensures the guest doesn't VM_EXIT and cause the
self-same problem while you've boosted it.

Because you've _not_ fixed the actual problem!

> Is there fundamentally wrong about priority boosting in spinlock
> sections, or this particular style of priority boosting is wrong?

Yes, its fundamentally crap, because it doesn't guarantee anything.

RT is about making guarantees. An RT program needs a provable forward
progress guarantee at the very least. It including a priority inversion
disqualifies it from being sane.

> 2) "Pollution of the kernel code path". That makes sense to me, if thats
> whats your concerned about.

Also..

> 3) "Reduction of spinlock performance". Its true, but for NFV workloads
> people don't care about.

I've no idea what an NFV is.

> 4) "All vcpus should be SCHED_FIFO all the time". OK, why is that?
> What dictates that to be true?

Solid engineering. Does the guest kernel function as a bunch of
independent CPUs or does it assume all CPUs are equal and have strong
inter-cpu connections? Linux is the latter, therefore if one VCPU is RT
they all should be.

Dammit, you even recognise this in the spin-owner preemption issue
you're hacking around, but then go arse-about-face 'solving' it.

> What the patch does is the following:
> It reduces the window where SCHED_FIFO is applied vcpu0
> to those were a spinlock is shared between -RT vcpus and vcpu0
> (why: because otherwise, when the emulator thread is sharing a
> pCPU with vcpu0, its unable to generate interrupts vcpu0).
> 
> And its being rejected because:

Its not fixing the actual problem. The real problem is the prio
inversion between the VCPU and the emulation thread, _That_ is what
needs fixing.

Rewrite that VCPU/emulator interaction to be a proper RT construct.

Then you can run the VCPU at RT prio as you should, and the guest can
issue all the VM_EXIT things it wants at any time and still function
correctly.