From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752317AbdIVMZK (ORCPT ); Fri, 22 Sep 2017 08:25:10 -0400 Received: from mx1.redhat.com ([209.132.183.28]:30430 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752128AbdIVMZJ (ORCPT ); Fri, 22 Sep 2017 08:25:09 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 4716D7E45E Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=mtosatti@redhat.com Date: Fri, 22 Sep 2017 09:24:52 -0300 From: Marcelo Tosatti To: Paolo Bonzini Cc: Konrad Rzeszutek Wilk , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) Message-ID: <20170922122452.GA29608@amt.cnet> References: <20170921113835.031375194@redhat.com> <20170921114039.364395490@redhat.com> <20170921133212.GN26248@char.us.oracle.com> <20170922010811.GA20133@amt.cnet> <29aadd63-ddfe-0ddc-2d71-8c0391db0ba4@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <29aadd63-ddfe-0ddc-2d71-8c0391db0ba4@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Fri, 22 Sep 2017 12:25:09 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 22, 2017 at 09:23:47AM +0200, Paolo Bonzini wrote: > On 22/09/2017 03:08, Marcelo Tosatti wrote: > > On Thu, Sep 21, 2017 at 03:49:33PM +0200, Paolo Bonzini wrote: > >> On 21/09/2017 15:32, Konrad Rzeszutek Wilk wrote: > >>> So the guest can change the scheduling decisions at the host level? > >>> And the host HAS to follow it? There is no policy override for the > >>> host to say - nah, not going to do it? > > > > In that case the host should not even configure the guest with this > > option (this is QEMU's 'enable-rt-fifo-hc' option). > > > >>> Also wouldn't the guest want to always be at SCHED_FIFO? [I am thinking > >>> of a guest admin who wants all the CPU resources he can get] > > > > No. Because in the following code, executed by the housekeeping vCPU > > running at constant SCHED_FIFO priority: > > > > 1. Start disk I/O. > > 2. busy spin > > > > With the emulator thread sharing the same pCPU with the housekeeping > > vCPU, the emulator thread (which runs at SCHED_NORMAL), will never > > be scheduled in in place of the vcpu thread at SCHED_FIFO. > > > > This causes a hang. > > But if the emulator thread can interrupt the housekeeping thread, the > emulator thread should also be SCHED_FIFO at higher priority; IIRC this > was in Jan's talk from a few years ago. The point is we do not want the emulator thread to interrupt the housekeeping thread at all times: we only want it to interrupt the housekeeping thread when it is not in a spinlock protected section (because that has an effect on realtime vcpu's attempting to grab that particular spinlock). Otherwise, it can interrupt the housekeeping thread. > QEMU would also have to use PI mutexes (which is the main reason why > it's using QemuMutex instead of e.g. GMutex). > > >> Yeah, I do not understand why there should be a housekeeping VCPU that > >> is running at SCHED_NORMAL. If it hurts, don't do it... > > > > Hope explanation above makes sense (in fact, it was you who pointed > > out SCHED_FIFO should not be constant on the housekeeping vCPU, > > when sharing pCPU with emulator thread at SCHED_NORMAL). > > The two are not exclusive... As you point out, it depends on the > workload. For DPDK you can put both of them at SCHED_NORMAL. For > kernel-intensive uses you must use SCHED_FIFO. > > Perhaps we could consider running these threads at SCHED_RR instead. > Unlike SCHED_NORMAL, I am not against a hypercall that bumps temporarily > SCHED_RR to SCHED_FIFO, but perhaps that's not even necessary. Sorry Paolo, i don't see how SCHED_RR is going to help here: " SCHED_RR: Round-robin scheduling SCHED_RR is a simple enhancement of SCHED_FIFO. Everything described above for SCHED_FIFO also applies to SCHED_RR, except that each thread is allowed to run only for a maximum time quantum." What must happen is that vcpu0 should run _until its finished with spinlock protected section_ (that is, any job the emulator thread has, in that period where vcpu0 has work to do, is of less priority and must not execute). Otherwise vcpu1, running a realtime workload, will attempt to grab the spinlock vcpu0 has grabbed, and busy spin waiting on the emulator thread to finish. If you have the emulator thread at a higher priority than vcpu0, as you suggested above, the same problem will happen. So that option is not viable. We tried to have vcpu0 with SCHED_FIFO at all times, to avoid this hypercall, but unfortunately that'll cause the hang as described in the trace. So i fail to see how SCHED_RR should help here? Thanks