From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751771AbbEASkf (ORCPT ); Fri, 1 May 2015 14:40:35 -0400 Received: from mail-wi0-f170.google.com ([209.85.212.170]:35703 "EHLO mail-wi0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750807AbbEASkc (ORCPT ); Fri, 1 May 2015 14:40:32 -0400 Date: Fri, 1 May 2015 20:40:26 +0200 From: Ingo Molnar To: Rik van Riel Cc: Andy Lutomirski , "linux-kernel@vger.kernel.org" , X86 ML , williams@redhat.com, Andrew Lutomirski , fweisbec@redhat.com, Peter Zijlstra , Heiko Carstens , Thomas Gleixner , Ingo Molnar , Paolo Bonzini , "Paul E. McKenney" , Linus Torvalds Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry Message-ID: <20150501184025.GA2114@gmail.com> References: <1430429035-25563-1-git-send-email-riel@redhat.com> <1430429035-25563-4-git-send-email-riel@redhat.com> <20150501064044.GA18957@gmail.com> <554399D1.6010405@redhat.com> <20150501155912.GA451@gmail.com> <20150501162109.GA1091@gmail.com> <5543A94B.3020108@redhat.com> <20150501163431.GB1327@gmail.com> <5543C05E.9040209@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5543C05E.9040209@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Rik van Riel wrote: > On 05/01/2015 12:34 PM, Ingo Molnar wrote: > > > > * Rik van Riel wrote: > > > >>> I can understand people running hard-RT workloads not wanting to > >>> see the overhead of a timer tick or a scheduler tick with variable > >>> (and occasionally heavy) work done in IRQ context, but the jitter > >>> caused by a single trivial IPI with constant work should be very, > >>> very low and constant. > >> > >> Not if the realtime workload is running inside a KVM guest. > > > > I don't buy this: > > > >> At that point an IPI, either on the host or in the guest, involves a > >> full VMEXIT & VMENTER cycle. > > > > So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 > > usec on recent hardware, and I bet it will get better with time. > > > > I'm not aware of any hard-RT workload that cannot take 1 usec > > latencies. > > Now think about doing this kind of IPI from inside a guest, to > another VCPU on the same guest. > > Now you are looking at VMEXIT/VMENTER on the first VCPU, Does it matter? It's not the hard-RT CPU, and this is a slowpath of synchronize_rcu(). > plus the cost of the IPI on the host, plus the cost of the emulation > layer, plus VMEXIT/VMENTER on the second VCPU to trigger the IPI > work, and possibly a second VMEXIT/VMENTER for IPI completion. Only the VMEXIT/VMENTER on the second VCPU matters to RT latencies. > I suspect it would be better to do RCU callback offload in some > other way. Well, it's not just about callback offload, but it's about the basic synchronization guarantee of synchronize_rcu(): that all RCU read-side critical sections have finished executing after the call returns. So even if a nohz-full CPU never actually queues a callback, it needs to stop using resources that a synchronize_rcu() caller expects it to stop using. We can do that only if we know it in an SMP-coherent way that the remote CPU is not in an rcu_read_lock() section. Sending an IPI is one way to achieve that. Or we could do that in the syscall path with a single store of a constant flag to a location in the task struct. We have a number of natural flags that get written on syscall entry, such as: pushq_cfi $__USER_DS /* pt_regs->ss */ That goes to a constant location on the kernel stack. On return from system calls we could write 0 to that location. So the remote CPU would have to do a read of this location. There are two cases: - If it's 0, then it has observed quiescent state on that CPU. (It does not have to be atomics anymore, as we'd only observe the value and MESI coherency takes care of it.) - If it's not 0 then the remote CPU is not executing user-space code and we can install (remotely) a TIF_NOHZ flag in it and expect it to process it either on return to user-space or on a context switch. This way, unless I'm missing something, reduces the overhead to a single store to a hot cacheline on return-to-userspace - which instruction if we place it well might as well be close to zero cost. No syscall entry cost. Slow-return cost only in the (rare) case of someone using synchronize_rcu(). Hm? Thanks, Ingo