From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751771AbbEASkf (ORCPT <rfc822;w@1wt.eu>);
	Fri, 1 May 2015 14:40:35 -0400
Received: from mail-wi0-f170.google.com ([209.85.212.170]:35703 "EHLO
	mail-wi0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750807AbbEASkc (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Fri, 1 May 2015 14:40:32 -0400
Date: Fri, 1 May 2015 20:40:26 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Rik van Riel <riel@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        X86 ML <x86@kernel.org>, williams@redhat.com,
        Andrew Lutomirski <luto@kernel.org>, fweisbec@redhat.com,
        Peter Zijlstra <peterz@infradead.org>,
        Heiko Carstens <heiko.carstens@de.ibm.com>,
        Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
        Paolo Bonzini <pbonzini@redhat.com>,
        "Paul E. McKenney" <paulmck@us.ibm.com>,
        Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable
 & enable from context tracking on syscall entry
Message-ID: <20150501184025.GA2114@gmail.com>
References: <1430429035-25563-1-git-send-email-riel@redhat.com>
 <1430429035-25563-4-git-send-email-riel@redhat.com>
 <20150501064044.GA18957@gmail.com>
 <554399D1.6010405@redhat.com>
 <20150501155912.GA451@gmail.com>
 <CALCETrVZf11EYLhKWOfeQSyzq9eq5KB+btcY19JF+sJvs2zMXA@mail.gmail.com>
 <20150501162109.GA1091@gmail.com>
 <5543A94B.3020108@redhat.com>
 <20150501163431.GB1327@gmail.com>
 <5543C05E.9040209@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <5543C05E.9040209@redhat.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


* Rik van Riel <riel@redhat.com> wrote:

> On 05/01/2015 12:34 PM, Ingo Molnar wrote:
> > 
> > * Rik van Riel <riel@redhat.com> wrote:
> > 
> >>> I can understand people running hard-RT workloads not wanting to 
> >>> see the overhead of a timer tick or a scheduler tick with variable 
> >>> (and occasionally heavy) work done in IRQ context, but the jitter 
> >>> caused by a single trivial IPI with constant work should be very, 
> >>> very low and constant.
> >>
> >> Not if the realtime workload is running inside a KVM guest.
> > 
> > I don't buy this:
> > 
> >> At that point an IPI, either on the host or in the guest, involves a 
> >> full VMEXIT & VMENTER cycle.
> > 
> > So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 
> > usec on recent hardware, and I bet it will get better with time.
> > 
> > I'm not aware of any hard-RT workload that cannot take 1 usec 
> > latencies.
> 
> Now think about doing this kind of IPI from inside a guest, to 
> another VCPU on the same guest.
> 
> Now you are looking at VMEXIT/VMENTER on the first VCPU,

Does it matter? It's not the hard-RT CPU, and this is a slowpath of 
synchronize_rcu().

> plus the cost of the IPI on the host, plus the cost of the emulation 
> layer, plus VMEXIT/VMENTER on the second VCPU to trigger the IPI 
> work, and possibly a second VMEXIT/VMENTER for IPI completion.

Only the VMEXIT/VMENTER on the second VCPU matters to RT latencies.

> I suspect it would be better to do RCU callback offload in some 
> other way.

Well, it's not just about callback offload, but it's about the basic 
synchronization guarantee of synchronize_rcu(): that all RCU read-side 
critical sections have finished executing after the call returns.

So even if a nohz-full CPU never actually queues a callback, it needs 
to stop using resources that a synchronize_rcu() caller expects it to 
stop using.

We can do that only if we know it in an SMP-coherent way that the 
remote CPU is not in an rcu_read_lock() section.

Sending an IPI is one way to achieve that.

Or we could do that in the syscall path with a single store of a 
constant flag to a location in the task struct. We have a number of 
natural flags that get written on syscall entry, such as:

        pushq_cfi $__USER_DS                    /* pt_regs->ss */

That goes to a constant location on the kernel stack. On return from 
system calls we could write 0 to that location.

So the remote CPU would have to do a read of this location. There are 
two cases:

 - If it's 0, then it has observed quiescent state on that CPU. (It 
   does not have to be atomics anymore, as we'd only observe the value 
   and MESI coherency takes care of it.)

 - If it's not 0 then the remote CPU is not executing user-space code 
   and we can install (remotely) a TIF_NOHZ flag in it and expect it 
   to process it either on return to user-space or on a context 
   switch.

This way, unless I'm missing something, reduces the overhead to a 
single store to a hot cacheline on return-to-userspace - which 
instruction if we place it well might as well be close to zero cost. 
No syscall entry cost. Slow-return cost only in the (rare) case of 
someone using synchronize_rcu().

Hm?

Thanks,

	Ingo