From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752732AbbEGRrk (ORCPT <rfc822;w@1wt.eu>);
	Thu, 7 May 2015 13:47:40 -0400
Received: from mail-lb0-f177.google.com ([209.85.217.177]:34726 "EHLO
	mail-lb0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751394AbbEGRrd (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 7 May 2015 13:47:33 -0400
MIME-Version: 1.0
In-Reply-To: <20150507150845.GA20608@gmail.com>
References: <20150501064044.GA18957@gmail.com> <554399D1.6010405@redhat.com>
 <1430659432.4233.3.camel@gmail.com> <55465B2D.6010300@redhat.com>
 <CALCETrWj35bCgit7Z516+Pk+GrM_rpvFQsJ5t2fF=5722HjXiQ@mail.gmail.com>
 <55466E72.8060602@redhat.com> <20150507104845.GB14924@gmail.com>
 <CALCETrXtd2PBLBhAdKPTxuMbE9pUSvsDLpBJnd2srJg5hk=LhQ@mail.gmail.com>
 <20150507124437.GB17443@gmail.com> <CALCETrWHELQr3AuA06+zJ12daH3W0Z1GGb4Ng-Ziccj+Mnr7zw@mail.gmail.com>
 <20150507150845.GA20608@gmail.com>
From: Andy Lutomirski <luto@amacapital.net>
Date: Thu, 7 May 2015 10:47:10 -0700
Message-ID: <CALCETrX8zEcAXBAsut8MGU7CwEMd_dFYDsRyysY0qR18qAC0rg@mail.gmail.com>
Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable &
 enable from context tracking on syscall entry
To: Ingo Molnar <mingo@kernel.org>
Cc: fweisbec@redhat.com, Paolo Bonzini <pbonzini@redhat.com>,
        X86 ML <x86@kernel.org>, Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>,
        Heiko Carstens <heiko.carstens@de.ibm.com>,
        Ingo Molnar <mingo@redhat.com>,
        Mike Galbraith <umgwanakikbuti@gmail.com>,
        Rik van Riel <riel@redhat.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        williams@redhat.com
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On May 7, 2015 8:38 PM, "Ingo Molnar" <mingo@kernel.org> wrote:
>
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
> > I think one or both of us is missing something or we're just talking
> > about different things.
>
> That's very much possible!
>
> I think part of the problem is that I called the 'remote CPU' the RT
> CPU, while you seem to be calling it the CPU that does the
> synchronize_rcu().
>
> So lets start again, with calling the synchronize_rcu() the 'remote
> CPU', and the one doing the RT workload the 'RT CPU':
>
> > If the nohz/RT cpu is about to enter user mode and stay there for a
> > long time, it does:
> >
> >   this_cpu_inc(rcu_qs_ctr);
> >
> > or whatever.  Maybe we add:
> >
> >   this_cpu_set(rcu_state) = IN_USER;
> >
> > or however it's spelled.
> >
> > The remote CPU wants to check our state.  If this happens just
> > before the IN_USER write or rcu_qs_ctr increment becomes visible,
> > then it'll think we're in kernel mode.  Now it either has to poll
> > (which is fine) or try to get us to tell the RCU core when we become
> > quiescent by setting TIF_RCU_THINGY.
>
> So do you mean:
>
>    this_cpu_set(rcu_state) = IN_KERNEL;
>    ...
>    this_cpu_inc(rcu_qs_ctr);
>    this_cpu_set(rcu_state) = IN_USER;
>
> ?
>
> So in your proposal we'd have an INC and two MOVs. I think we can make
> it just two simple stores into a byte flag, one on entry and one on
> exit:
>
>    this_cpu_set(rcu_state) = IN_KERNEL;
>    ...
>    this_cpu_set(rcu_state) = IN_USER;
>

I was thinking that either a counter or a state flag could make sense.
Doing both would be pointless.  The counter could use the low bits to
indicate the state.  The benefit of the counter would be that the
RCU-waiting CPU could observe that the counter has incremented and
that therefore a grace period has elapsed.  Getting it right would
require lots of care.

> plus the rare but magic TIF_RCU_THINGY that tells a waiting
> synchronize_rcu() about the next quiescent state.
>
> > The problem is that I don't see how TIF_RCU_THINGY can work
> > reliably. If the remote CPU sets it, it'll be too late and we'll
> > still enter user mode without seeing it.  If it's just an
> > optimization, though, then it should be fine.
>
> Well, after setting it, the remote CPU has to re-check whether the RT
> CPU has entered user-mode - before it goes to wait.

How?

Suppose the exit path looked like:

this_cpu_write(rcu_state, IN_USER);

if (ti->flags & _TIF_RCU_NOTIFY) {
    if (test_and_clear_bit(TIF_RCU_NOTIFY, &ti->flags))
        slow_notify_rcu_that_we_are_exiting();
}

iret or sysret;

The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
_TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
because stores can be delayed.  (It could even happen after sysret,
IIRC, but IRET is serializing.)

If we put an mfence after this_cpu_set or did an unconditional
test_and_clear_bit on ti->flags then this problem goes away, but that
would probably be slower than we'd like.

--Andy