Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

From: Ingo Molnar <mingo@kernel.org>
To: Andy Lutomirski <luto@amacapital.net>
Cc: fweisbec@redhat.com, Paolo Bonzini <pbonzini@redhat.com>,
	X86 ML <x86@kernel.org>, Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Ingo Molnar <mingo@redhat.com>,
	Mike Galbraith <umgwanakikbuti@gmail.com>,
	Rik van Riel <riel@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	williams@redhat.com
Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
Date: Fri, 8 May 2015 08:37:11 +0200	[thread overview]
Message-ID: <20150508063711.GA4934@gmail.com> (raw)
In-Reply-To: <CALCETrX8zEcAXBAsut8MGU7CwEMd_dFYDsRyysY0qR18qAC0rg@mail.gmail.com>

* Andy Lutomirski <luto@amacapital.net> wrote:

> > So do you mean:
> >
> >    this_cpu_set(rcu_state) = IN_KERNEL;
> >    ...
> >    this_cpu_inc(rcu_qs_ctr);
> >    this_cpu_set(rcu_state) = IN_USER;
> >
> > ?
> >
> > So in your proposal we'd have an INC and two MOVs. I think we can make
> > it just two simple stores into a byte flag, one on entry and one on
> > exit:
> >
> >    this_cpu_set(rcu_state) = IN_KERNEL;
> >    ...
> >    this_cpu_set(rcu_state) = IN_USER;
> >
> 
> I was thinking that either a counter or a state flag could make sense.
> Doing both would be pointless.  The counter could use the low bits to
> indicate the state.  The benefit of the counter would be that the
> RCU-waiting CPU could observe that the counter has incremented and
> that therefore a grace period has elapsed.  Getting it right would
> require lots of care.

So if you mean:

       <syscall entry>
       ...
       this_cpu_inc(rcu_qs_ctr);
       <syscall exit>

I don't see how this would work reliably: how do you handle the case 
of a SCHED_FIFO task never returning from user-space (common technique 
in RT apps)? synchronize_rcu() would block indefinitely as it would 
never see rcu_qs_ctr increase.

We have to be able to observe user-mode anyway, for system-time 
statistics purposes, and that flag could IMHO also drive the RCU GC 
machinery.

> > > The problem is that I don't see how TIF_RCU_THINGY can work 
> > > reliably. If the remote CPU sets it, it'll be too late and we'll 
> > > still enter user mode without seeing it.  If it's just an 
> > > optimization, though, then it should be fine.
> >
> > Well, after setting it, the remote CPU has to re-check whether the 
> > RT CPU has entered user-mode - before it goes to wait.
> 
> How?
> 
> Suppose the exit path looked like:
> 
> this_cpu_write(rcu_state, IN_USER);
> 
> if (ti->flags & _TIF_RCU_NOTIFY) {
>     if (test_and_clear_bit(TIF_RCU_NOTIFY, &ti->flags))
>         slow_notify_rcu_that_we_are_exiting();
> }
> 
> iret or sysret;

No, it would look like this:

   this_cpu_write(rcu_state, IN_USER);
   iret or sysret;

I.e. IN_USER is set well after all notifications are checked. No 
kernel execution happens afterwards. (No extra checks added - the 
regular return-to-user-work checks would handle TIF_RCU_NOTIFY.)

( Same goes for idle: we just mark it IN_IDLE and move it back to 
  IN_KERNEL after the idling ends. )

> The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets 
> _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET 
> because stores can be delayed.  (It could even happen after sysret, 
> IIRC, but IRET is serializing.)

All it has to do in the synchronize_rcu() slowpath is something like:

	if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
		smp_mb__before_atomic();
		set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
		smp_rmb();
		if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
			... go wait ...
	}
	/* Cool, we observed quiescent state: */

The cost of the trivial barrier is nothing compared to the 'go wait' 
cost which we will pay in 99.9% of the cases!

> If we put an mfence after this_cpu_set or did an unconditional 
> test_and_clear_bit on ti->flags then this problem goes away, but 
> that would probably be slower than we'd like.

We are talking about a dozen cycles, while a typical synchronize_rcu() 
will wait millions (sometimes billions) of cycles. There's absolutely 
zero performance concern here and it's all CPU local in any case.

In fact a user-mode/kernel-mode flag speeds up naive implementations 
of synchronize_rcu(): because it's able to observe extended quiescent 
state immediately, without having to wait for a counter to increase 
(which was always the classic way to observe grace periods).

If all CPUs are in user mode or are idle (which is rather common!) 
then synchronize_rcu() could return almost immediately - while 
previously it had to wait for scheduling or periodic timer irqs to 
trigger on all CPUs - adding many millisecs of delay even in the best 
of cases.

Thanks,

	Ingo