From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752835AbbEHK7k (ORCPT ); Fri, 8 May 2015 06:59:40 -0400 Received: from mail-la0-f42.google.com ([209.85.215.42]:33602 "EHLO mail-la0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751071AbbEHK7h (ORCPT ); Fri, 8 May 2015 06:59:37 -0400 MIME-Version: 1.0 In-Reply-To: <20150508063711.GA4934@gmail.com> References: <1430659432.4233.3.camel@gmail.com> <55465B2D.6010300@redhat.com> <55466E72.8060602@redhat.com> <20150507104845.GB14924@gmail.com> <20150507124437.GB17443@gmail.com> <20150507150845.GA20608@gmail.com> <20150508063711.GA4934@gmail.com> From: Andy Lutomirski Date: Fri, 8 May 2015 03:59:15 -0700 Message-ID: Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry To: Ingo Molnar Cc: fweisbec@redhat.com, Paolo Bonzini , Thomas Gleixner , X86 ML , Peter Zijlstra , Ingo Molnar , Heiko Carstens , Mike Galbraith , "linux-kernel@vger.kernel.org" , Rik van Riel , williams@redhat.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On May 8, 2015 12:07 PM, "Ingo Molnar" wrote: > > > * Andy Lutomirski wrote: > > > > So do you mean: > > > > > > this_cpu_set(rcu_state) = IN_KERNEL; > > > ... > > > this_cpu_inc(rcu_qs_ctr); > > > this_cpu_set(rcu_state) = IN_USER; > > > > > > ? > > > > > > So in your proposal we'd have an INC and two MOVs. I think we can make > > > it just two simple stores into a byte flag, one on entry and one on > > > exit: > > > > > > this_cpu_set(rcu_state) = IN_KERNEL; > > > ... > > > this_cpu_set(rcu_state) = IN_USER; > > > > > > > I was thinking that either a counter or a state flag could make sense. > > Doing both would be pointless. The counter could use the low bits to > > indicate the state. The benefit of the counter would be that the > > RCU-waiting CPU could observe that the counter has incremented and > > that therefore a grace period has elapsed. Getting it right would > > require lots of care. > > So if you mean: > > > ... > this_cpu_inc(rcu_qs_ctr); > > > I don't see how this would work reliably: how do you handle the case > of a SCHED_FIFO task never returning from user-space (common technique > in RT apps)? synchronize_rcu() would block indefinitely as it would > never see rcu_qs_ctr increase. > > We have to be able to observe user-mode anyway, for system-time > statistics purposes, and that flag could IMHO also drive the RCU GC > machinery. The counter would have to be fancier than that to work. We could say that an even value means user mode, for example. IOW the high bits of the counter would count transitions to quiescent and the low bits would indicate the state. Actually, I'd count backwards. We could use an andl instruction to go to user mode and a decl to go to kernel mode. > > > > > The problem is that I don't see how TIF_RCU_THINGY can work > > > > reliably. If the remote CPU sets it, it'll be too late and we'll > > > > still enter user mode without seeing it. If it's just an > > > > optimization, though, then it should be fine. > > > > > > Well, after setting it, the remote CPU has to re-check whether the > > > RT CPU has entered user-mode - before it goes to wait. > > > > How? > > > > Suppose the exit path looked like: > > > > this_cpu_write(rcu_state, IN_USER); > > > > if (ti->flags & _TIF_RCU_NOTIFY) { > > if (test_and_clear_bit(TIF_RCU_NOTIFY, &ti->flags)) > > slow_notify_rcu_that_we_are_exiting(); > > } > > > > iret or sysret; > > No, it would look like this: > > this_cpu_write(rcu_state, IN_USER); > iret or sysret; > > I.e. IN_USER is set well after all notifications are checked. No > kernel execution happens afterwards. (No extra checks added - the > regular return-to-user-work checks would handle TIF_RCU_NOTIFY.) > > ( Same goes for idle: we just mark it IN_IDLE and move it back to > IN_KERNEL after the idling ends. ) > > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets > > _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET > > because stores can be delayed. (It could even happen after sysret, > > IIRC, but IRET is serializing.) > > All it has to do in the synchronize_rcu() slowpath is something like: I don't think this works. Suppose rt_cpu starts in kernel mode. Then it checks ti flags. > > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { > smp_mb__before_atomic(); > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); > smp_rmb(); > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) Now it sets rcu_state and exits to user mode. It never notices TIF_RCU_NOTIFY. We could fix this by sending an IPI to kick it. > ... go wait ... > } > /* Cool, we observed quiescent state: */ > > The cost of the trivial barrier is nothing compared to the 'go wait' > cost which we will pay in 99.9% of the cases! > > > If we put an mfence after this_cpu_set or did an unconditional > > test_and_clear_bit on ti->flags then this problem goes away, but > > that would probably be slower than we'd like. > > We are talking about a dozen cycles, while a typical synchronize_rcu() > will wait millions (sometimes billions) of cycles. There's absolutely > zero performance concern here and it's all CPU local in any case. > The barrier would affect all syscalls, though. Admittedly that's still much cheaper than the current thing, but given that we can round trip a syscall in 110 cycles, we'd take at least 10% overhead. My preference would be to use a counter and skip the barrier. If the waiting CPU polled the counter, then, even if it lost this race, it would be guaranteed to see the counter change reasonably soon as long as no CPU implementation sits on its store buffer forever. > In fact a user-mode/kernel-mode flag speeds up naive implementations > of synchronize_rcu(): because it's able to observe extended quiescent > state immediately, without having to wait for a counter to increase > (which was always the classic way to observe grace periods). > > If all CPUs are in user mode or are idle (which is rather common!) > then synchronize_rcu() could return almost immediately - while > previously it had to wait for scheduling or periodic timer irqs to > trigger on all CPUs - adding many millisecs of delay even in the best > of cases. I agree absolutely, except in the whole-tons-of-cpus case, where I think the RCU tree stuff might preclude this. I'm not sure how well that works with full nohz. --Andy