From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752861AbbEHM5Q (ORCPT ); Fri, 8 May 2015 08:57:16 -0400 Received: from mail-lb0-f176.google.com ([209.85.217.176]:34045 "EHLO mail-lb0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751097AbbEHM5P (ORCPT ); Fri, 8 May 2015 08:57:15 -0400 MIME-Version: 1.0 In-Reply-To: <20150508112711.GA20444@gmail.com> References: <55466E72.8060602@redhat.com> <20150507104845.GB14924@gmail.com> <20150507124437.GB17443@gmail.com> <20150507150845.GA20608@gmail.com> <20150508063711.GA4934@gmail.com> <20150508112711.GA20444@gmail.com> From: Andy Lutomirski Date: Fri, 8 May 2015 05:56:52 -0700 Message-ID: Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry To: Ingo Molnar Cc: fweisbec@redhat.com, Paolo Bonzini , Thomas Gleixner , X86 ML , Peter Zijlstra , Ingo Molnar , Heiko Carstens , Mike Galbraith , "linux-kernel@vger.kernel.org" , Rik van Riel , williams@redhat.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 8, 2015 at 4:27 AM, Ingo Molnar wrote: > > * Andy Lutomirski wrote: > >> On May 8, 2015 12:07 PM, "Ingo Molnar" wrote: >> > >> > >> > * Andy Lutomirski wrote: >> > >> > > > So do you mean: >> > > > >> > > > this_cpu_set(rcu_state) = IN_KERNEL; >> > > > ... >> > > > this_cpu_inc(rcu_qs_ctr); >> > > > this_cpu_set(rcu_state) = IN_USER; >> > > > >> > > > ? >> > > > >> > > > So in your proposal we'd have an INC and two MOVs. I think we can make >> > > > it just two simple stores into a byte flag, one on entry and one on >> > > > exit: >> > > > >> > > > this_cpu_set(rcu_state) = IN_KERNEL; >> > > > ... >> > > > this_cpu_set(rcu_state) = IN_USER; >> > > > >> > > >> > > I was thinking that either a counter or a state flag could make sense. >> > > Doing both would be pointless. The counter could use the low bits to >> > > indicate the state. The benefit of the counter would be that the >> > > RCU-waiting CPU could observe that the counter has incremented and >> > > that therefore a grace period has elapsed. Getting it right would >> > > require lots of care. >> > >> > So if you mean: >> > >> > >> > ... >> > this_cpu_inc(rcu_qs_ctr); >> > >> > >> > I don't see how this would work reliably: how do you handle the case >> > of a SCHED_FIFO task never returning from user-space (common technique >> > in RT apps)? synchronize_rcu() would block indefinitely as it would >> > never see rcu_qs_ctr increase. >> > >> > We have to be able to observe user-mode anyway, for system-time >> > statistics purposes, and that flag could IMHO also drive the RCU GC >> > machinery. >> >> The counter would have to be fancier than that to work. We could >> say that an even value means user mode, for example. IOW the high >> bits of the counter would count transitions to quiescent and the low >> bits would indicate the state. >> >> Actually, I'd count backwards. We could use an andl instruction to >> go to user mode and a decl to go to kernel mode. > > at which point it's not really a monotonic quiescent state counter - > which your naming suggested before. > > Also it would have to be done both at entry and at exit. > > But yes, if you replace a state flag with a recursive state flag that > is INC/DEC maintained that would work as well. That's not a counter > though. I won't quibble about the name :) > >> > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets >> > > _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET >> > > because stores can be delayed. (It could even happen after >> > > sysret, IIRC, but IRET is serializing.) >> > >> > All it has to do in the synchronize_rcu() slowpath is something >> > like: >> >> I don't think this works. Suppose rt_cpu starts in kernel mode. >> Then it checks ti flags. >> >> > >> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { >> > smp_mb__before_atomic(); >> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); >> > smp_rmb(); >> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) >> >> Now it sets rcu_state and exits to user mode. It never notices >> TIF_RCU_NOTIFY. > > Indeed, you are right, that's racy. > >> [...] We could fix this by sending an IPI to kick it. > > With an IPI we won't need any flags or counters on the RT CPU - it's > the IPI we are trying to avoid. > > So how about the following way to fix the race: simply do a poll loop > with a fixed timeout sleep: like it would do anyway if > synchronize_rcu() was waiting for the timer irq to end the grace > period on the RT-CPU. Seems reasonable to me. > > The TIF flag would cause an RCU grace period to lapse, no need to wake > up the synchronize_rcu() side: it will notice the (regular) increased > RCU counter. > >> > We are talking about a dozen cycles, while a typical >> > synchronize_rcu() will wait millions (sometimes billions) of >> > cycles. There's absolutely zero performance concern here and it's >> > all CPU local in any case. >> >> The barrier would affect all syscalls, though. [...] > > What barrier? I never suggested any barrier in the syscall fast path, > this bit: Oh, I misunderstood. > >> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { >> > smp_mb__before_atomic(); >> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); >> > smp_rmb(); >> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) > > runs inside synchronize_rcu() or so. > > But none of that is needed if we do: > > - simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context > entry/exit time, straight in the assembly > > - TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does > nothing but: this_cpu_inc(rcu_qs_ctr). > > - synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and > then does a timeout-poll-loop with jiffies granular > timeouts, simulating a timer IRQ in essence. It will either > observe IN_USER or will see the regular RCU qs counter > increase. > > this should be race free. > > Alternatively we can even get rid of the TIF flag by merging the > percpu counter with the percpu state. Quiescent states are recorded > via a counter shifted by two bits: > > rcu_qs_ctr += 4; > > while user/kernel/idle mode is recorded in the lower 2 bits. > > So on entering the kernel we'd do: > > rcu_qs_ctr += 4+1; /* Register QS and set bit 0 to IN_KERNEL */ > > on returning to user-space we'd do: > > rcu_qs_ctr += 4-1; /* We have bit 0 set already, clear it and register a QS */ > > on going idle we'd do: > > rcu_qs_ctr += 4+2; /* Register QS, set bit 1 */ > > on return from idle we'd do: > > rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */ > > etc. On all boundary transitions we can use a constant ADD on a > suitable percpu variable. Sounds good to me, except that we need to be careful to distinguish between non-syscall entries from quiescent states and non-syscall entries from quiescent states. We could save the old state (as the current exception_enter code does) or we could allocate enough low bits for the state that the problem goes away. I don't think the TIF_RCU_QS variant is worthwhile -- merging the counter and state is probably both easier and faster. --Andy