On Wed, Aug 06, 2014 at 05:09:59AM -0700, Paul E. McKenney wrote:

> > Or you could shoot all CPUs with resched_cpu() which would have them
> > cycle through schedule() even if there's nothing but the idle thread to
> > run. That guarantees they'll go to sleep again in a !trampoline.
> 
> Good point, that would be an easier way to handle the idle threads than
> messing with rcu_tasks_kthread()'s affinity.  Thank you!

One issue though, resched_cpu() doesn't wait for that to complete. We'd
need something that would guarantee the remote CPU has actually
completed execution.

> > But I still very much hate the polling stuff...
> > 
> > Can't we abuse the preempt notifiers? Say we make it possible to install
> > preemption notifiers cross-task, then the task-rcu can install a
> > preempt-out notifier which completes the rcu-task wait.
> > 
> > After all, since we tagged it it was !running, and being scheduled out
> > means it ran (once) and therefore isn't on a trampoline anymore.
> 
> Maybe I am being overly paranoid, but couldn't the task be preempted
> in a trampoline, be resumed, execute one instruction (still in the
> tramopoline) and be preempted again?

Ah, what I failed to state was we should check the sleep condition. So
'voluntary' schedule() calls.

Of course if we'd made something specific to the trampoline thing and
not 'task'-rcu we could simply check if the IP was inside a trampoline
or not.

> > And the tick, which checks to see if the task got to userspace can do
> > the same, remove the notifier and then complete.
> 
> My main concern with this sort of approach is that I have to deal
> with full-up concurrency (200 CPUs all complete tasks concurrently,
> for example), which would make for a much larger and more complex patch.
> Now, I do admit that it is quite possible that I will end up there anyway,
> for example, if more people start using RCU-tasks, but I see no need to
> hurry this process.  ;-)

You mean cacheline contention on the struct completion? I'd first make
it simple and only fix it if/when it becomes a problem.

200 CPUs contending on a single cacheline _once_ is annoying, but
probably still lots cheaper than polling state for at least that many
tasks.