All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] sched: Have do_idle() call __schedule() without enabling preemption
@ 2017-04-12  2:44 Steven Rostedt
  2017-04-12 14:41 ` Steven Rostedt
  0 siblings, 1 reply; 4+ messages in thread
From: Steven Rostedt @ 2017-04-12  2:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, Paul E. McKenney

Peter,

I hope Thomas didn't lend you his frozen shark thrower. ;-)

Let me explain the issue we have. A while back, I asked Paul if he
could add a special RCU case, where the quiescent states are if a task
goes into user land (it's fine if it's there), is sleeping (off the run
queue) or voluntarily schedules. He implemented this for me where after
calling synchronize_rcu_tasks() all tasks is in the quiescent state
(user space or not on a run queue) or has done a voluntary schedule
(calling __schedule(false)).

The one issue is that this ignores the idle task, as the idle task is
never in user space and is always on the run queue. The only thing we
could hope for is if it calls schedule(false). Thus
synchronize_rcu_tasks() ignores it.

The reason I need this is to be able to have ftrace hooks that are
created dynamically (like what perf does when it uses function
tracing), to be able to use the optimize dynamic trampoline. What that
means is, if there's a single callback for a function hook, a
trampoline is created dynamically and that hook (fentry) calls the
dynamic trampoline directly. That dynamic trampoline calls the function
callback directly without caring about any other function hook that may
be registered to other functions.

If perf does function tracing and there's no other code doing any
function tracing, then all the functions it traces will call this
dynamically allocated trampoline which will call the perf function
tracer directly. Otherwise it will call the default trampoline that
calls a loop function to iterate over any registered function callbacks
with ftrace.

The issue occurs when perf is done tracing. Now we need to free this
dynamically created trampoline. But in order to do that, we need to
make sure all tasks are not executing on it. One thing ftrace does
before freeing the trampoline after detaching the hook from the
functions, is to schedule on every CPU, which will make sure everything
is out of a preempt section. This is more severe than
synchronize_sched() because ftrace is recorded in locations that RCU is
not active (like going to idle), and synchronize_sched() is not good
enough. But even scheduling on all CPUs is not good enough when the
kernel itself is preemptive. In that case we use the
synchronize_rcu_tasks() that covers those cases where a task has been
preempted. Except for one!

This brings us back to synchronize_rcu_tasks() ignoring the idle case.
Where I trigger this:

 BUG: unable to handle kernel paging request at ffffffffa0230077
 IP: 0xffffffffa0230077
 PGD 2414067 
 PUD 2415063 
 PMD c463c067 
 PTE 0
 
 Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
 Modules linked in: ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal kvm_intel i915 snd_seq kvm snd_seq_device snd_pcm i2c_algo_bit snd_timer drm_kms_helper irqbypass syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_i801 snd soundcore wmi i2c_core video e1000e ptp pps_core
 CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc3-test+ #356
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
 task: ffff8800cebbd0c0 task.stack: ffff8800cebd8000
 RIP: 0010:0xffffffffa0230077
 RSP: 0018:ffff8800cebdfd80 EFLAGS: 00010286
 RAX: 0000000000000000 RBX: ffff8800cebbd0c0 RCX: ffffffff8121391e
 RDX: dffffc0000000000 RSI: ffffffff81ce7c78 RDI: ffff8800cebbf298
 RBP: ffff8800cebdfe28 R08: 0000000000000003 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800cebbd0c0
 R13: ffffffff828fbe90 R14: ffff8800d392c480 R15: ffffffff826cc380
 FS:  0000000000000000(0000) GS:ffff8800d3900000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffffffa0230077 CR3: 0000000002413000 CR4: 00000000001406e0
 Call Trace:
  ? debug_smp_processor_id+0x17/0x20
  ? sched_ttwu_pending+0x79/0x190
  ? schedule+0x5/0xe0
  ? trace_hardirqs_on_caller+0x182/0x280
  schedule+0x5/0xe0
  schedule_preempt_disabled+0x18/0x30
  ? schedule+0x5/0xe0
  ? schedule_preempt_disabled+0x18/0x30
  do_idle+0x172/0x220

That RIP of ffffffffa0230077 is on the trampoline. The return address
is schedule+0x5 which happens to be from a call to fentry. Notice the
Comm: swapper/2.

I can trigger this with a few runs of my test case, pretty
consistently. When I looked at the code, I don't see any reason for
idle to ever enable preemption. It currently calls
schedule_preempt_disabled(), which enables preemption, calls schedule()
and then disables preemption.

The schedule() calls sched_submit_work() which immediately returns
because of the check if the task is running, and idle is always
running. Then it does a loop of disabling preemption calling
__schedule() then enabling preemption again, and checking
need_resched().

The above bug happens in schedule_preempt_disabled, where preemption is
enabled, then schedule() is called, which is traced by ftrace. But then
an interrupt came in while the idle task was on the ftrace trampoline,
and when the interrupt returned, it scheduled via the interrupt preempt
scheduling and not the call to schedule itself, which I feel is
inefficient anyway, as idle is going to call schedule again anyway.

The trampoline is freed (because none of the synchronizations were able
to detect idle was preempted on a trampoline), and when idle gets to
run again, it crashes (it's executing code that no longer exists).

The solution I'm proposing here is to create a schedule_idle() call,
that is local to kernel/sched/ which do_idle() calls instead and this
basically does exactly the same thing that schedule() does, except that
it doesn't enable preemption. It calls __schedule() in a loop that
checks for need_resched(). Not only does this solve the bug I'm
triggering with trying to free dynamically created ftrace trampolines,
it also makes the idle code a bit more efficient without having these
spurious schedule calls when interrupts occur in this small window when
preemption is enabled.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b31fc0..cb9ceda 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3502,6 +3502,13 @@ asmlinkage __visible void __sched schedule(void)
 }
 EXPORT_SYMBOL(schedule);
 
+void __sched schedule_idle(void)
+{
+	do {
+		__schedule(false);
+	} while (need_resched());
+}
+
 #ifdef CONFIG_CONTEXT_TRACKING
 asmlinkage __visible void __sched schedule_user(void)
 {
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index ac6d517..229c17e 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -264,7 +264,7 @@ static void do_idle(void)
 	smp_mb__after_atomic();
 
 	sched_ttwu_pending();
-	schedule_preempt_disabled();
+	schedule_idle();
 }
 
 bool cpu_in_idle(unsigned long pc)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5cbf922..c5ee02b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1465,6 +1465,8 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
 }
 #endif
 
+extern void schedule_idle(void);
+
 extern void sysrq_sched_debug_show(void);
 extern void sched_init_granularity(void);
 extern void update_max_interval(void);

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC][PATCH] sched: Have do_idle() call __schedule() without enabling preemption
  2017-04-12  2:44 [RFC][PATCH] sched: Have do_idle() call __schedule() without enabling preemption Steven Rostedt
@ 2017-04-12 14:41 ` Steven Rostedt
  2017-04-12 16:21   ` Peter Zijlstra
  0 siblings, 1 reply; 4+ messages in thread
From: Steven Rostedt @ 2017-04-12 14:41 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, Paul E. McKenney

[ tl;dr; version ]

Peter, In order to have synchronize_rcu_tasks() work, the idle task can
never be preempted. There's a very small window in
schedule_preempt_disable() that enables preemption, and when this
happens, it breaks synchronize_rcu_tasks() (see above email for
details).

Is there any reason to enable preemption, or can we simply have idle
call into schedule without ever allowing it to be preempted, as in my
patch?

Note, it is almost good enough to just change
schedule_preempt_disable() to do the exact same thing, but there's one
instance in kernel/locking/mutex.c that calls it in a non running state.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b31fc0..cb9ceda 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3502,6 +3502,13 @@ asmlinkage __visible void __sched schedule(void)
 }
 EXPORT_SYMBOL(schedule);
 
+void __sched schedule_idle(void)
+{
+	do {
+		__schedule(false);
+	} while (need_resched());
+}
+
 #ifdef CONFIG_CONTEXT_TRACKING
 asmlinkage __visible void __sched schedule_user(void)
 {
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index ac6d517..229c17e 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -264,7 +264,7 @@ static void do_idle(void)
 	smp_mb__after_atomic();
 
 	sched_ttwu_pending();
-	schedule_preempt_disabled();
+	schedule_idle();
 }
 
 bool cpu_in_idle(unsigned long pc)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5cbf922..c5ee02b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1465,6 +1465,8 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
 }
 #endif
 
+extern void schedule_idle(void);
+
 extern void sysrq_sched_debug_show(void);
 extern void sched_init_granularity(void);
 extern void update_max_interval(void);

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC][PATCH] sched: Have do_idle() call __schedule() without enabling preemption
  2017-04-12 14:41 ` Steven Rostedt
@ 2017-04-12 16:21   ` Peter Zijlstra
  2017-04-12 16:43     ` Steven Rostedt
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Zijlstra @ 2017-04-12 16:21 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML, Ingo Molnar, Paul E. McKenney

On Wed, Apr 12, 2017 at 10:41:24AM -0400, Steven Rostedt wrote:
> [ tl;dr; version ]
> 
> Peter, In order to have synchronize_rcu_tasks() work, the idle task can
> never be preempted. There's a very small window in
> schedule_preempt_disable() that enables preemption, and when this
> happens, it breaks synchronize_rcu_tasks() (see above email for
> details).
> 
> Is there any reason to enable preemption, or can we simply have idle
> call into schedule without ever allowing it to be preempted, as in my
> patch?

Dunno,.. this changelog should convince me, not make me do the work :-)

> Note, it is almost good enough to just change
> schedule_preempt_disable() to do the exact same thing, but there's one
> instance in kernel/locking/mutex.c that calls it in a non running state.

The point being that that must not happen because of
sched_submit_work(), which, for idle, should not matter.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC][PATCH] sched: Have do_idle() call __schedule() without enabling preemption
  2017-04-12 16:21   ` Peter Zijlstra
@ 2017-04-12 16:43     ` Steven Rostedt
  0 siblings, 0 replies; 4+ messages in thread
From: Steven Rostedt @ 2017-04-12 16:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, Paul E. McKenney

On Wed, 12 Apr 2017 18:21:57 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Apr 12, 2017 at 10:41:24AM -0400, Steven Rostedt wrote:
> > [ tl;dr; version ]
> > 
> > Peter, In order to have synchronize_rcu_tasks() work, the idle task can
> > never be preempted. There's a very small window in
> > schedule_preempt_disable() that enables preemption, and when this
> > happens, it breaks synchronize_rcu_tasks() (see above email for
> > details).
> > 
> > Is there any reason to enable preemption, or can we simply have idle
> > call into schedule without ever allowing it to be preempted, as in my
> > patch?  
> 
> Dunno,.. this changelog should convince me, not make me do the work :-)

I didn't want you to do any work. I was wondering if you knew of any
reason. But looking into the history of the do_idle() function and
schedule_preempt_disable(), I don't see any reason. I'll post another
patch with a better change log.

> 
> > Note, it is almost good enough to just change
> > schedule_preempt_disable() to do the exact same thing, but there's one
> > instance in kernel/locking/mutex.c that calls it in a non running state.  
> 
> The point being that that must not happen because of
> sched_submit_work(), which, for idle, should not matter.

Right, and also for all the other cases that call
schedule_preempt_disable() which includes init/main.c and another
location in mutex.c that sets the task state to TASK_RUNNING just
before calling it.

But the one case where it can be something other than TASK_RUNNING,
then we need to keep the current method.

OK, I think it is fine for the idle task to never enable preemption and
I'll post a better change log patch.

Thanks!

-- Steve

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-04-12 16:43 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-12  2:44 [RFC][PATCH] sched: Have do_idle() call __schedule() without enabling preemption Steven Rostedt
2017-04-12 14:41 ` Steven Rostedt
2017-04-12 16:21   ` Peter Zijlstra
2017-04-12 16:43     ` Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.