From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755814AbdDLXtH (ORCPT ); Wed, 12 Apr 2017 19:49:07 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:33663 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755104AbdDLXtF (ORCPT ); Wed, 12 Apr 2017 19:49:05 -0400 Date: Wed, 12 Apr 2017 16:48:59 -0700 From: "Paul E. McKenney" To: Steven Rostedt Cc: Peter Zijlstra , LKML , Ingo Molnar , Andrew Morton , Thomas Gleixner Subject: Re: [PATCH v2] sched: Have do_idle() call __schedule() without enabling preemption Reply-To: paulmck@linux.vnet.ibm.com References: <20170412142744.6ff07236@gandalf.local.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170412142744.6ff07236@gandalf.local.home> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17041223-0024-0000-0000-00000244A927 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00006925; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000208; SDB=6.00846726; UDB=6.00417681; IPR=6.00625164; BA=6.00005286; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00015026; XFM=3.00000013; UTC=2017-04-12 23:49:02 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17041223-0025-0000-0000-000043530F2C Message-Id: <20170412234859.GQ3956@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-04-12_18:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1702020001 definitions=main-1704120195 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 12, 2017 at 02:27:44PM -0400, Steven Rostedt wrote: > From: Steven Rostedt (VMware) > > I finally got around to creating trampolines for dynamically allocated > ftrace_ops with using synchronize_rcu_tasks(). For users of the ftrace > function hook callbacks, like perf, that allocate the ftrace_ops > descriptor via kmalloc() and friends, ftrace was not able to optimize > the functions being traced to use a trampoline because they would also > need to be allocated dynamically. The problem is that they cannot be > freed when CONFIG_PREEMPT is set, as there's no way to tell if a task > was preempted on the trampoline. That was before Paul McKenney > implemented synchronize_rcu_tasks() that would make sure all tasks > (except idle) have scheduled out or have entered user space. > > While testing this, I triggered this bug: > > BUG: unable to handle kernel paging request at ffffffffa0230077 > IP: 0xffffffffa0230077 > PGD 2414067 > PUD 2415063 > PMD c463c067 > PTE 0 > > Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN > Modules linked in: ip6table_filter ip6_tables snd_hda_codec_hdmi > snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel > snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal kvm_intel > i915 snd_seq kvm snd_seq_device snd_pcm i2c_algo_bit snd_timer > drm_kms_helper irqbypass syscopyarea sysfillrect sysimgblt fb_sys_fops > drm i2c_i801 snd soundcore wmi i2c_core video e1000e ptp pps_core CPU: > 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc3-test+ #356 Hardware > name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 > 05/07/2012 task: ffff8800cebbd0c0 task.stack: ffff8800cebd8000 RIP: > 0010:0xffffffffa0230077 RSP: 0018:ffff8800cebdfd80 EFLAGS: 00010286 > RAX: 0000000000000000 RBX: ffff8800cebbd0c0 RCX: ffffffff8121391e RDX: > dffffc0000000000 RSI: ffffffff81ce7c78 RDI: ffff8800cebbf298 RBP: > ffff8800cebdfe28 R08: 0000000000000003 R09: 0000000000000000 R10: > 0000000000000000 R11: 0000000000000000 R12: ffff8800cebbd0c0 R13: > ffffffff828fbe90 R14: ffff8800d392c480 R15: ffffffff826cc380 FS: > 0000000000000000(0000) GS:ffff8800d3900000(0000) > knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: > 0000000080050033 CR2: ffffffffa0230077 CR3: 0000000002413000 CR4: > 00000000001406e0 Call Trace: ? debug_smp_processor_id+0x17/0x20 ? > sched_ttwu_pending+0x79/0x190 ? schedule+0x5/0xe0 ? > trace_hardirqs_on_caller+0x182/0x280 schedule+0x5/0xe0 > schedule_preempt_disabled+0x18/0x30 ? schedule+0x5/0xe0 > ? schedule_preempt_disabled+0x18/0x30 > do_idle+0x172/0x220 > > What happened was that the idle task was preempted on the trampoline. > As synchronize_rcu_tasks() ignores the idle thread, there's nothing > that lets ftrace know that the idle task was preempted on a trampoline. > > The idle task shouldn't need to ever enable preemption. The idle task > is simply a loop that calls schedule or places the cpu into idle mode. > In fact, having preemption enabled is inefficient, because it can > happen when idle is just about to call schedule anyway, which would > cause schedule to be called twice. Once for when the interrupt came in > and was returning back to normal context, and then again in the normal > path that the idle loop is running in, which would be pointless, as it > had already scheduled. > > Adding a new function local to kernel/sched/ that allows idle to call > the scheduler without enabling preemption, fixes the > synchronize_rcu_tasks) issue, as well as removes the pointless spurious > scheduled caused by interrupts happening in the brief window where > preemption is enabled just before it calls schedule. > > Cc: "Paul E. McKenney" > Signed-off-by: Steven Rostedt (VMware) I don't understand why schedule_idle() needs a loop given that schedule_preempt_disabled() doesn't have one, but other than that, for whatever it is worth, it looks good to me. Thanx, Paul > --- > Changes from v1: > > o Added Thomas to Cc as he wrote the original generic idle code > > o Added comment to why schedule_idle() exits > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 3b31fc0..0788286 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3502,6 +3502,23 @@ asmlinkage __visible void __sched schedule(void) > } > EXPORT_SYMBOL(schedule); > > +/* > + * synchronize_rcu_tasks() makes sure that no task is stuck in preempted > + * state (have scheduled out non-voluntarily) by making sure that all > + * tasks have either left the run queue or have gone into user space. > + * As idle tasks do not do either, they must not ever be preempted > + * (schedule out non-voluntarily). > + * > + * schedule_idle() is similar to schedule_preempt_disable() except > + * that it never enables preemption. > + */ > +void __sched schedule_idle(void) > +{ > + do { > + __schedule(false); > + } while (need_resched()); > +} > + > #ifdef CONFIG_CONTEXT_TRACKING > asmlinkage __visible void __sched schedule_user(void) > { > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c > index ac6d517..229c17e 100644 > --- a/kernel/sched/idle.c > +++ b/kernel/sched/idle.c > @@ -264,7 +264,7 @@ static void do_idle(void) > smp_mb__after_atomic(); > > sched_ttwu_pending(); > - schedule_preempt_disabled(); > + schedule_idle(); > } > > bool cpu_in_idle(unsigned long pc) > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 5cbf922..c5ee02b 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1465,6 +1465,8 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq) > } > #endif > > +extern void schedule_idle(void); > + > extern void sysrq_sched_debug_show(void); > extern void sched_init_granularity(void); > extern void update_max_interval(void); >