linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] sched/urgent for 5.17-rc4
@ 2022-02-13 12:37 Borislav Petkov
  2022-02-13 18:02 ` Linus Torvalds
  2022-02-13 18:04 ` [GIT PULL] sched/urgent for 5.17-rc4 pr-tracker-bot
  0 siblings, 2 replies; 14+ messages in thread
From: Borislav Petkov @ 2022-02-13 12:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: x86-ml, lkml

Hi Linus,

please pull a single urgent scheduling fix for 5.17.

Thx.

---

The following changes since commit 26291c54e111ff6ba87a164d85d4a4e134b7315c:

  Linux 5.17-rc2 (2022-01-30 15:37:07 +0200)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/sched_urgent_for_v5.17_rc4

for you to fetch changes up to 13765de8148f71fa795e0a6607de37c49ea5915a:

  sched/fair: Fix fault in reweight_entity (2022-02-06 22:37:26 +0100)

----------------------------------------------------------------
- Fix a NULL-ptr dereference when recalculating a sched entity's weight

----------------------------------------------------------------
Tadeusz Struk (1):
      sched/fair: Fix fault in reweight_entity

 kernel/sched/core.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)
-- 
Regards/Gruss,
    Boris.

SUSE Software Solutions Germany GmbH, GF: Ivo Totev, HRB 36809, AG Nürnberg

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [GIT PULL] sched/urgent for 5.17-rc4
  2022-02-13 12:37 [GIT PULL] sched/urgent for 5.17-rc4 Borislav Petkov
@ 2022-02-13 18:02 ` Linus Torvalds
  2022-02-14  8:45   ` Peter Zijlstra
  2022-02-13 18:04 ` [GIT PULL] sched/urgent for 5.17-rc4 pr-tracker-bot
  1 sibling, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2022-02-13 18:02 UTC (permalink / raw)
  To: Borislav Petkov, Tadeusz Struk, Peter Zijlstra (Intel); +Cc: x86-ml, lkml

On Sun, Feb 13, 2022 at 4:37 AM Borislav Petkov <bp@suse.de> wrote:
>
> Tadeusz Struk (1):
>       sched/fair: Fix fault in reweight_entity

I've pulled this, but this really smells bad to me.

If set_load_weight() can see a process that hasn't even had the
runqueue pointer set yet, then what keeps *others* from the same
thing?

Adding a test like this in set_load_weight() just makes me go "what
makes this function so special"? IOW, why could only that function see
this situation with a missing cfs_rq pointer?

I really get the feeling that this is papering over a serious mistake
in how commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an
invalid sched_task_group") now causes fundamental process state to be
initialized too late - when the process is already visible to others.

The setpriority() -> dequeue_load_avg() chain just seems to be one
possible case.

*ANYBODY* that does find_task_by_vpid(who) would seem to be able to
find a task that hasn't actually been fully set up yet, and

Somebody tell me why I'm wrong, and what makes that setpriority thing
so magically special. Please.

                    Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [GIT PULL] sched/urgent for 5.17-rc4
  2022-02-13 12:37 [GIT PULL] sched/urgent for 5.17-rc4 Borislav Petkov
  2022-02-13 18:02 ` Linus Torvalds
@ 2022-02-13 18:04 ` pr-tracker-bot
  1 sibling, 0 replies; 14+ messages in thread
From: pr-tracker-bot @ 2022-02-13 18:04 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Linus Torvalds, x86-ml, lkml

The pull request you sent on Sun, 13 Feb 2022 13:37:17 +0100:

> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/sched_urgent_for_v5.17_rc4

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/6f3573672324b6391014680dd6e2cf7298aaea22

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [GIT PULL] sched/urgent for 5.17-rc4
  2022-02-13 18:02 ` Linus Torvalds
@ 2022-02-14  8:45   ` Peter Zijlstra
  2022-02-14  9:16     ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2022-02-14  8:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Tadeusz Struk, x86-ml, lkml, zhangqiao22, tj,
	dietmar.eggemann

On Sun, Feb 13, 2022 at 10:02:22AM -0800, Linus Torvalds wrote:
> On Sun, Feb 13, 2022 at 4:37 AM Borislav Petkov <bp@suse.de> wrote:
> >
> > Tadeusz Struk (1):
> >       sched/fair: Fix fault in reweight_entity
> 
> I've pulled this, but this really smells bad to me.
> 
> If set_load_weight() can see a process that hasn't even had the
> runqueue pointer set yet, then what keeps *others* from the same
> thing?

Urgh, I think you're right, the moment we enter the pidhash and become
visible we should be complete. That means the previous commit
(4ef0c5c6b5ba) is buggered... Let me try and make sense of all that
cgroup stuff again :-(

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [GIT PULL] sched/urgent for 5.17-rc4
  2022-02-14  8:45   ` Peter Zijlstra
@ 2022-02-14  9:16     ` Peter Zijlstra
  2022-02-14 19:17       ` Tejun Heo
                         ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Peter Zijlstra @ 2022-02-14  9:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Tadeusz Struk, x86-ml, lkml, zhangqiao22, tj,
	dietmar.eggemann

On Mon, Feb 14, 2022 at 09:45:22AM +0100, Peter Zijlstra wrote:
> On Sun, Feb 13, 2022 at 10:02:22AM -0800, Linus Torvalds wrote:
> > On Sun, Feb 13, 2022 at 4:37 AM Borislav Petkov <bp@suse.de> wrote:
> > >
> > > Tadeusz Struk (1):
> > >       sched/fair: Fix fault in reweight_entity
> > 
> > I've pulled this, but this really smells bad to me.
> > 
> > If set_load_weight() can see a process that hasn't even had the
> > runqueue pointer set yet, then what keeps *others* from the same
> > thing?
> 
> Urgh, I think you're right, the moment we enter the pidhash and become
> visible we should be complete. That means the previous commit
> (4ef0c5c6b5ba) is buggered... Let me try and make sense of all that
> cgroup stuff again :-(

Zhang, Tadeusz, TJ, how does this look?

---
 include/linux/sched/task.h |  4 ++--
 kernel/fork.c              |  9 ++++++++-
 kernel/sched/core.c        | 34 +++++++++++++++++++++-------------
 3 files changed, 31 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index b9198a1b3a84..e84e54d1b490 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -54,8 +54,8 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
 extern void init_idle(struct task_struct *idle, int cpu);
 
 extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
-extern void sched_post_fork(struct task_struct *p,
-			    struct kernel_clone_args *kargs);
+extern void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs);
+extern void sched_post_fork(struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 void __noreturn do_task_dead(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b21..05faebafe2b5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2266,6 +2266,13 @@ static __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_put_pidfd;
 
+	/*
+	 * Now that the cgroups are pinned, re-clone the parent cgroup and put
+	 * the new task on the correct runqueue. All this *before* the task
+	 * becomes visible.
+	 */
+	sched_cgroup_fork(p, args);
+
 	/*
 	 * From this point on we must avoid any synchronous user-space
 	 * communication until we take the tasklist-lock. In particular, we do
@@ -2376,7 +2383,7 @@ static __latent_entropy struct task_struct *copy_process(
 	write_unlock_irq(&tasklist_lock);
 
 	proc_fork_connector(p);
-	sched_post_fork(p, args);
+	sched_post_fork(p);
 	cgroup_post_fork(p, args);
 	perf_event_fork(p);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fcf0c180617c..dd97a42b1eee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1214,9 +1214,8 @@ int tg_nop(struct task_group *tg, void *data)
 }
 #endif
 
-static void set_load_weight(struct task_struct *p)
+static void set_load_weight(struct task_struct *p, bool update_load)
 {
-	bool update_load = !(READ_ONCE(p->__state) & TASK_NEW);
 	int prio = p->static_prio - MAX_RT_PRIO;
 	struct load_weight *load = &p->se.load;
 
@@ -4407,7 +4406,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 			p->static_prio = NICE_TO_PRIO(0);
 
 		p->prio = p->normal_prio = p->static_prio;
-		set_load_weight(p);
+		set_load_weight(p, false);
 
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
@@ -4425,6 +4424,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	init_entity_runnable_average(&p->se);
 
+
 #ifdef CONFIG_SCHED_INFO
 	if (likely(sched_info_on()))
 		memset(&p->sched_info, 0, sizeof(p->sched_info));
@@ -4440,18 +4440,23 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	return 0;
 }
 
-void sched_post_fork(struct task_struct *p, struct kernel_clone_args *kargs)
+void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
 {
 	unsigned long flags;
-#ifdef CONFIG_CGROUP_SCHED
-	struct task_group *tg;
-#endif
 
+	/*
+	 * Because we're not yet on the pid-hash, p->pi_lock isn't strictly
+	 * required yet, but lockdep gets upset if rules are violated.
+	 */
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 #ifdef CONFIG_CGROUP_SCHED
-	tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
-			  struct task_group, css);
-	p->sched_task_group = autogroup_task_group(p, tg);
+	if (1) {
+		struct task_group *tg;
+		tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
+				  struct task_group, css);
+		tg = autogroup_task_group(p, tg);
+		p->sched_task_group = autogroup_task_group(p, tg);
+	}
 #endif
 	rseq_migrate(p);
 	/*
@@ -4462,7 +4467,10 @@ void sched_post_fork(struct task_struct *p, struct kernel_clone_args *kargs)
 	if (p->sched_class->task_fork)
 		p->sched_class->task_fork(p);
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+}
 
+void sched_post_fork(struct task_struct *p)
+{
 	uclamp_post_fork(p);
 }
 
@@ -6922,7 +6930,7 @@ void set_user_nice(struct task_struct *p, long nice)
 		put_prev_task(rq, p);
 
 	p->static_prio = NICE_TO_PRIO(nice);
-	set_load_weight(p);
+	set_load_weight(p, true);
 	old_prio = p->prio;
 	p->prio = effective_prio(p);
 
@@ -7213,7 +7221,7 @@ static void __setscheduler_params(struct task_struct *p,
 	 */
 	p->rt_priority = attr->sched_priority;
 	p->normal_prio = normal_prio(p);
-	set_load_weight(p);
+	set_load_weight(p, true);
 }
 
 /*
@@ -9446,7 +9454,7 @@ void __init sched_init(void)
 #endif
 	}
 
-	set_load_weight(&init_task);
+	set_load_weight(&init_task, false);
 
 	/*
 	 * The boot idle thread does lazy MMU switching as well:

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [GIT PULL] sched/urgent for 5.17-rc4
  2022-02-14  9:16     ` Peter Zijlstra
@ 2022-02-14 19:17       ` Tejun Heo
  2022-02-14 19:35         ` Peter Zijlstra
  2022-02-17  8:51       ` [PATCH] sched: Fix yet more sched_fork() races Peter Zijlstra
  2022-02-19 10:14       ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
  2 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2022-02-14 19:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Borislav Petkov, Tadeusz Struk, x86-ml, lkml,
	zhangqiao22, dietmar.eggemann

Hello, Peter.

On Mon, Feb 14, 2022 at 10:16:57AM +0100, Peter Zijlstra wrote:
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d75a528f7b21..05faebafe2b5 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2266,6 +2266,13 @@ static __latent_entropy struct task_struct *copy_process(
>  	if (retval)
>  		goto bad_fork_put_pidfd;
>  
> +	/*
> +	 * Now that the cgroups are pinned, re-clone the parent cgroup and put
> +	 * the new task on the correct runqueue. All this *before* the task
> +	 * becomes visible.
> +	 */
> +	sched_cgroup_fork(p, args);

Would it be less confusing to comment that this isn't ->can_fork() because
scheduler task_group needs to be initialized for autogroup even when cgroup
is disabled and maybe name it sched_cgroup_can_fork() even if it always
succeeds?

> +void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
>  {
>  	unsigned long flags;
> -#ifdef CONFIG_CGROUP_SCHED
> -	struct task_group *tg;
> -#endif
>  
> +	/*
> +	 * Because we're not yet on the pid-hash, p->pi_lock isn't strictly
> +	 * required yet, but lockdep gets upset if rules are violated.
> +	 */
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
>  #ifdef CONFIG_CGROUP_SCHED
> -	tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
> -			  struct task_group, css);
> -	p->sched_task_group = autogroup_task_group(p, tg);
> +	if (1) {
> +		struct task_group *tg;
> +		tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
> +				  struct task_group, css);
> +		tg = autogroup_task_group(p, tg);
> +		p->sched_task_group = autogroup_task_group(p, tg);
> +	}

I suppose the double autogroup_task_group() call is unintentional?

Otherwise, looks good to me. The only requirement from cgroup side is that
the membership should be initialized between ->can_fork() and ->fork()
inclusively, and sans autogroup this would have been done as a part of
->can_fork() so the proposed change makes sense to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [GIT PULL] sched/urgent for 5.17-rc4
  2022-02-14 19:17       ` Tejun Heo
@ 2022-02-14 19:35         ` Peter Zijlstra
  2022-02-14 19:46           ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2022-02-14 19:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Borislav Petkov, Tadeusz Struk, x86-ml, lkml,
	zhangqiao22, dietmar.eggemann

On Mon, Feb 14, 2022 at 09:17:12AM -1000, Tejun Heo wrote:
> Hello, Peter.
> 
> On Mon, Feb 14, 2022 at 10:16:57AM +0100, Peter Zijlstra wrote:
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index d75a528f7b21..05faebafe2b5 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -2266,6 +2266,13 @@ static __latent_entropy struct task_struct *copy_process(
> >  	if (retval)
> >  		goto bad_fork_put_pidfd;
> >  
> > +	/*
> > +	 * Now that the cgroups are pinned, re-clone the parent cgroup and put
> > +	 * the new task on the correct runqueue. All this *before* the task
> > +	 * becomes visible.
> > +	 */
> > +	sched_cgroup_fork(p, args);
> 
> Would it be less confusing to comment that this isn't ->can_fork() because
> scheduler task_group needs to be initialized for autogroup even when cgroup
> is disabled and maybe name it sched_cgroup_can_fork() even if it always
> succeeds?

So there's two things that need doing; the re-cloning of the task_group
thing, but also calling of __set_task_cpu() which sets up the proper
runqueue links.

The first is CGroup only, and *could* in theory be done in ->can_fork(),
but the second needs to be done unconditionally, and it doesn't make
much sense to split this up.

I actually tried, but it made the patch bigger/uglier -- but maybe I
didn't try hard enough.

> > +void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
> >  {
> >  	unsigned long flags;
> > -#ifdef CONFIG_CGROUP_SCHED
> > -	struct task_group *tg;
> > -#endif
> >  
> > +	/*
> > +	 * Because we're not yet on the pid-hash, p->pi_lock isn't strictly
> > +	 * required yet, but lockdep gets upset if rules are violated.
> > +	 */
> >  	raw_spin_lock_irqsave(&p->pi_lock, flags);
> >  #ifdef CONFIG_CGROUP_SCHED
> > -	tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
> > -			  struct task_group, css);
> > -	p->sched_task_group = autogroup_task_group(p, tg);
> > +	if (1) {
> > +		struct task_group *tg;
> > +		tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
> > +				  struct task_group, css);
> > +		tg = autogroup_task_group(p, tg);
> > +		p->sched_task_group = autogroup_task_group(p, tg);
> > +	}
> 
> I suppose the double autogroup_task_group() call is unintentional?

Yeah, that's a silly fail. Will ammend.

> Otherwise, looks good to me. The only requirement from cgroup side is that
> the membership should be initialized between ->can_fork() and ->fork()
> inclusively, and sans autogroup this would have been done as a part of
> ->can_fork() so the proposed change makes sense to me.

Thanks! I suppose I should go write me a Changelog then... assuming it
actually works :-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [GIT PULL] sched/urgent for 5.17-rc4
  2022-02-14 19:35         ` Peter Zijlstra
@ 2022-02-14 19:46           ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2022-02-14 19:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Borislav Petkov, Tadeusz Struk, x86-ml, lkml,
	zhangqiao22, dietmar.eggemann

On Mon, Feb 14, 2022 at 08:35:54PM +0100, Peter Zijlstra wrote:
> So there's two things that need doing; the re-cloning of the task_group
> thing, but also calling of __set_task_cpu() which sets up the proper
> runqueue links.
> 
> The first is CGroup only, and *could* in theory be done in ->can_fork(),
> but the second needs to be done unconditionally, and it doesn't make
> much sense to split this up.
> 
> I actually tried, but it made the patch bigger/uglier -- but maybe I
> didn't try hard enough.

If there needs to be a separate call anyway, it prolly is better to keep
them together. It'd just be nice to mention why cpu is different in this
regard in a comment.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] sched: Fix yet more sched_fork() races
  2022-02-14  9:16     ` Peter Zijlstra
  2022-02-14 19:17       ` Tejun Heo
@ 2022-02-17  8:51       ` Peter Zijlstra
  2022-02-17 10:50         ` Dietmar Eggemann
                           ` (3 more replies)
  2022-02-19 10:14       ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
  2 siblings, 4 replies; 14+ messages in thread
From: Peter Zijlstra @ 2022-02-17  8:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Tadeusz Struk, x86-ml, lkml, zhangqiao22, tj,
	dietmar.eggemann

On Mon, Feb 14, 2022 at 10:16:57AM +0100, Peter Zijlstra wrote:
> Zhang, Tadeusz, TJ, how does this look?

*sigh* I was hoping for some Tested-by, since I've no idea how to
operate this cgroup stuff properly.

Anyway, full patch below. I'll go stick it in sched/urgent.

---
Subject: sched: Fix yet more sched_fork() races
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon, 14 Feb 2022 10:16:57 +0100

Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an
invalid sched_task_group") fixed a fork race vs cgroup, it opened up a
race vs syscalls by not placing the task on the runqueue before it
gets exposed through the pidhash.

Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is
trying to fix a single instance of this, instead fix the whole class
of issues, effectively reverting this commit.

Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched/task.h |    4 ++--
 kernel/fork.c              |   13 ++++++++++++-
 kernel/sched/core.c        |   34 +++++++++++++++++++++-------------
 3 files changed, 35 insertions(+), 16 deletions(-)

--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -54,8 +54,8 @@ extern asmlinkage void schedule_tail(str
 extern void init_idle(struct task_struct *idle, int cpu);
 
 extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
-extern void sched_post_fork(struct task_struct *p,
-			    struct kernel_clone_args *kargs);
+extern void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs);
+extern void sched_post_fork(struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 void __noreturn do_task_dead(void);
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2266,6 +2266,17 @@ static __latent_entropy struct task_stru
 		goto bad_fork_put_pidfd;
 
 	/*
+	 * Now that the cgroups are pinned, re-clone the parent cgroup and put
+	 * the new task on the correct runqueue. All this *before* the task
+	 * becomes visible.
+	 *
+	 * This isn't part of ->can_fork() because while the re-cloning is
+	 * cgroup specific, it unconditionally needs to place the task on a
+	 * runqueue.
+	 */
+	sched_cgroup_fork(p, args);
+
+	/*
 	 * From this point on we must avoid any synchronous user-space
 	 * communication until we take the tasklist-lock. In particular, we do
 	 * not want user-space to be able to predict the process start-time by
@@ -2375,7 +2386,7 @@ static __latent_entropy struct task_stru
 	write_unlock_irq(&tasklist_lock);
 
 	proc_fork_connector(p);
-	sched_post_fork(p, args);
+	sched_post_fork(p);
 	cgroup_post_fork(p, args);
 	perf_event_fork(p);
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1215,9 +1215,8 @@ int tg_nop(struct task_group *tg, void *
 }
 #endif
 
-static void set_load_weight(struct task_struct *p)
+static void set_load_weight(struct task_struct *p, bool update_load)
 {
-	bool update_load = !(READ_ONCE(p->__state) & TASK_NEW);
 	int prio = p->static_prio - MAX_RT_PRIO;
 	struct load_weight *load = &p->se.load;
 
@@ -4408,7 +4407,7 @@ int sched_fork(unsigned long clone_flags
 			p->static_prio = NICE_TO_PRIO(0);
 
 		p->prio = p->normal_prio = p->static_prio;
-		set_load_weight(p);
+		set_load_weight(p, false);
 
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
@@ -4426,6 +4425,7 @@ int sched_fork(unsigned long clone_flags
 
 	init_entity_runnable_average(&p->se);
 
+
 #ifdef CONFIG_SCHED_INFO
 	if (likely(sched_info_on()))
 		memset(&p->sched_info, 0, sizeof(p->sched_info));
@@ -4441,18 +4441,23 @@ int sched_fork(unsigned long clone_flags
 	return 0;
 }
 
-void sched_post_fork(struct task_struct *p, struct kernel_clone_args *kargs)
+void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
 {
 	unsigned long flags;
-#ifdef CONFIG_CGROUP_SCHED
-	struct task_group *tg;
-#endif
 
+	/*
+	 * Because we're not yet on the pid-hash, p->pi_lock isn't strictly
+	 * required yet, but lockdep gets upset if rules are violated.
+	 */
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 #ifdef CONFIG_CGROUP_SCHED
-	tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
-			  struct task_group, css);
-	p->sched_task_group = autogroup_task_group(p, tg);
+	if (1) {
+		struct task_group *tg;
+		tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
+				  struct task_group, css);
+		tg = autogroup_task_group(p, tg);
+		p->sched_task_group = tg;
+	}
 #endif
 	rseq_migrate(p);
 	/*
@@ -4463,7 +4468,10 @@ void sched_post_fork(struct task_struct
 	if (p->sched_class->task_fork)
 		p->sched_class->task_fork(p);
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+}
 
+void sched_post_fork(struct task_struct *p)
+{
 	uclamp_post_fork(p);
 }
 
@@ -6923,7 +6931,7 @@ void set_user_nice(struct task_struct *p
 		put_prev_task(rq, p);
 
 	p->static_prio = NICE_TO_PRIO(nice);
-	set_load_weight(p);
+	set_load_weight(p, true);
 	old_prio = p->prio;
 	p->prio = effective_prio(p);
 
@@ -7214,7 +7222,7 @@ static void __setscheduler_params(struct
 	 */
 	p->rt_priority = attr->sched_priority;
 	p->normal_prio = normal_prio(p);
-	set_load_weight(p);
+	set_load_weight(p, true);
 }
 
 /*
@@ -9447,7 +9455,7 @@ void __init sched_init(void)
 #endif
 	}
 
-	set_load_weight(&init_task);
+	set_load_weight(&init_task, false);
 
 	/*
 	 * The boot idle thread does lazy MMU switching as well:

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix yet more sched_fork() races
  2022-02-17  8:51       ` [PATCH] sched: Fix yet more sched_fork() races Peter Zijlstra
@ 2022-02-17 10:50         ` Dietmar Eggemann
  2022-02-17 12:08         ` Zhang Qiao
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Dietmar Eggemann @ 2022-02-17 10:50 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Borislav Petkov, Tadeusz Struk, x86-ml, lkml, zhangqiao22, tj

On 17/02/2022 09:51, Peter Zijlstra wrote:
> On Mon, Feb 14, 2022 at 10:16:57AM +0100, Peter Zijlstra wrote:
>> Zhang, Tadeusz, TJ, how does this look?
> 
> *sigh* I was hoping for some Tested-by, since I've no idea how to
> operate this cgroup stuff properly.

At least the reproducer from
https://syzkaller.appspot.com/bug?id=9d9c27adc674e3a7932b22b61c79a02da82cbdc1
doesn't barf with this patch but it does on 5.17-rc2 on Arm64 Juno-r0.

sched: Fix yet more sched_fork() races (2022-02-17 Peter Zijlstra)
13765de8148f - sched/fair: Fix fault in reweight_entity (2022-02-06
Tadeusz Struk)
26291c54e111 - Linux 5.17-rc2 (2022-01-30 Linus Torvalds)

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

[...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix yet more sched_fork() races
  2022-02-17  8:51       ` [PATCH] sched: Fix yet more sched_fork() races Peter Zijlstra
  2022-02-17 10:50         ` Dietmar Eggemann
@ 2022-02-17 12:08         ` Zhang Qiao
  2022-02-17 17:33         ` Tadeusz Struk
  2022-02-18  6:18         ` Zhang Qiao
  3 siblings, 0 replies; 14+ messages in thread
From: Zhang Qiao @ 2022-02-17 12:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Tadeusz Struk, x86-ml, lkml, tj,
	dietmar.eggemann, Linus Torvalds



在 2022/2/17 16:51, Peter Zijlstra 写道:
> On Mon, Feb 14, 2022 at 10:16:57AM +0100, Peter Zijlstra wrote:
>> Zhang, Tadeusz, TJ, how does this look?Sorry for not noticing the emails.

> 
> *sigh* I was hoping for some Tested-by, since I've no idea how to


I'll apply this patch and run the previous test suite.

--

Qiao.

> operate this cgroup stuff properly.
> 
> Anyway, full patch below. I'll go stick it in sched/urgent.
> 
> ---
> Subject: sched: Fix yet more sched_fork() races
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon, 14 Feb 2022 10:16:57 +0100
> 
> Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an
> invalid sched_task_group") fixed a fork race vs cgroup, it opened up a
> race vs syscalls by not placing the task on the runqueue before it
> gets exposed through the pidhash.
> 
> Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is
> trying to fix a single instance of this, instead fix the whole class
> of issues, effectively reverting this commit.
> 
> Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/sched/task.h |    4 ++--
>  kernel/fork.c              |   13 ++++++++++++-
>  kernel/sched/core.c        |   34 +++++++++++++++++++++-------------
>  3 files changed, 35 insertions(+), 16 deletions(-)
> 
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -54,8 +54,8 @@ extern asmlinkage void schedule_tail(str
>  extern void init_idle(struct task_struct *idle, int cpu);
>  
>  extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
> -extern void sched_post_fork(struct task_struct *p,
> -			    struct kernel_clone_args *kargs);
> +extern void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs);
> +extern void sched_post_fork(struct task_struct *p);
>  extern void sched_dead(struct task_struct *p);
>  
>  void __noreturn do_task_dead(void);
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2266,6 +2266,17 @@ static __latent_entropy struct task_stru
>  		goto bad_fork_put_pidfd;
>  
>  	/*
> +	 * Now that the cgroups are pinned, re-clone the parent cgroup and put
> +	 * the new task on the correct runqueue. All this *before* the task
> +	 * becomes visible.
> +	 *
> +	 * This isn't part of ->can_fork() because while the re-cloning is
> +	 * cgroup specific, it unconditionally needs to place the task on a
> +	 * runqueue.
> +	 */
> +	sched_cgroup_fork(p, args);
> +
> +	/*
>  	 * From this point on we must avoid any synchronous user-space
>  	 * communication until we take the tasklist-lock. In particular, we do
>  	 * not want user-space to be able to predict the process start-time by
> @@ -2375,7 +2386,7 @@ static __latent_entropy struct task_stru
>  	write_unlock_irq(&tasklist_lock);
>  
>  	proc_fork_connector(p);
> -	sched_post_fork(p, args);
> +	sched_post_fork(p);
>  	cgroup_post_fork(p, args);
>  	perf_event_fork(p);
>  
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1215,9 +1215,8 @@ int tg_nop(struct task_group *tg, void *
>  }
>  #endif
>  
> -static void set_load_weight(struct task_struct *p)
> +static void set_load_weight(struct task_struct *p, bool update_load)
>  {
> -	bool update_load = !(READ_ONCE(p->__state) & TASK_NEW);
>  	int prio = p->static_prio - MAX_RT_PRIO;
>  	struct load_weight *load = &p->se.load;
>  
> @@ -4408,7 +4407,7 @@ int sched_fork(unsigned long clone_flags
>  			p->static_prio = NICE_TO_PRIO(0);
>  
>  		p->prio = p->normal_prio = p->static_prio;
> -		set_load_weight(p);
> +		set_load_weight(p, false);
>  
>  		/*
>  		 * We don't need the reset flag anymore after the fork. It has
> @@ -4426,6 +4425,7 @@ int sched_fork(unsigned long clone_flags
>  
>  	init_entity_runnable_average(&p->se);
>  
> +
>  #ifdef CONFIG_SCHED_INFO
>  	if (likely(sched_info_on()))
>  		memset(&p->sched_info, 0, sizeof(p->sched_info));
> @@ -4441,18 +4441,23 @@ int sched_fork(unsigned long clone_flags
>  	return 0;
>  }
>  
> -void sched_post_fork(struct task_struct *p, struct kernel_clone_args *kargs)
> +void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
>  {
>  	unsigned long flags;
> -#ifdef CONFIG_CGROUP_SCHED
> -	struct task_group *tg;
> -#endif
>  
> +	/*
> +	 * Because we're not yet on the pid-hash, p->pi_lock isn't strictly
> +	 * required yet, but lockdep gets upset if rules are violated.
> +	 */
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
>  #ifdef CONFIG_CGROUP_SCHED
> -	tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
> -			  struct task_group, css);
> -	p->sched_task_group = autogroup_task_group(p, tg);
> +	if (1) {
> +		struct task_group *tg;
> +		tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
> +				  struct task_group, css);
> +		tg = autogroup_task_group(p, tg);
> +		p->sched_task_group = tg;
> +	}
>  #endif
>  	rseq_migrate(p);
>  	/*
> @@ -4463,7 +4468,10 @@ void sched_post_fork(struct task_struct
>  	if (p->sched_class->task_fork)
>  		p->sched_class->task_fork(p);
>  	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> +}
>  
> +void sched_post_fork(struct task_struct *p)
> +{
>  	uclamp_post_fork(p);
>  }
>  
> @@ -6923,7 +6931,7 @@ void set_user_nice(struct task_struct *p
>  		put_prev_task(rq, p);
>  
>  	p->static_prio = NICE_TO_PRIO(nice);
> -	set_load_weight(p);
> +	set_load_weight(p, true);
>  	old_prio = p->prio;
>  	p->prio = effective_prio(p);
>  
> @@ -7214,7 +7222,7 @@ static void __setscheduler_params(struct
>  	 */
>  	p->rt_priority = attr->sched_priority;
>  	p->normal_prio = normal_prio(p);
> -	set_load_weight(p);
> +	set_load_weight(p, true);
>  }
>  
>  /*
> @@ -9447,7 +9455,7 @@ void __init sched_init(void)
>  #endif
>  	}
>  
> -	set_load_weight(&init_task);
> +	set_load_weight(&init_task, false);
>  
>  	/*
>  	 * The boot idle thread does lazy MMU switching as well:
> 
> .
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix yet more sched_fork() races
  2022-02-17  8:51       ` [PATCH] sched: Fix yet more sched_fork() races Peter Zijlstra
  2022-02-17 10:50         ` Dietmar Eggemann
  2022-02-17 12:08         ` Zhang Qiao
@ 2022-02-17 17:33         ` Tadeusz Struk
  2022-02-18  6:18         ` Zhang Qiao
  3 siblings, 0 replies; 14+ messages in thread
From: Tadeusz Struk @ 2022-02-17 17:33 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Borislav Petkov, x86-ml, lkml, zhangqiao22, tj, dietmar.eggemann

On 2/17/22 00:51, Peter Zijlstra wrote:
> On Mon, Feb 14, 2022 at 10:16:57AM +0100, Peter Zijlstra wrote:
>> Zhang, Tadeusz, TJ, how does this look?
> 
> *sigh* I was hoping for some Tested-by, since I've no idea how to
> operate this cgroup stuff properly.
> 
> Anyway, full patch below. I'll go stick it in sched/urgent.

Just tested it on 5.17.0-rc4 and it looks ok.

Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>

-- 
Thanks,
Tadeusz

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix yet more sched_fork() races
  2022-02-17  8:51       ` [PATCH] sched: Fix yet more sched_fork() races Peter Zijlstra
                           ` (2 preceding siblings ...)
  2022-02-17 17:33         ` Tadeusz Struk
@ 2022-02-18  6:18         ` Zhang Qiao
  3 siblings, 0 replies; 14+ messages in thread
From: Zhang Qiao @ 2022-02-18  6:18 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Borislav Petkov, Tadeusz Struk, x86-ml, lkml, tj, dietmar.eggemann



在 2022/2/17 16:51, Peter Zijlstra 写道:
> On Mon, Feb 14, 2022 at 10:16:57AM +0100, Peter Zijlstra wrote:
>> Zhang, Tadeusz, TJ, how does this look?
> 
> *sigh* I was hoping for some Tested-by, since I've no idea how to
> operate this cgroup stuff properly.

tested this patch successfully.

Tested-by: Zhang Qiao <zhangqiao22@huawei.com>


> 
> Anyway, full patch below. I'll go stick it in sched/urgent.
> 
> ---
> Subject: sched: Fix yet more sched_fork() races
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon, 14 Feb 2022 10:16:57 +0100
> 
> Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an
> invalid sched_task_group") fixed a fork race vs cgroup, it opened up a
> race vs syscalls by not placing the task on the runqueue before it
> gets exposed through the pidhash.
> 
> Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is
> trying to fix a single instance of this, instead fix the whole class
> of issues, effectively reverting this commit.
> 
> Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/sched/task.h |    4 ++--
>  kernel/fork.c              |   13 ++++++++++++-
>  kernel/sched/core.c        |   34 +++++++++++++++++++++-------------
>  3 files changed, 35 insertions(+), 16 deletions(-)
> 
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -54,8 +54,8 @@ extern asmlinkage void schedule_tail(str
>  extern void init_idle(struct task_struct *idle, int cpu);
>  
>  extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
> -extern void sched_post_fork(struct task_struct *p,
> -			    struct kernel_clone_args *kargs);
> +extern void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs);
> +extern void sched_post_fork(struct task_struct *p);
>  extern void sched_dead(struct task_struct *p);
>  
>  void __noreturn do_task_dead(void);
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2266,6 +2266,17 @@ static __latent_entropy struct task_stru
>  		goto bad_fork_put_pidfd;
>  
>  	/*
> +	 * Now that the cgroups are pinned, re-clone the parent cgroup and put
> +	 * the new task on the correct runqueue. All this *before* the task
> +	 * becomes visible.
> +	 *
> +	 * This isn't part of ->can_fork() because while the re-cloning is
> +	 * cgroup specific, it unconditionally needs to place the task on a
> +	 * runqueue.
> +	 */
> +	sched_cgroup_fork(p, args);
> +
> +	/*
>  	 * From this point on we must avoid any synchronous user-space
>  	 * communication until we take the tasklist-lock. In particular, we do
>  	 * not want user-space to be able to predict the process start-time by
> @@ -2375,7 +2386,7 @@ static __latent_entropy struct task_stru
>  	write_unlock_irq(&tasklist_lock);
>  
>  	proc_fork_connector(p);
> -	sched_post_fork(p, args);
> +	sched_post_fork(p);
>  	cgroup_post_fork(p, args);
>  	perf_event_fork(p);
>  
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1215,9 +1215,8 @@ int tg_nop(struct task_group *tg, void *
>  }
>  #endif
>  
> -static void set_load_weight(struct task_struct *p)
> +static void set_load_weight(struct task_struct *p, bool update_load)
>  {
> -	bool update_load = !(READ_ONCE(p->__state) & TASK_NEW);
>  	int prio = p->static_prio - MAX_RT_PRIO;
>  	struct load_weight *load = &p->se.load;
>  
> @@ -4408,7 +4407,7 @@ int sched_fork(unsigned long clone_flags
>  			p->static_prio = NICE_TO_PRIO(0);
>  
>  		p->prio = p->normal_prio = p->static_prio;
> -		set_load_weight(p);
> +		set_load_weight(p, false);
>  
>  		/*
>  		 * We don't need the reset flag anymore after the fork. It has
> @@ -4426,6 +4425,7 @@ int sched_fork(unsigned long clone_flags
>  
>  	init_entity_runnable_average(&p->se);
>  
> +
>  #ifdef CONFIG_SCHED_INFO
>  	if (likely(sched_info_on()))
>  		memset(&p->sched_info, 0, sizeof(p->sched_info));
> @@ -4441,18 +4441,23 @@ int sched_fork(unsigned long clone_flags
>  	return 0;
>  }
>  
> -void sched_post_fork(struct task_struct *p, struct kernel_clone_args *kargs)
> +void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
>  {
>  	unsigned long flags;
> -#ifdef CONFIG_CGROUP_SCHED
> -	struct task_group *tg;
> -#endif
>  
> +	/*
> +	 * Because we're not yet on the pid-hash, p->pi_lock isn't strictly
> +	 * required yet, but lockdep gets upset if rules are violated.
> +	 */
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
>  #ifdef CONFIG_CGROUP_SCHED
> -	tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
> -			  struct task_group, css);
> -	p->sched_task_group = autogroup_task_group(p, tg);
> +	if (1) {
> +		struct task_group *tg;
> +		tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
> +				  struct task_group, css);
> +		tg = autogroup_task_group(p, tg);
> +		p->sched_task_group = tg;
> +	}
>  #endif
>  	rseq_migrate(p);
>  	/*
> @@ -4463,7 +4468,10 @@ void sched_post_fork(struct task_struct
>  	if (p->sched_class->task_fork)
>  		p->sched_class->task_fork(p);
>  	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> +}
>  
> +void sched_post_fork(struct task_struct *p)
> +{
>  	uclamp_post_fork(p);
>  }
>  
> @@ -6923,7 +6931,7 @@ void set_user_nice(struct task_struct *p
>  		put_prev_task(rq, p);
>  
>  	p->static_prio = NICE_TO_PRIO(nice);
> -	set_load_weight(p);
> +	set_load_weight(p, true);
>  	old_prio = p->prio;
>  	p->prio = effective_prio(p);
>  
> @@ -7214,7 +7222,7 @@ static void __setscheduler_params(struct
>  	 */
>  	p->rt_priority = attr->sched_priority;
>  	p->normal_prio = normal_prio(p);
> -	set_load_weight(p);
> +	set_load_weight(p, true);
>  }
>  
>  /*
> @@ -9447,7 +9455,7 @@ void __init sched_init(void)
>  #endif
>  	}
>  
> -	set_load_weight(&init_task);
> +	set_load_weight(&init_task, false);
>  
>  	/*
>  	 * The boot idle thread does lazy MMU switching as well:
> 
> .
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip: sched/urgent] sched: Fix yet more sched_fork() races
  2022-02-14  9:16     ` Peter Zijlstra
  2022-02-14 19:17       ` Tejun Heo
  2022-02-17  8:51       ` [PATCH] sched: Fix yet more sched_fork() races Peter Zijlstra
@ 2022-02-19 10:14       ` tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 14+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2022-02-19 10:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Linus Torvalds, Peter Zijlstra (Intel),
	Tadeusz Struk, Zhang Qiao, Dietmar Eggemann, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     b1e8206582f9d680cff7d04828708c8b6ab32957
Gitweb:        https://git.kernel.org/tip/b1e8206582f9d680cff7d04828708c8b6ab32957
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 14 Feb 2022 10:16:57 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 19 Feb 2022 11:11:05 +01:00

sched: Fix yet more sched_fork() races

Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an
invalid sched_task_group") fixed a fork race vs cgroup, it opened up a
race vs syscalls by not placing the task on the runqueue before it
gets exposed through the pidhash.

Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is
trying to fix a single instance of this, instead fix the whole class
of issues, effectively reverting this commit.

Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net
---
 include/linux/sched/task.h |  4 ++--
 kernel/fork.c              | 13 ++++++++++++-
 kernel/sched/core.c        | 34 +++++++++++++++++++++-------------
 3 files changed, 35 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index b9198a1..e84e54d 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -54,8 +54,8 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
 extern void init_idle(struct task_struct *idle, int cpu);
 
 extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
-extern void sched_post_fork(struct task_struct *p,
-			    struct kernel_clone_args *kargs);
+extern void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs);
+extern void sched_post_fork(struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 void __noreturn do_task_dead(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528..c607d23 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2267,6 +2267,17 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_put_pidfd;
 
 	/*
+	 * Now that the cgroups are pinned, re-clone the parent cgroup and put
+	 * the new task on the correct runqueue. All this *before* the task
+	 * becomes visible.
+	 *
+	 * This isn't part of ->can_fork() because while the re-cloning is
+	 * cgroup specific, it unconditionally needs to place the task on a
+	 * runqueue.
+	 */
+	sched_cgroup_fork(p, args);
+
+	/*
 	 * From this point on we must avoid any synchronous user-space
 	 * communication until we take the tasklist-lock. In particular, we do
 	 * not want user-space to be able to predict the process start-time by
@@ -2376,7 +2387,7 @@ static __latent_entropy struct task_struct *copy_process(
 	write_unlock_irq(&tasklist_lock);
 
 	proc_fork_connector(p);
-	sched_post_fork(p, args);
+	sched_post_fork(p);
 	cgroup_post_fork(p, args);
 	perf_event_fork(p);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fcf0c18..9745613 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1214,9 +1214,8 @@ int tg_nop(struct task_group *tg, void *data)
 }
 #endif
 
-static void set_load_weight(struct task_struct *p)
+static void set_load_weight(struct task_struct *p, bool update_load)
 {
-	bool update_load = !(READ_ONCE(p->__state) & TASK_NEW);
 	int prio = p->static_prio - MAX_RT_PRIO;
 	struct load_weight *load = &p->se.load;
 
@@ -4407,7 +4406,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 			p->static_prio = NICE_TO_PRIO(0);
 
 		p->prio = p->normal_prio = p->static_prio;
-		set_load_weight(p);
+		set_load_weight(p, false);
 
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
@@ -4425,6 +4424,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	init_entity_runnable_average(&p->se);
 
+
 #ifdef CONFIG_SCHED_INFO
 	if (likely(sched_info_on()))
 		memset(&p->sched_info, 0, sizeof(p->sched_info));
@@ -4440,18 +4440,23 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	return 0;
 }
 
-void sched_post_fork(struct task_struct *p, struct kernel_clone_args *kargs)
+void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
 {
 	unsigned long flags;
-#ifdef CONFIG_CGROUP_SCHED
-	struct task_group *tg;
-#endif
 
+	/*
+	 * Because we're not yet on the pid-hash, p->pi_lock isn't strictly
+	 * required yet, but lockdep gets upset if rules are violated.
+	 */
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 #ifdef CONFIG_CGROUP_SCHED
-	tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
-			  struct task_group, css);
-	p->sched_task_group = autogroup_task_group(p, tg);
+	if (1) {
+		struct task_group *tg;
+		tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
+				  struct task_group, css);
+		tg = autogroup_task_group(p, tg);
+		p->sched_task_group = tg;
+	}
 #endif
 	rseq_migrate(p);
 	/*
@@ -4462,7 +4467,10 @@ void sched_post_fork(struct task_struct *p, struct kernel_clone_args *kargs)
 	if (p->sched_class->task_fork)
 		p->sched_class->task_fork(p);
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+}
 
+void sched_post_fork(struct task_struct *p)
+{
 	uclamp_post_fork(p);
 }
 
@@ -6922,7 +6930,7 @@ void set_user_nice(struct task_struct *p, long nice)
 		put_prev_task(rq, p);
 
 	p->static_prio = NICE_TO_PRIO(nice);
-	set_load_weight(p);
+	set_load_weight(p, true);
 	old_prio = p->prio;
 	p->prio = effective_prio(p);
 
@@ -7213,7 +7221,7 @@ static void __setscheduler_params(struct task_struct *p,
 	 */
 	p->rt_priority = attr->sched_priority;
 	p->normal_prio = normal_prio(p);
-	set_load_weight(p);
+	set_load_weight(p, true);
 }
 
 /*
@@ -9446,7 +9454,7 @@ void __init sched_init(void)
 #endif
 	}
 
-	set_load_weight(&init_task);
+	set_load_weight(&init_task, false);
 
 	/*
 	 * The boot idle thread does lazy MMU switching as well:

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-02-19 10:14 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-13 12:37 [GIT PULL] sched/urgent for 5.17-rc4 Borislav Petkov
2022-02-13 18:02 ` Linus Torvalds
2022-02-14  8:45   ` Peter Zijlstra
2022-02-14  9:16     ` Peter Zijlstra
2022-02-14 19:17       ` Tejun Heo
2022-02-14 19:35         ` Peter Zijlstra
2022-02-14 19:46           ` Tejun Heo
2022-02-17  8:51       ` [PATCH] sched: Fix yet more sched_fork() races Peter Zijlstra
2022-02-17 10:50         ` Dietmar Eggemann
2022-02-17 12:08         ` Zhang Qiao
2022-02-17 17:33         ` Tadeusz Struk
2022-02-18  6:18         ` Zhang Qiao
2022-02-19 10:14       ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
2022-02-13 18:04 ` [GIT PULL] sched/urgent for 5.17-rc4 pr-tracker-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).