linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] High latency with core scheduling
@ 2021-12-17  0:41 Joel Fernandes
  2021-12-18  0:01 ` Peter Zijlstra
  0 siblings, 1 reply; 3+ messages in thread
From: Joel Fernandes @ 2021-12-17  0:41 UTC (permalink / raw)
  To: Josh Don, Vineeth Pillai, Peter Zijlstra, Steven Rostedt, Chen,
	Tim C, Brown, Len, LKML, AubreyLi@google.com, aubrey.li,
	Aaron Lu, Hyser,Chris, Don Hiatt, ricardo.neri, vincent.guittot
  Cc: joelaf

Hello,
On ChromeOS, we see really high scheduling latency when there is a heavy
workload running outside and inside a CGroup. The load inside Cgroup is
tagged for core scheduling and happen to be vCPU threads.  Because of this
various folks are complaining.

One of the issues we see is that the core rbtree is static when nothing in
the tree goes to sleep or wakes up. This can cause the same task in the core
rbtree to be repeatedly picked in pick_task().

The below diff seems to improve the situation, could you please take a look?
If it seems sane, we can go ahead and make it a formal patch to at least fix
one of the known issues.

The patch is simple, just remove the currently running task from the core rb
tree as its vruntime is not really static. Add it back on preemption.

note: This is against a 5.4 kernel, but the code is about the same and its RFC.
note: The issue does not seem to happen without CGroups involved so perhaps
      something is wonky in cfs_prio_less() still. Peter?

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c023a9a0c4ae..3c588ad05ab6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -200,7 +200,7 @@ static inline void dump_scrb(struct rb_node *root, int lvl, char *buf, int size)
 	dump_scrb(root->rb_right, lvl+1, buf, size);
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct rb_node *parent, **node;
 	struct task_struct *node_task;
@@ -212,6 +212,9 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 	if (!p->core_cookie)
 		return;
 
+	if (sched_core_enqueued(p))
+		return;
+
 	node = &rq->core_tree.rb_node;
 	parent = *node;
 
@@ -232,7 +235,7 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 	rb_insert_color(&p->core_node, &rq->core_tree);
 }
 
-static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
@@ -4745,6 +4748,18 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 		return class_pick;
 
 	cookie_pick = sched_core_find(rq, cookie);
+
+	/*
+	 * Currently running process might not be in the runqueue if fair class.
+	 * If it is of the same cookie as cookie_pick and has more priority,
+	 * then select it.
+	 */
+	if (rq != this_rq() && !is_task_rq_idle(cookie_pick) && !is_task_rq_idle(rq->curr) &&
+		cookie_pick->core_cookie == rq->curr->core_cookie &&
+		prio_less(cookie_pick, rq->curr, in_fi)) {
+		cookie_pick = rq->curr;
+	}
+
 	/*
 	 * If class > max && class > cookie, it is the highest priority task on
 	 * the core (so far) and it must be selected, otherwise we must go with
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 86cc67dd38e9..820c5cf4ecc1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1936,15 +1936,33 @@ struct sched_class {
 #endif
 };
 
+void sched_core_enqueue(struct rq *rq, struct task_struct *p);
+void sched_core_dequeue(struct rq *rq, struct task_struct *p);
+
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->curr != prev);
 	prev->sched_class->put_prev_task(rq, prev);
+#ifdef CONFIG_SCHED_CORE
+	if (sched_core_enabled(rq) && READ_ONCE(prev->state) != TASK_DEAD && prev->core_cookie && prev->on_rq) {
+		sched_core_enqueue(rq, prev);
+	}
+#endif
 }
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
 	next->sched_class->set_next_task(rq, next, false);
+#ifdef CONFIG_SCHED_CORE
+	/*
+	 * This task is going to run next and its vruntime will change.
+	 * Remove it from core rbtree so as to not confuse the ordering
+	 * in the rbtree when its vrun changes.
+	 */
+	if (sched_core_enabled(rq) && next->core_cookie && next->on_rq) {
+		sched_core_dequeue(rq, next);
+	}
+#endif
 }
 
 #ifdef CONFIG_SMP

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [RFC] High latency with core scheduling
  2021-12-17  0:41 [RFC] High latency with core scheduling Joel Fernandes
@ 2021-12-18  0:01 ` Peter Zijlstra
  2021-12-18 17:50   ` Joel Fernandes
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Zijlstra @ 2021-12-18  0:01 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Josh Don, Vineeth Pillai, Steven Rostedt, Chen, Tim C, Brown,
	Len, LKML, AubreyLi@google.com, aubrey.li, Aaron Lu, Hyser,Chris,
	Don Hiatt, ricardo.neri, vincent.guittot, joelaf

On Thu, Dec 16, 2021 at 07:41:31PM -0500, Joel Fernandes wrote:

> One of the issues we see is that the core rbtree is static when nothing in
> the tree goes to sleep or wakes up. This can cause the same task in the core
> rbtree to be repeatedly picked in pick_task().
> 
> The below diff seems to improve the situation, could you please take a look?
> If it seems sane, we can go ahead and make it a formal patch to at least fix
> one of the known issues.
> 
> The patch is simple, just remove the currently running task from the core rb
> tree as its vruntime is not really static. Add it back on preemption.

> note: This is against a 5.4 kernel, but the code is about the same and its RFC.

I think you'll find there's significant differences..

> note: The issue does not seem to happen without CGroups involved so perhaps
>       something is wonky in cfs_prio_less() still. Peter?

that's weird... but it's also 00h30 am, so who knows :-)

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c023a9a0c4ae..3c588ad05ab6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -200,7 +200,7 @@ static inline void dump_scrb(struct rb_node *root, int lvl, char *buf, int size)
>  	dump_scrb(root->rb_right, lvl+1, buf, size);
>  }
>  
> -static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> +void sched_core_enqueue(struct rq *rq, struct task_struct *p)
>  {
>  	struct rb_node *parent, **node;
>  	struct task_struct *node_task;
> @@ -212,6 +212,9 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
>  	if (!p->core_cookie)
>  		return;
>  
> +	if (sched_core_enqueued(p))
> +		return;

Are you actually hitting that? It feels wrong.

>  	node = &rq->core_tree.rb_node;
>  	parent = *node;
>  
> @@ -232,7 +235,7 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
>  	rb_insert_color(&p->core_node, &rq->core_tree);
>  }
>  
> -static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
> +void sched_core_dequeue(struct rq *rq, struct task_struct *p)
>  {
>  	rq->core->core_task_seq++;
>  
> @@ -4745,6 +4748,18 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
>  		return class_pick;
>  
>  	cookie_pick = sched_core_find(rq, cookie);
> +
> +	/*
> +	 * Currently running process might not be in the runqueue if fair class.
> +	 * If it is of the same cookie as cookie_pick and has more priority,
> +	 * then select it.
> +	 */
> +	if (rq != this_rq() && !is_task_rq_idle(cookie_pick) && !is_task_rq_idle(rq->curr) &&
> +		cookie_pick->core_cookie == rq->curr->core_cookie &&
> +		prio_less(cookie_pick, rq->curr, in_fi)) {

guys, this indent style kills infants.

> +		cookie_pick = rq->curr;
> +	}

This is the part that doesn't apply.. We completely rewrote the pick
loop. I think you're looking at a change in sched_core_find() now.
Basically it should check rq->curr against whatever it finds in the
core_tree, right?

> +
>  	/*
>  	 * If class > max && class > cookie, it is the highest priority task on
>  	 * the core (so far) and it must be selected, otherwise we must go with
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 86cc67dd38e9..820c5cf4ecc1 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1936,15 +1936,33 @@ struct sched_class {
>  #endif
>  };
>  
> +void sched_core_enqueue(struct rq *rq, struct task_struct *p);
> +void sched_core_dequeue(struct rq *rq, struct task_struct *p);
> +
>  static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
>  {
>  	WARN_ON_ONCE(rq->curr != prev);
>  	prev->sched_class->put_prev_task(rq, prev);
> +#ifdef CONFIG_SCHED_CORE
> +	if (sched_core_enabled(rq) && READ_ONCE(prev->state) != TASK_DEAD && prev->core_cookie && prev->on_rq) {

That TASK_DEAD thing is weird... do_task_dead() goes something like:

	set_special_state(TASK_DEAD)
	schedule()
	  deactivate_task(prev)
	    prev->on_rq = 0;
	    dequeue_task()
	      sched_core_dequeue() /* also wrong, see below */
	      prev->sched_class->dequeue_task()
	  ...
	  next = pick_next_task(..,prev,..);
	    put_prev_task()
	      if (... && prev->on_rq /* false */)
	        sched_core_enqueue()


Notably, the sched_core_dequeue() in dequeue_task() shouldn't happen
either, because it's current and as such shouldn't be enqueued to begin
with.


> +		sched_core_enqueue(rq, prev);
> +	}
> +#endif
>  }
>  
>  static inline void set_next_task(struct rq *rq, struct task_struct *next)
>  {
>  	next->sched_class->set_next_task(rq, next, false);
> +#ifdef CONFIG_SCHED_CORE
> +	/*
> +	 * This task is going to run next and its vruntime will change.
> +	 * Remove it from core rbtree so as to not confuse the ordering
> +	 * in the rbtree when its vrun changes.
> +	 */
> +	if (sched_core_enabled(rq) && next->core_cookie && next->on_rq) {
> +		sched_core_dequeue(rq, next);
> +	}
> +#endif

Anyway... *ouch* at the additional rb-tree ops, but I think you're right
about needing this :/

Just please, think through the whole enqueue/dequeue thing because even
for an rfc this seems overly sloppy.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] High latency with core scheduling
  2021-12-18  0:01 ` Peter Zijlstra
@ 2021-12-18 17:50   ` Joel Fernandes
  0 siblings, 0 replies; 3+ messages in thread
From: Joel Fernandes @ 2021-12-18 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Don, Vineeth Pillai, Steven Rostedt, Chen, Tim C, Brown,
	Len, LKML, AubreyLi@google.com, aubrey.li, Aaron Lu, Hyser,Chris,
	Don Hiatt, ricardo.neri, vincent.guittot

Hi Peter,
Thanks for the reply.

On Sat, Dec 18, 2021 at 01:01:02AM +0100, Peter Zijlstra wrote:
> On Thu, Dec 16, 2021 at 07:41:31PM -0500, Joel Fernandes wrote:
> 
> > One of the issues we see is that the core rbtree is static when nothing in
> > the tree goes to sleep or wakes up. This can cause the same task in the core
> > rbtree to be repeatedly picked in pick_task().
> > 
> > The below diff seems to improve the situation, could you please take a look?
> > If it seems sane, we can go ahead and make it a formal patch to at least fix
> > one of the known issues.
> > 
> > The patch is simple, just remove the currently running task from the core rb
> > tree as its vruntime is not really static. Add it back on preemption.
> 
> > note: This is against a 5.4 kernel, but the code is about the same and its RFC.
> 
> I think you'll find there's significant differences..

Sure, the scheduler keeps changing and we have to be careful what and how we
backport to products. The bar for backporting is pretty high like it has to
fix a visible bug or cause any non-trivial improvement. That's why its
important for you to consider taking stable backports seriously (like by
CC'ing stable on critical fixes so products get them. I hope you do consider
caring about stable since bug fixes are no good if no one gets them or gets
them after half a decade :-) ).  See below for more replies on the issues we
are discussing:

> > note: The issue does not seem to happen without CGroups involved so perhaps
> >       something is wonky in cfs_prio_less() still. Peter?
> 
> that's weird... but it's also 00h30 am, so who knows :-)

It seems cfs_prio_less() falls apart from the following usecase (thanks
Vineeth for discussing this with me):

root group
  /  \
 T1   G2
      / \
     T2 T3

Say T1 and T2 are *not* tagged, and T3 is tagged. All are CFS tasks and, T2
and T3 are in CGroup G2.

Say T1 is highest dynamic priority and runs for some time, while it is
running T2 is selected to run on a different hyperthread because it is
compatible with T1.

Because T2 gets to run, G2's vruntime moves forward.

Because both G2 and T1 have their vruntime moving forward at the same time,
T1 will always end up being higher priority than G2.

End result is T3 gets starved.

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index c023a9a0c4ae..3c588ad05ab6 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -200,7 +200,7 @@ static inline void dump_scrb(struct rb_node *root, int lvl, char *buf, int size)
> >  	dump_scrb(root->rb_right, lvl+1, buf, size);
> >  }
> >  
> > -static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> > +void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> >  {
> >  	struct rb_node *parent, **node;
> >  	struct task_struct *node_task;
> > @@ -212,6 +212,9 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> >  	if (!p->core_cookie)
> >  		return;
> >  
> > +	if (sched_core_enqueued(p))
> > +		return;
> 
> Are you actually hitting that? It feels wrong.

Yes, in initial version of the diff, the system wasn't booting without this.
You're right its likely not needed any more.

> > +		cookie_pick = rq->curr;
> > +	}
> 
> This is the part that doesn't apply.. We completely rewrote the pick
> loop. I think you're looking at a change in sched_core_find() now.
> Basically it should check rq->curr against whatever it finds in the
> core_tree, right?

Yeah but the core tree is static even with that re-write, so it still has the
issue. Right?

Yes, we need to update the pick loop with the upstream version. That gets
hard to do on older kernels/products as it adds risk to already tested code.
But if the pick loop rewrite fixed a bug, we should definitely prioritize
backporting that.

> > +
> >  	/*
> >  	 * If class > max && class > cookie, it is the highest priority task on
> >  	 * the core (so far) and it must be selected, otherwise we must go with
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 86cc67dd38e9..820c5cf4ecc1 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1936,15 +1936,33 @@ struct sched_class {
> >  #endif
> >  };
> >  
> > +void sched_core_enqueue(struct rq *rq, struct task_struct *p);
> > +void sched_core_dequeue(struct rq *rq, struct task_struct *p);
> > +
> >  static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
> >  {
> >  	WARN_ON_ONCE(rq->curr != prev);
> >  	prev->sched_class->put_prev_task(rq, prev);
> > +#ifdef CONFIG_SCHED_CORE
> > +	if (sched_core_enabled(rq) && READ_ONCE(prev->state) != TASK_DEAD && prev->core_cookie && prev->on_rq) {
> 
> That TASK_DEAD thing is weird... do_task_dead() goes something like:
> 
> 	set_special_state(TASK_DEAD)
> 	schedule()
> 	  deactivate_task(prev)
> 	    prev->on_rq = 0;
> 	    dequeue_task()
> 	      sched_core_dequeue() /* also wrong, see below */
> 	      prev->sched_class->dequeue_task()
> 	  ...
> 	  next = pick_next_task(..,prev,..);
> 	    put_prev_task()
> 	      if (... && prev->on_rq /* false */)
> 	        sched_core_enqueue()
> 
> 
> Notably, the sched_core_dequeue() in dequeue_task() shouldn't happen
> either, because it's current and as such shouldn't be enqueued to begin
> with.

Yes, you're right. The TASK_DEAD check should likely not be needed. We'll do
a thorough investigation of where all the core tree queues/dequeues are
happening.

> > +		sched_core_enqueue(rq, prev);
> > +	}
> > +#endif
> >  }
> >  
> >  static inline void set_next_task(struct rq *rq, struct task_struct *next)
> >  {
> >  	next->sched_class->set_next_task(rq, next, false);
> > +#ifdef CONFIG_SCHED_CORE
> > +	/*
> > +	 * This task is going to run next and its vruntime will change.
> > +	 * Remove it from core rbtree so as to not confuse the ordering
> > +	 * in the rbtree when its vrun changes.
> > +	 */
> > +	if (sched_core_enabled(rq) && next->core_cookie && next->on_rq) {
> > +		sched_core_dequeue(rq, next);
> > +	}
> > +#endif
> 
> Anyway... *ouch* at the additional rb-tree ops, but I think you're right
> about needing this :/

Thanks for acknowledging the issue.

thanks!

 - Joel


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-12-18 17:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-17  0:41 [RFC] High latency with core scheduling Joel Fernandes
2021-12-18  0:01 ` Peter Zijlstra
2021-12-18 17:50   ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).