All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Splitting scheduler into two halves
@ 2014-02-28  2:13 Du, Yuyang
  2014-02-28 10:29 ` Morten Rasmussen
  0 siblings, 1 reply; 5+ messages in thread
From: Du, Yuyang @ 2014-02-28  2:13 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-pm, Van De Ven, Arjan, Brown, Len, Wysocki, Rafael J

Hi Peter/Ingo and all,

With the advent of more cores and heterogeneous architectures, the scheduler is required to be more complex (power efficiency) and diverse (big.little). For the scheduler to address that challenge as a whole, it is costly but not necessary. This proposal argues that the scheduler be spitted into two parts: top half (task scheduling) and bottom half (load balance). Let the bottom half take charge of the incoming requirements.

The two halves are rather orthogonal in functionality. The task scheduling (top half) seeks for *ONE* CPU to execute running tasks fairly (priority included), while the load balance (bottom half) aims for *ALL* CPUs to maximize the throughput of the computing power. The goal of task scheduling is pretty unique and clear, and CFS and RT in that part are exactly approaching the goal. The load balance, however, is constrained to meet more goals, to name a few, performance (throughput/responsiveness), power consumption, architecture differences, etc. Those things are often hard to achieve because they may conflict and are difficult to estimate and plan. So, shall we declare the independence of the two, give them freedom to pursue their own "happiness".

We take an incremental development method. As a starting point, we did three things (but did not change one single line of real-work code):
	1)	Remove load balance from fair.c into load_balance.c (~3000 lines of codes). As a result, fair.c/rt.c and load_balance.c have very little intersection.
	2)	Define struct sched_lb_class that consists of the following members to umbrella the load balance entry points.
		a.	const struct sched_lb_class *next;
		b.	int (*fork_balance) (struct task_struct *p, int sd_flags, int wake_flags);
		c.	int (*exec_balance) (struct task_struct *p, int sd_flags, int wake_flags);
		d.	int (*wakeup_balance) (struct task_struct *p, int sd_flags, int wake_flags);
		e.	void (*idle_balance) (int this_cpu, struct rq *this_rq);
		f.	void (*periodic_rebalance) (int cpu, enum cpu_idle_type idle);
		g.	void (*nohz_idle_balance) (int this_cpu, enum cpu_idle_type idle);
		h.	void (*start_periodic_balance) (struct rq *rq, int cpu);
		i.	void (*check_nohz_idle_balance) (struct rq *rq, int cpu);
	3)	Insert another layer of indirection to wrap the implemented functions in sched_lb_class. Implement a default load balance class that is just the previous load balance.

The next to do is to continue redesigning and refactoring to make life easier toward more powerful and diverse load balance. And more importantly, this RFC solicits a discussion to get early feedback on the big proposed change.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] Splitting scheduler into two halves
  2014-02-28  2:13 [RFC] Splitting scheduler into two halves Du, Yuyang
@ 2014-02-28 10:29 ` Morten Rasmussen
  2014-02-28 11:44   ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: Morten Rasmussen @ 2014-02-28 10:29 UTC (permalink / raw)
  To: Du, Yuyang
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, linux-pm, Van De Ven,
	Arjan, Brown, Len, Wysocki, Rafael J

Hi Yuyang,

On Fri, Feb 28, 2014 at 02:13:32AM +0000, Du, Yuyang wrote:
> Hi Peter/Ingo and all,
> 
> With the advent of more cores and heterogeneous architectures, the
> scheduler is required to be more complex (power efficiency) and
> diverse (big.little). For the scheduler to address that challenge as a
> whole, it is costly but not necessary. This proposal argues that the
> scheduler be spitted into two parts: top half (task scheduling) and
> bottom half (load balance). Let the bottom half take charge of the
> incoming requirements.
> 
> The two halves are rather orthogonal in functionality. The task
> scheduling (top half) seeks for *ONE* CPU to execute running tasks
> fairly (priority included), while the load balance (bottom half) aims
> for *ALL* CPUs to maximize the throughput of the computing power. The
> goal of task scheduling is pretty unique and clear, and CFS and RT in
> that part are exactly approaching the goal. The load balance, however,
> is constrained to meet more goals, to name a few, performance
> (throughput/responsiveness), power consumption, architecture
> differences, etc. Those things are often hard to achieve because they
> may conflict and are difficult to estimate and plan. So, shall we
> declare the independence of the two, give them freedom to pursue their
> own "happiness".

Interesting proposal. While we could declare the load-balance function
independent from the rest of CFS, I don't think the can be separated as
cleanly as your proposal suggests.

If I understand your proposal correctly, you are proposing to have a
pluggable scheduler where it is possible to have many different
load-balance (bottom half) implementations. These may require different
statistics and metrics for their load-balancing heuristics that need to
be updated by the task scheduling (top half). Having worked with
big.LITTLE systems for quite a while, I know this is indeed the case if
you want to schedule more efficiently for big.LITTLE.

For heterogeneous systems and energy awareness in general, the current
load tracking isn't very good for low utilization situations. Fixing
that would mean changes in both halves. If go for extreme optimizations
for heterogeneous systems, you may even want the top half to keep track
of light and heavy tasks so you don't have to search through the
runqueues as part of load-balance in the bottom half to try to match
tasks to an appropriate cpu. I'm not saying that the latter is a
requirement, but just an example of things that people may try to do.

If you don't allow stuff to be added to the top half, there isn't much
room to do diverse implementations in the bottom half.

The current sched_class abstraction is already having the issue of not
abstracting everything. Functions in core.c are manipulating data inside
CFS directly.

> We take an incremental development method. As a starting point, we did
> three things (but did not change one single line of real-work code):
> 	1)	Remove load balance from fair.c into load_balance.c
> 	(~3000 lines of codes). As a result, fair.c/rt.c and
> 	load_balance.c have very little intersection.
> 	2)	Define struct sched_lb_class that consists of the
> 	following members to umbrella the load balance entry points.
> 		a.	const struct sched_lb_class *next;
> 		b.	int (*fork_balance) (struct task_struct *p, int sd_flags, int wake_flags);
> 		c.	int (*exec_balance) (struct task_struct *p, int sd_flags, int wake_flags);
> 		d.	int (*wakeup_balance) (struct task_struct *p, int sd_flags, int wake_flags);
> 		e.	void (*idle_balance) (int this_cpu, struct rq *this_rq);
> 		f.	void (*periodic_rebalance) (int cpu, enum cpu_idle_type idle);
> 		g.	void (*nohz_idle_balance) (int this_cpu, enum cpu_idle_type idle);
> 		h.	void (*start_periodic_balance) (struct rq *rq, int cpu);
> 		i.	void (*check_nohz_idle_balance) (struct rq *rq, int cpu);
> 	3)	Insert another layer of indirection to wrap the
> 	implemented functions in sched_lb_class. Implement a default
> 	load balance class that is just the previous load balance.
> 
> The next to do is to continue redesigning and refactoring to make life
> easier toward more powerful and diverse load balance. And more
> importantly, this RFC solicits a discussion to get early feedback on
> the big proposed change.

Is sched_lb_class supposed to implement load-balancing for all
sched_class'es (rt, deadline, and fair) or just fair?

Morten

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] Splitting scheduler into two halves
  2014-02-28 10:29 ` Morten Rasmussen
@ 2014-02-28 11:44   ` Peter Zijlstra
  2014-02-28 12:01     ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2014-02-28 11:44 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Du, Yuyang, Ingo Molnar, linux-kernel, linux-pm, Van De Ven,
	Arjan, Brown, Len, Wysocki, Rafael J

On Fri, Feb 28, 2014 at 10:29:32AM +0000, Morten Rasmussen wrote:
> If I understand your proposal correctly, you are proposing to have a
> pluggable scheduler where it is possible to have many different
> load-balance (bottom half) implementations.

Yeah, that's not _ever_ going to happen. We've had that discussion many
times, use your favourite search engine.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] Splitting scheduler into two halves
  2014-02-28 11:44   ` Peter Zijlstra
@ 2014-02-28 12:01     ` Peter Zijlstra
  2014-03-03  9:56       ` Du, Yuyang
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2014-02-28 12:01 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Du, Yuyang, Ingo Molnar, linux-kernel, linux-pm, Van De Ven,
	Arjan, Brown, Len, Wysocki, Rafael J

On Fri, Feb 28, 2014 at 12:44:59PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 28, 2014 at 10:29:32AM +0000, Morten Rasmussen wrote:
> > If I understand your proposal correctly, you are proposing to have a
> > pluggable scheduler where it is possible to have many different
> > load-balance (bottom half) implementations.
> 
> Yeah, that's not _ever_ going to happen. We've had that discussion many
> times, use your favourite search engine.

*groan*, the version in my inbox to which I replied earlier seems
private; and then I'm not CC'd to the list one.


---
Please use a sane MUA and teach it to wrap at around ~78 chars.

On Fri, Feb 28, 2014 at 02:13:32AM +0000, Du, Yuyang wrote:
> Hi Peter/Ingo and all,
> 
> With the advent of more cores and heterogeneous architectures, the
> scheduler is required to be more complex (power efficiency) and
> diverse (big.little). For the scheduler to address that challenge as a
> whole, it is costly but not necessary. This proposal argues that the
> scheduler be spitted into two parts: top half (task scheduling) and
> bottom half (load balance). Let the bottom half take charge of the
> incoming requirements.

This is already so.

> The two halves are rather orthogonal in functionality. The task
> scheduling (top half) seeks for *ONE* CPU to execute running tasks
> fairly (priority included), while the load balance (bottom half) aims
> for *ALL* CPUs to maximize the throughput of the computing power. The
> goal of task scheduling is pretty unique and clear, and CFS and RT in
> that part are exactly approaching the goal. The load balance, however,
> is constrained to meet more goals, to name a few, performance
> (throughput/responsiveness), power consumption, architecture
> differences, etc. Those things are often hard to achieve because they
> may conflict and are difficult to estimate and plan. So, shall we
> declare the independence of the two, give them freedom to pursue their
> own "happiness".

You cannot treat them completely independent, as fairness must extend
across CPUs. And there's good reasons to integrate them further still;
our current min_vruntime is a poor substitute for the per-cpu zero-lag
point. But with some of the runtime tracking we did for SMP-cgroup we
can approximate the global zero-lag point.

Using a global zero-lag point has advantages in that task latency is
petter preserved in the face of migrations.

So no; you cannot completely separate them. But even if you could;
I don't see the point in doing so.

> We take an incremental development method. As a starting point, we did three things (but did not change one single line of real-work code):
> 	1)	Remove load balance from fair.c into load_balance.c
> 	(~3000 lines of codes). As a result, fair.c/rt.c and
> 	load_balance.c have very little intersection.

You're very much overlooking the fact that RT and DL have their own
SMP logic. So the sched_class interface must very much include the
SMP logic.

The best you can try is creating fair_smp.c, but I'm not seeing how
that's going to be anything but pure code movement. You're not going to
suddenly make it all easier.

> 	2)	Define struct sched_lb_class that consists of the following members to umbrella the load balance entry points.
> 		a.	const struct sched_lb_class *next;
> 		b.	int (*fork_balance) (struct task_struct *p, int sd_flags, int wake_flags);
> 		c.	int (*exec_balance) (struct task_struct *p, int sd_flags, int wake_flags);
> 		d.	int (*wakeup_balance) (struct task_struct *p, int sd_flags, int wake_flags);
> 		e.	void (*idle_balance) (int this_cpu, struct rq *this_rq);
> 		f.	void (*periodic_rebalance) (int cpu, enum cpu_idle_type idle);
> 		g.	void (*nohz_idle_balance) (int this_cpu, enum cpu_idle_type idle);
> 		h.	void (*start_periodic_balance) (struct rq *rq, int cpu);
> 		i.	void (*check_nohz_idle_balance) (struct rq *rq, int cpu);

No point in doing that; as there will only ever be the one consumer.

> 	3)	Insert another layer of indirection to wrap the
> 	implemented functions in sched_lb_class. Implement a default
> 	load balance class that is just the previous load balance.

Every problem in CS can be solved by another layer of abstraction;
except for the problem of too many layers.

> The next to do is to continue redesigning and refactoring to make life
> easier toward more powerful and diverse load balance. And more
> importantly, this RFC solicits a discussion to get early feedback on
> the big proposed change.

I'm not seeing the point. Abstraction and indirection for a single user
are bloody pointless.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [RFC] Splitting scheduler into two halves
  2014-02-28 12:01     ` Peter Zijlstra
@ 2014-03-03  9:56       ` Du, Yuyang
  0 siblings, 0 replies; 5+ messages in thread
From: Du, Yuyang @ 2014-03-03  9:56 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: Ingo Molnar, linux-kernel, linux-pm, Van De Ven, Arjan, Brown,
	Len, Wysocki, Rafael J

Well, my two cents in response (sorry, I will use a Linux style mail client next time):

1) Top half and bottom half do have interaction with each other. They share information, top half provides some measurement/average to load balance. Good news is they don't interleave operation. Splitting them does not mean they must be self-inclusive to each other.

2) What I have done is a starting point. What really matters is some good design from that starting point. I myself have been working on workload consolidation (or task packing, you name it) for a while. I did have an intent to do a complete bottom half. So, encapsulating the load balance in a class allows it to be replaced easier. But this had already been the last thing I want even if I can do that (before the very first email), because it won't make "a better world".

Lets explore the design space a little bit. For load balance, its function is like:

Tasks -> be balanced -> on CPUs

Be more specific:

[ which_1 ] tasks -> be balanced -> on [ which_2 ] CPUs

Which_1 can be: fair or rt by priority. ARM people may want "small/light" or "big/heavy" by load tracking (fortunately, we don't need that). Maybe others.

Which_2 can be: all, or some.

Which_1 * which_2 have many combinations, each of which has this in common: XX tasks -> be balanced -> on YY CPUs (XX defined by bottom half but got from top half, the other done by bottom half).

So it looks like we need a class, and we need a few implementations (each for a combination) in a *hierarchy* and work together.

I don't have the final answer yet, this is why I said I am continuing to redesign and refactor it. But it really looks that modularity should/can be applied here to help realize so complex a system.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-03-03  9:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-28  2:13 [RFC] Splitting scheduler into two halves Du, Yuyang
2014-02-28 10:29 ` Morten Rasmussen
2014-02-28 11:44   ` Peter Zijlstra
2014-02-28 12:01     ` Peter Zijlstra
2014-03-03  9:56       ` Du, Yuyang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.