All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Michael Neuling <mikey@neuling.org>
Cc: Joel Schopp <jschopp@austin.ibm.com>, Ingo Molnar <mingo@elte.hu>,
	linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
	ego@in.ibm.com
Subject: Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7
Date: Tue, 23 Feb 2010 17:24:41 +0100	[thread overview]
Message-ID: <1266942281.11845.521.camel@laptop> (raw)
In-Reply-To: <23662.1266905307@neuling.org>

On Tue, 2010-02-23 at 17:08 +1100, Michael Neuling wrote:

> I have some comments on the code inline but... 
> 
> So when I run this, I don't get processes pulled down to the lower
> threads.  A simple test case of running 1 CPU intensive process at
> SCHED_OTHER on a machine with 2 way SMT system (a POWER6 but enabling
> SD_ASYM_PACKING).  The single processes doesn't move to lower threads as
> I'd hope.
> 
> Also, are you sure you want to put this in generic code?  It seem to be
> quite POWER7 specific functionality, so would be logically better in
> arch/powerpc.  I guess some other arch *might* need it, but seems
> unlikely.  

Well, there are no arch hooks in the load-balancing (aside from the
recent cpu_power stuff, and that really is the wrong thing to poke at
for this), and I did hear some other people express interest in such a
constraint.

Also, load-balancing is complex enough as it is, so I prefer to keep
everything in the generic code where possible, clearly things like
sched_domain creation need arch topology bits, and the arch_scale*
things require other arch information like cpu frequency.


> > @@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(st
> >  		DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
> >  }
> >  
> > +static int update_sd_pick_busiest(struct sched_domain *sd,
> > +	       			  struct sd_lb_stats *sds,
> > +				  struct sched_group *sg,
> > +			  	  struct sg_lb_stats *sgs)
> > +{
> > +	if (sgs->sum_nr_running > sgs->group_capacity)
> > +		return 1;
> > +
> > +	if (sgs->group_imb)
> > +		return 1;
> > +
> > +	if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) {
> 
> If we are asymetric packing...
> 
> 
> > +		if (!sds->busiest)
> > +			return 1;
> 
> This just seems to be a null pointer check.
> 
> From the tracing I've done, this is always true (always NULL) at this
> point so we return here.

Right, so we need to have a busiest group to take a task from, if there
is no busiest yet, take this group.

And in your scenario, with there being only a single task, we'd only hit
this once at most, so yes it makes sense this is always NULL.

> > +
> > +		if (group_first_cpu(sds->busiest) < group_first_cpu(sg))
> > +			return 1;
> 
> I'm a bit lost as to what this is for.  Any clues you could provide
> would be appreciated. :-)
> 
> Is the first cpu in this domain's busiest group before the first cpu in
> this group.  If, so pick this as the busiest?
> 
> Should this be the other way around if we want to pack the busiest to
> the first cpu?  Mark it as the busiest if it's after (not before).  
> 
> Is group_first_cpu guaranteed to give us the first physical cpu (ie.
> thread 0 in our case) or are these virtualised at this point?
> 
> I'm not seeing this hit anyway due to the null pointer check above.

So this says, if all things being equal, and we already have a busiest,
but this candidate (sg) is higher than the current (busiest) take this
one.

The idea is to move the highest SMT task down.

> > @@ -2562,6 +2585,38 @@ static inline void update_sd_lb_stats(st
> >  	} while (group != sd->groups);
> >  }
> >  
> > +int __weak sd_asym_packing_arch(void)
> > +{
> > +	return 0;
> > +}

arch_sd_asym_packing() is what you used in topology.h

> > +static int check_asym_packing(struct sched_domain *sd,
> > +				    struct sd_lb_stats *sds,
> > +				    unsigned long *imbalance)
> > +{
> > +	int i, cpu, busiest_cpu;
> > +
> > +	if (!(sd->flags & SD_ASYM_PACKING))
> > +		return 0;
> > +
> > +	if (!sds->busiest)
> > +		return 0;
> > +
> > +	i = 0;
> > +	busiest_cpu = group_first_cpu(sds->busiest);
> > +	for_each_cpu(cpu, sched_domain_span(sd)) {
> > +		i++;
> > +		if (cpu == busiest_cpu)
> > +			break;
> > +	}
> > +
> > +	if (sds->total_nr_running > i)
> > +		return 0;
> 
> This seems to be the core of the packing logic.
> 
> We make sure the busiest_cpu is not past total_nr_running.  If it is we
> mark as imbalanced.  Correct?
> 
> It seems if a non zero thread/group had a pile of processes running on
> it and a lower thread had much less, this wouldn't fire, but I'm
> guessing normal load balancing would kick in that case to fix the
> imbalance.
> 
> Any corrections to my ramblings appreciated :-)

Right, so we're concerned the scenario where there's less tasks than SMT
siblings, if there's more they should all be running and the regular
load-balancer will deal with it.

If there's less the group will normally be balanced and we fall out and
end up in check_asym_packing().

So what I tried doing with that loop is detect if there's a hole in the
packing before busiest. Now that I think about it, what we need to check
is if this_cpu (the removed cpu argument) is idle and less than busiest.

So something like:

static int check_asym_pacing(struct sched_domain *sd,
                             struct sd_lb_stats *sds,
                             int this_cpu, unsigned long *imbalance)
{
	int busiest_cpu;

	if (!(sd->flags & SD_ASYM_PACKING))
		return 0;

	if (!sds->busiest)
		return 0;

	busiest_cpu = group_first_cpu(sds->busiest);
	if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu)
		return 0;

	*imbalance = (sds->max_load * sds->busiest->cpu_power) /
			SCHED_LOAD_SCALE;
	return 1;
}

Does that make sense?

I still see two problems with this though,.. regular load-balancing only
balances on the first cpu of a domain (see the *balance = 0, condition
in update_sg_lb_stats()), this means that if SMT[12] are idle we'll not
pull properly. Also, nohz balancing might mess this up further.

We could maybe play some games with the balance decision in
update_sg_lb_stats() for SD_ASYM_PACKING domains and idle == CPU_IDLE,
no ideas yet on nohz though.




WARNING: multiple messages have this Message-ID (diff)
From: Peter Zijlstra <peterz@infradead.org>
To: Michael Neuling <mikey@neuling.org>
Cc: Ingo Molnar <mingo@elte.hu>,
	linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
	ego@in.ibm.com
Subject: Re: [PATCHv4 2/2] powerpc: implement arch_scale_smt_power for Power7
Date: Tue, 23 Feb 2010 17:24:41 +0100	[thread overview]
Message-ID: <1266942281.11845.521.camel@laptop> (raw)
In-Reply-To: <23662.1266905307@neuling.org>

On Tue, 2010-02-23 at 17:08 +1100, Michael Neuling wrote:

> I have some comments on the code inline but... 
> 
> So when I run this, I don't get processes pulled down to the lower
> threads.  A simple test case of running 1 CPU intensive process at
> SCHED_OTHER on a machine with 2 way SMT system (a POWER6 but enabling
> SD_ASYM_PACKING).  The single processes doesn't move to lower threads as
> I'd hope.
> 
> Also, are you sure you want to put this in generic code?  It seem to be
> quite POWER7 specific functionality, so would be logically better in
> arch/powerpc.  I guess some other arch *might* need it, but seems
> unlikely.  

Well, there are no arch hooks in the load-balancing (aside from the
recent cpu_power stuff, and that really is the wrong thing to poke at
for this), and I did hear some other people express interest in such a
constraint.

Also, load-balancing is complex enough as it is, so I prefer to keep
everything in the generic code where possible, clearly things like
sched_domain creation need arch topology bits, and the arch_scale*
things require other arch information like cpu frequency.


> > @@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(st
> >  		DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
> >  }
> >  
> > +static int update_sd_pick_busiest(struct sched_domain *sd,
> > +	       			  struct sd_lb_stats *sds,
> > +				  struct sched_group *sg,
> > +			  	  struct sg_lb_stats *sgs)
> > +{
> > +	if (sgs->sum_nr_running > sgs->group_capacity)
> > +		return 1;
> > +
> > +	if (sgs->group_imb)
> > +		return 1;
> > +
> > +	if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) {
> 
> If we are asymetric packing...
> 
> 
> > +		if (!sds->busiest)
> > +			return 1;
> 
> This just seems to be a null pointer check.
> 
> From the tracing I've done, this is always true (always NULL) at this
> point so we return here.

Right, so we need to have a busiest group to take a task from, if there
is no busiest yet, take this group.

And in your scenario, with there being only a single task, we'd only hit
this once at most, so yes it makes sense this is always NULL.

> > +
> > +		if (group_first_cpu(sds->busiest) < group_first_cpu(sg))
> > +			return 1;
> 
> I'm a bit lost as to what this is for.  Any clues you could provide
> would be appreciated. :-)
> 
> Is the first cpu in this domain's busiest group before the first cpu in
> this group.  If, so pick this as the busiest?
> 
> Should this be the other way around if we want to pack the busiest to
> the first cpu?  Mark it as the busiest if it's after (not before).  
> 
> Is group_first_cpu guaranteed to give us the first physical cpu (ie.
> thread 0 in our case) or are these virtualised at this point?
> 
> I'm not seeing this hit anyway due to the null pointer check above.

So this says, if all things being equal, and we already have a busiest,
but this candidate (sg) is higher than the current (busiest) take this
one.

The idea is to move the highest SMT task down.

> > @@ -2562,6 +2585,38 @@ static inline void update_sd_lb_stats(st
> >  	} while (group != sd->groups);
> >  }
> >  
> > +int __weak sd_asym_packing_arch(void)
> > +{
> > +	return 0;
> > +}

arch_sd_asym_packing() is what you used in topology.h

> > +static int check_asym_packing(struct sched_domain *sd,
> > +				    struct sd_lb_stats *sds,
> > +				    unsigned long *imbalance)
> > +{
> > +	int i, cpu, busiest_cpu;
> > +
> > +	if (!(sd->flags & SD_ASYM_PACKING))
> > +		return 0;
> > +
> > +	if (!sds->busiest)
> > +		return 0;
> > +
> > +	i = 0;
> > +	busiest_cpu = group_first_cpu(sds->busiest);
> > +	for_each_cpu(cpu, sched_domain_span(sd)) {
> > +		i++;
> > +		if (cpu == busiest_cpu)
> > +			break;
> > +	}
> > +
> > +	if (sds->total_nr_running > i)
> > +		return 0;
> 
> This seems to be the core of the packing logic.
> 
> We make sure the busiest_cpu is not past total_nr_running.  If it is we
> mark as imbalanced.  Correct?
> 
> It seems if a non zero thread/group had a pile of processes running on
> it and a lower thread had much less, this wouldn't fire, but I'm
> guessing normal load balancing would kick in that case to fix the
> imbalance.
> 
> Any corrections to my ramblings appreciated :-)

Right, so we're concerned the scenario where there's less tasks than SMT
siblings, if there's more they should all be running and the regular
load-balancer will deal with it.

If there's less the group will normally be balanced and we fall out and
end up in check_asym_packing().

So what I tried doing with that loop is detect if there's a hole in the
packing before busiest. Now that I think about it, what we need to check
is if this_cpu (the removed cpu argument) is idle and less than busiest.

So something like:

static int check_asym_pacing(struct sched_domain *sd,
                             struct sd_lb_stats *sds,
                             int this_cpu, unsigned long *imbalance)
{
	int busiest_cpu;

	if (!(sd->flags & SD_ASYM_PACKING))
		return 0;

	if (!sds->busiest)
		return 0;

	busiest_cpu = group_first_cpu(sds->busiest);
	if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu)
		return 0;

	*imbalance = (sds->max_load * sds->busiest->cpu_power) /
			SCHED_LOAD_SCALE;
	return 1;
}

Does that make sense?

I still see two problems with this though,.. regular load-balancing only
balances on the first cpu of a domain (see the *balance = 0, condition
in update_sg_lb_stats()), this means that if SMT[12] are idle we'll not
pull properly. Also, nohz balancing might mess this up further.

We could maybe play some games with the balance decision in
update_sg_lb_stats() for SD_ASYM_PACKING domains and idle == CPU_IDLE,
no ideas yet on nohz though.

  reply	other threads:[~2010-02-23 16:25 UTC|newest]

Thread overview: 103+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-20 20:00 [PATCH 0/2] sched: arch_scale_smt_powers Joel Schopp
2010-01-20 20:00 ` Joel Schopp
2010-01-20 20:02 ` [PATCH 1/2] sched: Fix the place where group powers are updated Joel Schopp
2010-01-20 20:02   ` Joel Schopp
2010-01-21 13:54   ` [tip:sched/core] " tip-bot for Gautham R Shenoy
2010-01-26 23:28   ` [PATCHv2 1/2] sched: enable ARCH_POWER Joel Schopp
2010-01-26 23:28     ` Joel Schopp
2010-01-28 23:20     ` [PATCHv3 " Joel Schopp
2010-01-28 23:20       ` Joel Schopp
2010-02-05 20:57       ` [PATCHv4 " Joel Schopp
2010-02-05 20:57         ` Joel Schopp
2010-01-20 20:04 ` [PATCH 2/2] powerpc: implement arch_scale_smt_power for Power7 Joel Schopp
2010-01-20 20:04   ` Joel Schopp
2010-01-20 20:48   ` Peter Zijlstra
2010-01-20 20:48     ` Peter Zijlstra
2010-01-20 21:58     ` Michael Neuling
2010-01-20 21:58       ` Michael Neuling
2010-01-20 22:44     ` Joel Schopp
2010-01-20 22:44       ` Joel Schopp
2010-01-21  8:27       ` Peter Zijlstra
2010-01-21  8:27         ` Peter Zijlstra
2010-01-20 21:04   ` Michael Neuling
2010-01-20 21:04     ` Michael Neuling
2010-01-20 22:09     ` Joel Schopp
2010-01-20 22:09       ` Joel Schopp
2010-01-24  3:00       ` Benjamin Herrenschmidt
2010-01-24  3:00         ` Benjamin Herrenschmidt
2010-01-25 17:50         ` Joel Schopp
2010-01-25 17:50           ` Joel Schopp
2010-01-26  4:23           ` Benjamin Herrenschmidt
2010-01-26  4:23             ` Benjamin Herrenschmidt
2010-01-20 21:33   ` Benjamin Herrenschmidt
2010-01-20 21:33     ` Benjamin Herrenschmidt
2010-01-20 22:36     ` Joel Schopp
2010-01-20 22:36       ` Joel Schopp
2010-01-26 23:28   ` [PATCHv2 " Joel Schopp
2010-01-26 23:28     ` Joel Schopp
2010-01-27  0:52     ` Benjamin Herrenschmidt
2010-01-27  0:52       ` Benjamin Herrenschmidt
2010-01-28 22:39       ` Joel Schopp
2010-01-28 22:39         ` Joel Schopp
2010-01-29  1:23         ` Benjamin Herrenschmidt
2010-01-29  1:23           ` Benjamin Herrenschmidt
2010-01-28 23:20     ` [PATCHv3 " Joel Schopp
2010-01-28 23:20       ` Joel Schopp
2010-01-28 23:24       ` Joel Schopp
2010-01-28 23:24         ` Joel Schopp
2010-01-29  1:23         ` Benjamin Herrenschmidt
2010-01-29  1:23           ` Benjamin Herrenschmidt
2010-01-29 10:13           ` Peter Zijlstra
2010-01-29 10:13             ` Peter Zijlstra
2010-01-29 18:34             ` Joel Schopp
2010-01-29 18:34               ` Joel Schopp
2010-01-29 18:41           ` Joel Schopp
2010-01-29 18:41             ` Joel Schopp
2010-02-05 20:57         ` [PATCHv4 " Joel Schopp
2010-02-05 20:57           ` Joel Schopp
2010-02-14 10:12           ` Peter Zijlstra
2010-02-14 10:12             ` Peter Zijlstra
2010-02-17 22:20             ` Michael Neuling
2010-02-17 22:20               ` Michael Neuling
2010-02-18 13:17               ` Peter Zijlstra
2010-02-18 13:17                 ` Peter Zijlstra
2010-02-18 13:19                 ` Peter Zijlstra
2010-02-18 13:19                   ` Peter Zijlstra
2010-02-18 16:28                 ` Joel Schopp
2010-02-18 16:28                   ` Joel Schopp
2010-02-18 17:08                   ` Peter Zijlstra
2010-02-18 17:08                     ` Peter Zijlstra
2010-02-19  6:05                 ` Michael Neuling
2010-02-19  6:05                   ` Michael Neuling
2010-02-19 10:01                   ` Peter Zijlstra
2010-02-19 10:01                     ` Peter Zijlstra
2010-02-19 11:01                     ` Michael Neuling
2010-02-19 11:01                       ` Michael Neuling
2010-02-23  6:08                       ` Michael Neuling
2010-02-23  6:08                         ` Michael Neuling
2010-02-23 16:24                         ` Peter Zijlstra [this message]
2010-02-23 16:24                           ` Peter Zijlstra
2010-02-23 16:30                           ` Peter Zijlstra
2010-02-23 16:30                             ` Peter Zijlstra
2010-02-24  6:07                           ` Michael Neuling
2010-02-24  6:07                             ` Michael Neuling
2010-02-24 11:13                             ` Michael Neuling
2010-02-24 11:13                               ` Michael Neuling
2010-02-24 11:58                               ` Michael Neuling
2010-02-24 11:58                                 ` Michael Neuling
2010-02-27 10:21                               ` Michael Neuling
2010-02-27 10:21                                 ` Michael Neuling
2010-03-02 14:44                                 ` Peter Zijlstra
2010-03-02 14:44                                   ` Peter Zijlstra
2010-03-04 22:28                                   ` Michael Neuling
2010-03-04 22:28                                     ` Michael Neuling
2010-01-29 12:25       ` [PATCHv3 " Gabriel Paubert
2010-01-29 12:25         ` Gabriel Paubert
2010-01-29 16:26         ` Joel Schopp
2010-01-29 16:26           ` Joel Schopp
2010-01-26 23:27 ` [PATCHv2 0/2] sched: arch_scale_smt_powers v2 Joel Schopp
2010-01-26 23:27   ` Joel Schopp
2010-01-28 23:20   ` [PATCHv3 0/2] sched: arch_scale_smt_powers Joel Schopp
2010-01-28 23:20     ` Joel Schopp
2010-02-05 20:57     ` [PATCHv4 " Joel Schopp
2010-02-05 20:57       ` Joel Schopp

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1266942281.11845.521.camel@laptop \
    --to=peterz@infradead.org \
    --cc=ego@in.ibm.com \
    --cc=jschopp@austin.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mikey@neuling.org \
    --cc=mingo@elte.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.