All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] sched/numa: don't move tasks to idle numa nodes while src node has very light load?
@ 2020-09-07 12:00 Song Bao Hua (Barry Song)
  2020-09-07 12:37 ` Vincent Guittot
  2020-09-07 12:42 ` Mel Gorman
  0 siblings, 2 replies; 5+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-09-07 12:00 UTC (permalink / raw)
  To: Mel Gorman, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, linux-kernel, Mel Gorman,
	Peter Zijlstra, Valentin Schneider, Phil Auld, Hillf Danton,
	Ingo Molnar
  Cc: Linuxarm, Liguozhu (Kenneth)

Hi All,
In case we have a numa system with 4 nodes and in each node we have 24 cpus, and all of the 96 cores are idle.
Then we start a process with 4 threads in this totally idle system. 
Actually any one of the four numa nodes should have enough capability to run the 4 threads while they can still have 20 idle CPUS after that.
But right now the existing code in CFS load balance will spread the 4 threads to multiple nodes.
This results in two negative side effects:
1. more numa nodes are awaken while they can save power in lowest frequency and halt status
2. cache coherency overhead between numa nodes

A proof-of-concept patch I made to "fix" this issue to some extent is like:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a68a05..f671e15 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9068,9 +9068,20 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
                }
 
                /* Consider allowing a small imbalance between NUMA groups */
-               if (env->sd->flags & SD_NUMA)
+               if (env->sd->flags & SD_NUMA) {
+                       /* if the src group uses only 1/4 capability and dst is idle
+                        * don't move task
+                        */
+                       if (busiest->sum_nr_running <= busiest->group_weight/4 &&
+                                       local->sum_nr_running == 0) {
+                               env->imbalance = 0;
+                               return;
+                       }
                        env->imbalance = adjust_numa_imbalance(env->imbalance,
                                                busiest->sum_nr_running);
+               }
 
                return;
        }

And I wrote a simple process with 4 threads to measure the execution time:

#include <stdio.h>
#include <pthread.h>
#include <sys/types.h>

struct foo {
    int x;
    int y;
} f1;

void* thread_fun1(void* param)
{
    int s = 0; 
    for (int i = 0; i < 1000000000; ++i)
        s += f1.x;
        return NULL;
}

void* thread_fun2(void* param)
{   
    for (int i = 0; i < 1000000000; ++i)
        ++f1.y;
        return NULL;
}

double difftimeval(const struct timeval *start, const struct timeval *end)
{
        double d;
        time_t s;
        suseconds_t u;

        s = start->tv_sec - end->tv_sec;
        u = start->tv_usec - end->tv_usec;

        d = s;
        d += u/1000000.0;

        return d;
}

int main(void)
{
        pthread_t tid1,tid2,tid3,tid4;
        struct timeval start,end;

        gettimeofday(&start, NULL);

        pthread_create(&tid1,NULL,thread_fun1,NULL);
        pthread_create(&tid2,NULL,thread_fun2,NULL);
        pthread_create(&tid3,NULL,thread_fun1,NULL);
        pthread_create(&tid4,NULL,thread_fun2,NULL);

        pthread_join(tid1,NULL);
        pthread_join(tid2,NULL);
        pthread_join(tid3,NULL);
        pthread_join(tid4,NULL);

        gettimeofday(&end, NULL);

        printf("execution time:%f\n", difftimeval(&end, &start));
}

Before the PoC patch, the test result:
$ ./a.out 
execution time:10.734581

After the PoC patch, the test result:
$ ./a.out 
execution time:6.775150

The execution time reduces around 30-40% because 4 threads are put in single one numa node.

On the other hand, the patch doesn't have to depend on NUMA, it can also apply to SCHED_MC with some changes. If one CPU can be still idle after they handle all tasks in the system, we maybe not need to wake up the 2nd CPU at all?

I understand this PoC patch could have negative side effect in some corner cases, for example, if the four threads running in one process want more memory bandwidth by running in multiple nodes. But generally speaking, we do a tradeoff between cache locality and better CPU utilization as they are the main concerns. If one process highly depends on memory bandwidth, they may change their mempolicy?

Thanks
Barry

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC] sched/numa: don't move tasks to idle numa nodes while src node has very light load?
  2020-09-07 12:00 [RFC] sched/numa: don't move tasks to idle numa nodes while src node has very light load? Song Bao Hua (Barry Song)
@ 2020-09-07 12:37 ` Vincent Guittot
  2020-10-10 21:43   ` Song Bao Hua (Barry Song)
  2020-09-07 12:42 ` Mel Gorman
  1 sibling, 1 reply; 5+ messages in thread
From: Vincent Guittot @ 2020-09-07 12:37 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song)
  Cc: Mel Gorman, mingo, peterz, juri.lelli, dietmar.eggemann, bsegall,
	linux-kernel, Mel Gorman, Peter Zijlstra, Valentin Schneider,
	Phil Auld, Hillf Danton, Ingo Molnar, Linuxarm,
	Liguozhu (Kenneth)

On Mon, 7 Sep 2020 at 14:00, Song Bao Hua (Barry Song)
<song.bao.hua@hisilicon.com> wrote:
>
> Hi All,
> In case we have a numa system with 4 nodes and in each node we have 24 cpus, and all of the 96 cores are idle.
> Then we start a process with 4 threads in this totally idle system.
> Actually any one of the four numa nodes should have enough capability to run the 4 threads while they can still have 20 idle CPUS after that.
> But right now the existing code in CFS load balance will spread the 4 threads to multiple nodes.
> This results in two negative side effects:
> 1. more numa nodes are awaken while they can save power in lowest frequency and halt status
> 2. cache coherency overhead between numa nodes
>
> A proof-of-concept patch I made to "fix" this issue to some extent is like:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1a68a05..f671e15 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9068,9 +9068,20 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 }
>
>                 /* Consider allowing a small imbalance between NUMA groups */
> -               if (env->sd->flags & SD_NUMA)
> +               if (env->sd->flags & SD_NUMA) {
> +                       /* if the src group uses only 1/4 capability and dst is idle
> +                        * don't move task
> +                        */
> +                       if (busiest->sum_nr_running <= busiest->group_weight/4 &&
> +                                       local->sum_nr_running == 0) {
> +                               env->imbalance = 0;
> +                               return;

Without considering if that makes sense or not, such tests should be
in adjust_numa_imbalance() which is there to decide if it's worth
"fixing" the imbalance between numa node or not.

The default behavior of load balancer is all about spreading tasks.
Then we have 2 NUMA hooks to prevent this to happen if it doesn't make
sense:
-This adjust_numa_imbalance()
-The fbq_type which is used to skip some rqs

Finally, there were several discussions around adjust_numa_imbalance()
when it was introduced and one was how to define how much imbalance is
allowed that will not regress the performance. The conclusion was that
it depends of a lot of inputs about the topology like the number of
CPUs, the number of nodes, the distance between nodes and several
others things. So as a 1st step, it was decided to use the simple and
current implementation.

The 1/4 threshold that you use above may work for some used cases on
your system but will most probably be wrong for others. We must find
something that is not just a heuristic and can work of other system
too



> +                       }
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
>                                                 busiest->sum_nr_running);
> +               }
>
>                 return;
>         }
>
> And I wrote a simple process with 4 threads to measure the execution time:
>
> #include <stdio.h>
> #include <pthread.h>
> #include <sys/types.h>
>
> struct foo {
>     int x;
>     int y;
> } f1;
>
> void* thread_fun1(void* param)
> {
>     int s = 0;
>     for (int i = 0; i < 1000000000; ++i)
>         s += f1.x;
>         return NULL;
> }
>
> void* thread_fun2(void* param)
> {
>     for (int i = 0; i < 1000000000; ++i)
>         ++f1.y;
>         return NULL;
> }
>
> double difftimeval(const struct timeval *start, const struct timeval *end)
> {
>         double d;
>         time_t s;
>         suseconds_t u;
>
>         s = start->tv_sec - end->tv_sec;
>         u = start->tv_usec - end->tv_usec;
>
>         d = s;
>         d += u/1000000.0;
>
>         return d;
> }
>
> int main(void)
> {
>         pthread_t tid1,tid2,tid3,tid4;
>         struct timeval start,end;
>
>         gettimeofday(&start, NULL);
>
>         pthread_create(&tid1,NULL,thread_fun1,NULL);
>         pthread_create(&tid2,NULL,thread_fun2,NULL);
>         pthread_create(&tid3,NULL,thread_fun1,NULL);
>         pthread_create(&tid4,NULL,thread_fun2,NULL);
>
>         pthread_join(tid1,NULL);
>         pthread_join(tid2,NULL);
>         pthread_join(tid3,NULL);
>         pthread_join(tid4,NULL);
>
>         gettimeofday(&end, NULL);
>
>         printf("execution time:%f\n", difftimeval(&end, &start));
> }
>
> Before the PoC patch, the test result:
> $ ./a.out
> execution time:10.734581
>
> After the PoC patch, the test result:
> $ ./a.out
> execution time:6.775150
>
> The execution time reduces around 30-40% because 4 threads are put in single one numa node.
>
> On the other hand, the patch doesn't have to depend on NUMA, it can also apply to SCHED_MC with some changes. If one CPU can be still idle after they handle all tasks in the system, we maybe not need to wake up the 2nd CPU at all?
>
> I understand this PoC patch could have negative side effect in some corner cases, for example, if the four threads running in one process want more memory bandwidth by running in multiple nodes. But generally speaking, we do a tradeoff between cache locality and better CPU utilization as they are the main concerns. If one process highly depends on memory bandwidth, they may change their mempolicy?
>
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] sched/numa: don't move tasks to idle numa nodes while src node has very light load?
  2020-09-07 12:00 [RFC] sched/numa: don't move tasks to idle numa nodes while src node has very light load? Song Bao Hua (Barry Song)
  2020-09-07 12:37 ` Vincent Guittot
@ 2020-09-07 12:42 ` Mel Gorman
  2020-10-10 22:04   ` Song Bao Hua (Barry Song)
  1 sibling, 1 reply; 5+ messages in thread
From: Mel Gorman @ 2020-09-07 12:42 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song)
  Cc: Mel Gorman, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, linux-kernel, Peter Zijlstra,
	Valentin Schneider, Phil Auld, Hillf Danton, Ingo Molnar,
	Linuxarm, Liguozhu (Kenneth)

On Mon, Sep 07, 2020 at 12:00:10PM +0000, Song Bao Hua (Barry Song) wrote:
> Hi All,
> In case we have a numa system with 4 nodes and in each node we have 24 cpus, and all of the 96 cores are idle.
> Then we start a process with 4 threads in this totally idle system. 
> Actually any one of the four numa nodes should have enough capability to run the 4 threads while they can still have 20 idle CPUS after that.
> But right now the existing code in CFS load balance will spread the 4 threads to multiple nodes.
> This results in two negative side effects:
> 1. more numa nodes are awaken while they can save power in lowest frequency and halt status
> 2. cache coherency overhead between numa nodes
> 
> A proof-of-concept patch I made to "fix" this issue to some extent is like:
> 

This is similar in concept to a patch that did something similar except
in adjust_numa_imbalance(). It ended up being great for light loads like
simple communicating pairs but fell apart for some HPC workloads when
memory bandwidth requirements increased. Ultimately it was dropped until
the NUMA/CPU load balancing was reconciled so may be worth a revisit. At
the time, it was really problematic once a one node was roughly 25% CPU
utilised on a 2-socket machine with hyper-threading enabled. The patch may
still work out but it would need wider testing. Within mmtests, the NAS
workloads for D-class on a 2-socket machine varying the number of parallel
tasks/processes are used should be enough to determine if the patch is
free from side-effects for one machine. It gets problematic for different
machine sizes as the point where memory bandwidth is saturated varies.
group_weight/4 might be fine on one machine as a cut-off and a problem
on a larger machine with more cores -- I hit that particular problem
when one 2 socket machine with 48 logical CPUs was fine but a different
machine with 80 logical CPUs regressed.

I'm not saying the patch is wrong, just that patches in general for this
area (everyone, not just you) need fairly broad testing.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [RFC] sched/numa: don't move tasks to idle numa nodes while src node has very light load?
  2020-09-07 12:37 ` Vincent Guittot
@ 2020-10-10 21:43   ` Song Bao Hua (Barry Song)
  0 siblings, 0 replies; 5+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-10-10 21:43 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, mingo, peterz, juri.lelli, dietmar.eggemann, bsegall,
	linux-kernel, Mel Gorman, Peter Zijlstra, Valentin Schneider,
	Phil Auld, Hillf Danton, Ingo Molnar, Linuxarm,
	Liguozhu (Kenneth)



> -----Original Message-----
> From: Vincent Guittot [mailto:vincent.guittot@linaro.org]
> Sent: Tuesday, September 8, 2020 12:37 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: Mel Gorman <mgorman@suse.de>; mingo@redhat.com;
> peterz@infradead.org; juri.lelli@redhat.com; dietmar.eggemann@arm.com;
> bsegall@google.com; linux-kernel@vger.kernel.org; Mel Gorman
> <mgorman@techsingularity.net>; Peter Zijlstra <a.p.zijlstra@chello.nl>;
> Valentin Schneider <valentin.schneider@arm.com>; Phil Auld
> <pauld@redhat.com>; Hillf Danton <hdanton@sina.com>; Ingo Molnar
> <mingo@kernel.org>; Linuxarm <linuxarm@huawei.com>; Liguozhu (Kenneth)
> <liguozhu@hisilicon.com>
> Subject: Re: [RFC] sched/numa: don't move tasks to idle numa nodes while src
> node has very light load?
> 
> On Mon, 7 Sep 2020 at 14:00, Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com> wrote:
> >
> > Hi All,
> > In case we have a numa system with 4 nodes and in each node we have 24
> cpus, and all of the 96 cores are idle.
> > Then we start a process with 4 threads in this totally idle system.
> > Actually any one of the four numa nodes should have enough capability to
> run the 4 threads while they can still have 20 idle CPUS after that.
> > But right now the existing code in CFS load balance will spread the 4 threads
> to multiple nodes.
> > This results in two negative side effects:
> > 1. more numa nodes are awaken while they can save power in lowest
> frequency and halt status
> > 2. cache coherency overhead between numa nodes
> >
> > A proof-of-concept patch I made to "fix" this issue to some extent is like:
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 1a68a05..f671e15 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9068,9 +9068,20 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
> >                 }
> >
> >                 /* Consider allowing a small imbalance between NUMA
> groups */
> > -               if (env->sd->flags & SD_NUMA)
> > +               if (env->sd->flags & SD_NUMA) {
> > +                       /* if the src group uses only 1/4 capability and
> dst is idle
> > +                        * don't move task
> > +                        */
> > +                       if (busiest->sum_nr_running <=
> busiest->group_weight/4 &&
> > +                                       local->sum_nr_running == 0) {
> > +                               env->imbalance = 0;
> > +                               return;
> 
> Without considering if that makes sense or not, such tests should be
> in adjust_numa_imbalance() which is there to decide if it's worth
> "fixing" the imbalance between numa node or not.

I was aware that adjust_numa_imbalance() is the better place to make this adjustment
for NUMA while sending this RFC. However, for PoC, I just wanted to present the idea to
get some general agreement or disagreement.

On the other hand, This adjustment may be beneficial not only for NUMA, but also for
other scheduling levels like SCHED_MC. For example, CPUs might be organized in a
couple of groups inside single one NUMA node.

> 
> The default behavior of load balancer is all about spreading tasks.
> Then we have 2 NUMA hooks to prevent this to happen if it doesn't make
> sense:
> -This adjust_numa_imbalance()
> -The fbq_type which is used to skip some rqs
> 
> Finally, there were several discussions around adjust_numa_imbalance()
> when it was introduced and one was how to define how much imbalance is
> allowed that will not regress the performance. The conclusion was that
> it depends of a lot of inputs about the topology like the number of
> CPUs, the number of nodes, the distance between nodes and several
> others things. So as a 1st step, it was decided to use the simple and
> current implementation.

It seems you mean we can figure out a formula to get the imbalance value based
on the number of cpu, the number of nodes, the distance between nodes?

X1: the number of cpu in each node
X2: the number of nodes
X3: the distance between nodes
X4: ... //other factors
Y: imbalance value

Y = f(X1, X2, X3, X4)

That is what you mean?

> 
> The 1/4 threshold that you use above may work for some used cases on
> your system but will most probably be wrong for others. We must find
> something that is not just a heuristic and can work of other system
> too
> 
> 
> 
> > +                       }
> >                         env->imbalance =
> adjust_numa_imbalance(env->imbalance,
> >
> busiest->sum_nr_running);
> > +               }
> >
> >                 return;
> >         }
> >
> > And I wrote a simple process with 4 threads to measure the execution time:
> >
> > #include <stdio.h>
> > #include <pthread.h>
> > #include <sys/types.h>
> >
> > struct foo {
> >     int x;
> >     int y;
> > } f1;
> >
> > void* thread_fun1(void* param)
> > {
> >     int s = 0;
> >     for (int i = 0; i < 1000000000; ++i)
> >         s += f1.x;
> >         return NULL;
> > }
> >
> > void* thread_fun2(void* param)
> > {
> >     for (int i = 0; i < 1000000000; ++i)
> >         ++f1.y;
> >         return NULL;
> > }
> >
> > double difftimeval(const struct timeval *start, const struct timeval *end)
> > {
> >         double d;
> >         time_t s;
> >         suseconds_t u;
> >
> >         s = start->tv_sec - end->tv_sec;
> >         u = start->tv_usec - end->tv_usec;
> >
> >         d = s;
> >         d += u/1000000.0;
> >
> >         return d;
> > }
> >
> > int main(void)
> > {
> >         pthread_t tid1,tid2,tid3,tid4;
> >         struct timeval start,end;
> >
> >         gettimeofday(&start, NULL);
> >
> >         pthread_create(&tid1,NULL,thread_fun1,NULL);
> >         pthread_create(&tid2,NULL,thread_fun2,NULL);
> >         pthread_create(&tid3,NULL,thread_fun1,NULL);
> >         pthread_create(&tid4,NULL,thread_fun2,NULL);
> >
> >         pthread_join(tid1,NULL);
> >         pthread_join(tid2,NULL);
> >         pthread_join(tid3,NULL);
> >         pthread_join(tid4,NULL);
> >
> >         gettimeofday(&end, NULL);
> >
> >         printf("execution time:%f\n", difftimeval(&end, &start));
> > }
> >
> > Before the PoC patch, the test result:
> > $ ./a.out
> > execution time:10.734581
> >
> > After the PoC patch, the test result:
> > $ ./a.out
> > execution time:6.775150
> >
> > The execution time reduces around 30-40% because 4 threads are put in
> single one numa node.
> >
> > On the other hand, the patch doesn't have to depend on NUMA, it can also
> apply to SCHED_MC with some changes. If one CPU can be still idle after they
> handle all tasks in the system, we maybe not need to wake up the 2nd CPU at
> all?
> >
> > I understand this PoC patch could have negative side effect in some corner
> cases, for example, if the four threads running in one process want more
> memory bandwidth by running in multiple nodes. But generally speaking, we
> do a tradeoff between cache locality and better CPU utilization as they are the
> main concerns. If one process highly depends on memory bandwidth, they
> may change their mempolicy?

Thanks
Barry

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [RFC] sched/numa: don't move tasks to idle numa nodes while src node has very light load?
  2020-09-07 12:42 ` Mel Gorman
@ 2020-10-10 22:04   ` Song Bao Hua (Barry Song)
  0 siblings, 0 replies; 5+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-10-10 22:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mel Gorman, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, linux-kernel, Peter Zijlstra,
	Valentin Schneider, Phil Auld, Hillf Danton, Ingo Molnar,
	Linuxarm, Liguozhu (Kenneth)



> -----Original Message-----
> From: Mel Gorman [mailto:mgorman@techsingularity.net]
> Sent: Tuesday, September 8, 2020 12:42 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: Mel Gorman <mgorman@suse.de>; mingo@redhat.com;
> peterz@infradead.org; juri.lelli@redhat.com; vincent.guittot@linaro.org;
> dietmar.eggemann@arm.com; bsegall@google.com;
> linux-kernel@vger.kernel.org; Peter Zijlstra <a.p.zijlstra@chello.nl>; Valentin
> Schneider <valentin.schneider@arm.com>; Phil Auld <pauld@redhat.com>;
> Hillf Danton <hdanton@sina.com>; Ingo Molnar <mingo@kernel.org>;
> Linuxarm <linuxarm@huawei.com>; Liguozhu (Kenneth)
> <liguozhu@hisilicon.com>
> Subject: Re: [RFC] sched/numa: don't move tasks to idle numa nodes while src
> node has very light load?
> 
> On Mon, Sep 07, 2020 at 12:00:10PM +0000, Song Bao Hua (Barry Song)
> wrote:
> > Hi All,
> > In case we have a numa system with 4 nodes and in each node we have 24
> cpus, and all of the 96 cores are idle.
> > Then we start a process with 4 threads in this totally idle system.
> > Actually any one of the four numa nodes should have enough capability to
> run the 4 threads while they can still have 20 idle CPUS after that.
> > But right now the existing code in CFS load balance will spread the 4 threads
> to multiple nodes.
> > This results in two negative side effects:
> > 1. more numa nodes are awaken while they can save power in lowest
> frequency and halt status
> > 2. cache coherency overhead between numa nodes
> >
> > A proof-of-concept patch I made to "fix" this issue to some extent is like:
> >
> 
> This is similar in concept to a patch that did something similar except
> in adjust_numa_imbalance(). It ended up being great for light loads like
> simple communicating pairs but fell apart for some HPC workloads when
> memory bandwidth requirements increased. Ultimately it was dropped until

Yes. There is a tradeoff between higher memory bandwidth and lower communication
overhead for things like bus latency, cache coherence. For kernel scheduler, it actually
doesn't know the requirement of applications. It doesn't know whether the application
is sensitive to memory bandwidth or sensitive to cache coherence unless applications
tell it by APIs like mempolicy().

It seems we can get the perf profiling data as the input for scheduler. If perf finds
the application needs lots of memory bandwidth, we spread it in more numa nodes.
Otherwise, if perf finds application gets low IPC due to cache coherence, we try to
put them in a numa node. Maybe it is too difficult for kernel, but if we could have
an userspace scheduler which call taskset, numactl based on perf profiling,  it seems
the userspace scheduler can schedule applications more precisely based on the
characteristics of applications?

> the NUMA/CPU load balancing was reconciled so may be worth a revisit. At
> the time, it was really problematic once a one node was roughly 25% CPU
> utilised on a 2-socket machine with hyper-threading enabled. The patch may
> still work out but it would need wider testing. Within mmtests, the NAS
> workloads for D-class on a 2-socket machine varying the number of parallel
> tasks/processes are used should be enough to determine if the patch is
> free from side-effects for one machine. It gets problematic for different
> machine sizes as the point where memory bandwidth is saturated varies.
> group_weight/4 might be fine on one machine as a cut-off and a problem
> on a larger machine with more cores -- I hit that particular problem
> when one 2 socket machine with 48 logical CPUs was fine but a different
> machine with 80 logical CPUs regressed.

Different machines have different memory bandwidth and different numa topology.
If it is too tough to figure out a proper value to make everyone happy, would you think
if we can provide a sysctl or bootargs for this so that users can adjust the cut-off based
on their own test and profiling?

> 
> I'm not saying the patch is wrong, just that patches in general for this
> area (everyone, not just you) need fairly broad testing.
> 

Thanks
Barry


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-10-10 23:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-07 12:00 [RFC] sched/numa: don't move tasks to idle numa nodes while src node has very light load? Song Bao Hua (Barry Song)
2020-09-07 12:37 ` Vincent Guittot
2020-10-10 21:43   ` Song Bao Hua (Barry Song)
2020-09-07 12:42 ` Mel Gorman
2020-10-10 22:04   ` Song Bao Hua (Barry Song)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.