* [RFC 0/3] sched/topology: fix sched groups on NUMA machines with mesh topology @ 2017-04-13 13:56 Lauro Ramos Venancio 2017-04-13 13:56 ` [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() Lauro Ramos Venancio ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: Lauro Ramos Venancio @ 2017-04-13 13:56 UTC (permalink / raw) To: linux-kernel Cc: lwang, riel, Mike Galbraith, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Lauro Ramos Venancio Currently, the scheduler is not able to directly move tasks between some NUMA nodes 2-hops apart on machines with mesh topology. This occurs because some NUMA nodes belongs to all sched groups. For more details, see the patch 2 commit log. This bug was reported in the paper [1] as "The Scheduling Group Construction bug". This patchset constructs the sched groups from each CPU perspective. So each NUMA node can have different groups in the last NUMA sched domain level. SPECjbb2005 results show up to 63% performance improvement and a huge standard deviation drop on a machine with 8 NUMA nodes and mesh topology. Patch 1 - just prepare the code for patch 2 Patch 2 - change the sched groups construction Patch 3 - fix issue with different groups starting with the same CPU [1] http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf Regards, Lauro Lauro Ramos Venancio (3): sched/topology: Refactor function build_overlap_sched_groups() sched/topology: fix sched groups on NUMA machines with mesh topology sched/topology: Different sched groups must not have the same balance cpu kernel/sched/topology.c | 165 ++++++++++++++++++++++++++++++++++-------------- 1 file changed, 117 insertions(+), 48 deletions(-) -- 1.8.3.1 ^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() 2017-04-13 13:56 [RFC 0/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio @ 2017-04-13 13:56 ` Lauro Ramos Venancio 2017-04-13 14:50 ` Rik van Riel 2017-05-15 9:02 ` [tip:sched/core] " tip-bot for Lauro Ramos Venancio 2017-04-13 13:56 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio 2017-04-13 13:56 ` [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu Lauro Ramos Venancio 2 siblings, 2 replies; 29+ messages in thread From: Lauro Ramos Venancio @ 2017-04-13 13:56 UTC (permalink / raw) To: linux-kernel Cc: lwang, riel, Mike Galbraith, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Lauro Ramos Venancio Create functions build_group_from_child_sched_domain() and init_overlap_sched_group(). No functional change. Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com> --- kernel/sched/topology.c | 62 ++++++++++++++++++++++++++++++++++--------------- 1 file changed, 43 insertions(+), 19 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 1b0b4fb..d786d45 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -513,6 +513,47 @@ int group_balance_cpu(struct sched_group *sg) return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg)); } +static struct sched_group * +build_group_from_child_sched_domain(struct sched_domain *sd, int cpu) +{ + struct sched_group *sg; + struct cpumask *sg_span; + + sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(), + GFP_KERNEL, cpu_to_node(cpu)); + + if (!sg) + return NULL; + + sg_span = sched_group_cpus(sg); + if (sd->child) + cpumask_copy(sg_span, sched_domain_span(sd->child)); + else + cpumask_copy(sg_span, sched_domain_span(sd)); + + return sg; +} + +static void init_overlap_sched_group(struct sched_domain *sd, + struct sched_group *sg, int cpu) +{ + struct sd_data *sdd = sd->private; + struct cpumask *sg_span; + + sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); + if (atomic_inc_return(&sg->sgc->ref) == 1) + build_group_mask(sd, sg); + + /* + * Initialize sgc->capacity such that even if we mess up the + * domains and no possible iteration will get us here, we won't + * die on a /0 trap. + */ + sg_span = sched_group_cpus(sg); + sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span); + sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; +} + static int build_overlap_sched_groups(struct sched_domain *sd, int cpu) { @@ -537,31 +578,14 @@ int group_balance_cpu(struct sched_group *sg) if (!cpumask_test_cpu(i, sched_domain_span(sibling))) continue; - sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(), - GFP_KERNEL, cpu_to_node(cpu)); - + sg = build_group_from_child_sched_domain(sibling, cpu); if (!sg) goto fail; sg_span = sched_group_cpus(sg); - if (sibling->child) - cpumask_copy(sg_span, sched_domain_span(sibling->child)); - else - cpumask_set_cpu(i, sg_span); - cpumask_or(covered, covered, sg_span); - sg->sgc = *per_cpu_ptr(sdd->sgc, i); - if (atomic_inc_return(&sg->sgc->ref) == 1) - build_group_mask(sd, sg); - - /* - * Initialize sgc->capacity such that even if we mess up the - * domains and no possible iteration will get us here, we won't - * die on a /0 trap. - */ - sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span); - sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; + init_overlap_sched_group(sd, sg, i); /* * Make sure the first group of this domain contains the -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() 2017-04-13 13:56 ` [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() Lauro Ramos Venancio @ 2017-04-13 14:50 ` Rik van Riel 2017-05-15 9:02 ` [tip:sched/core] " tip-bot for Lauro Ramos Venancio 1 sibling, 0 replies; 29+ messages in thread From: Rik van Riel @ 2017-04-13 14:50 UTC (permalink / raw) To: Lauro Ramos Venancio, linux-kernel Cc: lwang, Mike Galbraith, Peter Zijlstra, Thomas Gleixner, Ingo Molnar On Thu, 2017-04-13 at 10:56 -0300, Lauro Ramos Venancio wrote: > Create functions build_group_from_child_sched_domain() and > init_overlap_sched_group(). No functional change. > > Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com> > Acked-by: Rik van Riel <riel@redhat.com> ^ permalink raw reply [flat|nested] 29+ messages in thread
* [tip:sched/core] sched/topology: Refactor function build_overlap_sched_groups() 2017-04-13 13:56 ` [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() Lauro Ramos Venancio 2017-04-13 14:50 ` Rik van Riel @ 2017-05-15 9:02 ` tip-bot for Lauro Ramos Venancio 1 sibling, 0 replies; 29+ messages in thread From: tip-bot for Lauro Ramos Venancio @ 2017-05-15 9:02 UTC (permalink / raw) To: linux-tip-commits Cc: riel, lvenanci, torvalds, mingo, efault, tglx, hpa, linux-kernel, peterz Commit-ID: 8c0334697dc37eb3d6d7632304d3a3662248daac Gitweb: http://git.kernel.org/tip/8c0334697dc37eb3d6d7632304d3a3662248daac Author: Lauro Ramos Venancio <lvenanci@redhat.com> AuthorDate: Thu, 13 Apr 2017 10:56:07 -0300 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Mon, 15 May 2017 10:15:22 +0200 sched/topology: Refactor function build_overlap_sched_groups() Create functions build_group_from_child_sched_domain() and init_overlap_sched_group(). No functional change. Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1492091769-19879-2-git-send-email-lvenanci@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/topology.c | 62 ++++++++++++++++++++++++++++++++++--------------- 1 file changed, 43 insertions(+), 19 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 1b0b4fb..d786d45 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -513,6 +513,47 @@ int group_balance_cpu(struct sched_group *sg) return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg)); } +static struct sched_group * +build_group_from_child_sched_domain(struct sched_domain *sd, int cpu) +{ + struct sched_group *sg; + struct cpumask *sg_span; + + sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(), + GFP_KERNEL, cpu_to_node(cpu)); + + if (!sg) + return NULL; + + sg_span = sched_group_cpus(sg); + if (sd->child) + cpumask_copy(sg_span, sched_domain_span(sd->child)); + else + cpumask_copy(sg_span, sched_domain_span(sd)); + + return sg; +} + +static void init_overlap_sched_group(struct sched_domain *sd, + struct sched_group *sg, int cpu) +{ + struct sd_data *sdd = sd->private; + struct cpumask *sg_span; + + sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); + if (atomic_inc_return(&sg->sgc->ref) == 1) + build_group_mask(sd, sg); + + /* + * Initialize sgc->capacity such that even if we mess up the + * domains and no possible iteration will get us here, we won't + * die on a /0 trap. + */ + sg_span = sched_group_cpus(sg); + sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span); + sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; +} + static int build_overlap_sched_groups(struct sched_domain *sd, int cpu) { @@ -537,31 +578,14 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) if (!cpumask_test_cpu(i, sched_domain_span(sibling))) continue; - sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(), - GFP_KERNEL, cpu_to_node(cpu)); - + sg = build_group_from_child_sched_domain(sibling, cpu); if (!sg) goto fail; sg_span = sched_group_cpus(sg); - if (sibling->child) - cpumask_copy(sg_span, sched_domain_span(sibling->child)); - else - cpumask_set_cpu(i, sg_span); - cpumask_or(covered, covered, sg_span); - sg->sgc = *per_cpu_ptr(sdd->sgc, i); - if (atomic_inc_return(&sg->sgc->ref) == 1) - build_group_mask(sd, sg); - - /* - * Initialize sgc->capacity such that even if we mess up the - * domains and no possible iteration will get us here, we won't - * die on a /0 trap. - */ - sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span); - sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; + init_overlap_sched_group(sd, sg, i); /* * Make sure the first group of this domain contains the ^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology 2017-04-13 13:56 [RFC 0/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio 2017-04-13 13:56 ` [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() Lauro Ramos Venancio @ 2017-04-13 13:56 ` Lauro Ramos Venancio 2017-04-13 15:16 ` Rik van Riel ` (2 more replies) 2017-04-13 13:56 ` [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu Lauro Ramos Venancio 2 siblings, 3 replies; 29+ messages in thread From: Lauro Ramos Venancio @ 2017-04-13 13:56 UTC (permalink / raw) To: linux-kernel Cc: lwang, riel, Mike Galbraith, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Lauro Ramos Venancio Currently, on a 4 nodes NUMA machine with ring topology, two sched groups are generated for the last NUMA sched domain. One group has the CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the scheduler is unable to directly move tasks between these nodes. In the worst scenario, when a set of tasks are bound to nodes 1 and 3, the performance is severely impacted because just one node is used while the other node remains idle. This problem also affects machines with more NUMA nodes. For instance, currently, the scheduler is unable to directly move tasks between some node pairs 2-hops apart on an 8 nodes machine with mesh topology. This bug was reported in the paper [1] as "The Scheduling Group Construction bug". This patch constructs the sched groups from each CPU perspective. So, on a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops apart. SPECjbb2005 results on an 8 NUMA nodes machine with mesh topology Threads before after % mean stddev mean stddev 1 22801 1950 27059 1367 +19% 8 146008 50782 209193 826 +43% 32 351030 105111 522445 9051 +49% 48 365835 116571 594905 3314 +63% [1] http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com> --- kernel/sched/topology.c | 33 +++++++++++++++------------------ 1 file changed, 15 insertions(+), 18 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d786d45..d0302ad 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -557,14 +557,24 @@ static void init_overlap_sched_group(struct sched_domain *sd, static int build_overlap_sched_groups(struct sched_domain *sd, int cpu) { - struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg; + struct sched_group *last = NULL, *sg; const struct cpumask *span = sched_domain_span(sd); struct cpumask *covered = sched_domains_tmpmask; struct sd_data *sdd = sd->private; struct sched_domain *sibling; int i; - cpumask_clear(covered); + sg = build_group_from_child_sched_domain(sd, cpu); + if (!sg) + return -ENOMEM; + + init_overlap_sched_group(sd, sg, cpu); + + sd->groups = sg; + last = sg; + sg->next = sg; + + cpumask_copy(covered, sched_group_cpus(sg)); for_each_cpu(i, span) { struct cpumask *sg_span; @@ -587,28 +597,15 @@ static void init_overlap_sched_group(struct sched_domain *sd, init_overlap_sched_group(sd, sg, i); - /* - * Make sure the first group of this domain contains the - * canonical balance CPU. Otherwise the sched_domain iteration - * breaks. See update_sg_lb_stats(). - */ - if ((!groups && cpumask_test_cpu(cpu, sg_span)) || - group_balance_cpu(sg) == cpu) - groups = sg; - - if (!first) - first = sg; - if (last) - last->next = sg; + last->next = sg; last = sg; - last->next = first; + sg->next = sd->groups; } - sd->groups = groups; return 0; fail: - free_sched_groups(first, 0); + free_sched_groups(sd->groups, 0); return -ENOMEM; } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology 2017-04-13 13:56 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio @ 2017-04-13 15:16 ` Rik van Riel 2017-04-13 15:48 ` Peter Zijlstra 2017-04-14 11:38 ` Peter Zijlstra 2 siblings, 0 replies; 29+ messages in thread From: Rik van Riel @ 2017-04-13 15:16 UTC (permalink / raw) To: Lauro Ramos Venancio, linux-kernel Cc: lwang, Mike Galbraith, Peter Zijlstra, Thomas Gleixner, Ingo Molnar On Thu, 2017-04-13 at 10:56 -0300, Lauro Ramos Venancio wrote: > > SPECjbb2005 results on an 8 NUMA nodes machine with mesh topology > > Threads before after % > mean stddev mean stddev > 1 22801 1950 27059 1367 +19% > 8 146008 50782 209193 826 +43% > 32 351030 105111 522445 9051 +49% > 48 365835 116571 594905 3314 +63% Impressive! > [1] http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf > > Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology 2017-04-13 13:56 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio 2017-04-13 15:16 ` Rik van Riel @ 2017-04-13 15:48 ` Peter Zijlstra 2017-04-13 20:21 ` Lauro Venancio 2017-04-14 11:38 ` Peter Zijlstra 2 siblings, 1 reply; 29+ messages in thread From: Peter Zijlstra @ 2017-04-13 15:48 UTC (permalink / raw) To: Lauro Ramos Venancio Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote: > Currently, on a 4 nodes NUMA machine with ring topology, two sched > groups are generated for the last NUMA sched domain. One group has the > CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes > 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the > scheduler is unable to directly move tasks between these nodes. In the > worst scenario, when a set of tasks are bound to nodes 1 and 3, the > performance is severely impacted because just one node is used while the > other node remains idle. I feel a picture would be ever so much clearer. > This patch constructs the sched groups from each CPU perspective. So, on > a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same > groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups > [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops > apart. So I still have no idea what specifically goes wrong and how this fixes it. Changelog is impenetrable. "From each CPU's persepective" doesn't really help, there already is a for_each_cpu() in. Also, since I'm not sure what happend to the 4 node system, I cannot begin to imagine what would happen on the 8 node one. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology 2017-04-13 15:48 ` Peter Zijlstra @ 2017-04-13 20:21 ` Lauro Venancio 2017-04-13 21:06 ` Lauro Venancio 0 siblings, 1 reply; 29+ messages in thread From: Lauro Venancio @ 2017-04-13 20:21 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar On 04/13/2017 12:48 PM, Peter Zijlstra wrote: > On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote: >> Currently, on a 4 nodes NUMA machine with ring topology, two sched >> groups are generated for the last NUMA sched domain. One group has the >> CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes >> 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the >> scheduler is unable to directly move tasks between these nodes. In the >> worst scenario, when a set of tasks are bound to nodes 1 and 3, the >> performance is severely impacted because just one node is used while the >> other node remains idle. > I feel a picture would be ever so much clearer. > >> This patch constructs the sched groups from each CPU perspective. So, on >> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same >> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups >> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops >> apart. > So I still have no idea what specifically goes wrong and how this fixes > it. Changelog is impenetrable. On a 4 nodes machine with ring topology, the last sched domain level contains groups with 3 numa nodes each. So we have four possible groups: (0, 1, 2) (1, 2, 3) (2, 3, 0)(3, 0, 1). As we need just two groups to fill the sched domain, currently, the groups (3, 0, 1) and (1, 2, 3) are used for all CPUs. The problem with it is that nodes 1 and 3 belongs to both groups, becoming impossible to move tasks between these two nodes. This patch uses different groups depending on the CPU they are installed. So nodes 0 and 2 CPUs keep the same group as before: (3, 0, 1) and (1, 2, 3). Nodes 1 and 3 CPUs use the new groups: (0, 1, 2) and (2, 3, 0). So the first pair of groups allows movement between nodes 0 and 2; and the second pair of groups allows movement between nodes 1 and 3. I will improve the changelog. > "From each CPU's persepective" doesn't really help, there already is a > for_each_cpu() in. The for_each_cpu() is used to iterate across all sched domain cpus. It doesn't consider the CPU where the groups are being installed (parameter cpu in build_overlap_sched_groups()). Currently, the parameter cpu is used just for memory allocation and for ordering the groups, it doesn't change the groups that are chosen. This patch uses the parameter cpu to choose the first group, changing also, as consequence, the second group. > > Also, since I'm not sure what happend to the 4 node system, I cannot > begin to imagine what would happen on the 8 node one. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology 2017-04-13 20:21 ` Lauro Venancio @ 2017-04-13 21:06 ` Lauro Venancio 2017-04-13 23:38 ` Rik van Riel 0 siblings, 1 reply; 29+ messages in thread From: Lauro Venancio @ 2017-04-13 21:06 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar On 04/13/2017 05:21 PM, Lauro Venancio wrote: > On 04/13/2017 12:48 PM, Peter Zijlstra wrote: >> On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote: >>> Currently, on a 4 nodes NUMA machine with ring topology, two sched >>> groups are generated for the last NUMA sched domain. One group has the >>> CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes >>> 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the >>> scheduler is unable to directly move tasks between these nodes. In the >>> worst scenario, when a set of tasks are bound to nodes 1 and 3, the >>> performance is severely impacted because just one node is used while the >>> other node remains idle. >> I feel a picture would be ever so much clearer. >> >>> This patch constructs the sched groups from each CPU perspective. So, on >>> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same >>> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups >>> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops >>> apart. >> So I still have no idea what specifically goes wrong and how this fixes >> it. Changelog is impenetrable. > On a 4 nodes machine with ring topology, the last sched domain level > contains groups with 3 numa nodes each. So we have four possible groups: > (0, 1, 2) (1, 2, 3) (2, 3, 0)(3, 0, 1). As we need just two groups to > fill the sched domain, currently, the groups (3, 0, 1) and (1, 2, 3) are > used for all CPUs. The problem with it is that nodes 1 and 3 belongs to > both groups, becoming impossible to move tasks between these two nodes. > > This patch uses different groups depending on the CPU they are > installed. So nodes 0 and 2 CPUs keep the same group as before: (3, 0, > 1) and (1, 2, 3). Nodes 1 and 3 CPUs use the new groups: (0, 1, 2) and > (2, 3, 0). So the first pair of groups allows movement between nodes 0 > and 2; and the second pair of groups allows movement between nodes 1 and 3. > > I will improve the changelog. > >> "From each CPU's persepective" doesn't really help, there already is a >> for_each_cpu() in. > The for_each_cpu() is used to iterate across all sched domain cpus. It > doesn't consider the CPU where the groups are being installed (parameter > cpu in build_overlap_sched_groups()). Currently, the parameter cpu is > used just for memory allocation and for ordering the groups, it doesn't > change the groups that are chosen. This patch uses the parameter cpu to > choose the first group, changing also, as consequence, the second group. >> Also, since I'm not sure what happend to the 4 node system, I cannot >> begin to imagine what would happen on the 8 node one. Just for clarification, I am sending the nodes distance table for the two most common typologies affected by this issue. 4 nodes, ring topology node distances: node 0 1 2 3 0: 10 20 30 20 1: 20 10 20 30 2: 30 20 10 20 3: 20 30 20 10 8 node, mesh topology node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 22 16 22 16 22 1: 16 10 16 22 22 16 22 16 2: 16 16 10 16 16 16 16 22 3: 22 22 16 10 16 16 22 16 4: 16 22 16 16 10 16 16 16 5: 22 16 16 16 16 10 22 22 6: 16 22 16 22 16 22 10 16 7: 22 16 22 16 16 22 16 10 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology 2017-04-13 21:06 ` Lauro Venancio @ 2017-04-13 23:38 ` Rik van Riel 2017-04-14 10:48 ` Peter Zijlstra 0 siblings, 1 reply; 29+ messages in thread From: Rik van Riel @ 2017-04-13 23:38 UTC (permalink / raw) To: lvenanci, Peter Zijlstra Cc: linux-kernel, lwang, Mike Galbraith, Thomas Gleixner, Ingo Molnar On Thu, 2017-04-13 at 18:06 -0300, Lauro Venancio wrote: > Just for clarification, I am sending the nodes distance table for the > two most common typologies affected by this issue. What do the sched groups look like for these topologies, before and after your patch series? > 4 nodes, ring topology > node distances: > node 0 1 2 3 > 0: 10 20 30 20 > 1: 20 10 20 30 > 2: 30 20 10 20 > 3: 20 30 20 10 > > 8 node, mesh topology > node distances: > node 0 1 2 3 4 5 6 7 > 0: 10 16 16 22 16 22 16 22 > 1: 16 10 16 22 22 16 22 16 > 2: 16 16 10 16 16 16 16 22 > 3: 22 22 16 10 16 16 22 16 > 4: 16 22 16 16 10 16 16 16 > 5: 22 16 16 16 16 10 22 22 > 6: 16 22 16 22 16 22 10 16 > 7: 22 16 22 16 16 22 16 10 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology 2017-04-13 23:38 ` Rik van Riel @ 2017-04-14 10:48 ` Peter Zijlstra 0 siblings, 0 replies; 29+ messages in thread From: Peter Zijlstra @ 2017-04-14 10:48 UTC (permalink / raw) To: Rik van Riel Cc: lvenanci, linux-kernel, lwang, Mike Galbraith, Thomas Gleixner, Ingo Molnar On Thu, Apr 13, 2017 at 07:38:05PM -0400, Rik van Riel wrote: > What do the sched groups look like for these topologies, > before and after your patch series? > > > 4 nodes, ring topology > > node distances: > > node 0 1 2 3 > > 0: 10 20 30 20 > > 1: 20 10 20 30 > > 2: 30 20 10 20 > > 3: 20 30 20 10 kvm -smp 4 -m 4G -display none -monitor null -serial stdio -kernel defconfig-build/arch/x86/boot/bzImage -append "sched_debug debug ignore_loglevel earlyprintk=serial,ttyS0,115200,keep numa=fake=4:10,20,30,20,20,10,20,30,30,20,10,20,20,30,20,10,0" (FWIW, that's defconfig+kvmconfig+SCHED_DEBUG=y+NUMA_EMU=y) Gives me: [ 0.075004] smpboot: Total of 4 processors activated (22345.79 BogoMIPS) [ 0.076767] CPU0 attaching sched-domain: [ 0.077003] domain 0: span 0-1,3 level NUMA [ 0.078002] groups: 0 1 3 [ 0.079002] domain 1: span 0-3 level NUMA [ 0.080002] groups: 0-1,3 (cpu_capacity = 3072) 1-3 (cpu_capacity = 3072) [ 0.081005] CPU1 attaching sched-domain: [ 0.082003] domain 0: span 0-2 level NUMA [ 0.083002] groups: 1 2 0 [ 0.084002] domain 1: span 0-3 level NUMA [ 0.085002] groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072) [ 0.086004] CPU2 attaching sched-domain: [ 0.087002] domain 0: span 1-3 level NUMA [ 0.088002] groups: 2 3 1 [ 0.089002] domain 1: span 0-3 level NUMA [ 0.090002] groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072) [ 0.091004] CPU3 attaching sched-domain: [ 0.092002] domain 0: span 0,2-3 level NUMA [ 0.093002] groups: 3 0 2 [ 0.094002] domain 1: span 0-3 level NUMA [ 0.095002] groups: 0-1,3 (cpu_capacity = 3072) 1-3 (cpu_capacity = 3072) [ 0.096004] span: 0-3 (max cpu_capacity = 1024) With patches it looks like: [ 0.080006] smpboot: Total of 4 processors activated (22345.79 BogoMIPS) [ 0.082545] CPU0 attaching sched-domain: [ 0.083007] domain 0: span 0-1,3 level NUMA [ 0.084004] groups: 0 1 3 [ 0.085004] domain 1: span 0-3 level NUMA [ 0.086004] groups: 0-1,3 (cpu_capacity = 3072) 1-3 (cpu_capacity = 3072) [ 0.087007] CPU1 attaching sched-domain: [ 0.088004] domain 0: span 0-2 level NUMA [ 0.089004] groups: 1 0 2 [ 0.090004] domain 1: span 0-3 level NUMA [ 0.091003] groups: 0-2 (cpu_capacity = 3072) 0,2-3 (cpu_capacity = 3072) [ 0.092008] CPU2 attaching sched-domain: [ 0.093004] domain 0: span 1-3 level NUMA [ 0.094004] groups: 2 1 3 [ 0.095004] domain 1: span 0-3 level NUMA [ 0.096004] groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072) [ 0.097007] CPU3 attaching sched-domain: [ 0.098004] domain 0: span 0,2-3 level NUMA [ 0.099003] groups: 3 0 2 [ 0.100004] domain 1: span 0-3 level NUMA [ 0.101003] groups: 0,2-3 (cpu_capacity = 3072) 0-2 (cpu_capacity = 3072) [ 0.102007] span: 0-3 (max cpu_capacity = 1024) Now let me try and reverse engineer those patches .. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology 2017-04-13 13:56 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio 2017-04-13 15:16 ` Rik van Riel 2017-04-13 15:48 ` Peter Zijlstra @ 2017-04-14 11:38 ` Peter Zijlstra 2017-04-14 12:20 ` Peter Zijlstra 2017-04-14 16:58 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Peter Zijlstra 2 siblings, 2 replies; 29+ messages in thread From: Peter Zijlstra @ 2017-04-14 11:38 UTC (permalink / raw) To: Lauro Ramos Venancio Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote: > This patch constructs the sched groups from each CPU perspective. So, on > a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same > groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups > [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops > apart. Ah,.. so after drawing pictures I see what went wrong; duh :-( An equivalent patch would be (if for_each_cpu_wrap() were exposed): @@ -521,11 +588,11 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) struct cpumask *covered = sched_domains_tmpmask; struct sd_data *sdd = sd->private; struct sched_domain *sibling; - int i; + int i, wrap; cpumask_clear(covered); - for_each_cpu(i, span) { + for_each_cpu_wrap(i, span, cpu, wrap) { struct cpumask *sg_span; if (cpumask_test_cpu(i, covered)) We need to start iterating at @cpu, not start at 0 every time. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology 2017-04-14 11:38 ` Peter Zijlstra @ 2017-04-14 12:20 ` Peter Zijlstra 2017-05-15 9:03 ` [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() tip-bot for Peter Zijlstra 2017-04-14 16:58 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Peter Zijlstra 1 sibling, 1 reply; 29+ messages in thread From: Peter Zijlstra @ 2017-04-14 12:20 UTC (permalink / raw) To: Lauro Ramos Venancio Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar On Fri, Apr 14, 2017 at 01:38:13PM +0200, Peter Zijlstra wrote: > An equivalent patch would be (if for_each_cpu_wrap() were exposed): --- Subject: sched,cpumask: Export for_each_cpu_wrap() More users for for_each_cpu_wrap() have appeared. Promote the construct to generic cpumask interface. The implementation is slightly modified to reduce arguments. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- include/linux/cpumask.h | 16 ++++++++++++++++ kernel/sched/fair.c | 45 ++++----------------------------------------- lib/cpumask.c | 32 ++++++++++++++++++++++++++++++++ 3 files changed, 52 insertions(+), 41 deletions(-) diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 96f1e88..4b87b7b 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -236,6 +236,22 @@ unsigned int cpumask_local_spread(unsigned int i, int node); (cpu) = cpumask_next_zero((cpu), (mask)), \ (cpu) < nr_cpu_ids;) + +/** + * for_each_cpu_wrap - iterate over every cpu in a mask, starting at a specified location + * @cpu: the (optionally unsigned) integer iterator + * @mask: the cpumask poiter + * @start: the start location + * + * The implementation does not assume any bit in @mask is set (including @start). + * + * After the loop, cpu is >= nr_cpu_ids. + */ +#define for_each_cpu_wrap(cpu, mask, start) \ + for ((cpu) = cpumask_next_wrap((start)-1, (mask), (start), false); \ + (cpu) < nr_cpumask_bits; \ + (cpu) = cpumask_next_wrap((cpu), (mask), (start), true)) + /** * for_each_cpu_and - iterate over every cpu in both masks * @cpu: the (optionally unsigned) integer iterator diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a903276..d89d700 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5640,43 +5640,6 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu; } -/* - * Implement a for_each_cpu() variant that starts the scan at a given cpu - * (@start), and wraps around. - * - * This is used to scan for idle CPUs; such that not all CPUs looking for an - * idle CPU find the same CPU. The down-side is that tasks tend to cycle - * through the LLC domain. - * - * Especially tbench is found sensitive to this. - */ - -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped) -{ - int next; - -again: - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); - - if (*wrapped) { - if (next >= start) - return nr_cpumask_bits; - } else { - if (next >= nr_cpumask_bits) { - *wrapped = 1; - n = -1; - goto again; - } - } - - return next; -} - -#define for_each_cpu_wrap(cpu, mask, start, wrap) \ - for ((wrap) = 0, (cpu) = (start)-1; \ - (cpu) = cpumask_next_wrap((cpu), (mask), (start), &(wrap)), \ - (cpu) < nr_cpumask_bits; ) - #ifdef CONFIG_SCHED_SMT static inline void set_idle_cores(int cpu, int val) @@ -5736,7 +5699,7 @@ void __update_idle_core(struct rq *rq) static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); - int core, cpu, wrap; + int core, cpu; if (!static_branch_likely(&sched_smt_present)) return -1; @@ -5746,7 +5709,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed); - for_each_cpu_wrap(core, cpus, target, wrap) { + for_each_cpu_wrap(core, cpus, target) { bool idle = true; for_each_cpu(cpu, cpu_smt_mask(core)) { @@ -5812,7 +5775,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle = this_rq()->avg_idle; u64 time, cost; s64 delta; - int cpu, wrap; + int cpu; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -5829,7 +5792,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; if (idle_cpu(cpu)) diff --git a/lib/cpumask.c b/lib/cpumask.c index 81dedaa..4731a08 100644 --- a/lib/cpumask.c +++ b/lib/cpumask.c @@ -43,6 +43,38 @@ int cpumask_any_but(const struct cpumask *mask, unsigned int cpu) } EXPORT_SYMBOL(cpumask_any_but); +/** + * cpumask_next_wrap - helper to implement for_each_cpu_wrap + * @n: the cpu prior to the place to search + * @mask: the cpumask pointer + * @start: the start point of the iteration + * @wrap: assume @n crossing @start terminates the iteration + * + * Returns >= nr_cpu_ids on completion + * + * Note: the @wrap argument is required for the start condition when + * we cannot assume @start is set in @mask. + */ +int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap) +{ + int next; + +again: + next = cpumask_next(n, mask); + + if (wrap && n < start && next >= start) { + return nr_cpumask_bits; + + } else if (next >= nr_cpumask_bits) { + wrap = true; + n = -1; + goto again; + } + + return next; +} +EXPORT_SYMBOL(cpumask_next_wrap); + /* These are not inline because of header tangles. */ #ifdef CONFIG_CPUMASK_OFFSTACK /** ^ permalink raw reply related [flat|nested] 29+ messages in thread
* [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() 2017-04-14 12:20 ` Peter Zijlstra @ 2017-05-15 9:03 ` tip-bot for Peter Zijlstra 2017-05-17 10:53 ` hackbench vs select_idle_sibling; was: " Peter Zijlstra 0 siblings, 1 reply; 29+ messages in thread From: tip-bot for Peter Zijlstra @ 2017-05-15 9:03 UTC (permalink / raw) To: linux-tip-commits Cc: mingo, tglx, peterz, riel, efault, hpa, torvalds, lvenanci, linux-kernel Commit-ID: c743f0a5c50f2fcbc628526279cfa24f3dabe182 Gitweb: http://git.kernel.org/tip/c743f0a5c50f2fcbc628526279cfa24f3dabe182 Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Fri, 14 Apr 2017 14:20:05 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Mon, 15 May 2017 10:15:23 +0200 sched/fair, cpumask: Export for_each_cpu_wrap() More users for for_each_cpu_wrap() have appeared. Promote the construct to generic cpumask interface. The implementation is slightly modified to reduce arguments. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Lauro Ramos Venancio <lvenanci@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: lwang@redhat.com Link: http://lkml.kernel.org/r/20170414122005.o35me2h5nowqkxbv@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> --- include/linux/cpumask.h | 17 +++++++++++++++++ kernel/sched/fair.c | 45 ++++----------------------------------------- lib/cpumask.c | 32 ++++++++++++++++++++++++++++++++ 3 files changed, 53 insertions(+), 41 deletions(-) diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 2404ad2..a21b1fb 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -236,6 +236,23 @@ unsigned int cpumask_local_spread(unsigned int i, int node); (cpu) = cpumask_next_zero((cpu), (mask)), \ (cpu) < nr_cpu_ids;) +extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap); + +/** + * for_each_cpu_wrap - iterate over every cpu in a mask, starting at a specified location + * @cpu: the (optionally unsigned) integer iterator + * @mask: the cpumask poiter + * @start: the start location + * + * The implementation does not assume any bit in @mask is set (including @start). + * + * After the loop, cpu is >= nr_cpu_ids. + */ +#define for_each_cpu_wrap(cpu, mask, start) \ + for ((cpu) = cpumask_next_wrap((start)-1, (mask), (start), false); \ + (cpu) < nr_cpumask_bits; \ + (cpu) = cpumask_next_wrap((cpu), (mask), (start), true)) + /** * for_each_cpu_and - iterate over every cpu in both masks * @cpu: the (optionally unsigned) integer iterator diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4f1825d..f80c825 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5640,43 +5640,6 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu; } -/* - * Implement a for_each_cpu() variant that starts the scan at a given cpu - * (@start), and wraps around. - * - * This is used to scan for idle CPUs; such that not all CPUs looking for an - * idle CPU find the same CPU. The down-side is that tasks tend to cycle - * through the LLC domain. - * - * Especially tbench is found sensitive to this. - */ - -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped) -{ - int next; - -again: - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); - - if (*wrapped) { - if (next >= start) - return nr_cpumask_bits; - } else { - if (next >= nr_cpumask_bits) { - *wrapped = 1; - n = -1; - goto again; - } - } - - return next; -} - -#define for_each_cpu_wrap(cpu, mask, start, wrap) \ - for ((wrap) = 0, (cpu) = (start)-1; \ - (cpu) = cpumask_next_wrap((cpu), (mask), (start), &(wrap)), \ - (cpu) < nr_cpumask_bits; ) - #ifdef CONFIG_SCHED_SMT static inline void set_idle_cores(int cpu, int val) @@ -5736,7 +5699,7 @@ unlock: static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); - int core, cpu, wrap; + int core, cpu; if (!static_branch_likely(&sched_smt_present)) return -1; @@ -5746,7 +5709,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed); - for_each_cpu_wrap(core, cpus, target, wrap) { + for_each_cpu_wrap(core, cpus, target) { bool idle = true; for_each_cpu(cpu, cpu_smt_mask(core)) { @@ -5812,7 +5775,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle = this_rq()->avg_idle; u64 time, cost; s64 delta; - int cpu, wrap; + int cpu; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -5829,7 +5792,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; if (idle_cpu(cpu)) diff --git a/lib/cpumask.c b/lib/cpumask.c index 81dedaa..4731a08 100644 --- a/lib/cpumask.c +++ b/lib/cpumask.c @@ -43,6 +43,38 @@ int cpumask_any_but(const struct cpumask *mask, unsigned int cpu) } EXPORT_SYMBOL(cpumask_any_but); +/** + * cpumask_next_wrap - helper to implement for_each_cpu_wrap + * @n: the cpu prior to the place to search + * @mask: the cpumask pointer + * @start: the start point of the iteration + * @wrap: assume @n crossing @start terminates the iteration + * + * Returns >= nr_cpu_ids on completion + * + * Note: the @wrap argument is required for the start condition when + * we cannot assume @start is set in @mask. + */ +int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap) +{ + int next; + +again: + next = cpumask_next(n, mask); + + if (wrap && n < start && next >= start) { + return nr_cpumask_bits; + + } else if (next >= nr_cpumask_bits) { + wrap = true; + n = -1; + goto again; + } + + return next; +} +EXPORT_SYMBOL(cpumask_next_wrap); + /* These are not inline because of header tangles. */ #ifdef CONFIG_CPUMASK_OFFSTACK /** ^ permalink raw reply related [flat|nested] 29+ messages in thread
* hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() 2017-05-15 9:03 ` [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() tip-bot for Peter Zijlstra @ 2017-05-17 10:53 ` Peter Zijlstra 2017-05-17 12:46 ` Matt Fleming ` (3 more replies) 0 siblings, 4 replies; 29+ messages in thread From: Peter Zijlstra @ 2017-05-17 10:53 UTC (permalink / raw) To: mingo, tglx, riel, hpa, efault, linux-kernel, torvalds, lvenanci Cc: xiaolong.ye, kitsunyan, clm, matt On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote: > sched/fair, cpumask: Export for_each_cpu_wrap() > -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped) > -{ > - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); > -} OK, so this patch fixed an actual bug in the for_each_cpu_wrap() implementation. The above 'n+1' should be 'n', and the effect is that it'll skip over CPUs, potentially resulting in an iteration that only sees every other CPU (for a fully contiguous mask). This in turn causes hackbench to further suffer from the regression introduced by commit: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") So its well past time to fix this. Where the old scheme was a cliff-edge throttle on idle scanning, this introduces a more gradual approach. Instead of stopping to scan entirely, we limit how many CPUs we scan. Initial benchmarks show that it mostly recovers hackbench while not hurting anything else, except Mason's schbench, but not as bad as the old thing. It also appears to recover the tbench high-end, which also suffered like hackbench. I'm also hoping it will fix/preserve kitsunyan's interactivity issue. Please test.. Cc: xiaolong.ye@intel.com Cc: kitsunyan <kitsunyan@inbox.ru> Cc: Chris Mason <clm@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Matt Fleming <matt@codeblueprint.co.uk> --- kernel/sched/fair.c | 21 ++++++++++++++++----- kernel/sched/features.h | 1 + 2 files changed, 17 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 219fe58..8c1afa0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5793,27 +5793,38 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) { struct sched_domain *this_sd; - u64 avg_cost, avg_idle = this_rq()->avg_idle; + u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu; + int cpu, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) return -1; - avg_cost = this_sd->avg_scan_cost; - /* * Due to large variance we need a large fuzz factor; hackbench in * particularly is sensitive here. */ - if (sched_feat(SIS_AVG_CPU) && (avg_idle / 512) < avg_cost) + avg_idle = this_rq()->avg_idle / 512; + avg_cost = this_sd->avg_scan_cost + 1; + + if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) return -1; + if (sched_feat(SIS_PROP)) { + u64 span_avg = sd->span_weight * avg_idle; + if (span_avg > 4*avg_cost) + nr = div_u64(span_avg, avg_cost); + else + nr = 4; + } + time = local_clock(); for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + if (!--nr) + return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; if (idle_cpu(cpu)) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index dc4d148..d3fb155 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -55,6 +55,7 @@ SCHED_FEAT(TTWU_QUEUE, true) * When doing wakeups, attempt to limit superfluous scans of the LLC domain. */ SCHED_FEAT(SIS_AVG_CPU, false) +SCHED_FEAT(SIS_PROP, true) /* * Issue a WARN when we do multiple update_rq_clock() calls ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() 2017-05-17 10:53 ` hackbench vs select_idle_sibling; was: " Peter Zijlstra @ 2017-05-17 12:46 ` Matt Fleming 2017-05-17 14:49 ` Chris Mason ` (2 subsequent siblings) 3 siblings, 0 replies; 29+ messages in thread From: Matt Fleming @ 2017-05-17 12:46 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, tglx, riel, hpa, efault, linux-kernel, torvalds, lvenanci, xiaolong.ye, kitsunyan, clm On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: > On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote: > > sched/fair, cpumask: Export for_each_cpu_wrap() > > > -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped) > > -{ > > > - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); > > > -} > > OK, so this patch fixed an actual bug in the for_each_cpu_wrap() > implementation. The above 'n+1' should be 'n', and the effect is that > it'll skip over CPUs, potentially resulting in an iteration that only > sees every other CPU (for a fully contiguous mask). > > This in turn causes hackbench to further suffer from the regression > introduced by commit: > > 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") > > So its well past time to fix this. > > Where the old scheme was a cliff-edge throttle on idle scanning, this > introduces a more gradual approach. Instead of stopping to scan > entirely, we limit how many CPUs we scan. > > Initial benchmarks show that it mostly recovers hackbench while not > hurting anything else, except Mason's schbench, but not as bad as the > old thing. > > It also appears to recover the tbench high-end, which also suffered like > hackbench. > > I'm also hoping it will fix/preserve kitsunyan's interactivity issue. > > Please test.. Tests queued up at SUSE. I'll let you know the results as soon as they're ready -- it'll be at least a couple of days. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() 2017-05-17 10:53 ` hackbench vs select_idle_sibling; was: " Peter Zijlstra 2017-05-17 12:46 ` Matt Fleming @ 2017-05-17 14:49 ` Chris Mason 2017-05-19 15:00 ` Matt Fleming 2017-06-08 9:22 ` [tip:sched/core] sched/core: Implement new approach to scale select_idle_cpu() tip-bot for Peter Zijlstra 3 siblings, 0 replies; 29+ messages in thread From: Chris Mason @ 2017-05-17 14:49 UTC (permalink / raw) To: Peter Zijlstra, mingo, tglx, riel, hpa, efault, linux-kernel, torvalds, lvenanci Cc: xiaolong.ye, kitsunyan, matt On 05/17/2017 06:53 AM, Peter Zijlstra wrote: > On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote: >> sched/fair, cpumask: Export for_each_cpu_wrap() > >> -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped) >> -{ > >> - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); > >> -} > > OK, so this patch fixed an actual bug in the for_each_cpu_wrap() > implementation. The above 'n+1' should be 'n', and the effect is that > it'll skip over CPUs, potentially resulting in an iteration that only > sees every other CPU (for a fully contiguous mask). > > This in turn causes hackbench to further suffer from the regression > introduced by commit: > > 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") > > So its well past time to fix this. > > Where the old scheme was a cliff-edge throttle on idle scanning, this > introduces a more gradual approach. Instead of stopping to scan > entirely, we limit how many CPUs we scan. > > Initial benchmarks show that it mostly recovers hackbench while not > hurting anything else, except Mason's schbench, but not as bad as the > old thing. > > It also appears to recover the tbench high-end, which also suffered like > hackbench. > > I'm also hoping it will fix/preserve kitsunyan's interactivity issue. > > Please test.. We'll get some tests going here too. -chris ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() 2017-05-17 10:53 ` hackbench vs select_idle_sibling; was: " Peter Zijlstra 2017-05-17 12:46 ` Matt Fleming 2017-05-17 14:49 ` Chris Mason @ 2017-05-19 15:00 ` Matt Fleming 2017-06-05 13:00 ` Matt Fleming 2017-06-08 9:22 ` [tip:sched/core] sched/core: Implement new approach to scale select_idle_cpu() tip-bot for Peter Zijlstra 3 siblings, 1 reply; 29+ messages in thread From: Matt Fleming @ 2017-05-19 15:00 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, tglx, riel, hpa, efault, linux-kernel, torvalds, lvenanci, xiaolong.ye, kitsunyan, clm On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: > > Please test.. Results are still coming in but things do look better with your patch applied. It does look like there's a regression when running hackbench in process mode and when the CPUs are not fully utilised, e.g. check this out: hackbench-process-pipes 4.4.68 4.4.68 4.4.68 4.4.68 sles12-sp3 select-idle-cpu-aggressive for-each-cpu-wrap-fix latest-hackbench-fix Amean 1 0.8853 ( 0.00%) 1.2160 (-37.35%) 1.0350 (-16.91%) 1.1853 (-33.89%) This machine has 80 CPUs and that's a 40 process workload. Here's the key: select-idle-cpu-aggressive: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") for-each-cpu-wrap-fix: c743f0a5c50f ("sched/fair, cpumask: Export for_each_cpu_wrap()") latest-hackbench-fix: this patch But those results definitely look to be an exception. Here's the same machine running the same number of tasks but with pthreads, hackbench-thread-pipes 4.4.68 4.4.68 4.4.68 4.4.68 sles12-sp3 select-idle-cpu-aggressive for-each-cpu-wrap-fix latest-hackbench-fix Amean 1 0.7427 ( 0.00%) 0.9760 (-31.42%) 1.1907 (-60.32%) 0.7643 ( -2.92%) Nice win. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() 2017-05-19 15:00 ` Matt Fleming @ 2017-06-05 13:00 ` Matt Fleming 2017-06-06 9:21 ` Peter Zijlstra 0 siblings, 1 reply; 29+ messages in thread From: Matt Fleming @ 2017-06-05 13:00 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, tglx, riel, hpa, efault, linux-kernel, torvalds, lvenanci, xiaolong.ye, kitsunyan, clm On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote: > On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: > > > > Please test.. > > Results are still coming in but things do look better with your patch > applied. > > It does look like there's a regression when running hackbench in > process mode and when the CPUs are not fully utilised, e.g. check this > out: This turned out to be a false positive; your patch improves things as far as I can see. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() 2017-06-05 13:00 ` Matt Fleming @ 2017-06-06 9:21 ` Peter Zijlstra 2017-06-09 17:52 ` Chris Mason 0 siblings, 1 reply; 29+ messages in thread From: Peter Zijlstra @ 2017-06-06 9:21 UTC (permalink / raw) To: Matt Fleming Cc: mingo, tglx, riel, hpa, efault, linux-kernel, torvalds, lvenanci, xiaolong.ye, kitsunyan, clm On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote: > On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote: > > On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: > > > > > > Please test.. > > > > Results are still coming in but things do look better with your patch > > applied. > > > > It does look like there's a regression when running hackbench in > > process mode and when the CPUs are not fully utilised, e.g. check this > > out: > > This turned out to be a false positive; your patch improves things as > far as I can see. Hooray, I'll move it to a part of the queue intended for merging. Thanks! ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() 2017-06-06 9:21 ` Peter Zijlstra @ 2017-06-09 17:52 ` Chris Mason 0 siblings, 0 replies; 29+ messages in thread From: Chris Mason @ 2017-06-09 17:52 UTC (permalink / raw) To: Peter Zijlstra, Matt Fleming Cc: mingo, tglx, riel, hpa, efault, linux-kernel, torvalds, lvenanci, xiaolong.ye, kitsunyan On 06/06/2017 05:21 AM, Peter Zijlstra wrote: > On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote: >> On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote: >>> On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: >>>> >>>> Please test.. >>> >>> Results are still coming in but things do look better with your patch >>> applied. >>> >>> It does look like there's a regression when running hackbench in >>> process mode and when the CPUs are not fully utilised, e.g. check this >>> out: >> >> This turned out to be a false positive; your patch improves things as >> far as I can see. > > Hooray, I'll move it to a part of the queue intended for merging. It's a little late, but Roman Gushchin helped get some runs of this with our production workload. The patch is every so slightly better. Thanks! -chris ^ permalink raw reply [flat|nested] 29+ messages in thread
* [tip:sched/core] sched/core: Implement new approach to scale select_idle_cpu() 2017-05-17 10:53 ` hackbench vs select_idle_sibling; was: " Peter Zijlstra ` (2 preceding siblings ...) 2017-05-19 15:00 ` Matt Fleming @ 2017-06-08 9:22 ` tip-bot for Peter Zijlstra 3 siblings, 0 replies; 29+ messages in thread From: tip-bot for Peter Zijlstra @ 2017-06-08 9:22 UTC (permalink / raw) To: linux-tip-commits Cc: clm, torvalds, efault, mingo, kitsunyan, peterz, linux-kernel, tglx, matt, hpa Commit-ID: 1ad3aaf3fcd2444406628a19a9b9e0922b95e2d4 Gitweb: http://git.kernel.org/tip/1ad3aaf3fcd2444406628a19a9b9e0922b95e2d4 Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Wed, 17 May 2017 12:53:50 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Thu, 8 Jun 2017 10:25:17 +0200 sched/core: Implement new approach to scale select_idle_cpu() Hackbench recently suffered a bunch of pain, first by commit: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") and then by commit: c743f0a5c50f ("sched/fair, cpumask: Export for_each_cpu_wrap()") which fixed a bug in the initial for_each_cpu_wrap() implementation that made select_idle_cpu() even more expensive. The bug was that it would skip over CPUs when bits were consequtive in the bitmask. This however gave me an idea to fix select_idle_cpu(); where the old scheme was a cliff-edge throttle on idle scanning, this introduces a more gradual approach. Instead of stopping to scan entirely, we limit how many CPUs we scan. Initial benchmarks show that it mostly recovers hackbench while not hurting anything else, except Mason's schbench, but not as bad as the old thing. It also appears to recover the tbench high-end, which also suffered like hackbench. Tested-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: hpa@zytor.com Cc: kitsunyan <kitsunyan@inbox.ru> Cc: linux-kernel@vger.kernel.org Cc: lvenanci@redhat.com Cc: riel@redhat.com Cc: xiaolong.ye@intel.com Link: http://lkml.kernel.org/r/20170517105350.hk5m4h4jb6dfr65a@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/fair.c | 21 ++++++++++++++++----- kernel/sched/features.h | 1 + 2 files changed, 17 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 47a0c55..396bca9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5794,27 +5794,38 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) { struct sched_domain *this_sd; - u64 avg_cost, avg_idle = this_rq()->avg_idle; + u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu; + int cpu, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) return -1; - avg_cost = this_sd->avg_scan_cost; - /* * Due to large variance we need a large fuzz factor; hackbench in * particularly is sensitive here. */ - if (sched_feat(SIS_AVG_CPU) && (avg_idle / 512) < avg_cost) + avg_idle = this_rq()->avg_idle / 512; + avg_cost = this_sd->avg_scan_cost + 1; + + if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) return -1; + if (sched_feat(SIS_PROP)) { + u64 span_avg = sd->span_weight * avg_idle; + if (span_avg > 4*avg_cost) + nr = div_u64(span_avg, avg_cost); + else + nr = 4; + } + time = local_clock(); for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + if (!--nr) + return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; if (idle_cpu(cpu)) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index dc4d148..d3fb155 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -55,6 +55,7 @@ SCHED_FEAT(TTWU_QUEUE, true) * When doing wakeups, attempt to limit superfluous scans of the LLC domain. */ SCHED_FEAT(SIS_AVG_CPU, false) +SCHED_FEAT(SIS_PROP, true) /* * Issue a WARN when we do multiple update_rq_clock() calls ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology 2017-04-14 11:38 ` Peter Zijlstra 2017-04-14 12:20 ` Peter Zijlstra @ 2017-04-14 16:58 ` Peter Zijlstra 2017-04-17 14:40 ` Lauro Venancio 1 sibling, 1 reply; 29+ messages in thread From: Peter Zijlstra @ 2017-04-14 16:58 UTC (permalink / raw) To: Lauro Ramos Venancio Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar On Fri, Apr 14, 2017 at 01:38:13PM +0200, Peter Zijlstra wrote: > On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote: > > This patch constructs the sched groups from each CPU perspective. So, on > > a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same > > groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups > > [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops > > apart. > > Ah,.. so after drawing pictures I see what went wrong; duh :-( > > An equivalent patch would be (if for_each_cpu_wrap() were exposed): > > @@ -521,11 +588,11 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) > struct cpumask *covered = sched_domains_tmpmask; > struct sd_data *sdd = sd->private; > struct sched_domain *sibling; > - int i; > + int i, wrap; > > cpumask_clear(covered); > > - for_each_cpu(i, span) { > + for_each_cpu_wrap(i, span, cpu, wrap) { > struct cpumask *sg_span; > > if (cpumask_test_cpu(i, covered)) > > > We need to start iterating at @cpu, not start at 0 every time. > > OK, please have a look here: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology 2017-04-14 16:58 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Peter Zijlstra @ 2017-04-17 14:40 ` Lauro Venancio 0 siblings, 0 replies; 29+ messages in thread From: Lauro Venancio @ 2017-04-17 14:40 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar On 04/14/2017 01:58 PM, Peter Zijlstra wrote: > On Fri, Apr 14, 2017 at 01:38:13PM +0200, Peter Zijlstra wrote: >> On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote: >>> This patch constructs the sched groups from each CPU perspective. So, on >>> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same >>> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups >>> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops >>> apart. >> Ah,.. so after drawing pictures I see what went wrong; duh :-( >> >> An equivalent patch would be (if for_each_cpu_wrap() were exposed): >> >> @@ -521,11 +588,11 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) >> struct cpumask *covered = sched_domains_tmpmask; >> struct sd_data *sdd = sd->private; >> struct sched_domain *sibling; >> - int i; >> + int i, wrap; >> >> cpumask_clear(covered); >> >> - for_each_cpu(i, span) { >> + for_each_cpu_wrap(i, span, cpu, wrap) { >> struct cpumask *sg_span; >> >> if (cpumask_test_cpu(i, covered)) >> >> >> We need to start iterating at @cpu, not start at 0 every time. >> >> > OK, please have a look here: > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core Looks good, but please hold these patches while patch 3 is not applied. Without it, the sched_group_capacity (sg->sgc) instance is not selected correctly and we have an important performance regression in all NUMA machines. I will continue this discussion in the other thread. ^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu 2017-04-13 13:56 [RFC 0/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio 2017-04-13 13:56 ` [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() Lauro Ramos Venancio 2017-04-13 13:56 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio @ 2017-04-13 13:56 ` Lauro Ramos Venancio 2017-04-13 15:27 ` Rik van Riel 2017-04-14 16:49 ` Peter Zijlstra 2 siblings, 2 replies; 29+ messages in thread From: Lauro Ramos Venancio @ 2017-04-13 13:56 UTC (permalink / raw) To: linux-kernel Cc: lwang, riel, Mike Galbraith, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Lauro Ramos Venancio Currently, the group balance cpu is the groups's first CPU. But with overlapping groups, two different groups can have the same first CPU. This patch uses the group mask to mark all the CPUs that have a particular group as its main sched group. The group balance cpu is the first group CPU that is also in the mask. Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com> --- kernel/sched/topology.c | 76 ++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 62 insertions(+), 14 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d0302ad..7920bbb 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -477,27 +477,31 @@ enum s_alloc { }; /* - * Build an iteration mask that can exclude certain CPUs from the upwards - * domain traversal. + * An overlap sched group may not be present in all CPUs that compose the + * group. So build the mask, marking all the group CPUs where it is present. * * Asymmetric node setups can result in situations where the domain tree is of * unequal depth, make sure to skip domains that already cover the entire * range. - * - * In that case build_sched_domains() will have terminated the iteration early - * and our sibling sd spans will be empty. Domains should always include the - * CPU they're built on, so check that. */ static void build_group_mask(struct sched_domain *sd, struct sched_group *sg) { - const struct cpumask *span = sched_domain_span(sd); + const struct cpumask *sg_span = sched_group_cpus(sg); struct sd_data *sdd = sd->private; struct sched_domain *sibling; int i; - for_each_cpu(i, span) { + for_each_cpu(i, sg_span) { sibling = *per_cpu_ptr(sdd->sd, i); - if (!cpumask_test_cpu(i, sched_domain_span(sibling))) + + /* + * Asymmetric node setups: skip domains that are already + * done. + */ + if (!sibling->groups) + continue; + + if (!cpumask_equal(sg_span, sched_group_cpus(sibling->groups))) continue; cpumask_set_cpu(i, sched_group_mask(sg)); @@ -513,6 +517,28 @@ int group_balance_cpu(struct sched_group *sg) return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg)); } +/* + * Find the group balance cpu when the group mask is not available yet. + */ +static int find_group_balance_cpu(struct sched_domain *sd, + struct sched_group *sg) +{ + const struct cpumask *sg_span = sched_group_cpus(sg); + struct sd_data *sdd = sd->private; + struct sched_domain *sibling; + int i; + + for_each_cpu(i, sg_span) { + sibling = *per_cpu_ptr(sdd->sd, i); + if (cpumask_equal(sg_span, sched_group_cpus(sibling->groups))) + return i; + } + + WARN(1, "group balance cpu not found."); + return 0; +} + + static struct sched_group * build_group_from_child_sched_domain(struct sched_domain *sd, int cpu) { @@ -554,6 +580,19 @@ static void init_overlap_sched_group(struct sched_domain *sd, sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; } +static void init_overlap_sched_groups(struct sched_domain *sd) +{ + struct sched_group *sg = sd->groups; + int cpu; + + do { + cpu = find_group_balance_cpu(sd, sg); + init_overlap_sched_group(sd, sg, cpu); + + sg = sg->next; + } while (sg != sd->groups); +} + static int build_overlap_sched_groups(struct sched_domain *sd, int cpu) { @@ -568,8 +607,6 @@ static void init_overlap_sched_group(struct sched_domain *sd, if (!sg) return -ENOMEM; - init_overlap_sched_group(sd, sg, cpu); - sd->groups = sg; last = sg; sg->next = sg; @@ -584,7 +621,12 @@ static void init_overlap_sched_group(struct sched_domain *sd, sibling = *per_cpu_ptr(sdd->sd, i); - /* See the comment near build_group_mask(). */ + /* + * In Asymmetric node setups, build_sched_domains() will have + * terminated the iteration early and our sibling sd spans will + * be empty. Domains should always include the CPU they're + * built on, so check that. + */ if (!cpumask_test_cpu(i, sched_domain_span(sibling))) continue; @@ -595,8 +637,6 @@ static void init_overlap_sched_group(struct sched_domain *sd, sg_span = sched_group_cpus(sg); cpumask_or(covered, covered, sg_span); - init_overlap_sched_group(sd, sg, i); - last->next = sg; last = sg; sg->next = sd->groups; @@ -1449,6 +1489,14 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, } } + /* Init overlap groups */ + for_each_cpu(i, cpu_map) { + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { + if (sd->flags & SD_OVERLAP) + init_overlap_sched_groups(sd); + } + } + /* Calculate CPU capacity for physical packages and nodes */ for (i = nr_cpumask_bits-1; i >= 0; i--) { if (!cpumask_test_cpu(i, cpu_map)) -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu 2017-04-13 13:56 ` [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu Lauro Ramos Venancio @ 2017-04-13 15:27 ` Rik van Riel 2017-04-14 16:49 ` Peter Zijlstra 1 sibling, 0 replies; 29+ messages in thread From: Rik van Riel @ 2017-04-13 15:27 UTC (permalink / raw) To: Lauro Ramos Venancio, linux-kernel Cc: lwang, Mike Galbraith, Peter Zijlstra, Thomas Gleixner, Ingo Molnar On Thu, 2017-04-13 at 10:56 -0300, Lauro Ramos Venancio wrote: > Currently, the group balance cpu is the groups's first CPU. But with > overlapping groups, two different groups can have the same first CPU. > > This patch uses the group mask to mark all the CPUs that have a > particular group as its main sched group. The group balance cpu is > the > first group CPU that is also in the mask. > This is not your fault, but this code is really hard to understand. Your comments tell me what the code does, but not really why. > +++ b/kernel/sched/topology.c > @@ -477,27 +477,31 @@ enum s_alloc { > }; > > /* > - * Build an iteration mask that can exclude certain CPUs from the > upwards > - * domain traversal. > + * An overlap sched group may not be present in all CPUs that > compose the > + * group. So build the mask, marking all the group CPUs where it is > present. > * > * Asymmetric node setups can result in situations where the domain > tree is of > * unequal depth, make sure to skip domains that already cover the > entire > * range. > - * > - * In that case build_sched_domains() will have terminated the > iteration early > - * and our sibling sd spans will be empty. Domains should always > include the > - * CPU they're built on, so check that. > */ Why are we doing this? Could the comment explain why things need to be this way? > - for_each_cpu(i, span) { > + for_each_cpu(i, sg_span) { > sibling = *per_cpu_ptr(sdd->sd, i); > - if (!cpumask_test_cpu(i, > sched_domain_span(sibling))) > + > + /* > + * Asymmetric node setups: skip domains that are > already > + * done. > + */ > + if (!sibling->groups) > + continue; > + What does this mean? I would really like it if this code was a little better documented. This is not something I can put on you for most of the code, but I can at least ask it for the code you are adding :) The same goes for the rest of this patch. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu 2017-04-13 13:56 ` [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu Lauro Ramos Venancio 2017-04-13 15:27 ` Rik van Riel @ 2017-04-14 16:49 ` Peter Zijlstra 2017-04-17 15:34 ` Lauro Venancio 1 sibling, 1 reply; 29+ messages in thread From: Peter Zijlstra @ 2017-04-14 16:49 UTC (permalink / raw) To: Lauro Ramos Venancio Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar On Thu, Apr 13, 2017 at 10:56:09AM -0300, Lauro Ramos Venancio wrote: > Currently, the group balance cpu is the groups's first CPU. But with > overlapping groups, two different groups can have the same first CPU. > > This patch uses the group mask to mark all the CPUs that have a > particular group as its main sched group. The group balance cpu is the > first group CPU that is also in the mask. Please give a NUMA configuration and CPU number where this goes wrong. Because only the first group of a domain matters, and with the other thing fixed, I'm not immediately seeing where we go wobbly. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu 2017-04-14 16:49 ` Peter Zijlstra @ 2017-04-17 15:34 ` Lauro Venancio 2017-04-18 12:32 ` Peter Zijlstra 0 siblings, 1 reply; 29+ messages in thread From: Lauro Venancio @ 2017-04-17 15:34 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar On 04/14/2017 01:49 PM, Peter Zijlstra wrote: > On Thu, Apr 13, 2017 at 10:56:09AM -0300, Lauro Ramos Venancio wrote: >> Currently, the group balance cpu is the groups's first CPU. But with >> overlapping groups, two different groups can have the same first CPU. >> >> This patch uses the group mask to mark all the CPUs that have a >> particular group as its main sched group. The group balance cpu is the >> first group CPU that is also in the mask. > Please give a NUMA configuration and CPU number where this goes wrong. On a 4 nodes with ring topology, the groups (0-1,3 [cpu 0]), (0-2 [cpu 1]) and (0,2-3 [cpu 3]) share the same sched_group_capacity instance when the first groups cpu is used to select the sgc. > > Because only the first group of a domain matters, and with the other > thing fixed, I'm not immediately seeing where we go wobbly. Before patch 2, the group balance cpu was implicitly used to select the sched_group_capacity instance. When two different groups had the same balance cpu, they shared the same sched_group_capacity instance. After patch 2, one different sched_group_capacity instance is assigned to each group instance. This patch ensures tree things: 1) different instances of the same group share the same sched_group_capacity instance. 2) instances of different groups don't share the same sched_group_capacity instance. 3) the group balance cpu must be one of the cpus where the group is installed. I am rebasing this patch on top of your patches. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu 2017-04-17 15:34 ` Lauro Venancio @ 2017-04-18 12:32 ` Peter Zijlstra 0 siblings, 0 replies; 29+ messages in thread From: Peter Zijlstra @ 2017-04-18 12:32 UTC (permalink / raw) To: Lauro Venancio Cc: linux-kernel, lwang, riel, Mike Galbraith, Thomas Gleixner, Ingo Molnar On Mon, Apr 17, 2017 at 12:34:05PM -0300, Lauro Venancio wrote: > This patch ensures tree things: > > 1) different instances of the same group share the same > sched_group_capacity instance. > > 2) instances of different groups don't share the same > sched_group_capacity instance. > > 3) the group balance cpu must be one of the cpus where the group is > installed. > > > I am rebasing this patch on top of your patches. Well, I would rather have 3 patches, each with a comprehensible changelog. I had already rebased the patch (trivial) so that's not the problem. Explaining what and why it does things is. And as per the usual rules, if it does 3 things it should be 3 patches. ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2017-06-09 17:53 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-04-13 13:56 [RFC 0/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio 2017-04-13 13:56 ` [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() Lauro Ramos Venancio 2017-04-13 14:50 ` Rik van Riel 2017-05-15 9:02 ` [tip:sched/core] " tip-bot for Lauro Ramos Venancio 2017-04-13 13:56 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio 2017-04-13 15:16 ` Rik van Riel 2017-04-13 15:48 ` Peter Zijlstra 2017-04-13 20:21 ` Lauro Venancio 2017-04-13 21:06 ` Lauro Venancio 2017-04-13 23:38 ` Rik van Riel 2017-04-14 10:48 ` Peter Zijlstra 2017-04-14 11:38 ` Peter Zijlstra 2017-04-14 12:20 ` Peter Zijlstra 2017-05-15 9:03 ` [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() tip-bot for Peter Zijlstra 2017-05-17 10:53 ` hackbench vs select_idle_sibling; was: " Peter Zijlstra 2017-05-17 12:46 ` Matt Fleming 2017-05-17 14:49 ` Chris Mason 2017-05-19 15:00 ` Matt Fleming 2017-06-05 13:00 ` Matt Fleming 2017-06-06 9:21 ` Peter Zijlstra 2017-06-09 17:52 ` Chris Mason 2017-06-08 9:22 ` [tip:sched/core] sched/core: Implement new approach to scale select_idle_cpu() tip-bot for Peter Zijlstra 2017-04-14 16:58 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Peter Zijlstra 2017-04-17 14:40 ` Lauro Venancio 2017-04-13 13:56 ` [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu Lauro Ramos Venancio 2017-04-13 15:27 ` Rik van Riel 2017-04-14 16:49 ` Peter Zijlstra 2017-04-17 15:34 ` Lauro Venancio 2017-04-18 12:32 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).