linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Lauro Venancio <lvenanci@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, lwang@redhat.com, riel@redhat.com,
	Mike Galbraith <efault@gmx.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@kernel.org>
Subject: Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology
Date: Thu, 13 Apr 2017 17:21:00 -0300	[thread overview]
Message-ID: <5166d6ba-c8e6-c60e-61af-d32124234bb9@redhat.com> (raw)
In-Reply-To: <20170413154812.vrtkdyzgkrywj2no@hirez.programming.kicks-ass.net>

On 04/13/2017 12:48 PM, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
>> Currently, on a 4 nodes NUMA machine with ring topology, two sched
>> groups are generated for the last NUMA sched domain. One group has the
>> CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes
>> 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the
>> scheduler is unable to directly move tasks between these nodes. In the
>> worst scenario, when a set of tasks are bound to nodes 1 and 3, the
>> performance is severely impacted because just one node is used while the
>> other node remains idle.
> I feel a picture would be ever so much clearer.
>
>> This patch constructs the sched groups from each CPU perspective. So, on
>> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
>> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
>> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
>> apart.
> So I still have no idea what specifically goes wrong and how this fixes
> it. Changelog is impenetrable.
On a 4 nodes machine with ring topology, the last sched domain level
contains groups with 3 numa nodes each. So we have four possible groups:
(0, 1, 2) (1, 2, 3) (2, 3, 0)(3, 0, 1). As we need just two groups to
fill the sched domain, currently, the groups (3, 0, 1) and (1, 2, 3) are
used for all CPUs. The problem with it is that nodes 1 and 3 belongs to
both groups, becoming impossible to move tasks between these two nodes.

This patch uses different groups depending on the CPU they are
installed. So nodes 0 and 2 CPUs keep the same group as before: (3, 0,
1) and (1, 2, 3). Nodes 1 and 3 CPUs use the new groups: (0, 1, 2) and
(2, 3, 0). So the first pair of groups allows movement between nodes 0
and 2; and the second pair of groups allows movement between nodes 1 and 3.

I will improve the changelog.

> "From each CPU's persepective" doesn't really help, there already is a
> for_each_cpu() in.
The for_each_cpu() is used to iterate across all sched domain cpus. It
doesn't consider the CPU where the groups are being installed (parameter
cpu in build_overlap_sched_groups()). Currently, the parameter cpu is
used just for memory allocation and for ordering the groups, it doesn't
change the groups that are chosen. This patch uses the parameter cpu to
choose the first group, changing also, as consequence, the second group.
>
> Also, since I'm not sure what happend to the 4 node system, I cannot
> begin to imagine what would happen on the 8 node one.

  reply	other threads:[~2017-04-13 20:21 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-13 13:56 [RFC 0/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio
2017-04-13 13:56 ` [RFC 1/3] sched/topology: Refactor function build_overlap_sched_groups() Lauro Ramos Venancio
2017-04-13 14:50   ` Rik van Riel
2017-05-15  9:02   ` [tip:sched/core] " tip-bot for Lauro Ramos Venancio
2017-04-13 13:56 ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Lauro Ramos Venancio
2017-04-13 15:16   ` Rik van Riel
2017-04-13 15:48   ` Peter Zijlstra
2017-04-13 20:21     ` Lauro Venancio [this message]
2017-04-13 21:06       ` Lauro Venancio
2017-04-13 23:38         ` Rik van Riel
2017-04-14 10:48           ` Peter Zijlstra
2017-04-14 11:38   ` Peter Zijlstra
2017-04-14 12:20     ` Peter Zijlstra
2017-05-15  9:03       ` [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap() tip-bot for Peter Zijlstra
2017-05-17 10:53         ` hackbench vs select_idle_sibling; was: " Peter Zijlstra
2017-05-17 12:46           ` Matt Fleming
2017-05-17 14:49           ` Chris Mason
2017-05-19 15:00           ` Matt Fleming
2017-06-05 13:00             ` Matt Fleming
2017-06-06  9:21               ` Peter Zijlstra
2017-06-09 17:52                 ` Chris Mason
2017-06-08  9:22           ` [tip:sched/core] sched/core: Implement new approach to scale select_idle_cpu() tip-bot for Peter Zijlstra
2017-04-14 16:58     ` [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology Peter Zijlstra
2017-04-17 14:40       ` Lauro Venancio
2017-04-13 13:56 ` [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu Lauro Ramos Venancio
2017-04-13 15:27   ` Rik van Riel
2017-04-14 16:49   ` Peter Zijlstra
2017-04-17 15:34     ` Lauro Venancio
2017-04-18 12:32       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5166d6ba-c8e6-c60e-61af-d32124234bb9@redhat.com \
    --to=lvenanci@redhat.com \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lwang@redhat.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).