From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1756410AbdDMUVI (ORCPT <rfc822;w@1wt.eu>);
        Thu, 13 Apr 2017 16:21:08 -0400
Received: from mx1.redhat.com ([209.132.183.28]:53664 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1756383AbdDMUVE (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 13 Apr 2017 16:21:04 -0400
DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 05CAED9988
Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=lvenanci@redhat.com
DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 05CAED9988
Reply-To: lvenanci@redhat.com
Subject: Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with
 mesh topology
References: <1492091769-19879-1-git-send-email-lvenanci@redhat.com>
 <1492091769-19879-3-git-send-email-lvenanci@redhat.com>
 <20170413154812.vrtkdyzgkrywj2no@hirez.programming.kicks-ass.net>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, lwang@redhat.com, riel@redhat.com,
        Mike Galbraith <efault@gmx.de>, Thomas Gleixner <tglx@linutronix.de>,
        Ingo Molnar <mingo@kernel.org>
From: Lauro Venancio <lvenanci@redhat.com>
Organization: Red Hat
Message-ID: <5166d6ba-c8e6-c60e-61af-d32124234bb9@redhat.com>
Date: Thu, 13 Apr 2017 17:21:00 -0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.4.0
MIME-Version: 1.0
In-Reply-To: <20170413154812.vrtkdyzgkrywj2no@hirez.programming.kicks-ass.net>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Thu, 13 Apr 2017 20:21:04 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 04/13/2017 12:48 PM, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
>> Currently, on a 4 nodes NUMA machine with ring topology, two sched
>> groups are generated for the last NUMA sched domain. One group has the
>> CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes
>> 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the
>> scheduler is unable to directly move tasks between these nodes. In the
>> worst scenario, when a set of tasks are bound to nodes 1 and 3, the
>> performance is severely impacted because just one node is used while the
>> other node remains idle.
> I feel a picture would be ever so much clearer.
>
>> This patch constructs the sched groups from each CPU perspective. So, on
>> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
>> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
>> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
>> apart.
> So I still have no idea what specifically goes wrong and how this fixes
> it. Changelog is impenetrable.
On a 4 nodes machine with ring topology, the last sched domain level
contains groups with 3 numa nodes each. So we have four possible groups:
(0, 1, 2) (1, 2, 3) (2, 3, 0)(3, 0, 1). As we need just two groups to
fill the sched domain, currently, the groups (3, 0, 1) and (1, 2, 3) are
used for all CPUs. The problem with it is that nodes 1 and 3 belongs to
both groups, becoming impossible to move tasks between these two nodes.

This patch uses different groups depending on the CPU they are
installed. So nodes 0 and 2 CPUs keep the same group as before: (3, 0,
1) and (1, 2, 3). Nodes 1 and 3 CPUs use the new groups: (0, 1, 2) and
(2, 3, 0). So the first pair of groups allows movement between nodes 0
and 2; and the second pair of groups allows movement between nodes 1 and 3.

I will improve the changelog.

> "From each CPU's persepective" doesn't really help, there already is a
> for_each_cpu() in.
The for_each_cpu() is used to iterate across all sched domain cpus. It
doesn't consider the CPU where the groups are being installed (parameter
cpu in build_overlap_sched_groups()). Currently, the parameter cpu is
used just for memory allocation and for ordering the groups, it doesn't
change the groups that are chosen. This patch uses the parameter cpu to
choose the first group, changing also, as consequence, the second group.
>
> Also, since I'm not sure what happend to the 4 node system, I cannot
> begin to imagine what would happen on the 8 node one.