From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756410AbdDMUVI (ORCPT ); Thu, 13 Apr 2017 16:21:08 -0400 Received: from mx1.redhat.com ([209.132.183.28]:53664 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756383AbdDMUVE (ORCPT ); Thu, 13 Apr 2017 16:21:04 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 05CAED9988 Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=lvenanci@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 05CAED9988 Reply-To: lvenanci@redhat.com Subject: Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology References: <1492091769-19879-1-git-send-email-lvenanci@redhat.com> <1492091769-19879-3-git-send-email-lvenanci@redhat.com> <20170413154812.vrtkdyzgkrywj2no@hirez.programming.kicks-ass.net> To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, lwang@redhat.com, riel@redhat.com, Mike Galbraith , Thomas Gleixner , Ingo Molnar From: Lauro Venancio Organization: Red Hat Message-ID: <5166d6ba-c8e6-c60e-61af-d32124234bb9@redhat.com> Date: Thu, 13 Apr 2017 17:21:00 -0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20170413154812.vrtkdyzgkrywj2no@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Thu, 13 Apr 2017 20:21:04 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/13/2017 12:48 PM, Peter Zijlstra wrote: > On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote: >> Currently, on a 4 nodes NUMA machine with ring topology, two sched >> groups are generated for the last NUMA sched domain. One group has the >> CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes >> 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the >> scheduler is unable to directly move tasks between these nodes. In the >> worst scenario, when a set of tasks are bound to nodes 1 and 3, the >> performance is severely impacted because just one node is used while the >> other node remains idle. > I feel a picture would be ever so much clearer. > >> This patch constructs the sched groups from each CPU perspective. So, on >> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same >> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups >> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops >> apart. > So I still have no idea what specifically goes wrong and how this fixes > it. Changelog is impenetrable. On a 4 nodes machine with ring topology, the last sched domain level contains groups with 3 numa nodes each. So we have four possible groups: (0, 1, 2) (1, 2, 3) (2, 3, 0)(3, 0, 1). As we need just two groups to fill the sched domain, currently, the groups (3, 0, 1) and (1, 2, 3) are used for all CPUs. The problem with it is that nodes 1 and 3 belongs to both groups, becoming impossible to move tasks between these two nodes. This patch uses different groups depending on the CPU they are installed. So nodes 0 and 2 CPUs keep the same group as before: (3, 0, 1) and (1, 2, 3). Nodes 1 and 3 CPUs use the new groups: (0, 1, 2) and (2, 3, 0). So the first pair of groups allows movement between nodes 0 and 2; and the second pair of groups allows movement between nodes 1 and 3. I will improve the changelog. > "From each CPU's persepective" doesn't really help, there already is a > for_each_cpu() in. The for_each_cpu() is used to iterate across all sched domain cpus. It doesn't consider the CPU where the groups are being installed (parameter cpu in build_overlap_sched_groups()). Currently, the parameter cpu is used just for memory allocation and for ordering the groups, it doesn't change the groups that are chosen. This patch uses the parameter cpu to choose the first group, changing also, as consequence, the second group. > > Also, since I'm not sure what happend to the 4 node system, I cannot > begin to imagine what would happen on the 8 node one.