From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755046AbaEHRXq (ORCPT ); Thu, 8 May 2014 13:23:46 -0400 Received: from shelob.surriel.com ([74.92.59.67]:53760 "EHLO shelob.surriel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752434AbaEHRXl (ORCPT ); Thu, 8 May 2014 13:23:41 -0400 From: riel@redhat.com To: linux-kernel@vger.kernel.org Cc: mingo@kernel.org, peterz@infradead.org, mgorman@suse.de, chegu_vinod@hp.com Subject: [PATCH 4/4] sched,numa: pull workloads towards their preferred nodes Date: Thu, 8 May 2014 13:23:31 -0400 Message-Id: <1399569811-14362-5-git-send-email-riel@redhat.com> X-Mailer: git-send-email 1.8.5.3 In-Reply-To: <1399569811-14362-1-git-send-email-riel@redhat.com> References: <1399569811-14362-1-git-send-email-riel@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rik van Riel Give a bonus to nodes near a workload's preferred node. This will pull workloads towards their preferred node. For workloads that span multiple NUMA nodes, pseudo-interleaving will even out the memory use between nodes over time, causing the preferred node to move around over time. This movement over time will cause the preferred nodes to be on opposite sides of the system eventually, untangling workloads that were spread all over the system, and moving them onto adjacent nodes. The perturbation introduced by this patch enables the kernel to reliably untangled 2 4-node wide SPECjbb2005 instances on an 8 node system, improving average performance from 857814 to 931792 bops. Signed-off-by: Rik van Riel Tested-by: Chegu Vinod --- kernel/sched/fair.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 99cc829..cffa829 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -932,7 +932,7 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid) * the proximity of those nodes. */ static inline unsigned long nearby_nodes_score(struct task_struct *p, int nid, - bool task) + bool task, bool *preferred_nid) { int max_distance = max_node_distance(); unsigned long score = 0; @@ -949,6 +949,15 @@ static inline unsigned long nearby_nodes_score(struct task_struct *p, int nid, int distance; unsigned long faults; + /* + * Pseudo-interleaving balances out the memory use between the + * nodes where a workload runs, so the preferred node should + * change over time. This helps separate two workloads onto + * separate sides of the system. + */ + if (p->numa_group && node == p->numa_group->preferred_nid) + *preferred_nid = true; + /* Already scored by the calling function. */ if (node == nid) continue; @@ -989,6 +998,7 @@ static inline unsigned long nearby_nodes_score(struct task_struct *p, int nid, static inline unsigned long task_weight(struct task_struct *p, int nid) { unsigned long total_faults, score; + bool near_preferred_nid = false; if (!p->numa_faults_memory) return 0; @@ -999,7 +1009,7 @@ static inline unsigned long task_weight(struct task_struct *p, int nid) return 0; score = 1000 * task_faults(p, nid); - score += nearby_nodes_score(p, nid, true); + score += nearby_nodes_score(p, nid, true, &near_preferred_nid); score /= total_faults; @@ -1009,6 +1019,7 @@ static inline unsigned long task_weight(struct task_struct *p, int nid) static inline unsigned long group_weight(struct task_struct *p, int nid) { unsigned long total_faults, score; + bool near_preferred_nid = false; if (!p->numa_group) return 0; @@ -1019,7 +1030,15 @@ static inline unsigned long group_weight(struct task_struct *p, int nid) return 0; score = 1000 * group_faults(p, nid); - score += nearby_nodes_score(p, nid, false); + score += nearby_nodes_score(p, nid, false, &near_preferred_nid); + + /* + * Pull workloads towards their preferred node, with the minimum + * multiplier required to be a tie-breaker when two groups of nodes + * have the same amount of memory. + */ + if (near_preferred_nid) + score *= (max_node_distance() - LOCAL_DISTANCE); score /= total_faults; -- 1.8.5.3