From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754851Ab2LBSpb (ORCPT ); Sun, 2 Dec 2012 13:45:31 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:34653 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754741Ab2LBSpW (ORCPT ); Sun, 2 Dec 2012 13:45:22 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Peter Zijlstra , Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Thomas Gleixner , Johannes Weiner , Hugh Dickins Subject: [PATCH 39/52] sched: Track shared task's node groups and interleave their memory allocations Date: Sun, 2 Dec 2012 19:43:31 +0100 Message-Id: <1354473824-19229-40-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1354473824-19229-1-git-send-email-mingo@kernel.org> References: <1354473824-19229-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch shows the power of the shared/private distinction: in the shared tasks active balancing function (sched_update_ideal_cpu_shared()) we are able to to build a per (shared) task node mask of the nodes that it and its buddies occupy at the moment. Private tasks on the other hand are not affected and continue to do efficient node-local allocations. There's two important special cases: - if a group of shared tasks fits on a single node. In this case the interleaving happens on a single bit, a single node and thus turns into nice node-local allocations. - if a large group spans the whole system: in this case the node masks will cover the whole system, and all memory gets evenly interleaved and available RAM bandwidth gets utilized. This is preferable to allocating memory assymetrically and overloading certain CPU links and running into their bandwidth limitations. This patch, in combination with the private/shared buddies patch, optimizes the "4x JVM", "single JVM" and "2x JVM" SPECjbb workloads on a 4-node system produce almost completely perfect memory placement. For example a 4-JVM workload on a 4-node, 32-CPU system has this performance (8 SPECjbb warehouses per JVM): spec1.txt: throughput = 177460.44 SPECjbb2005 bops spec2.txt: throughput = 176175.08 SPECjbb2005 bops spec3.txt: throughput = 175053.91 SPECjbb2005 bops spec4.txt: throughput = 171383.52 SPECjbb2005 bops Which is close to the hard binding performance figures. while previously it would regress compared to mainline. Mainline has the following 4x JVM performance: spec1.txt: throughput = 157839.25 SPECjbb2005 bops spec2.txt: throughput = 156969.15 SPECjbb2005 bops spec3.txt: throughput = 157571.59 SPECjbb2005 bops spec4.txt: throughput = 157873.86 SPECjbb2005 bops So the patch brings a 12% speedup. This placement idea came while discussing interleaving strategies with Christoph Lameter. Suggested-by: Christoph Lameter Cc: Linus Torvalds Cc: Andrew Morton Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Rik van Riel Cc: Mel Gorman Cc: Hugh Dickins Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f3fb508..79f306c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -922,6 +922,10 @@ static int sched_update_ideal_cpu_shared(struct task_struct *p) buddies++; } WARN_ON_ONCE(buddies > full_buddies); + if (buddies) + node_set(node, p->numa_policy.v.nodes); + else + node_clear(node, p->numa_policy.v.nodes); /* Don't go to a node that is already at full capacity: */ if (buddies == full_buddies) -- 1.7.11.7