From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756099Ab3JGKnI (ORCPT ); Mon, 7 Oct 2013 06:43:08 -0400 Received: from cantor2.suse.de ([195.135.220.15]:39700 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751818Ab3JGK35 (ORCPT ); Mon, 7 Oct 2013 06:29:57 -0400 From: Mel Gorman To: Peter Zijlstra , Rik van Riel Cc: Srikar Dronamraju , Ingo Molnar , Andrea Arcangeli , Johannes Weiner , Linux-MM , LKML , Mel Gorman Subject: [PATCH 15/63] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node" Date: Mon, 7 Oct 2013 11:28:53 +0100 Message-Id: <1381141781-10992-16-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.8.4 In-Reply-To: <1381141781-10992-1-git-send-email-mgorman@suse.de> References: <1381141781-10992-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org PTE scanning and NUMA hinting fault handling is expensive so commit 5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node") deferred the PTE scan until a task had been scheduled on another node. The problem is that in the purely shared memory case that this may never happen and no NUMA hinting fault information will be captured. We are not ruling out the possibility that something better can be done here but for now, this patch needs to be reverted and depend entirely on the scan_delay to avoid punishing short-lived processes. Signed-off-by: Mel Gorman --- include/linux/mm_types.h | 10 ---------- kernel/fork.c | 3 --- kernel/sched/fair.c | 18 ------------------ kernel/sched/features.h | 4 +--- 4 files changed, 1 insertion(+), 34 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index d9851ee..b7adf1d 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -428,20 +428,10 @@ struct mm_struct { /* numa_scan_seq prevents two threads setting pte_numa */ int numa_scan_seq; - - /* - * The first node a task was scheduled on. If a task runs on - * a different node than Make PTE Scan Go Now. - */ - int first_nid; #endif struct uprobes_state uprobes_state; }; -/* first nid will either be a valid NID or one of these values */ -#define NUMA_PTE_SCAN_INIT -1 -#define NUMA_PTE_SCAN_ACTIVE -2 - static inline void mm_init_cpumask(struct mm_struct *mm) { #ifdef CONFIG_CPUMASK_OFFSTACK diff --git a/kernel/fork.c b/kernel/fork.c index 086fe73..7192d91 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -817,9 +817,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk) #ifdef CONFIG_TRANSPARENT_HUGEPAGE mm->pmd_huge_pte = NULL; #endif -#ifdef CONFIG_NUMA_BALANCING - mm->first_nid = NUMA_PTE_SCAN_INIT; -#endif if (!mm_init(mm, tsk)) goto fail_nomem; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 39be6af..148838c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -901,24 +901,6 @@ void task_numa_work(struct callback_head *work) return; /* - * We do not care about task placement until a task runs on a node - * other than the first one used by the address space. This is - * largely because migrations are driven by what CPU the task - * is running on. If it's never scheduled on another node, it'll - * not migrate so why bother trapping the fault. - */ - if (mm->first_nid == NUMA_PTE_SCAN_INIT) - mm->first_nid = numa_node_id(); - if (mm->first_nid != NUMA_PTE_SCAN_ACTIVE) { - /* Are we running on a new node yet? */ - if (numa_node_id() == mm->first_nid && - !sched_feat_numa(NUMA_FORCE)) - return; - - mm->first_nid = NUMA_PTE_SCAN_ACTIVE; - } - - /* * Reset the scan period if enough time has gone by. Objective is that * scanning will be reduced if pages are properly placed. As tasks * can enter different phases this needs to be re-examined. Lacking diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 99399f8..cba5c61 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -63,10 +63,8 @@ SCHED_FEAT(LB_MIN, false) /* * Apply the automatic NUMA scheduling policy. Enabled automatically * at runtime if running on a NUMA machine. Can be controlled via - * numa_balancing=. Allow PTE scanning to be forced on UMA machines - * for debugging the core machinery. + * numa_balancing= */ #ifdef CONFIG_NUMA_BALANCING SCHED_FEAT(NUMA, false) -SCHED_FEAT(NUMA_FORCE, false) #endif -- 1.8.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f48.google.com (mail-pb0-f48.google.com [209.85.160.48]) by kanga.kvack.org (Postfix) with ESMTP id 7B7D29C0005 for ; Mon, 7 Oct 2013 06:29:59 -0400 (EDT) Received: by mail-pb0-f48.google.com with SMTP id ma3so6917347pbc.21 for ; Mon, 07 Oct 2013 03:29:59 -0700 (PDT) From: Mel Gorman Subject: [PATCH 15/63] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node" Date: Mon, 7 Oct 2013 11:28:53 +0100 Message-Id: <1381141781-10992-16-git-send-email-mgorman@suse.de> In-Reply-To: <1381141781-10992-1-git-send-email-mgorman@suse.de> References: <1381141781-10992-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra , Rik van Riel Cc: Srikar Dronamraju , Ingo Molnar , Andrea Arcangeli , Johannes Weiner , Linux-MM , LKML , Mel Gorman PTE scanning and NUMA hinting fault handling is expensive so commit 5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node") deferred the PTE scan until a task had been scheduled on another node. The problem is that in the purely shared memory case that this may never happen and no NUMA hinting fault information will be captured. We are not ruling out the possibility that something better can be done here but for now, this patch needs to be reverted and depend entirely on the scan_delay to avoid punishing short-lived processes. Signed-off-by: Mel Gorman --- include/linux/mm_types.h | 10 ---------- kernel/fork.c | 3 --- kernel/sched/fair.c | 18 ------------------ kernel/sched/features.h | 4 +--- 4 files changed, 1 insertion(+), 34 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index d9851ee..b7adf1d 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -428,20 +428,10 @@ struct mm_struct { /* numa_scan_seq prevents two threads setting pte_numa */ int numa_scan_seq; - - /* - * The first node a task was scheduled on. If a task runs on - * a different node than Make PTE Scan Go Now. - */ - int first_nid; #endif struct uprobes_state uprobes_state; }; -/* first nid will either be a valid NID or one of these values */ -#define NUMA_PTE_SCAN_INIT -1 -#define NUMA_PTE_SCAN_ACTIVE -2 - static inline void mm_init_cpumask(struct mm_struct *mm) { #ifdef CONFIG_CPUMASK_OFFSTACK diff --git a/kernel/fork.c b/kernel/fork.c index 086fe73..7192d91 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -817,9 +817,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk) #ifdef CONFIG_TRANSPARENT_HUGEPAGE mm->pmd_huge_pte = NULL; #endif -#ifdef CONFIG_NUMA_BALANCING - mm->first_nid = NUMA_PTE_SCAN_INIT; -#endif if (!mm_init(mm, tsk)) goto fail_nomem; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 39be6af..148838c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -901,24 +901,6 @@ void task_numa_work(struct callback_head *work) return; /* - * We do not care about task placement until a task runs on a node - * other than the first one used by the address space. This is - * largely because migrations are driven by what CPU the task - * is running on. If it's never scheduled on another node, it'll - * not migrate so why bother trapping the fault. - */ - if (mm->first_nid == NUMA_PTE_SCAN_INIT) - mm->first_nid = numa_node_id(); - if (mm->first_nid != NUMA_PTE_SCAN_ACTIVE) { - /* Are we running on a new node yet? */ - if (numa_node_id() == mm->first_nid && - !sched_feat_numa(NUMA_FORCE)) - return; - - mm->first_nid = NUMA_PTE_SCAN_ACTIVE; - } - - /* * Reset the scan period if enough time has gone by. Objective is that * scanning will be reduced if pages are properly placed. As tasks * can enter different phases this needs to be re-examined. Lacking diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 99399f8..cba5c61 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -63,10 +63,8 @@ SCHED_FEAT(LB_MIN, false) /* * Apply the automatic NUMA scheduling policy. Enabled automatically * at runtime if running on a NUMA machine. Can be controlled via - * numa_balancing=. Allow PTE scanning to be forced on UMA machines - * for debugging the core machinery. + * numa_balancing= */ #ifdef CONFIG_NUMA_BALANCING SCHED_FEAT(NUMA, false) -SCHED_FEAT(NUMA_FORCE, false) #endif -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org