[RFC -V2 1/8] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode

From: "Huang, Ying" <ying.huang@intel.com>
To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Feng Tang <feng.tang@intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>, Rik van Riel <riel@redhat.com>,
	Mel Gorman <mgorman@suse.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>
Subject: [RFC -V2 1/8] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode
Date: Tue, 18 Feb 2020 16:26:27 +0800	[thread overview]
Message-ID: <20200218082634.1596727-2-ying.huang@intel.com> (raw)
In-Reply-To: <20200218082634.1596727-1-ying.huang@intel.com>

From: Huang Ying <ying.huang@intel.com>

With the advent of various new memory types, some machines will have
multiple memory types, e.g. DRAM and PMEM (persistent memory).
Because the performance of the different types of memory may be
different, the memory subsystem could be called memory tiering system.

In a typical memory tiering system, there are CPUs, fast memory and
slow memory in each physical NUMA node.  The CPUs and the fast memory
will be put in one logical node (called fast memory node), while the
slow memory will be put in another (faked) logical node (called slow
memory node).  And in autonuma, there are a set of mechanisms to
identify the pages recently accessed by the CPUs in a node and migrate
the pages to the node.  So the performance optimization to promote the
hot pages in slow memory node to the fast memory node in the memory
tiering system could be implemented based on the autonuma framework.

But the requirement of the hot page promotion in the memory tiering
system is different from that of the normal NUMA balancing in some
aspects.  E.g. for the hot page promotion, we can skip to scan fastest
memory node because we have nowhere to promote the hot pages to.  To
make autonuma works for both the normal NUMA balancing and the memory
tiering hot page promotion, we have defined a set of flags and made
the value of sysctl_numa_balancing_mode to be "OR" of these flags.
The flags are as follows,

- 0x0: NUMA_BALANCING_DISABLED
- 0x1: NUMA_BALANCING_NORMAL
- 0x2: NUMA_BALANCING_MEMORY_TIERING

NUMA_BALANCING_NORMAL enables the normal NUMA balancing across
sockets, while NUMA_BALANCING_MEMORY_TIERING enables the hot page
promotion across memory tiers.  They can be enabled individually or
together.  If all flags are cleared, the autonuma is disabled
completely.

The sysctl interface is extended accordingly in a backward compatible
way.

TODO:

- Update ABI document: Documentation/sysctl/kernel.txt

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/sched/sysctl.h | 5 +++++
 kernel/sched/core.c          | 9 +++------
 kernel/sysctl.c              | 7 ++++---
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..80dc5030c797 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -33,6 +33,11 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+#define NUMA_BALANCING_DISABLED		0x0
+#define NUMA_BALANCING_NORMAL		0x1
+#define NUMA_BALANCING_MEMORY_TIERING	0x2
+
+extern int sysctl_numa_balancing_mode;
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 90e4b00ace89..2d3f456d0ef6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2723,6 +2723,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
+int sysctl_numa_balancing_mode;
 
 #ifdef CONFIG_NUMA_BALANCING
 
@@ -2738,20 +2739,16 @@ void set_numabalancing_state(bool enabled)
 int sysctl_numa_balancing(struct ctl_table *table, int write,
 			 void __user *buffer, size_t *lenp, loff_t *ppos)
 {
-	struct ctl_table t;
 	int err;
-	int state = static_branch_likely(&sched_numa_balancing);
 
 	if (write && !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	t = *table;
-	t.data = &state;
-	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
 	if (err < 0)
 		return err;
 	if (write)
-		set_numabalancing_state(state);
+		set_numabalancing_state(*(int *)table->data);
 	return err;
 }
 #endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 70665934d53e..3756108bb658 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -126,6 +126,7 @@ static int sixty = 60;
 
 static int __maybe_unused neg_one = -1;
 static int __maybe_unused two = 2;
+static int __maybe_unused three = 3;
 static int __maybe_unused four = 4;
 static unsigned long zero_ul;
 static unsigned long one_ul = 1;
@@ -420,12 +421,12 @@ static struct ctl_table kern_table[] = {
 	},
 	{
 		.procname	= "numa_balancing",
-		.data		= NULL, /* filled in by handler */
-		.maxlen		= sizeof(unsigned int),
+		.data		= &sysctl_numa_balancing_mode,
+		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= sysctl_numa_balancing,
 		.extra1		= SYSCTL_ZERO,
-		.extra2		= SYSCTL_ONE,
+		.extra2		= &three,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
-- 
2.24.1