linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Huang, Ying" <ying.huang@intel.com>
To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Feng Tang <feng.tang@intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>, Rik van Riel <riel@redhat.com>,
	Mel Gorman <mgorman@suse.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>
Subject: [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM
Date: Tue, 18 Feb 2020 16:26:29 +0800	[thread overview]
Message-ID: <20200218082634.1596727-4-ying.huang@intel.com> (raw)
In-Reply-To: <20200218082634.1596727-1-ying.huang@intel.com>

From: Huang Ying <ying.huang@intel.com>

In a memory tiering system, if the memory size of the workloads is
smaller than that of the faster memory (e.g. DRAM) nodes, all pages of
the workloads should be put in the faster memory nodes.  But this
makes it unnecessary to use slower memory (e.g. PMEM) at all.

So in common cases, the memory size of the workload should be larger
than that of the faster memory nodes.  And to optimize the
performance, the hot pages should be promoted to the faster memory
nodes while the cold pages should be demoted to the slower memory
nodes.  To achieve that, we have two choices,

a. Promote the hot pages from the slower memory node to the faster
   memory node.  This will create some memory pressure in the faster
   memory node, thus trigger the memory reclaiming, where the cold
   pages will be demoted to the slower memory node.

b. Demote the cold pages from faster memory node to the slower memory
   node.  This will create some free memory space in the faster memory
   node, and the hot pages in the slower memory node could be promoted
   to the faster memory node.

The choice "a" will create the memory pressure in the faster memory
node.  If the memory pressure of the workload is high too, the memory
pressure may become so high that the memory allocation latency of the
workload is influenced, e.g. the direct reclaiming may be triggered.

The choice "b" works much better at this aspect.  If the memory
pressure of the workload is high, it will consume the free memory and
the hot pages promotion will stop earlier if its allocation watermark
is higher than that of the normal memory allocation.

In this patch, choice "b" is implemented.  If memory tiering NUMA
balancing mode is enabled, the node isn't the slowest node, and the
free memory size of the node is below the high watermark, the kswapd
of the node will be waken up to free some memory until the free memory
size is above the high watermark + autonuma promotion rate limit.  If
the free memory size is below the high watermark, autonuma promotion
will stop working.  This avoids to create too much memory pressure to
the system.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/migrate.c | 26 +++++++++++++++++---------
 mm/vmscan.c  |  7 +++++++
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 0b046759f99a..bbf16764d105 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -48,6 +48,7 @@
 #include <linux/page_owner.h>
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
+#include <linux/sched/sysctl.h>
 
 #include <asm/tlbflush.h>
 
@@ -1946,8 +1947,7 @@ COMPAT_SYSCALL_DEFINE6(move_pages, pid_t, pid, compat_ulong_t, nr_pages,
  * Returns true if this is a safe migration target node for misplaced NUMA
  * pages. Currently it only checks the watermarks which crude
  */
-static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
-				   unsigned long nr_migrate_pages)
+static bool migrate_balanced_pgdat(struct pglist_data *pgdat, int order)
 {
 	int z;
 
@@ -1958,12 +1958,9 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 			continue;
 
 		/* Avoid waking kswapd by allocating pages_to_migrate pages. */
-		if (!zone_watermark_ok(zone, 0,
-				       high_wmark_pages(zone) +
-				       nr_migrate_pages,
-				       ZONE_MOVABLE, 0))
-			continue;
-		return true;
+		if (zone_watermark_ok(zone, order, high_wmark_pages(zone),
+				      ZONE_MOVABLE, 0))
+			return true;
 	}
 	return false;
 }
@@ -1990,8 +1987,19 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
 
 	/* Avoid migrating to a node that is nearly full */
-	if (!migrate_balanced_pgdat(pgdat, compound_nr(page)))
+	if (!migrate_balanced_pgdat(pgdat, compound_order(page))) {
+		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) {
+			int z;
+
+			for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+				if (populated_zone(pgdat->node_zones + z))
+					break;
+			}
+			wakeup_kswapd(pgdat->node_zones + z,
+				      0, compound_order(page), ZONE_MOVABLE);
+		}
 		return 0;
+	}
 
 	if (isolate_lru_page(page))
 		return 0;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fe90236045d5..b265868d62ef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,7 @@
 
 #include <linux/swapops.h>
 #include <linux/balloon_compaction.h>
+#include <linux/sched/sysctl.h>
 
 #include "internal.h"
 
@@ -3462,8 +3463,11 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	int i;
 	unsigned long mark = -1;
+	unsigned long promote_ratelimit;
 	struct zone *zone;
 
+	promote_ratelimit = sysctl_numa_balancing_rate_limit <<
+		(20 - PAGE_SHIFT);
 	/*
 	 * Check watermarks bottom-up as lower zones are more likely to
 	 * meet watermarks.
@@ -3475,6 +3479,9 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 			continue;
 
 		mark = high_wmark_pages(zone);
+		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
+		    next_migration_node(pgdat->node_id) != -1)
+			mark += promote_ratelimit;
 		if (zone_watermark_ok_safe(zone, order, mark, classzone_idx))
 			return true;
 	}
-- 
2.24.1


  parent reply	other threads:[~2020-02-18  8:27 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-18  8:26 [RFC -V2 0/8] autonuma: Optimize memory placement in memory tiering system Huang, Ying
2020-02-18  8:26 ` [RFC -V2 1/8] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode Huang, Ying
2020-02-18  8:26 ` [RFC -V2 2/8] autonuma, memory tiering: Rate limit NUMA migration throughput Huang, Ying
2020-02-18  8:57   ` Mel Gorman
2020-02-19  6:01     ` Huang, Ying
2020-02-18  8:26 ` Huang, Ying [this message]
2020-02-18  9:09   ` [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM Mel Gorman
2020-02-19  6:05     ` Huang, Ying
2020-02-18  8:26 ` [RFC -V2 4/8] autonuma, memory tiering: Skip to scan fastest memory Huang, Ying
2020-02-18  8:26 ` [RFC -V2 5/8] autonuma, memory tiering: Only promote page if accessed twice Huang, Ying
2020-02-18  8:26 ` [RFC -V2 6/8] autonuma, memory tiering: Select hotter pages to promote to fast memory node Huang, Ying
2020-02-18  8:26 ` [RFC -V2 7/8] autonuma, memory tiering: Double hot threshold for write hint page fault Huang, Ying
2020-02-18  8:26 ` [RFC -V2 8/8] autonuma, memory tiering: Adjust hot threshold automatically Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200218082634.1596727-4-ying.huang@intel.com \
    --to=ying.huang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=feng.tang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).