linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Fengguang Wu <fengguang.wu@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Linux Memory Management List <linux-mm@kvack.org>,
	Liu Jingqi <jingqi.liu@intel.com>,
	Fengguang Wu <fengguang.wu@intel.com>,
	kvm@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
	Fan Du <fan.du@intel.com>, Yao Yuan <yuan.yao@intel.com>,
	Peng Dong <dongx.peng@intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Dong Eddie <eddie.dong@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Zhang Yi <yi.z.zhang@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>
Subject: [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag
Date: Wed, 26 Dec 2018 21:15:05 +0800	[thread overview]
Message-ID: <20181226133352.189896494@intel.com> (raw)
In-Reply-To: 20181226131446.330864849@intel.com

[-- Attachment #1: 0010-migrate-check-if-the-page-is-software-young-when-mov.patch --]
[-- Type: text/plain, Size: 3635 bytes --]

From: Liu Jingqi <jingqi.liu@intel.com>

Introduce MPOL_MF_SW_YOUNG flag to move_pages(). When on,
the already-in-DRAM pages will be set PG_referenced.

Background:
The use space migration daemon will frequently scan page table and
read-clear accessed bits to detect hot/cold pages. Then migrate hot
pages from PMEM to DRAM node. When doing so, it btw tells kernel that
these are the hot page set. This maintains a persistent view of hot/cold
pages between kernel and user space daemon.

The more concrete steps are

1) do multiple scan of page table, count accessed bits
2) highest accessed count => hot pages
3) call move_pages(hot pages, DRAM nodes, MPOL_MF_SW_YOUNG)

(1) regularly clears PTE young, which makes kernel lose access to
    PTE young information

(2) for anonymous pages, user space daemon defines which is hot and
    which is cold

(3) conveys user space view of hot/cold pages to kernel through
    PG_referenced

In the long run, most hot pages could already be in DRAM.
move_pages(MPOL_MF_SW_YOUNG) sets PG_referenced for those already in
DRAM hot pages. But not for newly migrated hot pages. Since they are
expected to put to the end of LRU, thus has long enough time in LRU to
gather accessed/PG_referenced bit and prove to kernel they are really hot.

The daemon may only select DRAM/2 pages as hot for 2 purposes:
- avoid thrashing, eg. some warm pages got promoted then demoted soon
- make sure enough DRAM LRU pages look "cold" to kernel, so that vmscan
  won't run into trouble busy scanning LRU lists

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 mm/migrate.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

--- linux.orig/mm/migrate.c	2018-12-23 20:37:12.604621319 +0800
+++ linux/mm/migrate.c	2018-12-23 20:37:12.604621319 +0800
@@ -55,6 +55,8 @@
 
 #include "internal.h"
 
+#define MPOL_MF_SW_YOUNG (1<<7)
+
 /*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
@@ -1484,12 +1486,13 @@ static int do_move_pages_to_node(struct
  * the target node
  */
 static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
-		int node, struct list_head *pagelist, bool migrate_all)
+		int node, struct list_head *pagelist, int flags)
 {
 	struct vm_area_struct *vma;
 	struct page *page;
 	unsigned int follflags;
 	int err;
+	bool migrate_all = flags & MPOL_MF_MOVE_ALL;
 
 	down_read(&mm->mmap_sem);
 	err = -EFAULT;
@@ -1519,6 +1522,8 @@ static int add_page_for_migration(struct
 
 	if (PageHuge(page)) {
 		if (PageHead(page)) {
+			if (flags & MPOL_MF_SW_YOUNG)
+				SetPageReferenced(page);
 			isolate_huge_page(page, pagelist);
 			err = 0;
 		}
@@ -1531,6 +1536,8 @@ static int add_page_for_migration(struct
 			goto out_putpage;
 
 		err = 0;
+		if (flags & MPOL_MF_SW_YOUNG)
+			SetPageReferenced(head);
 		list_add_tail(&head->lru, pagelist);
 		mod_node_page_state(page_pgdat(head),
 			NR_ISOLATED_ANON + page_is_file_cache(head),
@@ -1606,7 +1613,7 @@ static int do_pages_move(struct mm_struc
 		 * report them via status
 		 */
 		err = add_page_for_migration(mm, addr, current_node,
-				&pagelist, flags & MPOL_MF_MOVE_ALL);
+				&pagelist, flags);
 		if (!err)
 			continue;
 
@@ -1725,7 +1732,7 @@ static int kernel_move_pages(pid_t pid,
 	nodemask_t task_nodes;
 
 	/* Check flags */
-	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
+	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL|MPOL_MF_SW_YOUNG))
 		return -EINVAL;
 
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))

WARNING: multiple messages have this Message-ID (diff)
From: Fengguang Wu <fengguang.wu@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Linux Memory Management List <linux-mm@kvack.org>,
	Liu Jingqi <jingqi.liu@intel.com>,
	Fengguang Wu <fengguang.wu@intel.com>
Cc: kvm@vger.kernel.org
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Fan Du <fan.du@intel.com>
Cc: Yao Yuan <yuan.yao@intel.com>
Cc: Peng Dong <dongx.peng@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dong Eddie <eddie.dong@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Zhang Yi <yi.z.zhang@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Subject: [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag
Date: Wed, 26 Dec 2018 21:15:05 +0800	[thread overview]
Message-ID: <20181226133352.189896494@intel.com> (raw)
Message-ID: <20181226131505.XE-GwjPIQloU_9Rr7NVGwGyncb4_Dx84BHCqS8sRGbo@z> (raw)
In-Reply-To: 20181226131446.330864849@intel.com

[-- Attachment #1: 0010-migrate-check-if-the-page-is-software-young-when-mov.patch --]
[-- Type: text/plain, Size: 3637 bytes --]

From: Liu Jingqi <jingqi.liu@intel.com>

Introduce MPOL_MF_SW_YOUNG flag to move_pages(). When on,
the already-in-DRAM pages will be set PG_referenced.

Background:
The use space migration daemon will frequently scan page table and
read-clear accessed bits to detect hot/cold pages. Then migrate hot
pages from PMEM to DRAM node. When doing so, it btw tells kernel that
these are the hot page set. This maintains a persistent view of hot/cold
pages between kernel and user space daemon.

The more concrete steps are

1) do multiple scan of page table, count accessed bits
2) highest accessed count => hot pages
3) call move_pages(hot pages, DRAM nodes, MPOL_MF_SW_YOUNG)

(1) regularly clears PTE young, which makes kernel lose access to
    PTE young information

(2) for anonymous pages, user space daemon defines which is hot and
    which is cold

(3) conveys user space view of hot/cold pages to kernel through
    PG_referenced

In the long run, most hot pages could already be in DRAM.
move_pages(MPOL_MF_SW_YOUNG) sets PG_referenced for those already in
DRAM hot pages. But not for newly migrated hot pages. Since they are
expected to put to the end of LRU, thus has long enough time in LRU to
gather accessed/PG_referenced bit and prove to kernel they are really hot.

The daemon may only select DRAM/2 pages as hot for 2 purposes:
- avoid thrashing, eg. some warm pages got promoted then demoted soon
- make sure enough DRAM LRU pages look "cold" to kernel, so that vmscan
  won't run into trouble busy scanning LRU lists

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 mm/migrate.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

--- linux.orig/mm/migrate.c	2018-12-23 20:37:12.604621319 +0800
+++ linux/mm/migrate.c	2018-12-23 20:37:12.604621319 +0800
@@ -55,6 +55,8 @@
 
 #include "internal.h"
 
+#define MPOL_MF_SW_YOUNG (1<<7)
+
 /*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
@@ -1484,12 +1486,13 @@ static int do_move_pages_to_node(struct
  * the target node
  */
 static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
-		int node, struct list_head *pagelist, bool migrate_all)
+		int node, struct list_head *pagelist, int flags)
 {
 	struct vm_area_struct *vma;
 	struct page *page;
 	unsigned int follflags;
 	int err;
+	bool migrate_all = flags & MPOL_MF_MOVE_ALL;
 
 	down_read(&mm->mmap_sem);
 	err = -EFAULT;
@@ -1519,6 +1522,8 @@ static int add_page_for_migration(struct
 
 	if (PageHuge(page)) {
 		if (PageHead(page)) {
+			if (flags & MPOL_MF_SW_YOUNG)
+				SetPageReferenced(page);
 			isolate_huge_page(page, pagelist);
 			err = 0;
 		}
@@ -1531,6 +1536,8 @@ static int add_page_for_migration(struct
 			goto out_putpage;
 
 		err = 0;
+		if (flags & MPOL_MF_SW_YOUNG)
+			SetPageReferenced(head);
 		list_add_tail(&head->lru, pagelist);
 		mod_node_page_state(page_pgdat(head),
 			NR_ISOLATED_ANON + page_is_file_cache(head),
@@ -1606,7 +1613,7 @@ static int do_pages_move(struct mm_struc
 		 * report them via status
 		 */
 		err = add_page_for_migration(mm, addr, current_node,
-				&pagelist, flags & MPOL_MF_MOVE_ALL);
+				&pagelist, flags);
 		if (!err)
 			continue;
 
@@ -1725,7 +1732,7 @@ static int kernel_move_pages(pid_t pid,
 	nodemask_t task_nodes;
 
 	/* Check flags */
-	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
+	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL|MPOL_MF_SW_YOUNG))
 		return -EINVAL;
 
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))



  parent reply	other threads:[~2018-12-26 13:37 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-26 13:14 [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-27  3:41   ` Matthew Wilcox
2018-12-27  4:11     ` Fengguang Wu
2018-12-27  5:13       ` Dan Williams
2018-12-27  5:13         ` Dan Williams
2018-12-27 19:32         ` Yang Shi
2018-12-27 19:32           ` Yang Shi
2018-12-28  3:27           ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 02/21] acpi/numa: memorize NUMA node type from SRAT table Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 03/21] x86/numa_emulation: fix fake NUMA in uniform case Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 04/21] x86/numa_emulation: pass numa node type to fake nodes Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 05/21] mmzone: new pgdat flags for DRAM and PMEM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 06/21] x86,numa: update numa node type Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 07/21] mm: export node type {pmem|dram} under /sys/bus/node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-27 20:07   ` Christopher Lameter
2018-12-27 20:07     ` Christopher Lameter
2018-12-28  2:31     ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 09/21] mm: avoid duplicate peer target node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2019-01-01  9:14   ` Aneesh Kumar K.V
2019-01-01  9:14     ` Aneesh Kumar K.V
2019-01-07  9:57     ` Fengguang Wu
2019-01-07 14:09       ` Aneesh Kumar K.V
2018-12-26 13:14 ` [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2019-01-01  9:23   ` Aneesh Kumar K.V
2019-01-01  9:23     ` Aneesh Kumar K.V
2019-01-02  0:59     ` Yuan Yao
2019-01-02 16:47   ` Dave Hansen
2019-01-07 10:21     ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 12/21] x86/pgtable: " Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 13/21] x86/pgtable: dont check PMD accessed bit Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 14/21] kvm: register in mm_struct Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2019-02-02  6:57   ` Peter Xu
2019-02-02 10:50     ` Fengguang Wu
2019-02-04 10:46     ` Paolo Bonzini
2018-12-26 13:15 ` [RFC][PATCH v2 15/21] ept-idle: EPT walk for virtual machine Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 16/21] mm-idle: mm_walk for normal task Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 17/21] proc: introduce /proc/PID/idle_pages Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 18/21] kvm-ept-idle: enable module Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` Fengguang Wu [this message]
2018-12-26 13:15   ` [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-27 20:31 ` [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Michal Hocko
2018-12-28  5:08   ` Fengguang Wu
2018-12-28  8:41     ` Michal Hocko
2018-12-28  9:42       ` Fengguang Wu
2018-12-28 12:15         ` Michal Hocko
2018-12-28 13:15           ` Fengguang Wu
2018-12-28 13:15             ` Fengguang Wu
2018-12-28 19:46             ` Michal Hocko
2018-12-28 13:31           ` Fengguang Wu
2018-12-28 18:28             ` Yang Shi
2018-12-28 18:28               ` Yang Shi
2018-12-28 19:52             ` Michal Hocko
2019-01-02 12:21               ` Jonathan Cameron
2019-01-02 12:21                 ` Jonathan Cameron
2019-01-08 14:52                 ` Michal Hocko
2019-01-10 15:53                   ` Jerome Glisse
2019-01-10 15:53                     ` Jerome Glisse
2019-01-10 16:42                     ` Michal Hocko
2019-01-10 17:42                       ` Jerome Glisse
2019-01-10 17:42                         ` Jerome Glisse
2019-01-10 18:26                   ` Jonathan Cameron
2019-01-10 18:26                     ` Jonathan Cameron
2019-01-28 17:42                 ` Jonathan Cameron
2019-01-28 17:42                   ` Jonathan Cameron
2019-01-29  2:00                   ` Fengguang Wu
2019-01-03 10:57               ` Mel Gorman
2019-01-10 16:25               ` Jerome Glisse
2019-01-10 16:25                 ` Jerome Glisse
2019-01-10 16:50                 ` Michal Hocko
2019-01-10 18:02                   ` Jerome Glisse
2019-01-10 18:02                     ` Jerome Glisse
2019-01-02 18:12       ` Dave Hansen
2019-01-08 14:53         ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181226133352.189896494@intel.com \
    --to=fengguang.wu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dongx.peng@intel.com \
    --cc=eddie.dong@intel.com \
    --cc=fan.du@intel.com \
    --cc=jingqi.liu@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=yi.z.zhang@linux.intel.com \
    --cc=ying.huang@intel.com \
    --cc=yuan.yao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).