linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Huang, Ying" <ying.huang@intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Huang Ying <ying.huang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>, Rik van Riel <riel@redhat.com>,
	Mel Gorman <mgorman@suse.de>, Ingo Molnar <mingo@kernel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Fengguang Wu <fengguang.wu@intel.com>
Subject: [RFC 07/10] autonuma, memory tiering: Only promote page if accessed twice
Date: Fri,  1 Nov 2019 15:57:24 +0800	[thread overview]
Message-ID: <20191101075727.26683-8-ying.huang@intel.com> (raw)
In-Reply-To: <20191101075727.26683-1-ying.huang@intel.com>

From: Huang Ying <ying.huang@intel.com>

The original assumption of auto NUMA balancing is that the memory
privately or mainly accessed by the CPUs of a NUMA node (called
private memory) should fit the memory size of the NUMA node.  So if a
page is identified to be private memory, it will be migrated to the
target node immediately.  Eventually all private memory will be
migrated.

But this assumption isn't true in memory tiering system.  In a typical
memory tiering system, there are CPUs, fast memory, and slow memory in
each physical NUMA node.  The CPUs and the fast memory will be put in
one logical node (called fast memory node), while the slow memory will
be put in another (faked) logical node (called slow memory node).  To
take full advantage of the system resources, it's common that the size
of the private memory of the workload is larger than the memory size
of the fast memory node.

To resolve the issue, we will try to migrate only the hot pages in the
private memory to the fast memory node.  A private page that was
accessed at least twice in the current and the last scanning periods
will be identified as the hot page and migrated.  Otherwise, the page
isn't considered hot enough to be migrated.

To record whether a page is accessed in the last scanning period, the
Accessed bit of the PTE/PMD is used.  When the page tables are scanned
for autonuma, if the pte_protnone(pte) is true, the page isn't
accessed in the last scan period, and the Accessed bit will be
cleared, otherwise the Accessed bit will be kept.  When NUMA page
fault occurs, if the Accessed bit is set, the page has been accessed
at least twice in the current and the last scanning period and will be
migrated.

The Accessed bit of PTE/PMD is used by page reclaiming too.  So the
conflict is possible.  Considering the following situation,

a) the page is moved from active list to inactive list with Accessed
   bit cleared

b) the page is accessed, so Accessed bit is set

c) the page table is scanned by autonuma, PTE is set to
   PROTNONE+Accessed

c) the page isn't accessed

d) the page table is scanned by autonuma again, Accessed is cleared

e) the inactive list is scanned for reclaiming, the page is reclaimed
   wrongly because Accessed bit is cleared by autonuma

Although the page is reclaimed wrongly, it hasn't been accessed for
one numa balancing scanning period at least.  So the page isn't so hot
too.  That is, this shouldn't be a severe issue.

The patch improves the score of pmbench memory accessing benchmark
with 80:20 read/write ratio and normal access address distribution by
3.1% on a 2 socket Intel server with Optance DC Persistent Memory.  In
the test, the number of the promoted pages for autonuma reduces 7.2%
because the pages fail to pass the twice access checking.

Problems:

 - how to adjust scanning period upon hot page identification
   requirement.  E.g. if the count of page promotion is much larger
   than free memory, we need to scan faster to identify really hot
   pages.  But his will trigger too many page faults too.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/huge_memory.c | 13 ++++++++++++-
 mm/memory.c      | 28 +++++++++++++++-------------
 mm/mprotect.c    | 11 ++++++++++-
 3 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 61e241ce20fa..7634fb22931b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1528,6 +1528,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	if (unlikely(!pmd_same(pmd, *vmf->pmd)))
 		goto out_unlock;
 
+	/* Only migrate if accessed twice */
+	if (!pmd_young(*vmf->pmd))
+		goto out_unlock;
+
 	/*
 	 * If there are potential migrations, wait for completion and retry
 	 * without disrupting NUMA hinting information. Do not relock and
@@ -1948,8 +1952,15 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		if (is_huge_zero_pmd(*pmd))
 			goto unlock;
 
-		if (pmd_protnone(*pmd))
+		if (pmd_protnone(*pmd)) {
+			/*
+			 * PMD young bit is used to record whether the
+			 * page is accessed in last scan period
+			 */
+			if (pmd_young(*pmd))
+				set_pmd_at(mm, addr, pmd, pmd_mkold(*pmd));
 			goto unlock;
+		}
 
 		page = pmd_page(*pmd);
 		/*
diff --git a/mm/memory.c b/mm/memory.c
index ddf20bd0c317..e5da50eca36f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3711,10 +3711,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	 */
 	vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
 	spin_lock(vmf->ptl);
-	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		goto out;
-	}
+	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
+		goto unmap_out;
 
 	/*
 	 * Make it present again, Depending on how arch implementes non
@@ -3728,17 +3726,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 
+	/* Only migrate if accessed twice */
+	if (!pte_young(old_pte))
+		goto unmap_out;
+
 	page = vm_normal_page(vma, vmf->address, pte);
-	if (!page) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return 0;
-	}
+	if (!page)
+		goto unmap_out;
 
 	/* TODO: handle PTE-mapped THP */
-	if (PageCompound(page)) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return 0;
-	}
+	if (PageCompound(page))
+		goto unmap_out;
 
 	/*
 	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
@@ -3776,10 +3774,14 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	} else
 		flags |= TNF_MIGRATE_FAIL;
 
-out:
 	if (page_nid != NUMA_NO_NODE)
 		task_numa_fault(last_cpupid, page_nid, 1, flags);
 	return 0;
+
+unmap_out:
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+out:
+	return 0;
 }
 
 static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0636f2e5e05b..fd8d8e813717 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -83,8 +83,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				int nid;
 
 				/* Avoid TLB flush if possible */
-				if (pte_protnone(oldpte))
+				if (pte_protnone(oldpte)) {
+					/*
+					 * PTE young bit is used to record
+					 * whether the page is accessed in
+					 * last scan period
+					 */
+					if (pte_young(oldpte))
+						set_pte_at(vma->vm_mm, addr, pte,
+							   pte_mkold(oldpte));
 					continue;
+				}
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (!page || PageKsm(page))
-- 
2.23.0


  parent reply	other threads:[~2019-11-01  7:58 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-01  7:57 [RFC 00/10] autonuma: Optimize memory placement in memory tiering system Huang, Ying
2019-11-01  7:57 ` [RFC 01/10] autonuma: Fix watermark checking in migrate_balanced_pgdat() Huang, Ying
2019-11-01 11:11   ` Mel Gorman
2019-11-01  7:57 ` [RFC 02/10] autonuma: Reduce cache footprint when scanning page tables Huang, Ying
2019-11-01 11:13   ` Mel Gorman
2019-11-01  7:57 ` [RFC 03/10] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode Huang, Ying
2019-11-01  7:57 ` [RFC 04/10] autonuma, memory tiering: Rate limit NUMA migration throughput Huang, Ying
2019-11-01  7:57 ` [RFC 05/10] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM Huang, Ying
2019-11-01  7:57 ` [RFC 06/10] autonuma, memory tiering: Skip to scan fastest memory Huang, Ying
2019-11-01  7:57 ` Huang, Ying [this message]
2019-11-01  7:57 ` [RFC 08/10] autonuma, memory tiering: Select hotter pages to promote to fast memory node Huang, Ying
2019-11-01  9:24   ` Peter Zijlstra
2019-11-04  2:41     ` Huang, Ying
2019-11-04  8:44       ` Peter Zijlstra
2019-11-04 10:13         ` Huang, Ying
2019-11-01  7:57 ` [RFC 09/10] autonuma, memory tiering: Double hot threshold for write hint page fault Huang, Ying
2019-11-01  7:57 ` [RFC 10/10] autonuma, memory tiering: Adjust hot threshold automatically Huang, Ying
2019-11-01  9:31   ` Peter Zijlstra
2019-11-04  6:11     ` Huang, Ying
2019-11-04  8:49       ` Peter Zijlstra
2019-11-04 10:12         ` Huang, Ying
2019-11-21  8:38         ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191101075727.26683-8-ying.huang@intel.com \
    --to=ying.huang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=fengguang.wu@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).