linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] ksm: responses to NUMA review
@ 2013-02-21  8:17 Hugh Dickins
  2013-02-21  8:19 ` [PATCH 1/7] ksm: add some comments Hugh Dickins
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Hugh Dickins @ 2013-02-21  8:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

Here's a second KSM series, based on mmotm 2013-02-19-17-20: partly in
response to Mel's review feedback, partly fixes to issues that I found
myself in doing more review and testing.  None of the issues fixed are
truly show-stoppers, though I would prefer them fixed sooner than later.

1 ksm: add some comments
2 ksm: treat unstable nid like in stable tree
3 ksm: shrink 32-bit rmap_item back to 32 bytes
4 mm,ksm: FOLL_MIGRATION do migration_entry_wait
5 mm,ksm: swapoff might need to copy
6 mm: cleanup "swapcache" in do_swap_page
7 ksm: allocate roots when needed

 Documentation/vm/ksm.txt |   16 +++-
 include/linux/mm.h       |    1 
 mm/ksm.c                 |  137 +++++++++++++++++++++++--------------
 mm/memory.c              |   38 +++++++---
 mm/swapfile.c            |   15 +++-
 5 files changed, 140 insertions(+), 67 deletions(-)

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/7] ksm: add some comments
  2013-02-21  8:17 [PATCH 0/7] ksm: responses to NUMA review Hugh Dickins
@ 2013-02-21  8:19 ` Hugh Dickins
  2013-02-22  4:26   ` Ric Mason
  2013-02-21  8:20 ` [PATCH 2/7] ksm: treat unstable nid like in stable tree Hugh Dickins
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Hugh Dickins @ 2013-02-21  8:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

Added slightly more detail to the Documentation of merge_across_nodes,
a few comments in areas indicated by review, and renamed get_ksm_page()'s
argument from "locked" to "lock_it".  No functional change.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 Documentation/vm/ksm.txt |   16 ++++++++++++----
 mm/ksm.c                 |   18 ++++++++++++++----
 2 files changed, 26 insertions(+), 8 deletions(-)

--- mmotm.orig/Documentation/vm/ksm.txt	2013-02-20 22:28:09.456001057 -0800
+++ mmotm/Documentation/vm/ksm.txt	2013-02-20 22:28:23.580001392 -0800
@@ -60,10 +60,18 @@ sleep_millisecs  - how many milliseconds
 
 merge_across_nodes - specifies if pages from different numa nodes can be merged.
                    When set to 0, ksm merges only pages which physically
-                   reside in the memory area of same NUMA node. It brings
-                   lower latency to access to shared page. Value can be
-                   changed only when there is no ksm shared pages in system.
-                   Default: 1
+                   reside in the memory area of same NUMA node. That brings
+                   lower latency to access of shared pages. Systems with more
+                   nodes, at significant NUMA distances, are likely to benefit
+                   from the lower latency of setting 0. Smaller systems, which
+                   need to minimize memory usage, are likely to benefit from
+                   the greater sharing of setting 1 (default). You may wish to
+                   compare how your system performs under each setting, before
+                   deciding on which to use. merge_across_nodes setting can be
+                   changed only when there are no ksm shared pages in system:
+                   set run 2 to unmerge pages first, then to 1 after changing
+                   merge_across_nodes, to remerge according to the new setting.
+                   Default: 1 (merging across nodes as in earlier releases)
 
 run              - set 0 to stop ksmd from running but keep merged pages,
                    set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
--- mmotm.orig/mm/ksm.c	2013-02-20 22:28:09.456001057 -0800
+++ mmotm/mm/ksm.c	2013-02-20 22:28:23.584001392 -0800
@@ -87,6 +87,9 @@
  *    take 10 attempts to find a page in the unstable tree, once it is found,
  *    it is secured in the stable tree.  (When we scan a new page, we first
  *    compare it against the stable tree, and then against the unstable tree.)
+ *
+ * If the merge_across_nodes tunable is unset, then KSM maintains multiple
+ * stable trees and multiple unstable trees: one of each for each NUMA node.
  */
 
 /**
@@ -524,7 +527,7 @@ static void remove_node_from_stable_tree
  * a page to put something that might look like our key in page->mapping.
  * is on its way to being freed; but it is an anomaly to bear in mind.
  */
-static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
+static struct page *get_ksm_page(struct stable_node *stable_node, bool lock_it)
 {
 	struct page *page;
 	void *expected_mapping;
@@ -573,7 +576,7 @@ again:
 		goto stale;
 	}
 
-	if (locked) {
+	if (lock_it) {
 		lock_page(page);
 		if (ACCESS_ONCE(page->mapping) != expected_mapping) {
 			unlock_page(page);
@@ -703,10 +706,17 @@ static int remove_stable_node(struct sta
 		return 0;
 	}
 
-	if (WARN_ON_ONCE(page_mapped(page)))
+	if (WARN_ON_ONCE(page_mapped(page))) {
+		/*
+		 * This should not happen: but if it does, just refuse to let
+		 * merge_across_nodes be switched - there is no need to panic.
+		 */
 		err = -EBUSY;
-	else {
+	} else {
 		/*
+		 * The stable node did not yet appear stale to get_ksm_page(),
+		 * since that allows for an unmapped ksm page to be recognized
+		 * right up until it is freed; but the node is safe to remove.
 		 * This page might be in a pagevec waiting to be freed,
 		 * or it might be PageSwapCache (perhaps under writeback),
 		 * or it might have been removed from swapcache a moment ago.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/7] ksm: treat unstable nid like in stable tree
  2013-02-21  8:17 [PATCH 0/7] ksm: responses to NUMA review Hugh Dickins
  2013-02-21  8:19 ` [PATCH 1/7] ksm: add some comments Hugh Dickins
@ 2013-02-21  8:20 ` Hugh Dickins
  2013-02-22  7:13   ` Ric Mason
  2013-02-21  8:22 ` [PATCH 3/7] ksm: shrink 32-bit rmap_item back to 32 bytes Hugh Dickins
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Hugh Dickins @ 2013-02-21  8:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

An inconsistency emerged in reviewing the NUMA node changes to KSM:
when meeting a page from the wrong NUMA node in a stable tree, we say
that it's okay for comparisons, but not as a leaf for merging; whereas
when meeting a page from the wrong NUMA node in an unstable tree, we
bail out immediately.

Now, it might be that a wrong NUMA node in an unstable tree is more
likely to correlate with instablility (different content, with rbnode
now misplaced) than page migration; but even so, we are accustomed to
instablility in the unstable tree.

Without strong evidence for which strategy is generally better, I'd
rather be consistent with what's done in the stable tree: accept a page
from the wrong NUMA node for comparison, but not as a leaf for merging.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/ksm.c |   19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

--- mmotm.orig/mm/ksm.c	2013-02-20 22:28:23.584001392 -0800
+++ mmotm/mm/ksm.c	2013-02-20 22:28:27.288001480 -0800
@@ -1340,16 +1340,6 @@ struct rmap_item *unstable_tree_search_i
 			return NULL;
 		}
 
-		/*
-		 * If tree_page has been migrated to another NUMA node, it
-		 * will be flushed out and put into the right unstable tree
-		 * next time: only merge with it if merge_across_nodes.
-		 */
-		if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) {
-			put_page(tree_page);
-			return NULL;
-		}
-
 		ret = memcmp_pages(page, tree_page);
 
 		parent = *new;
@@ -1359,6 +1349,15 @@ struct rmap_item *unstable_tree_search_i
 		} else if (ret > 0) {
 			put_page(tree_page);
 			new = &parent->rb_right;
+		} else if (!ksm_merge_across_nodes &&
+			   page_to_nid(tree_page) != nid) {
+			/*
+			 * If tree_page has been migrated to another NUMA node,
+			 * it will be flushed out and put in the right unstable
+			 * tree next time: only merge with it when across_nodes.
+			 */
+			put_page(tree_page);
+			return NULL;
 		} else {
 			*tree_pagep = tree_page;
 			return tree_rmap_item;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3/7] ksm: shrink 32-bit rmap_item back to 32 bytes
  2013-02-21  8:17 [PATCH 0/7] ksm: responses to NUMA review Hugh Dickins
  2013-02-21  8:19 ` [PATCH 1/7] ksm: add some comments Hugh Dickins
  2013-02-21  8:20 ` [PATCH 2/7] ksm: treat unstable nid like in stable tree Hugh Dickins
@ 2013-02-21  8:22 ` Hugh Dickins
  2013-02-21  8:23 ` [PATCH 4/7] mm,ksm: FOLL_MIGRATION do migration_entry_wait Hugh Dickins
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Hugh Dickins @ 2013-02-21  8:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

Think of struct rmap_item as an extension of struct page (restricted
to MADV_MERGEABLE areas): there may be a lot of them, we need to keep
them small, especially on 32-bit architectures of limited lowmem.

Siting "int nid" after "unsigned int checksum" works nicely on 64-bit,
making no change to its 64-byte struct rmap_item; but bloats the 32-bit
struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which
rounds up to 40 bytes once allocated from slab.  We'd better avoid that.

Hey, I only just remembered that the anon_vma pointer in struct rmap_item
has no purpose until the rmap_item is hung from a stable tree node (which
has its own nid field); and rmap_item's nid field no purpose than to say
which tree root to tell rb_erase() when unlinking from an unstable tree.

Double them up in a union.  There's just one place where we set anon_vma
early (when we already hold mmap_sem): now we must remove tree_rmap_item
from its unstable tree there, before overwriting nid.  No need to spatter
BUG()s around: we'd be seeing oopses if this were wrong.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/ksm.c |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

--- mmotm.orig/mm/ksm.c	2013-02-20 22:28:27.288001480 -0800
+++ mmotm/mm/ksm.c	2013-02-20 22:28:29.688001537 -0800
@@ -150,23 +150,25 @@ struct stable_node {
  * struct rmap_item - reverse mapping item for virtual addresses
  * @rmap_list: next rmap_item in mm_slot's singly-linked rmap_list
  * @anon_vma: pointer to anon_vma for this mm,address, when in stable tree
+ * @nid: NUMA node id of unstable tree in which linked (may not match page)
  * @mm: the memory structure this rmap_item is pointing into
  * @address: the virtual address this rmap_item tracks (+ flags in low bits)
  * @oldchecksum: previous checksum of the page at that virtual address
- * @nid: NUMA node id of unstable tree in which linked (may not match page)
  * @node: rb node of this rmap_item in the unstable tree
  * @head: pointer to stable_node heading this list in the stable tree
  * @hlist: link into hlist of rmap_items hanging off that stable_node
  */
 struct rmap_item {
 	struct rmap_item *rmap_list;
-	struct anon_vma *anon_vma;	/* when stable */
+	union {
+		struct anon_vma *anon_vma;	/* when stable */
+#ifdef CONFIG_NUMA
+		int nid;		/* when node of unstable tree */
+#endif
+	};
 	struct mm_struct *mm;
 	unsigned long address;		/* + low bits used for flags below */
 	unsigned int oldchecksum;	/* when unstable */
-#ifdef CONFIG_NUMA
-	int nid;
-#endif
 	union {
 		struct rb_node node;	/* when node of unstable tree */
 		struct {		/* when listed from stable tree */
@@ -1092,6 +1094,9 @@ static int try_to_merge_with_ksm_page(st
 	if (err)
 		goto out;
 
+	/* Unstable nid is in union with stable anon_vma: remove first */
+	remove_rmap_item_from_tree(rmap_item);
+
 	/* Must get reference to anon_vma while still holding mmap_sem */
 	rmap_item->anon_vma = vma->anon_vma;
 	get_anon_vma(vma->anon_vma);
@@ -1466,14 +1471,11 @@ static void cmp_and_merge_page(struct pa
 		kpage = try_to_merge_two_pages(rmap_item, page,
 						tree_rmap_item, tree_page);
 		put_page(tree_page);
-		/*
-		 * As soon as we merge this page, we want to remove the
-		 * rmap_item of the page we have merged with from the unstable
-		 * tree, and insert it instead as new node in the stable tree.
-		 */
 		if (kpage) {
-			remove_rmap_item_from_tree(tree_rmap_item);
-
+			/*
+			 * The pages were successfully merged: insert new
+			 * node in the stable tree and add both rmap_items.
+			 */
 			lock_page(kpage);
 			stable_node = stable_tree_insert(kpage);
 			if (stable_node) {

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/7] mm,ksm: FOLL_MIGRATION do migration_entry_wait
  2013-02-21  8:17 [PATCH 0/7] ksm: responses to NUMA review Hugh Dickins
                   ` (2 preceding siblings ...)
  2013-02-21  8:22 ` [PATCH 3/7] ksm: shrink 32-bit rmap_item back to 32 bytes Hugh Dickins
@ 2013-02-21  8:23 ` Hugh Dickins
  2013-02-21  8:25 ` [PATCH 5/7] mm,ksm: swapoff might need to copy Hugh Dickins
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Hugh Dickins @ 2013-02-21  8:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

In "ksm: remove old stable nodes more thoroughly" I said that I'd never
seen its WARN_ON_ONCE(page_mapped(page)).  True at the time of writing,
but it soon appeared once I tried fuller tests on the whole series.

It turned out to be due to the KSM page migration itself: unmerge_and_
remove_all_rmap_items() failed to locate and replace all the KSM pages,
because of that hiatus in page migration when old pte has been replaced
by migration entry, but not yet by new pte.  follow_page() finds no page
at that instant, but a KSM page reappears shortly after, without a fault.

Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
for KSM's break_cow().  I'd have preferred to avoid another flag, and do
it every time, in case someone else makes the same easy mistake; but did
not find another transgressor (the common get_user_pages() is of course
safe), and cannot be sure that every follow_page() caller is prepared to
sleep - ia64's xencomm_vtop()?  Now, THP's wait_split_huge_page() can
already sleep there, since anon_vma locking was changed to mutex, but
maybe that's somehow excluded.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm.h |    1 +
 mm/ksm.c           |    2 +-
 mm/memory.c        |   20 ++++++++++++++++++--
 3 files changed, 20 insertions(+), 3 deletions(-)

--- mmotm.orig/include/linux/mm.h	2013-02-19 18:51:24.572031860 -0800
+++ mmotm/include/linux/mm.h	2013-02-20 22:42:54.728022096 -0800
@@ -1652,6 +1652,7 @@ static inline struct page *follow_page(s
 #define FOLL_SPLIT	0x80	/* don't return transhuge pages, split them */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
 #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
+#define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
--- mmotm.orig/mm/ksm.c	2013-02-20 22:28:29.688001537 -0800
+++ mmotm/mm/ksm.c	2013-02-20 22:50:10.540032454 -0800
@@ -363,7 +363,7 @@ static int break_ksm(struct vm_area_stru
 
 	do {
 		cond_resched();
-		page = follow_page(vma, addr, FOLL_GET);
+		page = follow_page(vma, addr, FOLL_GET | FOLL_MIGRATION);
 		if (IS_ERR_OR_NULL(page))
 			break;
 		if (PageKsm(page))
--- mmotm.orig/mm/memory.c	2013-02-20 22:28:09.168001050 -0800
+++ mmotm/mm/memory.c	2013-02-20 22:43:47.228023344 -0800
@@ -1548,8 +1548,24 @@ split_fallthrough:
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 
 	pte = *ptep;
-	if (!pte_present(pte))
-		goto no_page;
+	if (!pte_present(pte)) {
+		swp_entry_t entry;
+		/*
+		 * KSM's break_ksm() relies upon recognizing a ksm page
+		 * even while it is being migrated, so for that case we
+		 * need migration_entry_wait().
+		 */
+		if (likely(!(flags & FOLL_MIGRATION)))
+			goto no_page;
+		if (pte_none(pte) || pte_file(pte))
+			goto no_page;
+		entry = pte_to_swp_entry(pte);
+		if (!is_migration_entry(entry))
+			goto no_page;
+		pte_unmap_unlock(ptep, ptl);
+		migration_entry_wait(mm, pmd, address);
+		goto split_fallthrough;
+	}
 	if ((flags & FOLL_NUMA) && pte_numa(pte))
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 5/7] mm,ksm: swapoff might need to copy
  2013-02-21  8:17 [PATCH 0/7] ksm: responses to NUMA review Hugh Dickins
                   ` (3 preceding siblings ...)
  2013-02-21  8:23 ` [PATCH 4/7] mm,ksm: FOLL_MIGRATION do migration_entry_wait Hugh Dickins
@ 2013-02-21  8:25 ` Hugh Dickins
  2013-02-21 14:53   ` Johannes Weiner
  2013-02-21  8:27 ` [PATCH 6/7] mm: cleanup "swapcache" in do_swap_page Hugh Dickins
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Hugh Dickins @ 2013-02-21  8:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

Before establishing that KSM page migration was the cause of my
WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the
lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which
in many respects is equivalent to faulting in a page.

In fact I've never caught that as the cause: but in theory it does
at least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(),
to avoid bringing a KSM page back in when it's not supposed to be.

I intended to copy how it's done in do_swap_page(), but have a strong
aversion to how "swapcache" ends up being used there: rework it with
"page != swapcache".

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/swapfile.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

--- mmotm.orig/mm/swapfile.c	2013-02-20 22:28:09.076001048 -0800
+++ mmotm/mm/swapfile.c	2013-02-20 23:20:50.872076192 -0800
@@ -874,11 +874,17 @@ unsigned int count_swap_pages(int type,
 static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, swp_entry_t entry, struct page *page)
 {
+	struct page *swapcache;
 	struct mem_cgroup *memcg;
 	spinlock_t *ptl;
 	pte_t *pte;
 	int ret = 1;
 
+	swapcache = page;
+	page = ksm_might_need_to_copy(page, vma, addr);
+	if (unlikely(!page))
+		return -ENOMEM;
+
 	if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
 					 GFP_KERNEL, &memcg)) {
 		ret = -ENOMEM;
@@ -897,7 +903,10 @@ static int unuse_pte(struct vm_area_stru
 	get_page(page);
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
-	page_add_anon_rmap(page, vma, addr);
+	if (page == swapcache)
+		page_add_anon_rmap(page, vma, addr);
+	else /* ksm created a completely new copy */
+		page_add_new_anon_rmap(page, vma, addr);
 	mem_cgroup_commit_charge_swapin(page, memcg);
 	swap_free(entry);
 	/*
@@ -908,6 +917,10 @@ static int unuse_pte(struct vm_area_stru
 out:
 	pte_unmap_unlock(pte, ptl);
 out_nolock:
+	if (page != swapcache) {
+		unlock_page(page);
+		put_page(page);
+	}
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 6/7] mm: cleanup "swapcache" in do_swap_page
  2013-02-21  8:17 [PATCH 0/7] ksm: responses to NUMA review Hugh Dickins
                   ` (4 preceding siblings ...)
  2013-02-21  8:25 ` [PATCH 5/7] mm,ksm: swapoff might need to copy Hugh Dickins
@ 2013-02-21  8:27 ` Hugh Dickins
  2013-02-21  8:29 ` [PATCH 7/7] ksm: allocate roots when needed Hugh Dickins
  2013-02-22  3:44 ` [PATCH 0/7] ksm: responses to NUMA review Ric Mason
  7 siblings, 0 replies; 28+ messages in thread
From: Hugh Dickins @ 2013-02-21  8:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

I dislike the way in which "swapcache" gets used in do_swap_page():
there is always a page from swapcache there (even if maybe uncached
by the time we lock it), but tests are made according to "swapcache".
Rework that with "page != swapcache", as has been done in unuse_pte().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/memory.c |   18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

--- mmotm.orig/mm/memory.c	2013-02-20 22:43:47.228023344 -0800
+++ mmotm/mm/memory.c	2013-02-20 23:21:00.304076416 -0800
@@ -2954,7 +2954,7 @@ static int do_swap_page(struct mm_struct
 		unsigned int flags, pte_t orig_pte)
 {
 	spinlock_t *ptl;
-	struct page *page, *swapcache = NULL;
+	struct page *page, *swapcache;
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
@@ -3005,9 +3005,11 @@ static int do_swap_page(struct mm_struct
 		 */
 		ret = VM_FAULT_HWPOISON;
 		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+		swapcache = page;
 		goto out_release;
 	}
 
+	swapcache = page;
 	locked = lock_page_or_retry(page, mm, flags);
 
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -3025,16 +3027,12 @@ static int do_swap_page(struct mm_struct
 	if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
 		goto out_page;
 
-	swapcache = page;
 	page = ksm_might_need_to_copy(page, vma, address);
 	if (unlikely(!page)) {
 		ret = VM_FAULT_OOM;
 		page = swapcache;
-		swapcache = NULL;
 		goto out_page;
 	}
-	if (page == swapcache)
-		swapcache = NULL;
 
 	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
 		ret = VM_FAULT_OOM;
@@ -3078,10 +3076,10 @@ static int do_swap_page(struct mm_struct
 	}
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
-	if (swapcache) /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, address);
-	else
+	if (page == swapcache)
 		do_page_add_anon_rmap(page, vma, address, exclusive);
+	else /* ksm created a completely new copy */
+		page_add_new_anon_rmap(page, vma, address);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -3089,7 +3087,7 @@ static int do_swap_page(struct mm_struct
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
 		try_to_free_swap(page);
 	unlock_page(page);
-	if (swapcache) {
+	if (page != swapcache) {
 		/*
 		 * Hold the lock to avoid the swap entry to be reused
 		 * until we take the PT lock for the pte_same() check
@@ -3122,7 +3120,7 @@ out_page:
 	unlock_page(page);
 out_release:
 	page_cache_release(page);
-	if (swapcache) {
+	if (page != swapcache) {
 		unlock_page(swapcache);
 		page_cache_release(swapcache);
 	}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 7/7] ksm: allocate roots when needed
  2013-02-21  8:17 [PATCH 0/7] ksm: responses to NUMA review Hugh Dickins
                   ` (5 preceding siblings ...)
  2013-02-21  8:27 ` [PATCH 6/7] mm: cleanup "swapcache" in do_swap_page Hugh Dickins
@ 2013-02-21  8:29 ` Hugh Dickins
  2013-02-22  3:44 ` [PATCH 0/7] ksm: responses to NUMA review Ric Mason
  7 siblings, 0 replies; 28+ messages in thread
From: Hugh Dickins @ 2013-02-21  8:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically
allocated, particularly when very few users will ever actually tune
merge_across_nodes 0 to use more than 1+1 of those trees.  Not a big
deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity.

Start off with 1+1 statically allocated, then if merge_across_nodes is
ever tuned, allocate for nr_node_ids+nr_node_ids.  Do not attempt to
free up the extra if it's tuned back, that would be a waste of effort.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/ksm.c |   72 ++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 49 insertions(+), 23 deletions(-)

--- mmotm.orig/mm/ksm.c	2013-02-20 22:50:10.540032454 -0800
+++ mmotm/mm/ksm.c	2013-02-20 23:21:28.908077096 -0800
@@ -183,8 +183,10 @@ struct rmap_item {
 #define STABLE_FLAG	0x200	/* is listed from the stable tree */
 
 /* The stable and unstable tree heads */
-static struct rb_root root_unstable_tree[MAX_NUMNODES];
-static struct rb_root root_stable_tree[MAX_NUMNODES];
+static struct rb_root one_stable_tree[1] = { RB_ROOT };
+static struct rb_root one_unstable_tree[1] = { RB_ROOT };
+static struct rb_root *root_stable_tree = one_stable_tree;
+static struct rb_root *root_unstable_tree = one_unstable_tree;
 
 /* Recently migrated nodes of stable tree, pending proper placement */
 static LIST_HEAD(migrate_nodes);
@@ -224,8 +226,10 @@ static unsigned int ksm_thread_sleep_mil
 #ifdef CONFIG_NUMA
 /* Zeroed when merging across nodes is not allowed */
 static unsigned int ksm_merge_across_nodes = 1;
+static int ksm_nr_node_ids = 1;
 #else
 #define ksm_merge_across_nodes	1U
+#define ksm_nr_node_ids		1
 #endif
 
 #define KSM_RUN_STOP	0
@@ -506,7 +510,7 @@ static void remove_node_from_stable_tree
 		list_del(&stable_node->list);
 	else
 		rb_erase(&stable_node->node,
-			 &root_stable_tree[NUMA(stable_node->nid)]);
+			 root_stable_tree + NUMA(stable_node->nid));
 	free_stable_node(stable_node);
 }
 
@@ -642,7 +646,7 @@ static void remove_rmap_item_from_tree(s
 		BUG_ON(age > 1);
 		if (!age)
 			rb_erase(&rmap_item->node,
-				 &root_unstable_tree[NUMA(rmap_item->nid)]);
+				 root_unstable_tree + NUMA(rmap_item->nid));
 		ksm_pages_unshared--;
 		rmap_item->address &= PAGE_MASK;
 	}
@@ -740,7 +744,7 @@ static int remove_all_stable_nodes(void)
 	int nid;
 	int err = 0;
 
-	for (nid = 0; nid < nr_node_ids; nid++) {
+	for (nid = 0; nid < ksm_nr_node_ids; nid++) {
 		while (root_stable_tree[nid].rb_node) {
 			stable_node = rb_entry(root_stable_tree[nid].rb_node,
 						struct stable_node, node);
@@ -1148,6 +1152,7 @@ static struct page *try_to_merge_two_pag
 static struct page *stable_tree_search(struct page *page)
 {
 	int nid;
+	struct rb_root *root;
 	struct rb_node **new;
 	struct rb_node *parent;
 	struct stable_node *stable_node;
@@ -1161,8 +1166,9 @@ static struct page *stable_tree_search(s
 	}
 
 	nid = get_kpfn_nid(page_to_pfn(page));
+	root = root_stable_tree + nid;
 again:
-	new = &root_stable_tree[nid].rb_node;
+	new = &root->rb_node;
 	parent = NULL;
 
 	while (*new) {
@@ -1217,7 +1223,7 @@ again:
 	list_del(&page_node->list);
 	DO_NUMA(page_node->nid = nid);
 	rb_link_node(&page_node->node, parent, new);
-	rb_insert_color(&page_node->node, &root_stable_tree[nid]);
+	rb_insert_color(&page_node->node, root);
 	get_page(page);
 	return page;
 
@@ -1225,11 +1231,10 @@ replace:
 	if (page_node) {
 		list_del(&page_node->list);
 		DO_NUMA(page_node->nid = nid);
-		rb_replace_node(&stable_node->node,
-				&page_node->node, &root_stable_tree[nid]);
+		rb_replace_node(&stable_node->node, &page_node->node, root);
 		get_page(page);
 	} else {
-		rb_erase(&stable_node->node, &root_stable_tree[nid]);
+		rb_erase(&stable_node->node, root);
 		page = NULL;
 	}
 	stable_node->head = &migrate_nodes;
@@ -1248,13 +1253,15 @@ static struct stable_node *stable_tree_i
 {
 	int nid;
 	unsigned long kpfn;
+	struct rb_root *root;
 	struct rb_node **new;
 	struct rb_node *parent = NULL;
 	struct stable_node *stable_node;
 
 	kpfn = page_to_pfn(kpage);
 	nid = get_kpfn_nid(kpfn);
-	new = &root_stable_tree[nid].rb_node;
+	root = root_stable_tree + nid;
+	new = &root->rb_node;
 
 	while (*new) {
 		struct page *tree_page;
@@ -1293,7 +1300,7 @@ static struct stable_node *stable_tree_i
 	set_page_stable_node(kpage, stable_node);
 	DO_NUMA(stable_node->nid = nid);
 	rb_link_node(&stable_node->node, parent, new);
-	rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
+	rb_insert_color(&stable_node->node, root);
 
 	return stable_node;
 }
@@ -1323,7 +1330,7 @@ struct rmap_item *unstable_tree_search_i
 	int nid;
 
 	nid = get_kpfn_nid(page_to_pfn(page));
-	root = &root_unstable_tree[nid];
+	root = root_unstable_tree + nid;
 	new = &root->rb_node;
 
 	while (*new) {
@@ -1420,7 +1427,7 @@ static void cmp_and_merge_page(struct pa
 		if (stable_node->head != &migrate_nodes &&
 		    get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) {
 			rb_erase(&stable_node->node,
-				 &root_stable_tree[NUMA(stable_node->nid)]);
+				 root_stable_tree + NUMA(stable_node->nid));
 			stable_node->head = &migrate_nodes;
 			list_add(&stable_node->list, stable_node->head);
 		}
@@ -1572,7 +1579,7 @@ static struct rmap_item *scan_get_next_r
 			}
 		}
 
-		for (nid = 0; nid < nr_node_ids; nid++)
+		for (nid = 0; nid < ksm_nr_node_ids; nid++)
 			root_unstable_tree[nid] = RB_ROOT;
 
 		spin_lock(&ksm_mmlist_lock);
@@ -2089,8 +2096,8 @@ static void ksm_check_stable_tree(unsign
 	struct rb_node *node;
 	int nid;
 
-	for (nid = 0; nid < nr_node_ids; nid++) {
-		node = rb_first(&root_stable_tree[nid]);
+	for (nid = 0; nid < ksm_nr_node_ids; nid++) {
+		node = rb_first(root_stable_tree + nid);
 		while (node) {
 			stable_node = rb_entry(node, struct stable_node, node);
 			if (stable_node->kpfn >= start_pfn &&
@@ -2100,7 +2107,7 @@ static void ksm_check_stable_tree(unsign
 				 * which is why we keep kpfn instead of page*
 				 */
 				remove_node_from_stable_tree(stable_node);
-				node = rb_first(&root_stable_tree[nid]);
+				node = rb_first(root_stable_tree + nid);
 			} else
 				node = rb_next(node);
 			cond_resched();
@@ -2293,8 +2300,31 @@ static ssize_t merge_across_nodes_store(
 	if (ksm_merge_across_nodes != knob) {
 		if (ksm_pages_shared || remove_all_stable_nodes())
 			err = -EBUSY;
-		else
+		else if (root_stable_tree == one_stable_tree) {
+			struct rb_root *buf;
+			/*
+			 * This is the first time that we switch away from the
+			 * default of merging across nodes: must now allocate
+			 * a buffer to hold as many roots as may be needed.
+			 * Allocate stable and unstable together:
+			 * MAXSMP NODES_SHIFT 10 will use 16kB.
+			 */
+			buf = kcalloc(nr_node_ids + nr_node_ids,
+				sizeof(*buf), GFP_KERNEL | __GFP_ZERO);
+			/* Let us assume that RB_ROOT is NULL is zero */
+			if (!buf)
+				err = -ENOMEM;
+			else {
+				root_stable_tree = buf;
+				root_unstable_tree = buf + nr_node_ids;
+				/* Stable tree is empty but not the unstable */
+				root_unstable_tree[0] = one_unstable_tree[0];
+			}
+		}
+		if (!err) {
 			ksm_merge_across_nodes = knob;
+			ksm_nr_node_ids = knob ? 1 : nr_node_ids;
+		}
 	}
 	mutex_unlock(&ksm_thread_mutex);
 
@@ -2373,15 +2403,11 @@ static int __init ksm_init(void)
 {
 	struct task_struct *ksm_thread;
 	int err;
-	int nid;
 
 	err = ksm_slab_init();
 	if (err)
 		goto out;
 
-	for (nid = 0; nid < nr_node_ids; nid++)
-		root_stable_tree[nid] = RB_ROOT;
-
 	ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");
 	if (IS_ERR(ksm_thread)) {
 		printk(KERN_ERR "ksm: creating kthread failed\n");

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/7] mm,ksm: swapoff might need to copy
  2013-02-21  8:25 ` [PATCH 5/7] mm,ksm: swapoff might need to copy Hugh Dickins
@ 2013-02-21 14:53   ` Johannes Weiner
  2013-02-22 17:16     ` Hugh Dickins
  0 siblings, 1 reply; 28+ messages in thread
From: Johannes Weiner @ 2013-02-21 14:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

On Thu, Feb 21, 2013 at 12:25:40AM -0800, Hugh Dickins wrote:
> Before establishing that KSM page migration was the cause of my
> WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the
> lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which
> in many respects is equivalent to faulting in a page.
> 
> In fact I've never caught that as the cause: but in theory it does
> at least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(),
> to avoid bringing a KSM page back in when it's not supposed to be.

Maybe I am mistaken, maybe it was just too obvious to you to mention,
but the main reason for me would be that this can break eviction,
migration, etc. of that page when there is no rmap_item representing
the vma->anon_vma (the cross-anon_vma merge case), no?

> I intended to copy how it's done in do_swap_page(), but have a strong
> aversion to how "swapcache" ends up being used there: rework it with
> "page != swapcache".
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/7] ksm: responses to NUMA review
  2013-02-21  8:17 [PATCH 0/7] ksm: responses to NUMA review Hugh Dickins
                   ` (6 preceding siblings ...)
  2013-02-21  8:29 ` [PATCH 7/7] ksm: allocate roots when needed Hugh Dickins
@ 2013-02-22  3:44 ` Ric Mason
  2013-02-22 20:38   ` Hugh Dickins
  7 siblings, 1 reply; 28+ messages in thread
From: Ric Mason @ 2013-02-22  3:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

On 02/21/2013 04:17 PM, Hugh Dickins wrote:
> Here's a second KSM series, based on mmotm 2013-02-19-17-20: partly in
> response to Mel's review feedback, partly fixes to issues that I found
> myself in doing more review and testing.  None of the issues fixed are
> truly show-stoppers, though I would prefer them fixed sooner than later.

Do you have any ideas ksm support page cache and tmpfs?

>
> 1 ksm: add some comments
> 2 ksm: treat unstable nid like in stable tree
> 3 ksm: shrink 32-bit rmap_item back to 32 bytes
> 4 mm,ksm: FOLL_MIGRATION do migration_entry_wait
> 5 mm,ksm: swapoff might need to copy
> 6 mm: cleanup "swapcache" in do_swap_page
> 7 ksm: allocate roots when needed
>
>   Documentation/vm/ksm.txt |   16 +++-
>   include/linux/mm.h       |    1
>   mm/ksm.c                 |  137 +++++++++++++++++++++++--------------
>   mm/memory.c              |   38 +++++++---
>   mm/swapfile.c            |   15 +++-
>   5 files changed, 140 insertions(+), 67 deletions(-)
>
> Thanks,
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] ksm: add some comments
  2013-02-21  8:19 ` [PATCH 1/7] ksm: add some comments Hugh Dickins
@ 2013-02-22  4:26   ` Ric Mason
  2013-02-22 20:50     ` Hugh Dickins
  0 siblings, 1 reply; 28+ messages in thread
From: Ric Mason @ 2013-02-22  4:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

On 02/21/2013 04:19 PM, Hugh Dickins wrote:
> Added slightly more detail to the Documentation of merge_across_nodes,
> a few comments in areas indicated by review, and renamed get_ksm_page()'s
> argument from "locked" to "lock_it".  No functional change.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>   Documentation/vm/ksm.txt |   16 ++++++++++++----
>   mm/ksm.c                 |   18 ++++++++++++++----
>   2 files changed, 26 insertions(+), 8 deletions(-)
>
> --- mmotm.orig/Documentation/vm/ksm.txt	2013-02-20 22:28:09.456001057 -0800
> +++ mmotm/Documentation/vm/ksm.txt	2013-02-20 22:28:23.580001392 -0800
> @@ -60,10 +60,18 @@ sleep_millisecs  - how many milliseconds
>   
>   merge_across_nodes - specifies if pages from different numa nodes can be merged.
>                      When set to 0, ksm merges only pages which physically
> -                   reside in the memory area of same NUMA node. It brings
> -                   lower latency to access to shared page. Value can be
> -                   changed only when there is no ksm shared pages in system.
> -                   Default: 1
> +                   reside in the memory area of same NUMA node. That brings
> +                   lower latency to access of shared pages. Systems with more
> +                   nodes, at significant NUMA distances, are likely to benefit
> +                   from the lower latency of setting 0. Smaller systems, which
> +                   need to minimize memory usage, are likely to benefit from
> +                   the greater sharing of setting 1 (default). You may wish to
> +                   compare how your system performs under each setting, before
> +                   deciding on which to use. merge_across_nodes setting can be
> +                   changed only when there are no ksm shared pages in system:
> +                   set run 2 to unmerge pages first, then to 1 after changing
> +                   merge_across_nodes, to remerge according to the new setting.

What's the root reason merge_across_nodes setting just can be changed 
only when there are no ksm shared pages in system? Can they be unmerged 
and merged again during ksmd scan?

> +                   Default: 1 (merging across nodes as in earlier releases)
>   
>   run              - set 0 to stop ksmd from running but keep merged pages,
>                      set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
> --- mmotm.orig/mm/ksm.c	2013-02-20 22:28:09.456001057 -0800
> +++ mmotm/mm/ksm.c	2013-02-20 22:28:23.584001392 -0800
> @@ -87,6 +87,9 @@
>    *    take 10 attempts to find a page in the unstable tree, once it is found,
>    *    it is secured in the stable tree.  (When we scan a new page, we first
>    *    compare it against the stable tree, and then against the unstable tree.)
> + *
> + * If the merge_across_nodes tunable is unset, then KSM maintains multiple
> + * stable trees and multiple unstable trees: one of each for each NUMA node.
>    */
>   
>   /**
> @@ -524,7 +527,7 @@ static void remove_node_from_stable_tree
>    * a page to put something that might look like our key in page->mapping.
>    * is on its way to being freed; but it is an anomaly to bear in mind.
>    */
> -static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
> +static struct page *get_ksm_page(struct stable_node *stable_node, bool lock_it)
>   {
>   	struct page *page;
>   	void *expected_mapping;
> @@ -573,7 +576,7 @@ again:
>   		goto stale;
>   	}
>   
> -	if (locked) {
> +	if (lock_it) {
>   		lock_page(page);
>   		if (ACCESS_ONCE(page->mapping) != expected_mapping) {
>   			unlock_page(page);
> @@ -703,10 +706,17 @@ static int remove_stable_node(struct sta
>   		return 0;
>   	}
>   
> -	if (WARN_ON_ONCE(page_mapped(page)))
> +	if (WARN_ON_ONCE(page_mapped(page))) {
> +		/*
> +		 * This should not happen: but if it does, just refuse to let
> +		 * merge_across_nodes be switched - there is no need to panic.
> +		 */
>   		err = -EBUSY;
> -	else {
> +	} else {
>   		/*
> +		 * The stable node did not yet appear stale to get_ksm_page(),
> +		 * since that allows for an unmapped ksm page to be recognized
> +		 * right up until it is freed; but the node is safe to remove.
>   		 * This page might be in a pagevec waiting to be freed,
>   		 * or it might be PageSwapCache (perhaps under writeback),
>   		 * or it might have been removed from swapcache a moment ago.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
  2013-02-21  8:20 ` [PATCH 2/7] ksm: treat unstable nid like in stable tree Hugh Dickins
@ 2013-02-22  7:13   ` Ric Mason
  2013-02-22 21:03     ` Hugh Dickins
  0 siblings, 1 reply; 28+ messages in thread
From: Ric Mason @ 2013-02-22  7:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

On 02/21/2013 04:20 PM, Hugh Dickins wrote:
> An inconsistency emerged in reviewing the NUMA node changes to KSM:
> when meeting a page from the wrong NUMA node in a stable tree, we say
> that it's okay for comparisons, but not as a leaf for merging; whereas
> when meeting a page from the wrong NUMA node in an unstable tree, we
> bail out immediately.

IIUC
- ksm page from the wrong NUMA node will be add to current node's stable 
tree
- normal page from the wrong NUMA node will be merged to current node's 
stable tree  <- where I miss here? I didn't see any special handling in 
function stable_tree_search for this case.
- normal page from the wrong NUMA node will compare but not as a leaf 
for merging after the patch

>
> Now, it might be that a wrong NUMA node in an unstable tree is more
> likely to correlate with instablility (different content, with rbnode
> now misplaced) than page migration; but even so, we are accustomed to
> instablility in the unstable tree.
>
> Without strong evidence for which strategy is generally better, I'd
> rather be consistent with what's done in the stable tree: accept a page
> from the wrong NUMA node for comparison, but not as a leaf for merging.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>   mm/ksm.c |   19 +++++++++----------
>   1 file changed, 9 insertions(+), 10 deletions(-)
>
> --- mmotm.orig/mm/ksm.c	2013-02-20 22:28:23.584001392 -0800
> +++ mmotm/mm/ksm.c	2013-02-20 22:28:27.288001480 -0800
> @@ -1340,16 +1340,6 @@ struct rmap_item *unstable_tree_search_i
>   			return NULL;
>   		}
>   
> -		/*
> -		 * If tree_page has been migrated to another NUMA node, it
> -		 * will be flushed out and put into the right unstable tree
> -		 * next time: only merge with it if merge_across_nodes.
> -		 */
> -		if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) {
> -			put_page(tree_page);
> -			return NULL;
> -		}
> -
>   		ret = memcmp_pages(page, tree_page);
>   
>   		parent = *new;
> @@ -1359,6 +1349,15 @@ struct rmap_item *unstable_tree_search_i
>   		} else if (ret > 0) {
>   			put_page(tree_page);
>   			new = &parent->rb_right;
> +		} else if (!ksm_merge_across_nodes &&
> +			   page_to_nid(tree_page) != nid) {
> +			/*
> +			 * If tree_page has been migrated to another NUMA node,
> +			 * it will be flushed out and put in the right unstable
> +			 * tree next time: only merge with it when across_nodes.
> +			 */
> +			put_page(tree_page);
> +			return NULL;
>   		} else {
>   			*tree_pagep = tree_page;
>   			return tree_rmap_item;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/7] mm,ksm: swapoff might need to copy
  2013-02-21 14:53   ` Johannes Weiner
@ 2013-02-22 17:16     ` Hugh Dickins
  0 siblings, 0 replies; 28+ messages in thread
From: Hugh Dickins @ 2013-02-22 17:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

On Thu, 21 Feb 2013, Johannes Weiner wrote:
> On Thu, Feb 21, 2013 at 12:25:40AM -0800, Hugh Dickins wrote:
> > Before establishing that KSM page migration was the cause of my
> > WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the
> > lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which
> > in many respects is equivalent to faulting in a page.
> > 
> > In fact I've never caught that as the cause: but in theory it does
> > at least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(),
> > to avoid bringing a KSM page back in when it's not supposed to be.
> 
> Maybe I am mistaken, maybe it was just too obvious to you to mention,
> but the main reason for me would be that this can break eviction,
> migration, etc. of that page when there is no rmap_item representing
> the vma->anon_vma (the cross-anon_vma merge case), no?

Hah, thank you for asking that question, sir.
(People will think that I planted you in the Cc just to ask it.)

I did write a longer paragraph there, but it got to rambling off the
point, with some vagueness: so I wasn't sure that I wanted it all in
the commit message, and deleted the trailing sentences.

For most of the time that I had this patch waiting in my tree, before
coming to write the description, I believed exactly as you do above.
But felt that I couldn't submit the patch, without understanding how
I had never actually come across such a serious problem in testing.
I only got around to looking into it a few days ago.

At first I suspected a weakness in the CONFIG_DEBUG_VM checking in
__page_check_anon_rmap() (no, it looks good), or the early PageAnon
return in __page_set_anon_rmap() (shouldn't that do some anon_vma
checking first? maybe, but I didn't think it through before moving
on, and don't want to be too quick to add spurious warnings).

But then I remembered how this actually was working,
the key lines are a few levels up in unuse_vma():
	if (page_anon_vma(page)) {
		addr = page_address_in_vma(page, vma);

Those lines started out, in 2004, as an optimization to swapoff:
that once brute force (searching every swapped mm) has found one pte
for the page (page->mapping and page->index then set to the right
place in the right anon_vma), we can locate any remaining references
to the page much more quickly, but just looking at the right offset
of other vmas sharing the same anon_vma.

When KSM swapping came along, it turned out that the only change needed
there was not assume that page->mapping is set to an anon_vma pointer:
page_anon_vma(page) returns NULL on a KSM page.  What happens (on a
page freshly read in from swap) is that once one pte has been found,
page->mapping is pointed to an anon_vma, and the quicker lookup only
finds ptes which do belong to that anon_vma.  Then try_to_unuse() gives
up on that swap entry for the moment, deletes the page from swap cache,
moves on to the next used swap entry, comes back to the unresolved
swap entry next cycle, reads it from swap into a new page, and locates
the next anon_vma for it.  Or, if the KSM page was still lingering in
swapcache when swapoff first reaches it (with page->mapping pointing
to a stable_node), it would be correctly fixed up to all the various
ptes found by the brute force search.

Re-reading the page from swap is obviously not optimal, when it would
be much faster to memcpy.  But swapoff has never been remotely optimal
anyway; and the complete lack of code for the KSM case was delightful -
the only problem being that there no code to place a comment in!

I did have a nasty moment (well, it lasted a bit longer than that)
in replying to you yesterday: but what about when the anon_vma_chains
came in?  Has swapoff been doing lots of unnecessary re-reading from
swap to match ptes belonging to different anon_vmas in the chain?  But
actually it's okay, Andrea fixed up page_address_in_vma() in 2.6.36,
and I added a comment on KSM there (I see my 4829b906 also speaks of
a bug with respect to "half-KSM" pages, that I promise to fix in the
next release: I have yet to reconstruct the full story of what that was
about, but for now I'm assuming that I failed to keep my promise there).

Anyway, long story short (oh, people are supposed to say that at the
beginning not the end, aren't they?), the KSM swapoff code was already
working almost right before adding in the ksm_might_need_to_copy():
that's only needed to fix up the case of swapoff racing with unmerge
(a check that didn't even exist for swapin before the NUMA series).

> 
> > I intended to copy how it's done in do_swap_page(), but have a strong
> > aversion to how "swapcache" ends up being used there: rework it with
> > "page != swapcache".
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/7] ksm: responses to NUMA review
  2013-02-22  3:44 ` [PATCH 0/7] ksm: responses to NUMA review Ric Mason
@ 2013-02-22 20:38   ` Hugh Dickins
  2013-02-24  1:39     ` Ric Mason
  0 siblings, 1 reply; 28+ messages in thread
From: Hugh Dickins @ 2013-02-22 20:38 UTC (permalink / raw)
  To: Ric Mason
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

On Fri, 22 Feb 2013, Ric Mason wrote:
> On 02/21/2013 04:17 PM, Hugh Dickins wrote:
> > Here's a second KSM series, based on mmotm 2013-02-19-17-20: partly in
> > response to Mel's review feedback, partly fixes to issues that I found
> > myself in doing more review and testing.  None of the issues fixed are
> > truly show-stoppers, though I would prefer them fixed sooner than later.
> 
> Do you have any ideas ksm support page cache and tmpfs?

No.  It's only been asked as a hypothetical question: I don't know of
anyone actually needing it, and I wouldn't have time to do it myself.

It would be significantly more invasive than just dealing with anonymous
memory: with anon, we already have the infrastructure for read-only pages,
but we don't at present have any notion of read-only pagecache.

Just doing it in tmpfs?  Well, yes, that might be easier: since v3.1's
radix_tree rework, shmem/tmpfs mostly goes through its own interfaces
to pagecache, so read-only pagecache, and hence KSM, might be easier
to implement there than more generally.

Hugh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] ksm: add some comments
  2013-02-22  4:26   ` Ric Mason
@ 2013-02-22 20:50     ` Hugh Dickins
  0 siblings, 0 replies; 28+ messages in thread
From: Hugh Dickins @ 2013-02-22 20:50 UTC (permalink / raw)
  To: Ric Mason
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

On Fri, 22 Feb 2013, Ric Mason wrote:
> 
> What's the root reason merge_across_nodes setting just can be changed only
> when there are no ksm shared pages in system?

Simplicity.  Why add code (moving nodes from tree to tree, handling
the collisions) for a rare case that doesn't need to be fast?

> Can they be unmerged and merged again during ksmd scan?

That's more or less what happens, isn't it?  Perhaps you're
asking why the admin has to echo 2 >run; echo 0 >merge; echo 1 >run
instead of that all happening automatically inside the echo 0 > merge?

If I'd implemented it myself, I might have chosen to do it that way;
but neither I nor other reviewers felt strongly enough to change that,
though we could do so.

Hugh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
  2013-02-22  7:13   ` Ric Mason
@ 2013-02-22 21:03     ` Hugh Dickins
  2013-03-01  5:29       ` Ric Mason
  0 siblings, 1 reply; 28+ messages in thread
From: Hugh Dickins @ 2013-02-22 21:03 UTC (permalink / raw)
  To: Ric Mason
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

On Fri, 22 Feb 2013, Ric Mason wrote:
> On 02/21/2013 04:20 PM, Hugh Dickins wrote:
> > An inconsistency emerged in reviewing the NUMA node changes to KSM:
> > when meeting a page from the wrong NUMA node in a stable tree, we say
> > that it's okay for comparisons, but not as a leaf for merging; whereas
> > when meeting a page from the wrong NUMA node in an unstable tree, we
> > bail out immediately.
> 
> IIUC
> - ksm page from the wrong NUMA node will be add to current node's stable tree

That should never happen (and when I was checking with a WARN_ON it did
not happen).  What can happen is that a node already in a stable tree
has its page migrated away to another NUMA node.

> - normal page from the wrong NUMA node will be merged to current node's
> stable tree  <- where I miss here? I didn't see any special handling in
> function stable_tree_search for this case.

	nid = get_kpfn_nid(page_to_pfn(page));
	root = root_stable_tree + nid;

to choose the right tree for the page, and

				if (get_kpfn_nid(stable_node->kpfn) !=
						NUMA(stable_node->nid)) {
					put_page(tree_page);
					goto replace;
				}

to make sure that we don't latch on to a node whose page got migrated away.

> - normal page from the wrong NUMA node will compare but not as a leaf for
> merging after the patch

I don't understand you there, but hope my remarks above resolve it.

Hugh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/7] ksm: responses to NUMA review
  2013-02-22 20:38   ` Hugh Dickins
@ 2013-02-24  1:39     ` Ric Mason
  0 siblings, 0 replies; 28+ messages in thread
From: Ric Mason @ 2013-02-24  1:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

On 02/23/2013 04:38 AM, Hugh Dickins wrote:
> On Fri, 22 Feb 2013, Ric Mason wrote:
>> On 02/21/2013 04:17 PM, Hugh Dickins wrote:
>>> Here's a second KSM series, based on mmotm 2013-02-19-17-20: partly in
>>> response to Mel's review feedback, partly fixes to issues that I found
>>> myself in doing more review and testing.  None of the issues fixed are
>>> truly show-stoppers, though I would prefer them fixed sooner than later.
>> Do you have any ideas ksm support page cache and tmpfs?
> No.  It's only been asked as a hypothetical question: I don't know of
> anyone actually needing it, and I wouldn't have time to do it myself.
>
> It would be significantly more invasive than just dealing with anonymous
> memory: with anon, we already have the infrastructure for read-only pages,
> but we don't at present have any notion of read-only pagecache.
>
> Just doing it in tmpfs?  Well, yes, that might be easier: since v3.1's
> radix_tree rework, shmem/tmpfs mostly goes through its own interfaces
> to pagecache, so read-only pagecache, and hence KSM, might be easier
> to implement there than more generally.

Ok, are there potential users to take advantage of it?

>
> Hugh


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
  2013-02-22 21:03     ` Hugh Dickins
@ 2013-03-01  5:29       ` Ric Mason
  2013-03-01 20:03         ` Hugh Dickins
  0 siblings, 1 reply; 28+ messages in thread
From: Ric Mason @ 2013-03-01  5:29 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm


Hi Hugh,
On 02/23/2013 05:03 AM, Hugh Dickins wrote:
> On Fri, 22 Feb 2013, Ric Mason wrote:
>> On 02/21/2013 04:20 PM, Hugh Dickins wrote:
>>> An inconsistency emerged in reviewing the NUMA node changes to KSM:
>>> when meeting a page from the wrong NUMA node in a stable tree, we say
>>> that it's okay for comparisons, but not as a leaf for merging; whereas
>>> when meeting a page from the wrong NUMA node in an unstable tree, we
>>> bail out immediately.
>> IIUC
>> - ksm page from the wrong NUMA node will be add to current node's stable tree

Please forgive my late response.

> That should never happen (and when I was checking with a WARN_ON it did
> not happen).  What can happen is that a node already in a stable tree
> has its page migrated away to another NUMA node.
>
>> - normal page from the wrong NUMA node will be merged to current node's
>> stable tree  <- where I miss here? I didn't see any special handling in
>> function stable_tree_search for this case.
> 	nid = get_kpfn_nid(page_to_pfn(page));
> 	root = root_stable_tree + nid;
>
> to choose the right tree for the page, and
>
> 				if (get_kpfn_nid(stable_node->kpfn) !=
> 						NUMA(stable_node->nid)) {
> 					put_page(tree_page);
> 					goto replace;
> 				}
>
> to make sure that we don't latch on to a node whose page got migrated away.

I think the ksm implementation for num awareness  is buggy.

For page migratyion stuff, new page is allocated from node *which page 
is migrated to*.
- when meeting a page from the wrong NUMA node in an unstable tree
     get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page)
     How can say it's okay for comparisons, but not as a leaf for merging?
- when meeting a page from the wrong NUMA node in an stable tree
    - meeting a normal page
    - meeting a page which is ksm page before migration
      get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can't 
capture them since stable_node is for tree page in current stable tree. 
They are always equal.
>
>> - normal page from the wrong NUMA node will compare but not as a leaf for
>> merging after the patch
> I don't understand you there, but hope my remarks above resolve it.
>
> Hugh


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
  2013-03-01  5:29       ` Ric Mason
@ 2013-03-01 20:03         ` Hugh Dickins
  2013-03-02  1:10           ` Ric Mason
  0 siblings, 1 reply; 28+ messages in thread
From: Hugh Dickins @ 2013-03-01 20:03 UTC (permalink / raw)
  To: Ric Mason
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

On Fri, 1 Mar 2013, Ric Mason wrote:
> 
> I think the ksm implementation for num awareness  is buggy.

Sorry, I just don't understand your comments below,
but will try to answer or question them as best I can.

> 
> For page migratyion stuff, new page is allocated from node *which page is
> migrated to*.

Yes, by definition.

> - when meeting a page from the wrong NUMA node in an unstable tree
>     get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page)

I thought you were writing of the wrong NUMA node case,
but now you emphasize "*==*", which means the right NUMA node.

>     How can say it's okay for comparisons, but not as a leaf for merging?

Pages in the unstable tree are unstable (and it's not even accurate to
say "pages in the unstable tree"), they and their content can change at
any moment, so I cannot assert anything of them for sure.

But if we suppose, as an approximation, that they are somewhat likely
to remain stable (and the unstable tree would be useless without that
assumption: it tends to work out), but subject to migration, then it makes
sense to compare content, no matter what NUMA node it is on, in order to
locate a page of the same content; but wrong to merge with that page if
it's on the wrong NUMA node, if !merge_across_nodes tells us not to.


> - when meeting a page from the wrong NUMA node in an stable tree
>    - meeting a normal page

What does that line mean, and where does it fit in your argument?

>    - meeting a page which is ksm page before migration
>      get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can't capture
> them since stable_node is for tree page in current stable tree. They are
> always equal.

When we meet a ksm page in the stable tree before it's migrated to another
NUMA node, yes, it will be on the right NUMA node (because we were careful
only to merge pages from the right NUMA node there), and that test will not
capture them.  It's for capturng a ksm page in the stable tree after it has
been migrated to another NUMA node.

Hugh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
  2013-03-01 20:03         ` Hugh Dickins
@ 2013-03-02  1:10           ` Ric Mason
  2013-03-02  2:57             ` Hugh Dickins
  0 siblings, 1 reply; 28+ messages in thread
From: Ric Mason @ 2013-03-02  1:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm


Hi Hugh,
On 03/02/2013 04:03 AM, Hugh Dickins wrote:
> On Fri, 1 Mar 2013, Ric Mason wrote:
>> I think the ksm implementation for num awareness  is buggy.
> Sorry, I just don't understand your comments below,
> but will try to answer or question them as best I can.
>
>> For page migratyion stuff, new page is allocated from node *which page is
>> migrated to*.
> Yes, by definition.
>
>> - when meeting a page from the wrong NUMA node in an unstable tree
>>      get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page)
> I thought you were writing of the wrong NUMA node case,
> but now you emphasize "*==*", which means the right NUMA node.

Yes, I mean the wrong NUMA node. During page migration, new page has 
already been allocated in new node and old page maybe freed.  So 
tree_page is the page in new node's unstable tree, page is also new node 
page, so get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page).

>
>>      How can say it's okay for comparisons, but not as a leaf for merging?
> Pages in the unstable tree are unstable (and it's not even accurate to
> say "pages in the unstable tree"), they and their content can change at
> any moment, so I cannot assert anything of them for sure.
>
> But if we suppose, as an approximation, that they are somewhat likely
> to remain stable (and the unstable tree would be useless without that
> assumption: it tends to work out), but subject to migration, then it makes
> sense to compare content, no matter what NUMA node it is on, in order to
> locate a page of the same content; but wrong to merge with that page if
> it's on the wrong NUMA node, if !merge_across_nodes tells us not to.
>
>
>> - when meeting a page from the wrong NUMA node in an stable tree
>>     - meeting a normal page
> What does that line mean, and where does it fit in your argument?

I distinguish pages in three kinds.
- ksm page which already in stable tree in old node
- page in unstable tree in old node
- page not in any trees in old node

So normal page here I mean page not in any trees in old node.

>
>>     - meeting a page which is ksm page before migration
>>       get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can't capture
>> them since stable_node is for tree page in current stable tree. They are
>> always equal.
> When we meet a ksm page in the stable tree before it's migrated to another
> NUMA node, yes, it will be on the right NUMA node (because we were careful
> only to merge pages from the right NUMA node there), and that test will not
> capture them.  It's for capturng a ksm page in the stable tree after it has
> been migrated to another NUMA node.

ksm page migrated to another NUMA node still not freed, why? Who take 
page count of it? If not  freed, since new page is allocated in new 
node, it is the copy of current ksm page, so current ksm doesn't change, 
get_kpfn_nid(stable_node->kpfn) *==* NUMA(stable_node->nid).

>
> Hugh


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
  2013-03-02  1:10           ` Ric Mason
@ 2013-03-02  2:57             ` Hugh Dickins
  2013-03-06  1:28               ` Will Huck
  2013-03-06  2:37               ` Ric Mason
  0 siblings, 2 replies; 28+ messages in thread
From: Hugh Dickins @ 2013-03-02  2:57 UTC (permalink / raw)
  To: Ric Mason
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

On Sat, 2 Mar 2013, Ric Mason wrote:
> On 03/02/2013 04:03 AM, Hugh Dickins wrote:
> > On Fri, 1 Mar 2013, Ric Mason wrote:
> > > I think the ksm implementation for num awareness  is buggy.
> > Sorry, I just don't understand your comments below,
> > but will try to answer or question them as best I can.
> > 
> > > For page migratyion stuff, new page is allocated from node *which page is
> > > migrated to*.
> > Yes, by definition.
> > 
> > > - when meeting a page from the wrong NUMA node in an unstable tree
> > >      get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page)
> > I thought you were writing of the wrong NUMA node case,
> > but now you emphasize "*==*", which means the right NUMA node.
> 
> Yes, I mean the wrong NUMA node. During page migration, new page has already
> been allocated in new node and old page maybe freed.  So tree_page is the
> page in new node's unstable tree, page is also new node page, so
> get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page).

I don't understand; but here you seem to be describing a case where two
pages from the same NUMA node get merged (after both have been migrated
from another NUMA node?), and there's nothing wrong with that,
so I won't worry about it further.

> > >     - meeting a page which is ksm page before migration
> > >       get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can't
> > > capture
> > > them since stable_node is for tree page in current stable tree. They are
> > > always equal.
> > When we meet a ksm page in the stable tree before it's migrated to another
> > NUMA node, yes, it will be on the right NUMA node (because we were careful
> > only to merge pages from the right NUMA node there), and that test will not
> > capture them.  It's for capturng a ksm page in the stable tree after it has
> > been migrated to another NUMA node.
> 
> ksm page migrated to another NUMA node still not freed, why? Who take page
> count of it?

The old page, the one which used to be a ksm page on the old NUMA node,
should be freed very soon: since it was isolated from lru, and its page
count checked, I cannot think of anything to hold a reference to it,
apart from migration itself - so it just needs to reach putback_lru_page(),
and then may rest awhile on __lru_cache_add()'s pagevec before being freed.

But I don't see where I said the old page was still not freed.

> If not  freed, since new page is allocated in new node, it is
> the copy of current ksm page, so current ksm doesn't change,
> get_kpfn_nid(stable_node->kpfn) *==* NUMA(stable_node->nid).

But ksm_migrate_page() did
		VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
		stable_node->kpfn = page_to_pfn(newpage);
without changing stable_node->nid.

Hugh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
  2013-03-02  2:57             ` Hugh Dickins
@ 2013-03-06  1:28               ` Will Huck
  2013-03-06  4:31                 ` Hugh Dickins
  2013-03-06  2:37               ` Ric Mason
  1 sibling, 1 reply; 28+ messages in thread
From: Will Huck @ 2013-03-06  1:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ric Mason, Andrew Morton, Mel Gorman, Petr Holasek,
	Andrea Arcangeli, Izik Eidus, linux-kernel, linux-mm

Hi Hugh,
On 03/02/2013 10:57 AM, Hugh Dickins wrote:

How ksm treat a ksm forked page? IIUC, it's not merged in ksm stable 
tree. It will just be ignore?

> On Sat, 2 Mar 2013, Ric Mason wrote:
>> On 03/02/2013 04:03 AM, Hugh Dickins wrote:
>>> On Fri, 1 Mar 2013, Ric Mason wrote:
>>>> I think the ksm implementation for num awareness  is buggy.
>>> Sorry, I just don't understand your comments below,
>>> but will try to answer or question them as best I can.
>>>
>>>> For page migratyion stuff, new page is allocated from node *which page is
>>>> migrated to*.
>>> Yes, by definition.
>>>
>>>> - when meeting a page from the wrong NUMA node in an unstable tree
>>>>       get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page)
>>> I thought you were writing of the wrong NUMA node case,
>>> but now you emphasize "*==*", which means the right NUMA node.
>> Yes, I mean the wrong NUMA node. During page migration, new page has already
>> been allocated in new node and old page maybe freed.  So tree_page is the
>> page in new node's unstable tree, page is also new node page, so
>> get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page).
> I don't understand; but here you seem to be describing a case where two
> pages from the same NUMA node get merged (after both have been migrated
> from another NUMA node?), and there's nothing wrong with that,
> so I won't worry about it further.
>
>>>>      - meeting a page which is ksm page before migration
>>>>        get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can't
>>>> capture
>>>> them since stable_node is for tree page in current stable tree. They are
>>>> always equal.
>>> When we meet a ksm page in the stable tree before it's migrated to another
>>> NUMA node, yes, it will be on the right NUMA node (because we were careful
>>> only to merge pages from the right NUMA node there), and that test will not
>>> capture them.  It's for capturng a ksm page in the stable tree after it has
>>> been migrated to another NUMA node.
>> ksm page migrated to another NUMA node still not freed, why? Who take page
>> count of it?
> The old page, the one which used to be a ksm page on the old NUMA node,
> should be freed very soon: since it was isolated from lru, and its page
> count checked, I cannot think of anything to hold a reference to it,
> apart from migration itself - so it just needs to reach putback_lru_page(),
> and then may rest awhile on __lru_cache_add()'s pagevec before being freed.
>
> But I don't see where I said the old page was still not freed.
>
>> If not  freed, since new page is allocated in new node, it is
>> the copy of current ksm page, so current ksm doesn't change,
>> get_kpfn_nid(stable_node->kpfn) *==* NUMA(stable_node->nid).
> But ksm_migrate_page() did
> 		VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
> 		stable_node->kpfn = page_to_pfn(newpage);
> without changing stable_node->nid.
>
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
  2013-03-02  2:57             ` Hugh Dickins
  2013-03-06  1:28               ` Will Huck
@ 2013-03-06  2:37               ` Ric Mason
  2013-03-06  5:05                 ` Hugh Dickins
  1 sibling, 1 reply; 28+ messages in thread
From: Ric Mason @ 2013-03-06  2:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

On 03/02/2013 10:57 AM, Hugh Dickins wrote:
> On Sat, 2 Mar 2013, Ric Mason wrote:
>> On 03/02/2013 04:03 AM, Hugh Dickins wrote:
>>> On Fri, 1 Mar 2013, Ric Mason wrote:
>>>> I think the ksm implementation for num awareness  is buggy.
>>> Sorry, I just don't understand your comments below,
>>> but will try to answer or question them as best I can.
>>>
>>>> For page migratyion stuff, new page is allocated from node *which page is
>>>> migrated to*.
>>> Yes, by definition.
>>>
>>>> - when meeting a page from the wrong NUMA node in an unstable tree
>>>>       get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page)
>>> I thought you were writing of the wrong NUMA node case,
>>> but now you emphasize "*==*", which means the right NUMA node.
>> Yes, I mean the wrong NUMA node. During page migration, new page has already
>> been allocated in new node and old page maybe freed.  So tree_page is the
>> page in new node's unstable tree, page is also new node page, so
>> get_kpfn_nid(page_to_pfn(page)) *==* page_to_nid(tree_page).
> I don't understand; but here you seem to be describing a case where two
> pages from the same NUMA node get merged (after both have been migrated
> from another NUMA node?), and there's nothing wrong with that,
> so I won't worry about it further.

For the case of a ksm page is migrated to a different NUMA node and 
migrate its stable node to  the right tree and collide with an existing 
stable node. get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) 
can capture nothing since stable_node is the node in the right stable 
tree, nothing happen to it before this check. Did you intend to check 
get_kpfn_nid(page_node->kpfn) != NUMA(page_node->nid) ?

>
>>>>      - meeting a page which is ksm page before migration
>>>>        get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can't
>>>> capture
>>>> them since stable_node is for tree page in current stable tree. They are
>>>> always equal.
>>> When we meet a ksm page in the stable tree before it's migrated to another
>>> NUMA node, yes, it will be on the right NUMA node (because we were careful
>>> only to merge pages from the right NUMA node there), and that test will not
>>> capture them.  It's for capturng a ksm page in the stable tree after it has
>>> been migrated to another NUMA node.
>> ksm page migrated to another NUMA node still not freed, why? Who take page
>> count of it?
> The old page, the one which used to be a ksm page on the old NUMA node,
> should be freed very soon: since it was isolated from lru, and its page
> count checked, I cannot think of anything to hold a reference to it,
> apart from migration itself - so it just needs to reach putback_lru_page(),
> and then may rest awhile on __lru_cache_add()'s pagevec before being freed.
>
> But I don't see where I said the old page was still not freed.
>
>> If not  freed, since new page is allocated in new node, it is
>> the copy of current ksm page, so current ksm doesn't change,
>> get_kpfn_nid(stable_node->kpfn) *==* NUMA(stable_node->nid).
> But ksm_migrate_page() did
> 		VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
> 		stable_node->kpfn = page_to_pfn(newpage);
> without changing stable_node->nid.
>
> Hugh


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
  2013-03-06  1:28               ` Will Huck
@ 2013-03-06  4:31                 ` Hugh Dickins
  0 siblings, 0 replies; 28+ messages in thread
From: Hugh Dickins @ 2013-03-06  4:31 UTC (permalink / raw)
  To: Will Huck
  Cc: Ric Mason, Andrew Morton, Mel Gorman, Petr Holasek,
	Andrea Arcangeli, Izik Eidus, linux-kernel, linux-mm

On Wed, 6 Mar 2013, Will Huck wrote:
> 
> How ksm treat a ksm forked page? IIUC, it's not merged in ksm stable tree. It
> will just be ignore?

No, it's there in the stable tree, as it was before it got forked.  And
when ksmd comes around to find the new mm, it will allocate an rmap_item
for that page in the new mm, and append it to the existing stable_node.

Hugh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
  2013-03-06  2:37               ` Ric Mason
@ 2013-03-06  5:05                 ` Hugh Dickins
  2013-03-06  6:58                   ` Ric Mason
  2013-03-06 10:18                   ` Ric Mason
  0 siblings, 2 replies; 28+ messages in thread
From: Hugh Dickins @ 2013-03-06  5:05 UTC (permalink / raw)
  To: Ric Mason
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

On Wed, 6 Mar 2013, Ric Mason wrote:
[ I've deleted the context because that was about the unstable tree,
  and here you have moved to asking about a case in the stable tree. ]
> 
> For the case of a ksm page is migrated to a different NUMA node and migrate
> its stable node to  the right tree and collide with an existing stable node.
> get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can capture nothing

That's not so: as I've pointed out before, ksm_migrate_page() updates
stable_node->kpfn for the new page on the new NUMA node; but it cannot
(get the right locking to) move the stable_node to its new tree at that time.

It's moved out once ksmd notices that it's in the wrong NUMA node tree -
perhaps when one its rmap_items reaches the head of cmp_and_merge_page(),
or perhaps here in stable_tree_search() when it matches another page
coming in to cmp_and_merge_page().

You may be concentrating on the case when that "another page" is a ksm
page migrated from a different NUMA node; and overlooking the case of
when the matching ksm page in this stable tree has itself been migrated.

> since stable_node is the node in the right stable tree, nothing happen to it
> before this check. Did you intend to check get_kpfn_nid(page_node->kpfn) !=
> NUMA(page_node->nid) ?

Certainly not: page_node is usually NULL.  But I could have checked
get_kpfn_nid(stable_node->kpfn) != nid: I was duplicating the test
from cmp_and_merge_page(), but here we do have local variable nid.

Hugh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
  2013-03-06  5:05                 ` Hugh Dickins
@ 2013-03-06  6:58                   ` Ric Mason
  2013-03-06 10:18                   ` Ric Mason
  1 sibling, 0 replies; 28+ messages in thread
From: Ric Mason @ 2013-03-06  6:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

Hi Hugh,
On 03/06/2013 01:05 PM, Hugh Dickins wrote:
> On Wed, 6 Mar 2013, Ric Mason wrote:
> [ I've deleted the context because that was about the unstable tree,
>    and here you have moved to asking about a case in the stable tree. ]
>> For the case of a ksm page is migrated to a different NUMA node and migrate
>> its stable node to  the right tree and collide with an existing stable node.
>> get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can capture nothing
> That's not so: as I've pointed out before, ksm_migrate_page() updates
> stable_node->kpfn for the new page on the new NUMA node; but it cannot
> (get the right locking to) move the stable_node to its new tree at that time.
>
> It's moved out once ksmd notices that it's in the wrong NUMA node tree -
> perhaps when one its rmap_items reaches the head of cmp_and_merge_page(),
> or perhaps here in stable_tree_search() when it matches another page
> coming in to cmp_and_merge_page().
>
> You may be concentrating on the case when that "another page" is a ksm
> page migrated from a different NUMA node; and overlooking the case of
> when the matching ksm page in this stable tree has itself been migrated.
>
>> since stable_node is the node in the right stable tree, nothing happen to it
>> before this check. Did you intend to check get_kpfn_nid(page_node->kpfn) !=
>> NUMA(page_node->nid) ?
> Certainly not: page_node is usually NULL.  But I could have checked

Are you sure? list_del(&page_node->list) and DO_NUMA(page_node->nid = 
nid) will trigger panic now.
> get_kpfn_nid(stable_node->kpfn) != nid: I was duplicating the test
> from cmp_and_merge_page(), but here we do have local variable nid.
>
> Hugh


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
  2013-03-06  5:05                 ` Hugh Dickins
  2013-03-06  6:58                   ` Ric Mason
@ 2013-03-06 10:18                   ` Ric Mason
  2013-03-07 23:26                     ` Ric Mason
  1 sibling, 1 reply; 28+ messages in thread
From: Ric Mason @ 2013-03-06 10:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

Hi Hugh,
On 03/06/2013 01:05 PM, Hugh Dickins wrote:
> On Wed, 6 Mar 2013, Ric Mason wrote:
> [ I've deleted the context because that was about the unstable tree,
>    and here you have moved to asking about a case in the stable tree. ]

I think I can basically understand you, please correct me if something 
wrong.

For ksm page:
If one ksm page(in old node) migrate to another(new) node(ksm page is 
treated as old page, one new page allocated in another node now), since 
we can't get right lock in this time, we can't move stable node to its 
new tree at this time, stable node still in old node and 
stable_node->nid still store old node value. If ksmd scan and compare 
another page in old node and search stable tree will figure out that 
stable node relevant ksm page is migrated to new node, stable node will 
be erased from old node's stable tree and link to migrate_nodes list. 
What's the life of new page in new node? new page will be scaned by 
ksmd, it will search stable tree in new node and if doesn't find matched 
stable node, the new node is deleted from migrate_node list and add to 
new node's table tree as a leaf, if find stable node in stable tree, 
they will be merged. But in special case, the stable node relevant  ksm 
page can also migrated, new stable node will replace the stable node 
which relevant page migrated this time.
For unstable tree page:
If search in unstable tree and find the tree page which has equal 
content is migrated, just stop search and return, nothing merged. The 
new page in new node for this migrated unstable tree page will be insert 
to unstable tree in new node.

>> For the case of a ksm page is migrated to a different NUMA node and migrate
>> its stable node to  the right tree and collide with an existing stable node.
>> get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can capture nothing
> That's not so: as I've pointed out before, ksm_migrate_page() updates
> stable_node->kpfn for the new page on the new NUMA node; but it cannot
> (get the right locking to) move the stable_node to its new tree at that time.
>
> It's moved out once ksmd notices that it's in the wrong NUMA node tree -
> perhaps when one its rmap_items reaches the head of cmp_and_merge_page(),
> or perhaps here in stable_tree_search() when it matches another page
> coming in to cmp_and_merge_page().
>
> You may be concentrating on the case when that "another page" is a ksm
> page migrated from a different NUMA node; and overlooking the case of
> when the matching ksm page in this stable tree has itself been migrated.
>
>> since stable_node is the node in the right stable tree, nothing happen to it
>> before this check. Did you intend to check get_kpfn_nid(page_node->kpfn) !=
>> NUMA(page_node->nid) ?
> Certainly not: page_node is usually NULL.  But I could have checked
> get_kpfn_nid(stable_node->kpfn) != nid: I was duplicating the test
> from cmp_and_merge_page(), but here we do have local variable nid.
>
> Hugh


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] ksm: treat unstable nid like in stable tree
  2013-03-06 10:18                   ` Ric Mason
@ 2013-03-07 23:26                     ` Ric Mason
  0 siblings, 0 replies; 28+ messages in thread
From: Ric Mason @ 2013-03-07 23:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

Ping Hugh, :-)
On 03/06/2013 06:18 PM, Ric Mason wrote:
> Hi Hugh,
> On 03/06/2013 01:05 PM, Hugh Dickins wrote:
>> On Wed, 6 Mar 2013, Ric Mason wrote:
>> [ I've deleted the context because that was about the unstable tree,
>>    and here you have moved to asking about a case in the stable tree. ]
>
> I think I can basically understand you, please correct me if something 
> wrong.
>
> For ksm page:
> If one ksm page(in old node) migrate to another(new) node(ksm page is 
> treated as old page, one new page allocated in another node now), 
> since we can't get right lock in this time, we can't move stable node 
> to its new tree at this time, stable node still in old node and 
> stable_node->nid still store old node value. If ksmd scan and compare 
> another page in old node and search stable tree will figure out that 
> stable node relevant ksm page is migrated to new node, stable node 
> will be erased from old node's stable tree and link to migrate_nodes 
> list. What's the life of new page in new node? new page will be scaned 
> by ksmd, it will search stable tree in new node and if doesn't find 
> matched stable node, the new node is deleted from migrate_node list 
> and add to new node's table tree as a leaf, if find stable node in 
> stable tree, they will be merged. But in special case, the stable node 
> relevant  ksm page can also migrated, new stable node will replace the 
> stable node which relevant page migrated this time.
> For unstable tree page:
> If search in unstable tree and find the tree page which has equal 
> content is migrated, just stop search and return, nothing merged. The 
> new page in new node for this migrated unstable tree page will be 
> insert to unstable tree in new node.
>
>>> For the case of a ksm page is migrated to a different NUMA node and 
>>> migrate
>>> its stable node to  the right tree and collide with an existing 
>>> stable node.
>>> get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid) can 
>>> capture nothing
>> That's not so: as I've pointed out before, ksm_migrate_page() updates
>> stable_node->kpfn for the new page on the new NUMA node; but it cannot
>> (get the right locking to) move the stable_node to its new tree at 
>> that time.
>>
>> It's moved out once ksmd notices that it's in the wrong NUMA node tree -
>> perhaps when one its rmap_items reaches the head of 
>> cmp_and_merge_page(),
>> or perhaps here in stable_tree_search() when it matches another page
>> coming in to cmp_and_merge_page().
>>
>> You may be concentrating on the case when that "another page" is a ksm
>> page migrated from a different NUMA node; and overlooking the case of
>> when the matching ksm page in this stable tree has itself been migrated.
>>
>>> since stable_node is the node in the right stable tree, nothing 
>>> happen to it
>>> before this check. Did you intend to check 
>>> get_kpfn_nid(page_node->kpfn) !=
>>> NUMA(page_node->nid) ?
>> Certainly not: page_node is usually NULL.  But I could have checked
>> get_kpfn_nid(stable_node->kpfn) != nid: I was duplicating the test
>> from cmp_and_merge_page(), but here we do have local variable nid.
>>
>> Hugh
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2013-03-07 23:26 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-21  8:17 [PATCH 0/7] ksm: responses to NUMA review Hugh Dickins
2013-02-21  8:19 ` [PATCH 1/7] ksm: add some comments Hugh Dickins
2013-02-22  4:26   ` Ric Mason
2013-02-22 20:50     ` Hugh Dickins
2013-02-21  8:20 ` [PATCH 2/7] ksm: treat unstable nid like in stable tree Hugh Dickins
2013-02-22  7:13   ` Ric Mason
2013-02-22 21:03     ` Hugh Dickins
2013-03-01  5:29       ` Ric Mason
2013-03-01 20:03         ` Hugh Dickins
2013-03-02  1:10           ` Ric Mason
2013-03-02  2:57             ` Hugh Dickins
2013-03-06  1:28               ` Will Huck
2013-03-06  4:31                 ` Hugh Dickins
2013-03-06  2:37               ` Ric Mason
2013-03-06  5:05                 ` Hugh Dickins
2013-03-06  6:58                   ` Ric Mason
2013-03-06 10:18                   ` Ric Mason
2013-03-07 23:26                     ` Ric Mason
2013-02-21  8:22 ` [PATCH 3/7] ksm: shrink 32-bit rmap_item back to 32 bytes Hugh Dickins
2013-02-21  8:23 ` [PATCH 4/7] mm,ksm: FOLL_MIGRATION do migration_entry_wait Hugh Dickins
2013-02-21  8:25 ` [PATCH 5/7] mm,ksm: swapoff might need to copy Hugh Dickins
2013-02-21 14:53   ` Johannes Weiner
2013-02-22 17:16     ` Hugh Dickins
2013-02-21  8:27 ` [PATCH 6/7] mm: cleanup "swapcache" in do_swap_page Hugh Dickins
2013-02-21  8:29 ` [PATCH 7/7] ksm: allocate roots when needed Hugh Dickins
2013-02-22  3:44 ` [PATCH 0/7] ksm: responses to NUMA review Ric Mason
2013-02-22 20:38   ` Hugh Dickins
2013-02-24  1:39     ` Ric Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).