linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/11] ksm: NUMA trees and page migration
@ 2013-01-26  1:53 Hugh Dickins
  2013-01-26  1:54 ` [PATCH 1/11] ksm: allow trees per NUMA node Hugh Dickins
                   ` (11 more replies)
  0 siblings, 12 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-26  1:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, Rik van Riel,
	David Rientjes, Anton Arapov, Mel Gorman, linux-kernel, linux-mm

Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with
Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues
we had with that, fully enabling KSM page migration on the way.

(A different kind of KSM/NUMA issue which I've certainly not begun to
address here: when KSM pages are unmerged, there's usually no sense
in preferring to allocate the new pages local to the caller's node.)

Petr, I have intentionally changed the titles of yours: partly because
your "sysfs knob" understated it, but mainly because I think gmail is
liable to assign 1/11 and 2/11 to your earlier December thread, making
them vanish from this series.  I hope a change of title prevents that.

 1 ksm: allow trees per NUMA node
 2 ksm: add sysfs ABI Documentation
 3 ksm: trivial tidyups
 4 ksm: reorganize ksm_check_stable_tree
 5 ksm: get_ksm_page locked
 6 ksm: remove old stable nodes more thoroughly
 7 ksm: make KSM page migration possible
 8 ksm: make !merge_across_nodes migration safe
 9 mm: enable KSM page migration
10 mm: remove offlining arg to migrate_pages
11 ksm: stop hotremove lockdep warning

 Documentation/ABI/testing/sysfs-kernel-mm-ksm |   52 +
 Documentation/vm/ksm.txt                      |    7 
 include/linux/ksm.h                           |   18 
 include/linux/migrate.h                       |   14 
 mm/compaction.c                               |    2 
 mm/ksm.c                                      |  566 +++++++++++++---
 mm/memory-failure.c                           |    7 
 mm/memory.c                                   |   19 
 mm/memory_hotplug.c                           |    3 
 mm/mempolicy.c                                |   11 
 mm/migrate.c                                  |   61 -
 mm/page_alloc.c                               |    6 
 12 files changed, 580 insertions(+), 186 deletions(-)

Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/11] ksm: allow trees per NUMA node
  2013-01-26  1:53 [PATCH 0/11] ksm: NUMA trees and page migration Hugh Dickins
@ 2013-01-26  1:54 ` Hugh Dickins
  2013-01-27  1:14   ` Simon Jeons
                     ` (3 more replies)
  2013-01-26  1:56 ` [PATCH 2/11] ksm: add sysfs ABI Documentation Hugh Dickins
                   ` (10 subsequent siblings)
  11 siblings, 4 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-26  1:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, Rik van Riel,
	David Rientjes, Anton Arapov, linux-kernel, linux-mm

From: Petr Holasek <pholasek@redhat.com>

Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
which control merging pages across different numa nodes.
When it is set to zero only pages from the same node are merged,
otherwise pages from all nodes can be merged together (default behavior).

Typical use-case could be a lot of KVM guests on NUMA machine
and cpus from more distant nodes would have significant increase
of access latency to the merged ksm page. Sysfs knob was choosen
for higher variability when some users still prefers higher amount
of saved physical memory regardless of access latency.

Every numa node has its own stable & unstable trees because of faster
searching and inserting. Changing of merge_across_nodes value is possible
only when there are not any ksm shared pages in system.

I've tested this patch on numa machines with 2, 4 and 8 nodes and
measured speed of memory access inside of KVM guests with memory pinned
to one of nodes with this benchmark:

http://pholasek.fedorapeople.org/alloc_pg.c

Population standard deviations of access times in percentage of average
were following:

merge_across_nodes=1
2 nodes 1.4%
4 nodes 1.6%
8 nodes	1.7%

merge_across_nodes=0
2 nodes	1%
4 nodes	0.32%
8 nodes	0.018%

RFC: https://lkml.org/lkml/2011/11/30/91
v1: https://lkml.org/lkml/2012/1/23/46
v2: https://lkml.org/lkml/2012/6/29/105
v3: https://lkml.org/lkml/2012/9/14/550
v4: https://lkml.org/lkml/2012/9/23/137
v5: https://lkml.org/lkml/2012/12/10/540
v6: https://lkml.org/lkml/2012/12/23/154
v7: https://lkml.org/lkml/2012/12/27/225

Hugh notes that this patch brings two problems, whose solution needs
further support in mm/ksm.c, which follows in subsequent patches:
1) switching merge_across_nodes after running KSM is liable to oops
   on stale nodes still left over from the previous stable tree;
2) memory hotremove may migrate KSM pages, but there is no provision
   here for !merge_across_nodes to migrate nodes to the proper tree.

Signed-off-by: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 Documentation/vm/ksm.txt |    7 +
 mm/ksm.c                 |  151 ++++++++++++++++++++++++++++++++-----
 2 files changed, 139 insertions(+), 19 deletions(-)

--- mmotm.orig/Documentation/vm/ksm.txt	2013-01-25 14:36:31.724205455 -0800
+++ mmotm/Documentation/vm/ksm.txt	2013-01-25 14:36:38.608205618 -0800
@@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
                    e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
                    Default: 20 (chosen for demonstration purposes)
 
+merge_across_nodes - specifies if pages from different numa nodes can be merged.
+                   When set to 0, ksm merges only pages which physically
+                   reside in the memory area of same NUMA node. It brings
+                   lower latency to access to shared page. Value can be
+                   changed only when there is no ksm shared pages in system.
+                   Default: 1
+
 run              - set 0 to stop ksmd from running but keep merged pages,
                    set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
                    set 2 to stop ksmd and unmerge all pages currently merged,
--- mmotm.orig/mm/ksm.c	2013-01-25 14:36:31.724205455 -0800
+++ mmotm/mm/ksm.c	2013-01-25 14:36:38.608205618 -0800
@@ -36,6 +36,7 @@
 #include <linux/hashtable.h>
 #include <linux/freezer.h>
 #include <linux/oom.h>
+#include <linux/numa.h>
 
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -139,6 +140,9 @@ struct rmap_item {
 	struct mm_struct *mm;
 	unsigned long address;		/* + low bits used for flags below */
 	unsigned int oldchecksum;	/* when unstable */
+#ifdef CONFIG_NUMA
+	unsigned int nid;
+#endif
 	union {
 		struct rb_node node;	/* when node of unstable tree */
 		struct {		/* when listed from stable tree */
@@ -153,8 +157,8 @@ struct rmap_item {
 #define STABLE_FLAG	0x200	/* is listed from the stable tree */
 
 /* The stable and unstable tree heads */
-static struct rb_root root_stable_tree = RB_ROOT;
-static struct rb_root root_unstable_tree = RB_ROOT;
+static struct rb_root root_unstable_tree[MAX_NUMNODES];
+static struct rb_root root_stable_tree[MAX_NUMNODES];
 
 #define MM_SLOTS_HASH_BITS 10
 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
@@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+/* Zeroed when merging across nodes is not allowed */
+static unsigned int ksm_merge_across_nodes = 1;
+
 #define KSM_RUN_STOP	0
 #define KSM_RUN_MERGE	1
 #define KSM_RUN_UNMERGE	2
@@ -441,10 +448,25 @@ out:		page = NULL;
 	return page;
 }
 
+/*
+ * This helper is used for getting right index into array of tree roots.
+ * When merge_across_nodes knob is set to 1, there are only two rb-trees for
+ * stable and unstable pages from all nodes with roots in index 0. Otherwise,
+ * every node has its own stable and unstable tree.
+ */
+static inline int get_kpfn_nid(unsigned long kpfn)
+{
+	if (ksm_merge_across_nodes)
+		return 0;
+	else
+		return pfn_to_nid(kpfn);
+}
+
 static void remove_node_from_stable_tree(struct stable_node *stable_node)
 {
 	struct rmap_item *rmap_item;
 	struct hlist_node *hlist;
+	int nid;
 
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		if (rmap_item->hlist.next)
@@ -456,7 +478,9 @@ static void remove_node_from_stable_tree
 		cond_resched();
 	}
 
-	rb_erase(&stable_node->node, &root_stable_tree);
+	nid = get_kpfn_nid(stable_node->kpfn);
+
+	rb_erase(&stable_node->node, &root_stable_tree[nid]);
 	free_stable_node(stable_node);
 }
 
@@ -554,7 +578,12 @@ static void remove_rmap_item_from_tree(s
 		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
 		BUG_ON(age > 1);
 		if (!age)
-			rb_erase(&rmap_item->node, &root_unstable_tree);
+#ifdef CONFIG_NUMA
+			rb_erase(&rmap_item->node,
+					&root_unstable_tree[rmap_item->nid]);
+#else
+			rb_erase(&rmap_item->node, &root_unstable_tree[0]);
+#endif
 
 		ksm_pages_unshared--;
 		rmap_item->address &= PAGE_MASK;
@@ -990,8 +1019,9 @@ static struct page *try_to_merge_two_pag
  */
 static struct page *stable_tree_search(struct page *page)
 {
-	struct rb_node *node = root_stable_tree.rb_node;
+	struct rb_node *node;
 	struct stable_node *stable_node;
+	int nid;
 
 	stable_node = page_stable_node(page);
 	if (stable_node) {			/* ksm page forked */
@@ -999,6 +1029,9 @@ static struct page *stable_tree_search(s
 		return page;
 	}
 
+	nid = get_kpfn_nid(page_to_pfn(page));
+	node = root_stable_tree[nid].rb_node;
+
 	while (node) {
 		struct page *tree_page;
 		int ret;
@@ -1033,10 +1066,16 @@ static struct page *stable_tree_search(s
  */
 static struct stable_node *stable_tree_insert(struct page *kpage)
 {
-	struct rb_node **new = &root_stable_tree.rb_node;
+	int nid;
+	unsigned long kpfn;
+	struct rb_node **new;
 	struct rb_node *parent = NULL;
 	struct stable_node *stable_node;
 
+	kpfn = page_to_pfn(kpage);
+	nid = get_kpfn_nid(kpfn);
+	new = &root_stable_tree[nid].rb_node;
+
 	while (*new) {
 		struct page *tree_page;
 		int ret;
@@ -1070,11 +1109,11 @@ static struct stable_node *stable_tree_i
 		return NULL;
 
 	rb_link_node(&stable_node->node, parent, new);
-	rb_insert_color(&stable_node->node, &root_stable_tree);
+	rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
 
 	INIT_HLIST_HEAD(&stable_node->hlist);
 
-	stable_node->kpfn = page_to_pfn(kpage);
+	stable_node->kpfn = kpfn;
 	set_page_stable_node(kpage, stable_node);
 
 	return stable_node;
@@ -1098,10 +1137,15 @@ static
 struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
 					      struct page *page,
 					      struct page **tree_pagep)
-
 {
-	struct rb_node **new = &root_unstable_tree.rb_node;
+	struct rb_node **new;
+	struct rb_root *root;
 	struct rb_node *parent = NULL;
+	int nid;
+
+	nid = get_kpfn_nid(page_to_pfn(page));
+	root = &root_unstable_tree[nid];
+	new = &root->rb_node;
 
 	while (*new) {
 		struct rmap_item *tree_rmap_item;
@@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i
 			return NULL;
 		}
 
+		/*
+		 * If tree_page has been migrated to another NUMA node, it
+		 * will be flushed out and put into the right unstable tree
+		 * next time: only merge with it if merge_across_nodes.
+		 * Just notice, we don't have similar problem for PageKsm
+		 * because their migration is disabled now. (62b61f611e)
+		 */
+		if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) {
+			put_page(tree_page);
+			return NULL;
+		}
+
 		ret = memcmp_pages(page, tree_page);
 
 		parent = *new;
@@ -1139,8 +1195,11 @@ struct rmap_item *unstable_tree_search_i
 
 	rmap_item->address |= UNSTABLE_FLAG;
 	rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK);
+#ifdef CONFIG_NUMA
+	rmap_item->nid = nid;
+#endif
 	rb_link_node(&rmap_item->node, parent, new);
-	rb_insert_color(&rmap_item->node, &root_unstable_tree);
+	rb_insert_color(&rmap_item->node, root);
 
 	ksm_pages_unshared++;
 	return NULL;
@@ -1154,6 +1213,13 @@ struct rmap_item *unstable_tree_search_i
 static void stable_tree_append(struct rmap_item *rmap_item,
 			       struct stable_node *stable_node)
 {
+#ifdef CONFIG_NUMA
+	/*
+	 * Usually rmap_item->nid is already set correctly,
+	 * but it may be wrong after switching merge_across_nodes.
+	 */
+	rmap_item->nid = get_kpfn_nid(stable_node->kpfn);
+#endif
 	rmap_item->head = stable_node;
 	rmap_item->address |= STABLE_FLAG;
 	hlist_add_head(&rmap_item->hlist, &stable_node->hlist);
@@ -1283,6 +1349,7 @@ static struct rmap_item *scan_get_next_r
 	struct mm_slot *slot;
 	struct vm_area_struct *vma;
 	struct rmap_item *rmap_item;
+	int nid;
 
 	if (list_empty(&ksm_mm_head.mm_list))
 		return NULL;
@@ -1301,7 +1368,8 @@ static struct rmap_item *scan_get_next_r
 		 */
 		lru_add_drain_all();
 
-		root_unstable_tree = RB_ROOT;
+		for (nid = 0; nid < nr_node_ids; nid++)
+			root_unstable_tree[nid] = RB_ROOT;
 
 		spin_lock(&ksm_mmlist_lock);
 		slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
@@ -1770,15 +1838,19 @@ static struct stable_node *ksm_check_sta
 						 unsigned long end_pfn)
 {
 	struct rb_node *node;
+	int nid;
 
-	for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
-		struct stable_node *stable_node;
+	for (nid = 0; nid < nr_node_ids; nid++)
+		for (node = rb_first(&root_stable_tree[nid]); node;
+				node = rb_next(node)) {
+			struct stable_node *stable_node;
+
+			stable_node = rb_entry(node, struct stable_node, node);
+			if (stable_node->kpfn >= start_pfn &&
+			    stable_node->kpfn < end_pfn)
+				return stable_node;
+		}
 
-		stable_node = rb_entry(node, struct stable_node, node);
-		if (stable_node->kpfn >= start_pfn &&
-		    stable_node->kpfn < end_pfn)
-			return stable_node;
-	}
 	return NULL;
 }
 
@@ -1925,6 +1997,40 @@ static ssize_t run_store(struct kobject
 }
 KSM_ATTR(run);
 
+#ifdef CONFIG_NUMA
+static ssize_t merge_across_nodes_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", ksm_merge_across_nodes);
+}
+
+static ssize_t merge_across_nodes_store(struct kobject *kobj,
+				   struct kobj_attribute *attr,
+				   const char *buf, size_t count)
+{
+	int err;
+	unsigned long knob;
+
+	err = kstrtoul(buf, 10, &knob);
+	if (err)
+		return err;
+	if (knob > 1)
+		return -EINVAL;
+
+	mutex_lock(&ksm_thread_mutex);
+	if (ksm_merge_across_nodes != knob) {
+		if (ksm_pages_shared)
+			err = -EBUSY;
+		else
+			ksm_merge_across_nodes = knob;
+	}
+	mutex_unlock(&ksm_thread_mutex);
+
+	return err ? err : count;
+}
+KSM_ATTR(merge_across_nodes);
+#endif
+
 static ssize_t pages_shared_show(struct kobject *kobj,
 				 struct kobj_attribute *attr, char *buf)
 {
@@ -1979,6 +2085,9 @@ static struct attribute *ksm_attrs[] = {
 	&pages_unshared_attr.attr,
 	&pages_volatile_attr.attr,
 	&full_scans_attr.attr,
+#ifdef CONFIG_NUMA
+	&merge_across_nodes_attr.attr,
+#endif
 	NULL,
 };
 
@@ -1992,11 +2101,15 @@ static int __init ksm_init(void)
 {
 	struct task_struct *ksm_thread;
 	int err;
+	int nid;
 
 	err = ksm_slab_init();
 	if (err)
 		goto out;
 
+	for (nid = 0; nid < nr_node_ids; nid++)
+		root_stable_tree[nid] = RB_ROOT;
+
 	ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");
 	if (IS_ERR(ksm_thread)) {
 		printk(KERN_ERR "ksm: creating kthread failed\n");

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 2/11] ksm: add sysfs ABI Documentation
  2013-01-26  1:53 [PATCH 0/11] ksm: NUMA trees and page migration Hugh Dickins
  2013-01-26  1:54 ` [PATCH 1/11] ksm: allow trees per NUMA node Hugh Dickins
@ 2013-01-26  1:56 ` Hugh Dickins
  2013-01-26  1:58 ` [PATCH 3/11] ksm: trivial tidyups Hugh Dickins
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-26  1:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, Greg KH,
	linux-kernel, linux-mm

From: Petr Holasek <pholasek@redhat.com>

This patch adds sysfs documentation for Kernel Samepage Merging (KSM)
including new merge_across_nodes knob.

Signed-off-by: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 Documentation/ABI/testing/sysfs-kernel-mm-ksm |   52 ++++++++++++++++
 1 file changed, 52 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-ksm

--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ mmotm/Documentation/ABI/testing/sysfs-kernel-mm-ksm	2013-01-25 14:36:50.660205905 -0800
@@ -0,0 +1,52 @@
+What:		/sys/kernel/mm/ksm
+Date:		September 2009
+KernelVersion:	2.6.32
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for Kernel Samepage Merging (KSM)
+
+What:		/sys/kernel/mm/ksm/full_scans
+What:		/sys/kernel/mm/ksm/pages_shared
+What:		/sys/kernel/mm/ksm/pages_sharing
+What:		/sys/kernel/mm/ksm/pages_to_scan
+What:		/sys/kernel/mm/ksm/pages_unshared
+What:		/sys/kernel/mm/ksm/pages_volatile
+What:		/sys/kernel/mm/ksm/run
+What:		/sys/kernel/mm/ksm/sleep_millisecs
+Date:		September 2009
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Kernel Samepage Merging daemon sysfs interface
+
+		full_scans: how many times all mergeable areas have been
+		scanned.
+
+		pages_shared: how many shared pages are being used.
+
+		pages_sharing: how many more sites are sharing them i.e. how
+		much saved.
+
+		pages_to_scan: how many present pages to scan before ksmd goes
+		to sleep.
+
+		pages_unshared: how many pages unique but repeatedly checked
+		for merging.
+
+		pages_volatile: how many pages changing too fast to be placed
+		in a tree.
+
+		run: write 0 to disable ksm, read 0 while ksm is disabled.
+			write 1 to run ksm, read 1 while ksm is running.
+			write 2 to disable ksm and unmerge all its pages.
+
+		sleep_millisecs: how many milliseconds ksm should sleep between
+		scans.
+
+		See Documentation/vm/ksm.txt for more information.
+
+What:		/sys/kernel/mm/ksm/merge_across_nodes
+Date:		January 2013
+KernelVersion:	3.9
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Control merging pages across different NUMA nodes.
+
+		When it is set to 0 only pages from the same node are merged,
+		otherwise pages from all nodes can be merged together (default).

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 3/11] ksm: trivial tidyups
  2013-01-26  1:53 [PATCH 0/11] ksm: NUMA trees and page migration Hugh Dickins
  2013-01-26  1:54 ` [PATCH 1/11] ksm: allow trees per NUMA node Hugh Dickins
  2013-01-26  1:56 ` [PATCH 2/11] ksm: add sysfs ABI Documentation Hugh Dickins
@ 2013-01-26  1:58 ` Hugh Dickins
  2013-01-28 23:11   ` Andrew Morton
  2013-01-26  1:59 ` [PATCH 4/11] ksm: reorganize ksm_check_stable_tree Hugh Dickins
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 69+ messages in thread
From: Hugh Dickins @ 2013-01-26  1:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, linux-kernel, linux-mm

Add NUMA() and DO_NUMA() macros to minimize blight of #ifdef CONFIG_NUMAs
(but indeed we don't want to expand struct rmap_item by nid when not NUMA).
Add comment, remove "unsigned" from rmap_item->nid, as "int nid" elsewhere.
Define ksm_merge_across_nodes 1U when #ifndef NUMA to help optimizing out.
Use ?: in get_kpfn_nid().  Adjust a few comments noticed in ongoing work.

Leave stable_tree_insert()'s rb_linkage until after the node has been set
up, as unstable_tree_search_insert() does: ksm_thread_mutex and page lock
make either way safe, but we're going to copy and I prefer this precedent.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/ksm.c |   48 ++++++++++++++++++++++--------------------------
 1 file changed, 22 insertions(+), 26 deletions(-)

--- mmotm.orig/mm/ksm.c	2013-01-25 14:36:38.608205618 -0800
+++ mmotm/mm/ksm.c	2013-01-25 14:36:52.152205940 -0800
@@ -41,6 +41,14 @@
 #include <asm/tlbflush.h>
 #include "internal.h"
 
+#ifdef CONFIG_NUMA
+#define NUMA(x)		(x)
+#define DO_NUMA(x)	(x)
+#else
+#define NUMA(x)		(0)
+#define DO_NUMA(x)	do { } while (0)
+#endif
+
 /*
  * A few notes about the KSM scanning process,
  * to make it easier to understand the data structures below:
@@ -130,6 +138,7 @@ struct stable_node {
  * @mm: the memory structure this rmap_item is pointing into
  * @address: the virtual address this rmap_item tracks (+ flags in low bits)
  * @oldchecksum: previous checksum of the page at that virtual address
+ * @nid: NUMA node id of unstable tree in which linked (may not match page)
  * @node: rb node of this rmap_item in the unstable tree
  * @head: pointer to stable_node heading this list in the stable tree
  * @hlist: link into hlist of rmap_items hanging off that stable_node
@@ -141,7 +150,7 @@ struct rmap_item {
 	unsigned long address;		/* + low bits used for flags below */
 	unsigned int oldchecksum;	/* when unstable */
 #ifdef CONFIG_NUMA
-	unsigned int nid;
+	int nid;
 #endif
 	union {
 		struct rb_node node;	/* when node of unstable tree */
@@ -192,8 +201,12 @@ static unsigned int ksm_thread_pages_to_
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+#ifdef CONFIG_NUMA
 /* Zeroed when merging across nodes is not allowed */
 static unsigned int ksm_merge_across_nodes = 1;
+#else
+#define ksm_merge_across_nodes	1U
+#endif
 
 #define KSM_RUN_STOP	0
 #define KSM_RUN_MERGE	1
@@ -456,10 +469,7 @@ out:		page = NULL;
  */
 static inline int get_kpfn_nid(unsigned long kpfn)
 {
-	if (ksm_merge_across_nodes)
-		return 0;
-	else
-		return pfn_to_nid(kpfn);
+	return ksm_merge_across_nodes ? 0 : pfn_to_nid(kpfn);
 }
 
 static void remove_node_from_stable_tree(struct stable_node *stable_node)
@@ -479,7 +489,6 @@ static void remove_node_from_stable_tree
 	}
 
 	nid = get_kpfn_nid(stable_node->kpfn);
-
 	rb_erase(&stable_node->node, &root_stable_tree[nid]);
 	free_stable_node(stable_node);
 }
@@ -578,13 +587,8 @@ static void remove_rmap_item_from_tree(s
 		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
 		BUG_ON(age > 1);
 		if (!age)
-#ifdef CONFIG_NUMA
 			rb_erase(&rmap_item->node,
-					&root_unstable_tree[rmap_item->nid]);
-#else
-			rb_erase(&rmap_item->node, &root_unstable_tree[0]);
-#endif
-
+				 &root_unstable_tree[NUMA(rmap_item->nid)]);
 		ksm_pages_unshared--;
 		rmap_item->address &= PAGE_MASK;
 	}
@@ -604,7 +608,7 @@ static void remove_trailing_rmap_items(s
 }
 
 /*
- * Though it's very tempting to unmerge in_stable_tree(rmap_item)s rather
+ * Though it's very tempting to unmerge rmap_items from stable tree rather
  * than check every pte of a given vma, the locking doesn't quite work for
  * that - an rmap_item is assigned to the stable tree after inserting ksm
  * page and upping mmap_sem.  Nor does it fit with the way we skip dup'ing
@@ -1058,7 +1062,7 @@ static struct page *stable_tree_search(s
 }
 
 /*
- * stable_tree_insert - insert rmap_item pointing to new ksm page
+ * stable_tree_insert - insert stable tree node pointing to new ksm page
  * into the stable tree.
  *
  * This function returns the stable tree node just allocated on success,
@@ -1108,13 +1112,11 @@ static struct stable_node *stable_tree_i
 	if (!stable_node)
 		return NULL;
 
-	rb_link_node(&stable_node->node, parent, new);
-	rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
-
 	INIT_HLIST_HEAD(&stable_node->hlist);
-
 	stable_node->kpfn = kpfn;
 	set_page_stable_node(kpage, stable_node);
+	rb_link_node(&stable_node->node, parent, new);
+	rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
 
 	return stable_node;
 }
@@ -1170,8 +1172,6 @@ struct rmap_item *unstable_tree_search_i
 		 * If tree_page has been migrated to another NUMA node, it
 		 * will be flushed out and put into the right unstable tree
 		 * next time: only merge with it if merge_across_nodes.
-		 * Just notice, we don't have similar problem for PageKsm
-		 * because their migration is disabled now. (62b61f611e)
 		 */
 		if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) {
 			put_page(tree_page);
@@ -1195,9 +1195,7 @@ struct rmap_item *unstable_tree_search_i
 
 	rmap_item->address |= UNSTABLE_FLAG;
 	rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK);
-#ifdef CONFIG_NUMA
-	rmap_item->nid = nid;
-#endif
+	DO_NUMA(rmap_item->nid = nid);
 	rb_link_node(&rmap_item->node, parent, new);
 	rb_insert_color(&rmap_item->node, root);
 
@@ -1213,13 +1211,11 @@ struct rmap_item *unstable_tree_search_i
 static void stable_tree_append(struct rmap_item *rmap_item,
 			       struct stable_node *stable_node)
 {
-#ifdef CONFIG_NUMA
 	/*
 	 * Usually rmap_item->nid is already set correctly,
 	 * but it may be wrong after switching merge_across_nodes.
 	 */
-	rmap_item->nid = get_kpfn_nid(stable_node->kpfn);
-#endif
+	DO_NUMA(rmap_item->nid = get_kpfn_nid(stable_node->kpfn));
 	rmap_item->head = stable_node;
 	rmap_item->address |= STABLE_FLAG;
 	hlist_add_head(&rmap_item->hlist, &stable_node->hlist);

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 4/11] ksm: reorganize ksm_check_stable_tree
  2013-01-26  1:53 [PATCH 0/11] ksm: NUMA trees and page migration Hugh Dickins
                   ` (2 preceding siblings ...)
  2013-01-26  1:58 ` [PATCH 3/11] ksm: trivial tidyups Hugh Dickins
@ 2013-01-26  1:59 ` Hugh Dickins
  2013-02-05 16:48   ` Mel Gorman
  2013-01-26  2:00 ` [PATCH 5/11] ksm: get_ksm_page locked Hugh Dickins
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 69+ messages in thread
From: Hugh Dickins @ 2013-01-26  1:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, linux-kernel, linux-mm

Memory hotremove's ksm_check_stable_tree() is pitifully inefficient
(restarting whenever it finds a stale node to remove), but rearrange
so that at least it does not needlessly restart from nid 0 each time.
And add a couple of comments: here is why we keep pfn instead of page.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/ksm.c |   38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

--- mmotm.orig/mm/ksm.c	2013-01-25 14:36:52.152205940 -0800
+++ mmotm/mm/ksm.c	2013-01-25 14:36:53.244205966 -0800
@@ -1830,31 +1830,36 @@ void ksm_migrate_page(struct page *newpa
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn,
-						 unsigned long end_pfn)
+static void ksm_check_stable_tree(unsigned long start_pfn,
+				  unsigned long end_pfn)
 {
+	struct stable_node *stable_node;
 	struct rb_node *node;
 	int nid;
 
-	for (nid = 0; nid < nr_node_ids; nid++)
-		for (node = rb_first(&root_stable_tree[nid]); node;
-				node = rb_next(node)) {
-			struct stable_node *stable_node;
-
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		node = rb_first(&root_stable_tree[nid]);
+		while (node) {
 			stable_node = rb_entry(node, struct stable_node, node);
 			if (stable_node->kpfn >= start_pfn &&
-			    stable_node->kpfn < end_pfn)
-				return stable_node;
+			    stable_node->kpfn < end_pfn) {
+				/*
+				 * Don't get_ksm_page, page has already gone:
+				 * which is why we keep kpfn instead of page*
+				 */
+				remove_node_from_stable_tree(stable_node);
+				node = rb_first(&root_stable_tree[nid]);
+			} else
+				node = rb_next(node);
+			cond_resched();
 		}
-
-	return NULL;
+	}
 }
 
 static int ksm_memory_callback(struct notifier_block *self,
 			       unsigned long action, void *arg)
 {
 	struct memory_notify *mn = arg;
-	struct stable_node *stable_node;
 
 	switch (action) {
 	case MEM_GOING_OFFLINE:
@@ -1874,11 +1879,12 @@ static int ksm_memory_callback(struct no
 		/*
 		 * Most of the work is done by page migration; but there might
 		 * be a few stable_nodes left over, still pointing to struct
-		 * pages which have been offlined: prune those from the tree.
+		 * pages which have been offlined: prune those from the tree,
+		 * otherwise get_ksm_page() might later try to access a
+		 * non-existent struct page.
 		 */
-		while ((stable_node = ksm_check_stable_tree(mn->start_pfn,
-					mn->start_pfn + mn->nr_pages)) != NULL)
-			remove_node_from_stable_tree(stable_node);
+		ksm_check_stable_tree(mn->start_pfn,
+				      mn->start_pfn + mn->nr_pages);
 		/* fallthrough */
 
 	case MEM_CANCEL_OFFLINE:

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 5/11] ksm: get_ksm_page locked
  2013-01-26  1:53 [PATCH 0/11] ksm: NUMA trees and page migration Hugh Dickins
                   ` (3 preceding siblings ...)
  2013-01-26  1:59 ` [PATCH 4/11] ksm: reorganize ksm_check_stable_tree Hugh Dickins
@ 2013-01-26  2:00 ` Hugh Dickins
  2013-01-27  2:36   ` Simon Jeons
                     ` (2 more replies)
  2013-01-26  2:01 ` [PATCH 6/11] ksm: remove old stable nodes more thoroughly Hugh Dickins
                   ` (6 subsequent siblings)
  11 siblings, 3 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-26  2:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, linux-kernel, linux-mm

In some places where get_ksm_page() is used, we need the page to be locked.

When KSM migration is fully enabled, we shall want that to make sure that
the page just acquired cannot be migrated beneath us (raised page count is
only effective when there is serialization to make sure migration notices).
Whereas when navigating through the stable tree, we certainly do not want
to lock each node (raised page count is enough to guarantee the memcmps,
even if page is migrated to another node).

Since we're about to add another use case, add the locked argument to
get_ksm_page() now.

Hmm, what's that rcu_read_lock() about?  Complete misunderstanding, I
really got the wrong end of the stick on that!  There's a configuration
in which page_cache_get_speculative() can do something cheaper than
get_page_unless_zero(), relying on its caller's rcu_read_lock() to have
disabled preemption for it.  There's no need for rcu_read_lock() around
get_page_unless_zero() (and mapping checks) here.  Cut out that
silliness before making this any harder to understand.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/ksm.c |   23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

--- mmotm.orig/mm/ksm.c	2013-01-25 14:36:53.244205966 -0800
+++ mmotm/mm/ksm.c	2013-01-25 14:36:58.856206099 -0800
@@ -514,15 +514,14 @@ static void remove_node_from_stable_tree
  * but this is different - made simpler by ksm_thread_mutex being held, but
  * interesting for assuming that no other use of the struct page could ever
  * put our expected_mapping into page->mapping (or a field of the union which
- * coincides with page->mapping).  The RCU calls are not for KSM at all, but
- * to keep the page_count protocol described with page_cache_get_speculative.
+ * coincides with page->mapping).
  *
  * Note: it is possible that get_ksm_page() will return NULL one moment,
  * then page the next, if the page is in between page_freeze_refs() and
  * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
  * is on its way to being freed; but it is an anomaly to bear in mind.
  */
-static struct page *get_ksm_page(struct stable_node *stable_node)
+static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
 {
 	struct page *page;
 	void *expected_mapping;
@@ -530,7 +529,6 @@ static struct page *get_ksm_page(struct
 	page = pfn_to_page(stable_node->kpfn);
 	expected_mapping = (void *)stable_node +
 				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
-	rcu_read_lock();
 	if (page->mapping != expected_mapping)
 		goto stale;
 	if (!get_page_unless_zero(page))
@@ -539,10 +537,16 @@ static struct page *get_ksm_page(struct
 		put_page(page);
 		goto stale;
 	}
-	rcu_read_unlock();
+	if (locked) {
+		lock_page(page);
+		if (page->mapping != expected_mapping) {
+			unlock_page(page);
+			put_page(page);
+			goto stale;
+		}
+	}
 	return page;
 stale:
-	rcu_read_unlock();
 	remove_node_from_stable_tree(stable_node);
 	return NULL;
 }
@@ -558,11 +562,10 @@ static void remove_rmap_item_from_tree(s
 		struct page *page;
 
 		stable_node = rmap_item->head;
-		page = get_ksm_page(stable_node);
+		page = get_ksm_page(stable_node, true);
 		if (!page)
 			goto out;
 
-		lock_page(page);
 		hlist_del(&rmap_item->hlist);
 		unlock_page(page);
 		put_page(page);
@@ -1042,7 +1045,7 @@ static struct page *stable_tree_search(s
 
 		cond_resched();
 		stable_node = rb_entry(node, struct stable_node, node);
-		tree_page = get_ksm_page(stable_node);
+		tree_page = get_ksm_page(stable_node, false);
 		if (!tree_page)
 			return NULL;
 
@@ -1086,7 +1089,7 @@ static struct stable_node *stable_tree_i
 
 		cond_resched();
 		stable_node = rb_entry(*new, struct stable_node, node);
-		tree_page = get_ksm_page(stable_node);
+		tree_page = get_ksm_page(stable_node, false);
 		if (!tree_page)
 			return NULL;
 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 6/11] ksm: remove old stable nodes more thoroughly
  2013-01-26  1:53 [PATCH 0/11] ksm: NUMA trees and page migration Hugh Dickins
                   ` (4 preceding siblings ...)
  2013-01-26  2:00 ` [PATCH 5/11] ksm: get_ksm_page locked Hugh Dickins
@ 2013-01-26  2:01 ` Hugh Dickins
  2013-01-27  4:55   ` Simon Jeons
                     ` (4 more replies)
  2013-01-26  2:03 ` [PATCH 7/11] ksm: make KSM page migration possible Hugh Dickins
                   ` (5 subsequent siblings)
  11 siblings, 5 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-26  2:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, linux-kernel, linux-mm

Switching merge_across_nodes after running KSM is liable to oops on stale
nodes still left over from the previous stable tree.  It's not something
that people will often want to do, but it would be lame to demand a reboot
when they're trying to determine which merge_across_nodes setting is best.

How can this happen?  We only permit switching merge_across_nodes when
pages_shared is 0, and usually set run 2 to force that beforehand, which
ought to unmerge everything: yet oopses still occur when you then run 1.

Three causes:

1. The old stable tree (built according to the inverse merge_across_nodes)
has not been fully torn down.  A stable node lingers until get_ksm_page()
notices that the page it references no longer references it: but the page
is not necessarily freed as soon as expected, particularly when swapcache.

Fix this with a pass through the old stable tree, applying get_ksm_page()
to each of the remaining nodes (most found stale and removed immediately),
with forced removal of any left over.  Unless the page is still mapped:
I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE
and EBUSY than BUG.

2. __ksm_enter() has a nice little optimization, to insert the new mm
just behind ksmd's cursor, so there's a full pass for it to stabilize
(or be removed) before ksmd addresses it.  Nice when ksmd is running,
but not so nice when we're trying to unmerge all mms: we were missing
those mms forked and inserted behind the unmerge cursor.  Easily fixed
by inserting at the end when KSM_RUN_UNMERGE.

3. It is possible for a KSM page to be faulted back from swapcache into
an mm, just after unmerge_and_remove_all_rmap_items() scanned past it.
Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private
to ksm.c, so dissolve the distinction between ksm_might_need_to_copy()
and ksm_does_need_to_copy(), doing it all in the one call into ksm.c.

A long outstanding, unrelated bugfix sneaks in with that third fix:
ksm_does_need_to_copy() would copy from a !PageUptodate page (implying
I/O error when read in from swap) to a page which it then marks Uptodate.
Fix this case by not copying, letting do_swap_page() discover the error.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/ksm.h |   18 ++-------
 mm/ksm.c            |   83 +++++++++++++++++++++++++++++++++++++++---
 mm/memory.c         |   19 ++++-----
 3 files changed, 92 insertions(+), 28 deletions(-)

--- mmotm.orig/include/linux/ksm.h	2013-01-25 14:27:58.220193250 -0800
+++ mmotm/include/linux/ksm.h	2013-01-25 14:37:00.764206145 -0800
@@ -16,9 +16,6 @@
 struct stable_node;
 struct mem_cgroup;
 
-struct page *ksm_does_need_to_copy(struct page *page,
-			struct vm_area_struct *vma, unsigned long address);
-
 #ifdef CONFIG_KSM
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags);
@@ -73,15 +70,8 @@ static inline void set_page_stable_node(
  * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE,
  * but what if the vma was unmerged while the page was swapped out?
  */
-static inline int ksm_might_need_to_copy(struct page *page,
-			struct vm_area_struct *vma, unsigned long address)
-{
-	struct anon_vma *anon_vma = page_anon_vma(page);
-
-	return anon_vma &&
-		(anon_vma->root != vma->anon_vma->root ||
-		 page->index != linear_page_index(vma, address));
-}
+struct page *ksm_might_need_to_copy(struct page *page,
+			struct vm_area_struct *vma, unsigned long address);
 
 int page_referenced_ksm(struct page *page,
 			struct mem_cgroup *memcg, unsigned long *vm_flags);
@@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_
 	return 0;
 }
 
-static inline int ksm_might_need_to_copy(struct page *page,
+static inline struct page *ksm_might_need_to_copy(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {
-	return 0;
+	return page;
 }
 
 static inline int page_referenced_ksm(struct page *page,
--- mmotm.orig/mm/ksm.c	2013-01-25 14:36:58.856206099 -0800
+++ mmotm/mm/ksm.c	2013-01-25 14:37:00.768206145 -0800
@@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a
 /*
  * Only called through the sysfs control interface:
  */
+static int remove_stable_node(struct stable_node *stable_node)
+{
+	struct page *page;
+	int err;
+
+	page = get_ksm_page(stable_node, true);
+	if (!page) {
+		/*
+		 * get_ksm_page did remove_node_from_stable_tree itself.
+		 */
+		return 0;
+	}
+
+	if (WARN_ON_ONCE(page_mapped(page)))
+		err = -EBUSY;
+	else {
+		/*
+		 * This page might be in a pagevec waiting to be freed,
+		 * or it might be PageSwapCache (perhaps under writeback),
+		 * or it might have been removed from swapcache a moment ago.
+		 */
+		set_page_stable_node(page, NULL);
+		remove_node_from_stable_tree(stable_node);
+		err = 0;
+	}
+
+	unlock_page(page);
+	put_page(page);
+	return err;
+}
+
+static int remove_all_stable_nodes(void)
+{
+	struct stable_node *stable_node;
+	int nid;
+	int err = 0;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		while (root_stable_tree[nid].rb_node) {
+			stable_node = rb_entry(root_stable_tree[nid].rb_node,
+						struct stable_node, node);
+			if (remove_stable_node(stable_node)) {
+				err = -EBUSY;
+				break;	/* proceed to next nid */
+			}
+			cond_resched();
+		}
+	}
+	return err;
+}
+
 static int unmerge_and_remove_all_rmap_items(void)
 {
 	struct mm_slot *mm_slot;
@@ -691,6 +742,8 @@ static int unmerge_and_remove_all_rmap_i
 		}
 	}
 
+	/* Clean up stable nodes, but don't worry if some are still busy */
+	remove_all_stable_nodes();
 	ksm_scan.seqnr = 0;
 	return 0;
 
@@ -1586,11 +1639,19 @@ int __ksm_enter(struct mm_struct *mm)
 	spin_lock(&ksm_mmlist_lock);
 	insert_to_mm_slots_hash(mm, mm_slot);
 	/*
-	 * Insert just behind the scanning cursor, to let the area settle
+	 * When KSM_RUN_MERGE (or KSM_RUN_STOP),
+	 * insert just behind the scanning cursor, to let the area settle
 	 * down a little; when fork is followed by immediate exec, we don't
 	 * want ksmd to waste time setting up and tearing down an rmap_list.
+	 *
+	 * But when KSM_RUN_UNMERGE, it's important to insert ahead of its
+	 * scanning cursor, otherwise KSM pages in newly forked mms will be
+	 * missed: then we might as well insert at the end of the list.
 	 */
-	list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
+	if (ksm_run & KSM_RUN_UNMERGE)
+		list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list);
+	else
+		list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
 	spin_unlock(&ksm_mmlist_lock);
 
 	set_bit(MMF_VM_MERGEABLE, &mm->flags);
@@ -1640,11 +1701,25 @@ void __ksm_exit(struct mm_struct *mm)
 	}
 }
 
-struct page *ksm_does_need_to_copy(struct page *page,
+struct page *ksm_might_need_to_copy(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {
+	struct anon_vma *anon_vma = page_anon_vma(page);
 	struct page *new_page;
 
+	if (PageKsm(page)) {
+		if (page_stable_node(page) &&
+		    !(ksm_run & KSM_RUN_UNMERGE))
+			return page;	/* no need to copy it */
+	} else if (!anon_vma) {
+		return page;		/* no need to copy it */
+	} else if (anon_vma->root == vma->anon_vma->root &&
+		 page->index == linear_page_index(vma, address)) {
+		return page;		/* still no need to copy it */
+	}
+	if (!PageUptodate(page))
+		return page;		/* let do_swap_page report the error */
+
 	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
 	if (new_page) {
 		copy_user_highpage(new_page, page, address, vma);
@@ -2024,7 +2099,7 @@ static ssize_t merge_across_nodes_store(
 
 	mutex_lock(&ksm_thread_mutex);
 	if (ksm_merge_across_nodes != knob) {
-		if (ksm_pages_shared)
+		if (ksm_pages_shared || remove_all_stable_nodes())
 			err = -EBUSY;
 		else
 			ksm_merge_across_nodes = knob;
--- mmotm.orig/mm/memory.c	2013-01-25 14:27:58.220193250 -0800
+++ mmotm/mm/memory.c	2013-01-25 14:37:00.768206145 -0800
@@ -2994,17 +2994,16 @@ static int do_swap_page(struct mm_struct
 	if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
 		goto out_page;
 
-	if (ksm_might_need_to_copy(page, vma, address)) {
-		swapcache = page;
-		page = ksm_does_need_to_copy(page, vma, address);
-
-		if (unlikely(!page)) {
-			ret = VM_FAULT_OOM;
-			page = swapcache;
-			swapcache = NULL;
-			goto out_page;
-		}
+	swapcache = page;
+	page = ksm_might_need_to_copy(page, vma, address);
+	if (unlikely(!page)) {
+		ret = VM_FAULT_OOM;
+		page = swapcache;
+		swapcache = NULL;
+		goto out_page;
 	}
+	if (page == swapcache)
+		swapcache = NULL;
 
 	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
 		ret = VM_FAULT_OOM;

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 7/11] ksm: make KSM page migration possible
  2013-01-26  1:53 [PATCH 0/11] ksm: NUMA trees and page migration Hugh Dickins
                   ` (5 preceding siblings ...)
  2013-01-26  2:01 ` [PATCH 6/11] ksm: remove old stable nodes more thoroughly Hugh Dickins
@ 2013-01-26  2:03 ` Hugh Dickins
  2013-01-27  5:47   ` Simon Jeons
  2013-02-05 19:11   ` Mel Gorman
  2013-01-26  2:05 ` [PATCH 8/11] ksm: make !merge_across_nodes migration safe Hugh Dickins
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-26  2:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, linux-kernel, linux-mm

KSM page migration is already supported in the case of memory hotremove,
which takes the ksm_thread_mutex across all its migrations to keep life
simple.

But the new KSM NUMA merge_across_nodes knob introduces a problem, when
it's set to non-default 0: if a KSM page is migrated to a different NUMA
node, how do we migrate its stable node to the right tree?  And what if
that collides with an existing stable node?

So far there's no provision for that, and this patch does not attempt
to deal with it either.  But how will I test a solution, when I don't
know how to hotremove memory?  The best answer is to enable KSM page
migration in all cases now, and test more common cases.  With THP and
compaction added since KSM came in, page migration is now mainstream,
and it's a shame that a KSM page can frustrate freeing a page block.

Without worrying about merge_across_nodes 0 for now, this patch gets
KSM page migration working reliably for default merge_across_nodes 1
(but leave the patch enabling it until near the end of the series).

It's much simpler than I'd originally imagined, and does not require
an additional tier of locking: page migration relies on the page lock,
KSM page reclaim relies on the page lock, the page lock is enough for
KSM page migration too.

Almost all the care has to be in get_ksm_page(): that's the function
which worries about when a stable node is stale and should be freed,
now it also has to worry about the KSM page being migrated.

The only new overhead is an additional put/get/lock/unlock_page when
stable_tree_search() arrives at a matching node: to make sure migration
respects the raised page count, and so does not migrate the page while
we're busy with it here.  That's probably avoidable, either by changing
internal interfaces from using kpage to stable_node, or by moving the
ksm_migrate_page() callsite into a page_freeze_refs() section (even if
not swapcache); but this works well, I've no urge to pull it apart now.

(Descents of the stable tree may pass through nodes whose KSM pages are
under migration: being unlocked, the raised page count does not prevent
that, nor need it: it's safe to memcmp against either old or new page.)

You might worry about mremap, and whether page migration's rmap_walk
to remove migration entries will find all the KSM locations where it
inserted earlier: that should already be handled, by the satisfyingly
heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/ksm.c     |   94 ++++++++++++++++++++++++++++++++++++++-----------
 mm/migrate.c |    5 ++
 2 files changed, 77 insertions(+), 22 deletions(-)

--- mmotm.orig/mm/ksm.c	2013-01-25 14:37:00.768206145 -0800
+++ mmotm/mm/ksm.c	2013-01-25 14:37:03.832206218 -0800
@@ -499,6 +499,7 @@ static void remove_node_from_stable_tree
  * In which case we can trust the content of the page, and it
  * returns the gotten page; but if the page has now been zapped,
  * remove the stale node from the stable tree and return NULL.
+ * But beware, the stable node's page might be being migrated.
  *
  * You would expect the stable_node to hold a reference to the ksm page.
  * But if it increments the page's count, swapping out has to wait for
@@ -509,44 +510,77 @@ static void remove_node_from_stable_tree
  * pointing back to this stable node.  This relies on freeing a PageAnon
  * page to reset its page->mapping to NULL, and relies on no other use of
  * a page to put something that might look like our key in page->mapping.
- *
- * include/linux/pagemap.h page_cache_get_speculative() is a good reference,
- * but this is different - made simpler by ksm_thread_mutex being held, but
- * interesting for assuming that no other use of the struct page could ever
- * put our expected_mapping into page->mapping (or a field of the union which
- * coincides with page->mapping).
- *
- * Note: it is possible that get_ksm_page() will return NULL one moment,
- * then page the next, if the page is in between page_freeze_refs() and
- * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
  * is on its way to being freed; but it is an anomaly to bear in mind.
  */
 static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
 {
 	struct page *page;
 	void *expected_mapping;
+	unsigned long kpfn;
 
-	page = pfn_to_page(stable_node->kpfn);
 	expected_mapping = (void *)stable_node +
 				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
-	if (page->mapping != expected_mapping)
-		goto stale;
-	if (!get_page_unless_zero(page))
+again:
+	kpfn = ACCESS_ONCE(stable_node->kpfn);
+	page = pfn_to_page(kpfn);
+
+	/*
+	 * page is computed from kpfn, so on most architectures reading
+	 * page->mapping is naturally ordered after reading node->kpfn,
+	 * but on Alpha we need to be more careful.
+	 */
+	smp_read_barrier_depends();
+	if (ACCESS_ONCE(page->mapping) != expected_mapping)
 		goto stale;
-	if (page->mapping != expected_mapping) {
+
+	/*
+	 * We cannot do anything with the page while its refcount is 0.
+	 * Usually 0 means free, or tail of a higher-order page: in which
+	 * case this node is no longer referenced, and should be freed;
+	 * however, it might mean that the page is under page_freeze_refs().
+	 * The __remove_mapping() case is easy, again the node is now stale;
+	 * but if page is swapcache in migrate_page_move_mapping(), it might
+	 * still be our page, in which case it's essential to keep the node.
+	 */
+	while (!get_page_unless_zero(page)) {
+		/*
+		 * Another check for page->mapping != expected_mapping would
+		 * work here too.  We have chosen the !PageSwapCache test to
+		 * optimize the common case, when the page is or is about to
+		 * be freed: PageSwapCache is cleared (under spin_lock_irq)
+		 * in the freeze_refs section of __remove_mapping(); but Anon
+		 * page->mapping reset to NULL later, in free_pages_prepare().
+		 */
+		if (!PageSwapCache(page))
+			goto stale;
+		cpu_relax();
+	}
+
+	if (ACCESS_ONCE(page->mapping) != expected_mapping) {
 		put_page(page);
 		goto stale;
 	}
+
 	if (locked) {
 		lock_page(page);
-		if (page->mapping != expected_mapping) {
+		if (ACCESS_ONCE(page->mapping) != expected_mapping) {
 			unlock_page(page);
 			put_page(page);
 			goto stale;
 		}
 	}
 	return page;
+
 stale:
+	/*
+	 * We come here from above when page->mapping or !PageSwapCache
+	 * suggests that the node is stale; but it might be under migration.
+	 * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(),
+	 * before checking whether node->kpfn has been changed.
+	 */
+	smp_rmb();
+	if (ACCESS_ONCE(stable_node->kpfn) != kpfn)
+		goto again;
 	remove_node_from_stable_tree(stable_node);
 	return NULL;
 }
@@ -1103,15 +1137,25 @@ static struct page *stable_tree_search(s
 			return NULL;
 
 		ret = memcmp_pages(page, tree_page);
+		put_page(tree_page);
 
-		if (ret < 0) {
-			put_page(tree_page);
+		if (ret < 0)
 			node = node->rb_left;
-		} else if (ret > 0) {
-			put_page(tree_page);
+		else if (ret > 0)
 			node = node->rb_right;
-		} else
+		else {
+			/*
+			 * Lock and unlock the stable_node's page (which
+			 * might already have been migrated) so that page
+			 * migration is sure to notice its raised count.
+			 * It would be more elegant to return stable_node
+			 * than kpage, but that involves more changes.
+			 */
+			tree_page = get_ksm_page(stable_node, true);
+			if (tree_page)
+				unlock_page(tree_page);
 			return tree_page;
+		}
 	}
 
 	return NULL;
@@ -1903,6 +1947,14 @@ void ksm_migrate_page(struct page *newpa
 	if (stable_node) {
 		VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
 		stable_node->kpfn = page_to_pfn(newpage);
+		/*
+		 * newpage->mapping was set in advance; now we need smp_wmb()
+		 * to make sure that the new stable_node->kpfn is visible
+		 * to get_ksm_page() before it can see that oldpage->mapping
+		 * has gone stale (or that PageSwapCache has been cleared).
+		 */
+		smp_wmb();
+		set_page_stable_node(oldpage, NULL);
 	}
 }
 #endif /* CONFIG_MIGRATION */
--- mmotm.orig/mm/migrate.c	2013-01-25 14:27:58.140193249 -0800
+++ mmotm/mm/migrate.c	2013-01-25 14:37:03.832206218 -0800
@@ -464,7 +464,10 @@ void migrate_page_copy(struct page *newp
 
 	mlock_migrate_page(newpage, page);
 	ksm_migrate_page(newpage, page);
-
+	/*
+	 * Please do not reorder this without considering how mm/ksm.c's
+	 * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache().
+	 */
 	ClearPageSwapCache(page);
 	ClearPagePrivate(page);
 	set_page_private(page, 0);

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 8/11] ksm: make !merge_across_nodes migration safe
  2013-01-26  1:53 [PATCH 0/11] ksm: NUMA trees and page migration Hugh Dickins
                   ` (6 preceding siblings ...)
  2013-01-26  2:03 ` [PATCH 7/11] ksm: make KSM page migration possible Hugh Dickins
@ 2013-01-26  2:05 ` Hugh Dickins
  2013-01-27  8:49   ` Simon Jeons
  2013-01-28  3:44   ` Simon Jeons
  2013-01-26  2:06 ` [PATCH 9/11] ksm: enable KSM page migration Hugh Dickins
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-26  2:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, linux-kernel, linux-mm

The new KSM NUMA merge_across_nodes knob introduces a problem, when it's
set to non-default 0: if a KSM page is migrated to a different NUMA node,
how do we migrate its stable node to the right tree?  And what if that
collides with an existing stable node?

ksm_migrate_page() can do no more than it's already doing, updating
stable_node->kpfn: the stable tree itself cannot be manipulated without
holding ksm_thread_mutex.  So accept that a stable tree may temporarily
indicate a page belonging to the wrong NUMA node, leave updating until
the next pass of ksmd, just be careful not to merge other pages on to a
misplaced page.  Note nid of holding tree in stable_node, and recognize
that it will not always match nid of kpfn.

A misplaced KSM page is discovered, either when ksm_do_scan() next comes
around to one of its rmap_items (we now have to go to cmp_and_merge_page
even on pages in a stable tree), or when stable_tree_search() arrives at
a matching node for another page, and this node page is found misplaced.

In each case, move the misplaced stable_node to a list of migrate_nodes
(and use the address of migrate_nodes as magic by which to identify them):
we don't need them in a tree.  If stable_tree_search() finds no match for
a page, but it's currently exiled to this list, then slot its stable_node
right there into the tree, bringing all of its mappings with it; otherwise
they get migrated one by one to the original page of the colliding node.
stable_tree_search() is now modelled more like stable_tree_insert(),
in order to handle these insertions of migrated nodes.

remove_node_from_stable_tree(), remove_all_stable_nodes() and
ksm_check_stable_tree() have to handle the migrate_nodes list as well as
the stable tree itself.  Less obviously, we do need to prune the list of
stale entries from time to time (scan_get_next_rmap_item() does it once
each full scan): whereas stale nodes in the stable tree get naturally
pruned as searches try to brush past them, these migrate_nodes may get
forgotten and accumulate.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/ksm.c |  164 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 134 insertions(+), 30 deletions(-)

--- mmotm.orig/mm/ksm.c	2013-01-25 14:37:03.832206218 -0800
+++ mmotm/mm/ksm.c	2013-01-25 14:37:06.880206290 -0800
@@ -122,13 +122,25 @@ struct ksm_scan {
 /**
  * struct stable_node - node of the stable rbtree
  * @node: rb node of this ksm page in the stable tree
+ * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list
+ * @list: linked into migrate_nodes, pending placement in the proper node tree
  * @hlist: hlist head of rmap_items using this ksm page
- * @kpfn: page frame number of this ksm page
+ * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid)
+ * @nid: NUMA node id of stable tree in which linked (may not match kpfn)
  */
 struct stable_node {
-	struct rb_node node;
+	union {
+		struct rb_node node;	/* when node of stable tree */
+		struct {		/* when listed for migration */
+			struct list_head *head;
+			struct list_head list;
+		};
+	};
 	struct hlist_head hlist;
 	unsigned long kpfn;
+#ifdef CONFIG_NUMA
+	int nid;
+#endif
 };
 
 /**
@@ -169,6 +181,9 @@ struct rmap_item {
 static struct rb_root root_unstable_tree[MAX_NUMNODES];
 static struct rb_root root_stable_tree[MAX_NUMNODES];
 
+/* Recently migrated nodes of stable tree, pending proper placement */
+static LIST_HEAD(migrate_nodes);
+
 #define MM_SLOTS_HASH_BITS 10
 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
@@ -311,11 +326,6 @@ static void insert_to_mm_slots_hash(stru
 	hash_add(mm_slots_hash, &mm_slot->link, (unsigned long)mm);
 }
 
-static inline int in_stable_tree(struct rmap_item *rmap_item)
-{
-	return rmap_item->address & STABLE_FLAG;
-}
-
 /*
  * ksmd, and unmerge_and_remove_all_rmap_items(), must not touch an mm's
  * page tables after it has passed through ksm_exit() - which, if necessary,
@@ -476,7 +486,6 @@ static void remove_node_from_stable_tree
 {
 	struct rmap_item *rmap_item;
 	struct hlist_node *hlist;
-	int nid;
 
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		if (rmap_item->hlist.next)
@@ -488,8 +497,11 @@ static void remove_node_from_stable_tree
 		cond_resched();
 	}
 
-	nid = get_kpfn_nid(stable_node->kpfn);
-	rb_erase(&stable_node->node, &root_stable_tree[nid]);
+	if (stable_node->head == &migrate_nodes)
+		list_del(&stable_node->list);
+	else
+		rb_erase(&stable_node->node,
+			 &root_stable_tree[NUMA(stable_node->nid)]);
 	free_stable_node(stable_node);
 }
 
@@ -712,6 +724,7 @@ static int remove_stable_node(struct sta
 static int remove_all_stable_nodes(void)
 {
 	struct stable_node *stable_node;
+	struct list_head *this, *next;
 	int nid;
 	int err = 0;
 
@@ -726,6 +739,12 @@ static int remove_all_stable_nodes(void)
 			cond_resched();
 		}
 	}
+	list_for_each_safe(this, next, &migrate_nodes) {
+		stable_node = list_entry(this, struct stable_node, list);
+		if (remove_stable_node(stable_node))
+			err = -EBUSY;
+		cond_resched();
+	}
 	return err;
 }
 
@@ -1113,25 +1132,30 @@ static struct page *try_to_merge_two_pag
  */
 static struct page *stable_tree_search(struct page *page)
 {
-	struct rb_node *node;
-	struct stable_node *stable_node;
 	int nid;
+	struct rb_node **new;
+	struct rb_node *parent;
+	struct stable_node *stable_node;
+	struct stable_node *page_node;
 
-	stable_node = page_stable_node(page);
-	if (stable_node) {			/* ksm page forked */
+	page_node = page_stable_node(page);
+	if (page_node && page_node->head != &migrate_nodes) {
+		/* ksm page forked */
 		get_page(page);
 		return page;
 	}
 
 	nid = get_kpfn_nid(page_to_pfn(page));
-	node = root_stable_tree[nid].rb_node;
+again:
+	new = &root_stable_tree[nid].rb_node;
+	parent = NULL;
 
-	while (node) {
+	while (*new) {
 		struct page *tree_page;
 		int ret;
 
 		cond_resched();
-		stable_node = rb_entry(node, struct stable_node, node);
+		stable_node = rb_entry(*new, struct stable_node, node);
 		tree_page = get_ksm_page(stable_node, false);
 		if (!tree_page)
 			return NULL;
@@ -1139,10 +1163,11 @@ static struct page *stable_tree_search(s
 		ret = memcmp_pages(page, tree_page);
 		put_page(tree_page);
 
+		parent = *new;
 		if (ret < 0)
-			node = node->rb_left;
+			new = &parent->rb_left;
 		else if (ret > 0)
-			node = node->rb_right;
+			new = &parent->rb_right;
 		else {
 			/*
 			 * Lock and unlock the stable_node's page (which
@@ -1152,13 +1177,49 @@ static struct page *stable_tree_search(s
 			 * than kpage, but that involves more changes.
 			 */
 			tree_page = get_ksm_page(stable_node, true);
-			if (tree_page)
+			if (tree_page) {
 				unlock_page(tree_page);
-			return tree_page;
+				if (get_kpfn_nid(stable_node->kpfn) !=
+						NUMA(stable_node->nid)) {
+					put_page(tree_page);
+					goto replace;
+				}
+				return tree_page;
+			}
+			/*
+			 * There is now a place for page_node, but the tree may
+			 * have been rebalanced, so re-evaluate parent and new.
+			 */
+			if (page_node)
+				goto again;
+			return NULL;
 		}
 	}
 
-	return NULL;
+	if (!page_node)
+		return NULL;
+
+	list_del(&page_node->list);
+	DO_NUMA(page_node->nid = nid);
+	rb_link_node(&page_node->node, parent, new);
+	rb_insert_color(&page_node->node, &root_stable_tree[nid]);
+	get_page(page);
+	return page;
+
+replace:
+	if (page_node) {
+		list_del(&page_node->list);
+		DO_NUMA(page_node->nid = nid);
+		rb_replace_node(&stable_node->node,
+				&page_node->node, &root_stable_tree[nid]);
+		get_page(page);
+	} else {
+		rb_erase(&stable_node->node, &root_stable_tree[nid]);
+		page = NULL;
+	}
+	stable_node->head = &migrate_nodes;
+	list_add(&stable_node->list, stable_node->head);
+	return page;
 }
 
 /*
@@ -1215,6 +1276,7 @@ static struct stable_node *stable_tree_i
 	INIT_HLIST_HEAD(&stable_node->hlist);
 	stable_node->kpfn = kpfn;
 	set_page_stable_node(kpage, stable_node);
+	DO_NUMA(stable_node->nid = nid);
 	rb_link_node(&stable_node->node, parent, new);
 	rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
 
@@ -1311,11 +1373,6 @@ struct rmap_item *unstable_tree_search_i
 static void stable_tree_append(struct rmap_item *rmap_item,
 			       struct stable_node *stable_node)
 {
-	/*
-	 * Usually rmap_item->nid is already set correctly,
-	 * but it may be wrong after switching merge_across_nodes.
-	 */
-	DO_NUMA(rmap_item->nid = get_kpfn_nid(stable_node->kpfn));
 	rmap_item->head = stable_node;
 	rmap_item->address |= STABLE_FLAG;
 	hlist_add_head(&rmap_item->hlist, &stable_node->hlist);
@@ -1344,10 +1401,29 @@ static void cmp_and_merge_page(struct pa
 	unsigned int checksum;
 	int err;
 
-	remove_rmap_item_from_tree(rmap_item);
+	stable_node = page_stable_node(page);
+	if (stable_node) {
+		if (stable_node->head != &migrate_nodes &&
+		    get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) {
+			rb_erase(&stable_node->node,
+				 &root_stable_tree[NUMA(stable_node->nid)]);
+			stable_node->head = &migrate_nodes;
+			list_add(&stable_node->list, stable_node->head);
+		}
+		if (stable_node->head != &migrate_nodes &&
+		    rmap_item->head == stable_node)
+			return;
+	}
 
 	/* We first start with searching the page inside the stable tree */
 	kpage = stable_tree_search(page);
+	if (kpage == page && rmap_item->head == stable_node) {
+		put_page(kpage);
+		return;
+	}
+
+	remove_rmap_item_from_tree(rmap_item);
+
 	if (kpage) {
 		err = try_to_merge_with_ksm_page(rmap_item, page, kpage);
 		if (!err) {
@@ -1464,6 +1540,27 @@ static struct rmap_item *scan_get_next_r
 		 */
 		lru_add_drain_all();
 
+		/*
+		 * Whereas stale stable_nodes on the stable_tree itself
+		 * get pruned in the regular course of stable_tree_search(),
+		 * those moved out to the migrate_nodes list can accumulate:
+		 * so prune them once before each full scan.
+		 */
+		if (!ksm_merge_across_nodes) {
+			struct stable_node *stable_node;
+			struct list_head *this, *next;
+			struct page *page;
+
+			list_for_each_safe(this, next, &migrate_nodes) {
+				stable_node = list_entry(this,
+						struct stable_node, list);
+				page = get_ksm_page(stable_node, false);
+				if (page)
+					put_page(page);
+				cond_resched();
+			}
+		}
+
 		for (nid = 0; nid < nr_node_ids; nid++)
 			root_unstable_tree[nid] = RB_ROOT;
 
@@ -1586,8 +1683,7 @@ static void ksm_do_scan(unsigned int sca
 		rmap_item = scan_get_next_rmap_item(&page);
 		if (!rmap_item)
 			return;
-		if (!PageKsm(page) || !in_stable_tree(rmap_item))
-			cmp_and_merge_page(page, rmap_item);
+		cmp_and_merge_page(page, rmap_item);
 		put_page(page);
 	}
 }
@@ -1964,6 +2060,7 @@ static void ksm_check_stable_tree(unsign
 				  unsigned long end_pfn)
 {
 	struct stable_node *stable_node;
+	struct list_head *this, *next;
 	struct rb_node *node;
 	int nid;
 
@@ -1984,6 +2081,13 @@ static void ksm_check_stable_tree(unsign
 			cond_resched();
 		}
 	}
+	list_for_each_safe(this, next, &migrate_nodes) {
+		stable_node = list_entry(this, struct stable_node, list);
+		if (stable_node->kpfn >= start_pfn &&
+		    stable_node->kpfn < end_pfn)
+			remove_node_from_stable_tree(stable_node);
+		cond_resched();
+	}
 }
 
 static int ksm_memory_callback(struct notifier_block *self,

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 9/11] ksm: enable KSM page migration
  2013-01-26  1:53 [PATCH 0/11] ksm: NUMA trees and page migration Hugh Dickins
                   ` (7 preceding siblings ...)
  2013-01-26  2:05 ` [PATCH 8/11] ksm: make !merge_across_nodes migration safe Hugh Dickins
@ 2013-01-26  2:06 ` Hugh Dickins
  2013-01-26  2:07 ` [PATCH 10/11] mm: remove offlining arg to migrate_pages Hugh Dickins
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-26  2:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, Mel Gorman,
	linux-kernel, linux-mm

Migration of KSM pages is now safe: remove the PageKsm restrictions from
mempolicy.c and migrate.c.

But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which
are irrelevant to KSM: it looks as if that code was preventing hotremove
migration of KSM pages, unless they happened to be in swapcache.

There is some question as to whether enforcing a NUMA mempolicy migration
ought to migrate KSM pages, mapped into entirely unrelated processes; but
moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway,
and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on
any area where this is a worry.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/mempolicy.c |    3 +--
 mm/migrate.c   |   21 +++------------------
 2 files changed, 4 insertions(+), 20 deletions(-)

--- mmotm.orig/mm/mempolicy.c	2013-01-24 12:28:38.848127553 -0800
+++ mmotm/mm/mempolicy.c	2013-01-25 14:38:49.596208731 -0800
@@ -496,9 +496,8 @@ static int check_pte_range(struct vm_are
 		/*
 		 * vm_normal_page() filters out zero pages, but there might
 		 * still be PageReserved pages to skip, perhaps in a VDSO.
-		 * And we cannot move PageKsm pages sensibly or safely yet.
 		 */
-		if (PageReserved(page) || PageKsm(page))
+		if (PageReserved(page))
 			continue;
 		nid = page_to_nid(page);
 		if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
--- mmotm.orig/mm/migrate.c	2013-01-25 14:37:03.832206218 -0800
+++ mmotm/mm/migrate.c	2013-01-25 14:38:49.596208731 -0800
@@ -731,20 +731,6 @@ static int __unmap_and_move(struct page
 		lock_page(page);
 	}
 
-	/*
-	 * Only memory hotplug's offline_pages() caller has locked out KSM,
-	 * and can safely migrate a KSM page.  The other cases have skipped
-	 * PageKsm along with PageReserved - but it is only now when we have
-	 * the page lock that we can be certain it will not go KSM beneath us
-	 * (KSM will not upgrade a page from PageAnon to PageKsm when it sees
-	 * its pagecount raised, but only here do we take the page lock which
-	 * serializes that).
-	 */
-	if (PageKsm(page) && !offlining) {
-		rc = -EBUSY;
-		goto unlock;
-	}
-
 	/* charge against new page */
 	mem_cgroup_prepare_migration(page, newpage, &mem);
 
@@ -771,7 +757,7 @@ static int __unmap_and_move(struct page
 	 * File Caches may use write_page() or lock_page() in migration, then,
 	 * just care Anon page here.
 	 */
-	if (PageAnon(page)) {
+	if (PageAnon(page) && !PageKsm(page)) {
 		/*
 		 * Only page_lock_anon_vma_read() understands the subtleties of
 		 * getting a hold on an anon_vma from outside one of its mms.
@@ -851,7 +837,6 @@ uncharge:
 	mem_cgroup_end_migration(mem, page, newpage,
 				 (rc == MIGRATEPAGE_SUCCESS ||
 				  rc == MIGRATEPAGE_BALLOON_SUCCESS));
-unlock:
 	unlock_page(page);
 out:
 	return rc;
@@ -1156,7 +1141,7 @@ static int do_move_page_to_node_array(st
 			goto set_status;
 
 		/* Use PageReserved to check for zero page */
-		if (PageReserved(page) || PageKsm(page))
+		if (PageReserved(page))
 			goto put_and_set;
 
 		pp->page = page;
@@ -1318,7 +1303,7 @@ static void do_pages_stat_array(struct m
 
 		err = -ENOENT;
 		/* Use PageReserved to check for zero page */
-		if (!page || PageReserved(page) || PageKsm(page))
+		if (!page || PageReserved(page))
 			goto set_status;
 
 		err = page_to_nid(page);

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 10/11] mm: remove offlining arg to migrate_pages
  2013-01-26  1:53 [PATCH 0/11] ksm: NUMA trees and page migration Hugh Dickins
                   ` (8 preceding siblings ...)
  2013-01-26  2:06 ` [PATCH 9/11] ksm: enable KSM page migration Hugh Dickins
@ 2013-01-26  2:07 ` Hugh Dickins
  2013-01-26  2:10 ` [PATCH 11/11] ksm: stop hotremove lockdep warning Hugh Dickins
  2013-01-28 23:54 ` [PATCH 0/11] ksm: NUMA trees and page migration Andrew Morton
  11 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-26  2:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, Mel Gorman,
	linux-kernel, linux-mm

No functional change, but the only purpose of the offlining argument
to migrate_pages() etc, was to ensure that __unmap_and_move() could
migrate a KSM page for memory hotremove (which took ksm_thread_mutex)
but not for other callers.  Now all cases are safe, remove the arg.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/migrate.h |   14 ++++++--------
 mm/compaction.c         |    2 +-
 mm/memory-failure.c     |    7 +++----
 mm/memory_hotplug.c     |    3 +--
 mm/mempolicy.c          |    8 +++-----
 mm/migrate.c            |   35 +++++++++++++----------------------
 mm/page_alloc.c         |    6 ++----
 7 files changed, 29 insertions(+), 46 deletions(-)

--- mmotm.orig/include/linux/migrate.h	2013-01-24 12:28:38.740127550 -0800
+++ mmotm/include/linux/migrate.h	2013-01-25 14:38:51.468208776 -0800
@@ -40,11 +40,9 @@ extern void putback_movable_pages(struct
 extern int migrate_page(struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t x,
-			unsigned long private, bool offlining,
-			enum migrate_mode mode, int reason);
+		unsigned long private, enum migrate_mode mode, int reason);
 extern int migrate_huge_page(struct page *, new_page_t x,
-			unsigned long private, bool offlining,
-			enum migrate_mode mode);
+		unsigned long private, enum migrate_mode mode);
 
 extern int fail_migrate_page(struct address_space *,
 			struct page *, struct page *);
@@ -62,11 +60,11 @@ extern int migrate_huge_page_move_mappin
 static inline void putback_lru_pages(struct list_head *l) {}
 static inline void putback_movable_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
-		unsigned long private, bool offlining,
-		enum migrate_mode mode, int reason) { return -ENOSYS; }
+		unsigned long private, enum migrate_mode mode, int reason)
+	{ return -ENOSYS; }
 static inline int migrate_huge_page(struct page *page, new_page_t x,
-		unsigned long private, bool offlining,
-		enum migrate_mode mode) { return -ENOSYS; }
+		unsigned long private, enum migrate_mode mode)
+	{ return -ENOSYS; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
--- mmotm.orig/mm/compaction.c	2013-01-24 12:28:38.740127550 -0800
+++ mmotm/mm/compaction.c	2013-01-25 14:38:51.472208776 -0800
@@ -980,7 +980,7 @@ static int compact_zone(struct zone *zon
 
 		nr_migrate = cc->nr_migratepages;
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
-				(unsigned long)cc, false,
+				(unsigned long)cc,
 				cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC,
 				MR_COMPACTION);
 		update_nr_listpages(cc);
--- mmotm.orig/mm/memory-failure.c	2013-01-24 12:28:38.740127550 -0800
+++ mmotm/mm/memory-failure.c	2013-01-25 14:38:51.472208776 -0800
@@ -1432,7 +1432,7 @@ static int soft_offline_huge_page(struct
 		goto done;
 
 	/* Keep page count to indicate a given hugepage is isolated. */
-	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, false,
+	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL,
 				MIGRATE_SYNC);
 	put_page(hpage);
 	if (ret) {
@@ -1564,11 +1564,10 @@ int soft_offline_page(struct page *page,
 	if (!ret) {
 		LIST_HEAD(pagelist);
 		inc_zone_page_state(page, NR_ISOLATED_ANON +
-					    page_is_file_cache(page));
+					page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
-							false, MIGRATE_SYNC,
-							MR_MEMORY_FAILURE);
+					MIGRATE_SYNC, MR_MEMORY_FAILURE);
 		if (ret) {
 			putback_lru_pages(&pagelist);
 			pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
--- mmotm.orig/mm/memory_hotplug.c	2013-01-24 12:28:38.740127550 -0800
+++ mmotm/mm/memory_hotplug.c	2013-01-25 14:38:51.472208776 -0800
@@ -1283,8 +1283,7 @@ do_migrate_range(unsigned long start_pfn
 		 * migrate_pages returns # of failed pages.
 		 */
 		ret = migrate_pages(&source, alloc_migrate_target, 0,
-							true, MIGRATE_SYNC,
-							MR_MEMORY_HOTPLUG);
+					MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
 		if (ret)
 			putback_lru_pages(&source);
 	}
--- mmotm.orig/mm/mempolicy.c	2013-01-25 14:38:49.596208731 -0800
+++ mmotm/mm/mempolicy.c	2013-01-25 14:38:51.472208776 -0800
@@ -1014,8 +1014,7 @@ static int migrate_to_node(struct mm_str
 
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_node_page, dest,
-							false, MIGRATE_SYNC,
-							MR_SYSCALL);
+					MIGRATE_SYNC, MR_SYSCALL);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
@@ -1259,9 +1258,8 @@ static long do_mbind(unsigned long start
 		if (!list_empty(&pagelist)) {
 			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
-						(unsigned long)vma,
-						false, MIGRATE_SYNC,
-						MR_MEMPOLICY_MBIND);
+					(unsigned long)vma,
+					MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
 			if (nr_failed)
 				putback_lru_pages(&pagelist);
 		}
--- mmotm.orig/mm/migrate.c	2013-01-25 14:38:49.596208731 -0800
+++ mmotm/mm/migrate.c	2013-01-25 14:38:51.476208776 -0800
@@ -701,7 +701,7 @@ static int move_to_new_page(struct page
 }
 
 static int __unmap_and_move(struct page *page, struct page *newpage,
-			int force, bool offlining, enum migrate_mode mode)
+				int force, enum migrate_mode mode)
 {
 	int rc = -EAGAIN;
 	int remap_swapcache = 1;
@@ -847,8 +847,7 @@ out:
  * to the newly allocated page in newpage.
  */
 static int unmap_and_move(new_page_t get_new_page, unsigned long private,
-			struct page *page, int force, bool offlining,
-			enum migrate_mode mode)
+			struct page *page, int force, enum migrate_mode mode)
 {
 	int rc = 0;
 	int *result = NULL;
@@ -866,7 +865,7 @@ static int unmap_and_move(new_page_t get
 		if (unlikely(split_huge_page(page)))
 			goto out;
 
-	rc = __unmap_and_move(page, newpage, force, offlining, mode);
+	rc = __unmap_and_move(page, newpage, force, mode);
 
 	if (unlikely(rc == MIGRATEPAGE_BALLOON_SUCCESS)) {
 		/*
@@ -927,8 +926,7 @@ out:
  */
 static int unmap_and_move_huge_page(new_page_t get_new_page,
 				unsigned long private, struct page *hpage,
-				int force, bool offlining,
-				enum migrate_mode mode)
+				int force, enum migrate_mode mode)
 {
 	int rc = 0;
 	int *result = NULL;
@@ -990,9 +988,8 @@ out:
  *
  * Return: Number of pages not migrated or error code.
  */
-int migrate_pages(struct list_head *from,
-		new_page_t get_new_page, unsigned long private, bool offlining,
-		enum migrate_mode mode, int reason)
+int migrate_pages(struct list_head *from, new_page_t get_new_page,
+		unsigned long private, enum migrate_mode mode, int reason)
 {
 	int retry = 1;
 	int nr_failed = 0;
@@ -1013,8 +1010,7 @@ int migrate_pages(struct list_head *from
 			cond_resched();
 
 			rc = unmap_and_move(get_new_page, private,
-						page, pass > 2, offlining,
-						mode);
+						page, pass > 2, mode);
 
 			switch(rc) {
 			case -ENOMEM:
@@ -1047,15 +1043,13 @@ out:
 }
 
 int migrate_huge_page(struct page *hpage, new_page_t get_new_page,
-		      unsigned long private, bool offlining,
-		      enum migrate_mode mode)
+		      unsigned long private, enum migrate_mode mode)
 {
 	int pass, rc;
 
 	for (pass = 0; pass < 10; pass++) {
-		rc = unmap_and_move_huge_page(get_new_page,
-					      private, hpage, pass > 2, offlining,
-					      mode);
+		rc = unmap_and_move_huge_page(get_new_page, private,
+						hpage, pass > 2, mode);
 		switch (rc) {
 		case -ENOMEM:
 			goto out;
@@ -1178,8 +1172,7 @@ set_status:
 	err = 0;
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_page_node,
-				(unsigned long)pm, 0, MIGRATE_SYNC,
-				MR_SYSCALL);
+				(unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
@@ -1614,10 +1607,8 @@ int migrate_misplaced_page(struct page *
 		goto out;
 
 	list_add(&page->lru, &migratepages);
-	nr_remaining = migrate_pages(&migratepages,
-			alloc_misplaced_dst_page,
-			node, false, MIGRATE_ASYNC,
-			MR_NUMA_MISPLACED);
+	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
+				     node, MIGRATE_ASYNC, MR_NUMA_MISPLACED);
 	if (nr_remaining) {
 		putback_lru_pages(&migratepages);
 		isolated = 0;
--- mmotm.orig/mm/page_alloc.c	2013-01-24 12:28:38.740127550 -0800
+++ mmotm/mm/page_alloc.c	2013-01-25 14:38:51.476208776 -0800
@@ -6064,10 +6064,8 @@ static int __alloc_contig_migrate_range(
 							&cc->migratepages);
 		cc->nr_migratepages -= nr_reclaimed;
 
-		ret = migrate_pages(&cc->migratepages,
-				    alloc_migrate_target,
-				    0, false, MIGRATE_SYNC,
-				    MR_CMA);
+		ret = migrate_pages(&cc->migratepages, alloc_migrate_target,
+				    0, MIGRATE_SYNC, MR_CMA);
 	}
 	if (ret < 0) {
 		putback_movable_pages(&cc->migratepages);

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 11/11] ksm: stop hotremove lockdep warning
  2013-01-26  1:53 [PATCH 0/11] ksm: NUMA trees and page migration Hugh Dickins
                   ` (9 preceding siblings ...)
  2013-01-26  2:07 ` [PATCH 10/11] mm: remove offlining arg to migrate_pages Hugh Dickins
@ 2013-01-26  2:10 ` Hugh Dickins
  2013-01-27  6:23   ` Simon Jeons
  2013-02-08 18:45   ` Gerald Schaefer
  2013-01-28 23:54 ` [PATCH 0/11] ksm: NUMA trees and page migration Andrew Morton
  11 siblings, 2 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-26  2:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, Gerald Schaefer,
	KOSAKI Motohiro, linux-kernel, linux-mm

Complaints are rare, but lockdep still does not understand the way
ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and
holds it until the ksm_memory_callback(MEM_OFFLINE): that appears
to be a problem because notifier callbacks are made under down_read
of blocking_notifier_head->rwsem (so first the mutex is taken while
holding the rwsem, then later the rwsem is taken while still holding
the mutex); but is not in fact a problem because mem_hotplug_mutex
is held throughout the dance.

There was an attempt to fix this with mutex_lock_nested(); but if that
happened to fool lockdep two years ago, apparently it does so no longer.

I had hoped to eradicate this issue in extending KSM page migration not
to need the ksm_thread_mutex.  But then realized that although the page
migration itself is safe, we do still need to lock out ksmd and other
users of get_ksm_page() while offlining memory - at some point between
MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may
vanish, and get_ksm_page()'s accesses to them become a violation.

So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to
MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining()
checks, to achieve the same lockout without being caught by lockdep.
This is less elegant for KSM, but it's more important to keep lockdep
useful to other users - and I apologize for how long it took to fix.

Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/ksm.c |   55 +++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 41 insertions(+), 14 deletions(-)

--- mmotm.orig/mm/ksm.c	2013-01-25 14:37:06.880206290 -0800
+++ mmotm/mm/ksm.c	2013-01-25 14:38:53.984208836 -0800
@@ -226,7 +226,9 @@ static unsigned int ksm_merge_across_nod
 #define KSM_RUN_STOP	0
 #define KSM_RUN_MERGE	1
 #define KSM_RUN_UNMERGE	2
-static unsigned int ksm_run = KSM_RUN_STOP;
+#define KSM_RUN_OFFLINE	4
+static unsigned long ksm_run = KSM_RUN_STOP;
+static void wait_while_offlining(void);
 
 static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
 static DEFINE_MUTEX(ksm_thread_mutex);
@@ -1700,6 +1702,7 @@ static int ksm_scan_thread(void *nothing
 
 	while (!kthread_should_stop()) {
 		mutex_lock(&ksm_thread_mutex);
+		wait_while_offlining();
 		if (ksmd_should_run())
 			ksm_do_scan(ksm_thread_pages_to_scan);
 		mutex_unlock(&ksm_thread_mutex);
@@ -2056,6 +2059,22 @@ void ksm_migrate_page(struct page *newpa
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
+static int just_wait(void *word)
+{
+	schedule();
+	return 0;
+}
+
+static void wait_while_offlining(void)
+{
+	while (ksm_run & KSM_RUN_OFFLINE) {
+		mutex_unlock(&ksm_thread_mutex);
+		wait_on_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE),
+				just_wait, TASK_UNINTERRUPTIBLE);
+		mutex_lock(&ksm_thread_mutex);
+	}
+}
+
 static void ksm_check_stable_tree(unsigned long start_pfn,
 				  unsigned long end_pfn)
 {
@@ -2098,15 +2117,15 @@ static int ksm_memory_callback(struct no
 	switch (action) {
 	case MEM_GOING_OFFLINE:
 		/*
-		 * Keep it very simple for now: just lock out ksmd and
-		 * MADV_UNMERGEABLE while any memory is going offline.
-		 * mutex_lock_nested() is necessary because lockdep was alarmed
-		 * that here we take ksm_thread_mutex inside notifier chain
-		 * mutex, and later take notifier chain mutex inside
-		 * ksm_thread_mutex to unlock it.   But that's safe because both
-		 * are inside mem_hotplug_mutex.
+		 * Prevent ksm_do_scan(), unmerge_and_remove_all_rmap_items()
+		 * and remove_all_stable_nodes() while memory is going offline:
+		 * it is unsafe for them to touch the stable tree at this time.
+		 * But unmerge_ksm_pages(), rmap lookups and other entry points
+		 * which do not need the ksm_thread_mutex are all safe.
 		 */
-		mutex_lock_nested(&ksm_thread_mutex, SINGLE_DEPTH_NESTING);
+		mutex_lock(&ksm_thread_mutex);
+		ksm_run |= KSM_RUN_OFFLINE;
+		mutex_unlock(&ksm_thread_mutex);
 		break;
 
 	case MEM_OFFLINE:
@@ -2122,11 +2141,20 @@ static int ksm_memory_callback(struct no
 		/* fallthrough */
 
 	case MEM_CANCEL_OFFLINE:
+		mutex_lock(&ksm_thread_mutex);
+		ksm_run &= ~KSM_RUN_OFFLINE;
 		mutex_unlock(&ksm_thread_mutex);
+
+		smp_mb();	/* wake_up_bit advises this */
+		wake_up_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE));
 		break;
 	}
 	return NOTIFY_OK;
 }
+#else
+static void wait_while_offlining(void)
+{
+}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 #ifdef CONFIG_SYSFS
@@ -2189,7 +2217,7 @@ KSM_ATTR(pages_to_scan);
 static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr,
 			char *buf)
 {
-	return sprintf(buf, "%u\n", ksm_run);
+	return sprintf(buf, "%lu\n", ksm_run);
 }
 
 static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
@@ -2212,6 +2240,7 @@ static ssize_t run_store(struct kobject
 	 */
 
 	mutex_lock(&ksm_thread_mutex);
+	wait_while_offlining();
 	if (ksm_run != flags) {
 		ksm_run = flags;
 		if (flags & KSM_RUN_UNMERGE) {
@@ -2254,6 +2283,7 @@ static ssize_t merge_across_nodes_store(
 		return -EINVAL;
 
 	mutex_lock(&ksm_thread_mutex);
+	wait_while_offlining();
 	if (ksm_merge_across_nodes != knob) {
 		if (ksm_pages_shared || remove_all_stable_nodes())
 			err = -EBUSY;
@@ -2366,10 +2396,7 @@ static int __init ksm_init(void)
 #endif /* CONFIG_SYSFS */
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-	/*
-	 * Choose a high priority since the callback takes ksm_thread_mutex:
-	 * later callbacks could only be taking locks which nest within that.
-	 */
+	/* There is no significance to this priority 100 */
 	hotplug_memory_notifier(ksm_memory_callback, 100);
 #endif
 	return 0;

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/11] ksm: allow trees per NUMA node
  2013-01-26  1:54 ` [PATCH 1/11] ksm: allow trees per NUMA node Hugh Dickins
@ 2013-01-27  1:14   ` Simon Jeons
  2013-01-27  2:54     ` Hugh Dickins
  2013-01-28 23:03   ` Andrew Morton
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 69+ messages in thread
From: Simon Jeons @ 2013-01-27  1:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	Rik van Riel, David Rientjes, Anton Arapov, linux-kernel,
	linux-mm

Hi Hugh,
On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote:
> From: Petr Holasek <pholasek@redhat.com>
> 
> Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> which control merging pages across different numa nodes.
> When it is set to zero only pages from the same node are merged,
> otherwise pages from all nodes can be merged together (default behavior).
> 
> Typical use-case could be a lot of KVM guests on NUMA machine
> and cpus from more distant nodes would have significant increase
> of access latency to the merged ksm page. Sysfs knob was choosen
> for higher variability when some users still prefers higher amount
> of saved physical memory regardless of access latency.
> 
> Every numa node has its own stable & unstable trees because of faster
> searching and inserting. Changing of merge_across_nodes value is possible
> only when there are not any ksm shared pages in system.
> 
> I've tested this patch on numa machines with 2, 4 and 8 nodes and
> measured speed of memory access inside of KVM guests with memory pinned
> to one of nodes with this benchmark:
> 
> http://pholasek.fedorapeople.org/alloc_pg.c
> 
> Population standard deviations of access times in percentage of average
> were following:
> 
> merge_across_nodes=1
> 2 nodes 1.4%
> 4 nodes 1.6%
> 8 nodes	1.7%
> 
> merge_across_nodes=0
> 2 nodes	1%
> 4 nodes	0.32%
> 8 nodes	0.018%
> 
> RFC: https://lkml.org/lkml/2011/11/30/91
> v1: https://lkml.org/lkml/2012/1/23/46
> v2: https://lkml.org/lkml/2012/6/29/105
> v3: https://lkml.org/lkml/2012/9/14/550
> v4: https://lkml.org/lkml/2012/9/23/137
> v5: https://lkml.org/lkml/2012/12/10/540
> v6: https://lkml.org/lkml/2012/12/23/154
> v7: https://lkml.org/lkml/2012/12/27/225
> 
> Hugh notes that this patch brings two problems, whose solution needs
> further support in mm/ksm.c, which follows in subsequent patches:
> 1) switching merge_across_nodes after running KSM is liable to oops
>    on stale nodes still left over from the previous stable tree;
> 2) memory hotremove may migrate KSM pages, but there is no provision
>    here for !merge_across_nodes to migrate nodes to the proper tree.
> 
> Signed-off-by: Petr Holasek <pholasek@redhat.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  Documentation/vm/ksm.txt |    7 +
>  mm/ksm.c                 |  151 ++++++++++++++++++++++++++++++++-----
>  2 files changed, 139 insertions(+), 19 deletions(-)
> 
> --- mmotm.orig/Documentation/vm/ksm.txt	2013-01-25 14:36:31.724205455 -0800
> +++ mmotm/Documentation/vm/ksm.txt	2013-01-25 14:36:38.608205618 -0800
> @@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
>                     e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
>                     Default: 20 (chosen for demonstration purposes)
>  
> +merge_across_nodes - specifies if pages from different numa nodes can be merged.
> +                   When set to 0, ksm merges only pages which physically
> +                   reside in the memory area of same NUMA node. It brings
> +                   lower latency to access to shared page. Value can be
> +                   changed only when there is no ksm shared pages in system.
> +                   Default: 1
> +
>  run              - set 0 to stop ksmd from running but keep merged pages,
>                     set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
>                     set 2 to stop ksmd and unmerge all pages currently merged,
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:31.724205455 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:36:38.608205618 -0800
> @@ -36,6 +36,7 @@
>  #include <linux/hashtable.h>
>  #include <linux/freezer.h>
>  #include <linux/oom.h>
> +#include <linux/numa.h>
>  
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> @@ -139,6 +140,9 @@ struct rmap_item {
>  	struct mm_struct *mm;
>  	unsigned long address;		/* + low bits used for flags below */
>  	unsigned int oldchecksum;	/* when unstable */
> +#ifdef CONFIG_NUMA
> +	unsigned int nid;
> +#endif
>  	union {
>  		struct rb_node node;	/* when node of unstable tree */
>  		struct {		/* when listed from stable tree */
> @@ -153,8 +157,8 @@ struct rmap_item {
>  #define STABLE_FLAG	0x200	/* is listed from the stable tree */
>  
>  /* The stable and unstable tree heads */
> -static struct rb_root root_stable_tree = RB_ROOT;
> -static struct rb_root root_unstable_tree = RB_ROOT;
> +static struct rb_root root_unstable_tree[MAX_NUMNODES];
> +static struct rb_root root_stable_tree[MAX_NUMNODES];
>  
>  #define MM_SLOTS_HASH_BITS 10
>  static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_
>  /* Milliseconds ksmd should sleep between batches */
>  static unsigned int ksm_thread_sleep_millisecs = 20;
>  
> +/* Zeroed when merging across nodes is not allowed */
> +static unsigned int ksm_merge_across_nodes = 1;
> +
>  #define KSM_RUN_STOP	0
>  #define KSM_RUN_MERGE	1
>  #define KSM_RUN_UNMERGE	2
> @@ -441,10 +448,25 @@ out:		page = NULL;
>  	return page;
>  }
>  
> +/*
> + * This helper is used for getting right index into array of tree roots.
> + * When merge_across_nodes knob is set to 1, there are only two rb-trees for
> + * stable and unstable pages from all nodes with roots in index 0. Otherwise,
> + * every node has its own stable and unstable tree.
> + */
> +static inline int get_kpfn_nid(unsigned long kpfn)
> +{
> +	if (ksm_merge_across_nodes)
> +		return 0;
> +	else
> +		return pfn_to_nid(kpfn);
> +}
> +
>  static void remove_node_from_stable_tree(struct stable_node *stable_node)
>  {
>  	struct rmap_item *rmap_item;
>  	struct hlist_node *hlist;
> +	int nid;
>  
>  	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
>  		if (rmap_item->hlist.next)
> @@ -456,7 +478,9 @@ static void remove_node_from_stable_tree
>  		cond_resched();
>  	}
>  
> -	rb_erase(&stable_node->node, &root_stable_tree);
> +	nid = get_kpfn_nid(stable_node->kpfn);
> +
> +	rb_erase(&stable_node->node, &root_stable_tree[nid]);
>  	free_stable_node(stable_node);
>  }
>  
> @@ -554,7 +578,12 @@ static void remove_rmap_item_from_tree(s
>  		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
>  		BUG_ON(age > 1);
>  		if (!age)
> -			rb_erase(&rmap_item->node, &root_unstable_tree);
> +#ifdef CONFIG_NUMA
> +			rb_erase(&rmap_item->node,
> +					&root_unstable_tree[rmap_item->nid]);
> +#else
> +			rb_erase(&rmap_item->node, &root_unstable_tree[0]);
> +#endif
>  
>  		ksm_pages_unshared--;
>  		rmap_item->address &= PAGE_MASK;
> @@ -990,8 +1019,9 @@ static struct page *try_to_merge_two_pag
>   */
>  static struct page *stable_tree_search(struct page *page)
>  {
> -	struct rb_node *node = root_stable_tree.rb_node;
> +	struct rb_node *node;
>  	struct stable_node *stable_node;
> +	int nid;
>  
>  	stable_node = page_stable_node(page);
>  	if (stable_node) {			/* ksm page forked */
> @@ -999,6 +1029,9 @@ static struct page *stable_tree_search(s
>  		return page;
>  	}
>  
> +	nid = get_kpfn_nid(page_to_pfn(page));
> +	node = root_stable_tree[nid].rb_node;
> +
>  	while (node) {
>  		struct page *tree_page;
>  		int ret;
> @@ -1033,10 +1066,16 @@ static struct page *stable_tree_search(s
>   */
>  static struct stable_node *stable_tree_insert(struct page *kpage)
>  {
> -	struct rb_node **new = &root_stable_tree.rb_node;
> +	int nid;
> +	unsigned long kpfn;
> +	struct rb_node **new;
>  	struct rb_node *parent = NULL;
>  	struct stable_node *stable_node;
>  
> +	kpfn = page_to_pfn(kpage);
> +	nid = get_kpfn_nid(kpfn);
> +	new = &root_stable_tree[nid].rb_node;
> +
>  	while (*new) {
>  		struct page *tree_page;
>  		int ret;
> @@ -1070,11 +1109,11 @@ static struct stable_node *stable_tree_i
>  		return NULL;
>  
>  	rb_link_node(&stable_node->node, parent, new);
> -	rb_insert_color(&stable_node->node, &root_stable_tree);
> +	rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
>  
>  	INIT_HLIST_HEAD(&stable_node->hlist);
>  
> -	stable_node->kpfn = page_to_pfn(kpage);
> +	stable_node->kpfn = kpfn;
>  	set_page_stable_node(kpage, stable_node);
>  
>  	return stable_node;
> @@ -1098,10 +1137,15 @@ static
>  struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
>  					      struct page *page,
>  					      struct page **tree_pagep)
> -
>  {
> -	struct rb_node **new = &root_unstable_tree.rb_node;
> +	struct rb_node **new;
> +	struct rb_root *root;
>  	struct rb_node *parent = NULL;
> +	int nid;
> +
> +	nid = get_kpfn_nid(page_to_pfn(page));
> +	root = &root_unstable_tree[nid];
> +	new = &root->rb_node;
>  
>  	while (*new) {
>  		struct rmap_item *tree_rmap_item;
> @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i
>  			return NULL;
>  		}
>  
> +		/*
> +		 * If tree_page has been migrated to another NUMA node, it
> +		 * will be flushed out and put into the right unstable tree

Then why not insert the new page to unstable tree during page migration
against current upstream? Because default behavior is merge across
nodes.

> +		 * next time: only merge with it if merge_across_nodes.
> +		 * Just notice, we don't have similar problem for PageKsm
> +		 * because their migration is disabled now. (62b61f611e)
> +		 */
> +		if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) {
> +			put_page(tree_page);
> +			return NULL;
> +		}
> +
>  		ret = memcmp_pages(page, tree_page);
>  
>  		parent = *new;
> @@ -1139,8 +1195,11 @@ struct rmap_item *unstable_tree_search_i
>  
>  	rmap_item->address |= UNSTABLE_FLAG;
>  	rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK);
> +#ifdef CONFIG_NUMA
> +	rmap_item->nid = nid;
> +#endif
>  	rb_link_node(&rmap_item->node, parent, new);
> -	rb_insert_color(&rmap_item->node, &root_unstable_tree);
> +	rb_insert_color(&rmap_item->node, root);
>  
>  	ksm_pages_unshared++;
>  	return NULL;
> @@ -1154,6 +1213,13 @@ struct rmap_item *unstable_tree_search_i
>  static void stable_tree_append(struct rmap_item *rmap_item,
>  			       struct stable_node *stable_node)
>  {
> +#ifdef CONFIG_NUMA
> +	/*
> +	 * Usually rmap_item->nid is already set correctly,
> +	 * but it may be wrong after switching merge_across_nodes.
> +	 */
> +	rmap_item->nid = get_kpfn_nid(stable_node->kpfn);
> +#endif
>  	rmap_item->head = stable_node;
>  	rmap_item->address |= STABLE_FLAG;
>  	hlist_add_head(&rmap_item->hlist, &stable_node->hlist);
> @@ -1283,6 +1349,7 @@ static struct rmap_item *scan_get_next_r
>  	struct mm_slot *slot;
>  	struct vm_area_struct *vma;
>  	struct rmap_item *rmap_item;
> +	int nid;
>  
>  	if (list_empty(&ksm_mm_head.mm_list))
>  		return NULL;
> @@ -1301,7 +1368,8 @@ static struct rmap_item *scan_get_next_r
>  		 */
>  		lru_add_drain_all();
>  
> -		root_unstable_tree = RB_ROOT;
> +		for (nid = 0; nid < nr_node_ids; nid++)
> +			root_unstable_tree[nid] = RB_ROOT;
>  
>  		spin_lock(&ksm_mmlist_lock);
>  		slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
> @@ -1770,15 +1838,19 @@ static struct stable_node *ksm_check_sta
>  						 unsigned long end_pfn)
>  {
>  	struct rb_node *node;
> +	int nid;
>  
> -	for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
> -		struct stable_node *stable_node;
> +	for (nid = 0; nid < nr_node_ids; nid++)
> +		for (node = rb_first(&root_stable_tree[nid]); node;
> +				node = rb_next(node)) {
> +			struct stable_node *stable_node;
> +
> +			stable_node = rb_entry(node, struct stable_node, node);
> +			if (stable_node->kpfn >= start_pfn &&
> +			    stable_node->kpfn < end_pfn)
> +				return stable_node;
> +		}
>  
> -		stable_node = rb_entry(node, struct stable_node, node);
> -		if (stable_node->kpfn >= start_pfn &&
> -		    stable_node->kpfn < end_pfn)
> -			return stable_node;
> -	}
>  	return NULL;
>  }
>  
> @@ -1925,6 +1997,40 @@ static ssize_t run_store(struct kobject
>  }
>  KSM_ATTR(run);
>  
> +#ifdef CONFIG_NUMA
> +static ssize_t merge_across_nodes_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%u\n", ksm_merge_across_nodes);
> +}
> +
> +static ssize_t merge_across_nodes_store(struct kobject *kobj,
> +				   struct kobj_attribute *attr,
> +				   const char *buf, size_t count)
> +{
> +	int err;
> +	unsigned long knob;
> +
> +	err = kstrtoul(buf, 10, &knob);
> +	if (err)
> +		return err;
> +	if (knob > 1)
> +		return -EINVAL;
> +
> +	mutex_lock(&ksm_thread_mutex);
> +	if (ksm_merge_across_nodes != knob) {
> +		if (ksm_pages_shared)
> +			err = -EBUSY;
> +		else
> +			ksm_merge_across_nodes = knob;
> +	}
> +	mutex_unlock(&ksm_thread_mutex);
> +
> +	return err ? err : count;
> +}
> +KSM_ATTR(merge_across_nodes);
> +#endif
> +
>  static ssize_t pages_shared_show(struct kobject *kobj,
>  				 struct kobj_attribute *attr, char *buf)
>  {
> @@ -1979,6 +2085,9 @@ static struct attribute *ksm_attrs[] = {
>  	&pages_unshared_attr.attr,
>  	&pages_volatile_attr.attr,
>  	&full_scans_attr.attr,
> +#ifdef CONFIG_NUMA
> +	&merge_across_nodes_attr.attr,
> +#endif
>  	NULL,
>  };
>  
> @@ -1992,11 +2101,15 @@ static int __init ksm_init(void)
>  {
>  	struct task_struct *ksm_thread;
>  	int err;
> +	int nid;
>  
>  	err = ksm_slab_init();
>  	if (err)
>  		goto out;
>  
> +	for (nid = 0; nid < nr_node_ids; nid++)
> +		root_stable_tree[nid] = RB_ROOT;
> +
>  	ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");
>  	if (IS_ERR(ksm_thread)) {
>  		printk(KERN_ERR "ksm: creating kthread failed\n");
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 5/11] ksm: get_ksm_page locked
  2013-01-26  2:00 ` [PATCH 5/11] ksm: get_ksm_page locked Hugh Dickins
@ 2013-01-27  2:36   ` Simon Jeons
  2013-01-27 22:08     ` Hugh Dickins
  2013-01-27  2:48   ` Simon Jeons
  2013-02-05 17:18   ` Mel Gorman
  2 siblings, 1 reply; 69+ messages in thread
From: Simon Jeons @ 2013-01-27  2:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

Hi Hugh, 
On Fri, 2013-01-25 at 18:00 -0800, Hugh Dickins wrote:
> In some places where get_ksm_page() is used, we need the page to be locked.
> 

In function get_ksm_page, why check page->mapping =>
get_page_unless_zero => check page->mapping instead of
get_page_unless_zero => check page->mapping, because
get_page_unless_zero is expensive?

> When KSM migration is fully enabled, we shall want that to make sure that
> the page just acquired cannot be migrated beneath us (raised page count is
> only effective when there is serialization to make sure migration notices).
> Whereas when navigating through the stable tree, we certainly do not want

What's the meaning of "navigating through the stable tree"?

> to lock each node (raised page count is enough to guarantee the memcmps,
> even if page is migrated to another node).
> 
> Since we're about to add another use case, add the locked argument to
> get_ksm_page() now.

Why the parameter lock passed from stable_tree_search/insert is true,
but remove_rmap_item_from_tree is false?

> 
> Hmm, what's that rcu_read_lock() about?  Complete misunderstanding, I
> really got the wrong end of the stick on that!  There's a configuration
> in which page_cache_get_speculative() can do something cheaper than
> get_page_unless_zero(), relying on its caller's rcu_read_lock() to have
> disabled preemption for it.  There's no need for rcu_read_lock() around
> get_page_unless_zero() (and mapping checks) here.  Cut out that
> silliness before making this any harder to understand.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/ksm.c |   23 +++++++++++++----------
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:53.244205966 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:36:58.856206099 -0800
> @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree
>   * but this is different - made simpler by ksm_thread_mutex being held, but
>   * interesting for assuming that no other use of the struct page could ever
>   * put our expected_mapping into page->mapping (or a field of the union which
> - * coincides with page->mapping).  The RCU calls are not for KSM at all, but
> - * to keep the page_count protocol described with page_cache_get_speculative.
> + * coincides with page->mapping).
>   *
>   * Note: it is possible that get_ksm_page() will return NULL one moment,
>   * then page the next, if the page is in between page_freeze_refs() and
>   * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
>   * is on its way to being freed; but it is an anomaly to bear in mind.
>   */
> -static struct page *get_ksm_page(struct stable_node *stable_node)
> +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
>  {
>  	struct page *page;
>  	void *expected_mapping;
> @@ -530,7 +529,6 @@ static struct page *get_ksm_page(struct
>  	page = pfn_to_page(stable_node->kpfn);
>  	expected_mapping = (void *)stable_node +
>  				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
> -	rcu_read_lock();
>  	if (page->mapping != expected_mapping)
>  		goto stale;
>  	if (!get_page_unless_zero(page))
> @@ -539,10 +537,16 @@ static struct page *get_ksm_page(struct
>  		put_page(page);
>  		goto stale;
>  	}
> -	rcu_read_unlock();
> +	if (locked) {
> +		lock_page(page);
> +		if (page->mapping != expected_mapping) {
> +			unlock_page(page);
> +			put_page(page);
> +			goto stale;
> +		}
> +	}
>  	return page;
>  stale:
> -	rcu_read_unlock();
>  	remove_node_from_stable_tree(stable_node);
>  	return NULL;
>  }
> @@ -558,11 +562,10 @@ static void remove_rmap_item_from_tree(s
>  		struct page *page;
>  
>  		stable_node = rmap_item->head;
> -		page = get_ksm_page(stable_node);
> +		page = get_ksm_page(stable_node, true);
>  		if (!page)
>  			goto out;
>  
> -		lock_page(page);
>  		hlist_del(&rmap_item->hlist);
>  		unlock_page(page);
>  		put_page(page);
> @@ -1042,7 +1045,7 @@ static struct page *stable_tree_search(s
>  
>  		cond_resched();
>  		stable_node = rb_entry(node, struct stable_node, node);
> -		tree_page = get_ksm_page(stable_node);
> +		tree_page = get_ksm_page(stable_node, false);
>  		if (!tree_page)
>  			return NULL;
>  
> @@ -1086,7 +1089,7 @@ static struct stable_node *stable_tree_i
>  
>  		cond_resched();
>  		stable_node = rb_entry(*new, struct stable_node, node);
> -		tree_page = get_ksm_page(stable_node);
> +		tree_page = get_ksm_page(stable_node, false);
>  		if (!tree_page)
>  			return NULL;
>  
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 5/11] ksm: get_ksm_page locked
  2013-01-26  2:00 ` [PATCH 5/11] ksm: get_ksm_page locked Hugh Dickins
  2013-01-27  2:36   ` Simon Jeons
@ 2013-01-27  2:48   ` Simon Jeons
  2013-01-27 22:10     ` Hugh Dickins
  2013-02-05 17:18   ` Mel Gorman
  2 siblings, 1 reply; 69+ messages in thread
From: Simon Jeons @ 2013-01-27  2:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Fri, 2013-01-25 at 18:00 -0800, Hugh Dickins wrote:
> In some places where get_ksm_page() is used, we need the page to be locked.
> 
> When KSM migration is fully enabled, we shall want that to make sure that
> the page just acquired cannot be migrated beneath us (raised page count is
> only effective when there is serialization to make sure migration notices).
> Whereas when navigating through the stable tree, we certainly do not want
> to lock each node (raised page count is enough to guarantee the memcmps,
> even if page is migrated to another node).
> 
> Since we're about to add another use case, add the locked argument to
> get_ksm_page() now.
> 
> Hmm, what's that rcu_read_lock() about?  Complete misunderstanding, I
> really got the wrong end of the stick on that!  There's a configuration
> in which page_cache_get_speculative() can do something cheaper than
> get_page_unless_zero(), relying on its caller's rcu_read_lock() to have
> disabled preemption for it.  There's no need for rcu_read_lock() around
> get_page_unless_zero() (and mapping checks) here.  Cut out that
> silliness before making this any harder to understand.

BTW, what's the meaning of ksm page forked? 

> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/ksm.c |   23 +++++++++++++----------
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:53.244205966 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:36:58.856206099 -0800
> @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree
>   * but this is different - made simpler by ksm_thread_mutex being held, but
>   * interesting for assuming that no other use of the struct page could ever
>   * put our expected_mapping into page->mapping (or a field of the union which
> - * coincides with page->mapping).  The RCU calls are not for KSM at all, but
> - * to keep the page_count protocol described with page_cache_get_speculative.
> + * coincides with page->mapping).
>   *
>   * Note: it is possible that get_ksm_page() will return NULL one moment,
>   * then page the next, if the page is in between page_freeze_refs() and
>   * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
>   * is on its way to being freed; but it is an anomaly to bear in mind.
>   */
> -static struct page *get_ksm_page(struct stable_node *stable_node)
> +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
>  {
>  	struct page *page;
>  	void *expected_mapping;
> @@ -530,7 +529,6 @@ static struct page *get_ksm_page(struct
>  	page = pfn_to_page(stable_node->kpfn);
>  	expected_mapping = (void *)stable_node +
>  				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
> -	rcu_read_lock();
>  	if (page->mapping != expected_mapping)
>  		goto stale;
>  	if (!get_page_unless_zero(page))
> @@ -539,10 +537,16 @@ static struct page *get_ksm_page(struct
>  		put_page(page);
>  		goto stale;
>  	}
> -	rcu_read_unlock();
> +	if (locked) {
> +		lock_page(page);
> +		if (page->mapping != expected_mapping) {
> +			unlock_page(page);
> +			put_page(page);
> +			goto stale;
> +		}
> +	}
>  	return page;
>  stale:
> -	rcu_read_unlock();
>  	remove_node_from_stable_tree(stable_node);
>  	return NULL;
>  }
> @@ -558,11 +562,10 @@ static void remove_rmap_item_from_tree(s
>  		struct page *page;
>  
>  		stable_node = rmap_item->head;
> -		page = get_ksm_page(stable_node);
> +		page = get_ksm_page(stable_node, true);
>  		if (!page)
>  			goto out;
>  
> -		lock_page(page);
>  		hlist_del(&rmap_item->hlist);
>  		unlock_page(page);
>  		put_page(page);
> @@ -1042,7 +1045,7 @@ static struct page *stable_tree_search(s
>  
>  		cond_resched();
>  		stable_node = rb_entry(node, struct stable_node, node);
> -		tree_page = get_ksm_page(stable_node);
> +		tree_page = get_ksm_page(stable_node, false);
>  		if (!tree_page)
>  			return NULL;
>  
> @@ -1086,7 +1089,7 @@ static struct stable_node *stable_tree_i
>  
>  		cond_resched();
>  		stable_node = rb_entry(*new, struct stable_node, node);
> -		tree_page = get_ksm_page(stable_node);
> +		tree_page = get_ksm_page(stable_node, false);
>  		if (!tree_page)
>  			return NULL;
>  
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/11] ksm: allow trees per NUMA node
  2013-01-27  1:14   ` Simon Jeons
@ 2013-01-27  2:54     ` Hugh Dickins
  2013-01-27  3:16       ` Simon Jeons
  0 siblings, 1 reply; 69+ messages in thread
From: Hugh Dickins @ 2013-01-27  2:54 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	Rik van Riel, David Rientjes, Anton Arapov, linux-kernel,
	linux-mm

On Sat, 26 Jan 2013, Simon Jeons wrote:
> On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote:
> > From: Petr Holasek <pholasek@redhat.com>
> > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i
> >  			return NULL;
> >  		}
> >  
> > +		/*
> > +		 * If tree_page has been migrated to another NUMA node, it
> > +		 * will be flushed out and put into the right unstable tree
> 
> Then why not insert the new page to unstable tree during page migration
> against current upstream? Because default behavior is merge across
> nodes.

I don't understand the words "against current upstream" in your question.

We cannot move a page (strictly, a node) from one tree to another during
page migration itself, because the necessary ksm_thread_mutex is not held.
Not would we even want to while "merge across nodes".

Ah, perhaps you are pointing out that in current upstream, the only user
of ksm page migration is memory hotremove, which in current upstream does
hold ksm_thread_mutex.

So you'd like us to add code for moving a node from one tree to another
in ksm_migrate_page() (and what would it do when it collides with an
existing node?), code which will then be removed a few patches later
when ksm page migration is fully enabled?

No, I'm not going to put any more thought into that.  When Andrea pointed
out the problem with Petr's original change to ksm_migrate_page(), I did
indeed think that we could do something cleverer at that point; but once
I got down to trying it, found that a dead end.  I wasn't going to be
able to test the hotremove case properly anyway, so no good pursuing
solutions that couldn't be generalized.

Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/11] ksm: allow trees per NUMA node
  2013-01-27  2:54     ` Hugh Dickins
@ 2013-01-27  3:16       ` Simon Jeons
  2013-01-27 21:55         ` Hugh Dickins
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Jeons @ 2013-01-27  3:16 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	Rik van Riel, David Rientjes, Anton Arapov, linux-kernel,
	linux-mm

On Sat, 2013-01-26 at 18:54 -0800, Hugh Dickins wrote:
> On Sat, 26 Jan 2013, Simon Jeons wrote:
> > On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote:
> > > From: Petr Holasek <pholasek@redhat.com>
> > > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i
> > >  			return NULL;
> > >  		}
> > >  
> > > +		/*
> > > +		 * If tree_page has been migrated to another NUMA node, it
> > > +		 * will be flushed out and put into the right unstable tree
> > 
> > Then why not insert the new page to unstable tree during page migration
> > against current upstream? Because default behavior is merge across
> > nodes.
> 
> I don't understand the words "against current upstream" in your question.

I mean current upstream codes without numa awareness. :)

> 
> We cannot move a page (strictly, a node) from one tree to another during
> page migration itself, because the necessary ksm_thread_mutex is not held.
> Not would we even want to while "merge across nodes".
> 
> Ah, perhaps you are pointing out that in current upstream, the only user
> of ksm page migration is memory hotremove, which in current upstream does
> hold ksm_thread_mutex.
> 
> So you'd like us to add code for moving a node from one tree to another
> in ksm_migrate_page() (and what would it do when it collides with an

Without numa awareness, I still can't understand your explanation why
can't insert the node to the tree just after page migration instead of
inserting it at the next scan.

> existing node?), code which will then be removed a few patches later
> when ksm page migration is fully enabled?
> 
> No, I'm not going to put any more thought into that.  When Andrea pointed
> out the problem with Petr's original change to ksm_migrate_page(), I did
> indeed think that we could do something cleverer at that point; but once
> I got down to trying it, found that a dead end.  I wasn't going to be
> able to test the hotremove case properly anyway, so no good pursuing
> solutions that couldn't be generalized.
> 
> Hugh



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly
  2013-01-26  2:01 ` [PATCH 6/11] ksm: remove old stable nodes more thoroughly Hugh Dickins
@ 2013-01-27  4:55   ` Simon Jeons
  2013-01-27 23:05     ` Hugh Dickins
  2013-01-28  2:12   ` Simon Jeons
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 69+ messages in thread
From: Simon Jeons @ 2013-01-27  4:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

Hi Hugh,
On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote:
> Switching merge_across_nodes after running KSM is liable to oops on stale
> nodes still left over from the previous stable tree.  It's not something
> that people will often want to do, but it would be lame to demand a reboot
> when they're trying to determine which merge_across_nodes setting is best.
> 
> How can this happen?  We only permit switching merge_across_nodes when
> pages_shared is 0, and usually set run 2 to force that beforehand, which
> ought to unmerge everything: yet oopses still occur when you then run 1.
> 
> Three causes:
> 
> 1. The old stable tree (built according to the inverse merge_across_nodes)
> has not been fully torn down.  A stable node lingers until get_ksm_page()
> notices that the page it references no longer references it: but the page
> is not necessarily freed as soon as expected, particularly when swapcache.
> 

When can this happen?  

> Fix this with a pass through the old stable tree, applying get_ksm_page()
> to each of the remaining nodes (most found stale and removed immediately),
> with forced removal of any left over.  Unless the page is still mapped:
> I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE
> and EBUSY than BUG.
> 
> 2. __ksm_enter() has a nice little optimization, to insert the new mm
> just behind ksmd's cursor, so there's a full pass for it to stabilize
> (or be removed) before ksmd addresses it.  Nice when ksmd is running,
> but not so nice when we're trying to unmerge all mms: we were missing
> those mms forked and inserted behind the unmerge cursor.  Easily fixed
> by inserting at the end when KSM_RUN_UNMERGE.

mms forked will be unmerged just after ksmd's cursor since they're
inserted behind it, why will be missing?

> 
> 3. It is possible for a KSM page to be faulted back from swapcache into
> an mm, just after unmerge_and_remove_all_rmap_items() scanned past it.
> Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private
> to ksm.c, so dissolve the distinction between ksm_might_need_to_copy()
> and ksm_does_need_to_copy(), doing it all in the one call into ksm.c.
> 

Make sense. :)

> A long outstanding, unrelated bugfix sneaks in with that third fix:
> ksm_does_need_to_copy() would copy from a !PageUptodate page (implying
> I/O error when read in from swap) to a page which it then marks Uptodate.
> Fix this case by not copying, letting do_swap_page() discover the error.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  include/linux/ksm.h |   18 ++-------
>  mm/ksm.c            |   83 +++++++++++++++++++++++++++++++++++++++---
>  mm/memory.c         |   19 ++++-----
>  3 files changed, 92 insertions(+), 28 deletions(-)
> 
> --- mmotm.orig/include/linux/ksm.h	2013-01-25 14:27:58.220193250 -0800
> +++ mmotm/include/linux/ksm.h	2013-01-25 14:37:00.764206145 -0800
> @@ -16,9 +16,6 @@
>  struct stable_node;
>  struct mem_cgroup;
>  
> -struct page *ksm_does_need_to_copy(struct page *page,
> -			struct vm_area_struct *vma, unsigned long address);
> -
>  #ifdef CONFIG_KSM
>  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
>  		unsigned long end, int advice, unsigned long *vm_flags);
> @@ -73,15 +70,8 @@ static inline void set_page_stable_node(
>   * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE,
>   * but what if the vma was unmerged while the page was swapped out?
>   */
> -static inline int ksm_might_need_to_copy(struct page *page,
> -			struct vm_area_struct *vma, unsigned long address)
> -{
> -	struct anon_vma *anon_vma = page_anon_vma(page);
> -
> -	return anon_vma &&
> -		(anon_vma->root != vma->anon_vma->root ||
> -		 page->index != linear_page_index(vma, address));
> -}
> +struct page *ksm_might_need_to_copy(struct page *page,
> +			struct vm_area_struct *vma, unsigned long address);
>  
>  int page_referenced_ksm(struct page *page,
>  			struct mem_cgroup *memcg, unsigned long *vm_flags);
> @@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_
>  	return 0;
>  }
>  
> -static inline int ksm_might_need_to_copy(struct page *page,
> +static inline struct page *ksm_might_need_to_copy(struct page *page,
>  			struct vm_area_struct *vma, unsigned long address)
>  {
> -	return 0;
> +	return page;
>  }
>  
>  static inline int page_referenced_ksm(struct page *page,
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:58.856206099 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:37:00.768206145 -0800
> @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a
>  /*
>   * Only called through the sysfs control interface:
>   */
> +static int remove_stable_node(struct stable_node *stable_node)
> +{
> +	struct page *page;
> +	int err;
> +
> +	page = get_ksm_page(stable_node, true);
> +	if (!page) {
> +		/*
> +		 * get_ksm_page did remove_node_from_stable_tree itself.
> +		 */
> +		return 0;
> +	}
> +
> +	if (WARN_ON_ONCE(page_mapped(page)))
> +		err = -EBUSY;
> +	else {
> +		/*
> +		 * This page might be in a pagevec waiting to be freed,
> +		 * or it might be PageSwapCache (perhaps under writeback),
> +		 * or it might have been removed from swapcache a moment ago.
> +		 */
> +		set_page_stable_node(page, NULL);
> +		remove_node_from_stable_tree(stable_node);
> +		err = 0;
> +	}
> +
> +	unlock_page(page);
> +	put_page(page);
> +	return err;
> +}
> +
> +static int remove_all_stable_nodes(void)
> +{
> +	struct stable_node *stable_node;
> +	int nid;
> +	int err = 0;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		while (root_stable_tree[nid].rb_node) {
> +			stable_node = rb_entry(root_stable_tree[nid].rb_node,
> +						struct stable_node, node);
> +			if (remove_stable_node(stable_node)) {
> +				err = -EBUSY;
> +				break;	/* proceed to next nid */
> +			}
> +			cond_resched();
> +		}
> +	}
> +	return err;
> +}
> +
>  static int unmerge_and_remove_all_rmap_items(void)
>  {
>  	struct mm_slot *mm_slot;
> @@ -691,6 +742,8 @@ static int unmerge_and_remove_all_rmap_i
>  		}
>  	}
>  
> +	/* Clean up stable nodes, but don't worry if some are still busy */
> +	remove_all_stable_nodes();
>  	ksm_scan.seqnr = 0;
>  	return 0;
>  
> @@ -1586,11 +1639,19 @@ int __ksm_enter(struct mm_struct *mm)
>  	spin_lock(&ksm_mmlist_lock);
>  	insert_to_mm_slots_hash(mm, mm_slot);
>  	/*
> -	 * Insert just behind the scanning cursor, to let the area settle
> +	 * When KSM_RUN_MERGE (or KSM_RUN_STOP),
> +	 * insert just behind the scanning cursor, to let the area settle
>  	 * down a little; when fork is followed by immediate exec, we don't
>  	 * want ksmd to waste time setting up and tearing down an rmap_list.
> +	 *
> +	 * But when KSM_RUN_UNMERGE, it's important to insert ahead of its
> +	 * scanning cursor, otherwise KSM pages in newly forked mms will be
> +	 * missed: then we might as well insert at the end of the list.
>  	 */
> -	list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
> +	if (ksm_run & KSM_RUN_UNMERGE)
> +		list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list);
> +	else
> +		list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
>  	spin_unlock(&ksm_mmlist_lock);
>  
>  	set_bit(MMF_VM_MERGEABLE, &mm->flags);
> @@ -1640,11 +1701,25 @@ void __ksm_exit(struct mm_struct *mm)
>  	}
>  }
>  
> -struct page *ksm_does_need_to_copy(struct page *page,
> +struct page *ksm_might_need_to_copy(struct page *page,
>  			struct vm_area_struct *vma, unsigned long address)
>  {
> +	struct anon_vma *anon_vma = page_anon_vma(page);
>  	struct page *new_page;
>  
> +	if (PageKsm(page)) {
> +		if (page_stable_node(page) &&
> +		    !(ksm_run & KSM_RUN_UNMERGE))
> +			return page;	/* no need to copy it */
> +	} else if (!anon_vma) {
> +		return page;		/* no need to copy it */
> +	} else if (anon_vma->root == vma->anon_vma->root &&
> +		 page->index == linear_page_index(vma, address)) {
> +		return page;		/* still no need to copy it */
> +	}
> +	if (!PageUptodate(page))
> +		return page;		/* let do_swap_page report the error */
> +
>  	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
>  	if (new_page) {
>  		copy_user_highpage(new_page, page, address, vma);
> @@ -2024,7 +2099,7 @@ static ssize_t merge_across_nodes_store(
>  
>  	mutex_lock(&ksm_thread_mutex);
>  	if (ksm_merge_across_nodes != knob) {
> -		if (ksm_pages_shared)
> +		if (ksm_pages_shared || remove_all_stable_nodes())
>  			err = -EBUSY;
>  		else
>  			ksm_merge_across_nodes = knob;
> --- mmotm.orig/mm/memory.c	2013-01-25 14:27:58.220193250 -0800
> +++ mmotm/mm/memory.c	2013-01-25 14:37:00.768206145 -0800
> @@ -2994,17 +2994,16 @@ static int do_swap_page(struct mm_struct
>  	if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
>  		goto out_page;
>  
> -	if (ksm_might_need_to_copy(page, vma, address)) {
> -		swapcache = page;
> -		page = ksm_does_need_to_copy(page, vma, address);
> -
> -		if (unlikely(!page)) {
> -			ret = VM_FAULT_OOM;
> -			page = swapcache;
> -			swapcache = NULL;
> -			goto out_page;
> -		}
> +	swapcache = page;
> +	page = ksm_might_need_to_copy(page, vma, address);
> +	if (unlikely(!page)) {
> +		ret = VM_FAULT_OOM;
> +		page = swapcache;
> +		swapcache = NULL;
> +		goto out_page;
>  	}
> +	if (page == swapcache)
> +		swapcache = NULL;
>  
>  	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
>  		ret = VM_FAULT_OOM;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/11] ksm: make KSM page migration possible
  2013-01-26  2:03 ` [PATCH 7/11] ksm: make KSM page migration possible Hugh Dickins
@ 2013-01-27  5:47   ` Simon Jeons
  2013-01-27 23:12     ` Hugh Dickins
  2013-02-05 19:11   ` Mel Gorman
  1 sibling, 1 reply; 69+ messages in thread
From: Simon Jeons @ 2013-01-27  5:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Fri, 2013-01-25 at 18:03 -0800, Hugh Dickins wrote:
> KSM page migration is already supported in the case of memory hotremove,
> which takes the ksm_thread_mutex across all its migrations to keep life
> simple.
> 
> But the new KSM NUMA merge_across_nodes knob introduces a problem, when
> it's set to non-default 0: if a KSM page is migrated to a different NUMA
> node, how do we migrate its stable node to the right tree?  And what if
> that collides with an existing stable node?
> 
> So far there's no provision for that, and this patch does not attempt
> to deal with it either.  But how will I test a solution, when I don't
> know how to hotremove memory?  The best answer is to enable KSM page
> migration in all cases now, and test more common cases.  With THP and
> compaction added since KSM came in, page migration is now mainstream,
> and it's a shame that a KSM page can frustrate freeing a page block.
> 
> Without worrying about merge_across_nodes 0 for now, this patch gets
> KSM page migration working reliably for default merge_across_nodes 1
> (but leave the patch enabling it until near the end of the series).
> 
> It's much simpler than I'd originally imagined, and does not require
> an additional tier of locking: page migration relies on the page lock,
> KSM page reclaim relies on the page lock, the page lock is enough for
> KSM page migration too.
> 
> Almost all the care has to be in get_ksm_page(): that's the function
> which worries about when a stable node is stale and should be freed,
> now it also has to worry about the KSM page being migrated.
> 
> The only new overhead is an additional put/get/lock/unlock_page when
> stable_tree_search() arrives at a matching node: to make sure migration
> respects the raised page count, and so does not migrate the page while
> we're busy with it here.  That's probably avoidable, either by changing
> internal interfaces from using kpage to stable_node, or by moving the
> ksm_migrate_page() callsite into a page_freeze_refs() section (even if
> not swapcache); but this works well, I've no urge to pull it apart now.
> 
> (Descents of the stable tree may pass through nodes whose KSM pages are
> under migration: being unlocked, the raised page count does not prevent
> that, nor need it: it's safe to memcmp against either old or new page.)
> 
> You might worry about mremap, and whether page migration's rmap_walk
> to remove migration entries will find all the KSM locations where it
> inserted earlier: that should already be handled, by the satisfyingly
> heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,).
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/ksm.c     |   94 ++++++++++++++++++++++++++++++++++++++-----------
>  mm/migrate.c |    5 ++
>  2 files changed, 77 insertions(+), 22 deletions(-)
> 
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:37:00.768206145 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:37:03.832206218 -0800
> @@ -499,6 +499,7 @@ static void remove_node_from_stable_tree
>   * In which case we can trust the content of the page, and it
>   * returns the gotten page; but if the page has now been zapped,
>   * remove the stale node from the stable tree and return NULL.
> + * But beware, the stable node's page might be being migrated.
>   *
>   * You would expect the stable_node to hold a reference to the ksm page.
>   * But if it increments the page's count, swapping out has to wait for
> @@ -509,44 +510,77 @@ static void remove_node_from_stable_tree
>   * pointing back to this stable node.  This relies on freeing a PageAnon
>   * page to reset its page->mapping to NULL, and relies on no other use of
>   * a page to put something that might look like our key in page->mapping.
> - *
> - * include/linux/pagemap.h page_cache_get_speculative() is a good reference,
> - * but this is different - made simpler by ksm_thread_mutex being held, but
> - * interesting for assuming that no other use of the struct page could ever
> - * put our expected_mapping into page->mapping (or a field of the union which
> - * coincides with page->mapping).
> - *
> - * Note: it is possible that get_ksm_page() will return NULL one moment,
> - * then page the next, if the page is in between page_freeze_refs() and
> - * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
>   * is on its way to being freed; but it is an anomaly to bear in mind.
>   */
>  static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
>  {
>  	struct page *page;
>  	void *expected_mapping;
> +	unsigned long kpfn;
>  
> -	page = pfn_to_page(stable_node->kpfn);
>  	expected_mapping = (void *)stable_node +
>  				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
> -	if (page->mapping != expected_mapping)
> -		goto stale;
> -	if (!get_page_unless_zero(page))
> +again:
> +	kpfn = ACCESS_ONCE(stable_node->kpfn);
> +	page = pfn_to_page(kpfn);
> +
> +	/*
> +	 * page is computed from kpfn, so on most architectures reading
> +	 * page->mapping is naturally ordered after reading node->kpfn,
> +	 * but on Alpha we need to be more careful.
> +	 */
> +	smp_read_barrier_depends();
> +	if (ACCESS_ONCE(page->mapping) != expected_mapping)
>  		goto stale;
> -	if (page->mapping != expected_mapping) {
> +
> +	/*
> +	 * We cannot do anything with the page while its refcount is 0.
> +	 * Usually 0 means free, or tail of a higher-order page: in which
> +	 * case this node is no longer referenced, and should be freed;
> +	 * however, it might mean that the page is under page_freeze_refs().
> +	 * The __remove_mapping() case is easy, again the node is now stale;
> +	 * but if page is swapcache in migrate_page_move_mapping(), it might
> +	 * still be our page, in which case it's essential to keep the node.
> +	 */
> +	while (!get_page_unless_zero(page)) {
> +		/*
> +		 * Another check for page->mapping != expected_mapping would
> +		 * work here too.  We have chosen the !PageSwapCache test to
> +		 * optimize the common case, when the page is or is about to
> +		 * be freed: PageSwapCache is cleared (under spin_lock_irq)
> +		 * in the freeze_refs section of __remove_mapping(); but Anon
> +		 * page->mapping reset to NULL later, in free_pages_prepare().
> +		 */
> +		if (!PageSwapCache(page))
> +			goto stale;
> +		cpu_relax();
> +	}
> +
> +	if (ACCESS_ONCE(page->mapping) != expected_mapping) {
>  		put_page(page);
>  		goto stale;
>  	}
> +
>  	if (locked) {
>  		lock_page(page);
> -		if (page->mapping != expected_mapping) {
> +		if (ACCESS_ONCE(page->mapping) != expected_mapping) {
>  			unlock_page(page);
>  			put_page(page);
>  			goto stale;
>  		}
>  	}

Could you explain why need check page->mapping twice after get page?

>  	return page;
> +
>  stale:
> +	/*
> +	 * We come here from above when page->mapping or !PageSwapCache
> +	 * suggests that the node is stale; but it might be under migration.
> +	 * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(),
> +	 * before checking whether node->kpfn has been changed.
> +	 */
> +	smp_rmb();
> +	if (ACCESS_ONCE(stable_node->kpfn) != kpfn)
> +		goto again;
>  	remove_node_from_stable_tree(stable_node);
>  	return NULL;
>  }
> @@ -1103,15 +1137,25 @@ static struct page *stable_tree_search(s
>  			return NULL;
>  
>  		ret = memcmp_pages(page, tree_page);
> +		put_page(tree_page);
>  
> -		if (ret < 0) {
> -			put_page(tree_page);
> +		if (ret < 0)
>  			node = node->rb_left;
> -		} else if (ret > 0) {
> -			put_page(tree_page);
> +		else if (ret > 0)
>  			node = node->rb_right;
> -		} else
> +		else {
> +			/*
> +			 * Lock and unlock the stable_node's page (which
> +			 * might already have been migrated) so that page
> +			 * migration is sure to notice its raised count.
> +			 * It would be more elegant to return stable_node
> +			 * than kpage, but that involves more changes.
> +			 */
> +			tree_page = get_ksm_page(stable_node, true);
> +			if (tree_page)
> +				unlock_page(tree_page);
>  			return tree_page;
> +		}
>  	}
>  
>  	return NULL;
> @@ -1903,6 +1947,14 @@ void ksm_migrate_page(struct page *newpa
>  	if (stable_node) {
>  		VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
>  		stable_node->kpfn = page_to_pfn(newpage);
> +		/*
> +		 * newpage->mapping was set in advance; now we need smp_wmb()
> +		 * to make sure that the new stable_node->kpfn is visible
> +		 * to get_ksm_page() before it can see that oldpage->mapping
> +		 * has gone stale (or that PageSwapCache has been cleared).
> +		 */
> +		smp_wmb();
> +		set_page_stable_node(oldpage, NULL);
>  	}
>  }
>  #endif /* CONFIG_MIGRATION */
> --- mmotm.orig/mm/migrate.c	2013-01-25 14:27:58.140193249 -0800
> +++ mmotm/mm/migrate.c	2013-01-25 14:37:03.832206218 -0800
> @@ -464,7 +464,10 @@ void migrate_page_copy(struct page *newp
>  
>  	mlock_migrate_page(newpage, page);
>  	ksm_migrate_page(newpage, page);
> -
> +	/*
> +	 * Please do not reorder this without considering how mm/ksm.c's
> +	 * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache().
> +	 */
>  	ClearPageSwapCache(page);
>  	ClearPagePrivate(page);
>  	set_page_private(page, 0);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 11/11] ksm: stop hotremove lockdep warning
  2013-01-26  2:10 ` [PATCH 11/11] ksm: stop hotremove lockdep warning Hugh Dickins
@ 2013-01-27  6:23   ` Simon Jeons
  2013-01-27 23:35     ` Hugh Dickins
  2013-02-08 18:45   ` Gerald Schaefer
  1 sibling, 1 reply; 69+ messages in thread
From: Simon Jeons @ 2013-01-27  6:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	Gerald Schaefer, KOSAKI Motohiro, linux-kernel, linux-mm

On Fri, 2013-01-25 at 18:10 -0800, Hugh Dickins wrote:
> Complaints are rare, but lockdep still does not understand the way
> ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and
> holds it until the ksm_memory_callback(MEM_OFFLINE): that appears
> to be a problem because notifier callbacks are made under down_read
> of blocking_notifier_head->rwsem (so first the mutex is taken while
> holding the rwsem, then later the rwsem is taken while still holding
> the mutex); but is not in fact a problem because mem_hotplug_mutex
> is held throughout the dance.
> 
> There was an attempt to fix this with mutex_lock_nested(); but if that
> happened to fool lockdep two years ago, apparently it does so no longer.
> 
> I had hoped to eradicate this issue in extending KSM page migration not
> to need the ksm_thread_mutex.  But then realized that although the page
> migration itself is safe, we do still need to lock out ksmd and other
> users of get_ksm_page() while offlining memory - at some point between
> MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may
> vanish, and get_ksm_page()'s accesses to them become a violation.
> 
> So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to
> MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining()
> checks, to achieve the same lockout without being caught by lockdep.
> This is less elegant for KSM, but it's more important to keep lockdep
> useful to other users - and I apologize for how long it took to fix.
> 
> Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/ksm.c |   55 +++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 41 insertions(+), 14 deletions(-)
> 
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:37:06.880206290 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:38:53.984208836 -0800
> @@ -226,7 +226,9 @@ static unsigned int ksm_merge_across_nod
>  #define KSM_RUN_STOP	0
>  #define KSM_RUN_MERGE	1
>  #define KSM_RUN_UNMERGE	2
> -static unsigned int ksm_run = KSM_RUN_STOP;
> +#define KSM_RUN_OFFLINE	4
> +static unsigned long ksm_run = KSM_RUN_STOP;
> +static void wait_while_offlining(void);
>  
>  static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
>  static DEFINE_MUTEX(ksm_thread_mutex);
> @@ -1700,6 +1702,7 @@ static int ksm_scan_thread(void *nothing
>  
>  	while (!kthread_should_stop()) {
>  		mutex_lock(&ksm_thread_mutex);
> +		wait_while_offlining();
>  		if (ksmd_should_run())
>  			ksm_do_scan(ksm_thread_pages_to_scan);
>  		mutex_unlock(&ksm_thread_mutex);
> @@ -2056,6 +2059,22 @@ void ksm_migrate_page(struct page *newpa
>  #endif /* CONFIG_MIGRATION */
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> +static int just_wait(void *word)
> +{
> +	schedule();
> +	return 0;
> +}
> +
> +static void wait_while_offlining(void)
> +{
> +	while (ksm_run & KSM_RUN_OFFLINE) {
> +		mutex_unlock(&ksm_thread_mutex);
> +		wait_on_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE),
> +				just_wait, TASK_UNINTERRUPTIBLE);
> +		mutex_lock(&ksm_thread_mutex);
> +	}
> +}
> +
>  static void ksm_check_stable_tree(unsigned long start_pfn,
>  				  unsigned long end_pfn)
>  {
> @@ -2098,15 +2117,15 @@ static int ksm_memory_callback(struct no
>  	switch (action) {
>  	case MEM_GOING_OFFLINE:
>  		/*
> -		 * Keep it very simple for now: just lock out ksmd and
> -		 * MADV_UNMERGEABLE while any memory is going offline.
> -		 * mutex_lock_nested() is necessary because lockdep was alarmed
> -		 * that here we take ksm_thread_mutex inside notifier chain
> -		 * mutex, and later take notifier chain mutex inside
> -		 * ksm_thread_mutex to unlock it.   But that's safe because both
> -		 * are inside mem_hotplug_mutex.
> +		 * Prevent ksm_do_scan(), unmerge_and_remove_all_rmap_items()
> +		 * and remove_all_stable_nodes() while memory is going offline:
> +		 * it is unsafe for them to touch the stable tree at this time.
> +		 * But unmerge_ksm_pages(), rmap lookups and other entry points

Why unmerge_ksm_pages beneath us is safe for ksm memory hotremove?

> +		 * which do not need the ksm_thread_mutex are all safe.
>  		 */
> -		mutex_lock_nested(&ksm_thread_mutex, SINGLE_DEPTH_NESTING);
> +		mutex_lock(&ksm_thread_mutex);
> +		ksm_run |= KSM_RUN_OFFLINE;
> +		mutex_unlock(&ksm_thread_mutex);
>  		break;
>  
>  	case MEM_OFFLINE:
> @@ -2122,11 +2141,20 @@ static int ksm_memory_callback(struct no
>  		/* fallthrough */
>  
>  	case MEM_CANCEL_OFFLINE:
> +		mutex_lock(&ksm_thread_mutex);
> +		ksm_run &= ~KSM_RUN_OFFLINE;
>  		mutex_unlock(&ksm_thread_mutex);
> +
> +		smp_mb();	/* wake_up_bit advises this */
> +		wake_up_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE));
>  		break;
>  	}
>  	return NOTIFY_OK;
>  }
> +#else
> +static void wait_while_offlining(void)
> +{
> +}
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
>  #ifdef CONFIG_SYSFS
> @@ -2189,7 +2217,7 @@ KSM_ATTR(pages_to_scan);
>  static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr,
>  			char *buf)
>  {
> -	return sprintf(buf, "%u\n", ksm_run);
> +	return sprintf(buf, "%lu\n", ksm_run);
>  }
>  
>  static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
> @@ -2212,6 +2240,7 @@ static ssize_t run_store(struct kobject
>  	 */
>  
>  	mutex_lock(&ksm_thread_mutex);
> +	wait_while_offlining();
>  	if (ksm_run != flags) {
>  		ksm_run = flags;
>  		if (flags & KSM_RUN_UNMERGE) {
> @@ -2254,6 +2283,7 @@ static ssize_t merge_across_nodes_store(
>  		return -EINVAL;
>  
>  	mutex_lock(&ksm_thread_mutex);
> +	wait_while_offlining();
>  	if (ksm_merge_across_nodes != knob) {
>  		if (ksm_pages_shared || remove_all_stable_nodes())
>  			err = -EBUSY;
> @@ -2366,10 +2396,7 @@ static int __init ksm_init(void)
>  #endif /* CONFIG_SYSFS */
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -	/*
> -	 * Choose a high priority since the callback takes ksm_thread_mutex:
> -	 * later callbacks could only be taking locks which nest within that.
> -	 */
> +	/* There is no significance to this priority 100 */
>  	hotplug_memory_notifier(ksm_memory_callback, 100);
>  #endif
>  	return 0;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 8/11] ksm: make !merge_across_nodes migration safe
  2013-01-26  2:05 ` [PATCH 8/11] ksm: make !merge_across_nodes migration safe Hugh Dickins
@ 2013-01-27  8:49   ` Simon Jeons
  2013-01-27 23:25     ` Hugh Dickins
  2013-01-28  3:44   ` Simon Jeons
  1 sibling, 1 reply; 69+ messages in thread
From: Simon Jeons @ 2013-01-27  8:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Fri, 2013-01-25 at 18:05 -0800, Hugh Dickins wrote:
> The new KSM NUMA merge_across_nodes knob introduces a problem, when it's
> set to non-default 0: if a KSM page is migrated to a different NUMA node,
> how do we migrate its stable node to the right tree?  And what if that
> collides with an existing stable node?
> 
> ksm_migrate_page() can do no more than it's already doing, updating
> stable_node->kpfn: the stable tree itself cannot be manipulated without
> holding ksm_thread_mutex.  So accept that a stable tree may temporarily
> indicate a page belonging to the wrong NUMA node, leave updating until
> the next pass of ksmd, just be careful not to merge other pages on to a
> misplaced page.  Note nid of holding tree in stable_node, and recognize
> that it will not always match nid of kpfn.
> 
> A misplaced KSM page is discovered, either when ksm_do_scan() next comes
> around to one of its rmap_items (we now have to go to cmp_and_merge_page
> even on pages in a stable tree), or when stable_tree_search() arrives at
> a matching node for another page, and this node page is found misplaced.
> 
> In each case, move the misplaced stable_node to a list of migrate_nodes
> (and use the address of migrate_nodes as magic by which to identify them):
> we don't need them in a tree.  If stable_tree_search() finds no match for
> a page, but it's currently exiled to this list, then slot its stable_node
> right there into the tree, bringing all of its mappings with it; otherwise
> they get migrated one by one to the original page of the colliding node.
> stable_tree_search() is now modelled more like stable_tree_insert(),
> in order to handle these insertions of migrated nodes.
> 
> remove_node_from_stable_tree(), remove_all_stable_nodes() and
> ksm_check_stable_tree() have to handle the migrate_nodes list as well as
> the stable tree itself.  Less obviously, we do need to prune the list of
> stale entries from time to time (scan_get_next_rmap_item() does it once
> each full scan): whereas stale nodes in the stable tree get naturally
> pruned as searches try to brush past them, these migrate_nodes may get
> forgotten and accumulate.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/ksm.c |  164 +++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 134 insertions(+), 30 deletions(-)
> 
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:37:03.832206218 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:37:06.880206290 -0800
> @@ -122,13 +122,25 @@ struct ksm_scan {
>  /**
>   * struct stable_node - node of the stable rbtree
>   * @node: rb node of this ksm page in the stable tree
> + * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list
> + * @list: linked into migrate_nodes, pending placement in the proper node tree
>   * @hlist: hlist head of rmap_items using this ksm page
> - * @kpfn: page frame number of this ksm page
> + * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid)
> + * @nid: NUMA node id of stable tree in which linked (may not match kpfn)
>   */
>  struct stable_node {
> -	struct rb_node node;
> +	union {
> +		struct rb_node node;	/* when node of stable tree */
> +		struct {		/* when listed for migration */
> +			struct list_head *head;
> +			struct list_head list;
> +		};
> +	};
>  	struct hlist_head hlist;
>  	unsigned long kpfn;
> +#ifdef CONFIG_NUMA
> +	int nid;
> +#endif
>  };
>  
>  /**
> @@ -169,6 +181,9 @@ struct rmap_item {
>  static struct rb_root root_unstable_tree[MAX_NUMNODES];
>  static struct rb_root root_stable_tree[MAX_NUMNODES];
>  
> +/* Recently migrated nodes of stable tree, pending proper placement */
> +static LIST_HEAD(migrate_nodes);
> +
>  #define MM_SLOTS_HASH_BITS 10
>  static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>  
> @@ -311,11 +326,6 @@ static void insert_to_mm_slots_hash(stru
>  	hash_add(mm_slots_hash, &mm_slot->link, (unsigned long)mm);
>  }
>  
> -static inline int in_stable_tree(struct rmap_item *rmap_item)
> -{
> -	return rmap_item->address & STABLE_FLAG;
> -}
> -
>  /*
>   * ksmd, and unmerge_and_remove_all_rmap_items(), must not touch an mm's
>   * page tables after it has passed through ksm_exit() - which, if necessary,
> @@ -476,7 +486,6 @@ static void remove_node_from_stable_tree
>  {
>  	struct rmap_item *rmap_item;
>  	struct hlist_node *hlist;
> -	int nid;
>  
>  	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
>  		if (rmap_item->hlist.next)
> @@ -488,8 +497,11 @@ static void remove_node_from_stable_tree
>  		cond_resched();
>  	}
>  
> -	nid = get_kpfn_nid(stable_node->kpfn);
> -	rb_erase(&stable_node->node, &root_stable_tree[nid]);
> +	if (stable_node->head == &migrate_nodes)
> +		list_del(&stable_node->list);
> +	else
> +		rb_erase(&stable_node->node,
> +			 &root_stable_tree[NUMA(stable_node->nid)]);
>  	free_stable_node(stable_node);
>  }
>  
> @@ -712,6 +724,7 @@ static int remove_stable_node(struct sta
>  static int remove_all_stable_nodes(void)
>  {
>  	struct stable_node *stable_node;
> +	struct list_head *this, *next;
>  	int nid;
>  	int err = 0;
>  
> @@ -726,6 +739,12 @@ static int remove_all_stable_nodes(void)
>  			cond_resched();
>  		}
>  	}
> +	list_for_each_safe(this, next, &migrate_nodes) {
> +		stable_node = list_entry(this, struct stable_node, list);
> +		if (remove_stable_node(stable_node))
> +			err = -EBUSY;
> +		cond_resched();
> +	}
>  	return err;
>  }
>  
> @@ -1113,25 +1132,30 @@ static struct page *try_to_merge_two_pag
>   */
>  static struct page *stable_tree_search(struct page *page)
>  {
> -	struct rb_node *node;
> -	struct stable_node *stable_node;
>  	int nid;
> +	struct rb_node **new;
> +	struct rb_node *parent;
> +	struct stable_node *stable_node;
> +	struct stable_node *page_node;
>  
> -	stable_node = page_stable_node(page);
> -	if (stable_node) {			/* ksm page forked */
> +	page_node = page_stable_node(page);
> +	if (page_node && page_node->head != &migrate_nodes) {
> +		/* ksm page forked */
>  		get_page(page);
>  		return page;
>  	}
>  
>  	nid = get_kpfn_nid(page_to_pfn(page));
> -	node = root_stable_tree[nid].rb_node;
> +again:
> +	new = &root_stable_tree[nid].rb_node;
> +	parent = NULL;
>  
> -	while (node) {
> +	while (*new) {
>  		struct page *tree_page;
>  		int ret;
>  
>  		cond_resched();
> -		stable_node = rb_entry(node, struct stable_node, node);
> +		stable_node = rb_entry(*new, struct stable_node, node);
>  		tree_page = get_ksm_page(stable_node, false);
>  		if (!tree_page)
>  			return NULL;
> @@ -1139,10 +1163,11 @@ static struct page *stable_tree_search(s
>  		ret = memcmp_pages(page, tree_page);
>  		put_page(tree_page);
>  
> +		parent = *new;
>  		if (ret < 0)
> -			node = node->rb_left;
> +			new = &parent->rb_left;
>  		else if (ret > 0)
> -			node = node->rb_right;
> +			new = &parent->rb_right;
>  		else {
>  			/*
>  			 * Lock and unlock the stable_node's page (which
> @@ -1152,13 +1177,49 @@ static struct page *stable_tree_search(s
>  			 * than kpage, but that involves more changes.
>  			 */
>  			tree_page = get_ksm_page(stable_node, true);
> -			if (tree_page)
> +			if (tree_page) {
>  				unlock_page(tree_page);
> -			return tree_page;
> +				if (get_kpfn_nid(stable_node->kpfn) !=
> +						NUMA(stable_node->nid)) {
> +					put_page(tree_page);
> +					goto replace;
> +				}
> +				return tree_page;
> +			}
> +			/*
> +			 * There is now a place for page_node, but the tree may
> +			 * have been rebalanced, so re-evaluate parent and new.
> +			 */
> +			if (page_node)
> +				goto again;
> +			return NULL;
>  		}
>  	}
>  
> -	return NULL;
> +	if (!page_node)
> +		return NULL;
> +
> +	list_del(&page_node->list);
> +	DO_NUMA(page_node->nid = nid);
> +	rb_link_node(&page_node->node, parent, new);
> +	rb_insert_color(&page_node->node, &root_stable_tree[nid]);
> +	get_page(page);
> +	return page;
> +
> +replace:
> +	if (page_node) {
> +		list_del(&page_node->list);
> +		DO_NUMA(page_node->nid = nid);
> +		rb_replace_node(&stable_node->node,
> +				&page_node->node, &root_stable_tree[nid]);
> +		get_page(page);
> +	} else {
> +		rb_erase(&stable_node->node, &root_stable_tree[nid]);
> +		page = NULL;
> +	}
> +	stable_node->head = &migrate_nodes;
> +	list_add(&stable_node->list, stable_node->head);
> +	return page;
>  }
>  
>  /*
> @@ -1215,6 +1276,7 @@ static struct stable_node *stable_tree_i
>  	INIT_HLIST_HEAD(&stable_node->hlist);
>  	stable_node->kpfn = kpfn;
>  	set_page_stable_node(kpage, stable_node);
> +	DO_NUMA(stable_node->nid = nid);
>  	rb_link_node(&stable_node->node, parent, new);
>  	rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
>  
> @@ -1311,11 +1373,6 @@ struct rmap_item *unstable_tree_search_i
>  static void stable_tree_append(struct rmap_item *rmap_item,
>  			       struct stable_node *stable_node)
>  {
> -	/*
> -	 * Usually rmap_item->nid is already set correctly,
> -	 * but it may be wrong after switching merge_across_nodes.
> -	 */
> -	DO_NUMA(rmap_item->nid = get_kpfn_nid(stable_node->kpfn));
>  	rmap_item->head = stable_node;
>  	rmap_item->address |= STABLE_FLAG;
>  	hlist_add_head(&rmap_item->hlist, &stable_node->hlist);
> @@ -1344,10 +1401,29 @@ static void cmp_and_merge_page(struct pa
>  	unsigned int checksum;
>  	int err;
>  
> -	remove_rmap_item_from_tree(rmap_item);
> +	stable_node = page_stable_node(page);
> +	if (stable_node) {
> +		if (stable_node->head != &migrate_nodes &&
> +		    get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) {
> +			rb_erase(&stable_node->node,
> +				 &root_stable_tree[NUMA(stable_node->nid)]);
> +			stable_node->head = &migrate_nodes;
> +			list_add(&stable_node->list, stable_node->head);

Why list add &stable_node->list to stable_node->head? stable_node->head
is used for queue what?

> +		}
> +		if (stable_node->head != &migrate_nodes &&
> +		    rmap_item->head == stable_node)
> +			return;
> +	}
>  
>  	/* We first start with searching the page inside the stable tree */
>  	kpage = stable_tree_search(page);
> +	if (kpage == page && rmap_item->head == stable_node) {
> +		put_page(kpage);
> +		return;
> +	}
> +
> +	remove_rmap_item_from_tree(rmap_item);
> +
>  	if (kpage) {
>  		err = try_to_merge_with_ksm_page(rmap_item, page, kpage);
>  		if (!err) {
> @@ -1464,6 +1540,27 @@ static struct rmap_item *scan_get_next_r
>  		 */
>  		lru_add_drain_all();
>  
> +		/*
> +		 * Whereas stale stable_nodes on the stable_tree itself
> +		 * get pruned in the regular course of stable_tree_search(),

Which kinds of stable_nodes can be treated as stale? I just see remove
rmap_item in stable_tree_search() and scan_get_next_rmap_item().

> +		 * those moved out to the migrate_nodes list can accumulate:
> +		 * so prune them once before each full scan.
> +		 */
> +		if (!ksm_merge_across_nodes) {
> +			struct stable_node *stable_node;
> +			struct list_head *this, *next;
> +			struct page *page;
> +
> +			list_for_each_safe(this, next, &migrate_nodes) {
> +				stable_node = list_entry(this,
> +						struct stable_node, list);
> +				page = get_ksm_page(stable_node, false);
> +				if (page)
> +					put_page(page);
> +				cond_resched();
> +			}
> +		}
> +

Why get page of misplaced pages here?

>  		for (nid = 0; nid < nr_node_ids; nid++)
>  			root_unstable_tree[nid] = RB_ROOT;
>  
> @@ -1586,8 +1683,7 @@ static void ksm_do_scan(unsigned int sca
>  		rmap_item = scan_get_next_rmap_item(&page);
>  		if (!rmap_item)
>  			return;
> -		if (!PageKsm(page) || !in_stable_tree(rmap_item))
> -			cmp_and_merge_page(page, rmap_item);
> +		cmp_and_merge_page(page, rmap_item);
>  		put_page(page);
>  	}
>  }
> @@ -1964,6 +2060,7 @@ static void ksm_check_stable_tree(unsign
>  				  unsigned long end_pfn)
>  {
>  	struct stable_node *stable_node;
> +	struct list_head *this, *next;
>  	struct rb_node *node;
>  	int nid;
>  
> @@ -1984,6 +2081,13 @@ static void ksm_check_stable_tree(unsign
>  			cond_resched();
>  		}
>  	}
> +	list_for_each_safe(this, next, &migrate_nodes) {
> +		stable_node = list_entry(this, struct stable_node, list);
> +		if (stable_node->kpfn >= start_pfn &&
> +		    stable_node->kpfn < end_pfn)
> +			remove_node_from_stable_tree(stable_node);
> +		cond_resched();
> +	}
>  }
>  
>  static int ksm_memory_callback(struct notifier_block *self,
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/11] ksm: allow trees per NUMA node
  2013-01-27  3:16       ` Simon Jeons
@ 2013-01-27 21:55         ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-27 21:55 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	Rik van Riel, David Rientjes, Anton Arapov, linux-kernel,
	linux-mm

On Sat, 26 Jan 2013, Simon Jeons wrote:
> On Sat, 2013-01-26 at 18:54 -0800, Hugh Dickins wrote:
> > 
> > So you'd like us to add code for moving a node from one tree to another
> > in ksm_migrate_page() (and what would it do when it collides with an
> 
> Without numa awareness, I still can't understand your explanation why
> can't insert the node to the tree just after page migration instead of
> inserting it at the next scan.

The node is already there in the right (only) tree in that case.

> 
> > existing node?), code which will then be removed a few patches later
> > when ksm page migration is fully enabled?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 5/11] ksm: get_ksm_page locked
  2013-01-27  2:36   ` Simon Jeons
@ 2013-01-27 22:08     ` Hugh Dickins
  2013-01-28  0:36       ` Simon Jeons
  0 siblings, 1 reply; 69+ messages in thread
From: Hugh Dickins @ 2013-01-27 22:08 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Sat, 26 Jan 2013, Simon Jeons wrote:
> On Fri, 2013-01-25 at 18:00 -0800, Hugh Dickins wrote:
> > In some places where get_ksm_page() is used, we need the page to be locked.
> > 
> 
> In function get_ksm_page, why check page->mapping =>
> get_page_unless_zero => check page->mapping instead of
> get_page_unless_zero => check page->mapping, because
> get_page_unless_zero is expensive?

Yes, it's more expensive.

> 
> > When KSM migration is fully enabled, we shall want that to make sure that
> > the page just acquired cannot be migrated beneath us (raised page count is
> > only effective when there is serialization to make sure migration notices).
> > Whereas when navigating through the stable tree, we certainly do not want
> 
> What's the meaning of "navigating through the stable tree"?

Finding the right place in the stable tree,
as stable_tree_search() and stable_tree_insert() do.

> 
> > to lock each node (raised page count is enough to guarantee the memcmps,
> > even if page is migrated to another node).
> > 
> > Since we're about to add another use case, add the locked argument to
> > get_ksm_page() now.
> 
> Why the parameter lock passed from stable_tree_search/insert is true,
> but remove_rmap_item_from_tree is false?

The other way round?  remove_rmap_item_from_tree needs the page locked,
because it's about to modify the list: that's secured (e.g. against
concurrent KSM page reclaim) by the page lock.

stable_tree_search and stable_tree_insert do not need intermediate nodes
to be locked: get_page is enough to secure the page contents for memcmp,
and we don't want a pointless wait for exclusive page lock on every
intermediate node.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 5/11] ksm: get_ksm_page locked
  2013-01-27  2:48   ` Simon Jeons
@ 2013-01-27 22:10     ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-27 22:10 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Sat, 26 Jan 2013, Simon Jeons wrote:
> 
> BTW, what's the meaning of ksm page forked? 

A ksm page is mapped into a process's mm, then that process calls fork():
the ksm page then appears in the child's mm, before ksmd has tracked it.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly
  2013-01-27  4:55   ` Simon Jeons
@ 2013-01-27 23:05     ` Hugh Dickins
  2013-01-28  1:42       ` Simon Jeons
  0 siblings, 1 reply; 69+ messages in thread
From: Hugh Dickins @ 2013-01-27 23:05 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Sat, 26 Jan 2013, Simon Jeons wrote:
> On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote:
> > Switching merge_across_nodes after running KSM is liable to oops on stale
> > nodes still left over from the previous stable tree.  It's not something
> > that people will often want to do, but it would be lame to demand a reboot
> > when they're trying to determine which merge_across_nodes setting is best.
> > 
> > How can this happen?  We only permit switching merge_across_nodes when
> > pages_shared is 0, and usually set run 2 to force that beforehand, which
> > ought to unmerge everything: yet oopses still occur when you then run 1.
> > 
> > Three causes:
> > 
> > 1. The old stable tree (built according to the inverse merge_across_nodes)
> > has not been fully torn down.  A stable node lingers until get_ksm_page()
> > notices that the page it references no longer references it: but the page
> > is not necessarily freed as soon as expected, particularly when swapcache.
> > 
> 
> When can this happen?  

Whenever there's an additional reference to the page, beyond those for
its ptes in userspace - swapcache for example, or pinned by get_user_pages.
That delays its being freed (arriving at the "page->mapping = NULL;"
in free_pages_prepare()).  Or it might simply be sitting in a pagevec,
waiting for that to be filled up, to be freed as part of a batch.

> 
> > Fix this with a pass through the old stable tree, applying get_ksm_page()
> > to each of the remaining nodes (most found stale and removed immediately),
> > with forced removal of any left over.  Unless the page is still mapped:
> > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE
> > and EBUSY than BUG.
> > 
> > 2. __ksm_enter() has a nice little optimization, to insert the new mm
> > just behind ksmd's cursor, so there's a full pass for it to stabilize
> > (or be removed) before ksmd addresses it.  Nice when ksmd is running,
> > but not so nice when we're trying to unmerge all mms: we were missing
> > those mms forked and inserted behind the unmerge cursor.  Easily fixed
> > by inserting at the end when KSM_RUN_UNMERGE.
> 
> mms forked will be unmerged just after ksmd's cursor since they're
> inserted behind it, why will be missing?

unmerge_and_remove_all_rmap_items() makes one pass through the list
from start to finish: insert behind the cursor and it will be missed.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/11] ksm: make KSM page migration possible
  2013-01-27  5:47   ` Simon Jeons
@ 2013-01-27 23:12     ` Hugh Dickins
  2013-01-28  0:41       ` Simon Jeons
  0 siblings, 1 reply; 69+ messages in thread
From: Hugh Dickins @ 2013-01-27 23:12 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Sat, 26 Jan 2013, Simon Jeons wrote:
> On Fri, 2013-01-25 at 18:03 -0800, Hugh Dickins wrote:
> > +	while (!get_page_unless_zero(page)) {
> > +		/*
> > +		 * Another check for page->mapping != expected_mapping would
> > +		 * work here too.  We have chosen the !PageSwapCache test to
> > +		 * optimize the common case, when the page is or is about to
> > +		 * be freed: PageSwapCache is cleared (under spin_lock_irq)
> > +		 * in the freeze_refs section of __remove_mapping(); but Anon
> > +		 * page->mapping reset to NULL later, in free_pages_prepare().
> > +		 */
> > +		if (!PageSwapCache(page))
> > +			goto stale;
> > +		cpu_relax();
> > +	}
> > +
> > +	if (ACCESS_ONCE(page->mapping) != expected_mapping) {
> >  		put_page(page);
> >  		goto stale;
> >  	}
> > +
> >  	if (locked) {
> >  		lock_page(page);
> > -		if (page->mapping != expected_mapping) {
> > +		if (ACCESS_ONCE(page->mapping) != expected_mapping) {
> >  			unlock_page(page);
> >  			put_page(page);
> >  			goto stale;
> >  		}
> >  	}
> 
> Could you explain why need check page->mapping twice after get page?

Once for the !locked case, which should not return page if mapping changed.
Once for the locked case, which should not return page if mapping changed.
We could use "else", but that wouldn't be an improvement.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 8/11] ksm: make !merge_across_nodes migration safe
  2013-01-27  8:49   ` Simon Jeons
@ 2013-01-27 23:25     ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-27 23:25 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Sun, 27 Jan 2013, Simon Jeons wrote:
> On Fri, 2013-01-25 at 18:05 -0800, Hugh Dickins wrote:
> > @@ -1344,10 +1401,29 @@ static void cmp_and_merge_page(struct pa
> >  	unsigned int checksum;
> >  	int err;
> >  
> > -	remove_rmap_item_from_tree(rmap_item);
> > +	stable_node = page_stable_node(page);
> > +	if (stable_node) {
> > +		if (stable_node->head != &migrate_nodes &&
> > +		    get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) {
> > +			rb_erase(&stable_node->node,
> > +				 &root_stable_tree[NUMA(stable_node->nid)]);
> > +			stable_node->head = &migrate_nodes;
> > +			list_add(&stable_node->list, stable_node->head);
> 
> Why list add &stable_node->list to stable_node->head? stable_node->head
> is used for queue what?

Read that as list_add(&stable_node->list, &migrate_nodes) if you prefer.
stable_node->head (overlaying stable_node->node.__rb_parent_color, which
would never point to migrate_nodes as an rb_node) &migrate_nodes is used
as "magic" to show that that rb_node is currently saved on this list,
rather than linked into the stable tree itself.  We could do some
#define MIGRATE_NODES_MAGIC 0xwhatever and put that in head instead.

> > @@ -1464,6 +1540,27 @@ static struct rmap_item *scan_get_next_r
> >  		 */
> >  		lru_add_drain_all();
> >  
> > +		/*
> > +		 * Whereas stale stable_nodes on the stable_tree itself
> > +		 * get pruned in the regular course of stable_tree_search(),
> 
> Which kinds of stable_nodes can be treated as stale? I just see remove
> rmap_item in stable_tree_search() and scan_get_next_rmap_item().

See get_ksm_page().

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 11/11] ksm: stop hotremove lockdep warning
  2013-01-27  6:23   ` Simon Jeons
@ 2013-01-27 23:35     ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-27 23:35 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	Gerald Schaefer, KOSAKI Motohiro, linux-kernel, linux-mm

On Sun, 27 Jan 2013, Simon Jeons wrote:
> On Fri, 2013-01-25 at 18:10 -0800, Hugh Dickins wrote:
> > @@ -2098,15 +2117,15 @@ static int ksm_memory_callback(struct no
> >  	switch (action) {
> >  	case MEM_GOING_OFFLINE:
> >  		/*
> > -		 * Keep it very simple for now: just lock out ksmd and
> > -		 * MADV_UNMERGEABLE while any memory is going offline.
> > -		 * mutex_lock_nested() is necessary because lockdep was alarmed
> > -		 * that here we take ksm_thread_mutex inside notifier chain
> > -		 * mutex, and later take notifier chain mutex inside
> > -		 * ksm_thread_mutex to unlock it.   But that's safe because both
> > -		 * are inside mem_hotplug_mutex.
> > +		 * Prevent ksm_do_scan(), unmerge_and_remove_all_rmap_items()
> > +		 * and remove_all_stable_nodes() while memory is going offline:
> > +		 * it is unsafe for them to touch the stable tree at this time.
> > +		 * But unmerge_ksm_pages(), rmap lookups and other entry points
> 
> Why unmerge_ksm_pages beneath us is safe for ksm memory hotremove?
> 
> > +		 * which do not need the ksm_thread_mutex are all safe.

It's just like userspace doing a write-fault on every KSM page in the vma.
If that were unsafe for memory hotremove, then it would not be KSM's
problem, memory hotremove would already be unsafe.  (But memory hotremove
is safe because it migrates away from all the pages to be removed before
it can reach MEM_OFFLINE.)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 5/11] ksm: get_ksm_page locked
  2013-01-27 22:08     ` Hugh Dickins
@ 2013-01-28  0:36       ` Simon Jeons
  2013-01-28  3:35         ` Hugh Dickins
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Jeons @ 2013-01-28  0:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Sun, 2013-01-27 at 14:08 -0800, Hugh Dickins wrote:
> On Sat, 26 Jan 2013, Simon Jeons wrote:
> > On Fri, 2013-01-25 at 18:00 -0800, Hugh Dickins wrote:
> > > In some places where get_ksm_page() is used, we need the page to be locked.
> > > 
> > 
> > In function get_ksm_page, why check page->mapping =>
> > get_page_unless_zero => check page->mapping instead of
> > get_page_unless_zero => check page->mapping, because
> > get_page_unless_zero is expensive?
> 
> Yes, it's more expensive.
> 
> > 
> > > When KSM migration is fully enabled, we shall want that to make sure that
> > > the page just acquired cannot be migrated beneath us (raised page count is
> > > only effective when there is serialization to make sure migration notices).
> > > Whereas when navigating through the stable tree, we certainly do not want
> > 
> > What's the meaning of "navigating through the stable tree"?
> 
> Finding the right place in the stable tree,
> as stable_tree_search() and stable_tree_insert() do.
> 
> > 
> > > to lock each node (raised page count is enough to guarantee the memcmps,
> > > even if page is migrated to another node).
> > > 
> > > Since we're about to add another use case, add the locked argument to
> > > get_ksm_page() now.
> > 
> > Why the parameter lock passed from stable_tree_search/insert is true,
> > but remove_rmap_item_from_tree is false?
> 
> The other way round?  remove_rmap_item_from_tree needs the page locked,
> because it's about to modify the list: that's secured (e.g. against
> concurrent KSM page reclaim) by the page lock.

How can KSM page reclaim path call remove_rmap_item_from_tree? I have
already track every callsites but can't find it. BTW, I'm curious about
KSM page reclaim, it seems that there're no special handle in vmscan.c
for KSM page reclaim, is it will be reclaimed similiar with normal
page? 

> 
> stable_tree_search and stable_tree_insert do not need intermediate nodes
> to be locked: get_page is enough to secure the page contents for memcmp,
> and we don't want a pointless wait for exclusive page lock on every
> intermediate node.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/11] ksm: make KSM page migration possible
  2013-01-27 23:12     ` Hugh Dickins
@ 2013-01-28  0:41       ` Simon Jeons
  2013-01-28  3:44         ` Hugh Dickins
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Jeons @ 2013-01-28  0:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Sun, 2013-01-27 at 15:12 -0800, Hugh Dickins wrote:
> On Sat, 26 Jan 2013, Simon Jeons wrote:
> > On Fri, 2013-01-25 at 18:03 -0800, Hugh Dickins wrote:
> > > +	while (!get_page_unless_zero(page)) {
> > > +		/*
> > > +		 * Another check for page->mapping != expected_mapping would
> > > +		 * work here too.  We have chosen the !PageSwapCache test to
> > > +		 * optimize the common case, when the page is or is about to
> > > +		 * be freed: PageSwapCache is cleared (under spin_lock_irq)
> > > +		 * in the freeze_refs section of __remove_mapping(); but Anon
> > > +		 * page->mapping reset to NULL later, in free_pages_prepare().
> > > +		 */
> > > +		if (!PageSwapCache(page))
> > > +			goto stale;
> > > +		cpu_relax();
> > > +	}
> > > +
> > > +	if (ACCESS_ONCE(page->mapping) != expected_mapping) {
> > >  		put_page(page);
> > >  		goto stale;
> > >  	}
> > > +
> > >  	if (locked) {
> > >  		lock_page(page);
> > > -		if (page->mapping != expected_mapping) {
> > > +		if (ACCESS_ONCE(page->mapping) != expected_mapping) {
> > >  			unlock_page(page);
> > >  			put_page(page);
> > >  			goto stale;
> > >  		}
> > >  	}
> > 
> > Could you explain why need check page->mapping twice after get page?
> 
> Once for the !locked case, which should not return page if mapping changed.
> Once for the locked case, which should not return page if mapping changed.
> We could use "else", but that wouldn't be an improvement.

But for locked case, page->mapping will be check twice.




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly
  2013-01-27 23:05     ` Hugh Dickins
@ 2013-01-28  1:42       ` Simon Jeons
  2013-01-28  4:14         ` Hugh Dickins
  0 siblings, 1 reply; 69+ messages in thread
From: Simon Jeons @ 2013-01-28  1:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Sun, 2013-01-27 at 15:05 -0800, Hugh Dickins wrote:
> On Sat, 26 Jan 2013, Simon Jeons wrote:
> > On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote:
> > > Switching merge_across_nodes after running KSM is liable to oops on stale
> > > nodes still left over from the previous stable tree.  It's not something
> > > that people will often want to do, but it would be lame to demand a reboot
> > > when they're trying to determine which merge_across_nodes setting is best.
> > > 
> > > How can this happen?  We only permit switching merge_across_nodes when
> > > pages_shared is 0, and usually set run 2 to force that beforehand, which
> > > ought to unmerge everything: yet oopses still occur when you then run 1.
> > > 
> > > Three causes:
> > > 
> > > 1. The old stable tree (built according to the inverse merge_across_nodes)
                                                   ^^^^^^^^^^^^^^^^^^^^^
How to understand inverse merge_across_nodes here?

> > > has not been fully torn down.  A stable node lingers until get_ksm_page()
> > > notices that the page it references no longer references it: but the page

Do you mean page->mapping is NULL when call get_ksm_page()? Who clear it
NULL?

> > > is not necessarily freed as soon as expected, particularly when swapcache.

Why is not necessarily freed as soon as expected?

> > > 
> > 
> > When can this happen?  
> 
> Whenever there's an additional reference to the page, beyond those for
> its ptes in userspace - swapcache for example, or pinned by get_user_pages.
> That delays its being freed (arriving at the "page->mapping = NULL;"
> in free_pages_prepare()).  Or it might simply be sitting in a pagevec,
> waiting for that to be filled up, to be freed as part of a batch.
> 
> > 
> > > Fix this with a pass through the old stable tree, applying get_ksm_page()
> > > to each of the remaining nodes (most found stale and removed immediately),
> > > with forced removal of any left over.  Unless the page is still mapped:
> > > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE
> > > and EBUSY than BUG.
> > > 
> > > 2. __ksm_enter() has a nice little optimization, to insert the new mm
> > > just behind ksmd's cursor, so there's a full pass for it to stabilize
> > > (or be removed) before ksmd addresses it.  Nice when ksmd is running,
> > > but not so nice when we're trying to unmerge all mms: we were missing
> > > those mms forked and inserted behind the unmerge cursor.  Easily fixed
> > > by inserting at the end when KSM_RUN_UNMERGE.
> > 
> > mms forked will be unmerged just after ksmd's cursor since they're
> > inserted behind it, why will be missing?
> 
> unmerge_and_remove_all_rmap_items() makes one pass through the list
> from start to finish: insert behind the cursor and it will be missed.

Since mms forked will be insert just after ksmd's cursor, so it is the
next which will be scan and unmerge, where I miss?




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly
  2013-01-26  2:01 ` [PATCH 6/11] ksm: remove old stable nodes more thoroughly Hugh Dickins
  2013-01-27  4:55   ` Simon Jeons
@ 2013-01-28  2:12   ` Simon Jeons
  2013-01-28  4:19     ` Hugh Dickins
  2013-01-28  6:36   ` Simon Jeons
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 69+ messages in thread
From: Simon Jeons @ 2013-01-28  2:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote:
> Switching merge_across_nodes after running KSM is liable to oops on stale
> nodes still left over from the previous stable tree.  It's not something

Since this patch solve the problem, so the description of
merge_across_nodes(Value can be changed only when there is no ksm shared
pages in system) should be changed in this patch.

> that people will often want to do, but it would be lame to demand a reboot
> when they're trying to determine which merge_across_nodes setting is best.
> 
> How can this happen?  We only permit switching merge_across_nodes when
> pages_shared is 0, and usually set run 2 to force that beforehand, which
> ought to unmerge everything: yet oopses still occur when you then run 1.
> 
> Three causes:
> 
> 1. The old stable tree (built according to the inverse merge_across_nodes)
> has not been fully torn down.  A stable node lingers until get_ksm_page()
> notices that the page it references no longer references it: but the page
> is not necessarily freed as soon as expected, particularly when swapcache.
> 
> Fix this with a pass through the old stable tree, applying get_ksm_page()
> to each of the remaining nodes (most found stale and removed immediately),
> with forced removal of any left over.  Unless the page is still mapped:
> I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE
> and EBUSY than BUG.
> 
> 2. __ksm_enter() has a nice little optimization, to insert the new mm
> just behind ksmd's cursor, so there's a full pass for it to stabilize
> (or be removed) before ksmd addresses it.  Nice when ksmd is running,
> but not so nice when we're trying to unmerge all mms: we were missing
> those mms forked and inserted behind the unmerge cursor.  Easily fixed
> by inserting at the end when KSM_RUN_UNMERGE.
> 
> 3. It is possible for a KSM page to be faulted back from swapcache into
> an mm, just after unmerge_and_remove_all_rmap_items() scanned past it.
> Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private
> to ksm.c, so dissolve the distinction between ksm_might_need_to_copy()
> and ksm_does_need_to_copy(), doing it all in the one call into ksm.c.
> 
> A long outstanding, unrelated bugfix sneaks in with that third fix:
> ksm_does_need_to_copy() would copy from a !PageUptodate page (implying
> I/O error when read in from swap) to a page which it then marks Uptodate.
> Fix this case by not copying, letting do_swap_page() discover the error.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  include/linux/ksm.h |   18 ++-------
>  mm/ksm.c            |   83 +++++++++++++++++++++++++++++++++++++++---
>  mm/memory.c         |   19 ++++-----
>  3 files changed, 92 insertions(+), 28 deletions(-)
> 
> --- mmotm.orig/include/linux/ksm.h	2013-01-25 14:27:58.220193250 -0800
> +++ mmotm/include/linux/ksm.h	2013-01-25 14:37:00.764206145 -0800
> @@ -16,9 +16,6 @@
>  struct stable_node;
>  struct mem_cgroup;
>  
> -struct page *ksm_does_need_to_copy(struct page *page,
> -			struct vm_area_struct *vma, unsigned long address);
> -
>  #ifdef CONFIG_KSM
>  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
>  		unsigned long end, int advice, unsigned long *vm_flags);
> @@ -73,15 +70,8 @@ static inline void set_page_stable_node(
>   * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE,
>   * but what if the vma was unmerged while the page was swapped out?
>   */
> -static inline int ksm_might_need_to_copy(struct page *page,
> -			struct vm_area_struct *vma, unsigned long address)
> -{
> -	struct anon_vma *anon_vma = page_anon_vma(page);
> -
> -	return anon_vma &&
> -		(anon_vma->root != vma->anon_vma->root ||
> -		 page->index != linear_page_index(vma, address));
> -}
> +struct page *ksm_might_need_to_copy(struct page *page,
> +			struct vm_area_struct *vma, unsigned long address);
>  
>  int page_referenced_ksm(struct page *page,
>  			struct mem_cgroup *memcg, unsigned long *vm_flags);
> @@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_
>  	return 0;
>  }
>  
> -static inline int ksm_might_need_to_copy(struct page *page,
> +static inline struct page *ksm_might_need_to_copy(struct page *page,
>  			struct vm_area_struct *vma, unsigned long address)
>  {
> -	return 0;
> +	return page;
>  }
>  
>  static inline int page_referenced_ksm(struct page *page,
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:58.856206099 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:37:00.768206145 -0800
> @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a
>  /*
>   * Only called through the sysfs control interface:
>   */
> +static int remove_stable_node(struct stable_node *stable_node)
> +{
> +	struct page *page;
> +	int err;
> +
> +	page = get_ksm_page(stable_node, true);
> +	if (!page) {
> +		/*
> +		 * get_ksm_page did remove_node_from_stable_tree itself.
> +		 */
> +		return 0;
> +	}
> +
> +	if (WARN_ON_ONCE(page_mapped(page)))
> +		err = -EBUSY;
> +	else {
> +		/*
> +		 * This page might be in a pagevec waiting to be freed,
> +		 * or it might be PageSwapCache (perhaps under writeback),
> +		 * or it might have been removed from swapcache a moment ago.
> +		 */
> +		set_page_stable_node(page, NULL);
> +		remove_node_from_stable_tree(stable_node);
> +		err = 0;
> +	}
> +
> +	unlock_page(page);
> +	put_page(page);
> +	return err;
> +}
> +
> +static int remove_all_stable_nodes(void)
> +{
> +	struct stable_node *stable_node;
> +	int nid;
> +	int err = 0;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		while (root_stable_tree[nid].rb_node) {
> +			stable_node = rb_entry(root_stable_tree[nid].rb_node,
> +						struct stable_node, node);
> +			if (remove_stable_node(stable_node)) {
> +				err = -EBUSY;
> +				break;	/* proceed to next nid */
> +			}
> +			cond_resched();
> +		}
> +	}
> +	return err;
> +}
> +
>  static int unmerge_and_remove_all_rmap_items(void)
>  {
>  	struct mm_slot *mm_slot;
> @@ -691,6 +742,8 @@ static int unmerge_and_remove_all_rmap_i
>  		}
>  	}
>  
> +	/* Clean up stable nodes, but don't worry if some are still busy */
> +	remove_all_stable_nodes();
>  	ksm_scan.seqnr = 0;
>  	return 0;
>  
> @@ -1586,11 +1639,19 @@ int __ksm_enter(struct mm_struct *mm)
>  	spin_lock(&ksm_mmlist_lock);
>  	insert_to_mm_slots_hash(mm, mm_slot);
>  	/*
> -	 * Insert just behind the scanning cursor, to let the area settle
> +	 * When KSM_RUN_MERGE (or KSM_RUN_STOP),
> +	 * insert just behind the scanning cursor, to let the area settle
>  	 * down a little; when fork is followed by immediate exec, we don't
>  	 * want ksmd to waste time setting up and tearing down an rmap_list.
> +	 *
> +	 * But when KSM_RUN_UNMERGE, it's important to insert ahead of its
> +	 * scanning cursor, otherwise KSM pages in newly forked mms will be
> +	 * missed: then we might as well insert at the end of the list.
>  	 */
> -	list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
> +	if (ksm_run & KSM_RUN_UNMERGE)
> +		list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list);
> +	else
> +		list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
>  	spin_unlock(&ksm_mmlist_lock);
>  
>  	set_bit(MMF_VM_MERGEABLE, &mm->flags);
> @@ -1640,11 +1701,25 @@ void __ksm_exit(struct mm_struct *mm)
>  	}
>  }
>  
> -struct page *ksm_does_need_to_copy(struct page *page,
> +struct page *ksm_might_need_to_copy(struct page *page,
>  			struct vm_area_struct *vma, unsigned long address)
>  {
> +	struct anon_vma *anon_vma = page_anon_vma(page);
>  	struct page *new_page;
>  
> +	if (PageKsm(page)) {
> +		if (page_stable_node(page) &&
> +		    !(ksm_run & KSM_RUN_UNMERGE))
> +			return page;	/* no need to copy it */
> +	} else if (!anon_vma) {
> +		return page;		/* no need to copy it */
> +	} else if (anon_vma->root == vma->anon_vma->root &&
> +		 page->index == linear_page_index(vma, address)) {
> +		return page;		/* still no need to copy it */
> +	}
> +	if (!PageUptodate(page))
> +		return page;		/* let do_swap_page report the error */
> +
>  	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
>  	if (new_page) {
>  		copy_user_highpage(new_page, page, address, vma);
> @@ -2024,7 +2099,7 @@ static ssize_t merge_across_nodes_store(
>  
>  	mutex_lock(&ksm_thread_mutex);
>  	if (ksm_merge_across_nodes != knob) {
> -		if (ksm_pages_shared)
> +		if (ksm_pages_shared || remove_all_stable_nodes())
>  			err = -EBUSY;
>  		else
>  			ksm_merge_across_nodes = knob;
> --- mmotm.orig/mm/memory.c	2013-01-25 14:27:58.220193250 -0800
> +++ mmotm/mm/memory.c	2013-01-25 14:37:00.768206145 -0800
> @@ -2994,17 +2994,16 @@ static int do_swap_page(struct mm_struct
>  	if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
>  		goto out_page;
>  
> -	if (ksm_might_need_to_copy(page, vma, address)) {
> -		swapcache = page;
> -		page = ksm_does_need_to_copy(page, vma, address);
> -
> -		if (unlikely(!page)) {
> -			ret = VM_FAULT_OOM;
> -			page = swapcache;
> -			swapcache = NULL;
> -			goto out_page;
> -		}
> +	swapcache = page;
> +	page = ksm_might_need_to_copy(page, vma, address);
> +	if (unlikely(!page)) {
> +		ret = VM_FAULT_OOM;
> +		page = swapcache;
> +		swapcache = NULL;
> +		goto out_page;
>  	}
> +	if (page == swapcache)
> +		swapcache = NULL;
>  
>  	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
>  		ret = VM_FAULT_OOM;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 5/11] ksm: get_ksm_page locked
  2013-01-28  0:36       ` Simon Jeons
@ 2013-01-28  3:35         ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-28  3:35 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Sun, 27 Jan 2013, Simon Jeons wrote:
> On Sun, 2013-01-27 at 14:08 -0800, Hugh Dickins wrote:
> > On Sat, 26 Jan 2013, Simon Jeons wrote:
> > > 
> > > Why the parameter lock passed from stable_tree_search/insert is true,
> > > but remove_rmap_item_from_tree is false?
> > 
> > The other way round?  remove_rmap_item_from_tree needs the page locked,
> > because it's about to modify the list: that's secured (e.g. against
> > concurrent KSM page reclaim) by the page lock.
> 
> How can KSM page reclaim path call remove_rmap_item_from_tree? I have
> already track every callsites but can't find it.

It doesn't.  Please read what I said above again.

> BTW, I'm curious about
> KSM page reclaim, it seems that there're no special handle in vmscan.c
> for KSM page reclaim, is it will be reclaimed similiar with normal
> page? 

Look for PageKsm in mm/rmap.c.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 8/11] ksm: make !merge_across_nodes migration safe
  2013-01-26  2:05 ` [PATCH 8/11] ksm: make !merge_across_nodes migration safe Hugh Dickins
  2013-01-27  8:49   ` Simon Jeons
@ 2013-01-28  3:44   ` Simon Jeons
  1 sibling, 0 replies; 69+ messages in thread
From: Simon Jeons @ 2013-01-28  3:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Fri, 2013-01-25 at 18:05 -0800, Hugh Dickins wrote:
> The new KSM NUMA merge_across_nodes knob introduces a problem, when it's
> set to non-default 0: if a KSM page is migrated to a different NUMA node,
> how do we migrate its stable node to the right tree?  And what if that
> collides with an existing stable node?
> 
> ksm_migrate_page() can do no more than it's already doing, updating
> stable_node->kpfn: the stable tree itself cannot be manipulated without
> holding ksm_thread_mutex.  So accept that a stable tree may temporarily
> indicate a page belonging to the wrong NUMA node, leave updating until
> the next pass of ksmd, just be careful not to merge other pages on to a

How you not to merge other pages on to a misplaced page? I don't see it.

> misplaced page.  Note nid of holding tree in stable_node, and recognize
> that it will not always match nid of kpfn.
> 
> A misplaced KSM page is discovered, either when ksm_do_scan() next comes
> around to one of its rmap_items (we now have to go to cmp_and_merge_page
> even on pages in a stable tree), or when stable_tree_search() arrives at
> a matching node for another page, and this node page is found misplaced.
> 
> In each case, move the misplaced stable_node to a list of migrate_nodes
> (and use the address of migrate_nodes as magic by which to identify them):
> we don't need them in a tree.  If stable_tree_search() finds no match for
> a page, but it's currently exiled to this list, then slot its stable_node
> right there into the tree, bringing all of its mappings with it; otherwise
> they get migrated one by one to the original page of the colliding node.
> stable_tree_search() is now modelled more like stable_tree_insert(),
> in order to handle these insertions of migrated nodes.

When node will be removed from migrate_nodes list and insert to stable
tree?

> 
> remove_node_from_stable_tree(), remove_all_stable_nodes() and
> ksm_check_stable_tree() have to handle the migrate_nodes list as well as
> the stable tree itself.  Less obviously, we do need to prune the list of
> stale entries from time to time (scan_get_next_rmap_item() does it once
> each full scan):

>  whereas stale nodes in the stable tree get naturally
> pruned as searches try to brush past them, these migrate_nodes may get
> forgotten and accumulate.

Hard to understand this description. Could you explain it? :)

> Signed-off-by: Hugh Dickins <hughd@google.com>

What will happen if page node of an unstable tree migrate to a new numa
node? Also need to handle colliding? 

> ---
>  mm/ksm.c |  164 +++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 134 insertions(+), 30 deletions(-)
> 
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:37:03.832206218 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:37:06.880206290 -0800
> @@ -122,13 +122,25 @@ struct ksm_scan {
>  /**
>   * struct stable_node - node of the stable rbtree
>   * @node: rb node of this ksm page in the stable tree
> + * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list
> + * @list: linked into migrate_nodes, pending placement in the proper node tree
>   * @hlist: hlist head of rmap_items using this ksm page
> - * @kpfn: page frame number of this ksm page
> + * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid)
> + * @nid: NUMA node id of stable tree in which linked (may not match kpfn)
>   */
>  struct stable_node {
> -	struct rb_node node;
> +	union {
> +		struct rb_node node;	/* when node of stable tree */
> +		struct {		/* when listed for migration */
> +			struct list_head *head;
> +			struct list_head list;
> +		};
> +	};
>  	struct hlist_head hlist;
>  	unsigned long kpfn;
> +#ifdef CONFIG_NUMA
> +	int nid;
> +#endif
>  };
>  
>  /**
> @@ -169,6 +181,9 @@ struct rmap_item {
>  static struct rb_root root_unstable_tree[MAX_NUMNODES];
>  static struct rb_root root_stable_tree[MAX_NUMNODES];
>  
> +/* Recently migrated nodes of stable tree, pending proper placement */
> +static LIST_HEAD(migrate_nodes);
> +
>  #define MM_SLOTS_HASH_BITS 10
>  static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>  
> @@ -311,11 +326,6 @@ static void insert_to_mm_slots_hash(stru
>  	hash_add(mm_slots_hash, &mm_slot->link, (unsigned long)mm);
>  }
>  
> -static inline int in_stable_tree(struct rmap_item *rmap_item)
> -{
> -	return rmap_item->address & STABLE_FLAG;
> -}
> -
>  /*
>   * ksmd, and unmerge_and_remove_all_rmap_items(), must not touch an mm's
>   * page tables after it has passed through ksm_exit() - which, if necessary,
> @@ -476,7 +486,6 @@ static void remove_node_from_stable_tree
>  {
>  	struct rmap_item *rmap_item;
>  	struct hlist_node *hlist;
> -	int nid;
>  
>  	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
>  		if (rmap_item->hlist.next)
> @@ -488,8 +497,11 @@ static void remove_node_from_stable_tree
>  		cond_resched();
>  	}
>  
> -	nid = get_kpfn_nid(stable_node->kpfn);
> -	rb_erase(&stable_node->node, &root_stable_tree[nid]);
> +	if (stable_node->head == &migrate_nodes)
> +		list_del(&stable_node->list);
> +	else
> +		rb_erase(&stable_node->node,
> +			 &root_stable_tree[NUMA(stable_node->nid)]);
>  	free_stable_node(stable_node);
>  }
>  
> @@ -712,6 +724,7 @@ static int remove_stable_node(struct sta
>  static int remove_all_stable_nodes(void)
>  {
>  	struct stable_node *stable_node;
> +	struct list_head *this, *next;
>  	int nid;
>  	int err = 0;
>  
> @@ -726,6 +739,12 @@ static int remove_all_stable_nodes(void)
>  			cond_resched();
>  		}
>  	}
> +	list_for_each_safe(this, next, &migrate_nodes) {
> +		stable_node = list_entry(this, struct stable_node, list);
> +		if (remove_stable_node(stable_node))
> +			err = -EBUSY;
> +		cond_resched();
> +	}
>  	return err;
>  }
>  
> @@ -1113,25 +1132,30 @@ static struct page *try_to_merge_two_pag
>   */
>  static struct page *stable_tree_search(struct page *page)
>  {
> -	struct rb_node *node;
> -	struct stable_node *stable_node;
>  	int nid;
> +	struct rb_node **new;
> +	struct rb_node *parent;
> +	struct stable_node *stable_node;
> +	struct stable_node *page_node;
>  
> -	stable_node = page_stable_node(page);
> -	if (stable_node) {			/* ksm page forked */
> +	page_node = page_stable_node(page);
> +	if (page_node && page_node->head != &migrate_nodes) {
> +		/* ksm page forked */
>  		get_page(page);
>  		return page;
>  	}
>  
>  	nid = get_kpfn_nid(page_to_pfn(page));
> -	node = root_stable_tree[nid].rb_node;
> +again:
> +	new = &root_stable_tree[nid].rb_node;
> +	parent = NULL;
>  
> -	while (node) {
> +	while (*new) {
>  		struct page *tree_page;
>  		int ret;
>  
>  		cond_resched();
> -		stable_node = rb_entry(node, struct stable_node, node);
> +		stable_node = rb_entry(*new, struct stable_node, node);
>  		tree_page = get_ksm_page(stable_node, false);
>  		if (!tree_page)
>  			return NULL;
> @@ -1139,10 +1163,11 @@ static struct page *stable_tree_search(s
>  		ret = memcmp_pages(page, tree_page);
>  		put_page(tree_page);
>  
> +		parent = *new;
>  		if (ret < 0)
> -			node = node->rb_left;
> +			new = &parent->rb_left;
>  		else if (ret > 0)
> -			node = node->rb_right;
> +			new = &parent->rb_right;
>  		else {
>  			/*
>  			 * Lock and unlock the stable_node's page (which
> @@ -1152,13 +1177,49 @@ static struct page *stable_tree_search(s
>  			 * than kpage, but that involves more changes.
>  			 */
>  			tree_page = get_ksm_page(stable_node, true);
> -			if (tree_page)
> +			if (tree_page) {
>  				unlock_page(tree_page);
> -			return tree_page;
> +				if (get_kpfn_nid(stable_node->kpfn) !=
> +						NUMA(stable_node->nid)) {
> +					put_page(tree_page);
> +					goto replace;
> +				}
> +				return tree_page;
> +			}
> +			/*
> +			 * There is now a place for page_node, but the tree may
> +			 * have been rebalanced, so re-evaluate parent and new.
> +			 */
> +			if (page_node)
> +				goto again;
> +			return NULL;
>  		}
>  	}
>  
> -	return NULL;
> +	if (!page_node)
> +		return NULL;
> +
> +	list_del(&page_node->list);
> +	DO_NUMA(page_node->nid = nid);
> +	rb_link_node(&page_node->node, parent, new);
> +	rb_insert_color(&page_node->node, &root_stable_tree[nid]);
> +	get_page(page);
> +	return page;
> +
> +replace:
> +	if (page_node) {
> +		list_del(&page_node->list);
> +		DO_NUMA(page_node->nid = nid);
> +		rb_replace_node(&stable_node->node,
> +				&page_node->node, &root_stable_tree[nid]);
> +		get_page(page);
> +	} else {
> +		rb_erase(&stable_node->node, &root_stable_tree[nid]);
> +		page = NULL;
> +	}
> +	stable_node->head = &migrate_nodes;

Why still set this magic since node has already insert to the tree? 

> +	list_add(&stable_node->list, stable_node->head);
> +	return page;
>  }
>  
>  /*
> @@ -1215,6 +1276,7 @@ static struct stable_node *stable_tree_i
>  	INIT_HLIST_HEAD(&stable_node->hlist);
>  	stable_node->kpfn = kpfn;
>  	set_page_stable_node(kpage, stable_node);
> +	DO_NUMA(stable_node->nid = nid);
>  	rb_link_node(&stable_node->node, parent, new);
>  	rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
>  
> @@ -1311,11 +1373,6 @@ struct rmap_item *unstable_tree_search_i
>  static void stable_tree_append(struct rmap_item *rmap_item,
>  			       struct stable_node *stable_node)
>  {
> -	/*
> -	 * Usually rmap_item->nid is already set correctly,
> -	 * but it may be wrong after switching merge_across_nodes.
> -	 */
> -	DO_NUMA(rmap_item->nid = get_kpfn_nid(stable_node->kpfn));
>  	rmap_item->head = stable_node;
>  	rmap_item->address |= STABLE_FLAG;
>  	hlist_add_head(&rmap_item->hlist, &stable_node->hlist);
> @@ -1344,10 +1401,29 @@ static void cmp_and_merge_page(struct pa
>  	unsigned int checksum;
>  	int err;
>  
> -	remove_rmap_item_from_tree(rmap_item);
> +	stable_node = page_stable_node(page);
> +	if (stable_node) {
> +		if (stable_node->head != &migrate_nodes &&
> +		    get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) {
> +			rb_erase(&stable_node->node,
> +				 &root_stable_tree[NUMA(stable_node->nid)]);
> +			stable_node->head = &migrate_nodes;
> +			list_add(&stable_node->list, stable_node->head);
> +		}
> +		if (stable_node->head != &migrate_nodes &&
> +		    rmap_item->head == stable_node)
> +			return;
> +	}
>  
>  	/* We first start with searching the page inside the stable tree */
>  	kpage = stable_tree_search(page);
> +	if (kpage == page && rmap_item->head == stable_node) {
> +		put_page(kpage);
> +		return;
> +	}
> +
> +	remove_rmap_item_from_tree(rmap_item);
> +
>  	if (kpage) {
>  		err = try_to_merge_with_ksm_page(rmap_item, page, kpage);
>  		if (!err) {
> @@ -1464,6 +1540,27 @@ static struct rmap_item *scan_get_next_r
>  		 */
>  		lru_add_drain_all();
>  
> +		/*
> +		 * Whereas stale stable_nodes on the stable_tree itself
> +		 * get pruned in the regular course of stable_tree_search(),
> +		 * those moved out to the migrate_nodes list can accumulate:
> +		 * so prune them once before each full scan.
> +		 */
> +		if (!ksm_merge_across_nodes) {
> +			struct stable_node *stable_node;
> +			struct list_head *this, *next;
> +			struct page *page;
> +
> +			list_for_each_safe(this, next, &migrate_nodes) {
> +				stable_node = list_entry(this,
> +						struct stable_node, list);
> +				page = get_ksm_page(stable_node, false);
> +				if (page)
> +					put_page(page);
> +				cond_resched();
> +			}
> +		}
> +
>  		for (nid = 0; nid < nr_node_ids; nid++)
>  			root_unstable_tree[nid] = RB_ROOT;
>  
> @@ -1586,8 +1683,7 @@ static void ksm_do_scan(unsigned int sca
>  		rmap_item = scan_get_next_rmap_item(&page);
>  		if (!rmap_item)
>  			return;
> -		if (!PageKsm(page) || !in_stable_tree(rmap_item))
> -			cmp_and_merge_page(page, rmap_item);
> +		cmp_and_merge_page(page, rmap_item);
>  		put_page(page);
>  	}
>  }
> @@ -1964,6 +2060,7 @@ static void ksm_check_stable_tree(unsign
>  				  unsigned long end_pfn)
>  {
>  	struct stable_node *stable_node;
> +	struct list_head *this, *next;
>  	struct rb_node *node;
>  	int nid;
>  
> @@ -1984,6 +2081,13 @@ static void ksm_check_stable_tree(unsign
>  			cond_resched();
>  		}
>  	}
> +	list_for_each_safe(this, next, &migrate_nodes) {
> +		stable_node = list_entry(this, struct stable_node, list);
> +		if (stable_node->kpfn >= start_pfn &&
> +		    stable_node->kpfn < end_pfn)
> +			remove_node_from_stable_tree(stable_node);
> +		cond_resched();
> +	}
>  }
>  
>  static int ksm_memory_callback(struct notifier_block *self,
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/11] ksm: make KSM page migration possible
  2013-01-28  0:41       ` Simon Jeons
@ 2013-01-28  3:44         ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-28  3:44 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Sun, 27 Jan 2013, Simon Jeons wrote:
> On Sun, 2013-01-27 at 15:12 -0800, Hugh Dickins wrote:
> > On Sat, 26 Jan 2013, Simon Jeons wrote:
> > > 
> > > Could you explain why need check page->mapping twice after get page?
> > 
> > Once for the !locked case, which should not return page if mapping changed.
> > Once for the locked case, which should not return page if mapping changed.
> > We could use "else", but that wouldn't be an improvement.
> 
> But for locked case, page->mapping will be check twice.

Thrice.

I'm beginning to wonder: you do realize that page->mapping is volatile,
from the point of view of get_ksm_page()?  That is the whole point of
why get_ksm_page() exists.

I can see that the word "volatile" is not obviously used here - it's
tucked away inside the ACCESS_ONCE() - but I thought the descriptions
of races and barriers made that obvious.

If the comments here haven't helped enough, please take a look at
git commit 4035c07a8959 "ksm: take keyhole reference to page".

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly
  2013-01-28  1:42       ` Simon Jeons
@ 2013-01-28  4:14         ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-28  4:14 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Sun, 27 Jan 2013, Simon Jeons wrote:
> On Sun, 2013-01-27 at 15:05 -0800, Hugh Dickins wrote:
> > On Sat, 26 Jan 2013, Simon Jeons wrote:
> > > > How can this happen?  We only permit switching merge_across_nodes when
> > > > pages_shared is 0, and usually set run 2 to force that beforehand, which
> > > > ought to unmerge everything: yet oopses still occur when you then run 1.
> > > > 
> > > > Three causes:
> > > > 
> > > > 1. The old stable tree (built according to the inverse merge_across_nodes)
>                                                    ^^^^^^^^^^^^^^^^^^^^^
> How to understand inverse merge_across_nodes here?

How not to understand it?  Either it was 0 before (in which case there
were as many stable trees as NUMA nodes) and is being changed to 1 (in
which case there is to be only one stable tree), or it was 1 before
(for one) and is being changed to 0 (for many).

> 
> > > > has not been fully torn down.  A stable node lingers until get_ksm_page()
> > > > notices that the page it references no longer references it: but the page
> 
> Do you mean page->mapping is NULL when call get_ksm_page()? Who clear it
> NULL?

I think I already pointed you to free_pages_prepare().

> 
> > > > is not necessarily freed as soon as expected, particularly when swapcache.
> 
> Why is not necessarily freed as soon as expected?

As I answered below.

> > > > 
> > > 
> > > When can this happen?  
> > 
> > Whenever there's an additional reference to the page, beyond those for
> > its ptes in userspace - swapcache for example, or pinned by get_user_pages.
> > That delays its being freed (arriving at the "page->mapping = NULL;"
> > in free_pages_prepare()).  Or it might simply be sitting in a pagevec,
> > waiting for that to be filled up, to be freed as part of a batch.

> > > mms forked will be unmerged just after ksmd's cursor since they're
> > > inserted behind it, why will be missing?
> > 
> > unmerge_and_remove_all_rmap_items() makes one pass through the list
> > from start to finish: insert behind the cursor and it will be missed.
> 
> Since mms forked will be insert just after ksmd's cursor, so it is the
> next which will be scan and unmerge, where I miss?

mms forked are normally inserted just behind (== before) ksmd's cursor,
as I've said in comments and explanations several times.

Simon, I've had enough: you clearly have much more time to spare for
asking questions than I have for answering them repeatedly: I would
rather spend my time attending to 100 higher priorities.

Please try much harder to work these things out for yourself from the
source (perhaps with help from kernelnewbies.org), before interrogating
linux-kernel and linux-mm developers.  Sometimes your questions may
help everybody to understand better, but often they just waste our time.

I'll happily admit that mm, and mm/ksm.c in particular, is not the easiest
place to start in understanding the kernel, nor I the best expositor.

Best wishes,
Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly
  2013-01-28  2:12   ` Simon Jeons
@ 2013-01-28  4:19     ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-28  4:19 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Sun, 27 Jan 2013, Simon Jeons wrote:
> On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote:
> > Switching merge_across_nodes after running KSM is liable to oops on stale
> > nodes still left over from the previous stable tree.  It's not something
> 
> Since this patch solve the problem, so the description of
> merge_across_nodes(Value can be changed only when there is no ksm shared
> pages in system) should be changed in this patch.

No.

The code could be changed to unmerge_and_remove_all_rmap_items()
automatically whenever merge_across_nodes is changed; but that's
not what Petr chose to do, and I didn't feel strongly to change it.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly
  2013-01-26  2:01 ` [PATCH 6/11] ksm: remove old stable nodes more thoroughly Hugh Dickins
  2013-01-27  4:55   ` Simon Jeons
  2013-01-28  2:12   ` Simon Jeons
@ 2013-01-28  6:36   ` Simon Jeons
  2013-01-28 23:44   ` Andrew Morton
  2013-02-05 17:55   ` Mel Gorman
  4 siblings, 0 replies; 69+ messages in thread
From: Simon Jeons @ 2013-01-28  6:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Fri, 2013-01-25 at 18:01 -0800, Hugh Dickins wrote:
> Switching merge_across_nodes after running KSM is liable to oops on stale
> nodes still left over from the previous stable tree.  It's not something
> that people will often want to do, but it would be lame to demand a reboot
> when they're trying to determine which merge_across_nodes setting is best.
> 
> How can this happen?  We only permit switching merge_across_nodes when
> pages_shared is 0, and usually set run 2 to force that beforehand, which
> ought to unmerge everything: yet oopses still occur when you then run 1.
> 
> Three causes:
> 
> 1. The old stable tree (built according to the inverse merge_across_nodes)
> has not been fully torn down.  A stable node lingers until get_ksm_page()
> notices that the page it references no longer references it: but the page
> is not necessarily freed as soon as expected, particularly when swapcache.
> 
> Fix this with a pass through the old stable tree, applying get_ksm_page()
> to each of the remaining nodes (most found stale and removed immediately),
> with forced removal of any left over.  Unless the page is still mapped:
> I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE
> and EBUSY than BUG.
> 
> 2. __ksm_enter() has a nice little optimization, to insert the new mm
> just behind ksmd's cursor, so there's a full pass for it to stabilize
> (or be removed) before ksmd addresses it.  Nice when ksmd is running,
> but not so nice when we're trying to unmerge all mms: we were missing
> those mms forked and inserted behind the unmerge cursor.  Easily fixed
> by inserting at the end when KSM_RUN_UNMERGE.
> 
> 3. It is possible for a KSM page to be faulted back from swapcache into
> an mm, just after unmerge_and_remove_all_rmap_items() scanned past it.
> Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private
> to ksm.c, so dissolve the distinction between ksm_might_need_to_copy()
> and ksm_does_need_to_copy(), doing it all in the one call into ksm.c.
> 
> A long outstanding, unrelated bugfix sneaks in with that third fix:
> ksm_does_need_to_copy() would copy from a !PageUptodate page (implying
> I/O error when read in from swap) to a page which it then marks Uptodate.
> Fix this case by not copying, letting do_swap_page() discover the error.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  include/linux/ksm.h |   18 ++-------
>  mm/ksm.c            |   83 +++++++++++++++++++++++++++++++++++++++---
>  mm/memory.c         |   19 ++++-----
>  3 files changed, 92 insertions(+), 28 deletions(-)
> 
> --- mmotm.orig/include/linux/ksm.h	2013-01-25 14:27:58.220193250 -0800
> +++ mmotm/include/linux/ksm.h	2013-01-25 14:37:00.764206145 -0800
> @@ -16,9 +16,6 @@
>  struct stable_node;
>  struct mem_cgroup;
>  
> -struct page *ksm_does_need_to_copy(struct page *page,
> -			struct vm_area_struct *vma, unsigned long address);
> -
>  #ifdef CONFIG_KSM
>  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
>  		unsigned long end, int advice, unsigned long *vm_flags);
> @@ -73,15 +70,8 @@ static inline void set_page_stable_node(
>   * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE,
>   * but what if the vma was unmerged while the page was swapped out?
>   */
> -static inline int ksm_might_need_to_copy(struct page *page,
> -			struct vm_area_struct *vma, unsigned long address)
> -{
> -	struct anon_vma *anon_vma = page_anon_vma(page);
> -
> -	return anon_vma &&
> -		(anon_vma->root != vma->anon_vma->root ||
> -		 page->index != linear_page_index(vma, address));
> -}
> +struct page *ksm_might_need_to_copy(struct page *page,
> +			struct vm_area_struct *vma, unsigned long address);
>  
>  int page_referenced_ksm(struct page *page,
>  			struct mem_cgroup *memcg, unsigned long *vm_flags);
> @@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_
>  	return 0;
>  }
>  
> -static inline int ksm_might_need_to_copy(struct page *page,
> +static inline struct page *ksm_might_need_to_copy(struct page *page,
>  			struct vm_area_struct *vma, unsigned long address)
>  {
> -	return 0;
> +	return page;
>  }
>  
>  static inline int page_referenced_ksm(struct page *page,
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:58.856206099 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:37:00.768206145 -0800
> @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a
>  /*
>   * Only called through the sysfs control interface:
>   */
> +static int remove_stable_node(struct stable_node *stable_node)
> +{
> +	struct page *page;
> +	int err;
> +
> +	page = get_ksm_page(stable_node, true);
> +	if (!page) {
> +		/*
> +		 * get_ksm_page did remove_node_from_stable_tree itself.
> +		 */
> +		return 0;
> +	}
> +
> +	if (WARN_ON_ONCE(page_mapped(page)))
> +		err = -EBUSY;
> +	else {
> +		/*
> +		 * This page might be in a pagevec waiting to be freed,
> +		 * or it might be PageSwapCache (perhaps under writeback),
> +		 * or it might have been removed from swapcache a moment ago.
> +		 */
> +		set_page_stable_node(page, NULL);
> +		remove_node_from_stable_tree(stable_node);
> +		err = 0;
> +	}
> +
> +	unlock_page(page);
> +	put_page(page);
> +	return err;
> +}
> +
> +static int remove_all_stable_nodes(void)
> +{
> +	struct stable_node *stable_node;
> +	int nid;
> +	int err = 0;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		while (root_stable_tree[nid].rb_node) {
> +			stable_node = rb_entry(root_stable_tree[nid].rb_node,
> +						struct stable_node, node);
> +			if (remove_stable_node(stable_node)) {
> +				err = -EBUSY;
> +				break;	/* proceed to next nid */

Why proceed to next nid if meet unstale stable node in stable tree? Then
still can't fully cleanup stale stable nodes.

> +			}
> +			cond_resched();
> +		}
> +	}
> +	return err;
> +}
> +
>  static int unmerge_and_remove_all_rmap_items(void)
>  {
>  	struct mm_slot *mm_slot;
> @@ -691,6 +742,8 @@ static int unmerge_and_remove_all_rmap_i
>  		}
>  	}
>  
> +	/* Clean up stable nodes, but don't worry if some are still busy */
> +	remove_all_stable_nodes();
>  	ksm_scan.seqnr = 0;
>  	return 0;
>  
> @@ -1586,11 +1639,19 @@ int __ksm_enter(struct mm_struct *mm)
>  	spin_lock(&ksm_mmlist_lock);
>  	insert_to_mm_slots_hash(mm, mm_slot);
>  	/*
> -	 * Insert just behind the scanning cursor, to let the area settle
> +	 * When KSM_RUN_MERGE (or KSM_RUN_STOP),
> +	 * insert just behind the scanning cursor, to let the area settle
>  	 * down a little; when fork is followed by immediate exec, we don't
>  	 * want ksmd to waste time setting up and tearing down an rmap_list.
> +	 *
> +	 * But when KSM_RUN_UNMERGE, it's important to insert ahead of its
> +	 * scanning cursor, otherwise KSM pages in newly forked mms will be
> +	 * missed: then we might as well insert at the end of the list.
>  	 */
> -	list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
> +	if (ksm_run & KSM_RUN_UNMERGE)
> +		list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list);
> +	else
> +		list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
>  	spin_unlock(&ksm_mmlist_lock);
>  
>  	set_bit(MMF_VM_MERGEABLE, &mm->flags);
> @@ -1640,11 +1701,25 @@ void __ksm_exit(struct mm_struct *mm)
>  	}
>  }
>  
> -struct page *ksm_does_need_to_copy(struct page *page,
> +struct page *ksm_might_need_to_copy(struct page *page,
>  			struct vm_area_struct *vma, unsigned long address)
>  {
> +	struct anon_vma *anon_vma = page_anon_vma(page);
>  	struct page *new_page;
>  
> +	if (PageKsm(page)) {
> +		if (page_stable_node(page) &&
> +		    !(ksm_run & KSM_RUN_UNMERGE))
> +			return page;	/* no need to copy it */
> +	} else if (!anon_vma) {
> +		return page;		/* no need to copy it */
> +	} else if (anon_vma->root == vma->anon_vma->root &&
> +		 page->index == linear_page_index(vma, address)) {
> +		return page;		/* still no need to copy it */
> +	}
> +	if (!PageUptodate(page))
> +		return page;		/* let do_swap_page report the error */
> +
>  	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
>  	if (new_page) {
>  		copy_user_highpage(new_page, page, address, vma);
> @@ -2024,7 +2099,7 @@ static ssize_t merge_across_nodes_store(
>  
>  	mutex_lock(&ksm_thread_mutex);
>  	if (ksm_merge_across_nodes != knob) {
> -		if (ksm_pages_shared)
> +		if (ksm_pages_shared || remove_all_stable_nodes())
>  			err = -EBUSY;
>  		else
>  			ksm_merge_across_nodes = knob;
> --- mmotm.orig/mm/memory.c	2013-01-25 14:27:58.220193250 -0800
> +++ mmotm/mm/memory.c	2013-01-25 14:37:00.768206145 -0800
> @@ -2994,17 +2994,16 @@ static int do_swap_page(struct mm_struct
>  	if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
>  		goto out_page;
>  
> -	if (ksm_might_need_to_copy(page, vma, address)) {
> -		swapcache = page;
> -		page = ksm_does_need_to_copy(page, vma, address);
> -
> -		if (unlikely(!page)) {
> -			ret = VM_FAULT_OOM;
> -			page = swapcache;
> -			swapcache = NULL;
> -			goto out_page;
> -		}
> +	swapcache = page;
> +	page = ksm_might_need_to_copy(page, vma, address);
> +	if (unlikely(!page)) {
> +		ret = VM_FAULT_OOM;
> +		page = swapcache;
> +		swapcache = NULL;
> +		goto out_page;
>  	}
> +	if (page == swapcache)
> +		swapcache = NULL;
>  
>  	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
>  		ret = VM_FAULT_OOM;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/11] ksm: allow trees per NUMA node
  2013-01-26  1:54 ` [PATCH 1/11] ksm: allow trees per NUMA node Hugh Dickins
  2013-01-27  1:14   ` Simon Jeons
@ 2013-01-28 23:03   ` Andrew Morton
  2013-01-29  1:17     ` Hugh Dickins
  2013-01-28 23:08   ` Andrew Morton
  2013-02-05 16:41   ` Mel Gorman
  3 siblings, 1 reply; 69+ messages in thread
From: Andrew Morton @ 2013-01-28 23:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, Rik van Riel,
	David Rientjes, Anton Arapov, linux-kernel, linux-mm

On Fri, 25 Jan 2013 17:54:53 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> --- mmotm.orig/Documentation/vm/ksm.txt	2013-01-25 14:36:31.724205455 -0800
> +++ mmotm/Documentation/vm/ksm.txt	2013-01-25 14:36:38.608205618 -0800
> @@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
>                     e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
>                     Default: 20 (chosen for demonstration purposes)
>  
> +merge_across_nodes - specifies if pages from different numa nodes can be merged.
> +                   When set to 0, ksm merges only pages which physically
> +                   reside in the memory area of same NUMA node. It brings
> +                   lower latency to access to shared page. Value can be
> +                   changed only when there is no ksm shared pages in system.
> +                   Default: 1
> +

The explanation doesn't really tell the operator whether or not to set
merge_across_nodes for a particular machine/workload.

I guess most people will just shrug, turn the thing on and see if it
improved things, but that's rather random.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/11] ksm: allow trees per NUMA node
  2013-01-26  1:54 ` [PATCH 1/11] ksm: allow trees per NUMA node Hugh Dickins
  2013-01-27  1:14   ` Simon Jeons
  2013-01-28 23:03   ` Andrew Morton
@ 2013-01-28 23:08   ` Andrew Morton
  2013-01-29  1:38     ` Hugh Dickins
  2013-02-05 16:41   ` Mel Gorman
  3 siblings, 1 reply; 69+ messages in thread
From: Andrew Morton @ 2013-01-28 23:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, Rik van Riel,
	David Rientjes, Anton Arapov, linux-kernel, linux-mm

On Fri, 25 Jan 2013 17:54:53 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> +/* Zeroed when merging across nodes is not allowed */
> +static unsigned int ksm_merge_across_nodes = 1;

I spose this should be __read_mostly.  If __read_mostly is not really a
synonym for __make_write_often_storage_slower.  I continue to harbor
fear, uncertainty and doubt about this...



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/11] ksm: trivial tidyups
  2013-01-26  1:58 ` [PATCH 3/11] ksm: trivial tidyups Hugh Dickins
@ 2013-01-28 23:11   ` Andrew Morton
  2013-01-29  1:44     ` Hugh Dickins
  0 siblings, 1 reply; 69+ messages in thread
From: Andrew Morton @ 2013-01-28 23:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, linux-kernel, linux-mm

On Fri, 25 Jan 2013 17:58:11 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> +#ifdef CONFIG_NUMA
> +#define NUMA(x)		(x)
> +#define DO_NUMA(x)	(x)

Did we consider

	#define DO_NUMA do { (x) } while (0)

?

That could avoid some nasty config-dependent compilation issues.

> +#else
> +#define NUMA(x)		(0)
> +#define DO_NUMA(x)	do { } while (0)
> +#endif

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly
  2013-01-26  2:01 ` [PATCH 6/11] ksm: remove old stable nodes more thoroughly Hugh Dickins
                     ` (2 preceding siblings ...)
  2013-01-28  6:36   ` Simon Jeons
@ 2013-01-28 23:44   ` Andrew Morton
  2013-01-29  2:03     ` Hugh Dickins
  2013-02-05 17:55   ` Mel Gorman
  4 siblings, 1 reply; 69+ messages in thread
From: Andrew Morton @ 2013-01-28 23:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, linux-kernel, linux-mm

On Fri, 25 Jan 2013 18:01:59 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> +static int remove_all_stable_nodes(void)
> +{
> +	struct stable_node *stable_node;
> +	int nid;
> +	int err = 0;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		while (root_stable_tree[nid].rb_node) {
> +			stable_node = rb_entry(root_stable_tree[nid].rb_node,
> +						struct stable_node, node);
> +			if (remove_stable_node(stable_node)) {
> +				err = -EBUSY;

It's a bit rude to overwrite remove_stable_node()'s return value.

> +				break;	/* proceed to next nid */
> +			}
> +			cond_resched();

Why is this here?

> +		}
> +	}
> +	return err;
> +}


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/11] ksm: NUMA trees and page migration
  2013-01-26  1:53 [PATCH 0/11] ksm: NUMA trees and page migration Hugh Dickins
                   ` (10 preceding siblings ...)
  2013-01-26  2:10 ` [PATCH 11/11] ksm: stop hotremove lockdep warning Hugh Dickins
@ 2013-01-28 23:54 ` Andrew Morton
  2013-01-29  0:49   ` Izik Eidus
  2013-01-29  1:07   ` Hugh Dickins
  11 siblings, 2 replies; 69+ messages in thread
From: Andrew Morton @ 2013-01-28 23:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, Rik van Riel,
	David Rientjes, Anton Arapov, Mel Gorman, linux-kernel, linux-mm

On Fri, 25 Jan 2013 17:53:10 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> Here's a KSM series

Sanity check: do you have a feeling for how useful KSM is? 
Performance/space improvements for typical (or atypical) workloads? 
Are people using it?  Successfully?

IOW, is it justifying itself?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/11] ksm: NUMA trees and page migration
  2013-01-28 23:54 ` [PATCH 0/11] ksm: NUMA trees and page migration Andrew Morton
@ 2013-01-29  0:49   ` Izik Eidus
  2013-01-29  2:26     ` Izik Eidus
  2013-01-29  1:07   ` Hugh Dickins
  1 sibling, 1 reply; 69+ messages in thread
From: Izik Eidus @ 2013-01-29  0:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Petr Holasek, Andrea Arcangeli, Rik van Riel,
	David Rientjes, Anton Arapov, Mel Gorman, linux-kernel, linux-mm

On 01/29/2013 01:54 AM, Andrew Morton wrote:
> On Fri, 25 Jan 2013 17:53:10 -0800 (PST)
> Hugh Dickins <hughd@google.com> wrote:
>
>> Here's a KSM series
> Sanity check: do you have a feeling for how useful KSM is?
> Performance/space improvements for typical (or atypical) workloads?
> Are people using it?  Successfully?

Hi,
I think it mostly used for virtualization, I know at least two products 
that it use -
RHEV - RedHat enterprise virtualization, and my current place (Ravello 
Systems) that use it to do vm consolidation on top of cloud enviorments
(Run multiple unmodified VMs on top of one vm you get from ec2 / 
rackspace / what so ever), for Ravello it is highly critical in 
achieving high rate
of consolidation ratio...

>
> IOW, is it justifying itself?


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/11] ksm: NUMA trees and page migration
  2013-01-28 23:54 ` [PATCH 0/11] ksm: NUMA trees and page migration Andrew Morton
  2013-01-29  0:49   ` Izik Eidus
@ 2013-01-29  1:07   ` Hugh Dickins
  2013-01-29 10:45     ` Gleb Natapov
  1 sibling, 1 reply; 69+ messages in thread
From: Hugh Dickins @ 2013-01-29  1:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Marcelo Tosatti, Gleb Natapov, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, Rik van Riel, David Rientjes, Anton Arapov,
	Mel Gorman, linux-kernel, linux-mm, kvm

On Mon, 28 Jan 2013, Andrew Morton wrote:
> On Fri, 25 Jan 2013 17:53:10 -0800 (PST)
> Hugh Dickins <hughd@google.com> wrote:
> 
> > Here's a KSM series
> 
> Sanity check: do you have a feeling for how useful KSM is? 
> Performance/space improvements for typical (or atypical) workloads? 
> Are people using it?  Successfully?
> 
> IOW, is it justifying itself?

I have no idea!  To me it's simply a technical challenge - and I agree
with your implication that that's not a good enough justification.

I've added Marcelo and Gleb and the KVM list to the Cc:
my understanding is that it's the KVM guys who really appreciate KSM.

Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/11] ksm: allow trees per NUMA node
  2013-01-28 23:03   ` Andrew Morton
@ 2013-01-29  1:17     ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-29  1:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, Rik van Riel,
	David Rientjes, Anton Arapov, linux-kernel, linux-mm

On Mon, 28 Jan 2013, Andrew Morton wrote:
> On Fri, 25 Jan 2013 17:54:53 -0800 (PST)
> Hugh Dickins <hughd@google.com> wrote:
> 
> > --- mmotm.orig/Documentation/vm/ksm.txt	2013-01-25 14:36:31.724205455 -0800
> > +++ mmotm/Documentation/vm/ksm.txt	2013-01-25 14:36:38.608205618 -0800
> > @@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
> >                     e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
> >                     Default: 20 (chosen for demonstration purposes)
> >  
> > +merge_across_nodes - specifies if pages from different numa nodes can be merged.
> > +                   When set to 0, ksm merges only pages which physically
> > +                   reside in the memory area of same NUMA node. It brings
> > +                   lower latency to access to shared page. Value can be
> > +                   changed only when there is no ksm shared pages in system.
> > +                   Default: 1
> > +
> 
> The explanation doesn't really tell the operator whether or not to set
> merge_across_nodes for a particular machine/workload.
> 
> I guess most people will just shrug, turn the thing on and see if it
> improved things, but that's rather random.

Right.  I don't think we can tell them which is going to be better,
but surely we could do a better job of hinting at the tradeoffs.

I think we expect large NUMA machines with lots of memory to want the
better NUMA behavior of !merge_across_nodes, but machines with more
limited memory across short-distance NUMA nodes, to prefer the greater
deduplication of merge_across nodes.

Petr, do you have a more informative text for this?

Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/11] ksm: allow trees per NUMA node
  2013-01-28 23:08   ` Andrew Morton
@ 2013-01-29  1:38     ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-29  1:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, Rik van Riel,
	David Rientjes, Anton Arapov, linux-kernel, linux-mm

On Mon, 28 Jan 2013, Andrew Morton wrote:
> On Fri, 25 Jan 2013 17:54:53 -0800 (PST)
> Hugh Dickins <hughd@google.com> wrote:
> 
> > +/* Zeroed when merging across nodes is not allowed */
> > +static unsigned int ksm_merge_across_nodes = 1;
> 
> I spose this should be __read_mostly.  If __read_mostly is not really a
> synonym for __make_write_often_storage_slower.  I continue to harbor
> fear, uncertainty and doubt about this...

Could do.  No strong feeling, but I think I'd rather it share its
cacheline with other KSM-related stuff, than be off mixed up with
unrelateds.  I think there's a much stronger case for __read_mostly
when it's a library thing accessed by different subsystems.

You're right that this variable is accessed significantly more often
that the other KSM tunables, so deserves a __read_mostly more than
they do.  But where to stop?  Similar reluctance led me to avoid
using "unlikely" throughout ksm.c, unlikely as some conditions are
(I'm aghast to see that Andrea sneaked in a "likely" :).

Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/11] ksm: trivial tidyups
  2013-01-28 23:11   ` Andrew Morton
@ 2013-01-29  1:44     ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-29  1:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, linux-kernel, linux-mm

On Mon, 28 Jan 2013, Andrew Morton wrote:
> On Fri, 25 Jan 2013 17:58:11 -0800 (PST)
> Hugh Dickins <hughd@google.com> wrote:
> 
> > +#ifdef CONFIG_NUMA
> > +#define NUMA(x)		(x)
> > +#define DO_NUMA(x)	(x)
> 
> Did we consider
> 
> 	#define DO_NUMA do { (x) } while (0)
> 
> ?

It didn't occur to me at all.  I like that it makes more sense of
the DO_NUMA variant.  Is it okay that, to work with the way I was
using it, we need "(x);" in there rather than just "(x)"?

> 
> That could avoid some nasty config-dependent compilation issues.
> 
> > +#else
> > +#define NUMA(x)		(0)

[PATCH] ksm: trivial tidyups fix

Suggested by akpm: make DO_NUMA(x) do { (x); } while (0) more like the #else.

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/ksm.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- mmotm.org/mm/ksm.c	2013-01-27 09:55:45.000000000 -0800
+++ mmotm/mm/ksm.c	2013-01-28 16:50:25.772026446 -0800
@@ -43,7 +43,7 @@
 
 #ifdef CONFIG_NUMA
 #define NUMA(x)		(x)
-#define DO_NUMA(x)	(x)
+#define DO_NUMA(x)	do { (x); } while (0)
 #else
 #define NUMA(x)		(0)
 #define DO_NUMA(x)	do { } while (0)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly
  2013-01-28 23:44   ` Andrew Morton
@ 2013-01-29  2:03     ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-01-29  2:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Andrea Arcangeli, Izik Eidus, linux-kernel, linux-mm

On Mon, 28 Jan 2013, Andrew Morton wrote:
> On Fri, 25 Jan 2013 18:01:59 -0800 (PST)
> Hugh Dickins <hughd@google.com> wrote:
> 
> > +static int remove_all_stable_nodes(void)
> > +{
> > +	struct stable_node *stable_node;
> > +	int nid;
> > +	int err = 0;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		while (root_stable_tree[nid].rb_node) {
> > +			stable_node = rb_entry(root_stable_tree[nid].rb_node,
> > +						struct stable_node, node);
> > +			if (remove_stable_node(stable_node)) {
> > +				err = -EBUSY;
> 
> It's a bit rude to overwrite remove_stable_node()'s return value.

Well.... yes, but only the tiniest bit rude :)

> 
> > +				break;	/* proceed to next nid */
> > +			}
> > +			cond_resched();
> 
> Why is this here?

Because we don't have a limit on the length of this loop, and if
every node which remove_stable_node() finds is already stale, and
has no rmap_item still attached, then there would be no rescheduling
point in the unbounded loop without this one.  I was taught to worry
about bad latencies even in unpreemptible kernels.

Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/11] ksm: NUMA trees and page migration
  2013-01-29  0:49   ` Izik Eidus
@ 2013-01-29  2:26     ` Izik Eidus
  2013-01-29 16:51       ` Andrea Arcangeli
  0 siblings, 1 reply; 69+ messages in thread
From: Izik Eidus @ 2013-01-29  2:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Petr Holasek, Andrea Arcangeli, Rik van Riel,
	David Rientjes, Anton Arapov, Mel Gorman, linux-kernel, linux-mm

On 01/29/2013 02:49 AM, Izik Eidus wrote:
> On 01/29/2013 01:54 AM, Andrew Morton wrote:
>> On Fri, 25 Jan 2013 17:53:10 -0800 (PST)
>> Hugh Dickins <hughd@google.com> wrote:
>>
>>> Here's a KSM series
>> Sanity check: do you have a feeling for how useful KSM is?
>> Performance/space improvements for typical (or atypical) workloads?
>> Are people using it?  Successfully?


BTW, After thinking a bit about the word people, I wanted to see if 
normal users of linux
that just download and install Linux (without using special 
virtualization product) are able to use it.
So I google little bit for it, and found some nice results from users:
http://serverascode.com/2012/11/11/ksm-kvm.html

But I do agree that it provide justifying value only for virtualization 
users...

>
> Hi,
> I think it mostly used for virtualization, I know at least two 
> products that it use -
> RHEV - RedHat enterprise virtualization, and my current place (Ravello 
> Systems) that use it to do vm consolidation on top of cloud enviorments
> (Run multiple unmodified VMs on top of one vm you get from ec2 / 
> rackspace / what so ever), for Ravello it is highly critical in 
> achieving high rate
> of consolidation ratio...
>
>>
>> IOW, is it justifying itself?
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/11] ksm: NUMA trees and page migration
  2013-01-29  1:07   ` Hugh Dickins
@ 2013-01-29 10:45     ` Gleb Natapov
  0 siblings, 0 replies; 69+ messages in thread
From: Gleb Natapov @ 2013-01-29 10:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Marcelo Tosatti, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, Rik van Riel, David Rientjes, Anton Arapov,
	Mel Gorman, linux-kernel, linux-mm, kvm

On Mon, Jan 28, 2013 at 05:07:15PM -0800, Hugh Dickins wrote:
> On Mon, 28 Jan 2013, Andrew Morton wrote:
> > On Fri, 25 Jan 2013 17:53:10 -0800 (PST)
> > Hugh Dickins <hughd@google.com> wrote:
> > 
> > > Here's a KSM series
> > 
> > Sanity check: do you have a feeling for how useful KSM is? 
> > Performance/space improvements for typical (or atypical) workloads? 
> > Are people using it?  Successfully?
> > 
> > IOW, is it justifying itself?
> 
> I have no idea!  To me it's simply a technical challenge - and I agree
> with your implication that that's not a good enough justification.
> 
> I've added Marcelo and Gleb and the KVM list to the Cc:
> my understanding is that it's the KVM guys who really appreciate KSM.
> 
KSM is used on all RH kvm deployments for memory overcommit. I asked
around for numbers and got the answer that it allows to squeeze anywhere
between 10% and 100% more VMs on the same machine depends on a type of
a guest OS and how similar workloads of VMs are. And management tries
to keep VMs with similar OSes/workloads on the same host to gain more
from KSM.

--
			Gleb.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/11] ksm: NUMA trees and page migration
  2013-01-29  2:26     ` Izik Eidus
@ 2013-01-29 16:51       ` Andrea Arcangeli
  2013-01-31  0:05         ` Ric Mason
  0 siblings, 1 reply; 69+ messages in thread
From: Andrea Arcangeli @ 2013-01-29 16:51 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrew Morton, Hugh Dickins, Petr Holasek, Rik van Riel,
	David Rientjes, Anton Arapov, Mel Gorman, linux-kernel, linux-mm

Hi everyone,

On Tue, Jan 29, 2013 at 04:26:13AM +0200, Izik Eidus wrote:
> On 01/29/2013 02:49 AM, Izik Eidus wrote:
> > On 01/29/2013 01:54 AM, Andrew Morton wrote:
> >> On Fri, 25 Jan 2013 17:53:10 -0800 (PST)
> >> Hugh Dickins <hughd@google.com> wrote:
> >>
> >>> Here's a KSM series
> >> Sanity check: do you have a feeling for how useful KSM is?
> >> Performance/space improvements for typical (or atypical) workloads?
> >> Are people using it?  Successfully?
> 
> 
> BTW, After thinking a bit about the word people, I wanted to see if 
> normal users of linux
> that just download and install Linux (without using special 
> virtualization product) are able to use it.
> So I google little bit for it, and found some nice results from users:
> http://serverascode.com/2012/11/11/ksm-kvm.html
> 
> But I do agree that it provide justifying value only for virtualization 
> users...

Mostly for virtualization users indeed, but I'm aware of a few non
virtualization users too:

1) CERN has been one of the early adopters of KSM and initially they
were using KSM standalone (probably because not all hypervisors they
had to deal with were KVM/linux based, while all guests were linux and
in turn KSM capable). More info in the KSM paper page 2:

http://www.kernel.org/doc/ols/2009/ols2009-pages-19-28.pdf

However lately they're running KSM in combination with KVM too, and I'm
not sure if they're still using it standalone. See the "KSM shared"
blue area in slide 12 and the comparison with KSM on and off in slide
14.

https://indico.fnal.gov/getFile.py/access?contribId=18&sessionId=4&resId=0&materialId=slides&confId=4986

2) all recent cyanogenmod in the performance menu in settings supports
KSM out of the box. You can run it for a while and then shut it
off.

Not sure how good idea it is to leave it always on, but the only
efficient cellphone/tablet powersaving design (i.e. the wakelocks +
suspend to ram) still won't waste energy while the screen is off and
the phone has suspended to ram, regardless of KSM on or off.

KSM NUMA awareness however is not needed on the cellphone :).

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/11] ksm: NUMA trees and page migration
  2013-01-29 16:51       ` Andrea Arcangeli
@ 2013-01-31  0:05         ` Ric Mason
  0 siblings, 0 replies; 69+ messages in thread
From: Ric Mason @ 2013-01-31  0:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Andrew Morton, Hugh Dickins, Petr Holasek,
	Rik van Riel, David Rientjes, Anton Arapov, Mel Gorman,
	linux-kernel, linux-mm

On Tue, 2013-01-29 at 17:51 +0100, Andrea Arcangeli wrote:
> Hi everyone,
> 
> On Tue, Jan 29, 2013 at 04:26:13AM +0200, Izik Eidus wrote:
> > On 01/29/2013 02:49 AM, Izik Eidus wrote:
> > > On 01/29/2013 01:54 AM, Andrew Morton wrote:
> > >> On Fri, 25 Jan 2013 17:53:10 -0800 (PST)
> > >> Hugh Dickins <hughd@google.com> wrote:
> > >>
> > >>> Here's a KSM series
> > >> Sanity check: do you have a feeling for how useful KSM is?
> > >> Performance/space improvements for typical (or atypical) workloads?
> > >> Are people using it?  Successfully?
> > 
> > 
> > BTW, After thinking a bit about the word people, I wanted to see if 
> > normal users of linux
> > that just download and install Linux (without using special 
> > virtualization product) are able to use it.
> > So I google little bit for it, and found some nice results from users:
> > http://serverascode.com/2012/11/11/ksm-kvm.html
> > 
> > But I do agree that it provide justifying value only for virtualization 
> > users...
> 
> Mostly for virtualization users indeed, but I'm aware of a few non
> virtualization users too:
> 
> 1) CERN has been one of the early adopters of KSM and initially they
> were using KSM standalone (probably because not all hypervisors they
> had to deal with were KVM/linux based, while all guests were linux and
> in turn KSM capable). More info in the KSM paper page 2:
> 
> http://www.kernel.org/doc/ols/2009/ols2009-pages-19-28.pdf
> 
> However lately they're running KSM in combination with KVM too, and I'm
> not sure if they're still using it standalone. See the "KSM shared"
> blue area in slide 12 and the comparison with KSM on and off in slide
> 14.
> 
> https://indico.fnal.gov/getFile.py/access?contribId=18&sessionId=4&resId=0&materialId=slides&confId=4986
> 
> 2) all recent cyanogenmod in the performance menu in settings supports
> KSM out of the box. You can run it for a while and then shut it
> off.
> 
> Not sure how good idea it is to leave it always on, but the only
> efficient cellphone/tablet powersaving design (i.e. the wakelocks +
> suspend to ram) still won't waste energy while the screen is off and
> the phone has suspended to ram, regardless of KSM on or off.
> 
> KSM NUMA awareness however is not needed on the cellphone :).

Thanks for your sharing. Is there ksm benchmark? How to get it?

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/11] ksm: allow trees per NUMA node
  2013-01-26  1:54 ` [PATCH 1/11] ksm: allow trees per NUMA node Hugh Dickins
                     ` (2 preceding siblings ...)
  2013-01-28 23:08   ` Andrew Morton
@ 2013-02-05 16:41   ` Mel Gorman
  2013-02-07 23:57     ` Hugh Dickins
  3 siblings, 1 reply; 69+ messages in thread
From: Mel Gorman @ 2013-02-05 16:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	Rik van Riel, David Rientjes, Anton Arapov, linux-kernel,
	linux-mm

On Fri, Jan 25, 2013 at 05:54:53PM -0800, Hugh Dickins wrote:
> From: Petr Holasek <pholasek@redhat.com>
> 
> Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> which control merging pages across different numa nodes.
> When it is set to zero only pages from the same node are merged,
> otherwise pages from all nodes can be merged together (default behavior).
> 
> Typical use-case could be a lot of KVM guests on NUMA machine
> and cpus from more distant nodes would have significant increase
> of access latency to the merged ksm page. Sysfs knob was choosen
> for higher variability when some users still prefers higher amount
> of saved physical memory regardless of access latency.
> 

This is understandable but it's going to be a fairly obscure option.
I do not think it can be known in advance if the option should be set.
The user must either run benchmarks before and after or use perf to
record the "node-load-misses" event and see if setting the parameter
reduces the number of remote misses.

I don't know the internals of ksm.c at all and this is my first time reading
this series. Everything in this review is subject to being completely
wrong or due to a major misunderstanding on my part. Delete all feedback
if desired.

> Every numa node has its own stable & unstable trees because of faster
> searching and inserting. Changing of merge_across_nodes value is possible
> only when there are not any ksm shared pages in system.
> 
> I've tested this patch on numa machines with 2, 4 and 8 nodes and
> measured speed of memory access inside of KVM guests with memory pinned
> to one of nodes with this benchmark:
> 
> http://pholasek.fedorapeople.org/alloc_pg.c
> 
> Population standard deviations of access times in percentage of average
> were following:
> 
> merge_across_nodes=1
> 2 nodes 1.4%
> 4 nodes 1.6%
> 8 nodes	1.7%
> 
> merge_across_nodes=0
> 2 nodes	1%
> 4 nodes	0.32%
> 8 nodes	0.018%
> 
> RFC: https://lkml.org/lkml/2011/11/30/91
> v1: https://lkml.org/lkml/2012/1/23/46
> v2: https://lkml.org/lkml/2012/6/29/105
> v3: https://lkml.org/lkml/2012/9/14/550
> v4: https://lkml.org/lkml/2012/9/23/137
> v5: https://lkml.org/lkml/2012/12/10/540
> v6: https://lkml.org/lkml/2012/12/23/154
> v7: https://lkml.org/lkml/2012/12/27/225
> 
> Hugh notes that this patch brings two problems, whose solution needs
> further support in mm/ksm.c, which follows in subsequent patches:
> 1) switching merge_across_nodes after running KSM is liable to oops
>    on stale nodes still left over from the previous stable tree;
> 2) memory hotremove may migrate KSM pages, but there is no provision
>    here for !merge_across_nodes to migrate nodes to the proper tree.
> 
> Signed-off-by: Petr Holasek <pholasek@redhat.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  Documentation/vm/ksm.txt |    7 +
>  mm/ksm.c                 |  151 ++++++++++++++++++++++++++++++++-----
>  2 files changed, 139 insertions(+), 19 deletions(-)
> 
> --- mmotm.orig/Documentation/vm/ksm.txt	2013-01-25 14:36:31.724205455 -0800
> +++ mmotm/Documentation/vm/ksm.txt	2013-01-25 14:36:38.608205618 -0800
> @@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
>                     e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
>                     Default: 20 (chosen for demonstration purposes)
>  
> +merge_across_nodes - specifies if pages from different numa nodes can be merged.
> +                   When set to 0, ksm merges only pages which physically
> +                   reside in the memory area of same NUMA node. It brings
> +                   lower latency to access to shared page. Value can be
> +                   changed only when there is no ksm shared pages in system.
> +                   Default: 1
> +
>  run              - set 0 to stop ksmd from running but keep merged pages,
>                     set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
>                     set 2 to stop ksmd and unmerge all pages currently merged,
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:31.724205455 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:36:38.608205618 -0800
> @@ -36,6 +36,7 @@
>  #include <linux/hashtable.h>
>  #include <linux/freezer.h>
>  #include <linux/oom.h>
> +#include <linux/numa.h>
>  
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> @@ -139,6 +140,9 @@ struct rmap_item {
>  	struct mm_struct *mm;
>  	unsigned long address;		/* + low bits used for flags below */
>  	unsigned int oldchecksum;	/* when unstable */
> +#ifdef CONFIG_NUMA
> +	unsigned int nid;
> +#endif
>  	union {
>  		struct rb_node node;	/* when node of unstable tree */
>  		struct {		/* when listed from stable tree */

> @@ -153,8 +157,8 @@ struct rmap_item {
>  #define STABLE_FLAG	0x200	/* is listed from the stable tree */
>  
>  /* The stable and unstable tree heads */
> -static struct rb_root root_stable_tree = RB_ROOT;
> -static struct rb_root root_unstable_tree = RB_ROOT;
> +static struct rb_root root_unstable_tree[MAX_NUMNODES];
> +static struct rb_root root_stable_tree[MAX_NUMNODES];
>  

With multiple stable node trees does the comment that begins with

 * A few notes about the KSM scanning process,
 * to make it easier to understand the data structures below:

need an update?

It's uninitialised so kernel data size in vmlinux should be unaffected but
it's an additional runtime cost of around 4K for a standardish enterprise
distro kernel config.  Small beans on a NUMA machine and maybe not worth
the hassle of kmalloc for nr_online_nodes and dealing with node memory
hotplug but it's a pity.

>  #define MM_SLOTS_HASH_BITS 10
>  static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_
>  /* Milliseconds ksmd should sleep between batches */
>  static unsigned int ksm_thread_sleep_millisecs = 20;
>  
> +/* Zeroed when merging across nodes is not allowed */
> +static unsigned int ksm_merge_across_nodes = 1;
> +

Nit but initialised data does increase the size of vmlinux so maybe this
should be the "opposite". i.e. rename it to ksm_merge_within_nodes and
default it to 0?

__read_mostly?

>  #define KSM_RUN_STOP	0
>  #define KSM_RUN_MERGE	1
>  #define KSM_RUN_UNMERGE	2
> @@ -441,10 +448,25 @@ out:		page = NULL;
>  	return page;
>  }
>  
> +/*
> + * This helper is used for getting right index into array of tree roots.
> + * When merge_across_nodes knob is set to 1, there are only two rb-trees for
> + * stable and unstable pages from all nodes with roots in index 0. Otherwise,
> + * every node has its own stable and unstable tree.
> + */
> +static inline int get_kpfn_nid(unsigned long kpfn)
> +{
> +	if (ksm_merge_across_nodes)
> +		return 0;
> +	else
> +		return pfn_to_nid(kpfn);
> +}
> +

If we start with ksm_merge_across_nodes, KSM runs for a while and populates
the stable node tree for node 0 and then ksm_merge_across_nodes gets set
then badness happens because this can go anywhere

     nid = get_kpfn_nid(stable_node->kpfn);
     rb_erase(&stable_node->node, &root_stable_tree[nid]);

Very late in the review I noticed that you comment on this already in the
changelog and that it is addressed later in the series. I haven't seen
this patch yet so the following suggestion is very stale but might still
be relevant.

	We could increase size of root_stable_node[] by 1, have
	get_kpfn_nid return MAX_NR_NODES if ksm_merge_across_nodes and
	if ksm_merge_across_nodes gets set to 0 then we walk the stable
	tree at root_stable_tree[MAX_NR_NODES] and delete the entire
	tree? It's be disruptive as hell unfortunately and might break
	entirely if there is not enough memory to unshare the pages.

	Ideally we could take our time walking root_stable_tree[MAX_NR_NODES]
	without worrying about collisions and fix it up somehow. Dunno

>  static void remove_node_from_stable_tree(struct stable_node *stable_node)
>  {
>  	struct rmap_item *rmap_item;
>  	struct hlist_node *hlist;
> +	int nid;
>  
>  	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
>  		if (rmap_item->hlist.next)
> @@ -456,7 +478,9 @@ static void remove_node_from_stable_tree
>  		cond_resched();
>  	}
>  
> -	rb_erase(&stable_node->node, &root_stable_tree);
> +	nid = get_kpfn_nid(stable_node->kpfn);
> +
> +	rb_erase(&stable_node->node, &root_stable_tree[nid]);
>  	free_stable_node(stable_node);
>  }
>  
> @@ -554,7 +578,12 @@ static void remove_rmap_item_from_tree(s
>  		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
>  		BUG_ON(age > 1);
>  		if (!age)
> -			rb_erase(&rmap_item->node, &root_unstable_tree);
> +#ifdef CONFIG_NUMA
> +			rb_erase(&rmap_item->node,
> +					&root_unstable_tree[rmap_item->nid]);
> +#else
> +			rb_erase(&rmap_item->node, &root_unstable_tree[0]);
> +#endif
>  

nit, does rmap_item->nid deserve a getter and setter helper instead?

#ifdef CONFIG_NUMA
static inline int rmap_item_nid(struct rmap_item *item)
{
	return rmap_item->nid;
}

static inline void set_rmap_item_nid(struct rmap_item *item, int nid)
{
	rmap_item->nid = nid;
}
#else
static inline int rmap_item_nid(struct rmap_item *item)
{
	return 0;
}

static inline void set_rmap_item_nid(struct rmap_item *item, int nid)
{
}
#endif
	

>  		ksm_pages_unshared--;
>  		rmap_item->address &= PAGE_MASK;
> @@ -990,8 +1019,9 @@ static struct page *try_to_merge_two_pag
>   */
>  static struct page *stable_tree_search(struct page *page)
>  {
> -	struct rb_node *node = root_stable_tree.rb_node;
> +	struct rb_node *node;
>  	struct stable_node *stable_node;
> +	int nid;
>  
>  	stable_node = page_stable_node(page);
>  	if (stable_node) {			/* ksm page forked */
> @@ -999,6 +1029,9 @@ static struct page *stable_tree_search(s
>  		return page;
>  	}
>  
> +	nid = get_kpfn_nid(page_to_pfn(page));
> +	node = root_stable_tree[nid].rb_node;
> +
>  	while (node) {
>  		struct page *tree_page;
>  		int ret;
> @@ -1033,10 +1066,16 @@ static struct page *stable_tree_search(s
>   */
>  static struct stable_node *stable_tree_insert(struct page *kpage)
>  {
> -	struct rb_node **new = &root_stable_tree.rb_node;
> +	int nid;
> +	unsigned long kpfn;
> +	struct rb_node **new;
>  	struct rb_node *parent = NULL;
>  	struct stable_node *stable_node;
>  
> +	kpfn = page_to_pfn(kpage);
> +	nid = get_kpfn_nid(kpfn);
> +	new = &root_stable_tree[nid].rb_node;
> +
>  	while (*new) {
>  		struct page *tree_page;
>  		int ret;
> @@ -1070,11 +1109,11 @@ static struct stable_node *stable_tree_i
>  		return NULL;
>  
>  	rb_link_node(&stable_node->node, parent, new);
> -	rb_insert_color(&stable_node->node, &root_stable_tree);
> +	rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
>  
>  	INIT_HLIST_HEAD(&stable_node->hlist);
>  
> -	stable_node->kpfn = page_to_pfn(kpage);
> +	stable_node->kpfn = kpfn;
>  	set_page_stable_node(kpage, stable_node);
>  
>  	return stable_node;
> @@ -1098,10 +1137,15 @@ static
>  struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
>  					      struct page *page,
>  					      struct page **tree_pagep)
> -
>  {
> -	struct rb_node **new = &root_unstable_tree.rb_node;
> +	struct rb_node **new;
> +	struct rb_root *root;
>  	struct rb_node *parent = NULL;
> +	int nid;
> +
> +	nid = get_kpfn_nid(page_to_pfn(page));
> +	root = &root_unstable_tree[nid];
> +	new = &root->rb_node;
>  
>  	while (*new) {
>  		struct rmap_item *tree_rmap_item;
> @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i
>  			return NULL;
>  		}
>  
> +		/*
> +		 * If tree_page has been migrated to another NUMA node, it
> +		 * will be flushed out and put into the right unstable tree
> +		 * next time: only merge with it if merge_across_nodes.
> +		 * Just notice, we don't have similar problem for PageKsm
> +		 * because their migration is disabled now. (62b61f611e)
> +		 */
> +		if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) {
> +			put_page(tree_page);
> +			return NULL;
> +		}
> +

What about this case?

1. ksm_merge_across_nodes==0
2. pages gets placed on different unstable trees
3. ksm_merge_across_nodes==1

At that point we should be removing pages from the different unstable
tree and moving them to root_unstable_tree[0] but this put_page() doesn't
happen. Does it matter?

>  		ret = memcmp_pages(page, tree_page);
>  
>  		parent = *new;
> @@ -1139,8 +1195,11 @@ struct rmap_item *unstable_tree_search_i
>  
>  	rmap_item->address |= UNSTABLE_FLAG;
>  	rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK);
> +#ifdef CONFIG_NUMA
> +	rmap_item->nid = nid;
> +#endif
>  	rb_link_node(&rmap_item->node, parent, new);
> -	rb_insert_color(&rmap_item->node, &root_unstable_tree);
> +	rb_insert_color(&rmap_item->node, root);
>  
>  	ksm_pages_unshared++;
>  	return NULL;
> @@ -1154,6 +1213,13 @@ struct rmap_item *unstable_tree_search_i
>  static void stable_tree_append(struct rmap_item *rmap_item,
>  			       struct stable_node *stable_node)
>  {
> +#ifdef CONFIG_NUMA
> +	/*
> +	 * Usually rmap_item->nid is already set correctly,
> +	 * but it may be wrong after switching merge_across_nodes.
> +	 */
> +	rmap_item->nid = get_kpfn_nid(stable_node->kpfn);
> +#endif
>  	rmap_item->head = stable_node;
>  	rmap_item->address |= STABLE_FLAG;
>  	hlist_add_head(&rmap_item->hlist, &stable_node->hlist);
> @@ -1283,6 +1349,7 @@ static struct rmap_item *scan_get_next_r
>  	struct mm_slot *slot;
>  	struct vm_area_struct *vma;
>  	struct rmap_item *rmap_item;
> +	int nid;
>  
>  	if (list_empty(&ksm_mm_head.mm_list))
>  		return NULL;
> @@ -1301,7 +1368,8 @@ static struct rmap_item *scan_get_next_r
>  		 */
>  		lru_add_drain_all();
>  
> -		root_unstable_tree = RB_ROOT;
> +		for (nid = 0; nid < nr_node_ids; nid++)
> +			root_unstable_tree[nid] = RB_ROOT;
>  

Minor but you shouldn't need to reset tham all if
ksm_merge_across_nodes==1

Initially this triggered an alarm because it's not immediately obvious
why you can just discard an rbtree like this. It looks like because the
unstable tree is also part of a linked list so the rb representation can
be reset quickly without leaking memory.

>  		spin_lock(&ksm_mmlist_lock);
>  		slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
> @@ -1770,15 +1838,19 @@ static struct stable_node *ksm_check_sta
>  						 unsigned long end_pfn)
>  {
>  	struct rb_node *node;
> +	int nid;
>  
> -	for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
> -		struct stable_node *stable_node;
> +	for (nid = 0; nid < nr_node_ids; nid++)
> +		for (node = rb_first(&root_stable_tree[nid]); node;
> +				node = rb_next(node)) {
> +			struct stable_node *stable_node;
> +
> +			stable_node = rb_entry(node, struct stable_node, node);
> +			if (stable_node->kpfn >= start_pfn &&
> +			    stable_node->kpfn < end_pfn)
> +				return stable_node;
> +		}
>  
> -		stable_node = rb_entry(node, struct stable_node, node);
> -		if (stable_node->kpfn >= start_pfn &&
> -		    stable_node->kpfn < end_pfn)
> -			return stable_node;
> -	}
>  	return NULL;
>  }
>  
> @@ -1925,6 +1997,40 @@ static ssize_t run_store(struct kobject
>  }
>  KSM_ATTR(run);
>  
> +#ifdef CONFIG_NUMA
> +static ssize_t merge_across_nodes_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%u\n", ksm_merge_across_nodes);
> +}
> +
> +static ssize_t merge_across_nodes_store(struct kobject *kobj,
> +				   struct kobj_attribute *attr,
> +				   const char *buf, size_t count)
> +{
> +	int err;
> +	unsigned long knob;
> +
> +	err = kstrtoul(buf, 10, &knob);
> +	if (err)
> +		return err;
> +	if (knob > 1)
> +		return -EINVAL;
> +
> +	mutex_lock(&ksm_thread_mutex);
> +	if (ksm_merge_across_nodes != knob) {
> +		if (ksm_pages_shared)
> +			err = -EBUSY;
> +		else
> +			ksm_merge_across_nodes = knob;
> +	}
> +	mutex_unlock(&ksm_thread_mutex);
> +
> +	return err ? err : count;
> +}
> +KSM_ATTR(merge_across_nodes);
> +#endif
> +
>  static ssize_t pages_shared_show(struct kobject *kobj,
>  				 struct kobj_attribute *attr, char *buf)
>  {
> @@ -1979,6 +2085,9 @@ static struct attribute *ksm_attrs[] = {
>  	&pages_unshared_attr.attr,
>  	&pages_volatile_attr.attr,
>  	&full_scans_attr.attr,
> +#ifdef CONFIG_NUMA
> +	&merge_across_nodes_attr.attr,
> +#endif
>  	NULL,
>  };
>  
> @@ -1992,11 +2101,15 @@ static int __init ksm_init(void)
>  {
>  	struct task_struct *ksm_thread;
>  	int err;
> +	int nid;
>  
>  	err = ksm_slab_init();
>  	if (err)
>  		goto out;
>  
> +	for (nid = 0; nid < nr_node_ids; nid++)
> +		root_stable_tree[nid] = RB_ROOT;
> +
>  	ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");
>  	if (IS_ERR(ksm_thread)) {
>  		printk(KERN_ERR "ksm: creating kthread failed\n");
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 4/11] ksm: reorganize ksm_check_stable_tree
  2013-01-26  1:59 ` [PATCH 4/11] ksm: reorganize ksm_check_stable_tree Hugh Dickins
@ 2013-02-05 16:48   ` Mel Gorman
  2013-02-08  0:07     ` Hugh Dickins
  0 siblings, 1 reply; 69+ messages in thread
From: Mel Gorman @ 2013-02-05 16:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Fri, Jan 25, 2013 at 05:59:35PM -0800, Hugh Dickins wrote:
> Memory hotremove's ksm_check_stable_tree() is pitifully inefficient
> (restarting whenever it finds a stale node to remove), but rearrange
> so that at least it does not needlessly restart from nid 0 each time.
> And add a couple of comments: here is why we keep pfn instead of page.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/ksm.c |   38 ++++++++++++++++++++++----------------
>  1 file changed, 22 insertions(+), 16 deletions(-)
> 
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:52.152205940 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:36:53.244205966 -0800
> @@ -1830,31 +1830,36 @@ void ksm_migrate_page(struct page *newpa
>  #endif /* CONFIG_MIGRATION */
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn,
> -						 unsigned long end_pfn)
> +static void ksm_check_stable_tree(unsigned long start_pfn,
> +				  unsigned long end_pfn)
>  {
> +	struct stable_node *stable_node;
>  	struct rb_node *node;
>  	int nid;
>  
> -	for (nid = 0; nid < nr_node_ids; nid++)
> -		for (node = rb_first(&root_stable_tree[nid]); node;
> -				node = rb_next(node)) {
> -			struct stable_node *stable_node;
> -
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		node = rb_first(&root_stable_tree[nid]);
> +		while (node) {

This is not your fault, the old code is wrong too. It is assuming that all
nodes are populated in numeric orders with no holes. It won't work if just
two nodes 0 and 4 are online. It should be using for_each_online_node().

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 5/11] ksm: get_ksm_page locked
  2013-01-26  2:00 ` [PATCH 5/11] ksm: get_ksm_page locked Hugh Dickins
  2013-01-27  2:36   ` Simon Jeons
  2013-01-27  2:48   ` Simon Jeons
@ 2013-02-05 17:18   ` Mel Gorman
  2013-02-08  0:33     ` Hugh Dickins
  2 siblings, 1 reply; 69+ messages in thread
From: Mel Gorman @ 2013-02-05 17:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Fri, Jan 25, 2013 at 06:00:50PM -0800, Hugh Dickins wrote:
> In some places where get_ksm_page() is used, we need the page to be locked.
> 
> When KSM migration is fully enabled, we shall want that to make sure that
> the page just acquired cannot be migrated beneath us (raised page count is
> only effective when there is serialization to make sure migration notices).
> Whereas when navigating through the stable tree, we certainly do not want
> to lock each node (raised page count is enough to guarantee the memcmps,
> even if page is migrated to another node).
> 
> Since we're about to add another use case, add the locked argument to
> get_ksm_page() now.
> 
> Hmm, what's that rcu_read_lock() about?  Complete misunderstanding, I
> really got the wrong end of the stick on that!  There's a configuration
> in which page_cache_get_speculative() can do something cheaper than
> get_page_unless_zero(), relying on its caller's rcu_read_lock() to have
> disabled preemption for it.  There's no need for rcu_read_lock() around
> get_page_unless_zero() (and mapping checks) here.  Cut out that
> silliness before making this any harder to understand.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/ksm.c |   23 +++++++++++++----------
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:53.244205966 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:36:58.856206099 -0800
> @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree
>   * but this is different - made simpler by ksm_thread_mutex being held, but
>   * interesting for assuming that no other use of the struct page could ever
>   * put our expected_mapping into page->mapping (or a field of the union which
> - * coincides with page->mapping).  The RCU calls are not for KSM at all, but
> - * to keep the page_count protocol described with page_cache_get_speculative.
> + * coincides with page->mapping).
>   *
>   * Note: it is possible that get_ksm_page() will return NULL one moment,
>   * then page the next, if the page is in between page_freeze_refs() and
>   * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
>   * is on its way to being freed; but it is an anomaly to bear in mind.
>   */
> -static struct page *get_ksm_page(struct stable_node *stable_node)
> +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
>  {

The naming is unhelpful :(

Because the second parameter is called "locked", it implies that the
caller of this function holds the page lock (which is obviously very
silly). ret_locked maybe?

As the function is akin to find_lock_page I would  prefer if there was
a new get_lock_ksm_page() instead of locking depending on the value of a
parameter. We can do this because expected_mapping is recorded by the
stable_node and we only need to recalculate it if the page has been
successfully pinned. We calculate the expected value twice but that's
not earth shattering. It'd look something like;

/*
 * get_lock_ksm_page: Similar to get_ksm_page except returns with page
 * locked and pinned
 */
static struct page *get_lock_ksm_page(struct stable_node *stable_node)
{
	struct page *page = get_ksm_page(stable_node);

	if (page) {
  		expected_mapping = (void *)stable_node +
  				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
		lock_page(page);
		if (page->mapping != expected_mapping) {
			unlock_page(page);

			/* release pin taken by get_ksm_page() */
			put_page(page);
			page = NULL;
		}
	}

	return page;
}

Up to you, I'm not going to make a big deal of it.

FWIW, I agree that removing rcu_read_lock() is fine.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly
  2013-01-26  2:01 ` [PATCH 6/11] ksm: remove old stable nodes more thoroughly Hugh Dickins
                     ` (3 preceding siblings ...)
  2013-01-28 23:44   ` Andrew Morton
@ 2013-02-05 17:55   ` Mel Gorman
  2013-02-08 19:33     ` Hugh Dickins
  4 siblings, 1 reply; 69+ messages in thread
From: Mel Gorman @ 2013-02-05 17:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Fri, Jan 25, 2013 at 06:01:59PM -0800, Hugh Dickins wrote:
> Switching merge_across_nodes after running KSM is liable to oops on stale
> nodes still left over from the previous stable tree.  It's not something
> that people will often want to do, but it would be lame to demand a reboot
> when they're trying to determine which merge_across_nodes setting is best.
> 
> How can this happen?  We only permit switching merge_across_nodes when
> pages_shared is 0, and usually set run 2 to force that beforehand, which
> ought to unmerge everything: yet oopses still occur when you then run 1.
> 

When reviewing patch 1, I missed that the pages_shared check would prevent
most of the problems I was envisioning with leftover entries in the
stable tree. Sorry about that.

> Three causes:
> 
> 1. The old stable tree (built according to the inverse merge_across_nodes)
> has not been fully torn down.  A stable node lingers until get_ksm_page()
> notices that the page it references no longer references it: but the page
> is not necessarily freed as soon as expected, particularly when swapcache.
> 
> Fix this with a pass through the old stable tree, applying get_ksm_page()
> to each of the remaining nodes (most found stale and removed immediately),
> with forced removal of any left over.  Unless the page is still mapped:
> I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE
> and EBUSY than BUG.
> 
> 2. __ksm_enter() has a nice little optimization, to insert the new mm
> just behind ksmd's cursor, so there's a full pass for it to stabilize
> (or be removed) before ksmd addresses it.  Nice when ksmd is running,
> but not so nice when we're trying to unmerge all mms: we were missing
> those mms forked and inserted behind the unmerge cursor.  Easily fixed
> by inserting at the end when KSM_RUN_UNMERGE.
> 
> 3. It is possible for a KSM page to be faulted back from swapcache into
> an mm, just after unmerge_and_remove_all_rmap_items() scanned past it.
> Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private
> to ksm.c, so dissolve the distinction between ksm_might_need_to_copy()
> and ksm_does_need_to_copy(), doing it all in the one call into ksm.c.
> 
> A long outstanding, unrelated bugfix sneaks in with that third fix:
> ksm_does_need_to_copy() would copy from a !PageUptodate page (implying
> I/O error when read in from swap) to a page which it then marks Uptodate.
> Fix this case by not copying, letting do_swap_page() discover the error.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  include/linux/ksm.h |   18 ++-------
>  mm/ksm.c            |   83 +++++++++++++++++++++++++++++++++++++++---
>  mm/memory.c         |   19 ++++-----
>  3 files changed, 92 insertions(+), 28 deletions(-)
> 
> --- mmotm.orig/include/linux/ksm.h	2013-01-25 14:27:58.220193250 -0800
> +++ mmotm/include/linux/ksm.h	2013-01-25 14:37:00.764206145 -0800
> @@ -16,9 +16,6 @@
>  struct stable_node;
>  struct mem_cgroup;
>  
> -struct page *ksm_does_need_to_copy(struct page *page,
> -			struct vm_area_struct *vma, unsigned long address);
> -
>  #ifdef CONFIG_KSM
>  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
>  		unsigned long end, int advice, unsigned long *vm_flags);
> @@ -73,15 +70,8 @@ static inline void set_page_stable_node(
>   * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE,
>   * but what if the vma was unmerged while the page was swapped out?
>   */
> -static inline int ksm_might_need_to_copy(struct page *page,
> -			struct vm_area_struct *vma, unsigned long address)
> -{
> -	struct anon_vma *anon_vma = page_anon_vma(page);
> -
> -	return anon_vma &&
> -		(anon_vma->root != vma->anon_vma->root ||
> -		 page->index != linear_page_index(vma, address));
> -}
> +struct page *ksm_might_need_to_copy(struct page *page,
> +			struct vm_area_struct *vma, unsigned long address);
>  
>  int page_referenced_ksm(struct page *page,
>  			struct mem_cgroup *memcg, unsigned long *vm_flags);
> @@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_
>  	return 0;
>  }
>  
> -static inline int ksm_might_need_to_copy(struct page *page,
> +static inline struct page *ksm_might_need_to_copy(struct page *page,
>  			struct vm_area_struct *vma, unsigned long address)
>  {
> -	return 0;
> +	return page;
>  }
>  
>  static inline int page_referenced_ksm(struct page *page,
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:58.856206099 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:37:00.768206145 -0800
> @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a
>  /*
>   * Only called through the sysfs control interface:
>   */
> +static int remove_stable_node(struct stable_node *stable_node)
> +{
> +	struct page *page;
> +	int err;
> +
> +	page = get_ksm_page(stable_node, true);
> +	if (!page) {
> +		/*
> +		 * get_ksm_page did remove_node_from_stable_tree itself.
> +		 */
> +		return 0;
> +	}
> +
> +	if (WARN_ON_ONCE(page_mapped(page)))
> +		err = -EBUSY;
> +	else {
> +		/*

It will probably be very obvious to people familiar with ksm.c but even
so maybe remind the reader that the pages must already have been unmerged

* This page must already have been unmerged and should be stale.
* It might be in a pagevec waiting to be freed or it might be
......



> +		 * This page might be in a pagevec waiting to be freed,
> +		 * or it might be PageSwapCache (perhaps under writeback),
> +		 * or it might have been removed from swapcache a moment ago.
> +		 */
> +		set_page_stable_node(page, NULL);
> +		remove_node_from_stable_tree(stable_node);
> +		err = 0;
> +	}
> +
> +	unlock_page(page);
> +	put_page(page);
> +	return err;
> +}
> +
> +static int remove_all_stable_nodes(void)
> +{
> +	struct stable_node *stable_node;
> +	int nid;
> +	int err = 0;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		while (root_stable_tree[nid].rb_node) {
> +			stable_node = rb_entry(root_stable_tree[nid].rb_node,
> +						struct stable_node, node);
> +			if (remove_stable_node(stable_node)) {
> +				err = -EBUSY;
> +				break;	/* proceed to next nid */
> +			}

If remove_stable_node() returns an error then it's quite possible that it'll
go boom when that page is encountered later but it's not guaranteed. It'd
be best effort to continue removing as many of the stable nodes anyway.
We're in trouble either way of course.

Otherwise I didn't spot a problem so as weak as it is due my familiarity
with KSM;

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/11] ksm: make KSM page migration possible
  2013-01-26  2:03 ` [PATCH 7/11] ksm: make KSM page migration possible Hugh Dickins
  2013-01-27  5:47   ` Simon Jeons
@ 2013-02-05 19:11   ` Mel Gorman
  2013-02-08 20:52     ` Hugh Dickins
  1 sibling, 1 reply; 69+ messages in thread
From: Mel Gorman @ 2013-02-05 19:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Fri, Jan 25, 2013 at 06:03:31PM -0800, Hugh Dickins wrote:
> KSM page migration is already supported in the case of memory hotremove,
> which takes the ksm_thread_mutex across all its migrations to keep life
> simple.
> 
> But the new KSM NUMA merge_across_nodes knob introduces a problem, when
> it's set to non-default 0: if a KSM page is migrated to a different NUMA
> node, how do we migrate its stable node to the right tree?  And what if
> that collides with an existing stable node?
> 
> So far there's no provision for that, and this patch does not attempt
> to deal with it either.  But how will I test a solution, when I don't
> know how to hotremove memory? 

Just reach in and yank it straight out with a chisel.

> The best answer is to enable KSM page
> migration in all cases now, and test more common cases.  With THP and
> compaction added since KSM came in, page migration is now mainstream,
> and it's a shame that a KSM page can frustrate freeing a page block.
> 

THP will at least check if migration within a node works. It won't
necessarily check we can migrate across nodes properly but it's a lot
better than nothing.

> Without worrying about merge_across_nodes 0 for now, this patch gets
> KSM page migration working reliably for default merge_across_nodes 1
> (but leave the patch enabling it until near the end of the series).
> 
> It's much simpler than I'd originally imagined, and does not require
> an additional tier of locking: page migration relies on the page lock,
> KSM page reclaim relies on the page lock, the page lock is enough for
> KSM page migration too.
> 
> Almost all the care has to be in get_ksm_page(): that's the function
> which worries about when a stable node is stale and should be freed,
> now it also has to worry about the KSM page being migrated.
> 
> The only new overhead is an additional put/get/lock/unlock_page when
> stable_tree_search() arrives at a matching node: to make sure migration
> respects the raised page count, and so does not migrate the page while
> we're busy with it here.  That's probably avoidable, either by changing
> internal interfaces from using kpage to stable_node, or by moving the
> ksm_migrate_page() callsite into a page_freeze_refs() section (even if
> not swapcache); but this works well, I've no urge to pull it apart now.
> 
> (Descents of the stable tree may pass through nodes whose KSM pages are
> under migration: being unlocked, the raised page count does not prevent
> that, nor need it: it's safe to memcmp against either old or new page.)
> 
> You might worry about mremap, and whether page migration's rmap_walk
> to remove migration entries will find all the KSM locations where it
> inserted earlier: that should already be handled, by the satisfyingly
> heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,).
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/ksm.c     |   94 ++++++++++++++++++++++++++++++++++++++-----------
>  mm/migrate.c |    5 ++
>  2 files changed, 77 insertions(+), 22 deletions(-)
> 
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:37:00.768206145 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:37:03.832206218 -0800
> @@ -499,6 +499,7 @@ static void remove_node_from_stable_tree
>   * In which case we can trust the content of the page, and it
>   * returns the gotten page; but if the page has now been zapped,
>   * remove the stale node from the stable tree and return NULL.
> + * But beware, the stable node's page might be being migrated.
>   *
>   * You would expect the stable_node to hold a reference to the ksm page.
>   * But if it increments the page's count, swapping out has to wait for
> @@ -509,44 +510,77 @@ static void remove_node_from_stable_tree
>   * pointing back to this stable node.  This relies on freeing a PageAnon
>   * page to reset its page->mapping to NULL, and relies on no other use of
>   * a page to put something that might look like our key in page->mapping.
> - *
> - * include/linux/pagemap.h page_cache_get_speculative() is a good reference,
> - * but this is different - made simpler by ksm_thread_mutex being held, but
> - * interesting for assuming that no other use of the struct page could ever
> - * put our expected_mapping into page->mapping (or a field of the union which
> - * coincides with page->mapping).
> - *
> - * Note: it is possible that get_ksm_page() will return NULL one moment,
> - * then page the next, if the page is in between page_freeze_refs() and
> - * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
>   * is on its way to being freed; but it is an anomaly to bear in mind.
>   */
>  static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
>  {
>  	struct page *page;
>  	void *expected_mapping;
> +	unsigned long kpfn;
>  
> -	page = pfn_to_page(stable_node->kpfn);
>  	expected_mapping = (void *)stable_node +
>  				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
> -	if (page->mapping != expected_mapping)
> -		goto stale;
> -	if (!get_page_unless_zero(page))
> +again:
> +	kpfn = ACCESS_ONCE(stable_node->kpfn);
> +	page = pfn_to_page(kpfn);
> +

Ok.

There should be no concern that hot-remove made the kpfn invalid because
those stable tree entries should have been discarded.

> +	/*
> +	 * page is computed from kpfn, so on most architectures reading
> +	 * page->mapping is naturally ordered after reading node->kpfn,
> +	 * but on Alpha we need to be more careful.
> +	 */
> +	smp_read_barrier_depends();

The value of page is data dependant on pfn_to_page(). Is it really possible
for that to be re-ordered even on Alpha?

> +	if (ACCESS_ONCE(page->mapping) != expected_mapping)
>  		goto stale;
> -	if (page->mapping != expected_mapping) {
> +
> +	/*
> +	 * We cannot do anything with the page while its refcount is 0.
> +	 * Usually 0 means free, or tail of a higher-order page: in which
> +	 * case this node is no longer referenced, and should be freed;
> +	 * however, it might mean that the page is under page_freeze_refs().
> +	 * The __remove_mapping() case is easy, again the node is now stale;
> +	 * but if page is swapcache in migrate_page_move_mapping(), it might
> +	 * still be our page, in which case it's essential to keep the node.
> +	 */
> +	while (!get_page_unless_zero(page)) {
> +		/*
> +		 * Another check for page->mapping != expected_mapping would
> +		 * work here too.  We have chosen the !PageSwapCache test to
> +		 * optimize the common case, when the page is or is about to
> +		 * be freed: PageSwapCache is cleared (under spin_lock_irq)
> +		 * in the freeze_refs section of __remove_mapping(); but Anon
> +		 * page->mapping reset to NULL later, in free_pages_prepare().
> +		 */
> +		if (!PageSwapCache(page))
> +			goto stale;
> +		cpu_relax();
> +	}

The recheck of stable_node->kpfn check after a barrier distinguishes between
a free and a completed migration, that's fine. I'm hesitate to ask because
it must be obvious but where is the guarantee that a KSM page is in the
swap cache?

> +
> +	if (ACCESS_ONCE(page->mapping) != expected_mapping) {
>  		put_page(page);
>  		goto stale;
>  	}
> +
>  	if (locked) {
>  		lock_page(page);
> -		if (page->mapping != expected_mapping) {
> +		if (ACCESS_ONCE(page->mapping) != expected_mapping) {
>  			unlock_page(page);
>  			put_page(page);
>  			goto stale;
>  		}
>  	}
>  	return page;
> +
>  stale:
> +	/*
> +	 * We come here from above when page->mapping or !PageSwapCache
> +	 * suggests that the node is stale; but it might be under migration.
> +	 * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(),
> +	 * before checking whether node->kpfn has been changed.
> +	 */
> +	smp_rmb();
> +	if (ACCESS_ONCE(stable_node->kpfn) != kpfn)
> +		goto again;
>  	remove_node_from_stable_tree(stable_node);
>  	return NULL;
>  }
> @@ -1103,15 +1137,25 @@ static struct page *stable_tree_search(s
>  			return NULL;
>  
>  		ret = memcmp_pages(page, tree_page);
> +		put_page(tree_page);
>  
> -		if (ret < 0) {
> -			put_page(tree_page);
> +		if (ret < 0)
>  			node = node->rb_left;
> -		} else if (ret > 0) {
> -			put_page(tree_page);
> +		else if (ret > 0)
>  			node = node->rb_right;
> -		} else
> +		else {
> +			/*
> +			 * Lock and unlock the stable_node's page (which
> +			 * might already have been migrated) so that page
> +			 * migration is sure to notice its raised count.
> +			 * It would be more elegant to return stable_node
> +			 * than kpage, but that involves more changes.
> +			 */
> +			tree_page = get_ksm_page(stable_node, true);
> +			if (tree_page)
> +				unlock_page(tree_page);
>  			return tree_page;
> +		}
>  	}
>  
>  	return NULL;
> @@ -1903,6 +1947,14 @@ void ksm_migrate_page(struct page *newpa
>  	if (stable_node) {
>  		VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
>  		stable_node->kpfn = page_to_pfn(newpage);
> +		/*
> +		 * newpage->mapping was set in advance; now we need smp_wmb()
> +		 * to make sure that the new stable_node->kpfn is visible
> +		 * to get_ksm_page() before it can see that oldpage->mapping
> +		 * has gone stale (or that PageSwapCache has been cleared).
> +		 */
> +		smp_wmb();
> +		set_page_stable_node(oldpage, NULL);
>  	}
>  }
>  #endif /* CONFIG_MIGRATION */
> --- mmotm.orig/mm/migrate.c	2013-01-25 14:27:58.140193249 -0800
> +++ mmotm/mm/migrate.c	2013-01-25 14:37:03.832206218 -0800
> @@ -464,7 +464,10 @@ void migrate_page_copy(struct page *newp
>  
>  	mlock_migrate_page(newpage, page);
>  	ksm_migrate_page(newpage, page);
> -
> +	/*
> +	 * Please do not reorder this without considering how mm/ksm.c's
> +	 * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache().
> +	 */
>  	ClearPageSwapCache(page);
>  	ClearPagePrivate(page);
>  	set_page_private(page, 0);
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/11] ksm: allow trees per NUMA node
  2013-02-05 16:41   ` Mel Gorman
@ 2013-02-07 23:57     ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-02-07 23:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	Rik van Riel, David Rientjes, Anton Arapov, linux-kernel,
	linux-mm

On Tue, 5 Feb 2013, Mel Gorman wrote:
> On Fri, Jan 25, 2013 at 05:54:53PM -0800, Hugh Dickins wrote:
> > From: Petr Holasek <pholasek@redhat.com>
> > 
> > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> > which control merging pages across different numa nodes.
> > When it is set to zero only pages from the same node are merged,
> > otherwise pages from all nodes can be merged together (default behavior).
> > 
> > Typical use-case could be a lot of KVM guests on NUMA machine
> > and cpus from more distant nodes would have significant increase
> > of access latency to the merged ksm page. Sysfs knob was choosen
> > for higher variability when some users still prefers higher amount
> > of saved physical memory regardless of access latency.
> > 
> 
> This is understandable but it's going to be a fairly obscure option.
> I do not think it can be known in advance if the option should be set.
> The user must either run benchmarks before and after or use perf to
> record the "node-load-misses" event and see if setting the parameter
> reduces the number of remote misses.

Andrew made a similar point on the description of merge_across_nodes
in ksm.txt.  Petr's quiet at the moment, so I'll add a few more lines
to that description (in an incremental patch): but be assured what I say
will remain inadequate and unspecific - I don't have much idea of how to
decide the setting, but assume that the people who are interested in
using the knob will have a firmer idea of how to test for it.

> 
> I don't know the internals of ksm.c at all and this is my first time reading
> this series. Everything in this review is subject to being completely
> wrong or due to a major misunderstanding on my part. Delete all feedback
> if desired.

Thank you for spending your time on it.

[...snippings, but let's leave this paragraph in]

> > Hugh notes that this patch brings two problems, whose solution needs
> > further support in mm/ksm.c, which follows in subsequent patches:
> > 1) switching merge_across_nodes after running KSM is liable to oops
> >    on stale nodes still left over from the previous stable tree;
> > 2) memory hotremove may migrate KSM pages, but there is no provision
> >    here for !merge_across_nodes to migrate nodes to the proper tree.
...
> > --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:31.724205455 -0800
> > +++ mmotm/mm/ksm.c	2013-01-25 14:36:38.608205618 -0800
...
> 
> With multiple stable node trees does the comment that begins with
> 
>  * A few notes about the KSM scanning process,
>  * to make it easier to understand the data structures below:
> 
> need an update?

Okay: I won't go through it pluralizing everything, but a couple of lines
on the !merge_across_nodes multiplicity of trees would be helpful.

> 
> It's uninitialised so kernel data size in vmlinux should be unaffected but
> it's an additional runtime cost of around 4K for a standardish enterprise
> distro kernel config.  Small beans on a NUMA machine and maybe not worth
> the hassle of kmalloc for nr_online_nodes and dealing with node memory
> hotplug but it's a pity.

It's a pity, I agree; as is the addition of int nid into rmap_item
on 32-bit (on 64-bit it just occupies a hole) - there can be a lot of
those.  We were kind of hoping that the #ifdef CONFIG_NUMA would cover
it, but some distros now enable NUMA by default even on 32-bit.  And
it's a pity because 99% of users will leave merge_across_nodes at its
default of 1 and only ever need a single tree of each kind.

I'll look into starting off with just root_stable_tree[1] and
root_unstable_tree[1], then kmalloc'ing nr_node_ids of them when and if
merge_across_nodes is switched off.  Then I don't think we need bother
about hotplug.  If it ends up looking clean enough, I'll add that patch.

> 
> >  #define MM_SLOTS_HASH_BITS 10
> >  static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> > @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_
> >  /* Milliseconds ksmd should sleep between batches */
> >  static unsigned int ksm_thread_sleep_millisecs = 20;
> >  
> > +/* Zeroed when merging across nodes is not allowed */
> > +static unsigned int ksm_merge_across_nodes = 1;
> > +
> 
> Nit but initialised data does increase the size of vmlinux so maybe this
> should be the "opposite". i.e. rename it to ksm_merge_within_nodes and
> default it to 0?

I don't find that particular increase in size very compelling!  Though
I would have preferred the tunable to be the opposite way around: it
annoys me that the new code comes into play when !ksm_merge_across_nodes.

However, I do find "merge across nodes" (thanks to Andrew for "across")
a much more vivid description than the opposite "merge within nodes",
and can't think of a better alternative for that; and wouldn't want to
change it anyway at this late (v7) stage, not without Petr's consent.

> 
> __read_mostly?

I feel the same way as I did when Andrew suggested it:
> 
> I spose this should be __read_mostly.  If __read_mostly is not really a
> synonym for __make_write_often_storage_slower.  I continue to harbor
> fear, uncertainty and doubt about this...

Could do.  No strong feeling, but I think I'd rather it share its
cacheline with other KSM-related stuff, than be off mixed up with
unrelateds.  I think there's a much stronger case for __read_mostly
when it's a library thing accessed by different subsystems.

You're right that this variable is accessed significantly more often
that the other KSM tunables, so deserves a __read_mostly more than
they do.  But where to stop?  Similar reluctance led me to avoid
using "unlikely" throughout ksm.c, unlikely as some conditions are
(I'm aghast to see that Andrea sneaked in a "likely" :).

> 
> >  #define KSM_RUN_STOP	0
> >  #define KSM_RUN_MERGE	1
> >  #define KSM_RUN_UNMERGE	2
> > @@ -441,10 +448,25 @@ out:		page = NULL;
> >  	return page;
> >  }
> >  
> > +/*
> > + * This helper is used for getting right index into array of tree roots.
> > + * When merge_across_nodes knob is set to 1, there are only two rb-trees for
> > + * stable and unstable pages from all nodes with roots in index 0. Otherwise,
> > + * every node has its own stable and unstable tree.
> > + */
> > +static inline int get_kpfn_nid(unsigned long kpfn)
> > +{
> > +	if (ksm_merge_across_nodes)
> > +		return 0;
> > +	else
> > +		return pfn_to_nid(kpfn);
> > +}
> > +
> 
> If we start with ksm_merge_across_nodes, KSM runs for a while and populates
> the stable node tree for node 0 and then ksm_merge_across_nodes gets set
> then badness happens because this can go anywhere
> 
>      nid = get_kpfn_nid(stable_node->kpfn);
>      rb_erase(&stable_node->node, &root_stable_tree[nid]);
> 
> Very late in the review I noticed that you comment on this already in the
> changelog and that it is addressed later in the series. I haven't seen

Yes.  Nobody's git bisection will be thwarted by this defect, so I'm
happy for Petr's patch to go in as is first, then fix applied after.
And even in this patch, there's already a pages_shared 0 test: which
is inadequate, but covers the common case.

> this patch yet so the following suggestion is very stale but might still
> be relevant.
> 
> 	We could increase size of root_stable_node[] by 1, have
> 	get_kpfn_nid return MAX_NR_NODES if ksm_merge_across_nodes and
> 	if ksm_merge_across_nodes gets set to 0 then we walk the stable
> 	tree at root_stable_tree[MAX_NR_NODES] and delete the entire
> 	tree? It's be disruptive as hell unfortunately and might break
> 	entirely if there is not enough memory to unshare the pages.
> 
> 	Ideally we could take our time walking root_stable_tree[MAX_NR_NODES]
> 	without worrying about collisions and fix it up somehow. Dunno

Petr's intention was that we just be disruptive, and insist on the old
tree being torn down first: it was merely a defect that this patch does
not quite ensure that.

You're right that we could be cleverer: in the light of the changes I
ended up making for collisions in migration, maybe that approach could
be extended to switching merge_across_nodes.

But I think you'll agree that switching merge_across_nodes is a path
that needs to be handled correctly, but no way does it need optimization:
people will do it when they're trying to work out the right tuning for
their loads, and thereafter probably never again.

> > @@ -554,7 +578,12 @@ static void remove_rmap_item_from_tree(s
> >  		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
> >  		BUG_ON(age > 1);
> >  		if (!age)
> > -			rb_erase(&rmap_item->node, &root_unstable_tree);
> > +#ifdef CONFIG_NUMA
> > +			rb_erase(&rmap_item->node,
> > +					&root_unstable_tree[rmap_item->nid]);
> > +#else
> > +			rb_erase(&rmap_item->node, &root_unstable_tree[0]);
> > +#endif
> >  
> 
> nit, does rmap_item->nid deserve a getter and setter helper instead?

I found that part ugly too: it gets macro helpers in trivial tidyups 3/11,
though not quite the getter/setter helpers you had in mind.

> > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i
> >  			return NULL;
> >  		}
> >  
> > +		/*
> > +		 * If tree_page has been migrated to another NUMA node, it
> > +		 * will be flushed out and put into the right unstable tree
> > +		 * next time: only merge with it if merge_across_nodes.
> > +		 * Just notice, we don't have similar problem for PageKsm
> > +		 * because their migration is disabled now. (62b61f611e)
> > +		 */
> > +		if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) {
> > +			put_page(tree_page);
> > +			return NULL;
> > +		}
> > +
> 
> What about this case?
> 
> 1. ksm_merge_across_nodes==0
> 2. pages gets placed on different unstable trees
> 3. ksm_merge_across_nodes==1
> 
> At that point we should be removing pages from the different unstable
> tree and moving them to root_unstable_tree[0] but this put_page() doesn't
> happen. Does it matter?

It doesn't matter.  The general philosophy in ksm.c is to be very lazy
about the unstable tree: all kinds of things can go "wrong" with it
temporarily, that's okay so long as we don't fall for errors that would
persist round after round.  The check above is required (somewhere) to
make sure that we don't merge pages from different nodes into the same
stable tree when the switch says not to do that.  But the case that
you're thinking of, it'll just sort itself out in a later round
(I think you later realized how the unstable tree is rebuilt
from scratch each round).

Or have I misunderstood: are you worrying that a put_page()
is missing?  I don't see that.

But now you point me to this block, I do wonder if we could place it
better.  When I came to worry about such an issue in the stable tree,
I decided that it's perfectly okay to use a page from the wrong node
for an intermediate test, and suboptimal to give up at that point,
just wrong to return it as a final match.  But here we give up even
when it's an intermediate: seems inconsistent, I'll give it some more
thought later, and probably want to move it: it's not wrong as is,
but I think it could be more efficient and more consistent.

> > @@ -1301,7 +1368,8 @@ static struct rmap_item *scan_get_next_r
> >  		 */
> >  		lru_add_drain_all();
> >  
> > -		root_unstable_tree = RB_ROOT;
> > +		for (nid = 0; nid < nr_node_ids; nid++)
> > +			root_unstable_tree[nid] = RB_ROOT;
> >  
> 
> Minor but you shouldn't need to reset tham all if
> ksm_merge_across_nodes==1

True; and I'll need to attend to this if we do move away from
the static allocation of root_unstable_tree[MAX_NUMNODES].

> 
> Initially this triggered an alarm because it's not immediately obvious
> why you can just discard an rbtree like this. It looks like because the
> unstable tree is also part of a linked list so the rb representation can
> be reset quickly without leaking memory.

Right, it takes a while to get your head around the way we just forget
the old tree and start again each time.  There's a funny place in
remove_rmap_item_from_tree() (visible in an earlier extract) where it
has to consider the "age" of the rmap_item, to decide whether it's
linked into the current tree or not.

Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 4/11] ksm: reorganize ksm_check_stable_tree
  2013-02-05 16:48   ` Mel Gorman
@ 2013-02-08  0:07     ` Hugh Dickins
  2013-02-14 11:30       ` Mel Gorman
  0 siblings, 1 reply; 69+ messages in thread
From: Hugh Dickins @ 2013-02-08  0:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Tue, 5 Feb 2013, Mel Gorman wrote:
> On Fri, Jan 25, 2013 at 05:59:35PM -0800, Hugh Dickins wrote:
> > Memory hotremove's ksm_check_stable_tree() is pitifully inefficient
> > (restarting whenever it finds a stale node to remove), but rearrange
> > so that at least it does not needlessly restart from nid 0 each time.
> > And add a couple of comments: here is why we keep pfn instead of page.
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >  mm/ksm.c |   38 ++++++++++++++++++++++----------------
> >  1 file changed, 22 insertions(+), 16 deletions(-)
> > 
> > --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:52.152205940 -0800
> > +++ mmotm/mm/ksm.c	2013-01-25 14:36:53.244205966 -0800
> > @@ -1830,31 +1830,36 @@ void ksm_migrate_page(struct page *newpa
> >  #endif /* CONFIG_MIGRATION */
> >  
> >  #ifdef CONFIG_MEMORY_HOTREMOVE
> > -static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn,
> > -						 unsigned long end_pfn)
> > +static void ksm_check_stable_tree(unsigned long start_pfn,
> > +				  unsigned long end_pfn)
> >  {
> > +	struct stable_node *stable_node;
> >  	struct rb_node *node;
> >  	int nid;
> >  
> > -	for (nid = 0; nid < nr_node_ids; nid++)
> > -		for (node = rb_first(&root_stable_tree[nid]); node;
> > -				node = rb_next(node)) {
> > -			struct stable_node *stable_node;
> > -
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		node = rb_first(&root_stable_tree[nid]);
> > +		while (node) {
> 
> This is not your fault, the old code is wrong too. It is assuming that all
> nodes are populated in numeric orders with no holes. It won't work if just
> two nodes 0 and 4 are online. It should be using for_each_online_node().

If the old code is wrong, it probably would be my fault!  But I believe
this is okay: these rb_roots we're looking at, they are in memory which
is not being offlined, and the trees for offline nodes will simply be
empty, won't they?  Something's badly wrong if otherwise.

I certainly prefer to avoid for_each_online_node() etc: maybe I'm
confusing with for_each_online_something_else(), but experience tells
that you can get into nasty hotplug mutex ordering issues with those
things - not worth the pain if you can easily and safely avoid them.

Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 5/11] ksm: get_ksm_page locked
  2013-02-05 17:18   ` Mel Gorman
@ 2013-02-08  0:33     ` Hugh Dickins
  2013-02-14 11:34       ` Mel Gorman
  0 siblings, 1 reply; 69+ messages in thread
From: Hugh Dickins @ 2013-02-08  0:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Tue, 5 Feb 2013, Mel Gorman wrote:
> On Fri, Jan 25, 2013 at 06:00:50PM -0800, Hugh Dickins wrote:
> > In some places where get_ksm_page() is used, we need the page to be locked.
> > 
> > When KSM migration is fully enabled, we shall want that to make sure that
> > the page just acquired cannot be migrated beneath us (raised page count is
> > only effective when there is serialization to make sure migration notices).
> > Whereas when navigating through the stable tree, we certainly do not want
> > to lock each node (raised page count is enough to guarantee the memcmps,
> > even if page is migrated to another node).
> > 
> > Since we're about to add another use case, add the locked argument to
> > get_ksm_page() now.
> > 
> > Hmm, what's that rcu_read_lock() about?  Complete misunderstanding, I
> > really got the wrong end of the stick on that!  There's a configuration
> > in which page_cache_get_speculative() can do something cheaper than
> > get_page_unless_zero(), relying on its caller's rcu_read_lock() to have
> > disabled preemption for it.  There's no need for rcu_read_lock() around
> > get_page_unless_zero() (and mapping checks) here.  Cut out that
> > silliness before making this any harder to understand.
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >  mm/ksm.c |   23 +++++++++++++----------
> >  1 file changed, 13 insertions(+), 10 deletions(-)
> > 
> > --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:53.244205966 -0800
> > +++ mmotm/mm/ksm.c	2013-01-25 14:36:58.856206099 -0800
> > @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree
> >   * but this is different - made simpler by ksm_thread_mutex being held, but
> >   * interesting for assuming that no other use of the struct page could ever
> >   * put our expected_mapping into page->mapping (or a field of the union which
> > - * coincides with page->mapping).  The RCU calls are not for KSM at all, but
> > - * to keep the page_count protocol described with page_cache_get_speculative.
> > + * coincides with page->mapping).
> >   *
> >   * Note: it is possible that get_ksm_page() will return NULL one moment,
> >   * then page the next, if the page is in between page_freeze_refs() and
> >   * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
> >   * is on its way to being freed; but it is an anomaly to bear in mind.
> >   */
> > -static struct page *get_ksm_page(struct stable_node *stable_node)
> > +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
> >  {
> 
> The naming is unhelpful :(
> 
> Because the second parameter is called "locked", it implies that the
> caller of this function holds the page lock (which is obviously very
> silly). ret_locked maybe?

I'd prefer "lock_it": I'll make that change unless you've a better.

> 
> As the function is akin to find_lock_page I would  prefer if there was
> a new get_lock_ksm_page() instead of locking depending on the value of a
> parameter.

I demur.  If it were a global interface rather than a function static
to ksm.c, yes, I'm sure Linus would side very strongly with you, and I'd
be providing a pair of wrappers to get_ksm_page() to hide the bool arg.

But this is a private function (you're invited :) which doesn't need
that level of hand-holding.

And I'm a firm believer in having one, difficult, function where all
the heavy thought is focussed, which does the nasty work and spares
everywhere else from having to worry about the difficulties.

You'll shiver with horror as I recite shmem_getpage(_gfp),
page_lock_anon_vma(_read), page_relock_lruvec (well, that one did
not yet get beyond its posting): get_ksm_page is one of those.

> We can do this because expected_mapping is recorded by the
> stable_node and we only need to recalculate it if the page has been
> successfully pinned. We calculate the expected value twice but that's
> not earth shattering. It'd look something like;
> 
> /*
>  * get_lock_ksm_page: Similar to get_ksm_page except returns with page
>  * locked and pinned
>  */
> static struct page *get_lock_ksm_page(struct stable_node *stable_node)
> {
> 	struct page *page = get_ksm_page(stable_node);
> 
> 	if (page) {
>   		expected_mapping = (void *)stable_node +
>   				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
> 		lock_page(page);
> 		if (page->mapping != expected_mapping) {
> 			unlock_page(page);
> 
> 			/* release pin taken by get_ksm_page() */
> 			put_page(page);
> 			page = NULL;
> 		}
> 	}
> 
> 	return page;
> }

Something like; but would also need the remove_node_from_stable_tree.

> 
> Up to you, I'm not going to make a big deal of it.

Phew!  Probably my insistence springs from knowing what this function
develops into a few patches later, rather than the simpler version
that appears at this stage of the series.

> 
> FWIW, I agree that removing rcu_read_lock() is fine.

Good, thanks, I was rather embarrassed by my misunderstanding there.

Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 11/11] ksm: stop hotremove lockdep warning
  2013-01-26  2:10 ` [PATCH 11/11] ksm: stop hotremove lockdep warning Hugh Dickins
  2013-01-27  6:23   ` Simon Jeons
@ 2013-02-08 18:45   ` Gerald Schaefer
  2013-02-11 22:13     ` Hugh Dickins
  1 sibling, 1 reply; 69+ messages in thread
From: Gerald Schaefer @ 2013-02-08 18:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	KOSAKI Motohiro, linux-kernel, linux-mm

On Fri, 25 Jan 2013 18:10:18 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> Complaints are rare, but lockdep still does not understand the way
> ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and
> holds it until the ksm_memory_callback(MEM_OFFLINE): that appears
> to be a problem because notifier callbacks are made under down_read
> of blocking_notifier_head->rwsem (so first the mutex is taken while
> holding the rwsem, then later the rwsem is taken while still holding
> the mutex); but is not in fact a problem because mem_hotplug_mutex
> is held throughout the dance.
> 
> There was an attempt to fix this with mutex_lock_nested(); but if that
> happened to fool lockdep two years ago, apparently it does so no
> longer.
> 
> I had hoped to eradicate this issue in extending KSM page migration
> not to need the ksm_thread_mutex.  But then realized that although
> the page migration itself is safe, we do still need to lock out ksmd
> and other users of get_ksm_page() while offlining memory - at some
> point between MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages
> themselves may vanish, and get_ksm_page()'s accesses to them become a
> violation.
> 
> So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE
> to MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and
> wait_while_offlining() checks, to achieve the same lockout without
> being caught by lockdep. This is less elegant for KSM, but it's more
> important to keep lockdep useful to other users - and I apologize for
> how long it took to fix.

Thanks a lot for the patch! I verified that it fixes the lockdep warning
that we got on memory hotremove.

> 
> Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/ksm.c |   55 +++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 41 insertions(+), 14 deletions(-)
> 
> --- mmotm.orig/mm/ksm.c	2013-01-25 14:37:06.880206290 -0800
> +++ mmotm/mm/ksm.c	2013-01-25 14:38:53.984208836 -0800
> @@ -226,7 +226,9 @@ static unsigned int ksm_merge_across_nod
>  #define KSM_RUN_STOP	0
>  #define KSM_RUN_MERGE	1
>  #define KSM_RUN_UNMERGE	2
> -static unsigned int ksm_run = KSM_RUN_STOP;
> +#define KSM_RUN_OFFLINE	4
> +static unsigned long ksm_run = KSM_RUN_STOP;
> +static void wait_while_offlining(void);
> 
>  static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
>  static DEFINE_MUTEX(ksm_thread_mutex);
> @@ -1700,6 +1702,7 @@ static int ksm_scan_thread(void *nothing
> 
>  	while (!kthread_should_stop()) {
>  		mutex_lock(&ksm_thread_mutex);
> +		wait_while_offlining();
>  		if (ksmd_should_run())
>  			ksm_do_scan(ksm_thread_pages_to_scan);
>  		mutex_unlock(&ksm_thread_mutex);
> @@ -2056,6 +2059,22 @@ void ksm_migrate_page(struct page *newpa
>  #endif /* CONFIG_MIGRATION */
> 
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> +static int just_wait(void *word)
> +{
> +	schedule();
> +	return 0;
> +}
> +
> +static void wait_while_offlining(void)
> +{
> +	while (ksm_run & KSM_RUN_OFFLINE) {
> +		mutex_unlock(&ksm_thread_mutex);
> +		wait_on_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE),
> +				just_wait, TASK_UNINTERRUPTIBLE);
> +		mutex_lock(&ksm_thread_mutex);
> +	}
> +}
> +
>  static void ksm_check_stable_tree(unsigned long start_pfn,
>  				  unsigned long end_pfn)
>  {
> @@ -2098,15 +2117,15 @@ static int ksm_memory_callback(struct no
>  	switch (action) {
>  	case MEM_GOING_OFFLINE:
>  		/*
> -		 * Keep it very simple for now: just lock out ksmd
> and
> -		 * MADV_UNMERGEABLE while any memory is going
> offline.
> -		 * mutex_lock_nested() is necessary because lockdep
> was alarmed
> -		 * that here we take ksm_thread_mutex inside
> notifier chain
> -		 * mutex, and later take notifier chain mutex inside
> -		 * ksm_thread_mutex to unlock it.   But that's safe
> because both
> -		 * are inside mem_hotplug_mutex.
> +		 * Prevent ksm_do_scan(),
> unmerge_and_remove_all_rmap_items()
> +		 * and remove_all_stable_nodes() while memory is
> going offline:
> +		 * it is unsafe for them to touch the stable tree at
> this time.
> +		 * But unmerge_ksm_pages(), rmap lookups and other
> entry points
> +		 * which do not need the ksm_thread_mutex are all
> safe. */
> -		mutex_lock_nested(&ksm_thread_mutex,
> SINGLE_DEPTH_NESTING);
> +		mutex_lock(&ksm_thread_mutex);
> +		ksm_run |= KSM_RUN_OFFLINE;
> +		mutex_unlock(&ksm_thread_mutex);
>  		break;
> 
>  	case MEM_OFFLINE:
> @@ -2122,11 +2141,20 @@ static int ksm_memory_callback(struct no
>  		/* fallthrough */
> 
>  	case MEM_CANCEL_OFFLINE:
> +		mutex_lock(&ksm_thread_mutex);
> +		ksm_run &= ~KSM_RUN_OFFLINE;
>  		mutex_unlock(&ksm_thread_mutex);
> +
> +		smp_mb();	/* wake_up_bit advises this */
> +		wake_up_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE));
>  		break;
>  	}
>  	return NOTIFY_OK;
>  }
> +#else
> +static void wait_while_offlining(void)
> +{
> +}
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> 
>  #ifdef CONFIG_SYSFS
> @@ -2189,7 +2217,7 @@ KSM_ATTR(pages_to_scan);
>  static ssize_t run_show(struct kobject *kobj, struct kobj_attribute
> *attr, char *buf)
>  {
> -	return sprintf(buf, "%u\n", ksm_run);
> +	return sprintf(buf, "%lu\n", ksm_run);
>  }
> 
>  static ssize_t run_store(struct kobject *kobj, struct kobj_attribute
> *attr, @@ -2212,6 +2240,7 @@ static ssize_t run_store(struct kobject
>  	 */
> 
>  	mutex_lock(&ksm_thread_mutex);
> +	wait_while_offlining();
>  	if (ksm_run != flags) {
>  		ksm_run = flags;
>  		if (flags & KSM_RUN_UNMERGE) {
> @@ -2254,6 +2283,7 @@ static ssize_t merge_across_nodes_store(
>  		return -EINVAL;
> 
>  	mutex_lock(&ksm_thread_mutex);
> +	wait_while_offlining();
>  	if (ksm_merge_across_nodes != knob) {
>  		if (ksm_pages_shared || remove_all_stable_nodes())
>  			err = -EBUSY;
> @@ -2366,10 +2396,7 @@ static int __init ksm_init(void)
>  #endif /* CONFIG_SYSFS */
> 
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -	/*
> -	 * Choose a high priority since the callback takes
> ksm_thread_mutex:
> -	 * later callbacks could only be taking locks which nest
> within that.
> -	 */
> +	/* There is no significance to this priority 100 */
>  	hotplug_memory_notifier(ksm_memory_callback, 100);
>  #endif
>  	return 0;
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly
  2013-02-05 17:55   ` Mel Gorman
@ 2013-02-08 19:33     ` Hugh Dickins
  2013-02-14 11:58       ` Mel Gorman
  0 siblings, 1 reply; 69+ messages in thread
From: Hugh Dickins @ 2013-02-08 19:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Tue, 5 Feb 2013, Mel Gorman wrote:
> On Fri, Jan 25, 2013 at 06:01:59PM -0800, Hugh Dickins wrote:
> > Switching merge_across_nodes after running KSM is liable to oops on stale
> > nodes still left over from the previous stable tree.  It's not something
> > that people will often want to do, but it would be lame to demand a reboot
> > when they're trying to determine which merge_across_nodes setting is best.
> > 
> > How can this happen?  We only permit switching merge_across_nodes when
> > pages_shared is 0, and usually set run 2 to force that beforehand, which
> > ought to unmerge everything: yet oopses still occur when you then run 1.
> > 
> 
> When reviewing patch 1, I missed that the pages_shared check would prevent
> most of the problems I was envisioning with leftover entries in the
> stable tree. Sorry about that.

No apology necessary!

> 
> > Three causes:
> > 
> > 1. The old stable tree (built according to the inverse merge_across_nodes)
> > has not been fully torn down.  A stable node lingers until get_ksm_page()
> > notices that the page it references no longer references it: but the page
> > is not necessarily freed as soon as expected, particularly when swapcache.
> > 
> > Fix this with a pass through the old stable tree, applying get_ksm_page()
> > to each of the remaining nodes (most found stale and removed immediately),
> > with forced removal of any left over.  Unless the page is still mapped:
> > I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE
> > and EBUSY than BUG.

But once I applied the testing for this to the completed patch series,
I did start seeing that WARN_ON_ONCE: it's made safe by the EBUSY,
but not working as intended.  Cause outlined below.

> > 
> > 2. __ksm_enter() has a nice little optimization, to insert the new mm
> > just behind ksmd's cursor, so there's a full pass for it to stabilize
> > (or be removed) before ksmd addresses it.  Nice when ksmd is running,
> > but not so nice when we're trying to unmerge all mms: we were missing
> > those mms forked and inserted behind the unmerge cursor.  Easily fixed
> > by inserting at the end when KSM_RUN_UNMERGE.
> > 
> > 3. It is possible for a KSM page to be faulted back from swapcache into
> > an mm, just after unmerge_and_remove_all_rmap_items() scanned past it.
> > Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private
> > to ksm.c, so dissolve the distinction between ksm_might_need_to_copy()
> > and ksm_does_need_to_copy(), doing it all in the one call into ksm.c.

What I found is that a 4th cause emerges once KSM migration
is properly working: that interval during page migration when the old
page has been fully unmapped but the new not yet mapped in its place.

The KSM COW breaking cannot see a page there then, so it ends up with
a (newly migrated) KSM page left behind.  Almost certainly has to be
fixed in follow_page(), but I've not yet settled on its final form -
the fix I have works well, but a different approach might be better.

I'm also puzzled that I've never in practice been hit by a 5th cause:
swapoff's try_to_unuse() is much like faulting, and ought to have the
same ksm_might_need_to_copy() safeguards as faulting (or at least,
I cannot see why not).

> > --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:58.856206099 -0800
> > +++ mmotm/mm/ksm.c	2013-01-25 14:37:00.768206145 -0800
> > @@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a
> >  /*
> >   * Only called through the sysfs control interface:
> >   */
> > +static int remove_stable_node(struct stable_node *stable_node)
> > +{
> > +	struct page *page;
> > +	int err;
> > +
> > +	page = get_ksm_page(stable_node, true);
> > +	if (!page) {
> > +		/*
> > +		 * get_ksm_page did remove_node_from_stable_tree itself.
> > +		 */
> > +		return 0;
> > +	}
> > +
> > +	if (WARN_ON_ONCE(page_mapped(page)))
> > +		err = -EBUSY;
> > +	else {
> > +		/*
> 
> It will probably be very obvious to people familiar with ksm.c but even
> so maybe remind the reader that the pages must already have been unmerged
> 
> * This page must already have been unmerged and should be stale.
> * It might be in a pagevec waiting to be freed or it might be

Okay, I'll add a little more comment there;
but I need to think longer for exactly how to express it.

> ......
> 
> 
> 
> > +		 * This page might be in a pagevec waiting to be freed,
> > +		 * or it might be PageSwapCache (perhaps under writeback),
> > +		 * or it might have been removed from swapcache a moment ago.
> > +		 */
> > +		set_page_stable_node(page, NULL);
> > +		remove_node_from_stable_tree(stable_node);
> > +		err = 0;
> > +	}
> > +
> > +	unlock_page(page);
> > +	put_page(page);
> > +	return err;
> > +}
> > +
> > +static int remove_all_stable_nodes(void)
> > +{
> > +	struct stable_node *stable_node;
> > +	int nid;
> > +	int err = 0;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		while (root_stable_tree[nid].rb_node) {
> > +			stable_node = rb_entry(root_stable_tree[nid].rb_node,
> > +						struct stable_node, node);
> > +			if (remove_stable_node(stable_node)) {
> > +				err = -EBUSY;
> > +				break;	/* proceed to next nid */
> > +			}
> 
> If remove_stable_node() returns an error then it's quite possible that it'll
> go boom when that page is encountered later but it's not guaranteed. It'd
> be best effort to continue removing as many of the stable nodes anyway.
> We're in trouble either way of course.

If it returns an error, then indeed something we don't yet understand
has occurred, and we shall want to debug it.  But unless it's due to
corruption somewhere, we shouldn't be in much trouble, shouldn't go boom:
remove_all_stable_nodes() error is ignored at the end of unmerging, it
will be tried again when changing merge_across_nodes, and an error
then will just prevent changing merge_across_nodes at that time.  So
the mysteriously unremovable stable nodes remain the same kind of tree.

> 
> Otherwise I didn't spot a problem so as weak as it is due my familiarity
> with KSM;
> 
> Acked-by: Mel Gorman <mgorman@suse.de>

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 7/11] ksm: make KSM page migration possible
  2013-02-05 19:11   ` Mel Gorman
@ 2013-02-08 20:52     ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-02-08 20:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Paul E. McKenney, Andrew Morton, Petr Holasek, Andrea Arcangeli,
	Izik Eidus, linux-kernel, linux-mm

Paul, I've added you to the Cc in the hope that you can shed your light
on an smp_read_barrier_depends() question with which Mel taxes me below.
You may ask for more context: linux-next currently has an mm/ksm.c after
this patch is applied, but you may have questions beyond that - thanks!

On Tue, 5 Feb 2013, Mel Gorman wrote:
> On Fri, Jan 25, 2013 at 06:03:31PM -0800, Hugh Dickins wrote:
> > KSM page migration is already supported in the case of memory hotremove,
> > which takes the ksm_thread_mutex across all its migrations to keep life
> > simple.
> > 
> > But the new KSM NUMA merge_across_nodes knob introduces a problem, when
> > it's set to non-default 0: if a KSM page is migrated to a different NUMA
> > node, how do we migrate its stable node to the right tree?  And what if
> > that collides with an existing stable node?
> > 
> > So far there's no provision for that, and this patch does not attempt
> > to deal with it either.  But how will I test a solution, when I don't
> > know how to hotremove memory? 
> 
> Just reach in and yank it straight out with a chisel.

:)

> 
> > The best answer is to enable KSM page
> > migration in all cases now, and test more common cases.  With THP and
> > compaction added since KSM came in, page migration is now mainstream,
> > and it's a shame that a KSM page can frustrate freeing a page block.
> > 
> 
> THP will at least check if migration within a node works. It won't
> necessarily check we can migrate across nodes properly but it's a lot
> better than nothing.

No, I went back and dug out a hack-patch I was using three or four years
ago: occasionally on fault, just migrate every possible page in that mm
for no reason other than to test page migration.

> >  static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
> >  {
> >  	struct page *page;
> >  	void *expected_mapping;
> > +	unsigned long kpfn;
> >  
> > -	page = pfn_to_page(stable_node->kpfn);
> >  	expected_mapping = (void *)stable_node +
> >  				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
> > -	if (page->mapping != expected_mapping)
> > -		goto stale;
> > -	if (!get_page_unless_zero(page))
> > +again:
> > +	kpfn = ACCESS_ONCE(stable_node->kpfn);
> > +	page = pfn_to_page(kpfn);
> > +
> 
> Ok.
> 
> There should be no concern that hot-remove made the kpfn invalid because
> those stable tree entries should have been discarded.

Yes.

> 
> > +	/*
> > +	 * page is computed from kpfn, so on most architectures reading
> > +	 * page->mapping is naturally ordered after reading node->kpfn,
> > +	 * but on Alpha we need to be more careful.
> > +	 */
> > +	smp_read_barrier_depends();
> 
> The value of page is data dependant on pfn_to_page(). Is it really possible
> for that to be re-ordered even on Alpha?

My intuition (to say "understanding" would be an exaggeration) is that
on Alpha a very old value of page->mapping (in the line below) might be
lying around and read from one cache, which has not necessarily been
invalidated by ksm_migrate_page() pointing stable_node->kpfn to this
new page.

And if that happens, we could easily and mistakenly conclude that this
stable node is stale: although there's an smp_rmb() after goto stale,
stable_node->kpfn would still match kpfn, and we wrongly remove the node.

My confidence that I've expressed that clearly in words, is lower than
my confidence that I've coded it right; and if I'm wrong, yes, surely
it's better to remove any cargo-cult smp_read_barrier_depends().

> 
> > +	if (ACCESS_ONCE(page->mapping) != expected_mapping)
> >  		goto stale;
> > -	if (page->mapping != expected_mapping) {
> > +
> > +	/*
> > +	 * We cannot do anything with the page while its refcount is 0.
> > +	 * Usually 0 means free, or tail of a higher-order page: in which
> > +	 * case this node is no longer referenced, and should be freed;
> > +	 * however, it might mean that the page is under page_freeze_refs().
> > +	 * The __remove_mapping() case is easy, again the node is now stale;
> > +	 * but if page is swapcache in migrate_page_move_mapping(), it might
> > +	 * still be our page, in which case it's essential to keep the node.
> > +	 */
> > +	while (!get_page_unless_zero(page)) {
> > +		/*
> > +		 * Another check for page->mapping != expected_mapping would
> > +		 * work here too.  We have chosen the !PageSwapCache test to
> > +		 * optimize the common case, when the page is or is about to
> > +		 * be freed: PageSwapCache is cleared (under spin_lock_irq)
> > +		 * in the freeze_refs section of __remove_mapping(); but Anon
> > +		 * page->mapping reset to NULL later, in free_pages_prepare().
> > +		 */
> > +		if (!PageSwapCache(page))
> > +			goto stale;
> > +		cpu_relax();
> > +	}
> 
> The recheck of stable_node->kpfn check after a barrier distinguishes between
> a free and a completed migration, that's fine. I'm hesitate to ask because
> it must be obvious but where is the guarantee that a KSM page is in the
> swap cache?

Certainly none at all: it's the less common case that a KSM page is in
swap cache.  But if it is not in swap cache, how could its page count be
0 (causing get_page_unless_zero to fail)?  By being free, or well on its
way to being freed (hence stale); or reused as part of a compound page
(hence stale also); or reused for another purpose which arrives at a
page_freeze_refs() (hence stale also); other cases?

It's hard to see from the diff, but in the original version of
get_ksm_page(), !get_page_unless_zero goes straight to stale.

Don't for a moment imagine that this function sprang fully formed
from my mind: it was hard to get it working right (the swap cache
get_page_unless_zero failure during migration really caught me out),
and then to pare it down to its fairly simple final form.

Hugh

> 
> > +
> > +	if (ACCESS_ONCE(page->mapping) != expected_mapping) {
> >  		put_page(page);
> >  		goto stale;
> >  	}
> > +
> >  	if (locked) {
> >  		lock_page(page);
> > -		if (page->mapping != expected_mapping) {
> > +		if (ACCESS_ONCE(page->mapping) != expected_mapping) {
> >  			unlock_page(page);
> >  			put_page(page);
> >  			goto stale;
> >  		}
> >  	}
> >  	return page;
> > +
> >  stale:
> > +	/*
> > +	 * We come here from above when page->mapping or !PageSwapCache
> > +	 * suggests that the node is stale; but it might be under migration.
> > +	 * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(),
> > +	 * before checking whether node->kpfn has been changed.
> > +	 */
> > +	smp_rmb();
> > +	if (ACCESS_ONCE(stable_node->kpfn) != kpfn)
> > +		goto again;
> >  	remove_node_from_stable_tree(stable_node);
> >  	return NULL;
> >  }
> > @@ -1903,6 +1947,14 @@ void ksm_migrate_page(struct page *newpa
> >  	if (stable_node) {
> >  		VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
> >  		stable_node->kpfn = page_to_pfn(newpage);
> > +		/*
> > +		 * newpage->mapping was set in advance; now we need smp_wmb()
> > +		 * to make sure that the new stable_node->kpfn is visible
> > +		 * to get_ksm_page() before it can see that oldpage->mapping
> > +		 * has gone stale (or that PageSwapCache has been cleared).
> > +		 */
> > +		smp_wmb();
> > +		set_page_stable_node(oldpage, NULL);
> >  	}
> >  }
> >  #endif /* CONFIG_MIGRATION */

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 11/11] ksm: stop hotremove lockdep warning
  2013-02-08 18:45   ` Gerald Schaefer
@ 2013-02-11 22:13     ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-02-11 22:13 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	KOSAKI Motohiro, linux-kernel, linux-mm

On Fri, 8 Feb 2013, Gerald Schaefer wrote:
> On Fri, 25 Jan 2013 18:10:18 -0800 (PST)
> Hugh Dickins <hughd@google.com> wrote:
> 
> > Complaints are rare, but lockdep still does not understand the way
> > ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and
> > holds it until the ksm_memory_callback(MEM_OFFLINE): that appears
> > to be a problem because notifier callbacks are made under down_read
> > of blocking_notifier_head->rwsem (so first the mutex is taken while
> > holding the rwsem, then later the rwsem is taken while still holding
> > the mutex); but is not in fact a problem because mem_hotplug_mutex
> > is held throughout the dance.
> > 
> > There was an attempt to fix this with mutex_lock_nested(); but if that
> > happened to fool lockdep two years ago, apparently it does so no
> > longer.
> > 
> > I had hoped to eradicate this issue in extending KSM page migration
> > not to need the ksm_thread_mutex.  But then realized that although
> > the page migration itself is safe, we do still need to lock out ksmd
> > and other users of get_ksm_page() while offlining memory - at some
> > point between MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages
> > themselves may vanish, and get_ksm_page()'s accesses to them become a
> > violation.
> > 
> > So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE
> > to MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and
> > wait_while_offlining() checks, to achieve the same lockout without
> > being caught by lockdep. This is less elegant for KSM, but it's more
> > important to keep lockdep useful to other users - and I apologize for
> > how long it took to fix.
> 
> Thanks a lot for the patch! I verified that it fixes the lockdep warning
> that we got on memory hotremove.
> 
> > 
> > Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
> > Signed-off-by: Hugh Dickins <hughd@google.com>

Thank you for reporting and testing and reporting back:
sorry again for taking so long to fix it.

Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 4/11] ksm: reorganize ksm_check_stable_tree
  2013-02-08  0:07     ` Hugh Dickins
@ 2013-02-14 11:30       ` Mel Gorman
  0 siblings, 0 replies; 69+ messages in thread
From: Mel Gorman @ 2013-02-14 11:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Thu, Feb 07, 2013 at 04:07:17PM -0800, Hugh Dickins wrote:
> On Tue, 5 Feb 2013, Mel Gorman wrote:
> > On Fri, Jan 25, 2013 at 05:59:35PM -0800, Hugh Dickins wrote:
> > > Memory hotremove's ksm_check_stable_tree() is pitifully inefficient
> > > (restarting whenever it finds a stale node to remove), but rearrange
> > > so that at least it does not needlessly restart from nid 0 each time.
> > > And add a couple of comments: here is why we keep pfn instead of page.
> > > 
> > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > > ---
> > >  mm/ksm.c |   38 ++++++++++++++++++++++----------------
> > >  1 file changed, 22 insertions(+), 16 deletions(-)
> > > 
> > > --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:52.152205940 -0800
> > > +++ mmotm/mm/ksm.c	2013-01-25 14:36:53.244205966 -0800
> > > @@ -1830,31 +1830,36 @@ void ksm_migrate_page(struct page *newpa
> > >  #endif /* CONFIG_MIGRATION */
> > >  
> > >  #ifdef CONFIG_MEMORY_HOTREMOVE
> > > -static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn,
> > > -						 unsigned long end_pfn)
> > > +static void ksm_check_stable_tree(unsigned long start_pfn,
> > > +				  unsigned long end_pfn)
> > >  {
> > > +	struct stable_node *stable_node;
> > >  	struct rb_node *node;
> > >  	int nid;
> > >  
> > > -	for (nid = 0; nid < nr_node_ids; nid++)
> > > -		for (node = rb_first(&root_stable_tree[nid]); node;
> > > -				node = rb_next(node)) {
> > > -			struct stable_node *stable_node;
> > > -
> > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > > +		node = rb_first(&root_stable_tree[nid]);
> > > +		while (node) {
> > 
> > This is not your fault, the old code is wrong too. It is assuming that all
> > nodes are populated in numeric orders with no holes. It won't work if just
> > two nodes 0 and 4 are online. It should be using for_each_online_node().
> 
> If the old code is wrong, it probably would be my fault!  But I believe
> this is okay: these rb_roots we're looking at, they are in memory which
> is not being offlined, and the trees for offline nodes will simply be
> empty, won't they?  Something's badly wrong if otherwise.
> 

I would expect them to be empty but that was not the problem I had in
mind. Unfortunately I mixed up nr_online_ids and nr_node_ids and read
the loop incorrectly. What you have is fine.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 5/11] ksm: get_ksm_page locked
  2013-02-08  0:33     ` Hugh Dickins
@ 2013-02-14 11:34       ` Mel Gorman
  0 siblings, 0 replies; 69+ messages in thread
From: Mel Gorman @ 2013-02-14 11:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Thu, Feb 07, 2013 at 04:33:58PM -0800, Hugh Dickins wrote:
> > > <SNIP>
> > > --- mmotm.orig/mm/ksm.c	2013-01-25 14:36:53.244205966 -0800
> > > +++ mmotm/mm/ksm.c	2013-01-25 14:36:58.856206099 -0800
> > > @@ -514,15 +514,14 @@ static void remove_node_from_stable_tree
> > >   * but this is different - made simpler by ksm_thread_mutex being held, but
> > >   * interesting for assuming that no other use of the struct page could ever
> > >   * put our expected_mapping into page->mapping (or a field of the union which
> > > - * coincides with page->mapping).  The RCU calls are not for KSM at all, but
> > > - * to keep the page_count protocol described with page_cache_get_speculative.
> > > + * coincides with page->mapping).
> > >   *
> > >   * Note: it is possible that get_ksm_page() will return NULL one moment,
> > >   * then page the next, if the page is in between page_freeze_refs() and
> > >   * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
> > >   * is on its way to being freed; but it is an anomaly to bear in mind.
> > >   */
> > > -static struct page *get_ksm_page(struct stable_node *stable_node)
> > > +static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
> > >  {
> > 
> > The naming is unhelpful :(
> > 
> > Because the second parameter is called "locked", it implies that the
> > caller of this function holds the page lock (which is obviously very
> > silly). ret_locked maybe?
> 
> I'd prefer "lock_it": I'll make that change unless you've a better.
> 

I don't.

> > 
> > As the function is akin to find_lock_page I would  prefer if there was
> > a new get_lock_ksm_page() instead of locking depending on the value of a
> > parameter.
> 
> I demur.  If it were a global interface rather than a function static
> to ksm.c, yes, I'm sure Linus would side very strongly with you, and I'd
> be providing a pair of wrappers to get_ksm_page() to hide the bool arg.
> 
> But this is a private function (you're invited :) which doesn't need
> that level of hand-holding.
> 
> And I'm a firm believer in having one, difficult, function where all
> the heavy thought is focussed, which does the nasty work and spares
> everywhere else from having to worry about the difficulties.
> 

Ok, I'm convinced. As you say, the case for having one function is a lot
strong later in the series when this function becomes quite complex. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly
  2013-02-08 19:33     ` Hugh Dickins
@ 2013-02-14 11:58       ` Mel Gorman
  2013-02-14 22:19         ` Hugh Dickins
  0 siblings, 1 reply; 69+ messages in thread
From: Mel Gorman @ 2013-02-14 11:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Fri, Feb 08, 2013 at 11:33:40AM -0800, Hugh Dickins wrote:
> > > <SNIP>
> > > 
> > > 2. __ksm_enter() has a nice little optimization, to insert the new mm
> > > just behind ksmd's cursor, so there's a full pass for it to stabilize
> > > (or be removed) before ksmd addresses it.  Nice when ksmd is running,
> > > but not so nice when we're trying to unmerge all mms: we were missing
> > > those mms forked and inserted behind the unmerge cursor.  Easily fixed
> > > by inserting at the end when KSM_RUN_UNMERGE.
> > > 
> > > 3. It is possible for a KSM page to be faulted back from swapcache into
> > > an mm, just after unmerge_and_remove_all_rmap_items() scanned past it.
> > > Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private
> > > to ksm.c, so dissolve the distinction between ksm_might_need_to_copy()
> > > and ksm_does_need_to_copy(), doing it all in the one call into ksm.c.
> 
> What I found is that a 4th cause emerges once KSM migration
> is properly working: that interval during page migration when the old
> page has been fully unmapped but the new not yet mapped in its place.
> 

For anyone else watching -- normal page migration expects to be protected
during that particular window with migration ptes. Any references to the
PTE mapping a page being migrated faults on a swap-like PTE and waits
in migration_entry_wait().

> The KSM COW breaking cannot see a page there then, so it ends up with
> a (newly migrated) KSM page left behind.  Almost certainly has to be
> fixed in follow_page(), but I've not yet settled on its final form -
> the fix I have works well, but a different approach might be better.
> 

follow_page() is one option. My guess is that you're thinking of adding
a FOLL_ flag that will cause follow_page() to check is_migration_entry()
and migration_entry_wait() if the flag is present.

Otherwise you would need to check for migration ptes in a number of places
under page lock and then hold the lock for long periods of time to prevent
migration starting. I did not check this option in depth because it quickly
looked like it would be a mess, with long page lock hold times and might
not even be workable.

> > > +static int remove_all_stable_nodes(void)
> > > +{
> > > +	struct stable_node *stable_node;
> > > +	int nid;
> > > +	int err = 0;
> > > +
> > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > > +		while (root_stable_tree[nid].rb_node) {
> > > +			stable_node = rb_entry(root_stable_tree[nid].rb_node,
> > > +						struct stable_node, node);
> > > +			if (remove_stable_node(stable_node)) {
> > > +				err = -EBUSY;
> > > +				break;	/* proceed to next nid */
> > > +			}
> > 
> > If remove_stable_node() returns an error then it's quite possible that it'll
> > go boom when that page is encountered later but it's not guaranteed. It'd
> > be best effort to continue removing as many of the stable nodes anyway.
> > We're in trouble either way of course.
> 
> If it returns an error, then indeed something we don't yet understand
> has occurred, and we shall want to debug it.  But unless it's due to
> corruption somewhere, we shouldn't be in much trouble, shouldn't go boom:
> remove_all_stable_nodes() error is ignored at the end of unmerging, it
> will be tried again when changing merge_across_nodes, and an error
> then will just prevent changing merge_across_nodes at that time.  So
> the mysteriously unremovable stable nodes remain the same kind of tree.
> 

Ok.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 6/11] ksm: remove old stable nodes more thoroughly
  2013-02-14 11:58       ` Mel Gorman
@ 2013-02-14 22:19         ` Hugh Dickins
  0 siblings, 0 replies; 69+ messages in thread
From: Hugh Dickins @ 2013-02-14 22:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Petr Holasek, Andrea Arcangeli, Izik Eidus,
	linux-kernel, linux-mm

On Thu, 14 Feb 2013, Mel Gorman wrote:
> On Fri, Feb 08, 2013 at 11:33:40AM -0800, Hugh Dickins wrote:
> > 
> > What I found is that a 4th cause emerges once KSM migration
> > is properly working: that interval during page migration when the old
> > page has been fully unmapped but the new not yet mapped in its place.
> > 
> 
> For anyone else watching -- normal page migration expects to be protected
> during that particular window with migration ptes. Any references to the
> PTE mapping a page being migrated faults on a swap-like PTE and waits
> in migration_entry_wait().
> 
> > The KSM COW breaking cannot see a page there then, so it ends up with
> > a (newly migrated) KSM page left behind.  Almost certainly has to be
> > fixed in follow_page(), but I've not yet settled on its final form -
> > the fix I have works well, but a different approach might be better.
> > 

The fix I had (following migration entry to old page) was a bit too
PageKsm specfic, and probably wrong for when get_user_pages() needs
to get a hold on the _new_ page.

> 
> follow_page() is one option. My guess is that you're thinking of adding
> a FOLL_ flag that will cause follow_page() to check is_migration_entry()
> and migration_entry_wait() if the flag is present.

Maybe a FOLL_flag, but I was thinking of doing it always.  The usual
get_user_pages() case will already wait in handle_mm_fault() and works
okay, and I didn't identify a problem case for follow_page() apart from
this ksm.c usage; but I did wonder if someone might have or add code
which gets similarly caught out by the migration case.

It's not a change I'd dare to make (without a FOLL_flag) if Andrea
hadn't already added a wait_split_huge_page() into follow_page();
and I need to convince myself that adding another cause for waiting
is necessarily safe (perhaps adding a might_sleep would be good).

Sorry, I expected to have posted follow-up patches days and days ago,
but in fact my time has vanished elsewhere and I've not even started.

> 
> Otherwise you would need to check for migration ptes in a number of places
> under page lock and then hold the lock for long periods of time to prevent
> migration starting. I did not check this option in depth because it quickly
> looked like it would be a mess, with long page lock hold times and might
> not even be workable.

Yes, I think that's more or less why I quickly decided on doing it in
follow_page().

Another option would be to move the ksm_migrate_page() callsite, and
allow it to reject the migration attempt when "inconvenient" (I haven't
stopped to think of the definition of inconvenient).  Though it wouldn't
fail often enough for anyone out there to care, that option just feels
like a shameful cop-out to me: I'm trying to improve migration, not add
strange cases when it fails.

Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2013-02-14 22:19 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-26  1:53 [PATCH 0/11] ksm: NUMA trees and page migration Hugh Dickins
2013-01-26  1:54 ` [PATCH 1/11] ksm: allow trees per NUMA node Hugh Dickins
2013-01-27  1:14   ` Simon Jeons
2013-01-27  2:54     ` Hugh Dickins
2013-01-27  3:16       ` Simon Jeons
2013-01-27 21:55         ` Hugh Dickins
2013-01-28 23:03   ` Andrew Morton
2013-01-29  1:17     ` Hugh Dickins
2013-01-28 23:08   ` Andrew Morton
2013-01-29  1:38     ` Hugh Dickins
2013-02-05 16:41   ` Mel Gorman
2013-02-07 23:57     ` Hugh Dickins
2013-01-26  1:56 ` [PATCH 2/11] ksm: add sysfs ABI Documentation Hugh Dickins
2013-01-26  1:58 ` [PATCH 3/11] ksm: trivial tidyups Hugh Dickins
2013-01-28 23:11   ` Andrew Morton
2013-01-29  1:44     ` Hugh Dickins
2013-01-26  1:59 ` [PATCH 4/11] ksm: reorganize ksm_check_stable_tree Hugh Dickins
2013-02-05 16:48   ` Mel Gorman
2013-02-08  0:07     ` Hugh Dickins
2013-02-14 11:30       ` Mel Gorman
2013-01-26  2:00 ` [PATCH 5/11] ksm: get_ksm_page locked Hugh Dickins
2013-01-27  2:36   ` Simon Jeons
2013-01-27 22:08     ` Hugh Dickins
2013-01-28  0:36       ` Simon Jeons
2013-01-28  3:35         ` Hugh Dickins
2013-01-27  2:48   ` Simon Jeons
2013-01-27 22:10     ` Hugh Dickins
2013-02-05 17:18   ` Mel Gorman
2013-02-08  0:33     ` Hugh Dickins
2013-02-14 11:34       ` Mel Gorman
2013-01-26  2:01 ` [PATCH 6/11] ksm: remove old stable nodes more thoroughly Hugh Dickins
2013-01-27  4:55   ` Simon Jeons
2013-01-27 23:05     ` Hugh Dickins
2013-01-28  1:42       ` Simon Jeons
2013-01-28  4:14         ` Hugh Dickins
2013-01-28  2:12   ` Simon Jeons
2013-01-28  4:19     ` Hugh Dickins
2013-01-28  6:36   ` Simon Jeons
2013-01-28 23:44   ` Andrew Morton
2013-01-29  2:03     ` Hugh Dickins
2013-02-05 17:55   ` Mel Gorman
2013-02-08 19:33     ` Hugh Dickins
2013-02-14 11:58       ` Mel Gorman
2013-02-14 22:19         ` Hugh Dickins
2013-01-26  2:03 ` [PATCH 7/11] ksm: make KSM page migration possible Hugh Dickins
2013-01-27  5:47   ` Simon Jeons
2013-01-27 23:12     ` Hugh Dickins
2013-01-28  0:41       ` Simon Jeons
2013-01-28  3:44         ` Hugh Dickins
2013-02-05 19:11   ` Mel Gorman
2013-02-08 20:52     ` Hugh Dickins
2013-01-26  2:05 ` [PATCH 8/11] ksm: make !merge_across_nodes migration safe Hugh Dickins
2013-01-27  8:49   ` Simon Jeons
2013-01-27 23:25     ` Hugh Dickins
2013-01-28  3:44   ` Simon Jeons
2013-01-26  2:06 ` [PATCH 9/11] ksm: enable KSM page migration Hugh Dickins
2013-01-26  2:07 ` [PATCH 10/11] mm: remove offlining arg to migrate_pages Hugh Dickins
2013-01-26  2:10 ` [PATCH 11/11] ksm: stop hotremove lockdep warning Hugh Dickins
2013-01-27  6:23   ` Simon Jeons
2013-01-27 23:35     ` Hugh Dickins
2013-02-08 18:45   ` Gerald Schaefer
2013-02-11 22:13     ` Hugh Dickins
2013-01-28 23:54 ` [PATCH 0/11] ksm: NUMA trees and page migration Andrew Morton
2013-01-29  0:49   ` Izik Eidus
2013-01-29  2:26     ` Izik Eidus
2013-01-29 16:51       ` Andrea Arcangeli
2013-01-31  0:05         ` Ric Mason
2013-01-29  1:07   ` Hugh Dickins
2013-01-29 10:45     ` Gleb Natapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).