linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4] KSM: numa awareness sysfs knob
@ 2012-09-24  0:56 Petr Holasek
  2012-09-27 15:45 ` Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Petr Holasek @ 2012-09-24  0:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Andrew Morton, Chris Wright, Izik Eidus,
	Rik van Riel, David Rientjes, linux-kernel, linux-mm,
	Anton Arapov, Petr Holasek

Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
which control merging pages across different numa nodes.
When it is set to zero only pages from the same node are merged,
otherwise pages from all nodes can be merged together (default behavior).

Typical use-case could be a lot of KVM guests on NUMA machine
and cpus from more distant nodes would have significant increase
of access latency to the merged ksm page. Sysfs knob was choosen
for higher variability when some users still prefers higher amount
of saved physical memory regardless of access latency.

Every numa node has its own stable & unstable trees because of faster
searching and inserting. Changing of merge_nodes value is possible only
when there are not any ksm shared pages in system.

I've tested this patch on numa machines with 2, 4 and 8 nodes and
measured speed of memory access inside of KVM guests with memory pinned
to one of nodes with this benchmark:

http://pholasek.fedorapeople.org/alloc_pg.c

Population standard deviations of access times in percentage of average
were following:

merge_nodes=1
2 nodes 1.4%
4 nodes 1.6%
8 nodes	1.7%

merge_nodes=0
2 nodes	1%
4 nodes	0.32%
8 nodes	0.018%

RFC: https://lkml.org/lkml/2011/11/30/91
v1: https://lkml.org/lkml/2012/1/23/46
v2: https://lkml.org/lkml/2012/6/29/105
v3: https://lkml.org/lkml/2012/9/14/550

Changelog:

v2: Andrew's objections were reflected:
	- value of merge_nodes can't be changed while there are some ksm
	pages in system
	- merge_nodes sysfs entry appearance depends on CONFIG_NUMA
	- more verbose documentation
	- added some performance testing results

v3:	- more verbose documentation
	- fixed race in merge_nodes store function
	- introduced share_all debugging knob proposed by Andrew
	- minor cleanups

v4:	- merge_nodes was renamed to merge_across_nodes
	- share_all debug knob was dropped
	- get_kpfn_nid helper
	- fixed page migration behaviour

Signed-off-by: Petr Holasek <pholasek@redhat.com>
---
 Documentation/vm/ksm.txt |   7 +++
 mm/ksm.c                 | 135 ++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 122 insertions(+), 20 deletions(-)

diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
index b392e49..100d58d 100644
--- a/Documentation/vm/ksm.txt
+++ b/Documentation/vm/ksm.txt
@@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds ksmd should sleep before next scan
                    e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
                    Default: 20 (chosen for demonstration purposes)
 
+merge_nodes      - specifies if pages from different numa nodes can be merged.
+                   When set to 0, ksm merges only pages which physically
+                   reside in the memory area of same NUMA node. It brings
+                   lower latency to access to shared page. Value can be
+                   changed only when there is no ksm shared pages in system.
+                   Default: 1
+
 run              - set 0 to stop ksmd from running but keep merged pages,
                    set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
                    set 2 to stop ksmd and unmerge all pages currently merged,
diff --git a/mm/ksm.c b/mm/ksm.c
index 47c8853..7c82032 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -36,6 +36,7 @@
 #include <linux/hash.h>
 #include <linux/freezer.h>
 #include <linux/oom.h>
+#include <linux/numa.h>
 
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -140,7 +141,10 @@ struct rmap_item {
 	unsigned long address;		/* + low bits used for flags below */
 	unsigned int oldchecksum;	/* when unstable */
 	union {
-		struct rb_node node;	/* when node of unstable tree */
+		struct {
+			struct rb_node node;	/* when node of unstable tree */
+			struct rb_root *root;
+		};
 		struct {		/* when listed from stable tree */
 			struct stable_node *head;
 			struct hlist_node hlist;
@@ -153,8 +157,8 @@ struct rmap_item {
 #define STABLE_FLAG	0x200	/* is listed from the stable tree */
 
 /* The stable and unstable tree heads */
-static struct rb_root root_stable_tree = RB_ROOT;
-static struct rb_root root_unstable_tree = RB_ROOT;
+static struct rb_root root_unstable_tree[MAX_NUMNODES] = { RB_ROOT, };
+static struct rb_root root_stable_tree[MAX_NUMNODES] = { RB_ROOT, };
 
 #define MM_SLOTS_HASH_SHIFT 10
 #define MM_SLOTS_HASH_HEADS (1 << MM_SLOTS_HASH_SHIFT)
@@ -189,6 +193,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+/* Zeroed when merging across nodes is not allowed */
+static unsigned int ksm_merge_across_nodes = 1;
+
 #define KSM_RUN_STOP	0
 #define KSM_RUN_MERGE	1
 #define KSM_RUN_UNMERGE	2
@@ -447,10 +454,25 @@ out:		page = NULL;
 	return page;
 }
 
+/*
+ * This helper is used for getting right index into array of tree roots.
+ * When merge_across_nodes knob is set to 1, there are only two rb-trees for
+ * stable and unstable pages from all nodes with roots in index 0. Otherwise,
+ * every node has its own stable and unstable tree.
+ */
+static inline int get_kpfn_nid(unsigned long kpfn)
+{
+	if (ksm_merge_across_nodes)
+		return 0;
+	else
+		return pfn_to_nid(kpfn);
+}
+
 static void remove_node_from_stable_tree(struct stable_node *stable_node)
 {
 	struct rmap_item *rmap_item;
 	struct hlist_node *hlist;
+	int nid;
 
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		if (rmap_item->hlist.next)
@@ -462,7 +484,10 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
 		cond_resched();
 	}
 
-	rb_erase(&stable_node->node, &root_stable_tree);
+	nid = get_kpfn_nid(stable_node->kpfn);
+
+	rb_erase(&stable_node->node,
+			&root_stable_tree[nid]);
 	free_stable_node(stable_node);
 }
 
@@ -560,7 +585,7 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
 		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
 		BUG_ON(age > 1);
 		if (!age)
-			rb_erase(&rmap_item->node, &root_unstable_tree);
+			rb_erase(&rmap_item->node, rmap_item->root);
 
 		ksm_pages_unshared--;
 		rmap_item->address &= PAGE_MASK;
@@ -989,8 +1014,9 @@ static struct page *try_to_merge_two_pages(struct rmap_item *rmap_item,
  */
 static struct page *stable_tree_search(struct page *page)
 {
-	struct rb_node *node = root_stable_tree.rb_node;
+	struct rb_node *node;
 	struct stable_node *stable_node;
+	int nid;
 
 	stable_node = page_stable_node(page);
 	if (stable_node) {			/* ksm page forked */
@@ -998,6 +1024,9 @@ static struct page *stable_tree_search(struct page *page)
 		return page;
 	}
 
+	nid = get_kpfn_nid(page_to_pfn(page));
+	node = root_stable_tree[nid].rb_node;
+
 	while (node) {
 		struct page *tree_page;
 		int ret;
@@ -1032,10 +1061,14 @@ static struct page *stable_tree_search(struct page *page)
  */
 static struct stable_node *stable_tree_insert(struct page *kpage)
 {
-	struct rb_node **new = &root_stable_tree.rb_node;
+	int nid;
+	struct rb_node **new = NULL;
 	struct rb_node *parent = NULL;
 	struct stable_node *stable_node;
 
+	nid = get_kpfn_nid(page_to_nid(kpage));
+	new = &root_stable_tree[nid].rb_node;
+
 	while (*new) {
 		struct page *tree_page;
 		int ret;
@@ -1069,7 +1102,7 @@ static struct stable_node *stable_tree_insert(struct page *kpage)
 		return NULL;
 
 	rb_link_node(&stable_node->node, parent, new);
-	rb_insert_color(&stable_node->node, &root_stable_tree);
+	rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
 
 	INIT_HLIST_HEAD(&stable_node->hlist);
 
@@ -1097,10 +1130,16 @@ static
 struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
 					      struct page *page,
 					      struct page **tree_pagep)
-
 {
-	struct rb_node **new = &root_unstable_tree.rb_node;
+	struct rb_node **new = NULL;
+	struct rb_root *root;
 	struct rb_node *parent = NULL;
+	int nid;
+
+	nid = get_kpfn_nid(page_to_pfn(page));
+	root = &root_unstable_tree[nid];
+
+	new = &root->rb_node;
 
 	while (*new) {
 		struct rmap_item *tree_rmap_item;
@@ -1138,8 +1177,9 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
 
 	rmap_item->address |= UNSTABLE_FLAG;
 	rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK);
+	rmap_item->root = root;
 	rb_link_node(&rmap_item->node, parent, new);
-	rb_insert_color(&rmap_item->node, &root_unstable_tree);
+	rb_insert_color(&rmap_item->node, root);
 
 	ksm_pages_unshared++;
 	return NULL;
@@ -1282,6 +1322,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 	struct mm_slot *slot;
 	struct vm_area_struct *vma;
 	struct rmap_item *rmap_item;
+	int i;
 
 	if (list_empty(&ksm_mm_head.mm_list))
 		return NULL;
@@ -1300,7 +1341,8 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 		 */
 		lru_add_drain_all();
 
-		root_unstable_tree = RB_ROOT;
+		for (i = 0; i < MAX_NUMNODES; i++)
+			root_unstable_tree[i] = RB_ROOT;
 
 		spin_lock(&ksm_mmlist_lock);
 		slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
@@ -1758,7 +1800,12 @@ void ksm_migrate_page(struct page *newpage, struct page *oldpage)
 	stable_node = page_stable_node(newpage);
 	if (stable_node) {
 		VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
-		stable_node->kpfn = page_to_pfn(newpage);
+
+		if (ksm_merge_across_nodes ||
+				page_to_nid(oldpage) == page_to_nid(newpage))
+			stable_node->kpfn = page_to_pfn(newpage);
+		else
+			remove_node_from_stable_tree(stable_node);
 	}
 }
 #endif /* CONFIG_MIGRATION */
@@ -1768,15 +1815,19 @@ static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn,
 						 unsigned long end_pfn)
 {
 	struct rb_node *node;
+	int i;
 
-	for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
-		struct stable_node *stable_node;
+	for (i = 0; i < MAX_NUMNODES; i++)
+		for (node = rb_first(&root_stable_tree[i]); node;
+				node = rb_next(node)) {
+			struct stable_node *stable_node;
+
+			stable_node = rb_entry(node, struct stable_node, node);
+			if (stable_node->kpfn >= start_pfn &&
+			    stable_node->kpfn < end_pfn)
+				return stable_node;
+		}
 
-		stable_node = rb_entry(node, struct stable_node, node);
-		if (stable_node->kpfn >= start_pfn &&
-		    stable_node->kpfn < end_pfn)
-			return stable_node;
-	}
 	return NULL;
 }
 
@@ -1926,6 +1977,47 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
 }
 KSM_ATTR(run);
 
+#ifdef CONFIG_NUMA
+static ssize_t merge_across_nodes_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", ksm_merge_across_nodes);
+}
+
+static ssize_t merge_across_nodes_store(struct kobject *kobj,
+				   struct kobj_attribute *attr,
+				   const char *buf, size_t count)
+{
+	int err;
+	unsigned long knob;
+
+	err = kstrtoul(buf, 10, &knob);
+	if (err)
+		return err;
+	if (knob > 1)
+		return -EINVAL;
+
+	mutex_lock(&ksm_thread_mutex);
+	if (ksm_run & KSM_RUN_MERGE) {
+		err = -EBUSY;
+	} else {
+		if (ksm_merge_across_nodes != knob) {
+			if (ksm_pages_shared > 0)
+				err = -EBUSY;
+			else
+				ksm_merge_across_nodes = knob;
+		}
+	}
+
+	if (err)
+		count = err;
+	mutex_unlock(&ksm_thread_mutex);
+
+	return count;
+}
+KSM_ATTR(merge_across_nodes);
+#endif
+
 static ssize_t pages_shared_show(struct kobject *kobj,
 				 struct kobj_attribute *attr, char *buf)
 {
@@ -1980,6 +2072,9 @@ static struct attribute *ksm_attrs[] = {
 	&pages_unshared_attr.attr,
 	&pages_volatile_attr.attr,
 	&full_scans_attr.attr,
+#ifdef CONFIG_NUMA
+	&merge_across_nodes_attr.attr,
+#endif
 	NULL,
 };
 
-- 
1.7.11.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] KSM: numa awareness sysfs knob
  2012-09-24  0:56 [PATCH v4] KSM: numa awareness sysfs knob Petr Holasek
@ 2012-09-27 15:45 ` Rik van Riel
  2012-09-28 11:26 ` Andrea Arcangeli
  2012-10-01  1:37 ` Hugh Dickins
  2 siblings, 0 replies; 9+ messages in thread
From: Rik van Riel @ 2012-09-27 15:45 UTC (permalink / raw)
  To: Petr Holasek
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Chris Wright,
	Izik Eidus, David Rientjes, linux-kernel, linux-mm, Anton Arapov

On 09/23/2012 08:56 PM, Petr Holasek wrote:
> Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> which control merging pages across different numa nodes.
> When it is set to zero only pages from the same node are merged,
> otherwise pages from all nodes can be merged together (default behavior).
>
> Typical use-case could be a lot of KVM guests on NUMA machine
> and cpus from more distant nodes would have significant increase
> of access latency to the merged ksm page. Sysfs knob was choosen
> for higher variability when some users still prefers higher amount
> of saved physical memory regardless of access latency.
>
> Every numa node has its own stable & unstable trees because of faster
> searching and inserting. Changing of merge_nodes value is possible only
> when there are not any ksm shared pages in system.
>
> I've tested this patch on numa machines with 2, 4 and 8 nodes and
> measured speed of memory access inside of KVM guests with memory pinned
> to one of nodes with this benchmark:

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] KSM: numa awareness sysfs knob
  2012-09-24  0:56 [PATCH v4] KSM: numa awareness sysfs knob Petr Holasek
  2012-09-27 15:45 ` Rik van Riel
@ 2012-09-28 11:26 ` Andrea Arcangeli
  2012-10-01  0:36   ` Hugh Dickins
  2012-10-01  1:37 ` Hugh Dickins
  2 siblings, 1 reply; 9+ messages in thread
From: Andrea Arcangeli @ 2012-09-28 11:26 UTC (permalink / raw)
  To: Petr Holasek
  Cc: Hugh Dickins, Andrew Morton, Chris Wright, Izik Eidus,
	Rik van Riel, David Rientjes, linux-kernel, linux-mm,
	Anton Arapov

Hi everyone,

On Mon, Sep 24, 2012 at 02:56:06AM +0200, Petr Holasek wrote:
> +static struct rb_root root_unstable_tree[MAX_NUMNODES] = { RB_ROOT, };

not initializing is better so we don't waste .data and it goes in the
.bss, initializing only the first entry is useless anyway, that's
getting initialized later (maybe safer to initialize it once more in
some init routine too along with the below one for a peace of mind,
not at the first scan instance).

> +static struct rb_root root_stable_tree[MAX_NUMNODES] = { RB_ROOT, };

uninitialized root_stable_tree [1..]

> @@ -1300,7 +1341,8 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  		 */
>  		lru_add_drain_all();
>  
> -		root_unstable_tree = RB_ROOT;
> +		for (i = 0; i < MAX_NUMNODES; i++)
> +			root_unstable_tree[i] = RB_ROOT;

s/MAX_NUMNODES/nr_node_ids/

> @@ -1758,7 +1800,12 @@ void ksm_migrate_page(struct page *newpage, struct page *oldpage)
>  	stable_node = page_stable_node(newpage);
>  	if (stable_node) {
>  		VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
> -		stable_node->kpfn = page_to_pfn(newpage);
> +
> +		if (ksm_merge_across_nodes ||
> +				page_to_nid(oldpage) == page_to_nid(newpage))
> +			stable_node->kpfn = page_to_pfn(newpage);
> +		else
> +			remove_node_from_stable_tree(stable_node);
>  	}
>  }

This will result in memory corruption because the ksm page still
points to the stable_node that has been freed (that is copied by the
migrate code when the newpage->mapping = oldpage->mapping).

What should happen is that the ksm page of src_node is merged with
the pre-existing ksm page in the dst_node of the migration. That's the
complex case, the easy case is if there's no pre-existing page and
that just requires an insert of the stable node in a different rbtree
I think (without actual pagetable mangling).

It may be simpler to break cow across migrate and require the ksm
scanner to re-merge it however.

Basically the above would remove the ability to rmap the ksm page
(i.e. rmap crashes on a dangling pointer), but we need rmap to be
functional at all times on all ksm pages.

Hugh what's your views on this ksm_migrate_page NUMA aware that is
giving trouble? What would you prefer? Merge two ksm pages together
(something that has never happened before), break_cow (so we don't
have to merge two ksm pages together in the first place and we
fallback in the regular paths) etc...

All the rest looks very good, great work Petr!

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] KSM: numa awareness sysfs knob
  2012-09-28 11:26 ` Andrea Arcangeli
@ 2012-10-01  0:36   ` Hugh Dickins
  2012-10-01 11:53     ` Andrea Arcangeli
  0 siblings, 1 reply; 9+ messages in thread
From: Hugh Dickins @ 2012-10-01  0:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Petr Holasek, Andrew Morton, Chris Wright, Izik Eidus,
	Rik van Riel, David Rientjes, linux-kernel, linux-mm,
	Anton Arapov

Sorry for taking so long to look at Petr's nice work: it's more than
what I can think through in odd stolen moments; but yesterday at last
I managed to get down to remembering KSM and giving this some time.

On Fri, 28 Sep 2012, Andrea Arcangeli wrote:
> On Mon, Sep 24, 2012 at 02:56:06AM +0200, Petr Holasek wrote:
> > @@ -1758,7 +1800,12 @@ void ksm_migrate_page(struct page *newpage, struct page *oldpage)
> >  	stable_node = page_stable_node(newpage);
> >  	if (stable_node) {
> >  		VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
> > -		stable_node->kpfn = page_to_pfn(newpage);
> > +
> > +		if (ksm_merge_across_nodes ||
> > +				page_to_nid(oldpage) == page_to_nid(newpage))
> > +			stable_node->kpfn = page_to_pfn(newpage);
> > +		else
> > +			remove_node_from_stable_tree(stable_node);
> >  	}
> >  }
> 
> This will result in memory corruption because the ksm page still
> points to the stable_node that has been freed (that is copied by the
> migrate code when the newpage->mapping = oldpage->mapping).

That is a very acute observation!  (And there was I searching for
where we copy over the PageKsm ;) which of course is in ->mapping.)

> 
> What should happen is that the ksm page of src_node is merged with
> the pre-existing ksm page in the dst_node of the migration. That's the
> complex case, the easy case is if there's no pre-existing page and
> that just requires an insert of the stable node in a different rbtree
> I think (without actual pagetable mangling).

I believe it's not as bad as "pagetable mangling" suggests.

> 
> It may be simpler to break cow across migrate and require the ksm
> scanner to re-merge it however.

I'm all for the simplest solution, but here in ksm_migrate_page()
is not a good place for COW breaking - we don't want to get into
an indefinite number of page allocations, and the risk of failure.

I was toying with the idea of leaving the new page in the old NUMAnode's
stable tree temporarily, until ksmd comes around again, and let that
clean it up.  Which would imply less reliance on get_kpfn_nid(),
and not skipping PageKsm in ksm_do_scan(), and...

But it's not all that simple, and I think we can do better.

> 
> Basically the above would remove the ability to rmap the ksm page
> (i.e. rmap crashes on a dangling pointer), but we need rmap to be
> functional at all times on all ksm pages.

Yes.

> 
> Hugh what's your views on this ksm_migrate_page NUMA aware that is
> giving trouble? What would you prefer? Merge two ksm pages together
> (something that has never happened before), break_cow (so we don't
> have to merge two ksm pages together in the first place and we
> fallback in the regular paths) etc...

It's only just fully dawned on me that ksm_migrate_page() is actually
a very convenient place: no pagetable mangling required, because we
know that neither old nor new page is at this instant mapped into
userspace at all - don't we?  Instead there are swap-like migration
entries plugging all ptes until we're ready to put in the new page.

So I think what we really want to do is change the ksm_migrate_page()
interface a little, and probably the precise position it's called from,
to allow it to update mm/migrate.c's newpage - in the collision case
when the new NUMAnode already has a stable copy of this page.  But when
it doesn't, just move KSMnode from old NUMAnode's stable tree to new.

How well the existing ksm.c primitives are suited to this, I've not
checked.  Probably not too well, but shouldn't be hard to add what's
needed.

What do you think?  Does that sound reasonable, Petr?

By the way, this is probably a good occasion to remind ourselves,
that page migration is still usually disabled on PageKsm pages:
ksm_migrate_page() is only being called for memory hotremove.  I had
been about to complain that calling remove_node_from_stable_tree()
from ksm_migrate_page() is also unsafe from a locking point of view;
until I remembered that MEM_GOING_OFFLINE has previously acquired
ksm_thread_mutex.

But page migration is much more important now than three years ago,
with compaction relying upon it, CMA and THP relying upon compaction,
and lumpy reclaim gone.

Whilst it should not be mixed up in the NUMA patch itself, I think we
need now to relax that restriction.  I found re-reading my 62b61f611e
"ksm: memory hotremove migration only" was helpful.  Petr, is that
something you could take on also?  I _think_ it's just a matter of
protecting the stable tree(s) with an additional mutex (which ought
not to be contended, since ksm_thread_mutex is normally held above
it, except in migration); then removing a number of PageKsm refusals
(and the offlining arg to unmap_and_move() etc).  But perhaps there's
more to it, I haven't gone over it properly.

> 
> All the rest looks very good, great work Petr!

Yes, I agree; but a few more comments I'll make against the v4 post.

Hugh

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] KSM: numa awareness sysfs knob
  2012-09-24  0:56 [PATCH v4] KSM: numa awareness sysfs knob Petr Holasek
  2012-09-27 15:45 ` Rik van Riel
  2012-09-28 11:26 ` Andrea Arcangeli
@ 2012-10-01  1:37 ` Hugh Dickins
  2012-10-01  8:14   ` Hugh Dickins
  2012-10-08 23:00   ` Petr Holasek
  2 siblings, 2 replies; 9+ messages in thread
From: Hugh Dickins @ 2012-10-01  1:37 UTC (permalink / raw)
  To: Petr Holasek
  Cc: Andrea Arcangeli, Andrew Morton, Chris Wright, Izik Eidus,
	Rik van Riel, David Rientjes, linux-kernel, linux-mm,
	Anton Arapov

Andrea's point about ksm_migrate_page() is an important one, and I've
answered that against his mail, but here's some other easier points.

On Mon, 24 Sep 2012, Petr Holasek wrote:

> Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> which control merging pages across different numa nodes.
> When it is set to zero only pages from the same node are merged,
> otherwise pages from all nodes can be merged together (default behavior).
> 
> Typical use-case could be a lot of KVM guests on NUMA machine
> and cpus from more distant nodes would have significant increase
> of access latency to the merged ksm page. Sysfs knob was choosen
> for higher variability when some users still prefers higher amount
> of saved physical memory regardless of access latency.
> 
> Every numa node has its own stable & unstable trees because of faster
> searching and inserting. Changing of merge_nodes value is possible only

merge_across_nodes

> when there are not any ksm shared pages in system.
> 
> I've tested this patch on numa machines with 2, 4 and 8 nodes and
> measured speed of memory access inside of KVM guests with memory pinned
> to one of nodes with this benchmark:
> 
> http://pholasek.fedorapeople.org/alloc_pg.c
> 
> Population standard deviations of access times in percentage of average
> were following:
> 
> merge_nodes=1

merge_across_nodes

> 2 nodes 1.4%
> 4 nodes 1.6%
> 8 nodes	1.7%
> 
> merge_nodes=0

merge_across_nodes

> 2 nodes	1%
> 4 nodes	0.32%
> 8 nodes	0.018%
> 
> RFC: https://lkml.org/lkml/2011/11/30/91
> v1: https://lkml.org/lkml/2012/1/23/46
> v2: https://lkml.org/lkml/2012/6/29/105
> v3: https://lkml.org/lkml/2012/9/14/550
> 
> Changelog:
> 
> v2: Andrew's objections were reflected:
> 	- value of merge_nodes can't be changed while there are some ksm
> 	pages in system
> 	- merge_nodes sysfs entry appearance depends on CONFIG_NUMA
> 	- more verbose documentation
> 	- added some performance testing results
> 
> v3:	- more verbose documentation
> 	- fixed race in merge_nodes store function
> 	- introduced share_all debugging knob proposed by Andrew
> 	- minor cleanups
> 
> v4:	- merge_nodes was renamed to merge_across_nodes
> 	- share_all debug knob was dropped

Yes, you were right to drop the share_all knob for now.  I do like the
idea, but it was quite inappropriate to mix it in with this NUMAnode
patch.  And although I like the idea, I think it wants a bit more: I
already have a hacky "allksm" boot option patch to mm/mmap.c for my
own testing, which adds VM_MERGEABLE everywhere.  If I gave that up
(I'd like to!) in favour of yours, I think I would still be missing
all the mms that are created before there's a chance to switch the
share_all mode on.  Or maybe I misread your v3.  Anyway, that's a
different topic, happily taken off today's agenda.

> 	- get_kpfn_nid helper
> 	- fixed page migration behaviour
> 
> Signed-off-by: Petr Holasek <pholasek@redhat.com>
> ---
>  Documentation/vm/ksm.txt |   7 +++
>  mm/ksm.c                 | 135 ++++++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 122 insertions(+), 20 deletions(-)
> 
> diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
> index b392e49..100d58d 100644
> --- a/Documentation/vm/ksm.txt
> +++ b/Documentation/vm/ksm.txt
> @@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds ksmd should sleep before next scan
>                     e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
>                     Default: 20 (chosen for demonstration purposes)
>  
> +merge_nodes      - specifies if pages from different numa nodes can be merged.

merge_across_nodes

> +                   When set to 0, ksm merges only pages which physically
> +                   reside in the memory area of same NUMA node. It brings
> +                   lower latency to access to shared page. Value can be
> +                   changed only when there is no ksm shared pages in system.
> +                   Default: 1
> +
>  run              - set 0 to stop ksmd from running but keep merged pages,
>                     set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
>                     set 2 to stop ksmd and unmerge all pages currently merged,
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 47c8853..7c82032 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -36,6 +36,7 @@
>  #include <linux/hash.h>
>  #include <linux/freezer.h>
>  #include <linux/oom.h>
> +#include <linux/numa.h>
>  
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> @@ -140,7 +141,10 @@ struct rmap_item {
>  	unsigned long address;		/* + low bits used for flags below */
>  	unsigned int oldchecksum;	/* when unstable */
>  	union {
> -		struct rb_node node;	/* when node of unstable tree */
> +		struct {
> +			struct rb_node node;	/* when node of unstable tree */
> +			struct rb_root *root;
> +		};

This worries me a little, enlarging struct rmap_item: there may
be very many of them in the system, best to minimize their size.

(This struct rb_root *root is used for one thing only, isn't it?  For the
unstable rb_erase() in remove_rmap_item_from_tree().  It annoys me that
we need to record root just for that, but I don't see an alternative.)

The 64-bit case can be easily improved by locating unsigned int nid
after oldchecksum instead.  The 32-bit case (which supports smaller
NODES_SHIFT: 6 was the largest I found) could be "improved" by keeping
nid in the lower bits of address along with (moved) UNSTABLE_FLAG and
STABLE_FLAG and reduced SEQNR_MASK - we really need only 1 bit for that,
but 2 bits would be nice for keeping the BUG_ON(age > 1).

But you may think I'm going too far there, and prefer just to put
#ifdef CONFIG_NUMA around the unsigned int nid, so at least it does
not enlarge the more basic 32-bit configuration.

>  		struct {		/* when listed from stable tree */
>  			struct stable_node *head;
>  			struct hlist_node hlist;
> @@ -153,8 +157,8 @@ struct rmap_item {
>  #define STABLE_FLAG	0x200	/* is listed from the stable tree */
>  
>  /* The stable and unstable tree heads */
> -static struct rb_root root_stable_tree = RB_ROOT;
> -static struct rb_root root_unstable_tree = RB_ROOT;
> +static struct rb_root root_unstable_tree[MAX_NUMNODES] = { RB_ROOT, };
> +static struct rb_root root_stable_tree[MAX_NUMNODES] = { RB_ROOT, };
>  
>  #define MM_SLOTS_HASH_SHIFT 10
>  #define MM_SLOTS_HASH_HEADS (1 << MM_SLOTS_HASH_SHIFT)
> @@ -189,6 +193,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
>  /* Milliseconds ksmd should sleep between batches */
>  static unsigned int ksm_thread_sleep_millisecs = 20;
>  
> +/* Zeroed when merging across nodes is not allowed */
> +static unsigned int ksm_merge_across_nodes = 1;
> +
>  #define KSM_RUN_STOP	0
>  #define KSM_RUN_MERGE	1
>  #define KSM_RUN_UNMERGE	2
> @@ -447,10 +454,25 @@ out:		page = NULL;
>  	return page;
>  }
>  
> +/*
> + * This helper is used for getting right index into array of tree roots.
> + * When merge_across_nodes knob is set to 1, there are only two rb-trees for
> + * stable and unstable pages from all nodes with roots in index 0. Otherwise,
> + * every node has its own stable and unstable tree.
> + */
> +static inline int get_kpfn_nid(unsigned long kpfn)
> +{
> +	if (ksm_merge_across_nodes)
> +		return 0;
> +	else
> +		return pfn_to_nid(kpfn);
> +}
> +
>  static void remove_node_from_stable_tree(struct stable_node *stable_node)
>  {
>  	struct rmap_item *rmap_item;
>  	struct hlist_node *hlist;
> +	int nid;
>  
>  	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
>  		if (rmap_item->hlist.next)
> @@ -462,7 +484,10 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
>  		cond_resched();
>  	}
>  
> -	rb_erase(&stable_node->node, &root_stable_tree);
> +	nid = get_kpfn_nid(stable_node->kpfn);
> +
> +	rb_erase(&stable_node->node,
> +			&root_stable_tree[nid]);

Oh, if I ever get my fingers on that, it's going to join the line above :)

>  	free_stable_node(stable_node);
>  }
>  
> @@ -560,7 +585,7 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
>  		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
>  		BUG_ON(age > 1);
>  		if (!age)
> -			rb_erase(&rmap_item->node, &root_unstable_tree);
> +			rb_erase(&rmap_item->node, rmap_item->root);

Right, this is where we have to use rmap_item->root
or &root_unstable_tree[rmap_item->nid].

>  
>  		ksm_pages_unshared--;
>  		rmap_item->address &= PAGE_MASK;
> @@ -989,8 +1014,9 @@ static struct page *try_to_merge_two_pages(struct rmap_item *rmap_item,
>   */
>  static struct page *stable_tree_search(struct page *page)
>  {
> -	struct rb_node *node = root_stable_tree.rb_node;
> +	struct rb_node *node;
>  	struct stable_node *stable_node;
> +	int nid;
>  
>  	stable_node = page_stable_node(page);
>  	if (stable_node) {			/* ksm page forked */
> @@ -998,6 +1024,9 @@ static struct page *stable_tree_search(struct page *page)
>  		return page;
>  	}
>  
> +	nid = get_kpfn_nid(page_to_pfn(page));
> +	node = root_stable_tree[nid].rb_node;
> +
>  	while (node) {
>  		struct page *tree_page;
>  		int ret;
> @@ -1032,10 +1061,14 @@ static struct page *stable_tree_search(struct page *page)
>   */
>  static struct stable_node *stable_tree_insert(struct page *kpage)
>  {
> -	struct rb_node **new = &root_stable_tree.rb_node;
> +	int nid;
> +	struct rb_node **new = NULL;

No need to initialize new.

>  	struct rb_node *parent = NULL;
>  	struct stable_node *stable_node;
>  
> +	nid = get_kpfn_nid(page_to_nid(kpage));
> +	new = &root_stable_tree[nid].rb_node;
> +
>  	while (*new) {
>  		struct page *tree_page;
>  		int ret;
> @@ -1069,7 +1102,7 @@ static struct stable_node *stable_tree_insert(struct page *kpage)
>  		return NULL;
>  
>  	rb_link_node(&stable_node->node, parent, new);
> -	rb_insert_color(&stable_node->node, &root_stable_tree);
> +	rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
>  
>  	INIT_HLIST_HEAD(&stable_node->hlist);
>  
> @@ -1097,10 +1130,16 @@ static
>  struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
>  					      struct page *page,
>  					      struct page **tree_pagep)
> -

Thank you!

>  {
> -	struct rb_node **new = &root_unstable_tree.rb_node;
> +	struct rb_node **new = NULL;

No need to initialize new.

> +	struct rb_root *root;
>  	struct rb_node *parent = NULL;
> +	int nid;
> +
> +	nid = get_kpfn_nid(page_to_pfn(page));
> +	root = &root_unstable_tree[nid];
> +
> +	new = &root->rb_node;
>  
>  	while (*new) {
>  		struct rmap_item *tree_rmap_item;
> @@ -1138,8 +1177,9 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
>  
>  	rmap_item->address |= UNSTABLE_FLAG;
>  	rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK);
> +	rmap_item->root = root;
>  	rb_link_node(&rmap_item->node, parent, new);
> -	rb_insert_color(&rmap_item->node, &root_unstable_tree);
> +	rb_insert_color(&rmap_item->node, root);
>  
>  	ksm_pages_unshared++;
>  	return NULL;
> @@ -1282,6 +1322,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  	struct mm_slot *slot;
>  	struct vm_area_struct *vma;
>  	struct rmap_item *rmap_item;
> +	int i;

I'd call it nid myself.

>  
>  	if (list_empty(&ksm_mm_head.mm_list))
>  		return NULL;
> @@ -1300,7 +1341,8 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  		 */
>  		lru_add_drain_all();
>  
> -		root_unstable_tree = RB_ROOT;
> +		for (i = 0; i < MAX_NUMNODES; i++)
> +			root_unstable_tree[i] = RB_ROOT;
>  
>  		spin_lock(&ksm_mmlist_lock);
>  		slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
> @@ -1758,7 +1800,12 @@ void ksm_migrate_page(struct page *newpage, struct page *oldpage)
>  	stable_node = page_stable_node(newpage);
>  	if (stable_node) {
>  		VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
> -		stable_node->kpfn = page_to_pfn(newpage);
> +
> +		if (ksm_merge_across_nodes ||
> +				page_to_nid(oldpage) == page_to_nid(newpage))
> +			stable_node->kpfn = page_to_pfn(newpage);
> +		else
> +			remove_node_from_stable_tree(stable_node);

Important point from Andrea discussed elsewhere.

>  	}
>  }
>  #endif /* CONFIG_MIGRATION */
> @@ -1768,15 +1815,19 @@ static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn,
>  						 unsigned long end_pfn)
>  {
>  	struct rb_node *node;
> +	int i;

I'd call it nid myself.

>  
> -	for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
> -		struct stable_node *stable_node;
> +	for (i = 0; i < MAX_NUMNODES; i++)

It's irritating to have to do this outer nid loop, but I think you're
right, that the memory hotremove notification does not quite tell us
which node to look at.  Or can we derive that from start_pfn?  Would
it be safe to assume that end_pfn-1 must be in the same node?

> +		for (node = rb_first(&root_stable_tree[i]); node;
> +				node = rb_next(node)) {
> +			struct stable_node *stable_node;
> +
> +			stable_node = rb_entry(node, struct stable_node, node);
> +			if (stable_node->kpfn >= start_pfn &&
> +			    stable_node->kpfn < end_pfn)
> +				return stable_node;
> +		}
>  
> -		stable_node = rb_entry(node, struct stable_node, node);
> -		if (stable_node->kpfn >= start_pfn &&
> -		    stable_node->kpfn < end_pfn)
> -			return stable_node;
> -	}
>  	return NULL;
>  }
>  
> @@ -1926,6 +1977,47 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
>  }
>  KSM_ATTR(run);
>  
> +#ifdef CONFIG_NUMA
> +static ssize_t merge_across_nodes_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%u\n", ksm_merge_across_nodes);
> +}
> +
> +static ssize_t merge_across_nodes_store(struct kobject *kobj,
> +				   struct kobj_attribute *attr,
> +				   const char *buf, size_t count)
> +{
> +	int err;
> +	unsigned long knob;
> +
> +	err = kstrtoul(buf, 10, &knob);
> +	if (err)
> +		return err;
> +	if (knob > 1)
> +		return -EINVAL;
> +
> +	mutex_lock(&ksm_thread_mutex);
> +	if (ksm_run & KSM_RUN_MERGE) {
> +		err = -EBUSY;
> +	} else {
> +		if (ksm_merge_across_nodes != knob) {
> +			if (ksm_pages_shared > 0)
> +				err = -EBUSY;
> +			else
> +				ksm_merge_across_nodes = knob;
> +		}
> +	}
> +
> +	if (err)
> +		count = err;
> +	mutex_unlock(&ksm_thread_mutex);
> +
> +	return count;
> +}
> +KSM_ATTR(merge_across_nodes);
> +#endif
> +
>  static ssize_t pages_shared_show(struct kobject *kobj,
>  				 struct kobj_attribute *attr, char *buf)
>  {
> @@ -1980,6 +2072,9 @@ static struct attribute *ksm_attrs[] = {
>  	&pages_unshared_attr.attr,
>  	&pages_volatile_attr.attr,
>  	&full_scans_attr.attr,
> +#ifdef CONFIG_NUMA
> +	&merge_across_nodes_attr.attr,
> +#endif
>  	NULL,
>  };
>  
> -- 
> 1.7.11.4

Looks nice - thank you.

Hugh

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] KSM: numa awareness sysfs knob
  2012-10-01  1:37 ` Hugh Dickins
@ 2012-10-01  8:14   ` Hugh Dickins
  2012-10-08 23:00   ` Petr Holasek
  1 sibling, 0 replies; 9+ messages in thread
From: Hugh Dickins @ 2012-10-01  8:14 UTC (permalink / raw)
  To: Petr Holasek
  Cc: Andrea Arcangeli, Andrew Morton, Chris Wright, Izik Eidus,
	Rik van Riel, David Rientjes, linux-kernel, linux-mm,
	Anton Arapov

On Sun, 30 Sep 2012, Hugh Dickins wrote:
> Andrea's point about ksm_migrate_page() is an important one, and I've
> answered that against his mail, but here's some other easier points.

There's another point that I completely forgot to make once I got down
to the details of your patch.

Somewhere, I didn't decide exactly where, perhaps near the memcmp_pages()
call in unstable_tree_search_insert(), you do need to check that the page
"in" the unstable tree still belongs to the NUMAnode of the page we're
comparing with.

While that is, of course, the NUMAnode of the unstable tree we're
searching, the unstable tree places no hold on the pages "in" it (it's
actually a tree of rmap_items, not of pages), so they could get migrated
to a different NUMAnode (or faulted out and then faulted back in on a
different NUMAnode) since the rmap_item was placed in that tree.

This is little different from the other instabilities of the unstable
tree, it's not a big deal, and gets corrected (usually) next time around;
but you do want to check, to avoid promoting such a mismatch into the
stable tree.

Hugh

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] KSM: numa awareness sysfs knob
  2012-10-01  0:36   ` Hugh Dickins
@ 2012-10-01 11:53     ` Andrea Arcangeli
  2012-10-01 23:47       ` Hugh Dickins
  0 siblings, 1 reply; 9+ messages in thread
From: Andrea Arcangeli @ 2012-10-01 11:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Petr Holasek, Andrew Morton, Chris Wright, Izik Eidus,
	Rik van Riel, David Rientjes, linux-kernel, linux-mm,
	Anton Arapov

Hi Hugh,

On Sun, Sep 30, 2012 at 05:36:33PM -0700, Hugh Dickins wrote:
> I'm all for the simplest solution, but here in ksm_migrate_page()
> is not a good place for COW breaking - we don't want to get into
> an indefinite number of page allocations, and the risk of failure.

Agreed, not a good place to break_cow.

> I was toying with the idea of leaving the new page in the old NUMAnode's
> stable tree temporarily, until ksmd comes around again, and let that
> clean it up.  Which would imply less reliance on get_kpfn_nid(),
> and not skipping PageKsm in ksm_do_scan(), and...

There a break_cow could more easily run to cleanup the errors in the
stable tree. It'd be one way to avoid altering migrate.

> But it's not all that simple, and I think we can do better.

Agreed.

> It's only just fully dawned on me that ksm_migrate_page() is actually
> a very convenient place: no pagetable mangling required, because we
> know that neither old nor new page is at this instant mapped into
> userspace at all - don't we?  Instead there are swap-like migration
> entries plugging all ptes until we're ready to put in the new page.

Yes.

> So I think what we really want to do is change the ksm_migrate_page()
> interface a little, and probably the precise position it's called from,
> to allow it to update mm/migrate.c's newpage - in the collision case

I agree your proposed modification to the ->migratepage protocol
should be able to deal with that. We should notify the caller the
"newpage" has been freed and we transferred all ownership to an
"alternate_newpage". So then migrate will restore the ptes pointing to
the alternate_newpage (not the allocated newpage). It should be also
possible to get an hold on the alternate_newpage, before having to
allocate newpage.

> when the new NUMAnode already has a stable copy of this page.  But when
> it doesn't, just move KSMnode from old NUMAnode's stable tree to new.

Agreed, that is the easy case and doesn't require interface changes.

> How well the existing ksm.c primitives are suited to this, I've not
> checked.  Probably not too well, but shouldn't be hard to add what's
> needed.
> 
> What do you think?  Does that sound reasonable, Petr?

Sounds like a plan, I agree the modification to migrate is the best
way to go here. Only cons: it's not the simplest solution.

> By the way, this is probably a good occasion to remind ourselves,
> that page migration is still usually disabled on PageKsm pages:
> ksm_migrate_page() is only being called for memory hotremove.  I had
> been about to complain that calling remove_node_from_stable_tree()
> from ksm_migrate_page() is also unsafe from a locking point of view;
> until I remembered that MEM_GOING_OFFLINE has previously acquired
> ksm_thread_mutex.
> 
> But page migration is much more important now than three years ago,
> with compaction relying upon it, CMA and THP relying upon compaction,
> and lumpy reclaim gone.

Agreed. AutoNUMA needs it too: AutoNUMA migrates all types of memory,
not just anonymous memory, as long as the mapcount == 1.

If all users break_cow except one, then the KSM page can move around
if it has left just one user, we don't need to wait this last user to
break_cow (which may never happen) before can move it.

> Whilst it should not be mixed up in the NUMA patch itself, I think we
> need now to relax that restriction.  I found re-reading my 62b61f611e
> "ksm: memory hotremove migration only" was helpful.  Petr, is that
> something you could take on also?  I _think_ it's just a matter of
> protecting the stable tree(s) with an additional mutex (which ought
> not to be contended, since ksm_thread_mutex is normally held above
> it, except in migration); then removing a number of PageKsm refusals
> (and the offlining arg to unmap_and_move() etc).  But perhaps there's
> more to it, I haven't gone over it properly.

Removing the restriction sounds good. In addition to
compaction/AutoNUMA etc.. KSM pages are marked MOVABLE so it's likely
not good for the anti frag pageblock types.

So if I understand this correctly, there would be no way to trigger
the stable tree corruption in current v4, without memory hotremove.

> Yes, I agree; but a few more comments I'll make against the v4 post.

Cool.

Thanks for the help!
Andrea

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] KSM: numa awareness sysfs knob
  2012-10-01 11:53     ` Andrea Arcangeli
@ 2012-10-01 23:47       ` Hugh Dickins
  0 siblings, 0 replies; 9+ messages in thread
From: Hugh Dickins @ 2012-10-01 23:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Petr Holasek, Andrew Morton, Chris Wright, Izik Eidus,
	Rik van Riel, David Rientjes, linux-kernel, linux-mm,
	Anton Arapov

On Mon, 1 Oct 2012, Andrea Arcangeli wrote:
> On Sun, Sep 30, 2012 at 05:36:33PM -0700, Hugh Dickins wrote:
> > I'm all for the simplest solution, but here in ksm_migrate_page()
> > is not a good place for COW breaking - we don't want to get into
> > an indefinite number of page allocations, and the risk of failure.
> 
> Agreed, not a good place to break_cow.
> 
> > I was toying with the idea of leaving the new page in the old NUMAnode's
> > stable tree temporarily, until ksmd comes around again, and let that
> > clean it up.  Which would imply less reliance on get_kpfn_nid(),
> > and not skipping PageKsm in ksm_do_scan(), and...
> 
> There a break_cow could more easily run to cleanup the errors in the
> stable tree. It'd be one way to avoid altering migrate.

I get the feeling that you've thought this path through further
than I have, and are pretty sure that it will stay simple.

Whereas I was a little anxious about leaving a KSMnode behind in a
tree which ostensibly belongs to an already-offlined NUMAnode.

Don't let me push you into more far-reaching changes than are
necessary.  I didn't think we needed to change ->migrate, just
migrate_page_copy() in the anonymous case (the only one KSM affects).

Maybe that's bad practice, and maybe I'm wrong, that it would have to
percolate up a few levels of callers.  And I wouldn't want to think
through the memcg page charging in a hurry.

If you can see that another ksm_do_scan() break_cow() will sort it
all out simply and safely (together with extending the node pointer
or nid usage from unstable to stable tree, instead of get_kpfn_nid()),
please just go for that.

We can always be cleverer later - but I doubt it'll be a priority.

H(always good for saying the opposite in alternate mails)ugh

> 
> > But it's not all that simple, and I think we can do better.
> 
> Agreed.
> 
> > It's only just fully dawned on me that ksm_migrate_page() is actually
> > a very convenient place: no pagetable mangling required, because we
> > know that neither old nor new page is at this instant mapped into
> > userspace at all - don't we?  Instead there are swap-like migration
> > entries plugging all ptes until we're ready to put in the new page.
> 
> Yes.
> 
> > So I think what we really want to do is change the ksm_migrate_page()
> > interface a little, and probably the precise position it's called from,
> > to allow it to update mm/migrate.c's newpage - in the collision case
> 
> I agree your proposed modification to the ->migratepage protocol
> should be able to deal with that. We should notify the caller the
> "newpage" has been freed and we transferred all ownership to an
> "alternate_newpage". So then migrate will restore the ptes pointing to
> the alternate_newpage (not the allocated newpage). It should be also
> possible to get an hold on the alternate_newpage, before having to
> allocate newpage.
> 
> > when the new NUMAnode already has a stable copy of this page.  But when
> > it doesn't, just move KSMnode from old NUMAnode's stable tree to new.
> 
> Agreed, that is the easy case and doesn't require interface changes.
> 
> > How well the existing ksm.c primitives are suited to this, I've not
> > checked.  Probably not too well, but shouldn't be hard to add what's
> > needed.
> > 
> > What do you think?  Does that sound reasonable, Petr?
> 
> Sounds like a plan, I agree the modification to migrate is the best
> way to go here. Only cons: it's not the simplest solution.
> 
> > By the way, this is probably a good occasion to remind ourselves,
> > that page migration is still usually disabled on PageKsm pages:
> > ksm_migrate_page() is only being called for memory hotremove.  I had
> > been about to complain that calling remove_node_from_stable_tree()
> > from ksm_migrate_page() is also unsafe from a locking point of view;
> > until I remembered that MEM_GOING_OFFLINE has previously acquired
> > ksm_thread_mutex.
> > 
> > But page migration is much more important now than three years ago,
> > with compaction relying upon it, CMA and THP relying upon compaction,
> > and lumpy reclaim gone.
> 
> Agreed. AutoNUMA needs it too: AutoNUMA migrates all types of memory,
> not just anonymous memory, as long as the mapcount == 1.
> 
> If all users break_cow except one, then the KSM page can move around
> if it has left just one user, we don't need to wait this last user to
> break_cow (which may never happen) before can move it.
> 
> > Whilst it should not be mixed up in the NUMA patch itself, I think we
> > need now to relax that restriction.  I found re-reading my 62b61f611e
> > "ksm: memory hotremove migration only" was helpful.  Petr, is that
> > something you could take on also?  I _think_ it's just a matter of
> > protecting the stable tree(s) with an additional mutex (which ought
> > not to be contended, since ksm_thread_mutex is normally held above
> > it, except in migration); then removing a number of PageKsm refusals
> > (and the offlining arg to unmap_and_move() etc).  But perhaps there's
> > more to it, I haven't gone over it properly.
> 
> Removing the restriction sounds good. In addition to
> compaction/AutoNUMA etc.. KSM pages are marked MOVABLE so it's likely
> not good for the anti frag pageblock types.
> 
> So if I understand this correctly, there would be no way to trigger
> the stable tree corruption in current v4, without memory hotremove.
> 
> > Yes, I agree; but a few more comments I'll make against the v4 post.
> 
> Cool.
> 
> Thanks for the help!
> Andrea

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] KSM: numa awareness sysfs knob
  2012-10-01  1:37 ` Hugh Dickins
  2012-10-01  8:14   ` Hugh Dickins
@ 2012-10-08 23:00   ` Petr Holasek
  1 sibling, 0 replies; 9+ messages in thread
From: Petr Holasek @ 2012-10-08 23:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Andrew Morton, Chris Wright, Izik Eidus,
	Rik van Riel, David Rientjes, linux-kernel, linux-mm,
	Anton Arapov

Hi Hugh,

first of all, please let me apologize for the delay in my response and thank
you for your extensive review!

On Sun, 30 Sep 2012, Hugh Dickins wrote:
> Andrea's point about ksm_migrate_page() is an important one, and I've
> answered that against his mail, but here's some other easier points.
> 
> On Mon, 24 Sep 2012, Petr Holasek wrote:
> 
> > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> > which control merging pages across different numa nodes.
> > When it is set to zero only pages from the same node are merged,
> > otherwise pages from all nodes can be merged together (default behavior).
> > 

...

> > 
> > v4:	- merge_nodes was renamed to merge_across_nodes
> > 	- share_all debug knob was dropped
> 
> Yes, you were right to drop the share_all knob for now.  I do like the
> idea, but it was quite inappropriate to mix it in with this NUMAnode
> patch.  And although I like the idea, I think it wants a bit more: I
> already have a hacky "allksm" boot option patch to mm/mmap.c for my
> own testing, which adds VM_MERGEABLE everywhere.  If I gave that up
> (I'd like to!) in favour of yours, I think I would still be missing
> all the mms that are created before there's a chance to switch the
> share_all mode on.  Or maybe I misread your v3.  Anyway, that's a
> different topic, happily taken off today's agenda.
> 

Agreed, it hid original purpose of this patch and made it more difficult for
eventual merging. So let's move it lower on the ksm todo list for this time :)

> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 47c8853..7c82032 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -36,6 +36,7 @@
> >  #include <linux/hash.h>
> >  #include <linux/freezer.h>
> >  #include <linux/oom.h>
> > +#include <linux/numa.h>
> >  
> >  #include <asm/tlbflush.h>
> >  #include "internal.h"
> > @@ -140,7 +141,10 @@ struct rmap_item {
> >  	unsigned long address;		/* + low bits used for flags below */
> >  	unsigned int oldchecksum;	/* when unstable */
> >  	union {
> > -		struct rb_node node;	/* when node of unstable tree */
> > +		struct {
> > +			struct rb_node node;	/* when node of unstable tree */
> > +			struct rb_root *root;
> > +		};
> 
> This worries me a little, enlarging struct rmap_item: there may
> be very many of them in the system, best to minimize their size.
> 
> (This struct rb_root *root is used for one thing only, isn't it?  For the
> unstable rb_erase() in remove_rmap_item_from_tree().  It annoys me that
> we need to record root just for that, but I don't see an alternative.)

Yes, I've played a quite lot with this issue, but wasn't able to find an
alternative solution, too.

> 
> The 64-bit case can be easily improved by locating unsigned int nid
> after oldchecksum instead.  The 32-bit case (which supports smaller
> NODES_SHIFT: 6 was the largest I found) could be "improved" by keeping
> nid in the lower bits of address along with (moved) UNSTABLE_FLAG and
> STABLE_FLAG and reduced SEQNR_MASK - we really need only 1 bit for that,
> but 2 bits would be nice for keeping the BUG_ON(age > 1).
> 
> But you may think I'm going too far there, and prefer just to put
> #ifdef CONFIG_NUMA around the unsigned int nid, so at least it does
> not enlarge the more basic 32-bit configuration.
> 

I like your idea of unsigned int nid, will implement it in next version.

...

> >  
> > -	for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
> > -		struct stable_node *stable_node;
> > +	for (i = 0; i < MAX_NUMNODES; i++)
> 
> It's irritating to have to do this outer nid loop, but I think you're
> right, that the memory hotremove notification does not quite tell us
> which node to look at.  Or can we derive that from start_pfn?  Would
> it be safe to assume that end_pfn-1 must be in the same node?
> 

I had assumed that we can't rely on end_pfn-1 in the same node, but now
mm/memory_hotremove.c looks to me that we can rely on it because memory hotremove
callback is triggered for each zone where we are removing memory.
So I think yes, we can optimize it in the way you mentioned above. If I am wrong
correct me, please :)

...

> 
> Looks nice - thank you.
> 

Thanks for your help!

Petr

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-10-08 23:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-24  0:56 [PATCH v4] KSM: numa awareness sysfs knob Petr Holasek
2012-09-27 15:45 ` Rik van Riel
2012-09-28 11:26 ` Andrea Arcangeli
2012-10-01  0:36   ` Hugh Dickins
2012-10-01 11:53     ` Andrea Arcangeli
2012-10-01 23:47       ` Hugh Dickins
2012-10-01  1:37 ` Hugh Dickins
2012-10-01  8:14   ` Hugh Dickins
2012-10-08 23:00   ` Petr Holasek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).