linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] KSM: numa awareness sysfs knob
@ 2012-06-29 11:49 Petr Holasek
       [not found] ` <jsk93p$32e$1@dough.gmane.org>
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Petr Holasek @ 2012-06-29 11:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Andrea Arcangeli, Chris Wright, Izik Eidus,
	Rik van Riel, linux-kernel, linux-mm, Anton Arapov, Petr Holasek

Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_nodes
which control merging pages across different numa nodes.
When it is set to zero only pages from the same node are merged,
otherwise pages from all nodes can be merged together (default behavior).

Typical use-case could be a lot of KVM guests on NUMA machine
and cpus from more distant nodes would have significant increase
of access latency to the merged ksm page. Sysfs knob was choosen
for higher scalability.

Every numa node has its own stable & unstable trees because
of faster searching and inserting. Changing of merge_nodes
value is possible only when there are not any ksm shared pages in system.

I've tested this patch on numa machines with 2, 4 and 8 nodes and
measured speed of memory access inside of KVM guests with memory pinned
to one of nodes with this benchmark:

http://pholasek.fedorapeople.org/alloc_pg.c

Population standard deviations of access times in percentage of average
were following:

merge_nodes=1
2 nodes 1.4%
4 nodes 1.6%
8 nodes	1.7%

merge_nodes=0
2 nodes	1%
4 nodes	0.32%
8 nodes	0.018%

RFC: https://lkml.org/lkml/2011/11/30/91
v1: https://lkml.org/lkml/2012/1/23/46

v2: Andrew's objections were reflected:
	- value of merge_nodes can't be changed while there are some ksm
	pages in system
	- merge_nodes sysfs entry appearance depends on CONFIG_NUMA
	- more verbose documentation
	- added some performance testing results

Signed-off-by: Petr Holasek <pholasek@redhat.com>
---
 Documentation/vm/ksm.txt |    6 +++
 mm/ksm.c                 |  117 ++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 104 insertions(+), 19 deletions(-)

diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
index b392e49..79e5760 100644
--- a/Documentation/vm/ksm.txt
+++ b/Documentation/vm/ksm.txt
@@ -58,6 +58,12 @@ sleep_millisecs  - how many milliseconds ksmd should sleep before next scan
                    e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
                    Default: 20 (chosen for demonstration purposes)
 
+merge_nodes      - specifies if pages from different numa nodes can be merged.
+                   When set to 0, ksm merges only pages which physically
+                   resides in the memory area of same NUMA node. It brings
+                   lower latency to access to shared page.
+                   Default: 1
+
 run              - set 0 to stop ksmd from running but keep merged pages,
                    set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
                    set 2 to stop ksmd and unmerge all pages currently merged,
diff --git a/mm/ksm.c b/mm/ksm.c
index 47c8853..0f3aa63 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -36,6 +36,7 @@
 #include <linux/hash.h>
 #include <linux/freezer.h>
 #include <linux/oom.h>
+#include <linux/numa.h>
 
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -120,6 +121,7 @@ struct stable_node {
 	struct rb_node node;
 	struct hlist_head hlist;
 	unsigned long kpfn;
+	struct rb_root *root;
 };
 
 /**
@@ -140,7 +142,10 @@ struct rmap_item {
 	unsigned long address;		/* + low bits used for flags below */
 	unsigned int oldchecksum;	/* when unstable */
 	union {
-		struct rb_node node;	/* when node of unstable tree */
+		struct {
+			struct rb_node node;	/* when node of unstable tree */
+			struct rb_root *root;
+		};
 		struct {		/* when listed from stable tree */
 			struct stable_node *head;
 			struct hlist_node hlist;
@@ -153,8 +158,8 @@ struct rmap_item {
 #define STABLE_FLAG	0x200	/* is listed from the stable tree */
 
 /* The stable and unstable tree heads */
-static struct rb_root root_stable_tree = RB_ROOT;
-static struct rb_root root_unstable_tree = RB_ROOT;
+static struct rb_root root_unstable_tree[MAX_NUMNODES] = { RB_ROOT, };
+static struct rb_root root_stable_tree[MAX_NUMNODES] = { RB_ROOT, };
 
 #define MM_SLOTS_HASH_SHIFT 10
 #define MM_SLOTS_HASH_HEADS (1 << MM_SLOTS_HASH_SHIFT)
@@ -189,6 +194,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+/* Zeroed when merging across nodes is not allowed */
+static unsigned int ksm_merge_nodes = 1;
+
 #define KSM_RUN_STOP	0
 #define KSM_RUN_MERGE	1
 #define KSM_RUN_UNMERGE	2
@@ -462,7 +470,7 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node)
 		cond_resched();
 	}
 
-	rb_erase(&stable_node->node, &root_stable_tree);
+	rb_erase(&stable_node->node, stable_node->root);
 	free_stable_node(stable_node);
 }
 
@@ -560,7 +568,7 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
 		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
 		BUG_ON(age > 1);
 		if (!age)
-			rb_erase(&rmap_item->node, &root_unstable_tree);
+			rb_erase(&rmap_item->node, rmap_item->root);
 
 		ksm_pages_unshared--;
 		rmap_item->address &= PAGE_MASK;
@@ -989,8 +997,9 @@ static struct page *try_to_merge_two_pages(struct rmap_item *rmap_item,
  */
 static struct page *stable_tree_search(struct page *page)
 {
-	struct rb_node *node = root_stable_tree.rb_node;
+	struct rb_node *node;
 	struct stable_node *stable_node;
+	int nid;
 
 	stable_node = page_stable_node(page);
 	if (stable_node) {			/* ksm page forked */
@@ -998,6 +1007,13 @@ static struct page *stable_tree_search(struct page *page)
 		return page;
 	}
 
+	if (ksm_merge_nodes)
+		nid = 0;
+	else
+		nid = page_to_nid(page);
+
+	node = root_stable_tree[nid].rb_node;
+
 	while (node) {
 		struct page *tree_page;
 		int ret;
@@ -1032,10 +1048,18 @@ static struct page *stable_tree_search(struct page *page)
  */
 static struct stable_node *stable_tree_insert(struct page *kpage)
 {
-	struct rb_node **new = &root_stable_tree.rb_node;
+	int nid;
+	struct rb_node **new = NULL;
 	struct rb_node *parent = NULL;
 	struct stable_node *stable_node;
 
+	if (ksm_merge_nodes)
+		nid = 0;
+	else
+		nid = page_to_nid(kpage);
+
+	new = &root_stable_tree[nid].rb_node;
+
 	while (*new) {
 		struct page *tree_page;
 		int ret;
@@ -1069,11 +1093,12 @@ static struct stable_node *stable_tree_insert(struct page *kpage)
 		return NULL;
 
 	rb_link_node(&stable_node->node, parent, new);
-	rb_insert_color(&stable_node->node, &root_stable_tree);
+	rb_insert_color(&stable_node->node, &root_stable_tree[nid]);
 
 	INIT_HLIST_HEAD(&stable_node->hlist);
 
 	stable_node->kpfn = page_to_pfn(kpage);
+	stable_node->root = &root_stable_tree[nid];
 	set_page_stable_node(kpage, stable_node);
 
 	return stable_node;
@@ -1097,11 +1122,18 @@ static
 struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
 					      struct page *page,
 					      struct page **tree_pagep)
-
 {
-	struct rb_node **new = &root_unstable_tree.rb_node;
+	struct rb_node **new = NULL;
+	struct rb_root *root;
 	struct rb_node *parent = NULL;
 
+	if (ksm_merge_nodes)
+		root = &root_unstable_tree[0];
+	else
+		root = &root_unstable_tree[page_to_nid(page)];
+
+	new = &root->rb_node;
+
 	while (*new) {
 		struct rmap_item *tree_rmap_item;
 		struct page *tree_page;
@@ -1138,8 +1170,9 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
 
 	rmap_item->address |= UNSTABLE_FLAG;
 	rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK);
+	rmap_item->root = root;
 	rb_link_node(&rmap_item->node, parent, new);
-	rb_insert_color(&rmap_item->node, &root_unstable_tree);
+	rb_insert_color(&rmap_item->node, root);
 
 	ksm_pages_unshared++;
 	return NULL;
@@ -1282,6 +1315,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 	struct mm_slot *slot;
 	struct vm_area_struct *vma;
 	struct rmap_item *rmap_item;
+	int i;
 
 	if (list_empty(&ksm_mm_head.mm_list))
 		return NULL;
@@ -1300,7 +1334,8 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 		 */
 		lru_add_drain_all();
 
-		root_unstable_tree = RB_ROOT;
+		for (i = 0; i < MAX_NUMNODES; i++)
+			root_unstable_tree[i] = RB_ROOT;
 
 		spin_lock(&ksm_mmlist_lock);
 		slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
@@ -1768,15 +1803,19 @@ static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn,
 						 unsigned long end_pfn)
 {
 	struct rb_node *node;
+	int i;
 
-	for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
-		struct stable_node *stable_node;
+	for (i = 0; i < MAX_NUMNODES; i++)
+		for (node = rb_first(&root_stable_tree[i]); node;
+				node = rb_next(node)) {
+			struct stable_node *stable_node;
+
+			stable_node = rb_entry(node, struct stable_node, node);
+			if (stable_node->kpfn >= start_pfn &&
+			    stable_node->kpfn < end_pfn)
+				return stable_node;
+		}
 
-		stable_node = rb_entry(node, struct stable_node, node);
-		if (stable_node->kpfn >= start_pfn &&
-		    stable_node->kpfn < end_pfn)
-			return stable_node;
-	}
 	return NULL;
 }
 
@@ -1926,6 +1965,43 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
 }
 KSM_ATTR(run);
 
+#ifdef CONFIG_NUMA
+static ssize_t merge_nodes_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", ksm_merge_nodes);
+}
+
+static ssize_t merge_nodes_store(struct kobject *kobj,
+				   struct kobj_attribute *attr,
+				   const char *buf, size_t count)
+{
+	int err;
+	unsigned long knob;
+
+	err = kstrtoul(buf, 10, &knob);
+	if (err)
+		return err;
+	if (knob > 1)
+		return -EINVAL;
+
+	if (ksm_run & KSM_RUN_MERGE)
+		return -EBUSY;
+
+	mutex_lock(&ksm_thread_mutex);
+	if (ksm_merge_nodes != knob) {
+		if (ksm_pages_shared > 0)
+			return -EBUSY;
+		else
+			ksm_merge_nodes = knob;
+	}
+	mutex_unlock(&ksm_thread_mutex);
+
+	return count;
+}
+KSM_ATTR(merge_nodes);
+#endif
+
 static ssize_t pages_shared_show(struct kobject *kobj,
 				 struct kobj_attribute *attr, char *buf)
 {
@@ -1980,6 +2056,9 @@ static struct attribute *ksm_attrs[] = {
 	&pages_unshared_attr.attr,
 	&pages_volatile_attr.attr,
 	&full_scans_attr.attr,
+#ifdef CONFIG_NUMA
+	&merge_nodes_attr.attr,
+#endif
 	NULL,
 };
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] KSM: numa awareness sysfs knob
       [not found] ` <jsk93p$32e$1@dough.gmane.org>
@ 2012-06-29 13:23   ` Petr Holasek
  0 siblings, 0 replies; 13+ messages in thread
From: Petr Holasek @ 2012-06-29 13:23 UTC (permalink / raw)
  To: Cong Wang; +Cc: linux-mm, linux-kernel

On Fri, 29 Jun 2012, Cong Wang wrote:
> On Fri, 29 Jun 2012 at 11:49 GMT, Petr Holasek <pholasek@redhat.com> wrote:
> > -		root_unstable_tree = RB_ROOT;
> > +		for (i = 0; i < MAX_NUMNODES; i++)
> > +			root_unstable_tree[i] = RB_ROOT;
> 
> 
> This is not aware of memory-hotplug, right?
> 

What makes you think so?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] KSM: numa awareness sysfs knob
  2012-06-29 11:49 [PATCH v2] KSM: numa awareness sysfs knob Petr Holasek
       [not found] ` <jsk93p$32e$1@dough.gmane.org>
@ 2012-06-29 16:05 ` Johannes Weiner
  2012-06-29 16:30   ` Petr Holasek
  2012-06-29 21:17 ` Andrew Morton
  2 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2012-06-29 16:05 UTC (permalink / raw)
  To: Petr Holasek
  Cc: Hugh Dickins, Andrew Morton, Andrea Arcangeli, Chris Wright,
	Izik Eidus, Rik van Riel, linux-kernel, linux-mm, Anton Arapov

On Fri, Jun 29, 2012 at 01:49:52PM +0200, Petr Holasek wrote:
> Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_nodes
> which control merging pages across different numa nodes.
> When it is set to zero only pages from the same node are merged,
> otherwise pages from all nodes can be merged together (default behavior).

Is it conceivable that admins may (in the future) want to merge only
across nodes that are below a given distance threshold?

I'm not asking to implement this, just whether the knob can be
introduced such that it's future-compatible.  Make it default to a
Very High Number and only allow setting it to 0 for now e.g.?  And
name it max_node_merge_distance (I'm bad at names)?

What do you think?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] KSM: numa awareness sysfs knob
  2012-06-29 16:05 ` Johannes Weiner
@ 2012-06-29 16:30   ` Petr Holasek
  2012-06-29 16:47     ` Johannes Weiner
  0 siblings, 1 reply; 13+ messages in thread
From: Petr Holasek @ 2012-06-29 16:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, Andrea Arcangeli, Chris Wright,
	Izik Eidus, Rik van Riel, linux-kernel, linux-mm, Anton Arapov

On Fri, 29 Jun 2012, Johannes Weiner wrote:
> On Fri, Jun 29, 2012 at 01:49:52PM +0200, Petr Holasek wrote:
> > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_nodes
> > which control merging pages across different numa nodes.
> > When it is set to zero only pages from the same node are merged,
> > otherwise pages from all nodes can be merged together (default behavior).
> 
> Is it conceivable that admins may (in the future) want to merge only
> across nodes that are below a given distance threshold?
> 
> I'm not asking to implement this, just whether the knob can be
> introduced such that it's future-compatible.  Make it default to a
> Very High Number and only allow setting it to 0 for now e.g.?  And
> name it max_node_merge_distance (I'm bad at names)?
> 
> What do you think?

I started with exactly same idea as you described above in the first
RFC, link: https://lkml.org/lkml/2011/11/30/91
But this approach turned out to be more complicated than it looked
(see two last emails in thread) and complexity of solution would rise
a lot.

Regards,
Petr H

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] KSM: numa awareness sysfs knob
  2012-06-29 16:30   ` Petr Holasek
@ 2012-06-29 16:47     ` Johannes Weiner
  2012-06-29 22:30       ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2012-06-29 16:47 UTC (permalink / raw)
  To: Petr Holasek
  Cc: Hugh Dickins, Andrew Morton, Andrea Arcangeli, Chris Wright,
	Izik Eidus, Rik van Riel, linux-kernel, linux-mm, Anton Arapov

On Fri, Jun 29, 2012 at 06:30:33PM +0200, Petr Holasek wrote:
> On Fri, 29 Jun 2012, Johannes Weiner wrote:
> > On Fri, Jun 29, 2012 at 01:49:52PM +0200, Petr Holasek wrote:
> > > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_nodes
> > > which control merging pages across different numa nodes.
> > > When it is set to zero only pages from the same node are merged,
> > > otherwise pages from all nodes can be merged together (default behavior).
> > 
> > Is it conceivable that admins may (in the future) want to merge only
> > across nodes that are below a given distance threshold?
> > 
> > I'm not asking to implement this, just whether the knob can be
> > introduced such that it's future-compatible.  Make it default to a
> > Very High Number and only allow setting it to 0 for now e.g.?  And
> > name it max_node_merge_distance (I'm bad at names)?
> > 
> > What do you think?
> 
> I started with exactly same idea as you described above in the first
> RFC, link: https://lkml.org/lkml/2011/11/30/91
> But this approach turned out to be more complicated than it looked
> (see two last emails in thread) and complexity of solution would rise
> a lot.

Oh, I should have checked the archives given that it's v2.  I expected
it to get complex but didn't put enough thought into it to see /that/
amount of complexity.  Sorry.

Carry on, then :-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] KSM: numa awareness sysfs knob
  2012-06-29 11:49 [PATCH v2] KSM: numa awareness sysfs knob Petr Holasek
       [not found] ` <jsk93p$32e$1@dough.gmane.org>
  2012-06-29 16:05 ` Johannes Weiner
@ 2012-06-29 21:17 ` Andrew Morton
  2012-06-29 22:50   ` David Rientjes
  2012-06-30 12:29   ` Petr Holasek
  2 siblings, 2 replies; 13+ messages in thread
From: Andrew Morton @ 2012-06-29 21:17 UTC (permalink / raw)
  To: Petr Holasek
  Cc: Hugh Dickins, Andrea Arcangeli, Chris Wright, Izik Eidus,
	Rik van Riel, linux-kernel, linux-mm, Anton Arapov

On Fri, 29 Jun 2012 13:49:52 +0200
Petr Holasek <pholasek@redhat.com> wrote:

> Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_nodes
> which control merging pages across different numa nodes.
> When it is set to zero only pages from the same node are merged,
> otherwise pages from all nodes can be merged together (default behavior).
> 
> Typical use-case could be a lot of KVM guests on NUMA machine
> and cpus from more distant nodes would have significant increase
> of access latency to the merged ksm page. Sysfs knob was choosen
> for higher scalability.
> 
> Every numa node has its own stable & unstable trees because
> of faster searching and inserting. Changing of merge_nodes
> value is possible only when there are not any ksm shared pages in system.

It would be neat to have a knob which enables KSM for all anon
mappings.  ie: pretend that MADV_MERGEABLE is always set.  For testing
coverage purposes.

> I've tested this patch on numa machines with 2, 4 and 8 nodes and
> measured speed of memory access inside of KVM guests with memory pinned
> to one of nodes with this benchmark:
> 
> http://pholasek.fedorapeople.org/alloc_pg.c
> 
> Population standard deviations of access times in percentage of average
> were following:
> 
> merge_nodes=1
> 2 nodes 1.4%
> 4 nodes 1.6%
> 8 nodes	1.7%
> 
> merge_nodes=0
> 2 nodes	1%
> 4 nodes	0.32%
> 8 nodes	0.018%

ooh, numbers!  Thanks.

> --- a/Documentation/vm/ksm.txt
> +++ b/Documentation/vm/ksm.txt
> @@ -58,6 +58,12 @@ sleep_millisecs  - how many milliseconds ksmd should sleep before next scan
>                     e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
>                     Default: 20 (chosen for demonstration purposes)
>  
> +merge_nodes      - specifies if pages from different numa nodes can be merged.
> +                   When set to 0, ksm merges only pages which physically
> +                   resides in the memory area of same NUMA node. It brings
> +                   lower latency to access to shared page.
> +                   Default: 1

s/resides/reside/.

This doc should mention that /sys/kernel/mm/ksm/run should be zeroed to
alter merge_nodes.  Otherwise confusion will reign.

>
> ...
>
> +static ssize_t merge_nodes_store(struct kobject *kobj,
> +				   struct kobj_attribute *attr,
> +				   const char *buf, size_t count)
> +{
> +	int err;
> +	unsigned long knob;
> +
> +	err = kstrtoul(buf, 10, &knob);
> +	if (err)
> +		return err;
> +	if (knob > 1)
> +		return -EINVAL;
> +
> +	if (ksm_run & KSM_RUN_MERGE)
> +		return -EBUSY;
> +
> +	mutex_lock(&ksm_thread_mutex);
> +	if (ksm_merge_nodes != knob) {
> +		if (ksm_pages_shared > 0)
> +			return -EBUSY;
> +		else
> +			ksm_merge_nodes = knob;
> +	}
> +	mutex_unlock(&ksm_thread_mutex);
> +
> +	return count;
> +}

Seems a bit racy.  Shouldn't the test of ksm_run be inside the locked
region?

> +KSM_ATTR(merge_nodes);
> +#endif
>
> ...
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] KSM: numa awareness sysfs knob
  2012-06-29 16:47     ` Johannes Weiner
@ 2012-06-29 22:30       ` David Rientjes
  2012-06-30 11:40         ` Petr Holasek
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2012-06-29 22:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Petr Holasek, Hugh Dickins, Andrew Morton, Andrea Arcangeli,
	Chris Wright, Izik Eidus, Rik van Riel, linux-kernel, linux-mm,
	Anton Arapov

On Fri, 29 Jun 2012, Johannes Weiner wrote:

> > I started with exactly same idea as you described above in the first
> > RFC, link: https://lkml.org/lkml/2011/11/30/91
> > But this approach turned out to be more complicated than it looked
> > (see two last emails in thread) and complexity of solution would rise
> > a lot.
> 
> Oh, I should have checked the archives given that it's v2.  I expected
> it to get complex but didn't put enough thought into it to see /that/
> amount of complexity.  Sorry.
> 
> Carry on, then :-)
> 

I don't think it's an unfair amount of complexity to ask for, and I don't 
see the problem with ksm merging two pages that have a distance under the 
configured threshold and leaving the third page unmerged; by configuring 
the threshold (which should be a char, not an int) the admin has specified 
the locality that is necessary for optimal performance so has knowingly 
restricted ksm in that way.

I'd rename it to ksm_merge_distance, which is more similar to 
reclaim_distance, and return to the first version of this patch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] KSM: numa awareness sysfs knob
  2012-06-29 21:17 ` Andrew Morton
@ 2012-06-29 22:50   ` David Rientjes
  2012-06-30  9:43     ` Izik Eidus
  2012-06-30 12:29   ` Petr Holasek
  1 sibling, 1 reply; 13+ messages in thread
From: David Rientjes @ 2012-06-29 22:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Petr Holasek, Hugh Dickins, Andrea Arcangeli, Chris Wright,
	Izik Eidus, Rik van Riel, linux-kernel, linux-mm, Anton Arapov

On Fri, 29 Jun 2012, Andrew Morton wrote:

> > I've tested this patch on numa machines with 2, 4 and 8 nodes and
> > measured speed of memory access inside of KVM guests with memory pinned
> > to one of nodes with this benchmark:
> > 
> > http://pholasek.fedorapeople.org/alloc_pg.c
> > 
> > Population standard deviations of access times in percentage of average
> > were following:
> > 
> > merge_nodes=1
> > 2 nodes 1.4%
> > 4 nodes 1.6%
> > 8 nodes	1.7%
> > 
> > merge_nodes=0
> > 2 nodes	1%
> > 4 nodes	0.32%
> > 8 nodes	0.018%
> 
> ooh, numbers!  Thanks.
> 

Ok, the standard deviation increases when merging pages from nodes with 
remote distance, that makes sense.  But if that's true, then you would 
restrict either the entire application to local memory with mempolicies or 
cpusets, or you would use mbind() to restrict this memory to that set of 
nodes already so that accesses, even with ksm merging, would have 
affinity.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] KSM: numa awareness sysfs knob
  2012-06-29 22:50   ` David Rientjes
@ 2012-06-30  9:43     ` Izik Eidus
  0 siblings, 0 replies; 13+ messages in thread
From: Izik Eidus @ 2012-06-30  9:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Petr Holasek, Hugh Dickins, Andrea Arcangeli,
	Chris Wright, Rik van Riel, linux-kernel, linux-mm, Anton Arapov

On 06/30/2012 01:50 AM, David Rientjes wrote:
> On Fri, 29 Jun 2012, Andrew Morton wrote:
>
>>> I've tested this patch on numa machines with 2, 4 and 8 nodes and
>>> measured speed of memory access inside of KVM guests with memory pinned
>>> to one of nodes with this benchmark:
>>>
>>> http://pholasek.fedorapeople.org/alloc_pg.c
>>>
>>> Population standard deviations of access times in percentage of average
>>> were following:
>>>
>>> merge_nodes=1
>>> 2 nodes 1.4%
>>> 4 nodes 1.6%
>>> 8 nodes	1.7%
>>>
>>> merge_nodes=0
>>> 2 nodes	1%
>>> 4 nodes	0.32%
>>> 8 nodes	0.018%
>> ooh, numbers!  Thanks.
>>
> Ok, the standard deviation increases when merging pages from nodes with
> remote distance, that makes sense.  But if that's true, then you would
> restrict either the entire application to local memory with mempolicies or
> cpusets, or you would use mbind() to restrict this memory to that set of
> nodes already so that accesses, even with ksm merging, would have
> affinity.

While you are right for case you write your own custom application,
but I think the KVM guest case is little bit more problomatic in case 
the guest memory must be splitted across serval nodes.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] KSM: numa awareness sysfs knob
  2012-06-29 22:30       ` David Rientjes
@ 2012-06-30 11:40         ` Petr Holasek
  2012-07-02 21:26           ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Petr Holasek @ 2012-06-30 11:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Hugh Dickins, Andrew Morton, Andrea Arcangeli,
	Chris Wright, Izik Eidus, Rik van Riel, linux-kernel, linux-mm,
	Anton Arapov

On Fri, 29 Jun 2012, David Rientjes wrote:
> On Fri, 29 Jun 2012, Johannes Weiner wrote:
> 
> > > I started with exactly same idea as you described above in the first
> > > RFC, link: https://lkml.org/lkml/2011/11/30/91
> > > But this approach turned out to be more complicated than it looked
> > > (see two last emails in thread) and complexity of solution would rise
> > > a lot.
> > 
> > Oh, I should have checked the archives given that it's v2.  I expected
> > it to get complex but didn't put enough thought into it to see /that/
> > amount of complexity.  Sorry.
> > 
> > Carry on, then :-)
> > 
> 
> I don't think it's an unfair amount of complexity to ask for, and I don't 
> see the problem with ksm merging two pages that have a distance under the 
> configured threshold and leaving the third page unmerged; by configuring 
> the threshold (which should be a char, not an int) the admin has specified 
> the locality that is necessary for optimal performance so has knowingly 
> restricted ksm in that way.
> 
> I'd rename it to ksm_merge_distance, which is more similar to 
> reclaim_distance, and return to the first version of this patch.

The problem of the first patch/RFC was that merging algorithm was unstable
and could merge pages with distance higher than was set up (described by 
Nai Xia in RFC thread [1]). Sure, this instability could be solved, but for
ksm pages shared by many other pages on different nodes we would have to still
recalculate which page is "in the middle" and in case of change migrate it 
between nodes every time when ksmd reach new shareable page or when some 
sharing page is removed.

But please correct me if I understand you wrong.

[1] https://lkml.org/lkml/2011/12/1/167

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] KSM: numa awareness sysfs knob
  2012-06-29 21:17 ` Andrew Morton
  2012-06-29 22:50   ` David Rientjes
@ 2012-06-30 12:29   ` Petr Holasek
  1 sibling, 0 replies; 13+ messages in thread
From: Petr Holasek @ 2012-06-30 12:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Andrea Arcangeli, Chris Wright, Izik Eidus,
	Rik van Riel, linux-kernel, linux-mm, Anton Arapov

On Fri, 29 Jun 2012, Andrew Morton wrote:

> On Fri, 29 Jun 2012 13:49:52 +0200
> Petr Holasek <pholasek@redhat.com> wrote:
> 
> > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_nodes
> > which control merging pages across different numa nodes.
> > When it is set to zero only pages from the same node are merged,
> > otherwise pages from all nodes can be merged together (default behavior).
> > 
> > Typical use-case could be a lot of KVM guests on NUMA machine
> > and cpus from more distant nodes would have significant increase
> > of access latency to the merged ksm page. Sysfs knob was choosen
> > for higher scalability.
> > 
> > Every numa node has its own stable & unstable trees because
> > of faster searching and inserting. Changing of merge_nodes
> > value is possible only when there are not any ksm shared pages in system.
> 
> It would be neat to have a knob which enables KSM for all anon
> mappings.  ie: pretend that MADV_MERGEABLE is always set.  For testing
> coverage purposes.

Interesting idea, I'll try to add it in next release if /sys/kernel/mm/ksm
directory is the right place for such debug knob.

> > --- a/Documentation/vm/ksm.txt
> > +++ b/Documentation/vm/ksm.txt
> > @@ -58,6 +58,12 @@ sleep_millisecs  - how many milliseconds ksmd should sleep before next scan
> >                     e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
> >                     Default: 20 (chosen for demonstration purposes)
> >  
> > +merge_nodes      - specifies if pages from different numa nodes can be merged.
> > +                   When set to 0, ksm merges only pages which physically
> > +                   resides in the memory area of same NUMA node. It brings
> > +                   lower latency to access to shared page.
> > +                   Default: 1
> 
> s/resides/reside/.
> 
> This doc should mention that /sys/kernel/mm/ksm/run should be zeroed to
> alter merge_nodes.  Otherwise confusion will reign.
> 

Oh, forgot to mention it. I'll fix it.

> >
> > ...
> >
> > +static ssize_t merge_nodes_store(struct kobject *kobj,
> > +				   struct kobj_attribute *attr,
> > +				   const char *buf, size_t count)
> > +{
> > +	int err;
> > +	unsigned long knob;
> > +
> > +	err = kstrtoul(buf, 10, &knob);
> > +	if (err)
> > +		return err;
> > +	if (knob > 1)
> > +		return -EINVAL;
> > +
> > +	if (ksm_run & KSM_RUN_MERGE)
> > +		return -EBUSY;
> > +
> > +	mutex_lock(&ksm_thread_mutex);
> > +	if (ksm_merge_nodes != knob) {
> > +		if (ksm_pages_shared > 0)
> > +			return -EBUSY;
> > +		else
> > +			ksm_merge_nodes = knob;
> > +	}
> > +	mutex_unlock(&ksm_thread_mutex);
> > +
> > +	return count;
> > +}
> 
> Seems a bit racy.  Shouldn't the test of ksm_run be inside the locked
> region?
> 

Agreed.

Thanks for your review!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] KSM: numa awareness sysfs knob
  2012-06-30 11:40         ` Petr Holasek
@ 2012-07-02 21:26           ` David Rientjes
  2012-07-03 17:02             ` Petr Holasek
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2012-07-02 21:26 UTC (permalink / raw)
  To: Petr Holasek
  Cc: Johannes Weiner, Hugh Dickins, Andrew Morton, Andrea Arcangeli,
	Chris Wright, Izik Eidus, Rik van Riel, linux-kernel, linux-mm,
	Anton Arapov

On Sat, 30 Jun 2012, Petr Holasek wrote:

> The problem of the first patch/RFC was that merging algorithm was unstable
> and could merge pages with distance higher than was set up (described by 
> Nai Xia in RFC thread [1]). Sure, this instability could be solved, but for
> ksm pages shared by many other pages on different nodes we would have to still
> recalculate which page is "in the middle" and in case of change migrate it 
> between nodes every time when ksmd reach new shareable page or when some 
> sharing page is removed.
> 

Or you could simply refuse to ever merge any page that is identical to a 
page on a node with a distance greater than the threshold, i.e. never 
merge pages even under the threshold if a page exists on a node higher 
than the threshold.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] KSM: numa awareness sysfs knob
  2012-07-02 21:26           ` David Rientjes
@ 2012-07-03 17:02             ` Petr Holasek
  0 siblings, 0 replies; 13+ messages in thread
From: Petr Holasek @ 2012-07-03 17:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Hugh Dickins, Andrew Morton, Andrea Arcangeli,
	Chris Wright, Izik Eidus, Rik van Riel, linux-kernel, linux-mm,
	Anton Arapov

On Mon, 2 Jul 2012, David Rientjes wrote:

> > The problem of the first patch/RFC was that merging algorithm was unstable
> > and could merge pages with distance higher than was set up (described by 
> > Nai Xia in RFC thread [1]). Sure, this instability could be solved, but for
> > ksm pages shared by many other pages on different nodes we would have to
> > still
> > recalculate which page is "in the middle" and in case of change migrate it 
> > between nodes every time when ksmd reach new shareable page or when some 
> > sharing page is removed.
> > 
> 
> Or you could simply refuse to ever merge any page that is identical to a 
> page on a node with a distance greater than the threshold, i.e. never 
> merge pages even under the threshold if a page exists on a node higher 
> than the threshold.

Position of the shared page would be really dependent on order of scanning.
We could have shared page in the ideal middle as well as on the edge where it
would block merging of other pages even under the threshold. 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-07-03 17:03 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-29 11:49 [PATCH v2] KSM: numa awareness sysfs knob Petr Holasek
     [not found] ` <jsk93p$32e$1@dough.gmane.org>
2012-06-29 13:23   ` Petr Holasek
2012-06-29 16:05 ` Johannes Weiner
2012-06-29 16:30   ` Petr Holasek
2012-06-29 16:47     ` Johannes Weiner
2012-06-29 22:30       ` David Rientjes
2012-06-30 11:40         ` Petr Holasek
2012-07-02 21:26           ` David Rientjes
2012-07-03 17:02             ` Petr Holasek
2012-06-29 21:17 ` Andrew Morton
2012-06-29 22:50   ` David Rientjes
2012-06-30  9:43     ` Izik Eidus
2012-06-30 12:29   ` Petr Holasek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).