All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/12] ksm: stats, oom, doc, misc
@ 2009-08-03 12:08 ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:08 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hi Izik,

Here's a set of twelve patches, which I think complete what I want
to do with KSM for current mmotm and 2.6.32: it's as I sent you in
a rollup last week, but with 1/12 and 10/12 added.  Patches apply
to 2.6.31-rc5 plus our previous ten KSM patches, or to mmotm which
already includes those.

[PATCH  1/12] ksm: rename kernel_pages_allocated
[PATCH  2/12] ksm: move pages_sharing updates
[PATCH  3/12] ksm: pages_unshared and pages_volatile
[PATCH  4/12] ksm: break cow once unshared
[PATCH  5/12] ksm: keep quiet while list empty
[PATCH  6/12] ksm: five little cleanups
[PATCH  7/12] ksm: fix endless loop on oom
[PATCH  8/12] ksm: distribute remove_mm_from_lists
[PATCH  9/12] ksm: fix oom deadlock
[PATCH 10/12] ksm: sysfs and defaults
[PATCH 11/12] ksm: add some documentation
[PATCH 12/12] ksm: remove VM_MERGEABLE_FLAGS

 Documentation/vm/00-INDEX |    2 
 Documentation/vm/ksm.txt  |   89 +++++
 include/linux/ksm.h       |   31 +
 kernel/fork.c             |    1 
 mm/Kconfig                |    1 
 mm/ksm.c                  |  574 ++++++++++++++++++++++--------------
 mm/memory.c               |    5 
 mm/mmap.c                 |   15 
 8 files changed, 498 insertions(+), 220 deletions(-)

If you and others are happy with these, please send them on to Andrew
(or else just point him to them); if not, then let's fix them first.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 0/12] ksm: stats, oom, doc, misc
@ 2009-08-03 12:08 ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:08 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hi Izik,

Here's a set of twelve patches, which I think complete what I want
to do with KSM for current mmotm and 2.6.32: it's as I sent you in
a rollup last week, but with 1/12 and 10/12 added.  Patches apply
to 2.6.31-rc5 plus our previous ten KSM patches, or to mmotm which
already includes those.

[PATCH  1/12] ksm: rename kernel_pages_allocated
[PATCH  2/12] ksm: move pages_sharing updates
[PATCH  3/12] ksm: pages_unshared and pages_volatile
[PATCH  4/12] ksm: break cow once unshared
[PATCH  5/12] ksm: keep quiet while list empty
[PATCH  6/12] ksm: five little cleanups
[PATCH  7/12] ksm: fix endless loop on oom
[PATCH  8/12] ksm: distribute remove_mm_from_lists
[PATCH  9/12] ksm: fix oom deadlock
[PATCH 10/12] ksm: sysfs and defaults
[PATCH 11/12] ksm: add some documentation
[PATCH 12/12] ksm: remove VM_MERGEABLE_FLAGS

 Documentation/vm/00-INDEX |    2 
 Documentation/vm/ksm.txt  |   89 +++++
 include/linux/ksm.h       |   31 +
 kernel/fork.c             |    1 
 mm/Kconfig                |    1 
 mm/ksm.c                  |  574 ++++++++++++++++++++++--------------
 mm/memory.c               |    5 
 mm/mmap.c                 |   15 
 8 files changed, 498 insertions(+), 220 deletions(-)

If you and others are happy with these, please send them on to Andrew
(or else just point him to them); if not, then let's fix them first.

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 1/12] ksm: rename kernel_pages_allocated
  2009-08-03 12:08 ` Hugh Dickins
@ 2009-08-03 12:10   ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:10 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

We're not implementing swapping of KSM pages in its first release;
but when that follows, "kernel_pages_allocated" will be a very poor
name for the sysfs file showing number of nodes in the stable tree:
rename that to "pages_shared" throughout.

But we already have a "pages_shared", counting those page slots
sharing the shared pages: first rename that to... "pages_sharing".

What will become of "max_kernel_pages" when the pages shared can
be swapped?  I guess it will just be removed, so keep that name.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |   57 ++++++++++++++++++++++++++---------------------------
 1 file changed, 28 insertions(+), 29 deletions(-)

--- ksm0/mm/ksm.c	2009-08-01 05:02:09.000000000 +0100
+++ ksm1/mm/ksm.c	2009-08-02 13:49:36.000000000 +0100
@@ -150,10 +150,10 @@ static struct kmem_cache *rmap_item_cach
 static struct kmem_cache *mm_slot_cache;
 
 /* The number of nodes in the stable tree */
-static unsigned long ksm_kernel_pages_allocated;
+static unsigned long ksm_pages_shared;
 
 /* The number of page slots sharing those nodes */
-static unsigned long ksm_pages_shared;
+static unsigned long ksm_pages_sharing;
 
 /* Limit on the number of unswappable pages used */
 static unsigned long ksm_max_kernel_pages;
@@ -384,7 +384,7 @@ static void remove_rmap_item_from_tree(s
 				next_item->address |= NODE_FLAG;
 			} else {
 				rb_erase(&rmap_item->node, &root_stable_tree);
-				ksm_kernel_pages_allocated--;
+				ksm_pages_shared--;
 			}
 		} else {
 			struct rmap_item *prev_item = rmap_item->prev;
@@ -398,7 +398,7 @@ static void remove_rmap_item_from_tree(s
 		}
 
 		rmap_item->next = NULL;
-		ksm_pages_shared--;
+		ksm_pages_sharing--;
 
 	} else if (rmap_item->address & NODE_FLAG) {
 		unsigned char age;
@@ -748,7 +748,7 @@ static int try_to_merge_two_pages(struct
 	 * is the number of kernel pages that we hold.
 	 */
 	if (ksm_max_kernel_pages &&
-	    ksm_max_kernel_pages <= ksm_kernel_pages_allocated)
+	    ksm_max_kernel_pages <= ksm_pages_shared)
 		return err;
 
 	kpage = alloc_page(GFP_HIGHUSER);
@@ -787,7 +787,7 @@ static int try_to_merge_two_pages(struct
 		if (err)
 			break_cow(mm1, addr1);
 		else
-			ksm_pages_shared += 2;
+			ksm_pages_sharing += 2;
 	}
 
 	put_page(kpage);
@@ -817,7 +817,7 @@ static int try_to_merge_with_ksm_page(st
 	up_read(&mm1->mmap_sem);
 
 	if (!err)
-		ksm_pages_shared++;
+		ksm_pages_sharing++;
 
 	return err;
 }
@@ -935,7 +935,7 @@ static struct rmap_item *stable_tree_ins
 		}
 	}
 
-	ksm_kernel_pages_allocated++;
+	ksm_pages_shared++;
 
 	rmap_item->address |= NODE_FLAG | STABLE_FLAG;
 	rmap_item->next = NULL;
@@ -1051,7 +1051,7 @@ static void cmp_and_merge_page(struct pa
 	tree_rmap_item = stable_tree_search(page, page2, rmap_item);
 	if (tree_rmap_item) {
 		if (page == page2[0]) {			/* forked */
-			ksm_pages_shared++;
+			ksm_pages_sharing++;
 			err = 0;
 		} else
 			err = try_to_merge_with_ksm_page(rmap_item->mm,
@@ -1114,7 +1114,7 @@ static void cmp_and_merge_page(struct pa
 				break_cow(tree_rmap_item->mm,
 						tree_rmap_item->address);
 				break_cow(rmap_item->mm, rmap_item->address);
-				ksm_pages_shared -= 2;
+				ksm_pages_sharing -= 2;
 			}
 		}
 
@@ -1430,7 +1430,7 @@ static ssize_t run_store(struct kobject
 	/*
 	 * KSM_RUN_MERGE sets ksmd running, and 0 stops it running.
 	 * KSM_RUN_UNMERGE stops it running and unmerges all rmap_items,
-	 * breaking COW to free the kernel_pages_allocated (but leaves
+	 * breaking COW to free the unswappable pages_shared (but leaves
 	 * mm_slots on the list for when ksmd may be set running again).
 	 */
 
@@ -1449,22 +1449,6 @@ static ssize_t run_store(struct kobject
 }
 KSM_ATTR(run);
 
-static ssize_t pages_shared_show(struct kobject *kobj,
-				 struct kobj_attribute *attr, char *buf)
-{
-	return sprintf(buf, "%lu\n",
-			ksm_pages_shared - ksm_kernel_pages_allocated);
-}
-KSM_ATTR_RO(pages_shared);
-
-static ssize_t kernel_pages_allocated_show(struct kobject *kobj,
-					   struct kobj_attribute *attr,
-					   char *buf)
-{
-	return sprintf(buf, "%lu\n", ksm_kernel_pages_allocated);
-}
-KSM_ATTR_RO(kernel_pages_allocated);
-
 static ssize_t max_kernel_pages_store(struct kobject *kobj,
 				      struct kobj_attribute *attr,
 				      const char *buf, size_t count)
@@ -1488,13 +1472,28 @@ static ssize_t max_kernel_pages_show(str
 }
 KSM_ATTR(max_kernel_pages);
 
+static ssize_t pages_shared_show(struct kobject *kobj,
+				 struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", ksm_pages_shared);
+}
+KSM_ATTR_RO(pages_shared);
+
+static ssize_t pages_sharing_show(struct kobject *kobj,
+				  struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n",
+			ksm_pages_sharing - ksm_pages_shared);
+}
+KSM_ATTR_RO(pages_sharing);
+
 static struct attribute *ksm_attrs[] = {
 	&sleep_millisecs_attr.attr,
 	&pages_to_scan_attr.attr,
 	&run_attr.attr,
-	&pages_shared_attr.attr,
-	&kernel_pages_allocated_attr.attr,
 	&max_kernel_pages_attr.attr,
+	&pages_shared_attr.attr,
+	&pages_sharing_attr.attr,
 	NULL,
 };
 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 1/12] ksm: rename kernel_pages_allocated
@ 2009-08-03 12:10   ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:10 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

We're not implementing swapping of KSM pages in its first release;
but when that follows, "kernel_pages_allocated" will be a very poor
name for the sysfs file showing number of nodes in the stable tree:
rename that to "pages_shared" throughout.

But we already have a "pages_shared", counting those page slots
sharing the shared pages: first rename that to... "pages_sharing".

What will become of "max_kernel_pages" when the pages shared can
be swapped?  I guess it will just be removed, so keep that name.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |   57 ++++++++++++++++++++++++++---------------------------
 1 file changed, 28 insertions(+), 29 deletions(-)

--- ksm0/mm/ksm.c	2009-08-01 05:02:09.000000000 +0100
+++ ksm1/mm/ksm.c	2009-08-02 13:49:36.000000000 +0100
@@ -150,10 +150,10 @@ static struct kmem_cache *rmap_item_cach
 static struct kmem_cache *mm_slot_cache;
 
 /* The number of nodes in the stable tree */
-static unsigned long ksm_kernel_pages_allocated;
+static unsigned long ksm_pages_shared;
 
 /* The number of page slots sharing those nodes */
-static unsigned long ksm_pages_shared;
+static unsigned long ksm_pages_sharing;
 
 /* Limit on the number of unswappable pages used */
 static unsigned long ksm_max_kernel_pages;
@@ -384,7 +384,7 @@ static void remove_rmap_item_from_tree(s
 				next_item->address |= NODE_FLAG;
 			} else {
 				rb_erase(&rmap_item->node, &root_stable_tree);
-				ksm_kernel_pages_allocated--;
+				ksm_pages_shared--;
 			}
 		} else {
 			struct rmap_item *prev_item = rmap_item->prev;
@@ -398,7 +398,7 @@ static void remove_rmap_item_from_tree(s
 		}
 
 		rmap_item->next = NULL;
-		ksm_pages_shared--;
+		ksm_pages_sharing--;
 
 	} else if (rmap_item->address & NODE_FLAG) {
 		unsigned char age;
@@ -748,7 +748,7 @@ static int try_to_merge_two_pages(struct
 	 * is the number of kernel pages that we hold.
 	 */
 	if (ksm_max_kernel_pages &&
-	    ksm_max_kernel_pages <= ksm_kernel_pages_allocated)
+	    ksm_max_kernel_pages <= ksm_pages_shared)
 		return err;
 
 	kpage = alloc_page(GFP_HIGHUSER);
@@ -787,7 +787,7 @@ static int try_to_merge_two_pages(struct
 		if (err)
 			break_cow(mm1, addr1);
 		else
-			ksm_pages_shared += 2;
+			ksm_pages_sharing += 2;
 	}
 
 	put_page(kpage);
@@ -817,7 +817,7 @@ static int try_to_merge_with_ksm_page(st
 	up_read(&mm1->mmap_sem);
 
 	if (!err)
-		ksm_pages_shared++;
+		ksm_pages_sharing++;
 
 	return err;
 }
@@ -935,7 +935,7 @@ static struct rmap_item *stable_tree_ins
 		}
 	}
 
-	ksm_kernel_pages_allocated++;
+	ksm_pages_shared++;
 
 	rmap_item->address |= NODE_FLAG | STABLE_FLAG;
 	rmap_item->next = NULL;
@@ -1051,7 +1051,7 @@ static void cmp_and_merge_page(struct pa
 	tree_rmap_item = stable_tree_search(page, page2, rmap_item);
 	if (tree_rmap_item) {
 		if (page == page2[0]) {			/* forked */
-			ksm_pages_shared++;
+			ksm_pages_sharing++;
 			err = 0;
 		} else
 			err = try_to_merge_with_ksm_page(rmap_item->mm,
@@ -1114,7 +1114,7 @@ static void cmp_and_merge_page(struct pa
 				break_cow(tree_rmap_item->mm,
 						tree_rmap_item->address);
 				break_cow(rmap_item->mm, rmap_item->address);
-				ksm_pages_shared -= 2;
+				ksm_pages_sharing -= 2;
 			}
 		}
 
@@ -1430,7 +1430,7 @@ static ssize_t run_store(struct kobject
 	/*
 	 * KSM_RUN_MERGE sets ksmd running, and 0 stops it running.
 	 * KSM_RUN_UNMERGE stops it running and unmerges all rmap_items,
-	 * breaking COW to free the kernel_pages_allocated (but leaves
+	 * breaking COW to free the unswappable pages_shared (but leaves
 	 * mm_slots on the list for when ksmd may be set running again).
 	 */
 
@@ -1449,22 +1449,6 @@ static ssize_t run_store(struct kobject
 }
 KSM_ATTR(run);
 
-static ssize_t pages_shared_show(struct kobject *kobj,
-				 struct kobj_attribute *attr, char *buf)
-{
-	return sprintf(buf, "%lu\n",
-			ksm_pages_shared - ksm_kernel_pages_allocated);
-}
-KSM_ATTR_RO(pages_shared);
-
-static ssize_t kernel_pages_allocated_show(struct kobject *kobj,
-					   struct kobj_attribute *attr,
-					   char *buf)
-{
-	return sprintf(buf, "%lu\n", ksm_kernel_pages_allocated);
-}
-KSM_ATTR_RO(kernel_pages_allocated);
-
 static ssize_t max_kernel_pages_store(struct kobject *kobj,
 				      struct kobj_attribute *attr,
 				      const char *buf, size_t count)
@@ -1488,13 +1472,28 @@ static ssize_t max_kernel_pages_show(str
 }
 KSM_ATTR(max_kernel_pages);
 
+static ssize_t pages_shared_show(struct kobject *kobj,
+				 struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", ksm_pages_shared);
+}
+KSM_ATTR_RO(pages_shared);
+
+static ssize_t pages_sharing_show(struct kobject *kobj,
+				  struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n",
+			ksm_pages_sharing - ksm_pages_shared);
+}
+KSM_ATTR_RO(pages_sharing);
+
 static struct attribute *ksm_attrs[] = {
 	&sleep_millisecs_attr.attr,
 	&pages_to_scan_attr.attr,
 	&run_attr.attr,
-	&pages_shared_attr.attr,
-	&kernel_pages_allocated_attr.attr,
 	&max_kernel_pages_attr.attr,
+	&pages_shared_attr.attr,
+	&pages_sharing_attr.attr,
 	NULL,
 };
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 2/12] ksm: move pages_sharing updates
  2009-08-03 12:08 ` Hugh Dickins
@ 2009-08-03 12:11   ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:11 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

The pages_shared count is incremented and decremented when adding a node
to and removing a node from the stable tree: easy to understand.  But the
pages_sharing count was hard to follow, being adjusted in various places:
increment and decrement it when adding to and removing from the stable tree.

And the pages_sharing variable used to include the pages_shared, then those
were subtracted when shown in the pages_sharing sysfs file: now keep it as
an exclusive count of leaves hanging off the stable tree nodes, throughout.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |   24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

--- ksm1/mm/ksm.c	2009-08-02 13:49:36.000000000 +0100
+++ ksm2/mm/ksm.c	2009-08-02 13:49:43.000000000 +0100
@@ -152,7 +152,7 @@ static struct kmem_cache *mm_slot_cache;
 /* The number of nodes in the stable tree */
 static unsigned long ksm_pages_shared;
 
-/* The number of page slots sharing those nodes */
+/* The number of page slots additionally sharing those nodes */
 static unsigned long ksm_pages_sharing;
 
 /* Limit on the number of unswappable pages used */
@@ -382,6 +382,7 @@ static void remove_rmap_item_from_tree(s
 						&next_item->node,
 						&root_stable_tree);
 				next_item->address |= NODE_FLAG;
+				ksm_pages_sharing--;
 			} else {
 				rb_erase(&rmap_item->node, &root_stable_tree);
 				ksm_pages_shared--;
@@ -395,10 +396,10 @@ static void remove_rmap_item_from_tree(s
 				BUG_ON(next_item->prev != rmap_item);
 				next_item->prev = rmap_item->prev;
 			}
+			ksm_pages_sharing--;
 		}
 
 		rmap_item->next = NULL;
-		ksm_pages_sharing--;
 
 	} else if (rmap_item->address & NODE_FLAG) {
 		unsigned char age;
@@ -786,8 +787,6 @@ static int try_to_merge_two_pages(struct
 		 */
 		if (err)
 			break_cow(mm1, addr1);
-		else
-			ksm_pages_sharing += 2;
 	}
 
 	put_page(kpage);
@@ -816,9 +815,6 @@ static int try_to_merge_with_ksm_page(st
 	err = try_to_merge_one_page(vma, page1, kpage);
 	up_read(&mm1->mmap_sem);
 
-	if (!err)
-		ksm_pages_sharing++;
-
 	return err;
 }
 
@@ -935,13 +931,12 @@ static struct rmap_item *stable_tree_ins
 		}
 	}
 
-	ksm_pages_shared++;
-
 	rmap_item->address |= NODE_FLAG | STABLE_FLAG;
 	rmap_item->next = NULL;
 	rb_link_node(&rmap_item->node, parent, new);
 	rb_insert_color(&rmap_item->node, &root_stable_tree);
 
+	ksm_pages_shared++;
 	return rmap_item;
 }
 
@@ -1026,6 +1021,8 @@ static void stable_tree_append(struct rm
 
 	tree_rmap_item->next = rmap_item;
 	rmap_item->address |= STABLE_FLAG;
+
+	ksm_pages_sharing++;
 }
 
 /*
@@ -1050,10 +1047,9 @@ static void cmp_and_merge_page(struct pa
 	/* We first start with searching the page inside the stable tree */
 	tree_rmap_item = stable_tree_search(page, page2, rmap_item);
 	if (tree_rmap_item) {
-		if (page == page2[0]) {			/* forked */
-			ksm_pages_sharing++;
+		if (page == page2[0])			/* forked */
 			err = 0;
-		} else
+		else
 			err = try_to_merge_with_ksm_page(rmap_item->mm,
 							 rmap_item->address,
 							 page, page2[0]);
@@ -1114,7 +1110,6 @@ static void cmp_and_merge_page(struct pa
 				break_cow(tree_rmap_item->mm,
 						tree_rmap_item->address);
 				break_cow(rmap_item->mm, rmap_item->address);
-				ksm_pages_sharing -= 2;
 			}
 		}
 
@@ -1482,8 +1477,7 @@ KSM_ATTR_RO(pages_shared);
 static ssize_t pages_sharing_show(struct kobject *kobj,
 				  struct kobj_attribute *attr, char *buf)
 {
-	return sprintf(buf, "%lu\n",
-			ksm_pages_sharing - ksm_pages_shared);
+	return sprintf(buf, "%lu\n", ksm_pages_sharing);
 }
 KSM_ATTR_RO(pages_sharing);
 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 2/12] ksm: move pages_sharing updates
@ 2009-08-03 12:11   ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:11 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

The pages_shared count is incremented and decremented when adding a node
to and removing a node from the stable tree: easy to understand.  But the
pages_sharing count was hard to follow, being adjusted in various places:
increment and decrement it when adding to and removing from the stable tree.

And the pages_sharing variable used to include the pages_shared, then those
were subtracted when shown in the pages_sharing sysfs file: now keep it as
an exclusive count of leaves hanging off the stable tree nodes, throughout.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |   24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

--- ksm1/mm/ksm.c	2009-08-02 13:49:36.000000000 +0100
+++ ksm2/mm/ksm.c	2009-08-02 13:49:43.000000000 +0100
@@ -152,7 +152,7 @@ static struct kmem_cache *mm_slot_cache;
 /* The number of nodes in the stable tree */
 static unsigned long ksm_pages_shared;
 
-/* The number of page slots sharing those nodes */
+/* The number of page slots additionally sharing those nodes */
 static unsigned long ksm_pages_sharing;
 
 /* Limit on the number of unswappable pages used */
@@ -382,6 +382,7 @@ static void remove_rmap_item_from_tree(s
 						&next_item->node,
 						&root_stable_tree);
 				next_item->address |= NODE_FLAG;
+				ksm_pages_sharing--;
 			} else {
 				rb_erase(&rmap_item->node, &root_stable_tree);
 				ksm_pages_shared--;
@@ -395,10 +396,10 @@ static void remove_rmap_item_from_tree(s
 				BUG_ON(next_item->prev != rmap_item);
 				next_item->prev = rmap_item->prev;
 			}
+			ksm_pages_sharing--;
 		}
 
 		rmap_item->next = NULL;
-		ksm_pages_sharing--;
 
 	} else if (rmap_item->address & NODE_FLAG) {
 		unsigned char age;
@@ -786,8 +787,6 @@ static int try_to_merge_two_pages(struct
 		 */
 		if (err)
 			break_cow(mm1, addr1);
-		else
-			ksm_pages_sharing += 2;
 	}
 
 	put_page(kpage);
@@ -816,9 +815,6 @@ static int try_to_merge_with_ksm_page(st
 	err = try_to_merge_one_page(vma, page1, kpage);
 	up_read(&mm1->mmap_sem);
 
-	if (!err)
-		ksm_pages_sharing++;
-
 	return err;
 }
 
@@ -935,13 +931,12 @@ static struct rmap_item *stable_tree_ins
 		}
 	}
 
-	ksm_pages_shared++;
-
 	rmap_item->address |= NODE_FLAG | STABLE_FLAG;
 	rmap_item->next = NULL;
 	rb_link_node(&rmap_item->node, parent, new);
 	rb_insert_color(&rmap_item->node, &root_stable_tree);
 
+	ksm_pages_shared++;
 	return rmap_item;
 }
 
@@ -1026,6 +1021,8 @@ static void stable_tree_append(struct rm
 
 	tree_rmap_item->next = rmap_item;
 	rmap_item->address |= STABLE_FLAG;
+
+	ksm_pages_sharing++;
 }
 
 /*
@@ -1050,10 +1047,9 @@ static void cmp_and_merge_page(struct pa
 	/* We first start with searching the page inside the stable tree */
 	tree_rmap_item = stable_tree_search(page, page2, rmap_item);
 	if (tree_rmap_item) {
-		if (page == page2[0]) {			/* forked */
-			ksm_pages_sharing++;
+		if (page == page2[0])			/* forked */
 			err = 0;
-		} else
+		else
 			err = try_to_merge_with_ksm_page(rmap_item->mm,
 							 rmap_item->address,
 							 page, page2[0]);
@@ -1114,7 +1110,6 @@ static void cmp_and_merge_page(struct pa
 				break_cow(tree_rmap_item->mm,
 						tree_rmap_item->address);
 				break_cow(rmap_item->mm, rmap_item->address);
-				ksm_pages_sharing -= 2;
 			}
 		}
 
@@ -1482,8 +1477,7 @@ KSM_ATTR_RO(pages_shared);
 static ssize_t pages_sharing_show(struct kobject *kobj,
 				  struct kobj_attribute *attr, char *buf)
 {
-	return sprintf(buf, "%lu\n",
-			ksm_pages_sharing - ksm_pages_shared);
+	return sprintf(buf, "%lu\n", ksm_pages_sharing);
 }
 KSM_ATTR_RO(pages_sharing);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 3/12] ksm: pages_unshared and pages_volatile
  2009-08-03 12:08 ` Hugh Dickins
@ 2009-08-03 12:11   ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:11 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

The pages_shared and pages_sharing counts give a good picture of how
successful KSM is at sharing; but no clue to how much wasted work it's
doing to get there.  Add pages_unshared (count of unique pages waiting
in the unstable tree, hoping to find a mate) and pages_volatile.

pages_volatile is harder to define.  It includes those pages changing
too fast to get into the unstable tree, but also whatever other edge
conditions prevent a page getting into the trees: a high value may
deserve investigation.  Don't try to calculate it from the various
conditions: it's the total of rmap_items less those accounted for.

Also show full_scans: the number of completed scans of everything
registered in the mm list.

The locking for all these counts is simply ksm_thread_mutex.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

--- ksm2/mm/ksm.c	2009-08-02 13:49:43.000000000 +0100
+++ ksm3/mm/ksm.c	2009-08-02 13:49:51.000000000 +0100
@@ -155,6 +155,12 @@ static unsigned long ksm_pages_shared;
 /* The number of page slots additionally sharing those nodes */
 static unsigned long ksm_pages_sharing;
 
+/* The number of nodes in the unstable tree */
+static unsigned long ksm_pages_unshared;
+
+/* The number of rmap_items in use: to calculate pages_volatile */
+static unsigned long ksm_rmap_items;
+
 /* Limit on the number of unswappable pages used */
 static unsigned long ksm_max_kernel_pages;
 
@@ -204,11 +210,17 @@ static void __init ksm_slab_free(void)
 
 static inline struct rmap_item *alloc_rmap_item(void)
 {
-	return kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
+	struct rmap_item *rmap_item;
+
+	rmap_item = kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
+	if (rmap_item)
+		ksm_rmap_items++;
+	return rmap_item;
 }
 
 static inline void free_rmap_item(struct rmap_item *rmap_item)
 {
+	ksm_rmap_items--;
 	rmap_item->mm = NULL;	/* debug safety */
 	kmem_cache_free(rmap_item_cache, rmap_item);
 }
@@ -419,6 +431,7 @@ static void remove_rmap_item_from_tree(s
 		BUG_ON(age > 2);
 		if (!age)
 			rb_erase(&rmap_item->node, &root_unstable_tree);
+		ksm_pages_unshared--;
 	}
 
 	rmap_item->address &= PAGE_MASK;
@@ -1002,6 +1015,7 @@ static struct rmap_item *unstable_tree_s
 	rb_link_node(&rmap_item->node, parent, new);
 	rb_insert_color(&rmap_item->node, &root_unstable_tree);
 
+	ksm_pages_unshared++;
 	return NULL;
 }
 
@@ -1098,6 +1112,8 @@ static void cmp_and_merge_page(struct pa
 		if (!err) {
 			rb_erase(&tree_rmap_item->node, &root_unstable_tree);
 			tree_rmap_item->address &= ~NODE_FLAG;
+			ksm_pages_unshared--;
+
 			/*
 			 * If we fail to insert the page into the stable tree,
 			 * we will have 2 virtual addresses that are pointing
@@ -1481,6 +1497,37 @@ static ssize_t pages_sharing_show(struct
 }
 KSM_ATTR_RO(pages_sharing);
 
+static ssize_t pages_unshared_show(struct kobject *kobj,
+				   struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", ksm_pages_unshared);
+}
+KSM_ATTR_RO(pages_unshared);
+
+static ssize_t pages_volatile_show(struct kobject *kobj,
+				   struct kobj_attribute *attr, char *buf)
+{
+	long ksm_pages_volatile;
+
+	ksm_pages_volatile = ksm_rmap_items - ksm_pages_shared
+				- ksm_pages_sharing - ksm_pages_unshared;
+	/*
+	 * It was not worth any locking to calculate that statistic,
+	 * but it might therefore sometimes be negative: conceal that.
+	 */
+	if (ksm_pages_volatile < 0)
+		ksm_pages_volatile = 0;
+	return sprintf(buf, "%ld\n", ksm_pages_volatile);
+}
+KSM_ATTR_RO(pages_volatile);
+
+static ssize_t full_scans_show(struct kobject *kobj,
+			       struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", ksm_scan.seqnr);
+}
+KSM_ATTR_RO(full_scans);
+
 static struct attribute *ksm_attrs[] = {
 	&sleep_millisecs_attr.attr,
 	&pages_to_scan_attr.attr,
@@ -1488,6 +1535,9 @@ static struct attribute *ksm_attrs[] = {
 	&max_kernel_pages_attr.attr,
 	&pages_shared_attr.attr,
 	&pages_sharing_attr.attr,
+	&pages_unshared_attr.attr,
+	&pages_volatile_attr.attr,
+	&full_scans_attr.attr,
 	NULL,
 };
 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 3/12] ksm: pages_unshared and pages_volatile
@ 2009-08-03 12:11   ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:11 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

The pages_shared and pages_sharing counts give a good picture of how
successful KSM is at sharing; but no clue to how much wasted work it's
doing to get there.  Add pages_unshared (count of unique pages waiting
in the unstable tree, hoping to find a mate) and pages_volatile.

pages_volatile is harder to define.  It includes those pages changing
too fast to get into the unstable tree, but also whatever other edge
conditions prevent a page getting into the trees: a high value may
deserve investigation.  Don't try to calculate it from the various
conditions: it's the total of rmap_items less those accounted for.

Also show full_scans: the number of completed scans of everything
registered in the mm list.

The locking for all these counts is simply ksm_thread_mutex.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

--- ksm2/mm/ksm.c	2009-08-02 13:49:43.000000000 +0100
+++ ksm3/mm/ksm.c	2009-08-02 13:49:51.000000000 +0100
@@ -155,6 +155,12 @@ static unsigned long ksm_pages_shared;
 /* The number of page slots additionally sharing those nodes */
 static unsigned long ksm_pages_sharing;
 
+/* The number of nodes in the unstable tree */
+static unsigned long ksm_pages_unshared;
+
+/* The number of rmap_items in use: to calculate pages_volatile */
+static unsigned long ksm_rmap_items;
+
 /* Limit on the number of unswappable pages used */
 static unsigned long ksm_max_kernel_pages;
 
@@ -204,11 +210,17 @@ static void __init ksm_slab_free(void)
 
 static inline struct rmap_item *alloc_rmap_item(void)
 {
-	return kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
+	struct rmap_item *rmap_item;
+
+	rmap_item = kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
+	if (rmap_item)
+		ksm_rmap_items++;
+	return rmap_item;
 }
 
 static inline void free_rmap_item(struct rmap_item *rmap_item)
 {
+	ksm_rmap_items--;
 	rmap_item->mm = NULL;	/* debug safety */
 	kmem_cache_free(rmap_item_cache, rmap_item);
 }
@@ -419,6 +431,7 @@ static void remove_rmap_item_from_tree(s
 		BUG_ON(age > 2);
 		if (!age)
 			rb_erase(&rmap_item->node, &root_unstable_tree);
+		ksm_pages_unshared--;
 	}
 
 	rmap_item->address &= PAGE_MASK;
@@ -1002,6 +1015,7 @@ static struct rmap_item *unstable_tree_s
 	rb_link_node(&rmap_item->node, parent, new);
 	rb_insert_color(&rmap_item->node, &root_unstable_tree);
 
+	ksm_pages_unshared++;
 	return NULL;
 }
 
@@ -1098,6 +1112,8 @@ static void cmp_and_merge_page(struct pa
 		if (!err) {
 			rb_erase(&tree_rmap_item->node, &root_unstable_tree);
 			tree_rmap_item->address &= ~NODE_FLAG;
+			ksm_pages_unshared--;
+
 			/*
 			 * If we fail to insert the page into the stable tree,
 			 * we will have 2 virtual addresses that are pointing
@@ -1481,6 +1497,37 @@ static ssize_t pages_sharing_show(struct
 }
 KSM_ATTR_RO(pages_sharing);
 
+static ssize_t pages_unshared_show(struct kobject *kobj,
+				   struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", ksm_pages_unshared);
+}
+KSM_ATTR_RO(pages_unshared);
+
+static ssize_t pages_volatile_show(struct kobject *kobj,
+				   struct kobj_attribute *attr, char *buf)
+{
+	long ksm_pages_volatile;
+
+	ksm_pages_volatile = ksm_rmap_items - ksm_pages_shared
+				- ksm_pages_sharing - ksm_pages_unshared;
+	/*
+	 * It was not worth any locking to calculate that statistic,
+	 * but it might therefore sometimes be negative: conceal that.
+	 */
+	if (ksm_pages_volatile < 0)
+		ksm_pages_volatile = 0;
+	return sprintf(buf, "%ld\n", ksm_pages_volatile);
+}
+KSM_ATTR_RO(pages_volatile);
+
+static ssize_t full_scans_show(struct kobject *kobj,
+			       struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", ksm_scan.seqnr);
+}
+KSM_ATTR_RO(full_scans);
+
 static struct attribute *ksm_attrs[] = {
 	&sleep_millisecs_attr.attr,
 	&pages_to_scan_attr.attr,
@@ -1488,6 +1535,9 @@ static struct attribute *ksm_attrs[] = {
 	&max_kernel_pages_attr.attr,
 	&pages_shared_attr.attr,
 	&pages_sharing_attr.attr,
+	&pages_unshared_attr.attr,
+	&pages_volatile_attr.attr,
+	&full_scans_attr.attr,
 	NULL,
 };
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 4/12] ksm: break cow once unshared
  2009-08-03 12:08 ` Hugh Dickins
@ 2009-08-03 12:12   ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:12 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

We kept agreeing not to bother about the unswappable shared KSM pages
which later become unshared by others: observation suggests they're not
a significant proportion.  But they are disadvantageous, and it is easier
to break COW to replace them by swappable pages, than offer statistics
to show that they don't matter; then we can stop worrying about them.

Doing this in ksm_do_scan, they don't go through cmp_and_merge_page on
this pass: give them a good chance of getting into the unstable tree
on the next pass, or back into the stable, by computing checksum now.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- ksm3/mm/ksm.c	2009-08-02 13:49:51.000000000 +0100
+++ ksm4/mm/ksm.c	2009-08-02 13:49:59.000000000 +0100
@@ -1275,6 +1275,14 @@ static void ksm_do_scan(unsigned int sca
 			return;
 		if (!PageKsm(page) || !in_stable_tree(rmap_item))
 			cmp_and_merge_page(page, rmap_item);
+		else if (page_mapcount(page) == 1) {
+			/*
+			 * Replace now-unshared ksm page by ordinary page.
+			 */
+			break_cow(rmap_item->mm, rmap_item->address);
+			remove_rmap_item_from_tree(rmap_item);
+			rmap_item->oldchecksum = calc_checksum(page);
+		}
 		put_page(page);
 	}
 }

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 4/12] ksm: break cow once unshared
@ 2009-08-03 12:12   ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:12 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

We kept agreeing not to bother about the unswappable shared KSM pages
which later become unshared by others: observation suggests they're not
a significant proportion.  But they are disadvantageous, and it is easier
to break COW to replace them by swappable pages, than offer statistics
to show that they don't matter; then we can stop worrying about them.

Doing this in ksm_do_scan, they don't go through cmp_and_merge_page on
this pass: give them a good chance of getting into the unstable tree
on the next pass, or back into the stable, by computing checksum now.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- ksm3/mm/ksm.c	2009-08-02 13:49:51.000000000 +0100
+++ ksm4/mm/ksm.c	2009-08-02 13:49:59.000000000 +0100
@@ -1275,6 +1275,14 @@ static void ksm_do_scan(unsigned int sca
 			return;
 		if (!PageKsm(page) || !in_stable_tree(rmap_item))
 			cmp_and_merge_page(page, rmap_item);
+		else if (page_mapcount(page) == 1) {
+			/*
+			 * Replace now-unshared ksm page by ordinary page.
+			 */
+			break_cow(rmap_item->mm, rmap_item->address);
+			remove_rmap_item_from_tree(rmap_item);
+			rmap_item->oldchecksum = calc_checksum(page);
+		}
 		put_page(page);
 	}
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 5/12] ksm: keep quiet while list empty
  2009-08-03 12:08 ` Hugh Dickins
@ 2009-08-03 12:14   ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:14 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

ksm_scan_thread already sleeps in wait_event_interruptible until setting
ksm_run activates it; but if there's nothing on its list to look at, i.e.
nobody has yet said madvise MADV_MERGEABLE, it's a shame to be clocking
up system time and full_scans: ksmd_should_run added to check that too.

And move the mutex_lock out around it: the new counts showed that when
ksm_run is stopped, a little work often got done afterwards, because it
had been read before taking the mutex.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |   28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

--- ksm4/mm/ksm.c	2009-08-02 13:49:59.000000000 +0100
+++ ksm5/mm/ksm.c	2009-08-02 13:50:07.000000000 +0100
@@ -1287,21 +1287,27 @@ static void ksm_do_scan(unsigned int sca
 	}
 }
 
+static int ksmd_should_run(void)
+{
+	return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
+}
+
 static int ksm_scan_thread(void *nothing)
 {
 	set_user_nice(current, 5);
 
 	while (!kthread_should_stop()) {
-		if (ksm_run & KSM_RUN_MERGE) {
-			mutex_lock(&ksm_thread_mutex);
+		mutex_lock(&ksm_thread_mutex);
+		if (ksmd_should_run())
 			ksm_do_scan(ksm_thread_pages_to_scan);
-			mutex_unlock(&ksm_thread_mutex);
+		mutex_unlock(&ksm_thread_mutex);
+
+		if (ksmd_should_run()) {
 			schedule_timeout_interruptible(
 				msecs_to_jiffies(ksm_thread_sleep_millisecs));
 		} else {
 			wait_event_interruptible(ksm_thread_wait,
-					(ksm_run & KSM_RUN_MERGE) ||
-					kthread_should_stop());
+				ksmd_should_run() || kthread_should_stop());
 		}
 	}
 	return 0;
@@ -1346,10 +1352,16 @@ int ksm_madvise(struct vm_area_struct *v
 
 int __ksm_enter(struct mm_struct *mm)
 {
-	struct mm_slot *mm_slot = alloc_mm_slot();
+	struct mm_slot *mm_slot;
+	int needs_wakeup;
+
+	mm_slot = alloc_mm_slot();
 	if (!mm_slot)
 		return -ENOMEM;
 
+	/* Check ksm_run too?  Would need tighter locking */
+	needs_wakeup = list_empty(&ksm_mm_head.mm_list);
+
 	spin_lock(&ksm_mmlist_lock);
 	insert_to_mm_slots_hash(mm, mm_slot);
 	/*
@@ -1361,6 +1373,10 @@ int __ksm_enter(struct mm_struct *mm)
 	spin_unlock(&ksm_mmlist_lock);
 
 	set_bit(MMF_VM_MERGEABLE, &mm->flags);
+
+	if (needs_wakeup)
+		wake_up_interruptible(&ksm_thread_wait);
+
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 5/12] ksm: keep quiet while list empty
@ 2009-08-03 12:14   ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:14 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

ksm_scan_thread already sleeps in wait_event_interruptible until setting
ksm_run activates it; but if there's nothing on its list to look at, i.e.
nobody has yet said madvise MADV_MERGEABLE, it's a shame to be clocking
up system time and full_scans: ksmd_should_run added to check that too.

And move the mutex_lock out around it: the new counts showed that when
ksm_run is stopped, a little work often got done afterwards, because it
had been read before taking the mutex.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |   28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

--- ksm4/mm/ksm.c	2009-08-02 13:49:59.000000000 +0100
+++ ksm5/mm/ksm.c	2009-08-02 13:50:07.000000000 +0100
@@ -1287,21 +1287,27 @@ static void ksm_do_scan(unsigned int sca
 	}
 }
 
+static int ksmd_should_run(void)
+{
+	return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
+}
+
 static int ksm_scan_thread(void *nothing)
 {
 	set_user_nice(current, 5);
 
 	while (!kthread_should_stop()) {
-		if (ksm_run & KSM_RUN_MERGE) {
-			mutex_lock(&ksm_thread_mutex);
+		mutex_lock(&ksm_thread_mutex);
+		if (ksmd_should_run())
 			ksm_do_scan(ksm_thread_pages_to_scan);
-			mutex_unlock(&ksm_thread_mutex);
+		mutex_unlock(&ksm_thread_mutex);
+
+		if (ksmd_should_run()) {
 			schedule_timeout_interruptible(
 				msecs_to_jiffies(ksm_thread_sleep_millisecs));
 		} else {
 			wait_event_interruptible(ksm_thread_wait,
-					(ksm_run & KSM_RUN_MERGE) ||
-					kthread_should_stop());
+				ksmd_should_run() || kthread_should_stop());
 		}
 	}
 	return 0;
@@ -1346,10 +1352,16 @@ int ksm_madvise(struct vm_area_struct *v
 
 int __ksm_enter(struct mm_struct *mm)
 {
-	struct mm_slot *mm_slot = alloc_mm_slot();
+	struct mm_slot *mm_slot;
+	int needs_wakeup;
+
+	mm_slot = alloc_mm_slot();
 	if (!mm_slot)
 		return -ENOMEM;
 
+	/* Check ksm_run too?  Would need tighter locking */
+	needs_wakeup = list_empty(&ksm_mm_head.mm_list);
+
 	spin_lock(&ksm_mmlist_lock);
 	insert_to_mm_slots_hash(mm, mm_slot);
 	/*
@@ -1361,6 +1373,10 @@ int __ksm_enter(struct mm_struct *mm)
 	spin_unlock(&ksm_mmlist_lock);
 
 	set_bit(MMF_VM_MERGEABLE, &mm->flags);
+
+	if (needs_wakeup)
+		wake_up_interruptible(&ksm_thread_wait);
+
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 6/12] ksm: five little cleanups
  2009-08-03 12:08 ` Hugh Dickins
@ 2009-08-03 12:15   ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:15 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

1. We don't use __break_cow entry point now: merge it into break_cow.
2. remove_all_slot_rmap_items is just a special case of
   remove_trailing_rmap_items: use the latter instead.
3. Extend comment on unmerge_ksm_pages and rmap_items.
4. try_to_merge_two_pages should use try_to_merge_with_ksm_page
   instead of duplicating its code; and so swap them around.
5. Comment on cmp_and_merge_page described last year's: update it.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |  112 ++++++++++++++++++++---------------------------------
 1 file changed, 44 insertions(+), 68 deletions(-)

--- ksm5/mm/ksm.c	2009-08-02 13:50:07.000000000 +0100
+++ ksm6/mm/ksm.c	2009-08-02 13:50:15.000000000 +0100
@@ -315,22 +315,18 @@ static void break_ksm(struct vm_area_str
 	/* Which leaves us looping there if VM_FAULT_OOM: hmmm... */
 }
 
-static void __break_cow(struct mm_struct *mm, unsigned long addr)
+static void break_cow(struct mm_struct *mm, unsigned long addr)
 {
 	struct vm_area_struct *vma;
 
+	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, addr);
 	if (!vma || vma->vm_start > addr)
-		return;
+		goto out;
 	if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
-		return;
+		goto out;
 	break_ksm(vma, addr);
-}
-
-static void break_cow(struct mm_struct *mm, unsigned long addr)
-{
-	down_read(&mm->mmap_sem);
-	__break_cow(mm, addr);
+out:
 	up_read(&mm->mmap_sem);
 }
 
@@ -439,17 +435,6 @@ static void remove_rmap_item_from_tree(s
 	cond_resched();		/* we're called from many long loops */
 }
 
-static void remove_all_slot_rmap_items(struct mm_slot *mm_slot)
-{
-	struct rmap_item *rmap_item, *node;
-
-	list_for_each_entry_safe(rmap_item, node, &mm_slot->rmap_list, link) {
-		remove_rmap_item_from_tree(rmap_item);
-		list_del(&rmap_item->link);
-		free_rmap_item(rmap_item);
-	}
-}
-
 static void remove_trailing_rmap_items(struct mm_slot *mm_slot,
 				       struct list_head *cur)
 {
@@ -471,6 +456,11 @@ static void remove_trailing_rmap_items(s
  * page and upping mmap_sem.  Nor does it fit with the way we skip dup'ing
  * rmap_items from parent to child at fork time (so as not to waste time
  * if exit comes before the next scan reaches it).
+ *
+ * Similarly, although we'd like to remove rmap_items (so updating counts
+ * and freeing memory) when unmerging an area, it's easier to leave that
+ * to the next pass of ksmd - consider, for example, how ksmd might be
+ * in cmp_and_merge_page on one of the rmap_items we would be removing.
  */
 static void unmerge_ksm_pages(struct vm_area_struct *vma,
 			      unsigned long start, unsigned long end)
@@ -495,7 +485,7 @@ static void unmerge_and_remove_all_rmap_
 				continue;
 			unmerge_ksm_pages(vma, vma->vm_start, vma->vm_end);
 		}
-		remove_all_slot_rmap_items(mm_slot);
+		remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
 		up_read(&mm->mmap_sem);
 	}
 
@@ -533,7 +523,7 @@ static void remove_mm_from_lists(struct
 	list_del(&mm_slot->mm_list);
 	spin_unlock(&ksm_mmlist_lock);
 
-	remove_all_slot_rmap_items(mm_slot);
+	remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
 	free_mm_slot(mm_slot);
 	clear_bit(MMF_VM_MERGEABLE, &mm->flags);
 }
@@ -740,6 +730,29 @@ out:
 }
 
 /*
+ * try_to_merge_with_ksm_page - like try_to_merge_two_pages,
+ * but no new kernel page is allocated: kpage must already be a ksm page.
+ */
+static int try_to_merge_with_ksm_page(struct mm_struct *mm1,
+				      unsigned long addr1,
+				      struct page *page1,
+				      struct page *kpage)
+{
+	struct vm_area_struct *vma;
+	int err = -EFAULT;
+
+	down_read(&mm1->mmap_sem);
+	vma = find_vma(mm1, addr1);
+	if (!vma || vma->vm_start > addr1)
+		goto out;
+
+	err = try_to_merge_one_page(vma, page1, kpage);
+out:
+	up_read(&mm1->mmap_sem);
+	return err;
+}
+
+/*
  * try_to_merge_two_pages - take two identical pages and prepare them
  * to be merged into one page.
  *
@@ -772,9 +785,8 @@ static int try_to_merge_two_pages(struct
 	down_read(&mm1->mmap_sem);
 	vma = find_vma(mm1, addr1);
 	if (!vma || vma->vm_start > addr1) {
-		put_page(kpage);
 		up_read(&mm1->mmap_sem);
-		return err;
+		goto out;
 	}
 
 	copy_user_highpage(kpage, page1, addr1, vma);
@@ -782,56 +794,20 @@ static int try_to_merge_two_pages(struct
 	up_read(&mm1->mmap_sem);
 
 	if (!err) {
-		down_read(&mm2->mmap_sem);
-		vma = find_vma(mm2, addr2);
-		if (!vma || vma->vm_start > addr2) {
-			put_page(kpage);
-			up_read(&mm2->mmap_sem);
-			break_cow(mm1, addr1);
-			return -EFAULT;
-		}
-
-		err = try_to_merge_one_page(vma, page2, kpage);
-		up_read(&mm2->mmap_sem);
-
+		err = try_to_merge_with_ksm_page(mm2, addr2, page2, kpage);
 		/*
-		 * If the second try_to_merge_one_page failed, we have a
-		 * ksm page with just one pte pointing to it, so break it.
+		 * If that fails, we have a ksm page with only one pte
+		 * pointing to it: so break it.
 		 */
 		if (err)
 			break_cow(mm1, addr1);
 	}
-
+out:
 	put_page(kpage);
 	return err;
 }
 
 /*
- * try_to_merge_with_ksm_page - like try_to_merge_two_pages,
- * but no new kernel page is allocated: kpage must already be a ksm page.
- */
-static int try_to_merge_with_ksm_page(struct mm_struct *mm1,
-				      unsigned long addr1,
-				      struct page *page1,
-				      struct page *kpage)
-{
-	struct vm_area_struct *vma;
-	int err = -EFAULT;
-
-	down_read(&mm1->mmap_sem);
-	vma = find_vma(mm1, addr1);
-	if (!vma || vma->vm_start > addr1) {
-		up_read(&mm1->mmap_sem);
-		return err;
-	}
-
-	err = try_to_merge_one_page(vma, page1, kpage);
-	up_read(&mm1->mmap_sem);
-
-	return err;
-}
-
-/*
  * stable_tree_search - search page inside the stable tree
  * @page: the page that we are searching identical pages to.
  * @page2: pointer into identical page that we are holding inside the stable
@@ -1040,10 +1016,10 @@ static void stable_tree_append(struct rm
 }
 
 /*
- * cmp_and_merge_page - take a page computes its hash value and check if there
- * is similar hash value to different page,
- * in case we find that there is similar hash to different page we call to
- * try_to_merge_two_pages().
+ * cmp_and_merge_page - first see if page can be merged into the stable tree;
+ * if not, compare checksum to previous and if it's the same, see if page can
+ * be inserted into the unstable tree, or merged with a page already there and
+ * both transferred to the stable tree.
  *
  * @page: the page that we are searching identical page to.
  * @rmap_item: the reverse mapping into the virtual address of this page

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 6/12] ksm: five little cleanups
@ 2009-08-03 12:15   ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:15 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

1. We don't use __break_cow entry point now: merge it into break_cow.
2. remove_all_slot_rmap_items is just a special case of
   remove_trailing_rmap_items: use the latter instead.
3. Extend comment on unmerge_ksm_pages and rmap_items.
4. try_to_merge_two_pages should use try_to_merge_with_ksm_page
   instead of duplicating its code; and so swap them around.
5. Comment on cmp_and_merge_page described last year's: update it.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |  112 ++++++++++++++++++++---------------------------------
 1 file changed, 44 insertions(+), 68 deletions(-)

--- ksm5/mm/ksm.c	2009-08-02 13:50:07.000000000 +0100
+++ ksm6/mm/ksm.c	2009-08-02 13:50:15.000000000 +0100
@@ -315,22 +315,18 @@ static void break_ksm(struct vm_area_str
 	/* Which leaves us looping there if VM_FAULT_OOM: hmmm... */
 }
 
-static void __break_cow(struct mm_struct *mm, unsigned long addr)
+static void break_cow(struct mm_struct *mm, unsigned long addr)
 {
 	struct vm_area_struct *vma;
 
+	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, addr);
 	if (!vma || vma->vm_start > addr)
-		return;
+		goto out;
 	if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
-		return;
+		goto out;
 	break_ksm(vma, addr);
-}
-
-static void break_cow(struct mm_struct *mm, unsigned long addr)
-{
-	down_read(&mm->mmap_sem);
-	__break_cow(mm, addr);
+out:
 	up_read(&mm->mmap_sem);
 }
 
@@ -439,17 +435,6 @@ static void remove_rmap_item_from_tree(s
 	cond_resched();		/* we're called from many long loops */
 }
 
-static void remove_all_slot_rmap_items(struct mm_slot *mm_slot)
-{
-	struct rmap_item *rmap_item, *node;
-
-	list_for_each_entry_safe(rmap_item, node, &mm_slot->rmap_list, link) {
-		remove_rmap_item_from_tree(rmap_item);
-		list_del(&rmap_item->link);
-		free_rmap_item(rmap_item);
-	}
-}
-
 static void remove_trailing_rmap_items(struct mm_slot *mm_slot,
 				       struct list_head *cur)
 {
@@ -471,6 +456,11 @@ static void remove_trailing_rmap_items(s
  * page and upping mmap_sem.  Nor does it fit with the way we skip dup'ing
  * rmap_items from parent to child at fork time (so as not to waste time
  * if exit comes before the next scan reaches it).
+ *
+ * Similarly, although we'd like to remove rmap_items (so updating counts
+ * and freeing memory) when unmerging an area, it's easier to leave that
+ * to the next pass of ksmd - consider, for example, how ksmd might be
+ * in cmp_and_merge_page on one of the rmap_items we would be removing.
  */
 static void unmerge_ksm_pages(struct vm_area_struct *vma,
 			      unsigned long start, unsigned long end)
@@ -495,7 +485,7 @@ static void unmerge_and_remove_all_rmap_
 				continue;
 			unmerge_ksm_pages(vma, vma->vm_start, vma->vm_end);
 		}
-		remove_all_slot_rmap_items(mm_slot);
+		remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
 		up_read(&mm->mmap_sem);
 	}
 
@@ -533,7 +523,7 @@ static void remove_mm_from_lists(struct
 	list_del(&mm_slot->mm_list);
 	spin_unlock(&ksm_mmlist_lock);
 
-	remove_all_slot_rmap_items(mm_slot);
+	remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
 	free_mm_slot(mm_slot);
 	clear_bit(MMF_VM_MERGEABLE, &mm->flags);
 }
@@ -740,6 +730,29 @@ out:
 }
 
 /*
+ * try_to_merge_with_ksm_page - like try_to_merge_two_pages,
+ * but no new kernel page is allocated: kpage must already be a ksm page.
+ */
+static int try_to_merge_with_ksm_page(struct mm_struct *mm1,
+				      unsigned long addr1,
+				      struct page *page1,
+				      struct page *kpage)
+{
+	struct vm_area_struct *vma;
+	int err = -EFAULT;
+
+	down_read(&mm1->mmap_sem);
+	vma = find_vma(mm1, addr1);
+	if (!vma || vma->vm_start > addr1)
+		goto out;
+
+	err = try_to_merge_one_page(vma, page1, kpage);
+out:
+	up_read(&mm1->mmap_sem);
+	return err;
+}
+
+/*
  * try_to_merge_two_pages - take two identical pages and prepare them
  * to be merged into one page.
  *
@@ -772,9 +785,8 @@ static int try_to_merge_two_pages(struct
 	down_read(&mm1->mmap_sem);
 	vma = find_vma(mm1, addr1);
 	if (!vma || vma->vm_start > addr1) {
-		put_page(kpage);
 		up_read(&mm1->mmap_sem);
-		return err;
+		goto out;
 	}
 
 	copy_user_highpage(kpage, page1, addr1, vma);
@@ -782,56 +794,20 @@ static int try_to_merge_two_pages(struct
 	up_read(&mm1->mmap_sem);
 
 	if (!err) {
-		down_read(&mm2->mmap_sem);
-		vma = find_vma(mm2, addr2);
-		if (!vma || vma->vm_start > addr2) {
-			put_page(kpage);
-			up_read(&mm2->mmap_sem);
-			break_cow(mm1, addr1);
-			return -EFAULT;
-		}
-
-		err = try_to_merge_one_page(vma, page2, kpage);
-		up_read(&mm2->mmap_sem);
-
+		err = try_to_merge_with_ksm_page(mm2, addr2, page2, kpage);
 		/*
-		 * If the second try_to_merge_one_page failed, we have a
-		 * ksm page with just one pte pointing to it, so break it.
+		 * If that fails, we have a ksm page with only one pte
+		 * pointing to it: so break it.
 		 */
 		if (err)
 			break_cow(mm1, addr1);
 	}
-
+out:
 	put_page(kpage);
 	return err;
 }
 
 /*
- * try_to_merge_with_ksm_page - like try_to_merge_two_pages,
- * but no new kernel page is allocated: kpage must already be a ksm page.
- */
-static int try_to_merge_with_ksm_page(struct mm_struct *mm1,
-				      unsigned long addr1,
-				      struct page *page1,
-				      struct page *kpage)
-{
-	struct vm_area_struct *vma;
-	int err = -EFAULT;
-
-	down_read(&mm1->mmap_sem);
-	vma = find_vma(mm1, addr1);
-	if (!vma || vma->vm_start > addr1) {
-		up_read(&mm1->mmap_sem);
-		return err;
-	}
-
-	err = try_to_merge_one_page(vma, page1, kpage);
-	up_read(&mm1->mmap_sem);
-
-	return err;
-}
-
-/*
  * stable_tree_search - search page inside the stable tree
  * @page: the page that we are searching identical pages to.
  * @page2: pointer into identical page that we are holding inside the stable
@@ -1040,10 +1016,10 @@ static void stable_tree_append(struct rm
 }
 
 /*
- * cmp_and_merge_page - take a page computes its hash value and check if there
- * is similar hash value to different page,
- * in case we find that there is similar hash to different page we call to
- * try_to_merge_two_pages().
+ * cmp_and_merge_page - first see if page can be merged into the stable tree;
+ * if not, compare checksum to previous and if it's the same, see if page can
+ * be inserted into the unstable tree, or merged with a page already there and
+ * both transferred to the stable tree.
  *
  * @page: the page that we are searching identical page to.
  * @rmap_item: the reverse mapping into the virtual address of this page

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 7/12] ksm: fix endless loop on oom
  2009-08-03 12:08 ` Hugh Dickins
@ 2009-08-03 12:16   ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:16 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

break_ksm has been looping endlessly ignoring VM_FAULT_OOM: that should
only be a problem for ksmd when a memory control group imposes limits
(normally the OOM killer will kill others with an mm until it succeeds);
but in general (especially for MADV_UNMERGEABLE and KSM_RUN_UNMERGE) we
do need to route the error (or kill) back to the caller (or sighandling).

Test signal_pending in unmerge_ksm_pages, which could be a lengthy
procedure if it has to spill into swap: returning -ERESTARTSYS so that
trivial signals will restart but fatals will terminate (is that right?
we do different things in different places in mm, none exactly this).

unmerge_and_remove_all_rmap_items was forgetting to lock when going
down the mm_list: fix that.  Whether it's successful or not, reset
ksm_scan cursor to head; but only if it's successful, reset seqnr
(shown in full_scans) - page counts will have gone down to zero.

This patch leaves a significant OOM deadlock, but it's a good step
on the way, and that deadlock is fixed in a subsequent patch.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |  108 +++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 85 insertions(+), 23 deletions(-)

--- ksm6/mm/ksm.c	2009-08-02 13:50:15.000000000 +0100
+++ ksm7/mm/ksm.c	2009-08-02 13:50:25.000000000 +0100
@@ -294,10 +294,10 @@ static inline int in_stable_tree(struct
  * Could a ksm page appear anywhere else?  Actually yes, in a VM_PFNMAP
  * mmap of /dev/mem or /dev/kmem, where we would not want to touch it.
  */
-static void break_ksm(struct vm_area_struct *vma, unsigned long addr)
+static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 {
 	struct page *page;
-	int ret;
+	int ret = 0;
 
 	do {
 		cond_resched();
@@ -310,9 +310,36 @@ static void break_ksm(struct vm_area_str
 		else
 			ret = VM_FAULT_WRITE;
 		put_page(page);
-	} while (!(ret & (VM_FAULT_WRITE | VM_FAULT_SIGBUS)));
-
-	/* Which leaves us looping there if VM_FAULT_OOM: hmmm... */
+	} while (!(ret & (VM_FAULT_WRITE | VM_FAULT_SIGBUS | VM_FAULT_OOM)));
+	/*
+	 * We must loop because handle_mm_fault() may back out if there's
+	 * any difficulty e.g. if pte accessed bit gets updated concurrently.
+	 *
+	 * VM_FAULT_WRITE is what we have been hoping for: it indicates that
+	 * COW has been broken, even if the vma does not permit VM_WRITE;
+	 * but note that a concurrent fault might break PageKsm for us.
+	 *
+	 * VM_FAULT_SIGBUS could occur if we race with truncation of the
+	 * backing file, which also invalidates anonymous pages: that's
+	 * okay, that truncation will have unmapped the PageKsm for us.
+	 *
+	 * VM_FAULT_OOM: at the time of writing (late July 2009), setting
+	 * aside mem_cgroup limits, VM_FAULT_OOM would only be set if the
+	 * current task has TIF_MEMDIE set, and will be OOM killed on return
+	 * to user; and ksmd, having no mm, would never be chosen for that.
+	 *
+	 * But if the mm is in a limited mem_cgroup, then the fault may fail
+	 * with VM_FAULT_OOM even if the current task is not TIF_MEMDIE; and
+	 * even ksmd can fail in this way - though it's usually breaking ksm
+	 * just to undo a merge it made a moment before, so unlikely to oom.
+	 *
+	 * That's a pity: we might therefore have more kernel pages allocated
+	 * than we're counting as nodes in the stable tree; but ksm_do_scan
+	 * will retry to break_cow on each pass, so should recover the page
+	 * in due course.  The important thing is to not let VM_MERGEABLE
+	 * be cleared while any such pages might remain in the area.
+	 */
+	return (ret & VM_FAULT_OOM) ? -ENOMEM : 0;
 }
 
 static void break_cow(struct mm_struct *mm, unsigned long addr)
@@ -462,39 +489,61 @@ static void remove_trailing_rmap_items(s
  * to the next pass of ksmd - consider, for example, how ksmd might be
  * in cmp_and_merge_page on one of the rmap_items we would be removing.
  */
-static void unmerge_ksm_pages(struct vm_area_struct *vma,
-			      unsigned long start, unsigned long end)
+static int unmerge_ksm_pages(struct vm_area_struct *vma,
+			     unsigned long start, unsigned long end)
 {
 	unsigned long addr;
+	int err = 0;
 
-	for (addr = start; addr < end; addr += PAGE_SIZE)
-		break_ksm(vma, addr);
+	for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
+		if (signal_pending(current))
+			err = -ERESTARTSYS;
+		else
+			err = break_ksm(vma, addr);
+	}
+	return err;
 }
 
-static void unmerge_and_remove_all_rmap_items(void)
+static int unmerge_and_remove_all_rmap_items(void)
 {
 	struct mm_slot *mm_slot;
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
+	int err = 0;
+
+	spin_lock(&ksm_mmlist_lock);
+	mm_slot = list_entry(ksm_mm_head.mm_list.next,
+						struct mm_slot, mm_list);
+	spin_unlock(&ksm_mmlist_lock);
 
-	list_for_each_entry(mm_slot, &ksm_mm_head.mm_list, mm_list) {
+	while (mm_slot != &ksm_mm_head) {
 		mm = mm_slot->mm;
 		down_read(&mm->mmap_sem);
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
 				continue;
-			unmerge_ksm_pages(vma, vma->vm_start, vma->vm_end);
+			err = unmerge_ksm_pages(vma,
+						vma->vm_start, vma->vm_end);
+			if (err) {
+				up_read(&mm->mmap_sem);
+				goto out;
+			}
 		}
 		remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
 		up_read(&mm->mmap_sem);
+
+		spin_lock(&ksm_mmlist_lock);
+		mm_slot = list_entry(mm_slot->mm_list.next,
+						struct mm_slot, mm_list);
+		spin_unlock(&ksm_mmlist_lock);
 	}
 
+	ksm_scan.seqnr = 0;
+out:
 	spin_lock(&ksm_mmlist_lock);
-	if (ksm_scan.mm_slot != &ksm_mm_head) {
-		ksm_scan.mm_slot = &ksm_mm_head;
-		ksm_scan.seqnr++;
-	}
+	ksm_scan.mm_slot = &ksm_mm_head;
 	spin_unlock(&ksm_mmlist_lock);
+	return err;
 }
 
 static void remove_mm_from_lists(struct mm_struct *mm)
@@ -1058,6 +1107,8 @@ static void cmp_and_merge_page(struct pa
 	/*
 	 * A ksm page might have got here by fork, but its other
 	 * references have already been removed from the stable tree.
+	 * Or it might be left over from a break_ksm which failed
+	 * when the mem_cgroup had reached its limit: try again now.
 	 */
 	if (PageKsm(page))
 		break_cow(rmap_item->mm, rmap_item->address);
@@ -1293,6 +1344,7 @@ int ksm_madvise(struct vm_area_struct *v
 		unsigned long end, int advice, unsigned long *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	int err;
 
 	switch (advice) {
 	case MADV_MERGEABLE:
@@ -1305,9 +1357,11 @@ int ksm_madvise(struct vm_area_struct *v
 				 VM_MIXEDMAP  | VM_SAO))
 			return 0;		/* just ignore the advice */
 
-		if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
-			if (__ksm_enter(mm) < 0)
-				return -EAGAIN;
+		if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
+			err = __ksm_enter(mm);
+			if (err)
+				return err;
+		}
 
 		*vm_flags |= VM_MERGEABLE;
 		break;
@@ -1316,8 +1370,11 @@ int ksm_madvise(struct vm_area_struct *v
 		if (!(*vm_flags & VM_MERGEABLE))
 			return 0;		/* just ignore the advice */
 
-		if (vma->anon_vma)
-			unmerge_ksm_pages(vma, start, end);
+		if (vma->anon_vma) {
+			err = unmerge_ksm_pages(vma, start, end);
+			if (err)
+				return err;
+		}
 
 		*vm_flags &= ~VM_MERGEABLE;
 		break;
@@ -1448,8 +1505,13 @@ static ssize_t run_store(struct kobject
 	mutex_lock(&ksm_thread_mutex);
 	if (ksm_run != flags) {
 		ksm_run = flags;
-		if (flags & KSM_RUN_UNMERGE)
-			unmerge_and_remove_all_rmap_items();
+		if (flags & KSM_RUN_UNMERGE) {
+			err = unmerge_and_remove_all_rmap_items();
+			if (err) {
+				ksm_run = KSM_RUN_STOP;
+				count = err;
+			}
+		}
 	}
 	mutex_unlock(&ksm_thread_mutex);
 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 7/12] ksm: fix endless loop on oom
@ 2009-08-03 12:16   ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:16 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

break_ksm has been looping endlessly ignoring VM_FAULT_OOM: that should
only be a problem for ksmd when a memory control group imposes limits
(normally the OOM killer will kill others with an mm until it succeeds);
but in general (especially for MADV_UNMERGEABLE and KSM_RUN_UNMERGE) we
do need to route the error (or kill) back to the caller (or sighandling).

Test signal_pending in unmerge_ksm_pages, which could be a lengthy
procedure if it has to spill into swap: returning -ERESTARTSYS so that
trivial signals will restart but fatals will terminate (is that right?
we do different things in different places in mm, none exactly this).

unmerge_and_remove_all_rmap_items was forgetting to lock when going
down the mm_list: fix that.  Whether it's successful or not, reset
ksm_scan cursor to head; but only if it's successful, reset seqnr
(shown in full_scans) - page counts will have gone down to zero.

This patch leaves a significant OOM deadlock, but it's a good step
on the way, and that deadlock is fixed in a subsequent patch.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |  108 +++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 85 insertions(+), 23 deletions(-)

--- ksm6/mm/ksm.c	2009-08-02 13:50:15.000000000 +0100
+++ ksm7/mm/ksm.c	2009-08-02 13:50:25.000000000 +0100
@@ -294,10 +294,10 @@ static inline int in_stable_tree(struct
  * Could a ksm page appear anywhere else?  Actually yes, in a VM_PFNMAP
  * mmap of /dev/mem or /dev/kmem, where we would not want to touch it.
  */
-static void break_ksm(struct vm_area_struct *vma, unsigned long addr)
+static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 {
 	struct page *page;
-	int ret;
+	int ret = 0;
 
 	do {
 		cond_resched();
@@ -310,9 +310,36 @@ static void break_ksm(struct vm_area_str
 		else
 			ret = VM_FAULT_WRITE;
 		put_page(page);
-	} while (!(ret & (VM_FAULT_WRITE | VM_FAULT_SIGBUS)));
-
-	/* Which leaves us looping there if VM_FAULT_OOM: hmmm... */
+	} while (!(ret & (VM_FAULT_WRITE | VM_FAULT_SIGBUS | VM_FAULT_OOM)));
+	/*
+	 * We must loop because handle_mm_fault() may back out if there's
+	 * any difficulty e.g. if pte accessed bit gets updated concurrently.
+	 *
+	 * VM_FAULT_WRITE is what we have been hoping for: it indicates that
+	 * COW has been broken, even if the vma does not permit VM_WRITE;
+	 * but note that a concurrent fault might break PageKsm for us.
+	 *
+	 * VM_FAULT_SIGBUS could occur if we race with truncation of the
+	 * backing file, which also invalidates anonymous pages: that's
+	 * okay, that truncation will have unmapped the PageKsm for us.
+	 *
+	 * VM_FAULT_OOM: at the time of writing (late July 2009), setting
+	 * aside mem_cgroup limits, VM_FAULT_OOM would only be set if the
+	 * current task has TIF_MEMDIE set, and will be OOM killed on return
+	 * to user; and ksmd, having no mm, would never be chosen for that.
+	 *
+	 * But if the mm is in a limited mem_cgroup, then the fault may fail
+	 * with VM_FAULT_OOM even if the current task is not TIF_MEMDIE; and
+	 * even ksmd can fail in this way - though it's usually breaking ksm
+	 * just to undo a merge it made a moment before, so unlikely to oom.
+	 *
+	 * That's a pity: we might therefore have more kernel pages allocated
+	 * than we're counting as nodes in the stable tree; but ksm_do_scan
+	 * will retry to break_cow on each pass, so should recover the page
+	 * in due course.  The important thing is to not let VM_MERGEABLE
+	 * be cleared while any such pages might remain in the area.
+	 */
+	return (ret & VM_FAULT_OOM) ? -ENOMEM : 0;
 }
 
 static void break_cow(struct mm_struct *mm, unsigned long addr)
@@ -462,39 +489,61 @@ static void remove_trailing_rmap_items(s
  * to the next pass of ksmd - consider, for example, how ksmd might be
  * in cmp_and_merge_page on one of the rmap_items we would be removing.
  */
-static void unmerge_ksm_pages(struct vm_area_struct *vma,
-			      unsigned long start, unsigned long end)
+static int unmerge_ksm_pages(struct vm_area_struct *vma,
+			     unsigned long start, unsigned long end)
 {
 	unsigned long addr;
+	int err = 0;
 
-	for (addr = start; addr < end; addr += PAGE_SIZE)
-		break_ksm(vma, addr);
+	for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
+		if (signal_pending(current))
+			err = -ERESTARTSYS;
+		else
+			err = break_ksm(vma, addr);
+	}
+	return err;
 }
 
-static void unmerge_and_remove_all_rmap_items(void)
+static int unmerge_and_remove_all_rmap_items(void)
 {
 	struct mm_slot *mm_slot;
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
+	int err = 0;
+
+	spin_lock(&ksm_mmlist_lock);
+	mm_slot = list_entry(ksm_mm_head.mm_list.next,
+						struct mm_slot, mm_list);
+	spin_unlock(&ksm_mmlist_lock);
 
-	list_for_each_entry(mm_slot, &ksm_mm_head.mm_list, mm_list) {
+	while (mm_slot != &ksm_mm_head) {
 		mm = mm_slot->mm;
 		down_read(&mm->mmap_sem);
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
 				continue;
-			unmerge_ksm_pages(vma, vma->vm_start, vma->vm_end);
+			err = unmerge_ksm_pages(vma,
+						vma->vm_start, vma->vm_end);
+			if (err) {
+				up_read(&mm->mmap_sem);
+				goto out;
+			}
 		}
 		remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
 		up_read(&mm->mmap_sem);
+
+		spin_lock(&ksm_mmlist_lock);
+		mm_slot = list_entry(mm_slot->mm_list.next,
+						struct mm_slot, mm_list);
+		spin_unlock(&ksm_mmlist_lock);
 	}
 
+	ksm_scan.seqnr = 0;
+out:
 	spin_lock(&ksm_mmlist_lock);
-	if (ksm_scan.mm_slot != &ksm_mm_head) {
-		ksm_scan.mm_slot = &ksm_mm_head;
-		ksm_scan.seqnr++;
-	}
+	ksm_scan.mm_slot = &ksm_mm_head;
 	spin_unlock(&ksm_mmlist_lock);
+	return err;
 }
 
 static void remove_mm_from_lists(struct mm_struct *mm)
@@ -1058,6 +1107,8 @@ static void cmp_and_merge_page(struct pa
 	/*
 	 * A ksm page might have got here by fork, but its other
 	 * references have already been removed from the stable tree.
+	 * Or it might be left over from a break_ksm which failed
+	 * when the mem_cgroup had reached its limit: try again now.
 	 */
 	if (PageKsm(page))
 		break_cow(rmap_item->mm, rmap_item->address);
@@ -1293,6 +1344,7 @@ int ksm_madvise(struct vm_area_struct *v
 		unsigned long end, int advice, unsigned long *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	int err;
 
 	switch (advice) {
 	case MADV_MERGEABLE:
@@ -1305,9 +1357,11 @@ int ksm_madvise(struct vm_area_struct *v
 				 VM_MIXEDMAP  | VM_SAO))
 			return 0;		/* just ignore the advice */
 
-		if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
-			if (__ksm_enter(mm) < 0)
-				return -EAGAIN;
+		if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
+			err = __ksm_enter(mm);
+			if (err)
+				return err;
+		}
 
 		*vm_flags |= VM_MERGEABLE;
 		break;
@@ -1316,8 +1370,11 @@ int ksm_madvise(struct vm_area_struct *v
 		if (!(*vm_flags & VM_MERGEABLE))
 			return 0;		/* just ignore the advice */
 
-		if (vma->anon_vma)
-			unmerge_ksm_pages(vma, start, end);
+		if (vma->anon_vma) {
+			err = unmerge_ksm_pages(vma, start, end);
+			if (err)
+				return err;
+		}
 
 		*vm_flags &= ~VM_MERGEABLE;
 		break;
@@ -1448,8 +1505,13 @@ static ssize_t run_store(struct kobject
 	mutex_lock(&ksm_thread_mutex);
 	if (ksm_run != flags) {
 		ksm_run = flags;
-		if (flags & KSM_RUN_UNMERGE)
-			unmerge_and_remove_all_rmap_items();
+		if (flags & KSM_RUN_UNMERGE) {
+			err = unmerge_and_remove_all_rmap_items();
+			if (err) {
+				ksm_run = KSM_RUN_STOP;
+				count = err;
+			}
+		}
 	}
 	mutex_unlock(&ksm_thread_mutex);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 8/12] ksm: distribute remove_mm_from_lists
  2009-08-03 12:08 ` Hugh Dickins
@ 2009-08-03 12:17   ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:17 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Do some housekeeping in ksm.c, to help make the next patch easier
to understand: remove the function remove_mm_from_lists, distributing
its code to its callsites scan_get_next_rmap_item and __ksm_exit.

That turns out to be a win in scan_get_next_rmap_item: move its
remove_trailing_rmap_items and cursor advancement up, and it becomes
simpler than before.  __ksm_exit becomes messier, but will change
again; and moving its remove_trailing_rmap_items up lets us strengthen
the unstable tree item's age condition in remove_rmap_item_from_tree.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |   97 ++++++++++++++++++++++-------------------------------
 1 file changed, 42 insertions(+), 55 deletions(-)

--- ksm7/mm/ksm.c	2009-08-02 13:50:25.000000000 +0100
+++ ksm8/mm/ksm.c	2009-08-02 13:50:32.000000000 +0100
@@ -444,14 +444,9 @@ static void remove_rmap_item_from_tree(s
 		 * But __ksm_exit has to be careful: do the rb_erase
 		 * if it's interrupting a scan, and this rmap_item was
 		 * inserted by this scan rather than left from before.
-		 *
-		 * Because of the case in which remove_mm_from_lists
-		 * increments seqnr before removing rmaps, unstable_nr
-		 * may even be 2 behind seqnr, but should never be
-		 * further behind.  Yes, I did have trouble with this!
 		 */
 		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
-		BUG_ON(age > 2);
+		BUG_ON(age > 1);
 		if (!age)
 			rb_erase(&rmap_item->node, &root_unstable_tree);
 		ksm_pages_unshared--;
@@ -546,37 +541,6 @@ out:
 	return err;
 }
 
-static void remove_mm_from_lists(struct mm_struct *mm)
-{
-	struct mm_slot *mm_slot;
-
-	spin_lock(&ksm_mmlist_lock);
-	mm_slot = get_mm_slot(mm);
-
-	/*
-	 * This mm_slot is always at the scanning cursor when we're
-	 * called from scan_get_next_rmap_item; but it's a special
-	 * case when we're called from __ksm_exit.
-	 */
-	if (ksm_scan.mm_slot == mm_slot) {
-		ksm_scan.mm_slot = list_entry(
-			mm_slot->mm_list.next, struct mm_slot, mm_list);
-		ksm_scan.address = 0;
-		ksm_scan.rmap_item = list_entry(
-			&ksm_scan.mm_slot->rmap_list, struct rmap_item, link);
-		if (ksm_scan.mm_slot == &ksm_mm_head)
-			ksm_scan.seqnr++;
-	}
-
-	hlist_del(&mm_slot->link);
-	list_del(&mm_slot->mm_list);
-	spin_unlock(&ksm_mmlist_lock);
-
-	remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
-	free_mm_slot(mm_slot);
-	clear_bit(MMF_VM_MERGEABLE, &mm->flags);
-}
-
 static u32 calc_checksum(struct page *page)
 {
 	u32 checksum;
@@ -1248,33 +1212,31 @@ next_mm:
 		}
 	}
 
-	if (!ksm_scan.address) {
-		/*
-		 * We've completed a full scan of all vmas, holding mmap_sem
-		 * throughout, and found no VM_MERGEABLE: so do the same as
-		 * __ksm_exit does to remove this mm from all our lists now.
-		 */
-		remove_mm_from_lists(mm);
-		up_read(&mm->mmap_sem);
-		slot = ksm_scan.mm_slot;
-		if (slot != &ksm_mm_head)
-			goto next_mm;
-		return NULL;
-	}
-
 	/*
 	 * Nuke all the rmap_items that are above this current rmap:
 	 * because there were no VM_MERGEABLE vmas with such addresses.
 	 */
 	remove_trailing_rmap_items(slot, ksm_scan.rmap_item->link.next);
-	up_read(&mm->mmap_sem);
 
 	spin_lock(&ksm_mmlist_lock);
-	slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
-	ksm_scan.mm_slot = slot;
+	ksm_scan.mm_slot = list_entry(slot->mm_list.next,
+						struct mm_slot, mm_list);
+	if (ksm_scan.address == 0) {
+		/*
+		 * We've completed a full scan of all vmas, holding mmap_sem
+		 * throughout, and found no VM_MERGEABLE: so do the same as
+		 * __ksm_exit does to remove this mm from all our lists now.
+		 */
+		hlist_del(&slot->link);
+		list_del(&slot->mm_list);
+		free_mm_slot(slot);
+		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+	}
 	spin_unlock(&ksm_mmlist_lock);
+	up_read(&mm->mmap_sem);
 
 	/* Repeat until we've completed scanning the whole list */
+	slot = ksm_scan.mm_slot;
 	if (slot != &ksm_mm_head)
 		goto next_mm;
 
@@ -1415,13 +1377,38 @@ int __ksm_enter(struct mm_struct *mm)
 
 void __ksm_exit(struct mm_struct *mm)
 {
+	struct mm_slot *mm_slot;
+
 	/*
 	 * This process is exiting: doesn't hold and doesn't need mmap_sem;
 	 * but we do need to exclude ksmd and other exiters while we modify
 	 * the various lists and trees.
 	 */
 	mutex_lock(&ksm_thread_mutex);
-	remove_mm_from_lists(mm);
+	spin_lock(&ksm_mmlist_lock);
+	mm_slot = get_mm_slot(mm);
+	if (!list_empty(&mm_slot->rmap_list)) {
+		spin_unlock(&ksm_mmlist_lock);
+		remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
+		spin_lock(&ksm_mmlist_lock);
+	}
+
+	if (ksm_scan.mm_slot == mm_slot) {
+		ksm_scan.mm_slot = list_entry(
+			mm_slot->mm_list.next, struct mm_slot, mm_list);
+		ksm_scan.address = 0;
+		ksm_scan.rmap_item = list_entry(
+			&ksm_scan.mm_slot->rmap_list, struct rmap_item, link);
+		if (ksm_scan.mm_slot == &ksm_mm_head)
+			ksm_scan.seqnr++;
+	}
+
+	hlist_del(&mm_slot->link);
+	list_del(&mm_slot->mm_list);
+	spin_unlock(&ksm_mmlist_lock);
+
+	free_mm_slot(mm_slot);
+	clear_bit(MMF_VM_MERGEABLE, &mm->flags);
 	mutex_unlock(&ksm_thread_mutex);
 }
 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 8/12] ksm: distribute remove_mm_from_lists
@ 2009-08-03 12:17   ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:17 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Do some housekeeping in ksm.c, to help make the next patch easier
to understand: remove the function remove_mm_from_lists, distributing
its code to its callsites scan_get_next_rmap_item and __ksm_exit.

That turns out to be a win in scan_get_next_rmap_item: move its
remove_trailing_rmap_items and cursor advancement up, and it becomes
simpler than before.  __ksm_exit becomes messier, but will change
again; and moving its remove_trailing_rmap_items up lets us strengthen
the unstable tree item's age condition in remove_rmap_item_from_tree.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |   97 ++++++++++++++++++++++-------------------------------
 1 file changed, 42 insertions(+), 55 deletions(-)

--- ksm7/mm/ksm.c	2009-08-02 13:50:25.000000000 +0100
+++ ksm8/mm/ksm.c	2009-08-02 13:50:32.000000000 +0100
@@ -444,14 +444,9 @@ static void remove_rmap_item_from_tree(s
 		 * But __ksm_exit has to be careful: do the rb_erase
 		 * if it's interrupting a scan, and this rmap_item was
 		 * inserted by this scan rather than left from before.
-		 *
-		 * Because of the case in which remove_mm_from_lists
-		 * increments seqnr before removing rmaps, unstable_nr
-		 * may even be 2 behind seqnr, but should never be
-		 * further behind.  Yes, I did have trouble with this!
 		 */
 		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
-		BUG_ON(age > 2);
+		BUG_ON(age > 1);
 		if (!age)
 			rb_erase(&rmap_item->node, &root_unstable_tree);
 		ksm_pages_unshared--;
@@ -546,37 +541,6 @@ out:
 	return err;
 }
 
-static void remove_mm_from_lists(struct mm_struct *mm)
-{
-	struct mm_slot *mm_slot;
-
-	spin_lock(&ksm_mmlist_lock);
-	mm_slot = get_mm_slot(mm);
-
-	/*
-	 * This mm_slot is always at the scanning cursor when we're
-	 * called from scan_get_next_rmap_item; but it's a special
-	 * case when we're called from __ksm_exit.
-	 */
-	if (ksm_scan.mm_slot == mm_slot) {
-		ksm_scan.mm_slot = list_entry(
-			mm_slot->mm_list.next, struct mm_slot, mm_list);
-		ksm_scan.address = 0;
-		ksm_scan.rmap_item = list_entry(
-			&ksm_scan.mm_slot->rmap_list, struct rmap_item, link);
-		if (ksm_scan.mm_slot == &ksm_mm_head)
-			ksm_scan.seqnr++;
-	}
-
-	hlist_del(&mm_slot->link);
-	list_del(&mm_slot->mm_list);
-	spin_unlock(&ksm_mmlist_lock);
-
-	remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
-	free_mm_slot(mm_slot);
-	clear_bit(MMF_VM_MERGEABLE, &mm->flags);
-}
-
 static u32 calc_checksum(struct page *page)
 {
 	u32 checksum;
@@ -1248,33 +1212,31 @@ next_mm:
 		}
 	}
 
-	if (!ksm_scan.address) {
-		/*
-		 * We've completed a full scan of all vmas, holding mmap_sem
-		 * throughout, and found no VM_MERGEABLE: so do the same as
-		 * __ksm_exit does to remove this mm from all our lists now.
-		 */
-		remove_mm_from_lists(mm);
-		up_read(&mm->mmap_sem);
-		slot = ksm_scan.mm_slot;
-		if (slot != &ksm_mm_head)
-			goto next_mm;
-		return NULL;
-	}
-
 	/*
 	 * Nuke all the rmap_items that are above this current rmap:
 	 * because there were no VM_MERGEABLE vmas with such addresses.
 	 */
 	remove_trailing_rmap_items(slot, ksm_scan.rmap_item->link.next);
-	up_read(&mm->mmap_sem);
 
 	spin_lock(&ksm_mmlist_lock);
-	slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
-	ksm_scan.mm_slot = slot;
+	ksm_scan.mm_slot = list_entry(slot->mm_list.next,
+						struct mm_slot, mm_list);
+	if (ksm_scan.address == 0) {
+		/*
+		 * We've completed a full scan of all vmas, holding mmap_sem
+		 * throughout, and found no VM_MERGEABLE: so do the same as
+		 * __ksm_exit does to remove this mm from all our lists now.
+		 */
+		hlist_del(&slot->link);
+		list_del(&slot->mm_list);
+		free_mm_slot(slot);
+		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+	}
 	spin_unlock(&ksm_mmlist_lock);
+	up_read(&mm->mmap_sem);
 
 	/* Repeat until we've completed scanning the whole list */
+	slot = ksm_scan.mm_slot;
 	if (slot != &ksm_mm_head)
 		goto next_mm;
 
@@ -1415,13 +1377,38 @@ int __ksm_enter(struct mm_struct *mm)
 
 void __ksm_exit(struct mm_struct *mm)
 {
+	struct mm_slot *mm_slot;
+
 	/*
 	 * This process is exiting: doesn't hold and doesn't need mmap_sem;
 	 * but we do need to exclude ksmd and other exiters while we modify
 	 * the various lists and trees.
 	 */
 	mutex_lock(&ksm_thread_mutex);
-	remove_mm_from_lists(mm);
+	spin_lock(&ksm_mmlist_lock);
+	mm_slot = get_mm_slot(mm);
+	if (!list_empty(&mm_slot->rmap_list)) {
+		spin_unlock(&ksm_mmlist_lock);
+		remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
+		spin_lock(&ksm_mmlist_lock);
+	}
+
+	if (ksm_scan.mm_slot == mm_slot) {
+		ksm_scan.mm_slot = list_entry(
+			mm_slot->mm_list.next, struct mm_slot, mm_list);
+		ksm_scan.address = 0;
+		ksm_scan.rmap_item = list_entry(
+			&ksm_scan.mm_slot->rmap_list, struct rmap_item, link);
+		if (ksm_scan.mm_slot == &ksm_mm_head)
+			ksm_scan.seqnr++;
+	}
+
+	hlist_del(&mm_slot->link);
+	list_del(&mm_slot->mm_list);
+	spin_unlock(&ksm_mmlist_lock);
+
+	free_mm_slot(mm_slot);
+	clear_bit(MMF_VM_MERGEABLE, &mm->flags);
 	mutex_unlock(&ksm_thread_mutex);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 9/12] ksm: fix oom deadlock
  2009-08-03 12:08 ` Hugh Dickins
@ 2009-08-03 12:18   ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:18 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

There's a now-obvious deadlock in KSM's out-of-memory handling:
imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
trying to allocate a page to break KSM in an mm which becomes the
OOM victim (quite likely in the unmerge case): it's killed and goes
to exit, and hangs there waiting to acquire ksm_thread_mutex.

Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
though that made everything else: perhaps use mmap_sem somehow?
And part of the answer lies in the comments on unmerge_ksm_pages:
__ksm_exit should also leave all the rmap_item removal to ksmd.

But there's a fundamental problem, that KSM relies upon mmap_sem to
guarantee the consistency of the mm it's dealing with, yet exit_mmap
tears down an mm without taking mmap_sem.  And bumping mm_users won't
help at all, that just ensures that the pages the OOM killer assumes
are on their way to being freed will not be freed.

The best answer seems to be, to move the ksm_exit callout from just
before exit_mmap, to the middle of exit_mmap: after the mm's pages
have been freed (if the mmu_gather is flushed), but before its page
tables and vma structures have been freed; and down_write,up_write
mmap_sem there to serialize with KSM's own reliance on mmap_sem.

But KSM then needs to be careful, whenever it downs mmap_sem, to
check that the mm is not already exiting: there's a danger of using
find_vma on a layout that's being torn apart, or writing into page
tables which have been freed for reuse; and even do_anonymous_page
and __do_fault need to check they're not being called by break_ksm
to reinstate a pte after zap_pte_range has zapped that page table.

Though it might be clearer to add an exiting flag, set while holding
mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
a zapped pte.  All we need is to check whether mm_users is 0 - but
must remember that ksmd may detect that before __ksm_exit is reached.
So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.

__ksm_exit now has to leave clearing up the rmap_items to ksmd,
that needs ksm_thread_mutex; but shift the exiting mm just after the
ksm_scan cursor so that it will soon be dealt with.  __ksm_enter raise
mm_count to hold the mm_struct, ksmd's exit processing (exactly like
its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).

But also give __ksm_exit a fast path: when there's no complication
(no rmap_items attached to mm and it's not at the ksm_scan cursor),
it can safely do all the exiting work itself.  This is not just an
optimization: when ksmd is not running, the raised mm_count would
otherwise leak mm_structs.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 include/linux/ksm.h |   31 +++++++--
 kernel/fork.c       |    1 
 mm/ksm.c            |  144 ++++++++++++++++++++++++++++--------------
 mm/memory.c         |    5 -
 mm/mmap.c           |    9 ++
 5 files changed, 137 insertions(+), 53 deletions(-)

--- ksm8/include/linux/ksm.h	2009-08-01 05:02:09.000000000 +0100
+++ ksm9/include/linux/ksm.h	2009-08-02 13:50:41.000000000 +0100
@@ -12,11 +12,14 @@
 #include <linux/sched.h>
 #include <linux/vmstat.h>
 
+struct mmu_gather;
+
 #ifdef CONFIG_KSM
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
-void __ksm_exit(struct mm_struct *mm);
+void __ksm_exit(struct mm_struct *mm,
+		struct mmu_gather **tlbp, unsigned long end);
 
 static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
@@ -25,10 +28,24 @@ static inline int ksm_fork(struct mm_str
 	return 0;
 }
 
-static inline void ksm_exit(struct mm_struct *mm)
+/*
+ * For KSM to handle OOM without deadlock when it's breaking COW in a
+ * likely victim of the OOM killer, exit_mmap() has to serialize with
+ * ksm_exit() after freeing mm's pages but before freeing its page tables.
+ * That leaves a window in which KSM might refault pages which have just
+ * been finally unmapped: guard against that with ksm_test_exit(), and
+ * use it after getting mmap_sem in ksm.c, to check if mm is exiting.
+ */
+static inline bool ksm_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+static inline void ksm_exit(struct mm_struct *mm,
+			    struct mmu_gather **tlbp, unsigned long end)
 {
 	if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
-		__ksm_exit(mm);
+		__ksm_exit(mm, tlbp, end);
 }
 
 /*
@@ -64,7 +81,13 @@ static inline int ksm_fork(struct mm_str
 	return 0;
 }
 
-static inline void ksm_exit(struct mm_struct *mm)
+static inline bool ksm_test_exit(struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline void ksm_exit(struct mm_struct *mm,
+			    struct mmu_gather **tlbp, unsigned long end)
 {
 }
 
--- ksm8/kernel/fork.c	2009-08-01 05:02:09.000000000 +0100
+++ ksm9/kernel/fork.c	2009-08-02 13:50:41.000000000 +0100
@@ -492,7 +492,6 @@ void mmput(struct mm_struct *mm)
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
 		exit_aio(mm);
-		ksm_exit(mm);
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
--- ksm8/mm/ksm.c	2009-08-02 13:50:32.000000000 +0100
+++ ksm9/mm/ksm.c	2009-08-02 13:50:41.000000000 +0100
@@ -32,6 +32,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/ksm.h>
 
+#include <asm/tlb.h>
 #include <asm/tlbflush.h>
 
 /*
@@ -347,6 +348,8 @@ static void break_cow(struct mm_struct *
 	struct vm_area_struct *vma;
 
 	down_read(&mm->mmap_sem);
+	if (ksm_test_exit(mm))
+		goto out;
 	vma = find_vma(mm, addr);
 	if (!vma || vma->vm_start > addr)
 		goto out;
@@ -365,6 +368,8 @@ static struct page *get_mergeable_page(s
 	struct page *page;
 
 	down_read(&mm->mmap_sem);
+	if (ksm_test_exit(mm))
+		goto out;
 	vma = find_vma(mm, addr);
 	if (!vma || vma->vm_start > addr)
 		goto out;
@@ -439,11 +444,11 @@ static void remove_rmap_item_from_tree(s
 	} else if (rmap_item->address & NODE_FLAG) {
 		unsigned char age;
 		/*
-		 * ksm_thread can and must skip the rb_erase, because
+		 * Usually ksmd can and must skip the rb_erase, because
 		 * root_unstable_tree was already reset to RB_ROOT.
-		 * But __ksm_exit has to be careful: do the rb_erase
-		 * if it's interrupting a scan, and this rmap_item was
-		 * inserted by this scan rather than left from before.
+		 * But be careful when an mm is exiting: do the rb_erase
+		 * if this rmap_item was inserted by this scan, rather
+		 * than left over from before.
 		 */
 		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
 		BUG_ON(age > 1);
@@ -491,6 +496,8 @@ static int unmerge_ksm_pages(struct vm_a
 	int err = 0;
 
 	for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
+		if (ksm_test_exit(vma->vm_mm))
+			break;
 		if (signal_pending(current))
 			err = -ERESTARTSYS;
 		else
@@ -507,34 +514,50 @@ static int unmerge_and_remove_all_rmap_i
 	int err = 0;
 
 	spin_lock(&ksm_mmlist_lock);
-	mm_slot = list_entry(ksm_mm_head.mm_list.next,
+	ksm_scan.mm_slot = list_entry(ksm_mm_head.mm_list.next,
 						struct mm_slot, mm_list);
 	spin_unlock(&ksm_mmlist_lock);
 
-	while (mm_slot != &ksm_mm_head) {
+	for (mm_slot = ksm_scan.mm_slot;
+			mm_slot != &ksm_mm_head; mm_slot = ksm_scan.mm_slot) {
 		mm = mm_slot->mm;
 		down_read(&mm->mmap_sem);
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			if (ksm_test_exit(mm))
+				break;
 			if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
 				continue;
 			err = unmerge_ksm_pages(vma,
 						vma->vm_start, vma->vm_end);
-			if (err) {
-				up_read(&mm->mmap_sem);
-				goto out;
-			}
+			if (err)
+				goto error;
 		}
+
 		remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
-		up_read(&mm->mmap_sem);
 
 		spin_lock(&ksm_mmlist_lock);
-		mm_slot = list_entry(mm_slot->mm_list.next,
+		ksm_scan.mm_slot = list_entry(mm_slot->mm_list.next,
 						struct mm_slot, mm_list);
-		spin_unlock(&ksm_mmlist_lock);
+		if (ksm_test_exit(mm)) {
+			hlist_del(&mm_slot->link);
+			list_del(&mm_slot->mm_list);
+			spin_unlock(&ksm_mmlist_lock);
+
+			free_mm_slot(mm_slot);
+			clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+			up_read(&mm->mmap_sem);
+			mmdrop(mm);
+		} else {
+			spin_unlock(&ksm_mmlist_lock);
+			up_read(&mm->mmap_sem);
+		}
 	}
 
 	ksm_scan.seqnr = 0;
-out:
+	return 0;
+
+error:
+	up_read(&mm->mmap_sem);
 	spin_lock(&ksm_mmlist_lock);
 	ksm_scan.mm_slot = &ksm_mm_head;
 	spin_unlock(&ksm_mmlist_lock);
@@ -755,6 +778,9 @@ static int try_to_merge_with_ksm_page(st
 	int err = -EFAULT;
 
 	down_read(&mm1->mmap_sem);
+	if (ksm_test_exit(mm1))
+		goto out;
+
 	vma = find_vma(mm1, addr1);
 	if (!vma || vma->vm_start > addr1)
 		goto out;
@@ -796,6 +822,10 @@ static int try_to_merge_two_pages(struct
 		return err;
 
 	down_read(&mm1->mmap_sem);
+	if (ksm_test_exit(mm1)) {
+		up_read(&mm1->mmap_sem);
+		goto out;
+	}
 	vma = find_vma(mm1, addr1);
 	if (!vma || vma->vm_start > addr1) {
 		up_read(&mm1->mmap_sem);
@@ -1181,7 +1211,12 @@ next_mm:
 
 	mm = slot->mm;
 	down_read(&mm->mmap_sem);
-	for (vma = find_vma(mm, ksm_scan.address); vma; vma = vma->vm_next) {
+	if (ksm_test_exit(mm))
+		vma = NULL;
+	else
+		vma = find_vma(mm, ksm_scan.address);
+
+	for (; vma; vma = vma->vm_next) {
 		if (!(vma->vm_flags & VM_MERGEABLE))
 			continue;
 		if (ksm_scan.address < vma->vm_start)
@@ -1190,6 +1225,8 @@ next_mm:
 			ksm_scan.address = vma->vm_end;
 
 		while (ksm_scan.address < vma->vm_end) {
+			if (ksm_test_exit(mm))
+				break;
 			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
 			if (*page && PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
@@ -1212,6 +1249,11 @@ next_mm:
 		}
 	}
 
+	if (ksm_test_exit(mm)) {
+		ksm_scan.address = 0;
+		ksm_scan.rmap_item = list_entry(&slot->rmap_list,
+						struct rmap_item, link);
+	}
 	/*
 	 * Nuke all the rmap_items that are above this current rmap:
 	 * because there were no VM_MERGEABLE vmas with such addresses.
@@ -1226,24 +1268,29 @@ next_mm:
 		 * We've completed a full scan of all vmas, holding mmap_sem
 		 * throughout, and found no VM_MERGEABLE: so do the same as
 		 * __ksm_exit does to remove this mm from all our lists now.
+		 * This applies either when cleaning up after __ksm_exit
+		 * (but beware: we can reach here even before __ksm_exit),
+		 * or when all VM_MERGEABLE areas have been unmapped (and
+		 * mmap_sem then protects against race with MADV_MERGEABLE).
 		 */
 		hlist_del(&slot->link);
 		list_del(&slot->mm_list);
+		spin_unlock(&ksm_mmlist_lock);
+
 		free_mm_slot(slot);
 		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		up_read(&mm->mmap_sem);
+		mmdrop(mm);
+	} else {
+		spin_unlock(&ksm_mmlist_lock);
+		up_read(&mm->mmap_sem);
 	}
-	spin_unlock(&ksm_mmlist_lock);
-	up_read(&mm->mmap_sem);
 
 	/* Repeat until we've completed scanning the whole list */
 	slot = ksm_scan.mm_slot;
 	if (slot != &ksm_mm_head)
 		goto next_mm;
 
-	/*
-	 * Bump seqnr here rather than at top, so that __ksm_exit
-	 * can skip rb_erase on unstable tree until we run again.
-	 */
 	ksm_scan.seqnr++;
 	return NULL;
 }
@@ -1368,6 +1415,7 @@ int __ksm_enter(struct mm_struct *mm)
 	spin_unlock(&ksm_mmlist_lock);
 
 	set_bit(MMF_VM_MERGEABLE, &mm->flags);
+	atomic_inc(&mm->mm_count);
 
 	if (needs_wakeup)
 		wake_up_interruptible(&ksm_thread_wait);
@@ -1375,41 +1423,45 @@ int __ksm_enter(struct mm_struct *mm)
 	return 0;
 }
 
-void __ksm_exit(struct mm_struct *mm)
+void __ksm_exit(struct mm_struct *mm,
+		struct mmu_gather **tlbp, unsigned long end)
 {
 	struct mm_slot *mm_slot;
+	int easy_to_free = 0;
 
 	/*
-	 * This process is exiting: doesn't hold and doesn't need mmap_sem;
-	 * but we do need to exclude ksmd and other exiters while we modify
-	 * the various lists and trees.
+	 * This process is exiting: if it's straightforward (as is the
+	 * case when ksmd was never running), free mm_slot immediately.
+	 * But if it's at the cursor or has rmap_items linked to it, use
+	 * mmap_sem to synchronize with any break_cows before pagetables
+	 * are freed, and leave the mm_slot on the list for ksmd to free.
+	 * Beware: ksm may already have noticed it exiting and freed the slot.
 	 */
-	mutex_lock(&ksm_thread_mutex);
+
 	spin_lock(&ksm_mmlist_lock);
 	mm_slot = get_mm_slot(mm);
-	if (!list_empty(&mm_slot->rmap_list)) {
-		spin_unlock(&ksm_mmlist_lock);
-		remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
-		spin_lock(&ksm_mmlist_lock);
-	}
-
-	if (ksm_scan.mm_slot == mm_slot) {
-		ksm_scan.mm_slot = list_entry(
-			mm_slot->mm_list.next, struct mm_slot, mm_list);
-		ksm_scan.address = 0;
-		ksm_scan.rmap_item = list_entry(
-			&ksm_scan.mm_slot->rmap_list, struct rmap_item, link);
-		if (ksm_scan.mm_slot == &ksm_mm_head)
-			ksm_scan.seqnr++;
+	if (mm_slot && ksm_scan.mm_slot != mm_slot) {
+		if (list_empty(&mm_slot->rmap_list)) {
+			hlist_del(&mm_slot->link);
+			list_del(&mm_slot->mm_list);
+			easy_to_free = 1;
+		} else {
+			list_move(&mm_slot->mm_list,
+				  &ksm_scan.mm_slot->mm_list);
+		}
 	}
-
-	hlist_del(&mm_slot->link);
-	list_del(&mm_slot->mm_list);
 	spin_unlock(&ksm_mmlist_lock);
 
-	free_mm_slot(mm_slot);
-	clear_bit(MMF_VM_MERGEABLE, &mm->flags);
-	mutex_unlock(&ksm_thread_mutex);
+	if (easy_to_free) {
+		free_mm_slot(mm_slot);
+		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		mmdrop(mm);
+	} else if (mm_slot) {
+		tlb_finish_mmu(*tlbp, 0, end);
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+		*tlbp = tlb_gather_mmu(mm, 1);
+	}
 }
 
 #define KSM_ATTR_RO(_name) \
--- ksm8/mm/memory.c	2009-08-01 05:02:09.000000000 +0100
+++ ksm9/mm/memory.c	2009-08-02 13:50:41.000000000 +0100
@@ -2647,8 +2647,9 @@ static int do_anonymous_page(struct mm_s
 	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (!pte_none(*page_table))
+	if (!pte_none(*page_table) || ksm_test_exit(mm))
 		goto release;
+
 	inc_mm_counter(mm, anon_rss);
 	page_add_new_anon_rmap(page, vma, address);
 	set_pte_at(mm, address, page_table, entry);
@@ -2790,7 +2791,7 @@ static int __do_fault(struct mm_struct *
 	 * handle that later.
 	 */
 	/* Only go through if we didn't race with anybody else... */
-	if (likely(pte_same(*page_table, orig_pte))) {
+	if (likely(pte_same(*page_table, orig_pte) && !ksm_test_exit(mm))) {
 		flush_icache_page(vma, page);
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (flags & FAULT_FLAG_WRITE)
--- ksm8/mm/mmap.c	2009-06-25 05:18:10.000000000 +0100
+++ ksm9/mm/mmap.c	2009-08-02 13:50:41.000000000 +0100
@@ -27,6 +27,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/ksm.h>
 #include <linux/mmu_notifier.h>
 #include <linux/perf_counter.h>
 
@@ -2114,6 +2115,14 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
+
+	/*
+	 * For KSM to handle OOM without deadlock when it's breaking COW in a
+	 * likely victim of the OOM killer, we must serialize with ksm_exit()
+	 * after freeing mm's pages but before freeing its page tables.
+	 */
+	ksm_exit(mm, &tlb, end);
+
 	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 9/12] ksm: fix oom deadlock
@ 2009-08-03 12:18   ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:18 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

There's a now-obvious deadlock in KSM's out-of-memory handling:
imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
trying to allocate a page to break KSM in an mm which becomes the
OOM victim (quite likely in the unmerge case): it's killed and goes
to exit, and hangs there waiting to acquire ksm_thread_mutex.

Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
though that made everything else: perhaps use mmap_sem somehow?
And part of the answer lies in the comments on unmerge_ksm_pages:
__ksm_exit should also leave all the rmap_item removal to ksmd.

But there's a fundamental problem, that KSM relies upon mmap_sem to
guarantee the consistency of the mm it's dealing with, yet exit_mmap
tears down an mm without taking mmap_sem.  And bumping mm_users won't
help at all, that just ensures that the pages the OOM killer assumes
are on their way to being freed will not be freed.

The best answer seems to be, to move the ksm_exit callout from just
before exit_mmap, to the middle of exit_mmap: after the mm's pages
have been freed (if the mmu_gather is flushed), but before its page
tables and vma structures have been freed; and down_write,up_write
mmap_sem there to serialize with KSM's own reliance on mmap_sem.

But KSM then needs to be careful, whenever it downs mmap_sem, to
check that the mm is not already exiting: there's a danger of using
find_vma on a layout that's being torn apart, or writing into page
tables which have been freed for reuse; and even do_anonymous_page
and __do_fault need to check they're not being called by break_ksm
to reinstate a pte after zap_pte_range has zapped that page table.

Though it might be clearer to add an exiting flag, set while holding
mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
a zapped pte.  All we need is to check whether mm_users is 0 - but
must remember that ksmd may detect that before __ksm_exit is reached.
So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.

__ksm_exit now has to leave clearing up the rmap_items to ksmd,
that needs ksm_thread_mutex; but shift the exiting mm just after the
ksm_scan cursor so that it will soon be dealt with.  __ksm_enter raise
mm_count to hold the mm_struct, ksmd's exit processing (exactly like
its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).

But also give __ksm_exit a fast path: when there's no complication
(no rmap_items attached to mm and it's not at the ksm_scan cursor),
it can safely do all the exiting work itself.  This is not just an
optimization: when ksmd is not running, the raised mm_count would
otherwise leak mm_structs.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 include/linux/ksm.h |   31 +++++++--
 kernel/fork.c       |    1 
 mm/ksm.c            |  144 ++++++++++++++++++++++++++++--------------
 mm/memory.c         |    5 -
 mm/mmap.c           |    9 ++
 5 files changed, 137 insertions(+), 53 deletions(-)

--- ksm8/include/linux/ksm.h	2009-08-01 05:02:09.000000000 +0100
+++ ksm9/include/linux/ksm.h	2009-08-02 13:50:41.000000000 +0100
@@ -12,11 +12,14 @@
 #include <linux/sched.h>
 #include <linux/vmstat.h>
 
+struct mmu_gather;
+
 #ifdef CONFIG_KSM
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
-void __ksm_exit(struct mm_struct *mm);
+void __ksm_exit(struct mm_struct *mm,
+		struct mmu_gather **tlbp, unsigned long end);
 
 static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
@@ -25,10 +28,24 @@ static inline int ksm_fork(struct mm_str
 	return 0;
 }
 
-static inline void ksm_exit(struct mm_struct *mm)
+/*
+ * For KSM to handle OOM without deadlock when it's breaking COW in a
+ * likely victim of the OOM killer, exit_mmap() has to serialize with
+ * ksm_exit() after freeing mm's pages but before freeing its page tables.
+ * That leaves a window in which KSM might refault pages which have just
+ * been finally unmapped: guard against that with ksm_test_exit(), and
+ * use it after getting mmap_sem in ksm.c, to check if mm is exiting.
+ */
+static inline bool ksm_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+static inline void ksm_exit(struct mm_struct *mm,
+			    struct mmu_gather **tlbp, unsigned long end)
 {
 	if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
-		__ksm_exit(mm);
+		__ksm_exit(mm, tlbp, end);
 }
 
 /*
@@ -64,7 +81,13 @@ static inline int ksm_fork(struct mm_str
 	return 0;
 }
 
-static inline void ksm_exit(struct mm_struct *mm)
+static inline bool ksm_test_exit(struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline void ksm_exit(struct mm_struct *mm,
+			    struct mmu_gather **tlbp, unsigned long end)
 {
 }
 
--- ksm8/kernel/fork.c	2009-08-01 05:02:09.000000000 +0100
+++ ksm9/kernel/fork.c	2009-08-02 13:50:41.000000000 +0100
@@ -492,7 +492,6 @@ void mmput(struct mm_struct *mm)
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
 		exit_aio(mm);
-		ksm_exit(mm);
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
--- ksm8/mm/ksm.c	2009-08-02 13:50:32.000000000 +0100
+++ ksm9/mm/ksm.c	2009-08-02 13:50:41.000000000 +0100
@@ -32,6 +32,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/ksm.h>
 
+#include <asm/tlb.h>
 #include <asm/tlbflush.h>
 
 /*
@@ -347,6 +348,8 @@ static void break_cow(struct mm_struct *
 	struct vm_area_struct *vma;
 
 	down_read(&mm->mmap_sem);
+	if (ksm_test_exit(mm))
+		goto out;
 	vma = find_vma(mm, addr);
 	if (!vma || vma->vm_start > addr)
 		goto out;
@@ -365,6 +368,8 @@ static struct page *get_mergeable_page(s
 	struct page *page;
 
 	down_read(&mm->mmap_sem);
+	if (ksm_test_exit(mm))
+		goto out;
 	vma = find_vma(mm, addr);
 	if (!vma || vma->vm_start > addr)
 		goto out;
@@ -439,11 +444,11 @@ static void remove_rmap_item_from_tree(s
 	} else if (rmap_item->address & NODE_FLAG) {
 		unsigned char age;
 		/*
-		 * ksm_thread can and must skip the rb_erase, because
+		 * Usually ksmd can and must skip the rb_erase, because
 		 * root_unstable_tree was already reset to RB_ROOT.
-		 * But __ksm_exit has to be careful: do the rb_erase
-		 * if it's interrupting a scan, and this rmap_item was
-		 * inserted by this scan rather than left from before.
+		 * But be careful when an mm is exiting: do the rb_erase
+		 * if this rmap_item was inserted by this scan, rather
+		 * than left over from before.
 		 */
 		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
 		BUG_ON(age > 1);
@@ -491,6 +496,8 @@ static int unmerge_ksm_pages(struct vm_a
 	int err = 0;
 
 	for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
+		if (ksm_test_exit(vma->vm_mm))
+			break;
 		if (signal_pending(current))
 			err = -ERESTARTSYS;
 		else
@@ -507,34 +514,50 @@ static int unmerge_and_remove_all_rmap_i
 	int err = 0;
 
 	spin_lock(&ksm_mmlist_lock);
-	mm_slot = list_entry(ksm_mm_head.mm_list.next,
+	ksm_scan.mm_slot = list_entry(ksm_mm_head.mm_list.next,
 						struct mm_slot, mm_list);
 	spin_unlock(&ksm_mmlist_lock);
 
-	while (mm_slot != &ksm_mm_head) {
+	for (mm_slot = ksm_scan.mm_slot;
+			mm_slot != &ksm_mm_head; mm_slot = ksm_scan.mm_slot) {
 		mm = mm_slot->mm;
 		down_read(&mm->mmap_sem);
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			if (ksm_test_exit(mm))
+				break;
 			if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
 				continue;
 			err = unmerge_ksm_pages(vma,
 						vma->vm_start, vma->vm_end);
-			if (err) {
-				up_read(&mm->mmap_sem);
-				goto out;
-			}
+			if (err)
+				goto error;
 		}
+
 		remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
-		up_read(&mm->mmap_sem);
 
 		spin_lock(&ksm_mmlist_lock);
-		mm_slot = list_entry(mm_slot->mm_list.next,
+		ksm_scan.mm_slot = list_entry(mm_slot->mm_list.next,
 						struct mm_slot, mm_list);
-		spin_unlock(&ksm_mmlist_lock);
+		if (ksm_test_exit(mm)) {
+			hlist_del(&mm_slot->link);
+			list_del(&mm_slot->mm_list);
+			spin_unlock(&ksm_mmlist_lock);
+
+			free_mm_slot(mm_slot);
+			clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+			up_read(&mm->mmap_sem);
+			mmdrop(mm);
+		} else {
+			spin_unlock(&ksm_mmlist_lock);
+			up_read(&mm->mmap_sem);
+		}
 	}
 
 	ksm_scan.seqnr = 0;
-out:
+	return 0;
+
+error:
+	up_read(&mm->mmap_sem);
 	spin_lock(&ksm_mmlist_lock);
 	ksm_scan.mm_slot = &ksm_mm_head;
 	spin_unlock(&ksm_mmlist_lock);
@@ -755,6 +778,9 @@ static int try_to_merge_with_ksm_page(st
 	int err = -EFAULT;
 
 	down_read(&mm1->mmap_sem);
+	if (ksm_test_exit(mm1))
+		goto out;
+
 	vma = find_vma(mm1, addr1);
 	if (!vma || vma->vm_start > addr1)
 		goto out;
@@ -796,6 +822,10 @@ static int try_to_merge_two_pages(struct
 		return err;
 
 	down_read(&mm1->mmap_sem);
+	if (ksm_test_exit(mm1)) {
+		up_read(&mm1->mmap_sem);
+		goto out;
+	}
 	vma = find_vma(mm1, addr1);
 	if (!vma || vma->vm_start > addr1) {
 		up_read(&mm1->mmap_sem);
@@ -1181,7 +1211,12 @@ next_mm:
 
 	mm = slot->mm;
 	down_read(&mm->mmap_sem);
-	for (vma = find_vma(mm, ksm_scan.address); vma; vma = vma->vm_next) {
+	if (ksm_test_exit(mm))
+		vma = NULL;
+	else
+		vma = find_vma(mm, ksm_scan.address);
+
+	for (; vma; vma = vma->vm_next) {
 		if (!(vma->vm_flags & VM_MERGEABLE))
 			continue;
 		if (ksm_scan.address < vma->vm_start)
@@ -1190,6 +1225,8 @@ next_mm:
 			ksm_scan.address = vma->vm_end;
 
 		while (ksm_scan.address < vma->vm_end) {
+			if (ksm_test_exit(mm))
+				break;
 			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
 			if (*page && PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
@@ -1212,6 +1249,11 @@ next_mm:
 		}
 	}
 
+	if (ksm_test_exit(mm)) {
+		ksm_scan.address = 0;
+		ksm_scan.rmap_item = list_entry(&slot->rmap_list,
+						struct rmap_item, link);
+	}
 	/*
 	 * Nuke all the rmap_items that are above this current rmap:
 	 * because there were no VM_MERGEABLE vmas with such addresses.
@@ -1226,24 +1268,29 @@ next_mm:
 		 * We've completed a full scan of all vmas, holding mmap_sem
 		 * throughout, and found no VM_MERGEABLE: so do the same as
 		 * __ksm_exit does to remove this mm from all our lists now.
+		 * This applies either when cleaning up after __ksm_exit
+		 * (but beware: we can reach here even before __ksm_exit),
+		 * or when all VM_MERGEABLE areas have been unmapped (and
+		 * mmap_sem then protects against race with MADV_MERGEABLE).
 		 */
 		hlist_del(&slot->link);
 		list_del(&slot->mm_list);
+		spin_unlock(&ksm_mmlist_lock);
+
 		free_mm_slot(slot);
 		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		up_read(&mm->mmap_sem);
+		mmdrop(mm);
+	} else {
+		spin_unlock(&ksm_mmlist_lock);
+		up_read(&mm->mmap_sem);
 	}
-	spin_unlock(&ksm_mmlist_lock);
-	up_read(&mm->mmap_sem);
 
 	/* Repeat until we've completed scanning the whole list */
 	slot = ksm_scan.mm_slot;
 	if (slot != &ksm_mm_head)
 		goto next_mm;
 
-	/*
-	 * Bump seqnr here rather than at top, so that __ksm_exit
-	 * can skip rb_erase on unstable tree until we run again.
-	 */
 	ksm_scan.seqnr++;
 	return NULL;
 }
@@ -1368,6 +1415,7 @@ int __ksm_enter(struct mm_struct *mm)
 	spin_unlock(&ksm_mmlist_lock);
 
 	set_bit(MMF_VM_MERGEABLE, &mm->flags);
+	atomic_inc(&mm->mm_count);
 
 	if (needs_wakeup)
 		wake_up_interruptible(&ksm_thread_wait);
@@ -1375,41 +1423,45 @@ int __ksm_enter(struct mm_struct *mm)
 	return 0;
 }
 
-void __ksm_exit(struct mm_struct *mm)
+void __ksm_exit(struct mm_struct *mm,
+		struct mmu_gather **tlbp, unsigned long end)
 {
 	struct mm_slot *mm_slot;
+	int easy_to_free = 0;
 
 	/*
-	 * This process is exiting: doesn't hold and doesn't need mmap_sem;
-	 * but we do need to exclude ksmd and other exiters while we modify
-	 * the various lists and trees.
+	 * This process is exiting: if it's straightforward (as is the
+	 * case when ksmd was never running), free mm_slot immediately.
+	 * But if it's at the cursor or has rmap_items linked to it, use
+	 * mmap_sem to synchronize with any break_cows before pagetables
+	 * are freed, and leave the mm_slot on the list for ksmd to free.
+	 * Beware: ksm may already have noticed it exiting and freed the slot.
 	 */
-	mutex_lock(&ksm_thread_mutex);
+
 	spin_lock(&ksm_mmlist_lock);
 	mm_slot = get_mm_slot(mm);
-	if (!list_empty(&mm_slot->rmap_list)) {
-		spin_unlock(&ksm_mmlist_lock);
-		remove_trailing_rmap_items(mm_slot, mm_slot->rmap_list.next);
-		spin_lock(&ksm_mmlist_lock);
-	}
-
-	if (ksm_scan.mm_slot == mm_slot) {
-		ksm_scan.mm_slot = list_entry(
-			mm_slot->mm_list.next, struct mm_slot, mm_list);
-		ksm_scan.address = 0;
-		ksm_scan.rmap_item = list_entry(
-			&ksm_scan.mm_slot->rmap_list, struct rmap_item, link);
-		if (ksm_scan.mm_slot == &ksm_mm_head)
-			ksm_scan.seqnr++;
+	if (mm_slot && ksm_scan.mm_slot != mm_slot) {
+		if (list_empty(&mm_slot->rmap_list)) {
+			hlist_del(&mm_slot->link);
+			list_del(&mm_slot->mm_list);
+			easy_to_free = 1;
+		} else {
+			list_move(&mm_slot->mm_list,
+				  &ksm_scan.mm_slot->mm_list);
+		}
 	}
-
-	hlist_del(&mm_slot->link);
-	list_del(&mm_slot->mm_list);
 	spin_unlock(&ksm_mmlist_lock);
 
-	free_mm_slot(mm_slot);
-	clear_bit(MMF_VM_MERGEABLE, &mm->flags);
-	mutex_unlock(&ksm_thread_mutex);
+	if (easy_to_free) {
+		free_mm_slot(mm_slot);
+		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		mmdrop(mm);
+	} else if (mm_slot) {
+		tlb_finish_mmu(*tlbp, 0, end);
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+		*tlbp = tlb_gather_mmu(mm, 1);
+	}
 }
 
 #define KSM_ATTR_RO(_name) \
--- ksm8/mm/memory.c	2009-08-01 05:02:09.000000000 +0100
+++ ksm9/mm/memory.c	2009-08-02 13:50:41.000000000 +0100
@@ -2647,8 +2647,9 @@ static int do_anonymous_page(struct mm_s
 	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (!pte_none(*page_table))
+	if (!pte_none(*page_table) || ksm_test_exit(mm))
 		goto release;
+
 	inc_mm_counter(mm, anon_rss);
 	page_add_new_anon_rmap(page, vma, address);
 	set_pte_at(mm, address, page_table, entry);
@@ -2790,7 +2791,7 @@ static int __do_fault(struct mm_struct *
 	 * handle that later.
 	 */
 	/* Only go through if we didn't race with anybody else... */
-	if (likely(pte_same(*page_table, orig_pte))) {
+	if (likely(pte_same(*page_table, orig_pte) && !ksm_test_exit(mm))) {
 		flush_icache_page(vma, page);
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (flags & FAULT_FLAG_WRITE)
--- ksm8/mm/mmap.c	2009-06-25 05:18:10.000000000 +0100
+++ ksm9/mm/mmap.c	2009-08-02 13:50:41.000000000 +0100
@@ -27,6 +27,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/ksm.h>
 #include <linux/mmu_notifier.h>
 #include <linux/perf_counter.h>
 
@@ -2114,6 +2115,14 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
+
+	/*
+	 * For KSM to handle OOM without deadlock when it's breaking COW in a
+	 * likely victim of the OOM killer, we must serialize with ksm_exit()
+	 * after freeing mm's pages but before freeing its page tables.
+	 */
+	ksm_exit(mm, &tlb, end);
+
 	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 10/12] ksm: sysfs and defaults
  2009-08-03 12:08 ` Hugh Dickins
@ 2009-08-03 12:19   ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:19 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

At present KSM is just a waste of space if you don't have CONFIG_SYSFS=y
to provide the /sys/kernel/mm/ksm files to tune and activate it.

Make KSM depend on SYSFS?  Could do, but it might be better to provide
some defaults so that KSM works out-of-the-box, ready for testers to
madvise MADV_MERGEABLE, even without SYSFS.

Though anyone serious is likely to want to retune the numbers to their
taste once they have experience; and whether these settings ever reach
2.6.32 can be discussed along the way.  

Save 1kB from tiny kernels by #ifdef'ing the SYSFS side of it.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |   26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

--- ksm9/mm/ksm.c	2009-08-02 13:50:41.000000000 +0100
+++ ksm10/mm/ksm.c	2009-08-02 13:50:48.000000000 +0100
@@ -163,18 +163,18 @@ static unsigned long ksm_pages_unshared;
 static unsigned long ksm_rmap_items;
 
 /* Limit on the number of unswappable pages used */
-static unsigned long ksm_max_kernel_pages;
+static unsigned long ksm_max_kernel_pages = 2000;
 
 /* Number of pages ksmd should scan in one batch */
-static unsigned int ksm_thread_pages_to_scan;
+static unsigned int ksm_thread_pages_to_scan = 200;
 
 /* Milliseconds ksmd should sleep between batches */
-static unsigned int ksm_thread_sleep_millisecs;
+static unsigned int ksm_thread_sleep_millisecs = 20;
 
 #define KSM_RUN_STOP	0
 #define KSM_RUN_MERGE	1
 #define KSM_RUN_UNMERGE	2
-static unsigned int ksm_run;
+static unsigned int ksm_run = KSM_RUN_MERGE;
 
 static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
 static DEFINE_MUTEX(ksm_thread_mutex);
@@ -506,6 +506,10 @@ static int unmerge_ksm_pages(struct vm_a
 	return err;
 }
 
+#ifdef CONFIG_SYSFS
+/*
+ * Only called through the sysfs control interface:
+ */
 static int unmerge_and_remove_all_rmap_items(void)
 {
 	struct mm_slot *mm_slot;
@@ -563,6 +567,7 @@ error:
 	spin_unlock(&ksm_mmlist_lock);
 	return err;
 }
+#endif /* CONFIG_SYSFS */
 
 static u32 calc_checksum(struct page *page)
 {
@@ -1464,6 +1469,11 @@ void __ksm_exit(struct mm_struct *mm,
 	}
 }
 
+#ifdef CONFIG_SYSFS
+/*
+ * This all compiles without CONFIG_SYSFS, but is a waste of space.
+ */
+
 #define KSM_ATTR_RO(_name) \
 	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 #define KSM_ATTR(_name) \
@@ -1646,6 +1656,7 @@ static struct attribute_group ksm_attr_g
 	.attrs = ksm_attrs,
 	.name = "ksm",
 };
+#endif /* CONFIG_SYSFS */
 
 static int __init ksm_init(void)
 {
@@ -1667,16 +1678,17 @@ static int __init ksm_init(void)
 		goto out_free2;
 	}
 
+#ifdef CONFIG_SYSFS
 	err = sysfs_create_group(mm_kobj, &ksm_attr_group);
 	if (err) {
 		printk(KERN_ERR "ksm: register sysfs failed\n");
-		goto out_free3;
+		kthread_stop(ksm_thread);
+		goto out_free2;
 	}
+#endif /* CONFIG_SYSFS */
 
 	return 0;
 
-out_free3:
-	kthread_stop(ksm_thread);
 out_free2:
 	mm_slots_hash_free();
 out_free1:

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 10/12] ksm: sysfs and defaults
@ 2009-08-03 12:19   ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:19 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

At present KSM is just a waste of space if you don't have CONFIG_SYSFS=y
to provide the /sys/kernel/mm/ksm files to tune and activate it.

Make KSM depend on SYSFS?  Could do, but it might be better to provide
some defaults so that KSM works out-of-the-box, ready for testers to
madvise MADV_MERGEABLE, even without SYSFS.

Though anyone serious is likely to want to retune the numbers to their
taste once they have experience; and whether these settings ever reach
2.6.32 can be discussed along the way.  

Save 1kB from tiny kernels by #ifdef'ing the SYSFS side of it.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |   26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

--- ksm9/mm/ksm.c	2009-08-02 13:50:41.000000000 +0100
+++ ksm10/mm/ksm.c	2009-08-02 13:50:48.000000000 +0100
@@ -163,18 +163,18 @@ static unsigned long ksm_pages_unshared;
 static unsigned long ksm_rmap_items;
 
 /* Limit on the number of unswappable pages used */
-static unsigned long ksm_max_kernel_pages;
+static unsigned long ksm_max_kernel_pages = 2000;
 
 /* Number of pages ksmd should scan in one batch */
-static unsigned int ksm_thread_pages_to_scan;
+static unsigned int ksm_thread_pages_to_scan = 200;
 
 /* Milliseconds ksmd should sleep between batches */
-static unsigned int ksm_thread_sleep_millisecs;
+static unsigned int ksm_thread_sleep_millisecs = 20;
 
 #define KSM_RUN_STOP	0
 #define KSM_RUN_MERGE	1
 #define KSM_RUN_UNMERGE	2
-static unsigned int ksm_run;
+static unsigned int ksm_run = KSM_RUN_MERGE;
 
 static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
 static DEFINE_MUTEX(ksm_thread_mutex);
@@ -506,6 +506,10 @@ static int unmerge_ksm_pages(struct vm_a
 	return err;
 }
 
+#ifdef CONFIG_SYSFS
+/*
+ * Only called through the sysfs control interface:
+ */
 static int unmerge_and_remove_all_rmap_items(void)
 {
 	struct mm_slot *mm_slot;
@@ -563,6 +567,7 @@ error:
 	spin_unlock(&ksm_mmlist_lock);
 	return err;
 }
+#endif /* CONFIG_SYSFS */
 
 static u32 calc_checksum(struct page *page)
 {
@@ -1464,6 +1469,11 @@ void __ksm_exit(struct mm_struct *mm,
 	}
 }
 
+#ifdef CONFIG_SYSFS
+/*
+ * This all compiles without CONFIG_SYSFS, but is a waste of space.
+ */
+
 #define KSM_ATTR_RO(_name) \
 	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 #define KSM_ATTR(_name) \
@@ -1646,6 +1656,7 @@ static struct attribute_group ksm_attr_g
 	.attrs = ksm_attrs,
 	.name = "ksm",
 };
+#endif /* CONFIG_SYSFS */
 
 static int __init ksm_init(void)
 {
@@ -1667,16 +1678,17 @@ static int __init ksm_init(void)
 		goto out_free2;
 	}
 
+#ifdef CONFIG_SYSFS
 	err = sysfs_create_group(mm_kobj, &ksm_attr_group);
 	if (err) {
 		printk(KERN_ERR "ksm: register sysfs failed\n");
-		goto out_free3;
+		kthread_stop(ksm_thread);
+		goto out_free2;
 	}
+#endif /* CONFIG_SYSFS */
 
 	return 0;
 
-out_free3:
-	kthread_stop(ksm_thread);
 out_free2:
 	mm_slots_hash_free();
 out_free1:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 11/12] ksm: add some documentation
  2009-08-03 12:08 ` Hugh Dickins
@ 2009-08-03 12:21   ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:21 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Michael Kerrisk, Randy Dunlap, linux-kernel,
	linux-mm

Add Documentation/vm/ksm.txt: how to use the Kernel Samepage Merging feature

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
---

 Documentation/vm/00-INDEX |    2 
 Documentation/vm/ksm.txt  |   89 ++++++++++++++++++++++++++++++++++++
 mm/Kconfig                |    1 
 3 files changed, 92 insertions(+)

--- ksm10/Documentation/vm/00-INDEX	2009-06-10 04:05:27.000000000 +0100
+++ ksm11/Documentation/vm/00-INDEX	2009-08-02 13:50:57.000000000 +0100
@@ -6,6 +6,8 @@ balance
 	- various information on memory balancing.
 hugetlbpage.txt
 	- a brief summary of hugetlbpage support in the Linux kernel.
+ksm.txt
+	- how to use the Kernel Samepage Merging feature.
 locking
 	- info on how locking and synchronization is done in the Linux vm code.
 numa
--- ksm10/Documentation/vm/ksm.txt	1970-01-01 01:00:00.000000000 +0100
+++ ksm11/Documentation/vm/ksm.txt	2009-08-02 13:50:57.000000000 +0100
@@ -0,0 +1,89 @@
+How to use the Kernel Samepage Merging feature
+----------------------------------------------
+
+KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
+added to the Linux kernel in 2.6.32.  See mm/ksm.c for its implementation,
+and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
+
+The KSM daemon ksmd periodically scans those areas of user memory which
+have been registered with it, looking for pages of identical content which
+can be replaced by a single write-protected page (which is automatically
+copied if a process later wants to update its content).
+
+KSM was originally developed for use with KVM (where it was known as
+Kernel Shared Memory), to fit more virtual machines into physical memory,
+by sharing the data common between them.  But it can be useful to any
+application which generates many instances of the same data.
+
+KSM only merges anonymous (private) pages, never pagecache (file) pages.
+KSM's merged pages are at present locked into kernel memory for as long
+as they are shared: so cannot be swapped out like the user pages they
+replace (but swapping KSM pages should follow soon in a later release).
+
+KSM only operates on those areas of address space which an application
+has advised to be likely candidates for merging, by using the madvise(2)
+system call: int madvise(addr, length, MADV_MERGEABLE).
+
+The app may call int madvise(addr, length, MADV_UNMERGEABLE) to cancel
+that advice and restore unshared pages: whereupon KSM unmerges whatever
+it merged in that range.  Note: this unmerging call may suddenly require
+more memory than is available - possibly failing with EAGAIN, but more
+probably arousing the Out-Of-Memory killer.
+
+If KSM is not configured into the running kernel, madvise MADV_MERGEABLE
+and MADV_UNMERGEABLE simply fail with EINVAL.  If the running kernel was
+built with CONFIG_KSM=y, those calls will normally succeed: even if the
+the KSM daemon is not currently running, MADV_MERGEABLE still registers
+the range for whenever the KSM daemon is started; even if the range
+cannot contain any pages which KSM could actually merge; even if
+MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
+
+Like other madvise calls, they are intended for use on mapped areas of
+the user address space: they will report ENOMEM if the specified range
+includes unmapped gaps (though working on the intervening mapped areas),
+and might fail with EAGAIN if not enough memory for internal structures.
+
+Applications should be considerate in their use of MADV_MERGEABLE,
+restricting its use to areas likely to benefit.  KSM's scans may use
+a lot of processing power, and its kernel-resident pages are a limited
+resource.  Some installations will disable KSM for these reasons.
+
+The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/,
+readable by all but writable only by root:
+
+max_kernel_pages - set to maximum number of kernel pages that KSM may use
+                   e.g. "echo 2000 > /sys/kernel/mm/ksm/max_kernel_pages"
+                   Value 0 imposes no limit on the kernel pages KSM may use;
+                   but note that any process using MADV_MERGEABLE can cause
+                   KSM to allocate these pages, unswappable until it exits.
+                   Default: 2000 (chosen for demonstration purposes)
+
+pages_to_scan    - how many present pages to scan before ksmd goes to sleep
+                   e.g. "echo 200 > /sys/kernel/mm/ksm/pages_to_scan"
+                   Default: 200 (chosen for demonstration purposes)
+
+sleep_millisecs  - how many milliseconds ksmd should sleep before next scan
+                   e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
+                   Default: 20 (chosen for demonstration purposes)
+
+run              - set 0 to stop ksmd from running but keep merged pages,
+                   set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
+                   set 2 to stop ksmd and unmerge all pages currently merged,
+                         but leave mergeable areas registered for next run
+                   Default: 1 (for immediate use by apps which register)
+
+The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/:
+
+pages_shared     - how many shared unswappable kernel pages KSM is using
+pages_sharing    - how many more sites are sharing them i.e. how much saved
+pages_unshared   - how many pages unique but repeatedly checked for merging
+pages_volatile   - how many pages changing too fast to be placed in a tree
+full_scans       - how many times all mergeable areas have been scanned
+
+A high ratio of pages_sharing to pages_shared indicates good sharing, but
+a high ratio of pages_unshared to pages_sharing indicates wasted effort.
+pages_volatile embraces several different kinds of activity, but a high
+proportion there would also indicate poor use of madvise MADV_MERGEABLE.
+
+Izik Eidus,
+Hugh Dickins, 30 July 2009
--- ksm10/mm/Kconfig	2009-08-01 05:02:09.000000000 +0100
+++ ksm11/mm/Kconfig	2009-08-02 13:50:57.000000000 +0100
@@ -224,6 +224,7 @@ config KSM
 	  the many instances by a single resident page with that content, so
 	  saving memory until one or another app needs to modify the content.
 	  Recommended for use with KVM, or with other duplicative applications.
+	  See Documentation/vm/ksm.txt for more information.
 
 config DEFAULT_MMAP_MIN_ADDR
         int "Low address space to protect from user allocation"

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 11/12] ksm: add some documentation
@ 2009-08-03 12:21   ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:21 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Michael Kerrisk, Randy Dunlap, linux-kernel,
	linux-mm

Add Documentation/vm/ksm.txt: how to use the Kernel Samepage Merging feature

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
---

 Documentation/vm/00-INDEX |    2 
 Documentation/vm/ksm.txt  |   89 ++++++++++++++++++++++++++++++++++++
 mm/Kconfig                |    1 
 3 files changed, 92 insertions(+)

--- ksm10/Documentation/vm/00-INDEX	2009-06-10 04:05:27.000000000 +0100
+++ ksm11/Documentation/vm/00-INDEX	2009-08-02 13:50:57.000000000 +0100
@@ -6,6 +6,8 @@ balance
 	- various information on memory balancing.
 hugetlbpage.txt
 	- a brief summary of hugetlbpage support in the Linux kernel.
+ksm.txt
+	- how to use the Kernel Samepage Merging feature.
 locking
 	- info on how locking and synchronization is done in the Linux vm code.
 numa
--- ksm10/Documentation/vm/ksm.txt	1970-01-01 01:00:00.000000000 +0100
+++ ksm11/Documentation/vm/ksm.txt	2009-08-02 13:50:57.000000000 +0100
@@ -0,0 +1,89 @@
+How to use the Kernel Samepage Merging feature
+----------------------------------------------
+
+KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
+added to the Linux kernel in 2.6.32.  See mm/ksm.c for its implementation,
+and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
+
+The KSM daemon ksmd periodically scans those areas of user memory which
+have been registered with it, looking for pages of identical content which
+can be replaced by a single write-protected page (which is automatically
+copied if a process later wants to update its content).
+
+KSM was originally developed for use with KVM (where it was known as
+Kernel Shared Memory), to fit more virtual machines into physical memory,
+by sharing the data common between them.  But it can be useful to any
+application which generates many instances of the same data.
+
+KSM only merges anonymous (private) pages, never pagecache (file) pages.
+KSM's merged pages are at present locked into kernel memory for as long
+as they are shared: so cannot be swapped out like the user pages they
+replace (but swapping KSM pages should follow soon in a later release).
+
+KSM only operates on those areas of address space which an application
+has advised to be likely candidates for merging, by using the madvise(2)
+system call: int madvise(addr, length, MADV_MERGEABLE).
+
+The app may call int madvise(addr, length, MADV_UNMERGEABLE) to cancel
+that advice and restore unshared pages: whereupon KSM unmerges whatever
+it merged in that range.  Note: this unmerging call may suddenly require
+more memory than is available - possibly failing with EAGAIN, but more
+probably arousing the Out-Of-Memory killer.
+
+If KSM is not configured into the running kernel, madvise MADV_MERGEABLE
+and MADV_UNMERGEABLE simply fail with EINVAL.  If the running kernel was
+built with CONFIG_KSM=y, those calls will normally succeed: even if the
+the KSM daemon is not currently running, MADV_MERGEABLE still registers
+the range for whenever the KSM daemon is started; even if the range
+cannot contain any pages which KSM could actually merge; even if
+MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
+
+Like other madvise calls, they are intended for use on mapped areas of
+the user address space: they will report ENOMEM if the specified range
+includes unmapped gaps (though working on the intervening mapped areas),
+and might fail with EAGAIN if not enough memory for internal structures.
+
+Applications should be considerate in their use of MADV_MERGEABLE,
+restricting its use to areas likely to benefit.  KSM's scans may use
+a lot of processing power, and its kernel-resident pages are a limited
+resource.  Some installations will disable KSM for these reasons.
+
+The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/,
+readable by all but writable only by root:
+
+max_kernel_pages - set to maximum number of kernel pages that KSM may use
+                   e.g. "echo 2000 > /sys/kernel/mm/ksm/max_kernel_pages"
+                   Value 0 imposes no limit on the kernel pages KSM may use;
+                   but note that any process using MADV_MERGEABLE can cause
+                   KSM to allocate these pages, unswappable until it exits.
+                   Default: 2000 (chosen for demonstration purposes)
+
+pages_to_scan    - how many present pages to scan before ksmd goes to sleep
+                   e.g. "echo 200 > /sys/kernel/mm/ksm/pages_to_scan"
+                   Default: 200 (chosen for demonstration purposes)
+
+sleep_millisecs  - how many milliseconds ksmd should sleep before next scan
+                   e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
+                   Default: 20 (chosen for demonstration purposes)
+
+run              - set 0 to stop ksmd from running but keep merged pages,
+                   set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
+                   set 2 to stop ksmd and unmerge all pages currently merged,
+                         but leave mergeable areas registered for next run
+                   Default: 1 (for immediate use by apps which register)
+
+The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/:
+
+pages_shared     - how many shared unswappable kernel pages KSM is using
+pages_sharing    - how many more sites are sharing them i.e. how much saved
+pages_unshared   - how many pages unique but repeatedly checked for merging
+pages_volatile   - how many pages changing too fast to be placed in a tree
+full_scans       - how many times all mergeable areas have been scanned
+
+A high ratio of pages_sharing to pages_shared indicates good sharing, but
+a high ratio of pages_unshared to pages_sharing indicates wasted effort.
+pages_volatile embraces several different kinds of activity, but a high
+proportion there would also indicate poor use of madvise MADV_MERGEABLE.
+
+Izik Eidus,
+Hugh Dickins, 30 July 2009
--- ksm10/mm/Kconfig	2009-08-01 05:02:09.000000000 +0100
+++ ksm11/mm/Kconfig	2009-08-02 13:50:57.000000000 +0100
@@ -224,6 +224,7 @@ config KSM
 	  the many instances by a single resident page with that content, so
 	  saving memory until one or another app needs to modify the content.
 	  Recommended for use with KVM, or with other duplicative applications.
+	  See Documentation/vm/ksm.txt for more information.
 
 config DEFAULT_MMAP_MIN_ADDR
         int "Low address space to protect from user allocation"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 12/12] ksm: remove VM_MERGEABLE_FLAGS
  2009-08-03 12:08 ` Hugh Dickins
@ 2009-08-03 12:22   ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:22 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

KSM originally stood for Kernel Shared Memory: but the kernel has long
supported shared memory, and VM_SHARED and VM_MAYSHARE vmas, and KSM is
something else.  So we switched to saying "merge" instead of "share".

But Chris Wright points out that this is confusing where mmap.c merges
adjacent vmas: most especially in the name VM_MERGEABLE_FLAGS, used by
is_mergeable_vma() to let vmas be merged despite flags being different.

Call it VMA_MERGE_DESPITE_FLAGS?  Perhaps, but at present it consists
only of VM_CAN_NONLINEAR: so for now it's clearer on all sides to use
that directly, with a comment on it in is_mergeable_vma().

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---
This patch got lost along the way last time: no big deal but try again.

 mm/mmap.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- ksm11/mm/mmap.c	2009-08-02 13:50:41.000000000 +0100
+++ ksm12/mm/mmap.c	2009-08-02 13:51:04.000000000 +0100
@@ -660,9 +660,6 @@ again:			remove_next = 1 + (end > next->
 	validate_mm(mm);
 }
 
-/* Flags that can be inherited from an existing mapping when merging */
-#define VM_MERGEABLE_FLAGS (VM_CAN_NONLINEAR)
-
 /*
  * If the vma has a ->close operation then the driver probably needs to release
  * per-vma resources, so we don't attempt to merge those.
@@ -670,7 +667,8 @@ again:			remove_next = 1 + (end > next->
 static inline int is_mergeable_vma(struct vm_area_struct *vma,
 			struct file *file, unsigned long vm_flags)
 {
-	if ((vma->vm_flags ^ vm_flags) & ~VM_MERGEABLE_FLAGS)
+	/* VM_CAN_NONLINEAR may get set later by f_op->mmap() */
+	if ((vma->vm_flags ^ vm_flags) & ~VM_CAN_NONLINEAR)
 		return 0;
 	if (vma->vm_file != file)
 		return 0;

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 12/12] ksm: remove VM_MERGEABLE_FLAGS
@ 2009-08-03 12:22   ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 12:22 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

KSM originally stood for Kernel Shared Memory: but the kernel has long
supported shared memory, and VM_SHARED and VM_MAYSHARE vmas, and KSM is
something else.  So we switched to saying "merge" instead of "share".

But Chris Wright points out that this is confusing where mmap.c merges
adjacent vmas: most especially in the name VM_MERGEABLE_FLAGS, used by
is_mergeable_vma() to let vmas be merged despite flags being different.

Call it VMA_MERGE_DESPITE_FLAGS?  Perhaps, but at present it consists
only of VM_CAN_NONLINEAR: so for now it's clearer on all sides to use
that directly, with a comment on it in is_mergeable_vma().

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---
This patch got lost along the way last time: no big deal but try again.

 mm/mmap.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- ksm11/mm/mmap.c	2009-08-02 13:50:41.000000000 +0100
+++ ksm12/mm/mmap.c	2009-08-02 13:51:04.000000000 +0100
@@ -660,9 +660,6 @@ again:			remove_next = 1 + (end > next->
 	validate_mm(mm);
 }
 
-/* Flags that can be inherited from an existing mapping when merging */
-#define VM_MERGEABLE_FLAGS (VM_CAN_NONLINEAR)
-
 /*
  * If the vma has a ->close operation then the driver probably needs to release
  * per-vma resources, so we don't attempt to merge those.
@@ -670,7 +667,8 @@ again:			remove_next = 1 + (end > next->
 static inline int is_mergeable_vma(struct vm_area_struct *vma,
 			struct file *file, unsigned long vm_flags)
 {
-	if ((vma->vm_flags ^ vm_flags) & ~VM_MERGEABLE_FLAGS)
+	/* VM_CAN_NONLINEAR may get set later by f_op->mmap() */
+	if ((vma->vm_flags ^ vm_flags) & ~VM_CAN_NONLINEAR)
 		return 0;
 	if (vma->vm_file != file)
 		return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 1/12] ksm: rename kernel_pages_allocated
  2009-08-03 12:10   ` Hugh Dickins
@ 2009-08-03 14:21     ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-03 14:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> We're not implementing swapping of KSM pages in its first release;
> but when that follows, "kernel_pages_allocated" will be a very poor
> name for the sysfs file showing number of nodes in the stable tree:
> rename that to "pages_shared" throughout.
>
> But we already have a "pages_shared", counting those page slots
> sharing the shared pages: first rename that to... "pages_sharing".
>
> What will become of "max_kernel_pages" when the pages shared can
> be swapped?  I guess it will just be removed, so keep that name.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>
>   
ACK.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 1/12] ksm: rename kernel_pages_allocated
@ 2009-08-03 14:21     ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-03 14:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> We're not implementing swapping of KSM pages in its first release;
> but when that follows, "kernel_pages_allocated" will be a very poor
> name for the sysfs file showing number of nodes in the stable tree:
> rename that to "pages_shared" throughout.
>
> But we already have a "pages_shared", counting those page slots
> sharing the shared pages: first rename that to... "pages_sharing".
>
> What will become of "max_kernel_pages" when the pages shared can
> be swapped?  I guess it will just be removed, so keep that name.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>
>   
ACK.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 2/12] ksm: move pages_sharing updates
  2009-08-03 12:11   ` Hugh Dickins
@ 2009-08-03 14:34     ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-03 14:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> The pages_shared count is incremented and decremented when adding a node
> to and removing a node from the stable tree: easy to understand.  But the
> pages_sharing count was hard to follow, being adjusted in various places:
> increment and decrement it when adding to and removing from the stable tree.
>
> And the pages_sharing variable used to include the pages_shared, then those
> were subtracted when shown in the pages_sharing sysfs file: now keep it as
> an exclusive count of leaves hanging off the stable tree nodes, throughout.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>   
ACK, (Code is simpler that way)

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 2/12] ksm: move pages_sharing updates
@ 2009-08-03 14:34     ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-03 14:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> The pages_shared count is incremented and decremented when adding a node
> to and removing a node from the stable tree: easy to understand.  But the
> pages_sharing count was hard to follow, being adjusted in various places:
> increment and decrement it when adding to and removing from the stable tree.
>
> And the pages_sharing variable used to include the pages_shared, then those
> were subtracted when shown in the pages_sharing sysfs file: now keep it as
> an exclusive count of leaves hanging off the stable tree nodes, throughout.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>   
ACK, (Code is simpler that way)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 3/12] ksm: pages_unshared and pages_volatile
  2009-08-03 12:11   ` Hugh Dickins
@ 2009-08-03 14:54     ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-03 14:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> The pages_shared and pages_sharing counts give a good picture of how
> successful KSM is at sharing; but no clue to how much wasted work it's
> doing to get there.  Add pages_unshared (count of unique pages waiting
> in the unstable tree, hoping to find a mate) and pages_volatile.
>
> pages_volatile is harder to define.  It includes those pages changing
> too fast to get into the unstable tree, but also whatever other edge
> conditions prevent a page getting into the trees: a high value may
> deserve investigation.  Don't try to calculate it from the various
> conditions: it's the total of rmap_items less those accounted for.
>
> Also show full_scans: the number of completed scans of everything
> registered in the mm list.
>
> The locking for all these counts is simply ksm_thread_mutex.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>  mm/ksm.c |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 51 insertions(+), 1 deletion(-)
>
> --- ksm2/mm/ksm.c	2009-08-02 13:49:43.000000000 +0100
> +++ ksm3/mm/ksm.c	2009-08-02 13:49:51.000000000 +0100
> @@ -155,6 +155,12 @@ static unsigned long ksm_pages_shared;
>  /* The number of page slots additionally sharing those nodes */
>  static unsigned long ksm_pages_sharing;
>  
> +/* The number of nodes in the unstable tree */
> +static unsigned long ksm_pages_unshared;
> +
> +/* The number of rmap_items in use: to calculate pages_volatile */
> +static unsigned long ksm_rmap_items;
> +
>  /* Limit on the number of unswappable pages used */
>  static unsigned long ksm_max_kernel_pages;
>  
> @@ -204,11 +210,17 @@ static void __init ksm_slab_free(void)
>  
>  static inline struct rmap_item *alloc_rmap_item(void)
>  {
> -	return kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
> +	struct rmap_item *rmap_item;
> +
> +	rmap_item = kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
> +	if (rmap_item)
> +		ksm_rmap_items++;
> +	return rmap_item;
>  }
>  
>  static inline void free_rmap_item(struct rmap_item *rmap_item)
>  {
> +	ksm_rmap_items--;
>  	rmap_item->mm = NULL;	/* debug safety */
>  	kmem_cache_free(rmap_item_cache, rmap_item);
>  }
> @@ -419,6 +431,7 @@ static void remove_rmap_item_from_tree(s
>  		BUG_ON(age > 2);
>  		if (!age)
>  			rb_erase(&rmap_item->node, &root_unstable_tree);
> +		ksm_pages_unshared--;
>  	}
>  
>  	rmap_item->address &= PAGE_MASK;
> @@ -1002,6 +1015,7 @@ static struct rmap_item *unstable_tree_s
>  	rb_link_node(&rmap_item->node, parent, new);
>  	rb_insert_color(&rmap_item->node, &root_unstable_tree);
>  
> +	ksm_pages_unshared++;
>  	return NULL;
>  }
>  
> @@ -1098,6 +1112,8 @@ static void cmp_and_merge_page(struct pa
>  		if (!err) {
>  			rb_erase(&tree_rmap_item->node, &root_unstable_tree);
>  			tree_rmap_item->address &= ~NODE_FLAG;
> +			ksm_pages_unshared--;
> +
>  			/*
>  			 * If we fail to insert the page into the stable tree,
>  			 * we will have 2 virtual addresses that are pointing
> @@ -1481,6 +1497,37 @@ static ssize_t pages_sharing_show(struct
>  }
>  KSM_ATTR_RO(pages_sharing);
>  
> +static ssize_t pages_unshared_show(struct kobject *kobj,
> +				   struct kobj_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%lu\n", ksm_pages_unshared);
> +}
> +KSM_ATTR_RO(pages_unshared);
> +
> +static ssize_t pages_volatile_show(struct kobject *kobj,
> +				   struct kobj_attribute *attr, char *buf)
> +{
> +	long ksm_pages_volatile;
> +
> +	ksm_pages_volatile = ksm_rmap_items - ksm_pages_shared
> +				- ksm_pages_sharing - ksm_pages_unshared;
> +	/*
> +	 * It was not worth any locking to calculate that statistic,
> +	 * but it might therefore sometimes be negative: conceal that.
> +	 */
> +	if (ksm_pages_volatile < 0)
> +		ksm_pages_volatile = 0;
> +	return sprintf(buf, "%ld\n", ksm_pages_volatile);

ACK.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 3/12] ksm: pages_unshared and pages_volatile
@ 2009-08-03 14:54     ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-03 14:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> The pages_shared and pages_sharing counts give a good picture of how
> successful KSM is at sharing; but no clue to how much wasted work it's
> doing to get there.  Add pages_unshared (count of unique pages waiting
> in the unstable tree, hoping to find a mate) and pages_volatile.
>
> pages_volatile is harder to define.  It includes those pages changing
> too fast to get into the unstable tree, but also whatever other edge
> conditions prevent a page getting into the trees: a high value may
> deserve investigation.  Don't try to calculate it from the various
> conditions: it's the total of rmap_items less those accounted for.
>
> Also show full_scans: the number of completed scans of everything
> registered in the mm list.
>
> The locking for all these counts is simply ksm_thread_mutex.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>  mm/ksm.c |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 51 insertions(+), 1 deletion(-)
>
> --- ksm2/mm/ksm.c	2009-08-02 13:49:43.000000000 +0100
> +++ ksm3/mm/ksm.c	2009-08-02 13:49:51.000000000 +0100
> @@ -155,6 +155,12 @@ static unsigned long ksm_pages_shared;
>  /* The number of page slots additionally sharing those nodes */
>  static unsigned long ksm_pages_sharing;
>  
> +/* The number of nodes in the unstable tree */
> +static unsigned long ksm_pages_unshared;
> +
> +/* The number of rmap_items in use: to calculate pages_volatile */
> +static unsigned long ksm_rmap_items;
> +
>  /* Limit on the number of unswappable pages used */
>  static unsigned long ksm_max_kernel_pages;
>  
> @@ -204,11 +210,17 @@ static void __init ksm_slab_free(void)
>  
>  static inline struct rmap_item *alloc_rmap_item(void)
>  {
> -	return kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
> +	struct rmap_item *rmap_item;
> +
> +	rmap_item = kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
> +	if (rmap_item)
> +		ksm_rmap_items++;
> +	return rmap_item;
>  }
>  
>  static inline void free_rmap_item(struct rmap_item *rmap_item)
>  {
> +	ksm_rmap_items--;
>  	rmap_item->mm = NULL;	/* debug safety */
>  	kmem_cache_free(rmap_item_cache, rmap_item);
>  }
> @@ -419,6 +431,7 @@ static void remove_rmap_item_from_tree(s
>  		BUG_ON(age > 2);
>  		if (!age)
>  			rb_erase(&rmap_item->node, &root_unstable_tree);
> +		ksm_pages_unshared--;
>  	}
>  
>  	rmap_item->address &= PAGE_MASK;
> @@ -1002,6 +1015,7 @@ static struct rmap_item *unstable_tree_s
>  	rb_link_node(&rmap_item->node, parent, new);
>  	rb_insert_color(&rmap_item->node, &root_unstable_tree);
>  
> +	ksm_pages_unshared++;
>  	return NULL;
>  }
>  
> @@ -1098,6 +1112,8 @@ static void cmp_and_merge_page(struct pa
>  		if (!err) {
>  			rb_erase(&tree_rmap_item->node, &root_unstable_tree);
>  			tree_rmap_item->address &= ~NODE_FLAG;
> +			ksm_pages_unshared--;
> +
>  			/*
>  			 * If we fail to insert the page into the stable tree,
>  			 * we will have 2 virtual addresses that are pointing
> @@ -1481,6 +1497,37 @@ static ssize_t pages_sharing_show(struct
>  }
>  KSM_ATTR_RO(pages_sharing);
>  
> +static ssize_t pages_unshared_show(struct kobject *kobj,
> +				   struct kobj_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%lu\n", ksm_pages_unshared);
> +}
> +KSM_ATTR_RO(pages_unshared);
> +
> +static ssize_t pages_volatile_show(struct kobject *kobj,
> +				   struct kobj_attribute *attr, char *buf)
> +{
> +	long ksm_pages_volatile;
> +
> +	ksm_pages_volatile = ksm_rmap_items - ksm_pages_shared
> +				- ksm_pages_sharing - ksm_pages_unshared;
> +	/*
> +	 * It was not worth any locking to calculate that statistic,
> +	 * but it might therefore sometimes be negative: conceal that.
> +	 */
> +	if (ksm_pages_volatile < 0)
> +		ksm_pages_volatile = 0;
> +	return sprintf(buf, "%ld\n", ksm_pages_volatile);

ACK.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 4/12] ksm: break cow once unshared
  2009-08-03 12:12   ` Hugh Dickins
@ 2009-08-03 16:00     ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-03 16:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> We kept agreeing not to bother about the unswappable shared KSM pages
> which later become unshared by others: observation suggests they're not
> a significant proportion.  But they are disadvantageous, and it is easier
> to break COW to replace them by swappable pages, than offer statistics
> to show that they don't matter; then we can stop worrying about them.
>
> Doing this in ksm_do_scan, they don't go through cmp_and_merge_page on
> this pass: give them a good chance of getting into the unstable tree
> on the next pass, or back into the stable, by computing checksum now.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
ACK

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 4/12] ksm: break cow once unshared
@ 2009-08-03 16:00     ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-03 16:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> We kept agreeing not to bother about the unswappable shared KSM pages
> which later become unshared by others: observation suggests they're not
> a significant proportion.  But they are disadvantageous, and it is easier
> to break COW to replace them by swappable pages, than offer statistics
> to show that they don't matter; then we can stop worrying about them.
>
> Doing this in ksm_do_scan, they don't go through cmp_and_merge_page on
> this pass: give them a good chance of getting into the unstable tree
> on the next pass, or back into the stable, by computing checksum now.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
ACK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 1/12] ksm: rename kernel_pages_allocated
  2009-08-03 14:21     ` Izik Eidus
@ 2009-08-03 16:48       ` Andrea Arcangeli
  -1 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-03 16:48 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Hugh Dickins, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

On Mon, Aug 03, 2009 at 05:21:50PM +0300, Izik Eidus wrote:
> Hugh Dickins wrote:
> > We're not implementing swapping of KSM pages in its first release;
> > but when that follows, "kernel_pages_allocated" will be a very poor
> > name for the sysfs file showing number of nodes in the stable tree:
> > rename that to "pages_shared" throughout.
> >
> > But we already have a "pages_shared", counting those page slots
> > sharing the shared pages: first rename that to... "pages_sharing".
> >
> > What will become of "max_kernel_pages" when the pages shared can
> > be swapped?  I guess it will just be removed, so keep that name.
> >
> > Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> > ---
> >
> >
> >   
> ACK.

Not sure if it worth doing the subtraction between the number of user
mappings pointing to the ksm_pages_shared.

Acked-by: Andrea Arcangeli <aarcange@redhat.com>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 1/12] ksm: rename kernel_pages_allocated
@ 2009-08-03 16:48       ` Andrea Arcangeli
  0 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-03 16:48 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Hugh Dickins, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

On Mon, Aug 03, 2009 at 05:21:50PM +0300, Izik Eidus wrote:
> Hugh Dickins wrote:
> > We're not implementing swapping of KSM pages in its first release;
> > but when that follows, "kernel_pages_allocated" will be a very poor
> > name for the sysfs file showing number of nodes in the stable tree:
> > rename that to "pages_shared" throughout.
> >
> > But we already have a "pages_shared", counting those page slots
> > sharing the shared pages: first rename that to... "pages_sharing".
> >
> > What will become of "max_kernel_pages" when the pages shared can
> > be swapped?  I guess it will just be removed, so keep that name.
> >
> > Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> > ---
> >
> >
> >   
> ACK.

Not sure if it worth doing the subtraction between the number of user
mappings pointing to the ksm_pages_shared.

Acked-by: Andrea Arcangeli <aarcange@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 2/12] ksm: move pages_sharing updates
  2009-08-03 12:11   ` Hugh Dickins
@ 2009-08-03 16:53     ` Andrea Arcangeli
  -1 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-03 16:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

On Mon, Aug 03, 2009 at 01:11:00PM +0100, Hugh Dickins wrote:
> The pages_shared count is incremented and decremented when adding a node
> to and removing a node from the stable tree: easy to understand.  But the
> pages_sharing count was hard to follow, being adjusted in various places:
> increment and decrement it when adding to and removing from the stable tree.
> 
> And the pages_sharing variable used to include the pages_shared, then those
> were subtracted when shown in the pages_sharing sysfs file: now keep it as
> an exclusive count of leaves hanging off the stable tree nodes, throughout.
> 
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>

If we stick to the subtraction semantics (that I think for users is
less intuitive as they need to understand more of the ksm code to
figure out what it means) sure ack...

I don't see the big deal of just printing total number of ksm pages in
stable tree, and the actual _total_ number of userland mappings that
are mapping those. The subtraction to see the actual sharing that is
the difference between the two numbers, can be done by the user
itself.

But then I'm fine if we stick to the substraction logic, this is a
minor detail, I just usually prefer "raw" values. (if removing the
inc/dec is beneficial at runtime as it seems then doing an addition
will provide the info I would find more intuitive in  a more efficient
way than before)

Acked-by: Andrea Arcangeli <aarcange@redhat.com>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 2/12] ksm: move pages_sharing updates
@ 2009-08-03 16:53     ` Andrea Arcangeli
  0 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-03 16:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

On Mon, Aug 03, 2009 at 01:11:00PM +0100, Hugh Dickins wrote:
> The pages_shared count is incremented and decremented when adding a node
> to and removing a node from the stable tree: easy to understand.  But the
> pages_sharing count was hard to follow, being adjusted in various places:
> increment and decrement it when adding to and removing from the stable tree.
> 
> And the pages_sharing variable used to include the pages_shared, then those
> were subtracted when shown in the pages_sharing sysfs file: now keep it as
> an exclusive count of leaves hanging off the stable tree nodes, throughout.
> 
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>

If we stick to the subtraction semantics (that I think for users is
less intuitive as they need to understand more of the ksm code to
figure out what it means) sure ack...

I don't see the big deal of just printing total number of ksm pages in
stable tree, and the actual _total_ number of userland mappings that
are mapping those. The subtraction to see the actual sharing that is
the difference between the two numbers, can be done by the user
itself.

But then I'm fine if we stick to the substraction logic, this is a
minor detail, I just usually prefer "raw" values. (if removing the
inc/dec is beneficial at runtime as it seems then doing an addition
will provide the info I would find more intuitive in  a more efficient
way than before)

Acked-by: Andrea Arcangeli <aarcange@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 5/12] ksm: keep quiet while list empty
  2009-08-03 12:14   ` Hugh Dickins
@ 2009-08-03 16:55     ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-03 16:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> ksm_scan_thread already sleeps in wait_event_interruptible until setting
> ksm_run activates it; but if there's nothing on its list to look at, i.e.
> nobody has yet said madvise MADV_MERGEABLE, it's a shame to be clocking
> up system time and full_scans: ksmd_should_run added to check that too.
>
> And move the mutex_lock out around it: the new counts showed that when
> ksm_run is stopped, a little work often got done afterwards, because it
> had been read before taking the mutex.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>  mm/ksm.c |   28 ++++++++++++++++++++++------
>  1 file changed, 22 insertions(+), 6 deletions(-)
>
> --- ksm4/mm/ksm.c	2009-08-02 13:49:59.000000000 +0100
> +++ ksm5/mm/ksm.c	2009-08-02 13:50:07.000000000 +0100
> @@ -1287,21 +1287,27 @@ static void ksm_do_scan(unsigned int sca
>  	}
>  }
>  
> +static int ksmd_should_run(void)
> +{
> +	return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
> +}
> +
>  static int ksm_scan_thread(void *nothing)
>  {
>  	set_user_nice(current, 5);
>  
>  	while (!kthread_should_stop()) {
> -		if (ksm_run & KSM_RUN_MERGE) {
> -			mutex_lock(&ksm_thread_mutex);
> +		mutex_lock(&ksm_thread_mutex);
> +		if (ksmd_should_run())
>  			ksm_do_scan(ksm_thread_pages_to_scan);
> -			mutex_unlock(&ksm_thread_mutex);
> +		mutex_unlock(&ksm_thread_mutex);
> +
> +		if (ksmd_should_run()) {
>  			schedule_timeout_interruptible(
>  				msecs_to_jiffies(ksm_thread_sleep_millisecs));
>  		} else {
>  			wait_event_interruptible(ksm_thread_wait,
> -					(ksm_run & KSM_RUN_MERGE) ||
> -					kthread_should_stop());
> +				ksmd_should_run() || kthread_should_stop());
>  		}
>  	}
>  	return 0;
> @@ -1346,10 +1352,16 @@ int ksm_madvise(struct vm_area_struct *v
>  
>  int __ksm_enter(struct mm_struct *mm)
>  {
> -	struct mm_slot *mm_slot = alloc_mm_slot();
> +	struct mm_slot *mm_slot;
> +	int needs_wakeup;
> +
> +	mm_slot = alloc_mm_slot();
>  	if (!mm_slot)
>  		return -ENOMEM;
>  
> +	/* Check ksm_run too?  Would need tighter locking */
> +	needs_wakeup = list_empty(&ksm_mm_head.mm_list);
> +
>  	spin_lock(&ksm_mmlist_lock);
>  	insert_to_mm_slots_hash(mm, mm_slot);
>  	/*
> @@ -1361,6 +1373,10 @@ int __ksm_enter(struct mm_struct *mm)
>  	spin_unlock(&ksm_mmlist_lock);
>  
>  	set_bit(MMF_VM_MERGEABLE, &mm->flags);
> +
> +	if (needs_wakeup)
> +		wake_up_interruptible(&ksm_thread_wait);
> +
>  	return 0;
>  }
>  
>   
ACK

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 5/12] ksm: keep quiet while list empty
@ 2009-08-03 16:55     ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-03 16:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> ksm_scan_thread already sleeps in wait_event_interruptible until setting
> ksm_run activates it; but if there's nothing on its list to look at, i.e.
> nobody has yet said madvise MADV_MERGEABLE, it's a shame to be clocking
> up system time and full_scans: ksmd_should_run added to check that too.
>
> And move the mutex_lock out around it: the new counts showed that when
> ksm_run is stopped, a little work often got done afterwards, because it
> had been read before taking the mutex.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>  mm/ksm.c |   28 ++++++++++++++++++++++------
>  1 file changed, 22 insertions(+), 6 deletions(-)
>
> --- ksm4/mm/ksm.c	2009-08-02 13:49:59.000000000 +0100
> +++ ksm5/mm/ksm.c	2009-08-02 13:50:07.000000000 +0100
> @@ -1287,21 +1287,27 @@ static void ksm_do_scan(unsigned int sca
>  	}
>  }
>  
> +static int ksmd_should_run(void)
> +{
> +	return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
> +}
> +
>  static int ksm_scan_thread(void *nothing)
>  {
>  	set_user_nice(current, 5);
>  
>  	while (!kthread_should_stop()) {
> -		if (ksm_run & KSM_RUN_MERGE) {
> -			mutex_lock(&ksm_thread_mutex);
> +		mutex_lock(&ksm_thread_mutex);
> +		if (ksmd_should_run())
>  			ksm_do_scan(ksm_thread_pages_to_scan);
> -			mutex_unlock(&ksm_thread_mutex);
> +		mutex_unlock(&ksm_thread_mutex);
> +
> +		if (ksmd_should_run()) {
>  			schedule_timeout_interruptible(
>  				msecs_to_jiffies(ksm_thread_sleep_millisecs));
>  		} else {
>  			wait_event_interruptible(ksm_thread_wait,
> -					(ksm_run & KSM_RUN_MERGE) ||
> -					kthread_should_stop());
> +				ksmd_should_run() || kthread_should_stop());
>  		}
>  	}
>  	return 0;
> @@ -1346,10 +1352,16 @@ int ksm_madvise(struct vm_area_struct *v
>  
>  int __ksm_enter(struct mm_struct *mm)
>  {
> -	struct mm_slot *mm_slot = alloc_mm_slot();
> +	struct mm_slot *mm_slot;
> +	int needs_wakeup;
> +
> +	mm_slot = alloc_mm_slot();
>  	if (!mm_slot)
>  		return -ENOMEM;
>  
> +	/* Check ksm_run too?  Would need tighter locking */
> +	needs_wakeup = list_empty(&ksm_mm_head.mm_list);
> +
>  	spin_lock(&ksm_mmlist_lock);
>  	insert_to_mm_slots_hash(mm, mm_slot);
>  	/*
> @@ -1361,6 +1373,10 @@ int __ksm_enter(struct mm_struct *mm)
>  	spin_unlock(&ksm_mmlist_lock);
>  
>  	set_bit(MMF_VM_MERGEABLE, &mm->flags);
> +
> +	if (needs_wakeup)
> +		wake_up_interruptible(&ksm_thread_wait);
> +
>  	return 0;
>  }
>  
>   
ACK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 2/12] ksm: move pages_sharing updates
  2009-08-03 16:53     ` Andrea Arcangeli
@ 2009-08-03 17:34       ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 17:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

On Mon, 3 Aug 2009, Andrea Arcangeli wrote:
> 
> If we stick to the subtraction semantics (that I think for users is
> less intuitive as they need to understand more of the ksm code to
> figure out what it means) sure ack...
> 
> I don't see the big deal of just printing total number of ksm pages in
> stable tree, and the actual _total_ number of userland mappings that
> are mapping those. The subtraction to see the actual sharing that is
> the difference between the two numbers, can be done by the user
> itself.

Yes, I know just what you mean.  When I first came to this, I rather
disliked that subtraction (though with this patch we have no actual
subtraction - but as you indicate, yes, that's an accident of the
internal implementation).  And it gets in the way of calculating
the ratio of ptes serviced to pages used.

But something I always find bothersome with /proc/meminfo is the
uncertainty about which numbers are included in which other numbers,
and which are exclusive.

So once I'd come to add the pages_unshared and pages_volatile,
I was really appreciating that these numbers are all simply
exclusive.

> 
> But then I'm fine if we stick to the substraction logic, this is a
> minor detail, I just usually prefer "raw" values. (if removing the
> inc/dec is beneficial at runtime as it seems then doing an addition
> will provide the info I would find more intuitive in  a more efficient
> way than before)

If you're also okay to stick with the subtraction logic, I think
let's continue to stick with it; but certainly it's something we
can easily change before 2.6.32, if people keep questioning it.

(I've tried to make it clear in ksm.txt, but people are likely
to question which way it is, whichever way it is - but it's
easier to say "they're all exclusive" than "this includes that,
but the others are exclusive").

> 
> Acked-by: Andrea Arcangeli <aarcange@redhat.com>

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 2/12] ksm: move pages_sharing updates
@ 2009-08-03 17:34       ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-03 17:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

On Mon, 3 Aug 2009, Andrea Arcangeli wrote:
> 
> If we stick to the subtraction semantics (that I think for users is
> less intuitive as they need to understand more of the ksm code to
> figure out what it means) sure ack...
> 
> I don't see the big deal of just printing total number of ksm pages in
> stable tree, and the actual _total_ number of userland mappings that
> are mapping those. The subtraction to see the actual sharing that is
> the difference between the two numbers, can be done by the user
> itself.

Yes, I know just what you mean.  When I first came to this, I rather
disliked that subtraction (though with this patch we have no actual
subtraction - but as you indicate, yes, that's an accident of the
internal implementation).  And it gets in the way of calculating
the ratio of ptes serviced to pages used.

But something I always find bothersome with /proc/meminfo is the
uncertainty about which numbers are included in which other numbers,
and which are exclusive.

So once I'd come to add the pages_unshared and pages_volatile,
I was really appreciating that these numbers are all simply
exclusive.

> 
> But then I'm fine if we stick to the substraction logic, this is a
> minor detail, I just usually prefer "raw" values. (if removing the
> inc/dec is beneficial at runtime as it seems then doing an addition
> will provide the info I would find more intuitive in  a more efficient
> way than before)

If you're also okay to stick with the subtraction logic, I think
let's continue to stick with it; but certainly it's something we
can easily change before 2.6.32, if people keep questioning it.

(I've tried to make it clear in ksm.txt, but people are likely
to question which way it is, whichever way it is - but it's
easier to say "they're all exclusive" than "this includes that,
but the others are exclusive").

> 
> Acked-by: Andrea Arcangeli <aarcange@redhat.com>

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 6/12] ksm: five little cleanups
  2009-08-03 12:15   ` Hugh Dickins
@ 2009-08-04 12:41     ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-04 12:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> 1. We don't use __break_cow entry point now: merge it into break_cow.
> 2. remove_all_slot_rmap_items is just a special case of
>    remove_trailing_rmap_items: use the latter instead.
> 3. Extend comment on unmerge_ksm_pages and rmap_items.
> 4. try_to_merge_two_pages should use try_to_merge_with_ksm_page
>    instead of duplicating its code; and so swap them around.
> 5. Comment on cmp_and_merge_page described last year's: update it.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>
>   
Acked-by: Izik Eidus <ieidus@redhat.com>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 6/12] ksm: five little cleanups
@ 2009-08-04 12:41     ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-04 12:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> 1. We don't use __break_cow entry point now: merge it into break_cow.
> 2. remove_all_slot_rmap_items is just a special case of
>    remove_trailing_rmap_items: use the latter instead.
> 3. Extend comment on unmerge_ksm_pages and rmap_items.
> 4. try_to_merge_two_pages should use try_to_merge_with_ksm_page
>    instead of duplicating its code; and so swap them around.
> 5. Comment on cmp_and_merge_page described last year's: update it.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>
>   
Acked-by: Izik Eidus <ieidus@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 7/12] ksm: fix endless loop on oom
  2009-08-03 12:16   ` Hugh Dickins
@ 2009-08-04 12:55     ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-04 12:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> break_ksm has been looping endlessly ignoring VM_FAULT_OOM: that should
> only be a problem for ksmd when a memory control group imposes limits
> (normally the OOM killer will kill others with an mm until it succeeds);
> but in general (especially for MADV_UNMERGEABLE and KSM_RUN_UNMERGE) we
> do need to route the error (or kill) back to the caller (or sighandling).
>
> Test signal_pending in unmerge_ksm_pages, which could be a lengthy
> procedure if it has to spill into swap: returning -ERESTARTSYS so that
> trivial signals will restart but fatals will terminate (is that right?
> we do different things in different places in mm, none exactly this).
>
> unmerge_and_remove_all_rmap_items was forgetting to lock when going
> down the mm_list: fix that.  Whether it's successful or not, reset
> ksm_scan cursor to head; but only if it's successful, reset seqnr
> (shown in full_scans) - page counts will have gone down to zero.
>
> This patch leaves a significant OOM deadlock, but it's a good step
> on the way, and that deadlock is fixed in a subsequent patch.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>
>   
Better than before for sure, And I dont have in mind better and yet 
simple solution for the "failing to break the pages" then to just wait 
and catch them in the next scan, so ACK.

Acked-by: Izik Eidus <ieidus@redhat.com>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 7/12] ksm: fix endless loop on oom
@ 2009-08-04 12:55     ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-04 12:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> break_ksm has been looping endlessly ignoring VM_FAULT_OOM: that should
> only be a problem for ksmd when a memory control group imposes limits
> (normally the OOM killer will kill others with an mm until it succeeds);
> but in general (especially for MADV_UNMERGEABLE and KSM_RUN_UNMERGE) we
> do need to route the error (or kill) back to the caller (or sighandling).
>
> Test signal_pending in unmerge_ksm_pages, which could be a lengthy
> procedure if it has to spill into swap: returning -ERESTARTSYS so that
> trivial signals will restart but fatals will terminate (is that right?
> we do different things in different places in mm, none exactly this).
>
> unmerge_and_remove_all_rmap_items was forgetting to lock when going
> down the mm_list: fix that.  Whether it's successful or not, reset
> ksm_scan cursor to head; but only if it's successful, reset seqnr
> (shown in full_scans) - page counts will have gone down to zero.
>
> This patch leaves a significant OOM deadlock, but it's a good step
> on the way, and that deadlock is fixed in a subsequent patch.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>
>   
Better than before for sure, And I dont have in mind better and yet 
simple solution for the "failing to break the pages" then to just wait 
and catch them in the next scan, so ACK.

Acked-by: Izik Eidus <ieidus@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 8/12] ksm: distribute remove_mm_from_lists
  2009-08-03 12:17   ` Hugh Dickins
@ 2009-08-04 13:03     ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-04 13:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> Do some housekeeping in ksm.c, to help make the next patch easier
> to understand: remove the function remove_mm_from_lists, distributing
> its code to its callsites scan_get_next_rmap_item and __ksm_exit.
>
> That turns out to be a win in scan_get_next_rmap_item: move its
> remove_trailing_rmap_items and cursor advancement up, and it becomes
> simpler than before.  __ksm_exit becomes messier, but will change
> again; and moving its remove_trailing_rmap_items up lets us strengthen
> the unstable tree item's age condition in remove_rmap_item_from_tree.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>   
Acked-by: Izik Eidus <ieidus@redhat.com>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 8/12] ksm: distribute remove_mm_from_lists
@ 2009-08-04 13:03     ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-04 13:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> Do some housekeeping in ksm.c, to help make the next patch easier
> to understand: remove the function remove_mm_from_lists, distributing
> its code to its callsites scan_get_next_rmap_item and __ksm_exit.
>
> That turns out to be a win in scan_get_next_rmap_item: move its
> remove_trailing_rmap_items and cursor advancement up, and it becomes
> simpler than before.  __ksm_exit becomes messier, but will change
> again; and moving its remove_trailing_rmap_items up lets us strengthen
> the unstable tree item's age condition in remove_rmap_item_from_tree.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>   
Acked-by: Izik Eidus <ieidus@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 9/12] ksm: fix oom deadlock
  2009-08-03 12:18   ` Hugh Dickins
@ 2009-08-04 19:32     ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-04 19:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> There's a now-obvious deadlock in KSM's out-of-memory handling:
> imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
> trying to allocate a page to break KSM in an mm which becomes the
> OOM victim (quite likely in the unmerge case): it's killed and goes
> to exit, and hangs there waiting to acquire ksm_thread_mutex.
>
> Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
> though that made everything else: perhaps use mmap_sem somehow?
> And part of the answer lies in the comments on unmerge_ksm_pages:
> __ksm_exit should also leave all the rmap_item removal to ksmd.
>
> But there's a fundamental problem, that KSM relies upon mmap_sem to
> guarantee the consistency of the mm it's dealing with, yet exit_mmap
> tears down an mm without taking mmap_sem.  And bumping mm_users won't
> help at all, that just ensures that the pages the OOM killer assumes
> are on their way to being freed will not be freed.
>
> The best answer seems to be, to move the ksm_exit callout from just
> before exit_mmap, to the middle of exit_mmap: after the mm's pages
> have been freed (if the mmu_gather is flushed), but before its page
> tables and vma structures have been freed; and down_write,up_write
> mmap_sem there to serialize with KSM's own reliance on mmap_sem.
>
> But KSM then needs to be careful, whenever it downs mmap_sem, to
> check that the mm is not already exiting: there's a danger of using
> find_vma on a layout that's being torn apart, or writing into page
> tables which have been freed for reuse; and even do_anonymous_page
> and __do_fault need to check they're not being called by break_ksm
> to reinstate a pte after zap_pte_range has zapped that page table.
>
> Though it might be clearer to add an exiting flag, set while holding
> mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
> a zapped pte.  All we need is to check whether mm_users is 0 - but
> must remember that ksmd may detect that before __ksm_exit is reached.
> So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.
>
> __ksm_exit now has to leave clearing up the rmap_items to ksmd,
> that needs ksm_thread_mutex; but shift the exiting mm just after the
> ksm_scan cursor so that it will soon be dealt with.  __ksm_enter raise
> mm_count to hold the mm_struct, ksmd's exit processing (exactly like
> its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
> similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).
>
> But also give __ksm_exit a fast path: when there's no complication
> (no rmap_items attached to mm and it's not at the ksm_scan cursor),
> it can safely do all the exiting work itself.  This is not just an
> optimization: when ksmd is not running, the raised mm_count would
> otherwise leak mm_structs.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>   
Acked-by: Izik Eidus <ieidus@redhat.com>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 9/12] ksm: fix oom deadlock
@ 2009-08-04 19:32     ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-04 19:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> There's a now-obvious deadlock in KSM's out-of-memory handling:
> imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
> trying to allocate a page to break KSM in an mm which becomes the
> OOM victim (quite likely in the unmerge case): it's killed and goes
> to exit, and hangs there waiting to acquire ksm_thread_mutex.
>
> Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
> though that made everything else: perhaps use mmap_sem somehow?
> And part of the answer lies in the comments on unmerge_ksm_pages:
> __ksm_exit should also leave all the rmap_item removal to ksmd.
>
> But there's a fundamental problem, that KSM relies upon mmap_sem to
> guarantee the consistency of the mm it's dealing with, yet exit_mmap
> tears down an mm without taking mmap_sem.  And bumping mm_users won't
> help at all, that just ensures that the pages the OOM killer assumes
> are on their way to being freed will not be freed.
>
> The best answer seems to be, to move the ksm_exit callout from just
> before exit_mmap, to the middle of exit_mmap: after the mm's pages
> have been freed (if the mmu_gather is flushed), but before its page
> tables and vma structures have been freed; and down_write,up_write
> mmap_sem there to serialize with KSM's own reliance on mmap_sem.
>
> But KSM then needs to be careful, whenever it downs mmap_sem, to
> check that the mm is not already exiting: there's a danger of using
> find_vma on a layout that's being torn apart, or writing into page
> tables which have been freed for reuse; and even do_anonymous_page
> and __do_fault need to check they're not being called by break_ksm
> to reinstate a pte after zap_pte_range has zapped that page table.
>
> Though it might be clearer to add an exiting flag, set while holding
> mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
> a zapped pte.  All we need is to check whether mm_users is 0 - but
> must remember that ksmd may detect that before __ksm_exit is reached.
> So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.
>
> __ksm_exit now has to leave clearing up the rmap_items to ksmd,
> that needs ksm_thread_mutex; but shift the exiting mm just after the
> ksm_scan cursor so that it will soon be dealt with.  __ksm_enter raise
> mm_count to hold the mm_struct, ksmd's exit processing (exactly like
> its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
> similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).
>
> But also give __ksm_exit a fast path: when there's no complication
> (no rmap_items attached to mm and it's not at the ksm_scan cursor),
> it can safely do all the exiting work itself.  This is not just an
> optimization: when ksmd is not running, the raised mm_count would
> otherwise leak mm_structs.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>   
Acked-by: Izik Eidus <ieidus@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/12] ksm: sysfs and defaults
  2009-08-03 12:19   ` Hugh Dickins
@ 2009-08-04 19:34     ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-04 19:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> At present KSM is just a waste of space if you don't have CONFIG_SYSFS=y
> to provide the /sys/kernel/mm/ksm files to tune and activate it.
>
> Make KSM depend on SYSFS?  Could do, but it might be better to provide
> some defaults so that KSM works out-of-the-box, ready for testers to
> madvise MADV_MERGEABLE, even without SYSFS.
>
> Though anyone serious is likely to want to retune the numbers to their
> taste once they have experience; and whether these settings ever reach
> 2.6.32 can be discussed along the way.  
>
> Save 1kB from tiny kernels by #ifdef'ing the SYSFS side of it.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>   

Acked-By: Izik Eidus <ieidus@redhat.com>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/12] ksm: sysfs and defaults
@ 2009-08-04 19:34     ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-04 19:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> At present KSM is just a waste of space if you don't have CONFIG_SYSFS=y
> to provide the /sys/kernel/mm/ksm files to tune and activate it.
>
> Make KSM depend on SYSFS?  Could do, but it might be better to provide
> some defaults so that KSM works out-of-the-box, ready for testers to
> madvise MADV_MERGEABLE, even without SYSFS.
>
> Though anyone serious is likely to want to retune the numbers to their
> taste once they have experience; and whether these settings ever reach
> 2.6.32 can be discussed along the way.  
>
> Save 1kB from tiny kernels by #ifdef'ing the SYSFS side of it.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>   

Acked-By: Izik Eidus <ieidus@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 11/12] ksm: add some documentation
  2009-08-03 12:21   ` Hugh Dickins
@ 2009-08-04 19:35     ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-04 19:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Michael Kerrisk, Randy Dunlap, linux-kernel,
	linux-mm

Hugh Dickins wrote:
> Add Documentation/vm/ksm.txt: how to use the Kernel Samepage Merging feature
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
> Cc: Randy Dunlap <randy.dunlap@oracle.com>
> ---
>   
Acked-by: Izik Eidus <ieidus@redhat.com>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 11/12] ksm: add some documentation
@ 2009-08-04 19:35     ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-04 19:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Michael Kerrisk, Randy Dunlap, linux-kernel,
	linux-mm

Hugh Dickins wrote:
> Add Documentation/vm/ksm.txt: how to use the Kernel Samepage Merging feature
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
> Cc: Randy Dunlap <randy.dunlap@oracle.com>
> ---
>   
Acked-by: Izik Eidus <ieidus@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 12/12] ksm: remove VM_MERGEABLE_FLAGS
  2009-08-03 12:22   ` Hugh Dickins
@ 2009-08-04 19:35     ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-04 19:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> KSM originally stood for Kernel Shared Memory: but the kernel has long
> supported shared memory, and VM_SHARED and VM_MAYSHARE vmas, and KSM is
> something else.  So we switched to saying "merge" instead of "share".
>
> But Chris Wright points out that this is confusing where mmap.c merges
> adjacent vmas: most especially in the name VM_MERGEABLE_FLAGS, used by
> is_mergeable_vma() to let vmas be merged despite flags being different.
>
> Call it VMA_MERGE_DESPITE_FLAGS?  Perhaps, but at present it consists
> only of VM_CAN_NONLINEAR: so for now it's clearer on all sides to use
> that directly, with a comment on it in is_mergeable_vma().
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>   
Acked-by: Izik Eidus <ieidus@redhat.com>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 12/12] ksm: remove VM_MERGEABLE_FLAGS
@ 2009-08-04 19:35     ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-04 19:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> KSM originally stood for Kernel Shared Memory: but the kernel has long
> supported shared memory, and VM_SHARED and VM_MAYSHARE vmas, and KSM is
> something else.  So we switched to saying "merge" instead of "share".
>
> But Chris Wright points out that this is confusing where mmap.c merges
> adjacent vmas: most especially in the name VM_MERGEABLE_FLAGS, used by
> is_mergeable_vma() to let vmas be merged despite flags being different.
>
> Call it VMA_MERGE_DESPITE_FLAGS?  Perhaps, but at present it consists
> only of VM_CAN_NONLINEAR: so for now it's clearer on all sides to use
> that directly, with a comment on it in is_mergeable_vma().
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>   
Acked-by: Izik Eidus <ieidus@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 3/12] ksm: pages_unshared and pages_volatile
  2009-08-03 12:11   ` Hugh Dickins
@ 2009-08-04 21:49     ` Andrew Morton
  -1 siblings, 0 replies; 106+ messages in thread
From: Andrew Morton @ 2009-08-04 21:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: ieidus, aarcange, riel, chrisw, nickpiggin, linux-kernel, linux-mm

On Mon, 3 Aug 2009 13:11:53 +0100 (BST)
Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:

> The pages_shared and pages_sharing counts give a good picture of how
> successful KSM is at sharing; but no clue to how much wasted work it's
> doing to get there.  Add pages_unshared (count of unique pages waiting
> in the unstable tree, hoping to find a mate) and pages_volatile.
> 
> pages_volatile is harder to define.  It includes those pages changing
> too fast to get into the unstable tree, but also whatever other edge
> conditions prevent a page getting into the trees: a high value may
> deserve investigation.  Don't try to calculate it from the various
> conditions: it's the total of rmap_items less those accounted for.
> 
> Also show full_scans: the number of completed scans of everything
> registered in the mm list.
> 
> The locking for all these counts is simply ksm_thread_mutex.
> 
> ...
>
>  static inline struct rmap_item *alloc_rmap_item(void)
>  {
> -	return kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
> +	struct rmap_item *rmap_item;
> +
> +	rmap_item = kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
> +	if (rmap_item)
> +		ksm_rmap_items++;
> +	return rmap_item;
>  }

ksm_rmap_items was already available via /proc/slabinfo.  I guess that
wasn't a particularly nice user interface ;)


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 3/12] ksm: pages_unshared and pages_volatile
@ 2009-08-04 21:49     ` Andrew Morton
  0 siblings, 0 replies; 106+ messages in thread
From: Andrew Morton @ 2009-08-04 21:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: ieidus, aarcange, riel, chrisw, nickpiggin, linux-kernel, linux-mm

On Mon, 3 Aug 2009 13:11:53 +0100 (BST)
Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:

> The pages_shared and pages_sharing counts give a good picture of how
> successful KSM is at sharing; but no clue to how much wasted work it's
> doing to get there.  Add pages_unshared (count of unique pages waiting
> in the unstable tree, hoping to find a mate) and pages_volatile.
> 
> pages_volatile is harder to define.  It includes those pages changing
> too fast to get into the unstable tree, but also whatever other edge
> conditions prevent a page getting into the trees: a high value may
> deserve investigation.  Don't try to calculate it from the various
> conditions: it's the total of rmap_items less those accounted for.
> 
> Also show full_scans: the number of completed scans of everything
> registered in the mm list.
> 
> The locking for all these counts is simply ksm_thread_mutex.
> 
> ...
>
>  static inline struct rmap_item *alloc_rmap_item(void)
>  {
> -	return kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
> +	struct rmap_item *rmap_item;
> +
> +	rmap_item = kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
> +	if (rmap_item)
> +		ksm_rmap_items++;
> +	return rmap_item;
>  }

ksm_rmap_items was already available via /proc/slabinfo.  I guess that
wasn't a particularly nice user interface ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 5/12] ksm: keep quiet while list empty
  2009-08-03 12:14   ` Hugh Dickins
@ 2009-08-04 21:59     ` Andrew Morton
  -1 siblings, 0 replies; 106+ messages in thread
From: Andrew Morton @ 2009-08-04 21:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: ieidus, aarcange, riel, chrisw, nickpiggin, linux-kernel, linux-mm

On Mon, 3 Aug 2009 13:14:03 +0100 (BST)
Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:

> +		if (ksmd_should_run()) {
>  			schedule_timeout_interruptible(
>  				msecs_to_jiffies(ksm_thread_sleep_millisecs));
>  		} else {
>  			wait_event_interruptible(ksm_thread_wait,
> -					(ksm_run & KSM_RUN_MERGE) ||
> -					kthread_should_stop());
> +				ksmd_should_run() || kthread_should_stop());
>  		}

Yields


		if (ksmd_should_run()) {
			schedule_timeout_interruptible(
				msecs_to_jiffies(ksm_thread_sleep_millisecs));
		} else {
			wait_event_interruptible(ksm_thread_wait,
				ksmd_should_run() || kthread_should_stop());
		}

can it be something like

		wait_event_interruptible_timeout(ksm_thread_wait,
			ksmd_should_run() || kthread_should_stop(),
			msecs_to_jiffies(ksm_thread_sleep_millisecs));

?

That would also reduce the latency in responding to kthread_should_stop().

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 5/12] ksm: keep quiet while list empty
@ 2009-08-04 21:59     ` Andrew Morton
  0 siblings, 0 replies; 106+ messages in thread
From: Andrew Morton @ 2009-08-04 21:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: ieidus, aarcange, riel, chrisw, nickpiggin, linux-kernel, linux-mm

On Mon, 3 Aug 2009 13:14:03 +0100 (BST)
Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:

> +		if (ksmd_should_run()) {
>  			schedule_timeout_interruptible(
>  				msecs_to_jiffies(ksm_thread_sleep_millisecs));
>  		} else {
>  			wait_event_interruptible(ksm_thread_wait,
> -					(ksm_run & KSM_RUN_MERGE) ||
> -					kthread_should_stop());
> +				ksmd_should_run() || kthread_should_stop());
>  		}

Yields


		if (ksmd_should_run()) {
			schedule_timeout_interruptible(
				msecs_to_jiffies(ksm_thread_sleep_millisecs));
		} else {
			wait_event_interruptible(ksm_thread_wait,
				ksmd_should_run() || kthread_should_stop());
		}

can it be something like

		wait_event_interruptible_timeout(ksm_thread_wait,
			ksmd_should_run() || kthread_should_stop(),
			msecs_to_jiffies(ksm_thread_sleep_millisecs));

?

That would also reduce the latency in responding to kthread_should_stop().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 3/12] ksm: pages_unshared and pages_volatile
  2009-08-04 21:49     ` Andrew Morton
@ 2009-08-05 11:39       ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-05 11:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: ieidus, aarcange, riel, chrisw, nickpiggin, linux-kernel, linux-mm

On Tue, 4 Aug 2009, Andrew Morton wrote:
> On Mon, 3 Aug 2009 13:11:53 +0100 (BST)
> Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:
> 
> > pages_volatile is harder to define.  It includes those pages changing
> > too fast to get into the unstable tree, but also whatever other edge
> > conditions prevent a page getting into the trees: a high value may
> > deserve investigation.  Don't try to calculate it from the various
> > conditions: it's the total of rmap_items less those accounted for.
...
> >  static inline struct rmap_item *alloc_rmap_item(void)
> >  {
> > -	return kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
> > +	struct rmap_item *rmap_item;
> > +
> > +	rmap_item = kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
> > +	if (rmap_item)
> > +		ksm_rmap_items++;
> > +	return rmap_item;
> >  }
> 
> ksm_rmap_items was already available via /proc/slabinfo.  I guess that
> wasn't a particularly nice user interface ;)

procfs is not a nice interface for sysfs to be reading
when it's asked to show pages_volatile!

And not even always available, I think: SLOB wouldn't be able to report
the number of objects of any particular type, SLUB would need slub_nomerge
(or a debug flag) to keep the kmem_cache separate; and even SLAB would
have to assemble numbers from different cpus and queues, I guess.
Easier and more reliable for KSM to do its own thing here.

I do agree that slabinfo or slqbinfo or /proc/slabinfo is a good enough
interface for checking up on the number of rmap_items in use; it's what
I was using most of the time.  I did once fleetingly wonder whether to
show the count of rmap_items under /sys/kernel/mm/ksm, but rejected it
as not interesting enough to deserve more than what slabinfo tells.

But here the rmap_item count is being used to deduce something more
interesting (though more obscure), and I don't want SL?B divergences
to perturb the resulting number more than it is already.  I did start
out without the rmap_items count, incrementing and decrementing
pages_volatile all over the place; but soon abandoned that in
favour of the difference calculation.

But I think I'm taking you more seriously than you intended,
sorry for my humourlessness!

Hugh

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 3/12] ksm: pages_unshared and pages_volatile
@ 2009-08-05 11:39       ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-05 11:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: ieidus, aarcange, riel, chrisw, nickpiggin, linux-kernel, linux-mm

On Tue, 4 Aug 2009, Andrew Morton wrote:
> On Mon, 3 Aug 2009 13:11:53 +0100 (BST)
> Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:
> 
> > pages_volatile is harder to define.  It includes those pages changing
> > too fast to get into the unstable tree, but also whatever other edge
> > conditions prevent a page getting into the trees: a high value may
> > deserve investigation.  Don't try to calculate it from the various
> > conditions: it's the total of rmap_items less those accounted for.
...
> >  static inline struct rmap_item *alloc_rmap_item(void)
> >  {
> > -	return kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
> > +	struct rmap_item *rmap_item;
> > +
> > +	rmap_item = kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL);
> > +	if (rmap_item)
> > +		ksm_rmap_items++;
> > +	return rmap_item;
> >  }
> 
> ksm_rmap_items was already available via /proc/slabinfo.  I guess that
> wasn't a particularly nice user interface ;)

procfs is not a nice interface for sysfs to be reading
when it's asked to show pages_volatile!

And not even always available, I think: SLOB wouldn't be able to report
the number of objects of any particular type, SLUB would need slub_nomerge
(or a debug flag) to keep the kmem_cache separate; and even SLAB would
have to assemble numbers from different cpus and queues, I guess.
Easier and more reliable for KSM to do its own thing here.

I do agree that slabinfo or slqbinfo or /proc/slabinfo is a good enough
interface for checking up on the number of rmap_items in use; it's what
I was using most of the time.  I did once fleetingly wonder whether to
show the count of rmap_items under /sys/kernel/mm/ksm, but rejected it
as not interesting enough to deserve more than what slabinfo tells.

But here the rmap_item count is being used to deduce something more
interesting (though more obscure), and I don't want SL?B divergences
to perturb the resulting number more than it is already.  I did start
out without the rmap_items count, incrementing and decrementing
pages_volatile all over the place; but soon abandoned that in
favour of the difference calculation.

But I think I'm taking you more seriously than you intended,
sorry for my humourlessness!

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 5/12] ksm: keep quiet while list empty
  2009-08-04 21:59     ` Andrew Morton
@ 2009-08-05 11:54       ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-05 11:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: ieidus, aarcange, riel, chrisw, nickpiggin, linux-kernel, linux-mm

On Tue, 4 Aug 2009, Andrew Morton wrote:
> On Mon, 3 Aug 2009 13:14:03 +0100 (BST)
> Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:
> 
> > +		if (ksmd_should_run()) {
> >  			schedule_timeout_interruptible(
> >  				msecs_to_jiffies(ksm_thread_sleep_millisecs));
> >  		} else {
> >  			wait_event_interruptible(ksm_thread_wait,
> > -					(ksm_run & KSM_RUN_MERGE) ||
> > -					kthread_should_stop());
> > +				ksmd_should_run() || kthread_should_stop());
> >  		}
> 
> Yields

(Phew, for a moment I thought you were asking us to use yield() here.)

> 
> 
> 		if (ksmd_should_run()) {
> 			schedule_timeout_interruptible(
> 				msecs_to_jiffies(ksm_thread_sleep_millisecs));
> 		} else {
> 			wait_event_interruptible(ksm_thread_wait,
> 				ksmd_should_run() || kthread_should_stop());
> 		}
> 
> can it be something like
> 
> 		wait_event_interruptible_timeout(ksm_thread_wait,
> 			ksmd_should_run() || kthread_should_stop(),
> 			msecs_to_jiffies(ksm_thread_sleep_millisecs));
> 
> ?

I'd be glad to simplify what we have there, but I think your proposal
ends up doing exactly what we're trying to avoid, doesn't it?  Won't
it briefly wake up ksmd every ksm_thread_sleep_millisecs, even when
there's nothing for it to do?

> 
> That would also reduce the latency in responding to kthread_should_stop().

That's not a high priority consideration.  So far as I can tell, the only
use for that test is at startup, if the sysfs_create_group mysteriously
fails.  It's mostly a leftover from when you could have CONFIG_KSM=m:

I did wonder whether to go back and add some SLAB_PANICs etc now,
but in the end I was either too lazy or too deferential to Izik's
fine error handling (you choose which to believe: both, actually).

Hugh

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 5/12] ksm: keep quiet while list empty
@ 2009-08-05 11:54       ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-05 11:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: ieidus, aarcange, riel, chrisw, nickpiggin, linux-kernel, linux-mm

On Tue, 4 Aug 2009, Andrew Morton wrote:
> On Mon, 3 Aug 2009 13:14:03 +0100 (BST)
> Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:
> 
> > +		if (ksmd_should_run()) {
> >  			schedule_timeout_interruptible(
> >  				msecs_to_jiffies(ksm_thread_sleep_millisecs));
> >  		} else {
> >  			wait_event_interruptible(ksm_thread_wait,
> > -					(ksm_run & KSM_RUN_MERGE) ||
> > -					kthread_should_stop());
> > +				ksmd_should_run() || kthread_should_stop());
> >  		}
> 
> Yields

(Phew, for a moment I thought you were asking us to use yield() here.)

> 
> 
> 		if (ksmd_should_run()) {
> 			schedule_timeout_interruptible(
> 				msecs_to_jiffies(ksm_thread_sleep_millisecs));
> 		} else {
> 			wait_event_interruptible(ksm_thread_wait,
> 				ksmd_should_run() || kthread_should_stop());
> 		}
> 
> can it be something like
> 
> 		wait_event_interruptible_timeout(ksm_thread_wait,
> 			ksmd_should_run() || kthread_should_stop(),
> 			msecs_to_jiffies(ksm_thread_sleep_millisecs));
> 
> ?

I'd be glad to simplify what we have there, but I think your proposal
ends up doing exactly what we're trying to avoid, doesn't it?  Won't
it briefly wake up ksmd every ksm_thread_sleep_millisecs, even when
there's nothing for it to do?

> 
> That would also reduce the latency in responding to kthread_should_stop().

That's not a high priority consideration.  So far as I can tell, the only
use for that test is at startup, if the sysfs_create_group mysteriously
fails.  It's mostly a leftover from when you could have CONFIG_KSM=m:

I did wonder whether to go back and add some SLAB_PANICs etc now,
but in the end I was either too lazy or too deferential to Izik's
fine error handling (you choose which to believe: both, actually).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 3/12] ksm: pages_unshared and pages_volatile
  2009-08-05 11:39       ` Hugh Dickins
@ 2009-08-05 15:11         ` Andrea Arcangeli
  -1 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 15:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, ieidus, riel, chrisw, nickpiggin, linux-kernel, linux-mm

On Wed, Aug 05, 2009 at 12:39:06PM +0100, Hugh Dickins wrote:
> procfs is not a nice interface for sysfs to be reading
> when it's asked to show pages_volatile!

Agreed, that is the real reason, grabbing that info from
slub/slab/slob (not so much from procfs) would be tricky.

Acked-by: Andrea Arcangeli <aarcange@redhat.com>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 3/12] ksm: pages_unshared and pages_volatile
@ 2009-08-05 15:11         ` Andrea Arcangeli
  0 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-05 15:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, ieidus, riel, chrisw, nickpiggin, linux-kernel, linux-mm

On Wed, Aug 05, 2009 at 12:39:06PM +0100, Hugh Dickins wrote:
> procfs is not a nice interface for sysfs to be reading
> when it's asked to show pages_volatile!

Agreed, that is the real reason, grabbing that info from
slub/slab/slob (not so much from procfs) would be tricky.

Acked-by: Andrea Arcangeli <aarcange@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 9/12] ksm: fix oom deadlock
  2009-08-03 12:18   ` Hugh Dickins
@ 2009-08-25 14:58     ` Andrea Arcangeli
  -1 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-25 14:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

On Mon, Aug 03, 2009 at 01:18:16PM +0100, Hugh Dickins wrote:
> tables which have been freed for reuse; and even do_anonymous_page
> and __do_fault need to check they're not being called by break_ksm
> to reinstate a pte after zap_pte_range has zapped that page table.

This deadlocks exit_mmap in an infinite loop when there's some region
locked. mlock calls gup and pretends to page fault successfully if
there's a vma existing on the region, but it doesn't page fault
anymore because of the mm_count being 0 already, so follow_page fails
and gup retries the page fault forever. And generally I don't like to
add those checks to page fault fast path.

Given we check mm_users == 0 (ksm_test_exit) after taking mmap_sem in
unmerge_and_remove_all_rmap_items, why do we actually need to care
that a page fault happens? We hold mmap_sem so we're guaranteed to see
mm_users == 0 and we won't ever break COW on that mm with mm_users ==
0 so I think those troublesome checks from page fault can be simply
removed.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 9/12] ksm: fix oom deadlock
@ 2009-08-25 14:58     ` Andrea Arcangeli
  0 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-25 14:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

On Mon, Aug 03, 2009 at 01:18:16PM +0100, Hugh Dickins wrote:
> tables which have been freed for reuse; and even do_anonymous_page
> and __do_fault need to check they're not being called by break_ksm
> to reinstate a pte after zap_pte_range has zapped that page table.

This deadlocks exit_mmap in an infinite loop when there's some region
locked. mlock calls gup and pretends to page fault successfully if
there's a vma existing on the region, but it doesn't page fault
anymore because of the mm_count being 0 already, so follow_page fails
and gup retries the page fault forever. And generally I don't like to
add those checks to page fault fast path.

Given we check mm_users == 0 (ksm_test_exit) after taking mmap_sem in
unmerge_and_remove_all_rmap_items, why do we actually need to care
that a page fault happens? We hold mmap_sem so we're guaranteed to see
mm_users == 0 and we won't ever break COW on that mm with mm_users ==
0 so I think those troublesome checks from page fault can be simply
removed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-25 14:58     ` Andrea Arcangeli
@ 2009-08-25 15:22       ` Andrea Arcangeli
  -1 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-25 15:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

From: Andrea Arcangeli <aarcange@redhat.com>

We can't stop page faults from happening during exit_mmap or munlock
fails. The fundamental issue is the absolute lack of serialization
after mm_users reaches 0. mmap_sem should be hot in the cache as we
just released it a few nanoseconds before in exit_mm, we just need to
take it one last time after mm_users is 0 to allow drivers to
serialize safely against it so that taking mmap_sem and checking
mm_users > 0 is enough for ksm to serialize against exit_mmap while
still noticing when oom killer or something else wants to release all
memory of the mm. When ksm notices it bails out and it allows memory
to be released.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/kernel/fork.c b/kernel/fork.c
index 9a16c21..f5af0d3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -515,7 +515,18 @@ void mmput(struct mm_struct *mm)
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
 		exit_aio(mm);
+
+		/*
+		 * Allow drivers tracking mm without pinning mm_users
+		 * (so that mm_users is allowed to reach 0 while they
+		 * do their tracking) to serialize against exit_mmap
+		 * by taking mmap_sem and checking mm_users is still >
+		 * 0 before working on the mm they're tracking.
+		 */
+		down_read(&mm->mmap_sem);
+		up_read(&mm->mmap_sem);
 		exit_mmap(mm);
+
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
 			spin_lock(&mmlist_lock);
diff --git a/mm/memory.c b/mm/memory.c
index 4a2c60d..025431e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2603,7 +2603,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (!pte_none(*page_table) || ksm_test_exit(mm))
+	if (!pte_none(*page_table))
 		goto release;
 
 	inc_mm_counter(mm, anon_rss);
@@ -2753,7 +2753,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * handle that later.
 	 */
 	/* Only go through if we didn't race with anybody else... */
-	if (likely(pte_same(*page_table, orig_pte) && !ksm_test_exit(mm))) {
+	if (likely(pte_same(*page_table, orig_pte))) {
 		flush_icache_page(vma, page);
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (flags & FAULT_FLAG_WRITE)

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-25 15:22       ` Andrea Arcangeli
  0 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-25 15:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, linux-kernel, linux-mm

From: Andrea Arcangeli <aarcange@redhat.com>

We can't stop page faults from happening during exit_mmap or munlock
fails. The fundamental issue is the absolute lack of serialization
after mm_users reaches 0. mmap_sem should be hot in the cache as we
just released it a few nanoseconds before in exit_mm, we just need to
take it one last time after mm_users is 0 to allow drivers to
serialize safely against it so that taking mmap_sem and checking
mm_users > 0 is enough for ksm to serialize against exit_mmap while
still noticing when oom killer or something else wants to release all
memory of the mm. When ksm notices it bails out and it allows memory
to be released.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/kernel/fork.c b/kernel/fork.c
index 9a16c21..f5af0d3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -515,7 +515,18 @@ void mmput(struct mm_struct *mm)
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
 		exit_aio(mm);
+
+		/*
+		 * Allow drivers tracking mm without pinning mm_users
+		 * (so that mm_users is allowed to reach 0 while they
+		 * do their tracking) to serialize against exit_mmap
+		 * by taking mmap_sem and checking mm_users is still >
+		 * 0 before working on the mm they're tracking.
+		 */
+		down_read(&mm->mmap_sem);
+		up_read(&mm->mmap_sem);
 		exit_mmap(mm);
+
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
 			spin_lock(&mmlist_lock);
diff --git a/mm/memory.c b/mm/memory.c
index 4a2c60d..025431e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2603,7 +2603,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (!pte_none(*page_table) || ksm_test_exit(mm))
+	if (!pte_none(*page_table))
 		goto release;
 
 	inc_mm_counter(mm, anon_rss);
@@ -2753,7 +2753,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * handle that later.
 	 */
 	/* Only go through if we didn't race with anybody else... */
-	if (likely(pte_same(*page_table, orig_pte) && !ksm_test_exit(mm))) {
+	if (likely(pte_same(*page_table, orig_pte))) {
 		flush_icache_page(vma, page);
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (flags & FAULT_FLAG_WRITE)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* Re: [PATCH 9/12] ksm: fix oom deadlock
  2009-08-25 14:58     ` Andrea Arcangeli
@ 2009-08-25 17:35       ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-25 17:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Tue, 25 Aug 2009, Andrea Arcangeli wrote:
> On Mon, Aug 03, 2009 at 01:18:16PM +0100, Hugh Dickins wrote:
> > tables which have been freed for reuse; and even do_anonymous_page
> > and __do_fault need to check they're not being called by break_ksm
> > to reinstate a pte after zap_pte_range has zapped that page table.
> 
> This deadlocks exit_mmap in an infinite loop when there's some region
> locked. mlock calls gup and pretends to page fault successfully if
> there's a vma existing on the region, but it doesn't page fault
> anymore because of the mm_count being 0 already, so follow_page fails
> and gup retries the page fault forever.

That's right.  Justin alerted me to this issue last night, and at first
I was utterly mystified (and couldn't reproduce).  But a look at the
.jpg in the Fedora bugzilla, and another look at KSM 9/12, brought
me to the same conclusion that you've reached.

The _right_ solution (without even knowing of this problem) is
coincidentally being discussed currently in a different thread,
"make munlock fast when mlock is canceled by sigkill".  It's just
idiotic that munlock (in this case, munlocking pages on exit) should
be trying to fault in pages, and that causes its own problems when
mlock of a large area goes OOM and invokes the OOM killer on itself
(the munlock hangs trying to fault in what the mlock failed to do:
at this instant I forget whether that deadlocks the system, or
causes the wrong processes to be killed - I've several other OOM
fixes to make).

I have now made a patch with munlock_vma_pages_range() doing a
follow_page() loop instead of faulting in; but I've not yet tested
it properly, and it's rather mixed up with three other topics
(a coredump GUP flag to __get_user_pages to govern the ZERO_PAGE
shortcut, instead of confused guesses; reinstating do_anonymous
ZERO_PAGE; cleaning away unnecessary GUP flags).  It's something
that will need exposure in mmotm before going any further, whereas
this ksm_test_exit() issue needs a safe fix quicker than that.

I was pondering what to do when you wrote in.

> And generally I don't like to add those checks to page fault fast path.

I'd prefer not to have them too, but haven't yet worked out how to
get along safely without them.

> 
> Given we check mm_users == 0 (ksm_test_exit) after taking mmap_sem in
> unmerge_and_remove_all_rmap_items, why do we actually need to care
> that a page fault happens? We hold mmap_sem so we're guaranteed to see
> mm_users == 0 and we won't ever break COW on that mm with mm_users ==
> 0 so I think those troublesome checks from page fault can be simply
> removed.

break_ksm called from madvise(,,MADV_UNMERGEABLE) does have down_write
of mmap_sem.  break_ksm called from "echo 2 >/sys/kernel/mm/ksm/run"
has down_read of mmap_sem (taken in unmerge_and_remove_all_rmap_items).
break_ksm called from any of ksmd's break_cows has down_read of mmap_sem.

But the mmap_sem is not enough to exclude the mm exiting
(until __ksm_exit does its little down_write,up_write dance):
break_cow etc. do the ksm_test_exit check on mm_users before
proceeding any further, but that's just not enough to prevent
break_ksm's handle_pte_fault racing with exit_mmap - hence the
ksm_test_exits in mm/memory.c, to stop ptes being instantiated
after the final zap thinks it's wiped the pagetables.

Let's look at your actual patch...

Hugh

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 9/12] ksm: fix oom deadlock
@ 2009-08-25 17:35       ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-25 17:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Tue, 25 Aug 2009, Andrea Arcangeli wrote:
> On Mon, Aug 03, 2009 at 01:18:16PM +0100, Hugh Dickins wrote:
> > tables which have been freed for reuse; and even do_anonymous_page
> > and __do_fault need to check they're not being called by break_ksm
> > to reinstate a pte after zap_pte_range has zapped that page table.
> 
> This deadlocks exit_mmap in an infinite loop when there's some region
> locked. mlock calls gup and pretends to page fault successfully if
> there's a vma existing on the region, but it doesn't page fault
> anymore because of the mm_count being 0 already, so follow_page fails
> and gup retries the page fault forever.

That's right.  Justin alerted me to this issue last night, and at first
I was utterly mystified (and couldn't reproduce).  But a look at the
.jpg in the Fedora bugzilla, and another look at KSM 9/12, brought
me to the same conclusion that you've reached.

The _right_ solution (without even knowing of this problem) is
coincidentally being discussed currently in a different thread,
"make munlock fast when mlock is canceled by sigkill".  It's just
idiotic that munlock (in this case, munlocking pages on exit) should
be trying to fault in pages, and that causes its own problems when
mlock of a large area goes OOM and invokes the OOM killer on itself
(the munlock hangs trying to fault in what the mlock failed to do:
at this instant I forget whether that deadlocks the system, or
causes the wrong processes to be killed - I've several other OOM
fixes to make).

I have now made a patch with munlock_vma_pages_range() doing a
follow_page() loop instead of faulting in; but I've not yet tested
it properly, and it's rather mixed up with three other topics
(a coredump GUP flag to __get_user_pages to govern the ZERO_PAGE
shortcut, instead of confused guesses; reinstating do_anonymous
ZERO_PAGE; cleaning away unnecessary GUP flags).  It's something
that will need exposure in mmotm before going any further, whereas
this ksm_test_exit() issue needs a safe fix quicker than that.

I was pondering what to do when you wrote in.

> And generally I don't like to add those checks to page fault fast path.

I'd prefer not to have them too, but haven't yet worked out how to
get along safely without them.

> 
> Given we check mm_users == 0 (ksm_test_exit) after taking mmap_sem in
> unmerge_and_remove_all_rmap_items, why do we actually need to care
> that a page fault happens? We hold mmap_sem so we're guaranteed to see
> mm_users == 0 and we won't ever break COW on that mm with mm_users ==
> 0 so I think those troublesome checks from page fault can be simply
> removed.

break_ksm called from madvise(,,MADV_UNMERGEABLE) does have down_write
of mmap_sem.  break_ksm called from "echo 2 >/sys/kernel/mm/ksm/run"
has down_read of mmap_sem (taken in unmerge_and_remove_all_rmap_items).
break_ksm called from any of ksmd's break_cows has down_read of mmap_sem.

But the mmap_sem is not enough to exclude the mm exiting
(until __ksm_exit does its little down_write,up_write dance):
break_cow etc. do the ksm_test_exit check on mm_users before
proceeding any further, but that's just not enough to prevent
break_ksm's handle_pte_fault racing with exit_mmap - hence the
ksm_test_exits in mm/memory.c, to stop ptes being instantiated
after the final zap thinks it's wiped the pagetables.

Let's look at your actual patch...

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 9/12] ksm: fix oom deadlock
  2009-08-25 17:35       ` Hugh Dickins
@ 2009-08-25 17:47         ` Andrea Arcangeli
  -1 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-25 17:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Tue, Aug 25, 2009 at 06:35:56PM +0100, Hugh Dickins wrote:
> "make munlock fast when mlock is canceled by sigkill".  It's just
> idiotic that munlock (in this case, munlocking pages on exit) should
> be trying to fault in pages, and that causes its own problems when

I also pondered if to address the thing by fixing automatic munlock,
but then I think the same way it's asking for troubles to cause page
faults with mm_users == 0 in munlock, it's also asking for troubles to
cause page faults with mm_users == 0 in ksm. So if munlock is wrong
ksm was also wrong, and I tried to fix ksm not to do that, while
leaving munlock fixage for later/others.. ;)

> I have now made a patch with munlock_vma_pages_range() doing a
> follow_page() loop instead of faulting in; but I've not yet tested

That is a separate problem in my view.

> I'd prefer not to have them too, but haven't yet worked out how to
> get along safely without them.

ok.

> But the mmap_sem is not enough to exclude the mm exiting
> (until __ksm_exit does its little down_write,up_write dance):
> break_cow etc. do the ksm_test_exit check on mm_users before
> proceeding any further, but that's just not enough to prevent
> break_ksm's handle_pte_fault racing with exit_mmap - hence the
> ksm_test_exits in mm/memory.c, to stop ptes being instantiated
> after the final zap thinks it's wiped the pagetables.
> 
> Let's look at your actual patch...

I tried to work out how to get along safely without them, in short my
patch makes mmap_sem + ksm_test_exit check on mm_users before
proceeding any further "enough" (while still allowing ksm loop to bail
out if mm_users suddenly reaches zero because of oom killer).

Furthermore the mmap_sem is already guaranteed l1 hot and exclusive
because we wrote to it a few nanoseconds before calling mmput (to be
fair locked ops are not cheap but I'd rather add two locked op to the
last exit syscall of a thread group than a new branch to every single
page fault as there are tons more page faults than exit syscalls).

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 9/12] ksm: fix oom deadlock
@ 2009-08-25 17:47         ` Andrea Arcangeli
  0 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-25 17:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Tue, Aug 25, 2009 at 06:35:56PM +0100, Hugh Dickins wrote:
> "make munlock fast when mlock is canceled by sigkill".  It's just
> idiotic that munlock (in this case, munlocking pages on exit) should
> be trying to fault in pages, and that causes its own problems when

I also pondered if to address the thing by fixing automatic munlock,
but then I think the same way it's asking for troubles to cause page
faults with mm_users == 0 in munlock, it's also asking for troubles to
cause page faults with mm_users == 0 in ksm. So if munlock is wrong
ksm was also wrong, and I tried to fix ksm not to do that, while
leaving munlock fixage for later/others.. ;)

> I have now made a patch with munlock_vma_pages_range() doing a
> follow_page() loop instead of faulting in; but I've not yet tested

That is a separate problem in my view.

> I'd prefer not to have them too, but haven't yet worked out how to
> get along safely without them.

ok.

> But the mmap_sem is not enough to exclude the mm exiting
> (until __ksm_exit does its little down_write,up_write dance):
> break_cow etc. do the ksm_test_exit check on mm_users before
> proceeding any further, but that's just not enough to prevent
> break_ksm's handle_pte_fault racing with exit_mmap - hence the
> ksm_test_exits in mm/memory.c, to stop ptes being instantiated
> after the final zap thinks it's wiped the pagetables.
> 
> Let's look at your actual patch...

I tried to work out how to get along safely without them, in short my
patch makes mmap_sem + ksm_test_exit check on mm_users before
proceeding any further "enough" (while still allowing ksm loop to bail
out if mm_users suddenly reaches zero because of oom killer).

Furthermore the mmap_sem is already guaranteed l1 hot and exclusive
because we wrote to it a few nanoseconds before calling mmput (to be
fair locked ops are not cheap but I'd rather add two locked op to the
last exit syscall of a thread group than a new branch to every single
page fault as there are tons more page faults than exit syscalls).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-25 15:22       ` Andrea Arcangeli
@ 2009-08-25 17:49         ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-25 17:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Tue, 25 Aug 2009, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> We can't stop page faults from happening during exit_mmap or munlock
> fails. The fundamental issue is the absolute lack of serialization
> after mm_users reaches 0. mmap_sem should be hot in the cache as we
> just released it a few nanoseconds before in exit_mm, we just need to
> take it one last time after mm_users is 0 to allow drivers to
> serialize safely against it so that taking mmap_sem and checking
> mm_users > 0 is enough for ksm to serialize against exit_mmap while
> still noticing when oom killer or something else wants to release all
> memory of the mm. When ksm notices it bails out and it allows memory
> to be released.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9a16c21..f5af0d3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -515,7 +515,18 @@ void mmput(struct mm_struct *mm)
>  
>  	if (atomic_dec_and_test(&mm->mm_users)) {
>  		exit_aio(mm);
> +
> +		/*
> +		 * Allow drivers tracking mm without pinning mm_users
> +		 * (so that mm_users is allowed to reach 0 while they
> +		 * do their tracking) to serialize against exit_mmap
> +		 * by taking mmap_sem and checking mm_users is still >
> +		 * 0 before working on the mm they're tracking.
> +		 */
> +		down_read(&mm->mmap_sem);
> +		up_read(&mm->mmap_sem);

Sorry, I just don't get it.  How does down_read here help?
Perhaps you thought ksm.c had down_write of mmap_sem in all cases?

No, and I don't think we want to change its down_reads to down_writes.
Nor do we want to change your down_read here to down_write, that will
just reintroduce the OOM deadlock that 9/12 was about solving.

(If this does work, and I'm just missing it, then I think we'd want a
ksm_prep_exit or something to make them conditional on MMF_VM_MERGEABLE.)

Hugh

>  		exit_mmap(mm);
> +
>  		set_mm_exe_file(mm, NULL);
>  		if (!list_empty(&mm->mmlist)) {
>  			spin_lock(&mmlist_lock);
> diff --git a/mm/memory.c b/mm/memory.c
> index 4a2c60d..025431e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2603,7 +2603,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  
>  	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> -	if (!pte_none(*page_table) || ksm_test_exit(mm))
> +	if (!pte_none(*page_table))
>  		goto release;
>  
>  	inc_mm_counter(mm, anon_rss);
> @@ -2753,7 +2753,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 * handle that later.
>  	 */
>  	/* Only go through if we didn't race with anybody else... */
> -	if (likely(pte_same(*page_table, orig_pte) && !ksm_test_exit(mm))) {
> +	if (likely(pte_same(*page_table, orig_pte))) {
>  		flush_icache_page(vma, page);
>  		entry = mk_pte(page, vma->vm_page_prot);
>  		if (flags & FAULT_FLAG_WRITE)

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-25 17:49         ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-25 17:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Tue, 25 Aug 2009, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> We can't stop page faults from happening during exit_mmap or munlock
> fails. The fundamental issue is the absolute lack of serialization
> after mm_users reaches 0. mmap_sem should be hot in the cache as we
> just released it a few nanoseconds before in exit_mm, we just need to
> take it one last time after mm_users is 0 to allow drivers to
> serialize safely against it so that taking mmap_sem and checking
> mm_users > 0 is enough for ksm to serialize against exit_mmap while
> still noticing when oom killer or something else wants to release all
> memory of the mm. When ksm notices it bails out and it allows memory
> to be released.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9a16c21..f5af0d3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -515,7 +515,18 @@ void mmput(struct mm_struct *mm)
>  
>  	if (atomic_dec_and_test(&mm->mm_users)) {
>  		exit_aio(mm);
> +
> +		/*
> +		 * Allow drivers tracking mm without pinning mm_users
> +		 * (so that mm_users is allowed to reach 0 while they
> +		 * do their tracking) to serialize against exit_mmap
> +		 * by taking mmap_sem and checking mm_users is still >
> +		 * 0 before working on the mm they're tracking.
> +		 */
> +		down_read(&mm->mmap_sem);
> +		up_read(&mm->mmap_sem);

Sorry, I just don't get it.  How does down_read here help?
Perhaps you thought ksm.c had down_write of mmap_sem in all cases?

No, and I don't think we want to change its down_reads to down_writes.
Nor do we want to change your down_read here to down_write, that will
just reintroduce the OOM deadlock that 9/12 was about solving.

(If this does work, and I'm just missing it, then I think we'd want a
ksm_prep_exit or something to make them conditional on MMF_VM_MERGEABLE.)

Hugh

>  		exit_mmap(mm);
> +
>  		set_mm_exe_file(mm, NULL);
>  		if (!list_empty(&mm->mmlist)) {
>  			spin_lock(&mmlist_lock);
> diff --git a/mm/memory.c b/mm/memory.c
> index 4a2c60d..025431e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2603,7 +2603,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  
>  	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> -	if (!pte_none(*page_table) || ksm_test_exit(mm))
> +	if (!pte_none(*page_table))
>  		goto release;
>  
>  	inc_mm_counter(mm, anon_rss);
> @@ -2753,7 +2753,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 * handle that later.
>  	 */
>  	/* Only go through if we didn't race with anybody else... */
> -	if (likely(pte_same(*page_table, orig_pte) && !ksm_test_exit(mm))) {
> +	if (likely(pte_same(*page_table, orig_pte))) {
>  		flush_icache_page(vma, page);
>  		entry = mk_pte(page, vma->vm_page_prot);
>  		if (flags & FAULT_FLAG_WRITE)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-25 17:49         ` Hugh Dickins
@ 2009-08-25 18:10           ` Andrea Arcangeli
  -1 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-25 18:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Tue, Aug 25, 2009 at 06:49:09PM +0100, Hugh Dickins wrote:
> Sorry, I just don't get it.  How does down_read here help?
> Perhaps you thought ksm.c had down_write of mmap_sem in all cases?
> 
> No, and I don't think we want to change its down_reads to down_writes.

Looking ksm.c it should have been down_write indeed...

> Nor do we want to change your down_read here to down_write, that will
> just reintroduce the OOM deadlock that 9/12 was about solving.

I'm not sure anymore I get what this fix is about... mm_users is
allowed to go to 0. If mm_users is allowed to go to 0, it's up to ksm
to check inside its inner loops that mm_users is 0 and bail
out. Bailing out it will unblock exit so that exit_mmap can run. What
exactly is the unfixable issue?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-25 18:10           ` Andrea Arcangeli
  0 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-25 18:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Tue, Aug 25, 2009 at 06:49:09PM +0100, Hugh Dickins wrote:
> Sorry, I just don't get it.  How does down_read here help?
> Perhaps you thought ksm.c had down_write of mmap_sem in all cases?
> 
> No, and I don't think we want to change its down_reads to down_writes.

Looking ksm.c it should have been down_write indeed...

> Nor do we want to change your down_read here to down_write, that will
> just reintroduce the OOM deadlock that 9/12 was about solving.

I'm not sure anymore I get what this fix is about... mm_users is
allowed to go to 0. If mm_users is allowed to go to 0, it's up to ksm
to check inside its inner loops that mm_users is 0 and bail
out. Bailing out it will unblock exit so that exit_mmap can run. What
exactly is the unfixable issue?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-25 18:10           ` Andrea Arcangeli
@ 2009-08-25 18:58             ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-25 18:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Tue, 25 Aug 2009, Andrea Arcangeli wrote:
> On Tue, Aug 25, 2009 at 06:49:09PM +0100, Hugh Dickins wrote:
> Looking ksm.c it should have been down_write indeed...
> 
> > Nor do we want to change your down_read here to down_write, that will
> > just reintroduce the OOM deadlock that 9/12 was about solving.
> 
> I'm not sure anymore I get what this fix is about...

Yes, it's easy to drop one end of the string while picking up the other ;)

And it wouldn't be exactly the same deadlock, but similar.
The original deadlock that 9/12 was about was:
    There's a now-obvious deadlock in KSM's out-of-memory handling:
    imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
    trying to allocate a page to break KSM in an mm which becomes the
    OOM victim (quite likely in the unmerge case): it's killed and goes
    to exit, and hangs there waiting to acquire ksm_thread_mutex.

Whereas with down_write(&mm->mmap_sem); up_write(&mm->mmap_sem)
just before calling exit_mmap(), the deadlock comes on mmap_sem
instead: the exiting OOM-killed task waiting there (for break_cow
or the like to up_read mmap_sem), before it has freed any memory
to allow break_cow etc. to proceed.

> mm_users is
> allowed to go to 0. If mm_users is allowed to go to 0, it's up to ksm
> to check inside its inner loops that mm_users is 0 and bail
> out. Bailing out it will unblock exit so that exit_mmap can run.

Yes, but one of those checks that mm_users is 0 has to be lie below
handle_mm_fault, because mm_users may go to 0 and exit_mmap proceed
while one of handle_pte_fault's helpers is waiting to allocate a page
(for example; but SMP could race anywhere).  Hence ksm_test_exit()s
in mm/memory.c.

(And as I remarked in the 9/12 comments, it's no use bumping up
mm_users in break_ksm, say, though that would be a normal thing to
do: that just ensures the memory we'd be waiting for cannot be freed.)

> What exactly is the unfixable issue?

Oh, there's no unfixable issue,
just an issue we've not yet found the right fix for ;)

The idea I'm currently playing with, would fix one of your objections
but violate another, is to remove the ksm_test_exit()s from mm/memory.c,
allow KSM to racily fault in too late, but observe mm_users 0 afterwards
and zap it then.

I agree with you that it seems _wrong_ for KSM to fault into an area
being exited, which was why the ksm_test_exit()s; but the neatest
answer might turn out to be to allow it to do so after all.

Hugh

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-25 18:58             ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-25 18:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Tue, 25 Aug 2009, Andrea Arcangeli wrote:
> On Tue, Aug 25, 2009 at 06:49:09PM +0100, Hugh Dickins wrote:
> Looking ksm.c it should have been down_write indeed...
> 
> > Nor do we want to change your down_read here to down_write, that will
> > just reintroduce the OOM deadlock that 9/12 was about solving.
> 
> I'm not sure anymore I get what this fix is about...

Yes, it's easy to drop one end of the string while picking up the other ;)

And it wouldn't be exactly the same deadlock, but similar.
The original deadlock that 9/12 was about was:
    There's a now-obvious deadlock in KSM's out-of-memory handling:
    imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
    trying to allocate a page to break KSM in an mm which becomes the
    OOM victim (quite likely in the unmerge case): it's killed and goes
    to exit, and hangs there waiting to acquire ksm_thread_mutex.

Whereas with down_write(&mm->mmap_sem); up_write(&mm->mmap_sem)
just before calling exit_mmap(), the deadlock comes on mmap_sem
instead: the exiting OOM-killed task waiting there (for break_cow
or the like to up_read mmap_sem), before it has freed any memory
to allow break_cow etc. to proceed.

> mm_users is
> allowed to go to 0. If mm_users is allowed to go to 0, it's up to ksm
> to check inside its inner loops that mm_users is 0 and bail
> out. Bailing out it will unblock exit so that exit_mmap can run.

Yes, but one of those checks that mm_users is 0 has to be lie below
handle_mm_fault, because mm_users may go to 0 and exit_mmap proceed
while one of handle_pte_fault's helpers is waiting to allocate a page
(for example; but SMP could race anywhere).  Hence ksm_test_exit()s
in mm/memory.c.

(And as I remarked in the 9/12 comments, it's no use bumping up
mm_users in break_ksm, say, though that would be a normal thing to
do: that just ensures the memory we'd be waiting for cannot be freed.)

> What exactly is the unfixable issue?

Oh, there's no unfixable issue,
just an issue we've not yet found the right fix for ;)

The idea I'm currently playing with, would fix one of your objections
but violate another, is to remove the ksm_test_exit()s from mm/memory.c,
allow KSM to racily fault in too late, but observe mm_users 0 afterwards
and zap it then.

I agree with you that it seems _wrong_ for KSM to fault into an area
being exited, which was why the ksm_test_exit()s; but the neatest
answer might turn out to be to allow it to do so after all.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-25 18:58             ` Hugh Dickins
@ 2009-08-25 19:45               ` Andrea Arcangeli
  -1 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-25 19:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Tue, Aug 25, 2009 at 07:58:43PM +0100, Hugh Dickins wrote:
> On Tue, 25 Aug 2009, Andrea Arcangeli wrote:
> > On Tue, Aug 25, 2009 at 06:49:09PM +0100, Hugh Dickins wrote:
> > Looking ksm.c it should have been down_write indeed...
> > 
> > > Nor do we want to change your down_read here to down_write, that will
> > > just reintroduce the OOM deadlock that 9/12 was about solving.
> > 
> > I'm not sure anymore I get what this fix is about...
> 
> Yes, it's easy to drop one end of the string while picking up the other ;)
> 
> And it wouldn't be exactly the same deadlock, but similar.
> The original deadlock that 9/12 was about was:
>     There's a now-obvious deadlock in KSM's out-of-memory handling:
>     imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
>     trying to allocate a page to break KSM in an mm which becomes the
>     OOM victim (quite likely in the unmerge case): it's killed and goes
>     to exit, and hangs there waiting to acquire ksm_thread_mutex.

Yes I see that, that was before ksm was capable of noticing that it
was looping indefinitely triggering COW (allocating memory) on a mm
with mm_users == 0 selected by the OOM killer for release. Not true
anymore after ksm_test_exit is introduced in the KSM inner paths... I
mean that part of the fix is enough.

> Whereas with down_write(&mm->mmap_sem); up_write(&mm->mmap_sem)
> just before calling exit_mmap(), the deadlock comes on mmap_sem
> instead: the exiting OOM-killed task waiting there (for break_cow
> or the like to up_read mmap_sem), before it has freed any memory
> to allow break_cow etc. to proceed.

The whole difference is that now KSM will notice that mm_users is
already zero and it will release the mmap_sem promptly allowing
exit_mmap to run...

> Yes, but one of those checks that mm_users is 0 has to be lie below
> handle_mm_fault, because mm_users may go to 0 and exit_mmap proceed
> while one of handle_pte_fault's helpers is waiting to allocate a page
> (for example; but SMP could race anywhere).  Hence ksm_test_exit()s
> in mm/memory.c.

Hmm but you're trying here to perfect something that isn't needed to
be perfected... and that is a generic issue that always happens with
the OOM killer. I doesn't make any difference if it's KSM or the
application that triggered a page fault on the MM. If mmap_sem is hold
in read mode by a regular application page fault while OOM killer
fires, the exit_mmap routine will not run until the page fault is
complete. The SMP race anywhere is the reason the OOM killer has to
stop a moment before killing a second task to give a chance to the
task to run exit_mmap...

> (And as I remarked in the 9/12 comments, it's no use bumping up
> mm_users in break_ksm, say, though that would be a normal thing to
> do: that just ensures the memory we'd be waiting for cannot be freed.)

Yes, that would also prevent KSM to notice that the OOM killer
selected the mm for release. Well unless we check against mm_users ==
1, which only works as only as only ksm does that and no other driver
similar to KSM ;) so it's not a real solution...

> just an issue we've not yet found the right fix for ;)

I think you already did the right fix in simply doing ksm_test_exit
inside the KSM inner loops and adding as well a dummy
down_write;up_write in the ksm_exit case where rmap_items exists on
the mm_slot that is exiting. But there was no need of actually
teaching the page faults to bail out to react immediately to the OOM
killer (the task itself will not react immediately) and second
ksm_exit with its serializing down_write should be moved back before
exit_mmap and it will have the same effect of my previous patch with
down_write (s/read/write) just before exit_mmap.

> The idea I'm currently playing with, would fix one of your objections
> but violate another, is to remove the ksm_test_exit()s from mm/memory.c,
> allow KSM to racily fault in too late, but observe mm_users 0 afterwards
> and zap it then.

;)

> I agree with you that it seems _wrong_ for KSM to fault into an area
> being exited, which was why the ksm_test_exit()s; but the neatest
> answer might turn out to be to allow it to do so after all.

Hmm no... I think it's definitely asking for troubles, I would agree
with you if an immediate reaction to OOM killer would actually provide
any benefit, but I don't see the benefit, and this makes exit_mmap
simpler, and it avoids messing with tlb_gather and putting a
definitive stop on KSM before pagetables are freed.

I did this new patch what you think? And any further change in the
anti-oom-deadlock area if still needed, should reside on ksm.c.

--------
From: Andrea Arcangeli <aarcange@redhat.com>

Allowing page faults triggered by drivers tracking the mm during
exit_mmap with mm_users already zero is asking for troubles. And we
can't stop page faults from happening during exit_mmap or munlock fails
(munlock also better stop triggering page faults with mm_users zero).

ksm_exit if there are rmap_items still chained on this mm slot, will
take mmap_sem write side so preventing ksm to keep working on a mm while
exit_mmap runs. And ksm will bail out as soon as it notices that
mm_users is already zero thanks to the ksm_test_exit checks. So that
when a task is killed by OOM killer or the user, ksm will not
indefinitely prevent it to run exit_mmap and release its memory. 

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 2d64ff3..0e26de6 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -18,8 +18,7 @@ struct mmu_gather;
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
-void __ksm_exit(struct mm_struct *mm,
-		struct mmu_gather **tlbp, unsigned long end);
+void __ksm_exit(struct mm_struct *mm);
 
 static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
@@ -41,11 +40,10 @@ static inline bool ksm_test_exit(struct mm_struct *mm)
 	return atomic_read(&mm->mm_users) == 0;
 }
 
-static inline void ksm_exit(struct mm_struct *mm,
-			    struct mmu_gather **tlbp, unsigned long end)
+static inline void ksm_exit(struct mm_struct *mm)
 {
 	if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
-		__ksm_exit(mm, tlbp, end);
+		__ksm_exit(mm);
 }
 
 /*
@@ -86,8 +84,7 @@ static inline bool ksm_test_exit(struct mm_struct *mm)
 	return 0;
 }
 
-static inline void ksm_exit(struct mm_struct *mm,
-			    struct mmu_gather **tlbp, unsigned long end)
+static inline void ksm_exit(struct mm_struct *mm)
 {
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 9a16c21..6f93809 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -515,6 +515,7 @@ void mmput(struct mm_struct *mm)
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
 		exit_aio(mm);
+		ksm_exit(mm);
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
diff --git a/mm/ksm.c b/mm/ksm.c
index d03627f..329ebe9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1428,8 +1428,7 @@ int __ksm_enter(struct mm_struct *mm)
 	return 0;
 }
 
-void __ksm_exit(struct mm_struct *mm,
-		struct mmu_gather **tlbp, unsigned long end)
+void __ksm_exit(struct mm_struct *mm)
 {
 	struct mm_slot *mm_slot;
 	int easy_to_free = 0;
@@ -1462,10 +1461,8 @@ void __ksm_exit(struct mm_struct *mm,
 		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
 		mmdrop(mm);
 	} else if (mm_slot) {
-		tlb_finish_mmu(*tlbp, 0, end);
 		down_write(&mm->mmap_sem);
 		up_write(&mm->mmap_sem);
-		*tlbp = tlb_gather_mmu(mm, 1);
 	}
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 4a2c60d..025431e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2603,7 +2603,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (!pte_none(*page_table) || ksm_test_exit(mm))
+	if (!pte_none(*page_table))
 		goto release;
 
 	inc_mm_counter(mm, anon_rss);
@@ -2753,7 +2753,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * handle that later.
 	 */
 	/* Only go through if we didn't race with anybody else... */
-	if (likely(pte_same(*page_table, orig_pte) && !ksm_test_exit(mm))) {
+	if (likely(pte_same(*page_table, orig_pte))) {
 		flush_icache_page(vma, page);
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (flags & FAULT_FLAG_WRITE)
diff --git a/mm/mmap.c b/mm/mmap.c
index 1b0a709..f3f2a22 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2111,13 +2111,6 @@ void exit_mmap(struct mm_struct *mm)
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 
-	/*
-	 * For KSM to handle OOM without deadlock when it's breaking COW in a
-	 * likely victim of the OOM killer, we must serialize with ksm_exit()
-	 * after freeing mm's pages but before freeing its page tables.
-	 */
-	ksm_exit(mm, &tlb, end);
-
 	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
 

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-25 19:45               ` Andrea Arcangeli
  0 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-25 19:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Tue, Aug 25, 2009 at 07:58:43PM +0100, Hugh Dickins wrote:
> On Tue, 25 Aug 2009, Andrea Arcangeli wrote:
> > On Tue, Aug 25, 2009 at 06:49:09PM +0100, Hugh Dickins wrote:
> > Looking ksm.c it should have been down_write indeed...
> > 
> > > Nor do we want to change your down_read here to down_write, that will
> > > just reintroduce the OOM deadlock that 9/12 was about solving.
> > 
> > I'm not sure anymore I get what this fix is about...
> 
> Yes, it's easy to drop one end of the string while picking up the other ;)
> 
> And it wouldn't be exactly the same deadlock, but similar.
> The original deadlock that 9/12 was about was:
>     There's a now-obvious deadlock in KSM's out-of-memory handling:
>     imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
>     trying to allocate a page to break KSM in an mm which becomes the
>     OOM victim (quite likely in the unmerge case): it's killed and goes
>     to exit, and hangs there waiting to acquire ksm_thread_mutex.

Yes I see that, that was before ksm was capable of noticing that it
was looping indefinitely triggering COW (allocating memory) on a mm
with mm_users == 0 selected by the OOM killer for release. Not true
anymore after ksm_test_exit is introduced in the KSM inner paths... I
mean that part of the fix is enough.

> Whereas with down_write(&mm->mmap_sem); up_write(&mm->mmap_sem)
> just before calling exit_mmap(), the deadlock comes on mmap_sem
> instead: the exiting OOM-killed task waiting there (for break_cow
> or the like to up_read mmap_sem), before it has freed any memory
> to allow break_cow etc. to proceed.

The whole difference is that now KSM will notice that mm_users is
already zero and it will release the mmap_sem promptly allowing
exit_mmap to run...

> Yes, but one of those checks that mm_users is 0 has to be lie below
> handle_mm_fault, because mm_users may go to 0 and exit_mmap proceed
> while one of handle_pte_fault's helpers is waiting to allocate a page
> (for example; but SMP could race anywhere).  Hence ksm_test_exit()s
> in mm/memory.c.

Hmm but you're trying here to perfect something that isn't needed to
be perfected... and that is a generic issue that always happens with
the OOM killer. I doesn't make any difference if it's KSM or the
application that triggered a page fault on the MM. If mmap_sem is hold
in read mode by a regular application page fault while OOM killer
fires, the exit_mmap routine will not run until the page fault is
complete. The SMP race anywhere is the reason the OOM killer has to
stop a moment before killing a second task to give a chance to the
task to run exit_mmap...

> (And as I remarked in the 9/12 comments, it's no use bumping up
> mm_users in break_ksm, say, though that would be a normal thing to
> do: that just ensures the memory we'd be waiting for cannot be freed.)

Yes, that would also prevent KSM to notice that the OOM killer
selected the mm for release. Well unless we check against mm_users ==
1, which only works as only as only ksm does that and no other driver
similar to KSM ;) so it's not a real solution...

> just an issue we've not yet found the right fix for ;)

I think you already did the right fix in simply doing ksm_test_exit
inside the KSM inner loops and adding as well a dummy
down_write;up_write in the ksm_exit case where rmap_items exists on
the mm_slot that is exiting. But there was no need of actually
teaching the page faults to bail out to react immediately to the OOM
killer (the task itself will not react immediately) and second
ksm_exit with its serializing down_write should be moved back before
exit_mmap and it will have the same effect of my previous patch with
down_write (s/read/write) just before exit_mmap.

> The idea I'm currently playing with, would fix one of your objections
> but violate another, is to remove the ksm_test_exit()s from mm/memory.c,
> allow KSM to racily fault in too late, but observe mm_users 0 afterwards
> and zap it then.

;)

> I agree with you that it seems _wrong_ for KSM to fault into an area
> being exited, which was why the ksm_test_exit()s; but the neatest
> answer might turn out to be to allow it to do so after all.

Hmm no... I think it's definitely asking for troubles, I would agree
with you if an immediate reaction to OOM killer would actually provide
any benefit, but I don't see the benefit, and this makes exit_mmap
simpler, and it avoids messing with tlb_gather and putting a
definitive stop on KSM before pagetables are freed.

I did this new patch what you think? And any further change in the
anti-oom-deadlock area if still needed, should reside on ksm.c.

--------
From: Andrea Arcangeli <aarcange@redhat.com>

Allowing page faults triggered by drivers tracking the mm during
exit_mmap with mm_users already zero is asking for troubles. And we
can't stop page faults from happening during exit_mmap or munlock fails
(munlock also better stop triggering page faults with mm_users zero).

ksm_exit if there are rmap_items still chained on this mm slot, will
take mmap_sem write side so preventing ksm to keep working on a mm while
exit_mmap runs. And ksm will bail out as soon as it notices that
mm_users is already zero thanks to the ksm_test_exit checks. So that
when a task is killed by OOM killer or the user, ksm will not
indefinitely prevent it to run exit_mmap and release its memory. 

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 2d64ff3..0e26de6 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -18,8 +18,7 @@ struct mmu_gather;
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
-void __ksm_exit(struct mm_struct *mm,
-		struct mmu_gather **tlbp, unsigned long end);
+void __ksm_exit(struct mm_struct *mm);
 
 static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
@@ -41,11 +40,10 @@ static inline bool ksm_test_exit(struct mm_struct *mm)
 	return atomic_read(&mm->mm_users) == 0;
 }
 
-static inline void ksm_exit(struct mm_struct *mm,
-			    struct mmu_gather **tlbp, unsigned long end)
+static inline void ksm_exit(struct mm_struct *mm)
 {
 	if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
-		__ksm_exit(mm, tlbp, end);
+		__ksm_exit(mm);
 }
 
 /*
@@ -86,8 +84,7 @@ static inline bool ksm_test_exit(struct mm_struct *mm)
 	return 0;
 }
 
-static inline void ksm_exit(struct mm_struct *mm,
-			    struct mmu_gather **tlbp, unsigned long end)
+static inline void ksm_exit(struct mm_struct *mm)
 {
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 9a16c21..6f93809 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -515,6 +515,7 @@ void mmput(struct mm_struct *mm)
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
 		exit_aio(mm);
+		ksm_exit(mm);
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
diff --git a/mm/ksm.c b/mm/ksm.c
index d03627f..329ebe9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1428,8 +1428,7 @@ int __ksm_enter(struct mm_struct *mm)
 	return 0;
 }
 
-void __ksm_exit(struct mm_struct *mm,
-		struct mmu_gather **tlbp, unsigned long end)
+void __ksm_exit(struct mm_struct *mm)
 {
 	struct mm_slot *mm_slot;
 	int easy_to_free = 0;
@@ -1462,10 +1461,8 @@ void __ksm_exit(struct mm_struct *mm,
 		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
 		mmdrop(mm);
 	} else if (mm_slot) {
-		tlb_finish_mmu(*tlbp, 0, end);
 		down_write(&mm->mmap_sem);
 		up_write(&mm->mmap_sem);
-		*tlbp = tlb_gather_mmu(mm, 1);
 	}
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 4a2c60d..025431e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2603,7 +2603,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (!pte_none(*page_table) || ksm_test_exit(mm))
+	if (!pte_none(*page_table))
 		goto release;
 
 	inc_mm_counter(mm, anon_rss);
@@ -2753,7 +2753,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * handle that later.
 	 */
 	/* Only go through if we didn't race with anybody else... */
-	if (likely(pte_same(*page_table, orig_pte) && !ksm_test_exit(mm))) {
+	if (likely(pte_same(*page_table, orig_pte))) {
 		flush_icache_page(vma, page);
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (flags & FAULT_FLAG_WRITE)
diff --git a/mm/mmap.c b/mm/mmap.c
index 1b0a709..f3f2a22 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2111,13 +2111,6 @@ void exit_mmap(struct mm_struct *mm)
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 
-	/*
-	 * For KSM to handle OOM without deadlock when it's breaking COW in a
-	 * likely victim of the OOM killer, we must serialize with ksm_exit()
-	 * after freeing mm's pages but before freeing its page tables.
-	 */
-	ksm_exit(mm, &tlb, end);
-
 	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-25 19:45               ` Andrea Arcangeli
@ 2009-08-26 16:18                 ` Justin M. Forbes
  -1 siblings, 0 replies; 106+ messages in thread
From: Justin M. Forbes @ 2009-08-26 16:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Izik Eidus, Rik van Riel, Chris Wright,
	Nick Piggin, Andrew Morton, linux-kernel, linux-mm

On Tue, 2009-08-25 at 21:45 +0200, Andrea Arcangeli wrote:
> On Tue, Aug 25, 2009 at 07:58:43PM +0100, Hugh Dickins wrote:
> > On Tue, 25 Aug 2009, Andrea Arcangeli wrote:
> > > On Tue, Aug 25, 2009 at 06:49:09PM +0100, Hugh Dickins wrote:
> > > Looking ksm.c it should have been down_write indeed...
> > > 
> > > > Nor do we want to change your down_read here to down_write, that will
> > > > just reintroduce the OOM deadlock that 9/12 was about solving.
> > > 
> > > I'm not sure anymore I get what this fix is about...
> > 
> > Yes, it's easy to drop one end of the string while picking up the other ;)
> > 
> > And it wouldn't be exactly the same deadlock, but similar.
> > The original deadlock that 9/12 was about was:
> >     There's a now-obvious deadlock in KSM's out-of-memory handling:
> >     imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
> >     trying to allocate a page to break KSM in an mm which becomes the
> >     OOM victim (quite likely in the unmerge case): it's killed and goes
> >     to exit, and hangs there waiting to acquire ksm_thread_mutex.
> 
> Yes I see that, that was before ksm was capable of noticing that it
> was looping indefinitely triggering COW (allocating memory) on a mm
> with mm_users == 0 selected by the OOM killer for release. Not true
> anymore after ksm_test_exit is introduced in the KSM inner paths... I
> mean that part of the fix is enough.
> 
> > Whereas with down_write(&mm->mmap_sem); up_write(&mm->mmap_sem)
> > just before calling exit_mmap(), the deadlock comes on mmap_sem
> > instead: the exiting OOM-killed task waiting there (for break_cow
> > or the like to up_read mmap_sem), before it has freed any memory
> > to allow break_cow etc. to proceed.
> 
> The whole difference is that now KSM will notice that mm_users is
> already zero and it will release the mmap_sem promptly allowing
> exit_mmap to run...
> 
> > Yes, but one of those checks that mm_users is 0 has to be lie below
> > handle_mm_fault, because mm_users may go to 0 and exit_mmap proceed
> > while one of handle_pte_fault's helpers is waiting to allocate a page
> > (for example; but SMP could race anywhere).  Hence ksm_test_exit()s
> > in mm/memory.c.
> 
> Hmm but you're trying here to perfect something that isn't needed to
> be perfected... and that is a generic issue that always happens with
> the OOM killer. I doesn't make any difference if it's KSM or the
> application that triggered a page fault on the MM. If mmap_sem is hold
> in read mode by a regular application page fault while OOM killer
> fires, the exit_mmap routine will not run until the page fault is
> complete. The SMP race anywhere is the reason the OOM killer has to
> stop a moment before killing a second task to give a chance to the
> task to run exit_mmap...
> 
> > (And as I remarked in the 9/12 comments, it's no use bumping up
> > mm_users in break_ksm, say, though that would be a normal thing to
> > do: that just ensures the memory we'd be waiting for cannot be freed.)
> 
> Yes, that would also prevent KSM to notice that the OOM killer
> selected the mm for release. Well unless we check against mm_users ==
> 1, which only works as only as only ksm does that and no other driver
> similar to KSM ;) so it's not a real solution...
> 
> > just an issue we've not yet found the right fix for ;)
> 
> I think you already did the right fix in simply doing ksm_test_exit
> inside the KSM inner loops and adding as well a dummy
> down_write;up_write in the ksm_exit case where rmap_items exists on
> the mm_slot that is exiting. But there was no need of actually
> teaching the page faults to bail out to react immediately to the OOM
> killer (the task itself will not react immediately) and second
> ksm_exit with its serializing down_write should be moved back before
> exit_mmap and it will have the same effect of my previous patch with
> down_write (s/read/write) just before exit_mmap.
> 
> > The idea I'm currently playing with, would fix one of your objections
> > but violate another, is to remove the ksm_test_exit()s from mm/memory.c,
> > allow KSM to racily fault in too late, but observe mm_users 0 afterwards
> > and zap it then.
> 
> ;)
> 
> > I agree with you that it seems _wrong_ for KSM to fault into an area
> > being exited, which was why the ksm_test_exit()s; but the neatest
> > answer might turn out to be to allow it to do so after all.
> 
> Hmm no... I think it's definitely asking for troubles, I would agree
> with you if an immediate reaction to OOM killer would actually provide
> any benefit, but I don't see the benefit, and this makes exit_mmap
> simpler, and it avoids messing with tlb_gather and putting a
> definitive stop on KSM before pagetables are freed.
> 
> I did this new patch what you think? And any further change in the
> anti-oom-deadlock area if still needed, should reside on ksm.c.
> 
> --------
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Allowing page faults triggered by drivers tracking the mm during
> exit_mmap with mm_users already zero is asking for troubles. And we
> can't stop page faults from happening during exit_mmap or munlock fails
> (munlock also better stop triggering page faults with mm_users zero).
> 
> ksm_exit if there are rmap_items still chained on this mm slot, will
> take mmap_sem write side so preventing ksm to keep working on a mm while
> exit_mmap runs. And ksm will bail out as soon as it notices that
> mm_users is already zero thanks to the ksm_test_exit checks. So that
> when a task is killed by OOM killer or the user, ksm will not
> indefinitely prevent it to run exit_mmap and release its memory. 
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 

After review and testing, this patch passes my tests with KSM enabled.

Acked-by: Justin M. Forbes <jforbes@redhat.com>



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-26 16:18                 ` Justin M. Forbes
  0 siblings, 0 replies; 106+ messages in thread
From: Justin M. Forbes @ 2009-08-26 16:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Izik Eidus, Rik van Riel, Chris Wright,
	Nick Piggin, Andrew Morton, linux-kernel, linux-mm

On Tue, 2009-08-25 at 21:45 +0200, Andrea Arcangeli wrote:
> On Tue, Aug 25, 2009 at 07:58:43PM +0100, Hugh Dickins wrote:
> > On Tue, 25 Aug 2009, Andrea Arcangeli wrote:
> > > On Tue, Aug 25, 2009 at 06:49:09PM +0100, Hugh Dickins wrote:
> > > Looking ksm.c it should have been down_write indeed...
> > > 
> > > > Nor do we want to change your down_read here to down_write, that will
> > > > just reintroduce the OOM deadlock that 9/12 was about solving.
> > > 
> > > I'm not sure anymore I get what this fix is about...
> > 
> > Yes, it's easy to drop one end of the string while picking up the other ;)
> > 
> > And it wouldn't be exactly the same deadlock, but similar.
> > The original deadlock that 9/12 was about was:
> >     There's a now-obvious deadlock in KSM's out-of-memory handling:
> >     imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
> >     trying to allocate a page to break KSM in an mm which becomes the
> >     OOM victim (quite likely in the unmerge case): it's killed and goes
> >     to exit, and hangs there waiting to acquire ksm_thread_mutex.
> 
> Yes I see that, that was before ksm was capable of noticing that it
> was looping indefinitely triggering COW (allocating memory) on a mm
> with mm_users == 0 selected by the OOM killer for release. Not true
> anymore after ksm_test_exit is introduced in the KSM inner paths... I
> mean that part of the fix is enough.
> 
> > Whereas with down_write(&mm->mmap_sem); up_write(&mm->mmap_sem)
> > just before calling exit_mmap(), the deadlock comes on mmap_sem
> > instead: the exiting OOM-killed task waiting there (for break_cow
> > or the like to up_read mmap_sem), before it has freed any memory
> > to allow break_cow etc. to proceed.
> 
> The whole difference is that now KSM will notice that mm_users is
> already zero and it will release the mmap_sem promptly allowing
> exit_mmap to run...
> 
> > Yes, but one of those checks that mm_users is 0 has to be lie below
> > handle_mm_fault, because mm_users may go to 0 and exit_mmap proceed
> > while one of handle_pte_fault's helpers is waiting to allocate a page
> > (for example; but SMP could race anywhere).  Hence ksm_test_exit()s
> > in mm/memory.c.
> 
> Hmm but you're trying here to perfect something that isn't needed to
> be perfected... and that is a generic issue that always happens with
> the OOM killer. I doesn't make any difference if it's KSM or the
> application that triggered a page fault on the MM. If mmap_sem is hold
> in read mode by a regular application page fault while OOM killer
> fires, the exit_mmap routine will not run until the page fault is
> complete. The SMP race anywhere is the reason the OOM killer has to
> stop a moment before killing a second task to give a chance to the
> task to run exit_mmap...
> 
> > (And as I remarked in the 9/12 comments, it's no use bumping up
> > mm_users in break_ksm, say, though that would be a normal thing to
> > do: that just ensures the memory we'd be waiting for cannot be freed.)
> 
> Yes, that would also prevent KSM to notice that the OOM killer
> selected the mm for release. Well unless we check against mm_users ==
> 1, which only works as only as only ksm does that and no other driver
> similar to KSM ;) so it's not a real solution...
> 
> > just an issue we've not yet found the right fix for ;)
> 
> I think you already did the right fix in simply doing ksm_test_exit
> inside the KSM inner loops and adding as well a dummy
> down_write;up_write in the ksm_exit case where rmap_items exists on
> the mm_slot that is exiting. But there was no need of actually
> teaching the page faults to bail out to react immediately to the OOM
> killer (the task itself will not react immediately) and second
> ksm_exit with its serializing down_write should be moved back before
> exit_mmap and it will have the same effect of my previous patch with
> down_write (s/read/write) just before exit_mmap.
> 
> > The idea I'm currently playing with, would fix one of your objections
> > but violate another, is to remove the ksm_test_exit()s from mm/memory.c,
> > allow KSM to racily fault in too late, but observe mm_users 0 afterwards
> > and zap it then.
> 
> ;)
> 
> > I agree with you that it seems _wrong_ for KSM to fault into an area
> > being exited, which was why the ksm_test_exit()s; but the neatest
> > answer might turn out to be to allow it to do so after all.
> 
> Hmm no... I think it's definitely asking for troubles, I would agree
> with you if an immediate reaction to OOM killer would actually provide
> any benefit, but I don't see the benefit, and this makes exit_mmap
> simpler, and it avoids messing with tlb_gather and putting a
> definitive stop on KSM before pagetables are freed.
> 
> I did this new patch what you think? And any further change in the
> anti-oom-deadlock area if still needed, should reside on ksm.c.
> 
> --------
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Allowing page faults triggered by drivers tracking the mm during
> exit_mmap with mm_users already zero is asking for troubles. And we
> can't stop page faults from happening during exit_mmap or munlock fails
> (munlock also better stop triggering page faults with mm_users zero).
> 
> ksm_exit if there are rmap_items still chained on this mm slot, will
> take mmap_sem write side so preventing ksm to keep working on a mm while
> exit_mmap runs. And ksm will bail out as soon as it notices that
> mm_users is already zero thanks to the ksm_test_exit checks. So that
> when a task is killed by OOM killer or the user, ksm will not
> indefinitely prevent it to run exit_mmap and release its memory. 
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 

After review and testing, this patch passes my tests with KSM enabled.

Acked-by: Justin M. Forbes <jforbes@redhat.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-25 19:45               ` Andrea Arcangeli
@ 2009-08-26 19:17                 ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-26 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

Before getting down to details, let me say I'm giving your patch an
Acked-for-now-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>

On Tue, 25 Aug 2009, Andrea Arcangeli wrote:
> On Tue, Aug 25, 2009 at 07:58:43PM +0100, Hugh Dickins wrote:
> > The original deadlock that 9/12 was about was:
> >     There's a now-obvious deadlock in KSM's out-of-memory handling:
> >     imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
> >     trying to allocate a page to break KSM in an mm which becomes the
> >     OOM victim (quite likely in the unmerge case): it's killed and goes
> >     to exit, and hangs there waiting to acquire ksm_thread_mutex.
> 
> Yes I see that, that was before ksm was capable of noticing that it
> was looping indefinitely triggering COW (allocating memory) on a mm
> with mm_users == 0 selected by the OOM killer for release. Not true
> anymore after ksm_test_exit is introduced in the KSM inner paths... I
> mean that part of the fix is enough.

Not enough, actually - but it would be fair to conclude that I'm being
too much of a perfectionist here, and that your patch easily fixes a
much more important and prevalent issue than I'm trying to address.

Plus it strips out complications introduced by 9/12 which we'd all
be glad to do without if we can - and would deserve a patch of its
own to reintroduce if necessary later.

I've now tried several tests and combinations of patches (with and
without your patch, with and without OOM killer patches I worked on
when testing KSM, and found how often the wrong process is killed).

The case I tested last night, where your patch works fine, is
when a process balloons itself up to OOMing point with an madvise
MADV_UNMERGEABLE: it correctly detects, not mm_users == 0, but the 
pending SIGKILL, and backs out to the point where it gets killed
and all is well.

The case I didn't try last night is doing KSM_RUN_UNMERGE (which
behaves as could happen when ksmd is COWing, though a testcase for
that would be much harder to create).  With your patch, that does
not deadlock as I suggest above, but it cannot proceed without the
OOM killer killing the wrong process i.e. the next candidate after
the mergeable process I'm imagining which now gets ballooned up to
OOMing point.

The OOM killer first selects the right candidate, the process being
ballooned by unmerging; but while break_ksm is trying to get memory
to COW another page in it, holding down_read of mmap_sem, the killed
process reaches the down_write of mmap_sem and hangs there, before
it has freed the memory wanted (because you repositioned ksm_exit).

Page allocation has no way to notice that mm_users is 0 for the mm
being faulted into, nor that there's a TIF_MEMDIE on that task.
But it's not a deadlock, because p->mm is NULL now, so next time
around the OOM killer will skip it (I somewhat disagree with that
behaviour, but that's another story) and select another candidate
to kill, and that should free up a page to let break_ksm get out
to the next mm_users 0 test.

If you say that the OOM killer very often selects the wrong candidate
anyway, so what's the big deal here, I'd have to agree with you: I'd
been hoping to clean up some of that, but until I'm satisfied with and
sent in patches for that, maybe I'm silly to worry about this KSM case.

Certainly we need to worry more, and more urgently, about the hang
I've introduced to Rawhide, which your patch fixes.  (Side note:
I'm not for a moment saying we don't need to fix the KSM end, but
it does seem strange to me that this issue is so easily reproducible
on Rawhide with just an mlockall(MCL_CURRENT|MCL_FUTURE) program.
I couldn't get it with that, had to mmap beyond EOF then ftruncate
up to get it to happen; and nobody has reported the issue with mmotm.
Has anyone looked at why Rawhide's mlockall is not faulting in the
pages, I wonder if there's a separate bug there?)

> 
> > Whereas with down_write(&mm->mmap_sem); up_write(&mm->mmap_sem)
> > just before calling exit_mmap(), the deadlock comes on mmap_sem
> > instead: the exiting OOM-killed task waiting there (for break_cow
> > or the like to up_read mmap_sem), before it has freed any memory
> > to allow break_cow etc. to proceed.
> 
> The whole difference is that now KSM will notice that mm_users is
> already zero and it will release the mmap_sem promptly allowing
> exit_mmap to run...

No, not while it's down inside page allocation.

> 
> > Yes, but one of those checks that mm_users is 0 has to be lie below
> > handle_mm_fault, because mm_users may go to 0 and exit_mmap proceed
> > while one of handle_pte_fault's helpers is waiting to allocate a page
> > (for example; but SMP could race anywhere).  Hence ksm_test_exit()s
> > in mm/memory.c.
> 
> Hmm but you're trying here to perfect something that isn't needed to
> be perfected... and that is a generic issue that always happens with
> the OOM killer. I doesn't make any difference if it's KSM or the
> application that triggered a page fault on the MM. If mmap_sem is hold
> in read mode by a regular application page fault while OOM killer
> fires, the exit_mmap routine will not run until the page fault is
> complete. The SMP race anywhere is the reason the OOM killer has to
> stop a moment before killing a second task to give a chance to the
> task to run exit_mmap...

I think you're imagining the MADV_UNMERGEABLE case which I first
tested: what happens when a process does it to itself.  The problem
arises when one process (ksmd or "echo 2 >run") does it to another.

> 
> > (And as I remarked in the 9/12 comments, it's no use bumping up
> > mm_users in break_ksm, say, though that would be a normal thing to
> > do: that just ensures the memory we'd be waiting for cannot be freed.)
> 
> Yes, that would also prevent KSM to notice that the OOM killer
> selected the mm for release. Well unless we check against mm_users ==
> 1, which only works as only as only ksm does that and no other driver
> similar to KSM ;) so it's not a real solution...
> 
> > just an issue we've not yet found the right fix for ;)
> 
> I think you already did the right fix in simply doing ksm_test_exit
> inside the KSM inner loops and adding as well a dummy
> down_write;up_write in the ksm_exit case where rmap_items exists on
> the mm_slot that is exiting. But there was no need of actually
> teaching the page faults to bail out to react immediately to the OOM
> killer (the task itself will not react immediately) and second
> ksm_exit with its serializing down_write should be moved back before
> exit_mmap and it will have the same effect of my previous patch with
> down_write (s/read/write) just before exit_mmap.
> 
> > The idea I'm currently playing with, would fix one of your objections
> > but violate another, is to remove the ksm_test_exit()s from mm/memory.c,
> > allow KSM to racily fault in too late, but observe mm_users 0 afterwards
> > and zap it then.
> 
> ;)
> 
> > I agree with you that it seems _wrong_ for KSM to fault into an area
> > being exited, which was why the ksm_test_exit()s; but the neatest
> > answer might turn out to be to allow it to do so after all.
> 
> Hmm no... I think it's definitely asking for troubles, I would agree
> with you if an immediate reaction to OOM killer would actually provide
> any benefit, but I don't see the benefit, and this makes exit_mmap
> simpler, and it avoids messing with tlb_gather and putting a
> definitive stop on KSM before pagetables are freed.

I still like the idea of that solution (if ksm_test_exit zap_page_range
in break_ksm); however, I believe it too hits a problem with exit_mmap's
munlocking - since I didn't have any serialization there, it couldn't
know which side of the munlocking it is, so would be liable to do the
wrong thing with respect to VM_LOCKED areas, one way or the other.

But you don't like that approach at all, hmm.  It sounds like we'll
have a fight if I try either that or to reintroduce the ksm_test_exits
in memory.c, once the munlock faulting is eliminated.  Well, I'll give
it more thought: your patch is a lot better than the status quo,
and should go in for now - thanks.

> 
> I did this new patch what you think? And any further change in the
> anti-oom-deadlock area if still needed, should reside on ksm.c.
> 
> --------
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Allowing page faults triggered by drivers tracking the mm during
> exit_mmap with mm_users already zero is asking for troubles. And we
> can't stop page faults from happening during exit_mmap or munlock fails
> (munlock also better stop triggering page faults with mm_users zero).
> 
> ksm_exit if there are rmap_items still chained on this mm slot, will
> take mmap_sem write side so preventing ksm to keep working on a mm while
> exit_mmap runs. And ksm will bail out as soon as it notices that
> mm_users is already zero thanks to the ksm_test_exit checks. So that
> when a task is killed by OOM killer or the user, ksm will not
> indefinitely prevent it to run exit_mmap and release its memory. 
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

I disagree with quite a lot of your description, which doesn't even
mention the problem being fixed and how it is fixed; what drivers?
and ksm's ksm_test_exit checks are too high up to really help.
But this is a good patch until we've got a better...

Acked-for-now-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-26 19:17                 ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-26 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

Before getting down to details, let me say I'm giving your patch an
Acked-for-now-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>

On Tue, 25 Aug 2009, Andrea Arcangeli wrote:
> On Tue, Aug 25, 2009 at 07:58:43PM +0100, Hugh Dickins wrote:
> > The original deadlock that 9/12 was about was:
> >     There's a now-obvious deadlock in KSM's out-of-memory handling:
> >     imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
> >     trying to allocate a page to break KSM in an mm which becomes the
> >     OOM victim (quite likely in the unmerge case): it's killed and goes
> >     to exit, and hangs there waiting to acquire ksm_thread_mutex.
> 
> Yes I see that, that was before ksm was capable of noticing that it
> was looping indefinitely triggering COW (allocating memory) on a mm
> with mm_users == 0 selected by the OOM killer for release. Not true
> anymore after ksm_test_exit is introduced in the KSM inner paths... I
> mean that part of the fix is enough.

Not enough, actually - but it would be fair to conclude that I'm being
too much of a perfectionist here, and that your patch easily fixes a
much more important and prevalent issue than I'm trying to address.

Plus it strips out complications introduced by 9/12 which we'd all
be glad to do without if we can - and would deserve a patch of its
own to reintroduce if necessary later.

I've now tried several tests and combinations of patches (with and
without your patch, with and without OOM killer patches I worked on
when testing KSM, and found how often the wrong process is killed).

The case I tested last night, where your patch works fine, is
when a process balloons itself up to OOMing point with an madvise
MADV_UNMERGEABLE: it correctly detects, not mm_users == 0, but the 
pending SIGKILL, and backs out to the point where it gets killed
and all is well.

The case I didn't try last night is doing KSM_RUN_UNMERGE (which
behaves as could happen when ksmd is COWing, though a testcase for
that would be much harder to create).  With your patch, that does
not deadlock as I suggest above, but it cannot proceed without the
OOM killer killing the wrong process i.e. the next candidate after
the mergeable process I'm imagining which now gets ballooned up to
OOMing point.

The OOM killer first selects the right candidate, the process being
ballooned by unmerging; but while break_ksm is trying to get memory
to COW another page in it, holding down_read of mmap_sem, the killed
process reaches the down_write of mmap_sem and hangs there, before
it has freed the memory wanted (because you repositioned ksm_exit).

Page allocation has no way to notice that mm_users is 0 for the mm
being faulted into, nor that there's a TIF_MEMDIE on that task.
But it's not a deadlock, because p->mm is NULL now, so next time
around the OOM killer will skip it (I somewhat disagree with that
behaviour, but that's another story) and select another candidate
to kill, and that should free up a page to let break_ksm get out
to the next mm_users 0 test.

If you say that the OOM killer very often selects the wrong candidate
anyway, so what's the big deal here, I'd have to agree with you: I'd
been hoping to clean up some of that, but until I'm satisfied with and
sent in patches for that, maybe I'm silly to worry about this KSM case.

Certainly we need to worry more, and more urgently, about the hang
I've introduced to Rawhide, which your patch fixes.  (Side note:
I'm not for a moment saying we don't need to fix the KSM end, but
it does seem strange to me that this issue is so easily reproducible
on Rawhide with just an mlockall(MCL_CURRENT|MCL_FUTURE) program.
I couldn't get it with that, had to mmap beyond EOF then ftruncate
up to get it to happen; and nobody has reported the issue with mmotm.
Has anyone looked at why Rawhide's mlockall is not faulting in the
pages, I wonder if there's a separate bug there?)

> 
> > Whereas with down_write(&mm->mmap_sem); up_write(&mm->mmap_sem)
> > just before calling exit_mmap(), the deadlock comes on mmap_sem
> > instead: the exiting OOM-killed task waiting there (for break_cow
> > or the like to up_read mmap_sem), before it has freed any memory
> > to allow break_cow etc. to proceed.
> 
> The whole difference is that now KSM will notice that mm_users is
> already zero and it will release the mmap_sem promptly allowing
> exit_mmap to run...

No, not while it's down inside page allocation.

> 
> > Yes, but one of those checks that mm_users is 0 has to be lie below
> > handle_mm_fault, because mm_users may go to 0 and exit_mmap proceed
> > while one of handle_pte_fault's helpers is waiting to allocate a page
> > (for example; but SMP could race anywhere).  Hence ksm_test_exit()s
> > in mm/memory.c.
> 
> Hmm but you're trying here to perfect something that isn't needed to
> be perfected... and that is a generic issue that always happens with
> the OOM killer. I doesn't make any difference if it's KSM or the
> application that triggered a page fault on the MM. If mmap_sem is hold
> in read mode by a regular application page fault while OOM killer
> fires, the exit_mmap routine will not run until the page fault is
> complete. The SMP race anywhere is the reason the OOM killer has to
> stop a moment before killing a second task to give a chance to the
> task to run exit_mmap...

I think you're imagining the MADV_UNMERGEABLE case which I first
tested: what happens when a process does it to itself.  The problem
arises when one process (ksmd or "echo 2 >run") does it to another.

> 
> > (And as I remarked in the 9/12 comments, it's no use bumping up
> > mm_users in break_ksm, say, though that would be a normal thing to
> > do: that just ensures the memory we'd be waiting for cannot be freed.)
> 
> Yes, that would also prevent KSM to notice that the OOM killer
> selected the mm for release. Well unless we check against mm_users ==
> 1, which only works as only as only ksm does that and no other driver
> similar to KSM ;) so it's not a real solution...
> 
> > just an issue we've not yet found the right fix for ;)
> 
> I think you already did the right fix in simply doing ksm_test_exit
> inside the KSM inner loops and adding as well a dummy
> down_write;up_write in the ksm_exit case where rmap_items exists on
> the mm_slot that is exiting. But there was no need of actually
> teaching the page faults to bail out to react immediately to the OOM
> killer (the task itself will not react immediately) and second
> ksm_exit with its serializing down_write should be moved back before
> exit_mmap and it will have the same effect of my previous patch with
> down_write (s/read/write) just before exit_mmap.
> 
> > The idea I'm currently playing with, would fix one of your objections
> > but violate another, is to remove the ksm_test_exit()s from mm/memory.c,
> > allow KSM to racily fault in too late, but observe mm_users 0 afterwards
> > and zap it then.
> 
> ;)
> 
> > I agree with you that it seems _wrong_ for KSM to fault into an area
> > being exited, which was why the ksm_test_exit()s; but the neatest
> > answer might turn out to be to allow it to do so after all.
> 
> Hmm no... I think it's definitely asking for troubles, I would agree
> with you if an immediate reaction to OOM killer would actually provide
> any benefit, but I don't see the benefit, and this makes exit_mmap
> simpler, and it avoids messing with tlb_gather and putting a
> definitive stop on KSM before pagetables are freed.

I still like the idea of that solution (if ksm_test_exit zap_page_range
in break_ksm); however, I believe it too hits a problem with exit_mmap's
munlocking - since I didn't have any serialization there, it couldn't
know which side of the munlocking it is, so would be liable to do the
wrong thing with respect to VM_LOCKED areas, one way or the other.

But you don't like that approach at all, hmm.  It sounds like we'll
have a fight if I try either that or to reintroduce the ksm_test_exits
in memory.c, once the munlock faulting is eliminated.  Well, I'll give
it more thought: your patch is a lot better than the status quo,
and should go in for now - thanks.

> 
> I did this new patch what you think? And any further change in the
> anti-oom-deadlock area if still needed, should reside on ksm.c.
> 
> --------
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Allowing page faults triggered by drivers tracking the mm during
> exit_mmap with mm_users already zero is asking for troubles. And we
> can't stop page faults from happening during exit_mmap or munlock fails
> (munlock also better stop triggering page faults with mm_users zero).
> 
> ksm_exit if there are rmap_items still chained on this mm slot, will
> take mmap_sem write side so preventing ksm to keep working on a mm while
> exit_mmap runs. And ksm will bail out as soon as it notices that
> mm_users is already zero thanks to the ksm_test_exit checks. So that
> when a task is killed by OOM killer or the user, ksm will not
> indefinitely prevent it to run exit_mmap and release its memory. 
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

I disagree with quite a lot of your description, which doesn't even
mention the problem being fixed and how it is fixed; what drivers?
and ksm's ksm_test_exit checks are too high up to really help.
But this is a good patch until we've got a better...

Acked-for-now-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-26 19:17                 ` Hugh Dickins
@ 2009-08-26 19:44                   ` Andrea Arcangeli
  -1 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-26 19:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Wed, Aug 26, 2009 at 08:17:50PM +0100, Hugh Dickins wrote:
> Has anyone looked at why Rawhide's mlockall is not faulting in the
> pages, I wonder if there's a separate bug there?)

I reproduced only with self compiled mmotm kernel with full self
compiled userland with just a mlockall and exit (not rawhide
kernel/userland) so there's certainly no bug in rawhide, or at least
nothing special about it.

> No, not while it's down inside page allocation.

There's a slight difference if it's ksm inside page allocation and not
the task itself for other reasons. See the TIF_MEMDIE check in
page_alloc.c, those won't trigger when it's ksm causing a page
fault. So that's the problem left to tackle to make oom killer fully
happy with KSK unshare.

> But you don't like that approach at all, hmm.  It sounds like we'll
> have a fight if I try either that or to reintroduce the ksm_test_exits

;) Well I'd rather have a more unfixable issue if we have to
reintroduce the mm_users check the in page faults.

All is left to address is to teach page_alloc.c that the mm is going
away in a second patch. That might also help when it's aio triggering
gup page allocations or other kernel threads with use_mm just like ksm
and the oom killer selected those "mm" for release.

Having ksm using use_mm before triggering the handle_mm_fault (so
tsk->mm points to the mm of the task) and adding a MMF_MEMDIE to
mm->flags checked by page_alloc would work just fine and should solve
the double task killed... but then I'm unsure.. this is just the first
idea I had.

> in memory.c, once the munlock faulting is eliminated.  Well, I'll give
> it more thought: your patch is a lot better than the status quo,
> and should go in for now - thanks.

Ok, agreed!

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-26 19:44                   ` Andrea Arcangeli
  0 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-26 19:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Wed, Aug 26, 2009 at 08:17:50PM +0100, Hugh Dickins wrote:
> Has anyone looked at why Rawhide's mlockall is not faulting in the
> pages, I wonder if there's a separate bug there?)

I reproduced only with self compiled mmotm kernel with full self
compiled userland with just a mlockall and exit (not rawhide
kernel/userland) so there's certainly no bug in rawhide, or at least
nothing special about it.

> No, not while it's down inside page allocation.

There's a slight difference if it's ksm inside page allocation and not
the task itself for other reasons. See the TIF_MEMDIE check in
page_alloc.c, those won't trigger when it's ksm causing a page
fault. So that's the problem left to tackle to make oom killer fully
happy with KSK unshare.

> But you don't like that approach at all, hmm.  It sounds like we'll
> have a fight if I try either that or to reintroduce the ksm_test_exits

;) Well I'd rather have a more unfixable issue if we have to
reintroduce the mm_users check the in page faults.

All is left to address is to teach page_alloc.c that the mm is going
away in a second patch. That might also help when it's aio triggering
gup page allocations or other kernel threads with use_mm just like ksm
and the oom killer selected those "mm" for release.

Having ksm using use_mm before triggering the handle_mm_fault (so
tsk->mm points to the mm of the task) and adding a MMF_MEMDIE to
mm->flags checked by page_alloc would work just fine and should solve
the double task killed... but then I'm unsure.. this is just the first
idea I had.

> in memory.c, once the munlock faulting is eliminated.  Well, I'll give
> it more thought: your patch is a lot better than the status quo,
> and should go in for now - thanks.

Ok, agreed!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-26 19:44                   ` Andrea Arcangeli
@ 2009-08-26 19:57                     ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-26 19:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Wed, 26 Aug 2009, Andrea Arcangeli wrote:
> 
> All is left to address is to teach page_alloc.c that the mm is going
> away in a second patch. That might also help when it's aio triggering
> gup page allocations or other kernel threads with use_mm just like ksm
> and the oom killer selected those "mm" for release.
> 
> Having ksm using use_mm before triggering the handle_mm_fault (so
> tsk->mm points to the mm of the task) and adding a MMF_MEMDIE to
> mm->flags checked by page_alloc would work just fine and should solve
> the double task killed... but then I'm unsure.. this is just the first
> idea I had.

Yes, I began to have thoughts along those lines too as I was writing
my reply.  It is a different angle on the problem, I hadn't looked at
it that way before, and it does seem worth pursuing.  MMF_MEMDIE, yes,
that might be useful.  But KSM_RUN_UNMERGE wouldn't be able to use_mm
since it's coming from a normal user process - perhaps it should be a
kill-me-first like swapoff via PF_SWAPOFF.

Hugh

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-26 19:57                     ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-26 19:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Wed, 26 Aug 2009, Andrea Arcangeli wrote:
> 
> All is left to address is to teach page_alloc.c that the mm is going
> away in a second patch. That might also help when it's aio triggering
> gup page allocations or other kernel threads with use_mm just like ksm
> and the oom killer selected those "mm" for release.
> 
> Having ksm using use_mm before triggering the handle_mm_fault (so
> tsk->mm points to the mm of the task) and adding a MMF_MEMDIE to
> mm->flags checked by page_alloc would work just fine and should solve
> the double task killed... but then I'm unsure.. this is just the first
> idea I had.

Yes, I began to have thoughts along those lines too as I was writing
my reply.  It is a different angle on the problem, I hadn't looked at
it that way before, and it does seem worth pursuing.  MMF_MEMDIE, yes,
that might be useful.  But KSM_RUN_UNMERGE wouldn't be able to use_mm
since it's coming from a normal user process - perhaps it should be a
kill-me-first like swapoff via PF_SWAPOFF.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-26 19:57                     ` Hugh Dickins
@ 2009-08-26 20:28                       ` Andrea Arcangeli
  -1 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-26 20:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Wed, Aug 26, 2009 at 08:57:27PM +0100, Hugh Dickins wrote:
> that might be useful.  But KSM_RUN_UNMERGE wouldn't be able to use_mm
> since it's coming from a normal user process - perhaps it should be a
> kill-me-first like swapoff via PF_SWAPOFF.

That would sound just perfect if only there wasn't also a break_cow in
the kksmd context that will trigger page allocation as it can't
takeover the KSM page like it would normally be guaranteed to do for a
cow on a regular anon page mapped readonly in the pte after read
swapin for example. Still for the echo 2 kill me first definitely
makes sense, so maybe we should differentiate the two cases (kksmd and
sysfs).

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-26 20:28                       ` Andrea Arcangeli
  0 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-26 20:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Wed, Aug 26, 2009 at 08:57:27PM +0100, Hugh Dickins wrote:
> that might be useful.  But KSM_RUN_UNMERGE wouldn't be able to use_mm
> since it's coming from a normal user process - perhaps it should be a
> kill-me-first like swapoff via PF_SWAPOFF.

That would sound just perfect if only there wasn't also a break_cow in
the kksmd context that will trigger page allocation as it can't
takeover the KSM page like it would normally be guaranteed to do for a
cow on a regular anon page mapped readonly in the pte after read
swapin for example. Still for the echo 2 kill me first definitely
makes sense, so maybe we should differentiate the two cases (kksmd and
sysfs).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-26 19:44                   ` Andrea Arcangeli
@ 2009-08-26 20:29                     ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-26 20:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Wed, 26 Aug 2009, Andrea Arcangeli wrote:
> On Wed, Aug 26, 2009 at 08:17:50PM +0100, Hugh Dickins wrote:
> > Has anyone looked at why Rawhide's mlockall is not faulting in the
> > pages, I wonder if there's a separate bug there?)
> 
> I reproduced only with self compiled mmotm kernel with full self
> compiled userland with just a mlockall and exit (not rawhide
> kernel/userland) so there's certainly no bug in rawhide, or at least
> nothing special about it.

I just tried again and got it myself: the faulting page is a PROT_NONE
page of libc, yes, that figures: mlocking would not fault it in, but
munlocking would (in its current implementation) insist on faulting it.

I don't know what my difficulty was yesterday: perhaps that page
isn't always PROT_NONE, perhaps I got confused and was testing the
wrong kernel, one without KSM or one with my follow_page munlock.

Hugh

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-26 20:29                     ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-26 20:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Wed, 26 Aug 2009, Andrea Arcangeli wrote:
> On Wed, Aug 26, 2009 at 08:17:50PM +0100, Hugh Dickins wrote:
> > Has anyone looked at why Rawhide's mlockall is not faulting in the
> > pages, I wonder if there's a separate bug there?)
> 
> I reproduced only with self compiled mmotm kernel with full self
> compiled userland with just a mlockall and exit (not rawhide
> kernel/userland) so there's certainly no bug in rawhide, or at least
> nothing special about it.

I just tried again and got it myself: the faulting page is a PROT_NONE
page of libc, yes, that figures: mlocking would not fault it in, but
munlocking would (in its current implementation) insist on faulting it.

I don't know what my difficulty was yesterday: perhaps that page
isn't always PROT_NONE, perhaps I got confused and was testing the
wrong kernel, one without KSM or one with my follow_page munlock.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-26 19:57                     ` Hugh Dickins
@ 2009-08-26 20:54                       ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-26 20:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

Hugh Dickins wrote:
> On Wed, 26 Aug 2009, Andrea Arcangeli wrote:
>   
>> All is left to address is to teach page_alloc.c that the mm is going
>> away in a second patch. That might also help when it's aio triggering
>> gup page allocations or other kernel threads with use_mm just like ksm
>> and the oom killer selected those "mm" for release.
>>
>> Having ksm using use_mm before triggering the handle_mm_fault (so
>> tsk->mm points to the mm of the task) and adding a MMF_MEMDIE to
>> mm->flags checked by page_alloc would work just fine and should solve
>> the double task killed... but then I'm unsure.. this is just the first
>> idea I had.
>>     
>
> Yes, I began to have thoughts along those lines too as I was writing
> my reply.  It is a different angle on the problem, I hadn't looked at
> it that way before, and it does seem worth pursuing.  MMF_MEMDIE, yes,
> that might be useful.  But KSM_RUN_UNMERGE wouldn't be able to use_mm
> since it's coming from a normal user process - perhaps it should be a
> kill-me-first like swapoff via PF_SWAPOFF.
>
> Hugh
>   
About the KSM case:
The oom should work on problomatic processes, such that allocate big 
amount of memory.
But then as we now plane it to be, what might be a just fine application 
that used ksm and told it to stop merge it pages, might be what 
considered "bad application that need to be killed"

Is this what we really want?

But before getting into this, why is it so important to break the ksm 
pages when madvise(UNMERGEABLE) get called?

When thinking about it, lets say I want to use ksm to scan 2 
applications and merged their STATIC identical data, and then i want to 
stop scanning them after i know ksm merged the pages, as soon as i will 
try to unregister this 2 applications ksm will unmerge the pages, so we 
dont allow such thing for the user (we can tell him ofcurse for such 
case to use normal way of sharing, so this isnt a really strong case for 
this)

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-26 20:54                       ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-26 20:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

Hugh Dickins wrote:
> On Wed, 26 Aug 2009, Andrea Arcangeli wrote:
>   
>> All is left to address is to teach page_alloc.c that the mm is going
>> away in a second patch. That might also help when it's aio triggering
>> gup page allocations or other kernel threads with use_mm just like ksm
>> and the oom killer selected those "mm" for release.
>>
>> Having ksm using use_mm before triggering the handle_mm_fault (so
>> tsk->mm points to the mm of the task) and adding a MMF_MEMDIE to
>> mm->flags checked by page_alloc would work just fine and should solve
>> the double task killed... but then I'm unsure.. this is just the first
>> idea I had.
>>     
>
> Yes, I began to have thoughts along those lines too as I was writing
> my reply.  It is a different angle on the problem, I hadn't looked at
> it that way before, and it does seem worth pursuing.  MMF_MEMDIE, yes,
> that might be useful.  But KSM_RUN_UNMERGE wouldn't be able to use_mm
> since it's coming from a normal user process - perhaps it should be a
> kill-me-first like swapoff via PF_SWAPOFF.
>
> Hugh
>   
About the KSM case:
The oom should work on problomatic processes, such that allocate big 
amount of memory.
But then as we now plane it to be, what might be a just fine application 
that used ksm and told it to stop merge it pages, might be what 
considered "bad application that need to be killed"

Is this what we really want?

But before getting into this, why is it so important to break the ksm 
pages when madvise(UNMERGEABLE) get called?

When thinking about it, lets say I want to use ksm to scan 2 
applications and merged their STATIC identical data, and then i want to 
stop scanning them after i know ksm merged the pages, as soon as i will 
try to unregister this 2 applications ksm will unmerge the pages, so we 
dont allow such thing for the user (we can tell him ofcurse for such 
case to use normal way of sharing, so this isnt a really strong case for 
this)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-26 20:54                       ` Izik Eidus
@ 2009-08-26 21:14                         ` Andrea Arcangeli
  -1 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-26 21:14 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Hugh Dickins, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Wed, Aug 26, 2009 at 11:54:36PM +0300, Izik Eidus wrote:
> But before getting into this, why is it so important to break the ksm 
> pages when madvise(UNMERGEABLE) get called?

The moment ksm pages are swappable, there's no apparent reason why
anybody should ask the kernel to break any ksm page if the application
themselfs aren't writing to them in the first place (triggering
copy-on-write in app context which already handles TIF_MEMDIE just
fine).

In oom deadlock terms madvise(UNMERGEABLE) is the only place that is
100% fine at breaking KSM pages, because it runs with right tsk->mm
and page allocation will notice TIF_MEMDIE set on tsk.

If we remove "echo 2" only remaining "unsafe" spot is the break_cow in
kksmd context when memcmp fails and similar during the scan.

> When thinking about it, lets say I want to use ksm to scan 2 
> applications and merged their STATIC identical data, and then i want to 
> stop scanning them after i know ksm merged the pages, as soon as i will 
> try to unregister this 2 applications ksm will unmerge the pages, so we 
> dont allow such thing for the user (we can tell him ofcurse for such 
> case to use normal way of sharing, so this isnt a really strong case for 
> this)

For the app it will be tricky to know when the pages are merged
though, right now it could only wait a "while"... so I don't really
see madvise(UNMERGEABLE) as useful regardless how we implement
it... but then this goes beyond the scope of this bug because as said
madvise(UNMERGEABLE) is the only place that breaks ksm pages as safe
as regular write fault in oom context because of it running in the
process context (not echo 2 or kksmd context).

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-26 21:14                         ` Andrea Arcangeli
  0 siblings, 0 replies; 106+ messages in thread
From: Andrea Arcangeli @ 2009-08-26 21:14 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Hugh Dickins, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Wed, Aug 26, 2009 at 11:54:36PM +0300, Izik Eidus wrote:
> But before getting into this, why is it so important to break the ksm 
> pages when madvise(UNMERGEABLE) get called?

The moment ksm pages are swappable, there's no apparent reason why
anybody should ask the kernel to break any ksm page if the application
themselfs aren't writing to them in the first place (triggering
copy-on-write in app context which already handles TIF_MEMDIE just
fine).

In oom deadlock terms madvise(UNMERGEABLE) is the only place that is
100% fine at breaking KSM pages, because it runs with right tsk->mm
and page allocation will notice TIF_MEMDIE set on tsk.

If we remove "echo 2" only remaining "unsafe" spot is the break_cow in
kksmd context when memcmp fails and similar during the scan.

> When thinking about it, lets say I want to use ksm to scan 2 
> applications and merged their STATIC identical data, and then i want to 
> stop scanning them after i know ksm merged the pages, as soon as i will 
> try to unregister this 2 applications ksm will unmerge the pages, so we 
> dont allow such thing for the user (we can tell him ofcurse for such 
> case to use normal way of sharing, so this isnt a really strong case for 
> this)

For the app it will be tricky to know when the pages are merged
though, right now it could only wait a "while"... so I don't really
see madvise(UNMERGEABLE) as useful regardless how we implement
it... but then this goes beyond the scope of this bug because as said
madvise(UNMERGEABLE) is the only place that breaks ksm pages as safe
as regular write fault in oom context because of it running in the
process context (not echo 2 or kksmd context).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-26 21:14                         ` Andrea Arcangeli
@ 2009-08-26 21:49                           ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-26 21:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

Andrea Arcangeli wrote:
> On Wed, Aug 26, 2009 at 11:54:36PM +0300, Izik Eidus wrote:
>   
>> But before getting into this, why is it so important to break the ksm 
>> pages when madvise(UNMERGEABLE) get called?
>>     
>
> The moment ksm pages are swappable, there's no apparent reason why
> anybody should ask the kernel to break any ksm page if the application
> themselfs aren't writing to them in the first place (triggering
> copy-on-write in app context which already handles TIF_MEMDIE just
> fine).
>   

I think I am the one that should be blamed for breaking the ksm pages 
when running unmeregable (If I remember right),
but I think Hugh had a good case why we want to keep it... ? (If I 
remember right again...)

> In oom deadlock terms madvise(UNMERGEABLE) is the only place that is
> 100% fine at breaking KSM pages, because it runs with right tsk->mm
> and page allocation will notice TIF_MEMDIE set on tsk.
>
> If we remove "echo 2" only remaining "unsafe" spot is the break_cow in
> kksmd context when memcmp fails and similar during the scan.
>
>   
I didnt talk here about the bug..., I talked about the behavior...
It is the feeling that the oom will kill applications calling into 
UNMERGEABLE, even thought this application shouldn't die, just because 
it had big amount of memory shared and it unmerged it in the wrong time?...

But probably this thoughts have no end, and we are better stick with 
something practical that can work clean and simple...

So what I think is this:
echo 2 is something we want in this version beacuse we dont support 
swapping of the shared pages, so we got to allow some how to break the 
pages...

and echo 2 got to have UNMERGEABLE break the shared pages when its 
madvise get called...

So maybe it is just better to leave it like that?
>> When thinking about it, lets say I want to use ksm to scan 2 
>> applications and merged their STATIC identical data, and then i want to 
>> stop scanning them after i know ksm merged the pages, as soon as i will 
>> try to unregister this 2 applications ksm will unmerge the pages, so we 
>> dont allow such thing for the user (we can tell him ofcurse for such 
>> case to use normal way of sharing, so this isnt a really strong case for 
>> this)
>>     
>
> For the app it will be tricky to know when the pages are merged
> though, right now it could only wait a "while"... so I don't really
> see madvise(UNMERGEABLE) as useful regardless how we implement
> it... but then this goes beyond the scope of this bug because as said
> madvise(UNMERGEABLE) is the only place that breaks ksm pages as safe
> as regular write fault in oom context because of it running in the
> process context (not echo 2 or kksmd context).
>   
Yea, I agree about that this case was idiotic :), Actually I thought 
about case where application get little bit more info, but leave it, it 
is not worth it, traditional sharing is much better for such cases.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-26 21:49                           ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-26 21:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

Andrea Arcangeli wrote:
> On Wed, Aug 26, 2009 at 11:54:36PM +0300, Izik Eidus wrote:
>   
>> But before getting into this, why is it so important to break the ksm 
>> pages when madvise(UNMERGEABLE) get called?
>>     
>
> The moment ksm pages are swappable, there's no apparent reason why
> anybody should ask the kernel to break any ksm page if the application
> themselfs aren't writing to them in the first place (triggering
> copy-on-write in app context which already handles TIF_MEMDIE just
> fine).
>   

I think I am the one that should be blamed for breaking the ksm pages 
when running unmeregable (If I remember right),
but I think Hugh had a good case why we want to keep it... ? (If I 
remember right again...)

> In oom deadlock terms madvise(UNMERGEABLE) is the only place that is
> 100% fine at breaking KSM pages, because it runs with right tsk->mm
> and page allocation will notice TIF_MEMDIE set on tsk.
>
> If we remove "echo 2" only remaining "unsafe" spot is the break_cow in
> kksmd context when memcmp fails and similar during the scan.
>
>   
I didnt talk here about the bug..., I talked about the behavior...
It is the feeling that the oom will kill applications calling into 
UNMERGEABLE, even thought this application shouldn't die, just because 
it had big amount of memory shared and it unmerged it in the wrong time?...

But probably this thoughts have no end, and we are better stick with 
something practical that can work clean and simple...

So what I think is this:
echo 2 is something we want in this version beacuse we dont support 
swapping of the shared pages, so we got to allow some how to break the 
pages...

and echo 2 got to have UNMERGEABLE break the shared pages when its 
madvise get called...

So maybe it is just better to leave it like that?
>> When thinking about it, lets say I want to use ksm to scan 2 
>> applications and merged their STATIC identical data, and then i want to 
>> stop scanning them after i know ksm merged the pages, as soon as i will 
>> try to unregister this 2 applications ksm will unmerge the pages, so we 
>> dont allow such thing for the user (we can tell him ofcurse for such 
>> case to use normal way of sharing, so this isnt a really strong case for 
>> this)
>>     
>
> For the app it will be tricky to know when the pages are merged
> though, right now it could only wait a "while"... so I don't really
> see madvise(UNMERGEABLE) as useful regardless how we implement
> it... but then this goes beyond the scope of this bug because as said
> madvise(UNMERGEABLE) is the only place that breaks ksm pages as safe
> as regular write fault in oom context because of it running in the
> process context (not echo 2 or kksmd context).
>   
Yea, I agree about that this case was idiotic :), Actually I thought 
about case where application get little bit more info, but leave it, it 
is not worth it, traditional sharing is much better for such cases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-26 21:14                         ` Andrea Arcangeli
@ 2009-08-26 22:00                           ` David Rientjes
  -1 siblings, 0 replies; 106+ messages in thread
From: David Rientjes @ 2009-08-26 22:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Hugh Dickins, Rik van Riel, Chris Wright,
	Nick Piggin, Andrew Morton, Justin M. Forbes, linux-kernel,
	linux-mm

On Wed, 26 Aug 2009, Andrea Arcangeli wrote:

> In oom deadlock terms madvise(UNMERGEABLE) is the only place that is
> 100% fine at breaking KSM pages, because it runs with right tsk->mm
> and page allocation will notice TIF_MEMDIE set on tsk.
> 

Be aware that the page allocator in Linus' git will not notice TIF_MEMDIE 
for current if that task is chosen for oom kill since alloc_flags are not 
updated for that particular allocation.  My patch in -mm,
mm-update-alloc_flags-after-oom-killer-has-been-called.patch, fixes that 
but is not yet merged (I assume it's on hold for 2.6.32?).

I'd hate for you to run into this in testing and spend time debugging it 
when the problem already has a fix.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-26 22:00                           ` David Rientjes
  0 siblings, 0 replies; 106+ messages in thread
From: David Rientjes @ 2009-08-26 22:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Hugh Dickins, Rik van Riel, Chris Wright,
	Nick Piggin, Andrew Morton, Justin M. Forbes, linux-kernel,
	linux-mm

On Wed, 26 Aug 2009, Andrea Arcangeli wrote:

> In oom deadlock terms madvise(UNMERGEABLE) is the only place that is
> 100% fine at breaking KSM pages, because it runs with right tsk->mm
> and page allocation will notice TIF_MEMDIE set on tsk.
> 

Be aware that the page allocator in Linus' git will not notice TIF_MEMDIE 
for current if that task is chosen for oom kill since alloc_flags are not 
updated for that particular allocation.  My patch in -mm,
mm-update-alloc_flags-after-oom-killer-has-been-called.patch, fixes that 
but is not yet merged (I assume it's on hold for 2.6.32?).

I'd hate for you to run into this in testing and spend time debugging it 
when the problem already has a fix.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-26 21:49                           ` Izik Eidus
@ 2009-08-27 19:11                             ` Hugh Dickins
  -1 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-27 19:11 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Thu, 27 Aug 2009, Izik Eidus wrote:
> Andrea Arcangeli wrote:
> > On Wed, Aug 26, 2009 at 11:54:36PM +0300, Izik Eidus wrote:
> >   
> > > But before getting into this, why is it so important to break the ksm
> > > pages when madvise(UNMERGEABLE) get called?
> > >     

Good question.

> >
> > The moment ksm pages are swappable, there's no apparent reason why
> > anybody should ask the kernel to break any ksm page if the application
> > themselfs aren't writing to them in the first place (triggering
> > copy-on-write in app context which already handles TIF_MEMDIE just
> > fine).
> >   
> 
> I think I am the one that should be blamed for breaking the ksm pages when
> running unmeregable (If I remember right),

No, I think it was me to blame for that: looking back at the /dev/ksm KSM
and your draft of madvise KSM, I don't see breaking on unmerge in either.

> but I think Hugh had a good case why we want to keep it... ? (If I remember
> right again...)

There were several reasons for adding it, but all rather weak.

The "good case" you're half-remembering is probably that if we didn't
break ksm when doing madvise MADV_UNMERGEABLE, we'd lose track of any
KSM pages in the vma we're removing VM_MERGEABLE from (since we only
ever scan VM_MERGEABLEs), and so would be liable to build up more and
more unswappable KSM pages, well beyond the limit which can be imposed
by max_kernel_pages.

It is important to keep that accounting right, at least for so long as
they're unswappable; but it amounts to only a weak case, because we
could perfectly well have two vm_flags, one to say actively try to
merge here, and another to say there might still be KSM pages here.
(Perhaps with some restructuring that could instead be driven from
the stable tree end, rather than through the mm_slots and vmas.)

Reasons for avoiding two vm_flags: simplicity; minimizing KSM
footprint outside of ksm.c; and... VM_MERGEABLE has selfishly taken
bit 31 of vm_flags, the next vm_flag is going to involve some thought
on the best way to expand it (on 32-bit arches).  It's not an atomic
field so it shouldn't be hard; we used to minimize the use of 64-bit
on 32-bit but maybe gcc's unsigned long long handling is pretty good
nowadays and it's no issue?  Or maybe the issue was always avoiding
64-bit arithmetic, no issue with bitflags?  I don't know myself.

Other reasons for breaking ksm when unmerging: my sense that we ought
to provide a way to undo whatever is done (useful when testing, but
does that make it worth the effort? particularly if much of the testing
goes into testing precisely that feature!); and a notion that people
worried about covert channels would want to be able to undo merging
absolutely (but would they have been using MADV_MERGEABLE in the
first place, even have KSM configured in and running? seems unlikely).

It may be that MADV_UNMERGEABLE isn't really needed (I think I even
admitted once that probably nobody would use it other than we testing
it).  Yet I hesitate to rip it out: somehow it still seems right to
have it in there.  Why did you have unregistering in the /dev/ksm KSM?

> 
> > In oom deadlock terms madvise(UNMERGEABLE) is the only place that is
> > 100% fine at breaking KSM pages, because it runs with right tsk->mm
> > and page allocation will notice TIF_MEMDIE set on tsk.
> >
> > If we remove "echo 2" only remaining "unsafe" spot is the break_cow in
> > kksmd context when memcmp fails and similar during the scan.
> >
> >   
> I didnt talk here about the bug..., I talked about the behavior...
> It is the feeling that the oom will kill applications calling into
> UNMERGEABLE, even thought this application shouldn't die, just because it had
> big amount of memory shared and it unmerged it in the wrong time?...

The OOM killer makes its choices based upon total_vm (and some other
things): doesn't even consider rss, and certainly not KSM sharing.
So whenever out-of-memory, the OOM killer might choose to kill a very
highly KSM-merged process, freeing very little memory, just because
it _appears_ big from the outside.  There's nothing special to the
unmerging case, other than that being a good way to require lots of
page allocations in a single system call.  I don't see anything
unfair about it being killed at the point of unmerging: much more
unfair that it be killed when highly merged.

> 
> But probably this thoughts have no end, and we are better stick with something
> practical that can work clean and simple...
> 
> So what I think is this:
> echo 2 is something we want in this version beacuse we dont support swapping
> of the shared pages, so we got to allow some how to break the pages...
> 
> and echo 2 got to have UNMERGEABLE break the shared pages when its madvise get
> called...
> 
> So maybe it is just better to leave it like that?
> > > When thinking about it, lets say I want to use ksm to scan 2 applications
> > > and merged their STATIC identical data, and then i want to stop scanning
> > > them after i know ksm merged the pages, as soon as i will try to
> > > unregister this 2 applications ksm will unmerge the pages, so we dont
> > > allow such thing for the user (we can tell him ofcurse for such case to
> > > use normal way of sharing, so this isnt a really strong case for this)
> > >     
> >
> > For the app it will be tricky to know when the pages are merged
> > though, right now it could only wait a "while"... so I don't really
> > see madvise(UNMERGEABLE) as useful regardless how we implement
> > it... but then this goes beyond the scope of this bug because as said
> > madvise(UNMERGEABLE) is the only place that breaks ksm pages as safe
> > as regular write fault in oom context because of it running in the
> > process context (not echo 2 or kksmd context).
> >   
> Yea, I agree about that this case was idiotic :), Actually I thought about
> case where application get little bit more info, but leave it, it is not worth
> it, traditional sharing is much better for such cases.

I didn't seem idiotic to me, but I hadn't realized the ksmd timelapse
uncertainty Andrea points out.  Well, I'm not keen to change the way
it's working at present, but I do think you're right to question all
these aspects of unmerging.

Hugh

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-27 19:11                             ` Hugh Dickins
  0 siblings, 0 replies; 106+ messages in thread
From: Hugh Dickins @ 2009-08-27 19:11 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

On Thu, 27 Aug 2009, Izik Eidus wrote:
> Andrea Arcangeli wrote:
> > On Wed, Aug 26, 2009 at 11:54:36PM +0300, Izik Eidus wrote:
> >   
> > > But before getting into this, why is it so important to break the ksm
> > > pages when madvise(UNMERGEABLE) get called?
> > >     

Good question.

> >
> > The moment ksm pages are swappable, there's no apparent reason why
> > anybody should ask the kernel to break any ksm page if the application
> > themselfs aren't writing to them in the first place (triggering
> > copy-on-write in app context which already handles TIF_MEMDIE just
> > fine).
> >   
> 
> I think I am the one that should be blamed for breaking the ksm pages when
> running unmeregable (If I remember right),

No, I think it was me to blame for that: looking back at the /dev/ksm KSM
and your draft of madvise KSM, I don't see breaking on unmerge in either.

> but I think Hugh had a good case why we want to keep it... ? (If I remember
> right again...)

There were several reasons for adding it, but all rather weak.

The "good case" you're half-remembering is probably that if we didn't
break ksm when doing madvise MADV_UNMERGEABLE, we'd lose track of any
KSM pages in the vma we're removing VM_MERGEABLE from (since we only
ever scan VM_MERGEABLEs), and so would be liable to build up more and
more unswappable KSM pages, well beyond the limit which can be imposed
by max_kernel_pages.

It is important to keep that accounting right, at least for so long as
they're unswappable; but it amounts to only a weak case, because we
could perfectly well have two vm_flags, one to say actively try to
merge here, and another to say there might still be KSM pages here.
(Perhaps with some restructuring that could instead be driven from
the stable tree end, rather than through the mm_slots and vmas.)

Reasons for avoiding two vm_flags: simplicity; minimizing KSM
footprint outside of ksm.c; and... VM_MERGEABLE has selfishly taken
bit 31 of vm_flags, the next vm_flag is going to involve some thought
on the best way to expand it (on 32-bit arches).  It's not an atomic
field so it shouldn't be hard; we used to minimize the use of 64-bit
on 32-bit but maybe gcc's unsigned long long handling is pretty good
nowadays and it's no issue?  Or maybe the issue was always avoiding
64-bit arithmetic, no issue with bitflags?  I don't know myself.

Other reasons for breaking ksm when unmerging: my sense that we ought
to provide a way to undo whatever is done (useful when testing, but
does that make it worth the effort? particularly if much of the testing
goes into testing precisely that feature!); and a notion that people
worried about covert channels would want to be able to undo merging
absolutely (but would they have been using MADV_MERGEABLE in the
first place, even have KSM configured in and running? seems unlikely).

It may be that MADV_UNMERGEABLE isn't really needed (I think I even
admitted once that probably nobody would use it other than we testing
it).  Yet I hesitate to rip it out: somehow it still seems right to
have it in there.  Why did you have unregistering in the /dev/ksm KSM?

> 
> > In oom deadlock terms madvise(UNMERGEABLE) is the only place that is
> > 100% fine at breaking KSM pages, because it runs with right tsk->mm
> > and page allocation will notice TIF_MEMDIE set on tsk.
> >
> > If we remove "echo 2" only remaining "unsafe" spot is the break_cow in
> > kksmd context when memcmp fails and similar during the scan.
> >
> >   
> I didnt talk here about the bug..., I talked about the behavior...
> It is the feeling that the oom will kill applications calling into
> UNMERGEABLE, even thought this application shouldn't die, just because it had
> big amount of memory shared and it unmerged it in the wrong time?...

The OOM killer makes its choices based upon total_vm (and some other
things): doesn't even consider rss, and certainly not KSM sharing.
So whenever out-of-memory, the OOM killer might choose to kill a very
highly KSM-merged process, freeing very little memory, just because
it _appears_ big from the outside.  There's nothing special to the
unmerging case, other than that being a good way to require lots of
page allocations in a single system call.  I don't see anything
unfair about it being killed at the point of unmerging: much more
unfair that it be killed when highly merged.

> 
> But probably this thoughts have no end, and we are better stick with something
> practical that can work clean and simple...
> 
> So what I think is this:
> echo 2 is something we want in this version beacuse we dont support swapping
> of the shared pages, so we got to allow some how to break the pages...
> 
> and echo 2 got to have UNMERGEABLE break the shared pages when its madvise get
> called...
> 
> So maybe it is just better to leave it like that?
> > > When thinking about it, lets say I want to use ksm to scan 2 applications
> > > and merged their STATIC identical data, and then i want to stop scanning
> > > them after i know ksm merged the pages, as soon as i will try to
> > > unregister this 2 applications ksm will unmerge the pages, so we dont
> > > allow such thing for the user (we can tell him ofcurse for such case to
> > > use normal way of sharing, so this isnt a really strong case for this)
> > >     
> >
> > For the app it will be tricky to know when the pages are merged
> > though, right now it could only wait a "while"... so I don't really
> > see madvise(UNMERGEABLE) as useful regardless how we implement
> > it... but then this goes beyond the scope of this bug because as said
> > madvise(UNMERGEABLE) is the only place that breaks ksm pages as safe
> > as regular write fault in oom context because of it running in the
> > process context (not echo 2 or kksmd context).
> >   
> Yea, I agree about that this case was idiotic :), Actually I thought about
> case where application get little bit more info, but leave it, it is not worth
> it, traditional sharing is much better for such cases.

I didn't seem idiotic to me, but I hadn't realized the ksmd timelapse
uncertainty Andrea points out.  Well, I'm not keen to change the way
it's working at present, but I do think you're right to question all
these aspects of unmerging.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
  2009-08-27 19:11                             ` Hugh Dickins
@ 2009-08-27 19:35                               ` Izik Eidus
  -1 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-27 19:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

Hugh Dickins wrote:
>
> It may be that MADV_UNMERGEABLE isn't really needed (I think I even
> admitted once that probably nobody would use it other than we testing
> it).  Yet I hesitate to rip it out: somehow it still seems right to
> have it in there.  Why did you have unregistering in the /dev/ksm KSM?
>   

It was more to give possiblaty to applications save cpu cycles of ksmd 
so it wont continue to scan memory regions that don`t need ksm anymore,
As you said if someone will ever use it?, have no idea...

>   
> I didn't seem idiotic to me, but I hadn't realized the ksmd timelapse
> uncertainty Andrea points out.  Well, I'm not keen to change the way
> it's working at present, but I do think you're right to question all
> these aspects of unmerging.
>   

Yes lets keep it like that, UNMERGEABLE sound anyway like something that 
going to break the pages..., we can always add later STOPMERGE as a call 
that tell ksm to stop merge the pages but not break the shared pages...

> Hugh
>   


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock
@ 2009-08-27 19:35                               ` Izik Eidus
  0 siblings, 0 replies; 106+ messages in thread
From: Izik Eidus @ 2009-08-27 19:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Chris Wright, Nick Piggin,
	Andrew Morton, Justin M. Forbes, linux-kernel, linux-mm

Hugh Dickins wrote:
>
> It may be that MADV_UNMERGEABLE isn't really needed (I think I even
> admitted once that probably nobody would use it other than we testing
> it).  Yet I hesitate to rip it out: somehow it still seems right to
> have it in there.  Why did you have unregistering in the /dev/ksm KSM?
>   

It was more to give possiblaty to applications save cpu cycles of ksmd 
so it wont continue to scan memory regions that don`t need ksm anymore,
As you said if someone will ever use it?, have no idea...

>   
> I didn't seem idiotic to me, but I hadn't realized the ksmd timelapse
> uncertainty Andrea points out.  Well, I'm not keen to change the way
> it's working at present, but I do think you're right to question all
> these aspects of unmerging.
>   

Yes lets keep it like that, UNMERGEABLE sound anyway like something that 
going to break the pages..., we can always add later STOPMERGE as a call 
that tell ksm to stop merge the pages but not break the shared pages...

> Hugh
>   

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 106+ messages in thread

end of thread, other threads:[~2009-08-27 19:28 UTC | newest]

Thread overview: 106+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-03 12:08 [PATCH 0/12] ksm: stats, oom, doc, misc Hugh Dickins
2009-08-03 12:08 ` Hugh Dickins
2009-08-03 12:10 ` [PATCH 1/12] ksm: rename kernel_pages_allocated Hugh Dickins
2009-08-03 12:10   ` Hugh Dickins
2009-08-03 14:21   ` Izik Eidus
2009-08-03 14:21     ` Izik Eidus
2009-08-03 16:48     ` Andrea Arcangeli
2009-08-03 16:48       ` Andrea Arcangeli
2009-08-03 12:11 ` [PATCH 2/12] ksm: move pages_sharing updates Hugh Dickins
2009-08-03 12:11   ` Hugh Dickins
2009-08-03 14:34   ` Izik Eidus
2009-08-03 14:34     ` Izik Eidus
2009-08-03 16:53   ` Andrea Arcangeli
2009-08-03 16:53     ` Andrea Arcangeli
2009-08-03 17:34     ` Hugh Dickins
2009-08-03 17:34       ` Hugh Dickins
2009-08-03 12:11 ` [PATCH 3/12] ksm: pages_unshared and pages_volatile Hugh Dickins
2009-08-03 12:11   ` Hugh Dickins
2009-08-03 14:54   ` Izik Eidus
2009-08-03 14:54     ` Izik Eidus
2009-08-04 21:49   ` Andrew Morton
2009-08-04 21:49     ` Andrew Morton
2009-08-05 11:39     ` Hugh Dickins
2009-08-05 11:39       ` Hugh Dickins
2009-08-05 15:11       ` Andrea Arcangeli
2009-08-05 15:11         ` Andrea Arcangeli
2009-08-03 12:12 ` [PATCH 4/12] ksm: break cow once unshared Hugh Dickins
2009-08-03 12:12   ` Hugh Dickins
2009-08-03 16:00   ` Izik Eidus
2009-08-03 16:00     ` Izik Eidus
2009-08-03 12:14 ` [PATCH 5/12] ksm: keep quiet while list empty Hugh Dickins
2009-08-03 12:14   ` Hugh Dickins
2009-08-03 16:55   ` Izik Eidus
2009-08-03 16:55     ` Izik Eidus
2009-08-04 21:59   ` Andrew Morton
2009-08-04 21:59     ` Andrew Morton
2009-08-05 11:54     ` Hugh Dickins
2009-08-05 11:54       ` Hugh Dickins
2009-08-03 12:15 ` [PATCH 6/12] ksm: five little cleanups Hugh Dickins
2009-08-03 12:15   ` Hugh Dickins
2009-08-04 12:41   ` Izik Eidus
2009-08-04 12:41     ` Izik Eidus
2009-08-03 12:16 ` [PATCH 7/12] ksm: fix endless loop on oom Hugh Dickins
2009-08-03 12:16   ` Hugh Dickins
2009-08-04 12:55   ` Izik Eidus
2009-08-04 12:55     ` Izik Eidus
2009-08-03 12:17 ` [PATCH 8/12] ksm: distribute remove_mm_from_lists Hugh Dickins
2009-08-03 12:17   ` Hugh Dickins
2009-08-04 13:03   ` Izik Eidus
2009-08-04 13:03     ` Izik Eidus
2009-08-03 12:18 ` [PATCH 9/12] ksm: fix oom deadlock Hugh Dickins
2009-08-03 12:18   ` Hugh Dickins
2009-08-04 19:32   ` Izik Eidus
2009-08-04 19:32     ` Izik Eidus
2009-08-25 14:58   ` Andrea Arcangeli
2009-08-25 14:58     ` Andrea Arcangeli
2009-08-25 15:22     ` [PATCH 13/12] ksm: fix munlock during exit_mmap deadlock Andrea Arcangeli
2009-08-25 15:22       ` Andrea Arcangeli
2009-08-25 17:49       ` Hugh Dickins
2009-08-25 17:49         ` Hugh Dickins
2009-08-25 18:10         ` Andrea Arcangeli
2009-08-25 18:10           ` Andrea Arcangeli
2009-08-25 18:58           ` Hugh Dickins
2009-08-25 18:58             ` Hugh Dickins
2009-08-25 19:45             ` Andrea Arcangeli
2009-08-25 19:45               ` Andrea Arcangeli
2009-08-26 16:18               ` Justin M. Forbes
2009-08-26 16:18                 ` Justin M. Forbes
2009-08-26 19:17               ` Hugh Dickins
2009-08-26 19:17                 ` Hugh Dickins
2009-08-26 19:44                 ` Andrea Arcangeli
2009-08-26 19:44                   ` Andrea Arcangeli
2009-08-26 19:57                   ` Hugh Dickins
2009-08-26 19:57                     ` Hugh Dickins
2009-08-26 20:28                     ` Andrea Arcangeli
2009-08-26 20:28                       ` Andrea Arcangeli
2009-08-26 20:54                     ` Izik Eidus
2009-08-26 20:54                       ` Izik Eidus
2009-08-26 21:14                       ` Andrea Arcangeli
2009-08-26 21:14                         ` Andrea Arcangeli
2009-08-26 21:49                         ` Izik Eidus
2009-08-26 21:49                           ` Izik Eidus
2009-08-27 19:11                           ` Hugh Dickins
2009-08-27 19:11                             ` Hugh Dickins
2009-08-27 19:35                             ` Izik Eidus
2009-08-27 19:35                               ` Izik Eidus
2009-08-26 22:00                         ` David Rientjes
2009-08-26 22:00                           ` David Rientjes
2009-08-26 20:29                   ` Hugh Dickins
2009-08-26 20:29                     ` Hugh Dickins
2009-08-25 17:35     ` [PATCH 9/12] ksm: fix oom deadlock Hugh Dickins
2009-08-25 17:35       ` Hugh Dickins
2009-08-25 17:47       ` Andrea Arcangeli
2009-08-25 17:47         ` Andrea Arcangeli
2009-08-03 12:19 ` [PATCH 10/12] ksm: sysfs and defaults Hugh Dickins
2009-08-03 12:19   ` Hugh Dickins
2009-08-04 19:34   ` Izik Eidus
2009-08-04 19:34     ` Izik Eidus
2009-08-03 12:21 ` [PATCH 11/12] ksm: add some documentation Hugh Dickins
2009-08-03 12:21   ` Hugh Dickins
2009-08-04 19:35   ` Izik Eidus
2009-08-04 19:35     ` Izik Eidus
2009-08-03 12:22 ` [PATCH 12/12] ksm: remove VM_MERGEABLE_FLAGS Hugh Dickins
2009-08-03 12:22   ` Hugh Dickins
2009-08-04 19:35   ` Izik Eidus
2009-08-04 19:35     ` Izik Eidus

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.