linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] RFC - ksm api change into madvise
@ 2009-05-14  0:30 Izik Eidus
  2009-05-14  0:30 ` [PATCH 1/4] madvice: add MADV_SHAREABLE and MADV_UNSHAREABLE calls Izik Eidus
  2009-06-08 16:18 ` [PATCH 0/4] RFC - ksm api change into madvise Hugh Dickins
  0 siblings, 2 replies; 22+ messages in thread
From: Izik Eidus @ 2009-05-14  0:30 UTC (permalink / raw)
  To: hugh
  Cc: linux-kernel, aarcange, akpm, nickpiggin, chrisw, linux-mm, riel,
	Izik Eidus

This is comment request for ksm api changes.
The following patchs move the api to use madvise instead of ioctls.

Before i will describe the patchs, i want to note that i rewrote this
patch seires alot of times, all the other methods that i have tried had some
fandumatel issues with them.
The current implemantion does have some issues with it, but i belive they are
all solveable and better than the other ways to do it.
If you feel you have better way how to do it, please tell me :).

Ok when we changed ksm to use madvise instead of ioctls we wanted to keep
the following rules:

Not to increase the host memory usage if ksm is not being used (even when it
is compiled), this mean not to add fields into mm_struct / vm_area_struct...

Not to effect the system performence with notifiers that will have to block
while ksm code is running under some lock - ksm is helper, it should do it
work quitely, - this why i dropped patch that i did that add mmu notifiers
support inside ksm.c and recived notifications from the MM (for example
when vma is destroyed (invalidate_range...)

Not to change the MM logic.

Trying to touch as less code as we can outisde ksm.c


Taking into account all this rules, the end result that we have came with is:
mmlist is now not used only by swapoff, but by ksm as well, this mean that
each time you call to madvise for to set vma as MERGEABLE, madvise will check
if the mm_struct is inside the mmlist and will insert it in case it isnt.
It is belived that it is better to hurt little bit the performence of swapoff
than adding another list into the mm_struct.

One issue that should be note is: after mm_struct is going into the mmlist, it
wont be kicked from it until the procsses is die (even if there are no more
VM_MERGEABLE vmas), this doesnt mean memory is wasted, but it does mean ksm
will spend little more time in doing cur = cur->next if(...).

Another issue is: when procsess is die, ksm will have to find (when scanning)
that its mm_users == 1 and then do mmput(), this mean that there might be dealy
from the time that someone do kill until the mm is really free -
i am open for suggestions on how to improve this...

(when someone do echo 0 > /sys/kernel/mm/ksm/run ksm will throw away all the
memory, so condtion when the memory wont ever be free wont happen)


Another important thing is: this is request for comment, i still not sure few
things that we have made here are totaly safe:
(the mmlist sync with drain_mmlist, and the handle_vmas() function in madvise,
the logic inside ksm for searching the next virtual address on the vmas,
and so on...)
The main purpuse of this is to ask if the new interface is what you guys
want..., and if you like the impelmantion desgin.

(I have added option to scan closed support applications as well)


Thanks.

Izik Eidus (4):
  madvice: add MADV_SHAREABLE and MADV_UNSHAREABLE calls.
  mmlist: share mmlist with ksm.
  ksm: change ksm api to use madvise instead of ioctls.
  ksm: add support for scanning procsses that were not modifided to use
    ksm

 include/asm-generic/mman.h |    2 +
 include/linux/ksm.h        |   40 --
 include/linux/mm.h         |    2 +
 include/linux/sched.h      |    3 +
 include/linux/swap.h       |    4 +
 mm/Kconfig                 |    2 +-
 mm/ksm.c                   | 1102 ++++++++++++++++++++++----------------------
 mm/madvise.c               |  124 ++++--
 mm/rmap.c                  |    8 +
 mm/swapfile.c              |    9 +-
 10 files changed, 686 insertions(+), 610 deletions(-)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] madvice: add MADV_SHAREABLE and MADV_UNSHAREABLE calls.
  2009-05-14  0:30 [PATCH 0/4] RFC - ksm api change into madvise Izik Eidus
@ 2009-05-14  0:30 ` Izik Eidus
  2009-05-14  0:30   ` [PATCH 2/4] mmlist: share mmlist with ksm Izik Eidus
  2009-06-08 16:18 ` [PATCH 0/4] RFC - ksm api change into madvise Hugh Dickins
  1 sibling, 1 reply; 22+ messages in thread
From: Izik Eidus @ 2009-05-14  0:30 UTC (permalink / raw)
  To: hugh
  Cc: linux-kernel, aarcange, akpm, nickpiggin, chrisw, linux-mm, riel,
	Izik Eidus

This patch add MADV_SHAREABLE and MADV_UNSHAREABLE madvise calls,
this calls used to mark vm memory areas with the VM_MERGEABLE flag,
that specific if the memory inside the vma is allowed to be dinamicly shared
with other memorys.

(this is needed for ksm vma scanning support)

Signed-off-by: Izik Eidus <ieidus@redhat.com>
---
 include/asm-generic/mman.h |    2 +
 include/linux/mm.h         |    2 +
 include/linux/sched.h      |    2 +
 mm/madvise.c               |  116 +++++++++++++++++++++++++++++++++----------
 4 files changed, 95 insertions(+), 27 deletions(-)

diff --git a/include/asm-generic/mman.h b/include/asm-generic/mman.h
index 5e3dde2..830295d 100644
--- a/include/asm-generic/mman.h
+++ b/include/asm-generic/mman.h
@@ -34,6 +34,8 @@
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
+#define MADV_SHAREABLE	12		/* can share identical pages */
+#define MADV_UNSHAREABLE 13		/* can not share identical pages */
 
 /* compatibility flags */
 #define MAP_FILE	0
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a0ddfb5..61328a4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -106,6 +106,8 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 
+#define VM_MERGEABLE    0x80000000	/* Memory may be merged */
+
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..7dc786a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -439,6 +439,8 @@ extern int get_dumpable(struct mm_struct *mm);
 # define MMF_DUMP_MASK_DEFAULT_ELF	0
 #endif
 
+#define MMF_VM_MERGEABLE	9
+
 struct sighand_struct {
 	atomic_t		count;
 	struct k_sigaction	action[_NSIG];
diff --git a/mm/madvise.c b/mm/madvise.c
index b9ce574..bd215ce 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -30,36 +30,12 @@ static int madvise_need_mmap_write(int behavior)
 	}
 }
 
-/*
- * We can potentially split a vm area into separate
- * areas, each area with its own behavior.
- */
-static long madvise_behavior(struct vm_area_struct * vma,
-		     struct vm_area_struct **prev,
-		     unsigned long start, unsigned long end, int behavior)
+static int handle_vmas(struct vm_area_struct *vma, struct vm_area_struct **prev,
+		       unsigned long start, unsigned long end, int new_flags)
 {
 	struct mm_struct * mm = vma->vm_mm;
-	int error = 0;
 	pgoff_t pgoff;
-	int new_flags = vma->vm_flags;
-
-	switch (behavior) {
-	case MADV_NORMAL:
-		new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
-		break;
-	case MADV_SEQUENTIAL:
-		new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
-		break;
-	case MADV_RANDOM:
-		new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
-		break;
-	case MADV_DONTFORK:
-		new_flags |= VM_DONTCOPY;
-		break;
-	case MADV_DOFORK:
-		new_flags &= ~VM_DONTCOPY;
-		break;
-	}
+	int error = 0;
 
 	if (new_flags == vma->vm_flags) {
 		*prev = vma;
@@ -101,6 +77,37 @@ out:
 }
 
 /*
+ * We can potentially split a vm area into separate
+ * areas, each area with its own behavior.
+ */
+static long madvise_behavior(struct vm_area_struct * vma,
+		     struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end, int behavior)
+{
+	int new_flags = vma->vm_flags;
+
+	switch (behavior) {
+	case MADV_NORMAL:
+		new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
+		break;
+	case MADV_SEQUENTIAL:
+		new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
+		break;
+	case MADV_RANDOM:
+		new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
+		break;
+	case MADV_DONTFORK:
+		new_flags |= VM_DONTCOPY;
+		break;
+	case MADV_DOFORK:
+		new_flags &= ~VM_DONTCOPY;
+		break;
+	}
+
+	return handle_vmas(vma, prev, start, end, new_flags);
+}
+
+/*
  * Schedule all required I/O operations.  Do not wait for completion.
  */
 static long madvise_willneed(struct vm_area_struct * vma,
@@ -208,6 +215,54 @@ static long madvise_remove(struct vm_area_struct *vma,
 	return error;
 }
 
+/*
+ * Application allows pages to be shared with other pages of identical
+ * content.
+ *
+ */
+static long madvise_shareable(struct vm_area_struct *vma,
+				struct vm_area_struct **prev,
+				unsigned long start, unsigned long end,
+				int behavior)
+{
+	int ret;
+	struct mm_struct *mm;
+
+	switch (behavior) {
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+	case MADV_SHAREABLE:
+		ret = handle_vmas(vma, prev, start, end,
+				  vma->vm_flags | VM_MERGEABLE);
+
+		if (!ret) {
+			mm = vma->vm_mm;
+			set_bit(MMF_VM_MERGEABLE, &mm->flags);
+		}
+
+		return ret;
+	case MADV_UNSHAREABLE:
+		ret = handle_vmas(vma, prev, start, end,
+				  vma->vm_flags & ~VM_MERGEABLE);
+
+		if (!ret) {
+			mm = vma->vm_mm;
+			vma = mm->mmap;
+			while (vma) {
+				if (vma->vm_flags & VM_MERGEABLE)
+					break;
+				vma = vma->vm_next;
+			}
+			if (!vma)
+				clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		}
+
+		return ret;
+#endif
+	default:
+		return -EINVAL;
+	}
+}
+
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
@@ -238,6 +293,11 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		error = madvise_dontneed(vma, prev, start, end);
 		break;
 
+	case MADV_SHAREABLE:
+	case MADV_UNSHAREABLE:
+		error = madvise_shareable(vma, prev, start, end, behavior);
+		break;
+
 	default:
 		error = -EINVAL;
 		break;
@@ -269,6 +329,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
  *		so the kernel can free resources associated with it.
  *  MADV_REMOVE - the application wants to free up the given range of
  *		pages and associated backing store.
+ *  MADV_SHAREABLE - the application agrees that pages in the given
+ *		range can be shared w/ other pages of identical content.
  *
  * return values:
  *  zero    - success
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] mmlist: share mmlist with ksm.
  2009-05-14  0:30 ` [PATCH 1/4] madvice: add MADV_SHAREABLE and MADV_UNSHAREABLE calls Izik Eidus
@ 2009-05-14  0:30   ` Izik Eidus
  2009-05-14  0:30     ` [PATCH 3/4] ksm: change ksm api to use madvise instead of ioctls Izik Eidus
  0 siblings, 1 reply; 22+ messages in thread
From: Izik Eidus @ 2009-05-14  0:30 UTC (permalink / raw)
  To: hugh
  Cc: linux-kernel, aarcange, akpm, nickpiggin, chrisw, linux-mm, riel,
	Izik Eidus

This change the logic of drain_mmlist() so the mmlist will be able to be
shared with ksm.

Right now mmlist is used to track the mm structs that their pages had swapped
and it is used by swapoff, this patch change it so that in addition to holding
the mm structs that their pages had been swapped, it will hold the mm structs
that have vm areas that are VM_MERGEABLE.

The tradeoff is little bit more work when swapoff is running, but probably
better than adding another pointer into mm_struct and increase its size.

This patch add mmlist_mask that have 2 bits that are able to be set:
MMLIST_SWAP and MMLIST_KSM, this mmlist_mask control the beahivor of the
drain_mmlist() so it drain the mmlist when ksm use it.

Implemantion note: if program called madvise for MADV_SHAREABLE, and then
                   this vma will go away the mm_struct will be still kept
                   inside the mmlist untill the procsess exit.

Another intersting point is the code inside rmap.c:
	if (list_empty(&mm->mmlist)) {
		spin_lock(&mmlist_lock);
		if (list_empty(&mm->mmlist))
			list_add(&mm->mmlist, &init_mm.mmlist);
		spin_unlock(&mmlist_lock);
	}

I and Andrea have disscussed about this function and we are not sure if it cant
race with drain_mmlist when swapoff is run while swapping happen.

Is it safe?, if it is safe it is safe for this patch, if it isnt then this patch
isnt safe as well.

Signed-off-by: Izik Eidus <ieidus@redhat.com>
---
 include/linux/swap.h |    4 ++++
 mm/madvise.c         |    8 ++++++++
 mm/rmap.c            |    8 ++++++++
 mm/swapfile.c        |    9 +++++++--
 4 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62d8143..3919dc3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -295,6 +295,10 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
 			struct vm_area_struct *vma, unsigned long addr);
 
 /* linux/mm/swapfile.c */
+#define MMLIST_SWAP (1 << 0)
+#define MMLIST_KSM  (1 << 1)
+extern int mmlist_mask;
+
 extern long nr_swap_pages;
 extern long total_swap_pages;
 extern void si_swapinfo(struct sysinfo *);
diff --git a/mm/madvise.c b/mm/madvise.c
index bd215ce..40a0036 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -11,6 +11,7 @@
 #include <linux/mempolicy.h>
 #include <linux/hugetlb.h>
 #include <linux/sched.h>
+#include <linux/swap.h>
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -237,6 +238,13 @@ static long madvise_shareable(struct vm_area_struct *vma,
 		if (!ret) {
 			mm = vma->vm_mm;
 			set_bit(MMF_VM_MERGEABLE, &mm->flags);
+
+			spin_lock(&mmlist_lock);
+			if (unlikely(list_empty(&mm->mmlist)))
+				list_add(&mm->mmlist, &init_mm.mmlist);
+			if (unlikely(!(mmlist_mask & MMLIST_KSM)))
+				mmlist_mask |= MMLIST_KSM;
+			spin_unlock(&mmlist_lock);
 		}
 
 		return ret;
diff --git a/mm/rmap.c b/mm/rmap.c
index 95c55ea..71f378a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -951,7 +951,15 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				spin_lock(&mmlist_lock);
 				if (list_empty(&mm->mmlist))
 					list_add(&mm->mmlist, &init_mm.mmlist);
+				if (unlikely(!(mmlist_mask & MMLIST_SWAP)))
+					mmlist_mask |= MMLIST_SWAP;
 				spin_unlock(&mmlist_lock);
+			} else {
+				if (unlikely(!(mmlist_mask & MMLIST_SWAP))) {
+					spin_lock(&mmlist_lock);
+					mmlist_mask |= MMLIST_SWAP;
+					spin_unlock(&mmlist_lock);
+				}
 			}
 			dec_mm_counter(mm, anon_rss);
 		} else if (PAGE_MIGRATION) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 312fafe..dadaa15 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -42,6 +42,8 @@ long total_swap_pages;
 static int swap_overflow;
 static int least_priority;
 
+int mmlist_mask = 0;
+
 static const char Bad_file[] = "Bad swap file entry ";
 static const char Unused_file[] = "Unused swap file entry ";
 static const char Bad_offset[] = "Bad swap offset entry ";
@@ -1149,8 +1151,11 @@ static void drain_mmlist(void)
 		if (swap_info[i].inuse_pages)
 			return;
 	spin_lock(&mmlist_lock);
-	list_for_each_safe(p, next, &init_mm.mmlist)
-		list_del_init(p);
+	mmlist_mask &= ~MMLIST_SWAP;
+	if (!mmlist_mask) {
+		list_for_each_safe(p, next, &init_mm.mmlist)
+			list_del_init(p);
+	}
 	spin_unlock(&mmlist_lock);
 }
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] ksm: change ksm api to use madvise instead of ioctls.
  2009-05-14  0:30   ` [PATCH 2/4] mmlist: share mmlist with ksm Izik Eidus
@ 2009-05-14  0:30     ` Izik Eidus
  2009-05-14  0:30       ` [PATCH 4/4] ksm: add support for scanning procsses that were not modifided to use ksm Izik Eidus
  0 siblings, 1 reply; 22+ messages in thread
From: Izik Eidus @ 2009-05-14  0:30 UTC (permalink / raw)
  To: hugh
  Cc: linux-kernel, aarcange, akpm, nickpiggin, chrisw, linux-mm, riel,
	Izik Eidus

Now ksm use madvise to know what memory regions it should scan.

ksm will walk over all the mm_structs inside the mmlist and will search for
mm_structs that have the MMF_VM_MERGEABLE, and for that mm_structs it will
search for every vma that is VM_MERGEABLE and will try to share its pages.

Signed-off-by: Izik Eidus <ieidus@redhat.com>
---
 include/linux/ksm.h |   40 --
 mm/Kconfig          |    2 +-
 mm/ksm.c            | 1000 +++++++++++++++++++++++---------------------------
 3 files changed, 461 insertions(+), 581 deletions(-)

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index c0849c7..ca17782 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -1,46 +1,6 @@
 #ifndef __LINUX_KSM_H
 #define __LINUX_KSM_H
 
-/*
- * Userspace interface for /dev/ksm - kvm shared memory
- */
-
-#include <linux/types.h>
-#include <linux/ioctl.h>
-
-#define KSM_API_VERSION 1
-
 #define ksm_control_flags_run 1
 
-/* for KSM_REGISTER_MEMORY_REGION */
-struct ksm_memory_region {
-	__u32 npages; /* number of pages to share */
-	__u32 pad;
-	__u64 addr; /* the begining of the virtual address */
-        __u64 reserved_bits;
-};
-
-#define KSMIO 0xAB
-
-/* ioctls for /dev/ksm */
-
-#define KSM_GET_API_VERSION              _IO(KSMIO,   0x00)
-/*
- * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd
- */
-#define KSM_CREATE_SHARED_MEMORY_AREA    _IO(KSMIO,   0x01) /* return SMA fd */
-
-/* ioctls for SMA fds */
-
-/*
- * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be
- * scanned by kvm.
- */
-#define KSM_REGISTER_MEMORY_REGION       _IOW(KSMIO,  0x20,\
-					      struct ksm_memory_region)
-/*
- * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm.
- */
-#define KSM_REMOVE_MEMORY_REGION         _IO(KSMIO,   0x21)
-
 #endif
diff --git a/mm/Kconfig b/mm/Kconfig
index fb8ac63..73c4463 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -227,7 +227,7 @@ config MMU_NOTIFIER
 	bool
 
 config KSM
-	tristate "Enable KSM for page sharing"
+	bool "Enable KSM for page sharing"
 	help
 	  Enable the KSM kernel module to allow page sharing of equal pages
 	  among different tasks.
diff --git a/mm/ksm.c b/mm/ksm.c
index 8a0489b..901cce3 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1,7 +1,7 @@
 /*
- * Memory merging driver for Linux
+ * Memory merging support.
  *
- * This module enables dynamic sharing of identical pages found in different
+ * This code enables dynamic sharing of identical pages found in different
  * memory areas, even if they are not shared by fork()
  *
  * Copyright (C) 2008 Red Hat, Inc.
@@ -17,7 +17,6 @@
 #include <linux/errno.h>
 #include <linux/mm.h>
 #include <linux/fs.h>
-#include <linux/miscdevice.h>
 #include <linux/vmalloc.h>
 #include <linux/file.h>
 #include <linux/mman.h>
@@ -37,6 +36,7 @@
 #include <linux/swap.h>
 #include <linux/rbtree.h>
 #include <linux/anon_inodes.h>
+#include <linux/mmu_notifier.h>
 #include <linux/ksm.h>
 
 #include <asm/tlbflush.h>
@@ -44,45 +44,32 @@
 MODULE_AUTHOR("Red Hat, Inc.");
 MODULE_LICENSE("GPL");
 
-static int rmap_hash_size;
-module_param(rmap_hash_size, int, 0);
-MODULE_PARM_DESC(rmap_hash_size, "Hash table size for the reverse mapping");
-
-static int regions_per_fd;
-module_param(regions_per_fd, int, 0);
-
-/*
- * ksm_mem_slot - hold information for an userspace scanning range
- * (the scanning for this region will be from addr untill addr +
- *  npages * PAGE_SIZE inside mm)
+/**
+ * struct mm_slot - ksm information per mm that is being scanned
+ * @link: link to the mm_slots list
+ * @rmap_list: head for the rmap_list list
+ * @mm: the mm that this information is valid for
+ * @touched: 0 - mm_slot should be removed
  */
-struct ksm_mem_slot {
-	struct list_head link;
-	struct list_head sma_link;
+struct mm_slot {
+	struct hlist_node link;
+	struct list_head rmap_list;
 	struct mm_struct *mm;
-	unsigned long addr;	/* the begining of the virtual address */
-	unsigned npages;	/* number of pages to share */
-};
-
-/*
- * ksm_sma - shared memory area, each process have its own sma that contain the
- * information about the slots that it own
- */
-struct ksm_sma {
-	struct list_head sma_slots;
-	int nregions;
+	char touched;
 };
 
 /**
  * struct ksm_scan - cursor for scanning
- * @slot_index: the current slot we are scanning
- * @page_index: the page inside the sma that is currently being scanned
+ * @cur_mm_slot: the current mm_slot we are scanning
+ * @add_index: the address inside that is currently being scanned
+ * @cur_rmap: the current rmap that we are scanning inside the rmap_list
  *
  * ksm uses it to know what are the next pages it need to scan
  */
 struct ksm_scan {
-	struct ksm_mem_slot *slot_index;
-	unsigned long page_index;
+	struct mm_slot *cur_mm_slot;
+	unsigned long addr_index;
+	struct rmap_item *cur_rmap;
 };
 
 /*
@@ -139,14 +126,14 @@ struct tree_item {
 };
 
 /*
- * rmap_item - object of the rmap_hash hash table
+ * rmap_item - object of rmap_list per mm
  * (it is holding the previous hash value (oldindex),
  *  pointer into the page_hash_item, and pointer into the tree_item)
  */
 
 /**
  * struct rmap_item - reverse mapping item for virtual addresses
- * @link: link into the rmap_hash hash table.
+ * @link: link into rmap_list (rmap_list is per mm)
  * @mm: the memory strcture the rmap_item is pointing to.
  * @address: the virtual address the rmap_item is pointing to.
  * @oldchecksum: old checksum result for the page belong the virtual address
@@ -159,7 +146,7 @@ struct tree_item {
  */
 
 struct rmap_item {
-	struct hlist_node link;
+	struct list_head link;
 	struct mm_struct *mm;
 	unsigned long address;
 	unsigned int oldchecksum; /* old checksum value */
@@ -170,29 +157,19 @@ struct rmap_item {
 	struct rmap_item *prev;
 };
 
-/*
- * slots is linked list that hold all the memory regions that were registred
- * to be scanned.
- */
-static LIST_HEAD(slots);
-/*
- * slots_lock protects against removing and adding memory regions while a scanner
- * is in the middle of scanning.
- */
-static DECLARE_RWSEM(slots_lock);
-
 /* The stable and unstable trees heads. */
 struct rb_root root_stable_tree = RB_ROOT;
 struct rb_root root_unstable_tree = RB_ROOT;
 
 
-/* The number of linked list members inside the hash table */
-static unsigned int nrmaps_hash;
-/* rmap_hash hash table */
-static struct hlist_head *rmap_hash;
+/* The number of linked list members inside the mm slots hash table */
+static unsigned int nmm_slots_hash;
+/* mm_slots_hash hash table */
+static struct hlist_head *mm_slots_hash;
 
 static struct kmem_cache *tree_item_cache;
 static struct kmem_cache *rmap_item_cache;
+static struct kmem_cache *mm_slot_item_cache;
 
 /* the number of nodes inside the stable tree */
 static unsigned long nnodes_stable_tree;
@@ -223,8 +200,14 @@ static int ksm_slab_init(void)
 	if (!rmap_item_cache)
 		goto out_free;
 
+	mm_slot_item_cache = KMEM_CACHE(mm_slot, 0);
+	if (!mm_slot_item_cache)
+		goto out_free1;
+
 	return 0;
 
+out_free1:
+	kmem_cache_destroy(rmap_item_cache);
 out_free:
 	kmem_cache_destroy(tree_item_cache);
 out:
@@ -233,6 +216,7 @@ out:
 
 static void ksm_slab_free(void)
 {
+	kmem_cache_destroy(mm_slot_item_cache);
 	kmem_cache_destroy(rmap_item_cache);
 	kmem_cache_destroy(tree_item_cache);
 }
@@ -257,6 +241,16 @@ static inline void free_rmap_item(struct rmap_item *rmap_item)
 	kmem_cache_free(rmap_item_cache, rmap_item);
 }
 
+static inline struct mm_slot *alloc_mm_slot_item(void)
+{
+	return kmem_cache_zalloc(mm_slot_item_cache, GFP_KERNEL);
+}
+
+static inline void free_mm_slot_item(struct mm_slot *mm_slot_item)
+{
+	kmem_cache_free(mm_slot_item_cache, mm_slot_item);
+}
+
 static unsigned long addr_in_vma(struct vm_area_struct *vma, struct page *page)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -303,71 +297,20 @@ static inline int PageKsm(struct page *page)
 	return !PageAnon(page);
 }
 
-static int rmap_hash_init(void)
-{
-	if (!rmap_hash_size) {
-		struct sysinfo sinfo;
-
-		si_meminfo(&sinfo);
-		rmap_hash_size = sinfo.totalram / 10;
-	}
-	nrmaps_hash = rmap_hash_size;
-	rmap_hash = vmalloc(nrmaps_hash * sizeof(struct hlist_head));
-	if (!rmap_hash)
-		return -ENOMEM;
-	memset(rmap_hash, 0, nrmaps_hash * sizeof(struct hlist_head));
-	return 0;
-}
-
-static void rmap_hash_free(void)
-{
-	int i;
-	struct hlist_head *bucket;
-	struct hlist_node *node, *n;
-	struct rmap_item *rmap_item;
-
-	for (i = 0; i < nrmaps_hash; ++i) {
-		bucket = &rmap_hash[i];
-		hlist_for_each_entry_safe(rmap_item, node, n, bucket, link) {
-			hlist_del(&rmap_item->link);
-			free_rmap_item(rmap_item);
-		}
-	}
-	vfree(rmap_hash);
-}
-
-static inline u32 calc_checksum(struct page *page)
-{
-	u32 checksum;
-	void *addr = kmap_atomic(page, KM_USER0);
-	checksum = jhash2(addr, PAGE_SIZE / 4, 17);
-	kunmap_atomic(addr, KM_USER0);
-	return checksum;
-}
-
-/*
- * Return rmap_item for a given virtual address.
- */
-static struct rmap_item *get_rmap_item(struct mm_struct *mm, unsigned long addr)
+static void break_cow(struct mm_struct *mm, unsigned long addr)
 {
-	struct rmap_item *rmap_item;
-	struct hlist_head *bucket;
-	struct hlist_node *node;
+	struct page *page[1];
 
-	bucket = &rmap_hash[addr % nrmaps_hash];
-	hlist_for_each_entry(rmap_item, node, bucket, link) {
-		if (mm == rmap_item->mm && rmap_item->address == addr) {
-			return rmap_item;
-		}
-	}
-	return NULL;
+	down_read(&mm->mmap_sem);
+	if (get_user_pages(current, mm, addr, 1, 1, 0, page, NULL) == 1)
+		put_page(page[0]);
+	up_read(&mm->mmap_sem);
 }
 
 /*
  * Removing rmap_item from stable or unstable tree.
- * This function will free the rmap_item object, and if that rmap_item was
- * insde the stable or unstable trees, it would remove the link from there
- * as well.
+ * This function will clean the information from the stable/unstable tree
+ * and will free the tree_item if needed.
  */
 static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
 {
@@ -404,222 +347,92 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
 			}
 		} else {
 			/*
-			 * We dont rb_erase(&tree_item->node) here, beacuse
-			 * that the unstable tree will get flushed before we are
-			 * here.
+			 * We dont rb_erase(&tree_item->node) here, beacuse at
+			 * the time we are here root_unstable_tree = RB_ROOT
+			 * should had been called.
 			 */
 			free_tree_item(tree_item);
 		}
 	}
 
-	hlist_del(&rmap_item->link);
-	free_rmap_item(rmap_item);
-}
-
-static void break_cow(struct mm_struct *mm, unsigned long addr)
-{
-	struct page *page[1];
-
-	down_read(&mm->mmap_sem);
-	if (get_user_pages(current, mm, addr, 1, 1, 0, page, NULL) == 1)
-		put_page(page[0]);
-	up_read(&mm->mmap_sem);
+	rmap_item->stable_tree = 0;
+	rmap_item->oldchecksum = 0;
+	rmap_item->kpage_outside_tree = 0;
+	rmap_item->next = NULL;
+	rmap_item->prev = NULL;
 }
 
-static void remove_page_from_tree(struct mm_struct *mm,
-				  unsigned long addr)
+static void remove_all_slot_rmap_items(struct mm_slot *mm_slot, int do_break)
 {
-	struct rmap_item *rmap_item;
-
-	rmap_item = get_rmap_item(mm, addr);
-	if (!rmap_item)
-		return;
-
-	if (rmap_item->stable_tree) {
-		/* We are breaking all the KsmPages of area that is removed */
-		break_cow(mm, addr);
-	} else {
-		/*
-		 * If kpage_outside_tree is set, this item is KsmPage outside
-		 * the stable tree, therefor we have to break the COW and
-		 * in addition we have to dec nkpage_out_tree.
-		 */
-		if (rmap_item->kpage_outside_tree)
-			break_cow(mm, addr);
+	struct rmap_item *rmap_item, *node;
+
+	list_for_each_entry_safe(rmap_item, node, &mm_slot->rmap_list, link) {
+		if (do_break) {
+			if (rmap_item->stable_tree) {
+				/*
+				 * we are breaking all the ksmpages of the area
+				 * that is being removed
+			 	*/
+				break_cow(mm_slot->mm, rmap_item->address);
+			} else {
+				/*
+				 * if kpage_outside_tree is set, this item is
+				 * ksmpage outside the stable tree, therefor we
+				 * have to break the cow and in addition we have
+				 * to dec nkpage_out_tree.
+		 		 */
+				if (rmap_item->kpage_outside_tree)
+					break_cow(mm_slot->mm,
+						  rmap_item->address);
+			}
+		}
+		remove_rmap_item_from_tree(rmap_item);
+		list_del(&rmap_item->link);
+		free_rmap_item(rmap_item);
 	}
-
-	remove_rmap_item_from_tree(rmap_item);
 }
 
-static inline int is_intersecting_address(unsigned long addr,
-					  unsigned long begin,
-					  unsigned long end)
+static void remove_mm_slot(struct mm_slot *mm_slot, int do_break)
 {
-	if (addr >= begin && addr < end)
-		return 1;
-	return 0;
+	hlist_del(&mm_slot->link);
+	remove_all_slot_rmap_items(mm_slot, do_break);
+	mmput(mm_slot->mm);
+	free_mm_slot_item(mm_slot);
 }
 
-/*
- * is_overlap_mem - check if there is overlapping with memory that was already
- * registred.
- *
- * note - this function must to be called under slots_lock
- */
-static int is_overlap_mem(struct ksm_memory_region *mem)
+static int mm_slots_hash_init(void)
 {
-	struct ksm_mem_slot *slot;
-
-	list_for_each_entry(slot, &slots, link) {
-		unsigned long mem_end;
-		unsigned long slot_end;
-
-		cond_resched();
-
-		if (current->mm != slot->mm)
-			continue;
-
-		mem_end = mem->addr + (unsigned long)mem->npages * PAGE_SIZE;
-		slot_end = slot->addr + (unsigned long)slot->npages * PAGE_SIZE;
-
-		if (is_intersecting_address(mem->addr, slot->addr, slot_end) ||
-		    is_intersecting_address(mem_end - 1, slot->addr, slot_end))
-			return 1;
-		if (is_intersecting_address(slot->addr, mem->addr, mem_end) ||
-		    is_intersecting_address(slot_end - 1, mem->addr, mem_end))
-			return 1;
-	}
-
-	return 0;
-}
-
-static int ksm_sma_ioctl_register_memory_region(struct ksm_sma *ksm_sma,
-						struct ksm_memory_region *mem)
-{
-	struct ksm_mem_slot *slot;
-	int ret = -EPERM;
-
-	if (!mem->npages)
-		goto out;
+	nmm_slots_hash = 4096;
 
-	down_write(&slots_lock);
-
-	if ((ksm_sma->nregions + 1) > regions_per_fd) {
-		ret = -EBUSY;
-		goto out_unlock;
-	}
-
-	if (is_overlap_mem(mem))
-		goto out_unlock;
-
-	slot = kzalloc(sizeof(struct ksm_mem_slot), GFP_KERNEL);
-	if (!slot) {
-		ret = -ENOMEM;
-		goto out_unlock;
-	}
-
-	/*
-	 * We will hold refernce to the task_mm untill the file descriptor
-	 * will be closed, or KSM_REMOVE_MEMORY_REGION will be called.
-	 */
-	slot->mm = get_task_mm(current);
-	if (!slot->mm)
-		goto out_free;
-	slot->addr = mem->addr;
-	slot->npages = mem->npages;
-
-	list_add_tail(&slot->link, &slots);
-	list_add_tail(&slot->sma_link, &ksm_sma->sma_slots);
-	ksm_sma->nregions++;
-
-	up_write(&slots_lock);
+	mm_slots_hash = vmalloc(nmm_slots_hash * sizeof(struct hlist_head));
+	if (!mm_slots_hash)
+		return -ENOMEM;
+	memset(mm_slots_hash, 0, nmm_slots_hash * sizeof(struct hlist_head));
 	return 0;
-
-out_free:
-	kfree(slot);
-out_unlock:
-	up_write(&slots_lock);
-out:
-	return ret;
-}
-
-static void remove_slot_from_hash_and_tree(struct ksm_mem_slot *slot)
-{
-	int pages_count;
-
-	root_unstable_tree = RB_ROOT;
-	for (pages_count = 0; pages_count < slot->npages; ++pages_count)
-		remove_page_from_tree(slot->mm, slot->addr +
-				      pages_count * PAGE_SIZE);
-	/* Called under slots_lock */
-	list_del(&slot->link);
 }
 
-static int ksm_sma_ioctl_remove_memory_region(struct ksm_sma *ksm_sma,
-					      unsigned long addr)
+static void mm_slots_hash_free(void)
 {
-	int ret = -EFAULT;
-	struct ksm_mem_slot *slot, *node;
-
-	down_write(&slots_lock);
-	list_for_each_entry_safe(slot, node, &ksm_sma->sma_slots, sma_link) {
-		if (addr == slot->addr) {
-			remove_slot_from_hash_and_tree(slot);
-			mmput(slot->mm);
-			list_del(&slot->sma_link);
-			kfree(slot);
-			ksm_sma->nregions--;
-			ret = 0;
-		}
-	}
-	up_write(&slots_lock);
-	return ret;
-}
+	int i;
+	struct hlist_head *bucket;
+	struct hlist_node *node, *n;
+	struct mm_slot *mm_slot;
 
-static int ksm_sma_release(struct inode *inode, struct file *filp)
-{
-	struct ksm_mem_slot *slot, *node;
-	struct ksm_sma *ksm_sma = filp->private_data;
-
-	down_write(&slots_lock);
-	list_for_each_entry_safe(slot, node, &ksm_sma->sma_slots, sma_link) {
-		remove_slot_from_hash_and_tree(slot);
-		mmput(slot->mm);
-		list_del(&slot->sma_link);
-		kfree(slot);
+	for (i = 0; i < nmm_slots_hash; ++i) {
+		bucket = &mm_slots_hash[i];
+		hlist_for_each_entry_safe(mm_slot, node, n, bucket, link)
+			remove_mm_slot(mm_slot, 1);
 	}
-	up_write(&slots_lock);
-
-	kfree(ksm_sma);
-	return 0;
+	vfree(mm_slots_hash);
 }
 
-static long ksm_sma_ioctl(struct file *filp,
-			  unsigned int ioctl, unsigned long arg)
+static inline u32 calc_checksum(struct page *page)
 {
-	struct ksm_sma *sma = filp->private_data;
-	void __user *argp = (void __user *)arg;
-	int r = EINVAL;
-
-	switch (ioctl) {
-	case KSM_REGISTER_MEMORY_REGION: {
-		struct ksm_memory_region ksm_memory_region;
-
-		r = -EFAULT;
-		if (copy_from_user(&ksm_memory_region, argp,
-				   sizeof(ksm_memory_region)))
-			goto out;
-		r = ksm_sma_ioctl_register_memory_region(sma,
-							 &ksm_memory_region);
-		break;
-	}
-	case KSM_REMOVE_MEMORY_REGION:
-		r = ksm_sma_ioctl_remove_memory_region(sma, arg);
-		break;
-	}
-
-out:
-	return r;
+	u32 checksum;
+	void *addr = kmap_atomic(page, KM_USER0);
+	checksum = jhash2(addr, PAGE_SIZE / 4, 17);
+	kunmap_atomic(addr, KM_USER0);
+	return checksum;
 }
 
 static int memcmp_pages(struct page *page1, struct page *page2)
@@ -666,6 +479,9 @@ static int try_to_merge_one_page(struct mm_struct *mm,
 	unsigned long page_addr_in_vma;
 	pte_t orig_pte, *orig_ptep;
 
+	if(!(vma->vm_flags & VM_MERGEABLE))
+		goto out;
+
 	if (!PageAnon(oldpage))
 		goto out;
 
@@ -731,16 +547,16 @@ out:
  */
 
 static int try_to_merge_two_pages_alloc(struct mm_struct *mm1,
-					struct page *page1,
-					struct mm_struct *mm2,
-					struct page *page2,
-					unsigned long addr1,
-					unsigned long addr2)
+				        struct page *page1,
+				        struct mm_struct *mm2,
+				        struct page *page2,
+				        unsigned long addr1,
+				        unsigned long addr2)
 {
 	struct vm_area_struct *vma;
 	pgprot_t prot;
-	int ret = 1;
 	struct page *kpage;
+	int ret = 1;
 
 	/*
 	 * The number of the nodes inside the stable tree +
@@ -757,7 +573,7 @@ static int try_to_merge_two_pages_alloc(struct mm_struct *mm1,
 		return ret;
 	down_read(&mm1->mmap_sem);
 	vma = find_vma(mm1, addr1);
-	if (!vma) {
+	if (!vma || vma->vm_start > addr1) {
 		put_page(kpage);
 		up_read(&mm1->mmap_sem);
 		return ret;
@@ -772,7 +588,7 @@ static int try_to_merge_two_pages_alloc(struct mm_struct *mm1,
 	if (!ret) {
 		down_read(&mm2->mmap_sem);
 		vma = find_vma(mm2, addr2);
-		if (!vma) {
+		if (!vma || vma->vm_start > addr2) {
 			put_page(kpage);
 			up_read(&mm2->mmap_sem);
 			break_cow(mm1, addr1);
@@ -822,7 +638,7 @@ static int try_to_merge_two_pages_noalloc(struct mm_struct *mm1,
 	BUG_ON(!PageKsm(page2));
 	down_read(&mm1->mmap_sem);
 	vma = find_vma(mm1, addr1);
-	if (!vma) {
+	if (!vma || vma->vm_start > addr1) {
 		up_read(&mm1->mmap_sem);
 		return ret;
 	}
@@ -1100,7 +916,7 @@ static struct tree_item *unstable_tree_search_insert(struct page *page,
  *
  * we return 1 in case we removed the rmap_item.
  */
-int update_tree(struct rmap_item *rmap_item)
+static int update_tree(struct rmap_item *rmap_item)
 {
 	if (!rmap_item->stable_tree) {
 		if (unlikely(rmap_item->kpage_outside_tree)) {
@@ -1131,24 +947,6 @@ int update_tree(struct rmap_item *rmap_item)
 	return 1;
 }
 
-static void create_new_rmap_item(struct rmap_item *rmap_item,
-				 struct mm_struct *mm,
-				 unsigned long addr,
-				 unsigned int checksum)
-{
-	struct hlist_head *bucket;
-
-	rmap_item->mm = mm;
-	rmap_item->address = addr;
-	rmap_item->oldchecksum = checksum;
-	rmap_item->stable_tree = 0;
-	rmap_item->kpage_outside_tree = 0;
-	rmap_item->tree_item = NULL;
-
-	bucket = &rmap_hash[addr % nrmaps_hash];
-	hlist_add_head(&rmap_item->link, bucket);
-}
-
 /*
  * insert_to_stable_tree_list - insert another rmap_item into the linked list
  * rmap_items of a given node inside the stable tree.
@@ -1174,99 +972,75 @@ static void insert_to_stable_tree_list(struct rmap_item *rmap_item,
  * in case we find that there is similar hash to different page we call to
  * try_to_merge_two_pages().
  *
- * @ksm_scan: the ksm scanner strcture.
  * @page: the page that we are searching identical page to.
+ * @rmap_item: the reverse mapping into the virtual address of this page
  */
-static int cmp_and_merge_page(struct ksm_scan *ksm_scan, struct page *page)
+
+static int cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item)
 {
 	struct page *page2[1];
-	struct ksm_mem_slot *slot;
 	struct tree_item *tree_item;
-	struct rmap_item *rmap_item;
 	struct rmap_item *tree_rmap_item;
 	unsigned int checksum;
-	unsigned long addr;
-	int wait = 0;
+	int wait;
 	int ret = 0;
 
-	slot = ksm_scan->slot_index;
-	addr = slot->addr + ksm_scan->page_index * PAGE_SIZE;
-	rmap_item = get_rmap_item(slot->mm, addr);
-	if (rmap_item) {
-		if (update_tree(rmap_item)) {
-			rmap_item = NULL;
-			wait = 1;
-		}
-	}
+	wait = update_tree(rmap_item);
 
 	/* We first start with searching the page inside the stable tree */
 	tree_rmap_item = stable_tree_search(page, page2, rmap_item);
 	if (tree_rmap_item) {
-		struct rmap_item *tmp_rmap_item = NULL;
+		BUG_ON(!tree_rmap_item->tree_item);
 
-		if (!rmap_item) {
-			tmp_rmap_item = alloc_rmap_item();
-			if (!tmp_rmap_item)
-				return ret;
-		}
+		ret = try_to_merge_two_pages_noalloc(rmap_item->mm, page,
+						     page2[0],
+						     rmap_item->address);
 
-		BUG_ON(!tree_rmap_item->tree_item);
-		ret = try_to_merge_two_pages_noalloc(slot->mm, page, page2[0],
-						     addr);
 		put_page(page2[0]);
+
 		if (!ret) {
 			/*
 			 * The page was successuly merged, lets insert its
 			 * rmap_item into the stable tree.
 			 */
-
-			if (!rmap_item) {
-				create_new_rmap_item(tmp_rmap_item, slot->mm,
-						     addr, 0);
-				rmap_item = tmp_rmap_item;
-			}
-
 			insert_to_stable_tree_list(rmap_item, tree_rmap_item);
-		} else {
-			if (tmp_rmap_item)
-				free_rmap_item(tmp_rmap_item);
 		}
+
 		ret = !ret;
 		goto out;
 	}
 
+	if (wait)
+		goto out;
+
 	/*
 	 * In case the hash value of the page was changed from the last time we
 	 * have calculated it, this page to be changed frequely, therefore we
 	 * dont want to insert it to the unstable tree, and we dont want to
 	 * waste our time to search if there is something identical to it there.
 	 */
-	if (rmap_item) {
-		checksum = calc_checksum(page);
-		if (rmap_item->oldchecksum != checksum) {
-			rmap_item->oldchecksum = checksum;
-			goto out;
-		}
+	checksum = calc_checksum(page);
+	if (rmap_item->oldchecksum != checksum) {
+		rmap_item->oldchecksum = checksum;
+		goto out;
 	}
 
 	tree_item = unstable_tree_search_insert(page, page2, rmap_item);
 	if (tree_item) {
-		struct rmap_item *tmp_rmap_item = NULL;
 		struct rmap_item *merge_rmap_item;
+		struct mm_struct *tree_mm;
+		unsigned long tree_addr;
 
 		merge_rmap_item = tree_item->rmap_item;
 		BUG_ON(!merge_rmap_item);
 
-		if (!rmap_item) {
-			tmp_rmap_item = alloc_rmap_item();
-			if (!tmp_rmap_item)
-				return ret;
-		}
+		tree_mm = merge_rmap_item->mm;
+		tree_addr = merge_rmap_item->address;
+
+		ret = try_to_merge_two_pages_alloc(rmap_item->mm, page, tree_mm,
+						   page2[0], rmap_item->address,
+						   tree_addr);
 
-		ret = try_to_merge_two_pages_alloc(slot->mm, page,
-						   merge_rmap_item->mm,
-						   page2[0], addr,
-						   merge_rmap_item->address);
 		/*
 		 * As soon as we successuly merged this page, we want to remove
 		 * the rmap_item object of the page that we have merged with
@@ -1283,102 +1057,319 @@ static int cmp_and_merge_page(struct ksm_scan *ksm_scan, struct page *page)
 			 * their rmap as tree_item->kpage_outside_tree = 1
 			 * and to inc nkpage_out_tree by 2.
 			 */
-			if (stable_tree_insert(page2[0],
-					       tree_item, merge_rmap_item)) {
+			if (stable_tree_insert(page2[0], tree_item,
+					       merge_rmap_item)) {
 				merge_rmap_item->kpage_outside_tree = 1;
-				if (!rmap_item) {
-					create_new_rmap_item(tmp_rmap_item,
-							     slot->mm,
-							     addr, 0);
-					rmap_item = tmp_rmap_item;
-				}
 				rmap_item->kpage_outside_tree = 1;
 				nkpage_out_tree += 2;
 			} else {
-				if (tmp_rmap_item) {
-					create_new_rmap_item(tmp_rmap_item,
-							     slot->mm, addr, 0);
-					rmap_item = tmp_rmap_item;
-				}
 				insert_to_stable_tree_list(rmap_item,
 							   merge_rmap_item);
 			}
-		} else {
-			if (tmp_rmap_item)
-				free_rmap_item(tmp_rmap_item);
 		}
+
 		put_page(page2[0]);
+
 		ret = !ret;
-		goto out;
-	}
-	/*
-	 * When wait is 1, we dont want to calculate the hash value of the page
-	 * right now, instead we prefer to wait.
-	 */
-	if (!wait && !rmap_item) {
-		rmap_item = alloc_rmap_item();
-		if (!rmap_item)
-			return ret;
-		checksum = calc_checksum(page);
-		create_new_rmap_item(rmap_item, slot->mm, addr, checksum);
 	}
+
 out:
 	return ret;
 }
 
-/* return -EAGAIN - no slots registered, nothing to be done */
-static int scan_get_next_index(struct ksm_scan *ksm_scan)
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
 {
-	struct ksm_mem_slot *slot;
+	struct mm_slot *mm_slot;
+	struct hlist_head *bucket;
+	struct hlist_node *node;
 
-	if (list_empty(&slots))
-		return -EAGAIN;
+	bucket = &mm_slots_hash[((unsigned long)mm / sizeof (struct mm_struct))
+				% nmm_slots_hash];
+	hlist_for_each_entry(mm_slot, node, bucket, link) {
+		if (mm == mm_slot->mm)
+			return mm_slot;
+	}
+	return NULL;
+}
 
-	slot = ksm_scan->slot_index;
+static void insert_to_mm_slots_hash(struct mm_struct *mm,
+				    struct mm_slot *mm_slot)
+{
+	struct hlist_head *bucket;
 
-	/* Are there pages left in this slot to scan? */
-	if ((slot->npages - ksm_scan->page_index - 1) > 0) {
-		ksm_scan->page_index++;
-		return 0;
-	}
+	bucket = &mm_slots_hash[((unsigned long)mm / sizeof (struct mm_struct))
+				% nmm_slots_hash];
+	mm_slot->mm = mm;
+	atomic_inc(&mm_slot->mm->mm_users);
+	INIT_LIST_HEAD(&mm_slot->rmap_list);
+	hlist_add_head(&mm_slot->link, bucket);
+}
 
-	list_for_each_entry_from(slot, &slots, link) {
-		if (slot == ksm_scan->slot_index)
-			continue;
-		ksm_scan->page_index = 0;
-		ksm_scan->slot_index = slot;
-		return 0;
-	}
+static void create_new_rmap_item(struct list_head *cur,
+				 struct rmap_item *rmap_item,
+				 struct mm_struct *mm,
+				 unsigned long addr,
+				 unsigned int checksum)
+{
+	rmap_item->address = addr;
+	rmap_item->mm = mm;
+	rmap_item->oldchecksum = checksum;
+	rmap_item->stable_tree = 0;
+	rmap_item->kpage_outside_tree = 0;
+	rmap_item->tree_item = NULL;
 
-	/* look like we finished scanning the whole memory, starting again */
-	root_unstable_tree = RB_ROOT;
-	ksm_scan->page_index = 0;
-	ksm_scan->slot_index = list_first_entry(&slots,
-						struct ksm_mem_slot, link);
-	return 0;
+	list_add(&rmap_item->link, cur);
 }
 
 /*
- * update slot_index - make sure ksm_scan will point to vaild data,
- * it is possible that by the time we are here the data that ksm_scan was
- * pointed to was released so we have to call this function every time after
- * taking the slots_lock
+ * update_rmap_list - nuke every rmap_item above the current rmap_item.
  */
-static void scan_update_old_index(struct ksm_scan *ksm_scan)
+static void update_rmap_list(struct list_head *head, struct list_head *cur)
+{
+	struct rmap_item *rmap_item;
+
+	cur = cur->next;
+	while (cur != head) {
+		rmap_item = list_entry(cur, struct rmap_item, link);
+		cur = cur->next;
+		remove_rmap_item_from_tree(rmap_item);
+		list_del(&rmap_item->link);
+		free_rmap_item(rmap_item);
+	}
+}
+
+static struct rmap_item *get_next_rmap_index(unsigned long addr,
+					     struct mm_struct *mm,
+					     struct list_head *head,
+					     struct list_head *cur,
+					     struct rmap_item *pre_alloc_rmap,
+					     int *used_pre_alloc)
+{
+	struct rmap_item *rmap_item;
+
+	cur = cur->next;
+	while (cur != head) {
+		rmap_item = list_entry(cur, struct rmap_item, link);
+		if (rmap_item->address == addr) {
+			return rmap_item;
+		} else if (rmap_item->address < addr) {
+			cur = cur->next;
+			remove_rmap_item_from_tree(rmap_item);
+			list_del(&rmap_item->link);
+			free_rmap_item(rmap_item);
+		} else {
+			*used_pre_alloc = 1;
+			create_new_rmap_item(cur->prev, pre_alloc_rmap, mm,
+					     addr, 0);
+			return pre_alloc_rmap;
+		}
+	}
+
+	*used_pre_alloc = 1;
+	create_new_rmap_item(cur->prev, pre_alloc_rmap, mm, addr, 0);
+
+	return pre_alloc_rmap;
+}
+
+static void remove_all_mm_slots(void)
+{
+	int i;
+	struct hlist_head *bucket;
+	struct hlist_node *node, *n;
+	struct mm_slot *mm_slot;
+
+	for (i = 0; i < nmm_slots_hash; ++i) {
+		bucket = &mm_slots_hash[i];
+		hlist_for_each_entry_safe(mm_slot, node, n, bucket, link)
+			remove_mm_slot(mm_slot, 1);
+	}
+}
+
+static void remove_all_untouched_mm_slots(void)
+{
+	int i;
+	struct hlist_head *bucket;
+	struct hlist_node *node, *n;
+	struct mm_slot *mm_slot;
+
+	for (i = 0; i < nmm_slots_hash; ++i) {
+		bucket = &mm_slots_hash[i];
+		hlist_for_each_entry_safe(mm_slot, node, n, bucket, link) {
+			if (!mm_slot->touched)
+				remove_mm_slot(mm_slot, 1);
+			else
+				mm_slot->touched = 0;
+		}
+	}
+}
+
+static struct mm_slot *get_next_mmlist(struct list_head *cur,
+				       struct mm_slot *pre_alloc_mm_slot,
+				       int *used_pre_alloc)
 {
-	struct ksm_mem_slot *slot;
+	struct mm_struct *mm;
+	struct mm_slot *mm_slot;
+
+	cur = cur->next;
+	while (cur != &init_mm.mmlist) {
+		mm = list_entry(cur, struct mm_struct, mmlist);
+		if (test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
+			mm_slot = get_mm_slot(mm);
+			if (unlikely(atomic_read(&mm->mm_users) == 1)) {
+				if (mm_slot)
+					mm_slot->touched = 0;
+			} else {
+				if (!mm_slot) {
+					insert_to_mm_slots_hash(mm,
+							     pre_alloc_mm_slot);
+					*used_pre_alloc = 1;
+					mm_slot = pre_alloc_mm_slot;
+				}
+				mm_slot->touched = 1;
+				return mm_slot;
+			}
+		}
 
-	if (list_empty(&slots))
-		return;
+		cur = cur->next;
+	}
+	return NULL;
+}
 
-	list_for_each_entry(slot, &slots, link) {
-		if (ksm_scan->slot_index == slot)
-			return;
+ /* return -EAGAIN - no slots registered, nothing to be done */
+ static int scan_get_next_index(struct ksm_scan *ksm_scan)
+ {
+	struct mm_slot *slot;
+	struct vm_area_struct *vma;
+	struct rmap_item *pre_alloc_rmap_item;
+	struct mm_slot *pre_alloc_mm_slot;
+	int used_slot = 0;
+	int used_rmap = 0;
+	int ret = -EAGAIN;
+
+	pre_alloc_rmap_item = alloc_rmap_item();
+	if (!pre_alloc_rmap_item)
+		return -ENOMEM;
+	pre_alloc_mm_slot = alloc_mm_slot_item();
+	if (!pre_alloc_mm_slot) {
+		free_rmap_item(pre_alloc_rmap_item);
+		return -ENOMEM;
 	}
 
-	ksm_scan->slot_index = list_first_entry(&slots,
-						struct ksm_mem_slot, link);
-	ksm_scan->page_index = 0;
+	if (!ksm_scan->cur_mm_slot)
+		remove_all_untouched_mm_slots();
+
+	spin_lock(&mmlist_lock);
+
+ 	if (list_empty(&init_mm.mmlist))
+		goto out_unlock;
+
+	if (!ksm_scan->cur_mm_slot) {
+		ksm_scan->cur_mm_slot = get_next_mmlist(&init_mm.mmlist,
+							pre_alloc_mm_slot,
+							&used_slot);
+		if (!ksm_scan->cur_mm_slot)
+			goto out_unlock;
+
+		ksm_scan->addr_index = (unsigned long) -PAGE_SIZE;
+		ksm_scan->cur_rmap = list_entry(
+					      &ksm_scan->cur_mm_slot->rmap_list,
+					      struct rmap_item, link);
+
+		root_unstable_tree = RB_ROOT;
+	}
+
+	spin_unlock(&mmlist_lock);
+
+ 	slot = ksm_scan->cur_mm_slot;
+ 
+	down_read(&slot->mm->mmap_sem);
+
+	ksm_scan->addr_index += PAGE_SIZE;
+
+again:
+	vma = find_vma(slot->mm, ksm_scan->addr_index);
+	if (vma && vma->vm_flags & VM_MERGEABLE) {
+		if (ksm_scan->addr_index < vma->vm_start)
+			ksm_scan->addr_index = vma->vm_start;
+		up_read(&slot->mm->mmap_sem);
+
+		ksm_scan->cur_rmap =
+				  get_next_rmap_index(ksm_scan->addr_index,
+						      slot->mm,
+						      &slot->rmap_list,
+						      &ksm_scan->cur_rmap->link,
+						      pre_alloc_rmap_item,
+						      &used_rmap);
+
+		ret = 0;
+		goto out_free;
+	} else {
+		while (vma && !(vma->vm_flags & VM_MERGEABLE))
+			vma = vma->vm_next;
+
+		if (vma) {
+			ksm_scan->addr_index = vma->vm_start;
+			up_read(&slot->mm->mmap_sem);
+
+			ksm_scan->cur_rmap =
+				  get_next_rmap_index(ksm_scan->addr_index,
+						      slot->mm,
+						      &slot->rmap_list,
+						      &ksm_scan->cur_rmap->link,
+						      pre_alloc_rmap_item,
+						      &used_rmap);
+
+			ret = 0;
+			goto out_free;
+		}
+ 	}
+ 
+	up_read(&slot->mm->mmap_sem);
+
+	/*
+	 * Lets nuke all the rmap_items that above this current rmap
+	 * the reason that we do it is beacuse there were no vmas with the
+	 * VM_MERGEABLE flag set that had such addresses.
+	 */
+	update_rmap_list(&slot->rmap_list, &ksm_scan->cur_rmap->link);
+
+	/*
+	 * We have already used our pre allocated mm_slot, so we return and wait
+	 * this function will get called again.
+	 */
+	if (used_slot)
+		goto out_free;
+
+	spin_lock(&mmlist_lock);
+
+	ksm_scan->cur_mm_slot =
+			     get_next_mmlist(&ksm_scan->cur_mm_slot->mm->mmlist,
+					     pre_alloc_mm_slot,
+					     &used_slot);
+
+	/* look like we finished scanning the whole memory, starting again */
+	if (!ksm_scan->cur_mm_slot)
+		goto out_unlock;
+
+	spin_unlock(&mmlist_lock);
+
+	ksm_scan->addr_index = 0;
+	ksm_scan->cur_rmap = list_entry(&ksm_scan->cur_mm_slot->rmap_list,
+					struct rmap_item, link);
+	slot = ksm_scan->cur_mm_slot;
+
+	down_read(&slot->mm->mmap_sem);
+
+	goto again;
+
+out_unlock:
+	spin_unlock(&mmlist_lock);
+out_free:
+	if (!used_slot)
+		free_mm_slot_item(pre_alloc_mm_slot);
+	if (!used_rmap)
+		free_rmap_item(pre_alloc_rmap_item);
+	return ret;
 }
 
 /**
@@ -1394,21 +1385,23 @@ static void scan_update_old_index(struct ksm_scan *ksm_scan)
  */
 static int ksm_scan_start(struct ksm_scan *ksm_scan, unsigned int scan_npages)
 {
-	struct ksm_mem_slot *slot;
+	struct mm_slot *slot;
 	struct page *page[1];
+	struct mm_struct *mm;
+	struct rmap_item *rmap_item;
+	unsigned long addr;
 	int val;
 	int ret = 0;
 
-	down_read(&slots_lock);
-
-	scan_update_old_index(ksm_scan);
-
 	while (scan_npages > 0) {
 		ret = scan_get_next_index(ksm_scan);
 		if (ret)
 			goto out;
 
-		slot = ksm_scan->slot_index;
+		slot = ksm_scan->cur_mm_slot;
+		mm = slot->mm;
+		addr = ksm_scan->addr_index;
+		rmap_item = ksm_scan->cur_rmap;
 
 		cond_resched();
 
@@ -1416,97 +1409,34 @@ static int ksm_scan_start(struct ksm_scan *ksm_scan, unsigned int scan_npages)
 		 * If the page is swapped out or in swap cache, we don't want to
 		 * scan it (it is just for performance).
 		 */
-		down_read(&slot->mm->mmap_sem);
-		if (is_present_pte(slot->mm, slot->addr +
-				   ksm_scan->page_index * PAGE_SIZE)) {
-			val = get_user_pages(current, slot->mm, slot->addr +
-					     ksm_scan->page_index * PAGE_SIZE ,
-					      1, 0, 0, page, NULL);
-			up_read(&slot->mm->mmap_sem);
+		down_read(&mm->mmap_sem);
+		if (is_present_pte(mm, addr)) {
+			val = get_user_pages(current, mm, addr, 1, 0, 0, page,
+					     NULL);
+			up_read(&mm->mmap_sem);
 			if (val == 1) {
 				if (!PageKsm(page[0]))
-					cmp_and_merge_page(ksm_scan, page[0]);
+					cmp_and_merge_page(page[0], rmap_item);
 				put_page(page[0]);
 			}
 		} else {
-			up_read(&slot->mm->mmap_sem);
+			up_read(&mm->mmap_sem);
 		}
+
 		scan_npages--;
 	}
-	scan_get_next_index(ksm_scan);
 out:
-	up_read(&slots_lock);
 	return ret;
 }
 
-static const struct file_operations ksm_sma_fops = {
-	.release        = ksm_sma_release,
-	.unlocked_ioctl = ksm_sma_ioctl,
-	.compat_ioctl   = ksm_sma_ioctl,
-};
-
-static int ksm_dev_ioctl_create_shared_memory_area(void)
-{
-	int fd = -1;
-	struct ksm_sma *ksm_sma;
-
-	ksm_sma = kmalloc(sizeof(struct ksm_sma), GFP_KERNEL);
-	if (!ksm_sma) {
-		fd = -ENOMEM;
-		goto out;
-	}
-
-	INIT_LIST_HEAD(&ksm_sma->sma_slots);
-	ksm_sma->nregions = 0;
-
-	fd = anon_inode_getfd("ksm-sma", &ksm_sma_fops, ksm_sma, 0);
-	if (fd < 0)
-		goto out_free;
-
-	return fd;
-out_free:
-	kfree(ksm_sma);
-out:
-	return fd;
-}
-
-static long ksm_dev_ioctl(struct file *filp,
-			  unsigned int ioctl, unsigned long arg)
-{
-	long r = -EINVAL;
-
-	switch (ioctl) {
-	case KSM_GET_API_VERSION:
-		r = KSM_API_VERSION;
-		break;
-	case KSM_CREATE_SHARED_MEMORY_AREA:
-		r = ksm_dev_ioctl_create_shared_memory_area();
-		break;
-	default:
-		break;
-	}
-	return r;
-}
-
-static const struct file_operations ksm_chardev_ops = {
-	.unlocked_ioctl = ksm_dev_ioctl,
-	.compat_ioctl   = ksm_dev_ioctl,
-	.owner          = THIS_MODULE,
-};
-
-static struct miscdevice ksm_dev = {
-	KSM_MINOR,
-	"ksm",
-	&ksm_chardev_ops,
-};
-
 int ksm_scan_thread(void *nothing)
 {
 	while (!kthread_should_stop()) {
 		if (ksmd_flags & ksm_control_flags_run) {
 			down_read(&ksm_thread_lock);
-			ksm_scan_start(&ksm_thread_scan,
-				       ksm_thread_pages_to_scan);
+			if (ksmd_flags & ksm_control_flags_run)
+				ksm_scan_start(&ksm_thread_scan,
+					       ksm_thread_pages_to_scan);
 			up_read(&ksm_thread_lock);
 			schedule_timeout_interruptible(
 					usecs_to_jiffies(ksm_thread_sleep));
@@ -1613,6 +1543,8 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
 
 	down_write(&ksm_thread_lock);
 	ksmd_flags = k_flags;
+	if (!ksmd_flags)
+		remove_all_mm_slots();
 	up_write(&ksm_thread_lock);
 
 	if (ksmd_flags)
@@ -1696,13 +1628,10 @@ static int __init ksm_init(void)
 	if (r)
 		goto out;
 
-	r = rmap_hash_init();
+	r = mm_slots_hash_init();
 	if (r)
 		goto out_free1;
 
-	if (!regions_per_fd)
-		regions_per_fd = 1024;
-
 	ksm_thread = kthread_run(ksm_scan_thread, NULL, "kksmd");
 	if (IS_ERR(ksm_thread)) {
 		printk(KERN_ERR "ksm: creating kthread failed\n");
@@ -1710,27 +1639,19 @@ static int __init ksm_init(void)
 		goto out_free2;
 	}
 
-	r = misc_register(&ksm_dev);
-	if (r) {
-		printk(KERN_ERR "ksm: misc device register failed\n");
-		goto out_free3;
-	}
-
 	r = sysfs_create_group(mm_kobj, &ksm_attr_group);
 	if (r) {
 		printk(KERN_ERR "ksm: register sysfs failed\n");
-		goto out_free4;
+		goto out_free3;
 	}
 
 	printk(KERN_WARNING "ksm loaded\n");
 	return 0;
 
-out_free4:
-	misc_deregister(&ksm_dev);
 out_free3:
 	kthread_stop(ksm_thread);
 out_free2:
-	rmap_hash_free();
+	mm_slots_hash_free();
 out_free1:
 	ksm_slab_free();
 out:
@@ -1740,10 +1661,9 @@ out:
 static void __exit ksm_exit(void)
 {
 	sysfs_remove_group(mm_kobj, &ksm_attr_group);
-	misc_deregister(&ksm_dev);
 	ksmd_flags = ksm_control_flags_run;
 	kthread_stop(ksm_thread);
-	rmap_hash_free();
+	mm_slots_hash_free();
 	ksm_slab_free();
 }
 
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] ksm: add support for scanning procsses that were not modifided to use ksm
  2009-05-14  0:30     ` [PATCH 3/4] ksm: change ksm api to use madvise instead of ioctls Izik Eidus
@ 2009-05-14  0:30       ` Izik Eidus
  0 siblings, 0 replies; 22+ messages in thread
From: Izik Eidus @ 2009-05-14  0:30 UTC (permalink / raw)
  To: hugh
  Cc: linux-kernel, aarcange, akpm, nickpiggin, chrisw, linux-mm, riel,
	Izik Eidus

This patch add merge_pid and unmerge_pid fields into /sys/kernel/mm/ksm
this feild allow merging memory of any application in the system,
just run:
echo pid_num > /sys/kernel/mm/ksm/merge_pid - and memory will be merged.

This patch add MMF_VM_MERGEALL flag into the mm flags, this flags mean that
all the vmas inside this mm_struct are mergeable by ksm.

Signed-off-by: Izik Eidus <ieidus@redhat.com>
---
 include/linux/sched.h |    1 +
 mm/ksm.c              |  110 +++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 107 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7dc786a..c23af0c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -440,6 +440,7 @@ extern int get_dumpable(struct mm_struct *mm);
 #endif
 
 #define MMF_VM_MERGEABLE	9
+#define MMF_VM_MERGEALL		10
 
 struct sighand_struct {
 	atomic_t		count;
diff --git a/mm/ksm.c b/mm/ksm.c
index 901cce3..5bb42d8 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -479,7 +479,10 @@ static int try_to_merge_one_page(struct mm_struct *mm,
 	unsigned long page_addr_in_vma;
 	pte_t orig_pte, *orig_ptep;
 
-	if(!(vma->vm_flags & VM_MERGEABLE))
+
+
+	if(!(vma->vm_flags & VM_MERGEABLE) &&
+	   !test_bit(MMF_VM_MERGEALL, &mm->flags))
 		goto out;
 
 	if (!PageAnon(oldpage))
@@ -1213,7 +1216,8 @@ static struct mm_slot *get_next_mmlist(struct list_head *cur,
 	cur = cur->next;
 	while (cur != &init_mm.mmlist) {
 		mm = list_entry(cur, struct mm_struct, mmlist);
-		if (test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
+		if (test_bit(MMF_VM_MERGEABLE, &mm->flags) ||
+		    test_bit(MMF_VM_MERGEALL, &mm->flags)) {
 			mm_slot = get_mm_slot(mm);
 			if (unlikely(atomic_read(&mm->mm_users) == 1)) {
 				if (mm_slot)
@@ -1245,6 +1249,7 @@ static struct mm_slot *get_next_mmlist(struct list_head *cur,
 	int used_slot = 0;
 	int used_rmap = 0;
 	int ret = -EAGAIN;
+	int merge_all;
 
 	pre_alloc_rmap_item = alloc_rmap_item();
 	if (!pre_alloc_rmap_item)
@@ -1287,8 +1292,10 @@ static struct mm_slot *get_next_mmlist(struct list_head *cur,
 	ksm_scan->addr_index += PAGE_SIZE;
 
 again:
+	merge_all = test_bit(MMF_VM_MERGEALL, &slot->mm->flags);
+
 	vma = find_vma(slot->mm, ksm_scan->addr_index);
-	if (vma && vma->vm_flags & VM_MERGEABLE) {
+	if (vma && (vma->vm_flags & VM_MERGEABLE || merge_all)) {
 		if (ksm_scan->addr_index < vma->vm_start)
 			ksm_scan->addr_index = vma->vm_start;
 		up_read(&slot->mm->mmap_sem);
@@ -1304,7 +1311,7 @@ again:
 		ret = 0;
 		goto out_free;
 	} else {
-		while (vma && !(vma->vm_flags & VM_MERGEABLE))
+		while (vma && (!(vma->vm_flags & VM_MERGEABLE) && !merge_all))
 			vma = vma->vm_next;
 
 		if (vma) {
@@ -1455,6 +1462,99 @@ int ksm_scan_thread(void *nothing)
 	static struct kobj_attribute _name##_attr = \
 		__ATTR(_name, 0644, _name##_show, _name##_store)
 
+static ssize_t merge_pid_show(struct kobject *kobj, struct kobj_attribute *attr,
+			      char *buf)
+{
+	unsigned int usecs;
+
+	down_read(&ksm_thread_lock);
+	usecs = ksm_thread_sleep;
+	up_read(&ksm_thread_lock);
+
+	return sprintf(buf, "\n");
+}
+
+static ssize_t merge_pid_store(struct kobject *kobj,
+			       struct kobj_attribute *attr,
+			       const char *buf, size_t count)
+{
+	struct task_struct *task;
+	struct mm_struct *mm = NULL;
+	unsigned long pid;
+	int err;
+
+	err = strict_strtoul(buf, 10, &pid);
+	if (err)
+		return 0;
+
+	read_lock(&tasklist_lock);
+	task = find_task_by_vpid(pid);
+	if (task)
+		mm = get_task_mm(task);
+	read_unlock(&tasklist_lock);
+
+	if (mm) {
+		down_write(&mm->mmap_sem);
+		set_bit(MMF_VM_MERGEALL, &mm->flags);
+		up_write(&mm->mmap_sem);
+
+		spin_lock(&mmlist_lock);
+		if (unlikely(list_empty(&mm->mmlist)))
+			list_add(&mm->mmlist, &init_mm.mmlist);
+		if (unlikely(!(mmlist_mask & MMLIST_KSM)))
+			mmlist_mask |= MMLIST_KSM;
+		spin_unlock(&mmlist_lock);
+
+		mmput(mm);
+	}
+
+	return count;
+}
+KSM_ATTR(merge_pid);
+
+static ssize_t unmerge_pid_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	unsigned int usecs;
+
+	down_read(&ksm_thread_lock);
+	usecs = ksm_thread_sleep;
+	up_read(&ksm_thread_lock);
+
+	return sprintf(buf, "\n");
+}
+
+static ssize_t unmerge_pid_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count)
+{
+	struct task_struct *task;
+	struct mm_struct *mm = NULL;
+	unsigned long pid;
+	int err;
+
+	err = strict_strtoul(buf, 10, &pid);
+	if (err)
+		return 0;
+
+	read_lock(&tasklist_lock);
+	task = find_task_by_vpid(pid);
+	if (task)
+		mm = get_task_mm(task);
+	read_unlock(&tasklist_lock);
+
+	if (mm) {
+		down_write(&mm->mmap_sem);
+		clear_bit(MMF_VM_MERGEALL, &mm->flags);
+		up_write(&mm->mmap_sem);
+
+		mmput(mm);
+	}
+
+	return count;
+}
+KSM_ATTR(unmerge_pid);
+
 static ssize_t sleep_show(struct kobject *kobj, struct kobj_attribute *attr,
 			  char *buf)
 {
@@ -1605,6 +1705,8 @@ static ssize_t max_kernel_pages_show(struct kobject *kobj,
 KSM_ATTR(max_kernel_pages);
 
 static struct attribute *ksm_attrs[] = {
+	&merge_pid_attr.attr,
+	&unmerge_pid_attr.attr,
 	&sleep_attr.attr,
 	&pages_to_scan_attr.attr,
 	&run_attr.attr,
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] RFC - ksm api change into madvise
  2009-05-14  0:30 [PATCH 0/4] RFC - ksm api change into madvise Izik Eidus
  2009-05-14  0:30 ` [PATCH 1/4] madvice: add MADV_SHAREABLE and MADV_UNSHAREABLE calls Izik Eidus
@ 2009-06-08 16:18 ` Hugh Dickins
  2009-06-08 16:35   ` [PATCH mmotm] ksm: stop scan skipping pages Hugh Dickins
                     ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Hugh Dickins @ 2009-06-08 16:18 UTC (permalink / raw)
  To: Izik Eidus
  Cc: aarcange, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

On Thu, 14 May 2009, Izik Eidus wrote:

> This is comment request for ksm api changes.
> The following patchs move the api to use madvise instead of ioctls.

Thanks a lot for doing this.

I'm afraid more than three weeks have gone past, and the 2.6.31
merge window is almost upon us, and you haven't even got a comment
out of me: I apologize for that.

Although my (lack of) response is indistinguishable from a conspiracy
to keep KSM out of the kernel, I beg to assure you that's not the case.
I do want KSM to go in - though I never shared Andrew's optimism that it
is 2.6.31 material: I've too long a list of notes/doubts on the existing
implementation, which I've not had time to expand upon to you; but I
don't think there are any killer issues, we should be able to work
things out as 2.6.31 goes through its -rcs, and aim for 2.6.32.

But let's get this change of interface sorted out first.

I remain convinced that it's right to go the madvise() route,
though I don't necessarily like the details in your patches.
And I've come to the conclusion that the only way I can force
myself to contribute constructively, is to start from these
patches, and shift things around until it's as I think it
should be, then see what you think of the result.

I notice that you chose to integrate fully (though not fully enough)
with vmas, adding a VM_MERGEABLE flag.  Fine, that's probably going
to be safest in the end, and I'll follow you; but it is further than
I was necessarily asking you to go - it might have been okay to use
the madvise() interface, but just to declare areas of address space
(not necessarily backed by mappings) to ksm.c, as you did via /dev/ksm.
But it's fairly likely that if you had stayed with that, it would have
proved problematic later, so let's go forward with the full integration
with vmas.

> 
> Before i will describe the patchs, i want to note that i rewrote this
> patch seires alot of times, all the other methods that i have tried had some
> fandumatel issues with them.
> The current implemantion does have some issues with it, but i belive they are
> all solveable and better than the other ways to do it.
> If you feel you have better way how to do it, please tell me :).
> 
> Ok when we changed ksm to use madvise instead of ioctls we wanted to keep
> the following rules:
> 
> Not to increase the host memory usage if ksm is not being used (even when it
> is compiled), this mean not to add fields into mm_struct / vm_area_struct...
> 
> Not to effect the system performence with notifiers that will have to block
> while ksm code is running under some lock - ksm is helper, it should do it
> work quitely, - this why i dropped patch that i did that add mmu notifiers
> support inside ksm.c and recived notifications from the MM (for example
> when vma is destroyed (invalidate_range...)
> 
> Not to change the MM logic.
> 
> Trying to touch as less code as we can outisde ksm.c

These are well-intentioned goals, and thank you for making the effort
to follow them; but I'm probably going to depart from them.  I'd
rather we put in what's necessary and appropriate, and then cut
that down if necessary.

> 
> Taking into account all this rules, the end result that we have came with is:
> mmlist is now not used only by swapoff, but by ksm as well, this mean that
> each time you call to madvise for to set vma as MERGEABLE, madvise will check
> if the mm_struct is inside the mmlist and will insert it in case it isnt.
> It is belived that it is better to hurt little bit the performence of swapoff
> than adding another list into the mm_struct.

That was a perfectly sensible thing for you to do, given your rules
above; but I don't really like the result, and think it'll be clearer
to have your own list.  Whether by mm or by vma, I've not yet decided:
by mm won't add enough #idef CONFIG_KSM bloat to worry about; by vma,
we might be able to reuse some prio_tree fields, I've not checked yet.

> 
> One issue that should be note is: after mm_struct is going into the mmlist, it
> wont be kicked from it until the procsses is die (even if there are no more
> VM_MERGEABLE vmas), this doesnt mean memory is wasted, but it does mean ksm
> will spend little more time in doing cur = cur->next if(...).
> 
> Another issue is: when procsess is die, ksm will have to find (when scanning)
> that its mm_users == 1 and then do mmput(), this mean that there might be dealy
> from the time that someone do kill until the mm is really free -
> i am open for suggestions on how to improve this...

You've resisted putting in the callbacks you need.  I think they were
always (i.e. even when using /dev/ksm) necessary, but should become
more obvious now we have this tighter integration with mm's vmas.

You seem to have no callback in fork: doesn't that mean that KSM
pages get into mms of which mm/ksm.c has no knowledge?  You had
no callback in mremap move: doesn't that mean that KSM pages could
be moved into areas which mm/ksm.c never tracked?  Though that's
probably no issue now we move over to vmas: they should now travel
with their VM flag.  You have no callback in unmap: doesn't that
mean that KSM never knows when its pages have gone away?

(Closing the /dev/ksm fd used to clean up some of this, in the
end; but the lifetime of the fd can be so different from that of
the mapped area, I've felt very unsafe with that technique - a good
technique when you're trying to sneak in special handling for your
special driver, but not a good technique once you go to mainline.)

I haven't worked out the full consequences of these lost pages:
perhaps it's no worse than that you could never properly enforce
your ksm_thread_max_kernel_pages quota.

> 
> (when someone do echo 0 > /sys/kernel/mm/ksm/run ksm will throw away all the
> memory, so condtion when the memory wont ever be free wont happen)
> 
> 
> Another important thing is: this is request for comment, i still not sure few
> things that we have made here are totaly safe:
> (the mmlist sync with drain_mmlist, and the handle_vmas() function in madvise,
> the logic inside ksm for searching the next virtual address on the vmas,
> and so on...)
> The main purpuse of this is to ask if the new interface is what you guys
> want..., and if you like the impelmantion desgin.

It's in the right direction.  But it would be silly for me to start
criticizing your details now: I need to try doing the same, that will
force me to think deeply enough about it, and I may then be led to
the same decisions as you made.

> 
> (I have added option to scan closed support applications as well)

That's a nice detail that I'll find very useful for testing,
but we might want to hold it back longer than the rest.  I just get
naturally more cautious when we consider interfaces for doing things
to other processes, and want to spend even longer over it.

> 
> 
> Thanks.
> 
> Izik Eidus (4):
>   madvice: add MADV_SHAREABLE and MADV_UNSHAREABLE calls.

I didn't understand why you went over to VM_MERGEABLE but stuck
with MADV_SHAREABLE: there's a confusing mix of shareables and
mergeables, I'll head for mergeables throughout, though keep to "KSM".

>   mmlist: share mmlist with ksm.
>   ksm: change ksm api to use madvise instead of ioctls.
>   ksm: add support for scanning procsses that were not modifided to use
>     ksm

While I'm being communicative, let me mention two things,
not related to this RFC patchset, but to what's currently in mmotm.

I've a bugfix patch to scan_get_next_index(), I'll send that to you
in a few moments.

And a question on your page_wrprotect() addition to mm/rmap.c: though
it may contain some important checks (I'm thinking of the get_user_pages
protection), isn't it essentially redundant, and should be removed from
the patchset?  If we have a private page which is mapped into more than
the one address space by which we arrive at it, then, quite independent
of KSM, it needs to be write-protected already to prevent mods in one
address space leaking into another - doesn't it?  So I see no need for
the rmap'ped write-protection there, just make the checks and write
protect the pte you have in ksm.c.  Or am I missing something?

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH mmotm] ksm: stop scan skipping pages
  2009-06-08 16:18 ` [PATCH 0/4] RFC - ksm api change into madvise Hugh Dickins
@ 2009-06-08 16:35   ` Hugh Dickins
  2009-06-08 17:42     ` Izik Eidus
  2009-06-08 17:17   ` [PATCH 0/4] RFC - ksm api change into madvise Izik Eidus
  2009-06-08 22:57   ` Andrea Arcangeli
  2 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2009-06-08 16:35 UTC (permalink / raw)
  To: Izik Eidus
  Cc: aarcange, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

KSM can be slow to identify all mergeable pages.  There's an off-by-one
in how ksm_scan_start() proceeds (see how it does a scan_get_next_index
at the head of its loop, but also on leaving its loop), which causes it
to skip over a page occasionally.  Fix that.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 mm/ksm.c |   46 +++++++++++++++++++---------------------------
 1 file changed, 19 insertions(+), 27 deletions(-)

--- mmotm/mm/ksm.c	2009-06-08 13:14:36.000000000 +0100
+++ fixed/mm/ksm.c	2009-06-08 13:18:30.000000000 +0100
@@ -1331,30 +1331,26 @@ out:
 /* return -EAGAIN - no slots registered, nothing to be done */
 static int scan_get_next_index(struct ksm_scan *ksm_scan)
 {
-	struct ksm_mem_slot *slot;
+	struct list_head *next;
 
 	if (list_empty(&slots))
 		return -EAGAIN;
 
-	slot = ksm_scan->slot_index;
-
 	/* Are there pages left in this slot to scan? */
-	if ((slot->npages - ksm_scan->page_index - 1) > 0) {
-		ksm_scan->page_index++;
+	ksm_scan->page_index++;
+	if (ksm_scan->page_index < ksm_scan->slot_index->npages)
 		return 0;
-	}
 
-	list_for_each_entry_from(slot, &slots, link) {
-		if (slot == ksm_scan->slot_index)
-			continue;
-		ksm_scan->page_index = 0;
-		ksm_scan->slot_index = slot;
+	ksm_scan->page_index = 0;
+	next = ksm_scan->slot_index->link.next;
+	if (next != &slots) {
+		ksm_scan->slot_index =
+			list_entry(next, struct ksm_mem_slot, link);
 		return 0;
 	}
 
 	/* look like we finished scanning the whole memory, starting again */
 	root_unstable_tree = RB_ROOT;
-	ksm_scan->page_index = 0;
 	ksm_scan->slot_index = list_first_entry(&slots,
 						struct ksm_mem_slot, link);
 	return 0;
@@ -1366,21 +1362,22 @@ static int scan_get_next_index(struct ks
  * pointed to was released so we have to call this function every time after
  * taking the slots_lock
  */
-static void scan_update_old_index(struct ksm_scan *ksm_scan)
+static int scan_update_old_index(struct ksm_scan *ksm_scan)
 {
 	struct ksm_mem_slot *slot;
 
 	if (list_empty(&slots))
-		return;
+		return -EAGAIN;
 
 	list_for_each_entry(slot, &slots, link) {
 		if (ksm_scan->slot_index == slot)
-			return;
+			return 0;
 	}
 
 	ksm_scan->slot_index = list_first_entry(&slots,
 						struct ksm_mem_slot, link);
 	ksm_scan->page_index = 0;
+	return 0;
 }
 
 /**
@@ -1399,20 +1396,14 @@ static int ksm_scan_start(struct ksm_sca
 	struct ksm_mem_slot *slot;
 	struct page *page[1];
 	int val;
-	int ret = 0;
+	int ret;
 
 	down_read(&slots_lock);
+	ret = scan_update_old_index(ksm_scan);
 
-	scan_update_old_index(ksm_scan);
-
-	while (scan_npages > 0) {
-		ret = scan_get_next_index(ksm_scan);
-		if (ret)
-			goto out;
-
-		slot = ksm_scan->slot_index;
-
+	while (scan_npages && !ret) {
 		cond_resched();
+		slot = ksm_scan->slot_index;
 
 		/*
 		 * If the page is swapped out or in swap cache, we don't want to
@@ -1433,10 +1424,11 @@ static int ksm_scan_start(struct ksm_sca
 		} else {
 			up_read(&slot->mm->mmap_sem);
 		}
+
+		ret = scan_get_next_index(ksm_scan);
 		scan_npages--;
 	}
-	scan_get_next_index(ksm_scan);
-out:
+
 	up_read(&slots_lock);
 	return ret;
 }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] RFC - ksm api change into madvise
  2009-06-08 16:18 ` [PATCH 0/4] RFC - ksm api change into madvise Hugh Dickins
  2009-06-08 16:35   ` [PATCH mmotm] ksm: stop scan skipping pages Hugh Dickins
@ 2009-06-08 17:17   ` Izik Eidus
  2009-06-08 18:32     ` Hugh Dickins
  2009-06-08 22:57   ` Andrea Arcangeli
  2 siblings, 1 reply; 22+ messages in thread
From: Izik Eidus @ 2009-06-08 17:17 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: aarcange, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

Hugh Dickins wrote:
> On Thu, 14 May 2009, Izik Eidus wrote:
>
>   
>> This is comment request for ksm api changes.
>> The following patchs move the api to use madvise instead of ioctls.
>>     
>
> Thanks a lot for doing this.
>
> I'm afraid more than three weeks have gone past, and the 2.6.31
> merge window is almost upon us, and you haven't even got a comment
> out of me: I apologize for that.
>
> Although my (lack of) response is indistinguishable from a conspiracy
> to keep KSM out of the kernel, I beg to assure you that's not the case.
> I do want KSM to go in - though I never shared Andrew's optimism that it
> is 2.6.31 material: I've too long a list of notes/doubts on the existing
> implementation, which I've not had time to expand upon to you; but I
> don't think there are any killer issues, we should be able to work
> things out as 2.6.31 goes through its -rcs, and aim for 2.6.32.
>
> But let's get this change of interface sorted out first.
>   

Agree.

> I remain convinced that it's right to go the madvise() route,
> though I don't necessarily like the details in your patches.
> And I've come to the conclusion that the only way I can force
> myself to contribute constructively, is to start from these
> patches, and shift things around until it's as I think it
> should be, then see what you think of the result.
>   

Sound perfect way to go.

> I notice that you chose to integrate fully (though not fully enough)
> with vmas, adding a VM_MERGEABLE flag.  Fine, that's probably going
> to be safest in the end, and I'll follow you; but it is further than
> I was necessarily asking you to go - it might have been okay to use
> the madvise() interface, but just to declare areas of address space
> (not necessarily backed by mappings) to ksm.c, as you did via /dev/ksm.
> But it's fairly likely that if you had stayed with that, it would have
> proved problematic later, so let's go forward with the full integration
> with vmas.
>
>   
>> Before i will describe the patchs, i want to note that i rewrote this
>> patch seires alot of times, all the other methods that i have tried had some
>> fandumatel issues with them.
>> The current implemantion does have some issues with it, but i belive they are
>> all solveable and better than the other ways to do it.
>> If you feel you have better way how to do it, please tell me :).
>>
>> Ok when we changed ksm to use madvise instead of ioctls we wanted to keep
>> the following rules:
>>
>> Not to increase the host memory usage if ksm is not being used (even when it
>> is compiled), this mean not to add fields into mm_struct / vm_area_struct...
>>
>> Not to effect the system performence with notifiers that will have to block
>> while ksm code is running under some lock - ksm is helper, it should do it
>> work quitely, - this why i dropped patch that i did that add mmu notifiers
>> support inside ksm.c and recived notifications from the MM (for example
>> when vma is destroyed (invalidate_range...)
>>
>> Not to change the MM logic.
>>
>> Trying to touch as less code as we can outisde ksm.c
>>     
>
> These are well-intentioned goals, and thank you for making the effort
> to follow them; but I'm probably going to depart from them.  I'd
> rather we put in what's necessary and appropriate, and then cut
> that down if necessary.
>   

That the way to go, i just didnt want to scare anyone (it was obiouse to 
me that it is needed, just wanted you to say it is needed)

>   
>> Taking into account all this rules, the end result that we have came with is:
>> mmlist is now not used only by swapoff, but by ksm as well, this mean that
>> each time you call to madvise for to set vma as MERGEABLE, madvise will check
>> if the mm_struct is inside the mmlist and will insert it in case it isnt.
>> It is belived that it is better to hurt little bit the performence of swapoff
>> than adding another list into the mm_struct.
>>     
>
> That was a perfectly sensible thing for you to do, given your rules
> above; but I don't really like the result, and think it'll be clearer
> to have your own list.  Whether by mm or by vma, I've not yet decided:
> by mm won't add enough #idef CONFIG_KSM bloat to worry about; by vma,
> we might be able to reuse some prio_tree fields, I've not checked yet.
>
>   
>> One issue that should be note is: after mm_struct is going into the mmlist, it
>> wont be kicked from it until the procsses is die (even if there are no more
>> VM_MERGEABLE vmas), this doesnt mean memory is wasted, but it does mean ksm
>> will spend little more time in doing cur = cur->next if(...).
>>
>> Another issue is: when procsess is die, ksm will have to find (when scanning)
>> that its mm_users == 1 and then do mmput(), this mean that there might be dealy
>> from the time that someone do kill until the mm is really free -
>> i am open for suggestions on how to improve this...
>>     
>
> You've resisted putting in the callbacks you need.  I think they were
> always (i.e. even when using /dev/ksm) necessary, but should become
> more obvious now we have this tighter integration with mm's vmas.
>
> You seem to have no callback in fork: doesn't that mean that KSM
> pages get into mms of which mm/ksm.c has no knowledge?  
What you mean by this?, should the vma flags be copyed into the child 
and therefore ksm will scan the vma?
(only thing i have to check is: maybe the process itself wont go into 
the mmlist, and therefore ksm wont know about it)

> You had
> no callback in mremap move: doesn't that mean that KSM pages could
> be moved into areas which mm/ksm.c never tracked?  Though that's
> probably no issue now we move over to vmas: they should now travel
> with their VM flag.  You have no callback in unmap: doesn't that
> mean that KSM never knows when its pages have gone away?
>   

Yes, Adding all this callbacks would make ksm much more happy, Again, i 
didnt want to scare anyone...

> (Closing the /dev/ksm fd used to clean up some of this, in the
> end; but the lifetime of the fd can be so different from that of
> the mapped area, I've felt very unsafe with that technique - a good
> technique when you're trying to sneak in special handling for your
> special driver, but not a good technique once you go to mainline.)
>
> I haven't worked out the full consequences of these lost pages:
> perhaps it's no worse than that you could never properly enforce
> your ksm_thread_max_kernel_pages quota.
>   

You mean the shared pages outside the stable tree comment?

>   
>> (when someone do echo 0 > /sys/kernel/mm/ksm/run ksm will throw away all the
>> memory, so condtion when the memory wont ever be free wont happen)
>>
>>
>> Another important thing is: this is request for comment, i still not sure few
>> things that we have made here are totaly safe:
>> (the mmlist sync with drain_mmlist, and the handle_vmas() function in madvise,
>> the logic inside ksm for searching the next virtual address on the vmas,
>> and so on...)
>> The main purpuse of this is to ask if the new interface is what you guys
>> want..., and if you like the impelmantion desgin.
>>     
>
> It's in the right direction.  But it would be silly for me to start
> criticizing your details now: I need to try doing the same, that will
> force me to think deeply enough about it, and I may then be led to
> the same decisions as you made.
>
>   
>> (I have added option to scan closed support applications as well)
>>     
>
> That's a nice detail that I'll find very useful for testing,
> but we might want to hold it back longer than the rest.  I just get
> naturally more cautious when we consider interfaces for doing things
> to other processes, and want to spend even longer over it.
>
>   
>> Thanks.
>>
>> Izik Eidus (4):
>>   madvice: add MADV_SHAREABLE and MADV_UNSHAREABLE calls.
>>     
>
> I didn't understand why you went over to VM_MERGEABLE but stuck
> with MADV_SHAREABLE: there's a confusing mix of shareables and
> mergeables, I'll head for mergeables throughout, though keep to "KSM".
>
>   
>>   mmlist: share mmlist with ksm.
>>   ksm: change ksm api to use madvise instead of ioctls.
>>   ksm: add support for scanning procsses that were not modifided to use
>>     ksm
>>     
>
> While I'm being communicative, let me mention two things,
> not related to this RFC patchset, but to what's currently in mmotm.
>
> I've a bugfix patch to scan_get_next_index(), I'll send that to you
>   

Thanks.


> in a few moments.
>
> And a question on your page_wrprotect() addition to mm/rmap.c: though
> it may contain some important checks (I'm thinking of the get_user_pages
> protection), isn't it essentially redundant, and should be removed from
> the patchset?  If we have a private page which is mapped into more than
> the one address space by which we arrive at it, then, quite independent
> of KSM, it needs to be write-protected already to prevent mods in one
> address space leaking into another - doesn't it?  So I see no need for
> the rmap'ped write-protection there, just make the checks and write
> protect the pte you have in ksm.c.  Or am I missing something?
>   

Ok, so we have here 2 cases for ksm:
1:
    When the page is anonymous and is mapped readonly beteween serveal 
processes:
         for this you say we shouldnt walk over the rmap and try to 
writeprotect what is already writeprtected...

2:
   When the page is anonymous and is mapped write by just one process:
       for this you say it is better to handle it directly from inside 
ksm beacuse we already know
       the virtual address mapping of this page?

       so about this: you are right about the fact that we might dont 
have to walk over the rmap of the page for pages with mapcount 1
       but isnt it cleaner to deal it inside rmap.c?
       another thing,  get_user_pages() protection is needed even in 
that case, beacuse get_user_pages_fast is lockless, so odirect
       can run under our legs after we write protecting the page.


      anyway, nothing critical, i dont mind to move 
page_write_protect_one() into ksm.c, i still think get_user_pages 
protection is needed.


Thanks alot for your time.
> Hugh
>   


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mmotm] ksm: stop scan skipping pages
  2009-06-08 16:35   ` [PATCH mmotm] ksm: stop scan skipping pages Hugh Dickins
@ 2009-06-08 17:42     ` Izik Eidus
  2009-06-08 18:01       ` Hugh Dickins
  0 siblings, 1 reply; 22+ messages in thread
From: Izik Eidus @ 2009-06-08 17:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: aarcange, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

Hugh Dickins wrote:
> KSM can be slow to identify all mergeable pages.  There's an off-by-one
> in how ksm_scan_start() proceeds (see how it does a scan_get_next_index
> at the head of its loop, but also on leaving its loop), which causes it
> to skip over a page occasionally.  Fix that.
>
> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>
>  mm/ksm.c |   46 +++++++++++++++++++---------------------------
>  1 file changed, 19 insertions(+), 27 deletions(-)
>
> --- mmotm/mm/ksm.c	2009-06-08 13:14:36.000000000 +0100
> +++ fixed/mm/ksm.c	2009-06-08 13:18:30.000000000 +0100
> @@ -1331,30 +1331,26 @@ out:
>  /* return -EAGAIN - no slots registered, nothing to be done */
>  static int scan_get_next_index(struct ksm_scan *ksm_scan)
>  {
> -	struct ksm_mem_slot *slot;
> +	struct list_head *next;
>  
>  	if (list_empty(&slots))
>  		return -EAGAIN;
>  
> -	slot = ksm_scan->slot_index;
> -
>  	/* Are there pages left in this slot to scan? */
> -	if ((slot->npages - ksm_scan->page_index - 1) > 0) {
> -		ksm_scan->page_index++;
> +	ksm_scan->page_index++;
> +	if (ksm_scan->page_index < ksm_scan->slot_index->npages)
>  		return 0;
> -	}
>  
> -	list_for_each_entry_from(slot, &slots, link) {
> -		if (slot == ksm_scan->slot_index)
> -			continue;
> -		ksm_scan->page_index = 0;
> -		ksm_scan->slot_index = slot;
> +	ksm_scan->page_index = 0;
> +	next = ksm_scan->slot_index->link.next;
> +	if (next != &slots) {
> +		ksm_scan->slot_index =
> +			list_entry(next, struct ksm_mem_slot, link);
>  		return 0;
>  	}
>  
>  	/* look like we finished scanning the whole memory, starting again */
>  	root_unstable_tree = RB_ROOT;
> -	ksm_scan->page_index = 0;
>  	ksm_scan->slot_index = list_first_entry(&slots,
>  						struct ksm_mem_slot, link);
>  	return 0;
> @@ -1366,21 +1362,22 @@ static int scan_get_next_index(struct ks
>   * pointed to was released so we have to call this function every time after
>   * taking the slots_lock
>   */
> -static void scan_update_old_index(struct ksm_scan *ksm_scan)
> +static int scan_update_old_index(struct ksm_scan *ksm_scan)
>  {
>  	struct ksm_mem_slot *slot;
>  
>  	if (list_empty(&slots))
> -		return;
> +		return -EAGAIN;
>  
>  	list_for_each_entry(slot, &slots, link) {
>  		if (ksm_scan->slot_index == slot)
> -			return;
> +			return 0;
>  	}
>  
>  	ksm_scan->slot_index = list_first_entry(&slots,
>  						struct ksm_mem_slot, link);
>  	ksm_scan->page_index = 0;
> +	return 0;
>  }
>  
>  /**
> @@ -1399,20 +1396,14 @@ static int ksm_scan_start(struct ksm_sca
>  	struct ksm_mem_slot *slot;
>  	struct page *page[1];
>  	int val;
> -	int ret = 0;
> +	int ret;
>  
>  	down_read(&slots_lock);
> +	ret = scan_update_old_index(ksm_scan);
>  
> -	scan_update_old_index(ksm_scan);
> -
> -	while (scan_npages > 0) {
> -		ret = scan_get_next_index(ksm_scan);
> -		if (ret)
> -			goto out;
> -
> -		slot = ksm_scan->slot_index;
> -
> +	while (scan_npages && !ret) {
>  		cond_resched();
> +		slot = ksm_scan->slot_index;
>  
>  		/*
>  		 * If the page is swapped out or in swap cache, we don't want to
> @@ -1433,10 +1424,11 @@ static int ksm_scan_start(struct ksm_sca
>  		} else {
>  			up_read(&slot->mm->mmap_sem);
>  		}
> +
> +		ret = scan_get_next_index(ksm_scan);
>  		scan_npages--;
>  	}
> -	scan_get_next_index(ksm_scan);
> -out:
> +
>  	up_read(&slots_lock);
>  	return ret;
>  }
>   
ACK.

Thanks for the fix,
(I saw it while i wrote the RFC patch for the madvise, but beacuse that 
i thought that the RFC fix this (you can see the removel of the second 
call to scan_get_next_index()), and we move to madvise, I thought that 
no patch is needed for this code, guess I was wrong)

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mmotm] ksm: stop scan skipping pages
  2009-06-08 17:42     ` Izik Eidus
@ 2009-06-08 18:01       ` Hugh Dickins
  2009-06-08 20:12         ` Izik Eidus
  0 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2009-06-08 18:01 UTC (permalink / raw)
  To: Izik Eidus
  Cc: aarcange, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

On Mon, 8 Jun 2009, Izik Eidus wrote:
> 
> Thanks for the fix,
> (I saw it while i wrote the RFC patch for the madvise, but beacuse that i
> thought that the RFC fix this (you can see the removel of the second call to
> scan_get_next_index()), and we move to madvise, I thought that no patch is
> needed for this code, guess I was wrong)

Ah, no, I hadn't noticed that, this patch is from several weeks ago,
before you even posted the madvise() version.

I think myself that we ought to fix the algorithm as it stands now in
mmotm, rather than hiding the fix in amongst later interface changes.

But it's not a big deal, so long as it gets fixed in the end.

By the time Andrew sends KVM to Linus, it shouldn't be the
patches currently in mmotm with more on top: they should be
re-presented with all trace of /dev/ksm gone by then.

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] RFC - ksm api change into madvise
  2009-06-08 17:17   ` [PATCH 0/4] RFC - ksm api change into madvise Izik Eidus
@ 2009-06-08 18:32     ` Hugh Dickins
  2009-06-08 20:10       ` Izik Eidus
  0 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2009-06-08 18:32 UTC (permalink / raw)
  To: Izik Eidus
  Cc: aarcange, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

On Mon, 8 Jun 2009, Izik Eidus wrote:
> Hugh Dickins wrote:
> >
> > You seem to have no callback in fork: doesn't that mean that KSM
> > pages get into mms of which mm/ksm.c has no knowledge?  
> What you mean by this?, should the vma flags be copyed into the child and
> therefore ksm will scan the vma?

That part should be happening automatically now
(you'd have to add code to avoid copying VM_ flags).

> (only thing i have to check is: maybe the process itself wont go into the
> mmlist, and therefore ksm wont know about it)

But that part needs something to make it happen.  In the case of
swap, it's done over in copy_one_pte; the important thing is to
do it where the locking is such that it cannot get missed,
I can't say where is the right place yet.

> 
> > You had
> > no callback in mremap move: doesn't that mean that KSM pages could
> > be moved into areas which mm/ksm.c never tracked?  Though that's
> > probably no issue now we move over to vmas: they should now travel
> > with their VM flag.  You have no callback in unmap: doesn't that
> > mean that KSM never knows when its pages have gone away?
> >   
> 
> Yes, Adding all this callbacks would make ksm much more happy, Again, i didnt
> want to scare anyone...

Izik, you're too kindly ;)

> 
> > (Closing the /dev/ksm fd used to clean up some of this, in the
> > end; but the lifetime of the fd can be so different from that of
> > the mapped area, I've felt very unsafe with that technique - a good
> > technique when you're trying to sneak in special handling for your
> > special driver, but not a good technique once you go to mainline.)
> >
> > I haven't worked out the full consequences of these lost pages:
> > perhaps it's no worse than that you could never properly enforce
> > your ksm_thread_max_kernel_pages quota.
> >   
> 
> You mean the shared pages outside the stable tree comment?

I don't understand or even parse that question: which probably
means, no.  I mean that if the merged pages end up in areas
outside of where your trees expect them, then you've got pages
counted which cannot be found to have gone.  Whether you can get
too many resident pages that way, or just more counted than are
really there, I don't know: I've not worked through the accounting
yet, and would really really prefer it to be exact, instead of
pages getting lost outside the trees and raising these questions.

But you may tell me that I simply haven't understood it yet.

> > And a question on your page_wrprotect() addition to mm/rmap.c: though
> > it may contain some important checks (I'm thinking of the get_user_pages
> > protection), isn't it essentially redundant, and should be removed from
> > the patchset?  If we have a private page which is mapped into more than
> > the one address space by which we arrive at it, then, quite independent
> > of KSM, it needs to be write-protected already to prevent mods in one
> > address space leaking into another - doesn't it?  So I see no need for
> > the rmap'ped write-protection there, just make the checks and write
> > protect the pte you have in ksm.c.  Or am I missing something?
> >   
> 
> Ok, so we have here 2 cases for ksm:
> 1:
>    When the page is anonymous and is mapped readonly beteween serveal
> processes:
>         for this you say we shouldnt walk over the rmap and try to
> writeprotect what is already writeprtected...

Yes.

> 
> 2:
>   When the page is anonymous and is mapped write by just one process:
>       for this you say it is better to handle it directly from inside ksm
> beacuse we already know
>       the virtual address mapping of this page?

Yes.

> 
>       so about this: you are right about the fact that we might dont have to
> walk over the rmap of the page for pages with mapcount 1
>       but isnt it cleaner to deal it inside rmap.c?

If you needed to writeprotect pages with mapcount 2, 3, ... then
indeed you'd want to do it in rmap.c, and you wouldn't want to
specialcase mapcount 1.  But I'm saying you don't need to do the
writeprotection with mapcount 2, 3, ...; therefore better just
to do it (again, no need to special case mapcount 1) in ksm.c.

Cutting out a body of code: that's as clean as clean can be.

>       another thing,  get_user_pages() protection is needed even in 
> that case, beacuse get_user_pages_fast is lockless, so odirect
>       can run under our legs after we write protecting the page.

That I don't dispute, I made a point of saying that "it may contain
some important checks (I'm thinking of the get_user_pages protection)",
meaning that you'd want to transfer any such checks to ksm.c.

(In my own preferred patchset structure, I'd actually like that
O_DIRECT get_user_pages consideration added as a separate patch
over the base: one needs to focus in a very different way to get
those issues, it's better to hold them up for separate inspection.
The patch before would be buggy, yes, but not in such a way that
anyone bisecting for an unrelated bug would get inconvenienced.)

> 
>      anyway, nothing critical, i dont mind to move page_write_protect_one()
> into ksm.c, i still think get_user_pages protection is needed.

Yes, something like that.

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] RFC - ksm api change into madvise
  2009-06-08 18:32     ` Hugh Dickins
@ 2009-06-08 20:10       ` Izik Eidus
  2009-06-09  4:48         ` Izik Eidus
  0 siblings, 1 reply; 22+ messages in thread
From: Izik Eidus @ 2009-06-08 20:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: aarcange, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

Hugh Dickins wrote:
> On Mon, 8 Jun 2009, Izik Eidus wrote:
>   
>> Hugh Dickins wrote:
>>     
>>> You seem to have no callback in fork: doesn't that mean that KSM
>>> pages get into mms of which mm/ksm.c has no knowledge?  
>>>       
>> What you mean by this?, should the vma flags be copyed into the child and
>> therefore ksm will scan the vma?
>>     
>
> That part should be happening automatically now
> (you'd have to add code to avoid copying VM_ flags).
>   

Yea, should -> shouldn't

>   
>> (only thing i have to check is: maybe the process itself wont go into the
>> mmlist, and therefore ksm wont know about it)
>>     
>
> But that part needs something to make it happen.  In the case of
> swap, it's done over in copy_one_pte; the important thing is to
> do it where the locking is such that it cannot get missed,
> I can't say where is the right place yet.
>
>   
>>> You had
>>> no callback in mremap move: doesn't that mean that KSM pages could
>>> be moved into areas which mm/ksm.c never tracked?  Though that's
>>> probably no issue now we move over to vmas: they should now travel
>>> with their VM flag.  You have no callback in unmap: doesn't that
>>> mean that KSM never knows when its pages have gone away?
>>>   
>>>       
>> Yes, Adding all this callbacks would make ksm much more happy, Again, i didnt
>> want to scare anyone...
>>     
>
> Izik, you're too kindly ;)
>
>   
>>> (Closing the /dev/ksm fd used to clean up some of this, in the
>>> end; but the lifetime of the fd can be so different from that of
>>> the mapped area, I've felt very unsafe with that technique - a good
>>> technique when you're trying to sneak in special handling for your
>>> special driver, but not a good technique once you go to mainline.)
>>>
>>> I haven't worked out the full consequences of these lost pages:
>>> perhaps it's no worse than that you could never properly enforce
>>> your ksm_thread_max_kernel_pages quota.
>>>   
>>>       
>> You mean the shared pages outside the stable tree comment?
>>     
>
> I don't understand or even parse that question: which probably
> means, no.  I mean that if the merged pages end up in areas
> outside of where your trees expect them, then you've got pages
> counted which cannot be found to have gone.  Whether you can get
> too many resident pages that way, or just more counted than are
> really there, I don't know: I've not worked through the accounting
> yet, and would really really prefer it to be exact, instead of
> pages getting lost outside the trees and raising these questions.
>   

Agree, but to have the exact number you should have notification from 
do_wp_page when it break the write protected ksm pages, no?

(Right now ksm count this pages somewhat in "out of sync" mode from the 
linux VM, it doesnt mean that more unswappable pages can be allocated
 by ksm then allowed by ksm_thread_max_kernel_pages, but it mean that 
for some preiod of time less kernel pages might be allocated than allowed
 beacuse ksm will find too late, that some of the kernel pages that it 
already allocated were break and became swappable)

> But you may tell me that I simply haven't understood it yet.
>
>   
>>> And a question on your page_wrprotect() addition to mm/rmap.c: though
>>> it may contain some important checks (I'm thinking of the get_user_pages
>>> protection), isn't it essentially redundant, and should be removed from
>>> the patchset?  If we have a private page which is mapped into more than
>>> the one address space by which we arrive at it, then, quite independent
>>> of KSM, it needs to be write-protected already to prevent mods in one
>>> address space leaking into another - doesn't it?  So I see no need for
>>> the rmap'ped write-protection there, just make the checks and write
>>> protect the pte you have in ksm.c.  Or am I missing something?
>>>   
>>>       
>> Ok, so we have here 2 cases for ksm:
>> 1:
>>    When the page is anonymous and is mapped readonly beteween serveal
>> processes:
>>         for this you say we shouldnt walk over the rmap and try to
>> writeprotect what is already writeprtected...
>>     
>
> Yes.
>
>   
>> 2:
>>   When the page is anonymous and is mapped write by just one process:
>>       for this you say it is better to handle it directly from inside ksm
>> beacuse we already know
>>       the virtual address mapping of this page?
>>     
>
> Yes.
>
>   
>>       so about this: you are right about the fact that we might dont have to
>> walk over the rmap of the page for pages with mapcount 1
>>       but isnt it cleaner to deal it inside rmap.c?
>>     
>
> If you needed to writeprotect pages with mapcount 2, 3, ... then
> indeed you'd want to do it in rmap.c, and you wouldn't want to
> specialcase mapcount 1.  But I'm saying you don't need to do the
> writeprotection with mapcount 2, 3, ...; therefore better just
> to do it (again, no need to special case mapcount 1) in ksm.c.
>
> Cutting out a body of code: that's as clean as clean can be.
>
>
>   
Make sense, i will send patch today that merge that code into ksm.c 
(without all the rmap walking)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mmotm] ksm: stop scan skipping pages
  2009-06-08 18:01       ` Hugh Dickins
@ 2009-06-08 20:12         ` Izik Eidus
  2009-06-08 21:05           ` Hugh Dickins
  0 siblings, 1 reply; 22+ messages in thread
From: Izik Eidus @ 2009-06-08 20:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: aarcange, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

Hugh Dickins wrote:
> On Mon, 8 Jun 2009, Izik Eidus wrote:
>   
>> Thanks for the fix,
>> (I saw it while i wrote the RFC patch for the madvise, but beacuse that i
>> thought that the RFC fix this (you can see the removel of the second call to
>> scan_get_next_index()), and we move to madvise, I thought that no patch is
>> needed for this code, guess I was wrong)
>>     
>
> Ah, no, I hadn't noticed that, this patch is from several weeks ago,
> before you even posted the madvise() version.
>
> I think myself that we ought to fix the algorithm as it stands now in
> mmotm, rather than hiding the fix in amongst later interface changes.
>
> But it's not a big deal, so long as it gets fixed in the end.
>
> By the time Andrew sends KVM to Linus, it shouldn't be the
> patches currently in mmotm with more on top: 

So you want a repost of all the patchs?, in that case any value to keep 
this stuff in Andrew tree right now?
Wouldnt it speed our development if we will keep it outside, and then 
repost the whole thing?

> they should be
> re-presented with all trace of /dev/ksm gone by then.
>
> Hugh
>   


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mmotm] ksm: stop scan skipping pages
  2009-06-08 20:12         ` Izik Eidus
@ 2009-06-08 21:05           ` Hugh Dickins
  0 siblings, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2009-06-08 21:05 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Hugh Dickins, aarcange, akpm, nickpiggin, chrisw, riel,
	linux-kernel, linux-mm

On Mon, 8 Jun 2009, Izik Eidus wrote:
> Hugh Dickins wrote:
> >
> > By the time Andrew sends KVM to Linus, it shouldn't be the
> > patches currently in mmotm with more on top: 
> 
> So you want a repost of all the patchs?

Will do eventually.  In the same way that Andrew merges fixes to
patches into the patches before sending to Linus.  But in this
case, rather large pieces would vanish in such a process, plus it
would be unfair to expect Andrew to do that work; so it would
be better for you to submit replacements when we're ready.

> in that case any value to keep this
> stuff in Andrew tree right now?

I guess that depends on whether it's actually receiving any
testing or review while it's in there.  I'm happy for it to remain
there for now - and the core of it is going to remain unchanged
or only trivially changed.  Really a matter for you to decide
with Andrew: I don't want to pull it out of circulation myself.

> Wouldnt it speed our development if we will keep it outside,
> and then repost the whole thing?

I don't see that having a version there in mmotm will slow us
down (what could slow me down more than myself ;-?).  But you
and Andrew are free to decide it's counterproductive to keep the
current version there, that's okay by me if you prefer to pull it.

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] RFC - ksm api change into madvise
  2009-06-08 16:18 ` [PATCH 0/4] RFC - ksm api change into madvise Hugh Dickins
  2009-06-08 16:35   ` [PATCH mmotm] ksm: stop scan skipping pages Hugh Dickins
  2009-06-08 17:17   ` [PATCH 0/4] RFC - ksm api change into madvise Izik Eidus
@ 2009-06-08 22:57   ` Andrea Arcangeli
  2009-06-13 15:04     ` Hugh Dickins
  2 siblings, 1 reply; 22+ messages in thread
From: Andrea Arcangeli @ 2009-06-08 22:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Izik Eidus, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

Hi Hugh!

On Mon, Jun 08, 2009 at 05:18:57PM +0100, Hugh Dickins wrote:
> I notice that you chose to integrate fully (though not fully enough)
> with vmas, adding a VM_MERGEABLE flag.  Fine, that's probably going
> to be safest in the end, and I'll follow you; but it is further than

After we decided to switch to madvise, I suggested to take the
approach in the RFC in this thread so with a rmap_item list ordered by
address that is scanned nested in the core-vma loop that searches for
vma with VM_MERGEABLE set, so that madvise has only to set the
VM_MERGEABLE bit, and it still avoids to alter the VM at
all. Replacing the ksm slots (kind of out of sync ksm vmas) with the
nested loop, allows to provide madvise semantics while still being out
of sync with rmap_item and tree_item.

> You've resisted putting in the callbacks you need.  I think they were
> always (i.e. even when using /dev/ksm) necessary, but should become
> more obvious now we have this tighter integration with mm's vmas.

There is no need of littering the VM with new callbacks, KSM only has
to register in the mmu notifier to teardown rmap_items (and tree_item
when the last rmap_item of the tree_item list is gone) synchronously
when the mappings are invalidated during munmap/mremap or swapping
(anon pages can already be swapped while it's tracked by the unstable
tree, and later we hope stable tree ksm pages can be swapped too, and
we get rid of those from the rbtree as we walk it with
is_present_pte and gup).

After that we can also have tree_item store the kernel-vaddr (or
struct page *) to avoid calling gup() during the walk of the rbtree.
However such a change involves huge complexities and little gain, and
the part of getting rid of orphaned rmap_item/tree_item synchronously
doesn't provide practical advantages. It takes a single pass to get
rid of all orphaned rmap_items/tree_items, and shutting down ksm gets
rid of everything releasing any memory allocated by KSM (or at least
it should, especially if we don't allow it to be a module anymore
because it might not be worth it, the =N option is still useful for
the embedded that are sure they won't get great benefit from KSM in
their apps).

Izik has been brave enough to try having rmap_item/tree_item in sync
with mmu notifier already, but one of the complexities were in the
locking, we can't schedule anywhere in the mmu notifier methods (this
could be relaxed but it makes no sense to depend on this just to get
rid of orphaned rmap_item/tree_item synchronously). Other complexities
were in the fact replace_page when it does set_pte_at_notify will
recurse in KSM itself if KSM registers in MMU notifier.

Admittedly I didn't spend much time thinking about how everything is
supposed to work with mmu notifer and in-sync rmap_item/tree_item with
tree_item pointing to a physical page address without any struct page
pinning whatsoever (the very cool thing that is all about mmu
notifier, no pinning and full swapping) yet, but because I doubt it
will provide any significant advantage and I think being simpler has
some value too at least until there are more users of the
feature. Hence the current implementation still has an out of sync
scan and garbage collect rmap_item/tree_item lazily. This is not KVM
that without mmu notifier can't swap at all. We can do all our work
here in KSM out of sync by just garbage collecting the orphaned
entries out of sync, KSM works with virtual addresses it only does
temporary gup pins so it's not forced to use mmu notifier to give good
behavior.

So my suggestion is to go with out of sync, and then once this is all
solid and finished, we can try to use mmu notifier to make it in sync
and benchmark to see if it is worth it. With the primary benefit of
being able to remove gup from the tree lookup (collecting orphaned
rmap_item/tree_item a bit sooner doesn't matter so much IMHO, at least
for KVM).

> And a question on your page_wrprotect() addition to mm/rmap.c: though
> it may contain some important checks (I'm thinking of the get_user_pages
> protection), isn't it essentially redundant, and should be removed from
> the patchset?  If we have a private page which is mapped into more than
> the one address space by which we arrive at it, then, quite independent
> of KSM, it needs to be write-protected already to prevent mods in one
> address space leaking into another - doesn't it?  So I see no need for
> the rmap'ped write-protection there, just make the checks and write
> protect the pte you have in ksm.c.  Or am I missing something?

Makes sense, basically if pte is already wrprotected we've to do
nothing, and if it's writable we only need to care about the current
pte because mapcount is guaranteed 1. That probably can provide a
minor optimization to the ksm performance. OTOH if we'll ever decide
to merge more than anon pages this won't hold. Let's say anon pages
are by orders of magnitude simpler to merge than anything else because
they don't belong to all other data structures like pagecache, I guess
we'll never merge pagecache given the significant additional
complexities involved, so I'm not against keeping replace_page local
and without walking the anon_vma list at all.

So let us know what you think about the rmap_item/tree_item out of
sync, or in sync with mmu notifier. As said Izik already did a
preliminary patch with mmu notifier registration. I doubt we want to
invest in that direction unless there's 100% agreement that it is
definitely the way to go, and the expectation that it will make a
substantial difference to the KSM users. Minor optimizations that
increase complexity a lot, can be left for later.

Thanks for looking into KSM!
Andrea

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] RFC - ksm api change into madvise
  2009-06-08 20:10       ` Izik Eidus
@ 2009-06-09  4:48         ` Izik Eidus
  2009-06-09 17:24           ` Hugh Dickins
  0 siblings, 1 reply; 22+ messages in thread
From: Izik Eidus @ 2009-06-09  4:48 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Hugh Dickins, aarcange, akpm, nickpiggin, chrisw, riel,
	linux-kernel, linux-mm

On Mon, 08 Jun 2009 23:10:30 +0300
Izik Eidus <ieidus@redhat.com> wrote:

> Hugh Dickins wrote:
> > On Mon, 8 Jun 2009, Izik Eidus wrote:

> >
> > If you needed to writeprotect pages with mapcount 2, 3, ... then
> > indeed you'd want to do it in rmap.c, and you wouldn't want to
> > specialcase mapcount 1.  But I'm saying you don't need to do the
> > writeprotection with mapcount 2, 3, ...; therefore better just
> > to do it (again, no need to special case mapcount 1) in ksm.c.
> >
> > Cutting out a body of code: that's as clean as clean can be.
> >
> >
> >   
> Make sense, i will send patch today that merge that code into ksm.c 
> (without all the rmap walking)
> 
> 

How does this look like?


>From f41b092bee1437f6f436faa74f5da56403f61009 Mon Sep 17 00:00:00 2001
From: Izik Eidus <ieidus@redhat.com>
Date: Tue, 9 Jun 2009 07:24:25 +0300
Subject: [PATCH] ksm: remove page_wrprotect() from rmap.c

Remove page_wrprotect() from rmap.c and instead embedded the needed code
into ksm.c

Hugh pointed out that for the ksm usage case, we dont have to walk over the rmap
and to write protected page after page beacuse when Anonymous page is mapped
more than once, it have to be write protected already, and in a case that it
mapped just once, no need to walk over the rmap, we can instead write protect
it from inside ksm.c.

Thanks.

Signed-off-by: Izik Eidus <ieidus@redhat.com>
---
 include/linux/rmap.h |   12 ----
 mm/ksm.c             |   83 ++++++++++++++++++++++++++----
 mm/rmap.c            |  139 --------------------------------------------------
 3 files changed, 73 insertions(+), 161 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 469376d..350e76d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,10 +118,6 @@ static inline int try_to_munlock(struct page *page)
 }
 #endif
 
-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
-int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
-#endif
-
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
@@ -136,14 +132,6 @@ static inline int page_mkclean(struct page *page)
 	return 0;
 }
 
-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
-static inline int page_wrprotect(struct page *page, int *odirect_sync,
-				 int count_offset)
-{
-	return 0;
-}
-#endif
-
 #endif	/* CONFIG_MMU */
 
 /*
diff --git a/mm/ksm.c b/mm/ksm.c
index 74d921b..9fce82b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -37,6 +37,7 @@
 #include <linux/swap.h>
 #include <linux/rbtree.h>
 #include <linux/anon_inodes.h>
+#include <linux/mmu_notifier.h>
 #include <linux/ksm.h>
 
 #include <asm/tlbflush.h>
@@ -642,6 +643,75 @@ static inline int pages_identical(struct page *page1, struct page *page2)
 	return !memcmp_pages(page1, page2);
 }
 
+static inline int write_protect_page(struct page *page,
+				     struct vm_area_struct *vma,
+				     unsigned long addr,
+				     pte_t orig_pte)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	int swapped;
+	int ret = 1;
+
+	pgd = pgd_offset(mm, addr);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, addr);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, addr);
+	if (!pmd_present(*pmd))
+		goto out;
+
+	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	if (!ptep)
+		goto out;
+
+	if (!pte_same(*ptep, orig_pte)) {
+		pte_unmap_unlock(ptep, ptl);
+		goto out;
+	}
+
+	if (pte_write(*ptep)) {
+		pte_t entry;
+
+		swapped = PageSwapCache(page);
+		flush_cache_page(vma, addr, page_to_pfn(page));
+		/*
+		 * Ok this is tricky, when get_user_pages_fast() run it doesnt
+		 * take any lock, therefore the check that we are going to make
+		 * with the pagecount against the mapcount is racey and
+		 * O_DIRECT can happen right after the check.
+		 * So we clear the pte and flush the tlb before the check
+		 * this assure us that no O_DIRECT can happen after the check
+		 * or in the middle of the check.
+		 */
+		entry = ptep_clear_flush(vma, addr, ptep);
+		/*
+		 * Check that no O_DIRECT or similar I/O is in progress on the
+		 * page
+		 */
+		if ((page_mapcount(page) + 2 + swapped) != page_count(page)) {
+			set_pte_at_notify(mm, addr, ptep, entry);
+			goto out_unlock;
+		}
+		entry = pte_wrprotect(entry);
+		set_pte_at_notify(mm, addr, ptep, entry);
+	}
+	ret = 0;
+
+out_unlock:
+	pte_unmap_unlock(ptep, ptl);
+out:
+	return ret;
+}
+
 /*
  * try_to_merge_one_page - take two pages and merge them into one
  * @mm: mm_struct that hold vma pointing into oldpage
@@ -661,7 +731,6 @@ static int try_to_merge_one_page(struct mm_struct *mm,
 				 pgprot_t newprot)
 {
 	int ret = 1;
-	int odirect_sync;
 	unsigned long page_addr_in_vma;
 	pte_t orig_pte, *orig_ptep;
 
@@ -686,25 +755,19 @@ static int try_to_merge_one_page(struct mm_struct *mm,
 		goto out_putpage;
 	/*
 	 * we need the page lock to read a stable PageSwapCache in
-	 * page_wrprotect().
+	 * write_protect_page().
 	 * we use trylock_page() instead of lock_page(), beacuse we dont want to
 	 * wait here, we prefer to continue scanning and merging diffrent pages
 	 * and to come back to this page when it is unlocked.
 	 */
 	if (!trylock_page(oldpage))
 		goto out_putpage;
-	/*
-	 * page_wrprotect check if the page is swapped or in swap cache,
-	 * in the future we might want to run here if_present_pte and then
-	 * swap_free
-	 */
-	if (!page_wrprotect(oldpage, &odirect_sync, 2)) {
+
+	if (write_protect_page(oldpage, vma, page_addr_in_vma, orig_pte)) {
 		unlock_page(oldpage);
 		goto out_putpage;
 	}
 	unlock_page(oldpage);
-	if (!odirect_sync)
-		goto out_putpage;
 
 	orig_pte = pte_wrprotect(orig_pte);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index f53074c..c3ba0b9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -585,145 +585,6 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
-
-static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
-			      int *odirect_sync, int count_offset)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long address;
-	pte_t *pte;
-	spinlock_t *ptl;
-	int ret = 0;
-
-	address = vma_address(page, vma);
-	if (address == -EFAULT)
-		goto out;
-
-	pte = page_check_address(page, mm, address, &ptl, 0);
-	if (!pte)
-		goto out;
-
-	if (pte_write(*pte)) {
-		pte_t entry;
-
-		flush_cache_page(vma, address, pte_pfn(*pte));
-		/*
-		 * Ok this is tricky, when get_user_pages_fast() run it doesnt
-		 * take any lock, therefore the check that we are going to make
-		 * with the pagecount against the mapcount is racey and
-		 * O_DIRECT can happen right after the check.
-		 * So we clear the pte and flush the tlb before the check
-		 * this assure us that no O_DIRECT can happen after the check
-		 * or in the middle of the check.
-		 */
-		entry = ptep_clear_flush(vma, address, pte);
-		/*
-		 * Check that no O_DIRECT or similar I/O is in progress on the
-		 * page
-		 */
-		if ((page_mapcount(page) + count_offset) != page_count(page)) {
-			*odirect_sync = 0;
-			set_pte_at_notify(mm, address, pte, entry);
-			goto out_unlock;
-		}
-		entry = pte_wrprotect(entry);
-		set_pte_at_notify(mm, address, pte, entry);
-	}
-	ret = 1;
-
-out_unlock:
-	pte_unmap_unlock(pte, ptl);
-out:
-	return ret;
-}
-
-static int page_wrprotect_file(struct page *page, int *odirect_sync,
-			       int count_offset)
-{
-	struct address_space *mapping;
-	struct prio_tree_iter iter;
-	struct vm_area_struct *vma;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
-	int ret = 0;
-
-	mapping = page_mapping(page);
-	if (!mapping)
-		return ret;
-
-	spin_lock(&mapping->i_mmap_lock);
-
-	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
-		ret += page_wrprotect_one(page, vma, odirect_sync,
-					  count_offset);
-
-	spin_unlock(&mapping->i_mmap_lock);
-
-	return ret;
-}
-
-static int page_wrprotect_anon(struct page *page, int *odirect_sync,
-			       int count_offset)
-{
-	struct vm_area_struct *vma;
-	struct anon_vma *anon_vma;
-	int ret = 0;
-
-	anon_vma = page_lock_anon_vma(page);
-	if (!anon_vma)
-		return ret;
-
-	/*
-	 * If the page is inside the swap cache, its _count number was
-	 * increased by one, therefore we have to increase count_offset by one.
-	 */
-	if (PageSwapCache(page))
-		count_offset++;
-
-	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
-		ret += page_wrprotect_one(page, vma, odirect_sync,
-					  count_offset);
-
-	page_unlock_anon_vma(anon_vma);
-
-	return ret;
-}
-
-/**
- * page_wrprotect - set all ptes pointing to a page as readonly
- * @page:         the page to set as readonly
- * @odirect_sync: boolean value that is set to 0 when some of the ptes were not
- *                marked as readonly beacuse page_wrprotect_one() was not able
- *                to mark this ptes as readonly without opening window to a race
- *                with odirect
- * @count_offset: number of times page_wrprotect() caller had called get_page()
- *                on the page
- *
- * returns the number of ptes which were marked as readonly.
- * (ptes that were readonly before this function was called are counted as well)
- */
-int page_wrprotect(struct page *page, int *odirect_sync, int count_offset)
-{
-	int ret = 0;
-
-	/*
-	 * Page lock is needed for anon pages for the PageSwapCache check,
-	 * and for page_mapping for filebacked pages
-	 */
-	BUG_ON(!PageLocked(page));
-
-	*odirect_sync = 1;
-	if (PageAnon(page))
-		ret = page_wrprotect_anon(page, odirect_sync, count_offset);
-	else
-		ret = page_wrprotect_file(page, odirect_sync, count_offset);
-
-	return ret;
-}
-EXPORT_SYMBOL(page_wrprotect);
-
-#endif
-
 /**
  * __page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] RFC - ksm api change into madvise
  2009-06-09  4:48         ` Izik Eidus
@ 2009-06-09 17:24           ` Hugh Dickins
  2009-06-09 19:27             ` Hugh Dickins
  0 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2009-06-09 17:24 UTC (permalink / raw)
  To: Izik Eidus
  Cc: aarcange, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

On Tue, 9 Jun 2009, Izik Eidus wrote:
> How does this look like?

It looks about right to me, and how delightful to be rid of that
horrid odirect_sync argument!

For now I'll turn a blind eye to your idiosyncratic return code
convention (0 for success, 1 for failure: we're already schizoid
enough with 0 for failure, 1 for success boolean functions versus
0 for success, -errno for failure functions): you're being consistent
with yourself, let's run through ksm.c at a later date to sort that out.

One improvment to make now, though: you've elsewhere avoided
the pgd,pud,pmd,pte descent in ksm.c (using get_pte instead), and
page_check_address() is not static to rmap.c (filemap_xip wanted it),
so please continue to use that.  It's not exported, right, but I think
Chris was already decisive that we should abandon modular KSM, yes?

So I think slip in a patch before this one to remove the modular
option, so as not to generate traffic from people who build all
modular kernels.  Or if you think the modular option should remain,
append an EXPORT_SYMBOL to page_check_address().

Thanks,
Hugh

> 
> From f41b092bee1437f6f436faa74f5da56403f61009 Mon Sep 17 00:00:00 2001
> From: Izik Eidus <ieidus@redhat.com>
> Date: Tue, 9 Jun 2009 07:24:25 +0300
> Subject: [PATCH] ksm: remove page_wrprotect() from rmap.c
> 
> Remove page_wrprotect() from rmap.c and instead embedded the needed code
> into ksm.c
> 
> Hugh pointed out that for the ksm usage case, we dont have to walk over the rmap
> and to write protected page after page beacuse when Anonymous page is mapped
> more than once, it have to be write protected already, and in a case that it
> mapped just once, no need to walk over the rmap, we can instead write protect
> it from inside ksm.c.
> 
> Thanks.
> 
> Signed-off-by: Izik Eidus <ieidus@redhat.com>
> ---
>  include/linux/rmap.h |   12 ----
>  mm/ksm.c             |   83 ++++++++++++++++++++++++++----
>  mm/rmap.c            |  139 --------------------------------------------------
>  3 files changed, 73 insertions(+), 161 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 469376d..350e76d 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -118,10 +118,6 @@ static inline int try_to_munlock(struct page *page)
>  }
>  #endif
>  
> -#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
> -int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
> -#endif
> -
>  #else	/* !CONFIG_MMU */
>  
>  #define anon_vma_init()		do {} while (0)
> @@ -136,14 +132,6 @@ static inline int page_mkclean(struct page *page)
>  	return 0;
>  }
>  
> -#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
> -static inline int page_wrprotect(struct page *page, int *odirect_sync,
> -				 int count_offset)
> -{
> -	return 0;
> -}
> -#endif
> -
>  #endif	/* CONFIG_MMU */
>  
>  /*
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 74d921b..9fce82b 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -37,6 +37,7 @@
>  #include <linux/swap.h>
>  #include <linux/rbtree.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/mmu_notifier.h>
>  #include <linux/ksm.h>
>  
>  #include <asm/tlbflush.h>
> @@ -642,6 +643,75 @@ static inline int pages_identical(struct page *page1, struct page *page2)
>  	return !memcmp_pages(page1, page2);
>  }
>  
> +static inline int write_protect_page(struct page *page,
> +				     struct vm_area_struct *vma,
> +				     unsigned long addr,
> +				     pte_t orig_pte)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *ptep;
> +	spinlock_t *ptl;
> +	int swapped;
> +	int ret = 1;
> +
> +	pgd = pgd_offset(mm, addr);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	pud = pud_offset(pgd, addr);
> +	if (!pud_present(*pud))
> +		goto out;
> +
> +	pmd = pmd_offset(pud, addr);
> +	if (!pmd_present(*pmd))
> +		goto out;
> +
> +	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +	if (!ptep)
> +		goto out;
> +
> +	if (!pte_same(*ptep, orig_pte)) {
> +		pte_unmap_unlock(ptep, ptl);
> +		goto out;
> +	}
> +
> +	if (pte_write(*ptep)) {
> +		pte_t entry;
> +
> +		swapped = PageSwapCache(page);
> +		flush_cache_page(vma, addr, page_to_pfn(page));
> +		/*
> +		 * Ok this is tricky, when get_user_pages_fast() run it doesnt
> +		 * take any lock, therefore the check that we are going to make
> +		 * with the pagecount against the mapcount is racey and
> +		 * O_DIRECT can happen right after the check.
> +		 * So we clear the pte and flush the tlb before the check
> +		 * this assure us that no O_DIRECT can happen after the check
> +		 * or in the middle of the check.
> +		 */
> +		entry = ptep_clear_flush(vma, addr, ptep);
> +		/*
> +		 * Check that no O_DIRECT or similar I/O is in progress on the
> +		 * page
> +		 */
> +		if ((page_mapcount(page) + 2 + swapped) != page_count(page)) {
> +			set_pte_at_notify(mm, addr, ptep, entry);
> +			goto out_unlock;
> +		}
> +		entry = pte_wrprotect(entry);
> +		set_pte_at_notify(mm, addr, ptep, entry);
> +	}
> +	ret = 0;
> +
> +out_unlock:
> +	pte_unmap_unlock(ptep, ptl);
> +out:
> +	return ret;
> +}
> +
>  /*
>   * try_to_merge_one_page - take two pages and merge them into one
>   * @mm: mm_struct that hold vma pointing into oldpage
> @@ -661,7 +731,6 @@ static int try_to_merge_one_page(struct mm_struct *mm,
>  				 pgprot_t newprot)
>  {
>  	int ret = 1;
> -	int odirect_sync;
>  	unsigned long page_addr_in_vma;
>  	pte_t orig_pte, *orig_ptep;
>  
> @@ -686,25 +755,19 @@ static int try_to_merge_one_page(struct mm_struct *mm,
>  		goto out_putpage;
>  	/*
>  	 * we need the page lock to read a stable PageSwapCache in
> -	 * page_wrprotect().
> +	 * write_protect_page().
>  	 * we use trylock_page() instead of lock_page(), beacuse we dont want to
>  	 * wait here, we prefer to continue scanning and merging diffrent pages
>  	 * and to come back to this page when it is unlocked.
>  	 */
>  	if (!trylock_page(oldpage))
>  		goto out_putpage;
> -	/*
> -	 * page_wrprotect check if the page is swapped or in swap cache,
> -	 * in the future we might want to run here if_present_pte and then
> -	 * swap_free
> -	 */
> -	if (!page_wrprotect(oldpage, &odirect_sync, 2)) {
> +
> +	if (write_protect_page(oldpage, vma, page_addr_in_vma, orig_pte)) {
>  		unlock_page(oldpage);
>  		goto out_putpage;
>  	}
>  	unlock_page(oldpage);
> -	if (!odirect_sync)
> -		goto out_putpage;
>  
>  	orig_pte = pte_wrprotect(orig_pte);
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f53074c..c3ba0b9 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -585,145 +585,6 @@ int page_mkclean(struct page *page)
>  }
>  EXPORT_SYMBOL_GPL(page_mkclean);
>  
> -#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
> -
> -static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
> -			      int *odirect_sync, int count_offset)
> -{
> -	struct mm_struct *mm = vma->vm_mm;
> -	unsigned long address;
> -	pte_t *pte;
> -	spinlock_t *ptl;
> -	int ret = 0;
> -
> -	address = vma_address(page, vma);
> -	if (address == -EFAULT)
> -		goto out;
> -
> -	pte = page_check_address(page, mm, address, &ptl, 0);
> -	if (!pte)
> -		goto out;
> -
> -	if (pte_write(*pte)) {
> -		pte_t entry;
> -
> -		flush_cache_page(vma, address, pte_pfn(*pte));
> -		/*
> -		 * Ok this is tricky, when get_user_pages_fast() run it doesnt
> -		 * take any lock, therefore the check that we are going to make
> -		 * with the pagecount against the mapcount is racey and
> -		 * O_DIRECT can happen right after the check.
> -		 * So we clear the pte and flush the tlb before the check
> -		 * this assure us that no O_DIRECT can happen after the check
> -		 * or in the middle of the check.
> -		 */
> -		entry = ptep_clear_flush(vma, address, pte);
> -		/*
> -		 * Check that no O_DIRECT or similar I/O is in progress on the
> -		 * page
> -		 */
> -		if ((page_mapcount(page) + count_offset) != page_count(page)) {
> -			*odirect_sync = 0;
> -			set_pte_at_notify(mm, address, pte, entry);
> -			goto out_unlock;
> -		}
> -		entry = pte_wrprotect(entry);
> -		set_pte_at_notify(mm, address, pte, entry);
> -	}
> -	ret = 1;
> -
> -out_unlock:
> -	pte_unmap_unlock(pte, ptl);
> -out:
> -	return ret;
> -}
> -
> -static int page_wrprotect_file(struct page *page, int *odirect_sync,
> -			       int count_offset)
> -{
> -	struct address_space *mapping;
> -	struct prio_tree_iter iter;
> -	struct vm_area_struct *vma;
> -	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> -	int ret = 0;
> -
> -	mapping = page_mapping(page);
> -	if (!mapping)
> -		return ret;
> -
> -	spin_lock(&mapping->i_mmap_lock);
> -
> -	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
> -		ret += page_wrprotect_one(page, vma, odirect_sync,
> -					  count_offset);
> -
> -	spin_unlock(&mapping->i_mmap_lock);
> -
> -	return ret;
> -}
> -
> -static int page_wrprotect_anon(struct page *page, int *odirect_sync,
> -			       int count_offset)
> -{
> -	struct vm_area_struct *vma;
> -	struct anon_vma *anon_vma;
> -	int ret = 0;
> -
> -	anon_vma = page_lock_anon_vma(page);
> -	if (!anon_vma)
> -		return ret;
> -
> -	/*
> -	 * If the page is inside the swap cache, its _count number was
> -	 * increased by one, therefore we have to increase count_offset by one.
> -	 */
> -	if (PageSwapCache(page))
> -		count_offset++;
> -
> -	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
> -		ret += page_wrprotect_one(page, vma, odirect_sync,
> -					  count_offset);
> -
> -	page_unlock_anon_vma(anon_vma);
> -
> -	return ret;
> -}
> -
> -/**
> - * page_wrprotect - set all ptes pointing to a page as readonly
> - * @page:         the page to set as readonly
> - * @odirect_sync: boolean value that is set to 0 when some of the ptes were not
> - *                marked as readonly beacuse page_wrprotect_one() was not able
> - *                to mark this ptes as readonly without opening window to a race
> - *                with odirect
> - * @count_offset: number of times page_wrprotect() caller had called get_page()
> - *                on the page
> - *
> - * returns the number of ptes which were marked as readonly.
> - * (ptes that were readonly before this function was called are counted as well)
> - */
> -int page_wrprotect(struct page *page, int *odirect_sync, int count_offset)
> -{
> -	int ret = 0;
> -
> -	/*
> -	 * Page lock is needed for anon pages for the PageSwapCache check,
> -	 * and for page_mapping for filebacked pages
> -	 */
> -	BUG_ON(!PageLocked(page));
> -
> -	*odirect_sync = 1;
> -	if (PageAnon(page))
> -		ret = page_wrprotect_anon(page, odirect_sync, count_offset);
> -	else
> -		ret = page_wrprotect_file(page, odirect_sync, count_offset);
> -
> -	return ret;
> -}
> -EXPORT_SYMBOL(page_wrprotect);
> -
> -#endif
> -
>  /**
>   * __page_set_anon_rmap - setup new anonymous rmap
>   * @page:	the page to add the mapping to
> -- 
> 1.5.6.5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] RFC - ksm api change into madvise
  2009-06-09 17:24           ` Hugh Dickins
@ 2009-06-09 19:27             ` Hugh Dickins
  2009-06-10  6:28               ` Izik Eidus
  0 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2009-06-09 19:27 UTC (permalink / raw)
  To: Izik Eidus
  Cc: aarcange, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

On Tue, 9 Jun 2009, Hugh Dickins wrote:
> On Tue, 9 Jun 2009, Izik Eidus wrote:
> > How does this look like?
> 
> One improvment to make now, though: you've elsewhere avoided
> the pgd,pud,pmd,pte descent in ksm.c (using get_pte instead), and
> page_check_address() is not static to rmap.c (filemap_xip wanted it),
> so please continue to use that.  It's not exported, right, but I think
> Chris was already decisive that we should abandon modular KSM, yes?

I think you can simplify it further, can't you?  Isn't the get_pte()
preamble in try_to_merge_one_page() just unnecessary overhead now?  See
untested code below.  Or even move the trylock/unlock of the page into
write_protect_page if you prefer.  Later on we'll uninline rmap.c's
vma_address() so you can use it instead of your addr_in_vma() copy.

Hugh

static inline int write_protect_page(struct page *page,
				     struct vm_area_struct *vma,
				     pte_t *orig_pte)
{
	struct mm_struct *mm = vma->vm_mm;
	unsigned long addr;
	pte_t *ptep;
	spinlock_t *ptl;
	int swapped;
	int ret = 1;

	addr = addr_in_vma(vma, page);
	if (addr == -EFAULT)
		goto out;

	ptep = page_check_address(page, mm, addr, &ptl, 0);
	if (!ptep)
		goto out;

	if (pte_write(*ptep)) {
		pte_t entry;

		swapped = PageSwapCache(page);
		flush_cache_page(vma, addr, page_to_pfn(page));
		/*
		 * Ok this is tricky, when get_user_pages_fast() run it doesnt
		 * take any lock, therefore the check that we are going to make
		 * with the pagecount against the mapcount is racey and
		 * O_DIRECT can happen right after the check.
		 * So we clear the pte and flush the tlb before the check
		 * this assure us that no O_DIRECT can happen after the check
		 * or in the middle of the check.
		 */
		entry = ptep_clear_flush(vma, addr, ptep);
		/*
		 * Check that no O_DIRECT or similar I/O is in progress on the
		 * page
		 */
		if ((page_mapcount(page) + 2 + swapped) != page_count(page)) {
			set_pte_at_notify(mm, addr, ptep, entry);
			goto out_unlock;
		}
		entry = pte_wrprotect(entry);
		set_pte_at_notify(mm, addr, ptep, entry);
		*orig_pte = *ptep;
	}
	ret = 0;

out_unlock:
	pte_unmap_unlock(ptep, ptl);
out:
	return ret;
}

/*
 * try_to_merge_one_page - take two pages and merge them into one
 * @mm: mm_struct that hold vma pointing into oldpage
 * @vma: the vma that hold the pte pointing into oldpage
 * @oldpage: the page that we want to replace with newpage
 * @newpage: the page that we want to map instead of oldpage
 * @newprot: the new permission of the pte inside vma
 * note:
 * oldpage should be anon page while newpage should be file mapped page
 *
 * this function return 0 if the pages were merged, 1 otherwise.
 */
static int try_to_merge_one_page(struct mm_struct *mm,
				 struct vm_area_struct *vma,
				 struct page *oldpage,
				 struct page *newpage,
				 pgprot_t newprot)
{
	int ret = 1;
	pte_t orig_pte;

	if (!PageAnon(oldpage))
		goto out;

	get_page(newpage);
	get_page(oldpage);

	/*
	 * we need the page lock to read a stable PageSwapCache in
	 * write_protect_page().
	 * we use trylock_page() instead of lock_page(), beacuse we dont want to
	 * wait here, we prefer to continue scanning and merging diffrent pages
	 * and to come back to this page when it is unlocked.
	 */
	if (!trylock_page(oldpage))
		goto out_putpage;

	if (write_protect_page(oldpage, vma, &orig_pte)) {
		unlock_page(oldpage);
		goto out_putpage;
	}
	unlock_page(oldpage);

	if (pages_identical(oldpage, newpage))
		ret = replace_page(vma, oldpage, newpage, orig_pte, newprot);

out_putpage:
	put_page(oldpage);
	put_page(newpage);
out:
	return ret;
}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] RFC - ksm api change into madvise
  2009-06-09 19:27             ` Hugh Dickins
@ 2009-06-10  6:28               ` Izik Eidus
  2009-06-11 16:57                 ` Hugh Dickins
  0 siblings, 1 reply; 22+ messages in thread
From: Izik Eidus @ 2009-06-10  6:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: aarcange, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

On Tue, 9 Jun 2009 20:27:29 +0100 (BST)
Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:

> On Tue, 9 Jun 2009, Hugh Dickins wrote:
> > On Tue, 9 Jun 2009, Izik Eidus wrote:
> > > How does this look like?
> > 
> > One improvment to make now, though: you've elsewhere avoided
> > the pgd,pud,pmd,pte descent in ksm.c (using get_pte instead), and
> > page_check_address() is not static to rmap.c (filemap_xip wanted
> > it), so please continue to use that.  It's not exported, right, but
> > I think Chris was already decisive that we should abandon modular
> > KSM, yes?
> 
> I think you can simplify it further, can't you?  Isn't the get_pte()
> preamble in try_to_merge_one_page() just unnecessary overhead now?
> See untested code below.  Or even move the trylock/unlock of the page
> into write_protect_page if you prefer.  Later on we'll uninline
> rmap.c's vma_address() so you can use it instead of your
> addr_in_vma() copy.
> 
> Hugh


Great!, what you think about below? another thing we want to add or to
start sending it to Andrew?

btw may i add your signed-off to this patch?
(the only thing that i changed was taking down the *orig_pte = *ptep,
so we will merge write_protected pages, and add orig_pte = __pte(0) to
avoid annoying warning message about being used uninitialized)

>From 7304f4404d91a40e234b0530de6d3bfc8c5925a2 Mon Sep 17 00:00:00 2001
From: Izik Eidus <ieidus@redhat.com>
Date: Tue, 9 Jun 2009 21:00:55 +0300
Subject: [PATCH 1/2] ksm: remove ksm from being a module.

Signed-off-by: Izik Eidus <ieidus@redhat.com>
---
 include/linux/mm.h   |    2 +-
 include/linux/rmap.h |    4 ++--
 mm/Kconfig           |    5 ++---
 mm/memory.c          |    2 +-
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e617bab..cdc08d2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1258,7 +1258,7 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
 
-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+#if defined(CONFIG_KSM)
 int replace_page(struct vm_area_struct *vma, struct page *oldpage,
 		 struct page *newpage, pte_t orig_pte, pgprot_t prot);
 #endif
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 469376d..939c171 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,7 +118,7 @@ static inline int try_to_munlock(struct page *page)
 }
 #endif
 
-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+#if defined(CONFIG_KSM)
 int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
 #endif
 
@@ -136,7 +136,7 @@ static inline int page_mkclean(struct page *page)
 	return 0;
 }
 
-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+#if defined(CONFIG_KSM)
 static inline int page_wrprotect(struct page *page, int *odirect_sync,
 				 int count_offset)
 {
diff --git a/mm/Kconfig b/mm/Kconfig
index 5ebfd18..e7c118f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -227,10 +227,9 @@ config MMU_NOTIFIER
 	bool
 
 config KSM
-	tristate "Enable KSM for page sharing"
+	bool "Enable KSM for page sharing"
 	help
-	  Enable the KSM kernel module to allow page sharing of equal pages
-	  among different tasks.
+	  Enable KSM to allow page sharing of equal pages among different tasks.
 
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"
diff --git a/mm/memory.c b/mm/memory.c
index 8b4e40e..e23d4dd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1617,7 +1617,7 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+#if defined(CONFIG_KSM)
 
 /**
  * replace_page - replace page in vma with new page
-- 
1.5.6.5





>From 3d9975dea43ae848f21875b5f99accecf1366765 Mon Sep 17 00:00:00 2001
From: Izik Eidus <ieidus@redhat.com>
Date: Wed, 10 Jun 2009 09:16:26 +0300
Subject: [PATCH 2/2] ksm: remove page_wrprotect() from rmap.c

Remove page_wrprotect() from rmap.c and instead embedded the needed code
into ksm.c

Hugh pointed out that for the ksm usage case, we dont have to walk over the rmap
and to write protected page after page beacuse when Anonymous page is mapped
more than once, it have to be write protected already, and in a case that it
mapped just once, no need to walk over the rmap, we can instead write protect
it from inside ksm.c.

Thanks.

Signed-off-by: Izik Eidus <ieidus@redhat.com>
---
 include/linux/rmap.h |   12 ----
 mm/ksm.c             |   86 +++++++++++++++++++++----------
 mm/rmap.c            |  139 --------------------------------------------------
 3 files changed, 59 insertions(+), 178 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 939c171..350e76d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,10 +118,6 @@ static inline int try_to_munlock(struct page *page)
 }
 #endif
 
-#if defined(CONFIG_KSM)
-int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
-#endif
-
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
@@ -136,14 +132,6 @@ static inline int page_mkclean(struct page *page)
 	return 0;
 }
 
-#if defined(CONFIG_KSM)
-static inline int page_wrprotect(struct page *page, int *odirect_sync,
-				 int count_offset)
-{
-	return 0;
-}
-#endif
-
 #endif	/* CONFIG_MMU */
 
 /*
diff --git a/mm/ksm.c b/mm/ksm.c
index 74d921b..9d4be62 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -37,6 +37,7 @@
 #include <linux/swap.h>
 #include <linux/rbtree.h>
 #include <linux/anon_inodes.h>
+#include <linux/mmu_notifier.h>
 #include <linux/ksm.h>
 
 #include <asm/tlbflush.h>
@@ -642,6 +643,60 @@ static inline int pages_identical(struct page *page1, struct page *page2)
 	return !memcmp_pages(page1, page2);
 }
 
+static inline int write_protect_page(struct page *page,
+				     struct vm_area_struct *vma,
+				     pte_t *orig_pte)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	int swapped;
+	int ret = 1;
+
+	addr = addr_in_vma(vma, page);
+	if (addr == -EFAULT)
+		goto out;
+
+	ptep = page_check_address(page, mm, addr, &ptl, 0);
+	if (!ptep)
+		goto out;
+
+	if (pte_write(*ptep)) {
+		pte_t entry;
+
+		swapped = PageSwapCache(page);
+		flush_cache_page(vma, addr, page_to_pfn(page));
+		/*
+		 * Ok this is tricky, when get_user_pages_fast() run it doesnt
+		 * take any lock, therefore the check that we are going to make
+		 * with the pagecount against the mapcount is racey and
+		 * O_DIRECT can happen right after the check.
+		 * So we clear the pte and flush the tlb before the check
+		 * this assure us that no O_DIRECT can happen after the check
+		 * or in the middle of the check.
+		 */
+		entry = ptep_clear_flush(vma, addr, ptep);
+		/*
+		 * Check that no O_DIRECT or similar I/O is in progress on the
+		 * page
+		 */
+		if ((page_mapcount(page) + 2 + swapped) != page_count(page)) {
+			set_pte_at_notify(mm, addr, ptep, entry);
+			goto out_unlock;
+		}
+		entry = pte_wrprotect(entry);
+		set_pte_at_notify(mm, addr, ptep, entry);
+	}
+	*orig_pte = *ptep;
+	ret = 0;
+
+out_unlock:
+	pte_unmap_unlock(ptep, ptl);
+out:
+	return ret;
+}
+
 /*
  * try_to_merge_one_page - take two pages and merge them into one
  * @mm: mm_struct that hold vma pointing into oldpage
@@ -661,9 +716,7 @@ static int try_to_merge_one_page(struct mm_struct *mm,
 				 pgprot_t newprot)
 {
 	int ret = 1;
-	int odirect_sync;
-	unsigned long page_addr_in_vma;
-	pte_t orig_pte, *orig_ptep;
+	pte_t orig_pte = __pte(0);
 
 	if (!PageAnon(oldpage))
 		goto out;
@@ -671,42 +724,21 @@ static int try_to_merge_one_page(struct mm_struct *mm,
 	get_page(newpage);
 	get_page(oldpage);
 
-	page_addr_in_vma = addr_in_vma(vma, oldpage);
-	if (page_addr_in_vma == -EFAULT)
-		goto out_putpage;
-
-	orig_ptep = get_pte(mm, page_addr_in_vma);
-	if (!orig_ptep)
-		goto out_putpage;
-	orig_pte = *orig_ptep;
-	pte_unmap(orig_ptep);
-	if (!pte_present(orig_pte))
-		goto out_putpage;
-	if (page_to_pfn(oldpage) != pte_pfn(orig_pte))
-		goto out_putpage;
 	/*
 	 * we need the page lock to read a stable PageSwapCache in
-	 * page_wrprotect().
+	 * write_protect_page().
 	 * we use trylock_page() instead of lock_page(), beacuse we dont want to
 	 * wait here, we prefer to continue scanning and merging diffrent pages
 	 * and to come back to this page when it is unlocked.
 	 */
 	if (!trylock_page(oldpage))
 		goto out_putpage;
-	/*
-	 * page_wrprotect check if the page is swapped or in swap cache,
-	 * in the future we might want to run here if_present_pte and then
-	 * swap_free
-	 */
-	if (!page_wrprotect(oldpage, &odirect_sync, 2)) {
+
+	if (write_protect_page(oldpage, vma, &orig_pte)) {
 		unlock_page(oldpage);
 		goto out_putpage;
 	}
 	unlock_page(oldpage);
-	if (!odirect_sync)
-		goto out_putpage;
-
-	orig_pte = pte_wrprotect(orig_pte);
 
 	if (pages_identical(oldpage, newpage))
 		ret = replace_page(vma, oldpage, newpage, orig_pte, newprot);
diff --git a/mm/rmap.c b/mm/rmap.c
index f53074c..c3ba0b9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -585,145 +585,6 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
-#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
-
-static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
-			      int *odirect_sync, int count_offset)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long address;
-	pte_t *pte;
-	spinlock_t *ptl;
-	int ret = 0;
-
-	address = vma_address(page, vma);
-	if (address == -EFAULT)
-		goto out;
-
-	pte = page_check_address(page, mm, address, &ptl, 0);
-	if (!pte)
-		goto out;
-
-	if (pte_write(*pte)) {
-		pte_t entry;
-
-		flush_cache_page(vma, address, pte_pfn(*pte));
-		/*
-		 * Ok this is tricky, when get_user_pages_fast() run it doesnt
-		 * take any lock, therefore the check that we are going to make
-		 * with the pagecount against the mapcount is racey and
-		 * O_DIRECT can happen right after the check.
-		 * So we clear the pte and flush the tlb before the check
-		 * this assure us that no O_DIRECT can happen after the check
-		 * or in the middle of the check.
-		 */
-		entry = ptep_clear_flush(vma, address, pte);
-		/*
-		 * Check that no O_DIRECT or similar I/O is in progress on the
-		 * page
-		 */
-		if ((page_mapcount(page) + count_offset) != page_count(page)) {
-			*odirect_sync = 0;
-			set_pte_at_notify(mm, address, pte, entry);
-			goto out_unlock;
-		}
-		entry = pte_wrprotect(entry);
-		set_pte_at_notify(mm, address, pte, entry);
-	}
-	ret = 1;
-
-out_unlock:
-	pte_unmap_unlock(pte, ptl);
-out:
-	return ret;
-}
-
-static int page_wrprotect_file(struct page *page, int *odirect_sync,
-			       int count_offset)
-{
-	struct address_space *mapping;
-	struct prio_tree_iter iter;
-	struct vm_area_struct *vma;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
-	int ret = 0;
-
-	mapping = page_mapping(page);
-	if (!mapping)
-		return ret;
-
-	spin_lock(&mapping->i_mmap_lock);
-
-	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
-		ret += page_wrprotect_one(page, vma, odirect_sync,
-					  count_offset);
-
-	spin_unlock(&mapping->i_mmap_lock);
-
-	return ret;
-}
-
-static int page_wrprotect_anon(struct page *page, int *odirect_sync,
-			       int count_offset)
-{
-	struct vm_area_struct *vma;
-	struct anon_vma *anon_vma;
-	int ret = 0;
-
-	anon_vma = page_lock_anon_vma(page);
-	if (!anon_vma)
-		return ret;
-
-	/*
-	 * If the page is inside the swap cache, its _count number was
-	 * increased by one, therefore we have to increase count_offset by one.
-	 */
-	if (PageSwapCache(page))
-		count_offset++;
-
-	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
-		ret += page_wrprotect_one(page, vma, odirect_sync,
-					  count_offset);
-
-	page_unlock_anon_vma(anon_vma);
-
-	return ret;
-}
-
-/**
- * page_wrprotect - set all ptes pointing to a page as readonly
- * @page:         the page to set as readonly
- * @odirect_sync: boolean value that is set to 0 when some of the ptes were not
- *                marked as readonly beacuse page_wrprotect_one() was not able
- *                to mark this ptes as readonly without opening window to a race
- *                with odirect
- * @count_offset: number of times page_wrprotect() caller had called get_page()
- *                on the page
- *
- * returns the number of ptes which were marked as readonly.
- * (ptes that were readonly before this function was called are counted as well)
- */
-int page_wrprotect(struct page *page, int *odirect_sync, int count_offset)
-{
-	int ret = 0;
-
-	/*
-	 * Page lock is needed for anon pages for the PageSwapCache check,
-	 * and for page_mapping for filebacked pages
-	 */
-	BUG_ON(!PageLocked(page));
-
-	*odirect_sync = 1;
-	if (PageAnon(page))
-		ret = page_wrprotect_anon(page, odirect_sync, count_offset);
-	else
-		ret = page_wrprotect_file(page, odirect_sync, count_offset);
-
-	return ret;
-}
-EXPORT_SYMBOL(page_wrprotect);
-
-#endif
-
 /**
  * __page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
-- 
1.5.6.5




> 
> static inline int write_protect_page(struct page *page,
> 				     struct vm_area_struct *vma,
> 				     pte_t *orig_pte)
> {
> 	struct mm_struct *mm = vma->vm_mm;
> 	unsigned long addr;
> 	pte_t *ptep;
> 	spinlock_t *ptl;
> 	int swapped;
> 	int ret = 1;
> 
> 	addr = addr_in_vma(vma, page);
> 	if (addr == -EFAULT)
> 		goto out;
> 
> 	ptep = page_check_address(page, mm, addr, &ptl, 0);
> 	if (!ptep)
> 		goto out;
> 
> 	if (pte_write(*ptep)) {
> 		pte_t entry;
> 
> 		swapped = PageSwapCache(page);
> 		flush_cache_page(vma, addr, page_to_pfn(page));
> 		/*
> 		 * Ok this is tricky, when get_user_pages_fast() run
> it doesnt
> 		 * take any lock, therefore the check that we are
> going to make
> 		 * with the pagecount against the mapcount is racey
> and
> 		 * O_DIRECT can happen right after the check.
> 		 * So we clear the pte and flush the tlb before the
> check
> 		 * this assure us that no O_DIRECT can happen after
> the check
> 		 * or in the middle of the check.
> 		 */
> 		entry = ptep_clear_flush(vma, addr, ptep);
> 		/*
> 		 * Check that no O_DIRECT or similar I/O is in
> progress on the
> 		 * page
> 		 */
> 		if ((page_mapcount(page) + 2 + swapped) !=
> page_count(page)) { set_pte_at_notify(mm, addr, ptep, entry);
> 			goto out_unlock;
> 		}
> 		entry = pte_wrprotect(entry);
> 		set_pte_at_notify(mm, addr, ptep, entry);
> 		*orig_pte = *ptep;
> 	}
> 	ret = 0;
> 
> out_unlock:
> 	pte_unmap_unlock(ptep, ptl);
> out:
> 	return ret;
> }
> 
> /*
>  * try_to_merge_one_page - take two pages and merge them into one
>  * @mm: mm_struct that hold vma pointing into oldpage
>  * @vma: the vma that hold the pte pointing into oldpage
>  * @oldpage: the page that we want to replace with newpage
>  * @newpage: the page that we want to map instead of oldpage
>  * @newprot: the new permission of the pte inside vma
>  * note:
>  * oldpage should be anon page while newpage should be file mapped
> page *
>  * this function return 0 if the pages were merged, 1 otherwise.
>  */
> static int try_to_merge_one_page(struct mm_struct *mm,
> 				 struct vm_area_struct *vma,
> 				 struct page *oldpage,
> 				 struct page *newpage,
> 				 pgprot_t newprot)
> {
> 	int ret = 1;
> 	pte_t orig_pte;
> 
> 	if (!PageAnon(oldpage))
> 		goto out;
> 
> 	get_page(newpage);
> 	get_page(oldpage);
> 
> 	/*
> 	 * we need the page lock to read a stable PageSwapCache in
> 	 * write_protect_page().
> 	 * we use trylock_page() instead of lock_page(), beacuse we
> dont want to
> 	 * wait here, we prefer to continue scanning and merging
> diffrent pages
> 	 * and to come back to this page when it is unlocked.
> 	 */
> 	if (!trylock_page(oldpage))
> 		goto out_putpage;
> 
> 	if (write_protect_page(oldpage, vma, &orig_pte)) {
> 		unlock_page(oldpage);
> 		goto out_putpage;
> 	}
> 	unlock_page(oldpage);
> 
> 	if (pages_identical(oldpage, newpage))
> 		ret = replace_page(vma, oldpage, newpage, orig_pte,
> newprot);
> 
> out_putpage:
> 	put_page(oldpage);
> 	put_page(newpage);
> out:
> 	return ret;
> }

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] RFC - ksm api change into madvise
  2009-06-10  6:28               ` Izik Eidus
@ 2009-06-11 16:57                 ` Hugh Dickins
  2009-06-12 21:49                   ` Izik Eidus
  0 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2009-06-11 16:57 UTC (permalink / raw)
  To: Izik Eidus
  Cc: aarcange, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

On Wed, 10 Jun 2009, Izik Eidus wrote:
> 
> Great!, what you think about below?

A couple of things: please use #ifdef CONFIG_KSM throughout
now that you can, rather than #if defined(CONFIG_KSM).

And we ought to add a comment above the write_protect_page() call:
	/*
	 * If this anonymous page is mapped only here, its pte may need
	 * to be write-protected.  If it's mapped elsewhere, all its
	 * ptes are necessarily already write-protected.  In either
	 * case, we need to lock and check page_count is not raised.
	 */

> another thing we want to add or to start sending it to Andrew?

Umm, umm, umm, let's hold off sending Andrew for now.  It would be
a bit silly to send him 2/2, when what's actually required is to
withdraw ksm-add-page_wrprotect-write-protecting-page.patch,
providing adjustments as necessary to the other patches.

But I don't want us fiddling with Andrew's collection every day
or two (forget my earlier scan fix if you prefer, if you're sure
you caught all of that in what you have now); and I rather doubt
mmotm's KSM is getting much testing at present.  It would be nice
to get the withdrawn rmap.c changes out of everybody else's face,
but not worth messing mmotm around for that right now.

Whilst we can't let our divergence drift on very long, I think
it's better we give him a new collection of patches once we've
more or less settled the madvise switchover.  That new collection
won't contain much in mm/rmap.c if at all.

Unless you or Andrew feel strongly that I'm wrong on this.

> 
> btw may i add your signed-off to this patch?

If you've tested and found it works, please, by all means.
If you've not tested, or found it broken, then I deny all
knowledge of these changes ;)  But it's hypothetical: the
replacement collection to Andrew won't contain either of
these patches as such.

> (the only thing that i changed was taking down the *orig_pte = *ptep,
> so we will merge write_protected pages, and add orig_pte = __pte(0) to
> avoid annoying warning message about being used uninitialized)

Okay.  We do have a macro to annotate such pacifying initializations,
but I've not used it before, and forget what it is or where to look
for an example, I don't see it in kernel.h or compiler.h.  Maybe
Andrew will chime in and remind us.

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] RFC - ksm api change into madvise
  2009-06-11 16:57                 ` Hugh Dickins
@ 2009-06-12 21:49                   ` Izik Eidus
  0 siblings, 0 replies; 22+ messages in thread
From: Izik Eidus @ 2009-06-12 21:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: aarcange, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

Hugh Dickins wrote:
>
> Okay.  We do have a macro to annotate such pacifying initializations,
> but I've not used it before, and forget what it is or where to look
> for an example, I don't see it in kernel.h or compiler.h.  Maybe
> Andrew will chime in and remind us.
>
> Hugh
>   
I have looked on compiler.h - this file have something that deal with 
warnings, but didnt find anything that is good for our case....

Anyone know where is this thing is found?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] RFC - ksm api change into madvise
  2009-06-08 22:57   ` Andrea Arcangeli
@ 2009-06-13 15:04     ` Hugh Dickins
  0 siblings, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2009-06-13 15:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, akpm, nickpiggin, chrisw, riel, linux-kernel, linux-mm

Hi Andrea,

On Tue, 9 Jun 2009, Andrea Arcangeli wrote:
> 
> So let us know what you think about the rmap_item/tree_item out of
> sync, or in sync with mmu notifier. As said Izik already did a
> preliminary patch with mmu notifier registration. I doubt we want to
> invest in that direction unless there's 100% agreement that it is
> definitely the way to go, and the expectation that it will make a
> substantial difference to the KSM users. Minor optimizations that
> increase complexity a lot, can be left for later.

Thanks for your detailed mail, of which this is merely the final
paragraph.  Thought I'd better respond with a little reassurance,
though I'm not yet ready to write in detail.

I agree 100% that KSM is entitled to be as "lazy" about clearing
up pages as it is about merging them in the first place: you're
absolutely right to avoid the unnecessary overhead of keeping
strictly in synch, and I recognize the lock ordering problems
that keeping strictly in synch would be likely to entail.

My remarks about "lost" pages came from the belief that operations
upon the vmas could move pages to where they thereafter escaped
the attention of KSM's scans for an _indefinite_ period (until
the process exited or even after): that's what has worried me,
but I've yet to demonstrate such a problem, and the rework
naturally changes what happens here.

So, rest assured, I'm not wanting to impose a stricter discipline and
tighter linkage, unless it's to fix a proven indefinite discrepancy.

Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2009-06-13 15:05 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-14  0:30 [PATCH 0/4] RFC - ksm api change into madvise Izik Eidus
2009-05-14  0:30 ` [PATCH 1/4] madvice: add MADV_SHAREABLE and MADV_UNSHAREABLE calls Izik Eidus
2009-05-14  0:30   ` [PATCH 2/4] mmlist: share mmlist with ksm Izik Eidus
2009-05-14  0:30     ` [PATCH 3/4] ksm: change ksm api to use madvise instead of ioctls Izik Eidus
2009-05-14  0:30       ` [PATCH 4/4] ksm: add support for scanning procsses that were not modifided to use ksm Izik Eidus
2009-06-08 16:18 ` [PATCH 0/4] RFC - ksm api change into madvise Hugh Dickins
2009-06-08 16:35   ` [PATCH mmotm] ksm: stop scan skipping pages Hugh Dickins
2009-06-08 17:42     ` Izik Eidus
2009-06-08 18:01       ` Hugh Dickins
2009-06-08 20:12         ` Izik Eidus
2009-06-08 21:05           ` Hugh Dickins
2009-06-08 17:17   ` [PATCH 0/4] RFC - ksm api change into madvise Izik Eidus
2009-06-08 18:32     ` Hugh Dickins
2009-06-08 20:10       ` Izik Eidus
2009-06-09  4:48         ` Izik Eidus
2009-06-09 17:24           ` Hugh Dickins
2009-06-09 19:27             ` Hugh Dickins
2009-06-10  6:28               ` Izik Eidus
2009-06-11 16:57                 ` Hugh Dickins
2009-06-12 21:49                   ` Izik Eidus
2009-06-08 22:57   ` Andrea Arcangeli
2009-06-13 15:04     ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).