All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
@ 2013-05-09  9:50 wenchaolinux
  2013-05-09  9:50 ` [RFC PATCH V1 1/6] mm: add parameter remove_old in move_huge_pmd() wenchaolinux
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: wenchaolinux @ 2013-05-09  9:50 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mgorman, hughd, walken, viro, kirill.shutemov,
	xiaoguangrong, anthony, stefanha, Wenchao Xia

From: Wenchao Xia <wenchaolinux@gmail.com>

  This serial try to enable mremap syscall to cow some private memory region,
just like what fork() did. As a result, user space application would got a
mirror of those region, and it can be used as a snapshot for further processing.

This patch is based on the commit 
a12183c62717ac4579319189a00f5883a18dff08 pulled from upstream (linux 3.9) on
2013-04-04, but I hope to sent it first to see if some case I missed to handle
correctly, will try rebase to latest upstream code in next version.

simple test code:

#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int main(void)
{
    int len = 4096 * 2 ;
    void *old_addr = malloc(len);
    old_addr = ((unsigned long)(old_addr + 4096) & (~0x3FF));
    printf("mapping addr %p with len %d.\n", old_addr, len);
    char *oldc = old_addr;
    oldc[0] = 0;
    oldc[1] = 1;
    oldc[2] = 2;
    oldc[3] = 3;
    void *new_addr;
    unsigned long new_addr_l;
    new_addr = mremap(old_addr, len, 0, 4);
    printf("result new addr %lx %p.\n", new_addr_l, new_addr);
    char *newc = new_addr;
    printf("old value is 0x%lx.\n", *((unsigned long *)oldc));
    printf("new value is 0x%lx.\n", *((unsigned long *)newc));
    newc[0] = 6;
    printf("old value is 0x%lx.\n", *((unsigned long *)oldc));
    printf("new value is 0x%lx.\n", *((unsigned long *)newc));
    oldc[0] = 9;
    printf("old value is 0x%lx.\n", *((unsigned long *)oldc));
    printf("new value is 0x%lx.\n", *((unsigned long *)newc));
    assert(0 == munmap(new_addr, len));

}

Wenchao Xia (6):
  mm: add parameter remove_old in move_huge_pmd()
  mm : allow copy between different addresses for copy_one_pte()
  mm : export rss vec helper functions
  mm : export is_cow_mapping()
  mm : add parameter remove_old in move_page_tables
  mm : add new option MREMAP_DUP to mremap() syscall

 fs/exec.c                 |    2 +-
 include/linux/huge_mm.h   |    2 +-
 include/linux/mm.h        |    9 ++-
 include/uapi/linux/mman.h |    1 +
 mm/huge_memory.c          |    6 +-
 mm/memory.c               |   33 ++++----
 mm/mremap.c               |  200 +++++++++++++++++++++++++++++++++++++++++++--
 7 files changed, 224 insertions(+), 29 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH V1 1/6] mm: add parameter remove_old in move_huge_pmd()
  2013-05-09  9:50 [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall wenchaolinux
@ 2013-05-09  9:50 ` wenchaolinux
  2013-05-09  9:50 ` [RFC PATCH V1 2/6] mm : allow copy between different addresses for copy_one_pte() wenchaolinux
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: wenchaolinux @ 2013-05-09  9:50 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mgorman, hughd, walken, viro, kirill.shutemov,
	xiaoguangrong, anthony, stefanha, Wenchao Xia

From: Wenchao Xia <wenchaolinux@gmail.com>

Signed-off-by: Wenchao Xia <wenchaolinux@gmail.com>
---
 include/linux/huge_mm.h |    2 +-
 mm/huge_memory.c        |    6 ++++--
 mm/mremap.c             |    2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ee1c244..567dc1e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -29,7 +29,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 			 struct vm_area_struct *new_vma,
 			 unsigned long old_addr,
 			 unsigned long new_addr, unsigned long old_end,
-			 pmd_t *old_pmd, pmd_t *new_pmd);
+			 pmd_t *old_pmd, pmd_t *new_pmd, bool remove_old);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
 			int prot_numa);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5a..f752388 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1402,10 +1402,11 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	return ret;
 }
 
+/* This function copy or moves pmd in same mm */
 int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		  unsigned long old_addr,
 		  unsigned long new_addr, unsigned long old_end,
-		  pmd_t *old_pmd, pmd_t *new_pmd)
+		  pmd_t *old_pmd, pmd_t *new_pmd, bool remove_old)
 {
 	int ret = 0;
 	pmd_t pmd;
@@ -1429,7 +1430,8 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 
 	ret = __pmd_trans_huge_lock(old_pmd, vma);
 	if (ret == 1) {
-		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
+		pmd = remove_old ?
+			pmdp_get_and_clear(mm, old_addr, old_pmd) : *old_pmd;
 		VM_BUG_ON(!pmd_none(*new_pmd));
 		set_pmd_at(mm, new_addr, new_pmd, pmd);
 		spin_unlock(&mm->page_table_lock);
diff --git a/mm/mremap.c b/mm/mremap.c
index 463a257..0f3c5be 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -178,7 +178,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			if (extent == HPAGE_PMD_SIZE)
 				err = move_huge_pmd(vma, new_vma, old_addr,
 						    new_addr, old_end,
-						    old_pmd, new_pmd);
+						    old_pmd, new_pmd, true);
 			if (err > 0) {
 				need_flush = true;
 				continue;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH V1 2/6] mm : allow copy between different addresses for copy_one_pte()
  2013-05-09  9:50 [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall wenchaolinux
  2013-05-09  9:50 ` [RFC PATCH V1 1/6] mm: add parameter remove_old in move_huge_pmd() wenchaolinux
@ 2013-05-09  9:50 ` wenchaolinux
  2013-05-09  9:50 ` [RFC PATCH V1 3/6] mm : export rss vec helper functions wenchaolinux
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: wenchaolinux @ 2013-05-09  9:50 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mgorman, hughd, walken, viro, kirill.shutemov,
	xiaoguangrong, anthony, stefanha, Wenchao Xia

From: Wenchao Xia <wenchaolinux@gmail.com>

This function now can be used in pte copy in same process with
different addresses. It is also exported.

Signed-off-by: Wenchao Xia <wenchaolinux@gmail.com>
---
 include/linux/mm.h |    4 ++++
 mm/memory.c        |   27 ++++++++++++++-------------
 2 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7acc9dc..68f52bc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -963,6 +963,10 @@ int walk_page_range(unsigned long addr, unsigned long end,
 		struct mm_walk *walk);
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
+unsigned long copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			   pte_t *dst_pte, pte_t *src_pte,
+			   unsigned long dst_addr, unsigned long src_addr,
+			   struct vm_area_struct *vma, int *rss);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
index 494526a..0357cf1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -824,15 +824,15 @@ out:
 }
 
 /*
- * copy one vm_area from one task to the other. Assumes the page tables
- * already present in the new task to be cleared in the whole range
- * covered by this vma.
+ * copy one pte from @src_addr to @dst_addr. Assumes the page tables and vma
+ * already present in the @dst_addr, @src_addr and @src_pte is covered by
+ * @vma, @rss is a array of size NR_MM_COUNTERS used by caller to sync. dst_mm
+ * may be equal to src_mm. Return 0 or swap entry value.
  */
-
-static inline unsigned long
-copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
-		unsigned long addr, int *rss)
+unsigned long copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			   pte_t *dst_pte, pte_t *src_pte,
+			   unsigned long dst_addr, unsigned long src_addr,
+			   struct vm_area_struct *vma, int *rss)
 {
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
@@ -872,7 +872,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 					 */
 					make_migration_entry_read(&entry);
 					pte = swp_entry_to_pte(entry);
-					set_pte_at(src_mm, addr, src_pte, pte);
+					set_pte_at(src_mm, src_addr,
+						   src_pte, pte);
 				}
 			}
 		}
@@ -884,7 +885,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
+		ptep_set_wrprotect(src_mm, src_addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
 
@@ -896,7 +897,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
 
-	page = vm_normal_page(vma, addr, pte);
+	page = vm_normal_page(vma, src_addr, pte);
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page);
@@ -907,7 +908,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 
 out_set_pte:
-	set_pte_at(dst_mm, addr, dst_pte, pte);
+	set_pte_at(dst_mm, dst_addr, dst_pte, pte);
 	return 0;
 }
 
@@ -951,7 +952,7 @@ again:
 			continue;
 		}
 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
-							vma, addr, rss);
+						addr, addr, vma, rss);
 		if (entry.val)
 			break;
 		progress += 8;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH V1 3/6] mm : export rss vec helper functions
  2013-05-09  9:50 [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall wenchaolinux
  2013-05-09  9:50 ` [RFC PATCH V1 1/6] mm: add parameter remove_old in move_huge_pmd() wenchaolinux
  2013-05-09  9:50 ` [RFC PATCH V1 2/6] mm : allow copy between different addresses for copy_one_pte() wenchaolinux
@ 2013-05-09  9:50 ` wenchaolinux
  2013-05-09  9:50 ` [RFC PATCH V1 4/6] mm : export is_cow_mapping() wenchaolinux
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: wenchaolinux @ 2013-05-09  9:50 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mgorman, hughd, walken, viro, kirill.shutemov,
	xiaoguangrong, anthony, stefanha, Wenchao Xia

From: Wenchao Xia <wenchaolinux@gmail.com>

Signed-off-by: Wenchao Xia <wenchaolinux@gmail.com>
---
 include/linux/mm.h |    2 ++
 mm/memory.c        |    4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 68f52bc..5071a44 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -963,6 +963,8 @@ int walk_page_range(unsigned long addr, unsigned long end,
 		struct mm_walk *walk);
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
+void init_rss_vec(int *rss);
+void add_mm_rss_vec(struct mm_struct *mm, int *rss);
 unsigned long copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			   pte_t *dst_pte, pte_t *src_pte,
 			   unsigned long dst_addr, unsigned long src_addr,
diff --git a/mm/memory.c b/mm/memory.c
index 0357cf1..add1562 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -643,12 +643,12 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
 	return 0;
 }
 
-static inline void init_rss_vec(int *rss)
+void init_rss_vec(int *rss)
 {
 	memset(rss, 0, sizeof(int) * NR_MM_COUNTERS);
 }
 
-static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
+void add_mm_rss_vec(struct mm_struct *mm, int *rss)
 {
 	int i;
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH V1 4/6] mm : export is_cow_mapping()
  2013-05-09  9:50 [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall wenchaolinux
                   ` (2 preceding siblings ...)
  2013-05-09  9:50 ` [RFC PATCH V1 3/6] mm : export rss vec helper functions wenchaolinux
@ 2013-05-09  9:50 ` wenchaolinux
  2013-05-09  9:50 ` [RFC PATCH V1 5/6] mm : add parameter remove_old in move_page_tables wenchaolinux
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: wenchaolinux @ 2013-05-09  9:50 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mgorman, hughd, walken, viro, kirill.shutemov,
	xiaoguangrong, anthony, stefanha, Wenchao Xia

From: Wenchao Xia <wenchaolinux@gmail.com>

Signed-off-by: Wenchao Xia <wenchaolinux@gmail.com>
---
 include/linux/mm.h |    1 +
 mm/memory.c        |    2 +-
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5071a44..9bd01f5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -965,6 +965,7 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 void init_rss_vec(int *rss);
 void add_mm_rss_vec(struct mm_struct *mm, int *rss);
+bool is_cow_mapping(vm_flags_t flags);
 unsigned long copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			   pte_t *dst_pte, pte_t *src_pte,
 			   unsigned long dst_addr, unsigned long src_addr,
diff --git a/mm/memory.c b/mm/memory.c
index add1562..e5456e1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -723,7 +723,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
 }
 
-static inline bool is_cow_mapping(vm_flags_t flags)
+bool is_cow_mapping(vm_flags_t flags)
 {
 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH V1 5/6] mm : add parameter remove_old in move_page_tables
  2013-05-09  9:50 [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall wenchaolinux
                   ` (3 preceding siblings ...)
  2013-05-09  9:50 ` [RFC PATCH V1 4/6] mm : export is_cow_mapping() wenchaolinux
@ 2013-05-09  9:50 ` wenchaolinux
  2013-05-09  9:50 ` [RFC PATCH V1 6/6] mm : add new option MREMAP_DUP to mremap() syscall wenchaolinux
  2013-05-09 14:13 ` [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall Mel Gorman
  6 siblings, 0 replies; 22+ messages in thread
From: wenchaolinux @ 2013-05-09  9:50 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mgorman, hughd, walken, viro, kirill.shutemov,
	xiaoguangrong, anthony, stefanha, Wenchao Xia

From: Wenchao Xia <wenchaolinux@gmail.com>

Signed-off-by: Wenchao Xia <wenchaolinux@gmail.com>
---
 fs/exec.c          |    2 +-
 include/linux/mm.h |    2 +-
 mm/mremap.c        |   97 ++++++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 93 insertions(+), 8 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a96a488..12721e1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -603,7 +603,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	 * process cleanup to remove whatever mess we made.
 	 */
 	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length, false))
+				       vma, new_start, length, false, true))
 		return -ENOMEM;
 
 	lru_add_drain();
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9bd01f5..a5eb34c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1085,7 +1085,7 @@ vm_is_stack(struct task_struct *task, struct vm_area_struct *vma, int in_group);
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
-		bool need_rmap_locks);
+		bool need_rmap_locks, bool remove_old);
 extern unsigned long do_mremap(unsigned long addr,
 			       unsigned long old_len, unsigned long new_len,
 			       unsigned long flags, unsigned long new_addr);
diff --git a/mm/mremap.c b/mm/mremap.c
index 0f3c5be..2cc1cae 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -140,18 +140,93 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		mutex_unlock(&mapping->i_mmap_mutex);
 }
 
+static unsigned long dup_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
+			unsigned long old_addr, unsigned long old_end,
+			struct vm_area_struct *new_vma, pmd_t *new_pmd,
+			unsigned long new_addr, bool need_rmap_locks)
+{
+	struct address_space *mapping = NULL;
+	struct anon_vma *anon_vma = NULL;
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t *old_pte, *new_pte;
+	spinlock_t *old_ptl, *new_ptl;
+	pte_t *orig_old_pte, *orig_new_pte;
+	int rss[NR_MM_COUNTERS];
+	swp_entry_t entry = (swp_entry_t){0};
+
+again:
+	init_rss_vec(rss);
+
+	/* Same with move_ptes */
+	if (need_rmap_locks) {
+		if (vma->vm_file) {
+			mapping = vma->vm_file->f_mapping;
+			mutex_lock(&mapping->i_mmap_mutex);
+		}
+		if (vma->anon_vma) {
+			anon_vma = vma->anon_vma;
+			anon_vma_lock_write(anon_vma);
+		}
+	}
+
+	/*
+	 * We don't have to worry about the ordering of src and dst
+	 * pte locks because exclusive mmap_sem prevents deadlock.
+	 */
+	old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
+	new_pte = pte_offset_map(new_pmd, new_addr);
+	new_ptl = pte_lockptr(mm, new_pmd);
+	if (new_ptl != old_ptl)
+		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+	arch_enter_lazy_mmu_mode();
+	orig_old_pte = old_pte;
+	orig_new_pte = new_pte;
+
+	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
+				   new_pte++, new_addr += PAGE_SIZE) {
+		if (pte_none(*old_pte))
+			continue;
+		entry.val = copy_one_pte(mm, mm, new_pte, old_pte,
+					 new_addr, old_addr, vma, rss);
+		if (entry.val)
+			break;
+	}
+
+	arch_leave_lazy_mmu_mode();
+	add_mm_rss_vec(mm, rss);
+	if (new_ptl != old_ptl)
+		spin_unlock(new_ptl);
+	pte_unmap(orig_new_pte);
+	pte_unmap_unlock(orig_old_pte, old_ptl);
+	if (anon_vma)
+		anon_vma_unlock_write(anon_vma);
+	if (mapping)
+		mutex_unlock(&mapping->i_mmap_mutex);
+
+	if (entry.val) {
+		cond_resched();
+		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
+			goto out;
+	}
+	if (old_addr < old_end)
+		goto again;
+ out:
+	return old_addr;
+}
+
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
 
 unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
-		bool need_rmap_locks)
+		bool need_rmap_locks, bool remove_old)
 {
 	unsigned long extent, next, old_end;
 	pmd_t *old_pmd, *new_pmd;
 	bool need_flush = false;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	unsigned long t;
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
@@ -178,7 +253,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			if (extent == HPAGE_PMD_SIZE)
 				err = move_huge_pmd(vma, new_vma, old_addr,
 						    new_addr, old_end,
-						    old_pmd, new_pmd, true);
+						    old_pmd, new_pmd,
+						    remove_old);
 			if (err > 0) {
 				need_flush = true;
 				continue;
@@ -195,8 +271,17 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			extent = next - new_addr;
 		if (extent > LATENCY_LIMIT)
 			extent = LATENCY_LIMIT;
-		move_ptes(vma, old_pmd, old_addr, old_addr + extent,
-			  new_vma, new_pmd, new_addr, need_rmap_locks);
+		if (remove_old) {
+			move_ptes(vma, old_pmd, old_addr, old_addr + extent,
+				  new_vma, new_pmd, new_addr, need_rmap_locks);
+		} else {
+			t = dup_ptes(vma, old_pmd, old_addr, old_addr + extent,
+				  new_vma, new_pmd, new_addr, need_rmap_locks);
+			if (t < old_addr + extent) {
+				old_addr = t;
+				break;
+			}
+		}
 		need_flush = true;
 	}
 	if (likely(need_flush))
@@ -248,7 +333,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		return -ENOMEM;
 
 	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
-				     need_rmap_locks);
+				     need_rmap_locks, true);
 	if (moved_len < old_len) {
 		/*
 		 * On error, move entries back from new area to old,
@@ -256,7 +341,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		 * and then proceed to unmap new area instead of old.
 		 */
 		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
-				 true);
+				 true, true);
 		vma = new_vma;
 		old_len = new_len;
 		old_addr = new_addr;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH V1 6/6] mm : add new option MREMAP_DUP to mremap() syscall
  2013-05-09  9:50 [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall wenchaolinux
                   ` (4 preceding siblings ...)
  2013-05-09  9:50 ` [RFC PATCH V1 5/6] mm : add parameter remove_old in move_page_tables wenchaolinux
@ 2013-05-09  9:50 ` wenchaolinux
  2013-05-09 14:13 ` [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall Mel Gorman
  6 siblings, 0 replies; 22+ messages in thread
From: wenchaolinux @ 2013-05-09  9:50 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mgorman, hughd, walken, viro, kirill.shutemov,
	xiaoguangrong, anthony, stefanha, Wenchao Xia

From: Wenchao Xia <wenchaolinux@gmail.com>

This option allow user space program getting a mirror for
mem, that is two virtual mapping. The content is COW so
it can be used for snapshot a region of mem.

Now shared memory is not COWED yet.

Signed-off-by: Wenchao Xia <wenchaolinux@gmail.com>
---
 include/uapi/linux/mman.h |    1 +
 mm/mremap.c               |  103 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 102 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index ade4acd..5cf7816 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -5,6 +5,7 @@
 
 #define MREMAP_MAYMOVE	1
 #define MREMAP_FIXED	2
+#define MREMAP_DUP	4
 
 #define OVERCOMMIT_GUESS		0
 #define OVERCOMMIT_ALWAYS		1
diff --git a/mm/mremap.c b/mm/mremap.c
index 2cc1cae..f6cc29f 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -391,6 +391,45 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	return new_addr;
 }
 
+static unsigned long dup_vma(struct vm_area_struct *vma,
+			     unsigned long old_addr, unsigned long new_addr,
+			     unsigned long len, bool *locked)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct vm_area_struct *new_vma;
+	unsigned long vm_flags = vma->vm_flags;
+	unsigned long new_pgoff;
+	unsigned long duped_len;
+	int err;
+	bool need_rmap_locks;
+
+	new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
+	new_vma = copy_vma(&vma, new_addr, len, new_pgoff,
+			   &need_rmap_locks);
+	if (!new_vma)
+		return -ENOMEM;
+
+	duped_len = move_page_tables(vma, old_addr, new_vma, new_addr, len,
+				     need_rmap_locks, false);
+	if (duped_len < len) {
+		/* remove new duplicated area */
+		move_page_tables(new_vma, new_addr, vma, old_addr, duped_len,
+				 true, true);
+		err = do_munmap(mm, new_addr, duped_len);
+		VM_BUG_ON(err < 0);
+		return -ENOMEM;
+	}
+
+	vm_stat_account(mm, vma->vm_flags, vma->vm_file, len>>PAGE_SHIFT);
+
+	if (vm_flags & VM_LOCKED) {
+		mm->locked_vm += len >> PAGE_SHIFT;
+		*locked = true;
+	}
+
+	return new_addr;
+}
+
 static struct vm_area_struct *vma_to_resize(unsigned long addr,
 	unsigned long old_len, unsigned long new_len, unsigned long *p)
 {
@@ -511,6 +550,59 @@ out:
 	return ret;
 }
 
+static unsigned long mremap_dup(unsigned long old_addr, unsigned long new_addr,
+				unsigned long len, unsigned long flags,
+				bool *locked)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	unsigned long ret = -EINVAL, map_flags = 0;
+
+	if (flags & MREMAP_FIXED) {
+		if (new_addr & ~PAGE_MASK)
+			goto out;
+		if (len > TASK_SIZE || new_addr > TASK_SIZE - len)
+			goto out;
+		/* Overlap */
+		if ((new_addr <= old_addr) && (new_addr + len) > old_addr)
+			goto out;
+		if ((old_addr <= new_addr) && (old_addr + len) > new_addr)
+			goto out;
+
+		map_flags = MAP_FIXED;
+	} else {
+		new_addr = 0;
+	}
+
+	vma = find_vma(mm, old_addr);
+
+	/* We can't remap across vm area boundaries */
+	if (!vma || vma->vm_start > old_addr || len > vma->vm_end - old_addr)
+		goto out;
+
+	/* Currently, shared mem can't be cowed */
+	if (vma->vm_flags & VM_MAYSHARE)
+		map_flags |= MAP_SHARED;
+
+	ret = get_unmapped_area(vma->vm_file, new_addr, len, vma->vm_pgoff +
+				((old_addr - vma->vm_start) >> PAGE_SHIFT),
+				map_flags);
+	if (ret & ~PAGE_MASK)
+		goto out;
+
+	new_addr = ret;
+
+	/* for debug */
+	printk(KERN_WARNING
+		"mremap dup %lx with len %lx to %lx, original vm_flag %lx.",
+		old_addr, len, new_addr, vma->vm_flags);
+
+	ret = dup_vma(vma, old_addr, new_addr, len, locked);
+
+out:
+	return ret;
+}
+
 static int vma_expandable(struct vm_area_struct *vma, unsigned long delta)
 {
 	unsigned long end = vma->vm_end + delta;
@@ -543,7 +635,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 
 	down_write(&current->mm->mmap_sem);
 
-	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
+	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DUP))
 		goto out;
 
 	if (addr & ~PAGE_MASK)
@@ -552,6 +644,10 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 	old_len = PAGE_ALIGN(old_len);
 	new_len = PAGE_ALIGN(new_len);
 
+	if (flags & MREMAP_DUP) {
+		ret = mremap_dup(addr, new_addr, old_len, flags, &locked);
+		goto out;
+	}
 	/*
 	 * We allow a zero old-len as a special case
 	 * for DOS-emu "duplicate shm area" thing. But
@@ -638,7 +734,10 @@ out:
 	if (ret & ~PAGE_MASK)
 		vm_unacct_memory(charged);
 	up_write(&current->mm->mmap_sem);
-	if (locked && new_len > old_len)
+	/* locked == true only when operation success */
+	if ((flags & MREMAP_DUP) && (!IS_ERR_VALUE(ret)) && locked)
+		mm_populate(ret, old_len);
+	else if (locked && new_len > old_len)
 		mm_populate(new_addr + old_len, new_len - old_len);
 	return ret;
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
  2013-05-09  9:50 [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall wenchaolinux
                   ` (5 preceding siblings ...)
  2013-05-09  9:50 ` [RFC PATCH V1 6/6] mm : add new option MREMAP_DUP to mremap() syscall wenchaolinux
@ 2013-05-09 14:13 ` Mel Gorman
  2013-05-10  2:28   ` wenchao
  6 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2013-05-09 14:13 UTC (permalink / raw)
  To: wenchaolinux
  Cc: linux-mm, akpm, hughd, walken, viro, kirill.shutemov,
	xiaoguangrong, anthony, stefanha

On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
> From: Wenchao Xia <wenchaolinux@gmail.com>
> 
>   This serial try to enable mremap syscall to cow some private memory region,
> just like what fork() did. As a result, user space application would got a
> mirror of those region, and it can be used as a snapshot for further processing.
> 

What not just fork()? Even if the application was threaded it should be
managable to handle fork just for processing the private memory region
in question. I'm having trouble figuring out what sort of application
would require an interface like this.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
  2013-05-09 14:13 ` [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall Mel Gorman
@ 2013-05-10  2:28   ` wenchao
  2013-05-10  5:11     ` Stefan Hajnoczi
  2013-05-10  9:22     ` Kirill A. Shutemov
  0 siblings, 2 replies; 22+ messages in thread
From: wenchao @ 2013-05-10  2:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, akpm, hughd, walken, viro, kirill.shutemov,
	xiaoguangrong, anthony, stefanha

ao? 2013-5-9 22:13, Mel Gorman a??e??:
> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
>> From: Wenchao Xia <wenchaolinux@gmail.com>
>>
>>    This serial try to enable mremap syscall to cow some private memory region,
>> just like what fork() did. As a result, user space application would got a
>> mirror of those region, and it can be used as a snapshot for further processing.
>>
>
> What not just fork()? Even if the application was threaded it should be
> managable to handle fork just for processing the private memory region
> in question. I'm having trouble figuring out what sort of application
> would require an interface like this.
>
   It have some troubles: parent - child communication, sometimes
page copy.
   I'd like to snapshot qemu guest's RAM, currently solution is:
1) fork()
2) pipe guest RAM data from child to parent.
3) parent write down the contents.

   To avoid complex communication for data control, and file content 
protecting, So let parent instead of child handling the data with
a pipe, but this brings additional copy(). I think an explicit API
cow mapping an memory region inside one process, could avoid it,
and faster and cow less pages, also make user space code nicer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
  2013-05-10  2:28   ` wenchao
@ 2013-05-10  5:11     ` Stefan Hajnoczi
  2013-12-17  5:59         ` Xiao Guangrong
  2013-05-10  9:22     ` Kirill A. Shutemov
  1 sibling, 1 reply; 22+ messages in thread
From: Stefan Hajnoczi @ 2013-05-10  5:11 UTC (permalink / raw)
  To: wenchao
  Cc: Mel Gorman, linux-mm, Andrew Morton, hughd, walken,
	Alexander Viro, kirill.shutemov, Xiao Guangrong, Anthony Liguori

On Fri, May 10, 2013 at 4:28 AM, wenchao <wenchaolinux@gmail.com> wrote:
> 于 2013-5-9 22:13, Mel Gorman 写道:
>
>> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
>>>
>>> From: Wenchao Xia <wenchaolinux@gmail.com>
>>>
>>>    This serial try to enable mremap syscall to cow some private memory
>>> region,
>>> just like what fork() did. As a result, user space application would got
>>> a
>>> mirror of those region, and it can be used as a snapshot for further
>>> processing.
>>>
>>
>> What not just fork()? Even if the application was threaded it should be
>> managable to handle fork just for processing the private memory region
>> in question. I'm having trouble figuring out what sort of application
>> would require an interface like this.
>>
>   It have some troubles: parent - child communication, sometimes
> page copy.
>   I'd like to snapshot qemu guest's RAM, currently solution is:
> 1) fork()
> 2) pipe guest RAM data from child to parent.
> 3) parent write down the contents.
>
>   To avoid complex communication for data control, and file content
> protecting, So let parent instead of child handling the data with
> a pipe, but this brings additional copy(). I think an explicit API
> cow mapping an memory region inside one process, could avoid it,
> and faster and cow less pages, also make user space code nicer.

A new Linux-specific API is not portable and not available on existing
hosts.  Since QEMU supports non-Linux host operating systems the
fork() approach is preferable.

If you're worried about the memory copy - which should be benchmarked
- then vmsplice(2) can be used in the child process and splice(2) can
be used in the parent.  It probably doesn't help though since QEMU
scans RAM pages to find all-zero pages before sending them over the
socket, and at that point the memory copy might not make much
difference.

Perhaps other applications can use this new flag better, but for QEMU
I think fork()'s portability is more important than the convenience of
accessing the CoW pages in the same process.

Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
  2013-05-10  2:28   ` wenchao
  2013-05-10  5:11     ` Stefan Hajnoczi
@ 2013-05-10  9:22     ` Kirill A. Shutemov
  2013-05-11 14:16       ` Pavel Emelyanov
  1 sibling, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2013-05-10  9:22 UTC (permalink / raw)
  To: wenchao
  Cc: Mel Gorman, linux-mm, akpm, hughd, walken, viro, kirill.shutemov,
	xiaoguangrong, anthony, stefanha, xemul

wenchao wrote:
> ao? 2013-5-9 22:13, Mel Gorman a??e??:
> > On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
> >> From: Wenchao Xia <wenchaolinux@gmail.com>
> >>
> >>    This serial try to enable mremap syscall to cow some private memory region,
> >> just like what fork() did. As a result, user space application would got a
> >> mirror of those region, and it can be used as a snapshot for further processing.
> >>
> >
> > What not just fork()? Even if the application was threaded it should be
> > managable to handle fork just for processing the private memory region
> > in question. I'm having trouble figuring out what sort of application
> > would require an interface like this.
> >
>    It have some troubles: parent - child communication, sometimes
> page copy.
>    I'd like to snapshot qemu guest's RAM, currently solution is:
> 1) fork()
> 2) pipe guest RAM data from child to parent.
> 3) parent write down the contents.

CC Pavel

I wounder if you can reuse the CRIU approach for memory snapshoting.

http://thread.gmane.org/gmane.linux.kernel/1483158/

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
  2013-05-10  9:22     ` Kirill A. Shutemov
@ 2013-05-11 14:16       ` Pavel Emelyanov
  2013-05-13  2:40         ` wenchao
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Emelyanov @ 2013-05-11 14:16 UTC (permalink / raw)
  To: Kirill A. Shutemov, wenchao
  Cc: Mel Gorman, linux-mm, akpm, hughd, walken, viro, xiaoguangrong,
	anthony, stefanha

On 05/10/2013 01:22 PM, Kirill A. Shutemov wrote:
> wenchao wrote:
>> ao? 2013-5-9 22:13, Mel Gorman a??e??:
>>> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
>>>> From: Wenchao Xia <wenchaolinux@gmail.com>
>>>>
>>>>    This serial try to enable mremap syscall to cow some private memory region,
>>>> just like what fork() did. As a result, user space application would got a
>>>> mirror of those region, and it can be used as a snapshot for further processing.
>>>>
>>>
>>> What not just fork()? Even if the application was threaded it should be
>>> managable to handle fork just for processing the private memory region
>>> in question. I'm having trouble figuring out what sort of application
>>> would require an interface like this.
>>>
>>    It have some troubles: parent - child communication, sometimes
>> page copy.
>>    I'd like to snapshot qemu guest's RAM, currently solution is:
>> 1) fork()
>> 2) pipe guest RAM data from child to parent.
>> 3) parent write down the contents.
> 
> CC Pavel

Thank you!

> I wounder if you can reuse the CRIU approach for memory snapshoting.

I doubt it. First of all, we need to have task's memory in existing external process
which is not its child. With MREMAP_DUP we can't have this. And the most important
thing is that we don't need pages duplication on modification. It's the waste of
memory for our case. We just need to know the fact that the page has changed.

Wenchao, why can't you use existing KVM dirty-tracking for making mem snapshot? As
per my understanding of how KVM MMU works you can

1 turn dirty track on
2 read pages from their original places
3 pick dirty bitmap and read changed pages several times
4 freeze guest
5 repeat step 3
6 release guest

Does it work for you?

This is very very similar to how we do mem snapshot with CRIU (dirty tracking is the
soft-dirty patches from the link Kirill provided).

> http://thread.gmane.org/gmane.linux.kernel/1483158/
> 


Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
  2013-05-11 14:16       ` Pavel Emelyanov
@ 2013-05-13  2:40         ` wenchao
  0 siblings, 0 replies; 22+ messages in thread
From: wenchao @ 2013-05-13  2:40 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Kirill A. Shutemov, Mel Gorman, linux-mm, akpm, hughd, walken,
	viro, xiaoguangrong, anthony, stefanha

ao? 2013-5-11 22:16, Pavel Emelyanov a??e??:
> On 05/10/2013 01:22 PM, Kirill A. Shutemov wrote:
>> wenchao wrote:
>>> D'D?D? 2013-5-9 22:13, Mel Gorman DuD?D(C)D1D?DGBP:
>>>> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
>>>>> From: Wenchao Xia <wenchaolinux@gmail.com>
>>>>>
>>>>>     This serial try to enable mremap syscall to cow some private memory region,
>>>>> just like what fork() did. As a result, user space application would got a
>>>>> mirror of those region, and it can be used as a snapshot for further processing.
>>>>>
>>>>
>>>> What not just fork()? Even if the application was threaded it should be
>>>> managable to handle fork just for processing the private memory region
>>>> in question. I'm having trouble figuring out what sort of application
>>>> would require an interface like this.
>>>>
>>>     It have some troubles: parent - child communication, sometimes
>>> page copy.
>>>     I'd like to snapshot qemu guest's RAM, currently solution is:
>>> 1) fork()
>>> 2) pipe guest RAM data from child to parent.
>>> 3) parent write down the contents.
>>
>> CC Pavel
>
> Thank you!
>
   Sorry I forgot CC you, I have viewed the contents on CRIU website 
before patching. :>

>> I wounder if you can reuse the CRIU approach for memory snapshoting.
>
> I doubt it. First of all, we need to have task's memory in existing external process
> which is not its child. With MREMAP_DUP we can't have this. And the most important
> thing is that we don't need pages duplication on modification. It's the waste of
> memory for our case. We just need to know the fact that the page has changed.
>
> Wenchao, why can't you use existing KVM dirty-tracking for making mem snapshot? As
> per my understanding of how KVM MMU works you can
>
> 1 turn dirty track on
> 2 read pages from their original places
> 3 pick dirty bitmap and read changed pages several times
> 4 freeze guest
> 5 repeat step 3
> 6 release guest
>
> Does it work for you?
>
> This is very very similar to how we do mem snapshot with CRIU (dirty tracking is the
> soft-dirty patches from the link Kirill provided).
>
   It is different. Actually this approach is already used in qemu as
migration, also as a work around for snapshot. Dirty tracking actually
made a mirror of latest memory region, but for snapshot we need not
sync up the two mirror, and syncing up require frequently changed pages 
being saved several times. This brings extra trouble for following
actions, and also require transfer speed > memory changing speed.
   Looking at the block layer's function, for example, LVM2. Most of them
provided an API to do a snapshot by COW, so the idea comes: why not add
an same one for memory? Then the APIs are full for user. Later in
discuss Fork() seems one way to do it, with disadvantage that additional
process comes, so I hope to improve it.
   Comparation:
        Dirty tracking   VS   cow
CPU       higher              minimum
Memory    less                higher
I/O       higher              minimum

   Since dirty tracking keeps sync with latest memory data, so it is
more ideal for migration than snapshot, but in principle migration
is a bit different with snapshot. Further thinking, dirty tracking
could work together with cow, to form an incremental snapshot chain,
reducing the pages need to be written, but not related to this patch.

Base->delta->delta

>> http://thread.gmane.org/gmane.linux.kernel/1483158/
>>
>
>
> Thanks,
> Pavel
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
  2013-05-10  5:11     ` Stefan Hajnoczi
@ 2013-12-17  5:59         ` Xiao Guangrong
  0 siblings, 0 replies; 22+ messages in thread
From: Xiao Guangrong @ 2013-12-17  5:59 UTC (permalink / raw)
  To: Stefan Hajnoczi, wenchao
  Cc: Mel Gorman, linux-mm, Andrew Morton, hughd, walken,
	Alexander Viro, kirill.shutemov, Anthony Liguori, KVM


CCed KVM guys.

On 05/10/2013 01:11 PM, Stefan Hajnoczi wrote:
> On Fri, May 10, 2013 at 4:28 AM, wenchao <wenchaolinux@gmail.com> wrote:
>> 于 2013-5-9 22:13, Mel Gorman 写道:
>>
>>> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
>>>>
>>>> From: Wenchao Xia <wenchaolinux@gmail.com>
>>>>
>>>>    This serial try to enable mremap syscall to cow some private memory
>>>> region,
>>>> just like what fork() did. As a result, user space application would got
>>>> a
>>>> mirror of those region, and it can be used as a snapshot for further
>>>> processing.
>>>>
>>>
>>> What not just fork()? Even if the application was threaded it should be
>>> managable to handle fork just for processing the private memory region
>>> in question. I'm having trouble figuring out what sort of application
>>> would require an interface like this.
>>>
>>   It have some troubles: parent - child communication, sometimes
>> page copy.
>>   I'd like to snapshot qemu guest's RAM, currently solution is:
>> 1) fork()
>> 2) pipe guest RAM data from child to parent.
>> 3) parent write down the contents.
>>
>>   To avoid complex communication for data control, and file content
>> protecting, So let parent instead of child handling the data with
>> a pipe, but this brings additional copy(). I think an explicit API
>> cow mapping an memory region inside one process, could avoid it,
>> and faster and cow less pages, also make user space code nicer.
> 
> A new Linux-specific API is not portable and not available on existing
> hosts.  Since QEMU supports non-Linux host operating systems the
> fork() approach is preferable.
> 
> If you're worried about the memory copy - which should be benchmarked
> - then vmsplice(2) can be used in the child process and splice(2) can
> be used in the parent.  It probably doesn't help though since QEMU
> scans RAM pages to find all-zero pages before sending them over the
> socket, and at that point the memory copy might not make much
> difference.
> 
> Perhaps other applications can use this new flag better, but for QEMU
> I think fork()'s portability is more important than the convenience of
> accessing the CoW pages in the same process.

Yup, I agree with you that the new syscall sometimes is not a good solution.

Currently, we're working on live-update[1] that will be enabled on Qemu firstly,
this feature let the guest run on the new Qemu binary smoothly without
restart, it's good for us to do security-update.

In this case, we need to move the guest memory on old qemu instance to the
new one, fork() can not help because we need to exec() a new instance, after
that all memory mapping will be destroyed.

We tried to enable SPLICE_F_MOVE[2] for vmsplice() to move the memory without
memory-copy but the performance isn't so good as we expected: it's due to
some limitations: the page-size, lock, message-size limitation on pipe, etc.
Of course, we will continue to improve this, but wenchao's patch seems a new
direction for us.

To coordinate with your fork() approach, maybe we can introduce a new flag
for VMA, something like: VM_KEEP_ONEXEC, to tell exec() to do not destroy
this VMA. How about this or you guy have new idea? Really appreciate for your
suggestion.

[1] http://marc.info/?l=qemu-devel&m=138597598700844&w=2
[2] https://lkml.org/lkml/2013/10/25/285



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
@ 2013-12-17  5:59         ` Xiao Guangrong
  0 siblings, 0 replies; 22+ messages in thread
From: Xiao Guangrong @ 2013-12-17  5:59 UTC (permalink / raw)
  To: Stefan Hajnoczi, wenchao
  Cc: Mel Gorman, linux-mm, Andrew Morton, hughd, walken,
	Alexander Viro, kirill.shutemov, Anthony Liguori, KVM


CCed KVM guys.

On 05/10/2013 01:11 PM, Stefan Hajnoczi wrote:
> On Fri, May 10, 2013 at 4:28 AM, wenchao <wenchaolinux@gmail.com> wrote:
>> ao? 2013-5-9 22:13, Mel Gorman a??e??:
>>
>>> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
>>>>
>>>> From: Wenchao Xia <wenchaolinux@gmail.com>
>>>>
>>>>    This serial try to enable mremap syscall to cow some private memory
>>>> region,
>>>> just like what fork() did. As a result, user space application would got
>>>> a
>>>> mirror of those region, and it can be used as a snapshot for further
>>>> processing.
>>>>
>>>
>>> What not just fork()? Even if the application was threaded it should be
>>> managable to handle fork just for processing the private memory region
>>> in question. I'm having trouble figuring out what sort of application
>>> would require an interface like this.
>>>
>>   It have some troubles: parent - child communication, sometimes
>> page copy.
>>   I'd like to snapshot qemu guest's RAM, currently solution is:
>> 1) fork()
>> 2) pipe guest RAM data from child to parent.
>> 3) parent write down the contents.
>>
>>   To avoid complex communication for data control, and file content
>> protecting, So let parent instead of child handling the data with
>> a pipe, but this brings additional copy(). I think an explicit API
>> cow mapping an memory region inside one process, could avoid it,
>> and faster and cow less pages, also make user space code nicer.
> 
> A new Linux-specific API is not portable and not available on existing
> hosts.  Since QEMU supports non-Linux host operating systems the
> fork() approach is preferable.
> 
> If you're worried about the memory copy - which should be benchmarked
> - then vmsplice(2) can be used in the child process and splice(2) can
> be used in the parent.  It probably doesn't help though since QEMU
> scans RAM pages to find all-zero pages before sending them over the
> socket, and at that point the memory copy might not make much
> difference.
> 
> Perhaps other applications can use this new flag better, but for QEMU
> I think fork()'s portability is more important than the convenience of
> accessing the CoW pages in the same process.

Yup, I agree with you that the new syscall sometimes is not a good solution.

Currently, we're working on live-update[1] that will be enabled on Qemu firstly,
this feature let the guest run on the new Qemu binary smoothly without
restart, it's good for us to do security-update.

In this case, we need to move the guest memory on old qemu instance to the
new one, fork() can not help because we need to exec() a new instance, after
that all memory mapping will be destroyed.

We tried to enable SPLICE_F_MOVE[2] for vmsplice() to move the memory without
memory-copy but the performance isn't so good as we expected: it's due to
some limitations: the page-size, lock, message-size limitation on pipe, etc.
Of course, we will continue to improve this, but wenchao's patch seems a new
direction for us.

To coordinate with your fork() approach, maybe we can introduce a new flag
for VMA, something like: VM_KEEP_ONEXEC, to tell exec() to do not destroy
this VMA. How about this or you guy have new idea? Really appreciate for your
suggestion.

[1] http://marc.info/?l=qemu-devel&m=138597598700844&w=2
[2] https://lkml.org/lkml/2013/10/25/285


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
  2013-12-17  5:59         ` Xiao Guangrong
@ 2013-12-30 20:23           ` Marcelo Tosatti
  -1 siblings, 0 replies; 22+ messages in thread
From: Marcelo Tosatti @ 2013-12-30 20:23 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Stefan Hajnoczi, wenchao, Mel Gorman, linux-mm, Andrew Morton,
	hughd, walken, Alexander Viro, kirill.shutemov, Anthony Liguori,
	KVM

On Tue, Dec 17, 2013 at 01:59:04PM +0800, Xiao Guangrong wrote:
> 
> CCed KVM guys.
> 
> On 05/10/2013 01:11 PM, Stefan Hajnoczi wrote:
> > On Fri, May 10, 2013 at 4:28 AM, wenchao <wenchaolinux@gmail.com> wrote:
> >> 于 2013-5-9 22:13, Mel Gorman 写道:
> >>
> >>> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
> >>>>
> >>>> From: Wenchao Xia <wenchaolinux@gmail.com>
> >>>>
> >>>>    This serial try to enable mremap syscall to cow some private memory
> >>>> region,
> >>>> just like what fork() did. As a result, user space application would got
> >>>> a
> >>>> mirror of those region, and it can be used as a snapshot for further
> >>>> processing.
> >>>>
> >>>
> >>> What not just fork()? Even if the application was threaded it should be
> >>> managable to handle fork just for processing the private memory region
> >>> in question. I'm having trouble figuring out what sort of application
> >>> would require an interface like this.
> >>>
> >>   It have some troubles: parent - child communication, sometimes
> >> page copy.
> >>   I'd like to snapshot qemu guest's RAM, currently solution is:
> >> 1) fork()
> >> 2) pipe guest RAM data from child to parent.
> >> 3) parent write down the contents.
> >>
> >>   To avoid complex communication for data control, and file content
> >> protecting, So let parent instead of child handling the data with
> >> a pipe, but this brings additional copy(). I think an explicit API
> >> cow mapping an memory region inside one process, could avoid it,
> >> and faster and cow less pages, also make user space code nicer.
> > 
> > A new Linux-specific API is not portable and not available on existing
> > hosts.  Since QEMU supports non-Linux host operating systems the
> > fork() approach is preferable.
> > 
> > If you're worried about the memory copy - which should be benchmarked
> > - then vmsplice(2) can be used in the child process and splice(2) can
> > be used in the parent.  It probably doesn't help though since QEMU
> > scans RAM pages to find all-zero pages before sending them over the
> > socket, and at that point the memory copy might not make much
> > difference.
> > 
> > Perhaps other applications can use this new flag better, but for QEMU
> > I think fork()'s portability is more important than the convenience of
> > accessing the CoW pages in the same process.
> 
> Yup, I agree with you that the new syscall sometimes is not a good solution.
> 
> Currently, we're working on live-update[1] that will be enabled on Qemu firstly,
> this feature let the guest run on the new Qemu binary smoothly without
> restart, it's good for us to do security-update.
> 
> In this case, we need to move the guest memory on old qemu instance to the
> new one, fork() can not help because we need to exec() a new instance, after
> that all memory mapping will be destroyed.
> 
> We tried to enable SPLICE_F_MOVE[2] for vmsplice() to move the memory without
> memory-copy but the performance isn't so good as we expected: it's due to
> some limitations: the page-size, lock, message-size limitation on pipe, etc.
> Of course, we will continue to improve this, but wenchao's patch seems a new
> direction for us.
> 
> To coordinate with your fork() approach, maybe we can introduce a new flag
> for VMA, something like: VM_KEEP_ONEXEC, to tell exec() to do not destroy
> this VMA. How about this or you guy have new idea? Really appreciate for your
> suggestion.
> 
> [1] http://marc.info/?l=qemu-devel&m=138597598700844&w=2
> [2] https://lkml.org/lkml/2013/10/25/285

Hi,

What is the purpose of snapshotting guest RAM here, in the context of
local migration?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
@ 2013-12-30 20:23           ` Marcelo Tosatti
  0 siblings, 0 replies; 22+ messages in thread
From: Marcelo Tosatti @ 2013-12-30 20:23 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Stefan Hajnoczi, wenchao, Mel Gorman, linux-mm, Andrew Morton,
	hughd, walken, Alexander Viro, kirill.shutemov, Anthony Liguori,
	KVM

On Tue, Dec 17, 2013 at 01:59:04PM +0800, Xiao Guangrong wrote:
> 
> CCed KVM guys.
> 
> On 05/10/2013 01:11 PM, Stefan Hajnoczi wrote:
> > On Fri, May 10, 2013 at 4:28 AM, wenchao <wenchaolinux@gmail.com> wrote:
> >> ao? 2013-5-9 22:13, Mel Gorman a??e??:
> >>
> >>> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
> >>>>
> >>>> From: Wenchao Xia <wenchaolinux@gmail.com>
> >>>>
> >>>>    This serial try to enable mremap syscall to cow some private memory
> >>>> region,
> >>>> just like what fork() did. As a result, user space application would got
> >>>> a
> >>>> mirror of those region, and it can be used as a snapshot for further
> >>>> processing.
> >>>>
> >>>
> >>> What not just fork()? Even if the application was threaded it should be
> >>> managable to handle fork just for processing the private memory region
> >>> in question. I'm having trouble figuring out what sort of application
> >>> would require an interface like this.
> >>>
> >>   It have some troubles: parent - child communication, sometimes
> >> page copy.
> >>   I'd like to snapshot qemu guest's RAM, currently solution is:
> >> 1) fork()
> >> 2) pipe guest RAM data from child to parent.
> >> 3) parent write down the contents.
> >>
> >>   To avoid complex communication for data control, and file content
> >> protecting, So let parent instead of child handling the data with
> >> a pipe, but this brings additional copy(). I think an explicit API
> >> cow mapping an memory region inside one process, could avoid it,
> >> and faster and cow less pages, also make user space code nicer.
> > 
> > A new Linux-specific API is not portable and not available on existing
> > hosts.  Since QEMU supports non-Linux host operating systems the
> > fork() approach is preferable.
> > 
> > If you're worried about the memory copy - which should be benchmarked
> > - then vmsplice(2) can be used in the child process and splice(2) can
> > be used in the parent.  It probably doesn't help though since QEMU
> > scans RAM pages to find all-zero pages before sending them over the
> > socket, and at that point the memory copy might not make much
> > difference.
> > 
> > Perhaps other applications can use this new flag better, but for QEMU
> > I think fork()'s portability is more important than the convenience of
> > accessing the CoW pages in the same process.
> 
> Yup, I agree with you that the new syscall sometimes is not a good solution.
> 
> Currently, we're working on live-update[1] that will be enabled on Qemu firstly,
> this feature let the guest run on the new Qemu binary smoothly without
> restart, it's good for us to do security-update.
> 
> In this case, we need to move the guest memory on old qemu instance to the
> new one, fork() can not help because we need to exec() a new instance, after
> that all memory mapping will be destroyed.
> 
> We tried to enable SPLICE_F_MOVE[2] for vmsplice() to move the memory without
> memory-copy but the performance isn't so good as we expected: it's due to
> some limitations: the page-size, lock, message-size limitation on pipe, etc.
> Of course, we will continue to improve this, but wenchao's patch seems a new
> direction for us.
> 
> To coordinate with your fork() approach, maybe we can introduce a new flag
> for VMA, something like: VM_KEEP_ONEXEC, to tell exec() to do not destroy
> this VMA. How about this or you guy have new idea? Really appreciate for your
> suggestion.
> 
> [1] http://marc.info/?l=qemu-devel&m=138597598700844&w=2
> [2] https://lkml.org/lkml/2013/10/25/285

Hi,

What is the purpose of snapshotting guest RAM here, in the context of
local migration?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
  2013-12-30 20:23           ` Marcelo Tosatti
  (?)
@ 2013-12-31 12:06           ` Xiao Guangrong
  2013-12-31 18:53               ` Marcelo Tosatti
  -1 siblings, 1 reply; 22+ messages in thread
From: Xiao Guangrong @ 2013-12-31 12:06 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stefan Hajnoczi, wenchao, Mel Gorman, linux-mm, Andrew Morton,
	hughd, walken, Alexander Viro, kirill.shutemov, Anthony Liguori,
	KVM


On Dec 31, 2013, at 4:23 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

> On Tue, Dec 17, 2013 at 01:59:04PM +0800, Xiao Guangrong wrote:
>> 
>> CCed KVM guys.
>> 
>> On 05/10/2013 01:11 PM, Stefan Hajnoczi wrote:
>>> On Fri, May 10, 2013 at 4:28 AM, wenchao <wenchaolinux@gmail.com> wrote:
>>>> 于 2013-5-9 22:13, Mel Gorman 写道:
>>>> 
>>>>> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
>>>>>> 
>>>>>> From: Wenchao Xia <wenchaolinux@gmail.com>
>>>>>> 
>>>>>>  This serial try to enable mremap syscall to cow some private memory
>>>>>> region,
>>>>>> just like what fork() did. As a result, user space application would got
>>>>>> a
>>>>>> mirror of those region, and it can be used as a snapshot for further
>>>>>> processing.
>>>>>> 
>>>>> 
>>>>> What not just fork()? Even if the application was threaded it should be
>>>>> managable to handle fork just for processing the private memory region
>>>>> in question. I'm having trouble figuring out what sort of application
>>>>> would require an interface like this.
>>>>> 
>>>> It have some troubles: parent - child communication, sometimes
>>>> page copy.
>>>> I'd like to snapshot qemu guest's RAM, currently solution is:
>>>> 1) fork()
>>>> 2) pipe guest RAM data from child to parent.
>>>> 3) parent write down the contents.
>>>> 
>>>> To avoid complex communication for data control, and file content
>>>> protecting, So let parent instead of child handling the data with
>>>> a pipe, but this brings additional copy(). I think an explicit API
>>>> cow mapping an memory region inside one process, could avoid it,
>>>> and faster and cow less pages, also make user space code nicer.
>>> 
>>> A new Linux-specific API is not portable and not available on existing
>>> hosts.  Since QEMU supports non-Linux host operating systems the
>>> fork() approach is preferable.
>>> 
>>> If you're worried about the memory copy - which should be benchmarked
>>> - then vmsplice(2) can be used in the child process and splice(2) can
>>> be used in the parent.  It probably doesn't help though since QEMU
>>> scans RAM pages to find all-zero pages before sending them over the
>>> socket, and at that point the memory copy might not make much
>>> difference.
>>> 
>>> Perhaps other applications can use this new flag better, but for QEMU
>>> I think fork()'s portability is more important than the convenience of
>>> accessing the CoW pages in the same process.
>> 
>> Yup, I agree with you that the new syscall sometimes is not a good solution.
>> 
>> Currently, we're working on live-update[1] that will be enabled on Qemu firstly,
>> this feature let the guest run on the new Qemu binary smoothly without
>> restart, it's good for us to do security-update.
>> 
>> In this case, we need to move the guest memory on old qemu instance to the
>> new one, fork() can not help because we need to exec() a new instance, after
>> that all memory mapping will be destroyed.
>> 
>> We tried to enable SPLICE_F_MOVE[2] for vmsplice() to move the memory without
>> memory-copy but the performance isn't so good as we expected: it's due to
>> some limitations: the page-size, lock, message-size limitation on pipe, etc.
>> Of course, we will continue to improve this, but wenchao's patch seems a new
>> direction for us.
>> 
>> To coordinate with your fork() approach, maybe we can introduce a new flag
>> for VMA, something like: VM_KEEP_ONEXEC, to tell exec() to do not destroy
>> this VMA. How about this or you guy have new idea? Really appreciate for your
>> suggestion.
>> 
>> [1] http://marc.info/?l=qemu-devel&m=138597598700844&w=2
>> [2] https://lkml.org/lkml/2013/10/25/285
> 
> Hi,
> 

Hi Marcelo,


> What is the purpose of snapshotting guest RAM here, in the context of
> local migration?

RAM-shapshotting and local-migration are on the different ways.
Why i asked for your guy’s suggestion here is  beacuse i  thought
they need do a same thing that moves memory from one process
to another in a efficient way. Your idea? :)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
  2013-12-31 12:06           ` Xiao Guangrong
@ 2013-12-31 18:53               ` Marcelo Tosatti
  0 siblings, 0 replies; 22+ messages in thread
From: Marcelo Tosatti @ 2013-12-31 18:53 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Stefan Hajnoczi, wenchao, Mel Gorman, linux-mm, Andrew Morton,
	hughd, walken, Alexander Viro, kirill.shutemov, Anthony Liguori,
	KVM

On Tue, Dec 31, 2013 at 08:06:51PM +0800, Xiao Guangrong wrote:
> 
> On Dec 31, 2013, at 4:23 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > On Tue, Dec 17, 2013 at 01:59:04PM +0800, Xiao Guangrong wrote:
> >> 
> >> CCed KVM guys.
> >> 
> >> On 05/10/2013 01:11 PM, Stefan Hajnoczi wrote:
> >>> On Fri, May 10, 2013 at 4:28 AM, wenchao <wenchaolinux@gmail.com> wrote:
> >>>> 于 2013-5-9 22:13, Mel Gorman 写道:
> >>>> 
> >>>>> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
> >>>>>> 
> >>>>>> From: Wenchao Xia <wenchaolinux@gmail.com>
> >>>>>> 
> >>>>>>  This serial try to enable mremap syscall to cow some private memory
> >>>>>> region,
> >>>>>> just like what fork() did. As a result, user space application would got
> >>>>>> a
> >>>>>> mirror of those region, and it can be used as a snapshot for further
> >>>>>> processing.
> >>>>>> 
> >>>>> 
> >>>>> What not just fork()? Even if the application was threaded it should be
> >>>>> managable to handle fork just for processing the private memory region
> >>>>> in question. I'm having trouble figuring out what sort of application
> >>>>> would require an interface like this.
> >>>>> 
> >>>> It have some troubles: parent - child communication, sometimes
> >>>> page copy.
> >>>> I'd like to snapshot qemu guest's RAM, currently solution is:
> >>>> 1) fork()
> >>>> 2) pipe guest RAM data from child to parent.
> >>>> 3) parent write down the contents.
> >>>> 
> >>>> To avoid complex communication for data control, and file content
> >>>> protecting, So let parent instead of child handling the data with
> >>>> a pipe, but this brings additional copy(). I think an explicit API
> >>>> cow mapping an memory region inside one process, could avoid it,
> >>>> and faster and cow less pages, also make user space code nicer.
> >>> 
> >>> A new Linux-specific API is not portable and not available on existing
> >>> hosts.  Since QEMU supports non-Linux host operating systems the
> >>> fork() approach is preferable.
> >>> 
> >>> If you're worried about the memory copy - which should be benchmarked
> >>> - then vmsplice(2) can be used in the child process and splice(2) can
> >>> be used in the parent.  It probably doesn't help though since QEMU
> >>> scans RAM pages to find all-zero pages before sending them over the
> >>> socket, and at that point the memory copy might not make much
> >>> difference.
> >>> 
> >>> Perhaps other applications can use this new flag better, but for QEMU
> >>> I think fork()'s portability is more important than the convenience of
> >>> accessing the CoW pages in the same process.
> >> 
> >> Yup, I agree with you that the new syscall sometimes is not a good solution.
> >> 
> >> Currently, we're working on live-update[1] that will be enabled on Qemu firstly,
> >> this feature let the guest run on the new Qemu binary smoothly without
> >> restart, it's good for us to do security-update.
> >> 
> >> In this case, we need to move the guest memory on old qemu instance to the
> >> new one, fork() can not help because we need to exec() a new instance, after
> >> that all memory mapping will be destroyed.
> >> 
> >> We tried to enable SPLICE_F_MOVE[2] for vmsplice() to move the memory without
> >> memory-copy but the performance isn't so good as we expected: it's due to
> >> some limitations: the page-size, lock, message-size limitation on pipe, etc.
> >> Of course, we will continue to improve this, but wenchao's patch seems a new
> >> direction for us.
> >> 
> >> To coordinate with your fork() approach, maybe we can introduce a new flag
> >> for VMA, something like: VM_KEEP_ONEXEC, to tell exec() to do not destroy
> >> this VMA. How about this or you guy have new idea? Really appreciate for your
> >> suggestion.
> >> 
> >> [1] http://marc.info/?l=qemu-devel&m=138597598700844&w=2
> >> [2] https://lkml.org/lkml/2013/10/25/285
> > 
> > Hi,
> > 
> 
> Hi Marcelo,
> 
> 
> > What is the purpose of snapshotting guest RAM here, in the context of
> > local migration?
> 
> RAM-shapshotting and local-migration are on the different ways.
> Why i asked for your guy’s suggestion here is  beacuse i  thought
> they need do a same thing that moves memory from one process
> to another in a efficient way. Your idea? :)

Another possibility is to use memory that is not anonymous for guest
RAM, such as hugetlbfs or tmpfs. 

IIRC ksm and thp have limitations wrt tmpfs.

Still curious about RAM snapshotting.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
@ 2013-12-31 18:53               ` Marcelo Tosatti
  0 siblings, 0 replies; 22+ messages in thread
From: Marcelo Tosatti @ 2013-12-31 18:53 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Stefan Hajnoczi, wenchao, Mel Gorman, linux-mm, Andrew Morton,
	hughd, walken, Alexander Viro, kirill.shutemov, Anthony Liguori,
	KVM

On Tue, Dec 31, 2013 at 08:06:51PM +0800, Xiao Guangrong wrote:
> 
> On Dec 31, 2013, at 4:23 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > On Tue, Dec 17, 2013 at 01:59:04PM +0800, Xiao Guangrong wrote:
> >> 
> >> CCed KVM guys.
> >> 
> >> On 05/10/2013 01:11 PM, Stefan Hajnoczi wrote:
> >>> On Fri, May 10, 2013 at 4:28 AM, wenchao <wenchaolinux@gmail.com> wrote:
> >>>> ao? 2013-5-9 22:13, Mel Gorman a??e??:
> >>>> 
> >>>>> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
> >>>>>> 
> >>>>>> From: Wenchao Xia <wenchaolinux@gmail.com>
> >>>>>> 
> >>>>>>  This serial try to enable mremap syscall to cow some private memory
> >>>>>> region,
> >>>>>> just like what fork() did. As a result, user space application would got
> >>>>>> a
> >>>>>> mirror of those region, and it can be used as a snapshot for further
> >>>>>> processing.
> >>>>>> 
> >>>>> 
> >>>>> What not just fork()? Even if the application was threaded it should be
> >>>>> managable to handle fork just for processing the private memory region
> >>>>> in question. I'm having trouble figuring out what sort of application
> >>>>> would require an interface like this.
> >>>>> 
> >>>> It have some troubles: parent - child communication, sometimes
> >>>> page copy.
> >>>> I'd like to snapshot qemu guest's RAM, currently solution is:
> >>>> 1) fork()
> >>>> 2) pipe guest RAM data from child to parent.
> >>>> 3) parent write down the contents.
> >>>> 
> >>>> To avoid complex communication for data control, and file content
> >>>> protecting, So let parent instead of child handling the data with
> >>>> a pipe, but this brings additional copy(). I think an explicit API
> >>>> cow mapping an memory region inside one process, could avoid it,
> >>>> and faster and cow less pages, also make user space code nicer.
> >>> 
> >>> A new Linux-specific API is not portable and not available on existing
> >>> hosts.  Since QEMU supports non-Linux host operating systems the
> >>> fork() approach is preferable.
> >>> 
> >>> If you're worried about the memory copy - which should be benchmarked
> >>> - then vmsplice(2) can be used in the child process and splice(2) can
> >>> be used in the parent.  It probably doesn't help though since QEMU
> >>> scans RAM pages to find all-zero pages before sending them over the
> >>> socket, and at that point the memory copy might not make much
> >>> difference.
> >>> 
> >>> Perhaps other applications can use this new flag better, but for QEMU
> >>> I think fork()'s portability is more important than the convenience of
> >>> accessing the CoW pages in the same process.
> >> 
> >> Yup, I agree with you that the new syscall sometimes is not a good solution.
> >> 
> >> Currently, we're working on live-update[1] that will be enabled on Qemu firstly,
> >> this feature let the guest run on the new Qemu binary smoothly without
> >> restart, it's good for us to do security-update.
> >> 
> >> In this case, we need to move the guest memory on old qemu instance to the
> >> new one, fork() can not help because we need to exec() a new instance, after
> >> that all memory mapping will be destroyed.
> >> 
> >> We tried to enable SPLICE_F_MOVE[2] for vmsplice() to move the memory without
> >> memory-copy but the performance isn't so good as we expected: it's due to
> >> some limitations: the page-size, lock, message-size limitation on pipe, etc.
> >> Of course, we will continue to improve this, but wenchao's patch seems a new
> >> direction for us.
> >> 
> >> To coordinate with your fork() approach, maybe we can introduce a new flag
> >> for VMA, something like: VM_KEEP_ONEXEC, to tell exec() to do not destroy
> >> this VMA. How about this or you guy have new idea? Really appreciate for your
> >> suggestion.
> >> 
> >> [1] http://marc.info/?l=qemu-devel&m=138597598700844&w=2
> >> [2] https://lkml.org/lkml/2013/10/25/285
> > 
> > Hi,
> > 
> 
> Hi Marcelo,
> 
> 
> > What is the purpose of snapshotting guest RAM here, in the context of
> > local migration?
> 
> RAM-shapshotting and local-migration are on the different ways.
> Why i asked for your guya??s suggestion here is  beacuse i  thought
> they need do a same thing that moves memory from one process
> to another in a efficient way. Your idea? :)

Another possibility is to use memory that is not anonymous for guest
RAM, such as hugetlbfs or tmpfs. 

IIRC ksm and thp have limitations wrt tmpfs.

Still curious about RAM snapshotting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
  2013-12-31 18:53               ` Marcelo Tosatti
@ 2014-01-06  7:41                 ` Xiao Guangrong
  -1 siblings, 0 replies; 22+ messages in thread
From: Xiao Guangrong @ 2014-01-06  7:41 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stefan Hajnoczi, wenchao, Mel Gorman, linux-mm, Andrew Morton,
	hughd, walken, Alexander Viro, kirill.shutemov, Anthony Liguori,
	KVM

On 01/01/2014 02:53 AM, Marcelo Tosatti wrote:
> On Tue, Dec 31, 2013 at 08:06:51PM +0800, Xiao Guangrong wrote:
>>
>> On Dec 31, 2013, at 4:23 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>
>>> On Tue, Dec 17, 2013 at 01:59:04PM +0800, Xiao Guangrong wrote:
>>>>
>>>> CCed KVM guys.
>>>>
>>>> On 05/10/2013 01:11 PM, Stefan Hajnoczi wrote:
>>>>> On Fri, May 10, 2013 at 4:28 AM, wenchao <wenchaolinux@gmail.com> wrote:
>>>>>> 于 2013-5-9 22:13, Mel Gorman 写道:
>>>>>>
>>>>>>> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
>>>>>>>>
>>>>>>>> From: Wenchao Xia <wenchaolinux@gmail.com>
>>>>>>>>
>>>>>>>>  This serial try to enable mremap syscall to cow some private memory
>>>>>>>> region,
>>>>>>>> just like what fork() did. As a result, user space application would got
>>>>>>>> a
>>>>>>>> mirror of those region, and it can be used as a snapshot for further
>>>>>>>> processing.
>>>>>>>>
>>>>>>>
>>>>>>> What not just fork()? Even if the application was threaded it should be
>>>>>>> managable to handle fork just for processing the private memory region
>>>>>>> in question. I'm having trouble figuring out what sort of application
>>>>>>> would require an interface like this.
>>>>>>>
>>>>>> It have some troubles: parent - child communication, sometimes
>>>>>> page copy.
>>>>>> I'd like to snapshot qemu guest's RAM, currently solution is:
>>>>>> 1) fork()
>>>>>> 2) pipe guest RAM data from child to parent.
>>>>>> 3) parent write down the contents.
>>>>>>
>>>>>> To avoid complex communication for data control, and file content
>>>>>> protecting, So let parent instead of child handling the data with
>>>>>> a pipe, but this brings additional copy(). I think an explicit API
>>>>>> cow mapping an memory region inside one process, could avoid it,
>>>>>> and faster and cow less pages, also make user space code nicer.
>>>>>
>>>>> A new Linux-specific API is not portable and not available on existing
>>>>> hosts.  Since QEMU supports non-Linux host operating systems the
>>>>> fork() approach is preferable.
>>>>>
>>>>> If you're worried about the memory copy - which should be benchmarked
>>>>> - then vmsplice(2) can be used in the child process and splice(2) can
>>>>> be used in the parent.  It probably doesn't help though since QEMU
>>>>> scans RAM pages to find all-zero pages before sending them over the
>>>>> socket, and at that point the memory copy might not make much
>>>>> difference.
>>>>>
>>>>> Perhaps other applications can use this new flag better, but for QEMU
>>>>> I think fork()'s portability is more important than the convenience of
>>>>> accessing the CoW pages in the same process.
>>>>
>>>> Yup, I agree with you that the new syscall sometimes is not a good solution.
>>>>
>>>> Currently, we're working on live-update[1] that will be enabled on Qemu firstly,
>>>> this feature let the guest run on the new Qemu binary smoothly without
>>>> restart, it's good for us to do security-update.
>>>>
>>>> In this case, we need to move the guest memory on old qemu instance to the
>>>> new one, fork() can not help because we need to exec() a new instance, after
>>>> that all memory mapping will be destroyed.
>>>>
>>>> We tried to enable SPLICE_F_MOVE[2] for vmsplice() to move the memory without
>>>> memory-copy but the performance isn't so good as we expected: it's due to
>>>> some limitations: the page-size, lock, message-size limitation on pipe, etc.
>>>> Of course, we will continue to improve this, but wenchao's patch seems a new
>>>> direction for us.
>>>>
>>>> To coordinate with your fork() approach, maybe we can introduce a new flag
>>>> for VMA, something like: VM_KEEP_ONEXEC, to tell exec() to do not destroy
>>>> this VMA. How about this or you guy have new idea? Really appreciate for your
>>>> suggestion.
>>>>
>>>> [1] http://marc.info/?l=qemu-devel&m=138597598700844&w=2
>>>> [2] https://lkml.org/lkml/2013/10/25/285
>>>
>>> Hi,
>>>
>>
>> Hi Marcelo,
>>
>>
>>> What is the purpose of snapshotting guest RAM here, in the context of
>>> local migration?
>>
>> RAM-shapshotting and local-migration are on the different ways.
>> Why i asked for your guy’s suggestion here is  beacuse i  thought
>> they need do a same thing that moves memory from one process
>> to another in a efficient way. Your idea? :)
> 
> Another possibility is to use memory that is not anonymous for guest
> RAM, such as hugetlbfs or tmpfs. 
> 
> IIRC ksm and thp have limitations wrt tmpfs.

Yes, KSM and THP are what we're concerning about.

> 
> Still curious about RAM snapshotting.

Wen Chao, could you please tell it more?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall
@ 2014-01-06  7:41                 ` Xiao Guangrong
  0 siblings, 0 replies; 22+ messages in thread
From: Xiao Guangrong @ 2014-01-06  7:41 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stefan Hajnoczi, wenchao, Mel Gorman, linux-mm, Andrew Morton,
	hughd, walken, Alexander Viro, kirill.shutemov, Anthony Liguori,
	KVM

On 01/01/2014 02:53 AM, Marcelo Tosatti wrote:
> On Tue, Dec 31, 2013 at 08:06:51PM +0800, Xiao Guangrong wrote:
>>
>> On Dec 31, 2013, at 4:23 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>
>>> On Tue, Dec 17, 2013 at 01:59:04PM +0800, Xiao Guangrong wrote:
>>>>
>>>> CCed KVM guys.
>>>>
>>>> On 05/10/2013 01:11 PM, Stefan Hajnoczi wrote:
>>>>> On Fri, May 10, 2013 at 4:28 AM, wenchao <wenchaolinux@gmail.com> wrote:
>>>>>> ao? 2013-5-9 22:13, Mel Gorman a??e??:
>>>>>>
>>>>>>> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wrote:
>>>>>>>>
>>>>>>>> From: Wenchao Xia <wenchaolinux@gmail.com>
>>>>>>>>
>>>>>>>>  This serial try to enable mremap syscall to cow some private memory
>>>>>>>> region,
>>>>>>>> just like what fork() did. As a result, user space application would got
>>>>>>>> a
>>>>>>>> mirror of those region, and it can be used as a snapshot for further
>>>>>>>> processing.
>>>>>>>>
>>>>>>>
>>>>>>> What not just fork()? Even if the application was threaded it should be
>>>>>>> managable to handle fork just for processing the private memory region
>>>>>>> in question. I'm having trouble figuring out what sort of application
>>>>>>> would require an interface like this.
>>>>>>>
>>>>>> It have some troubles: parent - child communication, sometimes
>>>>>> page copy.
>>>>>> I'd like to snapshot qemu guest's RAM, currently solution is:
>>>>>> 1) fork()
>>>>>> 2) pipe guest RAM data from child to parent.
>>>>>> 3) parent write down the contents.
>>>>>>
>>>>>> To avoid complex communication for data control, and file content
>>>>>> protecting, So let parent instead of child handling the data with
>>>>>> a pipe, but this brings additional copy(). I think an explicit API
>>>>>> cow mapping an memory region inside one process, could avoid it,
>>>>>> and faster and cow less pages, also make user space code nicer.
>>>>>
>>>>> A new Linux-specific API is not portable and not available on existing
>>>>> hosts.  Since QEMU supports non-Linux host operating systems the
>>>>> fork() approach is preferable.
>>>>>
>>>>> If you're worried about the memory copy - which should be benchmarked
>>>>> - then vmsplice(2) can be used in the child process and splice(2) can
>>>>> be used in the parent.  It probably doesn't help though since QEMU
>>>>> scans RAM pages to find all-zero pages before sending them over the
>>>>> socket, and at that point the memory copy might not make much
>>>>> difference.
>>>>>
>>>>> Perhaps other applications can use this new flag better, but for QEMU
>>>>> I think fork()'s portability is more important than the convenience of
>>>>> accessing the CoW pages in the same process.
>>>>
>>>> Yup, I agree with you that the new syscall sometimes is not a good solution.
>>>>
>>>> Currently, we're working on live-update[1] that will be enabled on Qemu firstly,
>>>> this feature let the guest run on the new Qemu binary smoothly without
>>>> restart, it's good for us to do security-update.
>>>>
>>>> In this case, we need to move the guest memory on old qemu instance to the
>>>> new one, fork() can not help because we need to exec() a new instance, after
>>>> that all memory mapping will be destroyed.
>>>>
>>>> We tried to enable SPLICE_F_MOVE[2] for vmsplice() to move the memory without
>>>> memory-copy but the performance isn't so good as we expected: it's due to
>>>> some limitations: the page-size, lock, message-size limitation on pipe, etc.
>>>> Of course, we will continue to improve this, but wenchao's patch seems a new
>>>> direction for us.
>>>>
>>>> To coordinate with your fork() approach, maybe we can introduce a new flag
>>>> for VMA, something like: VM_KEEP_ONEXEC, to tell exec() to do not destroy
>>>> this VMA. How about this or you guy have new idea? Really appreciate for your
>>>> suggestion.
>>>>
>>>> [1] http://marc.info/?l=qemu-devel&m=138597598700844&w=2
>>>> [2] https://lkml.org/lkml/2013/10/25/285
>>>
>>> Hi,
>>>
>>
>> Hi Marcelo,
>>
>>
>>> What is the purpose of snapshotting guest RAM here, in the context of
>>> local migration?
>>
>> RAM-shapshotting and local-migration are on the different ways.
>> Why i asked for your guya??s suggestion here is  beacuse i  thought
>> they need do a same thing that moves memory from one process
>> to another in a efficient way. Your idea? :)
> 
> Another possibility is to use memory that is not anonymous for guest
> RAM, such as hugetlbfs or tmpfs. 
> 
> IIRC ksm and thp have limitations wrt tmpfs.

Yes, KSM and THP are what we're concerning about.

> 
> Still curious about RAM snapshotting.

Wen Chao, could you please tell it more?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-01-06  7:42 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-09  9:50 [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall wenchaolinux
2013-05-09  9:50 ` [RFC PATCH V1 1/6] mm: add parameter remove_old in move_huge_pmd() wenchaolinux
2013-05-09  9:50 ` [RFC PATCH V1 2/6] mm : allow copy between different addresses for copy_one_pte() wenchaolinux
2013-05-09  9:50 ` [RFC PATCH V1 3/6] mm : export rss vec helper functions wenchaolinux
2013-05-09  9:50 ` [RFC PATCH V1 4/6] mm : export is_cow_mapping() wenchaolinux
2013-05-09  9:50 ` [RFC PATCH V1 5/6] mm : add parameter remove_old in move_page_tables wenchaolinux
2013-05-09  9:50 ` [RFC PATCH V1 6/6] mm : add new option MREMAP_DUP to mremap() syscall wenchaolinux
2013-05-09 14:13 ` [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall Mel Gorman
2013-05-10  2:28   ` wenchao
2013-05-10  5:11     ` Stefan Hajnoczi
2013-12-17  5:59       ` Xiao Guangrong
2013-12-17  5:59         ` Xiao Guangrong
2013-12-30 20:23         ` Marcelo Tosatti
2013-12-30 20:23           ` Marcelo Tosatti
2013-12-31 12:06           ` Xiao Guangrong
2013-12-31 18:53             ` Marcelo Tosatti
2013-12-31 18:53               ` Marcelo Tosatti
2014-01-06  7:41               ` Xiao Guangrong
2014-01-06  7:41                 ` Xiao Guangrong
2013-05-10  9:22     ` Kirill A. Shutemov
2013-05-11 14:16       ` Pavel Emelyanov
2013-05-13  2:40         ` wenchao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.