linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] mm: Ability to monitor task memory changes (v3)
@ 2013-04-11 11:28 Pavel Emelyanov
  2013-04-11 11:28 ` [PATCH 1/5] clear_refs: Sanitize accepted commands declaration Pavel Emelyanov
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Pavel Emelyanov @ 2013-04-11 11:28 UTC (permalink / raw)
  To: Andrew Morton, Linux MM, Linux Kernel Mailing List

Hello,

This is the implementation of the soft-dirty bit concept that should help
keep track of changes in user memory, which in turn is very-very required by
the checkpoint-restore project (http://criu.org). Let me briefly remind what
the issue is.

<< EOF
To create a dump of an application(s) we save all the information about it
to files, and the biggest part of such dump is the contents of tasks' memory.
However, there are usage scenarios where it's not required to get _all_ the
task memory while creating a dump. For example, when doing periodical dumps,
it's only required to take full memory dump only at the first step and then
take incremental changes of memory. Another example is live migration. We 
copy all the memory to the destination node without stopping all tasks, then
stop them, check for what pages has changed, dump it and the rest of the state,
then copy it to the destination node. This decreases freeze time significantly.

That said, some help from kernel to watch how processes modify the contents
of their memory is required.
EOF

The proposal is to track changes with the help of new soft-dirty bit this way:

1. First do "echo 4 > /proc/$pid/clear_refs".
   At that point kernel clears the soft dirty _and_ the writable bits from all 
   ptes of process $pid. From now on every write to any page will result in #pf 
   and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set
   the soft dirty flag.

2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there
   (the 55'th one). If set, the respective pte was written to since last call
   to clear refs.

The soft-dirty bit is the _PAGE_BIT_HIDDEN one. Although it's used by kmemcheck,
the latter one marks kernel pages with it, while the former bit is put on user 
pages so they do not conflict to each other.

The set is against the v3.9-rc5.
It includes preparations to /proc/pid's clear_refs file, adds the pagemap2 one
and the soft-dirty concept itself with Andrew's comments on the previous patch 
(hopefully) fixed.


History of the set:

* Previous version of this patch, commented out by Andrew:
  http://lwn.net/Articles/546184/

* Pre-previous ftrace-based approach:
  http://permalink.gmane.org/gmane.linux.kernel.mm/91428

  This one was not nice, because ftrace could drop events so we might
  miss significant information about page updates.

  Another issue with it -- it was impossible to use one to watch arbitrary
  task -- task had to mark memory areas with madvise itself to make events
  occur.

  Also, program, that monitored the update events could interfere with 
  anyone else trying to mess with ftrace.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/5] clear_refs: Sanitize accepted commands declaration
  2013-04-11 11:28 [PATCH 0/5] mm: Ability to monitor task memory changes (v3) Pavel Emelyanov
@ 2013-04-11 11:28 ` Pavel Emelyanov
  2013-04-11 21:17   ` Andrew Morton
  2013-04-11 11:29 ` [PATCH 2/5] clear_refs: Introduce private struct for mm_walk Pavel Emelyanov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Pavel Emelyanov @ 2013-04-11 11:28 UTC (permalink / raw)
  To: Andrew Morton, Linux MM, Linux Kernel Mailing List

A new clear-refs type will be added in the next patch, so prepare
code for that.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
---
 fs/proc/task_mmu.c |   17 ++++++++++-------
 1 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3e636d8..67c2586 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -688,6 +688,13 @@ const struct file_operations proc_tid_smaps_operations = {
 	.release	= seq_release_private,
 };
 
+enum clear_refs_types {
+	CLEAR_REFS_ALL = 1,
+	CLEAR_REFS_ANON,
+	CLEAR_REFS_MAPPED,
+	CLEAR_REFS_LAST,
+};
+
 static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 {
@@ -719,10 +726,6 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
-#define CLEAR_REFS_ALL 1
-#define CLEAR_REFS_ANON 2
-#define CLEAR_REFS_MAPPED 3
-
 static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				size_t count, loff_t *ppos)
 {
@@ -730,7 +733,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 	char buffer[PROC_NUMBUF];
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
-	int type;
+	enum clear_refs_types type;
 	int rv;
 
 	memset(buffer, 0, sizeof(buffer));
@@ -738,10 +741,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		count = sizeof(buffer) - 1;
 	if (copy_from_user(buffer, buf, count))
 		return -EFAULT;
-	rv = kstrtoint(strstrip(buffer), 10, &type);
+	rv = kstrtoint(strstrip(buffer), 10, (int *)&type);
 	if (rv < 0)
 		return rv;
-	if (type < CLEAR_REFS_ALL || type > CLEAR_REFS_MAPPED)
+	if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
 		return -EINVAL;
 	task = get_proc_task(file_inode(file));
 	if (!task)
-- 
1.7.6.5

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/5] clear_refs: Introduce private struct for mm_walk
  2013-04-11 11:28 [PATCH 0/5] mm: Ability to monitor task memory changes (v3) Pavel Emelyanov
  2013-04-11 11:28 ` [PATCH 1/5] clear_refs: Sanitize accepted commands declaration Pavel Emelyanov
@ 2013-04-11 11:29 ` Pavel Emelyanov
  2013-04-11 11:29 ` [PATCH 3/5] pagemap: Introduce pagemap_entry_t without pmshift bits Pavel Emelyanov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Pavel Emelyanov @ 2013-04-11 11:29 UTC (permalink / raw)
  To: Andrew Morton, Linux MM, Linux Kernel Mailing List

In next patch the clear-refs-type will be required in clear_refs_pte_range
funciton, so prepare the walk->private to carry this info.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
---
 fs/proc/task_mmu.c |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 67c2586..c59a148 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -695,10 +695,15 @@ enum clear_refs_types {
 	CLEAR_REFS_LAST,
 };
 
+struct clear_refs_private {
+	struct vm_area_struct *vma;
+};
+
 static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 {
-	struct vm_area_struct *vma = walk->private;
+	struct clear_refs_private *cp = walk->private;
+	struct vm_area_struct *vma = cp->vma;
 	pte_t *pte, ptent;
 	spinlock_t *ptl;
 	struct page *page;
@@ -751,13 +756,16 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		return -ESRCH;
 	mm = get_task_mm(task);
 	if (mm) {
+		struct clear_refs_private cp = {
+		};
 		struct mm_walk clear_refs_walk = {
 			.pmd_entry = clear_refs_pte_range,
 			.mm = mm,
+			.private = &cp,
 		};
 		down_read(&mm->mmap_sem);
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			clear_refs_walk.private = vma;
+			cp.vma = vma;
 			if (is_vm_hugetlb_page(vma))
 				continue;
 			/*
-- 
1.7.6.5

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/5] pagemap: Introduce pagemap_entry_t without pmshift bits
  2013-04-11 11:28 [PATCH 0/5] mm: Ability to monitor task memory changes (v3) Pavel Emelyanov
  2013-04-11 11:28 ` [PATCH 1/5] clear_refs: Sanitize accepted commands declaration Pavel Emelyanov
  2013-04-11 11:29 ` [PATCH 2/5] clear_refs: Introduce private struct for mm_walk Pavel Emelyanov
@ 2013-04-11 11:29 ` Pavel Emelyanov
  2013-04-11 11:29 ` [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file Pavel Emelyanov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Pavel Emelyanov @ 2013-04-11 11:29 UTC (permalink / raw)
  To: Andrew Morton, Linux MM, Linux Kernel Mailing List

These bits are always constant (== PAGE_SHIFT) and just occupy space in
the entry. Moreover, in next patch we will need to report one more bit in
the pagemap, but all bits are already busy on it.

That said, describe the pagemap entry that has 6 more free zero bits.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
---
 fs/proc/task_mmu.c |   50 ++++++++++++++++++++++++++++++--------------------
 1 files changed, 30 insertions(+), 20 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c59a148..7f9b66c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -805,6 +805,7 @@ typedef struct {
 struct pagemapread {
 	int pos, len;
 	pagemap_entry_t *buffer;
+	bool v2;
 };
 
 #define PAGEMAP_WALK_SIZE	(PMD_SIZE)
@@ -818,14 +819,16 @@ struct pagemapread {
 #define PM_PSHIFT_BITS      6
 #define PM_PSHIFT_OFFSET    (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
 #define PM_PSHIFT_MASK      (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
-#define PM_PSHIFT(x)        (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
+#define __PM_PSHIFT(x)      (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
 #define PM_PFRAME_MASK      ((1LL << PM_PSHIFT_OFFSET) - 1)
 #define PM_PFRAME(x)        ((x) & PM_PFRAME_MASK)
+/* in pagemap2 pshift bits are occupied with more status bits */
+#define PM_STATUS2(v2, x)   (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
 
 #define PM_PRESENT          PM_STATUS(4LL)
 #define PM_SWAP             PM_STATUS(2LL)
 #define PM_FILE             PM_STATUS(1LL)
-#define PM_NOT_PRESENT      PM_PSHIFT(PAGE_SHIFT)
+#define PM_NOT_PRESENT(v2)  PM_STATUS2(v2, 0)
 #define PM_END_OF_BUFFER    1
 
 static inline pagemap_entry_t make_pme(u64 val)
@@ -848,7 +851,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
 	struct pagemapread *pm = walk->private;
 	unsigned long addr;
 	int err = 0;
-	pagemap_entry_t pme = make_pme(PM_NOT_PRESENT);
+	pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
 
 	for (addr = start; addr < end; addr += PAGE_SIZE) {
 		err = add_to_pagemap(addr, &pme, pm);
@@ -858,7 +861,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
 	return err;
 }
 
-static void pte_to_pagemap_entry(pagemap_entry_t *pme,
+static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
 		struct vm_area_struct *vma, unsigned long addr, pte_t pte)
 {
 	u64 frame, flags;
@@ -877,18 +880,18 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme,
 		if (is_migration_entry(entry))
 			page = migration_entry_to_page(entry);
 	} else {
-		*pme = make_pme(PM_NOT_PRESENT);
+		*pme = make_pme(PM_NOT_PRESENT(pm->v2));
 		return;
 	}
 
 	if (page && !PageAnon(page))
 		flags |= PM_FILE;
 
-	*pme = make_pme(PM_PFRAME(frame) | PM_PSHIFT(PAGE_SHIFT) | flags);
+	*pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, 0) | flags);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme,
+static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
 					pmd_t pmd, int offset)
 {
 	/*
@@ -898,12 +901,12 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme,
 	 */
 	if (pmd_present(pmd))
 		*pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset)
-				| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT);
+				| PM_STATUS2(pm->v2, 0) | PM_PRESENT);
 	else
-		*pme = make_pme(PM_NOT_PRESENT);
+		*pme = make_pme(PM_NOT_PRESENT(pm->v2));
 }
 #else
-static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme,
+static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
 						pmd_t pmd, int offset)
 {
 }
@@ -916,7 +919,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct pagemapread *pm = walk->private;
 	pte_t *pte;
 	int err = 0;
-	pagemap_entry_t pme = make_pme(PM_NOT_PRESENT);
+	pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
 
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
@@ -926,7 +929,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 			offset = (addr & ~PAGEMAP_WALK_MASK) >>
 					PAGE_SHIFT;
-			thp_pmd_to_pagemap_entry(&pme, *pmd, offset);
+			thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset);
 			err = add_to_pagemap(addr, &pme, pm);
 			if (err)
 				break;
@@ -943,7 +946,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		 * and need a new, higher one */
 		if (vma && (addr >= vma->vm_end)) {
 			vma = find_vma(walk->mm, addr);
-			pme = make_pme(PM_NOT_PRESENT);
+			pme = make_pme(PM_NOT_PRESENT(pm->v2));
 		}
 
 		/* check that 'vma' actually covers this address,
@@ -951,7 +954,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		if (vma && (vma->vm_start <= addr) &&
 		    !is_vm_hugetlb_page(vma)) {
 			pte = pte_offset_map(pmd, addr);
-			pte_to_pagemap_entry(&pme, vma, addr, *pte);
+			pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
 			/* unmap before userspace copy */
 			pte_unmap(pte);
 		}
@@ -966,14 +969,14 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme,
+static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
 					pte_t pte, int offset)
 {
 	if (pte_present(pte))
 		*pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset)
-				| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT);
+				| PM_STATUS2(pm->v2, 0) | PM_PRESENT);
 	else
-		*pme = make_pme(PM_NOT_PRESENT);
+		*pme = make_pme(PM_NOT_PRESENT(pm->v2));
 }
 
 /* This function walks within one hugetlb entry in the single call */
@@ -987,7 +990,7 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
 
 	for (; addr != end; addr += PAGE_SIZE) {
 		int offset = (addr & ~hmask) >> PAGE_SHIFT;
-		huge_pte_to_pagemap_entry(&pme, *pte, offset);
+		huge_pte_to_pagemap_entry(&pme, pm, *pte, offset);
 		err = add_to_pagemap(addr, &pme, pm);
 		if (err)
 			return err;
@@ -1023,8 +1026,8 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
  * determine which areas of memory are actually mapped and llseek to
  * skip over unmapped regions.
  */
-static ssize_t pagemap_read(struct file *file, char __user *buf,
-			    size_t count, loff_t *ppos)
+static ssize_t do_pagemap_read(struct file *file, char __user *buf,
+			    size_t count, loff_t *ppos, bool v2)
 {
 	struct task_struct *task = get_proc_task(file_inode(file));
 	struct mm_struct *mm;
@@ -1049,6 +1052,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
 	if (!count)
 		goto out_task;
 
+	pm.v2 = v2;
 	pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
 	pm.buffer = kmalloc(pm.len, GFP_TEMPORARY);
 	ret = -ENOMEM;
@@ -1121,6 +1125,12 @@ out:
 	return ret;
 }
 
+static ssize_t pagemap_read(struct file *file, char __user *buf,
+			    size_t count, loff_t *ppos)
+{
+	return do_pagemap_read(file, buf, count, ppos, false);
+}
+
 const struct file_operations proc_pagemap_operations = {
 	.llseek		= mem_lseek, /* borrow this */
 	.read		= pagemap_read,
-- 
1.7.6.5

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file
  2013-04-11 11:28 [PATCH 0/5] mm: Ability to monitor task memory changes (v3) Pavel Emelyanov
                   ` (2 preceding siblings ...)
  2013-04-11 11:29 ` [PATCH 3/5] pagemap: Introduce pagemap_entry_t without pmshift bits Pavel Emelyanov
@ 2013-04-11 11:29 ` Pavel Emelyanov
  2013-04-11 21:19   ` Andrew Morton
  2013-05-02 17:08   ` Matt Helsley
  2013-04-11 11:30 ` [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking Pavel Emelyanov
  2013-04-16 19:51 ` [PATCH 7/5] mem-soft-dirty: Reshuffle CONFIG_ options to be more Arch-friendly Pavel Emelyanov
  5 siblings, 2 replies; 19+ messages in thread
From: Pavel Emelyanov @ 2013-04-11 11:29 UTC (permalink / raw)
  To: Andrew Morton, Linux MM, Linux Kernel Mailing List

This file is the same as the pagemap one, but shows entries with bits
55-60 being zero (reserved for future use). Next patch will occupy one
of them.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
---
 Documentation/filesystems/proc.txt |    2 ++
 Documentation/vm/pagemap.txt       |    3 +++
 fs/proc/base.c                     |    2 ++
 fs/proc/internal.h                 |    1 +
 fs/proc/task_mmu.c                 |   11 +++++++++++
 5 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index fd8d0d5..22c47ec 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -487,6 +487,8 @@ Any other value written to /proc/PID/clear_refs will have no effect.
 The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
 using /proc/kpageflags and number of times a page is mapped using
 /proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt.
+(There's also a /proc/pid/pagemap2 file which is the 2nd version of the
+ pagemap one).
 
 1.2 Kernel data
 ---------------
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 7587493..4350397 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -30,6 +30,9 @@ There are three components to pagemap:
    determine which areas of memory are actually mapped and llseek to
    skip over unmapped regions.
 
+ * /proc/pid/pagemap2.  This file provides the same info as the pagemap
+   does, but bits 55-60 are reserved for future use and thus zero
+
  * /proc/kpagecount.  This file contains a 64-bit count of the number of
    times each page is mapped, indexed by PFN.
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 69078c7..34966ce 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2537,6 +2537,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
 	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
 	REG("pagemap",    S_IRUGO, proc_pagemap_operations),
+	REG("pagemap2",   S_IRUGO, proc_pagemap2_operations),
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
@@ -2882,6 +2883,7 @@ static const struct pid_entry tid_base_stuff[] = {
 	REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
 	REG("smaps",     S_IRUGO, proc_tid_smaps_operations),
 	REG("pagemap",    S_IRUGO, proc_pagemap_operations),
+	REG("pagemap2",   S_IRUGO, proc_pagemap2_operations),
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",      S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 85ff3a4..cc12bb7 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -67,6 +67,7 @@ extern const struct file_operations proc_pid_smaps_operations;
 extern const struct file_operations proc_tid_smaps_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_pagemap2_operations;
 extern const struct file_operations proc_net_operations;
 extern const struct inode_operations proc_net_inode_operations;
 extern const struct inode_operations proc_pid_link_inode_operations;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7f9b66c..3138009 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1135,6 +1135,17 @@ const struct file_operations proc_pagemap_operations = {
 	.llseek		= mem_lseek, /* borrow this */
 	.read		= pagemap_read,
 };
+
+static ssize_t pagemap2_read(struct file *file, char __user *buf,
+			    size_t count, loff_t *ppos)
+{
+	return do_pagemap_read(file, buf, count, ppos, true);
+}
+
+const struct file_operations proc_pagemap2_operations = {
+	.llseek		= mem_lseek, /* borrow this */
+	.read		= pagemap2_read,
+};
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
 #ifdef CONFIG_NUMA
-- 
1.7.6.5

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking
  2013-04-11 11:28 [PATCH 0/5] mm: Ability to monitor task memory changes (v3) Pavel Emelyanov
                   ` (3 preceding siblings ...)
  2013-04-11 11:29 ` [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file Pavel Emelyanov
@ 2013-04-11 11:30 ` Pavel Emelyanov
  2013-04-11 21:24   ` Andrew Morton
  2013-04-12 15:53   ` [PATCH 6/5] selftest: Add simple test for soft-dirty bit Pavel Emelyanov
  2013-04-16 19:51 ` [PATCH 7/5] mem-soft-dirty: Reshuffle CONFIG_ options to be more Arch-friendly Pavel Emelyanov
  5 siblings, 2 replies; 19+ messages in thread
From: Pavel Emelyanov @ 2013-04-11 11:30 UTC (permalink / raw)
  To: Andrew Morton, Linux MM, Linux Kernel Mailing List

The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to. In order to do this tracking one should

  1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
  2. Wait some time.
  3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)

To do this tracking, the writable bit is cleared from PTEs when the
soft-dirty bit is. Thus, after this, when the task tries to modify a page
at some virtual address the #PF occurs and the kernel sets the soft-dirty
bit on the respective PTE.

Note, that although all the task's address space is marked as r/o after the
soft-dirty bits clear, the #PF-s that occur after that are processed fast.
This is so, since the pages are still mapped to physical memory, and thus
all the kernel does is finds this fact out and puts back writable, dirty
and soft-dirty bits on the PTE.

Another thing to note, is that when mremap moves PTEs they are marked with
soft-dirty as well, since from the user perspective mremap modifies the
virtual memory at mremap's new address.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
---
 Documentation/filesystems/proc.txt   |    7 +++++-
 Documentation/vm/pagemap.txt         |    4 ++-
 Documentation/vm/soft-dirty.txt      |   36 ++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable.h       |   26 ++++++++++++++++++++++-
 arch/x86/include/asm/pgtable_types.h |    6 +++++
 fs/proc/task_mmu.c                   |   36 +++++++++++++++++++++++++++++----
 include/asm-generic/pgtable.h        |   22 ++++++++++++++++++++
 mm/Kconfig                           |   12 +++++++++++
 mm/huge_memory.c                     |    2 +-
 mm/mremap.c                          |    2 +-
 10 files changed, 142 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/vm/soft-dirty.txt

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 22c47ec..488c094 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -473,7 +473,8 @@ This file is only present if the CONFIG_MMU kernel configuration option is
 enabled.
 
 The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
-bits on both physical and virtual pages associated with a process.
+bits on both physical and virtual pages associated with a process, and the
+soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details).
 To clear the bits for all the pages associated with the process
     > echo 1 > /proc/PID/clear_refs
 
@@ -482,6 +483,10 @@ To clear the bits for the anonymous pages associated with the process
 
 To clear the bits for the file mapped pages associated with the process
     > echo 3 > /proc/PID/clear_refs
+
+To clear the soft-dirty bit
+    > echo 4 > /proc/PID/clear_refs
+
 Any other value written to /proc/PID/clear_refs will have no effect.
 
 The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 4350397..394cc03 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -31,7 +31,9 @@ There are three components to pagemap:
    skip over unmapped regions.
 
  * /proc/pid/pagemap2.  This file provides the same info as the pagemap
-   does, but bits 55-60 are reserved for future use and thus zero
+   does, but bits 56-60 are reserved for future use and thus zero
+
+      Bit 55 means pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
 
  * /proc/kpagecount.  This file contains a 64-bit count of the number of
    times each page is mapped, indexed by PFN.
diff --git a/Documentation/vm/soft-dirty.txt b/Documentation/vm/soft-dirty.txt
new file mode 100644
index 0000000..9a12a59
--- /dev/null
+++ b/Documentation/vm/soft-dirty.txt
@@ -0,0 +1,36 @@
+                            SOFT-DIRTY PTEs
+
+  The soft-dirty is a bit on a PTE which helps to track which pages a task
+writes to. In order to do this tracking one should
+
+  1. Clear soft-dirty bits from the task's PTEs.
+
+     This is done by writing "4" into the /proc/PID/clear_refs file of the
+     task in question.
+
+  2. Wait some time.
+
+  3. Read soft-dirty bits from the PTEs.
+
+     This is done by reading from the /proc/PID/pagemap. The bit 55 of the
+     64-bit qword is the soft-dirty one. If set, the respective PTE was
+     written to since step 1.
+
+
+  Internally, to do this tracking, the writable bit is cleared from PTEs
+when the soft-dirty bit is cleared. So, after this, when the task tries to
+modify a page at some virtual address the #PF occurs and the kernel sets
+the soft-dirty bit on the respective PTE.
+
+  Note, that although all the task's address space is marked as r/o after the
+soft-dirty bits clear, the #PF-s that occur after that are processed fast.
+This is so, since the pages are still mapped to physical memory, and thus all
+the kernel does is finds this fact out and puts both writable and soft-dirty
+bits on the PTE.
+
+
+  This feature is actively used by the checkpoint-restore project. You
+can find more details about it on http://criu.org
+
+
+-- Pavel Emelyanov, Apr 9, 2013
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1e67223..eb97470 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -207,7 +207,7 @@ static inline pte_t pte_mkexec(pte_t pte)
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_DIRTY);
+	return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
 }
 
 static inline pte_t pte_mkyoung(pte_t pte)
@@ -271,7 +271,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
 
 static inline pmd_t pmd_mkdirty(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_DIRTY);
+	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
 }
 
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
@@ -294,6 +294,28 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd)
 	return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 
+#define __HAVE_SOFT_DIRTY
+
+static inline int pte_soft_dirty(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_SOFT_DIRTY;
+}
+
+static inline int pmd_soft_dirty(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_SOFT_DIRTY;
+}
+
+static inline pte_t pte_mksoft_dirty(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_SOFT_DIRTY);
+}
+
+static inline pmd_t pmd_mksoft_dirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SOFT_DIRTY);
+}
+
 /*
  * Mask out unsupported bits in a present pgprot.  Non-present pgprots
  * can use those bits for other purposes, so leave them be.
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 567b5d0..dcf718c 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -55,6 +55,18 @@
 #define _PAGE_HIDDEN	(_AT(pteval_t, 0))
 #endif
 
+/*
+ * The same hidden bit is used by kmemcheck, but since kmemcheck
+ * works on kernel pages while soft-dirty engine on user space,
+ * they do not conflict with each other.
+ */
+
+#ifdef CONFIG_MEM_SOFT_DIRTY
+#define _PAGE_SOFT_DIRTY	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
+#else
+#define _PAGE_SOFT_DIRTY	(_AT(pteval_t, 0))
+#endif
+
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
 #else
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3138009..aae2474 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -692,13 +692,32 @@ enum clear_refs_types {
 	CLEAR_REFS_ALL = 1,
 	CLEAR_REFS_ANON,
 	CLEAR_REFS_MAPPED,
+	CLEAR_REFS_SOFT_DIRTY,
 	CLEAR_REFS_LAST,
 };
 
 struct clear_refs_private {
 	struct vm_area_struct *vma;
+	enum clear_refs_types type;
 };
 
+static inline void clear_soft_dirty(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *pte)
+{
+#ifdef CONFIG_MEM_SOFT_DIRTY
+	/*
+	 * The soft-dirty tracker uses #PF-s to catch writes
+	 * to pages, so write-protect the pte as well. See the
+	 * Documentation/vm/soft-dirty.txt for full description
+	 * of how soft-dirty works.
+	 */
+	pte_t ptent = *pte;
+	ptent = pte_wrprotect(ptent);
+	ptent = pte_clear_flags(ptent, _PAGE_SOFT_DIRTY);
+	set_pte_at(vma->vm_mm, addr, pte, ptent);
+#endif
+}
+
 static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 {
@@ -718,6 +731,11 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		if (!pte_present(ptent))
 			continue;
 
+		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
+			clear_soft_dirty(vma, addr, pte);
+			continue;
+		}
+
 		page = vm_normal_page(vma, addr, ptent);
 		if (!page)
 			continue;
@@ -757,6 +775,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 	mm = get_task_mm(task);
 	if (mm) {
 		struct clear_refs_private cp = {
+			.type = type,
 		};
 		struct mm_walk clear_refs_walk = {
 			.pmd_entry = clear_refs_pte_range,
@@ -825,6 +844,7 @@ struct pagemapread {
 /* in pagemap2 pshift bits are occupied with more status bits */
 #define PM_STATUS2(v2, x)   (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
 
+#define __PM_SOFT_DIRTY      (1LL)
 #define PM_PRESENT          PM_STATUS(4LL)
 #define PM_SWAP             PM_STATUS(2LL)
 #define PM_FILE             PM_STATUS(1LL)
@@ -866,6 +886,7 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
 {
 	u64 frame, flags;
 	struct page *page = NULL;
+	int flags2 = 0;
 
 	if (pte_present(pte)) {
 		frame = pte_pfn(pte);
@@ -886,13 +907,15 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
 
 	if (page && !PageAnon(page))
 		flags |= PM_FILE;
+	if (pte_soft_dirty(pte))
+		flags2 |= __PM_SOFT_DIRTY;
 
-	*pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, 0) | flags);
+	*pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, flags2) | flags);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
-					pmd_t pmd, int offset)
+		pmd_t pmd, int offset, int pmd_flags2)
 {
 	/*
 	 * Currently pmd for thp is always present because thp can not be
@@ -901,13 +924,13 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *p
 	 */
 	if (pmd_present(pmd))
 		*pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset)
-				| PM_STATUS2(pm->v2, 0) | PM_PRESENT);
+				| PM_STATUS2(pm->v2, pmd_flags2) | PM_PRESENT);
 	else
 		*pme = make_pme(PM_NOT_PRESENT(pm->v2));
 }
 #else
 static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
-						pmd_t pmd, int offset)
+		pmd_t pmd, int offset, int pmd_flags2)
 {
 }
 #endif
@@ -924,12 +947,15 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
 	if (vma && pmd_trans_huge_lock(pmd, vma) == 1) {
+		int pmd_flags2;
+
+		pmd_flags2 = (pmd_soft_dirty(*pmd) ? __PM_SOFT_DIRTY : 0);
 		for (; addr != end; addr += PAGE_SIZE) {
 			unsigned long offset;
 
 			offset = (addr & ~PAGEMAP_WALK_MASK) >>
 					PAGE_SHIFT;
-			thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset);
+			thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2);
 			err = add_to_pagemap(addr, &pme, pm);
 			if (err)
 				break;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index bfd8768..d74bdd2 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -386,6 +386,28 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
 #define arch_start_context_switch(prev)	do {} while (0)
 #endif
 
+#ifndef __HAVE_SOFT_DIRTY
+static inline int pte_soft_dirty(pte_t pte)
+{
+	return 0;
+}
+
+static inline int pmd_soft_dirty(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pte_t pte_mksoft_dirty(pte_t pte)
+{
+	return pte;
+}
+
+static inline pmd_t pmd_mksoft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+#endif
+
 #ifndef __HAVE_PFNMAP_TRACKING
 /*
  * Interfaces that can be used by architecture code to keep track of
diff --git a/mm/Kconfig b/mm/Kconfig
index 3bea74f..147689e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -471,3 +471,15 @@ config FRONTSWAP
 	  and swap data is stored as normal on the matching swap device.
 
 	  If unsure, say Y to enable frontswap.
+
+config MEM_SOFT_DIRTY
+	bool "Track memory changes"
+	depends on CHECKPOINT_RESTORE && X86
+	select PROC_PAGE_MONITOR
+	help
+	  This option enables memory changes tracking by introducing a
+	  soft-dirty bit on pte-s. This bit it set when someone writes
+	  into a page just as regular dirty bit, but unlike the latter
+	  it can be cleared by hands.
+
+	  See Documentation/vm/soft-dirty.txt for more details.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..eef1606 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1431,7 +1431,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 	if (ret == 1) {
 		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
 		VM_BUG_ON(!pmd_none(*new_pmd));
-		set_pmd_at(mm, new_addr, new_pmd, pmd);
+		set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
 		spin_unlock(&mm->page_table_lock);
 	}
 out:
diff --git a/mm/mremap.c b/mm/mremap.c
index 463a257..3708655 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 			continue;
 		pte = ptep_get_and_clear(mm, old_addr, old_pte);
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
-		set_pte_at(mm, new_addr, new_pte, pte);
+		set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
 	}
 
 	arch_leave_lazy_mmu_mode();
-- 
1.7.6.5

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/5] clear_refs: Sanitize accepted commands declaration
  2013-04-11 11:28 ` [PATCH 1/5] clear_refs: Sanitize accepted commands declaration Pavel Emelyanov
@ 2013-04-11 21:17   ` Andrew Morton
  0 siblings, 0 replies; 19+ messages in thread
From: Andrew Morton @ 2013-04-11 21:17 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux MM, Linux Kernel Mailing List

On Thu, 11 Apr 2013 15:28:51 +0400 Pavel Emelyanov <xemul@parallels.com> wrote:

> A new clear-refs type will be added in the next patch, so prepare
> code for that.
> 
> @@ -730,7 +733,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  	char buffer[PROC_NUMBUF];
>  	struct mm_struct *mm;
>  	struct vm_area_struct *vma;
> -	int type;
> +	enum clear_refs_types type;
>  	int rv;
>  
>  	memset(buffer, 0, sizeof(buffer));
> @@ -738,10 +741,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  		count = sizeof(buffer) - 1;
>  	if (copy_from_user(buffer, buf, count))
>  		return -EFAULT;
> -	rv = kstrtoint(strstrip(buffer), 10, &type);
> +	rv = kstrtoint(strstrip(buffer), 10, (int *)&type);

This is naughty.  The compiler is allowed to put the enum into storage
which is smaller (or, I guess, larger) than sizeof(int).  I've seen one
compiler which puts such an enum into a 16-bit word.

--- a/fs/proc/task_mmu.c~clear_refs-sanitize-accepted-commands-declaration-fix
+++ a/fs/proc/task_mmu.c
@@ -734,6 +734,7 @@ static ssize_t clear_refs_write(struct f
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	enum clear_refs_types type;
+	int itype;
 	int rv;
 
 	memset(buffer, 0, sizeof(buffer));
@@ -741,9 +742,10 @@ static ssize_t clear_refs_write(struct f
 		count = sizeof(buffer) - 1;
 	if (copy_from_user(buffer, buf, count))
 		return -EFAULT;
-	rv = kstrtoint(strstrip(buffer), 10, (int *)&type);
+	rv = kstrtoint(strstrip(buffer), 10, &itype);
 	if (rv < 0)
 		return rv;
+	type = (enum clear_refs_types)itype;
 	if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
 		return -EINVAL;
 	task = get_proc_task(file_inode(file));
_


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file
  2013-04-11 11:29 ` [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file Pavel Emelyanov
@ 2013-04-11 21:19   ` Andrew Morton
  2013-04-12 13:10     ` Pavel Emelyanov
  2013-05-02 17:08   ` Matt Helsley
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2013-04-11 21:19 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux MM, Linux Kernel Mailing List

On Thu, 11 Apr 2013 15:29:41 +0400 Pavel Emelyanov <xemul@parallels.com> wrote:

> This file is the same as the pagemap one, but shows entries with bits
> 55-60 being zero (reserved for future use). Next patch will occupy one
> of them.

I'm not understanding the motivation for this.  What does the current
/proc/pid/pagemap have in those bit positions?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking
  2013-04-11 11:30 ` [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking Pavel Emelyanov
@ 2013-04-11 21:24   ` Andrew Morton
  2013-04-12 13:14     ` Pavel Emelyanov
  2013-04-12 15:53   ` [PATCH 6/5] selftest: Add simple test for soft-dirty bit Pavel Emelyanov
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2013-04-11 21:24 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux MM, Linux Kernel Mailing List

On Thu, 11 Apr 2013 15:30:00 +0400 Pavel Emelyanov <xemul@parallels.com> wrote:

> The soft-dirty is a bit on a PTE which helps to track which pages a task
> writes to. In order to do this tracking one should
> 
>   1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
>   2. Wait some time.
>   3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
> 
> To do this tracking, the writable bit is cleared from PTEs when the
> soft-dirty bit is. Thus, after this, when the task tries to modify a page
> at some virtual address the #PF occurs and the kernel sets the soft-dirty
> bit on the respective PTE.
> 
> Note, that although all the task's address space is marked as r/o after the
> soft-dirty bits clear, the #PF-s that occur after that are processed fast.
> This is so, since the pages are still mapped to physical memory, and thus
> all the kernel does is finds this fact out and puts back writable, dirty
> and soft-dirty bits on the PTE.
> 
> Another thing to note, is that when mremap moves PTEs they are marked with
> soft-dirty as well, since from the user perspective mremap modifies the
> virtual memory at mremap's new address.
> 
> ...
>
> +config MEM_SOFT_DIRTY
> +	bool "Track memory changes"
> +	depends on CHECKPOINT_RESTORE && X86

I guess we can add the CHECKPOINT_RESTORE dependency for now, but it is
a general facility and I expect others will want to get their hands on
it for unrelated things.

>From that perspective, the dependency on X86 is awful.  What's the
problem here and what do other architectures need to do to be able to
support the feature?


You have a test application, I assume.  It would be helpful if we could
get that into tools/testing/selftests.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file
  2013-04-11 21:19   ` Andrew Morton
@ 2013-04-12 13:10     ` Pavel Emelyanov
  0 siblings, 0 replies; 19+ messages in thread
From: Pavel Emelyanov @ 2013-04-12 13:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux MM, Linux Kernel Mailing List

On 04/12/2013 01:19 AM, Andrew Morton wrote:
> On Thu, 11 Apr 2013 15:29:41 +0400 Pavel Emelyanov <xemul@parallels.com> wrote:
> 
>> This file is the same as the pagemap one, but shows entries with bits
>> 55-60 being zero (reserved for future use). Next patch will occupy one
>> of them.
> 
> I'm not understanding the motivation for this.  What does the current
> /proc/pid/pagemap have in those bit positions?

A constant PAGE_SHIFT value.

> 
> .
> 

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking
  2013-04-11 21:24   ` Andrew Morton
@ 2013-04-12 13:14     ` Pavel Emelyanov
  2013-04-15 21:46       ` Andrew Morton
  0 siblings, 1 reply; 19+ messages in thread
From: Pavel Emelyanov @ 2013-04-12 13:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux MM, Linux Kernel Mailing List

On 04/12/2013 01:24 AM, Andrew Morton wrote:
> On Thu, 11 Apr 2013 15:30:00 +0400 Pavel Emelyanov <xemul@parallels.com> wrote:
> 
>> The soft-dirty is a bit on a PTE which helps to track which pages a task
>> writes to. In order to do this tracking one should
>>
>>   1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
>>   2. Wait some time.
>>   3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
>>
>> To do this tracking, the writable bit is cleared from PTEs when the
>> soft-dirty bit is. Thus, after this, when the task tries to modify a page
>> at some virtual address the #PF occurs and the kernel sets the soft-dirty
>> bit on the respective PTE.
>>
>> Note, that although all the task's address space is marked as r/o after the
>> soft-dirty bits clear, the #PF-s that occur after that are processed fast.
>> This is so, since the pages are still mapped to physical memory, and thus
>> all the kernel does is finds this fact out and puts back writable, dirty
>> and soft-dirty bits on the PTE.
>>
>> Another thing to note, is that when mremap moves PTEs they are marked with
>> soft-dirty as well, since from the user perspective mremap modifies the
>> virtual memory at mremap's new address.
>>
>> ...
>>
>> +config MEM_SOFT_DIRTY
>> +	bool "Track memory changes"
>> +	depends on CHECKPOINT_RESTORE && X86
> 
> I guess we can add the CHECKPOINT_RESTORE dependency for now, but it is
> a general facility and I expect others will want to get their hands on
> it for unrelated things.

OK. Just tell me when you need the dependency removing patch.

>>From that perspective, the dependency on X86 is awful.  What's the
> problem here and what do other architectures need to do to be able to
> support the feature?

The problem here is that I don't know what free bits are available on
page table entries on other architectures. I was about to resolve this
for ARM very soon, but for the rest of them I need help from other people.

> You have a test application, I assume.  It would be helpful if we could
> get that into tools/testing/selftests.

If a very stupid 10-lines test is OK, then I can cook a patch with it.

Other than this I test this using the whole CRIU project, which is too
big for inclusion.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 6/5] selftest: Add simple test for soft-dirty bit
  2013-04-11 11:30 ` [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking Pavel Emelyanov
  2013-04-11 21:24   ` Andrew Morton
@ 2013-04-12 15:53   ` Pavel Emelyanov
  1 sibling, 0 replies; 19+ messages in thread
From: Pavel Emelyanov @ 2013-04-12 15:53 UTC (permalink / raw)
  To: Andrew Morton, Linux MM, Linux Kernel Mailing List

It creates a mapping of 3 pages and checks that reads, writes and clear-refs
result in present and soft-dirt bits reported from pagemap2 set as expected.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 575ef80..827f2c0 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -6,6 +6,7 @@ TARGETS += cpu-hotplug
 TARGETS += memory-hotplug
 TARGETS += efivarfs
 TARGETS += ptrace
+TARGETS += soft-dirty
 
 all:
 	for TARGET in $(TARGETS); do \
diff --git a/tools/testing/selftests/soft-dirty/Makefile b/tools/testing/selftests/soft-dirty/Makefile
new file mode 100644
index 0000000..a9cdc82
--- /dev/null
+++ b/tools/testing/selftests/soft-dirty/Makefile
@@ -0,0 +1,10 @@
+CFLAGS += -iquote../../../../include/uapi -Wall
+soft-dirty: soft-dirty.c
+
+all: soft-dirty
+
+clean:
+	rm -f soft-dirty
+
+run_tests: all
+	@./soft-dirty || echo "soft-dirty selftests: [FAIL]"
diff --git a/tools/testing/selftests/soft-dirty/soft-dirty.c b/tools/testing/selftests/soft-dirty/soft-dirty.c
new file mode 100644
index 0000000..aba4f87
--- /dev/null
+++ b/tools/testing/selftests/soft-dirty/soft-dirty.c
@@ -0,0 +1,114 @@
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/types.h>
+
+typedef unsigned long long u64;
+
+#define PME_PRESENT	(1ULL << 63)
+#define PME_SOFT_DIRTY	(1Ull << 55)
+
+#define PAGES_TO_TEST	3
+#ifndef PAGE_SIZE
+#define PAGE_SIZE	4096
+#endif
+
+static void get_pagemap2(char *mem, u64 *map)
+{
+	int fd;
+
+	fd = open("/proc/self/pagemap2", O_RDONLY);
+	if (fd < 0) {
+		perror("Can't open pagemap2");
+		exit(1);
+	}
+
+	lseek(fd, (unsigned long)mem / PAGE_SIZE * sizeof(u64), SEEK_SET);
+	read(fd, map, sizeof(u64) * PAGES_TO_TEST);
+	close(fd);
+}
+
+static inline char map_p(u64 map)
+{
+	return map & PME_PRESENT ? 'p' : '-';
+}
+
+static inline char map_sd(u64 map)
+{
+	return map & PME_SOFT_DIRTY ? 'd' : '-';
+}
+
+static int check_pte(int step, int page, u64 *map, u64 want)
+{
+	if ((map[page] & want) != want) {
+		printf("Step %d Page %d has %c%c, want %c%c\n",
+				step, page,
+				map_p(map[page]), map_sd(map[page]),
+				map_p(want), map_sd(want));
+		return 1;
+	}
+
+	return 0;
+}
+
+static void clear_refs(void)
+{
+	int fd;
+	char *v = "4";
+
+	fd = open("/proc/self/clear_refs", O_WRONLY);
+	if (write(fd, v, 3) < 3) {
+		perror("Can't clear soft-dirty bit");
+		exit(1);
+	}
+	close(fd);
+}
+
+int main(void)
+{
+	char *mem, x;
+	u64 map[PAGES_TO_TEST];
+
+	mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
+			PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, 0, 0);
+
+	x = mem[0];
+	mem[2 * PAGE_SIZE] = 'c';
+	get_pagemap2(mem, map);
+
+	if (check_pte(1, 0, map, PME_PRESENT))
+		return 1;
+	if (check_pte(1, 1, map, 0))
+		return 1;
+	if (check_pte(1, 2, map, PME_PRESENT | PME_SOFT_DIRTY))
+		return 1;
+
+	clear_refs();
+	get_pagemap2(mem, map);
+
+	if (check_pte(2, 0, map, PME_PRESENT))
+		return 1;
+	if (check_pte(2, 1, map, 0))
+		return 1;
+	if (check_pte(2, 2, map, PME_PRESENT))
+		return 1;
+
+	mem[0] = 'a';
+	mem[PAGE_SIZE] = 'b';
+	x = mem[2 * PAGE_SIZE];
+	get_pagemap2(mem, map);
+
+	if (check_pte(3, 0, map, PME_PRESENT | PME_SOFT_DIRTY))
+		return 1;
+	if (check_pte(3, 1, map, PME_PRESENT | PME_SOFT_DIRTY))
+		return 1;
+	if (check_pte(3, 2, map, PME_PRESENT))
+		return 1;
+
+	(void)x; /* gcc warn */
+
+	printf("PASS\n");
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking
  2013-04-12 13:14     ` Pavel Emelyanov
@ 2013-04-15 21:46       ` Andrew Morton
  2013-04-15 23:57         ` Stephen Rothwell
  2013-04-16 19:58         ` Pavel Emelyanov
  0 siblings, 2 replies; 19+ messages in thread
From: Andrew Morton @ 2013-04-15 21:46 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux MM, Linux Kernel Mailing List

On Fri, 12 Apr 2013 17:14:03 +0400 Pavel Emelyanov <xemul@parallels.com> wrote:

> On 04/12/2013 01:24 AM, Andrew Morton wrote:
> > On Thu, 11 Apr 2013 15:30:00 +0400 Pavel Emelyanov <xemul@parallels.com> wrote:
> > 
> >> The soft-dirty is a bit on a PTE which helps to track which pages a task
> >> writes to. In order to do this tracking one should
> >>
> >>   1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
> >>   2. Wait some time.
> >>   3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
> >>
> >> To do this tracking, the writable bit is cleared from PTEs when the
> >> soft-dirty bit is. Thus, after this, when the task tries to modify a page
> >> at some virtual address the #PF occurs and the kernel sets the soft-dirty
> >> bit on the respective PTE.
> >>
> >> Note, that although all the task's address space is marked as r/o after the
> >> soft-dirty bits clear, the #PF-s that occur after that are processed fast.
> >> This is so, since the pages are still mapped to physical memory, and thus
> >> all the kernel does is finds this fact out and puts back writable, dirty
> >> and soft-dirty bits on the PTE.
> >>
> >> Another thing to note, is that when mremap moves PTEs they are marked with
> >> soft-dirty as well, since from the user perspective mremap modifies the
> >> virtual memory at mremap's new address.
> >>
> >> ...
> >>
> >> +config MEM_SOFT_DIRTY
> >> +	bool "Track memory changes"
> >> +	depends on CHECKPOINT_RESTORE && X86
> > 
> > I guess we can add the CHECKPOINT_RESTORE dependency for now, but it is
> > a general facility and I expect others will want to get their hands on
> > it for unrelated things.
> 
> OK. Just tell me when you need the dependency removing patch.
> 
> >>From that perspective, the dependency on X86 is awful.  What's the
> > problem here and what do other architectures need to do to be able to
> > support the feature?
> 
> The problem here is that I don't know what free bits are available on
> page table entries on other architectures. I was about to resolve this
> for ARM very soon, but for the rest of them I need help from other people.

Well, this is also a thing arch maintainers can do when they feel a
need to support the feature on their architecture.  To support them at
that time we should provide them with a) adequate information in an
easy-to-find place (eg, a nice comment at the site of the reference x86
implementation) and b) a userspace test app.

> > You have a test application, I assume.  It would be helpful if we could
> > get that into tools/testing/selftests.
> 
> If a very stupid 10-lines test is OK, then I can cook a patch with it.

I think that would be good.  As a low-priority thing, please.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking
  2013-04-15 21:46       ` Andrew Morton
@ 2013-04-15 23:57         ` Stephen Rothwell
  2013-04-16 19:58         ` Pavel Emelyanov
  1 sibling, 0 replies; 19+ messages in thread
From: Stephen Rothwell @ 2013-04-15 23:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Pavel Emelyanov, Linux MM, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 797 bytes --]

On Mon, 15 Apr 2013 14:46:19 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Well, this is also a thing arch maintainers can do when they feel a
> need to support the feature on their architecture.  To support them at
> that time we should provide them with a) adequate information in an
> easy-to-find place (eg, a nice comment at the site of the reference x86
> implementation) and b) a userspace test app.

and c) a CONFIG symbol (maybe CONFIG_HAVE_MEM_SOFT_DIRTY, maybe in
arch/Kconfig) that they can select to get this feature (so that this
feature then depend on that CONFIG symbol instead of X86).  That way we
don't have to go back and tidy this up when 15 or so architectures
implement it.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 7/5] mem-soft-dirty: Reshuffle CONFIG_ options to be more Arch-friendly
  2013-04-11 11:28 [PATCH 0/5] mm: Ability to monitor task memory changes (v3) Pavel Emelyanov
                   ` (4 preceding siblings ...)
  2013-04-11 11:30 ` [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking Pavel Emelyanov
@ 2013-04-16 19:51 ` Pavel Emelyanov
  2013-04-16 23:24   ` Stephen Rothwell
  5 siblings, 1 reply; 19+ messages in thread
From: Pavel Emelyanov @ 2013-04-16 19:51 UTC (permalink / raw)
  To: Andrew Morton, Linux MM, Linux Kernel Mailing List; +Cc: Stephen Rothwell

As Stephen Rothwell pointed out, config options, that depend on
architecture support, are better to be wrapped into a select +
depends on scheme.

Do this for CONFIG_MEM_SOFT_DIRTY, as it currently works only
for X86.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>

---

diff --git a/arch/Kconfig b/arch/Kconfig
index 1455579..71c06ab 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -365,6 +365,9 @@ config HAVE_IRQ_TIME_ACCOUNTING
 config HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	bool
 
+config HAVE_ARCH_SOFT_DIRTY
+	bool
+
 config HAVE_MOD_ARCH_SPECIFIC
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 70c0f3d..81c0843 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -120,6 +120,7 @@ config X86
 	select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
 	select OLD_SIGACTION if X86_32
 	select COMPAT_OLD_SIGACTION if IA32_EMULATION
+	select HAVE_ARCH_SOFT_DIRTY
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index eb97470..ebf9373 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -294,8 +294,6 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd)
 	return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 
-#define __HAVE_SOFT_DIRTY
-
 static inline int pte_soft_dirty(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_SOFT_DIRTY;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index d74bdd2..a2ca78f 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -386,7 +386,7 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
 #define arch_start_context_switch(prev)	do {} while (0)
 #endif
 
-#ifndef __HAVE_SOFT_DIRTY
+#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline int pte_soft_dirty(pte_t pte)
 {
 	return 0;
diff --git a/mm/Kconfig b/mm/Kconfig
index 147689e..7deac66 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -474,7 +474,7 @@ config FRONTSWAP
 
 config MEM_SOFT_DIRTY
 	bool "Track memory changes"
-	depends on CHECKPOINT_RESTORE && X86
+	depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY
 	select PROC_PAGE_MONITOR
 	help
 	  This option enables memory changes tracking by introducing a

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking
  2013-04-15 21:46       ` Andrew Morton
  2013-04-15 23:57         ` Stephen Rothwell
@ 2013-04-16 19:58         ` Pavel Emelyanov
  1 sibling, 0 replies; 19+ messages in thread
From: Pavel Emelyanov @ 2013-04-16 19:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux MM, Linux Kernel Mailing List

>>> >From that perspective, the dependency on X86 is awful.  What's the
>>> problem here and what do other architectures need to do to be able to
>>> support the feature?
>>
>> The problem here is that I don't know what free bits are available on
>> page table entries on other architectures. I was about to resolve this
>> for ARM very soon, but for the rest of them I need help from other people.
> 
> Well, this is also a thing arch maintainers can do when they feel a
> need to support the feature on their architecture.  To support them at
> that time we should provide them with a) adequate information in an
> easy-to-find place (eg, a nice comment at the site of the reference x86
> implementation) and b) a userspace test app.

Item a) is presumably covered with two things -- required arch-specific
PTE manipulations are all collected in asm-generic/pgtable.h under the
!CONFIG_HAVE_ARCH_SOFT_DIRTY and the Documentation/vm/soft-dirty.txt
pointed by the API clear_refs_soft_dirty()'s comment.

Item b) was recently merged.

Item c) from Stephen is already sent.

Thank you for your time and help,
Pavel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/5] mem-soft-dirty: Reshuffle CONFIG_ options to be more Arch-friendly
  2013-04-16 19:51 ` [PATCH 7/5] mem-soft-dirty: Reshuffle CONFIG_ options to be more Arch-friendly Pavel Emelyanov
@ 2013-04-16 23:24   ` Stephen Rothwell
  0 siblings, 0 replies; 19+ messages in thread
From: Stephen Rothwell @ 2013-04-16 23:24 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 567 bytes --]

Hi Pavel,

On Tue, 16 Apr 2013 23:51:36 +0400 Pavel Emelyanov <xemul@parallels.com> wrote:
>
> As Stephen Rothwell pointed out, config options, that depend on
> architecture support, are better to be wrapped into a select +
> depends on scheme.
> 
> Do this for CONFIG_MEM_SOFT_DIRTY, as it currently works only
> for X86.
> 
> Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
> Cc: Stephen Rothwell <sfr@canb.auug.org.au>

Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file
  2013-04-11 11:29 ` [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file Pavel Emelyanov
  2013-04-11 21:19   ` Andrew Morton
@ 2013-05-02 17:08   ` Matt Helsley
  2013-05-04  9:47     ` Pavel Emelyanov
  1 sibling, 1 reply; 19+ messages in thread
From: Matt Helsley @ 2013-05-02 17:08 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List

On Thu, Apr 11, 2013 at 03:29:41PM +0400, Pavel Emelyanov wrote:
> This file is the same as the pagemap one, but shows entries with bits
> 55-60 being zero (reserved for future use). Next patch will occupy one
> of them.

This approach doesn't scale as well as it could. As best I can see
CRIU would do:

for each vma in /proc/<pid>/smaps
	for each page in /proc/<pid>/pagemap2
		if soft dirty bit
			copy page

(possibly with pfn checks to avoid copying the same page mapped in
multiple locations..)

However, if soft dirty bit changes could be queued up (from say the
fault handler and page table ops that map/unmap pages) and accumulated
in something like an interval tree it could be something like:

for each range of changed pages
	for each page in range
		copy page

IOW something that scales with the number of changed pages rather
than the number of mapped pages.

So I wonder if CRIU would abandon pagemap2 in the future for something
like this.

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file
  2013-05-02 17:08   ` Matt Helsley
@ 2013-05-04  9:47     ` Pavel Emelyanov
  0 siblings, 0 replies; 19+ messages in thread
From: Pavel Emelyanov @ 2013-05-04  9:47 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List

On 05/02/2013 09:08 PM, Matt Helsley wrote:
> On Thu, Apr 11, 2013 at 03:29:41PM +0400, Pavel Emelyanov wrote:
>> This file is the same as the pagemap one, but shows entries with bits
>> 55-60 being zero (reserved for future use). Next patch will occupy one
>> of them.
> 
> This approach doesn't scale as well as it could. As best I can see
> CRIU would do:
> 
> for each vma in /proc/<pid>/smaps
> 	for each page in /proc/<pid>/pagemap2
> 		if soft dirty bit
> 			copy page
> 
> (possibly with pfn checks to avoid copying the same page mapped in
> multiple locations..)

Comparing pfns got from two subsequent pagemap reads doesn't help at all.
If they are equal, this can mean that either page is shared or (less likely,
but still) that the page, that used to be at the 1st pagemap was reclaimed
and mapped to the 2nd between two reads. If they differ, it can again mean
either not-shared (most likely) or shared (pfns were equal, but got reclaimed
and swapped in back).

Some better API for pages sharing would be nice, probably such API could be
also re-used for the user-space KSM :)

> However, if soft dirty bit changes could be queued up (from say the
> fault handler and page table ops that map/unmap pages) and accumulated
> in something like an interval tree it could be something like:
> 
> for each range of changed pages
> 	for each page in range
> 		copy page
> 
> IOW something that scales with the number of changed pages rather
> than the number of mapped pages.
> 
> So I wonder if CRIU would abandon pagemap2 in the future for something
> like this.

We'd surely adopt such APIs is one exists. One thing to note about one is that
we'd also appreciate if this API would be able to batch "present" bits as well
as "swapped" and "page-file" ones. We use these three in CRIU as well, and
these bits scanning can also be optimized.

> Cheers,
> 	-Matt Helsley
> 

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-05-04  9:48 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-11 11:28 [PATCH 0/5] mm: Ability to monitor task memory changes (v3) Pavel Emelyanov
2013-04-11 11:28 ` [PATCH 1/5] clear_refs: Sanitize accepted commands declaration Pavel Emelyanov
2013-04-11 21:17   ` Andrew Morton
2013-04-11 11:29 ` [PATCH 2/5] clear_refs: Introduce private struct for mm_walk Pavel Emelyanov
2013-04-11 11:29 ` [PATCH 3/5] pagemap: Introduce pagemap_entry_t without pmshift bits Pavel Emelyanov
2013-04-11 11:29 ` [PATCH 4/5] pagemap: Introduce the /proc/PID/pagemap2 file Pavel Emelyanov
2013-04-11 21:19   ` Andrew Morton
2013-04-12 13:10     ` Pavel Emelyanov
2013-05-02 17:08   ` Matt Helsley
2013-05-04  9:47     ` Pavel Emelyanov
2013-04-11 11:30 ` [PATCH 5/5] mm: Soft-dirty bits for user memory changes tracking Pavel Emelyanov
2013-04-11 21:24   ` Andrew Morton
2013-04-12 13:14     ` Pavel Emelyanov
2013-04-15 21:46       ` Andrew Morton
2013-04-15 23:57         ` Stephen Rothwell
2013-04-16 19:58         ` Pavel Emelyanov
2013-04-12 15:53   ` [PATCH 6/5] selftest: Add simple test for soft-dirty bit Pavel Emelyanov
2013-04-16 19:51 ` [PATCH 7/5] mem-soft-dirty: Reshuffle CONFIG_ options to be more Arch-friendly Pavel Emelyanov
2013-04-16 23:24   ` Stephen Rothwell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).