linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
@ 2022-08-26  6:45 Muhammad Usama Anjum
  2022-08-26  6:45 ` [PATCH v3 1/4] fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit Muhammad Usama Anjum
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-08-26  6:45 UTC (permalink / raw)
  To: Jonathan Corbet, Alexander Viro, Andrew Morton, Shuah Khan,
	open list:DOCUMENTATION, open list, open list:PROC FILESYSTEM,
	open list:MEMORY MANAGEMENT, open list:KERNEL SELFTEST FRAMEWORK
  Cc: Muhammad Usama Anjum, kernel, Gabriel Krisman Bertazi,
	David Hildenbrand, Peter Enderborg, Greg KH


Hello,

This patch series implements a new ioctl on the pagemap proc fs file to
get, clear and perform both get and clear at the same time atomically on
the specified range of the memory.

Soft-dirty PTE bit of the memory pages can be viewed by using pagemap
procfs file. The soft-dirty PTE bit for the whole memory range of the
process can be cleared by writing to the clear_refs file. This series
adds features that weren't present earlier.
- There is no atomic get soft-dirty PTE bit status and clear operation
  present.
- The soft-dirty PTE bit of only a part of memory cannot be cleared.

Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The proc fs interface is enough for that as I think the process
is frozen. We have the use case where we need to track the soft-dirty
PTE bit for the running processes. We need this tracking and clear
mechanism of a region of memory while the process is running to emulate
the getWriteWatch() syscall of Windows. This syscall is used by games to
keep track of dirty pages and keep processing only the dirty pages. This
new ioctl can be used by the CRIU project and other applications which
require soft-dirty PTE bit information.

As in the current kernel there is no way to clear a part of memory (instead
of clearing the Soft-Dirty bits for the entire process) and get+clear
operation cannot be performed atomically, there are other methods to mimic
this information entirely in userspace with poor performance:
- The mprotect syscall and SIGSEGV handler for bookkeeping
- The userfaultfd syscall with the handler for bookkeeping
Some benchmarks can be seen [1].

This ioctl can be used by the CRIU project and other applications which
require soft-dirty PTE bit information. The following operations are
supported in this ioctl:
- Get the pages that are soft-dirty.
- Clear the pages which are soft-dirty.
- The optional flag to ignore the VM_SOFTDIRTY and only track per page
soft-dirty PTE bit

There are two decisions which have been taken about how to get the output
from the syscall.
- Return offsets of the pages from the start in the vec
- Stop execution when vec is filled with dirty pages
These two arguments doesn't follow the mincore() philosophy where the
output array corresponds to the address range in one to one fashion, hence
the output buffer length isn't passed and only a flag is set if the page
is present. This makes mincore() easy to use with less control. We are
passing the size of the output array and putting return data consecutively
which is offset of dirty pages from the start. The user can convert these
offsets back into the dirty page addresses easily. Suppose, the user want
to get first 10 dirty pages from a total memory of 100 pages. He'll
allocate output buffer of size 10 and the ioctl will abort after finding the
10 pages. This behaviour is needed to support Windows' getWriteWatch(). The
behaviour like mincore() can be achieved by passing output buffer of 100
size. This interface can be used for any desired behaviour.

[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com/

Regards,
Muhammad Usama Anjum

Muhammad Usama Anjum (4):
  fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit
  fs/proc/task_mmu: Implement IOCTL to get and clear soft dirty PTE bit
  selftests: vm: add pagemap ioctl tests
  mm: add documentation of the new ioctl on pagemap

 Documentation/admin-guide/mm/soft-dirty.rst |  42 +-
 fs/proc/task_mmu.c                          | 342 ++++++++++-
 include/uapi/linux/fs.h                     |  23 +
 tools/include/uapi/linux/fs.h               |  23 +
 tools/testing/selftests/vm/.gitignore       |   1 +
 tools/testing/selftests/vm/Makefile         |   2 +
 tools/testing/selftests/vm/pagemap_ioctl.c  | 649 ++++++++++++++++++++
 7 files changed, 1050 insertions(+), 32 deletions(-)
 create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c

-- 
2.30.2



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 1/4] fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit
  2022-08-26  6:45 [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE Muhammad Usama Anjum
@ 2022-08-26  6:45 ` Muhammad Usama Anjum
  2022-08-26  6:45 ` [PATCH v3 2/4] fs/proc/task_mmu: Implement IOCTL to get and clear soft dirty " Muhammad Usama Anjum
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-08-26  6:45 UTC (permalink / raw)
  To: Jonathan Corbet, Alexander Viro, Andrew Morton, Shuah Khan,
	open list:DOCUMENTATION, open list, open list:PROC FILESYSTEM,
	open list:MEMORY MANAGEMENT, open list:KERNEL SELFTEST FRAMEWORK
  Cc: Muhammad Usama Anjum, kernel, Gabriel Krisman Bertazi,
	David Hildenbrand, Peter Enderborg, Greg KH

Update the clear_soft_dirty() and clear_soft_dirty_pmd() to optionally
clear and return the status if page is dirty.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes in v2:
- Move back the functions back to their original file
---
 fs/proc/task_mmu.c | 82 ++++++++++++++++++++++++++++------------------
 1 file changed, 51 insertions(+), 31 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8b4f3073f8f5..f66674033207 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1095,8 +1095,8 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr,
 	return page_maybe_dma_pinned(page);
 }
 
-static inline void clear_soft_dirty(struct vm_area_struct *vma,
-		unsigned long addr, pte_t *pte)
+static inline bool check_soft_dirty(struct vm_area_struct *vma,
+				    unsigned long addr, pte_t *pte, bool clear)
 {
 	/*
 	 * The soft-dirty tracker uses #PF-s to catch writes
@@ -1105,55 +1105,75 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
 	 * of how soft-dirty works.
 	 */
 	pte_t ptent = *pte;
+	int dirty = 0;
 
 	if (pte_present(ptent)) {
 		pte_t old_pte;
 
-		if (pte_is_pinned(vma, addr, ptent))
-			return;
-		old_pte = ptep_modify_prot_start(vma, addr, pte);
-		ptent = pte_wrprotect(old_pte);
-		ptent = pte_clear_soft_dirty(ptent);
-		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
+		dirty = pte_soft_dirty(ptent);
+
+		if (dirty && clear && !pte_is_pinned(vma, addr, ptent)) {
+			old_pte = ptep_modify_prot_start(vma, addr, pte);
+			ptent = pte_wrprotect(old_pte);
+			ptent = pte_clear_soft_dirty(ptent);
+			ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
+		}
 	} else if (is_swap_pte(ptent)) {
-		ptent = pte_swp_clear_soft_dirty(ptent);
-		set_pte_at(vma->vm_mm, addr, pte, ptent);
+		dirty = pte_swp_soft_dirty(ptent);
+
+		if (dirty && clear) {
+			ptent = pte_swp_clear_soft_dirty(ptent);
+			set_pte_at(vma->vm_mm, addr, pte, ptent);
+		}
 	}
+
+	return !!dirty;
 }
 #else
-static inline void clear_soft_dirty(struct vm_area_struct *vma,
-		unsigned long addr, pte_t *pte)
+static inline bool check_soft_dirty(struct vm_area_struct *vma,
+				    unsigned long addr, pte_t *pte, bool clear)
 {
+	return false;
 }
 #endif
 
 #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
-		unsigned long addr, pmd_t *pmdp)
+static inline bool check_soft_dirty_pmd(struct vm_area_struct *vma,
+					unsigned long addr, pmd_t *pmdp, bool clear)
 {
 	pmd_t old, pmd = *pmdp;
+	int dirty = 0;
 
 	if (pmd_present(pmd)) {
-		/* See comment in change_huge_pmd() */
-		old = pmdp_invalidate(vma, addr, pmdp);
-		if (pmd_dirty(old))
-			pmd = pmd_mkdirty(pmd);
-		if (pmd_young(old))
-			pmd = pmd_mkyoung(pmd);
-
-		pmd = pmd_wrprotect(pmd);
-		pmd = pmd_clear_soft_dirty(pmd);
-
-		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+		dirty = pmd_soft_dirty(pmd);
+		if (dirty && clear) {
+			/* See comment in change_huge_pmd() */
+			old = pmdp_invalidate(vma, addr, pmdp);
+			if (pmd_dirty(old))
+				pmd = pmd_mkdirty(pmd);
+			if (pmd_young(old))
+				pmd = pmd_mkyoung(pmd);
+
+			pmd = pmd_wrprotect(pmd);
+			pmd = pmd_clear_soft_dirty(pmd);
+
+			set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+		}
 	} else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
-		pmd = pmd_swp_clear_soft_dirty(pmd);
-		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+		dirty = pmd_swp_soft_dirty(pmd);
+
+		if (dirty && clear) {
+			pmd = pmd_swp_clear_soft_dirty(pmd);
+			set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+		}
 	}
+	return !!dirty;
 }
 #else
-static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
-		unsigned long addr, pmd_t *pmdp)
+static inline bool check_soft_dirty_pmd(struct vm_area_struct *vma,
+					unsigned long addr, pmd_t *pmdp, bool clear)
 {
+	return false;
 }
 #endif
 
@@ -1169,7 +1189,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
 		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
-			clear_soft_dirty_pmd(vma, addr, pmd);
+			check_soft_dirty_pmd(vma, addr, pmd, true);
 			goto out;
 		}
 
@@ -1195,7 +1215,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		ptent = *pte;
 
 		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
-			clear_soft_dirty(vma, addr, pte);
+			check_soft_dirty(vma, addr, pte, true);
 			continue;
 		}
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 2/4] fs/proc/task_mmu: Implement IOCTL to get and clear soft dirty PTE bit
  2022-08-26  6:45 [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE Muhammad Usama Anjum
  2022-08-26  6:45 ` [PATCH v3 1/4] fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit Muhammad Usama Anjum
@ 2022-08-26  6:45 ` Muhammad Usama Anjum
  2022-08-26  6:45 ` [PATCH v3 3/4] selftests: vm: add pagemap ioctl tests Muhammad Usama Anjum
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-08-26  6:45 UTC (permalink / raw)
  To: Jonathan Corbet, Alexander Viro, Andrew Morton, Shuah Khan,
	open list:DOCUMENTATION, open list, open list:PROC FILESYSTEM,
	open list:MEMORY MANAGEMENT, open list:KERNEL SELFTEST FRAMEWORK
  Cc: Muhammad Usama Anjum, kernel, Gabriel Krisman Bertazi,
	David Hildenbrand, Peter Enderborg, Greg KH

This ioctl can be used to watch the process's memory and perform
atomic operations which aren't possible through procfs. Three
operations have been implemented:

- PAGEMAP_SD_GET gets the soft dirty pages in a address range.
- PAGEMAP_SD_CLEAR clears the soft dirty bit from dirty pages in a
  address range.
- PAGEMAP_SD_GET_AND_CLEAR gets and clears the soft dirty bit in a
  address range.

struct pagemap_sd_args is used as the argument of the IOCTL. In this
struct:
- The range is specified through start and len.
- The output buffer and size is specified as vec and vec_len.
- The flags can be specified in the flags field. Currently only one
  PAGEMAP_SD_NO_REUSED_REGIONS is supported which can be specified to
  ignore the VMA dirty flags.

This is based on a patch from Gabriel Krisman Bertazi.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more
  error checking

Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
---
 fs/proc/task_mmu.c            | 260 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h       |  23 +++
 tools/include/uapi/linux/fs.h |  23 +++
 3 files changed, 306 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f66674033207..33d3d5c2ab40 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -19,6 +19,8 @@
 #include <linux/shmem_fs.h>
 #include <linux/uaccess.h>
 #include <linux/pkeys.h>
+#include <uapi/linux/fs.h>
+#include <linux/vmalloc.h>
 
 #include <asm/elf.h>
 #include <asm/tlb.h>
@@ -1775,11 +1777,269 @@ static int pagemap_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+#ifdef CONFIG_MEM_SOFT_DIRTY
+#define IS_CLEAR_SD_OP(op) (op == PAGEMAP_SD_CLEAR || op == PAGEMAP_SD_GET_AND_CLEAR)
+#define IS_GET_SD_OP(op) (op == PAGEMAP_SD_GET || op == PAGEMAP_SD_GET_AND_CLEAR)
+#define PAGEMAP_SD_FLAGS_MASK (PAGEMAP_SD_NO_REUSED_REGIONS)
+
+struct pagemap_sd_private {
+	unsigned long start;
+	__u64 *vec;
+	unsigned long vec_len;
+	unsigned long index;
+	unsigned int op;
+	unsigned int flags;
+};
+
+static int pagemap_sd_pmd_entry(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+{
+	struct pagemap_sd_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long start = addr;
+	spinlock_t *ptl;
+	pte_t *pte;
+	int dirty;
+	bool dirty_vma = (p->flags & PAGEMAP_SD_NO_REUSED_REGIONS) ? 0 :
+			 (vma->vm_flags & VM_SOFTDIRTY);
+
+	end = min(end, walk->vma->vm_end);
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		if (dirty_vma || check_soft_dirty_pmd(vma, addr, pmd, false)) {
+			/*
+			 * Break huge page into small pages if operation needs to be performed is
+			 * on a portion of the huge page or the return buffer cannot store complete
+			 * data. Then process this PMD as having normal pages.
+			 */
+			if ((IS_CLEAR_SD_OP(p->op) && (end - addr < HPAGE_SIZE)) ||
+			    (IS_GET_SD_OP(p->op) && (p->index + HPAGE_SIZE/PAGE_SIZE > p->vec_len))) {
+				spin_unlock(ptl);
+				split_huge_pmd(vma, pmd, addr);
+				goto process_smaller_pages;
+			} else {
+				dirty = check_soft_dirty_pmd(vma, addr, pmd, IS_CLEAR_SD_OP(p->op));
+				if (IS_GET_SD_OP(p->op) && (dirty_vma || dirty)) {
+					for (; addr != end && p->index < p->vec_len;
+					     addr += PAGE_SIZE)
+						p->vec[p->index++] = addr - p->start;
+				}
+			}
+		}
+		spin_unlock(ptl);
+		return 0;
+	}
+
+process_smaller_pages:
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		dirty = check_soft_dirty(vma, addr, pte, IS_CLEAR_SD_OP(p->op));
+
+		if (IS_GET_SD_OP(p->op) && (dirty_vma || dirty)) {
+			p->vec[p->index++] = addr - p->start;
+			WARN_ON(p->index > p->vec_len);
+		}
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	if (IS_CLEAR_SD_OP(p->op))
+		flush_tlb_mm_range(vma->vm_mm, start, end, PAGE_SHIFT, false);
+
+	return 0;
+}
+
+static int pagemap_sd_pte_hole(unsigned long addr, unsigned long end, int depth,
+			       struct mm_walk *walk)
+{
+	struct pagemap_sd_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+
+	if (p->flags & PAGEMAP_SD_NO_REUSED_REGIONS)
+		return 0;
+
+	if (vma && (vma->vm_flags & VM_SOFTDIRTY) && IS_GET_SD_OP(p->op)) {
+		for (; addr != end && p->index < p->vec_len; addr += PAGE_SIZE)
+			p->vec[p->index++] = addr - p->start;
+	}
+
+	return 0;
+}
+
+static int pagemap_sd_pre_vma(unsigned long start, unsigned long end, struct mm_walk *walk)
+{
+	struct pagemap_sd_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	int ret;
+	unsigned long end_cut = end;
+
+	if (p->flags & PAGEMAP_SD_NO_REUSED_REGIONS)
+		return 0;
+
+	if (IS_CLEAR_SD_OP(p->op) && (vma->vm_flags & VM_SOFTDIRTY)) {
+		if (vma->vm_start < start) {
+			ret = split_vma(vma->vm_mm, vma, start, 1);
+			if (ret)
+				return ret;
+		}
+
+		if (IS_GET_SD_OP(p->op))
+			end_cut = min(start + p->vec_len * PAGE_SIZE, end);
+
+		if (vma->vm_end > end_cut) {
+			ret = split_vma(vma->vm_mm, vma, end_cut, 0);
+			if (ret)
+				return ret;
+		}
+	}
+
+	return 0;
+}
+
+static void pagemap_sd_post_vma(struct mm_walk *walk)
+{
+	struct pagemap_sd_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+
+	if (p->flags & PAGEMAP_SD_NO_REUSED_REGIONS)
+		return;
+
+	if (IS_CLEAR_SD_OP(p->op) && (vma->vm_flags & VM_SOFTDIRTY)) {
+		vma->vm_flags &= ~VM_SOFTDIRTY;
+		vma_set_page_prot(vma);
+	}
+}
+
+static int pagemap_sd_pmd_test_walk(unsigned long start, unsigned long end,
+				    struct mm_walk *walk)
+{
+	struct pagemap_sd_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+
+	if (IS_GET_SD_OP(p->op) && (p->index == p->vec_len))
+		return -1;
+
+	if (vma->vm_flags & VM_PFNMAP)
+		return 1;
+
+	return 0;
+}
+
+static const struct mm_walk_ops pagemap_sd_ops = {
+	.test_walk = pagemap_sd_pmd_test_walk,
+	.pre_vma = pagemap_sd_pre_vma,
+	.pmd_entry = pagemap_sd_pmd_entry,
+	.pte_hole = pagemap_sd_pte_hole,
+	.post_vma = pagemap_sd_post_vma,
+};
+
+static long do_pagemap_sd_cmd(struct mm_struct *mm, unsigned int cmd, struct pagemap_sd_args *arg)
+{
+	struct pagemap_sd_private sd_data;
+	struct mmu_notifier_range range;
+	unsigned long start, end;
+	int ret;
+
+	start = (unsigned long)untagged_addr(arg->start);
+	if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
+		return -EINVAL;
+
+	if (IS_GET_SD_OP(cmd) &&
+	    ((arg->vec_len == 0) || (!arg->vec) || (!access_ok((loff_t *)arg->vec, arg->vec_len))))
+		return -EINVAL;
+
+	if ((arg->flags & ~PAGEMAP_SD_FLAGS_MASK) || (arg->__reserved))
+		return -EINVAL;
+
+	end = start + arg->len;
+	sd_data.start = start;
+	sd_data.op = cmd;
+	sd_data.flags = arg->flags;
+	sd_data.index = 0;
+	sd_data.vec_len = arg->vec_len;
+
+	if (IS_GET_SD_OP(cmd)) {
+		sd_data.vec = vzalloc(arg->vec_len * sizeof(loff_t));
+		if (!sd_data.vec)
+			return -ENOMEM;
+	}
+
+	if (IS_CLEAR_SD_OP(cmd)) {
+		mmap_write_lock(mm);
+
+		mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY, 0, NULL,
+					mm, start, end);
+		mmu_notifier_invalidate_range_start(&range);
+		inc_tlb_flush_pending(mm);
+	} else {
+		mmap_read_lock(mm);
+	}
+
+	ret = walk_page_range(mm, start, end, &pagemap_sd_ops, &sd_data);
+
+	if (IS_CLEAR_SD_OP(cmd)) {
+		mmu_notifier_invalidate_range_end(&range);
+		dec_tlb_flush_pending(mm);
+
+		mmap_write_unlock(mm);
+	} else {
+		mmap_read_unlock(mm);
+	}
+
+	if (ret < 0)
+		goto free_sd_data;
+
+	if (IS_GET_SD_OP(cmd)) {
+		ret = copy_to_user((loff_t *)arg->vec, sd_data.vec, sd_data.index * sizeof(loff_t));
+		if (ret) {
+			ret = -EIO;
+			goto free_sd_data;
+		}
+		ret = sd_data.index;
+	} else {
+		ret = 0;
+	}
+
+free_sd_data:
+	if (IS_GET_SD_OP(cmd))
+		vfree(sd_data.vec);
+
+	return ret;
+}
+
+static long pagemap_sd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct pagemap_sd_args __user *uarg = (struct pagemap_sd_args __user *)arg;
+	struct mm_struct *mm = file->private_data;
+	struct pagemap_sd_args arguments;
+
+	switch (cmd) {
+	case PAGEMAP_SD_GET:
+		fallthrough;
+	case PAGEMAP_SD_CLEAR:
+		fallthrough;
+	case PAGEMAP_SD_GET_AND_CLEAR:
+		if (copy_from_user(&arguments, uarg, sizeof(struct pagemap_sd_args)))
+			return -EFAULT;
+		return do_pagemap_sd_cmd(mm, cmd, &arguments);
+	default:
+		return -EINVAL;
+	}
+}
+#endif /* CONFIG_MEM_SOFT_DIRTY */
+
 const struct file_operations proc_pagemap_operations = {
 	.llseek		= mem_lseek, /* borrow this */
 	.read		= pagemap_read,
 	.open		= pagemap_open,
 	.release	= pagemap_release,
+#ifdef CONFIG_MEM_SOFT_DIRTY
+	.unlocked_ioctl = pagemap_sd_ioctl,
+	.compat_ioctl = pagemap_sd_ioctl,
+#endif /* CONFIG_MEM_SOFT_DIRTY */
 };
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..4f6d1c0ae524 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -305,4 +305,27 @@ typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND)
 
+/**
+ * struct pagemap_sd_args - Soft-dirty IOCTL argument
+ * @start:	Starting address
+ * @len:	Length of the region
+ * @vec:	Output buffer address
+ * @vec_len:	Length of the output buffer
+ * @flags:	Special flags for the IOCTL
+ * @__reserved:	Reserved member to preserve data alignment. Should be 0.
+ */
+struct pagemap_sd_args {
+	__u64 __user start;
+	__u64 len;
+	__u64 __user vec;
+	__u64 vec_len;
+	__u32 flags;
+	__u32 __reserved;
+};
+
+#define PAGEMAP_SD_GET			_IOWR('f', 16, struct pagemap_sd_args)
+#define PAGEMAP_SD_CLEAR		_IOWR('f', 17, struct pagemap_sd_args)
+#define PAGEMAP_SD_GET_AND_CLEAR	_IOWR('f', 18, struct pagemap_sd_args)
+#define PAGEMAP_SD_NO_REUSED_REGIONS	0x1
+
 #endif /* _UAPI_LINUX_FS_H */
diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
index b7b56871029c..4f6d1c0ae524 100644
--- a/tools/include/uapi/linux/fs.h
+++ b/tools/include/uapi/linux/fs.h
@@ -305,4 +305,27 @@ typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND)
 
+/**
+ * struct pagemap_sd_args - Soft-dirty IOCTL argument
+ * @start:	Starting address
+ * @len:	Length of the region
+ * @vec:	Output buffer address
+ * @vec_len:	Length of the output buffer
+ * @flags:	Special flags for the IOCTL
+ * @__reserved:	Reserved member to preserve data alignment. Should be 0.
+ */
+struct pagemap_sd_args {
+	__u64 __user start;
+	__u64 len;
+	__u64 __user vec;
+	__u64 vec_len;
+	__u32 flags;
+	__u32 __reserved;
+};
+
+#define PAGEMAP_SD_GET			_IOWR('f', 16, struct pagemap_sd_args)
+#define PAGEMAP_SD_CLEAR		_IOWR('f', 17, struct pagemap_sd_args)
+#define PAGEMAP_SD_GET_AND_CLEAR	_IOWR('f', 18, struct pagemap_sd_args)
+#define PAGEMAP_SD_NO_REUSED_REGIONS	0x1
+
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 3/4] selftests: vm: add pagemap ioctl tests
  2022-08-26  6:45 [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE Muhammad Usama Anjum
  2022-08-26  6:45 ` [PATCH v3 1/4] fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit Muhammad Usama Anjum
  2022-08-26  6:45 ` [PATCH v3 2/4] fs/proc/task_mmu: Implement IOCTL to get and clear soft dirty " Muhammad Usama Anjum
@ 2022-08-26  6:45 ` Muhammad Usama Anjum
  2022-08-26  6:45 ` [PATCH v3 4/4] mm: add documentation of the new ioctl on pagemap Muhammad Usama Anjum
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-08-26  6:45 UTC (permalink / raw)
  To: Jonathan Corbet, Alexander Viro, Andrew Morton, Shuah Khan,
	open list:DOCUMENTATION, open list, open list:PROC FILESYSTEM,
	open list:MEMORY MANAGEMENT, open list:KERNEL SELFTEST FRAMEWORK
  Cc: Muhammad Usama Anjum, kernel, Gabriel Krisman Bertazi,
	David Hildenbrand, Peter Enderborg, Greg KH

Add pagemap ioctl tests. Add several different types of tests to judge
the correction of the interface.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
TAP version 13
1..44
ok 1 sanity_tests no cmd specified
ok 2 sanity_tests wrong flag specified
ok 3 sanity_tests reserved field specified specified
ok 4 sanity_tests wrong cmd specified
ok 5 sanity_tests mixture of correct and wrong cmds
ok 6 sanity_tests Clear area with larger vec size 1 22
ok 7 Page testing: all new pages must be soft dirty
ok 8 Page testing: all pages must not be soft dirty
ok 9 Page testing: all pages dirty other than first and the last one
ok 10 Page testing: only middle page dirty
ok 11 Page testing: only two middle pages dirty
ok 12 Page testing: only get 2 dirty pages and clear them as well
ok 13 Page testing: Range clear only
ok 14 Large Page testing: all new pages must be soft dirty
ok 15 Large Page testing: all pages must not be soft dirty
ok 16 Large Page testing: all pages dirty other than first and the last one
ok 17 Large Page testing: only middle page dirty
ok 18 Large Page testing: only two middle pages dirty
ok 19 Large Page testing: only get 2 dirty pages and clear them as well
ok 20 Large Page testing: Range clear only
ok 21 Huge page testing: all new pages must be soft dirty
ok 22 Huge page testing: all pages must not be soft dirty
ok 23 Huge page testing: all pages dirty other than first and the last one
ok 24 Huge page testing: only middle page dirty
ok 25 Huge page testing: only two middle pages dirty
ok 26 Huge page testing: only get 2 dirty pages and clear them as well
ok 27 Huge page testing: Range clear only
ok 28 Performance Page testing: page isn't dirty
ok 29 Performance Page testing: all pages must not be soft dirty
ok 30 Performance Page testing: all pages dirty other than first and the last one
ok 31 Performance Page testing: only middle page dirty
ok 32 Performance Page testing: only two middle pages dirty
ok 33 Performance Page testing: only get 2 dirty pages and clear them as well
ok 34 Performance Page testing: Range clear only
ok 35 hpage_unit_tests all new huge page must be dirty
ok 36 hpage_unit_tests all the huge page must not be dirty
ok 37 hpage_unit_tests all the huge page must be dirty and clear
ok 38 hpage_unit_tests only middle page dirty
ok 39 hpage_unit_tests clear first half of huge page
ok 40 hpage_unit_tests clear first half of huge page with limited buffer
ok 41 hpage_unit_tests clear second half huge page
ok 42 unmapped_region_tests Get dirty pages
ok 43 unmapped_region_tests Get dirty pages
ok 44 Test test_simple
 # Totals: pass:44 fail:0 xfail:0 xpass:0 skip:0 error:0

Changes in v3:
- Add another test to do sanity of flags

Changes in v2:
- Update the tests to use the ioctl interface instead of syscall
---
 tools/testing/selftests/vm/.gitignore      |   1 +
 tools/testing/selftests/vm/Makefile        |   2 +
 tools/testing/selftests/vm/pagemap_ioctl.c | 649 +++++++++++++++++++++
 3 files changed, 652 insertions(+)
 create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 31e5eea2a9b9..334fde556499 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -16,6 +16,7 @@ mremap_dontunmap
 mremap_test
 on-fault-limit
 transhuge-stress
+pagemap_ioctl
 protection_keys
 protection_keys_32
 protection_keys_64
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 266e965e724c..4296c3268f64 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -51,6 +51,7 @@ TEST_GEN_FILES += on-fault-limit
 TEST_GEN_FILES += thuge-gen
 TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
+TEST_GEN_PROGS += pagemap_ioctl
 TEST_GEN_PROGS += soft-dirty
 TEST_GEN_PROGS += split_huge_page_test
 TEST_GEN_FILES += ksm_tests
@@ -98,6 +99,7 @@ TEST_FILES += va_128TBswitch.sh
 include ../lib.mk
 
 $(OUTPUT)/madv_populate: vm_util.c
+$(OUTPUT)/pagemap_ioctl: vm_util.c
 $(OUTPUT)/soft-dirty: vm_util.c
 $(OUTPUT)/split_huge_page_test: vm_util.c
 
diff --git a/tools/testing/selftests/vm/pagemap_ioctl.c b/tools/testing/selftests/vm/pagemap_ioctl.c
new file mode 100644
index 000000000000..7775247b9cdc
--- /dev/null
+++ b/tools/testing/selftests/vm/pagemap_ioctl.c
@@ -0,0 +1,649 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <errno.h>
+#include <malloc.h>
+#include <asm-generic/unistd.h>
+#include "vm_util.h"
+#include "../kselftest.h"
+#include <linux/types.h>
+#include <linux/fs.h>
+#include <sys/ioctl.h>
+
+#define TEST_ITERATIONS 10000
+#define PAGEMAP "/proc/self/pagemap"
+int pagemap_fd;
+
+static long pagemap_ioctl(void *start, int len, unsigned int cmd, loff_t *vec,
+			  int vec_len, int flag)
+{
+	struct pagemap_sd_args arg;
+	int ret;
+
+	arg.start = (uintptr_t)start;
+	arg.len = len;
+	arg.vec = (uintptr_t)vec;
+	arg.vec_len = vec_len;
+	arg.flags = flag;
+	arg.__reserved = 0;
+
+	ret = ioctl(pagemap_fd, cmd, &arg);
+
+	return ret;
+}
+
+static long pagemap_ioctl_res(void *start, int len, unsigned int cmd, loff_t *vec,
+			      int vec_len, int flag, int res)
+{
+	struct pagemap_sd_args arg;
+	int ret;
+
+	arg.start = (uintptr_t)start;
+	arg.len = len;
+	arg.vec = (uintptr_t)vec;
+	arg.vec_len = vec_len;
+	arg.flags = flag;
+	arg.__reserved = res;
+
+	ret = ioctl(pagemap_fd, cmd, &arg);
+
+	return ret;
+}
+
+int sanity_tests(int page_size)
+{
+	char *mem;
+	int mem_size, vec_size, ret;
+	loff_t *vec;
+
+	/* 1. wrong operation */
+	vec_size = 100;
+	mem_size = page_size;
+
+	vec = malloc(sizeof(loff_t) * vec_size);
+	mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (!mem || !vec)
+		ksft_exit_fail_msg("error nomem\n");
+
+	ksft_test_result(pagemap_ioctl(mem, mem_size, 0, vec, vec_size, 0) < 0,
+			 "%s no cmd specified\n", __func__);
+	ksft_test_result(pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET_AND_CLEAR, vec, vec_size, 8)
+			 < 0, "%s wrong flag specified\n", __func__);
+	ksft_test_result(pagemap_ioctl_res(mem, mem_size, PAGEMAP_SD_GET_AND_CLEAR, vec, vec_size, 8, 10)
+			 < 0, "%s reserved field specified specified\n", __func__);
+	ksft_test_result(pagemap_ioctl(mem, mem_size, 0x01000000, vec, vec_size, 0) < 0,
+			 "%s wrong cmd specified\n", __func__);
+	ksft_test_result(pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET | 0xFF,
+			 vec, vec_size, 0) < 0,
+			 "%s mixture of correct and wrong cmds\n", __func__);
+
+	/* 2. Clear area with larger vec size */
+	ret = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET_AND_CLEAR,
+			    vec, vec_size, 0);
+	ksft_test_result(ret >= 0, "%s Clear area with larger vec size %d %d\n", __func__, ret, errno);
+
+	free(vec);
+	munmap(mem, mem_size);
+	return 0;
+}
+
+void *gethugepage(int map_size)
+{
+	int ret;
+	char *map;
+	size_t hpage_len = read_pmd_pagesize();
+
+	map = memalign(hpage_len, map_size);
+	if (!map)
+		ksft_exit_fail_msg("memalign failed %d %s\n", errno, strerror(errno));
+
+	ret = madvise(map, map_size, MADV_HUGEPAGE);
+	if (ret)
+		ksft_exit_fail_msg("madvise failed %d %d %s\n", ret, errno, strerror(errno));
+
+	memset(map, 0, map_size);
+
+	if (check_huge(map))
+		return map;
+
+	free(map);
+	return NULL;
+
+}
+
+int hpage_unit_tests(int page_size)
+{
+	char *map;
+	int i, ret;
+	size_t hpage_len = read_pmd_pagesize();
+	size_t num_pages = 1;
+	int map_size = hpage_len * num_pages;
+	int vec_size = map_size/page_size;
+	loff_t *vec, *vec2;
+
+	vec = malloc(sizeof(loff_t) * vec_size);
+	vec2 = malloc(sizeof(loff_t) * vec_size);
+	if (!vec || !vec2)
+		ksft_exit_fail_msg("malloc failed\n");
+
+	map = gethugepage(map_size);
+	if (map) {
+		// 1. all new huge page must be dirty
+		ret = pagemap_ioctl(map, map_size, PAGEMAP_SD_GET_AND_CLEAR, vec, vec_size, 0);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		for (i = 0; i < vec_size; i++)
+			if (vec[i] != i * page_size)
+				break;
+
+		ksft_test_result(i == vec_size, "%s all new huge page must be dirty\n", __func__);
+
+		// 2. all the huge page must not be dirty
+		ret = pagemap_ioctl(map, map_size, PAGEMAP_SD_GET, vec, vec_size, 0);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		ksft_test_result(ret == 0, "%s all the huge page must not be dirty\n", __func__);
+
+		// 3. all the huge page must be dirty and clear dirty as well
+		memset(map, -1, map_size);
+		ret = pagemap_ioctl(map, map_size, PAGEMAP_SD_GET_AND_CLEAR, vec, vec_size, 0);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		for (i = 0; i < vec_size; i++)
+			if (vec[i] != i * page_size)
+				break;
+
+		ksft_test_result(ret == vec_size && i == vec_size,
+				 "%s all the huge page must be dirty and clear\n", __func__);
+
+		// 4. only middle page dirty
+		free(map);
+		map = gethugepage(map_size);
+		clear_softdirty();
+		map[vec_size/2 * page_size]++;
+
+		ret = pagemap_ioctl(map, map_size, PAGEMAP_SD_GET, vec, vec_size, 0);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		for (i = 0; i < vec_size; i++) {
+			if (vec[i] == vec_size/2 * page_size)
+				break;
+		}
+		ksft_test_result(vec[i] == vec_size/2 * page_size,
+				 "%s only middle page dirty\n", __func__);
+
+		free(map);
+	} else {
+		ksft_test_result_skip("all new huge page must be dirty\n");
+		ksft_test_result_skip("all the huge page must not be dirty\n");
+		ksft_test_result_skip("all the huge page must be dirty and clear\n");
+		ksft_test_result_skip("only middle page dirty\n");
+	}
+
+	// 5. clear first half of huge page
+	map = gethugepage(map_size);
+	if (map) {
+		ret = pagemap_ioctl(map, map_size/2, PAGEMAP_SD_CLEAR, NULL, 0, 0);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		ret = pagemap_ioctl(map, map_size, PAGEMAP_SD_GET, vec, vec_size, 0);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		for (i = 0; i < vec_size/2; i++)
+			if (vec[i] != (i + vec_size/2) * page_size)
+				break;
+
+		ksft_test_result(i == vec_size/2 && ret == vec_size/2,
+				 "%s clear first half of huge page\n", __func__);
+		free(map);
+	} else {
+		ksft_test_result_skip("clear first half of huge page\n");
+	}
+
+	// 6. clear first half of huge page with limited buffer
+	map = gethugepage(map_size);
+	if (map) {
+		ret = pagemap_ioctl(map, map_size, PAGEMAP_SD_GET_AND_CLEAR, vec, vec_size/2, 0);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		ret = pagemap_ioctl(map, map_size, PAGEMAP_SD_GET, vec, vec_size, 0);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		for (i = 0; i < vec_size/2; i++)
+			if (vec[i] != (i + vec_size/2) * page_size)
+				break;
+
+		ksft_test_result(i == vec_size/2 && ret == vec_size/2,
+				 "%s clear first half of huge page with limited buffer\n",
+				 __func__);
+		free(map);
+	} else {
+		ksft_test_result_skip("clear first half of huge page with limited buffer\n");
+	}
+
+	// 7. clear second half of huge page
+	map = gethugepage(map_size);
+	if (map) {
+		memset(map, -1, map_size);
+		ret = pagemap_ioctl(map + map_size/2, map_size/2, PAGEMAP_SD_CLEAR, NULL, 0, 0);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		ret = pagemap_ioctl(map, map_size, PAGEMAP_SD_GET, vec, vec_size, 0);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		for (i = 0; i < vec_size/2; i++)
+			if (vec[i] != i * page_size)
+				break;
+
+		ksft_test_result(i == vec_size/2, "%s clear second half huge page\n", __func__);
+		free(map);
+	} else {
+		ksft_test_result_skip("clear second half huge page\n");
+	}
+
+	free(vec);
+	free(vec2);
+	return 0;
+}
+
+int base_tests(char *prefix, char *mem, int mem_size, int page_size, int skip)
+{
+	int vec_size, i, j, ret, dirty_pages, dirty_pages2;
+	loff_t *vec, *vec2;
+
+	if (skip) {
+		ksft_test_result_skip("%s all new pages must be soft dirty\n", prefix);
+		ksft_test_result_skip("%s all pages must not be soft dirty\n", prefix);
+		ksft_test_result_skip("%s all pages dirty other than first and the last one\n",
+				      prefix);
+		ksft_test_result_skip("%s only middle page dirty\n", prefix);
+		ksft_test_result_skip("%s only two middle pages dirty\n", prefix);
+		ksft_test_result_skip("%s only get 2 dirty pages and clear them as well\n", prefix);
+		ksft_test_result_skip("%s Range clear only\n", prefix);
+		return 0;
+	}
+
+	vec_size = mem_size/page_size;
+	vec = malloc(sizeof(loff_t) * vec_size);
+	vec2 = malloc(sizeof(loff_t) * vec_size);
+
+	/* 1. all new pages must be soft dirty and clear the range for next test */
+	dirty_pages = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET_AND_CLEAR, vec, vec_size - 2, 0);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	dirty_pages2 = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET_AND_CLEAR, vec2, vec_size, 0);
+	if (dirty_pages2 < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages2, errno, strerror(errno));
+
+	for (i = 0; i < dirty_pages; i++)
+		if (vec[i] != i * page_size)
+			break;
+	for (j = 0; j < dirty_pages2; j++)
+		if (vec2[j] != (j + vec_size - 2) * page_size)
+			break;
+
+	ksft_test_result(dirty_pages == vec_size - 2 && i == dirty_pages &&
+			 dirty_pages2 == 2 && j == dirty_pages2,
+			 "%s all new pages must be soft dirty\n", prefix);
+
+	// 2. all pages must not be soft dirty
+	dirty_pages = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET, vec, vec_size, 0);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	ksft_test_result(dirty_pages == 0, "%s all pages must not be soft dirty\n", prefix);
+
+	// 3. all pages dirty other than first and the last one
+	memset(mem + page_size, -1, (mem_size - (2 * page_size)));
+
+	dirty_pages = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET, vec, vec_size, 0);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	ksft_test_result(dirty_pages >= vec_size - 2 && dirty_pages <= vec_size,
+			 "%s all pages dirty other than first and the last one\n", prefix);
+
+	// 4. only middle page dirty
+	clear_softdirty();
+	mem[vec_size/2 * page_size]++;
+
+	dirty_pages = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET, vec, vec_size, 0);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	for (i = 0; i < vec_size; i++) {
+		if (vec[i] == vec_size/2 * page_size)
+			break;
+	}
+	ksft_test_result(vec[i] == vec_size/2 * page_size,
+			 "%s only middle page dirty\n", prefix);
+
+	// 5. only two middle pages dirty and walk over only middle pages
+	clear_softdirty();
+	mem[vec_size/2 * page_size]++;
+	mem[(vec_size/2 + 1) * page_size]++;
+
+	dirty_pages = pagemap_ioctl(&mem[vec_size/2 * page_size], 2 * page_size,
+				    PAGEMAP_SD_GET, vec, vec_size, 0);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	ksft_test_result(dirty_pages == 2 && vec[0] == 0 && vec[1] == page_size,
+			 "%s only two middle pages dirty\n", prefix);
+
+	/* 6. only get 2 dirty pages and clear them as well */
+	memset(mem, -1, mem_size);
+
+	/* get and clear second and third pages */
+	ret = pagemap_ioctl(mem + page_size, 2 * page_size, PAGEMAP_SD_GET_AND_CLEAR, vec, 2, 0);
+	if (ret < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+	dirty_pages = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET, vec2, vec_size, 0);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	for (i = 0; i < vec_size - 2; i++) {
+		if (i == 0 && (vec[i] != 0 || vec2[i] != 0))
+			break;
+		else if (i == 1 && (vec[i] != page_size || vec2[i] != (i + 2) * page_size))
+			break;
+		else if (i > 1 && (vec2[i] != (i + 2) * page_size))
+			break;
+	}
+
+	ksft_test_result(dirty_pages == vec_size - 2 && i == vec_size - 2,
+			 "%s only get 2 dirty pages and clear them as well\n", prefix);
+	/* 7. Range clear only */
+	memset(mem, -1, mem_size);
+	dirty_pages = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_CLEAR, NULL, 0, 0);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	dirty_pages2 = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET, vec, vec_size, 0);
+	if (dirty_pages2 < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages2, errno, strerror(errno));
+
+	ksft_test_result(dirty_pages == 0 && dirty_pages2 == 0, "%s Range clear only\n",
+			 prefix);
+
+	free(vec);
+	free(vec2);
+	return 0;
+}
+
+int performance_base_tests(char *prefix, char *mem, int mem_size, int page_size, int skip)
+{
+	int vec_size, i, ret, dirty_pages, dirty_pages2;
+	loff_t *vec, *vec2;
+
+	if (skip) {
+		ksft_test_result_skip("%s all new pages must be soft dirty\n", prefix);
+		ksft_test_result_skip("%s all pages must not be soft dirty\n", prefix);
+		ksft_test_result_skip("%s all pages dirty other than first and the last one\n",
+				      prefix);
+		ksft_test_result_skip("%s only middle page dirty\n", prefix);
+		ksft_test_result_skip("%s only two middle pages dirty\n", prefix);
+		ksft_test_result_skip("%s only get 2 dirty pages and clear them as well\n", prefix);
+		ksft_test_result_skip("%s Range clear only\n", prefix);
+		return 0;
+	}
+
+	vec_size = mem_size/page_size;
+	vec = malloc(sizeof(loff_t) * vec_size);
+	vec2 = malloc(sizeof(loff_t) * vec_size);
+
+	/* 1. all new pages must be soft dirty and clear the range for next test */
+	dirty_pages = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET_AND_CLEAR,
+				    vec, vec_size - 2, PAGEMAP_SD_NO_REUSED_REGIONS);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	dirty_pages2 = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET_AND_CLEAR,
+				     vec2, vec_size, PAGEMAP_SD_NO_REUSED_REGIONS);
+	if (dirty_pages2 < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages2, errno, strerror(errno));
+
+	ksft_test_result(dirty_pages == 0 && dirty_pages2 == 0,
+			 "%s page isn't dirty\n", prefix);
+
+	// 2. all pages must not be soft dirty
+	dirty_pages = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET,
+				    vec, vec_size, PAGEMAP_SD_NO_REUSED_REGIONS);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	ksft_test_result(dirty_pages == 0, "%s all pages must not be soft dirty\n", prefix);
+
+	// 3. all pages dirty other than first and the last one
+	memset(mem + page_size, -1, (mem_size - 2 * page_size));
+
+	dirty_pages = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET,
+				    vec, vec_size, PAGEMAP_SD_NO_REUSED_REGIONS);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	for (i = 0; i < dirty_pages; i++) {
+		if (vec[i] != (i + 1) * page_size)
+			break;
+	}
+
+	ksft_test_result(dirty_pages == vec_size - 2 && i == vec_size - 2,
+			 "%s all pages dirty other than first and the last one\n", prefix);
+
+	// 4. only middle page dirty
+	clear_softdirty();
+	mem[vec_size/2 * page_size]++;
+
+	dirty_pages = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET,
+				    vec, vec_size, PAGEMAP_SD_NO_REUSED_REGIONS);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	for (i = 0; i < vec_size; i++) {
+		if (vec[i] == vec_size/2 * page_size)
+			break;
+	}
+	ksft_test_result(vec[i] == vec_size/2 * page_size,
+			 "%s only middle page dirty\n", prefix);
+
+	// 5. only two middle pages dirty and walk over only middle pages
+	clear_softdirty();
+	mem[vec_size/2 * page_size]++;
+	mem[(vec_size/2 + 1) * page_size]++;
+
+	dirty_pages = pagemap_ioctl(&mem[vec_size/2 * page_size], 2 * page_size,
+				    PAGEMAP_SD_GET, vec, vec_size, PAGEMAP_SD_NO_REUSED_REGIONS);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	ksft_test_result(dirty_pages == 2 && vec[0] == 0 && vec[1] == page_size,
+			 "%s only two middle pages dirty\n", prefix);
+
+	/* 6. only get 2 dirty pages and clear them as well */
+	memset(mem, -1, mem_size);
+
+	/* get and clear second and third pages */
+	ret = pagemap_ioctl(mem + page_size, 2 * page_size, PAGEMAP_SD_GET_AND_CLEAR,
+			    vec, 2, PAGEMAP_SD_NO_REUSED_REGIONS);
+	if (ret < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+	dirty_pages = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET,
+				    vec2, vec_size, PAGEMAP_SD_NO_REUSED_REGIONS);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	for (i = 0; i < vec_size - 2; i++) {
+		if (i == 0 && (vec[i] != 0 || vec2[i] != 0))
+			break;
+		else if (i == 1 && (vec[i] != page_size || vec2[i] != (i + 2) * page_size))
+			break;
+		else if (i > 1 && (vec2[i] != (i + 2) * page_size))
+			break;
+	}
+
+	ksft_test_result(dirty_pages == vec_size - 2 && i == vec_size - 2,
+			 "%s only get 2 dirty pages and clear them as well\n", prefix);
+
+	/* 7. Range clear only */
+	memset(mem, -1, mem_size);
+	dirty_pages = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_CLEAR,
+				    NULL, 0, PAGEMAP_SD_NO_REUSED_REGIONS);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	dirty_pages2 = pagemap_ioctl(mem, mem_size, PAGEMAP_SD_GET,
+				     vec, vec_size, PAGEMAP_SD_NO_REUSED_REGIONS);
+	if (dirty_pages2 < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages2, errno, strerror(errno));
+
+	ksft_test_result(dirty_pages == 0 && dirty_pages2 == 0, "%s Range clear only\n",
+			 prefix);
+
+	free(vec);
+	free(vec2);
+	return 0;
+}
+
+int unmapped_region_tests(int page_size)
+{
+	void *start = (void *)0x10000000;
+	int dirty_pages, len = 0x00040000;
+	int vec_size = len / page_size;
+	loff_t *vec = malloc(sizeof(loff_t) * vec_size);
+
+	/* 1. Get dirty pages */
+	dirty_pages = pagemap_ioctl(start, len, PAGEMAP_SD_GET, vec, vec_size, 0);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	ksft_test_result(dirty_pages >= 0, "%s Get dirty pages\n", __func__);
+
+	/* 2. Clear dirty bit of whole address space */
+	dirty_pages = pagemap_ioctl(0, 0x7FFFFFFF, PAGEMAP_SD_CLEAR, NULL, 0, 0);
+	if (dirty_pages < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty_pages, errno, strerror(errno));
+
+	ksft_test_result(dirty_pages == 0, "%s Get dirty pages\n", __func__);
+
+	free(vec);
+	return 0;
+}
+
+static void test_simple(int page_size)
+{
+	int i;
+	char *map;
+	loff_t *vec = NULL;
+
+	map = aligned_alloc(page_size, page_size);
+	if (!map)
+		ksft_exit_fail_msg("mmap failed\n");
+
+	clear_softdirty();
+
+	for (i = 0 ; i < TEST_ITERATIONS; i++) {
+		if (pagemap_ioctl(map, page_size, PAGEMAP_SD_GET, vec, 1, 0) == 1) {
+			ksft_print_msg("dirty bit was 1, but should be 0 (i=%d)\n", i);
+			break;
+		}
+
+		clear_softdirty();
+		// Write something to the page to get the dirty bit enabled on the page
+		map[0]++;
+
+		if (pagemap_ioctl(map, page_size, PAGEMAP_SD_GET, vec, 1, 0) == 0) {
+			ksft_print_msg("dirty bit was 0, but should be 1 (i=%d)\n", i);
+			break;
+		}
+
+		clear_softdirty();
+	}
+	free(map);
+
+	ksft_test_result(i == TEST_ITERATIONS, "Test %s\n", __func__);
+}
+
+int main(int argc, char **argv)
+{
+	int page_size = getpagesize();
+	size_t hpage_len = read_pmd_pagesize();
+	char *mem, *map;
+	int mem_size;
+
+	ksft_print_header();
+	ksft_set_plan(44);
+
+	pagemap_fd = open(PAGEMAP, O_RDWR);
+	if (pagemap_fd < 0)
+		return -EINVAL;
+
+	/* 1. Sanity testing */
+	sanity_tests(page_size);
+
+	/* 2. Normal page testing */
+	mem_size = 10 * page_size;
+	mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (!mem)
+		ksft_exit_fail_msg("error nomem\n");
+
+	base_tests("Page testing:", mem, mem_size, page_size, 0);
+
+	munmap(mem, mem_size);
+
+	/* 3. Large page testing */
+	mem_size = 512 * 10 * page_size;
+	mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (!mem)
+		ksft_exit_fail_msg("error nomem\n");
+
+	base_tests("Large Page testing:", mem, mem_size, page_size, 0);
+
+	munmap(mem, mem_size);
+
+	/* 4. Huge page testing */
+	map = gethugepage(hpage_len);
+	if (check_huge(map))
+		base_tests("Huge page testing:", map, hpage_len, page_size, 0);
+	else
+		base_tests("Huge page testing:", NULL, 0, 0, 1);
+
+	free(map);
+
+	/* 5. Normal page testing */
+	mem_size = 10 * page_size;
+	mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (!mem)
+		ksft_exit_fail_msg("error nomem\n");
+
+	performance_base_tests("Performance Page testing:", mem, mem_size, page_size, 0);
+
+	munmap(mem, mem_size);
+
+	/* 6. Huge page tests */
+	hpage_unit_tests(page_size);
+
+	/* 7. Unmapped address test */
+	unmapped_region_tests(page_size);
+
+	/* 8. Iterative test */
+	test_simple(page_size);
+
+	close(pagemap_fd);
+	return ksft_exit_pass();
+}
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 4/4] mm: add documentation of the new ioctl on pagemap
  2022-08-26  6:45 [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE Muhammad Usama Anjum
                   ` (2 preceding siblings ...)
  2022-08-26  6:45 ` [PATCH v3 3/4] selftests: vm: add pagemap ioctl tests Muhammad Usama Anjum
@ 2022-08-26  6:45 ` Muhammad Usama Anjum
  2022-08-26  8:22   ` Bagas Sanjaya
  2022-09-07  9:40 ` [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE Muhammad Usama Anjum
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-08-26  6:45 UTC (permalink / raw)
  To: Jonathan Corbet, Alexander Viro, Andrew Morton, Shuah Khan,
	open list:DOCUMENTATION, open list, open list:PROC FILESYSTEM,
	open list:MEMORY MANAGEMENT, open list:KERNEL SELFTEST FRAMEWORK
  Cc: Muhammad Usama Anjum, kernel, Gabriel Krisman Bertazi,
	David Hildenbrand, Peter Enderborg, Greg KH

Add the explanation of the added ioctl on pagemap file. It can be used
to get, clear or both soft-dirty PTE bit of the specified range.
or both at the same time.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes in v2:
- Update documentation to mention ioctl instead of the syscall
---
 Documentation/admin-guide/mm/soft-dirty.rst | 42 ++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/mm/soft-dirty.rst b/Documentation/admin-guide/mm/soft-dirty.rst
index cb0cfd6672fa..d3d33e63a965 100644
--- a/Documentation/admin-guide/mm/soft-dirty.rst
+++ b/Documentation/admin-guide/mm/soft-dirty.rst
@@ -5,7 +5,12 @@ Soft-Dirty PTEs
 ===============
 
 The soft-dirty is a bit on a PTE which helps to track which pages a task
-writes to. In order to do this tracking one should
+writes to.
+
+Using Proc FS
+-------------
+
+In order to do this tracking one should
 
   1. Clear soft-dirty bits from the task's PTEs.
 
@@ -20,6 +25,41 @@ writes to. In order to do this tracking one should
      64-bit qword is the soft-dirty one. If set, the respective PTE was
      written to since step 1.
 
+Using IOCTL
+-----------
+
+The IOCTL on the ``/proc/PID/pagemap`` can be can be used to find the dirty pages
+atomically. The following commands are supported::
+
+	MEMWATCH_SD_GET
+		Get the page offsets which are soft dirty.
+
+	MEMWATCH_SD_CLEAR
+		Clear the pages which are soft dirty.
+
+	MEMWATCH_SD_GET_AND_CLEAR
+		Get and clear the pages which are soft dirty.
+
+The struct :c:type:`pagemap_sd_args` is used as the argument. In this struct:
+
+  1. The range is specified through start and len. The len argument need not be
+     the multiple of the page size, but since the information is returned for the
+     whole pages, len is effectively rounded up to the next multiple of the page
+     size.
+
+  2. The output buffer and size is specified in vec and vec_len. The offsets of
+     the dirty pages from start are returned in vec. The ioctl returns when the
+     whole range has been searched or vec is completely filled. The whole range
+     isn't cleared if vec fills up completely.
+
+  3. The flags can be specified in flags field. Currently only one flag,
+     PAGEMAP_SD_NO_REUSED_REGIONS is supported which can be specified to ignore
+     the VMA dirty flags for better performance. This flag shows only those pages
+     dirty which have been written to by the user. All new allocations aren't returned
+     to be dirty.
+
+Explanation
+-----------
 
 Internally, to do this tracking, the writable bit is cleared from PTEs
 when the soft-dirty bit is cleared. So, after this, when the task tries to
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/4] mm: add documentation of the new ioctl on pagemap
  2022-08-26  6:45 ` [PATCH v3 4/4] mm: add documentation of the new ioctl on pagemap Muhammad Usama Anjum
@ 2022-08-26  8:22   ` Bagas Sanjaya
  0 siblings, 0 replies; 24+ messages in thread
From: Bagas Sanjaya @ 2022-08-26  8:22 UTC (permalink / raw)
  To: Muhammad Usama Anjum, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Shuah Khan, open list:DOCUMENTATION, open list,
	open list:PROC FILESYSTEM, open list:MEMORY MANAGEMENT,
	open list:KERNEL SELFTEST FRAMEWORK
  Cc: kernel, Gabriel Krisman Bertazi, David Hildenbrand,
	Peter Enderborg, Greg KH

On 8/26/22 13:45, Muhammad Usama Anjum wrote:
>  The soft-dirty is a bit on a PTE which helps to track which pages a task
> -writes to. In order to do this tracking one should
> +writes to.
> +
> +Using Proc FS
> +-------------
> +
> +In order to do this tracking one should
>  
>    1. Clear soft-dirty bits from the task's PTEs.
>  
> @@ -20,6 +25,41 @@ writes to. In order to do this tracking one should
>       64-bit qword is the soft-dirty one. If set, the respective PTE was
>       written to since step 1.
>  
> +Using IOCTL
> +-----------
> +
> +The IOCTL on the ``/proc/PID/pagemap`` can be can be used to find the dirty pages
> +atomically. The following commands are supported::
> +
> +	MEMWATCH_SD_GET
> +		Get the page offsets which are soft dirty.
> +
> +	MEMWATCH_SD_CLEAR
> +		Clear the pages which are soft dirty.
> +
> +	MEMWATCH_SD_GET_AND_CLEAR
> +		Get and clear the pages which are soft dirty.
> +

Definition lists are enough, no need to use code block.

> +The struct :c:type:`pagemap_sd_args` is used as the argument. In this struct:
> +
> +  1. The range is specified through start and len. The len argument need not be
> +     the multiple of the page size, but since the information is returned for the
> +     whole pages, len is effectively rounded up to the next multiple of the page
> +     size.
> +
> +  2. The output buffer and size is specified in vec and vec_len. The offsets of
> +     the dirty pages from start are returned in vec. The ioctl returns when the
> +     whole range has been searched or vec is completely filled. The whole range
> +     isn't cleared if vec fills up completely.
> +
> +  3. The flags can be specified in flags field. Currently only one flag,
> +     PAGEMAP_SD_NO_REUSED_REGIONS is supported which can be specified to ignore
> +     the VMA dirty flags for better performance. This flag shows only those pages
> +     dirty which have been written to by the user. All new allocations aren't returned
> +     to be dirty.
> +
> +Explanation
> +-----------
>  
>  Internally, to do this tracking, the writable bit is cleared from PTEs
>  when the soft-dirty bit is cleared. So, after this, when the task tries to

I'd like to see identifier keywords (such as filename, function and variable name)
are consistently formatted either with inline code (``identifier``) or no
formatting (all or nothing).

-- 
An old man doll... just what I always wanted! - Clara


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-08-26  6:45 [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE Muhammad Usama Anjum
                   ` (3 preceding siblings ...)
  2022-08-26  6:45 ` [PATCH v3 4/4] mm: add documentation of the new ioctl on pagemap Muhammad Usama Anjum
@ 2022-09-07  9:40 ` Muhammad Usama Anjum
  2022-09-19 14:58 ` Andrei Vagin
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-09-07  9:40 UTC (permalink / raw)
  To: Jonathan Corbet, Alexander Viro, Andrew Morton, Shuah Khan,
	open list:DOCUMENTATION, open list, open list:PROC FILESYSTEM,
	open list:MEMORY MANAGEMENT, open list:KERNEL SELFTEST FRAMEWORK
  Cc: usama.anjum, kernel, Gabriel Krisman Bertazi, David Hildenbrand,
	Peter Enderborg, Greg KH

On 8/26/22 11:45 AM, Muhammad Usama Anjum wrote:
> 
> Hello,
> 
> This patch series implements a new ioctl on the pagemap proc fs file to
> get, clear and perform both get and clear at the same time atomically on
> the specified range of the memory.
> 
> Soft-dirty PTE bit of the memory pages can be viewed by using pagemap
> procfs file. The soft-dirty PTE bit for the whole memory range of the
> process can be cleared by writing to the clear_refs file. This series
> adds features that weren't present earlier.
> - There is no atomic get soft-dirty PTE bit status and clear operation
>   present.
> - The soft-dirty PTE bit of only a part of memory cannot be cleared.
> 
> Historically, soft-dirty PTE bit tracking has been used in the CRIU
> project. The proc fs interface is enough for that as I think the process
> is frozen. We have the use case where we need to track the soft-dirty
> PTE bit for the running processes. We need this tracking and clear
> mechanism of a region of memory while the process is running to emulate
> the getWriteWatch() syscall of Windows. This syscall is used by games to
> keep track of dirty pages and keep processing only the dirty pages. This
> new ioctl can be used by the CRIU project and other applications which
> require soft-dirty PTE bit information.
> 
> As in the current kernel there is no way to clear a part of memory (instead
> of clearing the Soft-Dirty bits for the entire process) and get+clear
> operation cannot be performed atomically, there are other methods to mimic
> this information entirely in userspace with poor performance:
> - The mprotect syscall and SIGSEGV handler for bookkeeping
> - The userfaultfd syscall with the handler for bookkeeping
> Some benchmarks can be seen [1].
> 
> This ioctl can be used by the CRIU project and other applications which
> require soft-dirty PTE bit information. The following operations are
> supported in this ioctl:
> - Get the pages that are soft-dirty.
> - Clear the pages which are soft-dirty.
> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
> soft-dirty PTE bit
Thoughts?

> 
> There are two decisions which have been taken about how to get the output
> from the syscall.
> - Return offsets of the pages from the start in the vec
> - Stop execution when vec is filled with dirty pages
> These two arguments doesn't follow the mincore() philosophy where the
> output array corresponds to the address range in one to one fashion, hence
> the output buffer length isn't passed and only a flag is set if the page
> is present. This makes mincore() easy to use with less control. We are
> passing the size of the output array and putting return data consecutively
> which is offset of dirty pages from the start. The user can convert these
> offsets back into the dirty page addresses easily. Suppose, the user want
> to get first 10 dirty pages from a total memory of 100 pages. He'll
> allocate output buffer of size 10 and the ioctl will abort after finding the
> 10 pages. This behaviour is needed to support Windows' getWriteWatch(). The
> behaviour like mincore() can be achieved by passing output buffer of 100
> size. This interface can be used for any desired behaviour.
> 
> [1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com/
> 
> Regards,
> Muhammad Usama Anjum
> 
> Muhammad Usama Anjum (4):
>   fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit
>   fs/proc/task_mmu: Implement IOCTL to get and clear soft dirty PTE bit
>   selftests: vm: add pagemap ioctl tests
>   mm: add documentation of the new ioctl on pagemap
> 
>  Documentation/admin-guide/mm/soft-dirty.rst |  42 +-
>  fs/proc/task_mmu.c                          | 342 ++++++++++-
>  include/uapi/linux/fs.h                     |  23 +
>  tools/include/uapi/linux/fs.h               |  23 +
>  tools/testing/selftests/vm/.gitignore       |   1 +
>  tools/testing/selftests/vm/Makefile         |   2 +
>  tools/testing/selftests/vm/pagemap_ioctl.c  | 649 ++++++++++++++++++++
>  7 files changed, 1050 insertions(+), 32 deletions(-)
>  create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
> 

-- 
Muhammad Usama Anjum


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-08-26  6:45 [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE Muhammad Usama Anjum
                   ` (4 preceding siblings ...)
  2022-09-07  9:40 ` [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE Muhammad Usama Anjum
@ 2022-09-19 14:58 ` Andrei Vagin
  2022-09-21 18:26   ` Muhammad Usama Anjum
  2022-09-21 18:30 ` Muhammad Usama Anjum
  2022-09-28  6:03 ` Muhammad Usama Anjum
  7 siblings, 1 reply; 24+ messages in thread
From: Andrei Vagin @ 2022-09-19 14:58 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Jonathan Corbet, Alexander Viro, Andrew Morton, Shuah Khan,
	open list:DOCUMENTATION, open list, open list:PROC FILESYSTEM,
	open list:MEMORY MANAGEMENT, open list:KERNEL SELFTEST FRAMEWORK,
	kernel, Gabriel Krisman Bertazi, David Hildenbrand,
	Peter Enderborg, Greg KH

On Fri, Aug 26, 2022 at 11:45:31AM +0500, Muhammad Usama Anjum wrote:
> 
> Hello,
> 
> This patch series implements a new ioctl on the pagemap proc fs file to
> get, clear and perform both get and clear at the same time atomically on
> the specified range of the memory.
> 
> Soft-dirty PTE bit of the memory pages can be viewed by using pagemap
> procfs file. The soft-dirty PTE bit for the whole memory range of the
> process can be cleared by writing to the clear_refs file. This series
> adds features that weren't present earlier.
> - There is no atomic get soft-dirty PTE bit status and clear operation
>   present.
> - The soft-dirty PTE bit of only a part of memory cannot be cleared.
> 
> Historically, soft-dirty PTE bit tracking has been used in the CRIU
> project. The proc fs interface is enough for that as I think the process
> is frozen. We have the use case where we need to track the soft-dirty
> PTE bit for the running processes. We need this tracking and clear
> mechanism of a region of memory while the process is running to emulate
> the getWriteWatch() syscall of Windows. This syscall is used by games to
> keep track of dirty pages and keep processing only the dirty pages. This
> new ioctl can be used by the CRIU project and other applications which
> require soft-dirty PTE bit information.
> 
> As in the current kernel there is no way to clear a part of memory (instead
> of clearing the Soft-Dirty bits for the entire process) and get+clear
> operation cannot be performed atomically, there are other methods to mimic
> this information entirely in userspace with poor performance:
> - The mprotect syscall and SIGSEGV handler for bookkeeping
> - The userfaultfd syscall with the handler for bookkeeping
> Some benchmarks can be seen [1].
> 
> This ioctl can be used by the CRIU project and other applications which
> require soft-dirty PTE bit information. The following operations are
> supported in this ioctl:
> - Get the pages that are soft-dirty.

I think this interface doesn't have to be limited by the soft-dirty
bits only. For example, CRIU needs to know whether file, present and swap bits
are set or not.

I mean we should be able to specify for what pages we need to get info
for. An ioctl argument can have these four fields:
* required bits (rmask & mask == mask) - all bits from this mask have to be set.
* any of these bits (amask & mask != 0) - any of these bits is set.
* exclude masks (emask & mask == 0) = none of these bits are set.
* return mask - bits that have to be reported to user.

> - Clear the pages which are soft-dirty.
> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
> soft-dirty PTE bit
> 
> There are two decisions which have been taken about how to get the output
> from the syscall.
> - Return offsets of the pages from the start in the vec

We can conside to return regions that contains pages with the same set
of bits.

struct page_region {
	void *start;
	long size;
	u64 bitmap;
}

And ioctl returns arrays of page_region-s. I believe it will be more
compact form for many cases.

> - Stop execution when vec is filled with dirty pages
> These two arguments doesn't follow the mincore() philosophy where the
> output array corresponds to the address range in one to one fashion, hence
> the output buffer length isn't passed and only a flag is set if the page
> is present. This makes mincore() easy to use with less control. We are
> passing the size of the output array and putting return data consecutively
> which is offset of dirty pages from the start. The user can convert these
> offsets back into the dirty page addresses easily. Suppose, the user want
> to get first 10 dirty pages from a total memory of 100 pages. He'll
> allocate output buffer of size 10 and the ioctl will abort after finding the
> 10 pages. This behaviour is needed to support Windows' getWriteWatch(). The
> behaviour like mincore() can be achieved by passing output buffer of 100
> size. This interface can be used for any desired behaviour.
> 
> [1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com/
> 
> Regards,
> Muhammad Usama Anjum
> 
> Muhammad Usama Anjum (4):
>   fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit
>   fs/proc/task_mmu: Implement IOCTL to get and clear soft dirty PTE bit
>   selftests: vm: add pagemap ioctl tests
>   mm: add documentation of the new ioctl on pagemap
> 
>  Documentation/admin-guide/mm/soft-dirty.rst |  42 +-
>  fs/proc/task_mmu.c                          | 342 ++++++++++-
>  include/uapi/linux/fs.h                     |  23 +
>  tools/include/uapi/linux/fs.h               |  23 +
>  tools/testing/selftests/vm/.gitignore       |   1 +
>  tools/testing/selftests/vm/Makefile         |   2 +
>  tools/testing/selftests/vm/pagemap_ioctl.c  | 649 ++++++++++++++++++++
>  7 files changed, 1050 insertions(+), 32 deletions(-)
>  create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
> 
> -- 
> 2.30.2
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-09-19 14:58 ` Andrei Vagin
@ 2022-09-21 18:26   ` Muhammad Usama Anjum
  2022-09-28 17:24     ` Andrei Vagin
  0 siblings, 1 reply; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-09-21 18:26 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: usama.anjum, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Shuah Khan, open list:DOCUMENTATION, open list,
	open list:PROC FILESYSTEM, open list:MEMORY MANAGEMENT,
	open list:KERNEL SELFTEST FRAMEWORK, kernel,
	Gabriel Krisman Bertazi, David Hildenbrand, Peter Enderborg,
	Greg KH

Hi,

Thank you for reviewing.

On 9/19/22 7:58 PM, Andrei Vagin wrote:
>> This ioctl can be used by the CRIU project and other applications which
>> require soft-dirty PTE bit information. The following operations are
>> supported in this ioctl:
>> - Get the pages that are soft-dirty.
> 
> I think this interface doesn't have to be limited by the soft-dirty
> bits only. For example, CRIU needs to know whether file, present and swap bits
> are set or not.
These operations can be performed by pagemap procfs file. Definitely
performing them through IOCTL will be faster. But I'm trying to add a
simple IOCTL by which some specific PTE bit can be read and cleared
atomically. This IOCTL can be extended to include other bits like file,
present and swap bits by keeping the interface simple. The following
mask advice is nice. But if we add that kind of masking, it'll start to
look like a filter on top of pagemap. My intention is to not duplicate
the functionality already provided by the pagemap. One may ask, then why
am I adding "get the soft-dirty pages" functionality? I'm adding it to
complement the get and clear operation. The "get" and "get and clear"
operations with special flag (PAGEMAP_SD_NO_REUSED_REGIONS) can give
results quicker by not splitting the VMAs.

> 
> I mean we should be able to specify for what pages we need to get info
> for. An ioctl argument can have these four fields:
> * required bits (rmask & mask == mask) - all bits from this mask have to be set.
> * any of these bits (amask & mask != 0) - any of these bits is set.
> * exclude masks (emask & mask == 0) = none of these bits are set.
> * return mask - bits that have to be reported to user.
> 
>> - Clear the pages which are soft-dirty.
>> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
>> soft-dirty PTE bit
>>
>> There are two decisions which have been taken about how to get the output
>> from the syscall.
>> - Return offsets of the pages from the start in the vec
> 
> We can conside to return regions that contains pages with the same set
> of bits.
> 
> struct page_region {
> 	void *start;
> 	long size;
> 	u64 bitmap;
> }
> 
> And ioctl returns arrays of page_region-s. I believe it will be more
> compact form for many cases.
Thank you for mentioning this. I'd considered this while development.
But I gave up and used the simple array to return the offsets of the
pages as in the problem I'm trying to solve, the dirty pages may be
present amid non-dirty pages. The range may not be useful in that case.
Also we want to return only a specific number of pages of interest. The
following paragraph explains it.

> 
>> - Stop execution when vec is filled with dirty pages
>> These two arguments doesn't follow the mincore() philosophy where the
>> output array corresponds to the address range in one to one fashion, hence
>> the output buffer length isn't passed and only a flag is set if the page
>> is present. This makes mincore() easy to use with less control. We are
>> passing the size of the output array and putting return data consecutively
>> which is offset of dirty pages from the start. The user can convert these
>> offsets back into the dirty page addresses easily. Suppose, the user want
>> to get first 10 dirty pages from a total memory of 100 pages. He'll
>> allocate output buffer of size 10 and the ioctl will abort after finding the
>> 10 pages. This behaviour is needed to support Windows' getWriteWatch(). The
>> behaviour like mincore() can be achieved by passing output buffer of 100
>> size. This interface can be used for any desired behaviour.
>>
>> [1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com/
>>
>> Regards,
>> Muhammad Usama Anjum
>>
>> Muhammad Usama Anjum (4):
>>   fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit
>>   fs/proc/task_mmu: Implement IOCTL to get and clear soft dirty PTE bit
>>   selftests: vm: add pagemap ioctl tests
>>   mm: add documentation of the new ioctl on pagemap
>>
>>  Documentation/admin-guide/mm/soft-dirty.rst |  42 +-
>>  fs/proc/task_mmu.c                          | 342 ++++++++++-
>>  include/uapi/linux/fs.h                     |  23 +
>>  tools/include/uapi/linux/fs.h               |  23 +
>>  tools/testing/selftests/vm/.gitignore       |   1 +
>>  tools/testing/selftests/vm/Makefile         |   2 +
>>  tools/testing/selftests/vm/pagemap_ioctl.c  | 649 ++++++++++++++++++++
>>  7 files changed, 1050 insertions(+), 32 deletions(-)
>>  create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
>>
>> -- 
>> 2.30.2
>>

-- 
Muhammad Usama Anjum


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-08-26  6:45 [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE Muhammad Usama Anjum
                   ` (5 preceding siblings ...)
  2022-09-19 14:58 ` Andrei Vagin
@ 2022-09-21 18:30 ` Muhammad Usama Anjum
  2022-09-28  6:03 ` Muhammad Usama Anjum
  7 siblings, 0 replies; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-09-21 18:30 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: usama.anjum, kernel, Gabriel Krisman Bertazi, David Hildenbrand,
	Peter Enderborg, Greg KH, Jonathan Corbet, Alexander Viro,
	Andrew Morton, Shuah Khan, open list:DOCUMENTATION, open list,
	open list:PROC FILESYSTEM, open list:MEMORY MANAGEMENT,
	open list:KERNEL SELFTEST FRAMEWORK

Hi Suren,

I'd shared this problem with you at Plumbers. Please review and share
your thoughts.

Thanks,
Usama


On 8/26/22 11:45 AM, Muhammad Usama Anjum wrote:
> 
> Hello,
> 
> This patch series implements a new ioctl on the pagemap proc fs file to
> get, clear and perform both get and clear at the same time atomically on
> the specified range of the memory.
> 
> Soft-dirty PTE bit of the memory pages can be viewed by using pagemap
> procfs file. The soft-dirty PTE bit for the whole memory range of the
> process can be cleared by writing to the clear_refs file. This series
> adds features that weren't present earlier.
> - There is no atomic get soft-dirty PTE bit status and clear operation
>   present.
> - The soft-dirty PTE bit of only a part of memory cannot be cleared.
> 
> Historically, soft-dirty PTE bit tracking has been used in the CRIU
> project. The proc fs interface is enough for that as I think the process
> is frozen. We have the use case where we need to track the soft-dirty
> PTE bit for the running processes. We need this tracking and clear
> mechanism of a region of memory while the process is running to emulate
> the getWriteWatch() syscall of Windows. This syscall is used by games to
> keep track of dirty pages and keep processing only the dirty pages. This
> new ioctl can be used by the CRIU project and other applications which
> require soft-dirty PTE bit information.
> 
> As in the current kernel there is no way to clear a part of memory (instead
> of clearing the Soft-Dirty bits for the entire process) and get+clear
> operation cannot be performed atomically, there are other methods to mimic
> this information entirely in userspace with poor performance:
> - The mprotect syscall and SIGSEGV handler for bookkeeping
> - The userfaultfd syscall with the handler for bookkeeping
> Some benchmarks can be seen [1].
> 
> This ioctl can be used by the CRIU project and other applications which
> require soft-dirty PTE bit information. The following operations are
> supported in this ioctl:
> - Get the pages that are soft-dirty.
> - Clear the pages which are soft-dirty.
> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
> soft-dirty PTE bit
> 
> There are two decisions which have been taken about how to get the output
> from the syscall.
> - Return offsets of the pages from the start in the vec
> - Stop execution when vec is filled with dirty pages
> These two arguments doesn't follow the mincore() philosophy where the
> output array corresponds to the address range in one to one fashion, hence
> the output buffer length isn't passed and only a flag is set if the page
> is present. This makes mincore() easy to use with less control. We are
> passing the size of the output array and putting return data consecutively
> which is offset of dirty pages from the start. The user can convert these
> offsets back into the dirty page addresses easily. Suppose, the user want
> to get first 10 dirty pages from a total memory of 100 pages. He'll
> allocate output buffer of size 10 and the ioctl will abort after finding the
> 10 pages. This behaviour is needed to support Windows' getWriteWatch(). The
> behaviour like mincore() can be achieved by passing output buffer of 100
> size. This interface can be used for any desired behaviour.
> 
> [1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com/
> 
> Regards,
> Muhammad Usama Anjum
> 
> Muhammad Usama Anjum (4):
>   fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit
>   fs/proc/task_mmu: Implement IOCTL to get and clear soft dirty PTE bit
>   selftests: vm: add pagemap ioctl tests
>   mm: add documentation of the new ioctl on pagemap
> 
>  Documentation/admin-guide/mm/soft-dirty.rst |  42 +-
>  fs/proc/task_mmu.c                          | 342 ++++++++++-
>  include/uapi/linux/fs.h                     |  23 +
>  tools/include/uapi/linux/fs.h               |  23 +
>  tools/testing/selftests/vm/.gitignore       |   1 +
>  tools/testing/selftests/vm/Makefile         |   2 +
>  tools/testing/selftests/vm/pagemap_ioctl.c  | 649 ++++++++++++++++++++
>  7 files changed, 1050 insertions(+), 32 deletions(-)
>  create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
> 

-- 
Muhammad Usama Anjum


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-08-26  6:45 [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE Muhammad Usama Anjum
                   ` (6 preceding siblings ...)
  2022-09-21 18:30 ` Muhammad Usama Anjum
@ 2022-09-28  6:03 ` Muhammad Usama Anjum
  7 siblings, 0 replies; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-09-28  6:03 UTC (permalink / raw)
  To: Jonathan Corbet, Alexander Viro, Andrew Morton, Shuah Khan,
	open list:DOCUMENTATION, open list, open list:PROC FILESYSTEM,
	open list:MEMORY MANAGEMENT, open list:KERNEL SELFTEST FRAMEWORK
  Cc: usama.anjum, kernel, Gabriel Krisman Bertazi, David Hildenbrand,
	Peter Enderborg, Greg KH, Suren Baghdasaryan

Any thoughts about it?

On 8/26/22 11:45 AM, Muhammad Usama Anjum wrote:
> 
> Hello,
> 
> This patch series implements a new ioctl on the pagemap proc fs file to
> get, clear and perform both get and clear at the same time atomically on
> the specified range of the memory.
> 
> Soft-dirty PTE bit of the memory pages can be viewed by using pagemap
> procfs file. The soft-dirty PTE bit for the whole memory range of the
> process can be cleared by writing to the clear_refs file. This series
> adds features that weren't present earlier.
> - There is no atomic get soft-dirty PTE bit status and clear operation
>   present.
> - The soft-dirty PTE bit of only a part of memory cannot be cleared.
> 
> Historically, soft-dirty PTE bit tracking has been used in the CRIU
> project. The proc fs interface is enough for that as I think the process
> is frozen. We have the use case where we need to track the soft-dirty
> PTE bit for the running processes. We need this tracking and clear
> mechanism of a region of memory while the process is running to emulate
> the getWriteWatch() syscall of Windows. This syscall is used by games to
> keep track of dirty pages and keep processing only the dirty pages. This
> new ioctl can be used by the CRIU project and other applications which
> require soft-dirty PTE bit information.
> 
> As in the current kernel there is no way to clear a part of memory (instead
> of clearing the Soft-Dirty bits for the entire process) and get+clear
> operation cannot be performed atomically, there are other methods to mimic
> this information entirely in userspace with poor performance:
> - The mprotect syscall and SIGSEGV handler for bookkeeping
> - The userfaultfd syscall with the handler for bookkeeping
> Some benchmarks can be seen [1].
> 
> This ioctl can be used by the CRIU project and other applications which
> require soft-dirty PTE bit information. The following operations are
> supported in this ioctl:
> - Get the pages that are soft-dirty.
> - Clear the pages which are soft-dirty.
> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
> soft-dirty PTE bit
> 
> There are two decisions which have been taken about how to get the output
> from the syscall.
> - Return offsets of the pages from the start in the vec
> - Stop execution when vec is filled with dirty pages
> These two arguments doesn't follow the mincore() philosophy where the
> output array corresponds to the address range in one to one fashion, hence
> the output buffer length isn't passed and only a flag is set if the page
> is present. This makes mincore() easy to use with less control. We are
> passing the size of the output array and putting return data consecutively
> which is offset of dirty pages from the start. The user can convert these
> offsets back into the dirty page addresses easily. Suppose, the user want
> to get first 10 dirty pages from a total memory of 100 pages. He'll
> allocate output buffer of size 10 and the ioctl will abort after finding the
> 10 pages. This behaviour is needed to support Windows' getWriteWatch(). The
> behaviour like mincore() can be achieved by passing output buffer of 100
> size. This interface can be used for any desired behaviour.
> 
> [1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com/
> 
> Regards,
> Muhammad Usama Anjum
> 
> Muhammad Usama Anjum (4):
>   fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit
>   fs/proc/task_mmu: Implement IOCTL to get and clear soft dirty PTE bit
>   selftests: vm: add pagemap ioctl tests
>   mm: add documentation of the new ioctl on pagemap
> 
>  Documentation/admin-guide/mm/soft-dirty.rst |  42 +-
>  fs/proc/task_mmu.c                          | 342 ++++++++++-
>  include/uapi/linux/fs.h                     |  23 +
>  tools/include/uapi/linux/fs.h               |  23 +
>  tools/testing/selftests/vm/.gitignore       |   1 +
>  tools/testing/selftests/vm/Makefile         |   2 +
>  tools/testing/selftests/vm/pagemap_ioctl.c  | 649 ++++++++++++++++++++
>  7 files changed, 1050 insertions(+), 32 deletions(-)
>  create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
> 

-- 
Muhammad Usama Anjum


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-09-21 18:26   ` Muhammad Usama Anjum
@ 2022-09-28 17:24     ` Andrei Vagin
  2022-10-03 11:21       ` Muhammad Usama Anjum
  0 siblings, 1 reply; 24+ messages in thread
From: Andrei Vagin @ 2022-09-28 17:24 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Jonathan Corbet, Alexander Viro, Andrew Morton, Shuah Khan,
	open list:DOCUMENTATION, open list, open list:PROC FILESYSTEM,
	open list:MEMORY MANAGEMENT, open list:KERNEL SELFTEST FRAMEWORK,
	kernel, Gabriel Krisman Bertazi, David Hildenbrand,
	Peter Enderborg, Greg KH

On Wed, Sep 21, 2022 at 11:26 AM Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
>
> Hi,
>
> Thank you for reviewing.
>
> On 9/19/22 7:58 PM, Andrei Vagin wrote:
> >> This ioctl can be used by the CRIU project and other applications which
> >> require soft-dirty PTE bit information. The following operations are
> >> supported in this ioctl:
> >> - Get the pages that are soft-dirty.
> >
> > I think this interface doesn't have to be limited by the soft-dirty
> > bits only. For example, CRIU needs to know whether file, present and swap bits
> > are set or not.
> These operations can be performed by pagemap procfs file. Definitely
> performing them through IOCTL will be faster. But I'm trying to add a
> simple IOCTL by which some specific PTE bit can be read and cleared
> atomically. This IOCTL can be extended to include other bits like file,
> present and swap bits by keeping the interface simple. The following
> mask advice is nice. But if we add that kind of masking, it'll start to
> look like a filter on top of pagemap. My intention is to not duplicate
> the functionality already provided by the pagemap. One may ask, then why
> am I adding "get the soft-dirty pages" functionality? I'm adding it to
> complement the get and clear operation. The "get" and "get and clear"
> operations with special flag (PAGEMAP_SD_NO_REUSED_REGIONS) can give
> results quicker by not splitting the VMAs.

This simple interface is good only for a limited number of use-cases.
The interface
that I suggest doesn't duplicate more code than this one, but it is much more
universal. It will be a big mess if you add a separate API for each
specific use-case.

>
> >
> > I mean we should be able to specify for what pages we need to get info
> > for. An ioctl argument can have these four fields:
> > * required bits (rmask & mask == mask) - all bits from this mask have to be set.
> > * any of these bits (amask & mask != 0) - any of these bits is set.
> > * exclude masks (emask & mask == 0) = none of these bits are set.
> > * return mask - bits that have to be reported to user.
> >
> >> - Clear the pages which are soft-dirty.
> >> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
> >> soft-dirty PTE bit
> >>
> >> There are two decisions which have been taken about how to get the output
> >> from the syscall.
> >> - Return offsets of the pages from the start in the vec
> >
> > We can conside to return regions that contains pages with the same set
> > of bits.
> >
> > struct page_region {
> >       void *start;
> >       long size;
> >       u64 bitmap;
> > }
> >
> > And ioctl returns arrays of page_region-s. I believe it will be more
> > compact form for many cases.
> Thank you for mentioning this. I'd considered this while development.
> But I gave up and used the simple array to return the offsets of the
> pages as in the problem I'm trying to solve, the dirty pages may be
> present amid non-dirty pages. The range may not be useful in that case.

This is a good example. If we expect more than two consequent pages
on average, the "region" interface looks more prefered. I don't know your
use-case, but in the case of CRIU, this assumption looks reasonable.

> Also we want to return only a specific number of pages of interest. The
> following paragraph explains it.
>
> >
> >> - Stop execution when vec is filled with dirty pages
> >> These two arguments doesn't follow the mincore() philosophy where the
> >> output array corresponds to the address range in one to one fashion, hence
> >> the output buffer length isn't passed and only a flag is set if the page
> >> is present. This makes mincore() easy to use with less control. We are
> >> passing the size of the output array and putting return data consecutively
> >> which is offset of dirty pages from the start. The user can convert these
> >> offsets back into the dirty page addresses easily. Suppose, the user want
> >> to get first 10 dirty pages from a total memory of 100 pages. He'll
> >> allocate output buffer of size 10 and the ioctl will abort after finding the
> >> 10 pages. This behaviour is needed to support Windows' getWriteWatch(). The
> >> behaviour like mincore() can be achieved by passing output buffer of 100
> >> size. This interface can be used for any desired behaviour.

Now, it is more clear where this interface came from. It repeats the interface
of Windows' getWriteWatch. I think we have to look wider. The
interface that reports
regions will be more efficient for many use-cases. As for
getWriteWatch, it will require
a bit more code in user-space, but this code is trivial.

Thanks,
Andrei


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-09-28 17:24     ` Andrei Vagin
@ 2022-10-03 11:21       ` Muhammad Usama Anjum
  2022-10-11  4:52         ` Andrei Vagin
  0 siblings, 1 reply; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-10-03 11:21 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: usama.anjum, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Shuah Khan, open list:DOCUMENTATION, open list,
	open list:PROC FILESYSTEM, open list:MEMORY MANAGEMENT,
	open list:KERNEL SELFTEST FRAMEWORK, kernel,
	Gabriel Krisman Bertazi, David Hildenbrand, Peter Enderborg,
	Greg KH, Suren Baghdasaryan, Matthew Wilcox

On 9/28/22 10:24 PM, Andrei Vagin wrote:
> On Wed, Sep 21, 2022 at 11:26 AM Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>>
>> Hi,
>>
>> Thank you for reviewing.
>>
>> On 9/19/22 7:58 PM, Andrei Vagin wrote:
>>>> This ioctl can be used by the CRIU project and other applications which
>>>> require soft-dirty PTE bit information. The following operations are
>>>> supported in this ioctl:
>>>> - Get the pages that are soft-dirty.
>>>
>>> I think this interface doesn't have to be limited by the soft-dirty
>>> bits only. For example, CRIU needs to know whether file, present and swap bits
>>> are set or not.
>> These operations can be performed by pagemap procfs file. Definitely
>> performing them through IOCTL will be faster. But I'm trying to add a
>> simple IOCTL by which some specific PTE bit can be read and cleared
>> atomically. This IOCTL can be extended to include other bits like file,
>> present and swap bits by keeping the interface simple. The following
>> mask advice is nice. But if we add that kind of masking, it'll start to
>> look like a filter on top of pagemap. My intention is to not duplicate
>> the functionality already provided by the pagemap. One may ask, then why
>> am I adding "get the soft-dirty pages" functionality? I'm adding it to
>> complement the get and clear operation. The "get" and "get and clear"
>> operations with special flag (PAGEMAP_SD_NO_REUSED_REGIONS) can give
>> results quicker by not splitting the VMAs.
> 
> This simple interface is good only for a limited number of use-cases.
> The interface
> that I suggest doesn't duplicate more code than this one, but it is much more
> universal. It will be a big mess if you add a separate API for each
> specific use-case.
>
>
>>> I mean we should be able to specify for what pages we need to get info
>>> for. An ioctl argument can have these four fields:
>>> * required bits (rmask & mask == mask) - all bits from this mask have to be set.
>>> * any of these bits (amask & mask != 0) - any of these bits is set.
>>> * exclude masks (emask & mask == 0) = none of these bits are set.
>>> * return mask - bits that have to be reported to user.
The required mask (rmask) makes sense to me. At the moment, I only know
about the practical use case for the required mask. Can you share how
can any and exclude masks help for the CRIU?

>>>
>>>> - Clear the pages which are soft-dirty.
>>>> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
>>>> soft-dirty PTE bit
>>>>
>>>> There are two decisions which have been taken about how to get the output
>>>> from the syscall.
>>>> - Return offsets of the pages from the start in the vec
>>>
>>> We can conside to return regions that contains pages with the same set
>>> of bits.
>>>
>>> struct page_region {
>>>       void *start;
>>>       long size;
>>>       u64 bitmap;
>>> }
>>>
>>> And ioctl returns arrays of page_region-s. I believe it will be more
>>> compact form for many cases.
>> Thank you for mentioning this. I'd considered this while development.
>> But I gave up and used the simple array to return the offsets of the
>> pages as in the problem I'm trying to solve, the dirty pages may be
>> present amid non-dirty pages. The range may not be useful in that case.
> 
> This is a good example. If we expect more than two consequent pages
> on average, the "region" interface looks more prefered. I don't know your
> use-case, but in the case of CRIU, this assumption looks reasonable.
> 
>> Also we want to return only a specific number of pages of interest. The
>> following paragraph explains it.
>>
>>>
>>>> - Stop execution when vec is filled with dirty pages
>>>> These two arguments doesn't follow the mincore() philosophy where the
>>>> output array corresponds to the address range in one to one fashion, hence
>>>> the output buffer length isn't passed and only a flag is set if the page
>>>> is present. This makes mincore() easy to use with less control. We are
>>>> passing the size of the output array and putting return data consecutively
>>>> which is offset of dirty pages from the start. The user can convert these
>>>> offsets back into the dirty page addresses easily. Suppose, the user want
>>>> to get first 10 dirty pages from a total memory of 100 pages. He'll
>>>> allocate output buffer of size 10 and the ioctl will abort after finding the
>>>> 10 pages. This behaviour is needed to support Windows' getWriteWatch(). The
>>>> behaviour like mincore() can be achieved by passing output buffer of 100
>>>> size. This interface can be used for any desired behaviour.
> 
> Now, it is more clear where this interface came from. It repeats the interface
> of Windows' getWriteWatch. I think we have to look wider. The
> interface that reports
> regions will be more efficient for many use-cases. As for
> getWriteWatch, it will require
> a bit more code in user-space, but this code is trivial.
> 
> Thanks,
> Andrei

-- 
Muhammad Usama Anjum


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-10-03 11:21       ` Muhammad Usama Anjum
@ 2022-10-11  4:52         ` Andrei Vagin
  2022-10-14 13:48           ` Danylo Mocherniuk
  0 siblings, 1 reply; 24+ messages in thread
From: Andrei Vagin @ 2022-10-11  4:52 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Jonathan Corbet, Alexander Viro, Andrew Morton, Shuah Khan,
	open list:DOCUMENTATION, open list, open list:PROC FILESYSTEM,
	open list:MEMORY MANAGEMENT, open list:KERNEL SELFTEST FRAMEWORK,
	kernel, Gabriel Krisman Bertazi, David Hildenbrand,
	Peter Enderborg, Greg KH, Suren Baghdasaryan, Matthew Wilcox,
	Danylo Mocherniuk

On Mon, Oct 03, 2022 at 04:21:22PM +0500, Muhammad Usama Anjum wrote:
> On 9/28/22 10:24 PM, Andrei Vagin wrote:
> > On Wed, Sep 21, 2022 at 11:26 AM Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> >>
> >> Hi,
> >>
> >> Thank you for reviewing.
> >>
> >> On 9/19/22 7:58 PM, Andrei Vagin wrote:
> >>>> This ioctl can be used by the CRIU project and other applications which
> >>>> require soft-dirty PTE bit information. The following operations are
> >>>> supported in this ioctl:
> >>>> - Get the pages that are soft-dirty.
> >>>
> >>> I think this interface doesn't have to be limited by the soft-dirty
> >>> bits only. For example, CRIU needs to know whether file, present and swap bits
> >>> are set or not.
> >> These operations can be performed by pagemap procfs file. Definitely
> >> performing them through IOCTL will be faster. But I'm trying to add a
> >> simple IOCTL by which some specific PTE bit can be read and cleared
> >> atomically. This IOCTL can be extended to include other bits like file,
> >> present and swap bits by keeping the interface simple. The following
> >> mask advice is nice. But if we add that kind of masking, it'll start to
> >> look like a filter on top of pagemap. My intention is to not duplicate
> >> the functionality already provided by the pagemap. One may ask, then why
> >> am I adding "get the soft-dirty pages" functionality? I'm adding it to
> >> complement the get and clear operation. The "get" and "get and clear"
> >> operations with special flag (PAGEMAP_SD_NO_REUSED_REGIONS) can give
> >> results quicker by not splitting the VMAs.
> > 
> > This simple interface is good only for a limited number of use-cases.
> > The interface
> > that I suggest doesn't duplicate more code than this one, but it is much more
> > universal. It will be a big mess if you add a separate API for each
> > specific use-case.
> >
> >
> >>> I mean we should be able to specify for what pages we need to get info
> >>> for. An ioctl argument can have these four fields:
> >>> * required bits (rmask & mask == mask) - all bits from this mask have to be set.
> >>> * any of these bits (amask & mask != 0) - any of these bits is set.
> >>> * exclude masks (emask & mask == 0) = none of these bits are set.
> >>> * return mask - bits that have to be reported to user.
> The required mask (rmask) makes sense to me. At the moment, I only know
> about the practical use case for the required mask. Can you share how
> can any and exclude masks help for the CRIU?
> 

I looked at should_dump_page in the CRIU code:
https://github.com/checkpoint-restore/criu/blob/45641ab26d7bb78706a6215fdef8f9133abf8d10/criu/mem.c#L102

When CRIU dumps file private mappings, it needs to get pages that have
PME_PRESENT or PME_SWAP but don't have PME_FILE.

> >>>> - Clear the pages which are soft-dirty.
> >>>> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
> >>>> soft-dirty PTE bit
> >>>>
> >>>> There are two decisions which have been taken about how to get the output
> >>>> from the syscall.
> >>>> - Return offsets of the pages from the start in the vec
> >>>
> >>> We can conside to return regions that contains pages with the same set
> >>> of bits.
> >>>
> >>> struct page_region {
> >>>       void *start;
> >>>       long size;
> >>>       u64 bitmap;
> >>> }
> >>>
> >>> And ioctl returns arrays of page_region-s. I believe it will be more
> >>> compact form for many cases.
> >> Thank you for mentioning this. I'd considered this while development.
> >> But I gave up and used the simple array to return the offsets of the
> >> pages as in the problem I'm trying to solve, the dirty pages may be
> >> present amid non-dirty pages. The range may not be useful in that case.
> > 
> > This is a good example. If we expect more than two consequent pages
> > on average, the "region" interface looks more prefered. I don't know your
> > use-case, but in the case of CRIU, this assumption looks reasonable.
> > 
> >> Also we want to return only a specific number of pages of interest. The
> >> following paragraph explains it.
> >>
> >>>
> >>>> - Stop execution when vec is filled with dirty pages
> >>>> These two arguments doesn't follow the mincore() philosophy where the
> >>>> output array corresponds to the address range in one to one fashion, hence
> >>>> the output buffer length isn't passed and only a flag is set if the page
> >>>> is present. This makes mincore() easy to use with less control. We are
> >>>> passing the size of the output array and putting return data consecutively
> >>>> which is offset of dirty pages from the start. The user can convert these
> >>>> offsets back into the dirty page addresses easily. Suppose, the user want
> >>>> to get first 10 dirty pages from a total memory of 100 pages. He'll
> >>>> allocate output buffer of size 10 and the ioctl will abort after finding the
> >>>> 10 pages. This behaviour is needed to support Windows' getWriteWatch(). The
> >>>> behaviour like mincore() can be achieved by passing output buffer of 100
> >>>> size. This interface can be used for any desired behaviour.
> > 
> > Now, it is more clear where this interface came from. It repeats the
> > interface of Windows' getWriteWatch. I think we have to look wider.
> > The interface that reports regions will be more efficient for many
> > use-cases. As for getWriteWatch, it will require a bit more code in
> > user-space, but this code is trivial.

I added Danylo to CC. I think he has a good use-case for the new
interface. Danylo, could you describe it here.

> > 
> > Thanks,
> > Andrei
> 
> -- 
> Muhammad Usama Anjum


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-10-11  4:52         ` Andrei Vagin
@ 2022-10-14 13:48           ` Danylo Mocherniuk
  2022-10-18 10:36             ` Muhammad Usama Anjum
  0 siblings, 1 reply; 24+ messages in thread
From: Danylo Mocherniuk @ 2022-10-14 13:48 UTC (permalink / raw)
  To: avagin
  Cc: akpm, corbet, david, gregkh, kernel, krisman, linux-doc,
	linux-fsdevel, linux-kernel, linux-kselftest, linux-mm, mdanylo,
	peter.enderborg, shuah, surenb, usama.anjum, viro, willy, emmir,
	figiel, kyurtsever

On Tue, Oct 11, 2022 at 6:52 AM Andrei Vagin <avagin@gmail.com> wrote:
>
> On Mon, Oct 03, 2022 at 04:21:22PM +0500, Muhammad Usama Anjum wrote:
> > On 9/28/22 10:24 PM, Andrei Vagin wrote:
> > > On Wed, Sep 21, 2022 at 11:26 AM Muhammad Usama Anjum
> > > <usama.anjum@collabora.com> wrote:
> > >>
> > >> Hi,
> > >>
> > >> Thank you for reviewing.
> > >>
> > >> On 9/19/22 7:58 PM, Andrei Vagin wrote:
> > >>>> This ioctl can be used by the CRIU project and other applications which
> > >>>> require soft-dirty PTE bit information. The following operations are
> > >>>> supported in this ioctl:
> > >>>> - Get the pages that are soft-dirty.
> > >>>
> > >>> I think this interface doesn't have to be limited by the soft-dirty
> > >>> bits only. For example, CRIU needs to know whether file, present and swap bits
> > >>> are set or not.
> > >> These operations can be performed by pagemap procfs file. Definitely
> > >> performing them through IOCTL will be faster. But I'm trying to add a
> > >> simple IOCTL by which some specific PTE bit can be read and cleared
> > >> atomically. This IOCTL can be extended to include other bits like file,
> > >> present and swap bits by keeping the interface simple. The following
> > >> mask advice is nice. But if we add that kind of masking, it'll start to
> > >> look like a filter on top of pagemap. My intention is to not duplicate
> > >> the functionality already provided by the pagemap. One may ask, then why
> > >> am I adding "get the soft-dirty pages" functionality? I'm adding it to
> > >> complement the get and clear operation. The "get" and "get and clear"
> > >> operations with special flag (PAGEMAP_SD_NO_REUSED_REGIONS) can give
> > >> results quicker by not splitting the VMAs.
> > >
> > > This simple interface is good only for a limited number of use-cases.
> > > The interface
> > > that I suggest doesn't duplicate more code than this one, but it is much more
> > > universal. It will be a big mess if you add a separate API for each
> > > specific use-case.
> > >
> > >
> > >>> I mean we should be able to specify for what pages we need to get info
> > >>> for. An ioctl argument can have these four fields:
> > >>> * required bits (rmask & mask == mask) - all bits from this mask have to be set.
> > >>> * any of these bits (amask & mask != 0) - any of these bits is set.
> > >>> * exclude masks (emask & mask == 0) = none of these bits are set.
> > >>> * return mask - bits that have to be reported to user.
> > The required mask (rmask) makes sense to me. At the moment, I only know
> > about the practical use case for the required mask. Can you share how
> > can any and exclude masks help for the CRIU?
> >
>
> I looked at should_dump_page in the CRIU code:
> https://github.com/checkpoint-restore/criu/blob/45641ab26d7bb78706a6215fdef8f9133abf8d10/criu/mem.c#L102
>
> When CRIU dumps file private mappings, it needs to get pages that have
> PME_PRESENT or PME_SWAP but don't have PME_FILE.

I would really like to see the mask discussed will be adopted. With it CRIU will
be able to migrate huge sparse VMAs assuming that a single hole is processed in 
O(1) time. 

Use cases for migrating sparse VMAs are binaries sanitized with ASAN, MSAN or
TSAN [1]. All of these sanitizers produce sparse mappings of shadow memory [2].
Being able to migrate such binaries allows to highly reduce the amount of work
needed to identify and fix post-migration crashes, which happen constantly.

>
> > >>>> - Clear the pages which are soft-dirty.
> > >>>> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
> > >>>> soft-dirty PTE bit
> > >>>>
> > >>>> There are two decisions which have been taken about how to get the output
> > >>>> from the syscall.
> > >>>> - Return offsets of the pages from the start in the vec
> > >>>
> > >>> We can conside to return regions that contains pages with the same set
> > >>> of bits.
> > >>>
> > >>> struct page_region {
> > >>>       void *start;
> > >>>       long size;
> > >>>       u64 bitmap;
> > >>> }
> > >>>
> > >>> And ioctl returns arrays of page_region-s. I believe it will be more
> > >>> compact form for many cases.
> > >> Thank you for mentioning this. I'd considered this while development.
> > >> But I gave up and used the simple array to return the offsets of the
> > >> pages as in the problem I'm trying to solve, the dirty pages may be
> > >> present amid non-dirty pages. The range may not be useful in that case.
> > >
> > > This is a good example. If we expect more than two consequent pages
> > > on average, the "region" interface looks more prefered. I don't know your
> > > use-case, but in the case of CRIU, this assumption looks reasonable.

Plus one for page_region data structure. It will make ASAN shadow memory
representation much more compact as well as any other classical use-case. 

[1] https://github.com/google/sanitizers	
[2] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit	

Best,
Danylo



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-10-14 13:48           ` Danylo Mocherniuk
@ 2022-10-18 10:36             ` Muhammad Usama Anjum
  2022-10-18 10:48               ` Greg KH
                                 ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-10-18 10:36 UTC (permalink / raw)
  To: Danylo Mocherniuk, avagin, linux-mm, akpm, gregkh
  Cc: corbet, david, kernel, krisman, linux-doc, linux-fsdevel,
	linux-kernel, linux-kselftest, peter.enderborg, shuah, viro,
	willy, emmir, figiel, kyurtsever, Paul Gofman, surenb

>>>>>> I mean we should be able to specify for what pages we need to get info
>>>>>> for. An ioctl argument can have these four fields:
>>>>>> * required bits (rmask & mask == mask) - all bits from this mask have to be set.
>>>>>> * any of these bits (amask & mask != 0) - any of these bits is set.
>>>>>> * exclude masks (emask & mask == 0) = none of these bits are set.
>>>>>> * return mask - bits that have to be reported to user.
>>> The required mask (rmask) makes sense to me. At the moment, I only know
>>> about the practical use case for the required mask. Can you share how
>>> can any and exclude masks help for the CRIU?
>>>
>>
>> I looked at should_dump_page in the CRIU code:
>> https://github.com/checkpoint-restore/criu/blob/45641ab26d7bb78706a6215fdef8f9133abf8d10/criu/mem.c#L102
>>
>> When CRIU dumps file private mappings, it needs to get pages that have
>> PME_PRESENT or PME_SWAP but don't have PME_FILE.
> 
> I would really like to see the mask discussed will be adopted. With it CRIU will
> be able to migrate huge sparse VMAs assuming that a single hole is processed in
> O(1) time.
> 
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN, MSAN or
> TSAN [1]. All of these sanitizers produce sparse mappings of shadow memory [2].
> Being able to migrate such binaries allows to highly reduce the amount of work
> needed to identify and fix post-migration crashes, which happen constantly.
> 

Hello all,

I've included the masks which the CRIU developers have specified. 
max_out_page is another new optional variable which is needed to 
terminate the operation without visiting all the pages after finding the 
max_out_page number of desired pages. There is no way to terminate the 
operation without this variable.

How does the interface looks now? Please comment.

/* PAGEMAP IOCTL */
#define PAGEMAP_GET		_IOWR('f', 16, struct pagemap_sd_args)
#define PAGEMAP_CLEAR		_IOWR('f', 17, struct pagemap_sd_args)
#define PAGEMAP_GET_AND_CLEAR	_IOWR('f', 18, struct pagemap_sd_args)

/* Bits are set in the bitmap of the page_region and masks in 
pagemap_sd_args */
#define PAGE_IS_SD	1 << 0
#define PAGE_IS_FILE	1 << 1
#define PAGE_IS_PRESENT	1 << 2
#define PAGE_IS_SWAPED	1 << 3

/**
  * struct page_region - Page region with bitmap flags
  * @start:	Start of the region
  * @len:	Length of the region
  * bitmap:	Bits sets for the region
  */
struct page_region {
	__u64 start;
	__u64 len;
	__u64 bitmap;
};

/**
  * struct pagemap_sd_args - Soft-dirty IOCTL argument
  * @start:		Starting address
  * @len:		Length of the region
  * @vec:		Output page_region struct array
  * @vec_len:		Length of the page_region struct array
  * @max_out_page:	Optional max output pages (It must be less than 
vec_len if specified)
  * @flags:		Special flags for the IOCTL
  * @rmask:		Special flags for the IOCTL
  * @amask:		Special flags for the IOCTL
  * @emask:		Special flags for the IOCTL
  * @__reserved:		Reserved member to preserve data alignment. Must be 0.
  */
struct pagemap_sd_args {
	__u64 __user start;
	__u64 len;
	__u64 __user vec; // page_region
	__u64 vec_len;    // sizeof(page_region)
	__u32 flags;      // special flags
	__u32 rmask;
	__u32 amask;
	__u32 emask;
	__u32 max_out_page;
	__u32 __reserved;
};

/* Special flags */
#define PAGEMAP_NO_REUSED_REGIONS	0x1


>>
>>>>>>> - Clear the pages which are soft-dirty.
>>>>>>> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
>>>>>>> soft-dirty PTE bit
>>>>>>>
>>>>>>> There are two decisions which have been taken about how to get the output
>>>>>>> from the syscall.
>>>>>>> - Return offsets of the pages from the start in the vec
>>>>>>
>>>>>> We can conside to return regions that contains pages with the same set
>>>>>> of bits.
>>>>>>
>>>>>> struct page_region {
>>>>>>        void *start;
>>>>>>        long size;
>>>>>>        u64 bitmap;
>>>>>> }
>>>>>>
>>>>>> And ioctl returns arrays of page_region-s. I believe it will be more
>>>>>> compact form for many cases.
>>>>> Thank you for mentioning this. I'd considered this while development.
>>>>> But I gave up and used the simple array to return the offsets of the
>>>>> pages as in the problem I'm trying to solve, the dirty pages may be
>>>>> present amid non-dirty pages. The range may not be useful in that case.
>>>>
>>>> This is a good example. If we expect more than two consequent pages
>>>> on average, the "region" interface looks more prefered. I don't know your
>>>> use-case, but in the case of CRIU, this assumption looks reasonable.
> 
> Plus one for page_region data structure. It will make ASAN shadow memory
> representation much more compact as well as any other classical use-case.
> 
> [1] https://github.com/google/sanitizers	
> [2] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit	
> 
> Best,
> Danylo
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-10-18 10:36             ` Muhammad Usama Anjum
@ 2022-10-18 10:48               ` Greg KH
  2022-10-18 12:30                 ` Muhammad Usama Anjum
  2022-10-18 11:11               ` Michał Mirosław
  2022-10-18 13:32               ` Muhammad Usama Anjum
  2 siblings, 1 reply; 24+ messages in thread
From: Greg KH @ 2022-10-18 10:48 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Danylo Mocherniuk, avagin, linux-mm, akpm, corbet, david, kernel,
	krisman, linux-doc, linux-fsdevel, linux-kernel, linux-kselftest,
	peter.enderborg, shuah, viro, willy, emmir, figiel, kyurtsever,
	Paul Gofman, surenb

On Tue, Oct 18, 2022 at 03:36:24PM +0500, Muhammad Usama Anjum wrote:
> /**
>  * struct pagemap_sd_args - Soft-dirty IOCTL argument
>  * @start:		Starting address
>  * @len:		Length of the region
>  * @vec:		Output page_region struct array
>  * @vec_len:		Length of the page_region struct array
>  * @max_out_page:	Optional max output pages (It must be less than vec_len if
> specified)
>  * @flags:		Special flags for the IOCTL
>  * @rmask:		Special flags for the IOCTL
>  * @amask:		Special flags for the IOCTL
>  * @emask:		Special flags for the IOCTL

What do you mean exactly by "special flags"?

>  * @__reserved:		Reserved member to preserve data alignment. Must be 0.
>  */
> struct pagemap_sd_args {
> 	__u64 __user start;
> 	__u64 len;
> 	__u64 __user vec; // page_region

__user is a marking for a pointer, not a u64, right?  Now the fact that
you treat it like a pointer later in the kernel is different, but that
shouldn't be on the uapi header file.  You can put it in the kerneldoc,
which you did not do.

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-10-18 10:36             ` Muhammad Usama Anjum
  2022-10-18 10:48               ` Greg KH
@ 2022-10-18 11:11               ` Michał Mirosław
  2022-10-18 13:22                 ` Muhammad Usama Anjum
  2022-10-18 13:32               ` Muhammad Usama Anjum
  2 siblings, 1 reply; 24+ messages in thread
From: Michał Mirosław @ 2022-10-18 11:11 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Danylo Mocherniuk, avagin, linux-mm, akpm, gregkh, corbet, david,
	kernel, krisman, linux-doc, linux-fsdevel, linux-kernel,
	linux-kselftest, peter.enderborg, shuah, viro, willy, figiel,
	kyurtsever, Paul Gofman, surenb

On Tue, 18 Oct 2022 at 12:36, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
[...]
> I've included the masks which the CRIU developers have specified.
> max_out_page is another new optional variable which is needed to
> terminate the operation without visiting all the pages after finding the
> max_out_page number of desired pages. There is no way to terminate the
> operation without this variable.
>
> How does the interface looks now? Please comment.
>
> /* PAGEMAP IOCTL */
> #define PAGEMAP_GET             _IOWR('f', 16, struct pagemap_sd_args)
> #define PAGEMAP_CLEAR           _IOWR('f', 17, struct pagemap_sd_args)
> #define PAGEMAP_GET_AND_CLEAR   _IOWR('f', 18, struct pagemap_sd_args)

Why are three IOCTLs needed? Could CLEAR be a flag (like the
PAGEMAP_NO_REUSED_REGIONS) or 'cmask' and GET be implied when vec !=
NULL?

> /* Bits are set in the bitmap of the page_region and masks in
> pagemap_sd_args */
> #define PAGE_IS_SD      1 << 0
> #define PAGE_IS_FILE    1 << 1
> #define PAGE_IS_PRESENT 1 << 2
> #define PAGE_IS_SWAPED  1 << 3
>
> /**
>   * struct page_region - Page region with bitmap flags
>   * @start:     Start of the region
>   * @len:       Length of the region
>   * bitmap:     Bits sets for the region
>   */
> struct page_region {
>         __u64 start;
>         __u64 len;
>         __u64 bitmap;
> };

Could you explain what units start and len are using? Are they bytes
or pages (what size)?

> /**
>   * struct pagemap_sd_args - Soft-dirty IOCTL argument

Nit: it's not soft-dirty-specific anymore. Maybe "pagemap_scan_args"?

>   * @start:             Starting address
>   * @len:               Length of the region
>   * @vec:               Output page_region struct array
>   * @vec_len:           Length of the page_region struct array
>   * @max_out_page:      Optional max output pages (It must be less than
> vec_len if specified)

Why is it required to be less than vec_len? vec_len effectively
specifies max number of ranges to find, and this new additional field
counts pages, I suppose?
BTW, if we count pages, then what size of them? Maybe using bytes
(matching start/len fields) would be more consistent?

>   * @flags:             Special flags for the IOCTL
>   * @rmask:             Special flags for the IOCTL
>   * @amask:             Special flags for the IOCTL
>   * @emask:             Special flags for the IOCTL
>   * @__reserved:                Reserved member to preserve data alignment. Must be 0.
>   */
> struct pagemap_sd_args {
>         __u64 __user start;
>         __u64 len;
>         __u64 __user vec; // page_region
>         __u64 vec_len;    // sizeof(page_region)
>         __u32 flags;      // special flags
>         __u32 rmask;
>         __u32 amask;
>         __u32 emask;
>         __u32 max_out_page;
>         __u32 __reserved;
> };
>
> /* Special flags */
> #define PAGEMAP_NO_REUSED_REGIONS       0x1

What does this flag do?

Best Regards
Michał Mirosław


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-10-18 10:48               ` Greg KH
@ 2022-10-18 12:30                 ` Muhammad Usama Anjum
  0 siblings, 0 replies; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-10-18 12:30 UTC (permalink / raw)
  To: Greg KH
  Cc: Muhammad Usama Anjum, Danylo Mocherniuk, avagin, linux-mm, akpm,
	corbet, david, kernel, krisman, linux-doc, linux-fsdevel,
	linux-kernel, linux-kselftest, peter.enderborg, shuah, viro,
	willy, emmir, figiel, kyurtsever, Paul Gofman, surenb

On 10/18/22 3:48 PM, Greg KH wrote:
> On Tue, Oct 18, 2022 at 03:36:24PM +0500, Muhammad Usama Anjum wrote:
>> /**
>>   * struct pagemap_sd_args - Soft-dirty IOCTL argument
>>   * @start:		Starting address
>>   * @len:		Length of the region
>>   * @vec:		Output page_region struct array
>>   * @vec_len:		Length of the page_region struct array
>>   * @max_out_page:	Optional max output pages (It must be less than vec_len if
>> specified)
>>   * @flags:		Special flags for the IOCTL
>>   * @rmask:		Special flags for the IOCTL
>>   * @amask:		Special flags for the IOCTL
>>   * @emask:		Special flags for the IOCTL
> 
> What do you mean exactly by "special flags"?
Sorry typo in the comments above. Optional flag can be specified in the 
flag. At the moment, there is only one flag(PAGEMAP_NO_REUSED_REGIONS).

/**
  * struct pagemap_sd_args - Soft-dirty IOCTL argument
  * @start:		Starting address
  * @len:		Length of the region
  * @vec:		Output page_region struct array
  * @vec_len:		Length of the page_region struct array
  * @max_out_page:	Optional max output pages (It must be less than 
vec_len if specified)
  * @flags:		Special flags for the IOCTL
  * @rmask:		Required mask - All of these bits have to be set
  * @amask:		Any mask - Any of these bits are set
  * @emask:		Exclude mask - None of these bits are set
  * @rmask:		Bits that have to be reported to the user in page_region
  */
struct pagemap_scan_args {
	__u64 __user start;
	__u64 len;
	__u64 __user vec;
	__u64 vec_len;
	__u32 max_out_page;
	__u32 flags;
	__u32 rmask;
	__u32 amask;
	__u32 emask;
	__u32 rmask;
};

> 
>>   * @__reserved:		Reserved member to preserve data alignment. Must be 0.
>>   */
>> struct pagemap_sd_args {
>> 	__u64 __user start;
>> 	__u64 len;
>> 	__u64 __user vec; // page_region
> 
> __user is a marking for a pointer, not a u64, right?  Now the fact that
> you treat it like a pointer later in the kernel is different, but that
> shouldn't be on the uapi header file.  You can put it in the kerneldoc,
> which you did not do.
I'll update.

> 
> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-10-18 11:11               ` Michał Mirosław
@ 2022-10-18 13:22                 ` Muhammad Usama Anjum
  2022-10-18 17:17                   ` Michał Mirosław
  0 siblings, 1 reply; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-10-18 13:22 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: Muhammad Usama Anjum, Danylo Mocherniuk, avagin, linux-mm, akpm,
	gregkh, corbet, david, kernel, krisman, linux-doc, linux-fsdevel,
	linux-kernel, linux-kselftest, peter.enderborg, shuah, viro,
	willy, figiel, kyurtsever, Paul Gofman, surenb

On 10/18/22 4:11 PM, Michał Mirosław wrote:
> On Tue, 18 Oct 2022 at 12:36, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
> [...]
>> I've included the masks which the CRIU developers have specified.
>> max_out_page is another new optional variable which is needed to
>> terminate the operation without visiting all the pages after finding the
>> max_out_page number of desired pages. There is no way to terminate the
>> operation without this variable.
>>
>> How does the interface looks now? Please comment.
>>
>> /* PAGEMAP IOCTL */
>> #define PAGEMAP_GET             _IOWR('f', 16, struct pagemap_sd_args)
>> #define PAGEMAP_CLEAR           _IOWR('f', 17, struct pagemap_sd_args)
>> #define PAGEMAP_GET_AND_CLEAR   _IOWR('f', 18, struct pagemap_sd_args)
> 
> Why are three IOCTLs needed? Could CLEAR be a flag (like the
> PAGEMAP_NO_REUSED_REGIONS) or 'cmask' and GET be implied when vec !=
> NULL?
Makes sense. Yes, this can be done.

> 
>> /* Bits are set in the bitmap of the page_region and masks in
>> pagemap_sd_args */
>> #define PAGE_IS_SD      1 << 0
>> #define PAGE_IS_FILE    1 << 1
>> #define PAGE_IS_PRESENT 1 << 2
>> #define PAGE_IS_SWAPED  1 << 3
>>
>> /**
>>    * struct page_region - Page region with bitmap flags
>>    * @start:     Start of the region
>>    * @len:       Length of the region
>>    * bitmap:     Bits sets for the region
>>    */
>> struct page_region {
>>          __u64 start;
>>          __u64 len;
>>          __u64 bitmap;
>> };
> 
> Could you explain what units start and len are using? Are they bytes
> or pages (what size)?
These are page sizes. Start must be the starting address of the page. 
Length don't need to be the exact multiple of the page size. All the 
pages in the length are included just like mincore() syscall.

> 
>> /**
>>    * struct pagemap_sd_args - Soft-dirty IOCTL argument
> 
> Nit: it's not soft-dirty-specific anymore. Maybe "pagemap_scan_args"?
> 
>>    * @start:             Starting address
>>    * @len:               Length of the region
>>    * @vec:               Output page_region struct array
>>    * @vec_len:           Length of the page_region struct array
>>    * @max_out_page:      Optional max output pages (It must be less than
>> vec_len if specified)
> 
> Why is it required to be less than vec_len? vec_len effectively
> specifies max number of ranges to find, and this new additional field
> counts pages, I suppose?
> BTW, if we count pages, then what size of them? Maybe using bytes
> (matching start/len fields) would be more consistent?
Yes, it if for counting pages. As the regions can have multiple pages, 
user cannot specify through the number of regions that how many pages 
does he need. Page size is used here as well like the start and len. 
This is optional argument as this is only needed to emulate the Windows 
syscall getWriteWatch.

> 
>>    * @flags:             Special flags for the IOCTL
>>    * @rmask:             Special flags for the IOCTL
>>    * @amask:             Special flags for the IOCTL
>>    * @emask:             Special flags for the IOCTL
>>    * @__reserved:                Reserved member to preserve data alignment. Must be 0.
>>    */
>> struct pagemap_sd_args {
>>          __u64 __user start;
>>          __u64 len;
>>          __u64 __user vec; // page_region
>>          __u64 vec_len;    // sizeof(page_region)
>>          __u32 flags;      // special flags
>>          __u32 rmask;
>>          __u32 amask;
>>          __u32 emask;
>>          __u32 max_out_page;
>>          __u32 __reserved;
>> };
>>
>> /* Special flags */
>> #define PAGEMAP_NO_REUSED_REGIONS       0x1
> 
> What does this flag do?
Some non-dirty pages get marked as dirty because of the kernel's 
internal activity. The dirty bit of the pages is stored in the VMA flags 
and in the per page flags. If any of these two bits are set, the page is 
considered to be dirty. Suppose you have cleared the dirty bit of half 
of VMA which will be done by splitting the VMA and clearing dirty flag 
in the half VMA and the pages in it. Now kernel may decide to merge the 
VMAs again as dirty bit of VMAs isn't considered if the VMAs should be 
merged. So the half VMA becomes dirty again. This splitting/merging 
costs performance. The application receives a lot of pages which aren't 
dirty in reality but marked as dirty. Performance is lost again here.

This PAGEMAP_NO_REUSED_REGIONS flag is used to don't depend on the dirty 
flag in the VMA flags. It only depends on the individual page dirty bit. 
With doing this, the new memory regions which are just created, doesn't 
look like dirty when seen with the IOCTL, but look dirty when seen from 
pagemap. This seems okay as the user of this flag know the implication 
of using it.

> 
> Best Regards
> Michał Mirosław


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-10-18 10:36             ` Muhammad Usama Anjum
  2022-10-18 10:48               ` Greg KH
  2022-10-18 11:11               ` Michał Mirosław
@ 2022-10-18 13:32               ` Muhammad Usama Anjum
  2022-10-18 18:20                 ` Greg KH
  2 siblings, 1 reply; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-10-18 13:32 UTC (permalink / raw)
  To: Danylo Mocherniuk, avagin, linux-mm, akpm, gregkh
  Cc: corbet, david, kernel, krisman, linux-doc, linux-fsdevel,
	linux-kernel, linux-kselftest, peter.enderborg, shuah, viro,
	willy, emmir, figiel, kyurtsever, Paul Gofman, surenb

On 10/18/22 3:36 PM, Muhammad Usama Anjum wrote:
>>>>>>> I mean we should be able to specify for what pages we need to get 
>>>>>>> info
>>>>>>> for. An ioctl argument can have these four fields:
>>>>>>> * required bits (rmask & mask == mask) - all bits from this mask 
>>>>>>> have to be set.
>>>>>>> * any of these bits (amask & mask != 0) - any of these bits is set.
>>>>>>> * exclude masks (emask & mask == 0) = none of these bits are set.
>>>>>>> * return mask - bits that have to be reported to user.
>>>> The required mask (rmask) makes sense to me. At the moment, I only know
>>>> about the practical use case for the required mask. Can you share how
>>>> can any and exclude masks help for the CRIU?
>>>>
>>>
>>> I looked at should_dump_page in the CRIU code:
>>> https://github.com/checkpoint-restore/criu/blob/45641ab26d7bb78706a6215fdef8f9133abf8d10/criu/mem.c#L102
>>>
>>> When CRIU dumps file private mappings, it needs to get pages that have
>>> PME_PRESENT or PME_SWAP but don't have PME_FILE.
>>
>> I would really like to see the mask discussed will be adopted. With it 
>> CRIU will
>> be able to migrate huge sparse VMAs assuming that a single hole is 
>> processed in
>> O(1) time.
>>
>> Use cases for migrating sparse VMAs are binaries sanitized with ASAN, 
>> MSAN or
>> TSAN [1]. All of these sanitizers produce sparse mappings of shadow 
>> memory [2].
>> Being able to migrate such binaries allows to highly reduce the amount 
>> of work
>> needed to identify and fix post-migration crashes, which happen 
>> constantly.
>>
> 
> Hello all,
> 
> I've included the masks which the CRIU developers have specified. 
> max_out_page is another new optional variable which is needed to 
> terminate the operation without visiting all the pages after finding the 
> max_out_page number of desired pages. There is no way to terminate the 
> operation without this variable.
> 
> How does the interface looks now? Please comment.
> 
Updated interface with only one IOCTL. If vec is defined, get operation 
will be performed. If PAGEMAP_SD_CLEAR flag is specified, soft dirty bit 
will be cleared as well. CLEAR flag can only be specified for clearing 
soft dirty bit.

/* PAGEMAP IOCTL */
#define PAGEMAP_SCAN	_IOWR('f', 16, struct pagemap_sd_args)

/* Bits are set in the bitmap of the page_region and masks in 
pagemap_sd_args */
#define PAGE_IS_SD	1 << 0
#define PAGE_IS_FILE	1 << 1
#define PAGE_IS_PRESENT	1 << 2
#define PAGE_IS_SWAPED	1 << 3

/**
  * struct page_region - Page region with bitmap flags
  * @start:	Start of the region
  * @len:	Length of the region
  * bitmap:	Bits sets for the region
  */
struct page_region {
	__u64 start;
	__u64 len;
	__u64 bitmap;
};

/**
  * struct pagemap_sd_args - Soft-dirty IOCTL argument
  * @start:		Starting address of the page
  * @len:		Length of the region (All the pages in this length are included)
  * @vec:		Output page_region struct array
  * @vec_len:		Length of the page_region struct array
  * @max_out_page:	Optional max output pages (It must be less than 
vec_len if specified)
  * @flags:		Special flags for the IOCTL
  * @rmask:		Required mask - All of these bits have to be set
  * @amask:		Any mask - Any of these bits are set
  * @emask:		Exclude mask - None of these bits are set
  * @rmask:		Bits that have to be reported to the user in page_region
  */
struct pagemap_scan_args {
	__u64 __user start;
	__u64 len;
	__u64 __user vec;
	__u64 vec_len;
	__u32 max_out_page;
	__u32 flags;
	__u32 rmask;
	__u32 amask;
	__u32 emask;
	__u32 rmask;
};

/* Special flags */
#define PAGEMAP_SD_CLEAR		1 << 0
#define PAGEMAP_NO_REUSED_REGIONS	1 << 1


> /* PAGEMAP IOCTL */
> #define PAGEMAP_GET        _IOWR('f', 16, struct pagemap_sd_args)
> #define PAGEMAP_CLEAR        _IOWR('f', 17, struct pagemap_sd_args)
> #define PAGEMAP_GET_AND_CLEAR    _IOWR('f', 18, struct pagemap_sd_args)
> 
> /* Bits are set in the bitmap of the page_region and masks in 
> pagemap_sd_args */
> #define PAGE_IS_SD    1 << 0
> #define PAGE_IS_FILE    1 << 1
> #define PAGE_IS_PRESENT    1 << 2
> #define PAGE_IS_SWAPED    1 << 3
> 
> /**
>   * struct page_region - Page region with bitmap flags
>   * @start:    Start of the region
>   * @len:    Length of the region
>   * bitmap:    Bits sets for the region
>   */
> struct page_region {
>      __u64 start;
>      __u64 len;
>      __u64 bitmap;
> };
> 
> /**
>   * struct pagemap_sd_args - Soft-dirty IOCTL argument
>   * @start:        Starting address
>   * @len:        Length of the region
>   * @vec:        Output page_region struct array
>   * @vec_len:        Length of the page_region struct array
>   * @max_out_page:    Optional max output pages (It must be less than 
> vec_len if specified)
>   * @flags:        Special flags for the IOCTL
>   * @rmask:        Special flags for the IOCTL
>   * @amask:        Special flags for the IOCTL
>   * @emask:        Special flags for the IOCTL
>   * @__reserved:        Reserved member to preserve data alignment. Must 
> be 0.
>   */
> struct pagemap_sd_args {
>      __u64 __user start;
>      __u64 len;
>      __u64 __user vec; // page_region
>      __u64 vec_len;    // sizeof(page_region)
>      __u32 flags;      // special flags
>      __u32 rmask;
>      __u32 amask;
>      __u32 emask;
>      __u32 max_out_page;
>      __u32 __reserved;
> };
> 
> /* Special flags */
> #define PAGEMAP_NO_REUSED_REGIONS    0x1
> 
> 
>>>
>>>>>>>> - Clear the pages which are soft-dirty.
>>>>>>>> - The optional flag to ignore the VM_SOFTDIRTY and only track 
>>>>>>>> per page
>>>>>>>> soft-dirty PTE bit
>>>>>>>>
>>>>>>>> There are two decisions which have been taken about how to get 
>>>>>>>> the output
>>>>>>>> from the syscall.
>>>>>>>> - Return offsets of the pages from the start in the vec
>>>>>>>
>>>>>>> We can conside to return regions that contains pages with the 
>>>>>>> same set
>>>>>>> of bits.
>>>>>>>
>>>>>>> struct page_region {
>>>>>>>        void *start;
>>>>>>>        long size;
>>>>>>>        u64 bitmap;
>>>>>>> }
>>>>>>>
>>>>>>> And ioctl returns arrays of page_region-s. I believe it will be more
>>>>>>> compact form for many cases.
>>>>>> Thank you for mentioning this. I'd considered this while development.
>>>>>> But I gave up and used the simple array to return the offsets of the
>>>>>> pages as in the problem I'm trying to solve, the dirty pages may be
>>>>>> present amid non-dirty pages. The range may not be useful in that 
>>>>>> case.
>>>>>
>>>>> This is a good example. If we expect more than two consequent pages
>>>>> on average, the "region" interface looks more prefered. I don't 
>>>>> know your
>>>>> use-case, but in the case of CRIU, this assumption looks reasonable.
>>
>> Plus one for page_region data structure. It will make ASAN shadow memory
>> representation much more compact as well as any other classical use-case.
>>
>> [1] https://github.com/google/sanitizers
>> [2] 
>> https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
>>
>> Best,
>> Danylo
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-10-18 13:22                 ` Muhammad Usama Anjum
@ 2022-10-18 17:17                   ` Michał Mirosław
  2022-10-19  7:18                     ` Muhammad Usama Anjum
  0 siblings, 1 reply; 24+ messages in thread
From: Michał Mirosław @ 2022-10-18 17:17 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Danylo Mocherniuk, avagin, linux-mm, akpm, gregkh, corbet, david,
	kernel, krisman, linux-doc, linux-fsdevel, linux-kernel,
	linux-kselftest, peter.enderborg, shuah, viro, willy, figiel,
	kyurtsever, Paul Gofman, surenb

On Tue, 18 Oct 2022 at 15:23, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
>
> On 10/18/22 4:11 PM, Michał Mirosław wrote:
> > On Tue, 18 Oct 2022 at 12:36, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
[...]
> >>    * @start:             Starting address
> >>    * @len:               Length of the region
> >>    * @vec:               Output page_region struct array
> >>    * @vec_len:           Length of the page_region struct array
> >>    * @max_out_page:      Optional max output pages (It must be less than
> >> vec_len if specified)
> >
> > Why is it required to be less than vec_len? vec_len effectively
> > specifies max number of ranges to find, and this new additional field
> > counts pages, I suppose?
> > BTW, if we count pages, then what size of them? Maybe using bytes
> > (matching start/len fields) would be more consistent?
> Yes, it if for counting pages. As the regions can have multiple pages,
> user cannot specify through the number of regions that how many pages
> does he need. Page size is used here as well like the start and len.
> This is optional argument as this is only needed to emulate the Windows
> syscall getWriteWatch.

I'm wondering about the condition that max_out_page < vec_len. Since
both count different things (pages vs ranges) I would expect there is
no strict relation between them and information returned is as much as
fits both (IOW: at most vec_len ranges spanning not more than
max_out_page pages). The field's name and description I'd suggest
improving: maybe 'max_pages' with a comment that 0 = unlimited?

[...]
> >> /* Special flags */
> >> #define PAGEMAP_NO_REUSED_REGIONS       0x1
> >
> > What does this flag do?
> Some non-dirty pages get marked as dirty because of the kernel's
> internal activity. The dirty bit of the pages is stored in the VMA flags
> and in the per page flags. If any of these two bits are set, the page is
> considered to be dirty. Suppose you have cleared the dirty bit of half
> of VMA which will be done by splitting the VMA and clearing dirty flag
> in the half VMA and the pages in it. Now kernel may decide to merge the
> VMAs again as dirty bit of VMAs isn't considered if the VMAs should be
> merged. So the half VMA becomes dirty again. This splitting/merging
> costs performance. The application receives a lot of pages which aren't
> dirty in reality but marked as dirty. Performance is lost again here.
>
> This PAGEMAP_NO_REUSED_REGIONS flag is used to don't depend on the dirty
> flag in the VMA flags. It only depends on the individual page dirty bit.
> With doing this, the new memory regions which are just created, doesn't
> look like dirty when seen with the IOCTL, but look dirty when seen from
> pagemap. This seems okay as the user of this flag know the implication
> of using it.

Thanks for explaining! Could you include this as a comment in the patch?

Best Regards
Michał Mirosław


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-10-18 13:32               ` Muhammad Usama Anjum
@ 2022-10-18 18:20                 ` Greg KH
  0 siblings, 0 replies; 24+ messages in thread
From: Greg KH @ 2022-10-18 18:20 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Danylo Mocherniuk, avagin, linux-mm, akpm, corbet, david, kernel,
	krisman, linux-doc, linux-fsdevel, linux-kernel, linux-kselftest,
	peter.enderborg, shuah, viro, willy, emmir, figiel, kyurtsever,
	Paul Gofman, surenb

On Tue, Oct 18, 2022 at 06:32:46PM +0500, Muhammad Usama Anjum wrote:
> On 10/18/22 3:36 PM, Muhammad Usama Anjum wrote:
> > > > > > > > I mean we should be able to specify for what
> > > > > > > > pages we need to get info
> > > > > > > > for. An ioctl argument can have these four fields:
> > > > > > > > * required bits (rmask & mask == mask) - all
> > > > > > > > bits from this mask have to be set.
> > > > > > > > * any of these bits (amask & mask != 0) - any of these bits is set.
> > > > > > > > * exclude masks (emask & mask == 0) = none of these bits are set.
> > > > > > > > * return mask - bits that have to be reported to user.
> > > > > The required mask (rmask) makes sense to me. At the moment, I only know
> > > > > about the practical use case for the required mask. Can you share how
> > > > > can any and exclude masks help for the CRIU?
> > > > > 
> > > > 
> > > > I looked at should_dump_page in the CRIU code:
> > > > https://github.com/checkpoint-restore/criu/blob/45641ab26d7bb78706a6215fdef8f9133abf8d10/criu/mem.c#L102
> > > > 
> > > > When CRIU dumps file private mappings, it needs to get pages that have
> > > > PME_PRESENT or PME_SWAP but don't have PME_FILE.
> > > 
> > > I would really like to see the mask discussed will be adopted. With
> > > it CRIU will
> > > be able to migrate huge sparse VMAs assuming that a single hole is
> > > processed in
> > > O(1) time.
> > > 
> > > Use cases for migrating sparse VMAs are binaries sanitized with
> > > ASAN, MSAN or
> > > TSAN [1]. All of these sanitizers produce sparse mappings of shadow
> > > memory [2].
> > > Being able to migrate such binaries allows to highly reduce the
> > > amount of work
> > > needed to identify and fix post-migration crashes, which happen
> > > constantly.
> > > 
> > 
> > Hello all,
> > 
> > I've included the masks which the CRIU developers have specified.
> > max_out_page is another new optional variable which is needed to
> > terminate the operation without visiting all the pages after finding the
> > max_out_page number of desired pages. There is no way to terminate the
> > operation without this variable.
> > 
> > How does the interface looks now? Please comment.
> > 
> Updated interface with only one IOCTL. If vec is defined, get operation will
> be performed. If PAGEMAP_SD_CLEAR flag is specified, soft dirty bit will be
> cleared as well. CLEAR flag can only be specified for clearing soft dirty
> bit.
> 
> /* PAGEMAP IOCTL */
> #define PAGEMAP_SCAN	_IOWR('f', 16, struct pagemap_sd_args)
> 
> /* Bits are set in the bitmap of the page_region and masks in
> pagemap_sd_args */
> #define PAGE_IS_SD	1 << 0
> #define PAGE_IS_FILE	1 << 1
> #define PAGE_IS_PRESENT	1 << 2
> #define PAGE_IS_SWAPED	1 << 3
> 
> /**
>  * struct page_region - Page region with bitmap flags
>  * @start:	Start of the region
>  * @len:	Length of the region
>  * bitmap:	Bits sets for the region
>  */
> struct page_region {
> 	__u64 start;
> 	__u64 len;
> 	__u64 bitmap;
> };
> 
> /**
>  * struct pagemap_sd_args - Soft-dirty IOCTL argument
>  * @start:		Starting address of the page
>  * @len:		Length of the region (All the pages in this length are included)
>  * @vec:		Output page_region struct array
>  * @vec_len:		Length of the page_region struct array
>  * @max_out_page:	Optional max output pages (It must be less than vec_len if
> specified)
>  * @flags:		Special flags for the IOCTL
>  * @rmask:		Required mask - All of these bits have to be set

Why have it at all if it always has to be all 1s?

>  * @amask:		Any mask - Any of these bits are set

which ones?

>  * @emask:		Exclude mask - None of these bits are set

Why have it, if none are ever set?

>  * @rmask:		Bits that have to be reported to the user in page_region

I feel like I have no idea what these bits are...

Anyway, please send a real patch, with real code, so we have a better
idea of what is happening.  AFTER you have tested and made it all work
properly.

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE
  2022-10-18 17:17                   ` Michał Mirosław
@ 2022-10-19  7:18                     ` Muhammad Usama Anjum
  0 siblings, 0 replies; 24+ messages in thread
From: Muhammad Usama Anjum @ 2022-10-19  7:18 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: Muhammad Usama Anjum, Danylo Mocherniuk, avagin, linux-mm, akpm,
	gregkh, corbet, david, kernel, krisman, linux-doc, linux-fsdevel,
	linux-kernel, linux-kselftest, peter.enderborg, shuah, viro,
	willy, figiel, kyurtsever, Paul Gofman, surenb

On 10/18/22 10:17 PM, Michał Mirosław wrote:
> On Tue, 18 Oct 2022 at 15:23, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>>
>> On 10/18/22 4:11 PM, Michał Mirosław wrote:
>>> On Tue, 18 Oct 2022 at 12:36, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
> [...]
>>>>     * @start:             Starting address
>>>>     * @len:               Length of the region
>>>>     * @vec:               Output page_region struct array
>>>>     * @vec_len:           Length of the page_region struct array
>>>>     * @max_out_page:      Optional max output pages (It must be less than
>>>> vec_len if specified)
>>>
>>> Why is it required to be less than vec_len? vec_len effectively
>>> specifies max number of ranges to find, and this new additional field
>>> counts pages, I suppose?
>>> BTW, if we count pages, then what size of them? Maybe using bytes
>>> (matching start/len fields) would be more consistent?
>> Yes, it if for counting pages. As the regions can have multiple pages,
>> user cannot specify through the number of regions that how many pages
>> does he need. Page size is used here as well like the start and len.
>> This is optional argument as this is only needed to emulate the Windows
>> syscall getWriteWatch.
> 
> I'm wondering about the condition that max_out_page < vec_len. Since
> both count different things (pages vs ranges) I would expect there is
> no strict relation between them and information returned is as much as
> fits both (IOW: at most vec_len ranges spanning not more than
> max_out_page pages). The field's name and description I'd suggest
> improving: maybe 'max_pages' with a comment that 0 = unlimited?
Correct, max_pages with this comment is what I want. I'll update. 
vec_len represents the total number of the page_range array elements. If 
the pages which we want to return are sparse or the consective pages 
have different flags, we'll only return one page in one page_range 
struct. In this case if someone has specified max_pages to be 10, 
vec_len must be at least 10 to keep store the 10 pages. So max_pages <= 
vec_len.

> 
> [...]
>>>> /* Special flags */
>>>> #define PAGEMAP_NO_REUSED_REGIONS       0x1
>>>
>>> What does this flag do?
>> Some non-dirty pages get marked as dirty because of the kernel's
>> internal activity. The dirty bit of the pages is stored in the VMA flags
>> and in the per page flags. If any of these two bits are set, the page is
>> considered to be dirty. Suppose you have cleared the dirty bit of half
>> of VMA which will be done by splitting the VMA and clearing dirty flag
>> in the half VMA and the pages in it. Now kernel may decide to merge the
>> VMAs again as dirty bit of VMAs isn't considered if the VMAs should be
>> merged. So the half VMA becomes dirty again. This splitting/merging
>> costs performance. The application receives a lot of pages which aren't
>> dirty in reality but marked as dirty. Performance is lost again here.
>>
>> This PAGEMAP_NO_REUSED_REGIONS flag is used to don't depend on the dirty
>> flag in the VMA flags. It only depends on the individual page dirty bit.
>> With doing this, the new memory regions which are just created, doesn't
>> look like dirty when seen with the IOCTL, but look dirty when seen from
>> pagemap. This seems okay as the user of this flag know the implication
>> of using it.
> 
> Thanks for explaining! Could you include this as a comment in the patch?
Will do.

> 
> Best Regards
> Michał Mirosław


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-10-19  7:18 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-26  6:45 [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE Muhammad Usama Anjum
2022-08-26  6:45 ` [PATCH v3 1/4] fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit Muhammad Usama Anjum
2022-08-26  6:45 ` [PATCH v3 2/4] fs/proc/task_mmu: Implement IOCTL to get and clear soft dirty " Muhammad Usama Anjum
2022-08-26  6:45 ` [PATCH v3 3/4] selftests: vm: add pagemap ioctl tests Muhammad Usama Anjum
2022-08-26  6:45 ` [PATCH v3 4/4] mm: add documentation of the new ioctl on pagemap Muhammad Usama Anjum
2022-08-26  8:22   ` Bagas Sanjaya
2022-09-07  9:40 ` [PATCH v3 0/4] Implement IOCTL to get and clear soft dirty PTE Muhammad Usama Anjum
2022-09-19 14:58 ` Andrei Vagin
2022-09-21 18:26   ` Muhammad Usama Anjum
2022-09-28 17:24     ` Andrei Vagin
2022-10-03 11:21       ` Muhammad Usama Anjum
2022-10-11  4:52         ` Andrei Vagin
2022-10-14 13:48           ` Danylo Mocherniuk
2022-10-18 10:36             ` Muhammad Usama Anjum
2022-10-18 10:48               ` Greg KH
2022-10-18 12:30                 ` Muhammad Usama Anjum
2022-10-18 11:11               ` Michał Mirosław
2022-10-18 13:22                 ` Muhammad Usama Anjum
2022-10-18 17:17                   ` Michał Mirosław
2022-10-19  7:18                     ` Muhammad Usama Anjum
2022-10-18 13:32               ` Muhammad Usama Anjum
2022-10-18 18:20                 ` Greg KH
2022-09-21 18:30 ` Muhammad Usama Anjum
2022-09-28  6:03 ` Muhammad Usama Anjum

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).