linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs
@ 2022-11-09 10:23 Muhammad Usama Anjum
  2022-11-09 10:23 ` [PATCH v6 1/3] fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit Muhammad Usama Anjum
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Muhammad Usama Anjum @ 2022-11-09 10:23 UTC (permalink / raw)
  To: Michał Mirosław, Andrei Vagin, Danylo Mocherniuk,
	Alexander Viro, Andrew Morton, Suren Baghdasaryan, Greg KH,
	Christian Brauner, Peter Xu, Yang Shi, Vlastimil Babka,
	Zach O'Keefe, Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, Muhammad Usama Anjum, kernel,
	Gabriel Krisman Bertazi, David Hildenbrand, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Paul Gofman

Changes in v6:
- Updated the interface and made cosmetic changes

Original Cover Letter in v5:
Hello,

This patch series implements IOCTL on the pagemap procfs file to get the
information about the page table entries (PTEs). The following operations
are supported in this ioctl:
- Get the information if the pages are soft-dirty, file mapped, present
  or swapped.
- Clear the soft-dirty PTE bit of the pages.
- Get and clear the soft-dirty PTE bit of the pages atomically.

Soft-dirty PTE bit of the memory pages can be read by using the pagemap
procfs file. The soft-dirty PTE bit for the whole memory range of the
process can be cleared by writing to the clear_refs file. There are other
methods to mimic this information entirely in userspace with poor
performance:
- The mprotect syscall and SIGSEGV handler for bookkeeping
- The userfaultfd syscall with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty PTE bit status and clear operation
  possible.
- The soft-dirty PTE bit of only a part of memory cannot be cleared.

Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows. This syscall is used by games to
keep track of dirty pages to process only the dirty pages.

The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project[2][3]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project[2].

The IOCTL returns the addresses of the pages which match the specific masks.
The page addresses are returned in struct page_region in a compact form.
The max_pages is needed to support a use case where user only wants to get
a specific number of pages. So there is no need to find all the pages of
interest in the range when max_pages is specified. The IOCTL returns when
the maximum number of the pages are found. The max_pages is optional. If
max_pages is specified, it must be equal or greater than the vec_size.
This restriction is needed to handle worse case when one page_region only
contains info of one page and it cannot be compacted. This is needed to
emulate the Windows getWriteWatch() syscall.

Some non-dirty pages get marked as dirty because of the kernel's
internal activity (such as VMA merging as soft-dirty bit difference isn't
considered while deciding to merge VMAs). The dirty bit of the pages is
stored in the VMA flags and in the per page flags. If any of these two bits
are set, the page is considered to be soft dirty. Suppose you have cleared
the soft dirty bit of half of VMA which will be done by splitting the VMA
and clearing soft dirty bit flag in the half VMA and the pages in it. Now
kernel may decide to merge the VMAs again. So the half VMA becomes dirty
again. This splitting/merging costs performance. The application receives
a lot of pages which aren't dirty in reality but marked as dirty.
Performance is lost again here. Also sometimes user doesn't want the newly
allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag
solves both the problems. It is used to not depend on the soft dirty flag
in the VMA flags. So VMA splitting and merging doesn't happen. It only
depends on the soft dirty bit of the individual pages. Thus by using this
flag, there may be a scenerio such that the new memory regions which are
just created, doesn't look dirty when seen with the IOCTL, but look dirty
when seen from procfs. This seems okay as the user of this flag know the
implication of using it.

[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com/
[2] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[3] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/

Regards,
Muhammad Usama Anjum

Muhammad Usama Anjum (3):
  fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit
  fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs
  selftests: vm: add pagemap ioctl tests

 fs/proc/task_mmu.c                         | 410 +++++++++++-
 include/uapi/linux/fs.h                    |  56 ++
 tools/include/uapi/linux/fs.h              |  56 ++
 tools/testing/selftests/vm/.gitignore      |   1 +
 tools/testing/selftests/vm/Makefile        |   5 +-
 tools/testing/selftests/vm/pagemap_ioctl.c | 698 +++++++++++++++++++++
 6 files changed, 1193 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v6 1/3] fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit
  2022-11-09 10:23 [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
@ 2022-11-09 10:23 ` Muhammad Usama Anjum
  2022-11-09 10:23 ` [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Muhammad Usama Anjum @ 2022-11-09 10:23 UTC (permalink / raw)
  To: Michał Mirosław, Andrei Vagin, Danylo Mocherniuk,
	Alexander Viro, Andrew Morton, Suren Baghdasaryan, Greg KH,
	Christian Brauner, Peter Xu, Yang Shi, Vlastimil Babka,
	Zach O'Keefe, Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, Muhammad Usama Anjum, kernel,
	Gabriel Krisman Bertazi, David Hildenbrand, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Paul Gofman

Update the clear_soft_dirty() and clear_soft_dirty_pmd() to optionally
clear and return the status if page is dirty.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes in v2:
- Move back the functions back to their original file
---
 fs/proc/task_mmu.c | 82 ++++++++++++++++++++++++++++------------------
 1 file changed, 51 insertions(+), 31 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8a74cdcc9af0..8235c536ac70 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1095,8 +1095,8 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr,
 	return page_maybe_dma_pinned(page);
 }
 
-static inline void clear_soft_dirty(struct vm_area_struct *vma,
-		unsigned long addr, pte_t *pte)
+static inline bool check_soft_dirty(struct vm_area_struct *vma,
+				    unsigned long addr, pte_t *pte, bool clear)
 {
 	/*
 	 * The soft-dirty tracker uses #PF-s to catch writes
@@ -1105,55 +1105,75 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
 	 * of how soft-dirty works.
 	 */
 	pte_t ptent = *pte;
+	int dirty = 0;
 
 	if (pte_present(ptent)) {
 		pte_t old_pte;
 
-		if (pte_is_pinned(vma, addr, ptent))
-			return;
-		old_pte = ptep_modify_prot_start(vma, addr, pte);
-		ptent = pte_wrprotect(old_pte);
-		ptent = pte_clear_soft_dirty(ptent);
-		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
+		dirty = pte_soft_dirty(ptent);
+
+		if (dirty && clear && !pte_is_pinned(vma, addr, ptent)) {
+			old_pte = ptep_modify_prot_start(vma, addr, pte);
+			ptent = pte_wrprotect(old_pte);
+			ptent = pte_clear_soft_dirty(ptent);
+			ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
+		}
 	} else if (is_swap_pte(ptent)) {
-		ptent = pte_swp_clear_soft_dirty(ptent);
-		set_pte_at(vma->vm_mm, addr, pte, ptent);
+		dirty = pte_swp_soft_dirty(ptent);
+
+		if (dirty && clear) {
+			ptent = pte_swp_clear_soft_dirty(ptent);
+			set_pte_at(vma->vm_mm, addr, pte, ptent);
+		}
 	}
+
+	return !!dirty;
 }
 #else
-static inline void clear_soft_dirty(struct vm_area_struct *vma,
-		unsigned long addr, pte_t *pte)
+static inline bool check_soft_dirty(struct vm_area_struct *vma,
+				    unsigned long addr, pte_t *pte, bool clear)
 {
+	return false;
 }
 #endif
 
 #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
-		unsigned long addr, pmd_t *pmdp)
+static inline bool check_soft_dirty_pmd(struct vm_area_struct *vma,
+					unsigned long addr, pmd_t *pmdp, bool clear)
 {
 	pmd_t old, pmd = *pmdp;
+	int dirty = 0;
 
 	if (pmd_present(pmd)) {
-		/* See comment in change_huge_pmd() */
-		old = pmdp_invalidate(vma, addr, pmdp);
-		if (pmd_dirty(old))
-			pmd = pmd_mkdirty(pmd);
-		if (pmd_young(old))
-			pmd = pmd_mkyoung(pmd);
-
-		pmd = pmd_wrprotect(pmd);
-		pmd = pmd_clear_soft_dirty(pmd);
-
-		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+		dirty = pmd_soft_dirty(pmd);
+		if (dirty && clear) {
+			/* See comment in change_huge_pmd() */
+			old = pmdp_invalidate(vma, addr, pmdp);
+			if (pmd_dirty(old))
+				pmd = pmd_mkdirty(pmd);
+			if (pmd_young(old))
+				pmd = pmd_mkyoung(pmd);
+
+			pmd = pmd_wrprotect(pmd);
+			pmd = pmd_clear_soft_dirty(pmd);
+
+			set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+		}
 	} else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
-		pmd = pmd_swp_clear_soft_dirty(pmd);
-		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+		dirty = pmd_swp_soft_dirty(pmd);
+
+		if (dirty && clear) {
+			pmd = pmd_swp_clear_soft_dirty(pmd);
+			set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+		}
 	}
+	return !!dirty;
 }
 #else
-static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
-		unsigned long addr, pmd_t *pmdp)
+static inline bool check_soft_dirty_pmd(struct vm_area_struct *vma,
+					unsigned long addr, pmd_t *pmdp, bool clear)
 {
+	return false;
 }
 #endif
 
@@ -1169,7 +1189,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
 		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
-			clear_soft_dirty_pmd(vma, addr, pmd);
+			check_soft_dirty_pmd(vma, addr, pmd, true);
 			goto out;
 		}
 
@@ -1195,7 +1215,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		ptent = *pte;
 
 		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
-			clear_soft_dirty(vma, addr, pte);
+			check_soft_dirty(vma, addr, pte, true);
 			continue;
 		}
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs
  2022-11-09 10:23 [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
  2022-11-09 10:23 ` [PATCH v6 1/3] fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit Muhammad Usama Anjum
@ 2022-11-09 10:23 ` Muhammad Usama Anjum
  2022-11-09 23:54   ` Andrei Vagin
                     ` (2 more replies)
  2022-11-09 10:23 ` [PATCH v6 3/3] selftests: vm: add pagemap ioctl tests Muhammad Usama Anjum
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 20+ messages in thread
From: Muhammad Usama Anjum @ 2022-11-09 10:23 UTC (permalink / raw)
  To: Michał Mirosław, Andrei Vagin, Danylo Mocherniuk,
	Alexander Viro, Andrew Morton, Suren Baghdasaryan, Greg KH,
	Christian Brauner, Peter Xu, Yang Shi, Vlastimil Babka,
	Zach O'Keefe, Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, Muhammad Usama Anjum, kernel,
	Gabriel Krisman Bertazi, David Hildenbrand, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Paul Gofman

This IOCTL, PAGEMAP_SCAN can be used to get and/or clear the info about
page table entries. The following operations are supported in this ioctl:
- Get the information if the pages are soft-dirty, file mapped, present
  or swapped.
- Clear the soft-dirty PTE bit of the pages.
- Get and clear the soft-dirty PTE bit of the pages.

Only the soft-dirty bit can be read and cleared atomically. struct
pagemap_sd_args is used as the argument of the IOCTL. In this struct:
- The range is specified through start and len.
- The output buffer and size is specified as vec and vec_len.
- The optional maximum requested pages are specified in the max_pages.
- The flags can be specified in the flags field. The PAGEMAP_SD_CLEAR
  and PAGEMAP_SD_NO_REUSED_REGIONS are supported.
- The masks are specified in rmask, amask, emask and return_mask.

This IOCTL can be extended to get information about more PTE bits.

This is based on a patch from Gabriel Krisman Bertazi.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality

Changes in v5:
- Remove tlb flushing even for clear operation

Changes in v4:
- Update the interface and implementation

Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more
  error checking

Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
---
 fs/proc/task_mmu.c            | 328 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h       |  56 ++++++
 tools/include/uapi/linux/fs.h |  56 ++++++
 3 files changed, 440 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8235c536ac70..8d6a84ec5ef7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -19,6 +19,9 @@
 #include <linux/shmem_fs.h>
 #include <linux/uaccess.h>
 #include <linux/pkeys.h>
+#include <uapi/linux/fs.h>
+#include <linux/vmalloc.h>
+#include <linux/minmax.h>
 
 #include <asm/elf.h>
 #include <asm/tlb.h>
@@ -1775,11 +1778,336 @@ static int pagemap_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+#define PAGEMAP_OP_MASK		(PAGE_IS_SOFTDIRTY | PAGE_IS_FILE |	\
+				 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PAGEMAP_NONSD_OP_MASK	(PAGE_IS_FILE |	PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PAGEMAP_SD_FLAGS	(PAGEMAP_SOFTDIRTY_CLEAR | PAGEMAP_NO_REUSED_REGIONS)
+#define IS_CLEAR_OP(a)		(a->flags & PAGEMAP_SOFTDIRTY_CLEAR)
+#define IS_GET_OP(a)		(a->vec)
+#define IS_SD_OP(a)		(a->flags & PAGEMAP_SD_FLAGS)
+
+struct pagemap_scan_private {
+	struct page_region *vec;
+	unsigned long vec_len;
+	unsigned long vec_index;
+	unsigned int max_pages;
+	unsigned int found_pages;
+	unsigned int flags;
+	unsigned long required_mask;
+	unsigned long anyof_mask;
+	unsigned long excluded_mask;
+	unsigned long return_mask;
+};
+
+static int pagemap_scan_pmd_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+
+	if (IS_GET_OP(p) && p->max_pages && (p->found_pages == p->max_pages))
+		return -1;
+
+	if (vma->vm_flags & VM_PFNMAP)
+		return 1;
+
+	return 0;
+}
+
+static int add_to_out(bool sd, bool file, bool pres, bool swap, struct pagemap_scan_private *p,
+		      unsigned long addr, unsigned int len)
+{
+	unsigned long bitmap, cur = sd | file << 1 | pres << 2 | swap << 3;
+	bool cpy = true;
+
+	if (p->required_mask)
+		cpy = ((p->required_mask & cur) == p->required_mask);
+	if (cpy && p->anyof_mask)
+		cpy = (p->anyof_mask & cur);
+	if (cpy && p->excluded_mask)
+		cpy = !(p->excluded_mask & cur);
+
+	bitmap = cur & p->return_mask;
+
+	if (cpy && bitmap) {
+		if ((p->vec_index) && (p->vec[p->vec_index - 1].bitmap == bitmap) &&
+		    (p->vec[p->vec_index - 1].start + p->vec[p->vec_index - 1].len * PAGE_SIZE ==
+		     addr)) {
+			p->vec[p->vec_index - 1].len += len;
+			p->found_pages += len;
+		} else if (p->vec_index < p->vec_len) {
+			p->vec[p->vec_index].start = addr;
+			p->vec[p->vec_index].len = len;
+			p->found_pages += len;
+			p->vec[p->vec_index].bitmap = bitmap;
+			p->vec_index++;
+		} else {
+			return -ENOMEM;
+		}
+	}
+
+	return 0;
+}
+
+static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long addr,
+				  unsigned long end, struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned int len;
+	spinlock_t *ptl;
+	int ret = 0;
+	pte_t *pte;
+	bool dirty_vma = (p->flags & PAGEMAP_NO_REUSED_REGIONS) ?
+			 (false) : (vma->vm_flags & VM_SOFTDIRTY);
+
+	if ((walk->vma->vm_end < addr) || (p->max_pages && p->found_pages == p->max_pages))
+		return 0;
+
+	end = min(end, walk->vma->vm_end);
+
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		if (dirty_vma || check_soft_dirty_pmd(vma, addr, pmd, false)) {
+			/*
+			 * Break huge page into small pages if operation needs to be performed is
+			 * on a portion of the huge page or the return buffer cannot store complete
+			 * data.
+			 */
+			if ((IS_CLEAR_OP(p) && (end - addr < HPAGE_SIZE))) {
+				spin_unlock(ptl);
+				split_huge_pmd(vma, pmd, addr);
+				goto process_smaller_pages;
+			}
+
+			if (IS_GET_OP(p)) {
+				len = (end - addr)/PAGE_SIZE;
+				if (p->max_pages && p->found_pages + len > p->max_pages)
+					len = p->max_pages - p->found_pages;
+
+				ret = add_to_out(dirty_vma ||
+						 check_soft_dirty_pmd(vma, addr, pmd, false),
+						 vma->vm_file, pmd_present(*pmd), is_swap_pmd(*pmd),
+						 p, addr, len);
+			}
+			if (!ret && IS_CLEAR_OP(p))
+				check_soft_dirty_pmd(vma, addr, pmd, true);
+		}
+		spin_unlock(ptl);
+		return 0;
+	}
+
+process_smaller_pages:
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr < end && !ret && (!p->max_pages || (p->found_pages < p->max_pages))
+	     ; pte++, addr += PAGE_SIZE) {
+		if (IS_GET_OP(p))
+			ret = add_to_out(dirty_vma || check_soft_dirty(vma, addr, pte, false),
+					 vma->vm_file, pte_present(*pte),
+					 is_swap_pte(*pte), p, addr, 1);
+		if (!ret && IS_CLEAR_OP(p))
+			check_soft_dirty(vma, addr, pte, true);
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
+				 struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned int len;
+	bool sd;
+
+	if (vma) {
+		/* Individual pages haven't been allocated and written */
+		sd = (p->flags & PAGEMAP_NO_REUSED_REGIONS) ? (false) :
+		     (vma->vm_flags & VM_SOFTDIRTY);
+
+		len = (end - addr)/PAGE_SIZE;
+		if (p->max_pages && p->found_pages + len > p->max_pages)
+			len = p->max_pages - p->found_pages;
+
+		add_to_out(sd, vma->vm_file, false, false, p, addr, len);
+	}
+
+	return 0;
+}
+
+#ifdef CONFIG_MEM_SOFT_DIRTY
+static int pagemap_scan_pre_vma(unsigned long start, unsigned long end, struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long end_cut = end;
+	int ret;
+
+	if (!(p->flags & PAGEMAP_NO_REUSED_REGIONS) && IS_CLEAR_OP(p) &&
+	    (vma->vm_flags & VM_SOFTDIRTY)) {
+		if (vma->vm_start < start) {
+			ret = split_vma(vma->vm_mm, vma, start, 1);
+			if (ret)
+				return ret;
+		}
+		/* Calculate end_cut because of max_pages */
+		if (IS_GET_OP(p) && p->max_pages)
+			end_cut = min(start + (p->max_pages - p->found_pages) * PAGE_SIZE, end);
+
+		if (vma->vm_end > end_cut) {
+			ret = split_vma(vma->vm_mm, vma, end_cut, 0);
+			if (ret)
+				return ret;
+		}
+	}
+
+	return 0;
+}
+
+static void pagemap_scan_post_vma(struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+
+	if (!(p->flags & PAGEMAP_NO_REUSED_REGIONS) && IS_CLEAR_OP(p) &&
+	    (vma->vm_flags & VM_SOFTDIRTY)) {
+		vma->vm_flags &= ~VM_SOFTDIRTY;
+		vma_set_page_prot(vma);
+	}
+}
+#endif /* CONFIG_MEM_SOFT_DIRTY */
+
+static const struct mm_walk_ops pagemap_scan_ops = {
+	.test_walk = pagemap_scan_pmd_test_walk,
+	.pmd_entry = pagemap_scan_pmd_entry,
+	.pte_hole = pagemap_scan_pte_hole,
+
+#ifdef CONFIG_MEM_SOFT_DIRTY
+	/* Only for clearing SD bit over VMAs */
+	.pre_vma = pagemap_scan_pre_vma,
+	.post_vma = pagemap_scan_post_vma,
+#endif /* CONFIG_MEM_SOFT_DIRTY */
+};
+
+static long do_pagemap_sd_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
+{
+	struct mmu_notifier_range range;
+	unsigned long __user start, end;
+	struct pagemap_scan_private p;
+	int ret;
+
+	start = (unsigned long)untagged_addr(arg->start);
+	if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
+		return -EINVAL;
+
+	if (IS_GET_OP(arg) &&
+	    ((arg->vec_len == 0) || (!access_ok((struct page_region *)arg->vec, arg->vec_len))))
+		return -ENOMEM;
+
+#ifndef CONFIG_MEM_SOFT_DIRTY
+	if (IS_SD_OP(arg) || (arg->required_mask & PAGE_IS_SOFTDIRTY) ||
+	    (arg->anyof_mask & PAGE_IS_SOFTDIRTY))
+		return -EINVAL;
+#endif
+
+	if ((arg->flags & ~PAGEMAP_SD_FLAGS) || (arg->required_mask & ~PAGEMAP_OP_MASK) ||
+	    (arg->anyof_mask & ~PAGEMAP_OP_MASK) || (arg->excluded_mask & ~PAGEMAP_OP_MASK) ||
+	    (arg->return_mask & ~PAGEMAP_OP_MASK))
+		return -EINVAL;
+
+	if ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) || !arg->return_mask)
+		return -EINVAL;
+
+	if (IS_SD_OP(arg) && ((arg->required_mask & PAGEMAP_NONSD_OP_MASK) ||
+	     (arg->anyof_mask & PAGEMAP_NONSD_OP_MASK)))
+		return -EINVAL;
+
+	end = start + arg->len;
+	p.max_pages = arg->max_pages;
+	p.found_pages = 0;
+	p.flags = arg->flags;
+	p.required_mask = arg->required_mask;
+	p.anyof_mask = arg->anyof_mask;
+	p.excluded_mask = arg->excluded_mask;
+	p.return_mask = arg->return_mask;
+	p.vec_index = 0;
+	p.vec_len = arg->vec_len;
+
+	if (IS_GET_OP(arg)) {
+		p.vec = vzalloc(arg->vec_len * sizeof(struct page_region));
+		if (!p.vec)
+			return -ENOMEM;
+	} else {
+		p.vec = NULL;
+	}
+
+	if (IS_CLEAR_OP(arg)) {
+		mmap_write_lock(mm);
+
+		mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY, 0, NULL, mm, start, end);
+		mmu_notifier_invalidate_range_start(&range);
+		inc_tlb_flush_pending(mm);
+	} else {
+		mmap_read_lock(mm);
+	}
+
+	ret = walk_page_range(mm, start, end, &pagemap_scan_ops, &p);
+
+	if (IS_CLEAR_OP(arg)) {
+		mmu_notifier_invalidate_range_end(&range);
+		dec_tlb_flush_pending(mm);
+
+		mmap_write_unlock(mm);
+	} else {
+		mmap_read_unlock(mm);
+	}
+
+	if (ret < 0)
+		goto free_data;
+
+	if (IS_GET_OP(arg) && p.vec_index) {
+		if (copy_to_user((struct page_region *)arg->vec, p.vec,
+				 p.vec_index * sizeof(struct page_region))) {
+			ret = -EFAULT;
+			goto free_data;
+		}
+		ret = p.vec_index;
+	} else {
+		ret = 0;
+	}
+
+free_data:
+	if (IS_GET_OP(arg))
+		vfree(p.vec);
+
+	return ret;
+}
+
+static long pagemap_sd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
+	struct mm_struct *mm = file->private_data;
+	struct pagemap_scan_arg argument;
+
+	if (cmd == PAGEMAP_SCAN) {
+		if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
+			return -EFAULT;
+		return do_pagemap_sd_cmd(mm, &argument);
+	}
+	return -EINVAL;
+}
+
 const struct file_operations proc_pagemap_operations = {
 	.llseek		= mem_lseek, /* borrow this */
 	.read		= pagemap_read,
 	.open		= pagemap_open,
 	.release	= pagemap_release,
+	.unlocked_ioctl = pagemap_sd_ioctl,
+	.compat_ioctl	= pagemap_sd_ioctl,
 };
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..11d232cfc9b3 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -305,4 +305,60 @@ typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND)
 
+/* Pagemap ioctl */
+#define PAGEMAP_SCAN	_IOWR('f', 16, struct pagemap_scan_arg)
+
+/* Bits are set in the bitmap of the page_region and masks in pagemap_sd_args */
+#define PAGE_IS_SOFTDIRTY	(1 << 0)
+#define PAGE_IS_FILE		(1 << 1)
+#define PAGE_IS_PRESENT		(1 << 2)
+#define PAGE_IS_SWAPPED		(1 << 3)
+
+/*
+ * struct page_region - Page region with bitmap flags
+ * @start:	Start of the region
+ * @len:	Length of the region
+ * bitmap:	Bits sets for the region
+ */
+struct page_region {
+	__u64 start;
+	__u64 len;
+	__u64 bitmap;
+};
+
+/*
+ * struct pagemap_scan_arg - Pagemap ioctl argument
+ * @start:		Starting address of the region
+ * @len:		Length of the region (All the pages in this length are included)
+ * @vec:		Address of page_region struct array for output
+ * @vec_len:		Length of the page_region struct array
+ * @max_pages:		Optional max return pages
+ * @flags:		Flags for the IOCTL
+ * @required_mask:	Required mask - All of these bits have to be set in the PTE
+ * @anyof_mask:		Any mask - Any of these bits are set in the PTE
+ * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
+ * @return_mask:	Bits that are to be reported in page_region
+ */
+struct pagemap_scan_arg {
+	__u64 start;
+	__u64 len;
+	__u64 vec;
+	__u64 vec_len;
+	__u32 max_pages;
+	__u32 flags;
+	__u64 required_mask;
+	__u64 anyof_mask;
+	__u64 excluded_mask;
+	__u64 return_mask;
+};
+
+/* Special flags */
+#define PAGEMAP_SOFTDIRTY_CLEAR		(1 << 0)
+/*
+ * Depend only on the soft dirty PTE bit of individual pages and don't check the soft dirty bit
+ * of the VMA to decide if the region is dirty or not. By using this flag, the newly created
+ * memory doesn't appear to be soft dirty through the IOCTL until the region is written.
+ */
+#define PAGEMAP_NO_REUSED_REGIONS	(1 << 1)
+
 #endif /* _UAPI_LINUX_FS_H */
diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
index b7b56871029c..11d232cfc9b3 100644
--- a/tools/include/uapi/linux/fs.h
+++ b/tools/include/uapi/linux/fs.h
@@ -305,4 +305,60 @@ typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND)
 
+/* Pagemap ioctl */
+#define PAGEMAP_SCAN	_IOWR('f', 16, struct pagemap_scan_arg)
+
+/* Bits are set in the bitmap of the page_region and masks in pagemap_sd_args */
+#define PAGE_IS_SOFTDIRTY	(1 << 0)
+#define PAGE_IS_FILE		(1 << 1)
+#define PAGE_IS_PRESENT		(1 << 2)
+#define PAGE_IS_SWAPPED		(1 << 3)
+
+/*
+ * struct page_region - Page region with bitmap flags
+ * @start:	Start of the region
+ * @len:	Length of the region
+ * bitmap:	Bits sets for the region
+ */
+struct page_region {
+	__u64 start;
+	__u64 len;
+	__u64 bitmap;
+};
+
+/*
+ * struct pagemap_scan_arg - Pagemap ioctl argument
+ * @start:		Starting address of the region
+ * @len:		Length of the region (All the pages in this length are included)
+ * @vec:		Address of page_region struct array for output
+ * @vec_len:		Length of the page_region struct array
+ * @max_pages:		Optional max return pages
+ * @flags:		Flags for the IOCTL
+ * @required_mask:	Required mask - All of these bits have to be set in the PTE
+ * @anyof_mask:		Any mask - Any of these bits are set in the PTE
+ * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
+ * @return_mask:	Bits that are to be reported in page_region
+ */
+struct pagemap_scan_arg {
+	__u64 start;
+	__u64 len;
+	__u64 vec;
+	__u64 vec_len;
+	__u32 max_pages;
+	__u32 flags;
+	__u64 required_mask;
+	__u64 anyof_mask;
+	__u64 excluded_mask;
+	__u64 return_mask;
+};
+
+/* Special flags */
+#define PAGEMAP_SOFTDIRTY_CLEAR		(1 << 0)
+/*
+ * Depend only on the soft dirty PTE bit of individual pages and don't check the soft dirty bit
+ * of the VMA to decide if the region is dirty or not. By using this flag, the newly created
+ * memory doesn't appear to be soft dirty through the IOCTL until the region is written.
+ */
+#define PAGEMAP_NO_REUSED_REGIONS	(1 << 1)
+
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v6 3/3] selftests: vm: add pagemap ioctl tests
  2022-11-09 10:23 [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
  2022-11-09 10:23 ` [PATCH v6 1/3] fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit Muhammad Usama Anjum
  2022-11-09 10:23 ` [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
@ 2022-11-09 10:23 ` Muhammad Usama Anjum
  2022-11-09 10:34 ` [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs David Hildenbrand
  2022-11-23 14:11 ` Peter Xu
  4 siblings, 0 replies; 20+ messages in thread
From: Muhammad Usama Anjum @ 2022-11-09 10:23 UTC (permalink / raw)
  To: Michał Mirosław, Andrei Vagin, Danylo Mocherniuk,
	Alexander Viro, Andrew Morton, Suren Baghdasaryan, Greg KH,
	Christian Brauner, Peter Xu, Yang Shi, Vlastimil Babka,
	Zach O'Keefe, Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, Muhammad Usama Anjum, kernel,
	Gabriel Krisman Bertazi, David Hildenbrand, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Paul Gofman

Add pagemap ioctl tests. Add several different types of tests to judge
the correction of the interface.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes in v6:
- Rename variables

Changes in v4:
- Updated all the tests to conform to new IOCTL

Changes in v3:
- Add another test to do sanity of flags

Changes in v2:
- Update the tests to use the ioctl interface instead of syscall

TAP version 13
1..59
ok 1 sanity_tests_sd wrong flag specified
ok 2 sanity_tests_sd wrong mask specified
ok 3 sanity_tests_sd wrong return mask specified
ok 4 sanity_tests_sd mixture of correct and wrong flag
ok 5 sanity_tests_sd Clear area with larger vec size
ok 6 sanity_tests_sd Repeated pattern of dirty and non-dirty pages
ok 7 sanity_tests_sd Repeated pattern of dirty and non-dirty pages in parts
ok 8 sanity_tests_sd Two regions
ok 9 Page testing: all new pages must be soft dirty
ok 10 Page testing: all pages must not be soft dirty
ok 11 Page testing: all pages dirty other than first and the last one
ok 12 Page testing: only middle page dirty
ok 13 Page testing: only two middle pages dirty
ok 14 Page testing: only get 2 dirty pages and clear them as well
ok 15 Page testing: Range clear only
ok 16 Large Page testing: all new pages must be soft dirty
ok 17 Large Page testing: all pages must not be soft dirty
ok 18 Large Page testing: all pages dirty other than first and the last one
ok 19 Large Page testing: only middle page dirty
ok 20 Large Page testing: only two middle pages dirty
ok 21 Large Page testing: only get 2 dirty pages and clear them as well
ok 22 Large Page testing: Range clear only
ok 23 Huge page testing: all new pages must be soft dirty
ok 24 Huge page testing: all pages must not be soft dirty
ok 25 Huge page testing: all pages dirty other than first and the last one
ok 26 Huge page testing: only middle page dirty
ok 27 Huge page testing: only two middle pages dirty
ok 28 Huge page testing: only get 2 dirty pages and clear them as well
ok 29 Huge page testing: Range clear only
ok 30 Performance Page testing: all new pages must be soft dirty
ok 31 Performance Page testing: all pages must not be soft dirty
ok 32 Performance Page testing: all pages dirty other than first and the last one
ok 33 Performance Page testing: only middle page dirty
ok 34 Performance Page testing: only two middle pages dirty
ok 35 Performance Page testing: only get 2 dirty pages and clear them as well
ok 36 Performance Page testing: Range clear only
ok 37 hpage_unit_tests all new huge page must be dirty
ok 38 hpage_unit_tests all the huge page must not be dirty
ok 39 hpage_unit_tests all the huge page must be dirty and clear
ok 40 hpage_unit_tests only middle page dirty
ok 41 hpage_unit_tests clear first half of huge page
ok 42 hpage_unit_tests clear first half of huge page with limited buffer
ok 43 hpage_unit_tests clear second half huge page
ok 44 unmapped_region_tests Get dirty pages
ok 45 unmapped_region_tests Get dirty pages
ok 46 Test test_simple
ok 47 sanity_tests clear op can only be specified with PAGE_IS_DIRTY
ok 48 sanity_tests rmask specified
ok 49 sanity_tests amask specified
ok 50 sanity_tests emask specified
ok 51 sanity_tests rmask and amask specified
ok 52 sanity_tests rmask and amask specified
ok 53 sanity_tests Get sd and present pages with amask
ok 54 sanity_tests Get all the pages with rmask
ok 55 sanity_tests Get sd and present pages with rmask and amask
ok 56 sanity_tests Don't get sd pages
ok 57 sanity_tests Don't get present pages
ok 58 sanity_tests Find dirty present pages with return mask
ok 59 sanity_tests Memory mapped file
 # Totals: pass:59 fail:0 xfail:0 xpass:0 skip:0 error:0
---
 tools/testing/selftests/vm/.gitignore      |   1 +
 tools/testing/selftests/vm/Makefile        |   5 +-
 tools/testing/selftests/vm/pagemap_ioctl.c | 698 +++++++++++++++++++++
 3 files changed, 702 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 8a536c731e3c..4a73983e3e58 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -17,6 +17,7 @@ mremap_dontunmap
 mremap_test
 on-fault-limit
 transhuge-stress
+pagemap_ioctl
 protection_keys
 protection_keys_32
 protection_keys_64
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 0986bd60c19f..2325bcdb9fae 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -24,9 +24,8 @@ MACHINE ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/' -e 's/ppc64.*/p
 # things despite using incorrect values such as an *occasionally* incomplete
 # LDLIBS.
 MAKEFLAGS += --no-builtin-rules
-
 CFLAGS = -Wall -I $(top_srcdir) -I $(top_srcdir)/usr/include $(EXTRA_CFLAGS) $(KHDR_INCLUDES)
-LDLIBS = -lrt -lpthread
+LDLIBS = -lrt -lpthread -lm
 TEST_GEN_FILES = anon_cow
 TEST_GEN_FILES += compaction_test
 TEST_GEN_FILES += gup_test
@@ -52,6 +51,7 @@ TEST_GEN_FILES += on-fault-limit
 TEST_GEN_FILES += thuge-gen
 TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
+TEST_GEN_PROGS += pagemap_ioctl
 TEST_GEN_PROGS += soft-dirty
 TEST_GEN_PROGS += split_huge_page_test
 TEST_GEN_FILES += ksm_tests
@@ -103,6 +103,7 @@ $(OUTPUT)/anon_cow: vm_util.c
 $(OUTPUT)/khugepaged: vm_util.c
 $(OUTPUT)/ksm_functional_tests: vm_util.c
 $(OUTPUT)/madv_populate: vm_util.c
+$(OUTPUT)/pagemap_ioctl: vm_util.c
 $(OUTPUT)/soft-dirty: vm_util.c
 $(OUTPUT)/split_huge_page_test: vm_util.c
 $(OUTPUT)/userfaultfd: vm_util.c
diff --git a/tools/testing/selftests/vm/pagemap_ioctl.c b/tools/testing/selftests/vm/pagemap_ioctl.c
new file mode 100644
index 000000000000..1a55b2c3b7fc
--- /dev/null
+++ b/tools/testing/selftests/vm/pagemap_ioctl.c
@@ -0,0 +1,698 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <errno.h>
+#include <malloc.h>
+#include <asm-generic/unistd.h>
+#include "vm_util.h"
+#include "../kselftest.h"
+#include <linux/types.h>
+#include <linux/fs.h>
+#include <sys/ioctl.h>
+#include <sys/stat.h>
+#include <math.h>
+
+#define PAGEMAP_OP_MASK		(PAGE_IS_SOFTDIRTY | PAGE_IS_FILE |	\
+				 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define TEST_ITERATIONS 10
+#define PAGEMAP "/proc/self/pagemap"
+int pagemap_fd;
+
+static long pagemap_ioctl(void *start, int len, void *vec, int vec_len, int flag,
+			  int max_pages, long required_mask, long anyof_mask, long excluded_mask,
+			  long return_mask)
+{
+	struct pagemap_scan_arg arg;
+	int ret;
+
+	arg.start = (uintptr_t)start;
+	arg.len = len;
+	arg.vec = (uintptr_t)vec;
+	arg.vec_len = vec_len;
+	arg.flags = flag;
+	arg.max_pages = max_pages;
+	arg.required_mask = required_mask;
+	arg.anyof_mask = anyof_mask;
+	arg.excluded_mask = excluded_mask;
+	arg.return_mask = return_mask;
+
+	ret = ioctl(pagemap_fd, PAGEMAP_SCAN, &arg);
+
+	return ret;
+}
+
+int sanity_tests_sd(int page_size)
+{
+	char *mem, *m[2];
+	int mem_size, vec_size, ret, ret2, ret3, i, num_pages = 10;
+	struct page_region *vec;
+
+	vec_size = 100;
+	mem_size = num_pages * page_size;
+
+	vec = malloc(sizeof(struct page_region) * vec_size);
+	if (!vec)
+		ksft_exit_fail_msg("error nomem\n");
+	mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (!mem)
+		ksft_exit_fail_msg("error nomem\n");
+
+	/* 1. wrong operation */
+	ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, -1,
+				       0, PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY) < 0,
+			 "%s wrong flag specified\n", __func__);
+	ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 8,
+				       0, 0x1111, 0, 0, PAGE_IS_SOFTDIRTY) < 0,
+			 "%s wrong mask specified\n", __func__);
+	ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0,
+				       0, PAGE_IS_SOFTDIRTY, 0, 0, 0x1000) < 0,
+			 "%s wrong return mask specified\n", __func__);
+	ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size,
+				       PAGEMAP_SOFTDIRTY_CLEAR | 0x32,
+				       0, PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY) < 0,
+			 "%s mixture of correct and wrong flag\n", __func__);
+
+	/* 2. Clear area with larger vec size */
+	ret = pagemap_ioctl(mem, mem_size, vec, vec_size, PAGEMAP_SOFTDIRTY_CLEAR, 0,
+			    PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+	ksft_test_result(ret >= 0, "%s Clear area with larger vec size\n", __func__);
+
+	/* 3. Repeated pattern of dirty and non-dirty pages */
+	for (i = 0; i < mem_size; i += 2 * page_size)
+		mem[i]++;
+
+	ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, PAGE_IS_SOFTDIRTY, 0, 0,
+			    PAGE_IS_SOFTDIRTY);
+	if (ret < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+	ksft_test_result(ret == mem_size/(page_size * 2),
+			 "%s Repeated pattern of dirty and non-dirty pages\n", __func__);
+
+	/* 4. Repeated pattern of dirty and non-dirty pages in parts */
+	ret = pagemap_ioctl(mem, mem_size, vec, num_pages/5, PAGEMAP_SOFTDIRTY_CLEAR,
+			    num_pages/2 - 2, PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+	if (ret < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+	ret2 = pagemap_ioctl(mem, mem_size, vec, 2, 0, 0, PAGE_IS_SOFTDIRTY, 0, 0,
+			     PAGE_IS_SOFTDIRTY);
+	if (ret2 < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", ret2, errno, strerror(errno));
+
+	ret3 = pagemap_ioctl(mem, mem_size, vec, num_pages/2, 0, 0, PAGE_IS_SOFTDIRTY, 0, 0,
+			     PAGE_IS_SOFTDIRTY);
+	if (ret3 < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", ret3, errno, strerror(errno));
+
+	ksft_test_result((ret + ret3) == num_pages/2 && ret2 == 2,
+			 "%s Repeated pattern of dirty and non-dirty pages in parts\n", __func__);
+
+	munmap(mem, mem_size);
+
+	/* 5. Two regions */
+	m[0] = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (!m[0])
+		ksft_exit_fail_msg("error nomem\n");
+	m[1] = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (!m[1])
+		ksft_exit_fail_msg("error nomem\n");
+
+	ret = pagemap_ioctl(m[0], mem_size, NULL, 0, PAGEMAP_SOFTDIRTY_CLEAR, 0,
+			    PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+	if (ret < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+	ret = pagemap_ioctl(m[1], mem_size, vec, 1, 0, 0, PAGE_IS_SOFTDIRTY, 0, 0,
+			    PAGE_IS_SOFTDIRTY);
+	if (ret < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+	ksft_test_result(ret == 1 && vec[0].len == mem_size/page_size,
+			 "%s Two regions\n", __func__);
+
+	munmap(m[0], mem_size);
+	munmap(m[1], mem_size);
+
+	free(vec);
+	return 0;
+}
+
+int base_tests(char *prefix, char *mem, int mem_size, int page_size, int skip, int flags)
+{
+	int vec_size, ret, dirty, dirty2;
+	struct page_region *vec, *vec2;
+
+	if (skip) {
+		ksft_test_result_skip("%s all new pages must be soft dirty\n", prefix);
+		ksft_test_result_skip("%s all pages must not be soft dirty\n", prefix);
+		ksft_test_result_skip("%s all pages dirty other than first and the last one\n",
+				      prefix);
+		ksft_test_result_skip("%s only middle page dirty\n", prefix);
+		ksft_test_result_skip("%s only two middle pages dirty\n", prefix);
+		ksft_test_result_skip("%s only get 2 dirty pages and clear them as well\n", prefix);
+		ksft_test_result_skip("%s Range clear only\n", prefix);
+		return 0;
+	}
+
+	vec_size = mem_size/page_size;
+	vec = malloc(sizeof(struct page_region) * vec_size);
+	vec2 = malloc(sizeof(struct page_region) * vec_size);
+
+	/* 1. all new pages must be soft dirty if PAGEMAP_NO_REUSED_REGIONS isn't used */
+	dirty = pagemap_ioctl(mem, mem_size, vec, 1, flags | PAGEMAP_SOFTDIRTY_CLEAR, vec_size - 2,
+			      PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+	if (dirty < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+	dirty2 = pagemap_ioctl(mem, mem_size, vec2, 1, flags | PAGEMAP_SOFTDIRTY_CLEAR, 0,
+			       PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+	if (dirty2 < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty2, errno, strerror(errno));
+
+	if (flags != PAGEMAP_NO_REUSED_REGIONS)
+		ksft_test_result(dirty == 1 && vec[0].start == (unsigned long)mem &&
+				 vec[0].len == vec_size - 2 && vec[0].bitmap == PAGE_IS_SOFTDIRTY &&
+				 dirty2 == 1 &&
+				 vec2[0].start == (unsigned long)(mem + mem_size - (2 * page_size))
+				 && vec2[0].len == 2 && vec[0].bitmap == PAGE_IS_SOFTDIRTY,
+				 "%s all new pages must be soft dirty\n", prefix);
+	else
+		ksft_test_result(dirty == 0 && dirty2 == 0,
+				 "%s all new pages must be soft dirty\n", prefix);
+
+	// 2. all pages must not be soft dirty
+	dirty = pagemap_ioctl(mem, mem_size, vec, 1, flags, 0, PAGE_IS_SOFTDIRTY, 0, 0,
+			      PAGE_IS_SOFTDIRTY);
+	if (dirty < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+	ksft_test_result(dirty == 0, "%s all pages must not be soft dirty\n", prefix);
+
+	// 3. all pages dirty other than first and the last one
+	memset(mem + page_size, -1, mem_size - (2 * page_size));
+
+	dirty = pagemap_ioctl(mem, mem_size, vec, 1, flags, 0, PAGE_IS_SOFTDIRTY, 0, 0,
+			      PAGE_IS_SOFTDIRTY);
+	if (dirty < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+	ksft_test_result(dirty == 1 && vec[0].len >= vec_size - 2 && vec[0].len <= vec_size,
+			 "%s all pages dirty other than first and the last one\n", prefix);
+
+	// 4. only middle page dirty
+	clear_softdirty();
+	mem[vec_size/2 * page_size]++;
+
+	dirty = pagemap_ioctl(mem, mem_size, vec, vec_size, flags, 0, PAGE_IS_SOFTDIRTY, 0, 0,
+			      PAGE_IS_SOFTDIRTY);
+	if (dirty < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+	ksft_test_result(vec[0].start == (uintptr_t)(mem + vec_size/2 * page_size),
+			 "%s only middle page dirty\n", prefix);
+
+	// 5. only two middle pages dirty and walk over only middle pages
+	clear_softdirty();
+	mem[vec_size/2 * page_size]++;
+	mem[(vec_size/2 + 1) * page_size]++;
+
+	dirty = pagemap_ioctl(&mem[vec_size/2 * page_size], 2 * page_size, vec, 1, flags, 0,
+			      PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+	if (dirty < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+	ksft_test_result(dirty == 1 && vec[0].start == (uintptr_t)(&mem[vec_size/2 * page_size]) &&
+			 vec[0].len == 2,
+			 "%s only two middle pages dirty\n", prefix);
+
+	/* 6. only get 2 dirty pages and clear them as well */
+	memset(mem, -1, mem_size);
+
+	/* get and clear second and third pages */
+	ret = pagemap_ioctl(mem + page_size, 2 * page_size, vec, 1, flags | PAGEMAP_SOFTDIRTY_CLEAR,
+			    2, PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+	if (ret < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+	dirty = pagemap_ioctl(mem, mem_size, vec2, vec_size, flags, 0,
+			      PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+	if (dirty < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+	ksft_test_result(ret == 1 && vec[0].len == 2 &&
+			 vec[0].start == (uintptr_t)(mem + page_size) &&
+			 dirty == 2 && vec2[0].len == 1 && vec2[0].start == (uintptr_t)mem &&
+			 vec2[1].len == vec_size - 3 &&
+			 vec2[1].start == (uintptr_t)(mem + 3 * page_size),
+			 "%s only get 2 dirty pages and clear them as well\n", prefix);
+
+	/* 7. Range clear only */
+	memset(mem, -1, mem_size);
+
+	dirty = pagemap_ioctl(mem, mem_size, NULL, 0, flags | PAGEMAP_SOFTDIRTY_CLEAR, 0,
+			      PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+	if (dirty < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+	dirty2 = pagemap_ioctl(mem, mem_size, vec, vec_size, flags, 0,
+			       PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+	if (dirty2 < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty2, errno, strerror(errno));
+
+	ksft_test_result(dirty == 0 && dirty2 == 0, "%s Range clear only\n",
+			 prefix);
+
+	free(vec);
+	free(vec2);
+	return 0;
+}
+
+void *gethugepage(int map_size)
+{
+	int ret;
+	char *map;
+	size_t hpage_len = read_pmd_pagesize();
+
+	map = memalign(hpage_len, map_size);
+	if (!map)
+		ksft_exit_fail_msg("memalign failed %d %s\n", errno, strerror(errno));
+
+	ret = madvise(map, map_size, MADV_HUGEPAGE);
+	if (ret)
+		ksft_exit_fail_msg("madvise failed %d %d %s\n", ret, errno, strerror(errno));
+
+	memset(map, 0, map_size);
+
+	if (check_huge_anon(map, map_size/hpage_len, hpage_len))
+		return map;
+
+	free(map);
+	return NULL;
+
+}
+
+int hpage_unit_tests(int page_size)
+{
+	char *map;
+	int ret;
+	size_t hpage_len = read_pmd_pagesize();
+	size_t num_pages = 10;
+	int map_size = hpage_len * num_pages;
+	int vec_size = map_size/page_size;
+	struct page_region *vec, *vec2;
+
+	vec = malloc(sizeof(struct page_region) * vec_size);
+	vec2 = malloc(sizeof(struct page_region) * vec_size);
+	if (!vec || !vec2)
+		ksft_exit_fail_msg("malloc failed\n");
+
+	map = gethugepage(map_size);
+	if (map) {
+		// 1. all new huge page must be dirty
+		ret = pagemap_ioctl(map, map_size, vec, vec_size, PAGEMAP_SOFTDIRTY_CLEAR, 0,
+				    PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		ksft_test_result(ret == 1 && vec[0].start == (uintptr_t)map &&
+				 vec[0].len == vec_size && vec[0].bitmap == PAGE_IS_SOFTDIRTY,
+				 "%s all new huge page must be dirty\n", __func__);
+
+		// 2. all the huge page must not be dirty
+		ret = pagemap_ioctl(map, map_size, vec, vec_size, PAGEMAP_SOFTDIRTY_CLEAR, 0,
+				    PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		ksft_test_result(ret == 0, "%s all the huge page must not be dirty\n", __func__);
+
+		// 3. all the huge page must be dirty and clear dirty as well
+		memset(map, -1, map_size);
+		ret = pagemap_ioctl(map, map_size, vec, vec_size, PAGEMAP_SOFTDIRTY_CLEAR, 0,
+				    PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		ksft_test_result(ret == 1 && vec[0].start == (uintptr_t)map &&
+				vec[0].len == vec_size && vec[0].bitmap == PAGE_IS_SOFTDIRTY,
+				 "%s all the huge page must be dirty and clear\n", __func__);
+
+		// 4. only middle page dirty
+		free(map);
+		map = gethugepage(map_size);
+		clear_softdirty();
+		map[vec_size/2 * page_size]++;
+
+		ret = pagemap_ioctl(map, map_size, vec, vec_size, 0, 0,
+				    PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		ksft_test_result(ret == 1 && vec[0].len > 0,
+				 "%s only middle page dirty\n", __func__);
+
+		free(map);
+	} else {
+		ksft_test_result_skip("all new huge page must be dirty\n");
+		ksft_test_result_skip("all the huge page must not be dirty\n");
+		ksft_test_result_skip("all the huge page must be dirty and clear\n");
+		ksft_test_result_skip("only middle page dirty\n");
+	}
+
+	// 5. clear first half of huge page
+	map = gethugepage(map_size);
+	if (map) {
+		ret = pagemap_ioctl(map, map_size/2, NULL, 0, PAGEMAP_SOFTDIRTY_CLEAR, 0,
+				    PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		ret = pagemap_ioctl(map, map_size, vec, vec_size, 0, 0,
+				    PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		ksft_test_result(ret == 1 && vec[0].len == vec_size/2 &&
+				 vec[0].start == (uintptr_t)(map + map_size/2),
+				 "%s clear first half of huge page\n", __func__);
+		free(map);
+	} else {
+		ksft_test_result_skip("clear first half of huge page\n");
+	}
+
+	// 6. clear first half of huge page with limited buffer
+	map = gethugepage(map_size);
+	if (map) {
+		ret = pagemap_ioctl(map, map_size, vec, vec_size, PAGEMAP_SOFTDIRTY_CLEAR,
+				    vec_size/2, PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		ret = pagemap_ioctl(map, map_size, vec, vec_size, 0, 0,
+				    PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		ksft_test_result(ret == 1 && vec[0].len == vec_size/2 &&
+				 vec[0].start == (uintptr_t)(map + map_size/2),
+				 "%s clear first half of huge page with limited buffer\n",
+				 __func__);
+
+		free(map);
+	} else {
+		ksft_test_result_skip("clear first half of huge page with limited buffer\n");
+	}
+
+	// 7. clear second half of huge page
+	map = gethugepage(map_size);
+	if (map) {
+		memset(map, -1, map_size);
+		ret = pagemap_ioctl(map + map_size/2, map_size, NULL, 0, PAGEMAP_SOFTDIRTY_CLEAR,
+				    0, PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		ret = pagemap_ioctl(map, map_size, vec, vec_size, 0, 0,
+				    PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+		if (ret < 0)
+			ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+		ksft_test_result(ret == 1 && vec[0].len == vec_size/2,
+				 "%s clear second half huge page\n", __func__);
+		free(map);
+	} else {
+		ksft_test_result_skip("clear second half huge page\n");
+	}
+
+	free(vec);
+	free(vec2);
+	return 0;
+}
+
+int unmapped_region_tests(int page_size)
+{
+	void *start = (void *)0x10000000;
+	int dirty, len = 0x00040000;
+	int vec_size = len / page_size;
+	struct page_region *vec = malloc(sizeof(struct page_region) * vec_size);
+
+	/* 1. Get dirty pages */
+	dirty = pagemap_ioctl(start, len, vec, vec_size, 0, 0, PAGE_IS_SOFTDIRTY, 0, 0,
+			      PAGE_IS_SOFTDIRTY);
+	if (dirty < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+	ksft_test_result(dirty >= 0, "%s Get dirty pages\n", __func__);
+
+	/* 2. Clear dirty bit of whole address space */
+	dirty = pagemap_ioctl(0, 0x7FFFFFFF, NULL, 0,  PAGEMAP_SOFTDIRTY_CLEAR, 0,
+			PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY);
+	if (dirty < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+	ksft_test_result(dirty == 0, "%s Get dirty pages\n", __func__);
+
+	free(vec);
+	return 0;
+}
+
+static void test_simple(int page_size)
+{
+	int i;
+	char *map;
+	struct page_region vec;
+
+	map = aligned_alloc(page_size, page_size);
+	if (!map)
+		ksft_exit_fail_msg("mmap failed\n");
+
+	clear_softdirty();
+
+	for (i = 0 ; i < TEST_ITERATIONS; i++) {
+		if (pagemap_ioctl(map, page_size, &vec, 1, 0, 0,
+				  PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY) == 1) {
+			ksft_print_msg("dirty bit was 1, but should be 0 (i=%d)\n", i);
+			break;
+		}
+
+		clear_softdirty();
+		// Write something to the page to get the dirty bit enabled on the page
+		map[0]++;
+
+		if (pagemap_ioctl(map, page_size, &vec, 1, 0, 0,
+				  PAGE_IS_SOFTDIRTY, 0, 0, PAGE_IS_SOFTDIRTY) == 0) {
+			ksft_print_msg("dirty bit was 0, but should be 1 (i=%d)\n", i);
+			break;
+		}
+
+		clear_softdirty();
+	}
+	free(map);
+
+	ksft_test_result(i == TEST_ITERATIONS, "Test %s\n", __func__);
+}
+
+int sanity_tests(int page_size)
+{
+	char *mem, *fmem;
+	int mem_size, vec_size, ret;
+	struct page_region *vec;
+
+	/* 1. wrong operation */
+	mem_size = 10 * page_size;
+	vec_size = mem_size / page_size;
+
+	vec = malloc(sizeof(struct page_region) * vec_size);
+	mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (!mem || !vec)
+		ksft_exit_fail_msg("error nomem\n");
+
+	ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size,
+				       PAGEMAP_SOFTDIRTY_CLEAR | PAGEMAP_NO_REUSED_REGIONS, 0,
+				       PAGEMAP_OP_MASK, 0, 0, PAGEMAP_OP_MASK) < 0,
+			 "%s clear op can only be specified with PAGE_IS_SOFTDIRTY\n", __func__);
+	ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+				       PAGEMAP_OP_MASK, 0, 0, PAGEMAP_OP_MASK) >= 0,
+			 "%s required_mask specified\n", __func__);
+	ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+				       0, PAGEMAP_OP_MASK, 0, PAGEMAP_OP_MASK) >= 0,
+			 "%s anyof_mask specified\n", __func__);
+	ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+				       0, 0, PAGEMAP_OP_MASK, PAGEMAP_OP_MASK) >= 0,
+			 "%s excluded_mask specified\n", __func__);
+	ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+				       PAGEMAP_OP_MASK, PAGEMAP_OP_MASK, 0, PAGEMAP_OP_MASK) >= 0,
+			 "%s required_mask and anyof_mask specified\n", __func__);
+	ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, PAGEMAP_SOFTDIRTY_CLEAR, 0,
+				       0, 0, PAGEMAP_OP_MASK, PAGEMAP_OP_MASK) >= 0,
+			 "%s required_mask and anyof_mask specified\n", __func__);
+	munmap(mem, mem_size);
+
+	/* 2. Get sd and present pages with anyof_mask */
+	mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (!mem)
+		ksft_exit_fail_msg("error nomem\n");
+	memset(mem, 0, mem_size);
+
+	ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+			    0, PAGEMAP_OP_MASK, 0, PAGEMAP_OP_MASK);
+	ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)mem && vec[0].len == vec_size &&
+			 vec[0].bitmap == (PAGE_IS_SOFTDIRTY | PAGE_IS_PRESENT),
+			 "%s Get sd and present pages with anyof_mask\n", __func__);
+
+	/* 3. Get sd and present pages with required_mask */
+	ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+			    PAGEMAP_OP_MASK, 0, 0, PAGEMAP_OP_MASK);
+	ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)mem && vec[0].len == vec_size &&
+			 vec[0].bitmap == (PAGE_IS_SOFTDIRTY | PAGE_IS_PRESENT),
+			 "%s Get all the pages with required_mask\n", __func__);
+
+	/* 4. Get sd and present pages with required_mask and anyof_mask */
+	ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+			    PAGE_IS_SOFTDIRTY, PAGE_IS_PRESENT, 0, PAGEMAP_OP_MASK);
+	ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)mem && vec[0].len == vec_size &&
+			 vec[0].bitmap == (PAGE_IS_SOFTDIRTY | PAGE_IS_PRESENT),
+			 "%s Get sd and present pages with required_mask and anyof_mask\n",
+			 __func__);
+
+	/* 5. Don't get sd pages */
+	ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+			    0, 0, PAGE_IS_SOFTDIRTY, PAGEMAP_OP_MASK);
+	ksft_test_result(ret == 0, "%s Don't get sd pages\n", __func__);
+
+	/* 6. Don't get present pages */
+	ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+			    0, 0, PAGE_IS_PRESENT, PAGEMAP_OP_MASK);
+	ksft_test_result(ret == 0, "%s Don't get present pages\n", __func__);
+
+	munmap(mem, mem_size);
+
+	/* 8. Find dirty present pages with return mask */
+	mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (!mem)
+		ksft_exit_fail_msg("error nomem\n");
+	memset(mem, 0, mem_size);
+
+	ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+			    0, PAGEMAP_OP_MASK, 0, PAGE_IS_SOFTDIRTY);
+	ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)mem && vec[0].len == vec_size &&
+			 vec[0].bitmap == PAGE_IS_SOFTDIRTY,
+			 "%s Find dirty present pages with return mask\n", __func__);
+
+	/* 9. Memory mapped file */
+	int fd;
+	struct stat sbuf;
+
+	fd = open("run_vmtests.sh", O_RDONLY);
+	if (fd < 0) {
+		ksft_test_result_skip("%s Memory mapped file\n");
+		goto free_vec_and_return;
+	}
+
+	ret = stat("run_vmtests.sh", &sbuf);
+	if (ret < 0)
+		ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+	fmem = mmap(NULL, sbuf.st_size, PROT_READ, MAP_SHARED, fd, 0);
+	if (!fmem)
+		ksft_exit_fail_msg("error nomem\n");
+
+	ret = pagemap_ioctl(fmem, sbuf.st_size, vec, vec_size, 0, 0,
+			    0, PAGEMAP_OP_MASK, 0, PAGEMAP_OP_MASK);
+
+	ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)fmem &&
+			 vec[0].len == ceilf((float)sbuf.st_size/page_size) &&
+			 vec[0].bitmap == (PAGE_IS_SOFTDIRTY | PAGE_IS_FILE),
+			 "%s Memory mapped file\n", __func__);
+
+	munmap(fmem, sbuf.st_size);
+
+free_vec_and_return:
+	free(vec);
+	return 0;
+}
+
+int main(void)
+{
+	int page_size = getpagesize();
+	size_t hpage_len = read_pmd_pagesize();
+	char *mem, *map;
+	int mem_size;
+
+	ksft_print_header();
+	ksft_set_plan(59);
+
+	pagemap_fd = open(PAGEMAP, O_RDWR);
+	if (pagemap_fd < 0)
+		return -EINVAL;
+
+	/*
+	 * Soft-dirty PTE bit tests
+	 */
+
+	/* 1. Sanity testing */
+	sanity_tests_sd(page_size);
+
+	/* 2. Normal page testing */
+	mem_size = 10 * page_size;
+	mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (!mem)
+		ksft_exit_fail_msg("error nomem\n");
+
+	base_tests("Page testing:", mem, mem_size, page_size, 0, 0);
+
+	munmap(mem, mem_size);
+
+	/* 3. Large page testing */
+	mem_size = 512 * 10 * page_size;
+	mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (!mem)
+		ksft_exit_fail_msg("error nomem\n");
+
+	base_tests("Large Page testing:", mem, mem_size, page_size, 0, 0);
+
+	munmap(mem, mem_size);
+
+	/* 4. Huge page testing */
+	map = gethugepage(hpage_len);
+	if (map)
+		base_tests("Huge page testing:", map, hpage_len, page_size, 0, 0);
+	else
+		base_tests("Huge page testing:", NULL, 0, 0, 1, 0);
+
+	free(map);
+
+	/* 5. Performance page testing */
+	mem_size = 10 * page_size;
+	mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (!mem)
+		ksft_exit_fail_msg("error nomem\n");
+
+	base_tests("Performance Page testing:", mem, mem_size, page_size, 0,
+		   PAGEMAP_NO_REUSED_REGIONS);
+
+	munmap(mem, mem_size);
+
+	/* 6. Huge page tests */
+	hpage_unit_tests(page_size);
+
+	/* 7. Unmapped address test */
+	unmapped_region_tests(page_size);
+
+	/* 8. Iterative test */
+	test_simple(page_size);
+
+	/*
+	 * Other PTE bit tests
+	 */
+
+	/* 1. Sanity testing */
+	sanity_tests(page_size);
+
+	close(pagemap_fd);
+	return ksft_exit_pass();
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs
  2022-11-09 10:23 [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
                   ` (2 preceding siblings ...)
  2022-11-09 10:23 ` [PATCH v6 3/3] selftests: vm: add pagemap ioctl tests Muhammad Usama Anjum
@ 2022-11-09 10:34 ` David Hildenbrand
  2022-11-11  7:08   ` Muhammad Usama Anjum
  2022-11-23 14:11 ` Peter Xu
  4 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-11-09 10:34 UTC (permalink / raw)
  To: Muhammad Usama Anjum, Michał Mirosław, Andrei Vagin,
	Danylo Mocherniuk, Alexander Viro, Andrew Morton,
	Suren Baghdasaryan, Greg KH, Christian Brauner, Peter Xu,
	Yang Shi, Vlastimil Babka, Zach O'Keefe,
	Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Paul Gofman

On 09.11.22 11:23, Muhammad Usama Anjum wrote:
> Changes in v6:
> - Updated the interface and made cosmetic changes
> 
> Original Cover Letter in v5:
> Hello,
> 
> This patch series implements IOCTL on the pagemap procfs file to get the
> information about the page table entries (PTEs). The following operations
> are supported in this ioctl:
> - Get the information if the pages are soft-dirty, file mapped, present
>    or swapped.
> - Clear the soft-dirty PTE bit of the pages.
> - Get and clear the soft-dirty PTE bit of the pages atomically.
> 
> Soft-dirty PTE bit of the memory pages can be read by using the pagemap
> procfs file. The soft-dirty PTE bit for the whole memory range of the
> process can be cleared by writing to the clear_refs file. There are other
> methods to mimic this information entirely in userspace with poor
> performance:
> - The mprotect syscall and SIGSEGV handler for bookkeeping
> - The userfaultfd syscall with the handler for bookkeeping
> Some benchmarks can be seen here[1]. This series adds features that weren't
> present earlier:
> - There is no atomic get soft-dirty PTE bit status and clear operation
>    possible.
> - The soft-dirty PTE bit of only a part of memory cannot be cleared.
> 
> Historically, soft-dirty PTE bit tracking has been used in the CRIU
> project. The procfs interface is enough for finding the soft-dirty bit
> status and clearing the soft-dirty bit of all the pages of a process.
> We have the use case where we need to track the soft-dirty PTE bit for
> only specific pages on demand. We need this tracking and clear mechanism
> of a region of memory while the process is running to emulate the
> getWriteWatch() syscall of Windows. This syscall is used by games to
> keep track of dirty pages to process only the dirty pages.
> 
> The information related to pages if the page is file mapped, present and
> swapped is required for the CRIU project[2][3]. The addition of the
> required mask, any mask, excluded mask and return masks are also required
> for the CRIU project[2].
> 
> The IOCTL returns the addresses of the pages which match the specific masks.
> The page addresses are returned in struct page_region in a compact form.
> The max_pages is needed to support a use case where user only wants to get
> a specific number of pages. So there is no need to find all the pages of
> interest in the range when max_pages is specified. The IOCTL returns when
> the maximum number of the pages are found. The max_pages is optional. If
> max_pages is specified, it must be equal or greater than the vec_size.
> This restriction is needed to handle worse case when one page_region only
> contains info of one page and it cannot be compacted. This is needed to
> emulate the Windows getWriteWatch() syscall.
> 
> Some non-dirty pages get marked as dirty because of the kernel's
> internal activity (such as VMA merging as soft-dirty bit difference isn't
> considered while deciding to merge VMAs). The dirty bit of the pages is
> stored in the VMA flags and in the per page flags. If any of these two bits
> are set, the page is considered to be soft dirty. Suppose you have cleared
> the soft dirty bit of half of VMA which will be done by splitting the VMA
> and clearing soft dirty bit flag in the half VMA and the pages in it. Now
> kernel may decide to merge the VMAs again. So the half VMA becomes dirty
> again. This splitting/merging costs performance. The application receives
> a lot of pages which aren't dirty in reality but marked as dirty.
> Performance is lost again here. Also sometimes user doesn't want the newly
> allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag
> solves both the problems. It is used to not depend on the soft dirty flag
> in the VMA flags. So VMA splitting and merging doesn't happen. It only
> depends on the soft dirty bit of the individual pages. Thus by using this
> flag, there may be a scenerio such that the new memory regions which are
> just created, doesn't look dirty when seen with the IOCTL, but look dirty
> when seen from procfs. This seems okay as the user of this flag know the
> implication of using it.

Please separate that part out from the other changes; I am still not 
convinced that we want this and what the semantical implications are.

Let's take a look at an example: can_change_pte_writable()

	/* Do we need write faults for softdirty tracking? */
	if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
		return false;

We care about PTE softdirty tracking, if it is enabled for the VMA. 
Tracking is enabled if: vma_soft_dirty_enabled()

	/*
	 * Soft-dirty is kind of special: its tracking is enabled when
	 * the vma flags not set.
	 */
	return !(vma->vm_flags & VM_SOFTDIRTY);

Consequently, if VM_SOFTDIRTY is set, we are not considering the 
soft_dirty PTE bits accordingly.


I'd suggest moving forward without this controversial 
PAGEMAP_NO_REUSED_REGIONS functionality for now, and preparing it as a 
clear add-on we can discuss separately.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs
  2022-11-09 10:23 ` [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
@ 2022-11-09 23:54   ` Andrei Vagin
  2022-11-11 10:10     ` Muhammad Usama Anjum
       [not found]   ` <202211120107.cYLiq2cH-lkp@intel.com>
  2022-12-12 20:42   ` Cyrill Gorcunov
  2 siblings, 1 reply; 20+ messages in thread
From: Andrei Vagin @ 2022-11-09 23:54 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Michał Mirosław, Danylo Mocherniuk, Alexander Viro,
	Andrew Morton, Suren Baghdasaryan, Greg KH, Christian Brauner,
	Peter Xu, Yang Shi, Vlastimil Babka, Zach O'Keefe,
	Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, David Hildenbrand, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Paul Gofman

Hi Muhammad,

Here are a few inline comments.

On Wed, Nov 09, 2022 at 03:23:02PM +0500, Muhammad Usama Anjum wrote:
> This IOCTL, PAGEMAP_SCAN can be used to get and/or clear the info about
> page table entries. The following operations are supported in this ioctl:
> - Get the information if the pages are soft-dirty, file mapped, present
>   or swapped.
> - Clear the soft-dirty PTE bit of the pages.
> - Get and clear the soft-dirty PTE bit of the pages.
> 
> Only the soft-dirty bit can be read and cleared atomically. struct
> pagemap_sd_args is used as the argument of the IOCTL. In this struct:
> - The range is specified through start and len.
> - The output buffer and size is specified as vec and vec_len.
> - The optional maximum requested pages are specified in the max_pages.
> - The flags can be specified in the flags field. The PAGEMAP_SD_CLEAR
>   and PAGEMAP_SD_NO_REUSED_REGIONS are supported.
> - The masks are specified in rmask, amask, emask and return_mask.
> 
> This IOCTL can be extended to get information about more PTE bits.
> 
> This is based on a patch from Gabriel Krisman Bertazi.
> 
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> ---
> Changes in v6:
> - Rename variables and update comments
> - Make IOCTL independent of soft_dirty config
> - Change masks and bitmap type to _u64
> - Improve code quality
> 
> Changes in v5:
> - Remove tlb flushing even for clear operation
> 
> Changes in v4:
> - Update the interface and implementation
> 
> Changes in v3:
> - Tighten the user-kernel interface by using explicit types and add more
>   error checking
> 
> Changes in v2:
> - Convert the interface from syscall to ioctl
> - Remove pidfd support as it doesn't make sense in ioctl
> ---
>  fs/proc/task_mmu.c            | 328 ++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fs.h       |  56 ++++++
>  tools/include/uapi/linux/fs.h |  56 ++++++
>  3 files changed, 440 insertions(+)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 8235c536ac70..8d6a84ec5ef7 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -19,6 +19,9 @@
>  #include <linux/shmem_fs.h>
>  #include <linux/uaccess.h>
>  #include <linux/pkeys.h>
> +#include <uapi/linux/fs.h>
> +#include <linux/vmalloc.h>
> +#include <linux/minmax.h>
>  
>  #include <asm/elf.h>
>  #include <asm/tlb.h>
> @@ -1775,11 +1778,336 @@ static int pagemap_release(struct inode *inode, struct file *file)
>  	return 0;
>  }
>  
> +#define PAGEMAP_OP_MASK		(PAGE_IS_SOFTDIRTY | PAGE_IS_FILE |	\
> +				 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define PAGEMAP_NONSD_OP_MASK	(PAGE_IS_FILE |	PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define PAGEMAP_SD_FLAGS	(PAGEMAP_SOFTDIRTY_CLEAR | PAGEMAP_NO_REUSED_REGIONS)
> +#define IS_CLEAR_OP(a)		(a->flags & PAGEMAP_SOFTDIRTY_CLEAR)
> +#define IS_GET_OP(a)		(a->vec)
> +#define IS_SD_OP(a)		(a->flags & PAGEMAP_SD_FLAGS)
> +
> +struct pagemap_scan_private {
> +	struct page_region *vec;
> +	unsigned long vec_len;
> +	unsigned long vec_index;
> +	unsigned int max_pages;
> +	unsigned int found_pages;
> +	unsigned int flags;
> +	unsigned long required_mask;
> +	unsigned long anyof_mask;
> +	unsigned long excluded_mask;
> +	unsigned long return_mask;
> +};
> +
> +static int pagemap_scan_pmd_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)
> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +
> +	if (IS_GET_OP(p) && p->max_pages && (p->found_pages == p->max_pages))
> +		return -1;
> +
> +	if (vma->vm_flags & VM_PFNMAP)
> +		return 1;
> +
> +	return 0;
> +}
> +
> +static int add_to_out(bool sd, bool file, bool pres, bool swap, struct pagemap_scan_private *p,
> +		      unsigned long addr, unsigned int len)
> +{
> +	unsigned long bitmap, cur = sd | file << 1 | pres << 2 | swap << 3;

Should we define contants for each of these bits?

> +	bool cpy = true;
> +
> +	if (p->required_mask)
> +		cpy = ((p->required_mask & cur) == p->required_mask);
> +	if (cpy && p->anyof_mask)
> +		cpy = (p->anyof_mask & cur);
> +	if (cpy && p->excluded_mask)
> +		cpy = !(p->excluded_mask & cur);
> +
> +	bitmap = cur & p->return_mask;
> +
> +	if (cpy && bitmap) {
> +		if ((p->vec_index) && (p->vec[p->vec_index - 1].bitmap == bitmap) &&
> +		    (p->vec[p->vec_index - 1].start + p->vec[p->vec_index - 1].len * PAGE_SIZE ==
> +		     addr)) {

I think it is better to define a variable for p->vec_index - 1.
nit: len can be in bytes rather than pages.

> +			p->vec[p->vec_index - 1].len += len;
> +			p->found_pages += len;
> +		} else if (p->vec_index < p->vec_len) {
> +			p->vec[p->vec_index].start = addr;
> +			p->vec[p->vec_index].len = len;
> +			p->found_pages += len;
> +			p->vec[p->vec_index].bitmap = bitmap;
> +			p->vec_index++;
> +		} else {
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long addr,
> +				  unsigned long end, struct mm_walk *walk)
> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned int len;
> +	spinlock_t *ptl;
> +	int ret = 0;
> +	pte_t *pte;
> +	bool dirty_vma = (p->flags & PAGEMAP_NO_REUSED_REGIONS) ?
> +			 (false) : (vma->vm_flags & VM_SOFTDIRTY);
> +
> +	if ((walk->vma->vm_end < addr) || (p->max_pages && p->found_pages == p->max_pages))
> +		return 0;
> +
> +	end = min(end, walk->vma->vm_end);
> +
> +	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (ptl) {
> +		if (dirty_vma || check_soft_dirty_pmd(vma, addr, pmd, false)) {
> +			/*
> +			 * Break huge page into small pages if operation needs to be performed is
> +			 * on a portion of the huge page or the return buffer cannot store complete
> +			 * data.
> +			 */
> +			if ((IS_CLEAR_OP(p) && (end - addr < HPAGE_SIZE))) {
> +				spin_unlock(ptl);
> +				split_huge_pmd(vma, pmd, addr);
> +				goto process_smaller_pages;
> +			}
> +
> +			if (IS_GET_OP(p)) {
> +				len = (end - addr)/PAGE_SIZE;
> +				if (p->max_pages && p->found_pages + len > p->max_pages)
> +					len = p->max_pages - p->found_pages;
> +
> +				ret = add_to_out(dirty_vma ||
> +						 check_soft_dirty_pmd(vma, addr, pmd, false),

can we reuse a return code of the previous call of check_soft_dirty_pmd?

> +						 vma->vm_file, pmd_present(*pmd), is_swap_pmd(*pmd),
> +						 p, addr, len);
> +			}
> +			if (!ret && IS_CLEAR_OP(p))
> +				check_soft_dirty_pmd(vma, addr, pmd, true);

should we return a error in this case? We need to be sure that:
* we stop waking page tables after this point.
* return this error to the user-space if we are not able to add anything
  in the vector.

> +		}
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +process_smaller_pages:
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (; addr < end && !ret && (!p->max_pages || (p->found_pages < p->max_pages))
> +	     ; pte++, addr += PAGE_SIZE) {
> +		if (IS_GET_OP(p))
> +			ret = add_to_out(dirty_vma || check_soft_dirty(vma, addr, pte, false),
> +					 vma->vm_file, pte_present(*pte),
> +					 is_swap_pte(*pte), p, addr, 1);
> +		if (!ret && IS_CLEAR_OP(p))
> +			check_soft_dirty(vma, addr, pte, true);
> +	}
> +	pte_unmap_unlock(pte - 1, ptl);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
> +				 struct mm_walk *walk)
> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned int len;
> +	bool sd;
> +
> +	if (vma) {
> +		/* Individual pages haven't been allocated and written */
> +		sd = (p->flags & PAGEMAP_NO_REUSED_REGIONS) ? (false) :
> +		     (vma->vm_flags & VM_SOFTDIRTY);
> +
> +		len = (end - addr)/PAGE_SIZE;
> +		if (p->max_pages && p->found_pages + len > p->max_pages)
> +			len = p->max_pages - p->found_pages;
> +
> +		add_to_out(sd, vma->vm_file, false, false, p, addr, len);
> +	}
> +
> +	return 0;
> +}
> +
> +#ifdef CONFIG_MEM_SOFT_DIRTY
> +static int pagemap_scan_pre_vma(unsigned long start, unsigned long end, struct mm_walk *walk)
> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned long end_cut = end;
> +	int ret;
> +
> +	if (!(p->flags & PAGEMAP_NO_REUSED_REGIONS) && IS_CLEAR_OP(p) &&
> +	    (vma->vm_flags & VM_SOFTDIRTY)) {
> +		if (vma->vm_start < start) {
> +			ret = split_vma(vma->vm_mm, vma, start, 1);
> +			if (ret)
> +				return ret;
> +		}
> +		/* Calculate end_cut because of max_pages */
> +		if (IS_GET_OP(p) && p->max_pages)
> +			end_cut = min(start + (p->max_pages - p->found_pages) * PAGE_SIZE, end);
> +
> +		if (vma->vm_end > end_cut) {
> +			ret = split_vma(vma->vm_mm, vma, end_cut, 0);
> +			if (ret)
> +				return ret;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static void pagemap_scan_post_vma(struct mm_walk *walk)
> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +
> +	if (!(p->flags & PAGEMAP_NO_REUSED_REGIONS) && IS_CLEAR_OP(p) &&
> +	    (vma->vm_flags & VM_SOFTDIRTY)) {
> +		vma->vm_flags &= ~VM_SOFTDIRTY;
> +		vma_set_page_prot(vma);
> +	}
> +}
> +#endif /* CONFIG_MEM_SOFT_DIRTY */
> +
> +static const struct mm_walk_ops pagemap_scan_ops = {
> +	.test_walk = pagemap_scan_pmd_test_walk,
> +	.pmd_entry = pagemap_scan_pmd_entry,
> +	.pte_hole = pagemap_scan_pte_hole,
> +
> +#ifdef CONFIG_MEM_SOFT_DIRTY
> +	/* Only for clearing SD bit over VMAs */
> +	.pre_vma = pagemap_scan_pre_vma,
> +	.post_vma = pagemap_scan_post_vma,
> +#endif /* CONFIG_MEM_SOFT_DIRTY */
> +};
> +
> +static long do_pagemap_sd_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
> +{
> +	struct mmu_notifier_range range;
> +	unsigned long __user start, end;
> +	struct pagemap_scan_private p;
> +	int ret;
> +
> +	start = (unsigned long)untagged_addr(arg->start);
> +	if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
> +		return -EINVAL;
> +
> +	if (IS_GET_OP(arg) &&
> +	    ((arg->vec_len == 0) || (!access_ok((struct page_region *)arg->vec, arg->vec_len))))
> +		return -ENOMEM;
> +
> +#ifndef CONFIG_MEM_SOFT_DIRTY
> +	if (IS_SD_OP(arg) || (arg->required_mask & PAGE_IS_SOFTDIRTY) ||
> +	    (arg->anyof_mask & PAGE_IS_SOFTDIRTY))
> +		return -EINVAL;
> +#endif
> +
> +	if ((arg->flags & ~PAGEMAP_SD_FLAGS) || (arg->required_mask & ~PAGEMAP_OP_MASK) ||
> +	    (arg->anyof_mask & ~PAGEMAP_OP_MASK) || (arg->excluded_mask & ~PAGEMAP_OP_MASK) ||
> +	    (arg->return_mask & ~PAGEMAP_OP_MASK))
> +		return -EINVAL;
> +
> +	if ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) || !arg->return_mask)
> +		return -EINVAL;
> +
> +	if (IS_SD_OP(arg) && ((arg->required_mask & PAGEMAP_NONSD_OP_MASK) ||
> +	     (arg->anyof_mask & PAGEMAP_NONSD_OP_MASK)))
> +		return -EINVAL;
> +
> +	end = start + arg->len;
> +	p.max_pages = arg->max_pages;
> +	p.found_pages = 0;
> +	p.flags = arg->flags;
> +	p.required_mask = arg->required_mask;
> +	p.anyof_mask = arg->anyof_mask;
> +	p.excluded_mask = arg->excluded_mask;
> +	p.return_mask = arg->return_mask;
> +	p.vec_index = 0;
> +	p.vec_len = arg->vec_len;
> +
> +	if (IS_GET_OP(arg)) {
> +		p.vec = vzalloc(arg->vec_len * sizeof(struct page_region));

I think we need to set a reasonable limit for vec_len to avoid large
allocations on the kernel. We can consider to use kmalloc or kvmalloc
here.

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs
  2022-11-09 10:34 ` [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs David Hildenbrand
@ 2022-11-11  7:08   ` Muhammad Usama Anjum
  2022-11-14 15:46     ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Muhammad Usama Anjum @ 2022-11-11  7:08 UTC (permalink / raw)
  To: David Hildenbrand, Michał Mirosław, Andrei Vagin,
	Danylo Mocherniuk, Alexander Viro, Andrew Morton, Paul Gofman
  Cc: Muhammad Usama Anjum, Suren Baghdasaryan, Greg KH,
	Christian Brauner, Peter Xu, Yang Shi, Vlastimil Babka,
	Zach O'Keefe, Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT

On 11/9/22 3:34 PM, David Hildenbrand wrote:
> On 09.11.22 11:23, Muhammad Usama Anjum wrote:
[...]
>> Some non-dirty pages get marked as dirty because of the kernel's
>> internal activity (such as VMA merging as soft-dirty bit difference isn't
>> considered while deciding to merge VMAs). The dirty bit of the pages is
>> stored in the VMA flags and in the per page flags. If any of these two bits
>> are set, the page is considered to be soft dirty. Suppose you have cleared
>> the soft dirty bit of half of VMA which will be done by splitting the VMA
>> and clearing soft dirty bit flag in the half VMA and the pages in it. Now
>> kernel may decide to merge the VMAs again. So the half VMA becomes dirty
>> again. This splitting/merging costs performance. The application receives
>> a lot of pages which aren't dirty in reality but marked as dirty.
>> Performance is lost again here. Also sometimes user doesn't want the newly
>> allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag
>> solves both the problems. It is used to not depend on the soft dirty flag
>> in the VMA flags. So VMA splitting and merging doesn't happen. It only
>> depends on the soft dirty bit of the individual pages. Thus by using this
>> flag, there may be a scenerio such that the new memory regions which are
>> just created, doesn't look dirty when seen with the IOCTL, but look dirty
>> when seen from procfs. This seems okay as the user of this flag know the
>> implication of using it.
The soft-dirtiness is stored in the PTE. VMA is marked dirty to store the
dirtiness for reused regions. Clearing the soft-dirty status of whole
process is straight forward. When we want to clear/monitor the
soft-dirtiness of a part of the virtual memory, there is a lot of internal
noise. We don't want the non-dirty pages to become dirty because of how the
soft-dirty feature has been working. Soft-dirty feature wasn't being used
the way we want to use now. While monitoring a part of memory, it is not
acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty
when the two VMAs are merged without considering if they both are dirty or
not (34228d473efe). To monitor changes over the memory, sometimes VMAs are
split to clear the soft-dirty bit in the VMA flags. But sometimes kernel
decide to merge them backup. It is so waste of resources.

To keep things consistent, the default behavior of the IOCTL is to output
even the extra non-dirty pages as dirty from the kernel noise. A optional
PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't
tolerant of extra non-dirty pages. This flag can be considered as something
which is by-passing the already present buggy implementation in the kernel.
It is not buggy per say as the issue can be solved if we don't allow the
two VMA which have different soft-dirty bits to get merged. But we are
allowing that so that the total number of VMAs doesn't increase. This was
acceptable at the time, but now with the use case of monitoring a part of
memory for soft-dirty doesn't want this merging. So either we need to
revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be needed
or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to ignore
the extra dirty pages which aren't dirty in reality.

When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are checked to
find if the pages are dirty. So re-used regions cannot be detected. This
has the only side-effect of not checking the VMAs. So this is limitation of
using this flag which should be acceptable in the current state of code.
This limitation is okay for the users as they can clear the soft-dirty bit
of the VMA before starting to monitor a range of memory for soft-dirtiness.


> Please separate that part out from the other changes; I am still not
> convinced that we want this and what the semantical implications are.
> 
> Let's take a look at an example: can_change_pte_writable()
> 
>     /* Do we need write faults for softdirty tracking? */
>     if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
>         return false;
> 
> We care about PTE softdirty tracking, if it is enabled for the VMA.
> Tracking is enabled if: vma_soft_dirty_enabled()
> 
>     /*
>      * Soft-dirty is kind of special: its tracking is enabled when
>      * the vma flags not set.
>      */
>     return !(vma->vm_flags & VM_SOFTDIRTY);
> 
> Consequently, if VM_SOFTDIRTY is set, we are not considering the soft_dirty
> PTE bits accordingly.
Sorry, I'm unable to completely grasp the meaning of the example. We have
followed clear_refs_write() to write the soft-dirty bit clearing code in
the current patch. Dirtiness of the VMA and the PTE may be set
independently. Newer allocated memory has dirty bit set in the VMA. When
something is written the memory, the soft dirty bit is set in the PTEs as
well regardless if the soft dirty bit is set in the VMA or not.

> 
> 
> I'd suggest moving forward without this controversial
> PAGEMAP_NO_REUSED_REGIONS functionality for now, and preparing it as a
> clear add-on we can discuss separately.Like I've described above, I've only added this flag to not get the
non-dirty pages as dirty. Can there be some alternative to adding this
flag? Please suggest.


--
BR,
Muhammad Usama Anjum

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs
  2022-11-09 23:54   ` Andrei Vagin
@ 2022-11-11 10:10     ` Muhammad Usama Anjum
  0 siblings, 0 replies; 20+ messages in thread
From: Muhammad Usama Anjum @ 2022-11-11 10:10 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Muhammad Usama Anjum, Michał Mirosław,
	Danylo Mocherniuk, Alexander Viro, Andrew Morton,
	Suren Baghdasaryan, Greg KH, Christian Brauner, Peter Xu,
	Yang Shi, Vlastimil Babka, Zach O'Keefe,
	Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, David Hildenbrand, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Paul Gofman

Hello Andrei,

Thank you for reviewing.

On 11/10/22 4:54 AM, Andrei Vagin wrote:
[...]
>> +static int add_to_out(bool sd, bool file, bool pres, bool swap, struct pagemap_scan_private *p,
>> +		      unsigned long addr, unsigned int len)
>> +{
>> +	unsigned long bitmap, cur = sd | file << 1 | pres << 2 | swap << 3;
> 
> Should we define contants for each of these bits?
I think I can define a macro to hide this dirty bit shifting in the function.

> 
>> +	bool cpy = true;
>> +
>> +	if (p->required_mask)
>> +		cpy = ((p->required_mask & cur) == p->required_mask);
>> +	if (cpy && p->anyof_mask)
>> +		cpy = (p->anyof_mask & cur);
>> +	if (cpy && p->excluded_mask)
>> +		cpy = !(p->excluded_mask & cur);
>> +
>> +	bitmap = cur & p->return_mask;
>> +
>> +	if (cpy && bitmap) {
>> +		if ((p->vec_index) && (p->vec[p->vec_index - 1].bitmap == bitmap) &&
>> +		    (p->vec[p->vec_index - 1].start + p->vec[p->vec_index - 1].len * PAGE_SIZE ==
>> +		     addr)) {
> 
> I think it is better to define a variable for p->vec_index - 1.
Will do in the next revision.

> nit: len can be in bytes rather than pages.
We are considering memory in the page units. The memory given to this IOCTL
must have PAGE_SIZE alignment. Oterwise we error out (picked this from
mincore()).

>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long addr,
>> +				  unsigned long end, struct mm_walk *walk)
>> +{
>> +	struct pagemap_scan_private *p = walk->private;
>> +	struct vm_area_struct *vma = walk->vma;
>> +	unsigned int len;
>> +	spinlock_t *ptl;
>> +	int ret = 0;
>> +	pte_t *pte;
>> +	bool dirty_vma = (p->flags & PAGEMAP_NO_REUSED_REGIONS) ?
>> +			 (false) : (vma->vm_flags & VM_SOFTDIRTY);
>> +
>> +	if ((walk->vma->vm_end < addr) || (p->max_pages && p->found_pages == p->max_pages))
>> +		return 0;
>> +
>> +	end = min(end, walk->vma->vm_end);
>> +
>> +	ptl = pmd_trans_huge_lock(pmd, vma);
>> +	if (ptl) {
>> +		if (dirty_vma || check_soft_dirty_pmd(vma, addr, pmd, false)) {
>> +			/*
>> +			 * Break huge page into small pages if operation needs to be performed is
>> +			 * on a portion of the huge page or the return buffer cannot store complete
>> +			 * data.
>> +			 */
>> +			if ((IS_CLEAR_OP(p) && (end - addr < HPAGE_SIZE))) {
>> +				spin_unlock(ptl);
>> +				split_huge_pmd(vma, pmd, addr);
>> +				goto process_smaller_pages;
>> +			}
>> +
>> +			if (IS_GET_OP(p)) {
>> +				len = (end - addr)/PAGE_SIZE;
>> +				if (p->max_pages && p->found_pages + len > p->max_pages)
>> +					len = p->max_pages - p->found_pages;
>> +
>> +				ret = add_to_out(dirty_vma ||
>> +						 check_soft_dirty_pmd(vma, addr, pmd, false),
> 
> can we reuse a return code of the previous call of check_soft_dirty_pmd?
Yes, will do.

> 
>> +						 vma->vm_file, pmd_present(*pmd), is_swap_pmd(*pmd),
>> +						 p, addr, len);
>> +			}
>> +			if (!ret && IS_CLEAR_OP(p))
>> +				check_soft_dirty_pmd(vma, addr, pmd, true);
> 
> should we return a error in this case? We need to be sure that:
> * we stop waking page tables after this point.
I'll update the implementation to return error. It immediately terminates
the walk as well.
> * return this error to the user-space if we are not able to add anything
>   in the vector.
I'm not returning error to userspace if we found no page matching the
masks. The total number of filled page_region are returned from the IOCTL.
If IOCTL returns 0, it means no page found has found, but the IOCTL
executed successfully.

[...]
>> +static long do_pagemap_sd_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
>> +{
>> +	struct mmu_notifier_range range;
>> +	unsigned long __user start, end;
>> +	struct pagemap_scan_private p;
>> +	int ret;
>> +
>> +	start = (unsigned long)untagged_addr(arg->start);
>> +	if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
>> +		return -EINVAL;
>> +
>> +	if (IS_GET_OP(arg) &&
>> +	    ((arg->vec_len == 0) || (!access_ok((struct page_region *)arg->vec, arg->vec_len))))
>> +		return -ENOMEM;
>> +
>> +#ifndef CONFIG_MEM_SOFT_DIRTY
>> +	if (IS_SD_OP(arg) || (arg->required_mask & PAGE_IS_SOFTDIRTY) ||
>> +	    (arg->anyof_mask & PAGE_IS_SOFTDIRTY))
>> +		return -EINVAL;
>> +#endif
>> +
>> +	if ((arg->flags & ~PAGEMAP_SD_FLAGS) || (arg->required_mask & ~PAGEMAP_OP_MASK) ||
>> +	    (arg->anyof_mask & ~PAGEMAP_OP_MASK) || (arg->excluded_mask & ~PAGEMAP_OP_MASK) ||
>> +	    (arg->return_mask & ~PAGEMAP_OP_MASK))
>> +		return -EINVAL;
>> +
>> +	if ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) || !arg->return_mask)
>> +		return -EINVAL;
>> +
>> +	if (IS_SD_OP(arg) && ((arg->required_mask & PAGEMAP_NONSD_OP_MASK) ||
>> +	     (arg->anyof_mask & PAGEMAP_NONSD_OP_MASK)))
>> +		return -EINVAL;
>> +
>> +	end = start + arg->len;
>> +	p.max_pages = arg->max_pages;
>> +	p.found_pages = 0;
>> +	p.flags = arg->flags;
>> +	p.required_mask = arg->required_mask;
>> +	p.anyof_mask = arg->anyof_mask;
>> +	p.excluded_mask = arg->excluded_mask;
>> +	p.return_mask = arg->return_mask;
>> +	p.vec_index = 0;
>> +	p.vec_len = arg->vec_len;
>> +
>> +	if (IS_GET_OP(arg)) {
>> +		p.vec = vzalloc(arg->vec_len * sizeof(struct page_region));
> 
> I think we need to set a reasonable limit for vec_len to avoid large
> allocations on the kernel. We can consider to use kmalloc or kvmalloc
> here.
I'll update to kvzalloc which uses vmalloc if kmalloc fails. It'll use
kmalloc for smaller allocations. Thanks for suggesting it. But it'll not
limit the memory allocation.

> 
> Thanks,
> Andrei

-- 
BR,
Muhammad Usama Anjum

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs
       [not found]   ` <202211120107.cYLiq2cH-lkp@intel.com>
@ 2022-11-11 17:53     ` Muhammad Usama Anjum
  0 siblings, 0 replies; 20+ messages in thread
From: Muhammad Usama Anjum @ 2022-11-11 17:53 UTC (permalink / raw)
  To: kernel test robot, Michał Mirosław, Andrei Vagin,
	Danylo Mocherniuk, Alexander Viro, Andrew Morton,
	Suren Baghdasaryan, Greg KH, Christian Brauner, Peter Xu,
	Yang Shi, Vlastimil Babka, Zach O'Keefe,
	Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, David Hildenbrand, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan,
	open list : PROC FILESYSTEM, Paul Gofman
  Cc: Muhammad Usama Anjum, llvm, oe-kbuild-all, Linux Memory Management List

I've fixed these build failure for the next patch iteration. Please comment
and review on the patch. I'll wait for a few days before sending next version.

On 11/11/22 10:13 PM, kernel test robot wrote:
> Hi Muhammad,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on next-20221109]
> [also build test ERROR on v6.1-rc4]
> [cannot apply to shuah-kselftest/next shuah-kselftest/fixes linus/master v6.1-rc4 v6.1-rc3 v6.1-rc2]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Muhammad-Usama-Anjum/Implement-IOCTL-to-get-and-or-the-clear-info-about-PTEs/20221109-182618
> patch link:    https://lore.kernel.org/r/20221109102303.851281-3-usama.anjum%40collabora.com
> patch subject: [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs
> config: arm-buildonly-randconfig-r006-20221111
> compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 463da45892e2d2a262277b91b96f5f8c05dc25d0)
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # install arm cross compiling tool for clang build
>         # apt-get install binutils-arm-linux-gnueabi
>         # https://github.com/intel-lab-lkp/linux/commit/b329378abd03a741ff7250ec1b60292c893476da
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Muhammad-Usama-Anjum/Implement-IOCTL-to-get-and-or-the-clear-info-about-PTEs/20221109-182618
>         git checkout b329378abd03a741ff7250ec1b60292c893476da
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash fs/
> 
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <lkp@intel.com>
> 
> All errors (new ones prefixed by >>):
> 
>>> fs/proc/task_mmu.c:1882:41: error: use of undeclared identifier 'HPAGE_SIZE'
>                            if ((IS_CLEAR_OP(p) && (end - addr < HPAGE_SIZE))) {
>                                                                 ^
>    1 error generated.
> 
> 
> vim +/HPAGE_SIZE +1882 fs/proc/task_mmu.c
> 
>   1856	
>   1857	static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long addr,
>   1858					  unsigned long end, struct mm_walk *walk)
>   1859	{
>   1860		struct pagemap_scan_private *p = walk->private;
>   1861		struct vm_area_struct *vma = walk->vma;
>   1862		unsigned int len;
>   1863		spinlock_t *ptl;
>   1864		int ret = 0;
>   1865		pte_t *pte;
>   1866		bool dirty_vma = (p->flags & PAGEMAP_NO_REUSED_REGIONS) ?
>   1867				 (false) : (vma->vm_flags & VM_SOFTDIRTY);
>   1868	
>   1869		if ((walk->vma->vm_end < addr) || (p->max_pages && p->found_pages == p->max_pages))
>   1870			return 0;
>   1871	
>   1872		end = min(end, walk->vma->vm_end);
>   1873	
>   1874		ptl = pmd_trans_huge_lock(pmd, vma);
>   1875		if (ptl) {
>   1876			if (dirty_vma || check_soft_dirty_pmd(vma, addr, pmd, false)) {
>   1877				/*
>   1878				 * Break huge page into small pages if operation needs to be performed is
>   1879				 * on a portion of the huge page or the return buffer cannot store complete
>   1880				 * data.
>   1881				 */
>> 1882				if ((IS_CLEAR_OP(p) && (end - addr < HPAGE_SIZE))) {
>   1883					spin_unlock(ptl);
>   1884					split_huge_pmd(vma, pmd, addr);
>   1885					goto process_smaller_pages;
>   1886				}
>   1887	
>   1888				if (IS_GET_OP(p)) {
>   1889					len = (end - addr)/PAGE_SIZE;
>   1890					if (p->max_pages && p->found_pages + len > p->max_pages)
>   1891						len = p->max_pages - p->found_pages;
>   1892	
>   1893					ret = add_to_out(dirty_vma ||
>   1894							 check_soft_dirty_pmd(vma, addr, pmd, false),
>   1895							 vma->vm_file, pmd_present(*pmd), is_swap_pmd(*pmd),
>   1896							 p, addr, len);
>   1897				}
>   1898				if (!ret && IS_CLEAR_OP(p))
>   1899					check_soft_dirty_pmd(vma, addr, pmd, true);
>   1900			}
>   1901			spin_unlock(ptl);
>   1902			return 0;
>   1903		}
>   1904	
>   1905	process_smaller_pages:
>   1906		if (pmd_trans_unstable(pmd))
>   1907			return 0;
>   1908	
>   1909		pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>   1910		for (; addr < end && !ret && (!p->max_pages || (p->found_pages < p->max_pages))
>   1911		     ; pte++, addr += PAGE_SIZE) {
>   1912			if (IS_GET_OP(p))
>   1913				ret = add_to_out(dirty_vma || check_soft_dirty(vma, addr, pte, false),
>   1914						 vma->vm_file, pte_present(*pte),
>   1915						 is_swap_pte(*pte), p, addr, 1);
>   1916			if (!ret && IS_CLEAR_OP(p))
>   1917				check_soft_dirty(vma, addr, pte, true);
>   1918		}
>   1919		pte_unmap_unlock(pte - 1, ptl);
>   1920		cond_resched();
>   1921	
>   1922		return 0;
>   1923	}
>   1924	
> 

-- 
BR,
Muhammad Usama Anjum

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs
  2022-11-11  7:08   ` Muhammad Usama Anjum
@ 2022-11-14 15:46     ` David Hildenbrand
  2022-11-21 15:00       ` Muhammad Usama Anjum
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-11-14 15:46 UTC (permalink / raw)
  To: Muhammad Usama Anjum, Michał Mirosław, Andrei Vagin,
	Danylo Mocherniuk, Alexander Viro, Andrew Morton, Paul Gofman
  Cc: Suren Baghdasaryan, Greg KH, Christian Brauner, Peter Xu,
	Yang Shi, Vlastimil Babka, Zach O'Keefe,
	Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT

> The soft-dirtiness is stored in the PTE. VMA is marked dirty to store the
> dirtiness for reused regions. Clearing the soft-dirty status of whole
> process is straight forward. When we want to clear/monitor the
> soft-dirtiness of a part of the virtual memory, there is a lot of internal
> noise. We don't want the non-dirty pages to become dirty because of how the
> soft-dirty feature has been working. Soft-dirty feature wasn't being used
> the way we want to use now. While monitoring a part of memory, it is not
> acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty
> when the two VMAs are merged without considering if they both are dirty or
> not (34228d473efe). To monitor changes over the memory, sometimes VMAs are
> split to clear the soft-dirty bit in the VMA flags. But sometimes kernel
> decide to merge them backup. It is so waste of resources.

Maybe you'd want a per-process option to not merge if the VM_SOFTDIRTY 
property differs. But that might be just one alternative for handling 
this case.

> 
> To keep things consistent, the default behavior of the IOCTL is to output
> even the extra non-dirty pages as dirty from the kernel noise. A optional
> PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't
> tolerant of extra non-dirty pages. This flag can be considered as something
> which is by-passing the already present buggy implementation in the kernel.
> It is not buggy per say as the issue can be solved if we don't allow the
> two VMA which have different soft-dirty bits to get merged. But we are
> allowing that so that the total number of VMAs doesn't increase. This was
> acceptable at the time, but now with the use case of monitoring a part of
> memory for soft-dirty doesn't want this merging. So either we need to
> revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be needed
> or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to ignore
> the extra dirty pages which aren't dirty in reality.
> 
> When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are checked to
> find if the pages are dirty. So re-used regions cannot be detected. This
> has the only side-effect of not checking the VMAs. So this is limitation of
> using this flag which should be acceptable in the current state of code.
> This limitation is okay for the users as they can clear the soft-dirty bit
> of the VMA before starting to monitor a range of memory for soft-dirtiness.
> 
> 
>> Please separate that part out from the other changes; I am still not
>> convinced that we want this and what the semantical implications are.
>>
>> Let's take a look at an example: can_change_pte_writable()
>>
>>      /* Do we need write faults for softdirty tracking? */
>>      if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
>>          return false;
>>
>> We care about PTE softdirty tracking, if it is enabled for the VMA.
>> Tracking is enabled if: vma_soft_dirty_enabled()
>>
>>      /*
>>       * Soft-dirty is kind of special: its tracking is enabled when
>>       * the vma flags not set.
>>       */
>>      return !(vma->vm_flags & VM_SOFTDIRTY);
>>
>> Consequently, if VM_SOFTDIRTY is set, we are not considering the soft_dirty
>> PTE bits accordingly.
> Sorry, I'm unable to completely grasp the meaning of the example. We have
> followed clear_refs_write() to write the soft-dirty bit clearing code in
> the current patch. Dirtiness of the VMA and the PTE may be set
> independently. Newer allocated memory has dirty bit set in the VMA. When
> something is written the memory, the soft dirty bit is set in the PTEs as
> well regardless if the soft dirty bit is set in the VMA or not.
> 

Let me try to find a simple explanation:

After clearing a SOFTDIRTY PTE flag inside an area with VM_SOFTDIRTY 
set, there are ways that PTE could get written to and it could become 
dirty, without the PTE becoming softdirty.

Essentially, inside a VMA with VM_SOFTDIRTY set, the PTE softdirty 
values might be stale: there might be entries that are softdirty even 
though the PTE is *not* marked softdirty.

These are, AFAIU, the current semantics, and I am not sure if we want 
user space to explicitly work around that.

>>
>>
>> I'd suggest moving forward without this controversial
>> PAGEMAP_NO_REUSED_REGIONS functionality for now, and preparing it as a
>> clear add-on we can discuss separately.Like I've described above, I've only added this flag to not get the
> non-dirty pages as dirty. Can there be some alternative to adding this
> flag? Please suggest.

Please split it out into a separate patch for now. We can discuss 
further what the semantics are and if there are better alternatives for 
that. In the meantime, you could move forward without 
PAGEMAP_NO_REUSED_REGIONS while we are discussing them further.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs
  2022-11-14 15:46     ` David Hildenbrand
@ 2022-11-21 15:00       ` Muhammad Usama Anjum
  2022-11-21 15:55         ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Muhammad Usama Anjum @ 2022-11-21 15:00 UTC (permalink / raw)
  To: David Hildenbrand, Michał Mirosław, Andrei Vagin,
	Danylo Mocherniuk, Alexander Viro, Andrew Morton, Paul Gofman
  Cc: Muhammad Usama Anjum, Suren Baghdasaryan, Greg KH,
	Christian Brauner, Peter Xu, Yang Shi, Vlastimil Babka,
	Zach O'Keefe, Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT

Hello,

Thank you for replying.

On 11/14/22 8:46 PM, David Hildenbrand wrote:
>> The soft-dirtiness is stored in the PTE. VMA is marked dirty to store the
>> dirtiness for reused regions. Clearing the soft-dirty status of whole
>> process is straight forward. When we want to clear/monitor the
>> soft-dirtiness of a part of the virtual memory, there is a lot of internal
>> noise. We don't want the non-dirty pages to become dirty because of how the
>> soft-dirty feature has been working. Soft-dirty feature wasn't being used
>> the way we want to use now. While monitoring a part of memory, it is not
>> acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty
>> when the two VMAs are merged without considering if they both are dirty or
>> not (34228d473efe). To monitor changes over the memory, sometimes VMAs are
>> split to clear the soft-dirty bit in the VMA flags. But sometimes kernel
>> decide to merge them backup. It is so waste of resources.
> 
> Maybe you'd want a per-process option to not merge if the VM_SOFTDIRTY
> property differs. But that might be just one alternative for handling this
> case.
> 
>>
>> To keep things consistent, the default behavior of the IOCTL is to output
>> even the extra non-dirty pages as dirty from the kernel noise. A optional
>> PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't
>> tolerant of extra non-dirty pages. This flag can be considered as something
>> which is by-passing the already present buggy implementation in the kernel.
>> It is not buggy per say as the issue can be solved if we don't allow the
>> two VMA which have different soft-dirty bits to get merged. But we are
>> allowing that so that the total number of VMAs doesn't increase. This was
>> acceptable at the time, but now with the use case of monitoring a part of
>> memory for soft-dirty doesn't want this merging. So either we need to
>> revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be needed
>> or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to ignore
>> the extra dirty pages which aren't dirty in reality.
>>
>> When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are checked to
>> find if the pages are dirty. So re-used regions cannot be detected. This
>> has the only side-effect of not checking the VMAs. So this is limitation of
>> using this flag which should be acceptable in the current state of code.
>> This limitation is okay for the users as they can clear the soft-dirty bit
>> of the VMA before starting to monitor a range of memory for soft-dirtiness.
>>
>>
>>> Please separate that part out from the other changes; I am still not
>>> convinced that we want this and what the semantical implications are.
>>>
>>> Let's take a look at an example: can_change_pte_writable()
>>>
>>>      /* Do we need write faults for softdirty tracking? */
>>>      if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
>>>          return false;
>>>
>>> We care about PTE softdirty tracking, if it is enabled for the VMA.
>>> Tracking is enabled if: vma_soft_dirty_enabled()
>>>
>>>      /*
>>>       * Soft-dirty is kind of special: its tracking is enabled when
>>>       * the vma flags not set.
>>>       */
>>>      return !(vma->vm_flags & VM_SOFTDIRTY);
>>>
>>> Consequently, if VM_SOFTDIRTY is set, we are not considering the soft_dirty
>>> PTE bits accordingly.
>> Sorry, I'm unable to completely grasp the meaning of the example. We have
>> followed clear_refs_write() to write the soft-dirty bit clearing code in
>> the current patch. Dirtiness of the VMA and the PTE may be set
>> independently. Newer allocated memory has dirty bit set in the VMA. When
>> something is written the memory, the soft dirty bit is set in the PTEs as
>> well regardless if the soft dirty bit is set in the VMA or not.
>>
> 
> Let me try to find a simple explanation:
> 
> After clearing a SOFTDIRTY PTE flag inside an area with VM_SOFTDIRTY set,
> there are ways that PTE could get written to and it could become dirty,
> without the PTE becoming softdirty.
> 
> Essentially, inside a VMA with VM_SOFTDIRTY set, the PTE softdirty values
> might be stale: there might be entries that are softdirty even though the
> PTE is *not* marked softdirty.
Can someone please share the example to reproduce this? In all of my
testing, even if I ignore VM_SOFTDIRTY and only base my decision of
soft-dirtiness on individual pages, it always passes.

> 
> These are, AFAIU, the current semantics, and I am not sure if we want user
> space to explicitly work around that.
> 
>>>
>>>
>>> I'd suggest moving forward without this controversial
>>> PAGEMAP_NO_REUSED_REGIONS functionality for now, and preparing it as a
>>> clear add-on we can discuss separately.Like I've described above, I've
>>> only added this flag to not get the
>> non-dirty pages as dirty. Can there be some alternative to adding this
>> flag? Please suggest.
> 
> Please split it out into a separate patch for now. We can discuss further
> what the semantics are and if there are better alternatives for that. In
> the meantime, you could move forward without PAGEMAP_NO_REUSED_REGIONS
> while we are discussing them further.
> 

-- 
BR,
Muhammad Usama Anjum

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs
  2022-11-21 15:00       ` Muhammad Usama Anjum
@ 2022-11-21 15:55         ` David Hildenbrand
  2022-11-30 11:42           ` Muhammad Usama Anjum
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-11-21 15:55 UTC (permalink / raw)
  To: Muhammad Usama Anjum, Michał Mirosław, Andrei Vagin,
	Danylo Mocherniuk, Alexander Viro, Andrew Morton, Paul Gofman
  Cc: Suren Baghdasaryan, Greg KH, Christian Brauner, Peter Xu,
	Yang Shi, Vlastimil Babka, Zach O'Keefe,
	Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT

[-- Attachment #1: Type: text/plain, Size: 5526 bytes --]

On 21.11.22 16:00, Muhammad Usama Anjum wrote:
> Hello,
> 
> Thank you for replying.
> 
> On 11/14/22 8:46 PM, David Hildenbrand wrote:
>>> The soft-dirtiness is stored in the PTE. VMA is marked dirty to store the
>>> dirtiness for reused regions. Clearing the soft-dirty status of whole
>>> process is straight forward. When we want to clear/monitor the
>>> soft-dirtiness of a part of the virtual memory, there is a lot of internal
>>> noise. We don't want the non-dirty pages to become dirty because of how the
>>> soft-dirty feature has been working. Soft-dirty feature wasn't being used
>>> the way we want to use now. While monitoring a part of memory, it is not
>>> acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty
>>> when the two VMAs are merged without considering if they both are dirty or
>>> not (34228d473efe). To monitor changes over the memory, sometimes VMAs are
>>> split to clear the soft-dirty bit in the VMA flags. But sometimes kernel
>>> decide to merge them backup. It is so waste of resources.
>>
>> Maybe you'd want a per-process option to not merge if the VM_SOFTDIRTY
>> property differs. But that might be just one alternative for handling this
>> case.
>>
>>>
>>> To keep things consistent, the default behavior of the IOCTL is to output
>>> even the extra non-dirty pages as dirty from the kernel noise. A optional
>>> PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't
>>> tolerant of extra non-dirty pages. This flag can be considered as something
>>> which is by-passing the already present buggy implementation in the kernel.
>>> It is not buggy per say as the issue can be solved if we don't allow the
>>> two VMA which have different soft-dirty bits to get merged. But we are
>>> allowing that so that the total number of VMAs doesn't increase. This was
>>> acceptable at the time, but now with the use case of monitoring a part of
>>> memory for soft-dirty doesn't want this merging. So either we need to
>>> revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be needed
>>> or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to ignore
>>> the extra dirty pages which aren't dirty in reality.
>>>
>>> When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are checked to
>>> find if the pages are dirty. So re-used regions cannot be detected. This
>>> has the only side-effect of not checking the VMAs. So this is limitation of
>>> using this flag which should be acceptable in the current state of code.
>>> This limitation is okay for the users as they can clear the soft-dirty bit
>>> of the VMA before starting to monitor a range of memory for soft-dirtiness.
>>>
>>>
>>>> Please separate that part out from the other changes; I am still not
>>>> convinced that we want this and what the semantical implications are.
>>>>
>>>> Let's take a look at an example: can_change_pte_writable()
>>>>
>>>>       /* Do we need write faults for softdirty tracking? */
>>>>       if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
>>>>           return false;
>>>>
>>>> We care about PTE softdirty tracking, if it is enabled for the VMA.
>>>> Tracking is enabled if: vma_soft_dirty_enabled()
>>>>
>>>>       /*
>>>>        * Soft-dirty is kind of special: its tracking is enabled when
>>>>        * the vma flags not set.
>>>>        */
>>>>       return !(vma->vm_flags & VM_SOFTDIRTY);
>>>>
>>>> Consequently, if VM_SOFTDIRTY is set, we are not considering the soft_dirty
>>>> PTE bits accordingly.
>>> Sorry, I'm unable to completely grasp the meaning of the example. We have
>>> followed clear_refs_write() to write the soft-dirty bit clearing code in
>>> the current patch. Dirtiness of the VMA and the PTE may be set
>>> independently. Newer allocated memory has dirty bit set in the VMA. When
>>> something is written the memory, the soft dirty bit is set in the PTEs as
>>> well regardless if the soft dirty bit is set in the VMA or not.
>>>
>>
>> Let me try to find a simple explanation:
>>
>> After clearing a SOFTDIRTY PTE flag inside an area with VM_SOFTDIRTY set,
>> there are ways that PTE could get written to and it could become dirty,
>> without the PTE becoming softdirty.
>>
>> Essentially, inside a VMA with VM_SOFTDIRTY set, the PTE softdirty values
>> might be stale: there might be entries that are softdirty even though the
>> PTE is *not* marked softdirty.
> Can someone please share the example to reproduce this? In all of my
> testing, even if I ignore VM_SOFTDIRTY and only base my decision of
> soft-dirtiness on individual pages, it always passes.

Quick reproducer (the first and easiest one that triggered :) )
attached.

With no kernel changes, it works as expected.

# ./softdirty_mprotect


With the following kernel change to simulate what you propose it fails:

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index d22687d2e81e..f2c682bf7f64 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1457,8 +1457,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
                 flags |= PM_FILE;
         if (page && !migration && page_mapcount(page) == 1)
                 flags |= PM_MMAP_EXCLUSIVE;
-       if (vma->vm_flags & VM_SOFTDIRTY)
-               flags |= PM_SOFT_DIRTY;
+       //if (vma->vm_flags & VM_SOFTDIRTY)
+       //      flags |= PM_SOFT_DIRTY;
  
         return make_pme(frame, flags);
  }


# ./softdirty_mprotect
Page #1 should be softdirty

-- 
Thanks,

David / dhildenb

[-- Attachment #2: softdirty_mprotect.c --]
[-- Type: text/x-csrc, Size: 2677 bytes --]

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdint.h>
#include <stdbool.h>
#include <sys/mman.h>

static size_t pagesize;
static int pagemap_fd;

static void clear_softdirty(void)
{
	int fd = open("/proc/self/clear_refs", O_WRONLY);
	const char *ctrl = "4";
	int ret;

	if (fd < 0) {
		fprintf(stderr, "open() failed\n");
		exit(1);
	}
	ret = write(fd, ctrl, strlen(ctrl));
	close(fd);
	if (ret != strlen(ctrl)) {
		fprintf(stderr, "write() failed\n");
		exit(1);
	}
}

static uint64_t pagemap_get_entry(int fd, char *start)
{
	const unsigned long pfn = (unsigned long)start / pagesize;
	uint64_t entry;
	int ret;

	ret = pread(fd, &entry, sizeof(entry), pfn * sizeof(entry));
	if (ret != sizeof(entry)) {
		fprintf(stderr, "pread() failed\n");
		exit(1);
	}

	return entry;
}

static bool pagemap_is_softdirty(int fd, char *start)
{
	uint64_t entry = pagemap_get_entry(fd, start);

	return entry & 0x0080000000000000ull;
}

void main(void)
{
	char *mem, *mem2;

	pagesize = getpagesize();
	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
	if (pagemap_fd < 0) {
		fprintf(stderr, "open() failed\n");
		exit(1);
	}

	/* Map 2 pages. */
	mem = mmap(0, 2 * pagesize, PROT_READ|PROT_WRITE,
		   MAP_PRIVATE|MAP_ANON, -1, 0);
	if (mem == MAP_FAILED) {
		fprintf(stderr, "mmap() failed\n");
		exit(1);
	}

	/* Populate both pages. */
	memset(mem, 1, 2 * pagesize);

	if (!pagemap_is_softdirty(pagemap_fd, mem))
		fprintf(stderr, "Page #1 should be softdirty\n");
	if (!pagemap_is_softdirty(pagemap_fd, mem + pagesize))
		fprintf(stderr, "Page #2 should be softdirty\n");

	/*
	 * Start softdirty tracking. Clear VM_SOFTDIRTY and clear the softdirty
	 * PTE bit.
	 */
	clear_softdirty();

	if (pagemap_is_softdirty(pagemap_fd, mem))
		fprintf(stderr, "Page #1 should not be softdirty\n");
	if (pagemap_is_softdirty(pagemap_fd, mem + pagesize))
		fprintf(stderr, "Page #2 should not be softdirty\n");

	/*
	 * Remap the second page. The VMA gets VM_SOFTDIRTY set. Both VMAs
	 * get merged such that the resulting VMA has VM_SOFTDIRTY set.
	 */
	mem2 = mmap(mem + pagesize, pagesize, PROT_READ|PROT_WRITE,
		    MAP_PRIVATE|MAP_ANON|MAP_FIXED, -1, 0);
	if (mem2 == MAP_FAILED) {
		fprintf(stderr, "mmap() failed\n");
		exit(1);
	}

	/* Protect + unprotect. */
	mprotect(mem, 2 * pagesize, PROT_READ);
	mprotect(mem, 2 * pagesize, PROT_READ|PROT_WRITE);

	/* Modify both pages. */
	memset(mem, 2, 2 * pagesize);

	if (!pagemap_is_softdirty(pagemap_fd, mem))
		fprintf(stderr, "Page #1 should be softdirty\n");
	if (!pagemap_is_softdirty(pagemap_fd, mem + pagesize))
		fprintf(stderr, "Page #2 should be softdirty\n");
}

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs
  2022-11-09 10:23 [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
                   ` (3 preceding siblings ...)
  2022-11-09 10:34 ` [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs David Hildenbrand
@ 2022-11-23 14:11 ` Peter Xu
  4 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2022-11-23 14:11 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Michał Mirosław, Andrei Vagin, Danylo Mocherniuk,
	Alexander Viro, Andrew Morton, Suren Baghdasaryan, Greg KH,
	Christian Brauner, Yang Shi, Vlastimil Babka, Zach O'Keefe,
	Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, David Hildenbrand, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Paul Gofman, Andrea Arcangeli

On Wed, Nov 09, 2022 at 03:23:00PM +0500, Muhammad Usama Anjum wrote:
> Soft-dirty PTE bit of the memory pages can be read by using the pagemap
> procfs file. The soft-dirty PTE bit for the whole memory range of the
> process can be cleared by writing to the clear_refs file. There are other
> methods to mimic this information entirely in userspace with poor
> performance:
> - The mprotect syscall and SIGSEGV handler for bookkeeping
> - The userfaultfd syscall with the handler for bookkeeping

Userfaultfd is definitely slow in this case because it needs the messaging
roundtrip that happens in two different threads synchronously, so at least
more schedule effort even than mprotect.

I saw the other patch on vma merging with SOFTDIRTY, didn't look deeper
there but IIUC it won't really help much if the other commit (34228d47)
can't be reverted then it seems to help nothing.  And, it does looks risky
to revert that because in the same commit it mentioned the case where one
can clear ref right before a vma merge, so definitely worth more thoughts
and testings, which I agree with you.

I'm thinking whether the vma issue can be totally avoided.  For example by
providing an async version of uffd-wp.

Currently uffd-wp must be synchronous and it'll be slow but it services
specific purposes.  And this is definitely not the 1st time any of us
thinking about uffd-wp being async, it's just that we need to solve the
problem of storage on the dirty information.

Actually we can also use other storage form but so far I didn't think of
anything that's easy and clean.  Current soft-dirty bit also has its
defects (e.g. the need to take mmap lock and walk the pgtables), but that
part will be the same as soft-dirty for now.

Now I'm wildly thinking whether we can just reuse the soft-dirty bit in the
ptes already defined.  The GET interface could be similar as proposed here,
or at least a separate issue.

So _maybe_ we can have a feature (bound to the uffd context) for uffd that
enables async uffd-wp, in which case the wr-protect fault is not sending
any message anymore (nor enqueuing) but instead setting the soft-dirty then
quickly resolving the write bit immediately and continue the fault.

Clearing of the soft-dirty bit needs to be done in UFFDIO_WRITEPROTECT
alongside of clearing uffd-wp bit.  So on that part the current GET+CLEAR
interface for pagemap may need to be replaced.  And frankly, it feels weird
to me to allow change mm layout in pagemap ioctls..  With this we can keep
the pagemap interface to only fetch information, like before.

A major benefit of using uffd is that uffd is by nature pte-based, so no
fiddling with vma needed at all.  Firstly, no need to worry about merging
vmas with tons of false positives.  Meanwhile, one can wr-protect in
page-size granule easily.  All the wr-protect is not governed by vma flag
anymore but based on uffd-wp flag, so no extra overhead too on any page
that the monitor is not interested.  There's already infrastructure code
for persisting uffd-wp bit, so it'll naturally work similarly for an async
mode if to come to the world.

It's just that we'll also need to consider exclusive use of the bit, so
we'll need to fail clear_refs on vmas where we have VM_UFFD_WP and also the
async feature enabled.  I would hope that's very rare, but worth thinking
about its side effect.  The same will need to apply to UFFDIO_REGISTER on
async wp mode when soft-dirty enabled, we'll need to bailout too.

Said that, this is not a suggestion of a new design, but just something I
thought about when reading this, and quickly writting this down.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs
  2022-11-21 15:55         ` David Hildenbrand
@ 2022-11-30 11:42           ` Muhammad Usama Anjum
  2022-11-30 12:10             ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Muhammad Usama Anjum @ 2022-11-30 11:42 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu
  Cc: Muhammad Usama Anjum, Suren Baghdasaryan, Greg KH,
	Christian Brauner, Peter Xu, Yang Shi, Vlastimil Babka,
	Zach O'Keefe, Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Michał Mirosław, Andrei Vagin, Danylo Mocherniuk,
	Alexander Viro, Andrew Morton, Paul Gofman, Cyrill Gorcunov

On 11/21/22 8:55 PM, David Hildenbrand wrote:
> On 21.11.22 16:00, Muhammad Usama Anjum wrote:
>> Hello,
>>
>> Thank you for replying.
>>
>> On 11/14/22 8:46 PM, David Hildenbrand wrote:
>>>> The soft-dirtiness is stored in the PTE. VMA is marked dirty to store the
>>>> dirtiness for reused regions. Clearing the soft-dirty status of whole
>>>> process is straight forward. When we want to clear/monitor the
>>>> soft-dirtiness of a part of the virtual memory, there is a lot of internal
>>>> noise. We don't want the non-dirty pages to become dirty because of how
>>>> the
>>>> soft-dirty feature has been working. Soft-dirty feature wasn't being used
>>>> the way we want to use now. While monitoring a part of memory, it is not
>>>> acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty
>>>> when the two VMAs are merged without considering if they both are dirty or
>>>> not (34228d473efe). To monitor changes over the memory, sometimes VMAs are
>>>> split to clear the soft-dirty bit in the VMA flags. But sometimes kernel
>>>> decide to merge them backup. It is so waste of resources.
>>>
>>> Maybe you'd want a per-process option to not merge if the VM_SOFTDIRTY
>>> property differs. But that might be just one alternative for handling this
>>> case.
>>>
>>>>
>>>> To keep things consistent, the default behavior of the IOCTL is to output
>>>> even the extra non-dirty pages as dirty from the kernel noise. A optional
>>>> PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't
>>>> tolerant of extra non-dirty pages. This flag can be considered as
>>>> something
>>>> which is by-passing the already present buggy implementation in the
>>>> kernel.
>>>> It is not buggy per say as the issue can be solved if we don't allow the
>>>> two VMA which have different soft-dirty bits to get merged. But we are
>>>> allowing that so that the total number of VMAs doesn't increase. This was
>>>> acceptable at the time, but now with the use case of monitoring a part of
>>>> memory for soft-dirty doesn't want this merging. So either we need to
>>>> revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be needed
>>>> or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to
>>>> ignore
>>>> the extra dirty pages which aren't dirty in reality.
>>>>
>>>> When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are checked to
>>>> find if the pages are dirty. So re-used regions cannot be detected. This
>>>> has the only side-effect of not checking the VMAs. So this is
>>>> limitation of
>>>> using this flag which should be acceptable in the current state of code.
>>>> This limitation is okay for the users as they can clear the soft-dirty bit
>>>> of the VMA before starting to monitor a range of memory for
>>>> soft-dirtiness.
>>>>
>>>>
>>>>> Please separate that part out from the other changes; I am still not
>>>>> convinced that we want this and what the semantical implications are.
>>>>>
>>>>> Let's take a look at an example: can_change_pte_writable()
>>>>>
>>>>>       /* Do we need write faults for softdirty tracking? */
>>>>>       if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
>>>>>           return false;
>>>>>
>>>>> We care about PTE softdirty tracking, if it is enabled for the VMA.
>>>>> Tracking is enabled if: vma_soft_dirty_enabled()
>>>>>
>>>>>       /*
>>>>>        * Soft-dirty is kind of special: its tracking is enabled when
>>>>>        * the vma flags not set.
>>>>>        */
>>>>>       return !(vma->vm_flags & VM_SOFTDIRTY);
>>>>>
>>>>> Consequently, if VM_SOFTDIRTY is set, we are not considering the
>>>>> soft_dirty
>>>>> PTE bits accordingly.
>>>> Sorry, I'm unable to completely grasp the meaning of the example. We have
>>>> followed clear_refs_write() to write the soft-dirty bit clearing code in
>>>> the current patch. Dirtiness of the VMA and the PTE may be set
>>>> independently. Newer allocated memory has dirty bit set in the VMA. When
>>>> something is written the memory, the soft dirty bit is set in the PTEs as
>>>> well regardless if the soft dirty bit is set in the VMA or not.
>>>>
>>>
>>> Let me try to find a simple explanation:
>>>
>>> After clearing a SOFTDIRTY PTE flag inside an area with VM_SOFTDIRTY set,
>>> there are ways that PTE could get written to and it could become dirty,
>>> without the PTE becoming softdirty.
>>>
>>> Essentially, inside a VMA with VM_SOFTDIRTY set, the PTE softdirty values
>>> might be stale: there might be entries that are softdirty even though the
>>> PTE is *not* marked softdirty.
>> Can someone please share the example to reproduce this? In all of my
>> testing, even if I ignore VM_SOFTDIRTY and only base my decision of
>> soft-dirtiness on individual pages, it always passes.
> 
> Quick reproducer (the first and easiest one that triggered :) )
> attached.
> 
> With no kernel changes, it works as expected.
> 
> # ./softdirty_mprotect
> 
> 
> With the following kernel change to simulate what you propose it fails:
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index d22687d2e81e..f2c682bf7f64 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1457,8 +1457,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct
> pagemapread *pm,
>                 flags |= PM_FILE;
>         if (page && !migration && page_mapcount(page) == 1)
>                 flags |= PM_MMAP_EXCLUSIVE;
> -       if (vma->vm_flags & VM_SOFTDIRTY)
> -               flags |= PM_SOFT_DIRTY;
> +       //if (vma->vm_flags & VM_SOFTDIRTY)
> +       //      flags |= PM_SOFT_DIRTY;
>  
>         return make_pme(frame, flags);
>  }
> 
> 
> # ./softdirty_mprotect
> Page #1 should be softdirty
> 
Thank you so much for sharing the issue and reproducer.

After remapping the second part of the memory and m-protecting +
m-unprotecting the whole memory, the PTE of the first half of the memory
doesn't get marked as soft dirty even after writing multiple times to it.
Even if soft-dirtiness is cleared on the whole process, the PTE of the
first half memory doesn't get dirty. This seems like more of a bug in
mprotect. The mprotect should not mess up with the soft-dirty flag in the PTEs.

I'm debugging this. I hope to find the issue soon. Soft-dirty tracking in
PTEs should be working correctly irrespective of the VM_SOFTDIRTY is set or
not on the VMA.

Cyrill has said in [1]:
> ioctl might be an option indeed
It brings some hope to this patch-set.

[1] https://lore.kernel.org/all/Y4W0axw0ZgORtfkt@grain/

-- 
BR,
Muhammad Usama Anjum

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs
  2022-11-30 11:42           ` Muhammad Usama Anjum
@ 2022-11-30 12:10             ` David Hildenbrand
  2022-12-05 15:29               ` Muhammad Usama Anjum
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-11-30 12:10 UTC (permalink / raw)
  To: Muhammad Usama Anjum, Peter Xu
  Cc: Suren Baghdasaryan, Greg KH, Christian Brauner, Yang Shi,
	Vlastimil Babka, Zach O'Keefe, Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Michał Mirosław, Andrei Vagin, Danylo Mocherniuk,
	Alexander Viro, Andrew Morton, Paul Gofman, Cyrill Gorcunov

On 30.11.22 12:42, Muhammad Usama Anjum wrote:
> On 11/21/22 8:55 PM, David Hildenbrand wrote:
>> On 21.11.22 16:00, Muhammad Usama Anjum wrote:
>>> Hello,
>>>
>>> Thank you for replying.
>>>
>>> On 11/14/22 8:46 PM, David Hildenbrand wrote:
>>>>> The soft-dirtiness is stored in the PTE. VMA is marked dirty to store the
>>>>> dirtiness for reused regions. Clearing the soft-dirty status of whole
>>>>> process is straight forward. When we want to clear/monitor the
>>>>> soft-dirtiness of a part of the virtual memory, there is a lot of internal
>>>>> noise. We don't want the non-dirty pages to become dirty because of how
>>>>> the
>>>>> soft-dirty feature has been working. Soft-dirty feature wasn't being used
>>>>> the way we want to use now. While monitoring a part of memory, it is not
>>>>> acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty
>>>>> when the two VMAs are merged without considering if they both are dirty or
>>>>> not (34228d473efe). To monitor changes over the memory, sometimes VMAs are
>>>>> split to clear the soft-dirty bit in the VMA flags. But sometimes kernel
>>>>> decide to merge them backup. It is so waste of resources.
>>>>
>>>> Maybe you'd want a per-process option to not merge if the VM_SOFTDIRTY
>>>> property differs. But that might be just one alternative for handling this
>>>> case.
>>>>
>>>>>
>>>>> To keep things consistent, the default behavior of the IOCTL is to output
>>>>> even the extra non-dirty pages as dirty from the kernel noise. A optional
>>>>> PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't
>>>>> tolerant of extra non-dirty pages. This flag can be considered as
>>>>> something
>>>>> which is by-passing the already present buggy implementation in the
>>>>> kernel.
>>>>> It is not buggy per say as the issue can be solved if we don't allow the
>>>>> two VMA which have different soft-dirty bits to get merged. But we are
>>>>> allowing that so that the total number of VMAs doesn't increase. This was
>>>>> acceptable at the time, but now with the use case of monitoring a part of
>>>>> memory for soft-dirty doesn't want this merging. So either we need to
>>>>> revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be needed
>>>>> or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to
>>>>> ignore
>>>>> the extra dirty pages which aren't dirty in reality.
>>>>>
>>>>> When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are checked to
>>>>> find if the pages are dirty. So re-used regions cannot be detected. This
>>>>> has the only side-effect of not checking the VMAs. So this is
>>>>> limitation of
>>>>> using this flag which should be acceptable in the current state of code.
>>>>> This limitation is okay for the users as they can clear the soft-dirty bit
>>>>> of the VMA before starting to monitor a range of memory for
>>>>> soft-dirtiness.
>>>>>
>>>>>
>>>>>> Please separate that part out from the other changes; I am still not
>>>>>> convinced that we want this and what the semantical implications are.
>>>>>>
>>>>>> Let's take a look at an example: can_change_pte_writable()
>>>>>>
>>>>>>        /* Do we need write faults for softdirty tracking? */
>>>>>>        if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
>>>>>>            return false;
>>>>>>
>>>>>> We care about PTE softdirty tracking, if it is enabled for the VMA.
>>>>>> Tracking is enabled if: vma_soft_dirty_enabled()
>>>>>>
>>>>>>        /*
>>>>>>         * Soft-dirty is kind of special: its tracking is enabled when
>>>>>>         * the vma flags not set.
>>>>>>         */
>>>>>>        return !(vma->vm_flags & VM_SOFTDIRTY);
>>>>>>
>>>>>> Consequently, if VM_SOFTDIRTY is set, we are not considering the
>>>>>> soft_dirty
>>>>>> PTE bits accordingly.
>>>>> Sorry, I'm unable to completely grasp the meaning of the example. We have
>>>>> followed clear_refs_write() to write the soft-dirty bit clearing code in
>>>>> the current patch. Dirtiness of the VMA and the PTE may be set
>>>>> independently. Newer allocated memory has dirty bit set in the VMA. When
>>>>> something is written the memory, the soft dirty bit is set in the PTEs as
>>>>> well regardless if the soft dirty bit is set in the VMA or not.
>>>>>
>>>>
>>>> Let me try to find a simple explanation:
>>>>
>>>> After clearing a SOFTDIRTY PTE flag inside an area with VM_SOFTDIRTY set,
>>>> there are ways that PTE could get written to and it could become dirty,
>>>> without the PTE becoming softdirty.
>>>>
>>>> Essentially, inside a VMA with VM_SOFTDIRTY set, the PTE softdirty values
>>>> might be stale: there might be entries that are softdirty even though the
>>>> PTE is *not* marked softdirty.
>>> Can someone please share the example to reproduce this? In all of my
>>> testing, even if I ignore VM_SOFTDIRTY and only base my decision of
>>> soft-dirtiness on individual pages, it always passes.
>>
>> Quick reproducer (the first and easiest one that triggered :) )
>> attached.
>>
>> With no kernel changes, it works as expected.
>>
>> # ./softdirty_mprotect
>>
>>
>> With the following kernel change to simulate what you propose it fails:
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index d22687d2e81e..f2c682bf7f64 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -1457,8 +1457,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct
>> pagemapread *pm,
>>                  flags |= PM_FILE;
>>          if (page && !migration && page_mapcount(page) == 1)
>>                  flags |= PM_MMAP_EXCLUSIVE;
>> -       if (vma->vm_flags & VM_SOFTDIRTY)
>> -               flags |= PM_SOFT_DIRTY;
>> +       //if (vma->vm_flags & VM_SOFTDIRTY)
>> +       //      flags |= PM_SOFT_DIRTY;
>>   
>>          return make_pme(frame, flags);
>>   }
>>
>>
>> # ./softdirty_mprotect
>> Page #1 should be softdirty
>>
> Thank you so much for sharing the issue and reproducer.
> 
> After remapping the second part of the memory and m-protecting +
> m-unprotecting the whole memory, the PTE of the first half of the memory
> doesn't get marked as soft dirty even after writing multiple times to it.
> Even if soft-dirtiness is cleared on the whole process, the PTE of the
> first half memory doesn't get dirty. This seems like more of a bug in
> mprotect. The mprotect should not mess up with the soft-dirty flag in the PTEs.
> 
> I'm debugging this. I hope to find the issue soon. Soft-dirty tracking in
> PTEs should be working correctly irrespective of the VM_SOFTDIRTY is set or
> not on the VMA.

No, it's not a bug and these are not the VM_SOFTDIRTY semantics -- just 
because you think they should be like this. As people explained, 
VM_SOFTDIRTY implies *until now* that any PTE is consideres softdirty. 
And there are other scenarios that can similarly trigger something like 
that, besides mprotect().

Sorry if I sound annoyed, but please

1) factor out that from your patch set for now
2) find a way to handle this cleanly, for example, not merging VMAs that
    differ in VM_SOFTDIRTY

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs
  2022-11-30 12:10             ` David Hildenbrand
@ 2022-12-05 15:29               ` Muhammad Usama Anjum
  2022-12-05 15:39                 ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Muhammad Usama Anjum @ 2022-12-05 15:29 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu, Cyrill Gorcunov
  Cc: Muhammad Usama Anjum, Suren Baghdasaryan, Greg KH,
	Christian Brauner, Yang Shi, Vlastimil Babka, Zach O'Keefe,
	Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Michał Mirosław, Andrei Vagin, Danylo Mocherniuk,
	Alexander Viro, Andrew Morton, Paul Gofman

On 11/30/22 5:10 PM, David Hildenbrand wrote:
> On 30.11.22 12:42, Muhammad Usama Anjum wrote:
>> On 11/21/22 8:55 PM, David Hildenbrand wrote:
>>> On 21.11.22 16:00, Muhammad Usama Anjum wrote:
>>>> Hello,
>>>>
>>>> Thank you for replying.
>>>>
>>>> On 11/14/22 8:46 PM, David Hildenbrand wrote:
>>>>>> The soft-dirtiness is stored in the PTE. VMA is marked dirty to store
>>>>>> the
>>>>>> dirtiness for reused regions. Clearing the soft-dirty status of whole
>>>>>> process is straight forward. When we want to clear/monitor the
>>>>>> soft-dirtiness of a part of the virtual memory, there is a lot of
>>>>>> internal
>>>>>> noise. We don't want the non-dirty pages to become dirty because of how
>>>>>> the
>>>>>> soft-dirty feature has been working. Soft-dirty feature wasn't being
>>>>>> used
>>>>>> the way we want to use now. While monitoring a part of memory, it is not
>>>>>> acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty
>>>>>> when the two VMAs are merged without considering if they both are
>>>>>> dirty or
>>>>>> not (34228d473efe). To monitor changes over the memory, sometimes
>>>>>> VMAs are
>>>>>> split to clear the soft-dirty bit in the VMA flags. But sometimes kernel
>>>>>> decide to merge them backup. It is so waste of resources.
>>>>>
>>>>> Maybe you'd want a per-process option to not merge if the VM_SOFTDIRTY
>>>>> property differs. But that might be just one alternative for handling
>>>>> this
>>>>> case.
>>>>>
>>>>>>
>>>>>> To keep things consistent, the default behavior of the IOCTL is to
>>>>>> output
>>>>>> even the extra non-dirty pages as dirty from the kernel noise. A
>>>>>> optional
>>>>>> PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't
>>>>>> tolerant of extra non-dirty pages. This flag can be considered as
>>>>>> something
>>>>>> which is by-passing the already present buggy implementation in the
>>>>>> kernel.
>>>>>> It is not buggy per say as the issue can be solved if we don't allow the
>>>>>> two VMA which have different soft-dirty bits to get merged. But we are
>>>>>> allowing that so that the total number of VMAs doesn't increase. This
>>>>>> was
>>>>>> acceptable at the time, but now with the use case of monitoring a
>>>>>> part of
>>>>>> memory for soft-dirty doesn't want this merging. So either we need to
>>>>>> revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be
>>>>>> needed
>>>>>> or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to
>>>>>> ignore
>>>>>> the extra dirty pages which aren't dirty in reality.
>>>>>>
>>>>>> When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are
>>>>>> checked to
>>>>>> find if the pages are dirty. So re-used regions cannot be detected. This
>>>>>> has the only side-effect of not checking the VMAs. So this is
>>>>>> limitation of
>>>>>> using this flag which should be acceptable in the current state of code.
>>>>>> This limitation is okay for the users as they can clear the
>>>>>> soft-dirty bit
>>>>>> of the VMA before starting to monitor a range of memory for
>>>>>> soft-dirtiness.
>>>>>>
>>>>>>
>>>>>>> Please separate that part out from the other changes; I am still not
>>>>>>> convinced that we want this and what the semantical implications are.
>>>>>>>
>>>>>>> Let's take a look at an example: can_change_pte_writable()
>>>>>>>
>>>>>>>        /* Do we need write faults for softdirty tracking? */
>>>>>>>        if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
>>>>>>>            return false;
>>>>>>>
>>>>>>> We care about PTE softdirty tracking, if it is enabled for the VMA.
>>>>>>> Tracking is enabled if: vma_soft_dirty_enabled()
>>>>>>>
>>>>>>>        /*
>>>>>>>         * Soft-dirty is kind of special: its tracking is enabled when
>>>>>>>         * the vma flags not set.
>>>>>>>         */
>>>>>>>        return !(vma->vm_flags & VM_SOFTDIRTY);
>>>>>>>
>>>>>>> Consequently, if VM_SOFTDIRTY is set, we are not considering the
>>>>>>> soft_dirty
>>>>>>> PTE bits accordingly.
>>>>>> Sorry, I'm unable to completely grasp the meaning of the example. We
>>>>>> have
>>>>>> followed clear_refs_write() to write the soft-dirty bit clearing code in
>>>>>> the current patch. Dirtiness of the VMA and the PTE may be set
>>>>>> independently. Newer allocated memory has dirty bit set in the VMA. When
>>>>>> something is written the memory, the soft dirty bit is set in the
>>>>>> PTEs as
>>>>>> well regardless if the soft dirty bit is set in the VMA or not.
>>>>>>
>>>>>
>>>>> Let me try to find a simple explanation:
>>>>>
>>>>> After clearing a SOFTDIRTY PTE flag inside an area with VM_SOFTDIRTY set,
>>>>> there are ways that PTE could get written to and it could become dirty,
>>>>> without the PTE becoming softdirty.
>>>>>
>>>>> Essentially, inside a VMA with VM_SOFTDIRTY set, the PTE softdirty values
>>>>> might be stale: there might be entries that are softdirty even though the
>>>>> PTE is *not* marked softdirty.
>>>> Can someone please share the example to reproduce this? In all of my
>>>> testing, even if I ignore VM_SOFTDIRTY and only base my decision of
>>>> soft-dirtiness on individual pages, it always passes.
>>>
>>> Quick reproducer (the first and easiest one that triggered :) )
>>> attached.
>>>
>>> With no kernel changes, it works as expected.
>>>
>>> # ./softdirty_mprotect
>>>
>>>
>>> With the following kernel change to simulate what you propose it fails:
>>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index d22687d2e81e..f2c682bf7f64 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -1457,8 +1457,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct
>>> pagemapread *pm,
>>>                  flags |= PM_FILE;
>>>          if (page && !migration && page_mapcount(page) == 1)
>>>                  flags |= PM_MMAP_EXCLUSIVE;
>>> -       if (vma->vm_flags & VM_SOFTDIRTY)
>>> -               flags |= PM_SOFT_DIRTY;
>>> +       //if (vma->vm_flags & VM_SOFTDIRTY)
>>> +       //      flags |= PM_SOFT_DIRTY;
>>>            return make_pme(frame, flags);
>>>   }
>>>
>>>
>>> # ./softdirty_mprotect
>>> Page #1 should be softdirty
>>>
>> Thank you so much for sharing the issue and reproducer.
>>
>> After remapping the second part of the memory and m-protecting +
>> m-unprotecting the whole memory, the PTE of the first half of the memory
>> doesn't get marked as soft dirty even after writing multiple times to it.
>> Even if soft-dirtiness is cleared on the whole process, the PTE of the
>> first half memory doesn't get dirty. This seems like more of a bug in
>> mprotect. The mprotect should not mess up with the soft-dirty flag in the
>> PTEs.
>>
>> I'm debugging this. I hope to find the issue soon. Soft-dirty tracking in
>> PTEs should be working correctly irrespective of the VM_SOFTDIRTY is set or
>> not on the VMA.
> 
> No, it's not a bug and these are not the VM_SOFTDIRTY semantics -- just
> because you think they should be like this. As people explained,
> VM_SOFTDIRTY implies *until now* that any PTE is consideres softdirty. And
> there are other scenarios that can similarly trigger something like that,
> besides mprotect().
> 
> Sorry if I sound annoyed, but please
> 
> 1) factor out that from your patch set for now
> 2) find a way to handle this cleanly, for example, not merging VMAs that
>    differ in VM_SOFTDIRTY
> 

I'm extremely sorry for the annoyance. I absolutely understand your point.
The problem is that the half of this IOCTL wouldn't be useful without
solving the extra soft-dirty pages issue. We don't want to upstream
something which we wouldn't be using until 2 is solved. This is why we are
trying to solve the point 2 before upstreaming the 1. I'm working on ideas
on how this can be resolved or redesigned entirely. Maybe Cyril will share
the ideas soon once he has some time. He was involved in the soft-dirty
feature development.

-- 
BR,
Muhammad Usama Anjum

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs
  2022-12-05 15:29               ` Muhammad Usama Anjum
@ 2022-12-05 15:39                 ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-12-05 15:39 UTC (permalink / raw)
  To: Muhammad Usama Anjum, Peter Xu, Cyrill Gorcunov
  Cc: Suren Baghdasaryan, Greg KH, Christian Brauner, Yang Shi,
	Vlastimil Babka, Zach O'Keefe, Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Michał Mirosław, Andrei Vagin, Danylo Mocherniuk,
	Alexander Viro, Andrew Morton, Paul Gofman

On 05.12.22 16:29, Muhammad Usama Anjum wrote:
> On 11/30/22 5:10 PM, David Hildenbrand wrote:
>> On 30.11.22 12:42, Muhammad Usama Anjum wrote:
>>> On 11/21/22 8:55 PM, David Hildenbrand wrote:
>>>> On 21.11.22 16:00, Muhammad Usama Anjum wrote:
>>>>> Hello,
>>>>>
>>>>> Thank you for replying.
>>>>>
>>>>> On 11/14/22 8:46 PM, David Hildenbrand wrote:
>>>>>>> The soft-dirtiness is stored in the PTE. VMA is marked dirty to store
>>>>>>> the
>>>>>>> dirtiness for reused regions. Clearing the soft-dirty status of whole
>>>>>>> process is straight forward. When we want to clear/monitor the
>>>>>>> soft-dirtiness of a part of the virtual memory, there is a lot of
>>>>>>> internal
>>>>>>> noise. We don't want the non-dirty pages to become dirty because of how
>>>>>>> the
>>>>>>> soft-dirty feature has been working. Soft-dirty feature wasn't being
>>>>>>> used
>>>>>>> the way we want to use now. While monitoring a part of memory, it is not
>>>>>>> acceptable to get non-dirty pages as dirty. Non-dirty pages become dirty
>>>>>>> when the two VMAs are merged without considering if they both are
>>>>>>> dirty or
>>>>>>> not (34228d473efe). To monitor changes over the memory, sometimes
>>>>>>> VMAs are
>>>>>>> split to clear the soft-dirty bit in the VMA flags. But sometimes kernel
>>>>>>> decide to merge them backup. It is so waste of resources.
>>>>>>
>>>>>> Maybe you'd want a per-process option to not merge if the VM_SOFTDIRTY
>>>>>> property differs. But that might be just one alternative for handling
>>>>>> this
>>>>>> case.
>>>>>>
>>>>>>>
>>>>>>> To keep things consistent, the default behavior of the IOCTL is to
>>>>>>> output
>>>>>>> even the extra non-dirty pages as dirty from the kernel noise. A
>>>>>>> optional
>>>>>>> PAGEMAP_NO_REUSED_REGIONS flag is added for those use cases which aren't
>>>>>>> tolerant of extra non-dirty pages. This flag can be considered as
>>>>>>> something
>>>>>>> which is by-passing the already present buggy implementation in the
>>>>>>> kernel.
>>>>>>> It is not buggy per say as the issue can be solved if we don't allow the
>>>>>>> two VMA which have different soft-dirty bits to get merged. But we are
>>>>>>> allowing that so that the total number of VMAs doesn't increase. This
>>>>>>> was
>>>>>>> acceptable at the time, but now with the use case of monitoring a
>>>>>>> part of
>>>>>>> memory for soft-dirty doesn't want this merging. So either we need to
>>>>>>> revert 34228d473efe and PAGEMAP_NO_REUSED_REGIONS flag will not be
>>>>>>> needed
>>>>>>> or we should allow PAGEMAP_NO_REUSED_REGIONS or similar mechanism to
>>>>>>> ignore
>>>>>>> the extra dirty pages which aren't dirty in reality.
>>>>>>>
>>>>>>> When PAGEMAP_NO_REUSED_REGIONS flag is used, only the PTEs are
>>>>>>> checked to
>>>>>>> find if the pages are dirty. So re-used regions cannot be detected. This
>>>>>>> has the only side-effect of not checking the VMAs. So this is
>>>>>>> limitation of
>>>>>>> using this flag which should be acceptable in the current state of code.
>>>>>>> This limitation is okay for the users as they can clear the
>>>>>>> soft-dirty bit
>>>>>>> of the VMA before starting to monitor a range of memory for
>>>>>>> soft-dirtiness.
>>>>>>>
>>>>>>>
>>>>>>>> Please separate that part out from the other changes; I am still not
>>>>>>>> convinced that we want this and what the semantical implications are.
>>>>>>>>
>>>>>>>> Let's take a look at an example: can_change_pte_writable()
>>>>>>>>
>>>>>>>>         /* Do we need write faults for softdirty tracking? */
>>>>>>>>         if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
>>>>>>>>             return false;
>>>>>>>>
>>>>>>>> We care about PTE softdirty tracking, if it is enabled for the VMA.
>>>>>>>> Tracking is enabled if: vma_soft_dirty_enabled()
>>>>>>>>
>>>>>>>>         /*
>>>>>>>>          * Soft-dirty is kind of special: its tracking is enabled when
>>>>>>>>          * the vma flags not set.
>>>>>>>>          */
>>>>>>>>         return !(vma->vm_flags & VM_SOFTDIRTY);
>>>>>>>>
>>>>>>>> Consequently, if VM_SOFTDIRTY is set, we are not considering the
>>>>>>>> soft_dirty
>>>>>>>> PTE bits accordingly.
>>>>>>> Sorry, I'm unable to completely grasp the meaning of the example. We
>>>>>>> have
>>>>>>> followed clear_refs_write() to write the soft-dirty bit clearing code in
>>>>>>> the current patch. Dirtiness of the VMA and the PTE may be set
>>>>>>> independently. Newer allocated memory has dirty bit set in the VMA. When
>>>>>>> something is written the memory, the soft dirty bit is set in the
>>>>>>> PTEs as
>>>>>>> well regardless if the soft dirty bit is set in the VMA or not.
>>>>>>>
>>>>>>
>>>>>> Let me try to find a simple explanation:
>>>>>>
>>>>>> After clearing a SOFTDIRTY PTE flag inside an area with VM_SOFTDIRTY set,
>>>>>> there are ways that PTE could get written to and it could become dirty,
>>>>>> without the PTE becoming softdirty.
>>>>>>
>>>>>> Essentially, inside a VMA with VM_SOFTDIRTY set, the PTE softdirty values
>>>>>> might be stale: there might be entries that are softdirty even though the
>>>>>> PTE is *not* marked softdirty.
>>>>> Can someone please share the example to reproduce this? In all of my
>>>>> testing, even if I ignore VM_SOFTDIRTY and only base my decision of
>>>>> soft-dirtiness on individual pages, it always passes.
>>>>
>>>> Quick reproducer (the first and easiest one that triggered :) )
>>>> attached.
>>>>
>>>> With no kernel changes, it works as expected.
>>>>
>>>> # ./softdirty_mprotect
>>>>
>>>>
>>>> With the following kernel change to simulate what you propose it fails:
>>>>
>>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>>> index d22687d2e81e..f2c682bf7f64 100644
>>>> --- a/fs/proc/task_mmu.c
>>>> +++ b/fs/proc/task_mmu.c
>>>> @@ -1457,8 +1457,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct
>>>> pagemapread *pm,
>>>>                   flags |= PM_FILE;
>>>>           if (page && !migration && page_mapcount(page) == 1)
>>>>                   flags |= PM_MMAP_EXCLUSIVE;
>>>> -       if (vma->vm_flags & VM_SOFTDIRTY)
>>>> -               flags |= PM_SOFT_DIRTY;
>>>> +       //if (vma->vm_flags & VM_SOFTDIRTY)
>>>> +       //      flags |= PM_SOFT_DIRTY;
>>>>             return make_pme(frame, flags);
>>>>    }
>>>>
>>>>
>>>> # ./softdirty_mprotect
>>>> Page #1 should be softdirty
>>>>
>>> Thank you so much for sharing the issue and reproducer.
>>>
>>> After remapping the second part of the memory and m-protecting +
>>> m-unprotecting the whole memory, the PTE of the first half of the memory
>>> doesn't get marked as soft dirty even after writing multiple times to it.
>>> Even if soft-dirtiness is cleared on the whole process, the PTE of the
>>> first half memory doesn't get dirty. This seems like more of a bug in
>>> mprotect. The mprotect should not mess up with the soft-dirty flag in the
>>> PTEs.
>>>
>>> I'm debugging this. I hope to find the issue soon. Soft-dirty tracking in
>>> PTEs should be working correctly irrespective of the VM_SOFTDIRTY is set or
>>> not on the VMA.
>>
>> No, it's not a bug and these are not the VM_SOFTDIRTY semantics -- just
>> because you think they should be like this. As people explained,
>> VM_SOFTDIRTY implies *until now* that any PTE is consideres softdirty. And
>> there are other scenarios that can similarly trigger something like that,
>> besides mprotect().
>>
>> Sorry if I sound annoyed, but please
>>
>> 1) factor out that from your patch set for now
>> 2) find a way to handle this cleanly, for example, not merging VMAs that
>>     differ in VM_SOFTDIRTY
>>
> 
> I'm extremely sorry for the annoyance. I absolutely understand your point.

No need to be sorry :)

> The problem is that the half of this IOCTL wouldn't be useful without
> solving the extra soft-dirty pages issue. We don't want to upstream
> something which we wouldn't be using until 2 is solved. This is why we are
> trying to solve the point 2 before upstreaming the 1. I'm working on ideas
> on how this can be resolved or redesigned entirely. Maybe Cyril will share
> the ideas soon once he has some time. He was involved in the soft-dirty
> feature development.

Got it, thanks for the info on usability without this feature. Let me 
make my point clearer: exposing VM_SOFTDIRTY details to user space and 
providing different kinds of "soft dirty" really is sub-optimal from an 
ABI POV.

It would be really preferred if we could just find a way to improve the 
soft-dirty implementation in a way such that no such ABI hacks are 
required and that the existing interface will provide the semantics you 
want. For example, if we could rework the VMA merging case that would be 
really preferable.

I understand that we might want more fine-grained soft-dirty clearing 
IOCTLs. My primary concern is regarding the VM_SOFTDIRTY special-casing 
just when observing whether a PTE is softdirty.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs
  2022-11-09 10:23 ` [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
  2022-11-09 23:54   ` Andrei Vagin
       [not found]   ` <202211120107.cYLiq2cH-lkp@intel.com>
@ 2022-12-12 20:42   ` Cyrill Gorcunov
  2022-12-13 13:04     ` Muhammad Usama Anjum
  2 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2022-12-12 20:42 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Michał Mirosław, Andrei Vagin, Danylo Mocherniuk,
	Alexander Viro, Andrew Morton, Suren Baghdasaryan, Greg KH,
	Christian Brauner, Peter Xu, Yang Shi, Vlastimil Babka,
	Zach O'Keefe, Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, David Hildenbrand, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Paul Gofman

On Wed, Nov 09, 2022 at 03:23:02PM +0500, Muhammad Usama Anjum wrote:
...
> +
> +static long do_pagemap_sd_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
> +{
> +	struct mmu_notifier_range range;
> +	unsigned long __user start, end;
> +	struct pagemap_scan_private p;
> +	int ret;
> +
> +	start = (unsigned long)untagged_addr(arg->start);
> +	if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
> +		return -EINVAL;
> +
> +	if (IS_GET_OP(arg) &&
> +	    ((arg->vec_len == 0) || (!access_ok((struct page_region *)arg->vec, arg->vec_len))))
> +		return -ENOMEM;
> +
> +	if (IS_SD_OP(arg) && ((arg->required_mask & PAGEMAP_NONSD_OP_MASK) ||
> +	     (arg->anyof_mask & PAGEMAP_NONSD_OP_MASK)))
> +		return -EINVAL;
> +
> +	end = start + arg->len;
> +	p.max_pages = arg->max_pages;
> +	p.found_pages = 0;
> +	p.flags = arg->flags;
> +	p.required_mask = arg->required_mask;
> +	p.anyof_mask = arg->anyof_mask;
> +	p.excluded_mask = arg->excluded_mask;
> +	p.return_mask = arg->return_mask;
> +	p.vec_index = 0;
> +	p.vec_len = arg->vec_len;
> +
> +	if (IS_GET_OP(arg)) {
> +		p.vec = vzalloc(arg->vec_len * sizeof(struct page_region));
> +		if (!p.vec)
> +			return -ENOMEM;
> +	} else {
> +		p.vec = NULL;
> +	}

Hi Muhammad! I'm really sorry for diving in such late (unfortunatelly too busy to
step in yet). Anyway, while in general such interface looks reasonable here are
few moments which really bothers me: as far as I undertstand you don't need
vzalloc here, plain vmalloc should works as well since you copy only filled
results back to userspace. Next -- there is no restriction on vec_len parameter,
is not here a door for DoS from userspace? Say I could start a number of ioctl
on same pagemap and try to allocate very big amount of vec_len in summay causing
big pressure on kernel's memory. Or I miss something obvious here?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs
  2022-12-12 20:42   ` Cyrill Gorcunov
@ 2022-12-13 13:04     ` Muhammad Usama Anjum
  2022-12-13 22:22       ` Cyrill Gorcunov
  0 siblings, 1 reply; 20+ messages in thread
From: Muhammad Usama Anjum @ 2022-12-13 13:04 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Muhammad Usama Anjum, Michał Mirosław, Andrei Vagin,
	Danylo Mocherniuk, Alexander Viro, Andrew Morton,
	Suren Baghdasaryan, Greg KH, Christian Brauner, Peter Xu,
	Yang Shi, Vlastimil Babka, Zach O'Keefe,
	Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, David Hildenbrand, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Paul Gofman

On 12/13/22 1:42 AM, Cyrill Gorcunov wrote:
> On Wed, Nov 09, 2022 at 03:23:02PM +0500, Muhammad Usama Anjum wrote:
> ...
>> +
>> +static long do_pagemap_sd_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
>> +{
>> +	struct mmu_notifier_range range;
>> +	unsigned long __user start, end;
>> +	struct pagemap_scan_private p;
>> +	int ret;
>> +
>> +	start = (unsigned long)untagged_addr(arg->start);
>> +	if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
>> +		return -EINVAL;
>> +
>> +	if (IS_GET_OP(arg) &&
>> +	    ((arg->vec_len == 0) || (!access_ok((struct page_region *)arg->vec, arg->vec_len))))
>> +		return -ENOMEM;
>> +
>> +	if (IS_SD_OP(arg) && ((arg->required_mask & PAGEMAP_NONSD_OP_MASK) ||
>> +	     (arg->anyof_mask & PAGEMAP_NONSD_OP_MASK)))
>> +		return -EINVAL;
>> +
>> +	end = start + arg->len;
>> +	p.max_pages = arg->max_pages;
>> +	p.found_pages = 0;
>> +	p.flags = arg->flags;
>> +	p.required_mask = arg->required_mask;
>> +	p.anyof_mask = arg->anyof_mask;
>> +	p.excluded_mask = arg->excluded_mask;
>> +	p.return_mask = arg->return_mask;
>> +	p.vec_index = 0;
>> +	p.vec_len = arg->vec_len;
>> +
>> +	if (IS_GET_OP(arg)) {
>> +		p.vec = vzalloc(arg->vec_len * sizeof(struct page_region));
>> +		if (!p.vec)
>> +			return -ENOMEM;
>> +	} else {
>> +		p.vec = NULL;
>> +	}
> 
> Hi Muhammad! I'm really sorry for diving in such late (unfortunatelly too busy to
> step in yet). Anyway, while in general such interface looks reasonable here are
> few moments which really bothers me: as far as I undertstand you don't need
> vzalloc here, plain vmalloc should works as well since you copy only filled
> results back to userspace. Thank you for reviewing. Correct, I'll update to use vmalloc.

> Next -- there is no restriction on vec_len parameter,
> is not here a door for DoS from userspace? Say I could start a number of ioctl
> on same pagemap and try to allocate very big amount of vec_len in summay causing
> big pressure on kernel's memory. Or I miss something obvious here?
Yes, there is a chance that a large chunk of kernel memory can get
allocated here as vec_len can be very large. We need to think of limiting
this buffer in the current implementation. Any reasonable limit should
work. I'm not sure what would be the reasonable limit. Maybe couple of
hundred MBs? I'll think about it. Or I should update the implementation
such that less amount of intermediate buffer can be used like mincore does.
But this can complicate the implementation further as we are already using
page ranges instead of keeping just the flags. I'll see what can be done.

-- 
BR,
Muhammad Usama Anjum

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs
  2022-12-13 13:04     ` Muhammad Usama Anjum
@ 2022-12-13 22:22       ` Cyrill Gorcunov
  0 siblings, 0 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2022-12-13 22:22 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Michał Mirosław, Andrei Vagin, Danylo Mocherniuk,
	Alexander Viro, Andrew Morton, Suren Baghdasaryan, Greg KH,
	Christian Brauner, Peter Xu, Yang Shi, Vlastimil Babka,
	Zach O'Keefe, Matthew Wilcox (Oracle),
	Gustavo A. R. Silva, Dan Williams, kernel,
	Gabriel Krisman Bertazi, David Hildenbrand, Peter Enderborg,
	open list : KERNEL SELFTEST FRAMEWORK, Shuah Khan, open list,
	open list : PROC FILESYSTEM, open list : MEMORY MANAGEMENT,
	Paul Gofman

On Tue, Dec 13, 2022 at 06:04:04PM +0500, Muhammad Usama Anjum wrote:
> > Hi Muhammad! I'm really sorry for diving in such late (unfortunatelly too busy to
> > step in yet). Anyway, while in general such interface looks reasonable here are
> > few moments which really bothers me: as far as I undertstand you don't need
> > vzalloc here, plain vmalloc should works as well since you copy only filled
> > results back to userspace.
>
> Thank you for reviewing. Correct, I'll update to use vmalloc.
> 
> > Next -- there is no restriction on vec_len parameter,
> > is not here a door for DoS from userspace? Say I could start a number of ioctl
> > on same pagemap and try to allocate very big amount of vec_len in summay causing
> > big pressure on kernel's memory. Or I miss something obvious here?
>
> Yes, there is a chance that a large chunk of kernel memory can get
> allocated here as vec_len can be very large. We need to think of limiting
> this buffer in the current implementation. Any reasonable limit should
> work. I'm not sure what would be the reasonable limit. Maybe couple of
> hundred MBs? I'll think about it. Or I should update the implementation
> such that less amount of intermediate buffer can be used like mincore does.
> But this can complicate the implementation further as we are already using
> page ranges instead of keeping just the flags. I'll see what can be done.

You know, I'm not yet convinced about overall design. This is new uapi which
should be reviewed very very carefully, once merged in we can't step back and
will have to live with it forever. As to buffer size: look how pagemap_read
is implemented, it allocates PAGEMAP_WALK_SIZE buffer array to gather results
then copies it back to userspace. If the main idea to be able to walk over
memory of a process with mm context locked it still doesn't bring much benefit
because once ioctl is complete the state of mm can be changed so precise results
are only possible if target process is not running.

Maybe all of these aspects are been discussed already I probably need to read
all previous converstaions first :)

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-12-13 22:22 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-09 10:23 [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
2022-11-09 10:23 ` [PATCH v6 1/3] fs/proc/task_mmu: update functions to clear the soft-dirty PTE bit Muhammad Usama Anjum
2022-11-09 10:23 ` [PATCH v6 2/3] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs Muhammad Usama Anjum
2022-11-09 23:54   ` Andrei Vagin
2022-11-11 10:10     ` Muhammad Usama Anjum
     [not found]   ` <202211120107.cYLiq2cH-lkp@intel.com>
2022-11-11 17:53     ` Muhammad Usama Anjum
2022-12-12 20:42   ` Cyrill Gorcunov
2022-12-13 13:04     ` Muhammad Usama Anjum
2022-12-13 22:22       ` Cyrill Gorcunov
2022-11-09 10:23 ` [PATCH v6 3/3] selftests: vm: add pagemap ioctl tests Muhammad Usama Anjum
2022-11-09 10:34 ` [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs David Hildenbrand
2022-11-11  7:08   ` Muhammad Usama Anjum
2022-11-14 15:46     ` David Hildenbrand
2022-11-21 15:00       ` Muhammad Usama Anjum
2022-11-21 15:55         ` David Hildenbrand
2022-11-30 11:42           ` Muhammad Usama Anjum
2022-11-30 12:10             ` David Hildenbrand
2022-12-05 15:29               ` Muhammad Usama Anjum
2022-12-05 15:39                 ` David Hildenbrand
2022-11-23 14:11 ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).