[PATCH v4 0/4] mm: A few cleanup patches around zap, shmem and uffd

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/4] mm: A few cleanup patches around zap, shmem and uffd
@ 2021-09-15 18:14 Peter Xu
  2021-09-15 18:14 ` [PATCH v4 1/4] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte Peter Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Peter Xu @ 2021-09-15 18:14 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: peterx, Andrea Arcangeli, Liam Howlett, Hugh Dickins,
	Mike Rapoport, Yang Shi, David Hildenbrand, Kirill A . Shutemov,
	Jerome Glisse, Alistair Popple, Miaohe Lin, Matthew Wilcox

[Based on v5.14-rc1]

Hi, Andrew,

I dropped patch 5 and will do it later when it justifies itself better.  Each
patch of this series now contains at least 1 R-b, would you consider merge it?

Thanks,

v4:
- Patch "mm: Drop first_index/last_index in zap_details"
  - Put first_index and last_index into two lines [Liam]
- Pick up r-bs
- Drop patch 5 for future

v3:
- Patch "mm: Add zap_skip_check_mapping() helper"
  - In zap_skip_check_mapping() check zap_mapping first [Alistair]
- Patch "mm: Add ZAP_FLAG_SKIP_SWAP and zap_flags":
  - Fix English errors in commit message [David]
  - Drop paragraph mentioning commit 22061a1ffabd in commit msg
  - Set ZAP_FLAG_SKIP_SWAP for unmap_mapping_page() too
- Pick up r-bs

v2:
- Patch "mm: Clear vmf->pte after pte_unmap_same() returns"
  - Remove one comment [David]
- Collect r-b for patch 2/3
- Rewrite the last two patches to drop ZAP_FLAG_CHECK_MAPPING, dropping
  Alistair's r-b on patch 5 because it changed [David, Matthew]

===== v1 cover letter =====

I picked up these patches from uffd-wp v5 series here:

https://lore.kernel.org/lkml/20210715201422.211004-1-peterx@redhat.com/

IMHO all of them are very nice cleanups to existing code already, they're all
small and self-contained.  They'll be needed by uffd-wp coming series.  I would
appreciate if they can be accepted earlier, so as to not carry them over always
with the uffd-wp series.

I removed some CC from the uffd-wp v5 series to reduce the noise, and added a
few more into it.

Reviews are greatly welcomed, thanks.

Peter Xu (4):
  mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
  mm: Clear vmf->pte after pte_unmap_same() returns
  mm: Drop first_index/last_index in zap_details
  mm: Add zap_skip_check_mapping() helper

 include/linux/mm.h | 18 ++++++++++--
 mm/memory.c        | 72 +++++++++++++++++++---------------------------
 mm/shmem.c         |  1 -
 mm/userfaultfd.c   |  3 +-
 4 files changed, 46 insertions(+), 48 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 1/4] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
  2021-09-15 18:14 [PATCH v4 0/4] mm: A few cleanup patches around zap, shmem and uffd Peter Xu
@ 2021-09-15 18:14 ` Peter Xu
  2021-09-24  3:56   ` Hugh Dickins
  2021-09-15 18:15 ` [PATCH v4 2/4] mm: Clear vmf->pte after pte_unmap_same() returns Peter Xu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2021-09-15 18:14 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: peterx, Andrea Arcangeli, Liam Howlett, Hugh Dickins,
	Mike Rapoport, Yang Shi, David Hildenbrand, Kirill A . Shutemov,
	Jerome Glisse, Alistair Popple, Miaohe Lin, Matthew Wilcox,
	Axel Rasmussen

It was conditionally done previously, as there's one shmem special case that we
use SetPageDirty() instead.  However that's not necessary and it should be
easier and cleaner to do it unconditionally in mfill_atomic_install_pte().

The most recent discussion about this is here, where Hugh explained the history
of SetPageDirty() and why it's possible that it's not required at all:

https://lore.kernel.org/lkml/alpine.LSU.2.11.2104121657050.1097@eggly.anvils/

Currently mfill_atomic_install_pte() has three callers:

        1. shmem_mfill_atomic_pte
        2. mcopy_atomic_pte
        3. mcontinue_atomic_pte

After the change: case (1) should have its SetPageDirty replaced by the dirty
bit on pte (so we unify them together, finally), case (2) should have no
functional change at all as it has page_in_cache==false, case (3) may add a
dirty bit to the pte.  However since case (3) is UFFDIO_CONTINUE for shmem,
it's merely 100% sure the page is dirty after all because UFFDIO_CONTINUE
normally requires another process to modify the page cache and kick the faulted
thread, so should not make a real difference either.

This should make it much easier to follow on which case will set dirty for
uffd, as we'll simply set it all now for all uffd related ioctls.  Meanwhile,
no special handling of SetPageDirty() if there's no need.

Cc: Hugh Dickins <hughd@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/shmem.c       | 1 -
 mm/userfaultfd.c | 3 +--
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 88742953532c..96ccf6e941aa 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2424,7 +2424,6 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 	shmem_recalc_inode(inode);
 	spin_unlock_irq(&info->lock);

-	SetPageDirty(page);
 	unlock_page(page);
 	return 0;
 out_delete_from_cache:
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 7a9008415534..caf6dfff2a60 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -69,10 +69,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	pgoff_t offset, max_off;

 	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+	_dst_pte = pte_mkdirty(_dst_pte);
 	if (page_in_cache && !vm_shared)
 		writable = false;
-	if (writable || !page_in_cache)
-		_dst_pte = pte_mkdirty(_dst_pte);
 	if (writable) {
 		if (wp_copy)
 			_dst_pte = pte_mkuffd_wp(_dst_pte);
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v4 2/4] mm: Clear vmf->pte after pte_unmap_same() returns
  2021-09-15 18:14 [PATCH v4 0/4] mm: A few cleanup patches around zap, shmem and uffd Peter Xu
  2021-09-15 18:14 ` [PATCH v4 1/4] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte Peter Xu
@ 2021-09-15 18:15 ` Peter Xu
  2021-09-24  3:59   ` Hugh Dickins
  2021-09-15 18:15 ` [PATCH v4 3/4] mm: Drop first_index/last_index in zap_details Peter Xu
  2021-09-15 18:15 ` [PATCH v4 4/4] mm: Add zap_skip_check_mapping() helper Peter Xu
  3 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2021-09-15 18:15 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton, linux-mm
  Cc: Hugh Dickins, David Hildenbrand, Liam Howlett, Mike Rapoport,
	Alistair Popple, Matthew Wilcox, peterx, Yang Shi,
	Kirill A . Shutemov, Jerome Glisse, Miaohe Lin, Andrea Arcangeli

pte_unmap_same() will always unmap the pte pointer.  After the unmap, vmf->pte
will not be valid any more, we should clear it.

It was safe only because no one is accessing vmf->pte after pte_unmap_same()
returns, since the only caller of pte_unmap_same() (so far) is do_swap_page(),
where vmf->pte will in most cases be overwritten very soon.

Directly pass in vmf into pte_unmap_same() and then we can also avoid the long
parameter list too, which should be a nice cleanup.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 25fc46e87214..7b095f07c4ef 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2724,19 +2724,19 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
  * proceeding (but do_wp_page is only called after already making such a check;
  * and do_anonymous_page can safely check later on).
  */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-				pte_t *page_table, pte_t orig_pte)
+static inline int pte_unmap_same(struct vm_fault *vmf)
 {
 	int same = 1;
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
 	if (sizeof(pte_t) > sizeof(unsigned long)) {
-		spinlock_t *ptl = pte_lockptr(mm, pmd);
+		spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
 		spin_lock(ptl);
-		same = pte_same(*page_table, orig_pte);
+		same = pte_same(*vmf->pte, vmf->orig_pte);
 		spin_unlock(ptl);
 	}
 #endif
-	pte_unmap(page_table);
+	pte_unmap(vmf->pte);
+	vmf->pte = NULL;
 	return same;
 }
 
@@ -3487,7 +3487,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
 
-	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+	if (!pte_unmap_same(vmf))
 		goto out;
 
 	entry = pte_to_swp_entry(vmf->orig_pte);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v4 3/4] mm: Drop first_index/last_index in zap_details
  2021-09-15 18:14 [PATCH v4 0/4] mm: A few cleanup patches around zap, shmem and uffd Peter Xu
  2021-09-15 18:14 ` [PATCH v4 1/4] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte Peter Xu
  2021-09-15 18:15 ` [PATCH v4 2/4] mm: Clear vmf->pte after pte_unmap_same() returns Peter Xu
@ 2021-09-15 18:15 ` Peter Xu
  2021-09-24  4:14   ` Hugh Dickins
  2021-09-15 18:15 ` [PATCH v4 4/4] mm: Add zap_skip_check_mapping() helper Peter Xu
  3 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2021-09-15 18:15 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton, linux-mm
  Cc: Hugh Dickins, David Hildenbrand, Liam Howlett, Mike Rapoport,
	Alistair Popple, Matthew Wilcox, peterx, Yang Shi,
	Kirill A . Shutemov, Jerome Glisse, Miaohe Lin, Andrea Arcangeli

The first_index/last_index parameters in zap_details are actually only used in
unmap_mapping_range_tree().  At the meantime, this function is only called by
unmap_mapping_pages() once.  Instead of passing these two variables through the
whole stack of page zapping code, remove them from zap_details and let them
simply be parameters of unmap_mapping_range_tree(), which is inlined.

Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  2 --
 mm/memory.c        | 31 ++++++++++++++++++-------------
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..d1126f731221 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1722,8 +1722,6 @@ extern void user_shm_unlock(size_t, struct ucounts *);
  */
 struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
-	pgoff_t	first_index;			/* Lowest page->index to unmap */
-	pgoff_t last_index;			/* Highest page->index to unmap */
 	struct page *single_page;		/* Locked page to be unmapped */
 };
 
diff --git a/mm/memory.c b/mm/memory.c
index 7b095f07c4ef..a7e427177817 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3321,20 +3321,20 @@ static void unmap_mapping_range_vma(struct vm_area_struct *vma,
 }
 
 static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
+					    pgoff_t first_index,
+					    pgoff_t last_index,
 					    struct zap_details *details)
 {
 	struct vm_area_struct *vma;
 	pgoff_t vba, vea, zba, zea;
 
-	vma_interval_tree_foreach(vma, root,
-			details->first_index, details->last_index) {
-
+	vma_interval_tree_foreach(vma, root, first_index, last_index) {
 		vba = vma->vm_pgoff;
 		vea = vba + vma_pages(vma) - 1;
-		zba = details->first_index;
+		zba = first_index;
 		if (zba < vba)
 			zba = vba;
-		zea = details->last_index;
+		zea = last_index;
 		if (zea > vea)
 			zea = vea;
 
@@ -3360,18 +3360,22 @@ void unmap_mapping_page(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	struct zap_details details = { };
+	pgoff_t	first_index;
+	pgoff_t	last_index;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageTail(page));
 
+	first_index = page->index;
+	last_index = page->index + thp_nr_pages(page) - 1;
+
 	details.check_mapping = mapping;
-	details.first_index = page->index;
-	details.last_index = page->index + thp_nr_pages(page) - 1;
 	details.single_page = page;
 
 	i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
-		unmap_mapping_range_tree(&mapping->i_mmap, &details);
+		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+					 last_index, &details);
 	i_mmap_unlock_write(mapping);
 }
 
@@ -3391,16 +3395,17 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
 		pgoff_t nr, bool even_cows)
 {
 	struct zap_details details = { };
+	pgoff_t	first_index = start;
+	pgoff_t	last_index = start + nr - 1;
 
 	details.check_mapping = even_cows ? NULL : mapping;
-	details.first_index = start;
-	details.last_index = start + nr - 1;
-	if (details.last_index < details.first_index)
-		details.last_index = ULONG_MAX;
+	if (last_index < first_index)
+		last_index = ULONG_MAX;
 
 	i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
-		unmap_mapping_range_tree(&mapping->i_mmap, &details);
+		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+					 last_index, &details);
 	i_mmap_unlock_write(mapping);
 }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v4 4/4] mm: Add zap_skip_check_mapping() helper
  2021-09-15 18:14 [PATCH v4 0/4] mm: A few cleanup patches around zap, shmem and uffd Peter Xu
                   ` (2 preceding siblings ...)
  2021-09-15 18:15 ` [PATCH v4 3/4] mm: Drop first_index/last_index in zap_details Peter Xu
@ 2021-09-15 18:15 ` Peter Xu
  2021-09-24  4:44   ` Hugh Dickins
  2021-09-24 10:11   ` David Hildenbrand
  3 siblings, 2 replies; 15+ messages in thread
From: Peter Xu @ 2021-09-15 18:15 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton, linux-mm
  Cc: Hugh Dickins, David Hildenbrand, Liam Howlett, Mike Rapoport,
	Alistair Popple, Matthew Wilcox, peterx, Yang Shi,
	Kirill A . Shutemov, Jerome Glisse, Miaohe Lin, Andrea Arcangeli

Use the helper for the checks.  Rename "check_mapping" into "zap_mapping"
because "check_mapping" looks like a bool but in fact it stores the mapping
itself.  When it's set, we check the mapping (it must be non-NULL).  When it's
cleared we skip the check, which works like the old way.

Move the duplicated comments to the helper too.

Reviewed-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h | 16 +++++++++++++++-
 mm/memory.c        | 29 ++++++-----------------------
 2 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d1126f731221..ed44f31615d9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1721,10 +1721,24 @@ extern void user_shm_unlock(size_t, struct ucounts *);
  * Parameter block passed down to zap_pte_range in exceptional cases.
  */
 struct zap_details {
-	struct address_space *check_mapping;	/* Check page->mapping if set */
+	struct address_space *zap_mapping;	/* Check page->mapping if set */
 	struct page *single_page;		/* Locked page to be unmapped */
 };
 
+/*
+ * We set details->zap_mappings when we want to unmap shared but keep private
+ * pages. Return true if skip zapping this page, false otherwise.
+ */
+static inline bool
+zap_skip_check_mapping(struct zap_details *details, struct page *page)
+{
+	if (!details || !page)
+		return false;
+
+	return details->zap_mapping &&
+	    (details->zap_mapping != page_rmapping(page));
+}
+
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index a7e427177817..8db8ce0ca6ce 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1333,16 +1333,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			struct page *page;
 
 			page = vm_normal_page(vma, addr, ptent);
-			if (unlikely(details) && page) {
-				/*
-				 * unmap_shared_mapping_pages() wants to
-				 * invalidate cache without truncating:
-				 * unmap shared but keep private pages.
-				 */
-				if (details->check_mapping &&
-				    details->check_mapping != page_rmapping(page))
-					continue;
-			}
+			if (unlikely(zap_skip_check_mapping(details, page)))
+				continue;
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
@@ -1375,17 +1367,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		    is_device_exclusive_entry(entry)) {
 			struct page *page = pfn_swap_entry_to_page(entry);
 
-			if (unlikely(details && details->check_mapping)) {
-				/*
-				 * unmap_shared_mapping_pages() wants to
-				 * invalidate cache without truncating:
-				 * unmap shared but keep private pages.
-				 */
-				if (details->check_mapping !=
-				    page_rmapping(page))
-					continue;
-			}
-
+			if (unlikely(zap_skip_check_mapping(details, page)))
+				continue;
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
 
@@ -3369,7 +3352,7 @@ void unmap_mapping_page(struct page *page)
 	first_index = page->index;
 	last_index = page->index + thp_nr_pages(page) - 1;
 
-	details.check_mapping = mapping;
+	details.zap_mapping = mapping;
 	details.single_page = page;
 
 	i_mmap_lock_write(mapping);
@@ -3398,7 +3381,7 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
 	pgoff_t	first_index = start;
 	pgoff_t	last_index = start + nr - 1;
 
-	details.check_mapping = even_cows ? NULL : mapping;
+	details.zap_mapping = even_cows ? NULL : mapping;
 	if (last_index < first_index)
 		last_index = ULONG_MAX;
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/4] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
  2021-09-15 18:14 ` [PATCH v4 1/4] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte Peter Xu
@ 2021-09-24  3:56   ` Hugh Dickins
  2021-09-27 21:21     ` Peter Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2021-09-24  3:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	Liam Howlett, Hugh Dickins, Mike Rapoport, Yang Shi,
	David Hildenbrand, Kirill A . Shutemov, Jerome Glisse,
	Alistair Popple, Miaohe Lin, Matthew Wilcox, Axel Rasmussen

On Wed, 15 Sep 2021, Peter Xu wrote:

> It was conditionally done previously, as there's one shmem special case that we
> use SetPageDirty() instead.  However that's not necessary and it should be
> easier and cleaner to do it unconditionally in mfill_atomic_install_pte().
> 
> The most recent discussion about this is here, where Hugh explained the history
> of SetPageDirty() and why it's possible that it's not required at all:
> 
> https://lore.kernel.org/lkml/alpine.LSU.2.11.2104121657050.1097@eggly.anvils/
> 
> Currently mfill_atomic_install_pte() has three callers:
> 
>         1. shmem_mfill_atomic_pte
>         2. mcopy_atomic_pte
>         3. mcontinue_atomic_pte
> 
> After the change: case (1) should have its SetPageDirty replaced by the dirty
> bit on pte (so we unify them together, finally), case (2) should have no
> functional change at all as it has page_in_cache==false, case (3) may add a
> dirty bit to the pte.  However since case (3) is UFFDIO_CONTINUE for shmem,
> it's merely 100% sure the page is dirty after all because UFFDIO_CONTINUE
> normally requires another process to modify the page cache and kick the faulted
> thread, so should not make a real difference either.
> 
> This should make it much easier to follow on which case will set dirty for
> uffd, as we'll simply set it all now for all uffd related ioctls.  Meanwhile,
> no special handling of SetPageDirty() if there's no need.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

I'm not going to NAK this, but you and I have different ideas of
"very nice cleanups".  Generally, you appear (understandably) to be
trying to offload pieces of work from your larger series, but often
I don't see the sense of them, here in isolation anyway.

Is this a safe transformation of the existing code? Yes, I believe so
(at least until someone adds some PTESAN checker which looks to see
if any ptes are dirty in vmas to which user never had write access).
But it took quite a lot of lawyering to arrive at that conclusion.

Is this a cleanup? No, it's a dirtyup.

shmem_mfill_atomic_pte() does SetPageDirty (before unlocking page)
because that's where the page contents are made dirty.  You could
criticise it for doing SetPageDirty even in the zeropage case:
yes, we've been lazy there; but that's a different argument.

If someone is faulting this page into a read-only vma, it's
surprising to make the pte dirty there.  What would be most correct
would be to keep the SetPageDirty in shmem_mfill_atomic_pte()
(with or without zeropage optimization), and probably SetPageDirty
in some other places in mm/userfaultfd.c (I didn't look where) when
the page is filled with supplied data, and mfill_atomic_install_pte()
only do that pte_mkdirty() when it's serving a FAULT_FLAG_WRITE.

I haven't looked again (I have a pile of mails to respond to!),
but when I looked before I think I found that the vmf flags are
not available to the userfaultfd ioctler.  If so, then it would
be more appropriate to just leave the mkdirty to the hardware on
return from fault (except - and again I cannot spend time researching
this - perhaps I'm too x86-centric, and there are other architectures
on which the software *must* do the mkdirty fixup to avoid refaulting
forever - though probably userfaultfd state would itself prevent that).

But you seem to think that doing the dirtying in an unnatural place
helps somehow; and for all I know, that may be so in your larger
series, though this change certainly raises suspicions of that.

I'm sorry to be so discouraging, but you have asked for my opinion,
and here at last you have it.  Not a NAK, but no enthusiasm at all.

Hugh

> ---
>  mm/shmem.c       | 1 -
>  mm/userfaultfd.c | 3 +--
>  2 files changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 88742953532c..96ccf6e941aa 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2424,7 +2424,6 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
>  	shmem_recalc_inode(inode);
>  	spin_unlock_irq(&info->lock);
>  
> -	SetPageDirty(page);
>  	unlock_page(page);
>  	return 0;
>  out_delete_from_cache:
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 7a9008415534..caf6dfff2a60 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -69,10 +69,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  	pgoff_t offset, max_off;
>  
>  	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> +	_dst_pte = pte_mkdirty(_dst_pte);
>  	if (page_in_cache && !vm_shared)
>  		writable = false;
> -	if (writable || !page_in_cache)
> -		_dst_pte = pte_mkdirty(_dst_pte);
>  	if (writable) {
>  		if (wp_copy)
>  			_dst_pte = pte_mkuffd_wp(_dst_pte);
> -- 
> 2.31.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 2/4] mm: Clear vmf->pte after pte_unmap_same() returns
  2021-09-15 18:15 ` [PATCH v4 2/4] mm: Clear vmf->pte after pte_unmap_same() returns Peter Xu
@ 2021-09-24  3:59   ` Hugh Dickins
  0 siblings, 0 replies; 15+ messages in thread
From: Hugh Dickins @ 2021-09-24  3:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, Andrew Morton, linux-mm, Hugh Dickins,
	David Hildenbrand, Liam Howlett, Mike Rapoport, Alistair Popple,
	Matthew Wilcox, Yang Shi, Kirill A . Shutemov, Jerome Glisse,
	Miaohe Lin, Andrea Arcangeli

On Wed, 15 Sep 2021, Peter Xu wrote:

> pte_unmap_same() will always unmap the pte pointer.  After the unmap, vmf->pte
> will not be valid any more, we should clear it.
> 
> It was safe only because no one is accessing vmf->pte after pte_unmap_same()
> returns, since the only caller of pte_unmap_same() (so far) is do_swap_page(),
> where vmf->pte will in most cases be overwritten very soon.
> 
> Directly pass in vmf into pte_unmap_same() and then we can also avoid the long
> parameter list too, which should be a nice cleanup.
> 
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Liam Howlett <liam.howlett@oracle.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

This one seems fine, thanks.
Acked-by: Hugh Dickins <hughd@google.com>

> ---
>  mm/memory.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 25fc46e87214..7b095f07c4ef 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2724,19 +2724,19 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
>   * proceeding (but do_wp_page is only called after already making such a check;
>   * and do_anonymous_page can safely check later on).
>   */
> -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
> -				pte_t *page_table, pte_t orig_pte)
> +static inline int pte_unmap_same(struct vm_fault *vmf)
>  {
>  	int same = 1;
>  #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
>  	if (sizeof(pte_t) > sizeof(unsigned long)) {
> -		spinlock_t *ptl = pte_lockptr(mm, pmd);
> +		spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>  		spin_lock(ptl);
> -		same = pte_same(*page_table, orig_pte);
> +		same = pte_same(*vmf->pte, vmf->orig_pte);
>  		spin_unlock(ptl);
>  	}
>  #endif
> -	pte_unmap(page_table);
> +	pte_unmap(vmf->pte);
> +	vmf->pte = NULL;
>  	return same;
>  }
>  
> @@ -3487,7 +3487,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	vm_fault_t ret = 0;
>  	void *shadow = NULL;
>  
> -	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
> +	if (!pte_unmap_same(vmf))
>  		goto out;
>  
>  	entry = pte_to_swp_entry(vmf->orig_pte);
> -- 
> 2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 3/4] mm: Drop first_index/last_index in zap_details
  2021-09-15 18:15 ` [PATCH v4 3/4] mm: Drop first_index/last_index in zap_details Peter Xu
@ 2021-09-24  4:14   ` Hugh Dickins
  0 siblings, 0 replies; 15+ messages in thread
From: Hugh Dickins @ 2021-09-24  4:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, Andrew Morton, linux-mm, Hugh Dickins,
	David Hildenbrand, Liam Howlett, Mike Rapoport, Alistair Popple,
	Matthew Wilcox, Yang Shi, Kirill A . Shutemov, Jerome Glisse,
	Miaohe Lin, Andrea Arcangeli

On Wed, 15 Sep 2021, Peter Xu wrote:

> The first_index/last_index parameters in zap_details are actually only used in
> unmap_mapping_range_tree().  At the meantime, this function is only called by
> unmap_mapping_pages() once.  Instead of passing these two variables through the
> whole stack of page zapping code, remove them from zap_details and let them
> simply be parameters of unmap_mapping_range_tree(), which is inlined.
> 
> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Liam Howlett <liam.howlett@oracle.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

And this one is fine too, thanks.  I don't know whether it saves anything
(ah yes, with args in registers not on the stack, should save a little),
but it's helpful to limit the scope of those indices.

You may wonder how they came to be in zap_details: that dates from the
days of remap_file_pages(): nonlinear vmas, in which the zapper needed
to check each pte_file()'s offset against first and last index, to
decide whether to zap or not.  They should have been removed in 4.0.

Acked-by: Hugh Dickins <hughd@google.com>

> ---
>  include/linux/mm.h |  2 --
>  mm/memory.c        | 31 ++++++++++++++++++-------------
>  2 files changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 73a52aba448f..d1126f731221 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1722,8 +1722,6 @@ extern void user_shm_unlock(size_t, struct ucounts *);
>   */
>  struct zap_details {
>  	struct address_space *check_mapping;	/* Check page->mapping if set */
> -	pgoff_t	first_index;			/* Lowest page->index to unmap */
> -	pgoff_t last_index;			/* Highest page->index to unmap */
>  	struct page *single_page;		/* Locked page to be unmapped */
>  };
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 7b095f07c4ef..a7e427177817 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3321,20 +3321,20 @@ static void unmap_mapping_range_vma(struct vm_area_struct *vma,
>  }
>  
>  static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
> +					    pgoff_t first_index,
> +					    pgoff_t last_index,
>  					    struct zap_details *details)
>  {
>  	struct vm_area_struct *vma;
>  	pgoff_t vba, vea, zba, zea;
>  
> -	vma_interval_tree_foreach(vma, root,
> -			details->first_index, details->last_index) {
> -
> +	vma_interval_tree_foreach(vma, root, first_index, last_index) {
>  		vba = vma->vm_pgoff;
>  		vea = vba + vma_pages(vma) - 1;
> -		zba = details->first_index;
> +		zba = first_index;
>  		if (zba < vba)
>  			zba = vba;
> -		zea = details->last_index;
> +		zea = last_index;
>  		if (zea > vea)
>  			zea = vea;
>  
> @@ -3360,18 +3360,22 @@ void unmap_mapping_page(struct page *page)
>  {
>  	struct address_space *mapping = page->mapping;
>  	struct zap_details details = { };
> +	pgoff_t	first_index;
> +	pgoff_t	last_index;
>  
>  	VM_BUG_ON(!PageLocked(page));
>  	VM_BUG_ON(PageTail(page));
>  
> +	first_index = page->index;
> +	last_index = page->index + thp_nr_pages(page) - 1;
> +
>  	details.check_mapping = mapping;
> -	details.first_index = page->index;
> -	details.last_index = page->index + thp_nr_pages(page) - 1;
>  	details.single_page = page;
>  
>  	i_mmap_lock_write(mapping);
>  	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
> -		unmap_mapping_range_tree(&mapping->i_mmap, &details);
> +		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
> +					 last_index, &details);
>  	i_mmap_unlock_write(mapping);
>  }
>  
> @@ -3391,16 +3395,17 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
>  		pgoff_t nr, bool even_cows)
>  {
>  	struct zap_details details = { };
> +	pgoff_t	first_index = start;
> +	pgoff_t	last_index = start + nr - 1;
>  
>  	details.check_mapping = even_cows ? NULL : mapping;
> -	details.first_index = start;
> -	details.last_index = start + nr - 1;
> -	if (details.last_index < details.first_index)
> -		details.last_index = ULONG_MAX;
> +	if (last_index < first_index)
> +		last_index = ULONG_MAX;
>  
>  	i_mmap_lock_write(mapping);
>  	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
> -		unmap_mapping_range_tree(&mapping->i_mmap, &details);
> +		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
> +					 last_index, &details);
>  	i_mmap_unlock_write(mapping);
>  }
>  
> -- 
> 2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 4/4] mm: Add zap_skip_check_mapping() helper
  2021-09-15 18:15 ` [PATCH v4 4/4] mm: Add zap_skip_check_mapping() helper Peter Xu
@ 2021-09-24  4:44   ` Hugh Dickins
  2021-09-24 10:11   ` David Hildenbrand
  1 sibling, 0 replies; 15+ messages in thread
From: Hugh Dickins @ 2021-09-24  4:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, Andrew Morton, linux-mm, Hugh Dickins,
	David Hildenbrand, Liam Howlett, Mike Rapoport, Alistair Popple,
	Matthew Wilcox, Yang Shi, Kirill A . Shutemov, Jerome Glisse,
	Miaohe Lin, Andrea Arcangeli

On Wed, 15 Sep 2021, Peter Xu wrote:

> Use the helper for the checks.  Rename "check_mapping" into "zap_mapping"
> because "check_mapping" looks like a bool but in fact it stores the mapping
> itself.  When it's set, we check the mapping (it must be non-NULL).  When it's
> cleared we skip the check, which works like the old way.
> 
> Move the duplicated comments to the helper too.
> 
> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Again, I won't NAK, but I have no enthusiasm for this at all: our tastes
clearly differ.  I don't find the new name helpful, I don't find the
separated "helper" helpful, and you have hidden the helpful comment
(but I'd be on firmer ground if the unmap_shared_mapping_pages() it
referred to had ever existed! perhaps it was in an intermediate tree).

But then I would feel this way, wouldn't I?
See dd9fd0e03de ("[PATCH] rmap: nonlinear truncation") in
//git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git

I'm glad to see that you have dropped 5/5 for now:
I was not keen on that one either.

Hugh

> ---
>  include/linux/mm.h | 16 +++++++++++++++-
>  mm/memory.c        | 29 ++++++-----------------------
>  2 files changed, 21 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d1126f731221..ed44f31615d9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1721,10 +1721,24 @@ extern void user_shm_unlock(size_t, struct ucounts *);
>   * Parameter block passed down to zap_pte_range in exceptional cases.
>   */
>  struct zap_details {
> -	struct address_space *check_mapping;	/* Check page->mapping if set */
> +	struct address_space *zap_mapping;	/* Check page->mapping if set */
>  	struct page *single_page;		/* Locked page to be unmapped */
>  };
>  
> +/*
> + * We set details->zap_mappings when we want to unmap shared but keep private
> + * pages. Return true if skip zapping this page, false otherwise.
> + */
> +static inline bool
> +zap_skip_check_mapping(struct zap_details *details, struct page *page)
> +{
> +	if (!details || !page)
> +		return false;
> +
> +	return details->zap_mapping &&
> +	    (details->zap_mapping != page_rmapping(page));
> +}
> +
>  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>  			     pte_t pte);
>  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> diff --git a/mm/memory.c b/mm/memory.c
> index a7e427177817..8db8ce0ca6ce 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1333,16 +1333,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			struct page *page;
>  
>  			page = vm_normal_page(vma, addr, ptent);
> -			if (unlikely(details) && page) {
> -				/*
> -				 * unmap_shared_mapping_pages() wants to
> -				 * invalidate cache without truncating:
> -				 * unmap shared but keep private pages.
> -				 */
> -				if (details->check_mapping &&
> -				    details->check_mapping != page_rmapping(page))
> -					continue;
> -			}
> +			if (unlikely(zap_skip_check_mapping(details, page)))
> +				continue;
>  			ptent = ptep_get_and_clear_full(mm, addr, pte,
>  							tlb->fullmm);
>  			tlb_remove_tlb_entry(tlb, pte, addr);
> @@ -1375,17 +1367,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  		    is_device_exclusive_entry(entry)) {
>  			struct page *page = pfn_swap_entry_to_page(entry);
>  
> -			if (unlikely(details && details->check_mapping)) {
> -				/*
> -				 * unmap_shared_mapping_pages() wants to
> -				 * invalidate cache without truncating:
> -				 * unmap shared but keep private pages.
> -				 */
> -				if (details->check_mapping !=
> -				    page_rmapping(page))
> -					continue;
> -			}
> -
> +			if (unlikely(zap_skip_check_mapping(details, page)))
> +				continue;
>  			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>  			rss[mm_counter(page)]--;
>  
> @@ -3369,7 +3352,7 @@ void unmap_mapping_page(struct page *page)
>  	first_index = page->index;
>  	last_index = page->index + thp_nr_pages(page) - 1;
>  
> -	details.check_mapping = mapping;
> +	details.zap_mapping = mapping;
>  	details.single_page = page;
>  
>  	i_mmap_lock_write(mapping);
> @@ -3398,7 +3381,7 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
>  	pgoff_t	first_index = start;
>  	pgoff_t	last_index = start + nr - 1;
>  
> -	details.check_mapping = even_cows ? NULL : mapping;
> +	details.zap_mapping = even_cows ? NULL : mapping;
>  	if (last_index < first_index)
>  		last_index = ULONG_MAX;
>  
> -- 
> 2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 4/4] mm: Add zap_skip_check_mapping() helper
  2021-09-15 18:15 ` [PATCH v4 4/4] mm: Add zap_skip_check_mapping() helper Peter Xu
  2021-09-24  4:44   ` Hugh Dickins
@ 2021-09-24 10:11   ` David Hildenbrand
  1 sibling, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-09-24 10:11 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, Andrew Morton, linux-mm
  Cc: Hugh Dickins, Liam Howlett, Mike Rapoport, Alistair Popple,
	Matthew Wilcox, Yang Shi, Kirill A . Shutemov, Jerome Glisse,
	Miaohe Lin, Andrea Arcangeli

On 15.09.21 20:15, Peter Xu wrote:
> Use the helper for the checks.  Rename "check_mapping" into "zap_mapping"
> because "check_mapping" looks like a bool but in fact it stores the mapping
> itself.  When it's set, we check the mapping (it must be non-NULL).  When it's
> cleared we skip the check, which works like the old way.
> 
> Move the duplicated comments to the helper too.
> 
> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/linux/mm.h | 16 +++++++++++++++-
>   mm/memory.c        | 29 ++++++-----------------------
>   2 files changed, 21 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d1126f731221..ed44f31615d9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1721,10 +1721,24 @@ extern void user_shm_unlock(size_t, struct ucounts *);
>    * Parameter block passed down to zap_pte_range in exceptional cases.
>    */
>   struct zap_details {
> -	struct address_space *check_mapping;	/* Check page->mapping if set */
> +	struct address_space *zap_mapping;	/* Check page->mapping if set */
>   	struct page *single_page;		/* Locked page to be unmapped */
>   };
>   
> +/*
> + * We set details->zap_mappings when we want to unmap shared but keep private
> + * pages. Return true if skip zapping this page, false otherwise.
> + */
> +static inline bool
> +zap_skip_check_mapping(struct zap_details *details, struct page *page)

I agree with Hugh that the name of this helper is suboptimal.

What about inverting the conditions and getting

static inline bool should_zap_page()
{
...
}

The calling code is then

if (unlikely(!should_zap_page(details, page)))
	continue;


I don't really like renaming "zap_mapping", again, because it's 
contained within "struct zap_details" already.

Factoring this out into a helper sounds like a good idea to me. Clear 
case of code de-duplication.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/4] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
  2021-09-24  3:56   ` Hugh Dickins
@ 2021-09-27 21:21     ` Peter Xu
  2021-09-28 19:28       ` Hugh Dickins
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2021-09-27 21:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	Liam Howlett, Mike Rapoport, Yang Shi, David Hildenbrand,
	Kirill A . Shutemov, Jerome Glisse, Alistair Popple, Miaohe Lin,
	Matthew Wilcox, Axel Rasmussen

Hi, Hugh,

On Thu, Sep 23, 2021 at 08:56:33PM -0700, Hugh Dickins wrote:
> I'm not going to NAK this, but you and I have different ideas of
> "very nice cleanups".  Generally, you appear (understandably) to be
> trying to offload pieces of work from your larger series, but often
> I don't see the sense of them, here in isolation anyway.
> 
> Is this a safe transformation of the existing code? Yes, I believe so
> (at least until someone adds some PTESAN checker which looks to see
> if any ptes are dirty in vmas to which user never had write access).
> But it took quite a lot of lawyering to arrive at that conclusion.

I can get your point there, but I keep a skeptical view if there'll be a tool
called PTESAN that asserts VM_WRITE for pte_dirty.

After we've noticed the arm64 implementation of pte_mkdirty() last time, I've
already started to not bind the ideas on VM_WRITE or pte_write() for pte dirty.

As I said before, that's quite natural when I think "the uffd-way", because
uffd can easily arm a page with read-only but while the page is dirty.  I think
you'll answer that with "we should mark the page dirty instead" in this case,
as you stated below.  I also agree.  However if we see pte_dirty a major way to
track data dirty information, and at last when it'll be converged into the
PageDirty, I think it doesn't really matter a huge lot to us if we set pte or
page dirty, or is it?

> 
> Is this a cleanup? No, it's a dirtyup.
> 
> shmem_mfill_atomic_pte() does SetPageDirty (before unlocking page)
> because that's where the page contents are made dirty.  You could
> criticise it for doing SetPageDirty even in the zeropage case:
> yes, we've been lazy there; but that's a different argument.
> 
> If someone is faulting this page into a read-only vma, it's
> surprising to make the pte dirty there.  What would be most correct
> would be to keep the SetPageDirty in shmem_mfill_atomic_pte()
> (with or without zeropage optimization), and probably SetPageDirty
> in some other places in mm/userfaultfd.c (I didn't look where) when
> the page is filled with supplied data, and mfill_atomic_install_pte()
> only do that pte_mkdirty() when it's serving a FAULT_FLAG_WRITE.

That's a good point, and yeah if we can unconditionally mark PageDirty it'll be
great too; I think what bothered me most in the past was that the condition to
check dirty is too complicated, for which myself has been debugging for two
cases where we should apply the dirty bit but we forgot; each of the debugging
process took me a few days or more to figure out, thanks to my awkward
debugging skills.

Then I noticed, why not we do the way around if for 99% of the cases they're
dirty in real systems?  Say, let's set dirty unconditionally and see when there
(could have, which I still doubt) is a negative effect on having some page
dirty, we track that from a "degraded" performance results.  Then we convert
some hard-to-debug data corrupt issues into "oh previously this programs runs
at speed 100x, now it runs 99x, why I got 1% performance lost?"  I even highly
doubt whether it'll come true: for the uffd case (which is the only case I
modified in this patch), I can hardly tell how many people would like to use
the mappings read-only, and how much they'll suffer from that extra dirty bit
or PageDirty.

That's why I really like this patch to happen, I want to save time for myself,
and for anyone who will be fighting for another dirty lost issues.

> 
> I haven't looked again (I have a pile of mails to respond to!),
> but when I looked before I think I found that the vmf flags are
> not available to the userfaultfd ioctler.  If so, then it would
> be more appropriate to just leave the mkdirty to the hardware on
> return from fault (except - and again I cannot spend time researching
> this - perhaps I'm too x86-centric, and there are other architectures
> on which the software *must* do the mkdirty fixup to avoid refaulting
> forever - though probably userfaultfd state would itself prevent that).

If it's based on the fact that we'll set PageDirty for file-backed, then it
looks okay, but not usre.

One thing to mention is pte_mkdirty() also counts in soft dirty by nature.  I'm
imagining a program that was soft-dirty tracked and somehow using UFFDIO_COPY
as the major data filler (so the task itself may not write to the page directly
hence HW won't set dirty bit there).  If with pte_mkdirty the other userspace
tracker with soft-dirty can still detect this, while with PageDirty I believe
it can't.  From that POV I'm not sure whether I can say that as proactively
doing pte_mkdirty is a safer approach just in case such an use case exist, as
myself can't say they're illegal, so pte_dirty is a superset of PageDirty not
vice versa.

> 
> But you seem to think that doing the dirtying in an unnatural place
> helps somehow; and for all I know, that may be so in your larger
> series, though this change certainly raises suspicions of that.
> 
> I'm sorry to be so discouraging, but you have asked for my opinion,
> and here at last you have it.  Not a NAK, but no enthusiasm at all.

Thanks a lot for still looking at these patches; even if most of them are
negative and they come a bit late for sure.. I still appreciate your time.

As you mentioned you're busy with all the things and I'm aware of it.  And
that's really what this patch wants to achieve too - to save time for all,
where my point stands at "maintaining 100% accurate dirty bit does not worth it
here".

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/4] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
  2021-09-27 21:21     ` Peter Xu
@ 2021-09-28 19:28       ` Hugh Dickins
  2021-09-28 21:37         ` Peter Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2021-09-28 19:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Liam Howlett, Mike Rapoport, Yang Shi,
	David Hildenbrand, Kirill A . Shutemov, Jerome Glisse,
	Alistair Popple, Miaohe Lin, Matthew Wilcox, Axel Rasmussen

On Mon, 27 Sep 2021, Peter Xu wrote:
> On Thu, Sep 23, 2021 at 08:56:33PM -0700, Hugh Dickins wrote:
> > I'm not going to NAK this, but you and I have different ideas of
> > "very nice cleanups".  Generally, you appear (understandably) to be
> > trying to offload pieces of work from your larger series, but often
> > I don't see the sense of them, here in isolation anyway.
> > 
> > Is this a safe transformation of the existing code? Yes, I believe so
> > (at least until someone adds some PTESAN checker which looks to see
> > if any ptes are dirty in vmas to which user never had write access).
> > But it took quite a lot of lawyering to arrive at that conclusion.
> 
> I can get your point there, but I keep a skeptical view if there'll be a tool
> called PTESAN that asserts VM_WRITE for pte_dirty.
> 
> After we've noticed the arm64 implementation of pte_mkdirty() last time, I've
> already started to not bind the ideas on VM_WRITE or pte_write() for pte dirty.

Yes, I know that there are good cases of pte_dirty() without pte_write():
that's why I said "never had write access" - never.

> 
> As I said before, that's quite natural when I think "the uffd-way", because
> uffd can easily arm a page with read-only but while the page is dirty.  I think
> you'll answer that with "we should mark the page dirty instead" in this case,
> as you stated below.  I also agree.  However if we see pte_dirty a major way to
> track data dirty information, and at last when it'll be converged into the
> PageDirty, I think it doesn't really matter a huge lot to us if we set pte or
> page dirty, or is it?

Please, imagine the faulting process has a PROT_READ,MAP_SHARED mmap of
/etc/passwd, or any of a million files you would not want it to write to.

The process serving that fault by doing the ioctl has (in my perhaps
mistaken mental model of userfaultfd) greater privilege, and is able
to fill in the contents of what /etc/passwd should contain: it fills the
right data into the page, which is preserved by being marked PageDirty.

The faulting process can never write to that page, and that pte ought
never to be marked dirty.

Marking the pte dirty is not a security problem: doing so does not
grant write access (though it's easy to imagine incorrect code elsewhere
that "deduces" pte_write() from pte_dirty() for some reason - I'm pretty
sure we have had such instances in the past, if not now).  But it is an
anomaly that would be better avoided.

Which is precisely why there is the "writable" check before doing it at
present ("writable" being a stand-in for the FAULT_FLAG_WRITE not visible
at this end).  (There is also "|| !page_in_cache" - that's to allow for a
read-only mmap of a page supplied by mcopy_atomic_pte(): which I argue
would be better with a SetPageDirty() in the caller than a pte_mkdirty()
here; but anonymous pages are less of a worry than shared file pages.)

So, as I said before, I believe what you're doing in this patch happens
to be a safe transformation of existing code; but not a nice cleanup.

> 
> > 
> > Is this a cleanup? No, it's a dirtyup.
> > 
> > shmem_mfill_atomic_pte() does SetPageDirty (before unlocking page)
> > because that's where the page contents are made dirty.  You could
> > criticise it for doing SetPageDirty even in the zeropage case:
> > yes, we've been lazy there; but that's a different argument.
> > 
> > If someone is faulting this page into a read-only vma, it's
> > surprising to make the pte dirty there.  What would be most correct
> > would be to keep the SetPageDirty in shmem_mfill_atomic_pte()
> > (with or without zeropage optimization), and probably SetPageDirty
> > in some other places in mm/userfaultfd.c (I didn't look where) when
> > the page is filled with supplied data, and mfill_atomic_install_pte()
> > only do that pte_mkdirty() when it's serving a FAULT_FLAG_WRITE.
> 
> That's a good point, and yeah if we can unconditionally mark PageDirty it'll be
> great too; I think what bothered me most in the past was that the condition to
> check dirty is too complicated, for which myself has been debugging for two
> cases where we should apply the dirty bit but we forgot; each of the debugging
> process took me a few days or more to figure out, thanks to my awkward
> debugging skills.
> 
> Then I noticed, why not we do the way around if for 99% of the cases they're
> dirty in real systems?  Say, let's set dirty unconditionally and see when there
> (could have, which I still doubt) is a negative effect on having some page
> dirty, we track that from a "degraded" performance results.  Then we convert
> some hard-to-debug data corrupt issues into "oh previously this programs runs
> at speed 100x, now it runs 99x, why I got 1% performance lost?"  I even highly
> doubt whether it'll come true: for the uffd case (which is the only case I
> modified in this patch), I can hardly tell how many people would like to use
> the mappings read-only, and how much they'll suffer from that extra dirty bit
> or PageDirty.
> 
> That's why I really like this patch to happen, I want to save time for myself,
> and for anyone who will be fighting for another dirty lost issues.

I've lost time on missed dirties too, so I ought to be more sympathetic
to your argument than I am: I'm afraid I read it as saying that you don't
really understand "dirty", so want to do it more often to be "safe".
Not a persuasive argument.

> 
> > 
> > I haven't looked again (I have a pile of mails to respond to!),
> > but when I looked before I think I found that the vmf flags are
> > not available to the userfaultfd ioctler.  If so, then it would
> > be more appropriate to just leave the mkdirty to the hardware on
> > return from fault (except - and again I cannot spend time researching
> > this - perhaps I'm too x86-centric, and there are other architectures
> > on which the software *must* do the mkdirty fixup to avoid refaulting
> > forever - though probably userfaultfd state would itself prevent that).
> 
> If it's based on the fact that we'll set PageDirty for file-backed, then it
> looks okay, but not usre.
> 
> One thing to mention is pte_mkdirty() also counts in soft dirty by nature.  I'm
> imagining a program that was soft-dirty tracked and somehow using UFFDIO_COPY
> as the major data filler (so the task itself may not write to the page directly
> hence HW won't set dirty bit there).  If with pte_mkdirty the other userspace
> tracker with soft-dirty can still detect this, while with PageDirty I believe
> it can't.  From that POV I'm not sure whether I can say that as proactively
> doing pte_mkdirty is a safer approach just in case such an use case exist, as
> myself can't say they're illegal, so pte_dirty is a superset of PageDirty not
> vice versa.

And this is not persuasive either: without much deeper analysis
(which I'll decline to do!), it's impossible to tell whether an excess of
pte_mkdirty()s is good or bad for the hypothetical uffd+softdirty tracker:
you're guessing good, I'm guessing bad.

How about a compromise (if you really want to continue with this patch):
you leave the SetPageDirty(page) in shmem_mfill_atomic_pte(), where I
feel a responsibility for it; but you do whatever works for you with
pte_mkdirty() at the mm/userfaultfd.c end?

(In the course of writing this, it has occurred to me that a much nicer
solution might be to delete mfill_atomic_install_pte() altogether, and
change the userfaultfd protocol so that handle_userfault() returns the
page supplied by ioctler, for process to map into its own userspace
in its usual way. But that's a big change, that neither of would be
keen to make; and it would not be surprising if it turned out actually
to be a very bad change - perhaps tried and abandoned before the "atomic"
functions were decided on. I wouldn't even dare mention it, unless that
direction might happen to fit in with something else you're plannng.)

Hugh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/4] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
  2021-09-28 19:28       ` Hugh Dickins
@ 2021-09-28 21:37         ` Peter Xu
  2021-11-04 21:34           ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2021-09-28 21:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	Liam Howlett, Mike Rapoport, Yang Shi, David Hildenbrand,
	Kirill A . Shutemov, Jerome Glisse, Alistair Popple, Miaohe Lin,
	Matthew Wilcox, Axel Rasmussen

On Tue, Sep 28, 2021 at 12:28:40PM -0700, Hugh Dickins wrote:
> > That's why I really like this patch to happen, I want to save time for myself,
> > and for anyone who will be fighting for another dirty lost issues.
> 
> I've lost time on missed dirties too, so I ought to be more sympathetic
> to your argument than I am: I'm afraid I read it as saying that you don't
> really understand "dirty", so want to do it more often to be "safe".
> Not a persuasive argument.

I'd hope I didn't mis-understood dirty bit, please let me know otherwise, then
I must have been making a very serious mistake, and also then I'll be more than
glad to learn to make things right, as long as you could help me to achieve it.

FWICT, I was trying to argue it's not worth it to worry the corner case, but
obviously I failed. :) But that's fine and understandable, because it happens.

> 
> > 
> > > 
> > > I haven't looked again (I have a pile of mails to respond to!),
> > > but when I looked before I think I found that the vmf flags are
> > > not available to the userfaultfd ioctler.  If so, then it would
> > > be more appropriate to just leave the mkdirty to the hardware on
> > > return from fault (except - and again I cannot spend time researching
> > > this - perhaps I'm too x86-centric, and there are other architectures
> > > on which the software *must* do the mkdirty fixup to avoid refaulting
> > > forever - though probably userfaultfd state would itself prevent that).
> > 
> > If it's based on the fact that we'll set PageDirty for file-backed, then it
> > looks okay, but not usre.
> > 
> > One thing to mention is pte_mkdirty() also counts in soft dirty by nature.  I'm
> > imagining a program that was soft-dirty tracked and somehow using UFFDIO_COPY
> > as the major data filler (so the task itself may not write to the page directly
> > hence HW won't set dirty bit there).  If with pte_mkdirty the other userspace
> > tracker with soft-dirty can still detect this, while with PageDirty I believe
> > it can't.  From that POV I'm not sure whether I can say that as proactively
> > doing pte_mkdirty is a safer approach just in case such an use case exist, as
> > myself can't say they're illegal, so pte_dirty is a superset of PageDirty not
> > vice versa.
> 
> And this is not persuasive either: without much deeper analysis
> (which I'll decline to do!), it's impossible to tell whether an excess of
> pte_mkdirty()s is good or bad for the hypothetical uffd+softdirty tracker:
> you're guessing good, I'm guessing bad.
> 
> How about a compromise (if you really want to continue with this patch):
> you leave the SetPageDirty(page) in shmem_mfill_atomic_pte(), where I
> feel a responsibility for it; but you do whatever works for you with
> pte_mkdirty() at the mm/userfaultfd.c end?

Sure.  Duplicating dirty bit is definitely fine to me as it achieves the same
goal as I hoped - we're still 100% clear we won't free a uffd page without
being noticed, then that's enough to me for the goal of this patch.  I won't
initiate that NACK myself since I still think duplicating is unnecessary no
matter it resides in shmem or uffd code, but please go ahead doing that and
I'll be fine with it, just in case Andrew didn't follow the details.

Actually, I can feel how strong opinion you're at this point on keeping the old
way on this patch, and yes I definitely touched the shmem code (your preference
has its reasoning; I have mine too and I had a preference on the other side,
but I failed to persuade you..).  So also feel free to NACK the whole patch
with both the SetPageDirty and condition checks on pte_mkdirty(), I'm fine with
it too.  I can keep the conditions in my future series too.  Not like the other
zap_details patch where I believe I must need at some point, this patch is
something I hoped a lot to have, but it's not a must.

I know this "hole" already so I won't fall into it so hard the 3rd time
(actually on the 2nd time I debugged faster and quickly noticed I just fell
into this same hole I've used to fall).  Let's wish the same to the others.

> 
> (In the course of writing this, it has occurred to me that a much nicer
> solution might be to delete mfill_atomic_install_pte() altogether, and
> change the userfaultfd protocol so that handle_userfault() returns the
> page supplied by ioctler, for process to map into its own userspace
> in its usual way. But that's a big change, that neither of would be
> keen to make; and it would not be surprising if it turned out actually
> to be a very bad change - perhaps tried and abandoned before the "atomic"
> functions were decided on. I wouldn't even dare mention it, unless that
> direction might happen to fit in with something else you're plannng.)

Dangling pages will be hard to make right, imho.  E.g., please considering one
thread faulted in with uffd-missing, we'll request UFFDIO_COPY and deliver the
page to the faulted thread, then a 2nd thread faulted on the same address
before the 1st faulted thread resolved the page fault in the pgtables.  It'll
start to be challenging from that point, imho.

From that POV I think current uffdio-copy is doing great on using pgtable and
the pgtable lock to keep all these things very well serialized.  We may have
false positives as the 2nd faulted thread will generate a dup message, but
that can be easily resolved by the user app.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/4] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
  2021-09-28 21:37         ` Peter Xu
@ 2021-11-04 21:34           ` Andrew Morton
  2021-11-05  1:01             ` Peter Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2021-11-04 21:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, linux-mm, linux-kernel, Andrea Arcangeli,
	Liam Howlett, Mike Rapoport, Yang Shi, David Hildenbrand,
	Kirill A . Shutemov, Jerome Glisse, Alistair Popple, Miaohe Lin,
	Matthew Wilcox, Axel Rasmussen

On Tue, 28 Sep 2021 17:37:31 -0400 Peter Xu <peterx@redhat.com> wrote:

> > How about a compromise (if you really want to continue with this patch):
> > you leave the SetPageDirty(page) in shmem_mfill_atomic_pte(), where I
> > feel a responsibility for it; but you do whatever works for you with
> > pte_mkdirty() at the mm/userfaultfd.c end?
> 
> Sure.  Duplicating dirty bit is definitely fine to me as it achieves the same
> goal as I hoped - we're still 100% clear we won't free a uffd page without
> being noticed, then that's enough to me for the goal of this patch.  I won't
> initiate that NACK myself since I still think duplicating is unnecessary no
> matter it resides in shmem or uffd code, but please go ahead doing that and
> I'll be fine with it, just in case Andrew didn't follow the details.

I think Hugh was asking you to implement this...

I guess I'll send this patch upstream.  But it does sound like Hugh
would prefer a followon patch for this kernel release which makes the
above change, please.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/4] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
  2021-11-04 21:34           ` Andrew Morton
@ 2021-11-05  1:01             ` Peter Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2021-11-05  1:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, linux-mm, linux-kernel, Andrea Arcangeli,
	Liam Howlett, Mike Rapoport, Yang Shi, David Hildenbrand,
	Kirill A . Shutemov, Jerome Glisse, Alistair Popple, Miaohe Lin,
	Matthew Wilcox, Axel Rasmussen

On Thu, Nov 04, 2021 at 02:34:40PM -0700, Andrew Morton wrote:
> On Tue, 28 Sep 2021 17:37:31 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > > How about a compromise (if you really want to continue with this patch):
> > > you leave the SetPageDirty(page) in shmem_mfill_atomic_pte(), where I
> > > feel a responsibility for it; but you do whatever works for you with
> > > pte_mkdirty() at the mm/userfaultfd.c end?
> > 
> > Sure.  Duplicating dirty bit is definitely fine to me as it achieves the same
> > goal as I hoped - we're still 100% clear we won't free a uffd page without
> > being noticed, then that's enough to me for the goal of this patch.  I won't
> > initiate that NACK myself since I still think duplicating is unnecessary no
> > matter it resides in shmem or uffd code, but please go ahead doing that and
> > I'll be fine with it, just in case Andrew didn't follow the details.
> 
> I think Hugh was asking you to implement this...
> 
> I guess I'll send this patch upstream.  But it does sound like Hugh
> would prefer a followon patch for this kernel release which makes the
> above change, please.

Thanks Andrew for helping.

But as I mentioned I still think that's odd to set dirty in both places.
That's why I don't want to draft the patch because I am not very willing to
sign-off..

If Hugh agrees, I can post the patch with Hugh's sign-off, adding the PageDirty
back too.  I am during a holiday so I cannot follow up the whole thing today,
but if it's easier for you to drop that patch or even drop the whole series,
please feel free to do.  I can rework everything too, then I'll try to get
Hugh's ack again on every single patch, as long as Hugh will have time to look
at it in the future.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-11-05  1:01 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-15 18:14 [PATCH v4 0/4] mm: A few cleanup patches around zap, shmem and uffd Peter Xu
2021-09-15 18:14 ` [PATCH v4 1/4] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte Peter Xu
2021-09-24  3:56   ` Hugh Dickins
2021-09-27 21:21     ` Peter Xu
2021-09-28 19:28       ` Hugh Dickins
2021-09-28 21:37         ` Peter Xu
2021-11-04 21:34           ` Andrew Morton
2021-11-05  1:01             ` Peter Xu
2021-09-15 18:15 ` [PATCH v4 2/4] mm: Clear vmf->pte after pte_unmap_same() returns Peter Xu
2021-09-24  3:59   ` Hugh Dickins
2021-09-15 18:15 ` [PATCH v4 3/4] mm: Drop first_index/last_index in zap_details Peter Xu
2021-09-24  4:14   ` Hugh Dickins
2021-09-15 18:15 ` [PATCH v4 4/4] mm: Add zap_skip_check_mapping() helper Peter Xu
2021-09-24  4:44   ` Hugh Dickins
2021-09-24 10:11   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).