All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
@ 2023-06-30 21:19 Suren Baghdasaryan
  2023-06-30 21:19 ` [PATCH v7 1/6] swap: remove remnants of polling from read_swap_cache_async Suren Baghdasaryan
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-06-30 21:19 UTC (permalink / raw)
  To: akpm
  Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
	david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
	surenb, linux-mm, linux-fsdevel, linux-kernel, kernel-team

When per-VMA locks were introduced in [1] several types of page faults
would still fall back to mmap_lock to keep the patchset simple. Among them
are swap and userfault pages. The main reason for skipping those cases was
the fact that mmap_lock could be dropped while handling these faults and
that required additional logic to be implemented.
Implement the mechanism to allow per-VMA locks to be dropped for these
cases.
First, change handle_mm_fault to drop per-VMA locks when returning
VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
and return vm_fault_t which simplifies later patches. Finally allow swap
and uffd page faults to be handled under per-VMA locks by dropping per-VMA
and retrying, the same way it's done under mmap_lock.
Naturally, once VMA lock is dropped that VMA should be assumed unstable
and can't be used.

Changes since v6 posted at [2]
- 4/6 replaced the ternary operation in folio_lock_or_retry,
per Matthew Wilcox
- 4/6 changed return code description for __folio_lock_or_retry
per Matthew Wilcox

Note: patch 3/6 will cause a trivial merge conflict in arch/arm64/mm/fault.c
when applied over mm-unstable branch due to a patch from ARM64 tree [3]
which is missing in mm-unstable.

[1] https://lore.kernel.org/all/20230227173632.3292573-1-surenb@google.com/
[2] https://lore.kernel.org/all/20230630020436.1066016-1-surenb@google.com/
[3] https://lore.kernel.org/all/20230524131305.2808-1-jszhang@kernel.org/

Suren Baghdasaryan (6):
  swap: remove remnants of polling from read_swap_cache_async
  mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED
  mm: drop per-VMA lock when returning VM_FAULT_RETRY or
    VM_FAULT_COMPLETED
  mm: change folio_lock_or_retry to use vm_fault directly
  mm: handle swap page faults under per-VMA lock
  mm: handle userfaults under VMA lock

 arch/arm64/mm/fault.c    |  3 ++-
 arch/powerpc/mm/fault.c  |  3 ++-
 arch/s390/mm/fault.c     |  3 ++-
 arch/x86/mm/fault.c      |  3 ++-
 fs/userfaultfd.c         | 34 ++++++++++++----------------
 include/linux/mm.h       | 37 ++++++++++++++++++++++++++++++
 include/linux/mm_types.h |  3 ++-
 include/linux/pagemap.h  | 11 +++++----
 mm/filemap.c             | 37 +++++++++++++++---------------
 mm/madvise.c             |  4 ++--
 mm/memory.c              | 49 ++++++++++++++++++++++------------------
 mm/swap.h                |  1 -
 mm/swap_state.c          | 12 ++++------
 13 files changed, 120 insertions(+), 80 deletions(-)

-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7 1/6] swap: remove remnants of polling from read_swap_cache_async
  2023-06-30 21:19 [PATCH v7 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
@ 2023-06-30 21:19 ` Suren Baghdasaryan
  2023-06-30 21:19 ` [PATCH v7 2/6] mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED Suren Baghdasaryan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-06-30 21:19 UTC (permalink / raw)
  To: akpm
  Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
	david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
	surenb, linux-mm, linux-fsdevel, linux-kernel, kernel-team,
	Christoph Hellwig

Commit [1] introduced IO polling support duding swapin to reduce
swap read latency for block devices that can be polled. However later
commit [2] removed polling support. Therefore it seems safe to remove
do_poll parameter in read_swap_cache_async and always call swap_readpage
with synchronous=false waiting for IO completion in folio_lock_or_retry.

[1] commit 23955622ff8d ("swap: add block io poll in swapin path")
[2] commit 9650b453a3d4 ("block: ignore RWF_HIPRI hint for sync dio")

Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 mm/madvise.c    |  4 ++--
 mm/swap.h       |  1 -
 mm/swap_state.c | 12 +++++-------
 3 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 886f06066622..ac6d92f74f6d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -218,7 +218,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 		ptep = NULL;
 
 		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
-					     vma, addr, false, &splug);
+					     vma, addr, &splug);
 		if (page)
 			put_page(page);
 	}
@@ -262,7 +262,7 @@ static void shmem_swapin_range(struct vm_area_struct *vma,
 		rcu_read_unlock();
 
 		page = read_swap_cache_async(entry, mapping_gfp_mask(mapping),
-					     vma, addr, false, &splug);
+					     vma, addr, &splug);
 		if (page)
 			put_page(page);
 
diff --git a/mm/swap.h b/mm/swap.h
index 7c033d793f15..8a3c7a0ace4f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -46,7 +46,6 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
 struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 				   struct vm_area_struct *vma,
 				   unsigned long addr,
-				   bool do_poll,
 				   struct swap_iocb **plug);
 struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 				     struct vm_area_struct *vma,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index f8ea7015bad4..5a690c79cc13 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -527,15 +527,14 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
  */
 struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 				   struct vm_area_struct *vma,
-				   unsigned long addr, bool do_poll,
-				   struct swap_iocb **plug)
+				   unsigned long addr, struct swap_iocb **plug)
 {
 	bool page_was_allocated;
 	struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
 			vma, addr, &page_was_allocated);
 
 	if (page_was_allocated)
-		swap_readpage(retpage, do_poll, plug);
+		swap_readpage(retpage, false, plug);
 
 	return retpage;
 }
@@ -630,7 +629,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	struct swap_info_struct *si = swp_swap_info(entry);
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
-	bool do_poll = true, page_allocated;
+	bool page_allocated;
 	struct vm_area_struct *vma = vmf->vma;
 	unsigned long addr = vmf->address;
 
@@ -638,7 +637,6 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	if (!mask)
 		goto skip;
 
-	do_poll = false;
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
@@ -670,7 +668,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
 	/* The page was likely read above, so no need for plugging here */
-	return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll, NULL);
+	return read_swap_cache_async(entry, gfp_mask, vma, addr, NULL);
 }
 
 int init_swap_address_space(unsigned int type, unsigned long nr_pages)
@@ -838,7 +836,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 skip:
 	/* The page was likely read above, so no need for plugging here */
 	return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
-				     ra_info.win == 1, NULL);
+				     NULL);
 }
 
 /**
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 2/6] mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED
  2023-06-30 21:19 [PATCH v7 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
  2023-06-30 21:19 ` [PATCH v7 1/6] swap: remove remnants of polling from read_swap_cache_async Suren Baghdasaryan
@ 2023-06-30 21:19 ` Suren Baghdasaryan
  2023-06-30 21:19 ` [PATCH v7 3/6] mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED Suren Baghdasaryan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-06-30 21:19 UTC (permalink / raw)
  To: akpm
  Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
	david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
	surenb, linux-mm, linux-fsdevel, linux-kernel, kernel-team

VM_FAULT_RESULT_TRACE should contain an element for every vm_fault_reason
to be used as flag_array inside trace_print_flags_seq(). The element
for VM_FAULT_COMPLETED is missing, add it.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm_types.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index de10fc797c8e..39cd34b4dbaa 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1077,7 +1077,8 @@ enum vm_fault_reason {
 	{ VM_FAULT_RETRY,               "RETRY" },	\
 	{ VM_FAULT_FALLBACK,            "FALLBACK" },	\
 	{ VM_FAULT_DONE_COW,            "DONE_COW" },	\
-	{ VM_FAULT_NEEDDSYNC,           "NEEDDSYNC" }
+	{ VM_FAULT_NEEDDSYNC,           "NEEDDSYNC" },	\
+	{ VM_FAULT_COMPLETED,           "COMPLETED" }
 
 struct vm_special_mapping {
 	const char *name;	/* The name, e.g. "[vdso]". */
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 3/6] mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED
  2023-06-30 21:19 [PATCH v7 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
  2023-06-30 21:19 ` [PATCH v7 1/6] swap: remove remnants of polling from read_swap_cache_async Suren Baghdasaryan
  2023-06-30 21:19 ` [PATCH v7 2/6] mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED Suren Baghdasaryan
@ 2023-06-30 21:19 ` Suren Baghdasaryan
  2023-06-30 21:19 ` [PATCH v7 4/6] mm: change folio_lock_or_retry to use vm_fault directly Suren Baghdasaryan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-06-30 21:19 UTC (permalink / raw)
  To: akpm
  Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
	david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
	surenb, linux-mm, linux-fsdevel, linux-kernel, kernel-team

handle_mm_fault returning VM_FAULT_RETRY or VM_FAULT_COMPLETED means
mmap_lock has been released. However with per-VMA locks behavior is
different and the caller should still release it. To make the
rules consistent for the caller, drop the per-VMA lock when returning
VM_FAULT_RETRY or VM_FAULT_COMPLETED. Currently the only path returning
VM_FAULT_RETRY under per-VMA locks is do_swap_page and no path returns
VM_FAULT_COMPLETED for now.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 arch/arm64/mm/fault.c   |  3 ++-
 arch/powerpc/mm/fault.c |  3 ++-
 arch/s390/mm/fault.c    |  3 ++-
 arch/x86/mm/fault.c     |  3 ++-
 mm/memory.c             | 12 ++++++++++++
 5 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 935f0a8911f9..9d78ff78b0e3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -602,7 +602,8 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 		goto lock_mmap;
 	}
 	fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, regs);
-	vma_end_read(vma);
+	if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
+		vma_end_read(vma);
 
 	if (!(fault & VM_FAULT_RETRY)) {
 		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 5bfdf6ecfa96..82954d0e6906 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -489,7 +489,8 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 	}
 
 	fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
-	vma_end_read(vma);
+	if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
+		vma_end_read(vma);
 
 	if (!(fault & VM_FAULT_RETRY)) {
 		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index dbe8394234e2..40a71063949b 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -418,7 +418,8 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 		goto lock_mmap;
 	}
 	fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
-	vma_end_read(vma);
+	if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
+		vma_end_read(vma);
 	if (!(fault & VM_FAULT_RETRY)) {
 		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
 		goto out;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e8711b2cafaf..56b4f9faf8c4 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1341,7 +1341,8 @@ void do_user_addr_fault(struct pt_regs *regs,
 		goto lock_mmap;
 	}
 	fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
-	vma_end_read(vma);
+	if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
+		vma_end_read(vma);
 
 	if (!(fault & VM_FAULT_RETRY)) {
 		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
diff --git a/mm/memory.c b/mm/memory.c
index 0ae594703021..5f26c56ce979 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3730,6 +3730,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
 		ret = VM_FAULT_RETRY;
+		vma_end_read(vma);
 		goto out;
 	}
 
@@ -5182,6 +5183,17 @@ static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma,
 				 !is_cow_mapping(vma->vm_flags)))
 			return VM_FAULT_SIGSEGV;
 	}
+#ifdef CONFIG_PER_VMA_LOCK
+	/*
+	 * Per-VMA locks can't be used with FAULT_FLAG_RETRY_NOWAIT because of
+	 * the assumption that lock is dropped on VM_FAULT_RETRY.
+	 */
+	if (WARN_ON_ONCE((*flags &
+			(FAULT_FLAG_VMA_LOCK | FAULT_FLAG_RETRY_NOWAIT)) ==
+			(FAULT_FLAG_VMA_LOCK | FAULT_FLAG_RETRY_NOWAIT)))
+		return VM_FAULT_SIGSEGV;
+#endif
+
 	return 0;
 }
 
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 4/6] mm: change folio_lock_or_retry to use vm_fault directly
  2023-06-30 21:19 [PATCH v7 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2023-06-30 21:19 ` [PATCH v7 3/6] mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED Suren Baghdasaryan
@ 2023-06-30 21:19 ` Suren Baghdasaryan
  2023-06-30 21:19 ` [PATCH v7 5/6] mm: handle swap page faults under per-VMA lock Suren Baghdasaryan
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-06-30 21:19 UTC (permalink / raw)
  To: akpm
  Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
	david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
	surenb, linux-mm, linux-fsdevel, linux-kernel, kernel-team

Change folio_lock_or_retry to accept vm_fault struct and return the
vm_fault_t directly.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 include/linux/pagemap.h | 11 ++++++-----
 mm/filemap.c            | 22 ++++++++++++----------
 mm/memory.c             | 14 ++++++--------
 3 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 716953ee1ebd..de16f740b755 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -900,8 +900,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
 
 void __folio_lock(struct folio *folio);
 int __folio_lock_killable(struct folio *folio);
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
-				unsigned int flags);
+vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf);
 void unlock_page(struct page *page);
 void folio_unlock(struct folio *folio);
 
@@ -1005,11 +1004,13 @@ static inline int folio_lock_killable(struct folio *folio)
  * Return value and mmap_lock implications depend on flags; see
  * __folio_lock_or_retry().
  */
-static inline bool folio_lock_or_retry(struct folio *folio,
-		struct mm_struct *mm, unsigned int flags)
+static inline vm_fault_t folio_lock_or_retry(struct folio *folio,
+					     struct vm_fault *vmf)
 {
 	might_sleep();
-	return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags);
+	if (!folio_trylock(folio))
+		return __folio_lock_or_retry(folio, vmf);
+	return 0;
 }
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 9e44a49bbd74..5da5ad6f7f4c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1669,32 +1669,34 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
 
 /*
  * Return values:
- * true - folio is locked; mmap_lock is still held.
- * false - folio is not locked.
+ * 0 - folio is locked.
+ * non-zero - folio is not locked.
  *     mmap_lock has been released (mmap_read_unlock(), unless flags had both
  *     FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
  *     which case mmap_lock is still held.
  *
- * If neither ALLOW_RETRY nor KILLABLE are set, will always return true
+ * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
  * with the folio locked and the mmap_lock unperturbed.
  */
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
-			 unsigned int flags)
+vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
 {
+	struct mm_struct *mm = vmf->vma->vm_mm;
+	unsigned int flags = vmf->flags;
+
 	if (fault_flag_allow_retry_first(flags)) {
 		/*
 		 * CAUTION! In this case, mmap_lock is not released
-		 * even though return 0.
+		 * even though return VM_FAULT_RETRY.
 		 */
 		if (flags & FAULT_FLAG_RETRY_NOWAIT)
-			return false;
+			return VM_FAULT_RETRY;
 
 		mmap_read_unlock(mm);
 		if (flags & FAULT_FLAG_KILLABLE)
 			folio_wait_locked_killable(folio);
 		else
 			folio_wait_locked(folio);
-		return false;
+		return VM_FAULT_RETRY;
 	}
 	if (flags & FAULT_FLAG_KILLABLE) {
 		bool ret;
@@ -1702,13 +1704,13 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
 		ret = __folio_lock_killable(folio);
 		if (ret) {
 			mmap_read_unlock(mm);
-			return false;
+			return VM_FAULT_RETRY;
 		}
 	} else {
 		__folio_lock(folio);
 	}
 
-	return true;
+	return 0;
 }
 
 /**
diff --git a/mm/memory.c b/mm/memory.c
index 5f26c56ce979..4ae3f046f593 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3582,6 +3582,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	struct folio *folio = page_folio(vmf->page);
 	struct vm_area_struct *vma = vmf->vma;
 	struct mmu_notifier_range range;
+	vm_fault_t ret;
 
 	/*
 	 * We need a reference to lock the folio because we don't hold
@@ -3594,9 +3595,10 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	if (!folio_try_get(folio))
 		return 0;
 
-	if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
+	ret = folio_lock_or_retry(folio, vmf);
+	if (ret) {
 		folio_put(folio);
-		return VM_FAULT_RETRY;
+		return ret;
 	}
 	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
 				vma->vm_mm, vmf->address & PAGE_MASK,
@@ -3721,7 +3723,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
-	int locked;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
 
@@ -3844,12 +3845,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_release;
 	}
 
-	locked = folio_lock_or_retry(folio, vma->vm_mm, vmf->flags);
-
-	if (!locked) {
-		ret |= VM_FAULT_RETRY;
+	ret |= folio_lock_or_retry(folio, vmf);
+	if (ret & VM_FAULT_RETRY)
 		goto out_release;
-	}
 
 	if (swapcache) {
 		/*
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 5/6] mm: handle swap page faults under per-VMA lock
  2023-06-30 21:19 [PATCH v7 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2023-06-30 21:19 ` [PATCH v7 4/6] mm: change folio_lock_or_retry to use vm_fault directly Suren Baghdasaryan
@ 2023-06-30 21:19 ` Suren Baghdasaryan
  2023-06-30 21:19 ` [PATCH v7 6/6] mm: handle userfaults under VMA lock Suren Baghdasaryan
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-06-30 21:19 UTC (permalink / raw)
  To: akpm
  Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
	david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
	surenb, linux-mm, linux-fsdevel, linux-kernel, kernel-team

When page fault is handled under per-VMA lock protection, all swap page
faults are retried with mmap_lock because folio_lock_or_retry has to drop
and reacquire mmap_lock if folio could not be immediately locked.
Follow the same pattern as mmap_lock to drop per-VMA lock when waiting
for folio and retrying once folio is available.
With this obstacle removed, enable do_swap_page to operate under
per-VMA lock protection. Drivers implementing ops->migrate_to_ram might
still rely on mmap_lock, therefore we have to fall back to mmap_lock in
that particular case.
Note that the only time do_swap_page calls synchronous swap_readpage
is when SWP_SYNCHRONOUS_IO is set, which is only set for
QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and
pmem). Therefore we don't sleep in this path, and there's no need to
drop the mmap or per-VMA lock.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h | 13 +++++++++++++
 mm/filemap.c       | 17 ++++++++---------
 mm/memory.c        | 16 ++++++++++------
 3 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 39aa409e84d5..54ab11214f4f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -720,6 +720,14 @@ static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
 	vma->detached = detached;
 }
 
+static inline void release_fault_lock(struct vm_fault *vmf)
+{
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+		vma_end_read(vmf->vma);
+	else
+		mmap_read_unlock(vmf->vma->vm_mm);
+}
+
 struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 					  unsigned long address);
 
@@ -735,6 +743,11 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 				     bool detached) {}
 
+static inline void release_fault_lock(struct vm_fault *vmf)
+{
+	mmap_read_unlock(vmf->vma->vm_mm);
+}
+
 #endif /* CONFIG_PER_VMA_LOCK */
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 5da5ad6f7f4c..5ac1b7beea2a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1671,27 +1671,26 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
  * Return values:
  * 0 - folio is locked.
  * non-zero - folio is not locked.
- *     mmap_lock has been released (mmap_read_unlock(), unless flags had both
- *     FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
- *     which case mmap_lock is still held.
+ *     mmap_lock or per-VMA lock has been released (mmap_read_unlock() or
+ *     vma_end_read()), unless flags had both FAULT_FLAG_ALLOW_RETRY and
+ *     FAULT_FLAG_RETRY_NOWAIT set, in which case the lock is still held.
  *
  * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
- * with the folio locked and the mmap_lock unperturbed.
+ * with the folio locked and the mmap_lock/per-VMA lock is left unperturbed.
  */
 vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
 {
-	struct mm_struct *mm = vmf->vma->vm_mm;
 	unsigned int flags = vmf->flags;
 
 	if (fault_flag_allow_retry_first(flags)) {
 		/*
-		 * CAUTION! In this case, mmap_lock is not released
-		 * even though return VM_FAULT_RETRY.
+		 * CAUTION! In this case, mmap_lock/per-VMA lock is not
+		 * released even though returning VM_FAULT_RETRY.
 		 */
 		if (flags & FAULT_FLAG_RETRY_NOWAIT)
 			return VM_FAULT_RETRY;
 
-		mmap_read_unlock(mm);
+		release_fault_lock(vmf);
 		if (flags & FAULT_FLAG_KILLABLE)
 			folio_wait_locked_killable(folio);
 		else
@@ -1703,7 +1702,7 @@ vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
 
 		ret = __folio_lock_killable(folio);
 		if (ret) {
-			mmap_read_unlock(mm);
+			release_fault_lock(vmf);
 			return VM_FAULT_RETRY;
 		}
 	} else {
diff --git a/mm/memory.c b/mm/memory.c
index 4ae3f046f593..bb0f68a73b0c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3729,12 +3729,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!pte_unmap_same(vmf))
 		goto out;
 
-	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
-		ret = VM_FAULT_RETRY;
-		vma_end_read(vma);
-		goto out;
-	}
-
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
@@ -3744,6 +3738,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			vmf->page = pfn_swap_entry_to_page(entry);
 			ret = remove_device_exclusive_entry(vmf);
 		} else if (is_device_private_entry(entry)) {
+			if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+				/*
+				 * migrate_to_ram is not yet ready to operate
+				 * under VMA lock.
+				 */
+				vma_end_read(vma);
+				ret = VM_FAULT_RETRY;
+				goto out;
+			}
+
 			vmf->page = pfn_swap_entry_to_page(entry);
 			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
 					vmf->address, &vmf->ptl);
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 6/6] mm: handle userfaults under VMA lock
  2023-06-30 21:19 [PATCH v7 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
                   ` (4 preceding siblings ...)
  2023-06-30 21:19 ` [PATCH v7 5/6] mm: handle swap page faults under per-VMA lock Suren Baghdasaryan
@ 2023-06-30 21:19 ` Suren Baghdasaryan
  2023-07-02 17:50 ` [PATCH v7 0/6] Per-VMA lock support for swap and userfaults Andrew Morton
  2023-08-09  7:48 ` David Hildenbrand
  7 siblings, 0 replies; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-06-30 21:19 UTC (permalink / raw)
  To: akpm
  Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
	david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
	surenb, linux-mm, linux-fsdevel, linux-kernel, kernel-team

Enable handle_userfault to operate under VMA lock by releasing VMA lock
instead of mmap_lock and retrying. Note that FAULT_FLAG_RETRY_NOWAIT
should never be used when handling faults under per-VMA lock protection
because that would break the assumption that lock is dropped on retry.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c   | 34 ++++++++++++++--------------------
 include/linux/mm.h | 24 ++++++++++++++++++++++++
 mm/memory.c        |  9 ---------
 3 files changed, 38 insertions(+), 29 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 7cecd49e078b..21a546eaf9f7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -277,17 +277,16 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
  * hugepmd ranges.
  */
 static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
-					 struct vm_area_struct *vma,
-					 unsigned long address,
-					 unsigned long flags,
-					 unsigned long reason)
+					      struct vm_fault *vmf,
+					      unsigned long reason)
 {
+	struct vm_area_struct *vma = vmf->vma;
 	pte_t *ptep, pte;
 	bool ret = true;
 
-	mmap_assert_locked(ctx->mm);
+	assert_fault_locked(vmf);
 
-	ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
+	ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma));
 	if (!ptep)
 		goto out;
 
@@ -308,10 +307,8 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 }
 #else
 static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
-					 struct vm_area_struct *vma,
-					 unsigned long address,
-					 unsigned long flags,
-					 unsigned long reason)
+					      struct vm_fault *vmf,
+					      unsigned long reason)
 {
 	return false;	/* should never get here */
 }
@@ -325,11 +322,11 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
  * threads.
  */
 static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
-					 unsigned long address,
-					 unsigned long flags,
+					 struct vm_fault *vmf,
 					 unsigned long reason)
 {
 	struct mm_struct *mm = ctx->mm;
+	unsigned long address = vmf->address;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
@@ -338,7 +335,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	pte_t ptent;
 	bool ret = true;
 
-	mmap_assert_locked(mm);
+	assert_fault_locked(vmf);
 
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
@@ -440,7 +437,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	 * Coredumping runs without mmap_lock so we can only check that
 	 * the mmap_lock is held, if PF_DUMPCORE was not set.
 	 */
-	mmap_assert_locked(mm);
+	assert_fault_locked(vmf);
 
 	ctx = vma->vm_userfaultfd_ctx.ctx;
 	if (!ctx)
@@ -556,15 +553,12 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
 
 	if (!is_vm_hugetlb_page(vma))
-		must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
-						  reason);
+		must_wait = userfaultfd_must_wait(ctx, vmf, reason);
 	else
-		must_wait = userfaultfd_huge_must_wait(ctx, vma,
-						       vmf->address,
-						       vmf->flags, reason);
+		must_wait = userfaultfd_huge_must_wait(ctx, vmf, reason);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_vma_unlock_read(vma);
-	mmap_read_unlock(mm);
+	release_fault_lock(vmf);
 
 	if (likely(must_wait && !READ_ONCE(ctx->released))) {
 		wake_up_poll(&ctx->fd_wqh, EPOLLIN);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 54ab11214f4f..2794225b2d42 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -705,6 +705,17 @@ static inline bool vma_try_start_write(struct vm_area_struct *vma)
 	return true;
 }
 
+static inline void vma_assert_locked(struct vm_area_struct *vma)
+{
+	int mm_lock_seq;
+
+	if (__is_vma_write_locked(vma, &mm_lock_seq))
+		return;
+
+	lockdep_assert_held(&vma->vm_lock->lock);
+	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_lock->lock), vma);
+}
+
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 {
 	int mm_lock_seq;
@@ -728,6 +739,14 @@ static inline void release_fault_lock(struct vm_fault *vmf)
 		mmap_read_unlock(vmf->vma->vm_mm);
 }
 
+static inline void assert_fault_locked(struct vm_fault *vmf)
+{
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+		vma_assert_locked(vmf->vma);
+	else
+		mmap_assert_locked(vmf->vma->vm_mm);
+}
+
 struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 					  unsigned long address);
 
@@ -748,6 +767,11 @@ static inline void release_fault_lock(struct vm_fault *vmf)
 	mmap_read_unlock(vmf->vma->vm_mm);
 }
 
+static inline void assert_fault_locked(struct vm_fault *vmf)
+{
+	mmap_assert_locked(vmf->vma->vm_mm);
+}
+
 #endif /* CONFIG_PER_VMA_LOCK */
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index bb0f68a73b0c..d9f36f9392a9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5407,15 +5407,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma_start_read(vma))
 		goto inval;
 
-	/*
-	 * Due to the possibility of userfault handler dropping mmap_lock, avoid
-	 * it for now and fall back to page fault handling under mmap_lock.
-	 */
-	if (userfaultfd_armed(vma)) {
-		vma_end_read(vma);
-		goto inval;
-	}
-
 	/* Check since vm_start/vm_end might change before we lock the VMA */
 	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
 		vma_end_read(vma);
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-06-30 21:19 [PATCH v7 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
                   ` (5 preceding siblings ...)
  2023-06-30 21:19 ` [PATCH v7 6/6] mm: handle userfaults under VMA lock Suren Baghdasaryan
@ 2023-07-02 17:50 ` Andrew Morton
  2023-07-03 15:27   ` Suren Baghdasaryan
  2023-08-09  7:48 ` David Hildenbrand
  7 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2023-07-02 17:50 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
	david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
	linux-mm, linux-fsdevel, linux-kernel, kernel-team

On Fri, 30 Jun 2023 14:19:51 -0700 Suren Baghdasaryan <surenb@google.com> wrote:

> When per-VMA locks were introduced in [1] several types of page faults
> would still fall back to mmap_lock to keep the patchset simple. Among them
> are swap and userfault pages. The main reason for skipping those cases was
> the fact that mmap_lock could be dropped while handling these faults and
> that required additional logic to be implemented.
> Implement the mechanism to allow per-VMA locks to be dropped for these
> cases.
> First, change handle_mm_fault to drop per-VMA locks when returning
> VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
> mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
> and return vm_fault_t which simplifies later patches. Finally allow swap
> and uffd page faults to be handled under per-VMA locks by dropping per-VMA
> and retrying, the same way it's done under mmap_lock.
> Naturally, once VMA lock is dropped that VMA should be assumed unstable
> and can't be used.

Is there any measurable performance benefit from this?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-07-02 17:50 ` [PATCH v7 0/6] Per-VMA lock support for swap and userfaults Andrew Morton
@ 2023-07-03 15:27   ` Suren Baghdasaryan
  0 siblings, 0 replies; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-07-03 15:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
	david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
	linux-mm, linux-fsdevel, linux-kernel, kernel-team

On Sun, Jul 2, 2023 at 5:50 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 30 Jun 2023 14:19:51 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > When per-VMA locks were introduced in [1] several types of page faults
> > would still fall back to mmap_lock to keep the patchset simple. Among them
> > are swap and userfault pages. The main reason for skipping those cases was
> > the fact that mmap_lock could be dropped while handling these faults and
> > that required additional logic to be implemented.
> > Implement the mechanism to allow per-VMA locks to be dropped for these
> > cases.
> > First, change handle_mm_fault to drop per-VMA locks when returning
> > VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
> > mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
> > and return vm_fault_t which simplifies later patches. Finally allow swap
> > and uffd page faults to be handled under per-VMA locks by dropping per-VMA
> > and retrying, the same way it's done under mmap_lock.
> > Naturally, once VMA lock is dropped that VMA should be assumed unstable
> > and can't be used.
>
> Is there any measurable performance benefit from this?

Good point. I haven't measured it but assume it will have the same
effect as for other page fault cases handled under per-VMA locks
(mmap_lock contention reduction). I'll try to create a test to measure
the effects.
Thanks,
Suren.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-06-30 21:19 [PATCH v7 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
                   ` (6 preceding siblings ...)
  2023-07-02 17:50 ` [PATCH v7 0/6] Per-VMA lock support for swap and userfaults Andrew Morton
@ 2023-08-09  7:48 ` David Hildenbrand
  2023-08-09 14:28   ` Suren Baghdasaryan
  7 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2023-08-09  7:48 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
	yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On 30.06.23 23:19, Suren Baghdasaryan wrote:
> When per-VMA locks were introduced in [1] several types of page faults
> would still fall back to mmap_lock to keep the patchset simple. Among them
> are swap and userfault pages. The main reason for skipping those cases was
> the fact that mmap_lock could be dropped while handling these faults and
> that required additional logic to be implemented.
> Implement the mechanism to allow per-VMA locks to be dropped for these
> cases.
> First, change handle_mm_fault to drop per-VMA locks when returning
> VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
> mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
> and return vm_fault_t which simplifies later patches. Finally allow swap
> and uffd page faults to be handled under per-VMA locks by dropping per-VMA
> and retrying, the same way it's done under mmap_lock.
> Naturally, once VMA lock is dropped that VMA should be assumed unstable
> and can't be used.
> 
> Changes since v6 posted at [2]
> - 4/6 replaced the ternary operation in folio_lock_or_retry,
> per Matthew Wilcox
> - 4/6 changed return code description for __folio_lock_or_retry
> per Matthew Wilcox
> 
> Note: patch 3/6 will cause a trivial merge conflict in arch/arm64/mm/fault.c
> when applied over mm-unstable branch due to a patch from ARM64 tree [3]
> which is missing in mm-unstable.
> 
> [1] https://lore.kernel.org/all/20230227173632.3292573-1-surenb@google.com/
> [2] https://lore.kernel.org/all/20230630020436.1066016-1-surenb@google.com/
> [3] https://lore.kernel.org/all/20230524131305.2808-1-jszhang@kernel.org/
> 
> Suren Baghdasaryan (6):
>    swap: remove remnants of polling from read_swap_cache_async
>    mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED
>    mm: drop per-VMA lock when returning VM_FAULT_RETRY or
>      VM_FAULT_COMPLETED
>    mm: change folio_lock_or_retry to use vm_fault directly
>    mm: handle swap page faults under per-VMA lock
>    mm: handle userfaults under VMA lock

On mm/mm-unstable I get running the selftests:

Testing sigbus-wp on shmem... [  383.215804] mm ffff9666078e5280 task_size 140737488351232
[  383.215804] get_unmapped_area ffffffffad03b980
[  383.215804] mmap_base 140378441285632 mmap_legacy_base 47254353883136
[  383.215804] pgd ffff966608960000 mm_users 1 mm_count 6 pgtables_bytes 126976 map_count 28
[  383.215804] hiwater_rss 6183 hiwater_vm 8aa7 total_vm 8aa7 locked_vm 0
[  383.215804] pinned_vm 0 data_vm 844 exec_vm 1a4 stack_vm 21
[  383.215804] start_code 402000 end_code 408f09 start_data 40ce10 end_data 40d500
[  383.215804] start_brk 17fe000 brk 1830000 start_stack 7ffecbbe08e0
[  383.215804] arg_start 7ffecbbe1c6f arg_end 7ffecbbe1c81 env_start 7ffecbbe1c81 env_end 7ffecbbe1fe6
[  383.215804] binfmt ffffffffaf3efe40 flags 80000cd
[  383.215804] ioctx_table 0000000000000000
[  383.215804] owner ffff96660d4a4000 exe_file ffff966285501a00
[  383.215804] notifier_subscriptions 0000000000000000
[  383.215804] numa_next_scan 4295050919 numa_scan_offset 0 numa_scan_seq 0
[  383.215804] tlb_flush_pending 0
[  383.215804] def_flags: 0x0()
[  383.236255] ------------[ cut here ]------------
[  383.237537] kernel BUG at include/linux/mmap_lock.h:66!
[  383.238897] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  383.240114] CPU: 37 PID: 1482 Comm: uffd-unit-tests Not tainted 6.5.0-rc4+ #68
[  383.242513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
[  383.244936] RIP: 0010:find_vma+0x3a/0x40
[  383.246200] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
[  383.251084] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
[  383.252781] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
[  383.255073] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
[  383.257352] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
[  383.259369] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
[  383.261570] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
[  383.263499] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
[  383.265483] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  383.266847] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
[  383.268532] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  383.270206] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  383.271905] PKRU: 55555554
[  383.272593] Call Trace:
[  383.273215]  <TASK>
[  383.273774]  ? die+0x32/0x80
[  383.274510]  ? do_trap+0xd6/0x100
[  383.275326]  ? find_vma+0x3a/0x40
[  383.276152]  ? do_error_trap+0x6a/0x90
[  383.277072]  ? find_vma+0x3a/0x40
[  383.277899]  ? exc_invalid_op+0x4c/0x60
[  383.278846]  ? find_vma+0x3a/0x40
[  383.279675]  ? asm_exc_invalid_op+0x16/0x20
[  383.280698]  ? find_vma+0x3a/0x40
[  383.281527]  lock_mm_and_find_vma+0x3f/0x270
[  383.282570]  do_user_addr_fault+0x1e4/0x660
[  383.283591]  exc_page_fault+0x73/0x170
[  383.284509]  asm_exc_page_fault+0x22/0x30
[  383.285486] RIP: 0033:0x404428
[  383.286265] Code: 48 89 85 18 ff ff ff e9 dc 00 00 00 48 8b 15 9f 92 00 00 48 8b 05 80 92 00 00 48 03
[  383.290566] RSP: 002b:00007ffecbbe05c0 EFLAGS: 00010206
[  383.291814] RAX: 0000000044401028 RBX: 00007ffecbbe08e8 RCX: 00007fac66e93c18
[  383.293502] RDX: 0000000044400000 RSI: 0000000000000001 RDI: 0000000000000000
[  383.295175] RBP: 00007ffecbbe06c0 R08: 00007ffecbbe05c0 R09: 00007ffecbbe06c0
[  383.296857] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
[  383.298533] R13: 00007ffecbbe08f8 R14: 000000000040ce18 R15: 00007fac67206000
[  383.300203]  </TASK>
[  383.300775] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_commong
[  383.309661] ---[ end trace 0000000000000000 ]---
[  383.310795] RIP: 0010:find_vma+0x3a/0x40
[  383.311771] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
[  383.316081] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
[  383.317346] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
[  383.319050] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
[  383.320767] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
[  383.322468] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
[  383.324164] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
[  383.325870] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
[  383.327795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  383.329177] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
[  383.330885] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  383.332592] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  383.334287] PKRU: 55555554


Which ends up being

VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);

I did not check if this is also the case on mainline, and if this series is responsible.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-09  7:48 ` David Hildenbrand
@ 2023-08-09 14:28   ` Suren Baghdasaryan
  2023-08-09 15:22     ` Suren Baghdasaryan
  0 siblings, 1 reply; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-08-09 14:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, peterx,
	ying.huang, yuzhao, dhowells, hughd, viro, brauner,
	pasha.tatashin, linux-mm, linux-fsdevel, linux-kernel,
	kernel-team

On Wed, Aug 9, 2023 at 12:49 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 30.06.23 23:19, Suren Baghdasaryan wrote:
> > When per-VMA locks were introduced in [1] several types of page faults
> > would still fall back to mmap_lock to keep the patchset simple. Among them
> > are swap and userfault pages. The main reason for skipping those cases was
> > the fact that mmap_lock could be dropped while handling these faults and
> > that required additional logic to be implemented.
> > Implement the mechanism to allow per-VMA locks to be dropped for these
> > cases.
> > First, change handle_mm_fault to drop per-VMA locks when returning
> > VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
> > mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
> > and return vm_fault_t which simplifies later patches. Finally allow swap
> > and uffd page faults to be handled under per-VMA locks by dropping per-VMA
> > and retrying, the same way it's done under mmap_lock.
> > Naturally, once VMA lock is dropped that VMA should be assumed unstable
> > and can't be used.
> >
> > Changes since v6 posted at [2]
> > - 4/6 replaced the ternary operation in folio_lock_or_retry,
> > per Matthew Wilcox
> > - 4/6 changed return code description for __folio_lock_or_retry
> > per Matthew Wilcox
> >
> > Note: patch 3/6 will cause a trivial merge conflict in arch/arm64/mm/fault.c
> > when applied over mm-unstable branch due to a patch from ARM64 tree [3]
> > which is missing in mm-unstable.
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-1-surenb@google.com/
> > [2] https://lore.kernel.org/all/20230630020436.1066016-1-surenb@google.com/
> > [3] https://lore.kernel.org/all/20230524131305.2808-1-jszhang@kernel.org/
> >
> > Suren Baghdasaryan (6):
> >    swap: remove remnants of polling from read_swap_cache_async
> >    mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED
> >    mm: drop per-VMA lock when returning VM_FAULT_RETRY or
> >      VM_FAULT_COMPLETED
> >    mm: change folio_lock_or_retry to use vm_fault directly
> >    mm: handle swap page faults under per-VMA lock
> >    mm: handle userfaults under VMA lock
>
> On mm/mm-unstable I get running the selftests:
>
> Testing sigbus-wp on shmem... [  383.215804] mm ffff9666078e5280 task_size 140737488351232
> [  383.215804] get_unmapped_area ffffffffad03b980
> [  383.215804] mmap_base 140378441285632 mmap_legacy_base 47254353883136
> [  383.215804] pgd ffff966608960000 mm_users 1 mm_count 6 pgtables_bytes 126976 map_count 28
> [  383.215804] hiwater_rss 6183 hiwater_vm 8aa7 total_vm 8aa7 locked_vm 0
> [  383.215804] pinned_vm 0 data_vm 844 exec_vm 1a4 stack_vm 21
> [  383.215804] start_code 402000 end_code 408f09 start_data 40ce10 end_data 40d500
> [  383.215804] start_brk 17fe000 brk 1830000 start_stack 7ffecbbe08e0
> [  383.215804] arg_start 7ffecbbe1c6f arg_end 7ffecbbe1c81 env_start 7ffecbbe1c81 env_end 7ffecbbe1fe6
> [  383.215804] binfmt ffffffffaf3efe40 flags 80000cd
> [  383.215804] ioctx_table 0000000000000000
> [  383.215804] owner ffff96660d4a4000 exe_file ffff966285501a00
> [  383.215804] notifier_subscriptions 0000000000000000
> [  383.215804] numa_next_scan 4295050919 numa_scan_offset 0 numa_scan_seq 0
> [  383.215804] tlb_flush_pending 0
> [  383.215804] def_flags: 0x0()
> [  383.236255] ------------[ cut here ]------------
> [  383.237537] kernel BUG at include/linux/mmap_lock.h:66!
> [  383.238897] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> [  383.240114] CPU: 37 PID: 1482 Comm: uffd-unit-tests Not tainted 6.5.0-rc4+ #68
> [  383.242513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
> [  383.244936] RIP: 0010:find_vma+0x3a/0x40
> [  383.246200] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
> [  383.251084] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
> [  383.252781] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
> [  383.255073] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
> [  383.257352] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
> [  383.259369] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
> [  383.261570] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
> [  383.263499] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
> [  383.265483] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  383.266847] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
> [  383.268532] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  383.270206] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  383.271905] PKRU: 55555554
> [  383.272593] Call Trace:
> [  383.273215]  <TASK>
> [  383.273774]  ? die+0x32/0x80
> [  383.274510]  ? do_trap+0xd6/0x100
> [  383.275326]  ? find_vma+0x3a/0x40
> [  383.276152]  ? do_error_trap+0x6a/0x90
> [  383.277072]  ? find_vma+0x3a/0x40
> [  383.277899]  ? exc_invalid_op+0x4c/0x60
> [  383.278846]  ? find_vma+0x3a/0x40
> [  383.279675]  ? asm_exc_invalid_op+0x16/0x20
> [  383.280698]  ? find_vma+0x3a/0x40
> [  383.281527]  lock_mm_and_find_vma+0x3f/0x270
> [  383.282570]  do_user_addr_fault+0x1e4/0x660
> [  383.283591]  exc_page_fault+0x73/0x170
> [  383.284509]  asm_exc_page_fault+0x22/0x30
> [  383.285486] RIP: 0033:0x404428
> [  383.286265] Code: 48 89 85 18 ff ff ff e9 dc 00 00 00 48 8b 15 9f 92 00 00 48 8b 05 80 92 00 00 48 03
> [  383.290566] RSP: 002b:00007ffecbbe05c0 EFLAGS: 00010206
> [  383.291814] RAX: 0000000044401028 RBX: 00007ffecbbe08e8 RCX: 00007fac66e93c18
> [  383.293502] RDX: 0000000044400000 RSI: 0000000000000001 RDI: 0000000000000000
> [  383.295175] RBP: 00007ffecbbe06c0 R08: 00007ffecbbe05c0 R09: 00007ffecbbe06c0
> [  383.296857] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
> [  383.298533] R13: 00007ffecbbe08f8 R14: 000000000040ce18 R15: 00007fac67206000
> [  383.300203]  </TASK>
> [  383.300775] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_commong
> [  383.309661] ---[ end trace 0000000000000000 ]---
> [  383.310795] RIP: 0010:find_vma+0x3a/0x40
> [  383.311771] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
> [  383.316081] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
> [  383.317346] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
> [  383.319050] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
> [  383.320767] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
> [  383.322468] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
> [  383.324164] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
> [  383.325870] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
> [  383.327795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  383.329177] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
> [  383.330885] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  383.332592] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  383.334287] PKRU: 55555554
>
>
> Which ends up being
>
> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
>
> I did not check if this is also the case on mainline, and if this series is responsible.

Thanks for reporting! I'm checking it now.

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-09 14:28   ` Suren Baghdasaryan
@ 2023-08-09 15:22     ` Suren Baghdasaryan
  2023-08-09 17:17       ` Suren Baghdasaryan
  0 siblings, 1 reply; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-08-09 15:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, peterx,
	ying.huang, yuzhao, dhowells, hughd, viro, brauner,
	pasha.tatashin, linux-mm, linux-fsdevel, linux-kernel,
	kernel-team

On Wed, Aug 9, 2023 at 7:28 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Aug 9, 2023 at 12:49 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 30.06.23 23:19, Suren Baghdasaryan wrote:
> > > When per-VMA locks were introduced in [1] several types of page faults
> > > would still fall back to mmap_lock to keep the patchset simple. Among them
> > > are swap and userfault pages. The main reason for skipping those cases was
> > > the fact that mmap_lock could be dropped while handling these faults and
> > > that required additional logic to be implemented.
> > > Implement the mechanism to allow per-VMA locks to be dropped for these
> > > cases.
> > > First, change handle_mm_fault to drop per-VMA locks when returning
> > > VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
> > > mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
> > > and return vm_fault_t which simplifies later patches. Finally allow swap
> > > and uffd page faults to be handled under per-VMA locks by dropping per-VMA
> > > and retrying, the same way it's done under mmap_lock.
> > > Naturally, once VMA lock is dropped that VMA should be assumed unstable
> > > and can't be used.
> > >
> > > Changes since v6 posted at [2]
> > > - 4/6 replaced the ternary operation in folio_lock_or_retry,
> > > per Matthew Wilcox
> > > - 4/6 changed return code description for __folio_lock_or_retry
> > > per Matthew Wilcox
> > >
> > > Note: patch 3/6 will cause a trivial merge conflict in arch/arm64/mm/fault.c
> > > when applied over mm-unstable branch due to a patch from ARM64 tree [3]
> > > which is missing in mm-unstable.
> > >
> > > [1] https://lore.kernel.org/all/20230227173632.3292573-1-surenb@google.com/
> > > [2] https://lore.kernel.org/all/20230630020436.1066016-1-surenb@google.com/
> > > [3] https://lore.kernel.org/all/20230524131305.2808-1-jszhang@kernel.org/
> > >
> > > Suren Baghdasaryan (6):
> > >    swap: remove remnants of polling from read_swap_cache_async
> > >    mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED
> > >    mm: drop per-VMA lock when returning VM_FAULT_RETRY or
> > >      VM_FAULT_COMPLETED
> > >    mm: change folio_lock_or_retry to use vm_fault directly
> > >    mm: handle swap page faults under per-VMA lock
> > >    mm: handle userfaults under VMA lock
> >
> > On mm/mm-unstable I get running the selftests:
> >
> > Testing sigbus-wp on shmem... [  383.215804] mm ffff9666078e5280 task_size 140737488351232
> > [  383.215804] get_unmapped_area ffffffffad03b980
> > [  383.215804] mmap_base 140378441285632 mmap_legacy_base 47254353883136
> > [  383.215804] pgd ffff966608960000 mm_users 1 mm_count 6 pgtables_bytes 126976 map_count 28
> > [  383.215804] hiwater_rss 6183 hiwater_vm 8aa7 total_vm 8aa7 locked_vm 0
> > [  383.215804] pinned_vm 0 data_vm 844 exec_vm 1a4 stack_vm 21
> > [  383.215804] start_code 402000 end_code 408f09 start_data 40ce10 end_data 40d500
> > [  383.215804] start_brk 17fe000 brk 1830000 start_stack 7ffecbbe08e0
> > [  383.215804] arg_start 7ffecbbe1c6f arg_end 7ffecbbe1c81 env_start 7ffecbbe1c81 env_end 7ffecbbe1fe6
> > [  383.215804] binfmt ffffffffaf3efe40 flags 80000cd
> > [  383.215804] ioctx_table 0000000000000000
> > [  383.215804] owner ffff96660d4a4000 exe_file ffff966285501a00
> > [  383.215804] notifier_subscriptions 0000000000000000
> > [  383.215804] numa_next_scan 4295050919 numa_scan_offset 0 numa_scan_seq 0
> > [  383.215804] tlb_flush_pending 0
> > [  383.215804] def_flags: 0x0()
> > [  383.236255] ------------[ cut here ]------------
> > [  383.237537] kernel BUG at include/linux/mmap_lock.h:66!
> > [  383.238897] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > [  383.240114] CPU: 37 PID: 1482 Comm: uffd-unit-tests Not tainted 6.5.0-rc4+ #68
> > [  383.242513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
> > [  383.244936] RIP: 0010:find_vma+0x3a/0x40
> > [  383.246200] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
> > [  383.251084] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
> > [  383.252781] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
> > [  383.255073] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
> > [  383.257352] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
> > [  383.259369] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
> > [  383.261570] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
> > [  383.263499] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
> > [  383.265483] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  383.266847] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
> > [  383.268532] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  383.270206] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  383.271905] PKRU: 55555554
> > [  383.272593] Call Trace:
> > [  383.273215]  <TASK>
> > [  383.273774]  ? die+0x32/0x80
> > [  383.274510]  ? do_trap+0xd6/0x100
> > [  383.275326]  ? find_vma+0x3a/0x40
> > [  383.276152]  ? do_error_trap+0x6a/0x90
> > [  383.277072]  ? find_vma+0x3a/0x40
> > [  383.277899]  ? exc_invalid_op+0x4c/0x60
> > [  383.278846]  ? find_vma+0x3a/0x40
> > [  383.279675]  ? asm_exc_invalid_op+0x16/0x20
> > [  383.280698]  ? find_vma+0x3a/0x40
> > [  383.281527]  lock_mm_and_find_vma+0x3f/0x270
> > [  383.282570]  do_user_addr_fault+0x1e4/0x660
> > [  383.283591]  exc_page_fault+0x73/0x170
> > [  383.284509]  asm_exc_page_fault+0x22/0x30
> > [  383.285486] RIP: 0033:0x404428
> > [  383.286265] Code: 48 89 85 18 ff ff ff e9 dc 00 00 00 48 8b 15 9f 92 00 00 48 8b 05 80 92 00 00 48 03
> > [  383.290566] RSP: 002b:00007ffecbbe05c0 EFLAGS: 00010206
> > [  383.291814] RAX: 0000000044401028 RBX: 00007ffecbbe08e8 RCX: 00007fac66e93c18
> > [  383.293502] RDX: 0000000044400000 RSI: 0000000000000001 RDI: 0000000000000000
> > [  383.295175] RBP: 00007ffecbbe06c0 R08: 00007ffecbbe05c0 R09: 00007ffecbbe06c0
> > [  383.296857] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
> > [  383.298533] R13: 00007ffecbbe08f8 R14: 000000000040ce18 R15: 00007fac67206000
> > [  383.300203]  </TASK>
> > [  383.300775] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_commong
> > [  383.309661] ---[ end trace 0000000000000000 ]---
> > [  383.310795] RIP: 0010:find_vma+0x3a/0x40
> > [  383.311771] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
> > [  383.316081] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
> > [  383.317346] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
> > [  383.319050] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
> > [  383.320767] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
> > [  383.322468] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
> > [  383.324164] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
> > [  383.325870] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
> > [  383.327795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  383.329177] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
> > [  383.330885] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  383.332592] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  383.334287] PKRU: 55555554
> >
> >
> > Which ends up being
> >
> > VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> >
> > I did not check if this is also the case on mainline, and if this series is responsible.
>
> Thanks for reporting! I'm checking it now.

Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
calling find_vma() without mmap_lock after successfully completing
get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
the first invocation of find_vma(), so this is not even the lock
upgrade path... I'll try to reproduce this issue and dig up more but
from the information I have so far this issue does not seem to be
related to this series.

>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-09 15:22     ` Suren Baghdasaryan
@ 2023-08-09 17:17       ` Suren Baghdasaryan
  2023-08-09 18:04         ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-08-09 17:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, peterx,
	ying.huang, yuzhao, dhowells, hughd, viro, brauner,
	pasha.tatashin, linux-mm, linux-fsdevel, linux-kernel,
	kernel-team

On Wed, Aug 9, 2023 at 8:22 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Aug 9, 2023 at 7:28 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Aug 9, 2023 at 12:49 AM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > On 30.06.23 23:19, Suren Baghdasaryan wrote:
> > > > When per-VMA locks were introduced in [1] several types of page faults
> > > > would still fall back to mmap_lock to keep the patchset simple. Among them
> > > > are swap and userfault pages. The main reason for skipping those cases was
> > > > the fact that mmap_lock could be dropped while handling these faults and
> > > > that required additional logic to be implemented.
> > > > Implement the mechanism to allow per-VMA locks to be dropped for these
> > > > cases.
> > > > First, change handle_mm_fault to drop per-VMA locks when returning
> > > > VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
> > > > mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
> > > > and return vm_fault_t which simplifies later patches. Finally allow swap
> > > > and uffd page faults to be handled under per-VMA locks by dropping per-VMA
> > > > and retrying, the same way it's done under mmap_lock.
> > > > Naturally, once VMA lock is dropped that VMA should be assumed unstable
> > > > and can't be used.
> > > >
> > > > Changes since v6 posted at [2]
> > > > - 4/6 replaced the ternary operation in folio_lock_or_retry,
> > > > per Matthew Wilcox
> > > > - 4/6 changed return code description for __folio_lock_or_retry
> > > > per Matthew Wilcox
> > > >
> > > > Note: patch 3/6 will cause a trivial merge conflict in arch/arm64/mm/fault.c
> > > > when applied over mm-unstable branch due to a patch from ARM64 tree [3]
> > > > which is missing in mm-unstable.
> > > >
> > > > [1] https://lore.kernel.org/all/20230227173632.3292573-1-surenb@google.com/
> > > > [2] https://lore.kernel.org/all/20230630020436.1066016-1-surenb@google.com/
> > > > [3] https://lore.kernel.org/all/20230524131305.2808-1-jszhang@kernel.org/
> > > >
> > > > Suren Baghdasaryan (6):
> > > >    swap: remove remnants of polling from read_swap_cache_async
> > > >    mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED
> > > >    mm: drop per-VMA lock when returning VM_FAULT_RETRY or
> > > >      VM_FAULT_COMPLETED
> > > >    mm: change folio_lock_or_retry to use vm_fault directly
> > > >    mm: handle swap page faults under per-VMA lock
> > > >    mm: handle userfaults under VMA lock
> > >
> > > On mm/mm-unstable I get running the selftests:
> > >
> > > Testing sigbus-wp on shmem... [  383.215804] mm ffff9666078e5280 task_size 140737488351232
> > > [  383.215804] get_unmapped_area ffffffffad03b980
> > > [  383.215804] mmap_base 140378441285632 mmap_legacy_base 47254353883136
> > > [  383.215804] pgd ffff966608960000 mm_users 1 mm_count 6 pgtables_bytes 126976 map_count 28
> > > [  383.215804] hiwater_rss 6183 hiwater_vm 8aa7 total_vm 8aa7 locked_vm 0
> > > [  383.215804] pinned_vm 0 data_vm 844 exec_vm 1a4 stack_vm 21
> > > [  383.215804] start_code 402000 end_code 408f09 start_data 40ce10 end_data 40d500
> > > [  383.215804] start_brk 17fe000 brk 1830000 start_stack 7ffecbbe08e0
> > > [  383.215804] arg_start 7ffecbbe1c6f arg_end 7ffecbbe1c81 env_start 7ffecbbe1c81 env_end 7ffecbbe1fe6
> > > [  383.215804] binfmt ffffffffaf3efe40 flags 80000cd
> > > [  383.215804] ioctx_table 0000000000000000
> > > [  383.215804] owner ffff96660d4a4000 exe_file ffff966285501a00
> > > [  383.215804] notifier_subscriptions 0000000000000000
> > > [  383.215804] numa_next_scan 4295050919 numa_scan_offset 0 numa_scan_seq 0
> > > [  383.215804] tlb_flush_pending 0
> > > [  383.215804] def_flags: 0x0()
> > > [  383.236255] ------------[ cut here ]------------
> > > [  383.237537] kernel BUG at include/linux/mmap_lock.h:66!
> > > [  383.238897] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > > [  383.240114] CPU: 37 PID: 1482 Comm: uffd-unit-tests Not tainted 6.5.0-rc4+ #68
> > > [  383.242513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
> > > [  383.244936] RIP: 0010:find_vma+0x3a/0x40
> > > [  383.246200] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
> > > [  383.251084] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
> > > [  383.252781] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
> > > [  383.255073] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
> > > [  383.257352] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
> > > [  383.259369] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
> > > [  383.261570] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
> > > [  383.263499] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
> > > [  383.265483] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  383.266847] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
> > > [  383.268532] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [  383.270206] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > [  383.271905] PKRU: 55555554
> > > [  383.272593] Call Trace:
> > > [  383.273215]  <TASK>
> > > [  383.273774]  ? die+0x32/0x80
> > > [  383.274510]  ? do_trap+0xd6/0x100
> > > [  383.275326]  ? find_vma+0x3a/0x40
> > > [  383.276152]  ? do_error_trap+0x6a/0x90
> > > [  383.277072]  ? find_vma+0x3a/0x40
> > > [  383.277899]  ? exc_invalid_op+0x4c/0x60
> > > [  383.278846]  ? find_vma+0x3a/0x40
> > > [  383.279675]  ? asm_exc_invalid_op+0x16/0x20
> > > [  383.280698]  ? find_vma+0x3a/0x40
> > > [  383.281527]  lock_mm_and_find_vma+0x3f/0x270
> > > [  383.282570]  do_user_addr_fault+0x1e4/0x660
> > > [  383.283591]  exc_page_fault+0x73/0x170
> > > [  383.284509]  asm_exc_page_fault+0x22/0x30
> > > [  383.285486] RIP: 0033:0x404428
> > > [  383.286265] Code: 48 89 85 18 ff ff ff e9 dc 00 00 00 48 8b 15 9f 92 00 00 48 8b 05 80 92 00 00 48 03
> > > [  383.290566] RSP: 002b:00007ffecbbe05c0 EFLAGS: 00010206
> > > [  383.291814] RAX: 0000000044401028 RBX: 00007ffecbbe08e8 RCX: 00007fac66e93c18
> > > [  383.293502] RDX: 0000000044400000 RSI: 0000000000000001 RDI: 0000000000000000
> > > [  383.295175] RBP: 00007ffecbbe06c0 R08: 00007ffecbbe05c0 R09: 00007ffecbbe06c0
> > > [  383.296857] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
> > > [  383.298533] R13: 00007ffecbbe08f8 R14: 000000000040ce18 R15: 00007fac67206000
> > > [  383.300203]  </TASK>
> > > [  383.300775] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_commong
> > > [  383.309661] ---[ end trace 0000000000000000 ]---
> > > [  383.310795] RIP: 0010:find_vma+0x3a/0x40
> > > [  383.311771] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
> > > [  383.316081] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
> > > [  383.317346] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
> > > [  383.319050] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
> > > [  383.320767] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
> > > [  383.322468] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
> > > [  383.324164] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
> > > [  383.325870] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
> > > [  383.327795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  383.329177] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
> > > [  383.330885] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [  383.332592] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > [  383.334287] PKRU: 55555554
> > >
> > >
> > > Which ends up being
> > >
> > > VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> > >
> > > I did not check if this is also the case on mainline, and if this series is responsible.
> >
> > Thanks for reporting! I'm checking it now.
>
> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
> calling find_vma() without mmap_lock after successfully completing
> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
> the first invocation of find_vma(), so this is not even the lock
> upgrade path... I'll try to reproduce this issue and dig up more but
> from the information I have so far this issue does not seem to be
> related to this series.

This is really weird. I added mmap_assert_locked(mm) calls into
get_mmap_lock_carefully() right after we acquire mmap_lock read lock
and one of them triggers right after successful
mmap_read_lock_killable(). Here is my modified version of
get_mmap_lock_carefully():

static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
struct pt_regs *regs) {
     /* Even if this succeeds, make it clear we might have slept */
     if (likely(mmap_read_trylock(mm))) {
         might_sleep();
         mmap_assert_locked(mm);
         return true;
     }
     if (regs && !user_mode(regs)) {
         unsigned long ip = instruction_pointer(regs);
         if (!search_exception_tables(ip))
             return false;
     }
     if (!mmap_read_lock_killable(mm)) {
         mmap_assert_locked(mm);                     <---- generates a BUG
         return true;
     }
     return false;
}

AFAIKT conditions for mmap_read_trylock() and
mmap_read_lock_killable() are checked correctly. Am I missing
something?

>
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-09 17:17       ` Suren Baghdasaryan
@ 2023-08-09 18:04         ` David Hildenbrand
  2023-08-09 18:08           ` Suren Baghdasaryan
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2023-08-09 18:04 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, peterx,
	ying.huang, yuzhao, dhowells, hughd, viro, brauner,
	pasha.tatashin, linux-mm, linux-fsdevel, linux-kernel,
	kernel-team

>>>> Which ends up being
>>>>
>>>> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
>>>>
>>>> I did not check if this is also the case on mainline, and if this series is responsible.
>>>
>>> Thanks for reporting! I'm checking it now.
>>
>> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
>> calling find_vma() without mmap_lock after successfully completing
>> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
>> the first invocation of find_vma(), so this is not even the lock
>> upgrade path... I'll try to reproduce this issue and dig up more but
>> from the information I have so far this issue does not seem to be
>> related to this series.

I just checked on mainline and it does not fail there.

> 
> This is really weird. I added mmap_assert_locked(mm) calls into
> get_mmap_lock_carefully() right after we acquire mmap_lock read lock
> and one of them triggers right after successful
> mmap_read_lock_killable(). Here is my modified version of
> get_mmap_lock_carefully():
> 
> static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
> struct pt_regs *regs) {
>       /* Even if this succeeds, make it clear we might have slept */
>       if (likely(mmap_read_trylock(mm))) {
>           might_sleep();
>           mmap_assert_locked(mm);
>           return true;
>       }
>       if (regs && !user_mode(regs)) {
>           unsigned long ip = instruction_pointer(regs);
>           if (!search_exception_tables(ip))
>               return false;
>       }
>       if (!mmap_read_lock_killable(mm)) {
>           mmap_assert_locked(mm);                     <---- generates a BUG
>           return true;
>       }
>       return false;
> }

Ehm, that's indeed weird.

> 
> AFAIKT conditions for mmap_read_trylock() and
> mmap_read_lock_killable() are checked correctly. Am I missing
> something?

Weirdly enough, it only triggers during that specific uffd test, right?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-09 18:04         ` David Hildenbrand
@ 2023-08-09 18:08           ` Suren Baghdasaryan
  2023-08-09 18:31             ` Suren Baghdasaryan
  0 siblings, 1 reply; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-08-09 18:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, peterx,
	ying.huang, yuzhao, dhowells, hughd, viro, brauner,
	pasha.tatashin, linux-mm, linux-fsdevel, linux-kernel,
	kernel-team

On Wed, Aug 9, 2023 at 11:04 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>>> Which ends up being
> >>>>
> >>>> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> >>>>
> >>>> I did not check if this is also the case on mainline, and if this series is responsible.
> >>>
> >>> Thanks for reporting! I'm checking it now.
> >>
> >> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
> >> calling find_vma() without mmap_lock after successfully completing
> >> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
> >> the first invocation of find_vma(), so this is not even the lock
> >> upgrade path... I'll try to reproduce this issue and dig up more but
> >> from the information I have so far this issue does not seem to be
> >> related to this series.
>
> I just checked on mainline and it does not fail there.
>
> >
> > This is really weird. I added mmap_assert_locked(mm) calls into
> > get_mmap_lock_carefully() right after we acquire mmap_lock read lock
> > and one of them triggers right after successful
> > mmap_read_lock_killable(). Here is my modified version of
> > get_mmap_lock_carefully():
> >
> > static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
> > struct pt_regs *regs) {
> >       /* Even if this succeeds, make it clear we might have slept */
> >       if (likely(mmap_read_trylock(mm))) {
> >           might_sleep();
> >           mmap_assert_locked(mm);
> >           return true;
> >       }
> >       if (regs && !user_mode(regs)) {
> >           unsigned long ip = instruction_pointer(regs);
> >           if (!search_exception_tables(ip))
> >               return false;
> >       }
> >       if (!mmap_read_lock_killable(mm)) {
> >           mmap_assert_locked(mm);                     <---- generates a BUG
> >           return true;
> >       }
> >       return false;
> > }
>
> Ehm, that's indeed weird.
>
> >
> > AFAIKT conditions for mmap_read_trylock() and
> > mmap_read_lock_killable() are checked correctly. Am I missing
> > something?
>
> Weirdly enough, it only triggers during that specific uffd test, right?

Yes, uffd-unit-tests. I even ran it separately to ensure it's not some
fallback from a previous test and I'm able to reproduce this
consistently.

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-09 18:08           ` Suren Baghdasaryan
@ 2023-08-09 18:31             ` Suren Baghdasaryan
  2023-08-10  5:29               ` Suren Baghdasaryan
  0 siblings, 1 reply; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-08-09 18:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, peterx,
	ying.huang, yuzhao, dhowells, hughd, viro, brauner,
	pasha.tatashin, linux-mm, linux-fsdevel, linux-kernel,
	kernel-team

On Wed, Aug 9, 2023 at 11:08 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Aug 9, 2023 at 11:04 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > >>>> Which ends up being
> > >>>>
> > >>>> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> > >>>>
> > >>>> I did not check if this is also the case on mainline, and if this series is responsible.
> > >>>
> > >>> Thanks for reporting! I'm checking it now.
> > >>
> > >> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
> > >> calling find_vma() without mmap_lock after successfully completing
> > >> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
> > >> the first invocation of find_vma(), so this is not even the lock
> > >> upgrade path... I'll try to reproduce this issue and dig up more but
> > >> from the information I have so far this issue does not seem to be
> > >> related to this series.
> >
> > I just checked on mainline and it does not fail there.

Thanks. Just to eliminate the possibility, I'll try reverting my
patchset in mm-unstable and will try the test again. Will do that in
the evening once I'm home.

> >
> > >
> > > This is really weird. I added mmap_assert_locked(mm) calls into
> > > get_mmap_lock_carefully() right after we acquire mmap_lock read lock
> > > and one of them triggers right after successful
> > > mmap_read_lock_killable(). Here is my modified version of
> > > get_mmap_lock_carefully():
> > >
> > > static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
> > > struct pt_regs *regs) {
> > >       /* Even if this succeeds, make it clear we might have slept */
> > >       if (likely(mmap_read_trylock(mm))) {
> > >           might_sleep();
> > >           mmap_assert_locked(mm);
> > >           return true;
> > >       }
> > >       if (regs && !user_mode(regs)) {
> > >           unsigned long ip = instruction_pointer(regs);
> > >           if (!search_exception_tables(ip))
> > >               return false;
> > >       }
> > >       if (!mmap_read_lock_killable(mm)) {
> > >           mmap_assert_locked(mm);                     <---- generates a BUG
> > >           return true;
> > >       }
> > >       return false;
> > > }
> >
> > Ehm, that's indeed weird.
> >
> > >
> > > AFAIKT conditions for mmap_read_trylock() and
> > > mmap_read_lock_killable() are checked correctly. Am I missing
> > > something?
> >
> > Weirdly enough, it only triggers during that specific uffd test, right?
>
> Yes, uffd-unit-tests. I even ran it separately to ensure it's not some
> fallback from a previous test and I'm able to reproduce this
> consistently.
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-09 18:31             ` Suren Baghdasaryan
@ 2023-08-10  5:29               ` Suren Baghdasaryan
  2023-08-10  6:24                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-08-10  5:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, peterx,
	ying.huang, yuzhao, dhowells, hughd, viro, brauner,
	pasha.tatashin, linux-mm, linux-fsdevel, linux-kernel,
	kernel-team

On Wed, Aug 9, 2023 at 11:31 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Aug 9, 2023 at 11:08 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Aug 9, 2023 at 11:04 AM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > >>>> Which ends up being
> > > >>>>
> > > >>>> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> > > >>>>
> > > >>>> I did not check if this is also the case on mainline, and if this series is responsible.
> > > >>>
> > > >>> Thanks for reporting! I'm checking it now.
> > > >>
> > > >> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
> > > >> calling find_vma() without mmap_lock after successfully completing
> > > >> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
> > > >> the first invocation of find_vma(), so this is not even the lock
> > > >> upgrade path... I'll try to reproduce this issue and dig up more but
> > > >> from the information I have so far this issue does not seem to be
> > > >> related to this series.
> > >
> > > I just checked on mainline and it does not fail there.
>
> Thanks. Just to eliminate the possibility, I'll try reverting my
> patchset in mm-unstable and will try the test again. Will do that in
> the evening once I'm home.
>
> > >
> > > >
> > > > This is really weird. I added mmap_assert_locked(mm) calls into
> > > > get_mmap_lock_carefully() right after we acquire mmap_lock read lock
> > > > and one of them triggers right after successful
> > > > mmap_read_lock_killable(). Here is my modified version of
> > > > get_mmap_lock_carefully():
> > > >
> > > > static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
> > > > struct pt_regs *regs) {
> > > >       /* Even if this succeeds, make it clear we might have slept */
> > > >       if (likely(mmap_read_trylock(mm))) {
> > > >           might_sleep();
> > > >           mmap_assert_locked(mm);
> > > >           return true;
> > > >       }
> > > >       if (regs && !user_mode(regs)) {
> > > >           unsigned long ip = instruction_pointer(regs);
> > > >           if (!search_exception_tables(ip))
> > > >               return false;
> > > >       }
> > > >       if (!mmap_read_lock_killable(mm)) {
> > > >           mmap_assert_locked(mm);                     <---- generates a BUG
> > > >           return true;
> > > >       }
> > > >       return false;
> > > > }
> > >
> > > Ehm, that's indeed weird.
> > >
> > > >
> > > > AFAIKT conditions for mmap_read_trylock() and
> > > > mmap_read_lock_killable() are checked correctly. Am I missing
> > > > something?
> > >
> > > Weirdly enough, it only triggers during that specific uffd test, right?
> >
> > Yes, uffd-unit-tests. I even ran it separately to ensure it's not some
> > fallback from a previous test and I'm able to reproduce this
> > consistently.

Yeah, it is somehow related to per-vma locking. Unfortunately I can't
reproduce the issue on my VM, so I have to use my host and bisection
is slow. I think I'll get to the bottom of this tomorrow.

> >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-10  5:29               ` Suren Baghdasaryan
@ 2023-08-10  6:24                 ` Suren Baghdasaryan
  2023-08-10  7:40                   ` David Hildenbrand
  2023-08-10 22:16                   ` Matthew Wilcox
  0 siblings, 2 replies; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-08-10  6:24 UTC (permalink / raw)
  To: David Hildenbrand, willy
  Cc: akpm, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
	yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On Wed, Aug 9, 2023 at 10:29 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Aug 9, 2023 at 11:31 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Aug 9, 2023 at 11:08 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Wed, Aug 9, 2023 at 11:04 AM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > >>>> Which ends up being
> > > > >>>>
> > > > >>>> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> > > > >>>>
> > > > >>>> I did not check if this is also the case on mainline, and if this series is responsible.
> > > > >>>
> > > > >>> Thanks for reporting! I'm checking it now.
> > > > >>
> > > > >> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
> > > > >> calling find_vma() without mmap_lock after successfully completing
> > > > >> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
> > > > >> the first invocation of find_vma(), so this is not even the lock
> > > > >> upgrade path... I'll try to reproduce this issue and dig up more but
> > > > >> from the information I have so far this issue does not seem to be
> > > > >> related to this series.
> > > >
> > > > I just checked on mainline and it does not fail there.
> >
> > Thanks. Just to eliminate the possibility, I'll try reverting my
> > patchset in mm-unstable and will try the test again. Will do that in
> > the evening once I'm home.
> >
> > > >
> > > > >
> > > > > This is really weird. I added mmap_assert_locked(mm) calls into
> > > > > get_mmap_lock_carefully() right after we acquire mmap_lock read lock
> > > > > and one of them triggers right after successful
> > > > > mmap_read_lock_killable(). Here is my modified version of
> > > > > get_mmap_lock_carefully():
> > > > >
> > > > > static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
> > > > > struct pt_regs *regs) {
> > > > >       /* Even if this succeeds, make it clear we might have slept */
> > > > >       if (likely(mmap_read_trylock(mm))) {
> > > > >           might_sleep();
> > > > >           mmap_assert_locked(mm);
> > > > >           return true;
> > > > >       }
> > > > >       if (regs && !user_mode(regs)) {
> > > > >           unsigned long ip = instruction_pointer(regs);
> > > > >           if (!search_exception_tables(ip))
> > > > >               return false;
> > > > >       }
> > > > >       if (!mmap_read_lock_killable(mm)) {
> > > > >           mmap_assert_locked(mm);                     <---- generates a BUG
> > > > >           return true;
> > > > >       }
> > > > >       return false;
> > > > > }
> > > >
> > > > Ehm, that's indeed weird.
> > > >
> > > > >
> > > > > AFAIKT conditions for mmap_read_trylock() and
> > > > > mmap_read_lock_killable() are checked correctly. Am I missing
> > > > > something?
> > > >
> > > > Weirdly enough, it only triggers during that specific uffd test, right?
> > >
> > > Yes, uffd-unit-tests. I even ran it separately to ensure it's not some
> > > fallback from a previous test and I'm able to reproduce this
> > > consistently.
>
> Yeah, it is somehow related to per-vma locking. Unfortunately I can't
> reproduce the issue on my VM, so I have to use my host and bisection
> is slow. I think I'll get to the bottom of this tomorrow.

Ok, I think I found the issue.  wp_page_shared() ->
fault_dirty_shared_page() can drop mmap_lock (see the comment saying
"Drop the mmap_lock before waiting on IO, if we can...", therefore we
have to ensure we are not doing this under per-VMA lock.
I think what happens is that this path is racing with another page
fault which took mmap_lock for read. fault_dirty_shared_page()
releases this lock which was taken by another page faulting thread and
that thread generates an assertion when it finds out the lock it just
took got released from under it.
The following crude change fixed the issue for me but there might be a
more granular way to deal with this:

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3293,18 +3293,18 @@ static vm_fault_t wp_page_shared(struct
vm_fault *vmf, struct folio *folio)
         struct vm_area_struct *vma = vmf->vma;
         vm_fault_t ret = 0;

+        if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+                pte_unmap_unlock(vmf->pte, vmf->ptl);
+                vma_end_read(vmf->vma);
+                return VM_FAULT_RETRY;
+        }
+
         folio_get(folio);

         if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
                 vm_fault_t tmp;

                 pte_unmap_unlock(vmf->pte, vmf->ptl);
-                if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
-                        folio_put(folio);
-                        vma_end_read(vmf->vma);
-                        return VM_FAULT_RETRY;
-                }
-
                 tmp = do_page_mkwrite(vmf, folio);
                 if (unlikely(!tmp || (tmp &
                                       (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {


Matthew, please check if this fix is valid and if there might be a
better way. I think the issue was introduced by 88e2667632d4 ("mm:
handle faults that merely update the accessed bit under the VMA lock")
Thanks,
Suren.



>
> > >
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > David / dhildenb
> > > >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-10  6:24                 ` Suren Baghdasaryan
@ 2023-08-10  7:40                   ` David Hildenbrand
  2023-08-10 20:20                     ` Suren Baghdasaryan
  2023-08-10 22:00                     ` Matthew Wilcox
  2023-08-10 22:16                   ` Matthew Wilcox
  1 sibling, 2 replies; 25+ messages in thread
From: David Hildenbrand @ 2023-08-10  7:40 UTC (permalink / raw)
  To: Suren Baghdasaryan, willy
  Cc: akpm, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
	michel, liam.howlett, jglisse, vbabka, minchan, dave,
	punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
	yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin, linux-mm,
	linux-fsdevel, linux-kernel, kernel-team

On 10.08.23 08:24, Suren Baghdasaryan wrote:
> On Wed, Aug 9, 2023 at 10:29 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> On Wed, Aug 9, 2023 at 11:31 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>>
>>> On Wed, Aug 9, 2023 at 11:08 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>>>
>>>> On Wed, Aug 9, 2023 at 11:04 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>>>>>> Which ends up being
>>>>>>>>>
>>>>>>>>> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
>>>>>>>>>
>>>>>>>>> I did not check if this is also the case on mainline, and if this series is responsible.
>>>>>>>>
>>>>>>>> Thanks for reporting! I'm checking it now.
>>>>>>>
>>>>>>> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
>>>>>>> calling find_vma() without mmap_lock after successfully completing
>>>>>>> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
>>>>>>> the first invocation of find_vma(), so this is not even the lock
>>>>>>> upgrade path... I'll try to reproduce this issue and dig up more but
>>>>>>> from the information I have so far this issue does not seem to be
>>>>>>> related to this series.
>>>>>
>>>>> I just checked on mainline and it does not fail there.
>>>
>>> Thanks. Just to eliminate the possibility, I'll try reverting my
>>> patchset in mm-unstable and will try the test again. Will do that in
>>> the evening once I'm home.
>>>
>>>>>
>>>>>>
>>>>>> This is really weird. I added mmap_assert_locked(mm) calls into
>>>>>> get_mmap_lock_carefully() right after we acquire mmap_lock read lock
>>>>>> and one of them triggers right after successful
>>>>>> mmap_read_lock_killable(). Here is my modified version of
>>>>>> get_mmap_lock_carefully():
>>>>>>
>>>>>> static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
>>>>>> struct pt_regs *regs) {
>>>>>>        /* Even if this succeeds, make it clear we might have slept */
>>>>>>        if (likely(mmap_read_trylock(mm))) {
>>>>>>            might_sleep();
>>>>>>            mmap_assert_locked(mm);
>>>>>>            return true;
>>>>>>        }
>>>>>>        if (regs && !user_mode(regs)) {
>>>>>>            unsigned long ip = instruction_pointer(regs);
>>>>>>            if (!search_exception_tables(ip))
>>>>>>                return false;
>>>>>>        }
>>>>>>        if (!mmap_read_lock_killable(mm)) {
>>>>>>            mmap_assert_locked(mm);                     <---- generates a BUG
>>>>>>            return true;
>>>>>>        }
>>>>>>        return false;
>>>>>> }
>>>>>
>>>>> Ehm, that's indeed weird.
>>>>>
>>>>>>
>>>>>> AFAIKT conditions for mmap_read_trylock() and
>>>>>> mmap_read_lock_killable() are checked correctly. Am I missing
>>>>>> something?
>>>>>
>>>>> Weirdly enough, it only triggers during that specific uffd test, right?
>>>>
>>>> Yes, uffd-unit-tests. I even ran it separately to ensure it's not some
>>>> fallback from a previous test and I'm able to reproduce this
>>>> consistently.
>>
>> Yeah, it is somehow related to per-vma locking. Unfortunately I can't
>> reproduce the issue on my VM, so I have to use my host and bisection
>> is slow. I think I'll get to the bottom of this tomorrow.
> 
> Ok, I think I found the issue. 

Nice!

> wp_page_shared() ->
> fault_dirty_shared_page() can drop mmap_lock (see the comment saying
> "Drop the mmap_lock before waiting on IO, if we can...", therefore we
> have to ensure we are not doing this under per-VMA lock.
> I think what happens is that this path is racing with another page
> fault which took mmap_lock for read. fault_dirty_shared_page()
> releases this lock which was taken by another page faulting thread and
> that thread generates an assertion when it finds out the lock it just
> took got released from under it.

I wonder if we could detect that someone releases the mmap lock that was 
not taken by that person, to bail out early at the right place when 
debugging such issues. Only with certain config knobs enabled, of course.

> The following crude change fixed the issue for me but there might be a
> more granular way to deal with this:
> 
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3293,18 +3293,18 @@ static vm_fault_t wp_page_shared(struct
> vm_fault *vmf, struct folio *folio)
>           struct vm_area_struct *vma = vmf->vma;
>           vm_fault_t ret = 0;
> 
> +        if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> +                pte_unmap_unlock(vmf->pte, vmf->ptl);
> +                vma_end_read(vmf->vma);
> +                return VM_FAULT_RETRY;
> +        }
> +

I won't lie: all of these locking checks are a bit hard to get and 
possibly even harder to maintain.

Maybe better mmap unlock sanity checks as spelled out above might help 
improve part of the situation.


And maybe some comments regarding the placement might help as well ;)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-10  7:40                   ` David Hildenbrand
@ 2023-08-10 20:20                     ` Suren Baghdasaryan
  2023-08-10 22:00                     ` Matthew Wilcox
  1 sibling, 0 replies; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-08-10 20:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: willy, akpm, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, peterx,
	ying.huang, yuzhao, dhowells, hughd, viro, brauner,
	pasha.tatashin, linux-mm, linux-fsdevel, linux-kernel,
	kernel-team

On Thu, Aug 10, 2023 at 12:41 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 10.08.23 08:24, Suren Baghdasaryan wrote:
> > On Wed, Aug 9, 2023 at 10:29 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> On Wed, Aug 9, 2023 at 11:31 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >>>
> >>> On Wed, Aug 9, 2023 at 11:08 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >>>>
> >>>> On Wed, Aug 9, 2023 at 11:04 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>>>>>> Which ends up being
> >>>>>>>>>
> >>>>>>>>> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> >>>>>>>>>
> >>>>>>>>> I did not check if this is also the case on mainline, and if this series is responsible.
> >>>>>>>>
> >>>>>>>> Thanks for reporting! I'm checking it now.
> >>>>>>>
> >>>>>>> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
> >>>>>>> calling find_vma() without mmap_lock after successfully completing
> >>>>>>> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
> >>>>>>> the first invocation of find_vma(), so this is not even the lock
> >>>>>>> upgrade path... I'll try to reproduce this issue and dig up more but
> >>>>>>> from the information I have so far this issue does not seem to be
> >>>>>>> related to this series.
> >>>>>
> >>>>> I just checked on mainline and it does not fail there.
> >>>
> >>> Thanks. Just to eliminate the possibility, I'll try reverting my
> >>> patchset in mm-unstable and will try the test again. Will do that in
> >>> the evening once I'm home.
> >>>
> >>>>>
> >>>>>>
> >>>>>> This is really weird. I added mmap_assert_locked(mm) calls into
> >>>>>> get_mmap_lock_carefully() right after we acquire mmap_lock read lock
> >>>>>> and one of them triggers right after successful
> >>>>>> mmap_read_lock_killable(). Here is my modified version of
> >>>>>> get_mmap_lock_carefully():
> >>>>>>
> >>>>>> static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
> >>>>>> struct pt_regs *regs) {
> >>>>>>        /* Even if this succeeds, make it clear we might have slept */
> >>>>>>        if (likely(mmap_read_trylock(mm))) {
> >>>>>>            might_sleep();
> >>>>>>            mmap_assert_locked(mm);
> >>>>>>            return true;
> >>>>>>        }
> >>>>>>        if (regs && !user_mode(regs)) {
> >>>>>>            unsigned long ip = instruction_pointer(regs);
> >>>>>>            if (!search_exception_tables(ip))
> >>>>>>                return false;
> >>>>>>        }
> >>>>>>        if (!mmap_read_lock_killable(mm)) {
> >>>>>>            mmap_assert_locked(mm);                     <---- generates a BUG
> >>>>>>            return true;
> >>>>>>        }
> >>>>>>        return false;
> >>>>>> }
> >>>>>
> >>>>> Ehm, that's indeed weird.
> >>>>>
> >>>>>>
> >>>>>> AFAIKT conditions for mmap_read_trylock() and
> >>>>>> mmap_read_lock_killable() are checked correctly. Am I missing
> >>>>>> something?
> >>>>>
> >>>>> Weirdly enough, it only triggers during that specific uffd test, right?
> >>>>
> >>>> Yes, uffd-unit-tests. I even ran it separately to ensure it's not some
> >>>> fallback from a previous test and I'm able to reproduce this
> >>>> consistently.
> >>
> >> Yeah, it is somehow related to per-vma locking. Unfortunately I can't
> >> reproduce the issue on my VM, so I have to use my host and bisection
> >> is slow. I think I'll get to the bottom of this tomorrow.
> >
> > Ok, I think I found the issue.
>
> Nice!
>
> > wp_page_shared() ->
> > fault_dirty_shared_page() can drop mmap_lock (see the comment saying
> > "Drop the mmap_lock before waiting on IO, if we can...", therefore we
> > have to ensure we are not doing this under per-VMA lock.
> > I think what happens is that this path is racing with another page
> > fault which took mmap_lock for read. fault_dirty_shared_page()
> > releases this lock which was taken by another page faulting thread and
> > that thread generates an assertion when it finds out the lock it just
> > took got released from under it.
>
> I wonder if we could detect that someone releases the mmap lock that was
> not taken by that person, to bail out early at the right place when
> debugging such issues. Only with certain config knobs enabled, of course.

I think that's doable. If we add tags_struct.mmap_locked = RDLOCK |
WRLOCK | NONE that could set when a task takes the mmap_lock and
reset+checked when it's released. Lockdep would also catch this if the
release code did not race with another page faulting task (this would
be seen as releasing the lock which was never locked).

>
> > The following crude change fixed the issue for me but there might be a
> > more granular way to deal with this:
> >
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3293,18 +3293,18 @@ static vm_fault_t wp_page_shared(struct
> > vm_fault *vmf, struct folio *folio)
> >           struct vm_area_struct *vma = vmf->vma;
> >           vm_fault_t ret = 0;
> >
> > +        if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > +                pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +                vma_end_read(vmf->vma);
> > +                return VM_FAULT_RETRY;
> > +        }
> > +
>
> I won't lie: all of these locking checks are a bit hard to get and
> possibly even harder to maintain.
>
> Maybe better mmap unlock sanity checks as spelled out above might help
> improve part of the situation.
>
>
> And maybe some comments regarding the placement might help as well ;)

I think comments with explanations why we bail out would help. I had
them in some but probably not all the places. Once the code stabilizes
I'll review the results and will add more comments with explanations.
Thanks,
Suren.

>
> --
> Cheers,
>
> David / dhildenb
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-10  7:40                   ` David Hildenbrand
  2023-08-10 20:20                     ` Suren Baghdasaryan
@ 2023-08-10 22:00                     ` Matthew Wilcox
  1 sibling, 0 replies; 25+ messages in thread
From: Matthew Wilcox @ 2023-08-10 22:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Suren Baghdasaryan, akpm, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, peterx,
	ying.huang, yuzhao, dhowells, hughd, viro, brauner,
	pasha.tatashin, linux-mm, linux-fsdevel, linux-kernel,
	kernel-team

On Thu, Aug 10, 2023 at 09:40:57AM +0200, David Hildenbrand wrote:
> I won't lie: all of these locking checks are a bit hard to get and possibly
> even harder to maintain.
> 
> Maybe better mmap unlock sanity checks as spelled out above might help
> improve part of the situation.
> 
> 
> And maybe some comments regarding the placement might help as well ;)

The placement was obvious; we can't call into drivers under the vma
lock.  Not until we've audited all of them.  I haven't yet had the
chance to figure out exactly what is being fixed with this patch ...
give me a few minutes.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-10  6:24                 ` Suren Baghdasaryan
  2023-08-10  7:40                   ` David Hildenbrand
@ 2023-08-10 22:16                   ` Matthew Wilcox
  2023-08-10 23:29                     ` Suren Baghdasaryan
  1 sibling, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2023-08-10 22:16 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: David Hildenbrand, akpm, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, peterx,
	ying.huang, yuzhao, dhowells, hughd, viro, brauner,
	pasha.tatashin, linux-mm, linux-fsdevel, linux-kernel,
	kernel-team

On Thu, Aug 10, 2023 at 06:24:15AM +0000, Suren Baghdasaryan wrote:
> Ok, I think I found the issue.  wp_page_shared() ->
> fault_dirty_shared_page() can drop mmap_lock (see the comment saying
> "Drop the mmap_lock before waiting on IO, if we can...", therefore we
> have to ensure we are not doing this under per-VMA lock.

... or we could change maybe_unlock_mmap_for_io() the same way
that we changed folio_lock_or_retry():

+++ b/mm/internal.h
@@ -706,7 +706,7 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
        if (fault_flag_allow_retry_first(flags) &&
            !(flags & FAULT_FLAG_RETRY_NOWAIT)) {
                fpin = get_file(vmf->vma->vm_file);
-               mmap_read_unlock(vmf->vma->vm_mm);
+               release_fault_lock(vmf);
        }
        return fpin;
 }

What do you think?

> I think what happens is that this path is racing with another page
> fault which took mmap_lock for read. fault_dirty_shared_page()
> releases this lock which was taken by another page faulting thread and
> that thread generates an assertion when it finds out the lock it just
> took got released from under it.

I'm confused that our debugging didn't catch this earlier.  lockdep
should always catch this.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-10 22:16                   ` Matthew Wilcox
@ 2023-08-10 23:29                     ` Suren Baghdasaryan
  2023-08-10 23:43                       ` Suren Baghdasaryan
  0 siblings, 1 reply; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-08-10 23:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, akpm, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, peterx,
	ying.huang, yuzhao, dhowells, hughd, viro, brauner,
	pasha.tatashin, linux-mm, linux-fsdevel, linux-kernel,
	kernel-team

On Thu, Aug 10, 2023 at 3:16 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Aug 10, 2023 at 06:24:15AM +0000, Suren Baghdasaryan wrote:
> > Ok, I think I found the issue.  wp_page_shared() ->
> > fault_dirty_shared_page() can drop mmap_lock (see the comment saying
> > "Drop the mmap_lock before waiting on IO, if we can...", therefore we
> > have to ensure we are not doing this under per-VMA lock.
>
> ... or we could change maybe_unlock_mmap_for_io() the same way
> that we changed folio_lock_or_retry():
>
> +++ b/mm/internal.h
> @@ -706,7 +706,7 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
>         if (fault_flag_allow_retry_first(flags) &&
>             !(flags & FAULT_FLAG_RETRY_NOWAIT)) {
>                 fpin = get_file(vmf->vma->vm_file);
> -               mmap_read_unlock(vmf->vma->vm_mm);
> +               release_fault_lock(vmf);
>         }
>         return fpin;
>  }
>
> What do you think?

This is very tempting... Let me try that and see if anything explodes,
but yes, this would be ideal.


>
> > I think what happens is that this path is racing with another page
> > fault which took mmap_lock for read. fault_dirty_shared_page()
> > releases this lock which was taken by another page faulting thread and
> > that thread generates an assertion when it finds out the lock it just
> > took got released from under it.
>
> I'm confused that our debugging didn't catch this earlier.  lockdep
> should always catch this.

Maybe this condition is rare enough?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-10 23:29                     ` Suren Baghdasaryan
@ 2023-08-10 23:43                       ` Suren Baghdasaryan
  2023-08-11  6:13                         ` Suren Baghdasaryan
  0 siblings, 1 reply; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-08-10 23:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, akpm, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, peterx,
	ying.huang, yuzhao, dhowells, hughd, viro, brauner,
	pasha.tatashin, linux-mm, linux-fsdevel, linux-kernel,
	kernel-team

On Thu, Aug 10, 2023 at 4:29 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Aug 10, 2023 at 3:16 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Aug 10, 2023 at 06:24:15AM +0000, Suren Baghdasaryan wrote:
> > > Ok, I think I found the issue.  wp_page_shared() ->
> > > fault_dirty_shared_page() can drop mmap_lock (see the comment saying
> > > "Drop the mmap_lock before waiting on IO, if we can...", therefore we
> > > have to ensure we are not doing this under per-VMA lock.
> >
> > ... or we could change maybe_unlock_mmap_for_io() the same way
> > that we changed folio_lock_or_retry():
> >
> > +++ b/mm/internal.h
> > @@ -706,7 +706,7 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
> >         if (fault_flag_allow_retry_first(flags) &&
> >             !(flags & FAULT_FLAG_RETRY_NOWAIT)) {
> >                 fpin = get_file(vmf->vma->vm_file);
> > -               mmap_read_unlock(vmf->vma->vm_mm);
> > +               release_fault_lock(vmf);
> >         }
> >         return fpin;
> >  }
> >
> > What do you think?
>
> This is very tempting... Let me try that and see if anything explodes,
> but yes, this would be ideal.

Ok, so far looks good, the problem is not reproducible. I'll run some
more exhaustive testing today.

>
>
> >
> > > I think what happens is that this path is racing with another page
> > > fault which took mmap_lock for read. fault_dirty_shared_page()
> > > releases this lock which was taken by another page faulting thread and
> > > that thread generates an assertion when it finds out the lock it just
> > > took got released from under it.
> >
> > I'm confused that our debugging didn't catch this earlier.  lockdep
> > should always catch this.
>
> Maybe this condition is rare enough?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults
  2023-08-10 23:43                       ` Suren Baghdasaryan
@ 2023-08-11  6:13                         ` Suren Baghdasaryan
  0 siblings, 0 replies; 25+ messages in thread
From: Suren Baghdasaryan @ 2023-08-11  6:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, akpm, hannes, mhocko, josef, jack, ldufour,
	laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
	dave, punit.agrawal, lstoakes, hdanton, apopple, peterx,
	ying.huang, yuzhao, dhowells, hughd, viro, brauner,
	pasha.tatashin, linux-mm, linux-fsdevel, linux-kernel,
	kernel-team

On Thu, Aug 10, 2023 at 4:43 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Aug 10, 2023 at 4:29 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Aug 10, 2023 at 3:16 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Aug 10, 2023 at 06:24:15AM +0000, Suren Baghdasaryan wrote:
> > > > Ok, I think I found the issue.  wp_page_shared() ->
> > > > fault_dirty_shared_page() can drop mmap_lock (see the comment saying
> > > > "Drop the mmap_lock before waiting on IO, if we can...", therefore we
> > > > have to ensure we are not doing this under per-VMA lock.
> > >
> > > ... or we could change maybe_unlock_mmap_for_io() the same way
> > > that we changed folio_lock_or_retry():
> > >
> > > +++ b/mm/internal.h
> > > @@ -706,7 +706,7 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
> > >         if (fault_flag_allow_retry_first(flags) &&
> > >             !(flags & FAULT_FLAG_RETRY_NOWAIT)) {
> > >                 fpin = get_file(vmf->vma->vm_file);
> > > -               mmap_read_unlock(vmf->vma->vm_mm);
> > > +               release_fault_lock(vmf);
> > >         }
> > >         return fpin;
> > >  }
> > >
> > > What do you think?
> >
> > This is very tempting... Let me try that and see if anything explodes,
> > but yes, this would be ideal.
>
> Ok, so far looks good, the problem is not reproducible. I'll run some
> more exhaustive testing today.

So far it works without a glitch. Matthew, I think it's fine. If you
post a fixup please add my Tested-by.
Thanks,
Suren.

>
> >
> >
> > >
> > > > I think what happens is that this path is racing with another page
> > > > fault which took mmap_lock for read. fault_dirty_shared_page()
> > > > releases this lock which was taken by another page faulting thread and
> > > > that thread generates an assertion when it finds out the lock it just
> > > > took got released from under it.
> > >
> > > I'm confused that our debugging didn't catch this earlier.  lockdep
> > > should always catch this.
> >
> > Maybe this condition is rare enough?

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-08-11  6:13 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-30 21:19 [PATCH v7 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
2023-06-30 21:19 ` [PATCH v7 1/6] swap: remove remnants of polling from read_swap_cache_async Suren Baghdasaryan
2023-06-30 21:19 ` [PATCH v7 2/6] mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED Suren Baghdasaryan
2023-06-30 21:19 ` [PATCH v7 3/6] mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED Suren Baghdasaryan
2023-06-30 21:19 ` [PATCH v7 4/6] mm: change folio_lock_or_retry to use vm_fault directly Suren Baghdasaryan
2023-06-30 21:19 ` [PATCH v7 5/6] mm: handle swap page faults under per-VMA lock Suren Baghdasaryan
2023-06-30 21:19 ` [PATCH v7 6/6] mm: handle userfaults under VMA lock Suren Baghdasaryan
2023-07-02 17:50 ` [PATCH v7 0/6] Per-VMA lock support for swap and userfaults Andrew Morton
2023-07-03 15:27   ` Suren Baghdasaryan
2023-08-09  7:48 ` David Hildenbrand
2023-08-09 14:28   ` Suren Baghdasaryan
2023-08-09 15:22     ` Suren Baghdasaryan
2023-08-09 17:17       ` Suren Baghdasaryan
2023-08-09 18:04         ` David Hildenbrand
2023-08-09 18:08           ` Suren Baghdasaryan
2023-08-09 18:31             ` Suren Baghdasaryan
2023-08-10  5:29               ` Suren Baghdasaryan
2023-08-10  6:24                 ` Suren Baghdasaryan
2023-08-10  7:40                   ` David Hildenbrand
2023-08-10 20:20                     ` Suren Baghdasaryan
2023-08-10 22:00                     ` Matthew Wilcox
2023-08-10 22:16                   ` Matthew Wilcox
2023-08-10 23:29                     ` Suren Baghdasaryan
2023-08-10 23:43                       ` Suren Baghdasaryan
2023-08-11  6:13                         ` Suren Baghdasaryan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.