All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] Avoid the mmap lock for fault-around
@ 2023-07-11 20:20 Matthew Wilcox (Oracle)
  2023-07-11 20:20 ` [PATCH v2 1/9] Revert "tcp: Use per-vma locking for receive zerocopy" Matthew Wilcox (Oracle)
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-11 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox (Oracle),
	Arjun Roy, Eric Dumazet, Suren Baghdasaryan, linux-fsdevel,
	Punit Agrawal

This patchset adds the ability to handle page faults on parts of files
which are already in the page cache without taking the mmap lock.

I've taken a very gradual approach to pushing the lock down.  I'm not 100%
confident in my ability to grasp all the finer aspects of VMA handling,
so some reviewrs may well feel that I could have combined some of
these patches.  I did try to skip one of these steps and it had a bug,
so I feel justified in proceeding cautiously.

Several people have volunteered to run benchmarks on this, so I haven't.
I have run it through xfstests and it doesn't appear to introduce any
regressions.

This patchset is against next-20230711.  There is a patch from Arjun Roy
in there which has terrible conflicts with this work.  At Eric Dumazet's
suggestion I have started out by reverting it, then doing my patches
and redoing Arjun's patch on top.  It has the benefit of halving the
size of Arjun's patch,  Merging this is going to be a nightmare unless
the networking tree reverts Arjun's patch (the mm tree can't revert
a patch which isn't in the mm tree!).

Arjun's patch did point out that using lock_vma_under_rcu() is currently
very awkward, so that inspired patch 8 which makes it always available.

Arjun Roy (1):
  tcp: Use per-vma locking for receive zerocopy

Matthew Wilcox (Oracle) (8):
  Revert "tcp: Use per-vma locking for receive zerocopy"
  mm: Allow per-VMA locks on file-backed VMAs
  mm: Move FAULT_FLAG_VMA_LOCK check from handle_mm_fault()
  mm: Move FAULT_FLAG_VMA_LOCK check into handle_pte_fault()
  mm: Move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault()
  mm: Move the FAULT_FLAG_VMA_LOCK check down from do_fault()
  mm: Run the fault-around code under the VMA lock
  mm: Remove CONFIG_PER_VMA_LOCK ifdefs

 MAINTAINERS             |  1 -
 arch/arm64/mm/fault.c   |  2 --
 arch/powerpc/mm/fault.c |  4 ----
 arch/riscv/mm/fault.c   |  4 ----
 arch/s390/mm/fault.c    |  2 --
 arch/x86/mm/fault.c     |  4 ----
 include/linux/mm.h      |  6 ++++++
 include/linux/net_mm.h  | 17 -----------------
 include/net/tcp.h       |  1 -
 mm/hugetlb.c            |  6 ++++++
 mm/memory.c             | 35 +++++++++++++++++++++++++----------
 net/ipv4/tcp.c          | 14 +++++---------
 12 files changed, 42 insertions(+), 54 deletions(-)
 delete mode 100644 include/linux/net_mm.h

-- 
2.39.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 1/9] Revert "tcp: Use per-vma locking for receive zerocopy"
  2023-07-11 20:20 [PATCH v2 0/9] Avoid the mmap lock for fault-around Matthew Wilcox (Oracle)
@ 2023-07-11 20:20 ` Matthew Wilcox (Oracle)
  2023-07-14  3:02   ` Suren Baghdasaryan
  2023-07-11 20:20 ` [PATCH v2 2/9] mm: Allow per-VMA locks on file-backed VMAs Matthew Wilcox (Oracle)
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-11 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox (Oracle),
	Arjun Roy, Eric Dumazet, Suren Baghdasaryan, linux-fsdevel,
	Punit Agrawal

This reverts commit 7a7f094635349a7d0314364ad50bdeb770b6df4f.
---
 MAINTAINERS            |  1 -
 include/linux/net_mm.h | 17 ----------------
 include/net/tcp.h      |  1 -
 mm/memory.c            |  7 +++----
 net/ipv4/tcp.c         | 45 ++++++++----------------------------------
 5 files changed, 11 insertions(+), 60 deletions(-)
 delete mode 100644 include/linux/net_mm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 18cd0ce2c7d2..00047800cff1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14816,7 +14816,6 @@ NETWORKING [TCP]
 M:	Eric Dumazet <edumazet@google.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
-F:	include/linux/net_mm.h
 F:	include/linux/tcp.h
 F:	include/net/tcp.h
 F:	include/trace/events/tcp.h
diff --git a/include/linux/net_mm.h b/include/linux/net_mm.h
deleted file mode 100644
index b298998bd5a0..000000000000
--- a/include/linux/net_mm.h
+++ /dev/null
@@ -1,17 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-#ifdef CONFIG_MMU
-
-#ifdef CONFIG_INET
-extern const struct vm_operations_struct tcp_vm_ops;
-static inline bool vma_is_tcp(const struct vm_area_struct *vma)
-{
-	return vma->vm_ops == &tcp_vm_ops;
-}
-#else
-static inline bool vma_is_tcp(const struct vm_area_struct *vma)
-{
-	return false;
-}
-#endif /* CONFIG_INET*/
-
-#endif /* CONFIG_MMU */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 226bce6d1e8c..95e4507febed 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -45,7 +45,6 @@
 #include <linux/memcontrol.h>
 #include <linux/bpf-cgroup.h>
 #include <linux/siphash.h>
-#include <linux/net_mm.h>
 
 extern struct inet_hashinfo tcp_hashinfo;
 
diff --git a/mm/memory.c b/mm/memory.c
index 0a265ac6246e..2c7967632866 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -77,7 +77,6 @@
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/sysctl.h>
-#include <linux/net_mm.h>
 
 #include <trace/events/kmem.h>
 
@@ -5419,12 +5418,12 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma)
 		goto inval;
 
-	/* Only anonymous and tcp vmas are supported for now */
-	if (!vma_is_anonymous(vma) && !vma_is_tcp(vma))
+	/* Only anonymous vmas are supported for now */
+	if (!vma_is_anonymous(vma))
 		goto inval;
 
 	/* find_mergeable_anon_vma uses adjacent vmas which are not locked */
-	if (!vma->anon_vma && !vma_is_tcp(vma))
+	if (!vma->anon_vma)
 		goto inval;
 
 	if (!vma_start_read(vma))
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e03e08745308..1542de3f66f7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1739,7 +1739,7 @@ void tcp_update_recv_tstamps(struct sk_buff *skb,
 }
 
 #ifdef CONFIG_MMU
-const struct vm_operations_struct tcp_vm_ops = {
+static const struct vm_operations_struct tcp_vm_ops = {
 };
 
 int tcp_mmap(struct file *file, struct socket *sock,
@@ -2038,34 +2038,6 @@ static void tcp_zc_finalize_rx_tstamp(struct sock *sk,
 	}
 }
 
-static struct vm_area_struct *find_tcp_vma(struct mm_struct *mm,
-					   unsigned long address,
-					   bool *mmap_locked)
-{
-	struct vm_area_struct *vma = NULL;
-
-#ifdef CONFIG_PER_VMA_LOCK
-	vma = lock_vma_under_rcu(mm, address);
-#endif
-	if (vma) {
-		if (!vma_is_tcp(vma)) {
-			vma_end_read(vma);
-			return NULL;
-		}
-		*mmap_locked = false;
-		return vma;
-	}
-
-	mmap_read_lock(mm);
-	vma = vma_lookup(mm, address);
-	if (!vma || !vma_is_tcp(vma)) {
-		mmap_read_unlock(mm);
-		return NULL;
-	}
-	*mmap_locked = true;
-	return vma;
-}
-
 #define TCP_ZEROCOPY_PAGE_BATCH_SIZE 32
 static int tcp_zerocopy_receive(struct sock *sk,
 				struct tcp_zerocopy_receive *zc,
@@ -2083,7 +2055,6 @@ static int tcp_zerocopy_receive(struct sock *sk,
 	u32 seq = tp->copied_seq;
 	u32 total_bytes_to_map;
 	int inq = tcp_inq(sk);
-	bool mmap_locked;
 	int ret;
 
 	zc->copybuf_len = 0;
@@ -2108,10 +2079,13 @@ static int tcp_zerocopy_receive(struct sock *sk,
 		return 0;
 	}
 
-	vma = find_tcp_vma(current->mm, address, &mmap_locked);
-	if (!vma)
-		return -EINVAL;
+	mmap_read_lock(current->mm);
 
+	vma = vma_lookup(current->mm, address);
+	if (!vma || vma->vm_ops != &tcp_vm_ops) {
+		mmap_read_unlock(current->mm);
+		return -EINVAL;
+	}
 	vma_len = min_t(unsigned long, zc->length, vma->vm_end - address);
 	avail_len = min_t(u32, vma_len, inq);
 	total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
@@ -2185,10 +2159,7 @@ static int tcp_zerocopy_receive(struct sock *sk,
 						   zc, total_bytes_to_map);
 	}
 out:
-	if (mmap_locked)
-		mmap_read_unlock(current->mm);
-	else
-		vma_end_read(vma);
+	mmap_read_unlock(current->mm);
 	/* Try to copy straggler data. */
 	if (!ret)
 		copylen = tcp_zc_handle_leftover(zc, sk, skb, &seq, copybuf_len, tss);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 2/9] mm: Allow per-VMA locks on file-backed VMAs
  2023-07-11 20:20 [PATCH v2 0/9] Avoid the mmap lock for fault-around Matthew Wilcox (Oracle)
  2023-07-11 20:20 ` [PATCH v2 1/9] Revert "tcp: Use per-vma locking for receive zerocopy" Matthew Wilcox (Oracle)
@ 2023-07-11 20:20 ` Matthew Wilcox (Oracle)
  2023-07-14  3:03   ` Suren Baghdasaryan
  2023-07-11 20:20 ` [PATCH v2 3/9] mm: Move FAULT_FLAG_VMA_LOCK check from handle_mm_fault() Matthew Wilcox (Oracle)
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-11 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox (Oracle),
	Arjun Roy, Eric Dumazet, Suren Baghdasaryan, linux-fsdevel,
	Punit Agrawal

The fault path will immediately fail in handle_mm_fault(), so this
is the minimal step which allows the per-VMA lock to be taken on
file-backed VMAs.  There may be a small performance reduction as a
little unnecessary work will be done on each page fault.  See later
patches for the improvement.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/memory.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2c7967632866..f2dcc695f54e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5247,6 +5247,11 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		goto out;
 	}
 
+	if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma)) {
+		vma_end_read(vma);
+		return VM_FAULT_RETRY;
+	}
+
 	/*
 	 * Enable the memcg OOM handling for faults triggered in user
 	 * space.  Kernel faults are handled more gracefully.
@@ -5418,12 +5423,8 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma)
 		goto inval;
 
-	/* Only anonymous vmas are supported for now */
-	if (!vma_is_anonymous(vma))
-		goto inval;
-
 	/* find_mergeable_anon_vma uses adjacent vmas which are not locked */
-	if (!vma->anon_vma)
+	if (vma_is_anonymous(vma) && !vma->anon_vma)
 		goto inval;
 
 	if (!vma_start_read(vma))
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 3/9] mm: Move FAULT_FLAG_VMA_LOCK check from handle_mm_fault()
  2023-07-11 20:20 [PATCH v2 0/9] Avoid the mmap lock for fault-around Matthew Wilcox (Oracle)
  2023-07-11 20:20 ` [PATCH v2 1/9] Revert "tcp: Use per-vma locking for receive zerocopy" Matthew Wilcox (Oracle)
  2023-07-11 20:20 ` [PATCH v2 2/9] mm: Allow per-VMA locks on file-backed VMAs Matthew Wilcox (Oracle)
@ 2023-07-11 20:20 ` Matthew Wilcox (Oracle)
  2023-07-14  3:04   ` Suren Baghdasaryan
  2023-07-11 20:20 ` [PATCH v2 4/9] mm: Move FAULT_FLAG_VMA_LOCK check into handle_pte_fault() Matthew Wilcox (Oracle)
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-11 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox (Oracle),
	Arjun Roy, Eric Dumazet, Suren Baghdasaryan, linux-fsdevel,
	Punit Agrawal

Handle a little more of the page fault path outside the mmap sem.
The hugetlb path doesn't need to check whether the VMA is anonymous;
the VM_HUGETLB flag is only set on hugetlbfs VMAs.  There should be no
performance change from the previous commit; this is simply a step to
ease bisection of any problems.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/hugetlb.c |  6 ++++++
 mm/memory.c  | 18 +++++++++---------
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e4a28ce0667f..109e1ff92bc8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6063,6 +6063,12 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int need_wait_lock = 0;
 	unsigned long haddr = address & huge_page_mask(h);
 
+	/* TODO: Handle faults under the VMA lock */
+	if (flags & FAULT_FLAG_VMA_LOCK) {
+		vma_end_read(vma);
+		return VM_FAULT_RETRY;
+	}
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
diff --git a/mm/memory.c b/mm/memory.c
index f2dcc695f54e..6eda5c5f2069 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4998,10 +4998,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 }
 
 /*
- * By the time we get here, we already hold the mm semaphore
- *
- * The mmap_lock may have been released depending on flags and our
- * return value.  See filemap_fault() and __folio_lock_or_retry().
+ * On entry, we hold either the VMA lock or the mmap_lock
+ * (FAULT_FLAG_VMA_LOCK tells you which).  If VM_FAULT_RETRY is set in
+ * the result, the mmap_lock is not held on exit.  See filemap_fault()
+ * and __folio_lock_or_retry().
  */
 static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		unsigned long address, unsigned int flags)
@@ -5020,6 +5020,11 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	p4d_t *p4d;
 	vm_fault_t ret;
 
+	if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma)) {
+		vma_end_read(vma);
+		return VM_FAULT_RETRY;
+	}
+
 	pgd = pgd_offset(mm, address);
 	p4d = p4d_alloc(mm, pgd, address);
 	if (!p4d)
@@ -5247,11 +5252,6 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		goto out;
 	}
 
-	if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma)) {
-		vma_end_read(vma);
-		return VM_FAULT_RETRY;
-	}
-
 	/*
 	 * Enable the memcg OOM handling for faults triggered in user
 	 * space.  Kernel faults are handled more gracefully.
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 4/9] mm: Move FAULT_FLAG_VMA_LOCK check into handle_pte_fault()
  2023-07-11 20:20 [PATCH v2 0/9] Avoid the mmap lock for fault-around Matthew Wilcox (Oracle)
                   ` (2 preceding siblings ...)
  2023-07-11 20:20 ` [PATCH v2 3/9] mm: Move FAULT_FLAG_VMA_LOCK check from handle_mm_fault() Matthew Wilcox (Oracle)
@ 2023-07-11 20:20 ` Matthew Wilcox (Oracle)
  2023-07-14  3:17   ` Suren Baghdasaryan
  2023-07-24 15:46   ` Jann Horn
  2023-07-11 20:20 ` [PATCH v2 5/9] mm: Move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault() Matthew Wilcox (Oracle)
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 29+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-11 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox (Oracle),
	Arjun Roy, Eric Dumazet, Suren Baghdasaryan, linux-fsdevel,
	Punit Agrawal

Push the check down from __handle_mm_fault().  There's a mild upside to
this patch in that we'll allocate the page tables while under the VMA
lock rather than the mmap lock, reducing the hold time on the mmap lock,
since the retry will find the page tables already populated.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/memory.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6eda5c5f2069..52f7fdd78380 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4924,6 +4924,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 {
 	pte_t entry;
 
+	if ((vmf->flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vmf->vma)) {
+		vma_end_read(vmf->vma);
+		return VM_FAULT_RETRY;
+	}
+
 	if (unlikely(pmd_none(*vmf->pmd))) {
 		/*
 		 * Leave __pte_alloc() until later: because vm_ops->fault may
@@ -5020,11 +5025,6 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	p4d_t *p4d;
 	vm_fault_t ret;
 
-	if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma)) {
-		vma_end_read(vma);
-		return VM_FAULT_RETRY;
-	}
-
 	pgd = pgd_offset(mm, address);
 	p4d = p4d_alloc(mm, pgd, address);
 	if (!p4d)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 5/9] mm: Move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault()
  2023-07-11 20:20 [PATCH v2 0/9] Avoid the mmap lock for fault-around Matthew Wilcox (Oracle)
                   ` (3 preceding siblings ...)
  2023-07-11 20:20 ` [PATCH v2 4/9] mm: Move FAULT_FLAG_VMA_LOCK check into handle_pte_fault() Matthew Wilcox (Oracle)
@ 2023-07-11 20:20 ` Matthew Wilcox (Oracle)
  2023-07-14  3:26   ` Suren Baghdasaryan
  2023-07-24 15:46   ` Jann Horn
  2023-07-11 20:20 ` [PATCH v2 6/9] mm: Move the FAULT_FLAG_VMA_LOCK check down from do_fault() Matthew Wilcox (Oracle)
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 29+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-11 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox (Oracle),
	Arjun Roy, Eric Dumazet, Suren Baghdasaryan, linux-fsdevel,
	Punit Agrawal

Call do_pte_missing() under the VMA lock ... then immediately retry
in do_fault().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/memory.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 52f7fdd78380..88cf9860f17e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4661,6 +4661,11 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 	struct mm_struct *vm_mm = vma->vm_mm;
 	vm_fault_t ret;
 
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK){
+		vma_end_read(vma);
+		return VM_FAULT_RETRY;
+	}
+
 	/*
 	 * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
 	 */
@@ -4924,11 +4929,6 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 {
 	pte_t entry;
 
-	if ((vmf->flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vmf->vma)) {
-		vma_end_read(vmf->vma);
-		return VM_FAULT_RETRY;
-	}
-
 	if (unlikely(pmd_none(*vmf->pmd))) {
 		/*
 		 * Leave __pte_alloc() until later: because vm_ops->fault may
@@ -4961,6 +4961,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	if (!vmf->pte)
 		return do_pte_missing(vmf);
 
+	if ((vmf->flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vmf->vma)) {
+		vma_end_read(vmf->vma);
+		return VM_FAULT_RETRY;
+	}
+
 	if (!pte_present(vmf->orig_pte))
 		return do_swap_page(vmf);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 6/9] mm: Move the FAULT_FLAG_VMA_LOCK check down from do_fault()
  2023-07-11 20:20 [PATCH v2 0/9] Avoid the mmap lock for fault-around Matthew Wilcox (Oracle)
                   ` (4 preceding siblings ...)
  2023-07-11 20:20 ` [PATCH v2 5/9] mm: Move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault() Matthew Wilcox (Oracle)
@ 2023-07-11 20:20 ` Matthew Wilcox (Oracle)
  2023-07-14  3:27   ` Suren Baghdasaryan
  2023-07-11 20:20 ` [PATCH v2 7/9] mm: Run the fault-around code under the VMA lock Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-11 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox (Oracle),
	Arjun Roy, Eric Dumazet, Suren Baghdasaryan, linux-fsdevel,
	Punit Agrawal

Perform the check at the start of do_read_fault(), do_cow_fault()
and do_shared_fault() instead.  Should be no performance change from
the last commit.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/memory.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 88cf9860f17e..709bffee8aa2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4547,6 +4547,11 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
 	vm_fault_t ret = 0;
 	struct folio *folio;
 
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+		vma_end_read(vmf->vma);
+		return VM_FAULT_RETRY;
+	}
+
 	/*
 	 * Let's call ->map_pages() first and use ->fault() as fallback
 	 * if page by the offset is not ready to be mapped (cold cache or
@@ -4575,6 +4580,11 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret;
 
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+		vma_end_read(vma);
+		return VM_FAULT_RETRY;
+	}
+
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
 
@@ -4615,6 +4625,11 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
 	vm_fault_t ret, tmp;
 	struct folio *folio;
 
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+		vma_end_read(vma);
+		return VM_FAULT_RETRY;
+	}
+
 	ret = __do_fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
@@ -4661,11 +4676,6 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 	struct mm_struct *vm_mm = vma->vm_mm;
 	vm_fault_t ret;
 
-	if (vmf->flags & FAULT_FLAG_VMA_LOCK){
-		vma_end_read(vma);
-		return VM_FAULT_RETRY;
-	}
-
 	/*
 	 * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
 	 */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 7/9] mm: Run the fault-around code under the VMA lock
  2023-07-11 20:20 [PATCH v2 0/9] Avoid the mmap lock for fault-around Matthew Wilcox (Oracle)
                   ` (5 preceding siblings ...)
  2023-07-11 20:20 ` [PATCH v2 6/9] mm: Move the FAULT_FLAG_VMA_LOCK check down from do_fault() Matthew Wilcox (Oracle)
@ 2023-07-11 20:20 ` Matthew Wilcox (Oracle)
  2023-07-14  3:32   ` Suren Baghdasaryan
  2023-07-11 20:20 ` [PATCH v2 8/9] mm: Remove CONFIG_PER_VMA_LOCK ifdefs Matthew Wilcox (Oracle)
  2023-07-11 20:20 ` [PATCH v2 9/9] tcp: Use per-vma locking for receive zerocopy Matthew Wilcox (Oracle)
  8 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-11 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox (Oracle),
	Arjun Roy, Eric Dumazet, Suren Baghdasaryan, linux-fsdevel,
	Punit Agrawal

The map_pages fs method should be safe to run under the VMA lock instead
of the mmap lock.  This should have a measurable reduction in contention
on the mmap lock.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/memory.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 709bffee8aa2..0a4e363b0605 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4547,11 +4547,6 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
 	vm_fault_t ret = 0;
 	struct folio *folio;
 
-	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
-		vma_end_read(vmf->vma);
-		return VM_FAULT_RETRY;
-	}
-
 	/*
 	 * Let's call ->map_pages() first and use ->fault() as fallback
 	 * if page by the offset is not ready to be mapped (cold cache or
@@ -4563,6 +4558,11 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
 			return ret;
 	}
 
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+		vma_end_read(vmf->vma);
+		return VM_FAULT_RETRY;
+	}
+
 	ret = __do_fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 8/9] mm: Remove CONFIG_PER_VMA_LOCK ifdefs
  2023-07-11 20:20 [PATCH v2 0/9] Avoid the mmap lock for fault-around Matthew Wilcox (Oracle)
                   ` (6 preceding siblings ...)
  2023-07-11 20:20 ` [PATCH v2 7/9] mm: Run the fault-around code under the VMA lock Matthew Wilcox (Oracle)
@ 2023-07-11 20:20 ` Matthew Wilcox (Oracle)
  2023-07-14  3:34   ` Suren Baghdasaryan
  2023-07-11 20:20 ` [PATCH v2 9/9] tcp: Use per-vma locking for receive zerocopy Matthew Wilcox (Oracle)
  8 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-11 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox (Oracle),
	Arjun Roy, Eric Dumazet, Suren Baghdasaryan, linux-fsdevel,
	Punit Agrawal

Provide lock_vma_under_rcu() when CONFIG_PER_VMA_LOCK is not defined
to eliminate ifdefs in the users.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 arch/arm64/mm/fault.c   | 2 --
 arch/powerpc/mm/fault.c | 4 ----
 arch/riscv/mm/fault.c   | 4 ----
 arch/s390/mm/fault.c    | 2 --
 arch/x86/mm/fault.c     | 4 ----
 include/linux/mm.h      | 6 ++++++
 6 files changed, 6 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index b8c80f7b8a5f..2e5d1e238af9 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -587,7 +587,6 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
 
-#ifdef CONFIG_PER_VMA_LOCK
 	if (!(mm_flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
@@ -616,7 +615,6 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 		return 0;
 	}
 lock_mmap:
-#endif /* CONFIG_PER_VMA_LOCK */
 
 retry:
 	vma = lock_mm_and_find_vma(mm, addr, regs);
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 82954d0e6906..b1723094d464 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -469,7 +469,6 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 	if (is_exec)
 		flags |= FAULT_FLAG_INSTRUCTION;
 
-#ifdef CONFIG_PER_VMA_LOCK
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
@@ -502,7 +501,6 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 		return user_mode(regs) ? 0 : SIGBUS;
 
 lock_mmap:
-#endif /* CONFIG_PER_VMA_LOCK */
 
 	/* When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in the
@@ -552,9 +550,7 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 
 	mmap_read_unlock(current->mm);
 
-#ifdef CONFIG_PER_VMA_LOCK
 done:
-#endif
 	if (unlikely(fault & VM_FAULT_ERROR))
 		return mm_fault_error(regs, address, fault);
 
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 6ea2cce4cc17..046732fcb48c 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -283,7 +283,6 @@ void handle_page_fault(struct pt_regs *regs)
 		flags |= FAULT_FLAG_WRITE;
 	else if (cause == EXC_INST_PAGE_FAULT)
 		flags |= FAULT_FLAG_INSTRUCTION;
-#ifdef CONFIG_PER_VMA_LOCK
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
@@ -311,7 +310,6 @@ void handle_page_fault(struct pt_regs *regs)
 		return;
 	}
 lock_mmap:
-#endif /* CONFIG_PER_VMA_LOCK */
 
 retry:
 	vma = lock_mm_and_find_vma(mm, addr, regs);
@@ -368,9 +366,7 @@ void handle_page_fault(struct pt_regs *regs)
 
 	mmap_read_unlock(mm);
 
-#ifdef CONFIG_PER_VMA_LOCK
 done:
-#endif
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		tsk->thread.bad_cause = cause;
 		mm_fault_error(regs, addr, fault);
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 40a71063949b..ac8351f172bb 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -407,7 +407,6 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 		access = VM_WRITE;
 	if (access == VM_WRITE)
 		flags |= FAULT_FLAG_WRITE;
-#ifdef CONFIG_PER_VMA_LOCK
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 	vma = lock_vma_under_rcu(mm, address);
@@ -431,7 +430,6 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 		goto out;
 	}
 lock_mmap:
-#endif /* CONFIG_PER_VMA_LOCK */
 	mmap_read_lock(mm);
 
 	gmap = NULL;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b0f7add07aa5..ab778eac1952 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1350,7 +1350,6 @@ void do_user_addr_fault(struct pt_regs *regs,
 	}
 #endif
 
-#ifdef CONFIG_PER_VMA_LOCK
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
@@ -1381,7 +1380,6 @@ void do_user_addr_fault(struct pt_regs *regs,
 		return;
 	}
 lock_mmap:
-#endif /* CONFIG_PER_VMA_LOCK */
 
 retry:
 	vma = lock_mm_and_find_vma(mm, address, regs);
@@ -1441,9 +1439,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	}
 
 	mmap_read_unlock(mm);
-#ifdef CONFIG_PER_VMA_LOCK
 done:
-#endif
 	if (likely(!(fault & VM_FAULT_ERROR)))
 		return;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46c442855df7..3c923a4bf213 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -813,6 +813,12 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
 	mmap_assert_locked(vmf->vma->vm_mm);
 }
 
+static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+		unsigned long address)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_PER_VMA_LOCK */
 
 /*
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 9/9] tcp: Use per-vma locking for receive zerocopy
  2023-07-11 20:20 [PATCH v2 0/9] Avoid the mmap lock for fault-around Matthew Wilcox (Oracle)
                   ` (7 preceding siblings ...)
  2023-07-11 20:20 ` [PATCH v2 8/9] mm: Remove CONFIG_PER_VMA_LOCK ifdefs Matthew Wilcox (Oracle)
@ 2023-07-11 20:20 ` Matthew Wilcox (Oracle)
  2023-07-14  3:40   ` Suren Baghdasaryan
  2023-07-21 18:48   ` Matthew Wilcox
  8 siblings, 2 replies; 29+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-11 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Arjun Roy, Eric Dumazet, Suren Baghdasaryan, linux-fsdevel,
	Punit Agrawal, David S . Miller, Matthew Wilcox

From: Arjun Roy <arjunroy@google.com>

Per-VMA locking allows us to lock a struct vm_area_struct without
taking the process-wide mmap lock in read mode.

Consider a process workload where the mmap lock is taken constantly in
write mode. In this scenario, all zerocopy receives are periodically
blocked during that period of time - though in principle, the memory
ranges being used by TCP are not touched by the operations that need
the mmap write lock. This results in performance degradation.

Now consider another workload where the mmap lock is never taken in
write mode, but there are many TCP connections using receive zerocopy
that are concurrently receiving. These connections all take the mmap
lock in read mode, but this does induce a lot of contention and atomic
ops for this process-wide lock. This results in additional CPU
overhead caused by contending on the cache line for this lock.

However, with per-vma locking, both of these problems can be avoided.

As a test, I ran an RPC-style request/response workload with 4KB
payloads and receive zerocopy enabled, with 100 simultaneous TCP
connections. I measured perf cycles within the
find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
without per-vma locking enabled.

When using process-wide mmap semaphore read locking, about 1% of
measured perf cycles were within this path. With per-VMA locking, this
value dropped to about 0.45%.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 net/ipv4/tcp.c | 39 ++++++++++++++++++++++++++++++++-------
 1 file changed, 32 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1542de3f66f7..7118ec6cf886 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2038,6 +2038,30 @@ static void tcp_zc_finalize_rx_tstamp(struct sock *sk,
 	}
 }
 
+static struct vm_area_struct *find_tcp_vma(struct mm_struct *mm,
+		unsigned long address, bool *mmap_locked)
+{
+	struct vm_area_struct *vma = lock_vma_under_rcu(mm, address);
+
+	if (vma) {
+		if (vma->vm_ops != &tcp_vm_ops) {
+			vma_end_read(vma);
+			return NULL;
+		}
+		*mmap_locked = false;
+		return vma;
+	}
+
+	mmap_read_lock(mm);
+	vma = vma_lookup(mm, address);
+	if (!vma || vma->vm_ops != &tcp_vm_ops) {
+		mmap_read_unlock(mm);
+		return NULL;
+	}
+	*mmap_locked = true;
+	return vma;
+}
+
 #define TCP_ZEROCOPY_PAGE_BATCH_SIZE 32
 static int tcp_zerocopy_receive(struct sock *sk,
 				struct tcp_zerocopy_receive *zc,
@@ -2055,6 +2079,7 @@ static int tcp_zerocopy_receive(struct sock *sk,
 	u32 seq = tp->copied_seq;
 	u32 total_bytes_to_map;
 	int inq = tcp_inq(sk);
+	bool mmap_locked;
 	int ret;
 
 	zc->copybuf_len = 0;
@@ -2079,13 +2104,10 @@ static int tcp_zerocopy_receive(struct sock *sk,
 		return 0;
 	}
 
-	mmap_read_lock(current->mm);
-
-	vma = vma_lookup(current->mm, address);
-	if (!vma || vma->vm_ops != &tcp_vm_ops) {
-		mmap_read_unlock(current->mm);
+	vma = find_tcp_vma(current->mm, address, &mmap_locked);
+	if (!vma)
 		return -EINVAL;
-	}
+
 	vma_len = min_t(unsigned long, zc->length, vma->vm_end - address);
 	avail_len = min_t(u32, vma_len, inq);
 	total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
@@ -2159,7 +2181,10 @@ static int tcp_zerocopy_receive(struct sock *sk,
 						   zc, total_bytes_to_map);
 	}
 out:
-	mmap_read_unlock(current->mm);
+	if (mmap_locked)
+		mmap_read_unlock(current->mm);
+	else
+		vma_end_read(vma);
 	/* Try to copy straggler data. */
 	if (!ret)
 		copylen = tcp_zc_handle_leftover(zc, sk, skb, &seq, copybuf_len, tss);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/9] Revert "tcp: Use per-vma locking for receive zerocopy"
  2023-07-11 20:20 ` [PATCH v2 1/9] Revert "tcp: Use per-vma locking for receive zerocopy" Matthew Wilcox (Oracle)
@ 2023-07-14  3:02   ` Suren Baghdasaryan
  2023-07-14  3:34     ` Matthew Wilcox
  0 siblings, 1 reply; 29+ messages in thread
From: Suren Baghdasaryan @ 2023-07-14  3:02 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, Arjun Roy, Eric Dumazet, linux-fsdevel, Punit Agrawal

On Tue, Jul 11, 2023 at 1:21 PM Matthew Wilcox (Oracle)
<willy@infradead.org> wrote:
>
> This reverts commit 7a7f094635349a7d0314364ad50bdeb770b6df4f.

nit: some explanation and SOB would be nice.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>


> ---
>  MAINTAINERS            |  1 -
>  include/linux/net_mm.h | 17 ----------------
>  include/net/tcp.h      |  1 -
>  mm/memory.c            |  7 +++----
>  net/ipv4/tcp.c         | 45 ++++++++----------------------------------
>  5 files changed, 11 insertions(+), 60 deletions(-)
>  delete mode 100644 include/linux/net_mm.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 18cd0ce2c7d2..00047800cff1 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -14816,7 +14816,6 @@ NETWORKING [TCP]
>  M:     Eric Dumazet <edumazet@google.com>
>  L:     netdev@vger.kernel.org
>  S:     Maintained
> -F:     include/linux/net_mm.h
>  F:     include/linux/tcp.h
>  F:     include/net/tcp.h
>  F:     include/trace/events/tcp.h
> diff --git a/include/linux/net_mm.h b/include/linux/net_mm.h
> deleted file mode 100644
> index b298998bd5a0..000000000000
> --- a/include/linux/net_mm.h
> +++ /dev/null
> @@ -1,17 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-or-later */
> -#ifdef CONFIG_MMU
> -
> -#ifdef CONFIG_INET
> -extern const struct vm_operations_struct tcp_vm_ops;
> -static inline bool vma_is_tcp(const struct vm_area_struct *vma)
> -{
> -       return vma->vm_ops == &tcp_vm_ops;
> -}
> -#else
> -static inline bool vma_is_tcp(const struct vm_area_struct *vma)
> -{
> -       return false;
> -}
> -#endif /* CONFIG_INET*/
> -
> -#endif /* CONFIG_MMU */
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 226bce6d1e8c..95e4507febed 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -45,7 +45,6 @@
>  #include <linux/memcontrol.h>
>  #include <linux/bpf-cgroup.h>
>  #include <linux/siphash.h>
> -#include <linux/net_mm.h>
>
>  extern struct inet_hashinfo tcp_hashinfo;
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 0a265ac6246e..2c7967632866 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -77,7 +77,6 @@
>  #include <linux/ptrace.h>
>  #include <linux/vmalloc.h>
>  #include <linux/sched/sysctl.h>
> -#include <linux/net_mm.h>
>
>  #include <trace/events/kmem.h>
>
> @@ -5419,12 +5418,12 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>         if (!vma)
>                 goto inval;
>
> -       /* Only anonymous and tcp vmas are supported for now */
> -       if (!vma_is_anonymous(vma) && !vma_is_tcp(vma))
> +       /* Only anonymous vmas are supported for now */
> +       if (!vma_is_anonymous(vma))
>                 goto inval;
>
>         /* find_mergeable_anon_vma uses adjacent vmas which are not locked */
> -       if (!vma->anon_vma && !vma_is_tcp(vma))
> +       if (!vma->anon_vma)
>                 goto inval;
>
>         if (!vma_start_read(vma))
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e03e08745308..1542de3f66f7 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1739,7 +1739,7 @@ void tcp_update_recv_tstamps(struct sk_buff *skb,
>  }
>
>  #ifdef CONFIG_MMU
> -const struct vm_operations_struct tcp_vm_ops = {
> +static const struct vm_operations_struct tcp_vm_ops = {
>  };
>
>  int tcp_mmap(struct file *file, struct socket *sock,
> @@ -2038,34 +2038,6 @@ static void tcp_zc_finalize_rx_tstamp(struct sock *sk,
>         }
>  }
>
> -static struct vm_area_struct *find_tcp_vma(struct mm_struct *mm,
> -                                          unsigned long address,
> -                                          bool *mmap_locked)
> -{
> -       struct vm_area_struct *vma = NULL;
> -
> -#ifdef CONFIG_PER_VMA_LOCK
> -       vma = lock_vma_under_rcu(mm, address);
> -#endif
> -       if (vma) {
> -               if (!vma_is_tcp(vma)) {
> -                       vma_end_read(vma);
> -                       return NULL;
> -               }
> -               *mmap_locked = false;
> -               return vma;
> -       }
> -
> -       mmap_read_lock(mm);
> -       vma = vma_lookup(mm, address);
> -       if (!vma || !vma_is_tcp(vma)) {
> -               mmap_read_unlock(mm);
> -               return NULL;
> -       }
> -       *mmap_locked = true;
> -       return vma;
> -}
> -
>  #define TCP_ZEROCOPY_PAGE_BATCH_SIZE 32
>  static int tcp_zerocopy_receive(struct sock *sk,
>                                 struct tcp_zerocopy_receive *zc,
> @@ -2083,7 +2055,6 @@ static int tcp_zerocopy_receive(struct sock *sk,
>         u32 seq = tp->copied_seq;
>         u32 total_bytes_to_map;
>         int inq = tcp_inq(sk);
> -       bool mmap_locked;
>         int ret;
>
>         zc->copybuf_len = 0;
> @@ -2108,10 +2079,13 @@ static int tcp_zerocopy_receive(struct sock *sk,
>                 return 0;
>         }
>
> -       vma = find_tcp_vma(current->mm, address, &mmap_locked);
> -       if (!vma)
> -               return -EINVAL;
> +       mmap_read_lock(current->mm);
>
> +       vma = vma_lookup(current->mm, address);
> +       if (!vma || vma->vm_ops != &tcp_vm_ops) {
> +               mmap_read_unlock(current->mm);
> +               return -EINVAL;
> +       }
>         vma_len = min_t(unsigned long, zc->length, vma->vm_end - address);
>         avail_len = min_t(u32, vma_len, inq);
>         total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
> @@ -2185,10 +2159,7 @@ static int tcp_zerocopy_receive(struct sock *sk,
>                                                    zc, total_bytes_to_map);
>         }
>  out:
> -       if (mmap_locked)
> -               mmap_read_unlock(current->mm);
> -       else
> -               vma_end_read(vma);
> +       mmap_read_unlock(current->mm);
>         /* Try to copy straggler data. */
>         if (!ret)
>                 copylen = tcp_zc_handle_leftover(zc, sk, skb, &seq, copybuf_len, tss);
> --
> 2.39.2
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/9] mm: Allow per-VMA locks on file-backed VMAs
  2023-07-11 20:20 ` [PATCH v2 2/9] mm: Allow per-VMA locks on file-backed VMAs Matthew Wilcox (Oracle)
@ 2023-07-14  3:03   ` Suren Baghdasaryan
  0 siblings, 0 replies; 29+ messages in thread
From: Suren Baghdasaryan @ 2023-07-14  3:03 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, Arjun Roy, Eric Dumazet, linux-fsdevel, Punit Agrawal

On Tue, Jul 11, 2023 at 1:21 PM Matthew Wilcox (Oracle)
<willy@infradead.org> wrote:
>
> The fault path will immediately fail in handle_mm_fault(), so this
> is the minimal step which allows the per-VMA lock to be taken on
> file-backed VMAs.  There may be a small performance reduction as a
> little unnecessary work will be done on each page fault.  See later
> patches for the improvement.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/memory.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 2c7967632866..f2dcc695f54e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5247,6 +5247,11 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>                 goto out;
>         }
>
> +       if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma)) {
> +               vma_end_read(vma);
> +               return VM_FAULT_RETRY;
> +       }
> +
>         /*
>          * Enable the memcg OOM handling for faults triggered in user
>          * space.  Kernel faults are handled more gracefully.
> @@ -5418,12 +5423,8 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>         if (!vma)
>                 goto inval;
>
> -       /* Only anonymous vmas are supported for now */
> -       if (!vma_is_anonymous(vma))
> -               goto inval;
> -
>         /* find_mergeable_anon_vma uses adjacent vmas which are not locked */
> -       if (!vma->anon_vma)
> +       if (vma_is_anonymous(vma) && !vma->anon_vma)
>                 goto inval;
>
>         if (!vma_start_read(vma))
> --
> 2.39.2
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 3/9] mm: Move FAULT_FLAG_VMA_LOCK check from handle_mm_fault()
  2023-07-11 20:20 ` [PATCH v2 3/9] mm: Move FAULT_FLAG_VMA_LOCK check from handle_mm_fault() Matthew Wilcox (Oracle)
@ 2023-07-14  3:04   ` Suren Baghdasaryan
  0 siblings, 0 replies; 29+ messages in thread
From: Suren Baghdasaryan @ 2023-07-14  3:04 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, Arjun Roy, Eric Dumazet, linux-fsdevel, Punit Agrawal

On Tue, Jul 11, 2023 at 1:20 PM Matthew Wilcox (Oracle)
<willy@infradead.org> wrote:
>
> Handle a little more of the page fault path outside the mmap sem.
> The hugetlb path doesn't need to check whether the VMA is anonymous;
> the VM_HUGETLB flag is only set on hugetlbfs VMAs.  There should be no
> performance change from the previous commit; this is simply a step to
> ease bisection of any problems.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/hugetlb.c |  6 ++++++
>  mm/memory.c  | 18 +++++++++---------
>  2 files changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e4a28ce0667f..109e1ff92bc8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6063,6 +6063,12 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>         int need_wait_lock = 0;
>         unsigned long haddr = address & huge_page_mask(h);
>
> +       /* TODO: Handle faults under the VMA lock */
> +       if (flags & FAULT_FLAG_VMA_LOCK) {
> +               vma_end_read(vma);
> +               return VM_FAULT_RETRY;
> +       }
> +
>         /*
>          * Serialize hugepage allocation and instantiation, so that we don't
>          * get spurious allocation failures if two CPUs race to instantiate
> diff --git a/mm/memory.c b/mm/memory.c
> index f2dcc695f54e..6eda5c5f2069 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4998,10 +4998,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  }
>
>  /*
> - * By the time we get here, we already hold the mm semaphore
> - *
> - * The mmap_lock may have been released depending on flags and our
> - * return value.  See filemap_fault() and __folio_lock_or_retry().
> + * On entry, we hold either the VMA lock or the mmap_lock
> + * (FAULT_FLAG_VMA_LOCK tells you which).  If VM_FAULT_RETRY is set in
> + * the result, the mmap_lock is not held on exit.  See filemap_fault()
> + * and __folio_lock_or_retry().
>   */
>  static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>                 unsigned long address, unsigned int flags)
> @@ -5020,6 +5020,11 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>         p4d_t *p4d;
>         vm_fault_t ret;
>
> +       if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma)) {
> +               vma_end_read(vma);
> +               return VM_FAULT_RETRY;
> +       }
> +
>         pgd = pgd_offset(mm, address);
>         p4d = p4d_alloc(mm, pgd, address);
>         if (!p4d)
> @@ -5247,11 +5252,6 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>                 goto out;
>         }
>
> -       if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma)) {
> -               vma_end_read(vma);
> -               return VM_FAULT_RETRY;
> -       }
> -
>         /*
>          * Enable the memcg OOM handling for faults triggered in user
>          * space.  Kernel faults are handled more gracefully.
> --
> 2.39.2
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/9] mm: Move FAULT_FLAG_VMA_LOCK check into handle_pte_fault()
  2023-07-11 20:20 ` [PATCH v2 4/9] mm: Move FAULT_FLAG_VMA_LOCK check into handle_pte_fault() Matthew Wilcox (Oracle)
@ 2023-07-14  3:17   ` Suren Baghdasaryan
  2023-07-24 15:46   ` Jann Horn
  1 sibling, 0 replies; 29+ messages in thread
From: Suren Baghdasaryan @ 2023-07-14  3:17 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, Arjun Roy, Eric Dumazet, linux-fsdevel, Punit Agrawal

On Tue, Jul 11, 2023 at 1:20 PM Matthew Wilcox (Oracle)
<willy@infradead.org> wrote:
>
> Push the check down from __handle_mm_fault().  There's a mild upside to
> this patch in that we'll allocate the page tables while under the VMA
> lock rather than the mmap lock, reducing the hold time on the mmap lock,
> since the retry will find the page tables already populated.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Took some time but seems safe to me...

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/memory.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 6eda5c5f2069..52f7fdd78380 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4924,6 +4924,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  {
>         pte_t entry;
>
> +       if ((vmf->flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vmf->vma)) {
> +               vma_end_read(vmf->vma);
> +               return VM_FAULT_RETRY;
> +       }
> +
>         if (unlikely(pmd_none(*vmf->pmd))) {
>                 /*
>                  * Leave __pte_alloc() until later: because vm_ops->fault may
> @@ -5020,11 +5025,6 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>         p4d_t *p4d;
>         vm_fault_t ret;
>
> -       if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma)) {
> -               vma_end_read(vma);
> -               return VM_FAULT_RETRY;
> -       }
> -
>         pgd = pgd_offset(mm, address);
>         p4d = p4d_alloc(mm, pgd, address);
>         if (!p4d)
> --
> 2.39.2
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 5/9] mm: Move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault()
  2023-07-11 20:20 ` [PATCH v2 5/9] mm: Move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault() Matthew Wilcox (Oracle)
@ 2023-07-14  3:26   ` Suren Baghdasaryan
  2023-07-24 15:46   ` Jann Horn
  1 sibling, 0 replies; 29+ messages in thread
From: Suren Baghdasaryan @ 2023-07-14  3:26 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, Arjun Roy, Eric Dumazet, linux-fsdevel, Punit Agrawal

On Tue, Jul 11, 2023 at 1:21 PM Matthew Wilcox (Oracle)
<willy@infradead.org> wrote:
>
> Call do_pte_missing() under the VMA lock ... then immediately retry
> in do_fault().
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/memory.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 52f7fdd78380..88cf9860f17e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4661,6 +4661,11 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
>         struct mm_struct *vm_mm = vma->vm_mm;
>         vm_fault_t ret;
>
> +       if (vmf->flags & FAULT_FLAG_VMA_LOCK){

nit: space before {

> +               vma_end_read(vma);
> +               return VM_FAULT_RETRY;
> +       }
> +
>         /*
>          * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
>          */
> @@ -4924,11 +4929,6 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  {
>         pte_t entry;
>
> -       if ((vmf->flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vmf->vma)) {
> -               vma_end_read(vmf->vma);
> -               return VM_FAULT_RETRY;
> -       }
> -

A comment a bit further talks about " A regular pmd is established and
it can't morph into a huge pmd by anon khugepaged, since that takes
mmap_lock in write mode"
I assume this is about collapse_pte_mapped_thp() and it does call
vma_start_write(vma), so I think we are ok.


>         if (unlikely(pmd_none(*vmf->pmd))) {
>                 /*
>                  * Leave __pte_alloc() until later: because vm_ops->fault may
> @@ -4961,6 +4961,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>         if (!vmf->pte)
>                 return do_pte_missing(vmf);
>
> +       if ((vmf->flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vmf->vma)) {
> +               vma_end_read(vmf->vma);
> +               return VM_FAULT_RETRY;
> +       }
> +
>         if (!pte_present(vmf->orig_pte))
>                 return do_swap_page(vmf);
>
> --
> 2.39.2
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 6/9] mm: Move the FAULT_FLAG_VMA_LOCK check down from do_fault()
  2023-07-11 20:20 ` [PATCH v2 6/9] mm: Move the FAULT_FLAG_VMA_LOCK check down from do_fault() Matthew Wilcox (Oracle)
@ 2023-07-14  3:27   ` Suren Baghdasaryan
  0 siblings, 0 replies; 29+ messages in thread
From: Suren Baghdasaryan @ 2023-07-14  3:27 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, Arjun Roy, Eric Dumazet, linux-fsdevel, Punit Agrawal

On Tue, Jul 11, 2023 at 1:21 PM Matthew Wilcox (Oracle)
<willy@infradead.org> wrote:
>
> Perform the check at the start of do_read_fault(), do_cow_fault()
> and do_shared_fault() instead.  Should be no performance change from
> the last commit.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/memory.c | 20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 88cf9860f17e..709bffee8aa2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4547,6 +4547,11 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
>         vm_fault_t ret = 0;
>         struct folio *folio;
>
> +       if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> +               vma_end_read(vmf->vma);
> +               return VM_FAULT_RETRY;
> +       }
> +
>         /*
>          * Let's call ->map_pages() first and use ->fault() as fallback
>          * if page by the offset is not ready to be mapped (cold cache or
> @@ -4575,6 +4580,11 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
>         struct vm_area_struct *vma = vmf->vma;
>         vm_fault_t ret;
>
> +       if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> +               vma_end_read(vma);
> +               return VM_FAULT_RETRY;
> +       }
> +
>         if (unlikely(anon_vma_prepare(vma)))
>                 return VM_FAULT_OOM;
>
> @@ -4615,6 +4625,11 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
>         vm_fault_t ret, tmp;
>         struct folio *folio;
>
> +       if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> +               vma_end_read(vma);
> +               return VM_FAULT_RETRY;
> +       }
> +
>         ret = __do_fault(vmf);
>         if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>                 return ret;
> @@ -4661,11 +4676,6 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
>         struct mm_struct *vm_mm = vma->vm_mm;
>         vm_fault_t ret;
>
> -       if (vmf->flags & FAULT_FLAG_VMA_LOCK){
> -               vma_end_read(vma);
> -               return VM_FAULT_RETRY;
> -       }
> -
>         /*
>          * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
>          */
> --
> 2.39.2
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 7/9] mm: Run the fault-around code under the VMA lock
  2023-07-11 20:20 ` [PATCH v2 7/9] mm: Run the fault-around code under the VMA lock Matthew Wilcox (Oracle)
@ 2023-07-14  3:32   ` Suren Baghdasaryan
  2023-07-24 17:38     ` Matthew Wilcox
  0 siblings, 1 reply; 29+ messages in thread
From: Suren Baghdasaryan @ 2023-07-14  3:32 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, Arjun Roy, Eric Dumazet, linux-fsdevel, Punit Agrawal

On Tue, Jul 11, 2023 at 1:20 PM Matthew Wilcox (Oracle)
<willy@infradead.org> wrote:
>
> The map_pages fs method should be safe to run under the VMA lock instead
> of the mmap lock.  This should have a measurable reduction in contention
> on the mmap lock.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

I'll trust your claim that vmf->vma->vm_ops->map_pages() never rely on
mmap_lock. I think it makes sense but I did not check every case :)

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/memory.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 709bffee8aa2..0a4e363b0605 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4547,11 +4547,6 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
>         vm_fault_t ret = 0;
>         struct folio *folio;
>
> -       if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> -               vma_end_read(vmf->vma);
> -               return VM_FAULT_RETRY;
> -       }
> -
>         /*
>          * Let's call ->map_pages() first and use ->fault() as fallback
>          * if page by the offset is not ready to be mapped (cold cache or
> @@ -4563,6 +4558,11 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
>                         return ret;
>         }
>
> +       if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> +               vma_end_read(vmf->vma);
> +               return VM_FAULT_RETRY;
> +       }
> +
>         ret = __do_fault(vmf);
>         if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>                 return ret;
> --
> 2.39.2
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/9] Revert "tcp: Use per-vma locking for receive zerocopy"
  2023-07-14  3:02   ` Suren Baghdasaryan
@ 2023-07-14  3:34     ` Matthew Wilcox
  2023-07-24 14:49       ` Jann Horn
  0 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2023-07-14  3:34 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-mm, Arjun Roy, Eric Dumazet, linux-fsdevel, Punit Agrawal

On Thu, Jul 13, 2023 at 08:02:12PM -0700, Suren Baghdasaryan wrote:
> On Tue, Jul 11, 2023 at 1:21 PM Matthew Wilcox (Oracle)
> <willy@infradead.org> wrote:
> >
> > This reverts commit 7a7f094635349a7d0314364ad50bdeb770b6df4f.
> 
> nit: some explanation and SOB would be nice.

Well, it can't be actually applied.  What needs to happen is that the
networking people need to drop the commit from their tree.  Some review
from the networking people would be helpful to be sure that I didn't
break anything in my reworking of this patch to apply after my patches.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 8/9] mm: Remove CONFIG_PER_VMA_LOCK ifdefs
  2023-07-11 20:20 ` [PATCH v2 8/9] mm: Remove CONFIG_PER_VMA_LOCK ifdefs Matthew Wilcox (Oracle)
@ 2023-07-14  3:34   ` Suren Baghdasaryan
  0 siblings, 0 replies; 29+ messages in thread
From: Suren Baghdasaryan @ 2023-07-14  3:34 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, Arjun Roy, Eric Dumazet, linux-fsdevel, Punit Agrawal

On Tue, Jul 11, 2023 at 1:21 PM Matthew Wilcox (Oracle)
<willy@infradead.org> wrote:
>
> Provide lock_vma_under_rcu() when CONFIG_PER_VMA_LOCK is not defined
> to eliminate ifdefs in the users.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Very nice. Thanks!

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  arch/arm64/mm/fault.c   | 2 --
>  arch/powerpc/mm/fault.c | 4 ----
>  arch/riscv/mm/fault.c   | 4 ----
>  arch/s390/mm/fault.c    | 2 --
>  arch/x86/mm/fault.c     | 4 ----
>  include/linux/mm.h      | 6 ++++++
>  6 files changed, 6 insertions(+), 16 deletions(-)
>
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index b8c80f7b8a5f..2e5d1e238af9 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -587,7 +587,6 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>
>         perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
>
> -#ifdef CONFIG_PER_VMA_LOCK
>         if (!(mm_flags & FAULT_FLAG_USER))
>                 goto lock_mmap;
>
> @@ -616,7 +615,6 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>                 return 0;
>         }
>  lock_mmap:
> -#endif /* CONFIG_PER_VMA_LOCK */
>
>  retry:
>         vma = lock_mm_and_find_vma(mm, addr, regs);
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 82954d0e6906..b1723094d464 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -469,7 +469,6 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
>         if (is_exec)
>                 flags |= FAULT_FLAG_INSTRUCTION;
>
> -#ifdef CONFIG_PER_VMA_LOCK
>         if (!(flags & FAULT_FLAG_USER))
>                 goto lock_mmap;
>
> @@ -502,7 +501,6 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
>                 return user_mode(regs) ? 0 : SIGBUS;
>
>  lock_mmap:
> -#endif /* CONFIG_PER_VMA_LOCK */
>
>         /* When running in the kernel we expect faults to occur only to
>          * addresses in user space.  All other faults represent errors in the
> @@ -552,9 +550,7 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
>
>         mmap_read_unlock(current->mm);
>
> -#ifdef CONFIG_PER_VMA_LOCK
>  done:
> -#endif
>         if (unlikely(fault & VM_FAULT_ERROR))
>                 return mm_fault_error(regs, address, fault);
>
> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
> index 6ea2cce4cc17..046732fcb48c 100644
> --- a/arch/riscv/mm/fault.c
> +++ b/arch/riscv/mm/fault.c
> @@ -283,7 +283,6 @@ void handle_page_fault(struct pt_regs *regs)
>                 flags |= FAULT_FLAG_WRITE;
>         else if (cause == EXC_INST_PAGE_FAULT)
>                 flags |= FAULT_FLAG_INSTRUCTION;
> -#ifdef CONFIG_PER_VMA_LOCK
>         if (!(flags & FAULT_FLAG_USER))
>                 goto lock_mmap;
>
> @@ -311,7 +310,6 @@ void handle_page_fault(struct pt_regs *regs)
>                 return;
>         }
>  lock_mmap:
> -#endif /* CONFIG_PER_VMA_LOCK */
>
>  retry:
>         vma = lock_mm_and_find_vma(mm, addr, regs);
> @@ -368,9 +366,7 @@ void handle_page_fault(struct pt_regs *regs)
>
>         mmap_read_unlock(mm);
>
> -#ifdef CONFIG_PER_VMA_LOCK
>  done:
> -#endif
>         if (unlikely(fault & VM_FAULT_ERROR)) {
>                 tsk->thread.bad_cause = cause;
>                 mm_fault_error(regs, addr, fault);
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index 40a71063949b..ac8351f172bb 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -407,7 +407,6 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>                 access = VM_WRITE;
>         if (access == VM_WRITE)
>                 flags |= FAULT_FLAG_WRITE;
> -#ifdef CONFIG_PER_VMA_LOCK
>         if (!(flags & FAULT_FLAG_USER))
>                 goto lock_mmap;
>         vma = lock_vma_under_rcu(mm, address);
> @@ -431,7 +430,6 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>                 goto out;
>         }
>  lock_mmap:
> -#endif /* CONFIG_PER_VMA_LOCK */
>         mmap_read_lock(mm);
>
>         gmap = NULL;
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index b0f7add07aa5..ab778eac1952 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1350,7 +1350,6 @@ void do_user_addr_fault(struct pt_regs *regs,
>         }
>  #endif
>
> -#ifdef CONFIG_PER_VMA_LOCK
>         if (!(flags & FAULT_FLAG_USER))
>                 goto lock_mmap;
>
> @@ -1381,7 +1380,6 @@ void do_user_addr_fault(struct pt_regs *regs,
>                 return;
>         }
>  lock_mmap:
> -#endif /* CONFIG_PER_VMA_LOCK */
>
>  retry:
>         vma = lock_mm_and_find_vma(mm, address, regs);
> @@ -1441,9 +1439,7 @@ void do_user_addr_fault(struct pt_regs *regs,
>         }
>
>         mmap_read_unlock(mm);
> -#ifdef CONFIG_PER_VMA_LOCK
>  done:
> -#endif
>         if (likely(!(fault & VM_FAULT_ERROR)))
>                 return;
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46c442855df7..3c923a4bf213 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -813,6 +813,12 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
>         mmap_assert_locked(vmf->vma->vm_mm);
>  }
>
> +static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> +               unsigned long address)
> +{
> +       return NULL;
> +}
> +
>  #endif /* CONFIG_PER_VMA_LOCK */
>
>  /*
> --
> 2.39.2
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 9/9] tcp: Use per-vma locking for receive zerocopy
  2023-07-11 20:20 ` [PATCH v2 9/9] tcp: Use per-vma locking for receive zerocopy Matthew Wilcox (Oracle)
@ 2023-07-14  3:40   ` Suren Baghdasaryan
  2023-07-21 18:48   ` Matthew Wilcox
  1 sibling, 0 replies; 29+ messages in thread
From: Suren Baghdasaryan @ 2023-07-14  3:40 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, Arjun Roy, Eric Dumazet, linux-fsdevel, Punit Agrawal,
	David S . Miller

On Tue, Jul 11, 2023 at 1:21 PM Matthew Wilcox (Oracle)
<willy@infradead.org> wrote:
>
> From: Arjun Roy <arjunroy@google.com>
>
> Per-VMA locking allows us to lock a struct vm_area_struct without
> taking the process-wide mmap lock in read mode.
>
> Consider a process workload where the mmap lock is taken constantly in
> write mode. In this scenario, all zerocopy receives are periodically
> blocked during that period of time - though in principle, the memory
> ranges being used by TCP are not touched by the operations that need
> the mmap write lock. This results in performance degradation.
>
> Now consider another workload where the mmap lock is never taken in
> write mode, but there are many TCP connections using receive zerocopy
> that are concurrently receiving. These connections all take the mmap
> lock in read mode, but this does induce a lot of contention and atomic
> ops for this process-wide lock. This results in additional CPU
> overhead caused by contending on the cache line for this lock.
>
> However, with per-vma locking, both of these problems can be avoided.
>
> As a test, I ran an RPC-style request/response workload with 4KB
> payloads and receive zerocopy enabled, with 100 simultaneous TCP
> connections. I measured perf cycles within the
> find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
> without per-vma locking enabled.
>
> When using process-wide mmap semaphore read locking, about 1% of
> measured perf cycles were within this path. With per-VMA locking, this
> value dropped to about 0.45%.
>
> Signed-off-by: Arjun Roy <arjunroy@google.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Seems to match the original version with less churn.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  net/ipv4/tcp.c | 39 ++++++++++++++++++++++++++++++++-------
>  1 file changed, 32 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1542de3f66f7..7118ec6cf886 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2038,6 +2038,30 @@ static void tcp_zc_finalize_rx_tstamp(struct sock *sk,
>         }
>  }
>

nit: Maybe add a comment that mmap_locked value is
undefined/meaningless if the function returns NULL?

> +static struct vm_area_struct *find_tcp_vma(struct mm_struct *mm,
> +               unsigned long address, bool *mmap_locked)
> +{
> +       struct vm_area_struct *vma = lock_vma_under_rcu(mm, address);
> +
> +       if (vma) {
> +               if (vma->vm_ops != &tcp_vm_ops) {
> +                       vma_end_read(vma);
> +                       return NULL;
> +               }
> +               *mmap_locked = false;
> +               return vma;
> +       }
> +
> +       mmap_read_lock(mm);
> +       vma = vma_lookup(mm, address);
> +       if (!vma || vma->vm_ops != &tcp_vm_ops) {
> +               mmap_read_unlock(mm);
> +               return NULL;
> +       }
> +       *mmap_locked = true;
> +       return vma;
> +}
> +
>  #define TCP_ZEROCOPY_PAGE_BATCH_SIZE 32
>  static int tcp_zerocopy_receive(struct sock *sk,
>                                 struct tcp_zerocopy_receive *zc,
> @@ -2055,6 +2079,7 @@ static int tcp_zerocopy_receive(struct sock *sk,
>         u32 seq = tp->copied_seq;
>         u32 total_bytes_to_map;
>         int inq = tcp_inq(sk);
> +       bool mmap_locked;
>         int ret;
>
>         zc->copybuf_len = 0;
> @@ -2079,13 +2104,10 @@ static int tcp_zerocopy_receive(struct sock *sk,
>                 return 0;
>         }
>
> -       mmap_read_lock(current->mm);
> -
> -       vma = vma_lookup(current->mm, address);
> -       if (!vma || vma->vm_ops != &tcp_vm_ops) {
> -               mmap_read_unlock(current->mm);
> +       vma = find_tcp_vma(current->mm, address, &mmap_locked);
> +       if (!vma)
>                 return -EINVAL;
> -       }
> +
>         vma_len = min_t(unsigned long, zc->length, vma->vm_end - address);
>         avail_len = min_t(u32, vma_len, inq);
>         total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
> @@ -2159,7 +2181,10 @@ static int tcp_zerocopy_receive(struct sock *sk,
>                                                    zc, total_bytes_to_map);
>         }
>  out:
> -       mmap_read_unlock(current->mm);
> +       if (mmap_locked)
> +               mmap_read_unlock(current->mm);
> +       else
> +               vma_end_read(vma);
>         /* Try to copy straggler data. */
>         if (!ret)
>                 copylen = tcp_zc_handle_leftover(zc, sk, skb, &seq, copybuf_len, tss);
> --
> 2.39.2
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 9/9] tcp: Use per-vma locking for receive zerocopy
  2023-07-11 20:20 ` [PATCH v2 9/9] tcp: Use per-vma locking for receive zerocopy Matthew Wilcox (Oracle)
  2023-07-14  3:40   ` Suren Baghdasaryan
@ 2023-07-21 18:48   ` Matthew Wilcox
  1 sibling, 0 replies; 29+ messages in thread
From: Matthew Wilcox @ 2023-07-21 18:48 UTC (permalink / raw)
  To: Arjun Roy, Eric Dumazet
  Cc: linux-mm, Suren Baghdasaryan, linux-fsdevel, Punit Agrawal,
	David S . Miller


Eric?  Arjun?  Any comments?

On Tue, Jul 11, 2023 at 09:20:47PM +0100, Matthew Wilcox (Oracle) wrote:
> From: Arjun Roy <arjunroy@google.com>
> 
> Per-VMA locking allows us to lock a struct vm_area_struct without
> taking the process-wide mmap lock in read mode.
> 
> Consider a process workload where the mmap lock is taken constantly in
> write mode. In this scenario, all zerocopy receives are periodically
> blocked during that period of time - though in principle, the memory
> ranges being used by TCP are not touched by the operations that need
> the mmap write lock. This results in performance degradation.
> 
> Now consider another workload where the mmap lock is never taken in
> write mode, but there are many TCP connections using receive zerocopy
> that are concurrently receiving. These connections all take the mmap
> lock in read mode, but this does induce a lot of contention and atomic
> ops for this process-wide lock. This results in additional CPU
> overhead caused by contending on the cache line for this lock.
> 
> However, with per-vma locking, both of these problems can be avoided.
> 
> As a test, I ran an RPC-style request/response workload with 4KB
> payloads and receive zerocopy enabled, with 100 simultaneous TCP
> connections. I measured perf cycles within the
> find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
> without per-vma locking enabled.
> 
> When using process-wide mmap semaphore read locking, about 1% of
> measured perf cycles were within this path. With per-VMA locking, this
> value dropped to about 0.45%.
> 
> Signed-off-by: Arjun Roy <arjunroy@google.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  net/ipv4/tcp.c | 39 ++++++++++++++++++++++++++++++++-------
>  1 file changed, 32 insertions(+), 7 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1542de3f66f7..7118ec6cf886 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2038,6 +2038,30 @@ static void tcp_zc_finalize_rx_tstamp(struct sock *sk,
>  	}
>  }
>  
> +static struct vm_area_struct *find_tcp_vma(struct mm_struct *mm,
> +		unsigned long address, bool *mmap_locked)
> +{
> +	struct vm_area_struct *vma = lock_vma_under_rcu(mm, address);
> +
> +	if (vma) {
> +		if (vma->vm_ops != &tcp_vm_ops) {
> +			vma_end_read(vma);
> +			return NULL;
> +		}
> +		*mmap_locked = false;
> +		return vma;
> +	}
> +
> +	mmap_read_lock(mm);
> +	vma = vma_lookup(mm, address);
> +	if (!vma || vma->vm_ops != &tcp_vm_ops) {
> +		mmap_read_unlock(mm);
> +		return NULL;
> +	}
> +	*mmap_locked = true;
> +	return vma;
> +}
> +
>  #define TCP_ZEROCOPY_PAGE_BATCH_SIZE 32
>  static int tcp_zerocopy_receive(struct sock *sk,
>  				struct tcp_zerocopy_receive *zc,
> @@ -2055,6 +2079,7 @@ static int tcp_zerocopy_receive(struct sock *sk,
>  	u32 seq = tp->copied_seq;
>  	u32 total_bytes_to_map;
>  	int inq = tcp_inq(sk);
> +	bool mmap_locked;
>  	int ret;
>  
>  	zc->copybuf_len = 0;
> @@ -2079,13 +2104,10 @@ static int tcp_zerocopy_receive(struct sock *sk,
>  		return 0;
>  	}
>  
> -	mmap_read_lock(current->mm);
> -
> -	vma = vma_lookup(current->mm, address);
> -	if (!vma || vma->vm_ops != &tcp_vm_ops) {
> -		mmap_read_unlock(current->mm);
> +	vma = find_tcp_vma(current->mm, address, &mmap_locked);
> +	if (!vma)
>  		return -EINVAL;
> -	}
> +
>  	vma_len = min_t(unsigned long, zc->length, vma->vm_end - address);
>  	avail_len = min_t(u32, vma_len, inq);
>  	total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
> @@ -2159,7 +2181,10 @@ static int tcp_zerocopy_receive(struct sock *sk,
>  						   zc, total_bytes_to_map);
>  	}
>  out:
> -	mmap_read_unlock(current->mm);
> +	if (mmap_locked)
> +		mmap_read_unlock(current->mm);
> +	else
> +		vma_end_read(vma);
>  	/* Try to copy straggler data. */
>  	if (!ret)
>  		copylen = tcp_zc_handle_leftover(zc, sk, skb, &seq, copybuf_len, tss);
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/9] Revert "tcp: Use per-vma locking for receive zerocopy"
  2023-07-14  3:34     ` Matthew Wilcox
@ 2023-07-24 14:49       ` Jann Horn
  2023-07-24 15:06         ` Matthew Wilcox
  0 siblings, 1 reply; 29+ messages in thread
From: Jann Horn @ 2023-07-24 14:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Suren Baghdasaryan, linux-mm, Arjun Roy, Eric Dumazet,
	linux-fsdevel, Punit Agrawal

On Fri, Jul 14, 2023 at 5:34 AM Matthew Wilcox <willy@infradead.org> wrote:
> On Thu, Jul 13, 2023 at 08:02:12PM -0700, Suren Baghdasaryan wrote:
> > On Tue, Jul 11, 2023 at 1:21 PM Matthew Wilcox (Oracle)
> > <willy@infradead.org> wrote:
> > >
> > > This reverts commit 7a7f094635349a7d0314364ad50bdeb770b6df4f.
> >
> > nit: some explanation and SOB would be nice.
>
> Well, it can't be actually applied.  What needs to happen is that the
> networking people need to drop the commit from their tree.  Some review
> from the networking people would be helpful to be sure that I didn't
> break anything in my reworking of this patch to apply after my patches.

Are you saying you want them to revert it before it reaches mainline?
That commit landed in v6.5-rc1.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/9] Revert "tcp: Use per-vma locking for receive zerocopy"
  2023-07-24 14:49       ` Jann Horn
@ 2023-07-24 15:06         ` Matthew Wilcox
  2023-07-24 21:42           ` Jakub Kicinski
  0 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2023-07-24 15:06 UTC (permalink / raw)
  To: Jann Horn
  Cc: Suren Baghdasaryan, linux-mm, Arjun Roy, Eric Dumazet,
	linux-fsdevel, Punit Agrawal, David S. Miller, Jakub Kicinski,
	Paolo Abeni, netdev

On Mon, Jul 24, 2023 at 04:49:16PM +0200, Jann Horn wrote:
> On Fri, Jul 14, 2023 at 5:34 AM Matthew Wilcox <willy@infradead.org> wrote:
> > On Thu, Jul 13, 2023 at 08:02:12PM -0700, Suren Baghdasaryan wrote:
> > > On Tue, Jul 11, 2023 at 1:21 PM Matthew Wilcox (Oracle)
> > > <willy@infradead.org> wrote:
> > > >
> > > > This reverts commit 7a7f094635349a7d0314364ad50bdeb770b6df4f.
> > >
> > > nit: some explanation and SOB would be nice.
> >
> > Well, it can't be actually applied.  What needs to happen is that the
> > networking people need to drop the commit from their tree.  Some review
> > from the networking people would be helpful to be sure that I didn't
> > break anything in my reworking of this patch to apply after my patches.
> 
> Are you saying you want them to revert it before it reaches mainline?
> That commit landed in v6.5-rc1.

... what?  It was posted on June 16th.  How does it end up in rc1 on
July 9th?  6.4 was June 25th.  9 days is long enough for something
that's not an urgent fix to land in rc1?  Networking doesn't close
development at rc5/6 like most subsystem trees?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 5/9] mm: Move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault()
  2023-07-11 20:20 ` [PATCH v2 5/9] mm: Move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault() Matthew Wilcox (Oracle)
  2023-07-14  3:26   ` Suren Baghdasaryan
@ 2023-07-24 15:46   ` Jann Horn
  2023-07-24 17:45     ` Matthew Wilcox
  1 sibling, 1 reply; 29+ messages in thread
From: Jann Horn @ 2023-07-24 15:46 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, Arjun Roy, Eric Dumazet, Suren Baghdasaryan,
	linux-fsdevel, Punit Agrawal

On Tue, Jul 11, 2023 at 10:20 PM Matthew Wilcox (Oracle)
<willy@infradead.org> wrote:
> Call do_pte_missing() under the VMA lock ... then immediately retry
> in do_fault().
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
[...]
> @@ -4961,6 +4961,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>         if (!vmf->pte)
>                 return do_pte_missing(vmf);
>
> +       if ((vmf->flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vmf->vma)) {
> +               vma_end_read(vmf->vma);
> +               return VM_FAULT_RETRY;
> +       }

At this point we can have vmf->pte mapped, right? Does this mean this
bailout leaks a kmap_local() on CONFIG_HIGHPTE?

>         if (!pte_present(vmf->orig_pte))
>                 return do_swap_page(vmf);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/9] mm: Move FAULT_FLAG_VMA_LOCK check into handle_pte_fault()
  2023-07-11 20:20 ` [PATCH v2 4/9] mm: Move FAULT_FLAG_VMA_LOCK check into handle_pte_fault() Matthew Wilcox (Oracle)
  2023-07-14  3:17   ` Suren Baghdasaryan
@ 2023-07-24 15:46   ` Jann Horn
  2023-07-24 16:37     ` Matthew Wilcox
  1 sibling, 1 reply; 29+ messages in thread
From: Jann Horn @ 2023-07-24 15:46 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, Arjun Roy, Eric Dumazet, Suren Baghdasaryan,
	linux-fsdevel, Punit Agrawal

On Tue, Jul 11, 2023 at 10:20 PM Matthew Wilcox (Oracle)
<willy@infradead.org> wrote:
> Push the check down from __handle_mm_fault().  There's a mild upside to
> this patch in that we'll allocate the page tables while under the VMA
> lock rather than the mmap lock, reducing the hold time on the mmap lock,
> since the retry will find the page tables already populated.

This commit, by moving the check from __handle_mm_fault() to
handle_pte_fault(), also makes the non-anonymous THP paths (including
the DAX huge fault handling) reachable for VMA-locked faults, right?
Is that intentional?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/9] mm: Move FAULT_FLAG_VMA_LOCK check into handle_pte_fault()
  2023-07-24 15:46   ` Jann Horn
@ 2023-07-24 16:37     ` Matthew Wilcox
  0 siblings, 0 replies; 29+ messages in thread
From: Matthew Wilcox @ 2023-07-24 16:37 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-mm, Arjun Roy, Eric Dumazet, Suren Baghdasaryan,
	linux-fsdevel, Punit Agrawal

On Mon, Jul 24, 2023 at 05:46:57PM +0200, Jann Horn wrote:
> On Tue, Jul 11, 2023 at 10:20 PM Matthew Wilcox (Oracle)
> <willy@infradead.org> wrote:
> > Push the check down from __handle_mm_fault().  There's a mild upside to
> > this patch in that we'll allocate the page tables while under the VMA
> > lock rather than the mmap lock, reducing the hold time on the mmap lock,
> > since the retry will find the page tables already populated.
> 
> This commit, by moving the check from __handle_mm_fault() to
> handle_pte_fault(), also makes the non-anonymous THP paths (including
> the DAX huge fault handling) reachable for VMA-locked faults, right?
> Is that intentional?

Oof, this patch is all kinds of buggy.  Will split this into several
pieces.  Thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 7/9] mm: Run the fault-around code under the VMA lock
  2023-07-14  3:32   ` Suren Baghdasaryan
@ 2023-07-24 17:38     ` Matthew Wilcox
  0 siblings, 0 replies; 29+ messages in thread
From: Matthew Wilcox @ 2023-07-24 17:38 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-mm, Arjun Roy, Eric Dumazet, linux-fsdevel, Punit Agrawal

On Thu, Jul 13, 2023 at 08:32:27PM -0700, Suren Baghdasaryan wrote:
> On Tue, Jul 11, 2023 at 1:20 PM Matthew Wilcox (Oracle)
> <willy@infradead.org> wrote:
> >
> > The map_pages fs method should be safe to run under the VMA lock instead
> > of the mmap lock.  This should have a measurable reduction in contention
> > on the mmap lock.
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> I'll trust your claim that vmf->vma->vm_ops->map_pages() never rely on
> mmap_lock. I think it makes sense but I did not check every case :)

Fortunately, there's really only one implementation of ->map_pages()
and it's filemap_map_pages().  afs_vm_map_pages() is a thin wrapper
around it.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 5/9] mm: Move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault()
  2023-07-24 15:46   ` Jann Horn
@ 2023-07-24 17:45     ` Matthew Wilcox
  0 siblings, 0 replies; 29+ messages in thread
From: Matthew Wilcox @ 2023-07-24 17:45 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-mm, Arjun Roy, Eric Dumazet, Suren Baghdasaryan,
	linux-fsdevel, Punit Agrawal

On Mon, Jul 24, 2023 at 05:46:21PM +0200, Jann Horn wrote:
> > +       if ((vmf->flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vmf->vma)) {
> > +               vma_end_read(vmf->vma);
> > +               return VM_FAULT_RETRY;
> > +       }
> 
> At this point we can have vmf->pte mapped, right? Does this mean this
> bailout leaks a kmap_local() on CONFIG_HIGHPTE?

Yup.  Guess nobody's testing on 32-bit machines.  Thanks, fixed.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/9] Revert "tcp: Use per-vma locking for receive zerocopy"
  2023-07-24 15:06         ` Matthew Wilcox
@ 2023-07-24 21:42           ` Jakub Kicinski
  0 siblings, 0 replies; 29+ messages in thread
From: Jakub Kicinski @ 2023-07-24 21:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jann Horn, Suren Baghdasaryan, linux-mm, Arjun Roy, Eric Dumazet,
	linux-fsdevel, Punit Agrawal, David S. Miller, Paolo Abeni,
	netdev

On Mon, 24 Jul 2023 16:06:00 +0100 Matthew Wilcox wrote:
> > Are you saying you want them to revert it before it reaches mainline?
> > That commit landed in v6.5-rc1.  
> 
> ... what?  It was posted on June 16th.  How does it end up in rc1 on
> July 9th?  6.4 was June 25th.  9 days is long enough for something
> that's not an urgent fix to land in rc1?  Networking doesn't close
> development at rc5/6 like most subsystem trees?

We don't, and yeah this one was a bit risky. We close for the merge
window (the two weeks), we could definitely push back on risky changes
starting a week or two before the window... but we don't know how long
the release will last :( if we stop taking large changes at rc6 and
release goes until rc8 that's 5 out of 11 weeks of the cycle when we
can't apply substantial patches. It's way too long. The weeks after 
the merge window are already super stressful, if we shut down for longer
it'll only get worse. I'm typing all this because I was hoping we can
bring up making the release schedule even more predictable with Linus,
I'm curious if others feel the same way.

On the matter at hand - I thought the patches were just conflicting
with your upcoming work. Are they already broken in v6.5? No problem
with queuing the revert for v6.5 here if they are.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-07-24 21:42 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-11 20:20 [PATCH v2 0/9] Avoid the mmap lock for fault-around Matthew Wilcox (Oracle)
2023-07-11 20:20 ` [PATCH v2 1/9] Revert "tcp: Use per-vma locking for receive zerocopy" Matthew Wilcox (Oracle)
2023-07-14  3:02   ` Suren Baghdasaryan
2023-07-14  3:34     ` Matthew Wilcox
2023-07-24 14:49       ` Jann Horn
2023-07-24 15:06         ` Matthew Wilcox
2023-07-24 21:42           ` Jakub Kicinski
2023-07-11 20:20 ` [PATCH v2 2/9] mm: Allow per-VMA locks on file-backed VMAs Matthew Wilcox (Oracle)
2023-07-14  3:03   ` Suren Baghdasaryan
2023-07-11 20:20 ` [PATCH v2 3/9] mm: Move FAULT_FLAG_VMA_LOCK check from handle_mm_fault() Matthew Wilcox (Oracle)
2023-07-14  3:04   ` Suren Baghdasaryan
2023-07-11 20:20 ` [PATCH v2 4/9] mm: Move FAULT_FLAG_VMA_LOCK check into handle_pte_fault() Matthew Wilcox (Oracle)
2023-07-14  3:17   ` Suren Baghdasaryan
2023-07-24 15:46   ` Jann Horn
2023-07-24 16:37     ` Matthew Wilcox
2023-07-11 20:20 ` [PATCH v2 5/9] mm: Move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault() Matthew Wilcox (Oracle)
2023-07-14  3:26   ` Suren Baghdasaryan
2023-07-24 15:46   ` Jann Horn
2023-07-24 17:45     ` Matthew Wilcox
2023-07-11 20:20 ` [PATCH v2 6/9] mm: Move the FAULT_FLAG_VMA_LOCK check down from do_fault() Matthew Wilcox (Oracle)
2023-07-14  3:27   ` Suren Baghdasaryan
2023-07-11 20:20 ` [PATCH v2 7/9] mm: Run the fault-around code under the VMA lock Matthew Wilcox (Oracle)
2023-07-14  3:32   ` Suren Baghdasaryan
2023-07-24 17:38     ` Matthew Wilcox
2023-07-11 20:20 ` [PATCH v2 8/9] mm: Remove CONFIG_PER_VMA_LOCK ifdefs Matthew Wilcox (Oracle)
2023-07-14  3:34   ` Suren Baghdasaryan
2023-07-11 20:20 ` [PATCH v2 9/9] tcp: Use per-vma locking for receive zerocopy Matthew Wilcox (Oracle)
2023-07-14  3:40   ` Suren Baghdasaryan
2023-07-21 18:48   ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.