linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH RESEND 00/28] per-VMA locks proposal
@ 2022-09-01 17:34 Suren Baghdasaryan
  2022-09-01 17:34 ` [RFC PATCH RESEND 01/28] mm: introduce CONFIG_PER_VMA_LOCK Suren Baghdasaryan
                   ` (30 more replies)
  0 siblings, 31 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:34 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Resending to fix the issue with the In-Reply-To tag in the original
submission at [4].

This is a proof of concept for per-vma locks idea that was discussed
during SPF [1] discussion at LSF/MM this year [2], which concluded with
suggestion that “a reader/writer semaphore could be put into the VMA
itself; that would have the effect of using the VMA as a sort of range
lock. There would still be contention at the VMA level, but it would be an
improvement.” This patchset implements this suggested approach.

When handling page faults we lookup the VMA that contains the faulting
page under RCU protection and try to acquire its lock. If that fails we
fall back to using mmap_lock, similar to how SPF handled this situation.

One notable way the implementation deviates from the proposal is the way
VMAs are marked as locked. Because during some of mm updates multiple
VMAs need to be locked until the end of the update (e.g. vma_merge,
split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks
and other complications would make the code more complex. Therefore we
provide a way to "mark" VMAs as locked and then unmark all locked VMAs
all at once. This is done using two sequence numbers - one in the
vm_area_struct and one in the mm_struct. VMA is considered locked when
these sequence numbers are equal. To mark a VMA as locked we set the
sequence number in vm_area_struct to be equal to the sequence number
in mm_struct. To unlock all VMAs we increment mm_struct's seq number.
This allows for an efficient way to track locked VMAs and to drop the
locks on all VMAs at the end of the update.

The patchset implements per-VMA locking only for anonymous pages which
are not in swap. If the initial proposal is considered acceptable, then
support for swapped and file-backed page faults will be added.

Performance benchmarks show similar although slightly smaller benefits as
with SPF patchset (~75% of SPF benefits). Still, with lower complexity
this approach might be more desirable.

The patchset applies cleanly over 6.0-rc3
The tree for testing is posted at [3]

[1] https://lore.kernel.org/all/20220128131006.67712-1-michel@lespinasse.org/
[2] https://lwn.net/Articles/893906/
[3] https://github.com/surenbaghdasaryan/linux/tree/per_vma_lock_rfc
[4] https://lore.kernel.org/all/20220829212531.3184856-1-surenb@google.com/

Laurent Dufour (2):
  powerc/mm: try VMA lock-based page fault handling first
  powerpc/mm: define ARCH_SUPPORTS_PER_VMA_LOCK

Michel Lespinasse (1):
  mm: rcu safe VMA freeing

Suren Baghdasaryan (25):
  mm: introduce CONFIG_PER_VMA_LOCK
  mm: introduce __find_vma to be used without mmap_lock protection
  mm: move mmap_lock assert function definitions
  mm: add per-VMA lock and helper functions to control it
  mm: mark VMA as locked whenever vma->vm_flags are modified
  kernel/fork: mark VMAs as locked before copying pages during fork
  mm/khugepaged: mark VMA as locked while collapsing a hugepage
  mm/mempolicy: mark VMA as locked when changing protection policy
  mm/mmap: mark VMAs as locked in vma_adjust
  mm/mmap: mark VMAs as locked before merging or splitting them
  mm/mremap: mark VMA as locked while remapping it to a new address
    range
  mm: conditionally mark VMA as locked in free_pgtables and
    unmap_page_range
  mm: mark VMAs as locked before isolating them
  mm/mmap: mark adjacent VMAs as locked if they can grow into unmapped
    area
  kernel/fork: assert no VMA readers during its destruction
  mm/mmap: prevent pagefault handler from racing with mmu_notifier
    registration
  mm: add FAULT_FLAG_VMA_LOCK flag
  mm: disallow do_swap_page to handle page faults under VMA lock
  mm: introduce per-VMA lock statistics
  mm: introduce find_and_lock_anon_vma to be used from arch-specific
    code
  x86/mm: try VMA lock-based page fault handling first
  x86/mm: define ARCH_SUPPORTS_PER_VMA_LOCK
  arm64/mm: try VMA lock-based page fault handling first
  arm64/mm: define ARCH_SUPPORTS_PER_VMA_LOCK
  kernel/fork: throttle call_rcu() calls in vm_area_free

 arch/arm64/Kconfig                     |   1 +
 arch/arm64/mm/fault.c                  |  36 +++++++++
 arch/powerpc/mm/fault.c                |  41 ++++++++++
 arch/powerpc/platforms/powernv/Kconfig |   1 +
 arch/powerpc/platforms/pseries/Kconfig |   1 +
 arch/x86/Kconfig                       |   1 +
 arch/x86/mm/fault.c                    |  36 +++++++++
 drivers/gpu/drm/i915/i915_gpu_error.c  |   4 +-
 fs/proc/task_mmu.c                     |   1 +
 fs/userfaultfd.c                       |   6 ++
 include/linux/mm.h                     | 104 ++++++++++++++++++++++++-
 include/linux/mm_types.h               |  33 ++++++--
 include/linux/mmap_lock.h              |  37 ++++++---
 include/linux/vm_event_item.h          |   6 ++
 include/linux/vmstat.h                 |   6 ++
 kernel/fork.c                          |  75 +++++++++++++++++-
 mm/Kconfig                             |  13 ++++
 mm/Kconfig.debug                       |   8 ++
 mm/init-mm.c                           |   6 ++
 mm/internal.h                          |   4 +-
 mm/khugepaged.c                        |   1 +
 mm/madvise.c                           |   1 +
 mm/memory.c                            |  82 ++++++++++++++++---
 mm/mempolicy.c                         |   6 +-
 mm/mlock.c                             |   2 +
 mm/mmap.c                              |  60 ++++++++++----
 mm/mprotect.c                          |   1 +
 mm/mremap.c                            |   1 +
 mm/nommu.c                             |   2 +
 mm/oom_kill.c                          |   3 +-
 mm/vmstat.c                            |   6 ++
 31 files changed, 531 insertions(+), 54 deletions(-)

-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 01/28] mm: introduce CONFIG_PER_VMA_LOCK
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
@ 2022-09-01 17:34 ` Suren Baghdasaryan
  2022-09-01 17:34 ` [RFC PATCH RESEND 02/28] mm: rcu safe VMA freeing Suren Baghdasaryan
                   ` (29 subsequent siblings)
  30 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:34 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

This configuration variable will be used to build the support for VMA
locking during page fault handling.

This is enabled by default on supported architectures with SMP and MMU
set.

The architecture support is needed since the page fault handler is called
from the architecture's page faulting code which needs modifications to
handle faults under VMA lock.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/Kconfig | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 0331f1461f81..58c20fad9cf9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1124,6 +1124,19 @@ config PTE_MARKER_UFFD_WP
 	  purposes.  It is required to enable userfaultfd write protection on
 	  file-backed memory types like shmem and hugetlbfs.
 
+config ARCH_SUPPORTS_PER_VMA_LOCK
+       def_bool n
+
+config PER_VMA_LOCK
+	bool "Per-vma locking support"
+	default y
+	depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
+	help
+	  Allow per-vma locking during page fault handling.
+
+	  This feature allows locking each virtual memory area separately when
+	  handling page faults instead of taking mmap_lock.
+
 source "mm/damon/Kconfig"
 
 endmenu
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 02/28] mm: rcu safe VMA freeing
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
  2022-09-01 17:34 ` [RFC PATCH RESEND 01/28] mm: introduce CONFIG_PER_VMA_LOCK Suren Baghdasaryan
@ 2022-09-01 17:34 ` Suren Baghdasaryan
  2022-09-01 17:34 ` [RFC PATCH RESEND 03/28] mm: introduce __find_vma to be used without mmap_lock protection Suren Baghdasaryan
                   ` (28 subsequent siblings)
  30 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:34 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

From: Michel Lespinasse <michel@lespinasse.org>

This prepares for page faults handling under VMA lock, looking up VMAs
under protection of an rcu read lock, instead of the usual mmap read lock.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm_types.h | 16 +++++++++++-----
 kernel/fork.c            | 13 +++++++++++++
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cf97f3884fda..bed25ef7c994 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -403,12 +403,18 @@ struct anon_vma_name {
 struct vm_area_struct {
 	/* The first cache line has the info for VMA tree walking. */
 
-	unsigned long vm_start;		/* Our start address within vm_mm. */
-	unsigned long vm_end;		/* The first byte after our end address
-					   within vm_mm. */
+	union {
+		struct {
+			/* VMA covers [vm_start; vm_end) addresses within mm */
+			unsigned long vm_start, vm_end;
 
-	/* linked list of VM areas per task, sorted by address */
-	struct vm_area_struct *vm_next, *vm_prev;
+			/* linked list of VMAs per task, sorted by address */
+			struct vm_area_struct *vm_next, *vm_prev;
+		};
+#ifdef CONFIG_PER_VMA_LOCK
+		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
+#endif
+	};
 
 	struct rb_node vm_rb;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 90c85b17bf69..614872438393 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -481,10 +481,23 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	return new;
 }
 
+#ifdef CONFIG_PER_VMA_LOCK
+static void __vm_area_free(struct rcu_head *head)
+{
+	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
+						  vm_rcu);
+	kmem_cache_free(vm_area_cachep, vma);
+}
+#endif
+
 void vm_area_free(struct vm_area_struct *vma)
 {
 	free_anon_vma_name(vma);
+#ifdef CONFIG_PER_VMA_LOCK
+	call_rcu(&vma->vm_rcu, __vm_area_free);
+#else
 	kmem_cache_free(vm_area_cachep, vma);
+#endif
 }
 
 static void account_kernel_stack(struct task_struct *tsk, int account)
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 03/28] mm: introduce __find_vma to be used without mmap_lock protection
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
  2022-09-01 17:34 ` [RFC PATCH RESEND 01/28] mm: introduce CONFIG_PER_VMA_LOCK Suren Baghdasaryan
  2022-09-01 17:34 ` [RFC PATCH RESEND 02/28] mm: rcu safe VMA freeing Suren Baghdasaryan
@ 2022-09-01 17:34 ` Suren Baghdasaryan
  2022-09-01 20:22   ` Kent Overstreet
  2022-09-01 17:34 ` [RFC PATCH RESEND 04/28] mm: move mmap_lock assert function definitions Suren Baghdasaryan
                   ` (27 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:34 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Add __find_vma function to be used for VMA lookup under rcu protection.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 4 ++--
 include/linux/mm.h                    | 9 ++++++++-
 mm/mmap.c                             | 6 ++----
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 32e92651ef7c..fc94985c95c8 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -507,7 +507,7 @@ static void error_print_context(struct drm_i915_error_state_buf *m,
 }
 
 static struct i915_vma_coredump *
-__find_vma(struct i915_vma_coredump *vma, const char *name)
+__i915_find_vma(struct i915_vma_coredump *vma, const char *name)
 {
 	while (vma) {
 		if (strcmp(vma->name, name) == 0)
@@ -521,7 +521,7 @@ __find_vma(struct i915_vma_coredump *vma, const char *name)
 struct i915_vma_coredump *
 intel_gpu_error_find_batch(const struct intel_engine_coredump *ee)
 {
-	return __find_vma(ee->vma, "batch");
+	return __i915_find_vma(ee->vma, "batch");
 }
 
 static void error_print_engine(struct drm_i915_error_state_buf *m,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21f8b27bd9fd..7d322a979455 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2712,7 +2712,14 @@ extern int expand_upwards(struct vm_area_struct *vma, unsigned long address);
 #endif
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
-extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr);
+extern struct vm_area_struct *__find_vma(struct mm_struct *mm, unsigned long addr);
+static inline
+struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+{
+	mmap_assert_locked(mm);
+	return __find_vma(mm, addr);
+}
+
 extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr,
 					     struct vm_area_struct **pprev);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 9d780f415be3..693e6776be39 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2250,12 +2250,11 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 EXPORT_SYMBOL(get_unmapped_area);
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
-struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+struct vm_area_struct *__find_vma(struct mm_struct *mm, unsigned long addr)
 {
 	struct rb_node *rb_node;
 	struct vm_area_struct *vma;
 
-	mmap_assert_locked(mm);
 	/* Check the cache first. */
 	vma = vmacache_find(mm, addr);
 	if (likely(vma))
@@ -2281,8 +2280,7 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 		vmacache_update(addr, vma);
 	return vma;
 }
-
-EXPORT_SYMBOL(find_vma);
+EXPORT_SYMBOL(__find_vma);
 
 /*
  * Same as find_vma, but also return a pointer to the previous VMA in *pprev.
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 04/28] mm: move mmap_lock assert function definitions
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2022-09-01 17:34 ` [RFC PATCH RESEND 03/28] mm: introduce __find_vma to be used without mmap_lock protection Suren Baghdasaryan
@ 2022-09-01 17:34 ` Suren Baghdasaryan
  2022-09-01 20:24   ` Kent Overstreet
  2022-09-01 17:34 ` [RFC PATCH RESEND 05/28] mm: add per-VMA lock and helper functions to control it Suren Baghdasaryan
                   ` (26 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:34 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Move mmap_lock assert function definitions up so that they can be used
by other mmap_lock routines.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mmap_lock.h | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 96e113e23d04..e49ba91bb1f0 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -60,6 +60,18 @@ static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
 
 #endif /* CONFIG_TRACING */
 
+static inline void mmap_assert_locked(struct mm_struct *mm)
+{
+	lockdep_assert_held(&mm->mmap_lock);
+	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
+}
+
+static inline void mmap_assert_write_locked(struct mm_struct *mm)
+{
+	lockdep_assert_held_write(&mm->mmap_lock);
+	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
+}
+
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
@@ -150,18 +162,6 @@ static inline void mmap_read_unlock_non_owner(struct mm_struct *mm)
 	up_read_non_owner(&mm->mmap_lock);
 }
 
-static inline void mmap_assert_locked(struct mm_struct *mm)
-{
-	lockdep_assert_held(&mm->mmap_lock);
-	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
-}
-
-static inline void mmap_assert_write_locked(struct mm_struct *mm)
-{
-	lockdep_assert_held_write(&mm->mmap_lock);
-	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
-}
-
 static inline int mmap_lock_is_contended(struct mm_struct *mm)
 {
 	return rwsem_is_contended(&mm->mmap_lock);
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 05/28] mm: add per-VMA lock and helper functions to control it
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2022-09-01 17:34 ` [RFC PATCH RESEND 04/28] mm: move mmap_lock assert function definitions Suren Baghdasaryan
@ 2022-09-01 17:34 ` Suren Baghdasaryan
  2022-09-06 13:46   ` Laurent Dufour
  2022-09-01 17:34 ` [RFC PATCH RESEND 06/28] mm: mark VMA as locked whenever vma->vm_flags are modified Suren Baghdasaryan
                   ` (25 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:34 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Introduce a per-VMA rw_semaphore to be used during page fault handling
instead of mmap_lock. Because there are cases when multiple VMAs need
to be exclusively locked during VMA tree modifications, instead of the
usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
exclusively and setting vma->lock_seq to the current mm->lock_seq. When
mmap_write_lock holder is done with all modifications and drops mmap_lock,
it will increment mm->lock_seq, effectively unlocking all VMAs marked as
locked.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h        | 78 +++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h  |  7 ++++
 include/linux/mmap_lock.h | 13 +++++++
 kernel/fork.c             |  4 ++
 mm/init-mm.c              |  3 ++
 5 files changed, 105 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7d322a979455..476bf936c5f0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -611,6 +611,83 @@ struct vm_operations_struct {
 					  unsigned long addr);
 };
 
+#ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_init_lock(struct vm_area_struct *vma)
+{
+	init_rwsem(&vma->lock);
+	vma->vm_lock_seq = -1;
+}
+
+static inline void vma_mark_locked(struct vm_area_struct *vma)
+{
+	int mm_lock_seq;
+
+	mmap_assert_write_locked(vma->vm_mm);
+
+	/*
+	 * current task is holding mmap_write_lock, both vma->vm_lock_seq and
+	 * mm->mm_lock_seq can't be concurrently modified.
+	 */
+	mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
+	if (vma->vm_lock_seq == mm_lock_seq)
+		return;
+
+	down_write(&vma->lock);
+	vma->vm_lock_seq = mm_lock_seq;
+	up_write(&vma->lock);
+}
+
+static inline bool vma_read_trylock(struct vm_area_struct *vma)
+{
+	if (unlikely(down_read_trylock(&vma->lock) == 0))
+		return false;
+
+	/*
+	 * Overflow might produce false locked result but it's not critical.
+	 * False unlocked result is critical but is impossible because we
+	 * modify and check vma->vm_lock_seq under vma->lock protection and
+	 * mm->mm_lock_seq modification invalidates all existing locks.
+	 */
+	if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq)) {
+		up_read(&vma->lock);
+		return false;
+	}
+	return true;
+}
+
+static inline void vma_read_unlock(struct vm_area_struct *vma)
+{
+	up_read(&vma->lock);
+}
+
+static inline void vma_assert_locked(struct vm_area_struct *vma)
+{
+	lockdep_assert_held(&vma->lock);
+	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->lock), vma);
+}
+
+static inline void vma_assert_write_locked(struct vm_area_struct *vma, int pos)
+{
+	mmap_assert_write_locked(vma->vm_mm);
+	/*
+	 * current task is holding mmap_write_lock, both vma->vm_lock_seq and
+	 * mm->mm_lock_seq can't be concurrently modified.
+	 */
+	VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+static inline void vma_init_lock(struct vm_area_struct *vma) {}
+static inline void vma_mark_locked(struct vm_area_struct *vma) {}
+static inline bool vma_read_trylock(struct vm_area_struct *vma)
+		{ return false; }
+static inline void vma_read_unlock(struct vm_area_struct *vma) {}
+static inline void vma_assert_locked(struct vm_area_struct *vma) {}
+static inline void vma_assert_write_locked(struct vm_area_struct *vma, int pos) {}
+
+#endif /* CONFIG_PER_VMA_LOCK */
+
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	static const struct vm_operations_struct dummy_vm_ops = {};
@@ -619,6 +696,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
+	vma_init_lock(vma);
 }
 
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bed25ef7c994..6a03f59c1e78 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -486,6 +486,10 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_PER_VMA_LOCK
+	struct rw_semaphore lock;
+	int vm_lock_seq;
+#endif
 } __randomize_layout;
 
 struct kioctx_table;
@@ -567,6 +571,9 @@ struct mm_struct {
 					  * init_mm.mmlist, and are protected
 					  * by mmlist_lock
 					  */
+#ifdef CONFIG_PER_VMA_LOCK
+		int mm_lock_seq;
+#endif
 
 
 		unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index e49ba91bb1f0..a391ae226564 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -72,6 +72,17 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
 	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
 }
 
+#ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_mark_unlocked_all(struct mm_struct *mm)
+{
+	mmap_assert_write_locked(mm);
+	/* No races during update due to exclusive mmap_lock being held */
+	WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
+}
+#else
+static inline void vma_mark_unlocked_all(struct mm_struct *mm) {}
+#endif
+
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
@@ -114,12 +125,14 @@ static inline bool mmap_write_trylock(struct mm_struct *mm)
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_released(mm, true);
+	vma_mark_unlocked_all(mm);
 	up_write(&mm->mmap_lock);
 }
 
 static inline void mmap_write_downgrade(struct mm_struct *mm)
 {
 	__mmap_lock_trace_acquire_returned(mm, false, true);
+	vma_mark_unlocked_all(mm);
 	downgrade_write(&mm->mmap_lock);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 614872438393..bfab31ecd11e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -475,6 +475,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		 */
 		*new = data_race(*orig);
 		INIT_LIST_HEAD(&new->anon_vma_chain);
+		vma_init_lock(new);
 		new->vm_next = new->vm_prev = NULL;
 		dup_anon_vma_name(orig, new);
 	}
@@ -1130,6 +1131,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	seqcount_init(&mm->write_protect_seq);
 	mmap_init_lock(mm);
 	INIT_LIST_HEAD(&mm->mmlist);
+#ifdef CONFIG_PER_VMA_LOCK
+	WRITE_ONCE(mm->mm_lock_seq, 0);
+#endif
 	mm_pgtables_bytes_init(mm);
 	mm->map_count = 0;
 	mm->locked_vm = 0;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index fbe7844d0912..8399f90d631c 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -37,6 +37,9 @@ struct mm_struct init_mm = {
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
+#ifdef CONFIG_PER_VMA_LOCK
+	.mm_lock_seq	= 0,
+#endif
 	.user_ns	= &init_user_ns,
 	.cpu_bitmap	= CPU_BITS_NONE,
 #ifdef CONFIG_IOMMU_SVA
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 06/28] mm: mark VMA as locked whenever vma->vm_flags are modified
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (4 preceding siblings ...)
  2022-09-01 17:34 ` [RFC PATCH RESEND 05/28] mm: add per-VMA lock and helper functions to control it Suren Baghdasaryan
@ 2022-09-01 17:34 ` Suren Baghdasaryan
  2022-09-06 14:26   ` Laurent Dufour
  2022-09-01 17:34 ` [RFC PATCH RESEND 07/28] kernel/fork: mark VMAs as locked before copying pages during fork Suren Baghdasaryan
                   ` (24 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:34 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

VMA flag modifications should be done under VMA lock to prevent concurrent
page fault handling in that area.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 fs/proc/task_mmu.c | 1 +
 fs/userfaultfd.c   | 6 ++++++
 mm/madvise.c       | 1 +
 mm/mlock.c         | 2 ++
 mm/mmap.c          | 1 +
 mm/mprotect.c      | 1 +
 6 files changed, 12 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4e0023643f8b..ceffa5c2c650 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1285,6 +1285,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			for (vma = mm->mmap; vma; vma = vma->vm_next) {
 				if (!(vma->vm_flags & VM_SOFTDIRTY))
 					continue;
+				vma_mark_locked(vma);
 				vma->vm_flags &= ~VM_SOFTDIRTY;
 				vma_set_page_prot(vma);
 			}
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 175de70e3adf..fe557b3d1c07 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -620,6 +620,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
 		mmap_write_lock(mm);
 		for (vma = mm->mmap; vma; vma = vma->vm_next)
 			if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
+				vma_mark_locked(vma);
 				vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 				vma->vm_flags &= ~__VM_UFFD_FLAGS;
 			}
@@ -653,6 +654,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
 
 	octx = vma->vm_userfaultfd_ctx.ctx;
 	if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+		vma_mark_locked(vma);
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 		vma->vm_flags &= ~__VM_UFFD_FLAGS;
 		return 0;
@@ -734,6 +736,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
 		atomic_inc(&ctx->mmap_changing);
 	} else {
 		/* Drop uffd context if remap feature not enabled */
+		vma_mark_locked(vma);
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 		vma->vm_flags &= ~__VM_UFFD_FLAGS;
 	}
@@ -891,6 +894,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 			vma = prev;
 		else
 			prev = vma;
+		vma_mark_locked(vma);
 		vma->vm_flags = new_flags;
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 	}
@@ -1449,6 +1453,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		 * the next vma was merged into the current one and
 		 * the current one has not been updated yet.
 		 */
+		vma_mark_locked(vma);
 		vma->vm_flags = new_flags;
 		vma->vm_userfaultfd_ctx.ctx = ctx;
 
@@ -1630,6 +1635,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		 * the next vma was merged into the current one and
 		 * the current one has not been updated yet.
 		 */
+		vma_mark_locked(vma);
 		vma->vm_flags = new_flags;
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 5f0f0948a50e..a173f0025abd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -181,6 +181,7 @@ static int madvise_update_vma(struct vm_area_struct *vma,
 	/*
 	 * vm_flags is protected by the mmap_lock held in write mode.
 	 */
+	vma_mark_locked(vma);
 	vma->vm_flags = new_flags;
 	if (!vma->vm_file) {
 		error = replace_anon_vma_name(vma, anon_name);
diff --git a/mm/mlock.c b/mm/mlock.c
index b14e929084cc..f62e1a4d05f2 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -380,6 +380,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
 	 */
 	if (newflags & VM_LOCKED)
 		newflags |= VM_IO;
+	vma_mark_locked(vma);
 	WRITE_ONCE(vma->vm_flags, newflags);
 
 	lru_add_drain();
@@ -456,6 +457,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
 		/* No work to do, and mlocking twice would be wrong */
+		vma_mark_locked(vma);
 		vma->vm_flags = newflags;
 	} else {
 		mlock_vma_pages_range(vma, start, end, newflags);
diff --git a/mm/mmap.c b/mm/mmap.c
index 693e6776be39..f89c9b058105 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1818,6 +1818,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 out:
 	perf_event_mmap(vma);
 
+	vma_mark_locked(vma);
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
diff --git a/mm/mprotect.c b/mm/mprotect.c
index bc6bddd156ca..df47fc21b0e4 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -621,6 +621,7 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * vm_flags and vm_page_prot are protected by the mmap_lock
 	 * held in write mode.
 	 */
+	vma_mark_locked(vma);
 	vma->vm_flags = newflags;
 	/*
 	 * We want to check manually if we can change individual PTEs writable
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 07/28] kernel/fork: mark VMAs as locked before copying pages during fork
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (5 preceding siblings ...)
  2022-09-01 17:34 ` [RFC PATCH RESEND 06/28] mm: mark VMA as locked whenever vma->vm_flags are modified Suren Baghdasaryan
@ 2022-09-01 17:34 ` Suren Baghdasaryan
  2022-09-06 14:37   ` Laurent Dufour
  2022-09-01 17:34 ` [RFC PATCH RESEND 08/28] mm/khugepaged: mark VMA as locked while collapsing a hugepage Suren Baghdasaryan
                   ` (23 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:34 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Protect VMAs from concurrent page fault handler while performing
copy_page_range for VMAs having VM_WIPEONFORK flag set.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 kernel/fork.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index bfab31ecd11e..1872ad549fed 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -709,8 +709,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		if (!(tmp->vm_flags & VM_WIPEONFORK))
+		if (!(tmp->vm_flags & VM_WIPEONFORK)) {
+			vma_mark_locked(mpnt);
 			retval = copy_page_range(tmp, mpnt);
+		}
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 08/28] mm/khugepaged: mark VMA as locked while collapsing a hugepage
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (6 preceding siblings ...)
  2022-09-01 17:34 ` [RFC PATCH RESEND 07/28] kernel/fork: mark VMAs as locked before copying pages during fork Suren Baghdasaryan
@ 2022-09-01 17:34 ` Suren Baghdasaryan
  2022-09-06 14:43   ` Laurent Dufour
  2022-09-01 17:34 ` [RFC PATCH RESEND 09/28] mm/mempolicy: mark VMA as locked when changing protection policy Suren Baghdasaryan
                   ` (22 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:34 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Protect VMA from concurrent page fault handler while modifying it in
collapse_huge_page.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/khugepaged.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 01f71786d530..030680633989 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1072,6 +1072,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (mm_find_pmd(mm, address) != pmd)
 		goto out_up_write;
 
+	vma_mark_locked(vma);
 	anon_vma_lock_write(vma->anon_vma);
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm,
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 09/28] mm/mempolicy: mark VMA as locked when changing protection policy
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (7 preceding siblings ...)
  2022-09-01 17:34 ` [RFC PATCH RESEND 08/28] mm/khugepaged: mark VMA as locked while collapsing a hugepage Suren Baghdasaryan
@ 2022-09-01 17:34 ` Suren Baghdasaryan
  2022-09-06 14:47   ` Laurent Dufour
  2022-09-01 17:34 ` [RFC PATCH RESEND 10/28] mm/mmap: mark VMAs as locked in vma_adjust Suren Baghdasaryan
                   ` (21 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:34 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Protect VMA from concurrent page fault handler while performing VMA
protection policy changes.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mempolicy.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b73d3248d976..6be1e5c75556 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -383,8 +383,10 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 	struct vm_area_struct *vma;
 
 	mmap_write_lock(mm);
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		vma_mark_locked(vma);
 		mpol_rebind_policy(vma->vm_policy, new);
+	}
 	mmap_write_unlock(mm);
 }
 
@@ -632,6 +634,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 	struct mmu_gather tlb;
 	int nr_updated;
 
+	vma_mark_locked(vma);
 	tlb_gather_mmu(&tlb, vma->vm_mm);
 
 	nr_updated = change_protection(&tlb, vma, addr, end, PAGE_NONE,
@@ -765,6 +768,7 @@ static int vma_replace_policy(struct vm_area_struct *vma,
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
+	vma_mark_locked(vma);
 	if (vma->vm_ops && vma->vm_ops->set_policy) {
 		err = vma->vm_ops->set_policy(vma, new);
 		if (err)
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 10/28] mm/mmap: mark VMAs as locked in vma_adjust
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (8 preceding siblings ...)
  2022-09-01 17:34 ` [RFC PATCH RESEND 09/28] mm/mempolicy: mark VMA as locked when changing protection policy Suren Baghdasaryan
@ 2022-09-01 17:34 ` Suren Baghdasaryan
  2022-09-06 15:35   ` Laurent Dufour
  2022-09-01 17:34 ` [RFC PATCH RESEND 11/28] mm/mmap: mark VMAs as locked before merging or splitting them Suren Baghdasaryan
                   ` (20 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:34 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

vma_adjust modifies a VMA and possibly its neighbors. Mark them as locked
before making the modifications.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f89c9b058105..ed58cf0689b2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -710,6 +710,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	long adjust_next = 0;
 	int remove_next = 0;
 
+	vma_mark_locked(vma);
+	if (next)
+		vma_mark_locked(next);
+
 	if (next && !insert) {
 		struct vm_area_struct *exporter = NULL, *importer = NULL;
 
@@ -754,8 +758,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * If next doesn't have anon_vma, import from vma after
 			 * next, if the vma overlaps with it.
 			 */
-			if (remove_next == 2 && !next->anon_vma)
+			if (remove_next == 2 && !next->anon_vma) {
 				exporter = next->vm_next;
+				if (exporter)
+					vma_mark_locked(exporter);
+			}
 
 		} else if (end > next->vm_start) {
 			/*
@@ -931,6 +938,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * "vma->vm_next" gap must be updated.
 			 */
 			next = vma->vm_next;
+			if (next)
+				vma_mark_locked(next);
 		} else {
 			/*
 			 * For the scope of the comment "next" and
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 11/28] mm/mmap: mark VMAs as locked before merging or splitting them
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (9 preceding siblings ...)
  2022-09-01 17:34 ` [RFC PATCH RESEND 10/28] mm/mmap: mark VMAs as locked in vma_adjust Suren Baghdasaryan
@ 2022-09-01 17:34 ` Suren Baghdasaryan
  2022-09-06 15:44   ` Laurent Dufour
  2022-09-01 17:35 ` [RFC PATCH RESEND 12/28] mm/mremap: mark VMA as locked while remapping it to a new address range Suren Baghdasaryan
                   ` (19 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:34 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Decisions about whether VMAs can be merged or split must be made while
VMAs are protected from the changes which can affect that decision.
For example, merge_vma uses vma->anon_vma in its decision whether the
VMA can be merged. Meanwhile, page fault handler changes vma->anon_vma
during COW operation.
Mark all VMAs which might be affected by a merge or split operation as
locked before making decision how such operations should be performed.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index ed58cf0689b2..ade3909c89b4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1147,10 +1147,17 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	if (vm_flags & VM_SPECIAL)
 		return NULL;
 
+	if (prev)
+		vma_mark_locked(prev);
 	next = vma_next(mm, prev);
 	area = next;
-	if (area && area->vm_end == end)		/* cases 6, 7, 8 */
+	if (area)
+		vma_mark_locked(area);
+	if (area && area->vm_end == end) {		/* cases 6, 7, 8 */
 		next = next->vm_next;
+		if (next)
+			vma_mark_locked(next);
+	}
 
 	/* verify some invariant that must be enforced by the caller */
 	VM_WARN_ON(prev && addr <= prev->vm_start);
@@ -2687,6 +2694,7 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct vm_area_struct *new;
 	int err;
 
+	vma_mark_locked(vma);
 	if (vma->vm_ops && vma->vm_ops->may_split) {
 		err = vma->vm_ops->may_split(vma, addr);
 		if (err)
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 12/28] mm/mremap: mark VMA as locked while remapping it to a new address range
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (10 preceding siblings ...)
  2022-09-01 17:34 ` [RFC PATCH RESEND 11/28] mm/mmap: mark VMAs as locked before merging or splitting them Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-06 16:09   ` Laurent Dufour
  2022-09-01 17:35 ` [RFC PATCH RESEND 13/28] mm: conditionally mark VMA as locked in free_pgtables and unmap_page_range Suren Baghdasaryan
                   ` (18 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Mark VMA as locked before copying it and when copy_vma produces a new VMA.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c   | 1 +
 mm/mremap.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index ade3909c89b4..121544fd90de 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3248,6 +3248,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			get_file(new_vma->vm_file);
 		if (new_vma->vm_ops && new_vma->vm_ops->open)
 			new_vma->vm_ops->open(new_vma);
+		vma_mark_locked(new_vma);
 		vma_link(mm, new_vma, prev, rb_link, rb_parent);
 		*need_rmap_locks = false;
 	}
diff --git a/mm/mremap.c b/mm/mremap.c
index b522cd0259a0..bdbf96254e43 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -620,6 +620,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 			return -ENOMEM;
 	}
 
+	vma_mark_locked(vma);
 	new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
 	new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
 			   &need_rmap_locks);
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 13/28] mm: conditionally mark VMA as locked in free_pgtables and unmap_page_range
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (11 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 12/28] mm/mremap: mark VMA as locked while remapping it to a new address range Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-09 10:33   ` Laurent Dufour
  2022-09-01 17:35 ` [RFC PATCH RESEND 14/28] mm: mark VMAs as locked before isolating them Suren Baghdasaryan
                   ` (17 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

free_pgtables and unmap_page_range functions can be called with mmap_lock
held for write (e.g. in mmap_region), held for read (e.g in
madvise_pageout) or not held at all (e.g in madvise_remove might
drop mmap_lock before calling vfs_fallocate, which ends up calling
unmap_page_range).
Provide free_pgtables and unmap_page_range with additional argument
indicating whether to mark the VMA as locked or not based on the usage.
The parameter is set based on whether mmap_lock is held in write mode
during the call. This ensures no change in behavior between mmap_lock
and per-vma locks.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h |  2 +-
 mm/internal.h      |  4 ++--
 mm/memory.c        | 32 +++++++++++++++++++++-----------
 mm/mmap.c          | 17 +++++++++--------
 mm/oom_kill.c      |  3 ++-
 5 files changed, 35 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 476bf936c5f0..dc72be923e5b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1874,7 +1874,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 void zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		    unsigned long size);
 void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
-		unsigned long start, unsigned long end);
+		unsigned long start, unsigned long end, bool lock_vma);
 
 struct mmu_notifier_range;
 
diff --git a/mm/internal.h b/mm/internal.h
index 785409805ed7..e6c0f999e0cb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -85,14 +85,14 @@ bool __folio_end_writeback(struct folio *folio);
 void deactivate_file_folio(struct folio *folio);
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
-		unsigned long floor, unsigned long ceiling);
+		unsigned long floor, unsigned long ceiling, bool lock_vma);
 void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 
 struct zap_details;
 void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
-			     struct zap_details *details);
+			     struct zap_details *details, bool lock_vma);
 
 void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
 		unsigned int order);
diff --git a/mm/memory.c b/mm/memory.c
index 4ba73f5aa8bb..9ac9944e8c62 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -403,7 +403,7 @@ void free_pgd_range(struct mmu_gather *tlb,
 }
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
-		unsigned long floor, unsigned long ceiling)
+		unsigned long floor, unsigned long ceiling, bool lock_vma)
 {
 	while (vma) {
 		struct vm_area_struct *next = vma->vm_next;
@@ -413,6 +413,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 * Hide vma from rmap and truncate_pagecache before freeing
 		 * pgtables
 		 */
+		if (lock_vma)
+			vma_mark_locked(vma);
 		unlink_anon_vmas(vma);
 		unlink_file_vma(vma);
 
@@ -427,6 +429,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			       && !is_vm_hugetlb_page(next)) {
 				vma = next;
 				next = vma->vm_next;
+				if (lock_vma)
+					vma_mark_locked(vma);
 				unlink_anon_vmas(vma);
 				unlink_file_vma(vma);
 			}
@@ -1631,12 +1635,16 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
-			     struct zap_details *details)
+			     struct zap_details *details,
+			     bool lock_vma)
 {
 	pgd_t *pgd;
 	unsigned long next;
 
 	BUG_ON(addr >= end);
+	if (lock_vma)
+		vma_mark_locked(vma);
+
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -1652,7 +1660,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 static void unmap_single_vma(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr,
-		struct zap_details *details)
+		struct zap_details *details, bool lock_vma)
 {
 	unsigned long start = max(vma->vm_start, start_addr);
 	unsigned long end;
@@ -1691,7 +1699,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 				i_mmap_unlock_write(vma->vm_file->f_mapping);
 			}
 		} else
-			unmap_page_range(tlb, vma, start, end, details);
+			unmap_page_range(tlb, vma, start, end, details, lock_vma);
 	}
 }
 
@@ -1715,7 +1723,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
  */
 void unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
-		unsigned long end_addr)
+		unsigned long end_addr, bool lock_vma)
 {
 	struct mmu_notifier_range range;
 	struct zap_details details = {
@@ -1728,7 +1736,8 @@ void unmap_vmas(struct mmu_gather *tlb,
 				start_addr, end_addr);
 	mmu_notifier_invalidate_range_start(&range);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
-		unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
+		unmap_single_vma(tlb, vma, start_addr, end_addr, &details,
+				 lock_vma);
 	mmu_notifier_invalidate_range_end(&range);
 }
 
@@ -1753,7 +1762,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
 	for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
-		unmap_single_vma(&tlb, vma, start, range.end, NULL);
+		unmap_single_vma(&tlb, vma, start, range.end, NULL, false);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
 }
@@ -1768,7 +1777,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
  * The range must fit into one VMA.
  */
 static void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
-		unsigned long size, struct zap_details *details)
+		unsigned long size, struct zap_details *details, bool lock_vma)
 {
 	struct mmu_notifier_range range;
 	struct mmu_gather tlb;
@@ -1779,7 +1788,7 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	tlb_gather_mmu(&tlb, vma->vm_mm);
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
-	unmap_single_vma(&tlb, vma, address, range.end, details);
+	unmap_single_vma(&tlb, vma, address, range.end, details, lock_vma);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
 }
@@ -1802,7 +1811,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 	    		!(vma->vm_flags & VM_PFNMAP))
 		return;
 
-	zap_page_range_single(vma, address, size, NULL);
+	zap_page_range_single(vma, address, size, NULL, true);
 }
 EXPORT_SYMBOL_GPL(zap_vma_ptes);
 
@@ -3483,7 +3492,8 @@ static void unmap_mapping_range_vma(struct vm_area_struct *vma,
 		unsigned long start_addr, unsigned long end_addr,
 		struct zap_details *details)
 {
-	zap_page_range_single(vma, start_addr, end_addr - start_addr, details);
+	zap_page_range_single(vma, start_addr, end_addr - start_addr, details,
+			      false);
 }
 
 static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
diff --git a/mm/mmap.c b/mm/mmap.c
index 121544fd90de..094678b4434b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -79,7 +79,7 @@ core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
 
 static void unmap_region(struct mm_struct *mm,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
-		unsigned long start, unsigned long end);
+		unsigned long start, unsigned long end, bool lock_vma);
 
 static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
 {
@@ -1866,7 +1866,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma->vm_file = NULL;
 
 	/* Undo any partial mapping done by a device driver. */
-	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
+	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end, true);
 	if (vm_flags & VM_SHARED)
 		mapping_unmap_writable(file->f_mapping);
 free_vma:
@@ -2626,7 +2626,7 @@ static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
  */
 static void unmap_region(struct mm_struct *mm,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
-		unsigned long start, unsigned long end)
+		unsigned long start, unsigned long end, bool lock_vma)
 {
 	struct vm_area_struct *next = vma_next(mm, prev);
 	struct mmu_gather tlb;
@@ -2634,9 +2634,10 @@ static void unmap_region(struct mm_struct *mm,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm);
 	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, vma, start, end);
+	unmap_vmas(&tlb, vma, start, end, lock_vma);
 	free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
-				 next ? next->vm_start : USER_PGTABLES_CEILING);
+				 next ? next->vm_start : USER_PGTABLES_CEILING,
+				 lock_vma);
 	tlb_finish_mmu(&tlb);
 }
 
@@ -2849,7 +2850,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	if (downgrade)
 		mmap_write_downgrade(mm);
 
-	unmap_region(mm, vma, prev, start, end);
+	unmap_region(mm, vma, prev, start, end, !downgrade);
 
 	/* Fix up all other VM information */
 	remove_vma_list(mm, vma);
@@ -3129,8 +3130,8 @@ void exit_mmap(struct mm_struct *mm)
 	tlb_gather_mmu_fullmm(&tlb, mm);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	unmap_vmas(&tlb, vma, 0, -1);
-	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
+	unmap_vmas(&tlb, vma, 0, -1, true);
+	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING, true);
 	tlb_finish_mmu(&tlb);
 
 	/* Walk the list again, actually closing and freeing it. */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3c6cf9e3cd66..6ffa7c511aa3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -549,7 +549,8 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 				ret = false;
 				continue;
 			}
-			unmap_page_range(&tlb, vma, range.start, range.end, NULL);
+			unmap_page_range(&tlb, vma, range.start, range.end,
+					 NULL, false);
 			mmu_notifier_invalidate_range_end(&range);
 			tlb_finish_mmu(&tlb);
 		}
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 14/28] mm: mark VMAs as locked before isolating them
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (12 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 13/28] mm: conditionally mark VMA as locked in free_pgtables and unmap_page_range Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-09 13:35   ` Laurent Dufour
  2022-09-01 17:35 ` [RFC PATCH RESEND 15/28] mm/mmap: mark adjacent VMAs as locked if they can grow into unmapped area Suren Baghdasaryan
                   ` (16 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Mark VMAs as locked before isolating them and clear their tree node so
that isolated VMAs are easily identifiable. In the later patches page
fault handlers will try locking the found VMA and will check whether
the VMA was isolated. Locking VMAs before isolating them ensures that
page fault handlers don't operate on isolated VMAs.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c  | 2 ++
 mm/nommu.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 094678b4434b..b0d78bdc0de0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -421,12 +421,14 @@ static inline void vma_rb_insert(struct vm_area_struct *vma,
 
 static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
 {
+	vma_mark_locked(vma);
 	/*
 	 * Note rb_erase_augmented is a fairly large inline function,
 	 * so make sure we instantiate it only once with our desired
 	 * augmented rbtree callbacks.
 	 */
 	rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
+	RB_CLEAR_NODE(&vma->vm_rb);
 }
 
 static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
diff --git a/mm/nommu.c b/mm/nommu.c
index e819cbc21b39..ff9933e57501 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -622,6 +622,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
 	struct mm_struct *mm = vma->vm_mm;
 	struct task_struct *curr = current;
 
+	vma_mark_locked(vma);
 	mm->map_count--;
 	for (i = 0; i < VMACACHE_SIZE; i++) {
 		/* if the vma is cached, invalidate the entire cache */
@@ -644,6 +645,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
 
 	/* remove from the MM's tree and list */
 	rb_erase(&vma->vm_rb, &mm->mm_rb);
+	RB_CLEAR_NODE(&vma->vm_rb);
 
 	__vma_unlink_list(mm, vma);
 }
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 15/28] mm/mmap: mark adjacent VMAs as locked if they can grow into unmapped area
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (13 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 14/28] mm: mark VMAs as locked before isolating them Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-09 13:43   ` Laurent Dufour
  2022-09-01 17:35 ` [RFC PATCH RESEND 16/28] kernel/fork: assert no VMA readers during its destruction Suren Baghdasaryan
                   ` (15 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

While unmapping VMAs, adjacent VMAs might be able to grow into the area
being unmapped. In such cases mark adjacent VMAs as locked to prevent
this growth.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index b0d78bdc0de0..b31cc97c2803 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2680,10 +2680,14 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * VM_GROWSUP VMA. Such VMAs can change their size under
 	 * down_read(mmap_lock) and collide with the VMA we are about to unmap.
 	 */
-	if (vma && (vma->vm_flags & VM_GROWSDOWN))
+	if (vma && (vma->vm_flags & VM_GROWSDOWN)) {
+		vma_mark_locked(vma);
 		return false;
-	if (prev && (prev->vm_flags & VM_GROWSUP))
+	}
+	if (prev && (prev->vm_flags & VM_GROWSUP)) {
+		vma_mark_locked(prev);
 		return false;
+	}
 	return true;
 }
 
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 16/28] kernel/fork: assert no VMA readers during its destruction
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (14 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 15/28] mm/mmap: mark adjacent VMAs as locked if they can grow into unmapped area Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-09 13:56   ` Laurent Dufour
  2022-09-01 17:35 ` [RFC PATCH RESEND 17/28] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration Suren Baghdasaryan
                   ` (14 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Assert there are no holders of VMA lock for reading when it is about to be
destroyed.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 8 ++++++++
 kernel/fork.c      | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc72be923e5b..0d9c1563c354 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -676,6 +676,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma, int pos)
 	VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
 }
 
+static inline void vma_assert_no_reader(struct vm_area_struct *vma)
+{
+	VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
+		      vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
+		      vma);
+}
+
 #else /* CONFIG_PER_VMA_LOCK */
 
 static inline void vma_init_lock(struct vm_area_struct *vma) {}
@@ -685,6 +692,7 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
 static inline void vma_read_unlock(struct vm_area_struct *vma) {}
 static inline void vma_assert_locked(struct vm_area_struct *vma) {}
 static inline void vma_assert_write_locked(struct vm_area_struct *vma, int pos) {}
+static inline void vma_assert_no_reader(struct vm_area_struct *vma) {}
 
 #endif /* CONFIG_PER_VMA_LOCK */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1872ad549fed..b443ba3a247a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -487,6 +487,8 @@ static void __vm_area_free(struct rcu_head *head)
 {
 	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
 						  vm_rcu);
+	/* The vma should either have no lock holders or be write-locked. */
+	vma_assert_no_reader(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 #endif
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 17/28] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (15 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 16/28] kernel/fork: assert no VMA readers during its destruction Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-09 14:20   ` Laurent Dufour
  2022-09-01 17:35 ` [RFC PATCH RESEND 18/28] mm: add FAULT_FLAG_VMA_LOCK flag Suren Baghdasaryan
                   ` (13 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Pagefault handlers might need to fire MMU notifications while a new
notifier is being registered. Modify mm_take_all_locks to mark all VMAs
as locked and prevent this race with fault handlers that would hold VMA
locks.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index b31cc97c2803..1edfcd384f5e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3538,6 +3538,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
  *     hugetlb mapping);
  *   - all i_mmap_rwsem locks;
  *   - all anon_vma->rwseml
+ *   - all vmas marked locked
  *
  * We can take all locks within these types randomly because the VM code
  * doesn't nest them and we protected from parallel mm_take_all_locks() by
@@ -3579,6 +3580,7 @@ int mm_take_all_locks(struct mm_struct *mm)
 		if (vma->anon_vma)
 			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
 				vm_lock_anon_vma(mm, avc->anon_vma);
+		vma_mark_locked(vma);
 	}
 
 	return 0;
@@ -3636,6 +3638,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
 	mmap_assert_write_locked(mm);
 	BUG_ON(!mutex_is_locked(&mm_all_locks_mutex));
 
+	vma_mark_unlocked_all(mm);
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (vma->anon_vma)
 			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 18/28] mm: add FAULT_FLAG_VMA_LOCK flag
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (16 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 17/28] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-09 14:26   ` Laurent Dufour
  2022-09-01 17:35 ` [RFC PATCH RESEND 19/28] mm: disallow do_swap_page to handle page faults under VMA lock Suren Baghdasaryan
                   ` (12 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Add a new flag to distinguish page faults handled under protection of
per-vma lock.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h       | 3 ++-
 include/linux/mm_types.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0d9c1563c354..7c3190eaabd7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -466,7 +466,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags)
 	{ FAULT_FLAG_USER,		"USER" }, \
 	{ FAULT_FLAG_REMOTE,		"REMOTE" }, \
 	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }, \
-	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }
+	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }, \
+	{ FAULT_FLAG_VMA_LOCK,		"VMA_LOCK" }
 
 /*
  * vm_fault is filled by the pagefault handler and passed to the vma's
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a03f59c1e78..36562e702baf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -886,6 +886,7 @@ enum fault_flag {
 	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
 	FAULT_FLAG_UNSHARE =		1 << 10,
 	FAULT_FLAG_ORIG_PTE_VALID =	1 << 11,
+	FAULT_FLAG_VMA_LOCK =		1 << 12,
 };
 
 typedef unsigned int __bitwise zap_flags_t;
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 19/28] mm: disallow do_swap_page to handle page faults under VMA lock
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (17 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 18/28] mm: add FAULT_FLAG_VMA_LOCK flag Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-06 19:39   ` Peter Xu
  2022-09-09 14:26   ` Laurent Dufour
  2022-09-01 17:35 ` [RFC PATCH RESEND 20/28] mm: introduce per-VMA lock statistics Suren Baghdasaryan
                   ` (11 subsequent siblings)
  30 siblings, 2 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Due to the possibility of do_swap_page dropping mmap_lock, abort fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/memory.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 9ac9944e8c62..29d2f49f922a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3738,6 +3738,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
 
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+		ret = VM_FAULT_RETRY;
+		goto out;
+	}
+
 	if (!pte_unmap_same(vmf))
 		goto out;
 
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 20/28] mm: introduce per-VMA lock statistics
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (18 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 19/28] mm: disallow do_swap_page to handle page faults under VMA lock Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-09 14:28   ` Laurent Dufour
  2022-09-01 17:35 ` [RFC PATCH RESEND 21/28] mm: introduce find_and_lock_anon_vma to be used from arch-specific code Suren Baghdasaryan
                   ` (10 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Add a new CONFIG_PER_VMA_LOCK_STATS config option to dump extra
statistics about handling page fault under VMA lock.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/vm_event_item.h | 6 ++++++
 include/linux/vmstat.h        | 6 ++++++
 mm/Kconfig.debug              | 8 ++++++++
 mm/vmstat.c                   | 6 ++++++
 4 files changed, 26 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f3fc36cd2276..a325783ed05d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -150,6 +150,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_X86
 		DIRECT_MAP_LEVEL2_SPLIT,
 		DIRECT_MAP_LEVEL3_SPLIT,
+#endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+		VMA_LOCK_SUCCESS,
+		VMA_LOCK_ABORT,
+		VMA_LOCK_RETRY,
+		VMA_LOCK_MISS,
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index bfe38869498d..0c2611899cfc 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -131,6 +131,12 @@ static inline void vm_events_fold_cpu(int cpu)
 #define count_vm_vmacache_event(x) do {} while (0)
 #endif
 
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+#define count_vm_vma_lock_event(x) count_vm_event(x)
+#else
+#define count_vm_vma_lock_event(x) do {} while (0)
+#endif
+
 #define __count_zid_vm_events(item, zid, delta) \
 	__count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
 
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index ce8dded36de9..075642763a03 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -207,3 +207,11 @@ config PTDUMP_DEBUGFS
 	  kernel.
 
 	  If in doubt, say N.
+
+
+config PER_VMA_LOCK_STATS
+	bool "Statistics for per-vma locks"
+	depends on PER_VMA_LOCK
+	help
+	  Statistics for per-vma locks.
+	  If in doubt, say N.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 90af9a8572f5..3f3804c846a6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1411,6 +1411,12 @@ const char * const vmstat_text[] = {
 	"direct_map_level2_splits",
 	"direct_map_level3_splits",
 #endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+	"vma_lock_success",
+	"vma_lock_abort",
+	"vma_lock_retry",
+	"vma_lock_miss",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 21/28] mm: introduce find_and_lock_anon_vma to be used from arch-specific code
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (19 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 20/28] mm: introduce per-VMA lock statistics Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-09 14:38   ` Laurent Dufour
  2022-09-01 17:35 ` [RFC PATCH RESEND 22/28] x86/mm: try VMA lock-based page fault handling first Suren Baghdasaryan
                   ` (9 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Introduce find_and_lock_anon_vma function to lookup and lock an anonymous
VMA during page fault handling. When VMA is not found, can't be locked
or changes after being locked, the function returns NULL. The lookup is
performed under RCU protection to prevent the found VMA from being
destroyed before the VMA lock is acquired. VMA lock statistics are
updated according to the results.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h |  3 +++
 mm/memory.c        | 45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7c3190eaabd7..a3cbaa7b9119 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -684,6 +684,9 @@ static inline void vma_assert_no_reader(struct vm_area_struct *vma)
 		      vma);
 }
 
+struct vm_area_struct *find_and_lock_anon_vma(struct mm_struct *mm,
+					      unsigned long address);
+
 #else /* CONFIG_PER_VMA_LOCK */
 
 static inline void vma_init_lock(struct vm_area_struct *vma) {}
diff --git a/mm/memory.c b/mm/memory.c
index 29d2f49f922a..bf557f7056de 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5183,6 +5183,51 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
 
+#ifdef CONFIG_PER_VMA_LOCK
+static inline struct vm_area_struct *find_vma_under_rcu(struct mm_struct *mm,
+							unsigned long address)
+{
+	struct vm_area_struct *vma = __find_vma(mm, address);
+
+	if (!vma || vma->vm_start > address)
+		return NULL;
+
+	if (!vma_is_anonymous(vma))
+		return NULL;
+
+	if (!vma_read_trylock(vma)) {
+		count_vm_vma_lock_event(VMA_LOCK_ABORT);
+		return NULL;
+	}
+
+	/* Check if the VMA got isolated after we found it */
+	if (RB_EMPTY_NODE(&vma->vm_rb)) {
+		vma_read_unlock(vma);
+		count_vm_vma_lock_event(VMA_LOCK_MISS);
+		return NULL;
+	}
+
+	return vma;
+}
+
+/*
+ * Lookup and lock and anonymous VMA. Returned VMA is guaranteed to be stable
+ * and not isolated. If the VMA is not found of is being modified the function
+ * returns NULL.
+ */
+struct vm_area_struct *find_and_lock_anon_vma(struct mm_struct *mm,
+					      unsigned long address)
+{
+	struct vm_area_struct *vma;
+
+	rcu_read_lock();
+	vma = find_vma_under_rcu(mm, address);
+	rcu_read_unlock();
+
+	return vma;
+}
+#endif /* CONFIG_PER_VMA_LOCK */
+
 #ifndef __PAGETABLE_P4D_FOLDED
 /*
  * Allocate p4d page table.
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 22/28] x86/mm: try VMA lock-based page fault handling first
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (20 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 21/28] mm: introduce find_and_lock_anon_vma to be used from arch-specific code Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-01 17:35 ` [RFC PATCH RESEND 23/28] x86/mm: define ARCH_SUPPORTS_PER_VMA_LOCK Suren Baghdasaryan
                   ` (8 subsequent siblings)
  30 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/x86/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fa71a5d12e87..35e74e3dc2c1 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -19,6 +19,7 @@
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_crash_gracefully_on_page_fault()*/
 #include <linux/mm_types.h>
+#include <linux/mm.h>			/* find_and_lock_vma() */
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1323,6 +1324,38 @@ void do_user_addr_fault(struct pt_regs *regs,
 	}
 #endif
 
+#ifdef CONFIG_PER_VMA_LOCK
+	if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+		goto lock_mmap;
+
+	vma = find_and_lock_anon_vma(mm, address);
+	if (!vma)
+		goto lock_mmap;
+
+	if (unlikely(access_error(error_code, vma))) {
+		vma_read_unlock(vma);
+		goto lock_mmap;
+	}
+	fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
+	vma_read_unlock(vma);
+
+	if (!(fault & VM_FAULT_RETRY)) {
+		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+		goto done;
+	}
+	count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+	/* Quick path to respond to signals */
+	if (fault_signal_pending(fault, regs)) {
+		if (!user_mode(regs))
+			kernelmode_fixup_or_oops(regs, error_code, address,
+						 SIGBUS, BUS_ADRERR,
+						 ARCH_DEFAULT_PKEY);
+		return;
+	}
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
+
 	/*
 	 * Kernel-mode access to the user address space should only occur
 	 * on well-defined single instructions listed in the exception
@@ -1423,6 +1456,9 @@ void do_user_addr_fault(struct pt_regs *regs,
 	}
 
 	mmap_read_unlock(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
 	if (likely(!(fault & VM_FAULT_ERROR)))
 		return;
 
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 23/28] x86/mm: define ARCH_SUPPORTS_PER_VMA_LOCK
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (21 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 22/28] x86/mm: try VMA lock-based page fault handling first Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-01 20:20   ` Kent Overstreet
  2022-09-01 17:35 ` [RFC PATCH RESEND 24/28] arm64/mm: try VMA lock-based page fault handling first Suren Baghdasaryan
                   ` (7 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Set ARCH_SUPPORTS_PER_VMA_LOCK so that the per-VMA lock support can be
compiled on this architecture.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f9920f1341c8..ee19de020b27 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86_64
 	# Options that are inherently 64-bit kernel only:
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
+	select ARCH_SUPPORTS_PER_VMA_LOCK
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select HAVE_ARCH_SOFT_DIRTY
 	select MODULES_USE_ELF_RELA
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 24/28] arm64/mm: try VMA lock-based page fault handling first
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (22 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 23/28] x86/mm: define ARCH_SUPPORTS_PER_VMA_LOCK Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-01 17:35 ` [RFC PATCH RESEND 25/28] arm64/mm: define ARCH_SUPPORTS_PER_VMA_LOCK Suren Baghdasaryan
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/arm64/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index c33f1fad2745..f05ce40ff32b 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -525,6 +525,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	unsigned long vm_flags;
 	unsigned int mm_flags = FAULT_FLAG_DEFAULT;
 	unsigned long addr = untagged_addr(far);
+#ifdef CONFIG_PER_VMA_LOCK
+	struct vm_area_struct *vma;
+#endif
 
 	if (kprobe_page_fault(regs, esr))
 		return 0;
@@ -575,6 +578,36 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
 
+#ifdef CONFIG_PER_VMA_LOCK
+	if (!(mm_flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+		goto lock_mmap;
+
+	vma = find_and_lock_anon_vma(mm, addr);
+	if (!vma)
+		goto lock_mmap;
+
+	if (!(vma->vm_flags & vm_flags)) {
+		vma_read_unlock(vma);
+		goto lock_mmap;
+	}
+	fault = handle_mm_fault(vma, addr & PAGE_MASK,
+				mm_flags | FAULT_FLAG_VMA_LOCK, regs);
+	vma_read_unlock(vma);
+
+	if (!(fault & VM_FAULT_RETRY)) {
+		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+		goto done;
+	}
+	count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+	/* Quick path to respond to signals */
+	if (fault_signal_pending(fault, regs)) {
+		if (!user_mode(regs))
+			goto no_context;
+		return 0;
+	}
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
 	/*
 	 * As per x86, we may deadlock here. However, since the kernel only
 	 * validly references user space from well defined areas of the code,
@@ -618,6 +651,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	}
 	mmap_read_unlock(mm);
 
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
 	/*
 	 * Handle the "normal" (no error) case first.
 	 */
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 25/28] arm64/mm: define ARCH_SUPPORTS_PER_VMA_LOCK
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (23 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 24/28] arm64/mm: try VMA lock-based page fault handling first Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-01 17:35 ` [RFC PATCH RESEND 26/28] powerc/mm: try VMA lock-based page fault handling first Suren Baghdasaryan
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

Set ARCH_SUPPORTS_PER_VMA_LOCK so that the per-VMA lock support can be
compiled on this architecture.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9fb9fff08c94..0747ae1f3b39 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -93,6 +93,7 @@ config ARM64
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+	select ARCH_SUPPORTS_PER_VMA_LOCK
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 26/28] powerc/mm: try VMA lock-based page fault handling first
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (24 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 25/28] arm64/mm: define ARCH_SUPPORTS_PER_VMA_LOCK Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-01 17:35 ` [RFC PATCH RESEND 27/28] powerpc/mm: define ARCH_SUPPORTS_PER_VMA_LOCK Suren Baghdasaryan
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

From: Laurent Dufour <ldufour@linux.ibm.com>

Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.
Copied from "x86/mm: try VMA lock-based page fault handling first"

Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/powerpc/mm/fault.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 014005428687..c92bdfcd1796 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -450,6 +450,44 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 	if (is_exec)
 		flags |= FAULT_FLAG_INSTRUCTION;
 
+#ifdef CONFIG_PER_VMA_LOCK
+	if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+		goto lock_mmap;
+
+	vma = find_and_lock_anon_vma(mm, address);
+	if (!vma)
+		goto lock_mmap;
+
+	if (unlikely(access_pkey_error(is_write, is_exec,
+				       (error_code & DSISR_KEYFAULT), vma))) {
+		int rc = bad_access_pkey(regs, address, vma);
+
+		vma_read_unlock(vma);
+		return rc;
+	}
+
+	if (unlikely(access_error(is_write, is_exec, vma))) {
+		int rc = bad_access(regs, address);
+
+		vma_read_unlock(vma);
+		return rc;
+	}
+
+	fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
+	vma_read_unlock(vma);
+
+	if (!(fault & VM_FAULT_RETRY)) {
+		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+		goto done;
+	}
+	count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+	if (fault_signal_pending(fault, regs))
+		return user_mode(regs) ? 0 : SIGBUS;
+
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
+
 	/* When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in the
 	 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -526,6 +564,9 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 
 	mmap_read_unlock(current->mm);
 
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
 	if (unlikely(fault & VM_FAULT_ERROR))
 		return mm_fault_error(regs, address, fault);
 
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 27/28] powerpc/mm: define ARCH_SUPPORTS_PER_VMA_LOCK
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (25 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 26/28] powerc/mm: try VMA lock-based page fault handling first Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-01 17:35 ` [RFC PATCH RESEND 28/28] kernel/fork: throttle call_rcu() calls in vm_area_free Suren Baghdasaryan
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

From: Laurent Dufour <ldufour@linux.ibm.com>

Set ARCH_SUPPORTS_PER_VMA_LOCK so that the per-VMA lock support can be
compiled on powernv and pseries.
It may be use on the other platforms but I can't test that seriously.

Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/powerpc/platforms/powernv/Kconfig | 1 +
 arch/powerpc/platforms/pseries/Kconfig | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index ae248a161b43..70a46acc70d6 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -16,6 +16,7 @@ config PPC_POWERNV
 	select PPC_DOORBELL
 	select MMU_NOTIFIER
 	select FORCE_SMP
+	select ARCH_SUPPORTS_PER_VMA_LOCK
 	default y
 
 config OPAL_PRD
diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
index fb6499977f99..7d13a2de3475 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -21,6 +21,7 @@ config PPC_PSERIES
 	select HOTPLUG_CPU
 	select FORCE_SMP
 	select SWIOTLB
+	select ARCH_SUPPORTS_PER_VMA_LOCK
 	default y
 
 config PARAVIRT_SPINLOCKS
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC PATCH RESEND 28/28] kernel/fork: throttle call_rcu() calls in vm_area_free
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (26 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 27/28] powerpc/mm: define ARCH_SUPPORTS_PER_VMA_LOCK Suren Baghdasaryan
@ 2022-09-01 17:35 ` Suren Baghdasaryan
  2022-09-09 15:19   ` Laurent Dufour
  2022-09-01 20:58 ` [RFC PATCH RESEND 00/28] per-VMA locks proposal Kent Overstreet
                   ` (2 subsequent siblings)
  30 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 17:35 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	surenb, kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev,
	x86, linux-kernel

call_rcu() can take a long time when callback offloading is enabled.
Its use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed. To minimize that impact, place VMAs into
a list and free them in groups using one call_rcu() call per group.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h       |  1 +
 include/linux/mm_types.h | 11 ++++++-
 kernel/fork.c            | 68 +++++++++++++++++++++++++++++++++++-----
 mm/init-mm.c             |  3 ++
 mm/mmap.c                |  1 +
 5 files changed, 75 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a3cbaa7b9119..81dff694ac14 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -249,6 +249,7 @@ void setup_initial_init_mm(void *start_code, void *end_code,
 struct vm_area_struct *vm_area_alloc(struct mm_struct *);
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
 void vm_area_free(struct vm_area_struct *);
+void drain_free_vmas(struct mm_struct *mm);
 
 #ifndef CONFIG_MMU
 extern struct rb_root nommu_region_tree;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 36562e702baf..6f3effc493b1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -412,7 +412,11 @@ struct vm_area_struct {
 			struct vm_area_struct *vm_next, *vm_prev;
 		};
 #ifdef CONFIG_PER_VMA_LOCK
-		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
+		struct {
+			struct list_head vm_free_list;
+			/* Used for deferred freeing. */
+			struct rcu_head vm_rcu;
+		};
 #endif
 	};
 
@@ -573,6 +577,11 @@ struct mm_struct {
 					  */
 #ifdef CONFIG_PER_VMA_LOCK
 		int mm_lock_seq;
+		struct {
+			struct list_head head;
+			spinlock_t lock;
+			int size;
+		} vma_free_list;
 #endif
 
 
diff --git a/kernel/fork.c b/kernel/fork.c
index b443ba3a247a..7c88710aed72 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -483,26 +483,75 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 }
 
 #ifdef CONFIG_PER_VMA_LOCK
-static void __vm_area_free(struct rcu_head *head)
+static inline void __vm_area_free(struct vm_area_struct *vma)
 {
-	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
-						  vm_rcu);
 	/* The vma should either have no lock holders or be write-locked. */
 	vma_assert_no_reader(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
-#endif
+
+static void vma_free_rcu_callback(struct rcu_head *head)
+{
+	struct vm_area_struct *first_vma;
+	struct vm_area_struct *vma, *vma2;
+
+	first_vma = container_of(head, struct vm_area_struct, vm_rcu);
+	list_for_each_entry_safe(vma, vma2, &first_vma->vm_free_list, vm_free_list)
+		__vm_area_free(vma);
+	__vm_area_free(first_vma);
+}
+
+void drain_free_vmas(struct mm_struct *mm)
+{
+	struct vm_area_struct *first_vma;
+	LIST_HEAD(to_destroy);
+
+	spin_lock(&mm->vma_free_list.lock);
+	list_splice_init(&mm->vma_free_list.head, &to_destroy);
+	mm->vma_free_list.size = 0;
+	spin_unlock(&mm->vma_free_list.lock);
+
+	if (list_empty(&to_destroy))
+		return;
+
+	first_vma = list_first_entry(&to_destroy, struct vm_area_struct, vm_free_list);
+	/* Remove the head which is allocated on the stack */
+	list_del(&to_destroy);
+
+	call_rcu(&first_vma->vm_rcu, vma_free_rcu_callback);
+}
+
+#define VM_AREA_FREE_LIST_MAX	32
+
+void vm_area_free(struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	bool drain;
+
+	free_anon_vma_name(vma);
+
+	spin_lock(&mm->vma_free_list.lock);
+	list_add(&vma->vm_free_list, &mm->vma_free_list.head);
+	mm->vma_free_list.size++;
+	drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
+	spin_unlock(&mm->vma_free_list.lock);
+
+	if (drain)
+		drain_free_vmas(mm);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+void drain_free_vmas(struct mm_struct *mm) {}
 
 void vm_area_free(struct vm_area_struct *vma)
 {
 	free_anon_vma_name(vma);
-#ifdef CONFIG_PER_VMA_LOCK
-	call_rcu(&vma->vm_rcu, __vm_area_free);
-#else
 	kmem_cache_free(vm_area_cachep, vma);
-#endif
 }
 
+#endif /* CONFIG_PER_VMA_LOCK */
+
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -1137,6 +1186,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	INIT_LIST_HEAD(&mm->mmlist);
 #ifdef CONFIG_PER_VMA_LOCK
 	WRITE_ONCE(mm->mm_lock_seq, 0);
+	INIT_LIST_HEAD(&mm->vma_free_list.head);
+	spin_lock_init(&mm->vma_free_list.lock);
+	mm->vma_free_list.size = 0;
 #endif
 	mm_pgtables_bytes_init(mm);
 	mm->map_count = 0;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 8399f90d631c..7b6d2460545f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -39,6 +39,9 @@ struct mm_struct init_mm = {
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 #ifdef CONFIG_PER_VMA_LOCK
 	.mm_lock_seq	= 0,
+	.vma_free_list.head = LIST_HEAD_INIT(init_mm.vma_free_list.head),
+	.vma_free_list.lock =  __SPIN_LOCK_UNLOCKED(init_mm.vma_free_list.lock),
+	.vma_free_list.size = 0,
 #endif
 	.user_ns	= &init_user_ns,
 	.cpu_bitmap	= CPU_BITS_NONE,
diff --git a/mm/mmap.c b/mm/mmap.c
index 1edfcd384f5e..d61b7ef84ba6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3149,6 +3149,7 @@ void exit_mmap(struct mm_struct *mm)
 	}
 	mm->mmap = NULL;
 	mmap_write_unlock(mm);
+	drain_free_vmas(mm);
 	vm_unacct_memory(nr_accounted);
 }
 
-- 
2.37.2.789.g6183377224-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 23/28] x86/mm: define ARCH_SUPPORTS_PER_VMA_LOCK
  2022-09-01 17:35 ` [RFC PATCH RESEND 23/28] x86/mm: define ARCH_SUPPORTS_PER_VMA_LOCK Suren Baghdasaryan
@ 2022-09-01 20:20   ` Kent Overstreet
  2022-09-01 23:17     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Kent Overstreet @ 2022-09-01 20:20 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	rientjes, axelrasmussen, joelaf, minchan, kernel-team, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, linux-kernel

On Thu, Sep 01, 2022 at 10:35:11AM -0700, Suren Baghdasaryan wrote:
> Set ARCH_SUPPORTS_PER_VMA_LOCK so that the per-VMA lock support can be
> compiled on this architecture.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  arch/x86/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index f9920f1341c8..ee19de020b27 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -27,6 +27,7 @@ config X86_64
>  	# Options that are inherently 64-bit kernel only:
>  	select ARCH_HAS_GIGANTIC_PAGE
>  	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> +	select ARCH_SUPPORTS_PER_VMA_LOCK
>  	select ARCH_USE_CMPXCHG_LOCKREF
>  	select HAVE_ARCH_SOFT_DIRTY
>  	select MODULES_USE_ELF_RELA

I think you could combine this with the previous path (and similarly on other
architectures) - they logically go together.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 03/28] mm: introduce __find_vma to be used without mmap_lock protection
  2022-09-01 17:34 ` [RFC PATCH RESEND 03/28] mm: introduce __find_vma to be used without mmap_lock protection Suren Baghdasaryan
@ 2022-09-01 20:22   ` Kent Overstreet
  2022-09-01 23:18     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Kent Overstreet @ 2022-09-01 20:22 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	rientjes, axelrasmussen, joelaf, minchan, kernel-team, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, linux-kernel

On Thu, Sep 01, 2022 at 10:34:51AM -0700, Suren Baghdasaryan wrote:
> Add __find_vma function to be used for VMA lookup under rcu protection.

So it was news to me that the rb tree code can be used for lockless lookups -
not having looked at lib/rbtree.c in over 10 years :) - I still think it should
be mentioned in the commit message that that's what you're doing and why it's
safe, because it's not exactly common knowledge and lockless stuff deserves
extra scrutiny.

Probably worth a comment, too.

Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 04/28] mm: move mmap_lock assert function definitions
  2022-09-01 17:34 ` [RFC PATCH RESEND 04/28] mm: move mmap_lock assert function definitions Suren Baghdasaryan
@ 2022-09-01 20:24   ` Kent Overstreet
  2022-09-01 20:51     ` Liam Howlett
  2022-09-02  6:23     ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 91+ messages in thread
From: Kent Overstreet @ 2022-09-01 20:24 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	rientjes, axelrasmussen, joelaf, minchan, kernel-team, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, linux-kernel

On Thu, Sep 01, 2022 at 10:34:52AM -0700, Suren Baghdasaryan wrote:
> Move mmap_lock assert function definitions up so that they can be used
> by other mmap_lock routines.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mmap_lock.h | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index 96e113e23d04..e49ba91bb1f0 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -60,6 +60,18 @@ static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
>  
>  #endif /* CONFIG_TRACING */
>  
> +static inline void mmap_assert_locked(struct mm_struct *mm)
> +{
> +	lockdep_assert_held(&mm->mmap_lock);
> +	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);

These look redundant to me - maybe there's a reason the VM developers want both,
but I would drop the VM_BUG_ON() and just keep the lockdep_assert_held(), since
that's the standard way to write that assertion.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 04/28] mm: move mmap_lock assert function definitions
  2022-09-01 20:24   ` Kent Overstreet
@ 2022-09-01 20:51     ` Liam Howlett
  2022-09-01 23:21       ` Suren Baghdasaryan
  2022-09-02  6:23     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 91+ messages in thread
From: Liam Howlett @ 2022-09-01 20:51 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Suren Baghdasaryan, akpm, michel, jglisse, mhocko, vbabka,
	hannes, mgorman, dave, willy, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, rientjes, axelrasmussen, joelaf, minchan, kernel-team,
	linux-mm, linux-arm-kernel, linuxppc-dev, x86, linux-kernel

* Kent Overstreet <kent.overstreet@linux.dev> [220901 16:24]:
> On Thu, Sep 01, 2022 at 10:34:52AM -0700, Suren Baghdasaryan wrote:
> > Move mmap_lock assert function definitions up so that they can be used
> > by other mmap_lock routines.
> > 
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mmap_lock.h | 24 ++++++++++++------------
> >  1 file changed, 12 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > index 96e113e23d04..e49ba91bb1f0 100644
> > --- a/include/linux/mmap_lock.h
> > +++ b/include/linux/mmap_lock.h
> > @@ -60,6 +60,18 @@ static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
> >  
> >  #endif /* CONFIG_TRACING */
> >  
> > +static inline void mmap_assert_locked(struct mm_struct *mm)
> > +{
> > +	lockdep_assert_held(&mm->mmap_lock);
> > +	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> 
> These look redundant to me - maybe there's a reason the VM developers want both,
> but I would drop the VM_BUG_ON() and just keep the lockdep_assert_held(), since
> that's the standard way to write that assertion.

I think this is because the VM_BUG_ON_MM() will give you a lot more
information and BUG_ON().

lockdep_assert_held() does not return a value and is a WARN_ON().

So they are partially redundant.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 00/28] per-VMA locks proposal
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (27 preceding siblings ...)
  2022-09-01 17:35 ` [RFC PATCH RESEND 28/28] kernel/fork: throttle call_rcu() calls in vm_area_free Suren Baghdasaryan
@ 2022-09-01 20:58 ` Kent Overstreet
  2022-09-01 23:26   ` Suren Baghdasaryan
  2022-09-02  7:42 ` Peter Zijlstra
  2022-09-05 12:32 ` Michal Hocko
  30 siblings, 1 reply; 91+ messages in thread
From: Kent Overstreet @ 2022-09-01 20:58 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	rientjes, axelrasmussen, joelaf, minchan, kernel-team, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, linux-kernel

On Thu, Sep 01, 2022 at 10:34:48AM -0700, Suren Baghdasaryan wrote:
> Resending to fix the issue with the In-Reply-To tag in the original
> submission at [4].
> 
> This is a proof of concept for per-vma locks idea that was discussed
> during SPF [1] discussion at LSF/MM this year [2], which concluded with
> suggestion that “a reader/writer semaphore could be put into the VMA
> itself; that would have the effect of using the VMA as a sort of range
> lock. There would still be contention at the VMA level, but it would be an
> improvement.” This patchset implements this suggested approach.
> 
> When handling page faults we lookup the VMA that contains the faulting
> page under RCU protection and try to acquire its lock. If that fails we
> fall back to using mmap_lock, similar to how SPF handled this situation.
> 
> One notable way the implementation deviates from the proposal is the way
> VMAs are marked as locked. Because during some of mm updates multiple
> VMAs need to be locked until the end of the update (e.g. vma_merge,
> split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks
> and other complications would make the code more complex. Therefore we
> provide a way to "mark" VMAs as locked and then unmark all locked VMAs
> all at once. This is done using two sequence numbers - one in the
> vm_area_struct and one in the mm_struct. VMA is considered locked when
> these sequence numbers are equal. To mark a VMA as locked we set the
> sequence number in vm_area_struct to be equal to the sequence number
> in mm_struct. To unlock all VMAs we increment mm_struct's seq number.
> This allows for an efficient way to track locked VMAs and to drop the
> locks on all VMAs at the end of the update.

I like it - the sequence numbers are a stroke of genuius. For what it's doing
the patchset seems almost small.

Two complaints so far:
 - I don't like the vma_mark_locked() name. To me it says that the caller
   already took or is taking the lock and this function is just marking that
   we're holding the lock, but it's really taking a different type of lock. But
   this function can block, it really is taking a lock, so it should say that.
   
   This is AFAIK a new concept, not sure I'm going to have anything good either,
   but perhaps vma_lock_multiple()?

 - I don't like the #ifdef and the separate fallback path in the fault handlers.

   Can we make find_and_lock_anon_vma() do the right thing, and not fail unless
   e.g. there isn't a vma at that address? Just have it wait for vm_lock_seq to
   change and then retry if needed.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 23/28] x86/mm: define ARCH_SUPPORTS_PER_VMA_LOCK
  2022-09-01 20:20   ` Kent Overstreet
@ 2022-09-01 23:17     ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 23:17 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Andrew Morton, Michel Lespinasse, Jerome Glisse, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Mel Gorman, Davidlohr Bueso,
	Matthew Wilcox, Liam R. Howlett, Peter Zijlstra, Laurent Dufour,
	Laurent Dufour, Paul E . McKenney, Andy Lutomirski, Song Liu,
	Peter Xu, David Hildenbrand, dhowells, Hugh Dickins, bigeasy,
	David Rientjes, Axel Rasmussen, Joel Fernandes, Minchan Kim,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86, LKML

On Thu, Sep 1, 2022 at 1:21 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, Sep 01, 2022 at 10:35:11AM -0700, Suren Baghdasaryan wrote:
> > Set ARCH_SUPPORTS_PER_VMA_LOCK so that the per-VMA lock support can be
> > compiled on this architecture.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  arch/x86/Kconfig | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index f9920f1341c8..ee19de020b27 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -27,6 +27,7 @@ config X86_64
> >       # Options that are inherently 64-bit kernel only:
> >       select ARCH_HAS_GIGANTIC_PAGE
> >       select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> > +     select ARCH_SUPPORTS_PER_VMA_LOCK
> >       select ARCH_USE_CMPXCHG_LOCKREF
> >       select HAVE_ARCH_SOFT_DIRTY
> >       select MODULES_USE_ELF_RELA
>
> I think you could combine this with the previous path (and similarly on other
> architectures) - they logically go together.

Thanks for the feedback! I see no downside to that, so unless there
are objections I will combine them in the next version.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 03/28] mm: introduce __find_vma to be used without mmap_lock protection
  2022-09-01 20:22   ` Kent Overstreet
@ 2022-09-01 23:18     ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 23:18 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Andrew Morton, Michel Lespinasse, Jerome Glisse, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Mel Gorman, Davidlohr Bueso,
	Matthew Wilcox, Liam R. Howlett, Peter Zijlstra, Laurent Dufour,
	Laurent Dufour, Paul E . McKenney, Andy Lutomirski, Song Liu,
	Peter Xu, David Hildenbrand, dhowells, Hugh Dickins, bigeasy,
	David Rientjes, Axel Rasmussen, Joel Fernandes, Minchan Kim,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86, LKML

On Thu, Sep 1, 2022 at 1:22 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, Sep 01, 2022 at 10:34:51AM -0700, Suren Baghdasaryan wrote:
> > Add __find_vma function to be used for VMA lookup under rcu protection.
>
> So it was news to me that the rb tree code can be used for lockless lookups -
> not having looked at lib/rbtree.c in over 10 years :) - I still think it should
> be mentioned in the commit message that that's what you're doing and why it's
> safe, because it's not exactly common knowledge and lockless stuff deserves
> extra scrutiny.
>
> Probably worth a comment, too.

Ack.

>
> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>

Thanks!

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 04/28] mm: move mmap_lock assert function definitions
  2022-09-01 20:51     ` Liam Howlett
@ 2022-09-01 23:21       ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 23:21 UTC (permalink / raw)
  To: Liam Howlett
  Cc: Kent Overstreet, akpm, michel, jglisse, mhocko, vbabka, hannes,
	mgorman, dave, willy, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	rientjes, axelrasmussen, joelaf, minchan, kernel-team, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, linux-kernel

On Thu, Sep 1, 2022 at 1:51 PM Liam Howlett <liam.howlett@oracle.com> wrote:
>
> * Kent Overstreet <kent.overstreet@linux.dev> [220901 16:24]:
> > On Thu, Sep 01, 2022 at 10:34:52AM -0700, Suren Baghdasaryan wrote:
> > > Move mmap_lock assert function definitions up so that they can be used
> > > by other mmap_lock routines.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  include/linux/mmap_lock.h | 24 ++++++++++++------------
> > >  1 file changed, 12 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > > index 96e113e23d04..e49ba91bb1f0 100644
> > > --- a/include/linux/mmap_lock.h
> > > +++ b/include/linux/mmap_lock.h
> > > @@ -60,6 +60,18 @@ static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
> > >
> > >  #endif /* CONFIG_TRACING */
> > >
> > > +static inline void mmap_assert_locked(struct mm_struct *mm)
> > > +{
> > > +   lockdep_assert_held(&mm->mmap_lock);
> > > +   VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> >
> > These look redundant to me - maybe there's a reason the VM developers want both,
> > but I would drop the VM_BUG_ON() and just keep the lockdep_assert_held(), since
> > that's the standard way to write that assertion.
>
> I think this is because the VM_BUG_ON_MM() will give you a lot more
> information and BUG_ON().
>
> lockdep_assert_held() does not return a value and is a WARN_ON().
>
> So they are partially redundant.

Yeah and I do not intend to change the existing functionality in this
patchset. If needed we can post a separate patch removing the
redundancy but from my experience debugging this code, VM_BUG_ON_MM
reports were very useful.

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 00/28] per-VMA locks proposal
  2022-09-01 20:58 ` [RFC PATCH RESEND 00/28] per-VMA locks proposal Kent Overstreet
@ 2022-09-01 23:26   ` Suren Baghdasaryan
  2022-09-11  9:35     ` Vlastimil Babka
  0 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-01 23:26 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Andrew Morton, Michel Lespinasse, Jerome Glisse, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Mel Gorman, Davidlohr Bueso,
	Matthew Wilcox, Liam R. Howlett, Peter Zijlstra, Laurent Dufour,
	Laurent Dufour, Paul E . McKenney, Andy Lutomirski, Song Liu,
	Peter Xu, David Hildenbrand, dhowells, Hugh Dickins, bigeasy,
	David Rientjes, Axel Rasmussen, Joel Fernandes, Minchan Kim,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86, LKML

On Thu, Sep 1, 2022 at 1:58 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, Sep 01, 2022 at 10:34:48AM -0700, Suren Baghdasaryan wrote:
> > Resending to fix the issue with the In-Reply-To tag in the original
> > submission at [4].
> >
> > This is a proof of concept for per-vma locks idea that was discussed
> > during SPF [1] discussion at LSF/MM this year [2], which concluded with
> > suggestion that “a reader/writer semaphore could be put into the VMA
> > itself; that would have the effect of using the VMA as a sort of range
> > lock. There would still be contention at the VMA level, but it would be an
> > improvement.” This patchset implements this suggested approach.
> >
> > When handling page faults we lookup the VMA that contains the faulting
> > page under RCU protection and try to acquire its lock. If that fails we
> > fall back to using mmap_lock, similar to how SPF handled this situation.
> >
> > One notable way the implementation deviates from the proposal is the way
> > VMAs are marked as locked. Because during some of mm updates multiple
> > VMAs need to be locked until the end of the update (e.g. vma_merge,
> > split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks
> > and other complications would make the code more complex. Therefore we
> > provide a way to "mark" VMAs as locked and then unmark all locked VMAs
> > all at once. This is done using two sequence numbers - one in the
> > vm_area_struct and one in the mm_struct. VMA is considered locked when
> > these sequence numbers are equal. To mark a VMA as locked we set the
> > sequence number in vm_area_struct to be equal to the sequence number
> > in mm_struct. To unlock all VMAs we increment mm_struct's seq number.
> > This allows for an efficient way to track locked VMAs and to drop the
> > locks on all VMAs at the end of the update.
>
> I like it - the sequence numbers are a stroke of genuius. For what it's doing
> the patchset seems almost small.

Thanks for reviewing it!

>
> Two complaints so far:
>  - I don't like the vma_mark_locked() name. To me it says that the caller
>    already took or is taking the lock and this function is just marking that
>    we're holding the lock, but it's really taking a different type of lock. But
>    this function can block, it really is taking a lock, so it should say that.
>
>    This is AFAIK a new concept, not sure I'm going to have anything good either,
>    but perhaps vma_lock_multiple()?

I'm open to name suggestions but vma_lock_multiple() is a bit
confusing to me. Will wait for more suggestions.

>
>  - I don't like the #ifdef and the separate fallback path in the fault handlers.
>
>    Can we make find_and_lock_anon_vma() do the right thing, and not fail unless
>    e.g. there isn't a vma at that address? Just have it wait for vm_lock_seq to
>    change and then retry if needed.

I think it can be done but would come with additional complexity. I
was really trying to keep things as simple as possible after SPF got
shot down on the grounds of complexity. I hope to start simple and
improve only when necessary.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 04/28] mm: move mmap_lock assert function definitions
  2022-09-01 20:24   ` Kent Overstreet
  2022-09-01 20:51     ` Liam Howlett
@ 2022-09-02  6:23     ` Sebastian Andrzej Siewior
  2022-09-02 17:46       ` Suren Baghdasaryan
  1 sibling, 1 reply; 91+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-09-02  6:23 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Suren Baghdasaryan, akpm, michel, jglisse, mhocko, vbabka,
	hannes, mgorman, dave, willy, liam.howlett, peterz, ldufour,
	laurent.dufour, paulmck, luto, songliubraving, peterx, david,
	dhowells, hughd, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On 2022-09-01 16:24:09 [-0400], Kent Overstreet wrote:
> > --- a/include/linux/mmap_lock.h
> > +++ b/include/linux/mmap_lock.h
> > @@ -60,6 +60,18 @@ static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
> >  
> >  #endif /* CONFIG_TRACING */
> >  
> > +static inline void mmap_assert_locked(struct mm_struct *mm)
> > +{
> > +	lockdep_assert_held(&mm->mmap_lock);
> > +	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> 
> These look redundant to me - maybe there's a reason the VM developers want both,
> but I would drop the VM_BUG_ON() and just keep the lockdep_assert_held(), since
> that's the standard way to write that assertion.

Exactly. rwsem_is_locked() returns true only if the lock is "locked" not
necessary by the caller. lockdep_assert_held() checks that the lock is
locked by the caller - this is the important part.

Sebastian

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 00/28] per-VMA locks proposal
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (28 preceding siblings ...)
  2022-09-01 20:58 ` [RFC PATCH RESEND 00/28] per-VMA locks proposal Kent Overstreet
@ 2022-09-02  7:42 ` Peter Zijlstra
  2022-09-02 14:45   ` Suren Baghdasaryan
  2022-09-05 12:32 ` Michal Hocko
  30 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2022-09-02  7:42 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Thu, Sep 01, 2022 at 10:34:48AM -0700, Suren Baghdasaryan wrote:
> This is a proof of concept for per-vma locks idea that was discussed
> during SPF [1] discussion at LSF/MM this year [2], which concluded with
> suggestion that “a reader/writer semaphore could be put into the VMA
> itself; that would have the effect of using the VMA as a sort of range
> lock. There would still be contention at the VMA level, but it would be an
> improvement.” This patchset implements this suggested approach.

The whole reason I started the SPF thing waay back when was because one
of the primary reporters at the time had very large VMAs and a per-vma
lock wouldn't actually help anything at all.

IIRC it was either scientific code initializing a huge matrix or a
database with a giant table; I'm sure the archives have better memory
than me.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 00/28] per-VMA locks proposal
  2022-09-02  7:42 ` Peter Zijlstra
@ 2022-09-02 14:45   ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-02 14:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Michel Lespinasse, Jerome Glisse, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Mel Gorman, Davidlohr Bueso,
	Matthew Wilcox, Liam R. Howlett, Laurent Dufour, Laurent Dufour,
	Paul E . McKenney, Andy Lutomirski, Song Liu, Peter Xu,
	David Hildenbrand, dhowells, Hugh Dickins, bigeasy,
	Kent Overstreet, David Rientjes, Axel Rasmussen, Joel Fernandes,
	Minchan Kim, kernel-team, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, LKML

On Fri, Sep 2, 2022 at 12:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Sep 01, 2022 at 10:34:48AM -0700, Suren Baghdasaryan wrote:
> > This is a proof of concept for per-vma locks idea that was discussed
> > during SPF [1] discussion at LSF/MM this year [2], which concluded with
> > suggestion that “a reader/writer semaphore could be put into the VMA
> > itself; that would have the effect of using the VMA as a sort of range
> > lock. There would still be contention at the VMA level, but it would be an
> > improvement.” This patchset implements this suggested approach.
>
> The whole reason I started the SPF thing waay back when was because one
> of the primary reporters at the time had very large VMAs and a per-vma
> lock wouldn't actually help anything at all.
>
> IIRC it was either scientific code initializing a huge matrix or a
> database with a giant table; I'm sure the archives have better memory
> than me.

Regardless of the initial intent, SPF happens to be very useful for
cases when we have multiple threads establishing some mappings
concurrently with page faults (see details at [1]). Android vendors
independently from each other were backporting your and Laurent's
patchset for years. I found internal reports of similar mmap_lock
contention issues in Google Fibers [2] and I suspect there are more
places this happens if people looked closer.

[1] https://lore.kernel.org/all/CAJuCfpE10y78SNPQ+LRY5EonDFhOG=1XjZ9FUUDiyhfhjZ54NA@mail.gmail.com/
[2] https://www.phoronix.com/scan.php?page=news_item&px=Google-Fibers-Toward-Open

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 04/28] mm: move mmap_lock assert function definitions
  2022-09-02  6:23     ` Sebastian Andrzej Siewior
@ 2022-09-02 17:46       ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-02 17:46 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Kent Overstreet, Andrew Morton, Michel Lespinasse, Jerome Glisse,
	Michal Hocko, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Davidlohr Bueso, Matthew Wilcox, Liam R. Howlett, Peter Zijlstra,
	Laurent Dufour, Laurent Dufour, Paul E . McKenney,
	Andy Lutomirski, Song Liu, Peter Xu, David Hildenbrand, dhowells,
	Hugh Dickins, David Rientjes, Axel Rasmussen, Joel Fernandes,
	Minchan Kim, kernel-team, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, LKML

On Thu, Sep 1, 2022 at 11:23 PM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2022-09-01 16:24:09 [-0400], Kent Overstreet wrote:
> > > --- a/include/linux/mmap_lock.h
> > > +++ b/include/linux/mmap_lock.h
> > > @@ -60,6 +60,18 @@ static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
> > >
> > >  #endif /* CONFIG_TRACING */
> > >
> > > +static inline void mmap_assert_locked(struct mm_struct *mm)
> > > +{
> > > +   lockdep_assert_held(&mm->mmap_lock);
> > > +   VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> >
> > These look redundant to me - maybe there's a reason the VM developers want both,
> > but I would drop the VM_BUG_ON() and just keep the lockdep_assert_held(), since
> > that's the standard way to write that assertion.
>
> Exactly. rwsem_is_locked() returns true only if the lock is "locked" not
> necessary by the caller. lockdep_assert_held() checks that the lock is
> locked by the caller - this is the important part.

Ok, if at the end of the day there is a consensus that this redundancy
should be removed then I'll do that in a patch separate from this
series. Please note that in this patch I'm not changing these
functions in any way, just moving them.

>
> Sebastian
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 00/28] per-VMA locks proposal
  2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
                   ` (29 preceding siblings ...)
  2022-09-02  7:42 ` Peter Zijlstra
@ 2022-09-05 12:32 ` Michal Hocko
  2022-09-05 18:32   ` Suren Baghdasaryan
  30 siblings, 1 reply; 91+ messages in thread
From: Michal Hocko @ 2022-09-05 12:32 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Unless I am missing something, this is not based on the Maple tree
rewrite, right? Does the change in the data structure makes any
difference to the approach? I remember discussions at LSFMM where it has
been pointed out that some issues with the vma tree are considerably
simpler to handle with the maple tree.

On Thu 01-09-22 10:34:48, Suren Baghdasaryan wrote:
[...]
> One notable way the implementation deviates from the proposal is the way
> VMAs are marked as locked. Because during some of mm updates multiple
> VMAs need to be locked until the end of the update (e.g. vma_merge,
> split_vma, etc).

I think it would be really helpful to spell out those issues in a greater
detail. Not everybody is aware of those vma related subtleties.

Thanks for working on this Suren!
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 00/28] per-VMA locks proposal
  2022-09-05 12:32 ` Michal Hocko
@ 2022-09-05 18:32   ` Suren Baghdasaryan
  2022-09-05 20:35     ` Kent Overstreet
  0 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-05 18:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Michel Lespinasse, Jerome Glisse, Vlastimil Babka,
	Johannes Weiner, Mel Gorman, Davidlohr Bueso, Matthew Wilcox,
	Liam R. Howlett, Peter Zijlstra, Laurent Dufour, Laurent Dufour,
	Paul E . McKenney, Andy Lutomirski, Song Liu, Peter Xu,
	David Hildenbrand, dhowells, Hugh Dickins,
	Sebastian Andrzej Siewior, Kent Overstreet, David Rientjes,
	Axel Rasmussen, Joel Fernandes, Minchan Kim, kernel-team,
	linux-mm, linux-arm-kernel, linuxppc-dev, x86, LKML

On Mon, Sep 5, 2022 at 5:32 AM 'Michal Hocko' via kernel-team
<kernel-team@android.com> wrote:
>
> Unless I am missing something, this is not based on the Maple tree
> rewrite, right? Does the change in the data structure makes any
> difference to the approach? I remember discussions at LSFMM where it has
> been pointed out that some issues with the vma tree are considerably
> simpler to handle with the maple tree.

Correct, this does not use the Maple tree yet but once Maple tree
transition happens and it supports RCU-safe lookups, my code in
find_vma_under_rcu() becomes really simple.

>
> On Thu 01-09-22 10:34:48, Suren Baghdasaryan wrote:
> [...]
> > One notable way the implementation deviates from the proposal is the way
> > VMAs are marked as locked. Because during some of mm updates multiple
> > VMAs need to be locked until the end of the update (e.g. vma_merge,
> > split_vma, etc).
>
> I think it would be really helpful to spell out those issues in a greater
> detail. Not everybody is aware of those vma related subtleties.

Ack. I'll expand the description of the cases when multiple VMAs need
to be locked in the same update. The main difficulties are:
1. Multiple VMAs might need to be locked within one
mmap_write_lock/mmap_write_unlock session (will call it an update
transaction).
2. Figuring out when it's safe to unlock a previously locked VMA is
tricky because that might be happening in different functions and at
different call levels.

So, instead of the usual lock/unlock pattern, the proposed solution
marks a VMA as locked and provides an efficient way to:
1. Identify locked VMAs.
2. Unlock all locked VMAs in bulk.

We also postpone unlocking the locked VMAs until the end of the update
transaction, when we do mmap_write_unlock. Potentially this keeps a
VMA locked for longer than is absolutely necessary but it results in a
big reduction of code complexity.

>
> Thanks for working on this Suren!

Thanks for reviewing!
Suren.

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 00/28] per-VMA locks proposal
  2022-09-05 18:32   ` Suren Baghdasaryan
@ 2022-09-05 20:35     ` Kent Overstreet
  2022-09-06 15:46       ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Kent Overstreet @ 2022-09-05 20:35 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, Andrew Morton, Michel Lespinasse, Jerome Glisse,
	Vlastimil Babka, Johannes Weiner, Mel Gorman, Davidlohr Bueso,
	Matthew Wilcox, Liam R. Howlett, Peter Zijlstra, Laurent Dufour,
	Laurent Dufour, Paul E . McKenney, Andy Lutomirski, Song Liu,
	Peter Xu, David Hildenbrand, dhowells, Hugh Dickins,
	Sebastian Andrzej Siewior, David Rientjes, Axel Rasmussen,
	Joel Fernandes, Minchan Kim, kernel-team, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, LKML

On Mon, Sep 05, 2022 at 11:32:48AM -0700, Suren Baghdasaryan wrote:
> On Mon, Sep 5, 2022 at 5:32 AM 'Michal Hocko' via kernel-team
> <kernel-team@android.com> wrote:
> >
> > Unless I am missing something, this is not based on the Maple tree
> > rewrite, right? Does the change in the data structure makes any
> > difference to the approach? I remember discussions at LSFMM where it has
> > been pointed out that some issues with the vma tree are considerably
> > simpler to handle with the maple tree.
> 
> Correct, this does not use the Maple tree yet but once Maple tree
> transition happens and it supports RCU-safe lookups, my code in
> find_vma_under_rcu() becomes really simple.
> 
> >
> > On Thu 01-09-22 10:34:48, Suren Baghdasaryan wrote:
> > [...]
> > > One notable way the implementation deviates from the proposal is the way
> > > VMAs are marked as locked. Because during some of mm updates multiple
> > > VMAs need to be locked until the end of the update (e.g. vma_merge,
> > > split_vma, etc).
> >
> > I think it would be really helpful to spell out those issues in a greater
> > detail. Not everybody is aware of those vma related subtleties.
> 
> Ack. I'll expand the description of the cases when multiple VMAs need
> to be locked in the same update. The main difficulties are:
> 1. Multiple VMAs might need to be locked within one
> mmap_write_lock/mmap_write_unlock session (will call it an update
> transaction).
> 2. Figuring out when it's safe to unlock a previously locked VMA is
> tricky because that might be happening in different functions and at
> different call levels.
> 
> So, instead of the usual lock/unlock pattern, the proposed solution
> marks a VMA as locked and provides an efficient way to:
> 1. Identify locked VMAs.
> 2. Unlock all locked VMAs in bulk.
> 
> We also postpone unlocking the locked VMAs until the end of the update
> transaction, when we do mmap_write_unlock. Potentially this keeps a
> VMA locked for longer than is absolutely necessary but it results in a
> big reduction of code complexity.

Correct me if I'm wrong, but it looks like any time multiple VMAs need to be
locked we need mmap_lock anyways, which is what makes your approach so sweet.

If however we ever want to lock multiple VMAs without taking mmap_lock, then
deadlock avoidance algorithms aren't that bad - there's the ww_mutex approach,
which is simple and works well when there isn't much expected contention (the
advantage of the ww_mutex approach is that it doesn't have to track all held
locks). I've also written full cycle detection; that approcah gets you fewer
restarts, at the cost of needing a list of all currently held locks.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 05/28] mm: add per-VMA lock and helper functions to control it
  2022-09-01 17:34 ` [RFC PATCH RESEND 05/28] mm: add per-VMA lock and helper functions to control it Suren Baghdasaryan
@ 2022-09-06 13:46   ` Laurent Dufour
  2022-09-06 17:24     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-06 13:46 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> Introduce a per-VMA rw_semaphore to be used during page fault handling
> instead of mmap_lock. Because there are cases when multiple VMAs need
> to be exclusively locked during VMA tree modifications, instead of the
> usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> mmap_write_lock holder is done with all modifications and drops mmap_lock,
> it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> locked.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Despite a minor comment below,

Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com>

> ---
>  include/linux/mm.h        | 78 +++++++++++++++++++++++++++++++++++++++
>  include/linux/mm_types.h  |  7 ++++
>  include/linux/mmap_lock.h | 13 +++++++
>  kernel/fork.c             |  4 ++
>  mm/init-mm.c              |  3 ++
>  5 files changed, 105 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7d322a979455..476bf936c5f0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -611,6 +611,83 @@ struct vm_operations_struct {
>  					  unsigned long addr);
>  };
>  
> +#ifdef CONFIG_PER_VMA_LOCK
> +static inline void vma_init_lock(struct vm_area_struct *vma)
> +{
> +	init_rwsem(&vma->lock);
> +	vma->vm_lock_seq = -1;
> +}
> +
> +static inline void vma_mark_locked(struct vm_area_struct *vma)
> +{
> +	int mm_lock_seq;
> +
> +	mmap_assert_write_locked(vma->vm_mm);
> +
> +	/*
> +	 * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> +	 * mm->mm_lock_seq can't be concurrently modified.
> +	 */
> +	mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> +	if (vma->vm_lock_seq == mm_lock_seq)
> +		return;
> +
> +	down_write(&vma->lock);
> +	vma->vm_lock_seq = mm_lock_seq;
> +	up_write(&vma->lock);
> +}
> +
> +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> +{
> +	if (unlikely(down_read_trylock(&vma->lock) == 0))
> +		return false;
> +
> +	/*
> +	 * Overflow might produce false locked result but it's not critical.

It might be good to precise here that in the case of false locked, the
caller is assumed to fallback read locking the mm entirely before doing its
change relative to that VMA.

> +	 * False unlocked result is critical but is impossible because we
> +	 * modify and check vma->vm_lock_seq under vma->lock protection and
> +	 * mm->mm_lock_seq modification invalidates all existing locks.
> +	 */
> +	if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq)) {
> +		up_read(&vma->lock);
> +		return false;
> +	}
> +	return true;
> +}
> +
> +static inline void vma_read_unlock(struct vm_area_struct *vma)
> +{
> +	up_read(&vma->lock);
> +}
> +
> +static inline void vma_assert_locked(struct vm_area_struct *vma)
> +{
> +	lockdep_assert_held(&vma->lock);
> +	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->lock), vma);
> +}
> +
> +static inline void vma_assert_write_locked(struct vm_area_struct *vma, int pos)
> +{
> +	mmap_assert_write_locked(vma->vm_mm);
> +	/*
> +	 * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> +	 * mm->mm_lock_seq can't be concurrently modified.
> +	 */
> +	VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> +}
> +
> +#else /* CONFIG_PER_VMA_LOCK */
> +
> +static inline void vma_init_lock(struct vm_area_struct *vma) {}
> +static inline void vma_mark_locked(struct vm_area_struct *vma) {}
> +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> +		{ return false; }
> +static inline void vma_read_unlock(struct vm_area_struct *vma) {}
> +static inline void vma_assert_locked(struct vm_area_struct *vma) {}
> +static inline void vma_assert_write_locked(struct vm_area_struct *vma, int pos) {}
> +
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
>  static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  {
>  	static const struct vm_operations_struct dummy_vm_ops = {};
> @@ -619,6 +696,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_mm = mm;
>  	vma->vm_ops = &dummy_vm_ops;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> +	vma_init_lock(vma);
>  }
>  
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index bed25ef7c994..6a03f59c1e78 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -486,6 +486,10 @@ struct vm_area_struct {
>  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef CONFIG_PER_VMA_LOCK
> +	struct rw_semaphore lock;
> +	int vm_lock_seq;
> +#endif
>  } __randomize_layout;
>  
>  struct kioctx_table;
> @@ -567,6 +571,9 @@ struct mm_struct {
>  					  * init_mm.mmlist, and are protected
>  					  * by mmlist_lock
>  					  */
> +#ifdef CONFIG_PER_VMA_LOCK
> +		int mm_lock_seq;
> +#endif
>  
>  
>  		unsigned long hiwater_rss; /* High-watermark of RSS usage */
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index e49ba91bb1f0..a391ae226564 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -72,6 +72,17 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
>  	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
>  }
>  
> +#ifdef CONFIG_PER_VMA_LOCK
> +static inline void vma_mark_unlocked_all(struct mm_struct *mm)
> +{
> +	mmap_assert_write_locked(mm);
> +	/* No races during update due to exclusive mmap_lock being held */
> +	WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +}
> +#else
> +static inline void vma_mark_unlocked_all(struct mm_struct *mm) {}
> +#endif
> +
>  static inline void mmap_init_lock(struct mm_struct *mm)
>  {
>  	init_rwsem(&mm->mmap_lock);
> @@ -114,12 +125,14 @@ static inline bool mmap_write_trylock(struct mm_struct *mm)
>  static inline void mmap_write_unlock(struct mm_struct *mm)
>  {
>  	__mmap_lock_trace_released(mm, true);
> +	vma_mark_unlocked_all(mm);
>  	up_write(&mm->mmap_lock);
>  }
>  
>  static inline void mmap_write_downgrade(struct mm_struct *mm)
>  {
>  	__mmap_lock_trace_acquire_returned(mm, false, true);
> +	vma_mark_unlocked_all(mm);
>  	downgrade_write(&mm->mmap_lock);
>  }
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 614872438393..bfab31ecd11e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -475,6 +475,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  		 */
>  		*new = data_race(*orig);
>  		INIT_LIST_HEAD(&new->anon_vma_chain);
> +		vma_init_lock(new);
>  		new->vm_next = new->vm_prev = NULL;
>  		dup_anon_vma_name(orig, new);
>  	}
> @@ -1130,6 +1131,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	seqcount_init(&mm->write_protect_seq);
>  	mmap_init_lock(mm);
>  	INIT_LIST_HEAD(&mm->mmlist);
> +#ifdef CONFIG_PER_VMA_LOCK
> +	WRITE_ONCE(mm->mm_lock_seq, 0);
> +#endif
>  	mm_pgtables_bytes_init(mm);
>  	mm->map_count = 0;
>  	mm->locked_vm = 0;
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index fbe7844d0912..8399f90d631c 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -37,6 +37,9 @@ struct mm_struct init_mm = {
>  	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
>  	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
>  	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
> +#ifdef CONFIG_PER_VMA_LOCK
> +	.mm_lock_seq	= 0,
> +#endif
>  	.user_ns	= &init_user_ns,
>  	.cpu_bitmap	= CPU_BITS_NONE,
>  #ifdef CONFIG_IOMMU_SVA


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 06/28] mm: mark VMA as locked whenever vma->vm_flags are modified
  2022-09-01 17:34 ` [RFC PATCH RESEND 06/28] mm: mark VMA as locked whenever vma->vm_flags are modified Suren Baghdasaryan
@ 2022-09-06 14:26   ` Laurent Dufour
  2022-09-06 19:00     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-06 14:26 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> VMA flag modifications should be done under VMA lock to prevent concurrent
> page fault handling in that area.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  fs/proc/task_mmu.c | 1 +
>  fs/userfaultfd.c   | 6 ++++++
>  mm/madvise.c       | 1 +
>  mm/mlock.c         | 2 ++
>  mm/mmap.c          | 1 +
>  mm/mprotect.c      | 1 +
>  6 files changed, 12 insertions(+)

There are few changes also done in the driver's space, for instance:

*** arch/x86/kernel/cpu/sgx/driver.c:
sgx_mmap[98]                   vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND |
VM_DONTDUMP | VM_IO;
*** arch/x86/kernel/cpu/sgx/virt.c:
sgx_vepc_mmap[108]             vma->vm_flags |= VM_PFNMAP | VM_IO |
VM_DONTDUMP | VM_DONTCOPY;
*** drivers/dax/device.c:
dax_mmap[311]                  vma->vm_flags |= VM_HUGEPAGE;

I guess these changes to vm_flags should be protected as well, or to be
checked one by one.

> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 4e0023643f8b..ceffa5c2c650 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1285,6 +1285,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  			for (vma = mm->mmap; vma; vma = vma->vm_next) {
>  				if (!(vma->vm_flags & VM_SOFTDIRTY))
>  					continue;
> +				vma_mark_locked(vma);
>  				vma->vm_flags &= ~VM_SOFTDIRTY;
>  				vma_set_page_prot(vma);
>  			}
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 175de70e3adf..fe557b3d1c07 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -620,6 +620,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
>  		mmap_write_lock(mm);
>  		for (vma = mm->mmap; vma; vma = vma->vm_next)
>  			if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
> +				vma_mark_locked(vma);
>  				vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
>  				vma->vm_flags &= ~__VM_UFFD_FLAGS;
>  			}
> @@ -653,6 +654,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
>  
>  	octx = vma->vm_userfaultfd_ctx.ctx;
>  	if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
> +		vma_mark_locked(vma);
>  		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
>  		vma->vm_flags &= ~__VM_UFFD_FLAGS;
>  		return 0;
> @@ -734,6 +736,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
>  		atomic_inc(&ctx->mmap_changing);
>  	} else {
>  		/* Drop uffd context if remap feature not enabled */
> +		vma_mark_locked(vma);
>  		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
>  		vma->vm_flags &= ~__VM_UFFD_FLAGS;
>  	}
> @@ -891,6 +894,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
>  			vma = prev;
>  		else
>  			prev = vma;
> +		vma_mark_locked(vma);
>  		vma->vm_flags = new_flags;
>  		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
>  	}
> @@ -1449,6 +1453,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  		 * the next vma was merged into the current one and
>  		 * the current one has not been updated yet.
>  		 */
> +		vma_mark_locked(vma);
>  		vma->vm_flags = new_flags;
>  		vma->vm_userfaultfd_ctx.ctx = ctx;
>  
> @@ -1630,6 +1635,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>  		 * the next vma was merged into the current one and
>  		 * the current one has not been updated yet.
>  		 */
> +		vma_mark_locked(vma);
>  		vma->vm_flags = new_flags;
>  		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 5f0f0948a50e..a173f0025abd 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -181,6 +181,7 @@ static int madvise_update_vma(struct vm_area_struct *vma,
>  	/*
>  	 * vm_flags is protected by the mmap_lock held in write mode.
>  	 */
> +	vma_mark_locked(vma);
>  	vma->vm_flags = new_flags;
>  	if (!vma->vm_file) {
>  		error = replace_anon_vma_name(vma, anon_name);
> diff --git a/mm/mlock.c b/mm/mlock.c
> index b14e929084cc..f62e1a4d05f2 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -380,6 +380,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
>  	 */
>  	if (newflags & VM_LOCKED)
>  		newflags |= VM_IO;
> +	vma_mark_locked(vma);
>  	WRITE_ONCE(vma->vm_flags, newflags);
>  
>  	lru_add_drain();
> @@ -456,6 +457,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  
>  	if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
>  		/* No work to do, and mlocking twice would be wrong */
> +		vma_mark_locked(vma);
>  		vma->vm_flags = newflags;
>  	} else {
>  		mlock_vma_pages_range(vma, start, end, newflags);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 693e6776be39..f89c9b058105 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1818,6 +1818,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  out:
>  	perf_event_mmap(vma);
>  
> +	vma_mark_locked(vma);
>  	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
>  	if (vm_flags & VM_LOCKED) {
>  		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||

I guess, this doesn't really impact, but the call to vma_mark_locked(vma)
may be done only in the case the vm_flags field is touched.
Something like this:

	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
	if (vm_flags & VM_LOCKED) {
		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
					is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm))
+					vma == get_gate_vma(current->mm)) {
+			vma_mark_locked(vma);
			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
-		else
+		} else
			mm->locked_vm += (len >> PAGE_SHIFT);
	}


> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index bc6bddd156ca..df47fc21b0e4 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -621,6 +621,7 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	 * vm_flags and vm_page_prot are protected by the mmap_lock
>  	 * held in write mode.
>  	 */
> +	vma_mark_locked(vma);
>  	vma->vm_flags = newflags;
>  	/*
>  	 * We want to check manually if we can change individual PTEs writable


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 07/28] kernel/fork: mark VMAs as locked before copying pages during fork
  2022-09-01 17:34 ` [RFC PATCH RESEND 07/28] kernel/fork: mark VMAs as locked before copying pages during fork Suren Baghdasaryan
@ 2022-09-06 14:37   ` Laurent Dufour
  2022-09-08 23:57     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-06 14:37 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> Protect VMAs from concurrent page fault handler while performing
> copy_page_range for VMAs having VM_WIPEONFORK flag set.

I'm wondering why is that necessary.
The copied mm is write locked, and the destination one is not reachable.
If any other readers are using the VMA, this is only for page fault handling.
I should have miss something because I can't see any need to mark the lock
VMA here.

> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  kernel/fork.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index bfab31ecd11e..1872ad549fed 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -709,8 +709,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		rb_parent = &tmp->vm_rb;
>  
>  		mm->map_count++;
> -		if (!(tmp->vm_flags & VM_WIPEONFORK))
> +		if (!(tmp->vm_flags & VM_WIPEONFORK)) {
> +			vma_mark_locked(mpnt);
>  			retval = copy_page_range(tmp, mpnt);
> +		}
>  
>  		if (tmp->vm_ops && tmp->vm_ops->open)
>  			tmp->vm_ops->open(tmp);


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 08/28] mm/khugepaged: mark VMA as locked while collapsing a hugepage
  2022-09-01 17:34 ` [RFC PATCH RESEND 08/28] mm/khugepaged: mark VMA as locked while collapsing a hugepage Suren Baghdasaryan
@ 2022-09-06 14:43   ` Laurent Dufour
  2022-09-09  0:15     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-06 14:43 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> Protect VMA from concurrent page fault handler while modifying it in
> collapse_huge_page.

Is the goal to protect changes in the anon_vma structure?

AFAICS, the vma it self is not impacted here, only the anon_vma and the
PMD/PTE are touched, and they have their own protection mechanism, isn't it?

> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/khugepaged.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 01f71786d530..030680633989 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1072,6 +1072,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	if (mm_find_pmd(mm, address) != pmd)
>  		goto out_up_write;
>  
> +	vma_mark_locked(vma);
>  	anon_vma_lock_write(vma->anon_vma);
>  
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm,


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 09/28] mm/mempolicy: mark VMA as locked when changing protection policy
  2022-09-01 17:34 ` [RFC PATCH RESEND 09/28] mm/mempolicy: mark VMA as locked when changing protection policy Suren Baghdasaryan
@ 2022-09-06 14:47   ` Laurent Dufour
  2022-09-09  0:27     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-06 14:47 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> Protect VMA from concurrent page fault handler while performing VMA
> protection policy changes.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/mempolicy.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index b73d3248d976..6be1e5c75556 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -383,8 +383,10 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
>  	struct vm_area_struct *vma;
>  
>  	mmap_write_lock(mm);
> -	for (vma = mm->mmap; vma; vma = vma->vm_next)
> +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +		vma_mark_locked(vma);
>  		mpol_rebind_policy(vma->vm_policy, new);
> +	}
>  	mmap_write_unlock(mm);
>  }
>  
> @@ -632,6 +634,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
>  	struct mmu_gather tlb;
>  	int nr_updated;
>  
> +	vma_mark_locked(vma);

If I understand that corretly, the VMA itself is not impacted, only the
PMDs/PTEs, and they are protected using the page table locks.

Am I missing something?

>  	tlb_gather_mmu(&tlb, vma->vm_mm);
>  
>  	nr_updated = change_protection(&tlb, vma, addr, end, PAGE_NONE,
> @@ -765,6 +768,7 @@ static int vma_replace_policy(struct vm_area_struct *vma,
>  	if (IS_ERR(new))
>  		return PTR_ERR(new);
>  
> +	vma_mark_locked(vma);
>  	if (vma->vm_ops && vma->vm_ops->set_policy) {
>  		err = vma->vm_ops->set_policy(vma, new);
>  		if (err)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 10/28] mm/mmap: mark VMAs as locked in vma_adjust
  2022-09-01 17:34 ` [RFC PATCH RESEND 10/28] mm/mmap: mark VMAs as locked in vma_adjust Suren Baghdasaryan
@ 2022-09-06 15:35   ` Laurent Dufour
  2022-09-09  0:51     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-06 15:35 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> vma_adjust modifies a VMA and possibly its neighbors. Mark them as locked
> before making the modifications.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/mmap.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f89c9b058105..ed58cf0689b2 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -710,6 +710,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  	long adjust_next = 0;
>  	int remove_next = 0;
>  
> +	vma_mark_locked(vma);
> +	if (next)
> +		vma_mark_locked(next);
> +

I was wondering if the VMAs insert and expand should be locked too.

For expand, I can't see any valid reason, but for insert, I'm puzzled.
I would think that it is better to lock the VMA to be inserted but I can't
really justify that.

It may be nice to detail why this is not need to lock insert and expand here.

>  	if (next && !insert) {
>  		struct vm_area_struct *exporter = NULL, *importer = NULL;
>  
> @@ -754,8 +758,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  			 * If next doesn't have anon_vma, import from vma after
>  			 * next, if the vma overlaps with it.
>  			 */
> -			if (remove_next == 2 && !next->anon_vma)
> +			if (remove_next == 2 && !next->anon_vma) {
>  				exporter = next->vm_next;
> +				if (exporter)
> +					vma_mark_locked(exporter);
> +			}
>  
>  		} else if (end > next->vm_start) {
>  			/*
> @@ -931,6 +938,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  			 * "vma->vm_next" gap must be updated.
>  			 */
>  			next = vma->vm_next;
> +			if (next)
> +				vma_mark_locked(next);
>  		} else {
>  			/*
>  			 * For the scope of the comment "next" and


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 11/28] mm/mmap: mark VMAs as locked before merging or splitting them
  2022-09-01 17:34 ` [RFC PATCH RESEND 11/28] mm/mmap: mark VMAs as locked before merging or splitting them Suren Baghdasaryan
@ 2022-09-06 15:44   ` Laurent Dufour
  0 siblings, 0 replies; 91+ messages in thread
From: Laurent Dufour @ 2022-09-06 15:44 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> Decisions about whether VMAs can be merged or split must be made while
> VMAs are protected from the changes which can affect that decision.
> For example, merge_vma uses vma->anon_vma in its decision whether the
> VMA can be merged. Meanwhile, page fault handler changes vma->anon_vma
> during COW operation.
> Mark all VMAs which might be affected by a merge or split operation as
> locked before making decision how such operations should be performed.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/mmap.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index ed58cf0689b2..ade3909c89b4 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1147,10 +1147,17 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>  	if (vm_flags & VM_SPECIAL)
>  		return NULL;
>  
> +	if (prev)
> +		vma_mark_locked(prev);
>  	next = vma_next(mm, prev);
>  	area = next;
> -	if (area && area->vm_end == end)		/* cases 6, 7, 8 */
> +	if (area)
> +		vma_mark_locked(area);
> +	if (area && area->vm_end == end) {		/* cases 6, 7, 8 */
>  		next = next->vm_next;
> +		if (next)
> +			vma_mark_locked(next);
> +	}
>  
>  	/* verify some invariant that must be enforced by the caller */
>  	VM_WARN_ON(prev && addr <= prev->vm_start);
> @@ -2687,6 +2694,7 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
>  	struct vm_area_struct *new;
>  	int err;
>  
> +	vma_mark_locked(vma);
>  	if (vma->vm_ops && vma->vm_ops->may_split) {
>  		err = vma->vm_ops->may_split(vma, addr);
>  		if (err)

That looks good to me, the new VMA allocated by vm_area_dup(vma) is
inheriting the locked state from vma.

Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 00/28] per-VMA locks proposal
  2022-09-05 20:35     ` Kent Overstreet
@ 2022-09-06 15:46       ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-06 15:46 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Michal Hocko, Andrew Morton, Michel Lespinasse, Jerome Glisse,
	Vlastimil Babka, Johannes Weiner, Mel Gorman, Davidlohr Bueso,
	Matthew Wilcox, Liam R. Howlett, Peter Zijlstra, Laurent Dufour,
	Laurent Dufour, Paul E . McKenney, Andy Lutomirski, Song Liu,
	Peter Xu, David Hildenbrand, dhowells, Hugh Dickins,
	Sebastian Andrzej Siewior, David Rientjes, Axel Rasmussen,
	Joel Fernandes, Minchan Kim, kernel-team, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, LKML

On Mon, Sep 5, 2022 at 1:35 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Mon, Sep 05, 2022 at 11:32:48AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Sep 5, 2022 at 5:32 AM 'Michal Hocko' via kernel-team
> > <kernel-team@android.com> wrote:
> > >
> > > Unless I am missing something, this is not based on the Maple tree
> > > rewrite, right? Does the change in the data structure makes any
> > > difference to the approach? I remember discussions at LSFMM where it has
> > > been pointed out that some issues with the vma tree are considerably
> > > simpler to handle with the maple tree.
> >
> > Correct, this does not use the Maple tree yet but once Maple tree
> > transition happens and it supports RCU-safe lookups, my code in
> > find_vma_under_rcu() becomes really simple.
> >
> > >
> > > On Thu 01-09-22 10:34:48, Suren Baghdasaryan wrote:
> > > [...]
> > > > One notable way the implementation deviates from the proposal is the way
> > > > VMAs are marked as locked. Because during some of mm updates multiple
> > > > VMAs need to be locked until the end of the update (e.g. vma_merge,
> > > > split_vma, etc).
> > >
> > > I think it would be really helpful to spell out those issues in a greater
> > > detail. Not everybody is aware of those vma related subtleties.
> >
> > Ack. I'll expand the description of the cases when multiple VMAs need
> > to be locked in the same update. The main difficulties are:
> > 1. Multiple VMAs might need to be locked within one
> > mmap_write_lock/mmap_write_unlock session (will call it an update
> > transaction).
> > 2. Figuring out when it's safe to unlock a previously locked VMA is
> > tricky because that might be happening in different functions and at
> > different call levels.
> >
> > So, instead of the usual lock/unlock pattern, the proposed solution
> > marks a VMA as locked and provides an efficient way to:
> > 1. Identify locked VMAs.
> > 2. Unlock all locked VMAs in bulk.
> >
> > We also postpone unlocking the locked VMAs until the end of the update
> > transaction, when we do mmap_write_unlock. Potentially this keeps a
> > VMA locked for longer than is absolutely necessary but it results in a
> > big reduction of code complexity.
>
> Correct me if I'm wrong, but it looks like any time multiple VMAs need to be
> locked we need mmap_lock anyways, which is what makes your approach so sweet.

That is correct. Anytime we need to take VMA's write lock we have to
be holding the write side of the mmap_lock as well. That's what allows
me to skip locking in cases like checking if the VMA is already
locked.

>
> If however we ever want to lock multiple VMAs without taking mmap_lock, then
> deadlock avoidance algorithms aren't that bad - there's the ww_mutex approach,
> which is simple and works well when there isn't much expected contention (the
> advantage of the ww_mutex approach is that it doesn't have to track all held
> locks). I've also written full cycle detection; that approcah gets you fewer
> restarts, at the cost of needing a list of all currently held locks.

Thanks for the tip! I'll take a closer look at ww_mutex.

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 12/28] mm/mremap: mark VMA as locked while remapping it to a new address range
  2022-09-01 17:35 ` [RFC PATCH RESEND 12/28] mm/mremap: mark VMA as locked while remapping it to a new address range Suren Baghdasaryan
@ 2022-09-06 16:09   ` Laurent Dufour
  0 siblings, 0 replies; 91+ messages in thread
From: Laurent Dufour @ 2022-09-06 16:09 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> Mark VMA as locked before copying it and when copy_vma produces a new VMA.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/mmap.c   | 1 +
>  mm/mremap.c | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index ade3909c89b4..121544fd90de 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3248,6 +3248,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>  			get_file(new_vma->vm_file);
>  		if (new_vma->vm_ops && new_vma->vm_ops->open)
>  			new_vma->vm_ops->open(new_vma);
> +		vma_mark_locked(new_vma);
>  		vma_link(mm, new_vma, prev, rb_link, rb_parent);
>  		*need_rmap_locks = false;
>  	}

Sounds good in the both case the returned new_vma is locked, either in
copy_vma() or in vma_merge().

> diff --git a/mm/mremap.c b/mm/mremap.c
> index b522cd0259a0..bdbf96254e43 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -620,6 +620,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  			return -ENOMEM;
>  	}
>  
> +	vma_mark_locked(vma);
>  	new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
>  	new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
>  			   &need_rmap_locks);

Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 05/28] mm: add per-VMA lock and helper functions to control it
  2022-09-06 13:46   ` Laurent Dufour
@ 2022-09-06 17:24     ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-06 17:24 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Andrew Morton, Michel Lespinasse, Jerome Glisse, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Mel Gorman, Davidlohr Bueso,
	Matthew Wilcox, Liam R. Howlett, Peter Zijlstra, Laurent Dufour,
	Paul E . McKenney, Andy Lutomirski, Song Liu, Peter Xu,
	David Hildenbrand, dhowells, Hugh Dickins,
	Sebastian Andrzej Siewior, Kent Overstreet, David Rientjes,
	Axel Rasmussen, Joel Fernandes, Minchan Kim, kernel-team,
	linux-mm, linux-arm-kernel, linuxppc-dev, x86, LKML

On Tue, Sep 6, 2022 at 6:47 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Despite a minor comment below,
>
> Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com>

Thanks for the reviews Laurent! I'll need some time to double-check
all the VMA locking locations that you spotted as potentially
unnecessary. Admittedly I was a bit paranoid when writing this
patchset and trying not to miss any potential race, so some of them
might indeed be unnecessary. Will reply to each of your comments once
I confirm the need for locking in each case.
Thanks,
Suren.

>
> > ---
> >  include/linux/mm.h        | 78 +++++++++++++++++++++++++++++++++++++++
> >  include/linux/mm_types.h  |  7 ++++
> >  include/linux/mmap_lock.h | 13 +++++++
> >  kernel/fork.c             |  4 ++
> >  mm/init-mm.c              |  3 ++
> >  5 files changed, 105 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 7d322a979455..476bf936c5f0 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -611,6 +611,83 @@ struct vm_operations_struct {
> >                                         unsigned long addr);
> >  };
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +static inline void vma_init_lock(struct vm_area_struct *vma)
> > +{
> > +     init_rwsem(&vma->lock);
> > +     vma->vm_lock_seq = -1;
> > +}
> > +
> > +static inline void vma_mark_locked(struct vm_area_struct *vma)
> > +{
> > +     int mm_lock_seq;
> > +
> > +     mmap_assert_write_locked(vma->vm_mm);
> > +
> > +     /*
> > +      * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> > +      * mm->mm_lock_seq can't be concurrently modified.
> > +      */
> > +     mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > +     if (vma->vm_lock_seq == mm_lock_seq)
> > +             return;
> > +
> > +     down_write(&vma->lock);
> > +     vma->vm_lock_seq = mm_lock_seq;
> > +     up_write(&vma->lock);
> > +}
> > +
> > +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > +{
> > +     if (unlikely(down_read_trylock(&vma->lock) == 0))
> > +             return false;
> > +
> > +     /*
> > +      * Overflow might produce false locked result but it's not critical.
>
> It might be good to precise here that in the case of false locked, the
> caller is assumed to fallback read locking the mm entirely before doing its
> change relative to that VMA.

Ack.

>
> > +      * False unlocked result is critical but is impossible because we
> > +      * modify and check vma->vm_lock_seq under vma->lock protection and
> > +      * mm->mm_lock_seq modification invalidates all existing locks.
> > +      */
> > +     if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq)) {
> > +             up_read(&vma->lock);
> > +             return false;
> > +     }
> > +     return true;
> > +}
> > +
> > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > +{
> > +     up_read(&vma->lock);
> > +}
> > +
> > +static inline void vma_assert_locked(struct vm_area_struct *vma)
> > +{
> > +     lockdep_assert_held(&vma->lock);
> > +     VM_BUG_ON_VMA(!rwsem_is_locked(&vma->lock), vma);
> > +}
> > +
> > +static inline void vma_assert_write_locked(struct vm_area_struct *vma, int pos)
> > +{
> > +     mmap_assert_write_locked(vma->vm_mm);
> > +     /*
> > +      * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> > +      * mm->mm_lock_seq can't be concurrently modified.
> > +      */
> > +     VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> > +}
> > +
> > +#else /* CONFIG_PER_VMA_LOCK */
> > +
> > +static inline void vma_init_lock(struct vm_area_struct *vma) {}
> > +static inline void vma_mark_locked(struct vm_area_struct *vma) {}
> > +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > +             { return false; }
> > +static inline void vma_read_unlock(struct vm_area_struct *vma) {}
> > +static inline void vma_assert_locked(struct vm_area_struct *vma) {}
> > +static inline void vma_assert_write_locked(struct vm_area_struct *vma, int pos) {}
> > +
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> >  static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >  {
> >       static const struct vm_operations_struct dummy_vm_ops = {};
> > @@ -619,6 +696,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >       vma->vm_mm = mm;
> >       vma->vm_ops = &dummy_vm_ops;
> >       INIT_LIST_HEAD(&vma->anon_vma_chain);
> > +     vma_init_lock(vma);
> >  }
> >
> >  static inline void vma_set_anonymous(struct vm_area_struct *vma)
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index bed25ef7c994..6a03f59c1e78 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -486,6 +486,10 @@ struct vm_area_struct {
> >       struct mempolicy *vm_policy;    /* NUMA policy for the VMA */
> >  #endif
> >       struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     struct rw_semaphore lock;
> > +     int vm_lock_seq;
> > +#endif
> >  } __randomize_layout;
> >
> >  struct kioctx_table;
> > @@ -567,6 +571,9 @@ struct mm_struct {
> >                                         * init_mm.mmlist, and are protected
> >                                         * by mmlist_lock
> >                                         */
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +             int mm_lock_seq;
> > +#endif
> >
> >
> >               unsigned long hiwater_rss; /* High-watermark of RSS usage */
> > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > index e49ba91bb1f0..a391ae226564 100644
> > --- a/include/linux/mmap_lock.h
> > +++ b/include/linux/mmap_lock.h
> > @@ -72,6 +72,17 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
> >       VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> >  }
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +static inline void vma_mark_unlocked_all(struct mm_struct *mm)
> > +{
> > +     mmap_assert_write_locked(mm);
> > +     /* No races during update due to exclusive mmap_lock being held */
> > +     WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > +}
> > +#else
> > +static inline void vma_mark_unlocked_all(struct mm_struct *mm) {}
> > +#endif
> > +
> >  static inline void mmap_init_lock(struct mm_struct *mm)
> >  {
> >       init_rwsem(&mm->mmap_lock);
> > @@ -114,12 +125,14 @@ static inline bool mmap_write_trylock(struct mm_struct *mm)
> >  static inline void mmap_write_unlock(struct mm_struct *mm)
> >  {
> >       __mmap_lock_trace_released(mm, true);
> > +     vma_mark_unlocked_all(mm);
> >       up_write(&mm->mmap_lock);
> >  }
> >
> >  static inline void mmap_write_downgrade(struct mm_struct *mm)
> >  {
> >       __mmap_lock_trace_acquire_returned(mm, false, true);
> > +     vma_mark_unlocked_all(mm);
> >       downgrade_write(&mm->mmap_lock);
> >  }
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 614872438393..bfab31ecd11e 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -475,6 +475,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >                */
> >               *new = data_race(*orig);
> >               INIT_LIST_HEAD(&new->anon_vma_chain);
> > +             vma_init_lock(new);
> >               new->vm_next = new->vm_prev = NULL;
> >               dup_anon_vma_name(orig, new);
> >       }
> > @@ -1130,6 +1131,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> >       seqcount_init(&mm->write_protect_seq);
> >       mmap_init_lock(mm);
> >       INIT_LIST_HEAD(&mm->mmlist);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     WRITE_ONCE(mm->mm_lock_seq, 0);
> > +#endif
> >       mm_pgtables_bytes_init(mm);
> >       mm->map_count = 0;
> >       mm->locked_vm = 0;
> > diff --git a/mm/init-mm.c b/mm/init-mm.c
> > index fbe7844d0912..8399f90d631c 100644
> > --- a/mm/init-mm.c
> > +++ b/mm/init-mm.c
> > @@ -37,6 +37,9 @@ struct mm_struct init_mm = {
> >       .page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
> >       .arg_lock       =  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
> >       .mmlist         = LIST_HEAD_INIT(init_mm.mmlist),
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     .mm_lock_seq    = 0,
> > +#endif
> >       .user_ns        = &init_user_ns,
> >       .cpu_bitmap     = CPU_BITS_NONE,
> >  #ifdef CONFIG_IOMMU_SVA
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 06/28] mm: mark VMA as locked whenever vma->vm_flags are modified
  2022-09-06 14:26   ` Laurent Dufour
@ 2022-09-06 19:00     ` Suren Baghdasaryan
  2022-09-06 20:00       ` Liam Howlett
  0 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-06 19:00 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Andrew Morton, Michel Lespinasse, Jerome Glisse, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Mel Gorman, Davidlohr Bueso,
	Matthew Wilcox, Liam R. Howlett, Peter Zijlstra, Laurent Dufour,
	Paul E . McKenney, Andy Lutomirski, Song Liu, Peter Xu,
	David Hildenbrand, dhowells, Hugh Dickins,
	Sebastian Andrzej Siewior, Kent Overstreet, David Rientjes,
	Axel Rasmussen, Joel Fernandes, Minchan Kim, kernel-team,
	linux-mm, linux-arm-kernel, linuxppc-dev, x86, LKML

On Tue, Sep 6, 2022 at 7:27 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> > VMA flag modifications should be done under VMA lock to prevent concurrent
> > page fault handling in that area.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  fs/proc/task_mmu.c | 1 +
> >  fs/userfaultfd.c   | 6 ++++++
> >  mm/madvise.c       | 1 +
> >  mm/mlock.c         | 2 ++
> >  mm/mmap.c          | 1 +
> >  mm/mprotect.c      | 1 +
> >  6 files changed, 12 insertions(+)
>
> There are few changes also done in the driver's space, for instance:
>
> *** arch/x86/kernel/cpu/sgx/driver.c:
> sgx_mmap[98]                   vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND |
> VM_DONTDUMP | VM_IO;
> *** arch/x86/kernel/cpu/sgx/virt.c:
> sgx_vepc_mmap[108]             vma->vm_flags |= VM_PFNMAP | VM_IO |
> VM_DONTDUMP | VM_DONTCOPY;
> *** drivers/dax/device.c:
> dax_mmap[311]                  vma->vm_flags |= VM_HUGEPAGE;
>
> I guess these changes to vm_flags should be protected as well, or to be
> checked one by one.

Thanks for noting these! I'll add necessary locking here and will look
for other places I might have missed.

>
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 4e0023643f8b..ceffa5c2c650 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1285,6 +1285,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >                       for (vma = mm->mmap; vma; vma = vma->vm_next) {
> >                               if (!(vma->vm_flags & VM_SOFTDIRTY))
> >                                       continue;
> > +                             vma_mark_locked(vma);
> >                               vma->vm_flags &= ~VM_SOFTDIRTY;
> >                               vma_set_page_prot(vma);
> >                       }
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 175de70e3adf..fe557b3d1c07 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -620,6 +620,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
> >               mmap_write_lock(mm);
> >               for (vma = mm->mmap; vma; vma = vma->vm_next)
> >                       if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
> > +                             vma_mark_locked(vma);
> >                               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> >                               vma->vm_flags &= ~__VM_UFFD_FLAGS;
> >                       }
> > @@ -653,6 +654,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
> >
> >       octx = vma->vm_userfaultfd_ctx.ctx;
> >       if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
> > +             vma_mark_locked(vma);
> >               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> >               vma->vm_flags &= ~__VM_UFFD_FLAGS;
> >               return 0;
> > @@ -734,6 +736,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
> >               atomic_inc(&ctx->mmap_changing);
> >       } else {
> >               /* Drop uffd context if remap feature not enabled */
> > +             vma_mark_locked(vma);
> >               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> >               vma->vm_flags &= ~__VM_UFFD_FLAGS;
> >       }
> > @@ -891,6 +894,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
> >                       vma = prev;
> >               else
> >                       prev = vma;
> > +             vma_mark_locked(vma);
> >               vma->vm_flags = new_flags;
> >               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> >       }
> > @@ -1449,6 +1453,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> >                * the next vma was merged into the current one and
> >                * the current one has not been updated yet.
> >                */
> > +             vma_mark_locked(vma);
> >               vma->vm_flags = new_flags;
> >               vma->vm_userfaultfd_ctx.ctx = ctx;
> >
> > @@ -1630,6 +1635,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> >                * the next vma was merged into the current one and
> >                * the current one has not been updated yet.
> >                */
> > +             vma_mark_locked(vma);
> >               vma->vm_flags = new_flags;
> >               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 5f0f0948a50e..a173f0025abd 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -181,6 +181,7 @@ static int madvise_update_vma(struct vm_area_struct *vma,
> >       /*
> >        * vm_flags is protected by the mmap_lock held in write mode.
> >        */
> > +     vma_mark_locked(vma);
> >       vma->vm_flags = new_flags;
> >       if (!vma->vm_file) {
> >               error = replace_anon_vma_name(vma, anon_name);
> > diff --git a/mm/mlock.c b/mm/mlock.c
> > index b14e929084cc..f62e1a4d05f2 100644
> > --- a/mm/mlock.c
> > +++ b/mm/mlock.c
> > @@ -380,6 +380,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
> >        */
> >       if (newflags & VM_LOCKED)
> >               newflags |= VM_IO;
> > +     vma_mark_locked(vma);
> >       WRITE_ONCE(vma->vm_flags, newflags);
> >
> >       lru_add_drain();
> > @@ -456,6 +457,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >
> >       if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
> >               /* No work to do, and mlocking twice would be wrong */
> > +             vma_mark_locked(vma);
> >               vma->vm_flags = newflags;
> >       } else {
> >               mlock_vma_pages_range(vma, start, end, newflags);
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 693e6776be39..f89c9b058105 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1818,6 +1818,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  out:
> >       perf_event_mmap(vma);
> >
> > +     vma_mark_locked(vma);
> >       vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> >       if (vm_flags & VM_LOCKED) {
> >               if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
>
> I guess, this doesn't really impact, but the call to vma_mark_locked(vma)
> may be done only in the case the vm_flags field is touched.
> Something like this:
>
>         vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
>         if (vm_flags & VM_LOCKED) {
>                 if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
>                                         is_vm_hugetlb_page(vma) ||
> -                                       vma == get_gate_vma(current->mm))
> +                                       vma == get_gate_vma(current->mm)) {
> +                       vma_mark_locked(vma);
>                         vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> -               else
> +               } else
>                         mm->locked_vm += (len >> PAGE_SHIFT);
>         }
>
>
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index bc6bddd156ca..df47fc21b0e4 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -621,6 +621,7 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >        * vm_flags and vm_page_prot are protected by the mmap_lock
> >        * held in write mode.
> >        */
> > +     vma_mark_locked(vma);
> >       vma->vm_flags = newflags;
> >       /*
> >        * We want to check manually if we can change individual PTEs writable
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 19/28] mm: disallow do_swap_page to handle page faults under VMA lock
  2022-09-01 17:35 ` [RFC PATCH RESEND 19/28] mm: disallow do_swap_page to handle page faults under VMA lock Suren Baghdasaryan
@ 2022-09-06 19:39   ` Peter Xu
  2022-09-06 20:08     ` Suren Baghdasaryan
  2022-09-09 14:26   ` Laurent Dufour
  1 sibling, 1 reply; 91+ messages in thread
From: Peter Xu @ 2022-09-06 19:39 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Thu, Sep 01, 2022 at 10:35:07AM -0700, Suren Baghdasaryan wrote:
> Due to the possibility of do_swap_page dropping mmap_lock, abort fault
> handling under VMA lock and retry holding mmap_lock. This can be handled
> more gracefully in the future.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/memory.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 9ac9944e8c62..29d2f49f922a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3738,6 +3738,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	vm_fault_t ret = 0;
>  	void *shadow = NULL;
>  
> +	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> +		ret = VM_FAULT_RETRY;
> +		goto out;
> +	}
> +

May want to fail early similarly for handle_userfault() too for similar
reason.  Thanks,

-- 
Peter Xu


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 06/28] mm: mark VMA as locked whenever vma->vm_flags are modified
  2022-09-06 19:00     ` Suren Baghdasaryan
@ 2022-09-06 20:00       ` Liam Howlett
  2022-09-06 20:13         ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Liam Howlett @ 2022-09-06 20:00 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Laurent Dufour, Andrew Morton, Michel Lespinasse, Jerome Glisse,
	Michal Hocko, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Davidlohr Bueso, Matthew Wilcox, Peter Zijlstra, Laurent Dufour,
	Paul E . McKenney, Andy Lutomirski, Song Liu, Peter Xu,
	David Hildenbrand, dhowells, Hugh Dickins,
	Sebastian Andrzej Siewior, Kent Overstreet, David Rientjes,
	Axel Rasmussen, Joel Fernandes, Minchan Kim, kernel-team,
	linux-mm, linux-arm-kernel, linuxppc-dev, x86, LKML

* Suren Baghdasaryan <surenb@google.com> [220906 15:01]:
> On Tue, Sep 6, 2022 at 7:27 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
> >
> > Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> > > VMA flag modifications should be done under VMA lock to prevent concurrent
> > > page fault handling in that area.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  fs/proc/task_mmu.c | 1 +
> > >  fs/userfaultfd.c   | 6 ++++++
> > >  mm/madvise.c       | 1 +
> > >  mm/mlock.c         | 2 ++
> > >  mm/mmap.c          | 1 +
> > >  mm/mprotect.c      | 1 +
> > >  6 files changed, 12 insertions(+)
> >
> > There are few changes also done in the driver's space, for instance:
> >
> > *** arch/x86/kernel/cpu/sgx/driver.c:
> > sgx_mmap[98]                   vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND |
> > VM_DONTDUMP | VM_IO;
> > *** arch/x86/kernel/cpu/sgx/virt.c:
> > sgx_vepc_mmap[108]             vma->vm_flags |= VM_PFNMAP | VM_IO |
> > VM_DONTDUMP | VM_DONTCOPY;
> > *** drivers/dax/device.c:
> > dax_mmap[311]                  vma->vm_flags |= VM_HUGEPAGE;
> >
> > I guess these changes to vm_flags should be protected as well, or to be
> > checked one by one.
> 
> Thanks for noting these! I'll add necessary locking here and will look
> for other places I might have missed.

Would an inline set/clear bit function be worth while for vm_flags?  If
it is then a name change to vm_flags may get the compiler to catch any
missed cases.  There doesn't seem to be many cases (12 inserts) so maybe
not.

> 
> >
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 4e0023643f8b..ceffa5c2c650 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -1285,6 +1285,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > >                       for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > >                               if (!(vma->vm_flags & VM_SOFTDIRTY))
> > >                                       continue;
> > > +                             vma_mark_locked(vma);
> > >                               vma->vm_flags &= ~VM_SOFTDIRTY;
> > >                               vma_set_page_prot(vma);
> > >                       }
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index 175de70e3adf..fe557b3d1c07 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -620,6 +620,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
> > >               mmap_write_lock(mm);
> > >               for (vma = mm->mmap; vma; vma = vma->vm_next)
> > >                       if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
> > > +                             vma_mark_locked(vma);
> > >                               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > >                               vma->vm_flags &= ~__VM_UFFD_FLAGS;
> > >                       }
> > > @@ -653,6 +654,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
> > >
> > >       octx = vma->vm_userfaultfd_ctx.ctx;
> > >       if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
> > > +             vma_mark_locked(vma);
> > >               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > >               vma->vm_flags &= ~__VM_UFFD_FLAGS;
> > >               return 0;
> > > @@ -734,6 +736,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
> > >               atomic_inc(&ctx->mmap_changing);
> > >       } else {
> > >               /* Drop uffd context if remap feature not enabled */
> > > +             vma_mark_locked(vma);
> > >               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > >               vma->vm_flags &= ~__VM_UFFD_FLAGS;
> > >       }
> > > @@ -891,6 +894,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
> > >                       vma = prev;
> > >               else
> > >                       prev = vma;
> > > +             vma_mark_locked(vma);
> > >               vma->vm_flags = new_flags;
> > >               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > >       }
> > > @@ -1449,6 +1453,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > >                * the next vma was merged into the current one and
> > >                * the current one has not been updated yet.
> > >                */
> > > +             vma_mark_locked(vma);
> > >               vma->vm_flags = new_flags;
> > >               vma->vm_userfaultfd_ctx.ctx = ctx;
> > >
> > > @@ -1630,6 +1635,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > >                * the next vma was merged into the current one and
> > >                * the current one has not been updated yet.
> > >                */
> > > +             vma_mark_locked(vma);
> > >               vma->vm_flags = new_flags;
> > >               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 5f0f0948a50e..a173f0025abd 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -181,6 +181,7 @@ static int madvise_update_vma(struct vm_area_struct *vma,
> > >       /*
> > >        * vm_flags is protected by the mmap_lock held in write mode.
> > >        */
> > > +     vma_mark_locked(vma);
> > >       vma->vm_flags = new_flags;
> > >       if (!vma->vm_file) {
> > >               error = replace_anon_vma_name(vma, anon_name);
> > > diff --git a/mm/mlock.c b/mm/mlock.c
> > > index b14e929084cc..f62e1a4d05f2 100644
> > > --- a/mm/mlock.c
> > > +++ b/mm/mlock.c
> > > @@ -380,6 +380,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
> > >        */
> > >       if (newflags & VM_LOCKED)
> > >               newflags |= VM_IO;
> > > +     vma_mark_locked(vma);
> > >       WRITE_ONCE(vma->vm_flags, newflags);
> > >
> > >       lru_add_drain();
> > > @@ -456,6 +457,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > >
> > >       if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
> > >               /* No work to do, and mlocking twice would be wrong */
> > > +             vma_mark_locked(vma);
> > >               vma->vm_flags = newflags;
> > >       } else {
> > >               mlock_vma_pages_range(vma, start, end, newflags);
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 693e6776be39..f89c9b058105 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1818,6 +1818,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  out:
> > >       perf_event_mmap(vma);
> > >
> > > +     vma_mark_locked(vma);
> > >       vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> > >       if (vm_flags & VM_LOCKED) {
> > >               if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> >
> > I guess, this doesn't really impact, but the call to vma_mark_locked(vma)
> > may be done only in the case the vm_flags field is touched.
> > Something like this:
> >
> >         vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> >         if (vm_flags & VM_LOCKED) {
> >                 if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> >                                         is_vm_hugetlb_page(vma) ||
> > -                                       vma == get_gate_vma(current->mm))
> > +                                       vma == get_gate_vma(current->mm)) {
> > +                       vma_mark_locked(vma);
> >                         vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> > -               else
> > +               } else
> >                         mm->locked_vm += (len >> PAGE_SHIFT);
> >         }
> >
> >
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index bc6bddd156ca..df47fc21b0e4 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -621,6 +621,7 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > >        * vm_flags and vm_page_prot are protected by the mmap_lock
> > >        * held in write mode.
> > >        */
> > > +     vma_mark_locked(vma);
> > >       vma->vm_flags = newflags;
> > >       /*
> > >        * We want to check manually if we can change individual PTEs writable
> >
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 19/28] mm: disallow do_swap_page to handle page faults under VMA lock
  2022-09-06 19:39   ` Peter Xu
@ 2022-09-06 20:08     ` Suren Baghdasaryan
  2022-09-06 20:22       ` Peter Xu
  0 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-06 20:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, Michel Lespinasse, Jerome Glisse, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Mel Gorman, Davidlohr Bueso,
	Matthew Wilcox, Liam R. Howlett, Peter Zijlstra, Laurent Dufour,
	Laurent Dufour, Paul E . McKenney, Andy Lutomirski, Song Liu,
	David Hildenbrand, dhowells, Hugh Dickins,
	Sebastian Andrzej Siewior, Kent Overstreet, David Rientjes,
	Axel Rasmussen, Joel Fernandes, Minchan Kim, kernel-team,
	linux-mm, linux-arm-kernel, linuxppc-dev, x86, LKML

On Tue, Sep 6, 2022 at 12:39 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 01, 2022 at 10:35:07AM -0700, Suren Baghdasaryan wrote:
> > Due to the possibility of do_swap_page dropping mmap_lock, abort fault
> > handling under VMA lock and retry holding mmap_lock. This can be handled
> > more gracefully in the future.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/memory.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9ac9944e8c62..29d2f49f922a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3738,6 +3738,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       vm_fault_t ret = 0;
> >       void *shadow = NULL;
> >
> > +     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > +             ret = VM_FAULT_RETRY;
> > +             goto out;
> > +     }
> > +
>
> May want to fail early similarly for handle_userfault() too for similar
> reason.  Thanks,

I wasn't aware of a similar issue there. Will have a closer look. Thanks!

>
> --
> Peter Xu
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 06/28] mm: mark VMA as locked whenever vma->vm_flags are modified
  2022-09-06 20:00       ` Liam Howlett
@ 2022-09-06 20:13         ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-06 20:13 UTC (permalink / raw)
  To: Liam Howlett
  Cc: Laurent Dufour, Andrew Morton, Michel Lespinasse, Jerome Glisse,
	Michal Hocko, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	Davidlohr Bueso, Matthew Wilcox, Peter Zijlstra, Laurent Dufour,
	Paul E . McKenney, Andy Lutomirski, Song Liu, Peter Xu,
	David Hildenbrand, dhowells, Hugh Dickins,
	Sebastian Andrzej Siewior, Kent Overstreet, David Rientjes,
	Axel Rasmussen, Joel Fernandes, Minchan Kim, kernel-team,
	linux-mm, linux-arm-kernel, linuxppc-dev, x86, LKML

On Tue, Sep 6, 2022 at 1:00 PM Liam Howlett <liam.howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [220906 15:01]:
> > On Tue, Sep 6, 2022 at 7:27 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
> > >
> > > Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> > > > VMA flag modifications should be done under VMA lock to prevent concurrent
> > > > page fault handling in that area.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > ---
> > > >  fs/proc/task_mmu.c | 1 +
> > > >  fs/userfaultfd.c   | 6 ++++++
> > > >  mm/madvise.c       | 1 +
> > > >  mm/mlock.c         | 2 ++
> > > >  mm/mmap.c          | 1 +
> > > >  mm/mprotect.c      | 1 +
> > > >  6 files changed, 12 insertions(+)
> > >
> > > There are few changes also done in the driver's space, for instance:
> > >
> > > *** arch/x86/kernel/cpu/sgx/driver.c:
> > > sgx_mmap[98]                   vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND |
> > > VM_DONTDUMP | VM_IO;
> > > *** arch/x86/kernel/cpu/sgx/virt.c:
> > > sgx_vepc_mmap[108]             vma->vm_flags |= VM_PFNMAP | VM_IO |
> > > VM_DONTDUMP | VM_DONTCOPY;
> > > *** drivers/dax/device.c:
> > > dax_mmap[311]                  vma->vm_flags |= VM_HUGEPAGE;
> > >
> > > I guess these changes to vm_flags should be protected as well, or to be
> > > checked one by one.
> >
> > Thanks for noting these! I'll add necessary locking here and will look
> > for other places I might have missed.
>
> Would an inline set/clear bit function be worth while for vm_flags?  If
> it is then a name change to vm_flags may get the compiler to catch any
> missed cases.  There doesn't seem to be many cases (12 inserts) so maybe
> not.

That would probably simplify the maintenance for flags in the future
and we can add vma_mark_locked directly in the set/clear functions.

>
> >
> > >
> > > >
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index 4e0023643f8b..ceffa5c2c650 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -1285,6 +1285,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > > >                       for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > > >                               if (!(vma->vm_flags & VM_SOFTDIRTY))
> > > >                                       continue;
> > > > +                             vma_mark_locked(vma);
> > > >                               vma->vm_flags &= ~VM_SOFTDIRTY;
> > > >                               vma_set_page_prot(vma);
> > > >                       }
> > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > > index 175de70e3adf..fe557b3d1c07 100644
> > > > --- a/fs/userfaultfd.c
> > > > +++ b/fs/userfaultfd.c
> > > > @@ -620,6 +620,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
> > > >               mmap_write_lock(mm);
> > > >               for (vma = mm->mmap; vma; vma = vma->vm_next)
> > > >                       if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
> > > > +                             vma_mark_locked(vma);
> > > >                               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > > >                               vma->vm_flags &= ~__VM_UFFD_FLAGS;
> > > >                       }
> > > > @@ -653,6 +654,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
> > > >
> > > >       octx = vma->vm_userfaultfd_ctx.ctx;
> > > >       if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
> > > > +             vma_mark_locked(vma);
> > > >               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > > >               vma->vm_flags &= ~__VM_UFFD_FLAGS;
> > > >               return 0;
> > > > @@ -734,6 +736,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
> > > >               atomic_inc(&ctx->mmap_changing);
> > > >       } else {
> > > >               /* Drop uffd context if remap feature not enabled */
> > > > +             vma_mark_locked(vma);
> > > >               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > > >               vma->vm_flags &= ~__VM_UFFD_FLAGS;
> > > >       }
> > > > @@ -891,6 +894,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
> > > >                       vma = prev;
> > > >               else
> > > >                       prev = vma;
> > > > +             vma_mark_locked(vma);
> > > >               vma->vm_flags = new_flags;
> > > >               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > > >       }
> > > > @@ -1449,6 +1453,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > >                * the next vma was merged into the current one and
> > > >                * the current one has not been updated yet.
> > > >                */
> > > > +             vma_mark_locked(vma);
> > > >               vma->vm_flags = new_flags;
> > > >               vma->vm_userfaultfd_ctx.ctx = ctx;
> > > >
> > > > @@ -1630,6 +1635,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > > >                * the next vma was merged into the current one and
> > > >                * the current one has not been updated yet.
> > > >                */
> > > > +             vma_mark_locked(vma);
> > > >               vma->vm_flags = new_flags;
> > > >               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > > >
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 5f0f0948a50e..a173f0025abd 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -181,6 +181,7 @@ static int madvise_update_vma(struct vm_area_struct *vma,
> > > >       /*
> > > >        * vm_flags is protected by the mmap_lock held in write mode.
> > > >        */
> > > > +     vma_mark_locked(vma);
> > > >       vma->vm_flags = new_flags;
> > > >       if (!vma->vm_file) {
> > > >               error = replace_anon_vma_name(vma, anon_name);
> > > > diff --git a/mm/mlock.c b/mm/mlock.c
> > > > index b14e929084cc..f62e1a4d05f2 100644
> > > > --- a/mm/mlock.c
> > > > +++ b/mm/mlock.c
> > > > @@ -380,6 +380,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
> > > >        */
> > > >       if (newflags & VM_LOCKED)
> > > >               newflags |= VM_IO;
> > > > +     vma_mark_locked(vma);
> > > >       WRITE_ONCE(vma->vm_flags, newflags);
> > > >
> > > >       lru_add_drain();
> > > > @@ -456,6 +457,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > >
> > > >       if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
> > > >               /* No work to do, and mlocking twice would be wrong */
> > > > +             vma_mark_locked(vma);
> > > >               vma->vm_flags = newflags;
> > > >       } else {
> > > >               mlock_vma_pages_range(vma, start, end, newflags);
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 693e6776be39..f89c9b058105 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -1818,6 +1818,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  out:
> > > >       perf_event_mmap(vma);
> > > >
> > > > +     vma_mark_locked(vma);
> > > >       vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> > > >       if (vm_flags & VM_LOCKED) {
> > > >               if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> > >
> > > I guess, this doesn't really impact, but the call to vma_mark_locked(vma)
> > > may be done only in the case the vm_flags field is touched.
> > > Something like this:
> > >
> > >         vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> > >         if (vm_flags & VM_LOCKED) {
> > >                 if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> > >                                         is_vm_hugetlb_page(vma) ||
> > > -                                       vma == get_gate_vma(current->mm))
> > > +                                       vma == get_gate_vma(current->mm)) {
> > > +                       vma_mark_locked(vma);
> > >                         vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> > > -               else
> > > +               } else
> > >                         mm->locked_vm += (len >> PAGE_SHIFT);
> > >         }
> > >
> > >
> > > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > > index bc6bddd156ca..df47fc21b0e4 100644
> > > > --- a/mm/mprotect.c
> > > > +++ b/mm/mprotect.c
> > > > @@ -621,6 +621,7 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > > >        * vm_flags and vm_page_prot are protected by the mmap_lock
> > > >        * held in write mode.
> > > >        */
> > > > +     vma_mark_locked(vma);
> > > >       vma->vm_flags = newflags;
> > > >       /*
> > > >        * We want to check manually if we can change individual PTEs writable
> > >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 19/28] mm: disallow do_swap_page to handle page faults under VMA lock
  2022-09-06 20:08     ` Suren Baghdasaryan
@ 2022-09-06 20:22       ` Peter Xu
  2022-09-07  0:58         ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Xu @ 2022-09-06 20:22 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Michel Lespinasse, Jerome Glisse, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Mel Gorman, Davidlohr Bueso,
	Matthew Wilcox, Liam R. Howlett, Peter Zijlstra, Laurent Dufour,
	Laurent Dufour, Paul E . McKenney, Andy Lutomirski, Song Liu,
	David Hildenbrand, dhowells, Hugh Dickins,
	Sebastian Andrzej Siewior, Kent Overstreet, David Rientjes,
	Axel Rasmussen, Joel Fernandes, Minchan Kim, kernel-team,
	linux-mm, linux-arm-kernel, linuxppc-dev, x86, LKML

On Tue, Sep 06, 2022 at 01:08:10PM -0700, Suren Baghdasaryan wrote:
> On Tue, Sep 6, 2022 at 12:39 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Sep 01, 2022 at 10:35:07AM -0700, Suren Baghdasaryan wrote:
> > > Due to the possibility of do_swap_page dropping mmap_lock, abort fault
> > > handling under VMA lock and retry holding mmap_lock. This can be handled
> > > more gracefully in the future.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  mm/memory.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 9ac9944e8c62..29d2f49f922a 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3738,6 +3738,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >       vm_fault_t ret = 0;
> > >       void *shadow = NULL;
> > >
> > > +     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > +             ret = VM_FAULT_RETRY;
> > > +             goto out;
> > > +     }
> > > +
> >
> > May want to fail early similarly for handle_userfault() too for similar
> > reason.  Thanks,
> 
> I wasn't aware of a similar issue there. Will have a closer look. Thanks!

Sure.

Just in case this would be anything helpful - handle_userfault() will both
assert at the entry (mmap_assert_locked) and will in most cases release
read lock along the way when waiting for page fault resolutions.

And userfaultfd should work on anonymous memory for either missing mode or
write protect mode.

Thanks,

-- 
Peter Xu


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 19/28] mm: disallow do_swap_page to handle page faults under VMA lock
  2022-09-06 20:22       ` Peter Xu
@ 2022-09-07  0:58         ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-07  0:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, Michel Lespinasse, Jerome Glisse, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Mel Gorman, Davidlohr Bueso,
	Matthew Wilcox, Liam R. Howlett, Peter Zijlstra, Laurent Dufour,
	Laurent Dufour, Paul E . McKenney, Andy Lutomirski, Song Liu,
	David Hildenbrand, dhowells, Hugh Dickins,
	Sebastian Andrzej Siewior, Kent Overstreet, David Rientjes,
	Axel Rasmussen, Joel Fernandes, Minchan Kim, kernel-team,
	linux-mm, linux-arm-kernel, linuxppc-dev, x86, LKML

On Tue, Sep 6, 2022 at 1:22 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Sep 06, 2022 at 01:08:10PM -0700, Suren Baghdasaryan wrote:
> > On Tue, Sep 6, 2022 at 12:39 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Sep 01, 2022 at 10:35:07AM -0700, Suren Baghdasaryan wrote:
> > > > Due to the possibility of do_swap_page dropping mmap_lock, abort fault
> > > > handling under VMA lock and retry holding mmap_lock. This can be handled
> > > > more gracefully in the future.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > ---
> > > >  mm/memory.c | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 9ac9944e8c62..29d2f49f922a 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -3738,6 +3738,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > >       vm_fault_t ret = 0;
> > > >       void *shadow = NULL;
> > > >
> > > > +     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > +             ret = VM_FAULT_RETRY;
> > > > +             goto out;
> > > > +     }
> > > > +
> > >
> > > May want to fail early similarly for handle_userfault() too for similar
> > > reason.  Thanks,
> >
> > I wasn't aware of a similar issue there. Will have a closer look. Thanks!
>
> Sure.
>
> Just in case this would be anything helpful - handle_userfault() will both
> assert at the entry (mmap_assert_locked) and will in most cases release
> read lock along the way when waiting for page fault resolutions.
>
> And userfaultfd should work on anonymous memory for either missing mode or
> write protect mode.

Got it. Thanks for the explanation. It definitely helps!

>
> Thanks,
>
> --
> Peter Xu
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 07/28] kernel/fork: mark VMAs as locked before copying pages during fork
  2022-09-06 14:37   ` Laurent Dufour
@ 2022-09-08 23:57     ` Suren Baghdasaryan
  2022-09-09 13:27       ` Laurent Dufour
  0 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-08 23:57 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Tue, Sep 6, 2022 at 7:38 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> > Protect VMAs from concurrent page fault handler while performing
> > copy_page_range for VMAs having VM_WIPEONFORK flag set.
>
> I'm wondering why is that necessary.
> The copied mm is write locked, and the destination one is not reachable.
> If any other readers are using the VMA, this is only for page fault handling.

Correct, this is done to prevent page faulting in the VMA being
duplicated. I assume we want to prevent the pages in that VMA from
changing when we are calling copy_page_range(). Am I wrong?

> I should have miss something because I can't see any need to mark the lock
> VMA here.
>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  kernel/fork.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index bfab31ecd11e..1872ad549fed 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -709,8 +709,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >               rb_parent = &tmp->vm_rb;
> >
> >               mm->map_count++;
> > -             if (!(tmp->vm_flags & VM_WIPEONFORK))
> > +             if (!(tmp->vm_flags & VM_WIPEONFORK)) {
> > +                     vma_mark_locked(mpnt);
> >                       retval = copy_page_range(tmp, mpnt);
> > +             }
> >
> >               if (tmp->vm_ops && tmp->vm_ops->open)
> >                       tmp->vm_ops->open(tmp);
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 08/28] mm/khugepaged: mark VMA as locked while collapsing a hugepage
  2022-09-06 14:43   ` Laurent Dufour
@ 2022-09-09  0:15     ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-09  0:15 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Tue, Sep 6, 2022 at 7:43 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> > Protect VMA from concurrent page fault handler while modifying it in
> > collapse_huge_page.
>
> Is the goal to protect changes in the anon_vma structure?
>
> AFAICS, the vma it self is not impacted here, only the anon_vma and the
> PMD/PTE are touched, and they have their own protection mechanism, isn't it?

Yes, I think you are right about not needing to lock VMA here as all
modified components are already protected. Thanks!

>
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/khugepaged.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 01f71786d530..030680633989 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1072,6 +1072,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> >       if (mm_find_pmd(mm, address) != pmd)
> >               goto out_up_write;
> >
> > +     vma_mark_locked(vma);
> >       anon_vma_lock_write(vma->anon_vma);
> >
> >       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm,
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 09/28] mm/mempolicy: mark VMA as locked when changing protection policy
  2022-09-06 14:47   ` Laurent Dufour
@ 2022-09-09  0:27     ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-09  0:27 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Tue, Sep 6, 2022 at 7:48 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> > Protect VMA from concurrent page fault handler while performing VMA
> > protection policy changes.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/mempolicy.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index b73d3248d976..6be1e5c75556 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -383,8 +383,10 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
> >       struct vm_area_struct *vma;
> >
> >       mmap_write_lock(mm);
> > -     for (vma = mm->mmap; vma; vma = vma->vm_next)
> > +     for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > +             vma_mark_locked(vma);
> >               mpol_rebind_policy(vma->vm_policy, new);
> > +     }
> >       mmap_write_unlock(mm);
> >  }
> >
> > @@ -632,6 +634,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
> >       struct mmu_gather tlb;
> >       int nr_updated;
> >
> > +     vma_mark_locked(vma);
>
> If I understand that corretly, the VMA itself is not impacted, only the
> PMDs/PTEs, and they are protected using the page table locks.
>
> Am I missing something?

I thought we would not want pages faulting in the VMA for which we are
changing the protection. However I think what you are saying is that
page table locks would already provide a more granular synchronization
with page fault handlers, which makes sense to me. Sounds like we can
skip locking the VMA here as well. Nice!

>
> >       tlb_gather_mmu(&tlb, vma->vm_mm);
> >
> >       nr_updated = change_protection(&tlb, vma, addr, end, PAGE_NONE,
> > @@ -765,6 +768,7 @@ static int vma_replace_policy(struct vm_area_struct *vma,
> >       if (IS_ERR(new))
> >               return PTR_ERR(new);
> >
> > +     vma_mark_locked(vma);
> >       if (vma->vm_ops && vma->vm_ops->set_policy) {
> >               err = vma->vm_ops->set_policy(vma, new);
> >               if (err)
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 10/28] mm/mmap: mark VMAs as locked in vma_adjust
  2022-09-06 15:35   ` Laurent Dufour
@ 2022-09-09  0:51     ` Suren Baghdasaryan
  2022-09-09 15:52       ` Laurent Dufour
  0 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-09  0:51 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Tue, Sep 6, 2022 at 8:35 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> > vma_adjust modifies a VMA and possibly its neighbors. Mark them as locked
> > before making the modifications.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/mmap.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index f89c9b058105..ed58cf0689b2 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -710,6 +710,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> >       long adjust_next = 0;
> >       int remove_next = 0;
> >
> > +     vma_mark_locked(vma);
> > +     if (next)
> > +             vma_mark_locked(next);
> > +
>
> I was wondering if the VMAs insert and expand should be locked too.
>
> For expand, I can't see any valid reason, but for insert, I'm puzzled.
> I would think that it is better to lock the VMA to be inserted but I can't
> really justify that.
>
> It may be nice to detail why this is not need to lock insert and expand here.

'expand' is always locked before it's passed to __vma_adjust() by
vma_merge(). It has to be locked before we decide "Can it merge with
the predecessor?" here
https://elixir.bootlin.com/linux/latest/source/mm/mmap.c#L1201 because
a change in VMA can affect that decision. I spent many hours tracking
the issue caused by not locking the VMA before making this decision.
It might be good to add a comment about this...

AFAIKT 'insert' is only used by __split_vma() and it's always a brand
new VMA which is not yet linked into mm->mmap. Any reason
__vma_adjust() should lock it?

>
> >       if (next && !insert) {
> >               struct vm_area_struct *exporter = NULL, *importer = NULL;
> >
> > @@ -754,8 +758,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> >                        * If next doesn't have anon_vma, import from vma after
> >                        * next, if the vma overlaps with it.
> >                        */
> > -                     if (remove_next == 2 && !next->anon_vma)
> > +                     if (remove_next == 2 && !next->anon_vma) {
> >                               exporter = next->vm_next;
> > +                             if (exporter)
> > +                                     vma_mark_locked(exporter);
> > +                     }
> >
> >               } else if (end > next->vm_start) {
> >                       /*
> > @@ -931,6 +938,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> >                        * "vma->vm_next" gap must be updated.
> >                        */
> >                       next = vma->vm_next;
> > +                     if (next)
> > +                             vma_mark_locked(next);
> >               } else {
> >                       /*
> >                        * For the scope of the comment "next" and
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 13/28] mm: conditionally mark VMA as locked in free_pgtables and unmap_page_range
  2022-09-01 17:35 ` [RFC PATCH RESEND 13/28] mm: conditionally mark VMA as locked in free_pgtables and unmap_page_range Suren Baghdasaryan
@ 2022-09-09 10:33   ` Laurent Dufour
  2022-09-09 16:43     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-09 10:33 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> free_pgtables and unmap_page_range functions can be called with mmap_lock
> held for write (e.g. in mmap_region), held for read (e.g in
> madvise_pageout) or not held at all (e.g in madvise_remove might
> drop mmap_lock before calling vfs_fallocate, which ends up calling
> unmap_page_range).
> Provide free_pgtables and unmap_page_range with additional argument
> indicating whether to mark the VMA as locked or not based on the usage.
> The parameter is set based on whether mmap_lock is held in write mode
> during the call. This ensures no change in behavior between mmap_lock
> and per-vma locks.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h |  2 +-
>  mm/internal.h      |  4 ++--
>  mm/memory.c        | 32 +++++++++++++++++++++-----------
>  mm/mmap.c          | 17 +++++++++--------
>  mm/oom_kill.c      |  3 ++-
>  5 files changed, 35 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 476bf936c5f0..dc72be923e5b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1874,7 +1874,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
>  void zap_page_range(struct vm_area_struct *vma, unsigned long address,
>  		    unsigned long size);
>  void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> -		unsigned long start, unsigned long end);
> +		unsigned long start, unsigned long end, bool lock_vma);
>  
>  struct mmu_notifier_range;
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 785409805ed7..e6c0f999e0cb 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -85,14 +85,14 @@ bool __folio_end_writeback(struct folio *folio);
>  void deactivate_file_folio(struct folio *folio);
>  
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> -		unsigned long floor, unsigned long ceiling);
> +		unsigned long floor, unsigned long ceiling, bool lock_vma);
>  void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
>  
>  struct zap_details;
>  void unmap_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
>  			     unsigned long addr, unsigned long end,
> -			     struct zap_details *details);
> +			     struct zap_details *details, bool lock_vma);
>  
>  void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
>  		unsigned int order);
> diff --git a/mm/memory.c b/mm/memory.c
> index 4ba73f5aa8bb..9ac9944e8c62 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -403,7 +403,7 @@ void free_pgd_range(struct mmu_gather *tlb,
>  }
>  
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
> -		unsigned long floor, unsigned long ceiling)
> +		unsigned long floor, unsigned long ceiling, bool lock_vma)
>  {
>  	while (vma) {
>  		struct vm_area_struct *next = vma->vm_next;
> @@ -413,6 +413,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		 * Hide vma from rmap and truncate_pagecache before freeing
>  		 * pgtables
>  		 */
> +		if (lock_vma)
> +			vma_mark_locked(vma);
>  		unlink_anon_vmas(vma);
>  		unlink_file_vma(vma);
>  
> @@ -427,6 +429,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			       && !is_vm_hugetlb_page(next)) {
>  				vma = next;
>  				next = vma->vm_next;
> +				if (lock_vma)
> +					vma_mark_locked(vma);
>  				unlink_anon_vmas(vma);
>  				unlink_file_vma(vma);
>  			}
> @@ -1631,12 +1635,16 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
>  void unmap_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
>  			     unsigned long addr, unsigned long end,
> -			     struct zap_details *details)
> +			     struct zap_details *details,
> +			     bool lock_vma)
>  {
>  	pgd_t *pgd;
>  	unsigned long next;
>  
>  	BUG_ON(addr >= end);
> +	if (lock_vma)
> +		vma_mark_locked(vma);

I'm wondering if that is really needed here.
The following processing is only dealing with the page table entries.
Today, if that could be called without holding the mmap_lock, that should
be safe to not mark the VMA locked (indeed the VMA itself is not impacted).

Thus unmap_single_vma() below no need to be touched, and its callers.

In the case a locking is required, I think there is a real potential issue
in the current kernel.

> +
>  	tlb_start_vma(tlb, vma);
>  	pgd = pgd_offset(vma->vm_mm, addr);
>  	do {
> @@ -1652,7 +1660,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>  static void unmap_single_vma(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		unsigned long end_addr,
> -		struct zap_details *details)
> +		struct zap_details *details, bool lock_vma)
>  {
>  	unsigned long start = max(vma->vm_start, start_addr);
>  	unsigned long end;
> @@ -1691,7 +1699,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>  				i_mmap_unlock_write(vma->vm_file->f_mapping);
>  			}
>  		} else
> -			unmap_page_range(tlb, vma, start, end, details);
> +			unmap_page_range(tlb, vma, start, end, details, lock_vma);
>  	}
>  }
>  
> @@ -1715,7 +1723,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>   */
>  void unmap_vmas(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, unsigned long start_addr,
> -		unsigned long end_addr)
> +		unsigned long end_addr, bool lock_vma)
>  {
>  	struct mmu_notifier_range range;
>  	struct zap_details details = {
> @@ -1728,7 +1736,8 @@ void unmap_vmas(struct mmu_gather *tlb,
>  				start_addr, end_addr);
>  	mmu_notifier_invalidate_range_start(&range);
>  	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> -		unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
> +		unmap_single_vma(tlb, vma, start_addr, end_addr, &details,
> +				 lock_vma);
>  	mmu_notifier_invalidate_range_end(&range);
>  }
>  
> @@ -1753,7 +1762,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
>  	update_hiwater_rss(vma->vm_mm);
>  	mmu_notifier_invalidate_range_start(&range);
>  	for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
> -		unmap_single_vma(&tlb, vma, start, range.end, NULL);
> +		unmap_single_vma(&tlb, vma, start, range.end, NULL, false);
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_finish_mmu(&tlb);
>  }
> @@ -1768,7 +1777,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
>   * The range must fit into one VMA.
>   */
>  static void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> -		unsigned long size, struct zap_details *details)
> +		unsigned long size, struct zap_details *details, bool lock_vma)
>  {
>  	struct mmu_notifier_range range;
>  	struct mmu_gather tlb;
> @@ -1779,7 +1788,7 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
>  	tlb_gather_mmu(&tlb, vma->vm_mm);
>  	update_hiwater_rss(vma->vm_mm);
>  	mmu_notifier_invalidate_range_start(&range);
> -	unmap_single_vma(&tlb, vma, address, range.end, details);
> +	unmap_single_vma(&tlb, vma, address, range.end, details, lock_vma);
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_finish_mmu(&tlb);
>  }
> @@ -1802,7 +1811,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
>  	    		!(vma->vm_flags & VM_PFNMAP))
>  		return;
>  
> -	zap_page_range_single(vma, address, size, NULL);
> +	zap_page_range_single(vma, address, size, NULL, true);
>  }
>  EXPORT_SYMBOL_GPL(zap_vma_ptes);
>  
> @@ -3483,7 +3492,8 @@ static void unmap_mapping_range_vma(struct vm_area_struct *vma,
>  		unsigned long start_addr, unsigned long end_addr,
>  		struct zap_details *details)
>  {
> -	zap_page_range_single(vma, start_addr, end_addr - start_addr, details);
> +	zap_page_range_single(vma, start_addr, end_addr - start_addr, details,
> +			      false);
>  }
>  
>  static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 121544fd90de..094678b4434b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -79,7 +79,7 @@ core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
>  
>  static void unmap_region(struct mm_struct *mm,
>  		struct vm_area_struct *vma, struct vm_area_struct *prev,
> -		unsigned long start, unsigned long end);
> +		unsigned long start, unsigned long end, bool lock_vma);
>  
>  static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
>  {
> @@ -1866,7 +1866,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	vma->vm_file = NULL;
>  
>  	/* Undo any partial mapping done by a device driver. */
> -	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
> +	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end, true);
>  	if (vm_flags & VM_SHARED)
>  		mapping_unmap_writable(file->f_mapping);
>  free_vma: 
> @@ -2626,7 +2626,7 @@ static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
>   */
>  static void unmap_region(struct mm_struct *mm,
>  		struct vm_area_struct *vma, struct vm_area_struct *prev,
> -		unsigned long start, unsigned long end)
> +		unsigned long start, unsigned long end, bool lock_vma)
>  {
>  	struct vm_area_struct *next = vma_next(mm, prev);
>  	struct mmu_gather tlb;
> @@ -2634,9 +2634,10 @@ static void unmap_region(struct mm_struct *mm,
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm);
>  	update_hiwater_rss(mm);
> -	unmap_vmas(&tlb, vma, start, end);
> +	unmap_vmas(&tlb, vma, start, end, lock_vma);
>  	free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> -				 next ? next->vm_start : USER_PGTABLES_CEILING);
> +				 next ? next->vm_start : USER_PGTABLES_CEILING,
> +				 lock_vma);
>  	tlb_finish_mmu(&tlb);
>  }
>  
> @@ -2849,7 +2850,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>  	if (downgrade)
>  		mmap_write_downgrade(mm);
>  
> -	unmap_region(mm, vma, prev, start, end);
> +	unmap_region(mm, vma, prev, start, end, !downgrade);
>  
>  	/* Fix up all other VM information */
>  	remove_vma_list(mm, vma);
> @@ -3129,8 +3130,8 @@ void exit_mmap(struct mm_struct *mm)
>  	tlb_gather_mmu_fullmm(&tlb, mm);
>  	/* update_hiwater_rss(mm) here? but nobody should be looking */
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
> -	unmap_vmas(&tlb, vma, 0, -1);
> -	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> +	unmap_vmas(&tlb, vma, 0, -1, true);
> +	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING, true);
>  	tlb_finish_mmu(&tlb);
>  
>  	/* Walk the list again, actually closing and freeing it. */
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3c6cf9e3cd66..6ffa7c511aa3 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -549,7 +549,8 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
>  				ret = false;
>  				continue;
>  			}
> -			unmap_page_range(&tlb, vma, range.start, range.end, NULL);
> +			unmap_page_range(&tlb, vma, range.start, range.end,
> +					 NULL, false);
>  			mmu_notifier_invalidate_range_end(&range);
>  			tlb_finish_mmu(&tlb);
>  		}

I'm wondering if the VMA locking should be done here instead of inside
unmap_page_range() which is not really touching the VMA's fields.

Here this would be needed because the page fault handler may check the
MMF_UNSTABLE flag and the VMA's locking before this loop is entered by
another thread.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 07/28] kernel/fork: mark VMAs as locked before copying pages during fork
  2022-09-08 23:57     ` Suren Baghdasaryan
@ 2022-09-09 13:27       ` Laurent Dufour
  2022-09-09 16:29         ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-09 13:27 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 09/09/2022 à 01:57, Suren Baghdasaryan a écrit :
> On Tue, Sep 6, 2022 at 7:38 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>>
>> Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
>>> Protect VMAs from concurrent page fault handler while performing
>>> copy_page_range for VMAs having VM_WIPEONFORK flag set.
>>
>> I'm wondering why is that necessary.
>> The copied mm is write locked, and the destination one is not reachable.
>> If any other readers are using the VMA, this is only for page fault handling.
> 
> Correct, this is done to prevent page faulting in the VMA being
> duplicated. I assume we want to prevent the pages in that VMA from
> changing when we are calling copy_page_range(). Am I wrong?

If a page is faulted while copy_page_range() is in progress, the page may
not be backed on the child side (PTE lock should protect the copy, isn't it).
Is that a real problem? It will be backed later if accessed on the child side.
Maybe the per process pages accounting could be incorrect...

> 
>> I should have miss something because I can't see any need to mark the lock
>> VMA here.
>>
>>> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>>> ---
>>>  kernel/fork.c | 4 +++-
>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index bfab31ecd11e..1872ad549fed 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -709,8 +709,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>>               rb_parent = &tmp->vm_rb;
>>>
>>>               mm->map_count++;
>>> -             if (!(tmp->vm_flags & VM_WIPEONFORK))
>>> +             if (!(tmp->vm_flags & VM_WIPEONFORK)) {
>>> +                     vma_mark_locked(mpnt);
>>>                       retval = copy_page_range(tmp, mpnt);
>>> +             }
>>>
>>>               if (tmp->vm_ops && tmp->vm_ops->open)
>>>                       tmp->vm_ops->open(tmp);
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 14/28] mm: mark VMAs as locked before isolating them
  2022-09-01 17:35 ` [RFC PATCH RESEND 14/28] mm: mark VMAs as locked before isolating them Suren Baghdasaryan
@ 2022-09-09 13:35   ` Laurent Dufour
  2022-09-09 16:28     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-09 13:35 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> Mark VMAs as locked before isolating them and clear their tree node so
> that isolated VMAs are easily identifiable. In the later patches page
> fault handlers will try locking the found VMA and will check whether
> the VMA was isolated. Locking VMAs before isolating them ensures that
> page fault handlers don't operate on isolated VMAs.

Found another place where the VMA should probably mark locked:
*** drivers/gpu/drm/drm_vma_manager.c:
drm_vma_node_revoke[338]       rb_erase(&entry->vm_rb, &node->vm_files);

There are 2 others entries in nommu.c but I guess this is not supported,
isn't it?


> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/mmap.c  | 2 ++
>  mm/nommu.c | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 094678b4434b..b0d78bdc0de0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -421,12 +421,14 @@ static inline void vma_rb_insert(struct vm_area_struct *vma,
>  
>  static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
>  {
> +	vma_mark_locked(vma);
>  	/*
>  	 * Note rb_erase_augmented is a fairly large inline function,
>  	 * so make sure we instantiate it only once with our desired
>  	 * augmented rbtree callbacks.
>  	 */
>  	rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
> +	RB_CLEAR_NODE(&vma->vm_rb);
>  }
>  
>  static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
> diff --git a/mm/nommu.c b/mm/nommu.c
> index e819cbc21b39..ff9933e57501 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -622,6 +622,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct task_struct *curr = current;
>  
> +	vma_mark_locked(vma);
>  	mm->map_count--;
>  	for (i = 0; i < VMACACHE_SIZE; i++) {
>  		/* if the vma is cached, invalidate the entire cache */
> @@ -644,6 +645,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
>  
>  	/* remove from the MM's tree and list */
>  	rb_erase(&vma->vm_rb, &mm->mm_rb);
> +	RB_CLEAR_NODE(&vma->vm_rb);
>  
>  	__vma_unlink_list(mm, vma);
>  }


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 15/28] mm/mmap: mark adjacent VMAs as locked if they can grow into unmapped area
  2022-09-01 17:35 ` [RFC PATCH RESEND 15/28] mm/mmap: mark adjacent VMAs as locked if they can grow into unmapped area Suren Baghdasaryan
@ 2022-09-09 13:43   ` Laurent Dufour
  2022-09-09 16:25     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-09 13:43 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> While unmapping VMAs, adjacent VMAs might be able to grow into the area
> being unmapped. In such cases mark adjacent VMAs as locked to prevent
> this growth.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/mmap.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index b0d78bdc0de0..b31cc97c2803 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2680,10 +2680,14 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 * VM_GROWSUP VMA. Such VMAs can change their size under
>  	 * down_read(mmap_lock) and collide with the VMA we are about to unmap.
>  	 */
> -	if (vma && (vma->vm_flags & VM_GROWSDOWN))
> +	if (vma && (vma->vm_flags & VM_GROWSDOWN)) {
> +		vma_mark_locked(vma);
>  		return false;
> -	if (prev && (prev->vm_flags & VM_GROWSUP))
> +	}
> +	if (prev && (prev->vm_flags & VM_GROWSUP)) {
> +		vma_mark_locked(prev);
>  		return false;
> +	}
>  	return true;
>  }
>

That looks right to be.

But, in addition to that, like the previous patch, all the VMAs to be
detached from the tree in the loop above, should be marked locked just
before calling vm_rb_erase().

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 16/28] kernel/fork: assert no VMA readers during its destruction
  2022-09-01 17:35 ` [RFC PATCH RESEND 16/28] kernel/fork: assert no VMA readers during its destruction Suren Baghdasaryan
@ 2022-09-09 13:56   ` Laurent Dufour
  2022-09-09 16:19     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-09 13:56 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> Assert there are no holders of VMA lock for reading when it is about to be
> destroyed.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h | 8 ++++++++
>  kernel/fork.c      | 2 ++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index dc72be923e5b..0d9c1563c354 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -676,6 +676,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma, int pos)
>  	VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
>  }
>  
> +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> +{
> +	VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
> +		      vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
> +		      vma);
> +}
> +
>  #else /* CONFIG_PER_VMA_LOCK */
>  
>  static inline void vma_init_lock(struct vm_area_struct *vma) {}
> @@ -685,6 +692,7 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
>  static inline void vma_read_unlock(struct vm_area_struct *vma) {}
>  static inline void vma_assert_locked(struct vm_area_struct *vma) {}
>  static inline void vma_assert_write_locked(struct vm_area_struct *vma, int pos) {}
> +static inline void vma_assert_no_reader(struct vm_area_struct *vma) {}
>  
>  #endif /* CONFIG_PER_VMA_LOCK */
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 1872ad549fed..b443ba3a247a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -487,6 +487,8 @@ static void __vm_area_free(struct rcu_head *head)
>  {
>  	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
>  						  vm_rcu);
> +	/* The vma should either have no lock holders or be write-locked. */
> +	vma_assert_no_reader(vma);

I'm wondering if this can be hit in the case the thread freeing a VMA is
preempted before incrementing the mm ref count, like this:

VMA is about to be freed
write lock VMA
free vma -> call_rcu
..
<--- thread preempted
	rcu handler runs
	rcu calls __vm_area_free() <<<<<<
unlock mmap_lock and increase the mm seq count


>  	kmem_cache_free(vm_area_cachep, vma);
>  }
>  #endif


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 17/28] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration
  2022-09-01 17:35 ` [RFC PATCH RESEND 17/28] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration Suren Baghdasaryan
@ 2022-09-09 14:20   ` Laurent Dufour
  2022-09-09 16:12     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-09 14:20 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> Pagefault handlers might need to fire MMU notifications while a new
> notifier is being registered. Modify mm_take_all_locks to mark all VMAs
> as locked and prevent this race with fault handlers that would hold VMA
> locks.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/mmap.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index b31cc97c2803..1edfcd384f5e 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3538,6 +3538,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
>   *     hugetlb mapping);
>   *   - all i_mmap_rwsem locks;
>   *   - all anon_vma->rwseml
> + *   - all vmas marked locked

IIRC, the anon_vma may be locked during the page fault handling, and this
happens after the VMA is read lock. I think the same applies to
i_mmap_rwsem lock.

Thus, the VMA should be marked locked first.

>   *
>   * We can take all locks within these types randomly because the VM code
>   * doesn't nest them and we protected from parallel mm_take_all_locks() by
> @@ -3579,6 +3580,7 @@ int mm_take_all_locks(struct mm_struct *mm)
>  		if (vma->anon_vma)
>  			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
>  				vm_lock_anon_vma(mm, avc->anon_vma);
> +		vma_mark_locked(vma);
>  	}
>  
>  	return 0;
> @@ -3636,6 +3638,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
>  	mmap_assert_write_locked(mm);
>  	BUG_ON(!mutex_is_locked(&mm_all_locks_mutex));
>  
> +	vma_mark_unlocked_all(mm);
>  	for (vma = mm->mmap; vma; vma = vma->vm_next) {
>  		if (vma->anon_vma)
>  			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 18/28] mm: add FAULT_FLAG_VMA_LOCK flag
  2022-09-01 17:35 ` [RFC PATCH RESEND 18/28] mm: add FAULT_FLAG_VMA_LOCK flag Suren Baghdasaryan
@ 2022-09-09 14:26   ` Laurent Dufour
  0 siblings, 0 replies; 91+ messages in thread
From: Laurent Dufour @ 2022-09-09 14:26 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> Add a new flag to distinguish page faults handled under protection of
> per-vma lock.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

FWIW,

Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com>

> ---
>  include/linux/mm.h       | 3 ++-
>  include/linux/mm_types.h | 1 +
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0d9c1563c354..7c3190eaabd7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -466,7 +466,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags)
>  	{ FAULT_FLAG_USER,		"USER" }, \
>  	{ FAULT_FLAG_REMOTE,		"REMOTE" }, \
>  	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }, \
> -	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }
> +	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }, \
> +	{ FAULT_FLAG_VMA_LOCK,		"VMA_LOCK" }
>  
>  /*
>   * vm_fault is filled by the pagefault handler and passed to the vma's
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6a03f59c1e78..36562e702baf 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -886,6 +886,7 @@ enum fault_flag {
>  	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
>  	FAULT_FLAG_UNSHARE =		1 << 10,
>  	FAULT_FLAG_ORIG_PTE_VALID =	1 << 11,
> +	FAULT_FLAG_VMA_LOCK =		1 << 12,
>  };
>  
>  typedef unsigned int __bitwise zap_flags_t;


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 19/28] mm: disallow do_swap_page to handle page faults under VMA lock
  2022-09-01 17:35 ` [RFC PATCH RESEND 19/28] mm: disallow do_swap_page to handle page faults under VMA lock Suren Baghdasaryan
  2022-09-06 19:39   ` Peter Xu
@ 2022-09-09 14:26   ` Laurent Dufour
  1 sibling, 0 replies; 91+ messages in thread
From: Laurent Dufour @ 2022-09-09 14:26 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> Due to the possibility of do_swap_page dropping mmap_lock, abort fault
> handling under VMA lock and retry holding mmap_lock. This can be handled
> more gracefully in the future.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com>

> ---
>  mm/memory.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 9ac9944e8c62..29d2f49f922a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3738,6 +3738,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	vm_fault_t ret = 0;
>  	void *shadow = NULL;
>  
> +	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> +		ret = VM_FAULT_RETRY;
> +		goto out;
> +	}
> +
>  	if (!pte_unmap_same(vmf))
>  		goto out;
>  


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 20/28] mm: introduce per-VMA lock statistics
  2022-09-01 17:35 ` [RFC PATCH RESEND 20/28] mm: introduce per-VMA lock statistics Suren Baghdasaryan
@ 2022-09-09 14:28   ` Laurent Dufour
  2022-09-09 16:11     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-09 14:28 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> Add a new CONFIG_PER_VMA_LOCK_STATS config option to dump extra
> statistics about handling page fault under VMA lock.
> 

Why not making this a default when per VMA lock are enabled?

> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/vm_event_item.h | 6 ++++++
>  include/linux/vmstat.h        | 6 ++++++
>  mm/Kconfig.debug              | 8 ++++++++
>  mm/vmstat.c                   | 6 ++++++
>  4 files changed, 26 insertions(+)
> 
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index f3fc36cd2276..a325783ed05d 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -150,6 +150,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_X86
>  		DIRECT_MAP_LEVEL2_SPLIT,
>  		DIRECT_MAP_LEVEL3_SPLIT,
> +#endif
> +#ifdef CONFIG_PER_VMA_LOCK_STATS
> +		VMA_LOCK_SUCCESS,
> +		VMA_LOCK_ABORT,
> +		VMA_LOCK_RETRY,
> +		VMA_LOCK_MISS,
>  #endif
>  		NR_VM_EVENT_ITEMS
>  };
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index bfe38869498d..0c2611899cfc 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -131,6 +131,12 @@ static inline void vm_events_fold_cpu(int cpu)
>  #define count_vm_vmacache_event(x) do {} while (0)
>  #endif
>  
> +#ifdef CONFIG_PER_VMA_LOCK_STATS
> +#define count_vm_vma_lock_event(x) count_vm_event(x)
> +#else
> +#define count_vm_vma_lock_event(x) do {} while (0)
> +#endif
> +
>  #define __count_zid_vm_events(item, zid, delta) \
>  	__count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
>  
> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> index ce8dded36de9..075642763a03 100644
> --- a/mm/Kconfig.debug
> +++ b/mm/Kconfig.debug
> @@ -207,3 +207,11 @@ config PTDUMP_DEBUGFS
>  	  kernel.
>  
>  	  If in doubt, say N.
> +
> +
> +config PER_VMA_LOCK_STATS
> +	bool "Statistics for per-vma locks"
> +	depends on PER_VMA_LOCK
> +	help
> +	  Statistics for per-vma locks.
> +	  If in doubt, say N.
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 90af9a8572f5..3f3804c846a6 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1411,6 +1411,12 @@ const char * const vmstat_text[] = {
>  	"direct_map_level2_splits",
>  	"direct_map_level3_splits",
>  #endif
> +#ifdef CONFIG_PER_VMA_LOCK_STATS
> +	"vma_lock_success",
> +	"vma_lock_abort",
> +	"vma_lock_retry",
> +	"vma_lock_miss",
> +#endif
>  #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
>  };
>  #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 21/28] mm: introduce find_and_lock_anon_vma to be used from arch-specific code
  2022-09-01 17:35 ` [RFC PATCH RESEND 21/28] mm: introduce find_and_lock_anon_vma to be used from arch-specific code Suren Baghdasaryan
@ 2022-09-09 14:38   ` Laurent Dufour
  2022-09-09 16:10     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-09 14:38 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> Introduce find_and_lock_anon_vma function to lookup and lock an anonymous
> VMA during page fault handling. When VMA is not found, can't be locked
> or changes after being locked, the function returns NULL. The lookup is
> performed under RCU protection to prevent the found VMA from being
> destroyed before the VMA lock is acquired. VMA lock statistics are
> updated according to the results.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h |  3 +++
>  mm/memory.c        | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 48 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7c3190eaabd7..a3cbaa7b9119 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -684,6 +684,9 @@ static inline void vma_assert_no_reader(struct vm_area_struct *vma)
>  		      vma);
>  }
>  
> +struct vm_area_struct *find_and_lock_anon_vma(struct mm_struct *mm,
> +					      unsigned long address);
> +
>  #else /* CONFIG_PER_VMA_LOCK */
>  
>  static inline void vma_init_lock(struct vm_area_struct *vma) {}
> diff --git a/mm/memory.c b/mm/memory.c
> index 29d2f49f922a..bf557f7056de 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5183,6 +5183,51 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  }
>  EXPORT_SYMBOL_GPL(handle_mm_fault);
>  
> +#ifdef CONFIG_PER_VMA_LOCK
> +static inline struct vm_area_struct *find_vma_under_rcu(struct mm_struct *mm,
> +							unsigned long address)
> +{
> +	struct vm_area_struct *vma = __find_vma(mm, address);
> +
> +	if (!vma || vma->vm_start > address)
> +		return NULL;
> +
> +	if (!vma_is_anonymous(vma))
> +		return NULL;
> +

It looks to me more natural to first check that the VMA is part of the RB
tree before try read locking it.

> +	if (!vma_read_trylock(vma)) {
> +		count_vm_vma_lock_event(VMA_LOCK_ABORT);
> +		return NULL;
> +	}
> +
> +	/* Check if the VMA got isolated after we found it */
> +	if (RB_EMPTY_NODE(&vma->vm_rb)) {
> +		vma_read_unlock(vma);
> +		count_vm_vma_lock_event(VMA_LOCK_MISS);
> +		return NULL;
> +	}
> +
> +	return vma;
> +}
> +
> +/*
> + * Lookup and lock and anonymous VMA. Returned VMA is guaranteed to be stable
> + * and not isolated. If the VMA is not found of is being modified the function
> + * returns NULL.
> + */
> +struct vm_area_struct *find_and_lock_anon_vma(struct mm_struct *mm,
> +					      unsigned long address)
> +{
> +	struct vm_area_struct *vma;
> +
> +	rcu_read_lock();
> +	vma = find_vma_under_rcu(mm, address);
> +	rcu_read_unlock();
> +
> +	return vma;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
>  #ifndef __PAGETABLE_P4D_FOLDED
>  /*
>   * Allocate p4d page table.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 28/28] kernel/fork: throttle call_rcu() calls in vm_area_free
  2022-09-01 17:35 ` [RFC PATCH RESEND 28/28] kernel/fork: throttle call_rcu() calls in vm_area_free Suren Baghdasaryan
@ 2022-09-09 15:19   ` Laurent Dufour
  2022-09-09 16:02     ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Laurent Dufour @ 2022-09-09 15:19 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> call_rcu() can take a long time when callback offloading is enabled.
> Its use in the vm_area_free can cause regressions in the exit path when
> multiple VMAs are being freed. To minimize that impact, place VMAs into
> a list and free them in groups using one call_rcu() call per group.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h       |  1 +
>  include/linux/mm_types.h | 11 ++++++-
>  kernel/fork.c            | 68 +++++++++++++++++++++++++++++++++++-----
>  mm/init-mm.c             |  3 ++
>  mm/mmap.c                |  1 +
>  5 files changed, 75 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a3cbaa7b9119..81dff694ac14 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -249,6 +249,7 @@ void setup_initial_init_mm(void *start_code, void *end_code,
>  struct vm_area_struct *vm_area_alloc(struct mm_struct *);
>  struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
>  void vm_area_free(struct vm_area_struct *);
> +void drain_free_vmas(struct mm_struct *mm);
>  
>  #ifndef CONFIG_MMU
>  extern struct rb_root nommu_region_tree;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 36562e702baf..6f3effc493b1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -412,7 +412,11 @@ struct vm_area_struct {
>  			struct vm_area_struct *vm_next, *vm_prev;
>  		};
>  #ifdef CONFIG_PER_VMA_LOCK
> -		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
> +		struct {
> +			struct list_head vm_free_list;
> +			/* Used for deferred freeing. */
> +			struct rcu_head vm_rcu;
> +		};
>  #endif
>  	};
>  
> @@ -573,6 +577,11 @@ struct mm_struct {
>  					  */
>  #ifdef CONFIG_PER_VMA_LOCK
>  		int mm_lock_seq;
> +		struct {
> +			struct list_head head;
> +			spinlock_t lock;
> +			int size;
> +		} vma_free_list;
>  #endif
>  
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b443ba3a247a..7c88710aed72 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -483,26 +483,75 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  }
>  
>  #ifdef CONFIG_PER_VMA_LOCK
> -static void __vm_area_free(struct rcu_head *head)
> +static inline void __vm_area_free(struct vm_area_struct *vma)
>  {
> -	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> -						  vm_rcu);
>  	/* The vma should either have no lock holders or be write-locked. */
>  	vma_assert_no_reader(vma);
>  	kmem_cache_free(vm_area_cachep, vma);
>  }
> -#endif
> +
> +static void vma_free_rcu_callback(struct rcu_head *head)
> +{
> +	struct vm_area_struct *first_vma;
> +	struct vm_area_struct *vma, *vma2;
> +
> +	first_vma = container_of(head, struct vm_area_struct, vm_rcu);
> +	list_for_each_entry_safe(vma, vma2, &first_vma->vm_free_list, vm_free_list)

Is that safe to walk the list against concurrent calls to
list_splice_init(), or list_add()?

> +		__vm_area_free(vma);
> +	__vm_area_free(first_vma);
> +}
> +
> +void drain_free_vmas(struct mm_struct *mm)
> +{
> +	struct vm_area_struct *first_vma;
> +	LIST_HEAD(to_destroy);
> +
> +	spin_lock(&mm->vma_free_list.lock);
> +	list_splice_init(&mm->vma_free_list.head, &to_destroy);
> +	mm->vma_free_list.size = 0;
> +	spin_unlock(&mm->vma_free_list.lock);
> +
> +	if (list_empty(&to_destroy))
> +		return;
> +
> +	first_vma = list_first_entry(&to_destroy, struct vm_area_struct, vm_free_list);
> +	/* Remove the head which is allocated on the stack */
> +	list_del(&to_destroy);
> +
> +	call_rcu(&first_vma->vm_rcu, vma_free_rcu_callback);
> +}
> +
> +#define VM_AREA_FREE_LIST_MAX	32
> +
> +void vm_area_free(struct vm_area_struct *vma)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	bool drain;
> +
> +	free_anon_vma_name(vma);
> +
> +	spin_lock(&mm->vma_free_list.lock);
> +	list_add(&vma->vm_free_list, &mm->vma_free_list.head);
> +	mm->vma_free_list.size++;
> +	drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
> +	spin_unlock(&mm->vma_free_list.lock);
> +
> +	if (drain)
> +		drain_free_vmas(mm);
> +}
> +
> +#else /* CONFIG_PER_VMA_LOCK */
> +
> +void drain_free_vmas(struct mm_struct *mm) {}
>  
>  void vm_area_free(struct vm_area_struct *vma)
>  {
>  	free_anon_vma_name(vma);
> -#ifdef CONFIG_PER_VMA_LOCK
> -	call_rcu(&vma->vm_rcu, __vm_area_free);
> -#else
>  	kmem_cache_free(vm_area_cachep, vma);
> -#endif
>  }
>  
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
>  static void account_kernel_stack(struct task_struct *tsk, int account)
>  {
>  	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> @@ -1137,6 +1186,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	INIT_LIST_HEAD(&mm->mmlist);
>  #ifdef CONFIG_PER_VMA_LOCK
>  	WRITE_ONCE(mm->mm_lock_seq, 0);
> +	INIT_LIST_HEAD(&mm->vma_free_list.head);
> +	spin_lock_init(&mm->vma_free_list.lock);
> +	mm->vma_free_list.size = 0;
>  #endif
>  	mm_pgtables_bytes_init(mm);
>  	mm->map_count = 0;
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index 8399f90d631c..7b6d2460545f 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -39,6 +39,9 @@ struct mm_struct init_mm = {
>  	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
>  #ifdef CONFIG_PER_VMA_LOCK
>  	.mm_lock_seq	= 0,
> +	.vma_free_list.head = LIST_HEAD_INIT(init_mm.vma_free_list.head),
> +	.vma_free_list.lock =  __SPIN_LOCK_UNLOCKED(init_mm.vma_free_list.lock),
> +	.vma_free_list.size = 0,
>  #endif
>  	.user_ns	= &init_user_ns,
>  	.cpu_bitmap	= CPU_BITS_NONE,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 1edfcd384f5e..d61b7ef84ba6 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3149,6 +3149,7 @@ void exit_mmap(struct mm_struct *mm)
>  	}
>  	mm->mmap = NULL;
>  	mmap_write_unlock(mm);
> +	drain_free_vmas(mm);
>  	vm_unacct_memory(nr_accounted);
>  }
>  


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 10/28] mm/mmap: mark VMAs as locked in vma_adjust
  2022-09-09  0:51     ` Suren Baghdasaryan
@ 2022-09-09 15:52       ` Laurent Dufour
  0 siblings, 0 replies; 91+ messages in thread
From: Laurent Dufour @ 2022-09-09 15:52 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 09/09/2022 à 02:51, Suren Baghdasaryan a écrit :
> On Tue, Sep 6, 2022 at 8:35 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>>
>> Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
>>> vma_adjust modifies a VMA and possibly its neighbors. Mark them as locked
>>> before making the modifications.
>>>
>>> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>>> ---
>>>  mm/mmap.c | 11 ++++++++++-
>>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index f89c9b058105..ed58cf0689b2 100644
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -710,6 +710,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>>>       long adjust_next = 0;
>>>       int remove_next = 0;
>>>
>>> +     vma_mark_locked(vma);
>>> +     if (next)
>>> +             vma_mark_locked(next);
>>> +
>>
>> I was wondering if the VMAs insert and expand should be locked too.
>>
>> For expand, I can't see any valid reason, but for insert, I'm puzzled.
>> I would think that it is better to lock the VMA to be inserted but I can't
>> really justify that.
>>
>> It may be nice to detail why this is not need to lock insert and expand here.
> 
> 'expand' is always locked before it's passed to __vma_adjust() by
> vma_merge(). It has to be locked before we decide "Can it merge with
> the predecessor?" here
> https://elixir.bootlin.com/linux/latest/source/mm/mmap.c#L1201 because
> a change in VMA can affect that decision. I spent many hours tracking
> the issue caused by not locking the VMA before making this decision.
> It might be good to add a comment about this...
> 
> AFAIKT 'insert' is only used by __split_vma() and it's always a brand
> new VMA which is not yet linked into mm->mmap. Any reason
> __vma_adjust() should lock it?

No, I think that's good this way.

> 
>>
>>>       if (next && !insert) {
>>>               struct vm_area_struct *exporter = NULL, *importer = NULL;
>>>
>>> @@ -754,8 +758,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>>>                        * If next doesn't have anon_vma, import from vma after
>>>                        * next, if the vma overlaps with it.
>>>                        */
>>> -                     if (remove_next == 2 && !next->anon_vma)
>>> +                     if (remove_next == 2 && !next->anon_vma) {
>>>                               exporter = next->vm_next;
>>> +                             if (exporter)
>>> +                                     vma_mark_locked(exporter);
>>> +                     }
>>>
>>>               } else if (end > next->vm_start) {
>>>                       /*
>>> @@ -931,6 +938,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>>>                        * "vma->vm_next" gap must be updated.
>>>                        */
>>>                       next = vma->vm_next;
>>> +                     if (next)
>>> +                             vma_mark_locked(next);
>>>               } else {
>>>                       /*
>>>                        * For the scope of the comment "next" and
>>
>> --
>> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 28/28] kernel/fork: throttle call_rcu() calls in vm_area_free
  2022-09-09 15:19   ` Laurent Dufour
@ 2022-09-09 16:02     ` Suren Baghdasaryan
  2022-09-09 16:14       ` Laurent Dufour
  0 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-09 16:02 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Fri, Sep 9, 2022 at 8:19 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> > call_rcu() can take a long time when callback offloading is enabled.
> > Its use in the vm_area_free can cause regressions in the exit path when
> > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > a list and free them in groups using one call_rcu() call per group.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h       |  1 +
> >  include/linux/mm_types.h | 11 ++++++-
> >  kernel/fork.c            | 68 +++++++++++++++++++++++++++++++++++-----
> >  mm/init-mm.c             |  3 ++
> >  mm/mmap.c                |  1 +
> >  5 files changed, 75 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index a3cbaa7b9119..81dff694ac14 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -249,6 +249,7 @@ void setup_initial_init_mm(void *start_code, void *end_code,
> >  struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> >  struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> >  void vm_area_free(struct vm_area_struct *);
> > +void drain_free_vmas(struct mm_struct *mm);
> >
> >  #ifndef CONFIG_MMU
> >  extern struct rb_root nommu_region_tree;
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 36562e702baf..6f3effc493b1 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -412,7 +412,11 @@ struct vm_area_struct {
> >                       struct vm_area_struct *vm_next, *vm_prev;
> >               };
> >  #ifdef CONFIG_PER_VMA_LOCK
> > -             struct rcu_head vm_rcu; /* Used for deferred freeing. */
> > +             struct {
> > +                     struct list_head vm_free_list;
> > +                     /* Used for deferred freeing. */
> > +                     struct rcu_head vm_rcu;
> > +             };
> >  #endif
> >       };
> >
> > @@ -573,6 +577,11 @@ struct mm_struct {
> >                                         */
> >  #ifdef CONFIG_PER_VMA_LOCK
> >               int mm_lock_seq;
> > +             struct {
> > +                     struct list_head head;
> > +                     spinlock_t lock;
> > +                     int size;
> > +             } vma_free_list;
> >  #endif
> >
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index b443ba3a247a..7c88710aed72 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -483,26 +483,75 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >  }
> >
> >  #ifdef CONFIG_PER_VMA_LOCK
> > -static void __vm_area_free(struct rcu_head *head)
> > +static inline void __vm_area_free(struct vm_area_struct *vma)
> >  {
> > -     struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> > -                                               vm_rcu);
> >       /* The vma should either have no lock holders or be write-locked. */
> >       vma_assert_no_reader(vma);
> >       kmem_cache_free(vm_area_cachep, vma);
> >  }
> > -#endif
> > +
> > +static void vma_free_rcu_callback(struct rcu_head *head)
> > +{
> > +     struct vm_area_struct *first_vma;
> > +     struct vm_area_struct *vma, *vma2;
> > +
> > +     first_vma = container_of(head, struct vm_area_struct, vm_rcu);
> > +     list_for_each_entry_safe(vma, vma2, &first_vma->vm_free_list, vm_free_list)
>
> Is that safe to walk the list against concurrent calls to
> list_splice_init(), or list_add()?

I think it is. drain_free_vmas() moves the to-be-destroyed and already
isolated VMAs from mm->vma_free_list into to_destroy list and then
passes that list to vma_free_rcu_callback(). At this point the list of
VMAs passed to vma_free_rcu_callback() is not accessible either from
mm (VMAs were isolated before vm_area_free() was called) or from
drain_free_vmas() since they were already removed from
mm->vma_free_list. Does that make sense?

>
> > +             __vm_area_free(vma);
> > +     __vm_area_free(first_vma);
> > +}
> > +
> > +void drain_free_vmas(struct mm_struct *mm)
> > +{
> > +     struct vm_area_struct *first_vma;
> > +     LIST_HEAD(to_destroy);
> > +
> > +     spin_lock(&mm->vma_free_list.lock);
> > +     list_splice_init(&mm->vma_free_list.head, &to_destroy);
> > +     mm->vma_free_list.size = 0;
> > +     spin_unlock(&mm->vma_free_list.lock);
> > +
> > +     if (list_empty(&to_destroy))
> > +             return;
> > +
> > +     first_vma = list_first_entry(&to_destroy, struct vm_area_struct, vm_free_list);
> > +     /* Remove the head which is allocated on the stack */
> > +     list_del(&to_destroy);
> > +
> > +     call_rcu(&first_vma->vm_rcu, vma_free_rcu_callback);
> > +}
> > +
> > +#define VM_AREA_FREE_LIST_MAX        32
> > +
> > +void vm_area_free(struct vm_area_struct *vma)
> > +{
> > +     struct mm_struct *mm = vma->vm_mm;
> > +     bool drain;
> > +
> > +     free_anon_vma_name(vma);
> > +
> > +     spin_lock(&mm->vma_free_list.lock);
> > +     list_add(&vma->vm_free_list, &mm->vma_free_list.head);
> > +     mm->vma_free_list.size++;
> > +     drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
> > +     spin_unlock(&mm->vma_free_list.lock);
> > +
> > +     if (drain)
> > +             drain_free_vmas(mm);
> > +}
> > +
> > +#else /* CONFIG_PER_VMA_LOCK */
> > +
> > +void drain_free_vmas(struct mm_struct *mm) {}
> >
> >  void vm_area_free(struct vm_area_struct *vma)
> >  {
> >       free_anon_vma_name(vma);
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -     call_rcu(&vma->vm_rcu, __vm_area_free);
> > -#else
> >       kmem_cache_free(vm_area_cachep, vma);
> > -#endif
> >  }
> >
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> >  static void account_kernel_stack(struct task_struct *tsk, int account)
> >  {
> >       if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> > @@ -1137,6 +1186,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> >       INIT_LIST_HEAD(&mm->mmlist);
> >  #ifdef CONFIG_PER_VMA_LOCK
> >       WRITE_ONCE(mm->mm_lock_seq, 0);
> > +     INIT_LIST_HEAD(&mm->vma_free_list.head);
> > +     spin_lock_init(&mm->vma_free_list.lock);
> > +     mm->vma_free_list.size = 0;
> >  #endif
> >       mm_pgtables_bytes_init(mm);
> >       mm->map_count = 0;
> > diff --git a/mm/init-mm.c b/mm/init-mm.c
> > index 8399f90d631c..7b6d2460545f 100644
> > --- a/mm/init-mm.c
> > +++ b/mm/init-mm.c
> > @@ -39,6 +39,9 @@ struct mm_struct init_mm = {
> >       .mmlist         = LIST_HEAD_INIT(init_mm.mmlist),
> >  #ifdef CONFIG_PER_VMA_LOCK
> >       .mm_lock_seq    = 0,
> > +     .vma_free_list.head = LIST_HEAD_INIT(init_mm.vma_free_list.head),
> > +     .vma_free_list.lock =  __SPIN_LOCK_UNLOCKED(init_mm.vma_free_list.lock),
> > +     .vma_free_list.size = 0,
> >  #endif
> >       .user_ns        = &init_user_ns,
> >       .cpu_bitmap     = CPU_BITS_NONE,
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 1edfcd384f5e..d61b7ef84ba6 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3149,6 +3149,7 @@ void exit_mmap(struct mm_struct *mm)
> >       }
> >       mm->mmap = NULL;
> >       mmap_write_unlock(mm);
> > +     drain_free_vmas(mm);
> >       vm_unacct_memory(nr_accounted);
> >  }
> >
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 21/28] mm: introduce find_and_lock_anon_vma to be used from arch-specific code
  2022-09-09 14:38   ` Laurent Dufour
@ 2022-09-09 16:10     ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-09 16:10 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Fri, Sep 9, 2022 at 7:38 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> > Introduce find_and_lock_anon_vma function to lookup and lock an anonymous
> > VMA during page fault handling. When VMA is not found, can't be locked
> > or changes after being locked, the function returns NULL. The lookup is
> > performed under RCU protection to prevent the found VMA from being
> > destroyed before the VMA lock is acquired. VMA lock statistics are
> > updated according to the results.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h |  3 +++
> >  mm/memory.c        | 45 +++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 48 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 7c3190eaabd7..a3cbaa7b9119 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -684,6 +684,9 @@ static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> >                     vma);
> >  }
> >
> > +struct vm_area_struct *find_and_lock_anon_vma(struct mm_struct *mm,
> > +                                           unsigned long address);
> > +
> >  #else /* CONFIG_PER_VMA_LOCK */
> >
> >  static inline void vma_init_lock(struct vm_area_struct *vma) {}
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 29d2f49f922a..bf557f7056de 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5183,6 +5183,51 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> >  }
> >  EXPORT_SYMBOL_GPL(handle_mm_fault);
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +static inline struct vm_area_struct *find_vma_under_rcu(struct mm_struct *mm,
> > +                                                     unsigned long address)
> > +{
> > +     struct vm_area_struct *vma = __find_vma(mm, address);
> > +
> > +     if (!vma || vma->vm_start > address)
> > +             return NULL;
> > +
> > +     if (!vma_is_anonymous(vma))
> > +             return NULL;
> > +
>
> It looks to me more natural to first check that the VMA is part of the RB
> tree before try read locking it.

I think we want to check that the VMA is still part of the mm _after_
we locked it. Otherwise we might pass the check, then some other
thread does (lock->isolate->unlock) and then we lock the VMA. We would
end up with a VMA that is not part of mm anymore but we assume it is.

>
> > +     if (!vma_read_trylock(vma)) {
> > +             count_vm_vma_lock_event(VMA_LOCK_ABORT);
> > +             return NULL;
> > +     }
> > +
> > +     /* Check if the VMA got isolated after we found it */
> > +     if (RB_EMPTY_NODE(&vma->vm_rb)) {
> > +             vma_read_unlock(vma);
> > +             count_vm_vma_lock_event(VMA_LOCK_MISS);
> > +             return NULL;
> > +     }
> > +
> > +     return vma;
> > +}
> > +
> > +/*
> > + * Lookup and lock and anonymous VMA. Returned VMA is guaranteed to be stable
> > + * and not isolated. If the VMA is not found of is being modified the function
> > + * returns NULL.
> > + */
> > +struct vm_area_struct *find_and_lock_anon_vma(struct mm_struct *mm,
> > +                                           unsigned long address)
> > +{
> > +     struct vm_area_struct *vma;
> > +
> > +     rcu_read_lock();
> > +     vma = find_vma_under_rcu(mm, address);
> > +     rcu_read_unlock();
> > +
> > +     return vma;
> > +}
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> >  #ifndef __PAGETABLE_P4D_FOLDED
> >  /*
> >   * Allocate p4d page table.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 20/28] mm: introduce per-VMA lock statistics
  2022-09-09 14:28   ` Laurent Dufour
@ 2022-09-09 16:11     ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-09 16:11 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Fri, Sep 9, 2022 at 7:29 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> > Add a new CONFIG_PER_VMA_LOCK_STATS config option to dump extra
> > statistics about handling page fault under VMA lock.
> >
>
> Why not making this a default when per VMA lock are enabled?

Good idea. If no objections I'll make that change.

>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/vm_event_item.h | 6 ++++++
> >  include/linux/vmstat.h        | 6 ++++++
> >  mm/Kconfig.debug              | 8 ++++++++
> >  mm/vmstat.c                   | 6 ++++++
> >  4 files changed, 26 insertions(+)
> >
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index f3fc36cd2276..a325783ed05d 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -150,6 +150,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  #ifdef CONFIG_X86
> >               DIRECT_MAP_LEVEL2_SPLIT,
> >               DIRECT_MAP_LEVEL3_SPLIT,
> > +#endif
> > +#ifdef CONFIG_PER_VMA_LOCK_STATS
> > +             VMA_LOCK_SUCCESS,
> > +             VMA_LOCK_ABORT,
> > +             VMA_LOCK_RETRY,
> > +             VMA_LOCK_MISS,
> >  #endif
> >               NR_VM_EVENT_ITEMS
> >  };
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index bfe38869498d..0c2611899cfc 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -131,6 +131,12 @@ static inline void vm_events_fold_cpu(int cpu)
> >  #define count_vm_vmacache_event(x) do {} while (0)
> >  #endif
> >
> > +#ifdef CONFIG_PER_VMA_LOCK_STATS
> > +#define count_vm_vma_lock_event(x) count_vm_event(x)
> > +#else
> > +#define count_vm_vma_lock_event(x) do {} while (0)
> > +#endif
> > +
> >  #define __count_zid_vm_events(item, zid, delta) \
> >       __count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
> >
> > diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> > index ce8dded36de9..075642763a03 100644
> > --- a/mm/Kconfig.debug
> > +++ b/mm/Kconfig.debug
> > @@ -207,3 +207,11 @@ config PTDUMP_DEBUGFS
> >         kernel.
> >
> >         If in doubt, say N.
> > +
> > +
> > +config PER_VMA_LOCK_STATS
> > +     bool "Statistics for per-vma locks"
> > +     depends on PER_VMA_LOCK
> > +     help
> > +       Statistics for per-vma locks.
> > +       If in doubt, say N.
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 90af9a8572f5..3f3804c846a6 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1411,6 +1411,12 @@ const char * const vmstat_text[] = {
> >       "direct_map_level2_splits",
> >       "direct_map_level3_splits",
> >  #endif
> > +#ifdef CONFIG_PER_VMA_LOCK_STATS
> > +     "vma_lock_success",
> > +     "vma_lock_abort",
> > +     "vma_lock_retry",
> > +     "vma_lock_miss",
> > +#endif
> >  #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
> >  };
> >  #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 17/28] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration
  2022-09-09 14:20   ` Laurent Dufour
@ 2022-09-09 16:12     ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-09 16:12 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Fri, Sep 9, 2022 at 7:20 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> > Pagefault handlers might need to fire MMU notifications while a new
> > notifier is being registered. Modify mm_take_all_locks to mark all VMAs
> > as locked and prevent this race with fault handlers that would hold VMA
> > locks.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/mmap.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index b31cc97c2803..1edfcd384f5e 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3538,6 +3538,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
> >   *     hugetlb mapping);
> >   *   - all i_mmap_rwsem locks;
> >   *   - all anon_vma->rwseml
> > + *   - all vmas marked locked
>
> IIRC, the anon_vma may be locked during the page fault handling, and this
> happens after the VMA is read lock. I think the same applies to
> i_mmap_rwsem lock.
>
> Thus, the VMA should be marked locked first.

I see. I'll double check and move the locking order. Thanks!

>
> >   *
> >   * We can take all locks within these types randomly because the VM code
> >   * doesn't nest them and we protected from parallel mm_take_all_locks() by
> > @@ -3579,6 +3580,7 @@ int mm_take_all_locks(struct mm_struct *mm)
> >               if (vma->anon_vma)
> >                       list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
> >                               vm_lock_anon_vma(mm, avc->anon_vma);
> > +             vma_mark_locked(vma);
> >       }
> >
> >       return 0;
> > @@ -3636,6 +3638,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
> >       mmap_assert_write_locked(mm);
> >       BUG_ON(!mutex_is_locked(&mm_all_locks_mutex));
> >
> > +     vma_mark_unlocked_all(mm);
> >       for (vma = mm->mmap; vma; vma = vma->vm_next) {
> >               if (vma->anon_vma)
> >                       list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 28/28] kernel/fork: throttle call_rcu() calls in vm_area_free
  2022-09-09 16:02     ` Suren Baghdasaryan
@ 2022-09-09 16:14       ` Laurent Dufour
  0 siblings, 0 replies; 91+ messages in thread
From: Laurent Dufour @ 2022-09-09 16:14 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

Le 09/09/2022 à 18:02, Suren Baghdasaryan a écrit :
> On Fri, Sep 9, 2022 at 8:19 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>>
>> Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
>>> call_rcu() can take a long time when callback offloading is enabled.
>>> Its use in the vm_area_free can cause regressions in the exit path when
>>> multiple VMAs are being freed. To minimize that impact, place VMAs into
>>> a list and free them in groups using one call_rcu() call per group.
>>>
>>> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>>> ---
>>>  include/linux/mm.h       |  1 +
>>>  include/linux/mm_types.h | 11 ++++++-
>>>  kernel/fork.c            | 68 +++++++++++++++++++++++++++++++++++-----
>>>  mm/init-mm.c             |  3 ++
>>>  mm/mmap.c                |  1 +
>>>  5 files changed, 75 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index a3cbaa7b9119..81dff694ac14 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -249,6 +249,7 @@ void setup_initial_init_mm(void *start_code, void *end_code,
>>>  struct vm_area_struct *vm_area_alloc(struct mm_struct *);
>>>  struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
>>>  void vm_area_free(struct vm_area_struct *);
>>> +void drain_free_vmas(struct mm_struct *mm);
>>>
>>>  #ifndef CONFIG_MMU
>>>  extern struct rb_root nommu_region_tree;
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 36562e702baf..6f3effc493b1 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -412,7 +412,11 @@ struct vm_area_struct {
>>>                       struct vm_area_struct *vm_next, *vm_prev;
>>>               };
>>>  #ifdef CONFIG_PER_VMA_LOCK
>>> -             struct rcu_head vm_rcu; /* Used for deferred freeing. */
>>> +             struct {
>>> +                     struct list_head vm_free_list;
>>> +                     /* Used for deferred freeing. */
>>> +                     struct rcu_head vm_rcu;
>>> +             };
>>>  #endif
>>>       };
>>>
>>> @@ -573,6 +577,11 @@ struct mm_struct {
>>>                                         */
>>>  #ifdef CONFIG_PER_VMA_LOCK
>>>               int mm_lock_seq;
>>> +             struct {
>>> +                     struct list_head head;
>>> +                     spinlock_t lock;
>>> +                     int size;
>>> +             } vma_free_list;
>>>  #endif
>>>
>>>
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index b443ba3a247a..7c88710aed72 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -483,26 +483,75 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>>>  }
>>>
>>>  #ifdef CONFIG_PER_VMA_LOCK
>>> -static void __vm_area_free(struct rcu_head *head)
>>> +static inline void __vm_area_free(struct vm_area_struct *vma)
>>>  {
>>> -     struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
>>> -                                               vm_rcu);
>>>       /* The vma should either have no lock holders or be write-locked. */
>>>       vma_assert_no_reader(vma);
>>>       kmem_cache_free(vm_area_cachep, vma);
>>>  }
>>> -#endif
>>> +
>>> +static void vma_free_rcu_callback(struct rcu_head *head)
>>> +{
>>> +     struct vm_area_struct *first_vma;
>>> +     struct vm_area_struct *vma, *vma2;
>>> +
>>> +     first_vma = container_of(head, struct vm_area_struct, vm_rcu);
>>> +     list_for_each_entry_safe(vma, vma2, &first_vma->vm_free_list, vm_free_list)
>>
>> Is that safe to walk the list against concurrent calls to
>> list_splice_init(), or list_add()?
> 
> I think it is. drain_free_vmas() moves the to-be-destroyed and already
> isolated VMAs from mm->vma_free_list into to_destroy list and then
> passes that list to vma_free_rcu_callback(). At this point the list of
> VMAs passed to vma_free_rcu_callback() is not accessible either from
> mm (VMAs were isolated before vm_area_free() was called) or from
> drain_free_vmas() since they were already removed from
> mm->vma_free_list. Does that make sense?

Got it!
Thanks for the explanation.

> 
>>
>>> +             __vm_area_free(vma);
>>> +     __vm_area_free(first_vma);
>>> +}
>>> +
>>> +void drain_free_vmas(struct mm_struct *mm)
>>> +{
>>> +     struct vm_area_struct *first_vma;
>>> +     LIST_HEAD(to_destroy);
>>> +
>>> +     spin_lock(&mm->vma_free_list.lock);
>>> +     list_splice_init(&mm->vma_free_list.head, &to_destroy);
>>> +     mm->vma_free_list.size = 0;
>>> +     spin_unlock(&mm->vma_free_list.lock);
>>> +
>>> +     if (list_empty(&to_destroy))
>>> +             return;
>>> +
>>> +     first_vma = list_first_entry(&to_destroy, struct vm_area_struct, vm_free_list);
>>> +     /* Remove the head which is allocated on the stack */
>>> +     list_del(&to_destroy);
>>> +
>>> +     call_rcu(&first_vma->vm_rcu, vma_free_rcu_callback);
>>> +}
>>> +
>>> +#define VM_AREA_FREE_LIST_MAX        32
>>> +
>>> +void vm_area_free(struct vm_area_struct *vma)
>>> +{
>>> +     struct mm_struct *mm = vma->vm_mm;
>>> +     bool drain;
>>> +
>>> +     free_anon_vma_name(vma);
>>> +
>>> +     spin_lock(&mm->vma_free_list.lock);
>>> +     list_add(&vma->vm_free_list, &mm->vma_free_list.head);
>>> +     mm->vma_free_list.size++;
>>> +     drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
>>> +     spin_unlock(&mm->vma_free_list.lock);
>>> +
>>> +     if (drain)
>>> +             drain_free_vmas(mm);
>>> +}
>>> +
>>> +#else /* CONFIG_PER_VMA_LOCK */
>>> +
>>> +void drain_free_vmas(struct mm_struct *mm) {}
>>>
>>>  void vm_area_free(struct vm_area_struct *vma)
>>>  {
>>>       free_anon_vma_name(vma);
>>> -#ifdef CONFIG_PER_VMA_LOCK
>>> -     call_rcu(&vma->vm_rcu, __vm_area_free);
>>> -#else
>>>       kmem_cache_free(vm_area_cachep, vma);
>>> -#endif
>>>  }
>>>
>>> +#endif /* CONFIG_PER_VMA_LOCK */
>>> +
>>>  static void account_kernel_stack(struct task_struct *tsk, int account)
>>>  {
>>>       if (IS_ENABLED(CONFIG_VMAP_STACK)) {
>>> @@ -1137,6 +1186,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>>>       INIT_LIST_HEAD(&mm->mmlist);
>>>  #ifdef CONFIG_PER_VMA_LOCK
>>>       WRITE_ONCE(mm->mm_lock_seq, 0);
>>> +     INIT_LIST_HEAD(&mm->vma_free_list.head);
>>> +     spin_lock_init(&mm->vma_free_list.lock);
>>> +     mm->vma_free_list.size = 0;
>>>  #endif
>>>       mm_pgtables_bytes_init(mm);
>>>       mm->map_count = 0;
>>> diff --git a/mm/init-mm.c b/mm/init-mm.c
>>> index 8399f90d631c..7b6d2460545f 100644
>>> --- a/mm/init-mm.c
>>> +++ b/mm/init-mm.c
>>> @@ -39,6 +39,9 @@ struct mm_struct init_mm = {
>>>       .mmlist         = LIST_HEAD_INIT(init_mm.mmlist),
>>>  #ifdef CONFIG_PER_VMA_LOCK
>>>       .mm_lock_seq    = 0,
>>> +     .vma_free_list.head = LIST_HEAD_INIT(init_mm.vma_free_list.head),
>>> +     .vma_free_list.lock =  __SPIN_LOCK_UNLOCKED(init_mm.vma_free_list.lock),
>>> +     .vma_free_list.size = 0,
>>>  #endif
>>>       .user_ns        = &init_user_ns,
>>>       .cpu_bitmap     = CPU_BITS_NONE,
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index 1edfcd384f5e..d61b7ef84ba6 100644
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -3149,6 +3149,7 @@ void exit_mmap(struct mm_struct *mm)
>>>       }
>>>       mm->mmap = NULL;
>>>       mmap_write_unlock(mm);
>>> +     drain_free_vmas(mm);
>>>       vm_unacct_memory(nr_accounted);
>>>  }
>>>
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 16/28] kernel/fork: assert no VMA readers during its destruction
  2022-09-09 13:56   ` Laurent Dufour
@ 2022-09-09 16:19     ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-09 16:19 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Fri, Sep 9, 2022 at 6:56 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> > Assert there are no holders of VMA lock for reading when it is about to be
> > destroyed.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h | 8 ++++++++
> >  kernel/fork.c      | 2 ++
> >  2 files changed, 10 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index dc72be923e5b..0d9c1563c354 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -676,6 +676,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma, int pos)
> >       VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> >  }
> >
> > +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> > +{
> > +     VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
> > +                   vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
> > +                   vma);
> > +}
> > +
> >  #else /* CONFIG_PER_VMA_LOCK */
> >
> >  static inline void vma_init_lock(struct vm_area_struct *vma) {}
> > @@ -685,6 +692,7 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
> >  static inline void vma_read_unlock(struct vm_area_struct *vma) {}
> >  static inline void vma_assert_locked(struct vm_area_struct *vma) {}
> >  static inline void vma_assert_write_locked(struct vm_area_struct *vma, int pos) {}
> > +static inline void vma_assert_no_reader(struct vm_area_struct *vma) {}
> >
> >  #endif /* CONFIG_PER_VMA_LOCK */
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 1872ad549fed..b443ba3a247a 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -487,6 +487,8 @@ static void __vm_area_free(struct rcu_head *head)
> >  {
> >       struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> >                                                 vm_rcu);
> > +     /* The vma should either have no lock holders or be write-locked. */
> > +     vma_assert_no_reader(vma);
>
> I'm wondering if this can be hit in the case the thread freeing a VMA is
> preempted before incrementing the mm ref count, like this:
>
> VMA is about to be freed
> write lock VMA
> free vma -> call_rcu
> ..
> <--- thread preempted
>         rcu handler runs
>         rcu calls __vm_area_free() <<<<<<

At this point the VMA is still write-locked (mm seq count hasn't been
incremented yet), correct? If so then vma_assert_no_reader() will not
assert because the second condition of VMA being write-locked is
satisfied. Did I miss anything?

> unlock mmap_lock and increase the mm seq count
>
>
> >       kmem_cache_free(vm_area_cachep, vma);
> >  }
> >  #endif
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 15/28] mm/mmap: mark adjacent VMAs as locked if they can grow into unmapped area
  2022-09-09 13:43   ` Laurent Dufour
@ 2022-09-09 16:25     ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-09 16:25 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Fri, Sep 9, 2022 at 6:43 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> > While unmapping VMAs, adjacent VMAs might be able to grow into the area
> > being unmapped. In such cases mark adjacent VMAs as locked to prevent
> > this growth.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/mmap.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index b0d78bdc0de0..b31cc97c2803 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2680,10 +2680,14 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
> >        * VM_GROWSUP VMA. Such VMAs can change their size under
> >        * down_read(mmap_lock) and collide with the VMA we are about to unmap.
> >        */
> > -     if (vma && (vma->vm_flags & VM_GROWSDOWN))
> > +     if (vma && (vma->vm_flags & VM_GROWSDOWN)) {
> > +             vma_mark_locked(vma);
> >               return false;
> > -     if (prev && (prev->vm_flags & VM_GROWSUP))
> > +     }
> > +     if (prev && (prev->vm_flags & VM_GROWSUP)) {
> > +             vma_mark_locked(prev);
> >               return false;
> > +     }
> >       return true;
> >  }
> >
>
> That looks right to be.
>
> But, in addition to that, like the previous patch, all the VMAs to be
> detached from the tree in the loop above, should be marked locked just
> before calling vm_rb_erase().

The following call chain already locks the VMA being isolated:
vma_rb_erase->vma_rb_erase_ignore->__vma_rb_erase->vma_mark_locked

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 14/28] mm: mark VMAs as locked before isolating them
  2022-09-09 13:35   ` Laurent Dufour
@ 2022-09-09 16:28     ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-09 16:28 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Fri, Sep 9, 2022 at 6:35 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> > Mark VMAs as locked before isolating them and clear their tree node so
> > that isolated VMAs are easily identifiable. In the later patches page
> > fault handlers will try locking the found VMA and will check whether
> > the VMA was isolated. Locking VMAs before isolating them ensures that
> > page fault handlers don't operate on isolated VMAs.
>
> Found another place where the VMA should probably mark locked:
> *** drivers/gpu/drm/drm_vma_manager.c:
> drm_vma_node_revoke[338]       rb_erase(&entry->vm_rb, &node->vm_files);

Thanks! I'll add the necessary locking.

>
> There are 2 others entries in nommu.c but I guess this is not supported,
> isn't it?

Yes, PER_VMA_LOCK config depends on MMU but for completeness we could
add locking there as well (it will be compiled out).

>
>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/mmap.c  | 2 ++
> >  mm/nommu.c | 2 ++
> >  2 files changed, 4 insertions(+)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 094678b4434b..b0d78bdc0de0 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -421,12 +421,14 @@ static inline void vma_rb_insert(struct vm_area_struct *vma,
> >
> >  static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
> >  {
> > +     vma_mark_locked(vma);
> >       /*
> >        * Note rb_erase_augmented is a fairly large inline function,
> >        * so make sure we instantiate it only once with our desired
> >        * augmented rbtree callbacks.
> >        */
> >       rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
> > +     RB_CLEAR_NODE(&vma->vm_rb);
> >  }
> >
> >  static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index e819cbc21b39..ff9933e57501 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -622,6 +622,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
> >       struct mm_struct *mm = vma->vm_mm;
> >       struct task_struct *curr = current;
> >
> > +     vma_mark_locked(vma);
> >       mm->map_count--;
> >       for (i = 0; i < VMACACHE_SIZE; i++) {
> >               /* if the vma is cached, invalidate the entire cache */
> > @@ -644,6 +645,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
> >
> >       /* remove from the MM's tree and list */
> >       rb_erase(&vma->vm_rb, &mm->mm_rb);
> > +     RB_CLEAR_NODE(&vma->vm_rb);
> >
> >       __vma_unlink_list(mm, vma);
> >  }
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 07/28] kernel/fork: mark VMAs as locked before copying pages during fork
  2022-09-09 13:27       ` Laurent Dufour
@ 2022-09-09 16:29         ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-09 16:29 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Fri, Sep 9, 2022 at 6:27 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 09/09/2022 à 01:57, Suren Baghdasaryan a écrit :
> > On Tue, Sep 6, 2022 at 7:38 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
> >>
> >> Le 01/09/2022 à 19:34, Suren Baghdasaryan a écrit :
> >>> Protect VMAs from concurrent page fault handler while performing
> >>> copy_page_range for VMAs having VM_WIPEONFORK flag set.
> >>
> >> I'm wondering why is that necessary.
> >> The copied mm is write locked, and the destination one is not reachable.
> >> If any other readers are using the VMA, this is only for page fault handling.
> >
> > Correct, this is done to prevent page faulting in the VMA being
> > duplicated. I assume we want to prevent the pages in that VMA from
> > changing when we are calling copy_page_range(). Am I wrong?
>
> If a page is faulted while copy_page_range() is in progress, the page may
> not be backed on the child side (PTE lock should protect the copy, isn't it).
> Is that a real problem? It will be backed later if accessed on the child side.
> Maybe the per process pages accounting could be incorrect...

This feels to me like walking on the edge. Maybe we can discuss this
with more people at LPC before trying it?

>
> >
> >> I should have miss something because I can't see any need to mark the lock
> >> VMA here.
> >>
> >>> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >>> ---
> >>>  kernel/fork.c | 4 +++-
> >>>  1 file changed, 3 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/kernel/fork.c b/kernel/fork.c
> >>> index bfab31ecd11e..1872ad549fed 100644
> >>> --- a/kernel/fork.c
> >>> +++ b/kernel/fork.c
> >>> @@ -709,8 +709,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >>>               rb_parent = &tmp->vm_rb;
> >>>
> >>>               mm->map_count++;
> >>> -             if (!(tmp->vm_flags & VM_WIPEONFORK))
> >>> +             if (!(tmp->vm_flags & VM_WIPEONFORK)) {
> >>> +                     vma_mark_locked(mpnt);
> >>>                       retval = copy_page_range(tmp, mpnt);
> >>> +             }
> >>>
> >>>               if (tmp->vm_ops && tmp->vm_ops->open)
> >>>                       tmp->vm_ops->open(tmp);
> >>
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 13/28] mm: conditionally mark VMA as locked in free_pgtables and unmap_page_range
  2022-09-09 10:33   ` Laurent Dufour
@ 2022-09-09 16:43     ` Suren Baghdasaryan
  0 siblings, 0 replies; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-09 16:43 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, rientjes, axelrasmussen, joelaf, minchan,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel

On Fri, Sep 9, 2022 at 3:33 AM Laurent Dufour <ldufour@linux.ibm.com> wrote:
>
> Le 01/09/2022 à 19:35, Suren Baghdasaryan a écrit :
> > free_pgtables and unmap_page_range functions can be called with mmap_lock
> > held for write (e.g. in mmap_region), held for read (e.g in
> > madvise_pageout) or not held at all (e.g in madvise_remove might
> > drop mmap_lock before calling vfs_fallocate, which ends up calling
> > unmap_page_range).
> > Provide free_pgtables and unmap_page_range with additional argument
> > indicating whether to mark the VMA as locked or not based on the usage.
> > The parameter is set based on whether mmap_lock is held in write mode
> > during the call. This ensures no change in behavior between mmap_lock
> > and per-vma locks.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h |  2 +-
> >  mm/internal.h      |  4 ++--
> >  mm/memory.c        | 32 +++++++++++++++++++++-----------
> >  mm/mmap.c          | 17 +++++++++--------
> >  mm/oom_kill.c      |  3 ++-
> >  5 files changed, 35 insertions(+), 23 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 476bf936c5f0..dc72be923e5b 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1874,7 +1874,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
> >  void zap_page_range(struct vm_area_struct *vma, unsigned long address,
> >                   unsigned long size);
> >  void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> > -             unsigned long start, unsigned long end);
> > +             unsigned long start, unsigned long end, bool lock_vma);
> >
> >  struct mmu_notifier_range;
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 785409805ed7..e6c0f999e0cb 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -85,14 +85,14 @@ bool __folio_end_writeback(struct folio *folio);
> >  void deactivate_file_folio(struct folio *folio);
> >
> >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> > -             unsigned long floor, unsigned long ceiling);
> > +             unsigned long floor, unsigned long ceiling, bool lock_vma);
> >  void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
> >
> >  struct zap_details;
> >  void unmap_page_range(struct mmu_gather *tlb,
> >                            struct vm_area_struct *vma,
> >                            unsigned long addr, unsigned long end,
> > -                          struct zap_details *details);
> > +                          struct zap_details *details, bool lock_vma);
> >
> >  void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
> >               unsigned int order);
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 4ba73f5aa8bb..9ac9944e8c62 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -403,7 +403,7 @@ void free_pgd_range(struct mmu_gather *tlb,
> >  }
> >
> >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > -             unsigned long floor, unsigned long ceiling)
> > +             unsigned long floor, unsigned long ceiling, bool lock_vma)
> >  {
> >       while (vma) {
> >               struct vm_area_struct *next = vma->vm_next;
> > @@ -413,6 +413,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >                * Hide vma from rmap and truncate_pagecache before freeing
> >                * pgtables
> >                */
> > +             if (lock_vma)
> > +                     vma_mark_locked(vma);
> >               unlink_anon_vmas(vma);
> >               unlink_file_vma(vma);
> >
> > @@ -427,6 +429,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >                              && !is_vm_hugetlb_page(next)) {
> >                               vma = next;
> >                               next = vma->vm_next;
> > +                             if (lock_vma)
> > +                                     vma_mark_locked(vma);
> >                               unlink_anon_vmas(vma);
> >                               unlink_file_vma(vma);
> >                       }
> > @@ -1631,12 +1635,16 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
> >  void unmap_page_range(struct mmu_gather *tlb,
> >                            struct vm_area_struct *vma,
> >                            unsigned long addr, unsigned long end,
> > -                          struct zap_details *details)
> > +                          struct zap_details *details,
> > +                          bool lock_vma)
> >  {
> >       pgd_t *pgd;
> >       unsigned long next;
> >
> >       BUG_ON(addr >= end);
> > +     if (lock_vma)
> > +             vma_mark_locked(vma);
>
> I'm wondering if that is really needed here.
> The following processing is only dealing with the page table entries.
> Today, if that could be called without holding the mmap_lock, that should
> be safe to not mark the VMA locked (indeed the VMA itself is not impacted).
>
> Thus unmap_single_vma() below no need to be touched, and its callers.
>
> In the case a locking is required, I think there is a real potential issue
> in the current kernel.

IIUC you are suggesting to do the locking at the callers who need it?
If so, I'll need to carefully review the callers before changing this
because the timing when we lock might make a difference here.

>
> > +
> >       tlb_start_vma(tlb, vma);
> >       pgd = pgd_offset(vma->vm_mm, addr);
> >       do {
> > @@ -1652,7 +1660,7 @@ void unmap_page_range(struct mmu_gather *tlb,
> >  static void unmap_single_vma(struct mmu_gather *tlb,
> >               struct vm_area_struct *vma, unsigned long start_addr,
> >               unsigned long end_addr,
> > -             struct zap_details *details)
> > +             struct zap_details *details, bool lock_vma)
> >  {
> >       unsigned long start = max(vma->vm_start, start_addr);
> >       unsigned long end;
> > @@ -1691,7 +1699,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> >                               i_mmap_unlock_write(vma->vm_file->f_mapping);
> >                       }
> >               } else
> > -                     unmap_page_range(tlb, vma, start, end, details);
> > +                     unmap_page_range(tlb, vma, start, end, details, lock_vma);
> >       }
> >  }
> >
> > @@ -1715,7 +1723,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> >   */
> >  void unmap_vmas(struct mmu_gather *tlb,
> >               struct vm_area_struct *vma, unsigned long start_addr,
> > -             unsigned long end_addr)
> > +             unsigned long end_addr, bool lock_vma)
> >  {
> >       struct mmu_notifier_range range;
> >       struct zap_details details = {
> > @@ -1728,7 +1736,8 @@ void unmap_vmas(struct mmu_gather *tlb,
> >                               start_addr, end_addr);
> >       mmu_notifier_invalidate_range_start(&range);
> >       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> > -             unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
> > +             unmap_single_vma(tlb, vma, start_addr, end_addr, &details,
> > +                              lock_vma);
> >       mmu_notifier_invalidate_range_end(&range);
> >  }
> >
> > @@ -1753,7 +1762,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
> >       update_hiwater_rss(vma->vm_mm);
> >       mmu_notifier_invalidate_range_start(&range);
> >       for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
> > -             unmap_single_vma(&tlb, vma, start, range.end, NULL);
> > +             unmap_single_vma(&tlb, vma, start, range.end, NULL, false);
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_finish_mmu(&tlb);
> >  }
> > @@ -1768,7 +1777,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
> >   * The range must fit into one VMA.
> >   */
> >  static void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > -             unsigned long size, struct zap_details *details)
> > +             unsigned long size, struct zap_details *details, bool lock_vma)
> >  {
> >       struct mmu_notifier_range range;
> >       struct mmu_gather tlb;
> > @@ -1779,7 +1788,7 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
> >       tlb_gather_mmu(&tlb, vma->vm_mm);
> >       update_hiwater_rss(vma->vm_mm);
> >       mmu_notifier_invalidate_range_start(&range);
> > -     unmap_single_vma(&tlb, vma, address, range.end, details);
> > +     unmap_single_vma(&tlb, vma, address, range.end, details, lock_vma);
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_finish_mmu(&tlb);
> >  }
> > @@ -1802,7 +1811,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
> >                       !(vma->vm_flags & VM_PFNMAP))
> >               return;
> >
> > -     zap_page_range_single(vma, address, size, NULL);
> > +     zap_page_range_single(vma, address, size, NULL, true);
> >  }
> >  EXPORT_SYMBOL_GPL(zap_vma_ptes);
> >
> > @@ -3483,7 +3492,8 @@ static void unmap_mapping_range_vma(struct vm_area_struct *vma,
> >               unsigned long start_addr, unsigned long end_addr,
> >               struct zap_details *details)
> >  {
> > -     zap_page_range_single(vma, start_addr, end_addr - start_addr, details);
> > +     zap_page_range_single(vma, start_addr, end_addr - start_addr, details,
> > +                           false);
> >  }
> >
> >  static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 121544fd90de..094678b4434b 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -79,7 +79,7 @@ core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
> >
> >  static void unmap_region(struct mm_struct *mm,
> >               struct vm_area_struct *vma, struct vm_area_struct *prev,
> > -             unsigned long start, unsigned long end);
> > +             unsigned long start, unsigned long end, bool lock_vma);
> >
> >  static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
> >  {
> > @@ -1866,7 +1866,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >       vma->vm_file = NULL;
> >
> >       /* Undo any partial mapping done by a device driver. */
> > -     unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
> > +     unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end, true);
> >       if (vm_flags & VM_SHARED)
> >               mapping_unmap_writable(file->f_mapping);
> >  free_vma:
> > @@ -2626,7 +2626,7 @@ static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
> >   */
> >  static void unmap_region(struct mm_struct *mm,
> >               struct vm_area_struct *vma, struct vm_area_struct *prev,
> > -             unsigned long start, unsigned long end)
> > +             unsigned long start, unsigned long end, bool lock_vma)
> >  {
> >       struct vm_area_struct *next = vma_next(mm, prev);
> >       struct mmu_gather tlb;
> > @@ -2634,9 +2634,10 @@ static void unmap_region(struct mm_struct *mm,
> >       lru_add_drain();
> >       tlb_gather_mmu(&tlb, mm);
> >       update_hiwater_rss(mm);
> > -     unmap_vmas(&tlb, vma, start, end);
> > +     unmap_vmas(&tlb, vma, start, end, lock_vma);
> >       free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> > -                              next ? next->vm_start : USER_PGTABLES_CEILING);
> > +                              next ? next->vm_start : USER_PGTABLES_CEILING,
> > +                              lock_vma);
> >       tlb_finish_mmu(&tlb);
> >  }
> >
> > @@ -2849,7 +2850,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> >       if (downgrade)
> >               mmap_write_downgrade(mm);
> >
> > -     unmap_region(mm, vma, prev, start, end);
> > +     unmap_region(mm, vma, prev, start, end, !downgrade);
> >
> >       /* Fix up all other VM information */
> >       remove_vma_list(mm, vma);
> > @@ -3129,8 +3130,8 @@ void exit_mmap(struct mm_struct *mm)
> >       tlb_gather_mmu_fullmm(&tlb, mm);
> >       /* update_hiwater_rss(mm) here? but nobody should be looking */
> >       /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > -     unmap_vmas(&tlb, vma, 0, -1);
> > -     free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > +     unmap_vmas(&tlb, vma, 0, -1, true);
> > +     free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING, true);
> >       tlb_finish_mmu(&tlb);
> >
> >       /* Walk the list again, actually closing and freeing it. */
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 3c6cf9e3cd66..6ffa7c511aa3 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -549,7 +549,8 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
> >                               ret = false;
> >                               continue;
> >                       }
> > -                     unmap_page_range(&tlb, vma, range.start, range.end, NULL);
> > +                     unmap_page_range(&tlb, vma, range.start, range.end,
> > +                                      NULL, false);
> >                       mmu_notifier_invalidate_range_end(&range);
> >                       tlb_finish_mmu(&tlb);
> >               }
>
> I'm wondering if the VMA locking should be done here instead of inside
> unmap_page_range() which is not really touching the VMA's fields.
>
> Here this would be needed because the page fault handler may check the
> MMF_UNSTABLE flag and the VMA's locking before this loop is entered by
> another thread.

Hmm. I'll double-check. Before my patchset __oom_reap_task_mm is
called with mmap_lock held for read, therefore technically it can race
with page fault handlers. There must be something that makes it safe.
Will try to find out what is that something...

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 00/28] per-VMA locks proposal
  2022-09-01 23:26   ` Suren Baghdasaryan
@ 2022-09-11  9:35     ` Vlastimil Babka
  2022-09-28  2:28       ` Suren Baghdasaryan
  0 siblings, 1 reply; 91+ messages in thread
From: Vlastimil Babka @ 2022-09-11  9:35 UTC (permalink / raw)
  To: Suren Baghdasaryan, Kent Overstreet
  Cc: Andrew Morton, Michel Lespinasse, Jerome Glisse, Michal Hocko,
	Johannes Weiner, Mel Gorman, Davidlohr Bueso, Matthew Wilcox,
	Liam R. Howlett, Peter Zijlstra, Laurent Dufour, Laurent Dufour,
	Paul E . McKenney, Andy Lutomirski, Song Liu, Peter Xu,
	David Hildenbrand, dhowells, Hugh Dickins, bigeasy,
	David Rientjes, Axel Rasmussen, Joel Fernandes, Minchan Kim,
	kernel-team, linux-mm, linux-arm-kernel, linuxppc-dev, x86, LKML

On 9/2/22 01:26, Suren Baghdasaryan wrote:
> On Thu, Sep 1, 2022 at 1:58 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
>>
>> On Thu, Sep 01, 2022 at 10:34:48AM -0700, Suren Baghdasaryan wrote:
>> > Resending to fix the issue with the In-Reply-To tag in the original
>> > submission at [4].
>> >
>> > This is a proof of concept for per-vma locks idea that was discussed
>> > during SPF [1] discussion at LSF/MM this year [2], which concluded with
>> > suggestion that “a reader/writer semaphore could be put into the VMA
>> > itself; that would have the effect of using the VMA as a sort of range
>> > lock. There would still be contention at the VMA level, but it would be an
>> > improvement.” This patchset implements this suggested approach.
>> >
>> > When handling page faults we lookup the VMA that contains the faulting
>> > page under RCU protection and try to acquire its lock. If that fails we
>> > fall back to using mmap_lock, similar to how SPF handled this situation.
>> >
>> > One notable way the implementation deviates from the proposal is the way
>> > VMAs are marked as locked. Because during some of mm updates multiple
>> > VMAs need to be locked until the end of the update (e.g. vma_merge,
>> > split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks
>> > and other complications would make the code more complex. Therefore we
>> > provide a way to "mark" VMAs as locked and then unmark all locked VMAs
>> > all at once. This is done using two sequence numbers - one in the
>> > vm_area_struct and one in the mm_struct. VMA is considered locked when
>> > these sequence numbers are equal. To mark a VMA as locked we set the
>> > sequence number in vm_area_struct to be equal to the sequence number
>> > in mm_struct. To unlock all VMAs we increment mm_struct's seq number.
>> > This allows for an efficient way to track locked VMAs and to drop the
>> > locks on all VMAs at the end of the update.
>>
>> I like it - the sequence numbers are a stroke of genuius. For what it's doing
>> the patchset seems almost small.
> 
> Thanks for reviewing it!
> 
>>
>> Two complaints so far:
>>  - I don't like the vma_mark_locked() name. To me it says that the caller
>>    already took or is taking the lock and this function is just marking that
>>    we're holding the lock, but it's really taking a different type of lock. But
>>    this function can block, it really is taking a lock, so it should say that.
>>
>>    This is AFAIK a new concept, not sure I'm going to have anything good either,
>>    but perhaps vma_lock_multiple()?
> 
> I'm open to name suggestions but vma_lock_multiple() is a bit
> confusing to me. Will wait for more suggestions.

Well, it does act like a vma_write_lock(), no? So why not that name. The
checking function for it is even called vma_assert_write_locked().

We just don't provide a single vma_write_unlock(), but a
vma_mark_unlocked_all(), that could be instead named e.g.
vma_write_unlock_all().
But it's called on a mm, so maybe e.g. mm_vma_write_unlock_all()?



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 00/28] per-VMA locks proposal
  2022-09-11  9:35     ` Vlastimil Babka
@ 2022-09-28  2:28       ` Suren Baghdasaryan
  2022-09-29 11:18         ` Vlastimil Babka
  0 siblings, 1 reply; 91+ messages in thread
From: Suren Baghdasaryan @ 2022-09-28  2:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kent Overstreet, Andrew Morton, Michel Lespinasse, Jerome Glisse,
	Michal Hocko, Johannes Weiner, Mel Gorman, Davidlohr Bueso,
	Matthew Wilcox, Liam R. Howlett, Peter Zijlstra, Laurent Dufour,
	Laurent Dufour, Paul E . McKenney, Andy Lutomirski, Song Liu,
	Peter Xu, David Hildenbrand, dhowells, Hugh Dickins,
	Sebastian Andrzej Siewior, David Rientjes, Axel Rasmussen,
	Joel Fernandes, Minchan Kim, kernel-team, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, LKML

On Sun, Sep 11, 2022 at 2:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 9/2/22 01:26, Suren Baghdasaryan wrote:
> > On Thu, Sep 1, 2022 at 1:58 PM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> >>
> >> On Thu, Sep 01, 2022 at 10:34:48AM -0700, Suren Baghdasaryan wrote:
> >> > Resending to fix the issue with the In-Reply-To tag in the original
> >> > submission at [4].
> >> >
> >> > This is a proof of concept for per-vma locks idea that was discussed
> >> > during SPF [1] discussion at LSF/MM this year [2], which concluded with
> >> > suggestion that “a reader/writer semaphore could be put into the VMA
> >> > itself; that would have the effect of using the VMA as a sort of range
> >> > lock. There would still be contention at the VMA level, but it would be an
> >> > improvement.” This patchset implements this suggested approach.
> >> >
> >> > When handling page faults we lookup the VMA that contains the faulting
> >> > page under RCU protection and try to acquire its lock. If that fails we
> >> > fall back to using mmap_lock, similar to how SPF handled this situation.
> >> >
> >> > One notable way the implementation deviates from the proposal is the way
> >> > VMAs are marked as locked. Because during some of mm updates multiple
> >> > VMAs need to be locked until the end of the update (e.g. vma_merge,
> >> > split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks
> >> > and other complications would make the code more complex. Therefore we
> >> > provide a way to "mark" VMAs as locked and then unmark all locked VMAs
> >> > all at once. This is done using two sequence numbers - one in the
> >> > vm_area_struct and one in the mm_struct. VMA is considered locked when
> >> > these sequence numbers are equal. To mark a VMA as locked we set the
> >> > sequence number in vm_area_struct to be equal to the sequence number
> >> > in mm_struct. To unlock all VMAs we increment mm_struct's seq number.
> >> > This allows for an efficient way to track locked VMAs and to drop the
> >> > locks on all VMAs at the end of the update.
> >>
> >> I like it - the sequence numbers are a stroke of genuius. For what it's doing
> >> the patchset seems almost small.
> >
> > Thanks for reviewing it!
> >
> >>
> >> Two complaints so far:
> >>  - I don't like the vma_mark_locked() name. To me it says that the caller
> >>    already took or is taking the lock and this function is just marking that
> >>    we're holding the lock, but it's really taking a different type of lock. But
> >>    this function can block, it really is taking a lock, so it should say that.
> >>
> >>    This is AFAIK a new concept, not sure I'm going to have anything good either,
> >>    but perhaps vma_lock_multiple()?
> >
> > I'm open to name suggestions but vma_lock_multiple() is a bit
> > confusing to me. Will wait for more suggestions.
>
> Well, it does act like a vma_write_lock(), no? So why not that name. The
> checking function for it is even called vma_assert_write_locked().
>
> We just don't provide a single vma_write_unlock(), but a
> vma_mark_unlocked_all(), that could be instead named e.g.
> vma_write_unlock_all().
> But it's called on a mm, so maybe e.g. mm_vma_write_unlock_all()?

Thank you for your suggestions, Vlastimil! vma_write_lock() sounds
good to me. For vma_mark_unlocked_all() replacement, I would prefer
vma_write_unlock_all() which keeps the vma_write_XXX naming pattern to
indicate that these are operating on the same locks. If the fact that
it accepts mm_struct as a parameter is an issue then maybe
vma_write_unlock_mm() ?

>
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC PATCH RESEND 00/28] per-VMA locks proposal
  2022-09-28  2:28       ` Suren Baghdasaryan
@ 2022-09-29 11:18         ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2022-09-29 11:18 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Kent Overstreet, Andrew Morton, Michel Lespinasse, Jerome Glisse,
	Michal Hocko, Johannes Weiner, Mel Gorman, Davidlohr Bueso,
	Matthew Wilcox, Liam R. Howlett, Peter Zijlstra, Laurent Dufour,
	Laurent Dufour, Paul E . McKenney, Andy Lutomirski, Song Liu,
	Peter Xu, David Hildenbrand, dhowells, Hugh Dickins,
	Sebastian Andrzej Siewior, David Rientjes, Axel Rasmussen,
	Joel Fernandes, Minchan Kim, kernel-team, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, LKML

On 9/28/22 04:28, Suren Baghdasaryan wrote:
> On Sun, Sep 11, 2022 at 2:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 9/2/22 01:26, Suren Baghdasaryan wrote:
>> >
>> >>
>> >> Two complaints so far:
>> >>  - I don't like the vma_mark_locked() name. To me it says that the caller
>> >>    already took or is taking the lock and this function is just marking that
>> >>    we're holding the lock, but it's really taking a different type of lock. But
>> >>    this function can block, it really is taking a lock, so it should say that.
>> >>
>> >>    This is AFAIK a new concept, not sure I'm going to have anything good either,
>> >>    but perhaps vma_lock_multiple()?
>> >
>> > I'm open to name suggestions but vma_lock_multiple() is a bit
>> > confusing to me. Will wait for more suggestions.
>>
>> Well, it does act like a vma_write_lock(), no? So why not that name. The
>> checking function for it is even called vma_assert_write_locked().
>>
>> We just don't provide a single vma_write_unlock(), but a
>> vma_mark_unlocked_all(), that could be instead named e.g.
>> vma_write_unlock_all().
>> But it's called on a mm, so maybe e.g. mm_vma_write_unlock_all()?
> 
> Thank you for your suggestions, Vlastimil! vma_write_lock() sounds
> good to me. For vma_mark_unlocked_all() replacement, I would prefer
> vma_write_unlock_all() which keeps the vma_write_XXX naming pattern to

OK.

> indicate that these are operating on the same locks. If the fact that
> it accepts mm_struct as a parameter is an issue then maybe
> vma_write_unlock_mm() ?

Sounds good!

>>
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2022-09-29 11:19 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-01 17:34 [RFC PATCH RESEND 00/28] per-VMA locks proposal Suren Baghdasaryan
2022-09-01 17:34 ` [RFC PATCH RESEND 01/28] mm: introduce CONFIG_PER_VMA_LOCK Suren Baghdasaryan
2022-09-01 17:34 ` [RFC PATCH RESEND 02/28] mm: rcu safe VMA freeing Suren Baghdasaryan
2022-09-01 17:34 ` [RFC PATCH RESEND 03/28] mm: introduce __find_vma to be used without mmap_lock protection Suren Baghdasaryan
2022-09-01 20:22   ` Kent Overstreet
2022-09-01 23:18     ` Suren Baghdasaryan
2022-09-01 17:34 ` [RFC PATCH RESEND 04/28] mm: move mmap_lock assert function definitions Suren Baghdasaryan
2022-09-01 20:24   ` Kent Overstreet
2022-09-01 20:51     ` Liam Howlett
2022-09-01 23:21       ` Suren Baghdasaryan
2022-09-02  6:23     ` Sebastian Andrzej Siewior
2022-09-02 17:46       ` Suren Baghdasaryan
2022-09-01 17:34 ` [RFC PATCH RESEND 05/28] mm: add per-VMA lock and helper functions to control it Suren Baghdasaryan
2022-09-06 13:46   ` Laurent Dufour
2022-09-06 17:24     ` Suren Baghdasaryan
2022-09-01 17:34 ` [RFC PATCH RESEND 06/28] mm: mark VMA as locked whenever vma->vm_flags are modified Suren Baghdasaryan
2022-09-06 14:26   ` Laurent Dufour
2022-09-06 19:00     ` Suren Baghdasaryan
2022-09-06 20:00       ` Liam Howlett
2022-09-06 20:13         ` Suren Baghdasaryan
2022-09-01 17:34 ` [RFC PATCH RESEND 07/28] kernel/fork: mark VMAs as locked before copying pages during fork Suren Baghdasaryan
2022-09-06 14:37   ` Laurent Dufour
2022-09-08 23:57     ` Suren Baghdasaryan
2022-09-09 13:27       ` Laurent Dufour
2022-09-09 16:29         ` Suren Baghdasaryan
2022-09-01 17:34 ` [RFC PATCH RESEND 08/28] mm/khugepaged: mark VMA as locked while collapsing a hugepage Suren Baghdasaryan
2022-09-06 14:43   ` Laurent Dufour
2022-09-09  0:15     ` Suren Baghdasaryan
2022-09-01 17:34 ` [RFC PATCH RESEND 09/28] mm/mempolicy: mark VMA as locked when changing protection policy Suren Baghdasaryan
2022-09-06 14:47   ` Laurent Dufour
2022-09-09  0:27     ` Suren Baghdasaryan
2022-09-01 17:34 ` [RFC PATCH RESEND 10/28] mm/mmap: mark VMAs as locked in vma_adjust Suren Baghdasaryan
2022-09-06 15:35   ` Laurent Dufour
2022-09-09  0:51     ` Suren Baghdasaryan
2022-09-09 15:52       ` Laurent Dufour
2022-09-01 17:34 ` [RFC PATCH RESEND 11/28] mm/mmap: mark VMAs as locked before merging or splitting them Suren Baghdasaryan
2022-09-06 15:44   ` Laurent Dufour
2022-09-01 17:35 ` [RFC PATCH RESEND 12/28] mm/mremap: mark VMA as locked while remapping it to a new address range Suren Baghdasaryan
2022-09-06 16:09   ` Laurent Dufour
2022-09-01 17:35 ` [RFC PATCH RESEND 13/28] mm: conditionally mark VMA as locked in free_pgtables and unmap_page_range Suren Baghdasaryan
2022-09-09 10:33   ` Laurent Dufour
2022-09-09 16:43     ` Suren Baghdasaryan
2022-09-01 17:35 ` [RFC PATCH RESEND 14/28] mm: mark VMAs as locked before isolating them Suren Baghdasaryan
2022-09-09 13:35   ` Laurent Dufour
2022-09-09 16:28     ` Suren Baghdasaryan
2022-09-01 17:35 ` [RFC PATCH RESEND 15/28] mm/mmap: mark adjacent VMAs as locked if they can grow into unmapped area Suren Baghdasaryan
2022-09-09 13:43   ` Laurent Dufour
2022-09-09 16:25     ` Suren Baghdasaryan
2022-09-01 17:35 ` [RFC PATCH RESEND 16/28] kernel/fork: assert no VMA readers during its destruction Suren Baghdasaryan
2022-09-09 13:56   ` Laurent Dufour
2022-09-09 16:19     ` Suren Baghdasaryan
2022-09-01 17:35 ` [RFC PATCH RESEND 17/28] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration Suren Baghdasaryan
2022-09-09 14:20   ` Laurent Dufour
2022-09-09 16:12     ` Suren Baghdasaryan
2022-09-01 17:35 ` [RFC PATCH RESEND 18/28] mm: add FAULT_FLAG_VMA_LOCK flag Suren Baghdasaryan
2022-09-09 14:26   ` Laurent Dufour
2022-09-01 17:35 ` [RFC PATCH RESEND 19/28] mm: disallow do_swap_page to handle page faults under VMA lock Suren Baghdasaryan
2022-09-06 19:39   ` Peter Xu
2022-09-06 20:08     ` Suren Baghdasaryan
2022-09-06 20:22       ` Peter Xu
2022-09-07  0:58         ` Suren Baghdasaryan
2022-09-09 14:26   ` Laurent Dufour
2022-09-01 17:35 ` [RFC PATCH RESEND 20/28] mm: introduce per-VMA lock statistics Suren Baghdasaryan
2022-09-09 14:28   ` Laurent Dufour
2022-09-09 16:11     ` Suren Baghdasaryan
2022-09-01 17:35 ` [RFC PATCH RESEND 21/28] mm: introduce find_and_lock_anon_vma to be used from arch-specific code Suren Baghdasaryan
2022-09-09 14:38   ` Laurent Dufour
2022-09-09 16:10     ` Suren Baghdasaryan
2022-09-01 17:35 ` [RFC PATCH RESEND 22/28] x86/mm: try VMA lock-based page fault handling first Suren Baghdasaryan
2022-09-01 17:35 ` [RFC PATCH RESEND 23/28] x86/mm: define ARCH_SUPPORTS_PER_VMA_LOCK Suren Baghdasaryan
2022-09-01 20:20   ` Kent Overstreet
2022-09-01 23:17     ` Suren Baghdasaryan
2022-09-01 17:35 ` [RFC PATCH RESEND 24/28] arm64/mm: try VMA lock-based page fault handling first Suren Baghdasaryan
2022-09-01 17:35 ` [RFC PATCH RESEND 25/28] arm64/mm: define ARCH_SUPPORTS_PER_VMA_LOCK Suren Baghdasaryan
2022-09-01 17:35 ` [RFC PATCH RESEND 26/28] powerc/mm: try VMA lock-based page fault handling first Suren Baghdasaryan
2022-09-01 17:35 ` [RFC PATCH RESEND 27/28] powerpc/mm: define ARCH_SUPPORTS_PER_VMA_LOCK Suren Baghdasaryan
2022-09-01 17:35 ` [RFC PATCH RESEND 28/28] kernel/fork: throttle call_rcu() calls in vm_area_free Suren Baghdasaryan
2022-09-09 15:19   ` Laurent Dufour
2022-09-09 16:02     ` Suren Baghdasaryan
2022-09-09 16:14       ` Laurent Dufour
2022-09-01 20:58 ` [RFC PATCH RESEND 00/28] per-VMA locks proposal Kent Overstreet
2022-09-01 23:26   ` Suren Baghdasaryan
2022-09-11  9:35     ` Vlastimil Babka
2022-09-28  2:28       ` Suren Baghdasaryan
2022-09-29 11:18         ` Vlastimil Babka
2022-09-02  7:42 ` Peter Zijlstra
2022-09-02 14:45   ` Suren Baghdasaryan
2022-09-05 12:32 ` Michal Hocko
2022-09-05 18:32   ` Suren Baghdasaryan
2022-09-05 20:35     ` Kent Overstreet
2022-09-06 15:46       ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).