linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/41] Per-VMA locks
@ 2023-01-09 20:52 Suren Baghdasaryan
  2023-01-09 20:52 ` [PATCH 01/41] maple_tree: Be more cautious about dead nodes Suren Baghdasaryan
                   ` (40 more replies)
  0 siblings, 41 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:52 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

This is the v1 of per-VMA locks patchset originally posted as an RFC at
[1] and described in LWN article at [2]. Per-vma locks idea that was
discussed during SPF [3] discussion at LSF/MM this year [4], which
concluded with suggestion that “a reader/writer semaphore could be put
into the VMA itself; that would have the effect of using the VMA as a
sort of range lock. There would still be contention at the VMA level, but
it would be an improvement.” This patchset implements this suggested
approach.

When handling page faults we lookup the VMA that contains the faulting
page under RCU protection and try to acquire its lock. If that fails we
fall back to using mmap_lock, similar to how SPF handled this situation.

One notable way the implementation deviates from the proposal is the way
VMAs are read-locked. During some of mm updates, multiple VMAs need to be
locked until the end of the update (e.g. vma_merge, split_vma, etc).
Tracking all the locked VMAs, avoiding recursive locks, figuring out when
it's safe to unlock previously locked VMAs would make the code more
complex. So, instead of the usual lock/unlock pattern, the proposed
solution marks a VMA as locked and provides an efficient way to:
1. Identify locked VMAs.
2. Unlock all locked VMAs in bulk.
We also postpone unlocking the locked VMAs until the end of the update,
when we do mmap_write_unlock. Potentially this keeps a VMA locked for
longer than is absolutely necessary but it results in a big reduction of
code complexity.
Read-locking a VMA is done using two sequence numbers - one in the
vm_area_struct and one in the mm_struct. VMA is considered read-locked
when these sequence numbers are equal. To read-lock a VMA we set the
sequence number in vm_area_struct to be equal to the sequence number in
mm_struct. To unlock all VMAs we increment mm_struct's seq number. This
allows for an efficient way to track locked VMAs and to drop the locks on
all VMAs at the end of the update.

The patchset implements per-VMA locking only for anonymous pages which
are not in swap and avoids userfaultfs as their implementation is more
complex. Additional support for file-back page faults, swapped and user
pages can be added incrementally.

Performance benchmarks show similar although slightly smaller benefits as
with SPF patchset (~75% of SPF benefits). Still, with lower complexity
this approach might be more desirable.

Since RFC was posted in September 2022, two separate Google teams outside
of Android evaluated the patchset and confirmed positive results. Here are
the known usecases when per-VMA locks show benefits:

Android:
Apps with high number of threads (~100) launch times improve by up to 20%.
Each thread mmaps several areas upon startup (Stack and Thread-local
storage (TLS), thread signal stack, indirect ref table), which requires
taking mmap_lock in write mode. Page faults take mmap_lock in read mode.
During app launch, both thread creation and page faults establishing the
active workinget are happening in parallel and that causes lock contention
between mm writers and readers even if updates and page faults are
happening in different VMAs. Per-vma locks prevent this contention by
providing more granular lock.

Google Fibers:
We have several dynamically sized thread pools that spawn new threads
under increased load and reduce their number when idling. For example,
Google's in-process scheduling/threading framework, UMCG/Fibers, is backed
by such a thread pool. When idling, only a small number of idle worker
threads are available; when a spike of incoming requests arrive, each
request is handled in its own "fiber", which is a work item posted onto a
UMCG worker thread; quite often these spikes lead to a number of new
threads spawning. Each new thread needs to allocate and register an RSEQ
section on its TLS, then register itself with the kernel as a UMCG worker
thread, and only after that it can be considered by the in-process
UMCG/Fiber scheduler as available to do useful work. In short, during an
incoming workload spike new threads have to be spawned, and they perform
several syscalls (RSEQ registration, UMCG worker registration, memory
allocations) before they can actually start doing useful work. Removing
any bottlenecks on this thread startup path will greatly improve our
services' latencies when faced with request/workload spikes.
At high scale, mmap_lock contention during thread creation and stack page
faults leads to user-visible multi-second serving latencies in a similar
pattern to Android app startup. Per-VMA locking patchset has been run
successfully in limited experiments with user-facing production workloads.
In these experiments, we observed that the peak thread creation rate was
high enough that thread creation is no longer a bottleneck.

TCP zerocopy receive:
From the point of view of TCP zerocopy receive, the per-vma lock patch is
massively beneficial.
In today's implementation, a process with N threads where N - 1 are
performing zerocopy receive and 1 thread is performing madvise() with the
write lock taken (e.g. needs to change vm_flags) will result in all N -1
receive threads blocking until the madvise is done. Conversely, on a busy
process receiving a lot of data, an madvise operation that does need to
take the mmap lock in write mode will need to wait for all of the receives
to be done - a lose:lose proposition. Per-VMA locking _removes_ by
definition this source of contention entirely.
There are other benefits for receive as well, chiefly a reduction in
cacheline bouncing across receiving threads for locking/unlocking the
single mmap lock. On an RPC style synthetic workload with 4KB RPCs:
1a) The find+lock+unlock VMA path in the base case, without the per-vma
lock patchset, is about 0.7% of cycles as measured by perf.
1b) mmap_read_lock + mmap_read_unlock in the base case is about 0.5%
cycles overall - most of this is within the TCP read hotpath (a small
fraction is 'other' usage in the system).
2a) The find+lock+unlock VMA path, with the per-vma patchset and a trivial
patch written to take advantage of it in TCP, is about 0.4% of cycles
(down from 0.7% above)
2b) mmap_read_lock + mmap_read_unlock in the per-vma patchset is < 0.1%
cycles and is out of the TCP read hotpath entirely (down from 0.5% before,
the remaining usage is the 'other' usage in the system).
So, in addition to entirely removing an onerous source of contention, it
also reduces the CPU cycles of TCP receive zerocopy by about 0.5%+
(compared to overall cycles in perf) for the 'small' RPC scenario.

The patchset structure is:
0001-0007: Enable maple-tree RCU mode
0008-0038: Main per-vma locks patchset
0039-0040: Performance optimizations
0041: Memory overhead optimization

Branch for testing is posted at:
https://github.com/surenbaghdasaryan/linux/tree/per_vma_lock

The patchset applies cleanly over Linus' tree at:
commit b7bfaa761d76 ("Linux 6.2-rc3")

[1] https://lore.kernel.org/all/20220901173516.702122-1-surenb@google.com/
[2] https://lwn.net/Articles/906852/
[3] https://lore.kernel.org/all/20220128131006.67712-1-michel@lespinasse.org/
[4] https://lwn.net/Articles/893906/

Laurent Dufour (1):
  powerc/mm: try VMA lock-based page fault handling first

Liam Howlett (4):
  maple_tree: Be more cautious about dead nodes
  maple_tree: Detect dead nodes in mas_start()
  maple_tree: Fix freeing of nodes in rcu mode
  maple_tree: remove extra smp_wmb() from mas_dead_leaves()

Liam R. Howlett (3):
  maple_tree: Fix write memory barrier of nodes once dead for RCU mode
  maple_tree: Add smp_rmb() to dead node detection
  mm: Enable maple tree RCU mode by default.

Michel Lespinasse (1):
  mm: rcu safe VMA freeing

Suren Baghdasaryan (32):
  mm: introduce CONFIG_PER_VMA_LOCK
  mm: move mmap_lock assert function definitions
  mm: export dump_mm()
  mm: add per-VMA lock and helper functions to control it
  mm: introduce vma->vm_flags modifier functions
  mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK
  mm: replace vma->vm_flags direct modifications with modifier calls
  mm: replace vma->vm_flags indirect modification in ksm_madvise
  mm/mmap: move VMA locking before anon_vma_lock_write call
  mm/khugepaged: write-lock VMA while collapsing a huge page
  mm/mmap: write-lock VMAs before merging, splitting or expanding them
  mm/mmap: write-lock VMAs in vma_adjust
  mm/mmap: write-lock VMAs affected by VMA expansion
  mm/mremap: write-lock VMA while remapping it to a new address range
  mm: write-lock VMAs before removing them from VMA tree
  mm: conditionally write-lock VMA in free_pgtables
  mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area
  kernel/fork: assert no VMA readers during its destruction
  mm/mmap: prevent pagefault handler from racing with mmu_notifier
    registration
  mm: introduce lock_vma_under_rcu to be used from arch-specific code
  mm: fall back to mmap_lock if vma->anon_vma is not yet set
  mm: add FAULT_FLAG_VMA_LOCK flag
  mm: prevent do_swap_page from handling page faults under VMA lock
  mm: prevent userfaults to be handled under per-vma lock
  mm: introduce per-VMA lock statistics
  x86/mm: try VMA lock-based page fault handling first
  arm64/mm: try VMA lock-based page fault handling first
  mm: introduce mod_vm_flags_nolock
  mm: avoid assertion in untrack_pfn
  kernel/fork: throttle call_rcu() calls in vm_area_free
  mm: separate vma->lock from vm_area_struct
  mm: replace rw_semaphore with atomic_t in vma_lock

 arch/arm/kernel/process.c                     |   2 +-
 arch/arm64/Kconfig                            |   1 +
 arch/arm64/mm/fault.c                         |  36 ++++
 arch/ia64/mm/init.c                           |   8 +-
 arch/loongarch/include/asm/tlb.h              |   2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c            |   5 +-
 arch/powerpc/kvm/book3s_xive_native.c         |   2 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |   2 +-
 arch/powerpc/mm/fault.c                       |  41 +++++
 arch/powerpc/platforms/book3s/vas-api.c       |   2 +-
 arch/powerpc/platforms/cell/spufs/file.c      |  14 +-
 arch/powerpc/platforms/powernv/Kconfig        |   1 +
 arch/powerpc/platforms/pseries/Kconfig        |   1 +
 arch/s390/mm/gmap.c                           |   8 +-
 arch/x86/Kconfig                              |   1 +
 arch/x86/entry/vsyscall/vsyscall_64.c         |   2 +-
 arch/x86/kernel/cpu/sgx/driver.c              |   2 +-
 arch/x86/kernel/cpu/sgx/virt.c                |   2 +-
 arch/x86/mm/fault.c                           |  36 ++++
 arch/x86/mm/pat/memtype.c                     |  14 +-
 arch/x86/um/mem_32.c                          |   2 +-
 drivers/acpi/pfr_telemetry.c                  |   2 +-
 drivers/android/binder.c                      |   3 +-
 drivers/char/mspec.c                          |   2 +-
 drivers/crypto/hisilicon/qm.c                 |   2 +-
 drivers/dax/device.c                          |   2 +-
 drivers/dma/idxd/cdev.c                       |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c       |   2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |   4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c     |   4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_events.c       |   4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c      |   4 +-
 drivers/gpu/drm/drm_gem.c                     |   2 +-
 drivers/gpu/drm/drm_gem_dma_helper.c          |   3 +-
 drivers/gpu/drm/drm_gem_shmem_helper.c        |   2 +-
 drivers/gpu/drm/drm_vm.c                      |   8 +-
 drivers/gpu/drm/etnaviv/etnaviv_gem.c         |   2 +-
 drivers/gpu/drm/exynos/exynos_drm_gem.c       |   4 +-
 drivers/gpu/drm/gma500/framebuffer.c          |   2 +-
 drivers/gpu/drm/i810/i810_dma.c               |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.c      |   4 +-
 drivers/gpu/drm/mediatek/mtk_drm_gem.c        |   2 +-
 drivers/gpu/drm/msm/msm_gem.c                 |   2 +-
 drivers/gpu/drm/omapdrm/omap_gem.c            |   3 +-
 drivers/gpu/drm/rockchip/rockchip_drm_gem.c   |   3 +-
 drivers/gpu/drm/tegra/gem.c                   |   5 +-
 drivers/gpu/drm/ttm/ttm_bo_vm.c               |   3 +-
 drivers/gpu/drm/virtio/virtgpu_vram.c         |   2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c      |   2 +-
 drivers/gpu/drm/xen/xen_drm_front_gem.c       |   3 +-
 drivers/hsi/clients/cmt_speech.c              |   2 +-
 drivers/hwtracing/intel_th/msu.c              |   2 +-
 drivers/hwtracing/stm/core.c                  |   2 +-
 drivers/infiniband/hw/hfi1/file_ops.c         |   4 +-
 drivers/infiniband/hw/mlx5/main.c             |   4 +-
 drivers/infiniband/hw/qib/qib_file_ops.c      |  13 +-
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c  |   2 +-
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.c   |   2 +-
 .../common/videobuf2/videobuf2-dma-contig.c   |   2 +-
 .../common/videobuf2/videobuf2-vmalloc.c      |   2 +-
 drivers/media/v4l2-core/videobuf-dma-contig.c |   2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c     |   4 +-
 drivers/media/v4l2-core/videobuf-vmalloc.c    |   2 +-
 drivers/misc/cxl/context.c                    |   2 +-
 drivers/misc/habanalabs/common/memory.c       |   2 +-
 drivers/misc/habanalabs/gaudi/gaudi.c         |   4 +-
 drivers/misc/habanalabs/gaudi2/gaudi2.c       |   8 +-
 drivers/misc/habanalabs/goya/goya.c           |   4 +-
 drivers/misc/ocxl/context.c                   |   4 +-
 drivers/misc/ocxl/sysfs.c                     |   2 +-
 drivers/misc/open-dice.c                      |   6 +-
 drivers/misc/sgi-gru/grufile.c                |   4 +-
 drivers/misc/uacce/uacce.c                    |   2 +-
 drivers/sbus/char/oradax.c                    |   2 +-
 drivers/scsi/cxlflash/ocxl_hw.c               |   2 +-
 drivers/scsi/sg.c                             |   2 +-
 .../staging/media/atomisp/pci/hmm/hmm_bo.c    |   2 +-
 drivers/staging/media/deprecated/meye/meye.c  |   4 +-
 .../media/deprecated/stkwebcam/stk-webcam.c   |   2 +-
 drivers/target/target_core_user.c             |   2 +-
 drivers/uio/uio.c                             |   2 +-
 drivers/usb/core/devio.c                      |   3 +-
 drivers/usb/mon/mon_bin.c                     |   3 +-
 drivers/vdpa/vdpa_user/iova_domain.c          |   2 +-
 drivers/vfio/pci/vfio_pci_core.c              |   2 +-
 drivers/vhost/vdpa.c                          |   2 +-
 drivers/video/fbdev/68328fb.c                 |   2 +-
 drivers/video/fbdev/core/fb_defio.c           |   4 +-
 drivers/xen/gntalloc.c                        |   2 +-
 drivers/xen/gntdev.c                          |   4 +-
 drivers/xen/privcmd-buf.c                     |   2 +-
 drivers/xen/privcmd.c                         |   4 +-
 fs/aio.c                                      |   2 +-
 fs/cramfs/inode.c                             |   2 +-
 fs/erofs/data.c                               |   2 +-
 fs/exec.c                                     |   4 +-
 fs/ext4/file.c                                |   2 +-
 fs/fuse/dax.c                                 |   2 +-
 fs/hugetlbfs/inode.c                          |   4 +-
 fs/orangefs/file.c                            |   3 +-
 fs/proc/task_mmu.c                            |   2 +-
 fs/proc/vmcore.c                              |   3 +-
 fs/userfaultfd.c                              |  12 +-
 fs/xfs/xfs_file.c                             |   2 +-
 include/linux/mm.h                            | 159 +++++++++++++++++-
 include/linux/mm_types.h                      |  62 ++++++-
 include/linux/mmap_lock.h                     |  37 ++--
 include/linux/pgtable.h                       |   5 +-
 include/linux/vm_event_item.h                 |   6 +
 include/linux/vmstat.h                        |   6 +
 kernel/bpf/ringbuf.c                          |   4 +-
 kernel/bpf/syscall.c                          |   4 +-
 kernel/events/core.c                          |   2 +-
 kernel/fork.c                                 | 148 +++++++++++++---
 kernel/kcov.c                                 |   2 +-
 kernel/relay.c                                |   2 +-
 lib/maple_tree.c                              | 146 +++++++++++++---
 mm/Kconfig                                    |  13 ++
 mm/Kconfig.debug                              |   8 +
 mm/debug.c                                    |   1 +
 mm/hugetlb.c                                  |   4 +-
 mm/init-mm.c                                  |   8 +
 mm/internal.h                                 |   2 +-
 mm/khugepaged.c                               |   7 +
 mm/ksm.c                                      |   2 +
 mm/madvise.c                                  |   2 +-
 mm/memory.c                                   |  94 +++++++++--
 mm/memremap.c                                 |   4 +-
 mm/mlock.c                                    |  12 +-
 mm/mmap.c                                     |  99 ++++++++---
 mm/mprotect.c                                 |   2 +-
 mm/mremap.c                                   |   9 +-
 mm/nommu.c                                    |  16 +-
 mm/secretmem.c                                |   2 +-
 mm/shmem.c                                    |   2 +-
 mm/vmalloc.c                                  |   2 +-
 mm/vmstat.c                                   |   6 +
 net/ipv4/tcp.c                                |   4 +-
 security/selinux/selinuxfs.c                  |   6 +-
 sound/core/oss/pcm_oss.c                      |   2 +-
 sound/core/pcm_native.c                       |   9 +-
 sound/soc/pxa/mmp-sspa.c                      |   2 +-
 sound/usb/usx2y/us122l.c                      |   4 +-
 sound/usb/usx2y/usX2Yhwdep.c                  |   2 +-
 sound/usb/usx2y/usx2yhwdeppcm.c               |   2 +-
 tools/testing/radix-tree/maple.c              |  16 ++
 146 files changed, 1047 insertions(+), 316 deletions(-)

-- 
2.39.0


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [PATCH 01/41] maple_tree: Be more cautious about dead nodes
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
@ 2023-01-09 20:52 ` Suren Baghdasaryan
  2023-01-09 20:52 ` [PATCH 02/41] maple_tree: Detect dead nodes in mas_start() Suren Baghdasaryan
                   ` (39 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:52 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb, Liam Howlett

From: Liam Howlett <Liam.Howlett@oracle.com>

ma_pivots() and ma_data_end() may be called with a dead node.  Ensure to
that the node isn't dead before using the returned values.

This is necessary for RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 lib/maple_tree.c | 53 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 43 insertions(+), 10 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 26e2045d3cda..ff9f04e0150d 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -540,6 +540,7 @@ static inline bool ma_dead_node(const struct maple_node *node)
 
 	return (parent == node);
 }
+
 /*
  * mte_dead_node() - check if the @enode is dead.
  * @enode: The encoded maple node
@@ -621,6 +622,8 @@ static inline unsigned int mas_alloc_req(const struct ma_state *mas)
  * @node - the maple node
  * @type - the node type
  *
+ * In the event of a dead node, this array may be %NULL
+ *
  * Return: A pointer to the maple node pivots
  */
 static inline unsigned long *ma_pivots(struct maple_node *node,
@@ -1091,8 +1094,11 @@ static int mas_ascend(struct ma_state *mas)
 		a_type = mas_parent_enum(mas, p_enode);
 		a_node = mte_parent(p_enode);
 		a_slot = mte_parent_slot(p_enode);
-		pivots = ma_pivots(a_node, a_type);
 		a_enode = mt_mk_node(a_node, a_type);
+		pivots = ma_pivots(a_node, a_type);
+
+		if (unlikely(ma_dead_node(a_node)))
+			return 1;
 
 		if (!set_min && a_slot) {
 			set_min = true;
@@ -1398,6 +1404,9 @@ static inline unsigned char ma_data_end(struct maple_node *node,
 {
 	unsigned char offset;
 
+	if (!pivots)
+		return 0;
+
 	if (type == maple_arange_64)
 		return ma_meta_end(node, type);
 
@@ -1433,6 +1442,9 @@ static inline unsigned char mas_data_end(struct ma_state *mas)
 		return ma_meta_end(node, type);
 
 	pivots = ma_pivots(node, type);
+	if (unlikely(ma_dead_node(node)))
+		return 0;
+
 	offset = mt_pivots[type] - 1;
 	if (likely(!pivots[offset]))
 		return ma_meta_end(node, type);
@@ -4504,6 +4516,9 @@ static inline int mas_prev_node(struct ma_state *mas, unsigned long min)
 	node = mas_mn(mas);
 	slots = ma_slots(node, mt);
 	pivots = ma_pivots(node, mt);
+	if (unlikely(ma_dead_node(node)))
+		return 1;
+
 	mas->max = pivots[offset];
 	if (offset)
 		mas->min = pivots[offset - 1] + 1;
@@ -4525,6 +4540,9 @@ static inline int mas_prev_node(struct ma_state *mas, unsigned long min)
 		slots = ma_slots(node, mt);
 		pivots = ma_pivots(node, mt);
 		offset = ma_data_end(node, mt, pivots, mas->max);
+		if (unlikely(ma_dead_node(node)))
+			return 1;
+
 		if (offset)
 			mas->min = pivots[offset - 1] + 1;
 
@@ -4573,6 +4591,7 @@ static inline int mas_next_node(struct ma_state *mas, struct maple_node *node,
 	struct maple_enode *enode;
 	int level = 0;
 	unsigned char offset;
+	unsigned char node_end;
 	enum maple_type mt;
 	void __rcu **slots;
 
@@ -4596,7 +4615,11 @@ static inline int mas_next_node(struct ma_state *mas, struct maple_node *node,
 		node = mas_mn(mas);
 		mt = mte_node_type(mas->node);
 		pivots = ma_pivots(node, mt);
-	} while (unlikely(offset == ma_data_end(node, mt, pivots, mas->max)));
+		node_end = ma_data_end(node, mt, pivots, mas->max);
+		if (unlikely(ma_dead_node(node)))
+			return 1;
+
+	} while (unlikely(offset == node_end));
 
 	slots = ma_slots(node, mt);
 	pivot = mas_safe_pivot(mas, pivots, ++offset, mt);
@@ -4612,6 +4635,9 @@ static inline int mas_next_node(struct ma_state *mas, struct maple_node *node,
 		mt = mte_node_type(mas->node);
 		slots = ma_slots(node, mt);
 		pivots = ma_pivots(node, mt);
+		if (unlikely(ma_dead_node(node)))
+			return 1;
+
 		offset = 0;
 		pivot = pivots[0];
 	}
@@ -4658,16 +4684,18 @@ static inline void *mas_next_nentry(struct ma_state *mas,
 		return NULL;
 	}
 
-	pivots = ma_pivots(node, type);
 	slots = ma_slots(node, type);
-	mas->index = mas_safe_min(mas, pivots, mas->offset);
-	if (ma_dead_node(node))
+	pivots = ma_pivots(node, type);
+	count = ma_data_end(node, type, pivots, mas->max);
+	if (unlikely(ma_dead_node(node)))
 		return NULL;
 
+	mas->index = mas_safe_min(mas, pivots, mas->offset);
+	if (unlikely(ma_dead_node(node)))
+		return NULL;
 	if (mas->index > max)
 		return NULL;
 
-	count = ma_data_end(node, type, pivots, mas->max);
 	if (mas->offset > count)
 		return NULL;
 
@@ -4815,6 +4843,11 @@ static inline void *mas_prev_nentry(struct ma_state *mas, unsigned long limit,
 
 	slots = ma_slots(mn, mt);
 	pivots = ma_pivots(mn, mt);
+	if (unlikely(ma_dead_node(mn))) {
+		mas_rewalk(mas, index);
+		goto retry;
+	}
+
 	if (offset == mt_pivots[mt])
 		pivot = mas->max;
 	else
@@ -6613,11 +6646,11 @@ static inline void *mas_first_entry(struct ma_state *mas, struct maple_node *mn,
 	while (likely(!ma_is_leaf(mt))) {
 		MT_BUG_ON(mas->tree, mte_dead_node(mas->node));
 		slots = ma_slots(mn, mt);
-		pivots = ma_pivots(mn, mt);
-		max = pivots[0];
 		entry = mas_slot(mas, slots, 0);
+		pivots = ma_pivots(mn, mt);
 		if (unlikely(ma_dead_node(mn)))
 			return NULL;
+		max = pivots[0];
 		mas->node = entry;
 		mn = mas_mn(mas);
 		mt = mte_node_type(mas->node);
@@ -6637,13 +6670,13 @@ static inline void *mas_first_entry(struct ma_state *mas, struct maple_node *mn,
 	if (likely(entry))
 		return entry;
 
-	pivots = ma_pivots(mn, mt);
-	mas->index = pivots[0] + 1;
 	mas->offset = 1;
 	entry = mas_slot(mas, slots, 1);
+	pivots = ma_pivots(mn, mt);
 	if (unlikely(ma_dead_node(mn)))
 		return NULL;
 
+	mas->index = pivots[0] + 1;
 	if (mas->index > limit)
 		goto none;
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 02/41] maple_tree: Detect dead nodes in mas_start()
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
  2023-01-09 20:52 ` [PATCH 01/41] maple_tree: Be more cautious about dead nodes Suren Baghdasaryan
@ 2023-01-09 20:52 ` Suren Baghdasaryan
  2023-01-09 20:52 ` [PATCH 03/41] maple_tree: Fix freeing of nodes in rcu mode Suren Baghdasaryan
                   ` (38 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:52 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb, Liam Howlett

From: Liam Howlett <Liam.Howlett@oracle.com>

When initially starting a search, the root node may already be in the
process of being replaced in RCU mode.  Detect and restart the walk if
this is the case.  This is necessary for RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 lib/maple_tree.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index ff9f04e0150d..a748938ad2e9 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1359,11 +1359,15 @@ static inline struct maple_enode *mas_start(struct ma_state *mas)
 		mas->depth = 0;
 		mas->offset = 0;
 
+retry:
 		root = mas_root(mas);
 		/* Tree with nodes */
 		if (likely(xa_is_node(root))) {
 			mas->depth = 1;
 			mas->node = mte_safe_root(root);
+			if (mte_dead_node(mas->node))
+				goto retry;
+
 			return NULL;
 		}
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 03/41] maple_tree: Fix freeing of nodes in rcu mode
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
  2023-01-09 20:52 ` [PATCH 01/41] maple_tree: Be more cautious about dead nodes Suren Baghdasaryan
  2023-01-09 20:52 ` [PATCH 02/41] maple_tree: Detect dead nodes in mas_start() Suren Baghdasaryan
@ 2023-01-09 20:52 ` Suren Baghdasaryan
  2023-01-09 20:52 ` [PATCH 04/41] maple_tree: remove extra smp_wmb() from mas_dead_leaves() Suren Baghdasaryan
                   ` (37 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:52 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb, Liam Howlett

From: Liam Howlett <Liam.Howlett@oracle.com>

The walk to destroy the nodes was not always setting the node type and
would result in a destroy method potentially using the values as nodes.
Avoid this by setting the correct node types.  This is necessary for the
RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 lib/maple_tree.c | 73 ++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 62 insertions(+), 11 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index a748938ad2e9..a11eea943f8d 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -897,6 +897,44 @@ static inline void ma_set_meta(struct maple_node *mn, enum maple_type mt,
 	meta->end = end;
 }
 
+/*
+ * mas_clear_meta() - clear the metadata information of a node, if it exists
+ * @mas: The maple state
+ * @mn: The maple node
+ * @mt: The maple node type
+ * @offset: The offset of the highest sub-gap in this node.
+ * @end: The end of the data in this node.
+ */
+static inline void mas_clear_meta(struct ma_state *mas, struct maple_node *mn,
+				  enum maple_type mt)
+{
+	struct maple_metadata *meta;
+	unsigned long *pivots;
+	void __rcu **slots;
+	void *next;
+
+	switch (mt) {
+	case maple_range_64:
+		pivots = mn->mr64.pivot;
+		if (unlikely(pivots[MAPLE_RANGE64_SLOTS - 2])) {
+			slots = mn->mr64.slot;
+			next = mas_slot_locked(mas, slots,
+					       MAPLE_RANGE64_SLOTS - 1);
+			if (unlikely((mte_to_node(next) && mte_node_type(next))))
+				return; /* The last slot is a node, no metadata */
+		}
+		fallthrough;
+	case maple_arange_64:
+		meta = ma_meta(mn, mt);
+		break;
+	default:
+		return;
+	}
+
+	meta->gap = 0;
+	meta->end = 0;
+}
+
 /*
  * ma_meta_end() - Get the data end of a node from the metadata
  * @mn: The maple node
@@ -5448,20 +5486,22 @@ static inline int mas_rev_alloc(struct ma_state *mas, unsigned long min,
  * mas_dead_leaves() - Mark all leaves of a node as dead.
  * @mas: The maple state
  * @slots: Pointer to the slot array
+ * @type: The maple node type
  *
  * Must hold the write lock.
  *
  * Return: The number of leaves marked as dead.
  */
 static inline
-unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots)
+unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots,
+			      enum maple_type mt)
 {
 	struct maple_node *node;
 	enum maple_type type;
 	void *entry;
 	int offset;
 
-	for (offset = 0; offset < mt_slot_count(mas->node); offset++) {
+	for (offset = 0; offset < mt_slots[mt]; offset++) {
 		entry = mas_slot_locked(mas, slots, offset);
 		type = mte_node_type(entry);
 		node = mte_to_node(entry);
@@ -5480,14 +5520,13 @@ unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots)
 
 static void __rcu **mas_dead_walk(struct ma_state *mas, unsigned char offset)
 {
-	struct maple_node *node, *next;
+	struct maple_node *next;
 	void __rcu **slots = NULL;
 
 	next = mas_mn(mas);
 	do {
-		mas->node = ma_enode_ptr(next);
-		node = mas_mn(mas);
-		slots = ma_slots(node, node->type);
+		mas->node = mt_mk_node(next, next->type);
+		slots = ma_slots(next, next->type);
 		next = mas_slot_locked(mas, slots, offset);
 		offset = 0;
 	} while (!ma_is_leaf(next->type));
@@ -5551,11 +5590,14 @@ static inline void __rcu **mas_destroy_descend(struct ma_state *mas,
 		node = mas_mn(mas);
 		slots = ma_slots(node, mte_node_type(mas->node));
 		next = mas_slot_locked(mas, slots, 0);
-		if ((mte_dead_node(next)))
+		if ((mte_dead_node(next))) {
+			mte_to_node(next)->type = mte_node_type(next);
 			next = mas_slot_locked(mas, slots, 1);
+		}
 
 		mte_set_node_dead(mas->node);
 		node->type = mte_node_type(mas->node);
+		mas_clear_meta(mas, node, node->type);
 		node->piv_parent = prev;
 		node->parent_slot = offset;
 		offset = 0;
@@ -5575,13 +5617,18 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
 
 	MA_STATE(mas, &mt, 0, 0);
 
-	if (mte_is_leaf(enode))
+	mas.node = enode;
+	if (mte_is_leaf(enode)) {
+		node->type = mte_node_type(enode);
 		goto free_leaf;
+	}
 
+	ma_flags &= ~MT_FLAGS_LOCK_MASK;
 	mt_init_flags(&mt, ma_flags);
 	mas_lock(&mas);
 
-	mas.node = start = enode;
+	mte_to_node(enode)->ma_flags = ma_flags;
+	start = enode;
 	slots = mas_destroy_descend(&mas, start, 0);
 	node = mas_mn(&mas);
 	do {
@@ -5589,7 +5636,8 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
 		unsigned char offset;
 		struct maple_enode *parent, *tmp;
 
-		node->slot_len = mas_dead_leaves(&mas, slots);
+		node->type = mte_node_type(mas.node);
+		node->slot_len = mas_dead_leaves(&mas, slots, node->type);
 		if (free)
 			mt_free_bulk(node->slot_len, slots);
 		offset = node->parent_slot + 1;
@@ -5613,7 +5661,8 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
 	} while (start != mas.node);
 
 	node = mas_mn(&mas);
-	node->slot_len = mas_dead_leaves(&mas, slots);
+	node->type = mte_node_type(mas.node);
+	node->slot_len = mas_dead_leaves(&mas, slots, node->type);
 	if (free)
 		mt_free_bulk(node->slot_len, slots);
 
@@ -5623,6 +5672,8 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
 free_leaf:
 	if (free)
 		mt_free_rcu(&node->rcu);
+	else
+		mas_clear_meta(&mas, node, node->type);
 }
 
 /*
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 04/41] maple_tree: remove extra smp_wmb() from mas_dead_leaves()
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2023-01-09 20:52 ` [PATCH 03/41] maple_tree: Fix freeing of nodes in rcu mode Suren Baghdasaryan
@ 2023-01-09 20:52 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 05/41] maple_tree: Fix write memory barrier of nodes once dead for RCU mode Suren Baghdasaryan
                   ` (36 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:52 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb, Liam Howlett

From: Liam Howlett <Liam.Howlett@oracle.com>

The call to mte_set_dead_node() before the smp_wmb() already calls
smp_wmb() so this is not needed.  This is an optimization for the RCU
mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 lib/maple_tree.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index a11eea943f8d..d85291b19f86 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5510,7 +5510,6 @@ unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots,
 			break;
 
 		mte_set_node_dead(entry);
-		smp_wmb(); /* Needed for RCU */
 		node->type = type;
 		rcu_assign_pointer(slots[offset], node);
 	}
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 05/41] maple_tree: Fix write memory barrier of nodes once dead for RCU mode
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2023-01-09 20:52 ` [PATCH 04/41] maple_tree: remove extra smp_wmb() from mas_dead_leaves() Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 06/41] maple_tree: Add smp_rmb() to dead node detection Suren Baghdasaryan
                   ` (35 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@oracle.com>

During the development of the maple tree, the strategy of freeing
multiple nodes changed and, in the process, the pivots were reused to
store pointers to dead nodes.  To ensure the readers see accurate
pivots, the writers need to mark the nodes as dead and call smp_wmb() to
ensure any readers can identify the node as dead before using the pivot
values.

There were two places where the old method of marking the node as dead
without smp_wmb() were being used, which resulted in RCU readers seeing
the wrong pivot value before seeing the node was dead.  Fix this race
condition by using mte_set_node_dead() which has the smp_wmb() call to
ensure the race is closed.

Add a WARN_ON() to the ma_free_rcu() call to ensure all nodes being
freed are marked as dead to ensure there are no other call paths besides
the two updated paths.

This is necessary for the RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 lib/maple_tree.c                 |  7 +++++--
 tools/testing/radix-tree/maple.c | 16 ++++++++++++++++
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index d85291b19f86..8066fb1e8ec9 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -179,7 +179,7 @@ static void mt_free_rcu(struct rcu_head *head)
  */
 static void ma_free_rcu(struct maple_node *node)
 {
-	node->parent = ma_parent_ptr(node);
+	WARN_ON(node->parent != ma_parent_ptr(node));
 	call_rcu(&node->rcu, mt_free_rcu);
 }
 
@@ -1775,8 +1775,10 @@ static inline void mas_replace(struct ma_state *mas, bool advanced)
 		rcu_assign_pointer(slots[offset], mas->node);
 	}
 
-	if (!advanced)
+	if (!advanced) {
+		mte_set_node_dead(old_enode);
 		mas_free(mas, old_enode);
+	}
 }
 
 /*
@@ -4217,6 +4219,7 @@ static inline bool mas_wr_node_store(struct ma_wr_state *wr_mas)
 done:
 	mas_leaf_set_meta(mas, newnode, dst_pivots, maple_leaf_64, new_end);
 	if (in_rcu) {
+		mte_set_node_dead(mas->node);
 		mas->node = mt_mk_node(newnode, wr_mas->type);
 		mas_replace(mas, false);
 	} else {
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 81fa7ec2e66a..2539ad6c4777 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -108,6 +108,7 @@ static noinline void check_new_node(struct maple_tree *mt)
 	MT_BUG_ON(mt, mn->slot[1] != NULL);
 	MT_BUG_ON(mt, mas_allocated(&mas) != 0);
 
+	mn->parent = ma_parent_ptr(mn);
 	ma_free_rcu(mn);
 	mas.node = MAS_START;
 	mas_nomem(&mas, GFP_KERNEL);
@@ -160,6 +161,7 @@ static noinline void check_new_node(struct maple_tree *mt)
 		MT_BUG_ON(mt, mas_allocated(&mas) != i);
 		MT_BUG_ON(mt, !mn);
 		MT_BUG_ON(mt, not_empty(mn));
+		mn->parent = ma_parent_ptr(mn);
 		ma_free_rcu(mn);
 	}
 
@@ -192,6 +194,7 @@ static noinline void check_new_node(struct maple_tree *mt)
 		MT_BUG_ON(mt, not_empty(mn));
 		MT_BUG_ON(mt, mas_allocated(&mas) != i - 1);
 		MT_BUG_ON(mt, !mn);
+		mn->parent = ma_parent_ptr(mn);
 		ma_free_rcu(mn);
 	}
 
@@ -210,6 +213,7 @@ static noinline void check_new_node(struct maple_tree *mt)
 			mn = mas_pop_node(&mas);
 			MT_BUG_ON(mt, not_empty(mn));
 			MT_BUG_ON(mt, mas_allocated(&mas) != j - 1);
+			mn->parent = ma_parent_ptr(mn);
 			ma_free_rcu(mn);
 		}
 		MT_BUG_ON(mt, mas_allocated(&mas) != 0);
@@ -233,6 +237,7 @@ static noinline void check_new_node(struct maple_tree *mt)
 			MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
 			mn = mas_pop_node(&mas);
 			MT_BUG_ON(mt, not_empty(mn));
+			mn->parent = ma_parent_ptr(mn);
 			ma_free_rcu(mn);
 			MT_BUG_ON(mt, mas_allocated(&mas) != i - j - 1);
 		}
@@ -269,6 +274,7 @@ static noinline void check_new_node(struct maple_tree *mt)
 			mn = mas_pop_node(&mas); /* get the next node. */
 			MT_BUG_ON(mt, mn == NULL);
 			MT_BUG_ON(mt, not_empty(mn));
+			mn->parent = ma_parent_ptr(mn);
 			ma_free_rcu(mn);
 		}
 		MT_BUG_ON(mt, mas_allocated(&mas) != 0);
@@ -294,6 +300,7 @@ static noinline void check_new_node(struct maple_tree *mt)
 			mn = mas_pop_node(&mas2); /* get the next node. */
 			MT_BUG_ON(mt, mn == NULL);
 			MT_BUG_ON(mt, not_empty(mn));
+			mn->parent = ma_parent_ptr(mn);
 			ma_free_rcu(mn);
 		}
 		MT_BUG_ON(mt, mas_allocated(&mas2) != 0);
@@ -334,10 +341,12 @@ static noinline void check_new_node(struct maple_tree *mt)
 	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 2);
 	mn = mas_pop_node(&mas);
 	MT_BUG_ON(mt, not_empty(mn));
+	mn->parent = ma_parent_ptr(mn);
 	ma_free_rcu(mn);
 	for (i = 1; i <= MAPLE_ALLOC_SLOTS + 1; i++) {
 		mn = mas_pop_node(&mas);
 		MT_BUG_ON(mt, not_empty(mn));
+		mn->parent = ma_parent_ptr(mn);
 		ma_free_rcu(mn);
 	}
 	MT_BUG_ON(mt, mas_allocated(&mas) != 0);
@@ -375,6 +384,7 @@ static noinline void check_new_node(struct maple_tree *mt)
 		mas_node_count(&mas, i); /* Request */
 		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
 		mn = mas_pop_node(&mas); /* get the next node. */
+		mn->parent = ma_parent_ptr(mn);
 		ma_free_rcu(mn);
 		mas_destroy(&mas);
 
@@ -382,10 +392,13 @@ static noinline void check_new_node(struct maple_tree *mt)
 		mas_node_count(&mas, i); /* Request */
 		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
 		mn = mas_pop_node(&mas); /* get the next node. */
+		mn->parent = ma_parent_ptr(mn);
 		ma_free_rcu(mn);
 		mn = mas_pop_node(&mas); /* get the next node. */
+		mn->parent = ma_parent_ptr(mn);
 		ma_free_rcu(mn);
 		mn = mas_pop_node(&mas); /* get the next node. */
+		mn->parent = ma_parent_ptr(mn);
 		ma_free_rcu(mn);
 		mas_destroy(&mas);
 	}
@@ -35369,6 +35382,7 @@ static noinline void check_prealloc(struct maple_tree *mt)
 	MT_BUG_ON(mt, allocated != 1 + height * 3);
 	mn = mas_pop_node(&mas);
 	MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
+	mn->parent = ma_parent_ptr(mn);
 	ma_free_rcu(mn);
 	MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
 	mas_destroy(&mas);
@@ -35386,6 +35400,7 @@ static noinline void check_prealloc(struct maple_tree *mt)
 	mas_destroy(&mas);
 	allocated = mas_allocated(&mas);
 	MT_BUG_ON(mt, allocated != 0);
+	mn->parent = ma_parent_ptr(mn);
 	ma_free_rcu(mn);
 
 	MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
@@ -35756,6 +35771,7 @@ void farmer_tests(void)
 	tree.ma_root = mt_mk_node(node, maple_leaf_64);
 	mt_dump(&tree);
 
+	node->parent = ma_parent_ptr(node);
 	ma_free_rcu(node);
 
 	/* Check things that will make lockdep angry */
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 06/41] maple_tree: Add smp_rmb() to dead node detection
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (4 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 05/41] maple_tree: Fix write memory barrier of nodes once dead for RCU mode Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 07/41] mm: Enable maple tree RCU mode by default Suren Baghdasaryan
                   ` (34 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@oracle.com>

Add an smp_rmb() before reading the parent pointer to ensure that
anything read from the node prior to the parent pointer hasn't been
reordered ahead of this check.

The is necessary for RCU mode.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 lib/maple_tree.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 8066fb1e8ec9..80ca28b656d3 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -535,9 +535,11 @@ static inline struct maple_node *mte_parent(const struct maple_enode *enode)
  */
 static inline bool ma_dead_node(const struct maple_node *node)
 {
-	struct maple_node *parent = (void *)((unsigned long)
-					     node->parent & ~MAPLE_NODE_MASK);
+	struct maple_node *parent;
 
+	/* Do not reorder reads from the node prior to the parent check */
+	smp_rmb();
+	parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
 	return (parent == node);
 }
 
@@ -552,6 +554,8 @@ static inline bool mte_dead_node(const struct maple_enode *enode)
 	struct maple_node *parent, *node;
 
 	node = mte_to_node(enode);
+	/* Do not reorder reads from the node prior to the parent check */
+	smp_rmb();
 	parent = mte_parent(enode);
 	return (parent == node);
 }
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 07/41] mm: Enable maple tree RCU mode by default.
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (5 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 06/41] maple_tree: Add smp_rmb() to dead node detection Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK Suren Baghdasaryan
                   ` (33 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@oracle.com>

Use the maple tree in RCU mode for VMA tracking.  This is necessary for
the use of per-VMA locking.  RCU mode is enabled by default but disabled
when exiting an mm and for the new tree during a fork.

Also enable RCU for the tree used in munmap operations to ensure the
nodes remain valid for readers.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm_types.h | 3 ++-
 kernel/fork.c            | 3 +++
 mm/mmap.c                | 4 +++-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3b8475007734..4b6bce73fbb4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -810,7 +810,8 @@ struct mm_struct {
 	unsigned long cpu_bitmap[];
 };
 
-#define MM_MT_FLAGS	(MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN)
+#define MM_MT_FLAGS	(MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN | \
+			 MT_FLAGS_USE_RCU)
 extern struct mm_struct init_mm;
 
 /* Pointer magic because the dynamic array size confuses some compilers. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..58aab6c889a4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -617,6 +617,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	if (retval)
 		goto out;
 
+	mt_clear_in_rcu(mas.tree);
 	mas_for_each(&old_mas, mpnt, ULONG_MAX) {
 		struct file *file;
 
@@ -703,6 +704,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	retval = arch_dup_mmap(oldmm, mm);
 loop_out:
 	mas_destroy(&mas);
+	if (!retval)
+		mt_set_in_rcu(mas.tree);
 out:
 	mmap_write_unlock(mm);
 	flush_tlb_mm(oldmm);
diff --git a/mm/mmap.c b/mm/mmap.c
index 87d929316d57..9db37adfc00a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2304,7 +2304,8 @@ do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 	int count = 0;
 	int error = -ENOMEM;
 	MA_STATE(mas_detach, &mt_detach, 0, 0);
-	mt_init_flags(&mt_detach, MT_FLAGS_LOCK_EXTERN);
+	mt_init_flags(&mt_detach, mas->tree->ma_flags &
+		      (MT_FLAGS_LOCK_MASK | MT_FLAGS_USE_RCU));
 	mt_set_external_lock(&mt_detach, &mm->mmap_lock);
 
 	if (mas_preallocate(mas, vma, GFP_KERNEL))
@@ -3091,6 +3092,7 @@ void exit_mmap(struct mm_struct *mm)
 	 */
 	set_bit(MMF_OOM_SKIP, &mm->flags);
 	mmap_write_lock(mm);
+	mt_clear_in_rcu(&mm->mm_mt);
 	free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
 		      USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (6 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 07/41] mm: Enable maple tree RCU mode by default Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-11  0:13   ` Davidlohr Bueso
  2023-01-09 20:53 ` [PATCH 09/41] mm: rcu safe VMA freeing Suren Baghdasaryan
                   ` (32 subsequent siblings)
  40 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

This configuration variable will be used to build the support for VMA
locking during page fault handling.

This is enabled by default on supported architectures with SMP and MMU
set.

The architecture support is needed since the page fault handler is called
from the architecture's page faulting code which needs modifications to
handle faults under VMA lock.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/Kconfig | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209dec05..0aeca3794972 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1183,6 +1183,19 @@ config LRU_GEN_STATS
 	  This option has a per-memcg and per-node memory overhead.
 # }
 
+config ARCH_SUPPORTS_PER_VMA_LOCK
+       def_bool n
+
+config PER_VMA_LOCK
+	bool "Per-vma locking support"
+	default y
+	depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
+	help
+	  Allow per-vma locking during page fault handling.
+
+	  This feature allows locking each virtual memory area separately when
+	  handling page faults instead of taking mmap_lock.
+
 source "mm/damon/Kconfig"
 
 endmenu
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 09/41] mm: rcu safe VMA freeing
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (7 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-17 14:25   ` Michal Hocko
  2023-01-09 20:53 ` [PATCH 10/41] mm: move mmap_lock assert function definitions Suren Baghdasaryan
                   ` (31 subsequent siblings)
  40 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

From: Michel Lespinasse <michel@lespinasse.org>

This prepares for page faults handling under VMA lock, looking up VMAs
under protection of an rcu read lock, instead of the usual mmap read lock.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm_types.h | 13 ++++++++++---
 kernel/fork.c            | 13 +++++++++++++
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4b6bce73fbb4..d5cdec1314fe 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -535,9 +535,16 @@ struct anon_vma_name {
 struct vm_area_struct {
 	/* The first cache line has the info for VMA tree walking. */
 
-	unsigned long vm_start;		/* Our start address within vm_mm. */
-	unsigned long vm_end;		/* The first byte after our end address
-					   within vm_mm. */
+	union {
+		struct {
+			/* VMA covers [vm_start; vm_end) addresses within mm */
+			unsigned long vm_start;
+			unsigned long vm_end;
+		};
+#ifdef CONFIG_PER_VMA_LOCK
+		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
+#endif
+	};
 
 	struct mm_struct *vm_mm;	/* The address space we belong to. */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 58aab6c889a4..5986817f393c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -479,10 +479,23 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	return new;
 }
 
+#ifdef CONFIG_PER_VMA_LOCK
+static void __vm_area_free(struct rcu_head *head)
+{
+	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
+						  vm_rcu);
+	kmem_cache_free(vm_area_cachep, vma);
+}
+#endif
+
 void vm_area_free(struct vm_area_struct *vma)
 {
 	free_anon_vma_name(vma);
+#ifdef CONFIG_PER_VMA_LOCK
+	call_rcu(&vma->vm_rcu, __vm_area_free);
+#else
 	kmem_cache_free(vm_area_cachep, vma);
+#endif
 }
 
 static void account_kernel_stack(struct task_struct *tsk, int account)
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 10/41] mm: move mmap_lock assert function definitions
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (8 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 09/41] mm: rcu safe VMA freeing Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 11/41] mm: export dump_mm() Suren Baghdasaryan
                   ` (30 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Move mmap_lock assert function definitions up so that they can be used
by other mmap_lock routines.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mmap_lock.h | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 96e113e23d04..e49ba91bb1f0 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -60,6 +60,18 @@ static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
 
 #endif /* CONFIG_TRACING */
 
+static inline void mmap_assert_locked(struct mm_struct *mm)
+{
+	lockdep_assert_held(&mm->mmap_lock);
+	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
+}
+
+static inline void mmap_assert_write_locked(struct mm_struct *mm)
+{
+	lockdep_assert_held_write(&mm->mmap_lock);
+	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
+}
+
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
@@ -150,18 +162,6 @@ static inline void mmap_read_unlock_non_owner(struct mm_struct *mm)
 	up_read_non_owner(&mm->mmap_lock);
 }
 
-static inline void mmap_assert_locked(struct mm_struct *mm)
-{
-	lockdep_assert_held(&mm->mmap_lock);
-	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
-}
-
-static inline void mmap_assert_write_locked(struct mm_struct *mm)
-{
-	lockdep_assert_held_write(&mm->mmap_lock);
-	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
-}
-
 static inline int mmap_lock_is_contended(struct mm_struct *mm)
 {
 	return rwsem_is_contended(&mm->mmap_lock);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 11/41] mm: export dump_mm()
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (9 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 10/41] mm: move mmap_lock assert function definitions Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 12/41] mm: add per-VMA lock and helper functions to control it Suren Baghdasaryan
                   ` (29 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

mmap_assert_write_locked() will be used in the next patch to ensure
vma write lock is taken only under mmap_lock exclusive lock. Because
mmap_assert_write_locked() uses dump_mm() and there are cases when
vma write lock is taken from inside a module, it's necessary to export
dump_mm() function.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/debug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/debug.c b/mm/debug.c
index 7f8e5f744e42..b6e9e53469d1 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -215,6 +215,7 @@ void dump_mm(const struct mm_struct *mm)
 		mm->def_flags, &mm->def_flags
 	);
 }
+EXPORT_SYMBOL(dump_mm);
 
 static bool page_init_poisoning __read_mostly = true;
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (10 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 11/41] mm: export dump_mm() Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-17 15:04   ` Michal Hocko
                     ` (2 more replies)
  2023-01-09 20:53 ` [PATCH 13/41] mm: introduce vma->vm_flags modifier functions Suren Baghdasaryan
                   ` (28 subsequent siblings)
  40 siblings, 3 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Introduce a per-VMA rw_semaphore to be used during page fault handling
instead of mmap_lock. Because there are cases when multiple VMAs need
to be exclusively locked during VMA tree modifications, instead of the
usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
exclusively and setting vma->lock_seq to the current mm->lock_seq. When
mmap_write_lock holder is done with all modifications and drops mmap_lock,
it will increment mm->lock_seq, effectively unlocking all VMAs marked as
locked.
VMA lock is placed on the cache line boundary so that its 'count' field
falls into the first cache line while the rest of the fields fall into
the second cache line. This lets the 'count' field to be cached with
other frequently accessed fields and used quickly in uncontended case
while 'owner' and other fields used in the contended case will not
invalidate the first cache line while waiting on the lock.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h        | 80 +++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h  |  8 ++++
 include/linux/mmap_lock.h | 13 +++++++
 kernel/fork.c             |  4 ++
 mm/init-mm.c              |  3 ++
 5 files changed, 108 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f196e4d66d..ec2c4c227d51 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -612,6 +612,85 @@ struct vm_operations_struct {
 					  unsigned long addr);
 };
 
+#ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_init_lock(struct vm_area_struct *vma)
+{
+	init_rwsem(&vma->lock);
+	vma->vm_lock_seq = -1;
+}
+
+static inline void vma_write_lock(struct vm_area_struct *vma)
+{
+	int mm_lock_seq;
+
+	mmap_assert_write_locked(vma->vm_mm);
+
+	/*
+	 * current task is holding mmap_write_lock, both vma->vm_lock_seq and
+	 * mm->mm_lock_seq can't be concurrently modified.
+	 */
+	mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
+	if (vma->vm_lock_seq == mm_lock_seq)
+		return;
+
+	down_write(&vma->lock);
+	vma->vm_lock_seq = mm_lock_seq;
+	up_write(&vma->lock);
+}
+
+/*
+ * Try to read-lock a vma. The function is allowed to occasionally yield false
+ * locked result to avoid performance overhead, in which case we fall back to
+ * using mmap_lock. The function should never yield false unlocked result.
+ */
+static inline bool vma_read_trylock(struct vm_area_struct *vma)
+{
+	/* Check before locking. A race might cause false locked result. */
+	if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
+		return false;
+
+	if (unlikely(down_read_trylock(&vma->lock) == 0))
+		return false;
+
+	/*
+	 * Overflow might produce false locked result.
+	 * False unlocked result is impossible because we modify and check
+	 * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
+	 * modification invalidates all existing locks.
+	 */
+	if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
+		up_read(&vma->lock);
+		return false;
+	}
+	return true;
+}
+
+static inline void vma_read_unlock(struct vm_area_struct *vma)
+{
+	up_read(&vma->lock);
+}
+
+static inline void vma_assert_write_locked(struct vm_area_struct *vma)
+{
+	mmap_assert_write_locked(vma->vm_mm);
+	/*
+	 * current task is holding mmap_write_lock, both vma->vm_lock_seq and
+	 * mm->mm_lock_seq can't be concurrently modified.
+	 */
+	VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+static inline void vma_init_lock(struct vm_area_struct *vma) {}
+static inline void vma_write_lock(struct vm_area_struct *vma) {}
+static inline bool vma_read_trylock(struct vm_area_struct *vma)
+		{ return false; }
+static inline void vma_read_unlock(struct vm_area_struct *vma) {}
+static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
+
+#endif /* CONFIG_PER_VMA_LOCK */
+
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	static const struct vm_operations_struct dummy_vm_ops = {};
@@ -620,6 +699,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
+	vma_init_lock(vma);
 }
 
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d5cdec1314fe..5f7c5ca89931 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -555,6 +555,11 @@ struct vm_area_struct {
 	pgprot_t vm_page_prot;
 	unsigned long vm_flags;		/* Flags, see mm.h. */
 
+#ifdef CONFIG_PER_VMA_LOCK
+	int vm_lock_seq;
+	struct rw_semaphore lock;
+#endif
+
 	/*
 	 * For areas with an address space and backing store,
 	 * linkage into the address_space->i_mmap interval tree.
@@ -680,6 +685,9 @@ struct mm_struct {
 					  * init_mm.mmlist, and are protected
 					  * by mmlist_lock
 					  */
+#ifdef CONFIG_PER_VMA_LOCK
+		int mm_lock_seq;
+#endif
 
 
 		unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index e49ba91bb1f0..40facd4c398b 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -72,6 +72,17 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
 	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
 }
 
+#ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_write_unlock_mm(struct mm_struct *mm)
+{
+	mmap_assert_write_locked(mm);
+	/* No races during update due to exclusive mmap_lock being held */
+	WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
+}
+#else
+static inline void vma_write_unlock_mm(struct mm_struct *mm) {}
+#endif
+
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
@@ -114,12 +125,14 @@ static inline bool mmap_write_trylock(struct mm_struct *mm)
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_released(mm, true);
+	vma_write_unlock_mm(mm);
 	up_write(&mm->mmap_lock);
 }
 
 static inline void mmap_write_downgrade(struct mm_struct *mm)
 {
 	__mmap_lock_trace_acquire_returned(mm, false, true);
+	vma_write_unlock_mm(mm);
 	downgrade_write(&mm->mmap_lock);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 5986817f393c..c026d75108b3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		 */
 		*new = data_race(*orig);
 		INIT_LIST_HEAD(&new->anon_vma_chain);
+		vma_init_lock(new);
 		dup_anon_vma_name(orig, new);
 	}
 	return new;
@@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	seqcount_init(&mm->write_protect_seq);
 	mmap_init_lock(mm);
 	INIT_LIST_HEAD(&mm->mmlist);
+#ifdef CONFIG_PER_VMA_LOCK
+	WRITE_ONCE(mm->mm_lock_seq, 0);
+#endif
 	mm_pgtables_bytes_init(mm);
 	mm->map_count = 0;
 	mm->locked_vm = 0;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index c9327abb771c..33269314e060 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -37,6 +37,9 @@ struct mm_struct init_mm = {
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
+#ifdef CONFIG_PER_VMA_LOCK
+	.mm_lock_seq	= 0,
+#endif
 	.user_ns	= &init_user_ns,
 	.cpu_bitmap	= CPU_BITS_NONE,
 #ifdef CONFIG_IOMMU_SVA
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 13/41] mm: introduce vma->vm_flags modifier functions
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (11 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 12/41] mm: add per-VMA lock and helper functions to control it Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-11 15:47   ` Davidlohr Bueso
  2023-01-17 15:09   ` Michal Hocko
  2023-01-09 20:53 ` [PATCH 14/41] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK Suren Baghdasaryan
                   ` (27 subsequent siblings)
  40 siblings, 2 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

To keep vma locking correctness when vm_flags are modified, add modifier
functions to be used whenever flags are updated.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h       | 38 ++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |  8 +++++++-
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ec2c4c227d51..35cf0a6cbcc2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -702,6 +702,44 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma_init_lock(vma);
 }
 
+/* Use when VMA is not part of the VMA tree and needs no locking */
+static inline
+void init_vm_flags(struct vm_area_struct *vma, unsigned long flags)
+{
+	WRITE_ONCE(vma->vm_flags, flags);
+}
+
+/* Use when VMA is part of the VMA tree and needs appropriate locking */
+static inline
+void reset_vm_flags(struct vm_area_struct *vma, unsigned long flags)
+{
+	vma_write_lock(vma);
+	init_vm_flags(vma, flags);
+}
+
+static inline
+void set_vm_flags(struct vm_area_struct *vma, unsigned long flags)
+{
+	vma_write_lock(vma);
+	vma->vm_flags |= flags;
+}
+
+static inline
+void clear_vm_flags(struct vm_area_struct *vma, unsigned long flags)
+{
+	vma_write_lock(vma);
+	vma->vm_flags &= ~flags;
+}
+
+static inline
+void mod_vm_flags(struct vm_area_struct *vma,
+		  unsigned long set, unsigned long clear)
+{
+	vma_write_lock(vma);
+	vma->vm_flags |= set;
+	vma->vm_flags &= ~clear;
+}
+
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
 {
 	vma->vm_ops = NULL;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5f7c5ca89931..0d27edd3e63a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -553,7 +553,13 @@ struct vm_area_struct {
 	 * See vmf_insert_mixed_prot() for discussion.
 	 */
 	pgprot_t vm_page_prot;
-	unsigned long vm_flags;		/* Flags, see mm.h. */
+
+	/*
+	 * Flags, see mm.h.
+	 * WARNING! Do not modify directly to keep correct VMA locking.
+	 * Use {init|reset|set|clear|mod}_vm_flags() functions instead.
+	 */
+	unsigned long vm_flags;
 
 #ifdef CONFIG_PER_VMA_LOCK
 	int vm_lock_seq;
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 14/41] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (12 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 13/41] mm: introduce vma->vm_flags modifier functions Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 15/41] mm: replace vma->vm_flags direct modifications with modifier calls Suren Baghdasaryan
                   ` (26 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

To simplify the usage of VM_LOCKED_CLEAR_MASK in clear_vm_flags(),
replace it with VM_LOCKED_MASK bitmask and convert all users.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 4 ++--
 kernel/fork.c      | 2 +-
 mm/hugetlb.c       | 4 ++--
 mm/mlock.c         | 6 +++---
 mm/mmap.c          | 6 +++---
 mm/mremap.c        | 2 +-
 6 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 35cf0a6cbcc2..2b16d45b75a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -416,8 +416,8 @@ extern unsigned int kobjsize(const void *objp);
 /* This mask defines which mm->def_flags a process can inherit its parent */
 #define VM_INIT_DEF_MASK	VM_NOHUGEPAGE
 
-/* This mask is used to clear all the VMA flags used by mlock */
-#define VM_LOCKED_CLEAR_MASK	(~(VM_LOCKED | VM_LOCKONFAULT))
+/* This mask represents all the VMA flag bits used by mlock */
+#define VM_LOCKED_MASK	(VM_LOCKED | VM_LOCKONFAULT)
 
 /* Arch-specific flags to clear when updating VM flags on protection change */
 #ifndef VM_ARCH_CLEAR
diff --git a/kernel/fork.c b/kernel/fork.c
index c026d75108b3..1591dd8a0745 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -674,7 +674,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			tmp->anon_vma = NULL;
 		} else if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
-		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
+		clear_vm_flags(tmp, VM_LOCKED_MASK);
 		file = tmp->vm_file;
 		if (file) {
 			struct address_space *mapping = file->f_mapping;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index db895230ee7e..24861cbfa2b1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6950,8 +6950,8 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
 	unsigned long s_end = sbase + PUD_SIZE;
 
 	/* Allow segments to share if only one is marked locked */
-	unsigned long vm_flags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
-	unsigned long svm_flags = svma->vm_flags & VM_LOCKED_CLEAR_MASK;
+	unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED_MASK;
+	unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED_MASK;
 
 	/*
 	 * match the virtual addresses, permission and the alignment of the
diff --git a/mm/mlock.c b/mm/mlock.c
index 7032f6dd0ce1..06aa9e204fac 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -490,7 +490,7 @@ static int apply_vma_lock_flags(unsigned long start, size_t len,
 		prev = mas_prev(&mas, 0);
 
 	for (nstart = start ; ; ) {
-		vm_flags_t newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
+		vm_flags_t newflags = vma->vm_flags & ~VM_LOCKED_MASK;
 
 		newflags |= flags;
 
@@ -662,7 +662,7 @@ static int apply_mlockall_flags(int flags)
 	struct vm_area_struct *vma, *prev = NULL;
 	vm_flags_t to_add = 0;
 
-	current->mm->def_flags &= VM_LOCKED_CLEAR_MASK;
+	current->mm->def_flags &= ~VM_LOCKED_MASK;
 	if (flags & MCL_FUTURE) {
 		current->mm->def_flags |= VM_LOCKED;
 
@@ -682,7 +682,7 @@ static int apply_mlockall_flags(int flags)
 	mas_for_each(&mas, vma, ULONG_MAX) {
 		vm_flags_t newflags;
 
-		newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
+		newflags = vma->vm_flags & ~VM_LOCKED_MASK;
 		newflags |= to_add;
 
 		/* Ignore errors */
diff --git a/mm/mmap.c b/mm/mmap.c
index 9db37adfc00a..5c4b608edde9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2721,7 +2721,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
 					is_vm_hugetlb_page(vma) ||
 					vma == get_gate_vma(current->mm))
-			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+			clear_vm_flags(vma, VM_LOCKED_MASK);
 		else
 			mm->locked_vm += (len >> PAGE_SHIFT);
 	}
@@ -3392,8 +3392,8 @@ static struct vm_area_struct *__install_special_mapping(
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 
-	vma->vm_flags = vm_flags | mm->def_flags | VM_DONTEXPAND | VM_SOFTDIRTY;
-	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+	init_vm_flags(vma, (vm_flags | mm->def_flags |
+		      VM_DONTEXPAND | VM_SOFTDIRTY) & ~VM_LOCKED_MASK);
 	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
 
 	vma->vm_ops = ops;
diff --git a/mm/mremap.c b/mm/mremap.c
index fe587c5d6591..5f6f9931bff1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -686,7 +686,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 
 	if (unlikely(!err && (flags & MREMAP_DONTUNMAP))) {
 		/* We always clear VM_LOCKED[ONFAULT] on the old vma */
-		vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+		clear_vm_flags(vma, VM_LOCKED_MASK);
 
 		/*
 		 * anon_vma links of the old vma is no longer needed after its page
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 15/41] mm: replace vma->vm_flags direct modifications with modifier calls
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (13 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 14/41] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 16/41] mm: replace vma->vm_flags indirect modification in ksm_madvise Suren Baghdasaryan
                   ` (25 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/arm/kernel/process.c                          |  2 +-
 arch/ia64/mm/init.c                                |  8 ++++----
 arch/loongarch/include/asm/tlb.h                   |  2 +-
 arch/powerpc/kvm/book3s_xive_native.c              |  2 +-
 arch/powerpc/mm/book3s64/subpage_prot.c            |  2 +-
 arch/powerpc/platforms/book3s/vas-api.c            |  2 +-
 arch/powerpc/platforms/cell/spufs/file.c           | 14 +++++++-------
 arch/s390/mm/gmap.c                                |  3 +--
 arch/x86/entry/vsyscall/vsyscall_64.c              |  2 +-
 arch/x86/kernel/cpu/sgx/driver.c                   |  2 +-
 arch/x86/kernel/cpu/sgx/virt.c                     |  2 +-
 arch/x86/mm/pat/memtype.c                          |  6 +++---
 arch/x86/um/mem_32.c                               |  2 +-
 drivers/acpi/pfr_telemetry.c                       |  2 +-
 drivers/android/binder.c                           |  3 +--
 drivers/char/mspec.c                               |  2 +-
 drivers/crypto/hisilicon/qm.c                      |  2 +-
 drivers/dax/device.c                               |  2 +-
 drivers/dma/idxd/cdev.c                            |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c            |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c           |  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c          |  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_events.c            |  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_process.c           |  4 ++--
 drivers/gpu/drm/drm_gem.c                          |  2 +-
 drivers/gpu/drm/drm_gem_dma_helper.c               |  3 +--
 drivers/gpu/drm/drm_gem_shmem_helper.c             |  2 +-
 drivers/gpu/drm/drm_vm.c                           |  8 ++++----
 drivers/gpu/drm/etnaviv/etnaviv_gem.c              |  2 +-
 drivers/gpu/drm/exynos/exynos_drm_gem.c            |  4 ++--
 drivers/gpu/drm/gma500/framebuffer.c               |  2 +-
 drivers/gpu/drm/i810/i810_dma.c                    |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.c           |  4 ++--
 drivers/gpu/drm/mediatek/mtk_drm_gem.c             |  2 +-
 drivers/gpu/drm/msm/msm_gem.c                      |  2 +-
 drivers/gpu/drm/omapdrm/omap_gem.c                 |  3 +--
 drivers/gpu/drm/rockchip/rockchip_drm_gem.c        |  3 +--
 drivers/gpu/drm/tegra/gem.c                        |  5 ++---
 drivers/gpu/drm/ttm/ttm_bo_vm.c                    |  3 +--
 drivers/gpu/drm/virtio/virtgpu_vram.c              |  2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c           |  2 +-
 drivers/gpu/drm/xen/xen_drm_front_gem.c            |  3 +--
 drivers/hsi/clients/cmt_speech.c                   |  2 +-
 drivers/hwtracing/intel_th/msu.c                   |  2 +-
 drivers/hwtracing/stm/core.c                       |  2 +-
 drivers/infiniband/hw/hfi1/file_ops.c              |  4 ++--
 drivers/infiniband/hw/mlx5/main.c                  |  4 ++--
 drivers/infiniband/hw/qib/qib_file_ops.c           | 13 ++++++-------
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c       |  2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c    |  2 +-
 .../media/common/videobuf2/videobuf2-dma-contig.c  |  2 +-
 drivers/media/common/videobuf2/videobuf2-vmalloc.c |  2 +-
 drivers/media/v4l2-core/videobuf-dma-contig.c      |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c          |  4 ++--
 drivers/media/v4l2-core/videobuf-vmalloc.c         |  2 +-
 drivers/misc/cxl/context.c                         |  2 +-
 drivers/misc/habanalabs/common/memory.c            |  2 +-
 drivers/misc/habanalabs/gaudi/gaudi.c              |  4 ++--
 drivers/misc/habanalabs/gaudi2/gaudi2.c            |  8 ++++----
 drivers/misc/habanalabs/goya/goya.c                |  4 ++--
 drivers/misc/ocxl/context.c                        |  4 ++--
 drivers/misc/ocxl/sysfs.c                          |  2 +-
 drivers/misc/open-dice.c                           |  6 +++---
 drivers/misc/sgi-gru/grufile.c                     |  4 ++--
 drivers/misc/uacce/uacce.c                         |  2 +-
 drivers/sbus/char/oradax.c                         |  2 +-
 drivers/scsi/cxlflash/ocxl_hw.c                    |  2 +-
 drivers/scsi/sg.c                                  |  2 +-
 drivers/staging/media/atomisp/pci/hmm/hmm_bo.c     |  2 +-
 drivers/staging/media/deprecated/meye/meye.c       |  4 ++--
 .../media/deprecated/stkwebcam/stk-webcam.c        |  2 +-
 drivers/target/target_core_user.c                  |  2 +-
 drivers/uio/uio.c                                  |  2 +-
 drivers/usb/core/devio.c                           |  3 +--
 drivers/usb/mon/mon_bin.c                          |  3 +--
 drivers/vdpa/vdpa_user/iova_domain.c               |  2 +-
 drivers/vfio/pci/vfio_pci_core.c                   |  2 +-
 drivers/vhost/vdpa.c                               |  2 +-
 drivers/video/fbdev/68328fb.c                      |  2 +-
 drivers/video/fbdev/core/fb_defio.c                |  4 ++--
 drivers/xen/gntalloc.c                             |  2 +-
 drivers/xen/gntdev.c                               |  4 ++--
 drivers/xen/privcmd-buf.c                          |  2 +-
 drivers/xen/privcmd.c                              |  4 ++--
 fs/aio.c                                           |  2 +-
 fs/cramfs/inode.c                                  |  2 +-
 fs/erofs/data.c                                    |  2 +-
 fs/exec.c                                          |  4 ++--
 fs/ext4/file.c                                     |  2 +-
 fs/fuse/dax.c                                      |  2 +-
 fs/hugetlbfs/inode.c                               |  4 ++--
 fs/orangefs/file.c                                 |  3 +--
 fs/proc/task_mmu.c                                 |  2 +-
 fs/proc/vmcore.c                                   |  3 +--
 fs/userfaultfd.c                                   | 12 ++++++------
 fs/xfs/xfs_file.c                                  |  2 +-
 include/linux/mm.h                                 |  2 +-
 kernel/bpf/ringbuf.c                               |  4 ++--
 kernel/bpf/syscall.c                               |  4 ++--
 kernel/events/core.c                               |  2 +-
 kernel/kcov.c                                      |  2 +-
 kernel/relay.c                                     |  2 +-
 mm/madvise.c                                       |  2 +-
 mm/memory.c                                        |  6 +++---
 mm/mlock.c                                         |  6 +++---
 mm/mmap.c                                          | 10 +++++-----
 mm/mprotect.c                                      |  2 +-
 mm/mremap.c                                        |  6 +++---
 mm/nommu.c                                         | 11 ++++++-----
 mm/secretmem.c                                     |  2 +-
 mm/shmem.c                                         |  2 +-
 mm/vmalloc.c                                       |  2 +-
 net/ipv4/tcp.c                                     |  4 ++--
 security/selinux/selinuxfs.c                       |  6 +++---
 sound/core/oss/pcm_oss.c                           |  2 +-
 sound/core/pcm_native.c                            |  9 +++++----
 sound/soc/pxa/mmp-sspa.c                           |  2 +-
 sound/usb/usx2y/us122l.c                           |  4 ++--
 sound/usb/usx2y/usX2Yhwdep.c                       |  2 +-
 sound/usb/usx2y/usx2yhwdeppcm.c                    |  2 +-
 120 files changed, 194 insertions(+), 205 deletions(-)

diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index f811733a8fc5..ec65f3ea3150 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -316,7 +316,7 @@ static int __init gate_vma_init(void)
 	gate_vma.vm_page_prot = PAGE_READONLY_EXEC;
 	gate_vma.vm_start = 0xffff0000;
 	gate_vma.vm_end	= 0xffff0000 + PAGE_SIZE;
-	gate_vma.vm_flags = VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYEXEC;
+	init_vm_flags(&gate_vma, VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYEXEC);
 	return 0;
 }
 arch_initcall(gate_vma_init);
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index fc4e4217e87f..d355e0ce28ab 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -109,7 +109,7 @@ ia64_init_addr_space (void)
 		vma_set_anonymous(vma);
 		vma->vm_start = current->thread.rbs_bot & PAGE_MASK;
 		vma->vm_end = vma->vm_start + PAGE_SIZE;
-		vma->vm_flags = VM_DATA_DEFAULT_FLAGS|VM_GROWSUP|VM_ACCOUNT;
+		init_vm_flags(vma, VM_DATA_DEFAULT_FLAGS|VM_GROWSUP|VM_ACCOUNT);
 		vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
 		mmap_write_lock(current->mm);
 		if (insert_vm_struct(current->mm, vma)) {
@@ -127,8 +127,8 @@ ia64_init_addr_space (void)
 			vma_set_anonymous(vma);
 			vma->vm_end = PAGE_SIZE;
 			vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT);
-			vma->vm_flags = VM_READ | VM_MAYREAD | VM_IO |
-					VM_DONTEXPAND | VM_DONTDUMP;
+			init_vm_flags(vma, VM_READ | VM_MAYREAD | VM_IO |
+				      VM_DONTEXPAND | VM_DONTDUMP);
 			mmap_write_lock(current->mm);
 			if (insert_vm_struct(current->mm, vma)) {
 				mmap_write_unlock(current->mm);
@@ -272,7 +272,7 @@ static int __init gate_vma_init(void)
 	vma_init(&gate_vma, NULL);
 	gate_vma.vm_start = FIXADDR_USER_START;
 	gate_vma.vm_end = FIXADDR_USER_END;
-	gate_vma.vm_flags = VM_READ | VM_MAYREAD | VM_EXEC | VM_MAYEXEC;
+	init_vm_flags(&gate_vma, VM_READ | VM_MAYREAD | VM_EXEC | VM_MAYEXEC);
 	gate_vma.vm_page_prot = __pgprot(__ACCESS_BITS | _PAGE_PL_3 | _PAGE_AR_RX);
 
 	return 0;
diff --git a/arch/loongarch/include/asm/tlb.h b/arch/loongarch/include/asm/tlb.h
index dd24f5898f65..51e35b44d105 100644
--- a/arch/loongarch/include/asm/tlb.h
+++ b/arch/loongarch/include/asm/tlb.h
@@ -149,7 +149,7 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 	struct vm_area_struct vma;
 
 	vma.vm_mm = tlb->mm;
-	vma.vm_flags = 0;
+	init_vm_flags(&vma, 0);
 	if (tlb->fullmm) {
 		flush_tlb_mm(tlb->mm);
 		return;
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 4f566bea5e10..7976af0f5ff8 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -324,7 +324,7 @@ static int kvmppc_xive_native_mmap(struct kvm_device *dev,
 		return -EINVAL;
 	}
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP);
 	vma->vm_page_prot = pgprot_noncached_wc(vma->vm_page_prot);
 
 	/*
diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index d73b3b4176e8..72948cdb1911 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -156,7 +156,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 	 * VM_NOHUGEPAGE and split them.
 	 */
 	for_each_vma_range(vmi, vma, addr + len) {
-		vma->vm_flags |= VM_NOHUGEPAGE;
+		set_vm_flags(vma, VM_NOHUGEPAGE);
 		walk_page_vma(vma, &subpage_walk_ops, NULL);
 	}
 }
diff --git a/arch/powerpc/platforms/book3s/vas-api.c b/arch/powerpc/platforms/book3s/vas-api.c
index eb5bed333750..a81615768fff 100644
--- a/arch/powerpc/platforms/book3s/vas-api.c
+++ b/arch/powerpc/platforms/book3s/vas-api.c
@@ -525,7 +525,7 @@ static int coproc_mmap(struct file *fp, struct vm_area_struct *vma)
 	pfn = paste_addr >> PAGE_SHIFT;
 
 	/* flags, page_prot from cxl_mmap(), except we want cachable */
-	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP);
 	vma->vm_page_prot = pgprot_cached(vma->vm_page_prot);
 
 	prot = __pgprot(pgprot_val(vma->vm_page_prot) | _PAGE_DIRTY);
diff --git a/arch/powerpc/platforms/cell/spufs/file.c b/arch/powerpc/platforms/cell/spufs/file.c
index 62d90a5e23d1..784fa39a484a 100644
--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -291,7 +291,7 @@ static int spufs_mem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!(vma->vm_flags & VM_SHARED))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP);
 	vma->vm_page_prot = pgprot_noncached_wc(vma->vm_page_prot);
 
 	vma->vm_ops = &spufs_mem_mmap_vmops;
@@ -381,7 +381,7 @@ static int spufs_cntl_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!(vma->vm_flags & VM_SHARED))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP);
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
 	vma->vm_ops = &spufs_cntl_mmap_vmops;
@@ -1043,7 +1043,7 @@ static int spufs_signal1_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!(vma->vm_flags & VM_SHARED))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP);
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
 	vma->vm_ops = &spufs_signal1_mmap_vmops;
@@ -1179,7 +1179,7 @@ static int spufs_signal2_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!(vma->vm_flags & VM_SHARED))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP);
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
 	vma->vm_ops = &spufs_signal2_mmap_vmops;
@@ -1302,7 +1302,7 @@ static int spufs_mss_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!(vma->vm_flags & VM_SHARED))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP);
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
 	vma->vm_ops = &spufs_mss_mmap_vmops;
@@ -1364,7 +1364,7 @@ static int spufs_psmap_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!(vma->vm_flags & VM_SHARED))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP);
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
 	vma->vm_ops = &spufs_psmap_mmap_vmops;
@@ -1424,7 +1424,7 @@ static int spufs_mfc_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!(vma->vm_flags & VM_SHARED))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP);
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
 	vma->vm_ops = &spufs_mfc_mmap_vmops;
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 74e1d873dce0..3811d6c86d09 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2522,8 +2522,7 @@ static inline void thp_split_mm(struct mm_struct *mm)
 	VMA_ITERATOR(vmi, mm, 0);
 
 	for_each_vma(vmi, vma) {
-		vma->vm_flags &= ~VM_HUGEPAGE;
-		vma->vm_flags |= VM_NOHUGEPAGE;
+		mod_vm_flags(vma, VM_NOHUGEPAGE, VM_HUGEPAGE);
 		walk_page_vma(vma, &thp_split_walk_ops, NULL);
 	}
 	mm->def_flags |= VM_NOHUGEPAGE;
diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index 4af81df133ee..e2a1626d86d8 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -391,7 +391,7 @@ void __init map_vsyscall(void)
 	}
 
 	if (vsyscall_mode == XONLY)
-		gate_vma.vm_flags = VM_EXEC;
+		init_vm_flags(&gate_vma, VM_EXEC);
 
 	BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
 		     (unsigned long)VSYSCALL_ADDR);
diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
index aa9b8b868867..42c0bded93b6 100644
--- a/arch/x86/kernel/cpu/sgx/driver.c
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -95,7 +95,7 @@ static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
 		return ret;
 
 	vma->vm_ops = &sgx_vm_ops;
-	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
+	set_vm_flags(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO);
 	vma->vm_private_data = encl;
 
 	return 0;
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
index 6a77a14eee38..0774a0bfeb28 100644
--- a/arch/x86/kernel/cpu/sgx/virt.c
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -105,7 +105,7 @@ static int sgx_vepc_mmap(struct file *file, struct vm_area_struct *vma)
 
 	vma->vm_ops = &sgx_vepc_vm_ops;
 	/* Don't copy VMA in fork() */
-	vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTDUMP | VM_DONTCOPY;
+	set_vm_flags(vma, VM_PFNMAP | VM_IO | VM_DONTDUMP | VM_DONTCOPY);
 	vma->vm_private_data = vepc;
 
 	return 0;
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 46de9cf5c91d..9e490a372896 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -999,7 +999,7 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 
 		ret = reserve_pfn_range(paddr, size, prot, 0);
 		if (ret == 0 && vma)
-			vma->vm_flags |= VM_PAT;
+			set_vm_flags(vma, VM_PAT);
 		return ret;
 	}
 
@@ -1065,7 +1065,7 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
 	}
 	free_pfn_range(paddr, size);
 	if (vma)
-		vma->vm_flags &= ~VM_PAT;
+		clear_vm_flags(vma, VM_PAT);
 }
 
 /*
@@ -1075,7 +1075,7 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
  */
 void untrack_pfn_moved(struct vm_area_struct *vma)
 {
-	vma->vm_flags &= ~VM_PAT;
+	clear_vm_flags(vma, VM_PAT);
 }
 
 pgprot_t pgprot_writecombine(pgprot_t prot)
diff --git a/arch/x86/um/mem_32.c b/arch/x86/um/mem_32.c
index cafd01f730da..bfd2c320ad25 100644
--- a/arch/x86/um/mem_32.c
+++ b/arch/x86/um/mem_32.c
@@ -16,7 +16,7 @@ static int __init gate_vma_init(void)
 	vma_init(&gate_vma, NULL);
 	gate_vma.vm_start = FIXADDR_USER_START;
 	gate_vma.vm_end = FIXADDR_USER_END;
-	gate_vma.vm_flags = VM_READ | VM_MAYREAD | VM_EXEC | VM_MAYEXEC;
+	init_vm_flags(&gate_vma, VM_READ | VM_MAYREAD | VM_EXEC | VM_MAYEXEC);
 	gate_vma.vm_page_prot = PAGE_READONLY;
 
 	return 0;
diff --git a/drivers/acpi/pfr_telemetry.c b/drivers/acpi/pfr_telemetry.c
index 27fb6cdad75f..9e339c705b5b 100644
--- a/drivers/acpi/pfr_telemetry.c
+++ b/drivers/acpi/pfr_telemetry.c
@@ -310,7 +310,7 @@ pfrt_log_mmap(struct file *file, struct vm_area_struct *vma)
 		return -EROFS;
 
 	/* changing from read to write with mprotect is not allowed */
-	vma->vm_flags &= ~VM_MAYWRITE;
+	clear_vm_flags(vma, VM_MAYWRITE);
 
 	pfrt_log_dev = to_pfrt_log_dev(file);
 
diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index 880224ec6abb..dd6c99223b8c 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -5572,8 +5572,7 @@ static int binder_mmap(struct file *filp, struct vm_area_struct *vma)
 		       proc->pid, vma->vm_start, vma->vm_end, "bad vm_flags", -EPERM);
 		return -EPERM;
 	}
-	vma->vm_flags |= VM_DONTCOPY | VM_MIXEDMAP;
-	vma->vm_flags &= ~VM_MAYWRITE;
+	mod_vm_flags(vma, VM_DONTCOPY | VM_MIXEDMAP, VM_MAYWRITE);
 
 	vma->vm_ops = &binder_vm_ops;
 	vma->vm_private_data = proc;
diff --git a/drivers/char/mspec.c b/drivers/char/mspec.c
index f8231e2e84be..57bd36a28f95 100644
--- a/drivers/char/mspec.c
+++ b/drivers/char/mspec.c
@@ -206,7 +206,7 @@ mspec_mmap(struct file *file, struct vm_area_struct *vma,
 	refcount_set(&vdata->refcnt, 1);
 	vma->vm_private_data = vdata;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
 	if (vdata->type == MSPEC_UNCACHED)
 		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_ops = &mspec_vm_ops;
diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c
index 007ac7a69ce7..57ecdb5c97fb 100644
--- a/drivers/crypto/hisilicon/qm.c
+++ b/drivers/crypto/hisilicon/qm.c
@@ -2363,7 +2363,7 @@ static int hisi_qm_uacce_mmap(struct uacce_queue *q,
 				return -EINVAL;
 		}
 
-		vma->vm_flags |= VM_IO;
+		set_vm_flags(vma, VM_IO);
 
 		return remap_pfn_range(vma, vma->vm_start,
 				       phys_base >> PAGE_SHIFT,
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 5494d745ced5..6e9726dfaa7e 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -308,7 +308,7 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_HUGEPAGE;
+	set_vm_flags(vma, VM_HUGEPAGE);
 	return 0;
 }
 
diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c
index e13e92609943..51cf836cf329 100644
--- a/drivers/dma/idxd/cdev.c
+++ b/drivers/dma/idxd/cdev.c
@@ -201,7 +201,7 @@ static int idxd_cdev_mmap(struct file *filp, struct vm_area_struct *vma)
 	if (rc < 0)
 		return rc;
 
-	vma->vm_flags |= VM_DONTCOPY;
+	set_vm_flags(vma, VM_DONTCOPY);
 	pfn = (base + idxd_get_wq_portal_full_offset(wq->id,
 				IDXD_PORTAL_LIMITED)) >> PAGE_SHIFT;
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index bb7350ea1d75..70b08a0d13cd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -257,7 +257,7 @@ static int amdgpu_gem_object_mmap(struct drm_gem_object *obj, struct vm_area_str
 	 */
 	if (is_cow_mapping(vma->vm_flags) &&
 	    !(vma->vm_flags & VM_ACCESS_FLAGS))
-		vma->vm_flags &= ~VM_MAYWRITE;
+		clear_vm_flags(vma, VM_MAYWRITE);
 
 	return drm_gem_ttm_mmap(obj, vma);
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 6d291aa6386b..7beb8dd6a5e6 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -2879,8 +2879,8 @@ static int kfd_mmio_mmap(struct kfd_dev *dev, struct kfd_process *process,
 
 	address = dev->adev->rmmio_remap.bus_addr;
 
-	vma->vm_flags |= VM_IO | VM_DONTCOPY | VM_DONTEXPAND | VM_NORESERVE |
-				VM_DONTDUMP | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_DONTCOPY | VM_DONTEXPAND | VM_NORESERVE |
+				VM_DONTDUMP | VM_PFNMAP);
 
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
index cd4e61bf0493..6cbe47cf9be5 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
@@ -159,8 +159,8 @@ int kfd_doorbell_mmap(struct kfd_dev *dev, struct kfd_process *process,
 	address = kfd_get_process_doorbells(pdd);
 	if (!address)
 		return -ENOMEM;
-	vma->vm_flags |= VM_IO | VM_DONTCOPY | VM_DONTEXPAND | VM_NORESERVE |
-				VM_DONTDUMP | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_DONTCOPY | VM_DONTEXPAND | VM_NORESERVE |
+				VM_DONTDUMP | VM_PFNMAP);
 
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index 729d26d648af..95cd20056cea 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -1052,8 +1052,8 @@ int kfd_event_mmap(struct kfd_process *p, struct vm_area_struct *vma)
 	pfn = __pa(page->kernel_address);
 	pfn >>= PAGE_SHIFT;
 
-	vma->vm_flags |= VM_IO | VM_DONTCOPY | VM_DONTEXPAND | VM_NORESERVE
-		       | VM_DONTDUMP | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_DONTCOPY | VM_DONTEXPAND | VM_NORESERVE
+		       | VM_DONTDUMP | VM_PFNMAP);
 
 	pr_debug("Mapping signal page\n");
 	pr_debug("     start user address  == 0x%08lx\n", vma->vm_start);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 51b1683ac5c1..b40f4b122918 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -1978,8 +1978,8 @@ int kfd_reserved_mem_mmap(struct kfd_dev *dev, struct kfd_process *process,
 		return -ENOMEM;
 	}
 
-	vma->vm_flags |= VM_IO | VM_DONTCOPY | VM_DONTEXPAND
-		| VM_NORESERVE | VM_DONTDUMP | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_DONTCOPY | VM_DONTEXPAND
+		| VM_NORESERVE | VM_DONTDUMP | VM_PFNMAP);
 	/* Mapping pages to user process */
 	return remap_pfn_range(vma, vma->vm_start,
 			       PFN_DOWN(__pa(qpd->cwsr_kaddr)),
diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index b8db675e7fb5..6ea7bcaa592b 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1047,7 +1047,7 @@ int drm_gem_mmap_obj(struct drm_gem_object *obj, unsigned long obj_size,
 			goto err_drm_gem_object_put;
 		}
 
-		vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+		set_vm_flags(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
 		vma->vm_page_prot = pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
 		vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
 	}
diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c b/drivers/gpu/drm/drm_gem_dma_helper.c
index 1e658c448366..41f241b9a581 100644
--- a/drivers/gpu/drm/drm_gem_dma_helper.c
+++ b/drivers/gpu/drm/drm_gem_dma_helper.c
@@ -530,8 +530,7 @@ int drm_gem_dma_mmap(struct drm_gem_dma_object *dma_obj, struct vm_area_struct *
 	 * the whole buffer.
 	 */
 	vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node);
-	vma->vm_flags &= ~VM_PFNMAP;
-	vma->vm_flags |= VM_DONTEXPAND;
+	mod_vm_flags(vma, VM_DONTEXPAND, VM_PFNMAP);
 
 	if (dma_obj->map_noncoherent) {
 		vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
index b602cd72a120..a5032dfac492 100644
--- a/drivers/gpu/drm/drm_gem_shmem_helper.c
+++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
@@ -633,7 +633,7 @@ int drm_gem_shmem_mmap(struct drm_gem_shmem_object *shmem, struct vm_area_struct
 	if (ret)
 		return ret;
 
-	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
 	if (shmem->map_wc)
 		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
diff --git a/drivers/gpu/drm/drm_vm.c b/drivers/gpu/drm/drm_vm.c
index f024dc93939e..8867bb6c40e3 100644
--- a/drivers/gpu/drm/drm_vm.c
+++ b/drivers/gpu/drm/drm_vm.c
@@ -476,7 +476,7 @@ static int drm_mmap_dma(struct file *filp, struct vm_area_struct *vma)
 
 	if (!capable(CAP_SYS_ADMIN) &&
 	    (dma->flags & _DRM_DMA_USE_PCI_RO)) {
-		vma->vm_flags &= ~(VM_WRITE | VM_MAYWRITE);
+		clear_vm_flags(vma, VM_WRITE | VM_MAYWRITE);
 #if defined(__i386__) || defined(__x86_64__)
 		pgprot_val(vma->vm_page_prot) &= ~_PAGE_RW;
 #else
@@ -492,7 +492,7 @@ static int drm_mmap_dma(struct file *filp, struct vm_area_struct *vma)
 
 	vma->vm_ops = &drm_vm_dma_ops;
 
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP);
 
 	drm_vm_open_locked(dev, vma);
 	return 0;
@@ -560,7 +560,7 @@ static int drm_mmap_locked(struct file *filp, struct vm_area_struct *vma)
 		return -EINVAL;
 
 	if (!capable(CAP_SYS_ADMIN) && (map->flags & _DRM_READ_ONLY)) {
-		vma->vm_flags &= ~(VM_WRITE | VM_MAYWRITE);
+		clear_vm_flags(vma, VM_WRITE | VM_MAYWRITE);
 #if defined(__i386__) || defined(__x86_64__)
 		pgprot_val(vma->vm_page_prot) &= ~_PAGE_RW;
 #else
@@ -628,7 +628,7 @@ static int drm_mmap_locked(struct file *filp, struct vm_area_struct *vma)
 	default:
 		return -EINVAL;	/* This should never happen. */
 	}
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP);
 
 	drm_vm_open_locked(dev, vma);
 	return 0;
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
index c5ae5492e1af..9a5a317038a4 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
@@ -130,7 +130,7 @@ static int etnaviv_gem_mmap_obj(struct etnaviv_gem_object *etnaviv_obj,
 {
 	pgprot_t vm_page_prot;
 
-	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
 
 	vm_page_prot = vm_get_page_prot(vma->vm_flags);
 
diff --git a/drivers/gpu/drm/exynos/exynos_drm_gem.c b/drivers/gpu/drm/exynos/exynos_drm_gem.c
index 3e493f48e0d4..c330d415729c 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_gem.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_gem.c
@@ -274,7 +274,7 @@ static int exynos_drm_gem_mmap_buffer(struct exynos_drm_gem *exynos_gem,
 	unsigned long vm_size;
 	int ret;
 
-	vma->vm_flags &= ~VM_PFNMAP;
+	clear_vm_flags(vma, VM_PFNMAP);
 	vma->vm_pgoff = 0;
 
 	vm_size = vma->vm_end - vma->vm_start;
@@ -368,7 +368,7 @@ static int exynos_drm_gem_mmap(struct drm_gem_object *obj, struct vm_area_struct
 	if (obj->import_attach)
 		return dma_buf_mmap(obj->dma_buf, vma, 0);
 
-	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_IO | VM_DONTEXPAND | VM_DONTDUMP);
 
 	DRM_DEV_DEBUG_KMS(to_dma_dev(obj->dev), "flags = 0x%x\n",
 			  exynos_gem->flags);
diff --git a/drivers/gpu/drm/gma500/framebuffer.c b/drivers/gpu/drm/gma500/framebuffer.c
index 8d5a37b8f110..471d5b3c1535 100644
--- a/drivers/gpu/drm/gma500/framebuffer.c
+++ b/drivers/gpu/drm/gma500/framebuffer.c
@@ -139,7 +139,7 @@ static int psbfb_mmap(struct fb_info *info, struct vm_area_struct *vma)
 	 */
 	vma->vm_ops = &psbfb_vm_ops;
 	vma->vm_private_data = (void *)fb;
-	vma->vm_flags |= VM_IO | VM_MIXEDMAP | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_IO | VM_MIXEDMAP | VM_DONTEXPAND | VM_DONTDUMP);
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/i810/i810_dma.c b/drivers/gpu/drm/i810/i810_dma.c
index 9fb4dd63342f..bced8c30709e 100644
--- a/drivers/gpu/drm/i810/i810_dma.c
+++ b/drivers/gpu/drm/i810/i810_dma.c
@@ -102,7 +102,7 @@ static int i810_mmap_buffers(struct file *filp, struct vm_area_struct *vma)
 	buf = dev_priv->mmap_buffer;
 	buf_priv = buf->dev_private;
 
-	vma->vm_flags |= VM_DONTCOPY;
+	set_vm_flags(vma, VM_DONTCOPY);
 
 	buf_priv->currently_mapped = I810_BUF_MAPPED;
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
index 0ad44f3868de..71b9e0485cb9 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
@@ -979,7 +979,7 @@ int i915_gem_mmap(struct file *filp, struct vm_area_struct *vma)
 			i915_gem_object_put(obj);
 			return -EINVAL;
 		}
-		vma->vm_flags &= ~VM_MAYWRITE;
+		clear_vm_flags(vma, VM_MAYWRITE);
 	}
 
 	anon = mmap_singleton(to_i915(dev));
@@ -988,7 +988,7 @@ int i915_gem_mmap(struct file *filp, struct vm_area_struct *vma)
 		return PTR_ERR(anon);
 	}
 
-	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
+	set_vm_flags(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO);
 
 	/*
 	 * We keep the ref on mmo->obj, not vm_file, but we require
diff --git a/drivers/gpu/drm/mediatek/mtk_drm_gem.c b/drivers/gpu/drm/mediatek/mtk_drm_gem.c
index 47e96b0289f9..427089733b87 100644
--- a/drivers/gpu/drm/mediatek/mtk_drm_gem.c
+++ b/drivers/gpu/drm/mediatek/mtk_drm_gem.c
@@ -158,7 +158,7 @@ static int mtk_drm_gem_object_mmap(struct drm_gem_object *obj,
 	 * dma_alloc_attrs() allocated a struct page table for mtk_gem, so clear
 	 * VM_PFNMAP flag that was set by drm_gem_mmap_obj()/drm_gem_mmap().
 	 */
-	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_IO | VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_page_prot = pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
 	vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
 
diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c
index 1dee0d18abbb..8aff3ae909af 100644
--- a/drivers/gpu/drm/msm/msm_gem.c
+++ b/drivers/gpu/drm/msm/msm_gem.c
@@ -1012,7 +1012,7 @@ static int msm_gem_object_mmap(struct drm_gem_object *obj, struct vm_area_struct
 {
 	struct msm_gem_object *msm_obj = to_msm_bo(obj);
 
-	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_page_prot = msm_gem_pgprot(msm_obj, vm_get_page_prot(vma->vm_flags));
 
 	return 0;
diff --git a/drivers/gpu/drm/omapdrm/omap_gem.c b/drivers/gpu/drm/omapdrm/omap_gem.c
index cf571796fd26..9c0e7d6a3784 100644
--- a/drivers/gpu/drm/omapdrm/omap_gem.c
+++ b/drivers/gpu/drm/omapdrm/omap_gem.c
@@ -543,8 +543,7 @@ int omap_gem_mmap_obj(struct drm_gem_object *obj,
 {
 	struct omap_gem_object *omap_obj = to_omap_bo(obj);
 
-	vma->vm_flags &= ~VM_PFNMAP;
-	vma->vm_flags |= VM_MIXEDMAP;
+	mod_vm_flags(vma, VM_MIXEDMAP, VM_PFNMAP);
 
 	if (omap_obj->flags & OMAP_BO_WC) {
 		vma->vm_page_prot = pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_gem.c b/drivers/gpu/drm/rockchip/rockchip_drm_gem.c
index 6edb7c52cb3d..735b64bbdcf2 100644
--- a/drivers/gpu/drm/rockchip/rockchip_drm_gem.c
+++ b/drivers/gpu/drm/rockchip/rockchip_drm_gem.c
@@ -251,8 +251,7 @@ static int rockchip_drm_gem_object_mmap(struct drm_gem_object *obj,
 	 * We allocated a struct page table for rk_obj, so clear
 	 * VM_PFNMAP flag that was set by drm_gem_mmap_obj()/drm_gem_mmap().
 	 */
-	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
-	vma->vm_flags &= ~VM_PFNMAP;
+	mod_vm_flags(vma, VM_IO | VM_DONTEXPAND | VM_DONTDUMP, VM_PFNMAP);
 
 	vma->vm_page_prot = pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
 	vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c
index 979e7bc902f6..6cdc6c45ef27 100644
--- a/drivers/gpu/drm/tegra/gem.c
+++ b/drivers/gpu/drm/tegra/gem.c
@@ -574,7 +574,7 @@ int __tegra_gem_mmap(struct drm_gem_object *gem, struct vm_area_struct *vma)
 		 * and set the vm_pgoff (used as a fake buffer offset by DRM)
 		 * to 0 as we want to map the whole buffer.
 		 */
-		vma->vm_flags &= ~VM_PFNMAP;
+		clear_vm_flags(vma, VM_PFNMAP);
 		vma->vm_pgoff = 0;
 
 		err = dma_mmap_wc(gem->dev->dev, vma, bo->vaddr, bo->iova,
@@ -588,8 +588,7 @@ int __tegra_gem_mmap(struct drm_gem_object *gem, struct vm_area_struct *vma)
 	} else {
 		pgprot_t prot = vm_get_page_prot(vma->vm_flags);
 
-		vma->vm_flags |= VM_MIXEDMAP;
-		vma->vm_flags &= ~VM_PFNMAP;
+		mod_vm_flags(vma, VM_MIXEDMAP, VM_PFNMAP);
 
 		vma->vm_page_prot = pgprot_writecombine(prot);
 	}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 5a3e4b891377..0861e6e33964 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -468,8 +468,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, struct ttm_buffer_object *bo)
 
 	vma->vm_private_data = bo;
 
-	vma->vm_flags |= VM_PFNMAP;
-	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_PFNMAP | VM_IO | VM_DONTEXPAND | VM_DONTDUMP);
 	return 0;
 }
 EXPORT_SYMBOL(ttm_bo_mmap_obj);
diff --git a/drivers/gpu/drm/virtio/virtgpu_vram.c b/drivers/gpu/drm/virtio/virtgpu_vram.c
index 6b45b0429fef..5498a1dbef63 100644
--- a/drivers/gpu/drm/virtio/virtgpu_vram.c
+++ b/drivers/gpu/drm/virtio/virtgpu_vram.c
@@ -46,7 +46,7 @@ static int virtio_gpu_vram_mmap(struct drm_gem_object *obj,
 		return -EINVAL;
 
 	vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node);
-	vma->vm_flags |= VM_MIXEDMAP | VM_DONTEXPAND;
+	set_vm_flags(vma, VM_MIXEDMAP | VM_DONTEXPAND);
 	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
 	vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
 	vma->vm_ops = &virtio_gpu_vram_vm_ops;
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c
index 265f7c48d856..8c8015528b6f 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c
@@ -97,7 +97,7 @@ int vmw_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	/* Use VM_PFNMAP rather than VM_MIXEDMAP if not a COW mapping */
 	if (!is_cow_mapping(vma->vm_flags))
-		vma->vm_flags = (vma->vm_flags & ~VM_MIXEDMAP) | VM_PFNMAP;
+		mod_vm_flags(vma, VM_PFNMAP, VM_MIXEDMAP);
 
 	ttm_bo_put(bo); /* release extra ref taken by ttm_bo_mmap_obj() */
 
diff --git a/drivers/gpu/drm/xen/xen_drm_front_gem.c b/drivers/gpu/drm/xen/xen_drm_front_gem.c
index 4c95ebcdcc2d..18a93ad4aa1f 100644
--- a/drivers/gpu/drm/xen/xen_drm_front_gem.c
+++ b/drivers/gpu/drm/xen/xen_drm_front_gem.c
@@ -69,8 +69,7 @@ static int xen_drm_front_gem_object_mmap(struct drm_gem_object *gem_obj,
 	 * vm_pgoff (used as a fake buffer offset by DRM) to 0 as we want to map
 	 * the whole buffer.
 	 */
-	vma->vm_flags &= ~VM_PFNMAP;
-	vma->vm_flags |= VM_MIXEDMAP | VM_DONTEXPAND;
+	mod_vm_flags(vma, VM_MIXEDMAP | VM_DONTEXPAND, VM_PFNMAP);
 	vma->vm_pgoff = 0;
 
 	/*
diff --git a/drivers/hsi/clients/cmt_speech.c b/drivers/hsi/clients/cmt_speech.c
index 8069f795c864..952a31e742a1 100644
--- a/drivers/hsi/clients/cmt_speech.c
+++ b/drivers/hsi/clients/cmt_speech.c
@@ -1264,7 +1264,7 @@ static int cs_char_mmap(struct file *file, struct vm_area_struct *vma)
 	if (vma_pages(vma) != 1)
 		return -EINVAL;
 
-	vma->vm_flags |= VM_IO | VM_DONTDUMP | VM_DONTEXPAND;
+	set_vm_flags(vma, VM_IO | VM_DONTDUMP | VM_DONTEXPAND);
 	vma->vm_ops = &cs_char_vm_ops;
 	vma->vm_private_data = file->private_data;
 
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c
index 6c8215a47a60..a6f178bf3ded 100644
--- a/drivers/hwtracing/intel_th/msu.c
+++ b/drivers/hwtracing/intel_th/msu.c
@@ -1659,7 +1659,7 @@ static int intel_th_msc_mmap(struct file *file, struct vm_area_struct *vma)
 		atomic_dec(&msc->user_count);
 
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTCOPY);
 	vma->vm_ops = &msc_mmap_ops;
 	return ret;
 }
diff --git a/drivers/hwtracing/stm/core.c b/drivers/hwtracing/stm/core.c
index 2712e699ba08..9a59e61c4194 100644
--- a/drivers/hwtracing/stm/core.c
+++ b/drivers/hwtracing/stm/core.c
@@ -715,7 +715,7 @@ static int stm_char_mmap(struct file *file, struct vm_area_struct *vma)
 	pm_runtime_get_sync(&stm->dev);
 
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_IO | VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_ops = &stm_mmap_vmops;
 	vm_iomap_memory(vma, phys, size);
 
diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index f5f9269fdc16..7294f2d33bc6 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -403,7 +403,7 @@ static int hfi1_file_mmap(struct file *fp, struct vm_area_struct *vma)
 			ret = -EPERM;
 			goto done;
 		}
-		vma->vm_flags &= ~VM_MAYWRITE;
+		clear_vm_flags(vma, VM_MAYWRITE);
 		addr = vma->vm_start;
 		for (i = 0 ; i < uctxt->egrbufs.numbufs; i++) {
 			memlen = uctxt->egrbufs.buffers[i].len;
@@ -528,7 +528,7 @@ static int hfi1_file_mmap(struct file *fp, struct vm_area_struct *vma)
 		goto done;
 	}
 
-	vma->vm_flags = flags;
+	reset_vm_flags(vma, flags);
 	hfi1_cdbg(PROC,
 		  "%u:%u type:%u io/vf:%d/%d, addr:0x%llx, len:%lu(%lu), flags:0x%lx\n",
 		    ctxt, subctxt, type, mapio, vmf, memaddr, memlen,
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index c669ef6e47e7..538318c809b3 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2087,7 +2087,7 @@ static int mlx5_ib_mmap_clock_info_page(struct mlx5_ib_dev *dev,
 
 	if (vma->vm_flags & (VM_WRITE | VM_EXEC))
 		return -EPERM;
-	vma->vm_flags &= ~VM_MAYWRITE;
+	clear_vm_flags(vma, VM_MAYWRITE);
 
 	if (!dev->mdev->clock_info)
 		return -EOPNOTSUPP;
@@ -2311,7 +2311,7 @@ static int mlx5_ib_mmap(struct ib_ucontext *ibcontext, struct vm_area_struct *vm
 
 		if (vma->vm_flags & VM_WRITE)
 			return -EPERM;
-		vma->vm_flags &= ~VM_MAYWRITE;
+		clear_vm_flags(vma, VM_MAYWRITE);
 
 		/* Don't expose to user-space information it shouldn't have */
 		if (PAGE_SIZE > 4096)
diff --git a/drivers/infiniband/hw/qib/qib_file_ops.c b/drivers/infiniband/hw/qib/qib_file_ops.c
index 3937144b2ae5..16ef80df4b7f 100644
--- a/drivers/infiniband/hw/qib/qib_file_ops.c
+++ b/drivers/infiniband/hw/qib/qib_file_ops.c
@@ -733,7 +733,7 @@ static int qib_mmap_mem(struct vm_area_struct *vma, struct qib_ctxtdata *rcd,
 		}
 
 		/* don't allow them to later change with mprotect */
-		vma->vm_flags &= ~VM_MAYWRITE;
+		clear_vm_flags(vma, VM_MAYWRITE);
 	}
 
 	pfn = virt_to_phys(kvaddr) >> PAGE_SHIFT;
@@ -769,7 +769,7 @@ static int mmap_ureg(struct vm_area_struct *vma, struct qib_devdata *dd,
 		phys = dd->physaddr + ureg;
 		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
-		vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
+		set_vm_flags(vma, VM_DONTCOPY | VM_DONTEXPAND);
 		ret = io_remap_pfn_range(vma, vma->vm_start,
 					 phys >> PAGE_SHIFT,
 					 vma->vm_end - vma->vm_start,
@@ -810,8 +810,7 @@ static int mmap_piobufs(struct vm_area_struct *vma,
 	 * don't allow them to later change to readable with mprotect (for when
 	 * not initially mapped readable, as is normally the case)
 	 */
-	vma->vm_flags &= ~VM_MAYREAD;
-	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
+	mod_vm_flags(vma, VM_DONTCOPY | VM_DONTEXPAND, VM_MAYREAD);
 
 	/* We used PAT if wc_cookie == 0 */
 	if (!dd->wc_cookie)
@@ -852,7 +851,7 @@ static int mmap_rcvegrbufs(struct vm_area_struct *vma,
 		goto bail;
 	}
 	/* don't allow them to later change to writable with mprotect */
-	vma->vm_flags &= ~VM_MAYWRITE;
+	clear_vm_flags(vma, VM_MAYWRITE);
 
 	start = vma->vm_start;
 
@@ -944,7 +943,7 @@ static int mmap_kvaddr(struct vm_area_struct *vma, u64 pgaddr,
 		 * Don't allow permission to later change to writable
 		 * with mprotect.
 		 */
-		vma->vm_flags &= ~VM_MAYWRITE;
+		clear_vm_flags(vma, VM_MAYWRITE);
 	} else
 		goto bail;
 	len = vma->vm_end - vma->vm_start;
@@ -955,7 +954,7 @@ static int mmap_kvaddr(struct vm_area_struct *vma, u64 pgaddr,
 
 	vma->vm_pgoff = (unsigned long) addr >> PAGE_SHIFT;
 	vma->vm_ops = &qib_file_vm_ops;
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP);
 	ret = 1;
 
 bail:
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
index 6e8c4fbb8083..6f9237c2a26b 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
@@ -672,7 +672,7 @@ int usnic_ib_mmap(struct ib_ucontext *context,
 	usnic_dbg("\n");
 
 	us_ibdev = to_usdev(context->device);
-	vma->vm_flags |= VM_IO;
+	set_vm_flags(vma, VM_IO);
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vfid = vma->vm_pgoff;
 	usnic_dbg("Page Offset %lu PAGE_SHIFT %u VFID %u\n",
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c
index 19176583dbde..7f1b7b5dd3f4 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c
@@ -408,7 +408,7 @@ int pvrdma_mmap(struct ib_ucontext *ibcontext, struct vm_area_struct *vma)
 	}
 
 	/* Map UAR to kernel space, VM_LOCKED? */
-	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
+	set_vm_flags(vma, VM_DONTCOPY | VM_DONTEXPAND);
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	if (io_remap_pfn_range(vma, start, context->uar.pfn, size,
 			       vma->vm_page_prot))
diff --git a/drivers/media/common/videobuf2/videobuf2-dma-contig.c b/drivers/media/common/videobuf2/videobuf2-dma-contig.c
index 5f1175f8b349..e66ae399749e 100644
--- a/drivers/media/common/videobuf2/videobuf2-dma-contig.c
+++ b/drivers/media/common/videobuf2/videobuf2-dma-contig.c
@@ -293,7 +293,7 @@ static int vb2_dc_mmap(void *buf_priv, struct vm_area_struct *vma)
 		return ret;
 	}
 
-	vma->vm_flags		|= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_private_data	= &buf->handler;
 	vma->vm_ops		= &vb2_common_vm_ops;
 
diff --git a/drivers/media/common/videobuf2/videobuf2-vmalloc.c b/drivers/media/common/videobuf2/videobuf2-vmalloc.c
index 959b45beb1f3..edb47240ec17 100644
--- a/drivers/media/common/videobuf2/videobuf2-vmalloc.c
+++ b/drivers/media/common/videobuf2/videobuf2-vmalloc.c
@@ -185,7 +185,7 @@ static int vb2_vmalloc_mmap(void *buf_priv, struct vm_area_struct *vma)
 	/*
 	 * Make sure that vm_areas for 2 buffers won't be merged together
 	 */
-	vma->vm_flags		|= VM_DONTEXPAND;
+	set_vm_flags(vma, VM_DONTEXPAND);
 
 	/*
 	 * Use common vm_area operations to track buffer refcount.
diff --git a/drivers/media/v4l2-core/videobuf-dma-contig.c b/drivers/media/v4l2-core/videobuf-dma-contig.c
index f2c439359557..c030823185ba 100644
--- a/drivers/media/v4l2-core/videobuf-dma-contig.c
+++ b/drivers/media/v4l2-core/videobuf-dma-contig.c
@@ -314,7 +314,7 @@ static int __videobuf_mmap_mapper(struct videobuf_queue *q,
 	}
 
 	vma->vm_ops = &videobuf_vm_ops;
-	vma->vm_flags |= VM_DONTEXPAND;
+	set_vm_flags(vma, VM_DONTEXPAND);
 	vma->vm_private_data = map;
 
 	dev_dbg(q->dev, "mmap %p: q=%p %08lx-%08lx (%lx) pgoff %08lx buf %d\n",
diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 234e9f647c96..9adac4875f29 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -630,8 +630,8 @@ static int __videobuf_mmap_mapper(struct videobuf_queue *q,
 	map->count    = 1;
 	map->q        = q;
 	vma->vm_ops   = &videobuf_vm_ops;
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
-	vma->vm_flags &= ~VM_IO; /* using shared anonymous pages */
+	/* using shared anonymous pages */
+	mod_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP, VM_IO);
 	vma->vm_private_data = map;
 	dprintk(1, "mmap %p: q=%p %08lx-%08lx pgoff %08lx bufs %d-%d\n",
 		map, q, vma->vm_start, vma->vm_end, vma->vm_pgoff, first, last);
diff --git a/drivers/media/v4l2-core/videobuf-vmalloc.c b/drivers/media/v4l2-core/videobuf-vmalloc.c
index 9b2443720ab0..48d439ccd414 100644
--- a/drivers/media/v4l2-core/videobuf-vmalloc.c
+++ b/drivers/media/v4l2-core/videobuf-vmalloc.c
@@ -247,7 +247,7 @@ static int __videobuf_mmap_mapper(struct videobuf_queue *q,
 	}
 
 	vma->vm_ops          = &videobuf_vm_ops;
-	vma->vm_flags       |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_private_data = map;
 
 	dprintk(1, "mmap %p: q=%p %08lx-%08lx (%lx) pgoff %08lx buf %d\n",
diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index acaa44809c58..17562e4efcb2 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -220,7 +220,7 @@ int cxl_context_iomap(struct cxl_context *ctx, struct vm_area_struct *vma)
 	pr_devel("%s: mmio physical: %llx pe: %i master:%i\n", __func__,
 		 ctx->psn_phys, ctx->pe , ctx->master);
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP);
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_ops = &cxl_mmap_vmops;
 	return 0;
diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c
index 5e9ae7600d75..ad8eae764b9b 100644
--- a/drivers/misc/habanalabs/common/memory.c
+++ b/drivers/misc/habanalabs/common/memory.c
@@ -2082,7 +2082,7 @@ static int hl_ts_mmap(struct hl_mmap_mem_buf *buf, struct vm_area_struct *vma, v
 {
 	struct hl_ts_buff *ts_buff = buf->private;
 
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP | VM_DONTCOPY | VM_NORESERVE;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP | VM_DONTCOPY | VM_NORESERVE);
 	return remap_vmalloc_range(vma, ts_buff->user_buff_address, 0);
 }
 
diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c
index 9f5e208701ba..4186f04da224 100644
--- a/drivers/misc/habanalabs/gaudi/gaudi.c
+++ b/drivers/misc/habanalabs/gaudi/gaudi.c
@@ -4236,8 +4236,8 @@ static int gaudi_mmap(struct hl_device *hdev, struct vm_area_struct *vma,
 {
 	int rc;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP |
-			VM_DONTCOPY | VM_NORESERVE;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP |
+			VM_DONTCOPY | VM_NORESERVE);
 
 	rc = dma_mmap_coherent(hdev->dev, vma, cpu_addr,
 				(dma_addr - HOST_PHYS_BASE), size);
diff --git a/drivers/misc/habanalabs/gaudi2/gaudi2.c b/drivers/misc/habanalabs/gaudi2/gaudi2.c
index e793fb2bdcbe..7311c3053944 100644
--- a/drivers/misc/habanalabs/gaudi2/gaudi2.c
+++ b/drivers/misc/habanalabs/gaudi2/gaudi2.c
@@ -5538,8 +5538,8 @@ static int gaudi2_mmap(struct hl_device *hdev, struct vm_area_struct *vma,
 {
 	int rc;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP |
-			VM_DONTCOPY | VM_NORESERVE;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP |
+			VM_DONTCOPY | VM_NORESERVE);
 
 #ifdef _HAS_DMA_MMAP_COHERENT
 
@@ -10116,8 +10116,8 @@ static int gaudi2_block_mmap(struct hl_device *hdev, struct vm_area_struct *vma,
 
 	address = pci_resource_start(hdev->pdev, SRAM_CFG_BAR_ID) + offset_in_bar;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP |
-			VM_DONTCOPY | VM_NORESERVE;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP |
+			VM_DONTCOPY | VM_NORESERVE);
 
 	rc = remap_pfn_range(vma, vma->vm_start, address >> PAGE_SHIFT,
 			block_size, vma->vm_page_prot);
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 0f083fcf81a6..5e2aaa26ea29 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -2880,8 +2880,8 @@ static int goya_mmap(struct hl_device *hdev, struct vm_area_struct *vma,
 {
 	int rc;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP |
-			VM_DONTCOPY | VM_NORESERVE;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP |
+			VM_DONTCOPY | VM_NORESERVE);
 
 	rc = dma_mmap_coherent(hdev->dev, vma, cpu_addr,
 				(dma_addr - HOST_PHYS_BASE), size);
diff --git a/drivers/misc/ocxl/context.c b/drivers/misc/ocxl/context.c
index 9eb0d93b01c6..e6f941248e93 100644
--- a/drivers/misc/ocxl/context.c
+++ b/drivers/misc/ocxl/context.c
@@ -180,7 +180,7 @@ static int check_mmap_afu_irq(struct ocxl_context *ctx,
 	if ((vma->vm_flags & VM_READ) || (vma->vm_flags & VM_EXEC) ||
 		!(vma->vm_flags & VM_WRITE))
 		return -EINVAL;
-	vma->vm_flags &= ~(VM_MAYREAD | VM_MAYEXEC);
+	clear_vm_flags(vma, VM_MAYREAD | VM_MAYEXEC);
 	return 0;
 }
 
@@ -204,7 +204,7 @@ int ocxl_context_mmap(struct ocxl_context *ctx, struct vm_area_struct *vma)
 	if (rc)
 		return rc;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP);
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_ops = &ocxl_vmops;
 	return 0;
diff --git a/drivers/misc/ocxl/sysfs.c b/drivers/misc/ocxl/sysfs.c
index 25c78df8055d..9398246cac79 100644
--- a/drivers/misc/ocxl/sysfs.c
+++ b/drivers/misc/ocxl/sysfs.c
@@ -134,7 +134,7 @@ static int global_mmio_mmap(struct file *filp, struct kobject *kobj,
 		(afu->config.global_mmio_size >> PAGE_SHIFT))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP);
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_ops = &global_mmio_vmops;
 	vma->vm_private_data = afu;
diff --git a/drivers/misc/open-dice.c b/drivers/misc/open-dice.c
index c61be3404c6f..9f9438b5b075 100644
--- a/drivers/misc/open-dice.c
+++ b/drivers/misc/open-dice.c
@@ -96,13 +96,13 @@ static int open_dice_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	/* Ensure userspace cannot acquire VM_WRITE + VM_SHARED later. */
 	if (vma->vm_flags & VM_WRITE)
-		vma->vm_flags &= ~VM_MAYSHARE;
+		clear_vm_flags(vma, VM_MAYSHARE);
 	else if (vma->vm_flags & VM_SHARED)
-		vma->vm_flags &= ~VM_MAYWRITE;
+		clear_vm_flags(vma, VM_MAYWRITE);
 
 	/* Create write-combine mapping so all clients observe a wipe. */
 	vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
-	vma->vm_flags |= VM_DONTCOPY | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTCOPY | VM_DONTDUMP);
 	return vm_iomap_memory(vma, drvdata->rmem->base, drvdata->rmem->size);
 }
 
diff --git a/drivers/misc/sgi-gru/grufile.c b/drivers/misc/sgi-gru/grufile.c
index 7ffcfc0bb587..8b777286d3b2 100644
--- a/drivers/misc/sgi-gru/grufile.c
+++ b/drivers/misc/sgi-gru/grufile.c
@@ -101,8 +101,8 @@ static int gru_file_mmap(struct file *file, struct vm_area_struct *vma)
 				vma->vm_end & (GRU_GSEG_PAGESIZE - 1))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_LOCKED |
-			 VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP | VM_LOCKED |
+			 VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_page_prot = PAGE_SHARED;
 	vma->vm_ops = &gru_vm_ops;
 
diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
index 905eff1f840e..f57e91cdb0f6 100644
--- a/drivers/misc/uacce/uacce.c
+++ b/drivers/misc/uacce/uacce.c
@@ -229,7 +229,7 @@ static int uacce_fops_mmap(struct file *filep, struct vm_area_struct *vma)
 	if (!qfr)
 		return -ENOMEM;
 
-	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_WIPEONFORK;
+	set_vm_flags(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_WIPEONFORK);
 	vma->vm_ops = &uacce_vm_ops;
 	vma->vm_private_data = q;
 	qfr->type = type;
diff --git a/drivers/sbus/char/oradax.c b/drivers/sbus/char/oradax.c
index 21b7cb6e7e70..a096734daad0 100644
--- a/drivers/sbus/char/oradax.c
+++ b/drivers/sbus/char/oradax.c
@@ -389,7 +389,7 @@ static int dax_devmap(struct file *f, struct vm_area_struct *vma)
 	/* completion area is mapped read-only for user */
 	if (vma->vm_flags & VM_WRITE)
 		return -EPERM;
-	vma->vm_flags &= ~VM_MAYWRITE;
+	clear_vm_flags(vma, VM_MAYWRITE);
 
 	if (remap_pfn_range(vma, vma->vm_start, ctx->ca_buf_ra >> PAGE_SHIFT,
 			    len, vma->vm_page_prot))
diff --git a/drivers/scsi/cxlflash/ocxl_hw.c b/drivers/scsi/cxlflash/ocxl_hw.c
index 631eda2d467e..d386c25c2699 100644
--- a/drivers/scsi/cxlflash/ocxl_hw.c
+++ b/drivers/scsi/cxlflash/ocxl_hw.c
@@ -1167,7 +1167,7 @@ static int afu_mmap(struct file *file, struct vm_area_struct *vma)
 	    (ctx->psn_size >> PAGE_SHIFT))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP);
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_ops = &ocxlflash_vmops;
 	return 0;
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index ff9854f59964..7438adfe3bdc 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -1288,7 +1288,7 @@ sg_mmap(struct file *filp, struct vm_area_struct *vma)
 	}
 
 	sfp->mmap_called = 1;
-	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_IO | VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_private_data = sfp;
 	vma->vm_ops = &sg_mmap_vm_ops;
 out:
diff --git a/drivers/staging/media/atomisp/pci/hmm/hmm_bo.c b/drivers/staging/media/atomisp/pci/hmm/hmm_bo.c
index 5e53eed8ae95..df1c944e5058 100644
--- a/drivers/staging/media/atomisp/pci/hmm/hmm_bo.c
+++ b/drivers/staging/media/atomisp/pci/hmm/hmm_bo.c
@@ -1072,7 +1072,7 @@ int hmm_bo_mmap(struct vm_area_struct *vma, struct hmm_buffer_object *bo)
 	vma->vm_private_data = bo;
 
 	vma->vm_ops = &hmm_bo_vm_ops;
-	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_IO | VM_DONTEXPAND | VM_DONTDUMP);
 
 	/*
 	 * call hmm_bo_vm_open explicitly.
diff --git a/drivers/staging/media/deprecated/meye/meye.c b/drivers/staging/media/deprecated/meye/meye.c
index 5d87efd9b95c..2505e64d7119 100644
--- a/drivers/staging/media/deprecated/meye/meye.c
+++ b/drivers/staging/media/deprecated/meye/meye.c
@@ -1476,8 +1476,8 @@ static int meye_mmap(struct file *file, struct vm_area_struct *vma)
 	}
 
 	vma->vm_ops = &meye_vm_ops;
-	vma->vm_flags &= ~VM_IO;	/* not I/O memory */
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	/* not I/O memory */
+	mod_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP, VM_IO);
 	vma->vm_private_data = (void *) (offset / gbufsize);
 	meye_vm_open(vma);
 
diff --git a/drivers/staging/media/deprecated/stkwebcam/stk-webcam.c b/drivers/staging/media/deprecated/stkwebcam/stk-webcam.c
index 787edb3d47c2..196d1034f104 100644
--- a/drivers/staging/media/deprecated/stkwebcam/stk-webcam.c
+++ b/drivers/staging/media/deprecated/stkwebcam/stk-webcam.c
@@ -779,7 +779,7 @@ static int v4l_stk_mmap(struct file *fp, struct vm_area_struct *vma)
 	ret = remap_vmalloc_range(vma, sbuf->buffer, 0);
 	if (ret)
 		return ret;
-	vma->vm_flags |= VM_DONTEXPAND;
+	set_vm_flags(vma, VM_DONTEXPAND);
 	vma->vm_private_data = sbuf;
 	vma->vm_ops = &stk_v4l_vm_ops;
 	sbuf->v4lbuf.flags |= V4L2_BUF_FLAG_MAPPED;
diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
index 2940559c3086..9fd64259904c 100644
--- a/drivers/target/target_core_user.c
+++ b/drivers/target/target_core_user.c
@@ -1928,7 +1928,7 @@ static int tcmu_mmap(struct uio_info *info, struct vm_area_struct *vma)
 {
 	struct tcmu_dev *udev = container_of(info, struct tcmu_dev, uio_info);
 
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_ops = &tcmu_vm_ops;
 
 	vma->vm_private_data = udev;
diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c
index 43afbb7c5ab9..08802744f3b7 100644
--- a/drivers/uio/uio.c
+++ b/drivers/uio/uio.c
@@ -713,7 +713,7 @@ static const struct vm_operations_struct uio_logical_vm_ops = {
 
 static int uio_mmap_logical(struct vm_area_struct *vma)
 {
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_ops = &uio_logical_vm_ops;
 	return 0;
 }
diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index 837f3e57f580..d9aefa259883 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -279,8 +279,7 @@ static int usbdev_mmap(struct file *file, struct vm_area_struct *vma)
 		}
 	}
 
-	vma->vm_flags |= VM_IO;
-	vma->vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP);
+	set_vm_flags(vma, VM_IO | VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_ops = &usbdev_vm_ops;
 	vma->vm_private_data = usbm;
 
diff --git a/drivers/usb/mon/mon_bin.c b/drivers/usb/mon/mon_bin.c
index 094e812e9e69..9b2d48a65fdf 100644
--- a/drivers/usb/mon/mon_bin.c
+++ b/drivers/usb/mon/mon_bin.c
@@ -1272,8 +1272,7 @@ static int mon_bin_mmap(struct file *filp, struct vm_area_struct *vma)
 	if (vma->vm_flags & VM_WRITE)
 		return -EPERM;
 
-	vma->vm_flags &= ~VM_MAYWRITE;
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	mod_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP, VM_MAYWRITE);
 	vma->vm_private_data = filp->private_data;
 	mon_bin_vma_open(vma);
 	return 0;
diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
index e682bc7ee6c9..39dcce2e455b 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -512,7 +512,7 @@ static int vduse_domain_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct vduse_iova_domain *domain = file->private_data;
 
-	vma->vm_flags |= VM_DONTDUMP | VM_DONTEXPAND;
+	set_vm_flags(vma, VM_DONTDUMP | VM_DONTEXPAND);
 	vma->vm_private_data = domain;
 	vma->vm_ops = &vduse_domain_mmap_ops;
 
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 26a541cc64d1..86eb3fc9ffb4 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1799,7 +1799,7 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
 	 * See remap_pfn_range(), called from vfio_pci_fault() but we can't
 	 * change vm_flags within the fault handler.  Set them now.
 	 */
-	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_ops = &vfio_pci_mmap_ops;
 
 	return 0;
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ec32f785dfde..7b81994a7d02 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1315,7 +1315,7 @@ static int vhost_vdpa_mmap(struct file *file, struct vm_area_struct *vma)
 	if (vma->vm_end - vma->vm_start != notify.size)
 		return -ENOTSUPP;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_ops = &vhost_vdpa_vm_ops;
 	return 0;
 }
diff --git a/drivers/video/fbdev/68328fb.c b/drivers/video/fbdev/68328fb.c
index 7db03ed77c76..a794a740af10 100644
--- a/drivers/video/fbdev/68328fb.c
+++ b/drivers/video/fbdev/68328fb.c
@@ -391,7 +391,7 @@ static int mc68x328fb_mmap(struct fb_info *info, struct vm_area_struct *vma)
 #ifndef MMU
 	/* this is uClinux (no MMU) specific code */
 
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_start = videomemory;
 
 	return 0;
diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c
index c730253ab85c..af0bfaa2d014 100644
--- a/drivers/video/fbdev/core/fb_defio.c
+++ b/drivers/video/fbdev/core/fb_defio.c
@@ -232,9 +232,9 @@ static const struct address_space_operations fb_deferred_io_aops = {
 int fb_deferred_io_mmap(struct fb_info *info, struct vm_area_struct *vma)
 {
 	vma->vm_ops = &fb_deferred_io_vm_ops;
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP);
 	if (!(info->flags & FBINFO_VIRTFB))
-		vma->vm_flags |= VM_IO;
+		set_vm_flags(vma, VM_IO);
 	vma->vm_private_data = info;
 	return 0;
 }
diff --git a/drivers/xen/gntalloc.c b/drivers/xen/gntalloc.c
index a15729beb9d1..ee4a8958dc68 100644
--- a/drivers/xen/gntalloc.c
+++ b/drivers/xen/gntalloc.c
@@ -525,7 +525,7 @@ static int gntalloc_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	vma->vm_private_data = vm_priv;
 
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP);
 
 	vma->vm_ops = &gntalloc_vmops;
 
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 4d9a3050de6a..6d5bb1ebb661 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -1055,10 +1055,10 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
 
 	vma->vm_ops = &gntdev_vmops;
 
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP | VM_MIXEDMAP;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP | VM_MIXEDMAP);
 
 	if (use_ptemod)
-		vma->vm_flags |= VM_DONTCOPY;
+		set_vm_flags(vma, VM_DONTCOPY);
 
 	vma->vm_private_data = map;
 	if (map->flags) {
diff --git a/drivers/xen/privcmd-buf.c b/drivers/xen/privcmd-buf.c
index dd5bbb6e1b6b..037547918630 100644
--- a/drivers/xen/privcmd-buf.c
+++ b/drivers/xen/privcmd-buf.c
@@ -156,7 +156,7 @@ static int privcmd_buf_mmap(struct file *file, struct vm_area_struct *vma)
 	vma_priv->file_priv = file_priv;
 	vma_priv->users = 1;
 
-	vma->vm_flags |= VM_IO | VM_DONTEXPAND;
+	set_vm_flags(vma, VM_IO | VM_DONTEXPAND);
 	vma->vm_ops = &privcmd_buf_vm_ops;
 	vma->vm_private_data = vma_priv;
 
diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 1edf45ee9890..4c8cfc6f86d8 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -934,8 +934,8 @@ static int privcmd_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	/* DONTCOPY is essential for Xen because copy_page_range doesn't know
 	 * how to recreate these mappings */
-	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTCOPY |
-			 VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP | VM_DONTCOPY |
+			 VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_ops = &privcmd_vm_ops;
 	vma->vm_private_data = NULL;
 
diff --git a/fs/aio.c b/fs/aio.c
index 562916d85cba..db821fb1e92d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -390,7 +390,7 @@ static const struct vm_operations_struct aio_ring_vm_ops = {
 
 static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	vma->vm_flags |= VM_DONTEXPAND;
+	set_vm_flags(vma, VM_DONTEXPAND);
 	vma->vm_ops = &aio_ring_vm_ops;
 	return 0;
 }
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index 61ccf7722fc3..874a17a1b8d9 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -408,7 +408,7 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
 		 * unpopulated ptes via cramfs_read_folio().
 		 */
 		int i;
-		vma->vm_flags |= VM_MIXEDMAP;
+		set_vm_flags(vma, VM_MIXEDMAP);
 		for (i = 0; i < pages && !ret; i++) {
 			vm_fault_t vmf;
 			unsigned long off = i * PAGE_SIZE;
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index f57f921683d7..e6413ced2bb1 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -429,7 +429,7 @@ static int erofs_file_mmap(struct file *file, struct vm_area_struct *vma)
 		return -EINVAL;
 
 	vma->vm_ops = &erofs_dax_vm_ops;
-	vma->vm_flags |= VM_HUGEPAGE;
+	set_vm_flags(vma, VM_HUGEPAGE);
 	return 0;
 }
 #else
diff --git a/fs/exec.c b/fs/exec.c
index ab913243a367..5e1631e109a8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -270,7 +270,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
 	BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
 	vma->vm_end = STACK_TOP_MAX;
 	vma->vm_start = vma->vm_end - PAGE_SIZE;
-	vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP;
+	init_vm_flags(vma, VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP);
 	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
 
 	err = insert_vm_struct(mm, vma);
@@ -834,7 +834,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	}
 
 	/* mprotect_fixup is overkill to remove the temporary stack flags */
-	vma->vm_flags &= ~VM_STACK_INCOMPLETE_SETUP;
+	clear_vm_flags(vma, VM_STACK_INCOMPLETE_SETUP);
 
 	stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages */
 	stack_size = vma->vm_end - vma->vm_start;
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 7ac0a81bd371..baeb385b07c7 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -801,7 +801,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_HUGEPAGE;
+		set_vm_flags(vma, VM_HUGEPAGE);
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index e23e802a8013..599969edc869 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -860,7 +860,7 @@ int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	file_accessed(file);
 	vma->vm_ops = &fuse_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	set_vm_flags(vma, VM_MIXEDMAP | VM_HUGEPAGE);
 	return 0;
 }
 
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 790d2727141a..d63a392985a7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -132,7 +132,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	 * way when do_mmap unwinds (may be important on powerpc
 	 * and ia64).
 	 */
-	vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
+	set_vm_flags(vma, VM_HUGETLB | VM_DONTEXPAND);
 	vma->vm_ops = &hugetlb_vm_ops;
 
 	ret = seal_check_future_write(info->seals, vma);
@@ -813,7 +813,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 	 * as input to create an allocation policy.
 	 */
 	vma_init(&pseudo_vma, mm);
-	pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
+	init_vm_flags(&pseudo_vma, VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
 	pseudo_vma.vm_file = file;
 
 	for (index = start; index < end; index++) {
diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index 167fa43b24f9..0f668db6bcf3 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -389,8 +389,7 @@ static int orangefs_file_mmap(struct file *file, struct vm_area_struct *vma)
 		     "orangefs_file_mmap: called on %pD\n", file);
 
 	/* set the sequential readahead hint */
-	vma->vm_flags |= VM_SEQ_READ;
-	vma->vm_flags &= ~VM_RAND_READ;
+	mod_vm_flags(vma, VM_SEQ_READ, VM_RAND_READ);
 
 	file_accessed(file);
 	vma->vm_ops = &orangefs_file_vm_ops;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e35a0398db63..4d651777c8a5 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1302,7 +1302,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			mas_for_each(&mas, vma, ULONG_MAX) {
 				if (!(vma->vm_flags & VM_SOFTDIRTY))
 					continue;
-				vma->vm_flags &= ~VM_SOFTDIRTY;
+				clear_vm_flags(vma, VM_SOFTDIRTY);
 				vma_set_page_prot(vma);
 			}
 
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 09a81e4b1273..858e4e804f85 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -582,8 +582,7 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 	if (vma->vm_flags & (VM_WRITE | VM_EXEC))
 		return -EPERM;
 
-	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
-	vma->vm_flags |= VM_MIXEDMAP;
+	mod_vm_flags(vma, VM_MIXEDMAP, VM_MAYWRITE | VM_MAYEXEC);
 	vma->vm_ops = &vmcore_mmap_ops;
 
 	len = 0;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 98ac37e34e3d..f46252544924 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -618,7 +618,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
 		for_each_vma(vmi, vma) {
 			if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
 				vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-				vma->vm_flags &= ~__VM_UFFD_FLAGS;
+				clear_vm_flags(vma, __VM_UFFD_FLAGS);
 			}
 		}
 		mmap_write_unlock(mm);
@@ -652,7 +652,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
 	octx = vma->vm_userfaultfd_ctx.ctx;
 	if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-		vma->vm_flags &= ~__VM_UFFD_FLAGS;
+		clear_vm_flags(vma, __VM_UFFD_FLAGS);
 		return 0;
 	}
 
@@ -733,7 +733,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
 	} else {
 		/* Drop uffd context if remap feature not enabled */
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-		vma->vm_flags &= ~__VM_UFFD_FLAGS;
+		clear_vm_flags(vma, __VM_UFFD_FLAGS);
 	}
 }
 
@@ -895,7 +895,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 			prev = vma;
 		}
 
-		vma->vm_flags = new_flags;
+		reset_vm_flags(vma, new_flags);
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 	}
 	mmap_write_unlock(mm);
@@ -1463,7 +1463,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		 * the next vma was merged into the current one and
 		 * the current one has not been updated yet.
 		 */
-		vma->vm_flags = new_flags;
+		reset_vm_flags(vma, new_flags);
 		vma->vm_userfaultfd_ctx.ctx = ctx;
 
 		if (is_vm_hugetlb_page(vma) && uffd_disable_huge_pmd_share(vma))
@@ -1651,7 +1651,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		 * the next vma was merged into the current one and
 		 * the current one has not been updated yet.
 		 */
-		vma->vm_flags = new_flags;
+		reset_vm_flags(vma, new_flags);
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 
 	skip:
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 595a5bcf46b9..bf777fed0dd4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1429,7 +1429,7 @@ xfs_file_mmap(
 	file_accessed(file);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(inode))
-		vma->vm_flags |= VM_HUGEPAGE;
+		set_vm_flags(vma, VM_HUGEPAGE);
 	return 0;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2b16d45b75a6..594e835bad9c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3756,7 +3756,7 @@ static inline int seal_check_future_write(int seals, struct vm_area_struct *vma)
 		 * VM_MAYWRITE as we still want them to be COW-writable.
 		 */
 		if (vma->vm_flags & VM_SHARED)
-			vma->vm_flags &= ~(VM_MAYWRITE);
+			clear_vm_flags(vma, VM_MAYWRITE);
 	}
 
 	return 0;
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 80f4b4d88aaf..d2c967cc2873 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -269,7 +269,7 @@ static int ringbuf_map_mmap_kern(struct bpf_map *map, struct vm_area_struct *vma
 		if (vma->vm_pgoff != 0 || vma->vm_end - vma->vm_start != PAGE_SIZE)
 			return -EPERM;
 	} else {
-		vma->vm_flags &= ~VM_MAYWRITE;
+		clear_vm_flags(vma, VM_MAYWRITE);
 	}
 	/* remap_vmalloc_range() checks size and offset constraints */
 	return remap_vmalloc_range(vma, rb_map->rb,
@@ -290,7 +290,7 @@ static int ringbuf_map_mmap_user(struct bpf_map *map, struct vm_area_struct *vma
 			 */
 			return -EPERM;
 	} else {
-		vma->vm_flags &= ~VM_MAYWRITE;
+		clear_vm_flags(vma, VM_MAYWRITE);
 	}
 	/* remap_vmalloc_range() checks size and offset constraints */
 	return remap_vmalloc_range(vma, rb_map->rb, vma->vm_pgoff + RINGBUF_PGOFF);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 64131f88c553..db19094c7ac7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -882,10 +882,10 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
 	/* set default open/close callbacks */
 	vma->vm_ops = &bpf_map_default_vmops;
 	vma->vm_private_data = map;
-	vma->vm_flags &= ~VM_MAYEXEC;
+	clear_vm_flags(vma, VM_MAYEXEC);
 	if (!(vma->vm_flags & VM_WRITE))
 		/* disallow re-mapping with PROT_WRITE */
-		vma->vm_flags &= ~VM_MAYWRITE;
+		clear_vm_flags(vma, VM_MAYWRITE);
 
 	err = map->ops->map_mmap(map, vma);
 	if (err)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d56328e5080e..6745460dcf49 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6573,7 +6573,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	 * Since pinned accounting is per vm we cannot allow fork() to copy our
 	 * vma.
 	 */
-	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_ops = &perf_mmap_vmops;
 
 	if (event->pmu->event_mapped)
diff --git a/kernel/kcov.c b/kernel/kcov.c
index e5cd09fd8a05..27fc1e26e1e1 100644
--- a/kernel/kcov.c
+++ b/kernel/kcov.c
@@ -489,7 +489,7 @@ static int kcov_mmap(struct file *filep, struct vm_area_struct *vma)
 		goto exit;
 	}
 	spin_unlock_irqrestore(&kcov->lock, flags);
-	vma->vm_flags |= VM_DONTEXPAND;
+	set_vm_flags(vma, VM_DONTEXPAND);
 	for (off = 0; off < size; off += PAGE_SIZE) {
 		page = vmalloc_to_page(kcov->area + off);
 		res = vm_insert_page(vma, vma->vm_start + off, page);
diff --git a/kernel/relay.c b/kernel/relay.c
index ef12532168d9..085aa8707bc2 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -91,7 +91,7 @@ static int relay_mmap_buf(struct rchan_buf *buf, struct vm_area_struct *vma)
 		return -EINVAL;
 
 	vma->vm_ops = &relay_file_mmap_ops;
-	vma->vm_flags |= VM_DONTEXPAND;
+	set_vm_flags(vma, VM_DONTEXPAND);
 	vma->vm_private_data = buf;
 
 	return 0;
diff --git a/mm/madvise.c b/mm/madvise.c
index a56a6d17e201..5b74321bcac9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -179,7 +179,7 @@ static int madvise_update_vma(struct vm_area_struct *vma,
 	/*
 	 * vm_flags is protected by the mmap_lock held in write mode.
 	 */
-	vma->vm_flags = new_flags;
+	reset_vm_flags(vma, new_flags);
 	if (!vma->vm_file || vma_is_anon_shmem(vma)) {
 		error = replace_anon_vma_name(vma, anon_name);
 		if (error)
diff --git a/mm/memory.c b/mm/memory.c
index aad226daf41b..2fabf89b2be9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1951,7 +1951,7 @@ int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
 	if (!(vma->vm_flags & VM_MIXEDMAP)) {
 		BUG_ON(mmap_read_trylock(vma->vm_mm));
 		BUG_ON(vma->vm_flags & VM_PFNMAP);
-		vma->vm_flags |= VM_MIXEDMAP;
+		set_vm_flags(vma, VM_MIXEDMAP);
 	}
 	/* Defer page refcount checking till we're about to map that page. */
 	return insert_pages(vma, addr, pages, num, vma->vm_page_prot);
@@ -2009,7 +2009,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
 	if (!(vma->vm_flags & VM_MIXEDMAP)) {
 		BUG_ON(mmap_read_trylock(vma->vm_mm));
 		BUG_ON(vma->vm_flags & VM_PFNMAP);
-		vma->vm_flags |= VM_MIXEDMAP;
+		set_vm_flags(vma, VM_MIXEDMAP);
 	}
 	return insert_page(vma, addr, page, vma->vm_page_prot);
 }
@@ -2475,7 +2475,7 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
 		vma->vm_pgoff = pfn;
 	}
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
 
 	BUG_ON(addr >= end);
 	pfn -= addr >> PAGE_SHIFT;
diff --git a/mm/mlock.c b/mm/mlock.c
index 06aa9e204fac..4807e91aaa8b 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -380,7 +380,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
 	 */
 	if (newflags & VM_LOCKED)
 		newflags |= VM_IO;
-	WRITE_ONCE(vma->vm_flags, newflags);
+	reset_vm_flags(vma, newflags);
 
 	lru_add_drain();
 	walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL);
@@ -388,7 +388,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
 
 	if (newflags & VM_IO) {
 		newflags &= ~VM_IO;
-		WRITE_ONCE(vma->vm_flags, newflags);
+		reset_vm_flags(vma, newflags);
 	}
 }
 
@@ -456,7 +456,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
 		/* No work to do, and mlocking twice would be wrong */
-		vma->vm_flags = newflags;
+		reset_vm_flags(vma, newflags);
 	} else {
 		mlock_vma_pages_range(vma, start, end, newflags);
 	}
diff --git a/mm/mmap.c b/mm/mmap.c
index 5c4b608edde9..fa994ae903d9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2607,7 +2607,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vma->vm_start = addr;
 	vma->vm_end = end;
-	vma->vm_flags = vm_flags;
+	init_vm_flags(vma, vm_flags);
 	vma->vm_page_prot = vm_get_page_prot(vm_flags);
 	vma->vm_pgoff = pgoff;
 
@@ -2736,7 +2736,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * then new mapped in-place (which must be aimed as
 	 * a completely new data area).
 	 */
-	vma->vm_flags |= VM_SOFTDIRTY;
+	set_vm_flags(vma, VM_SOFTDIRTY);
 
 	vma_set_page_prot(vma);
 
@@ -2959,7 +2959,7 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
 			anon_vma_interval_tree_pre_update_vma(vma);
 		}
 		vma->vm_end = addr + len;
-		vma->vm_flags |= VM_SOFTDIRTY;
+		set_vm_flags(vma, VM_SOFTDIRTY);
 		mas_store_prealloc(mas, vma);
 
 		if (vma->anon_vma) {
@@ -2979,7 +2979,7 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 	vma->vm_pgoff = addr >> PAGE_SHIFT;
-	vma->vm_flags = flags;
+	init_vm_flags(vma, flags);
 	vma->vm_page_prot = vm_get_page_prot(flags);
 	mas_set_range(mas, vma->vm_start, addr + len - 1);
 	if (mas_store_gfp(mas, vma, GFP_KERNEL))
@@ -2992,7 +2992,7 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
 	mm->data_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED)
 		mm->locked_vm += (len >> PAGE_SHIFT);
-	vma->vm_flags |= VM_SOFTDIRTY;
+	set_vm_flags(vma, VM_SOFTDIRTY);
 	validate_mm(mm);
 	return 0;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 908df12caa26..79adae74c094 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -633,7 +633,7 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * vm_flags and vm_page_prot are protected by the mmap_lock
 	 * held in write mode.
 	 */
-	vma->vm_flags = newflags;
+	reset_vm_flags(vma, newflags);
 	if (vma_wants_manual_pte_write_upgrade(vma))
 		mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
 	vma_set_page_prot(vma);
diff --git a/mm/mremap.c b/mm/mremap.c
index 5f6f9931bff1..2ccdd1561f5b 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -661,7 +661,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 
 	/* Conceal VM_ACCOUNT so old reservation is not undone */
 	if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) {
-		vma->vm_flags &= ~VM_ACCOUNT;
+		clear_vm_flags(vma, VM_ACCOUNT);
 		excess = vma->vm_end - vma->vm_start - old_len;
 		if (old_addr > vma->vm_start &&
 		    old_addr + old_len < vma->vm_end)
@@ -716,9 +716,9 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 
 	/* Restore VM_ACCOUNT if one or two pieces of vma left */
 	if (excess) {
-		vma->vm_flags |= VM_ACCOUNT;
+		set_vm_flags(vma, VM_ACCOUNT);
 		if (split)
-			find_vma(mm, vma->vm_end)->vm_flags |= VM_ACCOUNT;
+			set_vm_flags(find_vma(mm, vma->vm_end), VM_ACCOUNT);
 	}
 
 	return new_addr;
diff --git a/mm/nommu.c b/mm/nommu.c
index 214c70e1d059..b3154357ced5 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -173,7 +173,7 @@ static void *__vmalloc_user_flags(unsigned long size, gfp_t flags)
 		mmap_write_lock(current->mm);
 		vma = find_vma(current->mm, (unsigned long)ret);
 		if (vma)
-			vma->vm_flags |= VM_USERMAP;
+			set_vm_flags(vma, VM_USERMAP);
 		mmap_write_unlock(current->mm);
 	}
 
@@ -991,7 +991,8 @@ static int do_mmap_private(struct vm_area_struct *vma,
 
 	atomic_long_add(total, &mmap_pages_allocated);
 
-	region->vm_flags = vma->vm_flags |= VM_MAPPED_COPY;
+	set_vm_flags(vma, VM_MAPPED_COPY);
+	region->vm_flags = vma->flags;
 	region->vm_start = (unsigned long) base;
 	region->vm_end   = region->vm_start + len;
 	region->vm_top   = region->vm_start + (total << PAGE_SHIFT);
@@ -1088,7 +1089,7 @@ unsigned long do_mmap(struct file *file,
 	region->vm_flags = vm_flags;
 	region->vm_pgoff = pgoff;
 
-	vma->vm_flags = vm_flags;
+	init_vm_flags(vma, vm_flags);
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
@@ -1152,7 +1153,7 @@ unsigned long do_mmap(struct file *file,
 			vma->vm_end = start + len;
 
 			if (pregion->vm_flags & VM_MAPPED_COPY)
-				vma->vm_flags |= VM_MAPPED_COPY;
+				set_vm_flags(vma, VM_MAPPED_COPY);
 			else {
 				ret = do_mmap_shared_file(vma);
 				if (ret < 0) {
@@ -1632,7 +1633,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	if (addr != (pfn << PAGE_SHIFT))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
 	return 0;
 }
 EXPORT_SYMBOL(remap_pfn_range);
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 04c3ac9448a1..334b85714bd7 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -128,7 +128,7 @@ static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
 		return -EAGAIN;
 
-	vma->vm_flags |= VM_LOCKED | VM_DONTDUMP;
+	set_vm_flags(vma, VM_LOCKED | VM_DONTDUMP);
 	vma->vm_ops = &secretmem_vm_ops;
 
 	return 0;
diff --git a/mm/shmem.c b/mm/shmem.c
index c301487be5fb..2096bbdc955f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2289,7 +2289,7 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 		return ret;
 
 	/* arm64 - allow memory tagging on RAM-based files */
-	vma->vm_flags |= VM_MTE_ALLOWED;
+	set_vm_flags(vma, VM_MTE_ALLOWED);
 
 	file_accessed(file);
 	/* This is anonymous shared memory if it is unlinked at the time of mmap */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ca71de7c9d77..da02ec9c650f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3657,7 +3657,7 @@ int remap_vmalloc_range_partial(struct vm_area_struct *vma, unsigned long uaddr,
 		size -= PAGE_SIZE;
 	} while (size > 0);
 
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP);
 
 	return 0;
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c567d5e8053e..30158585c688 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1890,10 +1890,10 @@ int tcp_mmap(struct file *file, struct socket *sock,
 {
 	if (vma->vm_flags & (VM_WRITE | VM_EXEC))
 		return -EPERM;
-	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+	clear_vm_flags(vma, VM_MAYWRITE | VM_MAYEXEC);
 
 	/* Instruct vm_insert_page() to not mmap_read_lock(mm) */
-	vma->vm_flags |= VM_MIXEDMAP;
+	set_vm_flags(vma, VM_MIXEDMAP);
 
 	vma->vm_ops = &tcp_vm_ops;
 	return 0;
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index 0a6894cdc54d..9037deb5979e 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -262,7 +262,7 @@ static int sel_mmap_handle_status(struct file *filp,
 	if (vma->vm_flags & VM_WRITE)
 		return -EPERM;
 	/* disallow mprotect() turns it into writable */
-	vma->vm_flags &= ~VM_MAYWRITE;
+	clear_vm_flags(vma, VM_MAYWRITE);
 
 	return remap_pfn_range(vma, vma->vm_start,
 			       page_to_pfn(status),
@@ -506,13 +506,13 @@ static int sel_mmap_policy(struct file *filp, struct vm_area_struct *vma)
 {
 	if (vma->vm_flags & VM_SHARED) {
 		/* do not allow mprotect to make mapping writable */
-		vma->vm_flags &= ~VM_MAYWRITE;
+		clear_vm_flags(vma, VM_MAYWRITE);
 
 		if (vma->vm_flags & VM_WRITE)
 			return -EACCES;
 	}
 
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_ops = &sel_mmap_policy_ops;
 
 	return 0;
diff --git a/sound/core/oss/pcm_oss.c b/sound/core/oss/pcm_oss.c
index ac2efeb63a39..52473e2acd07 100644
--- a/sound/core/oss/pcm_oss.c
+++ b/sound/core/oss/pcm_oss.c
@@ -2910,7 +2910,7 @@ static int snd_pcm_oss_mmap(struct file *file, struct vm_area_struct *area)
 	}
 	/* set VM_READ access as well to fix memset() routines that do
 	   reads before writes (to improve performance) */
-	area->vm_flags |= VM_READ;
+	set_vm_flags(area, VM_READ);
 	if (substream == NULL)
 		return -ENXIO;
 	runtime = substream->runtime;
diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
index 9c122e757efe..f716bdb70afe 100644
--- a/sound/core/pcm_native.c
+++ b/sound/core/pcm_native.c
@@ -3675,8 +3675,9 @@ static int snd_pcm_mmap_status(struct snd_pcm_substream *substream, struct file
 		return -EINVAL;
 	area->vm_ops = &snd_pcm_vm_ops_status;
 	area->vm_private_data = substream;
-	area->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
-	area->vm_flags &= ~(VM_WRITE | VM_MAYWRITE);
+	mod_vm_flags(area, VM_DONTEXPAND | VM_DONTDUMP,
+		     VM_WRITE | VM_MAYWRITE);
+
 	return 0;
 }
 
@@ -3712,7 +3713,7 @@ static int snd_pcm_mmap_control(struct snd_pcm_substream *substream, struct file
 		return -EINVAL;
 	area->vm_ops = &snd_pcm_vm_ops_control;
 	area->vm_private_data = substream;
-	area->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(area, VM_DONTEXPAND | VM_DONTDUMP);
 	return 0;
 }
 
@@ -3828,7 +3829,7 @@ static const struct vm_operations_struct snd_pcm_vm_ops_data_fault = {
 int snd_pcm_lib_default_mmap(struct snd_pcm_substream *substream,
 			     struct vm_area_struct *area)
 {
-	area->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(area, VM_DONTEXPAND | VM_DONTDUMP);
 	if (!substream->ops->page &&
 	    !snd_dma_buffer_mmap(snd_pcm_get_dma_buf(substream), area))
 		return 0;
diff --git a/sound/soc/pxa/mmp-sspa.c b/sound/soc/pxa/mmp-sspa.c
index fb5a4390443f..fdd72d9bb46c 100644
--- a/sound/soc/pxa/mmp-sspa.c
+++ b/sound/soc/pxa/mmp-sspa.c
@@ -404,7 +404,7 @@ static int mmp_pcm_mmap(struct snd_soc_component *component,
 			struct snd_pcm_substream *substream,
 			struct vm_area_struct *vma)
 {
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(vma, VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	return remap_pfn_range(vma, vma->vm_start,
 		substream->dma_buffer.addr >> PAGE_SHIFT,
diff --git a/sound/usb/usx2y/us122l.c b/sound/usb/usx2y/us122l.c
index e558931cce16..b51db622a69b 100644
--- a/sound/usb/usx2y/us122l.c
+++ b/sound/usb/usx2y/us122l.c
@@ -224,9 +224,9 @@ static int usb_stream_hwdep_mmap(struct snd_hwdep *hw,
 	}
 
 	area->vm_ops = &usb_stream_hwdep_vm_ops;
-	area->vm_flags |= VM_DONTDUMP;
+	set_vm_flags(area, VM_DONTDUMP);
 	if (!read)
-		area->vm_flags |= VM_DONTEXPAND;
+		set_vm_flags(area, VM_DONTEXPAND);
 	area->vm_private_data = us122l;
 	atomic_inc(&us122l->mmap_count);
 out:
diff --git a/sound/usb/usx2y/usX2Yhwdep.c b/sound/usb/usx2y/usX2Yhwdep.c
index c29da0341bc5..3abe6d891f98 100644
--- a/sound/usb/usx2y/usX2Yhwdep.c
+++ b/sound/usb/usx2y/usX2Yhwdep.c
@@ -61,7 +61,7 @@ static int snd_us428ctls_mmap(struct snd_hwdep *hw, struct file *filp, struct vm
 	}
 
 	area->vm_ops = &us428ctls_vm_ops;
-	area->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(area, VM_DONTEXPAND | VM_DONTDUMP);
 	area->vm_private_data = hw->private_data;
 	return 0;
 }
diff --git a/sound/usb/usx2y/usx2yhwdeppcm.c b/sound/usb/usx2y/usx2yhwdeppcm.c
index 767a227d54da..22ce93b2fb24 100644
--- a/sound/usb/usx2y/usx2yhwdeppcm.c
+++ b/sound/usb/usx2y/usx2yhwdeppcm.c
@@ -706,7 +706,7 @@ static int snd_usx2y_hwdep_pcm_mmap(struct snd_hwdep *hw, struct file *filp, str
 		return -ENODEV;
 
 	area->vm_ops = &snd_usx2y_hwdep_pcm_vm_ops;
-	area->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	set_vm_flags(area, VM_DONTEXPAND | VM_DONTDUMP);
 	area->vm_private_data = hw->private_data;
 	return 0;
 }
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 16/41] mm: replace vma->vm_flags indirect modification in ksm_madvise
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (14 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 15/41] mm: replace vma->vm_flags direct modifications with modifier calls Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call Suren Baghdasaryan
                   ` (24 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Replace indirect modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness. Add a BUG_ON check in ksm_madvise() to catch indirect
vm_flags modification attempts.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 5 ++++-
 arch/s390/mm/gmap.c                | 5 ++++-
 mm/khugepaged.c                    | 2 ++
 mm/ksm.c                           | 2 ++
 4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 1d67baa5557a..325a7a47d348 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -393,6 +393,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
 {
 	unsigned long gfn = memslot->base_gfn;
 	unsigned long end, start = gfn_to_hva(kvm, gfn);
+	unsigned long vm_flags;
 	int ret = 0;
 	struct vm_area_struct *vma;
 	int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
@@ -409,12 +410,14 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
 			ret = H_STATE;
 			break;
 		}
+		vm_flags = vma->vm_flags;
 		ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
-			  merge_flag, &vma->vm_flags);
+			  merge_flag, &vm_flags);
 		if (ret) {
 			ret = H_STATE;
 			break;
 		}
+		reset_vm_flags(vma, vm_flags);
 		start = vma->vm_end;
 	} while (end > vma->vm_end);
 
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 3811d6c86d09..e47387f8be6d 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2587,14 +2587,17 @@ int gmap_mark_unmergeable(void)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
+	unsigned long vm_flags;
 	int ret;
 	VMA_ITERATOR(vmi, mm, 0);
 
 	for_each_vma(vmi, vma) {
+		vm_flags = vma->vm_flags;
 		ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
-				  MADV_UNMERGEABLE, &vma->vm_flags);
+				  MADV_UNMERGEABLE, &vm_flags);
 		if (ret)
 			return ret;
+		reset_vm_flags(vma, vm_flags);
 	}
 	mm->def_flags &= ~VM_MERGEABLE;
 	return 0;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5cb401aa2b9d..5376246a3052 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -352,6 +352,8 @@ struct attribute_group khugepaged_attr_group = {
 int hugepage_madvise(struct vm_area_struct *vma,
 		     unsigned long *vm_flags, int advice)
 {
+	/* vma->vm_flags can be changed only using modifier functions */
+	BUG_ON(vm_flags == &vma->vm_flags);
 	switch (advice) {
 	case MADV_HUGEPAGE:
 #ifdef CONFIG_S390
diff --git a/mm/ksm.c b/mm/ksm.c
index dd02780c387f..d05c41b289db 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2471,6 +2471,8 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
 
+	/* vma->vm_flags can be changed only using modifier functions */
+	BUG_ON(vm_flags == &vma->vm_flags);
 	switch (advice) {
 	case MADV_MERGEABLE:
 		/*
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (15 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 16/41] mm: replace vma->vm_flags indirect modification in ksm_madvise Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-17 15:16   ` Michal Hocko
  2023-01-09 20:53 ` [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page Suren Baghdasaryan
                   ` (23 subsequent siblings)
  40 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Move VMA flag modification (which now implies VMA locking) before
anon_vma_lock_write to match the locking order of page fault handler.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index fa994ae903d9..53d885e70a54 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2953,13 +2953,13 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
 		if (mas_preallocate(mas, vma, GFP_KERNEL))
 			goto unacct_fail;
 
+		set_vm_flags(vma, VM_SOFTDIRTY);
 		vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
 		if (vma->anon_vma) {
 			anon_vma_lock_write(vma->anon_vma);
 			anon_vma_interval_tree_pre_update_vma(vma);
 		}
 		vma->vm_end = addr + len;
-		set_vm_flags(vma, VM_SOFTDIRTY);
 		mas_store_prealloc(mas, vma);
 
 		if (vma->anon_vma) {
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (16 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-17 15:25   ` Michal Hocko
  2023-01-09 20:53 ` [PATCH 19/41] mm/mmap: write-lock VMAs before merging, splitting or expanding them Suren Baghdasaryan
                   ` (22 subsequent siblings)
  40 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Protect VMA from concurrent page fault handler while collapsing a huge
page. Page fault handler needs a stable PMD to use PTL and relies on
per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
not be detected by a page fault handler without proper locking.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/khugepaged.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5376246a3052..d8d0647f0c2c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1032,6 +1032,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 
+	vma_write_lock(vma);
 	anon_vma_lock_write(vma->anon_vma);
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm,
@@ -1503,6 +1504,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		goto drop_hpage;
 	}
 
+	/* Lock the vma before taking i_mmap and page table locks */
+	vma_write_lock(vma);
+
 	/*
 	 * We need to lock the mapping so that from here on, only GUP-fast and
 	 * hardware page walks can access the parts of the page tables that
@@ -1690,6 +1694,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
 				result = SCAN_PTE_UFFD_WP;
 				goto unlock_next;
 			}
+			vma_write_lock(vma);
 			collapse_and_free_pmd(mm, vma, addr, pmd);
 			if (!cc->is_khugepaged && is_target)
 				result = set_huge_pmd(vma, addr, pmd, hpage);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 19/41] mm/mmap: write-lock VMAs before merging, splitting or expanding them
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (17 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 20/41] mm/mmap: write-lock VMAs in vma_adjust Suren Baghdasaryan
                   ` (21 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Decisions about whether VMAs can be merged, split or expanded must be
made while VMAs are protected from the changes which can affect that
decision. For example, merge_vma uses vma->anon_vma in its decision
whether the VMA can be merged. Meanwhile, page fault handler changes
vma->anon_vma during COW operation.
Write-lock all VMAs which might be affected by a merge or split operation
before making decision how such operations should be performed.

Not sure if expansion really needs this, just being paranoid. Otherwise
mmap_region and vm_brk_flags might not locking.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 53d885e70a54..f6ca4a87f9e2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -254,8 +254,11 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 	 */
 	mas_set(&mas, oldbrk);
 	next = mas_find(&mas, newbrk - 1 + PAGE_SIZE + stack_guard_gap);
-	if (next && newbrk + PAGE_SIZE > vm_start_gap(next))
-		goto out;
+	if (next) {
+		vma_write_lock(next);
+		if (newbrk + PAGE_SIZE > vm_start_gap(next))
+			goto out;
+	}
 
 	brkvma = mas_prev(&mas, mm->start_brk);
 	/* Ok, looks good - let it rip. */
@@ -1017,10 +1020,17 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	if (vm_flags & VM_SPECIAL)
 		return NULL;
 
+	if (prev)
+		vma_write_lock(prev);
 	next = find_vma(mm, prev ? prev->vm_end : 0);
 	mid = next;
-	if (next && next->vm_end == end)		/* cases 6, 7, 8 */
+	if (next)
+		vma_write_lock(next);
+	if (next && next->vm_end == end) {		/* cases 6, 7, 8 */
 		next = find_vma(mm, next->vm_end);
+		if (next)
+			vma_write_lock(next);
+	}
 
 	/* verify some invariant that must be enforced by the caller */
 	VM_WARN_ON(prev && addr <= prev->vm_start);
@@ -2198,6 +2208,7 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	int err;
 	validate_mm_mt(mm);
 
+	vma_write_lock(vma);
 	if (vma->vm_ops && vma->vm_ops->may_split) {
 		err = vma->vm_ops->may_split(vma, addr);
 		if (err)
@@ -2564,6 +2575,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	/* Attempt to expand an old mapping */
 	/* Check next */
+	if (next)
+		vma_write_lock(next);
 	if (next && next->vm_start == end && !vma_policy(next) &&
 	    can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
 				 NULL_VM_UFFD_CTX, NULL)) {
@@ -2573,6 +2586,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	}
 
 	/* Check prev */
+	if (prev)
+		vma_write_lock(prev);
 	if (prev && prev->vm_end == addr && !vma_policy(prev) &&
 	    (vma ? can_vma_merge_after(prev, vm_flags, vma->anon_vma, file,
 				       pgoff, vma->vm_userfaultfd_ctx, NULL) :
@@ -2942,6 +2957,8 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
 	if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
+	if (vma)
+		vma_write_lock(vma);
 	/*
 	 * Expand the existing vma if possible; Note that singular lists do not
 	 * occur after forking, so the expand will only happen on new VMAs.
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 20/41] mm/mmap: write-lock VMAs in vma_adjust
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (18 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 19/41] mm/mmap: write-lock VMAs before merging, splitting or expanding them Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 21/41] mm/mmap: write-lock VMAs affected by VMA expansion Suren Baghdasaryan
                   ` (20 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

vma_adjust modifies a VMA and possibly its neighbors. Write-lock them
before making the modifications.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f6ca4a87f9e2..1e2154137631 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -614,6 +614,12 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
  * The following helper function should be used when such adjustments
  * are necessary.  The "insert" vma (if any) is to be inserted
  * before we drop the necessary locks.
+ * 'expand' vma is always locked before it's passed to __vma_adjust()
+ * from vma_merge() because vma should not change from the moment
+ * can_vma_merge_{before|after} decision is made.
+ * 'insert' vma is used only by __split_vma() and it's always a brand
+ * new vma which is not yet added into mm's vma tree, therefore no need
+ * to lock it.
  */
 int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
@@ -633,6 +639,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	MA_STATE(mas, &mm->mm_mt, 0, 0);
 	struct vm_area_struct *exporter = NULL, *importer = NULL;
 
+	vma_write_lock(vma);
+	if (next)
+		vma_write_lock(next);
+
 	if (next && !insert) {
 		if (end >= next->vm_end) {
 			/*
@@ -676,8 +686,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * If next doesn't have anon_vma, import from vma after
 			 * next, if the vma overlaps with it.
 			 */
-			if (remove_next == 2 && !next->anon_vma)
+			if (remove_next == 2 && !next->anon_vma) {
 				exporter = next_next;
+				if (exporter)
+					vma_write_lock(exporter);
+			}
 
 		} else if (end > next->vm_start) {
 			/*
@@ -850,6 +863,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		if (remove_next == 2) {
 			remove_next = 1;
 			next = next_next;
+			if (next)
+				vma_write_lock(next);
 			goto again;
 		}
 	}
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 21/41] mm/mmap: write-lock VMAs affected by VMA expansion
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (19 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 20/41] mm/mmap: write-lock VMAs in vma_adjust Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 22/41] mm/mremap: write-lock VMA while remapping it to a new address range Suren Baghdasaryan
                   ` (19 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

vma_expand changes VMA boundaries and might result in freeing an adjacent
VMA. Write-lock affected VMAs to prevent concurrent page faults.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 1e2154137631..ff02cb51e7e7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -544,6 +544,7 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
 	if (mas_preallocate(mas, vma, GFP_KERNEL))
 		goto nomem;
 
+	vma_write_lock(vma);
 	vma_adjust_trans_huge(vma, start, end, 0);
 
 	if (file) {
@@ -590,6 +591,7 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
 	}
 
 	if (remove_next) {
+		vma_write_lock(next);
 		if (file) {
 			uprobe_munmap(next, next->vm_start, next->vm_end);
 			fput(file);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 22/41] mm/mremap: write-lock VMA while remapping it to a new address range
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (20 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 21/41] mm/mmap: write-lock VMAs affected by VMA expansion Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 23/41] mm: write-lock VMAs before removing them from VMA tree Suren Baghdasaryan
                   ` (18 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Write-lock VMA as locked before copying it and when copy_vma produces
a new VMA.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com>
---
 mm/mmap.c   | 1 +
 mm/mremap.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index ff02cb51e7e7..da1908730828 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3261,6 +3261,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			get_file(new_vma->vm_file);
 		if (new_vma->vm_ops && new_vma->vm_ops->open)
 			new_vma->vm_ops->open(new_vma);
+		vma_write_lock(new_vma);
 		if (vma_link(mm, new_vma))
 			goto out_vma_link;
 		*need_rmap_locks = false;
diff --git a/mm/mremap.c b/mm/mremap.c
index 2ccdd1561f5b..d24a79bcb1a1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -622,6 +622,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 			return -ENOMEM;
 	}
 
+	vma_write_lock(vma);
 	new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
 	new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
 			   &need_rmap_locks);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 23/41] mm: write-lock VMAs before removing them from VMA tree
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (21 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 22/41] mm/mremap: write-lock VMA while remapping it to a new address range Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 24/41] mm: conditionally write-lock VMA in free_pgtables Suren Baghdasaryan
                   ` (17 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Write-locking VMAs before isolating them ensures that page fault
handlers don't operate on isolated VMAs.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c  | 2 ++
 mm/nommu.c | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index da1908730828..be289e0b693b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -448,6 +448,7 @@ void vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas)
  */
 void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas)
 {
+	vma_write_lock(vma);
 	trace_vma_mas_szero(mas->tree, vma->vm_start, vma->vm_end - 1);
 	mas->index = vma->vm_start;
 	mas->last = vma->vm_end - 1;
@@ -2300,6 +2301,7 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 static inline int munmap_sidetree(struct vm_area_struct *vma,
 				   struct ma_state *mas_detach)
 {
+	vma_write_lock(vma);
 	mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
 	if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
 		return -ENOMEM;
diff --git a/mm/nommu.c b/mm/nommu.c
index b3154357ced5..7ae91337ef14 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -552,6 +552,7 @@ void vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas)
 
 void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas)
 {
+	vma_write_lock(vma);
 	mas->index = vma->vm_start;
 	mas->last = vma->vm_end - 1;
 	mas_store_prealloc(mas, NULL);
@@ -1551,6 +1552,10 @@ void exit_mmap(struct mm_struct *mm)
 	mmap_write_lock(mm);
 	for_each_vma(vmi, vma) {
 		cleanup_vma_from_mm(vma);
+		/*
+		 * No need to lock VMA because this is the only mm user and no
+		 * page fault handled can race with it.
+		 */
 		delete_vma(mm, vma);
 		cond_resched();
 	}
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 24/41] mm: conditionally write-lock VMA in free_pgtables
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (22 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 23/41] mm: write-lock VMAs before removing them from VMA tree Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 25/41] mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area Suren Baghdasaryan
                   ` (16 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Normally free_pgtables needs to lock affected VMAs except for the case
when VMAs were isolated under VMA write-lock. munmap() does just that,
isolating while holding appropriate locks and then downgrading mmap_lock
and dropping per-VMA locks before freeing page tables.
Add a parameter to free_pgtables and unmap_region for such scenario.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/internal.h |  2 +-
 mm/memory.c   |  6 +++++-
 mm/mmap.c     | 18 ++++++++++++------
 3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index bcf75a8b032d..5ea4ff1a70e7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -87,7 +87,7 @@ void folio_activate(struct folio *folio);
 
 void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
 		   struct vm_area_struct *start_vma, unsigned long floor,
-		   unsigned long ceiling);
+		   unsigned long ceiling, bool lock_vma);
 void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 
 struct zap_details;
diff --git a/mm/memory.c b/mm/memory.c
index 2fabf89b2be9..9ece18548db1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -348,7 +348,7 @@ void free_pgd_range(struct mmu_gather *tlb,
 
 void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
 		   struct vm_area_struct *vma, unsigned long floor,
-		   unsigned long ceiling)
+		   unsigned long ceiling, bool lock_vma)
 {
 	MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
 
@@ -366,6 +366,8 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
 		 * Hide vma from rmap and truncate_pagecache before freeing
 		 * pgtables
 		 */
+		if (lock_vma)
+			vma_write_lock(vma);
 		unlink_anon_vmas(vma);
 		unlink_file_vma(vma);
 
@@ -380,6 +382,8 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
 			       && !is_vm_hugetlb_page(next)) {
 				vma = next;
 				next = mas_find(&mas, ceiling - 1);
+				if (lock_vma)
+					vma_write_lock(vma);
 				unlink_anon_vmas(vma);
 				unlink_file_vma(vma);
 			}
diff --git a/mm/mmap.c b/mm/mmap.c
index be289e0b693b..0d767ce043af 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -78,7 +78,7 @@ core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
 static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
 		struct vm_area_struct *next, unsigned long start,
-		unsigned long end);
+		unsigned long end, bool lock_vma);
 
 static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
 {
@@ -2202,7 +2202,7 @@ static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
 static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
 		struct vm_area_struct *next,
-		unsigned long start, unsigned long end)
+		unsigned long start, unsigned long end, bool lock_vma)
 {
 	struct mmu_gather tlb;
 
@@ -2211,7 +2211,8 @@ static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, mt, vma, start, end);
 	free_pgtables(&tlb, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
-				 next ? next->vm_start : USER_PGTABLES_CEILING);
+				 next ? next->vm_start : USER_PGTABLES_CEILING,
+				 lock_vma);
 	tlb_finish_mmu(&tlb);
 }
 
@@ -2468,7 +2469,11 @@ do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 			mmap_write_downgrade(mm);
 	}
 
-	unmap_region(mm, &mt_detach, vma, prev, next, start, end);
+	/*
+	 * We can free page tables without locking the vmas because they were
+	 * isolated before we downgraded mmap_lock and dropped per-vma locks.
+	 */
+	unmap_region(mm, &mt_detach, vma, prev, next, start, end, !downgrade);
 	/* Statistics and freeing VMAs */
 	mas_set(&mas_detach, start);
 	remove_mt(mm, &mas_detach);
@@ -2785,7 +2790,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma->vm_file = NULL;
 
 	/* Undo any partial mapping done by a device driver. */
-	unmap_region(mm, mas.tree, vma, prev, next, vma->vm_start, vma->vm_end);
+	unmap_region(mm, mas.tree, vma, prev, next, vma->vm_start, vma->vm_end,
+		     true);
 	if (file && (vm_flags & VM_SHARED))
 		mapping_unmap_writable(file->f_mapping);
 free_vma:
@@ -3130,7 +3136,7 @@ void exit_mmap(struct mm_struct *mm)
 	mmap_write_lock(mm);
 	mt_clear_in_rcu(&mm->mm_mt);
 	free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
-		      USER_PGTABLES_CEILING);
+		      USER_PGTABLES_CEILING, true);
 	tlb_finish_mmu(&tlb);
 
 	/*
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 25/41] mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (23 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 24/41] mm: conditionally write-lock VMA in free_pgtables Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction Suren Baghdasaryan
                   ` (15 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

While unmapping VMAs, adjacent VMAs might be able to grow into the area
being unmapped. In such cases write-lock adjacent VMAs to prevent this
growth.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 0d767ce043af..30c7d1c5206e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2461,11 +2461,13 @@ do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 	 * down_read(mmap_lock) and collide with the VMA we are about to unmap.
 	 */
 	if (downgrade) {
-		if (next && (next->vm_flags & VM_GROWSDOWN))
+		if (next && (next->vm_flags & VM_GROWSDOWN)) {
+			vma_write_lock(next);
 			downgrade = false;
-		else if (prev && (prev->vm_flags & VM_GROWSUP))
+		} else if (prev && (prev->vm_flags & VM_GROWSUP)) {
+			vma_write_lock(prev);
 			downgrade = false;
-		else
+		} else
 			mmap_write_downgrade(mm);
 	}
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (24 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 25/41] mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-17 15:42   ` Michal Hocko
  2023-01-09 20:53 ` [PATCH 27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration Suren Baghdasaryan
                   ` (14 subsequent siblings)
  40 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Assert there are no holders of VMA lock for reading when it is about to be
destroyed.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 8 ++++++++
 kernel/fork.c      | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 594e835bad9c..c464fc8a514c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 	VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
 }
 
+static inline void vma_assert_no_reader(struct vm_area_struct *vma)
+{
+	VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
+		      vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
+		      vma);
+}
+
 #else /* CONFIG_PER_VMA_LOCK */
 
 static inline void vma_init_lock(struct vm_area_struct *vma) {}
@@ -688,6 +695,7 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_read_unlock(struct vm_area_struct *vma) {}
 static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
+static inline void vma_assert_no_reader(struct vm_area_struct *vma) {}
 
 #endif /* CONFIG_PER_VMA_LOCK */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1591dd8a0745..6d9f14e55ecf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -485,6 +485,8 @@ static void __vm_area_free(struct rcu_head *head)
 {
 	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
 						  vm_rcu);
+	/* The vma should either have no lock holders or be write-locked. */
+	vma_assert_no_reader(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 #endif
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (25 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-18 12:50   ` Jann Horn
  2023-01-09 20:53 ` [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code Suren Baghdasaryan
                   ` (13 subsequent siblings)
  40 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Page fault handlers might need to fire MMU notifications while a new
notifier is being registered. Modify mm_take_all_locks to write-lock all
VMAs and prevent this race with fault handlers that would hold VMA locks.
VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
locking order as in page fault handlers.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 30c7d1c5206e..a256deca0bc0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3566,6 +3566,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
  * of mm/rmap.c:
  *   - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
  *     hugetlb mapping);
+ *   - all vmas marked locked
  *   - all i_mmap_rwsem locks;
  *   - all anon_vma->rwseml
  *
@@ -3591,6 +3592,7 @@ int mm_take_all_locks(struct mm_struct *mm)
 	mas_for_each(&mas, vma, ULONG_MAX) {
 		if (signal_pending(current))
 			goto out_unlock;
+		vma_write_lock(vma);
 		if (vma->vm_file && vma->vm_file->f_mapping &&
 				is_vm_hugetlb_page(vma))
 			vm_lock_mapping(mm, vma->vm_file->f_mapping);
@@ -3677,6 +3679,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
 		if (vma->vm_file && vma->vm_file->f_mapping)
 			vm_unlock_mapping(vma->vm_file->f_mapping);
 	}
+	vma_write_unlock_mm(mm);
 
 	mutex_unlock(&mm_all_locks_mutex);
 }
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (26 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-17 15:47   ` Michal Hocko
  2023-01-17 21:03   ` Jann Horn
  2023-01-09 20:53 ` [PATCH 29/41] mm: fall back to mmap_lock if vma->anon_vma is not yet set Suren Baghdasaryan
                   ` (12 subsequent siblings)
  40 siblings, 2 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Introduce lock_vma_under_rcu function to lookup and lock a VMA during
page fault handling. When VMA is not found, can't be locked or changes
after being locked, the function returns NULL. The lookup is performed
under RCU protection to prevent the found VMA from being destroyed before
the VMA lock is acquired. VMA lock statistics are updated according to
the results.
For now only anonymous VMAs can be searched this way. In other cases the
function returns NULL.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h |  3 +++
 mm/memory.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c464fc8a514c..d0fddf6a1de9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -687,6 +687,9 @@ static inline void vma_assert_no_reader(struct vm_area_struct *vma)
 		      vma);
 }
 
+struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+					  unsigned long address);
+
 #else /* CONFIG_PER_VMA_LOCK */
 
 static inline void vma_init_lock(struct vm_area_struct *vma) {}
diff --git a/mm/memory.c b/mm/memory.c
index 9ece18548db1..a658e26d965d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5242,6 +5242,57 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
 
+#ifdef CONFIG_PER_VMA_LOCK
+/*
+ * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
+ * stable and not isolated. If the VMA is not found or is being modified the
+ * function returns NULL.
+ */
+struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+					  unsigned long address)
+{
+	MA_STATE(mas, &mm->mm_mt, address, address);
+	struct vm_area_struct *vma, *validate;
+
+	rcu_read_lock();
+	vma = mas_walk(&mas);
+retry:
+	if (!vma)
+		goto inval;
+
+	/* Only anonymous vmas are supported for now */
+	if (!vma_is_anonymous(vma))
+		goto inval;
+
+	if (!vma_read_trylock(vma))
+		goto inval;
+
+	/* Check since vm_start/vm_end might change before we lock the VMA */
+	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
+		vma_read_unlock(vma);
+		goto inval;
+	}
+
+	/* Check if the VMA got isolated after we found it */
+	mas.index = address;
+	validate = mas_walk(&mas);
+	if (validate != vma) {
+		vma_read_unlock(vma);
+		count_vm_vma_lock_event(VMA_LOCK_MISS);
+		/* The area was replaced with another one. */
+		vma = validate;
+		goto retry;
+	}
+
+	rcu_read_unlock();
+	return vma;
+inval:
+	rcu_read_unlock();
+	count_vm_vma_lock_event(VMA_LOCK_ABORT);
+	return NULL;
+}
+#endif /* CONFIG_PER_VMA_LOCK */
+
 #ifndef __PAGETABLE_P4D_FOLDED
 /*
  * Allocate p4d page table.
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 29/41] mm: fall back to mmap_lock if vma->anon_vma is not yet set
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (27 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 30/41] mm: add FAULT_FLAG_VMA_LOCK flag Suren Baghdasaryan
                   ` (11 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

When vma->anon_vma is not set, pagefault handler will set it by either
reusing anon_vma of an adjacent VMA if VMAs are compatible or by
allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
a compatible adjacent VMA and that requires not only the faulting VMA
to be stable but also the tree structure and other VMAs inside that tree.
Therefore locking just the faulting VMA is not enough for this search.
Fall back to taking mmap_lock when vma->anon_vma is not set. This
situation happens only on the first page fault and should not affect
overall performance.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/memory.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index a658e26d965d..2560524ad7f4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5264,6 +5264,10 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma_is_anonymous(vma))
 		goto inval;
 
+	/* find_mergeable_anon_vma uses adjacent vmas which are not locked */
+	if (!vma->anon_vma)
+		goto inval;
+
 	if (!vma_read_trylock(vma))
 		goto inval;
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 30/41] mm: add FAULT_FLAG_VMA_LOCK flag
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (28 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 29/41] mm: fall back to mmap_lock if vma->anon_vma is not yet set Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 31/41] mm: prevent do_swap_page from handling page faults under VMA lock Suren Baghdasaryan
                   ` (10 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Add a new flag to distinguish page faults handled under protection of
per-vma lock.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com>
---
 include/linux/mm.h       | 3 ++-
 include/linux/mm_types.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d0fddf6a1de9..2e3be1d45371 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -467,7 +467,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags)
 	{ FAULT_FLAG_USER,		"USER" }, \
 	{ FAULT_FLAG_REMOTE,		"REMOTE" }, \
 	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }, \
-	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }
+	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }, \
+	{ FAULT_FLAG_VMA_LOCK,		"VMA_LOCK" }
 
 /*
  * vm_fault is filled by the pagefault handler and passed to the vma's
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0d27edd3e63a..fce9113d979c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1103,6 +1103,7 @@ enum fault_flag {
 	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
 	FAULT_FLAG_UNSHARE =		1 << 10,
 	FAULT_FLAG_ORIG_PTE_VALID =	1 << 11,
+	FAULT_FLAG_VMA_LOCK =		1 << 12,
 };
 
 typedef unsigned int __bitwise zap_flags_t;
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 31/41] mm: prevent do_swap_page from handling page faults under VMA lock
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (29 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 30/41] mm: add FAULT_FLAG_VMA_LOCK flag Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock Suren Baghdasaryan
                   ` (9 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Due to the possibility of do_swap_page dropping mmap_lock, abort fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com>
---
 mm/memory.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 2560524ad7f4..20806bc8b4eb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3707,6 +3707,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!pte_unmap_same(vmf))
 		goto out;
 
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+		ret = VM_FAULT_RETRY;
+		goto out;
+	}
+
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (30 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 31/41] mm: prevent do_swap_page from handling page faults under VMA lock Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-17 19:51   ` Jann Horn
  2023-01-09 20:53 ` [PATCH 33/41] mm: introduce per-VMA lock statistics Suren Baghdasaryan
                   ` (8 subsequent siblings)
  40 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 20806bc8b4eb..12508f4d845a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma->anon_vma)
 		goto inval;
 
+	/*
+	 * Due to the possibility of userfault handler dropping mmap_lock, avoid
+	 * it for now and fall back to page fault handling under mmap_lock.
+	 */
+	if (userfaultfd_armed(vma))
+		goto inval;
+
 	if (!vma_read_trylock(vma))
 		goto inval;
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 33/41] mm: introduce per-VMA lock statistics
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (31 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 34/41] x86/mm: try VMA lock-based page fault handling first Suren Baghdasaryan
                   ` (7 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Add a new CONFIG_PER_VMA_LOCK_STATS config option to dump extra
statistics about handling page fault under VMA lock.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/vm_event_item.h | 6 ++++++
 include/linux/vmstat.h        | 6 ++++++
 mm/Kconfig.debug              | 8 ++++++++
 mm/vmstat.c                   | 6 ++++++
 4 files changed, 26 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 7f5d1caf5890..8abfa1240040 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -149,6 +149,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_X86
 		DIRECT_MAP_LEVEL2_SPLIT,
 		DIRECT_MAP_LEVEL3_SPLIT,
+#endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+		VMA_LOCK_SUCCESS,
+		VMA_LOCK_ABORT,
+		VMA_LOCK_RETRY,
+		VMA_LOCK_MISS,
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 19cf5b6892ce..fed855bae6d8 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -125,6 +125,12 @@ static inline void vm_events_fold_cpu(int cpu)
 #define count_vm_tlb_events(x, y) do { (void)(y); } while (0)
 #endif
 
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+#define count_vm_vma_lock_event(x) count_vm_event(x)
+#else
+#define count_vm_vma_lock_event(x) do {} while (0)
+#endif
+
 #define __count_zid_vm_events(item, zid, delta) \
 	__count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
 
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index fca699ad1fb0..32a93b064590 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -207,3 +207,11 @@ config PTDUMP_DEBUGFS
 	  kernel.
 
 	  If in doubt, say N.
+
+
+config PER_VMA_LOCK_STATS
+	bool "Statistics for per-vma locks"
+	depends on PER_VMA_LOCK
+	default y
+	help
+	  Statistics for per-vma locks.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1ea6a5ce1c41..4f1089a1860e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1399,6 +1399,12 @@ const char * const vmstat_text[] = {
 	"direct_map_level2_splits",
 	"direct_map_level3_splits",
 #endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+	"vma_lock_success",
+	"vma_lock_abort",
+	"vma_lock_retry",
+	"vma_lock_miss",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 34/41] x86/mm: try VMA lock-based page fault handling first
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (32 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 33/41] mm: introduce per-VMA lock statistics Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 35/41] arm64/mm: " Suren Baghdasaryan
                   ` (6 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/x86/Kconfig    |  1 +
 arch/x86/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..3647f7bdb110 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86_64
 	# Options that are inherently 64-bit kernel only:
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
+	select ARCH_SUPPORTS_PER_VMA_LOCK
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select HAVE_ARCH_SOFT_DIRTY
 	select MODULES_USE_ELF_RELA
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7b0d4ab894c8..983266e7c49b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -19,6 +19,7 @@
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_crash_gracefully_on_page_fault()*/
 #include <linux/mm_types.h>
+#include <linux/mm.h>			/* find_and_lock_vma() */
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1354,6 +1355,38 @@ void do_user_addr_fault(struct pt_regs *regs,
 	}
 #endif
 
+#ifdef CONFIG_PER_VMA_LOCK
+	if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+		goto lock_mmap;
+
+	vma = lock_vma_under_rcu(mm, address);
+	if (!vma)
+		goto lock_mmap;
+
+	if (unlikely(access_error(error_code, vma))) {
+		vma_read_unlock(vma);
+		goto lock_mmap;
+	}
+	fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
+	vma_read_unlock(vma);
+
+	if (!(fault & VM_FAULT_RETRY)) {
+		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+		goto done;
+	}
+	count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+	/* Quick path to respond to signals */
+	if (fault_signal_pending(fault, regs)) {
+		if (!user_mode(regs))
+			kernelmode_fixup_or_oops(regs, error_code, address,
+						 SIGBUS, BUS_ADRERR,
+						 ARCH_DEFAULT_PKEY);
+		return;
+	}
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
+
 	/*
 	 * Kernel-mode access to the user address space should only occur
 	 * on well-defined single instructions listed in the exception
@@ -1454,6 +1487,9 @@ void do_user_addr_fault(struct pt_regs *regs,
 	}
 
 	mmap_read_unlock(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
 	if (likely(!(fault & VM_FAULT_ERROR)))
 		return;
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 35/41] arm64/mm: try VMA lock-based page fault handling first
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (33 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 34/41] x86/mm: try VMA lock-based page fault handling first Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 36/41] powerc/mm: " Suren Baghdasaryan
                   ` (5 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/arm64/Kconfig    |  1 +
 arch/arm64/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 03934808b2ed..829fa6d14a36 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -95,6 +95,7 @@ config ARM64
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+	select ARCH_SUPPORTS_PER_VMA_LOCK
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 596f46dabe4e..833fa8bab291 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -535,6 +535,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	unsigned long vm_flags;
 	unsigned int mm_flags = FAULT_FLAG_DEFAULT;
 	unsigned long addr = untagged_addr(far);
+#ifdef CONFIG_PER_VMA_LOCK
+	struct vm_area_struct *vma;
+#endif
 
 	if (kprobe_page_fault(regs, esr))
 		return 0;
@@ -585,6 +588,36 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
 
+#ifdef CONFIG_PER_VMA_LOCK
+	if (!(mm_flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+		goto lock_mmap;
+
+	vma = lock_vma_under_rcu(mm, addr);
+	if (!vma)
+		goto lock_mmap;
+
+	if (!(vma->vm_flags & vm_flags)) {
+		vma_read_unlock(vma);
+		goto lock_mmap;
+	}
+	fault = handle_mm_fault(vma, addr & PAGE_MASK,
+				mm_flags | FAULT_FLAG_VMA_LOCK, regs);
+	vma_read_unlock(vma);
+
+	if (!(fault & VM_FAULT_RETRY)) {
+		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+		goto done;
+	}
+	count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+	/* Quick path to respond to signals */
+	if (fault_signal_pending(fault, regs)) {
+		if (!user_mode(regs))
+			goto no_context;
+		return 0;
+	}
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
 	/*
 	 * As per x86, we may deadlock here. However, since the kernel only
 	 * validly references user space from well defined areas of the code,
@@ -628,6 +661,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	}
 	mmap_read_unlock(mm);
 
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
 	/*
 	 * Handle the "normal" (no error) case first.
 	 */
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 36/41] powerc/mm: try VMA lock-based page fault handling first
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (34 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 35/41] arm64/mm: " Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 37/41] mm: introduce mod_vm_flags_nolock Suren Baghdasaryan
                   ` (4 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

From: Laurent Dufour <ldufour@linux.ibm.com>

Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.
Copied from "x86/mm: try VMA lock-based page fault handling first"

Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/powerpc/mm/fault.c                | 41 ++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/Kconfig |  1 +
 arch/powerpc/platforms/pseries/Kconfig |  1 +
 3 files changed, 43 insertions(+)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 2bef19cc1b98..f92f8956d5f2 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -469,6 +469,44 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 	if (is_exec)
 		flags |= FAULT_FLAG_INSTRUCTION;
 
+#ifdef CONFIG_PER_VMA_LOCK
+	if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+		goto lock_mmap;
+
+	vma = lock_vma_under_rcu(mm, address);
+	if (!vma)
+		goto lock_mmap;
+
+	if (unlikely(access_pkey_error(is_write, is_exec,
+				       (error_code & DSISR_KEYFAULT), vma))) {
+		int rc = bad_access_pkey(regs, address, vma);
+
+		vma_read_unlock(vma);
+		return rc;
+	}
+
+	if (unlikely(access_error(is_write, is_exec, vma))) {
+		int rc = bad_access(regs, address);
+
+		vma_read_unlock(vma);
+		return rc;
+	}
+
+	fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
+	vma_read_unlock(vma);
+
+	if (!(fault & VM_FAULT_RETRY)) {
+		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+		goto done;
+	}
+	count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+	if (fault_signal_pending(fault, regs))
+		return user_mode(regs) ? 0 : SIGBUS;
+
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
+
 	/* When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in the
 	 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -545,6 +583,9 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 
 	mmap_read_unlock(current->mm);
 
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
 	if (unlikely(fault & VM_FAULT_ERROR))
 		return mm_fault_error(regs, address, fault);
 
diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index ae248a161b43..70a46acc70d6 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -16,6 +16,7 @@ config PPC_POWERNV
 	select PPC_DOORBELL
 	select MMU_NOTIFIER
 	select FORCE_SMP
+	select ARCH_SUPPORTS_PER_VMA_LOCK
 	default y
 
 config OPAL_PRD
diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
index a3b4d99567cb..e036a04ff1ca 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -21,6 +21,7 @@ config PPC_PSERIES
 	select HOTPLUG_CPU
 	select FORCE_SMP
 	select SWIOTLB
+	select ARCH_SUPPORTS_PER_VMA_LOCK
 	default y
 
 config PARAVIRT
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 37/41] mm: introduce mod_vm_flags_nolock
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (35 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 36/41] powerc/mm: " Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 38/41] mm: avoid assertion in untrack_pfn Suren Baghdasaryan
                   ` (3 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

In cases when VMA flags are modified after VMA was isolated and mmap_lock
was downgraded, flags modifications do not require per-VMA locking and
an attempt to lock the VMA would result in an assertion because mmap
write lock is not held.
Introduce mod_vm_flags_nolock to be used in such situation.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e3be1d45371..7d436a5027cc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -743,6 +743,14 @@ void clear_vm_flags(struct vm_area_struct *vma, unsigned long flags)
 	vma->vm_flags &= ~flags;
 }
 
+static inline
+void mod_vm_flags_nolock(struct vm_area_struct *vma,
+		  unsigned long set, unsigned long clear)
+{
+	vma->vm_flags |= set;
+	vma->vm_flags &= ~clear;
+}
+
 static inline
 void mod_vm_flags(struct vm_area_struct *vma,
 		  unsigned long set, unsigned long clear)
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 38/41] mm: avoid assertion in untrack_pfn
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (36 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 37/41] mm: introduce mod_vm_flags_nolock Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free Suren Baghdasaryan
                   ` (2 subsequent siblings)
  40 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

untrack_pfn can be called after VMA was isolated and mmap_lock downgraded.
An attempt to lock affected VMA would cause an assertion, therefore
use mod_vm_flags_nolock in such situations.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/x86/mm/pat/memtype.c | 10 +++++++---
 include/linux/mm.h        |  2 +-
 include/linux/pgtable.h   |  5 +++--
 mm/memory.c               | 15 ++++++++-------
 mm/memremap.c             |  4 ++--
 mm/mmap.c                 |  4 ++--
 6 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 9e490a372896..f71c8381430b 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -1045,7 +1045,7 @@ void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
  * can be for the entire vma (in which case pfn, size are zero).
  */
 void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
-		 unsigned long size)
+		 unsigned long size, bool lock_vma)
 {
 	resource_size_t paddr;
 	unsigned long prot;
@@ -1064,8 +1064,12 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
 		size = vma->vm_end - vma->vm_start;
 	}
 	free_pfn_range(paddr, size);
-	if (vma)
-		clear_vm_flags(vma, VM_PAT);
+	if (vma) {
+		if (lock_vma)
+			clear_vm_flags(vma, VM_PAT);
+		else
+			mod_vm_flags_nolock(vma, 0, VM_PAT);
+	}
 }
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7d436a5027cc..3158f33e268c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2135,7 +2135,7 @@ void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
 			   unsigned long size, struct zap_details *details);
 void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
 		struct vm_area_struct *start_vma, unsigned long start,
-		unsigned long end);
+		unsigned long end, bool lock_vma);
 
 struct mmu_notifier_range;
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 1159b25b0542..eaa831bd675d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1214,7 +1214,8 @@ static inline int track_pfn_copy(struct vm_area_struct *vma)
  * can be for the entire vma (in which case pfn, size are zero).
  */
 static inline void untrack_pfn(struct vm_area_struct *vma,
-			       unsigned long pfn, unsigned long size)
+			       unsigned long pfn, unsigned long size,
+			       bool lock_vma)
 {
 }
 
@@ -1232,7 +1233,7 @@ extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
 			     pfn_t pfn);
 extern int track_pfn_copy(struct vm_area_struct *vma);
 extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
-			unsigned long size);
+			unsigned long size, bool lock_vma);
 extern void untrack_pfn_moved(struct vm_area_struct *vma);
 #endif
 
diff --git a/mm/memory.c b/mm/memory.c
index 12508f4d845a..5c7d5eaa60d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1610,7 +1610,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 static void unmap_single_vma(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr,
-		struct zap_details *details)
+		struct zap_details *details, bool lock_vma)
 {
 	unsigned long start = max(vma->vm_start, start_addr);
 	unsigned long end;
@@ -1625,7 +1625,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 		uprobe_munmap(vma, start, end);
 
 	if (unlikely(vma->vm_flags & VM_PFNMAP))
-		untrack_pfn(vma, 0, 0);
+		untrack_pfn(vma, 0, 0, lock_vma);
 
 	if (start != end) {
 		if (unlikely(is_vm_hugetlb_page(vma))) {
@@ -1672,7 +1672,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
  */
 void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
 		struct vm_area_struct *vma, unsigned long start_addr,
-		unsigned long end_addr)
+		unsigned long end_addr, bool lock_vma)
 {
 	struct mmu_notifier_range range;
 	struct zap_details details = {
@@ -1686,7 +1686,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
 				start_addr, end_addr);
 	mmu_notifier_invalidate_range_start(&range);
 	do {
-		unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
+		unmap_single_vma(tlb, vma, start_addr, end_addr, &details,
+				 lock_vma);
 	} while ((vma = mas_find(&mas, end_addr - 1)) != NULL);
 	mmu_notifier_invalidate_range_end(&range);
 }
@@ -1715,7 +1716,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
 	do {
-		unmap_single_vma(&tlb, vma, start, range.end, NULL);
+		unmap_single_vma(&tlb, vma, start, range.end, NULL, false);
 	} while ((vma = mas_find(&mas, end - 1)) != NULL);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
@@ -1750,7 +1751,7 @@ void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
 	 * unmap 'address-end' not 'range.start-range.end' as range
 	 * could have been expanded for hugetlb pmd sharing.
 	 */
-	unmap_single_vma(&tlb, vma, address, end, details);
+	unmap_single_vma(&tlb, vma, address, end, details, false);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
 }
@@ -2519,7 +2520,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 
 	err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
 	if (err)
-		untrack_pfn(vma, pfn, PAGE_ALIGN(size));
+		untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
 	return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
diff --git a/mm/memremap.c b/mm/memremap.c
index 08cbf54fe037..2f88f43d4a01 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -129,7 +129,7 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
 	}
 	mem_hotplug_done();
 
-	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range));
+	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
 	pgmap_array_delete(range);
 }
 
@@ -276,7 +276,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	if (!is_private)
 		kasan_remove_zero_shadow(__va(range->start), range_len(range));
 err_kasan:
-	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range));
+	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
 err_pfn_remap:
 	pgmap_array_delete(range);
 	return error;
diff --git a/mm/mmap.c b/mm/mmap.c
index a256deca0bc0..332af383f7cd 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2209,7 +2209,7 @@ static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm);
 	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, mt, vma, start, end);
+	unmap_vmas(&tlb, mt, vma, start, end, lock_vma);
 	free_pgtables(&tlb, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
 				 next ? next->vm_start : USER_PGTABLES_CEILING,
 				 lock_vma);
@@ -3127,7 +3127,7 @@ void exit_mmap(struct mm_struct *mm)
 	tlb_gather_mmu_fullmm(&tlb, mm);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
-	unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
+	unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX, false);
 	mmap_read_unlock(mm);
 
 	/*
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (37 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 38/41] mm: avoid assertion in untrack_pfn Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-17 15:57   ` Michal Hocko
  2023-01-19 12:59   ` Michal Hocko
  2023-01-09 20:53 ` [PATCH 40/41] mm: separate vma->lock from vm_area_struct Suren Baghdasaryan
  2023-01-09 20:53 ` [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock Suren Baghdasaryan
  40 siblings, 2 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

call_rcu() can take a long time when callback offloading is enabled.
Its use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed. To minimize that impact, place VMAs into
a list and free them in groups using one call_rcu() call per group.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h       |  1 +
 include/linux/mm_types.h | 19 +++++++++--
 kernel/fork.c            | 68 +++++++++++++++++++++++++++++++++++-----
 mm/init-mm.c             |  3 ++
 mm/mmap.c                |  1 +
 5 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3158f33e268c..50c7a6dd9c7a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -250,6 +250,7 @@ void setup_initial_init_mm(void *start_code, void *end_code,
 struct vm_area_struct *vm_area_alloc(struct mm_struct *);
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
 void vm_area_free(struct vm_area_struct *);
+void drain_free_vmas(struct mm_struct *mm);
 
 #ifndef CONFIG_MMU
 extern struct rb_root nommu_region_tree;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fce9113d979c..c0e6c8e4700b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -592,8 +592,18 @@ struct vm_area_struct {
 	/* Information about our backing store: */
 	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
 					   units */
-	struct file * vm_file;		/* File we map to (can be NULL). */
-	void * vm_private_data;		/* was vm_pte (shared mem) */
+	union {
+		struct {
+			/* File we map to (can be NULL). */
+			struct file *vm_file;
+
+			/* was vm_pte (shared mem) */
+			void *vm_private_data;
+		};
+#ifdef CONFIG_PER_VMA_LOCK
+		struct list_head vm_free_list;
+#endif
+	};
 
 #ifdef CONFIG_ANON_VMA_NAME
 	/*
@@ -693,6 +703,11 @@ struct mm_struct {
 					  */
 #ifdef CONFIG_PER_VMA_LOCK
 		int mm_lock_seq;
+		struct {
+			struct list_head head;
+			spinlock_t lock;
+			int size;
+		} vma_free_list;
 #endif
 
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 6d9f14e55ecf..97f2b751f88d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -481,26 +481,75 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 }
 
 #ifdef CONFIG_PER_VMA_LOCK
-static void __vm_area_free(struct rcu_head *head)
+static inline void __vm_area_free(struct vm_area_struct *vma)
 {
-	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
-						  vm_rcu);
 	/* The vma should either have no lock holders or be write-locked. */
 	vma_assert_no_reader(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
-#endif
+
+static void vma_free_rcu_callback(struct rcu_head *head)
+{
+	struct vm_area_struct *first_vma;
+	struct vm_area_struct *vma, *vma2;
+
+	first_vma = container_of(head, struct vm_area_struct, vm_rcu);
+	list_for_each_entry_safe(vma, vma2, &first_vma->vm_free_list, vm_free_list)
+		__vm_area_free(vma);
+	__vm_area_free(first_vma);
+}
+
+void drain_free_vmas(struct mm_struct *mm)
+{
+	struct vm_area_struct *first_vma;
+	LIST_HEAD(to_destroy);
+
+	spin_lock(&mm->vma_free_list.lock);
+	list_splice_init(&mm->vma_free_list.head, &to_destroy);
+	mm->vma_free_list.size = 0;
+	spin_unlock(&mm->vma_free_list.lock);
+
+	if (list_empty(&to_destroy))
+		return;
+
+	first_vma = list_first_entry(&to_destroy, struct vm_area_struct, vm_free_list);
+	/* Remove the head which is allocated on the stack */
+	list_del(&to_destroy);
+
+	call_rcu(&first_vma->vm_rcu, vma_free_rcu_callback);
+}
+
+#define VM_AREA_FREE_LIST_MAX	32
+
+void vm_area_free(struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	bool drain;
+
+	free_anon_vma_name(vma);
+
+	spin_lock(&mm->vma_free_list.lock);
+	list_add(&vma->vm_free_list, &mm->vma_free_list.head);
+	mm->vma_free_list.size++;
+	drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
+	spin_unlock(&mm->vma_free_list.lock);
+
+	if (drain)
+		drain_free_vmas(mm);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+void drain_free_vmas(struct mm_struct *mm) {}
 
 void vm_area_free(struct vm_area_struct *vma)
 {
 	free_anon_vma_name(vma);
-#ifdef CONFIG_PER_VMA_LOCK
-	call_rcu(&vma->vm_rcu, __vm_area_free);
-#else
 	kmem_cache_free(vm_area_cachep, vma);
-#endif
 }
 
+#endif /* CONFIG_PER_VMA_LOCK */
+
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -1150,6 +1199,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	INIT_LIST_HEAD(&mm->mmlist);
 #ifdef CONFIG_PER_VMA_LOCK
 	WRITE_ONCE(mm->mm_lock_seq, 0);
+	INIT_LIST_HEAD(&mm->vma_free_list.head);
+	spin_lock_init(&mm->vma_free_list.lock);
+	mm->vma_free_list.size = 0;
 #endif
 	mm_pgtables_bytes_init(mm);
 	mm->map_count = 0;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 33269314e060..b53d23c2d7a3 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -39,6 +39,9 @@ struct mm_struct init_mm = {
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 #ifdef CONFIG_PER_VMA_LOCK
 	.mm_lock_seq	= 0,
+	.vma_free_list.head = LIST_HEAD_INIT(init_mm.vma_free_list.head),
+	.vma_free_list.lock =  __SPIN_LOCK_UNLOCKED(init_mm.vma_free_list.lock),
+	.vma_free_list.size = 0,
 #endif
 	.user_ns	= &init_user_ns,
 	.cpu_bitmap	= CPU_BITS_NONE,
diff --git a/mm/mmap.c b/mm/mmap.c
index 332af383f7cd..a0d5d3af1d95 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3159,6 +3159,7 @@ void exit_mmap(struct mm_struct *mm)
 	trace_exit_mmap(mm);
 	__mt_destroy(&mm->mm_mt);
 	mmap_write_unlock(mm);
+	drain_free_vmas(mm);
 	vm_unacct_memory(nr_accounted);
 }
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 40/41] mm: separate vma->lock from vm_area_struct
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (38 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-17 18:33   ` Jann Horn
  2023-01-09 20:53 ` [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock Suren Baghdasaryan
  40 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

vma->lock being part of the vm_area_struct causes performance regression
during page faults because during contention its count and owner fields
are constantly updated and having other parts of vm_area_struct used
during page fault handling next to them causes constant cache line
bouncing. Fix that by moving the lock outside of the vm_area_struct.
All attempts to keep vma->lock inside vm_area_struct in a separate
cache line still produce performance regression especially on NUMA
machines. Smallest regression was achieved when lock is placed in the
fourth cache line but that bloats vm_area_struct to 256 bytes.
Considering performance and memory impact, separate lock looks like
the best option. It increases memory footprint of each VMA but that
will be addressed in the next patch.
Note that after this change vma_init() does not allocate or
initialize vma->lock anymore. A number of drivers allocate a pseudo
VMA on the stack but they never use the VMA's lock, therefore it does
not need to be allocated. The future drivers which might need the VMA
lock should use vm_area_alloc()/vm_area_free() to allocate it.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h       | 25 ++++++------
 include/linux/mm_types.h |  6 ++-
 kernel/fork.c            | 82 ++++++++++++++++++++++++++++------------
 3 files changed, 74 insertions(+), 39 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 50c7a6dd9c7a..d40bf8a5e19e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -615,11 +615,6 @@ struct vm_operations_struct {
 };
 
 #ifdef CONFIG_PER_VMA_LOCK
-static inline void vma_init_lock(struct vm_area_struct *vma)
-{
-	init_rwsem(&vma->lock);
-	vma->vm_lock_seq = -1;
-}
 
 static inline void vma_write_lock(struct vm_area_struct *vma)
 {
@@ -635,9 +630,9 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
 	if (vma->vm_lock_seq == mm_lock_seq)
 		return;
 
-	down_write(&vma->lock);
+	down_write(&vma->vm_lock->lock);
 	vma->vm_lock_seq = mm_lock_seq;
-	up_write(&vma->lock);
+	up_write(&vma->vm_lock->lock);
 }
 
 /*
@@ -651,17 +646,17 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
 	if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->lock) == 0))
+	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
 		return false;
 
 	/*
 	 * Overflow might produce false locked result.
 	 * False unlocked result is impossible because we modify and check
-	 * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
+	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
 	 * modification invalidates all existing locks.
 	 */
 	if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->lock);
+		up_read(&vma->vm_lock->lock);
 		return false;
 	}
 	return true;
@@ -669,7 +664,7 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
 
 static inline void vma_read_unlock(struct vm_area_struct *vma)
 {
-	up_read(&vma->lock);
+	up_read(&vma->vm_lock->lock);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -684,7 +679,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 
 static inline void vma_assert_no_reader(struct vm_area_struct *vma)
 {
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
+	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock) &&
 		      vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
 		      vma);
 }
@@ -694,7 +689,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
-static inline void vma_init_lock(struct vm_area_struct *vma) {}
 static inline void vma_write_lock(struct vm_area_struct *vma) {}
 static inline bool vma_read_trylock(struct vm_area_struct *vma)
 		{ return false; }
@@ -704,6 +698,10 @@ static inline void vma_assert_no_reader(struct vm_area_struct *vma) {}
 
 #endif /* CONFIG_PER_VMA_LOCK */
 
+/*
+ * WARNING: vma_init does not initialize vma->vm_lock.
+ * Use vm_area_alloc()/vm_area_free() if vma needs locking.
+ */
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	static const struct vm_operations_struct dummy_vm_ops = {};
@@ -712,7 +710,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_init_lock(vma);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c0e6c8e4700b..faa61b400f9b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -526,6 +526,10 @@ struct anon_vma_name {
 	char name[];
 };
 
+struct vma_lock {
+	struct rw_semaphore lock;
+};
+
 /*
  * This struct describes a virtual memory area. There is one of these
  * per VM-area/task. A VM area is any part of the process virtual memory
@@ -563,7 +567,7 @@ struct vm_area_struct {
 
 #ifdef CONFIG_PER_VMA_LOCK
 	int vm_lock_seq;
-	struct rw_semaphore lock;
+	struct vma_lock *vm_lock;
 #endif
 
 	/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 97f2b751f88d..95db6a521cf1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -451,40 +451,28 @@ static struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
-{
-	struct vm_area_struct *vma;
+#ifdef CONFIG_PER_VMA_LOCK
 
-	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
-	if (vma)
-		vma_init(vma, mm);
-	return vma;
-}
+/* SLAB cache for vm_area_struct.lock */
+static struct kmem_cache *vma_lock_cachep;
 
-struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
+static bool vma_init_lock(struct vm_area_struct *vma)
 {
-	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
+	if (!vma->vm_lock)
+		return false;
 
-	if (new) {
-		ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
-		ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
-		/*
-		 * orig->shared.rb may be modified concurrently, but the clone
-		 * will be reinitialized.
-		 */
-		*new = data_race(*orig);
-		INIT_LIST_HEAD(&new->anon_vma_chain);
-		vma_init_lock(new);
-		dup_anon_vma_name(orig, new);
-	}
-	return new;
+	init_rwsem(&vma->vm_lock->lock);
+	vma->vm_lock_seq = -1;
+
+	return true;
 }
 
-#ifdef CONFIG_PER_VMA_LOCK
 static inline void __vm_area_free(struct vm_area_struct *vma)
 {
 	/* The vma should either have no lock holders or be write-locked. */
 	vma_assert_no_reader(vma);
+	kmem_cache_free(vma_lock_cachep, vma->vm_lock);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
@@ -540,6 +528,7 @@ void vm_area_free(struct vm_area_struct *vma)
 
 #else /* CONFIG_PER_VMA_LOCK */
 
+static bool vma_init_lock(struct vm_area_struct *vma) { return true; }
 void drain_free_vmas(struct mm_struct *mm) {}
 
 void vm_area_free(struct vm_area_struct *vma)
@@ -550,6 +539,48 @@ void vm_area_free(struct vm_area_struct *vma)
 
 #endif /* CONFIG_PER_VMA_LOCK */
 
+struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!vma)
+		return NULL;
+
+	vma_init(vma, mm);
+	if (!vma_init_lock(vma)) {
+		kmem_cache_free(vm_area_cachep, vma);
+		return NULL;
+	}
+
+	return vma;
+}
+
+struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
+{
+	struct vm_area_struct *new;
+
+	new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!new)
+		return NULL;
+
+	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
+	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
+	/*
+	 * orig->shared.rb may be modified concurrently, but the clone
+	 * will be reinitialized.
+	 */
+	*new = data_race(*orig);
+	if (!vma_init_lock(new)) {
+		kmem_cache_free(vm_area_cachep, new);
+		return NULL;
+	}
+	INIT_LIST_HEAD(&new->anon_vma_chain);
+	dup_anon_vma_name(orig, new);
+
+	return new;
+}
+
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -3138,6 +3169,9 @@ void __init proc_caches_init(void)
 			NULL);
 
 	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
+#ifdef CONFIG_PER_VMA_LOCK
+	vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
+#endif
 	mmap_init();
 	nsproxy_cache_init();
 }
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
                   ` (39 preceding siblings ...)
  2023-01-09 20:53 ` [PATCH 40/41] mm: separate vma->lock from vm_area_struct Suren Baghdasaryan
@ 2023-01-09 20:53 ` Suren Baghdasaryan
  2023-01-10  8:04   ` Vlastimil Babka
                     ` (3 more replies)
  40 siblings, 4 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-09 20:53 UTC (permalink / raw)
  To: akpm
  Cc: michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team, surenb

rw_semaphore is a sizable structure of 40 bytes and consumes
considerable space for each vm_area_struct. However vma_lock has
two important specifics which can be used to replace rw_semaphore
with a simpler structure:
1. Readers never wait. They try to take the vma_lock and fall back to
mmap_lock if that fails.
2. Only one writer at a time will ever try to write-lock a vma_lock
because writers first take mmap_lock in write mode.
Because of these requirements, full rw_semaphore functionality is not
needed and we can replace rw_semaphore with an atomic variable.
When a reader takes read lock, it increments the atomic unless the
value is negative. If that fails read-locking is aborted and mmap_lock
is used instead.
When writer takes write lock, it resets atomic value to -1 if the
current value is 0 (no readers). Since all writers take mmap_lock in
write mode first, there can be only one writer at a time. If there
are readers, writer will place itself into a wait queue using new
mm_struct.vma_writer_wait waitqueue head. The last reader to release
the vma_lock will signal the writer to wake up.
vm_lock_seq is also moved into vma_lock and along with atomic_t they
are nicely packed and consume 8 bytes, bringing the overhead from
vma_lock from 44 to 16 bytes:

    slabinfo before the changes:
     <name>            ... <objsize> <objperslab> <pagesperslab> : ...
    vm_area_struct    ...    152   53    2 : ...

    slabinfo with vma_lock:
     <name>            ... <objsize> <objperslab> <pagesperslab> : ...
    rw_semaphore      ...      8  512    1 : ...
    vm_area_struct    ...    160   51    2 : ...

Assuming 40000 vm_area_structs, memory consumption would be:
baseline: 6040kB
vma_lock (vm_area_structs+vma_lock): 6280kB+316kB=6596kB
Total increase: 556kB

atomic_t might overflow if there are many competing readers, therefore
vma_read_trylock() implements an overflow check and if that occurs it
restors the previous value and exits with a failure to lock.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h       | 37 +++++++++++++++++++++++++------------
 include/linux/mm_types.h | 10 ++++++++--
 kernel/fork.c            |  6 +++---
 mm/init-mm.c             |  2 ++
 4 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d40bf8a5e19e..294dd44b2198 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
 	 * mm->mm_lock_seq can't be concurrently modified.
 	 */
 	mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
-	if (vma->vm_lock_seq == mm_lock_seq)
+	if (vma->vm_lock->lock_seq == mm_lock_seq)
 		return;
 
-	down_write(&vma->vm_lock->lock);
-	vma->vm_lock_seq = mm_lock_seq;
-	up_write(&vma->vm_lock->lock);
+	if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
+		wait_event(vma->vm_mm->vma_writer_wait,
+			   atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
+	vma->vm_lock->lock_seq = mm_lock_seq;
+	/* Write barrier to ensure lock_seq change is visible before count */
+	smp_wmb();
+	atomic_set(&vma->vm_lock->count, 0);
 }
 
 /*
@@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
 static inline bool vma_read_trylock(struct vm_area_struct *vma)
 {
 	/* Check before locking. A race might cause false locked result. */
-	if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
+	if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
+	if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
 		return false;
 
+	/* If atomic_t overflows, restore and fail to lock. */
+	if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
+		if (atomic_dec_and_test(&vma->vm_lock->count))
+			wake_up(&vma->vm_mm->vma_writer_wait);
+		return false;
+	}
+
 	/*
 	 * Overflow might produce false locked result.
 	 * False unlocked result is impossible because we modify and check
 	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
 	 * modification invalidates all existing locks.
 	 */
-	if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock->lock);
+	if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
+		if (atomic_dec_and_test(&vma->vm_lock->count))
+			wake_up(&vma->vm_mm->vma_writer_wait);
 		return false;
 	}
 	return true;
@@ -664,7 +676,8 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
 
 static inline void vma_read_unlock(struct vm_area_struct *vma)
 {
-	up_read(&vma->vm_lock->lock);
+	if (atomic_dec_and_test(&vma->vm_lock->count))
+		wake_up(&vma->vm_mm->vma_writer_wait);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -674,13 +687,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 	 * current task is holding mmap_write_lock, both vma->vm_lock_seq and
 	 * mm->mm_lock_seq can't be concurrently modified.
 	 */
-	VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
+	VM_BUG_ON_VMA(vma->vm_lock->lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
 }
 
 static inline void vma_assert_no_reader(struct vm_area_struct *vma)
 {
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock) &&
-		      vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
+	VM_BUG_ON_VMA(atomic_read(&vma->vm_lock->count) > 0 &&
+		      vma->vm_lock->lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
 		      vma);
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index faa61b400f9b..a6050c38ca2e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -527,7 +527,13 @@ struct anon_vma_name {
 };
 
 struct vma_lock {
-	struct rw_semaphore lock;
+	/*
+	 * count > 0 ==> read-locked with 'count' number of readers
+	 * count < 0 ==> write-locked
+	 * count = 0 ==> unlocked
+	 */
+	atomic_t count;
+	int lock_seq;
 };
 
 /*
@@ -566,7 +572,6 @@ struct vm_area_struct {
 	unsigned long vm_flags;
 
 #ifdef CONFIG_PER_VMA_LOCK
-	int vm_lock_seq;
 	struct vma_lock *vm_lock;
 #endif
 
@@ -706,6 +711,7 @@ struct mm_struct {
 					  * by mmlist_lock
 					  */
 #ifdef CONFIG_PER_VMA_LOCK
+		struct wait_queue_head vma_writer_wait;
 		int mm_lock_seq;
 		struct {
 			struct list_head head;
diff --git a/kernel/fork.c b/kernel/fork.c
index 95db6a521cf1..b221ad182d98 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -461,9 +461,8 @@ static bool vma_init_lock(struct vm_area_struct *vma)
 	vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
 	if (!vma->vm_lock)
 		return false;
-
-	init_rwsem(&vma->vm_lock->lock);
-	vma->vm_lock_seq = -1;
+	atomic_set(&vma->vm_lock->count, 0);
+	vma->vm_lock->lock_seq = -1;
 
 	return true;
 }
@@ -1229,6 +1228,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mmap_init_lock(mm);
 	INIT_LIST_HEAD(&mm->mmlist);
 #ifdef CONFIG_PER_VMA_LOCK
+	init_waitqueue_head(&mm->vma_writer_wait);
 	WRITE_ONCE(mm->mm_lock_seq, 0);
 	INIT_LIST_HEAD(&mm->vma_free_list.head);
 	spin_lock_init(&mm->vma_free_list.lock);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index b53d23c2d7a3..0088e31e5f7e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -38,6 +38,8 @@ struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 #ifdef CONFIG_PER_VMA_LOCK
+	.vma_writer_wait =
+		__WAIT_QUEUE_HEAD_INITIALIZER(init_mm.vma_writer_wait),
 	.mm_lock_seq	= 0,
 	.vma_free_list.head = LIST_HEAD_INIT(init_mm.vma_free_list.head),
 	.vma_free_list.lock =  __SPIN_LOCK_UNLOCKED(init_mm.vma_free_list.lock),
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-09 20:53 ` [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock Suren Baghdasaryan
@ 2023-01-10  8:04   ` Vlastimil Babka
  2023-01-10 17:05     ` Suren Baghdasaryan
  2023-01-16 11:14   ` Hyeonggon Yoo
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 186+ messages in thread
From: Vlastimil Babka @ 2023-01-10  8:04 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: michel, jglisse, mhocko, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On 1/9/23 21:53, Suren Baghdasaryan wrote:
> rw_semaphore is a sizable structure of 40 bytes and consumes
> considerable space for each vm_area_struct. However vma_lock has
> two important specifics which can be used to replace rw_semaphore
> with a simpler structure:
> 1. Readers never wait. They try to take the vma_lock and fall back to
> mmap_lock if that fails.
> 2. Only one writer at a time will ever try to write-lock a vma_lock
> because writers first take mmap_lock in write mode.
> Because of these requirements, full rw_semaphore functionality is not
> needed and we can replace rw_semaphore with an atomic variable.
> When a reader takes read lock, it increments the atomic unless the
> value is negative. If that fails read-locking is aborted and mmap_lock
> is used instead.
> When writer takes write lock, it resets atomic value to -1 if the
> current value is 0 (no readers). Since all writers take mmap_lock in
> write mode first, there can be only one writer at a time. If there
> are readers, writer will place itself into a wait queue using new
> mm_struct.vma_writer_wait waitqueue head. The last reader to release
> the vma_lock will signal the writer to wake up.
> vm_lock_seq is also moved into vma_lock and along with atomic_t they
> are nicely packed and consume 8 bytes, bringing the overhead from
> vma_lock from 44 to 16 bytes:
> 
>     slabinfo before the changes:
>      <name>            ... <objsize> <objperslab> <pagesperslab> : ...
>     vm_area_struct    ...    152   53    2 : ...
> 
>     slabinfo with vma_lock:
>      <name>            ... <objsize> <objperslab> <pagesperslab> : ...
>     rw_semaphore      ...      8  512    1 : ...

I guess the cache is called vma_lock, not rw_semaphore?

>     vm_area_struct    ...    160   51    2 : ...
> 
> Assuming 40000 vm_area_structs, memory consumption would be:
> baseline: 6040kB
> vma_lock (vm_area_structs+vma_lock): 6280kB+316kB=6596kB
> Total increase: 556kB
> 
> atomic_t might overflow if there are many competing readers, therefore
> vma_read_trylock() implements an overflow check and if that occurs it
> restors the previous value and exits with a failure to lock.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

This patch is indeed an interesting addition indeed, but I can't help but
think it obsoletes the previous one :) We allocate an extra 8 bytes slab
object for the lock, and the pointer to it is also 8 bytes, and requires an
indirection. The vma_lock cache is not cacheline aligned (otherwise it would
be a major waste), so we have potential false sharing with up to 7 other
vma_lock's.
I'd expect if the vma_lock was placed with the relatively cold fields of
vm_area_struct, it shouldn't cause much cache ping pong when working with
that vma. Even if we don't cache align the vma to save memory (would be 192
bytes instead of 160 when aligned) and place the vma_lock and the cold
fields at the end of the vma, it may be false sharing the cacheline with the
next vma in the slab. But that's a single vma, not up to 7, so it shouldn't
be worse?



^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-10  8:04   ` Vlastimil Babka
@ 2023-01-10 17:05     ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-10 17:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, michel, jglisse, mhocko, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 10, 2023 at 12:04 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/9/23 21:53, Suren Baghdasaryan wrote:
> > rw_semaphore is a sizable structure of 40 bytes and consumes
> > considerable space for each vm_area_struct. However vma_lock has
> > two important specifics which can be used to replace rw_semaphore
> > with a simpler structure:
> > 1. Readers never wait. They try to take the vma_lock and fall back to
> > mmap_lock if that fails.
> > 2. Only one writer at a time will ever try to write-lock a vma_lock
> > because writers first take mmap_lock in write mode.
> > Because of these requirements, full rw_semaphore functionality is not
> > needed and we can replace rw_semaphore with an atomic variable.
> > When a reader takes read lock, it increments the atomic unless the
> > value is negative. If that fails read-locking is aborted and mmap_lock
> > is used instead.
> > When writer takes write lock, it resets atomic value to -1 if the
> > current value is 0 (no readers). Since all writers take mmap_lock in
> > write mode first, there can be only one writer at a time. If there
> > are readers, writer will place itself into a wait queue using new
> > mm_struct.vma_writer_wait waitqueue head. The last reader to release
> > the vma_lock will signal the writer to wake up.
> > vm_lock_seq is also moved into vma_lock and along with atomic_t they
> > are nicely packed and consume 8 bytes, bringing the overhead from
> > vma_lock from 44 to 16 bytes:
> >
> >     slabinfo before the changes:
> >      <name>            ... <objsize> <objperslab> <pagesperslab> : ...
> >     vm_area_struct    ...    152   53    2 : ...
> >
> >     slabinfo with vma_lock:
> >      <name>            ... <objsize> <objperslab> <pagesperslab> : ...
> >     rw_semaphore      ...      8  512    1 : ...
>
> I guess the cache is called vma_lock, not rw_semaphore?

Yes, sorry. Copy/paste error when combining the results. The numbers
though look correct, so I did not screw up that part :)

>
> >     vm_area_struct    ...    160   51    2 : ...
> >
> > Assuming 40000 vm_area_structs, memory consumption would be:
> > baseline: 6040kB
> > vma_lock (vm_area_structs+vma_lock): 6280kB+316kB=6596kB
> > Total increase: 556kB
> >
> > atomic_t might overflow if there are many competing readers, therefore
> > vma_read_trylock() implements an overflow check and if that occurs it
> > restors the previous value and exits with a failure to lock.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> This patch is indeed an interesting addition indeed, but I can't help but
> think it obsoletes the previous one :) We allocate an extra 8 bytes slab
> object for the lock, and the pointer to it is also 8 bytes, and requires an
> indirection. The vma_lock cache is not cacheline aligned (otherwise it would
> be a major waste), so we have potential false sharing with up to 7 other
> vma_lock's.

True, I thought long and hard about combining the last two patches but
decided to keep them separate to document the intent. The previous
patch splits the lock for performance reasons and this one is focused
on memory consumption. I'm open to changing this if it's confusing.

> I'd expect if the vma_lock was placed with the relatively cold fields of
> vm_area_struct, it shouldn't cause much cache ping pong when working with
> that vma. Even if we don't cache align the vma to save memory (would be 192
> bytes instead of 160 when aligned) and place the vma_lock and the cold
> fields at the end of the vma, it may be false sharing the cacheline with the
> next vma in the slab.

I would love to combine the vma_lock with vm_area_struct and I spent
several days trying different combinations to achieve decent
performance. My best achieved result was when I placed the vm_lock
into the third cache line at offset 192 and allocated vm_area_structs
from cache-aligned slab (horrible memory waste with each vma consuming
256 bytes). Even then I see regression in pft-threads test on a NUMA
machine (where cache-bouncing problem is most pronounced):

This is the result with split vma locks (current version). The higher
number the better:

BASE                                PVL
Hmean     faults/sec-1    469201.7282 (   0.00%)   464453.3976 *  -1.01%*
Hmean     faults/sec-4   1754465.6221 (   0.00%)  1660688.0452 *  -5.35%*
Hmean     faults/sec-7   2808141.6711 (   0.00%)  2688910.6458 *  -4.25%*
Hmean     faults/sec-12  3750307.7553 (   0.00%)  3863490.2057 *   3.02%*
Hmean     faults/sec-21  4145672.4677 (   0.00%)  3904532.7241 *  -5.82%*
Hmean     faults/sec-30  3775722.5726 (   0.00%)  3923225.3734 *   3.91%*
Hmean     faults/sec-48  4152563.5864 (   0.00%)  4783720.6811 *  15.20%*
Hmean     faults/sec-56  4163868.7111 (   0.00%)  4851473.7241 *  16.51%*

Here are results with the vma locks integrated into cache-aligned
vm_area_struct:

BASE               PVM_MERGED
Hmean     faults/sec-1    469201.7282 (   0.00%)   465268.1274 *  -0.84%*
Hmean     faults/sec-4   1754465.6221 (   0.00%)  1658538.0217 *  -5.47%*
Hmean     faults/sec-7   2808141.6711 (   0.00%)  2645016.1598 *  -5.81%*
Hmean     faults/sec-12  3750307.7553 (   0.00%)  3664676.6956 *  -2.28%*
Hmean     faults/sec-21  4145672.4677 (   0.00%)  3722203.7950 * -10.21%*
Hmean     faults/sec-30  3775722.5726 (   0.00%)  3821025.6963 *   1.20%*
Hmean     faults/sec-48  4152563.5864 (   0.00%)  4561016.1604 *   9.84%*
Hmean     faults/sec-56  4163868.7111 (   0.00%)  4528123.3737 *   8.75%*

These two compare with the same baseline test results, I just
separated the result into two to have readable email formatting.
It's also hard to find 56 bytes worth of fields in vm_area_struct
which are not used during page faults. So, in the end I decided to
keep vma_locks separate to preserve performance. If you have an idea
on how we can combine vm_area_struct fields in a better way, I would
love to try it out.

> But that's a single vma, not up to 7, so it shouldn't be worse?

Yes, I expected that too but mmtests show very small improvement when
I cache-align vma_locks slab. My spf_test does show about 10%
regression due to vma_lock cache-line bouncing, however considering
that it also shows 90% improvement over baseline, losing 10% of that
improvement to save 56 bytes per vma sounds like a good deal.
I think the lack of considerable regression here is due to vma_locks
being used only 2 times in the pagefault path - when we take it and
when we release it, while vm_aa_struct fields are used much more
heavily. So, invalidating vma_lock cache-line does not hit us as hard
as invalidating a part of vm_area_struct.

Looking forward to suggestions and thanks for the review, Vlastimil!




>
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK
  2023-01-09 20:53 ` [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK Suren Baghdasaryan
@ 2023-01-11  0:13   ` Davidlohr Bueso
  2023-01-11  0:44     ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Davidlohr Bueso @ 2023-01-11  0:13 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:

>This configuration variable will be used to build the support for VMA
>locking during page fault handling.
>
>This is enabled by default on supported architectures with SMP and MMU
>set.
>
>The architecture support is needed since the page fault handler is called
>from the architecture's page faulting code which needs modifications to
>handle faults under VMA lock.

I don't think that per-vma locking should be something that is user-configurable.
It should just be depdendant on the arch. So maybe just remove CONFIG_PER_VMA_LOCK?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK
  2023-01-11  0:13   ` Davidlohr Bueso
@ 2023-01-11  0:44     ` Suren Baghdasaryan
  2023-01-11  8:23       ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-11  0:44 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm, michel, jglisse, mhocko, vbabka,
	hannes, mgorman, willy, liam.howlett, peterz, ldufour,
	laurent.dufour, paulmck, luto, songliubraving, peterx, david,
	dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, jannh, shakeelb, tatashin, edumazet, gthelen, gurua,
	arjunroy, soheil, hughlynch, leewalsh, posk, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, linux-kernel, kernel-team

On Tue, Jan 10, 2023 at 4:39 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:
>
> >This configuration variable will be used to build the support for VMA
> >locking during page fault handling.
> >
> >This is enabled by default on supported architectures with SMP and MMU
> >set.
> >
> >The architecture support is needed since the page fault handler is called
> >from the architecture's page faulting code which needs modifications to
> >handle faults under VMA lock.
>
> I don't think that per-vma locking should be something that is user-configurable.
> It should just be depdendant on the arch. So maybe just remove CONFIG_PER_VMA_LOCK?

Thanks for the suggestion! I would be happy to make that change if
there are no objections. I think the only pushback might have been the
vma size increase but with the latest optimization in the last patch
maybe that's less of an issue?

>
> Thanks,
> Davidlohr
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK
  2023-01-11  0:44     ` Suren Baghdasaryan
@ 2023-01-11  8:23       ` Michal Hocko
  2023-01-11  9:54         ` Ingo Molnar
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-11  8:23 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue 10-01-23 16:44:42, Suren Baghdasaryan wrote:
> On Tue, Jan 10, 2023 at 4:39 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
> >
> > On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:
> >
> > >This configuration variable will be used to build the support for VMA
> > >locking during page fault handling.
> > >
> > >This is enabled by default on supported architectures with SMP and MMU
> > >set.
> > >
> > >The architecture support is needed since the page fault handler is called
> > >from the architecture's page faulting code which needs modifications to
> > >handle faults under VMA lock.
> >
> > I don't think that per-vma locking should be something that is user-configurable.
> > It should just be depdendant on the arch. So maybe just remove CONFIG_PER_VMA_LOCK?
> 
> Thanks for the suggestion! I would be happy to make that change if
> there are no objections. I think the only pushback might have been the
> vma size increase but with the latest optimization in the last patch
> maybe that's less of an issue?

Has vma size ever been a real problem? Sure there might be a lot of
those but your patch increases it by rwsem (without the last patch)
which is something like 40B on top of 136B vma so we are talking about
400B in total which even with wild mapcount limits shouldn't really be
prohibitive. With a default map count limit we are talking about 2M
increase at most (per address space).

Or are you aware of any specific usecases where vma size is a real
problem?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK
  2023-01-11  8:23       ` Michal Hocko
@ 2023-01-11  9:54         ` Ingo Molnar
       [not found]           ` <6be809f5554a4faaa22c287ba4224bd0@AcuMS.aculab.com>
  0 siblings, 1 reply; 186+ messages in thread
From: Ingo Molnar @ 2023-01-11  9:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, akpm, michel, jglisse, vbabka, hannes,
	mgorman, willy, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team


* Michal Hocko <mhocko@suse.com> wrote:

> On Tue 10-01-23 16:44:42, Suren Baghdasaryan wrote:
> > On Tue, Jan 10, 2023 at 4:39 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
> > >
> > > On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:
> > >
> > > >This configuration variable will be used to build the support for VMA
> > > >locking during page fault handling.
> > > >
> > > >This is enabled by default on supported architectures with SMP and MMU
> > > >set.
> > > >
> > > >The architecture support is needed since the page fault handler is called
> > > >from the architecture's page faulting code which needs modifications to
> > > >handle faults under VMA lock.
> > >
> > > I don't think that per-vma locking should be something that is user-configurable.
> > > It should just be depdendant on the arch. So maybe just remove CONFIG_PER_VMA_LOCK?
> > 
> > Thanks for the suggestion! I would be happy to make that change if
> > there are no objections. I think the only pushback might have been the
> > vma size increase but with the latest optimization in the last patch
> > maybe that's less of an issue?
> 
> Has vma size ever been a real problem? Sure there might be a lot of those 
> but your patch increases it by rwsem (without the last patch) which is 
> something like 40B on top of 136B vma so we are talking about 400B in 
> total which even with wild mapcount limits shouldn't really be 
> prohibitive. With a default map count limit we are talking about 2M 
> increase at most (per address space).
> 
> Or are you aware of any specific usecases where vma size is a real 
> problem?

40 bytes for the rwsem, plus the patch also adds a 32-bit sequence counter:

  + int vm_lock_seq;
  + struct rw_semaphore lock;

So it's +44 bytes.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions
  2023-01-09 20:53 ` [PATCH 13/41] mm: introduce vma->vm_flags modifier functions Suren Baghdasaryan
@ 2023-01-11 15:47   ` Davidlohr Bueso
  2023-01-11 17:36     ` Suren Baghdasaryan
  2023-01-17 15:09   ` Michal Hocko
  1 sibling, 1 reply; 186+ messages in thread
From: Davidlohr Bueso @ 2023-01-11 15:47 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:

>To keep vma locking correctness when vm_flags are modified, add modifier
>functions to be used whenever flags are updated.

How about moving this patch and the ones that follow out of this series,
into a preliminary patchset? It would reduce the amount of noise in the
per-vma lock changes, which would then only be adding the needed
vma_write_lock()ing.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK
       [not found]           ` <6be809f5554a4faaa22c287ba4224bd0@AcuMS.aculab.com>
@ 2023-01-11 16:28             ` Suren Baghdasaryan
  2023-01-11 16:44               ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-11 16:28 UTC (permalink / raw)
  To: David Laight
  Cc: Ingo Molnar, Michal Hocko, michel, joelaf, songliubraving,
	leewalsh, david, peterz, bigeasy, peterx, dhowells, linux-mm,
	edumazet, jglisse, punit.agrawal, arjunroy, minchan, x86, hughd,
	willy, gurua, laurent.dufour, linux-arm-kernel, rientjes,
	axelrasmussen, kernel-team, soheil, paulmck, jannh, liam.howlett,
	shakeelb, luto, gthelen, ldufour, vbabka, posk, lstoakes,
	peterjung1337, linuxppc-dev, kent.overstreet, hughlynch,
	linux-kernel, hannes, akpm, tatashin

On Wed, Jan 11, 2023 at 2:03 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Ingo Molnar
> > Sent: 11 January 2023 09:54
> >
> > * Michal Hocko <mhocko@suse.com> wrote:
> >
> > > On Tue 10-01-23 16:44:42, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 10, 2023 at 4:39 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
> > > > >
> > > > > On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:
> > > > >
> > > > > >This configuration variable will be used to build the support for VMA
> > > > > >locking during page fault handling.
> > > > > >
> > > > > >This is enabled by default on supported architectures with SMP and MMU
> > > > > >set.
> > > > > >
> > > > > >The architecture support is needed since the page fault handler is called
> > > > > >from the architecture's page faulting code which needs modifications to
> > > > > >handle faults under VMA lock.
> > > > >
> > > > > I don't think that per-vma locking should be something that is user-configurable.
> > > > > It should just be depdendant on the arch. So maybe just remove CONFIG_PER_VMA_LOCK?
> > > >
> > > > Thanks for the suggestion! I would be happy to make that change if
> > > > there are no objections. I think the only pushback might have been the
> > > > vma size increase but with the latest optimization in the last patch
> > > > maybe that's less of an issue?
> > >
> > > Has vma size ever been a real problem? Sure there might be a lot of those
> > > but your patch increases it by rwsem (without the last patch) which is
> > > something like 40B on top of 136B vma so we are talking about 400B in
> > > total which even with wild mapcount limits shouldn't really be
> > > prohibitive. With a default map count limit we are talking about 2M
> > > increase at most (per address space).
> > >
> > > Or are you aware of any specific usecases where vma size is a real
> > > problem?

Well, when fixing the cacheline bouncing problem in the initial design
I was adding 44 bytes to 152-byte vm_area_struct (CONFIG_NUMA enabled)
and pushing it just above 192 bytes while allocating these structures
from cache-aligned slab (keeping the lock in a separate cacheline to
prevent cacheline bouncing). That would use the whole 256 bytes per
VMA and it did make me nervous. The current design with no need to
cache-align vm_area_structs and with 44-byte overhead trimmed down to
16 bytes seems much more palatable.

> >
> > 40 bytes for the rwsem, plus the patch also adds a 32-bit sequence counter:
> >
> >   + int vm_lock_seq;
> >   + struct rw_semaphore lock;
> >
> > So it's +44 bytes.

Correct.

>
> Depend in whether vm_lock_seq goes into a padding hole or not
> it will be 40 or 48 bytes.
>
> But if these structures are allocated individually (not an array)
> then it depends on how may items kmalloc() fits into a page (or 2,4).

Yep. Depends on how we arrange the fields.

Anyhow. Sounds like the overhead of the current design is small enough
to remove CONFIG_PER_VMA_LOCK and let it depend only on architecture
support?
Thanks,
Suren.

>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK
  2023-01-11 16:28             ` Suren Baghdasaryan
@ 2023-01-11 16:44               ` Michal Hocko
  2023-01-11 17:04                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-11 16:44 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: David Laight, Ingo Molnar, michel, joelaf, songliubraving,
	leewalsh, david, peterz, bigeasy, peterx, dhowells, linux-mm,
	edumazet, jglisse, punit.agrawal, arjunroy, minchan, x86, hughd,
	willy, gurua, laurent.dufour, linux-arm-kernel, rientjes,
	axelrasmussen, kernel-team, soheil, paulmck, jannh, liam.howlett,
	shakeelb, luto, gthelen, ldufour, vbabka, posk, lstoakes,
	peterjung1337, linuxppc-dev, kent.overstreet, hughlynch,
	linux-kernel, hannes, akpm, tatashin

On Wed 11-01-23 08:28:49, Suren Baghdasaryan wrote:
[...]
> Anyhow. Sounds like the overhead of the current design is small enough
> to remove CONFIG_PER_VMA_LOCK and let it depend only on architecture
> support?

Yes. Further optimizations can be done on top. Let's not over optimize
at this stage.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK
  2023-01-11 16:44               ` Michal Hocko
@ 2023-01-11 17:04                 ` Suren Baghdasaryan
  2023-01-11 17:37                   ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-11 17:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Laight, Ingo Molnar, michel, joelaf, songliubraving,
	leewalsh, david, peterz, bigeasy, peterx, dhowells, linux-mm,
	edumazet, jglisse, punit.agrawal, arjunroy, minchan, x86, hughd,
	willy, gurua, laurent.dufour, linux-arm-kernel, rientjes,
	axelrasmussen, kernel-team, soheil, paulmck, jannh, liam.howlett,
	shakeelb, luto, gthelen, ldufour, vbabka, posk, lstoakes,
	peterjung1337, linuxppc-dev, kent.overstreet, hughlynch,
	linux-kernel, hannes, akpm, tatashin

On Wed, Jan 11, 2023 at 8:44 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 11-01-23 08:28:49, Suren Baghdasaryan wrote:
> [...]
> > Anyhow. Sounds like the overhead of the current design is small enough
> > to remove CONFIG_PER_VMA_LOCK and let it depend only on architecture
> > support?
>
> Yes. Further optimizations can be done on top. Let's not over optimize
> at this stage.

Sure, I won't optimize any further.
Just to expand on your question. Original design would be problematic
for embedded systems like Android. It notoriously has a high number of
VMAs due to anonymous VMAs being named, which prevents them from
merging. 2M per process increase would raise questions, therefore I
felt the need for optimizing the memory overhead which is done in the
last patch.
Thanks for the feedback!

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions
  2023-01-11 15:47   ` Davidlohr Bueso
@ 2023-01-11 17:36     ` Suren Baghdasaryan
  2023-01-11 19:52       ` Davidlohr Bueso
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-11 17:36 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm, michel, jglisse, mhocko, vbabka,
	hannes, mgorman, willy, liam.howlett, peterz, ldufour,
	laurent.dufour, paulmck, luto, songliubraving, peterx, david,
	dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, jannh, shakeelb, tatashin, edumazet, gthelen, gurua,
	arjunroy, soheil, hughlynch, leewalsh, posk, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, linux-kernel, kernel-team

On Wed, Jan 11, 2023 at 8:13 AM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:
>
> >To keep vma locking correctness when vm_flags are modified, add modifier
> >functions to be used whenever flags are updated.
>
> How about moving this patch and the ones that follow out of this series,
> into a preliminary patchset? It would reduce the amount of noise in the
> per-vma lock changes, which would then only be adding the needed
> vma_write_lock()ing.

How about moving those prerequisite patches to the beginning of the
patchset (before maple_tree RCU changes)? I feel like they do belong
in the patchset because as a standalone patchset it would be unclear
why I'm adding all these accessor functions and introducing this
churn. Would that be acceptable?

>
> Thanks,
> Davidlohr
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK
  2023-01-11 17:04                 ` Suren Baghdasaryan
@ 2023-01-11 17:37                   ` Michal Hocko
  2023-01-11 17:49                     ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-11 17:37 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: David Laight, Ingo Molnar, michel, joelaf, songliubraving,
	leewalsh, david, peterz, bigeasy, peterx, dhowells, linux-mm,
	edumazet, jglisse, punit.agrawal, arjunroy, minchan, x86, hughd,
	willy, gurua, laurent.dufour, linux-arm-kernel, rientjes,
	axelrasmussen, kernel-team, soheil, paulmck, jannh, liam.howlett,
	shakeelb, luto, gthelen, ldufour, vbabka, posk, lstoakes,
	peterjung1337, linuxppc-dev, kent.overstreet, hughlynch,
	linux-kernel, hannes, akpm, tatashin

On Wed 11-01-23 09:04:41, Suren Baghdasaryan wrote:
> On Wed, Jan 11, 2023 at 8:44 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 11-01-23 08:28:49, Suren Baghdasaryan wrote:
> > [...]
> > > Anyhow. Sounds like the overhead of the current design is small enough
> > > to remove CONFIG_PER_VMA_LOCK and let it depend only on architecture
> > > support?
> >
> > Yes. Further optimizations can be done on top. Let's not over optimize
> > at this stage.
> 
> Sure, I won't optimize any further.
> Just to expand on your question. Original design would be problematic
> for embedded systems like Android. It notoriously has a high number of
> VMAs due to anonymous VMAs being named, which prevents them from
> merging.

What is the usual number of VMAs in that environment?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK
  2023-01-11 17:37                   ` Michal Hocko
@ 2023-01-11 17:49                     ` Suren Baghdasaryan
  2023-01-11 18:02                       ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-11 17:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Laight, Ingo Molnar, michel, joelaf, songliubraving,
	leewalsh, david, peterz, bigeasy, peterx, dhowells, linux-mm,
	edumazet, jglisse, punit.agrawal, arjunroy, minchan, x86, hughd,
	willy, gurua, laurent.dufour, linux-arm-kernel, rientjes,
	axelrasmussen, kernel-team, soheil, paulmck, jannh, liam.howlett,
	shakeelb, luto, gthelen, ldufour, vbabka, posk, lstoakes,
	peterjung1337, linuxppc-dev, kent.overstreet, hughlynch,
	linux-kernel, hannes, akpm, tatashin

On Wed, Jan 11, 2023 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 11-01-23 09:04:41, Suren Baghdasaryan wrote:
> > On Wed, Jan 11, 2023 at 8:44 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 11-01-23 08:28:49, Suren Baghdasaryan wrote:
> > > [...]
> > > > Anyhow. Sounds like the overhead of the current design is small enough
> > > > to remove CONFIG_PER_VMA_LOCK and let it depend only on architecture
> > > > support?
> > >
> > > Yes. Further optimizations can be done on top. Let's not over optimize
> > > at this stage.
> >
> > Sure, I won't optimize any further.
> > Just to expand on your question. Original design would be problematic
> > for embedded systems like Android. It notoriously has a high number of
> > VMAs due to anonymous VMAs being named, which prevents them from
> > merging.
>
> What is the usual number of VMAs in that environment?

I've seen some games which had over 4000 VMAs but that's on the upper
side. In my calculations I used 40000 VMAs as a ballpark number and
rough calculations before size optimization would increase memory
consumption by ~2M (depending on the lock placement in vm_area_struct
it would vary a bit). In Android, the performance team flags any
change that exceeds 500KB, so it would raise questions.

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK
  2023-01-11 17:49                     ` Suren Baghdasaryan
@ 2023-01-11 18:02                       ` Michal Hocko
  2023-01-11 18:09                         ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-11 18:02 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: David Laight, Ingo Molnar, michel, joelaf, songliubraving,
	leewalsh, david, peterz, bigeasy, peterx, dhowells, linux-mm,
	edumazet, jglisse, punit.agrawal, arjunroy, minchan, x86, hughd,
	willy, gurua, laurent.dufour, linux-arm-kernel, rientjes,
	axelrasmussen, kernel-team, soheil, paulmck, jannh, liam.howlett,
	shakeelb, luto, gthelen, ldufour, vbabka, posk, lstoakes,
	peterjung1337, linuxppc-dev, kent.overstreet, hughlynch,
	linux-kernel, hannes, akpm, tatashin

On Wed 11-01-23 09:49:08, Suren Baghdasaryan wrote:
> On Wed, Jan 11, 2023 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 11-01-23 09:04:41, Suren Baghdasaryan wrote:
> > > On Wed, Jan 11, 2023 at 8:44 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Wed 11-01-23 08:28:49, Suren Baghdasaryan wrote:
> > > > [...]
> > > > > Anyhow. Sounds like the overhead of the current design is small enough
> > > > > to remove CONFIG_PER_VMA_LOCK and let it depend only on architecture
> > > > > support?
> > > >
> > > > Yes. Further optimizations can be done on top. Let's not over optimize
> > > > at this stage.
> > >
> > > Sure, I won't optimize any further.
> > > Just to expand on your question. Original design would be problematic
> > > for embedded systems like Android. It notoriously has a high number of
> > > VMAs due to anonymous VMAs being named, which prevents them from
> > > merging.
> >
> > What is the usual number of VMAs in that environment?
> 
> I've seen some games which had over 4000 VMAs but that's on the upper
> side. In my calculations I used 40000 VMAs as a ballpark number and
> rough calculations before size optimization would increase memory
> consumption by ~2M (depending on the lock placement in vm_area_struct
> it would vary a bit). In Android, the performance team flags any
> change that exceeds 500KB, so it would raise questions.

Thanks, that is a useful information! This is just slightly off-topic
but I ak wondering how much memory those vma names consume. Are there
that many unique names or they just happen to be alternating so that
neighboring ones tend to be different.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK
  2023-01-11 18:02                       ` Michal Hocko
@ 2023-01-11 18:09                         ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-11 18:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Laight, Ingo Molnar, michel, joelaf, songliubraving,
	leewalsh, david, peterz, bigeasy, peterx, dhowells, linux-mm,
	edumazet, jglisse, punit.agrawal, arjunroy, minchan, x86, hughd,
	willy, gurua, laurent.dufour, linux-arm-kernel, rientjes,
	axelrasmussen, kernel-team, soheil, paulmck, jannh, liam.howlett,
	shakeelb, luto, gthelen, ldufour, vbabka, posk, lstoakes,
	peterjung1337, linuxppc-dev, kent.overstreet, hughlynch,
	linux-kernel, hannes, akpm, tatashin

On Wed, Jan 11, 2023 at 10:03 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 11-01-23 09:49:08, Suren Baghdasaryan wrote:
> > On Wed, Jan 11, 2023 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 11-01-23 09:04:41, Suren Baghdasaryan wrote:
> > > > On Wed, Jan 11, 2023 at 8:44 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Wed 11-01-23 08:28:49, Suren Baghdasaryan wrote:
> > > > > [...]
> > > > > > Anyhow. Sounds like the overhead of the current design is small enough
> > > > > > to remove CONFIG_PER_VMA_LOCK and let it depend only on architecture
> > > > > > support?
> > > > >
> > > > > Yes. Further optimizations can be done on top. Let's not over optimize
> > > > > at this stage.
> > > >
> > > > Sure, I won't optimize any further.
> > > > Just to expand on your question. Original design would be problematic
> > > > for embedded systems like Android. It notoriously has a high number of
> > > > VMAs due to anonymous VMAs being named, which prevents them from
> > > > merging.
> > >
> > > What is the usual number of VMAs in that environment?
> >
> > I've seen some games which had over 4000 VMAs but that's on the upper
> > side. In my calculations I used 40000 VMAs as a ballpark number and
> > rough calculations before size optimization would increase memory
> > consumption by ~2M (depending on the lock placement in vm_area_struct
> > it would vary a bit). In Android, the performance team flags any
> > change that exceeds 500KB, so it would raise questions.
>
> Thanks, that is a useful information! This is just slightly off-topic
> but I ak wondering how much memory those vma names consume. Are there
> that many unique names or they just happen to be alternating so that
> neighboring ones tend to be different.

Good question. I don't have the ready answer to that but will try to
collect some stats. I know that many names are standardized but
haven't looked at how they are distributed in the address space. Will
followup once I collect the data.
Thanks,
Suren.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions
  2023-01-11 17:36     ` Suren Baghdasaryan
@ 2023-01-11 19:52       ` Davidlohr Bueso
  2023-01-11 21:23         ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Davidlohr Bueso @ 2023-01-11 19:52 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed, 11 Jan 2023, Suren Baghdasaryan wrote:

>On Wed, Jan 11, 2023 at 8:13 AM Davidlohr Bueso <dave@stgolabs.net> wrote:
>>
>> On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:
>>
>> >To keep vma locking correctness when vm_flags are modified, add modifier
>> >functions to be used whenever flags are updated.
>>
>> How about moving this patch and the ones that follow out of this series,
>> into a preliminary patchset? It would reduce the amount of noise in the
>> per-vma lock changes, which would then only be adding the needed
>> vma_write_lock()ing.
>
>How about moving those prerequisite patches to the beginning of the
>patchset (before maple_tree RCU changes)? I feel like they do belong
>in the patchset because as a standalone patchset it would be unclear
>why I'm adding all these accessor functions and introducing this
>churn. Would that be acceptable?

imo the abstraction of vm_flags handling is worth being standalone and is
easier to be picked up before a more complex locking scheme change. But
either way, it's up to you.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions
  2023-01-11 19:52       ` Davidlohr Bueso
@ 2023-01-11 21:23         ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-11 21:23 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm, michel, jglisse, mhocko, vbabka,
	hannes, mgorman, willy, liam.howlett, peterz, ldufour,
	laurent.dufour, paulmck, luto, songliubraving, peterx, david,
	dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, jannh, shakeelb, tatashin, edumazet, gthelen, gurua,
	arjunroy, soheil, hughlynch, leewalsh, posk, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, linux-kernel, kernel-team

On Wed, Jan 11, 2023 at 12:19 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> On Wed, 11 Jan 2023, Suren Baghdasaryan wrote:
>
> >On Wed, Jan 11, 2023 at 8:13 AM Davidlohr Bueso <dave@stgolabs.net> wrote:
> >>
> >> On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:
> >>
> >> >To keep vma locking correctness when vm_flags are modified, add modifier
> >> >functions to be used whenever flags are updated.
> >>
> >> How about moving this patch and the ones that follow out of this series,
> >> into a preliminary patchset? It would reduce the amount of noise in the
> >> per-vma lock changes, which would then only be adding the needed
> >> vma_write_lock()ing.
> >
> >How about moving those prerequisite patches to the beginning of the
> >patchset (before maple_tree RCU changes)? I feel like they do belong
> >in the patchset because as a standalone patchset it would be unclear
> >why I'm adding all these accessor functions and introducing this
> >churn. Would that be acceptable?
>
> imo the abstraction of vm_flags handling is worth being standalone and is
> easier to be picked up before a more complex locking scheme change. But
> either way, it's up to you.

I see your point. Ok, if you think it makes sense as a stand-alone
patch I can post it separately in the next version.
Thanks,
Suren.

>
> Thanks,
> Davidlohr
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-09 20:53 ` [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock Suren Baghdasaryan
  2023-01-10  8:04   ` Vlastimil Babka
@ 2023-01-16 11:14   ` Hyeonggon Yoo
  2023-01-16 22:36     ` Suren Baghdasaryan
  2023-01-17  4:14     ` Matthew Wilcox
       [not found]   ` <20230116140649.2012-1-hdanton@sina.com>
  2023-01-17 18:11   ` Jann Horn
  3 siblings, 2 replies; 186+ messages in thread
From: Hyeonggon Yoo @ 2023-01-16 11:14 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 09, 2023 at 12:53:36PM -0800, Suren Baghdasaryan wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d40bf8a5e19e..294dd44b2198 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
>  	 * mm->mm_lock_seq can't be concurrently modified.
>  	 */
>  	mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> -	if (vma->vm_lock_seq == mm_lock_seq)
> +	if (vma->vm_lock->lock_seq == mm_lock_seq)
>  		return;
>  
> -	down_write(&vma->vm_lock->lock);
> -	vma->vm_lock_seq = mm_lock_seq;
> -	up_write(&vma->vm_lock->lock);
> +	if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
> +		wait_event(vma->vm_mm->vma_writer_wait,
> +			   atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
> +	vma->vm_lock->lock_seq = mm_lock_seq;
> +	/* Write barrier to ensure lock_seq change is visible before count */
> +	smp_wmb();
> +	atomic_set(&vma->vm_lock->count, 0);
>  }
>  
>  /*
> @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
>  static inline bool vma_read_trylock(struct vm_area_struct *vma)
>  {
>  	/* Check before locking. A race might cause false locked result. */
> -	if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> +	if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
>  		return false;
>  
> -	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> +	if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
>  		return false;
>  
> +	/* If atomic_t overflows, restore and fail to lock. */
> +	if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> +		if (atomic_dec_and_test(&vma->vm_lock->count))
> +			wake_up(&vma->vm_mm->vma_writer_wait);
> +		return false;
> +	}
> +
>  	/*
>  	 * Overflow might produce false locked result.
>  	 * False unlocked result is impossible because we modify and check
>  	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
>  	 * modification invalidates all existing locks.
>  	 */
> -	if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> -		up_read(&vma->vm_lock->lock);
> +	if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> +		if (atomic_dec_and_test(&vma->vm_lock->count))
> +			wake_up(&vma->vm_mm->vma_writer_wait);
>  		return false;
>  	}

With this change readers can cause writers to starve.
What about checking waitqueue_active() before or after increasing
vma->vm_lock->count?

--
Thanks,
Hyeonggon

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-16 11:14   ` Hyeonggon Yoo
@ 2023-01-16 22:36     ` Suren Baghdasaryan
  2023-01-17  4:14     ` Matthew Wilcox
  1 sibling, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-16 22:36 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 16, 2023 at 3:15 AM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote:
>
> On Mon, Jan 09, 2023 at 12:53:36PM -0800, Suren Baghdasaryan wrote:
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index d40bf8a5e19e..294dd44b2198 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> >        * mm->mm_lock_seq can't be concurrently modified.
> >        */
> >       mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > -     if (vma->vm_lock_seq == mm_lock_seq)
> > +     if (vma->vm_lock->lock_seq == mm_lock_seq)
> >               return;
> >
> > -     down_write(&vma->vm_lock->lock);
> > -     vma->vm_lock_seq = mm_lock_seq;
> > -     up_write(&vma->vm_lock->lock);
> > +     if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
> > +             wait_event(vma->vm_mm->vma_writer_wait,
> > +                        atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
> > +     vma->vm_lock->lock_seq = mm_lock_seq;
> > +     /* Write barrier to ensure lock_seq change is visible before count */
> > +     smp_wmb();
> > +     atomic_set(&vma->vm_lock->count, 0);
> >  }
> >
> >  /*
> > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> >  {
> >       /* Check before locking. A race might cause false locked result. */
> > -     if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > +     if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> >               return false;
> >
> > -     if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > +     if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> >               return false;
> >
> > +     /* If atomic_t overflows, restore and fail to lock. */
> > +     if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > +             if (atomic_dec_and_test(&vma->vm_lock->count))
> > +                     wake_up(&vma->vm_mm->vma_writer_wait);
> > +             return false;
> > +     }
> > +
> >       /*
> >        * Overflow might produce false locked result.
> >        * False unlocked result is impossible because we modify and check
> >        * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> >        * modification invalidates all existing locks.
> >        */
> > -     if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > -             up_read(&vma->vm_lock->lock);
> > +     if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > +             if (atomic_dec_and_test(&vma->vm_lock->count))
> > +                     wake_up(&vma->vm_mm->vma_writer_wait);
> >               return false;
> >       }
>
> With this change readers can cause writers to starve.
> What about checking waitqueue_active() before or after increasing
> vma->vm_lock->count?

The readers are in page fault path, which is the fast path, while
writers performing updates are in slow path. Therefore I *think*
starving writers should not be a big issue. So far in benchmarks I
haven't seen issues with that but maybe there is such a case?

>
> --
> Thanks,
> Hyeonggon
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
       [not found]   ` <20230116140649.2012-1-hdanton@sina.com>
@ 2023-01-16 23:08     ` Suren Baghdasaryan
  2023-01-16 23:11       ` Suren Baghdasaryan
       [not found]       ` <20230117031632.2321-1-hdanton@sina.com>
  0 siblings, 2 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-16 23:08 UTC (permalink / raw)
  To: Hillf Danton
  Cc: vbabka, hannes, mgorman, peterz, hughd, linux-kernel, linux-mm

On Mon, Jan 16, 2023 at 6:07 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Mon, 9 Jan 2023 12:53:36 -0800 Suren Baghdasaryan <surenb@google.com>
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> >        * mm->mm_lock_seq can't be concurrently modified.
> >        */
> >       mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > -     if (vma->vm_lock_seq == mm_lock_seq)
> > +     if (vma->vm_lock->lock_seq == mm_lock_seq)
> >               return;
>
>         lock acquire for write to info lockdep.

Thanks for the review Hillf!

Good idea. Will add in the next version.

> >
> > -     down_write(&vma->vm_lock->lock);
> > -     vma->vm_lock_seq = mm_lock_seq;
> > -     up_write(&vma->vm_lock->lock);
> > +     if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
> > +             wait_event(vma->vm_mm->vma_writer_wait,
> > +                        atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
> > +     vma->vm_lock->lock_seq = mm_lock_seq;
> > +     /* Write barrier to ensure lock_seq change is visible before count */
> > +     smp_wmb();
> > +     atomic_set(&vma->vm_lock->count, 0);
> >  }
> >
> >  /*
> > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> >  {
> >       /* Check before locking. A race might cause false locked result. */
> > -     if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > +     if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> >               return false;
>
> Add mb to pair with the above wmb like

The wmb above is to ensure the ordering between updates of lock_seq
and vm_lock->count (lock_seq is updated first and vm_lock->count only
after that). The first access to vm_lock->count in this function is
atomic_inc_unless_negative() and it's an atomic RMW operation with a
return value. According to documentation such functions are fully
ordered, therefore I think we already have an implicit full memory
barrier between reads of lock_seq and vm_lock->count here. Am I wrong?

>
>         if (READ_ONCE(vma->vm_lock->lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq)) {
>                 smp_acquire__after_ctrl_dep();
>                 return false;
>         }
> >
> > -     if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > +     if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> >               return false;
> >
> > +     /* If atomic_t overflows, restore and fail to lock. */
> > +     if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > +             if (atomic_dec_and_test(&vma->vm_lock->count))
> > +                     wake_up(&vma->vm_mm->vma_writer_wait);
> > +             return false;
> > +     }
> > +
> >       /*
> >        * Overflow might produce false locked result.
> >        * False unlocked result is impossible because we modify and check
> >        * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> >        * modification invalidates all existing locks.
> >        */
> > -     if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > -             up_read(&vma->vm_lock->lock);
> > +     if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > +             if (atomic_dec_and_test(&vma->vm_lock->count))
> > +                     wake_up(&vma->vm_mm->vma_writer_wait);
> >               return false;
> >       }
>
> Simpler way to detect write lock owner and count overflow like
>
>         int count = atomic_read(&vma->vm_lock->count);
>         for (;;) {
>                 int new = count + 1;
>
>                 if (count < 0 || new < 0)
>                         return false;
>
>                 new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
>                 if (new == count)
>                         break;
>                 count = new;
>                 cpu_relax();
>         }
>
>         (wake up waiting readers after taking the lock;
>         but the write lock owner before this read trylock should be
>         responsible for waking waiters up.)
>
>         lock acquire for read.

This schema might cause readers to wait, which is not an exact
replacement for down_read_trylock(). The requirement to wake up
waiting readers also complicates things and since we can always fall
back to mmap_lock, that complication is unnecessary IMHO. I could use
part of your suggestion like this:

                 int new = count + 1;

                 if (count < 0 || new < 0)
                         return false;

                 new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
                 if (new == count)
                         return false;

Compared to doing atomic_inc_unless_negative() first, like I did
originally, this schema opens a bit wider window for the writer to get
in the middle and cause the reader to fail locking but I don't think
it would result in any visible regression.

>
> >       return true;
> > @@ -664,7 +676,8 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
> >
> >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> >  {
>         lock release for read.

Ack.

>
> > -     up_read(&vma->vm_lock->lock);
> > +     if (atomic_dec_and_test(&vma->vm_lock->count))
> > +             wake_up(&vma->vm_mm->vma_writer_wait);
> >  }
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-16 23:08     ` Suren Baghdasaryan
@ 2023-01-16 23:11       ` Suren Baghdasaryan
       [not found]       ` <20230117031632.2321-1-hdanton@sina.com>
  1 sibling, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-16 23:11 UTC (permalink / raw)
  To: Hillf Danton
  Cc: vbabka, hannes, mgorman, peterz, hughd, linux-kernel, linux-mm

On Mon, Jan 16, 2023 at 3:08 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Jan 16, 2023 at 6:07 AM Hillf Danton <hdanton@sina.com> wrote:
> >
> > On Mon, 9 Jan 2023 12:53:36 -0800 Suren Baghdasaryan <surenb@google.com>
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > >        * mm->mm_lock_seq can't be concurrently modified.
> > >        */
> > >       mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > > -     if (vma->vm_lock_seq == mm_lock_seq)
> > > +     if (vma->vm_lock->lock_seq == mm_lock_seq)
> > >               return;
> >
> >         lock acquire for write to info lockdep.
>
> Thanks for the review Hillf!
>
> Good idea. Will add in the next version.
>
> > >
> > > -     down_write(&vma->vm_lock->lock);
> > > -     vma->vm_lock_seq = mm_lock_seq;
> > > -     up_write(&vma->vm_lock->lock);
> > > +     if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
> > > +             wait_event(vma->vm_mm->vma_writer_wait,
> > > +                        atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
> > > +     vma->vm_lock->lock_seq = mm_lock_seq;
> > > +     /* Write barrier to ensure lock_seq change is visible before count */
> > > +     smp_wmb();
> > > +     atomic_set(&vma->vm_lock->count, 0);
> > >  }
> > >
> > >  /*
> > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > >  {
> > >       /* Check before locking. A race might cause false locked result. */
> > > -     if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > +     if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > >               return false;
> >
> > Add mb to pair with the above wmb like
>
> The wmb above is to ensure the ordering between updates of lock_seq
> and vm_lock->count (lock_seq is updated first and vm_lock->count only
> after that). The first access to vm_lock->count in this function is
> atomic_inc_unless_negative() and it's an atomic RMW operation with a
> return value. According to documentation such functions are fully
> ordered, therefore I think we already have an implicit full memory
> barrier between reads of lock_seq and vm_lock->count here. Am I wrong?
>
> >
> >         if (READ_ONCE(vma->vm_lock->lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq)) {
> >                 smp_acquire__after_ctrl_dep();
> >                 return false;
> >         }
> > >
> > > -     if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > +     if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > >               return false;
> > >
> > > +     /* If atomic_t overflows, restore and fail to lock. */
> > > +     if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > +             if (atomic_dec_and_test(&vma->vm_lock->count))
> > > +                     wake_up(&vma->vm_mm->vma_writer_wait);
> > > +             return false;
> > > +     }
> > > +
> > >       /*
> > >        * Overflow might produce false locked result.
> > >        * False unlocked result is impossible because we modify and check
> > >        * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > >        * modification invalidates all existing locks.
> > >        */
> > > -     if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > -             up_read(&vma->vm_lock->lock);
> > > +     if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > +             if (atomic_dec_and_test(&vma->vm_lock->count))
> > > +                     wake_up(&vma->vm_mm->vma_writer_wait);
> > >               return false;
> > >       }
> >
> > Simpler way to detect write lock owner and count overflow like
> >
> >         int count = atomic_read(&vma->vm_lock->count);
> >         for (;;) {
> >                 int new = count + 1;
> >
> >                 if (count < 0 || new < 0)
> >                         return false;
> >
> >                 new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> >                 if (new == count)
> >                         break;
> >                 count = new;
> >                 cpu_relax();
> >         }
> >
> >         (wake up waiting readers after taking the lock;
> >         but the write lock owner before this read trylock should be
> >         responsible for waking waiters up.)
> >
> >         lock acquire for read.
>
> This schema might cause readers to wait, which is not an exact
> replacement for down_read_trylock(). The requirement to wake up
> waiting readers also complicates things and since we can always fall
> back to mmap_lock, that complication is unnecessary IMHO. I could use
> part of your suggestion like this:
>
>                  int new = count + 1;
>
>                  if (count < 0 || new < 0)
>                          return false;
>
>                  new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
>                  if (new == count)
>                          return false;

Made a mistake above. It should have been:
                  if (new != count)
                          return false;


>
> Compared to doing atomic_inc_unless_negative() first, like I did
> originally, this schema opens a bit wider window for the writer to get
> in the middle and cause the reader to fail locking but I don't think
> it would result in any visible regression.
>
> >
> > >       return true;
> > > @@ -664,7 +676,8 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > >
> > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > >  {
> >         lock release for read.
>
> Ack.
>
> >
> > > -     up_read(&vma->vm_lock->lock);
> > > +     if (atomic_dec_and_test(&vma->vm_lock->count))
> > > +             wake_up(&vma->vm_mm->vma_writer_wait);
> > >  }
> >

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-16 11:14   ` Hyeonggon Yoo
  2023-01-16 22:36     ` Suren Baghdasaryan
@ 2023-01-17  4:14     ` Matthew Wilcox
  2023-01-17  4:34       ` Suren Baghdasaryan
  1 sibling, 1 reply; 186+ messages in thread
From: Matthew Wilcox @ 2023-01-17  4:14 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Suren Baghdasaryan, akpm, michel, jglisse, mhocko, vbabka,
	hannes, mgorman, dave, liam.howlett, peterz, ldufour,
	laurent.dufour, paulmck, luto, songliubraving, peterx, david,
	dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, jannh, shakeelb, tatashin, edumazet, gthelen, gurua,
	arjunroy, soheil, hughlynch, leewalsh, posk, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, linux-kernel, kernel-team

On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> >  {
> >  	/* Check before locking. A race might cause false locked result. */
> > -	if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > +	if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> >  		return false;
> >  
> > -	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > +	if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> >  		return false;
> >  
> > +	/* If atomic_t overflows, restore and fail to lock. */
> > +	if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > +		if (atomic_dec_and_test(&vma->vm_lock->count))
> > +			wake_up(&vma->vm_mm->vma_writer_wait);
> > +		return false;
> > +	}
> > +
> >  	/*
> >  	 * Overflow might produce false locked result.
> >  	 * False unlocked result is impossible because we modify and check
> >  	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> >  	 * modification invalidates all existing locks.
> >  	 */
> > -	if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > -		up_read(&vma->vm_lock->lock);
> > +	if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > +		if (atomic_dec_and_test(&vma->vm_lock->count))
> > +			wake_up(&vma->vm_mm->vma_writer_wait);
> >  		return false;
> >  	}
> 
> With this change readers can cause writers to starve.
> What about checking waitqueue_active() before or after increasing
> vma->vm_lock->count?

I don't understand how readers can starve a writer.  Readers do
atomic_inc_unless_negative() so a writer can always force readers
to fail.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17  4:14     ` Matthew Wilcox
@ 2023-01-17  4:34       ` Suren Baghdasaryan
  2023-01-17  5:46         ` Matthew Wilcox
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17  4:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hyeonggon Yoo, akpm, michel, jglisse, mhocko, vbabka, hannes,
	mgorman, dave, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > >  {
> > >     /* Check before locking. A race might cause false locked result. */
> > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > +   if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > >             return false;
> > >
> > > -   if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > +   if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > >             return false;
> > >
> > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > +   if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > +           return false;
> > > +   }
> > > +
> > >     /*
> > >      * Overflow might produce false locked result.
> > >      * False unlocked result is impossible because we modify and check
> > >      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > >      * modification invalidates all existing locks.
> > >      */
> > > -   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > -           up_read(&vma->vm_lock->lock);
> > > +   if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > >             return false;
> > >     }
> >
> > With this change readers can cause writers to starve.
> > What about checking waitqueue_active() before or after increasing
> > vma->vm_lock->count?
>
> I don't understand how readers can starve a writer.  Readers do
> atomic_inc_unless_negative() so a writer can always force readers
> to fail.

I think the point here was that if page faults keep occuring and they
prevent vm_lock->count from reaching 0 then a writer will be blocked
and there is no reader throttling mechanism (no max time that writer
will be waiting).

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
       [not found]       ` <20230117031632.2321-1-hdanton@sina.com>
@ 2023-01-17  4:52         ` Suren Baghdasaryan
       [not found]           ` <20230117083355.2374-1-hdanton@sina.com>
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17  4:52 UTC (permalink / raw)
  To: Hillf Danton
  Cc: vbabka, hannes, mgorman, peterz, hughd, linux-kernel, linux-mm

On Mon, Jan 16, 2023 at 7:16 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Mon, 16 Jan 2023 15:08:48 -0800 Suren Baghdasaryan <surenb@google.com>
> > On Mon, Jan 16, 2023 at 6:07 AM Hillf Danton <hdanton@sina.com> wrote:
> > > On Mon, 9 Jan 2023 12:53:36 -0800 Suren Baghdasaryan <surenb@google.com>
> > > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > >  {
> > > >       /* Check before locking. A race might cause false locked result. */
> > > > -     if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > +     if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > >               return false;
> > >
> > > Add mb to pair with the above wmb like
> >
> > The wmb above is to ensure the ordering between updates of lock_seq
> > and vm_lock->count (lock_seq is updated first and vm_lock->count only
> > after that). The first access to vm_lock->count in this function is
> > atomic_inc_unless_negative() and it's an atomic RMW operation with a
> > return value. According to documentation such functions are fully
> > ordered, therefore I think we already have an implicit full memory
> > barrier between reads of lock_seq and vm_lock->count here. Am I wrong?
>
> No you are not.

I'm not wrong or the other way around? Please expand a bit.

> Revisit it in case of full mb not ensured.
>
> #ifndef arch_atomic_inc_unless_negative
> static __always_inline bool
> arch_atomic_inc_unless_negative(atomic_t *v)
> {
>         int c = arch_atomic_read(v);
>
>         do {
>                 if (unlikely(c < 0))
>                         return false;
>         } while (!arch_atomic_try_cmpxchg(v, &c, c + 1));
>
>         return true;
> }
> #define arch_atomic_inc_unless_negative arch_atomic_inc_unless_negative
> #endif

I think your point is that the full mb is not ensured here, specifically if c<0?

>
> > >         int count = atomic_read(&vma->vm_lock->count);
> > >         for (;;) {
> > >                 int new = count + 1;
> > >
> > >                 if (count < 0 || new < 0)
> > >                         return false;
> > >
> > >                 new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> > >                 if (new == count)
> > >                         break;
> > >                 count = new;
> > >                 cpu_relax();
> > >         }
> > >
> > >         (wake up waiting readers after taking the lock;
> > >         but the write lock owner before this read trylock should be
> > >         responsible for waking waiters up.)
> > >
> > >         lock acquire for read.
> >
> > This schema might cause readers to wait, which is not an exact
>
> Could you specify a bit more on wait?

Oh, I misunderstood your intent with that for() loop. Indeed, if a
writer took the lock, the count will be negative and it terminates
with a failure. Yeah, I think that would work.

>
> > replacement for down_read_trylock(). The requirement to wake up
> > waiting readers also complicates things
>
> If the writer lock owner is preempted by a reader while releasing lock,
>
>         set count to zero
>                           <-- preempt
>         wake up waiters
>
> then lock is owned by reader but with read waiters.
>
> That is buggy if write waiter starvation is allowed in this patchset.

I don't quite understand your point here. Readers don't wait, so there
can't be "read waiters". Could you please expand with a race diagram
maybe?

>
> > and since we can always fall
> > back to mmap_lock, that complication is unnecessary IMHO. I could use
> > part of your suggestion like this:
> >
> >                  int new = count + 1;
> >
> >                  if (count < 0 || new < 0)
> >                          return false;
> >
> >                  new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> >                  if (new == count)
> >                          return false;
> >
> > Compared to doing atomic_inc_unless_negative() first, like I did
> > originally, this schema opens a bit wider window for the writer to get
> > in the middle and cause the reader to fail locking but I don't think
> > it would result in any visible regression.
>
> It is definitely fine for writer to acquire the lock while reader is
> doing trylock, no?

Yes.

>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17  4:34       ` Suren Baghdasaryan
@ 2023-01-17  5:46         ` Matthew Wilcox
  2023-01-17  5:58           ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Matthew Wilcox @ 2023-01-17  5:46 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Hyeonggon Yoo, akpm, michel, jglisse, mhocko, vbabka, hannes,
	mgorman, dave, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > >  {
> > > >     /* Check before locking. A race might cause false locked result. */
> > > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > +   if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > >             return false;
> > > >
> > > > -   if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > > +   if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > > >             return false;
> > > >
> > > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > > +   if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > > +           return false;
> > > > +   }
> > > > +
> > > >     /*
> > > >      * Overflow might produce false locked result.
> > > >      * False unlocked result is impossible because we modify and check
> > > >      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > >      * modification invalidates all existing locks.
> > > >      */
> > > > -   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > -           up_read(&vma->vm_lock->lock);
> > > > +   if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > >             return false;
> > > >     }
> > >
> > > With this change readers can cause writers to starve.
> > > What about checking waitqueue_active() before or after increasing
> > > vma->vm_lock->count?
> >
> > I don't understand how readers can starve a writer.  Readers do
> > atomic_inc_unless_negative() so a writer can always force readers
> > to fail.
> 
> I think the point here was that if page faults keep occuring and they
> prevent vm_lock->count from reaching 0 then a writer will be blocked
> and there is no reader throttling mechanism (no max time that writer
> will be waiting).

Perhaps I misunderstood your description; I thought that a _waiting_
writer would make the count negative, not a successfully acquiring
writer.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17  5:46         ` Matthew Wilcox
@ 2023-01-17  5:58           ` Suren Baghdasaryan
  2023-01-17 18:23             ` Matthew Wilcox
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17  5:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hyeonggon Yoo, akpm, michel, jglisse, mhocko, vbabka, hannes,
	mgorman, dave, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 16, 2023 at 9:46 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > >  {
> > > > >     /* Check before locking. A race might cause false locked result. */
> > > > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > +   if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > >             return false;
> > > > >
> > > > > -   if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > > > +   if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > > > >             return false;
> > > > >
> > > > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > > > +   if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > +           return false;
> > > > > +   }
> > > > > +
> > > > >     /*
> > > > >      * Overflow might produce false locked result.
> > > > >      * False unlocked result is impossible because we modify and check
> > > > >      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > > >      * modification invalidates all existing locks.
> > > > >      */
> > > > > -   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > -           up_read(&vma->vm_lock->lock);
> > > > > +   if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > > >             return false;
> > > > >     }
> > > >
> > > > With this change readers can cause writers to starve.
> > > > What about checking waitqueue_active() before or after increasing
> > > > vma->vm_lock->count?
> > >
> > > I don't understand how readers can starve a writer.  Readers do
> > > atomic_inc_unless_negative() so a writer can always force readers
> > > to fail.
> >
> > I think the point here was that if page faults keep occuring and they
> > prevent vm_lock->count from reaching 0 then a writer will be blocked
> > and there is no reader throttling mechanism (no max time that writer
> > will be waiting).
>
> Perhaps I misunderstood your description; I thought that a _waiting_
> writer would make the count negative, not a successfully acquiring
> writer.

A waiting writer does not modify the counter, instead it's placed on
the wait queue and the last reader which sets the count to 0 while
releasing its read lock will wake it up. Once the writer is woken it
will try to set the count to negative and if successful will own the
lock, otherwise it goes back to sleep.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 09/41] mm: rcu safe VMA freeing
  2023-01-09 20:53 ` [PATCH 09/41] mm: rcu safe VMA freeing Suren Baghdasaryan
@ 2023-01-17 14:25   ` Michal Hocko
  2023-01-18  2:16     ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-17 14:25 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon 09-01-23 12:53:04, Suren Baghdasaryan wrote:
[...]
>  void vm_area_free(struct vm_area_struct *vma)
>  {
>  	free_anon_vma_name(vma);
> +#ifdef CONFIG_PER_VMA_LOCK
> +	call_rcu(&vma->vm_rcu, __vm_area_free);
> +#else
>  	kmem_cache_free(vm_area_cachep, vma);
> +#endif

Is it safe to have vma with already freed vma_name? I suspect this is
safe because of mmap_lock but is there any reason to split the freeing
process and have this potential UAF lurking?

>  }
>  
>  static void account_kernel_stack(struct task_struct *tsk, int account)
> -- 
> 2.39.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-09 20:53 ` [PATCH 12/41] mm: add per-VMA lock and helper functions to control it Suren Baghdasaryan
@ 2023-01-17 15:04   ` Michal Hocko
  2023-01-17 15:12     ` Michal Hocko
  2023-01-17 21:08     ` Suren Baghdasaryan
  2023-01-17 15:07   ` Michal Hocko
  2023-01-17 18:02   ` Jann Horn
  2 siblings, 2 replies; 186+ messages in thread
From: Michal Hocko @ 2023-01-17 15:04 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> Introduce a per-VMA rw_semaphore to be used during page fault handling
> instead of mmap_lock. Because there are cases when multiple VMAs need
> to be exclusively locked during VMA tree modifications, instead of the
> usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> mmap_write_lock holder is done with all modifications and drops mmap_lock,
> it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> locked.

I have to say I was struggling a bit with the above and only understood
what you mean by reading the patch several times. I would phrase it like
this (feel free to use if you consider this to be an improvement).

Introduce a per-VMA rw_semaphore. The lock implementation relies on a
per-vma and per-mm sequence counters to note exclusive locking:
        - read lock - (implemented by vma_read_trylock) requires the the
          vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
          differ. If they match then there must be a vma exclusive lock
          held somewhere.
        - read unlock - (implemented by vma_read_unlock) is a trivial
          vma->lock unlock.
        - write lock - (vma_write_lock) requires the mmap_lock to be
          held exclusively and the current mm counter is noted to the vma
          side. This will allow multiple vmas to be locked under a single
          mmap_lock write lock (e.g. during vma merging). The vma counter
          is modified under exclusive vma lock.
        - write unlock - (vma_write_unlock_mm) is a batch release of all
          vma locks held. It doesn't pair with a specific
          vma_write_lock! It is done before exclusive mmap_lock is
          released by incrementing mm sequence counter (mm_lock_seq).
	- write downgrade - if the mmap_lock is downgraded to the read
	  lock all vma write locks are released as well (effectivelly
	  same as write unlock).

> VMA lock is placed on the cache line boundary so that its 'count' field
> falls into the first cache line while the rest of the fields fall into
> the second cache line. This lets the 'count' field to be cached with
> other frequently accessed fields and used quickly in uncontended case
> while 'owner' and other fields used in the contended case will not
> invalidate the first cache line while waiting on the lock.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h        | 80 +++++++++++++++++++++++++++++++++++++++
>  include/linux/mm_types.h  |  8 ++++
>  include/linux/mmap_lock.h | 13 +++++++
>  kernel/fork.c             |  4 ++
>  mm/init-mm.c              |  3 ++
>  5 files changed, 108 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f3f196e4d66d..ec2c4c227d51 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -612,6 +612,85 @@ struct vm_operations_struct {
>  					  unsigned long addr);
>  };
>  
> +#ifdef CONFIG_PER_VMA_LOCK
> +static inline void vma_init_lock(struct vm_area_struct *vma)
> +{
> +	init_rwsem(&vma->lock);
> +	vma->vm_lock_seq = -1;
> +}
> +
> +static inline void vma_write_lock(struct vm_area_struct *vma)
> +{
> +	int mm_lock_seq;
> +
> +	mmap_assert_write_locked(vma->vm_mm);
> +
> +	/*
> +	 * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> +	 * mm->mm_lock_seq can't be concurrently modified.
> +	 */
> +	mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> +	if (vma->vm_lock_seq == mm_lock_seq)
> +		return;
> +
> +	down_write(&vma->lock);
> +	vma->vm_lock_seq = mm_lock_seq;
> +	up_write(&vma->lock);
> +}
> +
> +/*
> + * Try to read-lock a vma. The function is allowed to occasionally yield false
> + * locked result to avoid performance overhead, in which case we fall back to
> + * using mmap_lock. The function should never yield false unlocked result.
> + */
> +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> +{
> +	/* Check before locking. A race might cause false locked result. */
> +	if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> +		return false;
> +
> +	if (unlikely(down_read_trylock(&vma->lock) == 0))
> +		return false;
> +
> +	/*
> +	 * Overflow might produce false locked result.
> +	 * False unlocked result is impossible because we modify and check
> +	 * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
> +	 * modification invalidates all existing locks.
> +	 */
> +	if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> +		up_read(&vma->lock);
> +		return false;
> +	}
> +	return true;
> +}
> +
> +static inline void vma_read_unlock(struct vm_area_struct *vma)
> +{
> +	up_read(&vma->lock);
> +}
> +
> +static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> +{
> +	mmap_assert_write_locked(vma->vm_mm);
> +	/*
> +	 * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> +	 * mm->mm_lock_seq can't be concurrently modified.
> +	 */
> +	VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> +}
> +
> +#else /* CONFIG_PER_VMA_LOCK */
> +
> +static inline void vma_init_lock(struct vm_area_struct *vma) {}
> +static inline void vma_write_lock(struct vm_area_struct *vma) {}
> +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> +		{ return false; }
> +static inline void vma_read_unlock(struct vm_area_struct *vma) {}
> +static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
> +
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
>  static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  {
>  	static const struct vm_operations_struct dummy_vm_ops = {};
> @@ -620,6 +699,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma->vm_mm = mm;
>  	vma->vm_ops = &dummy_vm_ops;
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> +	vma_init_lock(vma);
>  }
>  
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d5cdec1314fe..5f7c5ca89931 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -555,6 +555,11 @@ struct vm_area_struct {
>  	pgprot_t vm_page_prot;
>  	unsigned long vm_flags;		/* Flags, see mm.h. */
>  
> +#ifdef CONFIG_PER_VMA_LOCK
> +	int vm_lock_seq;
> +	struct rw_semaphore lock;
> +#endif
> +
>  	/*
>  	 * For areas with an address space and backing store,
>  	 * linkage into the address_space->i_mmap interval tree.
> @@ -680,6 +685,9 @@ struct mm_struct {
>  					  * init_mm.mmlist, and are protected
>  					  * by mmlist_lock
>  					  */
> +#ifdef CONFIG_PER_VMA_LOCK
> +		int mm_lock_seq;
> +#endif
>  
>  
>  		unsigned long hiwater_rss; /* High-watermark of RSS usage */
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index e49ba91bb1f0..40facd4c398b 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -72,6 +72,17 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
>  	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
>  }
>  
> +#ifdef CONFIG_PER_VMA_LOCK
> +static inline void vma_write_unlock_mm(struct mm_struct *mm)
> +{
> +	mmap_assert_write_locked(mm);
> +	/* No races during update due to exclusive mmap_lock being held */
> +	WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +}
> +#else
> +static inline void vma_write_unlock_mm(struct mm_struct *mm) {}
> +#endif
> +
>  static inline void mmap_init_lock(struct mm_struct *mm)
>  {
>  	init_rwsem(&mm->mmap_lock);
> @@ -114,12 +125,14 @@ static inline bool mmap_write_trylock(struct mm_struct *mm)
>  static inline void mmap_write_unlock(struct mm_struct *mm)
>  {
>  	__mmap_lock_trace_released(mm, true);
> +	vma_write_unlock_mm(mm);
>  	up_write(&mm->mmap_lock);
>  }
>  
>  static inline void mmap_write_downgrade(struct mm_struct *mm)
>  {
>  	__mmap_lock_trace_acquire_returned(mm, false, true);
> +	vma_write_unlock_mm(mm);
>  	downgrade_write(&mm->mmap_lock);
>  }
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5986817f393c..c026d75108b3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  		 */
>  		*new = data_race(*orig);
>  		INIT_LIST_HEAD(&new->anon_vma_chain);
> +		vma_init_lock(new);
>  		dup_anon_vma_name(orig, new);
>  	}
>  	return new;
> @@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	seqcount_init(&mm->write_protect_seq);
>  	mmap_init_lock(mm);
>  	INIT_LIST_HEAD(&mm->mmlist);
> +#ifdef CONFIG_PER_VMA_LOCK
> +	WRITE_ONCE(mm->mm_lock_seq, 0);
> +#endif
>  	mm_pgtables_bytes_init(mm);
>  	mm->map_count = 0;
>  	mm->locked_vm = 0;
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index c9327abb771c..33269314e060 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -37,6 +37,9 @@ struct mm_struct init_mm = {
>  	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
>  	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
>  	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
> +#ifdef CONFIG_PER_VMA_LOCK
> +	.mm_lock_seq	= 0,
> +#endif
>  	.user_ns	= &init_user_ns,
>  	.cpu_bitmap	= CPU_BITS_NONE,
>  #ifdef CONFIG_IOMMU_SVA
> -- 
> 2.39.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-09 20:53 ` [PATCH 12/41] mm: add per-VMA lock and helper functions to control it Suren Baghdasaryan
  2023-01-17 15:04   ` Michal Hocko
@ 2023-01-17 15:07   ` Michal Hocko
  2023-01-17 21:09     ` Suren Baghdasaryan
  2023-01-17 18:02   ` Jann Horn
  2 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-17 15:07 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5986817f393c..c026d75108b3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  		 */
>  		*new = data_race(*orig);
>  		INIT_LIST_HEAD(&new->anon_vma_chain);
> +		vma_init_lock(new);
>  		dup_anon_vma_name(orig, new);
>  	}
>  	return new;
> @@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	seqcount_init(&mm->write_protect_seq);
>  	mmap_init_lock(mm);
>  	INIT_LIST_HEAD(&mm->mmlist);
> +#ifdef CONFIG_PER_VMA_LOCK
> +	WRITE_ONCE(mm->mm_lock_seq, 0);
> +#endif

The mm shouldn't be visible so why WRITE_ONCE?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions
  2023-01-09 20:53 ` [PATCH 13/41] mm: introduce vma->vm_flags modifier functions Suren Baghdasaryan
  2023-01-11 15:47   ` Davidlohr Bueso
@ 2023-01-17 15:09   ` Michal Hocko
  2023-01-17 15:15     ` Michal Hocko
  1 sibling, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-17 15:09 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon 09-01-23 12:53:08, Suren Baghdasaryan wrote:
> To keep vma locking correctness when vm_flags are modified, add modifier
> functions to be used whenever flags are updated.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h       | 38 ++++++++++++++++++++++++++++++++++++++
>  include/linux/mm_types.h |  8 +++++++-
>  2 files changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ec2c4c227d51..35cf0a6cbcc2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -702,6 +702,44 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	vma_init_lock(vma);
>  }
>  
> +/* Use when VMA is not part of the VMA tree and needs no locking */
> +static inline
> +void init_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> +{
> +	WRITE_ONCE(vma->vm_flags, flags);
> +}

Why do we need WRITE_ONCE here? Isn't vma invisible during its
initialization?

> +
> +/* Use when VMA is part of the VMA tree and needs appropriate locking */
> +static inline
> +void reset_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> +{
> +	vma_write_lock(vma);
> +	init_vm_flags(vma, flags);
> +}
> +
> +static inline
> +void set_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> +{
> +	vma_write_lock(vma);
> +	vma->vm_flags |= flags;
> +}
> +
> +static inline
> +void clear_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> +{
> +	vma_write_lock(vma);
> +	vma->vm_flags &= ~flags;
> +}
> +
> +static inline
> +void mod_vm_flags(struct vm_area_struct *vma,
> +		  unsigned long set, unsigned long clear)
> +{
> +	vma_write_lock(vma);
> +	vma->vm_flags |= set;
> +	vma->vm_flags &= ~clear;
> +}
> +

This is rather unusual pattern. There is no note about locking involved
in the naming and also why is the locking part of this interface in the
first place? I can see reason for access functions to actually check for
lock asserts.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-17 15:04   ` Michal Hocko
@ 2023-01-17 15:12     ` Michal Hocko
  2023-01-17 21:21       ` Suren Baghdasaryan
  2023-01-17 21:08     ` Suren Baghdasaryan
  1 sibling, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-17 15:12 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
> 
> I have to say I was struggling a bit with the above and only understood
> what you mean by reading the patch several times. I would phrase it like
> this (feel free to use if you consider this to be an improvement).
> 
> Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> per-vma and per-mm sequence counters to note exclusive locking:
>         - read lock - (implemented by vma_read_trylock) requires the the
>           vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
>           differ. If they match then there must be a vma exclusive lock
>           held somewhere.
>         - read unlock - (implemented by vma_read_unlock) is a trivial
>           vma->lock unlock.
>         - write lock - (vma_write_lock) requires the mmap_lock to be
>           held exclusively and the current mm counter is noted to the vma
>           side. This will allow multiple vmas to be locked under a single
>           mmap_lock write lock (e.g. during vma merging). The vma counter
>           is modified under exclusive vma lock.

Didn't realize one more thing.
	    Unlike standard write lock this implementation allows to be
	    called multiple times under a single mmap_lock. In a sense
	    it is more of mark_vma_potentially_modified than a lock.

>         - write unlock - (vma_write_unlock_mm) is a batch release of all
>           vma locks held. It doesn't pair with a specific
>           vma_write_lock! It is done before exclusive mmap_lock is
>           released by incrementing mm sequence counter (mm_lock_seq).
> 	- write downgrade - if the mmap_lock is downgraded to the read
> 	  lock all vma write locks are released as well (effectivelly
> 	  same as write unlock).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions
  2023-01-17 15:09   ` Michal Hocko
@ 2023-01-17 15:15     ` Michal Hocko
  2023-01-18  2:07       ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-17 15:15 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue 17-01-23 16:09:03, Michal Hocko wrote:
> On Mon 09-01-23 12:53:08, Suren Baghdasaryan wrote:
> > To keep vma locking correctness when vm_flags are modified, add modifier
> > functions to be used whenever flags are updated.
> > 
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h       | 38 ++++++++++++++++++++++++++++++++++++++
> >  include/linux/mm_types.h |  8 +++++++-
> >  2 files changed, 45 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ec2c4c227d51..35cf0a6cbcc2 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -702,6 +702,44 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >  	vma_init_lock(vma);
> >  }
> >  
> > +/* Use when VMA is not part of the VMA tree and needs no locking */
> > +static inline
> > +void init_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > +{
> > +	WRITE_ONCE(vma->vm_flags, flags);
> > +}
> 
> Why do we need WRITE_ONCE here? Isn't vma invisible during its
> initialization?
> 
> > +
> > +/* Use when VMA is part of the VMA tree and needs appropriate locking */
> > +static inline
> > +void reset_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > +{
> > +	vma_write_lock(vma);
> > +	init_vm_flags(vma, flags);
> > +}
> > +
> > +static inline
> > +void set_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > +{
> > +	vma_write_lock(vma);
> > +	vma->vm_flags |= flags;
> > +}
> > +
> > +static inline
> > +void clear_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > +{
> > +	vma_write_lock(vma);
> > +	vma->vm_flags &= ~flags;
> > +}
> > +
> > +static inline
> > +void mod_vm_flags(struct vm_area_struct *vma,
> > +		  unsigned long set, unsigned long clear)
> > +{
> > +	vma_write_lock(vma);
> > +	vma->vm_flags |= set;
> > +	vma->vm_flags &= ~clear;
> > +}
> > +
> 
> This is rather unusual pattern. There is no note about locking involved
> in the naming and also why is the locking part of this interface in the
> first place? I can see reason for access functions to actually check for
> lock asserts.

OK, it took me a while but it is clear to me now. The confusion comes
from the naming vma_write_lock is no a lock in its usual terms. It is
more of a vma_mark_modified with side effects to read locking which is a
real lock. With that it makes more sense to have this done in these
helpers rather than requiring all users to keep this subtletly in mind.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call
  2023-01-09 20:53 ` [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call Suren Baghdasaryan
@ 2023-01-17 15:16   ` Michal Hocko
  2023-01-18  2:01     ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-17 15:16 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> Move VMA flag modification (which now implies VMA locking) before
> anon_vma_lock_write to match the locking order of page fault handler.

Does this changelog assumes per vma locking in the #PF?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page
  2023-01-09 20:53 ` [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page Suren Baghdasaryan
@ 2023-01-17 15:25   ` Michal Hocko
  2023-01-17 20:28     ` Jann Horn
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-17 15:25 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> Protect VMA from concurrent page fault handler while collapsing a huge
> page. Page fault handler needs a stable PMD to use PTL and relies on
> per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> not be detected by a page fault handler without proper locking.

I am struggling with this changelog. Maybe because my recollection of
the THP collapsing subtleties is weak. But aren't you just trying to say
that the current #PF handling and THP collapsing need to be mutually
exclusive currently so in order to keep that assumption you have mark
the vma write locked?

Also it is not really clear to me how that handles other vmas which can
share the same thp?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction
  2023-01-09 20:53 ` [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction Suren Baghdasaryan
@ 2023-01-17 15:42   ` Michal Hocko
  2023-01-18  1:53     ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-17 15:42 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> Assert there are no holders of VMA lock for reading when it is about to be
> destroyed.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h | 8 ++++++++
>  kernel/fork.c      | 2 ++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 594e835bad9c..c464fc8a514c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
>  	VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
>  }
>  
> +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> +{
> +	VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
> +		      vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
> +		      vma);

Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
us something is wrong on its own, no? This could be somebody racing with
the vma destruction and using the write lock. Unlikely but I do not see
why to narrow debugging scope.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code
  2023-01-09 20:53 ` [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code Suren Baghdasaryan
@ 2023-01-17 15:47   ` Michal Hocko
  2023-01-18  1:06     ` Suren Baghdasaryan
  2023-01-17 21:03   ` Jann Horn
  1 sibling, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-17 15:47 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> page fault handling. When VMA is not found, can't be locked or changes
> after being locked, the function returns NULL. The lookup is performed
> under RCU protection to prevent the found VMA from being destroyed before
> the VMA lock is acquired. VMA lock statistics are updated according to
> the results.
> For now only anonymous VMAs can be searched this way. In other cases the
> function returns NULL.

Could you describe why only anonymous vmas are handled at this stage and
what (roughly) has to be done to support other vmas? lock_vma_under_rcu
doesn't seem to have any anonymous vma specific requirements AFAICS.

Also isn't lock_vma_under_rcu effectively find_read_lock_vma? Not that
the naming is really the most important part but the rcu locking is
internal to the function so why should we spread this implementation
detail to the world...

> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h |  3 +++
>  mm/memory.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 54 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c464fc8a514c..d0fddf6a1de9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -687,6 +687,9 @@ static inline void vma_assert_no_reader(struct vm_area_struct *vma)
>  		      vma);
>  }
>  
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> +					  unsigned long address);
> +
>  #else /* CONFIG_PER_VMA_LOCK */
>  
>  static inline void vma_init_lock(struct vm_area_struct *vma) {}
> diff --git a/mm/memory.c b/mm/memory.c
> index 9ece18548db1..a658e26d965d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5242,6 +5242,57 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  }
>  EXPORT_SYMBOL_GPL(handle_mm_fault);
>  
> +#ifdef CONFIG_PER_VMA_LOCK
> +/*
> + * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
> + * stable and not isolated. If the VMA is not found or is being modified the
> + * function returns NULL.
> + */
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> +					  unsigned long address)
> +{
> +	MA_STATE(mas, &mm->mm_mt, address, address);
> +	struct vm_area_struct *vma, *validate;
> +
> +	rcu_read_lock();
> +	vma = mas_walk(&mas);
> +retry:
> +	if (!vma)
> +		goto inval;
> +
> +	/* Only anonymous vmas are supported for now */
> +	if (!vma_is_anonymous(vma))
> +		goto inval;
> +
> +	if (!vma_read_trylock(vma))
> +		goto inval;
> +
> +	/* Check since vm_start/vm_end might change before we lock the VMA */
> +	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> +		vma_read_unlock(vma);
> +		goto inval;
> +	}
> +
> +	/* Check if the VMA got isolated after we found it */
> +	mas.index = address;
> +	validate = mas_walk(&mas);
> +	if (validate != vma) {
> +		vma_read_unlock(vma);
> +		count_vm_vma_lock_event(VMA_LOCK_MISS);
> +		/* The area was replaced with another one. */
> +		vma = validate;
> +		goto retry;
> +	}
> +
> +	rcu_read_unlock();
> +	return vma;
> +inval:
> +	rcu_read_unlock();
> +	count_vm_vma_lock_event(VMA_LOCK_ABORT);
> +	return NULL;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
>  #ifndef __PAGETABLE_P4D_FOLDED
>  /*
>   * Allocate p4d page table.
> -- 
> 2.39.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-09 20:53 ` [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free Suren Baghdasaryan
@ 2023-01-17 15:57   ` Michal Hocko
  2023-01-18  1:19     ` Suren Baghdasaryan
  2023-01-19 12:59   ` Michal Hocko
  1 sibling, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-17 15:57 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> call_rcu() can take a long time when callback offloading is enabled.
> Its use in the vm_area_free can cause regressions in the exit path when
> multiple VMAs are being freed.

What kind of regressions.

> To minimize that impact, place VMAs into
> a list and free them in groups using one call_rcu() call per group.

Please add some data to justify this additional complexity.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-09 20:53 ` [PATCH 12/41] mm: add per-VMA lock and helper functions to control it Suren Baghdasaryan
  2023-01-17 15:04   ` Michal Hocko
  2023-01-17 15:07   ` Michal Hocko
@ 2023-01-17 18:02   ` Jann Horn
  2023-01-17 21:28     ` Suren Baghdasaryan
  2023-01-18 12:28     ` Michal Hocko
  2 siblings, 2 replies; 186+ messages in thread
From: Jann Horn @ 2023-01-17 18:02 UTC (permalink / raw)
  To: Suren Baghdasaryan, peterz, Ingo Molnar, Will Deacon
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

+locking maintainers

On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> Introduce a per-VMA rw_semaphore to be used during page fault handling
> instead of mmap_lock. Because there are cases when multiple VMAs need
> to be exclusively locked during VMA tree modifications, instead of the
> usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> mmap_write_lock holder is done with all modifications and drops mmap_lock,
> it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> locked.
[...]
> +static inline void vma_read_unlock(struct vm_area_struct *vma)
> +{
> +       up_read(&vma->lock);
> +}

One thing that might be gnarly here is that I think you might not be
allowed to use up_read() to fully release ownership of an object -
from what I remember, I think that up_read() (unlike something like
spin_unlock()) can access the lock object after it's already been
acquired by someone else. So if you want to protect against concurrent
deletion, this might have to be something like:

rcu_read_lock(); /* keeps vma alive */
up_read(&vma->lock);
rcu_read_unlock();

But I'm not entirely sure about that, the locking folks might know better.

Also, it might not matter given that the rw_semaphore part is removed
in the current patch 41/41 anyway...

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-09 20:53 ` [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock Suren Baghdasaryan
                     ` (2 preceding siblings ...)
       [not found]   ` <20230116140649.2012-1-hdanton@sina.com>
@ 2023-01-17 18:11   ` Jann Horn
  2023-01-17 18:26     ` Suren Baghdasaryan
  3 siblings, 1 reply; 186+ messages in thread
From: Jann Horn @ 2023-01-17 18:11 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> rw_semaphore is a sizable structure of 40 bytes and consumes
> considerable space for each vm_area_struct. However vma_lock has
> two important specifics which can be used to replace rw_semaphore
> with a simpler structure:
[...]
>  static inline void vma_read_unlock(struct vm_area_struct *vma)
>  {
> -       up_read(&vma->vm_lock->lock);
> +       if (atomic_dec_and_test(&vma->vm_lock->count))
> +               wake_up(&vma->vm_mm->vma_writer_wait);
>  }

I haven't properly reviewed this, but this bit looks like a
use-after-free because you're accessing the vma after dropping your
reference on it. You'd have to first look up the vma->vm_mm, then do
the atomic_dec_and_test(), and afterwards do the wake_up() without
touching the vma. Or alternatively wrap the whole thing in an RCU
read-side critical section if the VMA is freed with RCU delay.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
       [not found]           ` <20230117083355.2374-1-hdanton@sina.com>
@ 2023-01-17 18:21             ` Suren Baghdasaryan
  2023-01-17 18:27               ` Matthew Wilcox
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 18:21 UTC (permalink / raw)
  To: Hillf Danton
  Cc: vbabka, hannes, mgorman, peterz, hughd, linux-kernel, linux-mm

On Tue, Jan 17, 2023 at 12:34 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Mon, 16 Jan 2023 20:52:45 -0800 Suren Baghdasaryan <surenb@google.com>
> > On Mon, Jan 16, 2023 at 7:16 PM Hillf Danton <hdanton@sina.com> wrote:
> > > No you are not.
> >
> > I'm not wrong or the other way around? Please expand a bit.
>
> You are not wrong.

Ok, I think if I rewrite the vma_read_trylock() we should be fine?:

static inline bool vma_read_trylock(struct vm_area_struct *vma)
{
       int count, new;

        /* Check before locking. A race might cause false locked result. */
       if (READ_ONCE(vma->vm_lock->lock_seq) ==
           READ_ONCE(vma->vm_mm->mm_lock_seq))
                return false;

        count = atomic_read(&vma->vm_lock->count);
        for (;;) {
              /*
               * Is VMA is write-locked? Overflow might produce false
locked result.
               * False unlocked result is impossible because we modify and check
               * vma->vm_lock_seq under vma->vm_lock protection and
mm->mm_lock_seq
               * modification invalidates all existing locks.
               */
              if (count < 0)
                        return false;

             new = count + 1;
             /* If atomic_t overflows, fail to lock. */
             if (new < 0)
                        return false;

             /*
              * Atomic RMW will provide implicit mb on success to pair
with smp_wmb in
              * vma_write_lock, on failure we retry.
              */
              new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
                if (new == count)
                        break;
                count = new;
                cpu_relax();
        }
       if (unlikely(READ_ONCE(vma->vm_lock->lock_seq) ==
           READ_ONCE(vma->vm_mm->mm_lock_seq))) {
               if (atomic_dec_and_test(&vma->vm_lock->count))
                       wake_up(&vma->vm_mm->vma_writer_wait);
                return false;
        }
        return true;
}
> > >
> > > If the writer lock owner is preempted by a reader while releasing lock,
> > >
> > >         set count to zero
> > >                           <-- preempt
> > >         wake up waiters
> > >
> > > then lock is owned by reader but with read waiters.
> > >
> > > That is buggy if write waiter starvation is allowed in this patchset.
> >
> > I don't quite understand your point here. Readers don't wait, so there
> > can't be "read waiters". Could you please expand with a race diagram
> > maybe?
>
>         cpu3                    cpu2
>         ---                     ---
>         taskA bond to cpu3
>         down_write(&mm->mmap_lock);
>         vma_write_lock L
>                                 taskB fail to take L for read
>                                 taskC fail to take mmap_lock for write
>                                 taskD fail to take L for read
>         vma_write_unlock_mm(mm);
>
>         preempted by taskE
>            taskE take L for read and
>            read waiters of L, taskB and taskD,
>            should be woken up
>
>         up_write(&mm->mmap_lock);

Readers never wait for vma lock, that's why we have only
vma_read_trylock and no vma_read_lock. In your scenario taskB and
taskD will fall back to taking mmap_lock for read after they failed
vma_read_trylock. Once taskA does up_write(mmap_lock) they will be
woken up since they are blocked on taking mmap_lock for read.

>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17  5:58           ` Suren Baghdasaryan
@ 2023-01-17 18:23             ` Matthew Wilcox
  2023-01-17 18:28               ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Matthew Wilcox @ 2023-01-17 18:23 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Hyeonggon Yoo, akpm, michel, jglisse, mhocko, vbabka, hannes,
	mgorman, dave, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 16, 2023 at 09:58:35PM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 16, 2023 at 9:46 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> > > On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > > > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > > >  {
> > > > > >     /* Check before locking. A race might cause false locked result. */
> > > > > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > +   if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > >             return false;
> > > > > >
> > > > > > -   if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > > > > +   if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > > > > >             return false;
> > > > > >
> > > > > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > > > > +   if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > +           return false;
> > > > > > +   }
> > > > > > +
> > > > > >     /*
> > > > > >      * Overflow might produce false locked result.
> > > > > >      * False unlocked result is impossible because we modify and check
> > > > > >      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > > > >      * modification invalidates all existing locks.
> > > > > >      */
> > > > > > -   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > -           up_read(&vma->vm_lock->lock);
> > > > > > +   if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > >             return false;
> > > > > >     }
> > > > >
> > > > > With this change readers can cause writers to starve.
> > > > > What about checking waitqueue_active() before or after increasing
> > > > > vma->vm_lock->count?
> > > >
> > > > I don't understand how readers can starve a writer.  Readers do
> > > > atomic_inc_unless_negative() so a writer can always force readers
> > > > to fail.
> > >
> > > I think the point here was that if page faults keep occuring and they
> > > prevent vm_lock->count from reaching 0 then a writer will be blocked
> > > and there is no reader throttling mechanism (no max time that writer
> > > will be waiting).
> >
> > Perhaps I misunderstood your description; I thought that a _waiting_
> > writer would make the count negative, not a successfully acquiring
> > writer.
> 
> A waiting writer does not modify the counter, instead it's placed on
> the wait queue and the last reader which sets the count to 0 while
> releasing its read lock will wake it up. Once the writer is woken it
> will try to set the count to negative and if successful will own the
> lock, otherwise it goes back to sleep.

Then yes, that's a starvable lock.  Preventing starvation on the mmap
sem was the original motivation for making rwsems non-starvable, so
changing that behaviour now seems like a bad idea.  For efficiency, I'd
suggest that a waiting writer set the top bit of the counter.  That way,
all new readers will back off without needing to check a second variable
and old readers will know that they *may* need to do the wakeup when
atomic_sub_return_release() is negative.

(rwsem.c has a more complex bitfield, but I don't think we need to go
that far; the important point is that the waiting writer indicates its
presence in the count field so that readers can modify their behaviour)

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17 18:11   ` Jann Horn
@ 2023-01-17 18:26     ` Suren Baghdasaryan
  2023-01-17 18:31       ` Matthew Wilcox
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 18:26 UTC (permalink / raw)
  To: Jann Horn
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > rw_semaphore is a sizable structure of 40 bytes and consumes
> > considerable space for each vm_area_struct. However vma_lock has
> > two important specifics which can be used to replace rw_semaphore
> > with a simpler structure:
> [...]
> >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> >  {
> > -       up_read(&vma->vm_lock->lock);
> > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > +               wake_up(&vma->vm_mm->vma_writer_wait);
> >  }
>
> I haven't properly reviewed this, but this bit looks like a
> use-after-free because you're accessing the vma after dropping your
> reference on it. You'd have to first look up the vma->vm_mm, then do
> the atomic_dec_and_test(), and afterwards do the wake_up() without
> touching the vma. Or alternatively wrap the whole thing in an RCU
> read-side critical section if the VMA is freed with RCU delay.

vm_lock->count does not control the lifetime of the VMA, it's a
counter of how many readers took the lock or it's negative if the lock
is write-locked.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17 18:21             ` Suren Baghdasaryan
@ 2023-01-17 18:27               ` Matthew Wilcox
  2023-01-17 18:31                 ` Suren Baghdasaryan
       [not found]                 ` <20230118062639.2839-1-hdanton@sina.com>
  0 siblings, 2 replies; 186+ messages in thread
From: Matthew Wilcox @ 2023-01-17 18:27 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Hillf Danton, vbabka, hannes, mgorman, peterz, hughd,
	linux-kernel, linux-mm

On Tue, Jan 17, 2023 at 10:21:28AM -0800, Suren Baghdasaryan wrote:
> static inline bool vma_read_trylock(struct vm_area_struct *vma)
> {
>        int count, new;
> 
>         /* Check before locking. A race might cause false locked result. */
>        if (READ_ONCE(vma->vm_lock->lock_seq) ==
>            READ_ONCE(vma->vm_mm->mm_lock_seq))
>                 return false;
> 
>         count = atomic_read(&vma->vm_lock->count);
>         for (;;) {
>               /*
>                * Is VMA is write-locked? Overflow might produce false
> locked result.
>                * False unlocked result is impossible because we modify and check
>                * vma->vm_lock_seq under vma->vm_lock protection and
> mm->mm_lock_seq
>                * modification invalidates all existing locks.
>                */
>               if (count < 0)
>                         return false;
> 
>              new = count + 1;
>              /* If atomic_t overflows, fail to lock. */
>              if (new < 0)
>                         return false;
> 
>              /*
>               * Atomic RMW will provide implicit mb on success to pair
> with smp_wmb in
>               * vma_write_lock, on failure we retry.
>               */
>               new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
>                 if (new == count)
>                         break;
>                 count = new;
>                 cpu_relax();

The cpu_relax() is exactly the wrong thing to do here.  See this thread:
https://lore.kernel.org/linux-fsdevel/20230113184447.1707316-1-mjguzik@gmail.com/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17 18:23             ` Matthew Wilcox
@ 2023-01-17 18:28               ` Suren Baghdasaryan
  2023-01-17 20:31                 ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 18:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hyeonggon Yoo, akpm, michel, jglisse, mhocko, vbabka, hannes,
	mgorman, dave, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 10:23 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 16, 2023 at 09:58:35PM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 16, 2023 at 9:46 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> > > > On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > > > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > > > > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > > > >  {
> > > > > > >     /* Check before locking. A race might cause false locked result. */
> > > > > > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > > +   if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > >             return false;
> > > > > > >
> > > > > > > -   if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > > > > > +   if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > > > > > >             return false;
> > > > > > >
> > > > > > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > > > > > +   if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > > > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > > +           return false;
> > > > > > > +   }
> > > > > > > +
> > > > > > >     /*
> > > > > > >      * Overflow might produce false locked result.
> > > > > > >      * False unlocked result is impossible because we modify and check
> > > > > > >      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > > > > >      * modification invalidates all existing locks.
> > > > > > >      */
> > > > > > > -   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > > -           up_read(&vma->vm_lock->lock);
> > > > > > > +   if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > >             return false;
> > > > > > >     }
> > > > > >
> > > > > > With this change readers can cause writers to starve.
> > > > > > What about checking waitqueue_active() before or after increasing
> > > > > > vma->vm_lock->count?
> > > > >
> > > > > I don't understand how readers can starve a writer.  Readers do
> > > > > atomic_inc_unless_negative() so a writer can always force readers
> > > > > to fail.
> > > >
> > > > I think the point here was that if page faults keep occuring and they
> > > > prevent vm_lock->count from reaching 0 then a writer will be blocked
> > > > and there is no reader throttling mechanism (no max time that writer
> > > > will be waiting).
> > >
> > > Perhaps I misunderstood your description; I thought that a _waiting_
> > > writer would make the count negative, not a successfully acquiring
> > > writer.
> >
> > A waiting writer does not modify the counter, instead it's placed on
> > the wait queue and the last reader which sets the count to 0 while
> > releasing its read lock will wake it up. Once the writer is woken it
> > will try to set the count to negative and if successful will own the
> > lock, otherwise it goes back to sleep.
>
> Then yes, that's a starvable lock.  Preventing starvation on the mmap
> sem was the original motivation for making rwsems non-starvable, so
> changing that behaviour now seems like a bad idea.  For efficiency, I'd
> suggest that a waiting writer set the top bit of the counter.  That way,
> all new readers will back off without needing to check a second variable
> and old readers will know that they *may* need to do the wakeup when
> atomic_sub_return_release() is negative.
>
> (rwsem.c has a more complex bitfield, but I don't think we need to go
> that far; the important point is that the waiting writer indicates its
> presence in the count field so that readers can modify their behaviour)

Got it. Ok, I think we can figure something out to check if there are
waiting write-lockers and prevent new readers from taking the lock.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17 18:26     ` Suren Baghdasaryan
@ 2023-01-17 18:31       ` Matthew Wilcox
  2023-01-17 18:36         ` Jann Horn
  2023-01-17 18:36         ` Suren Baghdasaryan
  0 siblings, 2 replies; 186+ messages in thread
From: Matthew Wilcox @ 2023-01-17 18:31 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Jann Horn, akpm, michel, jglisse, mhocko, vbabka, hannes,
	mgorman, dave, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> >
> > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > considerable space for each vm_area_struct. However vma_lock has
> > > two important specifics which can be used to replace rw_semaphore
> > > with a simpler structure:
> > [...]
> > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > >  {
> > > -       up_read(&vma->vm_lock->lock);
> > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > >  }
> >
> > I haven't properly reviewed this, but this bit looks like a
> > use-after-free because you're accessing the vma after dropping your
> > reference on it. You'd have to first look up the vma->vm_mm, then do
> > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > touching the vma. Or alternatively wrap the whole thing in an RCU
> > read-side critical section if the VMA is freed with RCU delay.
> 
> vm_lock->count does not control the lifetime of the VMA, it's a
> counter of how many readers took the lock or it's negative if the lock
> is write-locked.

Yes, but ...
	
	Task A:
	atomic_dec_and_test(&vma->vm_lock->count)
			Task B:
			munmap()
			write lock
			free VMA
			synchronize_rcu()
			VMA is really freed
        wake_up(&vma->vm_mm->vma_writer_wait);

... vma is freed.

Now, I think this doesn't occur.  I'm pretty sure that every caller of
vma_read_unlock() is holding the RCU read lock.  But maybe we should
have that assertion?


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17 18:27               ` Matthew Wilcox
@ 2023-01-17 18:31                 ` Suren Baghdasaryan
       [not found]                 ` <20230118062639.2839-1-hdanton@sina.com>
  1 sibling, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 18:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hillf Danton, vbabka, hannes, mgorman, peterz, hughd,
	linux-kernel, linux-mm

On Tue, Jan 17, 2023 at 10:27 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 17, 2023 at 10:21:28AM -0800, Suren Baghdasaryan wrote:
> > static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > {
> >        int count, new;
> >
> >         /* Check before locking. A race might cause false locked result. */
> >        if (READ_ONCE(vma->vm_lock->lock_seq) ==
> >            READ_ONCE(vma->vm_mm->mm_lock_seq))
> >                 return false;
> >
> >         count = atomic_read(&vma->vm_lock->count);
> >         for (;;) {
> >               /*
> >                * Is VMA is write-locked? Overflow might produce false
> > locked result.
> >                * False unlocked result is impossible because we modify and check
> >                * vma->vm_lock_seq under vma->vm_lock protection and
> > mm->mm_lock_seq
> >                * modification invalidates all existing locks.
> >                */
> >               if (count < 0)
> >                         return false;
> >
> >              new = count + 1;
> >              /* If atomic_t overflows, fail to lock. */
> >              if (new < 0)
> >                         return false;
> >
> >              /*
> >               * Atomic RMW will provide implicit mb on success to pair
> > with smp_wmb in
> >               * vma_write_lock, on failure we retry.
> >               */
> >               new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> >                 if (new == count)
> >                         break;
> >                 count = new;
> >                 cpu_relax();
>
> The cpu_relax() is exactly the wrong thing to do here.  See this thread:
> https://lore.kernel.org/linux-fsdevel/20230113184447.1707316-1-mjguzik@gmail.com/

Thanks for the pointer, Matthew. I think we can safely remove
cpu_relax() since it's unlikely the count is constantly changing under
a reader.

>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 40/41] mm: separate vma->lock from vm_area_struct
  2023-01-09 20:53 ` [PATCH 40/41] mm: separate vma->lock from vm_area_struct Suren Baghdasaryan
@ 2023-01-17 18:33   ` Jann Horn
  2023-01-17 19:01     ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Jann Horn @ 2023-01-17 18:33 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> vma->lock being part of the vm_area_struct causes performance regression
> during page faults because during contention its count and owner fields
> are constantly updated and having other parts of vm_area_struct used
> during page fault handling next to them causes constant cache line
> bouncing. Fix that by moving the lock outside of the vm_area_struct.
> All attempts to keep vma->lock inside vm_area_struct in a separate
> cache line still produce performance regression especially on NUMA
> machines. Smallest regression was achieved when lock is placed in the
> fourth cache line but that bloats vm_area_struct to 256 bytes.

Just checking: When you tested putting the lock in different cache
lines, did you force the slab allocator to actually store the
vm_area_struct with cacheline alignment (by setting SLAB_HWCACHE_ALIGN
on the slab or with a ____cacheline_aligned_in_smp on the struct
definition)?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17 18:31       ` Matthew Wilcox
@ 2023-01-17 18:36         ` Jann Horn
  2023-01-17 18:49           ` Suren Baghdasaryan
  2023-01-17 18:36         ` Suren Baghdasaryan
  1 sibling, 1 reply; 186+ messages in thread
From: Jann Horn @ 2023-01-17 18:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Suren Baghdasaryan, akpm, michel, jglisse, mhocko, vbabka,
	hannes, mgorman, dave, liam.howlett, peterz, ldufour,
	laurent.dufour, paulmck, luto, songliubraving, peterx, david,
	dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy,
	soheil, hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 7:31 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > considerable space for each vm_area_struct. However vma_lock has
> > > > two important specifics which can be used to replace rw_semaphore
> > > > with a simpler structure:
> > > [...]
> > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > >  {
> > > > -       up_read(&vma->vm_lock->lock);
> > > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > > >  }
> > >
> > > I haven't properly reviewed this, but this bit looks like a
> > > use-after-free because you're accessing the vma after dropping your
> > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > read-side critical section if the VMA is freed with RCU delay.
> >
> > vm_lock->count does not control the lifetime of the VMA, it's a
> > counter of how many readers took the lock or it's negative if the lock
> > is write-locked.
>
> Yes, but ...
>
>         Task A:
>         atomic_dec_and_test(&vma->vm_lock->count)
>                         Task B:
>                         munmap()
>                         write lock
>                         free VMA
>                         synchronize_rcu()
>                         VMA is really freed
>         wake_up(&vma->vm_mm->vma_writer_wait);
>
> ... vma is freed.
>
> Now, I think this doesn't occur.  I'm pretty sure that every caller of
> vma_read_unlock() is holding the RCU read lock.  But maybe we should
> have that assertion?

I don't see that. When do_user_addr_fault() is calling
vma_read_unlock(), there's no RCU read lock held, right?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17 18:31       ` Matthew Wilcox
  2023-01-17 18:36         ` Jann Horn
@ 2023-01-17 18:36         ` Suren Baghdasaryan
  2023-01-17 18:48           ` Matthew Wilcox
  1 sibling, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 18:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jann Horn, akpm, michel, jglisse, mhocko, vbabka, hannes,
	mgorman, dave, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > considerable space for each vm_area_struct. However vma_lock has
> > > > two important specifics which can be used to replace rw_semaphore
> > > > with a simpler structure:
> > > [...]
> > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > >  {
> > > > -       up_read(&vma->vm_lock->lock);
> > > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > > >  }
> > >
> > > I haven't properly reviewed this, but this bit looks like a
> > > use-after-free because you're accessing the vma after dropping your
> > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > read-side critical section if the VMA is freed with RCU delay.
> >
> > vm_lock->count does not control the lifetime of the VMA, it's a
> > counter of how many readers took the lock or it's negative if the lock
> > is write-locked.
>
> Yes, but ...
>
>         Task A:
>         atomic_dec_and_test(&vma->vm_lock->count)
>                         Task B:
>                         munmap()
>                         write lock
>                         free VMA
>                         synchronize_rcu()
>                         VMA is really freed
>         wake_up(&vma->vm_mm->vma_writer_wait);
>
> ... vma is freed.
>
> Now, I think this doesn't occur.  I'm pretty sure that every caller of
> vma_read_unlock() is holding the RCU read lock.  But maybe we should
> have that assertion?

Yep, that's what this patch is doing
https://lore.kernel.org/all/20230109205336.3665937-27-surenb@google.com/
by calling vma_assert_no_reader() from __vm_area_free().

>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17 18:36         ` Suren Baghdasaryan
@ 2023-01-17 18:48           ` Matthew Wilcox
  2023-01-17 18:55             ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Matthew Wilcox @ 2023-01-17 18:48 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Jann Horn, akpm, michel, jglisse, mhocko, vbabka, hannes,
	mgorman, dave, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > two important specifics which can be used to replace rw_semaphore
> > > > > with a simpler structure:
> > > > [...]
> > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > >  {
> > > > > -       up_read(&vma->vm_lock->lock);
> > > > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > > > >  }
> > > >
> > > > I haven't properly reviewed this, but this bit looks like a
> > > > use-after-free because you're accessing the vma after dropping your
> > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > read-side critical section if the VMA is freed with RCU delay.
> > >
> > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > counter of how many readers took the lock or it's negative if the lock
> > > is write-locked.
> >
> > Yes, but ...
> >
> >         Task A:
> >         atomic_dec_and_test(&vma->vm_lock->count)
> >                         Task B:
> >                         munmap()
> >                         write lock
> >                         free VMA
> >                         synchronize_rcu()
> >                         VMA is really freed
> >         wake_up(&vma->vm_mm->vma_writer_wait);
> >
> > ... vma is freed.
> >
> > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > have that assertion?
> 
> Yep, that's what this patch is doing
> https://lore.kernel.org/all/20230109205336.3665937-27-surenb@google.com/
> by calling vma_assert_no_reader() from __vm_area_free().

That's not enough though.  Task A still has a pointer to vma after it
has called atomic_dec_and_test(), even after vma has been freed by
Task B, and before Task A dereferences vma->vm_mm.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17 18:36         ` Jann Horn
@ 2023-01-17 18:49           ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 18:49 UTC (permalink / raw)
  To: Jann Horn
  Cc: Matthew Wilcox, akpm, michel, jglisse, mhocko, vbabka, hannes,
	mgorman, dave, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 10:36 AM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Jan 17, 2023 at 7:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > two important specifics which can be used to replace rw_semaphore
> > > > > with a simpler structure:
> > > > [...]
> > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > >  {
> > > > > -       up_read(&vma->vm_lock->lock);
> > > > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > > > >  }
> > > >
> > > > I haven't properly reviewed this, but this bit looks like a
> > > > use-after-free because you're accessing the vma after dropping your
> > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > read-side critical section if the VMA is freed with RCU delay.
> > >
> > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > counter of how many readers took the lock or it's negative if the lock
> > > is write-locked.
> >
> > Yes, but ...
> >
> >         Task A:
> >         atomic_dec_and_test(&vma->vm_lock->count)
> >                         Task B:
> >                         munmap()
> >                         write lock
> >                         free VMA
> >                         synchronize_rcu()
> >                         VMA is really freed
> >         wake_up(&vma->vm_mm->vma_writer_wait);
> >
> > ... vma is freed.
> >
> > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > have that assertion?
>
> I don't see that. When do_user_addr_fault() is calling
> vma_read_unlock(), there's no RCU read lock held, right?

We free VMAs using call_rcu() after removing them from VMA tree. OTOH
page fault handlers are searching for VMAs from inside RCU read
section and calling vma_read_unlock() from there, see
https://lore.kernel.org/all/20230109205336.3665937-29-surenb@google.com/.
Once we take the VMA read-lock, it ensures that it can't be
write-locked and if someone is destroying or isolating the VMA, it
needs to write-lock it first.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17 18:48           ` Matthew Wilcox
@ 2023-01-17 18:55             ` Suren Baghdasaryan
  2023-01-17 18:59               ` Jann Horn
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 18:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jann Horn, akpm, michel, jglisse, mhocko, vbabka, hannes,
	mgorman, dave, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> > > > >
> > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > two important specifics which can be used to replace rw_semaphore
> > > > > > with a simpler structure:
> > > > > [...]
> > > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > >  {
> > > > > > -       up_read(&vma->vm_lock->lock);
> > > > > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > >  }
> > > > >
> > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > use-after-free because you're accessing the vma after dropping your
> > > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > read-side critical section if the VMA is freed with RCU delay.
> > > >
> > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > counter of how many readers took the lock or it's negative if the lock
> > > > is write-locked.
> > >
> > > Yes, but ...
> > >
> > >         Task A:
> > >         atomic_dec_and_test(&vma->vm_lock->count)
> > >                         Task B:
> > >                         munmap()
> > >                         write lock
> > >                         free VMA
> > >                         synchronize_rcu()
> > >                         VMA is really freed
> > >         wake_up(&vma->vm_mm->vma_writer_wait);
> > >
> > > ... vma is freed.
> > >
> > > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > > have that assertion?
> >
> > Yep, that's what this patch is doing
> > https://lore.kernel.org/all/20230109205336.3665937-27-surenb@google.com/
> > by calling vma_assert_no_reader() from __vm_area_free().
>
> That's not enough though.  Task A still has a pointer to vma after it
> has called atomic_dec_and_test(), even after vma has been freed by
> Task B, and before Task A dereferences vma->vm_mm.

Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
local variable and call mmgrab() before atomic_dec_and_test(), then
use it in wake_up() and call mmdrop(). Is that what you are thinking?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17 18:55             ` Suren Baghdasaryan
@ 2023-01-17 18:59               ` Jann Horn
  2023-01-17 19:06                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Jann Horn @ 2023-01-17 18:59 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, akpm, michel, jglisse, mhocko, vbabka, hannes,
	mgorman, dave, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 7:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> > > > > >
> > > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > > two important specifics which can be used to replace rw_semaphore
> > > > > > > with a simpler structure:
> > > > > > [...]
> > > > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > >  {
> > > > > > > -       up_read(&vma->vm_lock->lock);
> > > > > > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > >  }
> > > > > >
> > > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > > use-after-free because you're accessing the vma after dropping your
> > > > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > > read-side critical section if the VMA is freed with RCU delay.
> > > > >
> > > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > > counter of how many readers took the lock or it's negative if the lock
> > > > > is write-locked.
> > > >
> > > > Yes, but ...
> > > >
> > > >         Task A:
> > > >         atomic_dec_and_test(&vma->vm_lock->count)
> > > >                         Task B:
> > > >                         munmap()
> > > >                         write lock
> > > >                         free VMA
> > > >                         synchronize_rcu()
> > > >                         VMA is really freed
> > > >         wake_up(&vma->vm_mm->vma_writer_wait);
> > > >
> > > > ... vma is freed.
> > > >
> > > > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > > > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > > > have that assertion?
> > >
> > > Yep, that's what this patch is doing
> > > https://lore.kernel.org/all/20230109205336.3665937-27-surenb@google.com/
> > > by calling vma_assert_no_reader() from __vm_area_free().
> >
> > That's not enough though.  Task A still has a pointer to vma after it
> > has called atomic_dec_and_test(), even after vma has been freed by
> > Task B, and before Task A dereferences vma->vm_mm.
>
> Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
> local variable and call mmgrab() before atomic_dec_and_test(), then
> use it in wake_up() and call mmdrop(). Is that what you are thinking?

You shouldn't need mmgrab()/mmdrop(), because whoever is calling you
for page fault handling must be keeping the mm_struct alive.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 40/41] mm: separate vma->lock from vm_area_struct
  2023-01-17 18:33   ` Jann Horn
@ 2023-01-17 19:01     ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 19:01 UTC (permalink / raw)
  To: Jann Horn
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 10:34 AM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > vma->lock being part of the vm_area_struct causes performance regression
> > during page faults because during contention its count and owner fields
> > are constantly updated and having other parts of vm_area_struct used
> > during page fault handling next to them causes constant cache line
> > bouncing. Fix that by moving the lock outside of the vm_area_struct.
> > All attempts to keep vma->lock inside vm_area_struct in a separate
> > cache line still produce performance regression especially on NUMA
> > machines. Smallest regression was achieved when lock is placed in the
> > fourth cache line but that bloats vm_area_struct to 256 bytes.
>
> Just checking: When you tested putting the lock in different cache
> lines, did you force the slab allocator to actually store the
> vm_area_struct with cacheline alignment (by setting SLAB_HWCACHE_ALIGN
> on the slab or with a ____cacheline_aligned_in_smp on the struct
> definition)?

Yep, I tried all these combinations and still saw noticeable regression.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17 18:59               ` Jann Horn
@ 2023-01-17 19:06                 ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 19:06 UTC (permalink / raw)
  To: Jann Horn
  Cc: Matthew Wilcox, akpm, michel, jglisse, mhocko, vbabka, hannes,
	mgorman, dave, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 11:00 AM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Jan 17, 2023 at 7:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> > > > > > >
> > > > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > > > two important specifics which can be used to replace rw_semaphore
> > > > > > > > with a simpler structure:
> > > > > > > [...]
> > > > > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > > >  {
> > > > > > > > -       up_read(&vma->vm_lock->lock);
> > > > > > > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > > >  }
> > > > > > >
> > > > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > > > use-after-free because you're accessing the vma after dropping your
> > > > > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > > > read-side critical section if the VMA is freed with RCU delay.
> > > > > >
> > > > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > > > counter of how many readers took the lock or it's negative if the lock
> > > > > > is write-locked.
> > > > >
> > > > > Yes, but ...
> > > > >
> > > > >         Task A:
> > > > >         atomic_dec_and_test(&vma->vm_lock->count)
> > > > >                         Task B:
> > > > >                         munmap()
> > > > >                         write lock
> > > > >                         free VMA
> > > > >                         synchronize_rcu()
> > > > >                         VMA is really freed
> > > > >         wake_up(&vma->vm_mm->vma_writer_wait);
> > > > >
> > > > > ... vma is freed.
> > > > >
> > > > > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > > > > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > > > > have that assertion?
> > > >
> > > > Yep, that's what this patch is doing
> > > > https://lore.kernel.org/all/20230109205336.3665937-27-surenb@google.com/
> > > > by calling vma_assert_no_reader() from __vm_area_free().
> > >
> > > That's not enough though.  Task A still has a pointer to vma after it
> > > has called atomic_dec_and_test(), even after vma has been freed by
> > > Task B, and before Task A dereferences vma->vm_mm.
> >
> > Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
> > local variable and call mmgrab() before atomic_dec_and_test(), then
> > use it in wake_up() and call mmdrop(). Is that what you are thinking?
>
> You shouldn't need mmgrab()/mmdrop(), because whoever is calling you
> for page fault handling must be keeping the mm_struct alive.

Good point. Will update in the next revision to store mm before
dropping the count. Thanks for all the comments folks!

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock
  2023-01-09 20:53 ` [PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock Suren Baghdasaryan
@ 2023-01-17 19:51   ` Jann Horn
  2023-01-17 20:36     ` Jann Horn
  0 siblings, 1 reply; 186+ messages in thread
From: Jann Horn @ 2023-01-17 19:51 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
> handling under VMA lock and retry holding mmap_lock. This can be handled
> more gracefully in the future.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Suggested-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/memory.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 20806bc8b4eb..12508f4d845a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>         if (!vma->anon_vma)
>                 goto inval;
>
> +       /*
> +        * Due to the possibility of userfault handler dropping mmap_lock, avoid
> +        * it for now and fall back to page fault handling under mmap_lock.
> +        */
> +       if (userfaultfd_armed(vma))
> +               goto inval;

This looks racy wrt concurrent userfaultfd_register(). I think you'll
want to do the userfaultfd_armed(vma) check _after_ locking the VMA,
and ensure that the userfaultfd code write-locks the VMA before
changing the __VM_UFFD_FLAGS in vma->vm_flags.

>         if (!vma_read_trylock(vma))
>                 goto inval;
>
> --
> 2.39.0
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page
  2023-01-17 15:25   ` Michal Hocko
@ 2023-01-17 20:28     ` Jann Horn
  2023-01-17 21:05       ` Suren Baghdasaryan
  2023-01-18  9:40       ` Michal Hocko
  0 siblings, 2 replies; 186+ messages in thread
From: Jann Horn @ 2023-01-17 20:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, akpm, michel, jglisse, vbabka, hannes,
	mgorman, dave, willy, liam.howlett, peterz, ldufour,
	laurent.dufour, paulmck, luto, songliubraving, peterx, david,
	dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy,
	soheil, hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <mhocko@suse.com> wrote:
> On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > Protect VMA from concurrent page fault handler while collapsing a huge
> > page. Page fault handler needs a stable PMD to use PTL and relies on
> > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > not be detected by a page fault handler without proper locking.
>
> I am struggling with this changelog. Maybe because my recollection of
> the THP collapsing subtleties is weak. But aren't you just trying to say
> that the current #PF handling and THP collapsing need to be mutually
> exclusive currently so in order to keep that assumption you have mark
> the vma write locked?
>
> Also it is not really clear to me how that handles other vmas which can
> share the same thp?

It's not about the hugepage itself, it's about how the THP collapse
operation frees page tables.

Before this series, page tables can be walked under any one of the
mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
unlinks and frees page tables, it must ensure that all of those either
are locked or don't exist. This series adds a fourth lock under which
page tables can be traversed, and so khugepaged must also lock out that one.

There is a codepath in khugepaged that iterates through all mappings
of a file to zap page tables (retract_page_tables()), which locks each
visited mm with mmap_write_trylock() and now also does
vma_write_lock().


I think one aspect of this patch that might cause trouble later on, if
support for non-anonymous VMAs is added, is that retract_page_tables()
now does vma_write_lock() while holding the mapping lock; the page
fault handling path would probably take the locks the other way
around, leading to a deadlock? So the vma_write_lock() in
retract_page_tables() might have to become a trylock later on.

Related: Please add the new VMA lock to the big lock ordering comments
at the top of mm/rmap.c. (And maybe later mm/filemap.c, if/when you
add file VMA support.)

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17 18:28               ` Suren Baghdasaryan
@ 2023-01-17 20:31                 ` Michal Hocko
  2023-01-17 21:00                   ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-17 20:31 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, Hyeonggon Yoo, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, liam.howlett, peterz, ldufour,
	laurent.dufour, paulmck, luto, songliubraving, peterx, david,
	dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, jannh, shakeelb, tatashin, edumazet, gthelen, gurua,
	arjunroy, soheil, hughlynch, leewalsh, posk, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, linux-kernel, kernel-team

On Tue 17-01-23 10:28:40, Suren Baghdasaryan wrote:
[...]
> > Then yes, that's a starvable lock.  Preventing starvation on the mmap
> > sem was the original motivation for making rwsems non-starvable, so
> > changing that behaviour now seems like a bad idea.  For efficiency, I'd
> > suggest that a waiting writer set the top bit of the counter.  That way,
> > all new readers will back off without needing to check a second variable
> > and old readers will know that they *may* need to do the wakeup when
> > atomic_sub_return_release() is negative.
> >
> > (rwsem.c has a more complex bitfield, but I don't think we need to go
> > that far; the important point is that the waiting writer indicates its
> > presence in the count field so that readers can modify their behaviour)
> 
> Got it. Ok, I think we can figure something out to check if there are
> waiting write-lockers and prevent new readers from taking the lock.

Reinventing locking primitives is a ticket to weird bugs. I would stick
with the rwsem and deal with performance fallouts after it is clear that
the core idea is generally acceptable and based on actual real life
numbers. This whole thing is quite big enough that we do not have to go
through "is this new synchronization primitive correct and behaving
reasonably" exercise.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock
  2023-01-17 19:51   ` Jann Horn
@ 2023-01-17 20:36     ` Jann Horn
  2023-01-17 20:57       ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Jann Horn @ 2023-01-17 20:36 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 8:51 PM Jann Horn <jannh@google.com> wrote:
> On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
> > handling under VMA lock and retry holding mmap_lock. This can be handled
> > more gracefully in the future.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Suggested-by: Peter Xu <peterx@redhat.com>
> > ---
> >  mm/memory.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 20806bc8b4eb..12508f4d845a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >         if (!vma->anon_vma)
> >                 goto inval;
> >
> > +       /*
> > +        * Due to the possibility of userfault handler dropping mmap_lock, avoid
> > +        * it for now and fall back to page fault handling under mmap_lock.
> > +        */
> > +       if (userfaultfd_armed(vma))
> > +               goto inval;
>
> This looks racy wrt concurrent userfaultfd_register(). I think you'll
> want to do the userfaultfd_armed(vma) check _after_ locking the VMA,

I still think this change is needed...

> and ensure that the userfaultfd code write-locks the VMA before
> changing the __VM_UFFD_FLAGS in vma->vm_flags.

Ah, but now I see you already took care of this half of the issue with
the reset_vm_flags() change in
https://lore.kernel.org/linux-mm/20230109205336.3665937-16-surenb@google.com/
.


> >         if (!vma_read_trylock(vma))
> >                 goto inval;
> >
> > --
> > 2.39.0
> >

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock
  2023-01-17 20:36     ` Jann Horn
@ 2023-01-17 20:57       ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 20:57 UTC (permalink / raw)
  To: Jann Horn
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 12:36 PM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Jan 17, 2023 at 8:51 PM Jann Horn <jannh@google.com> wrote:
> > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
> > > handling under VMA lock and retry holding mmap_lock. This can be handled
> > > more gracefully in the future.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Suggested-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  mm/memory.c | 7 +++++++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 20806bc8b4eb..12508f4d845a 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > >         if (!vma->anon_vma)
> > >                 goto inval;
> > >
> > > +       /*
> > > +        * Due to the possibility of userfault handler dropping mmap_lock, avoid
> > > +        * it for now and fall back to page fault handling under mmap_lock.
> > > +        */
> > > +       if (userfaultfd_armed(vma))
> > > +               goto inval;
> >
> > This looks racy wrt concurrent userfaultfd_register(). I think you'll
> > want to do the userfaultfd_armed(vma) check _after_ locking the VMA,
>
> I still think this change is needed...

Yes, I think you are right. I'll move the check after locking the VMA. Thanks!

>
> > and ensure that the userfaultfd code write-locks the VMA before
> > changing the __VM_UFFD_FLAGS in vma->vm_flags.
>
> Ah, but now I see you already took care of this half of the issue with
> the reset_vm_flags() change in
> https://lore.kernel.org/linux-mm/20230109205336.3665937-16-surenb@google.com/
> .
>
>
> > >         if (!vma_read_trylock(vma))
> > >                 goto inval;
> > >
> > > --
> > > 2.39.0
> > >

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
  2023-01-17 20:31                 ` Michal Hocko
@ 2023-01-17 21:00                   ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 21:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Hyeonggon Yoo, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, liam.howlett, peterz, ldufour,
	laurent.dufour, paulmck, luto, songliubraving, peterx, david,
	dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, jannh, shakeelb, tatashin, edumazet, gthelen, gurua,
	arjunroy, soheil, hughlynch, leewalsh, posk, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 12:31 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 17-01-23 10:28:40, Suren Baghdasaryan wrote:
> [...]
> > > Then yes, that's a starvable lock.  Preventing starvation on the mmap
> > > sem was the original motivation for making rwsems non-starvable, so
> > > changing that behaviour now seems like a bad idea.  For efficiency, I'd
> > > suggest that a waiting writer set the top bit of the counter.  That way,
> > > all new readers will back off without needing to check a second variable
> > > and old readers will know that they *may* need to do the wakeup when
> > > atomic_sub_return_release() is negative.
> > >
> > > (rwsem.c has a more complex bitfield, but I don't think we need to go
> > > that far; the important point is that the waiting writer indicates its
> > > presence in the count field so that readers can modify their behaviour)
> >
> > Got it. Ok, I think we can figure something out to check if there are
> > waiting write-lockers and prevent new readers from taking the lock.
>
> Reinventing locking primitives is a ticket to weird bugs. I would stick
> with the rwsem and deal with performance fallouts after it is clear that
> the core idea is generally acceptable and based on actual real life
> numbers. This whole thing is quite big enough that we do not have to go
> through "is this new synchronization primitive correct and behaving
> reasonably" exercise.

Point taken. That's one of the reasons I kept this patch separate.
I'll drop this last patch from the series for now. One correction
though, this will not be a performance fallout but memory consumption
fallout.

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code
  2023-01-09 20:53 ` [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code Suren Baghdasaryan
  2023-01-17 15:47   ` Michal Hocko
@ 2023-01-17 21:03   ` Jann Horn
  2023-01-17 23:18     ` Liam Howlett
  1 sibling, 1 reply; 186+ messages in thread
From: Jann Horn @ 2023-01-17 21:03 UTC (permalink / raw)
  To: Suren Baghdasaryan, willy, liam.howlett
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	peterz, ldufour, laurent.dufour, paulmck, luto, songliubraving,
	peterx, david, dhowells, hughd, bigeasy, kent.overstreet,
	punit.agrawal, lstoakes, peterjung1337, rientjes, axelrasmussen,
	joelaf, minchan, shakeelb, tatashin, edumazet, gthelen, gurua,
	arjunroy, soheil, hughlynch, leewalsh, posk, linux-mm,
	linux-arm-kernel, linuxppc-dev, x86, linux-kernel, kernel-team

On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> page fault handling. When VMA is not found, can't be locked or changes
> after being locked, the function returns NULL. The lookup is performed
> under RCU protection to prevent the found VMA from being destroyed before
> the VMA lock is acquired. VMA lock statistics are updated according to
> the results.
> For now only anonymous VMAs can be searched this way. In other cases the
> function returns NULL.
[...]
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> +                                         unsigned long address)
> +{
> +       MA_STATE(mas, &mm->mm_mt, address, address);
> +       struct vm_area_struct *vma, *validate;
> +
> +       rcu_read_lock();
> +       vma = mas_walk(&mas);
> +retry:
> +       if (!vma)
> +               goto inval;
> +
> +       /* Only anonymous vmas are supported for now */
> +       if (!vma_is_anonymous(vma))
> +               goto inval;
> +
> +       if (!vma_read_trylock(vma))
> +               goto inval;
> +
> +       /* Check since vm_start/vm_end might change before we lock the VMA */
> +       if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> +               vma_read_unlock(vma);
> +               goto inval;
> +       }
> +
> +       /* Check if the VMA got isolated after we found it */
> +       mas.index = address;
> +       validate = mas_walk(&mas);

Question for Maple Tree experts:

Are you allowed to use mas_walk() like this? If the first mas_walk()
call encountered a single-entry tree, it would store mas->node =
MAS_ROOT, right? And then the second call would go into
mas_state_walk(), mas_start() would return NULL, mas_is_ptr() would be
true, and then mas_state_walk() would return the result of
mas_start(), which is NULL? And we'd end up with mas_walk() returning
NULL on the second run even though the tree hasn't changed?

> +       if (validate != vma) {
> +               vma_read_unlock(vma);
> +               count_vm_vma_lock_event(VMA_LOCK_MISS);
> +               /* The area was replaced with another one. */
> +               vma = validate;
> +               goto retry;
> +       }
> +
> +       rcu_read_unlock();
> +       return vma;
> +inval:
> +       rcu_read_unlock();
> +       count_vm_vma_lock_event(VMA_LOCK_ABORT);
> +       return NULL;
> +}

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page
  2023-01-17 20:28     ` Jann Horn
@ 2023-01-17 21:05       ` Suren Baghdasaryan
  2023-01-18  9:40       ` Michal Hocko
  1 sibling, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 21:05 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michal Hocko, akpm, michel, jglisse, vbabka, hannes, mgorman,
	dave, willy, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 12:28 PM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <mhocko@suse.com> wrote:
> > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > not be detected by a page fault handler without proper locking.
> >
> > I am struggling with this changelog. Maybe because my recollection of
> > the THP collapsing subtleties is weak. But aren't you just trying to say
> > that the current #PF handling and THP collapsing need to be mutually
> > exclusive currently so in order to keep that assumption you have mark
> > the vma write locked?
> >
> > Also it is not really clear to me how that handles other vmas which can
> > share the same thp?
>
> It's not about the hugepage itself, it's about how the THP collapse
> operation frees page tables.
>
> Before this series, page tables can be walked under any one of the
> mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> unlinks and frees page tables, it must ensure that all of those either
> are locked or don't exist. This series adds a fourth lock under which
> page tables can be traversed, and so khugepaged must also lock out that one.
>
> There is a codepath in khugepaged that iterates through all mappings
> of a file to zap page tables (retract_page_tables()), which locks each
> visited mm with mmap_write_trylock() and now also does
> vma_write_lock().
>
>
> I think one aspect of this patch that might cause trouble later on, if
> support for non-anonymous VMAs is added, is that retract_page_tables()
> now does vma_write_lock() while holding the mapping lock; the page
> fault handling path would probably take the locks the other way
> around, leading to a deadlock? So the vma_write_lock() in
> retract_page_tables() might have to become a trylock later on.
>
> Related: Please add the new VMA lock to the big lock ordering comments
> at the top of mm/rmap.c. (And maybe later mm/filemap.c, if/when you
> add file VMA support.)

Thanks for the clarifications and the warning. I'll add appropriate
comments and will take this deadlocking scenario into account when
later implementing support for file-backed page faults.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-17 15:04   ` Michal Hocko
  2023-01-17 15:12     ` Michal Hocko
@ 2023-01-17 21:08     ` Suren Baghdasaryan
  1 sibling, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 21:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 7:04 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
>
> I have to say I was struggling a bit with the above and only understood
> what you mean by reading the patch several times. I would phrase it like
> this (feel free to use if you consider this to be an improvement).
>
> Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> per-vma and per-mm sequence counters to note exclusive locking:
>         - read lock - (implemented by vma_read_trylock) requires the the
>           vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
>           differ. If they match then there must be a vma exclusive lock
>           held somewhere.
>         - read unlock - (implemented by vma_read_unlock) is a trivial
>           vma->lock unlock.
>         - write lock - (vma_write_lock) requires the mmap_lock to be
>           held exclusively and the current mm counter is noted to the vma
>           side. This will allow multiple vmas to be locked under a single
>           mmap_lock write lock (e.g. during vma merging). The vma counter
>           is modified under exclusive vma lock.
>         - write unlock - (vma_write_unlock_mm) is a batch release of all
>           vma locks held. It doesn't pair with a specific
>           vma_write_lock! It is done before exclusive mmap_lock is
>           released by incrementing mm sequence counter (mm_lock_seq).
>         - write downgrade - if the mmap_lock is downgraded to the read
>           lock all vma write locks are released as well (effectivelly
>           same as write unlock).

Thanks for the suggestion, Michal. I'll definitely reuse your description.

>
> > VMA lock is placed on the cache line boundary so that its 'count' field
> > falls into the first cache line while the rest of the fields fall into
> > the second cache line. This lets the 'count' field to be cached with
> > other frequently accessed fields and used quickly in uncontended case
> > while 'owner' and other fields used in the contended case will not
> > invalidate the first cache line while waiting on the lock.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h        | 80 +++++++++++++++++++++++++++++++++++++++
> >  include/linux/mm_types.h  |  8 ++++
> >  include/linux/mmap_lock.h | 13 +++++++
> >  kernel/fork.c             |  4 ++
> >  mm/init-mm.c              |  3 ++
> >  5 files changed, 108 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index f3f196e4d66d..ec2c4c227d51 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -612,6 +612,85 @@ struct vm_operations_struct {
> >                                         unsigned long addr);
> >  };
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +static inline void vma_init_lock(struct vm_area_struct *vma)
> > +{
> > +     init_rwsem(&vma->lock);
> > +     vma->vm_lock_seq = -1;
> > +}
> > +
> > +static inline void vma_write_lock(struct vm_area_struct *vma)
> > +{
> > +     int mm_lock_seq;
> > +
> > +     mmap_assert_write_locked(vma->vm_mm);
> > +
> > +     /*
> > +      * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> > +      * mm->mm_lock_seq can't be concurrently modified.
> > +      */
> > +     mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > +     if (vma->vm_lock_seq == mm_lock_seq)
> > +             return;
> > +
> > +     down_write(&vma->lock);
> > +     vma->vm_lock_seq = mm_lock_seq;
> > +     up_write(&vma->lock);
> > +}
> > +
> > +/*
> > + * Try to read-lock a vma. The function is allowed to occasionally yield false
> > + * locked result to avoid performance overhead, in which case we fall back to
> > + * using mmap_lock. The function should never yield false unlocked result.
> > + */
> > +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > +{
> > +     /* Check before locking. A race might cause false locked result. */
> > +     if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > +             return false;
> > +
> > +     if (unlikely(down_read_trylock(&vma->lock) == 0))
> > +             return false;
> > +
> > +     /*
> > +      * Overflow might produce false locked result.
> > +      * False unlocked result is impossible because we modify and check
> > +      * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
> > +      * modification invalidates all existing locks.
> > +      */
> > +     if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > +             up_read(&vma->lock);
> > +             return false;
> > +     }
> > +     return true;
> > +}
> > +
> > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > +{
> > +     up_read(&vma->lock);
> > +}
> > +
> > +static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > +{
> > +     mmap_assert_write_locked(vma->vm_mm);
> > +     /*
> > +      * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> > +      * mm->mm_lock_seq can't be concurrently modified.
> > +      */
> > +     VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> > +}
> > +
> > +#else /* CONFIG_PER_VMA_LOCK */
> > +
> > +static inline void vma_init_lock(struct vm_area_struct *vma) {}
> > +static inline void vma_write_lock(struct vm_area_struct *vma) {}
> > +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > +             { return false; }
> > +static inline void vma_read_unlock(struct vm_area_struct *vma) {}
> > +static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
> > +
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> >  static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >  {
> >       static const struct vm_operations_struct dummy_vm_ops = {};
> > @@ -620,6 +699,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >       vma->vm_mm = mm;
> >       vma->vm_ops = &dummy_vm_ops;
> >       INIT_LIST_HEAD(&vma->anon_vma_chain);
> > +     vma_init_lock(vma);
> >  }
> >
> >  static inline void vma_set_anonymous(struct vm_area_struct *vma)
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index d5cdec1314fe..5f7c5ca89931 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -555,6 +555,11 @@ struct vm_area_struct {
> >       pgprot_t vm_page_prot;
> >       unsigned long vm_flags;         /* Flags, see mm.h. */
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     int vm_lock_seq;
> > +     struct rw_semaphore lock;
> > +#endif
> > +
> >       /*
> >        * For areas with an address space and backing store,
> >        * linkage into the address_space->i_mmap interval tree.
> > @@ -680,6 +685,9 @@ struct mm_struct {
> >                                         * init_mm.mmlist, and are protected
> >                                         * by mmlist_lock
> >                                         */
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +             int mm_lock_seq;
> > +#endif
> >
> >
> >               unsigned long hiwater_rss; /* High-watermark of RSS usage */
> > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > index e49ba91bb1f0..40facd4c398b 100644
> > --- a/include/linux/mmap_lock.h
> > +++ b/include/linux/mmap_lock.h
> > @@ -72,6 +72,17 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
> >       VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> >  }
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +static inline void vma_write_unlock_mm(struct mm_struct *mm)
> > +{
> > +     mmap_assert_write_locked(mm);
> > +     /* No races during update due to exclusive mmap_lock being held */
> > +     WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > +}
> > +#else
> > +static inline void vma_write_unlock_mm(struct mm_struct *mm) {}
> > +#endif
> > +
> >  static inline void mmap_init_lock(struct mm_struct *mm)
> >  {
> >       init_rwsem(&mm->mmap_lock);
> > @@ -114,12 +125,14 @@ static inline bool mmap_write_trylock(struct mm_struct *mm)
> >  static inline void mmap_write_unlock(struct mm_struct *mm)
> >  {
> >       __mmap_lock_trace_released(mm, true);
> > +     vma_write_unlock_mm(mm);
> >       up_write(&mm->mmap_lock);
> >  }
> >
> >  static inline void mmap_write_downgrade(struct mm_struct *mm)
> >  {
> >       __mmap_lock_trace_acquire_returned(mm, false, true);
> > +     vma_write_unlock_mm(mm);
> >       downgrade_write(&mm->mmap_lock);
> >  }
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 5986817f393c..c026d75108b3 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >                */
> >               *new = data_race(*orig);
> >               INIT_LIST_HEAD(&new->anon_vma_chain);
> > +             vma_init_lock(new);
> >               dup_anon_vma_name(orig, new);
> >       }
> >       return new;
> > @@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> >       seqcount_init(&mm->write_protect_seq);
> >       mmap_init_lock(mm);
> >       INIT_LIST_HEAD(&mm->mmlist);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     WRITE_ONCE(mm->mm_lock_seq, 0);
> > +#endif
> >       mm_pgtables_bytes_init(mm);
> >       mm->map_count = 0;
> >       mm->locked_vm = 0;
> > diff --git a/mm/init-mm.c b/mm/init-mm.c
> > index c9327abb771c..33269314e060 100644
> > --- a/mm/init-mm.c
> > +++ b/mm/init-mm.c
> > @@ -37,6 +37,9 @@ struct mm_struct init_mm = {
> >       .page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
> >       .arg_lock       =  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
> >       .mmlist         = LIST_HEAD_INIT(init_mm.mmlist),
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     .mm_lock_seq    = 0,
> > +#endif
> >       .user_ns        = &init_user_ns,
> >       .cpu_bitmap     = CPU_BITS_NONE,
> >  #ifdef CONFIG_IOMMU_SVA
> > --
> > 2.39.0
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-17 15:07   ` Michal Hocko
@ 2023-01-17 21:09     ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 21:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 7:07 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 5986817f393c..c026d75108b3 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >                */
> >               *new = data_race(*orig);
> >               INIT_LIST_HEAD(&new->anon_vma_chain);
> > +             vma_init_lock(new);
> >               dup_anon_vma_name(orig, new);
> >       }
> >       return new;
> > @@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> >       seqcount_init(&mm->write_protect_seq);
> >       mmap_init_lock(mm);
> >       INIT_LIST_HEAD(&mm->mmlist);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     WRITE_ONCE(mm->mm_lock_seq, 0);
> > +#endif
>
> The mm shouldn't be visible so why WRITE_ONCE?

True. Will change to a simple assignment.

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-17 15:12     ` Michal Hocko
@ 2023-01-17 21:21       ` Suren Baghdasaryan
  2023-01-17 21:54         ` Matthew Wilcox
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 21:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > to be exclusively locked during VMA tree modifications, instead of the
> > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > locked.
> >
> > I have to say I was struggling a bit with the above and only understood
> > what you mean by reading the patch several times. I would phrase it like
> > this (feel free to use if you consider this to be an improvement).
> >
> > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > per-vma and per-mm sequence counters to note exclusive locking:
> >         - read lock - (implemented by vma_read_trylock) requires the the
> >           vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> >           differ. If they match then there must be a vma exclusive lock
> >           held somewhere.
> >         - read unlock - (implemented by vma_read_unlock) is a trivial
> >           vma->lock unlock.
> >         - write lock - (vma_write_lock) requires the mmap_lock to be
> >           held exclusively and the current mm counter is noted to the vma
> >           side. This will allow multiple vmas to be locked under a single
> >           mmap_lock write lock (e.g. during vma merging). The vma counter
> >           is modified under exclusive vma lock.
>
> Didn't realize one more thing.
>             Unlike standard write lock this implementation allows to be
>             called multiple times under a single mmap_lock. In a sense
>             it is more of mark_vma_potentially_modified than a lock.

In the RFC it was called vma_mark_locked() originally and renames were
discussed in the email thread ending here:
https://lore.kernel.org/all/621612d7-c537-3971-9520-a3dec7b43cb4@suse.cz/.
If other names are preferable I'm open to changing them.

>
> >         - write unlock - (vma_write_unlock_mm) is a batch release of all
> >           vma locks held. It doesn't pair with a specific
> >           vma_write_lock! It is done before exclusive mmap_lock is
> >           released by incrementing mm sequence counter (mm_lock_seq).
> >       - write downgrade - if the mmap_lock is downgraded to the read
> >         lock all vma write locks are released as well (effectivelly
> >         same as write unlock).
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-17 18:02   ` Jann Horn
@ 2023-01-17 21:28     ` Suren Baghdasaryan
  2023-01-17 21:45       ` Jann Horn
  2023-01-18 12:28     ` Michal Hocko
  1 sibling, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 21:28 UTC (permalink / raw)
  To: Jann Horn
  Cc: peterz, Ingo Molnar, Will Deacon, akpm, michel, jglisse, mhocko,
	vbabka, hannes, mgorman, dave, willy, liam.howlett, ldufour,
	laurent.dufour, paulmck, luto, songliubraving, peterx, david,
	dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy,
	soheil, hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 10:03 AM Jann Horn <jannh@google.com> wrote:
>
> +locking maintainers

Thanks! I'll CC the locking maintainers in the next posting.

>
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
> [...]
> > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > +{
> > +       up_read(&vma->lock);
> > +}
>
> One thing that might be gnarly here is that I think you might not be
> allowed to use up_read() to fully release ownership of an object -
> from what I remember, I think that up_read() (unlike something like
> spin_unlock()) can access the lock object after it's already been
> acquired by someone else. So if you want to protect against concurrent
> deletion, this might have to be something like:
>
> rcu_read_lock(); /* keeps vma alive */
> up_read(&vma->lock);
> rcu_read_unlock();

But for deleting VMA one would need to write-lock the vma->lock first,
which I assume can't happen until this up_read() is complete. Is that
assumption wrong?

>
> But I'm not entirely sure about that, the locking folks might know better.
>
> Also, it might not matter given that the rw_semaphore part is removed
> in the current patch 41/41 anyway...

This does matter because Michal suggested dropping that last 41/41
patch for now.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-17 21:28     ` Suren Baghdasaryan
@ 2023-01-17 21:45       ` Jann Horn
  2023-01-17 22:36         ` Suren Baghdasaryan
  2023-11-22 14:04         ` Alexander Gordeev
  0 siblings, 2 replies; 186+ messages in thread
From: Jann Horn @ 2023-01-17 21:45 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: peterz, Ingo Molnar, Will Deacon, akpm, michel, jglisse, mhocko,
	vbabka, hannes, mgorman, dave, willy, liam.howlett, ldufour,
	laurent.dufour, paulmck, luto, songliubraving, peterx, david,
	dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy,
	soheil, hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
> On Tue, Jan 17, 2023 at 10:03 AM Jann Horn <jannh@google.com> wrote:
> >
> > +locking maintainers
>
> Thanks! I'll CC the locking maintainers in the next posting.
>
> >
> > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > to be exclusively locked during VMA tree modifications, instead of the
> > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > locked.
> > [...]
> > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > +{
> > > +       up_read(&vma->lock);
> > > +}
> >
> > One thing that might be gnarly here is that I think you might not be
> > allowed to use up_read() to fully release ownership of an object -
> > from what I remember, I think that up_read() (unlike something like
> > spin_unlock()) can access the lock object after it's already been
> > acquired by someone else. So if you want to protect against concurrent
> > deletion, this might have to be something like:
> >
> > rcu_read_lock(); /* keeps vma alive */
> > up_read(&vma->lock);
> > rcu_read_unlock();
>
> But for deleting VMA one would need to write-lock the vma->lock first,
> which I assume can't happen until this up_read() is complete. Is that
> assumption wrong?

__up_read() does:

rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
      RWSEM_FLAG_WAITERS)) {
  clear_nonspinnable(sem);
  rwsem_wake(sem);
}

The atomic_long_add_return_release() is the point where we are doing
the main lock-releasing.

So if a reader dropped the read-lock while someone else was waiting on
the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the
lock together with it, the reader also does clear_nonspinnable() and
rwsem_wake() afterwards.
But in rwsem_down_write_slowpath(), after we've set
RWSEM_FLAG_WAITERS, we can return successfully immediately once
rwsem_try_write_lock() sees that there are no active readers or
writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg
succeeds). We're not necessarily waiting for the "nonspinnable" bit or
the wake.

So yeah, I think down_write() can return successfully before up_read()
is done with its memory accesses.

(Spinlocks are different - the kernel relies on being able to drop
references via spin_unlock() in some places.)

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-17 21:21       ` Suren Baghdasaryan
@ 2023-01-17 21:54         ` Matthew Wilcox
  2023-01-17 22:33           ` Suren Baghdasaryan
  2023-01-18  9:18           ` Michal Hocko
  0 siblings, 2 replies; 186+ messages in thread
From: Matthew Wilcox @ 2023-01-17 21:54 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, akpm, michel, jglisse, vbabka, hannes, mgorman,
	dave, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 01:21:47PM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > locked.
> > >
> > > I have to say I was struggling a bit with the above and only understood
> > > what you mean by reading the patch several times. I would phrase it like
> > > this (feel free to use if you consider this to be an improvement).
> > >
> > > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > > per-vma and per-mm sequence counters to note exclusive locking:
> > >         - read lock - (implemented by vma_read_trylock) requires the the
> > >           vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> > >           differ. If they match then there must be a vma exclusive lock
> > >           held somewhere.
> > >         - read unlock - (implemented by vma_read_unlock) is a trivial
> > >           vma->lock unlock.
> > >         - write lock - (vma_write_lock) requires the mmap_lock to be
> > >           held exclusively and the current mm counter is noted to the vma
> > >           side. This will allow multiple vmas to be locked under a single
> > >           mmap_lock write lock (e.g. during vma merging). The vma counter
> > >           is modified under exclusive vma lock.
> >
> > Didn't realize one more thing.
> >             Unlike standard write lock this implementation allows to be
> >             called multiple times under a single mmap_lock. In a sense
> >             it is more of mark_vma_potentially_modified than a lock.
> 
> In the RFC it was called vma_mark_locked() originally and renames were
> discussed in the email thread ending here:
> https://lore.kernel.org/all/621612d7-c537-3971-9520-a3dec7b43cb4@suse.cz/.
> If other names are preferable I'm open to changing them.

I don't want to bikeshed this, but rather than locking it seems to be
more:

	vma_start_read()
	vma_end_read()
	vma_start_write()
	vma_end_write()
	vma_downgrade_write()

... and that these are _implemented_ with locks (in part) is an
implementation detail?

Would that reduce people's confusion?

> >
> > >         - write unlock - (vma_write_unlock_mm) is a batch release of all
> > >           vma locks held. It doesn't pair with a specific
> > >           vma_write_lock! It is done before exclusive mmap_lock is
> > >           released by incrementing mm sequence counter (mm_lock_seq).
> > >       - write downgrade - if the mmap_lock is downgraded to the read
> > >         lock all vma write locks are released as well (effectivelly
> > >         same as write unlock).
> > --
> > Michal Hocko
> > SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-17 21:54         ` Matthew Wilcox
@ 2023-01-17 22:33           ` Suren Baghdasaryan
  2023-01-18  9:18           ` Michal Hocko
  1 sibling, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 22:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, akpm, michel, jglisse, vbabka, hannes, mgorman,
	dave, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 1:54 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 17, 2023 at 01:21:47PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > > > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > > locked.
> > > >
> > > > I have to say I was struggling a bit with the above and only understood
> > > > what you mean by reading the patch several times. I would phrase it like
> > > > this (feel free to use if you consider this to be an improvement).
> > > >
> > > > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > > > per-vma and per-mm sequence counters to note exclusive locking:
> > > >         - read lock - (implemented by vma_read_trylock) requires the the
> > > >           vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> > > >           differ. If they match then there must be a vma exclusive lock
> > > >           held somewhere.
> > > >         - read unlock - (implemented by vma_read_unlock) is a trivial
> > > >           vma->lock unlock.
> > > >         - write lock - (vma_write_lock) requires the mmap_lock to be
> > > >           held exclusively and the current mm counter is noted to the vma
> > > >           side. This will allow multiple vmas to be locked under a single
> > > >           mmap_lock write lock (e.g. during vma merging). The vma counter
> > > >           is modified under exclusive vma lock.
> > >
> > > Didn't realize one more thing.
> > >             Unlike standard write lock this implementation allows to be
> > >             called multiple times under a single mmap_lock. In a sense
> > >             it is more of mark_vma_potentially_modified than a lock.
> >
> > In the RFC it was called vma_mark_locked() originally and renames were
> > discussed in the email thread ending here:
> > https://lore.kernel.org/all/621612d7-c537-3971-9520-a3dec7b43cb4@suse.cz/.
> > If other names are preferable I'm open to changing them.
>
> I don't want to bikeshed this, but rather than locking it seems to be
> more:
>
>         vma_start_read()
>         vma_end_read()
>         vma_start_write()
>         vma_end_write()
>         vma_downgrade_write()

Couple corrections, we would have to have vma_start_tryread() and
vma_end_write_all(). Also there is no vma_downgrade_write().
mmap_write_downgrade() simply does vma_end_write_all().

>
> ... and that these are _implemented_ with locks (in part) is an
> implementation detail?
>
> Would that reduce people's confusion?
>
> > >
> > > >         - write unlock - (vma_write_unlock_mm) is a batch release of all
> > > >           vma locks held. It doesn't pair with a specific
> > > >           vma_write_lock! It is done before exclusive mmap_lock is
> > > >           released by incrementing mm sequence counter (mm_lock_seq).
> > > >       - write downgrade - if the mmap_lock is downgraded to the read
> > > >         lock all vma write locks are released as well (effectivelly
> > > >         same as write unlock).
> > > --
> > > Michal Hocko
> > > SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-17 21:45       ` Jann Horn
@ 2023-01-17 22:36         ` Suren Baghdasaryan
  2023-01-17 23:15           ` Matthew Wilcox
  2023-11-22 14:04         ` Alexander Gordeev
  1 sibling, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-17 22:36 UTC (permalink / raw)
  To: Jann Horn
  Cc: peterz, Ingo Molnar, Will Deacon, akpm, michel, jglisse, mhocko,
	vbabka, hannes, mgorman, dave, willy, liam.howlett, ldufour,
	laurent.dufour, paulmck, luto, songliubraving, peterx, david,
	dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy,
	soheil, hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 1:46 PM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > On Tue, Jan 17, 2023 at 10:03 AM Jann Horn <jannh@google.com> wrote:
> > >
> > > +locking maintainers
> >
> > Thanks! I'll CC the locking maintainers in the next posting.
> >
> > >
> > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > locked.
> > > [...]
> > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > +{
> > > > +       up_read(&vma->lock);
> > > > +}
> > >
> > > One thing that might be gnarly here is that I think you might not be
> > > allowed to use up_read() to fully release ownership of an object -
> > > from what I remember, I think that up_read() (unlike something like
> > > spin_unlock()) can access the lock object after it's already been
> > > acquired by someone else. So if you want to protect against concurrent
> > > deletion, this might have to be something like:
> > >
> > > rcu_read_lock(); /* keeps vma alive */
> > > up_read(&vma->lock);
> > > rcu_read_unlock();
> >
> > But for deleting VMA one would need to write-lock the vma->lock first,
> > which I assume can't happen until this up_read() is complete. Is that
> > assumption wrong?
>
> __up_read() does:
>
> rwsem_clear_reader_owned(sem);
> tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
> DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
> if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
>       RWSEM_FLAG_WAITERS)) {
>   clear_nonspinnable(sem);
>   rwsem_wake(sem);
> }
>
> The atomic_long_add_return_release() is the point where we are doing
> the main lock-releasing.
>
> So if a reader dropped the read-lock while someone else was waiting on
> the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the
> lock together with it, the reader also does clear_nonspinnable() and
> rwsem_wake() afterwards.
> But in rwsem_down_write_slowpath(), after we've set
> RWSEM_FLAG_WAITERS, we can return successfully immediately once
> rwsem_try_write_lock() sees that there are no active readers or
> writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg
> succeeds). We're not necessarily waiting for the "nonspinnable" bit or
> the wake.
>
> So yeah, I think down_write() can return successfully before up_read()
> is done with its memory accesses.
>
> (Spinlocks are different - the kernel relies on being able to drop
> references via spin_unlock() in some places.)

Thanks for bringing this up. I can add rcu_read_{lock/unlock) as you
suggested and that would fix the issue because we free VMAs from
call_rcu(). However this feels to me as an issue of rw_semaphore
design that this locking pattern is unsafe and might lead to UAF.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-17 22:36         ` Suren Baghdasaryan
@ 2023-01-17 23:15           ` Matthew Wilcox
  0 siblings, 0 replies; 186+ messages in thread
From: Matthew Wilcox @ 2023-01-17 23:15 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Jann Horn, peterz, Ingo Molnar, Will Deacon, akpm, michel,
	jglisse, mhocko, vbabka, hannes, mgorman, dave, liam.howlett,
	ldufour, laurent.dufour, paulmck, luto, songliubraving, peterx,
	david, dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy,
	soheil, hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 02:36:47PM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 1:46 PM Jann Horn <jannh@google.com> wrote:
> > On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > On Tue, Jan 17, 2023 at 10:03 AM Jann Horn <jannh@google.com> wrote:
> > > > One thing that might be gnarly here is that I think you might not be
> > > > allowed to use up_read() to fully release ownership of an object -
> > > > from what I remember, I think that up_read() (unlike something like
> > > > spin_unlock()) can access the lock object after it's already been
> > > > acquired by someone else. So if you want to protect against concurrent
> > > > deletion, this might have to be something like:
> > > >
> > > > rcu_read_lock(); /* keeps vma alive */
> > > > up_read(&vma->lock);
> > > > rcu_read_unlock();
> > >
> > > But for deleting VMA one would need to write-lock the vma->lock first,
> > > which I assume can't happen until this up_read() is complete. Is that
> > > assumption wrong?
> >
> > __up_read() does:
> >
> > rwsem_clear_reader_owned(sem);
> > tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
> > DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
> > if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
> >       RWSEM_FLAG_WAITERS)) {
> >   clear_nonspinnable(sem);
> >   rwsem_wake(sem);
> > }
> >
> > The atomic_long_add_return_release() is the point where we are doing
> > the main lock-releasing.
> >
> > So if a reader dropped the read-lock while someone else was waiting on
> > the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the
> > lock together with it, the reader also does clear_nonspinnable() and
> > rwsem_wake() afterwards.
> > But in rwsem_down_write_slowpath(), after we've set
> > RWSEM_FLAG_WAITERS, we can return successfully immediately once
> > rwsem_try_write_lock() sees that there are no active readers or
> > writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg
> > succeeds). We're not necessarily waiting for the "nonspinnable" bit or
> > the wake.
> >
> > So yeah, I think down_write() can return successfully before up_read()
> > is done with its memory accesses.
> >
> > (Spinlocks are different - the kernel relies on being able to drop
> > references via spin_unlock() in some places.)
> 
> Thanks for bringing this up. I can add rcu_read_{lock/unlock) as you
> suggested and that would fix the issue because we free VMAs from
> call_rcu(). However this feels to me as an issue of rw_semaphore
> design that this locking pattern is unsafe and might lead to UAF.

We have/had this problem with normal mutexes too.  It was the impetus
for adding the struct completion which is very careful to not touch
anything after the completion is, well, completed.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code
  2023-01-17 21:03   ` Jann Horn
@ 2023-01-17 23:18     ` Liam Howlett
  0 siblings, 0 replies; 186+ messages in thread
From: Liam Howlett @ 2023-01-17 23:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Suren Baghdasaryan, willy, akpm, michel, jglisse, mhocko, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

* Jann Horn <jannh@google.com> [230117 16:04]:
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > page fault handling. When VMA is not found, can't be locked or changes
> > after being locked, the function returns NULL. The lookup is performed
> > under RCU protection to prevent the found VMA from being destroyed before
> > the VMA lock is acquired. VMA lock statistics are updated according to
> > the results.
> > For now only anonymous VMAs can be searched this way. In other cases the
> > function returns NULL.
> [...]
> > +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > +                                         unsigned long address)
> > +{
> > +       MA_STATE(mas, &mm->mm_mt, address, address);
> > +       struct vm_area_struct *vma, *validate;
> > +
> > +       rcu_read_lock();
> > +       vma = mas_walk(&mas);
> > +retry:
> > +       if (!vma)
> > +               goto inval;
> > +
> > +       /* Only anonymous vmas are supported for now */
> > +       if (!vma_is_anonymous(vma))
> > +               goto inval;
> > +
> > +       if (!vma_read_trylock(vma))
> > +               goto inval;
> > +
> > +       /* Check since vm_start/vm_end might change before we lock the VMA */
> > +       if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > +               vma_read_unlock(vma);
> > +               goto inval;
> > +       }
> > +
> > +       /* Check if the VMA got isolated after we found it */
> > +       mas.index = address;
> > +       validate = mas_walk(&mas);
> 
> Question for Maple Tree experts:
> 
> Are you allowed to use mas_walk() like this? If the first mas_walk()
> call encountered a single-entry tree, it would store mas->node =
> MAS_ROOT, right? And then the second call would go into
> mas_state_walk(), mas_start() would return NULL, mas_is_ptr() would be
> true, and then mas_state_walk() would return the result of
> mas_start(), which is NULL? And we'd end up with mas_walk() returning
> NULL on the second run even though the tree hasn't changed?

This is safe for VMAs.  There might be a bug in the tree regarding
re-walking with a pointer, but it won't matter here.

A single entry tree will be a pointer if the entry is of the range 0 - 0
(mas.index == 0, mas.last == 0).  This would be a zero sized VMA - which
is not valid.

The second walk will check if the maple node is dead and restart the
walk if it is dead.  If the node isn't dead (almost always the case),
then it will be a very quick walk.

After a mas_walk(), the maple state has mas.index = vma->vm_start
and mas.last = (vma->vm_end - 1). The address is set prior to the second
walk in case of a vma split where mas.index from the first walk
is on the other side of the split than address.

> 
> > +       if (validate != vma) {
> > +               vma_read_unlock(vma);
> > +               count_vm_vma_lock_event(VMA_LOCK_MISS);
> > +               /* The area was replaced with another one. */
> > +               vma = validate;
> > +               goto retry;
> > +       }
> > +
> > +       rcu_read_unlock();
> > +       return vma;
> > +inval:
> > +       rcu_read_unlock();
> > +       count_vm_vma_lock_event(VMA_LOCK_ABORT);
> > +       return NULL;
> > +}

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code
  2023-01-17 15:47   ` Michal Hocko
@ 2023-01-18  1:06     ` Suren Baghdasaryan
  2023-01-18  2:44       ` Matthew Wilcox
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18  1:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 7:47 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > page fault handling. When VMA is not found, can't be locked or changes
> > after being locked, the function returns NULL. The lookup is performed
> > under RCU protection to prevent the found VMA from being destroyed before
> > the VMA lock is acquired. VMA lock statistics are updated according to
> > the results.
> > For now only anonymous VMAs can be searched this way. In other cases the
> > function returns NULL.
>
> Could you describe why only anonymous vmas are handled at this stage and
> what (roughly) has to be done to support other vmas? lock_vma_under_rcu
> doesn't seem to have any anonymous vma specific requirements AFAICS.

TBH I haven't spent too much time looking into file-backed page faults
yet but a couple of tasks I can think of are:
- Ensure that all vma->vm_ops->fault() handlers do not rely on
mmap_lock being read-locked;
- vma->vm_file freeing like VMA freeing will need to be done after RCU
grace period since page fault handlers use it. This will require some
caution because simply adding it into __vm_area_free() called via
call_rcu()  will cause corresponding fops->release() to be called
asynchronously. I had to solve this issue with out-of-tree SPF
implementation when asynchronously called snd_pcm_release() was
problematic.

I'm sure I'm missing more potential issues and maybe Matthew and
Michel can pinpoint more things to resolve here?

>
> Also isn't lock_vma_under_rcu effectively find_read_lock_vma? Not that
> the naming is really the most important part but the rcu locking is
> internal to the function so why should we spread this implementation
> detail to the world...

I wanted the name to indicate that the lookup is done with no locks
held. But I'm open to suggestions.

>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h |  3 +++
> >  mm/memory.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 54 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index c464fc8a514c..d0fddf6a1de9 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -687,6 +687,9 @@ static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> >                     vma);
> >  }
> >
> > +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > +                                       unsigned long address);
> > +
> >  #else /* CONFIG_PER_VMA_LOCK */
> >
> >  static inline void vma_init_lock(struct vm_area_struct *vma) {}
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9ece18548db1..a658e26d965d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5242,6 +5242,57 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> >  }
> >  EXPORT_SYMBOL_GPL(handle_mm_fault);
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +/*
> > + * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
> > + * stable and not isolated. If the VMA is not found or is being modified the
> > + * function returns NULL.
> > + */
> > +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > +                                       unsigned long address)
> > +{
> > +     MA_STATE(mas, &mm->mm_mt, address, address);
> > +     struct vm_area_struct *vma, *validate;
> > +
> > +     rcu_read_lock();
> > +     vma = mas_walk(&mas);
> > +retry:
> > +     if (!vma)
> > +             goto inval;
> > +
> > +     /* Only anonymous vmas are supported for now */
> > +     if (!vma_is_anonymous(vma))
> > +             goto inval;
> > +
> > +     if (!vma_read_trylock(vma))
> > +             goto inval;
> > +
> > +     /* Check since vm_start/vm_end might change before we lock the VMA */
> > +     if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > +             vma_read_unlock(vma);
> > +             goto inval;
> > +     }
> > +
> > +     /* Check if the VMA got isolated after we found it */
> > +     mas.index = address;
> > +     validate = mas_walk(&mas);
> > +     if (validate != vma) {
> > +             vma_read_unlock(vma);
> > +             count_vm_vma_lock_event(VMA_LOCK_MISS);
> > +             /* The area was replaced with another one. */
> > +             vma = validate;
> > +             goto retry;
> > +     }
> > +
> > +     rcu_read_unlock();
> > +     return vma;
> > +inval:
> > +     rcu_read_unlock();
> > +     count_vm_vma_lock_event(VMA_LOCK_ABORT);
> > +     return NULL;
> > +}
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> >  #ifndef __PAGETABLE_P4D_FOLDED
> >  /*
> >   * Allocate p4d page table.
> > --
> > 2.39.0
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-17 15:57   ` Michal Hocko
@ 2023-01-18  1:19     ` Suren Baghdasaryan
  2023-01-18  9:49       ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18  1:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > call_rcu() can take a long time when callback offloading is enabled.
> > Its use in the vm_area_free can cause regressions in the exit path when
> > multiple VMAs are being freed.
>
> What kind of regressions.
>
> > To minimize that impact, place VMAs into
> > a list and free them in groups using one call_rcu() call per group.
>
> Please add some data to justify this additional complexity.

Sorry, should have done that in the first place. A 4.3% regression was
noticed when running execl test from unixbench suite. spawn test also
showed 1.6% regression. Profiling revealed that vma freeing was taking
longer due to call_rcu() which is slow when RCU callback offloading is
enabled. I asked Paul McKenney and he explained to me that because the
callbacks are offloaded to some other kthread, possibly running on
some other CPU, it is necessary to use explicit locking.  Locking on a
per-call_rcu() basis would result in excessive contention during
callback flooding. So, by batching call_rcu() work we cut that
overhead and reduce this lock contention.


> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction
  2023-01-17 15:42   ` Michal Hocko
@ 2023-01-18  1:53     ` Suren Baghdasaryan
  2023-01-18  9:43       ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18  1:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 7:42 AM 'Michal Hocko' via kernel-team
<kernel-team@android.com> wrote:
>
> On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> > Assert there are no holders of VMA lock for reading when it is about to be
> > destroyed.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h | 8 ++++++++
> >  kernel/fork.c      | 2 ++
> >  2 files changed, 10 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 594e835bad9c..c464fc8a514c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> >       VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> >  }
> >
> > +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> > +{
> > +     VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
> > +                   vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
> > +                   vma);
>
> Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
> us something is wrong on its own, no? This could be somebody racing with
> the vma destruction and using the write lock. Unlikely but I do not see
> why to narrow debugging scope.

I wanted to ensure there are no page fault handlers (read-lockers)
when we are destroying the VMA and rwsem_is_locked(&vma->lock) alone
could trigger if someone is concurrently calling vma_write_lock(). But
I don't think we expect someone to be write-locking the VMA while we
are destroying it, so you are right, I'm overcomplicating things here.
I think I can get rid of vma_assert_no_reader() and add
VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock)) directly in
__vm_area_free(). WDYT?


> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call
  2023-01-17 15:16   ` Michal Hocko
@ 2023-01-18  2:01     ` Suren Baghdasaryan
  2023-01-18  9:23       ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18  2:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > Move VMA flag modification (which now implies VMA locking) before
> > anon_vma_lock_write to match the locking order of page fault handler.
>
> Does this changelog assumes per vma locking in the #PF?

Hmm, you are right. Page fault handlers do not use per-vma locks yet
but the changelog already talks about that. Maybe I should change it
to simply:
```
Move VMA flag modification (which now implies VMA locking) before
vma_adjust_trans_huge() to ensure the modifications are done after VMA
has been locked.
```
Is that better?

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions
  2023-01-17 15:15     ` Michal Hocko
@ 2023-01-18  2:07       ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18  2:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 7:15 AM 'Michal Hocko' via kernel-team
<kernel-team@android.com> wrote:
>
> On Tue 17-01-23 16:09:03, Michal Hocko wrote:
> > On Mon 09-01-23 12:53:08, Suren Baghdasaryan wrote:
> > > To keep vma locking correctness when vm_flags are modified, add modifier
> > > functions to be used whenever flags are updated.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  include/linux/mm.h       | 38 ++++++++++++++++++++++++++++++++++++++
> > >  include/linux/mm_types.h |  8 +++++++-
> > >  2 files changed, 45 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index ec2c4c227d51..35cf0a6cbcc2 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -702,6 +702,44 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> > >     vma_init_lock(vma);
> > >  }
> > >
> > > +/* Use when VMA is not part of the VMA tree and needs no locking */
> > > +static inline
> > > +void init_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > > +{
> > > +   WRITE_ONCE(vma->vm_flags, flags);
> > > +}
> >
> > Why do we need WRITE_ONCE here? Isn't vma invisible during its
> > initialization?

Ack. Will change to a simple assignment.

> >
> > > +
> > > +/* Use when VMA is part of the VMA tree and needs appropriate locking */
> > > +static inline
> > > +void reset_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > > +{
> > > +   vma_write_lock(vma);
> > > +   init_vm_flags(vma, flags);
> > > +}
> > > +
> > > +static inline
> > > +void set_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > > +{
> > > +   vma_write_lock(vma);
> > > +   vma->vm_flags |= flags;
> > > +}
> > > +
> > > +static inline
> > > +void clear_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > > +{
> > > +   vma_write_lock(vma);
> > > +   vma->vm_flags &= ~flags;
> > > +}
> > > +
> > > +static inline
> > > +void mod_vm_flags(struct vm_area_struct *vma,
> > > +             unsigned long set, unsigned long clear)
> > > +{
> > > +   vma_write_lock(vma);
> > > +   vma->vm_flags |= set;
> > > +   vma->vm_flags &= ~clear;
> > > +}
> > > +
> >
> > This is rather unusual pattern. There is no note about locking involved
> > in the naming and also why is the locking part of this interface in the
> > first place? I can see reason for access functions to actually check for
> > lock asserts.
>
> OK, it took me a while but it is clear to me now. The confusion comes
> from the naming vma_write_lock is no a lock in its usual terms. It is
> more of a vma_mark_modified with side effects to read locking which is a
> real lock. With that it makes more sense to have this done in these
> helpers rather than requiring all users to keep this subtletly in mind.

If renaming vma-locking primitives the way Matthew suggested in
https://lore.kernel.org/all/Y8cZMt01Z1FvVFXh@casper.infradead.org/
makes it easier to read/understand, I'm all for it. Let's discuss the
naming in that email thread because that's where these functions are
introduced.

>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 09/41] mm: rcu safe VMA freeing
  2023-01-17 14:25   ` Michal Hocko
@ 2023-01-18  2:16     ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18  2:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 6:25 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 09-01-23 12:53:04, Suren Baghdasaryan wrote:
> [...]
> >  void vm_area_free(struct vm_area_struct *vma)
> >  {
> >       free_anon_vma_name(vma);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     call_rcu(&vma->vm_rcu, __vm_area_free);
> > +#else
> >       kmem_cache_free(vm_area_cachep, vma);
> > +#endif
>
> Is it safe to have vma with already freed vma_name? I suspect this is
> safe because of mmap_lock but is there any reason to split the freeing
> process and have this potential UAF lurking?

It should be safe because VMA is either locked or has been isolated
while locked, so no page fault handlers should have access to it. But
you are right, moving free_anon_vma_name() into __vm_area_free() does
seem safer. Will make the change in the next rev.

>
> >  }
> >
> >  static void account_kernel_stack(struct task_struct *tsk, int account)
> > --
> > 2.39.0
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code
  2023-01-18  1:06     ` Suren Baghdasaryan
@ 2023-01-18  2:44       ` Matthew Wilcox
  2023-01-18 21:33         ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Matthew Wilcox @ 2023-01-18  2:44 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, akpm, michel, jglisse, vbabka, hannes, mgorman,
	dave, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 05:06:57PM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:47 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> > > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > > page fault handling. When VMA is not found, can't be locked or changes
> > > after being locked, the function returns NULL. The lookup is performed
> > > under RCU protection to prevent the found VMA from being destroyed before
> > > the VMA lock is acquired. VMA lock statistics are updated according to
> > > the results.
> > > For now only anonymous VMAs can be searched this way. In other cases the
> > > function returns NULL.
> >
> > Could you describe why only anonymous vmas are handled at this stage and
> > what (roughly) has to be done to support other vmas? lock_vma_under_rcu
> > doesn't seem to have any anonymous vma specific requirements AFAICS.
> 
> TBH I haven't spent too much time looking into file-backed page faults
> yet but a couple of tasks I can think of are:
> - Ensure that all vma->vm_ops->fault() handlers do not rely on
> mmap_lock being read-locked;

I think this way lies madness.  There are just too many device drivers
that implement ->fault.  My plan is to call the ->map_pages() method
under RCU without even read-locking the VMA.  If that doesn't satisfy
the fault, then drop all the way back to taking the mmap_sem for read
before calling into ->fault.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-17 21:54         ` Matthew Wilcox
  2023-01-17 22:33           ` Suren Baghdasaryan
@ 2023-01-18  9:18           ` Michal Hocko
  1 sibling, 0 replies; 186+ messages in thread
From: Michal Hocko @ 2023-01-18  9:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Suren Baghdasaryan, akpm, michel, jglisse, vbabka, hannes,
	mgorman, dave, liam.howlett, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue 17-01-23 21:54:58, Matthew Wilcox wrote:
> On Tue, Jan 17, 2023 at 01:21:47PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > > > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > > locked.
> > > >
> > > > I have to say I was struggling a bit with the above and only understood
> > > > what you mean by reading the patch several times. I would phrase it like
> > > > this (feel free to use if you consider this to be an improvement).
> > > >
> > > > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > > > per-vma and per-mm sequence counters to note exclusive locking:
> > > >         - read lock - (implemented by vma_read_trylock) requires the the
> > > >           vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> > > >           differ. If they match then there must be a vma exclusive lock
> > > >           held somewhere.
> > > >         - read unlock - (implemented by vma_read_unlock) is a trivial
> > > >           vma->lock unlock.
> > > >         - write lock - (vma_write_lock) requires the mmap_lock to be
> > > >           held exclusively and the current mm counter is noted to the vma
> > > >           side. This will allow multiple vmas to be locked under a single
> > > >           mmap_lock write lock (e.g. during vma merging). The vma counter
> > > >           is modified under exclusive vma lock.
> > >
> > > Didn't realize one more thing.
> > >             Unlike standard write lock this implementation allows to be
> > >             called multiple times under a single mmap_lock. In a sense
> > >             it is more of mark_vma_potentially_modified than a lock.
> > 
> > In the RFC it was called vma_mark_locked() originally and renames were
> > discussed in the email thread ending here:
> > https://lore.kernel.org/all/621612d7-c537-3971-9520-a3dec7b43cb4@suse.cz/.
> > If other names are preferable I'm open to changing them.
> 
> I don't want to bikeshed this, but rather than locking it seems to be
> more:
> 
> 	vma_start_read()
> 	vma_end_read()
> 	vma_start_write()
> 	vma_end_write()
> 	vma_downgrade_write()
> 
> ... and that these are _implemented_ with locks (in part) is an
> implementation detail?

Agreed!

> Would that reduce people's confusion?

Yes I believe that naming it less like a locking primitive will clarify
it. vma_{start,end}_[try]read is better indeed. I am wondering about the
write side of things because that is where things get confusing. There
is no explicit write lock nor unlock. vma_start_write sounds better than
the vma_write_lock but it still lacks that pairing with vma_end_write
which is never the right thing to call. Wouldn't vma_mark_modified and
vma_publish_changes describe the scheme better?

Downgrade case is probably the least interesting one because that is
just one off thing that can be completely hidden from any code besides
mmap_write_downgrade so I wouldn't be too concern about that one.

But as you say, no need to bikeshed this too much. Great naming is hard
and if the scheme is documented properly we can live with a suboptimal
naming as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call
  2023-01-18  2:01     ` Suren Baghdasaryan
@ 2023-01-18  9:23       ` Michal Hocko
  2023-01-18 18:09         ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-18  9:23 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > > Move VMA flag modification (which now implies VMA locking) before
> > > anon_vma_lock_write to match the locking order of page fault handler.
> >
> > Does this changelog assumes per vma locking in the #PF?
> 
> Hmm, you are right. Page fault handlers do not use per-vma locks yet
> but the changelog already talks about that. Maybe I should change it
> to simply:
> ```
> Move VMA flag modification (which now implies VMA locking) before
> vma_adjust_trans_huge() to ensure the modifications are done after VMA
> has been locked.

Because ....

Withtout that additional reasonaning it is not really clear why that is
needed and seems arbitrary.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page
  2023-01-17 20:28     ` Jann Horn
  2023-01-17 21:05       ` Suren Baghdasaryan
@ 2023-01-18  9:40       ` Michal Hocko
  2023-01-18 12:38         ` Jann Horn
  2023-01-18 17:41         ` Suren Baghdasaryan
  1 sibling, 2 replies; 186+ messages in thread
From: Michal Hocko @ 2023-01-18  9:40 UTC (permalink / raw)
  To: Jann Horn
  Cc: Suren Baghdasaryan, akpm, michel, jglisse, vbabka, hannes,
	mgorman, dave, willy, liam.howlett, peterz, ldufour,
	laurent.dufour, paulmck, luto, songliubraving, peterx, david,
	dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy,
	soheil, hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Tue 17-01-23 21:28:06, Jann Horn wrote:
> On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <mhocko@suse.com> wrote:
> > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > not be detected by a page fault handler without proper locking.
> >
> > I am struggling with this changelog. Maybe because my recollection of
> > the THP collapsing subtleties is weak. But aren't you just trying to say
> > that the current #PF handling and THP collapsing need to be mutually
> > exclusive currently so in order to keep that assumption you have mark
> > the vma write locked?
> >
> > Also it is not really clear to me how that handles other vmas which can
> > share the same thp?
> 
> It's not about the hugepage itself, it's about how the THP collapse
> operation frees page tables.
> 
> Before this series, page tables can be walked under any one of the
> mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> unlinks and frees page tables, it must ensure that all of those either
> are locked or don't exist. This series adds a fourth lock under which
> page tables can be traversed, and so khugepaged must also lock out that one.
> 
> There is a codepath in khugepaged that iterates through all mappings
> of a file to zap page tables (retract_page_tables()), which locks each
> visited mm with mmap_write_trylock() and now also does
> vma_write_lock().

OK, I see. This would be a great addendum to the changelog.
 
> I think one aspect of this patch that might cause trouble later on, if
> support for non-anonymous VMAs is added, is that retract_page_tables()
> now does vma_write_lock() while holding the mapping lock; the page
> fault handling path would probably take the locks the other way
> around, leading to a deadlock? So the vma_write_lock() in
> retract_page_tables() might have to become a trylock later on.

This, right?
#PF			retract_page_tables
vma_read_lock
			i_mmap_lock_write
i_mmap_lock_read
			vma_write_lock


I might be missing something but I have only found huge_pmd_share to be
called from the #PF path. That one should be safe as it cannot be a
target for THP. Not that it would matter much because such a dependency
chain would be really subtle.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction
  2023-01-18  1:53     ` Suren Baghdasaryan
@ 2023-01-18  9:43       ` Michal Hocko
  2023-01-18 18:06         ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-18  9:43 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue 17-01-23 17:53:00, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:42 AM 'Michal Hocko' via kernel-team
> <kernel-team@android.com> wrote:
> >
> > On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> > > Assert there are no holders of VMA lock for reading when it is about to be
> > > destroyed.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  include/linux/mm.h | 8 ++++++++
> > >  kernel/fork.c      | 2 ++
> > >  2 files changed, 10 insertions(+)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 594e835bad9c..c464fc8a514c 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > >       VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> > >  }
> > >
> > > +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> > > +{
> > > +     VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
> > > +                   vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
> > > +                   vma);
> >
> > Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
> > us something is wrong on its own, no? This could be somebody racing with
> > the vma destruction and using the write lock. Unlikely but I do not see
> > why to narrow debugging scope.
> 
> I wanted to ensure there are no page fault handlers (read-lockers)
> when we are destroying the VMA and rwsem_is_locked(&vma->lock) alone
> could trigger if someone is concurrently calling vma_write_lock(). But
> I don't think we expect someone to be write-locking the VMA while we

That would be UAF, no?

> are destroying it, so you are right, I'm overcomplicating things here.
> I think I can get rid of vma_assert_no_reader() and add
> VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock)) directly in
> __vm_area_free(). WDYT?

Yes, that adds some debugging. Not sure it is really necessary buyt it
is VM_BUG_ON so why not.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-18  1:19     ` Suren Baghdasaryan
@ 2023-01-18  9:49       ` Michal Hocko
  2023-01-18 18:04         ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-18  9:49 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > call_rcu() can take a long time when callback offloading is enabled.
> > > Its use in the vm_area_free can cause regressions in the exit path when
> > > multiple VMAs are being freed.
> >
> > What kind of regressions.
> >
> > > To minimize that impact, place VMAs into
> > > a list and free them in groups using one call_rcu() call per group.
> >
> > Please add some data to justify this additional complexity.
> 
> Sorry, should have done that in the first place. A 4.3% regression was
> noticed when running execl test from unixbench suite. spawn test also
> showed 1.6% regression. Profiling revealed that vma freeing was taking
> longer due to call_rcu() which is slow when RCU callback offloading is
> enabled.

Could you be more specific? vma freeing is async with the RCU so how
come this has resulted in a regression? Is there any heavy
rcu_synchronize in the exec path? That would be an interesting
information.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-17 18:02   ` Jann Horn
  2023-01-17 21:28     ` Suren Baghdasaryan
@ 2023-01-18 12:28     ` Michal Hocko
  2023-01-18 13:23       ` Jann Horn
  1 sibling, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-18 12:28 UTC (permalink / raw)
  To: Jann Horn
  Cc: Suren Baghdasaryan, peterz, Ingo Molnar, Will Deacon, akpm,
	michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue 17-01-23 19:02:55, Jann Horn wrote:
> +locking maintainers
> 
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
> [...]
> > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > +{
> > +       up_read(&vma->lock);
> > +}
> 
> One thing that might be gnarly here is that I think you might not be
> allowed to use up_read() to fully release ownership of an object -
> from what I remember, I think that up_read() (unlike something like
> spin_unlock()) can access the lock object after it's already been
> acquired by someone else.

Yes, I think you are right. From a look into the code it seems that
the UAF is quite unlikely as there is a ton of work to be done between
vma_write_lock used to prepare vma for removal and actual removal.
That doesn't make it less of a problem though.

> So if you want to protect against concurrent
> deletion, this might have to be something like:
> 
> rcu_read_lock(); /* keeps vma alive */
> up_read(&vma->lock);
> rcu_read_unlock();
> 
> But I'm not entirely sure about that, the locking folks might know better.

I am not a locking expert but to me it looks like this should work
because the final cleanup would have to happen rcu_read_unlock.

Thanks, I have completely missed this aspect of the locking when looking
into the code.

Btw. looking at this again I have fully realized how hard it is actually
to see that vm_area_free is guaranteed to sync up with ongoing readers.
vma manipulation functions like __adjust_vma make my head spin. Would it
make more sense to have a rcu style synchronization point in
vm_area_free directly before call_rcu? This would add an overhead of
uncontended down_write of course.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page
  2023-01-18  9:40       ` Michal Hocko
@ 2023-01-18 12:38         ` Jann Horn
  2023-01-18 17:41         ` Suren Baghdasaryan
  1 sibling, 0 replies; 186+ messages in thread
From: Jann Horn @ 2023-01-18 12:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, akpm, michel, jglisse, vbabka, hannes,
	mgorman, dave, willy, liam.howlett, peterz, ldufour,
	laurent.dufour, paulmck, luto, songliubraving, peterx, david,
	dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy,
	soheil, hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Wed, Jan 18, 2023 at 10:40 AM Michal Hocko <mhocko@suse.com> wrote:
> On Tue 17-01-23 21:28:06, Jann Horn wrote:
> > On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <mhocko@suse.com> wrote:
> > > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > > not be detected by a page fault handler without proper locking.
> > >
> > > I am struggling with this changelog. Maybe because my recollection of
> > > the THP collapsing subtleties is weak. But aren't you just trying to say
> > > that the current #PF handling and THP collapsing need to be mutually
> > > exclusive currently so in order to keep that assumption you have mark
> > > the vma write locked?
> > >
> > > Also it is not really clear to me how that handles other vmas which can
> > > share the same thp?
> >
> > It's not about the hugepage itself, it's about how the THP collapse
> > operation frees page tables.
> >
> > Before this series, page tables can be walked under any one of the
> > mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> > unlinks and frees page tables, it must ensure that all of those either
> > are locked or don't exist. This series adds a fourth lock under which
> > page tables can be traversed, and so khugepaged must also lock out that one.
> >
> > There is a codepath in khugepaged that iterates through all mappings
> > of a file to zap page tables (retract_page_tables()), which locks each
> > visited mm with mmap_write_trylock() and now also does
> > vma_write_lock().
>
> OK, I see. This would be a great addendum to the changelog.
>
> > I think one aspect of this patch that might cause trouble later on, if
> > support for non-anonymous VMAs is added, is that retract_page_tables()
> > now does vma_write_lock() while holding the mapping lock; the page
> > fault handling path would probably take the locks the other way
> > around, leading to a deadlock? So the vma_write_lock() in
> > retract_page_tables() might have to become a trylock later on.
>
> This, right?
> #PF                     retract_page_tables
> vma_read_lock
>                         i_mmap_lock_write
> i_mmap_lock_read
>                         vma_write_lock
>
>
> I might be missing something but I have only found huge_pmd_share to be
> called from the #PF path. That one should be safe as it cannot be a
> target for THP. Not that it would matter much because such a dependency
> chain would be really subtle.

Oops, yeah. Now that I'm looking closer I also don't see a path from
the #PF path to i_mmap_lock_read. Sorry for sending you on a wild
goose chase.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration
  2023-01-09 20:53 ` [PATCH 27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration Suren Baghdasaryan
@ 2023-01-18 12:50   ` Jann Horn
  2023-01-18 17:40     ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Jann Horn @ 2023-01-18 12:50 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> Page fault handlers might need to fire MMU notifications while a new
> notifier is being registered. Modify mm_take_all_locks to write-lock all
> VMAs and prevent this race with fault handlers that would hold VMA locks.
> VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
> locking order as in page fault handlers.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/mmap.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 30c7d1c5206e..a256deca0bc0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3566,6 +3566,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
>   * of mm/rmap.c:
>   *   - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
>   *     hugetlb mapping);
> + *   - all vmas marked locked

The existing comment above says that this is an *ordered* listing of
which locks are taken.

>   *   - all i_mmap_rwsem locks;
>   *   - all anon_vma->rwseml
>   *
> @@ -3591,6 +3592,7 @@ int mm_take_all_locks(struct mm_struct *mm)
>         mas_for_each(&mas, vma, ULONG_MAX) {
>                 if (signal_pending(current))
>                         goto out_unlock;
> +               vma_write_lock(vma);
>                 if (vma->vm_file && vma->vm_file->f_mapping &&
>                                 is_vm_hugetlb_page(vma))
>                         vm_lock_mapping(mm, vma->vm_file->f_mapping);

Note that multiple VMAs can have the same ->f_mapping, so with this,
the lock ordering between VMA locks and the mapping locks of hugetlb
VMAs is mixed: If you have two adjacent hugetlb VMAs with the same
->f_mapping, then the following operations happen:

1. lock VMA 1
2. lock mapping of VMAs 1 and 2
3. lock VMA 2
4. [second vm_lock_mapping() is a no-op]

So for VMA 1, we ended up taking the VMA lock first, but for VMA 2, we
took the mapping lock first.

The existing code has one loop per lock type to ensure that the locks
really are taken in the specified order, even when some of the locks
are associated with multiple VMAs.

If we don't care about the ordering between these two, maybe that's
fine and you just have to adjust the comment; but it would be clearer
to add a separate loop for the VMA locks.

> @@ -3677,6 +3679,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
>                 if (vma->vm_file && vma->vm_file->f_mapping)
>                         vm_unlock_mapping(vma->vm_file->f_mapping);
>         }
> +       vma_write_unlock_mm(mm);
>
>         mutex_unlock(&mm_all_locks_mutex);
>  }
> --
> 2.39.0
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-18 12:28     ` Michal Hocko
@ 2023-01-18 13:23       ` Jann Horn
  2023-01-18 15:11         ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Jann Horn @ 2023-01-18 13:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, peterz, Ingo Molnar, Will Deacon, akpm,
	michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <mhocko@suse.com> wrote:
> On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > +locking maintainers
> >
> > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > to be exclusively locked during VMA tree modifications, instead of the
> > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > locked.
> > [...]
> > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > +{
> > > +       up_read(&vma->lock);
> > > +}
> >
> > One thing that might be gnarly here is that I think you might not be
> > allowed to use up_read() to fully release ownership of an object -
> > from what I remember, I think that up_read() (unlike something like
> > spin_unlock()) can access the lock object after it's already been
> > acquired by someone else.
>
> Yes, I think you are right. From a look into the code it seems that
> the UAF is quite unlikely as there is a ton of work to be done between
> vma_write_lock used to prepare vma for removal and actual removal.
> That doesn't make it less of a problem though.
>
> > So if you want to protect against concurrent
> > deletion, this might have to be something like:
> >
> > rcu_read_lock(); /* keeps vma alive */
> > up_read(&vma->lock);
> > rcu_read_unlock();
> >
> > But I'm not entirely sure about that, the locking folks might know better.
>
> I am not a locking expert but to me it looks like this should work
> because the final cleanup would have to happen rcu_read_unlock.
>
> Thanks, I have completely missed this aspect of the locking when looking
> into the code.
>
> Btw. looking at this again I have fully realized how hard it is actually
> to see that vm_area_free is guaranteed to sync up with ongoing readers.
> vma manipulation functions like __adjust_vma make my head spin. Would it
> make more sense to have a rcu style synchronization point in
> vm_area_free directly before call_rcu? This would add an overhead of
> uncontended down_write of course.

Something along those lines might be a good idea, but I think that
rather than synchronizing the removal, it should maybe be something
that splats (and bails out?) if it detects pending readers. If we get
to vm_area_free() on a VMA that has pending readers, we might already
be in a lot of trouble because the concurrent readers might have been
traversing page tables while we were tearing them down or fun stuff
like that.

I think maybe Suren was already talking about something like that in
another part of this patch series but I don't remember...

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-18 13:23       ` Jann Horn
@ 2023-01-18 15:11         ` Michal Hocko
  2023-01-18 17:36           ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-18 15:11 UTC (permalink / raw)
  To: Jann Horn
  Cc: Suren Baghdasaryan, peterz, Ingo Molnar, Will Deacon, akpm,
	michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed 18-01-23 14:23:32, Jann Horn wrote:
> On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <mhocko@suse.com> wrote:
> > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > +locking maintainers
> > >
> > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > locked.
> > > [...]
> > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > +{
> > > > +       up_read(&vma->lock);
> > > > +}
> > >
> > > One thing that might be gnarly here is that I think you might not be
> > > allowed to use up_read() to fully release ownership of an object -
> > > from what I remember, I think that up_read() (unlike something like
> > > spin_unlock()) can access the lock object after it's already been
> > > acquired by someone else.
> >
> > Yes, I think you are right. From a look into the code it seems that
> > the UAF is quite unlikely as there is a ton of work to be done between
> > vma_write_lock used to prepare vma for removal and actual removal.
> > That doesn't make it less of a problem though.
> >
> > > So if you want to protect against concurrent
> > > deletion, this might have to be something like:
> > >
> > > rcu_read_lock(); /* keeps vma alive */
> > > up_read(&vma->lock);
> > > rcu_read_unlock();
> > >
> > > But I'm not entirely sure about that, the locking folks might know better.
> >
> > I am not a locking expert but to me it looks like this should work
> > because the final cleanup would have to happen rcu_read_unlock.
> >
> > Thanks, I have completely missed this aspect of the locking when looking
> > into the code.
> >
> > Btw. looking at this again I have fully realized how hard it is actually
> > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > vma manipulation functions like __adjust_vma make my head spin. Would it
> > make more sense to have a rcu style synchronization point in
> > vm_area_free directly before call_rcu? This would add an overhead of
> > uncontended down_write of course.
> 
> Something along those lines might be a good idea, but I think that
> rather than synchronizing the removal, it should maybe be something
> that splats (and bails out?) if it detects pending readers. If we get
> to vm_area_free() on a VMA that has pending readers, we might already
> be in a lot of trouble because the concurrent readers might have been
> traversing page tables while we were tearing them down or fun stuff
> like that.
> 
> I think maybe Suren was already talking about something like that in
> another part of this patch series but I don't remember...

This http://lkml.kernel.org/r/20230109205336.3665937-27-surenb@google.com?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-18 15:11         ` Michal Hocko
@ 2023-01-18 17:36           ` Suren Baghdasaryan
  2023-01-18 21:28             ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18 17:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, peterz, Ingo Molnar, Will Deacon, akpm, michel,
	jglisse, vbabka, hannes, mgorman, dave, willy, liam.howlett,
	ldufour, laurent.dufour, paulmck, luto, songliubraving, peterx,
	david, dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy,
	soheil, hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team
<kernel-team@android.com> wrote:
>
> On Wed 18-01-23 14:23:32, Jann Horn wrote:
> > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <mhocko@suse.com> wrote:
> > > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > > +locking maintainers
> > > >
> > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > > locked.
> > > > [...]
> > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > +{
> > > > > +       up_read(&vma->lock);
> > > > > +}
> > > >
> > > > One thing that might be gnarly here is that I think you might not be
> > > > allowed to use up_read() to fully release ownership of an object -
> > > > from what I remember, I think that up_read() (unlike something like
> > > > spin_unlock()) can access the lock object after it's already been
> > > > acquired by someone else.
> > >
> > > Yes, I think you are right. From a look into the code it seems that
> > > the UAF is quite unlikely as there is a ton of work to be done between
> > > vma_write_lock used to prepare vma for removal and actual removal.
> > > That doesn't make it less of a problem though.
> > >
> > > > So if you want to protect against concurrent
> > > > deletion, this might have to be something like:
> > > >
> > > > rcu_read_lock(); /* keeps vma alive */
> > > > up_read(&vma->lock);
> > > > rcu_read_unlock();
> > > >
> > > > But I'm not entirely sure about that, the locking folks might know better.
> > >
> > > I am not a locking expert but to me it looks like this should work
> > > because the final cleanup would have to happen rcu_read_unlock.
> > >
> > > Thanks, I have completely missed this aspect of the locking when looking
> > > into the code.
> > >
> > > Btw. looking at this again I have fully realized how hard it is actually
> > > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > > vma manipulation functions like __adjust_vma make my head spin. Would it
> > > make more sense to have a rcu style synchronization point in
> > > vm_area_free directly before call_rcu? This would add an overhead of
> > > uncontended down_write of course.
> >
> > Something along those lines might be a good idea, but I think that
> > rather than synchronizing the removal, it should maybe be something
> > that splats (and bails out?) if it detects pending readers. If we get
> > to vm_area_free() on a VMA that has pending readers, we might already
> > be in a lot of trouble because the concurrent readers might have been
> > traversing page tables while we were tearing them down or fun stuff
> > like that.
> >
> > I think maybe Suren was already talking about something like that in
> > another part of this patch series but I don't remember...
>
> This http://lkml.kernel.org/r/20230109205336.3665937-27-surenb@google.com?

Yes, I spent a lot of time ensuring that __adjust_vma locks the right
VMAs and that VMAs are freed or isolated under VMA write lock
protection to exclude any readers. If the VM_BUG_ON_VMA in the patch
Michal mentioned gets hit then it's a bug in my design and I'll have
to fix it. But please, let's not add synchronize_rcu() in the
vm_area_free(). That will slow down any path that frees a VMA,
especially the exit path which might be freeing thousands of them. I
had an SPF version with synchronize_rcu() in the vm_area_free() and
phone vendors started yelling at me the very next day. call_rcu() with
CONFIG_RCU_NOCB_CPU (which Android uses for power saving purposes) is
already bad enough to show up in the benchmarks and that's why I had
to add call_rcu() batching in
https://lore.kernel.org/all/20230109205336.3665937-40-surenb@google.com.

>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration
  2023-01-18 12:50   ` Jann Horn
@ 2023-01-18 17:40     ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18 17:40 UTC (permalink / raw)
  To: Jann Horn
  Cc: akpm, michel, jglisse, mhocko, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed, Jan 18, 2023 at 4:51 AM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > Page fault handlers might need to fire MMU notifications while a new
> > notifier is being registered. Modify mm_take_all_locks to write-lock all
> > VMAs and prevent this race with fault handlers that would hold VMA locks.
> > VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
> > locking order as in page fault handlers.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/mmap.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 30c7d1c5206e..a256deca0bc0 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3566,6 +3566,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
> >   * of mm/rmap.c:
> >   *   - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
> >   *     hugetlb mapping);
> > + *   - all vmas marked locked
>
> The existing comment above says that this is an *ordered* listing of
> which locks are taken.
>
> >   *   - all i_mmap_rwsem locks;
> >   *   - all anon_vma->rwseml
> >   *
> > @@ -3591,6 +3592,7 @@ int mm_take_all_locks(struct mm_struct *mm)
> >         mas_for_each(&mas, vma, ULONG_MAX) {
> >                 if (signal_pending(current))
> >                         goto out_unlock;
> > +               vma_write_lock(vma);
> >                 if (vma->vm_file && vma->vm_file->f_mapping &&
> >                                 is_vm_hugetlb_page(vma))
> >                         vm_lock_mapping(mm, vma->vm_file->f_mapping);
>
> Note that multiple VMAs can have the same ->f_mapping, so with this,
> the lock ordering between VMA locks and the mapping locks of hugetlb
> VMAs is mixed: If you have two adjacent hugetlb VMAs with the same
> ->f_mapping, then the following operations happen:
>
> 1. lock VMA 1
> 2. lock mapping of VMAs 1 and 2
> 3. lock VMA 2
> 4. [second vm_lock_mapping() is a no-op]
>
> So for VMA 1, we ended up taking the VMA lock first, but for VMA 2, we
> took the mapping lock first.
>
> The existing code has one loop per lock type to ensure that the locks
> really are taken in the specified order, even when some of the locks
> are associated with multiple VMAs.
>
> If we don't care about the ordering between these two, maybe that's
> fine and you just have to adjust the comment; but it would be clearer
> to add a separate loop for the VMA locks.

Oh, thanks for pointing out this detail. A separate loop is definitely
needed here. Will do that in the next respin.

>
> > @@ -3677,6 +3679,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
> >                 if (vma->vm_file && vma->vm_file->f_mapping)
> >                         vm_unlock_mapping(vma->vm_file->f_mapping);
> >         }
> > +       vma_write_unlock_mm(mm);
> >
> >         mutex_unlock(&mm_all_locks_mutex);
> >  }
> > --
> > 2.39.0
> >

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page
  2023-01-18  9:40       ` Michal Hocko
  2023-01-18 12:38         ` Jann Horn
@ 2023-01-18 17:41         ` Suren Baghdasaryan
  1 sibling, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18 17:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, akpm, michel, jglisse, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed, Jan 18, 2023 at 1:40 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 17-01-23 21:28:06, Jann Horn wrote:
> > On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <mhocko@suse.com> wrote:
> > > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > > not be detected by a page fault handler without proper locking.
> > >
> > > I am struggling with this changelog. Maybe because my recollection of
> > > the THP collapsing subtleties is weak. But aren't you just trying to say
> > > that the current #PF handling and THP collapsing need to be mutually
> > > exclusive currently so in order to keep that assumption you have mark
> > > the vma write locked?
> > >
> > > Also it is not really clear to me how that handles other vmas which can
> > > share the same thp?
> >
> > It's not about the hugepage itself, it's about how the THP collapse
> > operation frees page tables.
> >
> > Before this series, page tables can be walked under any one of the
> > mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> > unlinks and frees page tables, it must ensure that all of those either
> > are locked or don't exist. This series adds a fourth lock under which
> > page tables can be traversed, and so khugepaged must also lock out that one.
> >
> > There is a codepath in khugepaged that iterates through all mappings
> > of a file to zap page tables (retract_page_tables()), which locks each
> > visited mm with mmap_write_trylock() and now also does
> > vma_write_lock().
>
> OK, I see. This would be a great addendum to the changelog.

I'll add Jann's description in the changelog. Thanks Jann!

>
> > I think one aspect of this patch that might cause trouble later on, if
> > support for non-anonymous VMAs is added, is that retract_page_tables()
> > now does vma_write_lock() while holding the mapping lock; the page
> > fault handling path would probably take the locks the other way
> > around, leading to a deadlock? So the vma_write_lock() in
> > retract_page_tables() might have to become a trylock later on.
>
> This, right?
> #PF                     retract_page_tables
> vma_read_lock
>                         i_mmap_lock_write
> i_mmap_lock_read
>                         vma_write_lock
>
>
> I might be missing something but I have only found huge_pmd_share to be
> called from the #PF path. That one should be safe as it cannot be a
> target for THP. Not that it would matter much because such a dependency
> chain would be really subtle.
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-18  9:49       ` Michal Hocko
@ 2023-01-18 18:04         ` Suren Baghdasaryan
  2023-01-18 18:34           ` Paul E. McKenney
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18 18:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > multiple VMAs are being freed.
> > >
> > > What kind of regressions.
> > >
> > > > To minimize that impact, place VMAs into
> > > > a list and free them in groups using one call_rcu() call per group.
> > >
> > > Please add some data to justify this additional complexity.
> >
> > Sorry, should have done that in the first place. A 4.3% regression was
> > noticed when running execl test from unixbench suite. spawn test also
> > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > longer due to call_rcu() which is slow when RCU callback offloading is
> > enabled.
>
> Could you be more specific? vma freeing is async with the RCU so how
> come this has resulted in a regression? Is there any heavy
> rcu_synchronize in the exec path? That would be an interesting
> information.

No, there is no heavy rcu_synchronize() or any other additional
synchronous load in the exit path. It's the call_rcu() which can block
the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
other call_rcu()'s going on in parallel. Note that call_rcu() calls
rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
revealed that this function was taking multiple ms (don't recall the
actual number, sorry). Paul's explanation implied that this happens
due to contention on the locks taken in this function. For more
in-depth details I'll have to ask Paul for help :) This code is quite
complex and I don't know all the details of RCU implementation.


> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction
  2023-01-18  9:43       ` Michal Hocko
@ 2023-01-18 18:06         ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18 18:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed, Jan 18, 2023 at 1:43 AM 'Michal Hocko' via kernel-team
<kernel-team@android.com> wrote:
>
> On Tue 17-01-23 17:53:00, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:42 AM 'Michal Hocko' via kernel-team
> > <kernel-team@android.com> wrote:
> > >
> > > On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> > > > Assert there are no holders of VMA lock for reading when it is about to be
> > > > destroyed.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > ---
> > > >  include/linux/mm.h | 8 ++++++++
> > > >  kernel/fork.c      | 2 ++
> > > >  2 files changed, 10 insertions(+)
> > > >
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 594e835bad9c..c464fc8a514c 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > > >       VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> > > >  }
> > > >
> > > > +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> > > > +{
> > > > +     VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
> > > > +                   vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
> > > > +                   vma);
> > >
> > > Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
> > > us something is wrong on its own, no? This could be somebody racing with
> > > the vma destruction and using the write lock. Unlikely but I do not see
> > > why to narrow debugging scope.
> >
> > I wanted to ensure there are no page fault handlers (read-lockers)
> > when we are destroying the VMA and rwsem_is_locked(&vma->lock) alone
> > could trigger if someone is concurrently calling vma_write_lock(). But
> > I don't think we expect someone to be write-locking the VMA while we
>
> That would be UAF, no?

Yes. That's why what I have is an overkill (which is also racy).

>
> > are destroying it, so you are right, I'm overcomplicating things here.
> > I think I can get rid of vma_assert_no_reader() and add
> > VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock)) directly in
> > __vm_area_free(). WDYT?
>
> Yes, that adds some debugging. Not sure it is really necessary buyt it
> is VM_BUG_ON so why not.

I would like to keep it if possible. If it triggers that would be a
clear signal what the issue is. Otherwise it might be hard to debug
such a corner case.

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call
  2023-01-18  9:23       ` Michal Hocko
@ 2023-01-18 18:09         ` Suren Baghdasaryan
  2023-01-18 21:33           ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18 18:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed, Jan 18, 2023 at 1:23 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > > > Move VMA flag modification (which now implies VMA locking) before
> > > > anon_vma_lock_write to match the locking order of page fault handler.
> > >
> > > Does this changelog assumes per vma locking in the #PF?
> >
> > Hmm, you are right. Page fault handlers do not use per-vma locks yet
> > but the changelog already talks about that. Maybe I should change it
> > to simply:
> > ```
> > Move VMA flag modification (which now implies VMA locking) before
> > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > has been locked.
>
> Because ....

because vma_adjust_trans_huge() modifies the VMA and such
modifications should be done under VMA write-lock protection.

>
> Withtout that additional reasonaning it is not really clear why that is
> needed and seems arbitrary.

Would the above be a good reasoning?

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-18 18:04         ` Suren Baghdasaryan
@ 2023-01-18 18:34           ` Paul E. McKenney
  2023-01-18 19:01             ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Paul E. McKenney @ 2023-01-18 18:34 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, akpm, michel, jglisse, vbabka, hannes, mgorman,
	dave, willy, liam.howlett, peterz, ldufour, laurent.dufour, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > multiple VMAs are being freed.
> > > >
> > > > What kind of regressions.
> > > >
> > > > > To minimize that impact, place VMAs into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > Please add some data to justify this additional complexity.
> > >
> > > Sorry, should have done that in the first place. A 4.3% regression was
> > > noticed when running execl test from unixbench suite. spawn test also
> > > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > > longer due to call_rcu() which is slow when RCU callback offloading is
> > > enabled.
> >
> > Could you be more specific? vma freeing is async with the RCU so how
> > come this has resulted in a regression? Is there any heavy
> > rcu_synchronize in the exec path? That would be an interesting
> > information.
> 
> No, there is no heavy rcu_synchronize() or any other additional
> synchronous load in the exit path. It's the call_rcu() which can block
> the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
> other call_rcu()'s going on in parallel. Note that call_rcu() calls
> rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
> revealed that this function was taking multiple ms (don't recall the
> actual number, sorry). Paul's explanation implied that this happens
> due to contention on the locks taken in this function. For more
> in-depth details I'll have to ask Paul for help :) This code is quite
> complex and I don't know all the details of RCU implementation.

There are a couple of possibilities here.

First, if I am remembering correctly, the time between the call_rcu()
and invocation of the corresponding callback was taking multiple seconds,
but that was because the kernel was built with CONFIG_LAZY_RCU=y in
order to save power by batching RCU work over multiple call_rcu()
invocations.  If this is causing a problem for a given call site, the
shiny new call_rcu_hurry() can be used instead.  Doing this gets back
to the old-school non-laziness, but can of course consume more power.

Second, there is a much shorter one-jiffy delay between the call_rcu()
and the invocation of the corresponding callback in kernels built with
either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
of this delay is to avoid lock contention, and so this delay is incurred
only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
invoking call_rcu() at least 16 times within a given jiffy will incur
the added delay.  The reason for this delay is the use of a separate
->nocb_bypass list.  As Suren says, this bypass list is used to reduce
lock contention on the main ->cblist.  This is not needed in old-school
kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
(including most datacenter kernels) because in that case the callbacks
enqueued by call_rcu() are touched only by the corresponding CPU, so
that there is no need for locks.

Third, if you are instead seeing multiple milliseconds of CPU consumed by
call_rcu() in the common case (for example, without the aid of interrupts,
NMIs, or SMIs), please do let me know.  That sounds to me like a bug.

Or have I lost track of some other slow case?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock
       [not found]                 ` <20230118062639.2839-1-hdanton@sina.com>
@ 2023-01-18 18:35                   ` Matthew Wilcox
  0 siblings, 0 replies; 186+ messages in thread
From: Matthew Wilcox @ 2023-01-18 18:35 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Suren Baghdasaryan, vbabka, hannes, mgorman, peterz, hughd,
	linux-kernel, linux-mm

On Wed, Jan 18, 2023 at 02:26:39PM +0800, Hillf Danton wrote:
> On Tue, Jan 17, 2023 at 10:27 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > The cpu_relax() is exactly the wrong thing to do here.  See this thread:
> > https://lore.kernel.org/linux-fsdevel/20230113184447.1707316-1-mjguzik@gmail.com/
> 
> If you are right, feel free to go and remove every cpu_relax() under the
> kernel/locking directory.

I see you didn't read the whole thread where Linus points out that a
cmpxchg() loop is fundamentally different from a spinlock.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-18 18:34           ` Paul E. McKenney
@ 2023-01-18 19:01             ` Suren Baghdasaryan
  2023-01-18 20:20               ` Paul E. McKenney
  2023-01-19 12:52               ` Michal Hocko
  0 siblings, 2 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18 19:01 UTC (permalink / raw)
  To: paulmck
  Cc: Michal Hocko, akpm, michel, jglisse, vbabka, hannes, mgorman,
	dave, willy, liam.howlett, peterz, ldufour, laurent.dufour, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > multiple VMAs are being freed.
> > > > >
> > > > > What kind of regressions.
> > > > >
> > > > > > To minimize that impact, place VMAs into
> > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > >
> > > > > Please add some data to justify this additional complexity.
> > > >
> > > > Sorry, should have done that in the first place. A 4.3% regression was
> > > > noticed when running execl test from unixbench suite. spawn test also
> > > > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > > > longer due to call_rcu() which is slow when RCU callback offloading is
> > > > enabled.
> > >
> > > Could you be more specific? vma freeing is async with the RCU so how
> > > come this has resulted in a regression? Is there any heavy
> > > rcu_synchronize in the exec path? That would be an interesting
> > > information.
> >
> > No, there is no heavy rcu_synchronize() or any other additional
> > synchronous load in the exit path. It's the call_rcu() which can block
> > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
> > other call_rcu()'s going on in parallel. Note that call_rcu() calls
> > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
> > revealed that this function was taking multiple ms (don't recall the
> > actual number, sorry). Paul's explanation implied that this happens
> > due to contention on the locks taken in this function. For more
> > in-depth details I'll have to ask Paul for help :) This code is quite
> > complex and I don't know all the details of RCU implementation.
>
> There are a couple of possibilities here.
>
> First, if I am remembering correctly, the time between the call_rcu()
> and invocation of the corresponding callback was taking multiple seconds,
> but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> order to save power by batching RCU work over multiple call_rcu()
> invocations.  If this is causing a problem for a given call site, the
> shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> to the old-school non-laziness, but can of course consume more power.

That would not be the case because CONFIG_LAZY_RCU was not an option
at the time I was profiling this issue.
Laxy RCU would be a great option to replace this patch but
unfortunately it's not the default behavior, so I would still have to
implement this batching in case lazy RCU is not enabled.

>
> Second, there is a much shorter one-jiffy delay between the call_rcu()
> and the invocation of the corresponding callback in kernels built with
> either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> of this delay is to avoid lock contention, and so this delay is incurred
> only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> invoking call_rcu() at least 16 times within a given jiffy will incur
> the added delay.  The reason for this delay is the use of a separate
> ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> lock contention on the main ->cblist.  This is not needed in old-school
> kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> (including most datacenter kernels) because in that case the callbacks
> enqueued by call_rcu() are touched only by the corresponding CPU, so
> that there is no need for locks.

I believe this is the reason in my profiled case.

>
> Third, if you are instead seeing multiple milliseconds of CPU consumed by
> call_rcu() in the common case (for example, without the aid of interrupts,
> NMIs, or SMIs), please do let me know.  That sounds to me like a bug.

I don't think I've seen such a case.
Thanks for clarifications, Paul!

>
> Or have I lost track of some other slow case?
>
>                                                         Thanx, Paul

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-18 19:01             ` Suren Baghdasaryan
@ 2023-01-18 20:20               ` Paul E. McKenney
  2023-01-19 12:52               ` Michal Hocko
  1 sibling, 0 replies; 186+ messages in thread
From: Paul E. McKenney @ 2023-01-18 20:20 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, akpm, michel, jglisse, vbabka, hannes, mgorman,
	dave, willy, liam.howlett, peterz, ldufour, laurent.dufour, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed, Jan 18, 2023 at 11:01:08AM -0800, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
> > > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > multiple VMAs are being freed.
> > > > > >
> > > > > > What kind of regressions.
> > > > > >
> > > > > > > To minimize that impact, place VMAs into
> > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > >
> > > > > > Please add some data to justify this additional complexity.
> > > > >
> > > > > Sorry, should have done that in the first place. A 4.3% regression was
> > > > > noticed when running execl test from unixbench suite. spawn test also
> > > > > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > > > > longer due to call_rcu() which is slow when RCU callback offloading is
> > > > > enabled.
> > > >
> > > > Could you be more specific? vma freeing is async with the RCU so how
> > > > come this has resulted in a regression? Is there any heavy
> > > > rcu_synchronize in the exec path? That would be an interesting
> > > > information.
> > >
> > > No, there is no heavy rcu_synchronize() or any other additional
> > > synchronous load in the exit path. It's the call_rcu() which can block
> > > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
> > > other call_rcu()'s going on in parallel. Note that call_rcu() calls
> > > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
> > > revealed that this function was taking multiple ms (don't recall the
> > > actual number, sorry). Paul's explanation implied that this happens
> > > due to contention on the locks taken in this function. For more
> > > in-depth details I'll have to ask Paul for help :) This code is quite
> > > complex and I don't know all the details of RCU implementation.
> >
> > There are a couple of possibilities here.
> >
> > First, if I am remembering correctly, the time between the call_rcu()
> > and invocation of the corresponding callback was taking multiple seconds,
> > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > order to save power by batching RCU work over multiple call_rcu()
> > invocations.  If this is causing a problem for a given call site, the
> > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > to the old-school non-laziness, but can of course consume more power.
> 
> That would not be the case because CONFIG_LAZY_RCU was not an option
> at the time I was profiling this issue.
> Laxy RCU would be a great option to replace this patch but
> unfortunately it's not the default behavior, so I would still have to
> implement this batching in case lazy RCU is not enabled.
> 
> > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > and the invocation of the corresponding callback in kernels built with
> > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > of this delay is to avoid lock contention, and so this delay is incurred
> > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > invoking call_rcu() at least 16 times within a given jiffy will incur
> > the added delay.  The reason for this delay is the use of a separate
> > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > lock contention on the main ->cblist.  This is not needed in old-school
> > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > (including most datacenter kernels) because in that case the callbacks
> > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > that there is no need for locks.
> 
> I believe this is the reason in my profiled case.
> 
> >
> > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > call_rcu() in the common case (for example, without the aid of interrupts,
> > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> 
> I don't think I've seen such a case.

Whew!!!  ;-)

> Thanks for clarifications, Paul!

No problem!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-18 17:36           ` Suren Baghdasaryan
@ 2023-01-18 21:28             ` Michal Hocko
  2023-01-18 21:45               ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-18 21:28 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Jann Horn, peterz, Ingo Molnar, Will Deacon, akpm, michel,
	jglisse, vbabka, hannes, mgorman, dave, willy, liam.howlett,
	ldufour, laurent.dufour, paulmck, luto, songliubraving, peterx,
	david, dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy,
	soheil, hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Wed 18-01-23 09:36:44, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team
> <kernel-team@android.com> wrote:
> >
> > On Wed 18-01-23 14:23:32, Jann Horn wrote:
> > > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > > > +locking maintainers
> > > > >
> > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > > > locked.
> > > > > [...]
> > > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > +{
> > > > > > +       up_read(&vma->lock);
> > > > > > +}
> > > > >
> > > > > One thing that might be gnarly here is that I think you might not be
> > > > > allowed to use up_read() to fully release ownership of an object -
> > > > > from what I remember, I think that up_read() (unlike something like
> > > > > spin_unlock()) can access the lock object after it's already been
> > > > > acquired by someone else.
> > > >
> > > > Yes, I think you are right. From a look into the code it seems that
> > > > the UAF is quite unlikely as there is a ton of work to be done between
> > > > vma_write_lock used to prepare vma for removal and actual removal.
> > > > That doesn't make it less of a problem though.
> > > >
> > > > > So if you want to protect against concurrent
> > > > > deletion, this might have to be something like:
> > > > >
> > > > > rcu_read_lock(); /* keeps vma alive */
> > > > > up_read(&vma->lock);
> > > > > rcu_read_unlock();
> > > > >
> > > > > But I'm not entirely sure about that, the locking folks might know better.
> > > >
> > > > I am not a locking expert but to me it looks like this should work
> > > > because the final cleanup would have to happen rcu_read_unlock.
> > > >
> > > > Thanks, I have completely missed this aspect of the locking when looking
> > > > into the code.
> > > >
> > > > Btw. looking at this again I have fully realized how hard it is actually
> > > > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > > > vma manipulation functions like __adjust_vma make my head spin. Would it
> > > > make more sense to have a rcu style synchronization point in
> > > > vm_area_free directly before call_rcu? This would add an overhead of
> > > > uncontended down_write of course.
> > >
> > > Something along those lines might be a good idea, but I think that
> > > rather than synchronizing the removal, it should maybe be something
> > > that splats (and bails out?) if it detects pending readers. If we get
> > > to vm_area_free() on a VMA that has pending readers, we might already
> > > be in a lot of trouble because the concurrent readers might have been
> > > traversing page tables while we were tearing them down or fun stuff
> > > like that.
> > >
> > > I think maybe Suren was already talking about something like that in
> > > another part of this patch series but I don't remember...
> >
> > This http://lkml.kernel.org/r/20230109205336.3665937-27-surenb@google.com?
> 
> Yes, I spent a lot of time ensuring that __adjust_vma locks the right
> VMAs and that VMAs are freed or isolated under VMA write lock
> protection to exclude any readers. If the VM_BUG_ON_VMA in the patch
> Michal mentioned gets hit then it's a bug in my design and I'll have
> to fix it. But please, let's not add synchronize_rcu() in the
> vm_area_free().

Just to clarify. I didn't suggest to add synchronize_rcu into
vm_area_free. What I really meant was synchronize_rcu like primitive to
effectivelly synchronize with any potential pending read locker (so
something like vma_write_lock (or whatever it is called). The point is
that vma freeing is an event all readers should be notified about.
This can be done explicitly for each and every vma before vm_area_free
is called but this is just hard to review and easy to break over time.
See my point?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code
  2023-01-18  2:44       ` Matthew Wilcox
@ 2023-01-18 21:33         ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18 21:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, akpm, michel, jglisse, vbabka, hannes, mgorman,
	dave, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 6:44 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 17, 2023 at 05:06:57PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:47 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> > > > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > > > page fault handling. When VMA is not found, can't be locked or changes
> > > > after being locked, the function returns NULL. The lookup is performed
> > > > under RCU protection to prevent the found VMA from being destroyed before
> > > > the VMA lock is acquired. VMA lock statistics are updated according to
> > > > the results.
> > > > For now only anonymous VMAs can be searched this way. In other cases the
> > > > function returns NULL.
> > >
> > > Could you describe why only anonymous vmas are handled at this stage and
> > > what (roughly) has to be done to support other vmas? lock_vma_under_rcu
> > > doesn't seem to have any anonymous vma specific requirements AFAICS.
> >
> > TBH I haven't spent too much time looking into file-backed page faults
> > yet but a couple of tasks I can think of are:
> > - Ensure that all vma->vm_ops->fault() handlers do not rely on
> > mmap_lock being read-locked;
>
> I think this way lies madness.  There are just too many device drivers
> that implement ->fault.  My plan is to call the ->map_pages() method
> under RCU without even read-locking the VMA.  If that doesn't satisfy
> the fault, then drop all the way back to taking the mmap_sem for read
> before calling into ->fault.

Sounds reasonable to me but I guess the devil is in the details...

>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call
  2023-01-18 18:09         ` Suren Baghdasaryan
@ 2023-01-18 21:33           ` Michal Hocko
  2023-01-18 21:48             ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-18 21:33 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed 18-01-23 10:09:29, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 1:23 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > > > > Move VMA flag modification (which now implies VMA locking) before
> > > > > anon_vma_lock_write to match the locking order of page fault handler.
> > > >
> > > > Does this changelog assumes per vma locking in the #PF?
> > >
> > > Hmm, you are right. Page fault handlers do not use per-vma locks yet
> > > but the changelog already talks about that. Maybe I should change it
> > > to simply:
> > > ```
> > > Move VMA flag modification (which now implies VMA locking) before
> > > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > > has been locked.
> >
> > Because ....
> 
> because vma_adjust_trans_huge() modifies the VMA and such
> modifications should be done under VMA write-lock protection.

So it will become:
Move VMA flag modification (which now implies VMA locking) before
vma_adjust_trans_huge() to ensure the modifications are done after VMA
has been locked. Because vma_adjust_trans_huge() modifies the VMA and such
modifications should be done under VMA write-lock protection.

which is effectivelly saying
vma_adjust_trans_huge() modifies the VMA and such modifications should
be done under VMA write-lock protection so move VMA flag modifications
before so all of them are covered by the same write protection.

right?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-18 21:28             ` Michal Hocko
@ 2023-01-18 21:45               ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18 21:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, peterz, Ingo Molnar, Will Deacon, akpm, michel,
	jglisse, vbabka, hannes, mgorman, dave, willy, liam.howlett,
	ldufour, laurent.dufour, paulmck, luto, songliubraving, peterx,
	david, dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, axelrasmussen, joelaf,
	minchan, shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy,
	soheil, hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 18-01-23 09:36:44, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team
> > <kernel-team@android.com> wrote:
> > >
> > > On Wed 18-01-23 14:23:32, Jann Horn wrote:
> > > > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > > > > +locking maintainers
> > > > > >
> > > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > > > > locked.
> > > > > > [...]
> > > > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > > +{
> > > > > > > +       up_read(&vma->lock);
> > > > > > > +}
> > > > > >
> > > > > > One thing that might be gnarly here is that I think you might not be
> > > > > > allowed to use up_read() to fully release ownership of an object -
> > > > > > from what I remember, I think that up_read() (unlike something like
> > > > > > spin_unlock()) can access the lock object after it's already been
> > > > > > acquired by someone else.
> > > > >
> > > > > Yes, I think you are right. From a look into the code it seems that
> > > > > the UAF is quite unlikely as there is a ton of work to be done between
> > > > > vma_write_lock used to prepare vma for removal and actual removal.
> > > > > That doesn't make it less of a problem though.
> > > > >
> > > > > > So if you want to protect against concurrent
> > > > > > deletion, this might have to be something like:
> > > > > >
> > > > > > rcu_read_lock(); /* keeps vma alive */
> > > > > > up_read(&vma->lock);
> > > > > > rcu_read_unlock();
> > > > > >
> > > > > > But I'm not entirely sure about that, the locking folks might know better.
> > > > >
> > > > > I am not a locking expert but to me it looks like this should work
> > > > > because the final cleanup would have to happen rcu_read_unlock.
> > > > >
> > > > > Thanks, I have completely missed this aspect of the locking when looking
> > > > > into the code.
> > > > >
> > > > > Btw. looking at this again I have fully realized how hard it is actually
> > > > > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > > > > vma manipulation functions like __adjust_vma make my head spin. Would it
> > > > > make more sense to have a rcu style synchronization point in
> > > > > vm_area_free directly before call_rcu? This would add an overhead of
> > > > > uncontended down_write of course.
> > > >
> > > > Something along those lines might be a good idea, but I think that
> > > > rather than synchronizing the removal, it should maybe be something
> > > > that splats (and bails out?) if it detects pending readers. If we get
> > > > to vm_area_free() on a VMA that has pending readers, we might already
> > > > be in a lot of trouble because the concurrent readers might have been
> > > > traversing page tables while we were tearing them down or fun stuff
> > > > like that.
> > > >
> > > > I think maybe Suren was already talking about something like that in
> > > > another part of this patch series but I don't remember...
> > >
> > > This http://lkml.kernel.org/r/20230109205336.3665937-27-surenb@google.com?
> >
> > Yes, I spent a lot of time ensuring that __adjust_vma locks the right
> > VMAs and that VMAs are freed or isolated under VMA write lock
> > protection to exclude any readers. If the VM_BUG_ON_VMA in the patch
> > Michal mentioned gets hit then it's a bug in my design and I'll have
> > to fix it. But please, let's not add synchronize_rcu() in the
> > vm_area_free().
>
> Just to clarify. I didn't suggest to add synchronize_rcu into
> vm_area_free. What I really meant was synchronize_rcu like primitive to
> effectivelly synchronize with any potential pending read locker (so
> something like vma_write_lock (or whatever it is called). The point is
> that vma freeing is an event all readers should be notified about.

I don't think readers need to be notified if we are ensuring that the
VMA is not used by anyone else and is not reachable by the readers.
This is currently done by write-locking the VMA either before removing
it from the tree or before freeing it.

> This can be done explicitly for each and every vma before vm_area_free
> is called but this is just hard to review and easy to break over time.
> See my point?

I understand your point now and if we really need that, one way would
be to have a VMA refcount (like Laurent had in his version of SPF
implementation). I don't think current implementation needs that level
of VMA lifetime control unless I missed some location that should take
the lock and does not.

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call
  2023-01-18 21:33           ` Michal Hocko
@ 2023-01-18 21:48             ` Suren Baghdasaryan
  2023-01-19  9:31               ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-18 21:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed, Jan 18, 2023 at 1:33 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 18-01-23 10:09:29, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 1:23 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > > > > > Move VMA flag modification (which now implies VMA locking) before
> > > > > > anon_vma_lock_write to match the locking order of page fault handler.
> > > > >
> > > > > Does this changelog assumes per vma locking in the #PF?
> > > >
> > > > Hmm, you are right. Page fault handlers do not use per-vma locks yet
> > > > but the changelog already talks about that. Maybe I should change it
> > > > to simply:
> > > > ```
> > > > Move VMA flag modification (which now implies VMA locking) before
> > > > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > > > has been locked.
> > >
> > > Because ....
> >
> > because vma_adjust_trans_huge() modifies the VMA and such
> > modifications should be done under VMA write-lock protection.
>
> So it will become:
> Move VMA flag modification (which now implies VMA locking) before
> vma_adjust_trans_huge() to ensure the modifications are done after VMA
> has been locked. Because vma_adjust_trans_huge() modifies the VMA and such
> modifications should be done under VMA write-lock protection.
>
> which is effectivelly saying
> vma_adjust_trans_huge() modifies the VMA and such modifications should
> be done under VMA write-lock protection so move VMA flag modifications
> before so all of them are covered by the same write protection.
>
> right?

Yes, and the wording in the latter version is simpler to understand
IMO, so I would like to adopt it. Do you agree?

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call
  2023-01-18 21:48             ` Suren Baghdasaryan
@ 2023-01-19  9:31               ` Michal Hocko
  2023-01-19 18:53                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-19  9:31 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed 18-01-23 13:48:13, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 1:33 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > So it will become:
> > Move VMA flag modification (which now implies VMA locking) before
> > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > has been locked. Because vma_adjust_trans_huge() modifies the VMA and such
> > modifications should be done under VMA write-lock protection.
> >
> > which is effectivelly saying
> > vma_adjust_trans_huge() modifies the VMA and such modifications should
> > be done under VMA write-lock protection so move VMA flag modifications
> > before so all of them are covered by the same write protection.
> >
> > right?
> 
> Yes, and the wording in the latter version is simpler to understand
> IMO, so I would like to adopt it. Do you agree?

of course.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-18 19:01             ` Suren Baghdasaryan
  2023-01-18 20:20               ` Paul E. McKenney
@ 2023-01-19 12:52               ` Michal Hocko
  2023-01-19 19:17                 ` Paul E. McKenney
  1 sibling, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-19 12:52 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: paulmck, akpm, michel, jglisse, vbabka, hannes, mgorman, dave,
	willy, liam.howlett, peterz, ldufour, laurent.dufour, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote:
[...]
> > There are a couple of possibilities here.
> >
> > First, if I am remembering correctly, the time between the call_rcu()
> > and invocation of the corresponding callback was taking multiple seconds,
> > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > order to save power by batching RCU work over multiple call_rcu()
> > invocations.  If this is causing a problem for a given call site, the
> > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > to the old-school non-laziness, but can of course consume more power.
> 
> That would not be the case because CONFIG_LAZY_RCU was not an option
> at the time I was profiling this issue.
> Laxy RCU would be a great option to replace this patch but
> unfortunately it's not the default behavior, so I would still have to
> implement this batching in case lazy RCU is not enabled.
> 
> >
> > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > and the invocation of the corresponding callback in kernels built with
> > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > of this delay is to avoid lock contention, and so this delay is incurred
> > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > invoking call_rcu() at least 16 times within a given jiffy will incur
> > the added delay.  The reason for this delay is the use of a separate
> > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > lock contention on the main ->cblist.  This is not needed in old-school
> > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > (including most datacenter kernels) because in that case the callbacks
> > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > that there is no need for locks.
> 
> I believe this is the reason in my profiled case.
> 
> >
> > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > call_rcu() in the common case (for example, without the aid of interrupts,
> > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> 
> I don't think I've seen such a case.
> Thanks for clarifications, Paul!

Thanks for the explanation Paul. I have to say this has caught me as a
surprise. There are just not enough details about the benchmark to
understand what is going on but I find it rather surprising that
call_rcu can induce a higher overhead than the actual kmem_cache_free
which is the callback. My naive understanding has been that call_rcu is
really fast way to defer the execution to the RCU safe context to do the
final cleanup.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-09 20:53 ` [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free Suren Baghdasaryan
  2023-01-17 15:57   ` Michal Hocko
@ 2023-01-19 12:59   ` Michal Hocko
  2023-01-19 18:52     ` Suren Baghdasaryan
  1 sibling, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-19 12:59 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> call_rcu() can take a long time when callback offloading is enabled.
> Its use in the vm_area_free can cause regressions in the exit path when
> multiple VMAs are being freed. To minimize that impact, place VMAs into
> a list and free them in groups using one call_rcu() call per group.

After some more clarification I can understand how call_rcu might not be
super happy about thousands of callbacks to be invoked and I do agree
that this is not really optimal.

On the other hand I do not like this solution much either.
VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
much with processes with a huge number of vmas either. It would still be
in housands of callbacks to be scheduled without a good reason.

Instead, are there any other cases than remove_vma that need this
batching? We could easily just link all the vmas into linked list and
use a single call_rcu instead, no? This would both simplify the
implementation, remove the scaling issue as well and we do not have to
argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-19 12:59   ` Michal Hocko
@ 2023-01-19 18:52     ` Suren Baghdasaryan
  2023-01-19 19:20       ` Paul E. McKenney
  2023-01-20  8:52       ` Michal Hocko
  0 siblings, 2 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-19 18:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > call_rcu() can take a long time when callback offloading is enabled.
> > Its use in the vm_area_free can cause regressions in the exit path when
> > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > a list and free them in groups using one call_rcu() call per group.
>
> After some more clarification I can understand how call_rcu might not be
> super happy about thousands of callbacks to be invoked and I do agree
> that this is not really optimal.
>
> On the other hand I do not like this solution much either.
> VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> much with processes with a huge number of vmas either. It would still be
> in housands of callbacks to be scheduled without a good reason.
>
> Instead, are there any other cases than remove_vma that need this
> batching? We could easily just link all the vmas into linked list and
> use a single call_rcu instead, no? This would both simplify the
> implementation, remove the scaling issue as well and we do not have to
> argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.

Yes, I agree the solution is not stellar. I wanted something simple
but this is probably too simple. OTOH keeping all dead vm_area_structs
on the list without hooking up a shrinker (additional complexity) does
not sound too appealing either. WDYT about time domain throttling to
limit draining the list to say once per second like this:

void vm_area_free(struct vm_area_struct *vma)
{
       struct mm_struct *mm = vma->vm_mm;
       bool drain;

       free_anon_vma_name(vma);

       spin_lock(&mm->vma_free_list.lock);
       list_add(&vma->vm_free_list, &mm->vma_free_list.head);
       mm->vma_free_list.size++;
-       drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
+       drain = jiffies > mm->last_drain_tm + HZ;

       spin_unlock(&mm->vma_free_list.lock);

-       if (drain)
+       if (drain) {
              drain_free_vmas(mm);
+             mm->last_drain_tm = jiffies;
+       }
}

Ultimately we want to prevent very frequent call_rcu() calls, so
throttling in the time domain seems appropriate. That's the simplest
way I can think of to address your concern about a quick spike in VMA
freeing. It does not place any restriction on the list size and we
might have excessive dead vm_area_structs if after a large spike there
are no vm_area_free() calls but I don't know if that's a real problem,
so not sure we should be addressing it at this time. WDYT?


>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call
  2023-01-19  9:31               ` Michal Hocko
@ 2023-01-19 18:53                 ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-19 18:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Thu, Jan 19, 2023 at 1:31 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 18-01-23 13:48:13, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 1:33 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > So it will become:
> > > Move VMA flag modification (which now implies VMA locking) before
> > > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > > has been locked. Because vma_adjust_trans_huge() modifies the VMA and such
> > > modifications should be done under VMA write-lock protection.
> > >
> > > which is effectivelly saying
> > > vma_adjust_trans_huge() modifies the VMA and such modifications should
> > > be done under VMA write-lock protection so move VMA flag modifications
> > > before so all of them are covered by the same write protection.
> > >
> > > right?
> >
> > Yes, and the wording in the latter version is simpler to understand
> > IMO, so I would like to adopt it. Do you agree?
>
> of course.

Will update in the next respin. Thanks!

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-19 12:52               ` Michal Hocko
@ 2023-01-19 19:17                 ` Paul E. McKenney
  2023-01-20  8:57                   ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Paul E. McKenney @ 2023-01-19 19:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, akpm, michel, jglisse, vbabka, hannes,
	mgorman, dave, willy, liam.howlett, peterz, ldufour,
	laurent.dufour, luto, songliubraving, peterx, david, dhowells,
	hughd, bigeasy, kent.overstreet, punit.agrawal, lstoakes,
	peterjung1337, rientjes, axelrasmussen, joelaf, minchan, jannh,
	shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy, soheil,
	hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:
> On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> [...]
> > > There are a couple of possibilities here.
> > >
> > > First, if I am remembering correctly, the time between the call_rcu()
> > > and invocation of the corresponding callback was taking multiple seconds,
> > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > > order to save power by batching RCU work over multiple call_rcu()
> > > invocations.  If this is causing a problem for a given call site, the
> > > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > > to the old-school non-laziness, but can of course consume more power.
> > 
> > That would not be the case because CONFIG_LAZY_RCU was not an option
> > at the time I was profiling this issue.
> > Laxy RCU would be a great option to replace this patch but
> > unfortunately it's not the default behavior, so I would still have to
> > implement this batching in case lazy RCU is not enabled.
> > 
> > >
> > > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > > and the invocation of the corresponding callback in kernels built with
> > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > > of this delay is to avoid lock contention, and so this delay is incurred
> > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > > invoking call_rcu() at least 16 times within a given jiffy will incur
> > > the added delay.  The reason for this delay is the use of a separate
> > > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > > lock contention on the main ->cblist.  This is not needed in old-school
> > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > > (including most datacenter kernels) because in that case the callbacks
> > > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > > that there is no need for locks.
> > 
> > I believe this is the reason in my profiled case.
> > 
> > >
> > > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > > call_rcu() in the common case (for example, without the aid of interrupts,
> > > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> > 
> > I don't think I've seen such a case.
> > Thanks for clarifications, Paul!
> 
> Thanks for the explanation Paul. I have to say this has caught me as a
> surprise. There are just not enough details about the benchmark to
> understand what is going on but I find it rather surprising that
> call_rcu can induce a higher overhead than the actual kmem_cache_free
> which is the callback. My naive understanding has been that call_rcu is
> really fast way to defer the execution to the RCU safe context to do the
> final cleanup.

If I am following along correctly (ha!), then your "induce a higher
overhead" should be something like "induce a higher to-kfree() latency".

Of course, there already is a higher latency-to-kfree via call_rcu()
than via a direct call to kfree(), and callback-offload CPUs that are
being flooded with callbacks raise that latency a jiffy or so more in
order to avoid lock contention.

If this becomes a problem, the callback-offloading code can be a bit
smarter about avoiding lock contention, but need to see a real problem
before I make that change.  But if there is a real problem I will of
course fix it.

Or did I miss a turn in this discussion?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-19 18:52     ` Suren Baghdasaryan
@ 2023-01-19 19:20       ` Paul E. McKenney
  2023-01-19 19:47         ` Suren Baghdasaryan
  2023-01-20  8:52       ` Michal Hocko
  1 sibling, 1 reply; 186+ messages in thread
From: Paul E. McKenney @ 2023-01-19 19:20 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, akpm, michel, jglisse, vbabka, hannes, mgorman,
	dave, willy, liam.howlett, peterz, ldufour, laurent.dufour, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote:
> On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > call_rcu() can take a long time when callback offloading is enabled.
> > > Its use in the vm_area_free can cause regressions in the exit path when
> > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > a list and free them in groups using one call_rcu() call per group.
> >
> > After some more clarification I can understand how call_rcu might not be
> > super happy about thousands of callbacks to be invoked and I do agree
> > that this is not really optimal.
> >
> > On the other hand I do not like this solution much either.
> > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > much with processes with a huge number of vmas either. It would still be
> > in housands of callbacks to be scheduled without a good reason.
> >
> > Instead, are there any other cases than remove_vma that need this
> > batching? We could easily just link all the vmas into linked list and
> > use a single call_rcu instead, no? This would both simplify the
> > implementation, remove the scaling issue as well and we do not have to
> > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> 
> Yes, I agree the solution is not stellar. I wanted something simple
> but this is probably too simple. OTOH keeping all dead vm_area_structs
> on the list without hooking up a shrinker (additional complexity) does
> not sound too appealing either. WDYT about time domain throttling to
> limit draining the list to say once per second like this:
> 
> void vm_area_free(struct vm_area_struct *vma)
> {
>        struct mm_struct *mm = vma->vm_mm;
>        bool drain;
> 
>        free_anon_vma_name(vma);
> 
>        spin_lock(&mm->vma_free_list.lock);
>        list_add(&vma->vm_free_list, &mm->vma_free_list.head);
>        mm->vma_free_list.size++;
> -       drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
> +       drain = jiffies > mm->last_drain_tm + HZ;
> 
>        spin_unlock(&mm->vma_free_list.lock);
> 
> -       if (drain)
> +       if (drain) {
>               drain_free_vmas(mm);
> +             mm->last_drain_tm = jiffies;
> +       }
> }
> 
> Ultimately we want to prevent very frequent call_rcu() calls, so
> throttling in the time domain seems appropriate. That's the simplest
> way I can think of to address your concern about a quick spike in VMA
> freeing. It does not place any restriction on the list size and we
> might have excessive dead vm_area_structs if after a large spike there
> are no vm_area_free() calls but I don't know if that's a real problem,
> so not sure we should be addressing it at this time. WDYT?

Just to double-check, we really did try the very frequent call_rcu()
invocations and we really did see a problem, correct?

Although it is not perfect, call_rcu() is designed to take a fair amount
of abuse.  So if we didn't see a real problem, the frequent call_rcu()
invocations might be a bit simpler.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-19 19:20       ` Paul E. McKenney
@ 2023-01-19 19:47         ` Suren Baghdasaryan
  2023-01-19 19:55           ` Paul E. McKenney
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-19 19:47 UTC (permalink / raw)
  To: paulmck
  Cc: Michal Hocko, akpm, michel, jglisse, vbabka, hannes, mgorman,
	dave, willy, liam.howlett, peterz, ldufour, laurent.dufour, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Thu, Jan 19, 2023 at 11:20 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote:
> > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > a list and free them in groups using one call_rcu() call per group.
> > >
> > > After some more clarification I can understand how call_rcu might not be
> > > super happy about thousands of callbacks to be invoked and I do agree
> > > that this is not really optimal.
> > >
> > > On the other hand I do not like this solution much either.
> > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > much with processes with a huge number of vmas either. It would still be
> > > in housands of callbacks to be scheduled without a good reason.
> > >
> > > Instead, are there any other cases than remove_vma that need this
> > > batching? We could easily just link all the vmas into linked list and
> > > use a single call_rcu instead, no? This would both simplify the
> > > implementation, remove the scaling issue as well and we do not have to
> > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> >
> > Yes, I agree the solution is not stellar. I wanted something simple
> > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > on the list without hooking up a shrinker (additional complexity) does
> > not sound too appealing either. WDYT about time domain throttling to
> > limit draining the list to say once per second like this:
> >
> > void vm_area_free(struct vm_area_struct *vma)
> > {
> >        struct mm_struct *mm = vma->vm_mm;
> >        bool drain;
> >
> >        free_anon_vma_name(vma);
> >
> >        spin_lock(&mm->vma_free_list.lock);
> >        list_add(&vma->vm_free_list, &mm->vma_free_list.head);
> >        mm->vma_free_list.size++;
> > -       drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
> > +       drain = jiffies > mm->last_drain_tm + HZ;
> >
> >        spin_unlock(&mm->vma_free_list.lock);
> >
> > -       if (drain)
> > +       if (drain) {
> >               drain_free_vmas(mm);
> > +             mm->last_drain_tm = jiffies;
> > +       }
> > }
> >
> > Ultimately we want to prevent very frequent call_rcu() calls, so
> > throttling in the time domain seems appropriate. That's the simplest
> > way I can think of to address your concern about a quick spike in VMA
> > freeing. It does not place any restriction on the list size and we
> > might have excessive dead vm_area_structs if after a large spike there
> > are no vm_area_free() calls but I don't know if that's a real problem,
> > so not sure we should be addressing it at this time. WDYT?
>
> Just to double-check, we really did try the very frequent call_rcu()
> invocations and we really did see a problem, correct?

Correct. More specifically with CONFIG_RCU_NOCB_CPU=y we saw
regressions when a process exits and all its VMAs get destroyed,
causing a flood of call_rcu()'s.

>
> Although it is not perfect, call_rcu() is designed to take a fair amount
> of abuse.  So if we didn't see a real problem, the frequent call_rcu()
> invocations might be a bit simpler.
>
>                                                         Thanx, Paul

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-19 19:47         ` Suren Baghdasaryan
@ 2023-01-19 19:55           ` Paul E. McKenney
  0 siblings, 0 replies; 186+ messages in thread
From: Paul E. McKenney @ 2023-01-19 19:55 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, akpm, michel, jglisse, vbabka, hannes, mgorman,
	dave, willy, liam.howlett, peterz, ldufour, laurent.dufour, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Thu, Jan 19, 2023 at 11:47:36AM -0800, Suren Baghdasaryan wrote:
> On Thu, Jan 19, 2023 at 11:20 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > After some more clarification I can understand how call_rcu might not be
> > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > that this is not really optimal.
> > > >
> > > > On the other hand I do not like this solution much either.
> > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > much with processes with a huge number of vmas either. It would still be
> > > > in housands of callbacks to be scheduled without a good reason.
> > > >
> > > > Instead, are there any other cases than remove_vma that need this
> > > > batching? We could easily just link all the vmas into linked list and
> > > > use a single call_rcu instead, no? This would both simplify the
> > > > implementation, remove the scaling issue as well and we do not have to
> > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > >
> > > Yes, I agree the solution is not stellar. I wanted something simple
> > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > on the list without hooking up a shrinker (additional complexity) does
> > > not sound too appealing either. WDYT about time domain throttling to
> > > limit draining the list to say once per second like this:
> > >
> > > void vm_area_free(struct vm_area_struct *vma)
> > > {
> > >        struct mm_struct *mm = vma->vm_mm;
> > >        bool drain;
> > >
> > >        free_anon_vma_name(vma);
> > >
> > >        spin_lock(&mm->vma_free_list.lock);
> > >        list_add(&vma->vm_free_list, &mm->vma_free_list.head);
> > >        mm->vma_free_list.size++;
> > > -       drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
> > > +       drain = jiffies > mm->last_drain_tm + HZ;
> > >
> > >        spin_unlock(&mm->vma_free_list.lock);
> > >
> > > -       if (drain)
> > > +       if (drain) {
> > >               drain_free_vmas(mm);
> > > +             mm->last_drain_tm = jiffies;
> > > +       }
> > > }
> > >
> > > Ultimately we want to prevent very frequent call_rcu() calls, so
> > > throttling in the time domain seems appropriate. That's the simplest
> > > way I can think of to address your concern about a quick spike in VMA
> > > freeing. It does not place any restriction on the list size and we
> > > might have excessive dead vm_area_structs if after a large spike there
> > > are no vm_area_free() calls but I don't know if that's a real problem,
> > > so not sure we should be addressing it at this time. WDYT?
> >
> > Just to double-check, we really did try the very frequent call_rcu()
> > invocations and we really did see a problem, correct?
> 
> Correct. More specifically with CONFIG_RCU_NOCB_CPU=y we saw
> regressions when a process exits and all its VMAs get destroyed,
> causing a flood of call_rcu()'s.

Thank you for the reminder, real problem needs solution.  ;-)

							Thanx, Paul

> > Although it is not perfect, call_rcu() is designed to take a fair amount
> > of abuse.  So if we didn't see a real problem, the frequent call_rcu()
> > invocations might be a bit simpler.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-19 18:52     ` Suren Baghdasaryan
  2023-01-19 19:20       ` Paul E. McKenney
@ 2023-01-20  8:52       ` Michal Hocko
  2023-01-20 16:20         ` Suren Baghdasaryan
  1 sibling, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-20  8:52 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > call_rcu() can take a long time when callback offloading is enabled.
> > > Its use in the vm_area_free can cause regressions in the exit path when
> > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > a list and free them in groups using one call_rcu() call per group.
> >
> > After some more clarification I can understand how call_rcu might not be
> > super happy about thousands of callbacks to be invoked and I do agree
> > that this is not really optimal.
> >
> > On the other hand I do not like this solution much either.
> > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > much with processes with a huge number of vmas either. It would still be
> > in housands of callbacks to be scheduled without a good reason.
> >
> > Instead, are there any other cases than remove_vma that need this
> > batching? We could easily just link all the vmas into linked list and
> > use a single call_rcu instead, no? This would both simplify the
> > implementation, remove the scaling issue as well and we do not have to
> > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> 
> Yes, I agree the solution is not stellar. I wanted something simple
> but this is probably too simple. OTOH keeping all dead vm_area_structs
> on the list without hooking up a shrinker (additional complexity) does
> not sound too appealing either.

I suspect you have missed my idea. I do not really want to keep the list
around or any shrinker. It is dead simple. Collect all vmas in
remove_vma and then call_rcu the whole list at once after the whole list
(be it from exit_mmap or remove_mt). See?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-19 19:17                 ` Paul E. McKenney
@ 2023-01-20  8:57                   ` Michal Hocko
  2023-01-20 16:08                     ` Paul E. McKenney
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-20  8:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Suren Baghdasaryan, akpm, michel, jglisse, vbabka, hannes,
	mgorman, dave, willy, liam.howlett, peterz, ldufour,
	laurent.dufour, luto, songliubraving, peterx, david, dhowells,
	hughd, bigeasy, kent.overstreet, punit.agrawal, lstoakes,
	peterjung1337, rientjes, axelrasmussen, joelaf, minchan, jannh,
	shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy, soheil,
	hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Thu 19-01-23 11:17:07, Paul E. McKenney wrote:
> On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:
> > On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> > > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > [...]
> > > > There are a couple of possibilities here.
> > > >
> > > > First, if I am remembering correctly, the time between the call_rcu()
> > > > and invocation of the corresponding callback was taking multiple seconds,
> > > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > > > order to save power by batching RCU work over multiple call_rcu()
> > > > invocations.  If this is causing a problem for a given call site, the
> > > > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > > > to the old-school non-laziness, but can of course consume more power.
> > > 
> > > That would not be the case because CONFIG_LAZY_RCU was not an option
> > > at the time I was profiling this issue.
> > > Laxy RCU would be a great option to replace this patch but
> > > unfortunately it's not the default behavior, so I would still have to
> > > implement this batching in case lazy RCU is not enabled.
> > > 
> > > >
> > > > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > > > and the invocation of the corresponding callback in kernels built with
> > > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > > > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > > > of this delay is to avoid lock contention, and so this delay is incurred
> > > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > > > invoking call_rcu() at least 16 times within a given jiffy will incur
> > > > the added delay.  The reason for this delay is the use of a separate
> > > > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > > > lock contention on the main ->cblist.  This is not needed in old-school
> > > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > > > (including most datacenter kernels) because in that case the callbacks
> > > > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > > > that there is no need for locks.
> > > 
> > > I believe this is the reason in my profiled case.
> > > 
> > > >
> > > > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > > > call_rcu() in the common case (for example, without the aid of interrupts,
> > > > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> > > 
> > > I don't think I've seen such a case.
> > > Thanks for clarifications, Paul!
> > 
> > Thanks for the explanation Paul. I have to say this has caught me as a
> > surprise. There are just not enough details about the benchmark to
> > understand what is going on but I find it rather surprising that
> > call_rcu can induce a higher overhead than the actual kmem_cache_free
> > which is the callback. My naive understanding has been that call_rcu is
> > really fast way to defer the execution to the RCU safe context to do the
> > final cleanup.
> 
> If I am following along correctly (ha!), then your "induce a higher
> overhead" should be something like "induce a higher to-kfree() latency".

Yes, this is expected.

> Of course, there already is a higher latency-to-kfree via call_rcu()
> than via a direct call to kfree(), and callback-offload CPUs that are
> being flooded with callbacks raise that latency a jiffy or so more in
> order to avoid lock contention.
> 
> If this becomes a problem, the callback-offloading code can be a bit
> smarter about avoiding lock contention, but need to see a real problem
> before I make that change.  But if there is a real problem I will of
> course fix it.

I believe that Suren claims that the call_rcu is really visible in the
exit_mmap case. Time-to-free actual vmas shouldn't really be material
for that path. If that happens much more later on there could be some
side effects by an increased memory consumption but that should be
marginal. How fast exit_mmap really is should only depend on direct
calls from that path.

But I guess we need some specific numbers from Suren to be sure what is
going on here.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-20  8:57                   ` Michal Hocko
@ 2023-01-20 16:08                     ` Paul E. McKenney
  0 siblings, 0 replies; 186+ messages in thread
From: Paul E. McKenney @ 2023-01-20 16:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, akpm, michel, jglisse, vbabka, hannes,
	mgorman, dave, willy, liam.howlett, peterz, ldufour,
	laurent.dufour, luto, songliubraving, peterx, david, dhowells,
	hughd, bigeasy, kent.overstreet, punit.agrawal, lstoakes,
	peterjung1337, rientjes, axelrasmussen, joelaf, minchan, jannh,
	shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy, soheil,
	hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Fri, Jan 20, 2023 at 09:57:05AM +0100, Michal Hocko wrote:
> On Thu 19-01-23 11:17:07, Paul E. McKenney wrote:
> > On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:
> > > On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> > > > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > [...]
> > > > > There are a couple of possibilities here.
> > > > >
> > > > > First, if I am remembering correctly, the time between the call_rcu()
> > > > > and invocation of the corresponding callback was taking multiple seconds,
> > > > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > > > > order to save power by batching RCU work over multiple call_rcu()
> > > > > invocations.  If this is causing a problem for a given call site, the
> > > > > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > > > > to the old-school non-laziness, but can of course consume more power.
> > > > 
> > > > That would not be the case because CONFIG_LAZY_RCU was not an option
> > > > at the time I was profiling this issue.
> > > > Laxy RCU would be a great option to replace this patch but
> > > > unfortunately it's not the default behavior, so I would still have to
> > > > implement this batching in case lazy RCU is not enabled.
> > > > 
> > > > >
> > > > > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > > > > and the invocation of the corresponding callback in kernels built with
> > > > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > > > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > > > > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > > > > of this delay is to avoid lock contention, and so this delay is incurred
> > > > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > > > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > > > > invoking call_rcu() at least 16 times within a given jiffy will incur
> > > > > the added delay.  The reason for this delay is the use of a separate
> > > > > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > > > > lock contention on the main ->cblist.  This is not needed in old-school
> > > > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > > > > (including most datacenter kernels) because in that case the callbacks
> > > > > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > > > > that there is no need for locks.
> > > > 
> > > > I believe this is the reason in my profiled case.
> > > > 
> > > > >
> > > > > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > > > > call_rcu() in the common case (for example, without the aid of interrupts,
> > > > > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> > > > 
> > > > I don't think I've seen such a case.
> > > > Thanks for clarifications, Paul!
> > > 
> > > Thanks for the explanation Paul. I have to say this has caught me as a
> > > surprise. There are just not enough details about the benchmark to
> > > understand what is going on but I find it rather surprising that
> > > call_rcu can induce a higher overhead than the actual kmem_cache_free
> > > which is the callback. My naive understanding has been that call_rcu is
> > > really fast way to defer the execution to the RCU safe context to do the
> > > final cleanup.
> > 
> > If I am following along correctly (ha!), then your "induce a higher
> > overhead" should be something like "induce a higher to-kfree() latency".
> 
> Yes, this is expected.
> 
> > Of course, there already is a higher latency-to-kfree via call_rcu()
> > than via a direct call to kfree(), and callback-offload CPUs that are
> > being flooded with callbacks raise that latency a jiffy or so more in
> > order to avoid lock contention.
> > 
> > If this becomes a problem, the callback-offloading code can be a bit
> > smarter about avoiding lock contention, but need to see a real problem
> > before I make that change.  But if there is a real problem I will of
> > course fix it.
> 
> I believe that Suren claims that the call_rcu is really visible in the
> exit_mmap case. Time-to-free actual vmas shouldn't really be material
> for that path. If that happens much more later on there could be some
> side effects by an increased memory consumption but that should be
> marginal. How fast exit_mmap really is should only depend on direct
> calls from that path.
> 
> But I guess we need some specific numbers from Suren to be sure what is
> going on here.

Actually, Suren did discuss these (perhaps offlist) back in August.
I was just being forgetful.  :-/

							Thanx, Paul

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-20  8:52       ` Michal Hocko
@ 2023-01-20 16:20         ` Suren Baghdasaryan
  2023-01-20 16:45           ` Suren Baghdasaryan
  2023-01-23  9:59           ` Michal Hocko
  0 siblings, 2 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-20 16:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > a list and free them in groups using one call_rcu() call per group.
> > >
> > > After some more clarification I can understand how call_rcu might not be
> > > super happy about thousands of callbacks to be invoked and I do agree
> > > that this is not really optimal.
> > >
> > > On the other hand I do not like this solution much either.
> > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > much with processes with a huge number of vmas either. It would still be
> > > in housands of callbacks to be scheduled without a good reason.
> > >
> > > Instead, are there any other cases than remove_vma that need this
> > > batching? We could easily just link all the vmas into linked list and
> > > use a single call_rcu instead, no? This would both simplify the
> > > implementation, remove the scaling issue as well and we do not have to
> > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> >
> > Yes, I agree the solution is not stellar. I wanted something simple
> > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > on the list without hooking up a shrinker (additional complexity) does
> > not sound too appealing either.
>
> I suspect you have missed my idea. I do not really want to keep the list
> around or any shrinker. It is dead simple. Collect all vmas in
> remove_vma and then call_rcu the whole list at once after the whole list
> (be it from exit_mmap or remove_mt). See?

Yes, I understood your idea but keeping dead objects until the process
exits even when the system is low on memory (no shrinkers attached)
seems too wasteful. If we do this I would advocate for attaching a
shrinker.

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-20 16:20         ` Suren Baghdasaryan
@ 2023-01-20 16:45           ` Suren Baghdasaryan
  2023-01-20 16:49             ` Matthew Wilcox
  2023-01-23  9:59           ` Michal Hocko
  1 sibling, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-20 16:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > After some more clarification I can understand how call_rcu might not be
> > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > that this is not really optimal.
> > > >
> > > > On the other hand I do not like this solution much either.
> > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > much with processes with a huge number of vmas either. It would still be
> > > > in housands of callbacks to be scheduled without a good reason.
> > > >
> > > > Instead, are there any other cases than remove_vma that need this
> > > > batching? We could easily just link all the vmas into linked list and
> > > > use a single call_rcu instead, no? This would both simplify the
> > > > implementation, remove the scaling issue as well and we do not have to
> > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > >
> > > Yes, I agree the solution is not stellar. I wanted something simple
> > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > on the list without hooking up a shrinker (additional complexity) does
> > > not sound too appealing either.
> >
> > I suspect you have missed my idea. I do not really want to keep the list
> > around or any shrinker. It is dead simple. Collect all vmas in
> > remove_vma and then call_rcu the whole list at once after the whole list
> > (be it from exit_mmap or remove_mt). See?
>
> Yes, I understood your idea but keeping dead objects until the process
> exits even when the system is low on memory (no shrinkers attached)
> seems too wasteful. If we do this I would advocate for attaching a
> shrinker.

Maybe even simpler, since we are hit with this VMA freeing flood
during exit_mmap (when all VMAs are destroyed), we pass a hint to
vm_area_free to batch the destruction and all other cases call
call_rcu()? I don't think there will be other cases of VMA destruction
floods.

>
> >
> > --
> > Michal Hocko
> > SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-20 16:45           ` Suren Baghdasaryan
@ 2023-01-20 16:49             ` Matthew Wilcox
  2023-01-20 17:08               ` Liam R. Howlett
  2023-01-20 17:21               ` Paul E. McKenney
  0 siblings, 2 replies; 186+ messages in thread
From: Matthew Wilcox @ 2023-01-20 16:49 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, akpm, michel, jglisse, vbabka, hannes, mgorman,
	dave, liam.howlett, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > >
> > > > > After some more clarification I can understand how call_rcu might not be
> > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > that this is not really optimal.
> > > > >
> > > > > On the other hand I do not like this solution much either.
> > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > much with processes with a huge number of vmas either. It would still be
> > > > > in housands of callbacks to be scheduled without a good reason.
> > > > >
> > > > > Instead, are there any other cases than remove_vma that need this
> > > > > batching? We could easily just link all the vmas into linked list and
> > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > >
> > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > on the list without hooking up a shrinker (additional complexity) does
> > > > not sound too appealing either.
> > >
> > > I suspect you have missed my idea. I do not really want to keep the list
> > > around or any shrinker. It is dead simple. Collect all vmas in
> > > remove_vma and then call_rcu the whole list at once after the whole list
> > > (be it from exit_mmap or remove_mt). See?
> >
> > Yes, I understood your idea but keeping dead objects until the process
> > exits even when the system is low on memory (no shrinkers attached)
> > seems too wasteful. If we do this I would advocate for attaching a
> > shrinker.
> 
> Maybe even simpler, since we are hit with this VMA freeing flood
> during exit_mmap (when all VMAs are destroyed), we pass a hint to
> vm_area_free to batch the destruction and all other cases call
> call_rcu()? I don't think there will be other cases of VMA destruction
> floods.

... or have two different call_rcu functions; one for munmap() and
one for exit.  It'd be nice to use kmem_cache_free_bulk().

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-20 16:49             ` Matthew Wilcox
@ 2023-01-20 17:08               ` Liam R. Howlett
  2023-01-20 17:17                 ` Suren Baghdasaryan
  2023-01-20 17:21               ` Paul E. McKenney
  1 sibling, 1 reply; 186+ messages in thread
From: Liam R. Howlett @ 2023-01-20 17:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Suren Baghdasaryan, Michal Hocko, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

* Matthew Wilcox <willy@infradead.org> [230120 11:50]:
> On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > >
> > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > that this is not really optimal.
> > > > > >
> > > > > > On the other hand I do not like this solution much either.
> > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > >
> > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > >
> > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > not sound too appealing either.
> > > >
> > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > (be it from exit_mmap or remove_mt). See?
> > >
> > > Yes, I understood your idea but keeping dead objects until the process
> > > exits even when the system is low on memory (no shrinkers attached)
> > > seems too wasteful. If we do this I would advocate for attaching a
> > > shrinker.
> > 
> > Maybe even simpler, since we are hit with this VMA freeing flood
> > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > vm_area_free to batch the destruction and all other cases call
> > call_rcu()? I don't think there will be other cases of VMA destruction
> > floods.
> 
> ... or have two different call_rcu functions; one for munmap() and
> one for exit.  It'd be nice to use kmem_cache_free_bulk().

Do we even need a call_rcu on exit?  At the point of freeing the VMAs we
have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
Once we have obtained the write lock again, I think it's safe to say we
can just go ahead and free the VMAs directly.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-20 17:08               ` Liam R. Howlett
@ 2023-01-20 17:17                 ` Suren Baghdasaryan
  2023-01-20 17:32                   ` Matthew Wilcox
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-20 17:17 UTC (permalink / raw)
  To: Liam R. Howlett, Matthew Wilcox, Suren Baghdasaryan,
	Michal Hocko, akpm, michel, jglisse, vbabka, hannes, mgorman,
	dave, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Matthew Wilcox <willy@infradead.org> [230120 11:50]:
> > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > >
> > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > that this is not really optimal.
> > > > > > >
> > > > > > > On the other hand I do not like this solution much either.
> > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > >
> > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > >
> > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > not sound too appealing either.
> > > > >
> > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > (be it from exit_mmap or remove_mt). See?
> > > >
> > > > Yes, I understood your idea but keeping dead objects until the process
> > > > exits even when the system is low on memory (no shrinkers attached)
> > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > shrinker.
> > >
> > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > vm_area_free to batch the destruction and all other cases call
> > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > floods.
> >
> > ... or have two different call_rcu functions; one for munmap() and
> > one for exit.  It'd be nice to use kmem_cache_free_bulk().
>
> Do we even need a call_rcu on exit?  At the point of freeing the VMAs we
> have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
> Once we have obtained the write lock again, I think it's safe to say we
> can just go ahead and free the VMAs directly.

I think that would be still racy if the page fault handler found that
VMA under read-RCU protection but did not lock it yet (no locks are
held yet). If it's preempted, the VMA can be freed and destroyed from
under it without RCU grace period.

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-20 16:49             ` Matthew Wilcox
  2023-01-20 17:08               ` Liam R. Howlett
@ 2023-01-20 17:21               ` Paul E. McKenney
  2023-01-20 18:42                 ` Suren Baghdasaryan
  1 sibling, 1 reply; 186+ messages in thread
From: Paul E. McKenney @ 2023-01-20 17:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Suren Baghdasaryan, Michal Hocko, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, liam.howlett, peterz, ldufour,
	laurent.dufour, luto, songliubraving, peterx, david, dhowells,
	hughd, bigeasy, kent.overstreet, punit.agrawal, lstoakes,
	peterjung1337, rientjes, axelrasmussen, joelaf, minchan, jannh,
	shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy, soheil,
	hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Fri, Jan 20, 2023 at 04:49:42PM +0000, Matthew Wilcox wrote:
> On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > >
> > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > that this is not really optimal.
> > > > > >
> > > > > > On the other hand I do not like this solution much either.
> > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > >
> > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > >
> > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > not sound too appealing either.
> > > >
> > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > (be it from exit_mmap or remove_mt). See?
> > >
> > > Yes, I understood your idea but keeping dead objects until the process
> > > exits even when the system is low on memory (no shrinkers attached)
> > > seems too wasteful. If we do this I would advocate for attaching a
> > > shrinker.
> > 
> > Maybe even simpler, since we are hit with this VMA freeing flood
> > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > vm_area_free to batch the destruction and all other cases call
> > call_rcu()? I don't think there will be other cases of VMA destruction
> > floods.
> 
> ... or have two different call_rcu functions; one for munmap() and
> one for exit.  It'd be nice to use kmem_cache_free_bulk().

Good point, kfree_rcu(p, r) where "r" is the name of the rcu_head
structure's field, is much more cache-efficient.

The penalty is that there is no callback function to do any cleanup.
There is just a kfree()/kvfree (bulk version where applicable),
nothing else.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-20 17:17                 ` Suren Baghdasaryan
@ 2023-01-20 17:32                   ` Matthew Wilcox
  2023-01-20 17:50                     ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Matthew Wilcox @ 2023-01-20 17:32 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Michal Hocko, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote:
> On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Matthew Wilcox <willy@infradead.org> [230120 11:50]:
> > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > > >
> > > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > > that this is not really optimal.
> > > > > > > >
> > > > > > > > On the other hand I do not like this solution much either.
> > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > > >
> > > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > > >
> > > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > > not sound too appealing either.
> > > > > >
> > > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > > (be it from exit_mmap or remove_mt). See?
> > > > >
> > > > > Yes, I understood your idea but keeping dead objects until the process
> > > > > exits even when the system is low on memory (no shrinkers attached)
> > > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > > shrinker.
> > > >
> > > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > > vm_area_free to batch the destruction and all other cases call
> > > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > > floods.
> > >
> > > ... or have two different call_rcu functions; one for munmap() and
> > > one for exit.  It'd be nice to use kmem_cache_free_bulk().
> >
> > Do we even need a call_rcu on exit?  At the point of freeing the VMAs we
> > have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
> > Once we have obtained the write lock again, I think it's safe to say we
> > can just go ahead and free the VMAs directly.
> 
> I think that would be still racy if the page fault handler found that
> VMA under read-RCU protection but did not lock it yet (no locks are
> held yet). If it's preempted, the VMA can be freed and destroyed from
> under it without RCU grace period.

The page fault handler (or whatever other reader -- ptrace, proc, etc)
should have a refcount on the mm_struct, so we can't be in this path
trying to free VMAs.  Right?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-20 17:32                   ` Matthew Wilcox
@ 2023-01-20 17:50                     ` Suren Baghdasaryan
  2023-01-20 19:23                       ` Liam R. Howlett
  2023-01-23  9:56                       ` Michal Hocko
  0 siblings, 2 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-20 17:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Liam R. Howlett, Michal Hocko, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Matthew Wilcox <willy@infradead.org> [230120 11:50]:
> > > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > >
> > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > >
> > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > > > >
> > > > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > > > that this is not really optimal.
> > > > > > > > >
> > > > > > > > > On the other hand I do not like this solution much either.
> > > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > > > >
> > > > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > > > >
> > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > > > not sound too appealing either.
> > > > > > >
> > > > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > > > (be it from exit_mmap or remove_mt). See?
> > > > > >
> > > > > > Yes, I understood your idea but keeping dead objects until the process
> > > > > > exits even when the system is low on memory (no shrinkers attached)
> > > > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > > > shrinker.
> > > > >
> > > > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > > > vm_area_free to batch the destruction and all other cases call
> > > > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > > > floods.
> > > >
> > > > ... or have two different call_rcu functions; one for munmap() and
> > > > one for exit.  It'd be nice to use kmem_cache_free_bulk().
> > >
> > > Do we even need a call_rcu on exit?  At the point of freeing the VMAs we
> > > have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
> > > Once we have obtained the write lock again, I think it's safe to say we
> > > can just go ahead and free the VMAs directly.
> >
> > I think that would be still racy if the page fault handler found that
> > VMA under read-RCU protection but did not lock it yet (no locks are
> > held yet). If it's preempted, the VMA can be freed and destroyed from
> > under it without RCU grace period.
>
> The page fault handler (or whatever other reader -- ptrace, proc, etc)
> should have a refcount on the mm_struct, so we can't be in this path
> trying to free VMAs.  Right?

Hmm. That sounds right. I checked process_mrelease() as well, which
operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
VMAs without freeing them, so we are still good. Michal, do you agree
this is ok?

lock_vma_under_rcu() receives mm as a parameter, so I guess it's
implied that the caller should either mmget() it or operate on
current->mm, so no need to document this requirement?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-20 17:21               ` Paul E. McKenney
@ 2023-01-20 18:42                 ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-20 18:42 UTC (permalink / raw)
  To: paulmck
  Cc: Matthew Wilcox, Michal Hocko, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, liam.howlett, peterz, ldufour,
	laurent.dufour, luto, songliubraving, peterx, david, dhowells,
	hughd, bigeasy, kent.overstreet, punit.agrawal, lstoakes,
	peterjung1337, rientjes, axelrasmussen, joelaf, minchan, jannh,
	shakeelb, tatashin, edumazet, gthelen, gurua, arjunroy, soheil,
	hughlynch, leewalsh, posk, linux-mm, linux-arm-kernel,
	linuxppc-dev, x86, linux-kernel, kernel-team

On Fri, Jan 20, 2023 at 9:21 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Jan 20, 2023 at 04:49:42PM +0000, Matthew Wilcox wrote:
> > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > >
> > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > that this is not really optimal.
> > > > > > >
> > > > > > > On the other hand I do not like this solution much either.
> > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > >
> > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > >
> > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > not sound too appealing either.
> > > > >
> > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > (be it from exit_mmap or remove_mt). See?
> > > >
> > > > Yes, I understood your idea but keeping dead objects until the process
> > > > exits even when the system is low on memory (no shrinkers attached)
> > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > shrinker.
> > >
> > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > vm_area_free to batch the destruction and all other cases call
> > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > floods.
> >
> > ... or have two different call_rcu functions; one for munmap() and
> > one for exit.  It'd be nice to use kmem_cache_free_bulk().
>
> Good point, kfree_rcu(p, r) where "r" is the name of the rcu_head
> structure's field, is much more cache-efficient.
>
> The penalty is that there is no callback function to do any cleanup.
> There is just a kfree()/kvfree (bulk version where applicable),
> nothing else.

If Liam's suggestion works then we won't need anything additional. We
will free the vm_area_structs directly on process exit and use
call_rcu() in all other cases. Let's see if Michal knows of any case
which still needs an RCU grace period during exit_mmap.

>
>                                                         Thanx, Paul
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-20 17:50                     ` Suren Baghdasaryan
@ 2023-01-20 19:23                       ` Liam R. Howlett
  2023-01-23  9:56                       ` Michal Hocko
  1 sibling, 0 replies; 186+ messages in thread
From: Liam R. Howlett @ 2023-01-20 19:23 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, Michal Hocko, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

* Suren Baghdasaryan <surenb@google.com> [230120 12:50]:
> On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > >
> > > > * Matthew Wilcox <willy@infradead.org> [230120 11:50]:
> > > > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > > > > >
> > > > > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > > > > that this is not really optimal.
> > > > > > > > > >
> > > > > > > > > > On the other hand I do not like this solution much either.
> > > > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > > > > >
> > > > > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > > > > >
> > > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > > > > not sound too appealing either.
> > > > > > > >
> > > > > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > > > > (be it from exit_mmap or remove_mt). See?
> > > > > > >
> > > > > > > Yes, I understood your idea but keeping dead objects until the process
> > > > > > > exits even when the system is low on memory (no shrinkers attached)
> > > > > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > > > > shrinker.
> > > > > >
> > > > > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > > > > vm_area_free to batch the destruction and all other cases call
> > > > > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > > > > floods.
> > > > >
> > > > > ... or have two different call_rcu functions; one for munmap() and
> > > > > one for exit.  It'd be nice to use kmem_cache_free_bulk().
> > > >
> > > > Do we even need a call_rcu on exit?  At the point of freeing the VMAs we
> > > > have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
> > > > Once we have obtained the write lock again, I think it's safe to say we
> > > > can just go ahead and free the VMAs directly.
> > >
> > > I think that would be still racy if the page fault handler found that
> > > VMA under read-RCU protection but did not lock it yet (no locks are
> > > held yet). If it's preempted, the VMA can be freed and destroyed from
> > > under it without RCU grace period.
> >
> > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > should have a refcount on the mm_struct, so we can't be in this path
> > trying to free VMAs.  Right?
> 
> Hmm. That sounds right. I checked process_mrelease() as well, which
> operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> VMAs without freeing them, so we are still good. Michal, do you agree
> this is ok?
> 
> lock_vma_under_rcu() receives mm as a parameter, so I guess it's
> implied that the caller should either mmget() it or operate on
> current->mm, so no need to document this requirement?

It is also implied by the vma->vm_mm link.  Otherwise any RCU holder of
the VMA could have an unsafe pointer.  In fact, if this isn't true then
we need to change the callers to take the ref count to avoid just this
scenario.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-20 17:50                     ` Suren Baghdasaryan
  2023-01-20 19:23                       ` Liam R. Howlett
@ 2023-01-23  9:56                       ` Michal Hocko
  2023-01-23 16:22                         ` Suren Baghdasaryan
  1 sibling, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-23  9:56 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, Liam R. Howlett, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
[...]
> > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > should have a refcount on the mm_struct, so we can't be in this path
> > trying to free VMAs.  Right?
> 
> Hmm. That sounds right. I checked process_mrelease() as well, which
> operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> VMAs without freeing them, so we are still good. Michal, do you agree
> this is ok?

Don't we need RCU procetions for the vma life time assurance? Jann has
already shown how rwsem is not safe wrt to unlock and free without RCU.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-20 16:20         ` Suren Baghdasaryan
  2023-01-20 16:45           ` Suren Baghdasaryan
@ 2023-01-23  9:59           ` Michal Hocko
  2023-01-23 17:43             ` Suren Baghdasaryan
  1 sibling, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-23  9:59 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Fri 20-01-23 08:20:43, Suren Baghdasaryan wrote:
> On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > After some more clarification I can understand how call_rcu might not be
> > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > that this is not really optimal.
> > > >
> > > > On the other hand I do not like this solution much either.
> > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > much with processes with a huge number of vmas either. It would still be
> > > > in housands of callbacks to be scheduled without a good reason.
> > > >
> > > > Instead, are there any other cases than remove_vma that need this
> > > > batching? We could easily just link all the vmas into linked list and
> > > > use a single call_rcu instead, no? This would both simplify the
> > > > implementation, remove the scaling issue as well and we do not have to
> > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > >
> > > Yes, I agree the solution is not stellar. I wanted something simple
> > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > on the list without hooking up a shrinker (additional complexity) does
> > > not sound too appealing either.
> >
> > I suspect you have missed my idea. I do not really want to keep the list
> > around or any shrinker. It is dead simple. Collect all vmas in
> > remove_vma and then call_rcu the whole list at once after the whole list
> > (be it from exit_mmap or remove_mt). See?
> 
> Yes, I understood your idea but keeping dead objects until the process
> exits even when the system is low on memory (no shrinkers attached)
> seems too wasteful. If we do this I would advocate for attaching a
> shrinker.

I am still not sure we are on the same page here. No, vmas shouldn't lay
around un ntil the process exit. I am really suggesting queuing only for
remove_vma paths. You can have a different rcu callback than the one
used for trivial single vma removal paths.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-23  9:56                       ` Michal Hocko
@ 2023-01-23 16:22                         ` Suren Baghdasaryan
  2023-01-23 16:55                           ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-23 16:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Liam R. Howlett, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> [...]
> > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > should have a refcount on the mm_struct, so we can't be in this path
> > > trying to free VMAs.  Right?
> >
> > Hmm. That sounds right. I checked process_mrelease() as well, which
> > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > VMAs without freeing them, so we are still good. Michal, do you agree
> > this is ok?
>
> Don't we need RCU procetions for the vma life time assurance? Jann has
> already shown how rwsem is not safe wrt to unlock and free without RCU.

Jann's case requires a thread freeing the VMA to be blocked on vma
write lock waiting for the vma real lock to be released by a page
fault handler. However exit_mmap() means mm->mm_users==0, which in
turn suggests that there are no racing page fault handlers and no new
page fault handlers will appear. Is that a correct assumption? If so,
then races with page fault handlers can't happen while in exit_mmap().
Any other path (other than page fault handlers), accesses vma->lock
under protection of mmap_lock (for read or write, does not matter).
One exception is when we operate on an isolated VMA, then we don't
need mmap_lock protection, but exit_mmap() does not deal with isolated
VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
protection of mmap_lock in write mode, so races with anything other
than page fault handler should be safe as they are today.

That said, the future possible users of lock_vma_under_rcu() using VMA
without mmap_lock protection will have to ensure mm's stability while
they are using the obtained VMA. IOW they should elevate mm's refcount
and keep it elevated as long as they are using that VMA and not before
vma->lock is released. I guess it would be a good idea to document
that requirement in lock_vma_under_rcu() comments if we decide to take
this route.

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-23 16:22                         ` Suren Baghdasaryan
@ 2023-01-23 16:55                           ` Michal Hocko
  2023-01-23 17:07                             ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-23 16:55 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, Liam R. Howlett, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> > [...]
> > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > trying to free VMAs.  Right?
> > >
> > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > this is ok?
> >
> > Don't we need RCU procetions for the vma life time assurance? Jann has
> > already shown how rwsem is not safe wrt to unlock and free without RCU.
> 
> Jann's case requires a thread freeing the VMA to be blocked on vma
> write lock waiting for the vma real lock to be released by a page
> fault handler. However exit_mmap() means mm->mm_users==0, which in
> turn suggests that there are no racing page fault handlers and no new
> page fault handlers will appear. Is that a correct assumption? If so,
> then races with page fault handlers can't happen while in exit_mmap().
> Any other path (other than page fault handlers), accesses vma->lock
> under protection of mmap_lock (for read or write, does not matter).
> One exception is when we operate on an isolated VMA, then we don't
> need mmap_lock protection, but exit_mmap() does not deal with isolated
> VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> protection of mmap_lock in write mode, so races with anything other
> than page fault handler should be safe as they are today.

I do not see you talking about #PF (RCU + vma read lock protected) with
munmap. It is my understanding that the latter will synchronize over per
vma lock (along with mmap_lock exclusive locking). But then we are back
to the lifetime guarantees, or do I miss anything.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-23 16:55                           ` Michal Hocko
@ 2023-01-23 17:07                             ` Suren Baghdasaryan
  2023-01-23 17:16                               ` Michal Hocko
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-23 17:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Liam R. Howlett, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > [...]
> > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > trying to free VMAs.  Right?
> > > >
> > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > this is ok?
> > >
> > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> >
> > Jann's case requires a thread freeing the VMA to be blocked on vma
> > write lock waiting for the vma real lock to be released by a page
> > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > turn suggests that there are no racing page fault handlers and no new
> > page fault handlers will appear. Is that a correct assumption? If so,
> > then races with page fault handlers can't happen while in exit_mmap().
> > Any other path (other than page fault handlers), accesses vma->lock
> > under protection of mmap_lock (for read or write, does not matter).
> > One exception is when we operate on an isolated VMA, then we don't
> > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > protection of mmap_lock in write mode, so races with anything other
> > than page fault handler should be safe as they are today.
>
> I do not see you talking about #PF (RCU + vma read lock protected) with
> munmap. It is my understanding that the latter will synchronize over per
> vma lock (along with mmap_lock exclusive locking). But then we are back
> to the lifetime guarantees, or do I miss anything.

munmap() or any VMA-freeing operation other than exit_mmap() will free
using call_rcu(), as implemented today. The suggestion is to free VMAs
directly, without RCU grace period only when done from exit_mmap().
That' because VMA freeing flood has been seen so far only in the case
of exit_mmap() and we assume other cases are not that heavy to cause
call_rcu() flood to cause regressions. That assumption might prove
false but we can deal with that once we know it needs fixing.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-23 17:07                             ` Suren Baghdasaryan
@ 2023-01-23 17:16                               ` Michal Hocko
  2023-01-23 17:46                                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-23 17:16 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, Liam R. Howlett, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote:
> On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > [...]
> > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > > trying to free VMAs.  Right?
> > > > >
> > > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > > this is ok?
> > > >
> > > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> > >
> > > Jann's case requires a thread freeing the VMA to be blocked on vma
> > > write lock waiting for the vma real lock to be released by a page
> > > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > > turn suggests that there are no racing page fault handlers and no new
> > > page fault handlers will appear. Is that a correct assumption? If so,
> > > then races with page fault handlers can't happen while in exit_mmap().
> > > Any other path (other than page fault handlers), accesses vma->lock
> > > under protection of mmap_lock (for read or write, does not matter).
> > > One exception is when we operate on an isolated VMA, then we don't
> > > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > > protection of mmap_lock in write mode, so races with anything other
> > > than page fault handler should be safe as they are today.
> >
> > I do not see you talking about #PF (RCU + vma read lock protected) with
> > munmap. It is my understanding that the latter will synchronize over per
> > vma lock (along with mmap_lock exclusive locking). But then we are back
> > to the lifetime guarantees, or do I miss anything.
> 
> munmap() or any VMA-freeing operation other than exit_mmap() will free
> using call_rcu(), as implemented today. The suggestion is to free VMAs
> directly, without RCU grace period only when done from exit_mmap().

OK, I have clearly missed that. This makes more sense but it also adds
some more complexity and assumptions - a harder to maintain code in the
end. Whoever wants to touch this scheme in future would have to
re-evaluate all of them. So, I would just avoid that special casing if
that is feasible.

Dealing with the flood of call_rcu during exit_mmap is a trivial thing
to deal with as proposed elsewhere (just batch all of them in a single
run). This will surely add some more code but at least the locking would
consistent.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-23  9:59           ` Michal Hocko
@ 2023-01-23 17:43             ` Suren Baghdasaryan
  0 siblings, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-23 17:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, michel, jglisse, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, peterz, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 23, 2023 at 1:59 AM 'Michal Hocko' via kernel-team
<kernel-team@android.com> wrote:
>
> On Fri 20-01-23 08:20:43, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > >
> > > > > After some more clarification I can understand how call_rcu might not be
> > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > that this is not really optimal.
> > > > >
> > > > > On the other hand I do not like this solution much either.
> > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > much with processes with a huge number of vmas either. It would still be
> > > > > in housands of callbacks to be scheduled without a good reason.
> > > > >
> > > > > Instead, are there any other cases than remove_vma that need this
> > > > > batching? We could easily just link all the vmas into linked list and
> > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > >
> > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > on the list without hooking up a shrinker (additional complexity) does
> > > > not sound too appealing either.
> > >
> > > I suspect you have missed my idea. I do not really want to keep the list
> > > around or any shrinker. It is dead simple. Collect all vmas in
> > > remove_vma and then call_rcu the whole list at once after the whole list
> > > (be it from exit_mmap or remove_mt). See?
> >
> > Yes, I understood your idea but keeping dead objects until the process
> > exits even when the system is low on memory (no shrinkers attached)
> > seems too wasteful. If we do this I would advocate for attaching a
> > shrinker.
>
> I am still not sure we are on the same page here. No, vmas shouldn't lay
> around un ntil the process exit. I am really suggesting queuing only for
> remove_vma paths. You can have a different rcu callback than the one
> used for trivial single vma removal paths.

Oh, I also missed the remove_mt() part and thought you want to drain
the list only in exit_mmap(). I think that's a good option!

>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-23 17:16                               ` Michal Hocko
@ 2023-01-23 17:46                                 ` Suren Baghdasaryan
  2023-01-23 18:23                                   ` Matthew Wilcox
  0 siblings, 1 reply; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-23 17:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Liam R. Howlett, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 23, 2023 at 9:16 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote:
> > On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > [...]
> > > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > > > trying to free VMAs.  Right?
> > > > > >
> > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > > > this is ok?
> > > > >
> > > > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> > > >
> > > > Jann's case requires a thread freeing the VMA to be blocked on vma
> > > > write lock waiting for the vma real lock to be released by a page
> > > > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > > > turn suggests that there are no racing page fault handlers and no new
> > > > page fault handlers will appear. Is that a correct assumption? If so,
> > > > then races with page fault handlers can't happen while in exit_mmap().
> > > > Any other path (other than page fault handlers), accesses vma->lock
> > > > under protection of mmap_lock (for read or write, does not matter).
> > > > One exception is when we operate on an isolated VMA, then we don't
> > > > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > > > protection of mmap_lock in write mode, so races with anything other
> > > > than page fault handler should be safe as they are today.
> > >
> > > I do not see you talking about #PF (RCU + vma read lock protected) with
> > > munmap. It is my understanding that the latter will synchronize over per
> > > vma lock (along with mmap_lock exclusive locking). But then we are back
> > > to the lifetime guarantees, or do I miss anything.
> >
> > munmap() or any VMA-freeing operation other than exit_mmap() will free
> > using call_rcu(), as implemented today. The suggestion is to free VMAs
> > directly, without RCU grace period only when done from exit_mmap().
>
> OK, I have clearly missed that. This makes more sense but it also adds
> some more complexity and assumptions - a harder to maintain code in the
> end. Whoever wants to touch this scheme in future would have to
> re-evaluate all of them. So, I would just avoid that special casing if
> that is feasible.

Ok, I understand your point.

>
> Dealing with the flood of call_rcu during exit_mmap is a trivial thing
> to deal with as proposed elsewhere (just batch all of them in a single
> run). This will surely add some more code but at least the locking would
> consistent.

Yes, batching the vmas into a list and draining it in remove_mt() and
exit_mmap() as you suggested makes sense to me and is quite simple.
Let's do that if nobody has objections.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-23 17:46                                 ` Suren Baghdasaryan
@ 2023-01-23 18:23                                   ` Matthew Wilcox
  2023-01-23 18:47                                     ` Suren Baghdasaryan
  2023-01-23 19:18                                     ` Michal Hocko
  0 siblings, 2 replies; 186+ messages in thread
From: Matthew Wilcox @ 2023-01-23 18:23 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, Liam R. Howlett, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 23, 2023 at 9:16 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote:
> > > On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > > > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > [...]
> > > > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > > > > trying to free VMAs.  Right?
> > > > > > >
> > > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > > > > this is ok?
> > > > > >
> > > > > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > > > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> > > > >
> > > > > Jann's case requires a thread freeing the VMA to be blocked on vma
> > > > > write lock waiting for the vma real lock to be released by a page
> > > > > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > > > > turn suggests that there are no racing page fault handlers and no new
> > > > > page fault handlers will appear. Is that a correct assumption? If so,
> > > > > then races with page fault handlers can't happen while in exit_mmap().
> > > > > Any other path (other than page fault handlers), accesses vma->lock
> > > > > under protection of mmap_lock (for read or write, does not matter).
> > > > > One exception is when we operate on an isolated VMA, then we don't
> > > > > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > > > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > > > > protection of mmap_lock in write mode, so races with anything other
> > > > > than page fault handler should be safe as they are today.
> > > >
> > > > I do not see you talking about #PF (RCU + vma read lock protected) with
> > > > munmap. It is my understanding that the latter will synchronize over per
> > > > vma lock (along with mmap_lock exclusive locking). But then we are back
> > > > to the lifetime guarantees, or do I miss anything.
> > >
> > > munmap() or any VMA-freeing operation other than exit_mmap() will free
> > > using call_rcu(), as implemented today. The suggestion is to free VMAs
> > > directly, without RCU grace period only when done from exit_mmap().
> >
> > OK, I have clearly missed that. This makes more sense but it also adds
> > some more complexity and assumptions - a harder to maintain code in the
> > end. Whoever wants to touch this scheme in future would have to
> > re-evaluate all of them. So, I would just avoid that special casing if
> > that is feasible.
> 
> Ok, I understand your point.
> 
> >
> > Dealing with the flood of call_rcu during exit_mmap is a trivial thing
> > to deal with as proposed elsewhere (just batch all of them in a single
> > run). This will surely add some more code but at least the locking would
> > consistent.
> 
> Yes, batching the vmas into a list and draining it in remove_mt() and
> exit_mmap() as you suggested makes sense to me and is quite simple.
> Let's do that if nobody has objections.

I object.  We *know* nobody has a reference to any of the VMAs because
you have to have a refcount on the mm before you can get a reference
to a VMA.  If Michal is saying that somebody could do:

	mmget(mm);
	vma = find_vma(mm);
	lock_vma(vma);
	mmput(mm);
	vma->a = b;
	unlock_vma(mm, vma);

then that's something we'd catch in review -- you obviously can't use
the mm after you've dropped your reference to it.

Having all this extra code to solve two problems badly is a very poor
choice.  We have two distinct problems, each of which has a simple,
efficient solution.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-23 18:23                                   ` Matthew Wilcox
@ 2023-01-23 18:47                                     ` Suren Baghdasaryan
  2023-01-23 19:18                                     ` Michal Hocko
  1 sibling, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-23 18:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, Liam R. Howlett, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 23, 2023 at 10:23 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 23, 2023 at 9:16 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote:
> > > > On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > > > > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > [...]
> > > > > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > > > > > trying to free VMAs.  Right?
> > > > > > > >
> > > > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > > > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > > > > > this is ok?
> > > > > > >
> > > > > > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > > > > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> > > > > >
> > > > > > Jann's case requires a thread freeing the VMA to be blocked on vma
> > > > > > write lock waiting for the vma real lock to be released by a page
> > > > > > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > > > > > turn suggests that there are no racing page fault handlers and no new
> > > > > > page fault handlers will appear. Is that a correct assumption? If so,
> > > > > > then races with page fault handlers can't happen while in exit_mmap().
> > > > > > Any other path (other than page fault handlers), accesses vma->lock
> > > > > > under protection of mmap_lock (for read or write, does not matter).
> > > > > > One exception is when we operate on an isolated VMA, then we don't
> > > > > > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > > > > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > > > > > protection of mmap_lock in write mode, so races with anything other
> > > > > > than page fault handler should be safe as they are today.
> > > > >
> > > > > I do not see you talking about #PF (RCU + vma read lock protected) with
> > > > > munmap. It is my understanding that the latter will synchronize over per
> > > > > vma lock (along with mmap_lock exclusive locking). But then we are back
> > > > > to the lifetime guarantees, or do I miss anything.
> > > >
> > > > munmap() or any VMA-freeing operation other than exit_mmap() will free
> > > > using call_rcu(), as implemented today. The suggestion is to free VMAs
> > > > directly, without RCU grace period only when done from exit_mmap().
> > >
> > > OK, I have clearly missed that. This makes more sense but it also adds
> > > some more complexity and assumptions - a harder to maintain code in the
> > > end. Whoever wants to touch this scheme in future would have to
> > > re-evaluate all of them. So, I would just avoid that special casing if
> > > that is feasible.
> >
> > Ok, I understand your point.
> >
> > >
> > > Dealing with the flood of call_rcu during exit_mmap is a trivial thing
> > > to deal with as proposed elsewhere (just batch all of them in a single
> > > run). This will surely add some more code but at least the locking would
> > > consistent.
> >
> > Yes, batching the vmas into a list and draining it in remove_mt() and
> > exit_mmap() as you suggested makes sense to me and is quite simple.
> > Let's do that if nobody has objections.
>
> I object.  We *know* nobody has a reference to any of the VMAs because
> you have to have a refcount on the mm before you can get a reference
> to a VMA.  If Michal is saying that somebody could do:
>
>         mmget(mm);
>         vma = find_vma(mm);
>         lock_vma(vma);
>         mmput(mm);
>         vma->a = b;
>         unlock_vma(mm, vma);

More precisely, it's:
         mmget(mm);
         vma = lock_vma_under_rcu(mm, addr); -> calls vma_read_trylock(vma)
         mmput(mm);
         vma->a = b;
         vma_read_unlock(vma);

To be fair, vma_read_unlock() does not take mm as a parameter, so one
could have an impression that mm doesn't need to be pinned at the time
of its call.


>
> then that's something we'd catch in review -- you obviously can't use
> the mm after you've dropped your reference to it.
>
> Having all this extra code to solve two problems badly is a very poor
> choice.  We have two distinct problems, each of which has a simple,
> efficient solution.
>

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-23 18:23                                   ` Matthew Wilcox
  2023-01-23 18:47                                     ` Suren Baghdasaryan
@ 2023-01-23 19:18                                     ` Michal Hocko
  2023-01-23 19:30                                       ` Matthew Wilcox
  1 sibling, 1 reply; 186+ messages in thread
From: Michal Hocko @ 2023-01-23 19:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Suren Baghdasaryan, Liam R. Howlett, akpm, michel, jglisse,
	vbabka, hannes, mgorman, dave, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
[...]
> > Yes, batching the vmas into a list and draining it in remove_mt() and
> > exit_mmap() as you suggested makes sense to me and is quite simple.
> > Let's do that if nobody has objections.
> 
> I object.  We *know* nobody has a reference to any of the VMAs because
> you have to have a refcount on the mm before you can get a reference
> to a VMA.  If Michal is saying that somebody could do:
> 
> 	mmget(mm);
> 	vma = find_vma(mm);
> 	lock_vma(vma);
> 	mmput(mm);
> 	vma->a = b;
> 	unlock_vma(mm, vma);
> 
> then that's something we'd catch in review -- you obviously can't use
> the mm after you've dropped your reference to it.

I am not claiming this is possible now. I do not think we want to have
something like that in the future either but that is really hard to
envision. I am claiming that it is subtle and potentially error prone to
have two different ways of mass vma freeing wrt. locking. Also, don't we
have a very similar situation during last munmaps?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-23 19:18                                     ` Michal Hocko
@ 2023-01-23 19:30                                       ` Matthew Wilcox
  2023-01-23 19:57                                         ` Suren Baghdasaryan
  2023-01-23 20:00                                         ` Michal Hocko
  0 siblings, 2 replies; 186+ messages in thread
From: Matthew Wilcox @ 2023-01-23 19:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, Liam R. Howlett, akpm, michel, jglisse,
	vbabka, hannes, mgorman, dave, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> [...]
> > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > Let's do that if nobody has objections.
> > 
> > I object.  We *know* nobody has a reference to any of the VMAs because
> > you have to have a refcount on the mm before you can get a reference
> > to a VMA.  If Michal is saying that somebody could do:
> > 
> > 	mmget(mm);
> > 	vma = find_vma(mm);
> > 	lock_vma(vma);
> > 	mmput(mm);
> > 	vma->a = b;
> > 	unlock_vma(mm, vma);
> > 
> > then that's something we'd catch in review -- you obviously can't use
> > the mm after you've dropped your reference to it.
> 
> I am not claiming this is possible now. I do not think we want to have
> something like that in the future either but that is really hard to
> envision. I am claiming that it is subtle and potentially error prone to
> have two different ways of mass vma freeing wrt. locking. Also, don't we
> have a very similar situation during last munmaps?

We shouldn't have two ways of mass VMA freeing.  Nobody's suggesting that.
There are two cases; there's munmap(), which typically frees a single
VMA (yes, theoretically, you can free hundreds of VMAs with a single
call which spans multiple VMAs, but in practice that doesn't happen),
and there's exit_mmap() which happens on exec() and exit().

For the munmap() case, just RCU-free each one individually.  For the
exit_mmap() case, there's no need to use RCU because nobody should still
have a VMA pointer after calling mmdrop() [1]

[1] Sorry, the above example should have been mmgrab()/mmdrop(), not
mmget()/mmput(); you're not allowed to look at the VMA list with an
mmget(), you need to have grabbed.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-23 19:30                                       ` Matthew Wilcox
@ 2023-01-23 19:57                                         ` Suren Baghdasaryan
  2023-01-23 20:00                                         ` Michal Hocko
  1 sibling, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-23 19:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, Liam R. Howlett, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 23, 2023 at 11:31 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > [...]
> > > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > > Let's do that if nobody has objections.
> > >
> > > I object.  We *know* nobody has a reference to any of the VMAs because
> > > you have to have a refcount on the mm before you can get a reference
> > > to a VMA.  If Michal is saying that somebody could do:
> > >
> > >     mmget(mm);
> > >     vma = find_vma(mm);
> > >     lock_vma(vma);
> > >     mmput(mm);
> > >     vma->a = b;
> > >     unlock_vma(mm, vma);
> > >
> > > then that's something we'd catch in review -- you obviously can't use
> > > the mm after you've dropped your reference to it.
> >
> > I am not claiming this is possible now. I do not think we want to have
> > something like that in the future either but that is really hard to
> > envision. I am claiming that it is subtle and potentially error prone to
> > have two different ways of mass vma freeing wrt. locking. Also, don't we
> > have a very similar situation during last munmaps?
>
> We shouldn't have two ways of mass VMA freeing.  Nobody's suggesting that.
> There are two cases; there's munmap(), which typically frees a single
> VMA (yes, theoretically, you can free hundreds of VMAs with a single
> call which spans multiple VMAs, but in practice that doesn't happen),
> and there's exit_mmap() which happens on exec() and exit().
>
> For the munmap() case, just RCU-free each one individually.  For the
> exit_mmap() case, there's no need to use RCU because nobody should still
> have a VMA pointer after calling mmdrop() [1]
>
> [1] Sorry, the above example should have been mmgrab()/mmdrop(), not
> mmget()/mmput(); you're not allowed to look at the VMA list with an
> mmget(), you need to have grabbed.

I think it's clear that this would work with the current code and that
the concern is about any future possible misuse. So, it would be
preferable to proactively prevent such misuse.

vma_write_lock() and vma_write_unlock_mm() both have
mmap_assert_write_locked(), so they always happen under mmap_lock
protection and therefore do not pose any danger. The only issue we
need to be careful with is calling
vma_read_trylock()/vma_read_unlock() outside of mmap_lock protection
while mm is unstable. I don't think doing mmget/mmput inside these
functions is called for but maybe some assertions would prevent future
misuse?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-23 19:30                                       ` Matthew Wilcox
  2023-01-23 19:57                                         ` Suren Baghdasaryan
@ 2023-01-23 20:00                                         ` Michal Hocko
  2023-01-23 20:08                                           ` Suren Baghdasaryan
  2023-01-23 20:38                                           ` Liam R. Howlett
  1 sibling, 2 replies; 186+ messages in thread
From: Michal Hocko @ 2023-01-23 20:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Suren Baghdasaryan, Liam R. Howlett, akpm, michel, jglisse,
	vbabka, hannes, mgorman, dave, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon 23-01-23 19:30:43, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > [...]
> > > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > > Let's do that if nobody has objections.
> > > 
> > > I object.  We *know* nobody has a reference to any of the VMAs because
> > > you have to have a refcount on the mm before you can get a reference
> > > to a VMA.  If Michal is saying that somebody could do:
> > > 
> > > 	mmget(mm);
> > > 	vma = find_vma(mm);
> > > 	lock_vma(vma);
> > > 	mmput(mm);
> > > 	vma->a = b;
> > > 	unlock_vma(mm, vma);
> > > 
> > > then that's something we'd catch in review -- you obviously can't use
> > > the mm after you've dropped your reference to it.
> > 
> > I am not claiming this is possible now. I do not think we want to have
> > something like that in the future either but that is really hard to
> > envision. I am claiming that it is subtle and potentially error prone to
> > have two different ways of mass vma freeing wrt. locking. Also, don't we
> > have a very similar situation during last munmaps?
> 
> We shouldn't have two ways of mass VMA freeing.  Nobody's suggesting that.
> There are two cases; there's munmap(), which typically frees a single
> VMA (yes, theoretically, you can free hundreds of VMAs with a single
> call which spans multiple VMAs, but in practice that doesn't happen),
> and there's exit_mmap() which happens on exec() and exit().

This requires special casing remove_vma for those two different paths
(exit_mmap and remove_mt).  If you ask me that sounds like a suboptimal
code to even not handle potential large munmap which might very well be
a rare thing as you say. But haven't we learned that sooner or later we
will find out there is somebody that cares afterall? Anyway, this is not
something I care about all that much. It is just weird to special case
exit_mmap, if you ask me. Up to Suren to decide which way he wants to
go. I just really didn't like the initial implementation of batching
based on a completely arbitrary batch limit and lazy freeing.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-23 20:00                                         ` Michal Hocko
@ 2023-01-23 20:08                                           ` Suren Baghdasaryan
  2023-01-23 20:38                                           ` Liam R. Howlett
  1 sibling, 0 replies; 186+ messages in thread
From: Suren Baghdasaryan @ 2023-01-23 20:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Liam R. Howlett, akpm, michel, jglisse, vbabka,
	hannes, mgorman, dave, peterz, ldufour, laurent.dufour, paulmck,
	luto, songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Mon, Jan 23, 2023 at 12:00 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-01-23 19:30:43, Matthew Wilcox wrote:
> > On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> > > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > > [...]
> > > > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > > > Let's do that if nobody has objections.
> > > >
> > > > I object.  We *know* nobody has a reference to any of the VMAs because
> > > > you have to have a refcount on the mm before you can get a reference
> > > > to a VMA.  If Michal is saying that somebody could do:
> > > >
> > > >   mmget(mm);
> > > >   vma = find_vma(mm);
> > > >   lock_vma(vma);
> > > >   mmput(mm);
> > > >   vma->a = b;
> > > >   unlock_vma(mm, vma);
> > > >
> > > > then that's something we'd catch in review -- you obviously can't use
> > > > the mm after you've dropped your reference to it.
> > >
> > > I am not claiming this is possible now. I do not think we want to have
> > > something like that in the future either but that is really hard to
> > > envision. I am claiming that it is subtle and potentially error prone to
> > > have two different ways of mass vma freeing wrt. locking. Also, don't we
> > > have a very similar situation during last munmaps?
> >
> > We shouldn't have two ways of mass VMA freeing.  Nobody's suggesting that.
> > There are two cases; there's munmap(), which typically frees a single
> > VMA (yes, theoretically, you can free hundreds of VMAs with a single
> > call which spans multiple VMAs, but in practice that doesn't happen),
> > and there's exit_mmap() which happens on exec() and exit().
>
> This requires special casing remove_vma for those two different paths
> (exit_mmap and remove_mt).  If you ask me that sounds like a suboptimal
> code to even not handle potential large munmap which might very well be
> a rare thing as you say. But haven't we learned that sooner or later we
> will find out there is somebody that cares afterall? Anyway, this is not
> something I care about all that much. It is just weird to special case
> exit_mmap, if you ask me. Up to Suren to decide which way he wants to
> go. I just really didn't like the initial implementation of batching
> based on a completely arbitrary batch limit and lazy freeing.

I would prefer to go with the simplest sufficient solution. A
potential issue with a large munmap might prove to be real but I think
we know how to easily fix that with batching if the issue ever
materializes (I'll have a fix ready implementing Michal's suggestion).
So, I suggest going with Liam's/Matthew's solution and converting to
Michal's solution if regression shows up anywhere else. Would that be
acceptable?

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
  2023-01-23 20:00                                         ` Michal Hocko
  2023-01-23 20:08                                           ` Suren Baghdasaryan
@ 2023-01-23 20:38                                           ` Liam R. Howlett
  1 sibling, 0 replies; 186+ messages in thread
From: Liam R. Howlett @ 2023-01-23 20:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Suren Baghdasaryan, akpm, michel, jglisse,
	vbabka, hannes, mgorman, dave, peterz, ldufour, laurent.dufour,
	paulmck, luto, songliubraving, peterx, david, dhowells, hughd,
	bigeasy, kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, jannh, shakeelb,
	tatashin, edumazet, gthelen, gurua, arjunroy, soheil, hughlynch,
	leewalsh, posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

* Michal Hocko <mhocko@suse.com> [230123 15:00]:
> On Mon 23-01-23 19:30:43, Matthew Wilcox wrote:
> > On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> > > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > > [...]
> > > > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > > > Let's do that if nobody has objections.
> > > > 
> > > > I object.  We *know* nobody has a reference to any of the VMAs because
> > > > you have to have a refcount on the mm before you can get a reference
> > > > to a VMA.  If Michal is saying that somebody could do:
> > > > 
> > > > 	mmget(mm);
> > > > 	vma = find_vma(mm);
> > > > 	lock_vma(vma);
> > > > 	mmput(mm);
> > > > 	vma->a = b;
> > > > 	unlock_vma(mm, vma);
> > > > 
> > > > then that's something we'd catch in review -- you obviously can't use
> > > > the mm after you've dropped your reference to it.
> > > 
> > > I am not claiming this is possible now. I do not think we want to have
> > > something like that in the future either but that is really hard to
> > > envision. I am claiming that it is subtle and potentially error prone to
> > > have two different ways of mass vma freeing wrt. locking. Also, don't we
> > > have a very similar situation during last munmaps?
> > 
> > We shouldn't have two ways of mass VMA freeing.  Nobody's suggesting that.
> > There are two cases; there's munmap(), which typically frees a single
> > VMA (yes, theoretically, you can free hundreds of VMAs with a single
> > call which spans multiple VMAs, but in practice that doesn't happen),
> > and there's exit_mmap() which happens on exec() and exit().
> 
> This requires special casing remove_vma for those two different paths
> (exit_mmap and remove_mt).  If you ask me that sounds like a suboptimal
> code to even not handle potential large munmap which might very well be
> a rare thing as you say. But haven't we learned that sooner or later we
> will find out there is somebody that cares afterall? Anyway, this is not
> something I care about all that much. It is just weird to special case
> exit_mmap, if you ask me.

exit_mmap() is already a special case for locking (and statistics).
This exists today to optimize the special exit scenario.  I don't think
it's a question of sub-optimal code but what we can get away without
doing in the case of the process exit.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
  2023-01-17 21:45       ` Jann Horn
  2023-01-17 22:36         ` Suren Baghdasaryan
@ 2023-11-22 14:04         ` Alexander Gordeev
  1 sibling, 0 replies; 186+ messages in thread
From: Alexander Gordeev @ 2023-11-22 14:04 UTC (permalink / raw)
  To: Jann Horn
  Cc: Suren Baghdasaryan, peterz, Ingo Molnar, Will Deacon, akpm,
	michel, jglisse, mhocko, vbabka, hannes, mgorman, dave, willy,
	liam.howlett, ldufour, laurent.dufour, paulmck, luto,
	songliubraving, peterx, david, dhowells, hughd, bigeasy,
	kent.overstreet, punit.agrawal, lstoakes, peterjung1337,
	rientjes, axelrasmussen, joelaf, minchan, shakeelb, tatashin,
	edumazet, gthelen, gurua, arjunroy, soheil, hughlynch, leewalsh,
	posk, linux-mm, linux-arm-kernel, linuxppc-dev, x86,
	linux-kernel, kernel-team

On Tue, Jan 17, 2023 at 10:45:25PM +0100, Jann Horn wrote:

Hi Jann,

> On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > On Tue, Jan 17, 2023 at 10:03 AM Jann Horn <jannh@google.com> wrote:
> > >
> > > +locking maintainers
> >
> > Thanks! I'll CC the locking maintainers in the next posting.
> >
> > >
> > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > locked.
> > > [...]
> > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > +{
> > > > +       up_read(&vma->lock);
> > > > +}
> > >
> > > One thing that might be gnarly here is that I think you might not be
> > > allowed to use up_read() to fully release ownership of an object -
> > > from what I remember, I think that up_read() (unlike something like
> > > spin_unlock()) can access the lock object after it's already been
> > > acquired by someone else. So if you want to protect against concurrent
> > > deletion, this might have to be something like:
> > >
> > > rcu_read_lock(); /* keeps vma alive */
> > > up_read(&vma->lock);
> > > rcu_read_unlock();
> >
> > But for deleting VMA one would need to write-lock the vma->lock first,
> > which I assume can't happen until this up_read() is complete. Is that
> > assumption wrong?
> 
> __up_read() does:
> 
> rwsem_clear_reader_owned(sem);
> tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
> DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
> if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
>       RWSEM_FLAG_WAITERS)) {
>   clear_nonspinnable(sem);
>   rwsem_wake(sem);
> }

This sequence is covered by preempt_disable()/preempt_enable().
Would not it preserve the RCU grace period until after __up_read()
exited?

> The atomic_long_add_return_release() is the point where we are doing
> the main lock-releasing.
> 
> So if a reader dropped the read-lock while someone else was waiting on
> the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the
> lock together with it, the reader also does clear_nonspinnable() and
> rwsem_wake() afterwards.
> But in rwsem_down_write_slowpath(), after we've set
> RWSEM_FLAG_WAITERS, we can return successfully immediately once
> rwsem_try_write_lock() sees that there are no active readers or
> writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg
> succeeds). We're not necessarily waiting for the "nonspinnable" bit or
> the wake.
> 
> So yeah, I think down_write() can return successfully before up_read()
> is done with its memory accesses.

Thanks!

^ permalink raw reply	[flat|nested] 186+ messages in thread

end of thread, other threads:[~2023-11-22 14:06 UTC | newest]

Thread overview: 186+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-09 20:52 [PATCH 00/41] Per-VMA locks Suren Baghdasaryan
2023-01-09 20:52 ` [PATCH 01/41] maple_tree: Be more cautious about dead nodes Suren Baghdasaryan
2023-01-09 20:52 ` [PATCH 02/41] maple_tree: Detect dead nodes in mas_start() Suren Baghdasaryan
2023-01-09 20:52 ` [PATCH 03/41] maple_tree: Fix freeing of nodes in rcu mode Suren Baghdasaryan
2023-01-09 20:52 ` [PATCH 04/41] maple_tree: remove extra smp_wmb() from mas_dead_leaves() Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 05/41] maple_tree: Fix write memory barrier of nodes once dead for RCU mode Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 06/41] maple_tree: Add smp_rmb() to dead node detection Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 07/41] mm: Enable maple tree RCU mode by default Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK Suren Baghdasaryan
2023-01-11  0:13   ` Davidlohr Bueso
2023-01-11  0:44     ` Suren Baghdasaryan
2023-01-11  8:23       ` Michal Hocko
2023-01-11  9:54         ` Ingo Molnar
     [not found]           ` <6be809f5554a4faaa22c287ba4224bd0@AcuMS.aculab.com>
2023-01-11 16:28             ` Suren Baghdasaryan
2023-01-11 16:44               ` Michal Hocko
2023-01-11 17:04                 ` Suren Baghdasaryan
2023-01-11 17:37                   ` Michal Hocko
2023-01-11 17:49                     ` Suren Baghdasaryan
2023-01-11 18:02                       ` Michal Hocko
2023-01-11 18:09                         ` Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 09/41] mm: rcu safe VMA freeing Suren Baghdasaryan
2023-01-17 14:25   ` Michal Hocko
2023-01-18  2:16     ` Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 10/41] mm: move mmap_lock assert function definitions Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 11/41] mm: export dump_mm() Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 12/41] mm: add per-VMA lock and helper functions to control it Suren Baghdasaryan
2023-01-17 15:04   ` Michal Hocko
2023-01-17 15:12     ` Michal Hocko
2023-01-17 21:21       ` Suren Baghdasaryan
2023-01-17 21:54         ` Matthew Wilcox
2023-01-17 22:33           ` Suren Baghdasaryan
2023-01-18  9:18           ` Michal Hocko
2023-01-17 21:08     ` Suren Baghdasaryan
2023-01-17 15:07   ` Michal Hocko
2023-01-17 21:09     ` Suren Baghdasaryan
2023-01-17 18:02   ` Jann Horn
2023-01-17 21:28     ` Suren Baghdasaryan
2023-01-17 21:45       ` Jann Horn
2023-01-17 22:36         ` Suren Baghdasaryan
2023-01-17 23:15           ` Matthew Wilcox
2023-11-22 14:04         ` Alexander Gordeev
2023-01-18 12:28     ` Michal Hocko
2023-01-18 13:23       ` Jann Horn
2023-01-18 15:11         ` Michal Hocko
2023-01-18 17:36           ` Suren Baghdasaryan
2023-01-18 21:28             ` Michal Hocko
2023-01-18 21:45               ` Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 13/41] mm: introduce vma->vm_flags modifier functions Suren Baghdasaryan
2023-01-11 15:47   ` Davidlohr Bueso
2023-01-11 17:36     ` Suren Baghdasaryan
2023-01-11 19:52       ` Davidlohr Bueso
2023-01-11 21:23         ` Suren Baghdasaryan
2023-01-17 15:09   ` Michal Hocko
2023-01-17 15:15     ` Michal Hocko
2023-01-18  2:07       ` Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 14/41] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 15/41] mm: replace vma->vm_flags direct modifications with modifier calls Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 16/41] mm: replace vma->vm_flags indirect modification in ksm_madvise Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call Suren Baghdasaryan
2023-01-17 15:16   ` Michal Hocko
2023-01-18  2:01     ` Suren Baghdasaryan
2023-01-18  9:23       ` Michal Hocko
2023-01-18 18:09         ` Suren Baghdasaryan
2023-01-18 21:33           ` Michal Hocko
2023-01-18 21:48             ` Suren Baghdasaryan
2023-01-19  9:31               ` Michal Hocko
2023-01-19 18:53                 ` Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page Suren Baghdasaryan
2023-01-17 15:25   ` Michal Hocko
2023-01-17 20:28     ` Jann Horn
2023-01-17 21:05       ` Suren Baghdasaryan
2023-01-18  9:40       ` Michal Hocko
2023-01-18 12:38         ` Jann Horn
2023-01-18 17:41         ` Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 19/41] mm/mmap: write-lock VMAs before merging, splitting or expanding them Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 20/41] mm/mmap: write-lock VMAs in vma_adjust Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 21/41] mm/mmap: write-lock VMAs affected by VMA expansion Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 22/41] mm/mremap: write-lock VMA while remapping it to a new address range Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 23/41] mm: write-lock VMAs before removing them from VMA tree Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 24/41] mm: conditionally write-lock VMA in free_pgtables Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 25/41] mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction Suren Baghdasaryan
2023-01-17 15:42   ` Michal Hocko
2023-01-18  1:53     ` Suren Baghdasaryan
2023-01-18  9:43       ` Michal Hocko
2023-01-18 18:06         ` Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration Suren Baghdasaryan
2023-01-18 12:50   ` Jann Horn
2023-01-18 17:40     ` Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code Suren Baghdasaryan
2023-01-17 15:47   ` Michal Hocko
2023-01-18  1:06     ` Suren Baghdasaryan
2023-01-18  2:44       ` Matthew Wilcox
2023-01-18 21:33         ` Suren Baghdasaryan
2023-01-17 21:03   ` Jann Horn
2023-01-17 23:18     ` Liam Howlett
2023-01-09 20:53 ` [PATCH 29/41] mm: fall back to mmap_lock if vma->anon_vma is not yet set Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 30/41] mm: add FAULT_FLAG_VMA_LOCK flag Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 31/41] mm: prevent do_swap_page from handling page faults under VMA lock Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock Suren Baghdasaryan
2023-01-17 19:51   ` Jann Horn
2023-01-17 20:36     ` Jann Horn
2023-01-17 20:57       ` Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 33/41] mm: introduce per-VMA lock statistics Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 34/41] x86/mm: try VMA lock-based page fault handling first Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 35/41] arm64/mm: " Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 36/41] powerc/mm: " Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 37/41] mm: introduce mod_vm_flags_nolock Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 38/41] mm: avoid assertion in untrack_pfn Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free Suren Baghdasaryan
2023-01-17 15:57   ` Michal Hocko
2023-01-18  1:19     ` Suren Baghdasaryan
2023-01-18  9:49       ` Michal Hocko
2023-01-18 18:04         ` Suren Baghdasaryan
2023-01-18 18:34           ` Paul E. McKenney
2023-01-18 19:01             ` Suren Baghdasaryan
2023-01-18 20:20               ` Paul E. McKenney
2023-01-19 12:52               ` Michal Hocko
2023-01-19 19:17                 ` Paul E. McKenney
2023-01-20  8:57                   ` Michal Hocko
2023-01-20 16:08                     ` Paul E. McKenney
2023-01-19 12:59   ` Michal Hocko
2023-01-19 18:52     ` Suren Baghdasaryan
2023-01-19 19:20       ` Paul E. McKenney
2023-01-19 19:47         ` Suren Baghdasaryan
2023-01-19 19:55           ` Paul E. McKenney
2023-01-20  8:52       ` Michal Hocko
2023-01-20 16:20         ` Suren Baghdasaryan
2023-01-20 16:45           ` Suren Baghdasaryan
2023-01-20 16:49             ` Matthew Wilcox
2023-01-20 17:08               ` Liam R. Howlett
2023-01-20 17:17                 ` Suren Baghdasaryan
2023-01-20 17:32                   ` Matthew Wilcox
2023-01-20 17:50                     ` Suren Baghdasaryan
2023-01-20 19:23                       ` Liam R. Howlett
2023-01-23  9:56                       ` Michal Hocko
2023-01-23 16:22                         ` Suren Baghdasaryan
2023-01-23 16:55                           ` Michal Hocko
2023-01-23 17:07                             ` Suren Baghdasaryan
2023-01-23 17:16                               ` Michal Hocko
2023-01-23 17:46                                 ` Suren Baghdasaryan
2023-01-23 18:23                                   ` Matthew Wilcox
2023-01-23 18:47                                     ` Suren Baghdasaryan
2023-01-23 19:18                                     ` Michal Hocko
2023-01-23 19:30                                       ` Matthew Wilcox
2023-01-23 19:57                                         ` Suren Baghdasaryan
2023-01-23 20:00                                         ` Michal Hocko
2023-01-23 20:08                                           ` Suren Baghdasaryan
2023-01-23 20:38                                           ` Liam R. Howlett
2023-01-20 17:21               ` Paul E. McKenney
2023-01-20 18:42                 ` Suren Baghdasaryan
2023-01-23  9:59           ` Michal Hocko
2023-01-23 17:43             ` Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 40/41] mm: separate vma->lock from vm_area_struct Suren Baghdasaryan
2023-01-17 18:33   ` Jann Horn
2023-01-17 19:01     ` Suren Baghdasaryan
2023-01-09 20:53 ` [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock Suren Baghdasaryan
2023-01-10  8:04   ` Vlastimil Babka
2023-01-10 17:05     ` Suren Baghdasaryan
2023-01-16 11:14   ` Hyeonggon Yoo
2023-01-16 22:36     ` Suren Baghdasaryan
2023-01-17  4:14     ` Matthew Wilcox
2023-01-17  4:34       ` Suren Baghdasaryan
2023-01-17  5:46         ` Matthew Wilcox
2023-01-17  5:58           ` Suren Baghdasaryan
2023-01-17 18:23             ` Matthew Wilcox
2023-01-17 18:28               ` Suren Baghdasaryan
2023-01-17 20:31                 ` Michal Hocko
2023-01-17 21:00                   ` Suren Baghdasaryan
     [not found]   ` <20230116140649.2012-1-hdanton@sina.com>
2023-01-16 23:08     ` Suren Baghdasaryan
2023-01-16 23:11       ` Suren Baghdasaryan
     [not found]       ` <20230117031632.2321-1-hdanton@sina.com>
2023-01-17  4:52         ` Suren Baghdasaryan
     [not found]           ` <20230117083355.2374-1-hdanton@sina.com>
2023-01-17 18:21             ` Suren Baghdasaryan
2023-01-17 18:27               ` Matthew Wilcox
2023-01-17 18:31                 ` Suren Baghdasaryan
     [not found]                 ` <20230118062639.2839-1-hdanton@sina.com>
2023-01-18 18:35                   ` Matthew Wilcox
2023-01-17 18:11   ` Jann Horn
2023-01-17 18:26     ` Suren Baghdasaryan
2023-01-17 18:31       ` Matthew Wilcox
2023-01-17 18:36         ` Jann Horn
2023-01-17 18:49           ` Suren Baghdasaryan
2023-01-17 18:36         ` Suren Baghdasaryan
2023-01-17 18:48           ` Matthew Wilcox
2023-01-17 18:55             ` Suren Baghdasaryan
2023-01-17 18:59               ` Jann Horn
2023-01-17 19:06                 ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).