All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
@ 2021-12-09 19:13 Suren Baghdasaryan
  2021-12-09 19:13 ` [PATCH v5 2/3] mm: document locking restrictions for vm_operations_struct::close Suren Baghdasaryan
  2021-12-09 19:13 ` [PATCH v5 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection Suren Baghdasaryan
  0 siblings, 2 replies; 5+ messages in thread
From: Suren Baghdasaryan @ 2021-12-09 19:13 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, mhocko, rientjes, willy, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team, surenb

oom-reaper and process_mrelease system call should protect against
races with exit_mmap which can destroy page tables while they
walk the VMA tree. oom-reaper protects from that race by setting
MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP
before taking and releasing mmap_write_lock. process_mrelease has
to elevate mm->mm_users to prevent such race. Both oom-reaper and
process_mrelease hold mmap_read_lock when walking the VMA tree.
The locking rules and mechanisms could be simpler if exit_mmap takes
mmap_write_lock while executing destructive operations such as
free_pgtables.
Change exit_mmap to hold the mmap_write_lock when calling
unlock_range, free_pgtables and remove_vma. Note also that because
oom-reaper checks VM_LOCKED flag, unlock_range() should not be allowed
to race with it. Before this patch, remove_vma used to be called with
no locks held, however with fput being executed asynchronously and
vm_ops->close not being allowed to hold mmap_lock (it is called from
__split_vma with mmap_sem held for write), changing that should be fine.
In most cases this lock should be uncontended. Previously, Kirill
reported ~4% regression caused by a similar change [1]. We reran the
same test and although the individual results are quite noisy, the
percentiles show lower regression with 1.6% being the worst case [2].
The change allows oom-reaper and process_mrelease to execute safely
under mmap_read_lock without worries that exit_mmap might destroy page
tables from under them.

[1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@node.shutemov.name/
[2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
changes in v5
- Corrected patch description, per Michal Hocko
- Added Acked-by

 mm/mmap.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index bfb0ea164a90..f4e09d390a07 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3142,25 +3142,27 @@ void exit_mmap(struct mm_struct *mm)
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 		 * __oom_reap_task_mm() will not block.
 		 *
-		 * This needs to be done before calling munlock_vma_pages_all(),
+		 * This needs to be done before calling unlock_range(),
 		 * which clears VM_LOCKED, otherwise the oom reaper cannot
 		 * reliably test it.
 		 */
 		(void)__oom_reap_task_mm(mm);
 
 		set_bit(MMF_OOM_SKIP, &mm->flags);
-		mmap_write_lock(mm);
-		mmap_write_unlock(mm);
 	}
 
+	mmap_write_lock(mm);
 	if (mm->locked_vm)
 		unlock_range(mm->mmap, ULONG_MAX);
 
 	arch_exit_mmap(mm);
 
 	vma = mm->mmap;
-	if (!vma)	/* Can happen if dup_mmap() received an OOM */
+	if (!vma) {
+		/* Can happen if dup_mmap() received an OOM */
+		mmap_write_unlock(mm);
 		return;
+	}
 
 	lru_add_drain();
 	flush_cache_mm(mm);
@@ -3171,16 +3173,14 @@ void exit_mmap(struct mm_struct *mm)
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb);
 
-	/*
-	 * Walk the list again, actually closing and freeing it,
-	 * with preemption enabled, without holding any MM locks.
-	 */
+	/* Walk the list again, actually closing and freeing it. */
 	while (vma) {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
 		vma = remove_vma(vma);
 		cond_resched();
 	}
+	mmap_write_unlock(mm);
 	vm_unacct_memory(nr_accounted);
 }
 
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v5 2/3] mm: document locking restrictions for vm_operations_struct::close
  2021-12-09 19:13 [PATCH v5 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Suren Baghdasaryan
@ 2021-12-09 19:13 ` Suren Baghdasaryan
  2021-12-09 19:13 ` [PATCH v5 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection Suren Baghdasaryan
  1 sibling, 0 replies; 5+ messages in thread
From: Suren Baghdasaryan @ 2021-12-09 19:13 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, mhocko, rientjes, willy, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team, surenb

Add comments for vm_operations_struct::close documenting locking
requirements for this callback and its callers.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
changes in v5
- Added Acked-by

 include/linux/mm.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..b9b88ba7564b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -577,6 +577,10 @@ enum page_entry_size {
  */
 struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
+	/**
+	 * @close: Called when the VMA is being removed from the MM.
+	 * Context: User context.  May sleep.  Caller holds mmap_lock.
+	 */
 	void (*close)(struct vm_area_struct * area);
 	/* Called any time before splitting to check if it's allowed */
 	int (*may_split)(struct vm_area_struct *area, unsigned long addr);
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v5 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection
  2021-12-09 19:13 [PATCH v5 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Suren Baghdasaryan
  2021-12-09 19:13 ` [PATCH v5 2/3] mm: document locking restrictions for vm_operations_struct::close Suren Baghdasaryan
@ 2021-12-09 19:13 ` Suren Baghdasaryan
  2021-12-09 20:48   ` Jason Gunthorpe
  1 sibling, 1 reply; 5+ messages in thread
From: Suren Baghdasaryan @ 2021-12-09 19:13 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, mhocko, rientjes, willy, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team, surenb

With exit_mmap holding mmap_write_lock during free_pgtables call,
process_mrelease does not need to elevate mm->mm_users in order to
prevent exit_mmap from destrying pagetables while __oom_reap_task_mm
is walking the VMA tree. The change prevents process_mrelease from
calling the last mmput, which can lead to waiting for IO completion
in exit_aio.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
changes in v5
- Removed Fixes: tag, per Michal Hocko
- Added Acked-by's

 mm/oom_kill.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ddabefcfb5a..67780386f478 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1169,15 +1169,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 		goto put_task;
 	}
 
-	if (mmget_not_zero(p->mm)) {
-		mm = p->mm;
-		if (task_will_free_mem(p))
-			reap = true;
-		else {
-			/* Error only if the work has not been done already */
-			if (!test_bit(MMF_OOM_SKIP, &mm->flags))
-				ret = -EINVAL;
-		}
+	mm = p->mm;
+	mmgrab(mm);
+
+	if (task_will_free_mem(p))
+		reap = true;
+	else {
+		/* Error only if the work has not been done already */
+		if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+			ret = -EINVAL;
 	}
 	task_unlock(p);
 
@@ -1188,13 +1188,16 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 		ret = -EINTR;
 		goto drop_mm;
 	}
-	if (!__oom_reap_task_mm(mm))
+	/*
+	 * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure
+	 * possible change in exit_mmap is seen
+	 */
+	if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm))
 		ret = -EAGAIN;
 	mmap_read_unlock(mm);
 
 drop_mm:
-	if (mm)
-		mmput(mm);
+	mmdrop(mm);
 put_task:
 	put_task_struct(task);
 	return ret;
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v5 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection
  2021-12-09 19:13 ` [PATCH v5 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection Suren Baghdasaryan
@ 2021-12-09 20:48   ` Jason Gunthorpe
  2021-12-10 15:54     ` Suren Baghdasaryan
  0 siblings, 1 reply; 5+ messages in thread
From: Jason Gunthorpe @ 2021-12-09 20:48 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, mhocko, mhocko, rientjes, willy, hannes, guro, riel,
	minchan, kirill, aarcange, christian, hch, oleg, david, jannh,
	shakeelb, luto, christian.brauner, fweimer, jengelh, timmurray,
	linux-mm, linux-kernel, kernel-team

On Thu, Dec 09, 2021 at 11:13:25AM -0800, Suren Baghdasaryan wrote:
> With exit_mmap holding mmap_write_lock during free_pgtables call,
> process_mrelease does not need to elevate mm->mm_users in order to
> prevent exit_mmap from destrying pagetables while __oom_reap_task_mm
> is walking the VMA tree. The change prevents process_mrelease from
> calling the last mmput, which can lead to waiting for IO completion
> in exit_aio.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> changes in v5
> - Removed Fixes: tag, per Michal Hocko
> - Added Acked-by's
> 
>  mm/oom_kill.c | 27 +++++++++++++++------------
>  1 file changed, 15 insertions(+), 12 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

There are mmget_not_zero's all over the place, can others be cleaned
after this series goes ahead too?

It seems like anything doing the mmget just to look at the vma list
under the mmap lock is now fine with only a mmgrab?

A few I know about:

drivers/infiniband/core/umem_odp.c:     if (!mmget_not_zero(umem->owning_mm)) {

This is because mmu_interval_notifier_insert() might call
mm_take_all_locks() which was unsafe with concurrent exit_mmap

drivers/infiniband/core/umem_odp.c:     if (!owning_process || !mmget_not_zero(owning_mm)) {

This is because it calls hmm_range_fault() which iterates over the vma
list which is safe now

drivers/iommu/iommu-sva-lib.c:  return mmget_not_zero(mm);
drivers/iommu/iommu-sva-lib.c:  return ioasid_find(&iommu_sva_pasid, pasid, __mmget_not_zero);

It calls find_extend_vma() - but also it doesn't seem to have a mmgrab when it
does that mmget. The rcu is messed up here too, so humm.

Jason

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v5 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection
  2021-12-09 20:48   ` Jason Gunthorpe
@ 2021-12-10 15:54     ` Suren Baghdasaryan
  0 siblings, 0 replies; 5+ messages in thread
From: Suren Baghdasaryan @ 2021-12-10 15:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: akpm, mhocko, mhocko, rientjes, willy, hannes, guro, riel,
	minchan, kirill, aarcange, christian, hch, oleg, david, jannh,
	shakeelb, luto, christian.brauner, fweimer, jengelh, timmurray,
	linux-mm, linux-kernel, kernel-team

On Thu, Dec 9, 2021 at 12:48 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Dec 09, 2021 at 11:13:25AM -0800, Suren Baghdasaryan wrote:
> > With exit_mmap holding mmap_write_lock during free_pgtables call,
> > process_mrelease does not need to elevate mm->mm_users in order to
> > prevent exit_mmap from destrying pagetables while __oom_reap_task_mm
> > is walking the VMA tree. The change prevents process_mrelease from
> > calling the last mmput, which can lead to waiting for IO completion
> > in exit_aio.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > ---
> > changes in v5
> > - Removed Fixes: tag, per Michal Hocko
> > - Added Acked-by's
> >
> >  mm/oom_kill.c | 27 +++++++++++++++------------
> >  1 file changed, 15 insertions(+), 12 deletions(-)
>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Thanks!

>
> There are mmget_not_zero's all over the place, can others be cleaned
> after this series goes ahead too?
>
> It seems like anything doing the mmget just to look at the vma list
> under the mmap lock is now fine with only a mmgrab?

Sounds reasonable to me. I'll try to carve out some time to look into
it but no promises. Lots of loose ends to tie at the end of the year
:(

>
> A few I know about:
>
> drivers/infiniband/core/umem_odp.c:     if (!mmget_not_zero(umem->owning_mm)) {
>
> This is because mmu_interval_notifier_insert() might call
> mm_take_all_locks() which was unsafe with concurrent exit_mmap
>
> drivers/infiniband/core/umem_odp.c:     if (!owning_process || !mmget_not_zero(owning_mm)) {
>
> This is because it calls hmm_range_fault() which iterates over the vma
> list which is safe now
>
> drivers/iommu/iommu-sva-lib.c:  return mmget_not_zero(mm);
> drivers/iommu/iommu-sva-lib.c:  return ioasid_find(&iommu_sva_pasid, pasid, __mmget_not_zero);
>
> It calls find_extend_vma() - but also it doesn't seem to have a mmgrab when it
> does that mmget. The rcu is messed up here too, so humm.
>
> Jason

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-12-10 15:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-09 19:13 [PATCH v5 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Suren Baghdasaryan
2021-12-09 19:13 ` [PATCH v5 2/3] mm: document locking restrictions for vm_operations_struct::close Suren Baghdasaryan
2021-12-09 19:13 ` [PATCH v5 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection Suren Baghdasaryan
2021-12-09 20:48   ` Jason Gunthorpe
2021-12-10 15:54     ` Suren Baghdasaryan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.