linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
@ 2021-12-07 21:50 Suren Baghdasaryan
  2021-12-07 21:50 ` [PATCH v3 2/2] mm/oom_kill: allow process_mrelease to run under mmap_lock protection Suren Baghdasaryan
  2021-12-07 22:07 ` [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Matthew Wilcox
  0 siblings, 2 replies; 13+ messages in thread
From: Suren Baghdasaryan @ 2021-12-07 21:50 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, mhocko, rientjes, willy, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team, surenb

oom-reaper and process_mrelease system call should protect against
races with exit_mmap which can destroy page tables while they
walk the VMA tree. oom-reaper protects from that race by setting
MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP
before taking and releasing mmap_write_lock. process_mrelease has
to elevate mm->mm_users to prevent such race. Both oom-reaper and
process_mrelease hold mmap_read_lock when walking the VMA tree.
The locking rules and mechanisms could be simpler if exit_mmap takes
mmap_write_lock while executing destructive operations such as
free_pgtables.
Change exit_mmap to hold the mmap_write_lock when calling
free_pgtables and remove_vma. Operations like unmap_vmas and
unlock_range are not destructive and could run under mmap_read_lock
but for simplicity we take one mmap_write_lock during almost the entire
operation. Note also that because oom-reaper checks VM_LOCKED flag,
unlock_range() should not be allowed to race with it.
Before this patch, remove_vma used to be called with no locks held,
however with fput being executed asynchronously and vm_ops->close
not being allowed to hold mmap_lock (it is called from __split_vma
with mmap_sem held for write), changing that should be fine.
In most cases this lock should be uncontended. Previously, Kirill
reported ~4% regression caused by a similar change [1]. We reran the
same test and although the individual results are quite noisy, the
percentiles show lower regression with 1.6% being the worst case [2].
The change allows oom-reaper and process_mrelease to execute safely
under mmap_read_lock without worries that exit_mmap might destroy page
tables from under them.

[1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@node.shutemov.name/
[2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
changes in v3
- Amended patch description to explain why remove_vma can be called while
holding mmap_write_lock, per Michal Hocko
- Added a comment for vm_operations_struct::close, documenting restriction for
taking mmap_lock in the callback, per Michal Hocko and Matthew Wilcox

 include/linux/mm.h |  4 ++++
 mm/mmap.c          | 16 ++++++++--------
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..97e1a05c3b2c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -577,6 +577,10 @@ enum page_entry_size {
  */
 struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
+    /*
+     * Called with mmap_lock lock held for write from __split_vma and
+     * remove_vma, therefore should never take that lock.
+     */
 	void (*close)(struct vm_area_struct * area);
 	/* Called any time before splitting to check if it's allowed */
 	int (*may_split)(struct vm_area_struct *area, unsigned long addr);
diff --git a/mm/mmap.c b/mm/mmap.c
index bfb0ea164a90..f4e09d390a07 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3142,25 +3142,27 @@ void exit_mmap(struct mm_struct *mm)
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 		 * __oom_reap_task_mm() will not block.
 		 *
-		 * This needs to be done before calling munlock_vma_pages_all(),
+		 * This needs to be done before calling unlock_range(),
 		 * which clears VM_LOCKED, otherwise the oom reaper cannot
 		 * reliably test it.
 		 */
 		(void)__oom_reap_task_mm(mm);
 
 		set_bit(MMF_OOM_SKIP, &mm->flags);
-		mmap_write_lock(mm);
-		mmap_write_unlock(mm);
 	}
 
+	mmap_write_lock(mm);
 	if (mm->locked_vm)
 		unlock_range(mm->mmap, ULONG_MAX);
 
 	arch_exit_mmap(mm);
 
 	vma = mm->mmap;
-	if (!vma)	/* Can happen if dup_mmap() received an OOM */
+	if (!vma) {
+		/* Can happen if dup_mmap() received an OOM */
+		mmap_write_unlock(mm);
 		return;
+	}
 
 	lru_add_drain();
 	flush_cache_mm(mm);
@@ -3171,16 +3173,14 @@ void exit_mmap(struct mm_struct *mm)
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb);
 
-	/*
-	 * Walk the list again, actually closing and freeing it,
-	 * with preemption enabled, without holding any MM locks.
-	 */
+	/* Walk the list again, actually closing and freeing it. */
 	while (vma) {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
 		vma = remove_vma(vma);
 		cond_resched();
 	}
+	mmap_write_unlock(mm);
 	vm_unacct_memory(nr_accounted);
 }
 
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/2] mm/oom_kill: allow process_mrelease to run under mmap_lock protection
  2021-12-07 21:50 [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Suren Baghdasaryan
@ 2021-12-07 21:50 ` Suren Baghdasaryan
  2021-12-07 22:07 ` [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Matthew Wilcox
  1 sibling, 0 replies; 13+ messages in thread
From: Suren Baghdasaryan @ 2021-12-07 21:50 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, mhocko, rientjes, willy, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team, surenb

With exit_mmap holding mmap_write_lock during free_pgtables call,
process_mrelease does not need to elevate mm->mm_users in order to
prevent exit_mmap from destrying pagetables while __oom_reap_task_mm
is walking the VMA tree. The change prevents process_mrelease from
calling the last mmput, which can lead to waiting for IO completion
in exit_aio.

Fixes: 337546e83fc7 ("mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/oom_kill.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ddabefcfb5a..67780386f478 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1169,15 +1169,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 		goto put_task;
 	}
 
-	if (mmget_not_zero(p->mm)) {
-		mm = p->mm;
-		if (task_will_free_mem(p))
-			reap = true;
-		else {
-			/* Error only if the work has not been done already */
-			if (!test_bit(MMF_OOM_SKIP, &mm->flags))
-				ret = -EINVAL;
-		}
+	mm = p->mm;
+	mmgrab(mm);
+
+	if (task_will_free_mem(p))
+		reap = true;
+	else {
+		/* Error only if the work has not been done already */
+		if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+			ret = -EINVAL;
 	}
 	task_unlock(p);
 
@@ -1188,13 +1188,16 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 		ret = -EINTR;
 		goto drop_mm;
 	}
-	if (!__oom_reap_task_mm(mm))
+	/*
+	 * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure
+	 * possible change in exit_mmap is seen
+	 */
+	if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm))
 		ret = -EAGAIN;
 	mmap_read_unlock(mm);
 
 drop_mm:
-	if (mm)
-		mmput(mm);
+	mmdrop(mm);
 put_task:
 	put_task_struct(task);
 	return ret;
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
  2021-12-07 21:50 [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Suren Baghdasaryan
  2021-12-07 21:50 ` [PATCH v3 2/2] mm/oom_kill: allow process_mrelease to run under mmap_lock protection Suren Baghdasaryan
@ 2021-12-07 22:07 ` Matthew Wilcox
  2021-12-07 22:47   ` Suren Baghdasaryan
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2021-12-07 22:07 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, mhocko, mhocko, rientjes, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team

On Tue, Dec 07, 2021 at 01:50:30PM -0800, Suren Baghdasaryan wrote:
> - Added a comment for vm_operations_struct::close, documenting restriction for
> taking mmap_lock in the callback, per Michal Hocko and Matthew Wilcox

This should be a separate patch because it stands alone, but ...

>  struct vm_operations_struct {
>  	void (*open)(struct vm_area_struct * area);
> +    /*
> +     * Called with mmap_lock lock held for write from __split_vma and
> +     * remove_vma, therefore should never take that lock.
> +     */

Your whitespace indentation is weird.  And it'd be nice to make this a
kernel-doc comment (I know none of the others are, but that should be
fixed too).  And naming the functions that call it is wrong too.

	/**
	 * @close: Called when the VMA is being removed from the MM.
	 * Context: Caller holds mmap_lock.
	 */


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
  2021-12-07 22:07 ` [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Matthew Wilcox
@ 2021-12-07 22:47   ` Suren Baghdasaryan
  2021-12-07 23:08     ` Suren Baghdasaryan
  0 siblings, 1 reply; 13+ messages in thread
From: Suren Baghdasaryan @ 2021-12-07 22:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, mhocko, mhocko, rientjes, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team

On Tue, Dec 7, 2021 at 2:07 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Dec 07, 2021 at 01:50:30PM -0800, Suren Baghdasaryan wrote:
> > - Added a comment for vm_operations_struct::close, documenting restriction for
> > taking mmap_lock in the callback, per Michal Hocko and Matthew Wilcox
>
> This should be a separate patch because it stands alone, but ...

I thought about it and since it was relevant to the change in
remove_vma locking, I thought it would fit here. However, if you
insist on splitting it, I'll post it as a separate patch. Please let
me know.

>
> >  struct vm_operations_struct {
> >       void (*open)(struct vm_area_struct * area);
> > +    /*
> > +     * Called with mmap_lock lock held for write from __split_vma and
> > +     * remove_vma, therefore should never take that lock.
> > +     */
>
> Your whitespace indentation is weird.  And it'd be nice to make this a
> kernel-doc comment (I know none of the others are, but that should be
> fixed too).  And naming the functions that call it is wrong too.
>
>         /**
>          * @close: Called when the VMA is being removed from the MM.
>          * Context: Caller holds mmap_lock.
>          */

Ack. Will change and include in the next respin once I hear from you
on the preference for a separate patch.
Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
  2021-12-07 22:47   ` Suren Baghdasaryan
@ 2021-12-07 23:08     ` Suren Baghdasaryan
  2021-12-08  8:56       ` Michal Hocko
  2021-12-08 15:01       ` Matthew Wilcox
  0 siblings, 2 replies; 13+ messages in thread
From: Suren Baghdasaryan @ 2021-12-07 23:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, mhocko, mhocko, rientjes, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team

On Tue, Dec 7, 2021 at 2:47 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Dec 7, 2021 at 2:07 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Dec 07, 2021 at 01:50:30PM -0800, Suren Baghdasaryan wrote:
> > > - Added a comment for vm_operations_struct::close, documenting restriction for
> > > taking mmap_lock in the callback, per Michal Hocko and Matthew Wilcox
> >
> > This should be a separate patch because it stands alone, but ...
>
> I thought about it and since it was relevant to the change in
> remove_vma locking, I thought it would fit here. However, if you
> insist on splitting it, I'll post it as a separate patch. Please let
> me know.
>
> >
> > >  struct vm_operations_struct {
> > >       void (*open)(struct vm_area_struct * area);
> > > +    /*
> > > +     * Called with mmap_lock lock held for write from __split_vma and
> > > +     * remove_vma, therefore should never take that lock.
> > > +     */
> >
> > Your whitespace indentation is weird.  And it'd be nice to make this a
> > kernel-doc comment (I know none of the others are, but that should be
> > fixed too).  And naming the functions that call it is wrong too.
> >
> >         /**
> >          * @close: Called when the VMA is being removed from the MM.
> >          * Context: Caller holds mmap_lock.

BTW, is the caller always required to hold mmap_lock for write or it
*might* hold it?

> >          */
>
> Ack. Will change and include in the next respin once I hear from you
> on the preference for a separate patch.
> Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
  2021-12-07 23:08     ` Suren Baghdasaryan
@ 2021-12-08  8:56       ` Michal Hocko
  2021-12-08 15:01       ` Matthew Wilcox
  1 sibling, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2021-12-08  8:56 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, akpm, rientjes, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team

On Tue 07-12-21 15:08:19, Suren Baghdasaryan wrote:
> On Tue, Dec 7, 2021 at 2:47 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Dec 7, 2021 at 2:07 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Tue, Dec 07, 2021 at 01:50:30PM -0800, Suren Baghdasaryan wrote:
> > > > - Added a comment for vm_operations_struct::close, documenting restriction for
> > > > taking mmap_lock in the callback, per Michal Hocko and Matthew Wilcox
> > >
> > > This should be a separate patch because it stands alone, but ...
> >
> > I thought about it and since it was relevant to the change in
> > remove_vma locking, I thought it would fit here. However, if you
> > insist on splitting it, I'll post it as a separate patch. Please let
> > me know.
> >
> > >
> > > >  struct vm_operations_struct {
> > > >       void (*open)(struct vm_area_struct * area);
> > > > +    /*
> > > > +     * Called with mmap_lock lock held for write from __split_vma and
> > > > +     * remove_vma, therefore should never take that lock.
> > > > +     */
> > >
> > > Your whitespace indentation is weird.  And it'd be nice to make this a
> > > kernel-doc comment (I know none of the others are, but that should be
> > > fixed too).  And naming the functions that call it is wrong too.
> > >
> > >         /**
> > >          * @close: Called when the VMA is being removed from the MM.
> > >          * Context: Caller holds mmap_lock.
> 
> BTW, is the caller always required to hold mmap_lock for write or it
> *might* hold it?

I would go with might

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
  2021-12-07 23:08     ` Suren Baghdasaryan
  2021-12-08  8:56       ` Michal Hocko
@ 2021-12-08 15:01       ` Matthew Wilcox
  2021-12-08 15:51         ` Michal Hocko
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2021-12-08 15:01 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, mhocko, mhocko, rientjes, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team

On Tue, Dec 07, 2021 at 03:08:19PM -0800, Suren Baghdasaryan wrote:
> > >         /**
> > >          * @close: Called when the VMA is being removed from the MM.
> > >          * Context: Caller holds mmap_lock.
> 
> BTW, is the caller always required to hold mmap_lock for write or it
> *might* hold it?

__do_munmap() might hold it for read, thanks to:

        if (downgrade)
                mmap_write_downgrade(mm);

Should probably say:

	* Context: User context.  May sleep.  Caller holds mmap_lock.

I don't think we should burden the implementor of the vm_ops with the
knowledge that the VM chooses to not hold the mmap_lock under certain
circumstances when it doesn't matter whether it's holding the mmap_lock
or not.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
  2021-12-08 15:01       ` Matthew Wilcox
@ 2021-12-08 15:51         ` Michal Hocko
  2021-12-08 16:05           ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2021-12-08 15:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Suren Baghdasaryan, akpm, rientjes, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team

On Wed 08-12-21 15:01:24, Matthew Wilcox wrote:
> On Tue, Dec 07, 2021 at 03:08:19PM -0800, Suren Baghdasaryan wrote:
> > > >         /**
> > > >          * @close: Called when the VMA is being removed from the MM.
> > > >          * Context: Caller holds mmap_lock.
> > 
> > BTW, is the caller always required to hold mmap_lock for write or it
> > *might* hold it?
> 
> __do_munmap() might hold it for read, thanks to:
> 
>         if (downgrade)
>                 mmap_write_downgrade(mm);
> 
> Should probably say:
> 
> 	* Context: User context.  May sleep.  Caller holds mmap_lock.
> 
> I don't think we should burden the implementor of the vm_ops with the
> knowledge that the VM chooses to not hold the mmap_lock under certain
> circumstances when it doesn't matter whether it's holding the mmap_lock
> or not.

If we document it like that some code might depend on that lock to be
held. I think we only want to document that the holder itself is not
allowed to take mmap sem or a depending lock.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
  2021-12-08 15:51         ` Michal Hocko
@ 2021-12-08 16:05           ` Matthew Wilcox
  2021-12-08 16:50             ` Suren Baghdasaryan
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2021-12-08 16:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, akpm, rientjes, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team

On Wed, Dec 08, 2021 at 04:51:58PM +0100, Michal Hocko wrote:
> On Wed 08-12-21 15:01:24, Matthew Wilcox wrote:
> > On Tue, Dec 07, 2021 at 03:08:19PM -0800, Suren Baghdasaryan wrote:
> > > > >         /**
> > > > >          * @close: Called when the VMA is being removed from the MM.
> > > > >          * Context: Caller holds mmap_lock.
> > > 
> > > BTW, is the caller always required to hold mmap_lock for write or it
> > > *might* hold it?
> > 
> > __do_munmap() might hold it for read, thanks to:
> > 
> >         if (downgrade)
> >                 mmap_write_downgrade(mm);
> > 
> > Should probably say:
> > 
> > 	* Context: User context.  May sleep.  Caller holds mmap_lock.
> > 
> > I don't think we should burden the implementor of the vm_ops with the
> > knowledge that the VM chooses to not hold the mmap_lock under certain
> > circumstances when it doesn't matter whether it's holding the mmap_lock
> > or not.
> 
> If we document it like that some code might depend on that lock to be
> held. I think we only want to document that the holder itself is not
> allowed to take mmap sem or a depending lock.

The only place where we're not currently holding the mmap_lock is at
task exit, where the mmap_lock is effectively held because nobody else
can modify the task's mm.  Besides, Suren is changing that in this patch
series anyway, so it will be always true.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
  2021-12-08 16:05           ` Matthew Wilcox
@ 2021-12-08 16:50             ` Suren Baghdasaryan
  2021-12-08 19:13               ` Suren Baghdasaryan
  0 siblings, 1 reply; 13+ messages in thread
From: Suren Baghdasaryan @ 2021-12-08 16:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, akpm, rientjes, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team

On Wed, Dec 8, 2021 at 8:05 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Dec 08, 2021 at 04:51:58PM +0100, Michal Hocko wrote:
> > On Wed 08-12-21 15:01:24, Matthew Wilcox wrote:
> > > On Tue, Dec 07, 2021 at 03:08:19PM -0800, Suren Baghdasaryan wrote:
> > > > > >         /**
> > > > > >          * @close: Called when the VMA is being removed from the MM.
> > > > > >          * Context: Caller holds mmap_lock.
> > > >
> > > > BTW, is the caller always required to hold mmap_lock for write or it
> > > > *might* hold it?
> > >
> > > __do_munmap() might hold it for read, thanks to:
> > >
> > >         if (downgrade)
> > >                 mmap_write_downgrade(mm);
> > >
> > > Should probably say:
> > >
> > >     * Context: User context.  May sleep.  Caller holds mmap_lock.
> > >
> > > I don't think we should burden the implementor of the vm_ops with the
> > > knowledge that the VM chooses to not hold the mmap_lock under certain
> > > circumstances when it doesn't matter whether it's holding the mmap_lock
> > > or not.
> >
> > If we document it like that some code might depend on that lock to be
> > held. I think we only want to document that the holder itself is not
> > allowed to take mmap sem or a depending lock.
>
> The only place where we're not currently holding the mmap_lock is at
> task exit, where the mmap_lock is effectively held because nobody else
> can modify the task's mm.  Besides, Suren is changing that in this patch
> series anyway, so it will be always true.

Ok, I'll make it a separate patch after the patch that changes
exit_mmap and this statement will become always true. Sounds
reasonable?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
  2021-12-08 16:50             ` Suren Baghdasaryan
@ 2021-12-08 19:13               ` Suren Baghdasaryan
  2021-12-08 19:22                 ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Suren Baghdasaryan @ 2021-12-08 19:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, akpm, rientjes, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team

On Wed, Dec 8, 2021 at 8:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Dec 8, 2021 at 8:05 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Dec 08, 2021 at 04:51:58PM +0100, Michal Hocko wrote:
> > > On Wed 08-12-21 15:01:24, Matthew Wilcox wrote:
> > > > On Tue, Dec 07, 2021 at 03:08:19PM -0800, Suren Baghdasaryan wrote:
> > > > > > >         /**
> > > > > > >          * @close: Called when the VMA is being removed from the MM.
> > > > > > >          * Context: Caller holds mmap_lock.
> > > > >
> > > > > BTW, is the caller always required to hold mmap_lock for write or it
> > > > > *might* hold it?
> > > >
> > > > __do_munmap() might hold it for read, thanks to:
> > > >
> > > >         if (downgrade)
> > > >                 mmap_write_downgrade(mm);
> > > >
> > > > Should probably say:
> > > >
> > > >     * Context: User context.  May sleep.  Caller holds mmap_lock.
> > > >
> > > > I don't think we should burden the implementor of the vm_ops with the
> > > > knowledge that the VM chooses to not hold the mmap_lock under certain
> > > > circumstances when it doesn't matter whether it's holding the mmap_lock
> > > > or not.
> > >
> > > If we document it like that some code might depend on that lock to be
> > > held. I think we only want to document that the holder itself is not
> > > allowed to take mmap sem or a depending lock.
> >
> > The only place where we're not currently holding the mmap_lock is at
> > task exit, where the mmap_lock is effectively held because nobody else
> > can modify the task's mm.  Besides, Suren is changing that in this patch
> > series anyway, so it will be always true.
>
> Ok, I'll make it a separate patch after the patch that changes
> exit_mmap and this statement will become always true. Sounds
> reasonable?

Actually, while today vma_ops->close is called with mmap_lock held, I
believe we want this comment to reflect the restrictions on the
callback itself, not on the user. IOW, we want to say that the
callback should not take mmap_lock while the caller might or might not
hold it. If so, I think *might* would make more sense here, like this:

* Context: User context.  May sleep.  Caller might hold mmap_lock.

WDYT?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
  2021-12-08 19:13               ` Suren Baghdasaryan
@ 2021-12-08 19:22                 ` Matthew Wilcox
  2021-12-08 21:24                   ` Suren Baghdasaryan
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2021-12-08 19:22 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, akpm, rientjes, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team

On Wed, Dec 08, 2021 at 11:13:42AM -0800, Suren Baghdasaryan wrote:
> On Wed, Dec 8, 2021 at 8:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Dec 8, 2021 at 8:05 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Wed, Dec 08, 2021 at 04:51:58PM +0100, Michal Hocko wrote:
> > > > On Wed 08-12-21 15:01:24, Matthew Wilcox wrote:
> > > > > On Tue, Dec 07, 2021 at 03:08:19PM -0800, Suren Baghdasaryan wrote:
> > > > > > > >         /**
> > > > > > > >          * @close: Called when the VMA is being removed from the MM.
> > > > > > > >          * Context: Caller holds mmap_lock.
> > > > > >
> > > > > > BTW, is the caller always required to hold mmap_lock for write or it
> > > > > > *might* hold it?
> > > > >
> > > > > __do_munmap() might hold it for read, thanks to:
> > > > >
> > > > >         if (downgrade)
> > > > >                 mmap_write_downgrade(mm);
> > > > >
> > > > > Should probably say:
> > > > >
> > > > >     * Context: User context.  May sleep.  Caller holds mmap_lock.
> > > > >
> > > > > I don't think we should burden the implementor of the vm_ops with the
> > > > > knowledge that the VM chooses to not hold the mmap_lock under certain
> > > > > circumstances when it doesn't matter whether it's holding the mmap_lock
> > > > > or not.
> > > >
> > > > If we document it like that some code might depend on that lock to be
> > > > held. I think we only want to document that the holder itself is not
> > > > allowed to take mmap sem or a depending lock.
> > >
> > > The only place where we're not currently holding the mmap_lock is at
> > > task exit, where the mmap_lock is effectively held because nobody else
> > > can modify the task's mm.  Besides, Suren is changing that in this patch
> > > series anyway, so it will be always true.
> >
> > Ok, I'll make it a separate patch after the patch that changes
> > exit_mmap and this statement will become always true. Sounds
> > reasonable?
> 
> Actually, while today vma_ops->close is called with mmap_lock held, I
> believe we want this comment to reflect the restrictions on the
> callback itself, not on the user. IOW, we want to say that the
> callback should not take mmap_lock while the caller might or might not
> hold it. If so, I think *might* would make more sense here, like this:
> 
> * Context: User context.  May sleep.  Caller might hold mmap_lock.
> 
> WDYT?

We're documenting the contract between the caller and the callee.
That implies responsibilities on both sides.  For example, we're
placing requirements on the caller that they're not going to tear
down the VMA in interrupt context.  So I preferred what previous-Suren
said to current-Suren, "this statement will become always true".


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
  2021-12-08 19:22                 ` Matthew Wilcox
@ 2021-12-08 21:24                   ` Suren Baghdasaryan
  0 siblings, 0 replies; 13+ messages in thread
From: Suren Baghdasaryan @ 2021-12-08 21:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, akpm, rientjes, hannes, guro, riel, minchan,
	kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
	luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
	linux-kernel, kernel-team

On Wed, Dec 8, 2021 at 11:22 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Dec 08, 2021 at 11:13:42AM -0800, Suren Baghdasaryan wrote:
> > On Wed, Dec 8, 2021 at 8:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Wed, Dec 8, 2021 at 8:05 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Wed, Dec 08, 2021 at 04:51:58PM +0100, Michal Hocko wrote:
> > > > > On Wed 08-12-21 15:01:24, Matthew Wilcox wrote:
> > > > > > On Tue, Dec 07, 2021 at 03:08:19PM -0800, Suren Baghdasaryan wrote:
> > > > > > > > >         /**
> > > > > > > > >          * @close: Called when the VMA is being removed from the MM.
> > > > > > > > >          * Context: Caller holds mmap_lock.
> > > > > > >
> > > > > > > BTW, is the caller always required to hold mmap_lock for write or it
> > > > > > > *might* hold it?
> > > > > >
> > > > > > __do_munmap() might hold it for read, thanks to:
> > > > > >
> > > > > >         if (downgrade)
> > > > > >                 mmap_write_downgrade(mm);
> > > > > >
> > > > > > Should probably say:
> > > > > >
> > > > > >     * Context: User context.  May sleep.  Caller holds mmap_lock.
> > > > > >
> > > > > > I don't think we should burden the implementor of the vm_ops with the
> > > > > > knowledge that the VM chooses to not hold the mmap_lock under certain
> > > > > > circumstances when it doesn't matter whether it's holding the mmap_lock
> > > > > > or not.
> > > > >
> > > > > If we document it like that some code might depend on that lock to be
> > > > > held. I think we only want to document that the holder itself is not
> > > > > allowed to take mmap sem or a depending lock.
> > > >
> > > > The only place where we're not currently holding the mmap_lock is at
> > > > task exit, where the mmap_lock is effectively held because nobody else
> > > > can modify the task's mm.  Besides, Suren is changing that in this patch
> > > > series anyway, so it will be always true.
> > >
> > > Ok, I'll make it a separate patch after the patch that changes
> > > exit_mmap and this statement will become always true. Sounds
> > > reasonable?
> >
> > Actually, while today vma_ops->close is called with mmap_lock held, I
> > believe we want this comment to reflect the restrictions on the
> > callback itself, not on the user. IOW, we want to say that the
> > callback should not take mmap_lock while the caller might or might not
> > hold it. If so, I think *might* would make more sense here, like this:
> >
> > * Context: User context.  May sleep.  Caller might hold mmap_lock.
> >
> > WDYT?
>
> We're documenting the contract between the caller and the callee.
> That implies responsibilities on both sides.  For example, we're
> placing requirements on the caller that they're not going to tear
> down the VMA in interrupt context.  So I preferred what previous-Suren
> said to current-Suren, "this statement will become always true".
>

previous-Suren posted v4 at
https://lore.kernel.org/all/20211208212211.2860249-1-surenb@google.com
Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-12-08 21:24 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-07 21:50 [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Suren Baghdasaryan
2021-12-07 21:50 ` [PATCH v3 2/2] mm/oom_kill: allow process_mrelease to run under mmap_lock protection Suren Baghdasaryan
2021-12-07 22:07 ` [PATCH v3 1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Matthew Wilcox
2021-12-07 22:47   ` Suren Baghdasaryan
2021-12-07 23:08     ` Suren Baghdasaryan
2021-12-08  8:56       ` Michal Hocko
2021-12-08 15:01       ` Matthew Wilcox
2021-12-08 15:51         ` Michal Hocko
2021-12-08 16:05           ` Matthew Wilcox
2021-12-08 16:50             ` Suren Baghdasaryan
2021-12-08 19:13               ` Suren Baghdasaryan
2021-12-08 19:22                 ` Matthew Wilcox
2021-12-08 21:24                   ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).