All of lore.kernel.org
 help / color / mirror / Atom feed
From: Liam Howlett <liam.howlett@oracle.com>
To: Suren Baghdasaryan <surenb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	"mhocko@suse.com" <mhocko@suse.com>,
	"rientjes@google.com" <rientjes@google.com>,
	"willy@infradead.org" <willy@infradead.org>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	"guro@fb.com" <guro@fb.com>,
	"minchan@kernel.org" <minchan@kernel.org>,
	"kirill@shutemov.name" <kirill@shutemov.name>,
	"aarcange@redhat.com" <aarcange@redhat.com>,
	"brauner@kernel.org" <brauner@kernel.org>,
	"hch@infradead.org" <hch@infradead.org>,
	"oleg@redhat.com" <oleg@redhat.com>,
	"david@redhat.com" <david@redhat.com>,
	"jannh@google.com" <jannh@google.com>,
	"shakeelb@google.com" <shakeelb@google.com>,
	"peterx@redhat.com" <peterx@redhat.com>,
	"jhubbard@nvidia.com" <jhubbard@nvidia.com>,
	"shuah@kernel.org" <shuah@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"kernel-team@android.com" <kernel-team@android.com>
Subject: Re: [PATCH v2 1/2] mm: drop oom code from exit_mmap
Date: Thu, 19 May 2022 22:34:44 +0000	[thread overview]
Message-ID: <20220519223438.qx35hbpfnnfnpouw@revolver> (raw)
In-Reply-To: <CAJuCfpF4XjBmNGBe57aP0MYQguR4qHqeP=jeG87RcrAV4ODZYg@mail.gmail.com>

* Suren Baghdasaryan <surenb@google.com> [220519 17:44]:
> On Thu, May 19, 2022 at 1:33 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> >
> > * Andrew Morton <akpm@linux-foundation.org> [220519 15:29]:
> > > On Mon, 16 May 2022 00:56:18 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > > The primary reason to invoke the oom reaper from the exit_mmap path used
> > > > to be a prevention of an excessive oom killing if the oom victim exit
> > > > races with the oom reaper (see [1] for more details). The invocation has
> > > > moved around since then because of the interaction with the munlock
> > > > logic but the underlying reason has remained the same (see [2]).
> > > >
> > > > Munlock code is no longer a problem since [3] and there shouldn't be
> > > > any blocking operation before the memory is unmapped by exit_mmap so
> > > > the oom reaper invocation can be dropped. The unmapping part can be done
> > > > with the non-exclusive mmap_sem and the exclusive one is only required
> > > > when page tables are freed.
> > > >
> > > > Remove the oom_reaper from exit_mmap which will make the code easier to
> > > > read. This is really unlikely to make any observable difference although
> > > > some microbenchmarks could benefit from one less branch that needs to be
> > > > evaluated even though it almost never is true.
> > > >
> > >
> > > Liam, this mucks "mm: start tracking VMAs with maple tree" somewhat.
> > >
> > > > --- a/include/linux/oom.h
> > > > +++ b/include/linux/oom.h
> > > > @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
> > > >     return 0;
> > > >  }
> > > >
> > > > -bool __oom_reap_task_mm(struct mm_struct *mm);
> > > > -
> > > >  long oom_badness(struct task_struct *p,
> > > >             unsigned long totalpages);
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 313b57d55a63..ded42150e706 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -3105,30 +3105,13 @@ void exit_mmap(struct mm_struct *mm)
> > > >     /* mm's last user has gone, and its about to be pulled down */
> > > >     mmu_notifier_release(mm);
> > > >
> > > > -   if (unlikely(mm_is_oom_victim(mm))) {
> > > > -           /*
> > > > -            * Manually reap the mm to free as much memory as possible.
> > > > -            * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > > > -            * this mm from further consideration.  Taking mm->mmap_lock for
> > > > -            * write after setting MMF_OOM_SKIP will guarantee that the oom
> > > > -            * reaper will not run on this mm again after mmap_lock is
> > > > -            * dropped.
> > > > -            *
> > > > -            * Nothing can be holding mm->mmap_lock here and the above call
> > > > -            * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> > > > -            * __oom_reap_task_mm() will not block.
> > > > -            */
> > > > -           (void)__oom_reap_task_mm(mm);
> > > > -           set_bit(MMF_OOM_SKIP, &mm->flags);
> > > > -   }
> > > > -
> > > > -   mmap_write_lock(mm);
> > > > +   mmap_read_lock(mm);
> > > >     arch_exit_mmap(mm);
> > > >
> > > >     vma = mm->mmap;
> > > >     if (!vma) {
> > > >             /* Can happen if dup_mmap() received an OOM */
> > > > -           mmap_write_unlock(mm);
> > > > +           mmap_read_unlock(mm);
> > > >             return;
> > > >     }
> > > >
> > > > @@ -3138,6 +3121,16 @@ void exit_mmap(struct mm_struct *mm)
> > > >     /* update_hiwater_rss(mm) here? but nobody should be looking */
> > > >     /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > > >     unmap_vmas(&tlb, vma, 0, -1);
> > > > +   mmap_read_unlock(mm);
> > > > +
> > > > +   /*
> > > > +    * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > > > +    * because the memory has been already freed. Do not bother checking
> > > > +    * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > > > +    */
> > > > +   set_bit(MMF_OOM_SKIP, &mm->flags);
> > > > +
> > > > +   mmap_write_lock(mm);
> > > >     free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > > >     tlb_finish_mmu(&tlb);
> > > >
> > >
> > > I ended up with the below rework of "mm: start tracking VMAs with maple
> > > tree".  Please triple check?
> >
> > One small fix in the first one.  Suren found a race with the oom or
> > process_mrelease that needed the linked list to be removed here.  Please
> > correct me if I am mistaken, Suren?
> >
> > >
> > > void exit_mmap(struct mm_struct *mm)
> > > {
> > >       struct mmu_gather tlb;
> > >       struct vm_area_struct *vma;
> > >       unsigned long nr_accounted = 0;
> > >
> > >       /* mm's last user has gone, and its about to be pulled down */
> > >       mmu_notifier_release(mm);
> > >
> > >       mmap_write_lock(mm);
> > >       arch_exit_mmap(mm);
> > >       vma = mm->mmap;
> > >       if (!vma) {
> > >               /* Can happen if dup_mmap() received an OOM */
> > >               mmap_write_unlock(mm);
> > >               return;
> > >       }
> > >
> > >       lru_add_drain();
> > >       flush_cache_mm(mm);
> > >       tlb_gather_mmu_fullmm(&tlb, mm);
> > >       /* update_hiwater_rss(mm) here? but nobody should be looking */
> > >       /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > >       unmap_vmas(&tlb, vma, 0, -1);
> > >
> > >       /*
> > >        * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > >        * because the memory has been already freed. Do not bother checking
> > >        * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > >        */
> > >       set_bit(MMF_OOM_SKIP, &mm->flags);
> > >
> > >       free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > >       tlb_finish_mmu(&tlb);
> > >
> > >       /* Walk the list again, actually closing and freeing it. */
> > >       while (vma) {
> > >               if (vma->vm_flags & VM_ACCOUNT)
> > >                       nr_accounted += vma_pages(vma);
> > >               vma = remove_vma(vma);
> > >               cond_resched();
> > >       }
> > >
> > >       trace_exit_mmap(mm);
> > >       __mt_destroy(&mm->mm_mt);
> >
> > +       mm->mmap = NULL;
> 
> That's correct. We need to reset mm->mmap so that the loop in
> __oom_reap_task_mm() stops immediately. However with maple trees I
> believe that loop is different and with an empty tree there would be
> no dereferencing. Liam?

That's correct after the "mm: remove the vma linked list".  At this
point we still need it.  The maple tree is tracking the VMAs but it's
not being used - just checked to be the same.

> 
> >
> > >       mmap_write_unlock(mm);
> > >       vm_unacct_memory(nr_accounted);
> > > }
> > >
> > >
> > > And "mm: remove the vma linked list" needed further reworking.  I ended
> > > up with
> > >
> > > void exit_mmap(struct mm_struct *mm)
> > > {
> > >       struct mmu_gather tlb;
> > >       struct vm_area_struct *vma;
> > >       unsigned long nr_accounted = 0;
> > >       MA_STATE(mas, &mm->mm_mt, 0, 0);
> > >       int count = 0;
> > >
> > >       /* mm's last user has gone, and its about to be pulled down */
> > >       mmu_notifier_release(mm);
> > >
> > >       mmap_write_lock(mm);
> > >       arch_exit_mmap(mm);
> > >       vma = mas_find(&mas, ULONG_MAX);
> > >       if (!vma) {
> > >               /* Can happen if dup_mmap() received an OOM */
> > >               mmap_write_unlock(mm);
> > >               return;
> > >       }
> > >
> > >       lru_add_drain();
> > >       flush_cache_mm(mm);
> > >       tlb_gather_mmu_fullmm(&tlb, mm);
> > >       /* update_hiwater_rss(mm) here? but nobody should be looking */
> > >       /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> > >       unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> > >
> > >       /*
> > >        * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > >        * because the memory has been already freed. Do not bother checking
> > >        * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > >        */
> > >       set_bit(MMF_OOM_SKIP, &mm->flags);
> > >
> > >       free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > >       tlb_finish_mmu(&tlb);
> > >
> > >       /*
> > >        * Walk the list again, actually closing and freeing it, with preemption
> > >        * enabled, without holding any MM locks besides the unreachable
> > >        * mmap_write_lock.
> > >        */
> > >       do {
> > >               if (vma->vm_flags & VM_ACCOUNT)
> > >                       nr_accounted += vma_pages(vma);
> > >               remove_vma(vma);
> > >               count++;
> > >               cond_resched();
> > >       } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> > >
> > >       BUG_ON(count != mm->map_count);
> > >
> > >       trace_exit_mmap(mm);
> > >       __mt_destroy(&mm->mm_mt);
> > >       mmap_write_unlock(mm);
> > >       vm_unacct_memory(nr_accounted);
> > > }
> >
> > It is worth noting that this drops the mmap_read_lock/unlock before the
> > write locking.  I'm not sure why Suren had it in his patches and I've
> > responded just now asking about it.  It may be an important aspect of
> > what he was planning.
> 
> unmap_vmas() does not require a mmap_write_lock and doing in under
> read lock protection allows OOM-killer and process_mrelease to run in
> parallel with exit_mmap. That was the reason I start with
> mmap_read_lock and then switch to mmap_write_lock. If that creates
> issues we should switch to mmap_write_lock for the whole duration of
> this call.
> Thanks,
> Suren.
> 
> >
> > Thanks,
> > Liam
> >
> >
> > >
> > >
> > > The mapletree patches remain hidden from mm.git until, I expect, next week.
> > >
> > > Thanks.
> > >
> > >
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >

  reply	other threads:[~2022-05-19 22:35 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-16  7:56 [PATCH v2 1/2] mm: drop oom code from exit_mmap Suren Baghdasaryan
2022-05-16  7:56 ` [PATCH v2 2/2] mm: delete unused MMF_OOM_VICTIM flag Suren Baghdasaryan
2022-05-17  3:13   ` Shuah Khan
2022-05-17  3:12 ` [PATCH v2 1/2] mm: drop oom code from exit_mmap Shuah Khan
2022-05-19 19:29 ` Andrew Morton
2022-05-19 20:33   ` Liam Howlett
2022-05-19 21:43     ` Suren Baghdasaryan
2022-05-19 22:34       ` Liam Howlett [this message]
2022-05-19 20:22 ` Liam Howlett
2022-05-19 21:33   ` Suren Baghdasaryan
2022-05-19 22:14     ` Suren Baghdasaryan
2022-05-19 22:56     ` Liam Howlett
2022-05-19 23:08       ` Suren Baghdasaryan
2022-05-20  7:21     ` Michal Hocko
2022-05-20 15:55       ` Suren Baghdasaryan
2022-05-20 16:17         ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220519223438.qx35hbpfnnfnpouw@revolver \
    --to=liam.howlett@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=david@redhat.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=jannh@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=kernel-team@android.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=shakeelb@google.com \
    --cc=shuah@kernel.org \
    --cc=surenb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.