All of lore.kernel.org
 help / color / mirror / Atom feed
From: Minchan Kim <minchan@kernel.org>
To: Suren Baghdasaryan <surenb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-api@vger.kernel.org, Michal Hocko <mhocko@suse.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Tim Murray <timmurray@google.com>,
	Joel Fernandes <joel@joelfernandes.org>,
	Daniel Colascione <dancol@google.com>,
	Shakeel Butt <shakeelb@google.com>,
	Sonny Rao <sonnyrao@google.com>,
	oleksandr@redhat.com, hdanton@sina.com,
	Benoit Lize <lizeb@google.com>,
	Dave Hansen <dave.hansen@intel.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH v5 1/5] mm: introduce MADV_COLD
Date: Tue, 23 Jul 2019 14:47:29 +0900	[thread overview]
Message-ID: <20190723054729.GC128252@google.com> (raw)
In-Reply-To: <CAJuCfpFqnXshLH=sW4GLEFixjTWNSh0Dap3Qt-E-Ho2uy-R43w@mail.gmail.com>

On Wed, Jul 17, 2019 at 03:14:57PM -0700, Suren Baghdasaryan wrote:
> Hi Minchan,
> Couple comments inline.
> Thanks!
> 
> On Sun, Jul 14, 2019 at 4:34 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > When a process expects no accesses to a certain memory range, it could
> > give a hint to kernel that the pages can be reclaimed when memory pressure
> > happens but data should be preserved for future use.  This could reduce
> > workingset eviction so it ends up increasing performance.
> >
> > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > MADV_COLD can be used by a process to mark a memory range as not expected
> > to be used in the near future. The hint can help kernel in deciding which
> > pages to evict early during memory pressure.
> >
> > It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> >
> >         active file page -> inactive file LRU
> >         active anon page -> inacdtive anon LRU
> >
> > Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> > file LRU's head because MADV_COLD is a little bit different symantic.
> > MADV_FREE means it's okay to discard when the memory pressure because
> > the content of the page is *garbage* so freeing such pages is almost zero
> > overhead since we don't need to swap out and access afterward causes just
> > minor fault. Thus, it would make sense to put those freeable pages in
> > inactive file LRU to compete other used-once pages. It makes sense for
> > implmentaion point of view, too because it's not swapbacked memory any
> > longer until it would be re-dirtied. Even, it could give a bonus to make
> > them be reclaimed on swapless system. However, MADV_COLD doesn't mean
> > garbage so reclaiming them requires swap-out/in in the end so it's bigger
> > cost. Since we have designed VM LRU aging based on cost-model, anonymous
> > cold pages would be better to position inactive anon's LRU list, not file
> > LRU. Furthermore, it would help to avoid unnecessary scanning if system
> > doesn't have a swap device. Let's start simpler way without adding
> > complexity at this moment. However, keep in mind, too that it's a caveat
> > that workloads with a lot of pages cache are likely to ignore MADV_COLD
> > on anonymous memory because we rarely age anonymous LRU lists.
> >
> > * man-page material
> >
> > MADV_COLD (since Linux x.x)
> >
> > Pages in the specified regions will be treated as less-recently-accessed
> > compared to pages in the system with similar access frequencies.
> > In contrast to MADV_FREE, the contents of the region are preserved
> > regardless of subsequent writes to pages.
> >
> > MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
> > pages.
> >
> > * v2
> >  * add up the warn with lots of page cache workload - mhocko
> >  * add man page stuff - dave
> >
> > * v1
> >  * remove page_mapcount filter - hannes, mhocko
> >  * remove idle page handling - joelaf
> >
> > * RFCv2
> >  * add more description - mhocko
> >
> > * RFCv1
> >  * renaming from MADV_COOL to MADV_COLD - hannes
> >
> > * internal review
> >  * use clear_page_youn in deactivate_page - joelaf
> >  * Revise the description - surenb
> >  * Renaming from MADV_WARM to MADV_COOL - surenb
> >
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/swap.h                   |   1 +
> >  include/uapi/asm-generic/mman-common.h |   1 +
> >  mm/internal.h                          |   2 +-
> >  mm/madvise.c                           | 180 ++++++++++++++++++++++++-
> >  mm/oom_kill.c                          |   2 +-
> >  mm/swap.c                              |  42 ++++++
> >  6 files changed, 224 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index de2c67a33b7e..0ce997edb8bb 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
> >  extern void lru_add_drain_all(void);
> >  extern void rotate_reclaimable_page(struct page *page);
> >  extern void deactivate_file_page(struct page *page);
> > +extern void deactivate_page(struct page *page);
> >  extern void mark_page_lazyfree(struct page *page);
> >  extern void swap_setup(void);
> >
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 63b1f506ea67..ef8a56927b12 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -45,6 +45,7 @@
> >  #define MADV_SEQUENTIAL        2               /* expect sequential page references */
> >  #define MADV_WILLNEED  3               /* will need these pages */
> >  #define MADV_DONTNEED  4               /* don't need these pages */
> > +#define MADV_COLD      5               /* deactivatie these pages */
> 
> s/deactivatie/deactivate

Fixed.

> 
> >
> >  /* common parameters: try to keep these consistent across architectures */
> >  #define MADV_FREE      8               /* free pages only if memory pressure */
> > diff --git a/mm/internal.h b/mm/internal.h
> > index f53a14d67538..c61b215ff265 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -39,7 +39,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> >                 unsigned long floor, unsigned long ceiling);
> >
> > -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> > +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
> >  {
> >         return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
> >  }
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 968df3aa069f..bae0055f9724 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
> >         case MADV_REMOVE:
> >         case MADV_WILLNEED:
> >         case MADV_DONTNEED:
> > +       case MADV_COLD:
> >         case MADV_FREE:
> >                 return 0;
> >         default:
> > @@ -307,6 +308,178 @@ static long madvise_willneed(struct vm_area_struct *vma,
> >         return 0;
> >  }
> >
> > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > +                               unsigned long end, struct mm_walk *walk)
> > +{
> > +       struct mmu_gather *tlb = walk->private;
> > +       struct mm_struct *mm = tlb->mm;
> > +       struct vm_area_struct *vma = walk->vma;
> > +       pte_t *orig_pte, *pte, ptent;
> > +       spinlock_t *ptl;
> > +       struct page *page;
> > +       unsigned long next;
> > +
> > +       next = pmd_addr_end(addr, end);
> > +       if (pmd_trans_huge(*pmd)) {
> > +               pmd_t orig_pmd;
> > +
> > +               tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> > +               ptl = pmd_trans_huge_lock(pmd, vma);
> > +               if (!ptl)
> > +                       return 0;
> > +
> > +               orig_pmd = *pmd;
> > +               if (is_huge_zero_pmd(orig_pmd))
> > +                       goto huge_unlock;
> > +
> > +               if (unlikely(!pmd_present(orig_pmd))) {
> > +                       VM_BUG_ON(thp_migration_supported() &&
> > +                                       !is_pmd_migration_entry(orig_pmd));
> > +                       goto huge_unlock;
> > +               }
> > +
> > +               page = pmd_page(orig_pmd);
> > +               if (next - addr != HPAGE_PMD_SIZE) {
> > +                       int err;
> > +
> > +                       if (page_mapcount(page) != 1)
> > +                               goto huge_unlock;
> > +
> > +                       get_page(page);
> > +                       spin_unlock(ptl);
> > +                       lock_page(page);
> > +                       err = split_huge_page(page);
> > +                       unlock_page(page);
> > +                       put_page(page);
> > +                       if (!err)
> > +                               goto regular_page;
> > +                       return 0;
> > +               }
> > +
> > +               if (pmd_young(orig_pmd)) {
> > +                       pmdp_invalidate(vma, addr, pmd);
> > +                       orig_pmd = pmd_mkold(orig_pmd);
> > +
> > +                       set_pmd_at(mm, addr, pmd, orig_pmd);
> > +                       tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> > +               }
> > +
> > +               test_and_clear_page_young(page);
> > +               deactivate_page(page);
> > +huge_unlock:
> > +               spin_unlock(ptl);
> > +               return 0;
> > +       }
> > +
> > +       if (pmd_trans_unstable(pmd))
> > +               return 0;
> > +
> > +regular_page:
> > +       tlb_change_page_size(tlb, PAGE_SIZE);
> > +       orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +       flush_tlb_batched_pending(mm);
> > +       arch_enter_lazy_mmu_mode();
> > +       for (; addr < end; pte++, addr += PAGE_SIZE) {
> > +               ptent = *pte;
> > +
> > +               if (pte_none(ptent))
> > +                       continue;
> > +
> > +               if (!pte_present(ptent))
> > +                       continue;
> > +
> > +               page = vm_normal_page(vma, addr, ptent);
> > +               if (!page)
> > +                       continue;
> > +
> > +               /*
> > +                * Creating a THP page is expensive so split it only if we
> > +                * are sure it's worth. Split it if we are only owner.
> > +                */
> > +               if (PageTransCompound(page)) {
> > +                       if (page_mapcount(page) != 1)
> > +                               break;
> > +                       get_page(page);
> > +                       if (!trylock_page(page)) {
> > +                               put_page(page);
> > +                               break;
> > +                       }
> > +                       pte_unmap_unlock(orig_pte, ptl);
> > +                       if (split_huge_page(page)) {
> > +                               unlock_page(page);
> > +                               put_page(page);
> > +                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +                               break;
> > +                       }
> > +                       unlock_page(page);
> > +                       put_page(page);
> > +                       pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +                       pte--;
> > +                       addr -= PAGE_SIZE;
> > +                       continue;
> > +               }
> > +
> > +               VM_BUG_ON_PAGE(PageTransCompound(page), page);
> > +
> > +               if (pte_young(ptent)) {
> > +                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> > +                                                       tlb->fullmm);
> > +                       ptent = pte_mkold(ptent);
> > +                       set_pte_at(mm, addr, pte, ptent);
> > +                       tlb_remove_tlb_entry(tlb, pte, addr);
> > +               }
> > +
> > +               /*
> > +                * We are deactivating a page for accelerating reclaiming.
> > +                * VM couldn't reclaim the page unless we clear PG_young.
> > +                * As a side effect, it makes confuse idle-page tracking
> > +                * because they will miss recent referenced history.
> > +                */
> > +               test_and_clear_page_young(page);
> > +               deactivate_page(page);
> > +       }
> > +
> > +       arch_enter_lazy_mmu_mode();
> 
> Did you mean to say arch_leave_lazy_mmu_mode() ?

Oops, Fixed.

Thanks, Suren!

  reply	other threads:[~2019-07-23  5:47 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-14 23:33 [PATCH v5 0/5] Introduce MADV_COLD and MADV_PAGEOUT Minchan Kim
2019-07-14 23:33 ` [PATCH v5 1/5] mm: introduce MADV_COLD Minchan Kim
2019-07-17 22:14   ` Suren Baghdasaryan
2019-07-17 22:14     ` Suren Baghdasaryan
2019-07-23  5:47     ` Minchan Kim [this message]
2019-07-14 23:33 ` [PATCH v5 2/5] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM Minchan Kim
2019-07-14 23:33 ` [PATCH v5 3/5] mm: account nr_isolated_xxx in [isolate|putback]_lru_page Minchan Kim
2019-07-14 23:33 ` [PATCH v5 4/5] mm: introduce MADV_PAGEOUT Minchan Kim
2019-07-14 23:34 ` [PATCH v5 5/5] mm: factor out common parts between MADV_COLD and MADV_PAGEOUT Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190723054729.GC128252@google.com \
    --to=minchan@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=dancol@google.com \
    --cc=dave.hansen@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=joel@joelfernandes.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizeb@google.com \
    --cc=mhocko@suse.com \
    --cc=oleksandr@redhat.com \
    --cc=shakeelb@google.com \
    --cc=sonnyrao@google.com \
    --cc=surenb@google.com \
    --cc=timmurray@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.