linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yu Zhao <yuzhao@google.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Rik van Riel <riel@surriel.com>,
	linux-mm@kvack.org, Alex Shi <alex.shi@linux.alibaba.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Hillf Danton <hdanton@sina.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>, Michal Hocko <mhocko@suse.com>,
	Roman Gushchin <guro@fb.com>, Vlastimil Babka <vbabka@suse.cz>,
	Wei Yang <richard.weiyang@linux.alibaba.com>,
	Yang Shi <shy828301@gmail.com>,
	linux-kernel@vger.kernel.org, page-reclaim@google.com
Subject: Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list
Date: Tue, 16 Mar 2021 01:56:37 -0600	[thread overview]
Message-ID: <YFBktbCH9JFcT0rL@google.com> (raw)
In-Reply-To: <871rcfzjg0.fsf@yhuang6-desk1.ccr.corp.intel.com>

On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
> Yu Zhao <yuzhao@google.com> writes:
> 
> > On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
> >> Rik van Riel <riel@surriel.com> writes:
> >> 
> >> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
> >> >
> >> >> +/*
> >> >> + * After pages are faulted in, they become the youngest generation.
> >> >> They must
> >> >> + * go through aging process twice before they can be evicted. After
> >> >> first scan,
> >> >> + * their accessed bit set during initial faults are cleared and they
> >> >> become the
> >> >> + * second youngest generation. And second scan makes sure they
> >> >> haven't been used
> >> >> + * since the first.
> >> >> + */
> >> >
> >> > I have to wonder if the reductions in OOM kills and 
> >> > low-memory tab discards is due to this aging policy
> >> > change, rather than from the switch to virtual scanning.
> >
> > There are no policy changes per se. The current page reclaim also
> > scans a faulted-in page at least twice before it can reclaim it.
> > That said, the new aging yields a better overall result because it
> > discovers every page that has been referenced since the last scan,
> > in addition to what Ying has mentioned. The current page scan stops
> > stops once it finds enough candidates, which may seem more
> > efficiently, but actually pays the price for not finding the best.
> >
> >> If my understanding were correct, the temperature of the processes is
> >> considered in addition to that of the individual pages.  That is, the
> >> pages of the processes that haven't been scheduled after the previous
> >> scanning will not be scanned.  I guess that this helps OOM kills?
> >
> > Yes, that's correct.
> >
> >> If so, how about just take advantage of that information for OOM killing
> >> and page reclaiming?  For example, if a process hasn't been scheduled
> >> for long time, just reclaim its private pages.
> >
> > This is how it works. Pages that haven't been scanned grow older
> > automatically because those that have been scanned will be tagged with
> > younger generation numbers. Eviction does bucket sort based on
> > generation numbers and attacks the oldest.
> 
> Sorry, my original words are misleading.  What I wanted to say was that
> is it good enough that
> 
> - Do not change the core algorithm of current page reclaiming.
> 
> - Add some new logic to reclaim the process private pages regardless of
>   the Accessed bits if the processes are not scheduled for some long
>   enough time.  This can be done before the normal page reclaiming.

This is a good idea, which being used on Android and Chrome OS. We
call it per-process reclaim, and I've mentioned here:
https://lore.kernel.org/linux-mm/YBkT6175GmMWBvw3@google.com/
  On Android, our most advanced simulation that generates memory
  pressure from realistic user behavior shows 18% fewer low-memory
  kills, which in turn reduces cold starts by 16%. This is on top of
  per-process reclaim, a predecessor of ``MADV_COLD`` and
  ``MADV_PAGEOUT``, against background apps.

The patches landed not long a ago :) See mm/madvise.c

> So this is an one small step improvement to the current page reclaiming
> algorithm via taking advantage of the scheduler information.  It's
> clearly not sophisticated as your new algorithm, for example, the cold
> pages in the hot processes will not be reclaimed in this stage.  But it
> can reduce the overhead of scanning too.

The general problems with the direction of per-process reclaim:
  1) we can't find the coldest pages, as you have mentioned.
  2) we can't reach file pages accessed via file descriptors only,
  especially those caching config files that were read only once.
  3) we can't reclaim lru pages and slab objects proportionally and
  therefore we leave many stale slab objects behind.
  4) we have to be proactive, as you suggested (once again, you were
  right), and this has a serious problem: client's battery life can
  be affected.

The scanning overhead is only one of the two major problems of the
current page reclaim. The other problem is the granularity of the
active/inactive (sizes). We stopped using them in making job
scheduling decision a long time ago. I know another large internet
company adopted a similar approach as ours, and I'm wondering how
everybody else is coping with the discrepancy from those counters.

> All in all, some of your ideas may help the original LRU algorithm too.
> Or some can be experimented without replacing the original algorithm.
> 
> But from another point of view, your solution can be seen as a kind of
> improvement on top of the original LRU algorithm too.  It moves the
> recently accessed pages to kind of multiple active lists based on
> scanning page tables directly (instead of reversely).

We hope this series can be a framework or an infrastructure flexible
enough that people can build their complex use cases upon, e.g.,
proactive reclaim (machine-wide, not per process), cold memory
estimation (for job scheduling), AEP demotion, specifically, we want
people to use it with what you and Dave are working on here:
https://patchwork.kernel.org/project/linux-mm/cover/20210304235949.7922C1C3@viggo.jf.intel.com/

  reply	other threads:[~2021-03-16  7:57 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
2021-03-13  7:57 ` [PATCH v1 01/14] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao
2021-03-13 15:09   ` Matthew Wilcox
2021-03-14  7:45     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 02/14] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA Yu Zhao
2021-03-13  7:57 ` [PATCH v1 03/14] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE Yu Zhao
2021-03-13  7:57 ` [PATCH v1 04/14] include/linux/cgroup.h: export cgroup_mutex Yu Zhao
2021-03-13  7:57 ` [PATCH v1 05/14] mm/swap.c: export activate_page() Yu Zhao
2021-03-13  7:57 ` [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao
2021-03-14 22:12   ` Zi Yan
2021-03-14 22:51     ` Matthew Wilcox
2021-03-15  0:03       ` Yu Zhao
2021-03-15  0:27         ` Zi Yan
2021-03-15  1:04           ` Yu Zhao
2021-03-14 23:22   ` Dave Hansen
2021-03-15  3:16     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 07/14] mm/pagewalk.c: add pud_entry_post() for post-order traversals Yu Zhao
2021-03-13  7:57 ` [PATCH v1 08/14] mm/vmscan.c: refactor shrink_node() Yu Zhao
2021-03-13  7:57 ` [PATCH v1 09/14] mm: multigenerational lru: mm_struct list Yu Zhao
2021-03-15 19:40   ` Rik van Riel
2021-03-16  2:07     ` Huang, Ying
2021-03-16  3:57       ` Yu Zhao
2021-03-16  6:44         ` Huang, Ying
2021-03-16  7:56           ` Yu Zhao [this message]
2021-03-17  3:37             ` Huang, Ying
2021-03-17 10:46               ` Yu Zhao
2021-03-22  3:13                 ` Huang, Ying
2021-03-22  8:08                   ` Yu Zhao
2021-03-24  6:58                     ` Huang, Ying
2021-04-10 18:48                       ` Yu Zhao
2021-04-13  3:06                         ` Huang, Ying
2021-03-13  7:57 ` [PATCH v1 10/14] mm: multigenerational lru: core Yu Zhao
2021-03-15  2:02   ` Andi Kleen
2021-03-15  3:37     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 11/14] mm: multigenerational lru: page activation Yu Zhao
2021-03-16 16:34   ` Matthew Wilcox
2021-03-16 21:29     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 12/14] mm: multigenerational lru: user space interface Yu Zhao
2021-03-13  7:57 ` [PATCH v1 13/14] mm: multigenerational lru: Kconfig Yu Zhao
2021-03-13  7:57 ` [PATCH v1 14/14] mm: multigenerational lru: documentation Yu Zhao
2021-03-19  9:31   ` Alex Shi
2021-03-22  6:09     ` Yu Zhao
2021-03-14 22:48 ` [PATCH v1 00/14] Multigenerational LRU Zi Yan
2021-03-15  0:52   ` Yu Zhao
     [not found] ` <20210315011350.3648-1-hdanton@sina.com>
2021-03-15  6:49   ` Yu Zhao
2021-03-15 18:00 ` Dave Hansen
2021-03-16  2:24   ` Yu Zhao
2021-03-16 14:50     ` Dave Hansen
2021-03-16 20:30       ` Yu Zhao
2021-03-16 21:14         ` Dave Hansen
2021-04-10  9:21           ` Yu Zhao
2021-04-13  3:02             ` Huang, Ying
2021-04-13 23:00               ` Yu Zhao
2021-03-15 18:38 ` Yang Shi
2021-03-16  3:38   ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YFBktbCH9JFcT0rL@google.com \
    --to=yuzhao@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@linux.alibaba.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=page-reclaim@google.com \
    --cc=richard.weiyang@linux.alibaba.com \
    --cc=riel@surriel.com \
    --cc=shy828301@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).