linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Yu Zhao <yuzhao@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andi Kleen <ak@linux.intel.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Hillf Danton <hdanton@sina.com>, Jens Axboe <axboe@kernel.dk>,
	Jesse Barnes <jsbarnes@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Jonathan Corbet <corbet@lwn.net>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Michael Larabel <Michael@michaellarabel.com>,
	Rik van Riel <riel@surriel.com>, Vlastimil Babka <vbabka@suse.cz>,
	Will Deacon <will@kernel.org>, Ying Huang <ying.huang@intel.com>,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	page-reclaim@google.com, x86@kernel.org,
	Konstantin Kharlamov <Hi-Angel@yandex.ru>
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging
Date: Fri, 7 Jan 2022 09:43:49 +0100	[thread overview]
Message-ID: <Ydf9RXPch5ddg/WC@dhcp22.suse.cz> (raw)
In-Reply-To: <Ydde2F4Oi0wKx//y@google.com>

On Thu 06-01-22 14:27:52, Yu Zhao wrote:
> On Thu, Jan 06, 2022 at 05:06:42PM +0100, Michal Hocko wrote:
[...]
> > > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > > index 2db9a1432511..9c7a4fae0661 100644
> > > --- a/include/linux/oom.h
> > > +++ b/include/linux/oom.h
> > > @@ -57,6 +57,22 @@ struct oom_control {
> > >  extern struct mutex oom_lock;
> > >  extern struct mutex oom_adj_mutex;
> > >  
> > > +#ifdef CONFIG_MMU
> > > +extern struct task_struct *oom_reaper_list;
> > > +extern struct wait_queue_head oom_reaper_wait;
> > > +
> > > +static inline bool oom_reaping_in_progress(void)
> > > +{
> > > +	/* a racy check can be used to reduce the chance of overkilling */
> > > +	return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait);
> > > +}
> > > +#else
> > > +static inline bool oom_reaping_in_progress(void)
> > > +{
> > > +	return false;
> > > +}
> > > +#endif
> > 
> > I do not like this. These are internal oom reaper's and no code should
> > really make any decisions based on that. oom_reaping_in_progress is not
> > telling much anyway.
> 
> There is a perfectly legitimate reason for this.
> 
> If there is already a oom kill victim and the oom reaper is making
> progress, the system may still be under memory pressure until the oom
> reaping is done. The page reclaim has two choices in this transient
> state: kill more processes or keep reclaiming (a few more) hot pages.
> 
> The first choice, AKA overkilling, is generally a bad one. The oom
> reaper is single threaded and it can't go faster with additional
> victims. Additional processes are sacrificed for nothing -- this is
> an overcorrection of a system that tries to strike a balance between
> the tendencies to release memory pressure and to improve memory
> utilization.
> 
> > This is a global queue for oom reaper that can
> > contain oom victims from different oom scopes (e.g. global OOM, memcg
> > OOM or memory policy OOM).
> 
> True, but this is a wrong reason to make the conclusion below. Oom
> kill scopes do NOT matter; only the pool the freed memory goes into
> does. And there is only one global pool free pages.
> 
> > Your lru_gen_age_node uses this to decide whether to trigger
> > out_of_memory and that is clearly wrong for the above reasons.
> 
> I hope my explanation above is clear enough. There is nothing wrong
> with the purpose and the usage of oom_reaping_in_progress(), and it
> has been well tested in the Arch Linux Zen kernel.

I disagree. An ongoing oom kill in one domain (say memcg A) shouldn't be
any base for any decisions in reclaim in other domain (say memcg B or
even a global reclaim). Those are fundamentally different conditions.

> Without it, overkills can be easily reproduced by the following simple
> script. That is additional oom kills happen to processes other than
> "tail".
> 
>   # enable zram
>   while true;
>   do
>       tail /dev/zero
>   done

I would be interested to hear more (care to send oom reports?).

> > out_of_memory is designed to skip over any action if there is an oom
> > victim pending from the oom domain (have a look at oom_evaluate_task).
> 
> Where exactly? Point me to the code please.
> 
> I don't see such a logic inside out_of_memory() or
> oom_evaluate_task(). Currently the only thing that could remotely
> prevent overkills is oom_lock. But it's inadequate.

OK, let me try to exaplain. The protocol is rather convoluted. Once the
oom killer is invoked it choses a victim to kill. oom_evaluate_task will
evaluate _all_ tasks from the oom respective domain (select_bad_process
which distinguishes memcg vs global oom kill and oom_cpuset_eligible for
the cpuset domains). If there is any pre-existing oom victim
(tsk_is_oom_victim) then the scan is aborted and the oom killer bails
out. OOM victim stops being considered as relevant once the oom reaper
manages to release its address space (or give up on the mmap_sem
contention) and sets MMF_OOM_SKIP flag for the mm.

That being said the out_of_memory automatically backs off and relies on
the oom reaper to process its queue.

Does it make more clear for you now?

> This is the entire pipeline:
> low on memory -> out_of_memory() -> oom_reaper() -> free memory
> 
> To avoid overkills, we need to consider the later half of it too.
> oom_reaping_in_progress() is exactly for this purpose.
> 
> > > +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
> > > +		       unsigned long min_ttl)
> > > +{
> > > +	bool need_aging;
> > > +	long nr_to_scan;
> > > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > +	int swappiness = get_swappiness(memcg);
> > > +	DEFINE_MAX_SEQ(lruvec);
> > > +	DEFINE_MIN_SEQ(lruvec);
> > > +
> > > +	if (mem_cgroup_below_min(memcg))
> > > +		return false;
> > 
> > mem_cgroup_below_min requires effective values to be calculated for the
> > reclaimed hierarchy. Have a look at mem_cgroup_calculate_protection
> 
> I always keep that in mind, and age_lruvec() is called *after*
> mem_cgroup_calculate_protection():

>   balance_pgdat()
>     memcgs_need_aging = 0
>     do {
>       lru_gen_age_node()
>         if (!memcgs_need_aging) {
>             memcgs_need_aging = 1
>             return
>         }
>         age_lruvec()
> 
>       shrink_node_memcgs()
>         mem_cgroup_calculate_protection()
>         lru_gen_shrink_lruvec()
>           if ...
>             memcgs_need_aging = 0
>     } while ...

Uff, this is really subtle. I really think you should be following the
existing pattern when the effective values are calculated right in the
same context as they are evaluated.
-- 
Michal Hocko
SUSE Labs


  reply	other threads:[~2022-01-07  8:43 UTC|newest]

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-04 20:22 [PATCH v6 0/9] Multigenerational LRU Framework Yu Zhao
2022-01-04 20:22 ` [PATCH v6 1/9] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
2022-01-05 10:45   ` Will Deacon
2022-01-05 20:47     ` Yu Zhao
2022-01-06 10:30       ` Will Deacon
2022-01-07  7:25         ` Yu Zhao
2022-01-11 14:19           ` Will Deacon
2022-01-11 22:27             ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 2/9] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
2022-01-04 21:24   ` Linus Torvalds
2022-01-04 20:22 ` [PATCH v6 3/9] mm/vmscan.c: refactor shrink_node() Yu Zhao
2022-01-04 20:22 ` [PATCH v6 4/9] mm: multigenerational lru: groundwork Yu Zhao
2022-01-04 21:34   ` Linus Torvalds
2022-01-11  8:16   ` Aneesh Kumar K.V
2022-01-12  2:16     ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 5/9] mm: multigenerational lru: mm_struct list Yu Zhao
2022-01-07  9:06   ` Michal Hocko
2022-01-08  0:19     ` Yu Zhao
2022-01-10 15:21       ` Michal Hocko
2022-01-12  8:08         ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 6/9] mm: multigenerational lru: aging Yu Zhao
2022-01-06 16:06   ` Michal Hocko
2022-01-06 21:27     ` Yu Zhao
2022-01-07  8:43       ` Michal Hocko [this message]
2022-01-07 21:12         ` Yu Zhao
2022-01-06 16:12   ` Michal Hocko
2022-01-06 21:41     ` Yu Zhao
2022-01-07  8:55       ` Michal Hocko
2022-01-07  9:00         ` Michal Hocko
2022-01-10  3:58           ` Yu Zhao
2022-01-10 14:37             ` Michal Hocko
2022-01-13  9:43               ` Yu Zhao
2022-01-13 12:02                 ` Michal Hocko
2022-01-19  6:31                   ` Yu Zhao
2022-01-19  9:44                     ` Michal Hocko
2022-01-10 15:01     ` Michal Hocko
2022-01-10 16:01       ` Vlastimil Babka
2022-01-10 16:25         ` Michal Hocko
2022-01-11 23:16       ` Yu Zhao
2022-01-12 10:28         ` Michal Hocko
2022-01-13  9:25           ` Yu Zhao
2022-01-07 13:11   ` Michal Hocko
2022-01-07 23:36     ` Yu Zhao
2022-01-10 15:35       ` Michal Hocko
2022-01-11  1:18         ` Yu Zhao
2022-01-11  9:00           ` Michal Hocko
     [not found]         ` <1641900108.61dd684cb0e59@mail.inbox.lv>
2022-01-11 12:15           ` Michal Hocko
2022-01-11 14:22         ` Alexey Avramov
2022-01-07 14:44   ` Michal Hocko
2022-01-10  4:47     ` Yu Zhao
2022-01-10 10:54       ` Michal Hocko
2022-01-19  7:04         ` Yu Zhao
2022-01-19  9:42           ` Michal Hocko
2022-01-23 21:28             ` Yu Zhao
2022-01-24 14:01               ` Michal Hocko
2022-01-10 16:57   ` Michal Hocko
2022-01-12  1:01     ` Yu Zhao
2022-01-12 10:17       ` Michal Hocko
2022-01-12 23:43         ` Yu Zhao
2022-01-13 11:57           ` Michal Hocko
2022-01-23 21:40             ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 7/9] mm: multigenerational lru: eviction Yu Zhao
2022-01-11 10:37   ` Aneesh Kumar K.V
2022-01-12  8:05     ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 8/9] mm: multigenerational lru: user interface Yu Zhao
2022-01-10 10:27   ` Mike Rapoport
2022-01-12  8:35     ` Yu Zhao
2022-01-12 10:31       ` Michal Hocko
2022-01-12 15:45       ` Mike Rapoport
2022-01-13  9:47         ` Yu Zhao
2022-01-13 10:31   ` Aneesh Kumar K.V
2022-01-13 23:02     ` Yu Zhao
2022-01-14  5:20       ` Aneesh Kumar K.V
2022-01-14  6:50         ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 9/9] mm: multigenerational lru: Kconfig Yu Zhao
2022-01-04 21:39   ` Linus Torvalds
2022-01-04 20:22 ` [PATCH v6 0/9] Multigenerational LRU Framework Yu Zhao
2022-01-04 20:30 ` Yu Zhao
2022-01-04 21:43   ` Linus Torvalds
2022-01-05 21:12     ` Yu Zhao
2022-01-07  9:38   ` Michal Hocko
2022-01-07 18:45     ` Yu Zhao
2022-01-10 15:39       ` Michal Hocko
2022-01-10 22:04         ` Yu Zhao
2022-01-10 22:46           ` Jesse Barnes
2022-01-11  1:41             ` Linus Torvalds
2022-01-11 10:40             ` Michal Hocko
2022-01-11  8:41   ` Yu Zhao
2022-01-11  8:53     ` Holger Hoffstätte
2022-01-11  9:26     ` Jan Alexander Steffens (heftig)
2022-01-11 16:04     ` Shuang Zhai
2022-01-12  1:46     ` Suleiman Souhlal
2022-01-12  6:07     ` Sofia Trinh
2022-01-12 16:17       ` Daniel Byrne
2022-01-18  9:21     ` Yu Zhao
2022-01-18  9:36     ` Donald Carr
2022-01-19 20:19     ` Steven Barrett
2022-01-19 22:25     ` Brian Geffon
2022-01-05  2:44 ` Shuang Zhai
2022-01-05  8:55 ` SeongJae Park
2022-01-05 10:53   ` Yu Zhao
2022-01-05 11:25     ` SeongJae Park
2022-01-05 21:06       ` Yu Zhao
2022-01-10 14:49 ` Alexey Avramov
2022-01-11 10:24 ` Alexey Avramov
2022-01-12 20:56 ` Oleksandr Natalenko
2022-01-13  8:59   ` Yu Zhao
2022-01-23  5:43 ` Barry Song
2022-01-25  6:48   ` Yu Zhao
2022-01-28  8:54     ` Barry Song
2022-02-08  9:16       ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ydf9RXPch5ddg/WC@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=Hi-Angel@yandex.ru \
    --cc=Michael@michaellarabel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=jsbarnes@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=page-reclaim@google.com \
    --cc=riel@surriel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).