All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Alexey Avramov <hakavlad@inbox.lv>
Cc: Yu Zhao <yuzhao@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andi Kleen <ak@linux.intel.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Hillf Danton <hdanton@sina.com>, Jens Axboe <axboe@kernel.dk>,
	Jesse Barnes <jsbarnes@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Jonathan Corbet <corbet@lwn.net>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Michael Larabel <Michael@michaellarabel.com>,
	Rik van Riel <riel@surriel.com>, Vlastimil Babka <vbabka@suse.cz>,
	Will Deacon <will@kernel.org>, Ying Huang <ying.huang@intel.com>,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	page-reclaim@google.com, x86@kernel.org,
	Konstantin Kharlamov <Hi-Angel@yandex.ru>
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging
Date: Tue, 11 Jan 2022 13:15:50 +0100	[thread overview]
Message-ID: <Yd109jeRllJbjV9o@dhcp22.suse.cz> (raw)
In-Reply-To: <1641900108.61dd684cb0e59@mail.inbox.lv>

On Tue 11-01-22 11:21:48, Alexey Avramov wrote:
> > I do not really see any arguments why an userspace based trashing
> > detection cannot be used for those.
> 
> Firsly,
> because this is the task of the kernel, not the user space.
> Memory is managed by the kernel, not by the user space.
> The absence of such a mechanism in the kernel is a fundamental problem.
> The userspace tools are ugly hacks:
> some of them consume a lot of CPU [1],
> some of them consume a lot of memory [2],
> some of them cannot into process_mrelease() (earlyoom, nohang),
> some of them kill only the whole cgroup (systemd-oomd, oomd) [3]
> and depends on systemd and cgroup_v2 (oomd, systemd-oomd).

Thanks for those links. Read through them and my understanding is that
most of those are very specific to the tool used and nothing really
fundamental because of lack of kernel support.

> One of the biggest challenges for userspace oom-killers is to potentially
> function under intense memory pressure and are prone to getting stuck in
> memory reclaim themselves [4].

This one is more interesting and the truth is that handling the complete
OOM situation from the userspace is really tricky. Especially when with
a more complex oom decision policy. In the past we have discussed
potential ways to implement a oom kill policy be kernel modules or eBPF.
Without anybody following up on that.

But I suspect you are mixing up two things here. One of them is out
of memory situation where no memory can be reclaimed and allocated.

The other is one where the memory can be reclaimed, a progress is made,
but that leads to a trashing when the most of the time is spent on
refaulting a memory reclaimed shortly before.

The first one is addressed by the global oom killer and it tries to
be really conservative as much as possible because this is a very
disruptive operation. But the later one is more complex and a proper
handling really depends on the particular workload to be handled
properly because it is more of a QoS than an emergency action to keep
the system alive.

There are workloads which prefer a temporary trashing over its working
set during a peak memory demand rather than an OOM kill because way too
much work would be thrown away. On the other side workloads that are
latency sensitive can see even the direct reclaim as a runtime visible
problem.

I hope you can imagine there is a really large gap between those
two cases and no simple solution can be applied to the whole
range. Therefore we have PSI and refault stats exported to the userspace
so that a workload specific policy can be implemented there.

If userspace has hard time to use that data and action upon then let's
talk about specifics. For the most steady trashing situations I have
seen the userspace with mlocked memory and the code can make a forward
progress and mediate the situation.

[...]

> [1] https://github.com/facebookincubator/oomd/issues/79
> [2] https://github.com/hakavlad/nohang#memory-and-cpu-usage
> [3] https://github.com/facebookincubator/oomd/issues/125
> [4] https://lore.kernel.org/all/CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com/
> [5] https://github.com/zen-kernel/zen-kernel/issues/223
> [6] https://raw.githubusercontent.com/hakavlad/cache-tests/main/mg-LRU-v3_vs_classic-LRU/3-firefox-tail-OOM/mg-LRU-1/psi2
> [7] https://lore.kernel.org/linux-mm/20211202150614.22440-1-mgorman@techsingularity.net/
-- 
Michal Hocko
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@suse.com>
To: Alexey Avramov <hakavlad@inbox.lv>
Cc: Yu Zhao <yuzhao@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andi Kleen <ak@linux.intel.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Hillf Danton <hdanton@sina.com>, Jens Axboe <axboe@kernel.dk>,
	Jesse Barnes <jsbarnes@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Jonathan Corbet <corbet@lwn.net>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Michael Larabel <Michael@michaellarabel.com>,
	Rik van Riel <riel@surriel.com>, Vlastimil Babka <vbabka@suse.cz>,
	Will Deacon <will@kernel.org>, Ying Huang <ying.huang@intel.com>,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	page-reclaim@google.com, x86@kernel.org,
	Konstantin Kharlamov <Hi-Angel@yandex.ru>
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging
Date: Tue, 11 Jan 2022 13:15:50 +0100	[thread overview]
Message-ID: <Yd109jeRllJbjV9o@dhcp22.suse.cz> (raw)
In-Reply-To: <1641900108.61dd684cb0e59@mail.inbox.lv>

On Tue 11-01-22 11:21:48, Alexey Avramov wrote:
> > I do not really see any arguments why an userspace based trashing
> > detection cannot be used for those.
> 
> Firsly,
> because this is the task of the kernel, not the user space.
> Memory is managed by the kernel, not by the user space.
> The absence of such a mechanism in the kernel is a fundamental problem.
> The userspace tools are ugly hacks:
> some of them consume a lot of CPU [1],
> some of them consume a lot of memory [2],
> some of them cannot into process_mrelease() (earlyoom, nohang),
> some of them kill only the whole cgroup (systemd-oomd, oomd) [3]
> and depends on systemd and cgroup_v2 (oomd, systemd-oomd).

Thanks for those links. Read through them and my understanding is that
most of those are very specific to the tool used and nothing really
fundamental because of lack of kernel support.

> One of the biggest challenges for userspace oom-killers is to potentially
> function under intense memory pressure and are prone to getting stuck in
> memory reclaim themselves [4].

This one is more interesting and the truth is that handling the complete
OOM situation from the userspace is really tricky. Especially when with
a more complex oom decision policy. In the past we have discussed
potential ways to implement a oom kill policy be kernel modules or eBPF.
Without anybody following up on that.

But I suspect you are mixing up two things here. One of them is out
of memory situation where no memory can be reclaimed and allocated.

The other is one where the memory can be reclaimed, a progress is made,
but that leads to a trashing when the most of the time is spent on
refaulting a memory reclaimed shortly before.

The first one is addressed by the global oom killer and it tries to
be really conservative as much as possible because this is a very
disruptive operation. But the later one is more complex and a proper
handling really depends on the particular workload to be handled
properly because it is more of a QoS than an emergency action to keep
the system alive.

There are workloads which prefer a temporary trashing over its working
set during a peak memory demand rather than an OOM kill because way too
much work would be thrown away. On the other side workloads that are
latency sensitive can see even the direct reclaim as a runtime visible
problem.

I hope you can imagine there is a really large gap between those
two cases and no simple solution can be applied to the whole
range. Therefore we have PSI and refault stats exported to the userspace
so that a workload specific policy can be implemented there.

If userspace has hard time to use that data and action upon then let's
talk about specifics. For the most steady trashing situations I have
seen the userspace with mlocked memory and the code can make a forward
progress and mediate the situation.

[...]

> [1] https://github.com/facebookincubator/oomd/issues/79
> [2] https://github.com/hakavlad/nohang#memory-and-cpu-usage
> [3] https://github.com/facebookincubator/oomd/issues/125
> [4] https://lore.kernel.org/all/CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com/
> [5] https://github.com/zen-kernel/zen-kernel/issues/223
> [6] https://raw.githubusercontent.com/hakavlad/cache-tests/main/mg-LRU-v3_vs_classic-LRU/3-firefox-tail-OOM/mg-LRU-1/psi2
> [7] https://lore.kernel.org/linux-mm/20211202150614.22440-1-mgorman@techsingularity.net/
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  parent reply	other threads:[~2022-01-11 12:15 UTC|newest]

Thread overview: 223+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-04 20:22 [PATCH v6 0/9] Multigenerational LRU Framework Yu Zhao
2022-01-04 20:22 ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 1/9] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
2022-01-04 20:22   ` Yu Zhao
2022-01-05 10:45   ` Will Deacon
2022-01-05 10:45     ` Will Deacon
2022-01-05 20:47     ` Yu Zhao
2022-01-05 20:47       ` Yu Zhao
2022-01-06 10:30       ` Will Deacon
2022-01-06 10:30         ` Will Deacon
2022-01-07  7:25         ` Yu Zhao
2022-01-07  7:25           ` Yu Zhao
2022-01-11 14:19           ` Will Deacon
2022-01-11 14:19             ` Will Deacon
2022-01-11 22:27             ` Yu Zhao
2022-01-11 22:27               ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 2/9] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
2022-01-04 20:22   ` Yu Zhao
2022-01-04 21:24   ` Linus Torvalds
2022-01-04 21:24     ` Linus Torvalds
2022-01-04 20:22 ` [PATCH v6 3/9] mm/vmscan.c: refactor shrink_node() Yu Zhao
2022-01-04 20:22   ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 4/9] mm: multigenerational lru: groundwork Yu Zhao
2022-01-04 20:22   ` Yu Zhao
2022-01-04 21:34   ` Linus Torvalds
2022-01-04 21:34     ` Linus Torvalds
2022-01-11  8:16   ` Aneesh Kumar K.V
2022-01-11  8:16     ` Aneesh Kumar K.V
2022-01-12  2:16     ` Yu Zhao
2022-01-12  2:16       ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 5/9] mm: multigenerational lru: mm_struct list Yu Zhao
2022-01-04 20:22   ` Yu Zhao
2022-01-07  9:06   ` Michal Hocko
2022-01-07  9:06     ` Michal Hocko
2022-01-08  0:19     ` Yu Zhao
2022-01-08  0:19       ` Yu Zhao
2022-01-10 15:21       ` Michal Hocko
2022-01-10 15:21         ` Michal Hocko
2022-01-12  8:08         ` Yu Zhao
2022-01-12  8:08           ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 6/9] mm: multigenerational lru: aging Yu Zhao
2022-01-04 20:22   ` Yu Zhao
2022-01-06 16:06   ` Michal Hocko
2022-01-06 16:06     ` Michal Hocko
2022-01-06 21:27     ` Yu Zhao
2022-01-06 21:27       ` Yu Zhao
2022-01-07  8:43       ` Michal Hocko
2022-01-07  8:43         ` Michal Hocko
2022-01-07 21:12         ` Yu Zhao
2022-01-07 21:12           ` Yu Zhao
2022-01-06 16:12   ` Michal Hocko
2022-01-06 16:12     ` Michal Hocko
2022-01-06 21:41     ` Yu Zhao
2022-01-06 21:41       ` Yu Zhao
2022-01-07  8:55       ` Michal Hocko
2022-01-07  8:55         ` Michal Hocko
2022-01-07  9:00         ` Michal Hocko
2022-01-07  9:00           ` Michal Hocko
2022-01-10  3:58           ` Yu Zhao
2022-01-10  3:58             ` Yu Zhao
2022-01-10 14:37             ` Michal Hocko
2022-01-10 14:37               ` Michal Hocko
2022-01-13  9:43               ` Yu Zhao
2022-01-13  9:43                 ` Yu Zhao
2022-01-13 12:02                 ` Michal Hocko
2022-01-13 12:02                   ` Michal Hocko
2022-01-19  6:31                   ` Yu Zhao
2022-01-19  6:31                     ` Yu Zhao
2022-01-19  9:44                     ` Michal Hocko
2022-01-19  9:44                       ` Michal Hocko
2022-01-10 15:01     ` Michal Hocko
2022-01-10 15:01       ` Michal Hocko
2022-01-10 16:01       ` Vlastimil Babka
2022-01-10 16:01         ` Vlastimil Babka
2022-01-10 16:25         ` Michal Hocko
2022-01-10 16:25           ` Michal Hocko
2022-01-11 23:16       ` Yu Zhao
2022-01-11 23:16         ` Yu Zhao
2022-01-12 10:28         ` Michal Hocko
2022-01-12 10:28           ` Michal Hocko
2022-01-13  9:25           ` Yu Zhao
2022-01-13  9:25             ` Yu Zhao
2022-01-07 13:11   ` Michal Hocko
2022-01-07 13:11     ` Michal Hocko
2022-01-07 23:36     ` Yu Zhao
2022-01-07 23:36       ` Yu Zhao
2022-01-10 15:35       ` Michal Hocko
2022-01-10 15:35         ` Michal Hocko
2022-01-11  1:18         ` Yu Zhao
2022-01-11  1:18           ` Yu Zhao
2022-01-11  9:00           ` Michal Hocko
2022-01-11  9:00             ` Michal Hocko
     [not found]         ` <1641900108.61dd684cb0e59@mail.inbox.lv>
2022-01-11 12:15           ` Michal Hocko [this message]
2022-01-11 12:15             ` Michal Hocko
2022-01-13 17:00             ` Alexey Avramov
2022-01-13 17:00               ` Alexey Avramov
2022-01-11 14:22         ` Alexey Avramov
2022-01-11 14:22           ` Alexey Avramov
2022-01-07 14:44   ` Michal Hocko
2022-01-07 14:44     ` Michal Hocko
2022-01-10  4:47     ` Yu Zhao
2022-01-10  4:47       ` Yu Zhao
2022-01-10 10:54       ` Michal Hocko
2022-01-10 10:54         ` Michal Hocko
2022-01-19  7:04         ` Yu Zhao
2022-01-19  7:04           ` Yu Zhao
2022-01-19  9:42           ` Michal Hocko
2022-01-19  9:42             ` Michal Hocko
2022-01-23 21:28             ` Yu Zhao
2022-01-23 21:28               ` Yu Zhao
2022-01-24 14:01               ` Michal Hocko
2022-01-24 14:01                 ` Michal Hocko
2022-01-10 16:57   ` Michal Hocko
2022-01-10 16:57     ` Michal Hocko
2022-01-12  1:01     ` Yu Zhao
2022-01-12  1:01       ` Yu Zhao
2022-01-12 10:17       ` Michal Hocko
2022-01-12 10:17         ` Michal Hocko
2022-01-12 23:43         ` Yu Zhao
2022-01-12 23:43           ` Yu Zhao
2022-01-13 11:57           ` Michal Hocko
2022-01-13 11:57             ` Michal Hocko
2022-01-23 21:40             ` Yu Zhao
2022-01-23 21:40               ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 7/9] mm: multigenerational lru: eviction Yu Zhao
2022-01-04 20:22   ` Yu Zhao
2022-01-11 10:37   ` Aneesh Kumar K.V
2022-01-11 10:37     ` Aneesh Kumar K.V
2022-01-12  8:05     ` Yu Zhao
2022-01-12  8:05       ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 8/9] mm: multigenerational lru: user interface Yu Zhao
2022-01-04 20:22   ` Yu Zhao
2022-01-10 10:27   ` Mike Rapoport
2022-01-10 10:27     ` Mike Rapoport
2022-01-12  8:35     ` Yu Zhao
2022-01-12  8:35       ` Yu Zhao
2022-01-12 10:31       ` Michal Hocko
2022-01-12 10:31         ` Michal Hocko
2022-01-12 15:45       ` Mike Rapoport
2022-01-12 15:45         ` Mike Rapoport
2022-01-13  9:47         ` Yu Zhao
2022-01-13  9:47           ` Yu Zhao
2022-01-13 10:31   ` Aneesh Kumar K.V
2022-01-13 10:31     ` Aneesh Kumar K.V
2022-01-13 23:02     ` Yu Zhao
2022-01-13 23:02       ` Yu Zhao
2022-01-14  5:20       ` Aneesh Kumar K.V
2022-01-14  5:20         ` Aneesh Kumar K.V
2022-01-14  6:50         ` Yu Zhao
2022-01-14  6:50           ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 9/9] mm: multigenerational lru: Kconfig Yu Zhao
2022-01-04 20:22   ` Yu Zhao
2022-01-04 21:39   ` Linus Torvalds
2022-01-04 21:39     ` Linus Torvalds
2022-01-04 20:22 ` [PATCH v6 0/9] Multigenerational LRU Framework Yu Zhao
2022-01-04 20:30 ` Yu Zhao
2022-01-04 20:30   ` Yu Zhao
2022-01-04 21:43   ` Linus Torvalds
2022-01-04 21:43     ` Linus Torvalds
2022-01-05 21:12     ` Yu Zhao
2022-01-05 21:12       ` Yu Zhao
2022-01-07  9:38   ` Michal Hocko
2022-01-07  9:38     ` Michal Hocko
2022-01-07 18:45     ` Yu Zhao
2022-01-07 18:45       ` Yu Zhao
2022-01-10 15:39       ` Michal Hocko
2022-01-10 15:39         ` Michal Hocko
2022-01-10 22:04         ` Yu Zhao
2022-01-10 22:04           ` Yu Zhao
2022-01-10 22:46           ` Jesse Barnes
2022-01-10 22:46             ` Jesse Barnes
2022-01-11  1:41             ` Linus Torvalds
2022-01-11  1:41               ` Linus Torvalds
2022-01-11 10:40             ` Michal Hocko
2022-01-11 10:40               ` Michal Hocko
2022-01-11  8:41   ` Yu Zhao
2022-01-11  8:41     ` Yu Zhao
2022-01-11  8:53     ` Holger Hoffstätte
2022-01-11  8:53       ` Holger Hoffstätte
2022-01-11  9:26     ` Jan Alexander Steffens (heftig)
2022-01-11 16:04     ` Shuang Zhai
2022-01-11 16:04       ` Shuang Zhai
2022-01-12  1:46     ` Suleiman Souhlal
2022-01-12  1:46       ` Suleiman Souhlal
2022-01-12  6:07     ` Sofia Trinh
2022-01-12  6:07       ` Sofia Trinh
2022-01-12 16:17       ` Daniel Byrne
2022-01-18  9:21     ` Yu Zhao
2022-01-18  9:21       ` Yu Zhao
2022-01-18  9:36     ` Donald Carr
2022-01-18  9:36       ` Donald Carr
2022-01-19 20:19     ` Steven Barrett
2022-01-19 20:19       ` Steven Barrett
2022-01-19 22:25     ` Brian Geffon
2022-01-19 22:25       ` Brian Geffon
2022-01-05  2:44 ` Shuang Zhai
2022-01-05  2:44   ` Shuang Zhai
2022-01-05  8:55 ` SeongJae Park
2022-01-05  8:55   ` SeongJae Park
2022-01-05 10:53   ` Yu Zhao
2022-01-05 10:53     ` Yu Zhao
2022-01-05 11:12     ` Borislav Petkov
2022-01-05 11:12       ` Borislav Petkov
2022-01-05 11:25     ` SeongJae Park
2022-01-05 11:25       ` SeongJae Park
2022-01-05 21:06       ` Yu Zhao
2022-01-05 21:06         ` Yu Zhao
2022-01-10 14:49 ` Alexey Avramov
2022-01-10 14:49   ` Alexey Avramov
2022-01-11 10:24 ` Alexey Avramov
2022-01-11 10:24   ` Alexey Avramov
2022-01-12 20:56 ` Oleksandr Natalenko
2022-01-12 20:56   ` Oleksandr Natalenko
2022-01-13  8:59   ` Yu Zhao
2022-01-13  8:59     ` Yu Zhao
2022-01-23  5:43 ` Barry Song
2022-01-23  5:43   ` Barry Song
2022-01-25  6:48   ` Yu Zhao
2022-01-25  6:48     ` Yu Zhao
2022-01-28  8:54     ` Barry Song
2022-01-28  8:54       ` Barry Song
2022-02-08  9:16       ` Yu Zhao
2022-02-08  9:16         ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yd109jeRllJbjV9o@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=Hi-Angel@yandex.ru \
    --cc=Michael@michaellarabel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=hakavlad@inbox.lv \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=jsbarnes@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=page-reclaim@google.com \
    --cc=riel@surriel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.