From: Michal Hocko <mhocko@suse.com> To: Alexey Avramov <hakavlad@inbox.lv> Cc: Yu Zhao <yuzhao@google.com>, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, Andi Kleen <ak@linux.intel.com>, Catalin Marinas <catalin.marinas@arm.com>, Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>, Jens Axboe <axboe@kernel.dk>, Jesse Barnes <jsbarnes@google.com>, Johannes Weiner <hannes@cmpxchg.org>, Jonathan Corbet <corbet@lwn.net>, Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>, Michael Larabel <Michael@michaellarabel.com>, Rik van Riel <riel@surriel.com>, Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>, Ying Huang <ying.huang@intel.com>, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, page-reclaim@google.com, x86@kernel.org, Konstantin Kharlamov <Hi-Angel@yandex.ru> Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging Date: Tue, 11 Jan 2022 13:15:50 +0100 [thread overview] Message-ID: <Yd109jeRllJbjV9o@dhcp22.suse.cz> (raw) In-Reply-To: <1641900108.61dd684cb0e59@mail.inbox.lv> On Tue 11-01-22 11:21:48, Alexey Avramov wrote: > > I do not really see any arguments why an userspace based trashing > > detection cannot be used for those. > > Firsly, > because this is the task of the kernel, not the user space. > Memory is managed by the kernel, not by the user space. > The absence of such a mechanism in the kernel is a fundamental problem. > The userspace tools are ugly hacks: > some of them consume a lot of CPU [1], > some of them consume a lot of memory [2], > some of them cannot into process_mrelease() (earlyoom, nohang), > some of them kill only the whole cgroup (systemd-oomd, oomd) [3] > and depends on systemd and cgroup_v2 (oomd, systemd-oomd). Thanks for those links. Read through them and my understanding is that most of those are very specific to the tool used and nothing really fundamental because of lack of kernel support. > One of the biggest challenges for userspace oom-killers is to potentially > function under intense memory pressure and are prone to getting stuck in > memory reclaim themselves [4]. This one is more interesting and the truth is that handling the complete OOM situation from the userspace is really tricky. Especially when with a more complex oom decision policy. In the past we have discussed potential ways to implement a oom kill policy be kernel modules or eBPF. Without anybody following up on that. But I suspect you are mixing up two things here. One of them is out of memory situation where no memory can be reclaimed and allocated. The other is one where the memory can be reclaimed, a progress is made, but that leads to a trashing when the most of the time is spent on refaulting a memory reclaimed shortly before. The first one is addressed by the global oom killer and it tries to be really conservative as much as possible because this is a very disruptive operation. But the later one is more complex and a proper handling really depends on the particular workload to be handled properly because it is more of a QoS than an emergency action to keep the system alive. There are workloads which prefer a temporary trashing over its working set during a peak memory demand rather than an OOM kill because way too much work would be thrown away. On the other side workloads that are latency sensitive can see even the direct reclaim as a runtime visible problem. I hope you can imagine there is a really large gap between those two cases and no simple solution can be applied to the whole range. Therefore we have PSI and refault stats exported to the userspace so that a workload specific policy can be implemented there. If userspace has hard time to use that data and action upon then let's talk about specifics. For the most steady trashing situations I have seen the userspace with mlocked memory and the code can make a forward progress and mediate the situation. [...] > [1] https://github.com/facebookincubator/oomd/issues/79 > [2] https://github.com/hakavlad/nohang#memory-and-cpu-usage > [3] https://github.com/facebookincubator/oomd/issues/125 > [4] https://lore.kernel.org/all/CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com/ > [5] https://github.com/zen-kernel/zen-kernel/issues/223 > [6] https://raw.githubusercontent.com/hakavlad/cache-tests/main/mg-LRU-v3_vs_classic-LRU/3-firefox-tail-OOM/mg-LRU-1/psi2 > [7] https://lore.kernel.org/linux-mm/20211202150614.22440-1-mgorman@techsingularity.net/ -- Michal Hocko SUSE Labs
WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@suse.com> To: Alexey Avramov <hakavlad@inbox.lv> Cc: Yu Zhao <yuzhao@google.com>, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, Andi Kleen <ak@linux.intel.com>, Catalin Marinas <catalin.marinas@arm.com>, Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>, Jens Axboe <axboe@kernel.dk>, Jesse Barnes <jsbarnes@google.com>, Johannes Weiner <hannes@cmpxchg.org>, Jonathan Corbet <corbet@lwn.net>, Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>, Michael Larabel <Michael@michaellarabel.com>, Rik van Riel <riel@surriel.com>, Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>, Ying Huang <ying.huang@intel.com>, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, page-reclaim@google.com, x86@kernel.org, Konstantin Kharlamov <Hi-Angel@yandex.ru> Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging Date: Tue, 11 Jan 2022 13:15:50 +0100 [thread overview] Message-ID: <Yd109jeRllJbjV9o@dhcp22.suse.cz> (raw) In-Reply-To: <1641900108.61dd684cb0e59@mail.inbox.lv> On Tue 11-01-22 11:21:48, Alexey Avramov wrote: > > I do not really see any arguments why an userspace based trashing > > detection cannot be used for those. > > Firsly, > because this is the task of the kernel, not the user space. > Memory is managed by the kernel, not by the user space. > The absence of such a mechanism in the kernel is a fundamental problem. > The userspace tools are ugly hacks: > some of them consume a lot of CPU [1], > some of them consume a lot of memory [2], > some of them cannot into process_mrelease() (earlyoom, nohang), > some of them kill only the whole cgroup (systemd-oomd, oomd) [3] > and depends on systemd and cgroup_v2 (oomd, systemd-oomd). Thanks for those links. Read through them and my understanding is that most of those are very specific to the tool used and nothing really fundamental because of lack of kernel support. > One of the biggest challenges for userspace oom-killers is to potentially > function under intense memory pressure and are prone to getting stuck in > memory reclaim themselves [4]. This one is more interesting and the truth is that handling the complete OOM situation from the userspace is really tricky. Especially when with a more complex oom decision policy. In the past we have discussed potential ways to implement a oom kill policy be kernel modules or eBPF. Without anybody following up on that. But I suspect you are mixing up two things here. One of them is out of memory situation where no memory can be reclaimed and allocated. The other is one where the memory can be reclaimed, a progress is made, but that leads to a trashing when the most of the time is spent on refaulting a memory reclaimed shortly before. The first one is addressed by the global oom killer and it tries to be really conservative as much as possible because this is a very disruptive operation. But the later one is more complex and a proper handling really depends on the particular workload to be handled properly because it is more of a QoS than an emergency action to keep the system alive. There are workloads which prefer a temporary trashing over its working set during a peak memory demand rather than an OOM kill because way too much work would be thrown away. On the other side workloads that are latency sensitive can see even the direct reclaim as a runtime visible problem. I hope you can imagine there is a really large gap between those two cases and no simple solution can be applied to the whole range. Therefore we have PSI and refault stats exported to the userspace so that a workload specific policy can be implemented there. If userspace has hard time to use that data and action upon then let's talk about specifics. For the most steady trashing situations I have seen the userspace with mlocked memory and the code can make a forward progress and mediate the situation. [...] > [1] https://github.com/facebookincubator/oomd/issues/79 > [2] https://github.com/hakavlad/nohang#memory-and-cpu-usage > [3] https://github.com/facebookincubator/oomd/issues/125 > [4] https://lore.kernel.org/all/CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com/ > [5] https://github.com/zen-kernel/zen-kernel/issues/223 > [6] https://raw.githubusercontent.com/hakavlad/cache-tests/main/mg-LRU-v3_vs_classic-LRU/3-firefox-tail-OOM/mg-LRU-1/psi2 > [7] https://lore.kernel.org/linux-mm/20211202150614.22440-1-mgorman@techsingularity.net/ -- Michal Hocko SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2022-01-11 12:15 UTC|newest] Thread overview: 223+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-01-04 20:22 [PATCH v6 0/9] Multigenerational LRU Framework Yu Zhao 2022-01-04 20:22 ` Yu Zhao 2022-01-04 20:22 ` [PATCH v6 1/9] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao 2022-01-04 20:22 ` Yu Zhao 2022-01-05 10:45 ` Will Deacon 2022-01-05 10:45 ` Will Deacon 2022-01-05 20:47 ` Yu Zhao 2022-01-05 20:47 ` Yu Zhao 2022-01-06 10:30 ` Will Deacon 2022-01-06 10:30 ` Will Deacon 2022-01-07 7:25 ` Yu Zhao 2022-01-07 7:25 ` Yu Zhao 2022-01-11 14:19 ` Will Deacon 2022-01-11 14:19 ` Will Deacon 2022-01-11 22:27 ` Yu Zhao 2022-01-11 22:27 ` Yu Zhao 2022-01-04 20:22 ` [PATCH v6 2/9] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao 2022-01-04 20:22 ` Yu Zhao 2022-01-04 21:24 ` Linus Torvalds 2022-01-04 21:24 ` Linus Torvalds 2022-01-04 20:22 ` [PATCH v6 3/9] mm/vmscan.c: refactor shrink_node() Yu Zhao 2022-01-04 20:22 ` Yu Zhao 2022-01-04 20:22 ` [PATCH v6 4/9] mm: multigenerational lru: groundwork Yu Zhao 2022-01-04 20:22 ` Yu Zhao 2022-01-04 21:34 ` Linus Torvalds 2022-01-04 21:34 ` Linus Torvalds 2022-01-11 8:16 ` Aneesh Kumar K.V 2022-01-11 8:16 ` Aneesh Kumar K.V 2022-01-12 2:16 ` Yu Zhao 2022-01-12 2:16 ` Yu Zhao 2022-01-04 20:22 ` [PATCH v6 5/9] mm: multigenerational lru: mm_struct list Yu Zhao 2022-01-04 20:22 ` Yu Zhao 2022-01-07 9:06 ` Michal Hocko 2022-01-07 9:06 ` Michal Hocko 2022-01-08 0:19 ` Yu Zhao 2022-01-08 0:19 ` Yu Zhao 2022-01-10 15:21 ` Michal Hocko 2022-01-10 15:21 ` Michal Hocko 2022-01-12 8:08 ` Yu Zhao 2022-01-12 8:08 ` Yu Zhao 2022-01-04 20:22 ` [PATCH v6 6/9] mm: multigenerational lru: aging Yu Zhao 2022-01-04 20:22 ` Yu Zhao 2022-01-06 16:06 ` Michal Hocko 2022-01-06 16:06 ` Michal Hocko 2022-01-06 21:27 ` Yu Zhao 2022-01-06 21:27 ` Yu Zhao 2022-01-07 8:43 ` Michal Hocko 2022-01-07 8:43 ` Michal Hocko 2022-01-07 21:12 ` Yu Zhao 2022-01-07 21:12 ` Yu Zhao 2022-01-06 16:12 ` Michal Hocko 2022-01-06 16:12 ` Michal Hocko 2022-01-06 21:41 ` Yu Zhao 2022-01-06 21:41 ` Yu Zhao 2022-01-07 8:55 ` Michal Hocko 2022-01-07 8:55 ` Michal Hocko 2022-01-07 9:00 ` Michal Hocko 2022-01-07 9:00 ` Michal Hocko 2022-01-10 3:58 ` Yu Zhao 2022-01-10 3:58 ` Yu Zhao 2022-01-10 14:37 ` Michal Hocko 2022-01-10 14:37 ` Michal Hocko 2022-01-13 9:43 ` Yu Zhao 2022-01-13 9:43 ` Yu Zhao 2022-01-13 12:02 ` Michal Hocko 2022-01-13 12:02 ` Michal Hocko 2022-01-19 6:31 ` Yu Zhao 2022-01-19 6:31 ` Yu Zhao 2022-01-19 9:44 ` Michal Hocko 2022-01-19 9:44 ` Michal Hocko 2022-01-10 15:01 ` Michal Hocko 2022-01-10 15:01 ` Michal Hocko 2022-01-10 16:01 ` Vlastimil Babka 2022-01-10 16:01 ` Vlastimil Babka 2022-01-10 16:25 ` Michal Hocko 2022-01-10 16:25 ` Michal Hocko 2022-01-11 23:16 ` Yu Zhao 2022-01-11 23:16 ` Yu Zhao 2022-01-12 10:28 ` Michal Hocko 2022-01-12 10:28 ` Michal Hocko 2022-01-13 9:25 ` Yu Zhao 2022-01-13 9:25 ` Yu Zhao 2022-01-07 13:11 ` Michal Hocko 2022-01-07 13:11 ` Michal Hocko 2022-01-07 23:36 ` Yu Zhao 2022-01-07 23:36 ` Yu Zhao 2022-01-10 15:35 ` Michal Hocko 2022-01-10 15:35 ` Michal Hocko 2022-01-11 1:18 ` Yu Zhao 2022-01-11 1:18 ` Yu Zhao 2022-01-11 9:00 ` Michal Hocko 2022-01-11 9:00 ` Michal Hocko [not found] ` <1641900108.61dd684cb0e59@mail.inbox.lv> 2022-01-11 12:15 ` Michal Hocko [this message] 2022-01-11 12:15 ` Michal Hocko 2022-01-13 17:00 ` Alexey Avramov 2022-01-13 17:00 ` Alexey Avramov 2022-01-11 14:22 ` Alexey Avramov 2022-01-11 14:22 ` Alexey Avramov 2022-01-07 14:44 ` Michal Hocko 2022-01-07 14:44 ` Michal Hocko 2022-01-10 4:47 ` Yu Zhao 2022-01-10 4:47 ` Yu Zhao 2022-01-10 10:54 ` Michal Hocko 2022-01-10 10:54 ` Michal Hocko 2022-01-19 7:04 ` Yu Zhao 2022-01-19 7:04 ` Yu Zhao 2022-01-19 9:42 ` Michal Hocko 2022-01-19 9:42 ` Michal Hocko 2022-01-23 21:28 ` Yu Zhao 2022-01-23 21:28 ` Yu Zhao 2022-01-24 14:01 ` Michal Hocko 2022-01-24 14:01 ` Michal Hocko 2022-01-10 16:57 ` Michal Hocko 2022-01-10 16:57 ` Michal Hocko 2022-01-12 1:01 ` Yu Zhao 2022-01-12 1:01 ` Yu Zhao 2022-01-12 10:17 ` Michal Hocko 2022-01-12 10:17 ` Michal Hocko 2022-01-12 23:43 ` Yu Zhao 2022-01-12 23:43 ` Yu Zhao 2022-01-13 11:57 ` Michal Hocko 2022-01-13 11:57 ` Michal Hocko 2022-01-23 21:40 ` Yu Zhao 2022-01-23 21:40 ` Yu Zhao 2022-01-04 20:22 ` [PATCH v6 7/9] mm: multigenerational lru: eviction Yu Zhao 2022-01-04 20:22 ` Yu Zhao 2022-01-11 10:37 ` Aneesh Kumar K.V 2022-01-11 10:37 ` Aneesh Kumar K.V 2022-01-12 8:05 ` Yu Zhao 2022-01-12 8:05 ` Yu Zhao 2022-01-04 20:22 ` [PATCH v6 8/9] mm: multigenerational lru: user interface Yu Zhao 2022-01-04 20:22 ` Yu Zhao 2022-01-10 10:27 ` Mike Rapoport 2022-01-10 10:27 ` Mike Rapoport 2022-01-12 8:35 ` Yu Zhao 2022-01-12 8:35 ` Yu Zhao 2022-01-12 10:31 ` Michal Hocko 2022-01-12 10:31 ` Michal Hocko 2022-01-12 15:45 ` Mike Rapoport 2022-01-12 15:45 ` Mike Rapoport 2022-01-13 9:47 ` Yu Zhao 2022-01-13 9:47 ` Yu Zhao 2022-01-13 10:31 ` Aneesh Kumar K.V 2022-01-13 10:31 ` Aneesh Kumar K.V 2022-01-13 23:02 ` Yu Zhao 2022-01-13 23:02 ` Yu Zhao 2022-01-14 5:20 ` Aneesh Kumar K.V 2022-01-14 5:20 ` Aneesh Kumar K.V 2022-01-14 6:50 ` Yu Zhao 2022-01-14 6:50 ` Yu Zhao 2022-01-04 20:22 ` [PATCH v6 9/9] mm: multigenerational lru: Kconfig Yu Zhao 2022-01-04 20:22 ` Yu Zhao 2022-01-04 21:39 ` Linus Torvalds 2022-01-04 21:39 ` Linus Torvalds 2022-01-04 20:22 ` [PATCH v6 0/9] Multigenerational LRU Framework Yu Zhao 2022-01-04 20:30 ` Yu Zhao 2022-01-04 20:30 ` Yu Zhao 2022-01-04 21:43 ` Linus Torvalds 2022-01-04 21:43 ` Linus Torvalds 2022-01-05 21:12 ` Yu Zhao 2022-01-05 21:12 ` Yu Zhao 2022-01-07 9:38 ` Michal Hocko 2022-01-07 9:38 ` Michal Hocko 2022-01-07 18:45 ` Yu Zhao 2022-01-07 18:45 ` Yu Zhao 2022-01-10 15:39 ` Michal Hocko 2022-01-10 15:39 ` Michal Hocko 2022-01-10 22:04 ` Yu Zhao 2022-01-10 22:04 ` Yu Zhao 2022-01-10 22:46 ` Jesse Barnes 2022-01-10 22:46 ` Jesse Barnes 2022-01-11 1:41 ` Linus Torvalds 2022-01-11 1:41 ` Linus Torvalds 2022-01-11 10:40 ` Michal Hocko 2022-01-11 10:40 ` Michal Hocko 2022-01-11 8:41 ` Yu Zhao 2022-01-11 8:41 ` Yu Zhao 2022-01-11 8:53 ` Holger Hoffstätte 2022-01-11 8:53 ` Holger Hoffstätte 2022-01-11 9:26 ` Jan Alexander Steffens (heftig) 2022-01-11 16:04 ` Shuang Zhai 2022-01-11 16:04 ` Shuang Zhai 2022-01-12 1:46 ` Suleiman Souhlal 2022-01-12 1:46 ` Suleiman Souhlal 2022-01-12 6:07 ` Sofia Trinh 2022-01-12 6:07 ` Sofia Trinh 2022-01-12 16:17 ` Daniel Byrne 2022-01-18 9:21 ` Yu Zhao 2022-01-18 9:21 ` Yu Zhao 2022-01-18 9:36 ` Donald Carr 2022-01-18 9:36 ` Donald Carr 2022-01-19 20:19 ` Steven Barrett 2022-01-19 20:19 ` Steven Barrett 2022-01-19 22:25 ` Brian Geffon 2022-01-19 22:25 ` Brian Geffon 2022-01-05 2:44 ` Shuang Zhai 2022-01-05 2:44 ` Shuang Zhai 2022-01-05 8:55 ` SeongJae Park 2022-01-05 8:55 ` SeongJae Park 2022-01-05 10:53 ` Yu Zhao 2022-01-05 10:53 ` Yu Zhao 2022-01-05 11:12 ` Borislav Petkov 2022-01-05 11:12 ` Borislav Petkov 2022-01-05 11:25 ` SeongJae Park 2022-01-05 11:25 ` SeongJae Park 2022-01-05 21:06 ` Yu Zhao 2022-01-05 21:06 ` Yu Zhao 2022-01-10 14:49 ` Alexey Avramov 2022-01-10 14:49 ` Alexey Avramov 2022-01-11 10:24 ` Alexey Avramov 2022-01-11 10:24 ` Alexey Avramov 2022-01-12 20:56 ` Oleksandr Natalenko 2022-01-12 20:56 ` Oleksandr Natalenko 2022-01-13 8:59 ` Yu Zhao 2022-01-13 8:59 ` Yu Zhao 2022-01-23 5:43 ` Barry Song 2022-01-23 5:43 ` Barry Song 2022-01-25 6:48 ` Yu Zhao 2022-01-25 6:48 ` Yu Zhao 2022-01-28 8:54 ` Barry Song 2022-01-28 8:54 ` Barry Song 2022-02-08 9:16 ` Yu Zhao 2022-02-08 9:16 ` Yu Zhao
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=Yd109jeRllJbjV9o@dhcp22.suse.cz \ --to=mhocko@suse.com \ --cc=Hi-Angel@yandex.ru \ --cc=Michael@michaellarabel.com \ --cc=ak@linux.intel.com \ --cc=akpm@linux-foundation.org \ --cc=axboe@kernel.dk \ --cc=catalin.marinas@arm.com \ --cc=corbet@lwn.net \ --cc=dave.hansen@linux.intel.com \ --cc=hakavlad@inbox.lv \ --cc=hannes@cmpxchg.org \ --cc=hdanton@sina.com \ --cc=jsbarnes@google.com \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-doc@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mgorman@suse.de \ --cc=page-reclaim@google.com \ --cc=riel@surriel.com \ --cc=torvalds@linux-foundation.org \ --cc=vbabka@suse.cz \ --cc=will@kernel.org \ --cc=willy@infradead.org \ --cc=x86@kernel.org \ --cc=ying.huang@intel.com \ --cc=yuzhao@google.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.