Re: [PATCH] mm, oom: make the calculation of oom badness more accurate

From: Yafang Shao <laoar.shao@gmail.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Linux MM <linux-mm@kvack.org>
Subject: Re: [PATCH] mm, oom: make the calculation of oom badness more accurate
Date: Thu, 9 Jul 2020 10:14:14 +0800	[thread overview]
Message-ID: <CALOAHbBJ=Kh-fQGuN_e9oF68ywS9YDFYb=zUz2t5K9BsukP2Gw@mail.gmail.com> (raw)
In-Reply-To: <20200708190225.GM7271@dhcp22.suse.cz>

On Thu, Jul 9, 2020 at 3:02 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 08-07-20 10:57:27, David Rientjes wrote:
> > On Wed, 8 Jul 2020, Michal Hocko wrote:
> >
> > > I have only now realized that David is not on Cc. Add him here. The
> > > patch is http://lkml.kernel.org/r/1594214649-9837-1-git-send-email-laoar.shao@gmail.com.
> > >
> > > I believe the main problem is that we are normalizing to oom_score_adj
> > > units rather than usage/total. I have a very vague recollection this has
> > > been done in the past but I didn't get to dig into details yet.
> > >
> >
> > The memcg max is 4194304 pages, and an oom_score_adj of -998 would yield a
> > page adjustment of:
> >
> > adj = -998 * 4194304 / 1000 = −4185915 pages
> >
> > The largest pid 58406 (data_sim) has rss 3967322 pages,
> > pgtables 37101568 / 4096 = 9058 pages, and swapents 0.  So it's unadjusted
> > badness is
> >
> > 3967322 + 9058 pages = 3976380 pages
> >
> > Factoring in oom_score_adj, all of these processes will have a badness of
> > 1 because oom_badness() doesn't underflow, which I think is the point of
> > Yafang's proposal.
> >
> > I think the patch can work but, as you mention, also needs an update to
> > proc_oom_score().  proc_oom_score() is using the global amount of memory
> > so Yafang is likely not seeing it go negative for that reason but it could
> > happen.
>
> Yes, memcg just makes it more obvious but the same might happen for the
> global case. I am not sure how we can both alow underflow and present
> the value that would fit the existing model. The exported value should
> really reflect what the oom killer is using for the calculation or we
> are going to see discrepancies between the real oom decision and
> presented values. So I believe we really have to change the calculation
> rather than just make it tolerant to underflows.
>

Hi Michal,

- Before my patch,
The result of oom_badness() is [1,  2 * totalpages),
and the result of proc_oom_score() is  [0, 2000).

While the badness score in the Documentation/filesystems/proc.rst is: [0, 1000]
"The badness heuristic assigns a value to each candidate task ranging from 0
(never kill) to 1000 (always kill) to determine which process is targeted"

That means, we need to update the documentation anyway unless my
calculation is wrong.
So the point will be how to change it ?

- After my patch
oom_badness():  (-totalpages, 2 * totalpages)
proc_oom_score(): (-1000, 2000)

If we allow underflow, we can change the documentation as "from -1000
(never kill) to 2000(always kill)".
While if we don't allow underflow,  we can make bellow simple change,

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 774784587..0da8efa41 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -528,7 +528,7 @@ static int proc_oom_score(struct seq_file *m,
struct pid_namespace *ns,
        unsigned long totalpages = totalram_pages + total_swap_pages;
        unsigned long points = 0;

-       points = oom_badness(task, NULL, NULL, totalpages) *
+       points = 1000 + oom_badness(task, NULL, NULL, totalpages) *
                                      1000 / totalpages;
        seq_printf(m, "%lu\n", points);

And then update the documentation as "from 0 (never kill) to 3000
(always kill)"


-- 
Thanks
Yafang