From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=aM0h=AT=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.0 required=3.0 tests=DKIM_ADSP_CUSTOM_MED,
	DKIM_INVALID,DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 38108C433E0
	for <linux-mm@archiver.kernel.org>; Wed,  8 Jul 2020 17:57:32 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id E1B89206E9
	for <linux-mm@archiver.kernel.org>; Wed,  8 Jul 2020 17:57:31 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key) header.d=google.com header.i=@google.com header.b="TGEqwTgn"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E1B89206E9
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 35DC36B0008; Wed,  8 Jul 2020 13:57:31 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 30E566B000A; Wed,  8 Jul 2020 13:57:31 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1FCDC6B000C; Wed,  8 Jul 2020 13:57:31 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0067.hostedemail.com [216.40.44.67])
	by kanga.kvack.org (Postfix) with ESMTP id 09C776B0008
	for <linux-mm@kvack.org>; Wed,  8 Jul 2020 13:57:31 -0400 (EDT)
Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id A38052C6D
	for <linux-mm@kvack.org>; Wed,  8 Jul 2020 17:57:30 +0000 (UTC)
X-FDA: 77015665860.08.cough92_520c5d826ebf
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin08.hostedemail.com (Postfix) with ESMTP id 6B9951819E624
	for <linux-mm@kvack.org>; Wed,  8 Jul 2020 17:57:30 +0000 (UTC)
X-HE-Tag: cough92_520c5d826ebf
X-Filterd-Recvd-Size: 11494
Received: from mail-pl1-f193.google.com (mail-pl1-f193.google.com [209.85.214.193])
	by imf50.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed,  8 Jul 2020 17:57:29 +0000 (UTC)
Received: by mail-pl1-f193.google.com with SMTP id f2so18490964plr.8
        for <linux-mm@kvack.org>; Wed, 08 Jul 2020 10:57:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :user-agent:mime-version:content-id;
        bh=MtEfMx9Hf+45fSyFgFeLn0ozmCU11sn3ED6XsQKDvXc=;
        b=TGEqwTgngXzSn65iNmsjk1RAbhteZ2fIOhQA7WOnDAkTpznogSBjmm1VvlTvz8stuj
         d2EYv3kZaPM6cyitlxFJ9fVmbGqR3Fe+JdBxU3l7hFURmezdkIEIZXy6KbMayVQXulCX
         FCVg0o971ezTH1rguOxi/m02+lsc+vBlhMgJNpMgMDsTa80ZpVjTUHSpEbYzqlWCWDtg
         XY0Gi2GAWeVJr/oBMynpjsvo1TvsgP9Zi5ehIZnIi2MmBNxIvEyLiRb0xR8uVVmPoA5i
         Bpi9lo7UCa4e9XWKtf4DmhggxULsNJPvFLgwh6Hb08vSR9mfqsFMqVicUtakecyUnUxI
         c3Bw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:user-agent:mime-version:content-id;
        bh=MtEfMx9Hf+45fSyFgFeLn0ozmCU11sn3ED6XsQKDvXc=;
        b=iDxkhxOJTAx+uaGSv1a6aNofDA7FzpaVZS9scrWRlynpb+SFfQM+GwUN2k3GsERwKb
         yRD8fSFSq+bUuxdXk/Re2veZa0hKBhJEeI8BAf6yetUlzjZ3KQPt1JtcdvO+bB2L7BiT
         hW9sF/AKYBsvstiwcz/TbdOzm8qyDWWPwUdJ1YQpgNYnYqUQQsuGgWUI462cW2ZEv5+Q
         ZAR1i8vF4E5pK0OhgBZY88wcabHfcnp3mAVxalWMMRNo1qqHUi29mo4r04ArFoL0e41c
         Z4TN42gBVIKUxKIdrdhsMYG0VYcS1qbYLlwINiwicUb00DUgcRe+n4dh2qFy78Dki7DJ
         HQoQ==
X-Gm-Message-State: AOAM532b6pIpf/zvbFd8ibOx2KOmwoZGRfT1HMbCyYck24W0AjjGAlBL
	wANqFXlCE/2MPRhDh+8vGdoTgg==
X-Google-Smtp-Source: ABdhPJyNcIPUlDSa49nWDYS5u33FxtWjxRDfqLErE0O/3ZLmZOkJNUQD1ux4taJegVhsd6d4PBlMoA==
X-Received: by 2002:a17:90a:1387:: with SMTP id i7mr11224276pja.3.1594231048644;
        Wed, 08 Jul 2020 10:57:28 -0700 (PDT)
Received: from [2620:15c:17:3:4a0f:cfff:fe51:6667] ([2620:15c:17:3:4a0f:cfff:fe51:6667])
        by smtp.gmail.com with ESMTPSA id e15sm429459pgt.17.2020.07.08.10.57.27
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 08 Jul 2020 10:57:27 -0700 (PDT)
Date: Wed, 8 Jul 2020 10:57:27 -0700 (PDT)
From: David Rientjes <rientjes@google.com>
X-X-Sender: rientjes@chino.kir.corp.google.com
To: Michal Hocko <mhocko@kernel.org>
cc: Yafang Shao <laoar.shao@gmail.com>, akpm@linux-foundation.org, 
    linux-mm@kvack.org
Subject: Re: [PATCH] mm, oom: make the calculation of oom badness more
 accurate
In-Reply-To: <20200708143211.GK7271@dhcp22.suse.cz>
Message-ID: <alpine.DEB.2.23.453.2007081052050.700996@chino.kir.corp.google.com>
References: <1594214649-9837-1-git-send-email-laoar.shao@gmail.com> <20200708142806.GJ7271@dhcp22.suse.cz> <20200708143211.GK7271@dhcp22.suse.cz>
User-Agent: Alpine 2.23 (DEB 453 2020-06-18)
MIME-Version: 1.0
Content-Type: multipart/mixed; BOUNDARY="1482994552-819990614-1594230867=:700996"
Content-ID: <alpine.DEB.2.23.453.2007081054520.700996@chino.kir.corp.google.com>
X-Rspamd-Queue-Id: 6B9951819E624
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam03
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000023, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--1482994552-819990614-1594230867=:700996
Content-Type: text/plain; CHARSET=UTF-8
Content-ID: <alpine.DEB.2.23.453.2007081054521.700996@chino.kir.corp.google.com>
Content-Transfer-Encoding: quoted-printable

On Wed, 8 Jul 2020, Michal Hocko wrote:

> I have only now realized that David is not on Cc. Add him here. The
> patch is http://lkml.kernel.org/r/1594214649-9837-1-git-send-email-laoa=
r.shao@gmail.com.
>=20
> I believe the main problem is that we are normalizing to oom_score_adj
> units rather than usage/total. I have a very vague recollection this ha=
s
> been done in the past but I didn't get to dig into details yet.
>=20

The memcg max is 4194304 pages, and an oom_score_adj of -998 would yield =
a=20
page adjustment of:

adj =3D -998 * 4194304 / 1000 =3D =E2=88=924185915 pages

The largest pid 58406 (data_sim) has rss 3967322 pages,
pgtables 37101568 / 4096 =3D 9058 pages, and swapents 0.  So it's unadjus=
ted=20
badness is

3967322 + 9058 pages =3D 3976380 pages

Factoring in oom_score_adj, all of these processes will have a badness of=
=20
1 because oom_badness() doesn't underflow, which I think is the point of=20
Yafang's proposal.

I think the patch can work but, as you mention, also needs an update to=20
proc_oom_score().  proc_oom_score() is using the global amount of memory=20
so Yafang is likely not seeing it go negative for that reason but it coul=
d=20
happen.

> On Wed 08-07-20 16:28:08, Michal Hocko wrote:
> > On Wed 08-07-20 09:24:09, Yafang Shao wrote:
> > > Recently we found an issue on our production environment that when =
memcg
> > > oom is triggered the oom killer doesn't chose the process with larg=
est
> > > resident memory but chose the first scanned process. Note that all
> > > processes in this memcg have the same oom_score_adj, so the oom kil=
ler
> > > should chose the process with largest resident memory.
> > >=20
> > > Bellow is part of the oom info, which is enough to analyze this iss=
ue.
> > > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcn=
t 52843037
> > > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740=
988kB, failcnt 0
> > > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, fa=
ilcnt 0
> > > [...]
> > > [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_byt=
es swapents oom_score_adj name
> > > [7516987.983510] [ 5740]     0  5740      257        1    32768    =
    0          -998 pause
> > > [7516987.983574] [58804]     0 58804     4594      771    81920    =
    0          -998 entry_point.bas
> > > [7516987.983577] [58908]     0 58908     7089      689    98304    =
    0          -998 cron
> > > [7516987.983580] [58910]     0 58910    16235     5576   163840    =
    0          -998 supervisord
> > > [7516987.983590] [59620]     0 59620    18074     1395   188416    =
    0          -998 sshd
> > > [7516987.983594] [59622]     0 59622    18680     6679   188416    =
    0          -998 python
> > > [7516987.983598] [59624]     0 59624  1859266     5161   548864    =
    0          -998 odin-agent
> > > [7516987.983600] [59625]     0 59625   707223     9248   983040    =
    0          -998 filebeat
> > > [7516987.983604] [59627]     0 59627   416433    64239   774144    =
    0          -998 odin-log-agent
> > > [7516987.983607] [59631]     0 59631   180671    15012   385024    =
    0          -998 python3
> > > [7516987.983612] [61396]     0 61396   791287     3189   352256    =
    0          -998 client
> > > [7516987.983615] [61641]     0 61641  1844642    29089   946176    =
    0          -998 client
> > > [7516987.983765] [ 9236]     0  9236     2642      467    53248    =
    0          -998 php_scanner
> > > [7516987.983911] [42898]     0 42898    15543      838   167936    =
    0          -998 su
> > > [7516987.983915] [42900]  1000 42900     3673      867    77824    =
    0          -998 exec_script_vr2
> > > [7516987.983918] [42925]  1000 42925    36475    19033   335872    =
    0          -998 python
> > > [7516987.983921] [57146]  1000 57146     3673      848    73728    =
    0          -998 exec_script_J2p
> > > [7516987.983925] [57195]  1000 57195   186359    22958   491520    =
    0          -998 python2
> > > [7516987.983928] [58376]  1000 58376   275764    14402   290816    =
    0          -998 rosmaster
> > > [7516987.983931] [58395]  1000 58395   155166     4449   245760    =
    0          -998 rosout
> > > [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568    =
    0          -998 data_sim
> > > [7516987.984221] oom-kill:constraint=3DCONSTRAINT_MEMCG,nodemask=3D=
(null),cpuset=3D3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af=
6c4d753,mems_allowed=3D0-1,oom_memcg=3D/kubepods/podf1c273d3-9b36-11ea-b3=
df-246e9693c184,task_memcg=3D/kubepods/podf1c273d3-9b36-11ea-b3df-246e969=
3c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,ta=
sk=3Dpause,pid=3D5740,uid=3D0
> > > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (=
pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> > > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-=
rss:0kB, file-rss:0kB, shmem-rss:0kB
> > >=20
> > > We can find that the first scanned process 5740 (pause) was killed,=
 but its
> > > rss is only one page. That is because, when we calculate the oom ba=
dness in
> > > oom_badness(), we always ignore the negtive point and convert all o=
f these
> > > negtive points to 1. Now as oom_score_adj of all the processes in t=
his
> > > targeted memcg have the same value -998, the points of these proces=
ses are
> > > all negtive value. As a result, the first scanned process will be k=
illed.
> >=20
> > Such a large bias can skew results quite considerably.=20
> >=20
> > > The oom_socre_adj (-998) in this memcg is set by kubelet, because i=
t is a
> > > a Guaranteed pod, which has higher priority to prevent from being k=
illed by
> > > system oom.
> >=20
> > This is really interesting! I assume that the oom_score_adj is set to
> > protect from the global oom situation right? I am struggling to
> > understand what is the expected behavior when the oom is internal for
> > such a group though. Does killing a single task from such a group is =
a
> > sensible choice? I am not really familiar with kubelet but can it cop=
e
> > with data_sim going away from under it while the rest would still run=
?
> > Wouldn't it make more sense to simply tear down the whole thing?
> >=20
> > But that is a separate thing.
> >=20
> > > To fix this issue, we should make the calculation of oom point more
> > > accurate. We can achieve it by convert the chosen_point from 'unsig=
ned
> > > long' to 'long'.
> >=20
> > oom_score has a very coarse units because it maps all the consumed
> > memory into 0 - 1000 scale so effectively per-mille of the usable
> > memory. oom_score_adj acts on top of that as a bias. This is
> > exported to the userspace and I do not think we can change that (see
> > Documentation/filesystems/proc.rst) unfortunately. So you patch canno=
t
> > be really accepted as is because it would start reporting values outs=
ide
> > of the allowed range unless I am doing some math incorrectly.
> >=20
> > On the other hand, in this particular case I believe the existing
> > calculation is just wrong. Usable memory is 16777216kB (4194304 pages=
),
> > the top consumer is 3976380 pages so 94.8% the lowest memory consumer=
 is
> > effectively 0%. Even if we discount 94.8% by 99.8% then we should be
> > still having something like 7950 pages. So the normalization oom_badn=
ess
> > does cuts results too aggressively. There was quite some churn in the
> > calculation in the past fixing weird rounding bugs so I have to think
> > about how to fix this properly some more.
> >=20
> > That being said, even though the configuration is weird I do agree th=
at
> > oom_badness scaling is really unexpected and the memory consumption
> > in this particular example should be quite telling about who to chose=
 as
> > an oom victim.
> > --=20
> > Michal Hocko
> > SUSE Labs
>=20
> --=20
> Michal Hocko
> SUSE Labs
>=20
--1482994552-819990614-1594230867=:700996--