On Wed, 8 Jul 2020, Michal Hocko wrote:

> I have only now realized that David is not on Cc. Add him here. The
> patch is http://lkml.kernel.org/r/1594214649-9837-1-git-send-email-laoar.shao@gmail.com.
> 
> I believe the main problem is that we are normalizing to oom_score_adj
> units rather than usage/total. I have a very vague recollection this has
> been done in the past but I didn't get to dig into details yet.
> 

The memcg max is 4194304 pages, and an oom_score_adj of -998 would yield a 
page adjustment of:

adj = -998 * 4194304 / 1000 = −4185915 pages

The largest pid 58406 (data_sim) has rss 3967322 pages,
pgtables 37101568 / 4096 = 9058 pages, and swapents 0.  So it's unadjusted 
badness is

3967322 + 9058 pages = 3976380 pages

Factoring in oom_score_adj, all of these processes will have a badness of 
1 because oom_badness() doesn't underflow, which I think is the point of 
Yafang's proposal.

I think the patch can work but, as you mention, also needs an update to 
proc_oom_score().  proc_oom_score() is using the global amount of memory 
so Yafang is likely not seeing it go negative for that reason but it could 
happen.

> On Wed 08-07-20 16:28:08, Michal Hocko wrote:
> > On Wed 08-07-20 09:24:09, Yafang Shao wrote:
> > > Recently we found an issue on our production environment that when memcg
> > > oom is triggered the oom killer doesn't chose the process with largest
> > > resident memory but chose the first scanned process. Note that all
> > > processes in this memcg have the same oom_score_adj, so the oom killer
> > > should chose the process with largest resident memory.
> > > 
> > > Bellow is part of the oom info, which is enough to analyze this issue.
> > > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
> > > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
> > > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
> > > [...]
> > > [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> > > [7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
> > > [7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
> > > [7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
> > > [7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
> > > [7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
> > > [7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
> > > [7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
> > > [7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
> > > [7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
> > > [7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
> > > [7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
> > > [7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
> > > [7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
> > > [7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
> > > [7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
> > > [7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
> > > [7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
> > > [7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
> > > [7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
> > > [7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
> > > [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
> > > [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
> > > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> > > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > 
> > > We can find that the first scanned process 5740 (pause) was killed, but its
> > > rss is only one page. That is because, when we calculate the oom badness in
> > > oom_badness(), we always ignore the negtive point and convert all of these
> > > negtive points to 1. Now as oom_score_adj of all the processes in this
> > > targeted memcg have the same value -998, the points of these processes are
> > > all negtive value. As a result, the first scanned process will be killed.
> > 
> > Such a large bias can skew results quite considerably. 
> > 
> > > The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
> > > a Guaranteed pod, which has higher priority to prevent from being killed by
> > > system oom.
> > 
> > This is really interesting! I assume that the oom_score_adj is set to
> > protect from the global oom situation right? I am struggling to
> > understand what is the expected behavior when the oom is internal for
> > such a group though. Does killing a single task from such a group is a
> > sensible choice? I am not really familiar with kubelet but can it cope
> > with data_sim going away from under it while the rest would still run?
> > Wouldn't it make more sense to simply tear down the whole thing?
> > 
> > But that is a separate thing.
> > 
> > > To fix this issue, we should make the calculation of oom point more
> > > accurate. We can achieve it by convert the chosen_point from 'unsigned
> > > long' to 'long'.
> > 
> > oom_score has a very coarse units because it maps all the consumed
> > memory into 0 - 1000 scale so effectively per-mille of the usable
> > memory. oom_score_adj acts on top of that as a bias. This is
> > exported to the userspace and I do not think we can change that (see
> > Documentation/filesystems/proc.rst) unfortunately. So you patch cannot
> > be really accepted as is because it would start reporting values outside
> > of the allowed range unless I am doing some math incorrectly.
> > 
> > On the other hand, in this particular case I believe the existing
> > calculation is just wrong. Usable memory is 16777216kB (4194304 pages),
> > the top consumer is 3976380 pages so 94.8% the lowest memory consumer is
> > effectively 0%. Even if we discount 94.8% by 99.8% then we should be
> > still having something like 7950 pages. So the normalization oom_badness
> > does cuts results too aggressively. There was quite some churn in the
> > calculation in the past fixing weird rounding bugs so I have to think
> > about how to fix this properly some more.
> > 
> > That being said, even though the configuration is weird I do agree that
> > oom_badness scaling is really unexpected and the memory consumption
> > in this particular example should be quite telling about who to chose as
> > an oom victim.
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> -- 
> Michal Hocko
> SUSE Labs
>