From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=aM0h=AT=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=3.0 tests=MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 696C8C433DF
	for <linux-mm@archiver.kernel.org>; Wed,  8 Jul 2020 14:28:12 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 25E0E206DF
	for <linux-mm@archiver.kernel.org>; Wed,  8 Jul 2020 14:28:12 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 25E0E206DF
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id AD94E6B00BC; Wed,  8 Jul 2020 10:28:11 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A89476B00C0; Wed,  8 Jul 2020 10:28:11 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 979756B00C2; Wed,  8 Jul 2020 10:28:11 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0124.hostedemail.com [216.40.44.124])
	by kanga.kvack.org (Postfix) with ESMTP id 82DD46B00BC
	for <linux-mm@kvack.org>; Wed,  8 Jul 2020 10:28:11 -0400 (EDT)
Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 0FE3C1EE6
	for <linux-mm@kvack.org>; Wed,  8 Jul 2020 14:28:11 +0000 (UTC)
X-FDA: 77015138382.02.oil32_5706c8626ebd
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin02.hostedemail.com (Postfix) with ESMTP id 712E1300018B3BB9
	for <linux-mm@kvack.org>; Wed,  8 Jul 2020 14:28:10 +0000 (UTC)
X-HE-Tag: oil32_5706c8626ebd
X-Filterd-Recvd-Size: 8434
Received: from mail-wm1-f66.google.com (mail-wm1-f66.google.com [209.85.128.66])
	by imf41.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed,  8 Jul 2020 14:28:09 +0000 (UTC)
Received: by mail-wm1-f66.google.com with SMTP id 17so3427400wmo.1
        for <linux-mm@kvack.org>; Wed, 08 Jul 2020 07:28:09 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=u1iy5n/S0yzL83myREK+dbsCgPknhHxSYxuJHEPR9uU=;
        b=Y7NdeOX1Q5at0KKhe7gv7cfmQ/aL3aOBV0P44wAT0usAsUe+35fP8LeSCtKVGQKggx
         7V0WICH325yw0nEfzaRWngrTIjfS/+8BZppwTnAl+1/qvJ1GycoLQ6WZX3ywSplwoXof
         o6sbEimGAJuo7KXRoKmU4e/vi8bK4C55o1nY8qtkTqoamYEFXPll6DNy/OXScLkrqxhO
         pZyXkPH25QhuE+2KEMxoGpEb0wHjMhMogXkKAwI+/RWk3JBAISqFStQeBQm0gKsmhzS3
         yC5CCxc3UweY2fcw7h/lrVN9w6Zx7lhtXYTMaF5Y0NsWLId/Vzbj78BghbdZp5qXY2iy
         S+sA==
X-Gm-Message-State: AOAM532ZgC5WmZBk3Mb6I+K5ypVfIFv08vud5MwTUjA2XrimmIUDYiOQ
	yG4rNxVV2ReL66sJUn5ip5Y=
X-Google-Smtp-Source: ABdhPJxzKClrNoj/obCE5zreAF9KkAvftdt5KrMKH2/KW/Lzw+57tzKrEBKJ8BnvvWL3MQ3bLoNABQ==
X-Received: by 2002:a05:600c:204d:: with SMTP id p13mr9562973wmg.88.1594218488728;
        Wed, 08 Jul 2020 07:28:08 -0700 (PDT)
Received: from localhost (ip-37-188-179-51.eurotel.cz. [37.188.179.51])
        by smtp.gmail.com with ESMTPSA id b23sm6904869wmd.37.2020.07.08.07.28.07
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 08 Jul 2020 07:28:07 -0700 (PDT)
Date: Wed, 8 Jul 2020 16:28:06 +0200
From: Michal Hocko <mhocko@kernel.org>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org
Subject: Re: [PATCH] mm, oom: make the calculation of oom badness more
 accurate
Message-ID: <20200708142806.GJ7271@dhcp22.suse.cz>
References: <1594214649-9837-1-git-send-email-laoar.shao@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1594214649-9837-1-git-send-email-laoar.shao@gmail.com>
X-Rspamd-Queue-Id: 712E1300018B3BB9
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam03
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed 08-07-20 09:24:09, Yafang Shao wrote:
> Recently we found an issue on our production environment that when memcg
> oom is triggered the oom killer doesn't chose the process with largest
> resident memory but chose the first scanned process. Note that all
> processes in this memcg have the same oom_score_adj, so the oom killer
> should chose the process with largest resident memory.
> 
> Bellow is part of the oom info, which is enough to analyze this issue.
> [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
> [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
> [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
> [...]
> [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> [7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
> [7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
> [7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
> [7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
> [7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
> [7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
> [7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
> [7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
> [7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
> [7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
> [7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
> [7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
> [7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
> [7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
> [7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
> [7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
> [7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
> [7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
> [7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
> [7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
> [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
> [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
> [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> We can find that the first scanned process 5740 (pause) was killed, but its
> rss is only one page. That is because, when we calculate the oom badness in
> oom_badness(), we always ignore the negtive point and convert all of these
> negtive points to 1. Now as oom_score_adj of all the processes in this
> targeted memcg have the same value -998, the points of these processes are
> all negtive value. As a result, the first scanned process will be killed.

Such a large bias can skew results quite considerably. 

> The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
> a Guaranteed pod, which has higher priority to prevent from being killed by
> system oom.

This is really interesting! I assume that the oom_score_adj is set to
protect from the global oom situation right? I am struggling to
understand what is the expected behavior when the oom is internal for
such a group though. Does killing a single task from such a group is a
sensible choice? I am not really familiar with kubelet but can it cope
with data_sim going away from under it while the rest would still run?
Wouldn't it make more sense to simply tear down the whole thing?

But that is a separate thing.

> To fix this issue, we should make the calculation of oom point more
> accurate. We can achieve it by convert the chosen_point from 'unsigned
> long' to 'long'.

oom_score has a very coarse units because it maps all the consumed
memory into 0 - 1000 scale so effectively per-mille of the usable
memory. oom_score_adj acts on top of that as a bias. This is
exported to the userspace and I do not think we can change that (see
Documentation/filesystems/proc.rst) unfortunately. So you patch cannot
be really accepted as is because it would start reporting values outside
of the allowed range unless I am doing some math incorrectly.

On the other hand, in this particular case I believe the existing
calculation is just wrong. Usable memory is 16777216kB (4194304 pages),
the top consumer is 3976380 pages so 94.8% the lowest memory consumer is
effectively 0%. Even if we discount 94.8% by 99.8% then we should be
still having something like 7950 pages. So the normalization oom_badness
does cuts results too aggressively. There was quite some churn in the
calculation in the past fixing weird rounding bugs so I have to think
about how to fix this properly some more.

That being said, even though the configuration is weird I do agree that
oom_badness scaling is really unexpected and the memory consumption
in this particular example should be quite telling about who to chose as
an oom victim.
-- 
Michal Hocko
SUSE Labs