All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 5.4.y 0/1] missing upstream commit 9066e5c causing: kernel panic: System is deadlocked on memory
@ 2021-08-18 14:59 George Kennedy
  2021-08-18 14:59 ` [PATCH 5.4.y 1/1] mm, oom: make the calculation of oom badness more accurate George Kennedy
  2021-08-28  1:36 ` [PATCH 5.4.y 0/1] missing upstream commit 9066e5c causing: kernel panic: System is deadlocked on memory Sasha Levin
  0 siblings, 2 replies; 3+ messages in thread
From: George Kennedy @ 2021-08-18 14:59 UTC (permalink / raw)
  To: gregkh, laoar.shao
  Cc: george.kennedy, akpm, surenb, stable, christian, keescook, dhaval.giani

Upstream commit 9066e5c is missing from 5.4.y causing
kernel panic: System is deadlocked on memory
during 5.4.141-rc1 Syzkaller reproducer testing.

9066e5c 2020-08-11 Yafang Shao mm, oom: make the calculation of oom badness more accurate

Out of memory and no killable processes...
Kernel panic - not syncing: System is deadlocked on memory
CPU: 0 PID: 1 Comm: systemd Not tainted 5.4.141-rc1-syzk #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0xd4/0x119 lib/dump_stack.c:118
 panic+0x28f/0x6ad kernel/panic.c:221
 out_of_memory mm/oom_kill.c:1110 [inline]
 out_of_memory.cold.36+0xf4/0x174 mm/oom_kill.c:1045
 __alloc_pages_may_oom mm/page_alloc.c:3879 [inline]
 __alloc_pages_slowpath+0x1b30/0x2240 mm/page_alloc.c:4623
 __alloc_pages_nodemask+0x515/0x760 mm/page_alloc.c:4793
 alloc_pages_vma+0xe2/0x560 mm/mempolicy.c:2155
 __read_swap_cache_async+0x40e/0x770 mm/swap_state.c:399
 read_swap_cache_async+0x96/0x100 mm/swap_state.c:454
 swap_cluster_readahead+0x448/0x860 mm/swap_state.c:597
 swapin_readahead+0xbf/0xd40 mm/swap_state.c:789
 do_swap_page+0x812/0x1dc0 mm/memory.c:2937
 handle_pte_fault mm/memory.c:4003 [inline]
 __handle_mm_fault+0x17ad/0x24b0 mm/memory.c:4123
 handle_mm_fault+0x1f0/0x700 mm/memory.c:4160
 do_user_addr_fault arch/x86/mm/fault.c:1463 [inline]
 __do_page_fault+0x59e/0xd20 arch/x86/mm/fault.c:1528
 do_page_fault+0x52/0x390 arch/x86/mm/fault.c:1552
 do_async_page_fault+0x64/0xf0 arch/x86/kernel/kvm.c:253
 async_page_fault+0x3e/0x50 arch/x86/entry/entry_64.S:1206
RIP: 0010:ep_send_events_proc+0x2db/0xad0 fs/eventpoll.c:1751
Code: ff e8 79 f5 ff ff 31 ff 41 89 c7 89 c6 e8 6d b1 a3 ff 45 85 ff 0f 84 08 01 00 00 e8 4f b0 a3 ff 66 66 90 48 8b 85 50 ff ff ff <44> 89 38 e8 3d b0 a3 ff 66 66 90 48 8d 7b 74 48 89 f8 48 89 fe 48
RSP: 0018:ffff8881079e7ab0 EFLAGS: 00010293
RAX: 00007ffe9db639a0 RBX: ffff8880b490e180 RCX: ffffffff81d19d23
RDX: 0000000000000000 RSI: ffffffff81d19d31 RDI: 0000000000000005
RBP: ffff8881079e7bb0 R08: ffff8881079d8000 R09: ffffed1020f3cf2d
R10: ffffed1020f3cf2d R11: 0000000000000003 R12: dffffc0000000000
R13: ffff8880b490e198 R14: ffff8881079e7c10 R15: 0000000000000001
 ep_scan_ready_list.constprop.20+0x265/0x920 fs/eventpoll.c:702
 ep_send_events fs/eventpoll.c:1791 [inline]
 ep_poll+0x166/0xd70 fs/eventpoll.c:1939
 do_epoll_wait+0x192/0x1d0 fs/eventpoll.c:2291
 __do_sys_epoll_wait fs/eventpoll.c:2301 [inline]
 __se_sys_epoll_wait fs/eventpoll.c:2298 [inline]
 __x64_sys_epoll_wait+0x9c/0x100 fs/eventpoll.c:2298
 do_syscall_64+0xe6/0x4d0 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f4f31aad543
Code: Bad RIP value.
RSP: 002b:00007ffe9db63990 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8
RAX: ffffffffffffffda RBX: 00007ffe9db639a0 RCX: 00007f4f31aad543
RDX: 000000000000002a RSI: 00007ffe9db639a0 RDI: 0000000000000004
RBP: 00007ffe9db63c90 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000001
R13: ffffffffffffffff R14: 0000000000007500 R15: 000055da3b5769c0


Yafang Shao (1):
  mm, oom: make the calculation of oom badness more accurate

 fs/proc/base.c      | 11 ++++++++++-
 include/linux/oom.h |  4 ++--
 mm/oom_kill.c       | 22 ++++++++++------------
 3 files changed, 22 insertions(+), 15 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 5.4.y 1/1] mm, oom: make the calculation of oom badness more accurate
  2021-08-18 14:59 [PATCH 5.4.y 0/1] missing upstream commit 9066e5c causing: kernel panic: System is deadlocked on memory George Kennedy
@ 2021-08-18 14:59 ` George Kennedy
  2021-08-28  1:36 ` [PATCH 5.4.y 0/1] missing upstream commit 9066e5c causing: kernel panic: System is deadlocked on memory Sasha Levin
  1 sibling, 0 replies; 3+ messages in thread
From: George Kennedy @ 2021-08-18 14:59 UTC (permalink / raw)
  To: gregkh, laoar.shao
  Cc: george.kennedy, akpm, surenb, stable, christian, keescook, dhaval.giani

From: Yafang Shao <laoar.shao@gmail.com>

Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process.  Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
[7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
[7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
[7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
[7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
[7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
[7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
[7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
[7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
[7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
[7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
[7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
[7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
[7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
[7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
[7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
[7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
[7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
[7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
[7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
[7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page.  That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1.  Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value.  As a result, the first scanned
process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.

To fix this issue, we should make the calculation of oom point more
accurate.  We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
  Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 9066e5cfb73cdbcdbb49e87999482ab615e9fc76)
Signed-off-by: George Kennedy <george.kennedy@oracle.com>
---
 fs/proc/base.c      | 11 ++++++++++-
 include/linux/oom.h |  4 ++--
 mm/oom_kill.c       | 22 ++++++++++------------
 3 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 90d2f62..5a187e9 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -549,8 +549,17 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
 {
 	unsigned long totalpages = totalram_pages() + total_swap_pages;
 	unsigned long points = 0;
+	long badness;
+
+	badness = oom_badness(task, totalpages);
+	/*
+	 * Special case OOM_SCORE_ADJ_MIN for all others scale the
+	 * badness value into [0, 2000] range which we have been
+	 * exporting for a long time so userspace might depend on it.
+	 */
+	if (badness != LONG_MIN)
+		points = (1000 + badness * 1000 / (long)totalpages) * 2 / 3;
 
-	points = oom_badness(task, totalpages) * 1000 / totalpages;
 	seq_printf(m, "%lu\n", points);
 
 	return 0;
diff --git a/include/linux/oom.h b/include/linux/oom.h
index b9df343..2db9a14 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -48,7 +48,7 @@ struct oom_control {
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
 	struct task_struct *chosen;
-	unsigned long chosen_points;
+	long chosen_points;
 
 	/* Used to print the constraint info. */
 	enum oom_constraint constraint;
@@ -108,7 +108,7 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 
 bool __oom_reap_task_mm(struct mm_struct *mm);
 
-extern unsigned long oom_badness(struct task_struct *p,
+long oom_badness(struct task_struct *p,
 		unsigned long totalpages);
 
 extern bool out_of_memory(struct oom_control *oc);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 212e718..f1b810d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -197,17 +197,17 @@ static bool is_dump_unreclaim_slabs(void)
  * predictable as possible.  The goal is to return the highest value for the
  * task consuming the most memory to avoid subsequent oom failures.
  */
-unsigned long oom_badness(struct task_struct *p, unsigned long totalpages)
+long oom_badness(struct task_struct *p, unsigned long totalpages)
 {
 	long points;
 	long adj;
 
 	if (oom_unkillable_task(p))
-		return 0;
+		return LONG_MIN;
 
 	p = find_lock_task_mm(p);
 	if (!p)
-		return 0;
+		return LONG_MIN;
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
@@ -219,7 +219,7 @@ unsigned long oom_badness(struct task_struct *p, unsigned long totalpages)
 			test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
 			in_vfork(p)) {
 		task_unlock(p);
-		return 0;
+		return LONG_MIN;
 	}
 
 	/*
@@ -234,11 +234,7 @@ unsigned long oom_badness(struct task_struct *p, unsigned long totalpages)
 	adj *= totalpages / 1000;
 	points += adj;
 
-	/*
-	 * Never return 0 for an eligible task regardless of the root bonus and
-	 * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
-	 */
-	return points > 0 ? points : 1;
+	return points;
 }
 
 static const char * const oom_constraint_text[] = {
@@ -311,7 +307,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 static int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
-	unsigned long points;
+	long points;
 
 	if (oom_unkillable_task(task))
 		goto next;
@@ -337,12 +333,12 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 * killed first if it triggers an oom, then select it.
 	 */
 	if (oom_task_origin(task)) {
-		points = ULONG_MAX;
+		points = LONG_MAX;
 		goto select;
 	}
 
 	points = oom_badness(task, oc->totalpages);
-	if (!points || points < oc->chosen_points)
+	if (points == LONG_MIN || points < oc->chosen_points)
 		goto next;
 
 select:
@@ -366,6 +362,8 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
  */
 static void select_bad_process(struct oom_control *oc)
 {
+	oc->chosen_points = LONG_MIN;
+
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
 	else {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH 5.4.y 0/1] missing upstream commit 9066e5c causing: kernel panic: System is deadlocked on memory
  2021-08-18 14:59 [PATCH 5.4.y 0/1] missing upstream commit 9066e5c causing: kernel panic: System is deadlocked on memory George Kennedy
  2021-08-18 14:59 ` [PATCH 5.4.y 1/1] mm, oom: make the calculation of oom badness more accurate George Kennedy
@ 2021-08-28  1:36 ` Sasha Levin
  1 sibling, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2021-08-28  1:36 UTC (permalink / raw)
  To: George Kennedy
  Cc: gregkh, laoar.shao, akpm, surenb, stable, christian, keescook,
	dhaval.giani

On Wed, Aug 18, 2021 at 09:59:06AM -0500, George Kennedy wrote:
>Upstream commit 9066e5c is missing from 5.4.y causing
>kernel panic: System is deadlocked on memory
>during 5.4.141-rc1 Syzkaller reproducer testing.

Queued up, thanks!

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-08-28  1:36 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-18 14:59 [PATCH 5.4.y 0/1] missing upstream commit 9066e5c causing: kernel panic: System is deadlocked on memory George Kennedy
2021-08-18 14:59 ` [PATCH 5.4.y 1/1] mm, oom: make the calculation of oom badness more accurate George Kennedy
2021-08-28  1:36 ` [PATCH 5.4.y 0/1] missing upstream commit 9066e5c causing: kernel panic: System is deadlocked on memory Sasha Levin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.