From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id 960E46B0647 for ; Wed, 2 Aug 2017 20:04:12 -0400 (EDT) Received: by mail-wr0-f199.google.com with SMTP id g32so719752wrd.8 for ; Wed, 02 Aug 2017 17:04:12 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id p15si327899wma.123.2017.08.02.17.04.11 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 02 Aug 2017 17:04:11 -0700 (PDT) Date: Wed, 2 Aug 2017 17:04:09 -0700 From: Andrew Morton Subject: Re: [PATCH] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. Message-Id: <20170802170409.caaaab2a866cf8ac210291cc@linux-foundation.org> In-Reply-To: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: linux-mm@kvack.org, David Rientjes , Manish Jaggi , Michal Hocko , Oleg Nesterov , Vladimir Davydov On Thu, 3 Aug 2017 08:55:04 +0900 Tetsuo Handa wrote: > Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core > count causes random kernel panics when an OOM victim which consumed memory > in a way the OOM reaper does not help was selected by the OOM killer. > > ... > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -652,6 +652,7 @@ struct task_struct { > /* disallow userland-initiated cgroup migration */ > unsigned no_cgroup_migration:1; > #endif > + unsigned oom_kill_free_check_raced:1; > > unsigned long atomic_flags; /* Flags requiring atomic access. */ > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 9e8b4f0..a1ae78d 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -780,11 +780,19 @@ static bool task_will_free_mem(struct task_struct *task) > return false; > > /* > - * This task has already been drained by the oom reaper so there are > - * only small chances it will free some more > + * It is possible that current thread fails to try allocation from > + * memory reserves if the OOM reaper set MMF_OOM_SKIP on this mm before > + * current thread calls out_of_memory() in order to get TIF_MEMDIE. > + * In that case, allow current thread to try TIF_MEMDIE allocation > + * before start selecting next OOM victims. > */ > - if (test_bit(MMF_OOM_SKIP, &mm->flags)) > + if (test_bit(MMF_OOM_SKIP, &mm->flags)) { > + if (task == current && !task->oom_kill_free_check_raced) { > + task->oom_kill_free_check_raced = true; OK, caller's task_lock() prevents races here. nit: task->oom_kill_free_check_raced is `unsigned', so " = 1" would be more truthful here... > + return true; > + } > return false; > + } > > if (atomic_read(&mm->mm_users) <= 1) > return true; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id 5496C6B0649 for ; Wed, 2 Aug 2017 20:39:36 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id y190so66886114pgb.3 for ; Wed, 02 Aug 2017 17:39:36 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id e70si4571666pgc.620.2017.08.02.17.39.34 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 02 Aug 2017 17:39:34 -0700 (PDT) From: Tetsuo Handa Subject: [PATCH] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. Date: Thu, 3 Aug 2017 08:55:04 +0900 Message-Id: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: akpm@linux-foundation.org Cc: linux-mm@kvack.org, Tetsuo Handa , David Rientjes , Manish Jaggi , Michal Hocko , Oleg Nesterov , Vladimir Davydov Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core count causes random kernel panics when an OOM victim which consumed memory in a way the OOM reaper does not help was selected by the OOM killer. ---------- oom02 0 TINFO : start OOM testing for mlocked pages. oom02 0 TINFO : expected victim is 4578. oom02 0 TINFO : thread (ffff8b0e71f0), allocating 3221225472 bytes. oom02 0 TINFO : thread (ffff8b8e71f0), allocating 3221225472 bytes. (...snipped...) oom02 0 TINFO : thread (ffff8a0e71f0), allocating 3221225472 bytes. [ 364.737486] oom02:4583 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 (...snipped...) [ 365.036127] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 365.044691] [ 1905] 0 1905 3236 1714 10 4 0 0 systemd-journal [ 365.054172] [ 1908] 0 1908 20247 590 8 4 0 0 lvmetad [ 365.062959] [ 2421] 0 2421 3241 878 9 3 0 -1000 systemd-udevd [ 365.072266] [ 3125] 0 3125 3834 719 9 4 0 -1000 auditd [ 365.080963] [ 3145] 0 3145 1086 630 6 4 0 0 systemd-logind [ 365.090353] [ 3146] 0 3146 1208 596 7 3 0 0 irqbalance [ 365.099413] [ 3147] 81 3147 1118 625 5 4 0 -900 dbus-daemon [ 365.108548] [ 3149] 998 3149 116294 4180 26 5 0 0 polkitd [ 365.117333] [ 3164] 997 3164 19992 785 9 3 0 0 chronyd [ 365.126118] [ 3180] 0 3180 55605 7880 29 3 0 0 firewalld [ 365.135075] [ 3187] 0 3187 87842 3033 26 3 0 0 NetworkManager [ 365.144465] [ 3290] 0 3290 43037 1224 16 5 0 0 rsyslogd [ 365.153335] [ 3295] 0 3295 108279 6617 30 3 0 0 tuned [ 365.161944] [ 3308] 0 3308 27846 676 11 3 0 0 crond [ 365.170554] [ 3309] 0 3309 3332 616 10 3 0 -1000 sshd [ 365.179076] [ 3371] 0 3371 27307 364 6 3 0 0 agetty [ 365.187790] [ 3375] 0 3375 29397 1125 11 3 0 0 login [ 365.196402] [ 4178] 0 4178 4797 1119 14 4 0 0 master [ 365.205101] [ 4209] 89 4209 4823 1396 12 4 0 0 pickup [ 365.213798] [ 4211] 89 4211 4842 1485 12 3 0 0 qmgr [ 365.222325] [ 4491] 0 4491 27965 1022 8 3 0 0 bash [ 365.230849] [ 4513] 0 4513 670 365 5 3 0 0 oom02 [ 365.239459] [ 4578] 0 4578 37776030 32890957 64257 138 0 0 oom02 [ 365.248067] Out of memory: Kill process 4578 (oom02) score 952 or sacrifice child [ 365.255581] Killed process 4578 (oom02) total-vm:151104120kB, anon-rss:131562528kB, file-rss:1300kB, shmem-rss:0kB [ 365.266829] out_of_memory: Current (4583) has a pending SIGKILL [ 365.267347] oom_reaper: reaped process 4578 (oom02), now anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB [ 365.282658] oom_reaper: reaped process 4583 (oom02), now anon-rss:131561664kB, file-rss:0kB, shmem-rss:0kB [ 365.283361] oom02:4586 invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 (...snipped...) [ 365.576164] oom02:4585 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 (...snipped...) [ 365.576298] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 365.576338] [ 2421] 0 2421 3241 878 9 3 0 -1000 systemd-udevd [ 365.576342] [ 3125] 0 3125 3834 719 9 4 0 -1000 auditd [ 365.576347] [ 3309] 0 3309 3332 616 10 3 0 -1000 sshd [ 365.576356] [ 4580] 0 4578 37776030 32890417 64258 138 0 0 oom02 [ 365.576361] Kernel panic - not syncing: Out of memory and no killable processes... ---------- Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory() to return false as soon as MMF_OOM_SKIP is set, many threads sharing the victim's mm were not able to try allocation from memory reserves after the OOM reaper gave up reclaiming memory. We don't need to give up task_will_free_mem(current) without trying allocation from memory reserves. We will need to select next OOM victim only when allocation from memory reserves did not help. Thus, this patch allows task_will_free_mem(current) to ignore MMF_OOM_SKIP for once so that task_will_free_mem(current) will not start selecting next OOM victim without trying allocation from memory reserves. Link: http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com Reported-by: Manish Jaggi Signed-off-by: Tetsuo Handa Cc: Michal Hocko Cc: Oleg Nesterov Cc: Vladimir Davydov Cc: David Rientjes Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks") --- include/linux/sched.h | 1 + mm/oom_kill.c | 14 +++++++++++--- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 94137e7..88da211 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -652,6 +652,7 @@ struct task_struct { /* disallow userland-initiated cgroup migration */ unsigned no_cgroup_migration:1; #endif + unsigned oom_kill_free_check_raced:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 9e8b4f0..a1ae78d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -780,11 +780,19 @@ static bool task_will_free_mem(struct task_struct *task) return false; /* - * This task has already been drained by the oom reaper so there are - * only small chances it will free some more + * It is possible that current thread fails to try allocation from + * memory reserves if the OOM reaper set MMF_OOM_SKIP on this mm before + * current thread calls out_of_memory() in order to get TIF_MEMDIE. + * In that case, allow current thread to try TIF_MEMDIE allocation + * before start selecting next OOM victims. */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) + if (test_bit(MMF_OOM_SKIP, &mm->flags)) { + if (task == current && !task->oom_kill_free_check_raced) { + task->oom_kill_free_check_raced = true; + return true; + } return false; + } if (atomic_read(&mm->mm_users) <= 1) return true; -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f70.google.com (mail-wm0-f70.google.com [74.125.82.70]) by kanga.kvack.org (Postfix) with ESMTP id 3F8526B0669 for ; Thu, 3 Aug 2017 03:10:56 -0400 (EDT) Received: by mail-wm0-f70.google.com with SMTP id y206so1086364wmd.1 for ; Thu, 03 Aug 2017 00:10:56 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id f40si1029738wra.464.2017.08.03.00.10.54 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 03 Aug 2017 00:10:54 -0700 (PDT) Date: Thu, 3 Aug 2017 09:10:52 +0200 From: Michal Hocko Subject: Re: [PATCH] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. Message-ID: <20170803071051.GB12521@dhcp22.suse.cz> References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: akpm@linux-foundation.org, linux-mm@kvack.org, David Rientjes , Manish Jaggi , Oleg Nesterov , Vladimir Davydov On Thu 03-08-17 08:55:04, Tetsuo Handa wrote: > Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core > count causes random kernel panics when an OOM victim which consumed memory > in a way the OOM reaper does not help was selected by the OOM killer. > > ---------- > oom02 0 TINFO : start OOM testing for mlocked pages. > oom02 0 TINFO : expected victim is 4578. > oom02 0 TINFO : thread (ffff8b0e71f0), allocating 3221225472 bytes. > oom02 0 TINFO : thread (ffff8b8e71f0), allocating 3221225472 bytes. > (...snipped...) > oom02 0 TINFO : thread (ffff8a0e71f0), allocating 3221225472 bytes. > [ 364.737486] oom02:4583 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 > (...snipped...) > [ 365.036127] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name > [ 365.044691] [ 1905] 0 1905 3236 1714 10 4 0 0 systemd-journal > [ 365.054172] [ 1908] 0 1908 20247 590 8 4 0 0 lvmetad > [ 365.062959] [ 2421] 0 2421 3241 878 9 3 0 -1000 systemd-udevd > [ 365.072266] [ 3125] 0 3125 3834 719 9 4 0 -1000 auditd > [ 365.080963] [ 3145] 0 3145 1086 630 6 4 0 0 systemd-logind > [ 365.090353] [ 3146] 0 3146 1208 596 7 3 0 0 irqbalance > [ 365.099413] [ 3147] 81 3147 1118 625 5 4 0 -900 dbus-daemon > [ 365.108548] [ 3149] 998 3149 116294 4180 26 5 0 0 polkitd > [ 365.117333] [ 3164] 997 3164 19992 785 9 3 0 0 chronyd > [ 365.126118] [ 3180] 0 3180 55605 7880 29 3 0 0 firewalld > [ 365.135075] [ 3187] 0 3187 87842 3033 26 3 0 0 NetworkManager > [ 365.144465] [ 3290] 0 3290 43037 1224 16 5 0 0 rsyslogd > [ 365.153335] [ 3295] 0 3295 108279 6617 30 3 0 0 tuned > [ 365.161944] [ 3308] 0 3308 27846 676 11 3 0 0 crond > [ 365.170554] [ 3309] 0 3309 3332 616 10 3 0 -1000 sshd > [ 365.179076] [ 3371] 0 3371 27307 364 6 3 0 0 agetty > [ 365.187790] [ 3375] 0 3375 29397 1125 11 3 0 0 login > [ 365.196402] [ 4178] 0 4178 4797 1119 14 4 0 0 master > [ 365.205101] [ 4209] 89 4209 4823 1396 12 4 0 0 pickup > [ 365.213798] [ 4211] 89 4211 4842 1485 12 3 0 0 qmgr > [ 365.222325] [ 4491] 0 4491 27965 1022 8 3 0 0 bash > [ 365.230849] [ 4513] 0 4513 670 365 5 3 0 0 oom02 > [ 365.239459] [ 4578] 0 4578 37776030 32890957 64257 138 0 0 oom02 > [ 365.248067] Out of memory: Kill process 4578 (oom02) score 952 or sacrifice child > [ 365.255581] Killed process 4578 (oom02) total-vm:151104120kB, anon-rss:131562528kB, file-rss:1300kB, shmem-rss:0kB > [ 365.266829] out_of_memory: Current (4583) has a pending SIGKILL > [ 365.267347] oom_reaper: reaped process 4578 (oom02), now anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB > [ 365.282658] oom_reaper: reaped process 4583 (oom02), now anon-rss:131561664kB, file-rss:0kB, shmem-rss:0kB > [ 365.283361] oom02:4586 invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 > (...snipped...) > [ 365.576164] oom02:4585 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 > (...snipped...) > [ 365.576298] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name > [ 365.576338] [ 2421] 0 2421 3241 878 9 3 0 -1000 systemd-udevd > [ 365.576342] [ 3125] 0 3125 3834 719 9 4 0 -1000 auditd > [ 365.576347] [ 3309] 0 3309 3332 616 10 3 0 -1000 sshd > [ 365.576356] [ 4580] 0 4578 37776030 32890417 64258 138 0 0 oom02 > [ 365.576361] Kernel panic - not syncing: Out of memory and no killable processes... > ---------- > > Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip > oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory() > to return false as soon as MMF_OOM_SKIP is set, many threads sharing > the victim's mm were not able to try allocation from memory reserves > after the OOM reaper gave up reclaiming memory. > > We don't need to give up task_will_free_mem(current) without trying > allocation from memory reserves. We will need to select next OOM victim > only when allocation from memory reserves did not help. > > Thus, this patch allows task_will_free_mem(current) to ignore MMF_OOM_SKIP > for once so that task_will_free_mem(current) will not start selecting next > OOM victim without trying allocation from memory reserves. As I've already said this is an ugly hack and once we have http://lkml.kernel.org/r/20170727090357.3205-2-mhocko@kernel.org merged then it even shouldn't be needed because _all_ threads of the oom victim will have an instant access to memory reserves. So I do not think we want to merge this. > Link: http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com > Reported-by: Manish Jaggi > Signed-off-by: Tetsuo Handa > Cc: Michal Hocko > Cc: Oleg Nesterov > Cc: Vladimir Davydov > Cc: David Rientjes > Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks") > --- > include/linux/sched.h | 1 + > mm/oom_kill.c | 14 +++++++++++--- > 2 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 94137e7..88da211 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -652,6 +652,7 @@ struct task_struct { > /* disallow userland-initiated cgroup migration */ > unsigned no_cgroup_migration:1; > #endif > + unsigned oom_kill_free_check_raced:1; > > unsigned long atomic_flags; /* Flags requiring atomic access. */ > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 9e8b4f0..a1ae78d 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -780,11 +780,19 @@ static bool task_will_free_mem(struct task_struct *task) > return false; > > /* > - * This task has already been drained by the oom reaper so there are > - * only small chances it will free some more > + * It is possible that current thread fails to try allocation from > + * memory reserves if the OOM reaper set MMF_OOM_SKIP on this mm before > + * current thread calls out_of_memory() in order to get TIF_MEMDIE. > + * In that case, allow current thread to try TIF_MEMDIE allocation > + * before start selecting next OOM victims. > */ > - if (test_bit(MMF_OOM_SKIP, &mm->flags)) > + if (test_bit(MMF_OOM_SKIP, &mm->flags)) { > + if (task == current && !task->oom_kill_free_check_raced) { > + task->oom_kill_free_check_raced = true; > + return true; > + } > return false; > + } > > if (atomic_read(&mm->mm_users) <= 1) > return true; > -- > 1.8.3.1 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id 175F56B066D for ; Thu, 3 Aug 2017 03:53:43 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id u7so6896092pgo.6 for ; Thu, 03 Aug 2017 00:53:43 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id j24si5174848pfk.548.2017.08.03.00.53.41 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 03 Aug 2017 00:53:41 -0700 (PDT) Subject: Re: [PATCH] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. From: Tetsuo Handa References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> <20170803071051.GB12521@dhcp22.suse.cz> In-Reply-To: <20170803071051.GB12521@dhcp22.suse.cz> Message-Id: <201708031653.JGD57352.OQFtVLSFOMOHJF@I-love.SAKURA.ne.jp> Date: Thu, 3 Aug 2017 16:53:40 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: mhocko@suse.com Cc: akpm@linux-foundation.org, linux-mm@kvack.org, rientjes@google.com, mjaggi@caviumnetworks.com, oleg@redhat.com, vdavydov@virtuozzo.com Michal Hocko wrote: > > We don't need to give up task_will_free_mem(current) without trying > > allocation from memory reserves. We will need to select next OOM victim > > only when allocation from memory reserves did not help. > > > > Thus, this patch allows task_will_free_mem(current) to ignore MMF_OOM_SKIP > > for once so that task_will_free_mem(current) will not start selecting next > > OOM victim without trying allocation from memory reserves. > > As I've already said this is an ugly hack and once we have > http://lkml.kernel.org/r/20170727090357.3205-2-mhocko@kernel.org merged > then it even shouldn't be needed because _all_ threads of the oom victim > will have an instant access to memory reserves. > > So I do not think we want to merge this. > No, we still want to merge this, for 4.8+ kernels which won't get your patch backported will need this. Even after your patch is merged, there is a race window where allocating threads are between after gfp_pfmemalloc_allowed() and before mutex_trylock(&oom_lock) in __alloc_pages_may_oom() which means that some threads could call out_of_memory() and hit this task_will_free_mem(current) test. Ignoring MMF_OOM_SKIP for once is still useful. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f69.google.com (mail-wm0-f69.google.com [74.125.82.69]) by kanga.kvack.org (Postfix) with ESMTP id 30E9A6B0677 for ; Thu, 3 Aug 2017 04:15:02 -0400 (EDT) Received: by mail-wm0-f69.google.com with SMTP id i187so1247378wma.15 for ; Thu, 03 Aug 2017 01:15:02 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id p25si1067519wrp.296.2017.08.03.01.15.00 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 03 Aug 2017 01:15:00 -0700 (PDT) Date: Thu, 3 Aug 2017 10:14:59 +0200 From: Michal Hocko Subject: Re: [PATCH] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. Message-ID: <20170803081459.GD12521@dhcp22.suse.cz> References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> <20170803071051.GB12521@dhcp22.suse.cz> <201708031653.JGD57352.OQFtVLSFOMOHJF@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201708031653.JGD57352.OQFtVLSFOMOHJF@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: akpm@linux-foundation.org, linux-mm@kvack.org, rientjes@google.com, mjaggi@caviumnetworks.com, oleg@redhat.com, vdavydov@virtuozzo.com On Thu 03-08-17 16:53:40, Tetsuo Handa wrote: > Michal Hocko wrote: > > > We don't need to give up task_will_free_mem(current) without trying > > > allocation from memory reserves. We will need to select next OOM victim > > > only when allocation from memory reserves did not help. > > > > > > Thus, this patch allows task_will_free_mem(current) to ignore MMF_OOM_SKIP > > > for once so that task_will_free_mem(current) will not start selecting next > > > OOM victim without trying allocation from memory reserves. > > > > As I've already said this is an ugly hack and once we have > > http://lkml.kernel.org/r/20170727090357.3205-2-mhocko@kernel.org merged > > then it even shouldn't be needed because _all_ threads of the oom victim > > will have an instant access to memory reserves. > > > > So I do not think we want to merge this. > > > > No, we still want to merge this, for 4.8+ kernels which won't get your patch > backported will need this. Even after your patch is merged, there is a race > window where allocating threads are between after gfp_pfmemalloc_allowed() and > before mutex_trylock(&oom_lock) in __alloc_pages_may_oom() which means that > some threads could call out_of_memory() and hit this task_will_free_mem(current) > test. Ignoring MMF_OOM_SKIP for once is still useful. I disagree. I am _highly_ skeptical this is a stable material. The mentioned test case is artificial and the source of the problem is somewhere else. Moreover the culprit is somewhere else. It is in the oom reaper setting MMF_OOM_SKIP too early and it should be addressed there. Do not add workarounds where they are not appropriate. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id BF6AF6B0736 for ; Fri, 4 Aug 2017 07:10:11 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id k190so14735424pge.9 for ; Fri, 04 Aug 2017 04:10:11 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id e131si794706pgc.786.2017.08.04.04.10.10 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 04 Aug 2017 04:10:10 -0700 (PDT) Subject: Re: [PATCH] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. From: Tetsuo Handa References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> <20170803071051.GB12521@dhcp22.suse.cz> <201708031653.JGD57352.OQFtVLSFOMOHJF@I-love.SAKURA.ne.jp> <20170803081459.GD12521@dhcp22.suse.cz> In-Reply-To: <20170803081459.GD12521@dhcp22.suse.cz> Message-Id: <201708042010.HDD60496.LFtOQMFJOSFHOV@I-love.SAKURA.ne.jp> Date: Fri, 4 Aug 2017 20:10:09 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: mhocko@suse.com Cc: akpm@linux-foundation.org, linux-mm@kvack.org, rientjes@google.com, mjaggi@caviumnetworks.com, oleg@redhat.com, vdavydov.dev@gmail.com Michal Hocko wrote: > On Thu 03-08-17 16:53:40, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > > We don't need to give up task_will_free_mem(current) without trying > > > > allocation from memory reserves. We will need to select next OOM victim > > > > only when allocation from memory reserves did not help. > > > > > > > > Thus, this patch allows task_will_free_mem(current) to ignore MMF_OOM_SKIP > > > > for once so that task_will_free_mem(current) will not start selecting next > > > > OOM victim without trying allocation from memory reserves. > > > > > > As I've already said this is an ugly hack and once we have > > > http://lkml.kernel.org/r/20170727090357.3205-2-mhocko@kernel.org merged > > > then it even shouldn't be needed because _all_ threads of the oom victim > > > will have an instant access to memory reserves. > > > > > > So I do not think we want to merge this. > > > > > > > No, we still want to merge this, for 4.8+ kernels which won't get your patch > > backported will need this. Even after your patch is merged, there is a race > > window where allocating threads are between after gfp_pfmemalloc_allowed() and > > before mutex_trylock(&oom_lock) in __alloc_pages_may_oom() which means that > > some threads could call out_of_memory() and hit this task_will_free_mem(current) > > test. Ignoring MMF_OOM_SKIP for once is still useful. > > I disagree. I am _highly_ skeptical this is a stable material. The > mentioned test case is artificial and the source of the problem is > somewhere else. Moreover the culprit is somewhere else. It is in the oom > reaper setting MMF_OOM_SKIP too early and it should be addressed there. > Do not add workarounds where they are not appropriate. > So, what alternative can you provide us for now? The patch titled Subject: mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. has been removed from the -mm tree. Its filename was mm-oom-task_will_free_memcurrent-should-ignore-mmf_oom_skip-for-once.patch This patch was dropped because an updated version will be merged -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id 3CCA02802FE for ; Fri, 4 Aug 2017 07:26:03 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id q189so5344649wmd.6 for ; Fri, 04 Aug 2017 04:26:03 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id l19si3403475wrl.11.2017.08.04.04.26.01 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 04 Aug 2017 04:26:01 -0700 (PDT) Date: Fri, 4 Aug 2017 13:26:00 +0200 From: Michal Hocko Subject: Re: [PATCH] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. Message-ID: <20170804112600.GL26029@dhcp22.suse.cz> References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> <20170803071051.GB12521@dhcp22.suse.cz> <201708031653.JGD57352.OQFtVLSFOMOHJF@I-love.SAKURA.ne.jp> <20170803081459.GD12521@dhcp22.suse.cz> <201708042010.HDD60496.LFtOQMFJOSFHOV@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201708042010.HDD60496.LFtOQMFJOSFHOV@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: akpm@linux-foundation.org, linux-mm@kvack.org, rientjes@google.com, mjaggi@caviumnetworks.com, oleg@redhat.com, vdavydov.dev@gmail.com On Fri 04-08-17 20:10:09, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 03-08-17 16:53:40, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > > We don't need to give up task_will_free_mem(current) without trying > > > > > allocation from memory reserves. We will need to select next OOM victim > > > > > only when allocation from memory reserves did not help. > > > > > > > > > > Thus, this patch allows task_will_free_mem(current) to ignore MMF_OOM_SKIP > > > > > for once so that task_will_free_mem(current) will not start selecting next > > > > > OOM victim without trying allocation from memory reserves. > > > > > > > > As I've already said this is an ugly hack and once we have > > > > http://lkml.kernel.org/r/20170727090357.3205-2-mhocko@kernel.org merged > > > > then it even shouldn't be needed because _all_ threads of the oom victim > > > > will have an instant access to memory reserves. > > > > > > > > So I do not think we want to merge this. > > > > > > > > > > No, we still want to merge this, for 4.8+ kernels which won't get your patch > > > backported will need this. Even after your patch is merged, there is a race > > > window where allocating threads are between after gfp_pfmemalloc_allowed() and > > > before mutex_trylock(&oom_lock) in __alloc_pages_may_oom() which means that > > > some threads could call out_of_memory() and hit this task_will_free_mem(current) > > > test. Ignoring MMF_OOM_SKIP for once is still useful. > > > > I disagree. I am _highly_ skeptical this is a stable material. The > > mentioned test case is artificial and the source of the problem is > > somewhere else. Moreover the culprit is somewhere else. It is in the oom > > reaper setting MMF_OOM_SKIP too early and it should be addressed there. > > Do not add workarounds where they are not appropriate. > > > So, what alternative can you provide us for now? As I've already said http://lkml.kernel.org/r/20170727090357.3205-2-mhocko@kernel.org seems to be a better alternative. I am waiting for further review feedback before reposting it again. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id C04AD6B05B4 for ; Fri, 4 Aug 2017 07:44:54 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id i192so15405000pgc.11 for ; Fri, 04 Aug 2017 04:44:54 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id d15si820480pga.905.2017.08.04.04.44.52 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 04 Aug 2017 04:44:53 -0700 (PDT) Subject: Re: [PATCH] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. From: Tetsuo Handa References: <20170803071051.GB12521@dhcp22.suse.cz> <201708031653.JGD57352.OQFtVLSFOMOHJF@I-love.SAKURA.ne.jp> <20170803081459.GD12521@dhcp22.suse.cz> <201708042010.HDD60496.LFtOQMFJOSFHOV@I-love.SAKURA.ne.jp> <20170804112600.GL26029@dhcp22.suse.cz> In-Reply-To: <20170804112600.GL26029@dhcp22.suse.cz> Message-Id: <201708042044.JDB64025.SOMOFOHJFFtQLV@I-love.SAKURA.ne.jp> Date: Fri, 4 Aug 2017 20:44:52 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: mhocko@suse.com Cc: akpm@linux-foundation.org, linux-mm@kvack.org, rientjes@google.com, mjaggi@caviumnetworks.com, oleg@redhat.com, vdavydov.dev@gmail.com Michal Hocko wrote: > On Fri 04-08-17 20:10:09, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Thu 03-08-17 16:53:40, Tetsuo Handa wrote: > > > > Michal Hocko wrote: > > > > > > We don't need to give up task_will_free_mem(current) without trying > > > > > > allocation from memory reserves. We will need to select next OOM victim > > > > > > only when allocation from memory reserves did not help. > > > > > > > > > > > > Thus, this patch allows task_will_free_mem(current) to ignore MMF_OOM_SKIP > > > > > > for once so that task_will_free_mem(current) will not start selecting next > > > > > > OOM victim without trying allocation from memory reserves. > > > > > > > > > > As I've already said this is an ugly hack and once we have > > > > > http://lkml.kernel.org/r/20170727090357.3205-2-mhocko@kernel.org merged > > > > > then it even shouldn't be needed because _all_ threads of the oom victim > > > > > will have an instant access to memory reserves. > > > > > > > > > > So I do not think we want to merge this. > > > > > > > > > > > > > No, we still want to merge this, for 4.8+ kernels which won't get your patch > > > > backported will need this. Even after your patch is merged, there is a race > > > > window where allocating threads are between after gfp_pfmemalloc_allowed() and > > > > before mutex_trylock(&oom_lock) in __alloc_pages_may_oom() which means that > > > > some threads could call out_of_memory() and hit this task_will_free_mem(current) > > > > test. Ignoring MMF_OOM_SKIP for once is still useful. > > > > > > I disagree. I am _highly_ skeptical this is a stable material. The > > > mentioned test case is artificial and the source of the problem is > > > somewhere else. Moreover the culprit is somewhere else. It is in the oom > > > reaper setting MMF_OOM_SKIP too early and it should be addressed there. > > > Do not add workarounds where they are not appropriate. > > > > > So, what alternative can you provide us for now? > > As I've already said http://lkml.kernel.org/r/20170727090357.3205-2-mhocko@kernel.org > seems to be a better alternative. I am waiting for further review > feedback before reposting it again. > As I've already said, your patch does not close this race completely. Your patch will be too drastic/risky for stable material. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 447BC6B05B8 for ; Fri, 4 Aug 2017 07:52:31 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id q50so5588432wrb.14 for ; Fri, 04 Aug 2017 04:52:31 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 12si2980542wme.0.2017.08.04.04.52.30 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 04 Aug 2017 04:52:30 -0700 (PDT) Date: Fri, 4 Aug 2017 13:52:28 +0200 From: Michal Hocko Subject: Re: [PATCH] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. Message-ID: <20170804115228.GO26029@dhcp22.suse.cz> References: <20170803071051.GB12521@dhcp22.suse.cz> <201708031653.JGD57352.OQFtVLSFOMOHJF@I-love.SAKURA.ne.jp> <20170803081459.GD12521@dhcp22.suse.cz> <201708042010.HDD60496.LFtOQMFJOSFHOV@I-love.SAKURA.ne.jp> <20170804112600.GL26029@dhcp22.suse.cz> <201708042044.JDB64025.SOMOFOHJFFtQLV@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201708042044.JDB64025.SOMOFOHJFFtQLV@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: akpm@linux-foundation.org, linux-mm@kvack.org, rientjes@google.com, mjaggi@caviumnetworks.com, oleg@redhat.com, vdavydov.dev@gmail.com On Fri 04-08-17 20:44:52, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Fri 04-08-17 20:10:09, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Thu 03-08-17 16:53:40, Tetsuo Handa wrote: > > > > > Michal Hocko wrote: > > > > > > > We don't need to give up task_will_free_mem(current) without trying > > > > > > > allocation from memory reserves. We will need to select next OOM victim > > > > > > > only when allocation from memory reserves did not help. > > > > > > > > > > > > > > Thus, this patch allows task_will_free_mem(current) to ignore MMF_OOM_SKIP > > > > > > > for once so that task_will_free_mem(current) will not start selecting next > > > > > > > OOM victim without trying allocation from memory reserves. > > > > > > > > > > > > As I've already said this is an ugly hack and once we have > > > > > > http://lkml.kernel.org/r/20170727090357.3205-2-mhocko@kernel.org merged > > > > > > then it even shouldn't be needed because _all_ threads of the oom victim > > > > > > will have an instant access to memory reserves. > > > > > > > > > > > > So I do not think we want to merge this. > > > > > > > > > > > > > > > > No, we still want to merge this, for 4.8+ kernels which won't get your patch > > > > > backported will need this. Even after your patch is merged, there is a race > > > > > window where allocating threads are between after gfp_pfmemalloc_allowed() and > > > > > before mutex_trylock(&oom_lock) in __alloc_pages_may_oom() which means that > > > > > some threads could call out_of_memory() and hit this task_will_free_mem(current) > > > > > test. Ignoring MMF_OOM_SKIP for once is still useful. > > > > > > > > I disagree. I am _highly_ skeptical this is a stable material. The > > > > mentioned test case is artificial and the source of the problem is > > > > somewhere else. Moreover the culprit is somewhere else. It is in the oom > > > > reaper setting MMF_OOM_SKIP too early and it should be addressed there. > > > > Do not add workarounds where they are not appropriate. > > > > > > > So, what alternative can you provide us for now? > > > > As I've already said http://lkml.kernel.org/r/20170727090357.3205-2-mhocko@kernel.org > > seems to be a better alternative. I am waiting for further review > > feedback before reposting it again. > > > As I've already said, your patch does not close this race completely. Neither this patch. > Your patch will be too drastic/risky for stable material. As I've said this doesn't look like a stable material. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id CFFEA28038D for ; Fri, 4 Aug 2017 07:55:22 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id g9so14965715pfk.13 for ; Fri, 04 Aug 2017 04:55:22 -0700 (PDT) Received: from NAM03-DM3-obe.outbound.protection.outlook.com (mail-dm3nam03on0062.outbound.protection.outlook.com. [104.47.41.62]) by mx.google.com with ESMTPS id i70si917179pfk.223.2017.08.04.04.55.21 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 04 Aug 2017 04:55:21 -0700 (PDT) Subject: Re: [PATCH] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> From: Manish Jaggi Message-ID: Date: Fri, 4 Aug 2017 17:24:48 +0530 MIME-Version: 1.0 In-Reply-To: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa , akpm@linux-foundation.org Cc: linux-mm@kvack.org, David Rientjes , Michal Hocko , Oleg Nesterov , Vladimir Davydov Hi Tetsuo Handa, On 8/3/2017 5:25 AM, Tetsuo Handa wrote: > Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core > count causes random kernel panics when an OOM victim which consumed memory > in a way the OOM reaper does not help was selected by the OOM killer. > > ---------- > oom02 0 TINFO : start OOM testing for mlocked pages. > oom02 0 TINFO : expected victim is 4578. > oom02 0 TINFO : thread (ffff8b0e71f0), allocating 3221225472 bytes. > oom02 0 TINFO : thread (ffff8b8e71f0), allocating 3221225472 bytes. > (...snipped...) > oom02 0 TINFO : thread (ffff8a0e71f0), allocating 3221225472 bytes. > [ 364.737486] oom02:4583 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 > (...snipped...) > [ 365.036127] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name > [ 365.044691] [ 1905] 0 1905 3236 1714 10 4 0 0 systemd-journal > [ 365.054172] [ 1908] 0 1908 20247 590 8 4 0 0 lvmetad > [ 365.062959] [ 2421] 0 2421 3241 878 9 3 0 -1000 systemd-udevd > [ 365.072266] [ 3125] 0 3125 3834 719 9 4 0 -1000 auditd > [ 365.080963] [ 3145] 0 3145 1086 630 6 4 0 0 systemd-logind > [ 365.090353] [ 3146] 0 3146 1208 596 7 3 0 0 irqbalance > [ 365.099413] [ 3147] 81 3147 1118 625 5 4 0 -900 dbus-daemon > [ 365.108548] [ 3149] 998 3149 116294 4180 26 5 0 0 polkitd > [ 365.117333] [ 3164] 997 3164 19992 785 9 3 0 0 chronyd > [ 365.126118] [ 3180] 0 3180 55605 7880 29 3 0 0 firewalld > [ 365.135075] [ 3187] 0 3187 87842 3033 26 3 0 0 NetworkManager > [ 365.144465] [ 3290] 0 3290 43037 1224 16 5 0 0 rsyslogd > [ 365.153335] [ 3295] 0 3295 108279 6617 30 3 0 0 tuned > [ 365.161944] [ 3308] 0 3308 27846 676 11 3 0 0 crond > [ 365.170554] [ 3309] 0 3309 3332 616 10 3 0 -1000 sshd > [ 365.179076] [ 3371] 0 3371 27307 364 6 3 0 0 agetty > [ 365.187790] [ 3375] 0 3375 29397 1125 11 3 0 0 login > [ 365.196402] [ 4178] 0 4178 4797 1119 14 4 0 0 master > [ 365.205101] [ 4209] 89 4209 4823 1396 12 4 0 0 pickup > [ 365.213798] [ 4211] 89 4211 4842 1485 12 3 0 0 qmgr > [ 365.222325] [ 4491] 0 4491 27965 1022 8 3 0 0 bash > [ 365.230849] [ 4513] 0 4513 670 365 5 3 0 0 oom02 > [ 365.239459] [ 4578] 0 4578 37776030 32890957 64257 138 0 0 oom02 > [ 365.248067] Out of memory: Kill process 4578 (oom02) score 952 or sacrifice child > [ 365.255581] Killed process 4578 (oom02) total-vm:151104120kB, anon-rss:131562528kB, file-rss:1300kB, shmem-rss:0kB > [ 365.266829] out_of_memory: Current (4583) has a pending SIGKILL > [ 365.267347] oom_reaper: reaped process 4578 (oom02), now anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB > [ 365.282658] oom_reaper: reaped process 4583 (oom02), now anon-rss:131561664kB, file-rss:0kB, shmem-rss:0kB > [ 365.283361] oom02:4586 invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 > (...snipped...) > [ 365.576164] oom02:4585 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 > (...snipped...) > [ 365.576298] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name > [ 365.576338] [ 2421] 0 2421 3241 878 9 3 0 -1000 systemd-udevd > [ 365.576342] [ 3125] 0 3125 3834 719 9 4 0 -1000 auditd > [ 365.576347] [ 3309] 0 3309 3332 616 10 3 0 -1000 sshd > [ 365.576356] [ 4580] 0 4578 37776030 32890417 64258 138 0 0 oom02 > [ 365.576361] Kernel panic - not syncing: Out of memory and no killable processes... > ---------- Wanted to understand the envisaged effect of this patch - would this patch kill the task fully or it will still take few more iterations of oom-kill to kill other process to free memory - when I apply this patch I see other tasks getting killed, though I didnt got panic in initial testing, I saw login process getting killed. So I am not sure if this patch works... > Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip > oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory() > to return false as soon as MMF_OOM_SKIP is set, many threads sharing > the victim's mm were not able to try allocation from memory reserves > after the OOM reaper gave up reclaiming memory. > > We don't need to give up task_will_free_mem(current) without trying > allocation from memory reserves. We will need to select next OOM victim > only when allocation from memory reserves did not help. > > Thus, this patch allows task_will_free_mem(current) to ignore MMF_OOM_SKIP > for once so that task_will_free_mem(current) will not start selecting next > OOM victim without trying allocation from memory reserves. > > Link: http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com > Reported-by: Manish Jaggi > Signed-off-by: Tetsuo Handa > Cc: Michal Hocko > Cc: Oleg Nesterov > Cc: Vladimir Davydov > Cc: David Rientjes > Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks") > --- > include/linux/sched.h | 1 + > mm/oom_kill.c | 14 +++++++++++--- > 2 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 94137e7..88da211 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -652,6 +652,7 @@ struct task_struct { > /* disallow userland-initiated cgroup migration */ > unsigned no_cgroup_migration:1; > #endif > + unsigned oom_kill_free_check_raced:1; > > unsigned long atomic_flags; /* Flags requiring atomic access. */ > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 9e8b4f0..a1ae78d 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -780,11 +780,19 @@ static bool task_will_free_mem(struct task_struct *task) > return false; > > /* > - * This task has already been drained by the oom reaper so there are > - * only small chances it will free some more > + * It is possible that current thread fails to try allocation from > + * memory reserves if the OOM reaper set MMF_OOM_SKIP on this mm before > + * current thread calls out_of_memory() in order to get TIF_MEMDIE. > + * In that case, allow current thread to try TIF_MEMDIE allocation > + * before start selecting next OOM victims. > */ > - if (test_bit(MMF_OOM_SKIP, &mm->flags)) > + if (test_bit(MMF_OOM_SKIP, &mm->flags)) { > + if (task == current && !task->oom_kill_free_check_raced) { > + task->oom_kill_free_check_raced = true; > + return true; > + } > return false; > + } > > if (atomic_read(&mm->mm_users) <= 1) > return true; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id EC9456B06F2 for ; Fri, 4 Aug 2017 11:24:04 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id 24so20624207pfk.5 for ; Fri, 04 Aug 2017 08:24:04 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id h194si1155676pfe.672.2017.08.04.08.24.02 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 04 Aug 2017 08:24:02 -0700 (PDT) Subject: Re: [PATCH] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. From: Tetsuo Handa References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> In-Reply-To: Message-Id: <201708050024.ABD87010.SFFOVQOFOJMHtL@I-love.SAKURA.ne.jp> Date: Sat, 5 Aug 2017 00:24:01 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: mjaggi@caviumnetworks.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, rientjes@google.com, mhocko@suse.com, oleg@redhat.com, vdavydov.dev@gmail.com Manish Jaggi wrote: > Hi Tetsuo Handa, > > On 8/3/2017 5:25 AM, Tetsuo Handa wrote: > > Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core > > count causes random kernel panics when an OOM victim which consumed memory > > in a way the OOM reaper does not help was selected by the OOM killer. > > > > ---------- > > oom02 0 TINFO : start OOM testing for mlocked pages. > > oom02 0 TINFO : expected victim is 4578. > > oom02 0 TINFO : thread (ffff8b0e71f0), allocating 3221225472 bytes. > > oom02 0 TINFO : thread (ffff8b8e71f0), allocating 3221225472 bytes. > > (...snipped...) > > oom02 0 TINFO : thread (ffff8a0e71f0), allocating 3221225472 bytes. > > [ 364.737486] oom02:4583 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 > > (...snipped...) > > [ 365.036127] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name > > [ 365.044691] [ 1905] 0 1905 3236 1714 10 4 0 0 systemd-journal > > [ 365.054172] [ 1908] 0 1908 20247 590 8 4 0 0 lvmetad > > [ 365.062959] [ 2421] 0 2421 3241 878 9 3 0 -1000 systemd-udevd > > [ 365.072266] [ 3125] 0 3125 3834 719 9 4 0 -1000 auditd > > [ 365.080963] [ 3145] 0 3145 1086 630 6 4 0 0 systemd-logind > > [ 365.090353] [ 3146] 0 3146 1208 596 7 3 0 0 irqbalance > > [ 365.099413] [ 3147] 81 3147 1118 625 5 4 0 -900 dbus-daemon > > [ 365.108548] [ 3149] 998 3149 116294 4180 26 5 0 0 polkitd > > [ 365.117333] [ 3164] 997 3164 19992 785 9 3 0 0 chronyd > > [ 365.126118] [ 3180] 0 3180 55605 7880 29 3 0 0 firewalld > > [ 365.135075] [ 3187] 0 3187 87842 3033 26 3 0 0 NetworkManager > > [ 365.144465] [ 3290] 0 3290 43037 1224 16 5 0 0 rsyslogd > > [ 365.153335] [ 3295] 0 3295 108279 6617 30 3 0 0 tuned > > [ 365.161944] [ 3308] 0 3308 27846 676 11 3 0 0 crond > > [ 365.170554] [ 3309] 0 3309 3332 616 10 3 0 -1000 sshd > > [ 365.179076] [ 3371] 0 3371 27307 364 6 3 0 0 agetty > > [ 365.187790] [ 3375] 0 3375 29397 1125 11 3 0 0 login > > [ 365.196402] [ 4178] 0 4178 4797 1119 14 4 0 0 master > > [ 365.205101] [ 4209] 89 4209 4823 1396 12 4 0 0 pickup > > [ 365.213798] [ 4211] 89 4211 4842 1485 12 3 0 0 qmgr > > [ 365.222325] [ 4491] 0 4491 27965 1022 8 3 0 0 bash > > [ 365.230849] [ 4513] 0 4513 670 365 5 3 0 0 oom02 > > [ 365.239459] [ 4578] 0 4578 37776030 32890957 64257 138 0 0 oom02 > > [ 365.248067] Out of memory: Kill process 4578 (oom02) score 952 or sacrifice child > > [ 365.255581] Killed process 4578 (oom02) total-vm:151104120kB, anon-rss:131562528kB, file-rss:1300kB, shmem-rss:0kB > > [ 365.266829] out_of_memory: Current (4583) has a pending SIGKILL > > [ 365.267347] oom_reaper: reaped process 4578 (oom02), now anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB > > [ 365.282658] oom_reaper: reaped process 4583 (oom02), now anon-rss:131561664kB, file-rss:0kB, shmem-rss:0kB > > [ 365.283361] oom02:4586 invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 > > (...snipped...) > > [ 365.576164] oom02:4585 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 > > (...snipped...) > > [ 365.576298] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name > > [ 365.576338] [ 2421] 0 2421 3241 878 9 3 0 -1000 systemd-udevd > > [ 365.576342] [ 3125] 0 3125 3834 719 9 4 0 -1000 auditd > > [ 365.576347] [ 3309] 0 3309 3332 616 10 3 0 -1000 sshd > > [ 365.576356] [ 4580] 0 4578 37776030 32890417 64258 138 0 0 oom02 > > [ 365.576361] Kernel panic - not syncing: Out of memory and no killable processes... > > ---------- > Wanted to understand the envisaged effect of this patch > - would this patch kill the task fully or it will still take few more > iterations of oom-kill to kill other process to free memory > - when I apply this patch I see other tasks getting killed, though I > didnt got panic in initial testing, I saw login process getting killed. > So I am not sure if this patch works... Thank you for testing. This patch is working as intended. This patch (or any other patches) won't wait for the OOM victim (in this case oom02) to be fully killed. We don't want to risk OOM lockup situation by waiting for the OOM victim to be fully killed. If the OOM reaper kernel thread waits for the OOM victim forever, different OOM stress will trigger OOM lockup situation. Thus, the OOM reaper kernel thread gives up waiting for the OOM victim as soon as memory which can be reclaimed before __mmput() from mmput() from exit_mm() from do_exit() is called is reclaimed and sets MMF_OOM_SKIP. Other tasks might be getting killed, for threads which task_will_free_mem(current) returns false will call select_bad_process() and select_bad_process() will ignore existing OOM victims with MMF_OOM_SKIP already set. Compared to older kernels which do not have the OOM reaper support, this behavior looks like a regression. But please be patient. This behavior is our choice for not to risk OOM lockup situation. This patch will prevent _all_ threads which task_will_free_mem(current) returns true from calling select_bad_process(). And Michal's patch will prevent _most_ threads which task_will_free_mem(current) returns true from calling select_bad_process(). Since oom02 has many threads which task_will_free_mem(current) returns true, this patch (or Michal's patch) will reduce possibility of killing all threads. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f197.google.com (mail-pf0-f197.google.com [209.85.192.197]) by kanga.kvack.org (Postfix) with ESMTP id 217396B06F6 for ; Fri, 4 Aug 2017 11:54:50 -0400 (EDT) Received: by mail-pf0-f197.google.com with SMTP id r187so21063847pfr.8 for ; Fri, 04 Aug 2017 08:54:50 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id m23si1306522plk.947.2017.08.04.08.54.47 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 04 Aug 2017 08:54:48 -0700 (PDT) Subject: Re: [PATCH] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. From: Tetsuo Handa References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> <201708050024.ABD87010.SFFOVQOFOJMHtL@I-love.SAKURA.ne.jp> In-Reply-To: <201708050024.ABD87010.SFFOVQOFOJMHtL@I-love.SAKURA.ne.jp> Message-Id: <201708050054.FDD64564.tMQSVOFOFOLFJH@I-love.SAKURA.ne.jp> Date: Sat, 5 Aug 2017 00:54:46 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: mjaggi@caviumnetworks.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, rientjes@google.com, mhocko@suse.com, oleg@redhat.com, vdavydov.dev@gmail.com Tetsuo Handa wrote: > Manish Jaggi wrote: > > Wanted to understand the envisaged effect of this patch > > - would this patch kill the task fully or it will still take few more > > iterations of oom-kill to kill other process to free memory > > - when I apply this patch I see other tasks getting killed, though I > > didnt got panic in initial testing, I saw login process getting killed. > > So I am not sure if this patch works... > > Thank you for testing. This patch is working as intended. > > This patch (or any other patches) won't wait for the OOM victim (in this case > oom02) to be fully killed. We don't want to risk OOM lockup situation by waiting > for the OOM victim to be fully killed. If the OOM reaper kernel thread waits for > the OOM victim forever, different OOM stress will trigger OOM lockup situation. > Thus, the OOM reaper kernel thread gives up waiting for the OOM victim as soon as > memory which can be reclaimed before __mmput() from mmput() from exit_mm() from > do_exit() is called is reclaimed and sets MMF_OOM_SKIP. > > Other tasks might be getting killed, for threads which task_will_free_mem(current) > returns false will call select_bad_process() and select_bad_process() will ignore > existing OOM victims with MMF_OOM_SKIP already set. Compared to older kernels > which do not have the OOM reaper support, this behavior looks like a regression. > But please be patient. This behavior is our choice for not to risk OOM lockup > situation. > > This patch will prevent _all_ threads which task_will_free_mem(current) returns > true from calling select_bad_process(). And Michal's patch will prevent _most_ > threads which task_will_free_mem(current) returns true from calling select_bad_process(). > Since oom02 has many threads which task_will_free_mem(current) returns true, > this patch (or Michal's patch) will reduce possibility of killing all threads. > Oh, the last line was confusing. Since oom02 has many threads which task_will_free_mem(current) returns true, this patch (or Michal's patch) will reduce possibility of killing other tasks (i.e. processes other than oom02) by increasing possibility of allocations by OOM victim threads (i.e. threads in oom02) to succeed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 3AF362802FE for ; Sat, 19 Aug 2017 02:23:27 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id k3so25611795pfc.0 for ; Fri, 18 Aug 2017 23:23:27 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id e184si4472474pgc.782.2017.08.18.23.23.23 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 18 Aug 2017 23:23:24 -0700 (PDT) Subject: [PATCH v2] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. From: Tetsuo Handa References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> In-Reply-To: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> Message-Id: <201708191523.BJH90621.MHOOFFQSOLJFtV@I-love.SAKURA.ne.jp> Date: Sat, 19 Aug 2017 15:23:19 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: akpm@linux-foundation.org, mhocko@suse.com Cc: linux-mm@kvack.org, rientjes@google.com, mjaggi@caviumnetworks.com, oleg@redhat.com, vdavydov@virtuozzo.com Tetsuo Handa wrote at http://lkml.kernel.org/r/201708102328.ACD34352.OHFOLJMQVSFOFt@I-love.SAKURA.ne.jp : > Michal Hocko wrote: > > On Thu 10-08-17 21:10:30, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Tue 08-08-17 11:14:50, Tetsuo Handa wrote: > > > > > Michal Hocko wrote: > > > > > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote: > > > > > > > Michal Hocko wrote: > > > > > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote: > > > > > > > > > My question is, how can users know it if somebody was OOM-killed needlessly > > > > > > > > > by allowing MMF_OOM_SKIP to race. > > > > > > > > > > > > > > > > Is it really important to know that the race is due to MMF_OOM_SKIP? > > > > > > > > > > > > > > Yes, it is really important. Needlessly selecting even one OOM victim is > > > > > > > a pain which is difficult to explain to and persuade some of customers. > > > > > > > > > > > > How is this any different from a race with a task exiting an releasing > > > > > > some memory after we have crossed the point of no return and will kill > > > > > > something? > > > > > > > > > > I'm not complaining about an exiting task releasing some memory after we have > > > > > crossed the point of no return. > > > > > > > > > > What I'm saying is that we can postpone "the point of no return" if we ignore > > > > > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using oom_lock" > > > > > thread and "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for > > > > > once." thread). These are race conditions we can avoid without crystal ball. > > > > > > > > If those races are really that common than we can handle them even > > > > without "try once more" tricks. Really this is just an ugly hack. If you > > > > really care then make sure that we always try to allocate from memory > > > > reserves before going down the oom path. In other words, try to find a > > > > robust solution rather than tweaks around a problem. > > > > > > Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes > > > oom_lock serialization from the OOM reaper, possibility of calling out_of_memory() > > > due to successful mutex_trylock(&oom_lock) would increase when the OOM reaper set > > > MMF_OOM_SKIP quickly. > > > > > > What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set > > > on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before > > > out_of_memory() is called (due to successful mutex_trylock(&oom_lock)) ? > > > > > > Excuse me? Are you suggesting to try memory reserves before > > > task_is_oom_victim(current) becomes true? > > > > No what I've tried to say is that if this really is a real problem, > > which I am not sure about, then the proper way to handle that is to > > attempt to allocate from memory reserves for an oom victim. I would be > > even willing to take the oom_lock back into the oom reaper path if the > > former turnes out to be awkward to implement. But all this assumes this > > is a _real_ problem. > > Aren't we back to square one? My question is, how can users know it if > somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race. > > You don't want to call get_page_from_freelist() from out_of_memory(), do you? > But without passing a flag "whether get_page_from_freelist() with memory reserves > was already attempted if current thread is an OOM victim" to task_will_free_mem() > in out_of_memory() and a flag "whether get_page_from_freelist() without memory > reserves was already attempted if current thread is not an OOM victim" to > test_bit(MMF_OOM_SKIP) in oom_evaluate_task(), we won't be able to know > if somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race. Michal, I did not get your answer, and your "mm, oom: do not rely on TIF_MEMDIE for memory reserves access" did not help solving this problem. (I confirmed it by reverting your "mm, oom: allow oom reaper to race with exit_mmap" and applying Andrea's "mm: oom: let oom_reap_task and exit_mmap run concurrently" and this patch on top of linux-next-20170817.) ----------- #define _GNU_SOURCE #include #include #include #include #include #include #include #include #define NUMTHREADS 2 #define MMAPSIZE ((4096 * 1048576UL) / NUMTHREADS) #define STACKSIZE 4096 static int pipe_fd[2] = { EOF, EOF }; static int memory_eater(void *unused) { int fd = open("/dev/zero", O_RDONLY); char *buf = mmap(NULL, MMAPSIZE, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_SHARED, EOF, 0); read(pipe_fd[0], buf, 1); read(fd, buf, MMAPSIZE); pause(); return 0; } int main(int argc, char *argv[]) { int i; char *stack; if (pipe(pipe_fd)) return 1; stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_SHARED, EOF, 0); for (i = 0; i < NUMTHREADS; i++) if (clone(memory_eater, stack + (i + 1) * STACKSIZE, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM | CLONE_FS | CLONE_FILES, NULL) == -1) break; sleep(1); close(pipe_fd[1]); pause(); return 0; } ----------- ----------- [ 204.413605] Out of memory: Kill process 9286 (a.out) score 930 or sacrifice child [ 204.416241] Killed process 9286 (a.out) total-vm:4198476kB, anon-rss:72kB, file-rss:0kB, shmem-rss:3465520kB [ 204.419783] oom_reaper: reaped process 9286 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:3465720kB [ 204.455864] ------------[ cut here ]------------ [ 204.457921] kernel BUG at mm/oom_kill.c:786! [ 204.459844] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC [ 204.461877] Modules linked in: coretemp pcspkr sg vmw_vmci i2c_piix4 shpchp sd_mod ata_generic pata_acpi serio_raw mptspi scsi_transport_spi mptscsih vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci ttm drm libahci ata_piix e1000 mptbase i2c_core libata ipv6 [ 204.469328] CPU: 1 PID: 9287 Comm: a.out Not tainted 4.13.0-rc5-next-20170817+ #662 [ 204.472117] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 204.475265] task: ffff880135c88040 task.stack: ffff880137554000 [ 204.477556] RIP: 0010:task_will_free_mem+0x1a7/0x240 [ 204.479651] RSP: 0018:ffff880137557698 EFLAGS: 00010246 [ 204.481750] RAX: 0000000000000000 RBX: ffff880135c88040 RCX: 00000000ffffffff [ 204.484344] RDX: ffff880135c88040 RSI: 0000000000000000 RDI: ffff880135c88040 [ 204.487077] RBP: ffff8801375576b0 R08: 0000000000000000 R09: 0000000000000e6d [ 204.489565] R10: 0000000000000000 R11: 0000000000000e95 R12: ffff880133b48040 [ 204.492019] R13: ffff88013f7fea20 R14: 0000000000000000 R15: 00000000014200ca [ 204.494467] FS: 00007fc4d067d740(0000) GS:ffff88013a000000(0000) knlGS:0000000000000000 [ 204.497075] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 204.499231] CR2: 00007fc4b9ec2000 CR3: 0000000137b39004 CR4: 00000000001606e0 [ 204.501907] Call Trace: [ 204.503426] out_of_memory+0x54/0x560 [ 204.505137] __alloc_pages_nodemask+0xe91/0xf50 [ 204.507002] alloc_pages_vma+0x76/0x1a0 [ 204.508694] shmem_alloc_page+0x71/0xb0 [ 204.510351] ? native_sched_clock+0x36/0xa0 [ 204.512059] ? native_sched_clock+0x36/0xa0 [ 204.513721] ? find_get_entry+0x191/0x280 [ 204.515327] shmem_alloc_and_acct_page+0x83/0x230 [ 204.517330] shmem_getpage_gfp+0x1b6/0xe30 [ 204.519005] shmem_fault+0x97/0x200 [ 204.520558] ? __lock_acquire+0x4a7/0x1c20 [ 204.522101] ? __lock_acquire+0x4a7/0x1c20 [ 204.523619] __do_fault+0x19/0x120 [ 204.524965] __handle_mm_fault+0x8e3/0x1250 [ 204.526484] ? native_sched_clock+0x36/0xa0 [ 204.527945] handle_mm_fault+0x186/0x360 [ 204.529355] ? handle_mm_fault+0x47/0x360 [ 204.530771] __do_page_fault+0x1d2/0x510 [ 204.532177] do_page_fault+0x21/0x70 [ 204.533563] page_fault+0x22/0x30 [ 204.534906] RIP: 0010:__clear_user+0x3d/0x70 [ 204.536445] RSP: 0018:ffff880137557d58 EFLAGS: 00010206 [ 204.538108] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000200 [ 204.540173] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 00007fc4b9ec2000 [ 204.542215] RBP: ffff880137557d68 R08: 0000000000000001 R09: 0000000000000000 [ 204.544263] R10: 0000000000000001 R11: 0000000000000001 R12: 00007fc4b9ec2000 [ 204.546300] R13: ffff880137557e18 R14: 0000000069e16000 R15: 0000000000001000 [ 204.548338] ? __clear_user+0x1e/0x70 [ 204.549679] clear_user+0x34/0x50 [ 204.551088] iov_iter_zero+0x88/0x380 [ 204.552403] read_iter_zero+0x38/0xb0 [ 204.553839] new_sync_read+0xcc/0x110 [ 204.555215] __vfs_read+0x27/0x40 [ 204.556605] vfs_read+0xa0/0x160 [ 204.557749] SyS_read+0x53/0xc0 [ 204.558907] do_syscall_64+0x61/0x1d0 [ 204.560197] entry_SYSCALL64_slow_path+0x25/0x25 [ 204.561628] RIP: 0033:0x7fc4d0194c30 [ 204.562819] RSP: 002b:00007fc4d068afd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 204.564793] RAX: ffffffffffffffda RBX: 00007fc4500ac000 RCX: 00007fc4d0194c30 [ 204.566802] RDX: 0000000080000000 RSI: 00007fc4500ac000 RDI: 0000000000000005 [ 204.568760] RBP: 0000000000000005 R08: ffffffffffffffff R09: 0000000000000000 [ 204.570641] R10: 00007fc4d068ad60 R11: 0000000000000246 R12: 00000000004006d9 [ 204.572522] R13: 00007ffc72529cc0 R14: 0000000000000000 R15: 0000000000000000 [ 204.574393] Code: 83 c4 08 89 d8 5b 41 5c 5d c3 65 48 8b 14 25 00 c6 00 00 31 c0 48 39 d3 0f 85 90 fe ff ff f6 83 78 16 00 00 02 0f 85 83 fe ff ff <0f> 0b 80 3d b6 f3 b5 00 00 0f 85 f7 fe ff ff e8 05 ec f8 ff 84 [ 204.579049] RIP: task_will_free_mem+0x1a7/0x240 RSP: ffff880137557698 [ 204.580840] ---[ end trace 2be364e2657b83fa ]--- ----------- Therefore, I propose this patch for inclusion. ---------------------------------------- >>From cf6ef5a7b110d12e98bb2928e839abee16418188 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Thu, 17 Aug 2017 14:45:31 +0900 Subject: [PATCH v2] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core count causes random kernel panics when an OOM victim which consumed memory in a way the OOM reaper does not help was selected by the OOM killer [1]. ---------- oom02 0 TINFO : start OOM testing for mlocked pages. oom02 0 TINFO : expected victim is 4578. oom02 0 TINFO : thread (ffff8b0e71f0), allocating 3221225472 bytes. oom02 0 TINFO : thread (ffff8b8e71f0), allocating 3221225472 bytes. (...snipped...) oom02 0 TINFO : thread (ffff8a0e71f0), allocating 3221225472 bytes. [ 364.737486] oom02:4583 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 (...snipped...) [ 365.036127] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 365.044691] [ 1905] 0 1905 3236 1714 10 4 0 0 systemd-journal [ 365.054172] [ 1908] 0 1908 20247 590 8 4 0 0 lvmetad [ 365.062959] [ 2421] 0 2421 3241 878 9 3 0 -1000 systemd-udevd [ 365.072266] [ 3125] 0 3125 3834 719 9 4 0 -1000 auditd [ 365.080963] [ 3145] 0 3145 1086 630 6 4 0 0 systemd-logind [ 365.090353] [ 3146] 0 3146 1208 596 7 3 0 0 irqbalance [ 365.099413] [ 3147] 81 3147 1118 625 5 4 0 -900 dbus-daemon [ 365.108548] [ 3149] 998 3149 116294 4180 26 5 0 0 polkitd [ 365.117333] [ 3164] 997 3164 19992 785 9 3 0 0 chronyd [ 365.126118] [ 3180] 0 3180 55605 7880 29 3 0 0 firewalld [ 365.135075] [ 3187] 0 3187 87842 3033 26 3 0 0 NetworkManager [ 365.144465] [ 3290] 0 3290 43037 1224 16 5 0 0 rsyslogd [ 365.153335] [ 3295] 0 3295 108279 6617 30 3 0 0 tuned [ 365.161944] [ 3308] 0 3308 27846 676 11 3 0 0 crond [ 365.170554] [ 3309] 0 3309 3332 616 10 3 0 -1000 sshd [ 365.179076] [ 3371] 0 3371 27307 364 6 3 0 0 agetty [ 365.187790] [ 3375] 0 3375 29397 1125 11 3 0 0 login [ 365.196402] [ 4178] 0 4178 4797 1119 14 4 0 0 master [ 365.205101] [ 4209] 89 4209 4823 1396 12 4 0 0 pickup [ 365.213798] [ 4211] 89 4211 4842 1485 12 3 0 0 qmgr [ 365.222325] [ 4491] 0 4491 27965 1022 8 3 0 0 bash [ 365.230849] [ 4513] 0 4513 670 365 5 3 0 0 oom02 [ 365.239459] [ 4578] 0 4578 37776030 32890957 64257 138 0 0 oom02 [ 365.248067] Out of memory: Kill process 4578 (oom02) score 952 or sacrifice child [ 365.255581] Killed process 4578 (oom02) total-vm:151104120kB, anon-rss:131562528kB, file-rss:1300kB, shmem-rss:0kB [ 365.266829] out_of_memory: Current (4583) has a pending SIGKILL [ 365.267347] oom_reaper: reaped process 4578 (oom02), now anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB [ 365.282658] oom_reaper: reaped process 4583 (oom02), now anon-rss:131561664kB, file-rss:0kB, shmem-rss:0kB [ 365.283361] oom02:4586 invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 (...snipped...) [ 365.576164] oom02:4585 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 (...snipped...) [ 365.576298] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 365.576338] [ 2421] 0 2421 3241 878 9 3 0 -1000 systemd-udevd [ 365.576342] [ 3125] 0 3125 3834 719 9 4 0 -1000 auditd [ 365.576347] [ 3309] 0 3309 3332 616 10 3 0 -1000 sshd [ 365.576356] [ 4580] 0 4578 37776030 32890417 64258 138 0 0 oom02 [ 365.576361] Kernel panic - not syncing: Out of memory and no killable processes... ---------- Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory() to return false as soon as MMF_OOM_SKIP is set, many threads sharing the victim's mm were not able to try allocation from memory reserves after the OOM reaper gave up reclaiming memory. Until Linux 4.7, we were using if (current->mm && (fatal_signal_pending(current) || task_will_free_mem(current))) as a condition to try allocation from memory reserves with the risk of OOM lockup, but reports like [1] were impossible. Linux 4.8+ are regressed compared to Linux 4.7 due to the risk of needlessly selecting more OOM victims. We don't need to give up task_will_free_mem(current) before trying allocation from memory reserves. We will need to select next OOM victim only when allocation from memory reserves did not help. There is no need that the OOM victim is such malicious that consumes all memory. It is possible that a multithreaded but non memory hog process is selected by the OOM killer, and the OOM reaper fails to reclaim memory due to e.g. khugepaged [2], and the process fails to try allocation from memory reserves. Although "mm, oom: do not rely on TIF_MEMDIE for memory reserves access" tried to reduce this race window by replacing TIF_MEMDIE with oom_mm, and "mm: oom: let oom_reap_task and exit_mmap run concurrently" did not remove oom_lock serialization, this race window is still easy to trigger. You can confirm it by adding "BUG_ON(1);" at "task->oom_kill_free_check_raced = 1;" of this patch. Thus, this patch allows task_will_free_mem(current) to ignore MMF_OOM_SKIP for once so that task_will_free_mem(current) will not start selecting next OOM victim without trying allocation from memory reserves. [1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com [2] http://lkml.kernel.org/r/201708090835.ICI69305.VFFOLMHOStJOQF@I-love.SAKURA.ne.jp Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks") Reported-by: Manish Jaggi Signed-off-by: Tetsuo Handa Cc: Michal Hocko Cc: Oleg Nesterov Cc: Vladimir Davydov Cc: David Rientjes --- include/linux/sched.h | 1 + mm/oom_kill.c | 14 +++++++++++--- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 6110471..11f8d54 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -652,6 +652,7 @@ struct task_struct { /* disallow userland-initiated cgroup migration */ unsigned no_cgroup_migration:1; #endif + unsigned oom_kill_free_check_raced:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ab8348d..c5fb8a3 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -749,11 +749,19 @@ static bool task_will_free_mem(struct task_struct *task) return false; /* - * This task has already been drained by the oom reaper so there are - * only small chances it will free some more + * The current thread might fail to try OOM_ALLOC allocation if the OOM + * reaper set MMF_OOM_SKIP on this mm when the current thread was + * between after __gfp_pfmemalloc_flags() and before out_of_memory(). + * Make sure that the current thread has tried OOM_ALLOC allocation + * before starting to select the next OOM victims. */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) + if (test_bit(MMF_OOM_SKIP, &mm->flags)) { + if (task == current && !task->oom_kill_free_check_raced) { + task->oom_kill_free_check_raced = 1; + return true; + } return false; + } if (atomic_read(&mm->mm_users) <= 1) return true; -- 2.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 8BDB5280310 for ; Mon, 21 Aug 2017 04:43:11 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id p14so19308684wrg.8 for ; Mon, 21 Aug 2017 01:43:11 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id s26si8981295wrs.289.2017.08.21.01.43.09 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 21 Aug 2017 01:43:09 -0700 (PDT) Date: Mon, 21 Aug 2017 10:43:07 +0200 From: Michal Hocko Subject: Re: [PATCH v2] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. Message-ID: <20170821084307.GB25956@dhcp22.suse.cz> References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> <201708191523.BJH90621.MHOOFFQSOLJFtV@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201708191523.BJH90621.MHOOFFQSOLJFtV@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: akpm@linux-foundation.org, linux-mm@kvack.org, rientjes@google.com, mjaggi@caviumnetworks.com, oleg@redhat.com, vdavydov@virtuozzo.com On Sat 19-08-17 15:23:19, Tetsuo Handa wrote: > Tetsuo Handa wrote at http://lkml.kernel.org/r/201708102328.ACD34352.OHFOLJMQVSFOFt@I-love.SAKURA.ne.jp : > > Michal Hocko wrote: > > > On Thu 10-08-17 21:10:30, Tetsuo Handa wrote: > > > > Michal Hocko wrote: > > > > > On Tue 08-08-17 11:14:50, Tetsuo Handa wrote: > > > > > > Michal Hocko wrote: > > > > > > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote: > > > > > > > > Michal Hocko wrote: > > > > > > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote: > > > > > > > > > > My question is, how can users know it if somebody was OOM-killed needlessly > > > > > > > > > > by allowing MMF_OOM_SKIP to race. > > > > > > > > > > > > > > > > > > Is it really important to know that the race is due to MMF_OOM_SKIP? > > > > > > > > > > > > > > > > Yes, it is really important. Needlessly selecting even one OOM victim is > > > > > > > > a pain which is difficult to explain to and persuade some of customers. > > > > > > > > > > > > > > How is this any different from a race with a task exiting an releasing > > > > > > > some memory after we have crossed the point of no return and will kill > > > > > > > something? > > > > > > > > > > > > I'm not complaining about an exiting task releasing some memory after we have > > > > > > crossed the point of no return. > > > > > > > > > > > > What I'm saying is that we can postpone "the point of no return" if we ignore > > > > > > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using oom_lock" > > > > > > thread and "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for > > > > > > once." thread). These are race conditions we can avoid without crystal ball. > > > > > > > > > > If those races are really that common than we can handle them even > > > > > without "try once more" tricks. Really this is just an ugly hack. If you > > > > > really care then make sure that we always try to allocate from memory > > > > > reserves before going down the oom path. In other words, try to find a > > > > > robust solution rather than tweaks around a problem. > > > > > > > > Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes > > > > oom_lock serialization from the OOM reaper, possibility of calling out_of_memory() > > > > due to successful mutex_trylock(&oom_lock) would increase when the OOM reaper set > > > > MMF_OOM_SKIP quickly. > > > > > > > > What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set > > > > on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before > > > > out_of_memory() is called (due to successful mutex_trylock(&oom_lock)) ? > > > > > > > > Excuse me? Are you suggesting to try memory reserves before > > > > task_is_oom_victim(current) becomes true? > > > > > > No what I've tried to say is that if this really is a real problem, > > > which I am not sure about, then the proper way to handle that is to > > > attempt to allocate from memory reserves for an oom victim. I would be > > > even willing to take the oom_lock back into the oom reaper path if the > > > former turnes out to be awkward to implement. But all this assumes this > > > is a _real_ problem. > > > > Aren't we back to square one? My question is, how can users know it if > > somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race. > > > > You don't want to call get_page_from_freelist() from out_of_memory(), do you? > > But without passing a flag "whether get_page_from_freelist() with memory reserves > > was already attempted if current thread is an OOM victim" to task_will_free_mem() > > in out_of_memory() and a flag "whether get_page_from_freelist() without memory > > reserves was already attempted if current thread is not an OOM victim" to > > test_bit(MMF_OOM_SKIP) in oom_evaluate_task(), we won't be able to know > > if somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race. > > Michal, I did not get your answer, and your "mm, oom: do not rely on > TIF_MEMDIE for memory reserves access" did not help solving this problem. > (I confirmed it by reverting your "mm, oom: allow oom reaper to race with > exit_mmap" and applying Andrea's "mm: oom: let oom_reap_task and exit_mmap > run concurrently" and this patch on top of linux-next-20170817.) By "this patch" you probably mean a BUG_ON(tsk_is_oom_victim) somewhere in task_will_free_mem right? I do not see anything like that in you email. > [ 204.413605] Out of memory: Kill process 9286 (a.out) score 930 or sacrifice child > [ 204.416241] Killed process 9286 (a.out) total-vm:4198476kB, anon-rss:72kB, file-rss:0kB, shmem-rss:3465520kB > [ 204.419783] oom_reaper: reaped process 9286 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:3465720kB > [ 204.455864] ------------[ cut here ]------------ > [ 204.457921] kernel BUG at mm/oom_kill.c:786! > > Therefore, I propose this patch for inclusion. i've already told you that this is a wrong approach to handle a possible race and offered you an alternative. I realy fail to see why you keep reposting it. So to make myself absolutely clear Nacked-by: Michal Hocko to the patch below. > >From cf6ef5a7b110d12e98bb2928e839abee16418188 Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa > Date: Thu, 17 Aug 2017 14:45:31 +0900 > Subject: [PATCH v2] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. > > Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core > count causes random kernel panics when an OOM victim which consumed memory > in a way the OOM reaper does not help was selected by the OOM killer [1]. > > ---------- > oom02 0 TINFO : start OOM testing for mlocked pages. > oom02 0 TINFO : expected victim is 4578. > oom02 0 TINFO : thread (ffff8b0e71f0), allocating 3221225472 bytes. > oom02 0 TINFO : thread (ffff8b8e71f0), allocating 3221225472 bytes. > (...snipped...) > oom02 0 TINFO : thread (ffff8a0e71f0), allocating 3221225472 bytes. > [ 364.737486] oom02:4583 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 > (...snipped...) > [ 365.036127] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name > [ 365.044691] [ 1905] 0 1905 3236 1714 10 4 0 0 systemd-journal > [ 365.054172] [ 1908] 0 1908 20247 590 8 4 0 0 lvmetad > [ 365.062959] [ 2421] 0 2421 3241 878 9 3 0 -1000 systemd-udevd > [ 365.072266] [ 3125] 0 3125 3834 719 9 4 0 -1000 auditd > [ 365.080963] [ 3145] 0 3145 1086 630 6 4 0 0 systemd-logind > [ 365.090353] [ 3146] 0 3146 1208 596 7 3 0 0 irqbalance > [ 365.099413] [ 3147] 81 3147 1118 625 5 4 0 -900 dbus-daemon > [ 365.108548] [ 3149] 998 3149 116294 4180 26 5 0 0 polkitd > [ 365.117333] [ 3164] 997 3164 19992 785 9 3 0 0 chronyd > [ 365.126118] [ 3180] 0 3180 55605 7880 29 3 0 0 firewalld > [ 365.135075] [ 3187] 0 3187 87842 3033 26 3 0 0 NetworkManager > [ 365.144465] [ 3290] 0 3290 43037 1224 16 5 0 0 rsyslogd > [ 365.153335] [ 3295] 0 3295 108279 6617 30 3 0 0 tuned > [ 365.161944] [ 3308] 0 3308 27846 676 11 3 0 0 crond > [ 365.170554] [ 3309] 0 3309 3332 616 10 3 0 -1000 sshd > [ 365.179076] [ 3371] 0 3371 27307 364 6 3 0 0 agetty > [ 365.187790] [ 3375] 0 3375 29397 1125 11 3 0 0 login > [ 365.196402] [ 4178] 0 4178 4797 1119 14 4 0 0 master > [ 365.205101] [ 4209] 89 4209 4823 1396 12 4 0 0 pickup > [ 365.213798] [ 4211] 89 4211 4842 1485 12 3 0 0 qmgr > [ 365.222325] [ 4491] 0 4491 27965 1022 8 3 0 0 bash > [ 365.230849] [ 4513] 0 4513 670 365 5 3 0 0 oom02 > [ 365.239459] [ 4578] 0 4578 37776030 32890957 64257 138 0 0 oom02 > [ 365.248067] Out of memory: Kill process 4578 (oom02) score 952 or sacrifice child > [ 365.255581] Killed process 4578 (oom02) total-vm:151104120kB, anon-rss:131562528kB, file-rss:1300kB, shmem-rss:0kB > [ 365.266829] out_of_memory: Current (4583) has a pending SIGKILL > [ 365.267347] oom_reaper: reaped process 4578 (oom02), now anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB > [ 365.282658] oom_reaper: reaped process 4583 (oom02), now anon-rss:131561664kB, file-rss:0kB, shmem-rss:0kB > [ 365.283361] oom02:4586 invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 > (...snipped...) > [ 365.576164] oom02:4585 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0 > (...snipped...) > [ 365.576298] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name > [ 365.576338] [ 2421] 0 2421 3241 878 9 3 0 -1000 systemd-udevd > [ 365.576342] [ 3125] 0 3125 3834 719 9 4 0 -1000 auditd > [ 365.576347] [ 3309] 0 3309 3332 616 10 3 0 -1000 sshd > [ 365.576356] [ 4580] 0 4578 37776030 32890417 64258 138 0 0 oom02 > [ 365.576361] Kernel panic - not syncing: Out of memory and no killable processes... > ---------- > > Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip > oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory() > to return false as soon as MMF_OOM_SKIP is set, many threads sharing the > victim's mm were not able to try allocation from memory reserves after the > OOM reaper gave up reclaiming memory. > > Until Linux 4.7, we were using > > if (current->mm && > (fatal_signal_pending(current) || task_will_free_mem(current))) > > as a condition to try allocation from memory reserves with the risk of OOM > lockup, but reports like [1] were impossible. Linux 4.8+ are regressed > compared to Linux 4.7 due to the risk of needlessly selecting more OOM > victims. We don't need to give up task_will_free_mem(current) before trying > allocation from memory reserves. We will need to select next OOM victim > only when allocation from memory reserves did not help. > > There is no need that the OOM victim is such malicious that consumes all > memory. It is possible that a multithreaded but non memory hog process is > selected by the OOM killer, and the OOM reaper fails to reclaim memory due > to e.g. khugepaged [2], and the process fails to try allocation from memory > reserves. > > Although "mm, oom: do not rely on TIF_MEMDIE for memory reserves access" > tried to reduce this race window by replacing TIF_MEMDIE with oom_mm, and > "mm: oom: let oom_reap_task and exit_mmap run concurrently" did not remove > oom_lock serialization, this race window is still easy to trigger. You can > confirm it by adding "BUG_ON(1);" at "task->oom_kill_free_check_raced = 1;" > of this patch. > > Thus, this patch allows task_will_free_mem(current) to ignore MMF_OOM_SKIP > for once so that task_will_free_mem(current) will not start selecting next > OOM victim without trying allocation from memory reserves. > > [1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com > [2] http://lkml.kernel.org/r/201708090835.ICI69305.VFFOLMHOStJOQF@I-love.SAKURA.ne.jp > > Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks") > Reported-by: Manish Jaggi > Signed-off-by: Tetsuo Handa > Cc: Michal Hocko > Cc: Oleg Nesterov > Cc: Vladimir Davydov > Cc: David Rientjes > --- > include/linux/sched.h | 1 + > mm/oom_kill.c | 14 +++++++++++--- > 2 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 6110471..11f8d54 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -652,6 +652,7 @@ struct task_struct { > /* disallow userland-initiated cgroup migration */ > unsigned no_cgroup_migration:1; > #endif > + unsigned oom_kill_free_check_raced:1; > > unsigned long atomic_flags; /* Flags requiring atomic access. */ > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index ab8348d..c5fb8a3 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -749,11 +749,19 @@ static bool task_will_free_mem(struct task_struct *task) > return false; > > /* > - * This task has already been drained by the oom reaper so there are > - * only small chances it will free some more > + * The current thread might fail to try OOM_ALLOC allocation if the OOM > + * reaper set MMF_OOM_SKIP on this mm when the current thread was > + * between after __gfp_pfmemalloc_flags() and before out_of_memory(). > + * Make sure that the current thread has tried OOM_ALLOC allocation > + * before starting to select the next OOM victims. > */ > - if (test_bit(MMF_OOM_SKIP, &mm->flags)) > + if (test_bit(MMF_OOM_SKIP, &mm->flags)) { > + if (task == current && !task->oom_kill_free_check_raced) { > + task->oom_kill_free_check_raced = 1; > + return true; > + } > return false; > + } > > if (atomic_read(&mm->mm_users) <= 1) > return true; > -- > 2.9.5 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 420AB6B04DB for ; Mon, 21 Aug 2017 07:42:04 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id u1so13410840pgq.9 for ; Mon, 21 Aug 2017 04:42:04 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id t10si657518pge.766.2017.08.21.04.42.01 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 21 Aug 2017 04:42:02 -0700 (PDT) Subject: Re: [PATCH v2] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. From: Tetsuo Handa References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> <201708191523.BJH90621.MHOOFFQSOLJFtV@I-love.SAKURA.ne.jp> <20170821084307.GB25956@dhcp22.suse.cz> In-Reply-To: <20170821084307.GB25956@dhcp22.suse.cz> Message-Id: <201708212041.GAJ05272.VOMOJOFSQLFtHF@I-love.SAKURA.ne.jp> Date: Mon, 21 Aug 2017 20:41:52 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: mhocko@suse.com Cc: akpm@linux-foundation.org, linux-mm@kvack.org, rientjes@google.com, mjaggi@caviumnetworks.com, oleg@redhat.com, vdavydov@virtuozzo.com Michal Hocko wrote: > On Sat 19-08-17 15:23:19, Tetsuo Handa wrote: > > Tetsuo Handa wrote at http://lkml.kernel.org/r/201708102328.ACD34352.OHFOLJMQVSFOFt@I-love.SAKURA.ne.jp : > > > Michal Hocko wrote: > > > > On Thu 10-08-17 21:10:30, Tetsuo Handa wrote: > > > > > Michal Hocko wrote: > > > > > > On Tue 08-08-17 11:14:50, Tetsuo Handa wrote: > > > > > > > Michal Hocko wrote: > > > > > > > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote: > > > > > > > > > Michal Hocko wrote: > > > > > > > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote: > > > > > > > > > > > My question is, how can users know it if somebody was OOM-killed needlessly > > > > > > > > > > > by allowing MMF_OOM_SKIP to race. > > > > > > > > > > > > > > > > > > > > Is it really important to know that the race is due to MMF_OOM_SKIP? > > > > > > > > > > > > > > > > > > Yes, it is really important. Needlessly selecting even one OOM victim is > > > > > > > > > a pain which is difficult to explain to and persuade some of customers. > > > > > > > > > > > > > > > > How is this any different from a race with a task exiting an releasing > > > > > > > > some memory after we have crossed the point of no return and will kill > > > > > > > > something? > > > > > > > > > > > > > > I'm not complaining about an exiting task releasing some memory after we have > > > > > > > crossed the point of no return. > > > > > > > > > > > > > > What I'm saying is that we can postpone "the point of no return" if we ignore > > > > > > > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using oom_lock" > > > > > > > thread and "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for > > > > > > > once." thread). These are race conditions we can avoid without crystal ball. > > > > > > > > > > > > If those races are really that common than we can handle them even > > > > > > without "try once more" tricks. Really this is just an ugly hack. If you > > > > > > really care then make sure that we always try to allocate from memory > > > > > > reserves before going down the oom path. In other words, try to find a > > > > > > robust solution rather than tweaks around a problem. > > > > > > > > > > Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes > > > > > oom_lock serialization from the OOM reaper, possibility of calling out_of_memory() > > > > > due to successful mutex_trylock(&oom_lock) would increase when the OOM reaper set > > > > > MMF_OOM_SKIP quickly. > > > > > > > > > > What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set > > > > > on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before > > > > > out_of_memory() is called (due to successful mutex_trylock(&oom_lock)) ? > > > > > > > > > > Excuse me? Are you suggesting to try memory reserves before > > > > > task_is_oom_victim(current) becomes true? > > > > > > > > No what I've tried to say is that if this really is a real problem, > > > > which I am not sure about, then the proper way to handle that is to > > > > attempt to allocate from memory reserves for an oom victim. I would be > > > > even willing to take the oom_lock back into the oom reaper path if the > > > > former turnes out to be awkward to implement. But all this assumes this > > > > is a _real_ problem. > > > > > > Aren't we back to square one? My question is, how can users know it if > > > somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race. > > > > > > You don't want to call get_page_from_freelist() from out_of_memory(), do you? > > > But without passing a flag "whether get_page_from_freelist() with memory reserves > > > was already attempted if current thread is an OOM victim" to task_will_free_mem() > > > in out_of_memory() and a flag "whether get_page_from_freelist() without memory > > > reserves was already attempted if current thread is not an OOM victim" to > > > test_bit(MMF_OOM_SKIP) in oom_evaluate_task(), we won't be able to know > > > if somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race. > > > > Michal, I did not get your answer, and your "mm, oom: do not rely on > > TIF_MEMDIE for memory reserves access" did not help solving this problem. > > (I confirmed it by reverting your "mm, oom: allow oom reaper to race with > > exit_mmap" and applying Andrea's "mm: oom: let oom_reap_task and exit_mmap > > run concurrently" and this patch on top of linux-next-20170817.) > > By "this patch" you probably mean a BUG_ON(tsk_is_oom_victim) somewhere > in task_will_free_mem right? I do not see anything like that in you > email. I wrote You can confirm it by adding "BUG_ON(1);" at "task->oom_kill_free_check_raced = 1;" of this patch. in the patch description. > > > [ 204.413605] Out of memory: Kill process 9286 (a.out) score 930 or sacrifice child > > [ 204.416241] Killed process 9286 (a.out) total-vm:4198476kB, anon-rss:72kB, file-rss:0kB, shmem-rss:3465520kB > > [ 204.419783] oom_reaper: reaped process 9286 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:3465720kB > > [ 204.455864] ------------[ cut here ]------------ > > [ 204.457921] kernel BUG at mm/oom_kill.c:786! > > > > Therefore, I propose this patch for inclusion. > > i've already told you that this is a wrong approach to handle a possible > race and offered you an alternative. I realy fail to see why you keep > reposting it. So to make myself absolutely clear > > Nacked-by: Michal Hocko to the patch below. Where is your alternative? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f200.google.com (mail-wr0-f200.google.com [209.85.128.200]) by kanga.kvack.org (Postfix) with ESMTP id 2E5D66B04DF for ; Mon, 21 Aug 2017 08:10:26 -0400 (EDT) Received: by mail-wr0-f200.google.com with SMTP id f8so17328866wrf.2 for ; Mon, 21 Aug 2017 05:10:26 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id y94si9225424wrc.530.2017.08.21.05.10.24 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 21 Aug 2017 05:10:24 -0700 (PDT) Date: Mon, 21 Aug 2017 14:10:22 +0200 From: Michal Hocko Subject: Re: [PATCH v2] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. Message-ID: <20170821121022.GF25956@dhcp22.suse.cz> References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> <201708191523.BJH90621.MHOOFFQSOLJFtV@I-love.SAKURA.ne.jp> <20170821084307.GB25956@dhcp22.suse.cz> <201708212041.GAJ05272.VOMOJOFSQLFtHF@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201708212041.GAJ05272.VOMOJOFSQLFtHF@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: akpm@linux-foundation.org, linux-mm@kvack.org, rientjes@google.com, mjaggi@caviumnetworks.com, oleg@redhat.com, vdavydov@virtuozzo.com On Mon 21-08-17 20:41:52, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Sat 19-08-17 15:23:19, Tetsuo Handa wrote: > > > Tetsuo Handa wrote at http://lkml.kernel.org/r/201708102328.ACD34352.OHFOLJMQVSFOFt@I-love.SAKURA.ne.jp : > > > > Michal Hocko wrote: > > > > > On Thu 10-08-17 21:10:30, Tetsuo Handa wrote: > > > > > > Michal Hocko wrote: > > > > > > > On Tue 08-08-17 11:14:50, Tetsuo Handa wrote: > > > > > > > > Michal Hocko wrote: > > > > > > > > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote: > > > > > > > > > > Michal Hocko wrote: > > > > > > > > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote: > > > > > > > > > > > > My question is, how can users know it if somebody was OOM-killed needlessly > > > > > > > > > > > > by allowing MMF_OOM_SKIP to race. > > > > > > > > > > > > > > > > > > > > > > Is it really important to know that the race is due to MMF_OOM_SKIP? > > > > > > > > > > > > > > > > > > > > Yes, it is really important. Needlessly selecting even one OOM victim is > > > > > > > > > > a pain which is difficult to explain to and persuade some of customers. > > > > > > > > > > > > > > > > > > How is this any different from a race with a task exiting an releasing > > > > > > > > > some memory after we have crossed the point of no return and will kill > > > > > > > > > something? > > > > > > > > > > > > > > > > I'm not complaining about an exiting task releasing some memory after we have > > > > > > > > crossed the point of no return. > > > > > > > > > > > > > > > > What I'm saying is that we can postpone "the point of no return" if we ignore > > > > > > > > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using oom_lock" > > > > > > > > thread and "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for > > > > > > > > once." thread). These are race conditions we can avoid without crystal ball. > > > > > > > > > > > > > > If those races are really that common than we can handle them even > > > > > > > without "try once more" tricks. Really this is just an ugly hack. If you > > > > > > > really care then make sure that we always try to allocate from memory > > > > > > > reserves before going down the oom path. In other words, try to find a > > > > > > > robust solution rather than tweaks around a problem. > > > > > > > > > > > > Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes > > > > > > oom_lock serialization from the OOM reaper, possibility of calling out_of_memory() > > > > > > due to successful mutex_trylock(&oom_lock) would increase when the OOM reaper set > > > > > > MMF_OOM_SKIP quickly. > > > > > > > > > > > > What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set > > > > > > on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before > > > > > > out_of_memory() is called (due to successful mutex_trylock(&oom_lock)) ? > > > > > > > > > > > > Excuse me? Are you suggesting to try memory reserves before > > > > > > task_is_oom_victim(current) becomes true? > > > > > > > > > > No what I've tried to say is that if this really is a real problem, > > > > > which I am not sure about, then the proper way to handle that is to > > > > > attempt to allocate from memory reserves for an oom victim. I would be > > > > > even willing to take the oom_lock back into the oom reaper path if the > > > > > former turnes out to be awkward to implement. But all this assumes this > > > > > is a _real_ problem. > > > > > > > > Aren't we back to square one? My question is, how can users know it if > > > > somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race. > > > > > > > > You don't want to call get_page_from_freelist() from out_of_memory(), do you? > > > > But without passing a flag "whether get_page_from_freelist() with memory reserves > > > > was already attempted if current thread is an OOM victim" to task_will_free_mem() > > > > in out_of_memory() and a flag "whether get_page_from_freelist() without memory > > > > reserves was already attempted if current thread is not an OOM victim" to > > > > test_bit(MMF_OOM_SKIP) in oom_evaluate_task(), we won't be able to know > > > > if somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race. > > > > > > Michal, I did not get your answer, and your "mm, oom: do not rely on > > > TIF_MEMDIE for memory reserves access" did not help solving this problem. > > > (I confirmed it by reverting your "mm, oom: allow oom reaper to race with > > > exit_mmap" and applying Andrea's "mm: oom: let oom_reap_task and exit_mmap > > > run concurrently" and this patch on top of linux-next-20170817.) > > > > By "this patch" you probably mean a BUG_ON(tsk_is_oom_victim) somewhere > > in task_will_free_mem right? I do not see anything like that in you > > email. > > I wrote > > You can confirm it by adding "BUG_ON(1);" at "task->oom_kill_free_check_raced = 1;" > of this patch. > > in the patch description. Ahh, OK so it was in the changelog. Your wording suggested a debugging patch which you forgot to add. > > > > > [ 204.413605] Out of memory: Kill process 9286 (a.out) score 930 or sacrifice child > > > [ 204.416241] Killed process 9286 (a.out) total-vm:4198476kB, anon-rss:72kB, file-rss:0kB, shmem-rss:3465520kB > > > [ 204.419783] oom_reaper: reaped process 9286 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:3465720kB > > > [ 204.455864] ------------[ cut here ]------------ > > > [ 204.457921] kernel BUG at mm/oom_kill.c:786! > > > > > > Therefore, I propose this patch for inclusion. > > > > i've already told you that this is a wrong approach to handle a possible > > race and offered you an alternative. I realy fail to see why you keep > > reposting it. So to make myself absolutely clear > > > > Nacked-by: Michal Hocko to the patch below. > > Where is your alternative? Sigh... Let me repeat for the last time (this whole thread is largely a waste of time to be honest). Find a _robust_ solution rather than fiddling with try-once-more kind of hacks. E.g. do an allocation attempt _before_ we do any disruptive action (aka kill a victim). This would help other cases when we race with an exiting tasks or somebody managed to free memory while we were selecting an oom victim which can take quite some time. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id 78545280310 for ; Mon, 21 Aug 2017 08:57:50 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id q3so66324384pgr.3 for ; Mon, 21 Aug 2017 05:57:50 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id q8si8197874plk.491.2017.08.21.05.57.48 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 21 Aug 2017 05:57:48 -0700 (PDT) Subject: Re: [PATCH v2] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. From: Tetsuo Handa References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> <201708191523.BJH90621.MHOOFFQSOLJFtV@I-love.SAKURA.ne.jp> <20170821084307.GB25956@dhcp22.suse.cz> <201708212041.GAJ05272.VOMOJOFSQLFtHF@I-love.SAKURA.ne.jp> <20170821121022.GF25956@dhcp22.suse.cz> In-Reply-To: <20170821121022.GF25956@dhcp22.suse.cz> Message-Id: <201708212157.DFB00801.tLMOFFSOOVQFJH@I-love.SAKURA.ne.jp> Date: Mon, 21 Aug 2017 21:57:44 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: mhocko@suse.com Cc: akpm@linux-foundation.org, linux-mm@kvack.org, rientjes@google.com, mjaggi@caviumnetworks.com, oleg@redhat.com, vdavydov.dev@gmail.com Michal Hocko wrote: > On Mon 21-08-17 20:41:52, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Sat 19-08-17 15:23:19, Tetsuo Handa wrote: > > > > Tetsuo Handa wrote at http://lkml.kernel.org/r/201708102328.ACD34352.OHFOLJMQVSFOFt@I-love.SAKURA.ne.jp : > > > > > Michal Hocko wrote: > > > > > > On Thu 10-08-17 21:10:30, Tetsuo Handa wrote: > > > > > > > Michal Hocko wrote: > > > > > > > > On Tue 08-08-17 11:14:50, Tetsuo Handa wrote: > > > > > > > > > Michal Hocko wrote: > > > > > > > > > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote: > > > > > > > > > > > Michal Hocko wrote: > > > > > > > > > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote: > > > > > > > > > > > > > My question is, how can users know it if somebody was OOM-killed needlessly > > > > > > > > > > > > > by allowing MMF_OOM_SKIP to race. > > > > > > > > > > > > > > > > > > > > > > > > Is it really important to know that the race is due to MMF_OOM_SKIP? > > > > > > > > > > > > > > > > > > > > > > Yes, it is really important. Needlessly selecting even one OOM victim is > > > > > > > > > > > a pain which is difficult to explain to and persuade some of customers. > > > > > > > > > > > > > > > > > > > > How is this any different from a race with a task exiting an releasing > > > > > > > > > > some memory after we have crossed the point of no return and will kill > > > > > > > > > > something? > > > > > > > > > > > > > > > > > > I'm not complaining about an exiting task releasing some memory after we have > > > > > > > > > crossed the point of no return. > > > > > > > > > > > > > > > > > > What I'm saying is that we can postpone "the point of no return" if we ignore > > > > > > > > > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using oom_lock" > > > > > > > > > thread and "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for > > > > > > > > > once." thread). These are race conditions we can avoid without crystal ball. > > > > > > > > > > > > > > > > If those races are really that common than we can handle them even > > > > > > > > without "try once more" tricks. Really this is just an ugly hack. If you > > > > > > > > really care then make sure that we always try to allocate from memory > > > > > > > > reserves before going down the oom path. In other words, try to find a > > > > > > > > robust solution rather than tweaks around a problem. > > > > > > > > > > > > > > Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes > > > > > > > oom_lock serialization from the OOM reaper, possibility of calling out_of_memory() > > > > > > > due to successful mutex_trylock(&oom_lock) would increase when the OOM reaper set > > > > > > > MMF_OOM_SKIP quickly. > > > > > > > > > > > > > > What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set > > > > > > > on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before > > > > > > > out_of_memory() is called (due to successful mutex_trylock(&oom_lock)) ? > > > > > > > > > > > > > > Excuse me? Are you suggesting to try memory reserves before > > > > > > > task_is_oom_victim(current) becomes true? > > > > > > > > > > > > No what I've tried to say is that if this really is a real problem, > > > > > > which I am not sure about, then the proper way to handle that is to > > > > > > attempt to allocate from memory reserves for an oom victim. I would be > > > > > > even willing to take the oom_lock back into the oom reaper path if the > > > > > > former turnes out to be awkward to implement. But all this assumes this > > > > > > is a _real_ problem. > > > > > > > > > > Aren't we back to square one? My question is, how can users know it if > > > > > somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race. > > > > > > > > > > You don't want to call get_page_from_freelist() from out_of_memory(), do you? > > > > > But without passing a flag "whether get_page_from_freelist() with memory reserves > > > > > was already attempted if current thread is an OOM victim" to task_will_free_mem() > > > > > in out_of_memory() and a flag "whether get_page_from_freelist() without memory > > > > > reserves was already attempted if current thread is not an OOM victim" to > > > > > test_bit(MMF_OOM_SKIP) in oom_evaluate_task(), we won't be able to know > > > > > if somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race. > > > > > > > > Michal, I did not get your answer, and your "mm, oom: do not rely on > > > > TIF_MEMDIE for memory reserves access" did not help solving this problem. > > > > (I confirmed it by reverting your "mm, oom: allow oom reaper to race with > > > > exit_mmap" and applying Andrea's "mm: oom: let oom_reap_task and exit_mmap > > > > run concurrently" and this patch on top of linux-next-20170817.) > > > > > > By "this patch" you probably mean a BUG_ON(tsk_is_oom_victim) somewhere > > > in task_will_free_mem right? I do not see anything like that in you > > > email. > > > > I wrote > > > > You can confirm it by adding "BUG_ON(1);" at "task->oom_kill_free_check_raced = 1;" > > of this patch. > > > > in the patch description. > > Ahh, OK so it was in the changelog. Your wording suggested a debugging > patch which you forgot to add. > > > > > > > > [ 204.413605] Out of memory: Kill process 9286 (a.out) score 930 or sacrifice child > > > > [ 204.416241] Killed process 9286 (a.out) total-vm:4198476kB, anon-rss:72kB, file-rss:0kB, shmem-rss:3465520kB > > > > [ 204.419783] oom_reaper: reaped process 9286 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:3465720kB > > > > [ 204.455864] ------------[ cut here ]------------ > > > > [ 204.457921] kernel BUG at mm/oom_kill.c:786! > > > > > > > > Therefore, I propose this patch for inclusion. > > > > > > i've already told you that this is a wrong approach to handle a possible > > > race and offered you an alternative. I realy fail to see why you keep > > > reposting it. So to make myself absolutely clear > > > > > > Nacked-by: Michal Hocko to the patch below. > > > > Where is your alternative? > > Sigh... Let me repeat for the last time (this whole thread is largely a > waste of time to be honest). Find a _robust_ solution rather than > fiddling with try-once-more kind of hacks. E.g. do an allocation attempt > _before_ we do any disruptive action (aka kill a victim). This would > help other cases when we race with an exiting tasks or somebody managed > to free memory while we were selecting an oom victim which can take > quite some time. I did not get your answer to my question: You don't want to call get_page_from_freelist() from out_of_memory(), do you? Since David Rientjes wrote "how sloppy this would be because it's blurring the line between oom killer and page allocator." and you responded as "Yes the layer violation is definitely not nice." at http://lkml.kernel.org/r/20160129152307.GF32174@dhcp22.suse.cz , I assumed that you don't want to call get_page_from_freelist() from out_of_memory(). But now, you are suggesting to do an allocation attempt _before_ we do any disruptive action. Did you change your mind to accept calling get_page_from_freelist() from out_of_memory() ? If yes, I will try to write such patch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f200.google.com (mail-wr0-f200.google.com [209.85.128.200]) by kanga.kvack.org (Postfix) with ESMTP id 1B596280310 for ; Mon, 21 Aug 2017 09:18:56 -0400 (EDT) Received: by mail-wr0-f200.google.com with SMTP id z91so25042073wrc.4 for ; Mon, 21 Aug 2017 06:18:56 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id m10si9637989wrb.254.2017.08.21.06.18.54 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 21 Aug 2017 06:18:54 -0700 (PDT) Date: Mon, 21 Aug 2017 15:18:52 +0200 From: Michal Hocko Subject: Re: [PATCH v2] mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once. Message-ID: <20170821131851.GJ25956@dhcp22.suse.cz> References: <1501718104-8099-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> <201708191523.BJH90621.MHOOFFQSOLJFtV@I-love.SAKURA.ne.jp> <20170821084307.GB25956@dhcp22.suse.cz> <201708212041.GAJ05272.VOMOJOFSQLFtHF@I-love.SAKURA.ne.jp> <20170821121022.GF25956@dhcp22.suse.cz> <201708212157.DFB00801.tLMOFFSOOVQFJH@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201708212157.DFB00801.tLMOFFSOOVQFJH@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: akpm@linux-foundation.org, linux-mm@kvack.org, rientjes@google.com, mjaggi@caviumnetworks.com, oleg@redhat.com, vdavydov.dev@gmail.com On Mon 21-08-17 21:57:44, Tetsuo Handa wrote: > Michal Hocko wrote: [...] > > Sigh... Let me repeat for the last time (this whole thread is largely a > > waste of time to be honest). Find a _robust_ solution rather than > > fiddling with try-once-more kind of hacks. E.g. do an allocation attempt > > _before_ we do any disruptive action (aka kill a victim). This would > > help other cases when we race with an exiting tasks or somebody managed > > to free memory while we were selecting an oom victim which can take > > quite some time. > > I did not get your answer to my question: > > You don't want to call get_page_from_freelist() from out_of_memory(), do you? > > Since David Rientjes wrote "how sloppy this would be because it's blurring > the line between oom killer and page allocator." and you responded as > "Yes the layer violation is definitely not nice." at > http://lkml.kernel.org/r/20160129152307.GF32174@dhcp22.suse.cz , > I assumed that you don't want to call get_page_from_freelist() from out_of_memory(). Yes that would be a layering violation and I do not like that very much. And that is why I keep repeating that this is something to handle only _if_ the problem is real and happens with _sensible_ workloads so often that we really have to care. If this happens only under oom stress testing then I would be tempted to not care all that much. Please try to understand that OOM killer will never be perfect and adding more kludges and hacks make it more fragile so each additional heuristic should be considered carefully. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org