From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f199.google.com (mail-ob0-f199.google.com [209.85.214.199]) by kanga.kvack.org (Postfix) with ESMTP id E7F456B0253 for ; Thu, 7 Jul 2016 11:58:55 -0400 (EDT) Received: by mail-ob0-f199.google.com with SMTP id gv4so40690196obc.3 for ; Thu, 07 Jul 2016 08:58:55 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id j7si2476438oia.289.2016.07.07.08.58.54 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 07 Jul 2016 08:58:54 -0700 (PDT) Subject: [PATCH v2 0/6] Change OOM killer to use list of mm_struct. From: Tetsuo Handa Message-Id: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> Date: Fri, 8 Jul 2016 00:58:41 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com, mhocko@suse.com, mhocko@kernel.org This series is an update of http://lkml.kernel.org/r/201607031135.AAH95347.MVOHQtFJFLOOFS@I-love.SAKURA.ne.jp . This series is based on top of linux-next-20160707 + http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.org . include/linux/mm_types.h | 7 + include/linux/oom.h | 14 -- include/linux/sched.h | 2 kernel/exit.c | 2 kernel/fork.c | 4 mm/memcontrol.c | 14 -- mm/oom_kill.c | 297 ++++++++++++++++++----------------------------- 7 files changed, 140 insertions(+), 200 deletions(-) [PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage. [PATCH 2/6] mm,oom_reaper: Do not attempt to reap a task twice. [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims. [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct. [PATCH 5/6] mm,oom: Remove OOM_SCAN_ABORT case and signal_struct->oom_victims. [PATCH 6/6] mm,oom: Stop clearing TIF_MEMDIE on remote thread. This series does not include patches for use_mm() users and wait_event() in oom_killer_disable(). We can apply http://lkml.kernel.org/r/1467365190-24640-3-git-send-email-mhocko@kernel.org on top of this series. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f70.google.com (mail-pa0-f70.google.com [209.85.220.70]) by kanga.kvack.org (Postfix) with ESMTP id C03AE6B025E for ; Thu, 7 Jul 2016 12:00:26 -0400 (EDT) Received: by mail-pa0-f70.google.com with SMTP id cx13so40155219pac.2 for ; Thu, 07 Jul 2016 09:00:26 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id xg7si4801264pab.222.2016.07.07.09.00.24 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 07 Jul 2016 09:00:25 -0700 (PDT) Subject: [PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage. From: Tetsuo Handa References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> In-Reply-To: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> Message-Id: <201607080100.BFB78123.OJFtLVHFFMOSQO@I-love.SAKURA.ne.jp> Date: Fri, 8 Jul 2016 01:00:13 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com, mhocko@suse.com, mhocko@kernel.org >>From 70de3fe92435095b6ecbb400c61e84a99f639d56 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Fri, 8 Jul 2016 00:28:12 +0900 Subject: [PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage. Since holding mm_struct with elevated mm_count for a second is harmless, we can determine mm_struct and hold it upon entry of oom_reap_task(). This patch has no functional change. Future patch in this series will eliminate find_lock_task_mm() usage from the OOM reaper. Signed-off-by: Tetsuo Handa --- mm/oom_kill.c | 79 ++++++++++++++++++++++++++++------------------------------- 1 file changed, 37 insertions(+), 42 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 7d0a275..951eb1b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -452,12 +452,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); -static bool __oom_reap_task(struct task_struct *tsk) +static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm) { struct mmu_gather tlb; struct vm_area_struct *vma; - struct mm_struct *mm = NULL; - struct task_struct *p; struct zap_details details = {.check_swap_entries = true, .ignore_dirty = true}; bool ret = true; @@ -478,22 +476,9 @@ static bool __oom_reap_task(struct task_struct *tsk) */ mutex_lock(&oom_lock); - /* - * Make sure we find the associated mm_struct even when the particular - * thread has already terminated and cleared its mm. - * We might have race with exit path so consider our work done if there - * is no mm. - */ - p = find_lock_task_mm(tsk); - if (!p) - goto unlock_oom; - mm = p->mm; - atomic_inc(&mm->mm_count); - task_unlock(p); - if (!down_read_trylock(&mm->mmap_sem)) { ret = false; - goto mm_drop; + goto unlock_oom; } /* @@ -503,7 +488,7 @@ static bool __oom_reap_task(struct task_struct *tsk) */ if (!mmget_not_zero(mm)) { up_read(&mm->mmap_sem); - goto mm_drop; + goto unlock_oom; } tlb_gather_mmu(&tlb, mm, 0, -1); @@ -551,8 +536,6 @@ static bool __oom_reap_task(struct task_struct *tsk) * put the oom_reaper out of the way. */ mmput_async(mm); -mm_drop: - mmdrop(mm); unlock_oom: mutex_unlock(&oom_lock); return ret; @@ -562,36 +545,45 @@ unlock_oom: static void oom_reap_task(struct task_struct *tsk) { int attempts = 0; + struct mm_struct *mm = NULL; + struct task_struct *p = find_lock_task_mm(tsk); + + /* + * Make sure we find the associated mm_struct even when the particular + * thread has already terminated and cleared its mm. + * We might have race with exit path so consider our work done if there + * is no mm. + */ + if (!p) + goto done; + mm = p->mm; + atomic_inc(&mm->mm_count); + task_unlock(p); /* Retry the down_read_trylock(mmap_sem) a few times */ - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk)) + while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm)) schedule_timeout_idle(HZ/10); - if (attempts > MAX_OOM_REAP_RETRIES) { - struct task_struct *p; + if (attempts <= MAX_OOM_REAP_RETRIES) + goto done; - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", - task_pid_nr(tsk), tsk->comm); + pr_info("oom_reaper: unable to reap pid:%d (%s)\n", + task_pid_nr(tsk), tsk->comm); - /* - * If we've already tried to reap this task in the past and - * failed it probably doesn't make much sense to try yet again - * so hide the mm from the oom killer so that it can move on - * to another task with a different mm struct. - */ - p = find_lock_task_mm(tsk); - if (p) { - if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) { - pr_info("oom_reaper: giving up pid:%d (%s)\n", - task_pid_nr(tsk), tsk->comm); - set_bit(MMF_OOM_REAPED, &p->mm->flags); - } - task_unlock(p); - } - - debug_show_all_locks(); + /* + * If we've already tried to reap this task in the past and + * failed it probably doesn't make much sense to try yet again + * so hide the mm from the oom killer so that it can move on + * to another task with a different mm struct. + */ + if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) { + pr_info("oom_reaper: giving up pid:%d (%s)\n", + task_pid_nr(tsk), tsk->comm); + set_bit(MMF_OOM_REAPED, &mm->flags); } + debug_show_all_locks(); +done: /* * Clear TIF_MEMDIE because the task shouldn't be sitting on a * reasonably reclaimable memory anymore or it is not a good candidate @@ -603,6 +595,9 @@ static void oom_reap_task(struct task_struct *tsk) /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); + /* Drop a reference taken above. */ + if (mm) + mmdrop(mm); } static int oom_reaper(void *unused) -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f70.google.com (mail-it0-f70.google.com [209.85.214.70]) by kanga.kvack.org (Postfix) with ESMTP id C649F6B0253 for ; Thu, 7 Jul 2016 12:01:46 -0400 (EDT) Received: by mail-it0-f70.google.com with SMTP id j185so57545619ith.0 for ; Thu, 07 Jul 2016 09:01:46 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id l204si2623044oig.178.2016.07.07.09.01.45 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 07 Jul 2016 09:01:46 -0700 (PDT) Subject: [PATCH 2/6] mm,oom_reaper: Do not attempt to reap a task twice. From: Tetsuo Handa References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> In-Reply-To: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> Message-Id: <201607080101.CCE57351.tQFOOVOFJSHMLF@I-love.SAKURA.ne.jp> Date: Fri, 8 Jul 2016 01:01:36 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com, mhocko@suse.com, mhocko@kernel.org >>From 50b3a862b136c783be6ce25e0f22446f15a0ab03 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Fri, 8 Jul 2016 00:28:50 +0900 Subject: [PATCH 2/6] mm,oom_reaper: Do not attempt to reap a task twice. "mm, oom_reaper: do not attempt to reap a task twice" tried to give OOM reaper one more chance to retry. But since OOM killer or one of threads sharing that mm will queue that mm immediately, retrying MMF_OOM_NOT_REAPABLE mm is unlikely helpful. Let's always set MMF_OOM_REAPED. Signed-off-by: Tetsuo Handa --- include/linux/sched.h | 1 - mm/oom_kill.c | 15 +++------------ 2 files changed, 3 insertions(+), 13 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 553af29..c0efd80 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -523,7 +523,6 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_HAS_UPROBES 19 /* has uprobes */ #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ #define MMF_OOM_REAPED 21 /* mm has been already reaped */ -#define MMF_OOM_NOT_REAPABLE 22 /* mm couldn't be reaped */ #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 951eb1b..9f0022e 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -567,20 +567,11 @@ static void oom_reap_task(struct task_struct *tsk) if (attempts <= MAX_OOM_REAP_RETRIES) goto done; + /* Ignore this mm because somebody can't call up_write(mmap_sem). */ + set_bit(MMF_OOM_REAPED, &mm->flags); + pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); - - /* - * If we've already tried to reap this task in the past and - * failed it probably doesn't make much sense to try yet again - * so hide the mm from the oom killer so that it can move on - * to another task with a different mm struct. - */ - if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) { - pr_info("oom_reaper: giving up pid:%d (%s)\n", - task_pid_nr(tsk), tsk->comm); - set_bit(MMF_OOM_REAPED, &mm->flags); - } debug_show_all_locks(); done: -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f72.google.com (mail-oi0-f72.google.com [209.85.218.72]) by kanga.kvack.org (Postfix) with ESMTP id 1651C6B025E for ; Thu, 7 Jul 2016 12:03:23 -0400 (EDT) Received: by mail-oi0-f72.google.com with SMTP id d132so38225863oig.0 for ; Thu, 07 Jul 2016 09:03:23 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id y198si2379868oia.31.2016.07.07.09.03.21 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 07 Jul 2016 09:03:22 -0700 (PDT) Subject: [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims. From: Tetsuo Handa References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> In-Reply-To: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> Message-Id: <201607080103.CDH12401.LFOHStQFOOFVJM@I-love.SAKURA.ne.jp> Date: Fri, 8 Jul 2016 01:03:11 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com, mhocko@suse.com, mhocko@kernel.org >>From 5fbd16cffd5dc51f9ba8591fc18d315ff6ff9b96 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Fri, 8 Jul 2016 00:33:13 +0900 Subject: [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims. Currently, we walk process list in order to find existing TIF_MEMDIE threads. But if we remember list of mm_struct used by TIF_MEMDIE threads, we can avoid walking process list. Next patch in this series allows OOM reaper to use list of mm_struct introduced by this patch. This patch reverts commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task") because oom_has_pending_mm() will prevent that race. Since CONFIG_MMU=y kernel has OOM reaper callback hook which can remove mm_struct from the list, let the OOM reaper call exit_oom_mm(mm). This patch temporarily fails to call exit_oom_mm(mm) when find_lock_task_mm() in oom_reap_task() failed. It will be fixed by next patch. But since CONFIG_MMU=n kernel does not have OOM reaper callback hook, call exit_oom_mm(mm) from __mmput(mm) if that mm is used by OOM victims. Signed-off-by: Tetsuo Handa --- include/linux/mm_types.h | 7 +++++ include/linux/oom.h | 3 ++ kernel/fork.c | 4 +++ mm/memcontrol.c | 5 ++++ mm/oom_kill.c | 72 +++++++++++++++++++++++++++++++----------------- 5 files changed, 66 insertions(+), 25 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index e093e1d..7c1370a 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -392,6 +392,12 @@ struct mm_rss_stat { atomic_long_t count[NR_MM_COUNTERS]; }; +struct oom_mm { + struct list_head list; /* Linked to oom_mm_list list. */ + /* Thread which was passed to mark_oom_victim() for the last time. */ + struct task_struct *victim; +}; + struct kioctx_table; struct mm_struct { struct vm_area_struct *mmap; /* list of VMAs */ @@ -515,6 +521,7 @@ struct mm_struct { #ifdef CONFIG_HUGETLB_PAGE atomic_long_t hugetlb_usage; #endif + struct oom_mm oom_mm; #ifdef CONFIG_MMU struct work_struct async_put_work; #endif diff --git a/include/linux/oom.h b/include/linux/oom.h index 5bc0457..bdcb331 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -91,6 +91,9 @@ extern void oom_kill_process(struct oom_control *oc, struct task_struct *p, extern void check_panic_on_oom(struct oom_control *oc, enum oom_constraint constraint); +extern void exit_oom_mm(struct mm_struct *mm); +extern bool oom_has_pending_mm(struct mem_cgroup *memcg, + const nodemask_t *nodemask); extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, struct task_struct *task); diff --git a/kernel/fork.c b/kernel/fork.c index 7926993..8e469e0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -722,6 +722,10 @@ static inline void __mmput(struct mm_struct *mm) } if (mm->binfmt) module_put(mm->binfmt->module); +#ifndef CONFIG_MMU + if (mm->oom_mm.victim) + exit_oom_mm(mm); +#endif mmdrop(mm); } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 40dfca3..8f7a5b7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1241,6 +1241,11 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, } check_panic_on_oom(&oc, CONSTRAINT_MEMCG); + if (oom_has_pending_mm(memcg, NULL)) { + /* Set a dummy value to return "true". */ + chosen = (void *) 1; + goto unlock; + } totalpages = mem_cgroup_get_limit(memcg) ? : 1; for_each_mem_cgroup_tree(iter, memcg) { struct css_task_iter it; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 9f0022e..87e7ff3 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -275,6 +275,28 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc, } #endif +static LIST_HEAD(oom_mm_list); + +void exit_oom_mm(struct mm_struct *mm) +{ + mutex_lock(&oom_lock); + list_del(&mm->oom_mm.list); + put_task_struct(mm->oom_mm.victim); + mm->oom_mm.victim = NULL; + mmdrop(mm); + mutex_unlock(&oom_lock); +} + +bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask) +{ + struct mm_struct *mm; + + list_for_each_entry(mm, &oom_mm_list, oom_mm.list) + if (!oom_unkillable_task(mm->oom_mm.victim, memcg, nodemask)) + return true; + return false; +} + enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, struct task_struct *task) { @@ -458,28 +480,9 @@ static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm) struct vm_area_struct *vma; struct zap_details details = {.check_swap_entries = true, .ignore_dirty = true}; - bool ret = true; - /* - * We have to make sure to not race with the victim exit path - * and cause premature new oom victim selection: - * __oom_reap_task exit_mm - * mmget_not_zero - * mmput - * atomic_dec_and_test - * exit_oom_victim - * [...] - * out_of_memory - * select_bad_process - * # no TIF_MEMDIE task selects new victim - * unmap_page_range # frees some memory - */ - mutex_lock(&oom_lock); - - if (!down_read_trylock(&mm->mmap_sem)) { - ret = false; - goto unlock_oom; - } + if (!down_read_trylock(&mm->mmap_sem)) + return false; /* * increase mm_users only after we know we will reap something so @@ -488,7 +491,7 @@ static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm) */ if (!mmget_not_zero(mm)) { up_read(&mm->mmap_sem); - goto unlock_oom; + return true; } tlb_gather_mmu(&tlb, mm, 0, -1); @@ -536,9 +539,7 @@ static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm) * put the oom_reaper out of the way. */ mmput_async(mm); -unlock_oom: - mutex_unlock(&oom_lock); - return ret; + return true; } #define MAX_OOM_REAP_RETRIES 10 @@ -586,6 +587,9 @@ done: /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); + /* Drop references taken by mark_oom_victim() */ + if (mm) + exit_oom_mm(mm); /* Drop a reference taken above. */ if (mm) mmdrop(mm); @@ -653,6 +657,9 @@ subsys_initcall(oom_init) */ void mark_oom_victim(struct task_struct *tsk) { + struct mm_struct *mm = tsk->mm; + struct task_struct *old_tsk = mm->oom_mm.victim; + WARN_ON(oom_killer_disabled); /* OOM killer might race with memcg OOM */ if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) @@ -666,6 +673,18 @@ void mark_oom_victim(struct task_struct *tsk) */ __thaw_task(tsk); atomic_inc(&oom_victims); + /* + * Since mark_oom_victim() is called from multiple threads, + * connect this mm to oom_mm_list only if not yet connected. + */ + get_task_struct(tsk); + mm->oom_mm.victim = tsk; + if (!old_tsk) { + atomic_inc(&mm->mm_count); + list_add_tail(&mm->oom_mm.list, &oom_mm_list); + } else { + put_task_struct(old_tsk); + } } /** @@ -1026,6 +1045,9 @@ bool out_of_memory(struct oom_control *oc) return true; } + if (!is_sysrq_oom(oc) && oom_has_pending_mm(oc->memcg, oc->nodemask)) + return true; + p = select_bad_process(oc, &points, totalpages); /* Found nothing?!?! Either we hang forever, or we panic. */ if (!p && !is_sysrq_oom(oc)) { -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f70.google.com (mail-pa0-f70.google.com [209.85.220.70]) by kanga.kvack.org (Postfix) with ESMTP id A73CB6B0253 for ; Thu, 7 Jul 2016 12:04:58 -0400 (EDT) Received: by mail-pa0-f70.google.com with SMTP id ib6so33106962pad.0 for ; Thu, 07 Jul 2016 09:04:58 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id m87si4555834pfi.190.2016.07.07.09.04.57 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 07 Jul 2016 09:04:57 -0700 (PDT) Subject: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct. From: Tetsuo Handa References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> In-Reply-To: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> Message-Id: <201607080104.JDA41505.OtOFMSLOQVJFHF@I-love.SAKURA.ne.jp> Date: Fri, 8 Jul 2016 01:04:46 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com, mhocko@suse.com, mhocko@kernel.org >>From acc85fdd36452e39bace6aa73b3aaa41bbe776a5 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Fri, 8 Jul 2016 00:39:36 +0900 Subject: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct. Since OOM reaping is per mm_struct operation, it is natural to use list of mm_struct used by OOM victims. By using list of mm_struct, we can eliminate find_lock_task_mm() usage from the OOM reaper. Signed-off-by: Tetsuo Handa --- include/linux/oom.h | 8 ----- mm/memcontrol.c | 1 - mm/oom_kill.c | 100 +++++++++++++++------------------------------------- 3 files changed, 29 insertions(+), 80 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index bdcb331..cb3f041 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -72,14 +72,6 @@ static inline bool oom_task_origin(const struct task_struct *p) extern void mark_oom_victim(struct task_struct *tsk); -#ifdef CONFIG_MMU -extern void wake_oom_reaper(struct task_struct *tsk); -#else -static inline void wake_oom_reaper(struct task_struct *tsk) -{ -} -#endif - extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8f7a5b7..5043324 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1236,7 +1236,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, */ if (task_will_free_mem(current)) { mark_oom_victim(current); - wake_oom_reaper(current); goto unlock; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 87e7ff3..223e1fe 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -471,8 +471,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) */ static struct task_struct *oom_reaper_th; static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); -static struct task_struct *oom_reaper_list; -static DEFINE_SPINLOCK(oom_reaper_lock); static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm) { @@ -543,30 +541,23 @@ static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm) } #define MAX_OOM_REAP_RETRIES 10 -static void oom_reap_task(struct task_struct *tsk) +static void oom_reap_task(struct task_struct *tsk, struct mm_struct *mm) { int attempts = 0; - struct mm_struct *mm = NULL; - struct task_struct *p = find_lock_task_mm(tsk); /* - * Make sure we find the associated mm_struct even when the particular - * thread has already terminated and cleared its mm. - * We might have race with exit path so consider our work done if there - * is no mm. + * Check MMF_OOM_REAPED in case oom_kill_process() found this mm + * pinned. */ - if (!p) - goto done; - mm = p->mm; - atomic_inc(&mm->mm_count); - task_unlock(p); + if (test_bit(MMF_OOM_REAPED, &mm->flags)) + return; /* Retry the down_read_trylock(mmap_sem) a few times */ while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm)) schedule_timeout_idle(HZ/10); if (attempts <= MAX_OOM_REAP_RETRIES) - goto done; + return; /* Ignore this mm because somebody can't call up_write(mmap_sem). */ set_bit(MMF_OOM_REAPED, &mm->flags); @@ -574,25 +565,6 @@ static void oom_reap_task(struct task_struct *tsk) pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); debug_show_all_locks(); - -done: - /* - * Clear TIF_MEMDIE because the task shouldn't be sitting on a - * reasonably reclaimable memory anymore or it is not a good candidate - * for the oom victim right now because it cannot release its memory - * itself nor by the oom reaper. - */ - tsk->oom_reaper_list = NULL; - exit_oom_victim(tsk); - - /* Drop a reference taken by wake_oom_reaper */ - put_task_struct(tsk); - /* Drop references taken by mark_oom_victim() */ - if (mm) - exit_oom_mm(mm); - /* Drop a reference taken above. */ - if (mm) - mmdrop(mm); } static int oom_reaper(void *unused) @@ -600,41 +572,31 @@ static int oom_reaper(void *unused) set_freezable(); while (true) { - struct task_struct *tsk = NULL; - - wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL); - spin_lock(&oom_reaper_lock); - if (oom_reaper_list != NULL) { - tsk = oom_reaper_list; - oom_reaper_list = tsk->oom_reaper_list; - } - spin_unlock(&oom_reaper_lock); - - if (tsk) - oom_reap_task(tsk); + struct mm_struct *mm; + struct task_struct *victim; + + wait_event_freezable(oom_reaper_wait, + !list_empty(&oom_mm_list)); + mutex_lock(&oom_lock); + mm = list_first_entry(&oom_mm_list, struct mm_struct, + oom_mm.list); + victim = mm->oom_mm.victim; + /* + * Take a reference on current victim thread in case + * oom_reap_task() raced with mark_oom_victim() by + * other threads sharing this mm. + */ + get_task_struct(victim); + mutex_unlock(&oom_lock); + oom_reap_task(victim, mm); + put_task_struct(victim); + /* Drop references taken by mark_oom_victim() */ + exit_oom_mm(mm); } return 0; } -void wake_oom_reaper(struct task_struct *tsk) -{ - if (!oom_reaper_th) - return; - - /* tsk is already queued? */ - if (tsk == oom_reaper_list || tsk->oom_reaper_list) - return; - - get_task_struct(tsk); - - spin_lock(&oom_reaper_lock); - tsk->oom_reaper_list = oom_reaper_list; - oom_reaper_list = tsk; - spin_unlock(&oom_reaper_lock); - wake_up(&oom_reaper_wait); -} - static int __init oom_init(void) { oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); @@ -682,6 +644,9 @@ void mark_oom_victim(struct task_struct *tsk) if (!old_tsk) { atomic_inc(&mm->mm_count); list_add_tail(&mm->oom_mm.list, &oom_mm_list); +#ifdef CONFIG_MMU + wake_up(&oom_reaper_wait); +#endif } else { put_task_struct(old_tsk); } @@ -826,7 +791,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, unsigned int victim_points = 0; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); - bool can_oom_reap = true; /* * If the task is already exiting, don't alarm the sysadmin or kill @@ -835,7 +799,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, task_lock(p); if (task_will_free_mem(p)) { mark_oom_victim(p); - wake_oom_reaper(p); task_unlock(p); put_task_struct(p); return; @@ -925,7 +888,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, * memory might be still used. Hide the mm from the oom * killer to guarantee OOM forward progress. */ - can_oom_reap = false; set_bit(MMF_OOM_REAPED, &mm->flags); pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n", task_pid_nr(victim), victim->comm, @@ -936,9 +898,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, } rcu_read_unlock(); - if (can_oom_reap) - wake_oom_reaper(victim); - mmdrop(mm); put_task_struct(victim); } @@ -1014,7 +973,6 @@ bool out_of_memory(struct oom_control *oc) */ if (task_will_free_mem(current)) { mark_oom_victim(current); - wake_oom_reaper(current); return true; } -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f197.google.com (mail-io0-f197.google.com [209.85.223.197]) by kanga.kvack.org (Postfix) with ESMTP id 524866B0260 for ; Thu, 7 Jul 2016 12:06:25 -0400 (EDT) Received: by mail-io0-f197.google.com with SMTP id l202so48632154ioe.1 for ; Thu, 07 Jul 2016 09:06:25 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id 185si38764oia.232.2016.07.07.09.06.24 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 07 Jul 2016 09:06:24 -0700 (PDT) Subject: [PATCH 5/6] mm,oom: Remove OOM_SCAN_ABORT case and signal_struct->oom_victims. From: Tetsuo Handa References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> In-Reply-To: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> Message-Id: <201607080106.DCD82819.tJFHVLSOOQFMFO@I-love.SAKURA.ne.jp> Date: Fri, 8 Jul 2016 01:06:12 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com, mhocko@suse.com, mhocko@kernel.org >>From 2da51204250a2f1127792f98c12436771c07b67d Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Fri, 8 Jul 2016 00:41:38 +0900 Subject: [PATCH 5/6] mm,oom: Remove OOM_SCAN_ABORT case and signal_struct->oom_victims. Since oom_has_pending_mm() controls whether to select next OOM victim, we no longer need to abort OOM victim selection loop using OOM_SCAN_ABORT case. Also, since signal_struct->oom_victims was used only for allowing oom_scan_process_thread() to return OOM_SCAN_ABORT, we no longer need to use signal_struct->oom_victims. Signed-off-by: Tetsuo Handa --- include/linux/oom.h | 1 - include/linux/sched.h | 1 - mm/memcontrol.c | 8 -------- mm/oom_kill.c | 26 +------------------------- 4 files changed, 1 insertion(+), 35 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index cb3f041..27be4ba 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -49,7 +49,6 @@ enum oom_constraint { enum oom_scan_t { OOM_SCAN_OK, /* scan thread and find its badness */ OOM_SCAN_CONTINUE, /* do not consider thread for oom kill */ - OOM_SCAN_ABORT, /* abort the iteration and return */ OOM_SCAN_SELECT, /* always select this thread first */ }; diff --git a/include/linux/sched.h b/include/linux/sched.h index c0efd80..2f57cf1c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -670,7 +670,6 @@ struct signal_struct { atomic_t sigcnt; atomic_t live; int nr_threads; - atomic_t oom_victims; /* # of TIF_MEDIE threads in this thread group */ struct list_head thread_head; wait_queue_head_t wait_chldexit; /* for wait4() */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5043324..6afe1c5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1262,14 +1262,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, /* fall through */ case OOM_SCAN_CONTINUE: continue; - case OOM_SCAN_ABORT: - css_task_iter_end(&it); - mem_cgroup_iter_break(memcg, iter); - if (chosen) - put_task_struct(chosen); - /* Set a dummy value to return "true". */ - chosen = (void *) 1; - goto unlock; case OOM_SCAN_OK: break; }; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 223e1fe..b6b79ae 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -304,25 +304,6 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, return OOM_SCAN_CONTINUE; /* - * This task already has access to memory reserves and is being killed. - * Don't allow any other task to have access to the reserves unless - * the task has MMF_OOM_REAPED because chances that it would release - * any memory is quite low. - */ - if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) { - struct task_struct *p = find_lock_task_mm(task); - enum oom_scan_t ret = OOM_SCAN_ABORT; - - if (p) { - if (test_bit(MMF_OOM_REAPED, &p->mm->flags)) - ret = OOM_SCAN_CONTINUE; - task_unlock(p); - } - - return ret; - } - - /* * If task is allocating a lot of memory and has been marked to be * killed first if it triggers an oom, then select it. */ @@ -354,9 +335,6 @@ static struct task_struct *select_bad_process(struct oom_control *oc, /* fall through */ case OOM_SCAN_CONTINUE: continue; - case OOM_SCAN_ABORT: - rcu_read_unlock(); - return (struct task_struct *)(-1UL); case OOM_SCAN_OK: break; }; @@ -626,7 +604,6 @@ void mark_oom_victim(struct task_struct *tsk) /* OOM killer might race with memcg OOM */ if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) return; - atomic_inc(&tsk->signal->oom_victims); /* * Make sure that the task is woken up from uninterruptible sleep * if it is frozen because OOM killer wouldn't be able to free @@ -659,7 +636,6 @@ void exit_oom_victim(struct task_struct *tsk) { if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) return; - atomic_dec(&tsk->signal->oom_victims); if (!atomic_dec_return(&oom_victims)) wake_up_all(&oom_victims_wait); @@ -1012,7 +988,7 @@ bool out_of_memory(struct oom_control *oc) dump_header(oc, NULL); panic("Out of memory and no killable processes...\n"); } - if (p && p != (void *)-1UL) { + if (p) { oom_kill_process(oc, p, points, totalpages, "Out of memory"); /* * Give the killed process a good chance to exit before trying -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f198.google.com (mail-io0-f198.google.com [209.85.223.198]) by kanga.kvack.org (Postfix) with ESMTP id 2388A6B0253 for ; Thu, 7 Jul 2016 12:08:01 -0400 (EDT) Received: by mail-io0-f198.google.com with SMTP id t74so50242515ioi.3 for ; Thu, 07 Jul 2016 09:08:01 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id p82si216135oih.291.2016.07.07.09.08.00 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 07 Jul 2016 09:08:00 -0700 (PDT) Subject: [PATCH 6/6] mm,oom: Stop clearing TIF_MEMDIE on remote thread. From: Tetsuo Handa References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> In-Reply-To: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> Message-Id: <201607080107.ADG65168.QJLFFOFOHMOSVt@I-love.SAKURA.ne.jp> Date: Fri, 8 Jul 2016 01:07:47 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com, mhocko@suse.com, mhocko@kernel.org >>From fefcbf2412e38b006281cf1d20f9e3a2bb76714f Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Fri, 8 Jul 2016 00:42:48 +0900 Subject: [PATCH 6/6] mm,oom: Stop clearing TIF_MEMDIE on remote thread. Since oom_has_pending_mm() controls whether to select next OOM victim, we are no longer clearing TIF_MEMDIE on remote thread. Signed-off-by: Tetsuo Handa --- include/linux/oom.h | 2 +- kernel/exit.c | 2 +- mm/oom_kill.c | 5 ++--- 3 files changed, 4 insertions(+), 5 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 27be4ba..90e98ea 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -90,7 +90,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, extern bool out_of_memory(struct oom_control *oc); -extern void exit_oom_victim(struct task_struct *tsk); +extern void exit_oom_victim(void); extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); diff --git a/kernel/exit.c b/kernel/exit.c index 84ae830..1b1dada 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -511,7 +511,7 @@ static void exit_mm(struct task_struct *tsk) mm_update_next_owner(mm); mmput(mm); if (test_thread_flag(TIF_MEMDIE)) - exit_oom_victim(tsk); + exit_oom_victim(); } static struct task_struct *find_alive_thread(struct task_struct *p) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index b6b79ae..d120cb1 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -632,10 +632,9 @@ void mark_oom_victim(struct task_struct *tsk) /** * exit_oom_victim - note the exit of an OOM victim */ -void exit_oom_victim(struct task_struct *tsk) +void exit_oom_victim(void) { - if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) - return; + clear_thread_flag(TIF_MEMDIE); if (!atomic_dec_return(&oom_victims)) wake_up_all(&oom_victims_wait); -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f69.google.com (mail-wm0-f69.google.com [74.125.82.69]) by kanga.kvack.org (Postfix) with ESMTP id B1D4A6B0005 for ; Mon, 11 Jul 2016 08:02:36 -0400 (EDT) Received: by mail-wm0-f69.google.com with SMTP id o80so24001710wme.1 for ; Mon, 11 Jul 2016 05:02:36 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id j7si14752084wmd.131.2016.07.11.05.02.35 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 11 Jul 2016 05:02:35 -0700 (PDT) Date: Mon, 11 Jul 2016 14:02:33 +0200 From: Michal Hocko Subject: Re: [PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage. Message-ID: <20160711120232.GD1811@dhcp22.suse.cz> References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> <201607080100.BFB78123.OJFtLVHFFMOSQO@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201607080100.BFB78123.OJFtLVHFFMOSQO@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com On Fri 08-07-16 01:00:13, Tetsuo Handa wrote: > >From 70de3fe92435095b6ecbb400c61e84a99f639d56 Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa > Date: Fri, 8 Jul 2016 00:28:12 +0900 > Subject: [PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage. > > Since holding mm_struct with elevated mm_count for a second is harmless, > we can determine mm_struct and hold it upon entry of oom_reap_task(). > This patch has no functional change. Future patch in this series will > eliminate find_lock_task_mm() usage from the OOM reaper. the changelog is quite poor to be honest. It doesn't explain why this is really needed. What do you think about the following: " __oom_reap_task can be simplified a bit if it received a valid mm from oom_reap_task which might need it as well. We could drop one find_lock_task_mm call and also make the __oom_reap_task code flow easier to follow. Moreover this will make later patch in the series easier to review. Pinning mm's mm_count for longer time is not really harmfull because this will not pin much memory. This patch doesn't introduce any functional change. " > > Signed-off-by: Tetsuo Handa Other than that the patch looks good to me. Acked-by: Michal Hocko > --- > mm/oom_kill.c | 79 ++++++++++++++++++++++++++++------------------------------- > 1 file changed, 37 insertions(+), 42 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 7d0a275..951eb1b 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -452,12 +452,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); > static struct task_struct *oom_reaper_list; > static DEFINE_SPINLOCK(oom_reaper_lock); > > -static bool __oom_reap_task(struct task_struct *tsk) > +static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm) > { > struct mmu_gather tlb; > struct vm_area_struct *vma; > - struct mm_struct *mm = NULL; > - struct task_struct *p; > struct zap_details details = {.check_swap_entries = true, > .ignore_dirty = true}; > bool ret = true; > @@ -478,22 +476,9 @@ static bool __oom_reap_task(struct task_struct *tsk) > */ > mutex_lock(&oom_lock); > > - /* > - * Make sure we find the associated mm_struct even when the particular > - * thread has already terminated and cleared its mm. > - * We might have race with exit path so consider our work done if there > - * is no mm. > - */ > - p = find_lock_task_mm(tsk); > - if (!p) > - goto unlock_oom; > - mm = p->mm; > - atomic_inc(&mm->mm_count); > - task_unlock(p); > - > if (!down_read_trylock(&mm->mmap_sem)) { > ret = false; > - goto mm_drop; > + goto unlock_oom; > } > > /* > @@ -503,7 +488,7 @@ static bool __oom_reap_task(struct task_struct *tsk) > */ > if (!mmget_not_zero(mm)) { > up_read(&mm->mmap_sem); > - goto mm_drop; > + goto unlock_oom; > } > > tlb_gather_mmu(&tlb, mm, 0, -1); > @@ -551,8 +536,6 @@ static bool __oom_reap_task(struct task_struct *tsk) > * put the oom_reaper out of the way. > */ > mmput_async(mm); > -mm_drop: > - mmdrop(mm); > unlock_oom: > mutex_unlock(&oom_lock); > return ret; > @@ -562,36 +545,45 @@ unlock_oom: > static void oom_reap_task(struct task_struct *tsk) > { > int attempts = 0; > + struct mm_struct *mm = NULL; > + struct task_struct *p = find_lock_task_mm(tsk); > + > + /* > + * Make sure we find the associated mm_struct even when the particular > + * thread has already terminated and cleared its mm. > + * We might have race with exit path so consider our work done if there > + * is no mm. > + */ > + if (!p) > + goto done; > + mm = p->mm; > + atomic_inc(&mm->mm_count); > + task_unlock(p); > > /* Retry the down_read_trylock(mmap_sem) a few times */ > - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk)) > + while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm)) > schedule_timeout_idle(HZ/10); > > - if (attempts > MAX_OOM_REAP_RETRIES) { > - struct task_struct *p; > + if (attempts <= MAX_OOM_REAP_RETRIES) > + goto done; > > - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", > - task_pid_nr(tsk), tsk->comm); > + pr_info("oom_reaper: unable to reap pid:%d (%s)\n", > + task_pid_nr(tsk), tsk->comm); > > - /* > - * If we've already tried to reap this task in the past and > - * failed it probably doesn't make much sense to try yet again > - * so hide the mm from the oom killer so that it can move on > - * to another task with a different mm struct. > - */ > - p = find_lock_task_mm(tsk); > - if (p) { > - if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) { > - pr_info("oom_reaper: giving up pid:%d (%s)\n", > - task_pid_nr(tsk), tsk->comm); > - set_bit(MMF_OOM_REAPED, &p->mm->flags); > - } > - task_unlock(p); > - } > - > - debug_show_all_locks(); > + /* > + * If we've already tried to reap this task in the past and > + * failed it probably doesn't make much sense to try yet again > + * so hide the mm from the oom killer so that it can move on > + * to another task with a different mm struct. > + */ > + if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) { > + pr_info("oom_reaper: giving up pid:%d (%s)\n", > + task_pid_nr(tsk), tsk->comm); > + set_bit(MMF_OOM_REAPED, &mm->flags); > } > + debug_show_all_locks(); > > +done: > /* > * Clear TIF_MEMDIE because the task shouldn't be sitting on a > * reasonably reclaimable memory anymore or it is not a good candidate > @@ -603,6 +595,9 @@ static void oom_reap_task(struct task_struct *tsk) > > /* Drop a reference taken by wake_oom_reaper */ > put_task_struct(tsk); > + /* Drop a reference taken above. */ > + if (mm) > + mmdrop(mm); > } > > static int oom_reaper(void *unused) > -- > 1.8.3.1 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id 4E1156B0005 for ; Mon, 11 Jul 2016 08:15:53 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id r190so54465478wmr.0 for ; Mon, 11 Jul 2016 05:15:53 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id j73si8500917wmj.93.2016.07.11.05.15.52 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 11 Jul 2016 05:15:52 -0700 (PDT) Date: Mon, 11 Jul 2016 14:15:51 +0200 From: Michal Hocko Subject: Re: [PATCH 2/6] mm,oom_reaper: Do not attempt to reap a task twice. Message-ID: <20160711121550.GE1811@dhcp22.suse.cz> References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> <201607080101.CCE57351.tQFOOVOFJSHMLF@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201607080101.CCE57351.tQFOOVOFJSHMLF@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com On Fri 08-07-16 01:01:36, Tetsuo Handa wrote: > >From 50b3a862b136c783be6ce25e0f22446f15a0ab03 Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa > Date: Fri, 8 Jul 2016 00:28:50 +0900 > Subject: [PATCH 2/6] mm,oom_reaper: Do not attempt to reap a task twice. > > "mm, oom_reaper: do not attempt to reap a task twice" tried to give > OOM reaper one more chance to retry. But since OOM killer or one of > threads sharing that mm will queue that mm immediately, retrying > MMF_OOM_NOT_REAPABLE mm is unlikely helpful. I agree that the usefulness of the flag is rather limited and actually never shown in practice. Considering it was added as a just-in-case measure it is really hard to reason about. And I believe this should be stated in the changelog. > Let's always set MMF_OOM_REAPED. > > Signed-off-by: Tetsuo Handa That being said, feel free to add my Acked-by after the changelog is more clear about the motivation. > --- > include/linux/sched.h | 1 - > mm/oom_kill.c | 15 +++------------ > 2 files changed, 3 insertions(+), 13 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 553af29..c0efd80 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -523,7 +523,6 @@ static inline int get_dumpable(struct mm_struct *mm) > #define MMF_HAS_UPROBES 19 /* has uprobes */ > #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ > #define MMF_OOM_REAPED 21 /* mm has been already reaped */ > -#define MMF_OOM_NOT_REAPABLE 22 /* mm couldn't be reaped */ > > #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 951eb1b..9f0022e 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -567,20 +567,11 @@ static void oom_reap_task(struct task_struct *tsk) > if (attempts <= MAX_OOM_REAP_RETRIES) > goto done; > > + /* Ignore this mm because somebody can't call up_write(mmap_sem). */ > + set_bit(MMF_OOM_REAPED, &mm->flags); > + > pr_info("oom_reaper: unable to reap pid:%d (%s)\n", > task_pid_nr(tsk), tsk->comm); > - > - /* > - * If we've already tried to reap this task in the past and > - * failed it probably doesn't make much sense to try yet again > - * so hide the mm from the oom killer so that it can move on > - * to another task with a different mm struct. > - */ > - if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) { > - pr_info("oom_reaper: giving up pid:%d (%s)\n", > - task_pid_nr(tsk), tsk->comm); > - set_bit(MMF_OOM_REAPED, &mm->flags); > - } > debug_show_all_locks(); > > done: > -- > 1.8.3.1 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f72.google.com (mail-lf0-f72.google.com [209.85.215.72]) by kanga.kvack.org (Postfix) with ESMTP id 2AF1C6B0005 for ; Mon, 11 Jul 2016 08:50:55 -0400 (EDT) Received: by mail-lf0-f72.google.com with SMTP id a4so69259189lfa.1 for ; Mon, 11 Jul 2016 05:50:55 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 192si14960698wmf.136.2016.07.11.05.50.53 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 11 Jul 2016 05:50:53 -0700 (PDT) Date: Mon, 11 Jul 2016 14:50:51 +0200 From: Michal Hocko Subject: Re: [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims. Message-ID: <20160711125051.GF1811@dhcp22.suse.cz> References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> <201607080103.CDH12401.LFOHStQFOOFVJM@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201607080103.CDH12401.LFOHStQFOOFVJM@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com On Fri 08-07-16 01:03:11, Tetsuo Handa wrote: > >From 5fbd16cffd5dc51f9ba8591fc18d315ff6ff9b96 Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa > Date: Fri, 8 Jul 2016 00:33:13 +0900 > Subject: [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims. > > Currently, we walk process list in order to find existing TIF_MEMDIE > threads. But if we remember list of mm_struct used by TIF_MEMDIE threads, > we can avoid walking process list. Next patch in this series allows > OOM reaper to use list of mm_struct introduced by this patch. I believe the changelog doesn't tell the whole story and it is silent about many aspects which are not obvious at first sight. E.g. why it is any better to iterate over all mms rather than existing tasks? This is a slow path. Also it is quite possible to have thousands of mms linked on the list because of many memcgs hitting the oom. Sure, highly unlikely, but it would be better to note that this has been considered being acceptable with an explanation why. > This patch reverts commit e2fe14564d3316d1 ("oom_reaper: close race with > exiting task") because oom_has_pending_mm() will prevent that race. I guess a worth a separate patch. > Since CONFIG_MMU=y kernel has OOM reaper callback hook which can remove > mm_struct from the list, let the OOM reaper call exit_oom_mm(mm). This > patch temporarily fails to call exit_oom_mm(mm) when find_lock_task_mm() > in oom_reap_task() failed. It will be fixed by next patch. > > But since CONFIG_MMU=n kernel does not have OOM reaper callback hook, > call exit_oom_mm(mm) from __mmput(mm) if that mm is used by OOM victims. I guess referring to the MMU configuration is more confusing than helpful. The life time on the list is quite straightforward. mm is is unlinked by exit_oom_mm after the address space has been unmapped from __mmput or from the oom_reaper when available. This will guarantee that the mm doesn't block the next oom victim selection after the memory was reclaimed. > Signed-off-by: Tetsuo Handa > --- > include/linux/mm_types.h | 7 +++++ > include/linux/oom.h | 3 ++ > kernel/fork.c | 4 +++ > mm/memcontrol.c | 5 ++++ > mm/oom_kill.c | 72 +++++++++++++++++++++++++++++++----------------- > 5 files changed, 66 insertions(+), 25 deletions(-) > [...] > diff --git a/kernel/fork.c b/kernel/fork.c > index 7926993..8e469e0 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -722,6 +722,10 @@ static inline void __mmput(struct mm_struct *mm) > } > if (mm->binfmt) > module_put(mm->binfmt->module); > +#ifndef CONFIG_MMU > + if (mm->oom_mm.victim) > + exit_oom_mm(mm); > +#endif This ifdef is not really needed. There is no reason we should wait for the oom_reaper to unlink the mm. > mmdrop(mm); > } > [...] > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 9f0022e..87e7ff3 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -275,6 +275,28 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc, > } > #endif > > +static LIST_HEAD(oom_mm_list); > + > +void exit_oom_mm(struct mm_struct *mm) > +{ > + mutex_lock(&oom_lock); > + list_del(&mm->oom_mm.list); > + put_task_struct(mm->oom_mm.victim); > + mm->oom_mm.victim = NULL; > + mmdrop(mm); > + mutex_unlock(&oom_lock); > +} > + > +bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask) > +{ > + struct mm_struct *mm; > + > + list_for_each_entry(mm, &oom_mm_list, oom_mm.list) > + if (!oom_unkillable_task(mm->oom_mm.victim, memcg, nodemask)) > + return true; The condition is quite hard to read. Moreover 2 of 4 conditions are never true. Wouldn't it be better to do something like the following? diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d120cb103507..df4b2b3ad7d0 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -132,6 +132,20 @@ static inline bool is_sysrq_oom(struct oom_control *oc) return oc->order == -1; } +static bool task_in_oom_domain(struct task_struct *p, + struct mem_cgroup *memcg, const nodemask_t *nodemask) +{ + /* When mem_cgroup_out_of_memory() and p is not member of the group */ + if (memcg && !task_in_mem_cgroup(p, memcg)) + return false; + + /* p may not have freeable memory in nodemask */ + if (!has_intersects_mems_allowed(p, nodemask)) + return false; + + return true; +} + /* return true if the task is not adequate as candidate victim task. */ static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask) @@ -141,12 +155,7 @@ static bool oom_unkillable_task(struct task_struct *p, if (p->flags & PF_KTHREAD) return true; - /* When mem_cgroup_out_of_memory() and p is not member of the group */ - if (memcg && !task_in_mem_cgroup(p, memcg)) - return true; - - /* p may not have freeable memory in nodemask */ - if (!has_intersects_mems_allowed(p, nodemask)) + if (!task_in_oom_domain(p, memcg, nodemask)) return true; return false; @@ -292,7 +301,7 @@ bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask) struct mm_struct *mm; list_for_each_entry(mm, &oom_mm_list, oom_mm.list) - if (!oom_unkillable_task(mm->oom_mm.victim, memcg, nodemask)) + if (task_in_oom_domain(mm->oom_mm.victim, memcg, nodemask)) return true; return false; } [...] > @@ -653,6 +657,9 @@ subsys_initcall(oom_init) > */ > void mark_oom_victim(struct task_struct *tsk) > { > + struct mm_struct *mm = tsk->mm; > + struct task_struct *old_tsk = mm->oom_mm.victim; > + > WARN_ON(oom_killer_disabled); > /* OOM killer might race with memcg OOM */ > if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) > @@ -666,6 +673,18 @@ void mark_oom_victim(struct task_struct *tsk) > */ > __thaw_task(tsk); > atomic_inc(&oom_victims); > + /* > + * Since mark_oom_victim() is called from multiple threads, > + * connect this mm to oom_mm_list only if not yet connected. > + */ > + get_task_struct(tsk); > + mm->oom_mm.victim = tsk; > + if (!old_tsk) { > + atomic_inc(&mm->mm_count); > + list_add_tail(&mm->oom_mm.list, &oom_mm_list); > + } else { > + put_task_struct(old_tsk); > + } Isn't this overcomplicated? Why do we need to replace the old task by the current one? > } > > /** > @@ -1026,6 +1045,9 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + if (!is_sysrq_oom(oc) && oom_has_pending_mm(oc->memcg, oc->nodemask)) > + return true; > + > p = select_bad_process(oc, &points, totalpages); > /* Found nothing?!?! Either we hang forever, or we panic. */ > if (!p && !is_sysrq_oom(oc)) { > -- > 1.8.3.1 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f70.google.com (mail-wm0-f70.google.com [74.125.82.70]) by kanga.kvack.org (Postfix) with ESMTP id 338626B0005 for ; Mon, 11 Jul 2016 09:16:21 -0400 (EDT) Received: by mail-wm0-f70.google.com with SMTP id o80so25764823wme.1 for ; Mon, 11 Jul 2016 06:16:21 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 203si910227wmh.93.2016.07.11.06.16.19 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 11 Jul 2016 06:16:20 -0700 (PDT) Date: Mon, 11 Jul 2016 15:16:18 +0200 From: Michal Hocko Subject: Re: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct. Message-ID: <20160711131618.GG1811@dhcp22.suse.cz> References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> <201607080104.JDA41505.OtOFMSLOQVJFHF@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201607080104.JDA41505.OtOFMSLOQVJFHF@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com On Fri 08-07-16 01:04:46, Tetsuo Handa wrote: > >From acc85fdd36452e39bace6aa73b3aaa41bbe776a5 Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa > Date: Fri, 8 Jul 2016 00:39:36 +0900 > Subject: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct. > > Since OOM reaping is per mm_struct operation, it is natural to use > list of mm_struct used by OOM victims. By using list of mm_struct, > we can eliminate find_lock_task_mm() usage from the OOM reaper. Again, the changelog doesn't state why getting rid of find_lock_task_mm is useful. You are also dropping exit_oom_victim from the oom reaper without any explanation in the changelog. > Signed-off-by: Tetsuo Handa I haven't spotted anything obviously wrong with the patch. Conceptually this sounds like a right approach. It is a bit sad we have to keep the oom victim around as well but I guess this is acceptable. That being said, I think that going mm queuing way is better than keeping mm alive in signal_struct long term. > --- > include/linux/oom.h | 8 ----- > mm/memcontrol.c | 1 - > mm/oom_kill.c | 100 +++++++++++++++------------------------------------- > 3 files changed, 29 insertions(+), 80 deletions(-) The code reduction is really nice! Few notes below [...] > #define MAX_OOM_REAP_RETRIES 10 > -static void oom_reap_task(struct task_struct *tsk) > +static void oom_reap_task(struct task_struct *tsk, struct mm_struct *mm) > { > int attempts = 0; > - struct mm_struct *mm = NULL; > - struct task_struct *p = find_lock_task_mm(tsk); > > /* > - * Make sure we find the associated mm_struct even when the particular > - * thread has already terminated and cleared its mm. > - * We might have race with exit path so consider our work done if there > - * is no mm. > + * Check MMF_OOM_REAPED in case oom_kill_process() found this mm > + * pinned. > */ > - if (!p) > - goto done; > - mm = p->mm; > - atomic_inc(&mm->mm_count); > - task_unlock(p); > + if (test_bit(MMF_OOM_REAPED, &mm->flags)) > + return; > > /* Retry the down_read_trylock(mmap_sem) a few times */ > while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm)) > schedule_timeout_idle(HZ/10); > > if (attempts <= MAX_OOM_REAP_RETRIES) > - goto done; > + return; > > /* Ignore this mm because somebody can't call up_write(mmap_sem). */ > set_bit(MMF_OOM_REAPED, &mm->flags); This seems unnecessary when oom_reaper always calls exit_oom_mm. The same applies to __oom_reap_task. Which then means that the flag is turning into a misnomer. MMF_SKIP_OOM would fit better its current meaning. [...] > @@ -600,41 +572,31 @@ static int oom_reaper(void *unused) > set_freezable(); > > while (true) { > - struct task_struct *tsk = NULL; > - > - wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL); > - spin_lock(&oom_reaper_lock); > - if (oom_reaper_list != NULL) { > - tsk = oom_reaper_list; > - oom_reaper_list = tsk->oom_reaper_list; > - } > - spin_unlock(&oom_reaper_lock); > - > - if (tsk) > - oom_reap_task(tsk); > + struct mm_struct *mm; > + struct task_struct *victim; > + > + wait_event_freezable(oom_reaper_wait, > + !list_empty(&oom_mm_list)); > + mutex_lock(&oom_lock); > + mm = list_first_entry(&oom_mm_list, struct mm_struct, > + oom_mm.list); > + victim = mm->oom_mm.victim; > + /* > + * Take a reference on current victim thread in case > + * oom_reap_task() raced with mark_oom_victim() by > + * other threads sharing this mm. > + */ > + get_task_struct(victim); If you didn't play old_task games in mark_oom_victim then you wouldn't need this AFAIU. > + mutex_unlock(&oom_lock); > + oom_reap_task(victim, mm); > + put_task_struct(victim); > + /* Drop references taken by mark_oom_victim() */ > + exit_oom_mm(mm); [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f69.google.com (mail-lf0-f69.google.com [209.85.215.69]) by kanga.kvack.org (Postfix) with ESMTP id 7D4406B025E for ; Mon, 11 Jul 2016 09:19:03 -0400 (EDT) Received: by mail-lf0-f69.google.com with SMTP id p41so16693198lfi.0 for ; Mon, 11 Jul 2016 06:19:03 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id f197si10313454wmf.73.2016.07.11.06.19.02 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 11 Jul 2016 06:19:02 -0700 (PDT) Date: Mon, 11 Jul 2016 15:19:01 +0200 From: Michal Hocko Subject: Re: [PATCH 5/6] mm,oom: Remove OOM_SCAN_ABORT case and signal_struct->oom_victims. Message-ID: <20160711131901.GH1811@dhcp22.suse.cz> References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> <201607080106.DCD82819.tJFHVLSOOQFMFO@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201607080106.DCD82819.tJFHVLSOOQFMFO@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com On Fri 08-07-16 01:06:12, Tetsuo Handa wrote: > >From 2da51204250a2f1127792f98c12436771c07b67d Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa > Date: Fri, 8 Jul 2016 00:41:38 +0900 > Subject: [PATCH 5/6] mm,oom: Remove OOM_SCAN_ABORT case and signal_struct->oom_victims. > > Since oom_has_pending_mm() controls whether to select next OOM victim, > we no longer need to abort OOM victim selection loop using OOM_SCAN_ABORT > case. Also, since signal_struct->oom_victims was used only for allowing > oom_scan_process_thread() to return OOM_SCAN_ABORT, we no longer need to > use signal_struct->oom_victims. > > Signed-off-by: Tetsuo Handa Yes this looks good! Acked-by: Michal Hocko > --- > include/linux/oom.h | 1 - > include/linux/sched.h | 1 - > mm/memcontrol.c | 8 -------- > mm/oom_kill.c | 26 +------------------------- > 4 files changed, 1 insertion(+), 35 deletions(-) > > diff --git a/include/linux/oom.h b/include/linux/oom.h > index cb3f041..27be4ba 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -49,7 +49,6 @@ enum oom_constraint { > enum oom_scan_t { > OOM_SCAN_OK, /* scan thread and find its badness */ > OOM_SCAN_CONTINUE, /* do not consider thread for oom kill */ > - OOM_SCAN_ABORT, /* abort the iteration and return */ > OOM_SCAN_SELECT, /* always select this thread first */ > }; > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index c0efd80..2f57cf1c 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -670,7 +670,6 @@ struct signal_struct { > atomic_t sigcnt; > atomic_t live; > int nr_threads; > - atomic_t oom_victims; /* # of TIF_MEDIE threads in this thread group */ > struct list_head thread_head; > > wait_queue_head_t wait_chldexit; /* for wait4() */ > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 5043324..6afe1c5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1262,14 +1262,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > /* fall through */ > case OOM_SCAN_CONTINUE: > continue; > - case OOM_SCAN_ABORT: > - css_task_iter_end(&it); > - mem_cgroup_iter_break(memcg, iter); > - if (chosen) > - put_task_struct(chosen); > - /* Set a dummy value to return "true". */ > - chosen = (void *) 1; > - goto unlock; > case OOM_SCAN_OK: > break; > }; > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 223e1fe..b6b79ae 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -304,25 +304,6 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, > return OOM_SCAN_CONTINUE; > > /* > - * This task already has access to memory reserves and is being killed. > - * Don't allow any other task to have access to the reserves unless > - * the task has MMF_OOM_REAPED because chances that it would release > - * any memory is quite low. > - */ > - if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) { > - struct task_struct *p = find_lock_task_mm(task); > - enum oom_scan_t ret = OOM_SCAN_ABORT; > - > - if (p) { > - if (test_bit(MMF_OOM_REAPED, &p->mm->flags)) > - ret = OOM_SCAN_CONTINUE; > - task_unlock(p); > - } > - > - return ret; > - } > - > - /* > * If task is allocating a lot of memory and has been marked to be > * killed first if it triggers an oom, then select it. > */ > @@ -354,9 +335,6 @@ static struct task_struct *select_bad_process(struct oom_control *oc, > /* fall through */ > case OOM_SCAN_CONTINUE: > continue; > - case OOM_SCAN_ABORT: > - rcu_read_unlock(); > - return (struct task_struct *)(-1UL); > case OOM_SCAN_OK: > break; > }; > @@ -626,7 +604,6 @@ void mark_oom_victim(struct task_struct *tsk) > /* OOM killer might race with memcg OOM */ > if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) > return; > - atomic_inc(&tsk->signal->oom_victims); > /* > * Make sure that the task is woken up from uninterruptible sleep > * if it is frozen because OOM killer wouldn't be able to free > @@ -659,7 +636,6 @@ void exit_oom_victim(struct task_struct *tsk) > { > if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) > return; > - atomic_dec(&tsk->signal->oom_victims); > > if (!atomic_dec_return(&oom_victims)) > wake_up_all(&oom_victims_wait); > @@ -1012,7 +988,7 @@ bool out_of_memory(struct oom_control *oc) > dump_header(oc, NULL); > panic("Out of memory and no killable processes...\n"); > } > - if (p && p != (void *)-1UL) { > + if (p) { > oom_kill_process(oc, p, points, totalpages, "Out of memory"); > /* > * Give the killed process a good chance to exit before trying > -- > 1.8.3.1 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f71.google.com (mail-lf0-f71.google.com [209.85.215.71]) by kanga.kvack.org (Postfix) with ESMTP id CD4DF6B0005 for ; Mon, 11 Jul 2016 09:22:51 -0400 (EDT) Received: by mail-lf0-f71.google.com with SMTP id p41so16789412lfi.0 for ; Mon, 11 Jul 2016 06:22:51 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id d17si3999310wmh.142.2016.07.11.06.22.50 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 11 Jul 2016 06:22:50 -0700 (PDT) Date: Mon, 11 Jul 2016 15:22:49 +0200 From: Michal Hocko Subject: Re: [PATCH 6/6] mm,oom: Stop clearing TIF_MEMDIE on remote thread. Message-ID: <20160711132248.GI1811@dhcp22.suse.cz> References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> <201607080107.ADG65168.QJLFFOFOHMOSVt@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201607080107.ADG65168.QJLFFOFOHMOSVt@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com On Fri 08-07-16 01:07:47, Tetsuo Handa wrote: > >From fefcbf2412e38b006281cf1d20f9e3a2bb76714f Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa > Date: Fri, 8 Jul 2016 00:42:48 +0900 > Subject: [PATCH 6/6] mm,oom: Stop clearing TIF_MEMDIE on remote thread. > > Since oom_has_pending_mm() controls whether to select next OOM victim, > we are no longer clearing TIF_MEMDIE on remote thread. The changelog is quite cryptic. What about: " Since no kernel code path needs to clear TIF_MEMDIE flag on a remote thread we can drop the task parameter and enforce that actually. " > > Signed-off-by: Tetsuo Handa with the updated changelog Acked-by: Michal Hocko > --- > include/linux/oom.h | 2 +- > kernel/exit.c | 2 +- > mm/oom_kill.c | 5 ++--- > 3 files changed, 4 insertions(+), 5 deletions(-) > > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 27be4ba..90e98ea 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -90,7 +90,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, > > extern bool out_of_memory(struct oom_control *oc); > > -extern void exit_oom_victim(struct task_struct *tsk); > +extern void exit_oom_victim(void); > > extern int register_oom_notifier(struct notifier_block *nb); > extern int unregister_oom_notifier(struct notifier_block *nb); > diff --git a/kernel/exit.c b/kernel/exit.c > index 84ae830..1b1dada 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -511,7 +511,7 @@ static void exit_mm(struct task_struct *tsk) > mm_update_next_owner(mm); > mmput(mm); > if (test_thread_flag(TIF_MEMDIE)) > - exit_oom_victim(tsk); > + exit_oom_victim(); > } > > static struct task_struct *find_alive_thread(struct task_struct *p) > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index b6b79ae..d120cb1 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -632,10 +632,9 @@ void mark_oom_victim(struct task_struct *tsk) > /** > * exit_oom_victim - note the exit of an OOM victim > */ > -void exit_oom_victim(struct task_struct *tsk) > +void exit_oom_victim(void) > { > - if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) > - return; > + clear_thread_flag(TIF_MEMDIE); > > if (!atomic_dec_return(&oom_victims)) > wake_up_all(&oom_victims_wait); > -- > 1.8.3.1 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f70.google.com (mail-pa0-f70.google.com [209.85.220.70]) by kanga.kvack.org (Postfix) with ESMTP id 6A8FA6B0253 for ; Tue, 12 Jul 2016 02:00:52 -0400 (EDT) Received: by mail-pa0-f70.google.com with SMTP id ib6so11884465pad.0 for ; Mon, 11 Jul 2016 23:00:52 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id fu3si2140543pad.147.2016.07.11.23.00.50 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 11 Jul 2016 23:00:51 -0700 (PDT) Subject: Re: [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims. From: Tetsuo Handa References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> <201607080103.CDH12401.LFOHStQFOOFVJM@I-love.SAKURA.ne.jp> <20160711125051.GF1811@dhcp22.suse.cz> In-Reply-To: <20160711125051.GF1811@dhcp22.suse.cz> Message-Id: <201607121500.AGE04699.FFQOFHVSOtOLMJ@I-love.SAKURA.ne.jp> Date: Tue, 12 Jul 2016 15:00:41 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: mhocko@suse.cz Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com Michal Hocko wrote: > > diff --git a/kernel/fork.c b/kernel/fork.c > > index 7926993..8e469e0 100644 > > --- a/kernel/fork.c > > +++ b/kernel/fork.c > > @@ -722,6 +722,10 @@ static inline void __mmput(struct mm_struct *mm) > > } > > if (mm->binfmt) > > module_put(mm->binfmt->module); > > +#ifndef CONFIG_MMU > > + if (mm->oom_mm.victim) > > + exit_oom_mm(mm); > > +#endif > > This ifdef is not really needed. There is no reason we should wait for > the oom_reaper to unlink the mm. Oleg wanted to avoid adding OOM related hooks if possible ( http://lkml.kernel.org/r/20160705205231.GA25340@redhat.com ), but I thought that calling exit_oom_mm() from here is better for CONFIG_MMU=n case ( http://lkml.kernel.org/r/201607062043.FEC86485.JFFVLtFOQOSHMO@I-love.SAKURA.ne.jp ). I think that not calling exit_oom_mm() from here is better for CONFIG_MMU=y case. Calling exit_oom_mm() from here will require !list_empty() check after holding oom_lock at oom_reaper(). Instead, we can do +#ifdef CONFIG_MMU + if (mm->oom_mm.victim) + set_bit(MMF_OOM_REAPED, &mm->flags); +#else + if (mm->oom_mm.victim) + exit_oom_mm(mm); +#endif here and let oom_has_pending_mm() check for MMF_OOM_REAPED. > > +bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask) > > +{ > > + struct mm_struct *mm; > > + > > + list_for_each_entry(mm, &oom_mm_list, oom_mm.list) > > + if (!oom_unkillable_task(mm->oom_mm.victim, memcg, nodemask)) > > + return true; > > The condition is quite hard to read. Moreover 2 of 4 conditions are > never true. Wouldn't it be better to do something like the following? > No problem. > > @@ -653,6 +657,9 @@ subsys_initcall(oom_init) > > */ > > void mark_oom_victim(struct task_struct *tsk) > > { > > + struct mm_struct *mm = tsk->mm; > > + struct task_struct *old_tsk = mm->oom_mm.victim; > > + > > WARN_ON(oom_killer_disabled); > > /* OOM killer might race with memcg OOM */ > > if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) > > @@ -666,6 +673,18 @@ void mark_oom_victim(struct task_struct *tsk) > > */ > > __thaw_task(tsk); > > atomic_inc(&oom_victims); > > + /* > > + * Since mark_oom_victim() is called from multiple threads, > > + * connect this mm to oom_mm_list only if not yet connected. > > + */ > > + get_task_struct(tsk); > > + mm->oom_mm.victim = tsk; > > + if (!old_tsk) { > > + atomic_inc(&mm->mm_count); > > + list_add_tail(&mm->oom_mm.list, &oom_mm_list); > > + } else { > > + put_task_struct(old_tsk); > > + } > > Isn't this overcomplicated? Why do we need to replace the old task by > the current one? I'm not sure whether task_in_oom_domain(mm->oom_mm.victim, memcg, nodemask) in oom_has_pending_mm() will work as expected, especially when all threads in one thread group (which mm->oom_mm.victim belongs to) reached TASK_DEAD state. ( http://lkml.kernel.org/r/201607042150.CIB00512.FSOtMHLOOVFFQJ@I-love.SAKURA.ne.jp ) I guess that task_in_oom_domain() will return false, and that mm will be selected by another thread group (which mm->oom_mm.victim does not belongs to). Therefore, I think we need to replace the old task with the new task (at least when task_in_oom_domain() returned false) at mark_oom_victim(). If task_in_oom_domain(mm->oom_mm.victim, memcg, nodemask) in oom_has_pending_mm() does not work as expected even if we replace the old task with the new task at mark_oom_victim(), I think we after all need to use something like struct task_struct { (...snipped...) + struct mm_struct *oom_mm; /* current->mm as of getting TIF_MEMDIE */ + struct task_struct *oom_mm_list; /* Connected to oom_mm_list global list. */ -#ifdef CONFIG_MMU - struct task_struct *oom_reaper_list; -#endif (...snipped...) }; or your signal_struct->oom_mm approach. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f70.google.com (mail-lf0-f70.google.com [209.85.215.70]) by kanga.kvack.org (Postfix) with ESMTP id 7E9686B0262 for ; Tue, 12 Jul 2016 03:09:07 -0400 (EDT) Received: by mail-lf0-f70.google.com with SMTP id 33so4090254lfw.1 for ; Tue, 12 Jul 2016 00:09:07 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 19si1652412wmv.99.2016.07.12.00.09.05 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 12 Jul 2016 00:09:05 -0700 (PDT) Date: Tue, 12 Jul 2016 09:09:04 +0200 From: Michal Hocko Subject: Re: [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims. Message-ID: <20160712070903.GB14586@dhcp22.suse.cz> References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> <201607080103.CDH12401.LFOHStQFOOFVJM@I-love.SAKURA.ne.jp> <20160711125051.GF1811@dhcp22.suse.cz> <201607121500.AGE04699.FFQOFHVSOtOLMJ@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201607121500.AGE04699.FFQOFHVSOtOLMJ@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com On Tue 12-07-16 15:00:41, Tetsuo Handa wrote: > Michal Hocko wrote: > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > index 7926993..8e469e0 100644 > > > --- a/kernel/fork.c > > > +++ b/kernel/fork.c > > > @@ -722,6 +722,10 @@ static inline void __mmput(struct mm_struct *mm) > > > } > > > if (mm->binfmt) > > > module_put(mm->binfmt->module); > > > +#ifndef CONFIG_MMU > > > + if (mm->oom_mm.victim) > > > + exit_oom_mm(mm); > > > +#endif > > > > This ifdef is not really needed. There is no reason we should wait for > > the oom_reaper to unlink the mm. > > Oleg wanted to avoid adding OOM related hooks if possible > ( http://lkml.kernel.org/r/20160705205231.GA25340@redhat.com ), > but I thought that calling exit_oom_mm() from here is better for CONFIG_MMU=n case > ( http://lkml.kernel.org/r/201607062043.FEC86485.JFFVLtFOQOSHMO@I-love.SAKURA.ne.jp ). __mmput is a MM part of the exit path so I think sticking oom related things there is not harmful. Yes we do take a global lock (btw. the lock contention could be reduced if you preserve the existing spinlock and use it for enqueing. Besides that the ifdef is really ugly. > I think that not calling exit_oom_mm() from here is better for CONFIG_MMU=y case. > Calling exit_oom_mm() from here will require !list_empty() check after holding > oom_lock at oom_reaper(). Instead, we can do > > +#ifdef CONFIG_MMU > + if (mm->oom_mm.victim) > + set_bit(MMF_OOM_REAPED, &mm->flags); > +#else > + if (mm->oom_mm.victim) > + exit_oom_mm(mm); > +#endif > > here and let oom_has_pending_mm() check for MMF_OOM_REAPED. Yes that would be possible but why should we make this more complicated than necessary. It is natural that exit_oom_mm is called after the address space has been torn down. This would be the most common path. We have oom_reaper as a backup if this doesn't happen in time. It really doesn't make much sense to keep mm on the list artificially if the oom_reaper should just skip over it because it is already empty. So please let's make it as simple as possible. [...] > > > @@ -653,6 +657,9 @@ subsys_initcall(oom_init) > > > */ > > > void mark_oom_victim(struct task_struct *tsk) > > > { > > > + struct mm_struct *mm = tsk->mm; > > > + struct task_struct *old_tsk = mm->oom_mm.victim; > > > + > > > WARN_ON(oom_killer_disabled); > > > /* OOM killer might race with memcg OOM */ > > > if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) > > > @@ -666,6 +673,18 @@ void mark_oom_victim(struct task_struct *tsk) > > > */ > > > __thaw_task(tsk); > > > atomic_inc(&oom_victims); > > > + /* > > > + * Since mark_oom_victim() is called from multiple threads, > > > + * connect this mm to oom_mm_list only if not yet connected. > > > + */ > > > + get_task_struct(tsk); > > > + mm->oom_mm.victim = tsk; > > > + if (!old_tsk) { > > > + atomic_inc(&mm->mm_count); > > > + list_add_tail(&mm->oom_mm.list, &oom_mm_list); > > > + } else { > > > + put_task_struct(old_tsk); > > > + } > > > > Isn't this overcomplicated? Why do we need to replace the old task by > > the current one? > > I'm not sure whether task_in_oom_domain(mm->oom_mm.victim, memcg, nodemask) in > oom_has_pending_mm() will work as expected, especially when all threads in > one thread group (which mm->oom_mm.victim belongs to) reached TASK_DEAD state. > ( http://lkml.kernel.org/r/201607042150.CIB00512.FSOtMHLOOVFFQJ@I-love.SAKURA.ne.jp ) > > I guess that task_in_oom_domain() will return false, and that mm will be selected > by another thread group (which mm->oom_mm.victim does not belongs to). Therefore, > I think we need to replace the old task with the new task (at least when > task_in_oom_domain() returned false) at mark_oom_victim(). Can we do that in a separate patch then. It would make this patch easier to review and we can discuss about this corner case without distracting from the main point of this patch series. > If task_in_oom_domain(mm->oom_mm.victim, memcg, nodemask) in oom_has_pending_mm() > does not work as expected even if we replace the old task with the new task at > mark_oom_victim(), I think we after all need to use something like > > struct task_struct { > (...snipped...) > + struct mm_struct *oom_mm; /* current->mm as of getting TIF_MEMDIE */ > + struct task_struct *oom_mm_list; /* Connected to oom_mm_list global list. */ > -#ifdef CONFIG_MMU > - struct task_struct *oom_reaper_list; > -#endif > (...snipped...) > }; > > or your signal_struct->oom_mm approach. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f70.google.com (mail-it0-f70.google.com [209.85.214.70]) by kanga.kvack.org (Postfix) with ESMTP id 900F46B0005 for ; Tue, 12 Jul 2016 09:38:51 -0400 (EDT) Received: by mail-it0-f70.google.com with SMTP id u186so34775639ita.0 for ; Tue, 12 Jul 2016 06:38:51 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id j123si2185219ioe.157.2016.07.12.06.38.50 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 12 Jul 2016 06:38:50 -0700 (PDT) Subject: Re: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct. From: Tetsuo Handa References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> <201607080104.JDA41505.OtOFMSLOQVJFHF@I-love.SAKURA.ne.jp> <20160711131618.GG1811@dhcp22.suse.cz> In-Reply-To: <20160711131618.GG1811@dhcp22.suse.cz> Message-Id: <201607122238.HJI78681.QOHVSMFtFJOOFL@I-love.SAKURA.ne.jp> Date: Tue, 12 Jul 2016 22:38:42 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: mhocko@suse.cz Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com Michal Hocko wrote: > > #define MAX_OOM_REAP_RETRIES 10 > > -static void oom_reap_task(struct task_struct *tsk) > > +static void oom_reap_task(struct task_struct *tsk, struct mm_struct *mm) > > { > > int attempts = 0; > > - struct mm_struct *mm = NULL; > > - struct task_struct *p = find_lock_task_mm(tsk); > > > > /* > > - * Make sure we find the associated mm_struct even when the particular > > - * thread has already terminated and cleared its mm. > > - * We might have race with exit path so consider our work done if there > > - * is no mm. > > + * Check MMF_OOM_REAPED in case oom_kill_process() found this mm > > + * pinned. > > */ > > - if (!p) > > - goto done; > > - mm = p->mm; > > - atomic_inc(&mm->mm_count); > > - task_unlock(p); > > + if (test_bit(MMF_OOM_REAPED, &mm->flags)) > > + return; > > > > /* Retry the down_read_trylock(mmap_sem) a few times */ > > while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm)) > > schedule_timeout_idle(HZ/10); > > > > if (attempts <= MAX_OOM_REAP_RETRIES) > > - goto done; > > + return; > > > > /* Ignore this mm because somebody can't call up_write(mmap_sem). */ > > set_bit(MMF_OOM_REAPED, &mm->flags); > > This seems unnecessary when oom_reaper always calls exit_oom_mm. The > same applies to __oom_reap_task. Which then means that the flag is > turning into a misnomer. MMF_SKIP_OOM would fit better its current > meaning. Large oom_score_adj value or being a child process of highest OOM score might cause the same mm being selected again. I think these set_bit() are necessary in order to avoid the same mm being selected again. Other than that, I sent v3 patchset. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f71.google.com (mail-lf0-f71.google.com [209.85.215.71]) by kanga.kvack.org (Postfix) with ESMTP id A71B76B0005 for ; Tue, 12 Jul 2016 09:47:00 -0400 (EDT) Received: by mail-lf0-f71.google.com with SMTP id l89so11905527lfi.3 for ; Tue, 12 Jul 2016 06:47:00 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 202si3391476wmt.105.2016.07.12.06.46.59 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 12 Jul 2016 06:46:59 -0700 (PDT) Date: Tue, 12 Jul 2016 15:46:57 +0200 From: Michal Hocko Subject: Re: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct. Message-ID: <20160712134657.GJ14586@dhcp22.suse.cz> References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> <201607080104.JDA41505.OtOFMSLOQVJFHF@I-love.SAKURA.ne.jp> <20160711131618.GG1811@dhcp22.suse.cz> <201607122238.HJI78681.QOHVSMFtFJOOFL@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201607122238.HJI78681.QOHVSMFtFJOOFL@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com On Tue 12-07-16 22:38:42, Tetsuo Handa wrote: > Michal Hocko wrote: > > > #define MAX_OOM_REAP_RETRIES 10 > > > -static void oom_reap_task(struct task_struct *tsk) > > > +static void oom_reap_task(struct task_struct *tsk, struct mm_struct *mm) > > > { > > > int attempts = 0; > > > - struct mm_struct *mm = NULL; > > > - struct task_struct *p = find_lock_task_mm(tsk); > > > > > > /* > > > - * Make sure we find the associated mm_struct even when the particular > > > - * thread has already terminated and cleared its mm. > > > - * We might have race with exit path so consider our work done if there > > > - * is no mm. > > > + * Check MMF_OOM_REAPED in case oom_kill_process() found this mm > > > + * pinned. > > > */ > > > - if (!p) > > > - goto done; > > > - mm = p->mm; > > > - atomic_inc(&mm->mm_count); > > > - task_unlock(p); > > > + if (test_bit(MMF_OOM_REAPED, &mm->flags)) > > > + return; > > > > > > /* Retry the down_read_trylock(mmap_sem) a few times */ > > > while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm)) > > > schedule_timeout_idle(HZ/10); > > > > > > if (attempts <= MAX_OOM_REAP_RETRIES) > > > - goto done; > > > + return; > > > > > > /* Ignore this mm because somebody can't call up_write(mmap_sem). */ > > > set_bit(MMF_OOM_REAPED, &mm->flags); > > > > This seems unnecessary when oom_reaper always calls exit_oom_mm. The > > same applies to __oom_reap_task. Which then means that the flag is > > turning into a misnomer. MMF_SKIP_OOM would fit better its current > > meaning. > > Large oom_score_adj value or being a child process of highest OOM score > might cause the same mm being selected again. I think these set_bit() are > necessary in order to avoid the same mm being selected again. I do not understand. Child will have a different mm struct from the parent and I do not see how oom_score_adj is relevant here. Could you elaborate, please? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f70.google.com (mail-wm0-f70.google.com [74.125.82.70]) by kanga.kvack.org (Postfix) with ESMTP id A83AB6B025F for ; Tue, 12 Jul 2016 09:55:08 -0400 (EDT) Received: by mail-wm0-f70.google.com with SMTP id o80so14141602wme.1 for ; Tue, 12 Jul 2016 06:55:08 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id t138si3499646wmd.116.2016.07.12.06.55.07 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 12 Jul 2016 06:55:07 -0700 (PDT) Date: Tue, 12 Jul 2016 15:55:06 +0200 From: Michal Hocko Subject: Re: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct. Message-ID: <20160712135506.GK14586@dhcp22.suse.cz> References: <201607080058.BFI87504.JtFOOFQFVHSLOM@I-love.SAKURA.ne.jp> <201607080104.JDA41505.OtOFMSLOQVJFHF@I-love.SAKURA.ne.jp> <20160711131618.GG1811@dhcp22.suse.cz> <201607122238.HJI78681.QOHVSMFtFJOOFL@I-love.SAKURA.ne.jp> <20160712134657.GJ14586@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160712134657.GJ14586@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com On Tue 12-07-16 15:46:57, Michal Hocko wrote: > On Tue 12-07-16 22:38:42, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > > #define MAX_OOM_REAP_RETRIES 10 > > > > -static void oom_reap_task(struct task_struct *tsk) > > > > +static void oom_reap_task(struct task_struct *tsk, struct mm_struct *mm) > > > > { > > > > int attempts = 0; > > > > - struct mm_struct *mm = NULL; > > > > - struct task_struct *p = find_lock_task_mm(tsk); > > > > > > > > /* > > > > - * Make sure we find the associated mm_struct even when the particular > > > > - * thread has already terminated and cleared its mm. > > > > - * We might have race with exit path so consider our work done if there > > > > - * is no mm. > > > > + * Check MMF_OOM_REAPED in case oom_kill_process() found this mm > > > > + * pinned. > > > > */ > > > > - if (!p) > > > > - goto done; > > > > - mm = p->mm; > > > > - atomic_inc(&mm->mm_count); > > > > - task_unlock(p); > > > > + if (test_bit(MMF_OOM_REAPED, &mm->flags)) > > > > + return; > > > > > > > > /* Retry the down_read_trylock(mmap_sem) a few times */ > > > > while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm)) > > > > schedule_timeout_idle(HZ/10); > > > > > > > > if (attempts <= MAX_OOM_REAP_RETRIES) > > > > - goto done; > > > > + return; > > > > > > > > /* Ignore this mm because somebody can't call up_write(mmap_sem). */ > > > > set_bit(MMF_OOM_REAPED, &mm->flags); > > > > > > This seems unnecessary when oom_reaper always calls exit_oom_mm. The > > > same applies to __oom_reap_task. Which then means that the flag is > > > turning into a misnomer. MMF_SKIP_OOM would fit better its current > > > meaning. > > > > Large oom_score_adj value or being a child process of highest OOM score > > might cause the same mm being selected again. I think these set_bit() are > > necessary in order to avoid the same mm being selected again. > > I do not understand. Child will have a different mm struct from the > parent and I do not see how oom_score_adj is relevant here. Could you > elaborate, please? OK, I guess I got your point. You mean we can select the same child/task again after it has passed its exit_oom_mm. Trying to oom_reap such a task would be obviously pointless. Then it would be better to stich that set_bit into exit_oom_mm. Renaming it would be also better in that context. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f199.google.com (mail-io0-f199.google.com [209.85.223.199]) by kanga.kvack.org (Postfix) with ESMTP id ECE616B0005 for ; Tue, 12 Jul 2016 10:01:40 -0400 (EDT) Received: by mail-io0-f199.google.com with SMTP id u25so36730624ioi.1 for ; Tue, 12 Jul 2016 07:01:40 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id o37si1594729otc.168.2016.07.12.07.01.39 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 12 Jul 2016 07:01:40 -0700 (PDT) Subject: Re: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct. From: Tetsuo Handa References: <201607080104.JDA41505.OtOFMSLOQVJFHF@I-love.SAKURA.ne.jp> <20160711131618.GG1811@dhcp22.suse.cz> <201607122238.HJI78681.QOHVSMFtFJOOFL@I-love.SAKURA.ne.jp> <20160712134657.GJ14586@dhcp22.suse.cz> <20160712135506.GK14586@dhcp22.suse.cz> In-Reply-To: <20160712135506.GK14586@dhcp22.suse.cz> Message-Id: <201607122301.FFD43735.SOVOFFMHLQFJtO@I-love.SAKURA.ne.jp> Date: Tue, 12 Jul 2016 23:01:27 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: mhocko@suse.cz Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mst@redhat.com Michal Hocko wrote: > On Tue 12-07-16 15:46:57, Michal Hocko wrote: > > On Tue 12-07-16 22:38:42, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > > #define MAX_OOM_REAP_RETRIES 10 > > > > > -static void oom_reap_task(struct task_struct *tsk) > > > > > +static void oom_reap_task(struct task_struct *tsk, struct mm_struct *mm) > > > > > { > > > > > int attempts = 0; > > > > > - struct mm_struct *mm = NULL; > > > > > - struct task_struct *p = find_lock_task_mm(tsk); > > > > > > > > > > /* > > > > > - * Make sure we find the associated mm_struct even when the particular > > > > > - * thread has already terminated and cleared its mm. > > > > > - * We might have race with exit path so consider our work done if there > > > > > - * is no mm. > > > > > + * Check MMF_OOM_REAPED in case oom_kill_process() found this mm > > > > > + * pinned. > > > > > */ > > > > > - if (!p) > > > > > - goto done; > > > > > - mm = p->mm; > > > > > - atomic_inc(&mm->mm_count); > > > > > - task_unlock(p); > > > > > + if (test_bit(MMF_OOM_REAPED, &mm->flags)) > > > > > + return; > > > > > > > > > > /* Retry the down_read_trylock(mmap_sem) a few times */ > > > > > while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm)) > > > > > schedule_timeout_idle(HZ/10); > > > > > > > > > > if (attempts <= MAX_OOM_REAP_RETRIES) > > > > > - goto done; > > > > > + return; > > > > > > > > > > /* Ignore this mm because somebody can't call up_write(mmap_sem). */ > > > > > set_bit(MMF_OOM_REAPED, &mm->flags); > > > > > > > > This seems unnecessary when oom_reaper always calls exit_oom_mm. The > > > > same applies to __oom_reap_task. Which then means that the flag is > > > > turning into a misnomer. MMF_SKIP_OOM would fit better its current > > > > meaning. > > > > > > Large oom_score_adj value or being a child process of highest OOM score > > > might cause the same mm being selected again. I think these set_bit() are > > > necessary in order to avoid the same mm being selected again. > > > > I do not understand. Child will have a different mm struct from the > > parent and I do not see how oom_score_adj is relevant here. Could you > > elaborate, please? > > OK, I guess I got your point. You mean we can select the same child/task > again after it has passed its exit_oom_mm. Trying to oom_reap such a > task would be obviously pointless. Then it would be better to stich that > set_bit into exit_oom_mm. Renaming it would be also better in that > context. > Right. oom_kill_process() receives a task which oom_badness() returned highest score. But list_for_each_entry(child, &t->children, sibling) selects a child process if that task has any OOM killable child process. The OOM killer will kill that child process and the OOM reaper reaps memory from that child process. But if the OOM reaper does not set MMF_OOM_REAPED after reaping that child's memory, next round of the OOM killer will select that child process again. It is possible that the child process was consuming too little memory to solve the OOM situation. Even if that task does not have any OOM killable child process, it is possible that that task was consuming too little memory to solve the OOM situation. We can misguide the OOM killer to select such process by setting oom_score_adj to 1000. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org