From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8A7FEC43441 for ; Fri, 16 Nov 2018 10:06:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 52BC020892 for ; Fri, 16 Nov 2018 10:06:30 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 52BC020892 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=i-love.sakura.ne.jp Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389656AbeKPUSF (ORCPT ); Fri, 16 Nov 2018 15:18:05 -0500 Received: from www262.sakura.ne.jp ([202.181.97.72]:57657 "EHLO www262.sakura.ne.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727454AbeKPUSF (ORCPT ); Fri, 16 Nov 2018 15:18:05 -0500 Received: from fsav103.sakura.ne.jp (fsav103.sakura.ne.jp [27.133.134.230]) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTP id wAGA6A2Q073720; Fri, 16 Nov 2018 19:06:10 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Received: from www262.sakura.ne.jp (202.181.97.72) by fsav103.sakura.ne.jp (F-Secure/fsigk_smtp/530/fsav103.sakura.ne.jp); Fri, 16 Nov 2018 19:06:10 +0900 (JST) X-Virus-Status: clean(F-Secure/fsigk_smtp/530/fsav103.sakura.ne.jp) Received: from [192.168.1.8] (softbank126126163036.bbtec.net [126.126.163.36]) (authenticated bits=0) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTPSA id wAGA64ht073674 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NO); Fri, 16 Nov 2018 19:06:10 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Subject: Re: [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff To: Michal Hocko Cc: David Rientjes , Roman Gushchin , linux-mm@kvack.org, Andrew Morton , LKML , Linus Torvalds References: <20181025082403.3806-1-mhocko@kernel.org> <20181108093224.GS27423@dhcp22.suse.cz> <9dfd5c87-ae48-8ffb-fbc6-706d627658ff@i-love.sakura.ne.jp> <20181114101604.GM23419@dhcp22.suse.cz> <0648083a-3112-97ff-edd7-1444c1be529a@i-love.sakura.ne.jp> <20181115113653.GO23831@dhcp22.suse.cz> From: Tetsuo Handa Message-ID: <2d7d4cf6-bdd4-741d-764b-beb96d7af296@i-love.sakura.ne.jp> Date: Fri, 16 Nov 2018 19:06:06 +0900 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.3.0 MIME-Version: 1.0 In-Reply-To: <20181115113653.GO23831@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2018/11/15 20:36, Michal Hocko wrote: > On Thu 15-11-18 18:54:15, Tetsuo Handa wrote: > > On 2018/11/14 19:16, Michal Hocko wrote: > > > On Wed 14-11-18 18:46:13, Tetsuo Handa wrote: > > > [...] > > > > There is always an invisible lock called "scheduling priority". You can't > > > > leave the MMF_OOM_SKIP to the exit path. Your approach is not ready for > > > > handling the worst case. > > > > > > And that problem is all over the memory reclaim. You can get starved > > > to death and block other resources. And the memory reclaim is not the > > > only one. > > > > I think that it is a manner for kernel developers that no thread keeps > > consuming CPU resources forever. In the kernel world, doing > > > > while (1); > > > > is not permitted. Likewise, doing > > > > for (i = 0; i < very_large_value; i++) > > do_something_which_does_not_yield_CPU_to_others(); > > There is nothing like that proposed in this series. > > > has to be avoided, in order to avoid lockup problems. We are required to > > yield CPU to others when we are waiting for somebody else to make progress. > > It is the page allocator who is refusing to yield CPU to those who need CPU. > > And we do that in the reclaim path. > No we don't. Please explain why holding oom_lock and not holding oom_lock in a patch shown below makes difference (i.e. holding oom_lock can avoid lockups while not holding oom_lock fails to avoid lockups). ---------- --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3137,50 +3137,53 @@ void exit_mmap(struct mm_struct *mm) /* * oom_reaper cannot race with the page tables teardown but we * want to make sure that the exit path can take over the full * tear down when it is safe to do so */ if (oom) { extern void my_setpriority(void); down_write(&mm->mmap_sem); __unlink_vmas(vma); + mutex_lock(&oom_lock); /* * the exit path is guaranteed to finish the memory tear down * without any unbound blocking at this stage so make it clear * to the oom_reaper */ mm->mmap = NULL; up_write(&mm->mmap_sem); my_setpriority(); __free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); } else { free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); } tlb_finish_mmu(&tlb, 0, -1); /* + * Now that the full address space is torn down, make sure the + * OOM killer skips over this task + */ + if (oom) { + set_bit(MMF_OOM_SKIP, &mm->flags); + mutex_unlock(&oom_lock); + } + + /* * Walk the list again, actually closing and freeing it, * with preemption enabled, without holding any MM locks. */ while (vma) { if (vma->vm_flags & VM_ACCOUNT) nr_accounted += vma_pages(vma); vma = remove_vma(vma); } vm_unacct_memory(nr_accounted); - - /* - * Now that the full address space is torn down, make sure the - * OOM killer skips over this task - */ - if (oom) - set_bit(MMF_OOM_SKIP, &mm->flags); } /* Insert vm structure into process list sorted by address * and into the inode's i_mmap tree. If vm_file is non-NULL * then i_mmap_rwsem is taken here. */ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) { struct vm_area_struct *prev; ---------- > > Since the OOM reaper kernel thread "has normal priority" and "can run on any > > CPU", the possibility of failing to run is lower than an OOM victim thread > > which "has idle priority" and "can run on only limited CPU". You are trying > > to add a dependency on such thread, and I'm saying that adding a dependency > > on such thread increases possibility of lockup. > > Sigh. No, this is not the case. All this patch series does is that we > hand over to the exiting task once it doesn't block on any locks > anymore. If the thread is low priority then it is quite likely that the > oom reaper is done by the time the victim even reaches the exit path. Not true. A thread executing exit_mmap() can change its priority at any moment. Also, the OOM reaper kernel thread can fail to complete OOM reaping before exit_mmap() does mm->mmap = NULL due to e.g. down_read_trylock(&mm->mmap_sem) failure, doing OOM reaping on other OOM victims in different OOM domains, being preempted by scheduling priority. You will notice it by trying both ---------- */ if (mm->mmap) set_bit(MMF_OOM_SKIP, &mm->flags); + else + pr_info("Handed over %u to exit path.\n", task_pid_nr(tsk)); /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); ---------- and ---------- */ if (mm->mmap) set_bit(MMF_OOM_SKIP, &mm->flags); + else if (!test_bit(MMF_OOM_SKIP, &mm->flags)) + pr_info("Handed over %u to exit path.\n", task_pid_nr(tsk)); /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); ---------- and checking the frequency of printing "Handed over " messages. > > > Yes, even the OOM reaper kernel thread might fail to run if all CPUs were > > busy with realtime threads waiting for the OOM reaper kernel thread to make > > progress. In that case, we had better stop relying on asynchronous memory > > reclaim, and switch to direct OOM reaping by allocating threads. > > > > But what I demonstrated is that > > > > /* > > * the exit path is guaranteed to finish the memory tear down > > * without any unbound blocking at this stage so make it clear > > * to the oom_reaper > > */ > > > > becomes a lie even when only one CPU was busy with realtime threads waiting > > for an idle thread to make progress. If the page allocator stops telling a > > lie that "an OOM victim is making progress on behalf of me", we can avoid > > the lockup. > > OK, I stopped reading right here. This discussion is pointless. Once you > busy loop all CPUs you are screwed. Look at the log files. I made only 1 CPU (out of 8 CPUs) busy. > Are you going to blame a filesystem > that no progress can be made if a code path holding an important lock > is preemempted by high priority stuff a no further progress can be > made? I don't blame that case because it is doing something which is not a kernel bug. > This is just ridiculous. What you are arguing here is not fixable > with the current upstream kernel. I do blame memory allocation case because it is doing something which is a kernel bug which can be avoided if we stop telling a lie. No future kernel is fixable as long as we keep telling a lie. > Even your so beloved timeout based > solution doesn't cope with that because oom reaper can be preempted for > unbound amount of time. Yes, that's the reason I suggest direct OOM reaping. > Your argument just doens't make much sense in > the context of the current kernel. Full stop. Won't full stop at all. What I'm saying is "don't rely on busy polling loop". The reclaim path becomes a no-op for a thread executing __free_pgtables() when memcg OOM is waiting for that thread to complete __free_pgtables() and set MMF_OOM_SKIP.