From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 53E61ECDFAA for ; Fri, 13 Jul 2018 00:29:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 04235214C3 for ; Fri, 13 Jul 2018 00:29:22 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 04235214C3 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387828AbeGMAlU (ORCPT ); Thu, 12 Jul 2018 20:41:20 -0400 Received: from mga02.intel.com ([134.134.136.20]:40079 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387768AbeGMAlU (ORCPT ); Thu, 12 Jul 2018 20:41:20 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 12 Jul 2018 17:29:19 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.51,345,1526367600"; d="scan'208";a="245306317" Received: from unknown (HELO [10.239.13.97]) ([10.239.13.97]) by fmsmga006.fm.intel.com with ESMTP; 12 Jul 2018 17:29:16 -0700 Message-ID: <5B47F357.7020202@intel.com> Date: Fri, 13 Jul 2018 08:33:27 +0800 From: Wei Wang User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Michal Hocko CC: Linus Torvalds , virtio-dev@lists.oasis-open.org, Linux Kernel Mailing List , virtualization , KVM list , linux-mm , "Michael S. Tsirkin" , Andrew Morton , Paolo Bonzini , liliang.opensource@gmail.com, yang.zhang.wz@gmail.com, quan.xu0@gmail.com, nilal@redhat.com, Rik van Riel , peterx@redhat.com Subject: Re: [PATCH v35 1/5] mm: support to get hints of free page blocks References: <5B455D50.90902@intel.com> <20180711092152.GE20050@dhcp22.suse.cz> <5B46BB46.2080802@intel.com> <5B46C258.40601@intel.com> <20180712081317.GD32648@dhcp22.suse.cz> <5B473CB8.1050306@intel.com> <20180712114946.GI32648@dhcp22.suse.cz> In-Reply-To: <20180712114946.GI32648@dhcp22.suse.cz> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/12/2018 07:49 PM, Michal Hocko wrote: > On Thu 12-07-18 19:34:16, Wei Wang wrote: >> On 07/12/2018 04:13 PM, Michal Hocko wrote: >>> On Thu 12-07-18 10:52:08, Wei Wang wrote: >>>> On 07/12/2018 10:30 AM, Linus Torvalds wrote: >>>>> On Wed, Jul 11, 2018 at 7:17 PM Wei Wang wrote: >>>>>> Would it be better to remove __GFP_THISNODE? We actually want to get all >>>>>> the guest free pages (from all the nodes). >>>>> Maybe. Or maybe it would be better to have the memory balloon logic be >>>>> per-node? Maybe you don't want to remove too much memory from one >>>>> node? I think it's one of those "play with it" things. >>>>> >>>>> I don't think that's the big issue, actually. I think the real issue >>>>> is how to react quickly and gracefully to "oops, I'm trying to give >>>>> memory away, but now the guest wants it back" while you're in the >>>>> middle of trying to create that 2TB list of pages. >>>> OK. virtio-balloon has already registered an oom notifier >>>> (virtballoon_oom_notify). I plan to add some control there. If oom happens, >>>> - stop the page allocation; >>>> - immediately give back the allocated pages to mm. >>> Please don't. Oom notifier is an absolutely hideous interface which >>> should go away sooner or later (I would much rather like the former) so >>> do not build a new logic on top of it. I would appreciate if you >>> actually remove the notifier much more. >>> >>> You can give memory back from the standard shrinker interface. If we are >>> reaching low reclaim priorities then we are struggling to reclaim memory >>> and then you can start returning pages back. >> OK. Just curious why oom notifier is thought to be hideous, and has it been >> a consensus? > Because it is a completely non-transparent callout from the OOM context > which is really subtle on its own. It is just too easy to end up in > weird corner cases. We really have to be careful and be as swift as > possible. Any potential sleep would make the OOM situation much worse > because nobody would be able to make a forward progress or (in)direct > dependency on MM subsystem can easily deadlock. Those are really hard > to track down and defining the notifier as blockable by design which > just asks for bad implementations because most people simply do not > realize how subtle the oom context is. > > Another thing is that it happens way too late when we have basically > reclaimed the world and didn't get out of the memory pressure so you can > expect any workload is suffering already. Anybody sitting on a large > amount of reclaimable memory should have released that memory by that > time. Proportionally to the reclaim pressure ideally. > > The notifier API is completely unaware of oom constrains. Just imagine > you are OOM in a subset of numa nodes. Callback doesn't have any idea > about that. > > Moreover we do have proper reclaim mechanism that has a feedback > loop and that should be always preferable to an abrupt reclaim. Sounds very reasonable, thanks for the elaboration. I'll try with shrinker. Best, Wei