From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BCBB7C433E0 for ; Fri, 29 May 2020 00:55:38 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6935C2068D for ; Fri, 29 May 2020 00:55:38 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=nitingupta.dev header.i=@nitingupta.dev header.b="dPkfSwSD" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6935C2068D Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=nitingupta.dev Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EC5508001A; Thu, 28 May 2020 20:55:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E77B180010; Thu, 28 May 2020 20:55:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D64818001A; Thu, 28 May 2020 20:55:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0056.hostedemail.com [216.40.44.56]) by kanga.kvack.org (Postfix) with ESMTP id BD10880010 for ; Thu, 28 May 2020 20:55:37 -0400 (EDT) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 6B1E98245571 for ; Fri, 29 May 2020 00:55:37 +0000 (UTC) X-FDA: 76867938714.27.blade59_6b13c2780dd3b Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin27.hostedemail.com (Postfix) with ESMTP id 4D06C3D663 for ; Fri, 29 May 2020 00:55:36 +0000 (UTC) X-HE-Tag: blade59_6b13c2780dd3b X-Filterd-Recvd-Size: 8581 Received: from mail-lj1-f196.google.com (mail-lj1-f196.google.com [209.85.208.196]) by imf45.hostedemail.com (Postfix) with ESMTP for ; Fri, 29 May 2020 00:55:35 +0000 (UTC) Received: by mail-lj1-f196.google.com with SMTP id c11so505965ljn.2 for ; Thu, 28 May 2020 17:55:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nitingupta.dev; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=RUE+Kgr61UsTE5bnt2/uDqF89IqriBTI/I4RUKTjWYY=; b=dPkfSwSDZ3gLUdiYx2RfPaPXx3H92unDba68v4CdK8Au2YwLggppxIiJ44Uff7amdP D7Jm95g3tohD1kWlFJU8ZRKeHr8J2qf5CMv6osN8MyZVBWh9vx5Y9BLyl/L811sRPqa1 CtcEvczXWzcoE2GO7JxhqtLkkKk1mA0b92UllGcNzvfqtqVbGW1IaCLfbvaorN77x4WI YpZ89etJZeJMJLRBbPwgkb7vl9xzPDvby1lMj0L013EgoSRCEJJOzBaVTlRpUDrieoQ9 RvkGbxg81lNNbGXXwvGZBsy8vA1T4FmVGPyqE5dONSDYmkIquzpcawvMOKix8Mx1sTaY owTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=RUE+Kgr61UsTE5bnt2/uDqF89IqriBTI/I4RUKTjWYY=; b=Wbdqns5k0eshgDqunpya6qVr05mS4ICFFLKRWfvZ18w8YzpS1W6iEpLPYhtMC3Kg0e VPaXvQ+ITSzoIPA85iuMaYwOFI1ZhTHJ3SDjCxWP5BujPifEMqIiW0sAe7ciECsCK7fv QsazWKnqDrowLWRDDfuU/bDl3nT4jCKmO1cXXXGOpFM+524jKwtdhu4Xhfdye/DGpexW Wf951w45ee1eHgnxyNGI7T0Z+XC9UGwMTf+5ON9jwzd9L094HigHWy1mzfqZlijh8gCl JhGVBk8nCvvkswA7lRt6TbRyPIiOoHFqBMtOvlSaPWWfdmfLwIfN5Clq+HMKW45JH8d0 gegQ== X-Gm-Message-State: AOAM533XbQYHj0r0N/EZQyrfZ3enVlR0XLW7jRe9e1DvXD8qVYy2PbtK ZydeLt4yAVMr2LSxw6tfuH6RHzuJ3vjHOj9V4gm9CA== X-Google-Smtp-Source: ABdhPJxqPCwhIDakbWhcF91gfAO4i25wh73Fk87ERST1h2p0yV9LshcCqvGnmgNoTXIWFYZJaNqWTaoin0QuSKcPZyo= X-Received: by 2002:a2e:9746:: with SMTP id f6mr2714185ljj.189.1590713733999; Thu, 28 May 2020 17:55:33 -0700 (PDT) MIME-Version: 1.0 References: <20200518181446.25759-1-nigupta@nvidia.com> In-Reply-To: From: Nitin Gupta Date: Thu, 28 May 2020 17:55:22 -0700 Message-ID: Subject: Re: [PATCH v5] mm: Proactive compaction To: Khalid Aziz Cc: Nitin Gupta , Mel Gorman , Michal Hocko , Vlastimil Babka , Matthew Wilcox , Andrew Morton , Mike Kravetz , Joonsoo Kim , David Rientjes , linux-kernel , linux-mm , Linux API Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 4D06C3D663 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, May 28, 2020 at 4:32 PM Khalid Aziz wrote: > > This looks good to me. I like the idea overall of controlling > aggressiveness of compaction with a single tunable for the whole > system. I wonder how an end user could arrive at what a reasonable > value would be for this based upon their workload. More comments below. > Tunables like the one this patch introduces, and similar ones like 'swappiness' will always require some experimentations from the user. > On Mon, 2020-05-18 at 11:14 -0700, Nitin Gupta wrote: > > For some applications, we need to allocate almost all memory as > > hugepages. However, on a running system, higher-order allocations can > > fail if the memory is fragmented. Linux kernel currently does on- > > demand > > compaction as we request more hugepages, but this style of compaction > > incurs very high latency. Experiments with one-time full memory > > compaction (followed by hugepage allocations) show that kernel is > > able > > to restore a highly fragmented memory state to a fairly compacted > > memory > > state within <1 sec for a 32G system. Such data suggests that a more > > proactive compaction can help us allocate a large fraction of memory > > as > > hugepages keeping allocation latencies low. > > > > For a more proactive compaction, the approach taken here is to define > > a new tunable called 'proactiveness' which dictates bounds for > > external > > fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to > > maintain. > > > > The tunable is exposed through sysctl: > > /proc/sys/vm/compaction_proactiveness > > > > It takes value in range [0, 100], with a default of 20. > > Looking at the code, setting this to 100 would mean system would > continuously strive to drive level of fragmentation down to 0 which can > not be reasonable and would bog the system down. A cap lower than 100 > might be a good idea to keep kcompactd from dragging system down. > Yes, I understand that a value of 100 would be a continuous compaction storm but I still don't want to artificially cap the tunable. The interpretation of this tunable can change in future, and a range of [0, 100] seems more intuitive than, say [0, 90]. Still, I think a word of caution should be added to its documentation (admin-guide/sysctl/vm.rst). > > > > Total 2M hugepages allocated = 383859 (749G worth of hugepages out of > > 762G total free => 98% of free memory could be allocated as > > hugepages) > > > > - With 5.6.0-rc3 + this patch, with proactiveness=20 > > > > echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness > > Should be "echo 20 | sudo tee /proc/sys/vm/compaction_proactiveness" > oops... I forgot to update the patch description. This is from the v4 patch which used sysfs but v5 switched to using sysctl. > > > > diff --git a/Documentation/admin-guide/sysctl/vm.rst > > b/Documentation/admin-guide/sysctl/vm.rst > > index 0329a4d3fa9e..e5d88cabe980 100644 > > --- a/Documentation/admin-guide/sysctl/vm.rst > > +++ b/Documentation/admin-guide/sysctl/vm.rst > > @@ -119,6 +119,19 @@ all zones are compacted such that free memory is > > available in contiguous > > blocks where possible. This can be important for example in the > > allocation of > > huge pages although processes will also directly compact memory as > > required. > > > > +compaction_proactiveness > > +======================== > > + > > +This tunable takes a value in the range [0, 100] with a default > > value of > > +20. This tunable determines how aggressively compaction is done in > > the > > +background. Setting it to 0 disables proactive compaction. > > + > > +Note that compaction has a non-trivial system-wide impact as pages > > +belonging to different processes are moved around, which could also > > lead > > +to latency spikes in unsuspecting applications. The kernel employs > > +various heuristics to avoid wasting CPU cycles if it detects that > > +proactive compaction is not being effective. > > + > > Value of 100 would cause kcompactd to try to bring fragmentation down > to 0. If hugepages are being consumed and released continuously by the > workload, it is possible that kcompactd keeps making progress (and > hence passes the test "proactive_defer = score < prev_score ?") > continuously but can not reach a fragmentation score of 0 and hence > gets stuck in compact_zone() for a long time. Page migration for > compaction is not inexpensive. Maybe either cap the value to something > less than 100 or set a floor for wmark_low above 0. > > Some more guidance regarding the value for this tunable might be > helpful here, something along the lines of what does a value of 100 > mean in terms of how kcompactd will behave. It can then give end user a > better idea of what they are getting at what cost. You touch upon the > cost above. Just add some more details so an end user can get a better > idea of size of the cost for higher values of this tunable. > I like the idea of capping wmark_low to say, 5 to prevent admins from overloading the system. Similarly, wmark_high should be capped at say, 95 to allow tunable values below 10 to have any effect: currently such low tunable values would give wmark_high=100 which would cause proactive compaction to never get triggered. Finally, I see your concern about lack of guidance on extreme values of the tunable. I will address this in the next (v6) iteration. Thanks, Nitin