All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nitin Gupta <ngupta@nitingupta.dev>
To: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Nitin Gupta <nigupta@nvidia.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
	Matthew Wilcox <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	David Rientjes <rientjes@google.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	Linux API <linux-api@vger.kernel.org>
Subject: Re: [PATCH v5] mm: Proactive compaction
Date: Thu, 28 May 2020 17:55:22 -0700	[thread overview]
Message-ID: <CAB6CXpDjOFP-GCRk6ukU4SbeutfeBaTV6nH=JA-nTkKifSNQzA@mail.gmail.com> (raw)
In-Reply-To: <a83a4df1dc2ed76aa497be3454c67ee4437d2883.camel@oracle.com>

On Thu, May 28, 2020 at 4:32 PM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> This looks good to me. I like the idea overall of controlling
> aggressiveness of compaction with a single tunable for the whole
> system. I wonder how an end user could arrive at what a reasonable
> value would be for this based upon their workload. More comments below.
>

Tunables like the one this patch introduces, and similar ones like 'swappiness'
will always require some experimentations from the user.


> On Mon, 2020-05-18 at 11:14 -0700, Nitin Gupta wrote:
> > For some applications, we need to allocate almost all memory as
> > hugepages. However, on a running system, higher-order allocations can
> > fail if the memory is fragmented. Linux kernel currently does on-
> > demand
> > compaction as we request more hugepages, but this style of compaction
> > incurs very high latency. Experiments with one-time full memory
> > compaction (followed by hugepage allocations) show that kernel is
> > able
> > to restore a highly fragmented memory state to a fairly compacted
> > memory
> > state within <1 sec for a 32G system. Such data suggests that a more
> > proactive compaction can help us allocate a large fraction of memory
> > as
> > hugepages keeping allocation latencies low.
> >
> > For a more proactive compaction, the approach taken here is to define
> > a new tunable called 'proactiveness' which dictates bounds for
> > external
> > fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to
> > maintain.
> >
> > The tunable is exposed through sysctl:
> >   /proc/sys/vm/compaction_proactiveness
> >
> > It takes value in range [0, 100], with a default of 20.
>
> Looking at the code, setting this to 100 would mean system would
> continuously strive to drive level of fragmentation down to 0 which can
> not be reasonable and would bog the system down. A cap lower than 100
> might be a good idea to keep kcompactd from dragging system down.
>

Yes, I understand that a value of 100 would be a continuous compaction
storm but I still don't want to artificially cap the tunable. The interpretation
of this tunable can change in future, and a range of [0, 100] seems
more intuitive than, say [0, 90]. Still, I think a word of caution should
be added to its documentation (admin-guide/sysctl/vm.rst).


> >

> > Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
> > 762G total free => 98% of free memory could be allocated as
> > hugepages)
> >
> > - With 5.6.0-rc3 + this patch, with proactiveness=20
> >
> > echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness
>
> Should be "echo 20 | sudo tee /proc/sys/vm/compaction_proactiveness"
>

oops... I forgot to update the patch description. This is from the v4 patch
which used sysfs but v5 switched to using sysctl.


> >

> > diff --git a/Documentation/admin-guide/sysctl/vm.rst
> > b/Documentation/admin-guide/sysctl/vm.rst
> > index 0329a4d3fa9e..e5d88cabe980 100644
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -119,6 +119,19 @@ all zones are compacted such that free memory is
> > available in contiguous
> >  blocks where possible. This can be important for example in the
> > allocation of
> >  huge pages although processes will also directly compact memory as
> > required.
> >
> > +compaction_proactiveness
> > +========================
> > +
> > +This tunable takes a value in the range [0, 100] with a default
> > value of
> > +20. This tunable determines how aggressively compaction is done in
> > the
> > +background. Setting it to 0 disables proactive compaction.
> > +
> > +Note that compaction has a non-trivial system-wide impact as pages
> > +belonging to different processes are moved around, which could also
> > lead
> > +to latency spikes in unsuspecting applications. The kernel employs
> > +various heuristics to avoid wasting CPU cycles if it detects that
> > +proactive compaction is not being effective.
> > +
>
> Value of 100 would cause kcompactd to try to bring fragmentation down
> to 0. If hugepages are being consumed and released continuously by the
> workload, it is possible that kcompactd keeps making progress (and
> hence passes the test "proactive_defer = score < prev_score ?")
> continuously but can not reach a fragmentation score of 0 and hence
> gets stuck in compact_zone() for a long time. Page migration for
> compaction is not inexpensive. Maybe either cap the value to something
> less than 100 or set a floor for wmark_low above 0.
>
> Some more guidance regarding the value for this tunable might be
> helpful here, something along the lines of what does a value of 100
> mean in terms of how kcompactd will behave. It can then give end user a
> better idea of what they are getting at what cost. You touch upon the
> cost above. Just add some more details so an end user can get a better
> idea of size of the cost for higher values of this tunable.
>

I like the idea of capping wmark_low to say, 5 to prevent admins from
overloading the system. Similarly, wmark_high should be capped at
say, 95 to allow tunable values below 10 to have any effect: currently
such low tunable values would give wmark_high=100 which would
cause proactive compaction to never get triggered.

Finally, I see your concern about lack of guidance on extreme values
of the tunable. I will address this in the next (v6) iteration.

Thanks,
Nitin

  reply	other threads:[~2020-05-29  0:55 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-18 18:14 [PATCH v5] mm: Proactive compaction Nitin Gupta
2020-05-26 17:19 ` Nitin Gupta
2020-05-27 10:18 ` Vlastimil Babka
2020-05-28 21:29   ` Nitin Gupta
2020-05-28 21:29     ` Nitin Gupta
2020-05-28  9:15 ` Holger Hoffstätte
2020-05-28  9:50   ` Vlastimil Babka
2020-05-28 17:38     ` Nitin Gupta
2020-05-28 23:29 ` Khalid Aziz
2020-05-29  0:55   ` Nitin Gupta [this message]
2020-05-29  0:55     ` Nitin Gupta

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAB6CXpDjOFP-GCRk6ukU4SbeutfeBaTV6nH=JA-nTkKifSNQzA@mail.gmail.com' \
    --to=ngupta@nitingupta.dev \
    --cc=akpm@linux-foundation.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=khalid.aziz@oracle.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=nigupta@nvidia.com \
    --cc=rientjes@google.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.