From: Michal Hocko <mhocko@kernel.org>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
David Rientjes <rientjes@google.com>,
Vlastimil Babka <vbabka@suse.cz>, Mel Gorman <mgorman@suse.de>,
Andrew Morton <akpm@linux-foundation.org>,
LKML <linux-kernel@vger.kernel.org>,
linux-mm@kvack.org
Subject: Re: [PATCH] mm, hugetlb: allow hugepage allocations to excessively reclaim
Date: Tue, 8 Oct 2019 09:07:01 +0200 [thread overview]
Message-ID: <20191008070701.GA6681@dhcp22.suse.cz> (raw)
In-Reply-To: <312f4447-a260-8876-2f9d-f300c5c99dc3@oracle.com>
On Mon 07-10-19 12:03:30, Mike Kravetz wrote:
> On 10/7/19 12:55 AM, Michal Hocko wrote:
> > From: David Rientjes <rientjes@google.com>
> >
> > b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
> > may not succeed") has chnaged the allocator to bail out from the
> > allocator early to prevent from a potentially excessive memory
> > reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation,
> > reclaim and compaction loop as long as there is a reasonable chance to
> > make a forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED
> > at the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.
> >
> > The most obvious affected subsystem is hugetlbfs which allocates huge
> > pages based on an admin request (or via admin configured overcommit).
> > I have done a simple test which tries to allocate half of the memory
> > for hugetlb pages while the memory is full of a clean page cache. This
> > is not an unusual situation because we try to cache as much of the
> > memory as possible and sysctl/sysfs interface to allocate huge pages is
> > there for flexibility to allocate hugetlb pages at any time.
> >
> > System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
> > after the memory is prefilled by a clean page cache:
> > root@test1:~# cat hugetlb_test.sh
> >
> > set -x
> > echo 0 > /proc/sys/vm/nr_hugepages
> > echo 3 > /proc/sys/vm/drop_caches
> > echo 1 > /proc/sys/vm/compact_memory
> > dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
> > TS=$(date +%s)
> > echo 256 > /proc/sys/vm/nr_hugepages
> > cat /proc/sys/vm/nr_hugepages
> >
> > The results for 2 consecutive runs on clean 5.3
> > root@test1:~# sh hugetlb_test.sh
> > + echo 0
> > + echo 3
> > + echo 1
> > + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> > 262144+0 records in
> > 262144+0 records out
> > 1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
> > + date +%s
> > + TS=1569905284
> > + echo 256
> > + cat /proc/sys/vm/nr_hugepages
> > 256
> > root@test1:~# sh hugetlb_test.sh
> > + echo 0
> > + echo 3
> > + echo 1
> > + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> > 262144+0 records in
> > 262144+0 records out
> > 1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
> > + date +%s
> > + TS=1569905311
> > + echo 256
> > + cat /proc/sys/vm/nr_hugepages
> > 256
> >
> > Now with b39d0ee2632d applied
> > root@test1:~# sh hugetlb_test.sh
> > + echo 0
> > + echo 3
> > + echo 1
> > + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> > 262144+0 records in
> > 262144+0 records out
> > 1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
> > + date +%s
> > + TS=1569905516
> > + echo 256
> > + cat /proc/sys/vm/nr_hugepages
> > 11
> > root@test1:~# sh hugetlb_test.sh
> > + echo 0
> > + echo 3
> > + echo 1
> > + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> > 262144+0 records in
> > 262144+0 records out
> > 1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
> > + date +%s
> > + TS=1569905541
> > + echo 256
> > + cat /proc/sys/vm/nr_hugepages
> > 12
> >
> > The success rate went down by factor of 20!
> >
> > Although hugetlb allocation requests might fail and it is reasonable to
> > expect them to under extremely fragmented memory or when the memory is
> > under a heavy pressure but the above situation is not that case.
> >
> > Fix the regression by reverting back to the previous behavior for
> > __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
> > those requests.
>
> Thank you Michal for doing this.
>
> hugetlbfs allocations are commonly done via sysctl/sysfs shortly after boot
> where this may not be as much of an issue. However, I am aware of at least
> three use cases where allocations are made after the system has been up and
> running for quite some time:
> - DB reconfiguration. If sysctl/sysfs fails to get required number of huge
> pages, system is rebooted to perform allocation after boot.
> - VM provisioning. If unable get required number of huge pages, fall back
> to base pages.
> - An application that does not preallocate pool, but rather allocates pages
> at fault time for optimal NUMA locality.
> In all cases, I would expect b39d0ee2632d to cause regressions and noticable
> behavior changes.
Thanks a lot Mike. This is a very useful addition and I can see Andrew
has already added it to the changelog (thx). The usecase I keep hearing
most from the field is the first and the second one.
> My quick/limited testing in [1] was insufficient. It was also mentioned that
> if something like b39d0ee2632d went forward, I would like exemptions for
> __GFP_RETRY_MAYFAIL requests as in this patch.
>
> >
> > [mhocko@suse.com: reworded changelog]
> > Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
>
> FWIW,
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Thanks!
> [1] https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com
> --
> Mike Kravetz
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2019-10-08 7:07 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-07 7:55 [PATCH] mm, hugetlb: allow hugepage allocations to excessively reclaim Michal Hocko
2019-10-07 19:03 ` Mike Kravetz
2019-10-08 7:07 ` Michal Hocko [this message]
2019-10-08 7:21 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191008070701.GA6681@dhcp22.suse.cz \
--to=mhocko@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mike.kravetz@oracle.com \
--cc=rientjes@google.com \
--cc=torvalds@linux-foundation.org \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).