linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm, hugetlb: allow hugepage allocations to excessively reclaim
@ 2019-10-07  7:55 Michal Hocko
  2019-10-07 19:03 ` Mike Kravetz
  2019-10-08  7:21 ` Vlastimil Babka
  0 siblings, 2 replies; 4+ messages in thread
From: Michal Hocko @ 2019-10-07  7:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Rientjes, Vlastimil Babka, Mike Kravetz, Mel Gorman,
	Andrew Morton, LKML, linux-mm, Michal Hocko

From: David Rientjes <rientjes@google.com>

b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
may not succeed") has chnaged the allocator to bail out from the
allocator early to prevent from a potentially excessive memory
reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation,
reclaim and compaction loop as long as there is a reasonable chance to
make a forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED
at the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.

The most obvious affected subsystem is hugetlbfs which allocates huge
pages based on an admin request (or via admin configured overcommit).
I have done a simple test which tries to allocate half of the memory
for hugetlb pages while the memory is full of a clean page cache. This
is not an unusual situation because we try to cache as much of the
memory as possible and sysctl/sysfs interface to allocate huge pages is
there for flexibility to allocate hugetlb pages at any time.

System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
after the memory is prefilled by a clean page cache:
root@test1:~# cat hugetlb_test.sh

set -x
echo 0 > /proc/sys/vm/nr_hugepages
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
TS=$(date +%s)
echo 256 > /proc/sys/vm/nr_hugepages
cat /proc/sys/vm/nr_hugepages

The results for 2 consecutive runs on clean 5.3
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
+ date +%s
+ TS=1569905284
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
256
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
+ date +%s
+ TS=1569905311
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
256

Now with b39d0ee2632d applied
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
+ date +%s
+ TS=1569905516
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
11
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
+ date +%s
+ TS=1569905541
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
12

The success rate went down by factor of 20!

Although hugetlb allocation requests might fail and it is reasonable to
expect them to under extremely fragmented memory or when the memory is
under a heavy pressure but the above situation is not that case.

Fix the regression by reverting back to the previous behavior for
__GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
those requests.

[mhocko@suse.com: reworded changelog]
Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
this has been posted by David as an RFC [1]. David doesn't seem to
appreciate the level of regression so I have largely rewritten the
changelog to be more explicit. I haven't changed the patch itself
so I have preserved his s-o-b.

I would also like to emphasise that I am not overly happy about the
patch. Vlastimil has posted [2] an alternative solution which looks
better but it is also slightly more complex. We can do that in a follow
up though so let's go with the simplest hack^Wsolution for now.

[1] http://lkml.kernel.org/r/alpine.DEB.2.21.1910021556270.187014@chino.kir.corp.google.com
[2] http://lkml.kernel.org/r/20191001054343.GA15624@dhcp22.suse.cz

 mm/page_alloc.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 15c2050c629b..01aa46acee76 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4467,12 +4467,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		if (page)
 			goto got_pg;
 
-		 if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
+		 if (order >= pageblock_order && (gfp_mask & __GFP_IO) &&
+		     !(gfp_mask & __GFP_RETRY_MAYFAIL)) {
 			/*
 			 * If allocating entire pageblock(s) and compaction
 			 * failed because all zones are below low watermarks
 			 * or is prohibited because it recently failed at this
-			 * order, fail immediately.
+			 * order, fail immediately unless the allocator has
+			 * requested compaction and reclaim retry.
 			 *
 			 * Reclaim is
 			 *  - potentially very expensive because zones are far
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm, hugetlb: allow hugepage allocations to excessively reclaim
  2019-10-07  7:55 [PATCH] mm, hugetlb: allow hugepage allocations to excessively reclaim Michal Hocko
@ 2019-10-07 19:03 ` Mike Kravetz
  2019-10-08  7:07   ` Michal Hocko
  2019-10-08  7:21 ` Vlastimil Babka
  1 sibling, 1 reply; 4+ messages in thread
From: Mike Kravetz @ 2019-10-07 19:03 UTC (permalink / raw)
  To: Michal Hocko, Linus Torvalds
  Cc: David Rientjes, Vlastimil Babka, Mel Gorman, Andrew Morton, LKML,
	linux-mm, Michal Hocko

On 10/7/19 12:55 AM, Michal Hocko wrote:
> From: David Rientjes <rientjes@google.com>
> 
> b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
> may not succeed") has chnaged the allocator to bail out from the
> allocator early to prevent from a potentially excessive memory
> reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation,
> reclaim and compaction loop as long as there is a reasonable chance to
> make a forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED
> at the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.
> 
> The most obvious affected subsystem is hugetlbfs which allocates huge
> pages based on an admin request (or via admin configured overcommit).
> I have done a simple test which tries to allocate half of the memory
> for hugetlb pages while the memory is full of a clean page cache. This
> is not an unusual situation because we try to cache as much of the
> memory as possible and sysctl/sysfs interface to allocate huge pages is
> there for flexibility to allocate hugetlb pages at any time.
> 
> System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
> after the memory is prefilled by a clean page cache:
> root@test1:~# cat hugetlb_test.sh
> 
> set -x
> echo 0 > /proc/sys/vm/nr_hugepages
> echo 3 > /proc/sys/vm/drop_caches
> echo 1 > /proc/sys/vm/compact_memory
> dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
> TS=$(date +%s)
> echo 256 > /proc/sys/vm/nr_hugepages
> cat /proc/sys/vm/nr_hugepages
> 
> The results for 2 consecutive runs on clean 5.3
> root@test1:~# sh hugetlb_test.sh
> + echo 0
> + echo 3
> + echo 1
> + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
> + date +%s
> + TS=1569905284
> + echo 256
> + cat /proc/sys/vm/nr_hugepages
> 256
> root@test1:~# sh hugetlb_test.sh
> + echo 0
> + echo 3
> + echo 1
> + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
> + date +%s
> + TS=1569905311
> + echo 256
> + cat /proc/sys/vm/nr_hugepages
> 256
> 
> Now with b39d0ee2632d applied
> root@test1:~# sh hugetlb_test.sh
> + echo 0
> + echo 3
> + echo 1
> + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
> + date +%s
> + TS=1569905516
> + echo 256
> + cat /proc/sys/vm/nr_hugepages
> 11
> root@test1:~# sh hugetlb_test.sh
> + echo 0
> + echo 3
> + echo 1
> + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
> + date +%s
> + TS=1569905541
> + echo 256
> + cat /proc/sys/vm/nr_hugepages
> 12
> 
> The success rate went down by factor of 20!
> 
> Although hugetlb allocation requests might fail and it is reasonable to
> expect them to under extremely fragmented memory or when the memory is
> under a heavy pressure but the above situation is not that case.
> 
> Fix the regression by reverting back to the previous behavior for
> __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
> those requests.

Thank you Michal for doing this.

hugetlbfs allocations are commonly done via sysctl/sysfs shortly after boot
where this may not be as much of an issue.  However, I am aware of at least
three use cases where allocations are made after the system has been up and
running for quite some time:
- DB reconfiguration.  If sysctl/sysfs fails to get required number of huge
  pages, system is rebooted to perform allocation after boot.
- VM provisioning.  If unable get required number of huge pages, fall back
  to base pages.
- An application that does not preallocate pool, but rather allocates pages
  at fault time for optimal NUMA locality.
In all cases, I would expect b39d0ee2632d to cause regressions and noticable
behavior changes.

My quick/limited testing in [1] was insufficient.  It was also mentioned that
if something like b39d0ee2632d went forward, I would like exemptions for
__GFP_RETRY_MAYFAIL requests as in this patch.

> 
> [mhocko@suse.com: reworded changelog]
> Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

FWIW,
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

[1] https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm, hugetlb: allow hugepage allocations to excessively reclaim
  2019-10-07 19:03 ` Mike Kravetz
@ 2019-10-08  7:07   ` Michal Hocko
  0 siblings, 0 replies; 4+ messages in thread
From: Michal Hocko @ 2019-10-08  7:07 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linus Torvalds, David Rientjes, Vlastimil Babka, Mel Gorman,
	Andrew Morton, LKML, linux-mm

On Mon 07-10-19 12:03:30, Mike Kravetz wrote:
> On 10/7/19 12:55 AM, Michal Hocko wrote:
> > From: David Rientjes <rientjes@google.com>
> > 
> > b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
> > may not succeed") has chnaged the allocator to bail out from the
> > allocator early to prevent from a potentially excessive memory
> > reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation,
> > reclaim and compaction loop as long as there is a reasonable chance to
> > make a forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED
> > at the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.
> > 
> > The most obvious affected subsystem is hugetlbfs which allocates huge
> > pages based on an admin request (or via admin configured overcommit).
> > I have done a simple test which tries to allocate half of the memory
> > for hugetlb pages while the memory is full of a clean page cache. This
> > is not an unusual situation because we try to cache as much of the
> > memory as possible and sysctl/sysfs interface to allocate huge pages is
> > there for flexibility to allocate hugetlb pages at any time.
> > 
> > System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
> > after the memory is prefilled by a clean page cache:
> > root@test1:~# cat hugetlb_test.sh
> > 
> > set -x
> > echo 0 > /proc/sys/vm/nr_hugepages
> > echo 3 > /proc/sys/vm/drop_caches
> > echo 1 > /proc/sys/vm/compact_memory
> > dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
> > TS=$(date +%s)
> > echo 256 > /proc/sys/vm/nr_hugepages
> > cat /proc/sys/vm/nr_hugepages
> > 
> > The results for 2 consecutive runs on clean 5.3
> > root@test1:~# sh hugetlb_test.sh
> > + echo 0
> > + echo 3
> > + echo 1
> > + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> > 262144+0 records in
> > 262144+0 records out
> > 1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
> > + date +%s
> > + TS=1569905284
> > + echo 256
> > + cat /proc/sys/vm/nr_hugepages
> > 256
> > root@test1:~# sh hugetlb_test.sh
> > + echo 0
> > + echo 3
> > + echo 1
> > + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> > 262144+0 records in
> > 262144+0 records out
> > 1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
> > + date +%s
> > + TS=1569905311
> > + echo 256
> > + cat /proc/sys/vm/nr_hugepages
> > 256
> > 
> > Now with b39d0ee2632d applied
> > root@test1:~# sh hugetlb_test.sh
> > + echo 0
> > + echo 3
> > + echo 1
> > + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> > 262144+0 records in
> > 262144+0 records out
> > 1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
> > + date +%s
> > + TS=1569905516
> > + echo 256
> > + cat /proc/sys/vm/nr_hugepages
> > 11
> > root@test1:~# sh hugetlb_test.sh
> > + echo 0
> > + echo 3
> > + echo 1
> > + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> > 262144+0 records in
> > 262144+0 records out
> > 1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
> > + date +%s
> > + TS=1569905541
> > + echo 256
> > + cat /proc/sys/vm/nr_hugepages
> > 12
> > 
> > The success rate went down by factor of 20!
> > 
> > Although hugetlb allocation requests might fail and it is reasonable to
> > expect them to under extremely fragmented memory or when the memory is
> > under a heavy pressure but the above situation is not that case.
> > 
> > Fix the regression by reverting back to the previous behavior for
> > __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
> > those requests.
> 
> Thank you Michal for doing this.
> 
> hugetlbfs allocations are commonly done via sysctl/sysfs shortly after boot
> where this may not be as much of an issue.  However, I am aware of at least
> three use cases where allocations are made after the system has been up and
> running for quite some time:
> - DB reconfiguration.  If sysctl/sysfs fails to get required number of huge
>   pages, system is rebooted to perform allocation after boot.
> - VM provisioning.  If unable get required number of huge pages, fall back
>   to base pages.
> - An application that does not preallocate pool, but rather allocates pages
>   at fault time for optimal NUMA locality.
> In all cases, I would expect b39d0ee2632d to cause regressions and noticable
> behavior changes.

Thanks a lot Mike. This is a very useful addition and I can see Andrew
has already added it to the changelog (thx). The usecase I keep hearing
most from the field is the first and the second one.

> My quick/limited testing in [1] was insufficient.  It was also mentioned that
> if something like b39d0ee2632d went forward, I would like exemptions for
> __GFP_RETRY_MAYFAIL requests as in this patch.
> 
> > 
> > [mhocko@suse.com: reworded changelog]
> > Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> FWIW,
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

Thanks!

> [1] https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com
> -- 
> Mike Kravetz

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm, hugetlb: allow hugepage allocations to excessively reclaim
  2019-10-07  7:55 [PATCH] mm, hugetlb: allow hugepage allocations to excessively reclaim Michal Hocko
  2019-10-07 19:03 ` Mike Kravetz
@ 2019-10-08  7:21 ` Vlastimil Babka
  1 sibling, 0 replies; 4+ messages in thread
From: Vlastimil Babka @ 2019-10-08  7:21 UTC (permalink / raw)
  To: Michal Hocko, Linus Torvalds
  Cc: David Rientjes, Mike Kravetz, Mel Gorman, Andrew Morton, LKML,
	linux-mm, Michal Hocko

On 10/7/19 9:55 AM, Michal Hocko wrote:
> From: David Rientjes <rientjes@google.com>
Nit: the subject is still somewhat misleading IMHO, especially in light
of Mike's responses. I would say "reclaim as needed" instead of
"excessively reclaim". The excessive reclaim behavior in hugetlb nr_pages
setting was a bug that was addressed by a different series.
 
...

> [mhocko@suse.com: reworded changelog]
> Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

I still believe that using __GFP_NORETRY as needed is a cleaner solution
than a check for pageblock order and __GFP_IO, but that can be always
changed later. This patch does fix the hugetlbfs regression, so

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-10-08  7:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-07  7:55 [PATCH] mm, hugetlb: allow hugepage allocations to excessively reclaim Michal Hocko
2019-10-07 19:03 ` Mike Kravetz
2019-10-08  7:07   ` Michal Hocko
2019-10-08  7:21 ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).