Re: [PATCH] mapletree-vs-khugepaged

From: Liam Howlett <liam.howlett@oracle.com>
To: Guenter Roeck <linux@roeck-us.net>,
	Heiko Carstens <hca@linux.ibm.com>,
	Sven Schnelle <svens@linux.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: Re: [PATCH] mapletree-vs-khugepaged
Date: Tue, 31 May 2022 18:56:34 +0000	[thread overview]
Message-ID: <20220531185626.yvlmymbxyoe5vags@revolver> (raw)
In-Reply-To: <20220530173812.ehckwwrb5fk7mjfd@revolver>

[-- Attachment #1: Type: text/plain, Size: 5292 bytes --]

* Liam R. Howlett <Liam.Howlett@Oracle.com> [220530 13:38]:
> * Guenter Roeck <linux@roeck-us.net> [220519 17:42]:
> > On 5/19/22 07:35, Liam Howlett wrote:
> > > * Guenter Roeck <linux@roeck-us.net> [220517 10:32]:
> > > 
> > > ...
> > > > 
> > > > Another bisect result, boot failures with nommu targets (arm:mps2-an385,
> > > > m68k:mcf5208evb). Bisect log is the same for both.
> > > ...
> > > > # first bad commit: [bd773a78705fb58eeadd80e5b31739df4c83c559] nommu: remove uses of VMA linked list
> > > 
> > > I cannot reproduce this on my side, even with that specific commit.  Can
> > > you point me to the failure log, config file, etc?  Do you still see
> > > this with the fixes I've sent recently?
> > > 
> > 
> > This was in linux-next; most recently with next-20220517.
> > I don't know if that was up-to-date with your patches.
> > The problem seems to be memory allocation failures.
> > A sample log is at
> > https://kerneltests.org/builders/qemu-m68k-next/builds/1065/steps/qemubuildcommand/logs/stdio
> > The log history at
> > https://kerneltests.org/builders/qemu-m68k-next?numbuilds=30
> > will give you a variety of logs.
> > 
> > The configuration is derived from m5208evb_defconfig, with initrd
> > and command line embedded in the image. You can see the detailed
> > configuration updates at
> > https://github.com/groeck/linux-build-test/blob/master/rootfs/m68k/run-qemu-m68k.sh
> > 
> > Qemu command line is
> > 
> > qemu-system-m68k -M mcf5208evb -kernel vmlinux \
> >     -cpu m5208 -no-reboot -nographic -monitor none
> >     -append "rdinit=/sbin/init console=ttyS0,115200"
> > 
> > with initrd from
> > https://github.com/groeck/linux-build-test/blob/master/rootfs/m68k/rootfs-5208.cpio.gz
> > 
> > I use qemu v6.2, but any recent qemu version should work.
> 
> I have qemu 7.0 which seems to change the default memory size from 32MB
> to 128MB. This can be seen on your log here:
> 
> Memory: 27928K/32768K available (2827K kernel code, 160K rwdata, 432K rodata, 1016K init, 66K bss, 4840K reserved, 0K cma-reserved)
> 
> With 128MB the kernel boots.  With 64MB it also boots.  32MB fails with
> an OOM. Looking into it more, I see that the OOM is caused by a
> contiguous page allocation of 1MB (order 7 at 8K pages).  This can be
> seen in the log as well:
> 
> Running sysctl: echo: page allocation failure: order:7, mode:0xcc0(GFP_KERNEL), nodemask=(null)
> ...
> nommu: Allocation of length 884736 from process 63 (echo) failed
> 
> This last log message above comes from the code path that uses
> alloc_pages_exact().
> 
> I don't see why my 256 byte nodes (order 0 allocations yield 32 nodes)
> would fragment the memory beyond use on boot.  I have checked for some
> sort of massive leak by adding a static node count to the code and have
> only ever hit ~12 nodes.  Consulting the OOM log from the above link
> again:
> 
> DMA: 0*8kB 1*16kB (U) 9*32kB (U) 7*64kB (U) 21*128kB (U) 7*256kB (U) 6*512kB (U) 0*1024kB 0*2048kB 0*4096kB 0*8192kB = 8304kB
> 
> So to get to the point of breaking up a 1MB block, we'd need an obscene
> number of nodes.
> 
> Furthermore, the OOM on boot is not always happening.  When boot
> succeeds without an oom,  I checked slabinfo and see that the maple_node
> has 32 active objects which is 1 order 0 allocation. The boot does
> mostly cause an OOM.  It is worth noting that the slabinfo count is lazy
> on counting the number of active objects so it is most likely lower than
> this value in reality.
> 
> Does anyone have any idea why nommu would be getting this fragmented?

Answer: Why, yes.  Matthew does.  Using alloc_pages_exact() means we
allocate the huge chunk of memory then free the leftovers immediately.
Those freed leftover pages are handed out on the next request - which
happens to be the maple tree.

It seems nommu is so close to OOMing already that this makes a
difference.  Attached is a patch which _almost_ solves the issue by
making it less likely to use those pages, but it's still a matter of
timing on if this will OOM anyways.  It reduces the potential by a large
margin, maybe 1/10 fail instead of 4/5 failing.  This patch is probably
worth taking on its own as it reduces memory fragmentation on
short-lived allocations that use alloc_pages_exact().

I changed the nommu code a bit to reduce memory usage as well.  During a
split even, I no longer delete then re-add the VMA and I only
preallocate a single time for the two writes associated with a split. I
also moved my pre-allocation ahead of the call path that does
alloc_pages_exact().  This all but ensures we won't fragment the larger
chunks of memory as we get enough nodes out of a single page to run at
least through boot.  However, the failure rate remained at 1/10 with
this change.

I had accepted the scenario that this all just worked before, but my
setup is different than that of Guenter.  I am using buildroot-2022.02.1
and qemu 7.0 for my testing.  My configuration OOMs 12/13 times without
maple tree, so I think we actually lowered the memory pressure on boot
with these changes.  Obviously there is a element of timing that causes
variation in the testing so exact numbers are not possible.

Thanks,
Liam


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-mm-page_alloc-Reduce-potential-fragmentation-in-make.patch --]
[-- Type: text/x-diff; name="0001-mm-page_alloc-Reduce-potential-fragmentation-in-make.patch", Size: 1664 bytes --]

From abef6d264d2413a625670bdb873133576d5cce5f Mon Sep 17 00:00:00 2001
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Date: Tue, 31 May 2022 09:20:51 -0400
Subject: [PATCH] mm/page_alloc:  Reduce potential fragmentation in
 make_alloc_exact()

Try to avoid using the left over split page on the next request for a
page by calling __free_pages_ok() with FPI_TO_TAIL.  This increases the
potential of defragmenting memory when it's used for a short period of
time.

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/page_alloc.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f01c71e41bcf..8b6d6cada684 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5580,14 +5580,18 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
 		size_t size)
 {
 	if (addr) {
-		unsigned long alloc_end = addr + (PAGE_SIZE << order);
-		unsigned long used = addr + PAGE_ALIGN(size);
-
-		split_page(virt_to_page((void *)addr), order);
-		while (used < alloc_end) {
-			free_page(used);
-			used += PAGE_SIZE;
-		}
+		unsigned long nr = DIV_ROUND_UP(size, PAGE_SIZE);
+		struct page *page = virt_to_page((void *)addr);
+		struct page *last = page + nr;
+
+		split_page_owner(page, 1 << order);
+		split_page_memcg(page, 1 << order);
+		while (page < --last)
+			set_page_refcounted(last);
+
+		last = page + (1UL << order);
+		for (page += nr; page < last; page++)
+			__free_pages_ok(page, 0, FPI_TO_TAIL);
 	}
 	return (void *)addr;
 }
-- 
2.35.1