All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] Pages not released from memblock to the buddy allocator
@ 2023-01-04  7:43 Aaron Thompson
  2023-01-05  1:43 ` David Rientjes
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Aaron Thompson @ 2023-01-04  7:43 UTC (permalink / raw)
  To: linux-mm, Mike Rapoport
  Cc: H. Peter Anvin, Alexander Potapenko, Andrew Morton,
	Andy Shevchenko, Ard Biesheuvel, Borislav Petkov, Darren Hart,
	Dave Hansen, Dmitry Vyukov, Ingo Molnar, Marco Elver,
	Thomas Gleixner, kasan-dev, linux-efi, linux-kernel,
	platform-driver-x86, x86, Aaron Thompson

Hi all,

(I've CC'ed the KMSAN and x86 EFI maintainers as an FYI; the only code change
I'm proposing is in memblock.)

I've run into a case where pages are not released from memblock to the buddy
allocator. If deferred struct page init is enabled, and memblock_free_late() is
called before page_alloc_init_late() has run, and the pages being freed are in
the deferred init range, then the pages are never released. memblock_free_late()
calls memblock_free_pages() which only releases the pages if they are not in the
deferred range. That is correct for free pages because they will be initialized
and released by page_alloc_init_late(), but memblock_free_late() is dealing with
reserved pages. If memblock_free_late() doesn't release those pages, they will
forever be reserved. All reserved pages were initialized by memblock_free_all(),
so I believe the fix is to simply have memblock_free_late() call
__free_pages_core() directly instead of memblock_free_pages().

In addition, there was a recent change (3c20650982609 "init: kmsan: call KMSAN
initialization routines") that added a call to kmsan_memblock_free_pages() in
memblock_free_pages(). It looks to me like it would also be incorrect to make
that call in the memblock_free_late() case, because the KMSAN metadata was
already initialized for all reserved pages by kmsan_init_shadow(), which runs
before memblock_free_all(). Having memblock_free_late() call __free_pages_core()
directly also fixes this issue.

I encountered this issue when I tried to switch some x86_64 VMs I was running
from BIOS boot to EFI boot. The x86 EFI code reserves all EFI boot services
ranges via memblock_reserve() (part of setup_arch()), and it frees them later
via memblock_free_late() (part of efi_enter_virtual_mode()). The EFI
implementation of the VM I was attempting this on, an Amazon EC2 t3.micro
instance, maps north of 170 MB in boot services ranges that happen to fall in
the deferred init range. I certainly noticed when that much memory went missing
on a 1 GB VM.

I've tested the patch on EC2 instances, qemu/KVM VMs with OVMF, and some real
x86_64 EFI systems, and they all look good to me. However, the physical systems
that I have don't actually trigger this issue because they all have more than 4
GB of RAM, so their deferred init range starts above 4 GB (it's always in the
highest zone and ZONE_DMA32 ends at 4 GB) while their EFI boot services mappings
are below 4 GB.

Deferred struct page init can't be enabled on x86_32 so those systems are
unaffected. I haven't found any other code paths that would trigger this issue,
though I can't promise that there aren't any. I did run with this patch on an
arm64 VM as a sanity check, but memblock=debug didn't show any calls to
memblock_free_late() so that system was unaffected as well.

I am guessing that this change should also go the stable kernels but it may not
apply cleanly (__free_pages_core() was __free_pages_boot_core() and
memblock_free_pages() was __free_pages_bootmem() when this issue was first
introduced). I haven't gone through that process before so please let me know if
I can help with that.

This is the end result on an EC2 t3.micro instance booting via EFI:

v6.2-rc2:
  # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
  Node 0, zone      DMA
          spanned  4095
          present  3999
          managed  3840
  Node 0, zone    DMA32
          spanned  246652
          present  245868
          managed  178867

v6.2-rc2 + patch:
  # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
  Node 0, zone      DMA
          spanned  4095
          present  3999
          managed  3840
  Node 0, zone    DMA32
          spanned  246652
          present  245868
          managed  222816


Aaron Thompson (1):
  mm: Always release pages to the buddy allocator in
    memblock_free_late().

 mm/memblock.c                     | 2 +-
 tools/testing/memblock/internal.h | 4 ++++
 2 files changed, 5 insertions(+), 1 deletion(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/1] Pages not released from memblock to the buddy allocator
  2023-01-04  7:43 [PATCH 0/1] Pages not released from memblock to the buddy allocator Aaron Thompson
@ 2023-01-05  1:43 ` David Rientjes
  2023-01-05  4:17 ` [PATCH v2 " Aaron Thompson
       [not found] ` <20230105041650.1485-1-dev@aaront.org>
  2 siblings, 0 replies; 9+ messages in thread
From: David Rientjes @ 2023-01-05  1:43 UTC (permalink / raw)
  To: Aaron Thompson
  Cc: linux-mm, Mike Rapoport, H. Peter Anvin, Alexander Potapenko,
	Andrew Morton, Andy Shevchenko, Ard Biesheuvel, Borislav Petkov,
	Darren Hart, Dave Hansen, Dmitry Vyukov, Ingo Molnar,
	Marco Elver, Thomas Gleixner, kasan-dev, linux-efi, linux-kernel,
	platform-driver-x86, x86

On Wed, 4 Jan 2023, Aaron Thompson wrote:

> Hi all,
> 
> (I've CC'ed the KMSAN and x86 EFI maintainers as an FYI; the only code change
> I'm proposing is in memblock.)
> 
> I've run into a case where pages are not released from memblock to the buddy
> allocator. If deferred struct page init is enabled, and memblock_free_late() is
> called before page_alloc_init_late() has run, and the pages being freed are in
> the deferred init range, then the pages are never released. memblock_free_late()
> calls memblock_free_pages() which only releases the pages if they are not in the
> deferred range. That is correct for free pages because they will be initialized
> and released by page_alloc_init_late(), but memblock_free_late() is dealing with
> reserved pages. If memblock_free_late() doesn't release those pages, they will
> forever be reserved. All reserved pages were initialized by memblock_free_all(),
> so I believe the fix is to simply have memblock_free_late() call
> __free_pages_core() directly instead of memblock_free_pages().
> 
> In addition, there was a recent change (3c20650982609 "init: kmsan: call KMSAN
> initialization routines") that added a call to kmsan_memblock_free_pages() in
> memblock_free_pages(). It looks to me like it would also be incorrect to make
> that call in the memblock_free_late() case, because the KMSAN metadata was
> already initialized for all reserved pages by kmsan_init_shadow(), which runs
> before memblock_free_all(). Having memblock_free_late() call __free_pages_core()
> directly also fixes this issue.
> 
> I encountered this issue when I tried to switch some x86_64 VMs I was running
> from BIOS boot to EFI boot. The x86 EFI code reserves all EFI boot services
> ranges via memblock_reserve() (part of setup_arch()), and it frees them later
> via memblock_free_late() (part of efi_enter_virtual_mode()). The EFI
> implementation of the VM I was attempting this on, an Amazon EC2 t3.micro
> instance, maps north of 170 MB in boot services ranges that happen to fall in
> the deferred init range. I certainly noticed when that much memory went missing
> on a 1 GB VM.
> 
> I've tested the patch on EC2 instances, qemu/KVM VMs with OVMF, and some real
> x86_64 EFI systems, and they all look good to me. However, the physical systems
> that I have don't actually trigger this issue because they all have more than 4
> GB of RAM, so their deferred init range starts above 4 GB (it's always in the
> highest zone and ZONE_DMA32 ends at 4 GB) while their EFI boot services mappings
> are below 4 GB.
> 
> Deferred struct page init can't be enabled on x86_32 so those systems are
> unaffected. I haven't found any other code paths that would trigger this issue,
> though I can't promise that there aren't any. I did run with this patch on an
> arm64 VM as a sanity check, but memblock=debug didn't show any calls to
> memblock_free_late() so that system was unaffected as well.
> 
> I am guessing that this change should also go the stable kernels but it may not
> apply cleanly (__free_pages_core() was __free_pages_boot_core() and
> memblock_free_pages() was __free_pages_bootmem() when this issue was first
> introduced). I haven't gone through that process before so please let me know if
> I can help with that.
> 
> This is the end result on an EC2 t3.micro instance booting via EFI:
> 
> v6.2-rc2:
>   # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
>   Node 0, zone      DMA
>           spanned  4095
>           present  3999
>           managed  3840
>   Node 0, zone    DMA32
>           spanned  246652
>           present  245868
>           managed  178867
> 
> v6.2-rc2 + patch:
>   # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
>   Node 0, zone      DMA
>           spanned  4095
>           present  3999
>           managed  3840
>   Node 0, zone    DMA32
>           spanned  246652
>           present  245868
>           managed  222816
> 

The above before + after seems useful information to include in the commit 
description of the change.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 0/1] Pages not released from memblock to the buddy allocator
  2023-01-04  7:43 [PATCH 0/1] Pages not released from memblock to the buddy allocator Aaron Thompson
  2023-01-05  1:43 ` David Rientjes
@ 2023-01-05  4:17 ` Aaron Thompson
       [not found] ` <20230105041650.1485-1-dev@aaront.org>
  2 siblings, 0 replies; 9+ messages in thread
From: Aaron Thompson @ 2023-01-05  4:17 UTC (permalink / raw)
  To: Mike Rapoport, linux-mm
  Cc: H. Peter Anvin, Alexander Potapenko, Andrew Morton,
	Andy Shevchenko, Ard Biesheuvel, Borislav Petkov, Darren Hart,
	Dave Hansen, David Rientjes, Dmitry Vyukov, Ingo Molnar,
	Marco Elver, Thomas Gleixner, kasan-dev, linux-efi, linux-kernel,
	platform-driver-x86, x86, Aaron Thompson

Changelog:
v2:
  - Add comment in memblock_free_late() (suggested by Mike Rapoport)
  - Improve commit message, including an explanation of the x86_64 EFI boot
    issue (suggested by Mike Rapoport and David Rientjes)

Hi all,

(I've CC'ed the KMSAN and x86 EFI maintainers as an FYI; the only code change
I'm proposing is in memblock.)

I've run into a case where pages are not released from memblock to the buddy
allocator. If deferred struct page init is enabled, and memblock_free_late() is
called before page_alloc_init_late() has run, and the pages being freed are in
the deferred init range, then the pages are never released. memblock_free_late()
calls memblock_free_pages() which only releases the pages if they are not in the
deferred range. That is correct for free pages because they will be initialized
and released by page_alloc_init_late(), but memblock_free_late() is dealing with
reserved pages. If memblock_free_late() doesn't release those pages, they will
forever be reserved. All reserved pages were initialized by memblock_free_all(),
so I believe the fix is to simply have memblock_free_late() call
__free_pages_core() directly instead of memblock_free_pages().

In addition, there was a recent change (3c20650982609 "init: kmsan: call KMSAN
initialization routines") that added a call to kmsan_memblock_free_pages() in
memblock_free_pages(). It looks to me like it would also be incorrect to make
that call in the memblock_free_late() case, because the KMSAN metadata was
already initialized for all reserved pages by kmsan_init_shadow(), which runs
before memblock_free_all(). Having memblock_free_late() call __free_pages_core()
directly also fixes this issue.

I encountered this issue when I tried to switch some x86_64 VMs I was running
from BIOS boot to EFI boot. The x86 EFI code reserves all EFI boot services
ranges via memblock_reserve() (part of setup_arch()), and it frees them later
via memblock_free_late() (part of efi_enter_virtual_mode()). The EFI
implementation of the VM I was attempting this on, an Amazon EC2 t3.micro
instance, maps north of 170 MB in boot services ranges that happen to fall in
the deferred init range. I certainly noticed when that much memory went missing
on a 1 GB VM.

I've tested the patch on EC2 instances, qemu/KVM VMs with OVMF, and some real
x86_64 EFI systems, and they all look good to me. However, the physical systems
that I have don't actually trigger this issue because they all have more than 4
GB of RAM, so their deferred init range starts above 4 GB (it's always in the
highest zone and ZONE_DMA32 ends at 4 GB) while their EFI boot services mappings
are below 4 GB.

Deferred struct page init can't be enabled on x86_32 so those systems are
unaffected. I haven't found any other code paths that would trigger this issue,
though I can't promise that there aren't any. I did run with this patch on an
arm64 VM as a sanity check, but memblock=debug didn't show any calls to
memblock_free_late() so that system was unaffected as well.

I am guessing that this change should also go the stable kernels but it may not
apply cleanly (__free_pages_core() was __free_pages_boot_core() and
memblock_free_pages() was __free_pages_bootmem() when this issue was first
introduced). I haven't gone through that process before so please let me know if
I can help with that.

This is the end result on an EC2 t3.micro instance booting via EFI:

v6.2-rc2:
  # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
  Node 0, zone      DMA
          spanned  4095
          present  3999
          managed  3840
  Node 0, zone    DMA32
          spanned  246652
          present  245868
          managed  178867

v6.2-rc2 + patch:
  # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
  Node 0, zone      DMA
          spanned  4095
          present  3999
          managed  3840
  Node 0, zone    DMA32
          spanned  246652
          present  245868
          managed  222816


Aaron Thompson (1):
  mm: Always release pages to the buddy allocator in
    memblock_free_late().

 mm/memblock.c                     | 8 +++++++-
 tools/testing/memblock/internal.h | 4 ++++
 2 files changed, 11 insertions(+), 1 deletion(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/1] mm: Always release pages to the buddy allocator in memblock_free_late().
@ 2023-01-30  7:40     ` Xu Yu
  0 siblings, 0 replies; 9+ messages in thread
From: Aaron Thompson @ 2023-01-05  4:17 UTC (permalink / raw)
  To: Mike Rapoport, linux-mm
  Cc: H. Peter Anvin, Alexander Potapenko, Andrew Morton,
	Andy Shevchenko, Ard Biesheuvel, Borislav Petkov, Darren Hart,
	Dave Hansen, David Rientjes, Dmitry Vyukov, Ingo Molnar,
	Marco Elver, Thomas Gleixner, kasan-dev, linux-efi, linux-kernel,
	platform-driver-x86, x86, Aaron Thompson

If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the deferred
init process. memblock_free_pages() is called by memblock_free_late(),
which is used to free reserved ranges after memblock_free_all() has
run. All pages in reserved ranges have been initialized at that point,
and accordingly, those pages are not touched by the deferred init
process. This means that currently, if the pages that
memblock_free_late() intends to release are in the deferred range, they
will never be released to the buddy allocator. They will forever be
reserved.

In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().

For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call __free_pages_core()
directly instead.

One case where this issue can occur in the wild is EFI boot on
x86_64. The x86 EFI code reserves all EFI boot services memory ranges
via memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively). If any of those ranges happen to fall within the deferred
init range, the pages will not be released and that memory will be
unavailable.

For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:

v6.2-rc2:
  # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
  Node 0, zone      DMA
          spanned  4095
          present  3999
          managed  3840
  Node 0, zone    DMA32
          spanned  246652
          present  245868
          managed  178867

v6.2-rc2 + patch:
  # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
  Node 0, zone      DMA
          spanned  4095
          present  3999
          managed  3840
  Node 0, zone    DMA32
          spanned  246652
          present  245868
          managed  222816

Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Aaron Thompson <dev@aaront.org>
---
 mm/memblock.c                     | 8 +++++++-
 tools/testing/memblock/internal.h | 4 ++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 511d4783dcf1..fc3d8fbd2060 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1640,7 +1640,13 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
 	end = PFN_DOWN(base + size);
 
 	for (; cursor < end; cursor++) {
-		memblock_free_pages(pfn_to_page(cursor), cursor, 0);
+		/*
+		 * Reserved pages are always initialized by the end of
+		 * memblock_free_all() (by memmap_init() and, if deferred
+		 * initialization is enabled, memmap_init_reserved_pages()), so
+		 * these pages can be released directly to the buddy allocator.
+		 */
+		__free_pages_core(pfn_to_page(cursor), 0);
 		totalram_pages_inc();
 	}
 }
diff --git a/tools/testing/memblock/internal.h b/tools/testing/memblock/internal.h
index fdb7f5db7308..85973e55489e 100644
--- a/tools/testing/memblock/internal.h
+++ b/tools/testing/memblock/internal.h
@@ -15,6 +15,10 @@ bool mirrored_kernelcore = false;
 
 struct page {};
 
+void __free_pages_core(struct page *page, unsigned int order)
+{
+}
+
 void memblock_free_pages(struct page *page, unsigned long pfn,
 			 unsigned int order)
 {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] mm: Always release pages to the buddy allocator in memblock_free_late().
  2023-01-30  7:40     ` Xu Yu
  (?)
@ 2023-01-05 10:48     ` Ingo Molnar
  2023-01-06  2:02       ` Aaron Thompson
  -1 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2023-01-05 10:48 UTC (permalink / raw)
  To: Aaron Thompson
  Cc: Mike Rapoport, linux-mm, H. Peter Anvin, Alexander Potapenko,
	Andrew Morton, Andy Shevchenko, Ard Biesheuvel, Borislav Petkov,
	Darren Hart, Dave Hansen, David Rientjes, Dmitry Vyukov,
	Ingo Molnar, Marco Elver, Thomas Gleixner, kasan-dev, linux-efi,
	linux-kernel, platform-driver-x86, x86


* Aaron Thompson <dev@aaront.org> wrote:

> For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
> 
> v6.2-rc2:
>   # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
>   Node 0, zone      DMA
>           spanned  4095
>           present  3999
>           managed  3840
>   Node 0, zone    DMA32
>           spanned  246652
>           present  245868
>           managed  178867
> 
> v6.2-rc2 + patch:
>   # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
>   Node 0, zone      DMA
>           spanned  4095
>           present  3999
>           managed  3840
>   Node 0, zone    DMA32
>           spanned  246652
>           present  245868
>           managed  222816   # +43,949 pages

[ Note the annotation I added to the output - might be useful in the changelog too. ]

So this patch adds around +17% of RAM to this 1 GB virtual system? That 
looks rather significant ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] mm: Always release pages to the buddy allocator in memblock_free_late().
  2023-01-05 10:48     ` Ingo Molnar
@ 2023-01-06  2:02       ` Aaron Thompson
  2023-01-06  3:12         ` Ingo Molnar
  0 siblings, 1 reply; 9+ messages in thread
From: Aaron Thompson @ 2023-01-06  2:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Rapoport, linux-mm, H. Peter Anvin, Alexander Potapenko,
	Andrew Morton, Andy Shevchenko, Ard Biesheuvel, Borislav Petkov,
	Darren Hart, Dave Hansen, David Rientjes, Dmitry Vyukov,
	Ingo Molnar, Marco Elver, Thomas Gleixner, kasan-dev, linux-efi,
	linux-kernel, platform-driver-x86, x86


On 2023-01-05 02:48, Ingo Molnar wrote:
> * Aaron Thompson <dev@aaront.org> wrote:
> 
>> For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
>> 
>> v6.2-rc2:
>>   # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
>>   Node 0, zone      DMA
>>           spanned  4095
>>           present  3999
>>           managed  3840
>>   Node 0, zone    DMA32
>>           spanned  246652
>>           present  245868
>>           managed  178867
>> 
>> v6.2-rc2 + patch:
>>   # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
>>   Node 0, zone      DMA
>>           spanned  4095
>>           present  3999
>>           managed  3840
>>   Node 0, zone    DMA32
>>           spanned  246652
>>           present  245868
>>           managed  222816   # +43,949 pages
> 
> [ Note the annotation I added to the output - might be useful in the
> changelog too. ]
> 
> So this patch adds around +17% of RAM to this 1 GB virtual system? That
> looks rather significant ...
> 
> Thanks,
> 
> 	Ingo

It is significant, but I wouldn't describe it as being added. I would 
say that the system is currently losing 17% of RAM due to a bug, and 
this patch fixes that bug.

The actual numbers depend on the mappings given by the EFI, so they're 
largely out of our control. As an example, similar VMs that I run with 
the OVMF EFI lose about 3%. I couldn't say for sure which is the 
outlier, but my point is that the specific values are not really the 
focus, this is just an example that shows that the issue can be 
encountered in the wild with real impact. I know I'll be happy to get 
that memory back, whether it is 3% or 17% :)

Thanks,
-- Aaron

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] mm: Always release pages to the buddy allocator in memblock_free_late().
  2023-01-06  2:02       ` Aaron Thompson
@ 2023-01-06  3:12         ` Ingo Molnar
  0 siblings, 0 replies; 9+ messages in thread
From: Ingo Molnar @ 2023-01-06  3:12 UTC (permalink / raw)
  To: Aaron Thompson
  Cc: Mike Rapoport, linux-mm, H. Peter Anvin, Alexander Potapenko,
	Andrew Morton, Andy Shevchenko, Ard Biesheuvel, Borislav Petkov,
	Darren Hart, Dave Hansen, David Rientjes, Dmitry Vyukov,
	Ingo Molnar, Marco Elver, Thomas Gleixner, kasan-dev, linux-efi,
	linux-kernel, platform-driver-x86, x86


* Aaron Thompson <dev@aaront.org> wrote:

> 
> On 2023-01-05 02:48, Ingo Molnar wrote:
> > * Aaron Thompson <dev@aaront.org> wrote:
> > 
> > > For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
> > > 
> > > v6.2-rc2:
> > >   # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
> > >   Node 0, zone      DMA
> > >           spanned  4095
> > >           present  3999
> > >           managed  3840
> > >   Node 0, zone    DMA32
> > >           spanned  246652
> > >           present  245868
> > >           managed  178867
> > > 
> > > v6.2-rc2 + patch:
> > >   # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
> > >   Node 0, zone      DMA
> > >           spanned  4095
> > >           present  3999
> > >           managed  3840
> > >   Node 0, zone    DMA32
> > >           spanned  246652
> > >           present  245868
> > >           managed  222816   # +43,949 pages
> > 
> > [ Note the annotation I added to the output - might be useful in the
> > changelog too. ]
> > 
> > So this patch adds around +17% of RAM to this 1 GB virtual system? That
> > looks rather significant ...
> > 
> > Thanks,
> > 
> > 	Ingo
> 
> It is significant, but I wouldn't describe it as being added. I would say
> that the system is currently losing 17% of RAM due to a bug, and this patch
> fixes that bug.

To the end-user gaining +17% [or +3%] extra usable RAM compared to what 
they had before is what matters, and it's a big deal. :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/1] mm: Always release pages to the buddy allocator in memblock_free_late().
@ 2023-01-30  7:40     ` Xu Yu
  0 siblings, 0 replies; 9+ messages in thread
From: Xu Yu @ 2023-01-30  7:40 UTC (permalink / raw)
  To: baolin.wang, Mike Rapoport, linux-mm; +Cc: alikernel-developer

From: Aaron Thompson <dev@aaront.org>

If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the deferred
init process. memblock_free_pages() is called by memblock_free_late(),
which is used to free reserved ranges after memblock_free_all() has
run. All pages in reserved ranges have been initialized at that point,
and accordingly, those pages are not touched by the deferred init
process. This means that currently, if the pages that
memblock_free_late() intends to release are in the deferred range, they
will never be released to the buddy allocator. They will forever be
reserved.

In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().

For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call __free_pages_core()
directly instead.

One case where this issue can occur in the wild is EFI boot on
x86_64. The x86 EFI code reserves all EFI boot services memory ranges
via memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively). If any of those ranges happen to fall within the deferred
init range, the pages will not be released and that memory will be
unavailable.

For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:

v6.2-rc2:
  # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
  Node 0, zone      DMA
          spanned  4095
          present  3999
          managed  3840
  Node 0, zone    DMA32
          spanned  246652
          present  245868
          managed  178867

v6.2-rc2 + patch:
  # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
  Node 0, zone      DMA
          spanned  4095
          present  3999
          managed  3840
  Node 0, zone    DMA32
          spanned  246652
          present  245868
          managed  222816

Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Aaron Thompson <dev@aaront.org>
---
 mm/memblock.c                     | 8 +++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 511d4783dcf1..fc3d8fbd2060 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1640,7 +1640,13 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
 	end = PFN_DOWN(base + size);
 
 	for (; cursor < end; cursor++) {
-		memblock_free_pages(pfn_to_page(cursor), cursor, 0);
+		/*
+		 * Reserved pages are always initialized by the end of
+		 * memblock_free_all() (by memmap_init() and, if deferred
+		 * initialization is enabled, memmap_init_reserved_pages()), so
+		 * these pages can be released directly to the buddy allocator.
+		 */
+		__free_pages_core(pfn_to_page(cursor), 0);
 		totalram_pages_inc();
 	}
 }
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] mm: Always release pages to the buddy allocator in memblock_free_late().
  2023-01-30  7:40     ` Xu Yu
  (?)
  (?)
@ 2023-01-30  7:47     ` Xu Yu
  -1 siblings, 0 replies; 9+ messages in thread
From: Xu Yu @ 2023-01-30  7:47 UTC (permalink / raw)
  To: baolin.wang, Mike Rapoport, linux-mm

Sorry for sending out the email accidentally, please ignore it.

Actually, I was trying this fix in my environment.

On 1/30/23 3:40 PM, Xu Yu wrote:
> From: Aaron Thompson <dev@aaront.org>
> 
> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
> only releases pages to the buddy allocator if they are not in the
> deferred range. This is correct for free pages (as defined by
> for_each_free_mem_pfn_range_in_zone()) because free pages in the
> deferred range will be initialized and released as part of the deferred
> init process. memblock_free_pages() is called by memblock_free_late(),
> which is used to free reserved ranges after memblock_free_all() has
> run. All pages in reserved ranges have been initialized at that point,
> and accordingly, those pages are not touched by the deferred init
> process. This means that currently, if the pages that
> memblock_free_late() intends to release are in the deferred range, they
> will never be released to the buddy allocator. They will forever be
> reserved.
> 
> In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
> which is also correct for free pages but is not correct for reserved
> pages. KMSAN metadata for reserved pages is initialized by
> kmsan_init_shadow(), which runs shortly before memblock_free_all().
> 
> For both of these reasons, memblock_free_pages() should only be called
> for free pages, and memblock_free_late() should call __free_pages_core()
> directly instead.
> 
> One case where this issue can occur in the wild is EFI boot on
> x86_64. The x86 EFI code reserves all EFI boot services memory ranges
> via memblock_reserve() and frees them later via memblock_free_late()
> (efi_reserve_boot_services() and efi_free_boot_services(),
> respectively). If any of those ranges happen to fall within the deferred
> init range, the pages will not be released and that memory will be
> unavailable.
> 
> For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
> 
> v6.2-rc2:
>    # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
>    Node 0, zone      DMA
>            spanned  4095
>            present  3999
>            managed  3840
>    Node 0, zone    DMA32
>            spanned  246652
>            present  245868
>            managed  178867
> 
> v6.2-rc2 + patch:
>    # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
>    Node 0, zone      DMA
>            spanned  4095
>            present  3999
>            managed  3840
>    Node 0, zone    DMA32
>            spanned  246652
>            present  245868
>            managed  222816
> 
> Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
> Signed-off-by: Aaron Thompson <dev@aaront.org>
> ---
>   mm/memblock.c                     | 8 +++++++-
>   2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 511d4783dcf1..fc3d8fbd2060 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1640,7 +1640,13 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
>   	end = PFN_DOWN(base + size);
>   
>   	for (; cursor < end; cursor++) {
> -		memblock_free_pages(pfn_to_page(cursor), cursor, 0);
> +		/*
> +		 * Reserved pages are always initialized by the end of
> +		 * memblock_free_all() (by memmap_init() and, if deferred
> +		 * initialization is enabled, memmap_init_reserved_pages()), so
> +		 * these pages can be released directly to the buddy allocator.
> +		 */
> +		__free_pages_core(pfn_to_page(cursor), 0);
>   		totalram_pages_inc();
>   	}
>   }
-- 
Thanks,
Yu



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-01-30  7:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-04  7:43 [PATCH 0/1] Pages not released from memblock to the buddy allocator Aaron Thompson
2023-01-05  1:43 ` David Rientjes
2023-01-05  4:17 ` [PATCH v2 " Aaron Thompson
     [not found] ` <20230105041650.1485-1-dev@aaront.org>
2023-01-05  4:17   ` [PATCH v2 1/1] mm: Always release pages to the buddy allocator in memblock_free_late() Aaron Thompson
2023-01-30  7:40     ` Xu Yu
2023-01-05 10:48     ` Ingo Molnar
2023-01-06  2:02       ` Aaron Thompson
2023-01-06  3:12         ` Ingo Molnar
2023-01-30  7:47     ` Xu Yu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.