linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/2] mm: fix initialization of struct page for holes in  memory layout
@ 2021-01-30 22:10 Mike Rapoport
  2021-01-30 22:10 ` [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory Mike Rapoport
  2021-01-30 22:10 ` [PATCH v4 2/2] mm: fix initialization of struct page for holes in memory layout Mike Rapoport
  0 siblings, 2 replies; 14+ messages in thread
From: Mike Rapoport @ 2021-01-30 22:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Baoquan He, Borislav Petkov, Chris Wilson,
	David Hildenbrand, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Łukasz Majczak, Mel Gorman, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Qian Cai, Sarvela, Tomi P, Thomas Gleixner,
	Vlastimil Babka, linux-kernel, linux-mm, stable, x86

From: Mike Rapoport <rppt@linux.ibm.com>

Hi,

Commit 73a6e474cb37 ("mm: memmap_init: iterate over
memblock regions rather that check each PFN") exposed several issues with
the memory map initialization and these patches fix those issues.

Initially there were crashes during compaction that Qian Cai reported back
in April [1]. It seemed back then that the problem was fixed, but a few
weeks ago Andrea Arcangeli hit the same bug [2] and there was an additional
discussion at [3].

I didn't appreciate variety of ways BIOSes can report memory in the first
megabyte, so v3 of this set caused boot failures on several x86 systems. 
Hopefully this time I covered all the bases.

The first patch here complements commit bde9cfa3afe4 ("x86/setup: don't
remove E820_TYPE_RAM for pfn 0") for the cases when BIOS reports the first
page as absent or reserved.

The second patch is a more robust version of d3921cb8be29 ("mm: fix
initialization of struct page for holes in memory layout") that can now
handle the above cases as well.

v4:
* make sure pages in the range 0 - start_pfn_of_lowest_zone are initialized
  even if an architecture hides them from the generic mm
* finally make pfn 0 on x86 to be a part of memory visible to the generic
  mm as reserved memory.

v3: https://lore.kernel.org/lkml/20210111194017.22696-1-rppt@kernel.org
* use architectural zone constraints to set zone links for struct pages
  corresponding to the holes
* drop implicit update of memblock.memory
* add a patch that sets pfn 0 to E820_TYPE_RAM on x86

v2: https://lore.kernel.org/lkml/20201209214304.6812-1-rppt@kernel.org/):
* added patch that adds all regions in memblock.reserved that do not
overlap with memblock.memory to memblock.memory in the beginning of
free_area_init()

[1] https://lore.kernel.org/lkml/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
[2] https://lore.kernel.org/lkml/20201121194506.13464-1-aarcange@redhat.com
[3] https://lore.kernel.org/mm-commits/20201206005401.qKuAVgOXr%akpm@linux-foundation.org

Mike Rapoport (2):
  x86/setup: always add the beginning of RAM as memblock.memory
  mm: fix initialization of struct page for holes in memory layout

 arch/x86/kernel/setup.c |  8 ++++
 mm/page_alloc.c         | 85 ++++++++++++++++++++++++-----------------
 2 files changed, 59 insertions(+), 34 deletions(-)

-- 
2.28.0



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory
  2021-01-30 22:10 [PATCH v4 0/2] mm: fix initialization of struct page for holes in memory layout Mike Rapoport
@ 2021-01-30 22:10 ` Mike Rapoport
  2021-01-31  0:37   ` Linus Torvalds
  2021-02-01  9:32   ` David Hildenbrand
  2021-01-30 22:10 ` [PATCH v4 2/2] mm: fix initialization of struct page for holes in memory layout Mike Rapoport
  1 sibling, 2 replies; 14+ messages in thread
From: Mike Rapoport @ 2021-01-30 22:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Baoquan He, Borislav Petkov, Chris Wilson,
	David Hildenbrand, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Łukasz Majczak, Mel Gorman, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Qian Cai, Sarvela, Tomi P, Thomas Gleixner,
	Vlastimil Babka, linux-kernel, linux-mm, stable, x86

From: Mike Rapoport <rppt@linux.ibm.com>

The physical memory on an x86 system starts at address 0, but this is not
always reflected in e820 map. For example, the BIOS can have e820 entries
like

[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009ffff] usable

or

[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000001000-0x0000000000057fff] usable

In either case, e820__memblock_setup() won't add the range 0x0000 - 0x1000
to memblock.memory and later during memory map initialization this range is
left outside any zone.

With SPARSEMEM=y there is always a struct page for pfn 0 and this struct
page will have it's zone link wrong no matter what value will be set there.

To avoid this inconsistency, add the beginning of RAM to memblock.memory.
Limit the added chunk size to match the reserved memory to avoid
registering memory that may be used by the firmware but never reserved at
e820__memblock_setup() time.

Fixes: bde9cfa3afe4 ("x86/setup: don't remove E820_TYPE_RAM for pfn 0")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: stable@vger.kernel.org
---
 arch/x86/kernel/setup.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3412c4595efd..67c77ed6eef8 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -727,6 +727,14 @@ static void __init trim_low_memory_range(void)
 	 * Kconfig help text for X86_RESERVE_LOW.
 	 */
 	memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
+
+	/*
+	 * Even if the firmware does not report the memory at address 0 as
+	 * usable, inform the generic memory management about its existence
+	 * to ensure it is a part of ZONE_DMA and the memory map for it is
+	 * properly initialized.
+	 */
+	memblock_add(0, ALIGN(reserve_low, PAGE_SIZE));
 }
 	
 /*
-- 
2.28.0



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v4 2/2] mm: fix initialization of struct page for holes in memory layout
  2021-01-30 22:10 [PATCH v4 0/2] mm: fix initialization of struct page for holes in memory layout Mike Rapoport
  2021-01-30 22:10 ` [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory Mike Rapoport
@ 2021-01-30 22:10 ` Mike Rapoport
  1 sibling, 0 replies; 14+ messages in thread
From: Mike Rapoport @ 2021-01-30 22:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Baoquan He, Borislav Petkov, Chris Wilson,
	David Hildenbrand, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Łukasz Majczak, Mel Gorman, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Qian Cai, Sarvela, Tomi P, Thomas Gleixner,
	Vlastimil Babka, linux-kernel, linux-mm, stable, x86

From: Mike Rapoport <rppt@linux.ibm.com>

There could be struct pages that are not backed by actual physical
memory.  This can happen when the actual memory bank is not a multiple
of SECTION_SIZE or when an architecture does not register memory holes
reserved by the firmware as memblock.memory.

Such pages are currently initialized using init_unavailable_mem()
function that iterates through PFNs in holes in memblock.memory and if
there is a struct page corresponding to a PFN, the fields if this page
are set to default values and the page is marked as Reserved.

init_unavailable_mem() does not take into account zone and node the page
belongs to and sets both zone and node links in struct page to zero.

On a system that has firmware reserved holes in a zone above ZONE_DMA,
for instance in a configuration below:

	# grep -A1 E820 /proc/iomem
	7a17b000-7a216fff : Unknown E820 type
	7a217000-7bffffff : System RAM

unset zone link in struct page will trigger

	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
in struct page) in the same pageblock.

Update init_unavailable_mem() to use zone constraints defined by an
architecture to properly setup the zone link and use node ID of the
adjacent range in memblock.memory to set the node link.

Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
---
 mm/page_alloc.c | 85 +++++++++++++++++++++++++++++--------------------
 1 file changed, 51 insertions(+), 34 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 519a60d5b6f7..444642393bb6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7080,23 +7080,27 @@ void __init free_area_init_memoryless_node(int nid)
  * Initialize all valid struct pages in the range [spfn, epfn) and mark them
  * PageReserved(). Return the number of struct pages that were initialized.
  */
-static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
+static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn,
+					 int zone, int nid)
 {
-	unsigned long pfn;
+	unsigned long pfn, zone_epfn, zone_spfn = 0;
 	u64 pgcnt = 0;
 
+	if (zone)
+		zone_spfn = arch_zone_highest_possible_pfn[zone - 1];
+	zone_epfn = arch_zone_highest_possible_pfn[zone];
+
+	spfn = clamp(spfn, zone_spfn, zone_epfn);
+	epfn = clamp(epfn, zone_spfn, zone_epfn);
+
 	for (pfn = spfn; pfn < epfn; pfn++) {
 		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
 			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
 				+ pageblock_nr_pages - 1;
 			continue;
 		}
-		/*
-		 * Use a fake node/zone (0) for now. Some of these pages
-		 * (in memblock.reserved but not in memblock.memory) will
-		 * get re-initialized via reserve_bootmem_region() later.
-		 */
-		__init_single_page(pfn_to_page(pfn), pfn, 0, 0);
+
+		__init_single_page(pfn_to_page(pfn), pfn, zone, nid);
 		__SetPageReserved(pfn_to_page(pfn));
 		pgcnt++;
 	}
@@ -7105,51 +7109,64 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
 }
 
 /*
- * Only struct pages that are backed by physical memory are zeroed and
- * initialized by going through __init_single_page(). But, there are some
- * struct pages which are reserved in memblock allocator and their fields
- * may be accessed (for example page_to_pfn() on some configuration accesses
- * flags). We must explicitly initialize those struct pages.
+ * Only struct pages that correspond to ranges defined by memblock.memory
+ * are zeroed and initialized by going through __init_single_page() during
+ * memmap_init().
  *
- * This function also addresses a similar issue where struct pages are left
- * uninitialized because the physical address range is not covered by
- * memblock.memory or memblock.reserved. That could happen when memblock
- * layout is manually configured via memmap=, or when the highest physical
- * address (max_pfn) does not end on a section boundary.
+ * But, there could be struct pages that correspond to holes in
+ * memblock.memory. This can happen because of the following reasons:
+ * - phyiscal memory bank size is not necessarily the exact multiple of the
+ *   arbitrary section size
+ * - early reserved memory may not be listed in memblock.memory
+ * - memory layouts defined with memmap= kernel parameter may not align
+ *   nicely with memmap sections
+ *
+ * Explicitly initialize those struct pages so that:
+ * - PG_Reserved is set
+ * - zone link is set accorging to the architecture constrains
+ * - node is set to node id of the next populated region except for the
+ *   trailing hole where last node id is used
  */
-static void __init init_unavailable_mem(void)
+static void __init init_zone_unavailable_mem(int zone)
 {
-	phys_addr_t start, end;
-	u64 i, pgcnt;
-	phys_addr_t next = 0;
+	unsigned long start, end;
+	int i, nid;
+	u64 pgcnt;
+	unsigned long next = 0;
 
 	/*
-	 * Loop through unavailable ranges not covered by memblock.memory.
+	 * Loop through holes in memblock.memory and initialize struct
+	 * pages corresponding to these holes
 	 */
 	pgcnt = 0;
-	for_each_mem_range(i, &start, &end) {
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
 		if (next < start)
-			pgcnt += init_unavailable_range(PFN_DOWN(next),
-							PFN_UP(start));
+			pgcnt += init_unavailable_range(next, start, zone, nid);
 		next = end;
 	}
 
 	/*
-	 * Early sections always have a fully populated memmap for the whole
-	 * section - see pfn_valid(). If the last section has holes at the
-	 * end and that section is marked "online", the memmap will be
-	 * considered initialized. Make sure that memmap has a well defined
-	 * state.
+	 * Last section may surpass the actual end of memory (e.g. we can
+	 * have 1Gb section and 512Mb of RAM pouplated).
+	 * Make sure that memmap has a well defined state in this case.
 	 */
-	pgcnt += init_unavailable_range(PFN_DOWN(next),
-					round_up(max_pfn, PAGES_PER_SECTION));
+	end = round_up(max_pfn, PAGES_PER_SECTION);
+	pgcnt += init_unavailable_range(next, end, zone, nid);
 
 	/*
 	 * Struct pages that do not have backing memory. This could be because
 	 * firmware is using some of this memory, or for some other reasons.
 	 */
 	if (pgcnt)
-		pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
+		pr_info("Zone %s: zeroed struct page in unavailable ranges: %lld pages", zone_names[zone], pgcnt);
+}
+
+static void __init init_unavailable_mem(void)
+{
+	int zone;
+
+	for (zone = 0; zone < ZONE_MOVABLE; zone++)
+		init_zone_unavailable_mem(zone);
 }
 #else
 static inline void __init init_unavailable_mem(void)
-- 
2.28.0



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory
  2021-01-30 22:10 ` [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory Mike Rapoport
@ 2021-01-31  0:37   ` Linus Torvalds
  2021-01-31  8:03     ` Mike Rapoport
  2021-02-01  9:32   ` David Hildenbrand
  1 sibling, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2021-01-31  0:37 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Andrea Arcangeli, Baoquan He, Borislav Petkov,
	Chris Wilson, David Hildenbrand, H. Peter Anvin, Ingo Molnar,
	Łukasz Majczak, Mel Gorman, Michal Hocko, Mike Rapoport,
	Qian Cai, Sarvela, Tomi P, Thomas Gleixner, Vlastimil Babka,
	Linux Kernel Mailing List, Linux-MM, stable,
	the arch/x86 maintainers

On Sat, Jan 30, 2021 at 2:10 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> In either case, e820__memblock_setup() won't add the range 0x0000 - 0x1000
> to memblock.memory and later during memory map initialization this range is
> left outside any zone.

Honestly, this just sounds like memblock being stupid in the first place.

Why aren't these zones padded to sane alignments?

This patch smells like working around the memblock code being fragile
rather than a real fix.

That's *particularly* true when the very line above it did a
"memblock_reserve()" of the exact same range that the memblock_add()
"adds".

              Linus


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory
  2021-01-31  0:37   ` Linus Torvalds
@ 2021-01-31  8:03     ` Mike Rapoport
  2021-01-31 21:49       ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Rapoport @ 2021-01-31  8:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Andrea Arcangeli, Baoquan He, Borislav Petkov,
	Chris Wilson, David Hildenbrand, H. Peter Anvin, Ingo Molnar,
	Łukasz Majczak, Mel Gorman, Michal Hocko, Mike Rapoport,
	Qian Cai, Sarvela, Tomi P, Thomas Gleixner, Vlastimil Babka,
	Linux Kernel Mailing List, Linux-MM, stable,
	the arch/x86 maintainers

On Sat, Jan 30, 2021 at 04:37:54PM -0800, Linus Torvalds wrote:
> On Sat, Jan 30, 2021 at 2:10 PM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > In either case, e820__memblock_setup() won't add the range 0x0000 - 0x1000
> > to memblock.memory and later during memory map initialization this range is
> > left outside any zone.
> 
> Honestly, this just sounds like memblock being stupid in the first place.
> 
> Why aren't these zones padded to sane alignments?
 
The implicit alignment of zones would be a guess. What alignment would be
sane here? 1M? MAX_ORDER? pageblock_order?

I'm not sure that if an architecture reports its memory at X and we use,
say, round_down(X, 1M) for node[0]->node_start_pfn and
zone[0]->zone_start_pfn it wouldn't cause boot failure on some system out
there in the wild.

> This patch smells like working around the memblock code being fragile
> rather than a real fix.
>
> That's *particularly* true when the very line above it did a
> "memblock_reserve()" of the exact same range that the memblock_add()
> "adds".

The most correct thing to do would have been to 

	memblock_add(0, end_of_first_memory_bank);

Somewhere at e820__memblock_setup().

But that would mean we also must change the way e820__memblock_setup()
reserves memory and that seemed to me like really asking for troubles so
I've limited the registration of memory to the range that's for sure
reserved.

A part of the problem is that x86 adds only usable memory to
memblock.memory omitting holes and reserved areas, while free_area_init()
presumes that memblock.memory covers populated physical memory.

I've tried implicitly adding ranges from memblock.reserved to
memblock.memory if they were not there and it had broken some arm machines:

https://lore.kernel.org/lkml/127999c4-7d56-0c36-7f88-8e1a5c934cae@collabora.com

I do feel that free_area_init() is fragile and no doubt there is a room for
improvement there. But I think the safer way forward is to reduce
inconsistencies between arch and generic code, so that we won't need to
guess what is the memory layout at free_area_init() time.
 
>               Linus

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory
  2021-01-31  8:03     ` Mike Rapoport
@ 2021-01-31 21:49       ` Linus Torvalds
  2021-02-01 14:06         ` Mike Rapoport
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2021-01-31 21:49 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Andrea Arcangeli, Baoquan He, Borislav Petkov,
	Chris Wilson, David Hildenbrand, H. Peter Anvin, Ingo Molnar,
	Łukasz Majczak, Mel Gorman, Michal Hocko, Mike Rapoport,
	Qian Cai, Sarvela, Tomi P, Thomas Gleixner, Vlastimil Babka,
	Linux Kernel Mailing List, Linux-MM, stable,
	the arch/x86 maintainers

On Sun, Jan 31, 2021 at 12:04 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> >
> > That's *particularly* true when the very line above it did a
> > "memblock_reserve()" of the exact same range that the memblock_add()
> > "adds".
>
> The most correct thing to do would have been to
>
>         memblock_add(0, end_of_first_memory_bank);
>
> Somewhere at e820__memblock_setup().

You miss my complaint.

Why does the memblock code care about this magical "memblock_add()",
when we just told it that the SAME REGION is reserved by doing a
"memblock_reserve()"?

IOW, I'm not interested in "the correct thing to do would have been
[another memblock_add()]". I'm saying that the memblock code itself is
being confused, and no additional thing should have been required at
all, because we already *did* that memblock_reserve().

See?

Honestly, I'm not seeing it being a good thing to move further towards
memblock code as the primary model for memory initialization, when the
memblock code is so confused.

              Linus


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory
  2021-01-30 22:10 ` [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory Mike Rapoport
  2021-01-31  0:37   ` Linus Torvalds
@ 2021-02-01  9:32   ` David Hildenbrand
  2021-02-01 11:26     ` Baoquan He
  2021-02-01 14:30     ` Mike Rapoport
  1 sibling, 2 replies; 14+ messages in thread
From: David Hildenbrand @ 2021-02-01  9:32 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: Andrea Arcangeli, Baoquan He, Borislav Petkov, Chris Wilson,
	H. Peter Anvin, Ingo Molnar, Linus Torvalds, Łukasz Majczak,
	Mel Gorman, Michal Hocko, Mike Rapoport, Qian Cai, Sarvela,
	Tomi P, Thomas Gleixner, Vlastimil Babka, linux-kernel, linux-mm,
	stable, x86

On 30.01.21 23:10, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> The physical memory on an x86 system starts at address 0, but this is not
> always reflected in e820 map. For example, the BIOS can have e820 entries
> like
> 
> [    0.000000] BIOS-provided physical RAM map:
> [    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009ffff] usable
> 
> or
> 
> [    0.000000] BIOS-provided physical RAM map:
> [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000000001000-0x0000000000057fff] usable
> 
> In either case, e820__memblock_setup() won't add the range 0x0000 - 0x1000
> to memblock.memory and later during memory map initialization this range is
> left outside any zone.
> 
> With SPARSEMEM=y there is always a struct page for pfn 0 and this struct
> page will have it's zone link wrong no matter what value will be set there.
> 
> To avoid this inconsistency, add the beginning of RAM to memblock.memory.
> Limit the added chunk size to match the reserved memory to avoid
> registering memory that may be used by the firmware but never reserved at
> e820__memblock_setup() time.
> 
> Fixes: bde9cfa3afe4 ("x86/setup: don't remove E820_TYPE_RAM for pfn 0")
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> Cc: stable@vger.kernel.org
> ---
>   arch/x86/kernel/setup.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 3412c4595efd..67c77ed6eef8 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -727,6 +727,14 @@ static void __init trim_low_memory_range(void)
>   	 * Kconfig help text for X86_RESERVE_LOW.
>   	 */
>   	memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
> +
> +	/*
> +	 * Even if the firmware does not report the memory at address 0 as
> +	 * usable, inform the generic memory management about its existence
> +	 * to ensure it is a part of ZONE_DMA and the memory map for it is
> +	 * properly initialized.
> +	 */
> +	memblock_add(0, ALIGN(reserve_low, PAGE_SIZE));
>   }
>   	
>   /*
> 

I think, to make that code more robust, and to not rely on archs to do 
the right thing, we should do something like

1) Make sure in free_area_init() that each PFN with a memmap (i.e., 
falls into a partial present section) is spanned by a zone; that would 
include PFN 0 in this case.

2) In init_zone_unavailable_mem(), similar to round_up(max_pfn, 
PAGES_PER_SECTION) handling, consider range
	[round_down(min_pfn, PAGES_PER_SECTION), min_pfn - 1]
which would handle in the x86-64 case [0..0] and, therefore, initialize 
PFN 0.

Also, I think the special-case of PFN 0 is analogous to the 
round_up(max_pfn, PAGES_PER_SECTION) handling in 
init_zone_unavailable_mem(): who guarantees that these PFN above the 
highest present PFN are actually spanned by a zone?

I'd suggest going through all zone ranges in free_area_init() first, 
dealing with zones that have "not section aligned start/end", clamping 
them up/down if required such that no holes within a section are left 
uncovered by a zone.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory
  2021-02-01  9:32   ` David Hildenbrand
@ 2021-02-01 11:26     ` Baoquan He
  2021-02-01 14:34       ` Mike Rapoport
  2021-02-01 14:30     ` Mike Rapoport
  1 sibling, 1 reply; 14+ messages in thread
From: Baoquan He @ 2021-02-01 11:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Andrea Arcangeli, Borislav Petkov,
	Chris Wilson, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Łukasz Majczak, Mel Gorman, Michal Hocko, Mike Rapoport,
	Qian Cai, Sarvela, Tomi P, Thomas Gleixner, Vlastimil Babka,
	linux-kernel, linux-mm, stable, x86

On 02/01/21 at 10:32am, David Hildenbrand wrote:
> On 30.01.21 23:10, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > The physical memory on an x86 system starts at address 0, but this is not
> > always reflected in e820 map. For example, the BIOS can have e820 entries
> > like
> > 
> > [    0.000000] BIOS-provided physical RAM map:
> > [    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009ffff] usable
> > 
> > or
> > 
> > [    0.000000] BIOS-provided physical RAM map:
> > [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000000001000-0x0000000000057fff] usable
> > 
> > In either case, e820__memblock_setup() won't add the range 0x0000 - 0x1000
> > to memblock.memory and later during memory map initialization this range is
> > left outside any zone.
> > 
> > With SPARSEMEM=y there is always a struct page for pfn 0 and this struct
> > page will have it's zone link wrong no matter what value will be set there.
> > 
> > To avoid this inconsistency, add the beginning of RAM to memblock.memory.
> > Limit the added chunk size to match the reserved memory to avoid
> > registering memory that may be used by the firmware but never reserved at
> > e820__memblock_setup() time.
> > 
> > Fixes: bde9cfa3afe4 ("x86/setup: don't remove E820_TYPE_RAM for pfn 0")
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > Cc: stable@vger.kernel.org
> > ---
> >   arch/x86/kernel/setup.c | 8 ++++++++
> >   1 file changed, 8 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index 3412c4595efd..67c77ed6eef8 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -727,6 +727,14 @@ static void __init trim_low_memory_range(void)
> >   	 * Kconfig help text for X86_RESERVE_LOW.
> >   	 */
> >   	memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
> > +
> > +	/*
> > +	 * Even if the firmware does not report the memory at address 0 as
> > +	 * usable, inform the generic memory management about its existence
> > +	 * to ensure it is a part of ZONE_DMA and the memory map for it is
> > +	 * properly initialized.
> > +	 */
> > +	memblock_add(0, ALIGN(reserve_low, PAGE_SIZE));
> >   }
> >   	
> >   /*
> > 
> 
> I think, to make that code more robust, and to not rely on archs to do the
> right thing, we should do something like
> 
> 1) Make sure in free_area_init() that each PFN with a memmap (i.e., falls
> into a partial present section) is spanned by a zone; that would include PFN
> 0 in this case.
> 
> 2) In init_zone_unavailable_mem(), similar to round_up(max_pfn,
> PAGES_PER_SECTION) handling, consider range
> 	[round_down(min_pfn, PAGES_PER_SECTION), min_pfn - 1]
> which would handle in the x86-64 case [0..0] and, therefore, initialize PFN
> 0.

Sounds reasonable. Maybe we can change to get the real expected lowest
pfn from find_min_pfn_for_node() by iterating memblock.memory and
memblock.reserved and comparing.

> 
> Also, I think the special-case of PFN 0 is analogous to the
> round_up(max_pfn, PAGES_PER_SECTION) handling in
> init_zone_unavailable_mem(): who guarantees that these PFN above the highest
> present PFN are actually spanned by a zone?
> 
> I'd suggest going through all zone ranges in free_area_init() first, dealing
> with zones that have "not section aligned start/end", clamping them up/down
> if required such that no holes within a section are left uncovered by a
> zone.
> 
> -- 
> Thanks,
> 
> David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory
  2021-01-31 21:49       ` Linus Torvalds
@ 2021-02-01 14:06         ` Mike Rapoport
  0 siblings, 0 replies; 14+ messages in thread
From: Mike Rapoport @ 2021-02-01 14:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Andrea Arcangeli, Baoquan He, Borislav Petkov,
	Chris Wilson, David Hildenbrand, H. Peter Anvin, Ingo Molnar,
	Łukasz Majczak, Mel Gorman, Michal Hocko, Mike Rapoport,
	Qian Cai, Sarvela, Tomi P, Thomas Gleixner, Vlastimil Babka,
	Linux Kernel Mailing List, Linux-MM, stable,
	the arch/x86 maintainers

On Sun, Jan 31, 2021 at 01:49:27PM -0800, Linus Torvalds wrote:
> On Sun, Jan 31, 2021 at 12:04 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > >
> > > That's *particularly* true when the very line above it did a
> > > "memblock_reserve()" of the exact same range that the memblock_add()
> > > "adds".
> >
> > The most correct thing to do would have been to
> >
> >         memblock_add(0, end_of_first_memory_bank);
> >
> > Somewhere at e820__memblock_setup().
> 
> You miss my complaint.
> 
> Why does the memblock code care about this magical "memblock_add()",
> when we just told it that the SAME REGION is reserved by doing a
> "memblock_reserve()"?
> 
> IOW, I'm not interested in "the correct thing to do would have been
> [another memblock_add()]". I'm saying that the memblock code itself is
> being confused, and no additional thing should have been required at
> all, because we already *did* that memblock_reserve().
> 
> See?

There is nothing magical about memblock_add().

Memblock presumes that arch code uses memblock_add() to register populated
physical memory ranges and memblock_reserve() to protect memory ranges that
should not be touched. These ranges do not necessarily overlap, so there
maybe reserved ranges that do not have the corresponding registered memory.

This lets architectures to say "here are the memory banks I have" and "this
memory is in use" (or even "this memory _might_ be in use" ) independently
of each other.

The downside is that if there is a reserved range there is no way to tell
whether it is backed by populated memory.

We could change this semantics and enforce the overlap, e.g. by
implicitly adding all the reserved ranges to the registered memory.
I've already tried that and I've found out that there are systems that rely
on memblock's ability to track reserved and available ranges independently.
For example, arm systems I've mentioned in the previous mail always have a
reserved chunk at 0xfe000000 in their DTS, but they may have only 2G of
memory actually populated. 

Now, on x86 there is a gap between e820 and memblock since 2.6 times. As of
now, only E820_TYPE_RAM is added to memblock as memory, some of the
E820_*_RESERVED are reserved and on top there are reservations of the
memory that's known to be used by BIOS or kernel.

I'm trying to close this gap with small steps and with changes that I
believe will not break too many things at once so it'll become
unmanageable.

> Honestly, I'm not seeing it being a good thing to move further towards
> memblock code as the primary model for memory initialization, when the
> memblock code is so confused.

I'm not sure I follow you here.
If I'm not mistaken, memblock is used as the primary model for memmap and
page allocator initialization for almost a decade now...
 
>               Linus

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory
  2021-02-01  9:32   ` David Hildenbrand
  2021-02-01 11:26     ` Baoquan He
@ 2021-02-01 14:30     ` Mike Rapoport
  2021-02-01 14:32       ` David Hildenbrand
  1 sibling, 1 reply; 14+ messages in thread
From: Mike Rapoport @ 2021-02-01 14:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Andrea Arcangeli, Baoquan He, Borislav Petkov,
	Chris Wilson, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Łukasz Majczak, Mel Gorman, Michal Hocko, Mike Rapoport,
	Qian Cai, Sarvela, Tomi P, Thomas Gleixner, Vlastimil Babka,
	linux-kernel, linux-mm, stable, x86

On Mon, Feb 01, 2021 at 10:32:44AM +0100, David Hildenbrand wrote:
> On 30.01.21 23:10, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > The physical memory on an x86 system starts at address 0, but this is not
> > always reflected in e820 map. For example, the BIOS can have e820 entries
> > like
> > 
> > [    0.000000] BIOS-provided physical RAM map:
> > [    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009ffff] usable
> > 
> > or
> > 
> > [    0.000000] BIOS-provided physical RAM map:
> > [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000000001000-0x0000000000057fff] usable
> > 
> > In either case, e820__memblock_setup() won't add the range 0x0000 - 0x1000
> > to memblock.memory and later during memory map initialization this range is
> > left outside any zone.
> > 
> > With SPARSEMEM=y there is always a struct page for pfn 0 and this struct
> > page will have it's zone link wrong no matter what value will be set there.
> > 
> > To avoid this inconsistency, add the beginning of RAM to memblock.memory.
> > Limit the added chunk size to match the reserved memory to avoid
> > registering memory that may be used by the firmware but never reserved at
> > e820__memblock_setup() time.
> > 
> > Fixes: bde9cfa3afe4 ("x86/setup: don't remove E820_TYPE_RAM for pfn 0")
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > Cc: stable@vger.kernel.org
> > ---
> >   arch/x86/kernel/setup.c | 8 ++++++++
> >   1 file changed, 8 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index 3412c4595efd..67c77ed6eef8 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -727,6 +727,14 @@ static void __init trim_low_memory_range(void)
> >   	 * Kconfig help text for X86_RESERVE_LOW.
> >   	 */
> >   	memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
> > +
> > +	/*
> > +	 * Even if the firmware does not report the memory at address 0 as
> > +	 * usable, inform the generic memory management about its existence
> > +	 * to ensure it is a part of ZONE_DMA and the memory map for it is
> > +	 * properly initialized.
> > +	 */
> > +	memblock_add(0, ALIGN(reserve_low, PAGE_SIZE));
> >   }
> >   	
> >   /*
> > 
> 
> I think, to make that code more robust, and to not rely on archs to do the
> right thing, we should do something like
> 
> 1) Make sure in free_area_init() that each PFN with a memmap (i.e., falls
> into a partial present section) is spanned by a zone; that would include PFN
> 0 in this case.
> 
> 2) In init_zone_unavailable_mem(), similar to round_up(max_pfn,
> PAGES_PER_SECTION) handling, consider range
> 	[round_down(min_pfn, PAGES_PER_SECTION), min_pfn - 1]
> which would handle in the x86-64 case [0..0] and, therefore, initialize PFN
> 0.
> 
> Also, I think the special-case of PFN 0 is analogous to the
> round_up(max_pfn, PAGES_PER_SECTION) handling in
> init_zone_unavailable_mem(): who guarantees that these PFN above the highest
> present PFN are actually spanned by a zone?
> 
> I'd suggest going through all zone ranges in free_area_init() first, dealing
> with zones that have "not section aligned start/end", clamping them up/down
> if required such that no holes within a section are left uncovered by a
> zone.

I thought about changing the way zone extents are calculated so that zone
start/end will be always on a section boundary, but zone->zone_start_pfn
depends on node->node_start_pfn which is defined by hardware and expanding
a node to make its start pfn aligned at the section boundary might violate
the HW addressing scheme.

Maybe this could never happen, or maybe it's not really important as the
pages there will be reserved anyway, but I'm not sure I can estimate all
the implications. 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory
  2021-02-01 14:30     ` Mike Rapoport
@ 2021-02-01 14:32       ` David Hildenbrand
  2021-02-01 23:22         ` Mike Rapoport
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2021-02-01 14:32 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Andrea Arcangeli, Baoquan He, Borislav Petkov,
	Chris Wilson, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Łukasz Majczak, Mel Gorman, Michal Hocko, Mike Rapoport,
	Qian Cai, Sarvela, Tomi P, Thomas Gleixner, Vlastimil Babka,
	linux-kernel, linux-mm, stable, x86

On 01.02.21 15:30, Mike Rapoport wrote:
> On Mon, Feb 01, 2021 at 10:32:44AM +0100, David Hildenbrand wrote:
>> On 30.01.21 23:10, Mike Rapoport wrote:
>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>
>>> The physical memory on an x86 system starts at address 0, but this is not
>>> always reflected in e820 map. For example, the BIOS can have e820 entries
>>> like
>>>
>>> [    0.000000] BIOS-provided physical RAM map:
>>> [    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009ffff] usable
>>>
>>> or
>>>
>>> [    0.000000] BIOS-provided physical RAM map:
>>> [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
>>> [    0.000000] BIOS-e820: [mem 0x0000000000001000-0x0000000000057fff] usable
>>>
>>> In either case, e820__memblock_setup() won't add the range 0x0000 - 0x1000
>>> to memblock.memory and later during memory map initialization this range is
>>> left outside any zone.
>>>
>>> With SPARSEMEM=y there is always a struct page for pfn 0 and this struct
>>> page will have it's zone link wrong no matter what value will be set there.
>>>
>>> To avoid this inconsistency, add the beginning of RAM to memblock.memory.
>>> Limit the added chunk size to match the reserved memory to avoid
>>> registering memory that may be used by the firmware but never reserved at
>>> e820__memblock_setup() time.
>>>
>>> Fixes: bde9cfa3afe4 ("x86/setup: don't remove E820_TYPE_RAM for pfn 0")
>>> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
>>> Cc: stable@vger.kernel.org
>>> ---
>>>    arch/x86/kernel/setup.c | 8 ++++++++
>>>    1 file changed, 8 insertions(+)
>>>
>>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>>> index 3412c4595efd..67c77ed6eef8 100644
>>> --- a/arch/x86/kernel/setup.c
>>> +++ b/arch/x86/kernel/setup.c
>>> @@ -727,6 +727,14 @@ static void __init trim_low_memory_range(void)
>>>    	 * Kconfig help text for X86_RESERVE_LOW.
>>>    	 */
>>>    	memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
>>> +
>>> +	/*
>>> +	 * Even if the firmware does not report the memory at address 0 as
>>> +	 * usable, inform the generic memory management about its existence
>>> +	 * to ensure it is a part of ZONE_DMA and the memory map for it is
>>> +	 * properly initialized.
>>> +	 */
>>> +	memblock_add(0, ALIGN(reserve_low, PAGE_SIZE));
>>>    }
>>>    	
>>>    /*
>>>
>>
>> I think, to make that code more robust, and to not rely on archs to do the
>> right thing, we should do something like
>>
>> 1) Make sure in free_area_init() that each PFN with a memmap (i.e., falls
>> into a partial present section) is spanned by a zone; that would include PFN
>> 0 in this case.
>>
>> 2) In init_zone_unavailable_mem(), similar to round_up(max_pfn,
>> PAGES_PER_SECTION) handling, consider range
>> 	[round_down(min_pfn, PAGES_PER_SECTION), min_pfn - 1]
>> which would handle in the x86-64 case [0..0] and, therefore, initialize PFN
>> 0.
>>
>> Also, I think the special-case of PFN 0 is analogous to the
>> round_up(max_pfn, PAGES_PER_SECTION) handling in
>> init_zone_unavailable_mem(): who guarantees that these PFN above the highest
>> present PFN are actually spanned by a zone?
>>
>> I'd suggest going through all zone ranges in free_area_init() first, dealing
>> with zones that have "not section aligned start/end", clamping them up/down
>> if required such that no holes within a section are left uncovered by a
>> zone.
> 
> I thought about changing the way zone extents are calculated so that zone
> start/end will be always on a section boundary, but zone->zone_start_pfn
> depends on node->node_start_pfn which is defined by hardware and expanding
> a node to make its start pfn aligned at the section boundary might violate
> the HW addressing scheme.
> 
> Maybe this could never happen, or maybe it's not really important as the
> pages there will be reserved anyway, but I'm not sure I can estimate all
> the implications.
> 

I'm suggesting to let zone (+node?) ranges cover memory holes with a 
valid memmap. Not to move actual memory between nodes/zones.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory
  2021-02-01 11:26     ` Baoquan He
@ 2021-02-01 14:34       ` Mike Rapoport
  2021-02-01 14:55         ` Baoquan He
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Rapoport @ 2021-02-01 14:34 UTC (permalink / raw)
  To: Baoquan He
  Cc: David Hildenbrand, Andrew Morton, Andrea Arcangeli,
	Borislav Petkov, Chris Wilson, H. Peter Anvin, Ingo Molnar,
	Linus Torvalds, Łukasz Majczak, Mel Gorman, Michal Hocko,
	Mike Rapoport, Qian Cai, Sarvela, Tomi P, Thomas Gleixner,
	Vlastimil Babka, linux-kernel, linux-mm, stable, x86

On Mon, Feb 01, 2021 at 07:26:05PM +0800, Baoquan He wrote:
> On 02/01/21 at 10:32am, David Hildenbrand wrote:
> > 
> > 2) In init_zone_unavailable_mem(), similar to round_up(max_pfn,
> > PAGES_PER_SECTION) handling, consider range
> > 	[round_down(min_pfn, PAGES_PER_SECTION), min_pfn - 1]
> > which would handle in the x86-64 case [0..0] and, therefore, initialize PFN
> > 0.
> 
> Sounds reasonable. Maybe we can change to get the real expected lowest
> pfn from find_min_pfn_for_node() by iterating memblock.memory and
> memblock.reserved and comparing.

As I've found out the hard way [1], reserved memory is not necessary present.

There could be a system that instead of reserving memory at 0xfe000000 like
in Guillaume's report, could have it reserved at 0x0 and populated only
from the first gigabyte...
 
[1] https://lore.kernel.org/lkml/127999c4-7d56-0c36-7f88-8e1a5c934cae@collabora.com


-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory
  2021-02-01 14:34       ` Mike Rapoport
@ 2021-02-01 14:55         ` Baoquan He
  0 siblings, 0 replies; 14+ messages in thread
From: Baoquan He @ 2021-02-01 14:55 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David Hildenbrand, Andrew Morton, Andrea Arcangeli,
	Borislav Petkov, Chris Wilson, H. Peter Anvin, Ingo Molnar,
	Linus Torvalds, Łukasz Majczak, Mel Gorman, Michal Hocko,
	Mike Rapoport, Qian Cai, Sarvela, Tomi P, Thomas Gleixner,
	Vlastimil Babka, linux-kernel, linux-mm, stable, x86

On 02/01/21 at 04:34pm, Mike Rapoport wrote:
> On Mon, Feb 01, 2021 at 07:26:05PM +0800, Baoquan He wrote:
> > On 02/01/21 at 10:32am, David Hildenbrand wrote:
> > > 
> > > 2) In init_zone_unavailable_mem(), similar to round_up(max_pfn,
> > > PAGES_PER_SECTION) handling, consider range
> > > 	[round_down(min_pfn, PAGES_PER_SECTION), min_pfn - 1]
> > > which would handle in the x86-64 case [0..0] and, therefore, initialize PFN
> > > 0.
> > 
> > Sounds reasonable. Maybe we can change to get the real expected lowest
> > pfn from find_min_pfn_for_node() by iterating memblock.memory and
> > memblock.reserved and comparing.
> 
> As I've found out the hard way [1], reserved memory is not necessary present.
> 
> There could be a system that instead of reserving memory at 0xfe000000 like
> in Guillaume's report, could have it reserved at 0x0 and populated only
> from the first gigabyte...

OK. I thought that we can even compare memblock.memory.regions[0].base
with memblock.reserved.regions[0].base and take the smaller one as the
lowest pfn and assign it to arch_zone_lowest_possible_pfn[0]. When we
try to get the present pages, we still check memblock.memory with
for_each_mem_pfn_range(). Since we will consider and take reserved
memory into zone anyway, arch_zone_lowest_possible_pfn[] only impact the
boundary of zone. Just rough thought, please ignore it if something is
missed.

Thanks
Baoquan



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory
  2021-02-01 14:32       ` David Hildenbrand
@ 2021-02-01 23:22         ` Mike Rapoport
  0 siblings, 0 replies; 14+ messages in thread
From: Mike Rapoport @ 2021-02-01 23:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Andrea Arcangeli, Baoquan He, Borislav Petkov,
	Chris Wilson, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Łukasz Majczak, Mel Gorman, Michal Hocko, Mike Rapoport,
	Qian Cai, Sarvela, Tomi P, Thomas Gleixner, Vlastimil Babka,
	linux-kernel, linux-mm, stable, x86

On Mon, Feb 01, 2021 at 03:32:33PM +0100, David Hildenbrand wrote:
> On 01.02.21 15:30, Mike Rapoport wrote:
> > > 
> > > I'd suggest going through all zone ranges in free_area_init() first, dealing
> > > with zones that have "not section aligned start/end", clamping them up/down
> > > if required such that no holes within a section are left uncovered by a
> > > zone.
> > 
> > I thought about changing the way zone extents are calculated so that zone
> > start/end will be always on a section boundary, but zone->zone_start_pfn
> > depends on node->node_start_pfn which is defined by hardware and expanding
> > a node to make its start pfn aligned at the section boundary might violate
> > the HW addressing scheme.
> > 
> > Maybe this could never happen, or maybe it's not really important as the
> > pages there will be reserved anyway, but I'm not sure I can estimate all
> > the implications.
> > 
> 
> I'm suggesting to let zone (+node?) ranges cover memory holes with a valid
> memmap. Not to move actual memory between nodes/zones.

I didn't think you suggest to move actual memory :)

My concern was that extending node range might cause troubles, but TBH, I
cannot think of a memory layout that will be crazy enough to actually get
us into those troubles.

So something like the patch below might work. It'll need nice wrapping and
some comments, but generally it implements your suggestion to extend node's
range to include partial sections, and then interleave initialization of
struct pages representing unpopulated memory with the initialization of the
"real" memory map. Since zone's start/end are derived from node's start/end
we also get zones covering the holes.


diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 519a60d5b6f7..179d1eb4a9bb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6257,24 +6257,69 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	}
 }
 
+#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
+static u64 __meminit init_unavailable_range(unsigned long spfn,
+					    unsigned long epfn,
+					    int zone, int node)
+{
+	unsigned long pfn;
+	u64 pgcnt = 0;
+
+	for (pfn = spfn; pfn < epfn; pfn++) {
+		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
+			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
+				+ pageblock_nr_pages - 1;
+			continue;
+		}
+		__init_single_page(pfn_to_page(pfn), pfn, zone, node);
+		__SetPageReserved(pfn_to_page(pfn));
+		pgcnt++;
+	}
+
+	return pgcnt;
+}
+#else
+static inline u64 init_unavailable_range(unsigned long spfn, unsigned long epfn,
+					 int zone, int node)
+{
+	return 0;
+}
+#endif
+
+
 void __meminit __weak memmap_init(unsigned long size, int nid,
 				  unsigned long zone,
 				  unsigned long range_start_pfn)
 {
-	unsigned long start_pfn, end_pfn;
+	unsigned long start_pfn, end_pfn, next_pfn = 0;
 	unsigned long range_end_pfn = range_start_pfn + size;
+	u64 pgcnt = 0;
 	int i;
 
 	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
 		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
+		next_pfn = clamp(next_pfn, range_start_pfn, range_end_pfn);
 
 		if (end_pfn > start_pfn) {
 			size = end_pfn - start_pfn;
 			memmap_init_zone(size, nid, zone, start_pfn, range_end_pfn,
 					 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
 		}
+
+		if (next_pfn < start_pfn)
+			pgcnt += init_unavailable_range(next_pfn, start_pfn,
+							zone, nid);
+		next_pfn = end_pfn;
 	}
+
+	if (next_pfn < range_end_pfn)
+		pgcnt += init_unavailable_range(next_pfn, range_end_pfn,
+						zone, nid);
+
+	if (pgcnt)
+		pr_info("%s: Zeroed struct page in unavailable ranges: %lld\n",
+			zone_names[zone], pgcnt);
 }
 
 static int zone_batchsize(struct zone *zone)
@@ -6523,6 +6568,12 @@ void __init get_pfn_range_for_nid(unsigned int nid,
 
 	if (*start_pfn == -1UL)
 		*start_pfn = 0;
+	else {
+#ifdef CONFIG_SPARSEMEM
+		*start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
+		*end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
+#endif
+	}
 }
 
 /*
@@ -7075,88 +7126,6 @@ void __init free_area_init_memoryless_node(int nid)
 	free_area_init_node(nid);
 }
 
-#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
-/*
- * Initialize all valid struct pages in the range [spfn, epfn) and mark them
- * PageReserved(). Return the number of struct pages that were initialized.
- */
-static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
-{
-	unsigned long pfn;
-	u64 pgcnt = 0;
-
-	for (pfn = spfn; pfn < epfn; pfn++) {
-		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
-			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
-				+ pageblock_nr_pages - 1;
-			continue;
-		}
-		/*
-		 * Use a fake node/zone (0) for now. Some of these pages
-		 * (in memblock.reserved but not in memblock.memory) will
-		 * get re-initialized via reserve_bootmem_region() later.
-		 */
-		__init_single_page(pfn_to_page(pfn), pfn, 0, 0);
-		__SetPageReserved(pfn_to_page(pfn));
-		pgcnt++;
-	}
-
-	return pgcnt;
-}
-
-/*
- * Only struct pages that are backed by physical memory are zeroed and
- * initialized by going through __init_single_page(). But, there are some
- * struct pages which are reserved in memblock allocator and their fields
- * may be accessed (for example page_to_pfn() on some configuration accesses
- * flags). We must explicitly initialize those struct pages.
- *
- * This function also addresses a similar issue where struct pages are left
- * uninitialized because the physical address range is not covered by
- * memblock.memory or memblock.reserved. That could happen when memblock
- * layout is manually configured via memmap=, or when the highest physical
- * address (max_pfn) does not end on a section boundary.
- */
-static void __init init_unavailable_mem(void)
-{
-	phys_addr_t start, end;
-	u64 i, pgcnt;
-	phys_addr_t next = 0;
-
-	/*
-	 * Loop through unavailable ranges not covered by memblock.memory.
-	 */
-	pgcnt = 0;
-	for_each_mem_range(i, &start, &end) {
-		if (next < start)
-			pgcnt += init_unavailable_range(PFN_DOWN(next),
-							PFN_UP(start));
-		next = end;
-	}
-
-	/*
-	 * Early sections always have a fully populated memmap for the whole
-	 * section - see pfn_valid(). If the last section has holes at the
-	 * end and that section is marked "online", the memmap will be
-	 * considered initialized. Make sure that memmap has a well defined
-	 * state.
-	 */
-	pgcnt += init_unavailable_range(PFN_DOWN(next),
-					round_up(max_pfn, PAGES_PER_SECTION));
-
-	/*
-	 * Struct pages that do not have backing memory. This could be because
-	 * firmware is using some of this memory, or for some other reasons.
-	 */
-	if (pgcnt)
-		pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
-}
-#else
-static inline void __init init_unavailable_mem(void)
-{
-}
-#endif /* !CONFIG_FLAT_NODE_MEM_MAP */
-
 #if MAX_NUMNODES > 1
 /*
  * Figure out the number of possible node ids.
@@ -7516,7 +7485,7 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 	memset(arch_zone_highest_possible_pfn, 0,
 				sizeof(arch_zone_highest_possible_pfn));
 
-	start_pfn = find_min_pfn_with_active_regions();
+	start_pfn = 0;
 	descending = arch_has_descending_max_zone_pfns();
 
 	for (i = 0; i < MAX_NR_ZONES; i++) {
@@ -7580,7 +7549,6 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
 	setup_nr_node_ids();
-	init_unavailable_mem();
 	for_each_online_node(nid) {
 		pg_data_t *pgdat = NODE_DATA(nid);
 		free_area_init_node(nid);
 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-02-01 23:22 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-30 22:10 [PATCH v4 0/2] mm: fix initialization of struct page for holes in memory layout Mike Rapoport
2021-01-30 22:10 ` [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory Mike Rapoport
2021-01-31  0:37   ` Linus Torvalds
2021-01-31  8:03     ` Mike Rapoport
2021-01-31 21:49       ` Linus Torvalds
2021-02-01 14:06         ` Mike Rapoport
2021-02-01  9:32   ` David Hildenbrand
2021-02-01 11:26     ` Baoquan He
2021-02-01 14:34       ` Mike Rapoport
2021-02-01 14:55         ` Baoquan He
2021-02-01 14:30     ` Mike Rapoport
2021-02-01 14:32       ` David Hildenbrand
2021-02-01 23:22         ` Mike Rapoport
2021-01-30 22:10 ` [PATCH v4 2/2] mm: fix initialization of struct page for holes in memory layout Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).