All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
@ 2018-08-23 18:25 Masayoshi Mizuma
  2018-08-23 18:25 ` [PATCH 2/2] mm: zero remaining unavailable struct pages Masayoshi Mizuma
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Masayoshi Mizuma @ 2018-08-23 18:25 UTC (permalink / raw)
  To: linux-mm, Naoya Horiguchi
  Cc: Masayoshi Mizuma, Masayoshi Mizuma, linux-kernel, x86

From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>

commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
memblock.reserved") breaks movable_node kernel option because it
changed the memory gap range to reserved memblock. So, the node
is marked as Normal zone even if the SRAT has Hot plaggable affinity.

    =====================================================================
    kernel: BIOS-e820: [mem 0x0000180000000000-0x0000180fffffffff] usable
    kernel: BIOS-e820: [mem 0x00001c0000000000-0x00001c0fffffffff] usable
    ...
    kernel: reserved[0x12]#011[0x0000181000000000-0x00001bffffffffff], 0x000003f000000000 bytes flags: 0x0
    ...
    kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x180000000000-0x1bffffffffff] hotplug
    kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c0000000000-0x1fffffffffff] hotplug
    ...
    kernel: Movable zone start for each node
    kernel:  Node 3: 0x00001c0000000000
    kernel: Early memory node ranges
    ...
    =====================================================================

Naoya's v1 patch [*] fixes the original issue and this movable_node
issue doesn't occur.
Let's revert commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
regions into memblock.reserved") and apply the v1 patch.

[*] https://lkml.org/lkml/2018/6/13/27

Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
---
 arch/x86/kernel/e820.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index c88c23c658c1..d1f25c831447 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1248,7 +1248,6 @@ void __init e820__memblock_setup(void)
 {
 	int i;
 	u64 end;
-	u64 addr = 0;
 
 	/*
 	 * The bootstrap memblock region count maximum is 128 entries
@@ -1265,21 +1264,13 @@ void __init e820__memblock_setup(void)
 		struct e820_entry *entry = &e820_table->entries[i];
 
 		end = entry->addr + entry->size;
-		if (addr < entry->addr)
-			memblock_reserve(addr, entry->addr - addr);
-		addr = end;
 		if (end != (resource_size_t)end)
 			continue;
 
-		/*
-		 * all !E820_TYPE_RAM ranges (including gap ranges) are put
-		 * into memblock.reserved to make sure that struct pages in
-		 * such regions are not left uninitialized after bootup.
-		 */
 		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
-			memblock_reserve(entry->addr, entry->size);
-		else
-			memblock_add(entry->addr, entry->size);
+			continue;
+
+		memblock_add(entry->addr, entry->size);
 	}
 
 	/* Throw away partial pages: */
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] mm: zero remaining unavailable struct pages
  2018-08-23 18:25 [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" Masayoshi Mizuma
@ 2018-08-23 18:25 ` Masayoshi Mizuma
  2018-08-27 23:33   ` Pasha Tatashin
  2018-08-24  0:03 ` [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" Naoya Horiguchi
  2018-08-27 23:18 ` Pasha Tatashin
  2 siblings, 1 reply; 13+ messages in thread
From: Masayoshi Mizuma @ 2018-08-23 18:25 UTC (permalink / raw)
  To: linux-mm, Naoya Horiguchi; +Cc: Masayoshi Mizuma, linux-kernel, x86

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

There is a kernel panic that is triggered when reading /proc/kpageflags
on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':

  BUG: unable to handle kernel paging request at fffffffffffffffe
  PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
  Oops: 0000 [#1] SMP PTI
  CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
  RIP: 0010:stable_page_flags+0x27/0x3c0
  Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
  RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
  RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
  RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
  R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
  R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
  FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
  Call Trace:
   kpageflags_read+0xc7/0x120
   proc_reg_read+0x3c/0x60
   __vfs_read+0x36/0x170
   vfs_read+0x89/0x130
   ksys_pread64+0x71/0x90
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7efc42e75e23
  Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24

According to kernel bisection, this problem became visible due to commit
f7f99100d8d9 which changes how struct pages are initialized.

Memblock layout affects the pfn ranges covered by node/zone. Consider
that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
the default (no memmap= given) memblock layout is like below:

  MEMBLOCK configuration:
   memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
   memory.cnt  = 0x4
   memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
   memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
   memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
   memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
   ...

If you give memmap=1G!4G (so it just covers memory[0x2]),
the range [0x100000000-0x13fffffff] is gone:

  MEMBLOCK configuration:
   memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
   memory.cnt  = 0x3
   memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
   memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
   memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
   ...

This causes shrinking node 0's pfn range because it is calculated by
the address range of memblock.memory. So some of struct pages in the
gap range are left uninitialized.

We have a function zero_resv_unavail() which does zeroing the struct
pages outside memblock.memory, but currently it covers only the reserved
unavailable range (i.e. memblock.memory && !memblock.reserved).
This patch extends it to cover all unavailable range, which fixes
the reported issue.

Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Tested-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
---
 include/linux/memblock.h | 15 ---------------
 mm/page_alloc.c          | 36 +++++++++++++++++++++++++-----------
 2 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 516920549378..2acdd046df2d 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -265,21 +265,6 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
 			       nid, flags, p_start, p_end, p_nid)
 
-/**
- * for_each_resv_unavail_range - iterate through reserved and unavailable memory
- * @i: u64 used as loop variable
- * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
- * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
- *
- * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
- * Available as soon as memblock is initialized.
- * Note: because this memory does not belong to any physical node, flags and
- * nid arguments do not make sense and thus not exported as arguments.
- */
-#define for_each_resv_unavail_range(i, p_start, p_end)			\
-	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
-			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
-
 static inline void memblock_set_region_flags(struct memblock_region *r,
 					     enum memblock_flags flags)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c677c1506d73..bdd3cba1d547 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6447,29 +6447,42 @@ void __init free_area_init_node(int nid, unsigned long *zones_size,
  * struct pages which are reserved in memblock allocator and their fields
  * may be accessed (for example page_to_pfn() on some configuration accesses
  * flags). We must explicitly zero those struct pages.
+ *
+ * This function also addresses a similar issue where struct pages are left
+ * uninitialized because the physical address range is not covered by
+ * memblock.memory or memblock.reserved. That could happen when memblock
+ * layout is manually configured via memmap=.
  */
 void __init zero_resv_unavail(void)
 {
 	phys_addr_t start, end;
 	unsigned long pfn;
 	u64 i, pgcnt;
+	phys_addr_t next = 0;
 
 	/*
-	 * Loop through ranges that are reserved, but do not have reported
-	 * physical memory backing.
+	 * Loop through unavailable ranges not covered by memblock.memory.
 	 */
 	pgcnt = 0;
-	for_each_resv_unavail_range(i, &start, &end) {
-		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
-			if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
-				pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
-					+ pageblock_nr_pages - 1;
-				continue;
+	for_each_mem_range(i, &memblock.memory, NULL,
+			NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) {
+		if (next < start) {
+			for (pfn = PFN_DOWN(next); pfn < PFN_UP(start); pfn++) {
+				if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
+					continue;
+				mm_zero_struct_page(pfn_to_page(pfn));
+				pgcnt++;
 			}
-			mm_zero_struct_page(pfn_to_page(pfn));
-			pgcnt++;
 		}
+		next = end;
 	}
+	for (pfn = PFN_DOWN(next); pfn < max_pfn; pfn++) {
+		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
+			continue;
+		mm_zero_struct_page(pfn_to_page(pfn));
+		pgcnt++;
+	}
+
 
 	/*
 	 * Struct pages that do not have backing memory. This could be because
@@ -6479,7 +6492,8 @@ void __init zero_resv_unavail(void)
 	 * this code can be removed.
 	 */
 	if (pgcnt)
-		pr_info("Reserved but unavailable: %lld pages", pgcnt);
+		pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
+
 }
 #endif /* CONFIG_HAVE_MEMBLOCK && !CONFIG_FLAT_NODE_MEM_MAP */
 
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
  2018-08-23 18:25 [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" Masayoshi Mizuma
  2018-08-23 18:25 ` [PATCH 2/2] mm: zero remaining unavailable struct pages Masayoshi Mizuma
@ 2018-08-24  0:03 ` Naoya Horiguchi
  2018-08-24  8:29   ` Michal Hocko
  2018-08-27 23:18 ` Pasha Tatashin
  2 siblings, 1 reply; 13+ messages in thread
From: Naoya Horiguchi @ 2018-08-24  0:03 UTC (permalink / raw)
  To: Masayoshi Mizuma
  Cc: linux-mm, Masayoshi Mizuma, linux-kernel, x86, osalvador,
	pasha.tatashin, mhocko

(CCed related people)

Hi Mizuma-san,

Thank you for the report.
The mentioned patch was created based on feedbacks from reviewers/maintainers,
so I'd like to hear from them about how we should handle the issue.

And one note is that there is a follow-up patch for "x86/e820: put !E820_TYPE_RAM
regions into memblock.reserved" which might be affected by your changes.

> commit e181ae0c5db9544de9c53239eb22bc012ce75033
> Author: Pavel Tatashin <pasha.tatashin@oracle.com>
> Date:   Sat Jul 14 09:15:07 2018 -0400
> 
>     mm: zero unavailable pages before memmap init

Thanks,
Naoya Horiguchi

On Thu, Aug 23, 2018 at 02:25:12PM -0400, Masayoshi Mizuma wrote:
> From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
> 
> commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
> memblock.reserved") breaks movable_node kernel option because it
> changed the memory gap range to reserved memblock. So, the node
> is marked as Normal zone even if the SRAT has Hot plaggable affinity.
> 
>     =====================================================================
>     kernel: BIOS-e820: [mem 0x0000180000000000-0x0000180fffffffff] usable
>     kernel: BIOS-e820: [mem 0x00001c0000000000-0x00001c0fffffffff] usable
>     ...
>     kernel: reserved[0x12]#011[0x0000181000000000-0x00001bffffffffff], 0x000003f000000000 bytes flags: 0x0
>     ...
>     kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x180000000000-0x1bffffffffff] hotplug
>     kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c0000000000-0x1fffffffffff] hotplug
>     ...
>     kernel: Movable zone start for each node
>     kernel:  Node 3: 0x00001c0000000000
>     kernel: Early memory node ranges
>     ...
>     =====================================================================
> 
> Naoya's v1 patch [*] fixes the original issue and this movable_node
> issue doesn't occur.
> Let's revert commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
> regions into memblock.reserved") and apply the v1 patch.
> 
> [*] https://lkml.org/lkml/2018/6/13/27
> 
> Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
> ---
>  arch/x86/kernel/e820.c | 15 +++------------
>  1 file changed, 3 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index c88c23c658c1..d1f25c831447 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1248,7 +1248,6 @@ void __init e820__memblock_setup(void)
>  {
>  	int i;
>  	u64 end;
> -	u64 addr = 0;
>  
>  	/*
>  	 * The bootstrap memblock region count maximum is 128 entries
> @@ -1265,21 +1264,13 @@ void __init e820__memblock_setup(void)
>  		struct e820_entry *entry = &e820_table->entries[i];
>  
>  		end = entry->addr + entry->size;
> -		if (addr < entry->addr)
> -			memblock_reserve(addr, entry->addr - addr);
> -		addr = end;
>  		if (end != (resource_size_t)end)
>  			continue;
>  
> -		/*
> -		 * all !E820_TYPE_RAM ranges (including gap ranges) are put
> -		 * into memblock.reserved to make sure that struct pages in
> -		 * such regions are not left uninitialized after bootup.
> -		 */
>  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> -			memblock_reserve(entry->addr, entry->size);
> -		else
> -			memblock_add(entry->addr, entry->size);
> +			continue;
> +
> +		memblock_add(entry->addr, entry->size);
>  	}
>  
>  	/* Throw away partial pages: */
> -- 
> 2.18.0
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
  2018-08-24  0:03 ` [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" Naoya Horiguchi
@ 2018-08-24  8:29   ` Michal Hocko
  2018-08-27 12:31     ` Masayoshi Mizuma
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2018-08-24  8:29 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Masayoshi Mizuma, linux-mm, Masayoshi Mizuma, linux-kernel, x86,
	osalvador, Pavel Tatashin

On Fri 24-08-18 00:03:25, Naoya Horiguchi wrote:
> (CCed related people)

Fixup Pavel email.

> 
> Hi Mizuma-san,
> 
> Thank you for the report.
> The mentioned patch was created based on feedbacks from reviewers/maintainers,
> so I'd like to hear from them about how we should handle the issue.
> 
> And one note is that there is a follow-up patch for "x86/e820: put !E820_TYPE_RAM
> regions into memblock.reserved" which might be affected by your changes.
> 
> > commit e181ae0c5db9544de9c53239eb22bc012ce75033
> > Author: Pavel Tatashin <pasha.tatashin@oracle.com>
> > Date:   Sat Jul 14 09:15:07 2018 -0400
> > 
> >     mm: zero unavailable pages before memmap init
> 
> Thanks,
> Naoya Horiguchi
> 
> On Thu, Aug 23, 2018 at 02:25:12PM -0400, Masayoshi Mizuma wrote:
> > From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
> > 
> > commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
> > memblock.reserved") breaks movable_node kernel option because it
> > changed the memory gap range to reserved memblock. So, the node
> > is marked as Normal zone even if the SRAT has Hot plaggable affinity.
> > 
> >     =====================================================================
> >     kernel: BIOS-e820: [mem 0x0000180000000000-0x0000180fffffffff] usable
> >     kernel: BIOS-e820: [mem 0x00001c0000000000-0x00001c0fffffffff] usable
> >     ...
> >     kernel: reserved[0x12]#011[0x0000181000000000-0x00001bffffffffff], 0x000003f000000000 bytes flags: 0x0
> >     ...
> >     kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x180000000000-0x1bffffffffff] hotplug
> >     kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c0000000000-0x1fffffffffff] hotplug
> >     ...
> >     kernel: Movable zone start for each node
> >     kernel:  Node 3: 0x00001c0000000000
> >     kernel: Early memory node ranges
> >     ...
> >     =====================================================================
> > 
> > Naoya's v1 patch [*] fixes the original issue and this movable_node
> > issue doesn't occur.
> > Let's revert commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
> > regions into memblock.reserved") and apply the v1 patch.
> > 
> > [*] https://lkml.org/lkml/2018/6/13/27
> > 
> > Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
> > ---
> >  arch/x86/kernel/e820.c | 15 +++------------
> >  1 file changed, 3 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index c88c23c658c1..d1f25c831447 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1248,7 +1248,6 @@ void __init e820__memblock_setup(void)
> >  {
> >  	int i;
> >  	u64 end;
> > -	u64 addr = 0;
> >  
> >  	/*
> >  	 * The bootstrap memblock region count maximum is 128 entries
> > @@ -1265,21 +1264,13 @@ void __init e820__memblock_setup(void)
> >  		struct e820_entry *entry = &e820_table->entries[i];
> >  
> >  		end = entry->addr + entry->size;
> > -		if (addr < entry->addr)
> > -			memblock_reserve(addr, entry->addr - addr);
> > -		addr = end;
> >  		if (end != (resource_size_t)end)
> >  			continue;
> >  
> > -		/*
> > -		 * all !E820_TYPE_RAM ranges (including gap ranges) are put
> > -		 * into memblock.reserved to make sure that struct pages in
> > -		 * such regions are not left uninitialized after bootup.
> > -		 */
> >  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > -			memblock_reserve(entry->addr, entry->size);
> > -		else
> > -			memblock_add(entry->addr, entry->size);
> > +			continue;
> > +
> > +		memblock_add(entry->addr, entry->size);
> >  	}
> >  
> >  	/* Throw away partial pages: */
> > -- 
> > 2.18.0
> > 
> > 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
  2018-08-24  8:29   ` Michal Hocko
@ 2018-08-27 12:31     ` Masayoshi Mizuma
  2018-08-27 13:29       ` Pasha Tatashin
  0 siblings, 1 reply; 13+ messages in thread
From: Masayoshi Mizuma @ 2018-08-27 12:31 UTC (permalink / raw)
  To: mhocko, n-horiguchi, Pavel.Tatashin
  Cc: linux-mm, m.mizuma, linux-kernel, x86, osalvador

Hi Pavel,

I would appreciate if you could send the feedback for the patch.

Thanks!
Masa

On 08/24/2018 04:29 AM, Michal Hocko wrote:
> On Fri 24-08-18 00:03:25, Naoya Horiguchi wrote:
>> (CCed related people)
> 
> Fixup Pavel email.
> 
>>
>> Hi Mizuma-san,
>>
>> Thank you for the report.
>> The mentioned patch was created based on feedbacks from reviewers/maintainers,
>> so I'd like to hear from them about how we should handle the issue.
>>
>> And one note is that there is a follow-up patch for "x86/e820: put !E820_TYPE_RAM
>> regions into memblock.reserved" which might be affected by your changes.
>>
>>> commit e181ae0c5db9544de9c53239eb22bc012ce75033
>>> Author: Pavel Tatashin <pasha.tatashin@oracle.com>
>>> Date:   Sat Jul 14 09:15:07 2018 -0400
>>>
>>>     mm: zero unavailable pages before memmap init
>>
>> Thanks,
>> Naoya Horiguchi
>>
>> On Thu, Aug 23, 2018 at 02:25:12PM -0400, Masayoshi Mizuma wrote:
>>> From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
>>>
>>> commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
>>> memblock.reserved") breaks movable_node kernel option because it
>>> changed the memory gap range to reserved memblock. So, the node
>>> is marked as Normal zone even if the SRAT has Hot plaggable affinity.
>>>
>>>     =====================================================================
>>>     kernel: BIOS-e820: [mem 0x0000180000000000-0x0000180fffffffff] usable
>>>     kernel: BIOS-e820: [mem 0x00001c0000000000-0x00001c0fffffffff] usable
>>>     ...
>>>     kernel: reserved[0x12]#011[0x0000181000000000-0x00001bffffffffff], 0x000003f000000000 bytes flags: 0x0
>>>     ...
>>>     kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x180000000000-0x1bffffffffff] hotplug
>>>     kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c0000000000-0x1fffffffffff] hotplug
>>>     ...
>>>     kernel: Movable zone start for each node
>>>     kernel:  Node 3: 0x00001c0000000000
>>>     kernel: Early memory node ranges
>>>     ...
>>>     =====================================================================
>>>
>>> Naoya's v1 patch [*] fixes the original issue and this movable_node
>>> issue doesn't occur.
>>> Let's revert commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
>>> regions into memblock.reserved") and apply the v1 patch.
>>>
>>> [*] https://lkml.org/lkml/2018/6/13/27
>>>
>>> Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
>>> ---
>>>  arch/x86/kernel/e820.c | 15 +++------------
>>>  1 file changed, 3 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
>>> index c88c23c658c1..d1f25c831447 100644
>>> --- a/arch/x86/kernel/e820.c
>>> +++ b/arch/x86/kernel/e820.c
>>> @@ -1248,7 +1248,6 @@ void __init e820__memblock_setup(void)
>>>  {
>>>  	int i;
>>>  	u64 end;
>>> -	u64 addr = 0;
>>>  
>>>  	/*
>>>  	 * The bootstrap memblock region count maximum is 128 entries
>>> @@ -1265,21 +1264,13 @@ void __init e820__memblock_setup(void)
>>>  		struct e820_entry *entry = &e820_table->entries[i];
>>>  
>>>  		end = entry->addr + entry->size;
>>> -		if (addr < entry->addr)
>>> -			memblock_reserve(addr, entry->addr - addr);
>>> -		addr = end;
>>>  		if (end != (resource_size_t)end)
>>>  			continue;
>>>  
>>> -		/*
>>> -		 * all !E820_TYPE_RAM ranges (including gap ranges) are put
>>> -		 * into memblock.reserved to make sure that struct pages in
>>> -		 * such regions are not left uninitialized after bootup.
>>> -		 */
>>>  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
>>> -			memblock_reserve(entry->addr, entry->size);
>>> -		else
>>> -			memblock_add(entry->addr, entry->size);
>>> +			continue;
>>> +
>>> +		memblock_add(entry->addr, entry->size);
>>>  	}
>>>  
>>>  	/* Throw away partial pages: */
>>> -- 
>>> 2.18.0
>>>
>>>
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
  2018-08-27 12:31     ` Masayoshi Mizuma
@ 2018-08-27 13:29       ` Pasha Tatashin
  0 siblings, 0 replies; 13+ messages in thread
From: Pasha Tatashin @ 2018-08-27 13:29 UTC (permalink / raw)
  To: msys.mizuma
  Cc: mhocko, n-horiguchi, Linux Memory Management List,
	Masayoshi Mizuma, LKML, x86, osalvador

On Mon, Aug 27, 2018 at 8:31 AM Masayoshi Mizuma <msys.mizuma@gmail.com> wrote:
>
> Hi Pavel,
>
> I would appreciate if you could send the feedback for the patch.

I will study it today.

Pavel

>
> Thanks!
> Masa
>
> On 08/24/2018 04:29 AM, Michal Hocko wrote:
> > On Fri 24-08-18 00:03:25, Naoya Horiguchi wrote:
> >> (CCed related people)
> >
> > Fixup Pavel email.
> >
> >>
> >> Hi Mizuma-san,
> >>
> >> Thank you for the report.
> >> The mentioned patch was created based on feedbacks from reviewers/maintainers,
> >> so I'd like to hear from them about how we should handle the issue.
> >>
> >> And one note is that there is a follow-up patch for "x86/e820: put !E820_TYPE_RAM
> >> regions into memblock.reserved" which might be affected by your changes.
> >>
> >>> commit e181ae0c5db9544de9c53239eb22bc012ce75033
> >>> Author: Pavel Tatashin <pasha.tatashin@oracle.com>
> >>> Date:   Sat Jul 14 09:15:07 2018 -0400
> >>>
> >>>     mm: zero unavailable pages before memmap init
> >>
> >> Thanks,
> >> Naoya Horiguchi
> >>
> >> On Thu, Aug 23, 2018 at 02:25:12PM -0400, Masayoshi Mizuma wrote:
> >>> From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
> >>>
> >>> commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
> >>> memblock.reserved") breaks movable_node kernel option because it
> >>> changed the memory gap range to reserved memblock. So, the node
> >>> is marked as Normal zone even if the SRAT has Hot plaggable affinity.
> >>>
> >>>     =====================================================================
> >>>     kernel: BIOS-e820: [mem 0x0000180000000000-0x0000180fffffffff] usable
> >>>     kernel: BIOS-e820: [mem 0x00001c0000000000-0x00001c0fffffffff] usable
> >>>     ...
> >>>     kernel: reserved[0x12]#011[0x0000181000000000-0x00001bffffffffff], 0x000003f000000000 bytes flags: 0x0
> >>>     ...
> >>>     kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x180000000000-0x1bffffffffff] hotplug
> >>>     kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c0000000000-0x1fffffffffff] hotplug
> >>>     ...
> >>>     kernel: Movable zone start for each node
> >>>     kernel:  Node 3: 0x00001c0000000000
> >>>     kernel: Early memory node ranges
> >>>     ...
> >>>     =====================================================================
> >>>
> >>> Naoya's v1 patch [*] fixes the original issue and this movable_node
> >>> issue doesn't occur.
> >>> Let's revert commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
> >>> regions into memblock.reserved") and apply the v1 patch.
> >>>
> >>> [*] https://lkml.org/lkml/2018/6/13/27
> >>>
> >>> Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
> >>> ---
> >>>  arch/x86/kernel/e820.c | 15 +++------------
> >>>  1 file changed, 3 insertions(+), 12 deletions(-)
> >>>
> >>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> >>> index c88c23c658c1..d1f25c831447 100644
> >>> --- a/arch/x86/kernel/e820.c
> >>> +++ b/arch/x86/kernel/e820.c
> >>> @@ -1248,7 +1248,6 @@ void __init e820__memblock_setup(void)
> >>>  {
> >>>     int i;
> >>>     u64 end;
> >>> -   u64 addr = 0;
> >>>
> >>>     /*
> >>>      * The bootstrap memblock region count maximum is 128 entries
> >>> @@ -1265,21 +1264,13 @@ void __init e820__memblock_setup(void)
> >>>             struct e820_entry *entry = &e820_table->entries[i];
> >>>
> >>>             end = entry->addr + entry->size;
> >>> -           if (addr < entry->addr)
> >>> -                   memblock_reserve(addr, entry->addr - addr);
> >>> -           addr = end;
> >>>             if (end != (resource_size_t)end)
> >>>                     continue;
> >>>
> >>> -           /*
> >>> -            * all !E820_TYPE_RAM ranges (including gap ranges) are put
> >>> -            * into memblock.reserved to make sure that struct pages in
> >>> -            * such regions are not left uninitialized after bootup.
> >>> -            */
> >>>             if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> >>> -                   memblock_reserve(entry->addr, entry->size);
> >>> -           else
> >>> -                   memblock_add(entry->addr, entry->size);
> >>> +                   continue;
> >>> +
> >>> +           memblock_add(entry->addr, entry->size);
> >>>     }
> >>>
> >>>     /* Throw away partial pages: */
> >>> --
> >>> 2.18.0
> >>>
> >>>
> >
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
  2018-08-23 18:25 [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" Masayoshi Mizuma
  2018-08-23 18:25 ` [PATCH 2/2] mm: zero remaining unavailable struct pages Masayoshi Mizuma
  2018-08-24  0:03 ` [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" Naoya Horiguchi
@ 2018-08-27 23:18 ` Pasha Tatashin
  2 siblings, 0 replies; 13+ messages in thread
From: Pasha Tatashin @ 2018-08-27 23:18 UTC (permalink / raw)
  To: Masayoshi Mizuma, linux-mm, Naoya Horiguchi
  Cc: Masayoshi Mizuma, linux-kernel, x86

On 8/23/18 2:25 PM, Masayoshi Mizuma wrote:
> From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
> 
> commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
> memblock.reserved") breaks movable_node kernel option because it
> changed the memory gap range to reserved memblock. So, the node
> is marked as Normal zone even if the SRAT has Hot plaggable affinity.
> 
>     =====================================================================
>     kernel: BIOS-e820: [mem 0x0000180000000000-0x0000180fffffffff] usable
>     kernel: BIOS-e820: [mem 0x00001c0000000000-0x00001c0fffffffff] usable
>     ...
>     kernel: reserved[0x12]#011[0x0000181000000000-0x00001bffffffffff], 0x000003f000000000 bytes flags: 0x0
>     ...
>     kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x180000000000-0x1bffffffffff] hotplug
>     kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c0000000000-0x1fffffffffff] hotplug
>     ...
>     kernel: Movable zone start for each node
>     kernel:  Node 3: 0x00001c0000000000
>     kernel: Early memory node ranges
>     ...
>     =====================================================================
> 
> Naoya's v1 patch [*] fixes the original issue and this movable_node
> issue doesn't occur.
> Let's revert commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
> regions into memblock.reserved") and apply the v1 patch.
> 
> [*] https://lkml.org/lkml/2018/6/13/27
> 
> Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>

Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm: zero remaining unavailable struct pages
  2018-08-23 18:25 ` [PATCH 2/2] mm: zero remaining unavailable struct pages Masayoshi Mizuma
@ 2018-08-27 23:33   ` Pasha Tatashin
  2018-08-29 15:16     ` Masayoshi Mizuma
  0 siblings, 1 reply; 13+ messages in thread
From: Pasha Tatashin @ 2018-08-27 23:33 UTC (permalink / raw)
  To: Masayoshi Mizuma, linux-mm, Naoya Horiguchi; +Cc: linux-kernel, x86

On 8/23/18 2:25 PM, Masayoshi Mizuma wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> There is a kernel panic that is triggered when reading /proc/kpageflags
> on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> 
>   BUG: unable to handle kernel paging request at fffffffffffffffe
>   PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
>   Oops: 0000 [#1] SMP PTI
>   CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
>   RIP: 0010:stable_page_flags+0x27/0x3c0
>   Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
>   RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
>   RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
>   RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
>   RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
>   R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
>   R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
>   FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
>   Call Trace:
>    kpageflags_read+0xc7/0x120
>    proc_reg_read+0x3c/0x60
>    __vfs_read+0x36/0x170
>    vfs_read+0x89/0x130
>    ksys_pread64+0x71/0x90
>    do_syscall_64+0x5b/0x160
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
>   RIP: 0033:0x7efc42e75e23
>   Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> 
> According to kernel bisection, this problem became visible due to commit
> f7f99100d8d9 which changes how struct pages are initialized.
> 
> Memblock layout affects the pfn ranges covered by node/zone. Consider
> that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> the default (no memmap= given) memblock layout is like below:
> 
>   MEMBLOCK configuration:
>    memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
>    memory.cnt  = 0x4
>    memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
>    memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
>    memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
>    memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
>    ...
> 
> If you give memmap=1G!4G (so it just covers memory[0x2]),
> the range [0x100000000-0x13fffffff] is gone:
> 
>   MEMBLOCK configuration:
>    memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
>    memory.cnt  = 0x3
>    memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
>    memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
>    memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
>    ...
> 
> This causes shrinking node 0's pfn range because it is calculated by
> the address range of memblock.memory. So some of struct pages in the
> gap range are left uninitialized.
> 
> We have a function zero_resv_unavail() which does zeroing the struct
> pages outside memblock.memory, but currently it covers only the reserved
> unavailable range (i.e. memblock.memory && !memblock.reserved).
> This patch extends it to cover all unavailable range, which fixes
> the reported issue.
> 
> Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Tested-by: Oscar Salvador <osalvador@suse.de>
> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>

Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>

Also, please review and add the following patch to this series:

From 6d23e66e979244734a06c1b636742c2568121b39 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin <pavel.tatashin@microsoft.com>
Date: Mon, 27 Aug 2018 19:10:35 -0400
Subject: [PATCH] mm: return zero_resv_unavail optimization

When checking for valid pfns in zero_resv_unavail(), it is not necessary to
verify that pfns within pageblock_nr_pages ranges are valid, only the first
one needs to be checked. This is because memory for pages are allocated in
contiguous chunks that contain pageblock_nr_pages struct pages.

Signed-off-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
---
 mm/page_alloc.c | 46 ++++++++++++++++++++++++++--------------------
 1 file changed, 26 insertions(+), 20 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 650d8f16a67e..5dfc206db40e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6441,6 +6441,29 @@ void __init free_area_init_node(int nid, unsigned long *zones_size,
 }
 
 #if defined(CONFIG_HAVE_MEMBLOCK) && !defined(CONFIG_FLAT_NODE_MEM_MAP)
+
+/*
+ * Zero all valid struct pages in range [spfn, epfn), return number of struct
+ * pages zeroed
+ */
+static u64 zero_pfn_range(unsigned long spfn, unsigned long epfn)
+{
+	unsigned long pfn;
+	u64 pgcnt = 0;
+
+	for (pfn = spfn; pfn < epfn; pfn++) {
+		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
+			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
+				+ pageblock_nr_pages - 1;
+			continue;
+		}
+		mm_zero_struct_page(pfn_to_page(pfn));
+		pgcnt++;
+	}
+
+	return pgcnt;
+}
+
 /*
  * Only struct pages that are backed by physical memory are zeroed and
  * initialized by going through __init_single_page(). But, there are some
@@ -6456,7 +6479,6 @@ void __init free_area_init_node(int nid, unsigned long *zones_size,
 void __init zero_resv_unavail(void)
 {
 	phys_addr_t start, end;
-	unsigned long pfn;
 	u64 i, pgcnt;
 	phys_addr_t next = 0;
 
@@ -6466,34 +6488,18 @@ void __init zero_resv_unavail(void)
 	pgcnt = 0;
 	for_each_mem_range(i, &memblock.memory, NULL,
 			NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) {
-		if (next < start) {
-			for (pfn = PFN_DOWN(next); pfn < PFN_UP(start); pfn++) {
-				if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
-					continue;
-				mm_zero_struct_page(pfn_to_page(pfn));
-				pgcnt++;
-			}
-		}
+		if (next < start)
+			pgcnt += zero_pfn_range(PFN_DOWN(next), PFN_UP(start));
 		next = end;
 	}
-	for (pfn = PFN_DOWN(next); pfn < max_pfn; pfn++) {
-		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
-			continue;
-		mm_zero_struct_page(pfn_to_page(pfn));
-		pgcnt++;
-	}
-
+	pgcnt += zero_pfn_range(PFN_DOWN(next), max_pfn);
 
 	/*
 	 * Struct pages that do not have backing memory. This could be because
 	 * firmware is using some of this memory, or for some other reasons.
-	 * Once memblock is changed so such behaviour is not allowed: i.e.
-	 * list of "reserved" memory must be a subset of list of "memory", then
-	 * this code can be removed.
 	 */
 	if (pgcnt)
 		pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
-
 }
 #endif /* CONFIG_HAVE_MEMBLOCK && !CONFIG_FLAT_NODE_MEM_MAP */
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm: zero remaining unavailable struct pages
  2018-08-27 23:33   ` Pasha Tatashin
@ 2018-08-29 15:16     ` Masayoshi Mizuma
  2018-08-31  2:55       ` Naoya Horiguchi
  0 siblings, 1 reply; 13+ messages in thread
From: Masayoshi Mizuma @ 2018-08-29 15:16 UTC (permalink / raw)
  To: Pavel.Tatashin, linux-mm, n-horiguchi, mhocko; +Cc: linux-kernel, x86

Hi Horiguchi-san and Pavel

Thank you for your comments!
The Pavel's additional patch looks good to me, so I will add it to this series.

However, unfortunately, the movable_node option has something wrong yet...
When I offline the memory which belongs to movable zone, I got the following
warning. I'm trying to debug it.

I try to describe the issue as following. 
If you have any comments, please let me know.

WARNING: CPU: 156 PID: 25611 at mm/page_alloc.c:7730 has_unmovable_pages+0x1bf/0x200
RIP: 0010:has_unmovable_pages+0x1bf/0x200
...
Call Trace:
 is_mem_section_removable+0xd3/0x160
 show_mem_removable+0x8e/0xb0
 dev_attr_show+0x1c/0x50
 sysfs_kf_seq_show+0xb3/0x110
 seq_read+0xee/0x480
 __vfs_read+0x36/0x190
 vfs_read+0x89/0x130
 ksys_read+0x52/0xc0
 do_syscall_64+0x5b/0x180
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fe7b7823f70
...

I added a printk to catch the unmovable page.
---
@@ -7713,8 +7719,12 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
                 * is set to both of a memory hole page and a _used_ kernel
                 * page at boot.
                 */
-               if (found > count)
+               if (found > count) {
+                       pr_info("DEBUG: %s zone: %lx page: %lx pfn: %lx flags: %lx found: %ld count: %ld \n",
+                               __func__, zone, page, page_to_pfn(page), page->flags, found, count);
                        goto unmovable;
+               }
---

Then I got the following. The page (PFN: 0x1c0ff130d) flag is 
0xdfffffc0040048 (uptodate|active|swapbacked)

---
DEBUG: has_unmovable_pages zone: 0xffff8c0ffff80380 page: 0xffffea703fc4c340 pfn: 0x1c0ff130d flags: 0xdfffffc0040048 found: 1 count: 0 
---

And I got the owner from /sys/kernel/debug/page_owner.

Page allocated via order 0, mask 0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
PFN 7532909325 type Movable Block 14712713 type Movable Flags 0xdfffffc0040048(uptodate|active|swapbacked)
 __alloc_pages_nodemask+0xfc/0x270
 alloc_pages_vma+0x7c/0x1e0
 handle_pte_fault+0x399/0xe50
 __handle_mm_fault+0x38e/0x520
 handle_mm_fault+0xdc/0x210
 __do_page_fault+0x243/0x4c0
 do_page_fault+0x31/0x130
 page_fault+0x1e/0x30

The page is allocated as anonymous page via page fault.
I'm not sure, but lru flag should be added to the page...?

Thanks,
Masa

On 08/27/2018 07:33 PM, Pasha Tatashin wrote:
> On 8/23/18 2:25 PM, Masayoshi Mizuma wrote:
>> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> There is a kernel panic that is triggered when reading /proc/kpageflags
>> on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
>>
>>   BUG: unable to handle kernel paging request at fffffffffffffffe
>>   PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
>>   Oops: 0000 [#1] SMP PTI
>>   CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
>>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
>>   RIP: 0010:stable_page_flags+0x27/0x3c0
>>   Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
>>   RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
>>   RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
>>   RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
>>   RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
>>   R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
>>   R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
>>   FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
>>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
>>   Call Trace:
>>    kpageflags_read+0xc7/0x120
>>    proc_reg_read+0x3c/0x60
>>    __vfs_read+0x36/0x170
>>    vfs_read+0x89/0x130
>>    ksys_pread64+0x71/0x90
>>    do_syscall_64+0x5b/0x160
>>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>   RIP: 0033:0x7efc42e75e23
>>   Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
>>
>> According to kernel bisection, this problem became visible due to commit
>> f7f99100d8d9 which changes how struct pages are initialized.
>>
>> Memblock layout affects the pfn ranges covered by node/zone. Consider
>> that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
>> the default (no memmap= given) memblock layout is like below:
>>
>>   MEMBLOCK configuration:
>>    memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
>>    memory.cnt  = 0x4
>>    memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
>>    memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
>>    memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
>>    memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
>>    ...
>>
>> If you give memmap=1G!4G (so it just covers memory[0x2]),
>> the range [0x100000000-0x13fffffff] is gone:
>>
>>   MEMBLOCK configuration:
>>    memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
>>    memory.cnt  = 0x3
>>    memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
>>    memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
>>    memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
>>    ...
>>
>> This causes shrinking node 0's pfn range because it is calculated by
>> the address range of memblock.memory. So some of struct pages in the
>> gap range are left uninitialized.
>>
>> We have a function zero_resv_unavail() which does zeroing the struct
>> pages outside memblock.memory, but currently it covers only the reserved
>> unavailable range (i.e. memblock.memory && !memblock.reserved).
>> This patch extends it to cover all unavailable range, which fixes
>> the reported issue.
>>
>> Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Tested-by: Oscar Salvador <osalvador@suse.de>
>> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
> 
> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
> 
> Also, please review and add the following patch to this series:
> 
> From 6d23e66e979244734a06c1b636742c2568121b39 Mon Sep 17 00:00:00 2001
> From: Pavel Tatashin <pavel.tatashin@microsoft.com>
> Date: Mon, 27 Aug 2018 19:10:35 -0400
> Subject: [PATCH] mm: return zero_resv_unavail optimization
> 
> When checking for valid pfns in zero_resv_unavail(), it is not necessary to
> verify that pfns within pageblock_nr_pages ranges are valid, only the first
> one needs to be checked. This is because memory for pages are allocated in
> contiguous chunks that contain pageblock_nr_pages struct pages.
> 
> Signed-off-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
> ---
>  mm/page_alloc.c | 46 ++++++++++++++++++++++++++--------------------
>  1 file changed, 26 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 650d8f16a67e..5dfc206db40e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6441,6 +6441,29 @@ void __init free_area_init_node(int nid, unsigned long *zones_size,
>  }
>  
>  #if defined(CONFIG_HAVE_MEMBLOCK) && !defined(CONFIG_FLAT_NODE_MEM_MAP)
> +
> +/*
> + * Zero all valid struct pages in range [spfn, epfn), return number of struct
> + * pages zeroed
> + */
> +static u64 zero_pfn_range(unsigned long spfn, unsigned long epfn)
> +{
> +	unsigned long pfn;
> +	u64 pgcnt = 0;
> +
> +	for (pfn = spfn; pfn < epfn; pfn++) {
> +		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> +			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
> +				+ pageblock_nr_pages - 1;
> +			continue;
> +		}
> +		mm_zero_struct_page(pfn_to_page(pfn));
> +		pgcnt++;
> +	}
> +
> +	return pgcnt;
> +}
> +
>  /*
>   * Only struct pages that are backed by physical memory are zeroed and
>   * initialized by going through __init_single_page(). But, there are some
> @@ -6456,7 +6479,6 @@ void __init free_area_init_node(int nid, unsigned long *zones_size,
>  void __init zero_resv_unavail(void)
>  {
>  	phys_addr_t start, end;
> -	unsigned long pfn;
>  	u64 i, pgcnt;
>  	phys_addr_t next = 0;
>  
> @@ -6466,34 +6488,18 @@ void __init zero_resv_unavail(void)
>  	pgcnt = 0;
>  	for_each_mem_range(i, &memblock.memory, NULL,
>  			NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) {
> -		if (next < start) {
> -			for (pfn = PFN_DOWN(next); pfn < PFN_UP(start); pfn++) {
> -				if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
> -					continue;
> -				mm_zero_struct_page(pfn_to_page(pfn));
> -				pgcnt++;
> -			}
> -		}
> +		if (next < start)
> +			pgcnt += zero_pfn_range(PFN_DOWN(next), PFN_UP(start));
>  		next = end;
>  	}
> -	for (pfn = PFN_DOWN(next); pfn < max_pfn; pfn++) {
> -		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
> -			continue;
> -		mm_zero_struct_page(pfn_to_page(pfn));
> -		pgcnt++;
> -	}
> -
> +	pgcnt += zero_pfn_range(PFN_DOWN(next), max_pfn);
>  
>  	/*
>  	 * Struct pages that do not have backing memory. This could be because
>  	 * firmware is using some of this memory, or for some other reasons.
> -	 * Once memblock is changed so such behaviour is not allowed: i.e.
> -	 * list of "reserved" memory must be a subset of list of "memory", then
> -	 * this code can be removed.
>  	 */
>  	if (pgcnt)
>  		pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
> -
>  }
>  #endif /* CONFIG_HAVE_MEMBLOCK && !CONFIG_FLAT_NODE_MEM_MAP */
>  
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm: zero remaining unavailable struct pages
  2018-08-29 15:16     ` Masayoshi Mizuma
@ 2018-08-31  2:55       ` Naoya Horiguchi
  2018-09-17 13:26         ` Masayoshi Mizuma
  0 siblings, 1 reply; 13+ messages in thread
From: Naoya Horiguchi @ 2018-08-31  2:55 UTC (permalink / raw)
  To: Masayoshi Mizuma; +Cc: Pavel.Tatashin, linux-mm, mhocko, linux-kernel, x86

On Wed, Aug 29, 2018 at 11:16:30AM -0400, Masayoshi Mizuma wrote:
> Hi Horiguchi-san and Pavel
> 
> Thank you for your comments!
> The Pavel's additional patch looks good to me, so I will add it to this series.
> 
> However, unfortunately, the movable_node option has something wrong yet...
> When I offline the memory which belongs to movable zone, I got the following
> warning. I'm trying to debug it.
> 
> I try to describe the issue as following. 
> If you have any comments, please let me know.
> 
> WARNING: CPU: 156 PID: 25611 at mm/page_alloc.c:7730 has_unmovable_pages+0x1bf/0x200
> RIP: 0010:has_unmovable_pages+0x1bf/0x200
> ...
> Call Trace:
>  is_mem_section_removable+0xd3/0x160
>  show_mem_removable+0x8e/0xb0
>  dev_attr_show+0x1c/0x50
>  sysfs_kf_seq_show+0xb3/0x110
>  seq_read+0xee/0x480
>  __vfs_read+0x36/0x190
>  vfs_read+0x89/0x130
>  ksys_read+0x52/0xc0
>  do_syscall_64+0x5b/0x180
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7fe7b7823f70
> ...
> 
> I added a printk to catch the unmovable page.
> ---
> @@ -7713,8 +7719,12 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>                  * is set to both of a memory hole page and a _used_ kernel
>                  * page at boot.
>                  */
> -               if (found > count)
> +               if (found > count) {
> +                       pr_info("DEBUG: %s zone: %lx page: %lx pfn: %lx flags: %lx found: %ld count: %ld \n",
> +                               __func__, zone, page, page_to_pfn(page), page->flags, found, count);
>                         goto unmovable;
> +               }
> ---
> 
> Then I got the following. The page (PFN: 0x1c0ff130d) flag is 
> 0xdfffffc0040048 (uptodate|active|swapbacked)
> 
> ---
> DEBUG: has_unmovable_pages zone: 0xffff8c0ffff80380 page: 0xffffea703fc4c340 pfn: 0x1c0ff130d flags: 0xdfffffc0040048 found: 1 count: 0 
> ---
> 
> And I got the owner from /sys/kernel/debug/page_owner.
> 
> Page allocated via order 0, mask 0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> PFN 7532909325 type Movable Block 14712713 type Movable Flags 0xdfffffc0040048(uptodate|active|swapbacked)
>  __alloc_pages_nodemask+0xfc/0x270
>  alloc_pages_vma+0x7c/0x1e0
>  handle_pte_fault+0x399/0xe50
>  __handle_mm_fault+0x38e/0x520
>  handle_mm_fault+0xdc/0x210
>  __do_page_fault+0x243/0x4c0
>  do_page_fault+0x31/0x130
>  page_fault+0x1e/0x30
> 
> The page is allocated as anonymous page via page fault.
> I'm not sure, but lru flag should be added to the page...?

There is a small window of no PageLRU flag just after page allocation
until the page is linked to some LRU list.
This kind of unmovability is transient, so retrying can work.

I guess that this warning seems to be visible since commit 15c30bc09085
("mm, memory_hotplug: make has_unmovable_pages more robust")
which turned off the optimization based on the assumption that pages
under ZONE_MOVABLE are always movable.
I think that it helps developers find the issue that permanently
unmovable pages are accidentally located in ZONE_MOVABLE zone.
But even ZONE_MOVABLE zone could have transiently unmovable pages,
so the reported warning seems to me a false charge and should be avoided.
Doing lru_add_drain_all()/drain_all_pages() before has_unmovable_pages()
might be helpful?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm: zero remaining unavailable struct pages
  2018-08-31  2:55       ` Naoya Horiguchi
@ 2018-09-17 13:26         ` Masayoshi Mizuma
  2018-09-19  1:54           ` Naoya Horiguchi
  0 siblings, 1 reply; 13+ messages in thread
From: Masayoshi Mizuma @ 2018-09-17 13:26 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: Pavel.Tatashin, linux-mm, mhocko, linux-kernel, x86

On Fri, Aug 31, 2018 at 02:55:36AM +0000, Naoya Horiguchi wrote:
> On Wed, Aug 29, 2018 at 11:16:30AM -0400, Masayoshi Mizuma wrote:
> > Hi Horiguchi-san and Pavel
> > 
> > Thank you for your comments!
> > The Pavel's additional patch looks good to me, so I will add it to this series.
> > 
> > However, unfortunately, the movable_node option has something wrong yet...
> > When I offline the memory which belongs to movable zone, I got the following
> > warning. I'm trying to debug it.
> > 
> > I try to describe the issue as following. 
> > If you have any comments, please let me know.
> > 
> > WARNING: CPU: 156 PID: 25611 at mm/page_alloc.c:7730 has_unmovable_pages+0x1bf/0x200
> > RIP: 0010:has_unmovable_pages+0x1bf/0x200
> > ...
> > Call Trace:
> >  is_mem_section_removable+0xd3/0x160
> >  show_mem_removable+0x8e/0xb0
> >  dev_attr_show+0x1c/0x50
> >  sysfs_kf_seq_show+0xb3/0x110
> >  seq_read+0xee/0x480
> >  __vfs_read+0x36/0x190
> >  vfs_read+0x89/0x130
> >  ksys_read+0x52/0xc0
> >  do_syscall_64+0x5b/0x180
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > RIP: 0033:0x7fe7b7823f70
> > ...
> > 
> > I added a printk to catch the unmovable page.
> > ---
> > @@ -7713,8 +7719,12 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
> >                  * is set to both of a memory hole page and a _used_ kernel
> >                  * page at boot.
> >                  */
> > -               if (found > count)
> > +               if (found > count) {
> > +                       pr_info("DEBUG: %s zone: %lx page: %lx pfn: %lx flags: %lx found: %ld count: %ld \n",
> > +                               __func__, zone, page, page_to_pfn(page), page->flags, found, count);
> >                         goto unmovable;
> > +               }
> > ---
> > 
> > Then I got the following. The page (PFN: 0x1c0ff130d) flag is 
> > 0xdfffffc0040048 (uptodate|active|swapbacked)
> > 
> > ---
> > DEBUG: has_unmovable_pages zone: 0xffff8c0ffff80380 page: 0xffffea703fc4c340 pfn: 0x1c0ff130d flags: 0xdfffffc0040048 found: 1 count: 0 
> > ---
> > 
> > And I got the owner from /sys/kernel/debug/page_owner.
> > 
> > Page allocated via order 0, mask 0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > PFN 7532909325 type Movable Block 14712713 type Movable Flags 0xdfffffc0040048(uptodate|active|swapbacked)
> >  __alloc_pages_nodemask+0xfc/0x270
> >  alloc_pages_vma+0x7c/0x1e0
> >  handle_pte_fault+0x399/0xe50
> >  __handle_mm_fault+0x38e/0x520
> >  handle_mm_fault+0xdc/0x210
> >  __do_page_fault+0x243/0x4c0
> >  do_page_fault+0x31/0x130
> >  page_fault+0x1e/0x30
> > 
> > The page is allocated as anonymous page via page fault.
> > I'm not sure, but lru flag should be added to the page...?
> 
> There is a small window of no PageLRU flag just after page allocation
> until the page is linked to some LRU list.
> This kind of unmovability is transient, so retrying can work.
> 
> I guess that this warning seems to be visible since commit 15c30bc09085
> ("mm, memory_hotplug: make has_unmovable_pages more robust")
> which turned off the optimization based on the assumption that pages
> under ZONE_MOVABLE are always movable.
> I think that it helps developers find the issue that permanently
> unmovable pages are accidentally located in ZONE_MOVABLE zone.
> But even ZONE_MOVABLE zone could have transiently unmovable pages,
> so the reported warning seems to me a false charge and should be avoided.
> Doing lru_add_drain_all()/drain_all_pages() before has_unmovable_pages()
> might be helpful?

Thanks you for your proposal! And sorry for delayed responce.

lru_add_drain_all()/drain_all_pages() might be helpful, but it 
seems that the window is not very small because I tried to do
offline some times, and every offline failed...

I have another idea. I found that if the page is belonged to
Movable zone and it has Uptodate flag, the page will go lru
soon, so I think we can pass the page.
Does the idea make sence? As far as I tested it, it works well.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 52d9efe8c9fb..ecf87bec8ac6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7758,6 +7758,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
                if (__PageMovable(page))
                        continue;

+               if ((zone_idx(zone) == ZONE_MOVABLE) && PageUptodate(page))
+                       continue;
+
                if (!PageLRU(page))
                        found++;
                /*

Thanks,
Masa

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm: zero remaining unavailable struct pages
  2018-09-17 13:26         ` Masayoshi Mizuma
@ 2018-09-19  1:54           ` Naoya Horiguchi
  2018-09-19 18:15             ` Masayoshi Mizuma
  0 siblings, 1 reply; 13+ messages in thread
From: Naoya Horiguchi @ 2018-09-19  1:54 UTC (permalink / raw)
  To: Masayoshi Mizuma; +Cc: Pavel.Tatashin, linux-mm, mhocko, linux-kernel, x86

On Mon, Sep 17, 2018 at 09:26:07AM -0400, Masayoshi Mizuma wrote:
> On Fri, Aug 31, 2018 at 02:55:36AM +0000, Naoya Horiguchi wrote:
> > On Wed, Aug 29, 2018 at 11:16:30AM -0400, Masayoshi Mizuma wrote:
> > > Hi Horiguchi-san and Pavel
> > > 
> > > Thank you for your comments!
> > > The Pavel's additional patch looks good to me, so I will add it to this series.
> > > 
> > > However, unfortunately, the movable_node option has something wrong yet...
> > > When I offline the memory which belongs to movable zone, I got the following
> > > warning. I'm trying to debug it.
> > > 
> > > I try to describe the issue as following. 
> > > If you have any comments, please let me know.
> > > 
> > > WARNING: CPU: 156 PID: 25611 at mm/page_alloc.c:7730 has_unmovable_pages+0x1bf/0x200
> > > RIP: 0010:has_unmovable_pages+0x1bf/0x200
> > > ...
> > > Call Trace:
> > >  is_mem_section_removable+0xd3/0x160
> > >  show_mem_removable+0x8e/0xb0
> > >  dev_attr_show+0x1c/0x50
> > >  sysfs_kf_seq_show+0xb3/0x110
> > >  seq_read+0xee/0x480
> > >  __vfs_read+0x36/0x190
> > >  vfs_read+0x89/0x130
> > >  ksys_read+0x52/0xc0
> > >  do_syscall_64+0x5b/0x180
> > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > RIP: 0033:0x7fe7b7823f70
> > > ...
> > > 
> > > I added a printk to catch the unmovable page.
> > > ---
> > > @@ -7713,8 +7719,12 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
> > >                  * is set to both of a memory hole page and a _used_ kernel
> > >                  * page at boot.
> > >                  */
> > > -               if (found > count)
> > > +               if (found > count) {
> > > +                       pr_info("DEBUG: %s zone: %lx page: %lx pfn: %lx flags: %lx found: %ld count: %ld \n",
> > > +                               __func__, zone, page, page_to_pfn(page), page->flags, found, count);
> > >                         goto unmovable;
> > > +               }
> > > ---
> > > 
> > > Then I got the following. The page (PFN: 0x1c0ff130d) flag is 
> > > 0xdfffffc0040048 (uptodate|active|swapbacked)
> > > 
> > > ---
> > > DEBUG: has_unmovable_pages zone: 0xffff8c0ffff80380 page: 0xffffea703fc4c340 pfn: 0x1c0ff130d flags: 0xdfffffc0040048 found: 1 count: 0 
> > > ---
> > > 
> > > And I got the owner from /sys/kernel/debug/page_owner.
> > > 
> > > Page allocated via order 0, mask 0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > > PFN 7532909325 type Movable Block 14712713 type Movable Flags 0xdfffffc0040048(uptodate|active|swapbacked)
> > >  __alloc_pages_nodemask+0xfc/0x270
> > >  alloc_pages_vma+0x7c/0x1e0
> > >  handle_pte_fault+0x399/0xe50
> > >  __handle_mm_fault+0x38e/0x520
> > >  handle_mm_fault+0xdc/0x210
> > >  __do_page_fault+0x243/0x4c0
> > >  do_page_fault+0x31/0x130
> > >  page_fault+0x1e/0x30
> > > 
> > > The page is allocated as anonymous page via page fault.
> > > I'm not sure, but lru flag should be added to the page...?
> > 
> > There is a small window of no PageLRU flag just after page allocation
> > until the page is linked to some LRU list.
> > This kind of unmovability is transient, so retrying can work.
> > 
> > I guess that this warning seems to be visible since commit 15c30bc09085
> > ("mm, memory_hotplug: make has_unmovable_pages more robust")
> > which turned off the optimization based on the assumption that pages
> > under ZONE_MOVABLE are always movable.
> > I think that it helps developers find the issue that permanently
> > unmovable pages are accidentally located in ZONE_MOVABLE zone.
> > But even ZONE_MOVABLE zone could have transiently unmovable pages,
> > so the reported warning seems to me a false charge and should be avoided.
> > Doing lru_add_drain_all()/drain_all_pages() before has_unmovable_pages()
> > might be helpful?
> 
> Thanks you for your proposal! And sorry for delayed responce.
> 
> lru_add_drain_all()/drain_all_pages() might be helpful, but it 
> seems that the window is not very small because I tried to do
> offline some times, and every offline failed...

OK, so this doesn't work, thank you for trying.

> 
> I have another idea. I found that if the page is belonged to
> Movable zone and it has Uptodate flag, the page will go lru
> soon, so I think we can pass the page.
> Does the idea make sence? As far as I tested it, it works well.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 52d9efe8c9fb..ecf87bec8ac6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7758,6 +7758,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>                 if (__PageMovable(page))
>                         continue;
> 
> +               if ((zone_idx(zone) == ZONE_MOVABLE) && PageUptodate(page))
> +                       continue;
> +

We have many call sites calling SetPageUptodate (many are from filesystems,)
so I'm concerned that some caller might set PageUptodate on non-LRU pages.
Could you explain a little more how/why this check is a clear separation b/w
movable pages and unmovable pages?
(Filesystem metadata is never allocated from ZONE_MOVABLE?)

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm: zero remaining unavailable struct pages
  2018-09-19  1:54           ` Naoya Horiguchi
@ 2018-09-19 18:15             ` Masayoshi Mizuma
  0 siblings, 0 replies; 13+ messages in thread
From: Masayoshi Mizuma @ 2018-09-19 18:15 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: Pavel.Tatashin, linux-mm, mhocko, linux-kernel, x86

On Wed, Sep 19, 2018 at 01:54:40AM +0000, Naoya Horiguchi wrote:
> On Mon, Sep 17, 2018 at 09:26:07AM -0400, Masayoshi Mizuma wrote:
> > On Fri, Aug 31, 2018 at 02:55:36AM +0000, Naoya Horiguchi wrote:
> > > On Wed, Aug 29, 2018 at 11:16:30AM -0400, Masayoshi Mizuma wrote:
> > > > Hi Horiguchi-san and Pavel
> > > > 
> > > > Thank you for your comments!
> > > > The Pavel's additional patch looks good to me, so I will add it to this series.
> > > > 
> > > > However, unfortunately, the movable_node option has something wrong yet...
> > > > When I offline the memory which belongs to movable zone, I got the following
> > > > warning. I'm trying to debug it.
> > > > 
> > > > I try to describe the issue as following. 
> > > > If you have any comments, please let me know.
> > > > 
> > > > WARNING: CPU: 156 PID: 25611 at mm/page_alloc.c:7730 has_unmovable_pages+0x1bf/0x200
> > > > RIP: 0010:has_unmovable_pages+0x1bf/0x200
> > > > ...
> > > > Call Trace:
> > > >  is_mem_section_removable+0xd3/0x160
> > > >  show_mem_removable+0x8e/0xb0
> > > >  dev_attr_show+0x1c/0x50
> > > >  sysfs_kf_seq_show+0xb3/0x110
> > > >  seq_read+0xee/0x480
> > > >  __vfs_read+0x36/0x190
> > > >  vfs_read+0x89/0x130
> > > >  ksys_read+0x52/0xc0
> > > >  do_syscall_64+0x5b/0x180
> > > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > RIP: 0033:0x7fe7b7823f70
> > > > ...
> > > > 
> > > > I added a printk to catch the unmovable page.
> > > > ---
> > > > @@ -7713,8 +7719,12 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
> > > >                  * is set to both of a memory hole page and a _used_ kernel
> > > >                  * page at boot.
> > > >                  */
> > > > -               if (found > count)
> > > > +               if (found > count) {
> > > > +                       pr_info("DEBUG: %s zone: %lx page: %lx pfn: %lx flags: %lx found: %ld count: %ld \n",
> > > > +                               __func__, zone, page, page_to_pfn(page), page->flags, found, count);
> > > >                         goto unmovable;
> > > > +               }
> > > > ---
> > > > 
> > > > Then I got the following. The page (PFN: 0x1c0ff130d) flag is 
> > > > 0xdfffffc0040048 (uptodate|active|swapbacked)
> > > > 
> > > > ---
> > > > DEBUG: has_unmovable_pages zone: 0xffff8c0ffff80380 page: 0xffffea703fc4c340 pfn: 0x1c0ff130d flags: 0xdfffffc0040048 found: 1 count: 0 
> > > > ---
> > > > 
> > > > And I got the owner from /sys/kernel/debug/page_owner.
> > > > 
> > > > Page allocated via order 0, mask 0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
> > > > PFN 7532909325 type Movable Block 14712713 type Movable Flags 0xdfffffc0040048(uptodate|active|swapbacked)
> > > >  __alloc_pages_nodemask+0xfc/0x270
> > > >  alloc_pages_vma+0x7c/0x1e0
> > > >  handle_pte_fault+0x399/0xe50
> > > >  __handle_mm_fault+0x38e/0x520
> > > >  handle_mm_fault+0xdc/0x210
> > > >  __do_page_fault+0x243/0x4c0
> > > >  do_page_fault+0x31/0x130
> > > >  page_fault+0x1e/0x30
> > > > 
> > > > The page is allocated as anonymous page via page fault.
> > > > I'm not sure, but lru flag should be added to the page...?
> > > 
> > > There is a small window of no PageLRU flag just after page allocation
> > > until the page is linked to some LRU list.
> > > This kind of unmovability is transient, so retrying can work.
> > > 
> > > I guess that this warning seems to be visible since commit 15c30bc09085
> > > ("mm, memory_hotplug: make has_unmovable_pages more robust")
> > > which turned off the optimization based on the assumption that pages
> > > under ZONE_MOVABLE are always movable.
> > > I think that it helps developers find the issue that permanently
> > > unmovable pages are accidentally located in ZONE_MOVABLE zone.
> > > But even ZONE_MOVABLE zone could have transiently unmovable pages,
> > > so the reported warning seems to me a false charge and should be avoided.
> > > Doing lru_add_drain_all()/drain_all_pages() before has_unmovable_pages()
> > > might be helpful?
> > 
> > Thanks you for your proposal! And sorry for delayed responce.
> > 
> > lru_add_drain_all()/drain_all_pages() might be helpful, but it 
> > seems that the window is not very small because I tried to do
> > offline some times, and every offline failed...
> 
> OK, so this doesn't work, thank you for trying.
> 
> > 
> > I have another idea. I found that if the page is belonged to
> > Movable zone and it has Uptodate flag, the page will go lru
> > soon, so I think we can pass the page.
> > Does the idea make sence? As far as I tested it, it works well.
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 52d9efe8c9fb..ecf87bec8ac6 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -7758,6 +7758,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
> >                 if (__PageMovable(page))
> >                         continue;
> > 
> > +               if ((zone_idx(zone) == ZONE_MOVABLE) && PageUptodate(page))
> > +                       continue;
> > +
> 
> We have many call sites calling SetPageUptodate (many are from filesystems,)
> so I'm concerned that some caller might set PageUptodate on non-LRU pages.
> Could you explain a little more how/why this check is a clear separation b/w
> movable pages and unmovable pages?
> (Filesystem metadata is never allocated from ZONE_MOVABLE?)

Thanks, this is a good question.
As far as I can see, the caller which gets pages from movable zone
sets PageUptodate, or the page goes lru soon. But, yes, that is not
guranteed, so we should not use the check...

I have rethinked this. We may not need the Uptodate flag checking
here because ZONE_MOVABLE has movable pages only basically and the
addtional checkings are done here.

Or, PAGE_MAPPING_MOVABLE should be set in the mapping when
the movable page is allocated.

Thanks,
Masa

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-09-19 18:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-23 18:25 [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" Masayoshi Mizuma
2018-08-23 18:25 ` [PATCH 2/2] mm: zero remaining unavailable struct pages Masayoshi Mizuma
2018-08-27 23:33   ` Pasha Tatashin
2018-08-29 15:16     ` Masayoshi Mizuma
2018-08-31  2:55       ` Naoya Horiguchi
2018-09-17 13:26         ` Masayoshi Mizuma
2018-09-19  1:54           ` Naoya Horiguchi
2018-09-19 18:15             ` Masayoshi Mizuma
2018-08-24  0:03 ` [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" Naoya Horiguchi
2018-08-24  8:29   ` Michal Hocko
2018-08-27 12:31     ` Masayoshi Mizuma
2018-08-27 13:29       ` Pasha Tatashin
2018-08-27 23:18 ` Pasha Tatashin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.