linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [v6 00/15] complete deferred page initialization
@ 2017-08-07 20:38 Pavel Tatashin
  2017-08-07 20:38 ` [v6 01/15] x86/mm: reserve only exiting low pages Pavel Tatashin
                   ` (15 more replies)
  0 siblings, 16 replies; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

Changelog:
v6 - v4
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
  iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
  whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations

v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
  suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
  memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
    * Splited changes to platforms into 4 patches
    * Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
  keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization

Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz

Single threaded struct page init: 7.6s/T improvement
Deferred struct page init: 10.2s/T improvement

Pavel Tatashin (15):
  x86/mm: reserve only exiting low pages
  x86/mm: setting fields in deferred pages
  sparc64/mm: setting fields in deferred pages
  mm: discard memblock data later
  mm: don't accessed uninitialized struct pages
  sparc64: simplify vmemmap_populate
  mm: defining memblock_virt_alloc_try_nid_raw
  mm: zero struct pages during initialization
  sparc64: optimized struct page zeroing
  x86/kasan: explicitly zero kasan shadow memory
  arm64/kasan: explicitly zero kasan shadow memory
  mm: explicitly zero pagetable memory
  mm: stop zeroing memory during allocation in vmemmap
  mm: optimize early system hash allocations
  mm: debug for raw alloctor

 arch/arm64/mm/kasan_init.c          |  42 ++++++++++
 arch/sparc/include/asm/pgtable_64.h |  30 +++++++
 arch/sparc/mm/init_64.c             |  31 +++-----
 arch/x86/kernel/setup.c             |   5 +-
 arch/x86/mm/init_64.c               |   9 ++-
 arch/x86/mm/kasan_init_64.c         |  67 ++++++++++++++++
 include/linux/bootmem.h             |  27 +++++++
 include/linux/memblock.h            |   9 ++-
 include/linux/mm.h                  |   9 +++
 mm/memblock.c                       | 152 ++++++++++++++++++++++++++++--------
 mm/nobootmem.c                      |  16 ----
 mm/page_alloc.c                     |  31 +++++---
 mm/sparse-vmemmap.c                 |  10 ++-
 mm/sparse.c                         |   6 +-
 14 files changed, 356 insertions(+), 88 deletions(-)

-- 
2.14.0

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [v6 01/15] x86/mm: reserve only exiting low pages
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-11  8:07   ` Michal Hocko
  2017-08-14 13:55   ` Michal Hocko
  2017-08-07 20:38 ` [v6 02/15] x86/mm: setting fields in deferred pages Pavel Tatashin
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

Struct pages are initialized by going through __init_single_page(). Since
the existing physical memory in memblock is represented in memblock.memory
list, struct page for every page from this list goes through
__init_single_page().

The second memblock list: memblock.reserved, manages the allocated memory.
The memory that won't be available to kernel allocator. So, every page from
this list goes through reserve_bootmem_region(), where certain struct page
fields are set, the assumption being that the struct pages have been
initialized beforehand.

In trim_low_memory_range() we unconditionally reserve memoryfrom PFN 0, but
memblock.memory might start at a later PFN. For example, in QEMU,
e820__memblock_setup() can use PFN 1 as the first PFN in memblock.memory,
so PFN 0 is not on memblock.memory (and hence isn't initialized via
__init_single_page) but is on memblock.reserved (and hence we set fields in
the uninitialized struct page).

Currently, the struct page memory is always zeroed during allocation,
which prevents this problem from being detected. But, if some asserts
provided by CONFIG_DEBUG_VM_PGFLAGS are tighten, this problem may become
visible in existing kernels.

In this patchset we will stop zeroing struct page memory during allocation.
Therefore, this bug must be fixed in order to avoid random assert failures
caused by CONFIG_DEBUG_VM_PGFLAGS triggers.

The fix is to reserve memory from the first existing PFN.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 arch/x86/kernel/setup.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3486d0498800..489cdc141bcb 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -790,7 +790,10 @@ early_param("reservelow", parse_reservelow);
 
 static void __init trim_low_memory_range(void)
 {
-	memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
+	unsigned long min_pfn = find_min_pfn_with_active_regions();
+	phys_addr_t base = min_pfn << PAGE_SHIFT;
+
+	memblock_reserve(base, ALIGN(reserve_low, PAGE_SIZE));
 }
 	
 /*
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [v6 02/15] x86/mm: setting fields in deferred pages
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
  2017-08-07 20:38 ` [v6 01/15] x86/mm: reserve only exiting low pages Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-11  9:02   ` Michal Hocko
  2017-08-07 20:38 ` [v6 03/15] sparc64/mm: " Pavel Tatashin
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set some
fields prior to initializing:

        mem_init() {
                register_page_bootmem_info();
                free_all_bootmem();
                ...
        }

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

  mem_init
   register_page_bootmem_info
    register_page_bootmem_info_node
     get_page_bootmem
      .. setting fields here ..
      such as: page->freelist = (void *)type;

We end-up with similar issue as in the previous patch, where currently we
do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 arch/x86/mm/init_64.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 136422d7d539..1e863baec847 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1165,12 +1165,17 @@ void __init mem_init(void)
 
 	/* clear_bss() already clear the empty_zero_page */
 
-	register_page_bootmem_info();
-
 	/* this will put all memory onto the freelists */
 	free_all_bootmem();
 	after_bootmem = 1;
 
+	/* Must be done after boot memory is put on freelist, because here we
+	 * might set fields in deferred struct pages that have not yet been
+	 * initialized, and free_all_bootmem() initializes all the reserved
+	 * deferred pages for us.
+	 */
+	register_page_bootmem_info();
+
 	/* Register memory areas for /proc/kcore */
 	kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR,
 			 PAGE_SIZE, KCORE_OTHER);
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [v6 03/15] sparc64/mm: setting fields in deferred pages
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
  2017-08-07 20:38 ` [v6 01/15] x86/mm: reserve only exiting low pages Pavel Tatashin
  2017-08-07 20:38 ` [v6 02/15] x86/mm: setting fields in deferred pages Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-07 20:38 ` [v6 04/15] mm: discard memblock data later Pavel Tatashin
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set some
fields prior to initializing:

 mem_init() {
	 register_page_bootmem_info();
	 free_all_bootmem();
	 ...
 }

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

mem_init
register_page_bootmem_info
 register_page_bootmem_info_node
  get_page_bootmem
   .. setting fields here ..
   such as: page->freelist = (void *)type;

We end-up with similar issue as in the previous patch, where currently we
do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 arch/sparc/mm/init_64.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index fed73f14aa49..25ded711ab6c 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2487,9 +2487,15 @@ void __init mem_init(void)
 {
 	high_memory = __va(last_valid_pfn << PAGE_SHIFT);
 
-	register_page_bootmem_info();
 	free_all_bootmem();
 
+	/* Must be done after boot memory is put on freelist, because here we
+	 * might set fields in deferred struct pages that have not yet been
+	 * initialized, and free_all_bootmem() initializes all the reserved
+	 * deferred pages for us.
+	 */
+	register_page_bootmem_info();
+
 	/*
 	 * Set up the zero page, mark it reserved, so that page count
 	 * is not manipulated when freeing the page from user ptes.
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [v6 04/15] mm: discard memblock data later
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
                   ` (2 preceding siblings ...)
  2017-08-07 20:38 ` [v6 03/15] sparc64/mm: " Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-11  9:32   ` Michal Hocko
  2017-08-07 20:38 ` [v6 05/15] mm: don't accessed uninitialized struct pages Pavel Tatashin
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

There is existing use after free bug when deferred struct pages are
enabled:

The memblock_add() allocates memory for the memory array if more than
128 entries are needed.  See comment in e820__memblock_setup():

  * The bootstrap memblock region count maximum is 128 entries
  * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
  * than that - so allow memblock resizing.

This memblock memory is freed here:
        free_low_memory_core_early()

We access the freed memblock.memory later in boot when deferred pages are
initialized in this path:

        deferred_init_memmap()
                for_each_mem_pfn_range()
                  __next_mem_pfn_range()
                    type = &memblock.memory;

One possible explanation for why this use-after-free hasn't been hit
before is that the limit of INIT_MEMBLOCK_REGIONS has never been exceeded
at least on systems where deferred struct pages were enabled.

Another reason why we want this problem fixed in this patch series is,
in the next patch, we will need to access memblock.reserved from
deferred_init_memmap().

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 include/linux/memblock.h |  6 ++++--
 mm/memblock.c            | 38 +++++++++++++++++---------------------
 mm/nobootmem.c           | 16 ----------------
 mm/page_alloc.c          |  4 ++++
 4 files changed, 25 insertions(+), 39 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 77d427974f57..bae11c7e7bf3 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -61,6 +61,7 @@ extern int memblock_debug;
 #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
 #define __init_memblock __meminit
 #define __initdata_memblock __meminitdata
+void memblock_discard(void);
 #else
 #define __init_memblock
 #define __initdata_memblock
@@ -74,8 +75,6 @@ phys_addr_t memblock_find_in_range_node(phys_addr_t size, phys_addr_t align,
 					int nid, ulong flags);
 phys_addr_t memblock_find_in_range(phys_addr_t start, phys_addr_t end,
 				   phys_addr_t size, phys_addr_t align);
-phys_addr_t get_allocated_memblock_reserved_regions_info(phys_addr_t *addr);
-phys_addr_t get_allocated_memblock_memory_regions_info(phys_addr_t *addr);
 void memblock_allow_resize(void);
 int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid);
 int memblock_add(phys_addr_t base, phys_addr_t size);
@@ -110,6 +109,9 @@ void __next_mem_range_rev(u64 *idx, int nid, ulong flags,
 void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
 				phys_addr_t *out_end);
 
+void __memblock_free_early(phys_addr_t base, phys_addr_t size);
+void __memblock_free_late(phys_addr_t base, phys_addr_t size);
+
 /**
  * for_each_mem_range - iterate through memblock areas from type_a and not
  * included in type_b. Or just type_a if type_b is NULL.
diff --git a/mm/memblock.c b/mm/memblock.c
index 2cb25fe4452c..bf14aea6ab70 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -285,31 +285,27 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
 }
 
 #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
-
-phys_addr_t __init_memblock get_allocated_memblock_reserved_regions_info(
-					phys_addr_t *addr)
-{
-	if (memblock.reserved.regions == memblock_reserved_init_regions)
-		return 0;
-
-	*addr = __pa(memblock.reserved.regions);
-
-	return PAGE_ALIGN(sizeof(struct memblock_region) *
-			  memblock.reserved.max);
-}
-
-phys_addr_t __init_memblock get_allocated_memblock_memory_regions_info(
-					phys_addr_t *addr)
+/**
+ * Discard memory and reserved arrays if they were allocated
+ */
+void __init memblock_discard(void)
 {
-	if (memblock.memory.regions == memblock_memory_init_regions)
-		return 0;
+	phys_addr_t addr, size;
 
-	*addr = __pa(memblock.memory.regions);
+	if (memblock.reserved.regions != memblock_reserved_init_regions) {
+		addr = __pa(memblock.reserved.regions);
+		size = PAGE_ALIGN(sizeof(struct memblock_region) *
+				  memblock.reserved.max);
+		__memblock_free_late(addr, size);
+	}
 
-	return PAGE_ALIGN(sizeof(struct memblock_region) *
-			  memblock.memory.max);
+	if (memblock.memory.regions == memblock_memory_init_regions) {
+		addr = __pa(memblock.memory.regions);
+		size = PAGE_ALIGN(sizeof(struct memblock_region) *
+				  memblock.memory.max);
+		__memblock_free_late(addr, size);
+	}
 }
-
 #endif
 
 /**
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 36454d0f96ee..3637809a18d0 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -146,22 +146,6 @@ static unsigned long __init free_low_memory_core_early(void)
 				NULL)
 		count += __free_memory_core(start, end);
 
-#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
-	{
-		phys_addr_t size;
-
-		/* Free memblock.reserved array if it was allocated */
-		size = get_allocated_memblock_reserved_regions_info(&start);
-		if (size)
-			count += __free_memory_core(start, start + size);
-
-		/* Free memblock.memory array if it was allocated */
-		size = get_allocated_memblock_memory_regions_info(&start);
-		if (size)
-			count += __free_memory_core(start, start + size);
-	}
-#endif
-
 	return count;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fc32aa81f359..63d16c185736 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1584,6 +1584,10 @@ void __init page_alloc_init_late(void)
 	/* Reinit limits that are based on free pages after the kernel is up */
 	files_maxfiles_init();
 #endif
+#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
+	/* Discard memblock private memory */
+	memblock_discard();
+#endif
 
 	for_each_populated_zone(zone)
 		set_zone_contiguous(zone);
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [v6 05/15] mm: don't accessed uninitialized struct pages
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
                   ` (3 preceding siblings ...)
  2017-08-07 20:38 ` [v6 04/15] mm: discard memblock data later Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-11  9:37   ` Michal Hocko
  2017-08-15  9:33   ` Michal Hocko
  2017-08-07 20:38 ` [v6 06/15] sparc64: simplify vmemmap_populate Pavel Tatashin
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

In deferred_init_memmap() where all deferred struct pages are initialized
we have a check like this:

    if (page->flags) {
            VM_BUG_ON(page_zone(page) != zone);
            goto free_range;
    }

This way we are checking if the current deferred page has already been
initialized. It works, because memory for struct pages has been zeroed, and
the only way flags are not zero if it went through __init_single_page()
before.  But, once we change the current behavior and won't zero the memory
in memblock allocator, we cannot trust anything inside "struct page"es
until they are initialized. This patch fixes this.

This patch defines a new accessor memblock_get_reserved_pfn_range()
which returns successive ranges of reserved PFNs.  deferred_init_memmap()
calls it to determine if a PFN and its struct page has already been
initialized.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 include/linux/memblock.h |  3 +++
 mm/memblock.c            | 54 ++++++++++++++++++++++++++++++++++++++++++------
 mm/page_alloc.c          | 11 +++++++++-
 3 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7e7bf3..b6a2a610f5e1 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -320,6 +320,9 @@ int memblock_is_map_memory(phys_addr_t addr);
 int memblock_is_region_memory(phys_addr_t base, phys_addr_t size);
 bool memblock_is_reserved(phys_addr_t addr);
 bool memblock_is_region_reserved(phys_addr_t base, phys_addr_t size);
+void memblock_get_reserved_pfn_range(unsigned long pfn,
+				     unsigned long *pfn_start,
+				     unsigned long *pfn_end);
 
 extern void __memblock_dump_all(void);
 
diff --git a/mm/memblock.c b/mm/memblock.c
index bf14aea6ab70..08f449acfdd1 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1580,7 +1580,13 @@ void __init memblock_mem_limit_remove_map(phys_addr_t limit)
 	memblock_cap_memory_range(0, max_addr);
 }
 
-static int __init_memblock memblock_search(struct memblock_type *type, phys_addr_t addr)
+/**
+ * Return index in regions array if addr is within the region. Otherwise
+ * return -1. If -1 is returned and *next_idx is not %NULL, sets it to the
+ * next region index or -1 if there is none.
+ */
+static int __init_memblock memblock_search(struct memblock_type *type,
+					   phys_addr_t addr, int *next_idx)
 {
 	unsigned int left = 0, right = type->cnt;
 
@@ -1595,22 +1601,26 @@ static int __init_memblock memblock_search(struct memblock_type *type, phys_addr
 		else
 			return mid;
 	} while (left < right);
+
+	if (next_idx)
+		*next_idx = (right == type->cnt) ? -1 : right;
+
 	return -1;
 }
 
 bool __init memblock_is_reserved(phys_addr_t addr)
 {
-	return memblock_search(&memblock.reserved, addr) != -1;
+	return memblock_search(&memblock.reserved, addr, NULL) != -1;
 }
 
 bool __init_memblock memblock_is_memory(phys_addr_t addr)
 {
-	return memblock_search(&memblock.memory, addr) != -1;
+	return memblock_search(&memblock.memory, addr, NULL) != -1;
 }
 
 int __init_memblock memblock_is_map_memory(phys_addr_t addr)
 {
-	int i = memblock_search(&memblock.memory, addr);
+	int i = memblock_search(&memblock.memory, addr, NULL);
 
 	if (i == -1)
 		return false;
@@ -1622,7 +1632,7 @@ int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
 			 unsigned long *start_pfn, unsigned long *end_pfn)
 {
 	struct memblock_type *type = &memblock.memory;
-	int mid = memblock_search(type, PFN_PHYS(pfn));
+	int mid = memblock_search(type, PFN_PHYS(pfn), NULL);
 
 	if (mid == -1)
 		return -1;
@@ -1646,7 +1656,7 @@ int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
  */
 int __init_memblock memblock_is_region_memory(phys_addr_t base, phys_addr_t size)
 {
-	int idx = memblock_search(&memblock.memory, base);
+	int idx = memblock_search(&memblock.memory, base, NULL);
 	phys_addr_t end = base + memblock_cap_size(base, &size);
 
 	if (idx == -1)
@@ -1655,6 +1665,38 @@ int __init_memblock memblock_is_region_memory(phys_addr_t base, phys_addr_t size
 		 memblock.memory.regions[idx].size) >= end;
 }
 
+/**
+ * memblock_get_reserved_pfn_range - search for the next reserved region
+ *
+ * @pfn: start searching from this pfn.
+ *
+ * RETURNS:
+ * [start_pfn, end_pfn), where start_pfn >= pfn. If none is found
+ * start_pfn, and end_pfn are both set to ULONG_MAX.
+ */
+void __init_memblock memblock_get_reserved_pfn_range(unsigned long pfn,
+						     unsigned long *start_pfn,
+						     unsigned long *end_pfn)
+{
+	struct memblock_type *type = &memblock.reserved;
+	int next_idx, idx;
+
+	idx = memblock_search(type, PFN_PHYS(pfn), &next_idx);
+	if (idx == -1 && next_idx == -1) {
+		*start_pfn = ULONG_MAX;
+		*end_pfn = ULONG_MAX;
+		return;
+	}
+
+	if (idx == -1) {
+		idx = next_idx;
+		*start_pfn = PFN_DOWN(type->regions[idx].base);
+	} else {
+		*start_pfn = pfn;
+	}
+	*end_pfn = PFN_DOWN(type->regions[idx].base + type->regions[idx].size);
+}
+
 /**
  * memblock_is_region_reserved - check if a region intersects reserved memory
  * @base: base of region to check
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 63d16c185736..983de0a8047b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1447,6 +1447,7 @@ static int __init deferred_init_memmap(void *data)
 	pg_data_t *pgdat = data;
 	int nid = pgdat->node_id;
 	struct mminit_pfnnid_cache nid_init_state = { };
+	unsigned long resv_start_pfn = 0, resv_end_pfn = 0;
 	unsigned long start = jiffies;
 	unsigned long nr_pages = 0;
 	unsigned long walk_start, walk_end;
@@ -1491,6 +1492,10 @@ static int __init deferred_init_memmap(void *data)
 			pfn = zone->zone_start_pfn;
 
 		for (; pfn < end_pfn; pfn++) {
+			if (pfn >= resv_end_pfn)
+				memblock_get_reserved_pfn_range(pfn,
+								&resv_start_pfn,
+								&resv_end_pfn);
 			if (!pfn_valid_within(pfn))
 				goto free_range;
 
@@ -1524,7 +1529,11 @@ static int __init deferred_init_memmap(void *data)
 				cond_resched();
 			}
 
-			if (page->flags) {
+			/*
+			 * Check if this page has already been initialized due
+			 * to being reserved during boot in memblock.
+			 */
+			if (pfn >= resv_start_pfn) {
 				VM_BUG_ON(page_zone(page) != zone);
 				goto free_range;
 			}
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [v6 06/15] sparc64: simplify vmemmap_populate
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
                   ` (4 preceding siblings ...)
  2017-08-07 20:38 ` [v6 05/15] mm: don't accessed uninitialized struct pages Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-07 20:38 ` [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw Pavel Tatashin
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

Remove duplicating code by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 arch/sparc/mm/init_64.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 25ded711ab6c..ffb25be19389 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2590,30 +2590,19 @@ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
 	vstart = vstart & PMD_MASK;
 	vend = ALIGN(vend, PMD_SIZE);
 	for (; vstart < vend; vstart += PMD_SIZE) {
-		pgd_t *pgd = pgd_offset_k(vstart);
+		pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
 		unsigned long pte;
 		pud_t *pud;
 		pmd_t *pmd;
 
-		if (pgd_none(*pgd)) {
-			pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+		if (!pgd)
+			return -ENOMEM;
 
-			if (!new)
-				return -ENOMEM;
-			pgd_populate(&init_mm, pgd, new);
-		}
-
-		pud = pud_offset(pgd, vstart);
-		if (pud_none(*pud)) {
-			pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-			if (!new)
-				return -ENOMEM;
-			pud_populate(&init_mm, pud, new);
-		}
+		pud = vmemmap_pud_populate(pgd, vstart, node);
+		if (!pud)
+			return -ENOMEM;
 
 		pmd = pmd_offset(pud, vstart);
-
 		pte = pmd_val(*pmd);
 		if (!(pte & _PAGE_VALID)) {
 			void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
                   ` (5 preceding siblings ...)
  2017-08-07 20:38 ` [v6 06/15] sparc64: simplify vmemmap_populate Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-11 12:39   ` Michal Hocko
  2017-08-07 20:38 ` [v6 08/15] mm: zero struct pages during initialization Pavel Tatashin
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
    - Does not zero the allocated memory
    - Does not panic if request cannot be satisfied

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 include/linux/bootmem.h | 27 +++++++++++++++++++++++++
 mm/memblock.c           | 53 ++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 73 insertions(+), 7 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91b6439..ea30b3987282 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE		(~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+				      phys_addr_t min_addr,
+				      phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
 		phys_addr_t align, phys_addr_t min_addr,
 		phys_addr_t max_addr, int nid);
@@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
 					    NUMA_NO_NODE);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+					phys_addr_t size,  phys_addr_t align)
+{
+	return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
+					    BOOTMEM_ALLOC_ACCESSIBLE,
+					    NUMA_NO_NODE);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
 					phys_addr_t size, phys_addr_t align)
 {
@@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
 	return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+					phys_addr_t size,  phys_addr_t align)
+{
+	if (!align)
+		align = SMP_CACHE_BYTES;
+	return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
 					phys_addr_t size, phys_addr_t align)
 {
@@ -309,6 +328,14 @@ static inline void * __init memblock_virt_alloc_try_nid(phys_addr_t size,
 					  min_addr);
 }
 
+static inline void * __init memblock_virt_alloc_try_nid_raw(
+			phys_addr_t size, phys_addr_t align,
+			phys_addr_t min_addr, phys_addr_t max_addr, int nid)
+{
+	return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
+				min_addr, max_addr);
+}
+
 static inline void * __init memblock_virt_alloc_try_nid_nopanic(
 			phys_addr_t size, phys_addr_t align,
 			phys_addr_t min_addr, phys_addr_t max_addr, int nid)
diff --git a/mm/memblock.c b/mm/memblock.c
index 08f449acfdd1..3fbf3bcb52d9 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
 	return NULL;
 done:
 	ptr = phys_to_virt(alloc);
-	memset(ptr, 0, size);
 
 	/*
 	 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1340,6 +1339,38 @@ static void * __init memblock_virt_alloc_internal(
 	return ptr;
 }
 
+/**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *	  is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *	      is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *	      allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ *
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. Does not zero allocated memory, does not panic if request
+ * cannot be satisfied.
+ *
+ * RETURNS:
+ * Virtual address of allocated memory block on success, NULL on failure.
+ */
+void * __init memblock_virt_alloc_try_nid_raw(
+			phys_addr_t size, phys_addr_t align,
+			phys_addr_t min_addr, phys_addr_t max_addr,
+			int nid)
+{
+	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
+		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
+		     (u64)max_addr, (void *)_RET_IP_);
+
+	return memblock_virt_alloc_internal(size, align,
+					    min_addr, max_addr, nid);
+}
+
 /**
  * memblock_virt_alloc_try_nid_nopanic - allocate boot memory block
  * @size: size of memory block to be allocated in bytes
@@ -1351,8 +1382,8 @@ static void * __init memblock_virt_alloc_internal(
  *	      allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public version of _memblock_virt_alloc_try_nid_nopanic() which provides
- * additional debug information (including caller info), if enabled.
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. This function zeroes the allocated memory.
  *
  * RETURNS:
  * Virtual address of allocated memory block on success, NULL on failure.
@@ -1362,11 +1393,17 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
 				phys_addr_t min_addr, phys_addr_t max_addr,
 				int nid)
 {
+	void *ptr;
+
 	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
 		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
 		     (u64)max_addr, (void *)_RET_IP_);
-	return memblock_virt_alloc_internal(size, align, min_addr,
-					     max_addr, nid);
+
+	ptr = memblock_virt_alloc_internal(size, align,
+					   min_addr, max_addr, nid);
+	if (ptr)
+		memset(ptr, 0, size);
+	return ptr;
 }
 
 /**
@@ -1380,7 +1417,7 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
  *	      allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public panicking version of _memblock_virt_alloc_try_nid_nopanic()
+ * Public panicking version of memblock_virt_alloc_try_nid_nopanic()
  * which provides debug information (including caller info), if enabled,
  * and panics if the request can not be satisfied.
  *
@@ -1399,8 +1436,10 @@ void * __init memblock_virt_alloc_try_nid(
 		     (u64)max_addr, (void *)_RET_IP_);
 	ptr = memblock_virt_alloc_internal(size, align,
 					   min_addr, max_addr, nid);
-	if (ptr)
+	if (ptr) {
+		memset(ptr, 0, size);
 		return ptr;
+	}
 
 	panic("%s: Failed to allocate %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx\n",
 	      __func__, (u64)size, (u64)align, nid, (u64)min_addr,
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [v6 08/15] mm: zero struct pages during initialization
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
                   ` (6 preceding siblings ...)
  2017-08-07 20:38 ` [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-11 12:50   ` Michal Hocko
  2017-08-07 20:38 ` [v6 09/15] sparc64: optimized struct page zeroing Pavel Tatashin
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 include/linux/mm.h | 9 +++++++++
 mm/page_alloc.c    | 1 +
 2 files changed, 10 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5e8569..183ac5e733db 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -93,6 +93,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #define mm_forbids_zeropage(X)	(0)
 #endif
 
+/*
+ * On some architectures it is expensive to call memset() for small sizes.
+ * Those architectures should provide their own implementation of "struct page"
+ * zeroing by defining this macro in <asm/pgtable.h>.
+ */
+#ifndef mm_zero_struct_page
+#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
+#endif
+
 /*
  * Default maximum number of active map areas, this limits the number of vmas
  * per mm struct. Users can overwrite this number by sysctl but there is a
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 983de0a8047b..4d32c1fa4c6c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1168,6 +1168,7 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
+	mm_zero_struct_page(page);
 	set_page_links(page, zone, nid, pfn);
 	init_page_count(page);
 	page_mapcount_reset(page);
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [v6 09/15] sparc64: optimized struct page zeroing
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
                   ` (7 preceding siblings ...)
  2017-08-07 20:38 ` [v6 08/15] mm: zero struct pages during initialization Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-11 12:53   ` Michal Hocko
  2017-08-07 20:38 ` [v6 10/15] x86/kasan: explicitly zero kasan shadow memory Pavel Tatashin
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to tent regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 arch/sparc/include/asm/pgtable_64.h | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931f0570..cee5cc7ccc51 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)	(mem_map_zero)
 
+/* This macro must be updated when the size of struct page grows above 80
+ * or reduces below 64.
+ * The idea that compiler optimizes out switch() statement, and only
+ * leaves clrx instructions
+ */
+#define	mm_zero_struct_page(pp) do {					\
+	unsigned long *_pp = (void *)(pp);				\
+									\
+	 /* Check that struct page is either 64, 72, or 80 bytes */	\
+	BUILD_BUG_ON(sizeof(struct page) & 7);				\
+	BUILD_BUG_ON(sizeof(struct page) < 64);				\
+	BUILD_BUG_ON(sizeof(struct page) > 80);				\
+									\
+	switch (sizeof(struct page)) {					\
+	case 80:							\
+		_pp[9] = 0;	/* fallthrough */			\
+	case 72:							\
+		_pp[8] = 0;	/* fallthrough */			\
+	default:							\
+		_pp[7] = 0;						\
+		_pp[6] = 0;						\
+		_pp[5] = 0;						\
+		_pp[4] = 0;						\
+		_pp[3] = 0;						\
+		_pp[2] = 0;						\
+		_pp[1] = 0;						\
+		_pp[0] = 0;						\
+	}								\
+} while (0)
+
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [v6 10/15] x86/kasan: explicitly zero kasan shadow memory
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
                   ` (8 preceding siblings ...)
  2017-08-07 20:38 ` [v6 09/15] sparc64: optimized struct page zeroing Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-07 20:38 ` [v6 11/15] arm64/kasan: " Pavel Tatashin
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

To optimize the performance of struct page initialization,
vmemmap_populate() will no longer zero memory.

We must explicitly zero the memory that is allocated by vmemmap_populate()
for kasan, as this memory does not go through struct page initialization
path.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 arch/x86/mm/kasan_init_64.c | 67 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 02c9d7553409..ec6b2272fd80 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -84,6 +84,66 @@ static struct notifier_block kasan_die_notifier = {
 };
 #endif
 
+/*
+ * x86 variant of vmemmap_populate() uses either PMD_SIZE pages or base pages
+ * to map allocated memory.  This routine determines the page size for the given
+ * address from vmemmap.
+ */
+static u64 get_vmemmap_pgsz(u64 addr)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	pgd = pgd_offset_k(addr);
+	BUG_ON(pgd_none(*pgd) || pgd_large(*pgd));
+
+	p4d = p4d_offset(pgd, addr);
+	BUG_ON(p4d_none(*p4d) || p4d_large(*p4d));
+
+	pud = pud_offset(p4d, addr);
+	BUG_ON(pud_none(*pud) || pud_large(*pud));
+
+	pmd = pmd_offset(pud, addr);
+	BUG_ON(pmd_none(*pmd));
+
+	if (pmd_large(*pmd))
+		return PMD_SIZE;
+	return PAGE_SIZE;
+}
+
+/*
+ * Memory that was allocated by vmemmap_populate is not zeroed, so we must
+ * zero it here explicitly.
+ */
+static void
+zero_vmemmap_populated_memory(void)
+{
+	u64 i, start, end;
+
+	for (i = 0; i < E820_MAX_ENTRIES && pfn_mapped[i].end; i++) {
+		void *kaddr_start = pfn_to_kaddr(pfn_mapped[i].start);
+		void *kaddr_end = pfn_to_kaddr(pfn_mapped[i].end);
+
+		start = (u64)kasan_mem_to_shadow(kaddr_start);
+		end = (u64)kasan_mem_to_shadow(kaddr_end);
+
+		/* Round to the start end of the mapped pages */
+		start = rounddown(start, get_vmemmap_pgsz(start));
+		end = roundup(end, get_vmemmap_pgsz(start));
+		memset((void *)start, 0, end - start);
+	}
+
+	start = (u64)kasan_mem_to_shadow(_stext);
+	end = (u64)kasan_mem_to_shadow(_end);
+
+	/* Round to the start end of the mapped pages */
+	start = rounddown(start, get_vmemmap_pgsz(start));
+	end = roundup(end, get_vmemmap_pgsz(start));
+	memset((void *)start, 0, end - start);
+}
+
 void __init kasan_early_init(void)
 {
 	int i;
@@ -156,6 +216,13 @@ void __init kasan_init(void)
 		pte_t pte = __pte(__pa(kasan_zero_page) | __PAGE_KERNEL_RO);
 		set_pte(&kasan_zero_pte[i], pte);
 	}
+
+	/*
+	 * vmemmap_populate does not zero the memory, so we need to zero it
+	 * explicitly
+	 */
+	zero_vmemmap_populated_memory();
+
 	/* Flush TLBs again to be sure that write protection applied. */
 	__flush_tlb_all();
 
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [v6 11/15] arm64/kasan: explicitly zero kasan shadow memory
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
                   ` (9 preceding siblings ...)
  2017-08-07 20:38 ` [v6 10/15] x86/kasan: explicitly zero kasan shadow memory Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-08  9:07   ` Will Deacon
  2017-08-07 20:38 ` [v6 12/15] mm: explicitly zero pagetable memory Pavel Tatashin
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

To optimize the performance of struct page initialization,
vmemmap_populate() will no longer zero memory.

We must explicitly zero the memory that is allocated by vmemmap_populate()
for kasan, as this memory does not go through struct page initialization
path.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 arch/arm64/mm/kasan_init.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f03959a4ab..e78a9ecbb687 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -135,6 +135,41 @@ static void __init clear_pgds(unsigned long start,
 		set_pgd(pgd_offset_k(start), __pgd(0));
 }
 
+/*
+ * Memory that was allocated by vmemmap_populate is not zeroed, so we must
+ * zero it here explicitly.
+ */
+static void
+zero_vmemmap_populated_memory(void)
+{
+	struct memblock_region *reg;
+	u64 start, end;
+
+	for_each_memblock(memory, reg) {
+		start = __phys_to_virt(reg->base);
+		end = __phys_to_virt(reg->base + reg->size);
+
+		if (start >= end)
+			break;
+
+		start = (u64)kasan_mem_to_shadow((void *)start);
+		end = (u64)kasan_mem_to_shadow((void *)end);
+
+		/* Round to the start end of the mapped pages */
+		start = round_down(start, SWAPPER_BLOCK_SIZE);
+		end = round_up(end, SWAPPER_BLOCK_SIZE);
+		memset((void *)start, 0, end - start);
+	}
+
+	start = (u64)kasan_mem_to_shadow(_text);
+	end = (u64)kasan_mem_to_shadow(_end);
+
+	/* Round to the start end of the mapped pages */
+	start = round_down(start, SWAPPER_BLOCK_SIZE);
+	end = round_up(end, SWAPPER_BLOCK_SIZE);
+	memset((void *)start, 0, end - start);
+}
+
 void __init kasan_init(void)
 {
 	u64 kimg_shadow_start, kimg_shadow_end;
@@ -205,8 +240,15 @@ void __init kasan_init(void)
 			pfn_pte(sym_to_pfn(kasan_zero_page), PAGE_KERNEL_RO));
 
 	memset(kasan_zero_page, 0, PAGE_SIZE);
+
 	cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
 
+	/*
+	 * vmemmap_populate does not zero the memory, so we need to zero it
+	 * explicitly
+	 */
+	zero_vmemmap_populated_memory();
+
 	/* At this point kasan is fully initialized. Enable error messages */
 	init_task.kasan_depth = 0;
 	pr_info("KernelAddressSanitizer initialized\n");
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [v6 12/15] mm: explicitly zero pagetable memory
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
                   ` (10 preceding siblings ...)
  2017-08-07 20:38 ` [v6 11/15] arm64/kasan: " Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-07 20:38 ` [v6 13/15] mm: stop zeroing memory during allocation in vmemmap Pavel Tatashin
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

Soon vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 mm/sparse-vmemmap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index c50b1a14d55e..d40c721ab19f 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -191,6 +191,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
 		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
+		memset(p, 0, PAGE_SIZE);
 		pmd_populate_kernel(&init_mm, pmd, p);
 	}
 	return pmd;
@@ -203,6 +204,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node)
 		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
+		memset(p, 0, PAGE_SIZE);
 		pud_populate(&init_mm, pud, p);
 	}
 	return pud;
@@ -215,6 +217,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
 		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
+		memset(p, 0, PAGE_SIZE);
 		p4d_populate(&init_mm, p4d, p);
 	}
 	return p4d;
@@ -227,6 +230,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
+		memset(p, 0, PAGE_SIZE);
 		pgd_populate(&init_mm, pgd, p);
 	}
 	return pgd;
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [v6 13/15] mm: stop zeroing memory during allocation in vmemmap
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
                   ` (11 preceding siblings ...)
  2017-08-07 20:38 ` [v6 12/15] mm: explicitly zero pagetable memory Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-11 13:04   ` Michal Hocko
  2017-08-07 20:38 ` [v6 14/15] mm: optimize early system hash allocations Pavel Tatashin
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 mm/sparse-vmemmap.c | 6 +++---
 mm/sparse.c         | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d40c721ab19f..3b646b5ce1b6 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
 				unsigned long align,
 				unsigned long goal)
 {
-	return memblock_virt_alloc_try_nid(size, align, goal,
+	return memblock_virt_alloc_try_nid_raw(size, align, goal,
 					    BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
@@ -56,11 +56,11 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 
 		if (node_state(node, N_HIGH_MEMORY))
 			page = alloc_pages_node(
-				node, GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
+				node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
 				get_order(size));
 		else
 			page = alloc_pages(
-				GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
+				GFP_KERNEL | __GFP_RETRY_MAYFAIL,
 				get_order(size));
 		if (page)
 			return page_address(page);
diff --git a/mm/sparse.c b/mm/sparse.c
index 7b4be3fd5cac..0e315766ad11 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -441,9 +441,9 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
 	}
 
 	size = PAGE_ALIGN(size);
-	map = memblock_virt_alloc_try_nid(size * map_count,
-					  PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
-					  BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
+	map = memblock_virt_alloc_try_nid_raw(size * map_count,
+					      PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
+					      BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
 	if (map) {
 		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
 			if (!present_section_nr(pnum))
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [v6 14/15] mm: optimize early system hash allocations
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
                   ` (12 preceding siblings ...)
  2017-08-07 20:38 ` [v6 13/15] mm: stop zeroing memory during allocation in vmemmap Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-11 13:05   ` Michal Hocko
  2017-08-07 20:38 ` [v6 15/15] mm: debug for raw alloctor Pavel Tatashin
  2017-08-11  7:58 ` [v6 00/15] complete deferred page initialization Michal Hocko
  15 siblings, 1 reply; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 mm/page_alloc.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4d32c1fa4c6c..000806298dfb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7354,18 +7354,17 @@ void *__init alloc_large_system_hash(const char *tablename,
 
 	log2qty = ilog2(numentries);
 
-	/*
-	 * memblock allocator returns zeroed memory already, so HASH_ZERO is
-	 * currently not used when HASH_EARLY is specified.
-	 */
 	gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
 	do {
 		size = bucketsize << log2qty;
-		if (flags & HASH_EARLY)
-			table = memblock_virt_alloc_nopanic(size, 0);
-		else if (hashdist)
+		if (flags & HASH_EARLY) {
+			if (flags & HASH_ZERO)
+				table = memblock_virt_alloc_nopanic(size, 0);
+			else
+				table = memblock_virt_alloc_raw(size, 0);
+		} else if (hashdist) {
 			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
-		else {
+		} else {
 			/*
 			 * If bucketsize is not a power-of-two, we may free
 			 * some pages at the end of hash table which
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [v6 15/15] mm: debug for raw alloctor
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
                   ` (13 preceding siblings ...)
  2017-08-07 20:38 ` [v6 14/15] mm: optimize early system hash allocations Pavel Tatashin
@ 2017-08-07 20:38 ` Pavel Tatashin
  2017-08-11 13:08   ` Michal Hocko
  2017-08-11  7:58 ` [v6 00/15] complete deferred page initialization Michal Hocko
  15 siblings, 1 reply; 70+ messages in thread
From: Pavel Tatashin @ 2017-08-07 20:38 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, will.deacon,
	catalin.marinas, sam

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 mm/memblock.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 3fbf3bcb52d9..29fcb1dd8a81 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1363,12 +1363,19 @@ void * __init memblock_virt_alloc_try_nid_raw(
 			phys_addr_t min_addr, phys_addr_t max_addr,
 			int nid)
 {
+	void *ptr;
+
 	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
 		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
 		     (u64)max_addr, (void *)_RET_IP_);
 
-	return memblock_virt_alloc_internal(size, align,
-					    min_addr, max_addr, nid);
+	ptr = memblock_virt_alloc_internal(size, align,
+					   min_addr, max_addr, nid);
+#ifdef CONFIG_DEBUG_VM
+	if (ptr && size > 0)
+		memset(ptr, 0xff, size);
+#endif
+	return ptr;
 }
 
 /**
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [v6 11/15] arm64/kasan: explicitly zero kasan shadow memory
  2017-08-07 20:38 ` [v6 11/15] arm64/kasan: " Pavel Tatashin
@ 2017-08-08  9:07   ` Will Deacon
  2017-08-08 11:49     ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Will Deacon @ 2017-08-08  9:07 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, catalin.marinas, sam

On Mon, Aug 07, 2017 at 04:38:45PM -0400, Pavel Tatashin wrote:
> To optimize the performance of struct page initialization,
> vmemmap_populate() will no longer zero memory.
> 
> We must explicitly zero the memory that is allocated by vmemmap_populate()
> for kasan, as this memory does not go through struct page initialization
> path.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> ---
>  arch/arm64/mm/kasan_init.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
> index 81f03959a4ab..e78a9ecbb687 100644
> --- a/arch/arm64/mm/kasan_init.c
> +++ b/arch/arm64/mm/kasan_init.c
> @@ -135,6 +135,41 @@ static void __init clear_pgds(unsigned long start,
>  		set_pgd(pgd_offset_k(start), __pgd(0));
>  }
>  
> +/*
> + * Memory that was allocated by vmemmap_populate is not zeroed, so we must
> + * zero it here explicitly.
> + */
> +static void
> +zero_vmemmap_populated_memory(void)
> +{
> +	struct memblock_region *reg;
> +	u64 start, end;
> +
> +	for_each_memblock(memory, reg) {
> +		start = __phys_to_virt(reg->base);
> +		end = __phys_to_virt(reg->base + reg->size);
> +
> +		if (start >= end)
> +			break;
> +
> +		start = (u64)kasan_mem_to_shadow((void *)start);
> +		end = (u64)kasan_mem_to_shadow((void *)end);
> +
> +		/* Round to the start end of the mapped pages */
> +		start = round_down(start, SWAPPER_BLOCK_SIZE);
> +		end = round_up(end, SWAPPER_BLOCK_SIZE);
> +		memset((void *)start, 0, end - start);
> +	}
> +
> +	start = (u64)kasan_mem_to_shadow(_text);
> +	end = (u64)kasan_mem_to_shadow(_end);
> +
> +	/* Round to the start end of the mapped pages */
> +	start = round_down(start, SWAPPER_BLOCK_SIZE);
> +	end = round_up(end, SWAPPER_BLOCK_SIZE);
> +	memset((void *)start, 0, end - start);
> +}

I can't help but think this would be an awful lot nicer if you made
vmemmap_alloc_block take extra GFP flags as a parameter. That way, we could
implement a version of vmemmap_populate that does the zeroing when we need
it, without having to duplicate a bunch of the code like this. I think it
would also be less error-prone, because you wouldn't have to do the
allocation and the zeroing in two separate steps.

Will

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 11/15] arm64/kasan: explicitly zero kasan shadow memory
  2017-08-08  9:07   ` Will Deacon
@ 2017-08-08 11:49     ` Pasha Tatashin
  2017-08-08 12:30       ` Will Deacon
  2017-08-08 13:15       ` David Laight
  0 siblings, 2 replies; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-08 11:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, catalin.marinas, sam

Hi Will,

Thank you for looking at this change. What you described was in my 
previous iterations of this project.

See for example here: https://lkml.org/lkml/2017/5/5/369

I was asked to remove that flag, and only zero memory in place when 
needed. Overall the current approach is better everywhere else in the 
kernel, but it adds a little extra code to kasan initialization.

Pasha

On 08/08/2017 05:07 AM, Will Deacon wrote:
> On Mon, Aug 07, 2017 at 04:38:45PM -0400, Pavel Tatashin wrote:
>> To optimize the performance of struct page initialization,
>> vmemmap_populate() will no longer zero memory.
>>
>> We must explicitly zero the memory that is allocated by vmemmap_populate()
>> for kasan, as this memory does not go through struct page initialization
>> path.
>>
>> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
>> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
>> Reviewed-by: Bob Picco <bob.picco@oracle.com>
>> ---
>>   arch/arm64/mm/kasan_init.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 42 insertions(+)
>>
>> diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
>> index 81f03959a4ab..e78a9ecbb687 100644
>> --- a/arch/arm64/mm/kasan_init.c
>> +++ b/arch/arm64/mm/kasan_init.c
>> @@ -135,6 +135,41 @@ static void __init clear_pgds(unsigned long start,
>>   		set_pgd(pgd_offset_k(start), __pgd(0));
>>   }
>>   
>> +/*
>> + * Memory that was allocated by vmemmap_populate is not zeroed, so we must
>> + * zero it here explicitly.
>> + */
>> +static void
>> +zero_vmemmap_populated_memory(void)
>> +{
>> +	struct memblock_region *reg;
>> +	u64 start, end;
>> +
>> +	for_each_memblock(memory, reg) {
>> +		start = __phys_to_virt(reg->base);
>> +		end = __phys_to_virt(reg->base + reg->size);
>> +
>> +		if (start >= end)
>> +			break;
>> +
>> +		start = (u64)kasan_mem_to_shadow((void *)start);
>> +		end = (u64)kasan_mem_to_shadow((void *)end);
>> +
>> +		/* Round to the start end of the mapped pages */
>> +		start = round_down(start, SWAPPER_BLOCK_SIZE);
>> +		end = round_up(end, SWAPPER_BLOCK_SIZE);
>> +		memset((void *)start, 0, end - start);
>> +	}
>> +
>> +	start = (u64)kasan_mem_to_shadow(_text);
>> +	end = (u64)kasan_mem_to_shadow(_end);
>> +
>> +	/* Round to the start end of the mapped pages */
>> +	start = round_down(start, SWAPPER_BLOCK_SIZE);
>> +	end = round_up(end, SWAPPER_BLOCK_SIZE);
>> +	memset((void *)start, 0, end - start);
>> +}
> 
> I can't help but think this would be an awful lot nicer if you made
> vmemmap_alloc_block take extra GFP flags as a parameter. That way, we could
> implement a version of vmemmap_populate that does the zeroing when we need
> it, without having to duplicate a bunch of the code like this. I think it
> would also be less error-prone, because you wouldn't have to do the
> allocation and the zeroing in two separate steps.
> 
> Will
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 11/15] arm64/kasan: explicitly zero kasan shadow memory
  2017-08-08 11:49     ` Pasha Tatashin
@ 2017-08-08 12:30       ` Will Deacon
  2017-08-08 12:49         ` Pasha Tatashin
  2017-08-08 13:15       ` David Laight
  1 sibling, 1 reply; 70+ messages in thread
From: Will Deacon @ 2017-08-08 12:30 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, catalin.marinas, sam

On Tue, Aug 08, 2017 at 07:49:22AM -0400, Pasha Tatashin wrote:
> Hi Will,
> 
> Thank you for looking at this change. What you described was in my previous
> iterations of this project.
> 
> See for example here: https://lkml.org/lkml/2017/5/5/369
> 
> I was asked to remove that flag, and only zero memory in place when needed.
> Overall the current approach is better everywhere else in the kernel, but it
> adds a little extra code to kasan initialization.

Damn, I actually prefer the flag :)

But actually, if you look at our implementation of vmemmap_populate, then we
have our own version of vmemmap_populate_basepages that terminates at the
pmd level anyway if ARM64_SWAPPER_USES_SECTION_MAPS. If there's resistance
to do this in the core code, then I'd be inclined to replace our
vmemmap_populate implementation in the arm64 code with a single version that
can terminate at either the PMD or the PTE level, and do zeroing if
required. We're already special-casing it, so we don't really lose anything
imo.

Will

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 11/15] arm64/kasan: explicitly zero kasan shadow memory
  2017-08-08 12:30       ` Will Deacon
@ 2017-08-08 12:49         ` Pasha Tatashin
  0 siblings, 0 replies; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-08 12:49 UTC (permalink / raw)
  To: Will Deacon, ard.biesheuvel
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, catalin.marinas, sam

Hi Will,

 > Damn, I actually prefer the flag :)
 >
 > But actually, if you look at our implementation of vmemmap_populate, 
then we
 > have our own version of vmemmap_populate_basepages that terminates at the
 > pmd level anyway if ARM64_SWAPPER_USES_SECTION_MAPS. If there's 
resistance
 > to do this in the core code, then I'd be inclined to replace our
 > vmemmap_populate implementation in the arm64 code with a single 
version that
 > can terminate at either the PMD or the PTE level, and do zeroing if
 > required. We're already special-casing it, so we don't really lose 
anything
 > imo.

Another approach is to create a new mapping interface for kasan only. As 
what Ard Biesheuvel wrote:

 > KASAN uses vmemmap_populate as a convenience: kasan has nothing to do
 > with vmemmap, but the function already existed and happened to do what
 > KASAN requires.
 >
 > Given that that will no longer be the case, it would be far better to
 > stop using vmemmap_populate altogether, and clone it into a KASAN
 > specific version (with an appropriate name) with the zeroing folded
 > into it.

I agree with this statement, but I think it should not be part of this 
project.

Pasha

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [v6 11/15] arm64/kasan: explicitly zero kasan shadow memory
  2017-08-08 11:49     ` Pasha Tatashin
  2017-08-08 12:30       ` Will Deacon
@ 2017-08-08 13:15       ` David Laight
  2017-08-08 13:30         ` Pasha Tatashin
  1 sibling, 1 reply; 70+ messages in thread
From: David Laight @ 2017-08-08 13:15 UTC (permalink / raw)
  To: 'Pasha Tatashin', Will Deacon
  Cc: linux-s390, ard.biesheuvel, sam, borntraeger, catalin.marinas,
	x86, heiko.carstens, linux-kernel, kasan-dev, mhocko, linux-mm,
	willy, sparclinux, linuxppc-dev, davem, linux-arm-kernel

From: Pasha Tatashin
> Sent: 08 August 2017 12:49
> Thank you for looking at this change. What you described was in my
> previous iterations of this project.
> 
> See for example here: https://lkml.org/lkml/2017/5/5/369
> 
> I was asked to remove that flag, and only zero memory in place when
> needed. Overall the current approach is better everywhere else in the
> kernel, but it adds a little extra code to kasan initialization.

Perhaps you could #define the function prototype(s?) so that the flags
are not passed unless it is a kasan build?

	David

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 11/15] arm64/kasan: explicitly zero kasan shadow memory
  2017-08-08 13:15       ` David Laight
@ 2017-08-08 13:30         ` Pasha Tatashin
  0 siblings, 0 replies; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-08 13:30 UTC (permalink / raw)
  To: David Laight, Will Deacon
  Cc: linux-s390, ard.biesheuvel, sam, borntraeger, catalin.marinas,
	x86, heiko.carstens, linux-kernel, kasan-dev, mhocko, linux-mm,
	willy, sparclinux, linuxppc-dev, davem, linux-arm-kernel

On 2017-08-08 09:15, David Laight wrote:
> From: Pasha Tatashin
>> Sent: 08 August 2017 12:49
>> Thank you for looking at this change. What you described was in my
>> previous iterations of this project.
>>
>> See for example here: https://lkml.org/lkml/2017/5/5/369
>>
>> I was asked to remove that flag, and only zero memory in place when
>> needed. Overall the current approach is better everywhere else in the
>> kernel, but it adds a little extra code to kasan initialization.
> 
> Perhaps you could #define the function prototype(s?) so that the flags
> are not passed unless it is a kasan build?
> 

Hi David,

Thank you for suggestion. I think a kasan specific vmemmap (what I 
described in the previous e-mail) would be a better solution over having 
different prototypes with different builds.  It would be cleaner to have 
all kasan specific code in one place.

Pasha

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 00/15] complete deferred page initialization
  2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
                   ` (14 preceding siblings ...)
  2017-08-07 20:38 ` [v6 15/15] mm: debug for raw alloctor Pavel Tatashin
@ 2017-08-11  7:58 ` Michal Hocko
  2017-08-11 15:13   ` Pasha Tatashin
  15 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-11  7:58 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

[I am sorry I didn't get to your previous versions]

On Mon 07-08-17 16:38:34, Pavel Tatashin wrote:
[...]
> SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
> which defers initializing struct pages until all cpus have been started so
> it can be done in parallel.
> 
> However, this feature is sub-optimal, because the deferred page
> initialization code expects that the struct pages have already been zeroed,
> and the zeroing is done early in boot with a single thread only.  Also, we
> access that memory and set flags before struct pages are initialized. All
> of this is fixed in this patchset.
> 
> In this work we do the following:
> - Never read access struct page until it was initialized

How is this enforced? What about pfn walkers? E.g. page_ext
initialization code (page owner in particular)

> - Never set any fields in struct pages before they are initialized
> - Zero struct page at the beginning of struct page initialization

Please give us a more highlevel description of how your reimplementation
works and how is the patchset organized. I will go through those patches
but it is always good to give an overview in the cover letter to make
the review easier.

> Performance improvements on x86 machine with 8 nodes:
> Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz
> 
> Single threaded struct page init: 7.6s/T improvement
> Deferred struct page init: 10.2s/T improvement

What are before and after numbers and how have you measured them.
> 
> Pavel Tatashin (15):
>   x86/mm: reserve only exiting low pages
>   x86/mm: setting fields in deferred pages
>   sparc64/mm: setting fields in deferred pages
>   mm: discard memblock data later
>   mm: don't accessed uninitialized struct pages
>   sparc64: simplify vmemmap_populate
>   mm: defining memblock_virt_alloc_try_nid_raw
>   mm: zero struct pages during initialization
>   sparc64: optimized struct page zeroing
>   x86/kasan: explicitly zero kasan shadow memory
>   arm64/kasan: explicitly zero kasan shadow memory
>   mm: explicitly zero pagetable memory
>   mm: stop zeroing memory during allocation in vmemmap
>   mm: optimize early system hash allocations
>   mm: debug for raw alloctor
> 
>  arch/arm64/mm/kasan_init.c          |  42 ++++++++++
>  arch/sparc/include/asm/pgtable_64.h |  30 +++++++
>  arch/sparc/mm/init_64.c             |  31 +++-----
>  arch/x86/kernel/setup.c             |   5 +-
>  arch/x86/mm/init_64.c               |   9 ++-
>  arch/x86/mm/kasan_init_64.c         |  67 ++++++++++++++++
>  include/linux/bootmem.h             |  27 +++++++
>  include/linux/memblock.h            |   9 ++-
>  include/linux/mm.h                  |   9 +++
>  mm/memblock.c                       | 152 ++++++++++++++++++++++++++++--------
>  mm/nobootmem.c                      |  16 ----
>  mm/page_alloc.c                     |  31 +++++---
>  mm/sparse-vmemmap.c                 |  10 ++-
>  mm/sparse.c                         |   6 +-
>  14 files changed, 356 insertions(+), 88 deletions(-)
> 
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 01/15] x86/mm: reserve only exiting low pages
  2017-08-07 20:38 ` [v6 01/15] x86/mm: reserve only exiting low pages Pavel Tatashin
@ 2017-08-11  8:07   ` Michal Hocko
  2017-08-11 15:24     ` Pasha Tatashin
  2017-08-14 13:55   ` Michal Hocko
  1 sibling, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-11  8:07 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Mon 07-08-17 16:38:35, Pavel Tatashin wrote:
> Struct pages are initialized by going through __init_single_page(). Since
> the existing physical memory in memblock is represented in memblock.memory
> list, struct page for every page from this list goes through
> __init_single_page().

By a page _from_ this list you mean struct pages backing the physical
memory of the memblock lists?
 
> The second memblock list: memblock.reserved, manages the allocated memory.
> The memory that won't be available to kernel allocator. So, every page from
> this list goes through reserve_bootmem_region(), where certain struct page
> fields are set, the assumption being that the struct pages have been
> initialized beforehand.
> 
> In trim_low_memory_range() we unconditionally reserve memoryfrom PFN 0, but
> memblock.memory might start at a later PFN. For example, in QEMU,
> e820__memblock_setup() can use PFN 1 as the first PFN in memblock.memory,
> so PFN 0 is not on memblock.memory (and hence isn't initialized via
> __init_single_page) but is on memblock.reserved (and hence we set fields in
> the uninitialized struct page).
> 
> Currently, the struct page memory is always zeroed during allocation,
> which prevents this problem from being detected. But, if some asserts
> provided by CONFIG_DEBUG_VM_PGFLAGS are tighten, this problem may become
> visible in existing kernels.
> 
> In this patchset we will stop zeroing struct page memory during allocation.
> Therefore, this bug must be fixed in order to avoid random assert failures
> caused by CONFIG_DEBUG_VM_PGFLAGS triggers.
> 
> The fix is to reserve memory from the first existing PFN.

Hmm, I assume this is a result of some assert triggering, right? Which
one? Why don't we need the same treatment for other than x86 arch?

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>

I guess that the review happened inhouse. I do not want to question its
value but it is rather strange to not hear the specific review comments
which might be useful in general and moreover even not include those
people on the CC list so they are aware of the follow up discussion.

> ---
>  arch/x86/kernel/setup.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 3486d0498800..489cdc141bcb 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -790,7 +790,10 @@ early_param("reservelow", parse_reservelow);
>  
>  static void __init trim_low_memory_range(void)
>  {
> -	memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
> +	unsigned long min_pfn = find_min_pfn_with_active_regions();
> +	phys_addr_t base = min_pfn << PAGE_SHIFT;
> +
> +	memblock_reserve(base, ALIGN(reserve_low, PAGE_SIZE));
>  }
>  	
>  /*
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 02/15] x86/mm: setting fields in deferred pages
  2017-08-07 20:38 ` [v6 02/15] x86/mm: setting fields in deferred pages Pavel Tatashin
@ 2017-08-11  9:02   ` Michal Hocko
  2017-08-11 15:39     ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-11  9:02 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

[CC Mel - the full series is here
http://lkml.kernel.org/r/1502138329-123460-1-git-send-email-pasha.tatashin@oracle.com]

On Mon 07-08-17 16:38:36, Pavel Tatashin wrote:
> Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
> flags and other fields in "struct page"es are never changed prior to first
> initializing struct pages by going through __init_single_page().
> 
> With deferred struct page feature enabled there is a case where we set some
> fields prior to initializing:
> 
>         mem_init() {
>                 register_page_bootmem_info();
>                 free_all_bootmem();
>                 ...
>         }
> 
> When register_page_bootmem_info() is called only non-deferred struct pages
> are initialized. But, this function goes through some reserved pages which
> might be part of the deferred, and thus are not yet initialized.
> 
>   mem_init
>    register_page_bootmem_info
>     register_page_bootmem_info_node
>      get_page_bootmem
>       .. setting fields here ..
>       such as: page->freelist = (void *)type;
> 
> We end-up with similar issue as in the previous patch, where currently we
> do not observe problem as memory is zeroed. But, if flag asserts are
> changed we can start hitting issues.
> 
> Also, because in this patch series we will stop zeroing struct page memory
> during allocation, we must make sure that struct pages are properly
> initialized prior to using them.
> 
> The deferred-reserved pages are initialized in free_all_bootmem().
> Therefore, the fix is to switch the above calls.

I have to confess that this part of the early struct page initialization
is not my strongest point and I have to always re-read the code from the
scratch but I really do not undestand what you are trying to achieve
here.

AFAIU register_page_bootmem_info_node is only about struct pages backing
pgdat, usemap and memmap. Those should be in reserved memblocks and we
do not initialize those at later times, they are not relevant to the
deferred initialization as your changelog suggests so the ordering with
get_page_bootmem shouldn't matter. Or am I missing something here?
 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> ---
>  arch/x86/mm/init_64.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 136422d7d539..1e863baec847 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1165,12 +1165,17 @@ void __init mem_init(void)
>  
>  	/* clear_bss() already clear the empty_zero_page */
>  
> -	register_page_bootmem_info();
> -
>  	/* this will put all memory onto the freelists */
>  	free_all_bootmem();
>  	after_bootmem = 1;
>  
> +	/* Must be done after boot memory is put on freelist, because here we
> +	 * might set fields in deferred struct pages that have not yet been
> +	 * initialized, and free_all_bootmem() initializes all the reserved
> +	 * deferred pages for us.
> +	 */
> +	register_page_bootmem_info();
> +
>  	/* Register memory areas for /proc/kcore */
>  	kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR,
>  			 PAGE_SIZE, KCORE_OTHER);
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 04/15] mm: discard memblock data later
  2017-08-07 20:38 ` [v6 04/15] mm: discard memblock data later Pavel Tatashin
@ 2017-08-11  9:32   ` Michal Hocko
  2017-08-11  9:50     ` Mel Gorman
                       ` (2 more replies)
  0 siblings, 3 replies; 70+ messages in thread
From: Michal Hocko @ 2017-08-11  9:32 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam,
	Mel Gorman

[CC Mel]

On Mon 07-08-17 16:38:38, Pavel Tatashin wrote:
> There is existing use after free bug when deferred struct pages are
> enabled:
> 
> The memblock_add() allocates memory for the memory array if more than
> 128 entries are needed.  See comment in e820__memblock_setup():
> 
>   * The bootstrap memblock region count maximum is 128 entries
>   * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
>   * than that - so allow memblock resizing.
> 
> This memblock memory is freed here:
>         free_low_memory_core_early()
> 
> We access the freed memblock.memory later in boot when deferred pages are
> initialized in this path:
> 
>         deferred_init_memmap()
>                 for_each_mem_pfn_range()
>                   __next_mem_pfn_range()
>                     type = &memblock.memory;

Yes you seem to be right.
>
> One possible explanation for why this use-after-free hasn't been hit
> before is that the limit of INIT_MEMBLOCK_REGIONS has never been exceeded
> at least on systems where deferred struct pages were enabled.

Yeah this sounds like the case.
 
> Another reason why we want this problem fixed in this patch series is,
> in the next patch, we will need to access memblock.reserved from
> deferred_init_memmap().
> 

I guess this goes all the way down to 
Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in parallel with kswapd")
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>

Considering that some HW might behave strangely and this would be rather
hard to debug I would be tempted to mark this for stable. It should also
be merged separately from the rest of the series.

I have just one nit below
Acked-by: Michal Hocko <mhocko@suse.com>

[...]
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 2cb25fe4452c..bf14aea6ab70 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -285,31 +285,27 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
>  }
>  
>  #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK

pull this ifdef inside memblock_discard and you do not have an another
one in page_alloc_init_late

[...]
> +/**
> + * Discard memory and reserved arrays if they were allocated
> + */
> +void __init memblock_discard(void)
>  {

here

> -	if (memblock.memory.regions == memblock_memory_init_regions)
> -		return 0;
> +	phys_addr_t addr, size;
>  
> -	*addr = __pa(memblock.memory.regions);
> +	if (memblock.reserved.regions != memblock_reserved_init_regions) {
> +		addr = __pa(memblock.reserved.regions);
> +		size = PAGE_ALIGN(sizeof(struct memblock_region) *
> +				  memblock.reserved.max);
> +		__memblock_free_late(addr, size);
> +	}
>  
> -	return PAGE_ALIGN(sizeof(struct memblock_region) *
> -			  memblock.memory.max);
> +	if (memblock.memory.regions == memblock_memory_init_regions) {
> +		addr = __pa(memblock.memory.regions);
> +		size = PAGE_ALIGN(sizeof(struct memblock_region) *
> +				  memblock.memory.max);
> +		__memblock_free_late(addr, size);
> +	}
>  }
> -
>  #endif
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fc32aa81f359..63d16c185736 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1584,6 +1584,10 @@ void __init page_alloc_init_late(void)
>  	/* Reinit limits that are based on free pages after the kernel is up */
>  	files_maxfiles_init();
>  #endif
> +#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
> +	/* Discard memblock private memory */
> +	memblock_discard();
> +#endif
>  
>  	for_each_populated_zone(zone)
>  		set_zone_contiguous(zone);
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 05/15] mm: don't accessed uninitialized struct pages
  2017-08-07 20:38 ` [v6 05/15] mm: don't accessed uninitialized struct pages Pavel Tatashin
@ 2017-08-11  9:37   ` Michal Hocko
  2017-08-11 15:55     ` Pasha Tatashin
  2017-08-15  9:33   ` Michal Hocko
  1 sibling, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-11  9:37 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Mon 07-08-17 16:38:39, Pavel Tatashin wrote:
> In deferred_init_memmap() where all deferred struct pages are initialized
> we have a check like this:
> 
>     if (page->flags) {
>             VM_BUG_ON(page_zone(page) != zone);
>             goto free_range;
>     }
> 
> This way we are checking if the current deferred page has already been
> initialized. It works, because memory for struct pages has been zeroed, and
> the only way flags are not zero if it went through __init_single_page()
> before.  But, once we change the current behavior and won't zero the memory
> in memblock allocator, we cannot trust anything inside "struct page"es
> until they are initialized. This patch fixes this.
> 
> This patch defines a new accessor memblock_get_reserved_pfn_range()
> which returns successive ranges of reserved PFNs.  deferred_init_memmap()
> calls it to determine if a PFN and its struct page has already been
> initialized.

Why don't we simply check the pfn against pgdat->first_deferred_pfn?

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> ---
>  include/linux/memblock.h |  3 +++
>  mm/memblock.c            | 54 ++++++++++++++++++++++++++++++++++++++++++------
>  mm/page_alloc.c          | 11 +++++++++-
>  3 files changed, 61 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index bae11c7e7bf3..b6a2a610f5e1 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -320,6 +320,9 @@ int memblock_is_map_memory(phys_addr_t addr);
>  int memblock_is_region_memory(phys_addr_t base, phys_addr_t size);
>  bool memblock_is_reserved(phys_addr_t addr);
>  bool memblock_is_region_reserved(phys_addr_t base, phys_addr_t size);
> +void memblock_get_reserved_pfn_range(unsigned long pfn,
> +				     unsigned long *pfn_start,
> +				     unsigned long *pfn_end);
>  
>  extern void __memblock_dump_all(void);
>  
> diff --git a/mm/memblock.c b/mm/memblock.c
> index bf14aea6ab70..08f449acfdd1 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1580,7 +1580,13 @@ void __init memblock_mem_limit_remove_map(phys_addr_t limit)
>  	memblock_cap_memory_range(0, max_addr);
>  }
>  
> -static int __init_memblock memblock_search(struct memblock_type *type, phys_addr_t addr)
> +/**
> + * Return index in regions array if addr is within the region. Otherwise
> + * return -1. If -1 is returned and *next_idx is not %NULL, sets it to the
> + * next region index or -1 if there is none.
> + */
> +static int __init_memblock memblock_search(struct memblock_type *type,
> +					   phys_addr_t addr, int *next_idx)
>  {
>  	unsigned int left = 0, right = type->cnt;
>  
> @@ -1595,22 +1601,26 @@ static int __init_memblock memblock_search(struct memblock_type *type, phys_addr
>  		else
>  			return mid;
>  	} while (left < right);
> +
> +	if (next_idx)
> +		*next_idx = (right == type->cnt) ? -1 : right;
> +
>  	return -1;
>  }
>  
>  bool __init memblock_is_reserved(phys_addr_t addr)
>  {
> -	return memblock_search(&memblock.reserved, addr) != -1;
> +	return memblock_search(&memblock.reserved, addr, NULL) != -1;
>  }
>  
>  bool __init_memblock memblock_is_memory(phys_addr_t addr)
>  {
> -	return memblock_search(&memblock.memory, addr) != -1;
> +	return memblock_search(&memblock.memory, addr, NULL) != -1;
>  }
>  
>  int __init_memblock memblock_is_map_memory(phys_addr_t addr)
>  {
> -	int i = memblock_search(&memblock.memory, addr);
> +	int i = memblock_search(&memblock.memory, addr, NULL);
>  
>  	if (i == -1)
>  		return false;
> @@ -1622,7 +1632,7 @@ int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
>  			 unsigned long *start_pfn, unsigned long *end_pfn)
>  {
>  	struct memblock_type *type = &memblock.memory;
> -	int mid = memblock_search(type, PFN_PHYS(pfn));
> +	int mid = memblock_search(type, PFN_PHYS(pfn), NULL);
>  
>  	if (mid == -1)
>  		return -1;
> @@ -1646,7 +1656,7 @@ int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
>   */
>  int __init_memblock memblock_is_region_memory(phys_addr_t base, phys_addr_t size)
>  {
> -	int idx = memblock_search(&memblock.memory, base);
> +	int idx = memblock_search(&memblock.memory, base, NULL);
>  	phys_addr_t end = base + memblock_cap_size(base, &size);
>  
>  	if (idx == -1)
> @@ -1655,6 +1665,38 @@ int __init_memblock memblock_is_region_memory(phys_addr_t base, phys_addr_t size
>  		 memblock.memory.regions[idx].size) >= end;
>  }
>  
> +/**
> + * memblock_get_reserved_pfn_range - search for the next reserved region
> + *
> + * @pfn: start searching from this pfn.
> + *
> + * RETURNS:
> + * [start_pfn, end_pfn), where start_pfn >= pfn. If none is found
> + * start_pfn, and end_pfn are both set to ULONG_MAX.
> + */
> +void __init_memblock memblock_get_reserved_pfn_range(unsigned long pfn,
> +						     unsigned long *start_pfn,
> +						     unsigned long *end_pfn)
> +{
> +	struct memblock_type *type = &memblock.reserved;
> +	int next_idx, idx;
> +
> +	idx = memblock_search(type, PFN_PHYS(pfn), &next_idx);
> +	if (idx == -1 && next_idx == -1) {
> +		*start_pfn = ULONG_MAX;
> +		*end_pfn = ULONG_MAX;
> +		return;
> +	}
> +
> +	if (idx == -1) {
> +		idx = next_idx;
> +		*start_pfn = PFN_DOWN(type->regions[idx].base);
> +	} else {
> +		*start_pfn = pfn;
> +	}
> +	*end_pfn = PFN_DOWN(type->regions[idx].base + type->regions[idx].size);
> +}
> +
>  /**
>   * memblock_is_region_reserved - check if a region intersects reserved memory
>   * @base: base of region to check
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 63d16c185736..983de0a8047b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1447,6 +1447,7 @@ static int __init deferred_init_memmap(void *data)
>  	pg_data_t *pgdat = data;
>  	int nid = pgdat->node_id;
>  	struct mminit_pfnnid_cache nid_init_state = { };
> +	unsigned long resv_start_pfn = 0, resv_end_pfn = 0;
>  	unsigned long start = jiffies;
>  	unsigned long nr_pages = 0;
>  	unsigned long walk_start, walk_end;
> @@ -1491,6 +1492,10 @@ static int __init deferred_init_memmap(void *data)
>  			pfn = zone->zone_start_pfn;
>  
>  		for (; pfn < end_pfn; pfn++) {
> +			if (pfn >= resv_end_pfn)
> +				memblock_get_reserved_pfn_range(pfn,
> +								&resv_start_pfn,
> +								&resv_end_pfn);
>  			if (!pfn_valid_within(pfn))
>  				goto free_range;
>  
> @@ -1524,7 +1529,11 @@ static int __init deferred_init_memmap(void *data)
>  				cond_resched();
>  			}
>  
> -			if (page->flags) {
> +			/*
> +			 * Check if this page has already been initialized due
> +			 * to being reserved during boot in memblock.
> +			 */
> +			if (pfn >= resv_start_pfn) {
>  				VM_BUG_ON(page_zone(page) != zone);
>  				goto free_range;
>  			}
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 04/15] mm: discard memblock data later
  2017-08-11  9:32   ` Michal Hocko
@ 2017-08-11  9:50     ` Mel Gorman
  2017-08-11 15:49     ` Pasha Tatashin
  2017-08-11 19:00     ` Pasha Tatashin
  2 siblings, 0 replies; 70+ messages in thread
From: Mel Gorman @ 2017-08-11  9:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Pavel Tatashin, linux-kernel, sparclinux, linux-mm, linuxppc-dev,
	linux-s390, linux-arm-kernel, x86, kasan-dev, borntraeger,
	heiko.carstens, davem, willy, ard.biesheuvel, will.deacon,
	catalin.marinas, sam, Mel Gorman

On Fri, Aug 11, 2017 at 11:32:49AM +0200, Michal Hocko wrote:
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> > Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> > Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > Reviewed-by: Bob Picco <bob.picco@oracle.com>
> 
> Considering that some HW might behave strangely and this would be rather
> hard to debug I would be tempted to mark this for stable. It should also
> be merged separately from the rest of the series.
> 
> I have just one nit below
> Acked-by: Michal Hocko <mhocko@suse.com>
> 

Agreed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw
  2017-08-07 20:38 ` [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw Pavel Tatashin
@ 2017-08-11 12:39   ` Michal Hocko
  2017-08-11 15:58     ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-11 12:39 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Mon 07-08-17 16:38:41, Pavel Tatashin wrote:
> A new variant of memblock_virt_alloc_* allocations:
> memblock_virt_alloc_try_nid_raw()
>     - Does not zero the allocated memory
>     - Does not panic if request cannot be satisfied

OK, this looks good but I would not introduce memblock_virt_alloc_raw
here because we do not have any users. Please move that to "mm: optimize
early system hash allocations" which actually uses the API. It would be
easier to review it that way.

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>

other than that
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/bootmem.h | 27 +++++++++++++++++++++++++
>  mm/memblock.c           | 53 ++++++++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 73 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index e223d91b6439..ea30b3987282 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
>  #define BOOTMEM_ALLOC_ANYWHERE		(~(phys_addr_t)0)
>  
>  /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
> +void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
> +				      phys_addr_t min_addr,
> +				      phys_addr_t max_addr, int nid);
>  void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
>  		phys_addr_t align, phys_addr_t min_addr,
>  		phys_addr_t max_addr, int nid);
> @@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
>  					    NUMA_NO_NODE);
>  }
>  
> +static inline void * __init memblock_virt_alloc_raw(
> +					phys_addr_t size,  phys_addr_t align)
> +{
> +	return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
> +					    BOOTMEM_ALLOC_ACCESSIBLE,
> +					    NUMA_NO_NODE);
> +}
> +
>  static inline void * __init memblock_virt_alloc_nopanic(
>  					phys_addr_t size, phys_addr_t align)
>  {
> @@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
>  	return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
>  }
>  
> +static inline void * __init memblock_virt_alloc_raw(
> +					phys_addr_t size,  phys_addr_t align)
> +{
> +	if (!align)
> +		align = SMP_CACHE_BYTES;
> +	return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
> +}
> +
>  static inline void * __init memblock_virt_alloc_nopanic(
>  					phys_addr_t size, phys_addr_t align)
>  {
> @@ -309,6 +328,14 @@ static inline void * __init memblock_virt_alloc_try_nid(phys_addr_t size,
>  					  min_addr);
>  }
>  
> +static inline void * __init memblock_virt_alloc_try_nid_raw(
> +			phys_addr_t size, phys_addr_t align,
> +			phys_addr_t min_addr, phys_addr_t max_addr, int nid)
> +{
> +	return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
> +				min_addr, max_addr);
> +}
> +
>  static inline void * __init memblock_virt_alloc_try_nid_nopanic(
>  			phys_addr_t size, phys_addr_t align,
>  			phys_addr_t min_addr, phys_addr_t max_addr, int nid)
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 08f449acfdd1..3fbf3bcb52d9 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
>  	return NULL;
>  done:
>  	ptr = phys_to_virt(alloc);
> -	memset(ptr, 0, size);
>  
>  	/*
>  	 * The min_count is set to 0 so that bootmem allocated blocks
> @@ -1340,6 +1339,38 @@ static void * __init memblock_virt_alloc_internal(
>  	return ptr;
>  }
>  
> +/**
> + * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
> + * memory and without panicking
> + * @size: size of memory block to be allocated in bytes
> + * @align: alignment of the region and block's size
> + * @min_addr: the lower bound of the memory region from where the allocation
> + *	  is preferred (phys address)
> + * @max_addr: the upper bound of the memory region from where the allocation
> + *	      is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
> + *	      allocate only from memory limited by memblock.current_limit value
> + * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
> + *
> + * Public function, provides additional debug information (including caller
> + * info), if enabled. Does not zero allocated memory, does not panic if request
> + * cannot be satisfied.
> + *
> + * RETURNS:
> + * Virtual address of allocated memory block on success, NULL on failure.
> + */
> +void * __init memblock_virt_alloc_try_nid_raw(
> +			phys_addr_t size, phys_addr_t align,
> +			phys_addr_t min_addr, phys_addr_t max_addr,
> +			int nid)
> +{
> +	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
> +		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
> +		     (u64)max_addr, (void *)_RET_IP_);
> +
> +	return memblock_virt_alloc_internal(size, align,
> +					    min_addr, max_addr, nid);
> +}
> +
>  /**
>   * memblock_virt_alloc_try_nid_nopanic - allocate boot memory block
>   * @size: size of memory block to be allocated in bytes
> @@ -1351,8 +1382,8 @@ static void * __init memblock_virt_alloc_internal(
>   *	      allocate only from memory limited by memblock.current_limit value
>   * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
>   *
> - * Public version of _memblock_virt_alloc_try_nid_nopanic() which provides
> - * additional debug information (including caller info), if enabled.
> + * Public function, provides additional debug information (including caller
> + * info), if enabled. This function zeroes the allocated memory.
>   *
>   * RETURNS:
>   * Virtual address of allocated memory block on success, NULL on failure.
> @@ -1362,11 +1393,17 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
>  				phys_addr_t min_addr, phys_addr_t max_addr,
>  				int nid)
>  {
> +	void *ptr;
> +
>  	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
>  		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
>  		     (u64)max_addr, (void *)_RET_IP_);
> -	return memblock_virt_alloc_internal(size, align, min_addr,
> -					     max_addr, nid);
> +
> +	ptr = memblock_virt_alloc_internal(size, align,
> +					   min_addr, max_addr, nid);
> +	if (ptr)
> +		memset(ptr, 0, size);
> +	return ptr;
>  }
>  
>  /**
> @@ -1380,7 +1417,7 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
>   *	      allocate only from memory limited by memblock.current_limit value
>   * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
>   *
> - * Public panicking version of _memblock_virt_alloc_try_nid_nopanic()
> + * Public panicking version of memblock_virt_alloc_try_nid_nopanic()
>   * which provides debug information (including caller info), if enabled,
>   * and panics if the request can not be satisfied.
>   *
> @@ -1399,8 +1436,10 @@ void * __init memblock_virt_alloc_try_nid(
>  		     (u64)max_addr, (void *)_RET_IP_);
>  	ptr = memblock_virt_alloc_internal(size, align,
>  					   min_addr, max_addr, nid);
> -	if (ptr)
> +	if (ptr) {
> +		memset(ptr, 0, size);
>  		return ptr;
> +	}
>  
>  	panic("%s: Failed to allocate %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx\n",
>  	      __func__, (u64)size, (u64)align, nid, (u64)min_addr,
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 08/15] mm: zero struct pages during initialization
  2017-08-07 20:38 ` [v6 08/15] mm: zero struct pages during initialization Pavel Tatashin
@ 2017-08-11 12:50   ` Michal Hocko
  2017-08-11 16:03     ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-11 12:50 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Mon 07-08-17 16:38:42, Pavel Tatashin wrote:
> Add struct page zeroing as a part of initialization of other fields in
> __init_single_page().

I believe this deserves much more detailed explanation why this is safe.
What actually prevents any pfn walker from seeing an uninitialized
struct page? Please make your assumptions explicit in the commit log so
that we can check them independently.

Also this is done with some purpose which is the perfmance, right? You
have mentioned that in the cover letter but if somebody is going to read
through git logs this wouldn't be obvious from the specific commit.
So add that information here as well. Especially numbers will be
interesting.

As a sidenote, this will need some more followups for memory hotplug
after my recent changes which are not merged yet but I will take care of
that.

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>

After the relevant information is added feel free add
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mm.h | 9 +++++++++
>  mm/page_alloc.c    | 1 +
>  2 files changed, 10 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5e8569..183ac5e733db 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -93,6 +93,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
>  #define mm_forbids_zeropage(X)	(0)
>  #endif
>  
> +/*
> + * On some architectures it is expensive to call memset() for small sizes.
> + * Those architectures should provide their own implementation of "struct page"
> + * zeroing by defining this macro in <asm/pgtable.h>.
> + */
> +#ifndef mm_zero_struct_page
> +#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
> +#endif
> +
>  /*
>   * Default maximum number of active map areas, this limits the number of vmas
>   * per mm struct. Users can overwrite this number by sysctl but there is a
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 983de0a8047b..4d32c1fa4c6c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1168,6 +1168,7 @@ static void free_one_page(struct zone *zone,
>  static void __meminit __init_single_page(struct page *page, unsigned long pfn,
>  				unsigned long zone, int nid)
>  {
> +	mm_zero_struct_page(page);
>  	set_page_links(page, zone, nid, pfn);
>  	init_page_count(page);
>  	page_mapcount_reset(page);
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 09/15] sparc64: optimized struct page zeroing
  2017-08-07 20:38 ` [v6 09/15] sparc64: optimized struct page zeroing Pavel Tatashin
@ 2017-08-11 12:53   ` Michal Hocko
  2017-08-11 16:04     ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-11 12:53 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Mon 07-08-17 16:38:43, Pavel Tatashin wrote:
> Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
> calling memset(). We do eight to tent regular stores based on the size of
> struct page. Compiler optimizes out the conditions of switch() statement.

Again, this doesn't explain why we need this. You have mentioned those
reasons in some previous emails but be explicit here please.

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> ---
>  arch/sparc/include/asm/pgtable_64.h | 30 ++++++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> index 6fbd931f0570..cee5cc7ccc51 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
>  extern struct page *mem_map_zero;
>  #define ZERO_PAGE(vaddr)	(mem_map_zero)
>  
> +/* This macro must be updated when the size of struct page grows above 80
> + * or reduces below 64.
> + * The idea that compiler optimizes out switch() statement, and only
> + * leaves clrx instructions
> + */
> +#define	mm_zero_struct_page(pp) do {					\
> +	unsigned long *_pp = (void *)(pp);				\
> +									\
> +	 /* Check that struct page is either 64, 72, or 80 bytes */	\
> +	BUILD_BUG_ON(sizeof(struct page) & 7);				\
> +	BUILD_BUG_ON(sizeof(struct page) < 64);				\
> +	BUILD_BUG_ON(sizeof(struct page) > 80);				\
> +									\
> +	switch (sizeof(struct page)) {					\
> +	case 80:							\
> +		_pp[9] = 0;	/* fallthrough */			\
> +	case 72:							\
> +		_pp[8] = 0;	/* fallthrough */			\
> +	default:							\
> +		_pp[7] = 0;						\
> +		_pp[6] = 0;						\
> +		_pp[5] = 0;						\
> +		_pp[4] = 0;						\
> +		_pp[3] = 0;						\
> +		_pp[2] = 0;						\
> +		_pp[1] = 0;						\
> +		_pp[0] = 0;						\
> +	}								\
> +} while (0)
> +
>  /* PFNs are real physical page numbers.  However, mem_map only begins to record
>   * per-page information starting at pfn_base.  This is to handle systems where
>   * the first physical page in the machine is at some huge physical address,
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 13/15] mm: stop zeroing memory during allocation in vmemmap
  2017-08-07 20:38 ` [v6 13/15] mm: stop zeroing memory during allocation in vmemmap Pavel Tatashin
@ 2017-08-11 13:04   ` Michal Hocko
  2017-08-11 16:11     ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-11 13:04 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Mon 07-08-17 16:38:47, Pavel Tatashin wrote:
> Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
> we will get the performance improvement by zeroing the memory in parallel
> when struct pages are zeroed.

First of all this should be probably merged with the previous patch. The
I think vmemmap_alloc_block would be better to split up into
__vmemmap_alloc_block which doesn't zero and vmemmap_alloc_block which
does zero which would reduce the memset callsites and it would make it
slightly more robust interface.
 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> ---
>  mm/sparse-vmemmap.c | 6 +++---
>  mm/sparse.c         | 6 +++---
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index d40c721ab19f..3b646b5ce1b6 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
>  				unsigned long align,
>  				unsigned long goal)
>  {
> -	return memblock_virt_alloc_try_nid(size, align, goal,
> +	return memblock_virt_alloc_try_nid_raw(size, align, goal,
>  					    BOOTMEM_ALLOC_ACCESSIBLE, node);
>  }
>  
> @@ -56,11 +56,11 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
>  
>  		if (node_state(node, N_HIGH_MEMORY))
>  			page = alloc_pages_node(
> -				node, GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
> +				node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
>  				get_order(size));
>  		else
>  			page = alloc_pages(
> -				GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
> +				GFP_KERNEL | __GFP_RETRY_MAYFAIL,
>  				get_order(size));
>  		if (page)
>  			return page_address(page);
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 7b4be3fd5cac..0e315766ad11 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -441,9 +441,9 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
>  	}
>  
>  	size = PAGE_ALIGN(size);
> -	map = memblock_virt_alloc_try_nid(size * map_count,
> -					  PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
> -					  BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
> +	map = memblock_virt_alloc_try_nid_raw(size * map_count,
> +					      PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
> +					      BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
>  	if (map) {
>  		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
>  			if (!present_section_nr(pnum))
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 14/15] mm: optimize early system hash allocations
  2017-08-07 20:38 ` [v6 14/15] mm: optimize early system hash allocations Pavel Tatashin
@ 2017-08-11 13:05   ` Michal Hocko
  2017-08-11 16:13     ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-11 13:05 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Mon 07-08-17 16:38:48, Pavel Tatashin wrote:
> Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
> that memory that was allocated for system hash needs to be zeroed,
> otherwise the memory does not need to be zeroed, and client will initialize
> it.
> 
> If memory does not need to be zero'd, call the new
> memblock_virt_alloc_raw() interface, and thus improve the boot performance.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>

OK, but as mentioned in the previous patch add memblock_virt_alloc_raw
in this patch.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4d32c1fa4c6c..000806298dfb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7354,18 +7354,17 @@ void *__init alloc_large_system_hash(const char *tablename,
>  
>  	log2qty = ilog2(numentries);
>  
> -	/*
> -	 * memblock allocator returns zeroed memory already, so HASH_ZERO is
> -	 * currently not used when HASH_EARLY is specified.
> -	 */
>  	gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
>  	do {
>  		size = bucketsize << log2qty;
> -		if (flags & HASH_EARLY)
> -			table = memblock_virt_alloc_nopanic(size, 0);
> -		else if (hashdist)
> +		if (flags & HASH_EARLY) {
> +			if (flags & HASH_ZERO)
> +				table = memblock_virt_alloc_nopanic(size, 0);
> +			else
> +				table = memblock_virt_alloc_raw(size, 0);
> +		} else if (hashdist) {
>  			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
> -		else {
> +		} else {
>  			/*
>  			 * If bucketsize is not a power-of-two, we may free
>  			 * some pages at the end of hash table which
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 15/15] mm: debug for raw alloctor
  2017-08-07 20:38 ` [v6 15/15] mm: debug for raw alloctor Pavel Tatashin
@ 2017-08-11 13:08   ` Michal Hocko
  2017-08-11 16:18     ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-11 13:08 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Mon 07-08-17 16:38:49, Pavel Tatashin wrote:
> When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
> returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
> places excpect zeroed memory.

Please fold this into the patch which introduces
memblock_virt_alloc_try_nid_raw. I am not sure CONFIG_DEBUG_VM is the
best config because that tends to be enabled quite often. Maybe
CONFIG_MEMBLOCK_DEBUG? Or even make it kernel command line parameter?

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> ---
>  mm/memblock.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 3fbf3bcb52d9..29fcb1dd8a81 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1363,12 +1363,19 @@ void * __init memblock_virt_alloc_try_nid_raw(
>  			phys_addr_t min_addr, phys_addr_t max_addr,
>  			int nid)
>  {
> +	void *ptr;
> +
>  	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
>  		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
>  		     (u64)max_addr, (void *)_RET_IP_);
>  
> -	return memblock_virt_alloc_internal(size, align,
> -					    min_addr, max_addr, nid);
> +	ptr = memblock_virt_alloc_internal(size, align,
> +					   min_addr, max_addr, nid);
> +#ifdef CONFIG_DEBUG_VM
> +	if (ptr && size > 0)
> +		memset(ptr, 0xff, size);
> +#endif
> +	return ptr;
>  }
>  
>  /**
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 00/15] complete deferred page initialization
  2017-08-11  7:58 ` [v6 00/15] complete deferred page initialization Michal Hocko
@ 2017-08-11 15:13   ` Pasha Tatashin
  2017-08-11 15:22     ` Michal Hocko
  0 siblings, 1 reply; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-11 15:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On 08/11/2017 03:58 AM, Michal Hocko wrote:
> [I am sorry I didn't get to your previous versions]

Thank you for reviewing this work. I will address your comments, and 
send-out a new patches.

>>
>> In this work we do the following:
>> - Never read access struct page until it was initialized
> 
> How is this enforced? What about pfn walkers? E.g. page_ext
> initialization code (page owner in particular)

This is hard to enforce 100%. But, because we have a patch in this 
series that sets all memory that was allocated by 
memblock_virt_alloc_try_nid_raw() to ones with debug options enabled, 
and because Linux has a good set of asserts in place that check struct 
pages to be sane, especially the ones that are enabled with this config: 
CONFIG_DEBUG_VM_PGFLAGS. I was able to find many places in linux which 
accessed struct pages before __init_single_page() is performed, and fix 
them. Most of these places happen only when deferred struct page 
initialization code is enabled.

> 
>> - Never set any fields in struct pages before they are initialized
>> - Zero struct page at the beginning of struct page initialization
> 
> Please give us a more highlevel description of how your reimplementation
> works and how is the patchset organized. I will go through those patches
> but it is always good to give an overview in the cover letter to make
> the review easier.

Ok, will add more explanation to the cover letter.

>> Single threaded struct page init: 7.6s/T improvement
>> Deferred struct page init: 10.2s/T improvement
> 
> What are before and after numbers and how have you measured them.

When I send out this series the next time I will include before vs. 
after on the machine I tested, including links to dmesg output.

I used my early boot timestamps for x86 and sparc to measure the data. 
Early boot timestamps for sparc is already part of mainline, the x86 
patches are out for review: https://lkml.org/lkml/2017/8/10/946 (should 
have changed subject line there :) ).

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 00/15] complete deferred page initialization
  2017-08-11 15:13   ` Pasha Tatashin
@ 2017-08-11 15:22     ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-08-11 15:22 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Fri 11-08-17 11:13:07, Pasha Tatashin wrote:
> On 08/11/2017 03:58 AM, Michal Hocko wrote:
> >[I am sorry I didn't get to your previous versions]
> 
> Thank you for reviewing this work. I will address your comments, and
> send-out a new patches.
> 
> >>
> >>In this work we do the following:
> >>- Never read access struct page until it was initialized
> >
> >How is this enforced? What about pfn walkers? E.g. page_ext
> >initialization code (page owner in particular)
> 
> This is hard to enforce 100%. But, because we have a patch in this series
> that sets all memory that was allocated by memblock_virt_alloc_try_nid_raw()
> to ones with debug options enabled, and because Linux has a good set of
> asserts in place that check struct pages to be sane, especially the ones
> that are enabled with this config: CONFIG_DEBUG_VM_PGFLAGS. I was able to
> find many places in linux which accessed struct pages before
> __init_single_page() is performed, and fix them. Most of these places happen
> only when deferred struct page initialization code is enabled.

Yes, I am very well aware of how hard is this to guarantee. I was merely
pointing out that the changelog should be more verbose about your
testing and assumptions so that we can revalidate them.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 01/15] x86/mm: reserve only exiting low pages
  2017-08-11  8:07   ` Michal Hocko
@ 2017-08-11 15:24     ` Pasha Tatashin
  2017-08-14 11:40       ` Michal Hocko
  0 siblings, 1 reply; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-11 15:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

>> Struct pages are initialized by going through __init_single_page(). Since
>> the existing physical memory in memblock is represented in memblock.memory
>> list, struct page for every page from this list goes through
>> __init_single_page().
> 
> By a page _from_ this list you mean struct pages backing the physical
> memory of the memblock lists?

Correct: "for every page from this list...", for every page represented 
by this list the struct page is initialized through __init_single_page()

>> In this patchset we will stop zeroing struct page memory during allocation.
>> Therefore, this bug must be fixed in order to avoid random assert failures
>> caused by CONFIG_DEBUG_VM_PGFLAGS triggers.
>>
>> The fix is to reserve memory from the first existing PFN.
> 
> Hmm, I assume this is a result of some assert triggering, right? Which
> one? Why don't we need the same treatment for other than x86 arch?

Correct, the pgflags asserts were triggered when we were setting 
reserved flags to struct page for PFN 0 in which was never initialized 
through __init_single_page(). The reason they were triggered is because 
we set all uninitialized memory to ones in one of the debug patches.

>> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
>> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
>> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> 
> I guess that the review happened inhouse. I do not want to question its
> value but it is rather strange to not hear the specific review comments
> which might be useful in general and moreover even not include those
> people on the CC list so they are aware of the follow up discussion.

I will bring this up with my colleagues to how to handle this better in 
the future. I will also CC the reviewers when I sent out the updated 
patch series.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 02/15] x86/mm: setting fields in deferred pages
  2017-08-11  9:02   ` Michal Hocko
@ 2017-08-11 15:39     ` Pasha Tatashin
  2017-08-14 11:43       ` Michal Hocko
  0 siblings, 1 reply; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-11 15:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

> AFAIU register_page_bootmem_info_node is only about struct pages backing
> pgdat, usemap and memmap. Those should be in reserved memblocks and we
> do not initialize those at later times, they are not relevant to the
> deferred initialization as your changelog suggests so the ordering with
> get_page_bootmem shouldn't matter. Or am I missing something here?

The pages for pgdata, usemap, and memmap are part of reserved, and thus 
getting initialized when free_all_bootmem() is called.

So, we have something like this in mem_init()

register_page_bootmem_info
  register_page_bootmem_info_node
   get_page_bootmem
    .. setting fields here ..
    such as: page->freelist = (void *)type;

free_all_bootmem()
  free_low_memory_core_early()
   for_each_reserved_mem_region()
    reserve_bootmem_region()
     init_reserved_page() <- Only if this is deferred reserved page
      __init_single_pfn()
       __init_single_page()
           memset(0) <-- Loose the set fields here!

memblock does not know about deferred pages, and can be requested to 
allocate physical pages anywhere. So, the reserved pages in memblock can 
be both in non-deferred and deferred part of the memory.

Without deferred pages enabled, by the time register_page_bootmem_info() 
is called every page went through __init_single_page(), but with 
deferred pages enabled, there is scenario where fields can be set before 
pages go through __init_single_page(). This patch fixes it.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 04/15] mm: discard memblock data later
  2017-08-11  9:32   ` Michal Hocko
  2017-08-11  9:50     ` Mel Gorman
@ 2017-08-11 15:49     ` Pasha Tatashin
  2017-08-11 16:04       ` Michal Hocko
  2017-08-11 19:00     ` Pasha Tatashin
  2 siblings, 1 reply; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-11 15:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam,
	Mel Gorman

> I guess this goes all the way down to
> Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in parallel with kswapd")

I will add this to the patch.

>> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
>> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
>> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> 
> Considering that some HW might behave strangely and this would be rather
> hard to debug I would be tempted to mark this for stable. It should also
> be merged separately from the rest of the series.
> 
> I have just one nit below
> Acked-by: Michal Hocko <mhocko@suse.com>

I will address your comment, and send out a new patch. Should I send it 
out separately from the series or should I keep it inside?

Also, before I send out a new patch, I will need to root cause and 
resolve problem found by kernel test robot <fengguang.wu@intel.com>, and 
bisected down to this patch.

[  156.659400] BUG: Bad page state in process swapper  pfn:03147
[  156.660051] page:ffff88001ed8a1c0 count:0 mapcount:-127 mapping: 
     (null) index:0x1
[  156.660917] flags: 0x0()
[  156.661198] raw: 0000000000000000 0000000000000000 0000000000000001 
00000000ffffff80
[  156.662006] raw: ffff88001f4a8120 ffff88001ed85ce0 0000000000000000 
0000000000000000
[  156.662811] page dumped because: nonzero mapcount
[  156.663307] CPU: 0 PID: 1 Comm: swapper Not tainted 
4.13.0-rc3-00220-g1aad694 #1
[  156.664077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[  156.665129] Call Trace:
[  156.665422]  dump_stack+0x1e/0x20
[  156.665802]  bad_page+0x122/0x148

I was not able to reproduce this problem, even-though I used their qemu 
script and config. But I am getting the following panic both base and fix:

[  115.763259] VFS: Cannot open root device "ram0" or 
unknown-block(0,0): error -6
[  115.764511] Please append a correct "root=" boot option; here are the 
available partitions:
[  115.765816] Kernel panic - not syncing: VFS: Unable to mount root fs 
on unknown-block(0,0)
[  115.767124] CPU: 0 PID: 1 Comm: swapper Not tainted 
4.13.0-rc4_pt_memset6-00033-g7e65200b1473 #7
[  115.768506] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[  115.770368] Call Trace:
[  115.770771]  dump_stack+0x1e/0x20
[  115.771310]  panic+0xf8/0x2bc
[  115.771792]  mount_block_root+0x3bb/0x441
[  115.772437]  ? do_early_param+0xc5/0xc5
[  115.773051]  ? do_early_param+0xc5/0xc5
[  115.773683]  mount_root+0x7c/0x7f
[  115.774243]  prepare_namespace+0x194/0x1d1
[  115.774898]  kernel_init_freeable+0x1c8/0x1df
[  115.775575]  ? rest_init+0x13f/0x13f
[  115.776153]  kernel_init+0x14/0x142
[  115.776711]  ? rest_init+0x13f/0x13f
[  115.777285]  ret_from_fork+0x2a/0x40
[  115.777864] Kernel Offset: disabled

Their config has CONFIG_BLK_DEV_RAM disabled, but qemu script has:
root=/dev/ram0, so I enabled dev_ram, but still getting a panic when 
root is mounted both in base and fix.

Pasha

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 05/15] mm: don't accessed uninitialized struct pages
  2017-08-11  9:37   ` Michal Hocko
@ 2017-08-11 15:55     ` Pasha Tatashin
  2017-08-14 11:47       ` Michal Hocko
  0 siblings, 1 reply; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-11 15:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On 08/11/2017 05:37 AM, Michal Hocko wrote:
> On Mon 07-08-17 16:38:39, Pavel Tatashin wrote:
>> In deferred_init_memmap() where all deferred struct pages are initialized
>> we have a check like this:
>>
>>      if (page->flags) {
>>              VM_BUG_ON(page_zone(page) != zone);
>>              goto free_range;
>>      }
>>
>> This way we are checking if the current deferred page has already been
>> initialized. It works, because memory for struct pages has been zeroed, and
>> the only way flags are not zero if it went through __init_single_page()
>> before.  But, once we change the current behavior and won't zero the memory
>> in memblock allocator, we cannot trust anything inside "struct page"es
>> until they are initialized. This patch fixes this.
>>
>> This patch defines a new accessor memblock_get_reserved_pfn_range()
>> which returns successive ranges of reserved PFNs.  deferred_init_memmap()
>> calls it to determine if a PFN and its struct page has already been
>> initialized.
> 
> Why don't we simply check the pfn against pgdat->first_deferred_pfn?

Because we are initializing deferred pages, and all of them have pfn 
greater than pgdat->first_deferred_pfn. However, some of deferred pages 
were already initialized if they were reserved, in this path:

mem_init()
  free_all_bootmem()
   free_low_memory_core_early()
    for_each_reserved_mem_region()
     reserve_bootmem_region()
      init_reserved_page() <- if this is deferred reserved page
       __init_single_pfn()
        __init_single_page()

So, currently, we are using the value of page->flags to figure out if 
this page has been initialized while being part of deferred page, but 
this is not going to work for this project, as we do not zero the memory 
that is backing the struct pages, and size the value of page->flags can 
be anything.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw
  2017-08-11 12:39   ` Michal Hocko
@ 2017-08-11 15:58     ` Pasha Tatashin
  2017-08-11 16:06       ` Michal Hocko
  0 siblings, 1 reply; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-11 15:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On 08/11/2017 08:39 AM, Michal Hocko wrote:
> On Mon 07-08-17 16:38:41, Pavel Tatashin wrote:
>> A new variant of memblock_virt_alloc_* allocations:
>> memblock_virt_alloc_try_nid_raw()
>>      - Does not zero the allocated memory
>>      - Does not panic if request cannot be satisfied
> 
> OK, this looks good but I would not introduce memblock_virt_alloc_raw
> here because we do not have any users. Please move that to "mm: optimize
> early system hash allocations" which actually uses the API. It would be
> easier to review it that way.
> 
>> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
>> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
>> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> 
> other than that
> Acked-by: Michal Hocko <mhocko@suse.com>

Sure, I could do this, but as I understood from earlier Dave Miller's 
comments, we should do one logical change at a time. Hence, introduce 
API in one patch use it in another. So, this is how I tried to organize 
this patch set. Is this assumption incorrect?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 08/15] mm: zero struct pages during initialization
  2017-08-11 12:50   ` Michal Hocko
@ 2017-08-11 16:03     ` Pasha Tatashin
  0 siblings, 0 replies; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-11 16:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

> I believe this deserves much more detailed explanation why this is safe.
> What actually prevents any pfn walker from seeing an uninitialized
> struct page? Please make your assumptions explicit in the commit log so
> that we can check them independently.

There is nothing prevents pfn walkers from walk over any struct pages 
deferred and non-deferred. However, during boot before deferred pages 
are initialized we have just a few places that do that, and all of those 
cases are fixed in this patchset.

> Also this is done with some purpose which is the perfmance, right? You
> have mentioned that in the cover letter but if somebody is going to read
> through git logs this wouldn't be obvious from the specific commit.
> So add that information here as well. Especially numbers will be
> interesting.

I will add more performance data to this patch comment.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 04/15] mm: discard memblock data later
  2017-08-11 15:49     ` Pasha Tatashin
@ 2017-08-11 16:04       ` Michal Hocko
  2017-08-11 16:22         ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-11 16:04 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam,
	Mel Gorman

On Fri 11-08-17 11:49:15, Pasha Tatashin wrote:
> >I guess this goes all the way down to
> >Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in parallel with kswapd")
> 
> I will add this to the patch.
> 
> >>Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> >>Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> >>Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> >>Reviewed-by: Bob Picco <bob.picco@oracle.com>
> >
> >Considering that some HW might behave strangely and this would be rather
> >hard to debug I would be tempted to mark this for stable. It should also
> >be merged separately from the rest of the series.
> >
> >I have just one nit below
> >Acked-by: Michal Hocko <mhocko@suse.com>
> 
> I will address your comment, and send out a new patch. Should I send it out
> separately from the series or should I keep it inside?

I would post it separatelly. It doesn't depend on the rest.

> Also, before I send out a new patch, I will need to root cause and resolve
> problem found by kernel test robot <fengguang.wu@intel.com>, and bisected
> down to this patch.
> 
> [  156.659400] BUG: Bad page state in process swapper  pfn:03147
> [  156.660051] page:ffff88001ed8a1c0 count:0 mapcount:-127 mapping:
> (null) index:0x1
> [  156.660917] flags: 0x0()
> [  156.661198] raw: 0000000000000000 0000000000000000 0000000000000001
> 00000000ffffff80
> [  156.662006] raw: ffff88001f4a8120 ffff88001ed85ce0 0000000000000000
> 0000000000000000
> [  156.662811] page dumped because: nonzero mapcount
> [  156.663307] CPU: 0 PID: 1 Comm: swapper Not tainted
> 4.13.0-rc3-00220-g1aad694 #1
> [  156.664077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.9.3-20161025_171302-gandalf 04/01/2014
> [  156.665129] Call Trace:
> [  156.665422]  dump_stack+0x1e/0x20
> [  156.665802]  bad_page+0x122/0x148

Was the report related with this patch?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 09/15] sparc64: optimized struct page zeroing
  2017-08-11 12:53   ` Michal Hocko
@ 2017-08-11 16:04     ` Pasha Tatashin
  0 siblings, 0 replies; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-11 16:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

>> Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
>> calling memset(). We do eight to tent regular stores based on the size of
>> struct page. Compiler optimizes out the conditions of switch() statement.
> 
> Again, this doesn't explain why we need this. You have mentioned those
> reasons in some previous emails but be explicit here please.
> 

I will add performance data to this patch as well.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw
  2017-08-11 15:58     ` Pasha Tatashin
@ 2017-08-11 16:06       ` Michal Hocko
  2017-08-11 16:24         ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-11 16:06 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Fri 11-08-17 11:58:46, Pasha Tatashin wrote:
> On 08/11/2017 08:39 AM, Michal Hocko wrote:
> >On Mon 07-08-17 16:38:41, Pavel Tatashin wrote:
> >>A new variant of memblock_virt_alloc_* allocations:
> >>memblock_virt_alloc_try_nid_raw()
> >>     - Does not zero the allocated memory
> >>     - Does not panic if request cannot be satisfied
> >
> >OK, this looks good but I would not introduce memblock_virt_alloc_raw
> >here because we do not have any users. Please move that to "mm: optimize
> >early system hash allocations" which actually uses the API. It would be
> >easier to review it that way.
> >
> >>Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> >>Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> >>Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> >>Reviewed-by: Bob Picco <bob.picco@oracle.com>
> >
> >other than that
> >Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Sure, I could do this, but as I understood from earlier Dave Miller's
> comments, we should do one logical change at a time. Hence, introduce API in
> one patch use it in another. So, this is how I tried to organize this patch
> set. Is this assumption incorrect?

Well, it really depends. If the patch is really small then adding a new
API along with users is easier to review and backport because you have a
clear view of the usage. I believe this is the case here. But if others
feel otherwise I will not object.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 13/15] mm: stop zeroing memory during allocation in vmemmap
  2017-08-11 13:04   ` Michal Hocko
@ 2017-08-11 16:11     ` Pasha Tatashin
  0 siblings, 0 replies; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-11 16:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On 08/11/2017 09:04 AM, Michal Hocko wrote:
> On Mon 07-08-17 16:38:47, Pavel Tatashin wrote:
>> Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
>> we will get the performance improvement by zeroing the memory in parallel
>> when struct pages are zeroed.
> 
> First of all this should be probably merged with the previous patch. The
> I think vmemmap_alloc_block would be better to split up into
> __vmemmap_alloc_block which doesn't zero and vmemmap_alloc_block which
> does zero which would reduce the memset callsites and it would make it
> slightly more robust interface.

Ok, I will add: vmemmap_alloc_block_zero() call, and merge this and the 
previous patches together.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 14/15] mm: optimize early system hash allocations
  2017-08-11 13:05   ` Michal Hocko
@ 2017-08-11 16:13     ` Pasha Tatashin
  0 siblings, 0 replies; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-11 16:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

>> Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
>> that memory that was allocated for system hash needs to be zeroed,
>> otherwise the memory does not need to be zeroed, and client will initialize
>> it.
>>
>> If memory does not need to be zero'd, call the new
>> memblock_virt_alloc_raw() interface, and thus improve the boot performance.
>>
>> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
>> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
>> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> 
> OK, but as mentioned in the previous patch add memblock_virt_alloc_raw
> in this patch.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Ok I will merge them.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 15/15] mm: debug for raw alloctor
  2017-08-11 13:08   ` Michal Hocko
@ 2017-08-11 16:18     ` Pasha Tatashin
  2017-08-14 11:50       ` Michal Hocko
  0 siblings, 1 reply; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-11 16:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

>> When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
>> returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
>> places excpect zeroed memory.
> 
> Please fold this into the patch which introduces
> memblock_virt_alloc_try_nid_raw.

OK

  I am not sure CONFIG_DEBUG_VM is the
> best config because that tends to be enabled quite often. Maybe
> CONFIG_MEMBLOCK_DEBUG? Or even make it kernel command line parameter?
> 

Initially, I did not want to make it CONFIG_MEMBLOCK_DEBUG because we 
really benefit from this debugging code when VM debug is enabled, and 
especially struct page debugging asserts which also depend on 
CONFIG_DEBUG_VM.

However, now thinking about it, I will change it to 
CONFIG_MEMBLOCK_DEBUG, and let users decide what other debugging configs 
need to be enabled, as this is also OK.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 04/15] mm: discard memblock data later
  2017-08-11 16:04       ` Michal Hocko
@ 2017-08-11 16:22         ` Pasha Tatashin
  2017-08-14 11:36           ` Michal Hocko
  0 siblings, 1 reply; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-11 16:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam,
	Mel Gorman

>> I will address your comment, and send out a new patch. Should I send it out
>> separately from the series or should I keep it inside?
> 
> I would post it separatelly. It doesn't depend on the rest.

OK, I will post it separately. No it does not depend on the rest, but 
the reset depends on this. So, I am not sure how to enforce that this 
comes before the rest.

> 
>> Also, before I send out a new patch, I will need to root cause and resolve
>> problem found by kernel test robot <fengguang.wu@intel.com>, and bisected
>> down to this patch.
>>
>> [  156.659400] BUG: Bad page state in process swapper  pfn:03147
>> [  156.660051] page:ffff88001ed8a1c0 count:0 mapcount:-127 mapping:
>> (null) index:0x1
>> [  156.660917] flags: 0x0()
>> [  156.661198] raw: 0000000000000000 0000000000000000 0000000000000001
>> 00000000ffffff80
>> [  156.662006] raw: ffff88001f4a8120 ffff88001ed85ce0 0000000000000000
>> 0000000000000000
>> [  156.662811] page dumped because: nonzero mapcount
>> [  156.663307] CPU: 0 PID: 1 Comm: swapper Not tainted
>> 4.13.0-rc3-00220-g1aad694 #1
>> [  156.664077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> 1.9.3-20161025_171302-gandalf 04/01/2014
>> [  156.665129] Call Trace:
>> [  156.665422]  dump_stack+0x1e/0x20
>> [  156.665802]  bad_page+0x122/0x148
> 
> Was the report related with this patch?

Yes, they said that the problem was bisected down to this patch. Do you 
know if there is a way to submit a patch to this test robot?

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw
  2017-08-11 16:06       ` Michal Hocko
@ 2017-08-11 16:24         ` Pasha Tatashin
  0 siblings, 0 replies; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-11 16:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

>> Sure, I could do this, but as I understood from earlier Dave Miller's
>> comments, we should do one logical change at a time. Hence, introduce API in
>> one patch use it in another. So, this is how I tried to organize this patch
>> set. Is this assumption incorrect?
> 
> Well, it really depends. If the patch is really small then adding a new
> API along with users is easier to review and backport because you have a
> clear view of the usage. I believe this is the case here. But if others
> feel otherwise I will not object.

I will merge them.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 04/15] mm: discard memblock data later
  2017-08-11  9:32   ` Michal Hocko
  2017-08-11  9:50     ` Mel Gorman
  2017-08-11 15:49     ` Pasha Tatashin
@ 2017-08-11 19:00     ` Pasha Tatashin
  2017-08-14 11:34       ` Michal Hocko
  2 siblings, 1 reply; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-11 19:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam,
	Mel Gorman

Hi Michal,

This suggestion won't work, because there are arches without memblock 
support: tile, sh...

So, I would still need to have:

#ifdef CONFIG_MEMBLOCK in page_alloc, or define memblock_discard() stubs 
in nobootmem headfile. In either case it would become messier than what 
it is right now.

Pasha

> I have just one nit below
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> [...]
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 2cb25fe4452c..bf14aea6ab70 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -285,31 +285,27 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
>>   }
>>   
>>   #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
> 
> pull this ifdef inside memblock_discard and you do not have an another
> one in page_alloc_init_late
> 
> [...]
>> +/**
>> + * Discard memory and reserved arrays if they were allocated
>> + */
>> +void __init memblock_discard(void)
>>   {
> 
> here
> 
>> -	if (memblock.memory.regions == memblock_memory_init_regions)
>> -		return 0;
>> +	phys_addr_t addr, size;
>>   
>> -	*addr = __pa(memblock.memory.regions);
>> +	if (memblock.reserved.regions != memblock_reserved_init_regions) {
>> +		addr = __pa(memblock.reserved.regions);
>> +		size = PAGE_ALIGN(sizeof(struct memblock_region) *
>> +				  memblock.reserved.max);
>> +		__memblock_free_late(addr, size);
>> +	}
>>   
>> -	return PAGE_ALIGN(sizeof(struct memblock_region) *
>> -			  memblock.memory.max);
>> +	if (memblock.memory.regions == memblock_memory_init_regions) {
>> +		addr = __pa(memblock.memory.regions);
>> +		size = PAGE_ALIGN(sizeof(struct memblock_region) *
>> +				  memblock.memory.max);
>> +		__memblock_free_late(addr, size);
>> +	}
>>   }
>> -
>>   #endif

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 04/15] mm: discard memblock data later
  2017-08-11 19:00     ` Pasha Tatashin
@ 2017-08-14 11:34       ` Michal Hocko
  2017-08-14 13:39         ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-14 11:34 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam,
	Mel Gorman

On Fri 11-08-17 15:00:47, Pasha Tatashin wrote:
> Hi Michal,
> 
> This suggestion won't work, because there are arches without memblock
> support: tile, sh...
> 
> So, I would still need to have:
> 
> #ifdef CONFIG_MEMBLOCK in page_alloc, or define memblock_discard() stubs in
> nobootmem headfile. 

This is the standard way to do this. And it is usually preferred to
proliferate ifdefs in the code.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 04/15] mm: discard memblock data later
  2017-08-11 16:22         ` Pasha Tatashin
@ 2017-08-14 11:36           ` Michal Hocko
  2017-08-14 13:35             ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-14 11:36 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam,
	Mel Gorman

On Fri 11-08-17 12:22:52, Pasha Tatashin wrote:
> >>I will address your comment, and send out a new patch. Should I send it out
> >>separately from the series or should I keep it inside?
> >
> >I would post it separatelly. It doesn't depend on the rest.
> 
> OK, I will post it separately. No it does not depend on the rest, but the
> reset depends on this. So, I am not sure how to enforce that this comes
> before the rest.

Andrew will take care of that. Just make it explicit that some of the
patch depends on an earlier work when reposting.
 
> >>Also, before I send out a new patch, I will need to root cause and resolve
> >>problem found by kernel test robot <fengguang.wu@intel.com>, and bisected
> >>down to this patch.
> >>
> >>[  156.659400] BUG: Bad page state in process swapper  pfn:03147
> >>[  156.660051] page:ffff88001ed8a1c0 count:0 mapcount:-127 mapping:
> >>(null) index:0x1
> >>[  156.660917] flags: 0x0()
> >>[  156.661198] raw: 0000000000000000 0000000000000000 0000000000000001
> >>00000000ffffff80
> >>[  156.662006] raw: ffff88001f4a8120 ffff88001ed85ce0 0000000000000000
> >>0000000000000000
> >>[  156.662811] page dumped because: nonzero mapcount
> >>[  156.663307] CPU: 0 PID: 1 Comm: swapper Not tainted
> >>4.13.0-rc3-00220-g1aad694 #1
> >>[  156.664077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> >>1.9.3-20161025_171302-gandalf 04/01/2014
> >>[  156.665129] Call Trace:
> >>[  156.665422]  dump_stack+0x1e/0x20
> >>[  156.665802]  bad_page+0x122/0x148
> >
> >Was the report related with this patch?
> 
> Yes, they said that the problem was bisected down to this patch. Do you know
> if there is a way to submit a patch to this test robot?

You can ask them for re testing with an updated patch by replying to
their report. ANyway I fail to see how the change could lead to this
patch.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 01/15] x86/mm: reserve only exiting low pages
  2017-08-11 15:24     ` Pasha Tatashin
@ 2017-08-14 11:40       ` Michal Hocko
  2017-08-14 13:30         ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-14 11:40 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Fri 11-08-17 11:24:55, Pasha Tatashin wrote:
[...]
> >>In this patchset we will stop zeroing struct page memory during allocation.
> >>Therefore, this bug must be fixed in order to avoid random assert failures
> >>caused by CONFIG_DEBUG_VM_PGFLAGS triggers.
> >>
> >>The fix is to reserve memory from the first existing PFN.
> >
> >Hmm, I assume this is a result of some assert triggering, right? Which
> >one? Why don't we need the same treatment for other than x86 arch?
> 
> Correct, the pgflags asserts were triggered when we were setting reserved
> flags to struct page for PFN 0 in which was never initialized through
> __init_single_page(). The reason they were triggered is because we set all
> uninitialized memory to ones in one of the debug patches.

And why don't we need the same treatment for other architectures?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 02/15] x86/mm: setting fields in deferred pages
  2017-08-11 15:39     ` Pasha Tatashin
@ 2017-08-14 11:43       ` Michal Hocko
  2017-08-14 13:32         ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-14 11:43 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Fri 11-08-17 11:39:41, Pasha Tatashin wrote:
> >AFAIU register_page_bootmem_info_node is only about struct pages backing
> >pgdat, usemap and memmap. Those should be in reserved memblocks and we
> >do not initialize those at later times, they are not relevant to the
> >deferred initialization as your changelog suggests so the ordering with
> >get_page_bootmem shouldn't matter. Or am I missing something here?
> 
> The pages for pgdata, usemap, and memmap are part of reserved, and thus
> getting initialized when free_all_bootmem() is called.
> 
> So, we have something like this in mem_init()
> 
> register_page_bootmem_info
>  register_page_bootmem_info_node
>   get_page_bootmem
>    .. setting fields here ..
>    such as: page->freelist = (void *)type;
> 
> free_all_bootmem()
>  free_low_memory_core_early()
>   for_each_reserved_mem_region()
>    reserve_bootmem_region()
>     init_reserved_page() <- Only if this is deferred reserved page
>      __init_single_pfn()
>       __init_single_page()
>           memset(0) <-- Loose the set fields here!

OK, I have missed that part. Please make it explicit in the changelog.
It is quite easy to get lost in the deep call chains.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 05/15] mm: don't accessed uninitialized struct pages
  2017-08-11 15:55     ` Pasha Tatashin
@ 2017-08-14 11:47       ` Michal Hocko
  2017-08-14 13:51         ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-14 11:47 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Fri 11-08-17 11:55:39, Pasha Tatashin wrote:
> On 08/11/2017 05:37 AM, Michal Hocko wrote:
> >On Mon 07-08-17 16:38:39, Pavel Tatashin wrote:
> >>In deferred_init_memmap() where all deferred struct pages are initialized
> >>we have a check like this:
> >>
> >>     if (page->flags) {
> >>             VM_BUG_ON(page_zone(page) != zone);
> >>             goto free_range;
> >>     }
> >>
> >>This way we are checking if the current deferred page has already been
> >>initialized. It works, because memory for struct pages has been zeroed, and
> >>the only way flags are not zero if it went through __init_single_page()
> >>before.  But, once we change the current behavior and won't zero the memory
> >>in memblock allocator, we cannot trust anything inside "struct page"es
> >>until they are initialized. This patch fixes this.
> >>
> >>This patch defines a new accessor memblock_get_reserved_pfn_range()
> >>which returns successive ranges of reserved PFNs.  deferred_init_memmap()
> >>calls it to determine if a PFN and its struct page has already been
> >>initialized.
> >
> >Why don't we simply check the pfn against pgdat->first_deferred_pfn?
> 
> Because we are initializing deferred pages, and all of them have pfn greater
> than pgdat->first_deferred_pfn. However, some of deferred pages were already
> initialized if they were reserved, in this path:
> 
> mem_init()
>  free_all_bootmem()
>   free_low_memory_core_early()
>    for_each_reserved_mem_region()
>     reserve_bootmem_region()
>      init_reserved_page() <- if this is deferred reserved page
>       __init_single_pfn()
>        __init_single_page()
> 
> So, currently, we are using the value of page->flags to figure out if this
> page has been initialized while being part of deferred page, but this is not
> going to work for this project, as we do not zero the memory that is backing
> the struct pages, and size the value of page->flags can be anything.

True, this is the initialization part I've missed in one of the previous
patches already. Would it be possible to only iterate over !reserved
memory blocks instead? Now that we discard all the metadata later it
should be quite easy to do for_each_memblock_type, no?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 15/15] mm: debug for raw alloctor
  2017-08-11 16:18     ` Pasha Tatashin
@ 2017-08-14 11:50       ` Michal Hocko
  2017-08-14 14:01         ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-14 11:50 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Fri 11-08-17 12:18:24, Pasha Tatashin wrote:
> >>When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
> >>returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
> >>places excpect zeroed memory.
> >
> >Please fold this into the patch which introduces
> >memblock_virt_alloc_try_nid_raw.
> 
> OK
> 
>  I am not sure CONFIG_DEBUG_VM is the
> >best config because that tends to be enabled quite often. Maybe
> >CONFIG_MEMBLOCK_DEBUG? Or even make it kernel command line parameter?
> >
> 
> Initially, I did not want to make it CONFIG_MEMBLOCK_DEBUG because we really
> benefit from this debugging code when VM debug is enabled, and especially
> struct page debugging asserts which also depend on CONFIG_DEBUG_VM.
> 
> However, now thinking about it, I will change it to CONFIG_MEMBLOCK_DEBUG,
> and let users decide what other debugging configs need to be enabled, as
> this is also OK.

Actually the more I think about it the more I am convinced that a kernel
boot parameter would be better because it doesn't need the kernel to be
recompiled and it is a single branch in not so hot path.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 01/15] x86/mm: reserve only exiting low pages
  2017-08-14 11:40       ` Michal Hocko
@ 2017-08-14 13:30         ` Pasha Tatashin
  0 siblings, 0 replies; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-14 13:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

>> Correct, the pgflags asserts were triggered when we were setting reserved
>> flags to struct page for PFN 0 in which was never initialized through
>> __init_single_page(). The reason they were triggered is because we set all
>> uninitialized memory to ones in one of the debug patches.
> 
> And why don't we need the same treatment for other architectures?
> 

I have not seen similar issues on other architectures. At least this low 
memory reserve is x86 specific for BIOS purposes:

Documentation/admin-guide/kernel-parameters.txt
3624	reservelow=	[X86]
3625			Format: nn[K]
3626			Set the amount of memory to reserve for BIOS at
3627			the bottom of the address space.

If there are similar cases with other architectures, they will be caught 
by the last patch in this series, where all allocated memory is set to 
ones, and page flags asserts will be triggered. I have boot-tested on 
SPARC, ARM, and x86.

Pasha

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 02/15] x86/mm: setting fields in deferred pages
  2017-08-14 11:43       ` Michal Hocko
@ 2017-08-14 13:32         ` Pasha Tatashin
  0 siblings, 0 replies; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-14 13:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam



On 08/14/2017 07:43 AM, Michal Hocko wrote:
>> register_page_bootmem_info
>>   register_page_bootmem_info_node
>>    get_page_bootmem
>>     .. setting fields here ..
>>     such as: page->freelist = (void *)type;
>>
>> free_all_bootmem()
>>   free_low_memory_core_early()
>>    for_each_reserved_mem_region()
>>     reserve_bootmem_region()
>>      init_reserved_page() <- Only if this is deferred reserved page
>>       __init_single_pfn()
>>        __init_single_page()
>>            memset(0) <-- Loose the set fields here!
> OK, I have missed that part. Please make it explicit in the changelog.
> It is quite easy to get lost in the deep call chains.

Ok, will update comment.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 04/15] mm: discard memblock data later
  2017-08-14 11:36           ` Michal Hocko
@ 2017-08-14 13:35             ` Pasha Tatashin
  0 siblings, 0 replies; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-14 13:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam,
	Mel Gorman

>> OK, I will post it separately. No it does not depend on the rest, but the
>> reset depends on this. So, I am not sure how to enforce that this comes
>> before the rest.
> 
> Andrew will take care of that. Just make it explicit that some of the
> patch depends on an earlier work when reposting.

Ok.

>> Yes, they said that the problem was bisected down to this patch. Do you know
>> if there is a way to submit a patch to this test robot?
> 
> You can ask them for re testing with an updated patch by replying to
> their report. ANyway I fail to see how the change could lead to this
> patch.

I have already done that. Anyway, I think it is unrelated. I have used 
their scripts to test the patch alone, with number of elements in 
memblock array reduced down to 4. Verified that my freeing code is 
called, and never hit the problem that they reported.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 04/15] mm: discard memblock data later
  2017-08-14 11:34       ` Michal Hocko
@ 2017-08-14 13:39         ` Pasha Tatashin
  2017-08-14 13:42           ` Michal Hocko
  0 siblings, 1 reply; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-14 13:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam,
	Mel Gorman

>> #ifdef CONFIG_MEMBLOCK in page_alloc, or define memblock_discard() stubs in
>> nobootmem headfile.
> 
> This is the standard way to do this. And it is usually preferred to
> proliferate ifdefs in the code.

Hi Michal,

As you suggested, I sent-out this patch separately. If you feel 
strongly, that this should be updated to have stubs for platforms that 
do not implement memblock, please send a reply to that e-mail, so those 
who do not follow this tread will see it. Otherwise, I can leave it as 
is, page_alloc file already has a number memblock related ifdefs all of 
which can be cleaned out once every platform implements it (is it even 
achievable?)

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 04/15] mm: discard memblock data later
  2017-08-14 13:39         ` Pasha Tatashin
@ 2017-08-14 13:42           ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-08-14 13:42 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam,
	Mel Gorman

On Mon 14-08-17 09:39:17, Pasha Tatashin wrote:
> >>#ifdef CONFIG_MEMBLOCK in page_alloc, or define memblock_discard() stubs in
> >>nobootmem headfile.
> >
> >This is the standard way to do this. And it is usually preferred to
> >proliferate ifdefs in the code.
> 
> Hi Michal,
> 
> As you suggested, I sent-out this patch separately. If you feel strongly,
> that this should be updated to have stubs for platforms that do not
> implement memblock, please send a reply to that e-mail, so those who do not
> follow this tread will see it. Otherwise, I can leave it as is, page_alloc
> file already has a number memblock related ifdefs all of which can be
> cleaned out once every platform implements it (is it even achievable?)

I do not think this needs a repost just for this. This was merely a
JFYI, in case you would need to repost for other reasons then just
update that as well. But nothing really earth shattering.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 05/15] mm: don't accessed uninitialized struct pages
  2017-08-14 11:47       ` Michal Hocko
@ 2017-08-14 13:51         ` Pasha Tatashin
  2017-08-17 15:28           ` Pasha Tatashin
  0 siblings, 1 reply; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-14 13:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

>> mem_init()
>>   free_all_bootmem()
>>    free_low_memory_core_early()
>>     for_each_reserved_mem_region()
>>      reserve_bootmem_region()
>>       init_reserved_page() <- if this is deferred reserved page
>>        __init_single_pfn()
>>         __init_single_page()
>>
>> So, currently, we are using the value of page->flags to figure out if this
>> page has been initialized while being part of deferred page, but this is not
>> going to work for this project, as we do not zero the memory that is backing
>> the struct pages, and size the value of page->flags can be anything.
> 
> True, this is the initialization part I've missed in one of the previous
> patches already. Would it be possible to only iterate over !reserved
> memory blocks instead? Now that we discard all the metadata later it
> should be quite easy to do for_each_memblock_type, no?

Hi Michal,

Clever suggestion to add a new iterator to go through unreserved 
existing memory, I do not think there is this iterator available, so it 
would need to be implemented, using similar approach to what I have done 
with a call back.

However, there is a different reason, why I took this current approach.

Daniel Jordan is working on a ktask support:
https://lkml.org/lkml/2017/7/14/666

He and I discussed on how to multi-thread struct pages initialization 
within memory nodes using ktasks. Having this callback interface makes 
that multi-threading quiet easy, improving the boot performance further, 
with his prototype we saw x4-6 improvements (using 4-8 threads per 
node). Reducing the total time it takes to initialize all struct pages 
on machines with terabytes of memory to less than one second.

Pasha

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 01/15] x86/mm: reserve only exiting low pages
  2017-08-07 20:38 ` [v6 01/15] x86/mm: reserve only exiting low pages Pavel Tatashin
  2017-08-11  8:07   ` Michal Hocko
@ 2017-08-14 13:55   ` Michal Hocko
  2017-08-17 15:37     ` Pasha Tatashin
  1 sibling, 1 reply; 70+ messages in thread
From: Michal Hocko @ 2017-08-14 13:55 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam,
	H. Peter Anvin

Let's CC Hpa on this one. I am still not sure it is correct. The full
series is here
http://lkml.kernel.org/r/1502138329-123460-1-git-send-email-pasha.tatashin@oracle.com

On Mon 07-08-17 16:38:35, Pavel Tatashin wrote:
> Struct pages are initialized by going through __init_single_page(). Since
> the existing physical memory in memblock is represented in memblock.memory
> list, struct page for every page from this list goes through
> __init_single_page().
> 
> The second memblock list: memblock.reserved, manages the allocated memory.
> The memory that won't be available to kernel allocator. So, every page from
> this list goes through reserve_bootmem_region(), where certain struct page
> fields are set, the assumption being that the struct pages have been
> initialized beforehand.
> 
> In trim_low_memory_range() we unconditionally reserve memoryfrom PFN 0, but
> memblock.memory might start at a later PFN. For example, in QEMU,
> e820__memblock_setup() can use PFN 1 as the first PFN in memblock.memory,
> so PFN 0 is not on memblock.memory (and hence isn't initialized via
> __init_single_page) but is on memblock.reserved (and hence we set fields in
> the uninitialized struct page).
> 
> Currently, the struct page memory is always zeroed during allocation,
> which prevents this problem from being detected. But, if some asserts
> provided by CONFIG_DEBUG_VM_PGFLAGS are tighten, this problem may become
> visible in existing kernels.
> 
> In this patchset we will stop zeroing struct page memory during allocation.
> Therefore, this bug must be fixed in order to avoid random assert failures
> caused by CONFIG_DEBUG_VM_PGFLAGS triggers.
> 
> The fix is to reserve memory from the first existing PFN.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> ---
>  arch/x86/kernel/setup.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 3486d0498800..489cdc141bcb 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -790,7 +790,10 @@ early_param("reservelow", parse_reservelow);
>  
>  static void __init trim_low_memory_range(void)
>  {
> -	memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
> +	unsigned long min_pfn = find_min_pfn_with_active_regions();
> +	phys_addr_t base = min_pfn << PAGE_SHIFT;
> +
> +	memblock_reserve(base, ALIGN(reserve_low, PAGE_SIZE));
>  }
>  	
>  /*
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 15/15] mm: debug for raw alloctor
  2017-08-14 11:50       ` Michal Hocko
@ 2017-08-14 14:01         ` Pasha Tatashin
  2017-08-15  9:36           ` Michal Hocko
  0 siblings, 1 reply; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-14 14:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

>> However, now thinking about it, I will change it to CONFIG_MEMBLOCK_DEBUG,
>> and let users decide what other debugging configs need to be enabled, as
>> this is also OK.
> 
> Actually the more I think about it the more I am convinced that a kernel
> boot parameter would be better because it doesn't need the kernel to be
> recompiled and it is a single branch in not so hot path.

The main reason I do not like kernel parameter is that automated test 
suits for every platform would need to be updated to include this new 
parameter in order to test it.

Yet, I think it is important at least initially to test it on every 
platform unconditionally when certain debug configs are enabled.

This patch series allows boot allocator to return uninitialized memory, 
this behavior Linux never had before, but way too often firmware 
explicitly zero all the memory before starting OS. Therefore, it would 
be hard to debug issues that might be only seen during kinit type of 
reboots.

In the future, when memory sizes will increase so that this memset will 
become unacceptable even on debug kernels, it can always be removed, but 
at least at that time we will know that the code has been tested for 
many years.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 05/15] mm: don't accessed uninitialized struct pages
  2017-08-07 20:38 ` [v6 05/15] mm: don't accessed uninitialized struct pages Pavel Tatashin
  2017-08-11  9:37   ` Michal Hocko
@ 2017-08-15  9:33   ` Michal Hocko
  1 sibling, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-08-15  9:33 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam,
	Mel Gorman

[CC Mel - the original patch was
http://lkml.kernel.org/r/1502138329-123460-6-git-send-email-pasha.tatashin@oracle.com]

On Mon 07-08-17 16:38:39, Pavel Tatashin wrote:
> In deferred_init_memmap() where all deferred struct pages are initialized
> we have a check like this:
> 
>     if (page->flags) {
>             VM_BUG_ON(page_zone(page) != zone);
>             goto free_range;
>     }
> 
> This way we are checking if the current deferred page has already been
> initialized. It works, because memory for struct pages has been zeroed, and
> the only way flags are not zero if it went through __init_single_page()
> before.  But, once we change the current behavior and won't zero the memory
> in memblock allocator, we cannot trust anything inside "struct page"es
> until they are initialized. This patch fixes this.
> 
> This patch defines a new accessor memblock_get_reserved_pfn_range()
> which returns successive ranges of reserved PFNs.  deferred_init_memmap()
> calls it to determine if a PFN and its struct page has already been
> initialized.

Maybe I am missing something but how can we see reserved ranges here
when for_each_mem_pfn_range iterates over memblock.memory?

The loop is rather complex but I am wondering whether the page->flags
check is needed at all. We shouldn't have duplicated memblocks covering
the same pfn ranges so we cannot initialize the same range multiple
times, right? Reserved ranges are excluded altogether so how exactly can
we see an initialized struct page? In other words, why this simply
doesn't work?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 90e331e4c077..987a340a5bed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1524,11 +1524,6 @@ static int __init deferred_init_memmap(void *data)
 				cond_resched();
 			}
 
-			if (page->flags) {
-				VM_BUG_ON(page_zone(page) != zone);
-				goto free_range;
-			}
-
 			__init_single_page(page, pfn, zid, nid);
 			if (!free_base_page) {
 				free_base_page = page;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [v6 15/15] mm: debug for raw alloctor
  2017-08-14 14:01         ` Pasha Tatashin
@ 2017-08-15  9:36           ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-08-15  9:36 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Mon 14-08-17 10:01:52, Pasha Tatashin wrote:
> >>However, now thinking about it, I will change it to CONFIG_MEMBLOCK_DEBUG,
> >>and let users decide what other debugging configs need to be enabled, as
> >>this is also OK.
> >
> >Actually the more I think about it the more I am convinced that a kernel
> >boot parameter would be better because it doesn't need the kernel to be
> >recompiled and it is a single branch in not so hot path.
> 
> The main reason I do not like kernel parameter is that automated test suits
> for every platform would need to be updated to include this new parameter in
> order to test it.

How does this differ from the enabling a config option and building a
separate kernel?

My primary point of the kernel option was to have something available to
debug without recompiling the kernel which is more tedious than simply
adding one option to the kernel command line.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 05/15] mm: don't accessed uninitialized struct pages
  2017-08-14 13:51         ` Pasha Tatashin
@ 2017-08-17 15:28           ` Pasha Tatashin
  2017-08-17 15:43             ` Michal Hocko
  0 siblings, 1 reply; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-17 15:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

Hi Michal,

I've been looking through this code again, and I think your suggestion 
will work. I did not realize this iterator already exist:

for_each_free_mem_range() basically iterates through (memory && !reserved)

This is exactly what we need here. So, I will update this patch to use 
this iterator, which will simplify it.

Pasha

On 08/14/2017 09:51 AM, Pasha Tatashin wrote:
>>> mem_init()
>>>   free_all_bootmem()
>>>    free_low_memory_core_early()
>>>     for_each_reserved_mem_region()
>>>      reserve_bootmem_region()
>>>       init_reserved_page() <- if this is deferred reserved page
>>>        __init_single_pfn()
>>>         __init_single_page()
>>>
>>> So, currently, we are using the value of page->flags to figure out if 
>>> this
>>> page has been initialized while being part of deferred page, but this 
>>> is not
>>> going to work for this project, as we do not zero the memory that is 
>>> backing
>>> the struct pages, and size the value of page->flags can be anything.
>>
>> True, this is the initialization part I've missed in one of the previous
>> patches already. Would it be possible to only iterate over !reserved
>> memory blocks instead? Now that we discard all the metadata later it
>> should be quite easy to do for_each_memblock_type, no?
> 
> Hi Michal,
> 
> Clever suggestion to add a new iterator to go through unreserved 
> existing memory, I do not think there is this iterator available, so it 
> would need to be implemented, using similar approach to what I have done 
> with a call back.
> 
> However, there is a different reason, why I took this current approach.
> 
> Daniel Jordan is working on a ktask support:
> https://lkml.org/lkml/2017/7/14/666
> 
> He and I discussed on how to multi-thread struct pages initialization 
> within memory nodes using ktasks. Having this callback interface makes 
> that multi-threading quiet easy, improving the boot performance further, 
> with his prototype we saw x4-6 improvements (using 4-8 threads per 
> node). Reducing the total time it takes to initialize all struct pages 
> on machines with terabytes of memory to less than one second.
> 
> Pasha

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 01/15] x86/mm: reserve only exiting low pages
  2017-08-14 13:55   ` Michal Hocko
@ 2017-08-17 15:37     ` Pasha Tatashin
  0 siblings, 0 replies; 70+ messages in thread
From: Pasha Tatashin @ 2017-08-17 15:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam,
	H. Peter Anvin

Hi Michal,

While working on a bug that was reported to me by "kernel test robot".

  unable to handle kernel NULL pointer dereference at           (null)

The issue was that page_to_pfn() on that configuration was looking for a 
section inside flags fields in "struct page". So, reserved but 
unavailable memory should have its "struct page" zeroed.

Therefore, I am going to remove this patch from my series, but instead 
have a new patch that iterates through:

reserved && !memory memblocks, and zeroes struct pages for them. Since 
for that memory struct pages will never go through __init_single_page(), 
yet some fields might still be accessed.

Pasha

On 08/14/2017 09:55 AM, Michal Hocko wrote:
> Let's CC Hpa on this one. I am still not sure it is correct. The full
> series is here
> http://lkml.kernel.org/r/1502138329-123460-1-git-send-email-pasha.tatashin@oracle.com
> 
> On Mon 07-08-17 16:38:35, Pavel Tatashin wrote:
>> Struct pages are initialized by going through __init_single_page(). Since
>> the existing physical memory in memblock is represented in memblock.memory
>> list, struct page for every page from this list goes through
>> __init_single_page().
>>
>> The second memblock list: memblock.reserved, manages the allocated memory.
>> The memory that won't be available to kernel allocator. So, every page from
>> this list goes through reserve_bootmem_region(), where certain struct page
>> fields are set, the assumption being that the struct pages have been
>> initialized beforehand.
>>
>> In trim_low_memory_range() we unconditionally reserve memoryfrom PFN 0, but
>> memblock.memory might start at a later PFN. For example, in QEMU,
>> e820__memblock_setup() can use PFN 1 as the first PFN in memblock.memory,
>> so PFN 0 is not on memblock.memory (and hence isn't initialized via
>> __init_single_page) but is on memblock.reserved (and hence we set fields in
>> the uninitialized struct page).
>>
>> Currently, the struct page memory is always zeroed during allocation,
>> which prevents this problem from being detected. But, if some asserts
>> provided by CONFIG_DEBUG_VM_PGFLAGS are tighten, this problem may become
>> visible in existing kernels.
>>
>> In this patchset we will stop zeroing struct page memory during allocation.
>> Therefore, this bug must be fixed in order to avoid random assert failures
>> caused by CONFIG_DEBUG_VM_PGFLAGS triggers.
>>
>> The fix is to reserve memory from the first existing PFN.
>>
>> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
>> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
>> Reviewed-by: Bob Picco <bob.picco@oracle.com>
>> ---
>>   arch/x86/kernel/setup.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>> index 3486d0498800..489cdc141bcb 100644
>> --- a/arch/x86/kernel/setup.c
>> +++ b/arch/x86/kernel/setup.c
>> @@ -790,7 +790,10 @@ early_param("reservelow", parse_reservelow);
>>   
>>   static void __init trim_low_memory_range(void)
>>   {
>> -	memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
>> +	unsigned long min_pfn = find_min_pfn_with_active_regions();
>> +	phys_addr_t base = min_pfn << PAGE_SHIFT;
>> +
>> +	memblock_reserve(base, ALIGN(reserve_low, PAGE_SIZE));
>>   }
>>   	
>>   /*
>> -- 
>> 2.14.0
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [v6 05/15] mm: don't accessed uninitialized struct pages
  2017-08-17 15:28           ` Pasha Tatashin
@ 2017-08-17 15:43             ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-08-17 15:43 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, will.deacon, catalin.marinas, sam

On Thu 17-08-17 11:28:23, Pasha Tatashin wrote:
> Hi Michal,
> 
> I've been looking through this code again, and I think your suggestion will
> work. I did not realize this iterator already exist:
> 
> for_each_free_mem_range() basically iterates through (memory && !reserved)
> 
> This is exactly what we need here. So, I will update this patch to use this
> iterator, which will simplify it.

Please have a look at
http://lkml.kernel.org/r/20170815093306.GC29067@dhcp22.suse.cz
I believe we can simply drop the check altogether.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2017-08-17 15:43 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-07 20:38 [v6 00/15] complete deferred page initialization Pavel Tatashin
2017-08-07 20:38 ` [v6 01/15] x86/mm: reserve only exiting low pages Pavel Tatashin
2017-08-11  8:07   ` Michal Hocko
2017-08-11 15:24     ` Pasha Tatashin
2017-08-14 11:40       ` Michal Hocko
2017-08-14 13:30         ` Pasha Tatashin
2017-08-14 13:55   ` Michal Hocko
2017-08-17 15:37     ` Pasha Tatashin
2017-08-07 20:38 ` [v6 02/15] x86/mm: setting fields in deferred pages Pavel Tatashin
2017-08-11  9:02   ` Michal Hocko
2017-08-11 15:39     ` Pasha Tatashin
2017-08-14 11:43       ` Michal Hocko
2017-08-14 13:32         ` Pasha Tatashin
2017-08-07 20:38 ` [v6 03/15] sparc64/mm: " Pavel Tatashin
2017-08-07 20:38 ` [v6 04/15] mm: discard memblock data later Pavel Tatashin
2017-08-11  9:32   ` Michal Hocko
2017-08-11  9:50     ` Mel Gorman
2017-08-11 15:49     ` Pasha Tatashin
2017-08-11 16:04       ` Michal Hocko
2017-08-11 16:22         ` Pasha Tatashin
2017-08-14 11:36           ` Michal Hocko
2017-08-14 13:35             ` Pasha Tatashin
2017-08-11 19:00     ` Pasha Tatashin
2017-08-14 11:34       ` Michal Hocko
2017-08-14 13:39         ` Pasha Tatashin
2017-08-14 13:42           ` Michal Hocko
2017-08-07 20:38 ` [v6 05/15] mm: don't accessed uninitialized struct pages Pavel Tatashin
2017-08-11  9:37   ` Michal Hocko
2017-08-11 15:55     ` Pasha Tatashin
2017-08-14 11:47       ` Michal Hocko
2017-08-14 13:51         ` Pasha Tatashin
2017-08-17 15:28           ` Pasha Tatashin
2017-08-17 15:43             ` Michal Hocko
2017-08-15  9:33   ` Michal Hocko
2017-08-07 20:38 ` [v6 06/15] sparc64: simplify vmemmap_populate Pavel Tatashin
2017-08-07 20:38 ` [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw Pavel Tatashin
2017-08-11 12:39   ` Michal Hocko
2017-08-11 15:58     ` Pasha Tatashin
2017-08-11 16:06       ` Michal Hocko
2017-08-11 16:24         ` Pasha Tatashin
2017-08-07 20:38 ` [v6 08/15] mm: zero struct pages during initialization Pavel Tatashin
2017-08-11 12:50   ` Michal Hocko
2017-08-11 16:03     ` Pasha Tatashin
2017-08-07 20:38 ` [v6 09/15] sparc64: optimized struct page zeroing Pavel Tatashin
2017-08-11 12:53   ` Michal Hocko
2017-08-11 16:04     ` Pasha Tatashin
2017-08-07 20:38 ` [v6 10/15] x86/kasan: explicitly zero kasan shadow memory Pavel Tatashin
2017-08-07 20:38 ` [v6 11/15] arm64/kasan: " Pavel Tatashin
2017-08-08  9:07   ` Will Deacon
2017-08-08 11:49     ` Pasha Tatashin
2017-08-08 12:30       ` Will Deacon
2017-08-08 12:49         ` Pasha Tatashin
2017-08-08 13:15       ` David Laight
2017-08-08 13:30         ` Pasha Tatashin
2017-08-07 20:38 ` [v6 12/15] mm: explicitly zero pagetable memory Pavel Tatashin
2017-08-07 20:38 ` [v6 13/15] mm: stop zeroing memory during allocation in vmemmap Pavel Tatashin
2017-08-11 13:04   ` Michal Hocko
2017-08-11 16:11     ` Pasha Tatashin
2017-08-07 20:38 ` [v6 14/15] mm: optimize early system hash allocations Pavel Tatashin
2017-08-11 13:05   ` Michal Hocko
2017-08-11 16:13     ` Pasha Tatashin
2017-08-07 20:38 ` [v6 15/15] mm: debug for raw alloctor Pavel Tatashin
2017-08-11 13:08   ` Michal Hocko
2017-08-11 16:18     ` Pasha Tatashin
2017-08-14 11:50       ` Michal Hocko
2017-08-14 14:01         ` Pasha Tatashin
2017-08-15  9:36           ` Michal Hocko
2017-08-11  7:58 ` [v6 00/15] complete deferred page initialization Michal Hocko
2017-08-11 15:13   ` Pasha Tatashin
2017-08-11 15:22     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).