linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 00/12] complete deferred page initialization
@ 2017-09-20 20:17 Pavel Tatashin
  2017-09-20 20:17 ` [PATCH v9 01/12] x86/mm: setting fields in deferred pages Pavel Tatashin
                   ` (11 more replies)
  0 siblings, 12 replies; 52+ messages in thread
From: Pavel Tatashin @ 2017-09-20 20:17 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Changelog:
v9 - v8
- Addressed comments raised by Mark Rutland and Ard Biesheuvel: changed
  kasan implementation. Added a new function: kasan_map_populate() that
  zeroes the allocated and mapped memory

v8 - v7
- Added Acked-by's from Dave Miller for SPARC changes
- Fixed a minor compiling issue on tile architecture reported by kbuild

v7 - v6
- Addressed comments from Michal Hocko
- memblock_discard() patch was removed from this series and integrated
  separately
- Fixed bug reported by kbuild test robot new patch:
  mm: zero reserved and unavailable struct pages
- Removed patch
  x86/mm: reserve only exiting low pages
  As, it is not needed anymore, because of the previous fix
- Re-wrote deferred_init_memmap(), found and fixed an existing bug, where
  page variable is not reset when zone holes present.
- Merged several patches together per Michal request
- Added performance data including raw logs

v6 - v5
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
  iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
  whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations

v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
  suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
  memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
    * Splited changes to platforms into 4 patches
    * Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
  keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


==========================================================================
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
                        TIME          SPEED UP
base no deferred:       95.796233s
fix no deferred:        79.978956s    19.77%

base deferred:          77.254713s
fix deferred:           55.050509s    40.34%
==========================================================================
SPARC M6 3600 MHz with 15T of memory
                        TIME          SPEED UP
base no deferred:       358.335727s
fix no deferred:        302.320936s   18.52%

base deferred:          237.534603s
fix deferred:           182.103003s   30.44%
==========================================================================
Raw dmesg output with timestamps:
x86 base no deferred:    https://hastebin.com/ofunepurit.scala
x86 base deferred:       https://hastebin.com/ifazegeyas.scala
x86 fix no deferred:     https://hastebin.com/pegocohevo.scala
x86 fix deferred:        https://hastebin.com/ofupevikuk.scala
sparc base no deferred:  https://hastebin.com/ibobeteken.go
sparc base deferred:     https://hastebin.com/fariqimiyu.go
sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
sparc fix deferred:      https://hastebin.com/xadinobutu.go

Pavel Tatashin (12):
  x86/mm: setting fields in deferred pages
  sparc64/mm: setting fields in deferred pages
  mm: deferred_init_memmap improvements
  sparc64: simplify vmemmap_populate
  mm: defining memblock_virt_alloc_try_nid_raw
  mm: zero struct pages during initialization
  sparc64: optimized struct page zeroing
  mm: zero reserved and unavailable struct pages
  mm/kasan: kasan specific map populate function
  x86/kasan: use kasan_map_populate()
  arm64/kasan: use kasan_map_populate()
  mm: stop zeroing memory during allocation in vmemmap

 arch/arm64/include/asm/pgtable.h    |   3 +
 arch/arm64/mm/kasan_init.c          |  12 +--
 arch/sparc/include/asm/pgtable_64.h |  30 ++++++
 arch/sparc/mm/init_64.c             |  31 +++---
 arch/x86/mm/init_64.c               |   9 +-
 arch/x86/mm/kasan_init_64.c         |   8 +-
 include/linux/bootmem.h             |  27 +++++
 include/linux/kasan.h               |   2 +
 include/linux/memblock.h            |  16 +++
 include/linux/mm.h                  |  26 +++++
 mm/kasan/kasan_init.c               |  67 ++++++++++++
 mm/memblock.c                       |  60 +++++++++--
 mm/page_alloc.c                     | 207 ++++++++++++++++++++----------------
 mm/sparse-vmemmap.c                 |  15 ++-
 mm/sparse.c                         |   6 +-
 15 files changed, 380 insertions(+), 139 deletions(-)

-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v9 01/12] x86/mm: setting fields in deferred pages
  2017-09-20 20:17 [PATCH v9 00/12] complete deferred page initialization Pavel Tatashin
@ 2017-09-20 20:17 ` Pavel Tatashin
  2017-10-03 12:26   ` Michal Hocko
  2017-09-20 20:17 ` [PATCH v9 02/12] sparc64/mm: " Pavel Tatashin
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: Pavel Tatashin @ 2017-09-20 20:17 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled, however, we set fields in
register_page_bootmem_info that are subsequently clobbered right after in
free_all_bootmem:

        mem_init() {
                register_page_bootmem_info();
                free_all_bootmem();
                ...
        }

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

  mem_init
   register_page_bootmem_info
    register_page_bootmem_info_node
     get_page_bootmem
      .. setting fields here ..
      such as: page->freelist = (void *)type;

  free_all_bootmem()
   free_low_memory_core_early()
    for_each_reserved_mem_region()
     reserve_bootmem_region()
      init_reserved_page() <- Only if this is deferred reserved page
       __init_single_pfn()
        __init_single_page()
            memset(0) <-- Loose the set fields here

We end-up with issue where, currently we do not observe problem as memory
is explicitly zeroed. But, if flag asserts are changed we can start hitting
issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 arch/x86/mm/init_64.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ea1c3c2636e..30fe22558720 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1182,12 +1182,17 @@ void __init mem_init(void)
 
 	/* clear_bss() already clear the empty_zero_page */
 
-	register_page_bootmem_info();
-
 	/* this will put all memory onto the freelists */
 	free_all_bootmem();
 	after_bootmem = 1;
 
+	/* Must be done after boot memory is put on freelist, because here we
+	 * might set fields in deferred struct pages that have not yet been
+	 * initialized, and free_all_bootmem() initializes all the reserved
+	 * deferred pages for us.
+	 */
+	register_page_bootmem_info();
+
 	/* Register memory areas for /proc/kcore */
 	kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR,
 			 PAGE_SIZE, KCORE_OTHER);
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v9 02/12] sparc64/mm: setting fields in deferred pages
  2017-09-20 20:17 [PATCH v9 00/12] complete deferred page initialization Pavel Tatashin
  2017-09-20 20:17 ` [PATCH v9 01/12] x86/mm: setting fields in deferred pages Pavel Tatashin
@ 2017-09-20 20:17 ` Pavel Tatashin
  2017-10-03 12:28   ` Michal Hocko
  2017-09-20 20:17 ` [PATCH v9 03/12] mm: deferred_init_memmap improvements Pavel Tatashin
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: Pavel Tatashin @ 2017-09-20 20:17 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set some
fields prior to initializing:

mem_init() {
     register_page_bootmem_info();
     free_all_bootmem();
     ...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

mem_init
register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

free_all_bootmem()
free_low_memory_core_early()
 for_each_reserved_mem_region()
  reserve_bootmem_region()
   init_reserved_page() <- Only if this is deferred reserved page
    __init_single_pfn()
     __init_single_page()
      memset(0) <-- Loose the set fields here

We end-up with similar issue as in the previous patch, where currently we
do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
---
 arch/sparc/mm/init_64.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 6034569e2c0d..310c6754bcaa 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2548,9 +2548,15 @@ void __init mem_init(void)
 {
 	high_memory = __va(last_valid_pfn << PAGE_SHIFT);
 
-	register_page_bootmem_info();
 	free_all_bootmem();
 
+	/* Must be done after boot memory is put on freelist, because here we
+	 * might set fields in deferred struct pages that have not yet been
+	 * initialized, and free_all_bootmem() initializes all the reserved
+	 * deferred pages for us.
+	 */
+	register_page_bootmem_info();
+
 	/*
 	 * Set up the zero page, mark it reserved, so that page count
 	 * is not manipulated when freeing the page from user ptes.
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v9 03/12] mm: deferred_init_memmap improvements
  2017-09-20 20:17 [PATCH v9 00/12] complete deferred page initialization Pavel Tatashin
  2017-09-20 20:17 ` [PATCH v9 01/12] x86/mm: setting fields in deferred pages Pavel Tatashin
  2017-09-20 20:17 ` [PATCH v9 02/12] sparc64/mm: " Pavel Tatashin
@ 2017-09-20 20:17 ` Pavel Tatashin
  2017-10-03 12:57   ` Michal Hocko
  2017-09-20 20:17 ` [PATCH v9 04/12] sparc64: simplify vmemmap_populate Pavel Tatashin
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: Pavel Tatashin @ 2017-09-20 20:17 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

This patch fixes two issues in deferred_init_memmap

=====
In deferred_init_memmap() where all deferred struct pages are initialized
we have a check like this:

if (page->flags) {
	VM_BUG_ON(page_zone(page) != zone);
	goto free_range;
}

This way we are checking if the current deferred page has already been
initialized. It works, because memory for struct pages has been zeroed, and
the only way flags are not zero if it went through __init_single_page()
before.  But, once we change the current behavior and won't zero the memory
in memblock allocator, we cannot trust anything inside "struct page"es
until they are initialized. This patch fixes this.

The deferred_init_memmap() is re-written to loop through only free memory
ranges provided by memblock.

=====
This patch fixes another existing issue on systems that have holes in
zones i.e CONFIG_HOLES_IN_ZONE is defined.

In for_each_mem_pfn_range() we have code like this:

if (!pfn_valid_within(pfn)
	goto free_range;

Note: 'page' is not set to NULL and is not incremented but 'pfn' advances.
Thus means if deferred struct pages are enabled on systems with these kind
of holes, linux would get memory corruptions. I have fixed this issue by
defining a new macro that performs all the necessary operations when we
free the current set of pages.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 mm/page_alloc.c | 161 +++++++++++++++++++++++++++-----------------------------
 1 file changed, 78 insertions(+), 83 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c841af88836a..d132c801d2c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1410,14 +1410,17 @@ void clear_zone_contiguous(struct zone *zone)
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __init deferred_free_range(struct page *page,
-					unsigned long pfn, int nr_pages)
+static void __init deferred_free_range(unsigned long pfn,
+				       unsigned long nr_pages)
 {
-	int i;
+	struct page *page;
+	unsigned long i;
 
-	if (!page)
+	if (!nr_pages)
 		return;
 
+	page = pfn_to_page(pfn);
+
 	/* Free a large naturally-aligned chunk if possible */
 	if (nr_pages == pageblock_nr_pages &&
 	    (pfn & (pageblock_nr_pages - 1)) == 0) {
@@ -1443,19 +1446,82 @@ static inline void __init pgdat_init_report_one_done(void)
 		complete(&pgdat_init_all_done_comp);
 }
 
+#define DEFERRED_FREE(nr_free, free_base_pfn, page)			\
+({									\
+	unsigned long nr = (nr_free);					\
+									\
+	deferred_free_range((free_base_pfn), (nr));			\
+	(free_base_pfn) = 0;						\
+	(nr_free) = 0;							\
+	page = NULL;							\
+	nr;								\
+})
+
+static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn,
+					 unsigned long end_pfn)
+{
+	struct mminit_pfnnid_cache nid_init_state = { };
+	unsigned long nr_pgmask = pageblock_nr_pages - 1;
+	unsigned long free_base_pfn = 0;
+	unsigned long nr_pages = 0;
+	unsigned long nr_free = 0;
+	struct page *page = NULL;
+
+	for (; pfn < end_pfn; pfn++) {
+		/*
+		 * First we check if pfn is valid on architectures where it is
+		 * possible to have holes within pageblock_nr_pages. On systems
+		 * where it is not possible, this function is optimized out.
+		 *
+		 * Then, we check if a current large page is valid by only
+		 * checking the validity of the head pfn.
+		 *
+		 * meminit_pfn_in_nid is checked on systems where pfns can
+		 * interleave within a node: a pfn is between start and end
+		 * of a node, but does not belong to this memory node.
+		 *
+		 * Finally, we minimize pfn page lookups and scheduler checks by
+		 * performing it only once every pageblock_nr_pages.
+		 */
+		if (!pfn_valid_within(pfn)) {
+			nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
+		} else if (!(pfn & nr_pgmask) && !pfn_valid(pfn)) {
+			nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
+		} else if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
+			nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
+		} else if (page && (pfn & nr_pgmask)) {
+			page++;
+			__init_single_page(page, pfn, zid, nid);
+			nr_free++;
+		} else {
+			nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
+			page = pfn_to_page(pfn);
+			__init_single_page(page, pfn, zid, nid);
+			free_base_pfn = pfn;
+			nr_free = 1;
+			cond_resched();
+		}
+	}
+	/* Free the last block of pages to allocator */
+	nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
+
+	return nr_pages;
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
 	pg_data_t *pgdat = data;
 	int nid = pgdat->node_id;
-	struct mminit_pfnnid_cache nid_init_state = { };
 	unsigned long start = jiffies;
 	unsigned long nr_pages = 0;
-	unsigned long walk_start, walk_end;
-	int i, zid;
+	unsigned long spfn, epfn;
+	phys_addr_t spa, epa;
+	int zid;
 	struct zone *zone;
 	unsigned long first_init_pfn = pgdat->first_deferred_pfn;
 	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+	u64 i;
 
 	if (first_init_pfn == ULONG_MAX) {
 		pgdat_init_report_one_done();
@@ -1477,83 +1543,12 @@ static int __init deferred_init_memmap(void *data)
 		if (first_init_pfn < zone_end_pfn(zone))
 			break;
 	}
+	first_init_pfn = max(zone->zone_start_pfn, first_init_pfn);
 
-	for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
-		unsigned long pfn, end_pfn;
-		struct page *page = NULL;
-		struct page *free_base_page = NULL;
-		unsigned long free_base_pfn = 0;
-		int nr_to_free = 0;
-
-		end_pfn = min(walk_end, zone_end_pfn(zone));
-		pfn = first_init_pfn;
-		if (pfn < walk_start)
-			pfn = walk_start;
-		if (pfn < zone->zone_start_pfn)
-			pfn = zone->zone_start_pfn;
-
-		for (; pfn < end_pfn; pfn++) {
-			if (!pfn_valid_within(pfn))
-				goto free_range;
-
-			/*
-			 * Ensure pfn_valid is checked every
-			 * pageblock_nr_pages for memory holes
-			 */
-			if ((pfn & (pageblock_nr_pages - 1)) == 0) {
-				if (!pfn_valid(pfn)) {
-					page = NULL;
-					goto free_range;
-				}
-			}
-
-			if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
-				page = NULL;
-				goto free_range;
-			}
-
-			/* Minimise pfn page lookups and scheduler checks */
-			if (page && (pfn & (pageblock_nr_pages - 1)) != 0) {
-				page++;
-			} else {
-				nr_pages += nr_to_free;
-				deferred_free_range(free_base_page,
-						free_base_pfn, nr_to_free);
-				free_base_page = NULL;
-				free_base_pfn = nr_to_free = 0;
-
-				page = pfn_to_page(pfn);
-				cond_resched();
-			}
-
-			if (page->flags) {
-				VM_BUG_ON(page_zone(page) != zone);
-				goto free_range;
-			}
-
-			__init_single_page(page, pfn, zid, nid);
-			if (!free_base_page) {
-				free_base_page = page;
-				free_base_pfn = pfn;
-				nr_to_free = 0;
-			}
-			nr_to_free++;
-
-			/* Where possible, batch up pages for a single free */
-			continue;
-free_range:
-			/* Free the current block of pages to allocator */
-			nr_pages += nr_to_free;
-			deferred_free_range(free_base_page, free_base_pfn,
-								nr_to_free);
-			free_base_page = NULL;
-			free_base_pfn = nr_to_free = 0;
-		}
-		/* Free the last block of pages to allocator */
-		nr_pages += nr_to_free;
-		deferred_free_range(free_base_page, free_base_pfn, nr_to_free);
-
-		first_init_pfn = max(end_pfn, first_init_pfn);
+	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
+		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
+		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
+		nr_pages += deferred_init_range(nid, zid, spfn, epfn);
 	}
 
 	/* Sanity check that the next zone really is unpopulated */
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v9 04/12] sparc64: simplify vmemmap_populate
  2017-09-20 20:17 [PATCH v9 00/12] complete deferred page initialization Pavel Tatashin
                   ` (2 preceding siblings ...)
  2017-09-20 20:17 ` [PATCH v9 03/12] mm: deferred_init_memmap improvements Pavel Tatashin
@ 2017-09-20 20:17 ` Pavel Tatashin
  2017-10-03 12:59   ` Michal Hocko
  2017-09-20 20:17 ` [PATCH v9 05/12] mm: defining memblock_virt_alloc_try_nid_raw Pavel Tatashin
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: Pavel Tatashin @ 2017-09-20 20:17 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Remove duplicating code by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
---
 arch/sparc/mm/init_64.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 310c6754bcaa..99aea4d15a5f 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2651,30 +2651,19 @@ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
 	vstart = vstart & PMD_MASK;
 	vend = ALIGN(vend, PMD_SIZE);
 	for (; vstart < vend; vstart += PMD_SIZE) {
-		pgd_t *pgd = pgd_offset_k(vstart);
+		pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
 		unsigned long pte;
 		pud_t *pud;
 		pmd_t *pmd;
 
-		if (pgd_none(*pgd)) {
-			pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+		if (!pgd)
+			return -ENOMEM;
 
-			if (!new)
-				return -ENOMEM;
-			pgd_populate(&init_mm, pgd, new);
-		}
-
-		pud = pud_offset(pgd, vstart);
-		if (pud_none(*pud)) {
-			pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-			if (!new)
-				return -ENOMEM;
-			pud_populate(&init_mm, pud, new);
-		}
+		pud = vmemmap_pud_populate(pgd, vstart, node);
+		if (!pud)
+			return -ENOMEM;
 
 		pmd = pmd_offset(pud, vstart);
-
 		pte = pmd_val(*pmd);
 		if (!(pte & _PAGE_VALID)) {
 			void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v9 05/12] mm: defining memblock_virt_alloc_try_nid_raw
  2017-09-20 20:17 [PATCH v9 00/12] complete deferred page initialization Pavel Tatashin
                   ` (3 preceding siblings ...)
  2017-09-20 20:17 ` [PATCH v9 04/12] sparc64: simplify vmemmap_populate Pavel Tatashin
@ 2017-09-20 20:17 ` Pavel Tatashin
  2017-09-20 20:17 ` [PATCH v9 06/12] mm: zero struct pages during initialization Pavel Tatashin
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Pavel Tatashin @ 2017-09-20 20:17 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
    - Does not zero the allocated memory
    - Does not panic if request cannot be satisfied

* optimize early system hash allocations

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

* debug for raw alloctor

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/bootmem.h | 27 ++++++++++++++++++++++
 mm/memblock.c           | 60 +++++++++++++++++++++++++++++++++++++++++++------
 mm/page_alloc.c         | 15 ++++++-------
 3 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91b6439..ea30b3987282 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE		(~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+				      phys_addr_t min_addr,
+				      phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
 		phys_addr_t align, phys_addr_t min_addr,
 		phys_addr_t max_addr, int nid);
@@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
 					    NUMA_NO_NODE);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+					phys_addr_t size,  phys_addr_t align)
+{
+	return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
+					    BOOTMEM_ALLOC_ACCESSIBLE,
+					    NUMA_NO_NODE);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
 					phys_addr_t size, phys_addr_t align)
 {
@@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
 	return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+					phys_addr_t size,  phys_addr_t align)
+{
+	if (!align)
+		align = SMP_CACHE_BYTES;
+	return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
 					phys_addr_t size, phys_addr_t align)
 {
@@ -309,6 +328,14 @@ static inline void * __init memblock_virt_alloc_try_nid(phys_addr_t size,
 					  min_addr);
 }
 
+static inline void * __init memblock_virt_alloc_try_nid_raw(
+			phys_addr_t size, phys_addr_t align,
+			phys_addr_t min_addr, phys_addr_t max_addr, int nid)
+{
+	return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
+				min_addr, max_addr);
+}
+
 static inline void * __init memblock_virt_alloc_try_nid_nopanic(
 			phys_addr_t size, phys_addr_t align,
 			phys_addr_t min_addr, phys_addr_t max_addr, int nid)
diff --git a/mm/memblock.c b/mm/memblock.c
index 91205780e6b1..1f299fb1eb08 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
 	return NULL;
 done:
 	ptr = phys_to_virt(alloc);
-	memset(ptr, 0, size);
 
 	/*
 	 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1340,6 +1339,45 @@ static void * __init memblock_virt_alloc_internal(
 	return ptr;
 }
 
+/**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *	  is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *	      is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *	      allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ *
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. Does not zero allocated memory, does not panic if request
+ * cannot be satisfied.
+ *
+ * RETURNS:
+ * Virtual address of allocated memory block on success, NULL on failure.
+ */
+void * __init memblock_virt_alloc_try_nid_raw(
+			phys_addr_t size, phys_addr_t align,
+			phys_addr_t min_addr, phys_addr_t max_addr,
+			int nid)
+{
+	void *ptr;
+
+	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
+		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
+		     (u64)max_addr, (void *)_RET_IP_);
+
+	ptr = memblock_virt_alloc_internal(size, align,
+					   min_addr, max_addr, nid);
+#ifdef CONFIG_DEBUG_VM
+	if (ptr && size > 0)
+		memset(ptr, 0xff, size);
+#endif
+	return ptr;
+}
+
 /**
  * memblock_virt_alloc_try_nid_nopanic - allocate boot memory block
  * @size: size of memory block to be allocated in bytes
@@ -1351,8 +1389,8 @@ static void * __init memblock_virt_alloc_internal(
  *	      allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public version of _memblock_virt_alloc_try_nid_nopanic() which provides
- * additional debug information (including caller info), if enabled.
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. This function zeroes the allocated memory.
  *
  * RETURNS:
  * Virtual address of allocated memory block on success, NULL on failure.
@@ -1362,11 +1400,17 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
 				phys_addr_t min_addr, phys_addr_t max_addr,
 				int nid)
 {
+	void *ptr;
+
 	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
 		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
 		     (u64)max_addr, (void *)_RET_IP_);
-	return memblock_virt_alloc_internal(size, align, min_addr,
-					     max_addr, nid);
+
+	ptr = memblock_virt_alloc_internal(size, align,
+					   min_addr, max_addr, nid);
+	if (ptr)
+		memset(ptr, 0, size);
+	return ptr;
 }
 
 /**
@@ -1380,7 +1424,7 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
  *	      allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public panicking version of _memblock_virt_alloc_try_nid_nopanic()
+ * Public panicking version of memblock_virt_alloc_try_nid_nopanic()
  * which provides debug information (including caller info), if enabled,
  * and panics if the request can not be satisfied.
  *
@@ -1399,8 +1443,10 @@ void * __init memblock_virt_alloc_try_nid(
 		     (u64)max_addr, (void *)_RET_IP_);
 	ptr = memblock_virt_alloc_internal(size, align,
 					   min_addr, max_addr, nid);
-	if (ptr)
+	if (ptr) {
+		memset(ptr, 0, size);
 		return ptr;
+	}
 
 	panic("%s: Failed to allocate %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx\n",
 	      __func__, (u64)size, (u64)align, nid, (u64)min_addr,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d132c801d2c1..a8dbd405ed94 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7299,18 +7299,17 @@ void *__init alloc_large_system_hash(const char *tablename,
 
 	log2qty = ilog2(numentries);
 
-	/*
-	 * memblock allocator returns zeroed memory already, so HASH_ZERO is
-	 * currently not used when HASH_EARLY is specified.
-	 */
 	gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
 	do {
 		size = bucketsize << log2qty;
-		if (flags & HASH_EARLY)
-			table = memblock_virt_alloc_nopanic(size, 0);
-		else if (hashdist)
+		if (flags & HASH_EARLY) {
+			if (flags & HASH_ZERO)
+				table = memblock_virt_alloc_nopanic(size, 0);
+			else
+				table = memblock_virt_alloc_raw(size, 0);
+		} else if (hashdist) {
 			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
-		else {
+		} else {
 			/*
 			 * If bucketsize is not a power-of-two, we may free
 			 * some pages at the end of hash table which
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v9 06/12] mm: zero struct pages during initialization
  2017-09-20 20:17 [PATCH v9 00/12] complete deferred page initialization Pavel Tatashin
                   ` (4 preceding siblings ...)
  2017-09-20 20:17 ` [PATCH v9 05/12] mm: defining memblock_virt_alloc_try_nid_raw Pavel Tatashin
@ 2017-09-20 20:17 ` Pavel Tatashin
  2017-10-03 13:08   ` Michal Hocko
  2017-09-20 20:17 ` [PATCH v9 07/12] sparc64: optimized struct page zeroing Pavel Tatashin
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: Pavel Tatashin @ 2017-09-20 20:17 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

                        BASE            FIX
sparse_init     11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
                  --------------------------
Total           16.124447727s   8.362381922s

sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h | 9 +++++++++
 mm/page_alloc.c    | 1 +
 2 files changed, 10 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f8c10d336e42..50b74d628243 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #define mm_forbids_zeropage(X)	(0)
 #endif
 
+/*
+ * On some architectures it is expensive to call memset() for small sizes.
+ * Those architectures should provide their own implementation of "struct page"
+ * zeroing by defining this macro in <asm/pgtable.h>.
+ */
+#ifndef mm_zero_struct_page
+#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
+#endif
+
 /*
  * Default maximum number of active map areas, this limits the number of vmas
  * per mm struct. Users can overwrite this number by sysctl but there is a
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8dbd405ed94..4b630ee91430 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
+	mm_zero_struct_page(page);
 	set_page_links(page, zone, nid, pfn);
 	init_page_count(page);
 	page_mapcount_reset(page);
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v9 07/12] sparc64: optimized struct page zeroing
  2017-09-20 20:17 [PATCH v9 00/12] complete deferred page initialization Pavel Tatashin
                   ` (5 preceding siblings ...)
  2017-09-20 20:17 ` [PATCH v9 06/12] mm: zero struct pages during initialization Pavel Tatashin
@ 2017-09-20 20:17 ` Pavel Tatashin
  2017-09-20 20:17 ` [PATCH v9 08/12] mm: zero reserved and unavailable struct pages Pavel Tatashin
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Pavel Tatashin @ 2017-09-20 20:17 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to ten regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.

SPARC-M6 with 15T of memory, single thread performance:

                               BASE            FIX  OPTIMIZED_FIX
        bootmem_init   28.440467985s   2.305674818s   2.305161615s
free_area_init_nodes  202.845901673s 225.343084508s 172.556506560s
                      --------------------------------------------
Total                 231.286369658s 227.648759326s 174.861668175s

BASE:  current linux
FIX:   This patch series without "optimized struct page zeroing"
OPTIMIZED_FIX: This patch series including the current patch.

bootmem_init() is where memory for struct pages is zeroed during
allocation. Note, about two seconds in this function is a fixed time: it
does not increase as memory is increased.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
---
 arch/sparc/include/asm/pgtable_64.h | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 4fefe3762083..8ed478abc630 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)	(mem_map_zero)
 
+/* This macro must be updated when the size of struct page grows above 80
+ * or reduces below 64.
+ * The idea that compiler optimizes out switch() statement, and only
+ * leaves clrx instructions
+ */
+#define	mm_zero_struct_page(pp) do {					\
+	unsigned long *_pp = (void *)(pp);				\
+									\
+	 /* Check that struct page is either 64, 72, or 80 bytes */	\
+	BUILD_BUG_ON(sizeof(struct page) & 7);				\
+	BUILD_BUG_ON(sizeof(struct page) < 64);				\
+	BUILD_BUG_ON(sizeof(struct page) > 80);				\
+									\
+	switch (sizeof(struct page)) {					\
+	case 80:							\
+		_pp[9] = 0;	/* fallthrough */			\
+	case 72:							\
+		_pp[8] = 0;	/* fallthrough */			\
+	default:							\
+		_pp[7] = 0;						\
+		_pp[6] = 0;						\
+		_pp[5] = 0;						\
+		_pp[4] = 0;						\
+		_pp[3] = 0;						\
+		_pp[2] = 0;						\
+		_pp[1] = 0;						\
+		_pp[0] = 0;						\
+	}								\
+} while (0)
+
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v9 08/12] mm: zero reserved and unavailable struct pages
  2017-09-20 20:17 [PATCH v9 00/12] complete deferred page initialization Pavel Tatashin
                   ` (6 preceding siblings ...)
  2017-09-20 20:17 ` [PATCH v9 07/12] sparc64: optimized struct page zeroing Pavel Tatashin
@ 2017-09-20 20:17 ` Pavel Tatashin
  2017-10-03 13:18   ` Michal Hocko
  2017-09-20 20:17 ` [PATCH v9 09/12] mm/kasan: kasan specific map populate function Pavel Tatashin
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: Pavel Tatashin @ 2017-09-20 20:17 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().

In some cases these struct pages are accessed even if they do not contain
any data. One example is page_to_pfn() might access page->flags if this is
where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

Since, struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
	for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 include/linux/memblock.h | 16 ++++++++++++++++
 include/linux/mm.h       |  6 ++++++
 mm/page_alloc.c          | 30 ++++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7e7bf3..bdd4268f9323 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
 	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
 			       nid, flags, p_start, p_end, p_nid)
 
+/**
+ * for_each_resv_unavail_range - iterate through reserved and unavailable memory
+ * @i: u64 used as loop variable
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over unavailabled but reserved (reserved && !memory) areas of memblock.
+ * Available as soon as memblock is initialized.
+ * Note: because this memory does not belong to any physical node, flags and
+ * nid arguments do not make sense and thus not exported as arguments.
+ */
+#define for_each_resv_unavail_range(i, p_start, p_end)			\
+	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
+			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
+
 static inline void memblock_set_region_flags(struct memblock_region *r,
 					     unsigned long flags)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 50b74d628243..a7bba4ce79ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2010,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
 					struct mminit_pfnnid_cache *state);
 #endif
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+void zero_resv_unavail(void);
+#else
+static inline void zero_resv_unavail(void) {}
+#endif
+
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
 				unsigned long, enum memmap_context);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4b630ee91430..1d38d391dffd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6202,6 +6202,34 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 	free_area_init_core(pgdat);
 }
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+/*
+ * Only struct pages that are backed by physical memory are zeroed and
+ * initialized by going through __init_single_page(). But, there are some
+ * struct pages which are reserved in memblock allocator and their fields
+ * may be accessed (for example page_to_pfn() on some configuration accesses
+ * flags). We must explicitly zero those struct pages.
+ */
+void __paginginit zero_resv_unavail(void)
+{
+	phys_addr_t start, end;
+	unsigned long pfn;
+	u64 i, pgcnt;
+
+	/* Loop through ranges that are reserved, but do not have reported
+	 * physical memory backing.
+	 */
+	pgcnt = 0;
+	for_each_resv_unavail_range(i, &start, &end) {
+		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
+			mm_zero_struct_page(pfn_to_page(pfn));
+			pgcnt++;
+		}
+	}
+	pr_info("Reserved but unavailable: %lld pages", pgcnt);
+}
+#endif /* CONFIG_HAVE_MEMBLOCK */
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 
 #if MAX_NUMNODES > 1
@@ -6625,6 +6653,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 			node_set_state(nid, N_MEMORY);
 		check_for_memory(pgdat, nid);
 	}
+	zero_resv_unavail();
 }
 
 static int __init cmdline_parse_core(char *p, unsigned long *core)
@@ -6788,6 +6817,7 @@ void __init free_area_init(unsigned long *zones_size)
 {
 	free_area_init_node(0, zones_size,
 			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
+	zero_resv_unavail();
 }
 
 static int page_alloc_cpu_dead(unsigned int cpu)
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v9 09/12] mm/kasan: kasan specific map populate function
  2017-09-20 20:17 [PATCH v9 00/12] complete deferred page initialization Pavel Tatashin
                   ` (7 preceding siblings ...)
  2017-09-20 20:17 ` [PATCH v9 08/12] mm: zero reserved and unavailable struct pages Pavel Tatashin
@ 2017-09-20 20:17 ` Pavel Tatashin
  2017-10-03 14:48   ` Mark Rutland
  2017-09-20 20:17 ` [PATCH v9 10/12] x86/kasan: use kasan_map_populate() Pavel Tatashin
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: Pavel Tatashin @ 2017-09-20 20:17 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/arm64/include/asm/pgtable.h |  3 ++
 include/linux/kasan.h            |  2 ++
 mm/kasan/kasan_init.c            | 67 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 72 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index bc4e92337d16..d89713f04354 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -381,6 +381,9 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 				 PUD_TYPE_TABLE)
 #endif
 
+#define pmd_large(pmd)		pmd_sect(pmd)
+#define pud_large(pud)		pud_sect(pud)
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 	*pmdp = pmd;
diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index a5c7046f26b4..7e13df1722c2 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -78,6 +78,8 @@ size_t kasan_metadata_size(struct kmem_cache *cache);
 
 bool kasan_save_enable_multi_shot(void);
 void kasan_restore_multi_shot(bool enabled);
+int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+				 int node);
 
 #else /* CONFIG_KASAN */
 
diff --git a/mm/kasan/kasan_init.c b/mm/kasan/kasan_init.c
index 554e4c0f23a2..57a973f05f63 100644
--- a/mm/kasan/kasan_init.c
+++ b/mm/kasan/kasan_init.c
@@ -197,3 +197,70 @@ void __init kasan_populate_zero_shadow(const void *shadow_start,
 		zero_p4d_populate(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
 }
+
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+				 int node)
+{
+	unsigned long addr, pfn, next;
+	unsigned long long size;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	int ret;
+
+	ret = vmemmap_populate(start, end, node);
+	/*
+	 * We might have partially populated memory, so check for no entries,
+	 * and zero only those that actually exist.
+	 */
+	for (addr = start; addr < end; addr = next) {
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd)) {
+			next = pgd_addr_end(addr, end);
+			continue;
+		}
+
+		p4d = p4d_offset(pgd, addr);
+		if (p4d_none(*p4d)) {
+			next = p4d_addr_end(addr, end);
+			continue;
+		}
+
+		pud = pud_offset(p4d, addr);
+		if (pud_none(*pud)) {
+			next = pud_addr_end(addr, end);
+			continue;
+		}
+		if (pud_large(*pud)) {
+			/* This is PUD size page */
+			next = pud_addr_end(addr, end);
+			size = PUD_SIZE;
+			pfn = pud_pfn(*pud);
+		} else {
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd)) {
+				next = pmd_addr_end(addr, end);
+				continue;
+			}
+			if (pmd_large(*pmd)) {
+				/* This is PMD size page */
+				next = pmd_addr_end(addr, end);
+				size = PMD_SIZE;
+				pfn = pmd_pfn(*pmd);
+			} else {
+				pte = pte_offset_kernel(pmd, addr);
+				next = addr + PAGE_SIZE;
+				if (pte_none(*pte))
+					continue;
+				/* This is base size page */
+				size = PAGE_SIZE;
+				pfn = pte_pfn(*pte);
+			}
+		}
+		memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+	}
+	return ret;
+}
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v9 10/12] x86/kasan: use kasan_map_populate()
  2017-09-20 20:17 [PATCH v9 00/12] complete deferred page initialization Pavel Tatashin
                   ` (8 preceding siblings ...)
  2017-09-20 20:17 ` [PATCH v9 09/12] mm/kasan: kasan specific map populate function Pavel Tatashin
@ 2017-09-20 20:17 ` Pavel Tatashin
  2017-09-20 20:17 ` [PATCH v9 11/12] arm64/kasan: " Pavel Tatashin
  2017-09-20 20:17 ` [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap Pavel Tatashin
  11 siblings, 0 replies; 52+ messages in thread
From: Pavel Tatashin @ 2017-09-20 20:17 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

To optimize the performance of struct page initialization,
vmemmap_populate() will no longer zero memory.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/mm/kasan_init_64.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bc84b73684b7..2db95efd208e 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -23,7 +23,7 @@ static int __init map_range(struct range *range)
 	start = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->start));
 	end = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->end));
 
-	return vmemmap_populate(start, end, NUMA_NO_NODE);
+	return kasan_map_populate(start, end, NUMA_NO_NODE);
 }
 
 static void __init clear_pgds(unsigned long start,
@@ -136,9 +136,9 @@ void __init kasan_init(void)
 		kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
 		kasan_mem_to_shadow((void *)__START_KERNEL_map));
 
-	vmemmap_populate((unsigned long)kasan_mem_to_shadow(_stext),
-			(unsigned long)kasan_mem_to_shadow(_end),
-			NUMA_NO_NODE);
+	kasan_map_populate((unsigned long)kasan_mem_to_shadow(_stext),
+			   (unsigned long)kasan_mem_to_shadow(_end),
+			   NUMA_NO_NODE);
 
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v9 11/12] arm64/kasan: use kasan_map_populate()
  2017-09-20 20:17 [PATCH v9 00/12] complete deferred page initialization Pavel Tatashin
                   ` (9 preceding siblings ...)
  2017-09-20 20:17 ` [PATCH v9 10/12] x86/kasan: use kasan_map_populate() Pavel Tatashin
@ 2017-09-20 20:17 ` Pavel Tatashin
  2017-09-20 20:17 ` [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap Pavel Tatashin
  11 siblings, 0 replies; 52+ messages in thread
From: Pavel Tatashin @ 2017-09-20 20:17 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

To optimize the performance of struct page initialization,
vmemmap_populate() will no longer zero memory.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/arm64/mm/kasan_init.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f03959a4ab..b6e92cfa3ea3 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -161,11 +161,11 @@ void __init kasan_init(void)
 
 	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
 
-	vmemmap_populate(kimg_shadow_start, kimg_shadow_end,
-			 pfn_to_nid(virt_to_pfn(lm_alias(_text))));
+	kasan_map_populate(kimg_shadow_start, kimg_shadow_end,
+			   pfn_to_nid(virt_to_pfn(lm_alias(_text))));
 
 	/*
-	 * vmemmap_populate() has populated the shadow region that covers the
+	 * kasan_map_populate() has populated the shadow region that covers the
 	 * kernel image with SWAPPER_BLOCK_SIZE mappings, so we have to round
 	 * the start and end addresses to SWAPPER_BLOCK_SIZE as well, to prevent
 	 * kasan_populate_zero_shadow() from replacing the page table entries
@@ -191,9 +191,9 @@ void __init kasan_init(void)
 		if (start >= end)
 			break;
 
-		vmemmap_populate((unsigned long)kasan_mem_to_shadow(start),
-				(unsigned long)kasan_mem_to_shadow(end),
-				pfn_to_nid(virt_to_pfn(start)));
+		kasan_map_populate((unsigned long)kasan_mem_to_shadow(start),
+				   (unsigned long)kasan_mem_to_shadow(end),
+				   pfn_to_nid(virt_to_pfn(start)));
 	}
 
 	/*
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap
  2017-09-20 20:17 [PATCH v9 00/12] complete deferred page initialization Pavel Tatashin
                   ` (10 preceding siblings ...)
  2017-09-20 20:17 ` [PATCH v9 11/12] arm64/kasan: " Pavel Tatashin
@ 2017-09-20 20:17 ` Pavel Tatashin
  2017-10-03 13:19   ` Michal Hocko
  11 siblings, 1 reply; 52+ messages in thread
From: Pavel Tatashin @ 2017-09-20 20:17 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 include/linux/mm.h  | 11 +++++++++++
 mm/sparse-vmemmap.c | 15 +++++++--------
 mm/sparse.c         |  6 +++---
 3 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7bba4ce79ba..25848764570f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned long size, int node)
 	return __vmemmap_alloc_block_buf(size, node, NULL);
 }
 
+static inline void *vmemmap_alloc_block_zero(unsigned long size, int node)
+{
+	void *p = vmemmap_alloc_block(size, node);
+
+	if (!p)
+		return NULL;
+	memset(p, 0, size);
+
+	return p;
+}
+
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
 			       int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d1a39b8051e0..c2f5654e7c9d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
 				unsigned long align,
 				unsigned long goal)
 {
-	return memblock_virt_alloc_try_nid(size, align, goal,
+	return memblock_virt_alloc_try_nid_raw(size, align, goal,
 					    BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
@@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 	if (slab_is_available()) {
 		struct page *page;
 
-		page = alloc_pages_node(node,
-			GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-			get_order(size));
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
+					get_order(size));
 		if (page)
 			return page_address(page);
 		return NULL;
@@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pmd_populate_kernel(&init_mm, pmd, p);
@@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node)
 {
 	pud_t *pud = pud_offset(p4d, addr);
 	if (pud_none(*pud)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pud_populate(&init_mm, pud, p);
@@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
 {
 	p4d_t *p4d = p4d_offset(pgd, addr);
 	if (p4d_none(*p4d)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		p4d_populate(&init_mm, p4d, p);
@@ -219,7 +218,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 {
 	pgd_t *pgd = pgd_offset_k(addr);
 	if (pgd_none(*pgd)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pgd_populate(&init_mm, pgd, p);
diff --git a/mm/sparse.c b/mm/sparse.c
index 83b3bf6461af..d22f51bb7c79 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -437,9 +437,9 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
 	}
 
 	size = PAGE_ALIGN(size);
-	map = memblock_virt_alloc_try_nid(size * map_count,
-					  PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
-					  BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
+	map = memblock_virt_alloc_try_nid_raw(size * map_count,
+					      PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
+					      BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
 	if (map) {
 		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
 			if (!present_section_nr(pnum))
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 01/12] x86/mm: setting fields in deferred pages
  2017-09-20 20:17 ` [PATCH v9 01/12] x86/mm: setting fields in deferred pages Pavel Tatashin
@ 2017-10-03 12:26   ` Michal Hocko
  2017-10-03 15:07     ` Pasha Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-03 12:26 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Wed 20-09-17 16:17:03, Pavel Tatashin wrote:
> Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
> flags and other fields in "struct page"es are never changed prior to first
> initializing struct pages by going through __init_single_page().
> 
> With deferred struct page feature enabled, however, we set fields in
> register_page_bootmem_info that are subsequently clobbered right after in
> free_all_bootmem:
> 
>         mem_init() {
>                 register_page_bootmem_info();
>                 free_all_bootmem();
>                 ...
>         }
> 
> When register_page_bootmem_info() is called only non-deferred struct pages
> are initialized. But, this function goes through some reserved pages which
> might be part of the deferred, and thus are not yet initialized.
> 
>   mem_init
>    register_page_bootmem_info
>     register_page_bootmem_info_node
>      get_page_bootmem
>       .. setting fields here ..
>       such as: page->freelist = (void *)type;
> 
>   free_all_bootmem()
>    free_low_memory_core_early()
>     for_each_reserved_mem_region()
>      reserve_bootmem_region()
>       init_reserved_page() <- Only if this is deferred reserved page
>        __init_single_pfn()
>         __init_single_page()
>             memset(0) <-- Loose the set fields here
> 
> We end-up with issue where, currently we do not observe problem as memory
> is explicitly zeroed. But, if flag asserts are changed we can start hitting
> issues.
> 
> Also, because in this patch series we will stop zeroing struct page memory
> during allocation, we must make sure that struct pages are properly
> initialized prior to using them.
> 
> The deferred-reserved pages are initialized in free_all_bootmem().
> Therefore, the fix is to switch the above calls.

Thanks for extending the changelog. This is more informative now.
 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>

I hope I haven't missed anything but it looks good to me.

Acked-by: Michal Hocko <mhocko@suse.com>

one nit below
> ---
>  arch/x86/mm/init_64.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 5ea1c3c2636e..30fe22558720 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1182,12 +1182,17 @@ void __init mem_init(void)
>  
>  	/* clear_bss() already clear the empty_zero_page */
>  
> -	register_page_bootmem_info();
> -
>  	/* this will put all memory onto the freelists */
>  	free_all_bootmem();
>  	after_bootmem = 1;
>  
> +	/* Must be done after boot memory is put on freelist, because here we

standard code style is to do
	/*
	 * text starts here

> +	 * might set fields in deferred struct pages that have not yet been
> +	 * initialized, and free_all_bootmem() initializes all the reserved
> +	 * deferred pages for us.
> +	 */
> +	register_page_bootmem_info();
> +
>  	/* Register memory areas for /proc/kcore */
>  	kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR,
>  			 PAGE_SIZE, KCORE_OTHER);
> -- 
> 2.14.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 02/12] sparc64/mm: setting fields in deferred pages
  2017-09-20 20:17 ` [PATCH v9 02/12] sparc64/mm: " Pavel Tatashin
@ 2017-10-03 12:28   ` Michal Hocko
  2017-10-03 15:10     ` Pasha Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-03 12:28 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Wed 20-09-17 16:17:04, Pavel Tatashin wrote:
> Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
> flags and other fields in "struct page"es are never changed prior to first
> initializing struct pages by going through __init_single_page().
> 
> With deferred struct page feature enabled there is a case where we set some
> fields prior to initializing:
> 
> mem_init() {
>      register_page_bootmem_info();
>      free_all_bootmem();
>      ...
> }
> 
> When register_page_bootmem_info() is called only non-deferred struct pages
> are initialized. But, this function goes through some reserved pages which
> might be part of the deferred, and thus are not yet initialized.
> 
> mem_init
> register_page_bootmem_info
> register_page_bootmem_info_node
>  get_page_bootmem
>   .. setting fields here ..
>   such as: page->freelist = (void *)type;
> 
> free_all_bootmem()
> free_low_memory_core_early()
>  for_each_reserved_mem_region()
>   reserve_bootmem_region()
>    init_reserved_page() <- Only if this is deferred reserved page
>     __init_single_pfn()
>      __init_single_page()
>       memset(0) <-- Loose the set fields here
> 
> We end-up with similar issue as in the previous patch, where currently we
> do not observe problem as memory is zeroed. But, if flag asserts are
> changed we can start hitting issues.
> 
> Also, because in this patch series we will stop zeroing struct page memory
> during allocation, we must make sure that struct pages are properly
> initialized prior to using them.
> 
> The deferred-reserved pages are initialized in free_all_bootmem().
> Therefore, the fix is to switch the above calls.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> Acked-by: David S. Miller <davem@davemloft.net>

As you separated x86 and sparc patches doing essentially the same I
assume David is going to take this patch?

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  arch/sparc/mm/init_64.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
> index 6034569e2c0d..310c6754bcaa 100644
> --- a/arch/sparc/mm/init_64.c
> +++ b/arch/sparc/mm/init_64.c
> @@ -2548,9 +2548,15 @@ void __init mem_init(void)
>  {
>  	high_memory = __va(last_valid_pfn << PAGE_SHIFT);
>  
> -	register_page_bootmem_info();
>  	free_all_bootmem();
>  
> +	/* Must be done after boot memory is put on freelist, because here we
> +	 * might set fields in deferred struct pages that have not yet been
> +	 * initialized, and free_all_bootmem() initializes all the reserved
> +	 * deferred pages for us.
> +	 */
> +	register_page_bootmem_info();
> +
>  	/*
>  	 * Set up the zero page, mark it reserved, so that page count
>  	 * is not manipulated when freeing the page from user ptes.
> -- 
> 2.14.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 03/12] mm: deferred_init_memmap improvements
  2017-09-20 20:17 ` [PATCH v9 03/12] mm: deferred_init_memmap improvements Pavel Tatashin
@ 2017-10-03 12:57   ` Michal Hocko
  2017-10-03 15:15     ` Pasha Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-03 12:57 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Wed 20-09-17 16:17:05, Pavel Tatashin wrote:
> This patch fixes two issues in deferred_init_memmap
> 
> =====
> In deferred_init_memmap() where all deferred struct pages are initialized
> we have a check like this:
> 
> if (page->flags) {
> 	VM_BUG_ON(page_zone(page) != zone);
> 	goto free_range;
> }
> 
> This way we are checking if the current deferred page has already been
> initialized. It works, because memory for struct pages has been zeroed, and
> the only way flags are not zero if it went through __init_single_page()
> before.  But, once we change the current behavior and won't zero the memory
> in memblock allocator, we cannot trust anything inside "struct page"es
> until they are initialized. This patch fixes this.
> 
> The deferred_init_memmap() is re-written to loop through only free memory
> ranges provided by memblock.

Please be explicit that this is possible only because we discard
memblock data later after 3010f876500f ("mm: discard memblock data
later"). Also be more explicit how the new code works.

I like how the resulting code is more compact and smaller.
for_each_free_mem_range also looks more appropriate but I really detest
the DEFERRED_FREE thingy. Maybe we can handle all that in a single goto
section. I know this is not an art but manipulating variables from
macros is more error prone and much more ugly IMHO.

> =====
> This patch fixes another existing issue on systems that have holes in
> zones i.e CONFIG_HOLES_IN_ZONE is defined.
> 
> In for_each_mem_pfn_range() we have code like this:
> 
> if (!pfn_valid_within(pfn)
> 	goto free_range;
> 
> Note: 'page' is not set to NULL and is not incremented but 'pfn' advances.
> Thus means if deferred struct pages are enabled on systems with these kind
> of holes, linux would get memory corruptions. I have fixed this issue by
> defining a new macro that performs all the necessary operations when we
> free the current set of pages.

please do not use macros. Btw. this deserves its own fix. I suspect that
no CONFIG_HOLES_IN_ZONE arch enables DEFERRED_STRUCT_PAGE_INIT but
purely from the review point of view it should be its own patch.

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> ---
>  mm/page_alloc.c | 161 +++++++++++++++++++++++++++-----------------------------
>  1 file changed, 78 insertions(+), 83 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c841af88836a..d132c801d2c1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1410,14 +1410,17 @@ void clear_zone_contiguous(struct zone *zone)
>  }
>  
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> -static void __init deferred_free_range(struct page *page,
> -					unsigned long pfn, int nr_pages)
> +static void __init deferred_free_range(unsigned long pfn,
> +				       unsigned long nr_pages)
>  {
> -	int i;
> +	struct page *page;
> +	unsigned long i;
>  
> -	if (!page)
> +	if (!nr_pages)
>  		return;
>  
> +	page = pfn_to_page(pfn);
> +
>  	/* Free a large naturally-aligned chunk if possible */
>  	if (nr_pages == pageblock_nr_pages &&
>  	    (pfn & (pageblock_nr_pages - 1)) == 0) {
> @@ -1443,19 +1446,82 @@ static inline void __init pgdat_init_report_one_done(void)
>  		complete(&pgdat_init_all_done_comp);
>  }
>  
> +#define DEFERRED_FREE(nr_free, free_base_pfn, page)			\
> +({									\
> +	unsigned long nr = (nr_free);					\
> +									\
> +	deferred_free_range((free_base_pfn), (nr));			\
> +	(free_base_pfn) = 0;						\
> +	(nr_free) = 0;							\
> +	page = NULL;							\
> +	nr;								\
> +})
> +
> +static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn,
> +					 unsigned long end_pfn)
> +{
> +	struct mminit_pfnnid_cache nid_init_state = { };
> +	unsigned long nr_pgmask = pageblock_nr_pages - 1;
> +	unsigned long free_base_pfn = 0;
> +	unsigned long nr_pages = 0;
> +	unsigned long nr_free = 0;
> +	struct page *page = NULL;
> +
> +	for (; pfn < end_pfn; pfn++) {
> +		/*
> +		 * First we check if pfn is valid on architectures where it is
> +		 * possible to have holes within pageblock_nr_pages. On systems
> +		 * where it is not possible, this function is optimized out.
> +		 *
> +		 * Then, we check if a current large page is valid by only
> +		 * checking the validity of the head pfn.
> +		 *
> +		 * meminit_pfn_in_nid is checked on systems where pfns can
> +		 * interleave within a node: a pfn is between start and end
> +		 * of a node, but does not belong to this memory node.
> +		 *
> +		 * Finally, we minimize pfn page lookups and scheduler checks by
> +		 * performing it only once every pageblock_nr_pages.
> +		 */
> +		if (!pfn_valid_within(pfn)) {
> +			nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
> +		} else if (!(pfn & nr_pgmask) && !pfn_valid(pfn)) {
> +			nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
> +		} else if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
> +			nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
> +		} else if (page && (pfn & nr_pgmask)) {
> +			page++;
> +			__init_single_page(page, pfn, zid, nid);
> +			nr_free++;
> +		} else {
> +			nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
> +			page = pfn_to_page(pfn);
> +			__init_single_page(page, pfn, zid, nid);
> +			free_base_pfn = pfn;
> +			nr_free = 1;
> +			cond_resched();
> +		}
> +	}
> +	/* Free the last block of pages to allocator */
> +	nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
> +
> +	return nr_pages;
> +}
> +
>  /* Initialise remaining memory on a node */
>  static int __init deferred_init_memmap(void *data)
>  {
>  	pg_data_t *pgdat = data;
>  	int nid = pgdat->node_id;
> -	struct mminit_pfnnid_cache nid_init_state = { };
>  	unsigned long start = jiffies;
>  	unsigned long nr_pages = 0;
> -	unsigned long walk_start, walk_end;
> -	int i, zid;
> +	unsigned long spfn, epfn;
> +	phys_addr_t spa, epa;
> +	int zid;
>  	struct zone *zone;
>  	unsigned long first_init_pfn = pgdat->first_deferred_pfn;
>  	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> +	u64 i;
>  
>  	if (first_init_pfn == ULONG_MAX) {
>  		pgdat_init_report_one_done();
> @@ -1477,83 +1543,12 @@ static int __init deferred_init_memmap(void *data)
>  		if (first_init_pfn < zone_end_pfn(zone))
>  			break;
>  	}
> +	first_init_pfn = max(zone->zone_start_pfn, first_init_pfn);
>  
> -	for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
> -		unsigned long pfn, end_pfn;
> -		struct page *page = NULL;
> -		struct page *free_base_page = NULL;
> -		unsigned long free_base_pfn = 0;
> -		int nr_to_free = 0;
> -
> -		end_pfn = min(walk_end, zone_end_pfn(zone));
> -		pfn = first_init_pfn;
> -		if (pfn < walk_start)
> -			pfn = walk_start;
> -		if (pfn < zone->zone_start_pfn)
> -			pfn = zone->zone_start_pfn;
> -
> -		for (; pfn < end_pfn; pfn++) {
> -			if (!pfn_valid_within(pfn))
> -				goto free_range;
> -
> -			/*
> -			 * Ensure pfn_valid is checked every
> -			 * pageblock_nr_pages for memory holes
> -			 */
> -			if ((pfn & (pageblock_nr_pages - 1)) == 0) {
> -				if (!pfn_valid(pfn)) {
> -					page = NULL;
> -					goto free_range;
> -				}
> -			}
> -
> -			if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
> -				page = NULL;
> -				goto free_range;
> -			}
> -
> -			/* Minimise pfn page lookups and scheduler checks */
> -			if (page && (pfn & (pageblock_nr_pages - 1)) != 0) {
> -				page++;
> -			} else {
> -				nr_pages += nr_to_free;
> -				deferred_free_range(free_base_page,
> -						free_base_pfn, nr_to_free);
> -				free_base_page = NULL;
> -				free_base_pfn = nr_to_free = 0;
> -
> -				page = pfn_to_page(pfn);
> -				cond_resched();
> -			}
> -
> -			if (page->flags) {
> -				VM_BUG_ON(page_zone(page) != zone);
> -				goto free_range;
> -			}
> -
> -			__init_single_page(page, pfn, zid, nid);
> -			if (!free_base_page) {
> -				free_base_page = page;
> -				free_base_pfn = pfn;
> -				nr_to_free = 0;
> -			}
> -			nr_to_free++;
> -
> -			/* Where possible, batch up pages for a single free */
> -			continue;
> -free_range:
> -			/* Free the current block of pages to allocator */
> -			nr_pages += nr_to_free;
> -			deferred_free_range(free_base_page, free_base_pfn,
> -								nr_to_free);
> -			free_base_page = NULL;
> -			free_base_pfn = nr_to_free = 0;
> -		}
> -		/* Free the last block of pages to allocator */
> -		nr_pages += nr_to_free;
> -		deferred_free_range(free_base_page, free_base_pfn, nr_to_free);
> -
> -		first_init_pfn = max(end_pfn, first_init_pfn);
> +	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> +		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> +		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> +		nr_pages += deferred_init_range(nid, zid, spfn, epfn);
>  	}
>  
>  	/* Sanity check that the next zone really is unpopulated */
> -- 
> 2.14.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 04/12] sparc64: simplify vmemmap_populate
  2017-09-20 20:17 ` [PATCH v9 04/12] sparc64: simplify vmemmap_populate Pavel Tatashin
@ 2017-10-03 12:59   ` Michal Hocko
  2017-10-03 15:20     ` Pasha Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-03 12:59 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Wed 20-09-17 16:17:06, Pavel Tatashin wrote:
> Remove duplicating code by using common functions
> vmemmap_pud_populate and vmemmap_pgd_populate.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> Acked-by: David S. Miller <davem@davemloft.net>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  arch/sparc/mm/init_64.c | 23 ++++++-----------------
>  1 file changed, 6 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
> index 310c6754bcaa..99aea4d15a5f 100644
> --- a/arch/sparc/mm/init_64.c
> +++ b/arch/sparc/mm/init_64.c
> @@ -2651,30 +2651,19 @@ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
>  	vstart = vstart & PMD_MASK;
>  	vend = ALIGN(vend, PMD_SIZE);
>  	for (; vstart < vend; vstart += PMD_SIZE) {
> -		pgd_t *pgd = pgd_offset_k(vstart);
> +		pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
>  		unsigned long pte;
>  		pud_t *pud;
>  		pmd_t *pmd;
>  
> -		if (pgd_none(*pgd)) {
> -			pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
> +		if (!pgd)
> +			return -ENOMEM;
>  
> -			if (!new)
> -				return -ENOMEM;
> -			pgd_populate(&init_mm, pgd, new);
> -		}
> -
> -		pud = pud_offset(pgd, vstart);
> -		if (pud_none(*pud)) {
> -			pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
> -
> -			if (!new)
> -				return -ENOMEM;
> -			pud_populate(&init_mm, pud, new);
> -		}
> +		pud = vmemmap_pud_populate(pgd, vstart, node);
> +		if (!pud)
> +			return -ENOMEM;
>  
>  		pmd = pmd_offset(pud, vstart);
> -
>  		pte = pmd_val(*pmd);
>  		if (!(pte & _PAGE_VALID)) {
>  			void *block = vmemmap_alloc_block(PMD_SIZE, node);
> -- 
> 2.14.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 06/12] mm: zero struct pages during initialization
  2017-09-20 20:17 ` [PATCH v9 06/12] mm: zero struct pages during initialization Pavel Tatashin
@ 2017-10-03 13:08   ` Michal Hocko
  2017-10-03 15:22     ` Pasha Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-03 13:08 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Wed 20-09-17 16:17:08, Pavel Tatashin wrote:
> Add struct page zeroing as a part of initialization of other fields in
> __init_single_page().
> 
> This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
> v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):
> 
>                         BASE            FIX
> sparse_init     11.244671836s   0.007199623s
> zone_sizes_init  4.879775891s   8.355182299s
>                   --------------------------
> Total           16.124447727s   8.362381922s

Hmm, this is confusing. This assumes that sparse_init doesn't zero pages
anymore, right? So these number depend on the last patch in the series?

> sparse_init is where memory for struct pages is zeroed, and the zeroing
> part is moved later in this patch into __init_single_page(), which is
> called from zone_sizes_init().
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mm.h | 9 +++++++++
>  mm/page_alloc.c    | 1 +
>  2 files changed, 10 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f8c10d336e42..50b74d628243 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
>  #define mm_forbids_zeropage(X)	(0)
>  #endif
>  
> +/*
> + * On some architectures it is expensive to call memset() for small sizes.
> + * Those architectures should provide their own implementation of "struct page"
> + * zeroing by defining this macro in <asm/pgtable.h>.
> + */
> +#ifndef mm_zero_struct_page
> +#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
> +#endif
> +
>  /*
>   * Default maximum number of active map areas, this limits the number of vmas
>   * per mm struct. Users can overwrite this number by sysctl but there is a
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a8dbd405ed94..4b630ee91430 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone,
>  static void __meminit __init_single_page(struct page *page, unsigned long pfn,
>  				unsigned long zone, int nid)
>  {
> +	mm_zero_struct_page(page);
>  	set_page_links(page, zone, nid, pfn);
>  	init_page_count(page);
>  	page_mapcount_reset(page);
> -- 
> 2.14.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages
  2017-09-20 20:17 ` [PATCH v9 08/12] mm: zero reserved and unavailable struct pages Pavel Tatashin
@ 2017-10-03 13:18   ` Michal Hocko
  2017-10-03 15:29     ` Pasha Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-03 13:18 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Wed 20-09-17 16:17:10, Pavel Tatashin wrote:
> Some memory is reserved but unavailable: not present in memblock.memory
> (because not backed by physical pages), but present in memblock.reserved.
> Such memory has backing struct pages, but they are not initialized by going
> through __init_single_page().

Could you be more specific where is such a memory reserved?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap
  2017-09-20 20:17 ` [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap Pavel Tatashin
@ 2017-10-03 13:19   ` Michal Hocko
  2017-10-03 15:34     ` Pasha Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-03 13:19 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Wed 20-09-17 16:17:14, Pavel Tatashin wrote:
> vmemmap_alloc_block() will no longer zero the block, so zero memory
> at its call sites for everything except struct pages.  Struct page memory
> is zero'd by struct page initialization.
> 
> Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
> we will get the performance improvement by zeroing the memory in parallel
> when struct pages are zeroed.

Is it possible to merge this patch with http://lkml.kernel.org/r/20170920201714.19817-7-pasha.tatashin@oracle.com

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>
> ---
>  include/linux/mm.h  | 11 +++++++++++
>  mm/sparse-vmemmap.c | 15 +++++++--------
>  mm/sparse.c         |  6 +++---
>  3 files changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a7bba4ce79ba..25848764570f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned long size, int node)
>  	return __vmemmap_alloc_block_buf(size, node, NULL);
>  }
>  
> +static inline void *vmemmap_alloc_block_zero(unsigned long size, int node)
> +{
> +	void *p = vmemmap_alloc_block(size, node);
> +
> +	if (!p)
> +		return NULL;
> +	memset(p, 0, size);
> +
> +	return p;
> +}
> +
>  void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
>  int vmemmap_populate_basepages(unsigned long start, unsigned long end,
>  			       int node);
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index d1a39b8051e0..c2f5654e7c9d 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
>  				unsigned long align,
>  				unsigned long goal)
>  {
> -	return memblock_virt_alloc_try_nid(size, align, goal,
> +	return memblock_virt_alloc_try_nid_raw(size, align, goal,
>  					    BOOTMEM_ALLOC_ACCESSIBLE, node);
>  }
>  
> @@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
>  	if (slab_is_available()) {
>  		struct page *page;
>  
> -		page = alloc_pages_node(node,
> -			GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
> -			get_order(size));
> +		page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
> +					get_order(size));
>  		if (page)
>  			return page_address(page);
>  		return NULL;
> @@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
>  {
>  	pmd_t *pmd = pmd_offset(pud, addr);
>  	if (pmd_none(*pmd)) {
> -		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
> +		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>  		if (!p)
>  			return NULL;
>  		pmd_populate_kernel(&init_mm, pmd, p);
> @@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node)
>  {
>  	pud_t *pud = pud_offset(p4d, addr);
>  	if (pud_none(*pud)) {
> -		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
> +		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>  		if (!p)
>  			return NULL;
>  		pud_populate(&init_mm, pud, p);
> @@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
>  {
>  	p4d_t *p4d = p4d_offset(pgd, addr);
>  	if (p4d_none(*p4d)) {
> -		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
> +		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>  		if (!p)
>  			return NULL;
>  		p4d_populate(&init_mm, p4d, p);
> @@ -219,7 +218,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>  {
>  	pgd_t *pgd = pgd_offset_k(addr);
>  	if (pgd_none(*pgd)) {
> -		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
> +		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>  		if (!p)
>  			return NULL;
>  		pgd_populate(&init_mm, pgd, p);
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 83b3bf6461af..d22f51bb7c79 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -437,9 +437,9 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
>  	}
>  
>  	size = PAGE_ALIGN(size);
> -	map = memblock_virt_alloc_try_nid(size * map_count,
> -					  PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
> -					  BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
> +	map = memblock_virt_alloc_try_nid_raw(size * map_count,
> +					      PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
> +					      BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
>  	if (map) {
>  		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
>  			if (!present_section_nr(pnum))
> -- 
> 2.14.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function
  2017-09-20 20:17 ` [PATCH v9 09/12] mm/kasan: kasan specific map populate function Pavel Tatashin
@ 2017-10-03 14:48   ` Mark Rutland
  2017-10-03 15:04     ` Pasha Tatashin
  2017-10-09 17:13     ` Will Deacon
  0 siblings, 2 replies; 52+ messages in thread
From: Mark Rutland @ 2017-10-03 14:48 UTC (permalink / raw)
  To: Pavel Tatashin, will.deacon, catalin.marinas
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, sam, mgorman,
	steven.sistare, daniel.m.jordan, bob.picco

Hi Pavel,

On Wed, Sep 20, 2017 at 04:17:11PM -0400, Pavel Tatashin wrote:
> During early boot, kasan uses vmemmap_populate() to establish its shadow
> memory. But, that interface is intended for struct pages use.
> 
> Because of the current project, vmemmap won't be zeroed during allocation,
> but kasan expects that memory to be zeroed. We are adding a new
> kasan_map_populate() function to resolve this difference.

Thanks for putting this together.

I've given this a spin on arm64, and can confirm that it works.

Given that this involes redundant walking of page tables, I still think
it'd be preferable to have some common *_populate() helper that took a
gfp argument, but I guess it's not the end of the world.

I'll leave it to Will and Catalin to say whether they're happy with the
page table walking and the new p{u,m}d_large() helpers added to arm64.

Thanks,
Mark.

> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
>  arch/arm64/include/asm/pgtable.h |  3 ++
>  include/linux/kasan.h            |  2 ++
>  mm/kasan/kasan_init.c            | 67 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 72 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index bc4e92337d16..d89713f04354 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -381,6 +381,9 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
>  				 PUD_TYPE_TABLE)
>  #endif
>  
> +#define pmd_large(pmd)		pmd_sect(pmd)
> +#define pud_large(pud)		pud_sect(pud)
> +
>  static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>  {
>  	*pmdp = pmd;
> diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> index a5c7046f26b4..7e13df1722c2 100644
> --- a/include/linux/kasan.h
> +++ b/include/linux/kasan.h
> @@ -78,6 +78,8 @@ size_t kasan_metadata_size(struct kmem_cache *cache);
>  
>  bool kasan_save_enable_multi_shot(void);
>  void kasan_restore_multi_shot(bool enabled);
> +int __meminit kasan_map_populate(unsigned long start, unsigned long end,
> +				 int node);
>  
>  #else /* CONFIG_KASAN */
>  
> diff --git a/mm/kasan/kasan_init.c b/mm/kasan/kasan_init.c
> index 554e4c0f23a2..57a973f05f63 100644
> --- a/mm/kasan/kasan_init.c
> +++ b/mm/kasan/kasan_init.c
> @@ -197,3 +197,70 @@ void __init kasan_populate_zero_shadow(const void *shadow_start,
>  		zero_p4d_populate(pgd, addr, next);
>  	} while (pgd++, addr = next, addr != end);
>  }
> +
> +/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
> +int __meminit kasan_map_populate(unsigned long start, unsigned long end,
> +				 int node)
> +{
> +	unsigned long addr, pfn, next;
> +	unsigned long long size;
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +	int ret;
> +
> +	ret = vmemmap_populate(start, end, node);
> +	/*
> +	 * We might have partially populated memory, so check for no entries,
> +	 * and zero only those that actually exist.
> +	 */
> +	for (addr = start; addr < end; addr = next) {
> +		pgd = pgd_offset_k(addr);
> +		if (pgd_none(*pgd)) {
> +			next = pgd_addr_end(addr, end);
> +			continue;
> +		}
> +
> +		p4d = p4d_offset(pgd, addr);
> +		if (p4d_none(*p4d)) {
> +			next = p4d_addr_end(addr, end);
> +			continue;
> +		}
> +
> +		pud = pud_offset(p4d, addr);
> +		if (pud_none(*pud)) {
> +			next = pud_addr_end(addr, end);
> +			continue;
> +		}
> +		if (pud_large(*pud)) {
> +			/* This is PUD size page */
> +			next = pud_addr_end(addr, end);
> +			size = PUD_SIZE;
> +			pfn = pud_pfn(*pud);
> +		} else {
> +			pmd = pmd_offset(pud, addr);
> +			if (pmd_none(*pmd)) {
> +				next = pmd_addr_end(addr, end);
> +				continue;
> +			}
> +			if (pmd_large(*pmd)) {
> +				/* This is PMD size page */
> +				next = pmd_addr_end(addr, end);
> +				size = PMD_SIZE;
> +				pfn = pmd_pfn(*pmd);
> +			} else {
> +				pte = pte_offset_kernel(pmd, addr);
> +				next = addr + PAGE_SIZE;
> +				if (pte_none(*pte))
> +					continue;
> +				/* This is base size page */
> +				size = PAGE_SIZE;
> +				pfn = pte_pfn(*pte);
> +			}
> +		}
> +		memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
> +	}
> +	return ret;
> +}
> -- 
> 2.14.1
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function
  2017-10-03 14:48   ` Mark Rutland
@ 2017-10-03 15:04     ` Pasha Tatashin
  2017-10-09 17:13     ` Will Deacon
  1 sibling, 0 replies; 52+ messages in thread
From: Pasha Tatashin @ 2017-10-03 15:04 UTC (permalink / raw)
  To: Mark Rutland, will.deacon, catalin.marinas
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, sam, mgorman,
	steven.sistare, daniel.m.jordan, bob.picco

Hi Mark,

I considered using a new  *populate() function for shadow without using 
vmemmap_populate(), but that makes things unnecessary complicated: 
vmemmap_populate() has builtin:

1. large page support
2. device memory support
3. node locality support
4. several config based variants on different platforms

All of that  will cause the code simply be duplicated on each platform 
if we want to support that in kasan.

We could limit ourselves to only supporting base pages in memory by 
using something like vmemmap_populate_basepages(), but that is a step 
backward.  Kasan benefits from using large pages now, why remove it?

So, the solution I provide is walking page table right after memory is 
mapped. Since, we are using the actual page table, it is guaranteed that 
we are not going to miss any mapped memory, and also it is in common 
code, which makes things smaller and nicer.

Thank you,
Pasha

On 10/03/2017 10:48 AM, Mark Rutland wrote:
> 
> I've given this a spin on arm64, and can confirm that it works.
> 
> Given that this involes redundant walking of page tables, I still think
> it'd be preferable to have some common *_populate() helper that took a
> gfp argument, but I guess it's not the end of the world.
> 
> I'll leave it to Will and Catalin to say whether they're happy with the
> page table walking and the new p{u,m}d_large() helpers added to arm64.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 01/12] x86/mm: setting fields in deferred pages
  2017-10-03 12:26   ` Michal Hocko
@ 2017-10-03 15:07     ` Pasha Tatashin
  0 siblings, 0 replies; 52+ messages in thread
From: Pasha Tatashin @ 2017-10-03 15:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Hi Michal,

> 
> I hope I haven't missed anything but it looks good to me.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thank you for your review.

> 
> one nit below
>> ---
>>   arch/x86/mm/init_64.c | 9 +++++++--
>>   1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 5ea1c3c2636e..30fe22558720 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -1182,12 +1182,17 @@ void __init mem_init(void)
>>   
>>   	/* clear_bss() already clear the empty_zero_page */
>>   
>> -	register_page_bootmem_info();
>> -
>>   	/* this will put all memory onto the freelists */
>>   	free_all_bootmem();
>>   	after_bootmem = 1;
>>   
>> +	/* Must be done after boot memory is put on freelist, because here we
> 
> standard code style is to do
> 	/*
> 	 * text starts here

OK, will change for both patch 1 and 2.

Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 02/12] sparc64/mm: setting fields in deferred pages
  2017-10-03 12:28   ` Michal Hocko
@ 2017-10-03 15:10     ` Pasha Tatashin
  0 siblings, 0 replies; 52+ messages in thread
From: Pasha Tatashin @ 2017-10-03 15:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

> 
> As you separated x86 and sparc patches doing essentially the same I
> assume David is going to take this patch?

Correct, I noticed that usually platform specific changes are done in 
separate patches even if they are small. Dave already Acked this patch. 
So, I do not think it should be separated from the rest of the patches 
when this projects goes into mm-tree.

> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thank you,
Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 03/12] mm: deferred_init_memmap improvements
  2017-10-03 12:57   ` Michal Hocko
@ 2017-10-03 15:15     ` Pasha Tatashin
  2017-10-03 16:01       ` Pasha Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Pasha Tatashin @ 2017-10-03 15:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Hi Michal,

> 
> Please be explicit that this is possible only because we discard
> memblock data later after 3010f876500f ("mm: discard memblock data
> later"). Also be more explicit how the new code works.

OK

> 
> I like how the resulting code is more compact and smaller.

That was the goal :)

> for_each_free_mem_range also looks more appropriate but I really detest
> the DEFERRED_FREE thingy. Maybe we can handle all that in a single goto
> section. I know this is not an art but manipulating variables from
> macros is more error prone and much more ugly IMHO.

Sure, I can re-arrange to have a goto place. Function won't be as small, 
and if compiler is not smart enough we might end up with having more 
branches than what my current code has.

> 
> please do not use macros. Btw. this deserves its own fix. I suspect that
> no CONFIG_HOLES_IN_ZONE arch enables DEFERRED_STRUCT_PAGE_INIT but
> purely from the review point of view it should be its own patch.

Sure, I will submit this patch separately from the rest of the project. 
In my opinion DEFERRED_STRUCT_PAGE_INIT is the way of the future, so we 
should make sure it is working with as many configs as possible.

Thank you,
Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 04/12] sparc64: simplify vmemmap_populate
  2017-10-03 12:59   ` Michal Hocko
@ 2017-10-03 15:20     ` Pasha Tatashin
  0 siblings, 0 replies; 52+ messages in thread
From: Pasha Tatashin @ 2017-10-03 15:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco


> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thank you,
Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 06/12] mm: zero struct pages during initialization
  2017-10-03 13:08   ` Michal Hocko
@ 2017-10-03 15:22     ` Pasha Tatashin
  2017-10-04  8:45       ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Pasha Tatashin @ 2017-10-03 15:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco



On 10/03/2017 09:08 AM, Michal Hocko wrote:
> On Wed 20-09-17 16:17:08, Pavel Tatashin wrote:
>> Add struct page zeroing as a part of initialization of other fields in
>> __init_single_page().
>>
>> This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
>> v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):
>>
>>                          BASE            FIX
>> sparse_init     11.244671836s   0.007199623s
>> zone_sizes_init  4.879775891s   8.355182299s
>>                    --------------------------
>> Total           16.124447727s   8.362381922s
> 
> Hmm, this is confusing. This assumes that sparse_init doesn't zero pages
> anymore, right? So these number depend on the last patch in the series?

Correct, without the last patch sparse_init time won't change.

Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages
  2017-10-03 13:18   ` Michal Hocko
@ 2017-10-03 15:29     ` Pasha Tatashin
  2017-10-04  8:56       ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Pasha Tatashin @ 2017-10-03 15:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On 10/03/2017 09:18 AM, Michal Hocko wrote:
> On Wed 20-09-17 16:17:10, Pavel Tatashin wrote:
>> Some memory is reserved but unavailable: not present in memblock.memory
>> (because not backed by physical pages), but present in memblock.reserved.
>> Such memory has backing struct pages, but they are not initialized by going
>> through __init_single_page().
> 
> Could you be more specific where is such a memory reserved?
> 

I know of one example: trim_low_memory_range() unconditionally reserves 
from pfn 0, but e820__memblock_setup() might provide the exiting memory 
from pfn 1 (i.e. KVM).

But, there could be more based on this comment from linux/page-flags.h:

  19  * PG_reserved is set for special pages, which can never be swapped 
out. Some
  20  * of them might not even exist (eg empty_bad_page)...

Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap
  2017-10-03 13:19   ` Michal Hocko
@ 2017-10-03 15:34     ` Pasha Tatashin
  2017-10-03 20:26       ` Pasha Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Pasha Tatashin @ 2017-10-03 15:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On 10/03/2017 09:19 AM, Michal Hocko wrote:
> On Wed 20-09-17 16:17:14, Pavel Tatashin wrote:
>> vmemmap_alloc_block() will no longer zero the block, so zero memory
>> at its call sites for everything except struct pages.  Struct page memory
>> is zero'd by struct page initialization.
>>
>> Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
>> we will get the performance improvement by zeroing the memory in parallel
>> when struct pages are zeroed.
> 
> Is it possible to merge this patch with http://lkml.kernel.org/r/20170920201714.19817-7-pasha.tatashin@oracle.com

Yes, I will do that. It would also require re-arranging
[PATCH v9 07/12] sparc64: optimized struct page zeroing
optimization to come after this patch.

Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 03/12] mm: deferred_init_memmap improvements
  2017-10-03 15:15     ` Pasha Tatashin
@ 2017-10-03 16:01       ` Pasha Tatashin
  2017-10-04  8:48         ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Pasha Tatashin @ 2017-10-03 16:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Hi Michal,

Are you OK, if I replace DEFERRED_FREE() macro with a function like this:

/*
  * Helper for deferred_init_range, free the given range, and reset the
  * counters
  */
static inline unsigned long __def_free(unsigned long *nr_free,
                                        unsigned long *free_base_pfn,
                                        struct page **page)
{
         unsigned long nr = *nr_free;

         deferred_free_range(*free_base_pfn, nr);
         *free_base_pfn = 0;
         *nr_free = 0;
         *page = NULL;

         return nr;
}

Since it is inline, and we operate with non-volatile counters, compiler 
will be smart enough to remove all the unnecessary de-references. As a 
plus, we won't be adding any new branches, and the code is still going 
to stay compact.

Pasha

On 10/03/2017 11:15 AM, Pasha Tatashin wrote:
> Hi Michal,
> 
>>
>> Please be explicit that this is possible only because we discard
>> memblock data later after 3010f876500f ("mm: discard memblock data
>> later"). Also be more explicit how the new code works.
> 
> OK
> 
>>
>> I like how the resulting code is more compact and smaller.
> 
> That was the goal :)
> 
>> for_each_free_mem_range also looks more appropriate but I really detest
>> the DEFERRED_FREE thingy. Maybe we can handle all that in a single goto
>> section. I know this is not an art but manipulating variables from
>> macros is more error prone and much more ugly IMHO.
> 
> Sure, I can re-arrange to have a goto place. Function won't be as small, 
> and if compiler is not smart enough we might end up with having more 
> branches than what my current code has.
> 
>>
>> please do not use macros. Btw. this deserves its own fix. I suspect that
>> no CONFIG_HOLES_IN_ZONE arch enables DEFERRED_STRUCT_PAGE_INIT but
>> purely from the review point of view it should be its own patch.
> 
> Sure, I will submit this patch separately from the rest of the project. 
> In my opinion DEFERRED_STRUCT_PAGE_INIT is the way of the future, so we 
> should make sure it is working with as many configs as possible.
> 
> Thank you,
> Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap
  2017-10-03 15:34     ` Pasha Tatashin
@ 2017-10-03 20:26       ` Pasha Tatashin
  2017-10-04  8:45         ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Pasha Tatashin @ 2017-10-03 20:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Hi Michal,

I decided not to merge these two patches, because in addition to sparc 
optimization move, we have this dependancies:

mm: zero reserved and unavailable struct pages

must be before

mm: stop zeroing memory during allocation in vmemmap.

Otherwise, we can end-up with struct pages that are not zeroed properly.

However, the first patch depends on
mm: zero struct pages during initialization

As it uses mm_zero_struct_page().

Pasha


On 10/03/2017 11:34 AM, Pasha Tatashin wrote:
> On 10/03/2017 09:19 AM, Michal Hocko wrote:
>> On Wed 20-09-17 16:17:14, Pavel Tatashin wrote:
>>> vmemmap_alloc_block() will no longer zero the block, so zero memory
>>> at its call sites for everything except struct pages.A  Struct page 
>>> memory
>>> is zero'd by struct page initialization.
>>>
>>> Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
>>> we will get the performance improvement by zeroing the memory in 
>>> parallel
>>> when struct pages are zeroed.
>>
>> Is it possible to merge this patch with 
>> http://lkml.kernel.org/r/20170920201714.19817-7-pasha.tatashin@oracle.com
> 
> Yes, I will do that. It would also require re-arranging
> [PATCH v9 07/12] sparc64: optimized struct page zeroing
> optimization to come after this patch.
> 
> Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap
  2017-10-03 20:26       ` Pasha Tatashin
@ 2017-10-04  8:45         ` Michal Hocko
  0 siblings, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2017-10-04  8:45 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Tue 03-10-17 16:26:51, Pasha Tatashin wrote:
> Hi Michal,
> 
> I decided not to merge these two patches, because in addition to sparc
> optimization move, we have this dependancies:

optimizations can and should go on top of the core patch.

> mm: zero reserved and unavailable struct pages
> 
> must be before
> 
> mm: stop zeroing memory during allocation in vmemmap.
> 
> Otherwise, we can end-up with struct pages that are not zeroed properly.

Right and you can deal with it easily. Just introduce the
mm_zero_struct_page earlier along with its user in "stop zeroing ..."

I think that moving the zeroying in one go is more reasonable than
adding it to __init_single_page with misleading numbers and later
dropping the zeroying from the memmap path.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 06/12] mm: zero struct pages during initialization
  2017-10-03 15:22     ` Pasha Tatashin
@ 2017-10-04  8:45       ` Michal Hocko
  2017-10-04 12:26         ` Pasha Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-04  8:45 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Tue 03-10-17 11:22:35, Pasha Tatashin wrote:
> 
> 
> On 10/03/2017 09:08 AM, Michal Hocko wrote:
> > On Wed 20-09-17 16:17:08, Pavel Tatashin wrote:
> > > Add struct page zeroing as a part of initialization of other fields in
> > > __init_single_page().
> > > 
> > > This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
> > > v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):
> > > 
> > >                          BASE            FIX
> > > sparse_init     11.244671836s   0.007199623s
> > > zone_sizes_init  4.879775891s   8.355182299s
> > >                    --------------------------
> > > Total           16.124447727s   8.362381922s
> > 
> > Hmm, this is confusing. This assumes that sparse_init doesn't zero pages
> > anymore, right? So these number depend on the last patch in the series?
> 
> Correct, without the last patch sparse_init time won't change.

THen this is just misleading.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 03/12] mm: deferred_init_memmap improvements
  2017-10-03 16:01       ` Pasha Tatashin
@ 2017-10-04  8:48         ` Michal Hocko
  0 siblings, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2017-10-04  8:48 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Tue 03-10-17 12:01:08, Pasha Tatashin wrote:
> Hi Michal,
> 
> Are you OK, if I replace DEFERRED_FREE() macro with a function like this:
> 
> /*
>  * Helper for deferred_init_range, free the given range, and reset the
>  * counters
>  */
> static inline unsigned long __def_free(unsigned long *nr_free,
>                                        unsigned long *free_base_pfn,
>                                        struct page **page)
> {
>         unsigned long nr = *nr_free;
> 
>         deferred_free_range(*free_base_pfn, nr);
>         *free_base_pfn = 0;
>         *nr_free = 0;
>         *page = NULL;
> 
>         return nr;
> }
> 
> Since it is inline, and we operate with non-volatile counters, compiler will
> be smart enough to remove all the unnecessary de-references. As a plus, we
> won't be adding any new branches, and the code is still going to stay
> compact.

OK. It is a bit clunky but we are holding too much state there. I
haven't checked whether that can be simplified but this can be always
done later.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages
  2017-10-03 15:29     ` Pasha Tatashin
@ 2017-10-04  8:56       ` Michal Hocko
  2017-10-04 12:40         ` Pasha Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-04  8:56 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Tue 03-10-17 11:29:16, Pasha Tatashin wrote:
> On 10/03/2017 09:18 AM, Michal Hocko wrote:
> > On Wed 20-09-17 16:17:10, Pavel Tatashin wrote:
> > > Some memory is reserved but unavailable: not present in memblock.memory
> > > (because not backed by physical pages), but present in memblock.reserved.
> > > Such memory has backing struct pages, but they are not initialized by going
> > > through __init_single_page().
> > 
> > Could you be more specific where is such a memory reserved?
> > 
> 
> I know of one example: trim_low_memory_range() unconditionally reserves from
> pfn 0, but e820__memblock_setup() might provide the exiting memory from pfn
> 1 (i.e. KVM).

Then just initialize struct pages for that mapping rigth there where a
special API is used.

> But, there could be more based on this comment from linux/page-flags.h:
> 
>  19  * PG_reserved is set for special pages, which can never be swapped out.
> Some
>  20  * of them might not even exist (eg empty_bad_page)...

I have no idea wht empty_bad_page is but a quick grep shows that this is
never used. I might be wrong here but if somebody is reserving a memory
in a special way then we should handle the initialization right there.
E.g. create an API for special memblock reservations.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 06/12] mm: zero struct pages during initialization
  2017-10-04  8:45       ` Michal Hocko
@ 2017-10-04 12:26         ` Pasha Tatashin
  0 siblings, 0 replies; 52+ messages in thread
From: Pasha Tatashin @ 2017-10-04 12:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On 10/04/2017 04:45 AM, Michal Hocko wrote:
> On Tue 03-10-17 11:22:35, Pasha Tatashin wrote:
>>
>>
>> On 10/03/2017 09:08 AM, Michal Hocko wrote:
>>> On Wed 20-09-17 16:17:08, Pavel Tatashin wrote:
>>>> Add struct page zeroing as a part of initialization of other fields in
>>>> __init_single_page().
>>>>
>>>> This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
>>>> v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):
>>>>
>>>>                           BASE            FIX
>>>> sparse_init     11.244671836s   0.007199623s
>>>> zone_sizes_init  4.879775891s   8.355182299s
>>>>                     --------------------------
>>>> Total           16.124447727s   8.362381922s
>>>
>>> Hmm, this is confusing. This assumes that sparse_init doesn't zero pages
>>> anymore, right? So these number depend on the last patch in the series?
>>
>> Correct, without the last patch sparse_init time won't change.
> 
> THen this is just misleading.
> 

OK, I will re-arrange patches the way you suggested earlier.

Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages
  2017-10-04  8:56       ` Michal Hocko
@ 2017-10-04 12:40         ` Pasha Tatashin
  2017-10-04 12:57           ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Pasha Tatashin @ 2017-10-04 12:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

>>> Could you be more specific where is such a memory reserved?
>>>
>>
>> I know of one example: trim_low_memory_range() unconditionally reserves from
>> pfn 0, but e820__memblock_setup() might provide the exiting memory from pfn
>> 1 (i.e. KVM).
> 
> Then just initialize struct pages for that mapping rigth there where a
> special API is used.
> 
>> But, there could be more based on this comment from linux/page-flags.h:
>>
>>   19  * PG_reserved is set for special pages, which can never be swapped out.
>> Some
>>   20  * of them might not even exist (eg empty_bad_page)...
> 
> I have no idea wht empty_bad_page is but a quick grep shows that this is
> never used. I might be wrong here but if somebody is reserving a memory
> in a special way then we should handle the initialization right there.
> E.g. create an API for special memblock reservations.
> 

Hi Michal,

The reservations happen before struct pages are allocated and mapped. 
So, it is not always possible to do it at call sites.

Previously, I have solved this problem like this:

https://patchwork.kernel.org/patch/9886163

But, I was not too happy with that approach, so I replaced it with the 
current approach as it is more generic, and solves similar issues if 
they happen in other places. Also, the comment in page-flags got me 
scared that there are probably other places perhaps on other 
architectures that can have the similar issue.

In addition, I did not like my solution, I was simply shrinking the low 
reservation from:
[0 - reserve_low) to [min_pfn - reserve_low), but if min_pfn > 
reserve_low can we skip low reservation entirely? I was not sure.

The current approach notifies us if there are such pages, and we can 
fix/remove them in the future without crashing kernel in the meantime.

Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages
  2017-10-04 12:40         ` Pasha Tatashin
@ 2017-10-04 12:57           ` Michal Hocko
  2017-10-04 13:28             ` Pasha Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-04 12:57 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Wed 04-10-17 08:40:11, Pasha Tatashin wrote:
> > > > Could you be more specific where is such a memory reserved?
> > > > 
> > > 
> > > I know of one example: trim_low_memory_range() unconditionally reserves from
> > > pfn 0, but e820__memblock_setup() might provide the exiting memory from pfn
> > > 1 (i.e. KVM).
> > 
> > Then just initialize struct pages for that mapping rigth there where a
> > special API is used.
> > 
> > > But, there could be more based on this comment from linux/page-flags.h:
> > > 
> > >   19  * PG_reserved is set for special pages, which can never be swapped out.
> > > Some
> > >   20  * of them might not even exist (eg empty_bad_page)...
> > 
> > I have no idea wht empty_bad_page is but a quick grep shows that this is
> > never used. I might be wrong here but if somebody is reserving a memory
> > in a special way then we should handle the initialization right there.
> > E.g. create an API for special memblock reservations.
> > 
> 
> Hi Michal,
> 
> The reservations happen before struct pages are allocated and mapped. So, it
> is not always possible to do it at call sites.

OK, I didn't realize that.
 
> Previously, I have solved this problem like this:
> 
> https://patchwork.kernel.org/patch/9886163
> 
> But, I was not too happy with that approach, so I replaced it with the
> current approach as it is more generic, and solves similar issues if they
> happen in other places. Also, the comment in page-flags got me scared that
> there are probably other places perhaps on other architectures that can have
> the similar issue.

I believe the comment is just stale. I have looked into empty_bad_page
and it is just a relict. I plan to post a patch soon.

> In addition, I did not like my solution, I was simply shrinking the low
> reservation from:
> [0 - reserve_low) to [min_pfn - reserve_low), but if min_pfn > reserve_low
> can we skip low reservation entirely? I was not sure.
> 
> The current approach notifies us if there are such pages, and we can
> fix/remove them in the future without crashing kernel in the meantime.

I am not really familiar with the trim_low_memory_range code path. I am
not even sure we have to care about it because nobody should be walking
pfns outside of any zone. I am worried that this patch adds a code which
is not really used and it will just stay that way for ever because
nobody will dare to change it as it is too obscure and not explained
very well. trim_low_memory_range is a good example of this. Why do we
even reserve this range from the memory block allocator? The memory
shouldn't be backed by any real memory and thus not in the allocator in
the first place, no?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages
  2017-10-04 12:57           ` Michal Hocko
@ 2017-10-04 13:28             ` Pasha Tatashin
  2017-10-04 14:04               ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Pasha Tatashin @ 2017-10-04 13:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco


> I am not really familiar with the trim_low_memory_range code path. I am
> not even sure we have to care about it because nobody should be walking
> pfns outside of any zone.

According to commit comments first 4K belongs to BIOS, so I think the 
memory exists but BIOS may or may not report it to Linux. So, reserve it 
to make sure we never touch it.

  I am worried that this patch adds a code which
> is not really used and it will just stay that way for ever because
> nobody will dare to change it as it is too obscure and not explained
> very well.

I could explain mine code better. Perhaps add more comments, and explain 
when it can be removed?

  trim_low_memory_range is a good example of this. Why do we
> even reserve this range from the memory block allocator? The memory
> shouldn't be backed by any real memory and thus not in the allocator in
> the first place, no?
> 

Since it is not enforced in memblock that everything in reserved list 
must be part of memory list, we can have it, and we need to make sure 
kernel does not panic. Otherwise, it is very hard to detect such bugs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages
  2017-10-04 13:28             ` Pasha Tatashin
@ 2017-10-04 14:04               ` Michal Hocko
  2017-10-04 15:08                 ` Pasha Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-04 14:04 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Wed 04-10-17 09:28:55, Pasha Tatashin wrote:
> 
> > I am not really familiar with the trim_low_memory_range code path. I am
> > not even sure we have to care about it because nobody should be walking
> > pfns outside of any zone.
> 
> According to commit comments first 4K belongs to BIOS, so I think the memory
> exists but BIOS may or may not report it to Linux. So, reserve it to make
> sure we never touch it.

Yes and that memory should be outside of any zones, no?

> > I am worried that this patch adds a code which
> > is not really used and it will just stay that way for ever because
> > nobody will dare to change it as it is too obscure and not explained
> > very well.
> 
> I could explain mine code better. Perhaps add more comments, and explain
> when it can be removed?

More explanation would be definitely helpful

> > trim_low_memory_range is a good example of this. Why do we
> > even reserve this range from the memory block allocator? The memory
> > shouldn't be backed by any real memory and thus not in the allocator in
> > the first place, no?
> > 
> 
> Since it is not enforced in memblock that everything in reserved list must
> be part of memory list, we can have it, and we need to make sure kernel does
> not panic. Otherwise, it is very hard to detect such bugs.

So, should we report such a memblock reservation API (ab)use to the log?
Are you actually sure that trim_low_memory_range is doing a sane and
really needed thing? In other words do we have a zone which contains
this no-memory backed pfns?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages
  2017-10-04 14:04               ` Michal Hocko
@ 2017-10-04 15:08                 ` Pasha Tatashin
  0 siblings, 0 replies; 52+ messages in thread
From: Pasha Tatashin @ 2017-10-04 15:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco



On 10/04/2017 10:04 AM, Michal Hocko wrote:
> On Wed 04-10-17 09:28:55, Pasha Tatashin wrote:
>>
>>> I am not really familiar with the trim_low_memory_range code path. I am
>>> not even sure we have to care about it because nobody should be walking
>>> pfns outside of any zone.
>>
>> According to commit comments first 4K belongs to BIOS, so I think the memory
>> exists but BIOS may or may not report it to Linux. So, reserve it to make
>> sure we never touch it.
> 
> Yes and that memory should be outside of any zones, no?

I am not totally sure, I think some x86 expert could help us here. But, 
in either case this issue can be fixed separately from the rest of the 
series.

> 
>>> I am worried that this patch adds a code which
>>> is not really used and it will just stay that way for ever because
>>> nobody will dare to change it as it is too obscure and not explained
>>> very well.
>>
>> I could explain mine code better. Perhaps add more comments, and explain
>> when it can be removed?
> 
> More explanation would be definitely helpful
> 
>>> trim_low_memory_range is a good example of this. Why do we
>>> even reserve this range from the memory block allocator? The memory
>>> shouldn't be backed by any real memory and thus not in the allocator in
>>> the first place, no?
>>>
>>
>> Since it is not enforced in memblock that everything in reserved list must
>> be part of memory list, we can have it, and we need to make sure kernel does
>> not panic. Otherwise, it is very hard to detect such bugs.
> 
> So, should we report such a memblock reservation API (ab)use to the log?
> Are you actually sure that trim_low_memory_range is doing a sane and
> really needed thing? In other words do we have a zone which contains
> this no-memory backed pfns?
> 

And, this patch reports it already:

+	pr_info("Reserved but unavailable: %lld pages", pgcnt);

I could add a comment above this print call, explain that such memory is 
probably bogus and must be studied/fixed. Also, add that this code can 
be removed once memblock is changed to allow reserve only memory that is 
backed by physical memory i.e. in "memory" list.

Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function
  2017-10-03 14:48   ` Mark Rutland
  2017-10-03 15:04     ` Pasha Tatashin
@ 2017-10-09 17:13     ` Will Deacon
  2017-10-09 17:51       ` Pavel Tatashin
  1 sibling, 1 reply; 52+ messages in thread
From: Will Deacon @ 2017-10-09 17:13 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Pavel Tatashin, catalin.marinas, linux-kernel, sparclinux,
	linux-mm, linuxppc-dev, linux-s390, linux-arm-kernel, x86,
	kasan-dev, borntraeger, heiko.carstens, davem, willy, mhocko,
	ard.biesheuvel, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Tue, Oct 03, 2017 at 03:48:46PM +0100, Mark Rutland wrote:
> On Wed, Sep 20, 2017 at 04:17:11PM -0400, Pavel Tatashin wrote:
> > During early boot, kasan uses vmemmap_populate() to establish its shadow
> > memory. But, that interface is intended for struct pages use.
> > 
> > Because of the current project, vmemmap won't be zeroed during allocation,
> > but kasan expects that memory to be zeroed. We are adding a new
> > kasan_map_populate() function to resolve this difference.
> 
> Thanks for putting this together.
> 
> I've given this a spin on arm64, and can confirm that it works.
> 
> Given that this involes redundant walking of page tables, I still think
> it'd be preferable to have some common *_populate() helper that took a
> gfp argument, but I guess it's not the end of the world.
> 
> I'll leave it to Will and Catalin to say whether they're happy with the
> page table walking and the new p{u,m}d_large() helpers added to arm64.

To be honest, it just looks completely backwards to me; we're walking the
page tables we created earlier on so that we can figure out what needs to
be zeroed for KASAN. We already had that information before, hence my
preference to allow propagation of GFP_FLAGs to vmemmap_alloc_block when
it's needed. I know that's not popular for some reason, but is walking the
page tables really better?

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function
  2017-10-09 17:13     ` Will Deacon
@ 2017-10-09 17:51       ` Pavel Tatashin
  2017-10-09 18:14         ` Michal Hocko
  2017-10-09 18:22         ` Will Deacon
  0 siblings, 2 replies; 52+ messages in thread
From: Pavel Tatashin @ 2017-10-09 17:51 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, catalin.marinas, linux-kernel, sparclinux,
	linux-mm, linuxppc-dev, linux-s390, linux-arm-kernel, x86,
	kasan-dev, borntraeger, heiko.carstens, davem, willy, mhocko,
	ard.biesheuvel, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Will,

I can go back to that approach, if Michal OK with it. But, that would
mean that I would need to touch every single architecture that
implements vmemmap_populate(), and also pass flags at least through
these functions on every architectures (some have more than one
decided by configs).:

vmemmap_populate()
vmemmap_populate_basepages()
vmemmap_populate_hugepages()
vmemmap_pte_populate()
__vmemmap_alloc_block_buf()
alloc_block_buf()
vmemmap_alloc_block()

IMO, while I understand that it looks strange that we must walk page
table after creating it, it is a better approach: more enclosed as it
effects kasan only, and more universal as it is in common code. We are
also somewhat late in the review process, means we will need again to
get ACKs from the maintainers of other arches.

Pavel

On Mon, Oct 9, 2017 at 1:13 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Tue, Oct 03, 2017 at 03:48:46PM +0100, Mark Rutland wrote:
>> On Wed, Sep 20, 2017 at 04:17:11PM -0400, Pavel Tatashin wrote:
>> > During early boot, kasan uses vmemmap_populate() to establish its shadow
>> > memory. But, that interface is intended for struct pages use.
>> >
>> > Because of the current project, vmemmap won't be zeroed during allocation,
>> > but kasan expects that memory to be zeroed. We are adding a new
>> > kasan_map_populate() function to resolve this difference.
>>
>> Thanks for putting this together.
>>
>> I've given this a spin on arm64, and can confirm that it works.
>>
>> Given that this involes redundant walking of page tables, I still think
>> it'd be preferable to have some common *_populate() helper that took a
>> gfp argument, but I guess it's not the end of the world.
>>
>> I'll leave it to Will and Catalin to say whether they're happy with the
>> page table walking and the new p{u,m}d_large() helpers added to arm64.
>
> To be honest, it just looks completely backwards to me; we're walking the
> page tables we created earlier on so that we can figure out what needs to
> be zeroed for KASAN. We already had that information before, hence my
> preference to allow propagation of GFP_FLAGs to vmemmap_alloc_block when
> it's needed. I know that's not popular for some reason, but is walking the
> page tables really better?
>
> Will
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function
  2017-10-09 17:51       ` Pavel Tatashin
@ 2017-10-09 18:14         ` Michal Hocko
  2017-10-09 18:48           ` Will Deacon
  2017-10-09 18:22         ` Will Deacon
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-09 18:14 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Will Deacon, Mark Rutland, catalin.marinas, linux-kernel,
	sparclinux, linux-mm, linuxppc-dev, linux-s390, linux-arm-kernel,
	x86, kasan-dev, borntraeger, heiko.carstens, davem, willy,
	ard.biesheuvel, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

On Mon 09-10-17 13:51:47, Pavel Tatashin wrote:
> Hi Will,
> 
> I can go back to that approach, if Michal OK with it. But, that would
> mean that I would need to touch every single architecture that
> implements vmemmap_populate(), and also pass flags at least through
> these functions on every architectures (some have more than one
> decided by configs).:
> 
> vmemmap_populate()
> vmemmap_populate_basepages()
> vmemmap_populate_hugepages()
> vmemmap_pte_populate()
> __vmemmap_alloc_block_buf()
> alloc_block_buf()
> vmemmap_alloc_block()
> 
> IMO, while I understand that it looks strange that we must walk page
> table after creating it, it is a better approach: more enclosed as it
> effects kasan only, and more universal as it is in common code.

While I understand that gfp mask approach might look better at first
sight this is by no means a general purpose API so I would rather be
pragmatic and have a smaller code footprint than a more general
interface. Kasan is pretty much a special case and doing a one time
initialization 2 pass thing is imho acceptable. If this turns out to be
impractical in future then let's fix it up but right now I would rather
go a simpler path.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function
  2017-10-09 17:51       ` Pavel Tatashin
  2017-10-09 18:14         ` Michal Hocko
@ 2017-10-09 18:22         ` Will Deacon
  2017-10-09 18:42           ` Pavel Tatashin
  1 sibling, 1 reply; 52+ messages in thread
From: Will Deacon @ 2017-10-09 18:22 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Mark Rutland, catalin.marinas, linux-kernel, sparclinux,
	linux-mm, linuxppc-dev, linux-s390, linux-arm-kernel, x86,
	kasan-dev, borntraeger, heiko.carstens, davem, willy, mhocko,
	ard.biesheuvel, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Pavel,

On Mon, Oct 09, 2017 at 01:51:47PM -0400, Pavel Tatashin wrote:
> I can go back to that approach, if Michal OK with it. But, that would
> mean that I would need to touch every single architecture that
> implements vmemmap_populate(), and also pass flags at least through
> these functions on every architectures (some have more than one
> decided by configs).:
> 
> vmemmap_populate()
> vmemmap_populate_basepages()
> vmemmap_populate_hugepages()
> vmemmap_pte_populate()
> __vmemmap_alloc_block_buf()
> alloc_block_buf()
> vmemmap_alloc_block()

As an interim step, why not introduce something like
vmemmap_alloc_block_flags and make the page-table walking opt-out for
architectures that don't want it? Then we can just pass __GFP_ZERO from
our vmemmap_populate where necessary and other architectures can do the
page-table walking dance if they prefer.

> IMO, while I understand that it looks strange that we must walk page
> table after creating it, it is a better approach: more enclosed as it
> effects kasan only, and more universal as it is in common code.

I don't buy the more universal aspect, but I appreciate it's subjective.
Frankly, I'd just sooner not have core code walking early page tables if
it can be avoided, and it doesn't look hard to avoid it in this case.
The fact that you're having to add pmd_large and pud_large, which are
otherwise unused in mm/, is an indication that this isn't quite right imo.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function
  2017-10-09 18:22         ` Will Deacon
@ 2017-10-09 18:42           ` Pavel Tatashin
  2017-10-09 18:48             ` Will Deacon
  0 siblings, 1 reply; 52+ messages in thread
From: Pavel Tatashin @ 2017-10-09 18:42 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, catalin.marinas, linux-kernel, sparclinux,
	linux-mm, linuxppc-dev, linux-s390, linux-arm-kernel, x86,
	kasan-dev, borntraeger, heiko.carstens, davem, willy, mhocko,
	ard.biesheuvel, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Will,

In addition to what Michal wrote:

> As an interim step, why not introduce something like
> vmemmap_alloc_block_flags and make the page-table walking opt-out for
> architectures that don't want it? Then we can just pass __GFP_ZERO from
> our vmemmap_populate where necessary and other architectures can do the
> page-table walking dance if they prefer.

I do not see the benefit, implementing this approach means that we
would need to implement two table walks instead of one: one for x86,
another for ARM, as these two architectures support kasan. Also, this
would become a requirement for any future architecture that want to
add kasan support to add this page table walk implementation.

>> IMO, while I understand that it looks strange that we must walk page
>> table after creating it, it is a better approach: more enclosed as it
>> effects kasan only, and more universal as it is in common code.
>
> I don't buy the more universal aspect, but I appreciate it's subjective.
> Frankly, I'd just sooner not have core code walking early page tables if
> it can be avoided, and it doesn't look hard to avoid it in this case.
> The fact that you're having to add pmd_large and pud_large, which are
> otherwise unused in mm/, is an indication that this isn't quite right imo.

 28 +#define pmd_large(pmd)         pmd_sect(pmd)
 29 +#define pud_large(pud)         pud_sect(pud)

it is just naming difference, ARM64 calls them pmd_sect, common mm and
other arches call them
pmd_large/pud_large. Even the ARM has these defines in

arm/include/asm/pgtable-3level.h
arm/include/asm/pgtable-2level.h

Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function
  2017-10-09 18:14         ` Michal Hocko
@ 2017-10-09 18:48           ` Will Deacon
  0 siblings, 0 replies; 52+ messages in thread
From: Will Deacon @ 2017-10-09 18:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Pavel Tatashin, Mark Rutland, catalin.marinas, linux-kernel,
	sparclinux, linux-mm, linuxppc-dev, linux-s390, linux-arm-kernel,
	x86, kasan-dev, borntraeger, heiko.carstens, davem, willy,
	ard.biesheuvel, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

On Mon, Oct 09, 2017 at 08:14:33PM +0200, Michal Hocko wrote:
> On Mon 09-10-17 13:51:47, Pavel Tatashin wrote:
> > I can go back to that approach, if Michal OK with it. But, that would
> > mean that I would need to touch every single architecture that
> > implements vmemmap_populate(), and also pass flags at least through
> > these functions on every architectures (some have more than one
> > decided by configs).:
> > 
> > vmemmap_populate()
> > vmemmap_populate_basepages()
> > vmemmap_populate_hugepages()
> > vmemmap_pte_populate()
> > __vmemmap_alloc_block_buf()
> > alloc_block_buf()
> > vmemmap_alloc_block()
> > 
> > IMO, while I understand that it looks strange that we must walk page
> > table after creating it, it is a better approach: more enclosed as it
> > effects kasan only, and more universal as it is in common code.
> 
> While I understand that gfp mask approach might look better at first
> sight this is by no means a general purpose API so I would rather be
> pragmatic and have a smaller code footprint than a more general
> interface. Kasan is pretty much a special case and doing a one time
> initialization 2 pass thing is imho acceptable. If this turns out to be
> impractical in future then let's fix it up but right now I would rather
> go a simpler path.

I think the simpler path for arm64 is really to say when we want the memory
zeroing as opposed to exposing pmd_large/pud_large macros. Those are likely
to grow more users too, but are difficult to use correctly as we have things
like contiguous ptes that map to a granule smaller than a pmd.

I proposed an alternative solution to Pavel already, but it could be made
less general purpose by marking the function __meminit and only having it
do anything if KASAN is compiled in.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function
  2017-10-09 18:42           ` Pavel Tatashin
@ 2017-10-09 18:48             ` Will Deacon
  2017-10-09 18:59               ` Pavel Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Will Deacon @ 2017-10-09 18:48 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Mark Rutland, catalin.marinas, linux-kernel, sparclinux,
	linux-mm, linuxppc-dev, linux-s390, linux-arm-kernel, x86,
	kasan-dev, borntraeger, heiko.carstens, davem, willy, mhocko,
	ard.biesheuvel, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

On Mon, Oct 09, 2017 at 02:42:32PM -0400, Pavel Tatashin wrote:
> Hi Will,
> 
> In addition to what Michal wrote:
> 
> > As an interim step, why not introduce something like
> > vmemmap_alloc_block_flags and make the page-table walking opt-out for
> > architectures that don't want it? Then we can just pass __GFP_ZERO from
> > our vmemmap_populate where necessary and other architectures can do the
> > page-table walking dance if they prefer.
> 
> I do not see the benefit, implementing this approach means that we
> would need to implement two table walks instead of one: one for x86,
> another for ARM, as these two architectures support kasan. Also, this
> would become a requirement for any future architecture that want to
> add kasan support to add this page table walk implementation.

We have two table walks even with your patch series applied afaict: one in
our definition of vmemmap_populate (arch/arm64/mm/mmu.c) and this one
in the core code.

> >> IMO, while I understand that it looks strange that we must walk page
> >> table after creating it, it is a better approach: more enclosed as it
> >> effects kasan only, and more universal as it is in common code.
> >
> > I don't buy the more universal aspect, but I appreciate it's subjective.
> > Frankly, I'd just sooner not have core code walking early page tables if
> > it can be avoided, and it doesn't look hard to avoid it in this case.
> > The fact that you're having to add pmd_large and pud_large, which are
> > otherwise unused in mm/, is an indication that this isn't quite right imo.
> 
>  28 +#define pmd_large(pmd)         pmd_sect(pmd)
>  29 +#define pud_large(pud)         pud_sect(pud)
> 
> it is just naming difference, ARM64 calls them pmd_sect, common mm and
> other arches call them
> pmd_large/pud_large. Even the ARM has these defines in
> 
> arm/include/asm/pgtable-3level.h
> arm/include/asm/pgtable-2level.h

My worry is that these are actually highly arch-specific, but will likely
grow more users in mm/ that assume things for all architectures that aren't
necessarily valid.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function
  2017-10-09 18:48             ` Will Deacon
@ 2017-10-09 18:59               ` Pavel Tatashin
  2017-10-09 19:02                 ` Will Deacon
  0 siblings, 1 reply; 52+ messages in thread
From: Pavel Tatashin @ 2017-10-09 18:59 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, catalin.marinas, linux-kernel, sparclinux,
	linux-mm, linuxppc-dev, linux-s390, linux-arm-kernel, x86,
	kasan-dev, borntraeger, heiko.carstens, davem, willy, mhocko,
	Ard Biesheuvel, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Will,

> We have two table walks even with your patch series applied afaict: one in
> our definition of vmemmap_populate (arch/arm64/mm/mmu.c) and this one
> in the core code.

I meant to say implementing two new page table walkers, not at runtime.

> My worry is that these are actually highly arch-specific, but will likely
> grow more users in mm/ that assume things for all architectures that aren't
> necessarily valid.

I see, how about moving new kasan_map_populate() implementation into
arch dependent code:

arch/x86/mm/kasan_init_64.c
arch/arm64/mm/kasan_init.c

This way we won't need to add pmd_large()/pud_large() macros for arm64?

Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function
  2017-10-09 18:59               ` Pavel Tatashin
@ 2017-10-09 19:02                 ` Will Deacon
  2017-10-09 19:07                   ` Pavel Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Will Deacon @ 2017-10-09 19:02 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Mark Rutland, catalin.marinas, linux-kernel, sparclinux,
	linux-mm, linuxppc-dev, linux-s390, linux-arm-kernel, x86,
	kasan-dev, borntraeger, heiko.carstens, davem, willy, mhocko,
	Ard Biesheuvel, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Pavel,

On Mon, Oct 09, 2017 at 02:59:09PM -0400, Pavel Tatashin wrote:
> > We have two table walks even with your patch series applied afaict: one in
> > our definition of vmemmap_populate (arch/arm64/mm/mmu.c) and this one
> > in the core code.
> 
> I meant to say implementing two new page table walkers, not at runtime.

Ok, but I'm still missing why you think that is needed. What would be the
second page table walker that needs implementing?

> > My worry is that these are actually highly arch-specific, but will likely
> > grow more users in mm/ that assume things for all architectures that aren't
> > necessarily valid.
> 
> I see, how about moving new kasan_map_populate() implementation into
> arch dependent code:
> 
> arch/x86/mm/kasan_init_64.c
> arch/arm64/mm/kasan_init.c
> 
> This way we won't need to add pmd_large()/pud_large() macros for arm64?

I guess we could implement that on arm64 using our current vmemmap_populate
logic and an explicit memset.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function
  2017-10-09 19:02                 ` Will Deacon
@ 2017-10-09 19:07                   ` Pavel Tatashin
  2017-10-09 19:57                     ` Pavel Tatashin
  0 siblings, 1 reply; 52+ messages in thread
From: Pavel Tatashin @ 2017-10-09 19:07 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, catalin.marinas, linux-kernel, sparclinux,
	linux-mm, linuxppc-dev, linux-s390, linux-arm-kernel, x86,
	kasan-dev, borntraeger, heiko.carstens, davem, willy, mhocko,
	Ard Biesheuvel, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

>
> Ok, but I'm still missing why you think that is needed. What would be the
> second page table walker that needs implementing?
>
> I guess we could implement that on arm64 using our current vmemmap_populate
> logic and an explicit memset.
>

Hi Will,

What do you mean by explicit memset()? We can't simply memset() from
start to end without doing the page table walk, because at the time
kasan is calling vmemmap_populate() we have a tmp_pg_dir instead of
swapper_pg_dir.

We could do the explicit memset() after
cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); but again, this was in
one of my previous implementations, and I was asked to replace that.

Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function
  2017-10-09 19:07                   ` Pavel Tatashin
@ 2017-10-09 19:57                     ` Pavel Tatashin
  0 siblings, 0 replies; 52+ messages in thread
From: Pavel Tatashin @ 2017-10-09 19:57 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, catalin.marinas, linux-kernel, sparclinux,
	linux-mm, linuxppc-dev, linux-s390, linux-arm-kernel, x86,
	kasan-dev, borntraeger, heiko.carstens, davem, willy, mhocko,
	Ard Biesheuvel, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

>> I guess we could implement that on arm64 using our current vmemmap_populate
>> logic and an explicit memset.

Hi Will,

I will send out a new patch series with x86/arm64  versions of
kasan_map_populate(), so you could take a look if this is something
that is acceptable.

Thank you,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2017-10-09 19:57 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-20 20:17 [PATCH v9 00/12] complete deferred page initialization Pavel Tatashin
2017-09-20 20:17 ` [PATCH v9 01/12] x86/mm: setting fields in deferred pages Pavel Tatashin
2017-10-03 12:26   ` Michal Hocko
2017-10-03 15:07     ` Pasha Tatashin
2017-09-20 20:17 ` [PATCH v9 02/12] sparc64/mm: " Pavel Tatashin
2017-10-03 12:28   ` Michal Hocko
2017-10-03 15:10     ` Pasha Tatashin
2017-09-20 20:17 ` [PATCH v9 03/12] mm: deferred_init_memmap improvements Pavel Tatashin
2017-10-03 12:57   ` Michal Hocko
2017-10-03 15:15     ` Pasha Tatashin
2017-10-03 16:01       ` Pasha Tatashin
2017-10-04  8:48         ` Michal Hocko
2017-09-20 20:17 ` [PATCH v9 04/12] sparc64: simplify vmemmap_populate Pavel Tatashin
2017-10-03 12:59   ` Michal Hocko
2017-10-03 15:20     ` Pasha Tatashin
2017-09-20 20:17 ` [PATCH v9 05/12] mm: defining memblock_virt_alloc_try_nid_raw Pavel Tatashin
2017-09-20 20:17 ` [PATCH v9 06/12] mm: zero struct pages during initialization Pavel Tatashin
2017-10-03 13:08   ` Michal Hocko
2017-10-03 15:22     ` Pasha Tatashin
2017-10-04  8:45       ` Michal Hocko
2017-10-04 12:26         ` Pasha Tatashin
2017-09-20 20:17 ` [PATCH v9 07/12] sparc64: optimized struct page zeroing Pavel Tatashin
2017-09-20 20:17 ` [PATCH v9 08/12] mm: zero reserved and unavailable struct pages Pavel Tatashin
2017-10-03 13:18   ` Michal Hocko
2017-10-03 15:29     ` Pasha Tatashin
2017-10-04  8:56       ` Michal Hocko
2017-10-04 12:40         ` Pasha Tatashin
2017-10-04 12:57           ` Michal Hocko
2017-10-04 13:28             ` Pasha Tatashin
2017-10-04 14:04               ` Michal Hocko
2017-10-04 15:08                 ` Pasha Tatashin
2017-09-20 20:17 ` [PATCH v9 09/12] mm/kasan: kasan specific map populate function Pavel Tatashin
2017-10-03 14:48   ` Mark Rutland
2017-10-03 15:04     ` Pasha Tatashin
2017-10-09 17:13     ` Will Deacon
2017-10-09 17:51       ` Pavel Tatashin
2017-10-09 18:14         ` Michal Hocko
2017-10-09 18:48           ` Will Deacon
2017-10-09 18:22         ` Will Deacon
2017-10-09 18:42           ` Pavel Tatashin
2017-10-09 18:48             ` Will Deacon
2017-10-09 18:59               ` Pavel Tatashin
2017-10-09 19:02                 ` Will Deacon
2017-10-09 19:07                   ` Pavel Tatashin
2017-10-09 19:57                     ` Pavel Tatashin
2017-09-20 20:17 ` [PATCH v9 10/12] x86/kasan: use kasan_map_populate() Pavel Tatashin
2017-09-20 20:17 ` [PATCH v9 11/12] arm64/kasan: " Pavel Tatashin
2017-09-20 20:17 ` [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap Pavel Tatashin
2017-10-03 13:19   ` Michal Hocko
2017-10-03 15:34     ` Pasha Tatashin
2017-10-03 20:26       ` Pasha Tatashin
2017-10-04  8:45         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).