All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v11 0/9] complete deferred page initialization
@ 2017-10-09 22:19 ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Changelog:
v11 - v10
- Moved kasan_map_populate() implementation from common code into arch
  specific as discussed with Will Deacon. We do not need
  "mm/kasan: kasan specific map populate function" anymore, so only
  9 patches left.

v10 - v9
- Addressed new comments from Michal Hocko.
- Sent "mm: deferred_init_memmap improvements" as a separate patch as
  it is also fixing existing problem.
- Merged "mm: stop zeroing memory during allocation in vmemmap" with
  "mm: zero struct pages during initialization".
- Added more comments "mm: zero reserved and unavailable struct pages"

v9 - v8
- Addressed comments raised by Mark Rutland and Ard Biesheuvel: changed
  kasan implementation. Added a new function: kasan_map_populate() that
  zeroes the allocated and mapped memory

v8 - v7
- Added Acked-by's from Dave Miller for SPARC changes
- Fixed a minor compiling issue on tile architecture reported by kbuild

v7 - v6
- Addressed comments from Michal Hocko
- memblock_discard() patch was removed from this series and integrated
  separately
- Fixed bug reported by kbuild test robot new patch:
  mm: zero reserved and unavailable struct pages
- Removed patch
  x86/mm: reserve only exiting low pages
  As, it is not needed anymore, because of the previous fix 
- Re-wrote deferred_init_memmap(), found and fixed an existing bug, where
  page variable is not reset when zone holes present.
- Merged several patches together per Michal request
- Added performance data including raw logs

v6 - v5
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
  iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
  whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations
v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
  suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
  memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
    * Splited changes to platforms into 4 patches
    * Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
  keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


==========================================================================
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
                        TIME          SPEED UP
base no deferred:       95.796233s
fix no deferred:        79.978956s    19.77%

base deferred:          77.254713s
fix deferred:           55.050509s    40.34%
==========================================================================
SPARC M6 3600 MHz with 15T of memory
                        TIME          SPEED UP
base no deferred:       358.335727s
fix no deferred:        302.320936s   18.52%

base deferred:          237.534603s
fix deferred:           182.103003s   30.44%
==========================================================================
Raw dmesg output with timestamps:
x86 base no deferred:    https://hastebin.com/ofunepurit.scala
x86 base deferred:       https://hastebin.com/ifazegeyas.scala
x86 fix no deferred:     https://hastebin.com/pegocohevo.scala
x86 fix deferred:        https://hastebin.com/ofupevikuk.scala
sparc base no deferred:  https://hastebin.com/ibobeteken.go
sparc base deferred:     https://hastebin.com/fariqimiyu.go
sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
sparc fix deferred:      https://hastebin.com/xadinobutu.go

Pavel Tatashin (9):
  x86/mm: setting fields in deferred pages
  sparc64/mm: setting fields in deferred pages
  sparc64: simplify vmemmap_populate
  mm: defining memblock_virt_alloc_try_nid_raw
  mm: zero reserved and unavailable struct pages
  x86/kasan: add and use kasan_map_populate()
  arm64/kasan: add and use kasan_map_populate()
  mm: stop zeroing memory during allocation in vmemmap
  sparc64: optimized struct page zeroing

 arch/arm64/mm/kasan_init.c          | 72 ++++++++++++++++++++++++++++++++---
 arch/sparc/include/asm/pgtable_64.h | 30 +++++++++++++++
 arch/sparc/mm/init_64.c             | 32 +++++++---------
 arch/x86/mm/init_64.c               | 10 ++++-
 arch/x86/mm/kasan_init_64.c         | 75 +++++++++++++++++++++++++++++++++++--
 include/linux/bootmem.h             | 27 +++++++++++++
 include/linux/memblock.h            | 16 ++++++++
 include/linux/mm.h                  | 26 +++++++++++++
 mm/memblock.c                       | 60 +++++++++++++++++++++++++----
 mm/page_alloc.c                     | 54 ++++++++++++++++++++++----
 mm/sparse-vmemmap.c                 | 15 ++++----
 mm/sparse.c                         |  6 +--
 12 files changed, 367 insertions(+), 56 deletions(-)

-- 
2.14.2

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 0/9] complete deferred page initialization
@ 2017-10-09 22:19 ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

Changelog:
v11 - v10
- Moved kasan_map_populate() implementation from common code into arch
  specific as discussed with Will Deacon. We do not need
  "mm/kasan: kasan specific map populate function" anymore, so only
  9 patches left.

v10 - v9
- Addressed new comments from Michal Hocko.
- Sent "mm: deferred_init_memmap improvements" as a separate patch as
  it is also fixing existing problem.
- Merged "mm: stop zeroing memory during allocation in vmemmap" with
  "mm: zero struct pages during initialization".
- Added more comments "mm: zero reserved and unavailable struct pages"

v9 - v8
- Addressed comments raised by Mark Rutland and Ard Biesheuvel: changed
  kasan implementation. Added a new function: kasan_map_populate() that
  zeroes the allocated and mapped memory

v8 - v7
- Added Acked-by's from Dave Miller for SPARC changes
- Fixed a minor compiling issue on tile architecture reported by kbuild

v7 - v6
- Addressed comments from Michal Hocko
- memblock_discard() patch was removed from this series and integrated
  separately
- Fixed bug reported by kbuild test robot new patch:
  mm: zero reserved and unavailable struct pages
- Removed patch
  x86/mm: reserve only exiting low pages
  As, it is not needed anymore, because of the previous fix 
- Re-wrote deferred_init_memmap(), found and fixed an existing bug, where
  page variable is not reset when zone holes present.
- Merged several patches together per Michal request
- Added performance data including raw logs

v6 - v5
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
  iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
  whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations
v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
  suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
  memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
    * Splited changes to platforms into 4 patches
    * Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
  keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


=====================================
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
                        TIME          SPEED UP
base no deferred:       95.796233s
fix no deferred:        79.978956s    19.77%

base deferred:          77.254713s
fix deferred:           55.050509s    40.34%
=====================================
SPARC M6 3600 MHz with 15T of memory
                        TIME          SPEED UP
base no deferred:       358.335727s
fix no deferred:        302.320936s   18.52%

base deferred:          237.534603s
fix deferred:           182.103003s   30.44%
=====================================
Raw dmesg output with timestamps:
x86 base no deferred:    https://hastebin.com/ofunepurit.scala
x86 base deferred:       https://hastebin.com/ifazegeyas.scala
x86 fix no deferred:     https://hastebin.com/pegocohevo.scala
x86 fix deferred:        https://hastebin.com/ofupevikuk.scala
sparc base no deferred:  https://hastebin.com/ibobeteken.go
sparc base deferred:     https://hastebin.com/fariqimiyu.go
sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
sparc fix deferred:      https://hastebin.com/xadinobutu.go

Pavel Tatashin (9):
  x86/mm: setting fields in deferred pages
  sparc64/mm: setting fields in deferred pages
  sparc64: simplify vmemmap_populate
  mm: defining memblock_virt_alloc_try_nid_raw
  mm: zero reserved and unavailable struct pages
  x86/kasan: add and use kasan_map_populate()
  arm64/kasan: add and use kasan_map_populate()
  mm: stop zeroing memory during allocation in vmemmap
  sparc64: optimized struct page zeroing

 arch/arm64/mm/kasan_init.c          | 72 ++++++++++++++++++++++++++++++++---
 arch/sparc/include/asm/pgtable_64.h | 30 +++++++++++++++
 arch/sparc/mm/init_64.c             | 32 +++++++---------
 arch/x86/mm/init_64.c               | 10 ++++-
 arch/x86/mm/kasan_init_64.c         | 75 +++++++++++++++++++++++++++++++++++--
 include/linux/bootmem.h             | 27 +++++++++++++
 include/linux/memblock.h            | 16 ++++++++
 include/linux/mm.h                  | 26 +++++++++++++
 mm/memblock.c                       | 60 +++++++++++++++++++++++++----
 mm/page_alloc.c                     | 54 ++++++++++++++++++++++----
 mm/sparse-vmemmap.c                 | 15 ++++----
 mm/sparse.c                         |  6 +--
 12 files changed, 367 insertions(+), 56 deletions(-)

-- 
2.14.2


^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 0/9] complete deferred page initialization
@ 2017-10-09 22:19 ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Changelog:
v11 - v10
- Moved kasan_map_populate() implementation from common code into arch
  specific as discussed with Will Deacon. We do not need
  "mm/kasan: kasan specific map populate function" anymore, so only
  9 patches left.

v10 - v9
- Addressed new comments from Michal Hocko.
- Sent "mm: deferred_init_memmap improvements" as a separate patch as
  it is also fixing existing problem.
- Merged "mm: stop zeroing memory during allocation in vmemmap" with
  "mm: zero struct pages during initialization".
- Added more comments "mm: zero reserved and unavailable struct pages"

v9 - v8
- Addressed comments raised by Mark Rutland and Ard Biesheuvel: changed
  kasan implementation. Added a new function: kasan_map_populate() that
  zeroes the allocated and mapped memory

v8 - v7
- Added Acked-by's from Dave Miller for SPARC changes
- Fixed a minor compiling issue on tile architecture reported by kbuild

v7 - v6
- Addressed comments from Michal Hocko
- memblock_discard() patch was removed from this series and integrated
  separately
- Fixed bug reported by kbuild test robot new patch:
  mm: zero reserved and unavailable struct pages
- Removed patch
  x86/mm: reserve only exiting low pages
  As, it is not needed anymore, because of the previous fix 
- Re-wrote deferred_init_memmap(), found and fixed an existing bug, where
  page variable is not reset when zone holes present.
- Merged several patches together per Michal request
- Added performance data including raw logs

v6 - v5
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
  iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
  whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations
v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
  suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
  memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
    * Splited changes to platforms into 4 patches
    * Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
  keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


==========================================================================
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
                        TIME          SPEED UP
base no deferred:       95.796233s
fix no deferred:        79.978956s    19.77%

base deferred:          77.254713s
fix deferred:           55.050509s    40.34%
==========================================================================
SPARC M6 3600 MHz with 15T of memory
                        TIME          SPEED UP
base no deferred:       358.335727s
fix no deferred:        302.320936s   18.52%

base deferred:          237.534603s
fix deferred:           182.103003s   30.44%
==========================================================================
Raw dmesg output with timestamps:
x86 base no deferred:    https://hastebin.com/ofunepurit.scala
x86 base deferred:       https://hastebin.com/ifazegeyas.scala
x86 fix no deferred:     https://hastebin.com/pegocohevo.scala
x86 fix deferred:        https://hastebin.com/ofupevikuk.scala
sparc base no deferred:  https://hastebin.com/ibobeteken.go
sparc base deferred:     https://hastebin.com/fariqimiyu.go
sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
sparc fix deferred:      https://hastebin.com/xadinobutu.go

Pavel Tatashin (9):
  x86/mm: setting fields in deferred pages
  sparc64/mm: setting fields in deferred pages
  sparc64: simplify vmemmap_populate
  mm: defining memblock_virt_alloc_try_nid_raw
  mm: zero reserved and unavailable struct pages
  x86/kasan: add and use kasan_map_populate()
  arm64/kasan: add and use kasan_map_populate()
  mm: stop zeroing memory during allocation in vmemmap
  sparc64: optimized struct page zeroing

 arch/arm64/mm/kasan_init.c          | 72 ++++++++++++++++++++++++++++++++---
 arch/sparc/include/asm/pgtable_64.h | 30 +++++++++++++++
 arch/sparc/mm/init_64.c             | 32 +++++++---------
 arch/x86/mm/init_64.c               | 10 ++++-
 arch/x86/mm/kasan_init_64.c         | 75 +++++++++++++++++++++++++++++++++++--
 include/linux/bootmem.h             | 27 +++++++++++++
 include/linux/memblock.h            | 16 ++++++++
 include/linux/mm.h                  | 26 +++++++++++++
 mm/memblock.c                       | 60 +++++++++++++++++++++++++----
 mm/page_alloc.c                     | 54 ++++++++++++++++++++++----
 mm/sparse-vmemmap.c                 | 15 ++++----
 mm/sparse.c                         |  6 +--
 12 files changed, 367 insertions(+), 56 deletions(-)

-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 0/9] complete deferred page initialization
@ 2017-10-09 22:19 ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

Changelog:
v11 - v10
- Moved kasan_map_populate() implementation from common code into arch
  specific as discussed with Will Deacon. We do not need
  "mm/kasan: kasan specific map populate function" anymore, so only
  9 patches left.

v10 - v9
- Addressed new comments from Michal Hocko.
- Sent "mm: deferred_init_memmap improvements" as a separate patch as
  it is also fixing existing problem.
- Merged "mm: stop zeroing memory during allocation in vmemmap" with
  "mm: zero struct pages during initialization".
- Added more comments "mm: zero reserved and unavailable struct pages"

v9 - v8
- Addressed comments raised by Mark Rutland and Ard Biesheuvel: changed
  kasan implementation. Added a new function: kasan_map_populate() that
  zeroes the allocated and mapped memory

v8 - v7
- Added Acked-by's from Dave Miller for SPARC changes
- Fixed a minor compiling issue on tile architecture reported by kbuild

v7 - v6
- Addressed comments from Michal Hocko
- memblock_discard() patch was removed from this series and integrated
  separately
- Fixed bug reported by kbuild test robot new patch:
  mm: zero reserved and unavailable struct pages
- Removed patch
  x86/mm: reserve only exiting low pages
  As, it is not needed anymore, because of the previous fix 
- Re-wrote deferred_init_memmap(), found and fixed an existing bug, where
  page variable is not reset when zone holes present.
- Merged several patches together per Michal request
- Added performance data including raw logs

v6 - v5
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
  iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
  whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations
v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
  suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
  memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
    * Splited changes to platforms into 4 patches
    * Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
  keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


==========================================================================
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
                        TIME          SPEED UP
base no deferred:       95.796233s
fix no deferred:        79.978956s    19.77%

base deferred:          77.254713s
fix deferred:           55.050509s    40.34%
==========================================================================
SPARC M6 3600 MHz with 15T of memory
                        TIME          SPEED UP
base no deferred:       358.335727s
fix no deferred:        302.320936s   18.52%

base deferred:          237.534603s
fix deferred:           182.103003s   30.44%
==========================================================================
Raw dmesg output with timestamps:
x86 base no deferred:    https://hastebin.com/ofunepurit.scala
x86 base deferred:       https://hastebin.com/ifazegeyas.scala
x86 fix no deferred:     https://hastebin.com/pegocohevo.scala
x86 fix deferred:        https://hastebin.com/ofupevikuk.scala
sparc base no deferred:  https://hastebin.com/ibobeteken.go
sparc base deferred:     https://hastebin.com/fariqimiyu.go
sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
sparc fix deferred:      https://hastebin.com/xadinobutu.go

Pavel Tatashin (9):
  x86/mm: setting fields in deferred pages
  sparc64/mm: setting fields in deferred pages
  sparc64: simplify vmemmap_populate
  mm: defining memblock_virt_alloc_try_nid_raw
  mm: zero reserved and unavailable struct pages
  x86/kasan: add and use kasan_map_populate()
  arm64/kasan: add and use kasan_map_populate()
  mm: stop zeroing memory during allocation in vmemmap
  sparc64: optimized struct page zeroing

 arch/arm64/mm/kasan_init.c          | 72 ++++++++++++++++++++++++++++++++---
 arch/sparc/include/asm/pgtable_64.h | 30 +++++++++++++++
 arch/sparc/mm/init_64.c             | 32 +++++++---------
 arch/x86/mm/init_64.c               | 10 ++++-
 arch/x86/mm/kasan_init_64.c         | 75 +++++++++++++++++++++++++++++++++++--
 include/linux/bootmem.h             | 27 +++++++++++++
 include/linux/memblock.h            | 16 ++++++++
 include/linux/mm.h                  | 26 +++++++++++++
 mm/memblock.c                       | 60 +++++++++++++++++++++++++----
 mm/page_alloc.c                     | 54 ++++++++++++++++++++++----
 mm/sparse-vmemmap.c                 | 15 ++++----
 mm/sparse.c                         |  6 +--
 12 files changed, 367 insertions(+), 56 deletions(-)

-- 
2.14.2

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 1/9] x86/mm: setting fields in deferred pages
  2017-10-09 22:19 ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-09 22:19   ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled, however, we set fields in
register_page_bootmem_info that are subsequently clobbered right after in
free_all_bootmem:

        mem_init() {
                register_page_bootmem_info();
                free_all_bootmem();
                ...
        }

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

  mem_init
   register_page_bootmem_info
    register_page_bootmem_info_node
     get_page_bootmem
      .. setting fields here ..
      such as: page->freelist = (void *)type;

  free_all_bootmem()
   free_low_memory_core_early()
    for_each_reserved_mem_region()
     reserve_bootmem_region()
      init_reserved_page() <- Only if this is deferred reserved page
       __init_single_pfn()
        __init_single_page()
            memset(0) <-- Loose the set fields here

We end-up with issue where, currently we do not observe problem as memory
is explicitly zeroed. But, if flag asserts are changed we can start hitting
issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/mm/init_64.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ea1c3c2636e..8822523fdcd7 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1182,12 +1182,18 @@ void __init mem_init(void)
 
 	/* clear_bss() already clear the empty_zero_page */
 
-	register_page_bootmem_info();
-
 	/* this will put all memory onto the freelists */
 	free_all_bootmem();
 	after_bootmem = 1;
 
+	/*
+	 * Must be done after boot memory is put on freelist, because here we
+	 * might set fields in deferred struct pages that have not yet been
+	 * initialized, and free_all_bootmem() initializes all the reserved
+	 * deferred pages for us.
+	 */
+	register_page_bootmem_info();
+
 	/* Register memory areas for /proc/kcore */
 	kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR,
 			 PAGE_SIZE, KCORE_OTHER);
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 1/9] x86/mm: setting fields in deferred pages
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled, however, we set fields in
register_page_bootmem_info that are subsequently clobbered right after in
free_all_bootmem:

        mem_init() {
                register_page_bootmem_info();
                free_all_bootmem();
                ...
        }

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

  mem_init
   register_page_bootmem_info
    register_page_bootmem_info_node
     get_page_bootmem
      .. setting fields here ..
      such as: page->freelist = (void *)type;

  free_all_bootmem()
   free_low_memory_core_early()
    for_each_reserved_mem_region()
     reserve_bootmem_region()
      init_reserved_page() <- Only if this is deferred reserved page
       __init_single_pfn()
        __init_single_page()
            memset(0) <-- Loose the set fields here

We end-up with issue where, currently we do not observe problem as memory
is explicitly zeroed. But, if flag asserts are changed we can start hitting
issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/mm/init_64.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ea1c3c2636e..8822523fdcd7 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1182,12 +1182,18 @@ void __init mem_init(void)
 
 	/* clear_bss() already clear the empty_zero_page */
 
-	register_page_bootmem_info();
-
 	/* this will put all memory onto the freelists */
 	free_all_bootmem();
 	after_bootmem = 1;
 
+	/*
+	 * Must be done after boot memory is put on freelist, because here we
+	 * might set fields in deferred struct pages that have not yet been
+	 * initialized, and free_all_bootmem() initializes all the reserved
+	 * deferred pages for us.
+	 */
+	register_page_bootmem_info();
+
 	/* Register memory areas for /proc/kcore */
 	kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR,
 			 PAGE_SIZE, KCORE_OTHER);
-- 
2.14.2


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 1/9] x86/mm: setting fields in deferred pages
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled, however, we set fields in
register_page_bootmem_info that are subsequently clobbered right after in
free_all_bootmem:

        mem_init() {
                register_page_bootmem_info();
                free_all_bootmem();
                ...
        }

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

  mem_init
   register_page_bootmem_info
    register_page_bootmem_info_node
     get_page_bootmem
      .. setting fields here ..
      such as: page->freelist = (void *)type;

  free_all_bootmem()
   free_low_memory_core_early()
    for_each_reserved_mem_region()
     reserve_bootmem_region()
      init_reserved_page() <- Only if this is deferred reserved page
       __init_single_pfn()
        __init_single_page()
            memset(0) <-- Loose the set fields here

We end-up with issue where, currently we do not observe problem as memory
is explicitly zeroed. But, if flag asserts are changed we can start hitting
issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/mm/init_64.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ea1c3c2636e..8822523fdcd7 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1182,12 +1182,18 @@ void __init mem_init(void)
 
 	/* clear_bss() already clear the empty_zero_page */
 
-	register_page_bootmem_info();
-
 	/* this will put all memory onto the freelists */
 	free_all_bootmem();
 	after_bootmem = 1;
 
+	/*
+	 * Must be done after boot memory is put on freelist, because here we
+	 * might set fields in deferred struct pages that have not yet been
+	 * initialized, and free_all_bootmem() initializes all the reserved
+	 * deferred pages for us.
+	 */
+	register_page_bootmem_info();
+
 	/* Register memory areas for /proc/kcore */
 	kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR,
 			 PAGE_SIZE, KCORE_OTHER);
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 1/9] x86/mm: setting fields in deferred pages
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled, however, we set fields in
register_page_bootmem_info that are subsequently clobbered right after in
free_all_bootmem:

        mem_init() {
                register_page_bootmem_info();
                free_all_bootmem();
                ...
        }

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

  mem_init
   register_page_bootmem_info
    register_page_bootmem_info_node
     get_page_bootmem
      .. setting fields here ..
      such as: page->freelist = (void *)type;

  free_all_bootmem()
   free_low_memory_core_early()
    for_each_reserved_mem_region()
     reserve_bootmem_region()
      init_reserved_page() <- Only if this is deferred reserved page
       __init_single_pfn()
        __init_single_page()
            memset(0) <-- Loose the set fields here

We end-up with issue where, currently we do not observe problem as memory
is explicitly zeroed. But, if flag asserts are changed we can start hitting
issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/mm/init_64.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ea1c3c2636e..8822523fdcd7 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1182,12 +1182,18 @@ void __init mem_init(void)
 
 	/* clear_bss() already clear the empty_zero_page */
 
-	register_page_bootmem_info();
-
 	/* this will put all memory onto the freelists */
 	free_all_bootmem();
 	after_bootmem = 1;
 
+	/*
+	 * Must be done after boot memory is put on freelist, because here we
+	 * might set fields in deferred struct pages that have not yet been
+	 * initialized, and free_all_bootmem() initializes all the reserved
+	 * deferred pages for us.
+	 */
+	register_page_bootmem_info();
+
 	/* Register memory areas for /proc/kcore */
 	kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR,
 			 PAGE_SIZE, KCORE_OTHER);
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 2/9] sparc64/mm: setting fields in deferred pages
  2017-10-09 22:19 ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-09 22:19   ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set some
fields prior to initializing:

mem_init() {
     register_page_bootmem_info();
     free_all_bootmem();
     ...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

mem_init
register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

free_all_bootmem()
free_low_memory_core_early()
 for_each_reserved_mem_region()
  reserve_bootmem_region()
   init_reserved_page() <- Only if this is deferred reserved page
    __init_single_pfn()
     __init_single_page()
      memset(0) <-- Loose the set fields here

We end-up with similar issue as in the previous patch, where currently we
do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/sparc/mm/init_64.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 6034569e2c0d..caed495544e9 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2548,9 +2548,16 @@ void __init mem_init(void)
 {
 	high_memory = __va(last_valid_pfn << PAGE_SHIFT);
 
-	register_page_bootmem_info();
 	free_all_bootmem();
 
+	/*
+	 * Must be done after boot memory is put on freelist, because here we
+	 * might set fields in deferred struct pages that have not yet been
+	 * initialized, and free_all_bootmem() initializes all the reserved
+	 * deferred pages for us.
+	 */
+	register_page_bootmem_info();
+
 	/*
 	 * Set up the zero page, mark it reserved, so that page count
 	 * is not manipulated when freeing the page from user ptes.
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 2/9] sparc64/mm: setting fields in deferred pages
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set some
fields prior to initializing:

mem_init() {
     register_page_bootmem_info();
     free_all_bootmem();
     ...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

mem_init
register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

free_all_bootmem()
free_low_memory_core_early()
 for_each_reserved_mem_region()
  reserve_bootmem_region()
   init_reserved_page() <- Only if this is deferred reserved page
    __init_single_pfn()
     __init_single_page()
      memset(0) <-- Loose the set fields here

We end-up with similar issue as in the previous patch, where currently we
do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/sparc/mm/init_64.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 6034569e2c0d..caed495544e9 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2548,9 +2548,16 @@ void __init mem_init(void)
 {
 	high_memory = __va(last_valid_pfn << PAGE_SHIFT);
 
-	register_page_bootmem_info();
 	free_all_bootmem();
 
+	/*
+	 * Must be done after boot memory is put on freelist, because here we
+	 * might set fields in deferred struct pages that have not yet been
+	 * initialized, and free_all_bootmem() initializes all the reserved
+	 * deferred pages for us.
+	 */
+	register_page_bootmem_info();
+
 	/*
 	 * Set up the zero page, mark it reserved, so that page count
 	 * is not manipulated when freeing the page from user ptes.
-- 
2.14.2


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 2/9] sparc64/mm: setting fields in deferred pages
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set some
fields prior to initializing:

mem_init() {
     register_page_bootmem_info();
     free_all_bootmem();
     ...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

mem_init
register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

free_all_bootmem()
free_low_memory_core_early()
 for_each_reserved_mem_region()
  reserve_bootmem_region()
   init_reserved_page() <- Only if this is deferred reserved page
    __init_single_pfn()
     __init_single_page()
      memset(0) <-- Loose the set fields here

We end-up with similar issue as in the previous patch, where currently we
do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/sparc/mm/init_64.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 6034569e2c0d..caed495544e9 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2548,9 +2548,16 @@ void __init mem_init(void)
 {
 	high_memory = __va(last_valid_pfn << PAGE_SHIFT);
 
-	register_page_bootmem_info();
 	free_all_bootmem();
 
+	/*
+	 * Must be done after boot memory is put on freelist, because here we
+	 * might set fields in deferred struct pages that have not yet been
+	 * initialized, and free_all_bootmem() initializes all the reserved
+	 * deferred pages for us.
+	 */
+	register_page_bootmem_info();
+
 	/*
 	 * Set up the zero page, mark it reserved, so that page count
 	 * is not manipulated when freeing the page from user ptes.
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 2/9] sparc64/mm: setting fields in deferred pages
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set some
fields prior to initializing:

mem_init() {
     register_page_bootmem_info();
     free_all_bootmem();
     ...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

mem_init
register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

free_all_bootmem()
free_low_memory_core_early()
 for_each_reserved_mem_region()
  reserve_bootmem_region()
   init_reserved_page() <- Only if this is deferred reserved page
    __init_single_pfn()
     __init_single_page()
      memset(0) <-- Loose the set fields here

We end-up with similar issue as in the previous patch, where currently we
do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/sparc/mm/init_64.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 6034569e2c0d..caed495544e9 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2548,9 +2548,16 @@ void __init mem_init(void)
 {
 	high_memory = __va(last_valid_pfn << PAGE_SHIFT);
 
-	register_page_bootmem_info();
 	free_all_bootmem();
 
+	/*
+	 * Must be done after boot memory is put on freelist, because here we
+	 * might set fields in deferred struct pages that have not yet been
+	 * initialized, and free_all_bootmem() initializes all the reserved
+	 * deferred pages for us.
+	 */
+	register_page_bootmem_info();
+
 	/*
 	 * Set up the zero page, mark it reserved, so that page count
 	 * is not manipulated when freeing the page from user ptes.
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 3/9] sparc64: simplify vmemmap_populate
  2017-10-09 22:19 ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-09 22:19   ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Remove duplicating code by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/sparc/mm/init_64.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index caed495544e9..6839db3ffe1d 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2652,30 +2652,19 @@ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
 	vstart = vstart & PMD_MASK;
 	vend = ALIGN(vend, PMD_SIZE);
 	for (; vstart < vend; vstart += PMD_SIZE) {
-		pgd_t *pgd = pgd_offset_k(vstart);
+		pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
 		unsigned long pte;
 		pud_t *pud;
 		pmd_t *pmd;
 
-		if (pgd_none(*pgd)) {
-			pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+		if (!pgd)
+			return -ENOMEM;
 
-			if (!new)
-				return -ENOMEM;
-			pgd_populate(&init_mm, pgd, new);
-		}
-
-		pud = pud_offset(pgd, vstart);
-		if (pud_none(*pud)) {
-			pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-			if (!new)
-				return -ENOMEM;
-			pud_populate(&init_mm, pud, new);
-		}
+		pud = vmemmap_pud_populate(pgd, vstart, node);
+		if (!pud)
+			return -ENOMEM;
 
 		pmd = pmd_offset(pud, vstart);
-
 		pte = pmd_val(*pmd);
 		if (!(pte & _PAGE_VALID)) {
 			void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 3/9] sparc64: simplify vmemmap_populate
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

Remove duplicating code by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/sparc/mm/init_64.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index caed495544e9..6839db3ffe1d 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2652,30 +2652,19 @@ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
 	vstart = vstart & PMD_MASK;
 	vend = ALIGN(vend, PMD_SIZE);
 	for (; vstart < vend; vstart += PMD_SIZE) {
-		pgd_t *pgd = pgd_offset_k(vstart);
+		pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
 		unsigned long pte;
 		pud_t *pud;
 		pmd_t *pmd;
 
-		if (pgd_none(*pgd)) {
-			pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+		if (!pgd)
+			return -ENOMEM;
 
-			if (!new)
-				return -ENOMEM;
-			pgd_populate(&init_mm, pgd, new);
-		}
-
-		pud = pud_offset(pgd, vstart);
-		if (pud_none(*pud)) {
-			pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-			if (!new)
-				return -ENOMEM;
-			pud_populate(&init_mm, pud, new);
-		}
+		pud = vmemmap_pud_populate(pgd, vstart, node);
+		if (!pud)
+			return -ENOMEM;
 
 		pmd = pmd_offset(pud, vstart);
-
 		pte = pmd_val(*pmd);
 		if (!(pte & _PAGE_VALID)) {
 			void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
2.14.2


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 3/9] sparc64: simplify vmemmap_populate
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Remove duplicating code by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/sparc/mm/init_64.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index caed495544e9..6839db3ffe1d 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2652,30 +2652,19 @@ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
 	vstart = vstart & PMD_MASK;
 	vend = ALIGN(vend, PMD_SIZE);
 	for (; vstart < vend; vstart += PMD_SIZE) {
-		pgd_t *pgd = pgd_offset_k(vstart);
+		pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
 		unsigned long pte;
 		pud_t *pud;
 		pmd_t *pmd;
 
-		if (pgd_none(*pgd)) {
-			pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+		if (!pgd)
+			return -ENOMEM;
 
-			if (!new)
-				return -ENOMEM;
-			pgd_populate(&init_mm, pgd, new);
-		}
-
-		pud = pud_offset(pgd, vstart);
-		if (pud_none(*pud)) {
-			pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-			if (!new)
-				return -ENOMEM;
-			pud_populate(&init_mm, pud, new);
-		}
+		pud = vmemmap_pud_populate(pgd, vstart, node);
+		if (!pud)
+			return -ENOMEM;
 
 		pmd = pmd_offset(pud, vstart);
-
 		pte = pmd_val(*pmd);
 		if (!(pte & _PAGE_VALID)) {
 			void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 3/9] sparc64: simplify vmemmap_populate
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

Remove duplicating code by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/sparc/mm/init_64.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index caed495544e9..6839db3ffe1d 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2652,30 +2652,19 @@ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
 	vstart = vstart & PMD_MASK;
 	vend = ALIGN(vend, PMD_SIZE);
 	for (; vstart < vend; vstart += PMD_SIZE) {
-		pgd_t *pgd = pgd_offset_k(vstart);
+		pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
 		unsigned long pte;
 		pud_t *pud;
 		pmd_t *pmd;
 
-		if (pgd_none(*pgd)) {
-			pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+		if (!pgd)
+			return -ENOMEM;
 
-			if (!new)
-				return -ENOMEM;
-			pgd_populate(&init_mm, pgd, new);
-		}
-
-		pud = pud_offset(pgd, vstart);
-		if (pud_none(*pud)) {
-			pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-			if (!new)
-				return -ENOMEM;
-			pud_populate(&init_mm, pud, new);
-		}
+		pud = vmemmap_pud_populate(pgd, vstart, node);
+		if (!pud)
+			return -ENOMEM;
 
 		pmd = pmd_offset(pud, vstart);
-
 		pte = pmd_val(*pmd);
 		if (!(pte & _PAGE_VALID)) {
 			void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 4/9] mm: defining memblock_virt_alloc_try_nid_raw
  2017-10-09 22:19 ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-09 22:19   ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
    - Does not zero the allocated memory
    - Does not panic if request cannot be satisfied

* optimize early system hash allocations

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

* debug for raw alloctor

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/bootmem.h | 27 ++++++++++++++++++++++
 mm/memblock.c           | 60 +++++++++++++++++++++++++++++++++++++++++++------
 mm/page_alloc.c         | 15 ++++++-------
 3 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91b6439..ea30b3987282 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE		(~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+				      phys_addr_t min_addr,
+				      phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
 		phys_addr_t align, phys_addr_t min_addr,
 		phys_addr_t max_addr, int nid);
@@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
 					    NUMA_NO_NODE);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+					phys_addr_t size,  phys_addr_t align)
+{
+	return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
+					    BOOTMEM_ALLOC_ACCESSIBLE,
+					    NUMA_NO_NODE);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
 					phys_addr_t size, phys_addr_t align)
 {
@@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
 	return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+					phys_addr_t size,  phys_addr_t align)
+{
+	if (!align)
+		align = SMP_CACHE_BYTES;
+	return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
 					phys_addr_t size, phys_addr_t align)
 {
@@ -309,6 +328,14 @@ static inline void * __init memblock_virt_alloc_try_nid(phys_addr_t size,
 					  min_addr);
 }
 
+static inline void * __init memblock_virt_alloc_try_nid_raw(
+			phys_addr_t size, phys_addr_t align,
+			phys_addr_t min_addr, phys_addr_t max_addr, int nid)
+{
+	return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
+				min_addr, max_addr);
+}
+
 static inline void * __init memblock_virt_alloc_try_nid_nopanic(
 			phys_addr_t size, phys_addr_t align,
 			phys_addr_t min_addr, phys_addr_t max_addr, int nid)
diff --git a/mm/memblock.c b/mm/memblock.c
index 91205780e6b1..1f299fb1eb08 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
 	return NULL;
 done:
 	ptr = phys_to_virt(alloc);
-	memset(ptr, 0, size);
 
 	/*
 	 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1340,6 +1339,45 @@ static void * __init memblock_virt_alloc_internal(
 	return ptr;
 }
 
+/**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *	  is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *	      is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *	      allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ *
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. Does not zero allocated memory, does not panic if request
+ * cannot be satisfied.
+ *
+ * RETURNS:
+ * Virtual address of allocated memory block on success, NULL on failure.
+ */
+void * __init memblock_virt_alloc_try_nid_raw(
+			phys_addr_t size, phys_addr_t align,
+			phys_addr_t min_addr, phys_addr_t max_addr,
+			int nid)
+{
+	void *ptr;
+
+	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
+		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
+		     (u64)max_addr, (void *)_RET_IP_);
+
+	ptr = memblock_virt_alloc_internal(size, align,
+					   min_addr, max_addr, nid);
+#ifdef CONFIG_DEBUG_VM
+	if (ptr && size > 0)
+		memset(ptr, 0xff, size);
+#endif
+	return ptr;
+}
+
 /**
  * memblock_virt_alloc_try_nid_nopanic - allocate boot memory block
  * @size: size of memory block to be allocated in bytes
@@ -1351,8 +1389,8 @@ static void * __init memblock_virt_alloc_internal(
  *	      allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public version of _memblock_virt_alloc_try_nid_nopanic() which provides
- * additional debug information (including caller info), if enabled.
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. This function zeroes the allocated memory.
  *
  * RETURNS:
  * Virtual address of allocated memory block on success, NULL on failure.
@@ -1362,11 +1400,17 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
 				phys_addr_t min_addr, phys_addr_t max_addr,
 				int nid)
 {
+	void *ptr;
+
 	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
 		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
 		     (u64)max_addr, (void *)_RET_IP_);
-	return memblock_virt_alloc_internal(size, align, min_addr,
-					     max_addr, nid);
+
+	ptr = memblock_virt_alloc_internal(size, align,
+					   min_addr, max_addr, nid);
+	if (ptr)
+		memset(ptr, 0, size);
+	return ptr;
 }
 
 /**
@@ -1380,7 +1424,7 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
  *	      allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public panicking version of _memblock_virt_alloc_try_nid_nopanic()
+ * Public panicking version of memblock_virt_alloc_try_nid_nopanic()
  * which provides debug information (including caller info), if enabled,
  * and panics if the request can not be satisfied.
  *
@@ -1399,8 +1443,10 @@ void * __init memblock_virt_alloc_try_nid(
 		     (u64)max_addr, (void *)_RET_IP_);
 	ptr = memblock_virt_alloc_internal(size, align,
 					   min_addr, max_addr, nid);
-	if (ptr)
+	if (ptr) {
+		memset(ptr, 0, size);
 		return ptr;
+	}
 
 	panic("%s: Failed to allocate %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx\n",
 	      __func__, (u64)size, (u64)align, nid, (u64)min_addr,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cdbd14829fd3..20b0bace2235 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7307,18 +7307,17 @@ void *__init alloc_large_system_hash(const char *tablename,
 
 	log2qty = ilog2(numentries);
 
-	/*
-	 * memblock allocator returns zeroed memory already, so HASH_ZERO is
-	 * currently not used when HASH_EARLY is specified.
-	 */
 	gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
 	do {
 		size = bucketsize << log2qty;
-		if (flags & HASH_EARLY)
-			table = memblock_virt_alloc_nopanic(size, 0);
-		else if (hashdist)
+		if (flags & HASH_EARLY) {
+			if (flags & HASH_ZERO)
+				table = memblock_virt_alloc_nopanic(size, 0);
+			else
+				table = memblock_virt_alloc_raw(size, 0);
+		} else if (hashdist) {
 			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
-		else {
+		} else {
 			/*
 			 * If bucketsize is not a power-of-two, we may free
 			 * some pages at the end of hash table which
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 4/9] mm: defining memblock_virt_alloc_try_nid_raw
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
    - Does not zero the allocated memory
    - Does not panic if request cannot be satisfied

* optimize early system hash allocations

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

* debug for raw alloctor

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/bootmem.h | 27 ++++++++++++++++++++++
 mm/memblock.c           | 60 +++++++++++++++++++++++++++++++++++++++++++------
 mm/page_alloc.c         | 15 ++++++-------
 3 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91b6439..ea30b3987282 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE		(~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+				      phys_addr_t min_addr,
+				      phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
 		phys_addr_t align, phys_addr_t min_addr,
 		phys_addr_t max_addr, int nid);
@@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
 					    NUMA_NO_NODE);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+					phys_addr_t size,  phys_addr_t align)
+{
+	return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
+					    BOOTMEM_ALLOC_ACCESSIBLE,
+					    NUMA_NO_NODE);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
 					phys_addr_t size, phys_addr_t align)
 {
@@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
 	return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+					phys_addr_t size,  phys_addr_t align)
+{
+	if (!align)
+		align = SMP_CACHE_BYTES;
+	return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
 					phys_addr_t size, phys_addr_t align)
 {
@@ -309,6 +328,14 @@ static inline void * __init memblock_virt_alloc_try_nid(phys_addr_t size,
 					  min_addr);
 }
 
+static inline void * __init memblock_virt_alloc_try_nid_raw(
+			phys_addr_t size, phys_addr_t align,
+			phys_addr_t min_addr, phys_addr_t max_addr, int nid)
+{
+	return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
+				min_addr, max_addr);
+}
+
 static inline void * __init memblock_virt_alloc_try_nid_nopanic(
 			phys_addr_t size, phys_addr_t align,
 			phys_addr_t min_addr, phys_addr_t max_addr, int nid)
diff --git a/mm/memblock.c b/mm/memblock.c
index 91205780e6b1..1f299fb1eb08 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
 	return NULL;
 done:
 	ptr = phys_to_virt(alloc);
-	memset(ptr, 0, size);
 
 	/*
 	 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1340,6 +1339,45 @@ static void * __init memblock_virt_alloc_internal(
 	return ptr;
 }
 
+/**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *	  is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *	      is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *	      allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ *
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. Does not zero allocated memory, does not panic if request
+ * cannot be satisfied.
+ *
+ * RETURNS:
+ * Virtual address of allocated memory block on success, NULL on failure.
+ */
+void * __init memblock_virt_alloc_try_nid_raw(
+			phys_addr_t size, phys_addr_t align,
+			phys_addr_t min_addr, phys_addr_t max_addr,
+			int nid)
+{
+	void *ptr;
+
+	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
+		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
+		     (u64)max_addr, (void *)_RET_IP_);
+
+	ptr = memblock_virt_alloc_internal(size, align,
+					   min_addr, max_addr, nid);
+#ifdef CONFIG_DEBUG_VM
+	if (ptr && size > 0)
+		memset(ptr, 0xff, size);
+#endif
+	return ptr;
+}
+
 /**
  * memblock_virt_alloc_try_nid_nopanic - allocate boot memory block
  * @size: size of memory block to be allocated in bytes
@@ -1351,8 +1389,8 @@ static void * __init memblock_virt_alloc_internal(
  *	      allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public version of _memblock_virt_alloc_try_nid_nopanic() which provides
- * additional debug information (including caller info), if enabled.
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. This function zeroes the allocated memory.
  *
  * RETURNS:
  * Virtual address of allocated memory block on success, NULL on failure.
@@ -1362,11 +1400,17 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
 				phys_addr_t min_addr, phys_addr_t max_addr,
 				int nid)
 {
+	void *ptr;
+
 	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
 		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
 		     (u64)max_addr, (void *)_RET_IP_);
-	return memblock_virt_alloc_internal(size, align, min_addr,
-					     max_addr, nid);
+
+	ptr = memblock_virt_alloc_internal(size, align,
+					   min_addr, max_addr, nid);
+	if (ptr)
+		memset(ptr, 0, size);
+	return ptr;
 }
 
 /**
@@ -1380,7 +1424,7 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
  *	      allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public panicking version of _memblock_virt_alloc_try_nid_nopanic()
+ * Public panicking version of memblock_virt_alloc_try_nid_nopanic()
  * which provides debug information (including caller info), if enabled,
  * and panics if the request can not be satisfied.
  *
@@ -1399,8 +1443,10 @@ void * __init memblock_virt_alloc_try_nid(
 		     (u64)max_addr, (void *)_RET_IP_);
 	ptr = memblock_virt_alloc_internal(size, align,
 					   min_addr, max_addr, nid);
-	if (ptr)
+	if (ptr) {
+		memset(ptr, 0, size);
 		return ptr;
+	}
 
 	panic("%s: Failed to allocate %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx\n",
 	      __func__, (u64)size, (u64)align, nid, (u64)min_addr,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cdbd14829fd3..20b0bace2235 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7307,18 +7307,17 @@ void *__init alloc_large_system_hash(const char *tablename,
 
 	log2qty = ilog2(numentries);
 
-	/*
-	 * memblock allocator returns zeroed memory already, so HASH_ZERO is
-	 * currently not used when HASH_EARLY is specified.
-	 */
 	gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
 	do {
 		size = bucketsize << log2qty;
-		if (flags & HASH_EARLY)
-			table = memblock_virt_alloc_nopanic(size, 0);
-		else if (hashdist)
+		if (flags & HASH_EARLY) {
+			if (flags & HASH_ZERO)
+				table = memblock_virt_alloc_nopanic(size, 0);
+			else
+				table = memblock_virt_alloc_raw(size, 0);
+		} else if (hashdist) {
 			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
-		else {
+		} else {
 			/*
 			 * If bucketsize is not a power-of-two, we may free
 			 * some pages at the end of hash table which
-- 
2.14.2


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 4/9] mm: defining memblock_virt_alloc_try_nid_raw
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
    - Does not zero the allocated memory
    - Does not panic if request cannot be satisfied

* optimize early system hash allocations

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

* debug for raw alloctor

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/bootmem.h | 27 ++++++++++++++++++++++
 mm/memblock.c           | 60 +++++++++++++++++++++++++++++++++++++++++++------
 mm/page_alloc.c         | 15 ++++++-------
 3 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91b6439..ea30b3987282 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE		(~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+				      phys_addr_t min_addr,
+				      phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
 		phys_addr_t align, phys_addr_t min_addr,
 		phys_addr_t max_addr, int nid);
@@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
 					    NUMA_NO_NODE);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+					phys_addr_t size,  phys_addr_t align)
+{
+	return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
+					    BOOTMEM_ALLOC_ACCESSIBLE,
+					    NUMA_NO_NODE);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
 					phys_addr_t size, phys_addr_t align)
 {
@@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
 	return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+					phys_addr_t size,  phys_addr_t align)
+{
+	if (!align)
+		align = SMP_CACHE_BYTES;
+	return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
 					phys_addr_t size, phys_addr_t align)
 {
@@ -309,6 +328,14 @@ static inline void * __init memblock_virt_alloc_try_nid(phys_addr_t size,
 					  min_addr);
 }
 
+static inline void * __init memblock_virt_alloc_try_nid_raw(
+			phys_addr_t size, phys_addr_t align,
+			phys_addr_t min_addr, phys_addr_t max_addr, int nid)
+{
+	return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
+				min_addr, max_addr);
+}
+
 static inline void * __init memblock_virt_alloc_try_nid_nopanic(
 			phys_addr_t size, phys_addr_t align,
 			phys_addr_t min_addr, phys_addr_t max_addr, int nid)
diff --git a/mm/memblock.c b/mm/memblock.c
index 91205780e6b1..1f299fb1eb08 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
 	return NULL;
 done:
 	ptr = phys_to_virt(alloc);
-	memset(ptr, 0, size);
 
 	/*
 	 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1340,6 +1339,45 @@ static void * __init memblock_virt_alloc_internal(
 	return ptr;
 }
 
+/**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *	  is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *	      is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *	      allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ *
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. Does not zero allocated memory, does not panic if request
+ * cannot be satisfied.
+ *
+ * RETURNS:
+ * Virtual address of allocated memory block on success, NULL on failure.
+ */
+void * __init memblock_virt_alloc_try_nid_raw(
+			phys_addr_t size, phys_addr_t align,
+			phys_addr_t min_addr, phys_addr_t max_addr,
+			int nid)
+{
+	void *ptr;
+
+	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
+		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
+		     (u64)max_addr, (void *)_RET_IP_);
+
+	ptr = memblock_virt_alloc_internal(size, align,
+					   min_addr, max_addr, nid);
+#ifdef CONFIG_DEBUG_VM
+	if (ptr && size > 0)
+		memset(ptr, 0xff, size);
+#endif
+	return ptr;
+}
+
 /**
  * memblock_virt_alloc_try_nid_nopanic - allocate boot memory block
  * @size: size of memory block to be allocated in bytes
@@ -1351,8 +1389,8 @@ static void * __init memblock_virt_alloc_internal(
  *	      allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public version of _memblock_virt_alloc_try_nid_nopanic() which provides
- * additional debug information (including caller info), if enabled.
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. This function zeroes the allocated memory.
  *
  * RETURNS:
  * Virtual address of allocated memory block on success, NULL on failure.
@@ -1362,11 +1400,17 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
 				phys_addr_t min_addr, phys_addr_t max_addr,
 				int nid)
 {
+	void *ptr;
+
 	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
 		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
 		     (u64)max_addr, (void *)_RET_IP_);
-	return memblock_virt_alloc_internal(size, align, min_addr,
-					     max_addr, nid);
+
+	ptr = memblock_virt_alloc_internal(size, align,
+					   min_addr, max_addr, nid);
+	if (ptr)
+		memset(ptr, 0, size);
+	return ptr;
 }
 
 /**
@@ -1380,7 +1424,7 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
  *	      allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public panicking version of _memblock_virt_alloc_try_nid_nopanic()
+ * Public panicking version of memblock_virt_alloc_try_nid_nopanic()
  * which provides debug information (including caller info), if enabled,
  * and panics if the request can not be satisfied.
  *
@@ -1399,8 +1443,10 @@ void * __init memblock_virt_alloc_try_nid(
 		     (u64)max_addr, (void *)_RET_IP_);
 	ptr = memblock_virt_alloc_internal(size, align,
 					   min_addr, max_addr, nid);
-	if (ptr)
+	if (ptr) {
+		memset(ptr, 0, size);
 		return ptr;
+	}
 
 	panic("%s: Failed to allocate %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx\n",
 	      __func__, (u64)size, (u64)align, nid, (u64)min_addr,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cdbd14829fd3..20b0bace2235 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7307,18 +7307,17 @@ void *__init alloc_large_system_hash(const char *tablename,
 
 	log2qty = ilog2(numentries);
 
-	/*
-	 * memblock allocator returns zeroed memory already, so HASH_ZERO is
-	 * currently not used when HASH_EARLY is specified.
-	 */
 	gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
 	do {
 		size = bucketsize << log2qty;
-		if (flags & HASH_EARLY)
-			table = memblock_virt_alloc_nopanic(size, 0);
-		else if (hashdist)
+		if (flags & HASH_EARLY) {
+			if (flags & HASH_ZERO)
+				table = memblock_virt_alloc_nopanic(size, 0);
+			else
+				table = memblock_virt_alloc_raw(size, 0);
+		} else if (hashdist) {
 			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
-		else {
+		} else {
 			/*
 			 * If bucketsize is not a power-of-two, we may free
 			 * some pages at the end of hash table which
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 4/9] mm: defining memblock_virt_alloc_try_nid_raw
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
    - Does not zero the allocated memory
    - Does not panic if request cannot be satisfied

* optimize early system hash allocations

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

* debug for raw alloctor

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/bootmem.h | 27 ++++++++++++++++++++++
 mm/memblock.c           | 60 +++++++++++++++++++++++++++++++++++++++++++------
 mm/page_alloc.c         | 15 ++++++-------
 3 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91b6439..ea30b3987282 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE		(~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+				      phys_addr_t min_addr,
+				      phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
 		phys_addr_t align, phys_addr_t min_addr,
 		phys_addr_t max_addr, int nid);
@@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
 					    NUMA_NO_NODE);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+					phys_addr_t size,  phys_addr_t align)
+{
+	return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
+					    BOOTMEM_ALLOC_ACCESSIBLE,
+					    NUMA_NO_NODE);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
 					phys_addr_t size, phys_addr_t align)
 {
@@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
 	return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+					phys_addr_t size,  phys_addr_t align)
+{
+	if (!align)
+		align = SMP_CACHE_BYTES;
+	return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
 					phys_addr_t size, phys_addr_t align)
 {
@@ -309,6 +328,14 @@ static inline void * __init memblock_virt_alloc_try_nid(phys_addr_t size,
 					  min_addr);
 }
 
+static inline void * __init memblock_virt_alloc_try_nid_raw(
+			phys_addr_t size, phys_addr_t align,
+			phys_addr_t min_addr, phys_addr_t max_addr, int nid)
+{
+	return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
+				min_addr, max_addr);
+}
+
 static inline void * __init memblock_virt_alloc_try_nid_nopanic(
 			phys_addr_t size, phys_addr_t align,
 			phys_addr_t min_addr, phys_addr_t max_addr, int nid)
diff --git a/mm/memblock.c b/mm/memblock.c
index 91205780e6b1..1f299fb1eb08 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
 	return NULL;
 done:
 	ptr = phys_to_virt(alloc);
-	memset(ptr, 0, size);
 
 	/*
 	 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1340,6 +1339,45 @@ static void * __init memblock_virt_alloc_internal(
 	return ptr;
 }
 
+/**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *	  is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *	      is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *	      allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ *
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. Does not zero allocated memory, does not panic if request
+ * cannot be satisfied.
+ *
+ * RETURNS:
+ * Virtual address of allocated memory block on success, NULL on failure.
+ */
+void * __init memblock_virt_alloc_try_nid_raw(
+			phys_addr_t size, phys_addr_t align,
+			phys_addr_t min_addr, phys_addr_t max_addr,
+			int nid)
+{
+	void *ptr;
+
+	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
+		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
+		     (u64)max_addr, (void *)_RET_IP_);
+
+	ptr = memblock_virt_alloc_internal(size, align,
+					   min_addr, max_addr, nid);
+#ifdef CONFIG_DEBUG_VM
+	if (ptr && size > 0)
+		memset(ptr, 0xff, size);
+#endif
+	return ptr;
+}
+
 /**
  * memblock_virt_alloc_try_nid_nopanic - allocate boot memory block
  * @size: size of memory block to be allocated in bytes
@@ -1351,8 +1389,8 @@ static void * __init memblock_virt_alloc_internal(
  *	      allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public version of _memblock_virt_alloc_try_nid_nopanic() which provides
- * additional debug information (including caller info), if enabled.
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. This function zeroes the allocated memory.
  *
  * RETURNS:
  * Virtual address of allocated memory block on success, NULL on failure.
@@ -1362,11 +1400,17 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
 				phys_addr_t min_addr, phys_addr_t max_addr,
 				int nid)
 {
+	void *ptr;
+
 	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
 		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
 		     (u64)max_addr, (void *)_RET_IP_);
-	return memblock_virt_alloc_internal(size, align, min_addr,
-					     max_addr, nid);
+
+	ptr = memblock_virt_alloc_internal(size, align,
+					   min_addr, max_addr, nid);
+	if (ptr)
+		memset(ptr, 0, size);
+	return ptr;
 }
 
 /**
@@ -1380,7 +1424,7 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
  *	      allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public panicking version of _memblock_virt_alloc_try_nid_nopanic()
+ * Public panicking version of memblock_virt_alloc_try_nid_nopanic()
  * which provides debug information (including caller info), if enabled,
  * and panics if the request can not be satisfied.
  *
@@ -1399,8 +1443,10 @@ void * __init memblock_virt_alloc_try_nid(
 		     (u64)max_addr, (void *)_RET_IP_);
 	ptr = memblock_virt_alloc_internal(size, align,
 					   min_addr, max_addr, nid);
-	if (ptr)
+	if (ptr) {
+		memset(ptr, 0, size);
 		return ptr;
+	}
 
 	panic("%s: Failed to allocate %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx\n",
 	      __func__, (u64)size, (u64)align, nid, (u64)min_addr,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cdbd14829fd3..20b0bace2235 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7307,18 +7307,17 @@ void *__init alloc_large_system_hash(const char *tablename,
 
 	log2qty = ilog2(numentries);
 
-	/*
-	 * memblock allocator returns zeroed memory already, so HASH_ZERO is
-	 * currently not used when HASH_EARLY is specified.
-	 */
 	gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
 	do {
 		size = bucketsize << log2qty;
-		if (flags & HASH_EARLY)
-			table = memblock_virt_alloc_nopanic(size, 0);
-		else if (hashdist)
+		if (flags & HASH_EARLY) {
+			if (flags & HASH_ZERO)
+				table = memblock_virt_alloc_nopanic(size, 0);
+			else
+				table = memblock_virt_alloc_raw(size, 0);
+		} else if (hashdist) {
 			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
-		else {
+		} else {
 			/*
 			 * If bucketsize is not a power-of-two, we may free
 			 * some pages@the end of hash table which
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
  2017-10-09 22:19 ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-09 22:19   ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().

In some cases these struct pages are accessed even if they do not contain
any data. One example is page_to_pfn() might access page->flags if this is
where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the exiting
memory from pfn 1 (i.e. KVM).

Since, struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
	for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 include/linux/memblock.h | 16 ++++++++++++++++
 include/linux/mm.h       | 15 +++++++++++++++
 mm/page_alloc.c          | 38 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7e7bf3..ce8bfa5f3e9b 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
 	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
 			       nid, flags, p_start, p_end, p_nid)
 
+/**
+ * for_each_resv_unavail_range - iterate through reserved and unavailable memory
+ * @i: u64 used as loop variable
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
+ * Available as soon as memblock is initialized.
+ * Note: because this memory does not belong to any physical node, flags and
+ * nid arguments do not make sense and thus not exported as arguments.
+ */
+#define for_each_resv_unavail_range(i, p_start, p_end)			\
+	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
+			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
+
 static inline void memblock_set_region_flags(struct memblock_region *r,
 					     unsigned long flags)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..04c8b2e5aff4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #define mm_forbids_zeropage(X)	(0)
 #endif
 
+/*
+ * On some architectures it is expensive to call memset() for small sizes.
+ * Those architectures should provide their own implementation of "struct page"
+ * zeroing by defining this macro in <asm/pgtable.h>.
+ */
+#ifndef mm_zero_struct_page
+#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
+#endif
+
 /*
  * Default maximum number of active map areas, this limits the number of vmas
  * per mm struct. Users can overwrite this number by sysctl but there is a
@@ -2001,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
 					struct mminit_pfnnid_cache *state);
 #endif
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+void zero_resv_unavail(void);
+#else
+static inline void zero_resv_unavail(void) {}
+#endif
+
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
 				unsigned long, enum memmap_context);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20b0bace2235..5f0013bbbe9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6209,6 +6209,42 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 	free_area_init_core(pgdat);
 }
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+/*
+ * Only struct pages that are backed by physical memory are zeroed and
+ * initialized by going through __init_single_page(). But, there are some
+ * struct pages which are reserved in memblock allocator and their fields
+ * may be accessed (for example page_to_pfn() on some configuration accesses
+ * flags). We must explicitly zero those struct pages.
+ */
+void __paginginit zero_resv_unavail(void)
+{
+	phys_addr_t start, end;
+	unsigned long pfn;
+	u64 i, pgcnt;
+
+	/* Loop through ranges that are reserved, but do not have reported
+	 * physical memory backing.
+	 */
+	pgcnt = 0;
+	for_each_resv_unavail_range(i, &start, &end) {
+		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
+			mm_zero_struct_page(pfn_to_page(pfn));
+			pgcnt++;
+		}
+	}
+
+	/*
+	 * Struct pages that do not have backing memory. This could be because
+	 * firmware is using some of this memory, or for some other reasons.
+	 * Once memblock is changed so such behaviour is not allowed: i.e.
+	 * list of "reserved" memory must be a subset of list of "memory", then
+	 * this code can be removed.
+	 */
+	pr_info("Reserved but unavailable: %lld pages", pgcnt);
+}
+#endif /* CONFIG_HAVE_MEMBLOCK */
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 
 #if MAX_NUMNODES > 1
@@ -6632,6 +6668,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 			node_set_state(nid, N_MEMORY);
 		check_for_memory(pgdat, nid);
 	}
+	zero_resv_unavail();
 }
 
 static int __init cmdline_parse_core(char *p, unsigned long *core)
@@ -6795,6 +6832,7 @@ void __init free_area_init(unsigned long *zones_size)
 {
 	free_area_init_node(0, zones_size,
 			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
+	zero_resv_unavail();
 }
 
 static int page_alloc_cpu_dead(unsigned int cpu)
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().

In some cases these struct pages are accessed even if they do not contain
any data. One example is page_to_pfn() might access page->flags if this is
where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the exiting
memory from pfn 1 (i.e. KVM).

Since, struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
	for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 include/linux/memblock.h | 16 ++++++++++++++++
 include/linux/mm.h       | 15 +++++++++++++++
 mm/page_alloc.c          | 38 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7e7bf3..ce8bfa5f3e9b 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
 	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
 			       nid, flags, p_start, p_end, p_nid)
 
+/**
+ * for_each_resv_unavail_range - iterate through reserved and unavailable memory
+ * @i: u64 used as loop variable
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
+ * Available as soon as memblock is initialized.
+ * Note: because this memory does not belong to any physical node, flags and
+ * nid arguments do not make sense and thus not exported as arguments.
+ */
+#define for_each_resv_unavail_range(i, p_start, p_end)			\
+	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
+			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
+
 static inline void memblock_set_region_flags(struct memblock_region *r,
 					     unsigned long flags)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..04c8b2e5aff4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #define mm_forbids_zeropage(X)	(0)
 #endif
 
+/*
+ * On some architectures it is expensive to call memset() for small sizes.
+ * Those architectures should provide their own implementation of "struct page"
+ * zeroing by defining this macro in <asm/pgtable.h>.
+ */
+#ifndef mm_zero_struct_page
+#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
+#endif
+
 /*
  * Default maximum number of active map areas, this limits the number of vmas
  * per mm struct. Users can overwrite this number by sysctl but there is a
@@ -2001,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
 					struct mminit_pfnnid_cache *state);
 #endif
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+void zero_resv_unavail(void);
+#else
+static inline void zero_resv_unavail(void) {}
+#endif
+
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
 				unsigned long, enum memmap_context);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20b0bace2235..5f0013bbbe9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6209,6 +6209,42 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 	free_area_init_core(pgdat);
 }
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+/*
+ * Only struct pages that are backed by physical memory are zeroed and
+ * initialized by going through __init_single_page(). But, there are some
+ * struct pages which are reserved in memblock allocator and their fields
+ * may be accessed (for example page_to_pfn() on some configuration accesses
+ * flags). We must explicitly zero those struct pages.
+ */
+void __paginginit zero_resv_unavail(void)
+{
+	phys_addr_t start, end;
+	unsigned long pfn;
+	u64 i, pgcnt;
+
+	/* Loop through ranges that are reserved, but do not have reported
+	 * physical memory backing.
+	 */
+	pgcnt = 0;
+	for_each_resv_unavail_range(i, &start, &end) {
+		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
+			mm_zero_struct_page(pfn_to_page(pfn));
+			pgcnt++;
+		}
+	}
+
+	/*
+	 * Struct pages that do not have backing memory. This could be because
+	 * firmware is using some of this memory, or for some other reasons.
+	 * Once memblock is changed so such behaviour is not allowed: i.e.
+	 * list of "reserved" memory must be a subset of list of "memory", then
+	 * this code can be removed.
+	 */
+	pr_info("Reserved but unavailable: %lld pages", pgcnt);
+}
+#endif /* CONFIG_HAVE_MEMBLOCK */
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 
 #if MAX_NUMNODES > 1
@@ -6632,6 +6668,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 			node_set_state(nid, N_MEMORY);
 		check_for_memory(pgdat, nid);
 	}
+	zero_resv_unavail();
 }
 
 static int __init cmdline_parse_core(char *p, unsigned long *core)
@@ -6795,6 +6832,7 @@ void __init free_area_init(unsigned long *zones_size)
 {
 	free_area_init_node(0, zones_size,
 			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
+	zero_resv_unavail();
 }
 
 static int page_alloc_cpu_dead(unsigned int cpu)
-- 
2.14.2


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().

In some cases these struct pages are accessed even if they do not contain
any data. One example is page_to_pfn() might access page->flags if this is
where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the exiting
memory from pfn 1 (i.e. KVM).

Since, struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
	for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 include/linux/memblock.h | 16 ++++++++++++++++
 include/linux/mm.h       | 15 +++++++++++++++
 mm/page_alloc.c          | 38 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7e7bf3..ce8bfa5f3e9b 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
 	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
 			       nid, flags, p_start, p_end, p_nid)
 
+/**
+ * for_each_resv_unavail_range - iterate through reserved and unavailable memory
+ * @i: u64 used as loop variable
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
+ * Available as soon as memblock is initialized.
+ * Note: because this memory does not belong to any physical node, flags and
+ * nid arguments do not make sense and thus not exported as arguments.
+ */
+#define for_each_resv_unavail_range(i, p_start, p_end)			\
+	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
+			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
+
 static inline void memblock_set_region_flags(struct memblock_region *r,
 					     unsigned long flags)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..04c8b2e5aff4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #define mm_forbids_zeropage(X)	(0)
 #endif
 
+/*
+ * On some architectures it is expensive to call memset() for small sizes.
+ * Those architectures should provide their own implementation of "struct page"
+ * zeroing by defining this macro in <asm/pgtable.h>.
+ */
+#ifndef mm_zero_struct_page
+#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
+#endif
+
 /*
  * Default maximum number of active map areas, this limits the number of vmas
  * per mm struct. Users can overwrite this number by sysctl but there is a
@@ -2001,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
 					struct mminit_pfnnid_cache *state);
 #endif
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+void zero_resv_unavail(void);
+#else
+static inline void zero_resv_unavail(void) {}
+#endif
+
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
 				unsigned long, enum memmap_context);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20b0bace2235..5f0013bbbe9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6209,6 +6209,42 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 	free_area_init_core(pgdat);
 }
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+/*
+ * Only struct pages that are backed by physical memory are zeroed and
+ * initialized by going through __init_single_page(). But, there are some
+ * struct pages which are reserved in memblock allocator and their fields
+ * may be accessed (for example page_to_pfn() on some configuration accesses
+ * flags). We must explicitly zero those struct pages.
+ */
+void __paginginit zero_resv_unavail(void)
+{
+	phys_addr_t start, end;
+	unsigned long pfn;
+	u64 i, pgcnt;
+
+	/* Loop through ranges that are reserved, but do not have reported
+	 * physical memory backing.
+	 */
+	pgcnt = 0;
+	for_each_resv_unavail_range(i, &start, &end) {
+		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
+			mm_zero_struct_page(pfn_to_page(pfn));
+			pgcnt++;
+		}
+	}
+
+	/*
+	 * Struct pages that do not have backing memory. This could be because
+	 * firmware is using some of this memory, or for some other reasons.
+	 * Once memblock is changed so such behaviour is not allowed: i.e.
+	 * list of "reserved" memory must be a subset of list of "memory", then
+	 * this code can be removed.
+	 */
+	pr_info("Reserved but unavailable: %lld pages", pgcnt);
+}
+#endif /* CONFIG_HAVE_MEMBLOCK */
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 
 #if MAX_NUMNODES > 1
@@ -6632,6 +6668,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 			node_set_state(nid, N_MEMORY);
 		check_for_memory(pgdat, nid);
 	}
+	zero_resv_unavail();
 }
 
 static int __init cmdline_parse_core(char *p, unsigned long *core)
@@ -6795,6 +6832,7 @@ void __init free_area_init(unsigned long *zones_size)
 {
 	free_area_init_node(0, zones_size,
 			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
+	zero_resv_unavail();
 }
 
 static int page_alloc_cpu_dead(unsigned int cpu)
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().

In some cases these struct pages are accessed even if they do not contain
any data. One example is page_to_pfn() might access page->flags if this is
where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the exiting
memory from pfn 1 (i.e. KVM).

Since, struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
	for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 include/linux/memblock.h | 16 ++++++++++++++++
 include/linux/mm.h       | 15 +++++++++++++++
 mm/page_alloc.c          | 38 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7e7bf3..ce8bfa5f3e9b 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
 	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
 			       nid, flags, p_start, p_end, p_nid)
 
+/**
+ * for_each_resv_unavail_range - iterate through reserved and unavailable memory
+ * @i: u64 used as loop variable
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
+ * Available as soon as memblock is initialized.
+ * Note: because this memory does not belong to any physical node, flags and
+ * nid arguments do not make sense and thus not exported as arguments.
+ */
+#define for_each_resv_unavail_range(i, p_start, p_end)			\
+	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
+			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
+
 static inline void memblock_set_region_flags(struct memblock_region *r,
 					     unsigned long flags)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..04c8b2e5aff4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #define mm_forbids_zeropage(X)	(0)
 #endif
 
+/*
+ * On some architectures it is expensive to call memset() for small sizes.
+ * Those architectures should provide their own implementation of "struct page"
+ * zeroing by defining this macro in <asm/pgtable.h>.
+ */
+#ifndef mm_zero_struct_page
+#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
+#endif
+
 /*
  * Default maximum number of active map areas, this limits the number of vmas
  * per mm struct. Users can overwrite this number by sysctl but there is a
@@ -2001,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
 					struct mminit_pfnnid_cache *state);
 #endif
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+void zero_resv_unavail(void);
+#else
+static inline void zero_resv_unavail(void) {}
+#endif
+
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
 				unsigned long, enum memmap_context);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20b0bace2235..5f0013bbbe9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6209,6 +6209,42 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 	free_area_init_core(pgdat);
 }
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+/*
+ * Only struct pages that are backed by physical memory are zeroed and
+ * initialized by going through __init_single_page(). But, there are some
+ * struct pages which are reserved in memblock allocator and their fields
+ * may be accessed (for example page_to_pfn() on some configuration accesses
+ * flags). We must explicitly zero those struct pages.
+ */
+void __paginginit zero_resv_unavail(void)
+{
+	phys_addr_t start, end;
+	unsigned long pfn;
+	u64 i, pgcnt;
+
+	/* Loop through ranges that are reserved, but do not have reported
+	 * physical memory backing.
+	 */
+	pgcnt = 0;
+	for_each_resv_unavail_range(i, &start, &end) {
+		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
+			mm_zero_struct_page(pfn_to_page(pfn));
+			pgcnt++;
+		}
+	}
+
+	/*
+	 * Struct pages that do not have backing memory. This could be because
+	 * firmware is using some of this memory, or for some other reasons.
+	 * Once memblock is changed so such behaviour is not allowed: i.e.
+	 * list of "reserved" memory must be a subset of list of "memory", then
+	 * this code can be removed.
+	 */
+	pr_info("Reserved but unavailable: %lld pages", pgcnt);
+}
+#endif /* CONFIG_HAVE_MEMBLOCK */
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 
 #if MAX_NUMNODES > 1
@@ -6632,6 +6668,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 			node_set_state(nid, N_MEMORY);
 		check_for_memory(pgdat, nid);
 	}
+	zero_resv_unavail();
 }
 
 static int __init cmdline_parse_core(char *p, unsigned long *core)
@@ -6795,6 +6832,7 @@ void __init free_area_init(unsigned long *zones_size)
 {
 	free_area_init_node(0, zones_size,
 			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
+	zero_resv_unavail();
 }
 
 static int page_alloc_cpu_dead(unsigned int cpu)
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 6/9] x86/kasan: add and use kasan_map_populate()
  2017-10-09 22:19 ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-09 22:19   ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/mm/kasan_init_64.c | 75 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 71 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bc84b73684b7..9778fec8a5dc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -15,6 +15,73 @@
 
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+					int node)
+{
+	unsigned long addr, pfn, next;
+	unsigned long long size;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	int ret;
+
+	ret = vmemmap_populate(start, end, node);
+	/*
+	 * We might have partially populated memory, so check for no entries,
+	 * and zero only those that actually exist.
+	 */
+	for (addr = start; addr < end; addr = next) {
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd)) {
+			next = pgd_addr_end(addr, end);
+			continue;
+		}
+
+		p4d = p4d_offset(pgd, addr);
+		if (p4d_none(*p4d)) {
+			next = p4d_addr_end(addr, end);
+			continue;
+		}
+
+		pud = pud_offset(p4d, addr);
+		if (pud_none(*pud)) {
+			next = pud_addr_end(addr, end);
+			continue;
+		}
+		if (pud_large(*pud)) {
+			/* This is PUD size page */
+			next = pud_addr_end(addr, end);
+			size = PUD_SIZE;
+			pfn = pud_pfn(*pud);
+		} else {
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd)) {
+				next = pmd_addr_end(addr, end);
+				continue;
+			}
+			if (pmd_large(*pmd)) {
+				/* This is PMD size page */
+				next = pmd_addr_end(addr, end);
+				size = PMD_SIZE;
+				pfn = pmd_pfn(*pmd);
+			} else {
+				pte = pte_offset_kernel(pmd, addr);
+				next = addr + PAGE_SIZE;
+				if (pte_none(*pte))
+					continue;
+				/* This is base size page */
+				size = PAGE_SIZE;
+				pfn = pte_pfn(*pte);
+			}
+		}
+		memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+	}
+	return ret;
+}
+
 static int __init map_range(struct range *range)
 {
 	unsigned long start;
@@ -23,7 +90,7 @@ static int __init map_range(struct range *range)
 	start = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->start));
 	end = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->end));
 
-	return vmemmap_populate(start, end, NUMA_NO_NODE);
+	return kasan_map_populate(start, end, NUMA_NO_NODE);
 }
 
 static void __init clear_pgds(unsigned long start,
@@ -136,9 +203,9 @@ void __init kasan_init(void)
 		kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
 		kasan_mem_to_shadow((void *)__START_KERNEL_map));
 
-	vmemmap_populate((unsigned long)kasan_mem_to_shadow(_stext),
-			(unsigned long)kasan_mem_to_shadow(_end),
-			NUMA_NO_NODE);
+	kasan_map_populate((unsigned long)kasan_mem_to_shadow(_stext),
+			   (unsigned long)kasan_mem_to_shadow(_end),
+			   NUMA_NO_NODE);
 
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 6/9] x86/kasan: add and use kasan_map_populate()
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/mm/kasan_init_64.c | 75 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 71 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bc84b73684b7..9778fec8a5dc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -15,6 +15,73 @@
 
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+					int node)
+{
+	unsigned long addr, pfn, next;
+	unsigned long long size;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	int ret;
+
+	ret = vmemmap_populate(start, end, node);
+	/*
+	 * We might have partially populated memory, so check for no entries,
+	 * and zero only those that actually exist.
+	 */
+	for (addr = start; addr < end; addr = next) {
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd)) {
+			next = pgd_addr_end(addr, end);
+			continue;
+		}
+
+		p4d = p4d_offset(pgd, addr);
+		if (p4d_none(*p4d)) {
+			next = p4d_addr_end(addr, end);
+			continue;
+		}
+
+		pud = pud_offset(p4d, addr);
+		if (pud_none(*pud)) {
+			next = pud_addr_end(addr, end);
+			continue;
+		}
+		if (pud_large(*pud)) {
+			/* This is PUD size page */
+			next = pud_addr_end(addr, end);
+			size = PUD_SIZE;
+			pfn = pud_pfn(*pud);
+		} else {
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd)) {
+				next = pmd_addr_end(addr, end);
+				continue;
+			}
+			if (pmd_large(*pmd)) {
+				/* This is PMD size page */
+				next = pmd_addr_end(addr, end);
+				size = PMD_SIZE;
+				pfn = pmd_pfn(*pmd);
+			} else {
+				pte = pte_offset_kernel(pmd, addr);
+				next = addr + PAGE_SIZE;
+				if (pte_none(*pte))
+					continue;
+				/* This is base size page */
+				size = PAGE_SIZE;
+				pfn = pte_pfn(*pte);
+			}
+		}
+		memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+	}
+	return ret;
+}
+
 static int __init map_range(struct range *range)
 {
 	unsigned long start;
@@ -23,7 +90,7 @@ static int __init map_range(struct range *range)
 	start = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->start));
 	end = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->end));
 
-	return vmemmap_populate(start, end, NUMA_NO_NODE);
+	return kasan_map_populate(start, end, NUMA_NO_NODE);
 }
 
 static void __init clear_pgds(unsigned long start,
@@ -136,9 +203,9 @@ void __init kasan_init(void)
 		kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
 		kasan_mem_to_shadow((void *)__START_KERNEL_map));
 
-	vmemmap_populate((unsigned long)kasan_mem_to_shadow(_stext),
-			(unsigned long)kasan_mem_to_shadow(_end),
-			NUMA_NO_NODE);
+	kasan_map_populate((unsigned long)kasan_mem_to_shadow(_stext),
+			   (unsigned long)kasan_mem_to_shadow(_end),
+			   NUMA_NO_NODE);
 
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
-- 
2.14.2


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 6/9] x86/kasan: add and use kasan_map_populate()
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/mm/kasan_init_64.c | 75 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 71 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bc84b73684b7..9778fec8a5dc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -15,6 +15,73 @@
 
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+					int node)
+{
+	unsigned long addr, pfn, next;
+	unsigned long long size;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	int ret;
+
+	ret = vmemmap_populate(start, end, node);
+	/*
+	 * We might have partially populated memory, so check for no entries,
+	 * and zero only those that actually exist.
+	 */
+	for (addr = start; addr < end; addr = next) {
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd)) {
+			next = pgd_addr_end(addr, end);
+			continue;
+		}
+
+		p4d = p4d_offset(pgd, addr);
+		if (p4d_none(*p4d)) {
+			next = p4d_addr_end(addr, end);
+			continue;
+		}
+
+		pud = pud_offset(p4d, addr);
+		if (pud_none(*pud)) {
+			next = pud_addr_end(addr, end);
+			continue;
+		}
+		if (pud_large(*pud)) {
+			/* This is PUD size page */
+			next = pud_addr_end(addr, end);
+			size = PUD_SIZE;
+			pfn = pud_pfn(*pud);
+		} else {
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd)) {
+				next = pmd_addr_end(addr, end);
+				continue;
+			}
+			if (pmd_large(*pmd)) {
+				/* This is PMD size page */
+				next = pmd_addr_end(addr, end);
+				size = PMD_SIZE;
+				pfn = pmd_pfn(*pmd);
+			} else {
+				pte = pte_offset_kernel(pmd, addr);
+				next = addr + PAGE_SIZE;
+				if (pte_none(*pte))
+					continue;
+				/* This is base size page */
+				size = PAGE_SIZE;
+				pfn = pte_pfn(*pte);
+			}
+		}
+		memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+	}
+	return ret;
+}
+
 static int __init map_range(struct range *range)
 {
 	unsigned long start;
@@ -23,7 +90,7 @@ static int __init map_range(struct range *range)
 	start = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->start));
 	end = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->end));
 
-	return vmemmap_populate(start, end, NUMA_NO_NODE);
+	return kasan_map_populate(start, end, NUMA_NO_NODE);
 }
 
 static void __init clear_pgds(unsigned long start,
@@ -136,9 +203,9 @@ void __init kasan_init(void)
 		kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
 		kasan_mem_to_shadow((void *)__START_KERNEL_map));
 
-	vmemmap_populate((unsigned long)kasan_mem_to_shadow(_stext),
-			(unsigned long)kasan_mem_to_shadow(_end),
-			NUMA_NO_NODE);
+	kasan_map_populate((unsigned long)kasan_mem_to_shadow(_stext),
+			   (unsigned long)kasan_mem_to_shadow(_end),
+			   NUMA_NO_NODE);
 
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 6/9] x86/kasan: add and use kasan_map_populate()
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/mm/kasan_init_64.c | 75 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 71 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bc84b73684b7..9778fec8a5dc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -15,6 +15,73 @@
 
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+					int node)
+{
+	unsigned long addr, pfn, next;
+	unsigned long long size;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	int ret;
+
+	ret = vmemmap_populate(start, end, node);
+	/*
+	 * We might have partially populated memory, so check for no entries,
+	 * and zero only those that actually exist.
+	 */
+	for (addr = start; addr < end; addr = next) {
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd)) {
+			next = pgd_addr_end(addr, end);
+			continue;
+		}
+
+		p4d = p4d_offset(pgd, addr);
+		if (p4d_none(*p4d)) {
+			next = p4d_addr_end(addr, end);
+			continue;
+		}
+
+		pud = pud_offset(p4d, addr);
+		if (pud_none(*pud)) {
+			next = pud_addr_end(addr, end);
+			continue;
+		}
+		if (pud_large(*pud)) {
+			/* This is PUD size page */
+			next = pud_addr_end(addr, end);
+			size = PUD_SIZE;
+			pfn = pud_pfn(*pud);
+		} else {
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd)) {
+				next = pmd_addr_end(addr, end);
+				continue;
+			}
+			if (pmd_large(*pmd)) {
+				/* This is PMD size page */
+				next = pmd_addr_end(addr, end);
+				size = PMD_SIZE;
+				pfn = pmd_pfn(*pmd);
+			} else {
+				pte = pte_offset_kernel(pmd, addr);
+				next = addr + PAGE_SIZE;
+				if (pte_none(*pte))
+					continue;
+				/* This is base size page */
+				size = PAGE_SIZE;
+				pfn = pte_pfn(*pte);
+			}
+		}
+		memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+	}
+	return ret;
+}
+
 static int __init map_range(struct range *range)
 {
 	unsigned long start;
@@ -23,7 +90,7 @@ static int __init map_range(struct range *range)
 	start = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->start));
 	end = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->end));
 
-	return vmemmap_populate(start, end, NUMA_NO_NODE);
+	return kasan_map_populate(start, end, NUMA_NO_NODE);
 }
 
 static void __init clear_pgds(unsigned long start,
@@ -136,9 +203,9 @@ void __init kasan_init(void)
 		kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
 		kasan_mem_to_shadow((void *)__START_KERNEL_map));
 
-	vmemmap_populate((unsigned long)kasan_mem_to_shadow(_stext),
-			(unsigned long)kasan_mem_to_shadow(_end),
-			NUMA_NO_NODE);
+	kasan_map_populate((unsigned long)kasan_mem_to_shadow(_stext),
+			   (unsigned long)kasan_mem_to_shadow(_end),
+			   NUMA_NO_NODE);
 
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-09 22:19 ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-09 22:19   ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/arm64/mm/kasan_init.c | 72 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f03959a4ab..cb4af2951c90 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -28,6 +28,66 @@
 
 static pgd_t tmp_pg_dir[PTRS_PER_PGD] __initdata __aligned(PGD_SIZE);
 
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+					int node)
+{
+	unsigned long addr, pfn, next;
+	unsigned long long size;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	int ret;
+
+	ret = vmemmap_populate(start, end, node);
+	/*
+	 * We might have partially populated memory, so check for no entries,
+	 * and zero only those that actually exist.
+	 */
+	for (addr = start; addr < end; addr = next) {
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd)) {
+			next = pgd_addr_end(addr, end);
+			continue;
+		}
+
+		pud = pud_offset(pgd, addr);
+		if (pud_none(*pud)) {
+			next = pud_addr_end(addr, end);
+			continue;
+		}
+		if (pud_sect(*pud)) {
+			/* This is PUD size page */
+			next = pud_addr_end(addr, end);
+			size = PUD_SIZE;
+			pfn = pud_pfn(*pud);
+		} else {
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd)) {
+				next = pmd_addr_end(addr, end);
+				continue;
+			}
+			if (pmd_sect(*pmd)) {
+				/* This is PMD size page */
+				next = pmd_addr_end(addr, end);
+				size = PMD_SIZE;
+				pfn = pmd_pfn(*pmd);
+			} else {
+				pte = pte_offset_kernel(pmd, addr);
+				next = addr + PAGE_SIZE;
+				if (pte_none(*pte))
+					continue;
+				/* This is base size page */
+				size = PAGE_SIZE;
+				pfn = pte_pfn(*pte);
+			}
+		}
+		memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+	}
+	return ret;
+}
+
 /*
  * The p*d_populate functions call virt_to_phys implicitly so they can't be used
  * directly on kernel symbols (bm_p*d). All the early functions are called too
@@ -161,11 +221,11 @@ void __init kasan_init(void)
 
 	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
 
-	vmemmap_populate(kimg_shadow_start, kimg_shadow_end,
-			 pfn_to_nid(virt_to_pfn(lm_alias(_text))));
+	kasan_map_populate(kimg_shadow_start, kimg_shadow_end,
+			   pfn_to_nid(virt_to_pfn(lm_alias(_text))));
 
 	/*
-	 * vmemmap_populate() has populated the shadow region that covers the
+	 * kasan_map_populate() has populated the shadow region that covers the
 	 * kernel image with SWAPPER_BLOCK_SIZE mappings, so we have to round
 	 * the start and end addresses to SWAPPER_BLOCK_SIZE as well, to prevent
 	 * kasan_populate_zero_shadow() from replacing the page table entries
@@ -191,9 +251,9 @@ void __init kasan_init(void)
 		if (start >= end)
 			break;
 
-		vmemmap_populate((unsigned long)kasan_mem_to_shadow(start),
-				(unsigned long)kasan_mem_to_shadow(end),
-				pfn_to_nid(virt_to_pfn(start)));
+		kasan_map_populate((unsigned long)kasan_mem_to_shadow(start),
+				   (unsigned long)kasan_mem_to_shadow(end),
+				   pfn_to_nid(virt_to_pfn(start)));
 	}
 
 	/*
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/arm64/mm/kasan_init.c | 72 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f03959a4ab..cb4af2951c90 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -28,6 +28,66 @@
 
 static pgd_t tmp_pg_dir[PTRS_PER_PGD] __initdata __aligned(PGD_SIZE);
 
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+					int node)
+{
+	unsigned long addr, pfn, next;
+	unsigned long long size;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	int ret;
+
+	ret = vmemmap_populate(start, end, node);
+	/*
+	 * We might have partially populated memory, so check for no entries,
+	 * and zero only those that actually exist.
+	 */
+	for (addr = start; addr < end; addr = next) {
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd)) {
+			next = pgd_addr_end(addr, end);
+			continue;
+		}
+
+		pud = pud_offset(pgd, addr);
+		if (pud_none(*pud)) {
+			next = pud_addr_end(addr, end);
+			continue;
+		}
+		if (pud_sect(*pud)) {
+			/* This is PUD size page */
+			next = pud_addr_end(addr, end);
+			size = PUD_SIZE;
+			pfn = pud_pfn(*pud);
+		} else {
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd)) {
+				next = pmd_addr_end(addr, end);
+				continue;
+			}
+			if (pmd_sect(*pmd)) {
+				/* This is PMD size page */
+				next = pmd_addr_end(addr, end);
+				size = PMD_SIZE;
+				pfn = pmd_pfn(*pmd);
+			} else {
+				pte = pte_offset_kernel(pmd, addr);
+				next = addr + PAGE_SIZE;
+				if (pte_none(*pte))
+					continue;
+				/* This is base size page */
+				size = PAGE_SIZE;
+				pfn = pte_pfn(*pte);
+			}
+		}
+		memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+	}
+	return ret;
+}
+
 /*
  * The p*d_populate functions call virt_to_phys implicitly so they can't be used
  * directly on kernel symbols (bm_p*d). All the early functions are called too
@@ -161,11 +221,11 @@ void __init kasan_init(void)
 
 	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
 
-	vmemmap_populate(kimg_shadow_start, kimg_shadow_end,
-			 pfn_to_nid(virt_to_pfn(lm_alias(_text))));
+	kasan_map_populate(kimg_shadow_start, kimg_shadow_end,
+			   pfn_to_nid(virt_to_pfn(lm_alias(_text))));
 
 	/*
-	 * vmemmap_populate() has populated the shadow region that covers the
+	 * kasan_map_populate() has populated the shadow region that covers the
 	 * kernel image with SWAPPER_BLOCK_SIZE mappings, so we have to round
 	 * the start and end addresses to SWAPPER_BLOCK_SIZE as well, to prevent
 	 * kasan_populate_zero_shadow() from replacing the page table entries
@@ -191,9 +251,9 @@ void __init kasan_init(void)
 		if (start >= end)
 			break;
 
-		vmemmap_populate((unsigned long)kasan_mem_to_shadow(start),
-				(unsigned long)kasan_mem_to_shadow(end),
-				pfn_to_nid(virt_to_pfn(start)));
+		kasan_map_populate((unsigned long)kasan_mem_to_shadow(start),
+				   (unsigned long)kasan_mem_to_shadow(end),
+				   pfn_to_nid(virt_to_pfn(start)));
 	}
 
 	/*
-- 
2.14.2


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/arm64/mm/kasan_init.c | 72 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f03959a4ab..cb4af2951c90 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -28,6 +28,66 @@
 
 static pgd_t tmp_pg_dir[PTRS_PER_PGD] __initdata __aligned(PGD_SIZE);
 
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+					int node)
+{
+	unsigned long addr, pfn, next;
+	unsigned long long size;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	int ret;
+
+	ret = vmemmap_populate(start, end, node);
+	/*
+	 * We might have partially populated memory, so check for no entries,
+	 * and zero only those that actually exist.
+	 */
+	for (addr = start; addr < end; addr = next) {
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd)) {
+			next = pgd_addr_end(addr, end);
+			continue;
+		}
+
+		pud = pud_offset(pgd, addr);
+		if (pud_none(*pud)) {
+			next = pud_addr_end(addr, end);
+			continue;
+		}
+		if (pud_sect(*pud)) {
+			/* This is PUD size page */
+			next = pud_addr_end(addr, end);
+			size = PUD_SIZE;
+			pfn = pud_pfn(*pud);
+		} else {
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd)) {
+				next = pmd_addr_end(addr, end);
+				continue;
+			}
+			if (pmd_sect(*pmd)) {
+				/* This is PMD size page */
+				next = pmd_addr_end(addr, end);
+				size = PMD_SIZE;
+				pfn = pmd_pfn(*pmd);
+			} else {
+				pte = pte_offset_kernel(pmd, addr);
+				next = addr + PAGE_SIZE;
+				if (pte_none(*pte))
+					continue;
+				/* This is base size page */
+				size = PAGE_SIZE;
+				pfn = pte_pfn(*pte);
+			}
+		}
+		memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+	}
+	return ret;
+}
+
 /*
  * The p*d_populate functions call virt_to_phys implicitly so they can't be used
  * directly on kernel symbols (bm_p*d). All the early functions are called too
@@ -161,11 +221,11 @@ void __init kasan_init(void)
 
 	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
 
-	vmemmap_populate(kimg_shadow_start, kimg_shadow_end,
-			 pfn_to_nid(virt_to_pfn(lm_alias(_text))));
+	kasan_map_populate(kimg_shadow_start, kimg_shadow_end,
+			   pfn_to_nid(virt_to_pfn(lm_alias(_text))));
 
 	/*
-	 * vmemmap_populate() has populated the shadow region that covers the
+	 * kasan_map_populate() has populated the shadow region that covers the
 	 * kernel image with SWAPPER_BLOCK_SIZE mappings, so we have to round
 	 * the start and end addresses to SWAPPER_BLOCK_SIZE as well, to prevent
 	 * kasan_populate_zero_shadow() from replacing the page table entries
@@ -191,9 +251,9 @@ void __init kasan_init(void)
 		if (start >= end)
 			break;
 
-		vmemmap_populate((unsigned long)kasan_mem_to_shadow(start),
-				(unsigned long)kasan_mem_to_shadow(end),
-				pfn_to_nid(virt_to_pfn(start)));
+		kasan_map_populate((unsigned long)kasan_mem_to_shadow(start),
+				   (unsigned long)kasan_mem_to_shadow(end),
+				   pfn_to_nid(virt_to_pfn(start)));
 	}
 
 	/*
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/arm64/mm/kasan_init.c | 72 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f03959a4ab..cb4af2951c90 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -28,6 +28,66 @@
 
 static pgd_t tmp_pg_dir[PTRS_PER_PGD] __initdata __aligned(PGD_SIZE);
 
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+					int node)
+{
+	unsigned long addr, pfn, next;
+	unsigned long long size;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	int ret;
+
+	ret = vmemmap_populate(start, end, node);
+	/*
+	 * We might have partially populated memory, so check for no entries,
+	 * and zero only those that actually exist.
+	 */
+	for (addr = start; addr < end; addr = next) {
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd)) {
+			next = pgd_addr_end(addr, end);
+			continue;
+		}
+
+		pud = pud_offset(pgd, addr);
+		if (pud_none(*pud)) {
+			next = pud_addr_end(addr, end);
+			continue;
+		}
+		if (pud_sect(*pud)) {
+			/* This is PUD size page */
+			next = pud_addr_end(addr, end);
+			size = PUD_SIZE;
+			pfn = pud_pfn(*pud);
+		} else {
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd)) {
+				next = pmd_addr_end(addr, end);
+				continue;
+			}
+			if (pmd_sect(*pmd)) {
+				/* This is PMD size page */
+				next = pmd_addr_end(addr, end);
+				size = PMD_SIZE;
+				pfn = pmd_pfn(*pmd);
+			} else {
+				pte = pte_offset_kernel(pmd, addr);
+				next = addr + PAGE_SIZE;
+				if (pte_none(*pte))
+					continue;
+				/* This is base size page */
+				size = PAGE_SIZE;
+				pfn = pte_pfn(*pte);
+			}
+		}
+		memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+	}
+	return ret;
+}
+
 /*
  * The p*d_populate functions call virt_to_phys implicitly so they can't be used
  * directly on kernel symbols (bm_p*d). All the early functions are called too
@@ -161,11 +221,11 @@ void __init kasan_init(void)
 
 	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
 
-	vmemmap_populate(kimg_shadow_start, kimg_shadow_end,
-			 pfn_to_nid(virt_to_pfn(lm_alias(_text))));
+	kasan_map_populate(kimg_shadow_start, kimg_shadow_end,
+			   pfn_to_nid(virt_to_pfn(lm_alias(_text))));
 
 	/*
-	 * vmemmap_populate() has populated the shadow region that covers the
+	 * kasan_map_populate() has populated the shadow region that covers the
 	 * kernel image with SWAPPER_BLOCK_SIZE mappings, so we have to round
 	 * the start and end addresses to SWAPPER_BLOCK_SIZE as well, to prevent
 	 * kasan_populate_zero_shadow() from replacing the page table entries
@@ -191,9 +251,9 @@ void __init kasan_init(void)
 		if (start >= end)
 			break;
 
-		vmemmap_populate((unsigned long)kasan_mem_to_shadow(start),
-				(unsigned long)kasan_mem_to_shadow(end),
-				pfn_to_nid(virt_to_pfn(start)));
+		kasan_map_populate((unsigned long)kasan_mem_to_shadow(start),
+				   (unsigned long)kasan_mem_to_shadow(end),
+				   pfn_to_nid(virt_to_pfn(start)));
 	}
 
 	/*
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 8/9] mm: stop zeroing memory during allocation in vmemmap
  2017-10-09 22:19 ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-09 22:19   ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

                         BASE            FIX
sparse_init     11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
                  --------------------------
Total           16.124447727s   8.362381922s

sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h  | 11 +++++++++++
 mm/page_alloc.c     |  1 +
 mm/sparse-vmemmap.c | 15 +++++++--------
 mm/sparse.c         |  6 +++---
 4 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 04c8b2e5aff4..fd045a3b243a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned long size, int node)
 	return __vmemmap_alloc_block_buf(size, node, NULL);
 }
 
+static inline void *vmemmap_alloc_block_zero(unsigned long size, int node)
+{
+	void *p = vmemmap_alloc_block(size, node);
+
+	if (!p)
+		return NULL;
+	memset(p, 0, size);
+
+	return p;
+}
+
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
 			       int node);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5f0013bbbe9d..85e038e1e941 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
+	mm_zero_struct_page(page);
 	set_page_links(page, zone, nid, pfn);
 	init_page_count(page);
 	page_mapcount_reset(page);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d1a39b8051e0..c2f5654e7c9d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
 				unsigned long align,
 				unsigned long goal)
 {
-	return memblock_virt_alloc_try_nid(size, align, goal,
+	return memblock_virt_alloc_try_nid_raw(size, align, goal,
 					    BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
@@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 	if (slab_is_available()) {
 		struct page *page;
 
-		page = alloc_pages_node(node,
-			GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-			get_order(size));
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
+					get_order(size));
 		if (page)
 			return page_address(page);
 		return NULL;
@@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pmd_populate_kernel(&init_mm, pmd, p);
@@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node)
 {
 	pud_t *pud = pud_offset(p4d, addr);
 	if (pud_none(*pud)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pud_populate(&init_mm, pud, p);
@@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
 {
 	p4d_t *p4d = p4d_offset(pgd, addr);
 	if (p4d_none(*p4d)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		p4d_populate(&init_mm, p4d, p);
@@ -219,7 +218,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 {
 	pgd_t *pgd = pgd_offset_k(addr);
 	if (pgd_none(*pgd)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pgd_populate(&init_mm, pgd, p);
diff --git a/mm/sparse.c b/mm/sparse.c
index 83b3bf6461af..d22f51bb7c79 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -437,9 +437,9 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
 	}
 
 	size = PAGE_ALIGN(size);
-	map = memblock_virt_alloc_try_nid(size * map_count,
-					  PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
-					  BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
+	map = memblock_virt_alloc_try_nid_raw(size * map_count,
+					      PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
+					      BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
 	if (map) {
 		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
 			if (!present_section_nr(pnum))
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 8/9] mm: stop zeroing memory during allocation in vmemmap
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

                         BASE            FIX
sparse_init     11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
                  --------------------------
Total           16.124447727s   8.362381922s

sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h  | 11 +++++++++++
 mm/page_alloc.c     |  1 +
 mm/sparse-vmemmap.c | 15 +++++++--------
 mm/sparse.c         |  6 +++---
 4 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 04c8b2e5aff4..fd045a3b243a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned long size, int node)
 	return __vmemmap_alloc_block_buf(size, node, NULL);
 }
 
+static inline void *vmemmap_alloc_block_zero(unsigned long size, int node)
+{
+	void *p = vmemmap_alloc_block(size, node);
+
+	if (!p)
+		return NULL;
+	memset(p, 0, size);
+
+	return p;
+}
+
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
 			       int node);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5f0013bbbe9d..85e038e1e941 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
+	mm_zero_struct_page(page);
 	set_page_links(page, zone, nid, pfn);
 	init_page_count(page);
 	page_mapcount_reset(page);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d1a39b8051e0..c2f5654e7c9d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
 				unsigned long align,
 				unsigned long goal)
 {
-	return memblock_virt_alloc_try_nid(size, align, goal,
+	return memblock_virt_alloc_try_nid_raw(size, align, goal,
 					    BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
@@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 	if (slab_is_available()) {
 		struct page *page;
 
-		page = alloc_pages_node(node,
-			GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-			get_order(size));
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
+					get_order(size));
 		if (page)
 			return page_address(page);
 		return NULL;
@@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pmd_populate_kernel(&init_mm, pmd, p);
@@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node)
 {
 	pud_t *pud = pud_offset(p4d, addr);
 	if (pud_none(*pud)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pud_populate(&init_mm, pud, p);
@@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
 {
 	p4d_t *p4d = p4d_offset(pgd, addr);
 	if (p4d_none(*p4d)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		p4d_populate(&init_mm, p4d, p);
@@ -219,7 +218,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 {
 	pgd_t *pgd = pgd_offset_k(addr);
 	if (pgd_none(*pgd)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pgd_populate(&init_mm, pgd, p);
diff --git a/mm/sparse.c b/mm/sparse.c
index 83b3bf6461af..d22f51bb7c79 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -437,9 +437,9 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
 	}
 
 	size = PAGE_ALIGN(size);
-	map = memblock_virt_alloc_try_nid(size * map_count,
-					  PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
-					  BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
+	map = memblock_virt_alloc_try_nid_raw(size * map_count,
+					      PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
+					      BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
 	if (map) {
 		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
 			if (!present_section_nr(pnum))
-- 
2.14.2


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 8/9] mm: stop zeroing memory during allocation in vmemmap
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

                         BASE            FIX
sparse_init     11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
                  --------------------------
Total           16.124447727s   8.362381922s

sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h  | 11 +++++++++++
 mm/page_alloc.c     |  1 +
 mm/sparse-vmemmap.c | 15 +++++++--------
 mm/sparse.c         |  6 +++---
 4 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 04c8b2e5aff4..fd045a3b243a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned long size, int node)
 	return __vmemmap_alloc_block_buf(size, node, NULL);
 }
 
+static inline void *vmemmap_alloc_block_zero(unsigned long size, int node)
+{
+	void *p = vmemmap_alloc_block(size, node);
+
+	if (!p)
+		return NULL;
+	memset(p, 0, size);
+
+	return p;
+}
+
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
 			       int node);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5f0013bbbe9d..85e038e1e941 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
+	mm_zero_struct_page(page);
 	set_page_links(page, zone, nid, pfn);
 	init_page_count(page);
 	page_mapcount_reset(page);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d1a39b8051e0..c2f5654e7c9d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
 				unsigned long align,
 				unsigned long goal)
 {
-	return memblock_virt_alloc_try_nid(size, align, goal,
+	return memblock_virt_alloc_try_nid_raw(size, align, goal,
 					    BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
@@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 	if (slab_is_available()) {
 		struct page *page;
 
-		page = alloc_pages_node(node,
-			GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-			get_order(size));
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
+					get_order(size));
 		if (page)
 			return page_address(page);
 		return NULL;
@@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pmd_populate_kernel(&init_mm, pmd, p);
@@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node)
 {
 	pud_t *pud = pud_offset(p4d, addr);
 	if (pud_none(*pud)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pud_populate(&init_mm, pud, p);
@@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
 {
 	p4d_t *p4d = p4d_offset(pgd, addr);
 	if (p4d_none(*p4d)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		p4d_populate(&init_mm, p4d, p);
@@ -219,7 +218,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 {
 	pgd_t *pgd = pgd_offset_k(addr);
 	if (pgd_none(*pgd)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pgd_populate(&init_mm, pgd, p);
diff --git a/mm/sparse.c b/mm/sparse.c
index 83b3bf6461af..d22f51bb7c79 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -437,9 +437,9 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
 	}
 
 	size = PAGE_ALIGN(size);
-	map = memblock_virt_alloc_try_nid(size * map_count,
-					  PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
-					  BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
+	map = memblock_virt_alloc_try_nid_raw(size * map_count,
+					      PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
+					      BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
 	if (map) {
 		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
 			if (!present_section_nr(pnum))
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 8/9] mm: stop zeroing memory during allocation in vmemmap
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

                         BASE            FIX
sparse_init     11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
                  --------------------------
Total           16.124447727s   8.362381922s

sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h  | 11 +++++++++++
 mm/page_alloc.c     |  1 +
 mm/sparse-vmemmap.c | 15 +++++++--------
 mm/sparse.c         |  6 +++---
 4 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 04c8b2e5aff4..fd045a3b243a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned long size, int node)
 	return __vmemmap_alloc_block_buf(size, node, NULL);
 }
 
+static inline void *vmemmap_alloc_block_zero(unsigned long size, int node)
+{
+	void *p = vmemmap_alloc_block(size, node);
+
+	if (!p)
+		return NULL;
+	memset(p, 0, size);
+
+	return p;
+}
+
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
 			       int node);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5f0013bbbe9d..85e038e1e941 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
+	mm_zero_struct_page(page);
 	set_page_links(page, zone, nid, pfn);
 	init_page_count(page);
 	page_mapcount_reset(page);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d1a39b8051e0..c2f5654e7c9d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
 				unsigned long align,
 				unsigned long goal)
 {
-	return memblock_virt_alloc_try_nid(size, align, goal,
+	return memblock_virt_alloc_try_nid_raw(size, align, goal,
 					    BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
@@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 	if (slab_is_available()) {
 		struct page *page;
 
-		page = alloc_pages_node(node,
-			GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-			get_order(size));
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
+					get_order(size));
 		if (page)
 			return page_address(page);
 		return NULL;
@@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pmd_populate_kernel(&init_mm, pmd, p);
@@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node)
 {
 	pud_t *pud = pud_offset(p4d, addr);
 	if (pud_none(*pud)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pud_populate(&init_mm, pud, p);
@@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
 {
 	p4d_t *p4d = p4d_offset(pgd, addr);
 	if (p4d_none(*p4d)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		p4d_populate(&init_mm, p4d, p);
@@ -219,7 +218,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 {
 	pgd_t *pgd = pgd_offset_k(addr);
 	if (pgd_none(*pgd)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		pgd_populate(&init_mm, pgd, p);
diff --git a/mm/sparse.c b/mm/sparse.c
index 83b3bf6461af..d22f51bb7c79 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -437,9 +437,9 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
 	}
 
 	size = PAGE_ALIGN(size);
-	map = memblock_virt_alloc_try_nid(size * map_count,
-					  PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
-					  BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
+	map = memblock_virt_alloc_try_nid_raw(size * map_count,
+					      PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
+					      BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
 	if (map) {
 		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
 			if (!present_section_nr(pnum))
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 9/9] sparc64: optimized struct page zeroing
  2017-10-09 22:19 ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-09 22:19   ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to ten regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.

SPARC-M6 with 15T of memory, single thread performance:

                               BASE            FIX  OPTIMIZED_FIX
        bootmem_init   28.440467985s   2.305674818s   2.305161615s
free_area_init_nodes  202.845901673s 225.343084508s 172.556506560s
                      --------------------------------------------
Total                 231.286369658s 227.648759326s 174.861668175s

BASE:  current linux
FIX:   This patch series without "optimized struct page zeroing"
OPTIMIZED_FIX: This patch series including the current patch.

bootmem_init() is where memory for struct pages is zeroed during
allocation. Note, about two seconds in this function is a fixed time: it
does not increase as memory is increased.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
---
 arch/sparc/include/asm/pgtable_64.h | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 4fefe3762083..8ed478abc630 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)	(mem_map_zero)
 
+/* This macro must be updated when the size of struct page grows above 80
+ * or reduces below 64.
+ * The idea that compiler optimizes out switch() statement, and only
+ * leaves clrx instructions
+ */
+#define	mm_zero_struct_page(pp) do {					\
+	unsigned long *_pp = (void *)(pp);				\
+									\
+	 /* Check that struct page is either 64, 72, or 80 bytes */	\
+	BUILD_BUG_ON(sizeof(struct page) & 7);				\
+	BUILD_BUG_ON(sizeof(struct page) < 64);				\
+	BUILD_BUG_ON(sizeof(struct page) > 80);				\
+									\
+	switch (sizeof(struct page)) {					\
+	case 80:							\
+		_pp[9] = 0;	/* fallthrough */			\
+	case 72:							\
+		_pp[8] = 0;	/* fallthrough */			\
+	default:							\
+		_pp[7] = 0;						\
+		_pp[6] = 0;						\
+		_pp[5] = 0;						\
+		_pp[4] = 0;						\
+		_pp[3] = 0;						\
+		_pp[2] = 0;						\
+		_pp[1] = 0;						\
+		_pp[0] = 0;						\
+	}								\
+} while (0)
+
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 9/9] sparc64: optimized struct page zeroing
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to ten regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.

SPARC-M6 with 15T of memory, single thread performance:

                               BASE            FIX  OPTIMIZED_FIX
        bootmem_init   28.440467985s   2.305674818s   2.305161615s
free_area_init_nodes  202.845901673s 225.343084508s 172.556506560s
                      --------------------------------------------
Total                 231.286369658s 227.648759326s 174.861668175s

BASE:  current linux
FIX:   This patch series without "optimized struct page zeroing"
OPTIMIZED_FIX: This patch series including the current patch.

bootmem_init() is where memory for struct pages is zeroed during
allocation. Note, about two seconds in this function is a fixed time: it
does not increase as memory is increased.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
---
 arch/sparc/include/asm/pgtable_64.h | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 4fefe3762083..8ed478abc630 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)	(mem_map_zero)
 
+/* This macro must be updated when the size of struct page grows above 80
+ * or reduces below 64.
+ * The idea that compiler optimizes out switch() statement, and only
+ * leaves clrx instructions
+ */
+#define	mm_zero_struct_page(pp) do {					\
+	unsigned long *_pp = (void *)(pp);				\
+									\
+	 /* Check that struct page is either 64, 72, or 80 bytes */	\
+	BUILD_BUG_ON(sizeof(struct page) & 7);				\
+	BUILD_BUG_ON(sizeof(struct page) < 64);				\
+	BUILD_BUG_ON(sizeof(struct page) > 80);				\
+									\
+	switch (sizeof(struct page)) {					\
+	case 80:							\
+		_pp[9] = 0;	/* fallthrough */			\
+	case 72:							\
+		_pp[8] = 0;	/* fallthrough */			\
+	default:							\
+		_pp[7] = 0;						\
+		_pp[6] = 0;						\
+		_pp[5] = 0;						\
+		_pp[4] = 0;						\
+		_pp[3] = 0;						\
+		_pp[2] = 0;						\
+		_pp[1] = 0;						\
+		_pp[0] = 0;						\
+	}								\
+} while (0)
+
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
-- 
2.14.2


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 9/9] sparc64: optimized struct page zeroing
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to ten regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.

SPARC-M6 with 15T of memory, single thread performance:

                               BASE            FIX  OPTIMIZED_FIX
        bootmem_init   28.440467985s   2.305674818s   2.305161615s
free_area_init_nodes  202.845901673s 225.343084508s 172.556506560s
                      --------------------------------------------
Total                 231.286369658s 227.648759326s 174.861668175s

BASE:  current linux
FIX:   This patch series without "optimized struct page zeroing"
OPTIMIZED_FIX: This patch series including the current patch.

bootmem_init() is where memory for struct pages is zeroed during
allocation. Note, about two seconds in this function is a fixed time: it
does not increase as memory is increased.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
---
 arch/sparc/include/asm/pgtable_64.h | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 4fefe3762083..8ed478abc630 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)	(mem_map_zero)
 
+/* This macro must be updated when the size of struct page grows above 80
+ * or reduces below 64.
+ * The idea that compiler optimizes out switch() statement, and only
+ * leaves clrx instructions
+ */
+#define	mm_zero_struct_page(pp) do {					\
+	unsigned long *_pp = (void *)(pp);				\
+									\
+	 /* Check that struct page is either 64, 72, or 80 bytes */	\
+	BUILD_BUG_ON(sizeof(struct page) & 7);				\
+	BUILD_BUG_ON(sizeof(struct page) < 64);				\
+	BUILD_BUG_ON(sizeof(struct page) > 80);				\
+									\
+	switch (sizeof(struct page)) {					\
+	case 80:							\
+		_pp[9] = 0;	/* fallthrough */			\
+	case 72:							\
+		_pp[8] = 0;	/* fallthrough */			\
+	default:							\
+		_pp[7] = 0;						\
+		_pp[6] = 0;						\
+		_pp[5] = 0;						\
+		_pp[4] = 0;						\
+		_pp[3] = 0;						\
+		_pp[2] = 0;						\
+		_pp[1] = 0;						\
+		_pp[0] = 0;						\
+	}								\
+} while (0)
+
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 9/9] sparc64: optimized struct page zeroing
@ 2017-10-09 22:19   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-09 22:19 UTC (permalink / raw)
  To: linux-arm-kernel

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to ten regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.

SPARC-M6 with 15T of memory, single thread performance:

                               BASE            FIX  OPTIMIZED_FIX
        bootmem_init   28.440467985s   2.305674818s   2.305161615s
free_area_init_nodes  202.845901673s 225.343084508s 172.556506560s
                      --------------------------------------------
Total                 231.286369658s 227.648759326s 174.861668175s

BASE:  current linux
FIX:   This patch series without "optimized struct page zeroing"
OPTIMIZED_FIX: This patch series including the current patch.

bootmem_init() is where memory for struct pages is zeroed during
allocation. Note, about two seconds in this function is a fixed time: it
does not increase as memory is increased.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
---
 arch/sparc/include/asm/pgtable_64.h | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 4fefe3762083..8ed478abc630 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)	(mem_map_zero)
 
+/* This macro must be updated when the size of struct page grows above 80
+ * or reduces below 64.
+ * The idea that compiler optimizes out switch() statement, and only
+ * leaves clrx instructions
+ */
+#define	mm_zero_struct_page(pp) do {					\
+	unsigned long *_pp = (void *)(pp);				\
+									\
+	 /* Check that struct page is either 64, 72, or 80 bytes */	\
+	BUILD_BUG_ON(sizeof(struct page) & 7);				\
+	BUILD_BUG_ON(sizeof(struct page) < 64);				\
+	BUILD_BUG_ON(sizeof(struct page) > 80);				\
+									\
+	switch (sizeof(struct page)) {					\
+	case 80:							\
+		_pp[9] = 0;	/* fallthrough */			\
+	case 72:							\
+		_pp[8] = 0;	/* fallthrough */			\
+	default:							\
+		_pp[7] = 0;						\
+		_pp[6] = 0;						\
+		_pp[5] = 0;						\
+		_pp[4] = 0;						\
+		_pp[3] = 0;						\
+		_pp[2] = 0;						\
+		_pp[1] = 0;						\
+		_pp[0] = 0;						\
+	}								\
+} while (0)
+
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
  2017-10-09 22:19   ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-10 13:44     ` Michal Hocko
  -1 siblings, 0 replies; 115+ messages in thread
From: Michal Hocko @ 2017-10-10 13:44 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Mon 09-10-17 18:19:27, Pavel Tatashin wrote:
> Some memory is reserved but unavailable: not present in memblock.memory
> (because not backed by physical pages), but present in memblock.reserved.
> Such memory has backing struct pages, but they are not initialized by going
> through __init_single_page().
> 
> In some cases these struct pages are accessed even if they do not contain
> any data. One example is page_to_pfn() might access page->flags if this is
> where section information is stored (CONFIG_SPARSEMEM,
> SECTION_IN_PAGE_FLAGS).
> 
> One example of such memory: trim_low_memory_range() unconditionally
> reserves from pfn 0, but e820__memblock_setup() might provide the exiting
> memory from pfn 1 (i.e. KVM).
> 
> Since, struct pages are zeroed in __init_single_page(), and not during
> allocation time, we must zero such struct pages explicitly.
> 
> The patch involves adding a new memblock iterator:
> 	for_each_resv_unavail_range(i, p_start, p_end)
> 
> Which iterates through reserved && !memory lists, and we zero struct pages
> explicitly by calling mm_zero_struct_page().
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memblock.h | 16 ++++++++++++++++
>  include/linux/mm.h       | 15 +++++++++++++++
>  mm/page_alloc.c          | 38 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 69 insertions(+)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index bae11c7e7bf3..ce8bfa5f3e9b 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
>  	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
>  			       nid, flags, p_start, p_end, p_nid)
>  
> +/**
> + * for_each_resv_unavail_range - iterate through reserved and unavailable memory
> + * @i: u64 used as loop variable
> + * @flags: pick from blocks based on memory attributes
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + *
> + * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
> + * Available as soon as memblock is initialized.
> + * Note: because this memory does not belong to any physical node, flags and
> + * nid arguments do not make sense and thus not exported as arguments.
> + */
> +#define for_each_resv_unavail_range(i, p_start, p_end)			\
> +	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
> +			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
> +
>  static inline void memblock_set_region_flags(struct memblock_region *r,
>  					     unsigned long flags)
>  {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 065d99deb847..04c8b2e5aff4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
>  #define mm_forbids_zeropage(X)	(0)
>  #endif
>  
> +/*
> + * On some architectures it is expensive to call memset() for small sizes.
> + * Those architectures should provide their own implementation of "struct page"
> + * zeroing by defining this macro in <asm/pgtable.h>.
> + */
> +#ifndef mm_zero_struct_page
> +#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
> +#endif
> +
>  /*
>   * Default maximum number of active map areas, this limits the number of vmas
>   * per mm struct. Users can overwrite this number by sysctl but there is a
> @@ -2001,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
>  					struct mminit_pfnnid_cache *state);
>  #endif
>  
> +#ifdef CONFIG_HAVE_MEMBLOCK
> +void zero_resv_unavail(void);
> +#else
> +static inline void zero_resv_unavail(void) {}
> +#endif
> +
>  extern void set_dma_reserve(unsigned long new_dma_reserve);
>  extern void memmap_init_zone(unsigned long, int, unsigned long,
>  				unsigned long, enum memmap_context);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 20b0bace2235..5f0013bbbe9d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6209,6 +6209,42 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>  	free_area_init_core(pgdat);
>  }
>  
> +#ifdef CONFIG_HAVE_MEMBLOCK
> +/*
> + * Only struct pages that are backed by physical memory are zeroed and
> + * initialized by going through __init_single_page(). But, there are some
> + * struct pages which are reserved in memblock allocator and their fields
> + * may be accessed (for example page_to_pfn() on some configuration accesses
> + * flags). We must explicitly zero those struct pages.
> + */
> +void __paginginit zero_resv_unavail(void)
> +{
> +	phys_addr_t start, end;
> +	unsigned long pfn;
> +	u64 i, pgcnt;
> +
> +	/* Loop through ranges that are reserved, but do not have reported
> +	 * physical memory backing.
> +	 */
> +	pgcnt = 0;
> +	for_each_resv_unavail_range(i, &start, &end) {
> +		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
> +			mm_zero_struct_page(pfn_to_page(pfn));
> +			pgcnt++;
> +		}
> +	}
> +
> +	/*
> +	 * Struct pages that do not have backing memory. This could be because
> +	 * firmware is using some of this memory, or for some other reasons.
> +	 * Once memblock is changed so such behaviour is not allowed: i.e.
> +	 * list of "reserved" memory must be a subset of list of "memory", then
> +	 * this code can be removed.
> +	 */
> +	pr_info("Reserved but unavailable: %lld pages", pgcnt);
> +}
> +#endif /* CONFIG_HAVE_MEMBLOCK */
> +
>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  
>  #if MAX_NUMNODES > 1
> @@ -6632,6 +6668,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
>  			node_set_state(nid, N_MEMORY);
>  		check_for_memory(pgdat, nid);
>  	}
> +	zero_resv_unavail();
>  }
>  
>  static int __init cmdline_parse_core(char *p, unsigned long *core)
> @@ -6795,6 +6832,7 @@ void __init free_area_init(unsigned long *zones_size)
>  {
>  	free_area_init_node(0, zones_size,
>  			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
> +	zero_resv_unavail();
>  }
>  
>  static int page_alloc_cpu_dead(unsigned int cpu)
> -- 
> 2.14.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
@ 2017-10-10 13:44     ` Michal Hocko
  0 siblings, 0 replies; 115+ messages in thread
From: Michal Hocko @ 2017-10-10 13:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon 09-10-17 18:19:27, Pavel Tatashin wrote:
> Some memory is reserved but unavailable: not present in memblock.memory
> (because not backed by physical pages), but present in memblock.reserved.
> Such memory has backing struct pages, but they are not initialized by going
> through __init_single_page().
> 
> In some cases these struct pages are accessed even if they do not contain
> any data. One example is page_to_pfn() might access page->flags if this is
> where section information is stored (CONFIG_SPARSEMEM,
> SECTION_IN_PAGE_FLAGS).
> 
> One example of such memory: trim_low_memory_range() unconditionally
> reserves from pfn 0, but e820__memblock_setup() might provide the exiting
> memory from pfn 1 (i.e. KVM).
> 
> Since, struct pages are zeroed in __init_single_page(), and not during
> allocation time, we must zero such struct pages explicitly.
> 
> The patch involves adding a new memblock iterator:
> 	for_each_resv_unavail_range(i, p_start, p_end)
> 
> Which iterates through reserved && !memory lists, and we zero struct pages
> explicitly by calling mm_zero_struct_page().
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memblock.h | 16 ++++++++++++++++
>  include/linux/mm.h       | 15 +++++++++++++++
>  mm/page_alloc.c          | 38 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 69 insertions(+)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index bae11c7e7bf3..ce8bfa5f3e9b 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
>  	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
>  			       nid, flags, p_start, p_end, p_nid)
>  
> +/**
> + * for_each_resv_unavail_range - iterate through reserved and unavailable memory
> + * @i: u64 used as loop variable
> + * @flags: pick from blocks based on memory attributes
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + *
> + * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
> + * Available as soon as memblock is initialized.
> + * Note: because this memory does not belong to any physical node, flags and
> + * nid arguments do not make sense and thus not exported as arguments.
> + */
> +#define for_each_resv_unavail_range(i, p_start, p_end)			\
> +	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
> +			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
> +
>  static inline void memblock_set_region_flags(struct memblock_region *r,
>  					     unsigned long flags)
>  {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 065d99deb847..04c8b2e5aff4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
>  #define mm_forbids_zeropage(X)	(0)
>  #endif
>  
> +/*
> + * On some architectures it is expensive to call memset() for small sizes.
> + * Those architectures should provide their own implementation of "struct page"
> + * zeroing by defining this macro in <asm/pgtable.h>.
> + */
> +#ifndef mm_zero_struct_page
> +#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
> +#endif
> +
>  /*
>   * Default maximum number of active map areas, this limits the number of vmas
>   * per mm struct. Users can overwrite this number by sysctl but there is a
> @@ -2001,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
>  					struct mminit_pfnnid_cache *state);
>  #endif
>  
> +#ifdef CONFIG_HAVE_MEMBLOCK
> +void zero_resv_unavail(void);
> +#else
> +static inline void zero_resv_unavail(void) {}
> +#endif
> +
>  extern void set_dma_reserve(unsigned long new_dma_reserve);
>  extern void memmap_init_zone(unsigned long, int, unsigned long,
>  				unsigned long, enum memmap_context);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 20b0bace2235..5f0013bbbe9d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6209,6 +6209,42 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>  	free_area_init_core(pgdat);
>  }
>  
> +#ifdef CONFIG_HAVE_MEMBLOCK
> +/*
> + * Only struct pages that are backed by physical memory are zeroed and
> + * initialized by going through __init_single_page(). But, there are some
> + * struct pages which are reserved in memblock allocator and their fields
> + * may be accessed (for example page_to_pfn() on some configuration accesses
> + * flags). We must explicitly zero those struct pages.
> + */
> +void __paginginit zero_resv_unavail(void)
> +{
> +	phys_addr_t start, end;
> +	unsigned long pfn;
> +	u64 i, pgcnt;
> +
> +	/* Loop through ranges that are reserved, but do not have reported
> +	 * physical memory backing.
> +	 */
> +	pgcnt = 0;
> +	for_each_resv_unavail_range(i, &start, &end) {
> +		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
> +			mm_zero_struct_page(pfn_to_page(pfn));
> +			pgcnt++;
> +		}
> +	}
> +
> +	/*
> +	 * Struct pages that do not have backing memory. This could be because
> +	 * firmware is using some of this memory, or for some other reasons.
> +	 * Once memblock is changed so such behaviour is not allowed: i.e.
> +	 * list of "reserved" memory must be a subset of list of "memory", then
> +	 * this code can be removed.
> +	 */
> +	pr_info("Reserved but unavailable: %lld pages", pgcnt);
> +}
> +#endif /* CONFIG_HAVE_MEMBLOCK */
> +
>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  
>  #if MAX_NUMNODES > 1
> @@ -6632,6 +6668,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
>  			node_set_state(nid, N_MEMORY);
>  		check_for_memory(pgdat, nid);
>  	}
> +	zero_resv_unavail();
>  }
>  
>  static int __init cmdline_parse_core(char *p, unsigned long *core)
> @@ -6795,6 +6832,7 @@ void __init free_area_init(unsigned long *zones_size)
>  {
>  	free_area_init_node(0, zones_size,
>  			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
> +	zero_resv_unavail();
>  }
>  
>  static int page_alloc_cpu_dead(unsigned int cpu)
> -- 
> 2.14.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
@ 2017-10-10 13:44     ` Michal Hocko
  0 siblings, 0 replies; 115+ messages in thread
From: Michal Hocko @ 2017-10-10 13:44 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Mon 09-10-17 18:19:27, Pavel Tatashin wrote:
> Some memory is reserved but unavailable: not present in memblock.memory
> (because not backed by physical pages), but present in memblock.reserved.
> Such memory has backing struct pages, but they are not initialized by going
> through __init_single_page().
> 
> In some cases these struct pages are accessed even if they do not contain
> any data. One example is page_to_pfn() might access page->flags if this is
> where section information is stored (CONFIG_SPARSEMEM,
> SECTION_IN_PAGE_FLAGS).
> 
> One example of such memory: trim_low_memory_range() unconditionally
> reserves from pfn 0, but e820__memblock_setup() might provide the exiting
> memory from pfn 1 (i.e. KVM).
> 
> Since, struct pages are zeroed in __init_single_page(), and not during
> allocation time, we must zero such struct pages explicitly.
> 
> The patch involves adding a new memblock iterator:
> 	for_each_resv_unavail_range(i, p_start, p_end)
> 
> Which iterates through reserved && !memory lists, and we zero struct pages
> explicitly by calling mm_zero_struct_page().
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memblock.h | 16 ++++++++++++++++
>  include/linux/mm.h       | 15 +++++++++++++++
>  mm/page_alloc.c          | 38 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 69 insertions(+)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index bae11c7e7bf3..ce8bfa5f3e9b 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
>  	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
>  			       nid, flags, p_start, p_end, p_nid)
>  
> +/**
> + * for_each_resv_unavail_range - iterate through reserved and unavailable memory
> + * @i: u64 used as loop variable
> + * @flags: pick from blocks based on memory attributes
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + *
> + * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
> + * Available as soon as memblock is initialized.
> + * Note: because this memory does not belong to any physical node, flags and
> + * nid arguments do not make sense and thus not exported as arguments.
> + */
> +#define for_each_resv_unavail_range(i, p_start, p_end)			\
> +	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
> +			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
> +
>  static inline void memblock_set_region_flags(struct memblock_region *r,
>  					     unsigned long flags)
>  {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 065d99deb847..04c8b2e5aff4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
>  #define mm_forbids_zeropage(X)	(0)
>  #endif
>  
> +/*
> + * On some architectures it is expensive to call memset() for small sizes.
> + * Those architectures should provide their own implementation of "struct page"
> + * zeroing by defining this macro in <asm/pgtable.h>.
> + */
> +#ifndef mm_zero_struct_page
> +#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
> +#endif
> +
>  /*
>   * Default maximum number of active map areas, this limits the number of vmas
>   * per mm struct. Users can overwrite this number by sysctl but there is a
> @@ -2001,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
>  					struct mminit_pfnnid_cache *state);
>  #endif
>  
> +#ifdef CONFIG_HAVE_MEMBLOCK
> +void zero_resv_unavail(void);
> +#else
> +static inline void zero_resv_unavail(void) {}
> +#endif
> +
>  extern void set_dma_reserve(unsigned long new_dma_reserve);
>  extern void memmap_init_zone(unsigned long, int, unsigned long,
>  				unsigned long, enum memmap_context);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 20b0bace2235..5f0013bbbe9d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6209,6 +6209,42 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>  	free_area_init_core(pgdat);
>  }
>  
> +#ifdef CONFIG_HAVE_MEMBLOCK
> +/*
> + * Only struct pages that are backed by physical memory are zeroed and
> + * initialized by going through __init_single_page(). But, there are some
> + * struct pages which are reserved in memblock allocator and their fields
> + * may be accessed (for example page_to_pfn() on some configuration accesses
> + * flags). We must explicitly zero those struct pages.
> + */
> +void __paginginit zero_resv_unavail(void)
> +{
> +	phys_addr_t start, end;
> +	unsigned long pfn;
> +	u64 i, pgcnt;
> +
> +	/* Loop through ranges that are reserved, but do not have reported
> +	 * physical memory backing.
> +	 */
> +	pgcnt = 0;
> +	for_each_resv_unavail_range(i, &start, &end) {
> +		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
> +			mm_zero_struct_page(pfn_to_page(pfn));
> +			pgcnt++;
> +		}
> +	}
> +
> +	/*
> +	 * Struct pages that do not have backing memory. This could be because
> +	 * firmware is using some of this memory, or for some other reasons.
> +	 * Once memblock is changed so such behaviour is not allowed: i.e.
> +	 * list of "reserved" memory must be a subset of list of "memory", then
> +	 * this code can be removed.
> +	 */
> +	pr_info("Reserved but unavailable: %lld pages", pgcnt);
> +}
> +#endif /* CONFIG_HAVE_MEMBLOCK */
> +
>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  
>  #if MAX_NUMNODES > 1
> @@ -6632,6 +6668,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
>  			node_set_state(nid, N_MEMORY);
>  		check_for_memory(pgdat, nid);
>  	}
> +	zero_resv_unavail();
>  }
>  
>  static int __init cmdline_parse_core(char *p, unsigned long *core)
> @@ -6795,6 +6832,7 @@ void __init free_area_init(unsigned long *zones_size)
>  {
>  	free_area_init_node(0, zones_size,
>  			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
> +	zero_resv_unavail();
>  }
>  
>  static int page_alloc_cpu_dead(unsigned int cpu)
> -- 
> 2.14.2

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
@ 2017-10-10 13:44     ` Michal Hocko
  0 siblings, 0 replies; 115+ messages in thread
From: Michal Hocko @ 2017-10-10 13:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon 09-10-17 18:19:27, Pavel Tatashin wrote:
> Some memory is reserved but unavailable: not present in memblock.memory
> (because not backed by physical pages), but present in memblock.reserved.
> Such memory has backing struct pages, but they are not initialized by going
> through __init_single_page().
> 
> In some cases these struct pages are accessed even if they do not contain
> any data. One example is page_to_pfn() might access page->flags if this is
> where section information is stored (CONFIG_SPARSEMEM,
> SECTION_IN_PAGE_FLAGS).
> 
> One example of such memory: trim_low_memory_range() unconditionally
> reserves from pfn 0, but e820__memblock_setup() might provide the exiting
> memory from pfn 1 (i.e. KVM).
> 
> Since, struct pages are zeroed in __init_single_page(), and not during
> allocation time, we must zero such struct pages explicitly.
> 
> The patch involves adding a new memblock iterator:
> 	for_each_resv_unavail_range(i, p_start, p_end)
> 
> Which iterates through reserved && !memory lists, and we zero struct pages
> explicitly by calling mm_zero_struct_page().
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Bob Picco <bob.picco@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memblock.h | 16 ++++++++++++++++
>  include/linux/mm.h       | 15 +++++++++++++++
>  mm/page_alloc.c          | 38 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 69 insertions(+)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index bae11c7e7bf3..ce8bfa5f3e9b 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
>  	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
>  			       nid, flags, p_start, p_end, p_nid)
>  
> +/**
> + * for_each_resv_unavail_range - iterate through reserved and unavailable memory
> + * @i: u64 used as loop variable
> + * @flags: pick from blocks based on memory attributes
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + *
> + * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
> + * Available as soon as memblock is initialized.
> + * Note: because this memory does not belong to any physical node, flags and
> + * nid arguments do not make sense and thus not exported as arguments.
> + */
> +#define for_each_resv_unavail_range(i, p_start, p_end)			\
> +	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
> +			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
> +
>  static inline void memblock_set_region_flags(struct memblock_region *r,
>  					     unsigned long flags)
>  {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 065d99deb847..04c8b2e5aff4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
>  #define mm_forbids_zeropage(X)	(0)
>  #endif
>  
> +/*
> + * On some architectures it is expensive to call memset() for small sizes.
> + * Those architectures should provide their own implementation of "struct page"
> + * zeroing by defining this macro in <asm/pgtable.h>.
> + */
> +#ifndef mm_zero_struct_page
> +#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
> +#endif
> +
>  /*
>   * Default maximum number of active map areas, this limits the number of vmas
>   * per mm struct. Users can overwrite this number by sysctl but there is a
> @@ -2001,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
>  					struct mminit_pfnnid_cache *state);
>  #endif
>  
> +#ifdef CONFIG_HAVE_MEMBLOCK
> +void zero_resv_unavail(void);
> +#else
> +static inline void zero_resv_unavail(void) {}
> +#endif
> +
>  extern void set_dma_reserve(unsigned long new_dma_reserve);
>  extern void memmap_init_zone(unsigned long, int, unsigned long,
>  				unsigned long, enum memmap_context);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 20b0bace2235..5f0013bbbe9d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6209,6 +6209,42 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>  	free_area_init_core(pgdat);
>  }
>  
> +#ifdef CONFIG_HAVE_MEMBLOCK
> +/*
> + * Only struct pages that are backed by physical memory are zeroed and
> + * initialized by going through __init_single_page(). But, there are some
> + * struct pages which are reserved in memblock allocator and their fields
> + * may be accessed (for example page_to_pfn() on some configuration accesses
> + * flags). We must explicitly zero those struct pages.
> + */
> +void __paginginit zero_resv_unavail(void)
> +{
> +	phys_addr_t start, end;
> +	unsigned long pfn;
> +	u64 i, pgcnt;
> +
> +	/* Loop through ranges that are reserved, but do not have reported
> +	 * physical memory backing.
> +	 */
> +	pgcnt = 0;
> +	for_each_resv_unavail_range(i, &start, &end) {
> +		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
> +			mm_zero_struct_page(pfn_to_page(pfn));
> +			pgcnt++;
> +		}
> +	}
> +
> +	/*
> +	 * Struct pages that do not have backing memory. This could be because
> +	 * firmware is using some of this memory, or for some other reasons.
> +	 * Once memblock is changed so such behaviour is not allowed: i.e.
> +	 * list of "reserved" memory must be a subset of list of "memory", then
> +	 * this code can be removed.
> +	 */
> +	pr_info("Reserved but unavailable: %lld pages", pgcnt);
> +}
> +#endif /* CONFIG_HAVE_MEMBLOCK */
> +
>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  
>  #if MAX_NUMNODES > 1
> @@ -6632,6 +6668,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
>  			node_set_state(nid, N_MEMORY);
>  		check_for_memory(pgdat, nid);
>  	}
> +	zero_resv_unavail();
>  }
>  
>  static int __init cmdline_parse_core(char *p, unsigned long *core)
> @@ -6795,6 +6832,7 @@ void __init free_area_init(unsigned long *zones_size)
>  {
>  	free_area_init_node(0, zones_size,
>  			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
> +	zero_resv_unavail();
>  }
>  
>  static int page_alloc_cpu_dead(unsigned int cpu)
> -- 
> 2.14.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
  2017-10-10 13:44     ` Michal Hocko
  (?)
  (?)
@ 2017-10-10 14:09       ` Michal Hocko
  -1 siblings, 0 replies; 115+ messages in thread
From: Michal Hocko @ 2017-10-10 14:09 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Tue 10-10-17 15:44:41, Michal Hocko wrote:
> On Mon 09-10-17 18:19:27, Pavel Tatashin wrote:
> > Some memory is reserved but unavailable: not present in memblock.memory
> > (because not backed by physical pages), but present in memblock.reserved.
> > Such memory has backing struct pages, but they are not initialized by going
> > through __init_single_page().
> > 
> > In some cases these struct pages are accessed even if they do not contain
> > any data. One example is page_to_pfn() might access page->flags if this is
> > where section information is stored (CONFIG_SPARSEMEM,
> > SECTION_IN_PAGE_FLAGS).
> > 
> > One example of such memory: trim_low_memory_range() unconditionally
> > reserves from pfn 0, but e820__memblock_setup() might provide the exiting
> > memory from pfn 1 (i.e. KVM).

Btw. I would add your example from http://lkml.kernel.org/r/bcf24369-ac37-cedd-a264-3396fb5cf39e@oracle.com
to do changelog
 
> > Since, struct pages are zeroed in __init_single_page(), and not during
> > allocation time, we must zero such struct pages explicitly.
> > 
> > The patch involves adding a new memblock iterator:
> > 	for_each_resv_unavail_range(i, p_start, p_end)
> > 
> > Which iterates through reserved && !memory lists, and we zero struct pages
> > explicitly by calling mm_zero_struct_page().
> > 
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> > Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> > Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > Reviewed-by: Bob Picco <bob.picco@oracle.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> > ---
> >  include/linux/memblock.h | 16 ++++++++++++++++
> >  include/linux/mm.h       | 15 +++++++++++++++
> >  mm/page_alloc.c          | 38 ++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 69 insertions(+)
> > 
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index bae11c7e7bf3..ce8bfa5f3e9b 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
> >  	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
> >  			       nid, flags, p_start, p_end, p_nid)
> >  
> > +/**
> > + * for_each_resv_unavail_range - iterate through reserved and unavailable memory
> > + * @i: u64 used as loop variable
> > + * @flags: pick from blocks based on memory attributes
> > + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > + *
> > + * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
> > + * Available as soon as memblock is initialized.
> > + * Note: because this memory does not belong to any physical node, flags and
> > + * nid arguments do not make sense and thus not exported as arguments.
> > + */
> > +#define for_each_resv_unavail_range(i, p_start, p_end)			\
> > +	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
> > +			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
> > +
> >  static inline void memblock_set_region_flags(struct memblock_region *r,
> >  					     unsigned long flags)
> >  {
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 065d99deb847..04c8b2e5aff4 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
> >  #define mm_forbids_zeropage(X)	(0)
> >  #endif
> >  
> > +/*
> > + * On some architectures it is expensive to call memset() for small sizes.
> > + * Those architectures should provide their own implementation of "struct page"
> > + * zeroing by defining this macro in <asm/pgtable.h>.
> > + */
> > +#ifndef mm_zero_struct_page
> > +#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
> > +#endif
> > +
> >  /*
> >   * Default maximum number of active map areas, this limits the number of vmas
> >   * per mm struct. Users can overwrite this number by sysctl but there is a
> > @@ -2001,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
> >  					struct mminit_pfnnid_cache *state);
> >  #endif
> >  
> > +#ifdef CONFIG_HAVE_MEMBLOCK
> > +void zero_resv_unavail(void);
> > +#else
> > +static inline void zero_resv_unavail(void) {}
> > +#endif
> > +
> >  extern void set_dma_reserve(unsigned long new_dma_reserve);
> >  extern void memmap_init_zone(unsigned long, int, unsigned long,
> >  				unsigned long, enum memmap_context);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 20b0bace2235..5f0013bbbe9d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6209,6 +6209,42 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
> >  	free_area_init_core(pgdat);
> >  }
> >  
> > +#ifdef CONFIG_HAVE_MEMBLOCK
> > +/*
> > + * Only struct pages that are backed by physical memory are zeroed and
> > + * initialized by going through __init_single_page(). But, there are some
> > + * struct pages which are reserved in memblock allocator and their fields
> > + * may be accessed (for example page_to_pfn() on some configuration accesses
> > + * flags). We must explicitly zero those struct pages.
> > + */
> > +void __paginginit zero_resv_unavail(void)
> > +{
> > +	phys_addr_t start, end;
> > +	unsigned long pfn;
> > +	u64 i, pgcnt;
> > +
> > +	/* Loop through ranges that are reserved, but do not have reported
> > +	 * physical memory backing.
> > +	 */
> > +	pgcnt = 0;
> > +	for_each_resv_unavail_range(i, &start, &end) {
> > +		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
> > +			mm_zero_struct_page(pfn_to_page(pfn));
> > +			pgcnt++;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Struct pages that do not have backing memory. This could be because
> > +	 * firmware is using some of this memory, or for some other reasons.
> > +	 * Once memblock is changed so such behaviour is not allowed: i.e.
> > +	 * list of "reserved" memory must be a subset of list of "memory", then
> > +	 * this code can be removed.
> > +	 */
> > +	pr_info("Reserved but unavailable: %lld pages", pgcnt);
> > +}
> > +#endif /* CONFIG_HAVE_MEMBLOCK */
> > +
> >  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  
> >  #if MAX_NUMNODES > 1
> > @@ -6632,6 +6668,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
> >  			node_set_state(nid, N_MEMORY);
> >  		check_for_memory(pgdat, nid);
> >  	}
> > +	zero_resv_unavail();
> >  }
> >  
> >  static int __init cmdline_parse_core(char *p, unsigned long *core)
> > @@ -6795,6 +6832,7 @@ void __init free_area_init(unsigned long *zones_size)
> >  {
> >  	free_area_init_node(0, zones_size,
> >  			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
> > +	zero_resv_unavail();
> >  }
> >  
> >  static int page_alloc_cpu_dead(unsigned int cpu)
> > -- 
> > 2.14.2
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
@ 2017-10-10 14:09       ` Michal Hocko
  0 siblings, 0 replies; 115+ messages in thread
From: Michal Hocko @ 2017-10-10 14:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue 10-10-17 15:44:41, Michal Hocko wrote:
> On Mon 09-10-17 18:19:27, Pavel Tatashin wrote:
> > Some memory is reserved but unavailable: not present in memblock.memory
> > (because not backed by physical pages), but present in memblock.reserved.
> > Such memory has backing struct pages, but they are not initialized by going
> > through __init_single_page().
> > 
> > In some cases these struct pages are accessed even if they do not contain
> > any data. One example is page_to_pfn() might access page->flags if this is
> > where section information is stored (CONFIG_SPARSEMEM,
> > SECTION_IN_PAGE_FLAGS).
> > 
> > One example of such memory: trim_low_memory_range() unconditionally
> > reserves from pfn 0, but e820__memblock_setup() might provide the exiting
> > memory from pfn 1 (i.e. KVM).

Btw. I would add your example from http://lkml.kernel.org/r/bcf24369-ac37-cedd-a264-3396fb5cf39e@oracle.com
to do changelog
 
> > Since, struct pages are zeroed in __init_single_page(), and not during
> > allocation time, we must zero such struct pages explicitly.
> > 
> > The patch involves adding a new memblock iterator:
> > 	for_each_resv_unavail_range(i, p_start, p_end)
> > 
> > Which iterates through reserved && !memory lists, and we zero struct pages
> > explicitly by calling mm_zero_struct_page().
> > 
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> > Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> > Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > Reviewed-by: Bob Picco <bob.picco@oracle.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> > ---
> >  include/linux/memblock.h | 16 ++++++++++++++++
> >  include/linux/mm.h       | 15 +++++++++++++++
> >  mm/page_alloc.c          | 38 ++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 69 insertions(+)
> > 
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index bae11c7e7bf3..ce8bfa5f3e9b 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
> >  	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
> >  			       nid, flags, p_start, p_end, p_nid)
> >  
> > +/**
> > + * for_each_resv_unavail_range - iterate through reserved and unavailable memory
> > + * @i: u64 used as loop variable
> > + * @flags: pick from blocks based on memory attributes
> > + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > + *
> > + * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
> > + * Available as soon as memblock is initialized.
> > + * Note: because this memory does not belong to any physical node, flags and
> > + * nid arguments do not make sense and thus not exported as arguments.
> > + */
> > +#define for_each_resv_unavail_range(i, p_start, p_end)			\
> > +	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
> > +			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
> > +
> >  static inline void memblock_set_region_flags(struct memblock_region *r,
> >  					     unsigned long flags)
> >  {
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 065d99deb847..04c8b2e5aff4 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
> >  #define mm_forbids_zeropage(X)	(0)
> >  #endif
> >  
> > +/*
> > + * On some architectures it is expensive to call memset() for small sizes.
> > + * Those architectures should provide their own implementation of "struct page"
> > + * zeroing by defining this macro in <asm/pgtable.h>.
> > + */
> > +#ifndef mm_zero_struct_page
> > +#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
> > +#endif
> > +
> >  /*
> >   * Default maximum number of active map areas, this limits the number of vmas
> >   * per mm struct. Users can overwrite this number by sysctl but there is a
> > @@ -2001,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
> >  					struct mminit_pfnnid_cache *state);
> >  #endif
> >  
> > +#ifdef CONFIG_HAVE_MEMBLOCK
> > +void zero_resv_unavail(void);
> > +#else
> > +static inline void zero_resv_unavail(void) {}
> > +#endif
> > +
> >  extern void set_dma_reserve(unsigned long new_dma_reserve);
> >  extern void memmap_init_zone(unsigned long, int, unsigned long,
> >  				unsigned long, enum memmap_context);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 20b0bace2235..5f0013bbbe9d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6209,6 +6209,42 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
> >  	free_area_init_core(pgdat);
> >  }
> >  
> > +#ifdef CONFIG_HAVE_MEMBLOCK
> > +/*
> > + * Only struct pages that are backed by physical memory are zeroed and
> > + * initialized by going through __init_single_page(). But, there are some
> > + * struct pages which are reserved in memblock allocator and their fields
> > + * may be accessed (for example page_to_pfn() on some configuration accesses
> > + * flags). We must explicitly zero those struct pages.
> > + */
> > +void __paginginit zero_resv_unavail(void)
> > +{
> > +	phys_addr_t start, end;
> > +	unsigned long pfn;
> > +	u64 i, pgcnt;
> > +
> > +	/* Loop through ranges that are reserved, but do not have reported
> > +	 * physical memory backing.
> > +	 */
> > +	pgcnt = 0;
> > +	for_each_resv_unavail_range(i, &start, &end) {
> > +		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
> > +			mm_zero_struct_page(pfn_to_page(pfn));
> > +			pgcnt++;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Struct pages that do not have backing memory. This could be because
> > +	 * firmware is using some of this memory, or for some other reasons.
> > +	 * Once memblock is changed so such behaviour is not allowed: i.e.
> > +	 * list of "reserved" memory must be a subset of list of "memory", then
> > +	 * this code can be removed.
> > +	 */
> > +	pr_info("Reserved but unavailable: %lld pages", pgcnt);
> > +}
> > +#endif /* CONFIG_HAVE_MEMBLOCK */
> > +
> >  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  
> >  #if MAX_NUMNODES > 1
> > @@ -6632,6 +6668,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
> >  			node_set_state(nid, N_MEMORY);
> >  		check_for_memory(pgdat, nid);
> >  	}
> > +	zero_resv_unavail();
> >  }
> >  
> >  static int __init cmdline_parse_core(char *p, unsigned long *core)
> > @@ -6795,6 +6832,7 @@ void __init free_area_init(unsigned long *zones_size)
> >  {
> >  	free_area_init_node(0, zones_size,
> >  			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
> > +	zero_resv_unavail();
> >  }
> >  
> >  static int page_alloc_cpu_dead(unsigned int cpu)
> > -- 
> > 2.14.2
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
@ 2017-10-10 14:09       ` Michal Hocko
  0 siblings, 0 replies; 115+ messages in thread
From: Michal Hocko @ 2017-10-10 14:09 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

On Tue 10-10-17 15:44:41, Michal Hocko wrote:
> On Mon 09-10-17 18:19:27, Pavel Tatashin wrote:
> > Some memory is reserved but unavailable: not present in memblock.memory
> > (because not backed by physical pages), but present in memblock.reserved.
> > Such memory has backing struct pages, but they are not initialized by going
> > through __init_single_page().
> > 
> > In some cases these struct pages are accessed even if they do not contain
> > any data. One example is page_to_pfn() might access page->flags if this is
> > where section information is stored (CONFIG_SPARSEMEM,
> > SECTION_IN_PAGE_FLAGS).
> > 
> > One example of such memory: trim_low_memory_range() unconditionally
> > reserves from pfn 0, but e820__memblock_setup() might provide the exiting
> > memory from pfn 1 (i.e. KVM).

Btw. I would add your example from http://lkml.kernel.org/r/bcf24369-ac37-cedd-a264-3396fb5cf39e@oracle.com
to do changelog
 
> > Since, struct pages are zeroed in __init_single_page(), and not during
> > allocation time, we must zero such struct pages explicitly.
> > 
> > The patch involves adding a new memblock iterator:
> > 	for_each_resv_unavail_range(i, p_start, p_end)
> > 
> > Which iterates through reserved && !memory lists, and we zero struct pages
> > explicitly by calling mm_zero_struct_page().
> > 
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> > Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> > Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > Reviewed-by: Bob Picco <bob.picco@oracle.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> > ---
> >  include/linux/memblock.h | 16 ++++++++++++++++
> >  include/linux/mm.h       | 15 +++++++++++++++
> >  mm/page_alloc.c          | 38 ++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 69 insertions(+)
> > 
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index bae11c7e7bf3..ce8bfa5f3e9b 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
> >  	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
> >  			       nid, flags, p_start, p_end, p_nid)
> >  
> > +/**
> > + * for_each_resv_unavail_range - iterate through reserved and unavailable memory
> > + * @i: u64 used as loop variable
> > + * @flags: pick from blocks based on memory attributes
> > + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > + *
> > + * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
> > + * Available as soon as memblock is initialized.
> > + * Note: because this memory does not belong to any physical node, flags and
> > + * nid arguments do not make sense and thus not exported as arguments.
> > + */
> > +#define for_each_resv_unavail_range(i, p_start, p_end)			\
> > +	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
> > +			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
> > +
> >  static inline void memblock_set_region_flags(struct memblock_region *r,
> >  					     unsigned long flags)
> >  {
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 065d99deb847..04c8b2e5aff4 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
> >  #define mm_forbids_zeropage(X)	(0)
> >  #endif
> >  
> > +/*
> > + * On some architectures it is expensive to call memset() for small sizes.
> > + * Those architectures should provide their own implementation of "struct page"
> > + * zeroing by defining this macro in <asm/pgtable.h>.
> > + */
> > +#ifndef mm_zero_struct_page
> > +#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
> > +#endif
> > +
> >  /*
> >   * Default maximum number of active map areas, this limits the number of vmas
> >   * per mm struct. Users can overwrite this number by sysctl but there is a
> > @@ -2001,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
> >  					struct mminit_pfnnid_cache *state);
> >  #endif
> >  
> > +#ifdef CONFIG_HAVE_MEMBLOCK
> > +void zero_resv_unavail(void);
> > +#else
> > +static inline void zero_resv_unavail(void) {}
> > +#endif
> > +
> >  extern void set_dma_reserve(unsigned long new_dma_reserve);
> >  extern void memmap_init_zone(unsigned long, int, unsigned long,
> >  				unsigned long, enum memmap_context);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 20b0bace2235..5f0013bbbe9d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6209,6 +6209,42 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
> >  	free_area_init_core(pgdat);
> >  }
> >  
> > +#ifdef CONFIG_HAVE_MEMBLOCK
> > +/*
> > + * Only struct pages that are backed by physical memory are zeroed and
> > + * initialized by going through __init_single_page(). But, there are some
> > + * struct pages which are reserved in memblock allocator and their fields
> > + * may be accessed (for example page_to_pfn() on some configuration accesses
> > + * flags). We must explicitly zero those struct pages.
> > + */
> > +void __paginginit zero_resv_unavail(void)
> > +{
> > +	phys_addr_t start, end;
> > +	unsigned long pfn;
> > +	u64 i, pgcnt;
> > +
> > +	/* Loop through ranges that are reserved, but do not have reported
> > +	 * physical memory backing.
> > +	 */
> > +	pgcnt = 0;
> > +	for_each_resv_unavail_range(i, &start, &end) {
> > +		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
> > +			mm_zero_struct_page(pfn_to_page(pfn));
> > +			pgcnt++;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Struct pages that do not have backing memory. This could be because
> > +	 * firmware is using some of this memory, or for some other reasons.
> > +	 * Once memblock is changed so such behaviour is not allowed: i.e.
> > +	 * list of "reserved" memory must be a subset of list of "memory", then
> > +	 * this code can be removed.
> > +	 */
> > +	pr_info("Reserved but unavailable: %lld pages", pgcnt);
> > +}
> > +#endif /* CONFIG_HAVE_MEMBLOCK */
> > +
> >  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  
> >  #if MAX_NUMNODES > 1
> > @@ -6632,6 +6668,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
> >  			node_set_state(nid, N_MEMORY);
> >  		check_for_memory(pgdat, nid);
> >  	}
> > +	zero_resv_unavail();
> >  }
> >  
> >  static int __init cmdline_parse_core(char *p, unsigned long *core)
> > @@ -6795,6 +6832,7 @@ void __init free_area_init(unsigned long *zones_size)
> >  {
> >  	free_area_init_node(0, zones_size,
> >  			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
> > +	zero_resv_unavail();
> >  }
> >  
> >  static int page_alloc_cpu_dead(unsigned int cpu)
> > -- 
> > 2.14.2
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
@ 2017-10-10 14:09       ` Michal Hocko
  0 siblings, 0 replies; 115+ messages in thread
From: Michal Hocko @ 2017-10-10 14:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue 10-10-17 15:44:41, Michal Hocko wrote:
> On Mon 09-10-17 18:19:27, Pavel Tatashin wrote:
> > Some memory is reserved but unavailable: not present in memblock.memory
> > (because not backed by physical pages), but present in memblock.reserved.
> > Such memory has backing struct pages, but they are not initialized by going
> > through __init_single_page().
> > 
> > In some cases these struct pages are accessed even if they do not contain
> > any data. One example is page_to_pfn() might access page->flags if this is
> > where section information is stored (CONFIG_SPARSEMEM,
> > SECTION_IN_PAGE_FLAGS).
> > 
> > One example of such memory: trim_low_memory_range() unconditionally
> > reserves from pfn 0, but e820__memblock_setup() might provide the exiting
> > memory from pfn 1 (i.e. KVM).

Btw. I would add your example from http://lkml.kernel.org/r/bcf24369-ac37-cedd-a264-3396fb5cf39e at oracle.com
to do changelog
 
> > Since, struct pages are zeroed in __init_single_page(), and not during
> > allocation time, we must zero such struct pages explicitly.
> > 
> > The patch involves adding a new memblock iterator:
> > 	for_each_resv_unavail_range(i, p_start, p_end)
> > 
> > Which iterates through reserved && !memory lists, and we zero struct pages
> > explicitly by calling mm_zero_struct_page().
> > 
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> > Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
> > Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > Reviewed-by: Bob Picco <bob.picco@oracle.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> > ---
> >  include/linux/memblock.h | 16 ++++++++++++++++
> >  include/linux/mm.h       | 15 +++++++++++++++
> >  mm/page_alloc.c          | 38 ++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 69 insertions(+)
> > 
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index bae11c7e7bf3..ce8bfa5f3e9b 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
> >  	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
> >  			       nid, flags, p_start, p_end, p_nid)
> >  
> > +/**
> > + * for_each_resv_unavail_range - iterate through reserved and unavailable memory
> > + * @i: u64 used as loop variable
> > + * @flags: pick from blocks based on memory attributes
> > + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > + *
> > + * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
> > + * Available as soon as memblock is initialized.
> > + * Note: because this memory does not belong to any physical node, flags and
> > + * nid arguments do not make sense and thus not exported as arguments.
> > + */
> > +#define for_each_resv_unavail_range(i, p_start, p_end)			\
> > +	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
> > +			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
> > +
> >  static inline void memblock_set_region_flags(struct memblock_region *r,
> >  					     unsigned long flags)
> >  {
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 065d99deb847..04c8b2e5aff4 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
> >  #define mm_forbids_zeropage(X)	(0)
> >  #endif
> >  
> > +/*
> > + * On some architectures it is expensive to call memset() for small sizes.
> > + * Those architectures should provide their own implementation of "struct page"
> > + * zeroing by defining this macro in <asm/pgtable.h>.
> > + */
> > +#ifndef mm_zero_struct_page
> > +#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
> > +#endif
> > +
> >  /*
> >   * Default maximum number of active map areas, this limits the number of vmas
> >   * per mm struct. Users can overwrite this number by sysctl but there is a
> > @@ -2001,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
> >  					struct mminit_pfnnid_cache *state);
> >  #endif
> >  
> > +#ifdef CONFIG_HAVE_MEMBLOCK
> > +void zero_resv_unavail(void);
> > +#else
> > +static inline void zero_resv_unavail(void) {}
> > +#endif
> > +
> >  extern void set_dma_reserve(unsigned long new_dma_reserve);
> >  extern void memmap_init_zone(unsigned long, int, unsigned long,
> >  				unsigned long, enum memmap_context);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 20b0bace2235..5f0013bbbe9d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6209,6 +6209,42 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
> >  	free_area_init_core(pgdat);
> >  }
> >  
> > +#ifdef CONFIG_HAVE_MEMBLOCK
> > +/*
> > + * Only struct pages that are backed by physical memory are zeroed and
> > + * initialized by going through __init_single_page(). But, there are some
> > + * struct pages which are reserved in memblock allocator and their fields
> > + * may be accessed (for example page_to_pfn() on some configuration accesses
> > + * flags). We must explicitly zero those struct pages.
> > + */
> > +void __paginginit zero_resv_unavail(void)
> > +{
> > +	phys_addr_t start, end;
> > +	unsigned long pfn;
> > +	u64 i, pgcnt;
> > +
> > +	/* Loop through ranges that are reserved, but do not have reported
> > +	 * physical memory backing.
> > +	 */
> > +	pgcnt = 0;
> > +	for_each_resv_unavail_range(i, &start, &end) {
> > +		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
> > +			mm_zero_struct_page(pfn_to_page(pfn));
> > +			pgcnt++;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Struct pages that do not have backing memory. This could be because
> > +	 * firmware is using some of this memory, or for some other reasons.
> > +	 * Once memblock is changed so such behaviour is not allowed: i.e.
> > +	 * list of "reserved" memory must be a subset of list of "memory", then
> > +	 * this code can be removed.
> > +	 */
> > +	pr_info("Reserved but unavailable: %lld pages", pgcnt);
> > +}
> > +#endif /* CONFIG_HAVE_MEMBLOCK */
> > +
> >  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  
> >  #if MAX_NUMNODES > 1
> > @@ -6632,6 +6668,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
> >  			node_set_state(nid, N_MEMORY);
> >  		check_for_memory(pgdat, nid);
> >  	}
> > +	zero_resv_unavail();
> >  }
> >  
> >  static int __init cmdline_parse_core(char *p, unsigned long *core)
> > @@ -6795,6 +6832,7 @@ void __init free_area_init(unsigned long *zones_size)
> >  {
> >  	free_area_init_node(0, zones_size,
> >  			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
> > +	zero_resv_unavail();
> >  }
> >  
> >  static int page_alloc_cpu_dead(unsigned int cpu)
> > -- 
> > 2.14.2
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 0/9] complete deferred page initialization
  2017-10-09 22:19 ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-10 14:15   ` Michal Hocko
  -1 siblings, 0 replies; 115+ messages in thread
From: Michal Hocko @ 2017-10-10 14:15 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Btw. thanks for your persistance and willingness to go over all the
suggestions which might not have been consistent btween different
versions. I believe this is a general improvement in the early
initialization code. We do not rely on an implicit zeroing which just
happens to work by a chance. The perfomance improvements are a bonus on
top.

Thanks, good work!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 0/9] complete deferred page initialization
@ 2017-10-10 14:15   ` Michal Hocko
  0 siblings, 0 replies; 115+ messages in thread
From: Michal Hocko @ 2017-10-10 14:15 UTC (permalink / raw)
  To: linux-arm-kernel

Btw. thanks for your persistance and willingness to go over all the
suggestions which might not have been consistent btween different
versions. I believe this is a general improvement in the early
initialization code. We do not rely on an implicit zeroing which just
happens to work by a chance. The perfomance improvements are a bonus on
top.

Thanks, good work!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 0/9] complete deferred page initialization
@ 2017-10-10 14:15   ` Michal Hocko
  0 siblings, 0 replies; 115+ messages in thread
From: Michal Hocko @ 2017-10-10 14:15 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Btw. thanks for your persistance and willingness to go over all the
suggestions which might not have been consistent btween different
versions. I believe this is a general improvement in the early
initialization code. We do not rely on an implicit zeroing which just
happens to work by a chance. The perfomance improvements are a bonus on
top.

Thanks, good work!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 0/9] complete deferred page initialization
@ 2017-10-10 14:15   ` Michal Hocko
  0 siblings, 0 replies; 115+ messages in thread
From: Michal Hocko @ 2017-10-10 14:15 UTC (permalink / raw)
  To: linux-arm-kernel

Btw. thanks for your persistance and willingness to go over all the
suggestions which might not have been consistent btween different
versions. I believe this is a general improvement in the early
initialization code. We do not rely on an implicit zeroing which just
happens to work by a chance. The perfomance improvements are a bonus on
top.

Thanks, good work!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
  2017-10-10 14:09       ` Michal Hocko
  (?)
  (?)
@ 2017-10-10 14:30         ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 14:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Ard Biesheuvel, Mark Rutland, Will Deacon,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

> Btw. I would add your example from http://lkml.kernel.org/r/bcf24369-ac37-cedd-a264-3396fb5cf39e@oracle.com
> to do changelog
>

Will add, thank you for your review.

Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
@ 2017-10-10 14:30         ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 14:30 UTC (permalink / raw)
  To: linux-arm-kernel

> Btw. I would add your example from http://lkml.kernel.org/r/bcf24369-ac37-cedd-a264-3396fb5cf39e@oracle.com
> to do changelog
>

Will add, thank you for your review.

Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
@ 2017-10-10 14:30         ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 14:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Ard Biesheuvel, Mark Rutland, Will Deacon,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

> Btw. I would add your example from http://lkml.kernel.org/r/bcf24369-ac37-cedd-a264-3396fb5cf39e@oracle.com
> to do changelog
>

Will add, thank you for your review.

Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 5/9] mm: zero reserved and unavailable struct pages
@ 2017-10-10 14:30         ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 14:30 UTC (permalink / raw)
  To: linux-arm-kernel

> Btw. I would add your example from http://lkml.kernel.org/r/bcf24369-ac37-cedd-a264-3396fb5cf39e at oracle.com
> to do changelog
>

Will add, thank you for your review.

Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-09 22:19   ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-10 15:56     ` Will Deacon
  -1 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-10 15:56 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Hi Pavel,

On Mon, Oct 09, 2017 at 06:19:29PM -0400, Pavel Tatashin wrote:
> During early boot, kasan uses vmemmap_populate() to establish its shadow
> memory. But, that interface is intended for struct pages use.
> 
> Because of the current project, vmemmap won't be zeroed during allocation,
> but kasan expects that memory to be zeroed. We are adding a new
> kasan_map_populate() function to resolve this difference.
> 
> Therefore, we must use a new interface to allocate and map kasan shadow
> memory, that also zeroes memory for us.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
>  arch/arm64/mm/kasan_init.c | 72 ++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 66 insertions(+), 6 deletions(-)

Thanks for doing this, although I still think we can do better and avoid the
additional walking code altogether, as well as removing the dependence on
vmemmap. Rather than keep messing you about here (sorry about that), I've
written an arm64 patch for you to take on top of this series. Please take
a look below.

Cheers,

Will

--->8

>From 36c6c7c06273d08348b47c1a182116b0a1df8363 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Tue, 10 Oct 2017 15:49:43 +0100
Subject: [PATCH] arm64: kasan: Avoid using vmemmap_populate to initialise
 shadow

The kasan shadow is currently mapped using vmemmap_populate since that
provides a semi-convenient way to map pages into swapper. However, since
that no longer zeroes the mapped pages, it is not suitable for kasan,
which requires that the shadow is zeroed in order to avoid false
positives.

This patch removes our reliance on vmemmap_populate and reuses the
existing kasan page table code, which is already required for creating
the early shadow.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/Kconfig         |   2 +-
 arch/arm64/mm/kasan_init.c | 176 +++++++++++++++++++--------------------------
 2 files changed, 74 insertions(+), 104 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..888580b9036e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -68,7 +68,7 @@ config ARM64
 	select HAVE_ARCH_BITREVERSE
 	select HAVE_ARCH_HUGE_VMAP
 	select HAVE_ARCH_JUMP_LABEL
-	select HAVE_ARCH_KASAN if SPARSEMEM_VMEMMAP && !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
+	select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
 	select HAVE_ARCH_KGDB
 	select HAVE_ARCH_MMAP_RND_BITS
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index cb4af2951c90..b922826d9908 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -11,6 +11,7 @@
  */
 
 #define pr_fmt(fmt) "kasan: " fmt
+#include <linux/bootmem.h>
 #include <linux/kasan.h>
 #include <linux/kernel.h>
 #include <linux/sched/task.h>
@@ -28,66 +29,6 @@
 
 static pgd_t tmp_pg_dir[PTRS_PER_PGD] __initdata __aligned(PGD_SIZE);
 
-/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
-static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
-					int node)
-{
-	unsigned long addr, pfn, next;
-	unsigned long long size;
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
-	int ret;
-
-	ret = vmemmap_populate(start, end, node);
-	/*
-	 * We might have partially populated memory, so check for no entries,
-	 * and zero only those that actually exist.
-	 */
-	for (addr = start; addr < end; addr = next) {
-		pgd = pgd_offset_k(addr);
-		if (pgd_none(*pgd)) {
-			next = pgd_addr_end(addr, end);
-			continue;
-		}
-
-		pud = pud_offset(pgd, addr);
-		if (pud_none(*pud)) {
-			next = pud_addr_end(addr, end);
-			continue;
-		}
-		if (pud_sect(*pud)) {
-			/* This is PUD size page */
-			next = pud_addr_end(addr, end);
-			size = PUD_SIZE;
-			pfn = pud_pfn(*pud);
-		} else {
-			pmd = pmd_offset(pud, addr);
-			if (pmd_none(*pmd)) {
-				next = pmd_addr_end(addr, end);
-				continue;
-			}
-			if (pmd_sect(*pmd)) {
-				/* This is PMD size page */
-				next = pmd_addr_end(addr, end);
-				size = PMD_SIZE;
-				pfn = pmd_pfn(*pmd);
-			} else {
-				pte = pte_offset_kernel(pmd, addr);
-				next = addr + PAGE_SIZE;
-				if (pte_none(*pte))
-					continue;
-				/* This is base size page */
-				size = PAGE_SIZE;
-				pfn = pte_pfn(*pte);
-			}
-		}
-		memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
-	}
-	return ret;
-}
-
 /*
  * The p*d_populate functions call virt_to_phys implicitly so they can't be used
  * directly on kernel symbols (bm_p*d). All the early functions are called too
@@ -95,77 +36,117 @@ static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
  * with the physical address from __pa_symbol.
  */
 
-static void __init kasan_early_pte_populate(pmd_t *pmd, unsigned long addr,
-					unsigned long end)
+static phys_addr_t __init kasan_alloc_zeroed_page(int node)
 {
-	pte_t *pte;
-	unsigned long next;
+	void *p = memblock_virt_alloc_try_nid(PAGE_SIZE, PAGE_SIZE,
+					      __pa(MAX_DMA_ADDRESS),
+					      MEMBLOCK_ALLOC_ACCESSIBLE, node);
+	return __pa(p);
+}
 
-	if (pmd_none(*pmd))
-		__pmd_populate(pmd, __pa_symbol(kasan_zero_pte), PMD_TYPE_TABLE);
+static pte_t *__init kasan_pte_offset(pmd_t *pmd, unsigned long addr, int node,
+				      bool early)
+{
+	if (pmd_none(*pmd)) {
+		phys_addr_t pte_phys = early ? __pa_symbol(kasan_zero_pte)
+					     : kasan_alloc_zeroed_page(node);
+		__pmd_populate(pmd, pte_phys, PMD_TYPE_TABLE);
+	}
+
+	return early ? pte_offset_kimg(pmd, addr)
+		     : pte_offset_kernel(pmd, addr);
+}
+
+static pmd_t *__init kasan_pmd_offset(pud_t *pud, unsigned long addr, int node,
+				      bool early)
+{
+	if (pud_none(*pud)) {
+		phys_addr_t pmd_phys = early ? __pa_symbol(kasan_zero_pmd)
+					     : kasan_alloc_zeroed_page(node);
+		__pud_populate(pud, pmd_phys, PMD_TYPE_TABLE);
+	}
+
+	return early ? pmd_offset_kimg(pud, addr) : pmd_offset(pud, addr);
+}
+
+static pud_t *__init kasan_pud_offset(pgd_t *pgd, unsigned long addr, int node,
+				      bool early)
+{
+	if (pgd_none(*pgd)) {
+		phys_addr_t pud_phys = early ? __pa_symbol(kasan_zero_pud)
+					     : kasan_alloc_zeroed_page(node);
+		__pgd_populate(pgd, pud_phys, PMD_TYPE_TABLE);
+	}
+
+	return early ? pud_offset_kimg(pgd, addr) : pud_offset(pgd, addr);
+}
+
+static void __init kasan_pte_populate(pmd_t *pmd, unsigned long addr,
+				      unsigned long end, int node, bool early)
+{
+	unsigned long next;
+	pte_t *pte = kasan_pte_offset(pmd, addr, node, early);
 
-	pte = pte_offset_kimg(pmd, addr);
 	do {
+		phys_addr_t page_phys = early ? __pa_symbol(kasan_zero_page)
+					      : kasan_alloc_zeroed_page(node);
 		next = addr + PAGE_SIZE;
-		set_pte(pte, pfn_pte(sym_to_pfn(kasan_zero_page),
-					PAGE_KERNEL));
+		set_pte(pte, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
 	} while (pte++, addr = next, addr != end && pte_none(*pte));
 }
 
-static void __init kasan_early_pmd_populate(pud_t *pud,
-					unsigned long addr,
-					unsigned long end)
+static void __init kasan_pmd_populate(pud_t *pud, unsigned long addr,
+				      unsigned long end, int node, bool early)
 {
-	pmd_t *pmd;
 	unsigned long next;
+	pmd_t *pmd = kasan_pmd_offset(pud, addr, node, early);
 
-	if (pud_none(*pud))
-		__pud_populate(pud, __pa_symbol(kasan_zero_pmd), PMD_TYPE_TABLE);
-
-	pmd = pmd_offset_kimg(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		kasan_early_pte_populate(pmd, addr, next);
+		kasan_pte_populate(pmd, addr, next, node, early);
 	} while (pmd++, addr = next, addr != end && pmd_none(*pmd));
 }
 
-static void __init kasan_early_pud_populate(pgd_t *pgd,
-					unsigned long addr,
-					unsigned long end)
+static void __init kasan_pud_populate(pgd_t *pgd, unsigned long addr,
+				      unsigned long end, int node, bool early)
 {
-	pud_t *pud;
 	unsigned long next;
+	pud_t *pud = kasan_pud_offset(pgd, addr, node, early);
 
-	if (pgd_none(*pgd))
-		__pgd_populate(pgd, __pa_symbol(kasan_zero_pud), PUD_TYPE_TABLE);
-
-	pud = pud_offset_kimg(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		kasan_early_pmd_populate(pud, addr, next);
+		kasan_pmd_populate(pud, addr, next, node, early);
 	} while (pud++, addr = next, addr != end && pud_none(*pud));
 }
 
-static void __init kasan_map_early_shadow(void)
+static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
+				      int node, bool early)
 {
-	unsigned long addr = KASAN_SHADOW_START;
-	unsigned long end = KASAN_SHADOW_END;
 	unsigned long next;
 	pgd_t *pgd;
 
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		kasan_early_pud_populate(pgd, addr, next);
+		kasan_pud_populate(pgd, addr, next, node, early);
 	} while (pgd++, addr = next, addr != end);
 }
 
+/* The early shadow maps everything to a single page of zeroes */
 asmlinkage void __init kasan_early_init(void)
 {
 	BUILD_BUG_ON(KASAN_SHADOW_OFFSET != KASAN_SHADOW_END - (1UL << 61));
 	BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_START, PGDIR_SIZE));
 	BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_END, PGDIR_SIZE));
-	kasan_map_early_shadow();
+	kasan_pgd_populate(KASAN_SHADOW_START, KASAN_SHADOW_END, NUMA_NO_NODE,
+			   true);
+}
+
+/* Set up full kasan mappings, ensuring that the mapped pages are zeroed */
+static void __init kasan_map_populate(unsigned long start, unsigned long end,
+				      int node)
+{
+	kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
 }
 
 /*
@@ -224,17 +205,6 @@ void __init kasan_init(void)
 	kasan_map_populate(kimg_shadow_start, kimg_shadow_end,
 			   pfn_to_nid(virt_to_pfn(lm_alias(_text))));
 
-	/*
-	 * kasan_map_populate() has populated the shadow region that covers the
-	 * kernel image with SWAPPER_BLOCK_SIZE mappings, so we have to round
-	 * the start and end addresses to SWAPPER_BLOCK_SIZE as well, to prevent
-	 * kasan_populate_zero_shadow() from replacing the page table entries
-	 * (PMD or PTE) at the edges of the shadow region for the kernel
-	 * image.
-	 */
-	kimg_shadow_start = round_down(kimg_shadow_start, SWAPPER_BLOCK_SIZE);
-	kimg_shadow_end = round_up(kimg_shadow_end, SWAPPER_BLOCK_SIZE);
-
 	kasan_populate_zero_shadow((void *)KASAN_SHADOW_START,
 				   (void *)mod_shadow_start);
 	kasan_populate_zero_shadow((void *)kimg_shadow_end,
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-10 15:56     ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-10 15:56 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Pavel,

On Mon, Oct 09, 2017 at 06:19:29PM -0400, Pavel Tatashin wrote:
> During early boot, kasan uses vmemmap_populate() to establish its shadow
> memory. But, that interface is intended for struct pages use.
> 
> Because of the current project, vmemmap won't be zeroed during allocation,
> but kasan expects that memory to be zeroed. We are adding a new
> kasan_map_populate() function to resolve this difference.
> 
> Therefore, we must use a new interface to allocate and map kasan shadow
> memory, that also zeroes memory for us.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
>  arch/arm64/mm/kasan_init.c | 72 ++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 66 insertions(+), 6 deletions(-)

Thanks for doing this, although I still think we can do better and avoid the
additional walking code altogether, as well as removing the dependence on
vmemmap. Rather than keep messing you about here (sorry about that), I've
written an arm64 patch for you to take on top of this series. Please take
a look below.

Cheers,

Will

--->8

From 36c6c7c06273d08348b47c1a182116b0a1df8363 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Tue, 10 Oct 2017 15:49:43 +0100
Subject: [PATCH] arm64: kasan: Avoid using vmemmap_populate to initialise
 shadow

The kasan shadow is currently mapped using vmemmap_populate since that
provides a semi-convenient way to map pages into swapper. However, since
that no longer zeroes the mapped pages, it is not suitable for kasan,
which requires that the shadow is zeroed in order to avoid false
positives.

This patch removes our reliance on vmemmap_populate and reuses the
existing kasan page table code, which is already required for creating
the early shadow.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/Kconfig         |   2 +-
 arch/arm64/mm/kasan_init.c | 176 +++++++++++++++++++--------------------------
 2 files changed, 74 insertions(+), 104 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..888580b9036e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -68,7 +68,7 @@ config ARM64
 	select HAVE_ARCH_BITREVERSE
 	select HAVE_ARCH_HUGE_VMAP
 	select HAVE_ARCH_JUMP_LABEL
-	select HAVE_ARCH_KASAN if SPARSEMEM_VMEMMAP && !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
+	select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
 	select HAVE_ARCH_KGDB
 	select HAVE_ARCH_MMAP_RND_BITS
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index cb4af2951c90..b922826d9908 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -11,6 +11,7 @@
  */
 
 #define pr_fmt(fmt) "kasan: " fmt
+#include <linux/bootmem.h>
 #include <linux/kasan.h>
 #include <linux/kernel.h>
 #include <linux/sched/task.h>
@@ -28,66 +29,6 @@
 
 static pgd_t tmp_pg_dir[PTRS_PER_PGD] __initdata __aligned(PGD_SIZE);
 
-/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
-static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
-					int node)
-{
-	unsigned long addr, pfn, next;
-	unsigned long long size;
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
-	int ret;
-
-	ret = vmemmap_populate(start, end, node);
-	/*
-	 * We might have partially populated memory, so check for no entries,
-	 * and zero only those that actually exist.
-	 */
-	for (addr = start; addr < end; addr = next) {
-		pgd = pgd_offset_k(addr);
-		if (pgd_none(*pgd)) {
-			next = pgd_addr_end(addr, end);
-			continue;
-		}
-
-		pud = pud_offset(pgd, addr);
-		if (pud_none(*pud)) {
-			next = pud_addr_end(addr, end);
-			continue;
-		}
-		if (pud_sect(*pud)) {
-			/* This is PUD size page */
-			next = pud_addr_end(addr, end);
-			size = PUD_SIZE;
-			pfn = pud_pfn(*pud);
-		} else {
-			pmd = pmd_offset(pud, addr);
-			if (pmd_none(*pmd)) {
-				next = pmd_addr_end(addr, end);
-				continue;
-			}
-			if (pmd_sect(*pmd)) {
-				/* This is PMD size page */
-				next = pmd_addr_end(addr, end);
-				size = PMD_SIZE;
-				pfn = pmd_pfn(*pmd);
-			} else {
-				pte = pte_offset_kernel(pmd, addr);
-				next = addr + PAGE_SIZE;
-				if (pte_none(*pte))
-					continue;
-				/* This is base size page */
-				size = PAGE_SIZE;
-				pfn = pte_pfn(*pte);
-			}
-		}
-		memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
-	}
-	return ret;
-}
-
 /*
  * The p*d_populate functions call virt_to_phys implicitly so they can't be used
  * directly on kernel symbols (bm_p*d). All the early functions are called too
@@ -95,77 +36,117 @@ static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
  * with the physical address from __pa_symbol.
  */
 
-static void __init kasan_early_pte_populate(pmd_t *pmd, unsigned long addr,
-					unsigned long end)
+static phys_addr_t __init kasan_alloc_zeroed_page(int node)
 {
-	pte_t *pte;
-	unsigned long next;
+	void *p = memblock_virt_alloc_try_nid(PAGE_SIZE, PAGE_SIZE,
+					      __pa(MAX_DMA_ADDRESS),
+					      MEMBLOCK_ALLOC_ACCESSIBLE, node);
+	return __pa(p);
+}
 
-	if (pmd_none(*pmd))
-		__pmd_populate(pmd, __pa_symbol(kasan_zero_pte), PMD_TYPE_TABLE);
+static pte_t *__init kasan_pte_offset(pmd_t *pmd, unsigned long addr, int node,
+				      bool early)
+{
+	if (pmd_none(*pmd)) {
+		phys_addr_t pte_phys = early ? __pa_symbol(kasan_zero_pte)
+					     : kasan_alloc_zeroed_page(node);
+		__pmd_populate(pmd, pte_phys, PMD_TYPE_TABLE);
+	}
+
+	return early ? pte_offset_kimg(pmd, addr)
+		     : pte_offset_kernel(pmd, addr);
+}
+
+static pmd_t *__init kasan_pmd_offset(pud_t *pud, unsigned long addr, int node,
+				      bool early)
+{
+	if (pud_none(*pud)) {
+		phys_addr_t pmd_phys = early ? __pa_symbol(kasan_zero_pmd)
+					     : kasan_alloc_zeroed_page(node);
+		__pud_populate(pud, pmd_phys, PMD_TYPE_TABLE);
+	}
+
+	return early ? pmd_offset_kimg(pud, addr) : pmd_offset(pud, addr);
+}
+
+static pud_t *__init kasan_pud_offset(pgd_t *pgd, unsigned long addr, int node,
+				      bool early)
+{
+	if (pgd_none(*pgd)) {
+		phys_addr_t pud_phys = early ? __pa_symbol(kasan_zero_pud)
+					     : kasan_alloc_zeroed_page(node);
+		__pgd_populate(pgd, pud_phys, PMD_TYPE_TABLE);
+	}
+
+	return early ? pud_offset_kimg(pgd, addr) : pud_offset(pgd, addr);
+}
+
+static void __init kasan_pte_populate(pmd_t *pmd, unsigned long addr,
+				      unsigned long end, int node, bool early)
+{
+	unsigned long next;
+	pte_t *pte = kasan_pte_offset(pmd, addr, node, early);
 
-	pte = pte_offset_kimg(pmd, addr);
 	do {
+		phys_addr_t page_phys = early ? __pa_symbol(kasan_zero_page)
+					      : kasan_alloc_zeroed_page(node);
 		next = addr + PAGE_SIZE;
-		set_pte(pte, pfn_pte(sym_to_pfn(kasan_zero_page),
-					PAGE_KERNEL));
+		set_pte(pte, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
 	} while (pte++, addr = next, addr != end && pte_none(*pte));
 }
 
-static void __init kasan_early_pmd_populate(pud_t *pud,
-					unsigned long addr,
-					unsigned long end)
+static void __init kasan_pmd_populate(pud_t *pud, unsigned long addr,
+				      unsigned long end, int node, bool early)
 {
-	pmd_t *pmd;
 	unsigned long next;
+	pmd_t *pmd = kasan_pmd_offset(pud, addr, node, early);
 
-	if (pud_none(*pud))
-		__pud_populate(pud, __pa_symbol(kasan_zero_pmd), PMD_TYPE_TABLE);
-
-	pmd = pmd_offset_kimg(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		kasan_early_pte_populate(pmd, addr, next);
+		kasan_pte_populate(pmd, addr, next, node, early);
 	} while (pmd++, addr = next, addr != end && pmd_none(*pmd));
 }
 
-static void __init kasan_early_pud_populate(pgd_t *pgd,
-					unsigned long addr,
-					unsigned long end)
+static void __init kasan_pud_populate(pgd_t *pgd, unsigned long addr,
+				      unsigned long end, int node, bool early)
 {
-	pud_t *pud;
 	unsigned long next;
+	pud_t *pud = kasan_pud_offset(pgd, addr, node, early);
 
-	if (pgd_none(*pgd))
-		__pgd_populate(pgd, __pa_symbol(kasan_zero_pud), PUD_TYPE_TABLE);
-
-	pud = pud_offset_kimg(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		kasan_early_pmd_populate(pud, addr, next);
+		kasan_pmd_populate(pud, addr, next, node, early);
 	} while (pud++, addr = next, addr != end && pud_none(*pud));
 }
 
-static void __init kasan_map_early_shadow(void)
+static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
+				      int node, bool early)
 {
-	unsigned long addr = KASAN_SHADOW_START;
-	unsigned long end = KASAN_SHADOW_END;
 	unsigned long next;
 	pgd_t *pgd;
 
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		kasan_early_pud_populate(pgd, addr, next);
+		kasan_pud_populate(pgd, addr, next, node, early);
 	} while (pgd++, addr = next, addr != end);
 }
 
+/* The early shadow maps everything to a single page of zeroes */
 asmlinkage void __init kasan_early_init(void)
 {
 	BUILD_BUG_ON(KASAN_SHADOW_OFFSET != KASAN_SHADOW_END - (1UL << 61));
 	BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_START, PGDIR_SIZE));
 	BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_END, PGDIR_SIZE));
-	kasan_map_early_shadow();
+	kasan_pgd_populate(KASAN_SHADOW_START, KASAN_SHADOW_END, NUMA_NO_NODE,
+			   true);
+}
+
+/* Set up full kasan mappings, ensuring that the mapped pages are zeroed */
+static void __init kasan_map_populate(unsigned long start, unsigned long end,
+				      int node)
+{
+	kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
 }
 
 /*
@@ -224,17 +205,6 @@ void __init kasan_init(void)
 	kasan_map_populate(kimg_shadow_start, kimg_shadow_end,
 			   pfn_to_nid(virt_to_pfn(lm_alias(_text))));
 
-	/*
-	 * kasan_map_populate() has populated the shadow region that covers the
-	 * kernel image with SWAPPER_BLOCK_SIZE mappings, so we have to round
-	 * the start and end addresses to SWAPPER_BLOCK_SIZE as well, to prevent
-	 * kasan_populate_zero_shadow() from replacing the page table entries
-	 * (PMD or PTE) at the edges of the shadow region for the kernel
-	 * image.
-	 */
-	kimg_shadow_start = round_down(kimg_shadow_start, SWAPPER_BLOCK_SIZE);
-	kimg_shadow_end = round_up(kimg_shadow_end, SWAPPER_BLOCK_SIZE);
-
 	kasan_populate_zero_shadow((void *)KASAN_SHADOW_START,
 				   (void *)mod_shadow_start);
 	kasan_populate_zero_shadow((void *)kimg_shadow_end,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-10 15:56     ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-10 15:56 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, mhocko, ard.biesheuvel, mark.rutland,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

Hi Pavel,

On Mon, Oct 09, 2017 at 06:19:29PM -0400, Pavel Tatashin wrote:
> During early boot, kasan uses vmemmap_populate() to establish its shadow
> memory. But, that interface is intended for struct pages use.
> 
> Because of the current project, vmemmap won't be zeroed during allocation,
> but kasan expects that memory to be zeroed. We are adding a new
> kasan_map_populate() function to resolve this difference.
> 
> Therefore, we must use a new interface to allocate and map kasan shadow
> memory, that also zeroes memory for us.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
>  arch/arm64/mm/kasan_init.c | 72 ++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 66 insertions(+), 6 deletions(-)

Thanks for doing this, although I still think we can do better and avoid the
additional walking code altogether, as well as removing the dependence on
vmemmap. Rather than keep messing you about here (sorry about that), I've
written an arm64 patch for you to take on top of this series. Please take
a look below.

Cheers,

Will

--->8

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-10 15:56     ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-10 15:56 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Pavel,

On Mon, Oct 09, 2017 at 06:19:29PM -0400, Pavel Tatashin wrote:
> During early boot, kasan uses vmemmap_populate() to establish its shadow
> memory. But, that interface is intended for struct pages use.
> 
> Because of the current project, vmemmap won't be zeroed during allocation,
> but kasan expects that memory to be zeroed. We are adding a new
> kasan_map_populate() function to resolve this difference.
> 
> Therefore, we must use a new interface to allocate and map kasan shadow
> memory, that also zeroes memory for us.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
>  arch/arm64/mm/kasan_init.c | 72 ++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 66 insertions(+), 6 deletions(-)

Thanks for doing this, although I still think we can do better and avoid the
additional walking code altogether, as well as removing the dependence on
vmemmap. Rather than keep messing you about here (sorry about that), I've
written an arm64 patch for you to take on top of this series. Please take
a look below.

Cheers,

Will

--->8

>From 36c6c7c06273d08348b47c1a182116b0a1df8363 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Tue, 10 Oct 2017 15:49:43 +0100
Subject: [PATCH] arm64: kasan: Avoid using vmemmap_populate to initialise
 shadow

The kasan shadow is currently mapped using vmemmap_populate since that
provides a semi-convenient way to map pages into swapper. However, since
that no longer zeroes the mapped pages, it is not suitable for kasan,
which requires that the shadow is zeroed in order to avoid false
positives.

This patch removes our reliance on vmemmap_populate and reuses the
existing kasan page table code, which is already required for creating
the early shadow.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/Kconfig         |   2 +-
 arch/arm64/mm/kasan_init.c | 176 +++++++++++++++++++--------------------------
 2 files changed, 74 insertions(+), 104 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..888580b9036e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -68,7 +68,7 @@ config ARM64
 	select HAVE_ARCH_BITREVERSE
 	select HAVE_ARCH_HUGE_VMAP
 	select HAVE_ARCH_JUMP_LABEL
-	select HAVE_ARCH_KASAN if SPARSEMEM_VMEMMAP && !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
+	select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
 	select HAVE_ARCH_KGDB
 	select HAVE_ARCH_MMAP_RND_BITS
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index cb4af2951c90..b922826d9908 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -11,6 +11,7 @@
  */
 
 #define pr_fmt(fmt) "kasan: " fmt
+#include <linux/bootmem.h>
 #include <linux/kasan.h>
 #include <linux/kernel.h>
 #include <linux/sched/task.h>
@@ -28,66 +29,6 @@
 
 static pgd_t tmp_pg_dir[PTRS_PER_PGD] __initdata __aligned(PGD_SIZE);
 
-/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
-static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
-					int node)
-{
-	unsigned long addr, pfn, next;
-	unsigned long long size;
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
-	int ret;
-
-	ret = vmemmap_populate(start, end, node);
-	/*
-	 * We might have partially populated memory, so check for no entries,
-	 * and zero only those that actually exist.
-	 */
-	for (addr = start; addr < end; addr = next) {
-		pgd = pgd_offset_k(addr);
-		if (pgd_none(*pgd)) {
-			next = pgd_addr_end(addr, end);
-			continue;
-		}
-
-		pud = pud_offset(pgd, addr);
-		if (pud_none(*pud)) {
-			next = pud_addr_end(addr, end);
-			continue;
-		}
-		if (pud_sect(*pud)) {
-			/* This is PUD size page */
-			next = pud_addr_end(addr, end);
-			size = PUD_SIZE;
-			pfn = pud_pfn(*pud);
-		} else {
-			pmd = pmd_offset(pud, addr);
-			if (pmd_none(*pmd)) {
-				next = pmd_addr_end(addr, end);
-				continue;
-			}
-			if (pmd_sect(*pmd)) {
-				/* This is PMD size page */
-				next = pmd_addr_end(addr, end);
-				size = PMD_SIZE;
-				pfn = pmd_pfn(*pmd);
-			} else {
-				pte = pte_offset_kernel(pmd, addr);
-				next = addr + PAGE_SIZE;
-				if (pte_none(*pte))
-					continue;
-				/* This is base size page */
-				size = PAGE_SIZE;
-				pfn = pte_pfn(*pte);
-			}
-		}
-		memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
-	}
-	return ret;
-}
-
 /*
  * The p*d_populate functions call virt_to_phys implicitly so they can't be used
  * directly on kernel symbols (bm_p*d). All the early functions are called too
@@ -95,77 +36,117 @@ static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
  * with the physical address from __pa_symbol.
  */
 
-static void __init kasan_early_pte_populate(pmd_t *pmd, unsigned long addr,
-					unsigned long end)
+static phys_addr_t __init kasan_alloc_zeroed_page(int node)
 {
-	pte_t *pte;
-	unsigned long next;
+	void *p = memblock_virt_alloc_try_nid(PAGE_SIZE, PAGE_SIZE,
+					      __pa(MAX_DMA_ADDRESS),
+					      MEMBLOCK_ALLOC_ACCESSIBLE, node);
+	return __pa(p);
+}
 
-	if (pmd_none(*pmd))
-		__pmd_populate(pmd, __pa_symbol(kasan_zero_pte), PMD_TYPE_TABLE);
+static pte_t *__init kasan_pte_offset(pmd_t *pmd, unsigned long addr, int node,
+				      bool early)
+{
+	if (pmd_none(*pmd)) {
+		phys_addr_t pte_phys = early ? __pa_symbol(kasan_zero_pte)
+					     : kasan_alloc_zeroed_page(node);
+		__pmd_populate(pmd, pte_phys, PMD_TYPE_TABLE);
+	}
+
+	return early ? pte_offset_kimg(pmd, addr)
+		     : pte_offset_kernel(pmd, addr);
+}
+
+static pmd_t *__init kasan_pmd_offset(pud_t *pud, unsigned long addr, int node,
+				      bool early)
+{
+	if (pud_none(*pud)) {
+		phys_addr_t pmd_phys = early ? __pa_symbol(kasan_zero_pmd)
+					     : kasan_alloc_zeroed_page(node);
+		__pud_populate(pud, pmd_phys, PMD_TYPE_TABLE);
+	}
+
+	return early ? pmd_offset_kimg(pud, addr) : pmd_offset(pud, addr);
+}
+
+static pud_t *__init kasan_pud_offset(pgd_t *pgd, unsigned long addr, int node,
+				      bool early)
+{
+	if (pgd_none(*pgd)) {
+		phys_addr_t pud_phys = early ? __pa_symbol(kasan_zero_pud)
+					     : kasan_alloc_zeroed_page(node);
+		__pgd_populate(pgd, pud_phys, PMD_TYPE_TABLE);
+	}
+
+	return early ? pud_offset_kimg(pgd, addr) : pud_offset(pgd, addr);
+}
+
+static void __init kasan_pte_populate(pmd_t *pmd, unsigned long addr,
+				      unsigned long end, int node, bool early)
+{
+	unsigned long next;
+	pte_t *pte = kasan_pte_offset(pmd, addr, node, early);
 
-	pte = pte_offset_kimg(pmd, addr);
 	do {
+		phys_addr_t page_phys = early ? __pa_symbol(kasan_zero_page)
+					      : kasan_alloc_zeroed_page(node);
 		next = addr + PAGE_SIZE;
-		set_pte(pte, pfn_pte(sym_to_pfn(kasan_zero_page),
-					PAGE_KERNEL));
+		set_pte(pte, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
 	} while (pte++, addr = next, addr != end && pte_none(*pte));
 }
 
-static void __init kasan_early_pmd_populate(pud_t *pud,
-					unsigned long addr,
-					unsigned long end)
+static void __init kasan_pmd_populate(pud_t *pud, unsigned long addr,
+				      unsigned long end, int node, bool early)
 {
-	pmd_t *pmd;
 	unsigned long next;
+	pmd_t *pmd = kasan_pmd_offset(pud, addr, node, early);
 
-	if (pud_none(*pud))
-		__pud_populate(pud, __pa_symbol(kasan_zero_pmd), PMD_TYPE_TABLE);
-
-	pmd = pmd_offset_kimg(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		kasan_early_pte_populate(pmd, addr, next);
+		kasan_pte_populate(pmd, addr, next, node, early);
 	} while (pmd++, addr = next, addr != end && pmd_none(*pmd));
 }
 
-static void __init kasan_early_pud_populate(pgd_t *pgd,
-					unsigned long addr,
-					unsigned long end)
+static void __init kasan_pud_populate(pgd_t *pgd, unsigned long addr,
+				      unsigned long end, int node, bool early)
 {
-	pud_t *pud;
 	unsigned long next;
+	pud_t *pud = kasan_pud_offset(pgd, addr, node, early);
 
-	if (pgd_none(*pgd))
-		__pgd_populate(pgd, __pa_symbol(kasan_zero_pud), PUD_TYPE_TABLE);
-
-	pud = pud_offset_kimg(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		kasan_early_pmd_populate(pud, addr, next);
+		kasan_pmd_populate(pud, addr, next, node, early);
 	} while (pud++, addr = next, addr != end && pud_none(*pud));
 }
 
-static void __init kasan_map_early_shadow(void)
+static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
+				      int node, bool early)
 {
-	unsigned long addr = KASAN_SHADOW_START;
-	unsigned long end = KASAN_SHADOW_END;
 	unsigned long next;
 	pgd_t *pgd;
 
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		kasan_early_pud_populate(pgd, addr, next);
+		kasan_pud_populate(pgd, addr, next, node, early);
 	} while (pgd++, addr = next, addr != end);
 }
 
+/* The early shadow maps everything to a single page of zeroes */
 asmlinkage void __init kasan_early_init(void)
 {
 	BUILD_BUG_ON(KASAN_SHADOW_OFFSET != KASAN_SHADOW_END - (1UL << 61));
 	BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_START, PGDIR_SIZE));
 	BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_END, PGDIR_SIZE));
-	kasan_map_early_shadow();
+	kasan_pgd_populate(KASAN_SHADOW_START, KASAN_SHADOW_END, NUMA_NO_NODE,
+			   true);
+}
+
+/* Set up full kasan mappings, ensuring that the mapped pages are zeroed */
+static void __init kasan_map_populate(unsigned long start, unsigned long end,
+				      int node)
+{
+	kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
 }
 
 /*
@@ -224,17 +205,6 @@ void __init kasan_init(void)
 	kasan_map_populate(kimg_shadow_start, kimg_shadow_end,
 			   pfn_to_nid(virt_to_pfn(lm_alias(_text))));
 
-	/*
-	 * kasan_map_populate() has populated the shadow region that covers the
-	 * kernel image with SWAPPER_BLOCK_SIZE mappings, so we have to round
-	 * the start and end addresses to SWAPPER_BLOCK_SIZE as well, to prevent
-	 * kasan_populate_zero_shadow() from replacing the page table entries
-	 * (PMD or PTE)@the edges of the shadow region for the kernel
-	 * image.
-	 */
-	kimg_shadow_start = round_down(kimg_shadow_start, SWAPPER_BLOCK_SIZE);
-	kimg_shadow_end = round_up(kimg_shadow_end, SWAPPER_BLOCK_SIZE);
-
 	kasan_populate_zero_shadow((void *)KASAN_SHADOW_START,
 				   (void *)mod_shadow_start);
 	kasan_populate_zero_shadow((void *)kimg_shadow_end,
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-10 15:56     ` Will Deacon
  (?)
  (?)
@ 2017-10-10 17:07       ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 17:07 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Will,

Thank you for doing this work. How would you like to proceed?

- If you OK for my series to be accepted as-is, so your patch can be
added later on top, I think, I need an ack from you for kasan changes.
- Otherwise, I can replace: 4267aaf1d279 arm64/kasan: add and use
kasan_map_populate() in my series with code from your patch.

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-10 17:07       ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 17:07 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

Thank you for doing this work. How would you like to proceed?

- If you OK for my series to be accepted as-is, so your patch can be
added later on top, I think, I need an ack from you for kasan changes.
- Otherwise, I can replace: 4267aaf1d279 arm64/kasan: add and use
kasan_map_populate() in my series with code from your patch.

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-10 17:07       ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 17:07 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Will,

Thank you for doing this work. How would you like to proceed?

- If you OK for my series to be accepted as-is, so your patch can be
added later on top, I think, I need an ack from you for kasan changes.
- Otherwise, I can replace: 4267aaf1d279 arm64/kasan: add and use
kasan_map_populate() in my series with code from your patch.

Thank you,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-10 17:07       ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 17:07 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

Thank you for doing this work. How would you like to proceed?

- If you OK for my series to be accepted as-is, so your patch can be
added later on top, I think, I need an ack from you for kasan changes.
- Otherwise, I can replace: 4267aaf1d279 arm64/kasan: add and use
kasan_map_populate() in my series with code from your patch.

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-10 17:07       ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-10 17:10         ` Will Deacon
  -1 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-10 17:10 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Pavel,

On Tue, Oct 10, 2017 at 01:07:35PM -0400, Pavel Tatashin wrote:
> Thank you for doing this work. How would you like to proceed?
> 
> - If you OK for my series to be accepted as-is, so your patch can be
> added later on top, I think, I need an ack from you for kasan changes.
> - Otherwise, I can replace: 4267aaf1d279 arm64/kasan: add and use
> kasan_map_populate() in my series with code from your patch.

I was thinking that you could just add my patch to the end of your series
and have the whole lot go up like that. If you want to merge it with your
patch, I'm fine with that too.

Will

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-10 17:10         ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-10 17:10 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Pavel,

On Tue, Oct 10, 2017 at 01:07:35PM -0400, Pavel Tatashin wrote:
> Thank you for doing this work. How would you like to proceed?
> 
> - If you OK for my series to be accepted as-is, so your patch can be
> added later on top, I think, I need an ack from you for kasan changes.
> - Otherwise, I can replace: 4267aaf1d279 arm64/kasan: add and use
> kasan_map_populate() in my series with code from your patch.

I was thinking that you could just add my patch to the end of your series
and have the whole lot go up like that. If you want to merge it with your
patch, I'm fine with that too.

Will

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-10 17:10         ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-10 17:10 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Pavel,

On Tue, Oct 10, 2017 at 01:07:35PM -0400, Pavel Tatashin wrote:
> Thank you for doing this work. How would you like to proceed?
> 
> - If you OK for my series to be accepted as-is, so your patch can be
> added later on top, I think, I need an ack from you for kasan changes.
> - Otherwise, I can replace: 4267aaf1d279 arm64/kasan: add and use
> kasan_map_populate() in my series with code from your patch.

I was thinking that you could just add my patch to the end of your series
and have the whole lot go up like that. If you want to merge it with your
patch, I'm fine with that too.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-10 17:10         ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-10 17:10 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Pavel,

On Tue, Oct 10, 2017 at 01:07:35PM -0400, Pavel Tatashin wrote:
> Thank you for doing this work. How would you like to proceed?
> 
> - If you OK for my series to be accepted as-is, so your patch can be
> added later on top, I think, I need an ack from you for kasan changes.
> - Otherwise, I can replace: 4267aaf1d279 arm64/kasan: add and use
> kasan_map_populate() in my series with code from your patch.

I was thinking that you could just add my patch to the end of your series
and have the whole lot go up like that. If you want to merge it with your
patch, I'm fine with that too.

Will

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 0/9] complete deferred page initialization
  2017-10-10 14:15   ` Michal Hocko
  (?)
  (?)
@ 2017-10-10 17:19     ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 17:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

I wanted to thank you Michal for spending time and doing the in-depth 
reviews of every incremental change. Overall the series is in much 
better shape now because of your feedback.

Pavel

On 10/10/2017 10:15 AM, Michal Hocko wrote:
> Btw. thanks for your persistance and willingness to go over all the
> suggestions which might not have been consistent btween different
> versions. I believe this is a general improvement in the early
> initialization code. We do not rely on an implicit zeroing which just
> happens to work by a chance. The perfomance improvements are a bonus on
> top.
> 
> Thanks, good work!
> 

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 0/9] complete deferred page initialization
@ 2017-10-10 17:19     ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 17:19 UTC (permalink / raw)
  To: linux-arm-kernel

I wanted to thank you Michal for spending time and doing the in-depth 
reviews of every incremental change. Overall the series is in much 
better shape now because of your feedback.

Pavel

On 10/10/2017 10:15 AM, Michal Hocko wrote:
> Btw. thanks for your persistance and willingness to go over all the
> suggestions which might not have been consistent btween different
> versions. I believe this is a general improvement in the early
> initialization code. We do not rely on an implicit zeroing which just
> happens to work by a chance. The perfomance improvements are a bonus on
> top.
> 
> Thanks, good work!
> 

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 0/9] complete deferred page initialization
@ 2017-10-10 17:19     ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 17:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, ard.biesheuvel, mark.rutland, will.deacon,
	catalin.marinas, sam, mgorman, steven.sistare, daniel.m.jordan,
	bob.picco

I wanted to thank you Michal for spending time and doing the in-depth 
reviews of every incremental change. Overall the series is in much 
better shape now because of your feedback.

Pavel

On 10/10/2017 10:15 AM, Michal Hocko wrote:
> Btw. thanks for your persistance and willingness to go over all the
> suggestions which might not have been consistent btween different
> versions. I believe this is a general improvement in the early
> initialization code. We do not rely on an implicit zeroing which just
> happens to work by a chance. The perfomance improvements are a bonus on
> top.
> 
> Thanks, good work!
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 0/9] complete deferred page initialization
@ 2017-10-10 17:19     ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 17:19 UTC (permalink / raw)
  To: linux-arm-kernel

I wanted to thank you Michal for spending time and doing the in-depth 
reviews of every incremental change. Overall the series is in much 
better shape now because of your feedback.

Pavel

On 10/10/2017 10:15 AM, Michal Hocko wrote:
> Btw. thanks for your persistance and willingness to go over all the
> suggestions which might not have been consistent btween different
> versions. I believe this is a general improvement in the early
> initialization code. We do not rely on an implicit zeroing which just
> happens to work by a chance. The perfomance improvements are a bonus on
> top.
> 
> Thanks, good work!
> 

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-10 17:10         ` Will Deacon
  (?)
  (?)
@ 2017-10-10 17:41           ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 17:41 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Will,

Ok, I will add your patch at the end of my series.

Thank you,
Pavel

>
> I was thinking that you could just add my patch to the end of your series
> and have the whole lot go up like that. If you want to merge it with your
> patch, I'm fine with that too.
>
> Will
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-10 17:41           ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 17:41 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

Ok, I will add your patch at the end of my series.

Thank you,
Pavel

>
> I was thinking that you could just add my patch to the end of your series
> and have the whole lot go up like that. If you want to merge it with your
> patch, I'm fine with that too.
>
> Will
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-10 17:41           ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 17:41 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Will,

Ok, I will add your patch at the end of my series.

Thank you,
Pavel

>
> I was thinking that you could just add my patch to the end of your series
> and have the whole lot go up like that. If you want to merge it with your
> patch, I'm fine with that too.
>
> Will
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-10 17:41           ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-10 17:41 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

Ok, I will add your patch at the end of my series.

Thank you,
Pavel

>
> I was thinking that you could just add my patch to the end of your series
> and have the whole lot go up like that. If you want to merge it with your
> patch, I'm fine with that too.
>
> Will
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo at kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email at kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-10 17:41           ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-13 14:10             ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 14:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Will,

I have a couple concerns about your patch:

One of the reasons (and actually, the main reason) why I preferred to
keep vmemmap_populate() instead of implementing kasan's own variant,
which btw can be done in common code similarly to
vmemmap_populate_basepages() is that vmemmap_populate() uses large
pages when available. I think it is a considerable downgrade to go
back to base pages, when we already have large page support available
to us.

The kasan shadow tree is large, it is up-to 1/8th of system memory, so
even on moderate size servers, shadow tree is going to be multiple
gigabytes.

The second concern is that there is an existing bug associated with
your patch that I am not sure how to solve:

Try building your patch with CONFIG_DEBUG_VM. This config makes
memblock_virt_alloc_try_nid_raw() to do memset(0xff) on all allocated
memory.

I am getting the following panic during boot:

[    0.012637] pid_max: default: 32768 minimum: 301
[    0.016037] Security Framework initialized
[    0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
[    0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.055337] Unable to handle kernel paging request at virtual
address ffff0400010065af
[    0.055422] Mem abort info:
[    0.055518]   Exception class = DABT (current EL), IL = 32 bits
[    0.055579]   SET = 0, FnV = 0
[    0.055640]   EA = 0, S1PTW = 0
[    0.055699] Data abort info:
[    0.055762]   ISV = 0, ISS = 0x00000007
[    0.055822]   CM = 0, WnR = 0
[    0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff20000a8f4000
[    0.056047] [ffff0400010065af] *pgd=0000000046fe7003,
*pud=0000000046fe6003, *pmd=0000000046fe5003, *pte=0000000000000000
[    0.056436] Internal error: Oops: 96000007 [#1] PREEMPT SMP
[    0.056701] Modules linked in:
[    0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
[    0.057001] Hardware name: linux,dummy-virt (DT)
[    0.057084] task: ffff2000099d9000 task.stack: ffff2000099c0000
[    0.057275] PC is at __asan_load8+0x34/0xb0
[    0.057375] LR is at __d_rehash+0xf0/0x240
[    0.057460] pc : [<ffff200008317d7c>] lr : [<ffff20000837e168>]
pstate: 60000045
[    0.057522] sp : ffff2000099c6a60
[    0.057590] x29: ffff2000099c6a60 x28: ffff2000099d9010
[    0.057733] x27: 0000000000000004 x26: ffff200008031000
[    0.057846] x25: ffff2000099d9000 x24: ffff800003c06410
---Type <return> to continue, or q <return> to quit---
[    0.057954] x23: 00000000000003af x22: ffff800003c06400
[    0.058065] x21: 1fffe40001338d5a x20: ffff200008032d78
[    0.058175] x19: ffff800003c06408 x18: 0000000000000000
[    0.058311] x17: 0000000000000009 x16: 0000000000007fff
[    0.058417] x15: 000000000000002a x14: ffff2000080ef374
[    0.058528] x13: ffff200008126648 x12: ffff200008411a7c
[    0.058638] x11: ffff200008392358 x10: ffff200008392184
[    0.058770] x9 : ffff20000835aad8 x8 : ffff200009850e90
[    0.058883] x7 : ffff20000904b23c x6 : 00000000f2f2f200
[    0.058990] x5 : 0000000000000000 x4 : ffff200008032d78
[    0.059097] x3 : 0000000000000000 x2 : dfff200000000000
[    0.059206] x1 : 0000000000000007 x0 : 1fffe400010065af
[    0.059372] Process swapper/0 (pid: 0, stack limit = 0xffff2000099c0000)
[    0.059442] Call trace:
[    0.059603] Exception stack(0xffff2000099c6920 to 0xffff2000099c6a60)
[    0.059771] 6920: 1fffe400010065af 0000000000000007
dfff200000000000 0000000000000000
[    0.059877] 6940: ffff200008032d78 0000000000000000
00000000f2f2f200 ffff20000904b23c
[    0.059973] 6960: ffff200009850e90 ffff20000835aad8
ffff200008392184 ffff200008392358
[    0.060066] 6980: ffff200008411a7c ffff200008126648
ffff2000080ef374 000000000000002a
[    0.060154] 69a0: 0000000000007fff 0000000000000009
0000000000000000 ffff800003c06408
[    0.060246] 69c0: ffff200008032d78 1fffe40001338d5a
ffff800003c06400 00000000000003af
[    0.060338] 69e0: ffff800003c06410 ffff2000099d9000
ffff200008031000 0000000000000004
[    0.060432] 6a00: ffff2000099d9010 ffff2000099c6a60
ffff20000837e168 ffff2000099c6a60
[    0.060525] 6a20: ffff200008317d7c 0000000060000045
ffff200008392358 ffff200008411a7c
[    0.060620] 6a40: ffffffffffffffff ffff2000080ef374
ffff2000099c6a60 ffff200008317d7c
[    0.060762] [<ffff200008317d7c>] __asan_load8+0x34/0xb0
[    0.060856] [<ffff20000837e168>] __d_rehash+0xf0/0x240
[    0.060944] [<ffff20000837fb80>] d_add+0x288/0x3f0
[    0.061041] [<ffff200008420db8>] proc_setup_self+0x110/0x198
[    0.061139] [<ffff200008411594>] proc_fill_super+0x13c/0x198
[    0.061234] [<ffff200008359648>] mount_ns+0x98/0x148
[    0.061328] [<ffff2000084116ac>] proc_mount+0x5c/0x70
[    0.061422] [<ffff20000835aad8>] mount_fs+0x50/0x1a8
[    0.061515] [<ffff200008392184>] vfs_kern_mount.part.7+0x9c/0x218
[    0.061602] [<ffff200008392358>] kern_mount_data+0x38/0x70
[    0.061699] [<ffff200008411a7c>] pid_ns_prepare_proc+0x24/0x50
[    0.061796] [<ffff200008126648>] alloc_pid+0x6e8/0x730
[    0.061891] [<ffff2000080ef374>] copy_process.isra.6.part.7+0x11cc/0x2cb8
[    0.061978] [<ffff2000080f1104>] _do_fork+0x14c/0x4c0
[    0.062065] [<ffff2000080f14c0>] kernel_thread+0x30/0x38
[    0.062156] [<ffff20000904b23c>] rest_init+0x34/0x108
[    0.062260] [<ffff200009850e90>] start_kernel+0x45c/0x48c
[    0.062458] Code: 540001e1 d343fc00 d2c40002 f2fbffe2 (38e26800)
[    0.063559] ---[ end trace 390c5d4fc6641888 ]---
[    0.064164] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.064438] ---[ end Kernel panic - not syncing: Attempted to kill
the idle task!


So, I've been trying to root cause it, and here is what I've got:

First, I went back to my version of kasan_map_populate() and replaced
vmemmap_populate() with vmemmap_populate_basepages(), which
behavior-vise made it very similar to your patch. After doing this I
got the same panic. So, I figured there must be something to do with
the differences that regular vmemmap allocated with granularity of
SWAPPER_BLOCK_SIZE while kasan with granularity of PAGE_SIZE.

So, I made the following modification to your patch:

static void __init kasan_map_populate(unsigned long start, unsigned long end,
                                      int node)
{
+        start = round_down(start, SWAPPER_BLOCK_SIZE);
+       end = round_up(end, SWAPPER_BLOCK_SIZE);
        kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
}

This is basically makes shadow tree ranges to be SWAPPER_BLOCK_SIZE
aligned. After, this modification everything is working.  However, I
am not sure if this is a proper fix.

I feel, this patch requires more work, and I am troubled with using
base pages instead of large pages.

Thank you,
Pavel

On Tue, Oct 10, 2017 at 1:41 PM, Pavel Tatashin
<pasha.tatashin@oracle.com> wrote:
> Hi Will,
>
> Ok, I will add your patch at the end of my series.
>
> Thank you,
> Pavel
>
>>
>> I was thinking that you could just add my patch to the end of your series
>> and have the whole lot go up like that. If you want to merge it with your
>> patch, I'm fine with that too.
>>
>> Will
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 14:10             ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 14:10 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

I have a couple concerns about your patch:

One of the reasons (and actually, the main reason) why I preferred to
keep vmemmap_populate() instead of implementing kasan's own variant,
which btw can be done in common code similarly to
vmemmap_populate_basepages() is that vmemmap_populate() uses large
pages when available. I think it is a considerable downgrade to go
back to base pages, when we already have large page support available
to us.

The kasan shadow tree is large, it is up-to 1/8th of system memory, so
even on moderate size servers, shadow tree is going to be multiple
gigabytes.

The second concern is that there is an existing bug associated with
your patch that I am not sure how to solve:

Try building your patch with CONFIG_DEBUG_VM. This config makes
memblock_virt_alloc_try_nid_raw() to do memset(0xff) on all allocated
memory.

I am getting the following panic during boot:

[    0.012637] pid_max: default: 32768 minimum: 301
[    0.016037] Security Framework initialized
[    0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
[    0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.055337] Unable to handle kernel paging request at virtual
address ffff0400010065af
[    0.055422] Mem abort info:
[    0.055518]   Exception class = DABT (current EL), IL = 32 bits
[    0.055579]   SET = 0, FnV = 0
[    0.055640]   EA = 0, S1PTW = 0
[    0.055699] Data abort info:
[    0.055762]   ISV = 0, ISS = 0x00000007
[    0.055822]   CM = 0, WnR = 0
[    0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff20000a8f4000
[    0.056047] [ffff0400010065af] *pgd\000000046fe7003,
*pud\000000046fe6003, *pmd\000000046fe5003, *pte\000000000000000
[    0.056436] Internal error: Oops: 96000007 [#1] PREEMPT SMP
[    0.056701] Modules linked in:
[    0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
[    0.057001] Hardware name: linux,dummy-virt (DT)
[    0.057084] task: ffff2000099d9000 task.stack: ffff2000099c0000
[    0.057275] PC is at __asan_load8+0x34/0xb0
[    0.057375] LR is at __d_rehash+0xf0/0x240
[    0.057460] pc : [<ffff200008317d7c>] lr : [<ffff20000837e168>]
pstate: 60000045
[    0.057522] sp : ffff2000099c6a60
[    0.057590] x29: ffff2000099c6a60 x28: ffff2000099d9010
[    0.057733] x27: 0000000000000004 x26: ffff200008031000
[    0.057846] x25: ffff2000099d9000 x24: ffff800003c06410
---Type <return> to continue, or q <return> to quit---
[    0.057954] x23: 00000000000003af x22: ffff800003c06400
[    0.058065] x21: 1fffe40001338d5a x20: ffff200008032d78
[    0.058175] x19: ffff800003c06408 x18: 0000000000000000
[    0.058311] x17: 0000000000000009 x16: 0000000000007fff
[    0.058417] x15: 000000000000002a x14: ffff2000080ef374
[    0.058528] x13: ffff200008126648 x12: ffff200008411a7c
[    0.058638] x11: ffff200008392358 x10: ffff200008392184
[    0.058770] x9 : ffff20000835aad8 x8 : ffff200009850e90
[    0.058883] x7 : ffff20000904b23c x6 : 00000000f2f2f200
[    0.058990] x5 : 0000000000000000 x4 : ffff200008032d78
[    0.059097] x3 : 0000000000000000 x2 : dfff200000000000
[    0.059206] x1 : 0000000000000007 x0 : 1fffe400010065af
[    0.059372] Process swapper/0 (pid: 0, stack limit = 0xffff2000099c0000)
[    0.059442] Call trace:
[    0.059603] Exception stack(0xffff2000099c6920 to 0xffff2000099c6a60)
[    0.059771] 6920: 1fffe400010065af 0000000000000007
dfff200000000000 0000000000000000
[    0.059877] 6940: ffff200008032d78 0000000000000000
00000000f2f2f200 ffff20000904b23c
[    0.059973] 6960: ffff200009850e90 ffff20000835aad8
ffff200008392184 ffff200008392358
[    0.060066] 6980: ffff200008411a7c ffff200008126648
ffff2000080ef374 000000000000002a
[    0.060154] 69a0: 0000000000007fff 0000000000000009
0000000000000000 ffff800003c06408
[    0.060246] 69c0: ffff200008032d78 1fffe40001338d5a
ffff800003c06400 00000000000003af
[    0.060338] 69e0: ffff800003c06410 ffff2000099d9000
ffff200008031000 0000000000000004
[    0.060432] 6a00: ffff2000099d9010 ffff2000099c6a60
ffff20000837e168 ffff2000099c6a60
[    0.060525] 6a20: ffff200008317d7c 0000000060000045
ffff200008392358 ffff200008411a7c
[    0.060620] 6a40: ffffffffffffffff ffff2000080ef374
ffff2000099c6a60 ffff200008317d7c
[    0.060762] [<ffff200008317d7c>] __asan_load8+0x34/0xb0
[    0.060856] [<ffff20000837e168>] __d_rehash+0xf0/0x240
[    0.060944] [<ffff20000837fb80>] d_add+0x288/0x3f0
[    0.061041] [<ffff200008420db8>] proc_setup_self+0x110/0x198
[    0.061139] [<ffff200008411594>] proc_fill_super+0x13c/0x198
[    0.061234] [<ffff200008359648>] mount_ns+0x98/0x148
[    0.061328] [<ffff2000084116ac>] proc_mount+0x5c/0x70
[    0.061422] [<ffff20000835aad8>] mount_fs+0x50/0x1a8
[    0.061515] [<ffff200008392184>] vfs_kern_mount.part.7+0x9c/0x218
[    0.061602] [<ffff200008392358>] kern_mount_data+0x38/0x70
[    0.061699] [<ffff200008411a7c>] pid_ns_prepare_proc+0x24/0x50
[    0.061796] [<ffff200008126648>] alloc_pid+0x6e8/0x730
[    0.061891] [<ffff2000080ef374>] copy_process.isra.6.part.7+0x11cc/0x2cb8
[    0.061978] [<ffff2000080f1104>] _do_fork+0x14c/0x4c0
[    0.062065] [<ffff2000080f14c0>] kernel_thread+0x30/0x38
[    0.062156] [<ffff20000904b23c>] rest_init+0x34/0x108
[    0.062260] [<ffff200009850e90>] start_kernel+0x45c/0x48c
[    0.062458] Code: 540001e1 d343fc00 d2c40002 f2fbffe2 (38e26800)
[    0.063559] ---[ end trace 390c5d4fc6641888 ]---
[    0.064164] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.064438] ---[ end Kernel panic - not syncing: Attempted to kill
the idle task!


So, I've been trying to root cause it, and here is what I've got:

First, I went back to my version of kasan_map_populate() and replaced
vmemmap_populate() with vmemmap_populate_basepages(), which
behavior-vise made it very similar to your patch. After doing this I
got the same panic. So, I figured there must be something to do with
the differences that regular vmemmap allocated with granularity of
SWAPPER_BLOCK_SIZE while kasan with granularity of PAGE_SIZE.

So, I made the following modification to your patch:

static void __init kasan_map_populate(unsigned long start, unsigned long end,
                                      int node)
{
+        start = round_down(start, SWAPPER_BLOCK_SIZE);
+       end = round_up(end, SWAPPER_BLOCK_SIZE);
        kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
}

This is basically makes shadow tree ranges to be SWAPPER_BLOCK_SIZE
aligned. After, this modification everything is working.  However, I
am not sure if this is a proper fix.

I feel, this patch requires more work, and I am troubled with using
base pages instead of large pages.

Thank you,
Pavel

On Tue, Oct 10, 2017 at 1:41 PM, Pavel Tatashin
<pasha.tatashin@oracle.com> wrote:
> Hi Will,
>
> Ok, I will add your patch at the end of my series.
>
> Thank you,
> Pavel
>
>>
>> I was thinking that you could just add my patch to the end of your series
>> and have the whole lot go up like that. If you want to merge it with your
>> patch, I'm fine with that too.
>>
>> Will
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 14:10             ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 14:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Will,

I have a couple concerns about your patch:

One of the reasons (and actually, the main reason) why I preferred to
keep vmemmap_populate() instead of implementing kasan's own variant,
which btw can be done in common code similarly to
vmemmap_populate_basepages() is that vmemmap_populate() uses large
pages when available. I think it is a considerable downgrade to go
back to base pages, when we already have large page support available
to us.

The kasan shadow tree is large, it is up-to 1/8th of system memory, so
even on moderate size servers, shadow tree is going to be multiple
gigabytes.

The second concern is that there is an existing bug associated with
your patch that I am not sure how to solve:

Try building your patch with CONFIG_DEBUG_VM. This config makes
memblock_virt_alloc_try_nid_raw() to do memset(0xff) on all allocated
memory.

I am getting the following panic during boot:

[    0.012637] pid_max: default: 32768 minimum: 301
[    0.016037] Security Framework initialized
[    0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
[    0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.055337] Unable to handle kernel paging request at virtual
address ffff0400010065af
[    0.055422] Mem abort info:
[    0.055518]   Exception class = DABT (current EL), IL = 32 bits
[    0.055579]   SET = 0, FnV = 0
[    0.055640]   EA = 0, S1PTW = 0
[    0.055699] Data abort info:
[    0.055762]   ISV = 0, ISS = 0x00000007
[    0.055822]   CM = 0, WnR = 0
[    0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff20000a8f4000
[    0.056047] [ffff0400010065af] *pgd=0000000046fe7003,
*pud=0000000046fe6003, *pmd=0000000046fe5003, *pte=0000000000000000
[    0.056436] Internal error: Oops: 96000007 [#1] PREEMPT SMP
[    0.056701] Modules linked in:
[    0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
[    0.057001] Hardware name: linux,dummy-virt (DT)
[    0.057084] task: ffff2000099d9000 task.stack: ffff2000099c0000
[    0.057275] PC is at __asan_load8+0x34/0xb0
[    0.057375] LR is at __d_rehash+0xf0/0x240
[    0.057460] pc : [<ffff200008317d7c>] lr : [<ffff20000837e168>]
pstate: 60000045
[    0.057522] sp : ffff2000099c6a60
[    0.057590] x29: ffff2000099c6a60 x28: ffff2000099d9010
[    0.057733] x27: 0000000000000004 x26: ffff200008031000
[    0.057846] x25: ffff2000099d9000 x24: ffff800003c06410
---Type <return> to continue, or q <return> to quit---
[    0.057954] x23: 00000000000003af x22: ffff800003c06400
[    0.058065] x21: 1fffe40001338d5a x20: ffff200008032d78
[    0.058175] x19: ffff800003c06408 x18: 0000000000000000
[    0.058311] x17: 0000000000000009 x16: 0000000000007fff
[    0.058417] x15: 000000000000002a x14: ffff2000080ef374
[    0.058528] x13: ffff200008126648 x12: ffff200008411a7c
[    0.058638] x11: ffff200008392358 x10: ffff200008392184
[    0.058770] x9 : ffff20000835aad8 x8 : ffff200009850e90
[    0.058883] x7 : ffff20000904b23c x6 : 00000000f2f2f200
[    0.058990] x5 : 0000000000000000 x4 : ffff200008032d78
[    0.059097] x3 : 0000000000000000 x2 : dfff200000000000
[    0.059206] x1 : 0000000000000007 x0 : 1fffe400010065af
[    0.059372] Process swapper/0 (pid: 0, stack limit = 0xffff2000099c0000)
[    0.059442] Call trace:
[    0.059603] Exception stack(0xffff2000099c6920 to 0xffff2000099c6a60)
[    0.059771] 6920: 1fffe400010065af 0000000000000007
dfff200000000000 0000000000000000
[    0.059877] 6940: ffff200008032d78 0000000000000000
00000000f2f2f200 ffff20000904b23c
[    0.059973] 6960: ffff200009850e90 ffff20000835aad8
ffff200008392184 ffff200008392358
[    0.060066] 6980: ffff200008411a7c ffff200008126648
ffff2000080ef374 000000000000002a
[    0.060154] 69a0: 0000000000007fff 0000000000000009
0000000000000000 ffff800003c06408
[    0.060246] 69c0: ffff200008032d78 1fffe40001338d5a
ffff800003c06400 00000000000003af
[    0.060338] 69e0: ffff800003c06410 ffff2000099d9000
ffff200008031000 0000000000000004
[    0.060432] 6a00: ffff2000099d9010 ffff2000099c6a60
ffff20000837e168 ffff2000099c6a60
[    0.060525] 6a20: ffff200008317d7c 0000000060000045
ffff200008392358 ffff200008411a7c
[    0.060620] 6a40: ffffffffffffffff ffff2000080ef374
ffff2000099c6a60 ffff200008317d7c
[    0.060762] [<ffff200008317d7c>] __asan_load8+0x34/0xb0
[    0.060856] [<ffff20000837e168>] __d_rehash+0xf0/0x240
[    0.060944] [<ffff20000837fb80>] d_add+0x288/0x3f0
[    0.061041] [<ffff200008420db8>] proc_setup_self+0x110/0x198
[    0.061139] [<ffff200008411594>] proc_fill_super+0x13c/0x198
[    0.061234] [<ffff200008359648>] mount_ns+0x98/0x148
[    0.061328] [<ffff2000084116ac>] proc_mount+0x5c/0x70
[    0.061422] [<ffff20000835aad8>] mount_fs+0x50/0x1a8
[    0.061515] [<ffff200008392184>] vfs_kern_mount.part.7+0x9c/0x218
[    0.061602] [<ffff200008392358>] kern_mount_data+0x38/0x70
[    0.061699] [<ffff200008411a7c>] pid_ns_prepare_proc+0x24/0x50
[    0.061796] [<ffff200008126648>] alloc_pid+0x6e8/0x730
[    0.061891] [<ffff2000080ef374>] copy_process.isra.6.part.7+0x11cc/0x2cb8
[    0.061978] [<ffff2000080f1104>] _do_fork+0x14c/0x4c0
[    0.062065] [<ffff2000080f14c0>] kernel_thread+0x30/0x38
[    0.062156] [<ffff20000904b23c>] rest_init+0x34/0x108
[    0.062260] [<ffff200009850e90>] start_kernel+0x45c/0x48c
[    0.062458] Code: 540001e1 d343fc00 d2c40002 f2fbffe2 (38e26800)
[    0.063559] ---[ end trace 390c5d4fc6641888 ]---
[    0.064164] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.064438] ---[ end Kernel panic - not syncing: Attempted to kill
the idle task!


So, I've been trying to root cause it, and here is what I've got:

First, I went back to my version of kasan_map_populate() and replaced
vmemmap_populate() with vmemmap_populate_basepages(), which
behavior-vise made it very similar to your patch. After doing this I
got the same panic. So, I figured there must be something to do with
the differences that regular vmemmap allocated with granularity of
SWAPPER_BLOCK_SIZE while kasan with granularity of PAGE_SIZE.

So, I made the following modification to your patch:

static void __init kasan_map_populate(unsigned long start, unsigned long end,
                                      int node)
{
+        start = round_down(start, SWAPPER_BLOCK_SIZE);
+       end = round_up(end, SWAPPER_BLOCK_SIZE);
        kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
}

This is basically makes shadow tree ranges to be SWAPPER_BLOCK_SIZE
aligned. After, this modification everything is working.  However, I
am not sure if this is a proper fix.

I feel, this patch requires more work, and I am troubled with using
base pages instead of large pages.

Thank you,
Pavel

On Tue, Oct 10, 2017 at 1:41 PM, Pavel Tatashin
<pasha.tatashin@oracle.com> wrote:
> Hi Will,
>
> Ok, I will add your patch at the end of my series.
>
> Thank you,
> Pavel
>
>>
>> I was thinking that you could just add my patch to the end of your series
>> and have the whole lot go up like that. If you want to merge it with your
>> patch, I'm fine with that too.
>>
>> Will
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 14:10             ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 14:10 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

I have a couple concerns about your patch:

One of the reasons (and actually, the main reason) why I preferred to
keep vmemmap_populate() instead of implementing kasan's own variant,
which btw can be done in common code similarly to
vmemmap_populate_basepages() is that vmemmap_populate() uses large
pages when available. I think it is a considerable downgrade to go
back to base pages, when we already have large page support available
to us.

The kasan shadow tree is large, it is up-to 1/8th of system memory, so
even on moderate size servers, shadow tree is going to be multiple
gigabytes.

The second concern is that there is an existing bug associated with
your patch that I am not sure how to solve:

Try building your patch with CONFIG_DEBUG_VM. This config makes
memblock_virt_alloc_try_nid_raw() to do memset(0xff) on all allocated
memory.

I am getting the following panic during boot:

[    0.012637] pid_max: default: 32768 minimum: 301
[    0.016037] Security Framework initialized
[    0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
[    0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.055337] Unable to handle kernel paging request at virtual
address ffff0400010065af
[    0.055422] Mem abort info:
[    0.055518]   Exception class = DABT (current EL), IL = 32 bits
[    0.055579]   SET = 0, FnV = 0
[    0.055640]   EA = 0, S1PTW = 0
[    0.055699] Data abort info:
[    0.055762]   ISV = 0, ISS = 0x00000007
[    0.055822]   CM = 0, WnR = 0
[    0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff20000a8f4000
[    0.056047] [ffff0400010065af] *pgd=0000000046fe7003,
*pud=0000000046fe6003, *pmd=0000000046fe5003, *pte=0000000000000000
[    0.056436] Internal error: Oops: 96000007 [#1] PREEMPT SMP
[    0.056701] Modules linked in:
[    0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
[    0.057001] Hardware name: linux,dummy-virt (DT)
[    0.057084] task: ffff2000099d9000 task.stack: ffff2000099c0000
[    0.057275] PC is at __asan_load8+0x34/0xb0
[    0.057375] LR is at __d_rehash+0xf0/0x240
[    0.057460] pc : [<ffff200008317d7c>] lr : [<ffff20000837e168>]
pstate: 60000045
[    0.057522] sp : ffff2000099c6a60
[    0.057590] x29: ffff2000099c6a60 x28: ffff2000099d9010
[    0.057733] x27: 0000000000000004 x26: ffff200008031000
[    0.057846] x25: ffff2000099d9000 x24: ffff800003c06410
---Type <return> to continue, or q <return> to quit---
[    0.057954] x23: 00000000000003af x22: ffff800003c06400
[    0.058065] x21: 1fffe40001338d5a x20: ffff200008032d78
[    0.058175] x19: ffff800003c06408 x18: 0000000000000000
[    0.058311] x17: 0000000000000009 x16: 0000000000007fff
[    0.058417] x15: 000000000000002a x14: ffff2000080ef374
[    0.058528] x13: ffff200008126648 x12: ffff200008411a7c
[    0.058638] x11: ffff200008392358 x10: ffff200008392184
[    0.058770] x9 : ffff20000835aad8 x8 : ffff200009850e90
[    0.058883] x7 : ffff20000904b23c x6 : 00000000f2f2f200
[    0.058990] x5 : 0000000000000000 x4 : ffff200008032d78
[    0.059097] x3 : 0000000000000000 x2 : dfff200000000000
[    0.059206] x1 : 0000000000000007 x0 : 1fffe400010065af
[    0.059372] Process swapper/0 (pid: 0, stack limit = 0xffff2000099c0000)
[    0.059442] Call trace:
[    0.059603] Exception stack(0xffff2000099c6920 to 0xffff2000099c6a60)
[    0.059771] 6920: 1fffe400010065af 0000000000000007
dfff200000000000 0000000000000000
[    0.059877] 6940: ffff200008032d78 0000000000000000
00000000f2f2f200 ffff20000904b23c
[    0.059973] 6960: ffff200009850e90 ffff20000835aad8
ffff200008392184 ffff200008392358
[    0.060066] 6980: ffff200008411a7c ffff200008126648
ffff2000080ef374 000000000000002a
[    0.060154] 69a0: 0000000000007fff 0000000000000009
0000000000000000 ffff800003c06408
[    0.060246] 69c0: ffff200008032d78 1fffe40001338d5a
ffff800003c06400 00000000000003af
[    0.060338] 69e0: ffff800003c06410 ffff2000099d9000
ffff200008031000 0000000000000004
[    0.060432] 6a00: ffff2000099d9010 ffff2000099c6a60
ffff20000837e168 ffff2000099c6a60
[    0.060525] 6a20: ffff200008317d7c 0000000060000045
ffff200008392358 ffff200008411a7c
[    0.060620] 6a40: ffffffffffffffff ffff2000080ef374
ffff2000099c6a60 ffff200008317d7c
[    0.060762] [<ffff200008317d7c>] __asan_load8+0x34/0xb0
[    0.060856] [<ffff20000837e168>] __d_rehash+0xf0/0x240
[    0.060944] [<ffff20000837fb80>] d_add+0x288/0x3f0
[    0.061041] [<ffff200008420db8>] proc_setup_self+0x110/0x198
[    0.061139] [<ffff200008411594>] proc_fill_super+0x13c/0x198
[    0.061234] [<ffff200008359648>] mount_ns+0x98/0x148
[    0.061328] [<ffff2000084116ac>] proc_mount+0x5c/0x70
[    0.061422] [<ffff20000835aad8>] mount_fs+0x50/0x1a8
[    0.061515] [<ffff200008392184>] vfs_kern_mount.part.7+0x9c/0x218
[    0.061602] [<ffff200008392358>] kern_mount_data+0x38/0x70
[    0.061699] [<ffff200008411a7c>] pid_ns_prepare_proc+0x24/0x50
[    0.061796] [<ffff200008126648>] alloc_pid+0x6e8/0x730
[    0.061891] [<ffff2000080ef374>] copy_process.isra.6.part.7+0x11cc/0x2cb8
[    0.061978] [<ffff2000080f1104>] _do_fork+0x14c/0x4c0
[    0.062065] [<ffff2000080f14c0>] kernel_thread+0x30/0x38
[    0.062156] [<ffff20000904b23c>] rest_init+0x34/0x108
[    0.062260] [<ffff200009850e90>] start_kernel+0x45c/0x48c
[    0.062458] Code: 540001e1 d343fc00 d2c40002 f2fbffe2 (38e26800)
[    0.063559] ---[ end trace 390c5d4fc6641888 ]---
[    0.064164] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.064438] ---[ end Kernel panic - not syncing: Attempted to kill
the idle task!


So, I've been trying to root cause it, and here is what I've got:

First, I went back to my version of kasan_map_populate() and replaced
vmemmap_populate() with vmemmap_populate_basepages(), which
behavior-vise made it very similar to your patch. After doing this I
got the same panic. So, I figured there must be something to do with
the differences that regular vmemmap allocated with granularity of
SWAPPER_BLOCK_SIZE while kasan with granularity of PAGE_SIZE.

So, I made the following modification to your patch:

static void __init kasan_map_populate(unsigned long start, unsigned long end,
                                      int node)
{
+        start = round_down(start, SWAPPER_BLOCK_SIZE);
+       end = round_up(end, SWAPPER_BLOCK_SIZE);
        kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
}

This is basically makes shadow tree ranges to be SWAPPER_BLOCK_SIZE
aligned. After, this modification everything is working.  However, I
am not sure if this is a proper fix.

I feel, this patch requires more work, and I am troubled with using
base pages instead of large pages.

Thank you,
Pavel

On Tue, Oct 10, 2017 at 1:41 PM, Pavel Tatashin
<pasha.tatashin@oracle.com> wrote:
> Hi Will,
>
> Ok, I will add your patch at the end of my series.
>
> Thank you,
> Pavel
>
>>
>> I was thinking that you could just add my patch to the end of your series
>> and have the whole lot go up like that. If you want to merge it with your
>> patch, I'm fine with that too.
>>
>> Will
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo at kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email at kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-13 14:10             ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-13 14:43               ` Will Deacon
  -1 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-13 14:43 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Pavel,

On Fri, Oct 13, 2017 at 10:10:09AM -0400, Pavel Tatashin wrote:
> I have a couple concerns about your patch:
> 
> One of the reasons (and actually, the main reason) why I preferred to
> keep vmemmap_populate() instead of implementing kasan's own variant,
> which btw can be done in common code similarly to
> vmemmap_populate_basepages() is that vmemmap_populate() uses large
> pages when available. I think it is a considerable downgrade to go
> back to base pages, when we already have large page support available
> to us.

It shouldn't be difficult to use section mappings with my patch, I just
don't really see the need to try to optimise TLB pressure when you're
running with KASAN enabled which already has something like a 3x slowdown
afaik. If it ends up being a big deal, we can always do that later, but
my main aim here is to divorce kasan from vmemmap because they should be
completely unrelated.

> The kasan shadow tree is large, it is up-to 1/8th of system memory, so
> even on moderate size servers, shadow tree is going to be multiple
> gigabytes.
> 
> The second concern is that there is an existing bug associated with
> your patch that I am not sure how to solve:
> 
> Try building your patch with CONFIG_DEBUG_VM. This config makes
> memblock_virt_alloc_try_nid_raw() to do memset(0xff) on all allocated
> memory.
> 
> I am getting the following panic during boot:
> 
> [    0.012637] pid_max: default: 32768 minimum: 301
> [    0.016037] Security Framework initialized
> [    0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
> [    0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
> [    0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
> [    0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
> [    0.055337] Unable to handle kernel paging request at virtual
> address ffff0400010065af
> [    0.055422] Mem abort info:
> [    0.055518]   Exception class = DABT (current EL), IL = 32 bits
> [    0.055579]   SET = 0, FnV = 0
> [    0.055640]   EA = 0, S1PTW = 0
> [    0.055699] Data abort info:
> [    0.055762]   ISV = 0, ISS = 0x00000007
> [    0.055822]   CM = 0, WnR = 0
> [    0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff20000a8f4000
> [    0.056047] [ffff0400010065af] *pgd=0000000046fe7003,
> *pud=0000000046fe6003, *pmd=0000000046fe5003, *pte=0000000000000000
> [    0.056436] Internal error: Oops: 96000007 [#1] PREEMPT SMP
> [    0.056701] Modules linked in:
> [    0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
> [    0.057001] Hardware name: linux,dummy-virt (DT)
> [    0.057084] task: ffff2000099d9000 task.stack: ffff2000099c0000
> [    0.057275] PC is at __asan_load8+0x34/0xb0
> [    0.057375] LR is at __d_rehash+0xf0/0x240

[...]

> So, I've been trying to root cause it, and here is what I've got:
> 
> First, I went back to my version of kasan_map_populate() and replaced
> vmemmap_populate() with vmemmap_populate_basepages(), which
> behavior-vise made it very similar to your patch. After doing this I
> got the same panic. So, I figured there must be something to do with
> the differences that regular vmemmap allocated with granularity of
> SWAPPER_BLOCK_SIZE while kasan with granularity of PAGE_SIZE.
> 
> So, I made the following modification to your patch:
> 
> static void __init kasan_map_populate(unsigned long start, unsigned long end,
>                                       int node)
> {
> +        start = round_down(start, SWAPPER_BLOCK_SIZE);
> +       end = round_up(end, SWAPPER_BLOCK_SIZE);
>         kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
> }
> 
> This is basically makes shadow tree ranges to be SWAPPER_BLOCK_SIZE
> aligned. After, this modification everything is working.  However, I
> am not sure if this is a proper fix.

This certainly doesn't sound right; mapping the shadow with pages shouldn't
lead to problems. I also can't seem to reproduce this myself -- could you
share your full .config and a pointer to the git tree that you're using,
please?

> I feel, this patch requires more work, and I am troubled with using
> base pages instead of large pages.

I'm happy to try fixing this, because I think splitting up kasan and vmemmap
is the right thing to do here.

Will

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 14:43               ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-13 14:43 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Pavel,

On Fri, Oct 13, 2017 at 10:10:09AM -0400, Pavel Tatashin wrote:
> I have a couple concerns about your patch:
> 
> One of the reasons (and actually, the main reason) why I preferred to
> keep vmemmap_populate() instead of implementing kasan's own variant,
> which btw can be done in common code similarly to
> vmemmap_populate_basepages() is that vmemmap_populate() uses large
> pages when available. I think it is a considerable downgrade to go
> back to base pages, when we already have large page support available
> to us.

It shouldn't be difficult to use section mappings with my patch, I just
don't really see the need to try to optimise TLB pressure when you're
running with KASAN enabled which already has something like a 3x slowdown
afaik. If it ends up being a big deal, we can always do that later, but
my main aim here is to divorce kasan from vmemmap because they should be
completely unrelated.

> The kasan shadow tree is large, it is up-to 1/8th of system memory, so
> even on moderate size servers, shadow tree is going to be multiple
> gigabytes.
> 
> The second concern is that there is an existing bug associated with
> your patch that I am not sure how to solve:
> 
> Try building your patch with CONFIG_DEBUG_VM. This config makes
> memblock_virt_alloc_try_nid_raw() to do memset(0xff) on all allocated
> memory.
> 
> I am getting the following panic during boot:
> 
> [    0.012637] pid_max: default: 32768 minimum: 301
> [    0.016037] Security Framework initialized
> [    0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
> [    0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
> [    0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
> [    0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
> [    0.055337] Unable to handle kernel paging request at virtual
> address ffff0400010065af
> [    0.055422] Mem abort info:
> [    0.055518]   Exception class = DABT (current EL), IL = 32 bits
> [    0.055579]   SET = 0, FnV = 0
> [    0.055640]   EA = 0, S1PTW = 0
> [    0.055699] Data abort info:
> [    0.055762]   ISV = 0, ISS = 0x00000007
> [    0.055822]   CM = 0, WnR = 0
> [    0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff20000a8f4000
> [    0.056047] [ffff0400010065af] *pgd\000000046fe7003,
> *pud\000000046fe6003, *pmd\000000046fe5003, *pte\000000000000000
> [    0.056436] Internal error: Oops: 96000007 [#1] PREEMPT SMP
> [    0.056701] Modules linked in:
> [    0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
> [    0.057001] Hardware name: linux,dummy-virt (DT)
> [    0.057084] task: ffff2000099d9000 task.stack: ffff2000099c0000
> [    0.057275] PC is at __asan_load8+0x34/0xb0
> [    0.057375] LR is at __d_rehash+0xf0/0x240

[...]

> So, I've been trying to root cause it, and here is what I've got:
> 
> First, I went back to my version of kasan_map_populate() and replaced
> vmemmap_populate() with vmemmap_populate_basepages(), which
> behavior-vise made it very similar to your patch. After doing this I
> got the same panic. So, I figured there must be something to do with
> the differences that regular vmemmap allocated with granularity of
> SWAPPER_BLOCK_SIZE while kasan with granularity of PAGE_SIZE.
> 
> So, I made the following modification to your patch:
> 
> static void __init kasan_map_populate(unsigned long start, unsigned long end,
>                                       int node)
> {
> +        start = round_down(start, SWAPPER_BLOCK_SIZE);
> +       end = round_up(end, SWAPPER_BLOCK_SIZE);
>         kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
> }
> 
> This is basically makes shadow tree ranges to be SWAPPER_BLOCK_SIZE
> aligned. After, this modification everything is working.  However, I
> am not sure if this is a proper fix.

This certainly doesn't sound right; mapping the shadow with pages shouldn't
lead to problems. I also can't seem to reproduce this myself -- could you
share your full .config and a pointer to the git tree that you're using,
please?

> I feel, this patch requires more work, and I am troubled with using
> base pages instead of large pages.

I'm happy to try fixing this, because I think splitting up kasan and vmemmap
is the right thing to do here.

Will

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 14:43               ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-13 14:43 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Pavel,

On Fri, Oct 13, 2017 at 10:10:09AM -0400, Pavel Tatashin wrote:
> I have a couple concerns about your patch:
> 
> One of the reasons (and actually, the main reason) why I preferred to
> keep vmemmap_populate() instead of implementing kasan's own variant,
> which btw can be done in common code similarly to
> vmemmap_populate_basepages() is that vmemmap_populate() uses large
> pages when available. I think it is a considerable downgrade to go
> back to base pages, when we already have large page support available
> to us.

It shouldn't be difficult to use section mappings with my patch, I just
don't really see the need to try to optimise TLB pressure when you're
running with KASAN enabled which already has something like a 3x slowdown
afaik. If it ends up being a big deal, we can always do that later, but
my main aim here is to divorce kasan from vmemmap because they should be
completely unrelated.

> The kasan shadow tree is large, it is up-to 1/8th of system memory, so
> even on moderate size servers, shadow tree is going to be multiple
> gigabytes.
> 
> The second concern is that there is an existing bug associated with
> your patch that I am not sure how to solve:
> 
> Try building your patch with CONFIG_DEBUG_VM. This config makes
> memblock_virt_alloc_try_nid_raw() to do memset(0xff) on all allocated
> memory.
> 
> I am getting the following panic during boot:
> 
> [    0.012637] pid_max: default: 32768 minimum: 301
> [    0.016037] Security Framework initialized
> [    0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
> [    0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
> [    0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
> [    0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
> [    0.055337] Unable to handle kernel paging request at virtual
> address ffff0400010065af
> [    0.055422] Mem abort info:
> [    0.055518]   Exception class = DABT (current EL), IL = 32 bits
> [    0.055579]   SET = 0, FnV = 0
> [    0.055640]   EA = 0, S1PTW = 0
> [    0.055699] Data abort info:
> [    0.055762]   ISV = 0, ISS = 0x00000007
> [    0.055822]   CM = 0, WnR = 0
> [    0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff20000a8f4000
> [    0.056047] [ffff0400010065af] *pgd=0000000046fe7003,
> *pud=0000000046fe6003, *pmd=0000000046fe5003, *pte=0000000000000000
> [    0.056436] Internal error: Oops: 96000007 [#1] PREEMPT SMP
> [    0.056701] Modules linked in:
> [    0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
> [    0.057001] Hardware name: linux,dummy-virt (DT)
> [    0.057084] task: ffff2000099d9000 task.stack: ffff2000099c0000
> [    0.057275] PC is at __asan_load8+0x34/0xb0
> [    0.057375] LR is at __d_rehash+0xf0/0x240

[...]

> So, I've been trying to root cause it, and here is what I've got:
> 
> First, I went back to my version of kasan_map_populate() and replaced
> vmemmap_populate() with vmemmap_populate_basepages(), which
> behavior-vise made it very similar to your patch. After doing this I
> got the same panic. So, I figured there must be something to do with
> the differences that regular vmemmap allocated with granularity of
> SWAPPER_BLOCK_SIZE while kasan with granularity of PAGE_SIZE.
> 
> So, I made the following modification to your patch:
> 
> static void __init kasan_map_populate(unsigned long start, unsigned long end,
>                                       int node)
> {
> +        start = round_down(start, SWAPPER_BLOCK_SIZE);
> +       end = round_up(end, SWAPPER_BLOCK_SIZE);
>         kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
> }
> 
> This is basically makes shadow tree ranges to be SWAPPER_BLOCK_SIZE
> aligned. After, this modification everything is working.  However, I
> am not sure if this is a proper fix.

This certainly doesn't sound right; mapping the shadow with pages shouldn't
lead to problems. I also can't seem to reproduce this myself -- could you
share your full .config and a pointer to the git tree that you're using,
please?

> I feel, this patch requires more work, and I am troubled with using
> base pages instead of large pages.

I'm happy to try fixing this, because I think splitting up kasan and vmemmap
is the right thing to do here.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 14:43               ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-13 14:43 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Pavel,

On Fri, Oct 13, 2017 at 10:10:09AM -0400, Pavel Tatashin wrote:
> I have a couple concerns about your patch:
> 
> One of the reasons (and actually, the main reason) why I preferred to
> keep vmemmap_populate() instead of implementing kasan's own variant,
> which btw can be done in common code similarly to
> vmemmap_populate_basepages() is that vmemmap_populate() uses large
> pages when available. I think it is a considerable downgrade to go
> back to base pages, when we already have large page support available
> to us.

It shouldn't be difficult to use section mappings with my patch, I just
don't really see the need to try to optimise TLB pressure when you're
running with KASAN enabled which already has something like a 3x slowdown
afaik. If it ends up being a big deal, we can always do that later, but
my main aim here is to divorce kasan from vmemmap because they should be
completely unrelated.

> The kasan shadow tree is large, it is up-to 1/8th of system memory, so
> even on moderate size servers, shadow tree is going to be multiple
> gigabytes.
> 
> The second concern is that there is an existing bug associated with
> your patch that I am not sure how to solve:
> 
> Try building your patch with CONFIG_DEBUG_VM. This config makes
> memblock_virt_alloc_try_nid_raw() to do memset(0xff) on all allocated
> memory.
> 
> I am getting the following panic during boot:
> 
> [    0.012637] pid_max: default: 32768 minimum: 301
> [    0.016037] Security Framework initialized
> [    0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
> [    0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
> [    0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
> [    0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
> [    0.055337] Unable to handle kernel paging request at virtual
> address ffff0400010065af
> [    0.055422] Mem abort info:
> [    0.055518]   Exception class = DABT (current EL), IL = 32 bits
> [    0.055579]   SET = 0, FnV = 0
> [    0.055640]   EA = 0, S1PTW = 0
> [    0.055699] Data abort info:
> [    0.055762]   ISV = 0, ISS = 0x00000007
> [    0.055822]   CM = 0, WnR = 0
> [    0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff20000a8f4000
> [    0.056047] [ffff0400010065af] *pgd=0000000046fe7003,
> *pud=0000000046fe6003, *pmd=0000000046fe5003, *pte=0000000000000000
> [    0.056436] Internal error: Oops: 96000007 [#1] PREEMPT SMP
> [    0.056701] Modules linked in:
> [    0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
> [    0.057001] Hardware name: linux,dummy-virt (DT)
> [    0.057084] task: ffff2000099d9000 task.stack: ffff2000099c0000
> [    0.057275] PC is at __asan_load8+0x34/0xb0
> [    0.057375] LR is at __d_rehash+0xf0/0x240

[...]

> So, I've been trying to root cause it, and here is what I've got:
> 
> First, I went back to my version of kasan_map_populate() and replaced
> vmemmap_populate() with vmemmap_populate_basepages(), which
> behavior-vise made it very similar to your patch. After doing this I
> got the same panic. So, I figured there must be something to do with
> the differences that regular vmemmap allocated with granularity of
> SWAPPER_BLOCK_SIZE while kasan with granularity of PAGE_SIZE.
> 
> So, I made the following modification to your patch:
> 
> static void __init kasan_map_populate(unsigned long start, unsigned long end,
>                                       int node)
> {
> +        start = round_down(start, SWAPPER_BLOCK_SIZE);
> +       end = round_up(end, SWAPPER_BLOCK_SIZE);
>         kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
> }
> 
> This is basically makes shadow tree ranges to be SWAPPER_BLOCK_SIZE
> aligned. After, this modification everything is working.  However, I
> am not sure if this is a proper fix.

This certainly doesn't sound right; mapping the shadow with pages shouldn't
lead to problems. I also can't seem to reproduce this myself -- could you
share your full .config and a pointer to the git tree that you're using,
please?

> I feel, this patch requires more work, and I am troubled with using
> base pages instead of large pages.

I'm happy to try fixing this, because I think splitting up kasan and vmemmap
is the right thing to do here.

Will

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-13 14:43               ` Will Deacon
  (?)
  (?)
@ 2017-10-13 14:56                 ` Mark Rutland
  -1 siblings, 0 replies; 115+ messages in thread
From: Mark Rutland @ 2017-10-13 14:56 UTC (permalink / raw)
  To: Will Deacon, Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, catalin.marinas, sam,
	mgorman, Steve Sistare, daniel.m.jordan, bob.picco

Hi,

On Fri, Oct 13, 2017 at 03:43:19PM +0100, Will Deacon wrote:
> On Fri, Oct 13, 2017 at 10:10:09AM -0400, Pavel Tatashin wrote:
> > I am getting the following panic during boot:
> > 
> > [    0.012637] pid_max: default: 32768 minimum: 301
> > [    0.016037] Security Framework initialized
> > [    0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
> > [    0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
> > [    0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
> > [    0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
> > [    0.055337] Unable to handle kernel paging request at virtual
> > address ffff0400010065af
> > [    0.055422] Mem abort info:
> > [    0.055518]   Exception class = DABT (current EL), IL = 32 bits
> > [    0.055579]   SET = 0, FnV = 0
> > [    0.055640]   EA = 0, S1PTW = 0
> > [    0.055699] Data abort info:
> > [    0.055762]   ISV = 0, ISS = 0x00000007
> > [    0.055822]   CM = 0, WnR = 0
> > [    0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff20000a8f4000
> > [    0.056047] [ffff0400010065af] *pgd=0000000046fe7003,
> > *pud=0000000046fe6003, *pmd=0000000046fe5003, *pte=0000000000000000
> > [    0.056436] Internal error: Oops: 96000007 [#1] PREEMPT SMP
> > [    0.056701] Modules linked in:
> > [    0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> > 4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
> > [    0.057001] Hardware name: linux,dummy-virt (DT)
> > [    0.057084] task: ffff2000099d9000 task.stack: ffff2000099c0000
> > [    0.057275] PC is at __asan_load8+0x34/0xb0
> > [    0.057375] LR is at __d_rehash+0xf0/0x240

Do you know what your physical memory layout looks like? 

Knowing that would tell us where shadow memory *should* be.

Can you share the command line you're using the launch the VM?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 14:56                 ` Mark Rutland
  0 siblings, 0 replies; 115+ messages in thread
From: Mark Rutland @ 2017-10-13 14:56 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Fri, Oct 13, 2017 at 03:43:19PM +0100, Will Deacon wrote:
> On Fri, Oct 13, 2017 at 10:10:09AM -0400, Pavel Tatashin wrote:
> > I am getting the following panic during boot:
> > 
> > [    0.012637] pid_max: default: 32768 minimum: 301
> > [    0.016037] Security Framework initialized
> > [    0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
> > [    0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
> > [    0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
> > [    0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
> > [    0.055337] Unable to handle kernel paging request at virtual
> > address ffff0400010065af
> > [    0.055422] Mem abort info:
> > [    0.055518]   Exception class = DABT (current EL), IL = 32 bits
> > [    0.055579]   SET = 0, FnV = 0
> > [    0.055640]   EA = 0, S1PTW = 0
> > [    0.055699] Data abort info:
> > [    0.055762]   ISV = 0, ISS = 0x00000007
> > [    0.055822]   CM = 0, WnR = 0
> > [    0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff20000a8f4000
> > [    0.056047] [ffff0400010065af] *pgd\000000046fe7003,
> > *pud\000000046fe6003, *pmd\000000046fe5003, *pte\000000000000000
> > [    0.056436] Internal error: Oops: 96000007 [#1] PREEMPT SMP
> > [    0.056701] Modules linked in:
> > [    0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> > 4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
> > [    0.057001] Hardware name: linux,dummy-virt (DT)
> > [    0.057084] task: ffff2000099d9000 task.stack: ffff2000099c0000
> > [    0.057275] PC is at __asan_load8+0x34/0xb0
> > [    0.057375] LR is at __d_rehash+0xf0/0x240

Do you know what your physical memory layout looks like? 

Knowing that would tell us where shadow memory *should* be.

Can you share the command line you're using the launch the VM?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 14:56                 ` Mark Rutland
  0 siblings, 0 replies; 115+ messages in thread
From: Mark Rutland @ 2017-10-13 14:56 UTC (permalink / raw)
  To: Will Deacon, Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, catalin.marinas, sam,
	mgorman, Steve Sistare, daniel.m.jordan, bob.picco

Hi,

On Fri, Oct 13, 2017 at 03:43:19PM +0100, Will Deacon wrote:
> On Fri, Oct 13, 2017 at 10:10:09AM -0400, Pavel Tatashin wrote:
> > I am getting the following panic during boot:
> > 
> > [    0.012637] pid_max: default: 32768 minimum: 301
> > [    0.016037] Security Framework initialized
> > [    0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
> > [    0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
> > [    0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
> > [    0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
> > [    0.055337] Unable to handle kernel paging request at virtual
> > address ffff0400010065af
> > [    0.055422] Mem abort info:
> > [    0.055518]   Exception class = DABT (current EL), IL = 32 bits
> > [    0.055579]   SET = 0, FnV = 0
> > [    0.055640]   EA = 0, S1PTW = 0
> > [    0.055699] Data abort info:
> > [    0.055762]   ISV = 0, ISS = 0x00000007
> > [    0.055822]   CM = 0, WnR = 0
> > [    0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff20000a8f4000
> > [    0.056047] [ffff0400010065af] *pgd=0000000046fe7003,
> > *pud=0000000046fe6003, *pmd=0000000046fe5003, *pte=0000000000000000
> > [    0.056436] Internal error: Oops: 96000007 [#1] PREEMPT SMP
> > [    0.056701] Modules linked in:
> > [    0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> > 4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
> > [    0.057001] Hardware name: linux,dummy-virt (DT)
> > [    0.057084] task: ffff2000099d9000 task.stack: ffff2000099c0000
> > [    0.057275] PC is at __asan_load8+0x34/0xb0
> > [    0.057375] LR is at __d_rehash+0xf0/0x240

Do you know what your physical memory layout looks like? 

Knowing that would tell us where shadow memory *should* be.

Can you share the command line you're using the launch the VM?

Thanks,
Mark.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 14:56                 ` Mark Rutland
  0 siblings, 0 replies; 115+ messages in thread
From: Mark Rutland @ 2017-10-13 14:56 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Fri, Oct 13, 2017 at 03:43:19PM +0100, Will Deacon wrote:
> On Fri, Oct 13, 2017 at 10:10:09AM -0400, Pavel Tatashin wrote:
> > I am getting the following panic during boot:
> > 
> > [    0.012637] pid_max: default: 32768 minimum: 301
> > [    0.016037] Security Framework initialized
> > [    0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
> > [    0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
> > [    0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
> > [    0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
> > [    0.055337] Unable to handle kernel paging request at virtual
> > address ffff0400010065af
> > [    0.055422] Mem abort info:
> > [    0.055518]   Exception class = DABT (current EL), IL = 32 bits
> > [    0.055579]   SET = 0, FnV = 0
> > [    0.055640]   EA = 0, S1PTW = 0
> > [    0.055699] Data abort info:
> > [    0.055762]   ISV = 0, ISS = 0x00000007
> > [    0.055822]   CM = 0, WnR = 0
> > [    0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff20000a8f4000
> > [    0.056047] [ffff0400010065af] *pgd=0000000046fe7003,
> > *pud=0000000046fe6003, *pmd=0000000046fe5003, *pte=0000000000000000
> > [    0.056436] Internal error: Oops: 96000007 [#1] PREEMPT SMP
> > [    0.056701] Modules linked in:
> > [    0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> > 4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
> > [    0.057001] Hardware name: linux,dummy-virt (DT)
> > [    0.057084] task: ffff2000099d9000 task.stack: ffff2000099c0000
> > [    0.057275] PC is at __asan_load8+0x34/0xb0
> > [    0.057375] LR is at __d_rehash+0xf0/0x240

Do you know what your physical memory layout looks like? 

Knowing that would tell us where shadow memory *should* be.

Can you share the command line you're using the launch the VM?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-13 14:56                 ` Mark Rutland
  (?)
  (?)
@ 2017-10-13 15:02                   ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:02 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Will Deacon, linux-kernel, sparclinux, linux-mm, linuxppc-dev,
	linux-s390, linux-arm-kernel, x86, kasan-dev, borntraeger,
	heiko.carstens, davem, willy, Michal Hocko, Ard Biesheuvel,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

> Do you know what your physical memory layout looks like?

[    0.000000] Memory: 34960K/131072K available (16316K kernel code,
6716K rwdata, 7996K rodata, 1472K init, 8837K bss, 79728K reserved,
16384K cma-reserved)
[    0.000000] Virtual kernel memory layout:
[    0.000000]     kasan   : 0xffff000000000000 - 0xffff200000000000
( 32768 GB)
[    0.000000]     modules : 0xffff200000000000 - 0xffff200008000000
(   128 MB)
[    0.000000]     vmalloc : 0xffff200008000000 - 0xffff7dffbfff0000
( 96254 GB)
[    0.000000]       .text : 0xffff200008080000 - 0xffff200009070000
( 16320 KB)
[    0.000000]     .rodata : 0xffff200009070000 - 0xffff200009850000
(  8064 KB)
[    0.000000]       .init : 0xffff200009850000 - 0xffff2000099c0000
(  1472 KB)
[    0.000000]       .data : 0xffff2000099c0000 - 0xffff20000a04f200
(  6717 KB)
[    0.000000]        .bss : 0xffff20000a04f200 - 0xffff20000a8f09e0
(  8838 KB)
[    0.000000]     fixed   : 0xffff7dfffe7fd000 - 0xffff7dfffec00000
(  4108 KB)
[    0.000000]     PCI I/O : 0xffff7dfffee00000 - 0xffff7dffffe00000
(    16 MB)
[    0.000000]     vmemmap : 0xffff7e0000000000 - 0xffff800000000000
(  2048 GB maximum)
[    0.000000]               0xffff7e0000000000 - 0xffff7e0000200000
(     2 MB actual)
[    0.000000]     memory  : 0xffff800000000000 - 0xffff800008000000
(   128 MB)

>
> Knowing that would tell us where shadow memory *should* be.
>
> Can you share the command line you're using the launch the VM?
>

virtme-run --kdir . --arch aarch64 --qemu-opts -s -S

and get messages from connected gdb session via lx-dmesg command.

The actual qemu arguments are these:

qemu-system-aarch64 -fsdev
local,id=virtfs1,path=/,security_model=none,readonly -device
virtio-9p-device,fsdev=virtfs1,mount_tag=/dev/root -fsdev
local,id=virtfs5,path=/usr/share/virtme-guest-0,security_model=none,readonly
-device virtio-9p-device,fsdev=virtfs5,mount_tag=virtme.guesttools -M
virt -cpu cortex-a57 -parallel none -net none -echr 1 -serial none
-chardev stdio,id=console,signal=off,mux=on -serial chardev:console
-mon chardev=console -vga none -display none -kernel
./arch/arm64/boot/Image -append 'earlyprintk=serial,ttyAMA0,115200
console=ttyAMA0 psmouse.proto=exps "virtme_stty_con=rows 57 cols 105
iutf8" TERM=screen-256color-bce rootfstype=9p
rootflags=version=9p2000.L,trans=virtio,access=any raid=noautodetect
ro init=/bin/sh -- -c "mount -t tmpfs run /run;mkdir -p
/run/virtme/guesttools;/bin/mount -n -t 9p -o
ro,version=9p2000.L,trans=virtio,access=any virtme.guesttools
/run/virtme/guesttools;exec /run/virtme/guesttools/virtme-init"' -s -S

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 15:02                   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:02 UTC (permalink / raw)
  To: linux-arm-kernel

> Do you know what your physical memory layout looks like?

[    0.000000] Memory: 34960K/131072K available (16316K kernel code,
6716K rwdata, 7996K rodata, 1472K init, 8837K bss, 79728K reserved,
16384K cma-reserved)
[    0.000000] Virtual kernel memory layout:
[    0.000000]     kasan   : 0xffff000000000000 - 0xffff200000000000
( 32768 GB)
[    0.000000]     modules : 0xffff200000000000 - 0xffff200008000000
(   128 MB)
[    0.000000]     vmalloc : 0xffff200008000000 - 0xffff7dffbfff0000
( 96254 GB)
[    0.000000]       .text : 0xffff200008080000 - 0xffff200009070000
( 16320 KB)
[    0.000000]     .rodata : 0xffff200009070000 - 0xffff200009850000
(  8064 KB)
[    0.000000]       .init : 0xffff200009850000 - 0xffff2000099c0000
(  1472 KB)
[    0.000000]       .data : 0xffff2000099c0000 - 0xffff20000a04f200
(  6717 KB)
[    0.000000]        .bss : 0xffff20000a04f200 - 0xffff20000a8f09e0
(  8838 KB)
[    0.000000]     fixed   : 0xffff7dfffe7fd000 - 0xffff7dfffec00000
(  4108 KB)
[    0.000000]     PCI I/O : 0xffff7dfffee00000 - 0xffff7dffffe00000
(    16 MB)
[    0.000000]     vmemmap : 0xffff7e0000000000 - 0xffff800000000000
(  2048 GB maximum)
[    0.000000]               0xffff7e0000000000 - 0xffff7e0000200000
(     2 MB actual)
[    0.000000]     memory  : 0xffff800000000000 - 0xffff800008000000
(   128 MB)

>
> Knowing that would tell us where shadow memory *should* be.
>
> Can you share the command line you're using the launch the VM?
>

virtme-run --kdir . --arch aarch64 --qemu-opts -s -S

and get messages from connected gdb session via lx-dmesg command.

The actual qemu arguments are these:

qemu-system-aarch64 -fsdev
local,id=virtfs1,path=/,security_model=none,readonly -device
virtio-9p-device,fsdev=virtfs1,mount_tag=/dev/root -fsdev
local,id=virtfs5,path=/usr/share/virtme-guest-0,security_model=none,readonly
-device virtio-9p-device,fsdev=virtfs5,mount_tag=virtme.guesttools -M
virt -cpu cortex-a57 -parallel none -net none -echr 1 -serial none
-chardev stdio,id=console,signal=off,mux=on -serial chardev:console
-mon chardev=console -vga none -display none -kernel
./arch/arm64/boot/Image -append 'earlyprintk=serial,ttyAMA0,115200
console=ttyAMA0 psmouse.proto=exps "virtme_stty_con=rows 57 cols 105
iutf8" TERM=screen-256color-bce rootfstype=9p
rootflags=version=9p2000.L,trans=virtio,access=any raid=noautodetect
ro init=/bin/sh -- -c "mount -t tmpfs run /run;mkdir -p
/run/virtme/guesttools;/bin/mount -n -t 9p -o
ro,version=9p2000.L,trans=virtio,access=any virtme.guesttools
/run/virtme/guesttools;exec /run/virtme/guesttools/virtme-init"' -s -S

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 15:02                   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:02 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Will Deacon, linux-kernel, sparclinux, linux-mm, linuxppc-dev,
	linux-s390, linux-arm-kernel, x86, kasan-dev, borntraeger,
	heiko.carstens, davem, willy, Michal Hocko, Ard Biesheuvel,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

> Do you know what your physical memory layout looks like?

[    0.000000] Memory: 34960K/131072K available (16316K kernel code,
6716K rwdata, 7996K rodata, 1472K init, 8837K bss, 79728K reserved,
16384K cma-reserved)
[    0.000000] Virtual kernel memory layout:
[    0.000000]     kasan   : 0xffff000000000000 - 0xffff200000000000
( 32768 GB)
[    0.000000]     modules : 0xffff200000000000 - 0xffff200008000000
(   128 MB)
[    0.000000]     vmalloc : 0xffff200008000000 - 0xffff7dffbfff0000
( 96254 GB)
[    0.000000]       .text : 0xffff200008080000 - 0xffff200009070000
( 16320 KB)
[    0.000000]     .rodata : 0xffff200009070000 - 0xffff200009850000
(  8064 KB)
[    0.000000]       .init : 0xffff200009850000 - 0xffff2000099c0000
(  1472 KB)
[    0.000000]       .data : 0xffff2000099c0000 - 0xffff20000a04f200
(  6717 KB)
[    0.000000]        .bss : 0xffff20000a04f200 - 0xffff20000a8f09e0
(  8838 KB)
[    0.000000]     fixed   : 0xffff7dfffe7fd000 - 0xffff7dfffec00000
(  4108 KB)
[    0.000000]     PCI I/O : 0xffff7dfffee00000 - 0xffff7dffffe00000
(    16 MB)
[    0.000000]     vmemmap : 0xffff7e0000000000 - 0xffff800000000000
(  2048 GB maximum)
[    0.000000]               0xffff7e0000000000 - 0xffff7e0000200000
(     2 MB actual)
[    0.000000]     memory  : 0xffff800000000000 - 0xffff800008000000
(   128 MB)

>
> Knowing that would tell us where shadow memory *should* be.
>
> Can you share the command line you're using the launch the VM?
>

virtme-run --kdir . --arch aarch64 --qemu-opts -s -S

and get messages from connected gdb session via lx-dmesg command.

The actual qemu arguments are these:

qemu-system-aarch64 -fsdev
local,id=virtfs1,path=/,security_model=none,readonly -device
virtio-9p-device,fsdev=virtfs1,mount_tag=/dev/root -fsdev
local,id=virtfs5,path=/usr/share/virtme-guest-0,security_model=none,readonly
-device virtio-9p-device,fsdev=virtfs5,mount_tag=virtme.guesttools -M
virt -cpu cortex-a57 -parallel none -net none -echr 1 -serial none
-chardev stdio,id=console,signal=off,mux=on -serial chardev:console
-mon chardev=console -vga none -display none -kernel
./arch/arm64/boot/Image -append 'earlyprintk=serial,ttyAMA0,115200
console=ttyAMA0 psmouse.proto=exps "virtme_stty_con=rows 57 cols 105
iutf8" TERM=screen-256color-bce rootfstype=9p
rootflags=version=9p2000.L,trans=virtio,access=any raid=noautodetect
ro init=/bin/sh -- -c "mount -t tmpfs run /run;mkdir -p
/run/virtme/guesttools;/bin/mount -n -t 9p -o
ro,version=9p2000.L,trans=virtio,access=any virtme.guesttools
/run/virtme/guesttools;exec /run/virtme/guesttools/virtme-init"' -s -S

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 15:02                   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:02 UTC (permalink / raw)
  To: linux-arm-kernel

> Do you know what your physical memory layout looks like?

[    0.000000] Memory: 34960K/131072K available (16316K kernel code,
6716K rwdata, 7996K rodata, 1472K init, 8837K bss, 79728K reserved,
16384K cma-reserved)
[    0.000000] Virtual kernel memory layout:
[    0.000000]     kasan   : 0xffff000000000000 - 0xffff200000000000
( 32768 GB)
[    0.000000]     modules : 0xffff200000000000 - 0xffff200008000000
(   128 MB)
[    0.000000]     vmalloc : 0xffff200008000000 - 0xffff7dffbfff0000
( 96254 GB)
[    0.000000]       .text : 0xffff200008080000 - 0xffff200009070000
( 16320 KB)
[    0.000000]     .rodata : 0xffff200009070000 - 0xffff200009850000
(  8064 KB)
[    0.000000]       .init : 0xffff200009850000 - 0xffff2000099c0000
(  1472 KB)
[    0.000000]       .data : 0xffff2000099c0000 - 0xffff20000a04f200
(  6717 KB)
[    0.000000]        .bss : 0xffff20000a04f200 - 0xffff20000a8f09e0
(  8838 KB)
[    0.000000]     fixed   : 0xffff7dfffe7fd000 - 0xffff7dfffec00000
(  4108 KB)
[    0.000000]     PCI I/O : 0xffff7dfffee00000 - 0xffff7dffffe00000
(    16 MB)
[    0.000000]     vmemmap : 0xffff7e0000000000 - 0xffff800000000000
(  2048 GB maximum)
[    0.000000]               0xffff7e0000000000 - 0xffff7e0000200000
(     2 MB actual)
[    0.000000]     memory  : 0xffff800000000000 - 0xffff800008000000
(   128 MB)

>
> Knowing that would tell us where shadow memory *should* be.
>
> Can you share the command line you're using the launch the VM?
>

virtme-run --kdir . --arch aarch64 --qemu-opts -s -S

and get messages from connected gdb session via lx-dmesg command.

The actual qemu arguments are these:

qemu-system-aarch64 -fsdev
local,id=virtfs1,path=/,security_model=none,readonly -device
virtio-9p-device,fsdev=virtfs1,mount_tag=/dev/root -fsdev
local,id=virtfs5,path=/usr/share/virtme-guest-0,security_model=none,readonly
-device virtio-9p-device,fsdev=virtfs5,mount_tag=virtme.guesttools -M
virt -cpu cortex-a57 -parallel none -net none -echr 1 -serial none
-chardev stdio,id=console,signal=off,mux=on -serial chardev:console
-mon chardev=console -vga none -display none -kernel
./arch/arm64/boot/Image -append 'earlyprintk=serial,ttyAMA0,115200
console=ttyAMA0 psmouse.proto=exps "virtme_stty_con=rows 57 cols 105
iutf8" TERM=screen-256color-bce rootfstype=9p
rootflags=version=9p2000.L,trans=virtio,access=any raid=noautodetect
ro init=/bin/sh -- -c "mount -t tmpfs run /run;mkdir -p
/run/virtme/guesttools;/bin/mount -n -t 9p -o
ro,version=9p2000.L,trans=virtio,access=any virtme.guesttools
/run/virtme/guesttools;exec /run/virtme/guesttools/virtme-init"' -s -S

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-13 14:43               ` Will Deacon
  (?)
@ 2017-10-13 15:09                 ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:09 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

[-- Attachment #1: Type: text/plain, Size: 1310 bytes --]

> It shouldn't be difficult to use section mappings with my patch, I just
> don't really see the need to try to optimise TLB pressure when you're
> running with KASAN enabled which already has something like a 3x slowdown
> afaik. If it ends up being a big deal, we can always do that later, but
> my main aim here is to divorce kasan from vmemmap because they should be
> completely unrelated.

Yes, I understand that kasan makes system slow, but my point is why
make it even slower? However, I am OK adding your patch to the series,
BTW, symmetric changes will be needed for x86 as well sometime later.

>
> This certainly doesn't sound right; mapping the shadow with pages shouldn't
> lead to problems. I also can't seem to reproduce this myself -- could you
> share your full .config and a pointer to the git tree that you're using,
> please?

Config is attached. I am using my patch series + your patch + today's
clone from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Also, in a separate e-mail i sent out the qemu arguments.

>
>> I feel, this patch requires more work, and I am troubled with using
>> base pages instead of large pages.
>
> I'm happy to try fixing this, because I think splitting up kasan and vmemmap
> is the right thing to do here.

Thank you very much.

Pavel

[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 36799 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 15:09                 ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:09 UTC (permalink / raw)
  To: linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 1310 bytes --]

> It shouldn't be difficult to use section mappings with my patch, I just
> don't really see the need to try to optimise TLB pressure when you're
> running with KASAN enabled which already has something like a 3x slowdown
> afaik. If it ends up being a big deal, we can always do that later, but
> my main aim here is to divorce kasan from vmemmap because they should be
> completely unrelated.

Yes, I understand that kasan makes system slow, but my point is why
make it even slower? However, I am OK adding your patch to the series,
BTW, symmetric changes will be needed for x86 as well sometime later.

>
> This certainly doesn't sound right; mapping the shadow with pages shouldn't
> lead to problems. I also can't seem to reproduce this myself -- could you
> share your full .config and a pointer to the git tree that you're using,
> please?

Config is attached. I am using my patch series + your patch + today's
clone from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Also, in a separate e-mail i sent out the qemu arguments.

>
>> I feel, this patch requires more work, and I am troubled with using
>> base pages instead of large pages.
>
> I'm happy to try fixing this, because I think splitting up kasan and vmemmap
> is the right thing to do here.

Thank you very much.

Pavel

[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 36799 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 15:09                 ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:09 UTC (permalink / raw)
  To: linux-arm-kernel

> It shouldn't be difficult to use section mappings with my patch, I just
> don't really see the need to try to optimise TLB pressure when you're
> running with KASAN enabled which already has something like a 3x slowdown
> afaik. If it ends up being a big deal, we can always do that later, but
> my main aim here is to divorce kasan from vmemmap because they should be
> completely unrelated.

Yes, I understand that kasan makes system slow, but my point is why
make it even slower? However, I am OK adding your patch to the series,
BTW, symmetric changes will be needed for x86 as well sometime later.

>
> This certainly doesn't sound right; mapping the shadow with pages shouldn't
> lead to problems. I also can't seem to reproduce this myself -- could you
> share your full .config and a pointer to the git tree that you're using,
> please?

Config is attached. I am using my patch series + your patch + today's
clone from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Also, in a separate e-mail i sent out the qemu arguments.

>
>> I feel, this patch requires more work, and I am troubled with using
>> base pages instead of large pages.
>
> I'm happy to try fixing this, because I think splitting up kasan and vmemmap
> is the right thing to do here.

Thank you very much.

Pavel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.gz
Type: application/x-gzip
Size: 36799 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20171013/6c7f5e6d/attachment-0001.bin>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-13 15:09                 ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-13 15:34                   ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:34 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Here is simplified qemu command:

qemu-system-aarch64 \
      -display none \
      -kernel ./arch/arm64/boot/Image  \
      -M virt -cpu cortex-a57 -s -S

In a separate terminal start arm64 cross debugger:

$ aarch64-unknown-linux-gnu-gdb ./vmlinux
...
Reading symbols from ./vmlinux...done.
(gdb) target remote :1234
Remote debugging using :1234
0x0000000040000000 in ?? ()
(gdb) c
Continuing.
^C
(gdb) lx-dmesg
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.14.0-rc4_pt_study-00136-gbed2c89768ba
(soleen@xakep) (gcc version 7.1.0 (crosstool-NG
crosstool-ng-1.23.0-90-g81327dd9)) #1 SMP PREEMPT Fri Oct 13 11:24:46
EDT 2017
... until the panic message is printed ...

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 15:34                   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:34 UTC (permalink / raw)
  To: linux-arm-kernel

Here is simplified qemu command:

qemu-system-aarch64 \
      -display none \
      -kernel ./arch/arm64/boot/Image  \
      -M virt -cpu cortex-a57 -s -S

In a separate terminal start arm64 cross debugger:

$ aarch64-unknown-linux-gnu-gdb ./vmlinux
...
Reading symbols from ./vmlinux...done.
(gdb) target remote :1234
Remote debugging using :1234
0x0000000040000000 in ?? ()
(gdb) c
Continuing.
^C
(gdb) lx-dmesg
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.14.0-rc4_pt_study-00136-gbed2c89768ba
(soleen@xakep) (gcc version 7.1.0 (crosstool-NG
crosstool-ng-1.23.0-90-g81327dd9)) #1 SMP PREEMPT Fri Oct 13 11:24:46
EDT 2017
... until the panic message is printed ...

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 15:34                   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:34 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Here is simplified qemu command:

qemu-system-aarch64 \
      -display none \
      -kernel ./arch/arm64/boot/Image  \
      -M virt -cpu cortex-a57 -s -S

In a separate terminal start arm64 cross debugger:

$ aarch64-unknown-linux-gnu-gdb ./vmlinux
...
Reading symbols from ./vmlinux...done.
(gdb) target remote :1234
Remote debugging using :1234
0x0000000040000000 in ?? ()
(gdb) c
Continuing.
^C
(gdb) lx-dmesg
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.14.0-rc4_pt_study-00136-gbed2c89768ba
(soleen@xakep) (gcc version 7.1.0 (crosstool-NG
crosstool-ng-1.23.0-90-g81327dd9)) #1 SMP PREEMPT Fri Oct 13 11:24:46
EDT 2017
... until the panic message is printed ...

Thank you,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 15:34                   ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:34 UTC (permalink / raw)
  To: linux-arm-kernel

Here is simplified qemu command:

qemu-system-aarch64 \
      -display none \
      -kernel ./arch/arm64/boot/Image  \
      -M virt -cpu cortex-a57 -s -S

In a separate terminal start arm64 cross debugger:

$ aarch64-unknown-linux-gnu-gdb ./vmlinux
...
Reading symbols from ./vmlinux...done.
(gdb) target remote :1234
Remote debugging using :1234
0x0000000040000000 in ?? ()
(gdb) c
Continuing.
^C
(gdb) lx-dmesg
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.14.0-rc4_pt_study-00136-gbed2c89768ba
(soleen at xakep) (gcc version 7.1.0 (crosstool-NG
crosstool-ng-1.23.0-90-g81327dd9)) #1 SMP PREEMPT Fri Oct 13 11:24:46
EDT 2017
... until the panic message is printed ...

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-13 15:09                 ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-13 15:44                   ` Will Deacon
  -1 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-13 15:44 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Pavel,

On Fri, Oct 13, 2017 at 11:09:41AM -0400, Pavel Tatashin wrote:
> > It shouldn't be difficult to use section mappings with my patch, I just
> > don't really see the need to try to optimise TLB pressure when you're
> > running with KASAN enabled which already has something like a 3x slowdown
> > afaik. If it ends up being a big deal, we can always do that later, but
> > my main aim here is to divorce kasan from vmemmap because they should be
> > completely unrelated.
> 
> Yes, I understand that kasan makes system slow, but my point is why
> make it even slower? However, I am OK adding your patch to the series,
> BTW, symmetric changes will be needed for x86 as well sometime later.
> 
> >
> > This certainly doesn't sound right; mapping the shadow with pages shouldn't
> > lead to problems. I also can't seem to reproduce this myself -- could you
> > share your full .config and a pointer to the git tree that you're using,
> > please?
> 
> Config is attached. I am using my patch series + your patch + today's
> clone from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Great, I hit the same problem with your .config. It might actually be
CONFIG_DEBUG_MEMORY_INIT which does it.

> Also, in a separate e-mail i sent out the qemu arguments.
> 
> >
> >> I feel, this patch requires more work, and I am troubled with using
> >> base pages instead of large pages.
> >
> > I'm happy to try fixing this, because I think splitting up kasan and vmemmap
> > is the right thing to do here.
> 
> Thank you very much.

Thanks for sharing the .config and tree. It looks like the problem is that
kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
them up in kasan_map_populate, they remain unaligned when passed to
kasan_populate_zero_shadow, which confuses the loop termination conditions
in e.g. zero_pte_populate and the shadow isn't configured properly.

Fixup diff below; please merge in with my original patch.

Will

--->8

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b922826d9908..207b1acb823a 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -146,7 +146,7 @@ asmlinkage void __init kasan_early_init(void)
 static void __init kasan_map_populate(unsigned long start, unsigned long end,
				      int node)
 {
-	kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
+	kasan_pgd_populate(start, end, node, false);
 }
 
 /*
@@ -183,8 +183,8 @@ void __init kasan_init(void)
	struct memblock_region *reg;
	int i;
 
-	kimg_shadow_start = (u64)kasan_mem_to_shadow(_text);
-	kimg_shadow_end = (u64)kasan_mem_to_shadow(_end);
+	kimg_shadow_start = (u64)kasan_mem_to_shadow(_text) & PAGE_MASK;
+	kimg_shadow_end = PAGE_ALIGN((u64)kasan_mem_to_shadow(_end));
 
	mod_shadow_start = (u64)kasan_mem_to_shadow((void *)MODULES_VADDR);
	mod_shadow_end = (u64)kasan_mem_to_shadow((void *)MODULES_END);

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 15:44                   ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-13 15:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Pavel,

On Fri, Oct 13, 2017 at 11:09:41AM -0400, Pavel Tatashin wrote:
> > It shouldn't be difficult to use section mappings with my patch, I just
> > don't really see the need to try to optimise TLB pressure when you're
> > running with KASAN enabled which already has something like a 3x slowdown
> > afaik. If it ends up being a big deal, we can always do that later, but
> > my main aim here is to divorce kasan from vmemmap because they should be
> > completely unrelated.
> 
> Yes, I understand that kasan makes system slow, but my point is why
> make it even slower? However, I am OK adding your patch to the series,
> BTW, symmetric changes will be needed for x86 as well sometime later.
> 
> >
> > This certainly doesn't sound right; mapping the shadow with pages shouldn't
> > lead to problems. I also can't seem to reproduce this myself -- could you
> > share your full .config and a pointer to the git tree that you're using,
> > please?
> 
> Config is attached. I am using my patch series + your patch + today's
> clone from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Great, I hit the same problem with your .config. It might actually be
CONFIG_DEBUG_MEMORY_INIT which does it.

> Also, in a separate e-mail i sent out the qemu arguments.
> 
> >
> >> I feel, this patch requires more work, and I am troubled with using
> >> base pages instead of large pages.
> >
> > I'm happy to try fixing this, because I think splitting up kasan and vmemmap
> > is the right thing to do here.
> 
> Thank you very much.

Thanks for sharing the .config and tree. It looks like the problem is that
kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
them up in kasan_map_populate, they remain unaligned when passed to
kasan_populate_zero_shadow, which confuses the loop termination conditions
in e.g. zero_pte_populate and the shadow isn't configured properly.

Fixup diff below; please merge in with my original patch.

Will

--->8

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b922826d9908..207b1acb823a 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -146,7 +146,7 @@ asmlinkage void __init kasan_early_init(void)
 static void __init kasan_map_populate(unsigned long start, unsigned long end,
				      int node)
 {
-	kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
+	kasan_pgd_populate(start, end, node, false);
 }
 
 /*
@@ -183,8 +183,8 @@ void __init kasan_init(void)
	struct memblock_region *reg;
	int i;
 
-	kimg_shadow_start = (u64)kasan_mem_to_shadow(_text);
-	kimg_shadow_end = (u64)kasan_mem_to_shadow(_end);
+	kimg_shadow_start = (u64)kasan_mem_to_shadow(_text) & PAGE_MASK;
+	kimg_shadow_end = PAGE_ALIGN((u64)kasan_mem_to_shadow(_end));
 
	mod_shadow_start = (u64)kasan_mem_to_shadow((void *)MODULES_VADDR);
	mod_shadow_end = (u64)kasan_mem_to_shadow((void *)MODULES_END);



^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 15:44                   ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-13 15:44 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

Hi Pavel,

On Fri, Oct 13, 2017 at 11:09:41AM -0400, Pavel Tatashin wrote:
> > It shouldn't be difficult to use section mappings with my patch, I just
> > don't really see the need to try to optimise TLB pressure when you're
> > running with KASAN enabled which already has something like a 3x slowdown
> > afaik. If it ends up being a big deal, we can always do that later, but
> > my main aim here is to divorce kasan from vmemmap because they should be
> > completely unrelated.
> 
> Yes, I understand that kasan makes system slow, but my point is why
> make it even slower? However, I am OK adding your patch to the series,
> BTW, symmetric changes will be needed for x86 as well sometime later.
> 
> >
> > This certainly doesn't sound right; mapping the shadow with pages shouldn't
> > lead to problems. I also can't seem to reproduce this myself -- could you
> > share your full .config and a pointer to the git tree that you're using,
> > please?
> 
> Config is attached. I am using my patch series + your patch + today's
> clone from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Great, I hit the same problem with your .config. It might actually be
CONFIG_DEBUG_MEMORY_INIT which does it.

> Also, in a separate e-mail i sent out the qemu arguments.
> 
> >
> >> I feel, this patch requires more work, and I am troubled with using
> >> base pages instead of large pages.
> >
> > I'm happy to try fixing this, because I think splitting up kasan and vmemmap
> > is the right thing to do here.
> 
> Thank you very much.

Thanks for sharing the .config and tree. It looks like the problem is that
kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
them up in kasan_map_populate, they remain unaligned when passed to
kasan_populate_zero_shadow, which confuses the loop termination conditions
in e.g. zero_pte_populate and the shadow isn't configured properly.

Fixup diff below; please merge in with my original patch.

Will

--->8

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b922826d9908..207b1acb823a 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -146,7 +146,7 @@ asmlinkage void __init kasan_early_init(void)
 static void __init kasan_map_populate(unsigned long start, unsigned long end,
				      int node)
 {
-	kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
+	kasan_pgd_populate(start, end, node, false);
 }
 
 /*
@@ -183,8 +183,8 @@ void __init kasan_init(void)
	struct memblock_region *reg;
	int i;
 
-	kimg_shadow_start = (u64)kasan_mem_to_shadow(_text);
-	kimg_shadow_end = (u64)kasan_mem_to_shadow(_end);
+	kimg_shadow_start = (u64)kasan_mem_to_shadow(_text) & PAGE_MASK;
+	kimg_shadow_end = PAGE_ALIGN((u64)kasan_mem_to_shadow(_end));
 
	mod_shadow_start = (u64)kasan_mem_to_shadow((void *)MODULES_VADDR);
	mod_shadow_end = (u64)kasan_mem_to_shadow((void *)MODULES_END);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 15:44                   ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-13 15:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Pavel,

On Fri, Oct 13, 2017 at 11:09:41AM -0400, Pavel Tatashin wrote:
> > It shouldn't be difficult to use section mappings with my patch, I just
> > don't really see the need to try to optimise TLB pressure when you're
> > running with KASAN enabled which already has something like a 3x slowdown
> > afaik. If it ends up being a big deal, we can always do that later, but
> > my main aim here is to divorce kasan from vmemmap because they should be
> > completely unrelated.
> 
> Yes, I understand that kasan makes system slow, but my point is why
> make it even slower? However, I am OK adding your patch to the series,
> BTW, symmetric changes will be needed for x86 as well sometime later.
> 
> >
> > This certainly doesn't sound right; mapping the shadow with pages shouldn't
> > lead to problems. I also can't seem to reproduce this myself -- could you
> > share your full .config and a pointer to the git tree that you're using,
> > please?
> 
> Config is attached. I am using my patch series + your patch + today's
> clone from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Great, I hit the same problem with your .config. It might actually be
CONFIG_DEBUG_MEMORY_INIT which does it.

> Also, in a separate e-mail i sent out the qemu arguments.
> 
> >
> >> I feel, this patch requires more work, and I am troubled with using
> >> base pages instead of large pages.
> >
> > I'm happy to try fixing this, because I think splitting up kasan and vmemmap
> > is the right thing to do here.
> 
> Thank you very much.

Thanks for sharing the .config and tree. It looks like the problem is that
kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
them up in kasan_map_populate, they remain unaligned when passed to
kasan_populate_zero_shadow, which confuses the loop termination conditions
in e.g. zero_pte_populate and the shadow isn't configured properly.

Fixup diff below; please merge in with my original patch.

Will

--->8

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b922826d9908..207b1acb823a 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -146,7 +146,7 @@ asmlinkage void __init kasan_early_init(void)
 static void __init kasan_map_populate(unsigned long start, unsigned long end,
				      int node)
 {
-	kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
+	kasan_pgd_populate(start, end, node, false);
 }
 
 /*
@@ -183,8 +183,8 @@ void __init kasan_init(void)
	struct memblock_region *reg;
	int i;
 
-	kimg_shadow_start = (u64)kasan_mem_to_shadow(_text);
-	kimg_shadow_end = (u64)kasan_mem_to_shadow(_end);
+	kimg_shadow_start = (u64)kasan_mem_to_shadow(_text) & PAGE_MASK;
+	kimg_shadow_end = PAGE_ALIGN((u64)kasan_mem_to_shadow(_end));
 
	mod_shadow_start = (u64)kasan_mem_to_shadow((void *)MODULES_VADDR);
	mod_shadow_end = (u64)kasan_mem_to_shadow((void *)MODULES_END);

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-13 15:44                   ` Will Deacon
  (?)
  (?)
@ 2017-10-13 15:54                     ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:54 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

> Thanks for sharing the .config and tree. It looks like the problem is that
> kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
> them up in kasan_map_populate, they remain unaligned when passed to
> kasan_populate_zero_shadow, which confuses the loop termination conditions
> in e.g. zero_pte_populate and the shadow isn't configured properly.

This makes sense. Thank you. I will insert these changes into your
patch, and send out a new series soon after sanity checking it.

Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 15:54                     ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:54 UTC (permalink / raw)
  To: linux-arm-kernel

> Thanks for sharing the .config and tree. It looks like the problem is that
> kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
> them up in kasan_map_populate, they remain unaligned when passed to
> kasan_populate_zero_shadow, which confuses the loop termination conditions
> in e.g. zero_pte_populate and the shadow isn't configured properly.

This makes sense. Thank you. I will insert these changes into your
patch, and send out a new series soon after sanity checking it.

Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 15:54                     ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:54 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

> Thanks for sharing the .config and tree. It looks like the problem is that
> kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
> them up in kasan_map_populate, they remain unaligned when passed to
> kasan_populate_zero_shadow, which confuses the loop termination conditions
> in e.g. zero_pte_populate and the shadow isn't configured properly.

This makes sense. Thank you. I will insert these changes into your
patch, and send out a new series soon after sanity checking it.

Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 15:54                     ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 15:54 UTC (permalink / raw)
  To: linux-arm-kernel

> Thanks for sharing the .config and tree. It looks like the problem is that
> kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
> them up in kasan_map_populate, they remain unaligned when passed to
> kasan_populate_zero_shadow, which confuses the loop termination conditions
> in e.g. zero_pte_populate and the shadow isn't configured properly.

This makes sense. Thank you. I will insert these changes into your
patch, and send out a new series soon after sanity checking it.

Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-13 15:54                     ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-13 16:00                       ` Pavel Tatashin
  -1 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 16:00 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

BTW, don't we need the same aligments inside for_each_memblock() loop?

How about change kasan_map_populate() to accept regular VA start, end
address, and convert them internally after aligning to PAGE_SIZE?

Thank you,
Pavel


On Fri, Oct 13, 2017 at 11:54 AM, Pavel Tatashin
<pasha.tatashin@oracle.com> wrote:
>> Thanks for sharing the .config and tree. It looks like the problem is that
>> kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
>> them up in kasan_map_populate, they remain unaligned when passed to
>> kasan_populate_zero_shadow, which confuses the loop termination conditions
>> in e.g. zero_pte_populate and the shadow isn't configured properly.
>
> This makes sense. Thank you. I will insert these changes into your
> patch, and send out a new series soon after sanity checking it.
>
> Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 16:00                       ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 16:00 UTC (permalink / raw)
  To: linux-arm-kernel

BTW, don't we need the same aligments inside for_each_memblock() loop?

How about change kasan_map_populate() to accept regular VA start, end
address, and convert them internally after aligning to PAGE_SIZE?

Thank you,
Pavel


On Fri, Oct 13, 2017 at 11:54 AM, Pavel Tatashin
<pasha.tatashin@oracle.com> wrote:
>> Thanks for sharing the .config and tree. It looks like the problem is that
>> kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
>> them up in kasan_map_populate, they remain unaligned when passed to
>> kasan_populate_zero_shadow, which confuses the loop termination conditions
>> in e.g. zero_pte_populate and the shadow isn't configured properly.
>
> This makes sense. Thank you. I will insert these changes into your
> patch, and send out a new series soon after sanity checking it.
>
> Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 16:00                       ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 16:00 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

BTW, don't we need the same aligments inside for_each_memblock() loop?

How about change kasan_map_populate() to accept regular VA start, end
address, and convert them internally after aligning to PAGE_SIZE?

Thank you,
Pavel


On Fri, Oct 13, 2017 at 11:54 AM, Pavel Tatashin
<pasha.tatashin@oracle.com> wrote:
>> Thanks for sharing the .config and tree. It looks like the problem is that
>> kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
>> them up in kasan_map_populate, they remain unaligned when passed to
>> kasan_populate_zero_shadow, which confuses the loop termination conditions
>> in e.g. zero_pte_populate and the shadow isn't configured properly.
>
> This makes sense. Thank you. I will insert these changes into your
> patch, and send out a new series soon after sanity checking it.
>
> Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 16:00                       ` Pavel Tatashin
  0 siblings, 0 replies; 115+ messages in thread
From: Pavel Tatashin @ 2017-10-13 16:00 UTC (permalink / raw)
  To: linux-arm-kernel

BTW, don't we need the same aligments inside for_each_memblock() loop?

How about change kasan_map_populate() to accept regular VA start, end
address, and convert them internally after aligning to PAGE_SIZE?

Thank you,
Pavel


On Fri, Oct 13, 2017 at 11:54 AM, Pavel Tatashin
<pasha.tatashin@oracle.com> wrote:
>> Thanks for sharing the .config and tree. It looks like the problem is that
>> kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
>> them up in kasan_map_populate, they remain unaligned when passed to
>> kasan_populate_zero_shadow, which confuses the loop termination conditions
>> in e.g. zero_pte_populate and the shadow isn't configured properly.
>
> This makes sense. Thank you. I will insert these changes into your
> patch, and send out a new series soon after sanity checking it.
>
> Pavel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
  2017-10-13 16:00                       ` Pavel Tatashin
  (?)
  (?)
@ 2017-10-13 16:18                         ` Will Deacon
  -1 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-13 16:18 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

On Fri, Oct 13, 2017 at 12:00:27PM -0400, Pavel Tatashin wrote:
> BTW, don't we need the same aligments inside for_each_memblock() loop?

Hmm, yes actually, given that we shift them right for the shadow address.

> How about change kasan_map_populate() to accept regular VA start, end
> address, and convert them internally after aligning to PAGE_SIZE?

That's what my original patch did, but it doesn't help on its own because
kasan_populate_zero_shadow would need the same change.

Will

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 16:18                         ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-13 16:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 13, 2017 at 12:00:27PM -0400, Pavel Tatashin wrote:
> BTW, don't we need the same aligments inside for_each_memblock() loop?

Hmm, yes actually, given that we shift them right for the shadow address.

> How about change kasan_map_populate() to accept regular VA start, end
> address, and convert them internally after aligning to PAGE_SIZE?

That's what my original patch did, but it doesn't help on its own because
kasan_populate_zero_shadow would need the same change.

Will

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 16:18                         ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-13 16:18 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	linux-arm-kernel, x86, kasan-dev, borntraeger, heiko.carstens,
	davem, willy, Michal Hocko, Ard Biesheuvel, Mark Rutland,
	catalin.marinas, sam, mgorman, Steve Sistare, daniel.m.jordan,
	bob.picco

On Fri, Oct 13, 2017 at 12:00:27PM -0400, Pavel Tatashin wrote:
> BTW, don't we need the same aligments inside for_each_memblock() loop?

Hmm, yes actually, given that we shift them right for the shadow address.

> How about change kasan_map_populate() to accept regular VA start, end
> address, and convert them internally after aligning to PAGE_SIZE?

That's what my original patch did, but it doesn't help on its own because
kasan_populate_zero_shadow would need the same change.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()
@ 2017-10-13 16:18                         ` Will Deacon
  0 siblings, 0 replies; 115+ messages in thread
From: Will Deacon @ 2017-10-13 16:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 13, 2017 at 12:00:27PM -0400, Pavel Tatashin wrote:
> BTW, don't we need the same aligments inside for_each_memblock() loop?

Hmm, yes actually, given that we shift them right for the shadow address.

> How about change kasan_map_populate() to accept regular VA start, end
> address, and convert them internally after aligning to PAGE_SIZE?

That's what my original patch did, but it doesn't help on its own because
kasan_populate_zero_shadow would need the same change.

Will

^ permalink raw reply	[flat|nested] 115+ messages in thread

end of thread, other threads:[~2017-10-13 16:18 UTC | newest]

Thread overview: 115+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-09 22:19 [PATCH v11 0/9] complete deferred page initialization Pavel Tatashin
2017-10-09 22:19 ` Pavel Tatashin
2017-10-09 22:19 ` Pavel Tatashin
2017-10-09 22:19 ` Pavel Tatashin
2017-10-09 22:19 ` [PATCH v11 1/9] x86/mm: setting fields in deferred pages Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19 ` [PATCH v11 2/9] sparc64/mm: " Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19 ` [PATCH v11 3/9] sparc64: simplify vmemmap_populate Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19 ` [PATCH v11 4/9] mm: defining memblock_virt_alloc_try_nid_raw Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19 ` [PATCH v11 5/9] mm: zero reserved and unavailable struct pages Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-10 13:44   ` Michal Hocko
2017-10-10 13:44     ` Michal Hocko
2017-10-10 13:44     ` Michal Hocko
2017-10-10 13:44     ` Michal Hocko
2017-10-10 14:09     ` Michal Hocko
2017-10-10 14:09       ` Michal Hocko
2017-10-10 14:09       ` Michal Hocko
2017-10-10 14:09       ` Michal Hocko
2017-10-10 14:30       ` Pavel Tatashin
2017-10-10 14:30         ` Pavel Tatashin
2017-10-10 14:30         ` Pavel Tatashin
2017-10-10 14:30         ` Pavel Tatashin
2017-10-09 22:19 ` [PATCH v11 6/9] x86/kasan: add and use kasan_map_populate() Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19 ` [PATCH v11 7/9] arm64/kasan: " Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-10 15:56   ` Will Deacon
2017-10-10 15:56     ` Will Deacon
2017-10-10 15:56     ` Will Deacon
2017-10-10 15:56     ` Will Deacon
2017-10-10 17:07     ` Pavel Tatashin
2017-10-10 17:07       ` Pavel Tatashin
2017-10-10 17:07       ` Pavel Tatashin
2017-10-10 17:07       ` Pavel Tatashin
2017-10-10 17:10       ` Will Deacon
2017-10-10 17:10         ` Will Deacon
2017-10-10 17:10         ` Will Deacon
2017-10-10 17:10         ` Will Deacon
2017-10-10 17:41         ` Pavel Tatashin
2017-10-10 17:41           ` Pavel Tatashin
2017-10-10 17:41           ` Pavel Tatashin
2017-10-10 17:41           ` Pavel Tatashin
2017-10-13 14:10           ` Pavel Tatashin
2017-10-13 14:10             ` Pavel Tatashin
2017-10-13 14:10             ` Pavel Tatashin
2017-10-13 14:10             ` Pavel Tatashin
2017-10-13 14:43             ` Will Deacon
2017-10-13 14:43               ` Will Deacon
2017-10-13 14:43               ` Will Deacon
2017-10-13 14:43               ` Will Deacon
2017-10-13 14:56               ` Mark Rutland
2017-10-13 14:56                 ` Mark Rutland
2017-10-13 14:56                 ` Mark Rutland
2017-10-13 14:56                 ` Mark Rutland
2017-10-13 15:02                 ` Pavel Tatashin
2017-10-13 15:02                   ` Pavel Tatashin
2017-10-13 15:02                   ` Pavel Tatashin
2017-10-13 15:02                   ` Pavel Tatashin
2017-10-13 15:09               ` Pavel Tatashin
2017-10-13 15:09                 ` Pavel Tatashin
2017-10-13 15:09                 ` Pavel Tatashin
2017-10-13 15:34                 ` Pavel Tatashin
2017-10-13 15:34                   ` Pavel Tatashin
2017-10-13 15:34                   ` Pavel Tatashin
2017-10-13 15:34                   ` Pavel Tatashin
2017-10-13 15:44                 ` Will Deacon
2017-10-13 15:44                   ` Will Deacon
2017-10-13 15:44                   ` Will Deacon
2017-10-13 15:44                   ` Will Deacon
2017-10-13 15:54                   ` Pavel Tatashin
2017-10-13 15:54                     ` Pavel Tatashin
2017-10-13 15:54                     ` Pavel Tatashin
2017-10-13 15:54                     ` Pavel Tatashin
2017-10-13 16:00                     ` Pavel Tatashin
2017-10-13 16:00                       ` Pavel Tatashin
2017-10-13 16:00                       ` Pavel Tatashin
2017-10-13 16:00                       ` Pavel Tatashin
2017-10-13 16:18                       ` Will Deacon
2017-10-13 16:18                         ` Will Deacon
2017-10-13 16:18                         ` Will Deacon
2017-10-13 16:18                         ` Will Deacon
2017-10-09 22:19 ` [PATCH v11 8/9] mm: stop zeroing memory during allocation in vmemmap Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19 ` [PATCH v11 9/9] sparc64: optimized struct page zeroing Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-09 22:19   ` Pavel Tatashin
2017-10-10 14:15 ` [PATCH v11 0/9] complete deferred page initialization Michal Hocko
2017-10-10 14:15   ` Michal Hocko
2017-10-10 14:15   ` Michal Hocko
2017-10-10 14:15   ` Michal Hocko
2017-10-10 17:19   ` Pavel Tatashin
2017-10-10 17:19     ` Pavel Tatashin
2017-10-10 17:19     ` Pavel Tatashin
2017-10-10 17:19     ` Pavel Tatashin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.