linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
@ 2021-08-05 19:02 Zi Yan
  2021-08-05 19:02 ` [RFC PATCH 01/15] arch: x86: remove MAX_ORDER exceeding SECTION_SIZE check for 32bit vdso Zi Yan
                   ` (17 more replies)
  0 siblings, 18 replies; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Hi all,

This patchset add support for kernel boot time adjustable MAX_ORDER, so that
user can change the largest size of pages obtained from buddy allocator. It also
removes the restriction on MAX_ORDER based on SECTION_SIZE_BITS, so that
buddy allocator can merge PFNs across memory sections when SPARSEMEM_VMEMMAP is
set. It is on top of v5.14-rc4-mmotm-2021-08-02-18-51.

Motivation
===

This enables kernel to allocate 1GB pages and is necessary for my ongoing work
on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with
after some discussion with David Hildenbrand on what methods should be used for
allocating gigantic pages[2], since other approaches like using CMA allocator or
alloc_contig_pages() are regarded as suboptimal.

This also prevents increasing SECTION_SIZE_BITS when increasing MAX_ORDER, since
increasing SECTION_SIZE_BITS is not desirable as memory hotadd/hotremove chunk
size will be increased as well, causing memory management difficulty for VMs.

In addition, make MAX_ORDER a kernel boot time parameter can enable user to
adjust buddy allocator without recompiling the kernel for their own needs, so
that one can still have a small MAX_ORDER if he/she does not need to allocate
gigantic pages like 1GB PUD THPs.

Background
===

At the moment, kernel imposes MAX_ORDER - 1 + PAGE_SHFIT < SECTION_SIZE_BITS
restriction. This prevents buddy allocator merging pages across memory sections,
as PFNs might not be contiguous and code like page++ would fail. But this would
not be an issue when SPARSEMEM_VMEMMAP is set, since all struct page are
virtually contiguous. In addition, as long as buddy allocator checks the PFN
validity during buddy page merging (done in Patch 3), pages allocated from
buddy allocator can be manipulated by code like page++.


Description
===

I tested the patchset on both x86_64 and ARM64 at 4KB, 16KB, and 64KB base
pages. The systems boot and ltp mm test suite finished without issue. Also
memory hotplug worked on x86_64 when I tested. It definitely needs more tests
and reviews for other architectures.

In terms of the concerns on performance degradation if MAX_ORDER is increased,
I did some initial performance tests comparing MAX_ORDER=11 and MAX_ORDER=20 on
x86_64 machines and saw no performance difference[3].

Patch 1 excludes MAX_ORDER check from 32bit vdso compilation. The check uses
irrelevant 32bit SECTION_SIZE_BITS during 64bit kernel compilation. The
exclusion does not break the check in 32bit kernel, since the check will still
be performed during other kernel component compilation.

Patch 2 gives FORCE_MAX_ZONEORDER a better name.

Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
pages across memory sections. The check was removed when ARM64 gets rid of holes
in zones, but holes can appear in zones again after this patchset.

Patch 4-11 convert the use of MAX_ORDER to SECTION_SIZE_BITS or its derivative
constants, since these code places use MAX_ORDER as boundary check for
physically contiguous pages, where SECTION_SIZE_BITS should be used. After this
patchset, MAX_ORDER can go beyond SECTION_SIZE_BITS, the code can break.
I separate changes to different patches for easy review and can merge them into
a single one if that works better.

Patch 12 adds a new Kconfig option SET_MAX_ORDER to allow specifying MAX_ORDER
when ARCH_FORCE_MAX_ORDER is not used by the arch, like x86_64.

Patch 13 converts statically allocated arrays with MAX_ORDER length to dynamic
ones if possible and prepares for making MAX_ORDER a boot time parameter.

Patch 14 adds a new MIN_MAX_ORDER constant to replace soon-to-be-dynamic
MAX_ORDER for places where converting static array to dynamic one is causing
hassle and not necessary, i.e., ARM64 hypervisor page allocation and SLAB.

Patch 15 finally changes MAX_ORDER to be a kernel boot time parameter.


Any suggestion and/or comment is welcome. Thanks.


TODO
===

1. Redo the performance comparison tests using this patchset to understand the
   performance implication of changing MAX_ORDER.


Zi Yan (15):
  arch: x86: remove MAX_ORDER exceeding SECTION_SIZE check for 32bit
    vdso.
  arch: mm: rename FORCE_MAX_ZONEORDER to ARCH_FORCE_MAX_ORDER
  mm: check pfn validity when buddy allocator can merge pages across mem
    sections.
  mm: prevent pageblock size being larger than section size.
  mm/memory_hotplug: online pages at section size.
  mm: use PAGES_PER_SECTION instead for mem_map_offset/next().
  mm: hugetlb: use PAGES_PER_SECTION to check mem_map discontiguity
  fs: proc: use PAGES_PER_SECTION for page offline checking period.
  virtio: virtio_mem: use PAGES_PER_SECTION instead of
    MAX_ORDER_NR_PAGES
  virtio: virtio_balloon: use PAGES_PER_SECTION instead of
    MAX_ORDER_NR_PAGES.
  mm/page_reporting: report pages at section size instead of MAX_ORDER.
  mm: Make MAX_ORDER of buddy allocator configurable via Kconfig
    SET_MAX_ORDER.
  mm: convert MAX_ORDER sized static arrays to dynamic ones.
  mm: introduce MIN_MAX_ORDER to replace MAX_ORDER as compile time
    constant.
  mm: make MAX_ORDER a kernel boot time parameter.

 .../admin-guide/kdump/vmcoreinfo.rst          |   2 +-
 .../admin-guide/kernel-parameters.txt         |   5 +
 arch/Kconfig                                  |   4 +
 arch/arc/Kconfig                              |   2 +-
 arch/arm/Kconfig                              |   2 +-
 arch/arm/configs/imx_v6_v7_defconfig          |   2 +-
 arch/arm/configs/milbeaut_m10v_defconfig      |   2 +-
 arch/arm/configs/oxnas_v6_defconfig           |   2 +-
 arch/arm/configs/sama7_defconfig              |   2 +-
 arch/arm64/Kconfig                            |   2 +-
 arch/arm64/kvm/hyp/include/nvhe/gfp.h         |   2 +-
 arch/arm64/kvm/hyp/nvhe/page_alloc.c          |   3 +-
 arch/csky/Kconfig                             |   2 +-
 arch/ia64/Kconfig                             |   2 +-
 arch/ia64/include/asm/sparsemem.h             |   6 +-
 arch/m68k/Kconfig.cpu                         |   2 +-
 arch/mips/Kconfig                             |   2 +-
 arch/nios2/Kconfig                            |   2 +-
 arch/powerpc/Kconfig                          |   2 +-
 arch/powerpc/configs/85xx/ge_imp3a_defconfig  |   2 +-
 arch/powerpc/configs/fsl-emb-nonhw.config     |   2 +-
 arch/sh/configs/ecovec24_defconfig            |   2 +-
 arch/sh/mm/Kconfig                            |   2 +-
 arch/sparc/Kconfig                            |   2 +-
 arch/x86/entry/vdso/Makefile                  |   1 +
 arch/xtensa/Kconfig                           |   2 +-
 drivers/gpu/drm/ttm/ttm_device.c              |   7 +-
 drivers/gpu/drm/ttm/ttm_pool.c                |  58 +++++++++-
 drivers/virtio/virtio_balloon.c               |   2 +-
 drivers/virtio/virtio_mem.c                   |  12 +-
 fs/proc/kcore.c                               |   2 +-
 include/drm/ttm/ttm_pool.h                    |   4 +-
 include/linux/memory_hotplug.h                |   1 +
 include/linux/mmzone.h                        |  56 ++++++++-
 include/linux/pageblock-flags.h               |   7 +-
 include/linux/slab.h                          |   8 +-
 mm/Kconfig                                    |  16 +++
 mm/compaction.c                               |  20 ++--
 mm/hugetlb.c                                  |   2 +-
 mm/internal.h                                 |   4 +-
 mm/memory_hotplug.c                           |  18 ++-
 mm/page_alloc.c                               | 108 ++++++++++++++++--
 mm/page_isolation.c                           |   7 +-
 mm/page_owner.c                               |  14 ++-
 mm/page_reporting.c                           |   3 +-
 mm/slab.c                                     |   2 +-
 mm/slub.c                                     |   7 +-
 mm/vmscan.c                                   |   1 -
 48 files changed, 334 insertions(+), 86 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 01/15] arch: x86: remove MAX_ORDER exceeding SECTION_SIZE check for 32bit vdso.
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-05 19:02 ` [RFC PATCH 02/15] arch: mm: rename FORCE_MAX_ZONEORDER to ARCH_FORCE_MAX_ORDER Zi Yan
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan,
	Andy Lutomirski, x86

From: Zi Yan <ziy@nvidia.com>

For x86_64, 32bit vdso is compiled for compatibility reason and 32bit
SECTION_SIZE_BITS value is used during compilation. It causes
compilation time error when MAX_ORDER is increased in the 64bit
environment even if it is OK for 64bit SECTION_SIZE_BITS. Remove the
check during 32bit vdso compilation. The check still exists during the
compilation of other kernel components.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 arch/x86/entry/vdso/Makefile | 1 +
 include/linux/mmzone.h       | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 05c4abc2fdfd..cad339136ed1 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -156,6 +156,7 @@ KBUILD_CFLAGS_32 += -fno-stack-protector
 KBUILD_CFLAGS_32 += $(call cc-option, -foptimize-sibling-calls)
 KBUILD_CFLAGS_32 += -fno-omit-frame-pointer
 KBUILD_CFLAGS_32 += -DDISABLE_BRANCH_PROFILING
+KBUILD_CFLAGS_32 += -DNO_MAX_ORDER_CHECK
 
 ifdef CONFIG_RETPOLINE
 ifneq ($(RETPOLINE_VDSO_CFLAGS),)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6a1d79d84675..c1d914a72489 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1245,9 +1245,12 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
 #define SECTION_BLOCKFLAGS_BITS \
 	((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS)
 
+/* NO_MAX_ORDER_CHECK when compiling x64 32bit VDSO for 64bit system */
+#ifndef NO_MAX_ORDER_CHECK
 #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
 #error Allocator MAX_ORDER exceeds SECTION_SIZE
 #endif
+#endif /* NO_MAX_ORDER_CHECK */
 
 static inline unsigned long pfn_to_section_nr(unsigned long pfn)
 {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 02/15] arch: mm: rename FORCE_MAX_ZONEORDER to ARCH_FORCE_MAX_ORDER
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
  2021-08-05 19:02 ` [RFC PATCH 01/15] arch: x86: remove MAX_ORDER exceeding SECTION_SIZE check for 32bit vdso Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-05 19:02 ` [RFC PATCH 03/15] mm: check pfn validity when buddy allocator can merge pages across mem sections Zi Yan
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan,
	Vineet Gupta, Shawn Guo, Catalin Marinas, Guo Ren,
	Geert Uytterhoeven, Thomas Bogendoerfer, Ley Foon Tan,
	Michael Ellerman, Yoshinori Sato, David S. Miller, Chris Zankel,
	linux-snps-arc, linux-arm-kernel, linux-oxnas, linux-csky,
	linux-ia64, linux-m68k, linux-mips, linuxppc-dev, linux-sh,
	sparclinux, linux-xtensa

From: Zi Yan <ziy@nvidia.com>

This Kconfig option is used by individual arch to set its desired
MAX_ORDER. Rename it to reflect its actual use.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: linux-snps-arc@lists.infradead.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-oxnas@groups.io
Cc: linux-csky@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-mips@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 arch/arc/Kconfig                             | 2 +-
 arch/arm/Kconfig                             | 2 +-
 arch/arm/configs/imx_v6_v7_defconfig         | 2 +-
 arch/arm/configs/milbeaut_m10v_defconfig     | 2 +-
 arch/arm/configs/oxnas_v6_defconfig          | 2 +-
 arch/arm/configs/sama7_defconfig             | 2 +-
 arch/arm64/Kconfig                           | 2 +-
 arch/csky/Kconfig                            | 2 +-
 arch/ia64/Kconfig                            | 2 +-
 arch/ia64/include/asm/sparsemem.h            | 6 +++---
 arch/m68k/Kconfig.cpu                        | 2 +-
 arch/mips/Kconfig                            | 2 +-
 arch/nios2/Kconfig                           | 2 +-
 arch/powerpc/Kconfig                         | 2 +-
 arch/powerpc/configs/85xx/ge_imp3a_defconfig | 2 +-
 arch/powerpc/configs/fsl-emb-nonhw.config    | 2 +-
 arch/sh/configs/ecovec24_defconfig           | 2 +-
 arch/sh/mm/Kconfig                           | 2 +-
 arch/sparc/Kconfig                           | 2 +-
 arch/xtensa/Kconfig                          | 2 +-
 include/linux/mmzone.h                       | 4 ++--
 21 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index b5bf68e74732..923ea4c31e59 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -583,7 +583,7 @@ config ARC_BUILTIN_DTB_NAME
 
 endmenu	 # "ARC Architecture Configuration"
 
-config FORCE_MAX_ZONEORDER
+config ARCH_FORCE_MAX_ORDER
 	int "Maximum zone order"
 	default "12" if ARC_HUGEPAGE_16M
 	default "11"
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 2fb7012c3246..286854318fe5 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1523,7 +1523,7 @@ config ARM_MODULE_PLTS
 	  Disabling this is usually safe for small single-platform
 	  configurations. If unsure, say y.
 
-config FORCE_MAX_ZONEORDER
+config ARCH_FORCE_MAX_ORDER
 	int "Maximum zone order"
 	default "12" if SOC_AM33XX
 	default "9" if SA1111
diff --git a/arch/arm/configs/imx_v6_v7_defconfig b/arch/arm/configs/imx_v6_v7_defconfig
index 079fcd8d1d11..802310d3ebf5 100644
--- a/arch/arm/configs/imx_v6_v7_defconfig
+++ b/arch/arm/configs/imx_v6_v7_defconfig
@@ -34,7 +34,7 @@ CONFIG_PCI_IMX6=y
 CONFIG_SMP=y
 CONFIG_ARM_PSCI=y
 CONFIG_HIGHMEM=y
-CONFIG_FORCE_MAX_ZONEORDER=14
+CONFIG_ARCH_FORCE_MAX_ORDER=14
 CONFIG_CMDLINE="noinitrd console=ttymxc0,115200"
 CONFIG_KEXEC=y
 CONFIG_CPU_FREQ=y
diff --git a/arch/arm/configs/milbeaut_m10v_defconfig b/arch/arm/configs/milbeaut_m10v_defconfig
index 7c07f9893a0f..06967243f74d 100644
--- a/arch/arm/configs/milbeaut_m10v_defconfig
+++ b/arch/arm/configs/milbeaut_m10v_defconfig
@@ -26,7 +26,7 @@ CONFIG_THUMB2_KERNEL=y
 # CONFIG_THUMB2_AVOID_R_ARM_THM_JUMP11 is not set
 # CONFIG_ARM_PATCH_IDIV is not set
 CONFIG_HIGHMEM=y
-CONFIG_FORCE_MAX_ZONEORDER=12
+CONFIG_ARCH_FORCE_MAX_ORDER=12
 CONFIG_SECCOMP=y
 CONFIG_KEXEC=y
 CONFIG_EFI=y
diff --git a/arch/arm/configs/oxnas_v6_defconfig b/arch/arm/configs/oxnas_v6_defconfig
index cae0db6b4eaf..df8462272446 100644
--- a/arch/arm/configs/oxnas_v6_defconfig
+++ b/arch/arm/configs/oxnas_v6_defconfig
@@ -17,7 +17,7 @@ CONFIG_MACH_OX820=y
 CONFIG_SMP=y
 CONFIG_NR_CPUS=16
 CONFIG_CMA=y
-CONFIG_FORCE_MAX_ZONEORDER=12
+CONFIG_ARCH_FORCE_MAX_ORDER=12
 CONFIG_SECCOMP=y
 CONFIG_ARM_APPENDED_DTB=y
 CONFIG_ARM_ATAG_DTB_COMPAT=y
diff --git a/arch/arm/configs/sama7_defconfig b/arch/arm/configs/sama7_defconfig
index 938aae4bd80b..f8683b87cb27 100644
--- a/arch/arm/configs/sama7_defconfig
+++ b/arch/arm/configs/sama7_defconfig
@@ -22,7 +22,7 @@ CONFIG_ATMEL_CLOCKSOURCE_TCB=y
 # CONFIG_CACHE_L2X0 is not set
 # CONFIG_ARM_PATCH_IDIV is not set
 # CONFIG_CPU_SW_DOMAIN_PAN is not set
-CONFIG_FORCE_MAX_ZONEORDER=15
+CONFIG_ARCH_FORCE_MAX_ORDER=15
 CONFIG_UACCESS_WITH_MEMCPY=y
 # CONFIG_ATAGS is not set
 CONFIG_CMDLINE="console=ttyS0,115200 earlyprintk ignore_loglevel"
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index b5b13a932561..972d81f6bb2c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1152,7 +1152,7 @@ config XEN
 	help
 	  Say Y if you want to run Linux in a Virtual Machine on Xen on ARM64.
 
-config FORCE_MAX_ZONEORDER
+config ARCH_FORCE_MAX_ORDER
 	int
 	default "14" if ARM64_64K_PAGES
 	default "12" if ARM64_16K_PAGES
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 2716f6395ba7..0fd2333226b7 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -307,7 +307,7 @@ config HIGHMEM
 	select KMAP_LOCAL
 	default y
 
-config FORCE_MAX_ZONEORDER
+config ARCH_FORCE_MAX_ORDER
 	int "Maximum zone order"
 	default "11"
 
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 4993c7ac7ff6..d1bd010ee7a5 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -200,7 +200,7 @@ config IA64_CYCLONE
 	  Say Y here to enable support for IBM EXA Cyclone time source.
 	  If you're unsure, answer N.
 
-config FORCE_MAX_ZONEORDER
+config ARCH_FORCE_MAX_ORDER
 	int "MAX_ORDER (11 - 17)"  if !HUGETLB_PAGE
 	range 11 17  if !HUGETLB_PAGE
 	default "17" if HUGETLB_PAGE
diff --git a/arch/ia64/include/asm/sparsemem.h b/arch/ia64/include/asm/sparsemem.h
index 42ed5248fae9..84e8ce387b69 100644
--- a/arch/ia64/include/asm/sparsemem.h
+++ b/arch/ia64/include/asm/sparsemem.h
@@ -11,10 +11,10 @@
 
 #define SECTION_SIZE_BITS	(30)
 #define MAX_PHYSMEM_BITS	(50)
-#ifdef CONFIG_FORCE_MAX_ZONEORDER
-#if ((CONFIG_FORCE_MAX_ZONEORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS)
+#ifdef CONFIG_ARCH_FORCE_MAX_ORDER
+#if ((CONFIG_ARCH_FORCE_MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS)
 #undef SECTION_SIZE_BITS
-#define SECTION_SIZE_BITS (CONFIG_FORCE_MAX_ZONEORDER - 1 + PAGE_SHIFT)
+#define SECTION_SIZE_BITS (CONFIG_ARCH_FORCE_MAX_ORDER - 1 + PAGE_SHIFT)
 #endif
 #endif
 
diff --git a/arch/m68k/Kconfig.cpu b/arch/m68k/Kconfig.cpu
index 29e946394fdb..c39157f3dd87 100644
--- a/arch/m68k/Kconfig.cpu
+++ b/arch/m68k/Kconfig.cpu
@@ -408,7 +408,7 @@ config SINGLE_MEMORY_CHUNK
 	  order" to save memory that could be wasted for unused memory map.
 	  Say N if not sure.
 
-config FORCE_MAX_ZONEORDER
+config ARCH_FORCE_MAX_ORDER
 	int "Maximum zone order" if ADVANCED
 	depends on !SINGLE_MEMORY_CHUNK
 	default "11"
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 6dfb27d531dd..0998f671ed7e 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2284,7 +2284,7 @@ config PAGE_SIZE_64KB
 
 endchoice
 
-config FORCE_MAX_ZONEORDER
+config ARCH_FORCE_MAX_ORDER
 	int "Maximum zone order"
 	range 14 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB
 	default "14" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB
diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig
index c24955c81c92..e5c300198e1f 100644
--- a/arch/nios2/Kconfig
+++ b/arch/nios2/Kconfig
@@ -50,7 +50,7 @@ menu "Kernel features"
 
 source "kernel/Kconfig.hz"
 
-config FORCE_MAX_ZONEORDER
+config ARCH_FORCE_MAX_ORDER
 	int "Maximum zone order"
 	range 9 20
 	default "11"
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2e213ec6ec05..5edb48b363c4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -810,7 +810,7 @@ config DATA_SHIFT
 	  in that case. If PIN_TLB is selected, it must be aligned to 8M as
 	  8M pages will be pinned.
 
-config FORCE_MAX_ZONEORDER
+config ARCH_FORCE_MAX_ORDER
 	int "Maximum zone order"
 	range 8 9 if PPC64 && PPC_64K_PAGES
 	default "9" if PPC64 && PPC_64K_PAGES
diff --git a/arch/powerpc/configs/85xx/ge_imp3a_defconfig b/arch/powerpc/configs/85xx/ge_imp3a_defconfig
index f29c166998af..e7672c186325 100644
--- a/arch/powerpc/configs/85xx/ge_imp3a_defconfig
+++ b/arch/powerpc/configs/85xx/ge_imp3a_defconfig
@@ -30,7 +30,7 @@ CONFIG_PREEMPT=y
 # CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
 CONFIG_BINFMT_MISC=m
 CONFIG_MATH_EMULATION=y
-CONFIG_FORCE_MAX_ZONEORDER=17
+CONFIG_ARCH_FORCE_MAX_ORDER=17
 CONFIG_PCI=y
 CONFIG_PCIEPORTBUS=y
 CONFIG_PCI_MSI=y
diff --git a/arch/powerpc/configs/fsl-emb-nonhw.config b/arch/powerpc/configs/fsl-emb-nonhw.config
index df37efed0aec..521e7a530888 100644
--- a/arch/powerpc/configs/fsl-emb-nonhw.config
+++ b/arch/powerpc/configs/fsl-emb-nonhw.config
@@ -41,7 +41,7 @@ CONFIG_FIXED_PHY=y
 CONFIG_FONT_8x16=y
 CONFIG_FONT_8x8=y
 CONFIG_FONTS=y
-CONFIG_FORCE_MAX_ZONEORDER=13
+CONFIG_ARCH_FORCE_MAX_ORDER=13
 CONFIG_FRAMEBUFFER_CONSOLE=y
 CONFIG_FRAME_WARN=1024
 CONFIG_FTL=y
diff --git a/arch/sh/configs/ecovec24_defconfig b/arch/sh/configs/ecovec24_defconfig
index 03cb916819fa..4c09ca308d43 100644
--- a/arch/sh/configs/ecovec24_defconfig
+++ b/arch/sh/configs/ecovec24_defconfig
@@ -8,7 +8,7 @@ CONFIG_MODULES=y
 CONFIG_MODULE_UNLOAD=y
 # CONFIG_BLK_DEV_BSG is not set
 CONFIG_CPU_SUBTYPE_SH7724=y
-CONFIG_FORCE_MAX_ZONEORDER=12
+CONFIG_ARCH_FORCE_MAX_ORDER=12
 CONFIG_MEMORY_SIZE=0x10000000
 CONFIG_FLATMEM_MANUAL=y
 CONFIG_SH_ECOVEC=y
diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig
index ba569cfb4368..411fdc0901f7 100644
--- a/arch/sh/mm/Kconfig
+++ b/arch/sh/mm/Kconfig
@@ -18,7 +18,7 @@ config PAGE_OFFSET
 	default "0x80000000" if MMU
 	default "0x00000000"
 
-config FORCE_MAX_ZONEORDER
+config ARCH_FORCE_MAX_ORDER
 	int "Maximum zone order"
 	range 9 64 if PAGE_SIZE_16KB
 	default "9" if PAGE_SIZE_16KB
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index f0c0f955e169..2206c99612ed 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -273,7 +273,7 @@ config ARCH_SPARSEMEM_ENABLE
 config ARCH_SPARSEMEM_DEFAULT
 	def_bool y if SPARC64
 
-config FORCE_MAX_ZONEORDER
+config ARCH_FORCE_MAX_ORDER
 	int "Maximum zone order"
 	default "13"
 	help
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index b843902ad9fd..90784aa68cf8 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -709,7 +709,7 @@ config HIGHMEM
 
 	  If unsure, say Y.
 
-config FORCE_MAX_ZONEORDER
+config ARCH_FORCE_MAX_ORDER
 	int "Maximum zone order"
 	default "11"
 	help
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c1d914a72489..98e3297b9e09 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -24,10 +24,10 @@
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
-#ifndef CONFIG_FORCE_MAX_ZONEORDER
+#ifndef CONFIG_ARCH_FORCE_MAX_ORDER
 #define MAX_ORDER 11
 #else
-#define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
+#define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
 #endif
 #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 03/15] mm: check pfn validity when buddy allocator can merge pages across mem sections.
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
  2021-08-05 19:02 ` [RFC PATCH 01/15] arch: x86: remove MAX_ORDER exceeding SECTION_SIZE check for 32bit vdso Zi Yan
  2021-08-05 19:02 ` [RFC PATCH 02/15] arch: mm: rename FORCE_MAX_ZONEORDER to ARCH_FORCE_MAX_ORDER Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-05 19:02 ` [RFC PATCH 04/15] mm: prevent pageblock size being larger than section size Zi Yan
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

When MAX_ORDER - 1 + PAGE_SHIFT > SECTION_SIZE_BITS, it is possible to
have holes in memory zones. Use pfn_valid to check holes during buddy
page merging and physical frame scanning.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/mmzone.h | 13 +++++++++++++
 mm/compaction.c        | 20 +++++++++++++-------
 mm/memory_hotplug.c    |  7 +++++++
 mm/page_alloc.c        | 26 ++++++++++++++++++++++++--
 mm/page_isolation.c    |  7 ++++++-
 mm/page_owner.c        | 14 +++++++++++++-
 6 files changed, 76 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 98e3297b9e09..04f790ed81b7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1535,6 +1535,19 @@ void sparse_init(void);
 #define subsection_map_init(_pfn, _nr_pages) do {} while (0)
 #endif /* CONFIG_SPARSEMEM */
 
+/*
+ * If it is possible to have holes within a MAX_ORDER_NR_PAGES when
+ * MAX_ORDER_NR_PAGES crosses multiple memory sections, then we
+ * need to check pfn validity within each MAX_ORDER_NR_PAGES block.
+ * pfn_valid_within() should be used in this case; we optimise this away
+ * when we have no holes within a MAX_ORDER_NR_PAGES block.
+ */
+#if ((MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS)
+#define pfn_valid_within(pfn) pfn_valid(pfn)
+#else
+#define pfn_valid_within(pfn) (1)
+#endif
+
 #endif /* !__GENERATING_BOUNDS.H */
 #endif /* !__ASSEMBLY__ */
 #endif /* _LINUX_MMZONE_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index fbc60f964c38..dda640d51b70 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -306,14 +306,16 @@ __reset_isolation_pfn(struct zone *zone, unsigned long pfn, bool check_source,
 	 * is necessary for the block to be a migration source/target.
 	 */
 	do {
-		if (check_source && PageLRU(page)) {
-			clear_pageblock_skip(page);
-			return true;
-		}
+		if (pfn_valid_within(pfn)) {
+			if (check_source && PageLRU(page)) {
+				clear_pageblock_skip(page);
+				return true;
+			}
 
-		if (check_target && PageBuddy(page)) {
-			clear_pageblock_skip(page);
-			return true;
+			if (check_target && PageBuddy(page)) {
+				clear_pageblock_skip(page);
+				return true;
+			}
 		}
 
 		page += (1 << PAGE_ALLOC_COSTLY_ORDER);
@@ -583,6 +585,8 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 			break;
 
 		nr_scanned++;
+		if (!pfn_valid_within(blockpfn))
+			goto isolate_fail;
 
 		/*
 		 * For compound pages such as THP and hugetlbfs, we can save
@@ -881,6 +885,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			cond_resched();
 		}
 
+		if (!pfn_valid_within(low_pfn))
+			goto isolate_fail;
 		nr_scanned++;
 
 		page = pfn_to_page(low_pfn);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 632cd832aef6..85029994a494 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1617,6 +1617,13 @@ struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 			continue;
 		for (; pfn < sec_end_pfn && pfn < end_pfn;
 		     pfn += MAX_ORDER_NR_PAGES) {
+			int i = 0;
+
+			while ((i < MAX_ORDER_NR_PAGES) &&
+				!pfn_valid_within(pfn + i))
+				i++;
+			if (i == MAX_ORDER_NR_PAGES || pfn + i >= end_pfn)
+				continue;
 			/* Check if we got outside of the zone */
 			if (zone && !zone_spans_pfn(zone, pfn))
 				return NULL;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 416859e94f86..e4657009fd4f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -594,6 +594,8 @@ static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 
 static int page_is_consistent(struct zone *zone, struct page *page)
 {
+	if (!pfn_valid_within(page_to_pfn(page)))
+		return 0;
 	if (zone != page_zone(page))
 		return 0;
 
@@ -1023,12 +1025,16 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
 	if (order >= MAX_ORDER - 2)
 		return false;
 
+	if (!pfn_valid_within(buddy_pfn))
+		return false;
+
 	combined_pfn = buddy_pfn & pfn;
 	higher_page = page + (combined_pfn - pfn);
 	buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
 	higher_buddy = higher_page + (buddy_pfn - combined_pfn);
 
-	return page_is_buddy(higher_page, higher_buddy, order + 1);
+	return pfn_valid_within(buddy_pfn) &&
+	       page_is_buddy(higher_page, higher_buddy, order + 1);
 }
 
 /*
@@ -1089,6 +1095,8 @@ static inline void __free_one_page(struct page *page,
 		buddy_pfn = __find_buddy_pfn(pfn, order);
 		buddy = page + (buddy_pfn - pfn);
 
+		if (!pfn_valid_within(buddy_pfn))
+			goto done_merging;
 		if (!page_is_buddy(page, buddy, order))
 			goto done_merging;
 		/*
@@ -1118,6 +1126,9 @@ static inline void __free_one_page(struct page *page,
 
 			buddy_pfn = __find_buddy_pfn(pfn, order);
 			buddy = page + (buddy_pfn - pfn);
+
+			if (!pfn_valid_within(buddy_pfn))
+				goto done_merging;
 			buddy_mt = get_pageblock_migratetype(buddy);
 
 			if (migratetype != buddy_mt
@@ -1746,7 +1757,8 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
 /*
  * Check that the whole (or subset of) a pageblock given by the interval of
  * [start_pfn, end_pfn) is valid and within the same zone, before scanning it
- * with the migration of free compaction scanner.
+ * with the migration of free compaction scanner. The scanners then need to use
+ * only pfn_valid_within() check for holes within pageblocks.
  *
  * Return struct page pointer of start_pfn, or NULL if checks were not passed.
  *
@@ -1862,6 +1874,8 @@ static inline void __init pgdat_init_report_one_done(void)
  */
 static inline bool __init deferred_pfn_valid(unsigned long pfn)
 {
+	if (!pfn_valid_within(pfn))
+		return false;
 	if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
 		return false;
 	return true;
@@ -2508,6 +2522,11 @@ static int move_freepages(struct zone *zone,
 	int pages_moved = 0;
 
 	for (pfn = start_pfn; pfn <= end_pfn;) {
+		if (!pfn_valid_within(pfn)) {
+			pfn++;
+			continue;
+		}
+
 		page = pfn_to_page(pfn);
 		if (!PageBuddy(page)) {
 			/*
@@ -8825,6 +8844,9 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page,
 	}
 
 	for (; iter < pageblock_nr_pages - offset; iter++) {
+		if (!pfn_valid_within(pfn + iter))
+			continue;
+
 		page = pfn_to_page(pfn + iter);
 
 		/*
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 471e3a13b541..bddf788f45bf 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -93,7 +93,8 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 			buddy_pfn = __find_buddy_pfn(pfn, order);
 			buddy = page + (buddy_pfn - pfn);
 
-			if (!is_migrate_isolate_page(buddy)) {
+			if (pfn_valid_within(buddy_pfn) &&
+			    !is_migrate_isolate_page(buddy)) {
 				__isolate_free_page(page, order);
 				isolated_page = true;
 			}
@@ -249,6 +250,10 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
 	struct page *page;
 
 	while (pfn < end_pfn) {
+		if (!pfn_valid_within(pfn)) {
+			pfn++;
+			continue;
+		}
 		page = pfn_to_page(pfn);
 		if (PageBuddy(page))
 			/*
diff --git a/mm/page_owner.c b/mm/page_owner.c
index d24ed221357c..23bfb074ca3f 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -276,6 +276,9 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
 		pageblock_mt = get_pageblock_migratetype(page);
 
 		for (; pfn < block_end_pfn; pfn++) {
+			if (!pfn_valid_within(pfn))
+				continue;
+
 			/* The pageblock is online, no need to recheck. */
 			page = pfn_to_page(pfn);
 
@@ -476,6 +479,10 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
 			continue;
 		}
 
+		/* Check for holes within a MAX_ORDER area */
+		if (!pfn_valid_within(pfn))
+			continue;
+
 		page = pfn_to_page(pfn);
 		if (PageBuddy(page)) {
 			unsigned long freepage_order = buddy_order_unsafe(page);
@@ -553,9 +560,14 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone)
 		block_end_pfn = min(block_end_pfn, end_pfn);
 
 		for (; pfn < block_end_pfn; pfn++) {
-			struct page *page = pfn_to_page(pfn);
+			struct page *page;
 			struct page_ext *page_ext;
 
+			if (!pfn_valid_within(pfn))
+				continue;
+
+			page = pfn_to_page(pfn);
+
 			if (page_zone(page) != zone)
 				continue;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 04/15] mm: prevent pageblock size being larger than section size.
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (2 preceding siblings ...)
  2021-08-05 19:02 ` [RFC PATCH 03/15] mm: check pfn validity when buddy allocator can merge pages across mem sections Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-05 19:02 ` [RFC PATCH 05/15] mm/memory_hotplug: online pages at " Zi Yan
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan,
	Wei Yang

From: Zi Yan <ziy@nvidia.com>

Only physical pages from a section can be guaranteed to be contiguous
and so far a pageblock can only group contiguous physical pages by
design. Set pageblock_order properly to prevent pageblock going beyond
section size.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/pageblock-flags.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 973fd731a520..4277b4267767 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -44,8 +44,11 @@ extern unsigned int pageblock_order;
 
 #else /* CONFIG_HUGETLB_PAGE */
 
-/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
-#define pageblock_order		(MAX_ORDER-1)
+/*
+ * If huge pages are not used, group by MAX_ORDER_NR_PAGES or
+ * PAGES_PER_SECTION when MAX_ORDER_NR_PAGES is larger.
+ */
+#define pageblock_order		(min(PFN_SECTION_SHIFT, MAX_ORDER-1))
 
 #endif /* CONFIG_HUGETLB_PAGE */
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 05/15] mm/memory_hotplug: online pages at section size.
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (3 preceding siblings ...)
  2021-08-05 19:02 ` [RFC PATCH 04/15] mm: prevent pageblock size being larger than section size Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-05 19:02 ` [RFC PATCH 06/15] mm: use PAGES_PER_SECTION instead for mem_map_offset/next() Zi Yan
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

When CONFIG_SET_MAX_ORDER is used, MAX_ORDER can be larger than section
size. Holes can appear in hotplug memory chunks with size of
2^MAX_ORDER. Use PFN_SECTION_SHIFT (the order of section size) to limit
hotplug memory size.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memory_hotplug.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 85029994a494..91ca751ac20c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -601,16 +601,16 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
 	unsigned long pfn;
 
 	/*
-	 * Online the pages in MAX_ORDER - 1 aligned chunks. The callback might
-	 * decide to not expose all pages to the buddy (e.g., expose them
+	 * Online the pages in PFN_SECTION_SHIFT aligned chunks. The callback
+	 * might decide to not expose all pages to the buddy (e.g., expose them
 	 * later). We account all pages as being online and belonging to this
 	 * zone ("present").
 	 * When using memmap_on_memory, the range might not be aligned to
-	 * MAX_ORDER_NR_PAGES - 1, but pageblock aligned. __ffs() will detect
-	 * this and the first chunk to online will be pageblock_nr_pages.
+	 * PAGES_PER_SECTION - 1, but section size aligned. __ffs() will detect
+	 * this and the first chunk to online will be PAGES_PER_SECTION.
 	 */
 	for (pfn = start_pfn; pfn < end_pfn;) {
-		int order = min(MAX_ORDER - 1UL, __ffs(pfn));
+		int order = min_t(unsigned long, PFN_SECTION_SHIFT, __ffs(pfn));
 
 		(*online_page_callback)(pfn_to_page(pfn), order);
 		pfn += (1UL << order);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 06/15] mm: use PAGES_PER_SECTION instead for mem_map_offset/next().
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (4 preceding siblings ...)
  2021-08-05 19:02 ` [RFC PATCH 05/15] mm/memory_hotplug: online pages at " Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-05 19:02 ` [RFC PATCH 07/15] mm: hugetlb: use PAGES_PER_SECTION to check mem_map discontiguity Zi Yan
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

mem_map is only guaranteed to be virtually contiguous within a
section, so mem_map_offset/next() uses pfn to handle mem_map
discontiguity. Use PAGES_PER_SECTION instead of MAX_ORDER_NR_PAGES to
describe this condition more precisely.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/internal.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b1001ebeb286..4ca52c696902 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -485,7 +485,7 @@ static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
  */
 static inline struct page *mem_map_offset(struct page *base, int offset)
 {
-	if (unlikely(offset >= MAX_ORDER_NR_PAGES))
+	if (unlikely(offset >= PAGES_PER_SECTION))
 		return nth_page(base, offset);
 	return base + offset;
 }
@@ -497,7 +497,7 @@ static inline struct page *mem_map_offset(struct page *base, int offset)
 static inline struct page *mem_map_next(struct page *iter,
 						struct page *base, int offset)
 {
-	if (unlikely((offset & (MAX_ORDER_NR_PAGES - 1)) == 0)) {
+	if (unlikely((offset & (PAGES_PER_SECTION - 1)) == 0)) {
 		unsigned long pfn = page_to_pfn(base) + offset;
 		if (!pfn_valid(pfn))
 			return NULL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 07/15] mm: hugetlb: use PAGES_PER_SECTION to check mem_map discontiguity
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (5 preceding siblings ...)
  2021-08-05 19:02 ` [RFC PATCH 06/15] mm: use PAGES_PER_SECTION instead for mem_map_offset/next() Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-05 19:02 ` [RFC PATCH 08/15] fs: proc: use PAGES_PER_SECTION for page offline checking period Zi Yan
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

mem_map is only guaranteed to be virtually contiguous within a section.
Use PAGES_PER_SECTION to check the condition properly.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/hugetlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ce79d76c42ce..7f78203d6feb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1784,7 +1784,7 @@ pgoff_t hugetlb_basepage_index(struct page *page)
 	pgoff_t index = page_index(page_head);
 	unsigned long compound_idx;
 
-	if (compound_order(page_head) >= MAX_ORDER)
+	if (compound_order(page_head) >= PAGES_PER_SECTION)
 		compound_idx = page_to_pfn(page) - page_to_pfn(page_head);
 	else
 		compound_idx = page - page_head;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 08/15] fs: proc: use PAGES_PER_SECTION for page offline checking period.
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (6 preceding siblings ...)
  2021-08-05 19:02 ` [RFC PATCH 07/15] mm: hugetlb: use PAGES_PER_SECTION to check mem_map discontiguity Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-07 10:32   ` Mike Rapoport
  2021-08-05 19:02 ` [RFC PATCH 09/15] virtio: virtio_mem: use PAGES_PER_SECTION instead of MAX_ORDER_NR_PAGES Zi Yan
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan,
	Mike Rapoport, Oscar Salvador, Ying Chen, Feng Zhou,
	linux-fsdevel

From: Zi Yan <ziy@nvidia.com>

It keeps the existing behavior after MAX_ORDER is increased beyond
a section size.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Ying Chen <chenying.kernel@bytedance.com>
Cc: Feng Zhou <zhoufeng.zf@bytedance.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 fs/proc/kcore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 3f148759a5fd..77b7ba48fb44 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -486,7 +486,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 			}
 		}
 
-		if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
+		if (page_offline_frozen++ % PAGES_PER_SECTION == 0) {
 			page_offline_thaw();
 			cond_resched();
 			page_offline_freeze();
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 09/15] virtio: virtio_mem: use PAGES_PER_SECTION instead of MAX_ORDER_NR_PAGES
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (7 preceding siblings ...)
  2021-08-05 19:02 ` [RFC PATCH 08/15] fs: proc: use PAGES_PER_SECTION for page offline checking period Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-09  7:35   ` David Hildenbrand
  2021-08-05 19:02 ` [RFC PATCH 10/15] virtio: virtio_balloon: " Zi Yan
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan,
	Michael S. Tsirkin, Jason Wang, virtualization

From: Zi Yan <ziy@nvidia.com>

It keeps the existing behavior when MAX_ORDER grows beyond a section
size.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/virtio/virtio_mem.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 19036922f7ef..bab5a81fa796 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1105,11 +1105,11 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
  */
 static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
 {
-	const unsigned long max_nr_pages = MAX_ORDER_NR_PAGES;
+	const unsigned long max_nr_pages = PAGES_PER_SECTION;
 	unsigned long i;
 
 	/*
-	 * We are always called at least with MAX_ORDER_NR_PAGES
+	 * We are always called at least with PAGES_PER_SECTION
 	 * granularity/alignment (e.g., the way subblocks work). All pages
 	 * inside such a block are alike.
 	 */
@@ -1125,7 +1125,7 @@ static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
 		if (PageDirty(page)) {
 			virtio_mem_clear_fake_offline(pfn + i, max_nr_pages,
 						      false);
-			generic_online_page(page, MAX_ORDER - 1);
+			generic_online_page(page, PAGES_PER_SECTION - 1);
 		} else {
 			virtio_mem_clear_fake_offline(pfn + i, max_nr_pages,
 						      true);
@@ -1228,7 +1228,7 @@ static void virtio_mem_online_page_cb(struct page *page, unsigned int order)
 		if (vm->in_sbm) {
 			/*
 			 * We exploit here that subblocks have at least
-			 * MAX_ORDER_NR_PAGES size/alignment - so we cannot
+			 * PAGES_PER_SECTION size/alignment - so we cannot
 			 * cross subblocks within one call.
 			 */
 			id = virtio_mem_phys_to_mb_id(addr);
@@ -2438,14 +2438,14 @@ static int virtio_mem_init(struct virtio_mem *vm)
 				      VIRTIO_MEM_DEFAULT_OFFLINE_THRESHOLD);
 
 	/*
-	 * We want subblocks to span at least MAX_ORDER_NR_PAGES and
+	 * We want subblocks to span at least PAGES_PER_SECTION and
 	 * pageblock_nr_pages pages. This:
 	 * - Simplifies our page onlining code (virtio_mem_online_page_cb)
 	 *   and fake page onlining code (virtio_mem_fake_online).
 	 * - Is required for now for alloc_contig_range() to work reliably -
 	 *   it doesn't properly handle smaller granularity on ZONE_NORMAL.
 	 */
-	sb_size = max_t(uint64_t, MAX_ORDER_NR_PAGES,
+	sb_size = max_t(uint64_t, PAGES_PER_SECTION,
 			pageblock_nr_pages) * PAGE_SIZE;
 	sb_size = max_t(uint64_t, vm->device_block_size, sb_size);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 10/15] virtio: virtio_balloon: use PAGES_PER_SECTION instead of MAX_ORDER_NR_PAGES.
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (8 preceding siblings ...)
  2021-08-05 19:02 ` [RFC PATCH 09/15] virtio: virtio_mem: use PAGES_PER_SECTION instead of MAX_ORDER_NR_PAGES Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-09  7:42   ` David Hildenbrand
  2021-08-05 19:02 ` [RFC PATCH 11/15] mm/page_reporting: report pages at section size instead of MAX_ORDER Zi Yan
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan,
	Michael S. Tsirkin, Jason Wang, virtualization

From: Zi Yan <ziy@nvidia.com>

It keeps the existing behavior when MAX_ORDER grows beyond a section.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/virtio/virtio_balloon.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 47dce91f788c..de8d0355d827 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -36,7 +36,7 @@
 #define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
 					     __GFP_NOMEMALLOC)
 /* The order of free page blocks to report to host */
-#define VIRTIO_BALLOON_HINT_BLOCK_ORDER (MAX_ORDER - 1)
+#define VIRTIO_BALLOON_HINT_BLOCK_ORDER (PFN_SECTION_SHIFT - 1)
 /* The size of a free page block in bytes */
 #define VIRTIO_BALLOON_HINT_BLOCK_BYTES \
 	(1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT))
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 11/15] mm/page_reporting: report pages at section size instead of MAX_ORDER.
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (9 preceding siblings ...)
  2021-08-05 19:02 ` [RFC PATCH 10/15] virtio: virtio_balloon: " Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-09  7:25   ` David Hildenbrand
  2021-08-09 14:08   ` Alexander Duyck
  2021-08-05 19:02 ` [RFC PATCH 12/15] mm: Make MAX_ORDER of buddy allocator configurable via Kconfig SET_MAX_ORDER Zi Yan
                   ` (6 subsequent siblings)
  17 siblings, 2 replies; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

page_reporting_order was set to MAX_ORDER, which is always smaller than
a memory section size. An upcoming change will make MAX_ORDER larger
than a memory section size. Set page_reporting_order to
PFN_SECTION_SHIFT to match existing size assumption.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/page_reporting.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 382958eef8a9..dc4a2d699862 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -11,7 +11,8 @@
 #include "page_reporting.h"
 #include "internal.h"
 
-unsigned int page_reporting_order = MAX_ORDER;
+/* Set page_reporting_order at section size */
+unsigned int page_reporting_order = PFN_SECTION_SHIFT;
 module_param(page_reporting_order, uint, 0644);
 MODULE_PARM_DESC(page_reporting_order, "Set page reporting order");
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 12/15] mm: Make MAX_ORDER of buddy allocator configurable via Kconfig SET_MAX_ORDER.
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (10 preceding siblings ...)
  2021-08-05 19:02 ` [RFC PATCH 11/15] mm/page_reporting: report pages at section size instead of MAX_ORDER Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-06 15:16   ` Vlastimil Babka
  2021-08-05 19:02 ` [RFC PATCH 13/15] mm: convert MAX_ORDER sized static arrays to dynamic ones Zi Yan
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan,
	Kees Cook, Peter Zijlstra, Nicholas Piggin, Thomas Gleixner

From: Zi Yan <ziy@nvidia.com>

With SPARSEMEM_VMEMMAP, all struct page are virtually contigous,
thus kernel can manipulate arbitrarily large pages. By checking
PFN validity during buddy page merging process, all free pages in buddy
allocator's free area have their PFNs contiguous even if the system has
several not physically contiguous memory sections. With these two
conditions, it is OK to remove the restriction of
MAX_ORDER - 1 + PAGE_SHIFT < SECTION_SIZE_BITS and change MAX_ORDER
freely.

Add SET_MAX_ORDER to allow MAX_ORDER adjustment when arch does not set
its own MAX_ORDER via ARCH_FORCE_MAX_ORDER. Make it depend
on SPARSEMEM_VMEMMAP, when MAX_ORDER is not limited by SECTION_SIZE_BITS.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 arch/Kconfig           |  4 ++++
 include/linux/mmzone.h | 14 +++++++++++++-
 mm/Kconfig             | 16 ++++++++++++++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 01a3f8048cb7..40bd222adeb8 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -11,6 +11,10 @@ source "arch/$(SRCARCH)/Kconfig"
 
 menu "General architecture-dependent options"
 
+config ARCH_FORCE_MAX_ORDER
+    int
+    default "0"
+
 config CRASH_CORE
 	bool
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 04f790ed81b7..322b995942e5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -25,10 +25,14 @@
 
 /* Free memory management - zoned buddy allocator.  */
 #ifndef CONFIG_ARCH_FORCE_MAX_ORDER
+#ifdef CONFIG_SET_MAX_ORDER
+#define MAX_ORDER CONFIG_SET_MAX_ORDER
+#else
 #define MAX_ORDER 11
+#endif /* CONFIG_SET_MAX_ORDER */
 #else
 #define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
-#endif
+#endif /* CONFIG_ARCH_FORCE_MAX_ORDER */
 #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
 
 /*
@@ -1245,12 +1249,20 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
 #define SECTION_BLOCKFLAGS_BITS \
 	((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS)
 
+/*
+ * The MAX_ORDER check is not necessary when CONFIG_SET_MAX_ORDER is set, since
+ * it depends on CONFIG_SPARSEMEM_VMEMMAP, where all struct page are virtually
+ * contiguous, thus > section size pages can be allocated and manipulated
+ * without worrying about non-contiguous struct page.
+ */
+#ifndef CONFIG_SET_MAX_ORDER
 /* NO_MAX_ORDER_CHECK when compiling x64 32bit VDSO for 64bit system */
 #ifndef NO_MAX_ORDER_CHECK
 #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
 #error Allocator MAX_ORDER exceeds SECTION_SIZE
 #endif
 #endif /* NO_MAX_ORDER_CHECK */
+#endif /* CONFIG_SET_MAX_ORDER*/
 
 static inline unsigned long pfn_to_section_nr(unsigned long pfn)
 {
diff --git a/mm/Kconfig b/mm/Kconfig
index 1f9bd3371765..3a030b439501 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -89,6 +89,22 @@ config SPARSEMEM_VMEMMAP
 	  pfn_to_page and page_to_pfn operations.  This is the most
 	  efficient option when sufficient kernel resources are available.
 
+config SET_MAX_ORDER
+	int "Set maximum order of buddy allocator"
+    depends on SPARSEMEM_VMEMMAP && (ARCH_FORCE_MAX_ORDER = 0)
+	range 11 255
+	default "11"
+	help
+	  The kernel memory allocator divides physically contiguous memory
+	  blocks into "zones", where each zone is a power of two number of
+	  pages.  This option selects the largest power of two that the kernel
+	  keeps in the memory allocator.  If you need to allocate very large
+	  blocks of physically contiguous memory, then you may need to
+	  increase this value.
+
+	  This config option is actually maximum order plus one. For example,
+	  a value of 11 means that the largest free memory block is 2^10 pages.
+
 config HAVE_MEMBLOCK_PHYS_MAP
 	bool
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 13/15] mm: convert MAX_ORDER sized static arrays to dynamic ones.
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (11 preceding siblings ...)
  2021-08-05 19:02 ` [RFC PATCH 12/15] mm: Make MAX_ORDER of buddy allocator configurable via Kconfig SET_MAX_ORDER Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-05 19:16   ` Christian König
  2021-08-05 19:02 ` [RFC PATCH 14/15] mm: introduce MIN_MAX_ORDER to replace MAX_ORDER as compile time constant Zi Yan
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan,
	Dave Young, Jonathan Corbet, Christian Koenig, David Airlie,
	kexec, linux-doc, dri-devel

From: Zi Yan <ziy@nvidia.com>

This prepares for the upcoming changes to make MAX_ORDER a boot time
parameter instead of compilation time constant. All static arrays with
MAX_ORDER size are converted to pointers and their memory is allocated
at runtime.

free_area array in struct zone is allocated using memblock_alloc_node()
at boot time and using kzalloc() when memory is hot-added.

MAX_ORDER in arm64 nVHE code is independent of kernel buddy allocator,
so use CONFIG_FORCE_MAX_ZONEORDER instead.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: kexec@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 .../admin-guide/kdump/vmcoreinfo.rst          |  2 +-
 drivers/gpu/drm/ttm/ttm_device.c              |  7 ++-
 drivers/gpu/drm/ttm/ttm_pool.c                | 58 +++++++++++++++++--
 include/drm/ttm/ttm_pool.h                    |  4 +-
 include/linux/memory_hotplug.h                |  1 +
 include/linux/mmzone.h                        |  2 +-
 mm/memory_hotplug.c                           |  1 +
 mm/page_alloc.c                               | 48 ++++++++++++---
 8 files changed, 104 insertions(+), 19 deletions(-)

diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index 3861a25faae1..1c9449b9458f 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -172,7 +172,7 @@ variables.
 Offset of the free_list's member. This value is used to compute the number
 of free pages.
 
-Each zone has a free_area structure array called free_area[MAX_ORDER].
+Each zone has a free_area structure array called free_area with length of MAX_ORDER.
 The free_list represents a linked list of free page blocks.
 
 (list_head, next|prev)
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 74e3b460132b..7d994c03fbd0 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -94,7 +94,9 @@ static int ttm_global_init(void)
 		>> PAGE_SHIFT;
 	num_dma32 = min(num_dma32, 2UL << (30 - PAGE_SHIFT));
 
-	ttm_pool_mgr_init(num_pages);
+	ret = ttm_pool_mgr_init(num_pages);
+	if (ret)
+		goto out;
 	ttm_tt_mgr_init(num_pages, num_dma32);
 
 	glob->dummy_read_page = alloc_page(__GFP_ZERO | GFP_DMA32);
@@ -216,7 +218,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs,
 	bdev->funcs = funcs;
 
 	ttm_sys_man_init(bdev);
-	ttm_pool_init(&bdev->pool, dev, use_dma_alloc, use_dma32);
+	if (ttm_pool_init(&bdev->pool, dev, use_dma_alloc, use_dma32))
+		return -ENOMEM;
 
 	bdev->vma_manager = vma_manager;
 	INIT_DELAYED_WORK(&bdev->wq, ttm_device_delayed_workqueue);
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index cb38b1a17b09..ae20c80f14a4 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -64,11 +64,11 @@ module_param(page_pool_size, ulong, 0644);
 
 static atomic_long_t allocated_pages;
 
-static struct ttm_pool_type global_write_combined[MAX_ORDER];
-static struct ttm_pool_type global_uncached[MAX_ORDER];
+static struct ttm_pool_type *global_write_combined;
+static struct ttm_pool_type *global_uncached;
 
-static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER];
-static struct ttm_pool_type global_dma32_uncached[MAX_ORDER];
+static struct ttm_pool_type *global_dma32_write_combined;
+static struct ttm_pool_type *global_dma32_uncached;
 
 static struct mutex shrinker_lock;
 static struct list_head shrinker_list;
@@ -493,8 +493,10 @@ EXPORT_SYMBOL(ttm_pool_free);
  * @use_dma32: true if GFP_DMA32 should be used
  *
  * Initialize the pool and its pool types.
+ *
+ * Returns: 0 on successe, negative error code otherwise
  */
-void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
+int ttm_pool_init(struct ttm_pool *pool, struct device *dev,
 		   bool use_dma_alloc, bool use_dma32)
 {
 	unsigned int i, j;
@@ -506,11 +508,30 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
 	pool->use_dma32 = use_dma32;
 
 	if (use_dma_alloc) {
-		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
+		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
+			pool->caching[i].orders =
+				kzalloc(sizeof(struct ttm_pool_type) * MAX_ORDER,
+					GFP_KERNEL);
+			if (!pool->caching[i].orders) {
+				i--;
+				goto failed;
+			}
 			for (j = 0; j < MAX_ORDER; ++j)
 				ttm_pool_type_init(&pool->caching[i].orders[j],
 						   pool, i, j);
+
+		}
+		return 0;
+
+failed:
+		for (; i >= 0; i--) {
+			for (j = 0; j < MAX_ORDER; ++j)
+				ttm_pool_type_fini(&pool->caching[i].orders[j]);
+			kfree(pool->caching[i].orders);
+		}
+		return -ENOMEM;
 	}
+	return 0;
 }
 
 /**
@@ -696,6 +717,31 @@ int ttm_pool_mgr_init(unsigned long num_pages)
 	mutex_init(&shrinker_lock);
 	INIT_LIST_HEAD(&shrinker_list);
 
+	if (!global_write_combined) {
+		global_write_combined = kzalloc(sizeof(struct ttm_pool_type) * MAX_ORDER,
+						GFP_KERNEL);
+		if (!global_write_combined)
+			return -ENOMEM;
+	}
+	if (!global_uncached) {
+		global_uncached = kzalloc(sizeof(struct ttm_pool_type) * MAX_ORDER,
+					  GFP_KERNEL);
+		if (!global_uncached)
+			return -ENOMEM;
+	}
+	if (!global_dma32_write_combined) {
+		global_dma32_write_combined = kzalloc(sizeof(struct ttm_pool_type) * MAX_ORDER,
+						      GFP_KERNEL);
+		if (!global_dma32_write_combined)
+			return -ENOMEM;
+	}
+	if (!global_dma32_uncached) {
+		global_dma32_uncached = kzalloc(sizeof(struct ttm_pool_type) * MAX_ORDER,
+						GFP_KERNEL);
+		if (!global_dma32_uncached)
+			return -ENOMEM;
+	}
+
 	for (i = 0; i < MAX_ORDER; ++i) {
 		ttm_pool_type_init(&global_write_combined[i], NULL,
 				   ttm_write_combined, i);
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index 4321728bdd11..5c09e3cf63ce 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -71,7 +71,7 @@ struct ttm_pool {
 	bool use_dma32;
 
 	struct {
-		struct ttm_pool_type orders[MAX_ORDER];
+		struct ttm_pool_type *orders;
 	} caching[TTM_NUM_CACHING_TYPES];
 };
 
@@ -79,7 +79,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 		   struct ttm_operation_ctx *ctx);
 void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
 
-void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
+int ttm_pool_init(struct ttm_pool *pool, struct device *dev,
 		   bool use_dma_alloc, bool use_dma32);
 void ttm_pool_fini(struct ttm_pool *pool);
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 97f874a60607..c16aa66db61e 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -326,6 +326,7 @@ extern void clear_zone_contiguous(struct zone *zone);
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 extern void __ref free_area_init_core_hotplug(int nid);
+extern void __ref free_area_deinit_core_hotplug(int nid);
 extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
 extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
 extern int add_memory_resource(int nid, struct resource *resource,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 322b995942e5..09aafc05aef4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -609,7 +609,7 @@ struct zone {
 	ZONE_PADDING(_pad1_)
 
 	/* free areas of different sizes */
-	struct free_area	free_area[MAX_ORDER];
+	struct free_area	*free_area;
 
 	/* zone flags, see below */
 	unsigned long		flags;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 91ca751ac20c..4ce20b6482aa 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1239,6 +1239,7 @@ static void rollback_node_hotadd(int nid)
 
 	arch_refresh_nodedata(nid, NULL);
 	free_percpu(pgdat->per_cpu_nodestats);
+	free_area_deinit_core_hotplug(nid);
 	arch_free_nodedata(pgdat);
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e4657009fd4f..bfa6962f7615 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6053,11 +6053,21 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 
 	for_each_populated_zone(zone) {
 		unsigned int order;
-		unsigned long nr[MAX_ORDER], flags, total = 0;
-		unsigned char types[MAX_ORDER];
+		unsigned long *nr, flags, total = 0;
+		unsigned char *types;
 
 		if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
 			continue;
+
+		nr = kmalloc(sizeof(unsigned long) * MAX_ORDER, GFP_KERNEL);
+		if (!nr)
+			goto skip_zone;
+		types = kmalloc(sizeof(unsigned char) * MAX_ORDER, GFP_KERNEL);
+		if (!types) {
+			kfree(nr);
+			goto skip_zone;
+		}
+
 		show_node(zone);
 		printk(KERN_CONT "%s: ", zone->name);
 
@@ -6083,8 +6093,11 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 				show_migration_types(types[order]);
 		}
 		printk(KERN_CONT "= %lukB\n", K(total));
-	}
 
+		kfree(nr);
+		kfree(types);
+	}
+skip_zone:
 	hugetlb_show_meminfo();
 
 	printk("%ld total pagecache pages\n", global_node_page_state(NR_FILE_PAGES));
@@ -7429,8 +7442,8 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	lruvec_init(&pgdat->__lruvec);
 }
 
-static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
-							unsigned long remaining_pages)
+static void __init zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
+							unsigned long remaining_pages, bool hotplug)
 {
 	atomic_long_set(&zone->managed_pages, remaining_pages);
 	zone_set_nid(zone, nid);
@@ -7439,6 +7452,16 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx,
 	spin_lock_init(&zone->lock);
 	zone_seqlock_init(zone);
 	zone_pcp_init(zone);
+	if (hotplug)
+		zone->free_area =
+			kzalloc_node(sizeof(struct free_area) * MAX_ORDER,
+				     GFP_KERNEL, nid);
+	else
+		zone->free_area =
+			memblock_alloc_node(sizeof(struct free_area) * MAX_ORDER,
+					    sizeof(struct free_area), nid);
+	BUG_ON(!zone->free_area);
+
 }
 
 /*
@@ -7456,7 +7479,18 @@ void __ref free_area_init_core_hotplug(int nid)
 
 	pgdat_init_internals(pgdat);
 	for (z = 0; z < MAX_NR_ZONES; z++)
-		zone_init_internals(&pgdat->node_zones[z], z, nid, 0);
+		zone_init_internals(&pgdat->node_zones[z], z, nid, 0, true);
+}
+
+void __ref free_area_deinit_core_hotplug(int nid)
+{
+	enum zone_type z;
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	for (z = 0; z < MAX_NR_ZONES; z++) {
+		kfree(pgdat->node_zones[z].free_area);
+		pgdat->node_zones[z].free_area = NULL;
+	}
 }
 #endif
 
@@ -7519,7 +7553,7 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
 		 * when the bootmem allocator frees pages into the buddy system.
 		 * And all highmem pages will be managed by the buddy system.
 		 */
-		zone_init_internals(zone, j, nid, freesize);
+		zone_init_internals(zone, j, nid, freesize, false);
 
 		if (!size)
 			continue;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 14/15] mm: introduce MIN_MAX_ORDER to replace MAX_ORDER as compile time constant.
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (12 preceding siblings ...)
  2021-08-05 19:02 ` [RFC PATCH 13/15] mm: convert MAX_ORDER sized static arrays to dynamic ones Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-08  8:23   ` Mike Rapoport
  2021-08-05 19:02 ` [RFC PATCH 15/15] mm: make MAX_ORDER a kernel boot time parameter Zi Yan
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan,
	Marc Zyngier, Catalin Marinas, Christoph Lameter, Quentin Perret,
	linux-arm-kernel, kvmarm

From: Zi Yan <ziy@nvidia.com>

For other MAX_ORDER uses (described below), there is no need or too much
hassle to convert certain static array to dynamic ones. Add
MIN_MAX_ORDER to serve as compile time constant in place of MAX_ORDER.

ARM64 hypervisor maintains its own free page list and does not import
any core kernel symbols, so soon-to-be runtime variable MAX_ORDER is not
accessible in ARM64 hypervisor code. Also there is no need to allocating
very large pages.

In SLAB/SLOB/SLUB, 2-D array kmalloc_caches uses MAX_ORDER in its second
dimension. It is too much hassle to allocate memory for kmalloc_caches
before any proper memory allocator is set up.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Quentin Perret <qperret@google.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: kvmarm@lists.cs.columbia.edu
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 arch/arm64/kvm/hyp/include/nvhe/gfp.h | 2 +-
 arch/arm64/kvm/hyp/nvhe/page_alloc.c  | 3 ++-
 include/linux/mmzone.h                | 3 +++
 include/linux/slab.h                  | 8 ++++----
 mm/slab.c                             | 2 +-
 mm/slub.c                             | 7 ++++---
 6 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
index fb0f523d1492..c774b4a98336 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/gfp.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
@@ -16,7 +16,7 @@ struct hyp_pool {
 	 * API at EL2.
 	 */
 	hyp_spinlock_t lock;
-	struct list_head free_area[MAX_ORDER];
+	struct list_head free_area[MIN_MAX_ORDER];
 	phys_addr_t range_start;
 	phys_addr_t range_end;
 	unsigned short max_order;
diff --git a/arch/arm64/kvm/hyp/nvhe/page_alloc.c b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
index 41fc25bdfb34..a1cc1b648de0 100644
--- a/arch/arm64/kvm/hyp/nvhe/page_alloc.c
+++ b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
@@ -226,7 +226,8 @@ int hyp_pool_init(struct hyp_pool *pool, u64 pfn, unsigned int nr_pages,
 	int i;
 
 	hyp_spin_lock_init(&pool->lock);
-	pool->max_order = min(MAX_ORDER, get_order(nr_pages << PAGE_SHIFT));
+
+	pool->max_order = min(MIN_MAX_ORDER, get_order(nr_pages << PAGE_SHIFT));
 	for (i = 0; i < pool->max_order; i++)
 		INIT_LIST_HEAD(&pool->free_area[i]);
 	pool->range_start = phys;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 09aafc05aef4..379dada82d4b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -27,11 +27,14 @@
 #ifndef CONFIG_ARCH_FORCE_MAX_ORDER
 #ifdef CONFIG_SET_MAX_ORDER
 #define MAX_ORDER CONFIG_SET_MAX_ORDER
+#define MIN_MAX_ORDER CONFIG_SET_MAX_ORDER
 #else
 #define MAX_ORDER 11
+#define MIN_MAX_ORDER MAX_ORDER
 #endif /* CONFIG_SET_MAX_ORDER */
 #else
 #define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
+#define MIN_MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
 #endif /* CONFIG_ARCH_FORCE_MAX_ORDER */
 #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 2c0d80cca6b8..d8747c158db6 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -244,8 +244,8 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
  * to do various tricks to work around compiler limitations in order to
  * ensure proper constant folding.
  */
-#define KMALLOC_SHIFT_HIGH	((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
-				(MAX_ORDER + PAGE_SHIFT - 1) : 25)
+#define KMALLOC_SHIFT_HIGH	((MIN_MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
+				(MIN_MAX_ORDER + PAGE_SHIFT - 1) : 25)
 #define KMALLOC_SHIFT_MAX	KMALLOC_SHIFT_HIGH
 #ifndef KMALLOC_SHIFT_LOW
 #define KMALLOC_SHIFT_LOW	5
@@ -258,7 +258,7 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
  * (PAGE_SIZE*2).  Larger requests are passed to the page allocator.
  */
 #define KMALLOC_SHIFT_HIGH	(PAGE_SHIFT + 1)
-#define KMALLOC_SHIFT_MAX	(MAX_ORDER + PAGE_SHIFT - 1)
+#define KMALLOC_SHIFT_MAX	(MIN_MAX_ORDER + PAGE_SHIFT - 1)
 #ifndef KMALLOC_SHIFT_LOW
 #define KMALLOC_SHIFT_LOW	3
 #endif
@@ -271,7 +271,7 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
  * be allocated from the same page.
  */
 #define KMALLOC_SHIFT_HIGH	PAGE_SHIFT
-#define KMALLOC_SHIFT_MAX	(MAX_ORDER + PAGE_SHIFT - 1)
+#define KMALLOC_SHIFT_MAX	(MIN_MAX_ORDER + PAGE_SHIFT - 1)
 #ifndef KMALLOC_SHIFT_LOW
 #define KMALLOC_SHIFT_LOW	3
 #endif
diff --git a/mm/slab.c b/mm/slab.c
index d0f725637663..0041de8ec0e9 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -466,7 +466,7 @@ static int __init slab_max_order_setup(char *str)
 {
 	get_option(&str, &slab_max_order);
 	slab_max_order = slab_max_order < 0 ? 0 :
-				min(slab_max_order, MAX_ORDER - 1);
+				min(slab_max_order, MIN_MAX_ORDER - 1);
 	slab_max_order_set = true;
 
 	return 1;
diff --git a/mm/slub.c b/mm/slub.c
index b6c5205252eb..228e4a77c678 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3564,8 +3564,9 @@ static inline int calculate_order(unsigned int size)
 	/*
 	 * Doh this slab cannot be placed using slub_max_order.
 	 */
-	order = slab_order(size, 1, MAX_ORDER, 1);
-	if (order < MAX_ORDER)
+
+	order = slab_order(size, 1, MIN_MAX_ORDER, 1);
+	if (order < MIN_MAX_ORDER)
 		return order;
 	return -ENOSYS;
 }
@@ -4079,7 +4080,7 @@ __setup("slub_min_order=", setup_slub_min_order);
 static int __init setup_slub_max_order(char *str)
 {
 	get_option(&str, (int *)&slub_max_order);
-	slub_max_order = min(slub_max_order, (unsigned int)MAX_ORDER - 1);
+	slub_max_order = min(slub_max_order, (unsigned int)MIN_MAX_ORDER - 1);
 
 	return 1;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 15/15] mm: make MAX_ORDER a kernel boot time parameter.
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (13 preceding siblings ...)
  2021-08-05 19:02 ` [RFC PATCH 14/15] mm: introduce MIN_MAX_ORDER to replace MAX_ORDER as compile time constant Zi Yan
@ 2021-08-05 19:02 ` Zi Yan
  2021-08-06 15:36 ` [RFC PATCH 00/15] Make MAX_ORDER adjustable as " Vlastimil Babka
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel, Zi Yan,
	Jonathan Corbet, Paul E. McKenney, Randy Dunlap, Thomas Gleixner,
	linux-doc

From: Zi Yan <ziy@nvidia.com>

With the new buddy_alloc_max_order, users can specify larger MAX_ORDER
than set in CONFIG_ARCH_MAX_ORDER or CONFIG_SET_MAX_ORDER.
It can be set any value >= CONFIG_ARCH_MAX_ORDER or CONFIG_SET_MAX_ORDER,
but < 256 (limited by vmscan scan_control and per-cpu free page list).

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-doc@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 .../admin-guide/kernel-parameters.txt         |  5 +++
 include/linux/mmzone.h                        | 23 +++++++++++--
 mm/page_alloc.c                               | 34 ++++++++++++++++++-
 mm/vmscan.c                                   |  1 -
 4 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 5c59a5fb17c3..a37141aa28ae 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -465,6 +465,11 @@
 	bttv.pll=	See Documentation/admin-guide/media/bttv.rst
 	bttv.tuner=
 
+	buddy_alloc_max_order=	[KNL] This parameter adjusts the size of largest
+			pages that can be allocated from kernel buddy allocator. The largest
+			page size is 2^buddy_alloc_max_order * PAGE_SIZE.
+            Format: integer
+
 	bulk_remove=off	[PPC]  This parameter disables the use of the pSeries
 			firmware feature for flushing multiple hpte entries
 			at a time.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 379dada82d4b..9ca4d59722a1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -26,14 +26,25 @@
 /* Free memory management - zoned buddy allocator.  */
 #ifndef CONFIG_ARCH_FORCE_MAX_ORDER
 #ifdef CONFIG_SET_MAX_ORDER
-#define MAX_ORDER CONFIG_SET_MAX_ORDER
+/* Defined in mm/page_alloc.c */
+extern int buddy_alloc_max_order;
+
+#define MAX_ORDER buddy_alloc_max_order
 #define MIN_MAX_ORDER CONFIG_SET_MAX_ORDER
 #else
 #define MAX_ORDER 11
 #define MIN_MAX_ORDER MAX_ORDER
 #endif /* CONFIG_SET_MAX_ORDER */
 #else
+
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+/* Defined in mm/page_alloc.c */
+extern int buddy_alloc_max_order;
+
+#define MAX_ORDER buddy_alloc_max_order
+#else
 #define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
+#endif /* CONFIG_SPARSEMEM_VMEMMAP */
 #define MIN_MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
 #endif /* CONFIG_ARCH_FORCE_MAX_ORDER */
 #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
@@ -1557,8 +1568,14 @@ void sparse_init(void);
  * pfn_valid_within() should be used in this case; we optimise this away
  * when we have no holes within a MAX_ORDER_NR_PAGES block.
  */
-#if ((MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS)
-#define pfn_valid_within(pfn) pfn_valid(pfn)
+#if defined(CONFIG_ARCH_FORCE_MAX_ORDER) || defined(CONFIG_SET_MAX_ORDER)
+static inline bool pfn_valid_within(unsigned long pfn)
+{
+	if ((MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS)
+		return pfn_valid(pfn);
+
+	return 1;
+}
 #else
 #define pfn_valid_within(pfn) (1)
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bfa6962f7615..ea6f8d85a4cf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1487,7 +1487,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			batch_free = count;
 
 		order = pindex_to_order(pindex);
-		BUILD_BUG_ON(MAX_ORDER >= (1<<NR_PCP_ORDER_WIDTH));
 		do {
 			page = list_last_entry(list, struct page, lru);
 			/* must delete to avoid corrupting pcp list */
@@ -9508,3 +9507,36 @@ bool take_page_off_buddy(struct page *page)
 	return ret;
 }
 #endif
+
+#if (defined(CONFIG_ARCH_FORCE_MAX_ORDER) && defined(CONFIG_SPARSEMEM_VMEMMAP)) \
+	|| defined(CONFIG_SET_MAX_ORDER)
+int buddy_alloc_max_order = MIN_MAX_ORDER;
+EXPORT_SYMBOL(buddy_alloc_max_order);
+
+static int __init buddy_alloc_set(char *val)
+{
+	int ret;
+	unsigned long max_order;
+
+	ret = kstrtoul(val, 10, &max_order);
+
+	if (ret < 0)
+		return -EINVAL;
+
+	/*
+	 * max_order is also limited at below locations:
+	 * 1. scan_control in mm/vmscan.c uses s8 field for order, max_order cannot
+	 * be bigger than S8_MAX before the field is changed.
+	 * 2. free_pcppages_bulk has max_order upper limit.
+	 */
+	if (max_order > MIN_MAX_ORDER && max_order < S8_MAX &&
+	    max_order < (1<<NR_PCP_ORDER_WIDTH))
+		buddy_alloc_max_order = max_order;
+	else
+		buddy_alloc_max_order = MIN_MAX_ORDER;
+
+	return 0;
+}
+
+early_param("buddy_alloc_max_order", buddy_alloc_set);
+#endif
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 403a175a720f..9a3963c6166e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3610,7 +3610,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	 * scan_control uses s8 fields for order, priority, and reclaim_idx.
 	 * Confirm they are large enough for max values.
 	 */
-	BUILD_BUG_ON(MAX_ORDER > S8_MAX);
 	BUILD_BUG_ON(DEF_PRIORITY > S8_MAX);
 	BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 13/15] mm: convert MAX_ORDER sized static arrays to dynamic ones.
  2021-08-05 19:02 ` [RFC PATCH 13/15] mm: convert MAX_ORDER sized static arrays to dynamic ones Zi Yan
@ 2021-08-05 19:16   ` Christian König
  2021-08-05 19:58     ` Zi Yan
  0 siblings, 1 reply; 50+ messages in thread
From: Christian König @ 2021-08-05 19:16 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel,
	Dave Young, Jonathan Corbet, David Airlie, kexec, linux-doc,
	dri-devel

Am 05.08.21 um 21:02 schrieb Zi Yan:
> From: Zi Yan <ziy@nvidia.com>
>
> This prepares for the upcoming changes to make MAX_ORDER a boot time
> parameter instead of compilation time constant. All static arrays with
> MAX_ORDER size are converted to pointers and their memory is allocated
> at runtime.

Well in general I strongly suggest to not use the patter 
kmalloc(sizeof(some struct) * MAX_ORDER,...) instead use kmalloc_array, 
kcalloc etc..

Then when a array is embedded at the end of a structure you can use a 
trailing array and the struct_size() macro to determine the allocation size.

Additional to that separating the patch into changes for TTM to make the 
maximum allocation order independent from MAX_ORDER would be rather good 
to have I think.

Regards,
Christian.

>
> free_area array in struct zone is allocated using memblock_alloc_node()
> at boot time and using kzalloc() when memory is hot-added.
>
> MAX_ORDER in arm64 nVHE code is independent of kernel buddy allocator,
> so use CONFIG_FORCE_MAX_ZONEORDER instead.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Cc: Dave Young <dyoung@redhat.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: kexec@lists.infradead.org
> Cc: linux-doc@vger.kernel.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>   .../admin-guide/kdump/vmcoreinfo.rst          |  2 +-
>   drivers/gpu/drm/ttm/ttm_device.c              |  7 ++-
>   drivers/gpu/drm/ttm/ttm_pool.c                | 58 +++++++++++++++++--
>   include/drm/ttm/ttm_pool.h                    |  4 +-
>   include/linux/memory_hotplug.h                |  1 +
>   include/linux/mmzone.h                        |  2 +-
>   mm/memory_hotplug.c                           |  1 +
>   mm/page_alloc.c                               | 48 ++++++++++++---
>   8 files changed, 104 insertions(+), 19 deletions(-)
>
> diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> index 3861a25faae1..1c9449b9458f 100644
> --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> @@ -172,7 +172,7 @@ variables.
>   Offset of the free_list's member. This value is used to compute the number
>   of free pages.
>   
> -Each zone has a free_area structure array called free_area[MAX_ORDER].
> +Each zone has a free_area structure array called free_area with length of MAX_ORDER.
>   The free_list represents a linked list of free page blocks.
>   
>   (list_head, next|prev)
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 74e3b460132b..7d994c03fbd0 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -94,7 +94,9 @@ static int ttm_global_init(void)
>   		>> PAGE_SHIFT;
>   	num_dma32 = min(num_dma32, 2UL << (30 - PAGE_SHIFT));
>   
> -	ttm_pool_mgr_init(num_pages);
> +	ret = ttm_pool_mgr_init(num_pages);
> +	if (ret)
> +		goto out;
>   	ttm_tt_mgr_init(num_pages, num_dma32);
>   
>   	glob->dummy_read_page = alloc_page(__GFP_ZERO | GFP_DMA32);
> @@ -216,7 +218,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs,
>   	bdev->funcs = funcs;
>   
>   	ttm_sys_man_init(bdev);
> -	ttm_pool_init(&bdev->pool, dev, use_dma_alloc, use_dma32);
> +	if (ttm_pool_init(&bdev->pool, dev, use_dma_alloc, use_dma32))
> +		return -ENOMEM;
>   
>   	bdev->vma_manager = vma_manager;
>   	INIT_DELAYED_WORK(&bdev->wq, ttm_device_delayed_workqueue);
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index cb38b1a17b09..ae20c80f14a4 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -64,11 +64,11 @@ module_param(page_pool_size, ulong, 0644);
>   
>   static atomic_long_t allocated_pages;
>   
> -static struct ttm_pool_type global_write_combined[MAX_ORDER];
> -static struct ttm_pool_type global_uncached[MAX_ORDER];
> +static struct ttm_pool_type *global_write_combined;
> +static struct ttm_pool_type *global_uncached;
>   
> -static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER];
> -static struct ttm_pool_type global_dma32_uncached[MAX_ORDER];
> +static struct ttm_pool_type *global_dma32_write_combined;
> +static struct ttm_pool_type *global_dma32_uncached;
>   
>   static struct mutex shrinker_lock;
>   static struct list_head shrinker_list;
> @@ -493,8 +493,10 @@ EXPORT_SYMBOL(ttm_pool_free);
>    * @use_dma32: true if GFP_DMA32 should be used
>    *
>    * Initialize the pool and its pool types.
> + *
> + * Returns: 0 on successe, negative error code otherwise
>    */
> -void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
> +int ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>   		   bool use_dma_alloc, bool use_dma32)
>   {
>   	unsigned int i, j;
> @@ -506,11 +508,30 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>   	pool->use_dma32 = use_dma32;
>   
>   	if (use_dma_alloc) {
> -		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
> +		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
> +			pool->caching[i].orders =
> +				kzalloc(sizeof(struct ttm_pool_type) * MAX_ORDER,
> +					GFP_KERNEL);
> +			if (!pool->caching[i].orders) {
> +				i--;
> +				goto failed;
> +			}
>   			for (j = 0; j < MAX_ORDER; ++j)
>   				ttm_pool_type_init(&pool->caching[i].orders[j],
>   						   pool, i, j);
> +
> +		}
> +		return 0;
> +
> +failed:
> +		for (; i >= 0; i--) {
> +			for (j = 0; j < MAX_ORDER; ++j)
> +				ttm_pool_type_fini(&pool->caching[i].orders[j]);
> +			kfree(pool->caching[i].orders);
> +		}
> +		return -ENOMEM;
>   	}
> +	return 0;
>   }
>   
>   /**
> @@ -696,6 +717,31 @@ int ttm_pool_mgr_init(unsigned long num_pages)
>   	mutex_init(&shrinker_lock);
>   	INIT_LIST_HEAD(&shrinker_list);
>   
> +	if (!global_write_combined) {
> +		global_write_combined = kzalloc(sizeof(struct ttm_pool_type) * MAX_ORDER,
> +						GFP_KERNEL);
> +		if (!global_write_combined)
> +			return -ENOMEM;
> +	}
> +	if (!global_uncached) {
> +		global_uncached = kzalloc(sizeof(struct ttm_pool_type) * MAX_ORDER,
> +					  GFP_KERNEL);
> +		if (!global_uncached)
> +			return -ENOMEM;
> +	}
> +	if (!global_dma32_write_combined) {
> +		global_dma32_write_combined = kzalloc(sizeof(struct ttm_pool_type) * MAX_ORDER,
> +						      GFP_KERNEL);
> +		if (!global_dma32_write_combined)
> +			return -ENOMEM;
> +	}
> +	if (!global_dma32_uncached) {
> +		global_dma32_uncached = kzalloc(sizeof(struct ttm_pool_type) * MAX_ORDER,
> +						GFP_KERNEL);
> +		if (!global_dma32_uncached)
> +			return -ENOMEM;
> +	}
> +
>   	for (i = 0; i < MAX_ORDER; ++i) {
>   		ttm_pool_type_init(&global_write_combined[i], NULL,
>   				   ttm_write_combined, i);
> diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
> index 4321728bdd11..5c09e3cf63ce 100644
> --- a/include/drm/ttm/ttm_pool.h
> +++ b/include/drm/ttm/ttm_pool.h
> @@ -71,7 +71,7 @@ struct ttm_pool {
>   	bool use_dma32;
>   
>   	struct {
> -		struct ttm_pool_type orders[MAX_ORDER];
> +		struct ttm_pool_type *orders;
>   	} caching[TTM_NUM_CACHING_TYPES];
>   };
>   
> @@ -79,7 +79,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>   		   struct ttm_operation_ctx *ctx);
>   void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
>   
> -void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
> +int ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>   		   bool use_dma_alloc, bool use_dma32);
>   void ttm_pool_fini(struct ttm_pool *pool);
>   
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 97f874a60607..c16aa66db61e 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -326,6 +326,7 @@ extern void clear_zone_contiguous(struct zone *zone);
>   
>   #ifdef CONFIG_MEMORY_HOTPLUG
>   extern void __ref free_area_init_core_hotplug(int nid);
> +extern void __ref free_area_deinit_core_hotplug(int nid);
>   extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
>   extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
>   extern int add_memory_resource(int nid, struct resource *resource,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 322b995942e5..09aafc05aef4 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -609,7 +609,7 @@ struct zone {
>   	ZONE_PADDING(_pad1_)
>   
>   	/* free areas of different sizes */
> -	struct free_area	free_area[MAX_ORDER];
> +	struct free_area	*free_area;
>   
>   	/* zone flags, see below */
>   	unsigned long		flags;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 91ca751ac20c..4ce20b6482aa 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1239,6 +1239,7 @@ static void rollback_node_hotadd(int nid)
>   
>   	arch_refresh_nodedata(nid, NULL);
>   	free_percpu(pgdat->per_cpu_nodestats);
> +	free_area_deinit_core_hotplug(nid);
>   	arch_free_nodedata(pgdat);
>   }
>   
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e4657009fd4f..bfa6962f7615 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6053,11 +6053,21 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>   
>   	for_each_populated_zone(zone) {
>   		unsigned int order;
> -		unsigned long nr[MAX_ORDER], flags, total = 0;
> -		unsigned char types[MAX_ORDER];
> +		unsigned long *nr, flags, total = 0;
> +		unsigned char *types;
>   
>   		if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
>   			continue;
> +
> +		nr = kmalloc(sizeof(unsigned long) * MAX_ORDER, GFP_KERNEL);
> +		if (!nr)
> +			goto skip_zone;
> +		types = kmalloc(sizeof(unsigned char) * MAX_ORDER, GFP_KERNEL);
> +		if (!types) {
> +			kfree(nr);
> +			goto skip_zone;
> +		}
> +
>   		show_node(zone);
>   		printk(KERN_CONT "%s: ", zone->name);
>   
> @@ -6083,8 +6093,11 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>   				show_migration_types(types[order]);
>   		}
>   		printk(KERN_CONT "= %lukB\n", K(total));
> -	}
>   
> +		kfree(nr);
> +		kfree(types);
> +	}
> +skip_zone:
>   	hugetlb_show_meminfo();
>   
>   	printk("%ld total pagecache pages\n", global_node_page_state(NR_FILE_PAGES));
> @@ -7429,8 +7442,8 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>   	lruvec_init(&pgdat->__lruvec);
>   }
>   
> -static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
> -							unsigned long remaining_pages)
> +static void __init zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
> +							unsigned long remaining_pages, bool hotplug)
>   {
>   	atomic_long_set(&zone->managed_pages, remaining_pages);
>   	zone_set_nid(zone, nid);
> @@ -7439,6 +7452,16 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx,
>   	spin_lock_init(&zone->lock);
>   	zone_seqlock_init(zone);
>   	zone_pcp_init(zone);
> +	if (hotplug)
> +		zone->free_area =
> +			kzalloc_node(sizeof(struct free_area) * MAX_ORDER,
> +				     GFP_KERNEL, nid);
> +	else
> +		zone->free_area =
> +			memblock_alloc_node(sizeof(struct free_area) * MAX_ORDER,
> +					    sizeof(struct free_area), nid);
> +	BUG_ON(!zone->free_area);
> +
>   }
>   
>   /*
> @@ -7456,7 +7479,18 @@ void __ref free_area_init_core_hotplug(int nid)
>   
>   	pgdat_init_internals(pgdat);
>   	for (z = 0; z < MAX_NR_ZONES; z++)
> -		zone_init_internals(&pgdat->node_zones[z], z, nid, 0);
> +		zone_init_internals(&pgdat->node_zones[z], z, nid, 0, true);
> +}
> +
> +void __ref free_area_deinit_core_hotplug(int nid)
> +{
> +	enum zone_type z;
> +	pg_data_t *pgdat = NODE_DATA(nid);
> +
> +	for (z = 0; z < MAX_NR_ZONES; z++) {
> +		kfree(pgdat->node_zones[z].free_area);
> +		pgdat->node_zones[z].free_area = NULL;
> +	}
>   }
>   #endif
>   
> @@ -7519,7 +7553,7 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
>   		 * when the bootmem allocator frees pages into the buddy system.
>   		 * And all highmem pages will be managed by the buddy system.
>   		 */
> -		zone_init_internals(zone, j, nid, freesize);
> +		zone_init_internals(zone, j, nid, freesize, false);
>   
>   		if (!size)
>   			continue;


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 13/15] mm: convert MAX_ORDER sized static arrays to dynamic ones.
  2021-08-05 19:16   ` Christian König
@ 2021-08-05 19:58     ` Zi Yan
  2021-08-06  9:37       ` Christian König
  0 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2021-08-05 19:58 UTC (permalink / raw)
  To: Christian König
  Cc: David Hildenbrand, linux-mm, Matthew Wilcox, Vlastimil Babka,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	linux-kernel, Dave Young, Jonathan Corbet, David Airlie, kexec,
	linux-doc, dri-devel

[-- Attachment #1: Type: text/plain, Size: 1588 bytes --]

On 5 Aug 2021, at 15:16, Christian König wrote:

> Am 05.08.21 um 21:02 schrieb Zi Yan:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> This prepares for the upcoming changes to make MAX_ORDER a boot time
>> parameter instead of compilation time constant. All static arrays with
>> MAX_ORDER size are converted to pointers and their memory is allocated
>> at runtime.
>
> Well in general I strongly suggest to not use the patter kmalloc(sizeof(some struct) * MAX_ORDER,...) instead use kmalloc_array, kcalloc etc..
>
> Then when a array is embedded at the end of a structure you can use a trailing array and the struct_size() macro to determine the allocation size.

Sure. Will fix it.

>
> Additional to that separating the patch into changes for TTM to make the maximum allocation order independent from MAX_ORDER would be rather good to have I think.

Can you elaborate a little bit more on “make the maximum allocation order independent from MAX_ORDER”? From what I understand of ttm_pool_alloc(), it tries to get num_pages pages by allocating as large pages as possible, starting from MAX_ORDER. What is the rationale behind this algorithm? Why not just call alloc_page(order=0) num_pages times? Is it mean to reduce the number of calls to alloc_page? The allocated pages do not need to get as high as MAX_ORDER, is that the case? If yes, I probably can keep ttm pool as static arrays with length of MIN_MAX_ORDER, which I introduce in Patch 14 as the lower bound of boot time parameter MAX_ORDER. Let me know your thoughts.

Thanks.

—
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 13/15] mm: convert MAX_ORDER sized static arrays to dynamic ones.
  2021-08-05 19:58     ` Zi Yan
@ 2021-08-06  9:37       ` Christian König
  2021-08-06 14:00         ` Zi Yan
  0 siblings, 1 reply; 50+ messages in thread
From: Christian König @ 2021-08-06  9:37 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, Matthew Wilcox, Vlastimil Babka,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	linux-kernel, Dave Young, Jonathan Corbet, David Airlie, kexec,
	linux-doc, dri-devel

Am 05.08.21 um 21:58 schrieb Zi Yan:
> On 5 Aug 2021, at 15:16, Christian König wrote:
>
>> Am 05.08.21 um 21:02 schrieb Zi Yan:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> This prepares for the upcoming changes to make MAX_ORDER a boot time
>>> parameter instead of compilation time constant. All static arrays with
>>> MAX_ORDER size are converted to pointers and their memory is allocated
>>> at runtime.
>> Well in general I strongly suggest to not use the patter kmalloc(sizeof(some struct) * MAX_ORDER,...) instead use kmalloc_array, kcalloc etc..
>>
>> Then when a array is embedded at the end of a structure you can use a trailing array and the struct_size() macro to determine the allocation size.
> Sure. Will fix it.
>
>> Additional to that separating the patch into changes for TTM to make the maximum allocation order independent from MAX_ORDER would be rather good to have I think.
> Can you elaborate a little bit more on “make the maximum allocation order independent from MAX_ORDER”?

My idea was that you have a single patch to give the maximum order as 
parameter to the TTM pool.

>  From what I understand of ttm_pool_alloc(), it tries to get num_pages pages by allocating as large pages as possible, starting from MAX_ORDER. What is the rationale behind this algorithm? Why not just call alloc_page(order=0) num_pages times?

What we do here is essentially transparent huge pages for GPUs.

In opposite to CPU which can usually only use fixed sizes like 4KiB, 
2MiB, 1GiB at least AMD GPUs can use any power of two.

So it makes sense to allocate big pages as much as possible and only 
fallback to 4KiB pages when necessary.

Down side is that we potentially exhaust larger orders for other use cases.

> Is it mean to reduce the number of calls to alloc_page?

That is a nice to have side effect, but the performance improvement for 
the TLB is the main reason here.

> The allocated pages do not need to get as high as MAX_ORDER, is that the case?

Actually we would really like to have pages as large as 1GiB for best 
TLB utilization :)

> If yes, I probably can keep ttm pool as static arrays with length of MIN_MAX_ORDER, which I introduce in Patch 14 as the lower bound of boot time parameter MAX_ORDER. Let me know your thoughts.

Well you could do something like MAX_MAX_ORDER with a value of at least 
19 (1GiB with 4KiB pages IIRC). And then limit to the real available 
max_order when you make that a run time option.

Since the array elements consists only of a linked list and a few extra 
parameters / pointers we probably won't need more than a page or two for 
those.

Regards,
Christian.

>
> Thanks.
>
> —
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 13/15] mm: convert MAX_ORDER sized static arrays to dynamic ones.
  2021-08-06  9:37       ` Christian König
@ 2021-08-06 14:00         ` Zi Yan
  0 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2021-08-06 14:00 UTC (permalink / raw)
  To: Christian König
  Cc: David Hildenbrand, linux-mm, Matthew Wilcox, Vlastimil Babka,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	linux-kernel, Dave Young, Jonathan Corbet, David Airlie, kexec,
	linux-doc, dri-devel

[-- Attachment #1: Type: text/plain, Size: 3309 bytes --]

On 6 Aug 2021, at 5:37, Christian König wrote:

> Am 05.08.21 um 21:58 schrieb Zi Yan:
>> On 5 Aug 2021, at 15:16, Christian König wrote:
>>
>>> Am 05.08.21 um 21:02 schrieb Zi Yan:
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> This prepares for the upcoming changes to make MAX_ORDER a boot time
>>>> parameter instead of compilation time constant. All static arrays with
>>>> MAX_ORDER size are converted to pointers and their memory is allocated
>>>> at runtime.
>>> Well in general I strongly suggest to not use the patter kmalloc(sizeof(some struct) * MAX_ORDER,...) instead use kmalloc_array, kcalloc etc..
>>>
>>> Then when a array is embedded at the end of a structure you can use a trailing array and the struct_size() macro to determine the allocation size.
>> Sure. Will fix it.
>>
>>> Additional to that separating the patch into changes for TTM to make the maximum allocation order independent from MAX_ORDER would be rather good to have I think.
>> Can you elaborate a little bit more on “make the maximum allocation order independent from MAX_ORDER”?
>
> My idea was that you have a single patch to give the maximum order as parameter to the TTM pool.

Got it. No problem.
>
>>  From what I understand of ttm_pool_alloc(), it tries to get num_pages pages by allocating as large pages as possible, starting from MAX_ORDER. What is the rationale behind this algorithm? Why not just call alloc_page(order=0) num_pages times?
>
> What we do here is essentially transparent huge pages for GPUs.
>
> In opposite to CPU which can usually only use fixed sizes like 4KiB, 2MiB, 1GiB at least AMD GPUs can use any power of two.

FYI, Matthew Wilcox’s memory folio patchset adds support for arbitrary THP sizes[1]. You might want to check it out. :)

>
> So it makes sense to allocate big pages as much as possible and only fallback to 4KiB pages when necessary.
>
> Down side is that we potentially exhaust larger orders for other use cases.
>
>> Is it mean to reduce the number of calls to alloc_page?
>
> That is a nice to have side effect, but the performance improvement for the TLB is the main reason here.
>
>> The allocated pages do not need to get as high as MAX_ORDER, is that the case?
>
> Actually we would really like to have pages as large as 1GiB for best TLB utilization :)
>
>> If yes, I probably can keep ttm pool as static arrays with length of MIN_MAX_ORDER, which I introduce in Patch 14 as the lower bound of boot time parameter MAX_ORDER. Let me know your thoughts.
>
> Well you could do something like MAX_MAX_ORDER with a value of at least 19 (1GiB with 4KiB pages IIRC). And then limit to the real available max_order when you make that a run time option.
>
> Since the array elements consists only of a linked list and a few extra parameters / pointers we probably won't need more than a page or two for those.

Thanks for you explanation. Now I understand that ttm_pool does want pages as large as possible for performance reasons. I will keep the existing code, so that ttm_pool can get the largest pages from buddy allocator. I will separate the changes to TTM to a single patch like you suggested.


[1] https://lore.kernel.org/linux-mm/20210715033704.692967-1-willy@infradead.org

—
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 12/15] mm: Make MAX_ORDER of buddy allocator configurable via Kconfig SET_MAX_ORDER.
  2021-08-05 19:02 ` [RFC PATCH 12/15] mm: Make MAX_ORDER of buddy allocator configurable via Kconfig SET_MAX_ORDER Zi Yan
@ 2021-08-06 15:16   ` Vlastimil Babka
  2021-08-06 15:23     ` Zi Yan
  0 siblings, 1 reply; 50+ messages in thread
From: Vlastimil Babka @ 2021-08-06 15:16 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Kirill A . Shutemov, Mike Kravetz, Michal Hocko,
	John Hubbard, linux-kernel, Kees Cook, Peter Zijlstra,
	Nicholas Piggin, Thomas Gleixner

On 8/5/21 9:02 PM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> +config SET_MAX_ORDER
> +	int "Set maximum order of buddy allocator"
> +    depends on SPARSEMEM_VMEMMAP && (ARCH_FORCE_MAX_ORDER = 0)
> +	range 11 255
> +	default "11"
> +	help
> +	  The kernel memory allocator divides physically contiguous memory
> +	  blocks into "zones", where each zone is a power of two number of
> +	  pages.  This option selects the largest power of two that the kernel
> +	  keeps in the memory allocator.  If you need to allocate very large
> +	  blocks of physically contiguous memory, then you may need to
> +	  increase this value.
> +
> +	  This config option is actually maximum order plus one. For example,
> +	  a value of 11 means that the largest free memory block is 2^10 pages.

It's enough that it's confusing for the devs, we could spare the users and add
+1 to the value they specify :)

>  config HAVE_MEMBLOCK_PHYS_MAP
>  	bool
>  
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 12/15] mm: Make MAX_ORDER of buddy allocator configurable via Kconfig SET_MAX_ORDER.
  2021-08-06 15:16   ` Vlastimil Babka
@ 2021-08-06 15:23     ` Zi Yan
  0 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2021-08-06 15:23 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Hildenbrand, linux-mm, Matthew Wilcox, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel,
	Kees Cook, Peter Zijlstra, Nicholas Piggin, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1227 bytes --]

On 6 Aug 2021, at 11:16, Vlastimil Babka wrote:

> On 8/5/21 9:02 PM, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>> +config SET_MAX_ORDER
>> +	int "Set maximum order of buddy allocator"
>> +    depends on SPARSEMEM_VMEMMAP && (ARCH_FORCE_MAX_ORDER = 0)
>> +	range 11 255
>> +	default "11"
>> +	help
>> +	  The kernel memory allocator divides physically contiguous memory
>> +	  blocks into "zones", where each zone is a power of two number of
>> +	  pages.  This option selects the largest power of two that the kernel
>> +	  keeps in the memory allocator.  If you need to allocate very large
>> +	  blocks of physically contiguous memory, then you may need to
>> +	  increase this value.
>> +
>> +	  This config option is actually maximum order plus one. For example,
>> +	  a value of 11 means that the largest free memory block is 2^10 pages.
>
> It's enough that it's confusing for the devs, we could spare the users and add
> +1 to the value they specify :)

Sure. I will change the existing ARCH_FORCE_MAX_ORDER too, otherwise people might
get confused by two different MAX_ORDERs. Since this Kconfig only appears when
ARCH_FORCE_MAX_ORDER is not specified.


—
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (14 preceding siblings ...)
  2021-08-05 19:02 ` [RFC PATCH 15/15] mm: make MAX_ORDER a kernel boot time parameter Zi Yan
@ 2021-08-06 15:36 ` Vlastimil Babka
  2021-08-06 16:16   ` David Hildenbrand
  2021-08-06 16:32 ` Vlastimil Babka
  2021-08-09  7:41 ` David Hildenbrand
  17 siblings, 1 reply; 50+ messages in thread
From: Vlastimil Babka @ 2021-08-06 15:36 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Kirill A . Shutemov, Mike Kravetz, Michal Hocko,
	John Hubbard, linux-kernel, Mike Rapoport

On 8/5/21 9:02 PM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>

> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
> pages across memory sections. The check was removed when ARM64 gets rid of holes
> in zones, but holes can appear in zones again after this patchset.

To me that's most unwelcome resurrection. I kinda missed it was going away and
now I can't even rejoice? I assume the systems that will be bumping max_order
have a lot of memory. Are they going to have many holes? What if we just
sacrificed the memory that would have a hole and don't add it to buddy at all?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-06 15:36 ` [RFC PATCH 00/15] Make MAX_ORDER adjustable as " Vlastimil Babka
@ 2021-08-06 16:16   ` David Hildenbrand
  2021-08-06 16:54     ` Vlastimil Babka
  0 siblings, 1 reply; 50+ messages in thread
From: David Hildenbrand @ 2021-08-06 16:16 UTC (permalink / raw)
  To: Vlastimil Babka, Zi Yan, linux-mm
  Cc: Matthew Wilcox, Kirill A . Shutemov, Mike Kravetz, Michal Hocko,
	John Hubbard, linux-kernel, Mike Rapoport

On 06.08.21 17:36, Vlastimil Babka wrote:
> On 8/5/21 9:02 PM, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
> 
>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
>> pages across memory sections. The check was removed when ARM64 gets rid of holes
>> in zones, but holes can appear in zones again after this patchset.
> 
> To me that's most unwelcome resurrection. I kinda missed it was going away and
> now I can't even rejoice? I assume the systems that will be bumping max_order
> have a lot of memory. Are they going to have many holes? What if we just
> sacrificed the memory that would have a hole and don't add it to buddy at all?

I think the old implementation was just horrible and the description we 
have here still suffers from that old crap: "but holes can appear in 
zones again". No, it's not related to holes in zones at all. We can have 
MAX_ORDER -1 pages that are partially a hole.

And to be precise, "hole" here means "there is no memmap" and not "there 
is a hole but it has a valid memmap".

But IIRC, we now have under SPARSEMEM always a complete memmap for a 
complete memory sections (when talking about system RAM, ZONE_DEVICE is 
different but we don't really care for now I think).

So instead of introducing what we had before, I think we should look 
into something that doesn't confuse each person that stumbles over it 
out there. What does pfn_valid_within() even mean in the new context? 
pfn_valid() is most probably no longer what we really want, as we're 
dealing with multiple sections that might be online or offline; in the 
old world, this was different, as a MAX_ORDER -1 page was completely 
contained in a memory section that was either online or offline.

I'd imagine something that expresses something different in the context 
of sparsemem:

"Some page orders, such as MAX_ORDER -1, might span multiple memory 
sections. Each memory section has a completely valid memmap if online. 
Memory sections might either be completely online or completely offline. 
pfn_to_online_page() might succeed on one part of a MAX_ORDER - 1 page, 
but not on another part. But it will certainly be consistent within one 
memory section."

Further, as we know that MAX_ORDER -1 and memory sections are a power of 
two, we can actually do a binary search to identify boundaries, instead 
of having to check each and every page in the range.

Is what I describe the actual reason why we introduce pfn_valid_within() 
? (and might better introduce something new, with a better fitting name?)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (15 preceding siblings ...)
  2021-08-06 15:36 ` [RFC PATCH 00/15] Make MAX_ORDER adjustable as " Vlastimil Babka
@ 2021-08-06 16:32 ` Vlastimil Babka
  2021-08-06 17:19   ` Zi Yan
  2021-08-09  7:41 ` David Hildenbrand
  17 siblings, 1 reply; 50+ messages in thread
From: Vlastimil Babka @ 2021-08-06 16:32 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand, linux-mm
  Cc: Matthew Wilcox, Kirill A . Shutemov, Mike Kravetz, Michal Hocko,
	John Hubbard, linux-kernel

On 8/5/21 9:02 PM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> This patchset add support for kernel boot time adjustable MAX_ORDER, so that
> user can change the largest size of pages obtained from buddy allocator. It also
> removes the restriction on MAX_ORDER based on SECTION_SIZE_BITS, so that
> buddy allocator can merge PFNs across memory sections when SPARSEMEM_VMEMMAP is
> set. It is on top of v5.14-rc4-mmotm-2021-08-02-18-51.
> 
> Motivation
> ===
> 
> This enables kernel to allocate 1GB pages and is necessary for my ongoing work
> on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with
> after some discussion with David Hildenbrand on what methods should be used for
> allocating gigantic pages[2], since other approaches like using CMA allocator or
> alloc_contig_pages() are regarded as suboptimal.

I see references [1] [2] etc but no list of links at the end, did you
forgot to include?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-06 16:16   ` David Hildenbrand
@ 2021-08-06 16:54     ` Vlastimil Babka
  2021-08-06 17:08       ` David Hildenbrand
  2021-08-08  7:41       ` Mike Rapoport
  0 siblings, 2 replies; 50+ messages in thread
From: Vlastimil Babka @ 2021-08-06 16:54 UTC (permalink / raw)
  To: David Hildenbrand, Zi Yan, linux-mm
  Cc: Matthew Wilcox, Kirill A . Shutemov, Mike Kravetz, Michal Hocko,
	John Hubbard, linux-kernel, Mike Rapoport

On 8/6/21 6:16 PM, David Hildenbrand wrote:
> On 06.08.21 17:36, Vlastimil Babka wrote:
>> On 8/5/21 9:02 PM, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>
>>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
>>> pages across memory sections. The check was removed when ARM64 gets rid of holes
>>> in zones, but holes can appear in zones again after this patchset.
>>
>> To me that's most unwelcome resurrection. I kinda missed it was going away and
>> now I can't even rejoice? I assume the systems that will be bumping max_order
>> have a lot of memory. Are they going to have many holes? What if we just
>> sacrificed the memory that would have a hole and don't add it to buddy at all?
> 
> I think the old implementation was just horrible and the description we have
> here still suffers from that old crap: "but holes can appear in zones again".
> No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages
> that are partially a hole.
> 
> And to be precise, "hole" here means "there is no memmap" and not "there is a
> hole but it has a valid memmap".

Yes.

> But IIRC, we now have under SPARSEMEM always a complete memmap for a complete
> memory sections (when talking about system RAM, ZONE_DEVICE is different but we
> don't really care for now I think).
> 
> So instead of introducing what we had before, I think we should look into
> something that doesn't confuse each person that stumbles over it out there. What
> does pfn_valid_within() even mean in the new context? pfn_valid() is most
> probably no longer what we really want, as we're dealing with multiple sections
> that might be online or offline; in the old world, this was different, as a
> MAX_ORDER -1 page was completely contained in a memory section that was either
> online or offline.
> 
> I'd imagine something that expresses something different in the context of
> sparsemem:
> 
> "Some page orders, such as MAX_ORDER -1, might span multiple memory sections.
> Each memory section has a completely valid memmap if online. Memory sections
> might either be completely online or completely offline. pfn_to_online_page()
> might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But
> it will certainly be consistent within one memory section."
> 
> Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we
> can actually do a binary search to identify boundaries, instead of having to
> check each and every page in the range.
> 
> Is what I describe the actual reason why we introduce pfn_valid_within() ? (and
> might better introduce something new, with a better fitting name?)

What I don't like is mainly the re-addition of pfn_valid_within() (or whatever
we'd call it) into __free_one_page() for performance reasons, and also to
various pfn scanners (compaction) for performance and "I must not forget to
check this, or do I?" confusion reasons. It would be really great if we could
keep a guarantee that memmap exists for MAX_ORDER blocks. I see two ways to
achieve that:

1. we create memmap for MAX_ORDER blocks, pages in sections not online are
marked as reserved or some other state that allows us to do checks such as "is
there a buddy? no" without accessing a missing memmap
2. smaller blocks than MAX_ORDER are not released to buddy allocator

I think 1 would be more work, but less wasteful in the end?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-06 16:54     ` Vlastimil Babka
@ 2021-08-06 17:08       ` David Hildenbrand
  2021-08-06 18:24         ` Zi Yan
  2021-08-08  7:41       ` Mike Rapoport
  1 sibling, 1 reply; 50+ messages in thread
From: David Hildenbrand @ 2021-08-06 17:08 UTC (permalink / raw)
  To: Vlastimil Babka, Zi Yan, linux-mm
  Cc: Matthew Wilcox, Kirill A . Shutemov, Mike Kravetz, Michal Hocko,
	John Hubbard, linux-kernel, Mike Rapoport

On 06.08.21 18:54, Vlastimil Babka wrote:
> On 8/6/21 6:16 PM, David Hildenbrand wrote:
>> On 06.08.21 17:36, Vlastimil Babka wrote:
>>> On 8/5/21 9:02 PM, Zi Yan wrote:
>>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
>>>> pages across memory sections. The check was removed when ARM64 gets rid of holes
>>>> in zones, but holes can appear in zones again after this patchset.
>>>
>>> To me that's most unwelcome resurrection. I kinda missed it was going away and
>>> now I can't even rejoice? I assume the systems that will be bumping max_order
>>> have a lot of memory. Are they going to have many holes? What if we just
>>> sacrificed the memory that would have a hole and don't add it to buddy at all?
>>
>> I think the old implementation was just horrible and the description we have
>> here still suffers from that old crap: "but holes can appear in zones again".
>> No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages
>> that are partially a hole.
>>
>> And to be precise, "hole" here means "there is no memmap" and not "there is a
>> hole but it has a valid memmap".
> 
> Yes.
> 
>> But IIRC, we now have under SPARSEMEM always a complete memmap for a complete
>> memory sections (when talking about system RAM, ZONE_DEVICE is different but we
>> don't really care for now I think).
>>
>> So instead of introducing what we had before, I think we should look into
>> something that doesn't confuse each person that stumbles over it out there. What
>> does pfn_valid_within() even mean in the new context? pfn_valid() is most
>> probably no longer what we really want, as we're dealing with multiple sections
>> that might be online or offline; in the old world, this was different, as a
>> MAX_ORDER -1 page was completely contained in a memory section that was either
>> online or offline.
>>
>> I'd imagine something that expresses something different in the context of
>> sparsemem:
>>
>> "Some page orders, such as MAX_ORDER -1, might span multiple memory sections.
>> Each memory section has a completely valid memmap if online. Memory sections
>> might either be completely online or completely offline. pfn_to_online_page()
>> might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But
>> it will certainly be consistent within one memory section."
>>
>> Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we
>> can actually do a binary search to identify boundaries, instead of having to
>> check each and every page in the range.
>>
>> Is what I describe the actual reason why we introduce pfn_valid_within() ? (and
>> might better introduce something new, with a better fitting name?)
> 
> What I don't like is mainly the re-addition of pfn_valid_within() (or whatever
> we'd call it) into __free_one_page() for performance reasons, and also to
> various pfn scanners (compaction) for performance and "I must not forget to
> check this, or do I?" confusion reasons. It would be really great if we could
> keep a guarantee that memmap exists for MAX_ORDER blocks. I see two ways to
> achieve that:
> 
> 1. we create memmap for MAX_ORDER blocks, pages in sections not online are
> marked as reserved or some other state that allows us to do checks such as "is
> there a buddy? no" without accessing a missing memmap
> 2. smaller blocks than MAX_ORDER are not released to buddy allocator
> 
> I think 1 would be more work, but less wasteful in the end?

It will end up seriously messing with memory hot(un)plug. It's not 
sufficient if there is a memmap (pfn_valid()), it has to be online 
(pfn_to_online_page()) to actually have a meaning.

So you'd have to  allocate a memmap for all such memory sections, 
initialize it to all PG_Reserved ("memory hole") and mark these memory 
sections online. Further, you need memory block devices that are 
initialized and online.

So far so good, although wasteful. What happens if someone hotplugs a 
memory block that doesn't span a complete MAX_ORDER -1 page? Broken.


The only "workaround" would be requiring that MAX_ORDER - 1 cannot be 
bigger than memory blocks (memory_block_size_bytes()). The memory block 
size determines our hot(un)plug granularity and can (on some archs 
already) be determined at runtime. As both (MAX_ORDER and 
memory_block_size_bytes) would be determined at runtime, for example, by 
an admin explicitly requesting it, this might be feasible.


Memory hot(un)plug / onlining / offlining would most probably work 
naturally (although the hot(un)plug granularity is then limited to e.g., 
1GiB memory blocks). But if that's what an admin requests on the command 
line, so be it.

What might need some thought, though, is having overlapping 
sections/such memory blocks with devmem. Sub-section hotadd has to 
continue working unless we want to break some PMEM devices seriously.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-06 16:32 ` Vlastimil Babka
@ 2021-08-06 17:19   ` Zi Yan
  2021-08-06 20:27     ` Hugh Dickins
  0 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2021-08-06 17:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Hildenbrand, linux-mm, Matthew Wilcox, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2196 bytes --]

On 6 Aug 2021, at 12:32, Vlastimil Babka wrote:

> On 8/5/21 9:02 PM, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi all,
>>
>> This patchset add support for kernel boot time adjustable MAX_ORDER, so that
>> user can change the largest size of pages obtained from buddy allocator. It also
>> removes the restriction on MAX_ORDER based on SECTION_SIZE_BITS, so that
>> buddy allocator can merge PFNs across memory sections when SPARSEMEM_VMEMMAP is
>> set. It is on top of v5.14-rc4-mmotm-2021-08-02-18-51.
>>
>> Motivation
>> ===
>>
>> This enables kernel to allocate 1GB pages and is necessary for my ongoing work
>> on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with
>> after some discussion with David Hildenbrand on what methods should be used for
>> allocating gigantic pages[2], since other approaches like using CMA allocator or
>> alloc_contig_pages() are regarded as suboptimal.
>
> I see references [1] [2] etc but no list of links at the end, did you
> forgot to include?
Ah, I missed that. Sorry.

Here are the links:

[1] https://lore.kernel.org/linux-mm/20200928175428.4110504-1-zi.yan@sent.com/
[2] https://lore.kernel.org/linux-mm/e132fdd9-65af-1cad-8a6e-71844ebfe6a2@redhat.com/
[3] https://lore.kernel.org/linux-mm/289DA3C0-9AE5-4992-A35A-C13FCE4D8544@nvidia.com/

BTW, [3] is the performance test I did to compare MAX_ORDER=11 and MAX_ORDER=20,
which showed no performance impact of increasing MAX_ORDER.

In addition, I would like to share more detail on my plan on supporting 1GB PUD THP.
This patchset is the first step, enabling kernel to allocate 1GB pages, so that
user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using
alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory
fragmentation handling for pages up to MAX_ORDER, since currently pageblock size
is still limited by memory section size. As a result, I will explore solutions
like having additional larger pageblocks (up to MAX_ORDER) to counter memory
fragmentation. I will discover what else needs to be solved as I gradually improve
1GB PUD THP support.

—
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-06 17:08       ` David Hildenbrand
@ 2021-08-06 18:24         ` Zi Yan
  2021-08-09  7:20           ` David Hildenbrand
  0 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2021-08-06 18:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vlastimil Babka, linux-mm, Matthew Wilcox, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel,
	Mike Rapoport

[-- Attachment #1: Type: text/plain, Size: 5914 bytes --]

On 6 Aug 2021, at 13:08, David Hildenbrand wrote:

> On 06.08.21 18:54, Vlastimil Babka wrote:
>> On 8/6/21 6:16 PM, David Hildenbrand wrote:
>>> On 06.08.21 17:36, Vlastimil Babka wrote:
>>>> On 8/5/21 9:02 PM, Zi Yan wrote:
>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
>>>>> pages across memory sections. The check was removed when ARM64 gets rid of holes
>>>>> in zones, but holes can appear in zones again after this patchset.
>>>>
>>>> To me that's most unwelcome resurrection. I kinda missed it was going away and
>>>> now I can't even rejoice? I assume the systems that will be bumping max_order
>>>> have a lot of memory. Are they going to have many holes? What if we just
>>>> sacrificed the memory that would have a hole and don't add it to buddy at all?
>>>
>>> I think the old implementation was just horrible and the description we have
>>> here still suffers from that old crap: "but holes can appear in zones again".
>>> No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages
>>> that are partially a hole.
>>>
>>> And to be precise, "hole" here means "there is no memmap" and not "there is a
>>> hole but it has a valid memmap".
>>
>> Yes.
>>
>>> But IIRC, we now have under SPARSEMEM always a complete memmap for a complete
>>> memory sections (when talking about system RAM, ZONE_DEVICE is different but we
>>> don't really care for now I think).
>>>
>>> So instead of introducing what we had before, I think we should look into
>>> something that doesn't confuse each person that stumbles over it out there. What
>>> does pfn_valid_within() even mean in the new context? pfn_valid() is most
>>> probably no longer what we really want, as we're dealing with multiple sections
>>> that might be online or offline; in the old world, this was different, as a
>>> MAX_ORDER -1 page was completely contained in a memory section that was either
>>> online or offline.
>>>
>>> I'd imagine something that expresses something different in the context of
>>> sparsemem:
>>>
>>> "Some page orders, such as MAX_ORDER -1, might span multiple memory sections.
>>> Each memory section has a completely valid memmap if online. Memory sections
>>> might either be completely online or completely offline. pfn_to_online_page()
>>> might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But
>>> it will certainly be consistent within one memory section."
>>>
>>> Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we
>>> can actually do a binary search to identify boundaries, instead of having to
>>> check each and every page in the range.
>>>
>>> Is what I describe the actual reason why we introduce pfn_valid_within() ? (and
>>> might better introduce something new, with a better fitting name?)
>>
>> What I don't like is mainly the re-addition of pfn_valid_within() (or whatever
>> we'd call it) into __free_one_page() for performance reasons, and also to
>> various pfn scanners (compaction) for performance and "I must not forget to
>> check this, or do I?" confusion reasons. It would be really great if we could
>> keep a guarantee that memmap exists for MAX_ORDER blocks. I see two ways to
>> achieve that:
>>
>> 1. we create memmap for MAX_ORDER blocks, pages in sections not online are
>> marked as reserved or some other state that allows us to do checks such as "is
>> there a buddy? no" without accessing a missing memmap
>> 2. smaller blocks than MAX_ORDER are not released to buddy allocator
>>
>> I think 1 would be more work, but less wasteful in the end?
>
> It will end up seriously messing with memory hot(un)plug. It's not sufficient if there is a memmap (pfn_valid()), it has to be online (pfn_to_online_page()) to actually have a meaning.
>
> So you'd have to  allocate a memmap for all such memory sections, initialize it to all PG_Reserved ("memory hole") and mark these memory sections online. Further, you need memory block devices that are initialized and online.
>
> So far so good, although wasteful. What happens if someone hotplugs a memory block that doesn't span a complete MAX_ORDER -1 page? Broken.
>
>
> The only "workaround" would be requiring that MAX_ORDER - 1 cannot be bigger than memory blocks (memory_block_size_bytes()). The memory block size determines our hot(un)plug granularity and can (on some archs already) be determined at runtime. As both (MAX_ORDER and memory_block_size_bytes) would be determined at runtime, for example, by an admin explicitly requesting it, this might be feasible.
>
>
> Memory hot(un)plug / onlining / offlining would most probably work naturally (although the hot(un)plug granularity is then limited to e.g., 1GiB memory blocks). But if that's what an admin requests on the command line, so be it.
>
> What might need some thought, though, is having overlapping sections/such memory blocks with devmem. Sub-section hotadd has to continue working unless we want to break some PMEM devices seriously.

Thanks a lot for your valuable inputs!

Yes, this might work. But it seems to also defeat the purpose of sparse memory, which allow only memmapping present PFNs, right? Also it requires a lot more intrusive changes, which might not be accepted easily.

I will look into the cost of the added pfn checks and try to optimize it. One thing I can think of is that these non-present PFNs should only appear at the beginning and at the end of a zone, since HOLES_IN_ZONE is gone, maybe I just need to store and check PFN range of a zone instead of checking memory section validity and modify the zone PFN range during memory hot(un)plug. For offline pages in the middle of a zone, struct page still exists and PageBuddy() returns false, since PG_offline is set, right?


—
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-06 17:19   ` Zi Yan
@ 2021-08-06 20:27     ` Hugh Dickins
  2021-08-06 21:26       ` Zi Yan
  2021-08-07  1:10       ` Matthew Wilcox
  0 siblings, 2 replies; 50+ messages in thread
From: Hugh Dickins @ 2021-08-06 20:27 UTC (permalink / raw)
  To: Zi Yan
  Cc: Vlastimil Babka, David Hildenbrand, linux-mm, Matthew Wilcox,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	linux-kernel

On Fri, 6 Aug 2021, Zi Yan wrote:
> 
> In addition, I would like to share more detail on my plan on supporting 1GB PUD THP.
> This patchset is the first step, enabling kernel to allocate 1GB pages, so that
> user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using
> alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory
> fragmentation handling for pages up to MAX_ORDER, since currently pageblock size
> is still limited by memory section size. As a result, I will explore solutions
> like having additional larger pageblocks (up to MAX_ORDER) to counter memory
> fragmentation. I will discover what else needs to be solved as I gradually improve
> 1GB PUD THP support.

Sorry to be blunt, but let me state my opinion: 2MB THPs have given and
continue to give us more than enough trouble.  Complicating the kernel's
mm further, just to allow 1GB THPs, seems a very bad tradeoff to me.  I
understand that it's an appealing personal project; but for the sake of
of all the rest of us, please leave 1GB huge pages to hugetlbfs (until
the day when we are all using 2MB base pages).

Hugh

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-06 20:27     ` Hugh Dickins
@ 2021-08-06 21:26       ` Zi Yan
  2021-08-09  4:04         ` Hugh Dickins
  2021-08-07  1:10       ` Matthew Wilcox
  1 sibling, 1 reply; 50+ messages in thread
From: Zi Yan @ 2021-08-06 21:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, David Hildenbrand, linux-mm, Matthew Wilcox,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	linux-kernel, Roman Gushchin

[-- Attachment #1: Type: text/plain, Size: 2862 bytes --]

On 6 Aug 2021, at 16:27, Hugh Dickins wrote:

> On Fri, 6 Aug 2021, Zi Yan wrote:
>>
>> In addition, I would like to share more detail on my plan on supporting 1GB PUD THP.
>> This patchset is the first step, enabling kernel to allocate 1GB pages, so that
>> user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using
>> alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory
>> fragmentation handling for pages up to MAX_ORDER, since currently pageblock size
>> is still limited by memory section size. As a result, I will explore solutions
>> like having additional larger pageblocks (up to MAX_ORDER) to counter memory
>> fragmentation. I will discover what else needs to be solved as I gradually improve
>> 1GB PUD THP support.
>
> Sorry to be blunt, but let me state my opinion: 2MB THPs have given and
> continue to give us more than enough trouble.  Complicating the kernel's
> mm further, just to allow 1GB THPs, seems a very bad tradeoff to me.  I
> understand that it's an appealing personal project; but for the sake of
> of all the rest of us, please leave 1GB huge pages to hugetlbfs (until
> the day when we are all using 2MB base pages).

I do not agree with you. 2MB THP provides good performance, while letting us
keep using 4KB base pages. The 2MB THP implementation is the price we pay
to get the performance. This patchset removes the tie between MAX_ORDER
and section size to allow >2MB page allocation, which is useful in many
places. 1GB THP is one of the users. Gigantic pages also improve
device performance, like GPUs (e.g., AMD GPUs can use any power of two up to
1GB pages[1], which I just learnt). Also could you point out which part
of my patchset complicates kernel’s mm? I could try to simplify it if
possible.

In addition, I am not sure hugetlbfs is the way to go. THP is managed by
core mm, whereas hugetlbfs has its own code for memory management.
As hugetlbfs gets popular, more core mm functionalities have been
replicated and added to hugetlbfs codebase. It is not a good tradeoff
either. One of the reasons I work on 1GB THP is that Roman from Facebook
explicitly mentioned they want to use THP in place of hugetlbfs[2].

I think it might be more constructive to point out the existing issues
in THP so that we can improve the code together. BTW, I am also working
on simplifying THP code like generalizing THP split[3] and planning to
simplify page table manipulation code by reviving Kirill’s idea[4].

[1] https://lore.kernel.org/linux-mm/bdec12bd-9188-9f3e-c442-aa33e25303a6@amd.com/
[2] https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.thefacebook.com/
[3] https://lwn.net/Articles/837928/
[4] https://lore.kernel.org/linux-mm/20180424154355.mfjgkf47kdp2by4e@black.fi.intel.com/

—
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-06 20:27     ` Hugh Dickins
  2021-08-06 21:26       ` Zi Yan
@ 2021-08-07  1:10       ` Matthew Wilcox
  2021-08-07 21:23         ` Matthew Wilcox
  2021-08-09  4:29         ` Hugh Dickins
  1 sibling, 2 replies; 50+ messages in thread
From: Matthew Wilcox @ 2021-08-07  1:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Zi Yan, Vlastimil Babka, David Hildenbrand, linux-mm,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	linux-kernel

On Fri, Aug 06, 2021 at 01:27:27PM -0700, Hugh Dickins wrote:
> On Fri, 6 Aug 2021, Zi Yan wrote:
> > 
> > In addition, I would like to share more detail on my plan on supporting 1GB PUD THP.
> > This patchset is the first step, enabling kernel to allocate 1GB pages, so that
> > user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using
> > alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory
> > fragmentation handling for pages up to MAX_ORDER, since currently pageblock size
> > is still limited by memory section size. As a result, I will explore solutions
> > like having additional larger pageblocks (up to MAX_ORDER) to counter memory
> > fragmentation. I will discover what else needs to be solved as I gradually improve
> > 1GB PUD THP support.
> 
> Sorry to be blunt, but let me state my opinion: 2MB THPs have given and
> continue to give us more than enough trouble.  Complicating the kernel's
> mm further, just to allow 1GB THPs, seems a very bad tradeoff to me.  I
> understand that it's an appealing personal project; but for the sake of
> of all the rest of us, please leave 1GB huge pages to hugetlbfs (until
> the day when we are all using 2MB base pages).

I respect your opinion, Hugh.  You, more than most of us, have spent an
inordinate amount of time debugging huge page related issues.  I also
share your misgivings about the potential performance improvements for
1GB pages.  They're too big for all but the most unusual of special cases.
This hasn't been helped by the scarce number of 1GB TLB entries in Intel
CPUs until very recently (and even those are hard to come by today).
I do not think they are of interest for the page cache (as I'm fond of
observing, if you have 7GB/s storage (eg the Samsung 980 Pro), you can
take seven page faults per second).

I am, however, of the opinion that 2MB pages give us so much trouble
because they're so very special.  Few people exercise those code paths and
it's easy to break them without noticing.  This is partly why I want to
do arbitrary-order pages.  If everybody is running with compound pages
all the time, we'll see the corner cases often, and people other than
Hugh, Kirill and Mike will be able to work on them.

Now, I'm not planning on working on arbitrary-order anonymous
pages myself.  I think I have enough to deal with in the page cache &
filesystems.  But I'm happy to help out when I can be useful.  I think
256kB pages are probably optimal at the moment for file-backed memory,
so I'm not planning on exploring the space above PMD_ORDER myself.
But there have already been some important areas of collaboration between
the 1GB effort and the folio effort.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 08/15] fs: proc: use PAGES_PER_SECTION for page offline checking period.
  2021-08-05 19:02 ` [RFC PATCH 08/15] fs: proc: use PAGES_PER_SECTION for page offline checking period Zi Yan
@ 2021-08-07 10:32   ` Mike Rapoport
  2021-08-09 15:45     ` [RFC PATCH 08/15] " Zi Yan
  0 siblings, 1 reply; 50+ messages in thread
From: Mike Rapoport @ 2021-08-07 10:32 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, Matthew Wilcox, Vlastimil Babka,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	linux-kernel, Oscar Salvador, Ying Chen, Feng Zhou,
	linux-fsdevel

On Thu, Aug 05, 2021 at 03:02:46PM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> It keeps the existing behavior after MAX_ORDER is increased beyond
> a section size.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Ying Chen <chenying.kernel@bytedance.com>
> Cc: Feng Zhou <zhoufeng.zf@bytedance.com>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  fs/proc/kcore.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index 3f148759a5fd..77b7ba48fb44 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -486,7 +486,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>  			}
>  		}
>  
> -		if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
> +		if (page_offline_frozen++ % PAGES_PER_SECTION == 0) {

The behavior changes here. E.g. with default configuration on x86 instead
of cond_resched() every 2M we get cond_resched() every 128M.

I'm not saying it's wrong but at least it deserves an explanation why.

>  			page_offline_thaw();
>  			cond_resched();
>  			page_offline_freeze();
> -- 
> 2.30.2
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-07  1:10       ` Matthew Wilcox
@ 2021-08-07 21:23         ` Matthew Wilcox
  2021-08-09  4:29         ` Hugh Dickins
  1 sibling, 0 replies; 50+ messages in thread
From: Matthew Wilcox @ 2021-08-07 21:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Zi Yan, Vlastimil Babka, David Hildenbrand, linux-mm,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	linux-kernel

On Sat, Aug 07, 2021 at 02:10:55AM +0100, Matthew Wilcox wrote:
> This hasn't been helped by the scarce number of 1GB TLB entries in Intel
> CPUs until very recently (and even those are hard to come by today).

Minor correction to this.  I just fixed x86info to work on my Core
i7-1165G7 (Tiger Lake, launched about a year ago) and discovered it has:

 L1 Store Only TLB: 1GB/4MB/2MB/4KB pages, fully associative, 16 entries
 L1 Load Only TLB: 1GB pages, fully associative, 8 entries
 L2 Unified TLB: 1GB/4KB pages, 8-way associative, 1024 entries

My prior laptop (i7-7500U, Kaby Lake, 2016) has only 4x 1GB TLB entries at
the L1 level, and no support for L2 1GB pages.  So this speaks to Intel
finally taking performance of 1GB TLB entries seriously.  Perhaps more
seriously than they need to for a laptop with 16GB of memory!  There are
Xeon-W versions of this CPU, so I imagine that there are versions which
can support 1TB or more memory.

I still think that 1GB pages are too big for anything but specialist
use cases, but those do exist.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-06 16:54     ` Vlastimil Babka
  2021-08-06 17:08       ` David Hildenbrand
@ 2021-08-08  7:41       ` Mike Rapoport
  1 sibling, 0 replies; 50+ messages in thread
From: Mike Rapoport @ 2021-08-08  7:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Hildenbrand, Zi Yan, linux-mm, Matthew Wilcox,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	linux-kernel, Mike Rapoport

On Fri, Aug 06, 2021 at 06:54:14PM +0200, Vlastimil Babka wrote:
> On 8/6/21 6:16 PM, David Hildenbrand wrote:
> > On 06.08.21 17:36, Vlastimil Babka wrote:
> >> On 8/5/21 9:02 PM, Zi Yan wrote:
> >>> From: Zi Yan <ziy@nvidia.com>
> >>
> >>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
> >>> pages across memory sections. The check was removed when ARM64 gets rid of holes
> >>> in zones, but holes can appear in zones again after this patchset.
> >>
> >> To me that's most unwelcome resurrection. I kinda missed it was going away and
> >> now I can't even rejoice? I assume the systems that will be bumping max_order
> >> have a lot of memory. Are they going to have many holes? What if we just
> >> sacrificed the memory that would have a hole and don't add it to buddy at all?
> > 
> > I think the old implementation was just horrible and the description we have
> > here still suffers from that old crap: "but holes can appear in zones again".
> > No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages
> > that are partially a hole.
> > 
> > And to be precise, "hole" here means "there is no memmap" and not "there is a
> > hole but it has a valid memmap".
> 
> Yes.
> 
> > But IIRC, we now have under SPARSEMEM always a complete memmap for a complete
> > memory sections (when talking about system RAM, ZONE_DEVICE is different but we
> > don't really care for now I think).
> > 
> > So instead of introducing what we had before, I think we should look into
> > something that doesn't confuse each person that stumbles over it out there. What
> > does pfn_valid_within() even mean in the new context? pfn_valid() is most
> > probably no longer what we really want, as we're dealing with multiple sections
> > that might be online or offline; in the old world, this was different, as a
> > MAX_ORDER -1 page was completely contained in a memory section that was either
> > online or offline.
> > 
> > I'd imagine something that expresses something different in the context of
> > sparsemem:
> > 
> > "Some page orders, such as MAX_ORDER -1, might span multiple memory sections.
> > Each memory section has a completely valid memmap if online. Memory sections
> > might either be completely online or completely offline. pfn_to_online_page()
> > might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But
> > it will certainly be consistent within one memory section."
> > 
> > Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we
> > can actually do a binary search to identify boundaries, instead of having to
> > check each and every page in the range.
> > 
> > Is what I describe the actual reason why we introduce pfn_valid_within() ? (and
> > might better introduce something new, with a better fitting name?)
> 
> What I don't like is mainly the re-addition of pfn_valid_within() (or whatever
> we'd call it) into __free_one_page() for performance reasons, and also to
> various pfn scanners (compaction) for performance and "I must not forget to
> check this, or do I?" confusion reasons. It would be really great if we could
> keep a guarantee that memmap exists for MAX_ORDER blocks.

Maybe I'm missing something, but what if we use different constants to
define maximal allocation order buddy can handle and maximal size of
reasonable physically contiguous chunk?
I've skimmed through the uses of MAX_ORDER and it seems that many of those
could use, say, pageblock_order of several megabytes rather than MAX_ORDER.

If we only use MAX_ORDER to denote the maximal allocation order and keep
PFN walkers to use smaller pageblock order that we'll be able to have valid
memory map in all the cases. 

Then we can work out how to make migration/compaction etc to process 1G
pages.

> I see two ways to achieve that:
> 
> 1. we create memmap for MAX_ORDER blocks, pages in sections not online are
> marked as reserved or some other state that allows us to do checks such as "is
> there a buddy? no" without accessing a missing memmap
> 2. smaller blocks than MAX_ORDER are not released to buddy allocator
> 
> I think 1 would be more work, but less wasteful in the end?

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 14/15] mm: introduce MIN_MAX_ORDER to replace MAX_ORDER as compile time constant.
  2021-08-05 19:02 ` [RFC PATCH 14/15] mm: introduce MIN_MAX_ORDER to replace MAX_ORDER as compile time constant Zi Yan
@ 2021-08-08  8:23   ` Mike Rapoport
  2021-08-09 15:35     ` Zi Yan
  0 siblings, 1 reply; 50+ messages in thread
From: Mike Rapoport @ 2021-08-08  8:23 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, Matthew Wilcox, Vlastimil Babka,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	linux-kernel, Marc Zyngier, Catalin Marinas, Christoph Lameter,
	Quentin Perret, linux-arm-kernel, kvmarm

On Thu, Aug 05, 2021 at 03:02:52PM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> For other MAX_ORDER uses (described below), there is no need or too much
> hassle to convert certain static array to dynamic ones. Add
> MIN_MAX_ORDER to serve as compile time constant in place of MAX_ORDER.
> 
> ARM64 hypervisor maintains its own free page list and does not import
> any core kernel symbols, so soon-to-be runtime variable MAX_ORDER is not
> accessible in ARM64 hypervisor code. Also there is no need to allocating
> very large pages.
> 
> In SLAB/SLOB/SLUB, 2-D array kmalloc_caches uses MAX_ORDER in its second
> dimension. It is too much hassle to allocate memory for kmalloc_caches
> before any proper memory allocator is set up.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Quentin Perret <qperret@google.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: kvmarm@lists.cs.columbia.edu
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  arch/arm64/kvm/hyp/include/nvhe/gfp.h | 2 +-
>  arch/arm64/kvm/hyp/nvhe/page_alloc.c  | 3 ++-
>  include/linux/mmzone.h                | 3 +++
>  include/linux/slab.h                  | 8 ++++----
>  mm/slab.c                             | 2 +-
>  mm/slub.c                             | 7 ++++---
>  6 files changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
> index fb0f523d1492..c774b4a98336 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/gfp.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
> @@ -16,7 +16,7 @@ struct hyp_pool {
>  	 * API at EL2.
>  	 */
>  	hyp_spinlock_t lock;
> -	struct list_head free_area[MAX_ORDER];
> +	struct list_head free_area[MIN_MAX_ORDER];
>  	phys_addr_t range_start;
>  	phys_addr_t range_end;
>  	unsigned short max_order;
> diff --git a/arch/arm64/kvm/hyp/nvhe/page_alloc.c b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
> index 41fc25bdfb34..a1cc1b648de0 100644
> --- a/arch/arm64/kvm/hyp/nvhe/page_alloc.c
> +++ b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
> @@ -226,7 +226,8 @@ int hyp_pool_init(struct hyp_pool *pool, u64 pfn, unsigned int nr_pages,
>  	int i;
>  
>  	hyp_spin_lock_init(&pool->lock);
> -	pool->max_order = min(MAX_ORDER, get_order(nr_pages << PAGE_SHIFT));
> +
> +	pool->max_order = min(MIN_MAX_ORDER, get_order(nr_pages << PAGE_SHIFT));
>  	for (i = 0; i < pool->max_order; i++)
>  		INIT_LIST_HEAD(&pool->free_area[i]);
>  	pool->range_start = phys;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 09aafc05aef4..379dada82d4b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -27,11 +27,14 @@
>  #ifndef CONFIG_ARCH_FORCE_MAX_ORDER
>  #ifdef CONFIG_SET_MAX_ORDER
>  #define MAX_ORDER CONFIG_SET_MAX_ORDER
> +#define MIN_MAX_ORDER CONFIG_SET_MAX_ORDER
>  #else
>  #define MAX_ORDER 11
> +#define MIN_MAX_ORDER MAX_ORDER
>  #endif /* CONFIG_SET_MAX_ORDER */
>  #else
>  #define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
> +#define MIN_MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
>  #endif /* CONFIG_ARCH_FORCE_MAX_ORDER */
>  #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))

The end result of this #ifdef explosion looks entirely unreadable:

/* Free memory management - zoned buddy allocator.  */
#ifndef CONFIG_ARCH_FORCE_MAX_ORDER
#ifdef CONFIG_SET_MAX_ORDER
/* Defined in mm/page_alloc.c */
extern int buddy_alloc_max_order;

#define MAX_ORDER buddy_alloc_max_order
#define MIN_MAX_ORDER CONFIG_SET_MAX_ORDER
#else
#define MAX_ORDER 11
#define MIN_MAX_ORDER MAX_ORDER
#endif /* CONFIG_SET_MAX_ORDER */
#else

#ifdef CONFIG_SPARSEMEM_VMEMMAP
/* Defined in mm/page_alloc.c */
extern int buddy_alloc_max_order;

#define MAX_ORDER buddy_alloc_max_order
#else
#define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
#endif /* CONFIG_SPARSEMEM_VMEMMAP */
#define MIN_MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
#endif /* CONFIG_ARCH_FORCE_MAX_ORDER */

> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 2c0d80cca6b8..d8747c158db6 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -244,8 +244,8 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
>   * to do various tricks to work around compiler limitations in order to
>   * ensure proper constant folding.
>   */
> -#define KMALLOC_SHIFT_HIGH	((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
> -				(MAX_ORDER + PAGE_SHIFT - 1) : 25)
> +#define KMALLOC_SHIFT_HIGH	((MIN_MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
> +				(MIN_MAX_ORDER + PAGE_SHIFT - 1) : 25)
>  #define KMALLOC_SHIFT_MAX	KMALLOC_SHIFT_HIGH
>  #ifndef KMALLOC_SHIFT_LOW
>  #define KMALLOC_SHIFT_LOW	5
> @@ -258,7 +258,7 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
>   * (PAGE_SIZE*2).  Larger requests are passed to the page allocator.
>   */
>  #define KMALLOC_SHIFT_HIGH	(PAGE_SHIFT + 1)
> -#define KMALLOC_SHIFT_MAX	(MAX_ORDER + PAGE_SHIFT - 1)
> +#define KMALLOC_SHIFT_MAX	(MIN_MAX_ORDER + PAGE_SHIFT - 1)
>  #ifndef KMALLOC_SHIFT_LOW
>  #define KMALLOC_SHIFT_LOW	3
>  #endif
> @@ -271,7 +271,7 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
>   * be allocated from the same page.
>   */
>  #define KMALLOC_SHIFT_HIGH	PAGE_SHIFT
> -#define KMALLOC_SHIFT_MAX	(MAX_ORDER + PAGE_SHIFT - 1)
> +#define KMALLOC_SHIFT_MAX	(MIN_MAX_ORDER + PAGE_SHIFT - 1)
>  #ifndef KMALLOC_SHIFT_LOW
>  #define KMALLOC_SHIFT_LOW	3
>  #endif
> diff --git a/mm/slab.c b/mm/slab.c
> index d0f725637663..0041de8ec0e9 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -466,7 +466,7 @@ static int __init slab_max_order_setup(char *str)
>  {
>  	get_option(&str, &slab_max_order);
>  	slab_max_order = slab_max_order < 0 ? 0 :
> -				min(slab_max_order, MAX_ORDER - 1);
> +				min(slab_max_order, MIN_MAX_ORDER - 1);
>  	slab_max_order_set = true;
>  
>  	return 1;
> diff --git a/mm/slub.c b/mm/slub.c
> index b6c5205252eb..228e4a77c678 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3564,8 +3564,9 @@ static inline int calculate_order(unsigned int size)
>  	/*
>  	 * Doh this slab cannot be placed using slub_max_order.
>  	 */
> -	order = slab_order(size, 1, MAX_ORDER, 1);
> -	if (order < MAX_ORDER)
> +
> +	order = slab_order(size, 1, MIN_MAX_ORDER, 1);
> +	if (order < MIN_MAX_ORDER)
>  		return order;
>  	return -ENOSYS;
>  }
> @@ -4079,7 +4080,7 @@ __setup("slub_min_order=", setup_slub_min_order);
>  static int __init setup_slub_max_order(char *str)
>  {
>  	get_option(&str, (int *)&slub_max_order);
> -	slub_max_order = min(slub_max_order, (unsigned int)MAX_ORDER - 1);
> +	slub_max_order = min(slub_max_order, (unsigned int)MIN_MAX_ORDER - 1);
>  
>  	return 1;
>  }
> -- 
> 2.30.2
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-06 21:26       ` Zi Yan
@ 2021-08-09  4:04         ` Hugh Dickins
  0 siblings, 0 replies; 50+ messages in thread
From: Hugh Dickins @ 2021-08-09  4:04 UTC (permalink / raw)
  To: Zi Yan
  Cc: Hugh Dickins, Vlastimil Babka, David Hildenbrand, linux-mm,
	Matthew Wilcox, Kirill A . Shutemov, Mike Kravetz, Michal Hocko,
	John Hubbard, linux-kernel, Roman Gushchin

[-- Attachment #1: Type: text/plain, Size: 3857 bytes --]

On Fri, 6 Aug 2021, Zi Yan wrote:
> On 6 Aug 2021, at 16:27, Hugh Dickins wrote:
> > On Fri, 6 Aug 2021, Zi Yan wrote:
> >>
> >> In addition, I would like to share more detail on my plan on supporting 1GB PUD THP.
> >> This patchset is the first step, enabling kernel to allocate 1GB pages, so that
> >> user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using
> >> alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory
> >> fragmentation handling for pages up to MAX_ORDER, since currently pageblock size
> >> is still limited by memory section size. As a result, I will explore solutions
> >> like having additional larger pageblocks (up to MAX_ORDER) to counter memory
> >> fragmentation. I will discover what else needs to be solved as I gradually improve
> >> 1GB PUD THP support.
> >
> > Sorry to be blunt, but let me state my opinion: 2MB THPs have given and
> > continue to give us more than enough trouble.  Complicating the kernel's
> > mm further, just to allow 1GB THPs, seems a very bad tradeoff to me.  I
> > understand that it's an appealing personal project; but for the sake of
> > of all the rest of us, please leave 1GB huge pages to hugetlbfs (until
> > the day when we are all using 2MB base pages).
> 
> I do not agree with you. 2MB THP provides good performance, while letting us
> keep using 4KB base pages. The 2MB THP implementation is the price we pay
> to get the performance. This patchset removes the tie between MAX_ORDER
> and section size to allow >2MB page allocation, which is useful in many
> places. 1GB THP is one of the users. Gigantic pages also improve
> device performance, like GPUs (e.g., AMD GPUs can use any power of two up to
> 1GB pages[1], which I just learnt). Also could you point out which part
> of my patchset complicates kernel’s mm? I could try to simplify it if
> possible.
> 
> In addition, I am not sure hugetlbfs is the way to go. THP is managed by
> core mm, whereas hugetlbfs has its own code for memory management.
> As hugetlbfs gets popular, more core mm functionalities have been
> replicated and added to hugetlbfs codebase. It is not a good tradeoff
> either. One of the reasons I work on 1GB THP is that Roman from Facebook
> explicitly mentioned they want to use THP in place of hugetlbfs[2].
> 
> I think it might be more constructive to point out the existing issues
> in THP so that we can improve the code together. BTW, I am also working
> on simplifying THP code like generalizing THP split[3] and planning to
> simplify page table manipulation code by reviving Kirill’s idea[4].

You may have good reasons for working on huge PUD entry support;
and perhaps we have different understandings of "THP".

Fragmentation: that's what horrifies me about 1GB THP.

The dark side of THP is compaction.  People have put in a lot of effort
to get compaction working as well as it currently does, but getting 512
adjacent 4k pages is not easy.  Getting 512*512 adjacent 4k pages is
very much harder.  Please put in the work on compaction before you
attempt to support 1GB THP.

Related fears: unexpected latencies; unacceptable variance between runs;
frequent rebooting of machines to get back to an unfragmented state;
page table code that most of us will never be in a position to test.

Sorry, no, I'm not reading your patches: that's not personal, it's
just that I've more than enough to do already, and must make choices.

Hugh

> 
> [1] https://lore.kernel.org/linux-mm/bdec12bd-9188-9f3e-c442-aa33e25303a6@amd.com/
> [2] https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.thefacebook.com/
> [3] https://lwn.net/Articles/837928/
> [4] https://lore.kernel.org/linux-mm/20180424154355.mfjgkf47kdp2by4e@black.fi.intel.com/
> 
> —
> Best Regards,
> Yan, Zi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-07  1:10       ` Matthew Wilcox
  2021-08-07 21:23         ` Matthew Wilcox
@ 2021-08-09  4:29         ` Hugh Dickins
  2021-08-09 11:22           ` Matthew Wilcox
  1 sibling, 1 reply; 50+ messages in thread
From: Hugh Dickins @ 2021-08-09  4:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Zi Yan, Vlastimil Babka, David Hildenbrand,
	linux-mm, Kirill A . Shutemov, Mike Kravetz, Michal Hocko,
	John Hubbard, linux-kernel

On Sat, 7 Aug 2021, Matthew Wilcox wrote:
> 
> I am, however, of the opinion that 2MB pages give us so much trouble
> because they're so very special.  Few people exercise those code paths and
> it's easy to break them without noticing.  This is partly why I want to
> do arbitrary-order pages.  If everybody is running with compound pages
> all the time, we'll see the corner cases often, and people other than
> Hugh, Kirill and Mike will be able to work on them.

I don't entirely agree.  I'm all for your use of compound pages in page
cache, but don't think its problems are representative of the problems
in aiming for a PMD (or PUD) bar, with the weird page table transitions
we expect of "THP" there.

I haven't checked: is your use of compound pages in page cache still
limited to the HAVE_ARCH_TRANSPARENT_HUGEPAGE architectures?  When
the others could just as well use compound pages in page cache too.

Hugh

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-06 18:24         ` Zi Yan
@ 2021-08-09  7:20           ` David Hildenbrand
  0 siblings, 0 replies; 50+ messages in thread
From: David Hildenbrand @ 2021-08-09  7:20 UTC (permalink / raw)
  To: Zi Yan
  Cc: Vlastimil Babka, linux-mm, Matthew Wilcox, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel,
	Mike Rapoport

On 06.08.21 20:24, Zi Yan wrote:
> On 6 Aug 2021, at 13:08, David Hildenbrand wrote:
> 
>> On 06.08.21 18:54, Vlastimil Babka wrote:
>>> On 8/6/21 6:16 PM, David Hildenbrand wrote:
>>>> On 06.08.21 17:36, Vlastimil Babka wrote:
>>>>> On 8/5/21 9:02 PM, Zi Yan wrote:
>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>
>>>>>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
>>>>>> pages across memory sections. The check was removed when ARM64 gets rid of holes
>>>>>> in zones, but holes can appear in zones again after this patchset.
>>>>>
>>>>> To me that's most unwelcome resurrection. I kinda missed it was going away and
>>>>> now I can't even rejoice? I assume the systems that will be bumping max_order
>>>>> have a lot of memory. Are they going to have many holes? What if we just
>>>>> sacrificed the memory that would have a hole and don't add it to buddy at all?
>>>>
>>>> I think the old implementation was just horrible and the description we have
>>>> here still suffers from that old crap: "but holes can appear in zones again".
>>>> No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages
>>>> that are partially a hole.
>>>>
>>>> And to be precise, "hole" here means "there is no memmap" and not "there is a
>>>> hole but it has a valid memmap".
>>>
>>> Yes.
>>>
>>>> But IIRC, we now have under SPARSEMEM always a complete memmap for a complete
>>>> memory sections (when talking about system RAM, ZONE_DEVICE is different but we
>>>> don't really care for now I think).
>>>>
>>>> So instead of introducing what we had before, I think we should look into
>>>> something that doesn't confuse each person that stumbles over it out there. What
>>>> does pfn_valid_within() even mean in the new context? pfn_valid() is most
>>>> probably no longer what we really want, as we're dealing with multiple sections
>>>> that might be online or offline; in the old world, this was different, as a
>>>> MAX_ORDER -1 page was completely contained in a memory section that was either
>>>> online or offline.
>>>>
>>>> I'd imagine something that expresses something different in the context of
>>>> sparsemem:
>>>>
>>>> "Some page orders, such as MAX_ORDER -1, might span multiple memory sections.
>>>> Each memory section has a completely valid memmap if online. Memory sections
>>>> might either be completely online or completely offline. pfn_to_online_page()
>>>> might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But
>>>> it will certainly be consistent within one memory section."
>>>>
>>>> Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we
>>>> can actually do a binary search to identify boundaries, instead of having to
>>>> check each and every page in the range.
>>>>
>>>> Is what I describe the actual reason why we introduce pfn_valid_within() ? (and
>>>> might better introduce something new, with a better fitting name?)
>>>
>>> What I don't like is mainly the re-addition of pfn_valid_within() (or whatever
>>> we'd call it) into __free_one_page() for performance reasons, and also to
>>> various pfn scanners (compaction) for performance and "I must not forget to
>>> check this, or do I?" confusion reasons. It would be really great if we could
>>> keep a guarantee that memmap exists for MAX_ORDER blocks. I see two ways to
>>> achieve that:
>>>
>>> 1. we create memmap for MAX_ORDER blocks, pages in sections not online are
>>> marked as reserved or some other state that allows us to do checks such as "is
>>> there a buddy? no" without accessing a missing memmap
>>> 2. smaller blocks than MAX_ORDER are not released to buddy allocator
>>>
>>> I think 1 would be more work, but less wasteful in the end?
>>
>> It will end up seriously messing with memory hot(un)plug. It's not sufficient if there is a memmap (pfn_valid()), it has to be online (pfn_to_online_page()) to actually have a meaning.
>>
>> So you'd have to  allocate a memmap for all such memory sections, initialize it to all PG_Reserved ("memory hole") and mark these memory sections online. Further, you need memory block devices that are initialized and online.
>>
>> So far so good, although wasteful. What happens if someone hotplugs a memory block that doesn't span a complete MAX_ORDER -1 page? Broken.
>>
>>
>> The only "workaround" would be requiring that MAX_ORDER - 1 cannot be bigger than memory blocks (memory_block_size_bytes()). The memory block size determines our hot(un)plug granularity and can (on some archs already) be determined at runtime. As both (MAX_ORDER and memory_block_size_bytes) would be determined at runtime, for example, by an admin explicitly requesting it, this might be feasible.
>>
>>
>> Memory hot(un)plug / onlining / offlining would most probably work naturally (although the hot(un)plug granularity is then limited to e.g., 1GiB memory blocks). But if that's what an admin requests on the command line, so be it.
>>
>> What might need some thought, though, is having overlapping sections/such memory blocks with devmem. Sub-section hotadd has to continue working unless we want to break some PMEM devices seriously.
> 
> Thanks a lot for your valuable inputs!
> 
> Yes, this might work. But it seems to also defeat the purpose of sparse memory, which allow only memmapping present PFNs, right?

Not really. I will only be suboptimal for corner cases.

Except corner cases for devemem, we already always populate the memmap 
for complete memory sections. Now, we would populate the memmap for all 
memory sections spanning a MAX_ORDER - 1 page, if bigger than a section.

Will it matter in practice? I doubt it.

I consider 1 GiB allocations only relevant for really big machines. 
There, we don't really expect to have a lot of random memory holes. On a 
1 TiB machine, with 1 GiB memory blocks and 1 GiB max_order - 1, you 
don't expect to have a completely fragmented memory layout such that 
allocating additional memmap for some memory sections really makes a 
difference.

> Also it requires a lot more intrusive changes, which might not be accepted easily.

I guess it should require quite minimal changes in contrast to what you 
propose. What we should have to do is

a) Check that the configured MAX_ORDER - 1 is effectively not bigger 
than the memory block size

b) Initialize all sections spanning a MAX_ORDER - 1 during boot, we 
won't even have to mess with memory blocks et al.

All that's required is parsing/processing early parameters in the right 
order.

That sound very little intrusive compared to what you propose. Actually, 
I think what you propose would be an optimization of that approach.


> I will look into the cost of the added pfn checks and try to optimize it. One thing I can think of is that these non-present PFNs should only appear at the beginning and at the end of a zone, since HOLES_IN_ZONE is gone, maybe I just need to store and check PFN range of a zone instead of checking memory section validity and modify the zone PFN range during memory hot(un)plug. For offline pages in the middle of a zone, struct page still exists and PageBuddy() returns false, since PG_offline is set, right?

I think we can have quite some crazy "sparse" layouts where you can have 
random holes within a zone, not only at the beginning/end.

Offline pages can be identified using pfn_to_online_page(). You must not 
touch their memmap, not even to check for PageBuddy(). PG_offline is a 
special case where pfn_to_online_page() returns true and the memmap is 
valid, however, the pages are logically offline and  might get logically 
onlined later -- primarily used in virtualized environments, for 
example, with memory ballooning.

You can treat PG_offline pages like their are online, they just are 
accounted differently (!managed) and shouldn't be touched; but 
otherwise, they are just like any other allocated page.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 11/15] mm/page_reporting: report pages at section size instead of MAX_ORDER.
  2021-08-05 19:02 ` [RFC PATCH 11/15] mm/page_reporting: report pages at section size instead of MAX_ORDER Zi Yan
@ 2021-08-09  7:25   ` David Hildenbrand
  2021-08-09 14:12     ` Alexander Duyck
  2021-08-09 14:08   ` Alexander Duyck
  1 sibling, 1 reply; 50+ messages in thread
From: David Hildenbrand @ 2021-08-09  7:25 UTC (permalink / raw)
  To: Zi Yan, linux-mm, Alexander Duyck
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel

On 05.08.21 21:02, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> page_reporting_order was set to MAX_ORDER, which is always smaller than
> a memory section size. An upcoming change will make MAX_ORDER larger
> than a memory section size. Set page_reporting_order to
> PFN_SECTION_SHIFT to match existing size assumption.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>   mm/page_reporting.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> index 382958eef8a9..dc4a2d699862 100644
> --- a/mm/page_reporting.c
> +++ b/mm/page_reporting.c
> @@ -11,7 +11,8 @@
>   #include "page_reporting.h"
>   #include "internal.h"
>   
> -unsigned int page_reporting_order = MAX_ORDER;
> +/* Set page_reporting_order at section size */
> +unsigned int page_reporting_order = PFN_SECTION_SHIFT;
>   module_param(page_reporting_order, uint, 0644);
>   MODULE_PARM_DESC(page_reporting_order, "Set page reporting order");
>   
> 

If you look closely, this is only a placeholder and will get overwritten 
in page_reporting_register(). I don't recall why we have the module 
parameter at all. Most probably, to adjust the reporting order after we 
already registered a user. Can't we just initialize that to 0 ?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 09/15] virtio: virtio_mem: use PAGES_PER_SECTION instead of MAX_ORDER_NR_PAGES
  2021-08-05 19:02 ` [RFC PATCH 09/15] virtio: virtio_mem: use PAGES_PER_SECTION instead of MAX_ORDER_NR_PAGES Zi Yan
@ 2021-08-09  7:35   ` David Hildenbrand
  0 siblings, 0 replies; 50+ messages in thread
From: David Hildenbrand @ 2021-08-09  7:35 UTC (permalink / raw)
  To: Zi Yan, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel,
	Michael S. Tsirkin, Jason Wang, virtualization

On 05.08.21 21:02, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> It keeps the existing behavior when MAX_ORDER grows beyond a section
> size.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: virtualization@lists.linux-foundation.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>   drivers/virtio/virtio_mem.c | 12 ++++++------
>   1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 19036922f7ef..bab5a81fa796 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -1105,11 +1105,11 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
>    */
>   static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
>   {
> -	const unsigned long max_nr_pages = MAX_ORDER_NR_PAGES;
> +	const unsigned long max_nr_pages = PAGES_PER_SECTION;
>   	unsigned long i;
>   
>   	/*
> -	 * We are always called at least with MAX_ORDER_NR_PAGES
> +	 * We are always called at least with PAGES_PER_SECTION
>   	 * granularity/alignment (e.g., the way subblocks work). All pages
>   	 * inside such a block are alike.
>   	 */
> @@ -1125,7 +1125,7 @@ static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
>   		if (PageDirty(page)) {
>   			virtio_mem_clear_fake_offline(pfn + i, max_nr_pages,
>   						      false);
> -			generic_online_page(page, MAX_ORDER - 1);
> +			generic_online_page(page, PAGES_PER_SECTION - 1);
>   		} else {
>   			virtio_mem_clear_fake_offline(pfn + i, max_nr_pages,
>   						      true);
> @@ -1228,7 +1228,7 @@ static void virtio_mem_online_page_cb(struct page *page, unsigned int order)
>   		if (vm->in_sbm) {
>   			/*
>   			 * We exploit here that subblocks have at least
> -			 * MAX_ORDER_NR_PAGES size/alignment - so we cannot
> +			 * PAGES_PER_SECTION size/alignment - so we cannot
>   			 * cross subblocks within one call.
>   			 */
>   			id = virtio_mem_phys_to_mb_id(addr);
> @@ -2438,14 +2438,14 @@ static int virtio_mem_init(struct virtio_mem *vm)
>   				      VIRTIO_MEM_DEFAULT_OFFLINE_THRESHOLD);
>   
>   	/*
> -	 * We want subblocks to span at least MAX_ORDER_NR_PAGES and
> +	 * We want subblocks to span at least PAGES_PER_SECTION and
>   	 * pageblock_nr_pages pages. This:
>   	 * - Simplifies our page onlining code (virtio_mem_online_page_cb)
>   	 *   and fake page onlining code (virtio_mem_fake_online).
>   	 * - Is required for now for alloc_contig_range() to work reliably -
>   	 *   it doesn't properly handle smaller granularity on ZONE_NORMAL.
>   	 */
> -	sb_size = max_t(uint64_t, MAX_ORDER_NR_PAGES,
> +	sb_size = max_t(uint64_t, PAGES_PER_SECTION,
>   			pageblock_nr_pages) * PAGE_SIZE;
>   	sb_size = max_t(uint64_t, vm->device_block_size, sb_size);
>   
> 

This is very much completely broken and destroys most of the purpose of 
virtio-mem. It even is broken once MAX_ORDER would exceed a single 
memory section I think.

Whatever you do, keep virtio-mem working *as is* unless someone 
explicitly sets MAX_ORDER on the command line to something bigger.


virtio-mem will require some minor adjustments once MAX_ORDER_NR_PAGES 
would exceed the memory section size -- the functionality will, however, 
be heavily degraded once you increase MAX_ORDER_NR_PAGES in any way 
(again, which is fine if it's explicitly done by an admin on the command 
line).

As mentioned somewhere else already, we'll have to tackle 
alloc_contig_range() to properly deal with pageblock_order granularity, 
then we can rework virtio-mem code to be based on that instead of 
MAX_ORDER - 1.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
                   ` (16 preceding siblings ...)
  2021-08-06 16:32 ` Vlastimil Babka
@ 2021-08-09  7:41 ` David Hildenbrand
  17 siblings, 0 replies; 50+ messages in thread
From: David Hildenbrand @ 2021-08-09  7:41 UTC (permalink / raw)
  To: Zi Yan, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel

On 05.08.21 21:02, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> This patchset add support for kernel boot time adjustable MAX_ORDER, so that
> user can change the largest size of pages obtained from buddy allocator. It also
> removes the restriction on MAX_ORDER based on SECTION_SIZE_BITS, so that
> buddy allocator can merge PFNs across memory sections when SPARSEMEM_VMEMMAP is
> set. It is on top of v5.14-rc4-mmotm-2021-08-02-18-51.
> 
> Motivation
> ===
> 
> This enables kernel to allocate 1GB pages and is necessary for my ongoing work
> on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with
> after some discussion with David Hildenbrand on what methods should be used for
> allocating gigantic pages[2], since other approaches like using CMA allocator or
> alloc_contig_pages() are regarded as suboptimal.
> 
> This also prevents increasing SECTION_SIZE_BITS when increasing MAX_ORDER, since
> increasing SECTION_SIZE_BITS is not desirable as memory hotadd/hotremove chunk
> size will be increased as well, causing memory management difficulty for VMs.
> 
> In addition, make MAX_ORDER a kernel boot time parameter can enable user to
> adjust buddy allocator without recompiling the kernel for their own needs, so
> that one can still have a small MAX_ORDER if he/she does not need to allocate
> gigantic pages like 1GB PUD THPs.
> 
> Background
> ===
> 
> At the moment, kernel imposes MAX_ORDER - 1 + PAGE_SHFIT < SECTION_SIZE_BITS
> restriction. This prevents buddy allocator merging pages across memory sections,
> as PFNs might not be contiguous and code like page++ would fail. But this would
> not be an issue when SPARSEMEM_VMEMMAP is set, since all struct page are
> virtually contiguous. In addition, as long as buddy allocator checks the PFN
> validity during buddy page merging (done in Patch 3), pages allocated from
> buddy allocator can be manipulated by code like page++.
> 
> 
> Description
> ===
> 
> I tested the patchset on both x86_64 and ARM64 at 4KB, 16KB, and 64KB base
> pages. The systems boot and ltp mm test suite finished without issue. Also
> memory hotplug worked on x86_64 when I tested. It definitely needs more tests
> and reviews for other architectures.
> 
> In terms of the concerns on performance degradation if MAX_ORDER is increased,
> I did some initial performance tests comparing MAX_ORDER=11 and MAX_ORDER=20 on
> x86_64 machines and saw no performance difference[3].
> 
> Patch 1 excludes MAX_ORDER check from 32bit vdso compilation. The check uses
> irrelevant 32bit SECTION_SIZE_BITS during 64bit kernel compilation. The
> exclusion does not break the check in 32bit kernel, since the check will still
> be performed during other kernel component compilation.
> 
> Patch 2 gives FORCE_MAX_ZONEORDER a better name.
> 
> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
> pages across memory sections. The check was removed when ARM64 gets rid of holes
> in zones, but holes can appear in zones again after this patchset.
> 
> Patch 4-11 convert the use of MAX_ORDER to SECTION_SIZE_BITS or its derivative
> constants, since these code places use MAX_ORDER as boundary check for
> physically contiguous pages, where SECTION_SIZE_BITS should be used. After this
> patchset, MAX_ORDER can go beyond SECTION_SIZE_BITS, the code can break.
> I separate changes to different patches for easy review and can merge them into
> a single one if that works better.
> 
> Patch 12 adds a new Kconfig option SET_MAX_ORDER to allow specifying MAX_ORDER
> when ARCH_FORCE_MAX_ORDER is not used by the arch, like x86_64.
> 
> Patch 13 converts statically allocated arrays with MAX_ORDER length to dynamic
> ones if possible and prepares for making MAX_ORDER a boot time parameter.
> 
> Patch 14 adds a new MIN_MAX_ORDER constant to replace soon-to-be-dynamic
> MAX_ORDER for places where converting static array to dynamic one is causing
> hassle and not necessary, i.e., ARM64 hypervisor page allocation and SLAB.
> 
> Patch 15 finally changes MAX_ORDER to be a kernel boot time parameter.
> 
> 
> Any suggestion and/or comment is welcome. Thanks.
> 
> 
> TODO
> ===
> 
> 1. Redo the performance comparison tests using this patchset to understand the
>     performance implication of changing MAX_ORDER.


2. Make alloc_contig_range() cleanly deal with pageblock_order instead 
of MAX_ORDER - 1 to not force the minimal CMA area size/alignment to be 
e.g., 1 GiB instead of 4 MiB and to keep virtio-mem working as expected.

virtio-mem short term would mean disallowing initialization when an 
incompatible setup (MAX_ORDER_NR_PAGE > SECTION_NR_PAGES) is detected 
and bailing out loud that the admin has to fix that on the command line. 
I have optimizing alloc_contig_range() on my todo list, to get rid of 
the MAX_ORDER -1 dependency in virtio-mem; but I have no idea when I'll 
have time to work on that.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 10/15] virtio: virtio_balloon: use PAGES_PER_SECTION instead of MAX_ORDER_NR_PAGES.
  2021-08-05 19:02 ` [RFC PATCH 10/15] virtio: virtio_balloon: " Zi Yan
@ 2021-08-09  7:42   ` David Hildenbrand
  0 siblings, 0 replies; 50+ messages in thread
From: David Hildenbrand @ 2021-08-09  7:42 UTC (permalink / raw)
  To: Zi Yan, linux-mm
  Cc: Matthew Wilcox, Vlastimil Babka, Kirill A . Shutemov,
	Mike Kravetz, Michal Hocko, John Hubbard, linux-kernel,
	Michael S. Tsirkin, Jason Wang, virtualization

On 05.08.21 21:02, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> It keeps the existing behavior when MAX_ORDER grows beyond a section.
> 

... but it breaks/changes existing behavior if MAX_ORDER is smaller than 
a section?

> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: virtualization@lists.linux-foundation.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>   drivers/virtio/virtio_balloon.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 47dce91f788c..de8d0355d827 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -36,7 +36,7 @@
>   #define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
>   					     __GFP_NOMEMALLOC)
>   /* The order of free page blocks to report to host */
> -#define VIRTIO_BALLOON_HINT_BLOCK_ORDER (MAX_ORDER - 1)
> +#define VIRTIO_BALLOON_HINT_BLOCK_ORDER (PFN_SECTION_SHIFT - 1)
>   /* The size of a free page block in bytes */
>   #define VIRTIO_BALLOON_HINT_BLOCK_BYTES \
>   	(1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT))
> 


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
  2021-08-09  4:29         ` Hugh Dickins
@ 2021-08-09 11:22           ` Matthew Wilcox
  0 siblings, 0 replies; 50+ messages in thread
From: Matthew Wilcox @ 2021-08-09 11:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Zi Yan, Vlastimil Babka, David Hildenbrand, linux-mm,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	linux-kernel

On Sun, Aug 08, 2021 at 09:29:29PM -0700, Hugh Dickins wrote:
> On Sat, 7 Aug 2021, Matthew Wilcox wrote:
> > 
> > I am, however, of the opinion that 2MB pages give us so much trouble
> > because they're so very special.  Few people exercise those code paths and
> > it's easy to break them without noticing.  This is partly why I want to
> > do arbitrary-order pages.  If everybody is running with compound pages
> > all the time, we'll see the corner cases often, and people other than
> > Hugh, Kirill and Mike will be able to work on them.
> 
> I don't entirely agree.  I'm all for your use of compound pages in page
> cache, but don't think its problems are representative of the problems
> in aiming for a PMD (or PUD) bar, with the weird page table transitions
> we expect of "THP" there.
> 
> I haven't checked: is your use of compound pages in page cache still
> limited to the HAVE_ARCH_TRANSPARENT_HUGEPAGE architectures?  When
> the others could just as well use compound pages in page cache too.

It is no longer gated by whether TRANSPARENT_HUGEPAGE is enabled or not.
That's a relatively recent change, and I can't say that I've tested it.
I'll give it a try today on a PA-RISC system I've booted recently.

One of the followup pieces of work that I hope somebody other than
myself will undertake is using 64KB PTEs on a 4KB PAGE_SIZE ARM/POWER
machine if the stars align.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 11/15] mm/page_reporting: report pages at section size instead of MAX_ORDER.
  2021-08-05 19:02 ` [RFC PATCH 11/15] mm/page_reporting: report pages at section size instead of MAX_ORDER Zi Yan
  2021-08-09  7:25   ` David Hildenbrand
@ 2021-08-09 14:08   ` Alexander Duyck
  1 sibling, 0 replies; 50+ messages in thread
From: Alexander Duyck @ 2021-08-09 14:08 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, Matthew Wilcox, Vlastimil Babka,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	LKML

On Thu, Aug 5, 2021 at 12:04 PM Zi Yan <zi.yan@sent.com> wrote:
>
> From: Zi Yan <ziy@nvidia.com>
>
> page_reporting_order was set to MAX_ORDER, which is always smaller than
> a memory section size. An upcoming change will make MAX_ORDER larger
> than a memory section size. Set page_reporting_order to
> PFN_SECTION_SHIFT to match existing size assumption.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/page_reporting.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> index 382958eef8a9..dc4a2d699862 100644
> --- a/mm/page_reporting.c
> +++ b/mm/page_reporting.c
> @@ -11,7 +11,8 @@
>  #include "page_reporting.h"
>  #include "internal.h"
>
> -unsigned int page_reporting_order = MAX_ORDER;
> +/* Set page_reporting_order at section size */
> +unsigned int page_reporting_order = PFN_SECTION_SHIFT;
>  module_param(page_reporting_order, uint, 0644);
>  MODULE_PARM_DESC(page_reporting_order, "Set page reporting order");

The MAX_ORDER assumption is correct. The general idea with this being
set to MAX_ORDER is that the processing is from page_reporting_order
to MAX_ORDER. So with it set to MAX_ORDER then page reporting is
disabled.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 11/15] mm/page_reporting: report pages at section size instead of MAX_ORDER.
  2021-08-09  7:25   ` David Hildenbrand
@ 2021-08-09 14:12     ` Alexander Duyck
  2021-08-09 15:08       ` Zi Yan
  0 siblings, 1 reply; 50+ messages in thread
From: Alexander Duyck @ 2021-08-09 14:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, linux-mm, Matthew Wilcox, Vlastimil Babka,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	LKML

On Mon, Aug 9, 2021 at 12:25 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 05.08.21 21:02, Zi Yan wrote:
> > From: Zi Yan <ziy@nvidia.com>
> >
> > page_reporting_order was set to MAX_ORDER, which is always smaller than
> > a memory section size. An upcoming change will make MAX_ORDER larger
> > than a memory section size. Set page_reporting_order to
> > PFN_SECTION_SHIFT to match existing size assumption.
> >
> > Signed-off-by: Zi Yan <ziy@nvidia.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >   mm/page_reporting.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> > index 382958eef8a9..dc4a2d699862 100644
> > --- a/mm/page_reporting.c
> > +++ b/mm/page_reporting.c
> > @@ -11,7 +11,8 @@
> >   #include "page_reporting.h"
> >   #include "internal.h"
> >
> > -unsigned int page_reporting_order = MAX_ORDER;
> > +/* Set page_reporting_order at section size */
> > +unsigned int page_reporting_order = PFN_SECTION_SHIFT;
> >   module_param(page_reporting_order, uint, 0644);
> >   MODULE_PARM_DESC(page_reporting_order, "Set page reporting order");
> >
> >
>
> If you look closely, this is only a placeholder and will get overwritten
> in page_reporting_register(). I don't recall why we have the module
> parameter at all. Most probably, to adjust the reporting order after we
> already registered a user. Can't we just initialize that to 0 ?

Yeah, it is pretty much there for debugging in the event that we are
on an architecture that is misconfigured.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 11/15] mm/page_reporting: report pages at section size instead of MAX_ORDER.
  2021-08-09 14:12     ` Alexander Duyck
@ 2021-08-09 15:08       ` Zi Yan
  2021-08-09 16:51         ` Alexander Duyck
  0 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2021-08-09 15:08 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Hildenbrand, linux-mm, Matthew Wilcox, Vlastimil Babka,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	LKML

[-- Attachment #1: Type: text/plain, Size: 2003 bytes --]

On 9 Aug 2021, at 10:12, Alexander Duyck wrote:

> On Mon, Aug 9, 2021 at 12:25 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 05.08.21 21:02, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> page_reporting_order was set to MAX_ORDER, which is always smaller than
>>> a memory section size. An upcoming change will make MAX_ORDER larger
>>> than a memory section size. Set page_reporting_order to
>>> PFN_SECTION_SHIFT to match existing size assumption.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: linux-mm@kvack.org
>>> Cc: linux-kernel@vger.kernel.org
>>> ---
>>>   mm/page_reporting.c | 3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
>>> index 382958eef8a9..dc4a2d699862 100644
>>> --- a/mm/page_reporting.c
>>> +++ b/mm/page_reporting.c
>>> @@ -11,7 +11,8 @@
>>>   #include "page_reporting.h"
>>>   #include "internal.h"
>>>
>>> -unsigned int page_reporting_order = MAX_ORDER;
>>> +/* Set page_reporting_order at section size */
>>> +unsigned int page_reporting_order = PFN_SECTION_SHIFT;
>>>   module_param(page_reporting_order, uint, 0644);
>>>   MODULE_PARM_DESC(page_reporting_order, "Set page reporting order");
>>>
>>>
>>
>> If you look closely, this is only a placeholder and will get overwritten
>> in page_reporting_register(). I don't recall why we have the module
>> parameter at all. Most probably, to adjust the reporting order after we
>> already registered a user. Can't we just initialize that to 0 ?
>
> Yeah, it is pretty much there for debugging in the event that we are
> on an architecture that is misconfigured.

MAX_ORDER is changed to a boot time variable in Patch 15, thus cannot be used
for page_reporting_order initialization after that.

Thanks for David’s explanation. I will initialize page_reporting_order to 0
and fix the commit message.

—
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 14/15] mm: introduce MIN_MAX_ORDER to replace MAX_ORDER as compile time constant.
  2021-08-08  8:23   ` Mike Rapoport
@ 2021-08-09 15:35     ` Zi Yan
  0 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2021-08-09 15:35 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David Hildenbrand, linux-mm, Matthew Wilcox, Vlastimil Babka,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	linux-kernel, Marc Zyngier, Catalin Marinas, Christoph Lameter,
	Quentin Perret, linux-arm-kernel, kvmarm

[-- Attachment #1: Type: text/plain, Size: 7917 bytes --]

On 8 Aug 2021, at 4:23, Mike Rapoport wrote:

> On Thu, Aug 05, 2021 at 03:02:52PM -0400, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> For other MAX_ORDER uses (described below), there is no need or too much
>> hassle to convert certain static array to dynamic ones. Add
>> MIN_MAX_ORDER to serve as compile time constant in place of MAX_ORDER.
>>
>> ARM64 hypervisor maintains its own free page list and does not import
>> any core kernel symbols, so soon-to-be runtime variable MAX_ORDER is not
>> accessible in ARM64 hypervisor code. Also there is no need to allocating
>> very large pages.
>>
>> In SLAB/SLOB/SLUB, 2-D array kmalloc_caches uses MAX_ORDER in its second
>> dimension. It is too much hassle to allocate memory for kmalloc_caches
>> before any proper memory allocator is set up.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> Cc: Marc Zyngier <maz@kernel.org>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Quentin Perret <qperret@google.com>
>> Cc: linux-arm-kernel@lists.infradead.org
>> Cc: kvmarm@lists.cs.columbia.edu
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>>  arch/arm64/kvm/hyp/include/nvhe/gfp.h | 2 +-
>>  arch/arm64/kvm/hyp/nvhe/page_alloc.c  | 3 ++-
>>  include/linux/mmzone.h                | 3 +++
>>  include/linux/slab.h                  | 8 ++++----
>>  mm/slab.c                             | 2 +-
>>  mm/slub.c                             | 7 ++++---
>>  6 files changed, 15 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
>> index fb0f523d1492..c774b4a98336 100644
>> --- a/arch/arm64/kvm/hyp/include/nvhe/gfp.h
>> +++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
>> @@ -16,7 +16,7 @@ struct hyp_pool {
>>  	 * API at EL2.
>>  	 */
>>  	hyp_spinlock_t lock;
>> -	struct list_head free_area[MAX_ORDER];
>> +	struct list_head free_area[MIN_MAX_ORDER];
>>  	phys_addr_t range_start;
>>  	phys_addr_t range_end;
>>  	unsigned short max_order;
>> diff --git a/arch/arm64/kvm/hyp/nvhe/page_alloc.c b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
>> index 41fc25bdfb34..a1cc1b648de0 100644
>> --- a/arch/arm64/kvm/hyp/nvhe/page_alloc.c
>> +++ b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
>> @@ -226,7 +226,8 @@ int hyp_pool_init(struct hyp_pool *pool, u64 pfn, unsigned int nr_pages,
>>  	int i;
>>
>>  	hyp_spin_lock_init(&pool->lock);
>> -	pool->max_order = min(MAX_ORDER, get_order(nr_pages << PAGE_SHIFT));
>> +
>> +	pool->max_order = min(MIN_MAX_ORDER, get_order(nr_pages << PAGE_SHIFT));
>>  	for (i = 0; i < pool->max_order; i++)
>>  		INIT_LIST_HEAD(&pool->free_area[i]);
>>  	pool->range_start = phys;
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 09aafc05aef4..379dada82d4b 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -27,11 +27,14 @@
>>  #ifndef CONFIG_ARCH_FORCE_MAX_ORDER
>>  #ifdef CONFIG_SET_MAX_ORDER
>>  #define MAX_ORDER CONFIG_SET_MAX_ORDER
>> +#define MIN_MAX_ORDER CONFIG_SET_MAX_ORDER
>>  #else
>>  #define MAX_ORDER 11
>> +#define MIN_MAX_ORDER MAX_ORDER
>>  #endif /* CONFIG_SET_MAX_ORDER */
>>  #else
>>  #define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
>> +#define MIN_MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
>>  #endif /* CONFIG_ARCH_FORCE_MAX_ORDER */
>>  #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
>
> The end result of this #ifdef explosion looks entirely unreadable:

Just to clarify the use of #ifdef:

MAX_ORDER is changed to a boot time variable and it cannot be used
for static array declaration, so I added MIN_MAX_ORDER.

At the moment, there are three cases of setting MAX_ORDER and
MIN_MAX_ORDER:

1. CONFIG_ARCH_FORCE_MAX_ORDER tells the MAX_ORDER certain arch
needs at compilation time.
2. CONFIG_SET_MAX_ORDER allows to change MAX_ORDER when an arch
does not set its MAX_ORER and SPARSEMEM_VMEMMAP is set.
3. !SPARSEMEM_VMEMMAP and no CONFIG_ARCH_FORCE_MAX_ORDER, then
MAX_ORDER is set to 11 by default.

I agree the code is hard to read, I will clean this up.


>
> /* Free memory management - zoned buddy allocator.  */
> #ifndef CONFIG_ARCH_FORCE_MAX_ORDER
> #ifdef CONFIG_SET_MAX_ORDER
> /* Defined in mm/page_alloc.c */
> extern int buddy_alloc_max_order;
>
> #define MAX_ORDER buddy_alloc_max_order
> #define MIN_MAX_ORDER CONFIG_SET_MAX_ORDER
> #else
> #define MAX_ORDER 11
> #define MIN_MAX_ORDER MAX_ORDER
> #endif /* CONFIG_SET_MAX_ORDER */
> #else
>
> #ifdef CONFIG_SPARSEMEM_VMEMMAP
> /* Defined in mm/page_alloc.c */
> extern int buddy_alloc_max_order;
>
> #define MAX_ORDER buddy_alloc_max_order
> #else
> #define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
> #endif /* CONFIG_SPARSEMEM_VMEMMAP */
> #define MIN_MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
> #endif /* CONFIG_ARCH_FORCE_MAX_ORDER */
>
>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>> index 2c0d80cca6b8..d8747c158db6 100644
>> --- a/include/linux/slab.h
>> +++ b/include/linux/slab.h
>> @@ -244,8 +244,8 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
>>   * to do various tricks to work around compiler limitations in order to
>>   * ensure proper constant folding.
>>   */
>> -#define KMALLOC_SHIFT_HIGH	((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
>> -				(MAX_ORDER + PAGE_SHIFT - 1) : 25)
>> +#define KMALLOC_SHIFT_HIGH	((MIN_MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
>> +				(MIN_MAX_ORDER + PAGE_SHIFT - 1) : 25)
>>  #define KMALLOC_SHIFT_MAX	KMALLOC_SHIFT_HIGH
>>  #ifndef KMALLOC_SHIFT_LOW
>>  #define KMALLOC_SHIFT_LOW	5
>> @@ -258,7 +258,7 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
>>   * (PAGE_SIZE*2).  Larger requests are passed to the page allocator.
>>   */
>>  #define KMALLOC_SHIFT_HIGH	(PAGE_SHIFT + 1)
>> -#define KMALLOC_SHIFT_MAX	(MAX_ORDER + PAGE_SHIFT - 1)
>> +#define KMALLOC_SHIFT_MAX	(MIN_MAX_ORDER + PAGE_SHIFT - 1)
>>  #ifndef KMALLOC_SHIFT_LOW
>>  #define KMALLOC_SHIFT_LOW	3
>>  #endif
>> @@ -271,7 +271,7 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
>>   * be allocated from the same page.
>>   */
>>  #define KMALLOC_SHIFT_HIGH	PAGE_SHIFT
>> -#define KMALLOC_SHIFT_MAX	(MAX_ORDER + PAGE_SHIFT - 1)
>> +#define KMALLOC_SHIFT_MAX	(MIN_MAX_ORDER + PAGE_SHIFT - 1)
>>  #ifndef KMALLOC_SHIFT_LOW
>>  #define KMALLOC_SHIFT_LOW	3
>>  #endif
>> diff --git a/mm/slab.c b/mm/slab.c
>> index d0f725637663..0041de8ec0e9 100644
>> --- a/mm/slab.c
>> +++ b/mm/slab.c
>> @@ -466,7 +466,7 @@ static int __init slab_max_order_setup(char *str)
>>  {
>>  	get_option(&str, &slab_max_order);
>>  	slab_max_order = slab_max_order < 0 ? 0 :
>> -				min(slab_max_order, MAX_ORDER - 1);
>> +				min(slab_max_order, MIN_MAX_ORDER - 1);
>>  	slab_max_order_set = true;
>>
>>  	return 1;
>> diff --git a/mm/slub.c b/mm/slub.c
>> index b6c5205252eb..228e4a77c678 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3564,8 +3564,9 @@ static inline int calculate_order(unsigned int size)
>>  	/*
>>  	 * Doh this slab cannot be placed using slub_max_order.
>>  	 */
>> -	order = slab_order(size, 1, MAX_ORDER, 1);
>> -	if (order < MAX_ORDER)
>> +
>> +	order = slab_order(size, 1, MIN_MAX_ORDER, 1);
>> +	if (order < MIN_MAX_ORDER)
>>  		return order;
>>  	return -ENOSYS;
>>  }
>> @@ -4079,7 +4080,7 @@ __setup("slub_min_order=", setup_slub_min_order);
>>  static int __init setup_slub_max_order(char *str)
>>  {
>>  	get_option(&str, (int *)&slub_max_order);
>> -	slub_max_order = min(slub_max_order, (unsigned int)MAX_ORDER - 1);
>> +	slub_max_order = min(slub_max_order, (unsigned int)MIN_MAX_ORDER - 1);
>>
>>  	return 1;
>>  }
>> -- 
>> 2.30.2
>>
>
> -- 
> Sincerely yours,
> Mike.


—
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 08/15] proc: use PAGES_PER_SECTION for page offline checking period.
  2021-08-07 10:32   ` Mike Rapoport
@ 2021-08-09 15:45     ` Zi Yan
  0 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2021-08-09 15:45 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David Hildenbrand, linux-mm, Matthew Wilcox, Vlastimil Babka,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	linux-kernel, Oscar Salvador, Ying Chen, Feng Zhou,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1612 bytes --]

On 7 Aug 2021, at 6:32, Mike Rapoport wrote:

> On Thu, Aug 05, 2021 at 03:02:46PM -0400, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> It keeps the existing behavior after MAX_ORDER is increased beyond
>> a section size.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> Cc: Mike Rapoport <rppt@kernel.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Ying Chen <chenying.kernel@bytedance.com>
>> Cc: Feng Zhou <zhoufeng.zf@bytedance.com>
>> Cc: linux-fsdevel@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>>  fs/proc/kcore.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
>> index 3f148759a5fd..77b7ba48fb44 100644
>> --- a/fs/proc/kcore.c
>> +++ b/fs/proc/kcore.c
>> @@ -486,7 +486,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>>  			}
>>  		}
>>
>> -		if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
>> +		if (page_offline_frozen++ % PAGES_PER_SECTION == 0) {
>
> The behavior changes here. E.g. with default configuration on x86 instead
> of cond_resched() every 2M we get cond_resched() every 128M.
>
> I'm not saying it's wrong but at least it deserves an explanation why.

Sure. I will also think about whether I should use PAGES_PER_SECTION or pageblock_nr_pages
to replace MAX_ORDER in this and other patches. pageblock_nr_pages will be unchanged,
so at least in x86_64, using pageblock_nr_pages would not change code behaviors.

—
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 11/15] mm/page_reporting: report pages at section size instead of MAX_ORDER.
  2021-08-09 15:08       ` Zi Yan
@ 2021-08-09 16:51         ` Alexander Duyck
  0 siblings, 0 replies; 50+ messages in thread
From: Alexander Duyck @ 2021-08-09 16:51 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, Matthew Wilcox, Vlastimil Babka,
	Kirill A . Shutemov, Mike Kravetz, Michal Hocko, John Hubbard,
	LKML

On Mon, Aug 9, 2021 at 8:08 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 9 Aug 2021, at 10:12, Alexander Duyck wrote:
>
> > On Mon, Aug 9, 2021 at 12:25 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 05.08.21 21:02, Zi Yan wrote:
> >>> From: Zi Yan <ziy@nvidia.com>
> >>>
> >>> page_reporting_order was set to MAX_ORDER, which is always smaller than
> >>> a memory section size. An upcoming change will make MAX_ORDER larger
> >>> than a memory section size. Set page_reporting_order to
> >>> PFN_SECTION_SHIFT to match existing size assumption.
> >>>
> >>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>> Cc: David Hildenbrand <david@redhat.com>
> >>> Cc: linux-mm@kvack.org
> >>> Cc: linux-kernel@vger.kernel.org
> >>> ---
> >>>   mm/page_reporting.c | 3 ++-
> >>>   1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> >>> index 382958eef8a9..dc4a2d699862 100644
> >>> --- a/mm/page_reporting.c
> >>> +++ b/mm/page_reporting.c
> >>> @@ -11,7 +11,8 @@
> >>>   #include "page_reporting.h"
> >>>   #include "internal.h"
> >>>
> >>> -unsigned int page_reporting_order = MAX_ORDER;
> >>> +/* Set page_reporting_order at section size */
> >>> +unsigned int page_reporting_order = PFN_SECTION_SHIFT;
> >>>   module_param(page_reporting_order, uint, 0644);
> >>>   MODULE_PARM_DESC(page_reporting_order, "Set page reporting order");
> >>>
> >>>
> >>
> >> If you look closely, this is only a placeholder and will get overwritten
> >> in page_reporting_register(). I don't recall why we have the module
> >> parameter at all. Most probably, to adjust the reporting order after we
> >> already registered a user. Can't we just initialize that to 0 ?
> >
> > Yeah, it is pretty much there for debugging in the event that we are
> > on an architecture that is misconfigured.
>
> MAX_ORDER is changed to a boot time variable in Patch 15, thus cannot be used
> for page_reporting_order initialization after that.
>
> Thanks for David’s explanation. I will initialize page_reporting_order to 0
> and fix the commit message.

Rather than 0 it might be better to use (unsigned)-1 as that would
prevent page reporting from being able to run until the value is
overwritten.

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2021-08-09 16:51 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-05 19:02 [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter Zi Yan
2021-08-05 19:02 ` [RFC PATCH 01/15] arch: x86: remove MAX_ORDER exceeding SECTION_SIZE check for 32bit vdso Zi Yan
2021-08-05 19:02 ` [RFC PATCH 02/15] arch: mm: rename FORCE_MAX_ZONEORDER to ARCH_FORCE_MAX_ORDER Zi Yan
2021-08-05 19:02 ` [RFC PATCH 03/15] mm: check pfn validity when buddy allocator can merge pages across mem sections Zi Yan
2021-08-05 19:02 ` [RFC PATCH 04/15] mm: prevent pageblock size being larger than section size Zi Yan
2021-08-05 19:02 ` [RFC PATCH 05/15] mm/memory_hotplug: online pages at " Zi Yan
2021-08-05 19:02 ` [RFC PATCH 06/15] mm: use PAGES_PER_SECTION instead for mem_map_offset/next() Zi Yan
2021-08-05 19:02 ` [RFC PATCH 07/15] mm: hugetlb: use PAGES_PER_SECTION to check mem_map discontiguity Zi Yan
2021-08-05 19:02 ` [RFC PATCH 08/15] fs: proc: use PAGES_PER_SECTION for page offline checking period Zi Yan
2021-08-07 10:32   ` Mike Rapoport
2021-08-09 15:45     ` [RFC PATCH 08/15] " Zi Yan
2021-08-05 19:02 ` [RFC PATCH 09/15] virtio: virtio_mem: use PAGES_PER_SECTION instead of MAX_ORDER_NR_PAGES Zi Yan
2021-08-09  7:35   ` David Hildenbrand
2021-08-05 19:02 ` [RFC PATCH 10/15] virtio: virtio_balloon: " Zi Yan
2021-08-09  7:42   ` David Hildenbrand
2021-08-05 19:02 ` [RFC PATCH 11/15] mm/page_reporting: report pages at section size instead of MAX_ORDER Zi Yan
2021-08-09  7:25   ` David Hildenbrand
2021-08-09 14:12     ` Alexander Duyck
2021-08-09 15:08       ` Zi Yan
2021-08-09 16:51         ` Alexander Duyck
2021-08-09 14:08   ` Alexander Duyck
2021-08-05 19:02 ` [RFC PATCH 12/15] mm: Make MAX_ORDER of buddy allocator configurable via Kconfig SET_MAX_ORDER Zi Yan
2021-08-06 15:16   ` Vlastimil Babka
2021-08-06 15:23     ` Zi Yan
2021-08-05 19:02 ` [RFC PATCH 13/15] mm: convert MAX_ORDER sized static arrays to dynamic ones Zi Yan
2021-08-05 19:16   ` Christian König
2021-08-05 19:58     ` Zi Yan
2021-08-06  9:37       ` Christian König
2021-08-06 14:00         ` Zi Yan
2021-08-05 19:02 ` [RFC PATCH 14/15] mm: introduce MIN_MAX_ORDER to replace MAX_ORDER as compile time constant Zi Yan
2021-08-08  8:23   ` Mike Rapoport
2021-08-09 15:35     ` Zi Yan
2021-08-05 19:02 ` [RFC PATCH 15/15] mm: make MAX_ORDER a kernel boot time parameter Zi Yan
2021-08-06 15:36 ` [RFC PATCH 00/15] Make MAX_ORDER adjustable as " Vlastimil Babka
2021-08-06 16:16   ` David Hildenbrand
2021-08-06 16:54     ` Vlastimil Babka
2021-08-06 17:08       ` David Hildenbrand
2021-08-06 18:24         ` Zi Yan
2021-08-09  7:20           ` David Hildenbrand
2021-08-08  7:41       ` Mike Rapoport
2021-08-06 16:32 ` Vlastimil Babka
2021-08-06 17:19   ` Zi Yan
2021-08-06 20:27     ` Hugh Dickins
2021-08-06 21:26       ` Zi Yan
2021-08-09  4:04         ` Hugh Dickins
2021-08-07  1:10       ` Matthew Wilcox
2021-08-07 21:23         ` Matthew Wilcox
2021-08-09  4:29         ` Hugh Dickins
2021-08-09 11:22           ` Matthew Wilcox
2021-08-09  7:41 ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).