All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
@ 2016-01-28  6:19 ` Dan Williams
  0 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2016-01-28  6:19 UTC (permalink / raw)
  To: akpm
  Cc: Rik van Riel, Dave Hansen, linux-kernel, linux-mm, Mel Gorman,
	Mark, Joonsoo Kim, Sudip Mukherjee

ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
mm zones that are bumping up against the current maximum limit of 4
zones, i.e. 2 bits in page->flags.  When adding a zone this equation
still needs to be satisified:

    SECTIONS_WIDTH + ZONES_WIDTH + NODES_SHIFT + LAST_CPUPID_SHIFT
	  <= BITS_PER_LONG - NR_PAGEFLAGS

ZONE_DEVICE currently tries to satisfy this equation by requiring that
ZONE_DMA be disabled, but this is untenable given generic kernels want
to support ZONE_DEVICE and ZONE_DMA simultaneously.  ZONE_CMA would like
to increase the amount of memory covered per section, but that limits
the minimum granularity at which consecutive memory ranges can be added
via devm_memremap_pages().

The trade-off of what is acceptable to sacrifice depends heavily on the
platform.  For example, ZONE_CMA is targeted for 32-bit platforms where
page->flags is constrained, but those platforms likely do not care about
the minimum granularity of memory hotplug.  A big iron machine with 1024
numa nodes can likely sacrifice ZONE_DMA where a general purpose
distribution kernel can not.

CONFIG_NR_ZONES_EXTENDED is a configuration symbol that gets selected
when the number of configured zones exceeds 4.  It documents the
configuration symbols and definitions that get modified when ZONES_WIDTH
is greater than 2.

For now, it steals a bit from NODES_SHIFT.  Later on it can be used to
document the definitions that get modified when a 32-bit configuration
wants more zone bits.

Note that GFP_ZONE_TABLE poses an interesting constraint since
include/linux/gfp.h gets included by the 32-bit portion of a 64-bit
build.  We need to be careful to only build the table for zones that
have a corresponding gfp_t flag.  GFP_ZONES_SHIFT is introduced for this
purpose.  This patch does not attempt to solve the problem of adding a
new zone that also has a corresponding GFP_ flag.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"")
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Reported-by: Mark <markk@clara.co.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/Kconfig                  |    6 ++++--
 include/linux/gfp.h               |   33 ++++++++++++++++++++-------------
 include/linux/page-flags-layout.h |    2 ++
 mm/Kconfig                        |    7 +++++--
 4 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 330e738ccfc1..9dfc52eb3976 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1409,8 +1409,10 @@ config NUMA_EMU
 
 config NODES_SHIFT
 	int "Maximum NUMA Nodes (as a power of 2)" if !MAXSMP
-	range 1 10
-	default "10" if MAXSMP
+	range 1 10 if !NR_ZONES_EXTENDED
+	range 1 9 if NR_ZONES_EXTENDED
+	default "10" if MAXSMP && !NR_ZONES_EXTENDED
+	default "9" if MAXSMP && NR_ZONES_EXTENDED
 	default "6" if X86_64
 	default "3"
 	depends on NEED_MULTIPLE_NODES
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 28ad5f6494b0..5979c2c80140 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -329,22 +329,29 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
  *       0xe    => BAD (MOVABLE+DMA32+HIGHMEM)
  *       0xf    => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
  *
- * ZONES_SHIFT must be <= 2 on 32 bit platforms.
+ * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
  */
 
-#if 16 * ZONES_SHIFT > BITS_PER_LONG
-#error ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
+#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
+/* ZONE_DEVICE is not a valid GFP zone specifier */
+#define GFP_ZONES_SHIFT 2
+#else
+#define GFP_ZONES_SHIFT ZONES_SHIFT
+#endif
+
+#if 16 * GFP_ZONES_SHIFT > BITS_PER_LONG
+#error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
 #endif
 
 #define GFP_ZONE_TABLE ( \
-	(ZONE_NORMAL << 0 * ZONES_SHIFT)				      \
-	| (OPT_ZONE_DMA << ___GFP_DMA * ZONES_SHIFT)			      \
-	| (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * ZONES_SHIFT)		      \
-	| (OPT_ZONE_DMA32 << ___GFP_DMA32 * ZONES_SHIFT)		      \
-	| (ZONE_NORMAL << ___GFP_MOVABLE * ZONES_SHIFT)			      \
-	| (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * ZONES_SHIFT)	      \
-	| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * ZONES_SHIFT)   \
-	| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * ZONES_SHIFT)   \
+	(ZONE_NORMAL << 0 * GFP_ZONES_SHIFT)					\
+	| (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT)			\
+	| (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT)		\
+	| (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT)		      	\
+	| (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT)			\
+	| (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT)	\
+	| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)	\
+	| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)	\
 )
 
 /*
@@ -369,8 +376,8 @@ static inline enum zone_type gfp_zone(gfp_t flags)
 	enum zone_type z;
 	int bit = (__force int) (flags & GFP_ZONEMASK);
 
-	z = (GFP_ZONE_TABLE >> (bit * ZONES_SHIFT)) &
-					 ((1 << ZONES_SHIFT) - 1);
+	z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
+					 ((1 << GFP_ZONES_SHIFT) - 1);
 	VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
 	return z;
 }
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index da523661500a..77b078c103b2 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -17,6 +17,8 @@
 #define ZONES_SHIFT 1
 #elif MAX_NR_ZONES <= 4
 #define ZONES_SHIFT 2
+#elif MAX_NR_ZONES <= 8
+#define ZONES_SHIFT 3
 #else
 #error ZONES_SHIFT -- too many zones configured adjust calculation
 #endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 97a4e06b15c0..cb5377624df3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -651,8 +651,6 @@ config IDLE_PAGE_TRACKING
 
 config ZONE_DEVICE
 	bool "Device memory (pmem, etc...) hotplug support" if EXPERT
-	default !ZONE_DMA
-	depends on !ZONE_DMA
 	depends on MEMORY_HOTPLUG
 	depends on MEMORY_HOTREMOVE
 	depends on X86_64 #arch_add_memory() comprehends device memory
@@ -666,5 +664,10 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config NR_ZONES_EXTENDED
+	bool
+	default n if !64BIT
+	default y if ZONE_DEVICE && ZONE_DMA && ZONE_DMA32
+
 config FRAME_VECTOR
 	bool

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
@ 2016-01-28  6:19 ` Dan Williams
  0 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2016-01-28  6:19 UTC (permalink / raw)
  To: akpm
  Cc: Rik van Riel, Dave Hansen, linux-kernel, linux-mm, Mel Gorman,
	Mark, Joonsoo Kim, Sudip Mukherjee

ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
mm zones that are bumping up against the current maximum limit of 4
zones, i.e. 2 bits in page->flags.  When adding a zone this equation
still needs to be satisified:

    SECTIONS_WIDTH + ZONES_WIDTH + NODES_SHIFT + LAST_CPUPID_SHIFT
	  <= BITS_PER_LONG - NR_PAGEFLAGS

ZONE_DEVICE currently tries to satisfy this equation by requiring that
ZONE_DMA be disabled, but this is untenable given generic kernels want
to support ZONE_DEVICE and ZONE_DMA simultaneously.  ZONE_CMA would like
to increase the amount of memory covered per section, but that limits
the minimum granularity at which consecutive memory ranges can be added
via devm_memremap_pages().

The trade-off of what is acceptable to sacrifice depends heavily on the
platform.  For example, ZONE_CMA is targeted for 32-bit platforms where
page->flags is constrained, but those platforms likely do not care about
the minimum granularity of memory hotplug.  A big iron machine with 1024
numa nodes can likely sacrifice ZONE_DMA where a general purpose
distribution kernel can not.

CONFIG_NR_ZONES_EXTENDED is a configuration symbol that gets selected
when the number of configured zones exceeds 4.  It documents the
configuration symbols and definitions that get modified when ZONES_WIDTH
is greater than 2.

For now, it steals a bit from NODES_SHIFT.  Later on it can be used to
document the definitions that get modified when a 32-bit configuration
wants more zone bits.

Note that GFP_ZONE_TABLE poses an interesting constraint since
include/linux/gfp.h gets included by the 32-bit portion of a 64-bit
build.  We need to be careful to only build the table for zones that
have a corresponding gfp_t flag.  GFP_ZONES_SHIFT is introduced for this
purpose.  This patch does not attempt to solve the problem of adding a
new zone that also has a corresponding GFP_ flag.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"")
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Reported-by: Mark <markk@clara.co.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/Kconfig                  |    6 ++++--
 include/linux/gfp.h               |   33 ++++++++++++++++++++-------------
 include/linux/page-flags-layout.h |    2 ++
 mm/Kconfig                        |    7 +++++--
 4 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 330e738ccfc1..9dfc52eb3976 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1409,8 +1409,10 @@ config NUMA_EMU
 
 config NODES_SHIFT
 	int "Maximum NUMA Nodes (as a power of 2)" if !MAXSMP
-	range 1 10
-	default "10" if MAXSMP
+	range 1 10 if !NR_ZONES_EXTENDED
+	range 1 9 if NR_ZONES_EXTENDED
+	default "10" if MAXSMP && !NR_ZONES_EXTENDED
+	default "9" if MAXSMP && NR_ZONES_EXTENDED
 	default "6" if X86_64
 	default "3"
 	depends on NEED_MULTIPLE_NODES
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 28ad5f6494b0..5979c2c80140 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -329,22 +329,29 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
  *       0xe    => BAD (MOVABLE+DMA32+HIGHMEM)
  *       0xf    => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
  *
- * ZONES_SHIFT must be <= 2 on 32 bit platforms.
+ * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
  */
 
-#if 16 * ZONES_SHIFT > BITS_PER_LONG
-#error ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
+#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
+/* ZONE_DEVICE is not a valid GFP zone specifier */
+#define GFP_ZONES_SHIFT 2
+#else
+#define GFP_ZONES_SHIFT ZONES_SHIFT
+#endif
+
+#if 16 * GFP_ZONES_SHIFT > BITS_PER_LONG
+#error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
 #endif
 
 #define GFP_ZONE_TABLE ( \
-	(ZONE_NORMAL << 0 * ZONES_SHIFT)				      \
-	| (OPT_ZONE_DMA << ___GFP_DMA * ZONES_SHIFT)			      \
-	| (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * ZONES_SHIFT)		      \
-	| (OPT_ZONE_DMA32 << ___GFP_DMA32 * ZONES_SHIFT)		      \
-	| (ZONE_NORMAL << ___GFP_MOVABLE * ZONES_SHIFT)			      \
-	| (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * ZONES_SHIFT)	      \
-	| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * ZONES_SHIFT)   \
-	| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * ZONES_SHIFT)   \
+	(ZONE_NORMAL << 0 * GFP_ZONES_SHIFT)					\
+	| (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT)			\
+	| (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT)		\
+	| (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT)		      	\
+	| (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT)			\
+	| (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT)	\
+	| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)	\
+	| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)	\
 )
 
 /*
@@ -369,8 +376,8 @@ static inline enum zone_type gfp_zone(gfp_t flags)
 	enum zone_type z;
 	int bit = (__force int) (flags & GFP_ZONEMASK);
 
-	z = (GFP_ZONE_TABLE >> (bit * ZONES_SHIFT)) &
-					 ((1 << ZONES_SHIFT) - 1);
+	z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
+					 ((1 << GFP_ZONES_SHIFT) - 1);
 	VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
 	return z;
 }
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index da523661500a..77b078c103b2 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -17,6 +17,8 @@
 #define ZONES_SHIFT 1
 #elif MAX_NR_ZONES <= 4
 #define ZONES_SHIFT 2
+#elif MAX_NR_ZONES <= 8
+#define ZONES_SHIFT 3
 #else
 #error ZONES_SHIFT -- too many zones configured adjust calculation
 #endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 97a4e06b15c0..cb5377624df3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -651,8 +651,6 @@ config IDLE_PAGE_TRACKING
 
 config ZONE_DEVICE
 	bool "Device memory (pmem, etc...) hotplug support" if EXPERT
-	default !ZONE_DMA
-	depends on !ZONE_DMA
 	depends on MEMORY_HOTPLUG
 	depends on MEMORY_HOTREMOVE
 	depends on X86_64 #arch_add_memory() comprehends device memory
@@ -666,5 +664,10 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config NR_ZONES_EXTENDED
+	bool
+	default n if !64BIT
+	default y if ZONE_DEVICE && ZONE_DMA && ZONE_DMA32
+
 config FRAME_VECTOR
 	bool

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
  2016-01-28  6:19 ` Dan Williams
@ 2016-02-02  5:42   ` Andrew Morton
  -1 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2016-02-02  5:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rik van Riel, Dave Hansen, linux-kernel, linux-mm, Mel Gorman,
	Mark, Joonsoo Kim, Sudip Mukherjee

On Wed, 27 Jan 2016 22:19:14 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

> ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
> mm zones that are bumping up against the current maximum limit of 4
> zones, i.e. 2 bits in page->flags.  When adding a zone this equation
> still needs to be satisified:
> 
>     SECTIONS_WIDTH + ZONES_WIDTH + NODES_SHIFT + LAST_CPUPID_SHIFT
> 	  <= BITS_PER_LONG - NR_PAGEFLAGS
> 
> ZONE_DEVICE currently tries to satisfy this equation by requiring that
> ZONE_DMA be disabled, but this is untenable given generic kernels want
> to support ZONE_DEVICE and ZONE_DMA simultaneously.  ZONE_CMA would like
> to increase the amount of memory covered per section, but that limits
> the minimum granularity at which consecutive memory ranges can be added
> via devm_memremap_pages().
> 
> The trade-off of what is acceptable to sacrifice depends heavily on the
> platform.  For example, ZONE_CMA is targeted for 32-bit platforms where
> page->flags is constrained, but those platforms likely do not care about
> the minimum granularity of memory hotplug.  A big iron machine with 1024
> numa nodes can likely sacrifice ZONE_DMA where a general purpose
> distribution kernel can not.
> 
> CONFIG_NR_ZONES_EXTENDED is a configuration symbol that gets selected
> when the number of configured zones exceeds 4.  It documents the
> configuration symbols and definitions that get modified when ZONES_WIDTH
> is greater than 2.
> 
> For now, it steals a bit from NODES_SHIFT.  Later on it can be used to
> document the definitions that get modified when a 32-bit configuration
> wants more zone bits.

So if you want ZONE_DMA, you're limited to 512 NUMA nodes?

That seems reasonable.

> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1409,8 +1409,10 @@ config NUMA_EMU
>  
>  config NODES_SHIFT
>  	int "Maximum NUMA Nodes (as a power of 2)" if !MAXSMP
> -	range 1 10
> -	default "10" if MAXSMP
> +	range 1 10 if !NR_ZONES_EXTENDED
> +	range 1 9 if NR_ZONES_EXTENDED
> +	default "10" if MAXSMP && !NR_ZONES_EXTENDED
> +	default "9" if MAXSMP && NR_ZONES_EXTENDED
>  	default "6" if X86_64
>  	default "3"
>  	depends on NEED_MULTIPLE_NODES
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 28ad5f6494b0..5979c2c80140 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -329,22 +329,29 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
>   *       0xe    => BAD (MOVABLE+DMA32+HIGHMEM)
>   *       0xf    => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
>   *
> - * ZONES_SHIFT must be <= 2 on 32 bit platforms.
> + * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
>   */
>  
> -#if 16 * ZONES_SHIFT > BITS_PER_LONG
> -#error ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
> +#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
> +/* ZONE_DEVICE is not a valid GFP zone specifier */
> +#define GFP_ZONES_SHIFT 2
> +#else
> +#define GFP_ZONES_SHIFT ZONES_SHIFT
> +#endif
> +
> +#if 16 * GFP_ZONES_SHIFT > BITS_PER_LONG
> +#error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
>  #endif
>  
>  #define GFP_ZONE_TABLE ( \
> -	(ZONE_NORMAL << 0 * ZONES_SHIFT)				      \
> -	| (OPT_ZONE_DMA << ___GFP_DMA * ZONES_SHIFT)			      \
> -	| (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * ZONES_SHIFT)		      \
> -	| (OPT_ZONE_DMA32 << ___GFP_DMA32 * ZONES_SHIFT)		      \
> -	| (ZONE_NORMAL << ___GFP_MOVABLE * ZONES_SHIFT)			      \
> -	| (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * ZONES_SHIFT)	      \
> -	| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * ZONES_SHIFT)   \
> -	| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * ZONES_SHIFT)   \
> +	(ZONE_NORMAL << 0 * GFP_ZONES_SHIFT)					\
> +	| (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT)			\
> +	| (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT)		\
> +	| (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT)		      	\
> +	| (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT)			\
> +	| (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT)	\
> +	| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)	\
> +	| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)	\
>  )

Geeze.  Congrats on decrypting this stuff.  I hope.  Do you think it's
possible to comprehensibly document it all for the next poor soul who
ventures into it?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
@ 2016-02-02  5:42   ` Andrew Morton
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2016-02-02  5:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rik van Riel, Dave Hansen, linux-kernel, linux-mm, Mel Gorman,
	Mark, Joonsoo Kim, Sudip Mukherjee

On Wed, 27 Jan 2016 22:19:14 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

> ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
> mm zones that are bumping up against the current maximum limit of 4
> zones, i.e. 2 bits in page->flags.  When adding a zone this equation
> still needs to be satisified:
> 
>     SECTIONS_WIDTH + ZONES_WIDTH + NODES_SHIFT + LAST_CPUPID_SHIFT
> 	  <= BITS_PER_LONG - NR_PAGEFLAGS
> 
> ZONE_DEVICE currently tries to satisfy this equation by requiring that
> ZONE_DMA be disabled, but this is untenable given generic kernels want
> to support ZONE_DEVICE and ZONE_DMA simultaneously.  ZONE_CMA would like
> to increase the amount of memory covered per section, but that limits
> the minimum granularity at which consecutive memory ranges can be added
> via devm_memremap_pages().
> 
> The trade-off of what is acceptable to sacrifice depends heavily on the
> platform.  For example, ZONE_CMA is targeted for 32-bit platforms where
> page->flags is constrained, but those platforms likely do not care about
> the minimum granularity of memory hotplug.  A big iron machine with 1024
> numa nodes can likely sacrifice ZONE_DMA where a general purpose
> distribution kernel can not.
> 
> CONFIG_NR_ZONES_EXTENDED is a configuration symbol that gets selected
> when the number of configured zones exceeds 4.  It documents the
> configuration symbols and definitions that get modified when ZONES_WIDTH
> is greater than 2.
> 
> For now, it steals a bit from NODES_SHIFT.  Later on it can be used to
> document the definitions that get modified when a 32-bit configuration
> wants more zone bits.

So if you want ZONE_DMA, you're limited to 512 NUMA nodes?

That seems reasonable.

> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1409,8 +1409,10 @@ config NUMA_EMU
>  
>  config NODES_SHIFT
>  	int "Maximum NUMA Nodes (as a power of 2)" if !MAXSMP
> -	range 1 10
> -	default "10" if MAXSMP
> +	range 1 10 if !NR_ZONES_EXTENDED
> +	range 1 9 if NR_ZONES_EXTENDED
> +	default "10" if MAXSMP && !NR_ZONES_EXTENDED
> +	default "9" if MAXSMP && NR_ZONES_EXTENDED
>  	default "6" if X86_64
>  	default "3"
>  	depends on NEED_MULTIPLE_NODES
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 28ad5f6494b0..5979c2c80140 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -329,22 +329,29 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
>   *       0xe    => BAD (MOVABLE+DMA32+HIGHMEM)
>   *       0xf    => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
>   *
> - * ZONES_SHIFT must be <= 2 on 32 bit platforms.
> + * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
>   */
>  
> -#if 16 * ZONES_SHIFT > BITS_PER_LONG
> -#error ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
> +#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
> +/* ZONE_DEVICE is not a valid GFP zone specifier */
> +#define GFP_ZONES_SHIFT 2
> +#else
> +#define GFP_ZONES_SHIFT ZONES_SHIFT
> +#endif
> +
> +#if 16 * GFP_ZONES_SHIFT > BITS_PER_LONG
> +#error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
>  #endif
>  
>  #define GFP_ZONE_TABLE ( \
> -	(ZONE_NORMAL << 0 * ZONES_SHIFT)				      \
> -	| (OPT_ZONE_DMA << ___GFP_DMA * ZONES_SHIFT)			      \
> -	| (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * ZONES_SHIFT)		      \
> -	| (OPT_ZONE_DMA32 << ___GFP_DMA32 * ZONES_SHIFT)		      \
> -	| (ZONE_NORMAL << ___GFP_MOVABLE * ZONES_SHIFT)			      \
> -	| (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * ZONES_SHIFT)	      \
> -	| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * ZONES_SHIFT)   \
> -	| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * ZONES_SHIFT)   \
> +	(ZONE_NORMAL << 0 * GFP_ZONES_SHIFT)					\
> +	| (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT)			\
> +	| (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT)		\
> +	| (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT)		      	\
> +	| (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT)			\
> +	| (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT)	\
> +	| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)	\
> +	| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)	\
>  )

Geeze.  Congrats on decrypting this stuff.  I hope.  Do you think it's
possible to comprehensibly document it all for the next poor soul who
ventures into it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
  2016-02-02  5:42   ` Andrew Morton
@ 2016-02-07  6:10     ` Dan Williams
  -1 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2016-02-07  6:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Hansen, linux-kernel, Linux MM, Mel Gorman,
	Mark, Joonsoo Kim, Sudip Mukherjee

On Mon, Feb 1, 2016 at 9:42 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Wed, 27 Jan 2016 22:19:14 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

>>  #define GFP_ZONE_TABLE ( \
>> -     (ZONE_NORMAL << 0 * ZONES_SHIFT)                                      \
>> -     | (OPT_ZONE_DMA << ___GFP_DMA * ZONES_SHIFT)                          \
>> -     | (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * ZONES_SHIFT)                  \
>> -     | (OPT_ZONE_DMA32 << ___GFP_DMA32 * ZONES_SHIFT)                      \
>> -     | (ZONE_NORMAL << ___GFP_MOVABLE * ZONES_SHIFT)                       \
>> -     | (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * ZONES_SHIFT)       \
>> -     | (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * ZONES_SHIFT)   \
>> -     | (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * ZONES_SHIFT)   \
>> +     (ZONE_NORMAL << 0 * GFP_ZONES_SHIFT)                                    \
>> +     | (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT)                        \
>> +     | (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT)                \
>> +     | (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT)                    \
>> +     | (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT)                     \
>> +     | (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT)     \
>> +     | (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT) \
>> +     | (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT) \
>>  )
>
> Geeze.  Congrats on decrypting this stuff.  I hope.  Do you think it's
> possible to comprehensibly document it all for the next poor soul who
> ventures into it?
>

It is documented, just not included in the diff context.  At least the
existing documentation was enough for me to decipher that my changes
were doing the right thing:

/*
 * GFP_ZONE_TABLE is a word size bitstring that is used for looking up the
 * zone to use given the lowest 4 bits of gfp_t. Entries are ZONE_SHIFT long
 * and there are 16 of them to cover all possible combinations of
 * __GFP_DMA, __GFP_DMA32, __GFP_MOVABLE and __GFP_HIGHMEM.
 *
 * The zone fallback order is MOVABLE=>HIGHMEM=>NORMAL=>DMA32=>DMA.
 * But GFP_MOVABLE is not only a zone specifier but also an allocation
 * policy. Therefore __GFP_MOVABLE plus another zone selector is valid.
 * Only 1 bit of the lowest 3 bits (DMA,DMA32,HIGHMEM) can be set to "1".
 *
 *       bit       result
 *       =================
 *       0x0    => NORMAL
 *       0x1    => DMA or NORMAL
 *       0x2    => HIGHMEM or NORMAL
 *       0x3    => BAD (DMA+HIGHMEM)
 *       0x4    => DMA32 or DMA or NORMAL
 *       0x5    => BAD (DMA+DMA32)
 *       0x6    => BAD (HIGHMEM+DMA32)
 *       0x7    => BAD (HIGHMEM+DMA32+DMA)
 *       0x8    => NORMAL (MOVABLE+0)
 *       0x9    => DMA or NORMAL (MOVABLE+DMA)
 *       0xa    => MOVABLE (Movable is valid only if HIGHMEM is set too)
 *       0xb    => BAD (MOVABLE+HIGHMEM+DMA)
 *       0xc    => DMA32 (MOVABLE+DMA32)
 *       0xd    => BAD (MOVABLE+DMA32+DMA)
 *       0xe    => BAD (MOVABLE+DMA32+HIGHMEM)
 *       0xf    => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
 *
 * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
 */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
@ 2016-02-07  6:10     ` Dan Williams
  0 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2016-02-07  6:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Hansen, linux-kernel, Linux MM, Mel Gorman,
	Mark, Joonsoo Kim, Sudip Mukherjee

On Mon, Feb 1, 2016 at 9:42 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Wed, 27 Jan 2016 22:19:14 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

>>  #define GFP_ZONE_TABLE ( \
>> -     (ZONE_NORMAL << 0 * ZONES_SHIFT)                                      \
>> -     | (OPT_ZONE_DMA << ___GFP_DMA * ZONES_SHIFT)                          \
>> -     | (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * ZONES_SHIFT)                  \
>> -     | (OPT_ZONE_DMA32 << ___GFP_DMA32 * ZONES_SHIFT)                      \
>> -     | (ZONE_NORMAL << ___GFP_MOVABLE * ZONES_SHIFT)                       \
>> -     | (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * ZONES_SHIFT)       \
>> -     | (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * ZONES_SHIFT)   \
>> -     | (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * ZONES_SHIFT)   \
>> +     (ZONE_NORMAL << 0 * GFP_ZONES_SHIFT)                                    \
>> +     | (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT)                        \
>> +     | (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT)                \
>> +     | (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT)                    \
>> +     | (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT)                     \
>> +     | (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT)     \
>> +     | (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT) \
>> +     | (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT) \
>>  )
>
> Geeze.  Congrats on decrypting this stuff.  I hope.  Do you think it's
> possible to comprehensibly document it all for the next poor soul who
> ventures into it?
>

It is documented, just not included in the diff context.  At least the
existing documentation was enough for me to decipher that my changes
were doing the right thing:

/*
 * GFP_ZONE_TABLE is a word size bitstring that is used for looking up the
 * zone to use given the lowest 4 bits of gfp_t. Entries are ZONE_SHIFT long
 * and there are 16 of them to cover all possible combinations of
 * __GFP_DMA, __GFP_DMA32, __GFP_MOVABLE and __GFP_HIGHMEM.
 *
 * The zone fallback order is MOVABLE=>HIGHMEM=>NORMAL=>DMA32=>DMA.
 * But GFP_MOVABLE is not only a zone specifier but also an allocation
 * policy. Therefore __GFP_MOVABLE plus another zone selector is valid.
 * Only 1 bit of the lowest 3 bits (DMA,DMA32,HIGHMEM) can be set to "1".
 *
 *       bit       result
 *       =================
 *       0x0    => NORMAL
 *       0x1    => DMA or NORMAL
 *       0x2    => HIGHMEM or NORMAL
 *       0x3    => BAD (DMA+HIGHMEM)
 *       0x4    => DMA32 or DMA or NORMAL
 *       0x5    => BAD (DMA+DMA32)
 *       0x6    => BAD (HIGHMEM+DMA32)
 *       0x7    => BAD (HIGHMEM+DMA32+DMA)
 *       0x8    => NORMAL (MOVABLE+0)
 *       0x9    => DMA or NORMAL (MOVABLE+DMA)
 *       0xa    => MOVABLE (Movable is valid only if HIGHMEM is set too)
 *       0xb    => BAD (MOVABLE+HIGHMEM+DMA)
 *       0xc    => DMA32 (MOVABLE+DMA32)
 *       0xd    => BAD (MOVABLE+DMA32+DMA)
 *       0xe    => BAD (MOVABLE+DMA32+HIGHMEM)
 *       0xf    => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
 *
 * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
  2016-02-02  5:42   ` Andrew Morton
@ 2016-02-29 12:33     ` Vlastimil Babka
  -1 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2016-02-29 12:33 UTC (permalink / raw)
  To: Andrew Morton, Dan Williams
  Cc: Rik van Riel, Dave Hansen, linux-kernel, linux-mm, Mel Gorman,
	Mark, Joonsoo Kim, Sudip Mukherjee

On 02/02/2016 06:42 AM, Andrew Morton wrote:
> On Wed, 27 Jan 2016 22:19:14 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
>
>> ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
>> mm zones that are bumping up against the current maximum limit of 4
>> zones, i.e. 2 bits in page->flags.  When adding a zone this equation
>> still needs to be satisified:
>>
>>      SECTIONS_WIDTH + ZONES_WIDTH + NODES_SHIFT + LAST_CPUPID_SHIFT
>> 	  <= BITS_PER_LONG - NR_PAGEFLAGS
>>
>> ZONE_DEVICE currently tries to satisfy this equation by requiring that
>> ZONE_DMA be disabled, but this is untenable given generic kernels want
>> to support ZONE_DEVICE and ZONE_DMA simultaneously.  ZONE_CMA would like
>> to increase the amount of memory covered per section, but that limits
>> the minimum granularity at which consecutive memory ranges can be added
>> via devm_memremap_pages().
>>
>> The trade-off of what is acceptable to sacrifice depends heavily on the
>> platform.  For example, ZONE_CMA is targeted for 32-bit platforms where
>> page->flags is constrained, but those platforms likely do not care about
>> the minimum granularity of memory hotplug.  A big iron machine with 1024
>> numa nodes can likely sacrifice ZONE_DMA where a general purpose
>> distribution kernel can not.
>>
>> CONFIG_NR_ZONES_EXTENDED is a configuration symbol that gets selected
>> when the number of configured zones exceeds 4.  It documents the
>> configuration symbols and definitions that get modified when ZONES_WIDTH
>> is greater than 2.
>>
>> For now, it steals a bit from NODES_SHIFT.  Later on it can be used to
>> document the definitions that get modified when a 32-bit configuration
>> wants more zone bits.
>
> So if you want ZONE_DMA, you're limited to 512 NUMA nodes?
>
> That seems reasonable.

Sorry for the late reply, but it seems that with !SPARSEMEM, or with 
SPARSEMEM_VMEMMAP, reducing NUMA nodes isn't even necessary, because 
SECTIONS_WIDTH is zero (see the diagrams in linux/page-flags-layout.h). 
In my brief tests with 4.4 based kernel with SPARSEMEM_VMEMMAP it seems 
that with 1024 NUMA nodes and 8192 CPU's, there's still 7 bits left 
(i.e. 6 with CONFIG_NR_ZONES_EXTENDED).

With the danger of becoming even more complex, could the limit also 
depend on CONFIG_SPARSEMEM/VMEMMAP to reflect that somehow?

Or does it even make sense to limit the Kconfig choice like this? Same 
reduction of bits could be achieved in multiple ways. Less CPU's means 
smaller LAST_CPUPID_SHIFT. NUMA_BALACING disabled means LAST_CPUPID_SHIFT=0.

What would be perhaps better is to (in case things don't fit) show what 
uses how many bits and what are the relevant config options to tune to 
make it fit?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
@ 2016-02-29 12:33     ` Vlastimil Babka
  0 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2016-02-29 12:33 UTC (permalink / raw)
  To: Andrew Morton, Dan Williams
  Cc: Rik van Riel, Dave Hansen, linux-kernel, linux-mm, Mel Gorman,
	Mark, Joonsoo Kim, Sudip Mukherjee

On 02/02/2016 06:42 AM, Andrew Morton wrote:
> On Wed, 27 Jan 2016 22:19:14 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
>
>> ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
>> mm zones that are bumping up against the current maximum limit of 4
>> zones, i.e. 2 bits in page->flags.  When adding a zone this equation
>> still needs to be satisified:
>>
>>      SECTIONS_WIDTH + ZONES_WIDTH + NODES_SHIFT + LAST_CPUPID_SHIFT
>> 	  <= BITS_PER_LONG - NR_PAGEFLAGS
>>
>> ZONE_DEVICE currently tries to satisfy this equation by requiring that
>> ZONE_DMA be disabled, but this is untenable given generic kernels want
>> to support ZONE_DEVICE and ZONE_DMA simultaneously.  ZONE_CMA would like
>> to increase the amount of memory covered per section, but that limits
>> the minimum granularity at which consecutive memory ranges can be added
>> via devm_memremap_pages().
>>
>> The trade-off of what is acceptable to sacrifice depends heavily on the
>> platform.  For example, ZONE_CMA is targeted for 32-bit platforms where
>> page->flags is constrained, but those platforms likely do not care about
>> the minimum granularity of memory hotplug.  A big iron machine with 1024
>> numa nodes can likely sacrifice ZONE_DMA where a general purpose
>> distribution kernel can not.
>>
>> CONFIG_NR_ZONES_EXTENDED is a configuration symbol that gets selected
>> when the number of configured zones exceeds 4.  It documents the
>> configuration symbols and definitions that get modified when ZONES_WIDTH
>> is greater than 2.
>>
>> For now, it steals a bit from NODES_SHIFT.  Later on it can be used to
>> document the definitions that get modified when a 32-bit configuration
>> wants more zone bits.
>
> So if you want ZONE_DMA, you're limited to 512 NUMA nodes?
>
> That seems reasonable.

Sorry for the late reply, but it seems that with !SPARSEMEM, or with 
SPARSEMEM_VMEMMAP, reducing NUMA nodes isn't even necessary, because 
SECTIONS_WIDTH is zero (see the diagrams in linux/page-flags-layout.h). 
In my brief tests with 4.4 based kernel with SPARSEMEM_VMEMMAP it seems 
that with 1024 NUMA nodes and 8192 CPU's, there's still 7 bits left 
(i.e. 6 with CONFIG_NR_ZONES_EXTENDED).

With the danger of becoming even more complex, could the limit also 
depend on CONFIG_SPARSEMEM/VMEMMAP to reflect that somehow?

Or does it even make sense to limit the Kconfig choice like this? Same 
reduction of bits could be achieved in multiple ways. Less CPU's means 
smaller LAST_CPUPID_SHIFT. NUMA_BALACING disabled means LAST_CPUPID_SHIFT=0.

What would be perhaps better is to (in case things don't fit) show what 
uses how many bits and what are the relevant config options to tune to 
make it fit?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
  2016-02-29 12:33     ` Vlastimil Babka
@ 2016-02-29 17:55       ` Dan Williams
  -1 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2016-02-29 17:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Rik van Riel, Dave Hansen, linux-kernel, Linux MM,
	Mel Gorman, Mark, Joonsoo Kim, Sudip Mukherjee

On Mon, Feb 29, 2016 at 4:33 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 02/02/2016 06:42 AM, Andrew Morton wrote:
>>
>> On Wed, 27 Jan 2016 22:19:14 -0800 Dan Williams <dan.j.williams@intel.com>
>> wrote:
>>
>>> ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
>>> mm zones that are bumping up against the current maximum limit of 4
>>> zones, i.e. 2 bits in page->flags.  When adding a zone this equation
>>> still needs to be satisified:
>>>
>>>      SECTIONS_WIDTH + ZONES_WIDTH + NODES_SHIFT + LAST_CPUPID_SHIFT
>>>           <= BITS_PER_LONG - NR_PAGEFLAGS
>>>
>>> ZONE_DEVICE currently tries to satisfy this equation by requiring that
>>> ZONE_DMA be disabled, but this is untenable given generic kernels want
>>> to support ZONE_DEVICE and ZONE_DMA simultaneously.  ZONE_CMA would like
>>> to increase the amount of memory covered per section, but that limits
>>> the minimum granularity at which consecutive memory ranges can be added
>>> via devm_memremap_pages().
>>>
>>> The trade-off of what is acceptable to sacrifice depends heavily on the
>>> platform.  For example, ZONE_CMA is targeted for 32-bit platforms where
>>> page->flags is constrained, but those platforms likely do not care about
>>> the minimum granularity of memory hotplug.  A big iron machine with 1024
>>> numa nodes can likely sacrifice ZONE_DMA where a general purpose
>>> distribution kernel can not.
>>>
>>> CONFIG_NR_ZONES_EXTENDED is a configuration symbol that gets selected
>>> when the number of configured zones exceeds 4.  It documents the
>>> configuration symbols and definitions that get modified when ZONES_WIDTH
>>> is greater than 2.
>>>
>>> For now, it steals a bit from NODES_SHIFT.  Later on it can be used to
>>> document the definitions that get modified when a 32-bit configuration
>>> wants more zone bits.
>>
>>
>> So if you want ZONE_DMA, you're limited to 512 NUMA nodes?
>>
>> That seems reasonable.
>
>
> Sorry for the late reply, but it seems that with !SPARSEMEM, or with
> SPARSEMEM_VMEMMAP, reducing NUMA nodes isn't even necessary, because
> SECTIONS_WIDTH is zero (see the diagrams in linux/page-flags-layout.h). In
> my brief tests with 4.4 based kernel with SPARSEMEM_VMEMMAP it seems that
> with 1024 NUMA nodes and 8192 CPU's, there's still 7 bits left (i.e. 6 with
> CONFIG_NR_ZONES_EXTENDED).
>
> With the danger of becoming even more complex, could the limit also depend
> on CONFIG_SPARSEMEM/VMEMMAP to reflect that somehow?

In this case it's already part of the equation because:

config ZONE_DEVICE
       depends on MEMORY_HOTPLUG
       depends on MEMORY_HOTREMOVE

...and those in turn depend on SPARSEMEM.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
@ 2016-02-29 17:55       ` Dan Williams
  0 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2016-02-29 17:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Rik van Riel, Dave Hansen, linux-kernel, Linux MM,
	Mel Gorman, Mark, Joonsoo Kim, Sudip Mukherjee

On Mon, Feb 29, 2016 at 4:33 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 02/02/2016 06:42 AM, Andrew Morton wrote:
>>
>> On Wed, 27 Jan 2016 22:19:14 -0800 Dan Williams <dan.j.williams@intel.com>
>> wrote:
>>
>>> ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
>>> mm zones that are bumping up against the current maximum limit of 4
>>> zones, i.e. 2 bits in page->flags.  When adding a zone this equation
>>> still needs to be satisified:
>>>
>>>      SECTIONS_WIDTH + ZONES_WIDTH + NODES_SHIFT + LAST_CPUPID_SHIFT
>>>           <= BITS_PER_LONG - NR_PAGEFLAGS
>>>
>>> ZONE_DEVICE currently tries to satisfy this equation by requiring that
>>> ZONE_DMA be disabled, but this is untenable given generic kernels want
>>> to support ZONE_DEVICE and ZONE_DMA simultaneously.  ZONE_CMA would like
>>> to increase the amount of memory covered per section, but that limits
>>> the minimum granularity at which consecutive memory ranges can be added
>>> via devm_memremap_pages().
>>>
>>> The trade-off of what is acceptable to sacrifice depends heavily on the
>>> platform.  For example, ZONE_CMA is targeted for 32-bit platforms where
>>> page->flags is constrained, but those platforms likely do not care about
>>> the minimum granularity of memory hotplug.  A big iron machine with 1024
>>> numa nodes can likely sacrifice ZONE_DMA where a general purpose
>>> distribution kernel can not.
>>>
>>> CONFIG_NR_ZONES_EXTENDED is a configuration symbol that gets selected
>>> when the number of configured zones exceeds 4.  It documents the
>>> configuration symbols and definitions that get modified when ZONES_WIDTH
>>> is greater than 2.
>>>
>>> For now, it steals a bit from NODES_SHIFT.  Later on it can be used to
>>> document the definitions that get modified when a 32-bit configuration
>>> wants more zone bits.
>>
>>
>> So if you want ZONE_DMA, you're limited to 512 NUMA nodes?
>>
>> That seems reasonable.
>
>
> Sorry for the late reply, but it seems that with !SPARSEMEM, or with
> SPARSEMEM_VMEMMAP, reducing NUMA nodes isn't even necessary, because
> SECTIONS_WIDTH is zero (see the diagrams in linux/page-flags-layout.h). In
> my brief tests with 4.4 based kernel with SPARSEMEM_VMEMMAP it seems that
> with 1024 NUMA nodes and 8192 CPU's, there's still 7 bits left (i.e. 6 with
> CONFIG_NR_ZONES_EXTENDED).
>
> With the danger of becoming even more complex, could the limit also depend
> on CONFIG_SPARSEMEM/VMEMMAP to reflect that somehow?

In this case it's already part of the equation because:

config ZONE_DEVICE
       depends on MEMORY_HOTPLUG
       depends on MEMORY_HOTREMOVE

...and those in turn depend on SPARSEMEM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
  2016-02-29 17:55       ` Dan Williams
@ 2016-03-01  0:06         ` Vlastimil Babka
  -1 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2016-03-01  0:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Rik van Riel, Dave Hansen, linux-kernel, Linux MM,
	Mel Gorman, Mark, Joonsoo Kim, Sudip Mukherjee

On 29.2.2016 18:55, Dan Williams wrote:
> On Mon, Feb 29, 2016 at 4:33 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 02/02/2016 06:42 AM, Andrew Morton wrote:
>>> So if you want ZONE_DMA, you're limited to 512 NUMA nodes?
>>>
>>> That seems reasonable.
>>
>>
>> Sorry for the late reply, but it seems that with !SPARSEMEM, or with
>> SPARSEMEM_VMEMMAP, reducing NUMA nodes isn't even necessary, because
>> SECTIONS_WIDTH is zero (see the diagrams in linux/page-flags-layout.h). In
>> my brief tests with 4.4 based kernel with SPARSEMEM_VMEMMAP it seems that
>> with 1024 NUMA nodes and 8192 CPU's, there's still 7 bits left (i.e. 6 with
>> CONFIG_NR_ZONES_EXTENDED).
>>
>> With the danger of becoming even more complex, could the limit also depend
>> on CONFIG_SPARSEMEM/VMEMMAP to reflect that somehow?
> 
> In this case it's already part of the equation because:
> 
> config ZONE_DEVICE
>        depends on MEMORY_HOTPLUG
>        depends on MEMORY_HOTREMOVE
> 
> ...and those in turn depend on SPARSEMEM.

Fine, but then SPARSEMEM_VMEMMAP should be still an available subvariant of
SPARSEMEM with SECTION_WIDTH=0.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
@ 2016-03-01  0:06         ` Vlastimil Babka
  0 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2016-03-01  0:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Rik van Riel, Dave Hansen, linux-kernel, Linux MM,
	Mel Gorman, Mark, Joonsoo Kim, Sudip Mukherjee

On 29.2.2016 18:55, Dan Williams wrote:
> On Mon, Feb 29, 2016 at 4:33 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 02/02/2016 06:42 AM, Andrew Morton wrote:
>>> So if you want ZONE_DMA, you're limited to 512 NUMA nodes?
>>>
>>> That seems reasonable.
>>
>>
>> Sorry for the late reply, but it seems that with !SPARSEMEM, or with
>> SPARSEMEM_VMEMMAP, reducing NUMA nodes isn't even necessary, because
>> SECTIONS_WIDTH is zero (see the diagrams in linux/page-flags-layout.h). In
>> my brief tests with 4.4 based kernel with SPARSEMEM_VMEMMAP it seems that
>> with 1024 NUMA nodes and 8192 CPU's, there's still 7 bits left (i.e. 6 with
>> CONFIG_NR_ZONES_EXTENDED).
>>
>> With the danger of becoming even more complex, could the limit also depend
>> on CONFIG_SPARSEMEM/VMEMMAP to reflect that somehow?
> 
> In this case it's already part of the equation because:
> 
> config ZONE_DEVICE
>        depends on MEMORY_HOTPLUG
>        depends on MEMORY_HOTREMOVE
> 
> ...and those in turn depend on SPARSEMEM.

Fine, but then SPARSEMEM_VMEMMAP should be still an available subvariant of
SPARSEMEM with SECTION_WIDTH=0.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
  2016-03-01  0:06         ` Vlastimil Babka
@ 2016-03-01  2:06           ` Dan Williams
  -1 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2016-03-01  2:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Rik van Riel, Dave Hansen, linux-kernel, Linux MM,
	Mel Gorman, Mark, Joonsoo Kim, Sudip Mukherjee

On Mon, Feb 29, 2016 at 4:06 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 29.2.2016 18:55, Dan Williams wrote:
>> On Mon, Feb 29, 2016 at 4:33 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>> On 02/02/2016 06:42 AM, Andrew Morton wrote:
>>>> So if you want ZONE_DMA, you're limited to 512 NUMA nodes?
>>>>
>>>> That seems reasonable.
>>>
>>>
>>> Sorry for the late reply, but it seems that with !SPARSEMEM, or with
>>> SPARSEMEM_VMEMMAP, reducing NUMA nodes isn't even necessary, because
>>> SECTIONS_WIDTH is zero (see the diagrams in linux/page-flags-layout.h). In
>>> my brief tests with 4.4 based kernel with SPARSEMEM_VMEMMAP it seems that
>>> with 1024 NUMA nodes and 8192 CPU's, there's still 7 bits left (i.e. 6 with
>>> CONFIG_NR_ZONES_EXTENDED).
>>>
>>> With the danger of becoming even more complex, could the limit also depend
>>> on CONFIG_SPARSEMEM/VMEMMAP to reflect that somehow?
>>
>> In this case it's already part of the equation because:
>>
>> config ZONE_DEVICE
>>        depends on MEMORY_HOTPLUG
>>        depends on MEMORY_HOTREMOVE
>>
>> ...and those in turn depend on SPARSEMEM.
>
> Fine, but then SPARSEMEM_VMEMMAP should be still an available subvariant of
> SPARSEMEM with SECTION_WIDTH=0.

It should be, but not for the ZONE_DEVICE case.  ZONE_DEVICE depends
on x86_64 which means ZONE_DEVICE also implies SPARSEMEM_VMEMMAP
since:

config ARCH_SPARSEMEM_ENABLE
       def_bool y
       depends on X86_64 || NUMA || X86_32 || X86_32_NON_STANDARD
       select SPARSEMEM_STATIC if X86_32
       select SPARSEMEM_VMEMMAP_ENABLE if X86_64

Now, if a future patch wants to reclaim page flags space for other
usages outside of ZONE_DEVICE it can do the work to handle the
SPARSEMEM_VMEMMAP=n case.  I don't see a reason to fold that
distinction into the current patch given the current constraints.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
@ 2016-03-01  2:06           ` Dan Williams
  0 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2016-03-01  2:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Rik van Riel, Dave Hansen, linux-kernel, Linux MM,
	Mel Gorman, Mark, Joonsoo Kim, Sudip Mukherjee

On Mon, Feb 29, 2016 at 4:06 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 29.2.2016 18:55, Dan Williams wrote:
>> On Mon, Feb 29, 2016 at 4:33 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>> On 02/02/2016 06:42 AM, Andrew Morton wrote:
>>>> So if you want ZONE_DMA, you're limited to 512 NUMA nodes?
>>>>
>>>> That seems reasonable.
>>>
>>>
>>> Sorry for the late reply, but it seems that with !SPARSEMEM, or with
>>> SPARSEMEM_VMEMMAP, reducing NUMA nodes isn't even necessary, because
>>> SECTIONS_WIDTH is zero (see the diagrams in linux/page-flags-layout.h). In
>>> my brief tests with 4.4 based kernel with SPARSEMEM_VMEMMAP it seems that
>>> with 1024 NUMA nodes and 8192 CPU's, there's still 7 bits left (i.e. 6 with
>>> CONFIG_NR_ZONES_EXTENDED).
>>>
>>> With the danger of becoming even more complex, could the limit also depend
>>> on CONFIG_SPARSEMEM/VMEMMAP to reflect that somehow?
>>
>> In this case it's already part of the equation because:
>>
>> config ZONE_DEVICE
>>        depends on MEMORY_HOTPLUG
>>        depends on MEMORY_HOTREMOVE
>>
>> ...and those in turn depend on SPARSEMEM.
>
> Fine, but then SPARSEMEM_VMEMMAP should be still an available subvariant of
> SPARSEMEM with SECTION_WIDTH=0.

It should be, but not for the ZONE_DEVICE case.  ZONE_DEVICE depends
on x86_64 which means ZONE_DEVICE also implies SPARSEMEM_VMEMMAP
since:

config ARCH_SPARSEMEM_ENABLE
       def_bool y
       depends on X86_64 || NUMA || X86_32 || X86_32_NON_STANDARD
       select SPARSEMEM_STATIC if X86_32
       select SPARSEMEM_VMEMMAP_ENABLE if X86_64

Now, if a future patch wants to reclaim page flags space for other
usages outside of ZONE_DEVICE it can do the work to handle the
SPARSEMEM_VMEMMAP=n case.  I don't see a reason to fold that
distinction into the current patch given the current constraints.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
  2016-03-01  2:06           ` Dan Williams
@ 2016-03-01  8:31             ` Vlastimil Babka
  -1 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2016-03-01  8:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Rik van Riel, Dave Hansen, linux-kernel, Linux MM,
	Mel Gorman, Mark, Joonsoo Kim, Sudip Mukherjee

On 03/01/2016 03:06 AM, Dan Williams wrote:
> On Mon, Feb 29, 2016 at 4:06 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 29.2.2016 18:55, Dan Williams wrote:
>>> On Mon, Feb 29, 2016 at 4:33 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>> On 02/02/2016 06:42 AM, Andrew Morton wrote:
>>>
>>> In this case it's already part of the equation because:
>>>
>>> config ZONE_DEVICE
>>>        depends on MEMORY_HOTPLUG
>>>        depends on MEMORY_HOTREMOVE
>>>
>>> ...and those in turn depend on SPARSEMEM.
>>
>> Fine, but then SPARSEMEM_VMEMMAP should be still an available subvariant of
>> SPARSEMEM with SECTION_WIDTH=0.
>
> It should be, but not for the ZONE_DEVICE case.  ZONE_DEVICE depends
> on x86_64 which means ZONE_DEVICE also implies SPARSEMEM_VMEMMAP
> since:
>
> config ARCH_SPARSEMEM_ENABLE
>         def_bool y
>         depends on X86_64 || NUMA || X86_32 || X86_32_NON_STANDARD
>         select SPARSEMEM_STATIC if X86_32
>         select SPARSEMEM_VMEMMAP_ENABLE if X86_64
>
> Now, if a future patch wants to reclaim page flags space for other
> usages outside of ZONE_DEVICE it can do the work to handle the
> SPARSEMEM_VMEMMAP=n case.  I don't see a reason to fold that
> distinction into the current patch given the current constraints.

OK so that IUUC shows that x86_64 should be always fine without decreasing the 
range for NODES_SHIFT? That's basically my point - since there's a configuration 
where things don't fit (32bit?), the patch broadly decreases range for 
NODES_SHIFT for everyone, right?

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
@ 2016-03-01  8:31             ` Vlastimil Babka
  0 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2016-03-01  8:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Rik van Riel, Dave Hansen, linux-kernel, Linux MM,
	Mel Gorman, Mark, Joonsoo Kim, Sudip Mukherjee

On 03/01/2016 03:06 AM, Dan Williams wrote:
> On Mon, Feb 29, 2016 at 4:06 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 29.2.2016 18:55, Dan Williams wrote:
>>> On Mon, Feb 29, 2016 at 4:33 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>> On 02/02/2016 06:42 AM, Andrew Morton wrote:
>>>
>>> In this case it's already part of the equation because:
>>>
>>> config ZONE_DEVICE
>>>        depends on MEMORY_HOTPLUG
>>>        depends on MEMORY_HOTREMOVE
>>>
>>> ...and those in turn depend on SPARSEMEM.
>>
>> Fine, but then SPARSEMEM_VMEMMAP should be still an available subvariant of
>> SPARSEMEM with SECTION_WIDTH=0.
>
> It should be, but not for the ZONE_DEVICE case.  ZONE_DEVICE depends
> on x86_64 which means ZONE_DEVICE also implies SPARSEMEM_VMEMMAP
> since:
>
> config ARCH_SPARSEMEM_ENABLE
>         def_bool y
>         depends on X86_64 || NUMA || X86_32 || X86_32_NON_STANDARD
>         select SPARSEMEM_STATIC if X86_32
>         select SPARSEMEM_VMEMMAP_ENABLE if X86_64
>
> Now, if a future patch wants to reclaim page flags space for other
> usages outside of ZONE_DEVICE it can do the work to handle the
> SPARSEMEM_VMEMMAP=n case.  I don't see a reason to fold that
> distinction into the current patch given the current constraints.

OK so that IUUC shows that x86_64 should be always fine without decreasing the 
range for NODES_SHIFT? That's basically my point - since there's a configuration 
where things don't fit (32bit?), the patch broadly decreases range for 
NODES_SHIFT for everyone, right?

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
  2016-03-01  8:31             ` Vlastimil Babka
@ 2016-03-01 23:43               ` Dan Williams
  -1 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2016-03-01 23:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Rik van Riel, Dave Hansen, linux-kernel, Linux MM,
	Mel Gorman, Mark, Joonsoo Kim, Sudip Mukherjee

On Tue, Mar 1, 2016 at 12:31 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 03/01/2016 03:06 AM, Dan Williams wrote:
>>
>> On Mon, Feb 29, 2016 at 4:06 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>
>>> On 29.2.2016 18:55, Dan Williams wrote:
>>>>
>>>> On Mon, Feb 29, 2016 at 4:33 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>>>
>>>>> On 02/02/2016 06:42 AM, Andrew Morton wrote:
>>>>
>>>>
>>>> In this case it's already part of the equation because:
>>>>
>>>> config ZONE_DEVICE
>>>>        depends on MEMORY_HOTPLUG
>>>>        depends on MEMORY_HOTREMOVE
>>>>
>>>> ...and those in turn depend on SPARSEMEM.
>>>
>>>
>>> Fine, but then SPARSEMEM_VMEMMAP should be still an available subvariant
>>> of
>>> SPARSEMEM with SECTION_WIDTH=0.
>>
>>
>> It should be, but not for the ZONE_DEVICE case.  ZONE_DEVICE depends
>> on x86_64 which means ZONE_DEVICE also implies SPARSEMEM_VMEMMAP
>> since:
>>
>> config ARCH_SPARSEMEM_ENABLE
>>         def_bool y
>>         depends on X86_64 || NUMA || X86_32 || X86_32_NON_STANDARD
>>         select SPARSEMEM_STATIC if X86_32
>>         select SPARSEMEM_VMEMMAP_ENABLE if X86_64
>>
>> Now, if a future patch wants to reclaim page flags space for other
>> usages outside of ZONE_DEVICE it can do the work to handle the
>> SPARSEMEM_VMEMMAP=n case.  I don't see a reason to fold that
>> distinction into the current patch given the current constraints.
>
>
> OK so that IUUC shows that x86_64 should be always fine without decreasing
> the range for NODES_SHIFT? That's basically my point - since there's a
> configuration where things don't fit (32bit?), the patch broadly decreases
> range for NODES_SHIFT for everyone, right?

So I went hunting for the x86_64 config that sent me off in this
direction in the first place, but I can't reproduce it.  I'm indeed
able to fit ZONE_DEVICE + ZONE_DMA + NODES_SHIFT(10) without
overflowing page flags.  Maybe we reduced some usage page->flags usage
between 4.3 and 4.5 and I missed it?

In any event, you're right we can indeed fit ZONE_DEVICE into the
current MAXSMP definition.  I'll respin the patch.

Thanks for probing on this!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
@ 2016-03-01 23:43               ` Dan Williams
  0 siblings, 0 replies; 20+ messages in thread
From: Dan Williams @ 2016-03-01 23:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Rik van Riel, Dave Hansen, linux-kernel, Linux MM,
	Mel Gorman, Mark, Joonsoo Kim, Sudip Mukherjee

On Tue, Mar 1, 2016 at 12:31 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 03/01/2016 03:06 AM, Dan Williams wrote:
>>
>> On Mon, Feb 29, 2016 at 4:06 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>
>>> On 29.2.2016 18:55, Dan Williams wrote:
>>>>
>>>> On Mon, Feb 29, 2016 at 4:33 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>>>
>>>>> On 02/02/2016 06:42 AM, Andrew Morton wrote:
>>>>
>>>>
>>>> In this case it's already part of the equation because:
>>>>
>>>> config ZONE_DEVICE
>>>>        depends on MEMORY_HOTPLUG
>>>>        depends on MEMORY_HOTREMOVE
>>>>
>>>> ...and those in turn depend on SPARSEMEM.
>>>
>>>
>>> Fine, but then SPARSEMEM_VMEMMAP should be still an available subvariant
>>> of
>>> SPARSEMEM with SECTION_WIDTH=0.
>>
>>
>> It should be, but not for the ZONE_DEVICE case.  ZONE_DEVICE depends
>> on x86_64 which means ZONE_DEVICE also implies SPARSEMEM_VMEMMAP
>> since:
>>
>> config ARCH_SPARSEMEM_ENABLE
>>         def_bool y
>>         depends on X86_64 || NUMA || X86_32 || X86_32_NON_STANDARD
>>         select SPARSEMEM_STATIC if X86_32
>>         select SPARSEMEM_VMEMMAP_ENABLE if X86_64
>>
>> Now, if a future patch wants to reclaim page flags space for other
>> usages outside of ZONE_DEVICE it can do the work to handle the
>> SPARSEMEM_VMEMMAP=n case.  I don't see a reason to fold that
>> distinction into the current patch given the current constraints.
>
>
> OK so that IUUC shows that x86_64 should be always fine without decreasing
> the range for NODES_SHIFT? That's basically my point - since there's a
> configuration where things don't fit (32bit?), the patch broadly decreases
> range for NODES_SHIFT for everyone, right?

So I went hunting for the x86_64 config that sent me off in this
direction in the first place, but I can't reproduce it.  I'm indeed
able to fit ZONE_DEVICE + ZONE_DMA + NODES_SHIFT(10) without
overflowing page flags.  Maybe we reduced some usage page->flags usage
between 4.3 and 4.5 and I missed it?

In any event, you're right we can indeed fit ZONE_DEVICE into the
current MAXSMP definition.  I'll respin the patch.

Thanks for probing on this!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
  2016-03-01 23:43               ` Dan Williams
@ 2016-03-02  8:10                 ` Vlastimil Babka
  -1 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2016-03-02  8:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Rik van Riel, Dave Hansen, linux-kernel, Linux MM,
	Mel Gorman, Mark, Joonsoo Kim, Sudip Mukherjee

On 03/02/2016 12:43 AM, Dan Williams wrote:
> On Tue, Mar 1, 2016 at 12:31 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 03/01/2016 03:06 AM, Dan Williams wrote:
>>>
>>> On Mon, Feb 29, 2016 at 4:06 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>>
>>>> On 29.2.2016 18:55, Dan Williams wrote:
>>>>>
>>>>> On Mon, Feb 29, 2016 at 4:33 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>>>>
>>>>>> On 02/02/2016 06:42 AM, Andrew Morton wrote:
>>>>>
>>>>>
>>>>> In this case it's already part of the equation because:
>>>>>
>>>>> config ZONE_DEVICE
>>>>>         depends on MEMORY_HOTPLUG
>>>>>         depends on MEMORY_HOTREMOVE
>>>>>
>>>>> ...and those in turn depend on SPARSEMEM.
>>>>
>>>>
>>>> Fine, but then SPARSEMEM_VMEMMAP should be still an available subvariant
>>>> of
>>>> SPARSEMEM with SECTION_WIDTH=0.
>>>
>>>
>>> It should be, but not for the ZONE_DEVICE case.  ZONE_DEVICE depends
>>> on x86_64 which means ZONE_DEVICE also implies SPARSEMEM_VMEMMAP
>>> since:
>>>
>>> config ARCH_SPARSEMEM_ENABLE
>>>          def_bool y
>>>          depends on X86_64 || NUMA || X86_32 || X86_32_NON_STANDARD
>>>          select SPARSEMEM_STATIC if X86_32
>>>          select SPARSEMEM_VMEMMAP_ENABLE if X86_64
>>>
>>> Now, if a future patch wants to reclaim page flags space for other
>>> usages outside of ZONE_DEVICE it can do the work to handle the
>>> SPARSEMEM_VMEMMAP=n case.  I don't see a reason to fold that
>>> distinction into the current patch given the current constraints.
>>
>>
>> OK so that IUUC shows that x86_64 should be always fine without decreasing
>> the range for NODES_SHIFT? That's basically my point - since there's a
>> configuration where things don't fit (32bit?), the patch broadly decreases
>> range for NODES_SHIFT for everyone, right?
>
> So I went hunting for the x86_64 config that sent me off in this
> direction in the first place, but I can't reproduce it.  I'm indeed
> able to fit ZONE_DEVICE + ZONE_DMA + NODES_SHIFT(10) without
> overflowing page flags.  Maybe we reduced some usage page->flags usage
> between 4.3 and 4.5 and I missed it?

Oh, I think I see it now. SPARSEMEM_VMEMMAP_ENABLE only *allows to 
enable* CONFIG_SPARSEMEM_VMEMMAP, it doesn't force it:

config SPARSEMEM_VMEMMAP
         bool "Sparse Memory virtual memmap"
         depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE
         default y

> In any event, you're right we can indeed fit ZONE_DEVICE into the
> current MAXSMP definition.  I'll respin the patch.

But I still believe that that your respin is better than this variant. 
We shouldn't broadly limit the range in one of the options, when there 
are multiple options affecting the usage of bits. There's a warning if 
the overal configuration is "too large", which could potentially be more 
detailed. But we never said configuring the kernel is trivial ;-)

Also in this case the "default y" for SPARSEMEM_VMEMMAP should prevent 
surprise when one enables ZONE_DEVICE through nvdimm and doesn't fiddle 
with the lowlevel details. As long as it takes multiple explicit choices 
differing from defaults to get to the warning, I'd say we are fine.

> Thanks for probing on this!
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED
@ 2016-03-02  8:10                 ` Vlastimil Babka
  0 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2016-03-02  8:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Rik van Riel, Dave Hansen, linux-kernel, Linux MM,
	Mel Gorman, Mark, Joonsoo Kim, Sudip Mukherjee

On 03/02/2016 12:43 AM, Dan Williams wrote:
> On Tue, Mar 1, 2016 at 12:31 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 03/01/2016 03:06 AM, Dan Williams wrote:
>>>
>>> On Mon, Feb 29, 2016 at 4:06 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>>
>>>> On 29.2.2016 18:55, Dan Williams wrote:
>>>>>
>>>>> On Mon, Feb 29, 2016 at 4:33 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>>>>
>>>>>> On 02/02/2016 06:42 AM, Andrew Morton wrote:
>>>>>
>>>>>
>>>>> In this case it's already part of the equation because:
>>>>>
>>>>> config ZONE_DEVICE
>>>>>         depends on MEMORY_HOTPLUG
>>>>>         depends on MEMORY_HOTREMOVE
>>>>>
>>>>> ...and those in turn depend on SPARSEMEM.
>>>>
>>>>
>>>> Fine, but then SPARSEMEM_VMEMMAP should be still an available subvariant
>>>> of
>>>> SPARSEMEM with SECTION_WIDTH=0.
>>>
>>>
>>> It should be, but not for the ZONE_DEVICE case.  ZONE_DEVICE depends
>>> on x86_64 which means ZONE_DEVICE also implies SPARSEMEM_VMEMMAP
>>> since:
>>>
>>> config ARCH_SPARSEMEM_ENABLE
>>>          def_bool y
>>>          depends on X86_64 || NUMA || X86_32 || X86_32_NON_STANDARD
>>>          select SPARSEMEM_STATIC if X86_32
>>>          select SPARSEMEM_VMEMMAP_ENABLE if X86_64
>>>
>>> Now, if a future patch wants to reclaim page flags space for other
>>> usages outside of ZONE_DEVICE it can do the work to handle the
>>> SPARSEMEM_VMEMMAP=n case.  I don't see a reason to fold that
>>> distinction into the current patch given the current constraints.
>>
>>
>> OK so that IUUC shows that x86_64 should be always fine without decreasing
>> the range for NODES_SHIFT? That's basically my point - since there's a
>> configuration where things don't fit (32bit?), the patch broadly decreases
>> range for NODES_SHIFT for everyone, right?
>
> So I went hunting for the x86_64 config that sent me off in this
> direction in the first place, but I can't reproduce it.  I'm indeed
> able to fit ZONE_DEVICE + ZONE_DMA + NODES_SHIFT(10) without
> overflowing page flags.  Maybe we reduced some usage page->flags usage
> between 4.3 and 4.5 and I missed it?

Oh, I think I see it now. SPARSEMEM_VMEMMAP_ENABLE only *allows to 
enable* CONFIG_SPARSEMEM_VMEMMAP, it doesn't force it:

config SPARSEMEM_VMEMMAP
         bool "Sparse Memory virtual memmap"
         depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE
         default y

> In any event, you're right we can indeed fit ZONE_DEVICE into the
> current MAXSMP definition.  I'll respin the patch.

But I still believe that that your respin is better than this variant. 
We shouldn't broadly limit the range in one of the options, when there 
are multiple options affecting the usage of bits. There's a warning if 
the overal configuration is "too large", which could potentially be more 
detailed. But we never said configuring the kernel is trivial ;-)

Also in this case the "default y" for SPARSEMEM_VMEMMAP should prevent 
surprise when one enables ZONE_DEVICE through nvdimm and doesn't fiddle 
with the lowlevel details. As long as it takes multiple explicit choices 
differing from defaults to get to the warning, I'd say we are fine.

> Thanks for probing on this!
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-03-02  8:10 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-28  6:19 [RFC PATCH] mm: CONFIG_NR_ZONES_EXTENDED Dan Williams
2016-01-28  6:19 ` Dan Williams
2016-02-02  5:42 ` Andrew Morton
2016-02-02  5:42   ` Andrew Morton
2016-02-07  6:10   ` Dan Williams
2016-02-07  6:10     ` Dan Williams
2016-02-29 12:33   ` Vlastimil Babka
2016-02-29 12:33     ` Vlastimil Babka
2016-02-29 17:55     ` Dan Williams
2016-02-29 17:55       ` Dan Williams
2016-03-01  0:06       ` Vlastimil Babka
2016-03-01  0:06         ` Vlastimil Babka
2016-03-01  2:06         ` Dan Williams
2016-03-01  2:06           ` Dan Williams
2016-03-01  8:31           ` Vlastimil Babka
2016-03-01  8:31             ` Vlastimil Babka
2016-03-01 23:43             ` Dan Williams
2016-03-01 23:43               ` Dan Williams
2016-03-02  8:10               ` Vlastimil Babka
2016-03-02  8:10                 ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.