All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/4] Zeroing hash tables in allocator
@ 2017-03-02  5:33 ` Pavel Tatashin
  0 siblings, 0 replies; 35+ messages in thread
From: Pavel Tatashin @ 2017-03-02  5:33 UTC (permalink / raw)
  To: linux-mm, sparclinux, linux-fsdevel

Changes:
v3 -> v2: Added a new patch for adaptive hash table scaling as suggested by
Andi Kleen
v1 -> v2: Reverted NG4memcpy() changes

Pavel Tatashin (4):
  sparc64: NG4 memset 32 bits overflow
  mm: Zeroing hash tables in allocator
  mm: Updated callers to use HASH_ZERO flag
  mm: Adaptive hash table scaling

 arch/sparc/lib/NG4memset.S          |   26 +++++++++++++-------------
 fs/dcache.c                         |   18 ++++--------------
 fs/inode.c                          |   14 ++------------
 fs/namespace.c                      |   10 ++--------
 include/linux/bootmem.h             |    2 ++
 kernel/locking/qspinlock_paravirt.h |    3 ++-
 kernel/pid.c                        |    7 ++-----
 mm/page_alloc.c                     |   31 ++++++++++++++++++++++++++++---
 8 files changed, 55 insertions(+), 56 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 0/4] Zeroing hash tables in allocator
@ 2017-03-02  5:33 ` Pavel Tatashin
  0 siblings, 0 replies; 35+ messages in thread
From: Pavel Tatashin @ 2017-03-02  5:33 UTC (permalink / raw)
  To: linux-mm, sparclinux, linux-fsdevel

Changes:
v3 -> v2: Added a new patch for adaptive hash table scaling as suggested by
Andi Kleen
v1 -> v2: Reverted NG4memcpy() changes

Pavel Tatashin (4):
  sparc64: NG4 memset 32 bits overflow
  mm: Zeroing hash tables in allocator
  mm: Updated callers to use HASH_ZERO flag
  mm: Adaptive hash table scaling

 arch/sparc/lib/NG4memset.S          |   26 +++++++++++++-------------
 fs/dcache.c                         |   18 ++++--------------
 fs/inode.c                          |   14 ++------------
 fs/namespace.c                      |   10 ++--------
 include/linux/bootmem.h             |    2 ++
 kernel/locking/qspinlock_paravirt.h |    3 ++-
 kernel/pid.c                        |    7 ++-----
 mm/page_alloc.c                     |   31 ++++++++++++++++++++++++++++---
 8 files changed, 55 insertions(+), 56 deletions(-)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 1/4] sparc64: NG4 memset 32 bits overflow
  2017-03-02  5:33 ` Pavel Tatashin
@ 2017-03-02  5:33   ` Pavel Tatashin
  -1 siblings, 0 replies; 35+ messages in thread
From: Pavel Tatashin @ 2017-03-02  5:33 UTC (permalink / raw)
  To: linux-mm, sparclinux, linux-fsdevel

Early in boot Linux patches memset and memcpy to branch to platform
optimized versions of these routines. The NG4 (Niagra 4) versions are
currently used on  all platforms starting from T4. Recently, there were M7
optimized routines added into UEK4 but not into mainline yet. So, even with
M7 optimized routines NG4 are still going to be used on T4, T5, M5, and M6
processors.

While investigating how to improve initialization time of dentry_hashtable
which is 8G long on M6 ldom with 7T of main memory, I noticed that memset()
does not reset all the memory in this array, after studying the code, I
realized that NG4memset() branches use %icc register instead of %xcc to
check compare, so if value of length is over 32-bit long, which is true for
8G array, these routines fail to work properly.

The fix is to replace all %icc with %xcc in these routines. (Alternative is
to use %ncc, but this is misleading, as the code already has sparcv9 only
instructions, and cannot be compiled on 32-bit).

This is important to fix this bug, because even older T4-4 can have 2T of
memory, and there are large memory proportional data structures in kernel
which can be larger than 4G in size. The failing of memset() is silent and
corruption is hard to detect.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
---
 arch/sparc/lib/NG4memset.S |   26 +++++++++++++-------------
 1 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/sparc/lib/NG4memset.S b/arch/sparc/lib/NG4memset.S
index 41da4bd..e7c2e70 100644
--- a/arch/sparc/lib/NG4memset.S
+++ b/arch/sparc/lib/NG4memset.S
@@ -13,14 +13,14 @@
 	.globl		NG4memset
 NG4memset:
 	andcc		%o1, 0xff, %o4
-	be,pt		%icc, 1f
+	be,pt		%xcc, 1f
 	 mov		%o2, %o1
 	sllx		%o4, 8, %g1
 	or		%g1, %o4, %o2
 	sllx		%o2, 16, %g1
 	or		%g1, %o2, %o2
 	sllx		%o2, 32, %g1
-	ba,pt		%icc, 1f
+	ba,pt		%xcc, 1f
 	 or		%g1, %o2, %o4
 	.size		NG4memset,.-NG4memset
 
@@ -29,7 +29,7 @@ NG4memset:
 NG4bzero:
 	clr		%o4
 1:	cmp		%o1, 16
-	ble		%icc, .Ltiny
+	ble		%xcc, .Ltiny
 	 mov		%o0, %o3
 	sub		%g0, %o0, %g1
 	and		%g1, 0x7, %g1
@@ -37,7 +37,7 @@ NG4bzero:
 	 sub		%o1, %g1, %o1
 1:	stb		%o4, [%o0 + 0x00]
 	subcc		%g1, 1, %g1
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 1, %o0
 .Laligned8:
 	cmp		%o1, 64 + (64 - 8)
@@ -48,7 +48,7 @@ NG4bzero:
 	 sub		%o1, %g1, %o1
 1:	stx		%o4, [%o0 + 0x00]
 	subcc		%g1, 8, %g1
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x8, %o0
 .Laligned64:
 	andn		%o1, 64 - 1, %g1
@@ -58,30 +58,30 @@ NG4bzero:
 1:	stxa		%o4, [%o0 + %g0] ASI_BLK_INIT_QUAD_LDD_P
 	subcc		%g1, 0x40, %g1
 	stxa		%o4, [%o0 + %g2] ASI_BLK_INIT_QUAD_LDD_P
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x40, %o0
 .Lpostloop:
 	cmp		%o1, 8
-	bl,pn		%icc, .Ltiny
+	bl,pn		%xcc, .Ltiny
 	 membar		#StoreStore|#StoreLoad
 .Lmedium:
 	andn		%o1, 0x7, %g1
 	sub		%o1, %g1, %o1
 1:	stx		%o4, [%o0 + 0x00]
 	subcc		%g1, 0x8, %g1
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x08, %o0
 	andcc		%o1, 0x4, %g1
-	be,pt		%icc, .Ltiny
+	be,pt		%xcc, .Ltiny
 	 sub		%o1, %g1, %o1
 	stw		%o4, [%o0 + 0x00]
 	add		%o0, 0x4, %o0
 .Ltiny:
 	cmp		%o1, 0
-	be,pn		%icc, .Lexit
+	be,pn		%xcc, .Lexit
 1:	 subcc		%o1, 1, %o1
 	stb		%o4, [%o0 + 0x00]
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 1, %o0
 .Lexit:
 	retl
@@ -99,7 +99,7 @@ NG4bzero:
 	stxa		%o4, [%o0 + %g2] ASI_BLK_INIT_QUAD_LDD_P
 	stxa		%o4, [%o0 + %g3] ASI_BLK_INIT_QUAD_LDD_P
 	stxa		%o4, [%o0 + %o5] ASI_BLK_INIT_QUAD_LDD_P
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x30, %o0
-	ba,a,pt		%icc, .Lpostloop
+	ba,a,pt		%xcc, .Lpostloop
 	.size		NG4bzero,.-NG4bzero
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 1/4] sparc64: NG4 memset 32 bits overflow
@ 2017-03-02  5:33   ` Pavel Tatashin
  0 siblings, 0 replies; 35+ messages in thread
From: Pavel Tatashin @ 2017-03-02  5:33 UTC (permalink / raw)
  To: linux-mm, sparclinux, linux-fsdevel

Early in boot Linux patches memset and memcpy to branch to platform
optimized versions of these routines. The NG4 (Niagra 4) versions are
currently used on  all platforms starting from T4. Recently, there were M7
optimized routines added into UEK4 but not into mainline yet. So, even with
M7 optimized routines NG4 are still going to be used on T4, T5, M5, and M6
processors.

While investigating how to improve initialization time of dentry_hashtable
which is 8G long on M6 ldom with 7T of main memory, I noticed that memset()
does not reset all the memory in this array, after studying the code, I
realized that NG4memset() branches use %icc register instead of %xcc to
check compare, so if value of length is over 32-bit long, which is true for
8G array, these routines fail to work properly.

The fix is to replace all %icc with %xcc in these routines. (Alternative is
to use %ncc, but this is misleading, as the code already has sparcv9 only
instructions, and cannot be compiled on 32-bit).

This is important to fix this bug, because even older T4-4 can have 2T of
memory, and there are large memory proportional data structures in kernel
which can be larger than 4G in size. The failing of memset() is silent and
corruption is hard to detect.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
---
 arch/sparc/lib/NG4memset.S |   26 +++++++++++++-------------
 1 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/sparc/lib/NG4memset.S b/arch/sparc/lib/NG4memset.S
index 41da4bd..e7c2e70 100644
--- a/arch/sparc/lib/NG4memset.S
+++ b/arch/sparc/lib/NG4memset.S
@@ -13,14 +13,14 @@
 	.globl		NG4memset
 NG4memset:
 	andcc		%o1, 0xff, %o4
-	be,pt		%icc, 1f
+	be,pt		%xcc, 1f
 	 mov		%o2, %o1
 	sllx		%o4, 8, %g1
 	or		%g1, %o4, %o2
 	sllx		%o2, 16, %g1
 	or		%g1, %o2, %o2
 	sllx		%o2, 32, %g1
-	ba,pt		%icc, 1f
+	ba,pt		%xcc, 1f
 	 or		%g1, %o2, %o4
 	.size		NG4memset,.-NG4memset
 
@@ -29,7 +29,7 @@ NG4memset:
 NG4bzero:
 	clr		%o4
 1:	cmp		%o1, 16
-	ble		%icc, .Ltiny
+	ble		%xcc, .Ltiny
 	 mov		%o0, %o3
 	sub		%g0, %o0, %g1
 	and		%g1, 0x7, %g1
@@ -37,7 +37,7 @@ NG4bzero:
 	 sub		%o1, %g1, %o1
 1:	stb		%o4, [%o0 + 0x00]
 	subcc		%g1, 1, %g1
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 1, %o0
 .Laligned8:
 	cmp		%o1, 64 + (64 - 8)
@@ -48,7 +48,7 @@ NG4bzero:
 	 sub		%o1, %g1, %o1
 1:	stx		%o4, [%o0 + 0x00]
 	subcc		%g1, 8, %g1
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x8, %o0
 .Laligned64:
 	andn		%o1, 64 - 1, %g1
@@ -58,30 +58,30 @@ NG4bzero:
 1:	stxa		%o4, [%o0 + %g0] ASI_BLK_INIT_QUAD_LDD_P
 	subcc		%g1, 0x40, %g1
 	stxa		%o4, [%o0 + %g2] ASI_BLK_INIT_QUAD_LDD_P
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x40, %o0
 .Lpostloop:
 	cmp		%o1, 8
-	bl,pn		%icc, .Ltiny
+	bl,pn		%xcc, .Ltiny
 	 membar		#StoreStore|#StoreLoad
 .Lmedium:
 	andn		%o1, 0x7, %g1
 	sub		%o1, %g1, %o1
 1:	stx		%o4, [%o0 + 0x00]
 	subcc		%g1, 0x8, %g1
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x08, %o0
 	andcc		%o1, 0x4, %g1
-	be,pt		%icc, .Ltiny
+	be,pt		%xcc, .Ltiny
 	 sub		%o1, %g1, %o1
 	stw		%o4, [%o0 + 0x00]
 	add		%o0, 0x4, %o0
 .Ltiny:
 	cmp		%o1, 0
-	be,pn		%icc, .Lexit
+	be,pn		%xcc, .Lexit
 1:	 subcc		%o1, 1, %o1
 	stb		%o4, [%o0 + 0x00]
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 1, %o0
 .Lexit:
 	retl
@@ -99,7 +99,7 @@ NG4bzero:
 	stxa		%o4, [%o0 + %g2] ASI_BLK_INIT_QUAD_LDD_P
 	stxa		%o4, [%o0 + %g3] ASI_BLK_INIT_QUAD_LDD_P
 	stxa		%o4, [%o0 + %o5] ASI_BLK_INIT_QUAD_LDD_P
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x30, %o0
-	ba,a,pt		%icc, .Lpostloop
+	ba,a,pt		%xcc, .Lpostloop
 	.size		NG4bzero,.-NG4bzero
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 2/4] mm: Zeroing hash tables in allocator
  2017-03-02  5:33 ` Pavel Tatashin
@ 2017-03-02  5:33   ` Pavel Tatashin
  -1 siblings, 0 replies; 35+ messages in thread
From: Pavel Tatashin @ 2017-03-02  5:33 UTC (permalink / raw)
  To: linux-mm, sparclinux, linux-fsdevel

Add a new flag HASH_ZERO which when provided grantees that the hash table
that is returned by alloc_large_system_hash() is zeroed.  In most cases
that is what is needed by the caller. Use page level allocator's __GFP_ZERO
flags to zero the memory. It is using memset() which is efficient method to
zero memory and is optimized for most platforms.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
---
 include/linux/bootmem.h |    1 +
 mm/page_alloc.c         |   12 +++++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index 962164d..e223d91 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -358,6 +358,7 @@ static inline void __init memblock_free_late(
 #define HASH_EARLY	0x00000001	/* Allocating during early boot? */
 #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
 					 * shift passed via *_hash_shift */
+#define HASH_ZERO	0x00000004	/* Zero allocated hash table */
 
 /* Only NUMA needs hash distribution. 64bit NUMA architectures have
  * sufficient vmalloc space.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a7a6aac..1b0f7a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7142,6 +7142,7 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 	unsigned long long max = high_limit;
 	unsigned long log2qty, size;
 	void *table = NULL;
+	gfp_t gfp_flags;
 
 	/* allow the kernel cmdline to have a say */
 	if (!numentries) {
@@ -7186,12 +7187,17 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 
 	log2qty = ilog2(numentries);
 
+	/*
+	 * memblock allocator returns zeroed memory already, so HASH_ZERO is
+	 * currently not used when HASH_EARLY is specified.
+	 */
+	gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
 	do {
 		size = bucketsize << log2qty;
 		if (flags & HASH_EARLY)
 			table = memblock_virt_alloc_nopanic(size, 0);
 		else if (hashdist)
-			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
+			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
 		else {
 			/*
 			 * If bucketsize is not a power-of-two, we may free
@@ -7199,8 +7205,8 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 			 * alloc_pages_exact() automatically does
 			 */
 			if (get_order(size) < MAX_ORDER) {
-				table = alloc_pages_exact(size, GFP_ATOMIC);
-				kmemleak_alloc(table, size, 1, GFP_ATOMIC);
+				table = alloc_pages_exact(size, gfp_flags);
+				kmemleak_alloc(table, size, 1, gfp_flags);
 			}
 		}
 	} while (!table && size > PAGE_SIZE && --log2qty);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 2/4] mm: Zeroing hash tables in allocator
@ 2017-03-02  5:33   ` Pavel Tatashin
  0 siblings, 0 replies; 35+ messages in thread
From: Pavel Tatashin @ 2017-03-02  5:33 UTC (permalink / raw)
  To: linux-mm, sparclinux, linux-fsdevel

Add a new flag HASH_ZERO which when provided grantees that the hash table
that is returned by alloc_large_system_hash() is zeroed.  In most cases
that is what is needed by the caller. Use page level allocator's __GFP_ZERO
flags to zero the memory. It is using memset() which is efficient method to
zero memory and is optimized for most platforms.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
---
 include/linux/bootmem.h |    1 +
 mm/page_alloc.c         |   12 +++++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index 962164d..e223d91 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -358,6 +358,7 @@ static inline void __init memblock_free_late(
 #define HASH_EARLY	0x00000001	/* Allocating during early boot? */
 #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
 					 * shift passed via *_hash_shift */
+#define HASH_ZERO	0x00000004	/* Zero allocated hash table */
 
 /* Only NUMA needs hash distribution. 64bit NUMA architectures have
  * sufficient vmalloc space.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a7a6aac..1b0f7a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7142,6 +7142,7 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 	unsigned long long max = high_limit;
 	unsigned long log2qty, size;
 	void *table = NULL;
+	gfp_t gfp_flags;
 
 	/* allow the kernel cmdline to have a say */
 	if (!numentries) {
@@ -7186,12 +7187,17 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 
 	log2qty = ilog2(numentries);
 
+	/*
+	 * memblock allocator returns zeroed memory already, so HASH_ZERO is
+	 * currently not used when HASH_EARLY is specified.
+	 */
+	gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
 	do {
 		size = bucketsize << log2qty;
 		if (flags & HASH_EARLY)
 			table = memblock_virt_alloc_nopanic(size, 0);
 		else if (hashdist)
-			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
+			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
 		else {
 			/*
 			 * If bucketsize is not a power-of-two, we may free
@@ -7199,8 +7205,8 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 			 * alloc_pages_exact() automatically does
 			 */
 			if (get_order(size) < MAX_ORDER) {
-				table = alloc_pages_exact(size, GFP_ATOMIC);
-				kmemleak_alloc(table, size, 1, GFP_ATOMIC);
+				table = alloc_pages_exact(size, gfp_flags);
+				kmemleak_alloc(table, size, 1, gfp_flags);
 			}
 		}
 	} while (!table && size > PAGE_SIZE && --log2qty);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 3/4] mm: Updated callers to use HASH_ZERO flag
  2017-03-02  5:33 ` Pavel Tatashin
@ 2017-03-02  5:33   ` Pavel Tatashin
  -1 siblings, 0 replies; 35+ messages in thread
From: Pavel Tatashin @ 2017-03-02  5:33 UTC (permalink / raw)
  To: linux-mm, sparclinux, linux-fsdevel

Update dcache, inode, pid, mountpoint, and mount hash tables to use
HASH_ZERO, and remove initialization after allocations.
In case of places where HASH_EARLY was used such as in __pv_init_lock_hash
the zeroed hash table was already assumed, because memblock zeroes the
memory.

CPU: SPARC M6, Memory: 7T
Before fix:
Dentry cache hash table entries: 1073741824
Inode-cache hash table entries: 536870912
Mount-cache hash table entries: 16777216
Mountpoint-cache hash table entries: 16777216
ftrace: allocating 20414 entries in 40 pages
Total time: 11.798s

After fix:
Dentry cache hash table entries: 1073741824
Inode-cache hash table entries: 536870912
Mount-cache hash table entries: 16777216
Mountpoint-cache hash table entries: 16777216
ftrace: allocating 20414 entries in 40 pages
Total time: 3.198s

CPU: Intel Xeon E5-2630, Memory: 2.2T:
Before fix:
Dentry cache hash table entries: 536870912
Inode-cache hash table entries: 268435456
Mount-cache hash table entries: 8388608
Mountpoint-cache hash table entries: 8388608
CPU: Physical Processor ID: 0
Total time: 3.245s

After fix:
Dentry cache hash table entries: 536870912
Inode-cache hash table entries: 268435456
Mount-cache hash table entries: 8388608
Mountpoint-cache hash table entries: 8388608
CPU: Physical Processor ID: 0
Total time: 3.244s

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
---
 fs/dcache.c                         |   18 ++++--------------
 fs/inode.c                          |   14 ++------------
 fs/namespace.c                      |   10 ++--------
 kernel/locking/qspinlock_paravirt.h |    3 ++-
 kernel/pid.c                        |    7 ++-----
 5 files changed, 12 insertions(+), 40 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 95d71ed..363502f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3548,8 +3548,6 @@ static int __init set_dhash_entries(char *str)
 
 static void __init dcache_init_early(void)
 {
-	unsigned int loop;
-
 	/* If hashes are distributed across NUMA nodes, defer
 	 * hash allocation until vmalloc space is available.
 	 */
@@ -3561,24 +3559,19 @@ static void __init dcache_init_early(void)
 					sizeof(struct hlist_bl_head),
 					dhash_entries,
 					13,
-					HASH_EARLY,
+					HASH_EARLY | HASH_ZERO,
 					&d_hash_shift,
 					&d_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << d_hash_shift); loop++)
-		INIT_HLIST_BL_HEAD(dentry_hashtable + loop);
 }
 
 static void __init dcache_init(void)
 {
-	unsigned int loop;
-
-	/* 
+	/*
 	 * A constructor could be added for stable state like the lists,
 	 * but it is probably not worth it because of the cache nature
-	 * of the dcache. 
+	 * of the dcache.
 	 */
 	dentry_cache = KMEM_CACHE(dentry,
 		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT);
@@ -3592,14 +3585,11 @@ static void __init dcache_init(void)
 					sizeof(struct hlist_bl_head),
 					dhash_entries,
 					13,
-					0,
+					HASH_ZERO,
 					&d_hash_shift,
 					&d_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << d_hash_shift); loop++)
-		INIT_HLIST_BL_HEAD(dentry_hashtable + loop);
 }
 
 /* SLAB cache for __getname() consumers */
diff --git a/fs/inode.c b/fs/inode.c
index 88110fd..1b15a7c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1916,8 +1916,6 @@ static int __init set_ihash_entries(char *str)
  */
 void __init inode_init_early(void)
 {
-	unsigned int loop;
-
 	/* If hashes are distributed across NUMA nodes, defer
 	 * hash allocation until vmalloc space is available.
 	 */
@@ -1929,20 +1927,15 @@ void __init inode_init_early(void)
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
-					HASH_EARLY,
+					HASH_EARLY | HASH_ZERO,
 					&i_hash_shift,
 					&i_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
 }
 
 void __init inode_init(void)
 {
-	unsigned int loop;
-
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
 					 sizeof(struct inode),
@@ -1960,14 +1953,11 @@ void __init inode_init(void)
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
-					0,
+					HASH_ZERO,
 					&i_hash_shift,
 					&i_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
 }
 
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
diff --git a/fs/namespace.c b/fs/namespace.c
index 8bfad42..275e6e2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3238,7 +3238,6 @@ static void __init init_mount_tree(void)
 
 void __init mnt_init(void)
 {
-	unsigned u;
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
@@ -3247,22 +3246,17 @@ void __init mnt_init(void)
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
 				mhash_entries, 19,
-				0,
+				HASH_ZERO,
 				&m_hash_shift, &m_hash_mask, 0, 0);
 	mountpoint_hashtable = alloc_large_system_hash("Mountpoint-cache",
 				sizeof(struct hlist_head),
 				mphash_entries, 19,
-				0,
+				HASH_ZERO,
 				&mp_hash_shift, &mp_hash_mask, 0, 0);
 
 	if (!mount_hashtable || !mountpoint_hashtable)
 		panic("Failed to allocate mount hash table\n");
 
-	for (u = 0; u <= m_hash_mask; u++)
-		INIT_HLIST_HEAD(&mount_hashtable[u]);
-	for (u = 0; u <= mp_hash_mask; u++)
-		INIT_HLIST_HEAD(&mountpoint_hashtable[u]);
-
 	kernfs_init();
 
 	err = sysfs_init();
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e6b2f7a..4ccfcaa 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -193,7 +193,8 @@ void __init __pv_init_lock_hash(void)
 	 */
 	pv_lock_hash = alloc_large_system_hash("PV qspinlock",
 					       sizeof(struct pv_hash_entry),
-					       pv_hash_size, 0, HASH_EARLY,
+					       pv_hash_size, 0,
+					       HASH_EARLY | HASH_ZERO,
 					       &pv_lock_hash_bits, NULL,
 					       pv_hash_size, pv_hash_size);
 }
diff --git a/kernel/pid.c b/kernel/pid.c
index 0291804..013e023 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -572,16 +572,13 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
  */
 void __init pidhash_init(void)
 {
-	unsigned int i, pidhash_size;
+	unsigned int pidhash_size;
 
 	pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
-					   HASH_EARLY | HASH_SMALL,
+					   HASH_EARLY | HASH_SMALL | HASH_ZERO,
 					   &pidhash_shift, NULL,
 					   0, 4096);
 	pidhash_size = 1U << pidhash_shift;
-
-	for (i = 0; i < pidhash_size; i++)
-		INIT_HLIST_HEAD(&pid_hash[i]);
 }
 
 void __init pidmap_init(void)
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 3/4] mm: Updated callers to use HASH_ZERO flag
@ 2017-03-02  5:33   ` Pavel Tatashin
  0 siblings, 0 replies; 35+ messages in thread
From: Pavel Tatashin @ 2017-03-02  5:33 UTC (permalink / raw)
  To: linux-mm, sparclinux, linux-fsdevel

Update dcache, inode, pid, mountpoint, and mount hash tables to use
HASH_ZERO, and remove initialization after allocations.
In case of places where HASH_EARLY was used such as in __pv_init_lock_hash
the zeroed hash table was already assumed, because memblock zeroes the
memory.

CPU: SPARC M6, Memory: 7T
Before fix:
Dentry cache hash table entries: 1073741824
Inode-cache hash table entries: 536870912
Mount-cache hash table entries: 16777216
Mountpoint-cache hash table entries: 16777216
ftrace: allocating 20414 entries in 40 pages
Total time: 11.798s

After fix:
Dentry cache hash table entries: 1073741824
Inode-cache hash table entries: 536870912
Mount-cache hash table entries: 16777216
Mountpoint-cache hash table entries: 16777216
ftrace: allocating 20414 entries in 40 pages
Total time: 3.198s

CPU: Intel Xeon E5-2630, Memory: 2.2T:
Before fix:
Dentry cache hash table entries: 536870912
Inode-cache hash table entries: 268435456
Mount-cache hash table entries: 8388608
Mountpoint-cache hash table entries: 8388608
CPU: Physical Processor ID: 0
Total time: 3.245s

After fix:
Dentry cache hash table entries: 536870912
Inode-cache hash table entries: 268435456
Mount-cache hash table entries: 8388608
Mountpoint-cache hash table entries: 8388608
CPU: Physical Processor ID: 0
Total time: 3.244s

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
---
 fs/dcache.c                         |   18 ++++--------------
 fs/inode.c                          |   14 ++------------
 fs/namespace.c                      |   10 ++--------
 kernel/locking/qspinlock_paravirt.h |    3 ++-
 kernel/pid.c                        |    7 ++-----
 5 files changed, 12 insertions(+), 40 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 95d71ed..363502f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3548,8 +3548,6 @@ static int __init set_dhash_entries(char *str)
 
 static void __init dcache_init_early(void)
 {
-	unsigned int loop;
-
 	/* If hashes are distributed across NUMA nodes, defer
 	 * hash allocation until vmalloc space is available.
 	 */
@@ -3561,24 +3559,19 @@ static void __init dcache_init_early(void)
 					sizeof(struct hlist_bl_head),
 					dhash_entries,
 					13,
-					HASH_EARLY,
+					HASH_EARLY | HASH_ZERO,
 					&d_hash_shift,
 					&d_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << d_hash_shift); loop++)
-		INIT_HLIST_BL_HEAD(dentry_hashtable + loop);
 }
 
 static void __init dcache_init(void)
 {
-	unsigned int loop;
-
-	/* 
+	/*
 	 * A constructor could be added for stable state like the lists,
 	 * but it is probably not worth it because of the cache nature
-	 * of the dcache. 
+	 * of the dcache.
 	 */
 	dentry_cache = KMEM_CACHE(dentry,
 		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT);
@@ -3592,14 +3585,11 @@ static void __init dcache_init(void)
 					sizeof(struct hlist_bl_head),
 					dhash_entries,
 					13,
-					0,
+					HASH_ZERO,
 					&d_hash_shift,
 					&d_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << d_hash_shift); loop++)
-		INIT_HLIST_BL_HEAD(dentry_hashtable + loop);
 }
 
 /* SLAB cache for __getname() consumers */
diff --git a/fs/inode.c b/fs/inode.c
index 88110fd..1b15a7c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1916,8 +1916,6 @@ static int __init set_ihash_entries(char *str)
  */
 void __init inode_init_early(void)
 {
-	unsigned int loop;
-
 	/* If hashes are distributed across NUMA nodes, defer
 	 * hash allocation until vmalloc space is available.
 	 */
@@ -1929,20 +1927,15 @@ void __init inode_init_early(void)
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
-					HASH_EARLY,
+					HASH_EARLY | HASH_ZERO,
 					&i_hash_shift,
 					&i_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
 }
 
 void __init inode_init(void)
 {
-	unsigned int loop;
-
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
 					 sizeof(struct inode),
@@ -1960,14 +1953,11 @@ void __init inode_init(void)
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
-					0,
+					HASH_ZERO,
 					&i_hash_shift,
 					&i_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
 }
 
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
diff --git a/fs/namespace.c b/fs/namespace.c
index 8bfad42..275e6e2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3238,7 +3238,6 @@ static void __init init_mount_tree(void)
 
 void __init mnt_init(void)
 {
-	unsigned u;
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
@@ -3247,22 +3246,17 @@ void __init mnt_init(void)
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
 				mhash_entries, 19,
-				0,
+				HASH_ZERO,
 				&m_hash_shift, &m_hash_mask, 0, 0);
 	mountpoint_hashtable = alloc_large_system_hash("Mountpoint-cache",
 				sizeof(struct hlist_head),
 				mphash_entries, 19,
-				0,
+				HASH_ZERO,
 				&mp_hash_shift, &mp_hash_mask, 0, 0);
 
 	if (!mount_hashtable || !mountpoint_hashtable)
 		panic("Failed to allocate mount hash table\n");
 
-	for (u = 0; u <= m_hash_mask; u++)
-		INIT_HLIST_HEAD(&mount_hashtable[u]);
-	for (u = 0; u <= mp_hash_mask; u++)
-		INIT_HLIST_HEAD(&mountpoint_hashtable[u]);
-
 	kernfs_init();
 
 	err = sysfs_init();
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e6b2f7a..4ccfcaa 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -193,7 +193,8 @@ void __init __pv_init_lock_hash(void)
 	 */
 	pv_lock_hash = alloc_large_system_hash("PV qspinlock",
 					       sizeof(struct pv_hash_entry),
-					       pv_hash_size, 0, HASH_EARLY,
+					       pv_hash_size, 0,
+					       HASH_EARLY | HASH_ZERO,
 					       &pv_lock_hash_bits, NULL,
 					       pv_hash_size, pv_hash_size);
 }
diff --git a/kernel/pid.c b/kernel/pid.c
index 0291804..013e023 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -572,16 +572,13 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
  */
 void __init pidhash_init(void)
 {
-	unsigned int i, pidhash_size;
+	unsigned int pidhash_size;
 
 	pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
-					   HASH_EARLY | HASH_SMALL,
+					   HASH_EARLY | HASH_SMALL | HASH_ZERO,
 					   &pidhash_shift, NULL,
 					   0, 4096);
 	pidhash_size = 1U << pidhash_shift;
-
-	for (i = 0; i < pidhash_size; i++)
-		INIT_HLIST_HEAD(&pid_hash[i]);
 }
 
 void __init pidmap_init(void)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 4/4] mm: Adaptive hash table scaling
  2017-03-02  5:33 ` Pavel Tatashin
@ 2017-03-02  5:33   ` Pavel Tatashin
  -1 siblings, 0 replies; 35+ messages in thread
From: Pavel Tatashin @ 2017-03-02  5:33 UTC (permalink / raw)
  To: linux-mm, sparclinux, linux-fsdevel

Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT
is provided every time memory quadruples the sizes of hash tables will only
double instead of quadrupling as well. This algorithm starts working only
when memory size reaches a certain point, currently set to 64G.

This is example of dentry hash table size, before and after four various
memory configurations:

MEMORY	   SCALE	 HASH_SIZE
	old	new	old	new
    8G	 13	 13      8M      8M
   16G	 13	 13     16M     16M
   32G	 13	 13     32M     32M
   64G	 13	 13     64M     64M
  128G	 13	 14    128M     64M
  256G	 13	 14    256M    128M
  512G	 13	 15    512M    128M
 1024G	 13	 15   1024M    256M
 2048G	 13	 16   2048M    256M
 4096G	 13	 16   4096M    512M
 8192G	 13	 17   8192M    512M
16384G	 13	 17  16384M   1024M
32768G	 13	 18  32768M   1024M
65536G	 13	 18  65536M   2048M

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 fs/dcache.c             |    2 +-
 fs/inode.c              |    2 +-
 include/linux/bootmem.h |    1 +
 mm/page_alloc.c         |   19 +++++++++++++++++++
 4 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 363502f..808ea99 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3585,7 +3585,7 @@ static void __init dcache_init(void)
 					sizeof(struct hlist_bl_head),
 					dhash_entries,
 					13,
-					HASH_ZERO,
+					HASH_ZERO | HASH_ADAPT,
 					&d_hash_shift,
 					&d_hash_mask,
 					0,
diff --git a/fs/inode.c b/fs/inode.c
index 1b15a7c..32c8ee4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1953,7 +1953,7 @@ void __init inode_init(void)
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
-					HASH_ZERO,
+					HASH_ZERO | HASH_ADAPT,
 					&i_hash_shift,
 					&i_hash_mask,
 					0,
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91..dbaf312 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -359,6 +359,7 @@ static inline void __init memblock_free_late(
 #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
 					 * shift passed via *_hash_shift */
 #define HASH_ZERO	0x00000004	/* Zero allocated hash table */
+#define	HASH_ADAPT	0x00000008	/* Adaptive scale for large memory */
 
 /* Only NUMA needs hash distribution. 64bit NUMA architectures have
  * sufficient vmalloc space.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1b0f7a4..608055e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7124,6 +7124,17 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 #endif
 
 /*
+ * Adaptive scale is meant to reduce sizes of hash tables on large memory
+ * machines. As memory size is increased the scale is also increased but at
+ * slower pace.  Starting from ADAPT_SCALE_BASE (64G), every time memory
+ * quadruples the scale is increased by one, which means the size of hash table
+ * only doubles, instead of quadrupling as well.
+ */
+#define ADAPT_SCALE_BASE	(64ul << 30)
+#define ADAPT_SCALE_SHIFT	2
+#define ADAPT_SCALE_NPAGES	(ADAPT_SCALE_BASE >> PAGE_SHIFT)
+
+/*
  * allocate a large system hash table from bootmem
  * - it is assumed that the hash table must contain an exact power-of-2
  *   quantity of entries
@@ -7154,6 +7165,14 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 		if (PAGE_SHIFT < 20)
 			numentries = round_up(numentries, (1<<20)/PAGE_SIZE);
 
+		if (flags & HASH_ADAPT) {
+			unsigned long adapt;
+
+			for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries;
+			     adapt <<= ADAPT_SCALE_SHIFT)
+				scale++;
+		}
+
 		/* limit to 1 bucket per 2^scale bytes of low memory */
 		if (scale > PAGE_SHIFT)
 			numentries >>= (scale - PAGE_SHIFT);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 4/4] mm: Adaptive hash table scaling
@ 2017-03-02  5:33   ` Pavel Tatashin
  0 siblings, 0 replies; 35+ messages in thread
From: Pavel Tatashin @ 2017-03-02  5:33 UTC (permalink / raw)
  To: linux-mm, sparclinux, linux-fsdevel

Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT
is provided every time memory quadruples the sizes of hash tables will only
double instead of quadrupling as well. This algorithm starts working only
when memory size reaches a certain point, currently set to 64G.

This is example of dentry hash table size, before and after four various
memory configurations:

MEMORY	   SCALE	 HASH_SIZE
	old	new	old	new
    8G	 13	 13      8M      8M
   16G	 13	 13     16M     16M
   32G	 13	 13     32M     32M
   64G	 13	 13     64M     64M
  128G	 13	 14    128M     64M
  256G	 13	 14    256M    128M
  512G	 13	 15    512M    128M
 1024G	 13	 15   1024M    256M
 2048G	 13	 16   2048M    256M
 4096G	 13	 16   4096M    512M
 8192G	 13	 17   8192M    512M
16384G	 13	 17  16384M   1024M
32768G	 13	 18  32768M   1024M
65536G	 13	 18  65536M   2048M

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 fs/dcache.c             |    2 +-
 fs/inode.c              |    2 +-
 include/linux/bootmem.h |    1 +
 mm/page_alloc.c         |   19 +++++++++++++++++++
 4 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 363502f..808ea99 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3585,7 +3585,7 @@ static void __init dcache_init(void)
 					sizeof(struct hlist_bl_head),
 					dhash_entries,
 					13,
-					HASH_ZERO,
+					HASH_ZERO | HASH_ADAPT,
 					&d_hash_shift,
 					&d_hash_mask,
 					0,
diff --git a/fs/inode.c b/fs/inode.c
index 1b15a7c..32c8ee4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1953,7 +1953,7 @@ void __init inode_init(void)
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
-					HASH_ZERO,
+					HASH_ZERO | HASH_ADAPT,
 					&i_hash_shift,
 					&i_hash_mask,
 					0,
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91..dbaf312 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -359,6 +359,7 @@ static inline void __init memblock_free_late(
 #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
 					 * shift passed via *_hash_shift */
 #define HASH_ZERO	0x00000004	/* Zero allocated hash table */
+#define	HASH_ADAPT	0x00000008	/* Adaptive scale for large memory */
 
 /* Only NUMA needs hash distribution. 64bit NUMA architectures have
  * sufficient vmalloc space.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1b0f7a4..608055e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7124,6 +7124,17 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 #endif
 
 /*
+ * Adaptive scale is meant to reduce sizes of hash tables on large memory
+ * machines. As memory size is increased the scale is also increased but at
+ * slower pace.  Starting from ADAPT_SCALE_BASE (64G), every time memory
+ * quadruples the scale is increased by one, which means the size of hash table
+ * only doubles, instead of quadrupling as well.
+ */
+#define ADAPT_SCALE_BASE	(64ul << 30)
+#define ADAPT_SCALE_SHIFT	2
+#define ADAPT_SCALE_NPAGES	(ADAPT_SCALE_BASE >> PAGE_SHIFT)
+
+/*
  * allocate a large system hash table from bootmem
  * - it is assumed that the hash table must contain an exact power-of-2
  *   quantity of entries
@@ -7154,6 +7165,14 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 		if (PAGE_SHIFT < 20)
 			numentries = round_up(numentries, (1<<20)/PAGE_SIZE);
 
+		if (flags & HASH_ADAPT) {
+			unsigned long adapt;
+
+			for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries;
+			     adapt <<= ADAPT_SCALE_SHIFT)
+				scale++;
+		}
+
 		/* limit to 1 bucket per 2^scale bytes of low memory */
 		if (scale > PAGE_SHIFT)
 			numentries >>= (scale - PAGE_SHIFT);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
  2017-03-02  5:33   ` Pavel Tatashin
@ 2017-03-03 23:32     ` Andrew Morton
  -1 siblings, 0 replies; 35+ messages in thread
From: Andrew Morton @ 2017-03-03 23:32 UTC (permalink / raw)
  To: Pavel Tatashin; +Cc: linux-mm, sparclinux, linux-fsdevel, Al Viro

On Thu,  2 Mar 2017 00:33:45 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT
> is provided every time memory quadruples the sizes of hash tables will only
> double instead of quadrupling as well. This algorithm starts working only
> when memory size reaches a certain point, currently set to 64G.
> 
> This is example of dentry hash table size, before and after four various
> memory configurations:
> 
> MEMORY	   SCALE	 HASH_SIZE
> 	old	new	old	new
>     8G	 13	 13      8M      8M
>    16G	 13	 13     16M     16M
>    32G	 13	 13     32M     32M
>    64G	 13	 13     64M     64M
>   128G	 13	 14    128M     64M
>   256G	 13	 14    256M    128M
>   512G	 13	 15    512M    128M
>  1024G	 13	 15   1024M    256M
>  2048G	 13	 16   2048M    256M
>  4096G	 13	 16   4096M    512M
>  8192G	 13	 17   8192M    512M
> 16384G	 13	 17  16384M   1024M
> 32768G	 13	 18  32768M   1024M
> 65536G	 13	 18  65536M   2048M

OK, but what are the runtime effects?  Presumably some workloads will
slow down a bit.  How much? How do we know that this is a worthwhile
tradeoff?

If the effect of this change is "undetectable" then those hash tables
are simply too large, and additional tuning is needed, yes?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
@ 2017-03-03 23:32     ` Andrew Morton
  0 siblings, 0 replies; 35+ messages in thread
From: Andrew Morton @ 2017-03-03 23:32 UTC (permalink / raw)
  To: Pavel Tatashin; +Cc: linux-mm, sparclinux, linux-fsdevel, Al Viro

On Thu,  2 Mar 2017 00:33:45 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT
> is provided every time memory quadruples the sizes of hash tables will only
> double instead of quadrupling as well. This algorithm starts working only
> when memory size reaches a certain point, currently set to 64G.
> 
> This is example of dentry hash table size, before and after four various
> memory configurations:
> 
> MEMORY	   SCALE	 HASH_SIZE
> 	old	new	old	new
>     8G	 13	 13      8M      8M
>    16G	 13	 13     16M     16M
>    32G	 13	 13     32M     32M
>    64G	 13	 13     64M     64M
>   128G	 13	 14    128M     64M
>   256G	 13	 14    256M    128M
>   512G	 13	 15    512M    128M
>  1024G	 13	 15   1024M    256M
>  2048G	 13	 16   2048M    256M
>  4096G	 13	 16   4096M    512M
>  8192G	 13	 17   8192M    512M
> 16384G	 13	 17  16384M   1024M
> 32768G	 13	 18  32768M   1024M
> 65536G	 13	 18  65536M   2048M

OK, but what are the runtime effects?  Presumably some workloads will
slow down a bit.  How much? How do we know that this is a worthwhile
tradeoff?

If the effect of this change is "undetectable" then those hash tables
are simply too large, and additional tuning is needed, yes?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 1/4] sparc64: NG4 memset 32 bits overflow
  2017-03-02  5:33   ` Pavel Tatashin
@ 2017-03-03 23:34     ` Andrew Morton
  -1 siblings, 0 replies; 35+ messages in thread
From: Andrew Morton @ 2017-03-03 23:34 UTC (permalink / raw)
  To: Pavel Tatashin; +Cc: linux-mm, sparclinux, linux-fsdevel, David Miller

On Thu,  2 Mar 2017 00:33:42 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> Early in boot Linux patches memset and memcpy to branch to platform
> optimized versions of these routines. The NG4 (Niagra 4) versions are
> currently used on  all platforms starting from T4. Recently, there were M7
> optimized routines added into UEK4 but not into mainline yet. So, even with
> M7 optimized routines NG4 are still going to be used on T4, T5, M5, and M6
> processors.
> 
> While investigating how to improve initialization time of dentry_hashtable
> which is 8G long on M6 ldom with 7T of main memory, I noticed that memset()
> does not reset all the memory in this array, after studying the code, I
> realized that NG4memset() branches use %icc register instead of %xcc to
> check compare, so if value of length is over 32-bit long, which is true for
> 8G array, these routines fail to work properly.
> 
> The fix is to replace all %icc with %xcc in these routines. (Alternative is
> to use %ncc, but this is misleading, as the code already has sparcv9 only
> instructions, and cannot be compiled on 32-bit).
> 
> This is important to fix this bug, because even older T4-4 can have 2T of
> memory, and there are large memory proportional data structures in kernel
> which can be larger than 4G in size. The failing of memset() is silent and
> corruption is hard to detect.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Babu Moger <babu.moger@oracle.com>

It sounds like this fix should be backported into -stable kernels?  If
so, which version(s)?

Also, what are the user-visible runtime effects of this change?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 1/4] sparc64: NG4 memset 32 bits overflow
@ 2017-03-03 23:34     ` Andrew Morton
  0 siblings, 0 replies; 35+ messages in thread
From: Andrew Morton @ 2017-03-03 23:34 UTC (permalink / raw)
  To: Pavel Tatashin; +Cc: linux-mm, sparclinux, linux-fsdevel, David Miller

On Thu,  2 Mar 2017 00:33:42 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> Early in boot Linux patches memset and memcpy to branch to platform
> optimized versions of these routines. The NG4 (Niagra 4) versions are
> currently used on  all platforms starting from T4. Recently, there were M7
> optimized routines added into UEK4 but not into mainline yet. So, even with
> M7 optimized routines NG4 are still going to be used on T4, T5, M5, and M6
> processors.
> 
> While investigating how to improve initialization time of dentry_hashtable
> which is 8G long on M6 ldom with 7T of main memory, I noticed that memset()
> does not reset all the memory in this array, after studying the code, I
> realized that NG4memset() branches use %icc register instead of %xcc to
> check compare, so if value of length is over 32-bit long, which is true for
> 8G array, these routines fail to work properly.
> 
> The fix is to replace all %icc with %xcc in these routines. (Alternative is
> to use %ncc, but this is misleading, as the code already has sparcv9 only
> instructions, and cannot be compiled on 32-bit).
> 
> This is important to fix this bug, because even older T4-4 can have 2T of
> memory, and there are large memory proportional data structures in kernel
> which can be larger than 4G in size. The failing of memset() is silent and
> corruption is hard to detect.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Babu Moger <babu.moger@oracle.com>

It sounds like this fix should be backported into -stable kernels?  If
so, which version(s)?

Also, what are the user-visible runtime effects of this change?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
  2017-03-03 23:32     ` Andrew Morton
@ 2017-04-26 20:11       ` Michal Hocko
  -1 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2017-04-26 20:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pavel Tatashin, linux-mm, sparclinux, linux-fsdevel, Al Viro

On Fri 03-03-17 15:32:47, Andrew Morton wrote:
> On Thu,  2 Mar 2017 00:33:45 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:
> 
> > Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT
> > is provided every time memory quadruples the sizes of hash tables will only
> > double instead of quadrupling as well. This algorithm starts working only
> > when memory size reaches a certain point, currently set to 64G.
> > 
> > This is example of dentry hash table size, before and after four various
> > memory configurations:
> > 
> > MEMORY	   SCALE	 HASH_SIZE
> > 	old	new	old	new
> >     8G	 13	 13      8M      8M
> >    16G	 13	 13     16M     16M
> >    32G	 13	 13     32M     32M
> >    64G	 13	 13     64M     64M
> >   128G	 13	 14    128M     64M
> >   256G	 13	 14    256M    128M
> >   512G	 13	 15    512M    128M
> >  1024G	 13	 15   1024M    256M
> >  2048G	 13	 16   2048M    256M
> >  4096G	 13	 16   4096M    512M
> >  8192G	 13	 17   8192M    512M
> > 16384G	 13	 17  16384M   1024M
> > 32768G	 13	 18  32768M   1024M
> > 65536G	 13	 18  65536M   2048M
> 
> OK, but what are the runtime effects?  Presumably some workloads will
> slow down a bit.  How much? How do we know that this is a worthwhile
> tradeoff?
> 
> If the effect of this change is "undetectable" then those hash tables
> are simply too large, and additional tuning is needed, yes?

I am playing with a 3TB and have hit the following
[    0.961309] Dentry cache hash table entries: 536870912 (order: 20, 4294967296 bytes)
[    2.300012] vmalloc: allocation failure, allocated 1383612416 of 2147487744 bytes
[    2.307473] swapper/0: page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC)
[    2.315101] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W          4.4.49-hotplug19-default #1
[    2.324017] Hardware name: Huawei 9008/IT91SMUB, BIOS BLXSV607 04/17/2017
[    2.330775]  ffffffff8101aba5 ffffffff8130efa0 ffffffff81863f48 ffffffff81c03e40
[    2.338201]  ffffffff8118c9a2 02080020fff00300 ffffffff81863f48 ffffffff81c03de0
[    2.345628]  0000000000000018 ffffffff81c03e50 ffffffff81c03df8 ffffffff811d28e6
[    2.353056] Call Trace:
[    2.355507]  [<ffffffff81019a99>] dump_trace+0x59/0x310
[    2.360710]  [<ffffffff81019e3a>] show_stack_log_lvl+0xea/0x170
[    2.366605]  [<ffffffff8101abc1>] show_stack+0x21/0x40
[    2.371723]  [<ffffffff8130efa0>] dump_stack+0x5c/0x7c
[    2.376842]  [<ffffffff8118c9a2>] warn_alloc_failed+0xe2/0x150
[    2.382655]  [<ffffffff811c2a10>] __vmalloc_node_range+0x240/0x280
[    2.388814]  [<ffffffff811c2a97>] __vmalloc+0x47/0x50
[    2.393851]  [<ffffffff81da02ae>] alloc_large_system_hash+0x189/0x25d
[    2.400264]  [<ffffffff81da7625>] inode_init+0x74/0xa3
[    2.405381]  [<ffffffff81da7483>] vfs_caches_init+0x59/0xe1
[    2.410930]  [<ffffffff81d6f070>] start_kernel+0x474/0x4d0
[    2.416392]  [<ffffffff81d6e719>] x86_64_start_kernel+0x147/0x156

Allocating 4G for a hash table is just ridiculous. 512MB which this
patch should give looks much reasonable, although I would argue it is
still a _lot_.
I cannot say I would be really happy about the chosen approach,
though. Why HASH_ADAPT is not implicit? Which hash table would need
gigabytes of memory and still benefit from it? Even if there is such an
example then it should use the explicit high_limit. I do not like this
opt-in because it is just too easy to miss that and hit the same issue
again. And in fact only few users of alloc_large_system_hash are using
the flag. E.g. why {dcache,inode}_init_early do not have the flag? I
am pretty sure that having a physically contiguous hash table would be
better over vmalloc from the TLB point of view.

mount_hashtable resp. mountpoint_hashtable are another example. Other
users just have a reasonable max value. So can we do the following
on top of your commit? I think that we should rethink the scaling as
well but I do not have a good answer for the maximum size so let's just
start with a more reasonable API first.
---
diff --git a/fs/dcache.c b/fs/dcache.c
index 808ea99062c2..363502faa328 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3585,7 +3585,7 @@ static void __init dcache_init(void)
 					sizeof(struct hlist_bl_head),
 					dhash_entries,
 					13,
-					HASH_ZERO | HASH_ADAPT,
+					HASH_ZERO,
 					&d_hash_shift,
 					&d_hash_mask,
 					0,
diff --git a/fs/inode.c b/fs/inode.c
index a9caf53df446..b3c0731ec1fe 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1950,7 +1950,7 @@ void __init inode_init(void)
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
-					HASH_ZERO | HASH_ADAPT,
+					HASH_ZERO,
 					&i_hash_shift,
 					&i_hash_mask,
 					0,
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index dbaf312b3317..e223d91b6439 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -359,7 +359,6 @@ extern void *alloc_large_system_hash(const char *tablename,
 #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
 					 * shift passed via *_hash_shift */
 #define HASH_ZERO	0x00000004	/* Zero allocated hash table */
-#define	HASH_ADAPT	0x00000008	/* Adaptive scale for large memory */
 
 /* Only NUMA needs hash distribution. 64bit NUMA architectures have
  * sufficient vmalloc space.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fa752de84eef..3bf60669d200 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7226,7 +7226,7 @@ void *__init alloc_large_system_hash(const char *tablename,
 		if (PAGE_SHIFT < 20)
 			numentries = round_up(numentries, (1<<20)/PAGE_SIZE);
 
-		if (flags & HASH_ADAPT) {
+		if (!high_limit) {
 			unsigned long adapt;
 
 			for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries;

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
@ 2017-04-26 20:11       ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2017-04-26 20:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pavel Tatashin, linux-mm, sparclinux, linux-fsdevel, Al Viro

On Fri 03-03-17 15:32:47, Andrew Morton wrote:
> On Thu,  2 Mar 2017 00:33:45 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:
> 
> > Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT
> > is provided every time memory quadruples the sizes of hash tables will only
> > double instead of quadrupling as well. This algorithm starts working only
> > when memory size reaches a certain point, currently set to 64G.
> > 
> > This is example of dentry hash table size, before and after four various
> > memory configurations:
> > 
> > MEMORY	   SCALE	 HASH_SIZE
> > 	old	new	old	new
> >     8G	 13	 13      8M      8M
> >    16G	 13	 13     16M     16M
> >    32G	 13	 13     32M     32M
> >    64G	 13	 13     64M     64M
> >   128G	 13	 14    128M     64M
> >   256G	 13	 14    256M    128M
> >   512G	 13	 15    512M    128M
> >  1024G	 13	 15   1024M    256M
> >  2048G	 13	 16   2048M    256M
> >  4096G	 13	 16   4096M    512M
> >  8192G	 13	 17   8192M    512M
> > 16384G	 13	 17  16384M   1024M
> > 32768G	 13	 18  32768M   1024M
> > 65536G	 13	 18  65536M   2048M
> 
> OK, but what are the runtime effects?  Presumably some workloads will
> slow down a bit.  How much? How do we know that this is a worthwhile
> tradeoff?
> 
> If the effect of this change is "undetectable" then those hash tables
> are simply too large, and additional tuning is needed, yes?

I am playing with a 3TB and have hit the following
[    0.961309] Dentry cache hash table entries: 536870912 (order: 20, 4294967296 bytes)
[    2.300012] vmalloc: allocation failure, allocated 1383612416 of 2147487744 bytes
[    2.307473] swapper/0: page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC)
[    2.315101] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W          4.4.49-hotplug19-default #1
[    2.324017] Hardware name: Huawei 9008/IT91SMUB, BIOS BLXSV607 04/17/2017
[    2.330775]  ffffffff8101aba5 ffffffff8130efa0 ffffffff81863f48 ffffffff81c03e40
[    2.338201]  ffffffff8118c9a2 02080020fff00300 ffffffff81863f48 ffffffff81c03de0
[    2.345628]  0000000000000018 ffffffff81c03e50 ffffffff81c03df8 ffffffff811d28e6
[    2.353056] Call Trace:
[    2.355507]  [<ffffffff81019a99>] dump_trace+0x59/0x310
[    2.360710]  [<ffffffff81019e3a>] show_stack_log_lvl+0xea/0x170
[    2.366605]  [<ffffffff8101abc1>] show_stack+0x21/0x40
[    2.371723]  [<ffffffff8130efa0>] dump_stack+0x5c/0x7c
[    2.376842]  [<ffffffff8118c9a2>] warn_alloc_failed+0xe2/0x150
[    2.382655]  [<ffffffff811c2a10>] __vmalloc_node_range+0x240/0x280
[    2.388814]  [<ffffffff811c2a97>] __vmalloc+0x47/0x50
[    2.393851]  [<ffffffff81da02ae>] alloc_large_system_hash+0x189/0x25d
[    2.400264]  [<ffffffff81da7625>] inode_init+0x74/0xa3
[    2.405381]  [<ffffffff81da7483>] vfs_caches_init+0x59/0xe1
[    2.410930]  [<ffffffff81d6f070>] start_kernel+0x474/0x4d0
[    2.416392]  [<ffffffff81d6e719>] x86_64_start_kernel+0x147/0x156

Allocating 4G for a hash table is just ridiculous. 512MB which this
patch should give looks much reasonable, although I would argue it is
still a _lot_.
I cannot say I would be really happy about the chosen approach,
though. Why HASH_ADAPT is not implicit? Which hash table would need
gigabytes of memory and still benefit from it? Even if there is such an
example then it should use the explicit high_limit. I do not like this
opt-in because it is just too easy to miss that and hit the same issue
again. And in fact only few users of alloc_large_system_hash are using
the flag. E.g. why {dcache,inode}_init_early do not have the flag? I
am pretty sure that having a physically contiguous hash table would be
better over vmalloc from the TLB point of view.

mount_hashtable resp. mountpoint_hashtable are another example. Other
users just have a reasonable max value. So can we do the following
on top of your commit? I think that we should rethink the scaling as
well but I do not have a good answer for the maximum size so let's just
start with a more reasonable API first.
---
diff --git a/fs/dcache.c b/fs/dcache.c
index 808ea99062c2..363502faa328 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3585,7 +3585,7 @@ static void __init dcache_init(void)
 					sizeof(struct hlist_bl_head),
 					dhash_entries,
 					13,
-					HASH_ZERO | HASH_ADAPT,
+					HASH_ZERO,
 					&d_hash_shift,
 					&d_hash_mask,
 					0,
diff --git a/fs/inode.c b/fs/inode.c
index a9caf53df446..b3c0731ec1fe 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1950,7 +1950,7 @@ void __init inode_init(void)
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
-					HASH_ZERO | HASH_ADAPT,
+					HASH_ZERO,
 					&i_hash_shift,
 					&i_hash_mask,
 					0,
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index dbaf312b3317..e223d91b6439 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -359,7 +359,6 @@ extern void *alloc_large_system_hash(const char *tablename,
 #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
 					 * shift passed via *_hash_shift */
 #define HASH_ZERO	0x00000004	/* Zero allocated hash table */
-#define	HASH_ADAPT	0x00000008	/* Adaptive scale for large memory */
 
 /* Only NUMA needs hash distribution. 64bit NUMA architectures have
  * sufficient vmalloc space.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fa752de84eef..3bf60669d200 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7226,7 +7226,7 @@ void *__init alloc_large_system_hash(const char *tablename,
 		if (PAGE_SHIFT < 20)
 			numentries = round_up(numentries, (1<<20)/PAGE_SIZE);
 
-		if (flags & HASH_ADAPT) {
+		if (!high_limit) {
 			unsigned long adapt;
 
 			for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
  2017-04-26 20:11       ` Michal Hocko
@ 2017-05-02  8:04         ` Michal Hocko
  -1 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2017-05-02  8:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pavel Tatashin, linux-mm, sparclinux, linux-fsdevel, Al Viro

Ping on this. Andrew, are you going to fold this or should I post a
separate patch?

[...]
> I cannot say I would be really happy about the chosen approach,
> though. Why HASH_ADAPT is not implicit? Which hash table would need
> gigabytes of memory and still benefit from it? Even if there is such an
> example then it should use the explicit high_limit. I do not like this
> opt-in because it is just too easy to miss that and hit the same issue
> again. And in fact only few users of alloc_large_system_hash are using
> the flag. E.g. why {dcache,inode}_init_early do not have the flag? I
> am pretty sure that having a physically contiguous hash table would be
> better over vmalloc from the TLB point of view.
> 
> mount_hashtable resp. mountpoint_hashtable are another example. Other
> users just have a reasonable max value. So can we do the following
> on top of your commit? I think that we should rethink the scaling as
> well but I do not have a good answer for the maximum size so let's just
> start with a more reasonable API first.
> ---
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 808ea99062c2..363502faa328 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -3585,7 +3585,7 @@ static void __init dcache_init(void)
>  					sizeof(struct hlist_bl_head),
>  					dhash_entries,
>  					13,
> -					HASH_ZERO | HASH_ADAPT,
> +					HASH_ZERO,
>  					&d_hash_shift,
>  					&d_hash_mask,
>  					0,
> diff --git a/fs/inode.c b/fs/inode.c
> index a9caf53df446..b3c0731ec1fe 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1950,7 +1950,7 @@ void __init inode_init(void)
>  					sizeof(struct hlist_head),
>  					ihash_entries,
>  					14,
> -					HASH_ZERO | HASH_ADAPT,
> +					HASH_ZERO,
>  					&i_hash_shift,
>  					&i_hash_mask,
>  					0,
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index dbaf312b3317..e223d91b6439 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -359,7 +359,6 @@ extern void *alloc_large_system_hash(const char *tablename,
>  #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
>  					 * shift passed via *_hash_shift */
>  #define HASH_ZERO	0x00000004	/* Zero allocated hash table */
> -#define	HASH_ADAPT	0x00000008	/* Adaptive scale for large memory */
>  
>  /* Only NUMA needs hash distribution. 64bit NUMA architectures have
>   * sufficient vmalloc space.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fa752de84eef..3bf60669d200 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7226,7 +7226,7 @@ void *__init alloc_large_system_hash(const char *tablename,
>  		if (PAGE_SHIFT < 20)
>  			numentries = round_up(numentries, (1<<20)/PAGE_SIZE);
>  
> -		if (flags & HASH_ADAPT) {
> +		if (!high_limit) {
>  			unsigned long adapt;
>  
>  			for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries;
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
@ 2017-05-02  8:04         ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2017-05-02  8:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pavel Tatashin, linux-mm, sparclinux, linux-fsdevel, Al Viro

Ping on this. Andrew, are you going to fold this or should I post a
separate patch?

[...]
> I cannot say I would be really happy about the chosen approach,
> though. Why HASH_ADAPT is not implicit? Which hash table would need
> gigabytes of memory and still benefit from it? Even if there is such an
> example then it should use the explicit high_limit. I do not like this
> opt-in because it is just too easy to miss that and hit the same issue
> again. And in fact only few users of alloc_large_system_hash are using
> the flag. E.g. why {dcache,inode}_init_early do not have the flag? I
> am pretty sure that having a physically contiguous hash table would be
> better over vmalloc from the TLB point of view.
> 
> mount_hashtable resp. mountpoint_hashtable are another example. Other
> users just have a reasonable max value. So can we do the following
> on top of your commit? I think that we should rethink the scaling as
> well but I do not have a good answer for the maximum size so let's just
> start with a more reasonable API first.
> ---
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 808ea99062c2..363502faa328 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -3585,7 +3585,7 @@ static void __init dcache_init(void)
>  					sizeof(struct hlist_bl_head),
>  					dhash_entries,
>  					13,
> -					HASH_ZERO | HASH_ADAPT,
> +					HASH_ZERO,
>  					&d_hash_shift,
>  					&d_hash_mask,
>  					0,
> diff --git a/fs/inode.c b/fs/inode.c
> index a9caf53df446..b3c0731ec1fe 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1950,7 +1950,7 @@ void __init inode_init(void)
>  					sizeof(struct hlist_head),
>  					ihash_entries,
>  					14,
> -					HASH_ZERO | HASH_ADAPT,
> +					HASH_ZERO,
>  					&i_hash_shift,
>  					&i_hash_mask,
>  					0,
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index dbaf312b3317..e223d91b6439 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -359,7 +359,6 @@ extern void *alloc_large_system_hash(const char *tablename,
>  #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
>  					 * shift passed via *_hash_shift */
>  #define HASH_ZERO	0x00000004	/* Zero allocated hash table */
> -#define	HASH_ADAPT	0x00000008	/* Adaptive scale for large memory */
>  
>  /* Only NUMA needs hash distribution. 64bit NUMA architectures have
>   * sufficient vmalloc space.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fa752de84eef..3bf60669d200 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7226,7 +7226,7 @@ void *__init alloc_large_system_hash(const char *tablename,
>  		if (PAGE_SHIFT < 20)
>  			numentries = round_up(numentries, (1<<20)/PAGE_SIZE);
>  
> -		if (flags & HASH_ADAPT) {
> +		if (!high_limit) {
>  			unsigned long adapt;
>  
>  			for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries;
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
  2017-04-26 20:11       ` Michal Hocko
@ 2017-05-04 18:23         ` Pasha Tatashin
  -1 siblings, 0 replies; 35+ messages in thread
From: Pasha Tatashin @ 2017-05-04 18:23 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton; +Cc: linux-mm, sparclinux, linux-fsdevel, Al Viro

Hi Michal,

I do not really want to impose any hard limit, because I do not know 
what it should be.

The owners of the subsystems that use these large hash table should make 
a call, and perhaps pass high_limit, if needed into 
alloc_large_system_hash().

Previous growth rate was unacceptable, because in addition to allocating 
large tables (which is acceptable if we take a total system memory 
size), we also needed to zero that, and zeroing while we have only one 
CPU available was significantly reducing the boot time.

Now, on 32T the hash table is 1G instead of 32G, so the call is 32 times 
faster to finish. While it is not a good idea to waste memory, both 1G 
and 32G is insignificant amount of memory compared to the total amount 
of such 32T systems (0.09% and 0.003% accordingly).

Here is boot log on 32T system without this fix:
https://hastebin.com/muruzoveno.go

[  769.622359] Dentry cache hash table entries: 2147483648 (order: 21, 
17179869184 bytes)
[  791.942136] Inode-cache hash table entries: 2147483648 (order: 21, 
17179869184 bytes)
[  810.810745] Mount-cache hash table entries: 67108864 (order: 16, 
536870912 bytes)
[  810.922322] Mountpoint-cache hash table entries: 67108864 (order: 16, 
536870912 bytes)
[  812.125398] ftrace: allocating 20650 entries in 41 pages

Total time 42.5s

With this fix (and some other unrelated for this interval fixes):
https://hastebin.com/buxucurawa.go

[   12.621164] Dentry cache hash table entries: 134217728 (order: 17, 
1073741824 bytes)
[   12.869462] Inode-cache hash table entries: 67108864 (order: 16, 
536870912 bytes)
[   13.101963] Mount-cache hash table entries: 67108864 (order: 16, 
536870912 bytes)
[   13.331988] Mountpoint-cache hash table entries: 67108864 (order: 16, 
536870912 bytes)
[   13.364661] ftrace: allocating 20650 entries in 41 pages

Total time 0.76s.

So, it scales well for 32T systems, and will scale well for perceivable 
future without adding a hard ceiling limit.

Pasha

On 04/26/2017 04:11 PM, Michal Hocko wrote:
> On Fri 03-03-17 15:32:47, Andrew Morton wrote:
>> On Thu,  2 Mar 2017 00:33:45 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:
>>
>>> Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT
>>> is provided every time memory quadruples the sizes of hash tables will only
>>> double instead of quadrupling as well. This algorithm starts working only
>>> when memory size reaches a certain point, currently set to 64G.
>>>
>>> This is example of dentry hash table size, before and after four various
>>> memory configurations:
>>>
>>> MEMORY	   SCALE	 HASH_SIZE
>>> 	old	new	old	new
>>>      8G	 13	 13      8M      8M
>>>     16G	 13	 13     16M     16M
>>>     32G	 13	 13     32M     32M
>>>     64G	 13	 13     64M     64M
>>>    128G	 13	 14    128M     64M
>>>    256G	 13	 14    256M    128M
>>>    512G	 13	 15    512M    128M
>>>   1024G	 13	 15   1024M    256M
>>>   2048G	 13	 16   2048M    256M
>>>   4096G	 13	 16   4096M    512M
>>>   8192G	 13	 17   8192M    512M
>>> 16384G	 13	 17  16384M   1024M
>>> 32768G	 13	 18  32768M   1024M
>>> 65536G	 13	 18  65536M   2048M
>>
>> OK, but what are the runtime effects?  Presumably some workloads will
>> slow down a bit.  How much? How do we know that this is a worthwhile
>> tradeoff?
>>
>> If the effect of this change is "undetectable" then those hash tables
>> are simply too large, and additional tuning is needed, yes?
> 
> I am playing with a 3TB and have hit the following
> [    0.961309] Dentry cache hash table entries: 536870912 (order: 20, 4294967296 bytes)
> [    2.300012] vmalloc: allocation failure, allocated 1383612416 of 2147487744 bytes
> [    2.307473] swapper/0: page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC)
> [    2.315101] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W          4.4.49-hotplug19-default #1
> [    2.324017] Hardware name: Huawei 9008/IT91SMUB, BIOS BLXSV607 04/17/2017
> [    2.330775]  ffffffff8101aba5 ffffffff8130efa0 ffffffff81863f48 ffffffff81c03e40
> [    2.338201]  ffffffff8118c9a2 02080020fff00300 ffffffff81863f48 ffffffff81c03de0
> [    2.345628]  0000000000000018 ffffffff81c03e50 ffffffff81c03df8 ffffffff811d28e6
> [    2.353056] Call Trace:
> [    2.355507]  [<ffffffff81019a99>] dump_trace+0x59/0x310
> [    2.360710]  [<ffffffff81019e3a>] show_stack_log_lvl+0xea/0x170
> [    2.366605]  [<ffffffff8101abc1>] show_stack+0x21/0x40
> [    2.371723]  [<ffffffff8130efa0>] dump_stack+0x5c/0x7c
> [    2.376842]  [<ffffffff8118c9a2>] warn_alloc_failed+0xe2/0x150
> [    2.382655]  [<ffffffff811c2a10>] __vmalloc_node_range+0x240/0x280
> [    2.388814]  [<ffffffff811c2a97>] __vmalloc+0x47/0x50
> [    2.393851]  [<ffffffff81da02ae>] alloc_large_system_hash+0x189/0x25d
> [    2.400264]  [<ffffffff81da7625>] inode_init+0x74/0xa3
> [    2.405381]  [<ffffffff81da7483>] vfs_caches_init+0x59/0xe1
> [    2.410930]  [<ffffffff81d6f070>] start_kernel+0x474/0x4d0
> [    2.416392]  [<ffffffff81d6e719>] x86_64_start_kernel+0x147/0x156
> 
> Allocating 4G for a hash table is just ridiculous. 512MB which this
> patch should give looks much reasonable, although I would argue it is
> still a _lot_.
> I cannot say I would be really happy about the chosen approach,
> though. Why HASH_ADAPT is not implicit? Which hash table would need
> gigabytes of memory and still benefit from it? Even if there is such an
> example then it should use the explicit high_limit. I do not like this
> opt-in because it is just too easy to miss that and hit the same issue
> again. And in fact only few users of alloc_large_system_hash are using
> the flag. E.g. why {dcache,inode}_init_early do not have the flag? I
> am pretty sure that having a physically contiguous hash table would be
> better over vmalloc from the TLB point of view.
> 
> mount_hashtable resp. mountpoint_hashtable are another example. Other
> users just have a reasonable max value. So can we do the following
> on top of your commit? I think that we should rethink the scaling as
> well but I do not have a good answer for the maximum size so let's just
> start with a more reasonable API first.
> ---
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 808ea99062c2..363502faa328 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -3585,7 +3585,7 @@ static void __init dcache_init(void)
>   					sizeof(struct hlist_bl_head),
>   					dhash_entries,
>   					13,
> -					HASH_ZERO | HASH_ADAPT,
> +					HASH_ZERO,
>   					&d_hash_shift,
>   					&d_hash_mask,
>   					0,
> diff --git a/fs/inode.c b/fs/inode.c
> index a9caf53df446..b3c0731ec1fe 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1950,7 +1950,7 @@ void __init inode_init(void)
>   					sizeof(struct hlist_head),
>   					ihash_entries,
>   					14,
> -					HASH_ZERO | HASH_ADAPT,
> +					HASH_ZERO,
>   					&i_hash_shift,
>   					&i_hash_mask,
>   					0,
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index dbaf312b3317..e223d91b6439 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -359,7 +359,6 @@ extern void *alloc_large_system_hash(const char *tablename,
>   #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
>   					 * shift passed via *_hash_shift */
>   #define HASH_ZERO	0x00000004	/* Zero allocated hash table */
> -#define	HASH_ADAPT	0x00000008	/* Adaptive scale for large memory */
>   
>   /* Only NUMA needs hash distribution. 64bit NUMA architectures have
>    * sufficient vmalloc space.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fa752de84eef..3bf60669d200 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7226,7 +7226,7 @@ void *__init alloc_large_system_hash(const char *tablename,
>   		if (PAGE_SHIFT < 20)
>   			numentries = round_up(numentries, (1<<20)/PAGE_SIZE);
>   
> -		if (flags & HASH_ADAPT) {
> +		if (!high_limit) {
>   			unsigned long adapt;
>   
>   			for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
@ 2017-05-04 18:23         ` Pasha Tatashin
  0 siblings, 0 replies; 35+ messages in thread
From: Pasha Tatashin @ 2017-05-04 18:23 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton; +Cc: linux-mm, sparclinux, linux-fsdevel, Al Viro

Hi Michal,

I do not really want to impose any hard limit, because I do not know 
what it should be.

The owners of the subsystems that use these large hash table should make 
a call, and perhaps pass high_limit, if needed into 
alloc_large_system_hash().

Previous growth rate was unacceptable, because in addition to allocating 
large tables (which is acceptable if we take a total system memory 
size), we also needed to zero that, and zeroing while we have only one 
CPU available was significantly reducing the boot time.

Now, on 32T the hash table is 1G instead of 32G, so the call is 32 times 
faster to finish. While it is not a good idea to waste memory, both 1G 
and 32G is insignificant amount of memory compared to the total amount 
of such 32T systems (0.09% and 0.003% accordingly).

Here is boot log on 32T system without this fix:
https://hastebin.com/muruzoveno.go

[  769.622359] Dentry cache hash table entries: 2147483648 (order: 21, 
17179869184 bytes)
[  791.942136] Inode-cache hash table entries: 2147483648 (order: 21, 
17179869184 bytes)
[  810.810745] Mount-cache hash table entries: 67108864 (order: 16, 
536870912 bytes)
[  810.922322] Mountpoint-cache hash table entries: 67108864 (order: 16, 
536870912 bytes)
[  812.125398] ftrace: allocating 20650 entries in 41 pages

Total time 42.5s

With this fix (and some other unrelated for this interval fixes):
https://hastebin.com/buxucurawa.go

[   12.621164] Dentry cache hash table entries: 134217728 (order: 17, 
1073741824 bytes)
[   12.869462] Inode-cache hash table entries: 67108864 (order: 16, 
536870912 bytes)
[   13.101963] Mount-cache hash table entries: 67108864 (order: 16, 
536870912 bytes)
[   13.331988] Mountpoint-cache hash table entries: 67108864 (order: 16, 
536870912 bytes)
[   13.364661] ftrace: allocating 20650 entries in 41 pages

Total time 0.76s.

So, it scales well for 32T systems, and will scale well for perceivable 
future without adding a hard ceiling limit.

Pasha

On 04/26/2017 04:11 PM, Michal Hocko wrote:
> On Fri 03-03-17 15:32:47, Andrew Morton wrote:
>> On Thu,  2 Mar 2017 00:33:45 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:
>>
>>> Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT
>>> is provided every time memory quadruples the sizes of hash tables will only
>>> double instead of quadrupling as well. This algorithm starts working only
>>> when memory size reaches a certain point, currently set to 64G.
>>>
>>> This is example of dentry hash table size, before and after four various
>>> memory configurations:
>>>
>>> MEMORY	   SCALE	 HASH_SIZE
>>> 	old	new	old	new
>>>      8G	 13	 13      8M      8M
>>>     16G	 13	 13     16M     16M
>>>     32G	 13	 13     32M     32M
>>>     64G	 13	 13     64M     64M
>>>    128G	 13	 14    128M     64M
>>>    256G	 13	 14    256M    128M
>>>    512G	 13	 15    512M    128M
>>>   1024G	 13	 15   1024M    256M
>>>   2048G	 13	 16   2048M    256M
>>>   4096G	 13	 16   4096M    512M
>>>   8192G	 13	 17   8192M    512M
>>> 16384G	 13	 17  16384M   1024M
>>> 32768G	 13	 18  32768M   1024M
>>> 65536G	 13	 18  65536M   2048M
>>
>> OK, but what are the runtime effects?  Presumably some workloads will
>> slow down a bit.  How much? How do we know that this is a worthwhile
>> tradeoff?
>>
>> If the effect of this change is "undetectable" then those hash tables
>> are simply too large, and additional tuning is needed, yes?
> 
> I am playing with a 3TB and have hit the following
> [    0.961309] Dentry cache hash table entries: 536870912 (order: 20, 4294967296 bytes)
> [    2.300012] vmalloc: allocation failure, allocated 1383612416 of 2147487744 bytes
> [    2.307473] swapper/0: page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC)
> [    2.315101] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W          4.4.49-hotplug19-default #1
> [    2.324017] Hardware name: Huawei 9008/IT91SMUB, BIOS BLXSV607 04/17/2017
> [    2.330775]  ffffffff8101aba5 ffffffff8130efa0 ffffffff81863f48 ffffffff81c03e40
> [    2.338201]  ffffffff8118c9a2 02080020fff00300 ffffffff81863f48 ffffffff81c03de0
> [    2.345628]  0000000000000018 ffffffff81c03e50 ffffffff81c03df8 ffffffff811d28e6
> [    2.353056] Call Trace:
> [    2.355507]  [<ffffffff81019a99>] dump_trace+0x59/0x310
> [    2.360710]  [<ffffffff81019e3a>] show_stack_log_lvl+0xea/0x170
> [    2.366605]  [<ffffffff8101abc1>] show_stack+0x21/0x40
> [    2.371723]  [<ffffffff8130efa0>] dump_stack+0x5c/0x7c
> [    2.376842]  [<ffffffff8118c9a2>] warn_alloc_failed+0xe2/0x150
> [    2.382655]  [<ffffffff811c2a10>] __vmalloc_node_range+0x240/0x280
> [    2.388814]  [<ffffffff811c2a97>] __vmalloc+0x47/0x50
> [    2.393851]  [<ffffffff81da02ae>] alloc_large_system_hash+0x189/0x25d
> [    2.400264]  [<ffffffff81da7625>] inode_init+0x74/0xa3
> [    2.405381]  [<ffffffff81da7483>] vfs_caches_init+0x59/0xe1
> [    2.410930]  [<ffffffff81d6f070>] start_kernel+0x474/0x4d0
> [    2.416392]  [<ffffffff81d6e719>] x86_64_start_kernel+0x147/0x156
> 
> Allocating 4G for a hash table is just ridiculous. 512MB which this
> patch should give looks much reasonable, although I would argue it is
> still a _lot_.
> I cannot say I would be really happy about the chosen approach,
> though. Why HASH_ADAPT is not implicit? Which hash table would need
> gigabytes of memory and still benefit from it? Even if there is such an
> example then it should use the explicit high_limit. I do not like this
> opt-in because it is just too easy to miss that and hit the same issue
> again. And in fact only few users of alloc_large_system_hash are using
> the flag. E.g. why {dcache,inode}_init_early do not have the flag? I
> am pretty sure that having a physically contiguous hash table would be
> better over vmalloc from the TLB point of view.
> 
> mount_hashtable resp. mountpoint_hashtable are another example. Other
> users just have a reasonable max value. So can we do the following
> on top of your commit? I think that we should rethink the scaling as
> well but I do not have a good answer for the maximum size so let's just
> start with a more reasonable API first.
> ---
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 808ea99062c2..363502faa328 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -3585,7 +3585,7 @@ static void __init dcache_init(void)
>   					sizeof(struct hlist_bl_head),
>   					dhash_entries,
>   					13,
> -					HASH_ZERO | HASH_ADAPT,
> +					HASH_ZERO,
>   					&d_hash_shift,
>   					&d_hash_mask,
>   					0,
> diff --git a/fs/inode.c b/fs/inode.c
> index a9caf53df446..b3c0731ec1fe 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1950,7 +1950,7 @@ void __init inode_init(void)
>   					sizeof(struct hlist_head),
>   					ihash_entries,
>   					14,
> -					HASH_ZERO | HASH_ADAPT,
> +					HASH_ZERO,
>   					&i_hash_shift,
>   					&i_hash_mask,
>   					0,
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index dbaf312b3317..e223d91b6439 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -359,7 +359,6 @@ extern void *alloc_large_system_hash(const char *tablename,
>   #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
>   					 * shift passed via *_hash_shift */
>   #define HASH_ZERO	0x00000004	/* Zero allocated hash table */
> -#define	HASH_ADAPT	0x00000008	/* Adaptive scale for large memory */
>   
>   /* Only NUMA needs hash distribution. 64bit NUMA architectures have
>    * sufficient vmalloc space.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fa752de84eef..3bf60669d200 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7226,7 +7226,7 @@ void *__init alloc_large_system_hash(const char *tablename,
>   		if (PAGE_SHIFT < 20)
>   			numentries = round_up(numentries, (1<<20)/PAGE_SIZE);
>   
> -		if (flags & HASH_ADAPT) {
> +		if (!high_limit) {
>   			unsigned long adapt;
>   
>   			for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries;
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
  2017-05-04 18:23         ` Pasha Tatashin
@ 2017-05-04 18:28           ` Pasha Tatashin
  -1 siblings, 0 replies; 35+ messages in thread
From: Pasha Tatashin @ 2017-05-04 18:28 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton; +Cc: linux-mm, sparclinux, linux-fsdevel, Al Viro

BTW, I am OK with your patch on top of this "Adaptive hash table" patch, 
but I do not know what high_limit should be from where HASH_ADAPT will 
kick in. 128M sound reasonable to you?

Pasha


On 05/04/2017 02:23 PM, Pasha Tatashin wrote:
> Hi Michal,
> 
> I do not really want to impose any hard limit, because I do not know 
> what it should be.
> 
> The owners of the subsystems that use these large hash table should make 
> a call, and perhaps pass high_limit, if needed into 
> alloc_large_system_hash().
> 
> Previous growth rate was unacceptable, because in addition to allocating 
> large tables (which is acceptable if we take a total system memory 
> size), we also needed to zero that, and zeroing while we have only one 
> CPU available was significantly reducing the boot time.
> 
> Now, on 32T the hash table is 1G instead of 32G, so the call is 32 times 
> faster to finish. While it is not a good idea to waste memory, both 1G 
> and 32G is insignificant amount of memory compared to the total amount 
> of such 32T systems (0.09% and 0.003% accordingly).
> 
> Here is boot log on 32T system without this fix:
> https://hastebin.com/muruzoveno.go
> 
> [  769.622359] Dentry cache hash table entries: 2147483648 (order: 21, 
> 17179869184 bytes)
> [  791.942136] Inode-cache hash table entries: 2147483648 (order: 21, 
> 17179869184 bytes)
> [  810.810745] Mount-cache hash table entries: 67108864 (order: 16, 
> 536870912 bytes)
> [  810.922322] Mountpoint-cache hash table entries: 67108864 (order: 16, 
> 536870912 bytes)
> [  812.125398] ftrace: allocating 20650 entries in 41 pages
> 
> Total time 42.5s
> 
> With this fix (and some other unrelated for this interval fixes):
> https://hastebin.com/buxucurawa.go
> 
> [   12.621164] Dentry cache hash table entries: 134217728 (order: 17, 
> 1073741824 bytes)
> [   12.869462] Inode-cache hash table entries: 67108864 (order: 16, 
> 536870912 bytes)
> [   13.101963] Mount-cache hash table entries: 67108864 (order: 16, 
> 536870912 bytes)
> [   13.331988] Mountpoint-cache hash table entries: 67108864 (order: 16, 
> 536870912 bytes)
> [   13.364661] ftrace: allocating 20650 entries in 41 pages
> 
> Total time 0.76s.
> 
> So, it scales well for 32T systems, and will scale well for perceivable 
> future without adding a hard ceiling limit.
> 
> Pasha
> 
> On 04/26/2017 04:11 PM, Michal Hocko wrote:
>> On Fri 03-03-17 15:32:47, Andrew Morton wrote:
>>> On Thu,  2 Mar 2017 00:33:45 -0500 Pavel Tatashin 
>>> <pasha.tatashin@oracle.com> wrote:
>>>
>>>> Allow hash tables to scale with memory but at slower pace, when 
>>>> HASH_ADAPT
>>>> is provided every time memory quadruples the sizes of hash tables 
>>>> will only
>>>> double instead of quadrupling as well. This algorithm starts working 
>>>> only
>>>> when memory size reaches a certain point, currently set to 64G.
>>>>
>>>> This is example of dentry hash table size, before and after four 
>>>> various
>>>> memory configurations:
>>>>
>>>> MEMORY       SCALE     HASH_SIZE
>>>>     old    new    old    new
>>>>      8G     13     13      8M      8M
>>>>     16G     13     13     16M     16M
>>>>     32G     13     13     32M     32M
>>>>     64G     13     13     64M     64M
>>>>    128G     13     14    128M     64M
>>>>    256G     13     14    256M    128M
>>>>    512G     13     15    512M    128M
>>>>   1024G     13     15   1024M    256M
>>>>   2048G     13     16   2048M    256M
>>>>   4096G     13     16   4096M    512M
>>>>   8192G     13     17   8192M    512M
>>>> 16384G     13     17  16384M   1024M
>>>> 32768G     13     18  32768M   1024M
>>>> 65536G     13     18  65536M   2048M
>>>
>>> OK, but what are the runtime effects?  Presumably some workloads will
>>> slow down a bit.  How much? How do we know that this is a worthwhile
>>> tradeoff?
>>>
>>> If the effect of this change is "undetectable" then those hash tables
>>> are simply too large, and additional tuning is needed, yes?
>>
>> I am playing with a 3TB and have hit the following
>> [    0.961309] Dentry cache hash table entries: 536870912 (order: 20, 
>> 4294967296 bytes)
>> [    2.300012] vmalloc: allocation failure, allocated 1383612416 of 
>> 2147487744 bytes
>> [    2.307473] swapper/0: page allocation failure: order:0, 
>> mode:0x2080020(GFP_ATOMIC)
>> [    2.315101] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        
>> W          4.4.49-hotplug19-default #1
>> [    2.324017] Hardware name: Huawei 9008/IT91SMUB, BIOS BLXSV607 
>> 04/17/2017
>> [    2.330775]  ffffffff8101aba5 ffffffff8130efa0 ffffffff81863f48 
>> ffffffff81c03e40
>> [    2.338201]  ffffffff8118c9a2 02080020fff00300 ffffffff81863f48 
>> ffffffff81c03de0
>> [    2.345628]  0000000000000018 ffffffff81c03e50 ffffffff81c03df8 
>> ffffffff811d28e6
>> [    2.353056] Call Trace:
>> [    2.355507]  [<ffffffff81019a99>] dump_trace+0x59/0x310
>> [    2.360710]  [<ffffffff81019e3a>] show_stack_log_lvl+0xea/0x170
>> [    2.366605]  [<ffffffff8101abc1>] show_stack+0x21/0x40
>> [    2.371723]  [<ffffffff8130efa0>] dump_stack+0x5c/0x7c
>> [    2.376842]  [<ffffffff8118c9a2>] warn_alloc_failed+0xe2/0x150
>> [    2.382655]  [<ffffffff811c2a10>] __vmalloc_node_range+0x240/0x280
>> [    2.388814]  [<ffffffff811c2a97>] __vmalloc+0x47/0x50
>> [    2.393851]  [<ffffffff81da02ae>] alloc_large_system_hash+0x189/0x25d
>> [    2.400264]  [<ffffffff81da7625>] inode_init+0x74/0xa3
>> [    2.405381]  [<ffffffff81da7483>] vfs_caches_init+0x59/0xe1
>> [    2.410930]  [<ffffffff81d6f070>] start_kernel+0x474/0x4d0
>> [    2.416392]  [<ffffffff81d6e719>] x86_64_start_kernel+0x147/0x156
>>
>> Allocating 4G for a hash table is just ridiculous. 512MB which this
>> patch should give looks much reasonable, although I would argue it is
>> still a _lot_.
>> I cannot say I would be really happy about the chosen approach,
>> though. Why HASH_ADAPT is not implicit? Which hash table would need
>> gigabytes of memory and still benefit from it? Even if there is such an
>> example then it should use the explicit high_limit. I do not like this
>> opt-in because it is just too easy to miss that and hit the same issue
>> again. And in fact only few users of alloc_large_system_hash are using
>> the flag. E.g. why {dcache,inode}_init_early do not have the flag? I
>> am pretty sure that having a physically contiguous hash table would be
>> better over vmalloc from the TLB point of view.
>>
>> mount_hashtable resp. mountpoint_hashtable are another example. Other
>> users just have a reasonable max value. So can we do the following
>> on top of your commit? I think that we should rethink the scaling as
>> well but I do not have a good answer for the maximum size so let's just
>> start with a more reasonable API first.
>> ---
>> diff --git a/fs/dcache.c b/fs/dcache.c
>> index 808ea99062c2..363502faa328 100644
>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -3585,7 +3585,7 @@ static void __init dcache_init(void)
>>                       sizeof(struct hlist_bl_head),
>>                       dhash_entries,
>>                       13,
>> -                    HASH_ZERO | HASH_ADAPT,
>> +                    HASH_ZERO,
>>                       &d_hash_shift,
>>                       &d_hash_mask,
>>                       0,
>> diff --git a/fs/inode.c b/fs/inode.c
>> index a9caf53df446..b3c0731ec1fe 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -1950,7 +1950,7 @@ void __init inode_init(void)
>>                       sizeof(struct hlist_head),
>>                       ihash_entries,
>>                       14,
>> -                    HASH_ZERO | HASH_ADAPT,
>> +                    HASH_ZERO,
>>                       &i_hash_shift,
>>                       &i_hash_mask,
>>                       0,
>> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
>> index dbaf312b3317..e223d91b6439 100644
>> --- a/include/linux/bootmem.h
>> +++ b/include/linux/bootmem.h
>> @@ -359,7 +359,6 @@ extern void *alloc_large_system_hash(const char 
>> *tablename,
>>   #define HASH_SMALL    0x00000002    /* sub-page allocation allowed, min
>>                        * shift passed via *_hash_shift */
>>   #define HASH_ZERO    0x00000004    /* Zero allocated hash table */
>> -#define    HASH_ADAPT    0x00000008    /* Adaptive scale for large 
>> memory */
>>   /* Only NUMA needs hash distribution. 64bit NUMA architectures have
>>    * sufficient vmalloc space.
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index fa752de84eef..3bf60669d200 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7226,7 +7226,7 @@ void *__init alloc_large_system_hash(const char 
>> *tablename,
>>           if (PAGE_SHIFT < 20)
>>               numentries = round_up(numentries, (1<<20)/PAGE_SIZE);
>> -        if (flags & HASH_ADAPT) {
>> +        if (!high_limit) {
>>               unsigned long adapt;
>>               for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries;
>>
> 
> -- 
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
@ 2017-05-04 18:28           ` Pasha Tatashin
  0 siblings, 0 replies; 35+ messages in thread
From: Pasha Tatashin @ 2017-05-04 18:28 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton; +Cc: linux-mm, sparclinux, linux-fsdevel, Al Viro

BTW, I am OK with your patch on top of this "Adaptive hash table" patch, 
but I do not know what high_limit should be from where HASH_ADAPT will 
kick in. 128M sound reasonable to you?

Pasha


On 05/04/2017 02:23 PM, Pasha Tatashin wrote:
> Hi Michal,
> 
> I do not really want to impose any hard limit, because I do not know 
> what it should be.
> 
> The owners of the subsystems that use these large hash table should make 
> a call, and perhaps pass high_limit, if needed into 
> alloc_large_system_hash().
> 
> Previous growth rate was unacceptable, because in addition to allocating 
> large tables (which is acceptable if we take a total system memory 
> size), we also needed to zero that, and zeroing while we have only one 
> CPU available was significantly reducing the boot time.
> 
> Now, on 32T the hash table is 1G instead of 32G, so the call is 32 times 
> faster to finish. While it is not a good idea to waste memory, both 1G 
> and 32G is insignificant amount of memory compared to the total amount 
> of such 32T systems (0.09% and 0.003% accordingly).
> 
> Here is boot log on 32T system without this fix:
> https://hastebin.com/muruzoveno.go
> 
> [  769.622359] Dentry cache hash table entries: 2147483648 (order: 21, 
> 17179869184 bytes)
> [  791.942136] Inode-cache hash table entries: 2147483648 (order: 21, 
> 17179869184 bytes)
> [  810.810745] Mount-cache hash table entries: 67108864 (order: 16, 
> 536870912 bytes)
> [  810.922322] Mountpoint-cache hash table entries: 67108864 (order: 16, 
> 536870912 bytes)
> [  812.125398] ftrace: allocating 20650 entries in 41 pages
> 
> Total time 42.5s
> 
> With this fix (and some other unrelated for this interval fixes):
> https://hastebin.com/buxucurawa.go
> 
> [   12.621164] Dentry cache hash table entries: 134217728 (order: 17, 
> 1073741824 bytes)
> [   12.869462] Inode-cache hash table entries: 67108864 (order: 16, 
> 536870912 bytes)
> [   13.101963] Mount-cache hash table entries: 67108864 (order: 16, 
> 536870912 bytes)
> [   13.331988] Mountpoint-cache hash table entries: 67108864 (order: 16, 
> 536870912 bytes)
> [   13.364661] ftrace: allocating 20650 entries in 41 pages
> 
> Total time 0.76s.
> 
> So, it scales well for 32T systems, and will scale well for perceivable 
> future without adding a hard ceiling limit.
> 
> Pasha
> 
> On 04/26/2017 04:11 PM, Michal Hocko wrote:
>> On Fri 03-03-17 15:32:47, Andrew Morton wrote:
>>> On Thu,  2 Mar 2017 00:33:45 -0500 Pavel Tatashin 
>>> <pasha.tatashin@oracle.com> wrote:
>>>
>>>> Allow hash tables to scale with memory but at slower pace, when 
>>>> HASH_ADAPT
>>>> is provided every time memory quadruples the sizes of hash tables 
>>>> will only
>>>> double instead of quadrupling as well. This algorithm starts working 
>>>> only
>>>> when memory size reaches a certain point, currently set to 64G.
>>>>
>>>> This is example of dentry hash table size, before and after four 
>>>> various
>>>> memory configurations:
>>>>
>>>> MEMORY       SCALE     HASH_SIZE
>>>>     old    new    old    new
>>>>      8G     13     13      8M      8M
>>>>     16G     13     13     16M     16M
>>>>     32G     13     13     32M     32M
>>>>     64G     13     13     64M     64M
>>>>    128G     13     14    128M     64M
>>>>    256G     13     14    256M    128M
>>>>    512G     13     15    512M    128M
>>>>   1024G     13     15   1024M    256M
>>>>   2048G     13     16   2048M    256M
>>>>   4096G     13     16   4096M    512M
>>>>   8192G     13     17   8192M    512M
>>>> 16384G     13     17  16384M   1024M
>>>> 32768G     13     18  32768M   1024M
>>>> 65536G     13     18  65536M   2048M
>>>
>>> OK, but what are the runtime effects?  Presumably some workloads will
>>> slow down a bit.  How much? How do we know that this is a worthwhile
>>> tradeoff?
>>>
>>> If the effect of this change is "undetectable" then those hash tables
>>> are simply too large, and additional tuning is needed, yes?
>>
>> I am playing with a 3TB and have hit the following
>> [    0.961309] Dentry cache hash table entries: 536870912 (order: 20, 
>> 4294967296 bytes)
>> [    2.300012] vmalloc: allocation failure, allocated 1383612416 of 
>> 2147487744 bytes
>> [    2.307473] swapper/0: page allocation failure: order:0, 
>> mode:0x2080020(GFP_ATOMIC)
>> [    2.315101] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        
>> W          4.4.49-hotplug19-default #1
>> [    2.324017] Hardware name: Huawei 9008/IT91SMUB, BIOS BLXSV607 
>> 04/17/2017
>> [    2.330775]  ffffffff8101aba5 ffffffff8130efa0 ffffffff81863f48 
>> ffffffff81c03e40
>> [    2.338201]  ffffffff8118c9a2 02080020fff00300 ffffffff81863f48 
>> ffffffff81c03de0
>> [    2.345628]  0000000000000018 ffffffff81c03e50 ffffffff81c03df8 
>> ffffffff811d28e6
>> [    2.353056] Call Trace:
>> [    2.355507]  [<ffffffff81019a99>] dump_trace+0x59/0x310
>> [    2.360710]  [<ffffffff81019e3a>] show_stack_log_lvl+0xea/0x170
>> [    2.366605]  [<ffffffff8101abc1>] show_stack+0x21/0x40
>> [    2.371723]  [<ffffffff8130efa0>] dump_stack+0x5c/0x7c
>> [    2.376842]  [<ffffffff8118c9a2>] warn_alloc_failed+0xe2/0x150
>> [    2.382655]  [<ffffffff811c2a10>] __vmalloc_node_range+0x240/0x280
>> [    2.388814]  [<ffffffff811c2a97>] __vmalloc+0x47/0x50
>> [    2.393851]  [<ffffffff81da02ae>] alloc_large_system_hash+0x189/0x25d
>> [    2.400264]  [<ffffffff81da7625>] inode_init+0x74/0xa3
>> [    2.405381]  [<ffffffff81da7483>] vfs_caches_init+0x59/0xe1
>> [    2.410930]  [<ffffffff81d6f070>] start_kernel+0x474/0x4d0
>> [    2.416392]  [<ffffffff81d6e719>] x86_64_start_kernel+0x147/0x156
>>
>> Allocating 4G for a hash table is just ridiculous. 512MB which this
>> patch should give looks much reasonable, although I would argue it is
>> still a _lot_.
>> I cannot say I would be really happy about the chosen approach,
>> though. Why HASH_ADAPT is not implicit? Which hash table would need
>> gigabytes of memory and still benefit from it? Even if there is such an
>> example then it should use the explicit high_limit. I do not like this
>> opt-in because it is just too easy to miss that and hit the same issue
>> again. And in fact only few users of alloc_large_system_hash are using
>> the flag. E.g. why {dcache,inode}_init_early do not have the flag? I
>> am pretty sure that having a physically contiguous hash table would be
>> better over vmalloc from the TLB point of view.
>>
>> mount_hashtable resp. mountpoint_hashtable are another example. Other
>> users just have a reasonable max value. So can we do the following
>> on top of your commit? I think that we should rethink the scaling as
>> well but I do not have a good answer for the maximum size so let's just
>> start with a more reasonable API first.
>> ---
>> diff --git a/fs/dcache.c b/fs/dcache.c
>> index 808ea99062c2..363502faa328 100644
>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -3585,7 +3585,7 @@ static void __init dcache_init(void)
>>                       sizeof(struct hlist_bl_head),
>>                       dhash_entries,
>>                       13,
>> -                    HASH_ZERO | HASH_ADAPT,
>> +                    HASH_ZERO,
>>                       &d_hash_shift,
>>                       &d_hash_mask,
>>                       0,
>> diff --git a/fs/inode.c b/fs/inode.c
>> index a9caf53df446..b3c0731ec1fe 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -1950,7 +1950,7 @@ void __init inode_init(void)
>>                       sizeof(struct hlist_head),
>>                       ihash_entries,
>>                       14,
>> -                    HASH_ZERO | HASH_ADAPT,
>> +                    HASH_ZERO,
>>                       &i_hash_shift,
>>                       &i_hash_mask,
>>                       0,
>> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
>> index dbaf312b3317..e223d91b6439 100644
>> --- a/include/linux/bootmem.h
>> +++ b/include/linux/bootmem.h
>> @@ -359,7 +359,6 @@ extern void *alloc_large_system_hash(const char 
>> *tablename,
>>   #define HASH_SMALL    0x00000002    /* sub-page allocation allowed, min
>>                        * shift passed via *_hash_shift */
>>   #define HASH_ZERO    0x00000004    /* Zero allocated hash table */
>> -#define    HASH_ADAPT    0x00000008    /* Adaptive scale for large 
>> memory */
>>   /* Only NUMA needs hash distribution. 64bit NUMA architectures have
>>    * sufficient vmalloc space.
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index fa752de84eef..3bf60669d200 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7226,7 +7226,7 @@ void *__init alloc_large_system_hash(const char 
>> *tablename,
>>           if (PAGE_SHIFT < 20)
>>               numentries = round_up(numentries, (1<<20)/PAGE_SIZE);
>> -        if (flags & HASH_ADAPT) {
>> +        if (!high_limit) {
>>               unsigned long adapt;
>>               for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries;
>>
> 
> -- 
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
  2017-05-04 18:23         ` Pasha Tatashin
@ 2017-05-05 13:29           ` Michal Hocko
  -1 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2017-05-05 13:29 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Andrew Morton, linux-mm, sparclinux, linux-fsdevel, Al Viro

On Thu 04-05-17 14:23:24, Pasha Tatashin wrote:
> Hi Michal,
> 
> I do not really want to impose any hard limit, because I do not know what it
> should be.
> 
> The owners of the subsystems that use these large hash table should make a
> call, and perhaps pass high_limit, if needed into alloc_large_system_hash().

Some of surely should. E.g. mount_hashtable resp. mountpoint_hashtable
really do not need a large hash AFAIU. On the other hand it is somehow
handy to scale dentry and inode hashes according to the amount of
memory. But the scale factor should be much slower than the current
upstream implementation. As I've said I do not want to judge your
scaling change. All I am saying that making it explicit is just _wrong_
because it a) doesn't cover all cases just the two you have noticed and
b) new users will most probably just copy&paste existing users so
chances are they will introduce the same large hashtables without a good
reason. I would even say that user shouldn't care about how the scaling
is implemented. There is a way to limit it and if there is no limit set
then just do whatever is appropriate.

> 
> Previous growth rate was unacceptable, because in addition to allocating
> large tables (which is acceptable if we take a total system memory size), we
> also needed to zero that, and zeroing while we have only one CPU available
> was significantly reducing the boot time.
> 
> Now, on 32T the hash table is 1G instead of 32G, so the call is 32 times
> faster to finish. While it is not a good idea to waste memory, both 1G and
> 32G is insignificant amount of memory compared to the total amount of such
> 32T systems (0.09% and 0.003% accordingly).

Try to think in terms of hashed objects. How many objects would we need
to hash? Also this might be not a significant portion of the memory but
it is still a memory which can be used for other purposes.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
@ 2017-05-05 13:29           ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2017-05-05 13:29 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Andrew Morton, linux-mm, sparclinux, linux-fsdevel, Al Viro

On Thu 04-05-17 14:23:24, Pasha Tatashin wrote:
> Hi Michal,
> 
> I do not really want to impose any hard limit, because I do not know what it
> should be.
> 
> The owners of the subsystems that use these large hash table should make a
> call, and perhaps pass high_limit, if needed into alloc_large_system_hash().

Some of surely should. E.g. mount_hashtable resp. mountpoint_hashtable
really do not need a large hash AFAIU. On the other hand it is somehow
handy to scale dentry and inode hashes according to the amount of
memory. But the scale factor should be much slower than the current
upstream implementation. As I've said I do not want to judge your
scaling change. All I am saying that making it explicit is just _wrong_
because it a) doesn't cover all cases just the two you have noticed and
b) new users will most probably just copy&paste existing users so
chances are they will introduce the same large hashtables without a good
reason. I would even say that user shouldn't care about how the scaling
is implemented. There is a way to limit it and if there is no limit set
then just do whatever is appropriate.

> 
> Previous growth rate was unacceptable, because in addition to allocating
> large tables (which is acceptable if we take a total system memory size), we
> also needed to zero that, and zeroing while we have only one CPU available
> was significantly reducing the boot time.
> 
> Now, on 32T the hash table is 1G instead of 32G, so the call is 32 times
> faster to finish. While it is not a good idea to waste memory, both 1G and
> 32G is insignificant amount of memory compared to the total amount of such
> 32T systems (0.09% and 0.003% accordingly).

Try to think in terms of hashed objects. How many objects would we need
to hash? Also this might be not a significant portion of the memory but
it is still a memory which can be used for other purposes.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
  2017-05-04 18:28           ` Pasha Tatashin
@ 2017-05-05 13:30             ` Michal Hocko
  -1 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2017-05-05 13:30 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Andrew Morton, linux-mm, sparclinux, linux-fsdevel, Al Viro

On Thu 04-05-17 14:28:51, Pasha Tatashin wrote:
> BTW, I am OK with your patch on top of this "Adaptive hash table" patch, but
> I do not know what high_limit should be from where HASH_ADAPT will kick in.
> 128M sound reasonable to you?

For simplicity I would just use it unconditionally when no high_limit is
set. What would be the problem with that? If you look at current users
(and there no new users emerging too often) then most of them just want
_some_ scaling. The original one obviously doesn't scale with large
machines. Are you OK to fold my change to your patch or you want me to
send a separate patch? AFAIK Andrew hasn't posted this patch to Linus
yet.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
@ 2017-05-05 13:30             ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2017-05-05 13:30 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Andrew Morton, linux-mm, sparclinux, linux-fsdevel, Al Viro

On Thu 04-05-17 14:28:51, Pasha Tatashin wrote:
> BTW, I am OK with your patch on top of this "Adaptive hash table" patch, but
> I do not know what high_limit should be from where HASH_ADAPT will kick in.
> 128M sound reasonable to you?

For simplicity I would just use it unconditionally when no high_limit is
set. What would be the problem with that? If you look at current users
(and there no new users emerging too often) then most of them just want
_some_ scaling. The original one obviously doesn't scale with large
machines. Are you OK to fold my change to your patch or you want me to
send a separate patch? AFAIK Andrew hasn't posted this patch to Linus
yet.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
  2017-05-05 13:30             ` Michal Hocko
@ 2017-05-05 15:33               ` Pasha Tatashin
  -1 siblings, 0 replies; 35+ messages in thread
From: Pasha Tatashin @ 2017-05-05 15:33 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, linux-mm, sparclinux, linux-fsdevel, Al Viro



On 05/05/2017 09:30 AM, Michal Hocko wrote:
> On Thu 04-05-17 14:28:51, Pasha Tatashin wrote:
>> BTW, I am OK with your patch on top of this "Adaptive hash table" patch, but
>> I do not know what high_limit should be from where HASH_ADAPT will kick in.
>> 128M sound reasonable to you?
> 
> For simplicity I would just use it unconditionally when no high_limit is
> set. What would be the problem with that?

Sure, that sounds good.

  If you look at current users
> (and there no new users emerging too often) then most of them just want
> _some_ scaling. The original one obviously doesn't scale with large
> machines. Are you OK to fold my change to your patch or you want me to
> send a separate patch? AFAIK Andrew hasn't posted this patch to Linus
> yet.
> 

I would like a separate patch because mine has soaked in mm tree for a 
while now.

Thank you,
Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
@ 2017-05-05 15:33               ` Pasha Tatashin
  0 siblings, 0 replies; 35+ messages in thread
From: Pasha Tatashin @ 2017-05-05 15:33 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, linux-mm, sparclinux, linux-fsdevel, Al Viro



On 05/05/2017 09:30 AM, Michal Hocko wrote:
> On Thu 04-05-17 14:28:51, Pasha Tatashin wrote:
>> BTW, I am OK with your patch on top of this "Adaptive hash table" patch, but
>> I do not know what high_limit should be from where HASH_ADAPT will kick in.
>> 128M sound reasonable to you?
> 
> For simplicity I would just use it unconditionally when no high_limit is
> set. What would be the problem with that?

Sure, that sounds good.

  If you look at current users
> (and there no new users emerging too often) then most of them just want
> _some_ scaling. The original one obviously doesn't scale with large
> machines. Are you OK to fold my change to your patch or you want me to
> send a separate patch? AFAIK Andrew hasn't posted this patch to Linus
> yet.
> 

I would like a separate patch because mine has soaked in mm tree for a 
while now.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
  2017-05-05 15:33               ` Pasha Tatashin
  (?)
@ 2017-05-09  9:46                 ` Michal Hocko
  -1 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2017-05-09  9:46 UTC (permalink / raw)
  To: Andrew Morton, Pasha Tatashin
  Cc: linux-mm, sparclinux, linux-fsdevel, Al Viro

On Fri 05-05-17 11:33:36, Pasha Tatashin wrote:
> 
> 
> On 05/05/2017 09:30 AM, Michal Hocko wrote:
> >On Thu 04-05-17 14:28:51, Pasha Tatashin wrote:
> >>BTW, I am OK with your patch on top of this "Adaptive hash table" patch, but
> >>I do not know what high_limit should be from where HASH_ADAPT will kick in.
> >>128M sound reasonable to you?
> >
> >For simplicity I would just use it unconditionally when no high_limit is
> >set. What would be the problem with that?
> 
> Sure, that sounds good.
> 
>  If you look at current users
> >(and there no new users emerging too often) then most of them just want
> >_some_ scaling. The original one obviously doesn't scale with large
> >machines. Are you OK to fold my change to your patch or you want me to
> >send a separate patch? AFAIK Andrew hasn't posted this patch to Linus
> >yet.
> >
> 
> I would like a separate patch because mine has soaked in mm tree for a while
> now.

OK, Andrew tends to fold follow up fixes in his mm tree. But anyway, as
you prefer to have this in a separate patch. Could you add this on top
Andrew? I believe mnt hash tables need a _reasonable_ upper bound but
that is for a separate patch I believe.
--- 
>From ac970fdb3e6f5f03a440fdbe6fe09460d99d3557 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 9 May 2017 11:34:59 +0200
Subject: [PATCH] mm: drop HASH_ADAPT

"mm: Adaptive hash table scaling" has introduced a new large hash table
automatic scaling because the previous implementation led to too large
hashes on TB systems. This is all nice and good but the patch assumes that
callers of alloc_large_system_hash will opt-in to use this new scaling.
This makes the API unnecessarily complicated and error prone. The only
thing that callers should care about is whether they have an upper
bound for the size or leave it to alloc_large_system_hash to decide (by
providing high_limit == 0).

As a quick code inspection shows there are users with high_limit == 0
which do not use the flag already e.g. {dcache,inode}_init_early or
mnt_init when creating mnt has tables. They certainly have no good
reason to use a different scaling because the [di]cache was the
motivation for introducing a different scaling in the first place (we
just do this attempt and use memblock). It is also hard to imagine why
we would mnt hash tables need larger hash tables.

Just drop the flag and use the scaling whenever there is no high_limit
specified.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/dcache.c             | 2 +-
 fs/inode.c              | 2 +-
 include/linux/bootmem.h | 1 -
 mm/page_alloc.c         | 2 +-
 4 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 808ea99062c2..363502faa328 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3585,7 +3585,7 @@ static void __init dcache_init(void)
 					sizeof(struct hlist_bl_head),
 					dhash_entries,
 					13,
-					HASH_ZERO | HASH_ADAPT,
+					HASH_ZERO,
 					&d_hash_shift,
 					&d_hash_mask,
 					0,
diff --git a/fs/inode.c b/fs/inode.c
index 32c8ee454ef0..1b15a7cc78ce 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1953,7 +1953,7 @@ void __init inode_init(void)
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
-					HASH_ZERO | HASH_ADAPT,
+					HASH_ZERO,
 					&i_hash_shift,
 					&i_hash_mask,
 					0,
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index dbaf312b3317..e223d91b6439 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -359,7 +359,6 @@ extern void *alloc_large_system_hash(const char *tablename,
 #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
 					 * shift passed via *_hash_shift */
 #define HASH_ZERO	0x00000004	/* Zero allocated hash table */
-#define	HASH_ADAPT	0x00000008	/* Adaptive scale for large memory */
 
 /* Only NUMA needs hash distribution. 64bit NUMA architectures have
  * sufficient vmalloc space.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index beb2827fd5de..3b840b998c05 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7213,7 +7213,7 @@ void *__init alloc_large_system_hash(const char *tablename,
 		if (PAGE_SHIFT < 20)
 			numentries = round_up(numentries, (1<<20)/PAGE_SIZE);
 
-		if (flags & HASH_ADAPT) {
+		if (!high_limit) {
 			unsigned long adapt;
 
 			for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries;
-- 
2.11.0


-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
@ 2017-05-09  9:46                 ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2017-05-09  9:46 UTC (permalink / raw)
  To: Andrew Morton, Pasha Tatashin
  Cc: linux-mm, sparclinux, linux-fsdevel, Al Viro

On Fri 05-05-17 11:33:36, Pasha Tatashin wrote:
> 
> 
> On 05/05/2017 09:30 AM, Michal Hocko wrote:
> >On Thu 04-05-17 14:28:51, Pasha Tatashin wrote:
> >>BTW, I am OK with your patch on top of this "Adaptive hash table" patch, but
> >>I do not know what high_limit should be from where HASH_ADAPT will kick in.
> >>128M sound reasonable to you?
> >
> >For simplicity I would just use it unconditionally when no high_limit is
> >set. What would be the problem with that?
> 
> Sure, that sounds good.
> 
>  If you look at current users
> >(and there no new users emerging too often) then most of them just want
> >_some_ scaling. The original one obviously doesn't scale with large
> >machines. Are you OK to fold my change to your patch or you want me to
> >send a separate patch? AFAIK Andrew hasn't posted this patch to Linus
> >yet.
> >
> 
> I would like a separate patch because mine has soaked in mm tree for a while
> now.

OK, Andrew tends to fold follow up fixes in his mm tree. But anyway, as
you prefer to have this in a separate patch. Could you add this on top
Andrew? I believe mnt hash tables need a _reasonable_ upper bound but
that is for a separate patch I believe.
--- 
From ac970fdb3e6f5f03a440fdbe6fe09460d99d3557 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 9 May 2017 11:34:59 +0200
Subject: [PATCH] mm: drop HASH_ADAPT

"mm: Adaptive hash table scaling" has introduced a new large hash table
automatic scaling because the previous implementation led to too large
hashes on TB systems. This is all nice and good but the patch assumes that
callers of alloc_large_system_hash will opt-in to use this new scaling.
This makes the API unnecessarily complicated and error prone. The only
thing that callers should care about is whether they have an upper
bound for the size or leave it to alloc_large_system_hash to decide (by
providing high_limit = 0).

As a quick code inspection shows there are users with high_limit = 0
which do not use the flag already e.g. {dcache,inode}_init_early or
mnt_init when creating mnt has tables. They certainly have no good
reason to use a different scaling because the [di]cache was the
motivation for introducing a different scaling in the first place (we
just do this attempt and use memblock). It is also hard to imagine why
we would mnt hash tables need larger hash tables.

Just drop the flag and use the scaling whenever there is no high_limit
specified.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/dcache.c             | 2 +-
 fs/inode.c              | 2 +-
 include/linux/bootmem.h | 1 -
 mm/page_alloc.c         | 2 +-
 4 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 808ea99062c2..363502faa328 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3585,7 +3585,7 @@ static void __init dcache_init(void)
 					sizeof(struct hlist_bl_head),
 					dhash_entries,
 					13,
-					HASH_ZERO | HASH_ADAPT,
+					HASH_ZERO,
 					&d_hash_shift,
 					&d_hash_mask,
 					0,
diff --git a/fs/inode.c b/fs/inode.c
index 32c8ee454ef0..1b15a7cc78ce 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1953,7 +1953,7 @@ void __init inode_init(void)
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
-					HASH_ZERO | HASH_ADAPT,
+					HASH_ZERO,
 					&i_hash_shift,
 					&i_hash_mask,
 					0,
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index dbaf312b3317..e223d91b6439 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -359,7 +359,6 @@ extern void *alloc_large_system_hash(const char *tablename,
 #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
 					 * shift passed via *_hash_shift */
 #define HASH_ZERO	0x00000004	/* Zero allocated hash table */
-#define	HASH_ADAPT	0x00000008	/* Adaptive scale for large memory */
 
 /* Only NUMA needs hash distribution. 64bit NUMA architectures have
  * sufficient vmalloc space.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index beb2827fd5de..3b840b998c05 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7213,7 +7213,7 @@ void *__init alloc_large_system_hash(const char *tablename,
 		if (PAGE_SHIFT < 20)
 			numentries = round_up(numentries, (1<<20)/PAGE_SIZE);
 
-		if (flags & HASH_ADAPT) {
+		if (!high_limit) {
 			unsigned long adapt;
 
 			for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries;
-- 
2.11.0


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
@ 2017-05-09  9:46                 ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2017-05-09  9:46 UTC (permalink / raw)
  To: Andrew Morton, Pasha Tatashin
  Cc: linux-mm, sparclinux, linux-fsdevel, Al Viro

On Fri 05-05-17 11:33:36, Pasha Tatashin wrote:
> 
> 
> On 05/05/2017 09:30 AM, Michal Hocko wrote:
> >On Thu 04-05-17 14:28:51, Pasha Tatashin wrote:
> >>BTW, I am OK with your patch on top of this "Adaptive hash table" patch, but
> >>I do not know what high_limit should be from where HASH_ADAPT will kick in.
> >>128M sound reasonable to you?
> >
> >For simplicity I would just use it unconditionally when no high_limit is
> >set. What would be the problem with that?
> 
> Sure, that sounds good.
> 
>  If you look at current users
> >(and there no new users emerging too often) then most of them just want
> >_some_ scaling. The original one obviously doesn't scale with large
> >machines. Are you OK to fold my change to your patch or you want me to
> >send a separate patch? AFAIK Andrew hasn't posted this patch to Linus
> >yet.
> >
> 
> I would like a separate patch because mine has soaked in mm tree for a while
> now.

OK, Andrew tends to fold follow up fixes in his mm tree. But anyway, as
you prefer to have this in a separate patch. Could you add this on top
Andrew? I believe mnt hash tables need a _reasonable_ upper bound but
that is for a separate patch I believe.
--- 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
  2017-05-09  9:46                 ` Michal Hocko
@ 2017-05-09 13:07                   ` Pasha Tatashin
  -1 siblings, 0 replies; 35+ messages in thread
From: Pasha Tatashin @ 2017-05-09 13:07 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton; +Cc: linux-mm, sparclinux, linux-fsdevel, Al Viro

Thank you Michal for making this change.

Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>

> 
> OK, Andrew tends to fold follow up fixes in his mm tree. But anyway, as
> you prefer to have this in a separate patch. Could you add this on top
> Andrew? I believe mnt hash tables need a _reasonable_ upper bound but
> that is for a separate patch I believe.
> ---
>  From ac970fdb3e6f5f03a440fdbe6fe09460d99d3557 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Tue, 9 May 2017 11:34:59 +0200
> Subject: [PATCH] mm: drop HASH_ADAPT
> 
> "mm: Adaptive hash table scaling" has introduced a new large hash table
> automatic scaling because the previous implementation led to too large
> hashes on TB systems. This is all nice and good but the patch assumes that
> callers of alloc_large_system_hash will opt-in to use this new scaling.
> This makes the API unnecessarily complicated and error prone. The only
> thing that callers should care about is whether they have an upper
> bound for the size or leave it to alloc_large_system_hash to decide (by
> providing high_limit == 0).
> 
> As a quick code inspection shows there are users with high_limit == 0
> which do not use the flag already e.g. {dcache,inode}_init_early or
> mnt_init when creating mnt has tables. They certainly have no good
> reason to use a different scaling because the [di]cache was the
> motivation for introducing a different scaling in the first place (we
> just do this attempt and use memblock). It is also hard to imagine why
> we would mnt hash tables need larger hash tables.
> 
> Just drop the flag and use the scaling whenever there is no high_limit
> specified.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>   fs/dcache.c             | 2 +-
>   fs/inode.c              | 2 +-
>   include/linux/bootmem.h | 1 -
>   mm/page_alloc.c         | 2 +-
>   4 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 808ea99062c2..363502faa328 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -3585,7 +3585,7 @@ static void __init dcache_init(void)
>   					sizeof(struct hlist_bl_head),
>   					dhash_entries,
>   					13,
> -					HASH_ZERO | HASH_ADAPT,
> +					HASH_ZERO,
>   					&d_hash_shift,
>   					&d_hash_mask,
>   					0,
> diff --git a/fs/inode.c b/fs/inode.c
> index 32c8ee454ef0..1b15a7cc78ce 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1953,7 +1953,7 @@ void __init inode_init(void)
>   					sizeof(struct hlist_head),
>   					ihash_entries,
>   					14,
> -					HASH_ZERO | HASH_ADAPT,
> +					HASH_ZERO,
>   					&i_hash_shift,
>   					&i_hash_mask,
>   					0,
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index dbaf312b3317..e223d91b6439 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -359,7 +359,6 @@ extern void *alloc_large_system_hash(const char *tablename,
>   #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
>   					 * shift passed via *_hash_shift */
>   #define HASH_ZERO	0x00000004	/* Zero allocated hash table */
> -#define	HASH_ADAPT	0x00000008	/* Adaptive scale for large memory */
>   
>   /* Only NUMA needs hash distribution. 64bit NUMA architectures have
>    * sufficient vmalloc space.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index beb2827fd5de..3b840b998c05 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7213,7 +7213,7 @@ void *__init alloc_large_system_hash(const char *tablename,
>   		if (PAGE_SHIFT < 20)
>   			numentries = round_up(numentries, (1<<20)/PAGE_SIZE);
>   
> -		if (flags & HASH_ADAPT) {
> +		if (!high_limit) {
>   			unsigned long adapt;
>   
>   			for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
@ 2017-05-09 13:07                   ` Pasha Tatashin
  0 siblings, 0 replies; 35+ messages in thread
From: Pasha Tatashin @ 2017-05-09 13:07 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton; +Cc: linux-mm, sparclinux, linux-fsdevel, Al Viro

Thank you Michal for making this change.

Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>

> 
> OK, Andrew tends to fold follow up fixes in his mm tree. But anyway, as
> you prefer to have this in a separate patch. Could you add this on top
> Andrew? I believe mnt hash tables need a _reasonable_ upper bound but
> that is for a separate patch I believe.
> ---
>  From ac970fdb3e6f5f03a440fdbe6fe09460d99d3557 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Tue, 9 May 2017 11:34:59 +0200
> Subject: [PATCH] mm: drop HASH_ADAPT
> 
> "mm: Adaptive hash table scaling" has introduced a new large hash table
> automatic scaling because the previous implementation led to too large
> hashes on TB systems. This is all nice and good but the patch assumes that
> callers of alloc_large_system_hash will opt-in to use this new scaling.
> This makes the API unnecessarily complicated and error prone. The only
> thing that callers should care about is whether they have an upper
> bound for the size or leave it to alloc_large_system_hash to decide (by
> providing high_limit = 0).
> 
> As a quick code inspection shows there are users with high_limit = 0
> which do not use the flag already e.g. {dcache,inode}_init_early or
> mnt_init when creating mnt has tables. They certainly have no good
> reason to use a different scaling because the [di]cache was the
> motivation for introducing a different scaling in the first place (we
> just do this attempt and use memblock). It is also hard to imagine why
> we would mnt hash tables need larger hash tables.
> 
> Just drop the flag and use the scaling whenever there is no high_limit
> specified.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>   fs/dcache.c             | 2 +-
>   fs/inode.c              | 2 +-
>   include/linux/bootmem.h | 1 -
>   mm/page_alloc.c         | 2 +-
>   4 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 808ea99062c2..363502faa328 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -3585,7 +3585,7 @@ static void __init dcache_init(void)
>   					sizeof(struct hlist_bl_head),
>   					dhash_entries,
>   					13,
> -					HASH_ZERO | HASH_ADAPT,
> +					HASH_ZERO,
>   					&d_hash_shift,
>   					&d_hash_mask,
>   					0,
> diff --git a/fs/inode.c b/fs/inode.c
> index 32c8ee454ef0..1b15a7cc78ce 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1953,7 +1953,7 @@ void __init inode_init(void)
>   					sizeof(struct hlist_head),
>   					ihash_entries,
>   					14,
> -					HASH_ZERO | HASH_ADAPT,
> +					HASH_ZERO,
>   					&i_hash_shift,
>   					&i_hash_mask,
>   					0,
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index dbaf312b3317..e223d91b6439 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -359,7 +359,6 @@ extern void *alloc_large_system_hash(const char *tablename,
>   #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
>   					 * shift passed via *_hash_shift */
>   #define HASH_ZERO	0x00000004	/* Zero allocated hash table */
> -#define	HASH_ADAPT	0x00000008	/* Adaptive scale for large memory */
>   
>   /* Only NUMA needs hash distribution. 64bit NUMA architectures have
>    * sufficient vmalloc space.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index beb2827fd5de..3b840b998c05 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7213,7 +7213,7 @@ void *__init alloc_large_system_hash(const char *tablename,
>   		if (PAGE_SHIFT < 20)
>   			numentries = round_up(numentries, (1<<20)/PAGE_SIZE);
>   
> -		if (flags & HASH_ADAPT) {
> +		if (!high_limit) {
>   			unsigned long adapt;
>   
>   			for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries;
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
  2017-03-03 23:32     ` Andrew Morton
@ 2017-05-17 15:51       ` Pasha Tatashin
  -1 siblings, 0 replies; 35+ messages in thread
From: Pasha Tatashin @ 2017-05-17 15:51 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko; +Cc: linux-mm, sparclinux, linux-fsdevel, Al Viro



On 03/03/2017 06:32 PM, Andrew Morton wrote:
> On Thu,  2 Mar 2017 00:33:45 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:
> 
>> Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT
>> is provided every time memory quadruples the sizes of hash tables will only
>> double instead of quadrupling as well. This algorithm starts working only
>> when memory size reaches a certain point, currently set to 64G.
>>
>> This is example of dentry hash table size, before and after four various
>> memory configurations:
>>
>> MEMORY	   SCALE	 HASH_SIZE
>> 	old	new	old	new
>>      8G	 13	 13      8M      8M
>>     16G	 13	 13     16M     16M
>>     32G	 13	 13     32M     32M
>>     64G	 13	 13     64M     64M
>>    128G	 13	 14    128M     64M
>>    256G	 13	 14    256M    128M
>>    512G	 13	 15    512M    128M
>>   1024G	 13	 15   1024M    256M
>>   2048G	 13	 16   2048M    256M
>>   4096G	 13	 16   4096M    512M
>>   8192G	 13	 17   8192M    512M
>> 16384G	 13	 17  16384M   1024M
>> 32768G	 13	 18  32768M   1024M
>> 65536G	 13	 18  65536M   2048M
> 
> OK, but what are the runtime effects?  Presumably some workloads will
> slow down a bit.  How much? How do we know that this is a worthwhile
> tradeoff?
> 
> If the effect of this change is "undetectable" then those hash tables
> are simply too large, and additional tuning is needed, yes?
> 
Hi Andrew,

The effect of this change on runtime is undetectable as filesystem 
growth is not proportional to machine memory size as what is currently 
assumed. The change effects only large memory machine. Additional tuning 
might be needed, but that can be done by the clients of the 
kmem_cache_create interface, not the generic cache allocator itself.

Thank you,
Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/4] mm: Adaptive hash table scaling
@ 2017-05-17 15:51       ` Pasha Tatashin
  0 siblings, 0 replies; 35+ messages in thread
From: Pasha Tatashin @ 2017-05-17 15:51 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko; +Cc: linux-mm, sparclinux, linux-fsdevel, Al Viro



On 03/03/2017 06:32 PM, Andrew Morton wrote:
> On Thu,  2 Mar 2017 00:33:45 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:
> 
>> Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT
>> is provided every time memory quadruples the sizes of hash tables will only
>> double instead of quadrupling as well. This algorithm starts working only
>> when memory size reaches a certain point, currently set to 64G.
>>
>> This is example of dentry hash table size, before and after four various
>> memory configurations:
>>
>> MEMORY	   SCALE	 HASH_SIZE
>> 	old	new	old	new
>>      8G	 13	 13      8M      8M
>>     16G	 13	 13     16M     16M
>>     32G	 13	 13     32M     32M
>>     64G	 13	 13     64M     64M
>>    128G	 13	 14    128M     64M
>>    256G	 13	 14    256M    128M
>>    512G	 13	 15    512M    128M
>>   1024G	 13	 15   1024M    256M
>>   2048G	 13	 16   2048M    256M
>>   4096G	 13	 16   4096M    512M
>>   8192G	 13	 17   8192M    512M
>> 16384G	 13	 17  16384M   1024M
>> 32768G	 13	 18  32768M   1024M
>> 65536G	 13	 18  65536M   2048M
> 
> OK, but what are the runtime effects?  Presumably some workloads will
> slow down a bit.  How much? How do we know that this is a worthwhile
> tradeoff?
> 
> If the effect of this change is "undetectable" then those hash tables
> are simply too large, and additional tuning is needed, yes?
> 
Hi Andrew,

The effect of this change on runtime is undetectable as filesystem 
growth is not proportional to machine memory size as what is currently 
assumed. The change effects only large memory machine. Additional tuning 
might be needed, but that can be done by the clients of the 
kmem_cache_create interface, not the generic cache allocator itself.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2017-05-17 15:51 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-02  5:33 [PATCH v3 0/4] Zeroing hash tables in allocator Pavel Tatashin
2017-03-02  5:33 ` Pavel Tatashin
2017-03-02  5:33 ` [PATCH v3 1/4] sparc64: NG4 memset 32 bits overflow Pavel Tatashin
2017-03-02  5:33   ` Pavel Tatashin
2017-03-03 23:34   ` Andrew Morton
2017-03-03 23:34     ` Andrew Morton
2017-03-02  5:33 ` [PATCH v3 2/4] mm: Zeroing hash tables in allocator Pavel Tatashin
2017-03-02  5:33   ` Pavel Tatashin
2017-03-02  5:33 ` [PATCH v3 3/4] mm: Updated callers to use HASH_ZERO flag Pavel Tatashin
2017-03-02  5:33   ` Pavel Tatashin
2017-03-02  5:33 ` [PATCH v3 4/4] mm: Adaptive hash table scaling Pavel Tatashin
2017-03-02  5:33   ` Pavel Tatashin
2017-03-03 23:32   ` Andrew Morton
2017-03-03 23:32     ` Andrew Morton
2017-04-26 20:11     ` Michal Hocko
2017-04-26 20:11       ` Michal Hocko
2017-05-02  8:04       ` Michal Hocko
2017-05-02  8:04         ` Michal Hocko
2017-05-04 18:23       ` Pasha Tatashin
2017-05-04 18:23         ` Pasha Tatashin
2017-05-04 18:28         ` Pasha Tatashin
2017-05-04 18:28           ` Pasha Tatashin
2017-05-05 13:30           ` Michal Hocko
2017-05-05 13:30             ` Michal Hocko
2017-05-05 15:33             ` Pasha Tatashin
2017-05-05 15:33               ` Pasha Tatashin
2017-05-09  9:46               ` Michal Hocko
2017-05-09  9:46                 ` Michal Hocko
2017-05-09  9:46                 ` Michal Hocko
2017-05-09 13:07                 ` Pasha Tatashin
2017-05-09 13:07                   ` Pasha Tatashin
2017-05-05 13:29         ` Michal Hocko
2017-05-05 13:29           ` Michal Hocko
2017-05-17 15:51     ` Pasha Tatashin
2017-05-17 15:51       ` Pasha Tatashin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.