All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/3] Zeroing hash tables in allocator
@ 2017-02-28 14:55 ` Pavel Tatashin
  0 siblings, 0 replies; 18+ messages in thread
From: Pavel Tatashin @ 2017-02-28 14:55 UTC (permalink / raw)
  To: linux-mm, sparclinux

On large machines hash tables can be many gigabytes in size and it is
inefficient to zero them in a loop without platform specific optimizations.

Using memset() provides a standard platform optimized way to zero the
memory.

Pavel Tatashin (3):
  sparc64: NG4 memset/memcpy 32 bits overflow
  mm: Zeroing hash tables in allocator
  mm: Updated callers to use HASH_ZERO flag

 arch/sparc/lib/NG4memcpy.S          |   71 ++++++++++++++++-------------------
 arch/sparc/lib/NG4memset.S          |   26 ++++++------
 fs/dcache.c                         |   18 ++-------
 fs/inode.c                          |   14 +------
 fs/namespace.c                      |   10 +----
 include/linux/bootmem.h             |    1 +
 kernel/locking/qspinlock_paravirt.h |    3 +-
 kernel/pid.c                        |    7 +--
 mm/page_alloc.c                     |   12 ++++-
 9 files changed, 67 insertions(+), 95 deletions(-)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 0/3] Zeroing hash tables in allocator
@ 2017-02-28 14:55 ` Pavel Tatashin
  0 siblings, 0 replies; 18+ messages in thread
From: Pavel Tatashin @ 2017-02-28 14:55 UTC (permalink / raw)
  To: linux-mm, sparclinux

On large machines hash tables can be many gigabytes in size and it is
inefficient to zero them in a loop without platform specific optimizations.

Using memset() provides a standard platform optimized way to zero the
memory.

Pavel Tatashin (3):
  sparc64: NG4 memset/memcpy 32 bits overflow
  mm: Zeroing hash tables in allocator
  mm: Updated callers to use HASH_ZERO flag

 arch/sparc/lib/NG4memcpy.S          |   71 ++++++++++++++++-------------------
 arch/sparc/lib/NG4memset.S          |   26 ++++++------
 fs/dcache.c                         |   18 ++-------
 fs/inode.c                          |   14 +------
 fs/namespace.c                      |   10 +----
 include/linux/bootmem.h             |    1 +
 kernel/locking/qspinlock_paravirt.h |    3 +-
 kernel/pid.c                        |    7 +--
 mm/page_alloc.c                     |   12 ++++-
 9 files changed, 67 insertions(+), 95 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 1/3] sparc64: NG4 memset/memcpy 32 bits overflow
  2017-02-28 14:55 ` Pavel Tatashin
@ 2017-02-28 14:55   ` Pavel Tatashin
  -1 siblings, 0 replies; 18+ messages in thread
From: Pavel Tatashin @ 2017-02-28 14:55 UTC (permalink / raw)
  To: linux-mm, sparclinux

Early in boot Linux patches memset and memcpy to branch to platform
optimized versions of these routines. The NG4 (Niagra 4) versions are
currently used on  all platforms starting from T4. Recently, there were M7
optimized routines added into UEK4 but not into mainline yet. So, even with
M7 optimized routines NG4 are still going to be used on T4, T5, M5, and M6
processors.

While investigating how to improve initialization time of dentry_hashtable
which is 8G long on M6 ldom with 7T of main memory, I noticed that memset()
does not reset all the memory in this array, after studying the code, I
realized that NG4memset() branches use %icc register instead of %xcc to
check compare, so if value of length is over 32-bit long, which is true for
8G array, these routines fail to work properly.

The fix is to replace all %icc with %xcc in these routines. (Alternative is
to use %ncc, but this is misleading, as the code already has sparcv9 only
instructions, and cannot be compiled on 32-bit).

This is important to fix this bug, because even older T4-4 can have 2T of
memory, and there are large memory proportional data structures in kernel
which can be larger than 4G in size. The failing of memset() is silent and
corruption is hard to detect.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
---
 arch/sparc/lib/NG4memcpy.S |   71 ++++++++++++++++++++------------------------
 arch/sparc/lib/NG4memset.S |   26 ++++++++--------
 2 files changed, 45 insertions(+), 52 deletions(-)

diff --git a/arch/sparc/lib/NG4memcpy.S b/arch/sparc/lib/NG4memcpy.S
index 75bb93b..60ccb46 100644
--- a/arch/sparc/lib/NG4memcpy.S
+++ b/arch/sparc/lib/NG4memcpy.S
@@ -18,7 +18,7 @@
 #define FPU_ENTER			\
 	rd	%fprs, %o5;		\
 	andcc	%o5, FPRS_FEF, %g0;	\
-	be,a,pn	%icc, 999f;		\
+	be,a,pn	%xcc, 999f;		\
 	 wr	%g0, FPRS_FEF, %fprs;	\
 	999:
 
@@ -84,10 +84,6 @@
 #define PREAMBLE
 #endif
 
-#ifndef XCC
-#define XCC xcc
-#endif
-
 	.register	%g2,#scratch
 	.register	%g3,#scratch
 
@@ -252,19 +248,16 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 #ifdef MEMCPY_DEBUG
 	wr		%g0, 0x80, %asi
 #endif
-	srlx		%o2, 31, %g2
-	cmp		%g2, 0
-	tne		%XCC, 5
 	PREAMBLE
 	mov		%o0, %o3
 	brz,pn		%o2, .Lexit
 	 cmp		%o2, 3
-	ble,pn		%icc, .Ltiny
+	ble,pn		%xcc, .Ltiny
 	 cmp		%o2, 19
-	ble,pn		%icc, .Lsmall
+	ble,pn		%xcc, .Lsmall
 	 or		%o0, %o1, %g2
 	cmp		%o2, 128
-	bl,pn		%icc, .Lmedium
+	bl,pn		%xcc, .Lmedium
 	 nop
 
 .Llarge:/* len >= 0x80 */
@@ -279,7 +272,7 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	add		%o1, 1, %o1
 	subcc		%g1, 1, %g1
 	add		%o0, 1, %o0
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 EX_ST(STORE(stb, %g2, %o0 - 0x01), NG4_retl_o2_plus_g1_plus_1)
 
 51:	LOAD(prefetch, %o1 + 0x040, #n_reads_strong)
@@ -295,7 +288,7 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	 * loop, or we require the alignaddr/faligndata variant.
 	 */
 	andcc		%o1, 0x7, %o5
-	bne,pn		%icc, .Llarge_src_unaligned
+	bne,pn		%xcc, .Llarge_src_unaligned
 	 sub		%g0, %o0, %g1
 
 	/* Legitimize the use of initializing stores by getting dest
@@ -309,7 +302,7 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	add		%o1, 8, %o1
 	subcc		%g1, 8, %g1
 	add		%o0, 8, %o0
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 EX_ST(STORE(stx, %g2, %o0 - 0x08), NG4_retl_o2_plus_g1_plus_8)
 
 .Llarge_aligned:
@@ -343,16 +336,16 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	add		%o0, 0x08, %o0
 	EX_ST(STORE_INIT(GLOBAL_SPARE, %o0), NG4_retl_o2_plus_o4_plus_8)
 	add		%o0, 0x08, %o0
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 LOAD(prefetch, %o1 + 0x200, #n_reads_strong)
 
 	membar		#StoreLoad | #StoreStore
 
 	brz,pn		%o2, .Lexit
 	 cmp		%o2, 19
-	ble,pn		%icc, .Lsmall_unaligned
+	ble,pn		%xcc, .Lsmall_unaligned
 	 nop
-	ba,a,pt		%icc, .Lmedium_noprefetch
+	ba,a,pt		%xcc, .Lmedium_noprefetch
 
 .Lexit:	retl
 	 mov		EX_RETVAL(%o3), %o0
@@ -395,7 +388,7 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	EX_ST_FP(STORE(std, %f28, %o0 + 0x30), NG4_retl_o2_plus_o4_plus_16)
 	EX_ST_FP(STORE(std, %f30, %o0 + 0x38), NG4_retl_o2_plus_o4_plus_8)
 	add		%o0, 0x40, %o0
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 LOAD(prefetch, %g1 + 0x200, #n_reads_strong)
 #ifdef NON_USER_COPY
 	VISExitHalfFast
@@ -404,9 +397,9 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 #endif
 	brz,pn		%o2, .Lexit
 	 cmp		%o2, 19
-	ble,pn		%icc, .Lsmall_unaligned
+	ble,pn		%xcc, .Lsmall_unaligned
 	 nop
-	ba,a,pt		%icc, .Lmedium_unaligned
+	ba,a,pt		%xcc, .Lmedium_unaligned
 
 #ifdef NON_USER_COPY
 .Lmedium_vis_entry_fail:
@@ -415,11 +408,11 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 .Lmedium:
 	LOAD(prefetch, %o1 + 0x40, #n_reads_strong)
 	andcc		%g2, 0x7, %g0
-	bne,pn		%icc, .Lmedium_unaligned
+	bne,pn		%xcc, .Lmedium_unaligned
 	 nop
 .Lmedium_noprefetch:
 	andncc		%o2, 0x20 - 1, %o5
-	be,pn		%icc, 2f
+	be,pn		%xcc, 2f
 	 sub		%o2, %o5, %o2
 1:	EX_LD(LOAD(ldx, %o1 + 0x00, %g1), NG4_retl_o2_plus_o5)
 	EX_LD(LOAD(ldx, %o1 + 0x08, %g2), NG4_retl_o2_plus_o5)
@@ -431,29 +424,29 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	EX_ST(STORE(stx, %g2, %o0 + 0x08), NG4_retl_o2_plus_o5_plus_24)
 	EX_ST(STORE(stx, GLOBAL_SPARE, %o0 + 0x10), NG4_retl_o2_plus_o5_plus_24)
 	EX_ST(STORE(stx, %o4, %o0 + 0x18), NG4_retl_o2_plus_o5_plus_8)
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x20, %o0
 2:	andcc		%o2, 0x18, %o5
-	be,pt		%icc, 3f
+	be,pt		%xcc, 3f
 	 sub		%o2, %o5, %o2
 
 1:	EX_LD(LOAD(ldx, %o1 + 0x00, %g1), NG4_retl_o2_plus_o5)
 	add		%o1, 0x08, %o1
 	add		%o0, 0x08, %o0
 	subcc		%o5, 0x08, %o5
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 EX_ST(STORE(stx, %g1, %o0 - 0x08), NG4_retl_o2_plus_o5_plus_8)
 3:	brz,pt		%o2, .Lexit
 	 cmp		%o2, 0x04
-	bl,pn		%icc, .Ltiny
+	bl,pn		%xcc, .Ltiny
 	 nop
 	EX_LD(LOAD(lduw, %o1 + 0x00, %g1), NG4_retl_o2)
 	add		%o1, 0x04, %o1
 	add		%o0, 0x04, %o0
 	subcc		%o2, 0x04, %o2
-	bne,pn		%icc, .Ltiny
+	bne,pn		%xcc, .Ltiny
 	 EX_ST(STORE(stw, %g1, %o0 - 0x04), NG4_retl_o2_plus_4)
-	ba,a,pt		%icc, .Lexit
+	ba,a,pt		%xcc, .Lexit
 .Lmedium_unaligned:
 	/* First get dest 8 byte aligned.  */
 	sub		%g0, %o0, %g1
@@ -465,7 +458,7 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	add		%o1, 1, %o1
 	subcc		%g1, 1, %g1
 	add		%o0, 1, %o0
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 EX_ST(STORE(stb, %g2, %o0 - 0x01), NG4_retl_o2_plus_g1_plus_1)
 2:
 	and		%o1, 0x7, %g1
@@ -485,30 +478,30 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	or		GLOBAL_SPARE, %o4, GLOBAL_SPARE
 	EX_ST(STORE(stx, GLOBAL_SPARE, %o0 + 0x00), NG4_retl_o2_plus_o5_plus_8)
 	add		%o0, 0x08, %o0
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 sllx		%g3, %g1, %o4
 	srl		%g1, 3, %g1
 	add		%o1, %g1, %o1
 	brz,pn		%o2, .Lexit
 	 nop
-	ba,pt		%icc, .Lsmall_unaligned
+	ba,pt		%xcc, .Lsmall_unaligned
 
 .Ltiny:
 	EX_LD(LOAD(ldub, %o1 + 0x00, %g1), NG4_retl_o2)
 	subcc		%o2, 1, %o2
-	be,pn		%icc, .Lexit
+	be,pn		%xcc, .Lexit
 	 EX_ST(STORE(stb, %g1, %o0 + 0x00), NG4_retl_o2_plus_1)
 	EX_LD(LOAD(ldub, %o1 + 0x01, %g1), NG4_retl_o2)
 	subcc		%o2, 1, %o2
-	be,pn		%icc, .Lexit
+	be,pn		%xcc, .Lexit
 	 EX_ST(STORE(stb, %g1, %o0 + 0x01), NG4_retl_o2_plus_1)
 	EX_LD(LOAD(ldub, %o1 + 0x02, %g1), NG4_retl_o2)
-	ba,pt		%icc, .Lexit
+	ba,pt		%xcc, .Lexit
 	 EX_ST(STORE(stb, %g1, %o0 + 0x02), NG4_retl_o2)
 
 .Lsmall:
 	andcc		%g2, 0x3, %g0
-	bne,pn		%icc, .Lsmall_unaligned
+	bne,pn		%xcc, .Lsmall_unaligned
 	 andn		%o2, 0x4 - 1, %o5
 	sub		%o2, %o5, %o2
 1:
@@ -516,18 +509,18 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	add		%o1, 0x04, %o1
 	subcc		%o5, 0x04, %o5
 	add		%o0, 0x04, %o0
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 EX_ST(STORE(stw, %g1, %o0 - 0x04), NG4_retl_o2_plus_o5_plus_4)
 	brz,pt		%o2, .Lexit
 	 nop
-	ba,a,pt		%icc, .Ltiny
+	ba,a,pt		%xcc, .Ltiny
 
 .Lsmall_unaligned:
 1:	EX_LD(LOAD(ldub, %o1 + 0x00, %g1), NG4_retl_o2)
 	add		%o1, 1, %o1
 	add		%o0, 1, %o0
 	subcc		%o2, 1, %o2
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 EX_ST(STORE(stb, %g1, %o0 - 0x01), NG4_retl_o2_plus_1)
-	ba,a,pt		%icc, .Lexit
+	ba,a,pt		%xcc, .Lexit
 	.size		FUNC_NAME, .-FUNC_NAME
diff --git a/arch/sparc/lib/NG4memset.S b/arch/sparc/lib/NG4memset.S
index 41da4bd..e7c2e70 100644
--- a/arch/sparc/lib/NG4memset.S
+++ b/arch/sparc/lib/NG4memset.S
@@ -13,14 +13,14 @@
 	.globl		NG4memset
 NG4memset:
 	andcc		%o1, 0xff, %o4
-	be,pt		%icc, 1f
+	be,pt		%xcc, 1f
 	 mov		%o2, %o1
 	sllx		%o4, 8, %g1
 	or		%g1, %o4, %o2
 	sllx		%o2, 16, %g1
 	or		%g1, %o2, %o2
 	sllx		%o2, 32, %g1
-	ba,pt		%icc, 1f
+	ba,pt		%xcc, 1f
 	 or		%g1, %o2, %o4
 	.size		NG4memset,.-NG4memset
 
@@ -29,7 +29,7 @@ NG4memset:
 NG4bzero:
 	clr		%o4
 1:	cmp		%o1, 16
-	ble		%icc, .Ltiny
+	ble		%xcc, .Ltiny
 	 mov		%o0, %o3
 	sub		%g0, %o0, %g1
 	and		%g1, 0x7, %g1
@@ -37,7 +37,7 @@ NG4bzero:
 	 sub		%o1, %g1, %o1
 1:	stb		%o4, [%o0 + 0x00]
 	subcc		%g1, 1, %g1
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 1, %o0
 .Laligned8:
 	cmp		%o1, 64 + (64 - 8)
@@ -48,7 +48,7 @@ NG4bzero:
 	 sub		%o1, %g1, %o1
 1:	stx		%o4, [%o0 + 0x00]
 	subcc		%g1, 8, %g1
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x8, %o0
 .Laligned64:
 	andn		%o1, 64 - 1, %g1
@@ -58,30 +58,30 @@ NG4bzero:
 1:	stxa		%o4, [%o0 + %g0] ASI_BLK_INIT_QUAD_LDD_P
 	subcc		%g1, 0x40, %g1
 	stxa		%o4, [%o0 + %g2] ASI_BLK_INIT_QUAD_LDD_P
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x40, %o0
 .Lpostloop:
 	cmp		%o1, 8
-	bl,pn		%icc, .Ltiny
+	bl,pn		%xcc, .Ltiny
 	 membar		#StoreStore|#StoreLoad
 .Lmedium:
 	andn		%o1, 0x7, %g1
 	sub		%o1, %g1, %o1
 1:	stx		%o4, [%o0 + 0x00]
 	subcc		%g1, 0x8, %g1
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x08, %o0
 	andcc		%o1, 0x4, %g1
-	be,pt		%icc, .Ltiny
+	be,pt		%xcc, .Ltiny
 	 sub		%o1, %g1, %o1
 	stw		%o4, [%o0 + 0x00]
 	add		%o0, 0x4, %o0
 .Ltiny:
 	cmp		%o1, 0
-	be,pn		%icc, .Lexit
+	be,pn		%xcc, .Lexit
 1:	 subcc		%o1, 1, %o1
 	stb		%o4, [%o0 + 0x00]
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 1, %o0
 .Lexit:
 	retl
@@ -99,7 +99,7 @@ NG4bzero:
 	stxa		%o4, [%o0 + %g2] ASI_BLK_INIT_QUAD_LDD_P
 	stxa		%o4, [%o0 + %g3] ASI_BLK_INIT_QUAD_LDD_P
 	stxa		%o4, [%o0 + %o5] ASI_BLK_INIT_QUAD_LDD_P
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x30, %o0
-	ba,a,pt		%icc, .Lpostloop
+	ba,a,pt		%xcc, .Lpostloop
 	.size		NG4bzero,.-NG4bzero
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 1/3] sparc64: NG4 memset/memcpy 32 bits overflow
@ 2017-02-28 14:55   ` Pavel Tatashin
  0 siblings, 0 replies; 18+ messages in thread
From: Pavel Tatashin @ 2017-02-28 14:55 UTC (permalink / raw)
  To: linux-mm, sparclinux

Early in boot Linux patches memset and memcpy to branch to platform
optimized versions of these routines. The NG4 (Niagra 4) versions are
currently used on  all platforms starting from T4. Recently, there were M7
optimized routines added into UEK4 but not into mainline yet. So, even with
M7 optimized routines NG4 are still going to be used on T4, T5, M5, and M6
processors.

While investigating how to improve initialization time of dentry_hashtable
which is 8G long on M6 ldom with 7T of main memory, I noticed that memset()
does not reset all the memory in this array, after studying the code, I
realized that NG4memset() branches use %icc register instead of %xcc to
check compare, so if value of length is over 32-bit long, which is true for
8G array, these routines fail to work properly.

The fix is to replace all %icc with %xcc in these routines. (Alternative is
to use %ncc, but this is misleading, as the code already has sparcv9 only
instructions, and cannot be compiled on 32-bit).

This is important to fix this bug, because even older T4-4 can have 2T of
memory, and there are large memory proportional data structures in kernel
which can be larger than 4G in size. The failing of memset() is silent and
corruption is hard to detect.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
---
 arch/sparc/lib/NG4memcpy.S |   71 ++++++++++++++++++++------------------------
 arch/sparc/lib/NG4memset.S |   26 ++++++++--------
 2 files changed, 45 insertions(+), 52 deletions(-)

diff --git a/arch/sparc/lib/NG4memcpy.S b/arch/sparc/lib/NG4memcpy.S
index 75bb93b..60ccb46 100644
--- a/arch/sparc/lib/NG4memcpy.S
+++ b/arch/sparc/lib/NG4memcpy.S
@@ -18,7 +18,7 @@
 #define FPU_ENTER			\
 	rd	%fprs, %o5;		\
 	andcc	%o5, FPRS_FEF, %g0;	\
-	be,a,pn	%icc, 999f;		\
+	be,a,pn	%xcc, 999f;		\
 	 wr	%g0, FPRS_FEF, %fprs;	\
 	999:
 
@@ -84,10 +84,6 @@
 #define PREAMBLE
 #endif
 
-#ifndef XCC
-#define XCC xcc
-#endif
-
 	.register	%g2,#scratch
 	.register	%g3,#scratch
 
@@ -252,19 +248,16 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 #ifdef MEMCPY_DEBUG
 	wr		%g0, 0x80, %asi
 #endif
-	srlx		%o2, 31, %g2
-	cmp		%g2, 0
-	tne		%XCC, 5
 	PREAMBLE
 	mov		%o0, %o3
 	brz,pn		%o2, .Lexit
 	 cmp		%o2, 3
-	ble,pn		%icc, .Ltiny
+	ble,pn		%xcc, .Ltiny
 	 cmp		%o2, 19
-	ble,pn		%icc, .Lsmall
+	ble,pn		%xcc, .Lsmall
 	 or		%o0, %o1, %g2
 	cmp		%o2, 128
-	bl,pn		%icc, .Lmedium
+	bl,pn		%xcc, .Lmedium
 	 nop
 
 .Llarge:/* len >= 0x80 */
@@ -279,7 +272,7 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	add		%o1, 1, %o1
 	subcc		%g1, 1, %g1
 	add		%o0, 1, %o0
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 EX_ST(STORE(stb, %g2, %o0 - 0x01), NG4_retl_o2_plus_g1_plus_1)
 
 51:	LOAD(prefetch, %o1 + 0x040, #n_reads_strong)
@@ -295,7 +288,7 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	 * loop, or we require the alignaddr/faligndata variant.
 	 */
 	andcc		%o1, 0x7, %o5
-	bne,pn		%icc, .Llarge_src_unaligned
+	bne,pn		%xcc, .Llarge_src_unaligned
 	 sub		%g0, %o0, %g1
 
 	/* Legitimize the use of initializing stores by getting dest
@@ -309,7 +302,7 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	add		%o1, 8, %o1
 	subcc		%g1, 8, %g1
 	add		%o0, 8, %o0
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 EX_ST(STORE(stx, %g2, %o0 - 0x08), NG4_retl_o2_plus_g1_plus_8)
 
 .Llarge_aligned:
@@ -343,16 +336,16 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	add		%o0, 0x08, %o0
 	EX_ST(STORE_INIT(GLOBAL_SPARE, %o0), NG4_retl_o2_plus_o4_plus_8)
 	add		%o0, 0x08, %o0
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 LOAD(prefetch, %o1 + 0x200, #n_reads_strong)
 
 	membar		#StoreLoad | #StoreStore
 
 	brz,pn		%o2, .Lexit
 	 cmp		%o2, 19
-	ble,pn		%icc, .Lsmall_unaligned
+	ble,pn		%xcc, .Lsmall_unaligned
 	 nop
-	ba,a,pt		%icc, .Lmedium_noprefetch
+	ba,a,pt		%xcc, .Lmedium_noprefetch
 
 .Lexit:	retl
 	 mov		EX_RETVAL(%o3), %o0
@@ -395,7 +388,7 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	EX_ST_FP(STORE(std, %f28, %o0 + 0x30), NG4_retl_o2_plus_o4_plus_16)
 	EX_ST_FP(STORE(std, %f30, %o0 + 0x38), NG4_retl_o2_plus_o4_plus_8)
 	add		%o0, 0x40, %o0
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 LOAD(prefetch, %g1 + 0x200, #n_reads_strong)
 #ifdef NON_USER_COPY
 	VISExitHalfFast
@@ -404,9 +397,9 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 #endif
 	brz,pn		%o2, .Lexit
 	 cmp		%o2, 19
-	ble,pn		%icc, .Lsmall_unaligned
+	ble,pn		%xcc, .Lsmall_unaligned
 	 nop
-	ba,a,pt		%icc, .Lmedium_unaligned
+	ba,a,pt		%xcc, .Lmedium_unaligned
 
 #ifdef NON_USER_COPY
 .Lmedium_vis_entry_fail:
@@ -415,11 +408,11 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 .Lmedium:
 	LOAD(prefetch, %o1 + 0x40, #n_reads_strong)
 	andcc		%g2, 0x7, %g0
-	bne,pn		%icc, .Lmedium_unaligned
+	bne,pn		%xcc, .Lmedium_unaligned
 	 nop
 .Lmedium_noprefetch:
 	andncc		%o2, 0x20 - 1, %o5
-	be,pn		%icc, 2f
+	be,pn		%xcc, 2f
 	 sub		%o2, %o5, %o2
 1:	EX_LD(LOAD(ldx, %o1 + 0x00, %g1), NG4_retl_o2_plus_o5)
 	EX_LD(LOAD(ldx, %o1 + 0x08, %g2), NG4_retl_o2_plus_o5)
@@ -431,29 +424,29 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	EX_ST(STORE(stx, %g2, %o0 + 0x08), NG4_retl_o2_plus_o5_plus_24)
 	EX_ST(STORE(stx, GLOBAL_SPARE, %o0 + 0x10), NG4_retl_o2_plus_o5_plus_24)
 	EX_ST(STORE(stx, %o4, %o0 + 0x18), NG4_retl_o2_plus_o5_plus_8)
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x20, %o0
 2:	andcc		%o2, 0x18, %o5
-	be,pt		%icc, 3f
+	be,pt		%xcc, 3f
 	 sub		%o2, %o5, %o2
 
 1:	EX_LD(LOAD(ldx, %o1 + 0x00, %g1), NG4_retl_o2_plus_o5)
 	add		%o1, 0x08, %o1
 	add		%o0, 0x08, %o0
 	subcc		%o5, 0x08, %o5
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 EX_ST(STORE(stx, %g1, %o0 - 0x08), NG4_retl_o2_plus_o5_plus_8)
 3:	brz,pt		%o2, .Lexit
 	 cmp		%o2, 0x04
-	bl,pn		%icc, .Ltiny
+	bl,pn		%xcc, .Ltiny
 	 nop
 	EX_LD(LOAD(lduw, %o1 + 0x00, %g1), NG4_retl_o2)
 	add		%o1, 0x04, %o1
 	add		%o0, 0x04, %o0
 	subcc		%o2, 0x04, %o2
-	bne,pn		%icc, .Ltiny
+	bne,pn		%xcc, .Ltiny
 	 EX_ST(STORE(stw, %g1, %o0 - 0x04), NG4_retl_o2_plus_4)
-	ba,a,pt		%icc, .Lexit
+	ba,a,pt		%xcc, .Lexit
 .Lmedium_unaligned:
 	/* First get dest 8 byte aligned.  */
 	sub		%g0, %o0, %g1
@@ -465,7 +458,7 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	add		%o1, 1, %o1
 	subcc		%g1, 1, %g1
 	add		%o0, 1, %o0
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 EX_ST(STORE(stb, %g2, %o0 - 0x01), NG4_retl_o2_plus_g1_plus_1)
 2:
 	and		%o1, 0x7, %g1
@@ -485,30 +478,30 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	or		GLOBAL_SPARE, %o4, GLOBAL_SPARE
 	EX_ST(STORE(stx, GLOBAL_SPARE, %o0 + 0x00), NG4_retl_o2_plus_o5_plus_8)
 	add		%o0, 0x08, %o0
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 sllx		%g3, %g1, %o4
 	srl		%g1, 3, %g1
 	add		%o1, %g1, %o1
 	brz,pn		%o2, .Lexit
 	 nop
-	ba,pt		%icc, .Lsmall_unaligned
+	ba,pt		%xcc, .Lsmall_unaligned
 
 .Ltiny:
 	EX_LD(LOAD(ldub, %o1 + 0x00, %g1), NG4_retl_o2)
 	subcc		%o2, 1, %o2
-	be,pn		%icc, .Lexit
+	be,pn		%xcc, .Lexit
 	 EX_ST(STORE(stb, %g1, %o0 + 0x00), NG4_retl_o2_plus_1)
 	EX_LD(LOAD(ldub, %o1 + 0x01, %g1), NG4_retl_o2)
 	subcc		%o2, 1, %o2
-	be,pn		%icc, .Lexit
+	be,pn		%xcc, .Lexit
 	 EX_ST(STORE(stb, %g1, %o0 + 0x01), NG4_retl_o2_plus_1)
 	EX_LD(LOAD(ldub, %o1 + 0x02, %g1), NG4_retl_o2)
-	ba,pt		%icc, .Lexit
+	ba,pt		%xcc, .Lexit
 	 EX_ST(STORE(stb, %g1, %o0 + 0x02), NG4_retl_o2)
 
 .Lsmall:
 	andcc		%g2, 0x3, %g0
-	bne,pn		%icc, .Lsmall_unaligned
+	bne,pn		%xcc, .Lsmall_unaligned
 	 andn		%o2, 0x4 - 1, %o5
 	sub		%o2, %o5, %o2
 1:
@@ -516,18 +509,18 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	add		%o1, 0x04, %o1
 	subcc		%o5, 0x04, %o5
 	add		%o0, 0x04, %o0
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 EX_ST(STORE(stw, %g1, %o0 - 0x04), NG4_retl_o2_plus_o5_plus_4)
 	brz,pt		%o2, .Lexit
 	 nop
-	ba,a,pt		%icc, .Ltiny
+	ba,a,pt		%xcc, .Ltiny
 
 .Lsmall_unaligned:
 1:	EX_LD(LOAD(ldub, %o1 + 0x00, %g1), NG4_retl_o2)
 	add		%o1, 1, %o1
 	add		%o0, 1, %o0
 	subcc		%o2, 1, %o2
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 EX_ST(STORE(stb, %g1, %o0 - 0x01), NG4_retl_o2_plus_1)
-	ba,a,pt		%icc, .Lexit
+	ba,a,pt		%xcc, .Lexit
 	.size		FUNC_NAME, .-FUNC_NAME
diff --git a/arch/sparc/lib/NG4memset.S b/arch/sparc/lib/NG4memset.S
index 41da4bd..e7c2e70 100644
--- a/arch/sparc/lib/NG4memset.S
+++ b/arch/sparc/lib/NG4memset.S
@@ -13,14 +13,14 @@
 	.globl		NG4memset
 NG4memset:
 	andcc		%o1, 0xff, %o4
-	be,pt		%icc, 1f
+	be,pt		%xcc, 1f
 	 mov		%o2, %o1
 	sllx		%o4, 8, %g1
 	or		%g1, %o4, %o2
 	sllx		%o2, 16, %g1
 	or		%g1, %o2, %o2
 	sllx		%o2, 32, %g1
-	ba,pt		%icc, 1f
+	ba,pt		%xcc, 1f
 	 or		%g1, %o2, %o4
 	.size		NG4memset,.-NG4memset
 
@@ -29,7 +29,7 @@ NG4memset:
 NG4bzero:
 	clr		%o4
 1:	cmp		%o1, 16
-	ble		%icc, .Ltiny
+	ble		%xcc, .Ltiny
 	 mov		%o0, %o3
 	sub		%g0, %o0, %g1
 	and		%g1, 0x7, %g1
@@ -37,7 +37,7 @@ NG4bzero:
 	 sub		%o1, %g1, %o1
 1:	stb		%o4, [%o0 + 0x00]
 	subcc		%g1, 1, %g1
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 1, %o0
 .Laligned8:
 	cmp		%o1, 64 + (64 - 8)
@@ -48,7 +48,7 @@ NG4bzero:
 	 sub		%o1, %g1, %o1
 1:	stx		%o4, [%o0 + 0x00]
 	subcc		%g1, 8, %g1
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x8, %o0
 .Laligned64:
 	andn		%o1, 64 - 1, %g1
@@ -58,30 +58,30 @@ NG4bzero:
 1:	stxa		%o4, [%o0 + %g0] ASI_BLK_INIT_QUAD_LDD_P
 	subcc		%g1, 0x40, %g1
 	stxa		%o4, [%o0 + %g2] ASI_BLK_INIT_QUAD_LDD_P
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x40, %o0
 .Lpostloop:
 	cmp		%o1, 8
-	bl,pn		%icc, .Ltiny
+	bl,pn		%xcc, .Ltiny
 	 membar		#StoreStore|#StoreLoad
 .Lmedium:
 	andn		%o1, 0x7, %g1
 	sub		%o1, %g1, %o1
 1:	stx		%o4, [%o0 + 0x00]
 	subcc		%g1, 0x8, %g1
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x08, %o0
 	andcc		%o1, 0x4, %g1
-	be,pt		%icc, .Ltiny
+	be,pt		%xcc, .Ltiny
 	 sub		%o1, %g1, %o1
 	stw		%o4, [%o0 + 0x00]
 	add		%o0, 0x4, %o0
 .Ltiny:
 	cmp		%o1, 0
-	be,pn		%icc, .Lexit
+	be,pn		%xcc, .Lexit
 1:	 subcc		%o1, 1, %o1
 	stb		%o4, [%o0 + 0x00]
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 1, %o0
 .Lexit:
 	retl
@@ -99,7 +99,7 @@ NG4bzero:
 	stxa		%o4, [%o0 + %g2] ASI_BLK_INIT_QUAD_LDD_P
 	stxa		%o4, [%o0 + %g3] ASI_BLK_INIT_QUAD_LDD_P
 	stxa		%o4, [%o0 + %o5] ASI_BLK_INIT_QUAD_LDD_P
-	bne,pt		%icc, 1b
+	bne,pt		%xcc, 1b
 	 add		%o0, 0x30, %o0
-	ba,a,pt		%icc, .Lpostloop
+	ba,a,pt		%xcc, .Lpostloop
 	.size		NG4bzero,.-NG4bzero
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 2/3] mm: Zeroing hash tables in allocator
  2017-02-28 14:55 ` Pavel Tatashin
@ 2017-02-28 14:55   ` Pavel Tatashin
  -1 siblings, 0 replies; 18+ messages in thread
From: Pavel Tatashin @ 2017-02-28 14:55 UTC (permalink / raw)
  To: linux-mm, sparclinux

Add a new flag HASH_ZERO which when provided grantees that the hash table
that is returned by alloc_large_system_hash() is zeroed.  In most cases
that is what is needed by the caller. Use page level allocator's __GFP_ZERO
flags to zero the memory. It is using memset() which is efficient method to
zero memory and is optimized for most platforms.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
---
 include/linux/bootmem.h |    1 +
 mm/page_alloc.c         |   12 +++++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index 962164d..e223d91 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -358,6 +358,7 @@ static inline void __init memblock_free_late(
 #define HASH_EARLY	0x00000001	/* Allocating during early boot? */
 #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
 					 * shift passed via *_hash_shift */
+#define HASH_ZERO	0x00000004	/* Zero allocated hash table */
 
 /* Only NUMA needs hash distribution. 64bit NUMA architectures have
  * sufficient vmalloc space.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a7a6aac..1b0f7a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7142,6 +7142,7 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 	unsigned long long max = high_limit;
 	unsigned long log2qty, size;
 	void *table = NULL;
+	gfp_t gfp_flags;
 
 	/* allow the kernel cmdline to have a say */
 	if (!numentries) {
@@ -7186,12 +7187,17 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 
 	log2qty = ilog2(numentries);
 
+	/*
+	 * memblock allocator returns zeroed memory already, so HASH_ZERO is
+	 * currently not used when HASH_EARLY is specified.
+	 */
+	gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
 	do {
 		size = bucketsize << log2qty;
 		if (flags & HASH_EARLY)
 			table = memblock_virt_alloc_nopanic(size, 0);
 		else if (hashdist)
-			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
+			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
 		else {
 			/*
 			 * If bucketsize is not a power-of-two, we may free
@@ -7199,8 +7205,8 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 			 * alloc_pages_exact() automatically does
 			 */
 			if (get_order(size) < MAX_ORDER) {
-				table = alloc_pages_exact(size, GFP_ATOMIC);
-				kmemleak_alloc(table, size, 1, GFP_ATOMIC);
+				table = alloc_pages_exact(size, gfp_flags);
+				kmemleak_alloc(table, size, 1, gfp_flags);
 			}
 		}
 	} while (!table && size > PAGE_SIZE && --log2qty);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 2/3] mm: Zeroing hash tables in allocator
@ 2017-02-28 14:55   ` Pavel Tatashin
  0 siblings, 0 replies; 18+ messages in thread
From: Pavel Tatashin @ 2017-02-28 14:55 UTC (permalink / raw)
  To: linux-mm, sparclinux

Add a new flag HASH_ZERO which when provided grantees that the hash table
that is returned by alloc_large_system_hash() is zeroed.  In most cases
that is what is needed by the caller. Use page level allocator's __GFP_ZERO
flags to zero the memory. It is using memset() which is efficient method to
zero memory and is optimized for most platforms.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
---
 include/linux/bootmem.h |    1 +
 mm/page_alloc.c         |   12 +++++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index 962164d..e223d91 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -358,6 +358,7 @@ static inline void __init memblock_free_late(
 #define HASH_EARLY	0x00000001	/* Allocating during early boot? */
 #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
 					 * shift passed via *_hash_shift */
+#define HASH_ZERO	0x00000004	/* Zero allocated hash table */
 
 /* Only NUMA needs hash distribution. 64bit NUMA architectures have
  * sufficient vmalloc space.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a7a6aac..1b0f7a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7142,6 +7142,7 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 	unsigned long long max = high_limit;
 	unsigned long log2qty, size;
 	void *table = NULL;
+	gfp_t gfp_flags;
 
 	/* allow the kernel cmdline to have a say */
 	if (!numentries) {
@@ -7186,12 +7187,17 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 
 	log2qty = ilog2(numentries);
 
+	/*
+	 * memblock allocator returns zeroed memory already, so HASH_ZERO is
+	 * currently not used when HASH_EARLY is specified.
+	 */
+	gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
 	do {
 		size = bucketsize << log2qty;
 		if (flags & HASH_EARLY)
 			table = memblock_virt_alloc_nopanic(size, 0);
 		else if (hashdist)
-			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
+			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
 		else {
 			/*
 			 * If bucketsize is not a power-of-two, we may free
@@ -7199,8 +7205,8 @@ static unsigned long __init arch_reserved_kernel_pages(void)
 			 * alloc_pages_exact() automatically does
 			 */
 			if (get_order(size) < MAX_ORDER) {
-				table = alloc_pages_exact(size, GFP_ATOMIC);
-				kmemleak_alloc(table, size, 1, GFP_ATOMIC);
+				table = alloc_pages_exact(size, gfp_flags);
+				kmemleak_alloc(table, size, 1, gfp_flags);
 			}
 		}
 	} while (!table && size > PAGE_SIZE && --log2qty);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 3/3] mm: Updated callers to use HASH_ZERO flag
  2017-02-28 14:55 ` Pavel Tatashin
@ 2017-02-28 14:55   ` Pavel Tatashin
  -1 siblings, 0 replies; 18+ messages in thread
From: Pavel Tatashin @ 2017-02-28 14:55 UTC (permalink / raw)
  To: linux-mm, sparclinux

Update dcache, inode, pid, mountpoint, and mount hash tables to use
HASH_ZERO, and remove initialization after allocations.
In case of places where HASH_EARLY was used such as in __pv_init_lock_hash
the zeroed hash table was already assumed, because memblock zeroes the
memory.

CPU: SPARC M6, Memory: 7T
Before fix:
Dentry cache hash table entries: 1073741824
Inode-cache hash table entries: 536870912
Mount-cache hash table entries: 16777216
Mountpoint-cache hash table entries: 16777216
ftrace: allocating 20414 entries in 40 pages
Total time: 11.798s

After fix:
Dentry cache hash table entries: 1073741824
Inode-cache hash table entries: 536870912
Mount-cache hash table entries: 16777216
Mountpoint-cache hash table entries: 16777216
ftrace: allocating 20414 entries in 40 pages
Total time: 3.198s

CPU: Intel Xeon E5-2630, Memory: 2.2T:
Before fix:
Dentry cache hash table entries: 536870912
Inode-cache hash table entries: 268435456
Mount-cache hash table entries: 8388608
Mountpoint-cache hash table entries: 8388608
CPU: Physical Processor ID: 0
Total time: 3.245s

After fix:
Dentry cache hash table entries: 536870912
Inode-cache hash table entries: 268435456
Mount-cache hash table entries: 8388608
Mountpoint-cache hash table entries: 8388608
CPU: Physical Processor ID: 0
Total time: 3.244s

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
---
 fs/dcache.c                         |   18 ++++--------------
 fs/inode.c                          |   14 ++------------
 fs/namespace.c                      |   10 ++--------
 kernel/locking/qspinlock_paravirt.h |    3 ++-
 kernel/pid.c                        |    7 ++-----
 5 files changed, 12 insertions(+), 40 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 95d71ed..363502f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3548,8 +3548,6 @@ static int __init set_dhash_entries(char *str)
 
 static void __init dcache_init_early(void)
 {
-	unsigned int loop;
-
 	/* If hashes are distributed across NUMA nodes, defer
 	 * hash allocation until vmalloc space is available.
 	 */
@@ -3561,24 +3559,19 @@ static void __init dcache_init_early(void)
 					sizeof(struct hlist_bl_head),
 					dhash_entries,
 					13,
-					HASH_EARLY,
+					HASH_EARLY | HASH_ZERO,
 					&d_hash_shift,
 					&d_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << d_hash_shift); loop++)
-		INIT_HLIST_BL_HEAD(dentry_hashtable + loop);
 }
 
 static void __init dcache_init(void)
 {
-	unsigned int loop;
-
-	/* 
+	/*
 	 * A constructor could be added for stable state like the lists,
 	 * but it is probably not worth it because of the cache nature
-	 * of the dcache. 
+	 * of the dcache.
 	 */
 	dentry_cache = KMEM_CACHE(dentry,
 		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT);
@@ -3592,14 +3585,11 @@ static void __init dcache_init(void)
 					sizeof(struct hlist_bl_head),
 					dhash_entries,
 					13,
-					0,
+					HASH_ZERO,
 					&d_hash_shift,
 					&d_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << d_hash_shift); loop++)
-		INIT_HLIST_BL_HEAD(dentry_hashtable + loop);
 }
 
 /* SLAB cache for __getname() consumers */
diff --git a/fs/inode.c b/fs/inode.c
index 88110fd..1b15a7c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1916,8 +1916,6 @@ static int __init set_ihash_entries(char *str)
  */
 void __init inode_init_early(void)
 {
-	unsigned int loop;
-
 	/* If hashes are distributed across NUMA nodes, defer
 	 * hash allocation until vmalloc space is available.
 	 */
@@ -1929,20 +1927,15 @@ void __init inode_init_early(void)
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
-					HASH_EARLY,
+					HASH_EARLY | HASH_ZERO,
 					&i_hash_shift,
 					&i_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
 }
 
 void __init inode_init(void)
 {
-	unsigned int loop;
-
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
 					 sizeof(struct inode),
@@ -1960,14 +1953,11 @@ void __init inode_init(void)
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
-					0,
+					HASH_ZERO,
 					&i_hash_shift,
 					&i_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
 }
 
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
diff --git a/fs/namespace.c b/fs/namespace.c
index 8bfad42..275e6e2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3238,7 +3238,6 @@ static void __init init_mount_tree(void)
 
 void __init mnt_init(void)
 {
-	unsigned u;
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
@@ -3247,22 +3246,17 @@ void __init mnt_init(void)
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
 				mhash_entries, 19,
-				0,
+				HASH_ZERO,
 				&m_hash_shift, &m_hash_mask, 0, 0);
 	mountpoint_hashtable = alloc_large_system_hash("Mountpoint-cache",
 				sizeof(struct hlist_head),
 				mphash_entries, 19,
-				0,
+				HASH_ZERO,
 				&mp_hash_shift, &mp_hash_mask, 0, 0);
 
 	if (!mount_hashtable || !mountpoint_hashtable)
 		panic("Failed to allocate mount hash table\n");
 
-	for (u = 0; u <= m_hash_mask; u++)
-		INIT_HLIST_HEAD(&mount_hashtable[u]);
-	for (u = 0; u <= mp_hash_mask; u++)
-		INIT_HLIST_HEAD(&mountpoint_hashtable[u]);
-
 	kernfs_init();
 
 	err = sysfs_init();
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e6b2f7a..4ccfcaa 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -193,7 +193,8 @@ void __init __pv_init_lock_hash(void)
 	 */
 	pv_lock_hash = alloc_large_system_hash("PV qspinlock",
 					       sizeof(struct pv_hash_entry),
-					       pv_hash_size, 0, HASH_EARLY,
+					       pv_hash_size, 0,
+					       HASH_EARLY | HASH_ZERO,
 					       &pv_lock_hash_bits, NULL,
 					       pv_hash_size, pv_hash_size);
 }
diff --git a/kernel/pid.c b/kernel/pid.c
index 0291804..013e023 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -572,16 +572,13 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
  */
 void __init pidhash_init(void)
 {
-	unsigned int i, pidhash_size;
+	unsigned int pidhash_size;
 
 	pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
-					   HASH_EARLY | HASH_SMALL,
+					   HASH_EARLY | HASH_SMALL | HASH_ZERO,
 					   &pidhash_shift, NULL,
 					   0, 4096);
 	pidhash_size = 1U << pidhash_shift;
-
-	for (i = 0; i < pidhash_size; i++)
-		INIT_HLIST_HEAD(&pid_hash[i]);
 }
 
 void __init pidmap_init(void)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 3/3] mm: Updated callers to use HASH_ZERO flag
@ 2017-02-28 14:55   ` Pavel Tatashin
  0 siblings, 0 replies; 18+ messages in thread
From: Pavel Tatashin @ 2017-02-28 14:55 UTC (permalink / raw)
  To: linux-mm, sparclinux

Update dcache, inode, pid, mountpoint, and mount hash tables to use
HASH_ZERO, and remove initialization after allocations.
In case of places where HASH_EARLY was used such as in __pv_init_lock_hash
the zeroed hash table was already assumed, because memblock zeroes the
memory.

CPU: SPARC M6, Memory: 7T
Before fix:
Dentry cache hash table entries: 1073741824
Inode-cache hash table entries: 536870912
Mount-cache hash table entries: 16777216
Mountpoint-cache hash table entries: 16777216
ftrace: allocating 20414 entries in 40 pages
Total time: 11.798s

After fix:
Dentry cache hash table entries: 1073741824
Inode-cache hash table entries: 536870912
Mount-cache hash table entries: 16777216
Mountpoint-cache hash table entries: 16777216
ftrace: allocating 20414 entries in 40 pages
Total time: 3.198s

CPU: Intel Xeon E5-2630, Memory: 2.2T:
Before fix:
Dentry cache hash table entries: 536870912
Inode-cache hash table entries: 268435456
Mount-cache hash table entries: 8388608
Mountpoint-cache hash table entries: 8388608
CPU: Physical Processor ID: 0
Total time: 3.245s

After fix:
Dentry cache hash table entries: 536870912
Inode-cache hash table entries: 268435456
Mount-cache hash table entries: 8388608
Mountpoint-cache hash table entries: 8388608
CPU: Physical Processor ID: 0
Total time: 3.244s

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
---
 fs/dcache.c                         |   18 ++++--------------
 fs/inode.c                          |   14 ++------------
 fs/namespace.c                      |   10 ++--------
 kernel/locking/qspinlock_paravirt.h |    3 ++-
 kernel/pid.c                        |    7 ++-----
 5 files changed, 12 insertions(+), 40 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 95d71ed..363502f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3548,8 +3548,6 @@ static int __init set_dhash_entries(char *str)
 
 static void __init dcache_init_early(void)
 {
-	unsigned int loop;
-
 	/* If hashes are distributed across NUMA nodes, defer
 	 * hash allocation until vmalloc space is available.
 	 */
@@ -3561,24 +3559,19 @@ static void __init dcache_init_early(void)
 					sizeof(struct hlist_bl_head),
 					dhash_entries,
 					13,
-					HASH_EARLY,
+					HASH_EARLY | HASH_ZERO,
 					&d_hash_shift,
 					&d_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << d_hash_shift); loop++)
-		INIT_HLIST_BL_HEAD(dentry_hashtable + loop);
 }
 
 static void __init dcache_init(void)
 {
-	unsigned int loop;
-
-	/* 
+	/*
 	 * A constructor could be added for stable state like the lists,
 	 * but it is probably not worth it because of the cache nature
-	 * of the dcache. 
+	 * of the dcache.
 	 */
 	dentry_cache = KMEM_CACHE(dentry,
 		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT);
@@ -3592,14 +3585,11 @@ static void __init dcache_init(void)
 					sizeof(struct hlist_bl_head),
 					dhash_entries,
 					13,
-					0,
+					HASH_ZERO,
 					&d_hash_shift,
 					&d_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << d_hash_shift); loop++)
-		INIT_HLIST_BL_HEAD(dentry_hashtable + loop);
 }
 
 /* SLAB cache for __getname() consumers */
diff --git a/fs/inode.c b/fs/inode.c
index 88110fd..1b15a7c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1916,8 +1916,6 @@ static int __init set_ihash_entries(char *str)
  */
 void __init inode_init_early(void)
 {
-	unsigned int loop;
-
 	/* If hashes are distributed across NUMA nodes, defer
 	 * hash allocation until vmalloc space is available.
 	 */
@@ -1929,20 +1927,15 @@ void __init inode_init_early(void)
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
-					HASH_EARLY,
+					HASH_EARLY | HASH_ZERO,
 					&i_hash_shift,
 					&i_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
 }
 
 void __init inode_init(void)
 {
-	unsigned int loop;
-
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
 					 sizeof(struct inode),
@@ -1960,14 +1953,11 @@ void __init inode_init(void)
 					sizeof(struct hlist_head),
 					ihash_entries,
 					14,
-					0,
+					HASH_ZERO,
 					&i_hash_shift,
 					&i_hash_mask,
 					0,
 					0);
-
-	for (loop = 0; loop < (1U << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
 }
 
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
diff --git a/fs/namespace.c b/fs/namespace.c
index 8bfad42..275e6e2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3238,7 +3238,6 @@ static void __init init_mount_tree(void)
 
 void __init mnt_init(void)
 {
-	unsigned u;
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
@@ -3247,22 +3246,17 @@ void __init mnt_init(void)
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
 				mhash_entries, 19,
-				0,
+				HASH_ZERO,
 				&m_hash_shift, &m_hash_mask, 0, 0);
 	mountpoint_hashtable = alloc_large_system_hash("Mountpoint-cache",
 				sizeof(struct hlist_head),
 				mphash_entries, 19,
-				0,
+				HASH_ZERO,
 				&mp_hash_shift, &mp_hash_mask, 0, 0);
 
 	if (!mount_hashtable || !mountpoint_hashtable)
 		panic("Failed to allocate mount hash table\n");
 
-	for (u = 0; u <= m_hash_mask; u++)
-		INIT_HLIST_HEAD(&mount_hashtable[u]);
-	for (u = 0; u <= mp_hash_mask; u++)
-		INIT_HLIST_HEAD(&mountpoint_hashtable[u]);
-
 	kernfs_init();
 
 	err = sysfs_init();
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e6b2f7a..4ccfcaa 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -193,7 +193,8 @@ void __init __pv_init_lock_hash(void)
 	 */
 	pv_lock_hash = alloc_large_system_hash("PV qspinlock",
 					       sizeof(struct pv_hash_entry),
-					       pv_hash_size, 0, HASH_EARLY,
+					       pv_hash_size, 0,
+					       HASH_EARLY | HASH_ZERO,
 					       &pv_lock_hash_bits, NULL,
 					       pv_hash_size, pv_hash_size);
 }
diff --git a/kernel/pid.c b/kernel/pid.c
index 0291804..013e023 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -572,16 +572,13 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
  */
 void __init pidhash_init(void)
 {
-	unsigned int i, pidhash_size;
+	unsigned int pidhash_size;
 
 	pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
-					   HASH_EARLY | HASH_SMALL,
+					   HASH_EARLY | HASH_SMALL | HASH_ZERO,
 					   &pidhash_shift, NULL,
 					   0, 4096);
 	pidhash_size = 1U << pidhash_shift;
-
-	for (i = 0; i < pidhash_size; i++)
-		INIT_HLIST_HEAD(&pid_hash[i]);
 }
 
 void __init pidmap_init(void)
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/3] sparc64: NG4 memset/memcpy 32 bits overflow
  2017-02-28 14:55   ` Pavel Tatashin
@ 2017-02-28 15:12     ` David Miller
  -1 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2017-02-28 15:12 UTC (permalink / raw)
  To: pasha.tatashin; +Cc: linux-mm, sparclinux

From: Pavel Tatashin <pasha.tatashin@oracle.com>
Date: Tue, 28 Feb 2017 09:55:44 -0500

> @@ -252,19 +248,16 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
>  #ifdef MEMCPY_DEBUG
>  	wr		%g0, 0x80, %asi
>  #endif
> -	srlx		%o2, 31, %g2
> -	cmp		%g2, 0
> -	tne		%XCC, 5
>  	PREAMBLE
>  	mov		%o0, %o3
>  	brz,pn		%o2, .Lexit


This limitation was placed here intentionally, because huge values
are %99 of the time bugs and unintentional.

You will see that every assembler optimized memcpy on sparc64 has
this bug trap, not just NG4.

This is a very useful way to find bugs and length {over,under}flows.
Please do not remove it.

If you have to do 4GB or larger copies, do it in pieces or similar.

Thank you.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/3] sparc64: NG4 memset/memcpy 32 bits overflow
@ 2017-02-28 15:12     ` David Miller
  0 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2017-02-28 15:12 UTC (permalink / raw)
  To: pasha.tatashin; +Cc: linux-mm, sparclinux

From: Pavel Tatashin <pasha.tatashin@oracle.com>
Date: Tue, 28 Feb 2017 09:55:44 -0500

> @@ -252,19 +248,16 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
>  #ifdef MEMCPY_DEBUG
>  	wr		%g0, 0x80, %asi
>  #endif
> -	srlx		%o2, 31, %g2
> -	cmp		%g2, 0
> -	tne		%XCC, 5
>  	PREAMBLE
>  	mov		%o0, %o3
>  	brz,pn		%o2, .Lexit


This limitation was placed here intentionally, because huge values
are %99 of the time bugs and unintentional.

You will see that every assembler optimized memcpy on sparc64 has
this bug trap, not just NG4.

This is a very useful way to find bugs and length {over,under}flows.
Please do not remove it.

If you have to do 4GB or larger copies, do it in pieces or similar.

Thank you.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/3] sparc64: NG4 memset/memcpy 32 bits overflow
  2017-02-28 15:12     ` David Miller
@ 2017-02-28 15:56       ` Pasha Tatashin
  -1 siblings, 0 replies; 18+ messages in thread
From: Pasha Tatashin @ 2017-02-28 15:56 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, sparclinux

Hi Dave,

Thank you, I will reinstate the check in memcpy() to limit it to 2G 
memcpy(). Are you OK with keeping the change of icc to xcc for 
consistency, or should I revert it as well?

N4memset() never had this length bound check, and it bit met when I was 
testing the time it takes to zero large hash tables. Are you OK to keep 
the change in memset()?

Also, for consideration, machines are getting bigger, and 2G is becoming 
very small compared to the memory sizes, so some algorithms can become 
inefficient when they have to artificially limit memcpy()s to 2G chunks.

X6-8 scales up to 6T:
http://www.oracle.com/technetwork/database/exadata/exadata-x6-8-ds-2968796.pdf

SPARC M7-16 scales up to 16T:
http://www.oracle.com/us/products/servers-storage/sparc-m7-16-ds-2687045.pdf

2G is just 0.012% of the total memory size on M7-16.

Thank you,
Pasha

On 2017-02-28 10:12, David Miller wrote:
> From: Pavel Tatashin <pasha.tatashin@oracle.com>
> Date: Tue, 28 Feb 2017 09:55:44 -0500
>
>> @@ -252,19 +248,16 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
>>  #ifdef MEMCPY_DEBUG
>>  	wr		%g0, 0x80, %asi
>>  #endif
>> -	srlx		%o2, 31, %g2
>> -	cmp		%g2, 0
>> -	tne		%XCC, 5
>>  	PREAMBLE
>>  	mov		%o0, %o3
>>  	brz,pn		%o2, .Lexit
>
>
> This limitation was placed here intentionally, because huge values
> are %99 of the time bugs and unintentional.
>
> You will see that every assembler optimized memcpy on sparc64 has
> this bug trap, not just NG4.
>
> This is a very useful way to find bugs and length {over,under}flows.
> Please do not remove it.
>
> If you have to do 4GB or larger copies, do it in pieces or similar.
>
> Thank you.
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/3] sparc64: NG4 memset/memcpy 32 bits overflow
@ 2017-02-28 15:56       ` Pasha Tatashin
  0 siblings, 0 replies; 18+ messages in thread
From: Pasha Tatashin @ 2017-02-28 15:56 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, sparclinux

Hi Dave,

Thank you, I will reinstate the check in memcpy() to limit it to 2G 
memcpy(). Are you OK with keeping the change of icc to xcc for 
consistency, or should I revert it as well?

N4memset() never had this length bound check, and it bit met when I was 
testing the time it takes to zero large hash tables. Are you OK to keep 
the change in memset()?

Also, for consideration, machines are getting bigger, and 2G is becoming 
very small compared to the memory sizes, so some algorithms can become 
inefficient when they have to artificially limit memcpy()s to 2G chunks.

X6-8 scales up to 6T:
http://www.oracle.com/technetwork/database/exadata/exadata-x6-8-ds-2968796.pdf

SPARC M7-16 scales up to 16T:
http://www.oracle.com/us/products/servers-storage/sparc-m7-16-ds-2687045.pdf

2G is just 0.012% of the total memory size on M7-16.

Thank you,
Pasha

On 2017-02-28 10:12, David Miller wrote:
> From: Pavel Tatashin <pasha.tatashin@oracle.com>
> Date: Tue, 28 Feb 2017 09:55:44 -0500
>
>> @@ -252,19 +248,16 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
>>  #ifdef MEMCPY_DEBUG
>>  	wr		%g0, 0x80, %asi
>>  #endif
>> -	srlx		%o2, 31, %g2
>> -	cmp		%g2, 0
>> -	tne		%XCC, 5
>>  	PREAMBLE
>>  	mov		%o0, %o3
>>  	brz,pn		%o2, .Lexit
>
>
> This limitation was placed here intentionally, because huge values
> are %99 of the time bugs and unintentional.
>
> You will see that every assembler optimized memcpy on sparc64 has
> this bug trap, not just NG4.
>
> This is a very useful way to find bugs and length {over,under}flows.
> Please do not remove it.
>
> If you have to do 4GB or larger copies, do it in pieces or similar.
>
> Thank you.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/3] sparc64: NG4 memset/memcpy 32 bits overflow
  2017-02-28 15:56       ` Pasha Tatashin
@ 2017-02-28 18:59         ` Matthew Wilcox
  -1 siblings, 0 replies; 18+ messages in thread
From: Matthew Wilcox @ 2017-02-28 18:59 UTC (permalink / raw)
  To: Pasha Tatashin; +Cc: David Miller, linux-mm, sparclinux

On Tue, Feb 28, 2017 at 10:56:57AM -0500, Pasha Tatashin wrote:
> Also, for consideration, machines are getting bigger, and 2G is becoming
> very small compared to the memory sizes, so some algorithms can become
> inefficient when they have to artificially limit memcpy()s to 2G chunks.

... what algorithms are deemed "inefficient" when they take a break every
2 billion bytes to, ohidon'tknow, check to see that a higher priority
process doesn't want the CPU?

> X6-8 scales up to 6T:
> http://www.oracle.com/technetwork/database/exadata/exadata-x6-8-ds-2968796.pdf
> 
> SPARC M7-16 scales up to 16T:
> http://www.oracle.com/us/products/servers-storage/sparc-m7-16-ds-2687045.pdf
> 
> 2G is just 0.012% of the total memory size on M7-16.

Right, so suppose you're copying half the memory to the other half of
memory.  Let's suppose it takes a hundred extra instructions every 2GB to
check that nobody else wants the CPU and dive back into the memcpy code.
That's 800,000 additional instructions.  Which even on a SPARC CPU is
going to execute in less than 0.001 second.  CPU memory bandwidth is
on the order of 100GB/s, so the overall memcpy is going to take about
160 seconds.

You'd have far more joy dividing the work up into 2GB chunks and
distributing the work to N CPU packages (... not hardware threads
...) than you would trying to save a millisecond by allowing the CPU to
copy more than 2GB at a time.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/3] sparc64: NG4 memset/memcpy 32 bits overflow
@ 2017-02-28 18:59         ` Matthew Wilcox
  0 siblings, 0 replies; 18+ messages in thread
From: Matthew Wilcox @ 2017-02-28 18:59 UTC (permalink / raw)
  To: Pasha Tatashin; +Cc: David Miller, linux-mm, sparclinux

On Tue, Feb 28, 2017 at 10:56:57AM -0500, Pasha Tatashin wrote:
> Also, for consideration, machines are getting bigger, and 2G is becoming
> very small compared to the memory sizes, so some algorithms can become
> inefficient when they have to artificially limit memcpy()s to 2G chunks.

... what algorithms are deemed "inefficient" when they take a break every
2 billion bytes to, ohidon'tknow, check to see that a higher priority
process doesn't want the CPU?

> X6-8 scales up to 6T:
> http://www.oracle.com/technetwork/database/exadata/exadata-x6-8-ds-2968796.pdf
> 
> SPARC M7-16 scales up to 16T:
> http://www.oracle.com/us/products/servers-storage/sparc-m7-16-ds-2687045.pdf
> 
> 2G is just 0.012% of the total memory size on M7-16.

Right, so suppose you're copying half the memory to the other half of
memory.  Let's suppose it takes a hundred extra instructions every 2GB to
check that nobody else wants the CPU and dive back into the memcpy code.
That's 800,000 additional instructions.  Which even on a SPARC CPU is
going to execute in less than 0.001 second.  CPU memory bandwidth is
on the order of 100GB/s, so the overall memcpy is going to take about
160 seconds.

You'd have far more joy dividing the work up into 2GB chunks and
distributing the work to N CPU packages (... not hardware threads
...) than you would trying to save a millisecond by allowing the CPU to
copy more than 2GB at a time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/3] sparc64: NG4 memset/memcpy 32 bits overflow
  2017-02-28 18:59         ` Matthew Wilcox
@ 2017-02-28 19:34           ` Pasha Tatashin
  -1 siblings, 0 replies; 18+ messages in thread
From: Pasha Tatashin @ 2017-02-28 19:34 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: David Miller, linux-mm, sparclinux

Hi Matthew,

Thank you for your comments, my replies below:

On 02/28/2017 01:59 PM, Matthew Wilcox wrote:
> ... what algorithms are deemed "inefficient" when they take a break every
> 2 billion bytes to, ohidon'tknow, check to see that a higher priority
> process doesn't want the CPU?

I do not see that NG4memcpy() is disabling interrupts so there should 
not be any issues with letting higher priority processes to interrupt 
and do their work. And, as I said my point was mostly for consideration, 
I will revert that bound check in NG4memcpy() to the 2G limit.

> Right, so suppose you're copying half the memory to the other half of
> memory.  Let's suppose it takes a hundred extra instructions every 2GB to
> check that nobody else wants the CPU and dive back into the memcpy code.
> That's 800,000 additional instructions.  Which even on a SPARC CPU is
> going to execute in less than 0.001 second.  CPU memory bandwidth is
> on the order of 100GB/s, so the overall memcpy is going to take about
> 160 seconds.

Sure, the computational overhead is minimal, but still adding and 
maintaining extra code to break-up a single memcpy() has its cost. For 
example: as far I as can tell x86 and powerpc memcpy()s do not have this 
limit, which means that an author of a driver would have to explicitly 
divide memcpy()s into 2G chunks only to work on SPARC (and know about 
this limit too!). If there is a driver that has a memory proportional 
data structure it is possible it will panic the kernel once such driver 
is attached on a larger memory machine.

Another example is memblock allocator that is currently unconditionally 
calls memset() to zero all the allocated memory without breaking it up 
into pieces, and when other CPUs are not yet available to split the work 
to speed it up.

So, if a large chunk of memory is allocated via memblock() allocator, 
(as one example when booted with kernel parameter: "hashdist=0") we will 
have memset() called for 8G and 4G pieces of memory on machine with 7T 
of memory, and that will cause panic if we will add this bound limit to 
memset as well.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/3] sparc64: NG4 memset/memcpy 32 bits overflow
@ 2017-02-28 19:34           ` Pasha Tatashin
  0 siblings, 0 replies; 18+ messages in thread
From: Pasha Tatashin @ 2017-02-28 19:34 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: David Miller, linux-mm, sparclinux

Hi Matthew,

Thank you for your comments, my replies below:

On 02/28/2017 01:59 PM, Matthew Wilcox wrote:
> ... what algorithms are deemed "inefficient" when they take a break every
> 2 billion bytes to, ohidon'tknow, check to see that a higher priority
> process doesn't want the CPU?

I do not see that NG4memcpy() is disabling interrupts so there should 
not be any issues with letting higher priority processes to interrupt 
and do their work. And, as I said my point was mostly for consideration, 
I will revert that bound check in NG4memcpy() to the 2G limit.

> Right, so suppose you're copying half the memory to the other half of
> memory.  Let's suppose it takes a hundred extra instructions every 2GB to
> check that nobody else wants the CPU and dive back into the memcpy code.
> That's 800,000 additional instructions.  Which even on a SPARC CPU is
> going to execute in less than 0.001 second.  CPU memory bandwidth is
> on the order of 100GB/s, so the overall memcpy is going to take about
> 160 seconds.

Sure, the computational overhead is minimal, but still adding and 
maintaining extra code to break-up a single memcpy() has its cost. For 
example: as far I as can tell x86 and powerpc memcpy()s do not have this 
limit, which means that an author of a driver would have to explicitly 
divide memcpy()s into 2G chunks only to work on SPARC (and know about 
this limit too!). If there is a driver that has a memory proportional 
data structure it is possible it will panic the kernel once such driver 
is attached on a larger memory machine.

Another example is memblock allocator that is currently unconditionally 
calls memset() to zero all the allocated memory without breaking it up 
into pieces, and when other CPUs are not yet available to split the work 
to speed it up.

So, if a large chunk of memory is allocated via memblock() allocator, 
(as one example when booted with kernel parameter: "hashdist=0") we will 
have memset() called for 8G and 4G pieces of memory on machine with 7T 
of memory, and that will cause panic if we will add this bound limit to 
memset as well.

Thank you,
Pasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/3] sparc64: NG4 memset/memcpy 32 bits overflow
  2017-02-28 19:34           ` Pasha Tatashin
@ 2017-02-28 19:58             ` Matthew Wilcox
  -1 siblings, 0 replies; 18+ messages in thread
From: Matthew Wilcox @ 2017-02-28 19:58 UTC (permalink / raw)
  To: Pasha Tatashin; +Cc: David Miller, linux-mm, sparclinux

On Tue, Feb 28, 2017 at 02:34:17PM -0500, Pasha Tatashin wrote:
> Hi Matthew,
> 
> Thank you for your comments, my replies below:
> 
> On 02/28/2017 01:59 PM, Matthew Wilcox wrote:
> > ... what algorithms are deemed "inefficient" when they take a break every
> > 2 billion bytes to, ohidon'tknow, check to see that a higher priority
> > process doesn't want the CPU?
> 
> I do not see that NG4memcpy() is disabling interrupts so there should not be
> any issues with letting higher priority processes to interrupt and do their
> work. And, as I said my point was mostly for consideration, I will revert
> that bound check in NG4memcpy() to the 2G limit.

That's not how it works in Linux.  Unless you've configured your kernel
with PREEMPT, threads are not preempted while they're inside the kernel.
See cond_resched() in include/linux/sched.h.

> > Right, so suppose you're copying half the memory to the other half of
> > memory.  Let's suppose it takes a hundred extra instructions every 2GB to
> > check that nobody else wants the CPU and dive back into the memcpy code.
> > That's 800,000 additional instructions.  Which even on a SPARC CPU is
> > going to execute in less than 0.001 second.  CPU memory bandwidth is
> > on the order of 100GB/s, so the overall memcpy is going to take about
> > 160 seconds.
> 
> Sure, the computational overhead is minimal, but still adding and
> maintaining extra code to break-up a single memcpy() has its cost. For
> example: as far I as can tell x86 and powerpc memcpy()s do not have this
> limit, which means that an author of a driver would have to explicitly
> divide memcpy()s into 2G chunks only to work on SPARC (and know about this
> limit too!). If there is a driver that has a memory proportional data
> structure it is possible it will panic the kernel once such driver is
> attached on a larger memory machine.

Ah, now that is a good point.  We should insert such a limit into all
the architectural memcpy() implementations and the default implementation
in lib/.  This should not affect any drivers; it is almost impossible to
allocate 2GB of memory.  kmalloc won't do it, alloc_pages won't do it.
vmalloc will do it (maybe it shouldn't?) but I have a hard time thinking
of a good reason for a driver to allocate that much memory.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/3] sparc64: NG4 memset/memcpy 32 bits overflow
@ 2017-02-28 19:58             ` Matthew Wilcox
  0 siblings, 0 replies; 18+ messages in thread
From: Matthew Wilcox @ 2017-02-28 19:58 UTC (permalink / raw)
  To: Pasha Tatashin; +Cc: David Miller, linux-mm, sparclinux

On Tue, Feb 28, 2017 at 02:34:17PM -0500, Pasha Tatashin wrote:
> Hi Matthew,
> 
> Thank you for your comments, my replies below:
> 
> On 02/28/2017 01:59 PM, Matthew Wilcox wrote:
> > ... what algorithms are deemed "inefficient" when they take a break every
> > 2 billion bytes to, ohidon'tknow, check to see that a higher priority
> > process doesn't want the CPU?
> 
> I do not see that NG4memcpy() is disabling interrupts so there should not be
> any issues with letting higher priority processes to interrupt and do their
> work. And, as I said my point was mostly for consideration, I will revert
> that bound check in NG4memcpy() to the 2G limit.

That's not how it works in Linux.  Unless you've configured your kernel
with PREEMPT, threads are not preempted while they're inside the kernel.
See cond_resched() in include/linux/sched.h.

> > Right, so suppose you're copying half the memory to the other half of
> > memory.  Let's suppose it takes a hundred extra instructions every 2GB to
> > check that nobody else wants the CPU and dive back into the memcpy code.
> > That's 800,000 additional instructions.  Which even on a SPARC CPU is
> > going to execute in less than 0.001 second.  CPU memory bandwidth is
> > on the order of 100GB/s, so the overall memcpy is going to take about
> > 160 seconds.
> 
> Sure, the computational overhead is minimal, but still adding and
> maintaining extra code to break-up a single memcpy() has its cost. For
> example: as far I as can tell x86 and powerpc memcpy()s do not have this
> limit, which means that an author of a driver would have to explicitly
> divide memcpy()s into 2G chunks only to work on SPARC (and know about this
> limit too!). If there is a driver that has a memory proportional data
> structure it is possible it will panic the kernel once such driver is
> attached on a larger memory machine.

Ah, now that is a good point.  We should insert such a limit into all
the architectural memcpy() implementations and the default implementation
in lib/.  This should not affect any drivers; it is almost impossible to
allocate 2GB of memory.  kmalloc won't do it, alloc_pages won't do it.
vmalloc will do it (maybe it shouldn't?) but I have a hard time thinking
of a good reason for a driver to allocate that much memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-02-28 19:58 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-28 14:55 [PATCH v1 0/3] Zeroing hash tables in allocator Pavel Tatashin
2017-02-28 14:55 ` Pavel Tatashin
2017-02-28 14:55 ` [PATCH v1 1/3] sparc64: NG4 memset/memcpy 32 bits overflow Pavel Tatashin
2017-02-28 14:55   ` Pavel Tatashin
2017-02-28 15:12   ` David Miller
2017-02-28 15:12     ` David Miller
2017-02-28 15:56     ` Pasha Tatashin
2017-02-28 15:56       ` Pasha Tatashin
2017-02-28 18:59       ` Matthew Wilcox
2017-02-28 18:59         ` Matthew Wilcox
2017-02-28 19:34         ` Pasha Tatashin
2017-02-28 19:34           ` Pasha Tatashin
2017-02-28 19:58           ` Matthew Wilcox
2017-02-28 19:58             ` Matthew Wilcox
2017-02-28 14:55 ` [PATCH v1 2/3] mm: Zeroing hash tables in allocator Pavel Tatashin
2017-02-28 14:55   ` Pavel Tatashin
2017-02-28 14:55 ` [PATCH v1 3/3] mm: Updated callers to use HASH_ZERO flag Pavel Tatashin
2017-02-28 14:55   ` Pavel Tatashin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.