All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
@ 2020-09-15 19:46 Scott Cheloha
  2020-09-16  7:39 ` David Hildenbrand
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Scott Cheloha @ 2020-09-15 19:46 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Nathan Lynch, Michal Suchanek, Laurent Dufour, David Hildenbrand,
	Rick Lindsley

During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
to determine which node id (nid) to use when later calling __add_memory().

This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
appropriate nid for a given address by looking up the LMB containing the
address and then passing that LMB to of_drconf_to_nid_single() to get the
nid.  In dlpar_add_lmb() we get this address from the LMB itself.

In short, we have a pointer to an LMB and then we are searching for
that LMB *again* in order to find its nid.

If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
can skip the redundant lookup.  The only error handling we need to
duplicate from memory_add_physaddr_to_nid() is the fallback to the
default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
an invalid nid.

Skipping the extra lookup makes hot-add operations faster, especially
on machines with many LMBs.

Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
LMBs on an upatched kernel took ~3.5 hours while a patched kernel
completed the same operation in ~2 hours:

Unpatched (12450 seconds):
Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added

Patched (7065 seconds):
Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added

It should be noted that the speedup grows more substantial when
hot-adding LMBs at the end of the drconf range.  This is because we
are skipping a linear LMB search.

To see the distinction, consider smaller hot-add test on the same
LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
LMBs completed less than 1 second faster on a patched kernel:

Unpatched:
 Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

        104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
             4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
             2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
               394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
   445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
     8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
   300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
   258,091,488,691      instructions              #    0.58  insn per cycle
                                                  #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
    70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
     3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)

           105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )

Patched:
 Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

        104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
             4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
             2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
               394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
   442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
     8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
   299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
   252,731,168,193      instructions              #    0.57  insn per cycle
                                                  #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
    68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
     3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)

           104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )

This is consistent.  An add-by-count hot-add operation adds LMBs
greedily, so LMBs near the start of the drconf range are considered
first.  On an otherwise idle LPAR with so many LMBs we would expect to
find the LMBs we need near the start of the drconf range, hence the
smaller speedup.

Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
---
Changelog:

v1: https://lore.kernel.org/linuxppc-dev/20200910175637.2865160-1-cheloha@linux.ibm.com/

v2:
- Move prototype for of_drconf_to_nid_single() to topology.h.
  Requested by Michael Ellerman.

v3:
- Send the right patch.  v2 is from the wrong branch, my mistake.

 arch/powerpc/include/asm/topology.h             | 3 +++
 arch/powerpc/mm/numa.c                          | 2 +-
 arch/powerpc/platforms/pseries/hotplug-memory.c | 6 ++++--
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f0b6300e7dd3..d15d9999bad6 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -86,6 +86,9 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 
 #endif /* CONFIG_NUMA */
 
+struct drmem_lmb;
+extern int of_drconf_to_nid_single(struct drmem_lmb *);
+
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
 extern int find_and_online_cpu_nid(int cpu);
 #else
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 1f61fa2148b5..63507b47164d 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -430,7 +430,7 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
  * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
  */
-static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
+int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 {
 	struct assoc_arrays aa = { .arrays = NULL };
 	int default_nid = NUMA_NO_NODE;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 0ea976d1cac4..9a533acf8ad0 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -611,8 +611,10 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
 	block_sz = memory_block_size_bytes();
 
-	/* Find the node id for this address. */
-	nid = memory_add_physaddr_to_nid(lmb->base_addr);
+	/* Find the node id for this LMB.  Fake one if necessary. */
+	nid = of_drconf_to_nid_single(lmb);
+	if (nid < 0 || !node_possible(nid))
+		nid = first_online_node;
 
 	/* Add the memory */
 	rc = __add_memory(nid, lmb->base_addr, block_sz);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
  2020-09-15 19:46 [PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup Scott Cheloha
@ 2020-09-16  7:39 ` David Hildenbrand
  2020-09-16 14:39   ` Scott Cheloha
  2020-09-16  8:22 ` Laurent Dufour
  2020-10-07  3:21 ` Michael Ellerman
  2 siblings, 1 reply; 6+ messages in thread
From: David Hildenbrand @ 2020-09-16  7:39 UTC (permalink / raw)
  To: Scott Cheloha, linuxppc-dev
  Cc: Nathan Lynch, Michal Suchanek, Laurent Dufour, Rick Lindsley

On 15.09.20 21:46, Scott Cheloha wrote:
> During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
> to determine which node id (nid) to use when later calling __add_memory().
> 
> This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
> appropriate nid for a given address by looking up the LMB containing the
> address and then passing that LMB to of_drconf_to_nid_single() to get the
> nid.  In dlpar_add_lmb() we get this address from the LMB itself.
> 
> In short, we have a pointer to an LMB and then we are searching for
> that LMB *again* in order to find its nid.
> 
> If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
> can skip the redundant lookup.  The only error handling we need to
> duplicate from memory_add_physaddr_to_nid() is the fallback to the
> default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
> an invalid nid.
> 
> Skipping the extra lookup makes hot-add operations faster, especially
> on machines with many LMBs.
> 
> Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
> LMBs on an upatched kernel took ~3.5 hours while a patched kernel
> completed the same operation in ~2 hours:
> 
> Unpatched (12450 seconds):
> Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
> Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
> [...]
> Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
> 
> Patched (7065 seconds):
> Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
> Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
> [...]
> Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
> 
> It should be noted that the speedup grows more substantial when
> hot-adding LMBs at the end of the drconf range.  This is because we
> are skipping a linear LMB search.
> 
> To see the distinction, consider smaller hot-add test on the same
> LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
> LMBs completed less than 1 second faster on a patched kernel:
> 
> Unpatched:
>  Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
> 
>         104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
>              4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
>              2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
>                394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
>    445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
>      8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
>    300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
>    258,091,488,691      instructions              #    0.58  insn per cycle
>                                                   #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
>     70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
>      3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)
> 
>            105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )
> 
> Patched:
>  Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
> 
>         104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
>              4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
>              2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
>                394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
>    442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
>      8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
>    299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
>    252,731,168,193      instructions              #    0.57  insn per cycle
>                                                   #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
>     68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
>      3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)
> 
>            104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )
> 
> This is consistent.  An add-by-count hot-add operation adds LMBs
> greedily, so LMBs near the start of the drconf range are considered
> first.  On an otherwise idle LPAR with so many LMBs we would expect to
> find the LMBs we need near the start of the drconf range, hence the
> smaller speedup.
> 
> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>


Hi Scott,

IIRC, ppc DLPAR does a single add_memory() for each LMB (16 MB). With
tons of LMBs, this will also make /proc/iomem explode in size (using a a
list-based tree), making traversal significantly slower e.g., on
insertions and system ram walks.

I was wondering if you would get another performance boost under ppc
when using MEMHP_MERGE_RESOURCE [1]. AFAIKs, the resource boundaries are
not of interest. No guarantees, might be worth a try.

Did you investigate what else makes memory hotplug that slow? (126000
LMBs correspond to roughly 2TB, that shouldn't take 2 hours ...) Memory
block devices might still be a slowdown (although we have an xarray in
place now that takes care of most pain).

[1]
https://lore.kernel.org/linux-mm/20200911103459.10306-1-david@redhat.com/

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
  2020-09-15 19:46 [PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup Scott Cheloha
  2020-09-16  7:39 ` David Hildenbrand
@ 2020-09-16  8:22 ` Laurent Dufour
  2020-10-07  3:21 ` Michael Ellerman
  2 siblings, 0 replies; 6+ messages in thread
From: Laurent Dufour @ 2020-09-16  8:22 UTC (permalink / raw)
  To: Scott Cheloha, linuxppc-dev
  Cc: Nathan Lynch, Michal Suchanek, David Hildenbrand, Rick Lindsley

Le 15/09/2020 à 21:46, Scott Cheloha a écrit :
> During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
> to determine which node id (nid) to use when later calling __add_memory().
> 
> This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
> appropriate nid for a given address by looking up the LMB containing the
> address and then passing that LMB to of_drconf_to_nid_single() to get the
> nid.  In dlpar_add_lmb() we get this address from the LMB itself.
> 
> In short, we have a pointer to an LMB and then we are searching for
> that LMB *again* in order to find its nid.
> 
> If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
> can skip the redundant lookup.  The only error handling we need to
> duplicate from memory_add_physaddr_to_nid() is the fallback to the
> default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
> an invalid nid.
> 
> Skipping the extra lookup makes hot-add operations faster, especially
> on machines with many LMBs.
> 
> Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
> LMBs on an upatched kernel took ~3.5 hours while a patched kernel
> completed the same operation in ~2 hours:
> 
> Unpatched (12450 seconds):
> Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
> Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
> [...]
> Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
> 
> Patched (7065 seconds):
> Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
> Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
> [...]
> Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
> 
> It should be noted that the speedup grows more substantial when
> hot-adding LMBs at the end of the drconf range.  This is because we
> are skipping a linear LMB search.
> 
> To see the distinction, consider smaller hot-add test on the same
> LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
> LMBs completed less than 1 second faster on a patched kernel:
> 
> Unpatched:
>   Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
> 
>          104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
>               4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
>               2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
>                 394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
>     445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
>       8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
>     300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
>     258,091,488,691      instructions              #    0.58  insn per cycle
>                                                    #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
>      70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
>       3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)
> 
>             105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )
> 
> Patched:
>   Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
> 
>          104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
>               4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
>               2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
>                 394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
>     442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
>       8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
>     299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
>     252,731,168,193      instructions              #    0.57  insn per cycle
>                                                    #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
>      68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
>       3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)
> 
>             104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )
> 
> This is consistent.  An add-by-count hot-add operation adds LMBs
> greedily, so LMBs near the start of the drconf range are considered
> first.  On an otherwise idle LPAR with so many LMBs we would expect to
> find the LMBs we need near the start of the drconf range, hence the
> smaller speedup.
> 
> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
> ---
> Changelog:
> 
> v1: https://lore.kernel.org/linuxppc-dev/20200910175637.2865160-1-cheloha@linux.ibm.com/
> 
> v2:
> - Move prototype for of_drconf_to_nid_single() to topology.h.
>    Requested by Michael Ellerman.
> 
> v3:
> - Send the right patch.  v2 is from the wrong branch, my mistake.
> 
>   arch/powerpc/include/asm/topology.h             | 3 +++
>   arch/powerpc/mm/numa.c                          | 2 +-
>   arch/powerpc/platforms/pseries/hotplug-memory.c | 6 ++++--
>   3 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index f0b6300e7dd3..d15d9999bad6 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -86,6 +86,9 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> 
>   #endif /* CONFIG_NUMA */
> 
> +struct drmem_lmb;
> +extern int of_drconf_to_nid_single(struct drmem_lmb *);

Hi Scott,

checkpatch.pl reports:

CHECK: extern prototypes should be avoided in .h files
#106: FILE: arch/powerpc/include/asm/topology.h:90:
+extern int of_drconf_to_nid_single(struct drmem_lmb *);

WARNING: function definition argument 'struct drmem_lmb *' should also have an 
identifier name
#106: FILE: arch/powerpc/include/asm/topology.h:90:
+extern int of_drconf_to_nid_single(struct drmem_lmb *);

cheers,
Laurent.

> +
>   #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
>   extern int find_and_online_cpu_nid(int cpu);
>   #else
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 1f61fa2148b5..63507b47164d 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -430,7 +430,7 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
>    * This is like of_node_to_nid_single() for memory represented in the
>    * ibm,dynamic-reconfiguration-memory node.
>    */
> -static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
> +int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>   {
>   	struct assoc_arrays aa = { .arrays = NULL };
>   	int default_nid = NUMA_NO_NODE;
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 0ea976d1cac4..9a533acf8ad0 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -611,8 +611,10 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
> 
>   	block_sz = memory_block_size_bytes();
> 
> -	/* Find the node id for this address. */
> -	nid = memory_add_physaddr_to_nid(lmb->base_addr);
> +	/* Find the node id for this LMB.  Fake one if necessary. */
> +	nid = of_drconf_to_nid_single(lmb);
> +	if (nid < 0 || !node_possible(nid))
> +		nid = first_online_node;
> 
>   	/* Add the memory */
>   	rc = __add_memory(nid, lmb->base_addr, block_sz);
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
  2020-09-16  7:39 ` David Hildenbrand
@ 2020-09-16 14:39   ` Scott Cheloha
  2020-09-16 14:45     ` David Hildenbrand
  0 siblings, 1 reply; 6+ messages in thread
From: Scott Cheloha @ 2020-09-16 14:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nathan Lynch, linuxppc-dev, Michal Suchanek, Laurent Dufour,
	Rick Lindsley

On Wed, Sep 16, 2020 at 09:39:53AM +0200, David Hildenbrand wrote:
> On 15.09.20 21:46, Scott Cheloha wrote:
> > During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
> > to determine which node id (nid) to use when later calling __add_memory().
> > 
> > This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
> > appropriate nid for a given address by looking up the LMB containing the
> > address and then passing that LMB to of_drconf_to_nid_single() to get the
> > nid.  In dlpar_add_lmb() we get this address from the LMB itself.
> > 
> > In short, we have a pointer to an LMB and then we are searching for
> > that LMB *again* in order to find its nid.
> > 
> > If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
> > can skip the redundant lookup.  The only error handling we need to
> > duplicate from memory_add_physaddr_to_nid() is the fallback to the
> > default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
> > an invalid nid.
> > 
> > Skipping the extra lookup makes hot-add operations faster, especially
> > on machines with many LMBs.
> > 
> > Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
> > LMBs on an upatched kernel took ~3.5 hours while a patched kernel
> > completed the same operation in ~2 hours:
> > 
> > Unpatched (12450 seconds):
> > Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
> > Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
> > [...]
> > Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
> > 
> > Patched (7065 seconds):
> > Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
> > Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
> > [...]
> > Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
> > 
> > It should be noted that the speedup grows more substantial when
> > hot-adding LMBs at the end of the drconf range.  This is because we
> > are skipping a linear LMB search.
> > 
> > To see the distinction, consider smaller hot-add test on the same
> > LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
> > LMBs completed less than 1 second faster on a patched kernel:
> > 
> > Unpatched:
> >  Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
> > 
> >         104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
> >              4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
> >              2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
> >                394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
> >    445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
> >      8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
> >    300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
> >    258,091,488,691      instructions              #    0.58  insn per cycle
> >                                                   #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
> >     70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
> >      3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)
> > 
> >            105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )
> > 
> > Patched:
> >  Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
> > 
> >         104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
> >              4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
> >              2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
> >                394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
> >    442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
> >      8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
> >    299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
> >    252,731,168,193      instructions              #    0.57  insn per cycle
> >                                                   #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
> >     68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
> >      3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)
> > 
> >            104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )
> > 
> > This is consistent.  An add-by-count hot-add operation adds LMBs
> > greedily, so LMBs near the start of the drconf range are considered
> > first.  On an otherwise idle LPAR with so many LMBs we would expect to
> > find the LMBs we need near the start of the drconf range, hence the
> > smaller speedup.
> > 
> > Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
> 
> 
> Hi Scott,
> 
> IIRC, ppc DLPAR does a single add_memory() [...]

Yes.

> [...] for each LMB (16 MB).

The block size is set by the hypervisor.  The default is 256MB.  In
this test I had a block size of 256MB.

On multi-terabyte machines I would effectively always expect a block
size of 256MB.  16MB blocks are supported, but it is not the default
setting so it is increasingly rare.

> With tons of LMBs, this will also make /proc/iomem explode in size (using a
> list-based tree), making traversal significantly slower e.g., on
> insertions and system ram walks.
> 
> I was wondering if you would get another performance boost under ppc
> when using MEMHP_MERGE_RESOURCE [1]. AFAIKs, the resource boundaries are
> not of interest. No guarantees, might be worth a try.

I'll give it a shot.

> Did you investigate what else makes memory hotplug that slow? (126000
> LMBs correspond to roughly 2TB, that shouldn't take 2 hours ...)

It was about ~31TB in 256MB blocks.  It's a worst-case test (add all
the memory), but I'm pretty happy with a 1.5 hour improvement :)

> Memory block devices might still be a slowdown (although we have an
> xarray in place now that takes care of most pain).

Memory block devices are no longer a hotspot.

Some of the slowdown is in the printk overhead.  We print a log for
every LMB.  It is very silly.  I intend to move those to a debug
priority, which should trivially speed things up.

Otherwise I need to do more profiling.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
  2020-09-16 14:39   ` Scott Cheloha
@ 2020-09-16 14:45     ` David Hildenbrand
  0 siblings, 0 replies; 6+ messages in thread
From: David Hildenbrand @ 2020-09-16 14:45 UTC (permalink / raw)
  To: Scott Cheloha
  Cc: Nathan Lynch, linuxppc-dev, Michal Suchanek, Laurent Dufour,
	Rick Lindsley

On 16.09.20 16:39, Scott Cheloha wrote:
> On Wed, Sep 16, 2020 at 09:39:53AM +0200, David Hildenbrand wrote:
>> On 15.09.20 21:46, Scott Cheloha wrote:
>>> During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
>>> to determine which node id (nid) to use when later calling __add_memory().
>>>
>>> This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
>>> appropriate nid for a given address by looking up the LMB containing the
>>> address and then passing that LMB to of_drconf_to_nid_single() to get the
>>> nid.  In dlpar_add_lmb() we get this address from the LMB itself.
>>>
>>> In short, we have a pointer to an LMB and then we are searching for
>>> that LMB *again* in order to find its nid.
>>>
>>> If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
>>> can skip the redundant lookup.  The only error handling we need to
>>> duplicate from memory_add_physaddr_to_nid() is the fallback to the
>>> default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
>>> an invalid nid.
>>>
>>> Skipping the extra lookup makes hot-add operations faster, especially
>>> on machines with many LMBs.
>>>
>>> Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
>>> LMBs on an upatched kernel took ~3.5 hours while a patched kernel
>>> completed the same operation in ~2 hours:
>>>
>>> Unpatched (12450 seconds):
>>> Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
>>> Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
>>> [...]
>>> Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
>>>
>>> Patched (7065 seconds):
>>> Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
>>> Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
>>> [...]
>>> Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
>>>
>>> It should be noted that the speedup grows more substantial when
>>> hot-adding LMBs at the end of the drconf range.  This is because we
>>> are skipping a linear LMB search.
>>>
>>> To see the distinction, consider smaller hot-add test on the same
>>> LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
>>> LMBs completed less than 1 second faster on a patched kernel:
>>>
>>> Unpatched:
>>>  Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
>>>
>>>         104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
>>>              4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
>>>              2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
>>>                394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
>>>    445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
>>>      8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
>>>    300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
>>>    258,091,488,691      instructions              #    0.58  insn per cycle
>>>                                                   #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
>>>     70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
>>>      3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)
>>>
>>>            105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )
>>>
>>> Patched:
>>>  Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
>>>
>>>         104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
>>>              4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
>>>              2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
>>>                394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
>>>    442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
>>>      8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
>>>    299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
>>>    252,731,168,193      instructions              #    0.57  insn per cycle
>>>                                                   #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
>>>     68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
>>>      3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)
>>>
>>>            104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )
>>>
>>> This is consistent.  An add-by-count hot-add operation adds LMBs
>>> greedily, so LMBs near the start of the drconf range are considered
>>> first.  On an otherwise idle LPAR with so many LMBs we would expect to
>>> find the LMBs we need near the start of the drconf range, hence the
>>> smaller speedup.
>>>
>>> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
>>
>>
>> Hi Scott,
>>
>> IIRC, ppc DLPAR does a single add_memory() [...]
> 
> Yes.
> 
>> [...] for each LMB (16 MB).
> 
> The block size is set by the hypervisor.  The default is 256MB.  In
> this test I had a block size of 256MB.

Oh, I wasn't aware that it's configurable, thanks for pointing that out
(missed the custom memory_block_size_bytes() implementation).

I wonder how it works with pseries_remove_memblock(), that uses
MIN_MEMORY_BLOCK_SIZE with __remove_memory() - that will always BUG_ON
in try_remove_memory() with BUG_ON(check_hotplug_memory_range(start,
size)) in case the size is < memory_block_size_bytes().

Maybe that's not called on such machines ...

> 
> On multi-terabyte machines I would effectively always expect a block
> size of 256MB.  16MB blocks are supported, but it is not the default
> setting so it is increasingly rare.>
>> With tons of LMBs, this will also make /proc/iomem explode in size (using a
>> list-based tree), making traversal significantly slower e.g., on
>> insertions and system ram walks.
>>
>> I was wondering if you would get another performance boost under ppc
>> when using MEMHP_MERGE_RESOURCE [1]. AFAIKs, the resource boundaries are
>> not of interest. No guarantees, might be worth a try.
> 
> I'll give it a shot.
> 
>> Did you investigate what else makes memory hotplug that slow? (126000
>> LMBs correspond to roughly 2TB, that shouldn't take 2 hours ...)
> 
> It was about ~31TB in 256MB blocks.  It's a worst-case test (add all
> the memory), but I'm pretty happy with a 1.5 hour improvement :)

Yeah, definitely :)


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
  2020-09-15 19:46 [PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup Scott Cheloha
  2020-09-16  7:39 ` David Hildenbrand
  2020-09-16  8:22 ` Laurent Dufour
@ 2020-10-07  3:21 ` Michael Ellerman
  2 siblings, 0 replies; 6+ messages in thread
From: Michael Ellerman @ 2020-10-07  3:21 UTC (permalink / raw)
  To: Scott Cheloha, linuxppc-dev
  Cc: Laurent Dufour, Michal Suchanek, Rick Lindsley, Nathan Lynch,
	David Hildenbrand

On Tue, 15 Sep 2020 14:46:47 -0500, Scott Cheloha wrote:
> During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
> to determine which node id (nid) to use when later calling __add_memory().
> 
> This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
> appropriate nid for a given address by looking up the LMB containing the
> address and then passing that LMB to of_drconf_to_nid_single() to get the
> nid.  In dlpar_add_lmb() we get this address from the LMB itself.
> 
> [...]

Applied to powerpc/next.

[1/1] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
      https://git.kernel.org/powerpc/c/72cdd117c449896c707fc6cfe5b90978160697d0

cheers

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-10-07  3:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-15 19:46 [PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup Scott Cheloha
2020-09-16  7:39 ` David Hildenbrand
2020-09-16 14:39   ` Scott Cheloha
2020-09-16 14:45     ` David Hildenbrand
2020-09-16  8:22 ` Laurent Dufour
2020-10-07  3:21 ` Michael Ellerman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.