linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup
@ 2020-01-28 22:11 Scott Cheloha
  2020-01-28 23:56 ` Nathan Lynch
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Scott Cheloha @ 2020-01-28 22:11 UTC (permalink / raw)
  To: linux-kernel, Michael Ellerman
  Cc: Nathan Fontenont, Nathan Lynch, Rick Lindley, linuxppc-dev

LMB lookup is currently an O(n) linear search.  This scales poorly when
there are many LMBs.

If we cache each LMB by both its base address and its DRC index
in an xarray we can cut lookups to O(log n), greatly accelerating
drmem initialization and memory hotplug.

This patch introduces two xarrays of of LMBs and fills them during
drmem initialization.  The patch also adds two interfaces for LMB
lookup.

The first interface, drmem_find_lmb_by_base_addr(), is employed in
hot_add_drconf_scn_to_nid() to replace a linear search.  This speeds up
memory_add_physaddr_to_nid(), which is called by lmb_set_nid(), an
interface used during drmem initialization and memory hotplug.

The second interface, drmem_find_lmb_by_drc_index(), is employed in
get_lmb_range() to replace a linear search.  This speeds up
dlpar_memory_add_by_ic() and dlpar_memory_remove_by_ic(), interfaces
used during memory hotplug.

These substitutions yield significant improvements:

1. A POWER9 VM with a maximum memory of 10TB and 256MB LMBs has
   40960 LMBs.  With this patch it completes drmem_init() ~1138ms
   faster.

Before:
[    0.542244] drmem: initializing drmem v1
[    1.768787] drmem: initialized 40960 LMBs

After:
[    0.543611] drmem: initializing drmem v1
[    0.631386] drmem: initialized 40960 LMBs

2. A POWER9 VM with a maximum memory of 4TB and 256MB LMBs has
   16384 LMBs.  Via the qemu monitor we can hot-add memory as
   virtual DIMMs.  Each DIMM is 256 LMBs.  With this patch we
   hot-add every possible LMB about 60 seconds faster.

Before:
[   17.422177] pseries-hotplug-mem: Attempting to hot-add 256 LMB(s) at index 80000100
[...]
[  167.285563] pseries-hotplug-mem: Memory at 3fff0000000 (drc index 80003fff) was hot-added

After:
[   14.753480] pseries-hotplug-mem: Attempting to hot-add 256 LMB(s) at index 80000100
[...]
[  103.934092] pseries-hotplug-mem: Memory at 3fff0000000 (drc index 80003fff) was hot-added

Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
---
These linear searches become a serious bottleneck as the machine
approaches 64TB.  There are just too many LMBs to use a linear
search.

On a 60TB machine we recently saw the following soft lockup during
drmem_init():

[   60.602386] watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [swapper/0:1]
[   60.602414] Modules linked in:
[   60.602417] Supported: No, Unreleased kernel
[   60.602423] CPU: 9 PID: 1 Comm: swapper/0 Not tainted 5.3.18-2-default #1 SLE15-SP2 (unreleased)
[   60.602426] NIP:  c000000000095c0c LR: c000000000095bb0 CTR: 0000000000000000
[   60.602430] REGS: c00022c7fc497830 TRAP: 0901   Not tainted  (5.3.18-2-default)
[   60.602432] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44000244  XER: 00000000
[   60.602442] CFAR: c000000000095c18 IRQMASK: 0 
               GPR00: c000000000095bb0 c00022c7fc497ac0 c00000000162cc00 c0003bffffff5e08 
               GPR04: 0000000000000000 c000000000ea539a 000000000000002f 0000000010000000 
               GPR08: c00007fc5f59ffb8 000014bc50000000 c00007fc5f1f1a30 c0000000014f2fb8 
               GPR12: 0000000000000000 c00000001e980600 
[   60.602464] NIP [c000000000095c0c] hot_add_scn_to_nid+0xbc/0x400
[   60.602467] LR [c000000000095bb0] hot_add_scn_to_nid+0x60/0x400
[   60.602470] Call Trace:
[   60.602473] [c00022c7fc497ac0] [c000000000095bb0] hot_add_scn_to_nid+0x60/0x400 (unreliable)
[   60.602478] [c00022c7fc497b20] [c00000000007a6a0] memory_add_physaddr_to_nid+0x20/0x60
[   60.602483] [c00022c7fc497b40] [c0000000010235a4] drmem_init+0x258/0x2d8
[   60.602485] [c00022c7fc497c10] [c000000000010694] do_one_initcall+0x64/0x300
[   60.602489] [c00022c7fc497ce0] [c0000000010144f8] kernel_init_freeable+0x2e8/0x3fc
[   60.602491] [c00022c7fc497db0] [c000000000010b0c] kernel_init+0x2c/0x160
[   60.602497] [c00022c7fc497e20] [c00000000000b960] ret_from_kernel_thread+0x5c/0x7c
[   60.602498] Instruction dump:
[   60.602501] 7d0a4214 7faa4040 419d0328 e92a0010 71290088 2fa90008 409e001c e92a0000 
[   60.602506] 7fbe4840 419c0010 7d274a14 7fbe4840 <419c00e4> 394a0018 7faa4040 409dffd0 

This patch should eliminate the drmem_init() bottleneck during boot.

One other important thing to note is that this only addresses part of
the slowdown during memory hotplug when there are many LMBs.  A far larger
part of it is caused by the linear memblock search in find_memory_block().
That problem is addressed with this patch:

https://lore.kernel.org/lkml/20200121231028.13699-1-cheloha@linux.ibm.com/

which is in linux-next.  The numbers I quote here in the commit message
for time improvements during hotplug are taken with that patch applied.

Without it, hotplug is even slower.

 arch/powerpc/include/asm/drmem.h              |  3 ++
 arch/powerpc/mm/drmem.c                       | 33 +++++++++++++++++++
 arch/powerpc/mm/numa.c                        | 30 +++++++----------
 .../platforms/pseries/hotplug-memory.c        | 11 ++-----
 4 files changed, 50 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 3d76e1c388c2..a37cbe794cdd 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -88,6 +88,9 @@ static inline bool drmem_lmb_reserved(struct drmem_lmb *lmb)
 	return lmb->flags & DRMEM_LMB_RESERVED;
 }
 
+struct drmem_lmb *drmem_find_lmb_by_base_addr(unsigned long);
+struct drmem_lmb *drmem_find_lmb_by_drc_index(unsigned long);
+
 u64 drmem_lmb_memory_max(void);
 void __init walk_drmem_lmbs(struct device_node *dn,
 			void (*func)(struct drmem_lmb *, const __be32 **));
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 557d9080604d..7c464b0a256e 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -11,9 +11,12 @@
 #include <linux/of.h>
 #include <linux/of_fdt.h>
 #include <linux/memblock.h>
+#include <linux/xarray.h>
 #include <asm/prom.h>
 #include <asm/drmem.h>
 
+static DEFINE_XARRAY(drmem_lmb_base_addr);
+static DEFINE_XARRAY(drmem_lmb_drc_index);
 static struct drmem_lmb_info __drmem_info;
 struct drmem_lmb_info *drmem_info = &__drmem_info;
 
@@ -25,6 +28,31 @@ u64 drmem_lmb_memory_max(void)
 	return last_lmb->base_addr + drmem_lmb_size();
 }
 
+struct drmem_lmb *drmem_find_lmb_by_base_addr(unsigned long base_addr)
+{
+	return xa_load(&drmem_lmb_base_addr, base_addr);
+}
+
+struct drmem_lmb *drmem_find_lmb_by_drc_index(unsigned long drc_index)
+{
+	return xa_load(&drmem_lmb_drc_index, drc_index);
+}
+
+static int drmem_lmb_cache_for_lookup(struct drmem_lmb *lmb)
+{
+	void *ret;
+
+	ret = xa_store(&drmem_lmb_base_addr, lmb->base_addr, lmb,  GFP_KERNEL);
+	if (xa_err(ret))
+		return xa_err(ret);
+
+	ret = xa_store(&drmem_lmb_drc_index, lmb->drc_index, lmb, GFP_KERNEL);
+	if (xa_err(ret))
+		return xa_err(ret);
+
+	return 0;
+}
+
 static u32 drmem_lmb_flags(struct drmem_lmb *lmb)
 {
 	/*
@@ -364,6 +392,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop)
 
 	for_each_drmem_lmb(lmb) {
 		read_drconf_v1_cell(lmb, &prop);
+		if (drmem_lmb_cache_for_lookup(lmb) != 0)
+			return;
 		lmb_set_nid(lmb);
 	}
 }
@@ -411,6 +441,9 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop)
 			lmb->aa_index = dr_cell.aa_index;
 			lmb->flags = dr_cell.flags;
 
+			if (drmem_lmb_cache_for_lookup(lmb) != 0)
+				return;
+
 			lmb_set_nid(lmb);
 		}
 	}
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 50d68d21ddcc..23684d44549f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,27 +958,21 @@ early_param("topology_updates", early_topology_updates);
 static int hot_add_drconf_scn_to_nid(unsigned long scn_addr)
 {
 	struct drmem_lmb *lmb;
-	unsigned long lmb_size;
-	int nid = NUMA_NO_NODE;
-
-	lmb_size = drmem_lmb_size();
-
-	for_each_drmem_lmb(lmb) {
-		/* skip this block if it is reserved or not assigned to
-		 * this partition */
-		if ((lmb->flags & DRCONF_MEM_RESERVED)
-		    || !(lmb->flags & DRCONF_MEM_ASSIGNED))
-			continue;
 
-		if ((scn_addr < lmb->base_addr)
-		    || (scn_addr >= (lmb->base_addr + lmb_size)))
-			continue;
+	lmb = drmem_find_lmb_by_base_addr(scn_addr);
+	if (lmb == NULL)
+		return NUMA_NO_NODE;
 
-		nid = of_drconf_to_nid_single(lmb);
-		break;
-	}
+	/*
+	 * We can't use it if it is reserved or not assigned to
+	 * this partition.
+	 */
+	if (lmb->flags & DRCONF_MEM_RESERVED)
+		return NUMA_NO_NODE;
+	if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
+		return NUMA_NO_NODE;
 
-	return nid;
+	return of_drconf_to_nid_single(lmb);
 }
 
 /*
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index c126b94d1943..29bd19831a9a 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -222,17 +222,10 @@ static int get_lmb_range(u32 drc_index, int n_lmbs,
 			 struct drmem_lmb **start_lmb,
 			 struct drmem_lmb **end_lmb)
 {
-	struct drmem_lmb *lmb, *start, *end;
+	struct drmem_lmb *start, *end;
 	struct drmem_lmb *last_lmb;
 
-	start = NULL;
-	for_each_drmem_lmb(lmb) {
-		if (lmb->drc_index == drc_index) {
-			start = lmb;
-			break;
-		}
-	}
-
+	start = drmem_find_lmb_by_drc_index(drc_index);
 	if (!start)
 		return -EINVAL;
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup
  2020-01-28 22:11 [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup Scott Cheloha
@ 2020-01-28 23:56 ` Nathan Lynch
  2020-01-29 18:10   ` Scott Cheloha
  2020-02-04 16:19   ` Scott Cheloha
  2020-02-21 17:29 ` pseries: accelerate drmem and simplify hotplug with xarrays Scott Cheloha
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Nathan Lynch @ 2020-01-28 23:56 UTC (permalink / raw)
  To: Scott Cheloha
  Cc: Nathan Fontenont, Rick Lindsley, linuxppc-dev, linux-kernel,
	Michael Ellerman

Scott Cheloha <cheloha@linux.ibm.com> writes:
> LMB lookup is currently an O(n) linear search.  This scales poorly when
> there are many LMBs.
>
> If we cache each LMB by both its base address and its DRC index
> in an xarray we can cut lookups to O(log n), greatly accelerating
> drmem initialization and memory hotplug.
>
> This patch introduces two xarrays of of LMBs and fills them during
> drmem initialization.  The patch also adds two interfaces for LMB
> lookup.

Good but can you replace the array of LMBs altogether
(drmem_info->lmbs)? xarray allows iteration over the members if needed.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup
  2020-01-28 23:56 ` Nathan Lynch
@ 2020-01-29 18:10   ` Scott Cheloha
  2020-01-30 16:09     ` Fontenot, Nathan
  2020-02-04 16:19   ` Scott Cheloha
  1 sibling, 1 reply; 15+ messages in thread
From: Scott Cheloha @ 2020-01-29 18:10 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: Nathan Fontenont, Rick Lindsley, linuxppc-dev, linux-kernel,
	Michael Ellerman

On Tue, Jan 28, 2020 at 05:56:55PM -0600, Nathan Lynch wrote:
> Scott Cheloha <cheloha@linux.ibm.com> writes:
> > LMB lookup is currently an O(n) linear search.  This scales poorly when
> > there are many LMBs.
> >
> > If we cache each LMB by both its base address and its DRC index
> > in an xarray we can cut lookups to O(log n), greatly accelerating
> > drmem initialization and memory hotplug.
> >
> > This patch introduces two xarrays of of LMBs and fills them during
> > drmem initialization.  The patch also adds two interfaces for LMB
> > lookup.
> 
> Good but can you replace the array of LMBs altogether
> (drmem_info->lmbs)? xarray allows iteration over the members if needed.

I don't think we can without potentially changing the current behavior.

The current behavior in dlpar_memory_{add,remove}_by_ic() is to advance
linearly through the array from the LMB with the matching DRC index.

Iteration through the xarray via xa_for_each_start() will return LMBs
indexed with monotonically increasing DRC indices.

Are they equivalent?  Or can we have an LMB with a smaller DRC index
appear at a greater offset in the array?

If the following condition is possible:

	drmem_info->lmbs[i].drc_index > drmem_info->lmbs[j].drc_index

where i < j, then we have a possible behavior change because
xa_for_each_start() may not return a contiguous array slice.  It might
"leap backwards" in the array.  Or it might skip over a chunk of LMBs.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup
  2020-01-29 18:10   ` Scott Cheloha
@ 2020-01-30 16:09     ` Fontenot, Nathan
  2020-02-03 20:13       ` Scott Cheloha
  0 siblings, 1 reply; 15+ messages in thread
From: Fontenot, Nathan @ 2020-01-30 16:09 UTC (permalink / raw)
  To: Scott Cheloha, Nathan Lynch
  Cc: Rick Lindsley, linuxppc-dev, linux-kernel, Michael Ellerman

On 1/29/2020 12:10 PM, Scott Cheloha wrote:
> On Tue, Jan 28, 2020 at 05:56:55PM -0600, Nathan Lynch wrote:
>> Scott Cheloha <cheloha@linux.ibm.com> writes:
>>> LMB lookup is currently an O(n) linear search.  This scales poorly when
>>> there are many LMBs.
>>>
>>> If we cache each LMB by both its base address and its DRC index
>>> in an xarray we can cut lookups to O(log n), greatly accelerating
>>> drmem initialization and memory hotplug.
>>>
>>> This patch introduces two xarrays of of LMBs and fills them during
>>> drmem initialization.  The patch also adds two interfaces for LMB
>>> lookup.
>>
>> Good but can you replace the array of LMBs altogether
>> (drmem_info->lmbs)? xarray allows iteration over the members if needed.
> 
> I don't think we can without potentially changing the current behavior.
> 
> The current behavior in dlpar_memory_{add,remove}_by_ic() is to advance
> linearly through the array from the LMB with the matching DRC index.
> 
> Iteration through the xarray via xa_for_each_start() will return LMBs
> indexed with monotonically increasing DRC indices.> 
> Are they equivalent?  Or can we have an LMB with a smaller DRC index
> appear at a greater offset in the array?
> 
> If the following condition is possible:
> 
> 	drmem_info->lmbs[i].drc_index > drmem_info->lmbs[j].drc_index
> 
> where i < j, then we have a possible behavior change because
> xa_for_each_start() may not return a contiguous array slice.  It might
> "leap backwards" in the array.  Or it might skip over a chunk of LMBs.
> 

The LMB array should have each LMB in monotonically increasing DRC Index
value. Note that this is set up based on the DT property but I don't recall
ever seeing the DT specify LMBs out of order or not being contiguous.

I am not very familiar with xarrays but it appears this should be possible.

-Nathan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup
  2020-01-30 16:09     ` Fontenot, Nathan
@ 2020-02-03 20:13       ` Scott Cheloha
  2020-02-05 14:33         ` Fontenot, Nathan
  0 siblings, 1 reply; 15+ messages in thread
From: Scott Cheloha @ 2020-02-03 20:13 UTC (permalink / raw)
  To: Fontenot, Nathan, Nathan Lynch
  Cc: Rick Lindsley, linuxppc-dev, linux-kernel, Michael Ellerman

On Thu, Jan 30, 2020 at 10:09:32AM -0600, Fontenot, Nathan wrote:
> On 1/29/2020 12:10 PM, Scott Cheloha wrote:
> > On Tue, Jan 28, 2020 at 05:56:55PM -0600, Nathan Lynch wrote:
> >> Scott Cheloha <cheloha@linux.ibm.com> writes:
> >>> LMB lookup is currently an O(n) linear search.  This scales poorly when
> >>> there are many LMBs.
> >>>
> >>> If we cache each LMB by both its base address and its DRC index
> >>> in an xarray we can cut lookups to O(log n), greatly accelerating
> >>> drmem initialization and memory hotplug.
> >>>
> >>> This patch introduces two xarrays of of LMBs and fills them during
> >>> drmem initialization.  The patch also adds two interfaces for LMB
> >>> lookup.
> >>
> >> Good but can you replace the array of LMBs altogether
> >> (drmem_info->lmbs)? xarray allows iteration over the members if needed.
> > 
> > I don't think we can without potentially changing the current behavior.
> > 
> > The current behavior in dlpar_memory_{add,remove}_by_ic() is to advance
> > linearly through the array from the LMB with the matching DRC index.
> > 
> > Iteration through the xarray via xa_for_each_start() will return LMBs
> > indexed with monotonically increasing DRC indices.> 
> > Are they equivalent?  Or can we have an LMB with a smaller DRC index
> > appear at a greater offset in the array?
> > 
> > If the following condition is possible:
> > 
> > 	drmem_info->lmbs[i].drc_index > drmem_info->lmbs[j].drc_index
> > 
> > where i < j, then we have a possible behavior change because
> > xa_for_each_start() may not return a contiguous array slice.  It might
> > "leap backwards" in the array.  Or it might skip over a chunk of LMBs.
> > 
> 
> The LMB array should have each LMB in monotonically increasing DRC Index
> value. Note that this is set up based on the DT property but I don't recall
> ever seeing the DT specify LMBs out of order or not being contiguous.

Is that ordering guaranteed by the PAPR or some other spec or is that
just a convention?

Code like drmem_update_dt_v1() makes me very nervous:

static int drmem_update_dt_v1(struct device_node *memory,
                              struct property *prop)
{
        struct property *new_prop;
        struct of_drconf_cell_v1 *dr_cell;
        struct drmem_lmb *lmb;
        u32 *p;

        new_prop = clone_property(prop, prop->length);
        if (!new_prop)
                return -1;

        p = new_prop->value;
        *p++ = cpu_to_be32(drmem_info->n_lmbs);

        dr_cell = (struct of_drconf_cell_v1 *)p;

        for_each_drmem_lmb(lmb) {
                dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
                dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
                dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
                dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));

                dr_cell++;
        }

        of_update_property(memory, new_prop);
        return 0;
}

If for whatever reason the firmware has a DRC that isn't monotonically
increasing and we update a firmware property at the wrong offset I have
no idea what would happen.

With the array we preserve the order.  Without it we might violate
some assumption the firmware has made.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup
  2020-01-28 23:56 ` Nathan Lynch
  2020-01-29 18:10   ` Scott Cheloha
@ 2020-02-04 16:19   ` Scott Cheloha
  1 sibling, 0 replies; 15+ messages in thread
From: Scott Cheloha @ 2020-02-04 16:19 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: Nathan Fontenont, Rick Lindsley, linuxppc-dev, linux-kernel,
	Michael Ellerman

On Tue, Jan 28, 2020 at 05:56:55PM -0600, Nathan Lynch wrote:
> Scott Cheloha <cheloha@linux.ibm.com> writes:
> > LMB lookup is currently an O(n) linear search.  This scales poorly when
> > there are many LMBs.
> >
> > If we cache each LMB by both its base address and its DRC index
> > in an xarray we can cut lookups to O(log n), greatly accelerating
> > drmem initialization and memory hotplug.
> >
> > This patch introduces two xarrays of of LMBs and fills them during
> > drmem initialization.  The patch also adds two interfaces for LMB
> > lookup.
> 
> Good but can you replace the array of LMBs altogether
> (drmem_info->lmbs)? xarray allows iteration over the members if needed.

I would like to try to "solve one problem at a time".

We can fix the linear search performance scaling problems without
removing the array of LMBs.  As I've shown in my diff, we can do it
with minimal change to the existing code.

If it turns out that the PAPR guarantees the ordering of the memory
DRCs then in a subsequent patch (series) we can replace the LMB array
(__drmem_info.lmbs) with an xarray indexed by DRC and use e.g.
xa_for_each() in the hotplug code.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup
  2020-02-03 20:13       ` Scott Cheloha
@ 2020-02-05 14:33         ` Fontenot, Nathan
  0 siblings, 0 replies; 15+ messages in thread
From: Fontenot, Nathan @ 2020-02-05 14:33 UTC (permalink / raw)
  To: Scott Cheloha, Nathan Lynch
  Cc: Rick Lindsley, linuxppc-dev, linux-kernel, Michael Ellerman

On 2/3/2020 2:13 PM, Scott Cheloha wrote:
> On Thu, Jan 30, 2020 at 10:09:32AM -0600, Fontenot, Nathan wrote:
>> On 1/29/2020 12:10 PM, Scott Cheloha wrote:
>>> On Tue, Jan 28, 2020 at 05:56:55PM -0600, Nathan Lynch wrote:
>>>> Scott Cheloha <cheloha@linux.ibm.com> writes:
>>>>> LMB lookup is currently an O(n) linear search.  This scales poorly when
>>>>> there are many LMBs.
>>>>>
>>>>> If we cache each LMB by both its base address and its DRC index
>>>>> in an xarray we can cut lookups to O(log n), greatly accelerating
>>>>> drmem initialization and memory hotplug.
>>>>>
>>>>> This patch introduces two xarrays of of LMBs and fills them during
>>>>> drmem initialization.  The patch also adds two interfaces for LMB
>>>>> lookup.
>>>>
>>>> Good but can you replace the array of LMBs altogether
>>>> (drmem_info->lmbs)? xarray allows iteration over the members if needed.
>>>
>>> I don't think we can without potentially changing the current behavior.
>>>
>>> The current behavior in dlpar_memory_{add,remove}_by_ic() is to advance
>>> linearly through the array from the LMB with the matching DRC index.
>>>
>>> Iteration through the xarray via xa_for_each_start() will return LMBs
>>> indexed with monotonically increasing DRC indices.> 
>>> Are they equivalent?  Or can we have an LMB with a smaller DRC index
>>> appear at a greater offset in the array?
>>>
>>> If the following condition is possible:
>>>
>>> 	drmem_info->lmbs[i].drc_index > drmem_info->lmbs[j].drc_index
>>>
>>> where i < j, then we have a possible behavior change because
>>> xa_for_each_start() may not return a contiguous array slice.  It might
>>> "leap backwards" in the array.  Or it might skip over a chunk of LMBs.
>>>
>>
>> The LMB array should have each LMB in monotonically increasing DRC Index
>> value. Note that this is set up based on the DT property but I don't recall
>> ever seeing the DT specify LMBs out of order or not being contiguous.
> 
> Is that ordering guaranteed by the PAPR or some other spec or is that
> just a convention?

From what I remember the PAPR does not specify that DRC indexes are guaranteed
to be contiguous. In past discussions with pHyp developers I had been told that
they always generate contiguous DRC index values but without a specification in
the PAPR that could always break.

-Nathan

> 
> Code like drmem_update_dt_v1() makes me very nervous:
> 
> static int drmem_update_dt_v1(struct device_node *memory,
>                               struct property *prop)
> {
>         struct property *new_prop;
>         struct of_drconf_cell_v1 *dr_cell;
>         struct drmem_lmb *lmb;
>         u32 *p;
> 
>         new_prop = clone_property(prop, prop->length);
>         if (!new_prop)
>                 return -1;
> 
>         p = new_prop->value;
>         *p++ = cpu_to_be32(drmem_info->n_lmbs);
> 
>         dr_cell = (struct of_drconf_cell_v1 *)p;
> 
>         for_each_drmem_lmb(lmb) {
>                 dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
>                 dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
>                 dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
>                 dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
> 
>                 dr_cell++;
>         }
> 
>         of_update_property(memory, new_prop);
>         return 0;
> }
> 
> If for whatever reason the firmware has a DRC that isn't monotonically
> increasing and we update a firmware property at the wrong offset I have
> no idea what would happen.
> 
> With the array we preserve the order.  Without it we might violate
> some assumption the firmware has made.
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* pseries: accelerate drmem and simplify hotplug with xarrays
  2020-01-28 22:11 [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup Scott Cheloha
  2020-01-28 23:56 ` Nathan Lynch
@ 2020-02-21 17:29 ` Scott Cheloha
  2020-02-21 17:29 ` [PATCH v2 1/2] powerpc/drmem: accelerate memory_add_physaddr_to_nid() with LMB xarray Scott Cheloha
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Scott Cheloha @ 2020-02-21 17:29 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman
  Cc: Nathan Lynch, Rick Lindley, David Hildenbrand, Michal Suchanek,
	Michal Hocko, Nathan Fontenont, linux-kernel

This patch series introduces two xarrays of LMBs in an effort to speed
up the drmem code and simplify the hotplug code on pseries machines.

The first patch introduces an xarray of LMBs indexed by physical
address.  xa_load() is then used to accelerate LMB lookup during
memory_add_physaddr_to_nid().  The interface is used during boot,
in drmem_init(), and memory hot-add, in dlpar_add_lmb(), on pseries
machines.  Its linear LMB search is a serious bottleneck on larger
machines that xa_load() flattens nicely as shown in the before/after
example.

The second patch introduces a second xarray of LMBs, this one
indexed by DRC index.  The xarray API is leveraged to replace
the custom LMB search/iteration/marking code currently used in
the pseries memory hotplug module.  The result is cleaner and,
again thanks to xa_load(), faster.

v1: One big patch.

v2 changes:
- Split up the big patch from v1 into a series.
- Provide a more dramatic example in patch 1/2 to emphasize the
  linear search bottleneck in memory_add_physaddr_to_nid().
- Expand the use of the xarray API in patch 2/2 to replace more
  custom code.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/2] powerpc/drmem: accelerate memory_add_physaddr_to_nid() with LMB xarray
  2020-01-28 22:11 [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup Scott Cheloha
  2020-01-28 23:56 ` Nathan Lynch
  2020-02-21 17:29 ` pseries: accelerate drmem and simplify hotplug with xarrays Scott Cheloha
@ 2020-02-21 17:29 ` Scott Cheloha
  2020-02-21 20:02   ` Nathan Lynch
  2020-07-22 23:00   ` Anton Blanchard
  2020-02-21 17:29 ` [PATCH v2 2/2] pseries/hotplug-memory: leverage xarray API to simplify code Scott Cheloha
  2020-02-21 18:11 ` [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup Nathan Lynch
  4 siblings, 2 replies; 15+ messages in thread
From: Scott Cheloha @ 2020-02-21 17:29 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman
  Cc: Nathan Lynch, Rick Lindley, David Hildenbrand, Michal Suchanek,
	Michal Hocko, Nathan Fontenont, linux-kernel

On PowerPC, memory_add_physaddr_to_nid() uses a linear search to find
an LMB matching the given address.  This scales very poorly when there
are many LMBs.  The poor scaling cripples drmem_init() during boot:
lmb_set_nid(), which calls memory_add_physaddr_to_nid(), is called for
each LMB.

If we index each LMB in an xarray by its base address we can achieve
O(log n) search during memory_add_physaddr_to_nid(), which scales much
better.

For example, in the lab we have a 64TB P9 machine with 256MB LMBs.
So, suring drmem_init() we instantiate 249854 LMBs.  On a vanilla
kernel it completes drmem_init() in ~35 seconds with a soft lockup
trace.  On the patched kernel it completes drmem_init() in ~0.5
seconds.

Before:
[   53.721639] drmem: initializing drmem v2
[   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1]
[   80.604377] Modules linked in:
[   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4
[   80.604397] NIP:  c0000000000a4980 LR: c0000000000a4940 CTR: 0000000000000000
[   80.604407] REGS: c0002dbff8493830 TRAP: 0901   Not tainted  (5.6.0-rc2+)
[   80.604412] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44000248  XER: 0000000d
[   80.604431] CFAR: c0000000000a4a38 IRQMASK: 0
[   80.604431] GPR00: c0000000000a4940 c0002dbff8493ac0 c000000001904400 c0003cfffffede30
[   80.604431] GPR04: 0000000000000000 c000000000f4095a 000000000000002f 0000000010000000
[   80.604431] GPR08: c0000bf7ecdb7fb8 c0000bf7ecc2d3c8 0000000000000008 c00c0002fdfb2001
[   80.604431] GPR12: 0000000000000000 c00000001e8ec200
[   80.604477] NIP [c0000000000a4980] hot_add_scn_to_nid+0xa0/0x3e0
[   80.604486] LR [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0
[   80.604492] Call Trace:
[   80.604498] [c0002dbff8493ac0] [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0 (unreliable)
[   80.604509] [c0002dbff8493b20] [c000000000087c10] memory_add_physaddr_to_nid+0x20/0x60
[   80.604521] [c0002dbff8493b40] [c0000000010d4880] drmem_init+0x25c/0x2f0
[   80.604530] [c0002dbff8493c10] [c000000000010154] do_one_initcall+0x64/0x2c0
[   80.604540] [c0002dbff8493ce0] [c0000000010c4aa0] kernel_init_freeable+0x2d8/0x3a0
[   80.604550] [c0002dbff8493db0] [c000000000010824] kernel_init+0x2c/0x148
[   80.604560] [c0002dbff8493e20] [c00000000000b648] ret_from_kernel_thread+0x5c/0x74
[   80.604567] Instruction dump:
[   80.604574] 392918e8 e9490000 e90a000a e92a0000 80ea000c 1d080018 3908ffe8 7d094214
[   80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e9490000 7fbe5040
[   89.047390] drmem: 249854 LMB(s)

After:
[   53.424702] drmem: initializing drmem v2
[   53.898813] drmem: 249854 LMB(s)

lmb_set_nid() is called from dlpar_lmb_add() so this patch will also
improve memory hot-add speeds on big machines.

Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
---
 arch/powerpc/include/asm/drmem.h |  1 +
 arch/powerpc/mm/drmem.c          | 24 ++++++++++++++++++++++++
 arch/powerpc/mm/numa.c           | 29 ++++++++++-------------------
 3 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 3d76e1c388c2..90a5a9ad872b 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -88,6 +88,7 @@ static inline bool drmem_lmb_reserved(struct drmem_lmb *lmb)
 	return lmb->flags & DRMEM_LMB_RESERVED;
 }
 
+struct drmem_lmb *drmem_find_lmb_by_base_addr(u64 base_addr);
 u64 drmem_lmb_memory_max(void);
 void __init walk_drmem_lmbs(struct device_node *dn,
 			void (*func)(struct drmem_lmb *, const __be32 **));
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 44bfbdae920c..62cbe79e3860 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -11,12 +11,31 @@
 #include <linux/of.h>
 #include <linux/of_fdt.h>
 #include <linux/memblock.h>
+#include <linux/xarray.h>
 #include <asm/prom.h>
 #include <asm/drmem.h>
 
+static DEFINE_XARRAY(drmem_lmb_xa_base_addr);
 static struct drmem_lmb_info __drmem_info;
 struct drmem_lmb_info *drmem_info = &__drmem_info;
 
+static int drmem_cache_lmb_for_lookup(struct drmem_lmb *lmb)
+{
+	void *ret;
+
+	ret = xa_store(&drmem_lmb_xa_base_addr, lmb->base_addr, lmb,
+		       GFP_KERNEL);
+	if (xa_is_err(ret))
+		return xa_err(ret);
+
+	return 0;
+}
+
+struct drmem_lmb *drmem_find_lmb_by_base_addr(u64 base_addr)
+{
+	return xa_load(&drmem_lmb_xa_base_addr, base_addr);
+}
+
 u64 drmem_lmb_memory_max(void)
 {
 	struct drmem_lmb *last_lmb;
@@ -364,6 +383,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop)
 
 	for_each_drmem_lmb(lmb) {
 		read_drconf_v1_cell(lmb, &prop);
+		if (drmem_cache_lmb_for_lookup(lmb) != 0)
+			return;
 		lmb_set_nid(lmb);
 	}
 }
@@ -411,6 +432,9 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop)
 			lmb->aa_index = dr_cell.aa_index;
 			lmb->flags = dr_cell.flags;
 
+			if (drmem_cache_lmb_for_lookup(lmb) != 0)
+				return;
+
 			lmb_set_nid(lmb);
 		}
 	}
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 3c7dec70cda0..0fd7963a991e 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,27 +958,18 @@ early_param("topology_updates", early_topology_updates);
 static int hot_add_drconf_scn_to_nid(unsigned long scn_addr)
 {
 	struct drmem_lmb *lmb;
-	unsigned long lmb_size;
-	int nid = NUMA_NO_NODE;
-
-	lmb_size = drmem_lmb_size();
-
-	for_each_drmem_lmb(lmb) {
-		/* skip this block if it is reserved or not assigned to
-		 * this partition */
-		if ((lmb->flags & DRCONF_MEM_RESERVED)
-		    || !(lmb->flags & DRCONF_MEM_ASSIGNED))
-			continue;
-
-		if ((scn_addr < lmb->base_addr)
-		    || (scn_addr >= (lmb->base_addr + lmb_size)))
-			continue;
 
-		nid = of_drconf_to_nid_single(lmb);
-		break;
-	}
+	lmb = drmem_find_lmb_by_base_addr(scn_addr);
+	if (lmb == NULL)
+		return NUMA_NO_NODE;
+	
+	/* can't use this block if it is reserved or not assigned to
+	 * this partition */
+	if ((lmb->flags & DRCONF_MEM_RESERVED)
+	    || !(lmb->flags & DRCONF_MEM_ASSIGNED))
+		return NUMA_NO_NODE;
 
-	return nid;
+	return of_drconf_to_nid_single(lmb);
 }
 
 /*
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/2] pseries/hotplug-memory: leverage xarray API to simplify code
  2020-01-28 22:11 [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup Scott Cheloha
                   ` (2 preceding siblings ...)
  2020-02-21 17:29 ` [PATCH v2 1/2] powerpc/drmem: accelerate memory_add_physaddr_to_nid() with LMB xarray Scott Cheloha
@ 2020-02-21 17:29 ` Scott Cheloha
  2020-02-21 20:03   ` Nathan Lynch
  2020-02-21 18:11 ` [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup Nathan Lynch
  4 siblings, 1 reply; 15+ messages in thread
From: Scott Cheloha @ 2020-02-21 17:29 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman
  Cc: Nathan Lynch, Rick Lindley, David Hildenbrand, Michal Suchanek,
	Michal Hocko, Nathan Fontenont, linux-kernel

The xarray API has entry marking (xa_set_mark/xa_clear_mark) and a
range-based iterator (xa_for_each_range), so there is no need for the
pseries hotplug code to maintain its own implementation of these
features.

This patch introduces an xarray of drmem_lmb structures indexed by
each LMB's DRC index (drmem_lmb.drc_index).  The xarray is protected
by the hotplug lock.  LMBs are indexed into the xarray during
drmem_init() and accessed during hotplug operations.

Custom LMB search, iteration, and marking code is replaced with xarray
equivalents.  The result is more compact.  The code ought to run
faster, too: several linear searches have been replaced with xa_load(),
which runs in sub-linear time.

The array of LMBs, drmem_info.lmbs[], is kept to preserve the ordering
of LMBs read from the firmware in drmem_init() during firmware writes
in drmem_update_dt().

Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
---
 arch/powerpc/include/asm/drmem.h              |  26 +--
 arch/powerpc/mm/drmem.c                       |  38 ++--
 .../platforms/pseries/hotplug-memory.c        | 207 ++++++------------
 3 files changed, 92 insertions(+), 179 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 90a5a9ad872b..97e07eec7eda 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -26,13 +26,8 @@ struct drmem_lmb_info {
 
 extern struct drmem_lmb_info *drmem_info;
 
-#define for_each_drmem_lmb_in_range(lmb, start, end)		\
-	for ((lmb) = (start); (lmb) <= (end); (lmb)++)
-
-#define for_each_drmem_lmb(lmb)					\
-	for_each_drmem_lmb_in_range((lmb),			\
-		&drmem_info->lmbs[0],				\
-		&drmem_info->lmbs[drmem_info->n_lmbs - 1])
+struct xarray;
+extern struct xarray *drmem_lmb_xa;
 
 /*
  * The of_drconf_cell_v1 struct defines the layout of the LMB data
@@ -71,23 +66,6 @@ static inline u32 drmem_lmb_size(void)
 	return drmem_info->lmb_size;
 }
 
-#define DRMEM_LMB_RESERVED	0x80000000
-
-static inline void drmem_mark_lmb_reserved(struct drmem_lmb *lmb)
-{
-	lmb->flags |= DRMEM_LMB_RESERVED;
-}
-
-static inline void drmem_remove_lmb_reservation(struct drmem_lmb *lmb)
-{
-	lmb->flags &= ~DRMEM_LMB_RESERVED;
-}
-
-static inline bool drmem_lmb_reserved(struct drmem_lmb *lmb)
-{
-	return lmb->flags & DRMEM_LMB_RESERVED;
-}
-
 struct drmem_lmb *drmem_find_lmb_by_base_addr(u64 base_addr);
 u64 drmem_lmb_memory_max(void);
 void __init walk_drmem_lmbs(struct device_node *dn,
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 62cbe79e3860..013ab2689bd8 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -16,6 +16,8 @@
 #include <asm/drmem.h>
 
 static DEFINE_XARRAY(drmem_lmb_xa_base_addr);
+static DEFINE_XARRAY(drmem_lmb_xa_drc_index);
+struct xarray *drmem_lmb_xa = &drmem_lmb_xa_drc_index;
 static struct drmem_lmb_info __drmem_info;
 struct drmem_lmb_info *drmem_info = &__drmem_info;
 
@@ -27,6 +29,10 @@ static int drmem_cache_lmb_for_lookup(struct drmem_lmb *lmb)
 		       GFP_KERNEL);
 	if (xa_is_err(ret))
 		return xa_err(ret);
+	ret = xa_store(&drmem_lmb_xa_drc_index, lmb->drc_index, lmb,
+		       GFP_KERNEL);
+	if (xa_is_err(ret))
+		return xa_err(ret);
 
 	return 0;
 }
@@ -44,15 +50,6 @@ u64 drmem_lmb_memory_max(void)
 	return last_lmb->base_addr + drmem_lmb_size();
 }
 
-static u32 drmem_lmb_flags(struct drmem_lmb *lmb)
-{
-	/*
-	 * Return the value of the lmb flags field minus the reserved
-	 * bit used internally for hotplug processing.
-	 */
-	return lmb->flags & ~DRMEM_LMB_RESERVED;
-}
-
 static struct property *clone_property(struct property *prop, u32 prop_sz)
 {
 	struct property *new_prop;
@@ -84,6 +81,7 @@ static int drmem_update_dt_v1(struct device_node *memory,
 	struct of_drconf_cell_v1 *dr_cell;
 	struct drmem_lmb *lmb;
 	u32 *p;
+	int i;
 
 	new_prop = clone_property(prop, prop->length);
 	if (!new_prop)
@@ -94,11 +92,12 @@ static int drmem_update_dt_v1(struct device_node *memory,
 
 	dr_cell = (struct of_drconf_cell_v1 *)p;
 
-	for_each_drmem_lmb(lmb) {
+	for (i = 0; i < drmem_info->n_lmbs; i++) {
+		lmb = &drmem_info->lmbs[i];
 		dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
 		dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
 		dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
-		dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
+		dr_cell->flags = cpu_to_be32(lmb->flags);
 
 		dr_cell++;
 	}
@@ -113,7 +112,7 @@ static void init_drconf_v2_cell(struct of_drconf_cell_v2 *dr_cell,
 	dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
 	dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
 	dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
-	dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
+	dr_cell->flags = cpu_to_be32(lmb->flags);
 }
 
 static int drmem_update_dt_v2(struct device_node *memory,
@@ -124,11 +123,13 @@ static int drmem_update_dt_v2(struct device_node *memory,
 	struct drmem_lmb *lmb, *prev_lmb;
 	u32 lmb_sets, prop_sz, seq_lmbs;
 	u32 *p;
+	int i;
 
 	/* First pass, determine how many LMB sets are needed. */
 	lmb_sets = 0;
 	prev_lmb = NULL;
-	for_each_drmem_lmb(lmb) {
+	for (i = 0; i < drmem_info->n_lmbs; i++) {
+		lmb = &drmem_info->lmbs[i];
 		if (!prev_lmb) {
 			prev_lmb = lmb;
 			lmb_sets++;
@@ -136,7 +137,7 @@ static int drmem_update_dt_v2(struct device_node *memory,
 		}
 
 		if (prev_lmb->aa_index != lmb->aa_index ||
-		    drmem_lmb_flags(prev_lmb) != drmem_lmb_flags(lmb))
+		    prev_lmb->flags != lmb->flags)
 			lmb_sets++;
 
 		prev_lmb = lmb;
@@ -155,7 +156,8 @@ static int drmem_update_dt_v2(struct device_node *memory,
 	/* Second pass, populate the LMB set data */
 	prev_lmb = NULL;
 	seq_lmbs = 0;
-	for_each_drmem_lmb(lmb) {
+	for (i = 0; i < drmem_info->n_lmbs; i++) {
+		lmb = &drmem_info->lmbs[i];
 		if (prev_lmb == NULL) {
 			/* Start of first LMB set */
 			prev_lmb = lmb;
@@ -165,7 +167,7 @@ static int drmem_update_dt_v2(struct device_node *memory,
 		}
 
 		if (prev_lmb->aa_index != lmb->aa_index ||
-		    drmem_lmb_flags(prev_lmb) != drmem_lmb_flags(lmb)) {
+		    prev_lmb->flags != lmb->flags) {
 			/* end of one set, start of another */
 			dr_cell->seq_lmbs = cpu_to_be32(seq_lmbs);
 			dr_cell++;
@@ -371,6 +373,7 @@ void __init walk_drmem_lmbs(struct device_node *dn,
 static void __init init_drmem_v1_lmbs(const __be32 *prop)
 {
 	struct drmem_lmb *lmb;
+	int i;
 
 	drmem_info->n_lmbs = of_read_number(prop++, 1);
 	if (drmem_info->n_lmbs == 0)
@@ -381,7 +384,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop)
 	if (!drmem_info->lmbs)
 		return;
 
-	for_each_drmem_lmb(lmb) {
+	for (i = 0; i < drmem_info->n_lmbs; i++) {
+		lmb = &drmem_info->lmbs[i];
 		read_drconf_v1_cell(lmb, &prop);
 		if (drmem_cache_lmb_for_lookup(lmb) != 0)
 			return;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index a4d40a3ceea3..61d4c3c1e0fd 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -13,6 +13,7 @@
 #include <linux/memory.h>
 #include <linux/memory_hotplug.h>
 #include <linux/slab.h>
+#include <linux/xarray.h>
 
 #include <asm/firmware.h>
 #include <asm/machdep.h>
@@ -218,35 +219,6 @@ static struct memory_block *lmb_to_memblock(struct drmem_lmb *lmb)
 	return mem_block;
 }
 
-static int get_lmb_range(u32 drc_index, int n_lmbs,
-			 struct drmem_lmb **start_lmb,
-			 struct drmem_lmb **end_lmb)
-{
-	struct drmem_lmb *lmb, *start, *end;
-	struct drmem_lmb *last_lmb;
-
-	start = NULL;
-	for_each_drmem_lmb(lmb) {
-		if (lmb->drc_index == drc_index) {
-			start = lmb;
-			break;
-		}
-	}
-
-	if (!start)
-		return -EINVAL;
-
-	end = &start[n_lmbs - 1];
-
-	last_lmb = &drmem_info->lmbs[drmem_info->n_lmbs - 1];
-	if (end > last_lmb)
-		return -EINVAL;
-
-	*start_lmb = start;
-	*end_lmb = end;
-	return 0;
-}
-
 static int dlpar_change_lmb_state(struct drmem_lmb *lmb, bool online)
 {
 	struct memory_block *mem_block;
@@ -403,6 +375,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
 {
 	struct drmem_lmb *lmb;
+	unsigned long index;
 	int lmbs_removed = 0;
 	int lmbs_available = 0;
 	int rc;
@@ -413,7 +386,7 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
 		return -EINVAL;
 
 	/* Validate that there are enough LMBs to satisfy the request */
-	for_each_drmem_lmb(lmb) {
+	xa_for_each(drmem_lmb_xa, index, lmb) {
 		if (lmb_is_removable(lmb))
 			lmbs_available++;
 
@@ -427,7 +400,7 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
 		return -EINVAL;
 	}
 
-	for_each_drmem_lmb(lmb) {
+	xa_for_each(drmem_lmb_xa, index, lmb) {
 		rc = dlpar_remove_lmb(lmb);
 		if (rc)
 			continue;
@@ -435,7 +408,7 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
 		/* Mark this lmb so we can add it later if all of the
 		 * requested LMBs cannot be removed.
 		 */
-		drmem_mark_lmb_reserved(lmb);
+		xa_set_mark(drmem_lmb_xa, index, XA_MARK_0);
 
 		lmbs_removed++;
 		if (lmbs_removed == lmbs_to_remove)
@@ -445,29 +418,23 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
 	if (lmbs_removed != lmbs_to_remove) {
 		pr_err("Memory hot-remove failed, adding LMB's back\n");
 
-		for_each_drmem_lmb(lmb) {
-			if (!drmem_lmb_reserved(lmb))
-				continue;
-
+		xa_for_each_marked(drmem_lmb_xa, index, lmb, XA_MARK_0) {
 			rc = dlpar_add_lmb(lmb);
 			if (rc)
 				pr_err("Failed to add LMB back, drc index %x\n",
 				       lmb->drc_index);
 
-			drmem_remove_lmb_reservation(lmb);
+			xa_clear_mark(drmem_lmb_xa, index, XA_MARK_0);
 		}
 
 		rc = -EINVAL;
 	} else {
-		for_each_drmem_lmb(lmb) {
-			if (!drmem_lmb_reserved(lmb))
-				continue;
-
+		xa_for_each_marked(drmem_lmb_xa, index, lmb, XA_MARK_0) {
 			dlpar_release_drc(lmb->drc_index);
 			pr_info("Memory at %llx was hot-removed\n",
 				lmb->base_addr);
 
-			drmem_remove_lmb_reservation(lmb);
+			xa_clear_mark(drmem_lmb_xa, index, XA_MARK_0);
 		}
 		rc = 0;
 	}
@@ -478,25 +445,19 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
 static int dlpar_memory_remove_by_index(u32 drc_index)
 {
 	struct drmem_lmb *lmb;
-	int lmb_found;
 	int rc;
 
 	pr_info("Attempting to hot-remove LMB, drc index %x\n", drc_index);
 
-	lmb_found = 0;
-	for_each_drmem_lmb(lmb) {
-		if (lmb->drc_index == drc_index) {
-			lmb_found = 1;
-			rc = dlpar_remove_lmb(lmb);
-			if (!rc)
-				dlpar_release_drc(lmb->drc_index);
-
-			break;
-		}
+	lmb = xa_load(drmem_lmb_xa, drc_index);
+	if (lmb == NULL) {
+		pr_info("cannot hot-remove LMB %x: not found\n", drc_index);
+		return -EINVAL;
 	}
 
-	if (!lmb_found)
-		rc = -EINVAL;
+	rc = dlpar_remove_lmb(lmb);
+	if (!rc)
+		dlpar_release_drc(lmb->drc_index);
 
 	if (rc)
 		pr_info("Failed to hot-remove memory at %llx\n",
@@ -510,27 +471,22 @@ static int dlpar_memory_remove_by_index(u32 drc_index)
 static int dlpar_memory_readd_by_index(u32 drc_index)
 {
 	struct drmem_lmb *lmb;
-	int lmb_found;
 	int rc;
 
 	pr_info("Attempting to update LMB, drc index %x\n", drc_index);
 
-	lmb_found = 0;
-	for_each_drmem_lmb(lmb) {
-		if (lmb->drc_index == drc_index) {
-			lmb_found = 1;
-			rc = dlpar_remove_lmb(lmb);
-			if (!rc) {
-				rc = dlpar_add_lmb(lmb);
-				if (rc)
-					dlpar_release_drc(lmb->drc_index);
-			}
-			break;
-		}
+	lmb = xa_load(drmem_lmb_xa, drc_index);
+	if (lmb == NULL) {
+		pr_info("cannot update LMB %x: not found\n", drc_index);
+		return -EINVAL;
 	}
 
-	if (!lmb_found)
-		rc = -EINVAL;
+	rc = dlpar_remove_lmb(lmb);
+	if (!rc) {
+		rc = dlpar_add_lmb(lmb);
+		if (rc)
+			dlpar_release_drc(lmb->drc_index);
+	}
 
 	if (rc)
 		pr_info("Failed to update memory at %llx\n",
@@ -543,22 +499,21 @@ static int dlpar_memory_readd_by_index(u32 drc_index)
 
 static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
 {
-	struct drmem_lmb *lmb, *start_lmb, *end_lmb;
+	struct drmem_lmb *lmb;
+	unsigned long first, index, last;
 	int lmbs_available = 0;
 	int rc;
 
 	pr_info("Attempting to hot-remove %u LMB(s) at %x\n",
 		lmbs_to_remove, drc_index);
 
-	if (lmbs_to_remove == 0)
-		return -EINVAL;
-
-	rc = get_lmb_range(drc_index, lmbs_to_remove, &start_lmb, &end_lmb);
-	if (rc)
+	if (lmbs_to_remove == 0 || drc_index > U32_MAX - lmbs_to_remove)
 		return -EINVAL;
+	first = drc_index;
+	last = drc_index + lmbs_to_remove - 1;
 
 	/* Validate that there are enough LMBs to satisfy the request */
-	for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
+	xa_for_each_range(drmem_lmb_xa, index, lmb, first, last) {
 		if (lmb->flags & DRCONF_MEM_RESERVED)
 			break;
 
@@ -568,7 +523,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
 	if (lmbs_available < lmbs_to_remove)
 		return -EINVAL;
 
-	for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
+	xa_for_each_range(drmem_lmb_xa, index, lmb, first, last) {
 		if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
 			continue;
 
@@ -576,35 +531,29 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
 		if (rc)
 			break;
 
-		drmem_mark_lmb_reserved(lmb);
+		xa_set_mark(drmem_lmb_xa, index, XA_MARK_0);
 	}
 
 	if (rc) {
 		pr_err("Memory indexed-count-remove failed, adding any removed LMBs\n");
 
 
-		for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
-			if (!drmem_lmb_reserved(lmb))
-				continue;
-
+		xa_for_each_marked(drmem_lmb_xa, index, lmb, XA_MARK_0) {
 			rc = dlpar_add_lmb(lmb);
 			if (rc)
 				pr_err("Failed to add LMB, drc index %x\n",
 				       lmb->drc_index);
 
-			drmem_remove_lmb_reservation(lmb);
+			xa_clear_mark(drmem_lmb_xa, index, XA_MARK_0);
 		}
 		rc = -EINVAL;
 	} else {
-		for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
-			if (!drmem_lmb_reserved(lmb))
-				continue;
-
+		xa_for_each_marked(drmem_lmb_xa, index, lmb, XA_MARK_0) {
 			dlpar_release_drc(lmb->drc_index);
 			pr_info("Memory at %llx (drc index %x) was hot-removed\n",
 				lmb->base_addr, lmb->drc_index);
 
-			drmem_remove_lmb_reservation(lmb);
+			xa_clear_mark(drmem_lmb_xa, index, XA_MARK_0);
 		}
 	}
 
@@ -687,6 +636,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 static int dlpar_memory_add_by_count(u32 lmbs_to_add)
 {
 	struct drmem_lmb *lmb;
+	unsigned long index;
 	int lmbs_available = 0;
 	int lmbs_added = 0;
 	int rc;
@@ -697,7 +647,7 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
 		return -EINVAL;
 
 	/* Validate that there are enough LMBs to satisfy the request */
-	for_each_drmem_lmb(lmb) {
+	xa_for_each(drmem_lmb_xa, index, lmb) {
 		if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
 			lmbs_available++;
 
@@ -708,7 +658,7 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
 	if (lmbs_available < lmbs_to_add)
 		return -EINVAL;
 
-	for_each_drmem_lmb(lmb) {
+	xa_for_each(drmem_lmb_xa, index, lmb) {
 		if (lmb->flags & DRCONF_MEM_ASSIGNED)
 			continue;
 
@@ -725,7 +675,7 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
 		/* Mark this lmb so we can remove it later if all of the
 		 * requested LMBs cannot be added.
 		 */
-		drmem_mark_lmb_reserved(lmb);
+		xa_set_mark(drmem_lmb_xa, index, XA_MARK_0);
 
 		lmbs_added++;
 		if (lmbs_added == lmbs_to_add)
@@ -735,10 +685,7 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
 	if (lmbs_added != lmbs_to_add) {
 		pr_err("Memory hot-add failed, removing any added LMBs\n");
 
-		for_each_drmem_lmb(lmb) {
-			if (!drmem_lmb_reserved(lmb))
-				continue;
-
+		xa_for_each_marked(drmem_lmb_xa, index, lmb, XA_MARK_0) {
 			rc = dlpar_remove_lmb(lmb);
 			if (rc)
 				pr_err("Failed to remove LMB, drc index %x\n",
@@ -746,17 +693,14 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
 			else
 				dlpar_release_drc(lmb->drc_index);
 
-			drmem_remove_lmb_reservation(lmb);
+			xa_clear_mark(drmem_lmb_xa, index, XA_MARK_0);
 		}
 		rc = -EINVAL;
 	} else {
-		for_each_drmem_lmb(lmb) {
-			if (!drmem_lmb_reserved(lmb))
-				continue;
-
+		xa_for_each_marked(drmem_lmb_xa, index, lmb, XA_MARK_0) {
 			pr_info("Memory at %llx (drc index %x) was hot-added\n",
 				lmb->base_addr, lmb->drc_index);
-			drmem_remove_lmb_reservation(lmb);
+			xa_clear_mark(drmem_lmb_xa, index, XA_MARK_0);
 		}
 		rc = 0;
 	}
@@ -767,27 +711,22 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
 static int dlpar_memory_add_by_index(u32 drc_index)
 {
 	struct drmem_lmb *lmb;
-	int rc, lmb_found;
+	int rc;
 
 	pr_info("Attempting to hot-add LMB, drc index %x\n", drc_index);
 
-	lmb_found = 0;
-	for_each_drmem_lmb(lmb) {
-		if (lmb->drc_index == drc_index) {
-			lmb_found = 1;
-			rc = dlpar_acquire_drc(lmb->drc_index);
-			if (!rc) {
-				rc = dlpar_add_lmb(lmb);
-				if (rc)
-					dlpar_release_drc(lmb->drc_index);
-			}
-
-			break;
-		}
+	lmb = xa_load(drmem_lmb_xa, drc_index);
+	if (lmb == NULL) {
+		pr_info("cannot hot-add LMB %x: not found\n", drc_index);
+		return -EINVAL;
 	}
 
-	if (!lmb_found)
-		rc = -EINVAL;
+	rc = dlpar_acquire_drc(lmb->drc_index);
+	if (!rc) {
+		rc = dlpar_add_lmb(lmb);
+		if (rc)
+			dlpar_release_drc(lmb->drc_index);
+	}
 
 	if (rc)
 		pr_info("Failed to hot-add memory, drc index %x\n", drc_index);
@@ -800,22 +739,21 @@ static int dlpar_memory_add_by_index(u32 drc_index)
 
 static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 drc_index)
 {
-	struct drmem_lmb *lmb, *start_lmb, *end_lmb;
+	struct drmem_lmb *lmb;
+	unsigned long first, index, last;
 	int lmbs_available = 0;
 	int rc;
 
 	pr_info("Attempting to hot-add %u LMB(s) at index %x\n",
 		lmbs_to_add, drc_index);
 
-	if (lmbs_to_add == 0)
-		return -EINVAL;
-
-	rc = get_lmb_range(drc_index, lmbs_to_add, &start_lmb, &end_lmb);
-	if (rc)
+	if (lmbs_to_add == 0 || drc_index > U32_MAX - lmbs_to_add)
 		return -EINVAL;
+	first = drc_index;
+	last = drc_index + lmbs_to_add - 1;
 
 	/* Validate that the LMBs in this range are not reserved */
-	for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
+	xa_for_each_range(drmem_lmb_xa, index, lmb, first, last) {
 		if (lmb->flags & DRCONF_MEM_RESERVED)
 			break;
 
@@ -825,7 +763,7 @@ static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 drc_index)
 	if (lmbs_available < lmbs_to_add)
 		return -EINVAL;
 
-	for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
+	xa_for_each_range(drmem_lmb_xa, index, lmb, first, last) {
 		if (lmb->flags & DRCONF_MEM_ASSIGNED)
 			continue;
 
@@ -839,16 +777,12 @@ static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 drc_index)
 			break;
 		}
 
-		drmem_mark_lmb_reserved(lmb);
+		xa_set_mark(drmem_lmb_xa, index, XA_MARK_0);
 	}
 
 	if (rc) {
 		pr_err("Memory indexed-count-add failed, removing any added LMBs\n");
-
-		for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
-			if (!drmem_lmb_reserved(lmb))
-				continue;
-
+		xa_for_each_marked(drmem_lmb_xa, index, lmb, XA_MARK_0) {
 			rc = dlpar_remove_lmb(lmb);
 			if (rc)
 				pr_err("Failed to remove LMB, drc index %x\n",
@@ -856,17 +790,14 @@ static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 drc_index)
 			else
 				dlpar_release_drc(lmb->drc_index);
 
-			drmem_remove_lmb_reservation(lmb);
+			xa_clear_mark(drmem_lmb_xa, index, XA_MARK_0);
 		}
 		rc = -EINVAL;
 	} else {
-		for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
-			if (!drmem_lmb_reserved(lmb))
-				continue;
-
+		xa_for_each_marked(drmem_lmb_xa, index, lmb, XA_MARK_0) {
 			pr_info("Memory at %llx (drc index %x) was hot-added\n",
 				lmb->base_addr, lmb->drc_index);
-			drmem_remove_lmb_reservation(lmb);
+			xa_clear_mark(drmem_lmb_xa, index, XA_MARK_0);
 		}
 	}
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup
  2020-01-28 22:11 [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup Scott Cheloha
                   ` (3 preceding siblings ...)
  2020-02-21 17:29 ` [PATCH v2 2/2] pseries/hotplug-memory: leverage xarray API to simplify code Scott Cheloha
@ 2020-02-21 18:11 ` Nathan Lynch
  2020-02-21 18:28   ` Nathan Lynch
  4 siblings, 1 reply; 15+ messages in thread
From: Nathan Lynch @ 2020-02-21 18:11 UTC (permalink / raw)
  To: Scott Cheloha
  Cc: Nathan Fontenont, Rick Lindsley, linuxppc-dev, linux-kernel,
	Michael Ellerman

Hi Scott, I've owed you a follow-up on this for a couple weeks, sorry.

Beyond the issue of whether to remove the drmem_info->lmbs array, there
are some other things to address.

Scott Cheloha <cheloha@linux.ibm.com> writes:
> diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
> index 3d76e1c388c2..a37cbe794cdd 100644
> --- a/arch/powerpc/include/asm/drmem.h
> +++ b/arch/powerpc/include/asm/drmem.h
> @@ -88,6 +88,9 @@ static inline bool drmem_lmb_reserved(struct drmem_lmb *lmb)
>  	return lmb->flags & DRMEM_LMB_RESERVED;
>  }
>  
> +struct drmem_lmb *drmem_find_lmb_by_base_addr(unsigned long);
> +struct drmem_lmb *drmem_find_lmb_by_drc_index(unsigned long);

Need identifiers for the function arguments here. checkpatch warns about
this. Also drc_index conventionally is handled as u32 in this code.


> +struct drmem_lmb *drmem_find_lmb_by_base_addr(unsigned long base_addr)
> +{
> +	return xa_load(&drmem_lmb_base_addr, base_addr);
> +}
> +
> +struct drmem_lmb *drmem_find_lmb_by_drc_index(unsigned long drc_index)
> +{
> +	return xa_load(&drmem_lmb_drc_index, drc_index);
> +}
> +
> +static int drmem_lmb_cache_for_lookup(struct drmem_lmb *lmb)

This is called only from __init functions, so it should be __init as well.


> +{
> +	void *ret;
> +
> +	ret = xa_store(&drmem_lmb_base_addr, lmb->base_addr, lmb,  GFP_KERNEL);
> +	if (xa_err(ret))
> +		return xa_err(ret);
> +
> +	ret = xa_store(&drmem_lmb_drc_index, lmb->drc_index, lmb, GFP_KERNEL);
> +	if (xa_err(ret))
> +		return xa_err(ret);
> +
> +	return 0;
> +}
> +
>  static u32 drmem_lmb_flags(struct drmem_lmb *lmb)
>  {
>  	/*
> @@ -364,6 +392,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop)
>  
>  	for_each_drmem_lmb(lmb) {
>  		read_drconf_v1_cell(lmb, &prop);
> +		if (drmem_lmb_cache_for_lookup(lmb) != 0)
> +			return;
>  		lmb_set_nid(lmb);
>  	}

Failing to record an lmb in the caches shouldn't be cause for silently
aborting this initialization. Future lookups against the caches (should
the system even boot) may fail, but the drmem_lmbs will still be
initialized correctly.

I'd say just ignore (or perhaps log once) xa_store() failures as long as
this code only runs at boot.


> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 50d68d21ddcc..23684d44549f 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -958,27 +958,21 @@ early_param("topology_updates", early_topology_updates);
>  static int hot_add_drconf_scn_to_nid(unsigned long scn_addr)
>  {
>  	struct drmem_lmb *lmb;
> -	unsigned long lmb_size;
> -	int nid = NUMA_NO_NODE;
> -
> -	lmb_size = drmem_lmb_size();
> -
> -	for_each_drmem_lmb(lmb) {
> -		/* skip this block if it is reserved or not assigned to
> -		 * this partition */
> -		if ((lmb->flags & DRCONF_MEM_RESERVED)
> -		    || !(lmb->flags & DRCONF_MEM_ASSIGNED))
> -			continue;
>  
> -		if ((scn_addr < lmb->base_addr)
> -		    || (scn_addr >= (lmb->base_addr + lmb_size)))
> -			continue;
> +	lmb = drmem_find_lmb_by_base_addr(scn_addr);
> +	if (lmb == NULL)
> +		return NUMA_NO_NODE;
>  
> -		nid = of_drconf_to_nid_single(lmb);
> -		break;
> -	}
> +	/*
> +	 * We can't use it if it is reserved or not assigned to
> +	 * this partition.
> +	 */
> +	if (lmb->flags & DRCONF_MEM_RESERVED)
> +		return NUMA_NO_NODE;
> +	if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
> +		return NUMA_NO_NODE;
>  
> -	return nid;
> +	return of_drconf_to_nid_single(lmb);
>  }
>  
>  /*
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index c126b94d1943..29bd19831a9a 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -222,17 +222,10 @@ static int get_lmb_range(u32 drc_index, int n_lmbs,
>  			 struct drmem_lmb **start_lmb,
>  			 struct drmem_lmb **end_lmb)
>  {
> -	struct drmem_lmb *lmb, *start, *end;
> +	struct drmem_lmb *start, *end;
>  	struct drmem_lmb *last_lmb;
>  
> -	start = NULL;
> -	for_each_drmem_lmb(lmb) {
> -		if (lmb->drc_index == drc_index) {
> -			start = lmb;
> -			break;
> -		}
> -	}
> -
> +	start = drmem_find_lmb_by_drc_index(drc_index);
>  	if (!start)
>  		return -EINVAL;

The changes to hot_add_drconf_scn_to_nid() and get_lmb_range() look
correct to me.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup
  2020-02-21 18:11 ` [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup Nathan Lynch
@ 2020-02-21 18:28   ` Nathan Lynch
  0 siblings, 0 replies; 15+ messages in thread
From: Nathan Lynch @ 2020-02-21 18:28 UTC (permalink / raw)
  To: Scott Cheloha
  Cc: Nathan Fontenont, Rick Lindsley, linuxppc-dev, linux-kernel,
	Michael Ellerman

Nathan Lynch <nathanl@linux.ibm.com> writes:

> Hi Scott, I've owed you a follow-up on this for a couple weeks, sorry.

I see v2 was posted as I was writing this, so I'll follow up on that
thread.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/2] powerpc/drmem: accelerate memory_add_physaddr_to_nid() with LMB xarray
  2020-02-21 17:29 ` [PATCH v2 1/2] powerpc/drmem: accelerate memory_add_physaddr_to_nid() with LMB xarray Scott Cheloha
@ 2020-02-21 20:02   ` Nathan Lynch
  2020-07-22 23:00   ` Anton Blanchard
  1 sibling, 0 replies; 15+ messages in thread
From: Nathan Lynch @ 2020-02-21 20:02 UTC (permalink / raw)
  To: Scott Cheloha
  Cc: Rick Lindsley, David Hildenbrand, Michal Suchanek, Michal Hocko,
	Nathan Fontenont, linux-kernel, linuxppc-dev, Michael Ellerman

Hi Scott,

Scott Cheloha <cheloha@linux.ibm.com> writes:
> On PowerPC, memory_add_physaddr_to_nid() uses a linear search to find
> an LMB matching the given address.  This scales very poorly when there
> are many LMBs.  The poor scaling cripples drmem_init() during boot:
> lmb_set_nid(), which calls memory_add_physaddr_to_nid(), is called for
> each LMB.
>
> If we index each LMB in an xarray by its base address we can achieve
> O(log n) search during memory_add_physaddr_to_nid(), which scales much
> better.

Is O(log n) correct? I had thought lookup complexity depends on the
characteristics of the key.

Repeating some points from my comments on v1 below.


> +static int drmem_cache_lmb_for_lookup(struct drmem_lmb *lmb)

This is called only from __init functions, so it should be __init as well.

> +{
> +	void *ret;
> +
> +	ret = xa_store(&drmem_lmb_xa_base_addr, lmb->base_addr, lmb,
> +		       GFP_KERNEL);
> +	if (xa_is_err(ret))
> +		return xa_err(ret);
> +
> +	return 0;
> +}

[...]

> @@ -364,6 +383,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop)
>  
>  	for_each_drmem_lmb(lmb) {
>  		read_drconf_v1_cell(lmb, &prop);
> +		if (drmem_cache_lmb_for_lookup(lmb) != 0)
> +			return;
>  		lmb_set_nid(lmb);
>  	}
>  }
> @@ -411,6 +432,9 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop)
>  			lmb->aa_index = dr_cell.aa_index;
>  			lmb->flags = dr_cell.flags;
>  
> +			if (drmem_cache_lmb_for_lookup(lmb) != 0)
> +				return;
> +

Failing to record an lmb in the cache shouldn't be cause for silently
aborting this initialization. Future lookups against the caches (should
the system even boot) may fail, but the drmem_lmbs will still be
initialized correctly.

I'd say just ignore (or perhaps log once) xa_store() failures as long as
this code only runs at boot.


> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 3c7dec70cda0..0fd7963a991e 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -958,27 +958,18 @@ early_param("topology_updates", early_topology_updates);
>  static int hot_add_drconf_scn_to_nid(unsigned long scn_addr)
>  {
>  	struct drmem_lmb *lmb;
> -	unsigned long lmb_size;
> -	int nid = NUMA_NO_NODE;
> -
> -	lmb_size = drmem_lmb_size();
> -
> -	for_each_drmem_lmb(lmb) {
> -		/* skip this block if it is reserved or not assigned to
> -		 * this partition */
> -		if ((lmb->flags & DRCONF_MEM_RESERVED)
> -		    || !(lmb->flags & DRCONF_MEM_ASSIGNED))
> -			continue;
> -
> -		if ((scn_addr < lmb->base_addr)
> -		    || (scn_addr >= (lmb->base_addr + lmb_size)))
> -			continue;
>  
> -		nid = of_drconf_to_nid_single(lmb);
> -		break;
> -	}
> +	lmb = drmem_find_lmb_by_base_addr(scn_addr);

It occurs to me that the old behavior here will succeed in looking up a
drmem_lmb if scn_addr is within it, while the new behavior is that the
given address must match the starting address of the drmem_lmb. I think
this is probably OK though?


> +	if (lmb == NULL)
> +		return NUMA_NO_NODE;
> +	
> +	/* can't use this block if it is reserved or not assigned to
> +	 * this partition */
> +	if ((lmb->flags & DRCONF_MEM_RESERVED)
> +	    || !(lmb->flags & DRCONF_MEM_ASSIGNED))
> +		return NUMA_NO_NODE;
>  
> -	return nid;
> +	return of_drconf_to_nid_single(lmb);
>  }

Otherwise I have no concerns about the changes to
hot_add_drconf_scn_to_nid.

Other than the minor points I've made here, this looks like a tidy
self-contained change to fix drmem initialization latency.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] pseries/hotplug-memory: leverage xarray API to simplify code
  2020-02-21 17:29 ` [PATCH v2 2/2] pseries/hotplug-memory: leverage xarray API to simplify code Scott Cheloha
@ 2020-02-21 20:03   ` Nathan Lynch
  0 siblings, 0 replies; 15+ messages in thread
From: Nathan Lynch @ 2020-02-21 20:03 UTC (permalink / raw)
  To: Scott Cheloha
  Cc: Rick Lindsley, David Hildenbrand, Michal Suchanek, Michal Hocko,
	Nathan Fontenont, linux-kernel, linuxppc-dev, Michael Ellerman

Hi Scott,

Scott Cheloha <cheloha@linux.ibm.com> writes:
> -#define for_each_drmem_lmb_in_range(lmb, start, end)		\
> -	for ((lmb) = (start); (lmb) <= (end); (lmb)++)
> -
> -#define for_each_drmem_lmb(lmb)					\
> -	for_each_drmem_lmb_in_range((lmb),			\
> -		&drmem_info->lmbs[0],				\
> -		&drmem_info->lmbs[drmem_info->n_lmbs - 1])

A couple things.

This will conflict with "powerpc/pseries: Avoid NULL pointer dereference
when drmem is unavailable" which is in linuxppc/next-test:
   
   https://patchwork.ozlabs.org/patch/1231904/

Regardless, I don't think trading the iterator macros for open-coded
loops improve the code:

> -	for_each_drmem_lmb(lmb) {
> +	for (i = 0; i < drmem_info->n_lmbs; i++) {
> +		lmb = &drmem_info->lmbs[i];

[...]

> +struct xarray;
> +extern struct xarray *drmem_lmb_xa;

drmem_lmb_xa should go in the drmem_info structure if you can't make it
static in drmem.c.

>  
>  /*
>   * The of_drconf_cell_v1 struct defines the layout of the LMB data
> @@ -71,23 +66,6 @@ static inline u32 drmem_lmb_size(void)
>  	return drmem_info->lmb_size;
>  }
>  
> -#define DRMEM_LMB_RESERVED	0x80000000
> -
> -static inline void drmem_mark_lmb_reserved(struct drmem_lmb *lmb)
p> -{
> -	lmb->flags |= DRMEM_LMB_RESERVED;
> -}
> -
> -static inline void drmem_remove_lmb_reservation(struct drmem_lmb *lmb)
> -{
> -	lmb->flags &= ~DRMEM_LMB_RESERVED;
> -}
> -
> -static inline bool drmem_lmb_reserved(struct drmem_lmb *lmb)
> -{
> -	return lmb->flags & DRMEM_LMB_RESERVED;
> -}

The flag management is logically separate from the iterator changes, so
splitting that out would ease review.

Looking further... yes, this needs to be a series of smaller changes
please.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/2] powerpc/drmem: accelerate memory_add_physaddr_to_nid() with LMB xarray
  2020-02-21 17:29 ` [PATCH v2 1/2] powerpc/drmem: accelerate memory_add_physaddr_to_nid() with LMB xarray Scott Cheloha
  2020-02-21 20:02   ` Nathan Lynch
@ 2020-07-22 23:00   ` Anton Blanchard
  1 sibling, 0 replies; 15+ messages in thread
From: Anton Blanchard @ 2020-07-22 23:00 UTC (permalink / raw)
  To: Scott Cheloha
  Cc: linuxppc-dev, Michael Ellerman, Nathan Lynch, Nathan Fontenont,
	Michal Hocko, Michal Suchanek, David Hildenbrand, linux-kernel,
	Rick Lindley

Hi Scott,

I'm hitting this issue and Rick just pointed my at your patch. Any
chance we could get it upstream?

Thanks,
Anton

> On PowerPC, memory_add_physaddr_to_nid() uses a linear search to find
> an LMB matching the given address.  This scales very poorly when there
> are many LMBs.  The poor scaling cripples drmem_init() during boot:
> lmb_set_nid(), which calls memory_add_physaddr_to_nid(), is called for
> each LMB.
> 
> If we index each LMB in an xarray by its base address we can achieve
> O(log n) search during memory_add_physaddr_to_nid(), which scales much
> better.
> 
> For example, in the lab we have a 64TB P9 machine with 256MB LMBs.
> So, suring drmem_init() we instantiate 249854 LMBs.  On a vanilla
> kernel it completes drmem_init() in ~35 seconds with a soft lockup
> trace.  On the patched kernel it completes drmem_init() in ~0.5
> seconds.
> 
> Before:
> [   53.721639] drmem: initializing drmem v2
> [   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s!
> [swapper/0:1] [   80.604377] Modules linked in:
> [   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+
> #4 [   80.604397] NIP:  c0000000000a4980 LR: c0000000000a4940 CTR:
> 0000000000000000 [   80.604407] REGS: c0002dbff8493830 TRAP: 0901
> Not tainted  (5.6.0-rc2+) [   80.604412] MSR:  8000000002009033
> <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44000248  XER: 0000000d [
> 80.604431] CFAR: c0000000000a4a38 IRQMASK: 0 [   80.604431] GPR00:
> c0000000000a4940 c0002dbff8493ac0 c000000001904400 c0003cfffffede30 [
>   80.604431] GPR04: 0000000000000000 c000000000f4095a
> 000000000000002f 0000000010000000 [   80.604431] GPR08:
> c0000bf7ecdb7fb8 c0000bf7ecc2d3c8 0000000000000008 c00c0002fdfb2001 [
>   80.604431] GPR12: 0000000000000000 c00000001e8ec200 [   80.604477]
> NIP [c0000000000a4980] hot_add_scn_to_nid+0xa0/0x3e0 [   80.604486]
> LR [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0 [   80.604492]
> Call Trace: [   80.604498] [c0002dbff8493ac0] [c0000000000a4940]
> hot_add_scn_to_nid+0x60/0x3e0 (unreliable) [   80.604509]
> [c0002dbff8493b20] [c000000000087c10]
> memory_add_physaddr_to_nid+0x20/0x60 [   80.604521]
> [c0002dbff8493b40] [c0000000010d4880] drmem_init+0x25c/0x2f0 [
> 80.604530] [c0002dbff8493c10] [c000000000010154]
> do_one_initcall+0x64/0x2c0 [   80.604540] [c0002dbff8493ce0]
> [c0000000010c4aa0] kernel_init_freeable+0x2d8/0x3a0 [   80.604550]
> [c0002dbff8493db0] [c000000000010824] kernel_init+0x2c/0x148 [
> 80.604560] [c0002dbff8493e20] [c00000000000b648]
> ret_from_kernel_thread+0x5c/0x74 [   80.604567] Instruction dump: [
> 80.604574] 392918e8 e9490000 e90a000a e92a0000 80ea000c 1d080018
> 3908ffe8 7d094214 [   80.604586] 7fa94040 419d00dc e9490010 714a0088
> <2faa0008> 409e00ac e9490000 7fbe5040 [   89.047390] drmem: 249854
> LMB(s)
> 
> After:
> [   53.424702] drmem: initializing drmem v2
> [   53.898813] drmem: 249854 LMB(s)
> 
> lmb_set_nid() is called from dlpar_lmb_add() so this patch will also
> improve memory hot-add speeds on big machines.
> 
> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/drmem.h |  1 +
>  arch/powerpc/mm/drmem.c          | 24 ++++++++++++++++++++++++
>  arch/powerpc/mm/numa.c           | 29 ++++++++++-------------------
>  3 files changed, 35 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/drmem.h
> b/arch/powerpc/include/asm/drmem.h index 3d76e1c388c2..90a5a9ad872b
> 100644 --- a/arch/powerpc/include/asm/drmem.h
> +++ b/arch/powerpc/include/asm/drmem.h
> @@ -88,6 +88,7 @@ static inline bool drmem_lmb_reserved(struct
> drmem_lmb *lmb) return lmb->flags & DRMEM_LMB_RESERVED;
>  }
>  
> +struct drmem_lmb *drmem_find_lmb_by_base_addr(u64 base_addr);
>  u64 drmem_lmb_memory_max(void);
>  void __init walk_drmem_lmbs(struct device_node *dn,
>  			void (*func)(struct drmem_lmb *, const
> __be32 **)); diff --git a/arch/powerpc/mm/drmem.c
> b/arch/powerpc/mm/drmem.c index 44bfbdae920c..62cbe79e3860 100644
> --- a/arch/powerpc/mm/drmem.c
> +++ b/arch/powerpc/mm/drmem.c
> @@ -11,12 +11,31 @@
>  #include <linux/of.h>
>  #include <linux/of_fdt.h>
>  #include <linux/memblock.h>
> +#include <linux/xarray.h>
>  #include <asm/prom.h>
>  #include <asm/drmem.h>
>  
> +static DEFINE_XARRAY(drmem_lmb_xa_base_addr);
>  static struct drmem_lmb_info __drmem_info;
>  struct drmem_lmb_info *drmem_info = &__drmem_info;
>  
> +static int drmem_cache_lmb_for_lookup(struct drmem_lmb *lmb)
> +{
> +	void *ret;
> +
> +	ret = xa_store(&drmem_lmb_xa_base_addr, lmb->base_addr, lmb,
> +		       GFP_KERNEL);
> +	if (xa_is_err(ret))
> +		return xa_err(ret);
> +
> +	return 0;
> +}
> +
> +struct drmem_lmb *drmem_find_lmb_by_base_addr(u64 base_addr)
> +{
> +	return xa_load(&drmem_lmb_xa_base_addr, base_addr);
> +}
> +
>  u64 drmem_lmb_memory_max(void)
>  {
>  	struct drmem_lmb *last_lmb;
> @@ -364,6 +383,8 @@ static void __init init_drmem_v1_lmbs(const
> __be32 *prop) 
>  	for_each_drmem_lmb(lmb) {
>  		read_drconf_v1_cell(lmb, &prop);
> +		if (drmem_cache_lmb_for_lookup(lmb) != 0)
> +			return;
>  		lmb_set_nid(lmb);
>  	}
>  }
> @@ -411,6 +432,9 @@ static void __init init_drmem_v2_lmbs(const
> __be32 *prop) lmb->aa_index = dr_cell.aa_index;
>  			lmb->flags = dr_cell.flags;
>  
> +			if (drmem_cache_lmb_for_lookup(lmb) != 0)
> +				return;
> +
>  			lmb_set_nid(lmb);
>  		}
>  	}
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 3c7dec70cda0..0fd7963a991e 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -958,27 +958,18 @@ early_param("topology_updates",
> early_topology_updates); static int
> hot_add_drconf_scn_to_nid(unsigned long scn_addr) {
>  	struct drmem_lmb *lmb;
> -	unsigned long lmb_size;
> -	int nid = NUMA_NO_NODE;
> -
> -	lmb_size = drmem_lmb_size();
> -
> -	for_each_drmem_lmb(lmb) {
> -		/* skip this block if it is reserved or not assigned
> to
> -		 * this partition */
> -		if ((lmb->flags & DRCONF_MEM_RESERVED)
> -		    || !(lmb->flags & DRCONF_MEM_ASSIGNED))
> -			continue;
> -
> -		if ((scn_addr < lmb->base_addr)
> -		    || (scn_addr >= (lmb->base_addr + lmb_size)))
> -			continue;
>  
> -		nid = of_drconf_to_nid_single(lmb);
> -		break;
> -	}
> +	lmb = drmem_find_lmb_by_base_addr(scn_addr);
> +	if (lmb == NULL)
> +		return NUMA_NO_NODE;
> +	
> +	/* can't use this block if it is reserved or not assigned to
> +	 * this partition */
> +	if ((lmb->flags & DRCONF_MEM_RESERVED)
> +	    || !(lmb->flags & DRCONF_MEM_ASSIGNED))
> +		return NUMA_NO_NODE;
>  
> -	return nid;
> +	return of_drconf_to_nid_single(lmb);
>  }
>  
>  /*


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-07-22 23:01 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-28 22:11 [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup Scott Cheloha
2020-01-28 23:56 ` Nathan Lynch
2020-01-29 18:10   ` Scott Cheloha
2020-01-30 16:09     ` Fontenot, Nathan
2020-02-03 20:13       ` Scott Cheloha
2020-02-05 14:33         ` Fontenot, Nathan
2020-02-04 16:19   ` Scott Cheloha
2020-02-21 17:29 ` pseries: accelerate drmem and simplify hotplug with xarrays Scott Cheloha
2020-02-21 17:29 ` [PATCH v2 1/2] powerpc/drmem: accelerate memory_add_physaddr_to_nid() with LMB xarray Scott Cheloha
2020-02-21 20:02   ` Nathan Lynch
2020-07-22 23:00   ` Anton Blanchard
2020-02-21 17:29 ` [PATCH v2 2/2] pseries/hotplug-memory: leverage xarray API to simplify code Scott Cheloha
2020-02-21 20:03   ` Nathan Lynch
2020-02-21 18:11 ` [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup Nathan Lynch
2020-02-21 18:28   ` Nathan Lynch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).