linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
       [not found] <20191217193238-1-cheloha@linux.vnet.ibm.com>
@ 2020-01-09 21:19 ` Scott Cheloha
  2020-01-09 21:30   ` Michal Hocko
  2020-01-09 21:25 ` [PATCH v4] " Scott Cheloha
  1 sibling, 1 reply; 19+ messages in thread
From: Scott Cheloha @ 2020-01-09 21:19 UTC (permalink / raw)
  To: linux-kernel, Rafael J. Wysocki, Greg Kroah-Hartman,
	David Hildenbrand, Andrew Morton
  Cc: nathanl, ricklind, mhocko, Scott Cheloha

Searching for a particular memory block by id is an O(n) operation
because each memory block's underlying device is kept in an unsorted
linked list on the subsystem bus.

We can cut the lookup cost to O(log n) if we cache the memory blocks in
a radix tree.  With a radix tree cache in place both memory subsystem
initialization and memory hotplug run palpably faster on systems with a
large number of memory blocks.

Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
v2 incorporates suggestions from David Hildenbrand.

v3 changes:
  - Rebase atop "drivers/base/memory.c: drop the mem_sysfs_mutex"

  - Be conservative: don't use radix_tree_for_each_slot() in
    walk_memory_blocks() yet.  It introduces RCU which could
    change behavior.  Walking the tree "by hand" with
    find_memory_block_by_id() is slower but keeps the patch
    simple.

v4 changes:
  - Rewrite commit message to explicitly note the time
    complexity improvements.

  - Provide anecdotal accounts of time-savings in the changelog
    (see below).

mhocko@suse.com has asked for additional details on time
savings, so here are some results I've collected when measuring
memory_dev_init() with/without the patch.

1. A 32GB POWER9 VM with 16MB memblocks has 2048 blocks:

# Unpatched
[    0.005121] adding memory block 0... ok
[...]
[    0.095230] adding memory block 1024... ok
[...]
[    0.304248] adding memory block 2047... ok
[    0.304508] added all memory blocks

# Patched
[    0.004701] adding memory block 0... ok
[...]
[    0.033383] adding memory block 1024... ok
[...]
[    0.061387] adding memory block 2047... ok
[    0.061414] added all memory blocks

   Unpatched, memory_dev_init() runs in about 0.299 seconds.  Patched,
   it runs in about 0.057 seconds.  Savings of .242 seconds, or nearly
   a quarter of a second.

2. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks:

# Unpatched
[   13.703907] memory_dev_init: adding blocks
[   13.703931] memory_dev_init: added block 0
[   13.762678] memory_dev_init: added block 1024
[   13.910359] memory_dev_init: added block 2048
[   14.146941] memory_dev_init: added block 3072
[...]
[  218.516235] memory_dev_init: added block 57344
[  229.310467] memory_dev_init: added block 58368
[  240.590857] memory_dev_init: added block 59392
[  252.351665] memory_dev_init: added block 60416
[...]
[ 2152.023248] memory_dev_init: added block 128000
[ 2196.464430] memory_dev_init: added block 129024
[ 2241.746515] memory_dev_init: added block 130048
[ 2287.406099] memory_dev_init: added all blocks

# Patched
[   13.696898] memory_dev_init: adding blocks
[   13.696920] memory_dev_init: added block 0
[   13.710966] memory_dev_init: added block 1024
[   13.724865] memory_dev_init: added block 2048
[   13.738802] memory_dev_init: added block 3072
[...]
[   14.520999] memory_dev_init: added block 57344
[   14.536355] memory_dev_init: added block 58368
[   14.551747] memory_dev_init: added block 59392
[   14.567128] memory_dev_init: added block 60416
[...]
[   15.595638] memory_dev_init: added block 126976
[   15.611761] memory_dev_init: added block 128000
[   15.627889] memory_dev_init: added block 129024
[   15.644048] memory_dev_init: added block 130048
[   15.660035] memory_dev_init: added all blocks

   Unpatched, memory_dev_init() runs in about 2275 seconds,
   or ~37 minutes.  Patched, memory_dev_init() runs in about
   1.97 seconds.  Savings of ~37 minutes.

   I did not actually measure walk_memory_blocks(), but during
   boot on this machine without the patch I got the following
   (abbreviated) traces:

[ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160

   Those traces disappeared with the patch, so I'm pretty sure
   this patch shaves ~30 minutes off of walk_memory_blocks()
   at boot.

Given the above results I think it is safe to say that this patch will
dramatically improve boot times on large POWER systems.

 drivers/base/memory.c | 36 +++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 799b43191dea..8902930d5ef2 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -19,6 +19,7 @@
 #include <linux/memory.h>
 #include <linux/memory_hotplug.h>
 #include <linux/mm.h>
+#include <linux/radix-tree.h>
 #include <linux/stat.h>
 #include <linux/slab.h>
 
@@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
 	.offline = memory_subsys_offline,
 };
 
+/*
+ * Memory blocks are cached in a local radix tree to avoid
+ * a costly linear search for the corresponding device on
+ * the subsystem bus.
+ */
+static RADIX_TREE(memory_blocks, GFP_KERNEL);
+
 static BLOCKING_NOTIFIER_HEAD(memory_chain);
 
 int register_memory_notifier(struct notifier_block *nb)
@@ -572,20 +580,14 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
 /* A reference for the returned memory block device is acquired. */
 static struct memory_block *find_memory_block_by_id(unsigned long block_id)
 {
-	struct device *dev;
+	struct memory_block *mem;
 
-	dev = subsys_find_device_by_id(&memory_subsys, block_id, NULL);
-	return dev ? to_memory_block(dev) : NULL;
+	mem = radix_tree_lookup(&memory_blocks, block_id);
+	if (mem)
+		get_device(&mem->dev);
+	return mem;
 }
 
-/*
- * For now, we have a linear search to go find the appropriate
- * memory_block corresponding to a particular phys_index. If
- * this gets to be a real problem, we can always use a radix
- * tree or something here.
- *
- * This could be made generic for all device subsystems.
- */
 struct memory_block *find_memory_block(struct mem_section *section)
 {
 	unsigned long block_id = base_memory_block_id(__section_nr(section));
@@ -628,9 +630,15 @@ int register_memory(struct memory_block *memory)
 	memory->dev.offline = memory->state == MEM_OFFLINE;
 
 	ret = device_register(&memory->dev);
-	if (ret)
+	if (ret) {
 		put_device(&memory->dev);
-
+		return ret;
+	}
+	ret = radix_tree_insert(&memory_blocks, memory->dev.id, memory);
+	if (ret) {
+		put_device(&memory->dev);
+		device_unregister(&memory->dev);
+	}
 	return ret;
 }
 
@@ -688,6 +696,8 @@ static void unregister_memory(struct memory_block *memory)
 	if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
 		return;
 
+	WARN_ON(radix_tree_delete(&memory_blocks, memory->dev.id) == NULL);
+
 	/* drop the ref. we got via find_memory_block() */
 	put_device(&memory->dev);
 	device_unregister(&memory->dev);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
       [not found] <20191217193238-1-cheloha@linux.vnet.ibm.com>
  2020-01-09 21:19 ` [PATCH] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup Scott Cheloha
@ 2020-01-09 21:25 ` Scott Cheloha
  2020-01-09 22:00   ` Andrew Morton
                     ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Scott Cheloha @ 2020-01-09 21:25 UTC (permalink / raw)
  To: linux-kernel, Rafael J. Wysocki, Greg Kroah-Hartman,
	David Hildenbrand, Andrew Morton
  Cc: nathanl, ricklind, mhocko, Scott Cheloha

Searching for a particular memory block by id is an O(n) operation
because each memory block's underlying device is kept in an unsorted
linked list on the subsystem bus.

We can cut the lookup cost to O(log n) if we cache the memory blocks in
a radix tree.  With a radix tree cache in place both memory subsystem
initialization and memory hotplug run palpably faster on systems with a
large number of memory blocks.

Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
(Sorry for the duplicate mail, forgot to mark this as v4)

v2 incorporates suggestions from David Hildenbrand.

v3 changes:
  - Rebase atop "drivers/base/memory.c: drop the mem_sysfs_mutex"

  - Be conservative: don't use radix_tree_for_each_slot() in
    walk_memory_blocks() yet.  It introduces RCU which could
    change behavior.  Walking the tree "by hand" with
    find_memory_block_by_id() is slower but keeps the patch
    simple.

v4 changes:
  - Rewrite commit message to explicitly note the time
    complexity improvements.

  - Provide anecdotal accounts of time-savings in the changelog
    (see below).

mhocko@suse.com has asked for additional details on time
savings, so here are some results I've collected when measuring
memory_dev_init() with/without the patch.

1. A 32GB POWER9 VM with 16MB memblocks has 2048 blocks:

# Unpatched
[    0.005121] adding memory block 0... ok
[...]
[    0.095230] adding memory block 1024... ok
[...]
[    0.304248] adding memory block 2047... ok
[    0.304508] added all memory blocks

# Patched
[    0.004701] adding memory block 0... ok
[...]
[    0.033383] adding memory block 1024... ok
[...]
[    0.061387] adding memory block 2047... ok
[    0.061414] added all memory blocks

   Unpatched, memory_dev_init() runs in about 0.299 seconds.  Patched,
   it runs in about 0.057 seconds.  Savings of .242 seconds, or nearly
   a quarter of a second.

2. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks:

# Unpatched
[   13.703907] memory_dev_init: adding blocks
[   13.703931] memory_dev_init: added block 0
[   13.762678] memory_dev_init: added block 1024
[   13.910359] memory_dev_init: added block 2048
[   14.146941] memory_dev_init: added block 3072
[...]
[  218.516235] memory_dev_init: added block 57344
[  229.310467] memory_dev_init: added block 58368
[  240.590857] memory_dev_init: added block 59392
[  252.351665] memory_dev_init: added block 60416
[...]
[ 2152.023248] memory_dev_init: added block 128000
[ 2196.464430] memory_dev_init: added block 129024
[ 2241.746515] memory_dev_init: added block 130048
[ 2287.406099] memory_dev_init: added all blocks

# Patched
[   13.696898] memory_dev_init: adding blocks
[   13.696920] memory_dev_init: added block 0
[   13.710966] memory_dev_init: added block 1024
[   13.724865] memory_dev_init: added block 2048
[   13.738802] memory_dev_init: added block 3072
[...]
[   14.520999] memory_dev_init: added block 57344
[   14.536355] memory_dev_init: added block 58368
[   14.551747] memory_dev_init: added block 59392
[   14.567128] memory_dev_init: added block 60416
[...]
[   15.595638] memory_dev_init: added block 126976
[   15.611761] memory_dev_init: added block 128000
[   15.627889] memory_dev_init: added block 129024
[   15.644048] memory_dev_init: added block 130048
[   15.660035] memory_dev_init: added all blocks

   Unpatched, memory_dev_init() runs in about 2275 seconds,
   or ~37 minutes.  Patched, memory_dev_init() runs in about
   1.97 seconds.  Savings of ~37 minutes.

   I did not actually measure walk_memory_blocks(), but during
   boot on this machine without the patch I got the following
   (abbreviated) traces:

[ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160

   Those traces disappeared with the patch, so I'm pretty sure
   this patch shaves ~30 minutes off of walk_memory_blocks()
   at boot.

Given the above results I think it is safe to say that this patch will
dramatically improve boot times on large POWER systems.

 drivers/base/memory.c | 36 +++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 799b43191dea..8902930d5ef2 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -19,6 +19,7 @@
 #include <linux/memory.h>
 #include <linux/memory_hotplug.h>
 #include <linux/mm.h>
+#include <linux/radix-tree.h>
 #include <linux/stat.h>
 #include <linux/slab.h>
 
@@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
 	.offline = memory_subsys_offline,
 };
 
+/*
+ * Memory blocks are cached in a local radix tree to avoid
+ * a costly linear search for the corresponding device on
+ * the subsystem bus.
+ */
+static RADIX_TREE(memory_blocks, GFP_KERNEL);
+
 static BLOCKING_NOTIFIER_HEAD(memory_chain);
 
 int register_memory_notifier(struct notifier_block *nb)
@@ -572,20 +580,14 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
 /* A reference for the returned memory block device is acquired. */
 static struct memory_block *find_memory_block_by_id(unsigned long block_id)
 {
-	struct device *dev;
+	struct memory_block *mem;
 
-	dev = subsys_find_device_by_id(&memory_subsys, block_id, NULL);
-	return dev ? to_memory_block(dev) : NULL;
+	mem = radix_tree_lookup(&memory_blocks, block_id);
+	if (mem)
+		get_device(&mem->dev);
+	return mem;
 }
 
-/*
- * For now, we have a linear search to go find the appropriate
- * memory_block corresponding to a particular phys_index. If
- * this gets to be a real problem, we can always use a radix
- * tree or something here.
- *
- * This could be made generic for all device subsystems.
- */
 struct memory_block *find_memory_block(struct mem_section *section)
 {
 	unsigned long block_id = base_memory_block_id(__section_nr(section));
@@ -628,9 +630,15 @@ int register_memory(struct memory_block *memory)
 	memory->dev.offline = memory->state == MEM_OFFLINE;
 
 	ret = device_register(&memory->dev);
-	if (ret)
+	if (ret) {
 		put_device(&memory->dev);
-
+		return ret;
+	}
+	ret = radix_tree_insert(&memory_blocks, memory->dev.id, memory);
+	if (ret) {
+		put_device(&memory->dev);
+		device_unregister(&memory->dev);
+	}
 	return ret;
 }
 
@@ -688,6 +696,8 @@ static void unregister_memory(struct memory_block *memory)
 	if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
 		return;
 
+	WARN_ON(radix_tree_delete(&memory_blocks, memory->dev.id) == NULL);
+
 	/* drop the ref. we got via find_memory_block() */
 	put_device(&memory->dev);
 	device_unregister(&memory->dev);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-09 21:19 ` [PATCH] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup Scott Cheloha
@ 2020-01-09 21:30   ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2020-01-09 21:30 UTC (permalink / raw)
  To: Scott Cheloha
  Cc: linux-kernel, Rafael J. Wysocki, Greg Kroah-Hartman,
	David Hildenbrand, Andrew Morton, nathanl, ricklind,
	Scott Cheloha

On Thu 09-01-20 15:19:52, Scott Cheloha wrote:
> Searching for a particular memory block by id is an O(n) operation
> because each memory block's underlying device is kept in an unsorted
> linked list on the subsystem bus.
> 
> We can cut the lookup cost to O(log n) if we cache the memory blocks in
> a radix tree.  With a radix tree cache in place both memory subsystem
> initialization and memory hotplug run palpably faster on systems with a
> large number of memory blocks.
> 
> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> v2 incorporates suggestions from David Hildenbrand.
> 
> v3 changes:
>   - Rebase atop "drivers/base/memory.c: drop the mem_sysfs_mutex"
> 
>   - Be conservative: don't use radix_tree_for_each_slot() in
>     walk_memory_blocks() yet.  It introduces RCU which could
>     change behavior.  Walking the tree "by hand" with
>     find_memory_block_by_id() is slower but keeps the patch
>     simple.
> 
> v4 changes:
>   - Rewrite commit message to explicitly note the time
>     complexity improvements.
> 
>   - Provide anecdotal accounts of time-savings in the changelog
>     (see below).
> 
> mhocko@suse.com has asked for additional details on time
> savings, so here are some results I've collected when measuring
> memory_dev_init() with/without the patch.

This data should be part of the changelog. Thanks!

> 1. A 32GB POWER9 VM with 16MB memblocks has 2048 blocks:
> 
> # Unpatched
> [    0.005121] adding memory block 0... ok
> [...]
> [    0.095230] adding memory block 1024... ok
> [...]
> [    0.304248] adding memory block 2047... ok
> [    0.304508] added all memory blocks
> 
> # Patched
> [    0.004701] adding memory block 0... ok
> [...]
> [    0.033383] adding memory block 1024... ok
> [...]
> [    0.061387] adding memory block 2047... ok
> [    0.061414] added all memory blocks
> 
>    Unpatched, memory_dev_init() runs in about 0.299 seconds.  Patched,
>    it runs in about 0.057 seconds.  Savings of .242 seconds, or nearly
>    a quarter of a second.
> 
> 2. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks:
> 
> # Unpatched
> [   13.703907] memory_dev_init: adding blocks
> [   13.703931] memory_dev_init: added block 0
> [   13.762678] memory_dev_init: added block 1024
> [   13.910359] memory_dev_init: added block 2048
> [   14.146941] memory_dev_init: added block 3072
> [...]
> [  218.516235] memory_dev_init: added block 57344
> [  229.310467] memory_dev_init: added block 58368
> [  240.590857] memory_dev_init: added block 59392
> [  252.351665] memory_dev_init: added block 60416
> [...]
> [ 2152.023248] memory_dev_init: added block 128000
> [ 2196.464430] memory_dev_init: added block 129024
> [ 2241.746515] memory_dev_init: added block 130048
> [ 2287.406099] memory_dev_init: added all blocks
> 
> # Patched
> [   13.696898] memory_dev_init: adding blocks
> [   13.696920] memory_dev_init: added block 0
> [   13.710966] memory_dev_init: added block 1024
> [   13.724865] memory_dev_init: added block 2048
> [   13.738802] memory_dev_init: added block 3072
> [...]
> [   14.520999] memory_dev_init: added block 57344
> [   14.536355] memory_dev_init: added block 58368
> [   14.551747] memory_dev_init: added block 59392
> [   14.567128] memory_dev_init: added block 60416
> [...]
> [   15.595638] memory_dev_init: added block 126976
> [   15.611761] memory_dev_init: added block 128000
> [   15.627889] memory_dev_init: added block 129024
> [   15.644048] memory_dev_init: added block 130048
> [   15.660035] memory_dev_init: added all blocks
> 
>    Unpatched, memory_dev_init() runs in about 2275 seconds,
>    or ~37 minutes.  Patched, memory_dev_init() runs in about
>    1.97 seconds.  Savings of ~37 minutes.
> 
>    I did not actually measure walk_memory_blocks(), but during
>    boot on this machine without the patch I got the following
>    (abbreviated) traces:
> 
> [ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> 
>    Those traces disappeared with the patch, so I'm pretty sure
>    this patch shaves ~30 minutes off of walk_memory_blocks()
>    at boot.
> 
> Given the above results I think it is safe to say that this patch will
> dramatically improve boot times on large POWER systems.
> 
>  drivers/base/memory.c | 36 +++++++++++++++++++++++-------------
>  1 file changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 799b43191dea..8902930d5ef2 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -19,6 +19,7 @@
>  #include <linux/memory.h>
>  #include <linux/memory_hotplug.h>
>  #include <linux/mm.h>
> +#include <linux/radix-tree.h>
>  #include <linux/stat.h>
>  #include <linux/slab.h>
>  
> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
>  	.offline = memory_subsys_offline,
>  };
>  
> +/*
> + * Memory blocks are cached in a local radix tree to avoid
> + * a costly linear search for the corresponding device on
> + * the subsystem bus.
> + */
> +static RADIX_TREE(memory_blocks, GFP_KERNEL);
> +
>  static BLOCKING_NOTIFIER_HEAD(memory_chain);
>  
>  int register_memory_notifier(struct notifier_block *nb)
> @@ -572,20 +580,14 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
>  /* A reference for the returned memory block device is acquired. */
>  static struct memory_block *find_memory_block_by_id(unsigned long block_id)
>  {
> -	struct device *dev;
> +	struct memory_block *mem;
>  
> -	dev = subsys_find_device_by_id(&memory_subsys, block_id, NULL);
> -	return dev ? to_memory_block(dev) : NULL;
> +	mem = radix_tree_lookup(&memory_blocks, block_id);
> +	if (mem)
> +		get_device(&mem->dev);
> +	return mem;
>  }
>  
> -/*
> - * For now, we have a linear search to go find the appropriate
> - * memory_block corresponding to a particular phys_index. If
> - * this gets to be a real problem, we can always use a radix
> - * tree or something here.
> - *
> - * This could be made generic for all device subsystems.
> - */
>  struct memory_block *find_memory_block(struct mem_section *section)
>  {
>  	unsigned long block_id = base_memory_block_id(__section_nr(section));
> @@ -628,9 +630,15 @@ int register_memory(struct memory_block *memory)
>  	memory->dev.offline = memory->state == MEM_OFFLINE;
>  
>  	ret = device_register(&memory->dev);
> -	if (ret)
> +	if (ret) {
>  		put_device(&memory->dev);
> -
> +		return ret;
> +	}
> +	ret = radix_tree_insert(&memory_blocks, memory->dev.id, memory);
> +	if (ret) {
> +		put_device(&memory->dev);
> +		device_unregister(&memory->dev);
> +	}
>  	return ret;
>  }
>  
> @@ -688,6 +696,8 @@ static void unregister_memory(struct memory_block *memory)
>  	if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
>  		return;
>  
> +	WARN_ON(radix_tree_delete(&memory_blocks, memory->dev.id) == NULL);
> +
>  	/* drop the ref. we got via find_memory_block() */
>  	put_device(&memory->dev);
>  	device_unregister(&memory->dev);
> -- 
> 2.24.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-09 21:25 ` [PATCH v4] " Scott Cheloha
@ 2020-01-09 22:00   ` Andrew Morton
  2020-01-09 22:17     ` David Hildenbrand
  2020-01-15 19:09   ` David Hildenbrand
  2020-01-21 23:10   ` [PATCH v5] drivers/base/memory.c: cache memory blocks in xarray " Scott Cheloha
  2 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2020-01-09 22:00 UTC (permalink / raw)
  To: Scott Cheloha
  Cc: linux-kernel, Rafael J. Wysocki, Greg Kroah-Hartman,
	David Hildenbrand, nathanl, ricklind, mhocko, Scott Cheloha

On Thu,  9 Jan 2020 15:25:16 -0600 Scott Cheloha <cheloha@linux.vnet.ibm.com> wrote:

> Searching for a particular memory block by id is an O(n) operation
> because each memory block's underlying device is kept in an unsorted
> linked list on the subsystem bus.
> 
> We can cut the lookup cost to O(log n) if we cache the memory blocks in
> a radix tree.  With a radix tree cache in place both memory subsystem
> initialization and memory hotplug run palpably faster on systems with a
> large number of memory blocks.
> 
> ...
>
> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
>  	.offline = memory_subsys_offline,
>  };
>  
> +/*
> + * Memory blocks are cached in a local radix tree to avoid
> + * a costly linear search for the corresponding device on
> + * the subsystem bus.
> + */
> +static RADIX_TREE(memory_blocks, GFP_KERNEL);

What protects this tree from racy accesses?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-09 22:00   ` Andrew Morton
@ 2020-01-09 22:17     ` David Hildenbrand
  2020-01-09 22:27       ` Andrew Morton
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2020-01-09 22:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Scott Cheloha, linux-kernel, Rafael J. Wysocki,
	Greg Kroah-Hartman, David Hildenbrand, nathanl, ricklind, mhocko,
	Scott Cheloha



> Am 09.01.2020 um 23:00 schrieb Andrew Morton <akpm@linux-foundation.org>:
> 
> On Thu,  9 Jan 2020 15:25:16 -0600 Scott Cheloha <cheloha@linux.vnet.ibm.com> wrote:
> 
>> Searching for a particular memory block by id is an O(n) operation
>> because each memory block's underlying device is kept in an unsorted
>> linked list on the subsystem bus.
>> 
>> We can cut the lookup cost to O(log n) if we cache the memory blocks in
>> a radix tree.  With a radix tree cache in place both memory subsystem
>> initialization and memory hotplug run palpably faster on systems with a
>> large number of memory blocks.
>> 
>> ...
>> 
>> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
>>    .offline = memory_subsys_offline,
>> };
>> 
>> +/*
>> + * Memory blocks are cached in a local radix tree to avoid
>> + * a costly linear search for the corresponding device on
>> + * the subsystem bus.
>> + */
>> +static RADIX_TREE(memory_blocks, GFP_KERNEL);
> 
> What protects this tree from racy accesses?

I think the device hotplug lock currently (except during boot where no races can happen).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-09 22:17     ` David Hildenbrand
@ 2020-01-09 22:27       ` Andrew Morton
  2020-01-09 22:35         ` David Hildenbrand
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2020-01-09 22:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Scott Cheloha, linux-kernel, Rafael J. Wysocki,
	Greg Kroah-Hartman, nathanl, ricklind, mhocko, Scott Cheloha

On Thu, 9 Jan 2020 23:17:09 +0100 David Hildenbrand <david@redhat.com> wrote:

> 
> 
> > Am 09.01.2020 um 23:00 schrieb Andrew Morton <akpm@linux-foundation.org>:
> > 
> > On Thu,  9 Jan 2020 15:25:16 -0600 Scott Cheloha <cheloha@linux.vnet.ibm.com> wrote:
> > 
> >> Searching for a particular memory block by id is an O(n) operation
> >> because each memory block's underlying device is kept in an unsorted
> >> linked list on the subsystem bus.
> >> 
> >> We can cut the lookup cost to O(log n) if we cache the memory blocks in
> >> a radix tree.  With a radix tree cache in place both memory subsystem
> >> initialization and memory hotplug run palpably faster on systems with a
> >> large number of memory blocks.
> >> 
> >> ...
> >> 
> >> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
> >>    .offline = memory_subsys_offline,
> >> };
> >> 
> >> +/*
> >> + * Memory blocks are cached in a local radix tree to avoid
> >> + * a costly linear search for the corresponding device on
> >> + * the subsystem bus.
> >> + */
> >> +static RADIX_TREE(memory_blocks, GFP_KERNEL);
> > 
> > What protects this tree from racy accesses?
> 
> I think the device hotplug lock currently (except during boot where no races can happen).
> 

So this?

--- a/drivers/base/memory.c~drivers-base-memoryc-cache-blocks-in-radix-tree-to-accelerate-lookup-fix
+++ a/drivers/base/memory.c
@@ -61,6 +61,9 @@ static struct bus_type memory_subsys = {
  * Memory blocks are cached in a local radix tree to avoid
  * a costly linear search for the corresponding device on
  * the subsystem bus.
+ *
+ * Protected by mem_hotplug_lock in mem_hotplug_begin(), and by the guaranteed
+ * single-threadness at boot time.
  */
 static RADIX_TREE(memory_blocks, GFP_KERNEL);
 

But are we sure this is all true?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-09 22:27       ` Andrew Morton
@ 2020-01-09 22:35         ` David Hildenbrand
  2020-01-10  9:32           ` David Hildenbrand
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2020-01-09 22:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Scott Cheloha, linux-kernel,
	Rafael J. Wysocki, Greg Kroah-Hartman, nathanl, ricklind, mhocko,
	Scott Cheloha

> Am 09.01.2020 um 23:28 schrieb Andrew Morton <akpm@linux-foundation.org>:
> 
> On Thu, 9 Jan 2020 23:17:09 +0100 David Hildenbrand <david@redhat.com> wrote:
> 
>> 
>> 
>>>> Am 09.01.2020 um 23:00 schrieb Andrew Morton <akpm@linux-foundation.org>:
>>> 
>>> On Thu,  9 Jan 2020 15:25:16 -0600 Scott Cheloha <cheloha@linux.vnet.ibm.com> wrote:
>>> 
>>>> Searching for a particular memory block by id is an O(n) operation
>>>> because each memory block's underlying device is kept in an unsorted
>>>> linked list on the subsystem bus.
>>>> 
>>>> We can cut the lookup cost to O(log n) if we cache the memory blocks in
>>>> a radix tree.  With a radix tree cache in place both memory subsystem
>>>> initialization and memory hotplug run palpably faster on systems with a
>>>> large number of memory blocks.
>>>> 
>>>> ...
>>>> 
>>>> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
>>>>   .offline = memory_subsys_offline,
>>>> };
>>>> 
>>>> +/*
>>>> + * Memory blocks are cached in a local radix tree to avoid
>>>> + * a costly linear search for the corresponding device on
>>>> + * the subsystem bus.
>>>> + */
>>>> +static RADIX_TREE(memory_blocks, GFP_KERNEL);
>>> 
>>> What protects this tree from racy accesses?
>> 
>> I think the device hotplug lock currently (except during boot where no races can happen).
>> 
> 
> So this?
> 
> --- a/drivers/base/memory.c~drivers-base-memoryc-cache-blocks-in-radix-tree-to-accelerate-lookup-fix
> +++ a/drivers/base/memory.c
> @@ -61,6 +61,9 @@ static struct bus_type memory_subsys = {
>  * Memory blocks are cached in a local radix tree to avoid
>  * a costly linear search for the corresponding device on
>  * the subsystem bus.
> + *
> + * Protected by mem_hotplug_lock in mem_hotplug_begin(), and by the guaranteed
> + * single-threadness at boot time.
>  */
> static RADIX_TREE(memory_blocks, GFP_KERNEL);
> 
> 
> But are we sure this is all true?

I think the device hotplug lock, not the memory hotplug lock. Will double check later.
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-09 22:35         ` David Hildenbrand
@ 2020-01-10  9:32           ` David Hildenbrand
  2020-01-10 11:31             ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2020-01-10  9:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Scott Cheloha, linux-kernel, Rafael J. Wysocki,
	Greg Kroah-Hartman, nathanl, ricklind, mhocko, Scott Cheloha

On 09.01.20 23:35, David Hildenbrand wrote:
>> Am 09.01.2020 um 23:28 schrieb Andrew Morton <akpm@linux-foundation.org>:
>>
>> On Thu, 9 Jan 2020 23:17:09 +0100 David Hildenbrand <david@redhat.com> wrote:
>>
>>>
>>>
>>>>> Am 09.01.2020 um 23:00 schrieb Andrew Morton <akpm@linux-foundation.org>:
>>>>
>>>> On Thu,  9 Jan 2020 15:25:16 -0600 Scott Cheloha <cheloha@linux.vnet.ibm.com> wrote:
>>>>
>>>>> Searching for a particular memory block by id is an O(n) operation
>>>>> because each memory block's underlying device is kept in an unsorted
>>>>> linked list on the subsystem bus.
>>>>>
>>>>> We can cut the lookup cost to O(log n) if we cache the memory blocks in
>>>>> a radix tree.  With a radix tree cache in place both memory subsystem
>>>>> initialization and memory hotplug run palpably faster on systems with a
>>>>> large number of memory blocks.
>>>>>
>>>>> ...
>>>>>
>>>>> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
>>>>>   .offline = memory_subsys_offline,
>>>>> };
>>>>>
>>>>> +/*
>>>>> + * Memory blocks are cached in a local radix tree to avoid
>>>>> + * a costly linear search for the corresponding device on
>>>>> + * the subsystem bus.
>>>>> + */
>>>>> +static RADIX_TREE(memory_blocks, GFP_KERNEL);
>>>>
>>>> What protects this tree from racy accesses?
>>>
>>> I think the device hotplug lock currently (except during boot where no races can happen).
>>>
>>
>> So this?
>>
>> --- a/drivers/base/memory.c~drivers-base-memoryc-cache-blocks-in-radix-tree-to-accelerate-lookup-fix
>> +++ a/drivers/base/memory.c
>> @@ -61,6 +61,9 @@ static struct bus_type memory_subsys = {
>>  * Memory blocks are cached in a local radix tree to avoid
>>  * a costly linear search for the corresponding device on
>>  * the subsystem bus.
>> + *
>> + * Protected by mem_hotplug_lock in mem_hotplug_begin(), and by the guaranteed
>> + * single-threadness at boot time.
>>  */
>> static RADIX_TREE(memory_blocks, GFP_KERNEL);
>>
>>
>> But are we sure this is all true?
> 
> I think the device hotplug lock, not the memory hotplug lock. Will double check later.

So all writers either hold the device_hotplug_lock or run during boot.
Documented e.g., for memory_dev_init(), create_memory_block_devices(),
remove_memory_block_devices().

The readers are mainly
- find_memory_block()
 -> called via online_pages()/offline_pages() where we hold the
    device_hotplug_lock
 -> called from arch/powerpc/platforms/pseries/hotplug-memory.c,
    where we always hold the device_hotplug_lock
- walk_memory_blocks()
 -> Callers in drivers/acpi/acpi_memhotplug.c either hold the
    device_hotplug_lock or are called early during boot
 -> Callers in mm/memory_hotplug.c either hold the
    device_hotplug_lock or are called early during boot
 -> link_mem_sections() is called either early during boot or via
    add_memory_resource() (whereby all callers either hold the
    device_hotplug_lock or are called early during boot)
 -> Callers in arch/powerpc/platforms/powernv/memtrace.c hold the
    device_hotplug_lock

So we are fine.

I suggest we document that expected behavior via

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 799b43191dea..8c8dc081597e 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -585,6 +585,8 @@ static struct memory_block
*find_memory_block_by_id(unsigned long block_id)
  * tree or something here.
  *
  * This could be made generic for all device subsystems.
+ *
+ * Called under device_hotplug_lock.
  */
 struct memory_block *find_memory_block(struct mem_section *section)
 {
@@ -837,6 +839,8 @@ void __init memory_dev_init(void)
  *
  * In case func() returns an error, walking is aborted and the error is
  * returned.
+ *
+ * Called under device_hotplug_lock.
  */
 int walk_memory_blocks(unsigned long start, unsigned long size,
                       void *arg, walk_memory_blocks_func_t func)


Please note that the memory hotplug lock is not safe on the reader side.
But also not on the writer side after
https://lkml.kernel.org/r/157863061737.2230556.3959730620803366776.stgit@dwillia2-desk3.amr.corp.intel.com


-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-10  9:32           ` David Hildenbrand
@ 2020-01-10 11:31             ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2020-01-10 11:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Scott Cheloha, linux-kernel, Rafael J. Wysocki,
	Greg Kroah-Hartman, nathanl, ricklind, Scott Cheloha

On Fri 10-01-20 10:32:01, David Hildenbrand wrote:
> On 09.01.20 23:35, David Hildenbrand wrote:
> >> Am 09.01.2020 um 23:28 schrieb Andrew Morton <akpm@linux-foundation.org>:
> >>
> >> On Thu, 9 Jan 2020 23:17:09 +0100 David Hildenbrand <david@redhat.com> wrote:
> >>
> >>>
> >>>
> >>>>> Am 09.01.2020 um 23:00 schrieb Andrew Morton <akpm@linux-foundation.org>:
> >>>>
> >>>> On Thu,  9 Jan 2020 15:25:16 -0600 Scott Cheloha <cheloha@linux.vnet.ibm.com> wrote:
> >>>>
> >>>>> Searching for a particular memory block by id is an O(n) operation
> >>>>> because each memory block's underlying device is kept in an unsorted
> >>>>> linked list on the subsystem bus.
> >>>>>
> >>>>> We can cut the lookup cost to O(log n) if we cache the memory blocks in
> >>>>> a radix tree.  With a radix tree cache in place both memory subsystem
> >>>>> initialization and memory hotplug run palpably faster on systems with a
> >>>>> large number of memory blocks.
> >>>>>
> >>>>> ...
> >>>>>
> >>>>> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
> >>>>>   .offline = memory_subsys_offline,
> >>>>> };
> >>>>>
> >>>>> +/*
> >>>>> + * Memory blocks are cached in a local radix tree to avoid
> >>>>> + * a costly linear search for the corresponding device on
> >>>>> + * the subsystem bus.
> >>>>> + */
> >>>>> +static RADIX_TREE(memory_blocks, GFP_KERNEL);
> >>>>
> >>>> What protects this tree from racy accesses?
> >>>
> >>> I think the device hotplug lock currently (except during boot where no races can happen).
> >>>
> >>
> >> So this?
> >>
> >> --- a/drivers/base/memory.c~drivers-base-memoryc-cache-blocks-in-radix-tree-to-accelerate-lookup-fix
> >> +++ a/drivers/base/memory.c
> >> @@ -61,6 +61,9 @@ static struct bus_type memory_subsys = {
> >>  * Memory blocks are cached in a local radix tree to avoid
> >>  * a costly linear search for the corresponding device on
> >>  * the subsystem bus.
> >> + *
> >> + * Protected by mem_hotplug_lock in mem_hotplug_begin(), and by the guaranteed
> >> + * single-threadness at boot time.
> >>  */
> >> static RADIX_TREE(memory_blocks, GFP_KERNEL);
> >>
> >>
> >> But are we sure this is all true?
> > 
> > I think the device hotplug lock, not the memory hotplug lock. Will double check later.
> 
> So all writers either hold the device_hotplug_lock or run during boot.
> Documented e.g., for memory_dev_init(), create_memory_block_devices(),
> remove_memory_block_devices().
> 
> The readers are mainly
> - find_memory_block()
>  -> called via online_pages()/offline_pages() where we hold the
>     device_hotplug_lock
>  -> called from arch/powerpc/platforms/pseries/hotplug-memory.c,
>     where we always hold the device_hotplug_lock
> - walk_memory_blocks()
>  -> Callers in drivers/acpi/acpi_memhotplug.c either hold the
>     device_hotplug_lock or are called early during boot
>  -> Callers in mm/memory_hotplug.c either hold the
>     device_hotplug_lock or are called early during boot
>  -> link_mem_sections() is called either early during boot or via
>     add_memory_resource() (whereby all callers either hold the
>     device_hotplug_lock or are called early during boot)
>  -> Callers in arch/powerpc/platforms/powernv/memtrace.c hold the
>     device_hotplug_lock
> 
> So we are fine.
> 
> I suggest we document that expected behavior via

Thanks for documenting this! Adding a comment as you suggest makes
sense. Overall the locking expectations would be similar to
subsys_find_device_by_id which doesn't use any internal locking so the
radix tree which follows the lifetime of those objects should be
compatible with the current implementation (so no new races at least).

> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 799b43191dea..8c8dc081597e 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -585,6 +585,8 @@ static struct memory_block
> *find_memory_block_by_id(unsigned long block_id)
>   * tree or something here.
>   *
>   * This could be made generic for all device subsystems.
> + *
> + * Called under device_hotplug_lock.
>   */
>  struct memory_block *find_memory_block(struct mem_section *section)
>  {
> @@ -837,6 +839,8 @@ void __init memory_dev_init(void)
>   *
>   * In case func() returns an error, walking is aborted and the error is
>   * returned.
> + *
> + * Called under device_hotplug_lock.
>   */
>  int walk_memory_blocks(unsigned long start, unsigned long size,
>                        void *arg, walk_memory_blocks_func_t func)
> 
> 
> Please note that the memory hotplug lock is not safe on the reader side.
> But also not on the writer side after
> https://lkml.kernel.org/r/157863061737.2230556.3959730620803366776.stgit@dwillia2-desk3.amr.corp.intel.com
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-09 21:25 ` [PATCH v4] " Scott Cheloha
  2020-01-09 22:00   ` Andrew Morton
@ 2020-01-15 19:09   ` David Hildenbrand
  2020-01-16 15:22     ` Michal Hocko
  2020-01-21 23:10   ` [PATCH v5] drivers/base/memory.c: cache memory blocks in xarray " Scott Cheloha
  2 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2020-01-15 19:09 UTC (permalink / raw)
  To: Scott Cheloha, linux-kernel, Rafael J. Wysocki,
	Greg Kroah-Hartman, Andrew Morton
  Cc: nathanl, ricklind, mhocko, Scott Cheloha, Donald Dutile

On 09.01.20 22:25, Scott Cheloha wrote:
> Searching for a particular memory block by id is an O(n) operation
> because each memory block's underlying device is kept in an unsorted
> linked list on the subsystem bus.
> 
> We can cut the lookup cost to O(log n) if we cache the memory blocks in
> a radix tree.  With a radix tree cache in place both memory subsystem
> initialization and memory hotplug run palpably faster on systems with a
> large number of memory blocks.
> 
> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Soooo,

I just learned that radix trees are nowadays only a wrapper for xarray
(for quite a while already!), and that the xarray interface shall be
used in new code.

include/linux/radix-tree.h:

/* Keep unconverted code working */
#define radix_tree_root		xarray
[...]

Do we want to convert this code before sending it to Linus'? (resend
this patch, or a fixup on top)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-15 19:09   ` David Hildenbrand
@ 2020-01-16 15:22     ` Michal Hocko
  2020-01-16 15:28       ` David Hildenbrand
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2020-01-16 15:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Scott Cheloha, linux-kernel, Rafael J. Wysocki,
	Greg Kroah-Hartman, Andrew Morton, nathanl, ricklind,
	Scott Cheloha, Donald Dutile

On Wed 15-01-20 20:09:48, David Hildenbrand wrote:
> On 09.01.20 22:25, Scott Cheloha wrote:
> > Searching for a particular memory block by id is an O(n) operation
> > because each memory block's underlying device is kept in an unsorted
> > linked list on the subsystem bus.
> > 
> > We can cut the lookup cost to O(log n) if we cache the memory blocks in
> > a radix tree.  With a radix tree cache in place both memory subsystem
> > initialization and memory hotplug run palpably faster on systems with a
> > large number of memory blocks.
> > 
> > Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Soooo,
> 
> I just learned that radix trees are nowadays only a wrapper for xarray
> (for quite a while already!), and that the xarray interface shall be
> used in new code.

Good point. I somehow didn't realize this would add more work for a
later code refactoring. The mapping should be pretty straightforward.

Thanks for noticing!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-16 15:22     ` Michal Hocko
@ 2020-01-16 15:28       ` David Hildenbrand
  2020-01-16 16:17         ` David Hildenbrand
  2020-01-16 17:17         ` Don Dutile
  0 siblings, 2 replies; 19+ messages in thread
From: David Hildenbrand @ 2020-01-16 15:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Scott Cheloha, linux-kernel, Rafael J. Wysocki,
	Greg Kroah-Hartman, Andrew Morton, nathanl, ricklind,
	Scott Cheloha, Donald Dutile

On 16.01.20 16:22, Michal Hocko wrote:
> On Wed 15-01-20 20:09:48, David Hildenbrand wrote:
>> On 09.01.20 22:25, Scott Cheloha wrote:
>>> Searching for a particular memory block by id is an O(n) operation
>>> because each memory block's underlying device is kept in an unsorted
>>> linked list on the subsystem bus.
>>>
>>> We can cut the lookup cost to O(log n) if we cache the memory blocks in
>>> a radix tree.  With a radix tree cache in place both memory subsystem
>>> initialization and memory hotplug run palpably faster on systems with a
>>> large number of memory blocks.
>>>
>>> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
>>> Acked-by: David Hildenbrand <david@redhat.com>
>>> Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
>>> Acked-by: Michal Hocko <mhocko@suse.com>
>>
>> Soooo,
>>
>> I just learned that radix trees are nowadays only a wrapper for xarray
>> (for quite a while already!), and that the xarray interface shall be
>> used in new code.
> 
> Good point. I somehow didn't realize this would add more work for a
> later code refactoring. The mapping should be pretty straightforward.

Yes it is. @Scott, care to send a fixup that does the mapping?

> 
> Thanks for noticing!

Don noticed it, so thanks to him :)


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-16 15:28       ` David Hildenbrand
@ 2020-01-16 16:17         ` David Hildenbrand
  2020-01-17  9:35           ` Michal Hocko
  2020-01-16 17:17         ` Don Dutile
  1 sibling, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2020-01-16 16:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Scott Cheloha, linux-kernel, Rafael J. Wysocki,
	Greg Kroah-Hartman, Andrew Morton, nathanl, ricklind,
	Scott Cheloha, Donald Dutile

On 16.01.20 16:28, David Hildenbrand wrote:
> On 16.01.20 16:22, Michal Hocko wrote:
>> On Wed 15-01-20 20:09:48, David Hildenbrand wrote:
>>> On 09.01.20 22:25, Scott Cheloha wrote:
>>>> Searching for a particular memory block by id is an O(n) operation
>>>> because each memory block's underlying device is kept in an unsorted
>>>> linked list on the subsystem bus.
>>>>
>>>> We can cut the lookup cost to O(log n) if we cache the memory blocks in
>>>> a radix tree.  With a radix tree cache in place both memory subsystem
>>>> initialization and memory hotplug run palpably faster on systems with a
>>>> large number of memory blocks.
>>>>
>>>> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
>>>> Acked-by: David Hildenbrand <david@redhat.com>
>>>> Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
>>>> Acked-by: Michal Hocko <mhocko@suse.com>
>>>
>>> Soooo,
>>>
>>> I just learned that radix trees are nowadays only a wrapper for xarray
>>> (for quite a while already!), and that the xarray interface shall be
>>> used in new code.
>>
>> Good point. I somehow didn't realize this would add more work for a
>> later code refactoring. The mapping should be pretty straightforward.
> 
> Yes it is. @Scott, care to send a fixup that does the mapping?

Never having used an xarray, I gave it a quick shot. The following
should do the trick:


diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index c6d288fad493..c75dec35de43 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -19,7 +19,7 @@
 #include <linux/memory.h>
 #include <linux/memory_hotplug.h>
 #include <linux/mm.h>
-#include <linux/radix-tree.h>
+#include <linux/xarray.h>
 #include <linux/stat.h>
 #include <linux/slab.h>
 
@@ -58,11 +58,11 @@ static struct bus_type memory_subsys = {
 };
 
 /*
- * Memory blocks are cached in a local radix tree to avoid
+ * Memory blocks are cached in a local xarray to avoid
  * a costly linear search for the corresponding device on
  * the subsystem bus.
  */
-static RADIX_TREE(memory_blocks, GFP_KERNEL);
+static DEFINE_XARRAY(memory_blocks);
 
 static BLOCKING_NOTIFIER_HEAD(memory_chain);
 
@@ -566,7 +566,7 @@ static struct memory_block *find_memory_block_by_id(unsigned long block_id)
 {
        struct memory_block *mem;
 
-       mem = radix_tree_lookup(&memory_blocks, block_id);
+       mem = xa_load(&memory_blocks, block_id);
        if (mem)
                get_device(&mem->dev);
        return mem;
@@ -621,7 +621,8 @@ int register_memory(struct memory_block *memory)
                put_device(&memory->dev);
                return ret;
        }
-       ret = radix_tree_insert(&memory_blocks, memory->dev.id, memory);
+       ret = xa_err(xa_store(&memory_blocks, memory->dev.id, memory,
+                             GFP_KERNEL));
        if (ret) {
                put_device(&memory->dev);
                device_unregister(&memory->dev);
@@ -683,7 +684,7 @@ static void unregister_memory(struct memory_block *memory)
        if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
                return;
 
-       WARN_ON(radix_tree_delete(&memory_blocks, memory->dev.id) == NULL);
+       WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL);
 
        /* drop the ref. we got via find_memory_block() */
        put_device(&memory->dev);


-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-16 15:28       ` David Hildenbrand
  2020-01-16 16:17         ` David Hildenbrand
@ 2020-01-16 17:17         ` Don Dutile
  1 sibling, 0 replies; 19+ messages in thread
From: Don Dutile @ 2020-01-16 17:17 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: Scott Cheloha, linux-kernel, Rafael J. Wysocki,
	Greg Kroah-Hartman, Andrew Morton, nathanl, ricklind,
	Scott Cheloha

On 1/16/20 10:28 AM, David Hildenbrand wrote:
> On 16.01.20 16:22, Michal Hocko wrote:
>> On Wed 15-01-20 20:09:48, David Hildenbrand wrote:
>>> On 09.01.20 22:25, Scott Cheloha wrote:
>>>> Searching for a particular memory block by id is an O(n) operation
>>>> because each memory block's underlying device is kept in an unsorted
>>>> linked list on the subsystem bus.
>>>>
>>>> We can cut the lookup cost to O(log n) if we cache the memory blocks in
>>>> a radix tree.  With a radix tree cache in place both memory subsystem
>>>> initialization and memory hotplug run palpably faster on systems with a
>>>> large number of memory blocks.
>>>>
>>>> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
>>>> Acked-by: David Hildenbrand <david@redhat.com>
>>>> Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
>>>> Acked-by: Michal Hocko <mhocko@suse.com>
>>>
>>> Soooo,
>>>
>>> I just learned that radix trees are nowadays only a wrapper for xarray
>>> (for quite a while already!), and that the xarray interface shall be
>>> used in new code.
>>
>> Good point. I somehow didn't realize this would add more work for a
>> later code refactoring. The mapping should be pretty straightforward.
> 
> Yes it is. @Scott, care to send a fixup that does the mapping?
> 
>>
>> Thanks for noticing!
> 
> Don noticed it, so thanks to him :)
> 
Backporting XArray to RHEL-8, and continuing to support radix-tree api in RHEL-8 due to RHEL rulz, it wasn't stable/bug-free everyewhere, etc., was a challenge, thus my 'sensitivity' to seeing new radix-tree code upstream.

> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-16 16:17         ` David Hildenbrand
@ 2020-01-17  9:35           ` Michal Hocko
  2020-01-20  9:15             ` David Hildenbrand
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2020-01-17  9:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Scott Cheloha, linux-kernel, Rafael J. Wysocki,
	Greg Kroah-Hartman, Andrew Morton, nathanl, ricklind,
	Scott Cheloha, Donald Dutile

On Thu 16-01-20 17:17:54, David Hildenbrand wrote:
[...]
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index c6d288fad493..c75dec35de43 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -19,7 +19,7 @@
>  #include <linux/memory.h>
>  #include <linux/memory_hotplug.h>
>  #include <linux/mm.h>
> -#include <linux/radix-tree.h>
> +#include <linux/xarray.h>
>  #include <linux/stat.h>
>  #include <linux/slab.h>
>  
> @@ -58,11 +58,11 @@ static struct bus_type memory_subsys = {
>  };
>  
>  /*
> - * Memory blocks are cached in a local radix tree to avoid
> + * Memory blocks are cached in a local xarray to avoid
>   * a costly linear search for the corresponding device on
>   * the subsystem bus.
>   */
> -static RADIX_TREE(memory_blocks, GFP_KERNEL);
> +static DEFINE_XARRAY(memory_blocks);
>  
>  static BLOCKING_NOTIFIER_HEAD(memory_chain);
>  
> @@ -566,7 +566,7 @@ static struct memory_block *find_memory_block_by_id(unsigned long block_id)
>  {
>         struct memory_block *mem;
>  
> -       mem = radix_tree_lookup(&memory_blocks, block_id);
> +       mem = xa_load(&memory_blocks, block_id);
>         if (mem)
>                 get_device(&mem->dev);
>         return mem;
> @@ -621,7 +621,8 @@ int register_memory(struct memory_block *memory)
>                 put_device(&memory->dev);
>                 return ret;
>         }
> -       ret = radix_tree_insert(&memory_blocks, memory->dev.id, memory);
> +       ret = xa_err(xa_store(&memory_blocks, memory->dev.id, memory,
> +                             GFP_KERNEL));
>         if (ret) {
>                 put_device(&memory->dev);
>                 device_unregister(&memory->dev);
> @@ -683,7 +684,7 @@ static void unregister_memory(struct memory_block *memory)
>         if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
>                 return;
>  
> -       WARN_ON(radix_tree_delete(&memory_blocks, memory->dev.id) == NULL);
> +       WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL);
>  
>         /* drop the ref. we got via find_memory_block() */
>         put_device(&memory->dev);

OK, this looks sensible. xa_store shouldn't ever return an existing
device as we do the lookpup beforehand so good. We might need to
reorganize the code if we want to drop the loopup though.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-17  9:35           ` Michal Hocko
@ 2020-01-20  9:15             ` David Hildenbrand
  2020-01-21 12:30               ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2020-01-20  9:15 UTC (permalink / raw)
  To: Michal Hocko, Scott Cheloha
  Cc: linux-kernel, Rafael J. Wysocki, Greg Kroah-Hartman,
	Andrew Morton, nathanl, ricklind, Scott Cheloha, Donald Dutile

On 17.01.20 10:35, Michal Hocko wrote:
> On Thu 16-01-20 17:17:54, David Hildenbrand wrote:
> [...]
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index c6d288fad493..c75dec35de43 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -19,7 +19,7 @@
>>  #include <linux/memory.h>
>>  #include <linux/memory_hotplug.h>
>>  #include <linux/mm.h>
>> -#include <linux/radix-tree.h>
>> +#include <linux/xarray.h>
>>  #include <linux/stat.h>
>>  #include <linux/slab.h>
>>  
>> @@ -58,11 +58,11 @@ static struct bus_type memory_subsys = {
>>  };
>>  
>>  /*
>> - * Memory blocks are cached in a local radix tree to avoid
>> + * Memory blocks are cached in a local xarray to avoid
>>   * a costly linear search for the corresponding device on
>>   * the subsystem bus.
>>   */
>> -static RADIX_TREE(memory_blocks, GFP_KERNEL);
>> +static DEFINE_XARRAY(memory_blocks);
>>  
>>  static BLOCKING_NOTIFIER_HEAD(memory_chain);
>>  
>> @@ -566,7 +566,7 @@ static struct memory_block *find_memory_block_by_id(unsigned long block_id)
>>  {
>>         struct memory_block *mem;
>>  
>> -       mem = radix_tree_lookup(&memory_blocks, block_id);
>> +       mem = xa_load(&memory_blocks, block_id);
>>         if (mem)
>>                 get_device(&mem->dev);
>>         return mem;
>> @@ -621,7 +621,8 @@ int register_memory(struct memory_block *memory)
>>                 put_device(&memory->dev);
>>                 return ret;
>>         }
>> -       ret = radix_tree_insert(&memory_blocks, memory->dev.id, memory);
>> +       ret = xa_err(xa_store(&memory_blocks, memory->dev.id, memory,
>> +                             GFP_KERNEL));
>>         if (ret) {
>>                 put_device(&memory->dev);
>>                 device_unregister(&memory->dev);
>> @@ -683,7 +684,7 @@ static void unregister_memory(struct memory_block *memory)
>>         if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
>>                 return;
>>  
>> -       WARN_ON(radix_tree_delete(&memory_blocks, memory->dev.id) == NULL);
>> +       WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL);
>>  
>>         /* drop the ref. we got via find_memory_block() */
>>         put_device(&memory->dev);
> 
> OK, this looks sensible. xa_store shouldn't ever return an existing
> device as we do the lookpup beforehand so good. We might need to
> reorganize the code if we want to drop the loopup though.

Ping Scott. Will you resend a cleanup like this as a proper patch or
shall I?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
  2020-01-20  9:15             ` David Hildenbrand
@ 2020-01-21 12:30               ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2020-01-21 12:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Scott Cheloha, linux-kernel, Rafael J. Wysocki,
	Greg Kroah-Hartman, Andrew Morton, nathanl, ricklind,
	Scott Cheloha, Donald Dutile

On Mon 20-01-20 10:15:32, David Hildenbrand wrote:
[...]
> Ping Scott. Will you resend a cleanup like this as a proper patch or
> shall I?

Maybe just fold it into the orignal patch which is already sitting in
mmotm?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v5] drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
  2020-01-09 21:25 ` [PATCH v4] " Scott Cheloha
  2020-01-09 22:00   ` Andrew Morton
  2020-01-15 19:09   ` David Hildenbrand
@ 2020-01-21 23:10   ` Scott Cheloha
  2020-01-22 10:43     ` David Hildenbrand
  2 siblings, 1 reply; 19+ messages in thread
From: Scott Cheloha @ 2020-01-21 23:10 UTC (permalink / raw)
  To: linux-kernel, Rafael J. Wysocki, Greg Kroah-Hartman,
	David Hildenbrand, Andrew Morton
  Cc: nathanl, ricklind, mhocko, Scott Cheloha

From: Scott Cheloha <cheloha@linux.vnet.ibm.com>

Searching for a particular memory block by id is an O(n) operation
because each memory block's underlying device is kept in an unsorted
linked list on the subsystem bus.

We can cut the lookup cost to O(log n) if we cache each memory block
in an xarray.  This time complexity improvement is significant on
systems with many memory blocks.  For example:

1. A 128GB POWER9 VM with 256MB memblocks has 512 blocks.  With this
   change  memory_dev_init() completes ~12ms faster and walk_memory_blocks()
   completes ~12ms faster.

Before:
[    0.005042] memory_dev_init: adding memory blocks
[    0.021591] memory_dev_init: added memory blocks
[    0.022699] walk_memory_blocks: walking memory blocks
[    0.038730] walk_memory_blocks: walked memory blocks 0-511

After:
[    0.005057] memory_dev_init: adding memory blocks
[    0.009415] memory_dev_init: added memory blocks
[    0.010519] walk_memory_blocks: walking memory blocks
[    0.014135] walk_memory_blocks: walked memory blocks 0-511

2. A 256GB POWER9 LPAR with 256MB memblocks has 1024 blocks.  With
   this change memory_dev_init() completes ~88ms faster and
   walk_memory_blocks() completes ~87ms faster.

Before:
[    0.252246] memory_dev_init: adding memory blocks
[    0.395469] memory_dev_init: added memory blocks
[    0.409413] walk_memory_blocks: walking memory blocks
[    0.433028] walk_memory_blocks: walked memory blocks 0-511
[    0.433094] walk_memory_blocks: walking memory blocks
[    0.500244] walk_memory_blocks: walked memory blocks 131072-131583

After:
[    0.245063] memory_dev_init: adding memory blocks
[    0.299539] memory_dev_init: added memory blocks
[    0.313609] walk_memory_blocks: walking memory blocks
[    0.315287] walk_memory_blocks: walked memory blocks 0-511
[    0.315349] walk_memory_blocks: walking memory blocks
[    0.316988] walk_memory_blocks: walked memory blocks 131072-131583

3. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks.  With
   this change we complete memory_dev_init() ~37 minutes faster and
   walk_memory_blocks() at least ~30 minutes faster.  The exact timing
   for walk_memory_blocks() is  missing, though I observed that the
   soft lockups in walk_memory_blocks() disappeared with the change,
   suggesting that lower bound.

Before:
[   13.703907] memory_dev_init: adding blocks
[ 2287.406099] memory_dev_init: added all blocks
[ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160

After:
[   13.696898] memory_dev_init: adding blocks
[   15.660035] memory_dev_init: added all blocks
(the walk_memory_blocks traces disappear)

There should be no significant negative impact for machines with few
memory blocks.  A sparse xarray has a small footprint and an O(log n)
lookup is negligibly slower than an O(n) lookup for only the smallest
number of memory blocks.

1. A 16GB x86 machine with 128MB memblocks has 132 blocks.  With this
   change memory_dev_init() completes ~300us faster and walk_memory_blocks()
   completes no faster or slower.  The improvement is pretty close to noise.

Before:
[    0.224752] memory_dev_init: adding memory blocks
[    0.227116] memory_dev_init: added memory blocks
[    0.227183] walk_memory_blocks: walking memory blocks
[    0.227183] walk_memory_blocks: walked memory blocks 0-131

After:
[    0.224911] memory_dev_init: adding memory blocks
[    0.226935] memory_dev_init: added memory blocks
[    0.227089] walk_memory_blocks: walking memory blocks
[    0.227089] walk_memory_blocks: walked memory blocks 0-131

Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
v2 incorporates suggestions from David Hildenbrand.

v3 changes:
  - Rebase atop "drivers/base/memory.c: drop the mem_sysfs_mutex"

  - Be conservative: don't use radix_tree_for_each_slot() in
    walk_memory_blocks() yet.  It introduces RCU which could
    change behavior.  Walking the tree "by hand" with
    find_memory_block_by_id() is slower but keeps the patch
    simple.

v4 changes:
  - Rewrite commit message to explicitly note the time
    complexity improvements.

  - Provide anecdotal accounts of time-savings in changelog

v5 changes:
  - Switch from the radix_tree API to the xarray API to conform
    to current kernel preferences.

  - Move the time savings accounts into the commit message itself.
    Remeasure performance changes on the machines I had available.

    It should be noted that I was not able to get time on the 32TB
    machine to remeasure the improvements for v5.  The quoted traces
    are from v4 of the patch.  However, the xarray API is used to
    implement the radix_tree API, so I expect the performance changes
    will be identical.

    I did test v5 of the patch on the other machines mentioned in the
    commit message to ensure there were no regressions.

 drivers/base/memory.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 799b43191dea..2178d3e6d063 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -21,6 +21,7 @@
 #include <linux/mm.h>
 #include <linux/stat.h>
 #include <linux/slab.h>
+#include <linux/xarray.h>
 
 #include <linux/atomic.h>
 #include <linux/uaccess.h>
@@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
 	.offline = memory_subsys_offline,
 };
 
+/*
+ * Memory blocks are cached in a local radix tree to avoid
+ * a costly linear search for the corresponding device on
+ * the subsystem bus.
+ */
+static DEFINE_XARRAY(memory_blocks);
+
 static BLOCKING_NOTIFIER_HEAD(memory_chain);
 
 int register_memory_notifier(struct notifier_block *nb)
@@ -572,20 +580,14 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
 /* A reference for the returned memory block device is acquired. */
 static struct memory_block *find_memory_block_by_id(unsigned long block_id)
 {
-	struct device *dev;
+	struct memory_block *mem;
 
-	dev = subsys_find_device_by_id(&memory_subsys, block_id, NULL);
-	return dev ? to_memory_block(dev) : NULL;
+	mem = xa_load(&memory_blocks, block_id);
+	if (mem)
+		get_device(&mem->dev);
+	return mem;
 }
 
-/*
- * For now, we have a linear search to go find the appropriate
- * memory_block corresponding to a particular phys_index. If
- * this gets to be a real problem, we can always use a radix
- * tree or something here.
- *
- * This could be made generic for all device subsystems.
- */
 struct memory_block *find_memory_block(struct mem_section *section)
 {
 	unsigned long block_id = base_memory_block_id(__section_nr(section));
@@ -628,9 +630,16 @@ int register_memory(struct memory_block *memory)
 	memory->dev.offline = memory->state == MEM_OFFLINE;
 
 	ret = device_register(&memory->dev);
-	if (ret)
+	if (ret) {
 		put_device(&memory->dev);
-
+		return ret;
+	}
+	ret = xa_err(xa_store(&memory_blocks, memory->dev.id, memory,
+			      GFP_KERNEL));
+	if (ret) {
+		put_device(&memory->dev);
+		device_unregister(&memory->dev);
+	}
 	return ret;
 }
 
@@ -688,6 +697,8 @@ static void unregister_memory(struct memory_block *memory)
 	if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
 		return;
 
+	WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL);
+
 	/* drop the ref. we got via find_memory_block() */
 	put_device(&memory->dev);
 	device_unregister(&memory->dev);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v5] drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
  2020-01-21 23:10   ` [PATCH v5] drivers/base/memory.c: cache memory blocks in xarray " Scott Cheloha
@ 2020-01-22 10:43     ` David Hildenbrand
  0 siblings, 0 replies; 19+ messages in thread
From: David Hildenbrand @ 2020-01-22 10:43 UTC (permalink / raw)
  To: Scott Cheloha, linux-kernel, Rafael J. Wysocki,
	Greg Kroah-Hartman, Andrew Morton
  Cc: nathanl, ricklind, mhocko, Scott Cheloha

On 22.01.20 00:10, Scott Cheloha wrote:
> From: Scott Cheloha <cheloha@linux.vnet.ibm.com>
> 
> Searching for a particular memory block by id is an O(n) operation
> because each memory block's underlying device is kept in an unsorted
> linked list on the subsystem bus.
> 
> We can cut the lookup cost to O(log n) if we cache each memory block
> in an xarray.  This time complexity improvement is significant on
> systems with many memory blocks.  For example:
> 
> 1. A 128GB POWER9 VM with 256MB memblocks has 512 blocks.  With this
>    change  memory_dev_init() completes ~12ms faster and walk_memory_blocks()
>    completes ~12ms faster.
> 
> Before:
> [    0.005042] memory_dev_init: adding memory blocks
> [    0.021591] memory_dev_init: added memory blocks
> [    0.022699] walk_memory_blocks: walking memory blocks
> [    0.038730] walk_memory_blocks: walked memory blocks 0-511
> 
> After:
> [    0.005057] memory_dev_init: adding memory blocks
> [    0.009415] memory_dev_init: added memory blocks
> [    0.010519] walk_memory_blocks: walking memory blocks
> [    0.014135] walk_memory_blocks: walked memory blocks 0-511
> 
> 2. A 256GB POWER9 LPAR with 256MB memblocks has 1024 blocks.  With
>    this change memory_dev_init() completes ~88ms faster and
>    walk_memory_blocks() completes ~87ms faster.
> 
> Before:
> [    0.252246] memory_dev_init: adding memory blocks
> [    0.395469] memory_dev_init: added memory blocks
> [    0.409413] walk_memory_blocks: walking memory blocks
> [    0.433028] walk_memory_blocks: walked memory blocks 0-511
> [    0.433094] walk_memory_blocks: walking memory blocks
> [    0.500244] walk_memory_blocks: walked memory blocks 131072-131583
> 
> After:
> [    0.245063] memory_dev_init: adding memory blocks
> [    0.299539] memory_dev_init: added memory blocks
> [    0.313609] walk_memory_blocks: walking memory blocks
> [    0.315287] walk_memory_blocks: walked memory blocks 0-511
> [    0.315349] walk_memory_blocks: walking memory blocks
> [    0.316988] walk_memory_blocks: walked memory blocks 131072-131583
> 
> 3. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks.  With
>    this change we complete memory_dev_init() ~37 minutes faster and
>    walk_memory_blocks() at least ~30 minutes faster.  The exact timing
>    for walk_memory_blocks() is  missing, though I observed that the
>    soft lockups in walk_memory_blocks() disappeared with the change,
>    suggesting that lower bound.
> 
> Before:
> [   13.703907] memory_dev_init: adding blocks
> [ 2287.406099] memory_dev_init: added all blocks
> [ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> 
> After:
> [   13.696898] memory_dev_init: adding blocks
> [   15.660035] memory_dev_init: added all blocks
> (the walk_memory_blocks traces disappear)
> 
> There should be no significant negative impact for machines with few
> memory blocks.  A sparse xarray has a small footprint and an O(log n)
> lookup is negligibly slower than an O(n) lookup for only the smallest
> number of memory blocks.
> 
> 1. A 16GB x86 machine with 128MB memblocks has 132 blocks.  With this
>    change memory_dev_init() completes ~300us faster and walk_memory_blocks()
>    completes no faster or slower.  The improvement is pretty close to noise.
> 
> Before:
> [    0.224752] memory_dev_init: adding memory blocks
> [    0.227116] memory_dev_init: added memory blocks
> [    0.227183] walk_memory_blocks: walking memory blocks
> [    0.227183] walk_memory_blocks: walked memory blocks 0-131
> 
> After:
> [    0.224911] memory_dev_init: adding memory blocks
> [    0.226935] memory_dev_init: added memory blocks
> [    0.227089] walk_memory_blocks: walking memory blocks
> [    0.227089] walk_memory_blocks: walked memory blocks 0-131
> 
> Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> v2 incorporates suggestions from David Hildenbrand.
> 
> v3 changes:
>   - Rebase atop "drivers/base/memory.c: drop the mem_sysfs_mutex"
> 
>   - Be conservative: don't use radix_tree_for_each_slot() in
>     walk_memory_blocks() yet.  It introduces RCU which could
>     change behavior.  Walking the tree "by hand" with
>     find_memory_block_by_id() is slower but keeps the patch
>     simple.
> 
> v4 changes:
>   - Rewrite commit message to explicitly note the time
>     complexity improvements.
> 
>   - Provide anecdotal accounts of time-savings in changelog
> 
> v5 changes:
>   - Switch from the radix_tree API to the xarray API to conform
>     to current kernel preferences.
> 
>   - Move the time savings accounts into the commit message itself.
>     Remeasure performance changes on the machines I had available.
> 
>     It should be noted that I was not able to get time on the 32TB
>     machine to remeasure the improvements for v5.  The quoted traces
>     are from v4 of the patch.  However, the xarray API is used to
>     implement the radix_tree API, so I expect the performance changes
>     will be identical.
> 
>     I did test v5 of the patch on the other machines mentioned in the
>     commit message to ensure there were no regressions.
> 
>  drivers/base/memory.c | 37 ++++++++++++++++++++++++-------------
>  1 file changed, 24 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 799b43191dea..2178d3e6d063 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -21,6 +21,7 @@
>  #include <linux/mm.h>
>  #include <linux/stat.h>
>  #include <linux/slab.h>
> +#include <linux/xarray.h>
>  
>  #include <linux/atomic.h>
>  #include <linux/uaccess.h>
> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
>  	.offline = memory_subsys_offline,
>  };
>  
> +/*
> + * Memory blocks are cached in a local radix tree to avoid
> + * a costly linear search for the corresponding device on
> + * the subsystem bus.
> + */
> +static DEFINE_XARRAY(memory_blocks);
> +
>  static BLOCKING_NOTIFIER_HEAD(memory_chain);
>  
>  int register_memory_notifier(struct notifier_block *nb)
> @@ -572,20 +580,14 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
>  /* A reference for the returned memory block device is acquired. */
>  static struct memory_block *find_memory_block_by_id(unsigned long block_id)
>  {
> -	struct device *dev;
> +	struct memory_block *mem;
>  
> -	dev = subsys_find_device_by_id(&memory_subsys, block_id, NULL);
> -	return dev ? to_memory_block(dev) : NULL;
> +	mem = xa_load(&memory_blocks, block_id);
> +	if (mem)
> +		get_device(&mem->dev);
> +	return mem;
>  }
>  
> -/*
> - * For now, we have a linear search to go find the appropriate
> - * memory_block corresponding to a particular phys_index. If
> - * this gets to be a real problem, we can always use a radix
> - * tree or something here.
> - *
> - * This could be made generic for all device subsystems.
> - */
>  struct memory_block *find_memory_block(struct mem_section *section)
>  {
>  	unsigned long block_id = base_memory_block_id(__section_nr(section));
> @@ -628,9 +630,16 @@ int register_memory(struct memory_block *memory)
>  	memory->dev.offline = memory->state == MEM_OFFLINE;
>  
>  	ret = device_register(&memory->dev);
> -	if (ret)
> +	if (ret) {
>  		put_device(&memory->dev);
> -
> +		return ret;
> +	}
> +	ret = xa_err(xa_store(&memory_blocks, memory->dev.id, memory,
> +			      GFP_KERNEL));
> +	if (ret) {
> +		put_device(&memory->dev);
> +		device_unregister(&memory->dev);
> +	}
>  	return ret;
>  }
>  
> @@ -688,6 +697,8 @@ static void unregister_memory(struct memory_block *memory)
>  	if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
>  		return;
>  
> +	WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL);
> +
>  	/* drop the ref. we got via find_memory_block() */
>  	put_device(&memory->dev);
>  	device_unregister(&memory->dev);
> 

I think only the device_hotplug_lock documentation from me as a fixup
are missing. So this replacing the original patch looks good to me!

Thanks Scott!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-01-22 10:43 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20191217193238-1-cheloha@linux.vnet.ibm.com>
2020-01-09 21:19 ` [PATCH] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup Scott Cheloha
2020-01-09 21:30   ` Michal Hocko
2020-01-09 21:25 ` [PATCH v4] " Scott Cheloha
2020-01-09 22:00   ` Andrew Morton
2020-01-09 22:17     ` David Hildenbrand
2020-01-09 22:27       ` Andrew Morton
2020-01-09 22:35         ` David Hildenbrand
2020-01-10  9:32           ` David Hildenbrand
2020-01-10 11:31             ` Michal Hocko
2020-01-15 19:09   ` David Hildenbrand
2020-01-16 15:22     ` Michal Hocko
2020-01-16 15:28       ` David Hildenbrand
2020-01-16 16:17         ` David Hildenbrand
2020-01-17  9:35           ` Michal Hocko
2020-01-20  9:15             ` David Hildenbrand
2020-01-21 12:30               ` Michal Hocko
2020-01-16 17:17         ` Don Dutile
2020-01-21 23:10   ` [PATCH v5] drivers/base/memory.c: cache memory blocks in xarray " Scott Cheloha
2020-01-22 10:43     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).