linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Scott Cheloha <cheloha@linux.vnet.ibm.com>
To: linux-kernel@vger.kernel.org,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: nathanl@linux.ibm.com, ricklind@linux.vnet.ibm.com,
	mhocko@suse.com, Scott Cheloha <cheloha@linux.ibm.com>
Subject: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup
Date: Thu,  9 Jan 2020 15:25:16 -0600	[thread overview]
Message-ID: <20200109212516.17849-1-cheloha@linux.vnet.ibm.com> (raw)
In-Reply-To: <20191217193238-1-cheloha@linux.vnet.ibm.com>

Searching for a particular memory block by id is an O(n) operation
because each memory block's underlying device is kept in an unsorted
linked list on the subsystem bus.

We can cut the lookup cost to O(log n) if we cache the memory blocks in
a radix tree.  With a radix tree cache in place both memory subsystem
initialization and memory hotplug run palpably faster on systems with a
large number of memory blocks.

Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
(Sorry for the duplicate mail, forgot to mark this as v4)

v2 incorporates suggestions from David Hildenbrand.

v3 changes:
  - Rebase atop "drivers/base/memory.c: drop the mem_sysfs_mutex"

  - Be conservative: don't use radix_tree_for_each_slot() in
    walk_memory_blocks() yet.  It introduces RCU which could
    change behavior.  Walking the tree "by hand" with
    find_memory_block_by_id() is slower but keeps the patch
    simple.

v4 changes:
  - Rewrite commit message to explicitly note the time
    complexity improvements.

  - Provide anecdotal accounts of time-savings in the changelog
    (see below).

mhocko@suse.com has asked for additional details on time
savings, so here are some results I've collected when measuring
memory_dev_init() with/without the patch.

1. A 32GB POWER9 VM with 16MB memblocks has 2048 blocks:

# Unpatched
[    0.005121] adding memory block 0... ok
[...]
[    0.095230] adding memory block 1024... ok
[...]
[    0.304248] adding memory block 2047... ok
[    0.304508] added all memory blocks

# Patched
[    0.004701] adding memory block 0... ok
[...]
[    0.033383] adding memory block 1024... ok
[...]
[    0.061387] adding memory block 2047... ok
[    0.061414] added all memory blocks

   Unpatched, memory_dev_init() runs in about 0.299 seconds.  Patched,
   it runs in about 0.057 seconds.  Savings of .242 seconds, or nearly
   a quarter of a second.

2. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks:

# Unpatched
[   13.703907] memory_dev_init: adding blocks
[   13.703931] memory_dev_init: added block 0
[   13.762678] memory_dev_init: added block 1024
[   13.910359] memory_dev_init: added block 2048
[   14.146941] memory_dev_init: added block 3072
[...]
[  218.516235] memory_dev_init: added block 57344
[  229.310467] memory_dev_init: added block 58368
[  240.590857] memory_dev_init: added block 59392
[  252.351665] memory_dev_init: added block 60416
[...]
[ 2152.023248] memory_dev_init: added block 128000
[ 2196.464430] memory_dev_init: added block 129024
[ 2241.746515] memory_dev_init: added block 130048
[ 2287.406099] memory_dev_init: added all blocks

# Patched
[   13.696898] memory_dev_init: adding blocks
[   13.696920] memory_dev_init: added block 0
[   13.710966] memory_dev_init: added block 1024
[   13.724865] memory_dev_init: added block 2048
[   13.738802] memory_dev_init: added block 3072
[...]
[   14.520999] memory_dev_init: added block 57344
[   14.536355] memory_dev_init: added block 58368
[   14.551747] memory_dev_init: added block 59392
[   14.567128] memory_dev_init: added block 60416
[...]
[   15.595638] memory_dev_init: added block 126976
[   15.611761] memory_dev_init: added block 128000
[   15.627889] memory_dev_init: added block 129024
[   15.644048] memory_dev_init: added block 130048
[   15.660035] memory_dev_init: added all blocks

   Unpatched, memory_dev_init() runs in about 2275 seconds,
   or ~37 minutes.  Patched, memory_dev_init() runs in about
   1.97 seconds.  Savings of ~37 minutes.

   I did not actually measure walk_memory_blocks(), but during
   boot on this machine without the patch I got the following
   (abbreviated) traces:

[ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160

   Those traces disappeared with the patch, so I'm pretty sure
   this patch shaves ~30 minutes off of walk_memory_blocks()
   at boot.

Given the above results I think it is safe to say that this patch will
dramatically improve boot times on large POWER systems.

 drivers/base/memory.c | 36 +++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 799b43191dea..8902930d5ef2 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -19,6 +19,7 @@
 #include <linux/memory.h>
 #include <linux/memory_hotplug.h>
 #include <linux/mm.h>
+#include <linux/radix-tree.h>
 #include <linux/stat.h>
 #include <linux/slab.h>
 
@@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
 	.offline = memory_subsys_offline,
 };
 
+/*
+ * Memory blocks are cached in a local radix tree to avoid
+ * a costly linear search for the corresponding device on
+ * the subsystem bus.
+ */
+static RADIX_TREE(memory_blocks, GFP_KERNEL);
+
 static BLOCKING_NOTIFIER_HEAD(memory_chain);
 
 int register_memory_notifier(struct notifier_block *nb)
@@ -572,20 +580,14 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
 /* A reference for the returned memory block device is acquired. */
 static struct memory_block *find_memory_block_by_id(unsigned long block_id)
 {
-	struct device *dev;
+	struct memory_block *mem;
 
-	dev = subsys_find_device_by_id(&memory_subsys, block_id, NULL);
-	return dev ? to_memory_block(dev) : NULL;
+	mem = radix_tree_lookup(&memory_blocks, block_id);
+	if (mem)
+		get_device(&mem->dev);
+	return mem;
 }
 
-/*
- * For now, we have a linear search to go find the appropriate
- * memory_block corresponding to a particular phys_index. If
- * this gets to be a real problem, we can always use a radix
- * tree or something here.
- *
- * This could be made generic for all device subsystems.
- */
 struct memory_block *find_memory_block(struct mem_section *section)
 {
 	unsigned long block_id = base_memory_block_id(__section_nr(section));
@@ -628,9 +630,15 @@ int register_memory(struct memory_block *memory)
 	memory->dev.offline = memory->state == MEM_OFFLINE;
 
 	ret = device_register(&memory->dev);
-	if (ret)
+	if (ret) {
 		put_device(&memory->dev);
-
+		return ret;
+	}
+	ret = radix_tree_insert(&memory_blocks, memory->dev.id, memory);
+	if (ret) {
+		put_device(&memory->dev);
+		device_unregister(&memory->dev);
+	}
 	return ret;
 }
 
@@ -688,6 +696,8 @@ static void unregister_memory(struct memory_block *memory)
 	if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
 		return;
 
+	WARN_ON(radix_tree_delete(&memory_blocks, memory->dev.id) == NULL);
+
 	/* drop the ref. we got via find_memory_block() */
 	put_device(&memory->dev);
 	device_unregister(&memory->dev);
-- 
2.24.1


  parent reply	other threads:[~2020-01-09 21:25 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20191217193238-1-cheloha@linux.vnet.ibm.com>
2020-01-09 21:19 ` [PATCH] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup Scott Cheloha
2020-01-09 21:30   ` Michal Hocko
2020-01-09 21:25 ` Scott Cheloha [this message]
2020-01-09 22:00   ` [PATCH v4] " Andrew Morton
2020-01-09 22:17     ` David Hildenbrand
2020-01-09 22:27       ` Andrew Morton
2020-01-09 22:35         ` David Hildenbrand
2020-01-10  9:32           ` David Hildenbrand
2020-01-10 11:31             ` Michal Hocko
2020-01-15 19:09   ` David Hildenbrand
2020-01-16 15:22     ` Michal Hocko
2020-01-16 15:28       ` David Hildenbrand
2020-01-16 16:17         ` David Hildenbrand
2020-01-17  9:35           ` Michal Hocko
2020-01-20  9:15             ` David Hildenbrand
2020-01-21 12:30               ` Michal Hocko
2020-01-16 17:17         ` Don Dutile
2020-01-21 23:10   ` [PATCH v5] drivers/base/memory.c: cache memory blocks in xarray " Scott Cheloha
2020-01-22 10:43     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200109212516.17849-1-cheloha@linux.vnet.ibm.com \
    --to=cheloha@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=cheloha@linux.ibm.com \
    --cc=david@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=nathanl@linux.ibm.com \
    --cc=rafael@kernel.org \
    --cc=ricklind@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).