linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
@ 2010-09-27 19:09 Nathan Fontenot
  2010-09-27 19:21 ` [PATCH 1/8] v2 Move find_memory_block() routine Nathan Fontenot
                   ` (9 more replies)
  0 siblings, 10 replies; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-27 19:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen

This set of patches decouples the concept that a single memory
section corresponds to a single directory in 
/sys/devices/system/memory/.  On systems
with large amounts of memory (1+ TB) there are perfomance issues
related to creating the large number of sysfs directories.  For
a powerpc machine with 1 TB of memory we are creating 63,000+
directories.  This is resulting in boot times of around 45-50
minutes for systems with 1 TB of memory and 8 hours for systems
with 2 TB of memory.  With this patch set applied I am now seeing
boot times of 5 minutes or less.

The root of this issue is in sysfs directory creation. Every time
a directory is created a string compare is done against all sibling
directories to ensure we do not create duplicates.  The list of
directory nodes in sysfs is kept as an unsorted list which results
in this being an exponentially longer operation as the number of
directories are created.

The solution solved by this patch set is to allow a single
directory in sysfs to span multiple memory sections.  This is
controlled by an optional architecturally defined function
memory_block_size_bytes().  The default definition of this
routine returns a memory block size equal to the memory section
size. This maintains the current layout of sysfs memory
directories as it appears to userspace to remain the same as it
is today.

For architectures that define their own version of this routine,
as is done for powerpc in this patchset, the view in userspace
would change such that each memoryXXX directory would span
multiple memory sections.  The number of sections spanned would
depend on the value reported by memory_block_size_bytes.

In both cases a new file 'end_phys_index' is created in each
memoryXXX directory.  This file will contain the physical id
of the last memory section covered by the sysfs directory.  For
the default case, the value in 'end_phys_index' will be the same
as in the existing 'phys_index' file.

This version of the patch set includes an update to to properly
report block_size_bytes, phys_index, and end_phys_index.  Additionally,
the patch that adds the end_phys_index sysfs file is now patch 5/8
instead of being patch 2/8 as in the previous version of the patches.

-Nathan Fontenot

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/8] v2 Move find_memory_block() routine
  2010-09-27 19:09 [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Nathan Fontenot
@ 2010-09-27 19:21 ` Nathan Fontenot
  2010-09-27 19:22 ` [PATCH 2/8] v2 Add section count to memory_block struct Nathan Fontenot
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-27 19:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen

Move the find_memory_block() routine up to avoid needing a forward
declaration in subsequent patches.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 drivers/base/memory.c |   62 +++++++++++++++++++++++++-------------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c	2010-09-21 11:59:24.000000000 -0500
+++ linux-next/drivers/base/memory.c	2010-09-21 12:32:45.000000000 -0500
@@ -435,6 +435,37 @@ int __weak arch_get_memory_phys_device(u
 	return 0;
 }
 
+/*
+ * For now, we have a linear search to go find the appropriate
+ * memory_block corresponding to a particular phys_index. If
+ * this gets to be a real problem, we can always use a radix
+ * tree or something here.
+ *
+ * This could be made generic for all sysdev classes.
+ */
+struct memory_block *find_memory_block(struct mem_section *section)
+{
+	struct kobject *kobj;
+	struct sys_device *sysdev;
+	struct memory_block *mem;
+	char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
+
+	/*
+	 * This only works because we know that section == sysdev->id
+	 * slightly redundant with sysdev_register()
+	 */
+	sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, __section_nr(section));
+
+	kobj = kset_find_obj(&memory_sysdev_class.kset, name);
+	if (!kobj)
+		return NULL;
+
+	sysdev = container_of(kobj, struct sys_device, kobj);
+	mem = container_of(sysdev, struct memory_block, sysdev);
+
+	return mem;
+}
+
 static int add_memory_block(int nid, struct mem_section *section,
 			unsigned long state, enum mem_add_context context)
 {
@@ -468,37 +499,6 @@ static int add_memory_block(int nid, str
 	return ret;
 }
 
-/*
- * For now, we have a linear search to go find the appropriate
- * memory_block corresponding to a particular phys_index. If
- * this gets to be a real problem, we can always use a radix
- * tree or something here.
- *
- * This could be made generic for all sysdev classes.
- */
-struct memory_block *find_memory_block(struct mem_section *section)
-{
-	struct kobject *kobj;
-	struct sys_device *sysdev;
-	struct memory_block *mem;
-	char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
-
-	/*
-	 * This only works because we know that section == sysdev->id
-	 * slightly redundant with sysdev_register()
-	 */
-	sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, __section_nr(section));
-
-	kobj = kset_find_obj(&memory_sysdev_class.kset, name);
-	if (!kobj)
-		return NULL;
-
-	sysdev = container_of(kobj, struct sys_device, kobj);
-	mem = container_of(sysdev, struct memory_block, sysdev);
-
-	return mem;
-}
-
 int remove_memory_block(unsigned long node_id, struct mem_section *section,
 		int phys_device)
 {

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 2/8] v2 Add section count to memory_block struct
  2010-09-27 19:09 [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Nathan Fontenot
  2010-09-27 19:21 ` [PATCH 1/8] v2 Move find_memory_block() routine Nathan Fontenot
@ 2010-09-27 19:22 ` Nathan Fontenot
  2010-09-28  9:31   ` Robin Holt
  2010-09-27 19:23 ` [PATCH 3/8] v2 Add mutex for adding/removing memory blocks Nathan Fontenot
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-27 19:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen

Add a section count property to the memory_block struct to track the number
of memory sections that have been added/removed from a memory block. This
allows us to know when the last memory section of a memory block has been
removed so we can remove the memory block.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 drivers/base/memory.c  |   16 ++++++++++------
 include/linux/memory.h |    3 +++
 2 files changed, 13 insertions(+), 6 deletions(-)

Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c	2010-09-27 09:17:20.000000000 -0500
+++ linux-next/drivers/base/memory.c	2010-09-27 09:31:35.000000000 -0500
@@ -478,6 +478,7 @@
 
 	mem->phys_index = __section_nr(section);
 	mem->state = state;
+	atomic_inc(&mem->section_count);
 	mutex_init(&mem->state_mutex);
 	start_pfn = section_nr_to_pfn(mem->phys_index);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
@@ -505,12 +506,15 @@
 	struct memory_block *mem;
 
 	mem = find_memory_block(section);
-	unregister_mem_sect_under_nodes(mem);
-	mem_remove_simple_file(mem, phys_index);
-	mem_remove_simple_file(mem, state);
-	mem_remove_simple_file(mem, phys_device);
-	mem_remove_simple_file(mem, removable);
-	unregister_memory(mem, section);
+
+	if (atomic_dec_and_test(&mem->section_count)) {
+		unregister_mem_sect_under_nodes(mem);
+		mem_remove_simple_file(mem, phys_index);
+		mem_remove_simple_file(mem, state);
+		mem_remove_simple_file(mem, phys_device);
+		mem_remove_simple_file(mem, removable);
+		unregister_memory(mem, section);
+	}
 
 	return 0;
 }
Index: linux-next/include/linux/memory.h
===================================================================
--- linux-next.orig/include/linux/memory.h	2010-09-27 09:17:20.000000000 -0500
+++ linux-next/include/linux/memory.h	2010-09-27 09:22:56.000000000 -0500
@@ -19,10 +19,13 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/mutex.h>
+#include <asm/atomic.h>
 
 struct memory_block {
 	unsigned long phys_index;
 	unsigned long state;
+	atomic_t section_count;
+
 	/*
 	 * This serializes all state change requests.  It isn't
 	 * held during creation because the control files are

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/8] v2 Add mutex for adding/removing memory blocks
  2010-09-27 19:09 [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Nathan Fontenot
  2010-09-27 19:21 ` [PATCH 1/8] v2 Move find_memory_block() routine Nathan Fontenot
  2010-09-27 19:22 ` [PATCH 2/8] v2 Add section count to memory_block struct Nathan Fontenot
@ 2010-09-27 19:23 ` Nathan Fontenot
  2010-09-27 19:25 ` [PATCH 4/8] v2 Allow memory block to span multiple memory sections Nathan Fontenot
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-27 19:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen

Add a new mutex for use in adding and removing of memory blocks.  This
is needed to avoid any race conditions in which the same memory block could
be added and removed at the same time.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 drivers/base/memory.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c	2010-09-27 09:31:35.000000000 -0500
+++ linux-next/drivers/base/memory.c	2010-09-27 09:31:57.000000000 -0500
@@ -27,6 +27,8 @@
 #include <asm/atomic.h>
 #include <asm/uaccess.h>
 
+static DEFINE_MUTEX(mem_sysfs_mutex);
+
 #define MEMORY_CLASS_NAME	"memory"
 
 static struct sysdev_class memory_sysdev_class = {
@@ -476,6 +478,8 @@
 	if (!mem)
 		return -ENOMEM;
 
+	mutex_lock(&mem_sysfs_mutex);
+
 	mem->phys_index = __section_nr(section);
 	mem->state = state;
 	atomic_inc(&mem->section_count);
@@ -497,6 +501,7 @@
 			ret = register_mem_sect_under_node(mem, nid);
 	}
 
+	mutex_unlock(&mem_sysfs_mutex);
 	return ret;
 }
 
@@ -505,6 +510,7 @@
 {
 	struct memory_block *mem;
 
+	mutex_lock(&mem_sysfs_mutex);
 	mem = find_memory_block(section);
 
 	if (atomic_dec_and_test(&mem->section_count)) {
@@ -516,6 +522,7 @@
 		unregister_memory(mem, section);
 	}
 
+	mutex_unlock(&mem_sysfs_mutex);
 	return 0;
 }

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 4/8] v2 Allow memory block to span multiple memory sections
  2010-09-27 19:09 [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Nathan Fontenot
                   ` (2 preceding siblings ...)
  2010-09-27 19:23 ` [PATCH 3/8] v2 Add mutex for adding/removing memory blocks Nathan Fontenot
@ 2010-09-27 19:25 ` Nathan Fontenot
  2010-09-27 23:55   ` Dave Hansen
  2010-09-28 12:48   ` Robin Holt
  2010-09-27 19:26 ` [PATCH 5/8] v2 Add end_phys_index file Nathan Fontenot
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-27 19:25 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen

Update the memory sysfs code such that each sysfs memory directory is now
considered a memory block that can span multiple memory sections per
memory block.  The default size of each memory block is SECTION_SIZE_BITS
to maintain the current behavior of having a single memory section per
memory block (i.e. one sysfs directory per memory section).

For architectures that want to have memory blocks span multiple
memory sections they need only define their own memory_block_size_bytes()
routine.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 drivers/base/memory.c |  155 ++++++++++++++++++++++++++++++++++----------------
 1 file changed, 108 insertions(+), 47 deletions(-)

Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c	2010-09-27 09:31:57.000000000 -0500
+++ linux-next/drivers/base/memory.c	2010-09-27 13:50:18.000000000 -0500
@@ -30,6 +30,14 @@
 static DEFINE_MUTEX(mem_sysfs_mutex);
 
 #define MEMORY_CLASS_NAME	"memory"
+#define MIN_MEMORY_BLOCK_SIZE	(1 << SECTION_SIZE_BITS)
+
+static int sections_per_block;
+
+static inline int base_memory_block_id(int section_nr)
+{
+	return section_nr / sections_per_block;
+}
 
 static struct sysdev_class memory_sysdev_class = {
 	.name = MEMORY_CLASS_NAME,
@@ -84,28 +92,47 @@
  * register_memory - Setup a sysfs device for a memory block
  */
 static
-int register_memory(struct memory_block *memory, struct mem_section *section)
+int register_memory(struct memory_block *memory)
 {
 	int error;
 
 	memory->sysdev.cls = &memory_sysdev_class;
-	memory->sysdev.id = __section_nr(section);
+	memory->sysdev.id = memory->phys_index / sections_per_block;
 
 	error = sysdev_register(&memory->sysdev);
 	return error;
 }
 
 static void
-unregister_memory(struct memory_block *memory, struct mem_section *section)
+unregister_memory(struct memory_block *memory)
 {
 	BUG_ON(memory->sysdev.cls != &memory_sysdev_class);
-	BUG_ON(memory->sysdev.id != __section_nr(section));
 
 	/* drop the ref. we got in remove_memory_block() */
 	kobject_put(&memory->sysdev.kobj);
 	sysdev_unregister(&memory->sysdev);
 }
 
+u32 __weak memory_block_size_bytes(void)
+{
+	return MIN_MEMORY_BLOCK_SIZE;
+}
+
+static u32 get_memory_block_size(void)
+{
+	u32 block_sz;
+
+	block_sz = memory_block_size_bytes();
+
+	/* Validate blk_sz is a power of 2 and not less than section size */
+	if ((block_sz & (block_sz - 1)) || (block_sz < MIN_MEMORY_BLOCK_SIZE)) {
+		WARN_ON(1);
+		block_sz = MIN_MEMORY_BLOCK_SIZE;
+	}
+
+	return block_sz;
+}
+
 /*
  * use this as the physical section index that this memsection
  * uses.
@@ -116,7 +143,7 @@
 {
 	struct memory_block *mem =
 		container_of(dev, struct memory_block, sysdev);
-	return sprintf(buf, "%08lx\n", mem->phys_index);
+	return sprintf(buf, "%08lx\n", mem->phys_index / sections_per_block);
 }
 
 /*
@@ -125,13 +152,16 @@
 static ssize_t show_mem_removable(struct sys_device *dev,
 			struct sysdev_attribute *attr, char *buf)
 {
-	unsigned long start_pfn;
-	int ret;
+	unsigned long i, pfn;
+	int ret = 1;
 	struct memory_block *mem =
 		container_of(dev, struct memory_block, sysdev);
 
-	start_pfn = section_nr_to_pfn(mem->phys_index);
-	ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
+	for (i = 0; i < sections_per_block; i++) {
+		pfn = section_nr_to_pfn(mem->phys_index + i);
+		ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
+	}
+
 	return sprintf(buf, "%d\n", ret);
 }
 
@@ -184,17 +214,14 @@
  * OK to have direct references to sparsemem variables in here.
  */
 static int
-memory_block_action(struct memory_block *mem, unsigned long action)
+memory_section_action(unsigned long phys_index, unsigned long action)
 {
 	int i;
-	unsigned long psection;
 	unsigned long start_pfn, start_paddr;
 	struct page *first_page;
 	int ret;
-	int old_state = mem->state;
 
-	psection = mem->phys_index;
-	first_page = pfn_to_page(psection << PFN_SECTION_SHIFT);
+	first_page = pfn_to_page(phys_index << PFN_SECTION_SHIFT);
 
 	/*
 	 * The probe routines leave the pages reserved, just
@@ -207,8 +234,8 @@
 				continue;
 
 			printk(KERN_WARNING "section number %ld page number %d "
-				"not reserved, was it already online? \n",
-				psection, i);
+				"not reserved, was it already online?\n",
+				phys_index, i);
 			return -EBUSY;
 		}
 	}
@@ -219,18 +246,13 @@
 			ret = online_pages(start_pfn, PAGES_PER_SECTION);
 			break;
 		case MEM_OFFLINE:
-			mem->state = MEM_GOING_OFFLINE;
 			start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
 			ret = remove_memory(start_paddr,
 					    PAGES_PER_SECTION << PAGE_SHIFT);
-			if (ret) {
-				mem->state = old_state;
-				break;
-			}
 			break;
 		default:
-			WARN(1, KERN_WARNING "%s(%p, %ld) unknown action: %ld\n",
-					__func__, mem, action, action);
+			WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
+			     "%ld\n", __func__, phys_index, action, action);
 			ret = -EINVAL;
 	}
 
@@ -240,7 +262,8 @@
 static int memory_block_change_state(struct memory_block *mem,
 		unsigned long to_state, unsigned long from_state_req)
 {
-	int ret = 0;
+	int i, ret = 0;
+
 	mutex_lock(&mem->state_mutex);
 
 	if (mem->state != from_state_req) {
@@ -248,8 +271,22 @@
 		goto out;
 	}
 
-	ret = memory_block_action(mem, to_state);
-	if (!ret)
+	if (to_state == MEM_OFFLINE)
+		mem->state = MEM_GOING_OFFLINE;
+
+	for (i = 0; i < sections_per_block; i++) {
+		ret = memory_section_action(mem->phys_index + i, to_state);
+		if (ret)
+			break;
+	}
+
+	if (ret) {
+		for (i = 0; i < sections_per_block; i++)
+			memory_section_action(mem->phys_index + i,
+					      from_state_req);
+
+		mem->state = from_state_req;
+	} else
 		mem->state = to_state;
 
 out:
@@ -262,20 +299,15 @@
 		struct sysdev_attribute *attr, const char *buf, size_t count)
 {
 	struct memory_block *mem;
-	unsigned int phys_section_nr;
 	int ret = -EINVAL;
 
 	mem = container_of(dev, struct memory_block, sysdev);
-	phys_section_nr = mem->phys_index;
-
-	if (!present_section_nr(phys_section_nr))
-		goto out;
 
 	if (!strncmp(buf, "online", min((int)count, 6)))
 		ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
 	else if(!strncmp(buf, "offline", min((int)count, 7)))
 		ret = memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
-out:
+
 	if (ret)
 		return ret;
 	return count;
@@ -315,7 +347,7 @@
 print_block_size(struct sysdev_class *class, struct sysdev_class_attribute *attr,
 		 char *buf)
 {
-	return sprintf(buf, "%lx\n", (unsigned long)PAGES_PER_SECTION * PAGE_SIZE);
+	return sprintf(buf, "%x\n", get_memory_block_size());
 }
 
 static SYSDEV_CLASS_ATTR(block_size_bytes, 0444, print_block_size, NULL);
@@ -451,12 +483,13 @@
 	struct sys_device *sysdev;
 	struct memory_block *mem;
 	char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
+	int block_id = base_memory_block_id(__section_nr(section));
 
 	/*
 	 * This only works because we know that section == sysdev->id
 	 * slightly redundant with sysdev_register()
 	 */
-	sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, __section_nr(section));
+	sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, block_id);
 
 	kobj = kset_find_obj(&memory_sysdev_class.kset, name);
 	if (!kobj)
@@ -468,26 +501,27 @@
 	return mem;
 }
 
-static int add_memory_block(int nid, struct mem_section *section,
-			unsigned long state, enum mem_add_context context)
+static int init_memory_block(struct memory_block **memory,
+			     struct mem_section *section, unsigned long state)
 {
-	struct memory_block *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+	struct memory_block *mem;
 	unsigned long start_pfn;
+	int scn_nr;
 	int ret = 0;
 
+	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 	if (!mem)
 		return -ENOMEM;
 
-	mutex_lock(&mem_sysfs_mutex);
-
-	mem->phys_index = __section_nr(section);
+	scn_nr = __section_nr(section);
+	mem->phys_index = base_memory_block_id(scn_nr) * sections_per_block;
 	mem->state = state;
 	atomic_inc(&mem->section_count);
 	mutex_init(&mem->state_mutex);
 	start_pfn = section_nr_to_pfn(mem->phys_index);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
 
-	ret = register_memory(mem, section);
+	ret = register_memory(mem);
 	if (!ret)
 		ret = mem_create_simple_file(mem, phys_index);
 	if (!ret)
@@ -496,8 +530,29 @@
 		ret = mem_create_simple_file(mem, phys_device);
 	if (!ret)
 		ret = mem_create_simple_file(mem, removable);
+
+	*memory = mem;
+	return ret;
+}
+
+static int add_memory_section(int nid, struct mem_section *section,
+			unsigned long state, enum mem_add_context context)
+{
+	struct memory_block *mem;
+	int ret = 0;
+
+	mutex_lock(&mem_sysfs_mutex);
+
+	mem = find_memory_block(section);
+	if (mem) {
+		atomic_inc(&mem->section_count);
+		kobject_put(&mem->sysdev.kobj);
+	} else
+		ret = init_memory_block(&mem, section, state);
+
 	if (!ret) {
-		if (context == HOTPLUG)
+		if (context == HOTPLUG &&
+		    atomic_read(&mem->section_count) == sections_per_block)
 			ret = register_mem_sect_under_node(mem, nid);
 	}
 
@@ -519,8 +574,10 @@
 		mem_remove_simple_file(mem, state);
 		mem_remove_simple_file(mem, phys_device);
 		mem_remove_simple_file(mem, removable);
-		unregister_memory(mem, section);
-	}
+		unregister_memory(mem);
+		kfree(mem);
+	} else
+		kobject_put(&mem->sysdev.kobj);
 
 	mutex_unlock(&mem_sysfs_mutex);
 	return 0;
@@ -532,7 +589,7 @@
  */
 int register_new_memory(int nid, struct mem_section *section)
 {
-	return add_memory_block(nid, section, MEM_OFFLINE, HOTPLUG);
+	return add_memory_section(nid, section, MEM_OFFLINE, HOTPLUG);
 }
 
 int unregister_memory_section(struct mem_section *section)
@@ -551,12 +608,16 @@
 	unsigned int i;
 	int ret;
 	int err;
+	int block_sz;
 
 	memory_sysdev_class.kset.uevent_ops = &memory_uevent_ops;
 	ret = sysdev_class_register(&memory_sysdev_class);
 	if (ret)
 		goto out;
 
+	block_sz = get_memory_block_size();
+	sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
+
 	/*
 	 * Create entries for memory sections that were found
 	 * during boot and have been initialized
@@ -564,8 +625,8 @@
 	for (i = 0; i < NR_MEM_SECTIONS; i++) {
 		if (!present_section_nr(i))
 			continue;
-		err = add_memory_block(0, __nr_to_section(i), MEM_ONLINE,
-				       BOOT);
+		err = add_memory_section(0, __nr_to_section(i), MEM_ONLINE,
+					 BOOT);
 		if (!ret)
 			ret = err;
 	}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 5/8] v2 Add end_phys_index file
  2010-09-27 19:09 [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Nathan Fontenot
                   ` (3 preceding siblings ...)
  2010-09-27 19:25 ` [PATCH 4/8] v2 Allow memory block to span multiple memory sections Nathan Fontenot
@ 2010-09-27 19:26 ` Nathan Fontenot
  2010-09-27 19:27 ` [PATCH 6/8] v2 Update node sysfs code Nathan Fontenot
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-27 19:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen

Update the 'phys_index' properties of a memory block to include a
'start_phys_index' which is the same as the current 'phys_index' property.
The property still appears as 'phys_index' in sysfs but the memory_block
struct name is updated to indicate the start and end values.
This also adds an 'end_phys_index' property to indicate the id of the
last section in th memory block.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 drivers/base/memory.c  |   39 ++++++++++++++++++++++++++++++---------
 include/linux/memory.h |    3 ++-
 2 files changed, 32 insertions(+), 10 deletions(-)

Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c	2010-09-27 13:50:18.000000000 -0500
+++ linux-next/drivers/base/memory.c	2010-09-27 13:50:38.000000000 -0500
@@ -97,7 +97,7 @@
 	int error;
 
 	memory->sysdev.cls = &memory_sysdev_class;
-	memory->sysdev.id = memory->phys_index / sections_per_block;
+	memory->sysdev.id = memory->start_phys_index / sections_per_block;
 
 	error = sysdev_register(&memory->sysdev);
 	return error;
@@ -138,12 +138,26 @@
  * uses.
  */
 
-static ssize_t show_mem_phys_index(struct sys_device *dev,
+static ssize_t show_mem_start_phys_index(struct sys_device *dev,
 			struct sysdev_attribute *attr, char *buf)
 {
 	struct memory_block *mem =
 		container_of(dev, struct memory_block, sysdev);
-	return sprintf(buf, "%08lx\n", mem->phys_index / sections_per_block);
+	unsigned long phys_index;
+
+	phys_index = mem->start_phys_index / sections_per_block;
+	return sprintf(buf, "%08lx\n", phys_index);
+}
+
+static ssize_t show_mem_end_phys_index(struct sys_device *dev,
+			struct sysdev_attribute *attr, char *buf)
+{
+	struct memory_block *mem =
+		container_of(dev, struct memory_block, sysdev);
+	unsigned long phys_index;
+
+	phys_index = mem->end_phys_index / sections_per_block;
+	return sprintf(buf, "%08lx\n", phys_index);
 }
 
 /*
@@ -158,7 +172,7 @@
 		container_of(dev, struct memory_block, sysdev);
 
 	for (i = 0; i < sections_per_block; i++) {
-		pfn = section_nr_to_pfn(mem->phys_index + i);
+		pfn = section_nr_to_pfn(mem->start_phys_index + i);
 		ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
 	}
 
@@ -275,14 +289,15 @@
 		mem->state = MEM_GOING_OFFLINE;
 
 	for (i = 0; i < sections_per_block; i++) {
-		ret = memory_section_action(mem->phys_index + i, to_state);
+		ret = memory_section_action(mem->start_phys_index + i,
+					    to_state);
 		if (ret)
 			break;
 	}
 
 	if (ret) {
 		for (i = 0; i < sections_per_block; i++)
-			memory_section_action(mem->phys_index + i,
+			memory_section_action(mem->start_phys_index + i,
 					      from_state_req);
 
 		mem->state = from_state_req;
@@ -330,7 +345,8 @@
 	return sprintf(buf, "%d\n", mem->phys_device);
 }
 
-static SYSDEV_ATTR(phys_index, 0444, show_mem_phys_index, NULL);
+static SYSDEV_ATTR(phys_index, 0444, show_mem_start_phys_index, NULL);
+static SYSDEV_ATTR(end_phys_index, 0444, show_mem_end_phys_index, NULL);
 static SYSDEV_ATTR(state, 0644, show_mem_state, store_mem_state);
 static SYSDEV_ATTR(phys_device, 0444, show_phys_device, NULL);
 static SYSDEV_ATTR(removable, 0444, show_mem_removable, NULL);
@@ -514,17 +530,21 @@
 		return -ENOMEM;
 
 	scn_nr = __section_nr(section);
-	mem->phys_index = base_memory_block_id(scn_nr) * sections_per_block;
+	mem->start_phys_index =
+			base_memory_block_id(scn_nr) * sections_per_block;
+	mem->end_phys_index = mem->start_phys_index + sections_per_block - 1;
 	mem->state = state;
 	atomic_inc(&mem->section_count);
 	mutex_init(&mem->state_mutex);
-	start_pfn = section_nr_to_pfn(mem->phys_index);
+	start_pfn = section_nr_to_pfn(mem->start_phys_index);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
 
 	ret = register_memory(mem);
 	if (!ret)
 		ret = mem_create_simple_file(mem, phys_index);
 	if (!ret)
+		ret = mem_create_simple_file(mem, end_phys_index);
+	if (!ret)
 		ret = mem_create_simple_file(mem, state);
 	if (!ret)
 		ret = mem_create_simple_file(mem, phys_device);
@@ -571,6 +591,7 @@
 	if (atomic_dec_and_test(&mem->section_count)) {
 		unregister_mem_sect_under_nodes(mem);
 		mem_remove_simple_file(mem, phys_index);
+		mem_remove_simple_file(mem, end_phys_index);
 		mem_remove_simple_file(mem, state);
 		mem_remove_simple_file(mem, phys_device);
 		mem_remove_simple_file(mem, removable);
Index: linux-next/include/linux/memory.h
===================================================================
--- linux-next.orig/include/linux/memory.h	2010-09-27 13:49:37.000000000 -0500
+++ linux-next/include/linux/memory.h	2010-09-27 13:50:38.000000000 -0500
@@ -22,7 +22,8 @@
 #include <asm/atomic.h>
 
 struct memory_block {
-	unsigned long phys_index;
+	unsigned long start_phys_index;
+	unsigned long end_phys_index;
 	unsigned long state;
 	atomic_t section_count;

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 6/8] v2 Update node sysfs code
  2010-09-27 19:09 [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Nathan Fontenot
                   ` (4 preceding siblings ...)
  2010-09-27 19:26 ` [PATCH 5/8] v2 Add end_phys_index file Nathan Fontenot
@ 2010-09-27 19:27 ` Nathan Fontenot
  2010-09-28  9:29   ` Robin Holt
  2010-09-27 19:28 ` [PATCH 7/8] v2 Define memory_block_size_bytes() for powerpc/pseries Nathan Fontenot
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-27 19:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen

Update the node sysfs code to be aware of the new capability for a memory
block to contain multiple memory sections.  This requires an additional
parameter to unregister_mem_sect_under_nodes so that we know which memory
section of the memory block to unregister.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 drivers/base/memory.c |    2 +-
 drivers/base/node.c   |   12 ++++++++----
 include/linux/node.h  |    6 ++++--
 3 files changed, 13 insertions(+), 7 deletions(-)

Index: linux-next/drivers/base/node.c
===================================================================
--- linux-next.orig/drivers/base/node.c	2010-09-27 13:49:36.000000000 -0500
+++ linux-next/drivers/base/node.c	2010-09-27 13:50:43.000000000 -0500
@@ -346,8 +346,10 @@
 		return -EFAULT;
 	if (!node_online(nid))
 		return 0;
-	sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
-	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+
+	sect_start_pfn = section_nr_to_pfn(mem_blk->start_phys_index);
+	sect_end_pfn = section_nr_to_pfn(mem_blk->end_phys_index);
+	sect_end_pfn += PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
 		int page_nid;
 
@@ -371,7 +373,8 @@
 }
 
 /* unregister memory section under all nodes that it spans */
-int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
+int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+				    unsigned long phys_index)
 {
 	NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
 	unsigned long pfn, sect_start_pfn, sect_end_pfn;
@@ -383,7 +386,8 @@
 	if (!unlinked_nodes)
 		return -ENOMEM;
 	nodes_clear(*unlinked_nodes);
-	sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
+
+	sect_start_pfn = section_nr_to_pfn(phys_index);
 	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
 		int nid;
Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c	2010-09-27 13:50:38.000000000 -0500
+++ linux-next/drivers/base/memory.c	2010-09-27 13:50:43.000000000 -0500
@@ -587,9 +587,9 @@
 
 	mutex_lock(&mem_sysfs_mutex);
 	mem = find_memory_block(section);
+	unregister_mem_sect_under_nodes(mem, __section_nr(section));
 
 	if (atomic_dec_and_test(&mem->section_count)) {
-		unregister_mem_sect_under_nodes(mem);
 		mem_remove_simple_file(mem, phys_index);
 		mem_remove_simple_file(mem, end_phys_index);
 		mem_remove_simple_file(mem, state);
Index: linux-next/include/linux/node.h
===================================================================
--- linux-next.orig/include/linux/node.h	2010-09-27 13:49:36.000000000 -0500
+++ linux-next/include/linux/node.h	2010-09-27 13:50:43.000000000 -0500
@@ -44,7 +44,8 @@
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 						int nid);
-extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
+extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+					   unsigned long phys_index);
 
 #ifdef CONFIG_HUGETLBFS
 extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
@@ -72,7 +73,8 @@
 {
 	return 0;
 }
-static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
+static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+						  unsigned long phys_index)
 {
 	return 0;
 }

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 7/8] v2 Define memory_block_size_bytes() for powerpc/pseries
  2010-09-27 19:09 [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Nathan Fontenot
                   ` (5 preceding siblings ...)
  2010-09-27 19:27 ` [PATCH 6/8] v2 Update node sysfs code Nathan Fontenot
@ 2010-09-27 19:28 ` Nathan Fontenot
  2010-09-27 19:28 ` [PATCH 8/8] v2 Update memory hotplug documentation Nathan Fontenot
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-27 19:28 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen

Define a version of memory_block_size_bytes() for powerpc/pseries such that
a memory block spans an entire lmb.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 arch/powerpc/platforms/pseries/hotplug-memory.c |   66 +++++++++++++++++++-----
 1 file changed, 53 insertions(+), 13 deletions(-)

Index: linux-next/arch/powerpc/platforms/pseries/hotplug-memory.c
===================================================================
--- linux-next.orig/arch/powerpc/platforms/pseries/hotplug-memory.c	2010-09-27 13:49:34.000000000 -0500
+++ linux-next/arch/powerpc/platforms/pseries/hotplug-memory.c	2010-09-27 13:50:45.000000000 -0500
@@ -17,6 +17,54 @@
 #include <asm/pSeries_reconfig.h>
 #include <asm/sparsemem.h>
 
+static u32 get_memblock_size(void)
+{
+	struct device_node *np;
+	unsigned int memblock_size = 0;
+
+	np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
+	if (np) {
+		const unsigned long *size;
+
+		size = of_get_property(np, "ibm,lmb-size", NULL);
+		memblock_size = size ? *size : 0;
+
+		of_node_put(np);
+	} else {
+		unsigned int memzero_size = 0;
+		const unsigned int *regs;
+
+		np = of_find_node_by_path("/memory@0");
+		if (np) {
+			regs = of_get_property(np, "reg", NULL);
+			memzero_size = regs ? regs[3] : 0;
+			of_node_put(np);
+		}
+
+		if (memzero_size) {
+			/* We now know the size of memory@0, use this to find
+			 * the first memoryblock and get its size.
+			 */
+			char buf[64];
+
+			sprintf(buf, "/memory@%x", memzero_size);
+			np = of_find_node_by_path(buf);
+			if (np) {
+				regs = of_get_property(np, "reg", NULL);
+				memblock_size = regs ? regs[3] : 0;
+				of_node_put(np);
+			}
+		}
+	}
+
+	return memblock_size;
+}
+
+u32 memory_block_size_bytes(void)
+{
+	return get_memblock_size();
+}
+
 static int pseries_remove_memblock(unsigned long base, unsigned int memblock_size)
 {
 	unsigned long start, start_pfn;
@@ -127,30 +175,22 @@
 
 static int pseries_drconf_memory(unsigned long *base, unsigned int action)
 {
-	struct device_node *np;
-	const unsigned long *lmb_size;
+	unsigned long memblock_size;
 	int rc;
 
-	np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
-	if (!np)
+	memblock_size = get_memblock_size();
+	if (!memblock_size)
 		return -EINVAL;
 
-	lmb_size = of_get_property(np, "ibm,lmb-size", NULL);
-	if (!lmb_size) {
-		of_node_put(np);
-		return -EINVAL;
-	}
-
 	if (action == PSERIES_DRCONF_MEM_ADD) {
-		rc = memblock_add(*base, *lmb_size);
+		rc = memblock_add(*base, memblock_size);
 		rc = (rc < 0) ? -EINVAL : 0;
 	} else if (action == PSERIES_DRCONF_MEM_REMOVE) {
-		rc = pseries_remove_memblock(*base, *lmb_size);
+		rc = pseries_remove_memblock(*base, memblock_size);
 	} else {
 		rc = -EINVAL;
 	}
 
-	of_node_put(np);
 	return rc;
 }

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 8/8] v2 Update memory hotplug documentation
  2010-09-27 19:09 [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Nathan Fontenot
                   ` (6 preceding siblings ...)
  2010-09-27 19:28 ` [PATCH 7/8] v2 Define memory_block_size_bytes() for powerpc/pseries Nathan Fontenot
@ 2010-09-27 19:28 ` Nathan Fontenot
  2010-09-28 12:45   ` Avi Kivity
  2010-09-28 12:38 ` [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Robin Holt
  2010-09-28 12:44 ` Avi Kivity
  9 siblings, 1 reply; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-27 19:28 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, KAMEZAWA Hiroyuki, Dave Hansen

Update the memory hotplug documentation to reflect the new behaviors of
memory blocks reflected in sysfs.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 Documentation/memory-hotplug.txt |   46 +++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 16 deletions(-)

Index: linux-next/Documentation/memory-hotplug.txt
===================================================================
--- linux-next.orig/Documentation/memory-hotplug.txt	2010-09-27 13:49:33.000000000 -0500
+++ linux-next/Documentation/memory-hotplug.txt	2010-09-27 13:50:48.000000000 -0500
@@ -126,36 +126,50 @@
 --------------------------------
 4 sysfs files for memory hotplug
 --------------------------------
-All sections have their device information under /sys/devices/system/memory as
+All sections have their device information in sysfs.  Each section is part of
+a memory block under /sys/devices/system/memory as
 
 /sys/devices/system/memory/memoryXXX
-(XXX is section id.)
+(XXX is the section id.)
 
-Now, XXX is defined as start_address_of_section / section_size.
+Now, XXX is defined as (start_address_of_section / section_size) of the first
+section contained in the memory block.  The files 'phys_index' and
+'end_phys_index' under each directory report the beginning and end section id's
+for the memory block covered by the sysfs directory.  It is expected that all
+memory sections in this range are present and no memory holes exist in the
+range. Currently there is no way to determine if there is a memory hole, but
+the existence of one should not affect the hotplug capabilities of the memory
+block.
 
 For example, assume 1GiB section size. A device for a memory starting at
 0x100000000 is /sys/device/system/memory/memory4
 (0x100000000 / 1Gib = 4)
 This device covers address range [0x100000000 ... 0x140000000)
 
-Under each section, you can see 4 files.
+Under each section, you can see 5 files.
 
-/sys/devices/system/memory/memoryXXX/phys_index
+/sys/devices/system/memory/memoryXXX/start_phys_index
+/sys/devices/system/memory/memoryXXX/end_phys_index
 /sys/devices/system/memory/memoryXXX/phys_device
 /sys/devices/system/memory/memoryXXX/state
 /sys/devices/system/memory/memoryXXX/removable
 
-'phys_index' : read-only and contains section id, same as XXX.
-'state'      : read-write
-               at read:  contains online/offline state of memory.
-               at write: user can specify "online", "offline" command
-'phys_device': read-only: designed to show the name of physical memory device.
-               This is not well implemented now.
-'removable'  : read-only: contains an integer value indicating
-               whether the memory section is removable or not
-               removable.  A value of 1 indicates that the memory
-               section is removable and a value of 0 indicates that
-               it is not removable.
+'phys_index'      : read-only and contains section id of the first section
+		    in the memory block, same as XXX.
+'end_phys_index'  : read-only and contains section id of the last section
+		    in the memory block.
+'state'           : read-write
+                    at read:  contains online/offline state of memory.
+                    at write: user can specify "online", "offline" command
+                    which will be performed on al sections in the block.
+'phys_device'     : read-only: designed to show the name of physical memory
+                    device.  This is not well implemented now.
+'removable'       : read-only: contains an integer value indicating
+                    whether the memory block is removable or not
+                    removable.  A value of 1 indicates that the memory
+                    block is removable and a value of 0 indicates that
+                    it is not removable. A memory block is removable only if
+                    every section in the block is removable.
 
 NOTE:
   These directories/files appear after physical memory hotplug phase.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/8] v2 Allow memory block to span multiple memory sections
  2010-09-27 19:25 ` [PATCH 4/8] v2 Allow memory block to span multiple memory sections Nathan Fontenot
@ 2010-09-27 23:55   ` Dave Hansen
  2010-09-28 18:06     ` Nathan Fontenot
  2010-09-28 12:48   ` Robin Holt
  1 sibling, 1 reply; 33+ messages in thread
From: Dave Hansen @ 2010-09-27 23:55 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: linux-mm, Greg KH, linux-kernel, KAMEZAWA Hiroyuki, linuxppc-dev

On Mon, 2010-09-27 at 14:25 -0500, Nathan Fontenot wrote:
> +static inline int base_memory_block_id(int section_nr)
> +{
> +       return section_nr / sections_per_block;
> +}
...
> -       mutex_lock(&mem_sysfs_mutex);
> -
> -       mem->phys_index = __section_nr(section);
> +       scn_nr = __section_nr(section);
> +       mem->phys_index = base_memory_block_id(scn_nr) * sections_per_block; 

I'm really regretting giving this variable such a horrid name.  I suck.

I think this is correct now:

	mem->phys_index = base_memory_block_id(scn_nr) * sections_per_block;
	mem->phys_index = section_nr / sections_per_block * sections_per_block;
	mem->phys_index = section_nr

Since it gets exported to userspace this way:

> +static ssize_t show_mem_start_phys_index(struct sys_device *dev,
>                         struct sysdev_attribute *attr, char *buf)
>  {
>         struct memory_block *mem =
>                 container_of(dev, struct memory_block, sysdev);
> -       return sprintf(buf, "%08lx\n", mem->phys_index / sections_per_block);
> +       unsigned long phys_index;
> +
> +       phys_index = mem->start_phys_index / sections_per_block;
> +       return sprintf(buf, "%08lx\n", phys_index);
> +}

The only other thing I'd say is that we need to put phys_index out of
its misery and call it what it is now: a section number.  I think it's
OK to call them "start/end_section_nr", at least inside the kernel.  I
intentionally used "phys_index" terminology in sysfs so that we _could_
eventually do this stuff and break the relationship between sections and
the sysfs dirs, but I think keeping the terminology around inside the
kernel is confusing now.

-- Dave

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 6/8] v2 Update node sysfs code
  2010-09-27 19:27 ` [PATCH 6/8] v2 Update node sysfs code Nathan Fontenot
@ 2010-09-28  9:29   ` Robin Holt
  2010-09-28 15:21     ` Dave Hansen
  0 siblings, 1 reply; 33+ messages in thread
From: Robin Holt @ 2010-09-28  9:29 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
	KAMEZAWA Hiroyuki

This patch may work, but it appears it is lacking in, at least the
link_mem_sections() function.  Assuming you have a memory block covering
2GB and a section size of 128MB (some values we are toying with for
large SGI machines), you end up calling register_mem_sect_under_node()
16 times which then takes the same steps.

I think you also need:

Index: linux-2.6.32/drivers/base/node.c
===================================================================
--- linux-2.6.32.orig/drivers/base/node.c	2010-09-28 04:18:53.848448349 -0500
+++ linux-2.6.32/drivers/base/node.c	2010-09-28 04:21:35.169446261 -0500
@@ -342,6 +342,7 @@
 		if (!err)
 			err = ret;
 
+		pfn = section_nr_to_pfn(mem_blk->end_phys_index);
 		/* discard ref obtained in find_memory_block() */
 		kobject_put(&mem_blk->sysdev.kobj);
 	}

Also, I don't think I much care for the weirdness that occurs if a
memory block spans two nodes.  I have not thought through how possible
(or likely) this is, but the code certainly permits it.  If that were
the case, how would we know which sections need to be taken offline, etc?
I wonder how much this will muddy up the information available in sysfs.

Thanks,
Robin

On Mon, Sep 27, 2010 at 02:27:09PM -0500, Nathan Fontenot wrote:
> Update the node sysfs code to be aware of the new capability for a memory
> block to contain multiple memory sections.  This requires an additional
> parameter to unregister_mem_sect_under_nodes so that we know which memory
> section of the memory block to unregister.
> 
> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
> 
> ---
>  drivers/base/memory.c |    2 +-
>  drivers/base/node.c   |   12 ++++++++----
>  include/linux/node.h  |    6 ++++--
>  3 files changed, 13 insertions(+), 7 deletions(-)
> 
> Index: linux-next/drivers/base/node.c
> ===================================================================
> --- linux-next.orig/drivers/base/node.c	2010-09-27 13:49:36.000000000 -0500
> +++ linux-next/drivers/base/node.c	2010-09-27 13:50:43.000000000 -0500
> @@ -346,8 +346,10 @@
>  		return -EFAULT;
>  	if (!node_online(nid))
>  		return 0;
> -	sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
> -	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
> +
> +	sect_start_pfn = section_nr_to_pfn(mem_blk->start_phys_index);
> +	sect_end_pfn = section_nr_to_pfn(mem_blk->end_phys_index);
> +	sect_end_pfn += PAGES_PER_SECTION - 1;
>  	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
>  		int page_nid;
>  
> @@ -371,7 +373,8 @@
>  }
>  
>  /* unregister memory section under all nodes that it spans */
> -int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
> +int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
> +				    unsigned long phys_index)
>  {
>  	NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
>  	unsigned long pfn, sect_start_pfn, sect_end_pfn;
> @@ -383,7 +386,8 @@
>  	if (!unlinked_nodes)
>  		return -ENOMEM;
>  	nodes_clear(*unlinked_nodes);
> -	sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
> +
> +	sect_start_pfn = section_nr_to_pfn(phys_index);
>  	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
>  	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
>  		int nid;
> Index: linux-next/drivers/base/memory.c
> ===================================================================
> --- linux-next.orig/drivers/base/memory.c	2010-09-27 13:50:38.000000000 -0500
> +++ linux-next/drivers/base/memory.c	2010-09-27 13:50:43.000000000 -0500
> @@ -587,9 +587,9 @@
>  
>  	mutex_lock(&mem_sysfs_mutex);
>  	mem = find_memory_block(section);
> +	unregister_mem_sect_under_nodes(mem, __section_nr(section));
>  
>  	if (atomic_dec_and_test(&mem->section_count)) {
> -		unregister_mem_sect_under_nodes(mem);
>  		mem_remove_simple_file(mem, phys_index);
>  		mem_remove_simple_file(mem, end_phys_index);
>  		mem_remove_simple_file(mem, state);
> Index: linux-next/include/linux/node.h
> ===================================================================
> --- linux-next.orig/include/linux/node.h	2010-09-27 13:49:36.000000000 -0500
> +++ linux-next/include/linux/node.h	2010-09-27 13:50:43.000000000 -0500
> @@ -44,7 +44,8 @@
>  extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
>  extern int register_mem_sect_under_node(struct memory_block *mem_blk,
>  						int nid);
> -extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
> +extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
> +					   unsigned long phys_index);
>  
>  #ifdef CONFIG_HUGETLBFS
>  extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
> @@ -72,7 +73,8 @@
>  {
>  	return 0;
>  }
> -static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
> +static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
> +						  unsigned long phys_index)
>  {
>  	return 0;
>  }
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/8] v2 Add section count to memory_block struct
  2010-09-27 19:22 ` [PATCH 2/8] v2 Add section count to memory_block struct Nathan Fontenot
@ 2010-09-28  9:31   ` Robin Holt
  2010-09-28 18:14     ` Nathan Fontenot
  0 siblings, 1 reply; 33+ messages in thread
From: Robin Holt @ 2010-09-28  9:31 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
	KAMEZAWA Hiroyuki

In the next patch, you introduce a mutex for adding/removing memory blocks.
Is there really a need for this to be atomic?  If you reorder the patches
so the mutex comes first, would the atomic be needed any longer?

Robin

On Mon, Sep 27, 2010 at 02:22:24PM -0500, Nathan Fontenot wrote:
> Add a section count property to the memory_block struct to track the number
> of memory sections that have been added/removed from a memory block. This
> allows us to know when the last memory section of a memory block has been
> removed so we can remove the memory block.
> 
> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
> 
> ---
>  drivers/base/memory.c  |   16 ++++++++++------
>  include/linux/memory.h |    3 +++
>  2 files changed, 13 insertions(+), 6 deletions(-)
> 
> Index: linux-next/drivers/base/memory.c
> ===================================================================
> --- linux-next.orig/drivers/base/memory.c	2010-09-27 09:17:20.000000000 -0500
> +++ linux-next/drivers/base/memory.c	2010-09-27 09:31:35.000000000 -0500
> @@ -478,6 +478,7 @@
>  
>  	mem->phys_index = __section_nr(section);
>  	mem->state = state;
> +	atomic_inc(&mem->section_count);
>  	mutex_init(&mem->state_mutex);
>  	start_pfn = section_nr_to_pfn(mem->phys_index);
>  	mem->phys_device = arch_get_memory_phys_device(start_pfn);
> @@ -505,12 +506,15 @@
>  	struct memory_block *mem;
>  
>  	mem = find_memory_block(section);
> -	unregister_mem_sect_under_nodes(mem);
> -	mem_remove_simple_file(mem, phys_index);
> -	mem_remove_simple_file(mem, state);
> -	mem_remove_simple_file(mem, phys_device);
> -	mem_remove_simple_file(mem, removable);
> -	unregister_memory(mem, section);
> +
> +	if (atomic_dec_and_test(&mem->section_count)) {
> +		unregister_mem_sect_under_nodes(mem);
> +		mem_remove_simple_file(mem, phys_index);
> +		mem_remove_simple_file(mem, state);
> +		mem_remove_simple_file(mem, phys_device);
> +		mem_remove_simple_file(mem, removable);
> +		unregister_memory(mem, section);
> +	}
>  
>  	return 0;
>  }
> Index: linux-next/include/linux/memory.h
> ===================================================================
> --- linux-next.orig/include/linux/memory.h	2010-09-27 09:17:20.000000000 -0500
> +++ linux-next/include/linux/memory.h	2010-09-27 09:22:56.000000000 -0500
> @@ -19,10 +19,13 @@
>  #include <linux/node.h>
>  #include <linux/compiler.h>
>  #include <linux/mutex.h>
> +#include <asm/atomic.h>
>  
>  struct memory_block {
>  	unsigned long phys_index;
>  	unsigned long state;
> +	atomic_t section_count;
> +
>  	/*
>  	 * This serializes all state change requests.  It isn't
>  	 * held during creation because the control files are
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
  2010-09-27 19:09 [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Nathan Fontenot
                   ` (7 preceding siblings ...)
  2010-09-27 19:28 ` [PATCH 8/8] v2 Update memory hotplug documentation Nathan Fontenot
@ 2010-09-28 12:38 ` Robin Holt
  2010-09-28 18:17   ` Nathan Fontenot
  2010-09-28 12:44 ` Avi Kivity
  9 siblings, 1 reply; 33+ messages in thread
From: Robin Holt @ 2010-09-28 12:38 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
	KAMEZAWA Hiroyuki

I was tasked with looking at a slowdown in similar sized SGI machines
booting x86_64.  Jack Steiner had already looked into the memory_dev_init.
I was looking at link_mem_sections().

I made a dramatic improvement on a 16TB machine in that function by
merely caching the most recent memory section and checking to see if
the next memory section happens to be the subsequent in the linked list
of kobjects.

That simple cache reduced the time for link_mem_sections from 1 hour 27
minutes down to 46 seconds.

I would like to propose we implement something along those lines also,
but I am currently swamped.  I can probably get you a patch tomorrow
afternoon that applies at the end of this set.

Thanks,
Robin

On Mon, Sep 27, 2010 at 02:09:31PM -0500, Nathan Fontenot wrote:
> This set of patches decouples the concept that a single memory
> section corresponds to a single directory in 
> /sys/devices/system/memory/.  On systems
> with large amounts of memory (1+ TB) there are perfomance issues
> related to creating the large number of sysfs directories.  For
> a powerpc machine with 1 TB of memory we are creating 63,000+
> directories.  This is resulting in boot times of around 45-50
> minutes for systems with 1 TB of memory and 8 hours for systems
> with 2 TB of memory.  With this patch set applied I am now seeing
> boot times of 5 minutes or less.
> 
> The root of this issue is in sysfs directory creation. Every time
> a directory is created a string compare is done against all sibling
> directories to ensure we do not create duplicates.  The list of
> directory nodes in sysfs is kept as an unsorted list which results
> in this being an exponentially longer operation as the number of
> directories are created.
> 
> The solution solved by this patch set is to allow a single
> directory in sysfs to span multiple memory sections.  This is
> controlled by an optional architecturally defined function
> memory_block_size_bytes().  The default definition of this
> routine returns a memory block size equal to the memory section
> size. This maintains the current layout of sysfs memory
> directories as it appears to userspace to remain the same as it
> is today.
> 
> For architectures that define their own version of this routine,
> as is done for powerpc in this patchset, the view in userspace
> would change such that each memoryXXX directory would span
> multiple memory sections.  The number of sections spanned would
> depend on the value reported by memory_block_size_bytes.
> 
> In both cases a new file 'end_phys_index' is created in each
> memoryXXX directory.  This file will contain the physical id
> of the last memory section covered by the sysfs directory.  For
> the default case, the value in 'end_phys_index' will be the same
> as in the existing 'phys_index' file.
> 
> This version of the patch set includes an update to to properly
> report block_size_bytes, phys_index, and end_phys_index.  Additionally,
> the patch that adds the end_phys_index sysfs file is now patch 5/8
> instead of being patch 2/8 as in the previous version of the patches.
> 
> -Nathan Fontenot
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
  2010-09-27 19:09 [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Nathan Fontenot
                   ` (8 preceding siblings ...)
  2010-09-28 12:38 ` [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Robin Holt
@ 2010-09-28 12:44 ` Avi Kivity
  2010-09-28 15:12   ` Robin Holt
  2010-09-28 15:17   ` Dave Hansen
  9 siblings, 2 replies; 33+ messages in thread
From: Avi Kivity @ 2010-09-28 12:44 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
	KAMEZAWA Hiroyuki

  On 09/27/2010 09:09 PM, Nathan Fontenot wrote:
> This set of patches decouples the concept that a single memory
> section corresponds to a single directory in
> /sys/devices/system/memory/.  On systems
> with large amounts of memory (1+ TB) there are perfomance issues
> related to creating the large number of sysfs directories.  For
> a powerpc machine with 1 TB of memory we are creating 63,000+
> directories.  This is resulting in boot times of around 45-50
> minutes for systems with 1 TB of memory and 8 hours for systems
> with 2 TB of memory.  With this patch set applied I am now seeing
> boot times of 5 minutes or less.
>
> The root of this issue is in sysfs directory creation. Every time
> a directory is created a string compare is done against all sibling
> directories to ensure we do not create duplicates.  The list of
> directory nodes in sysfs is kept as an unsorted list which results
> in this being an exponentially longer operation as the number of
> directories are created.
>
> The solution solved by this patch set is to allow a single
> directory in sysfs to span multiple memory sections.  This is
> controlled by an optional architecturally defined function
> memory_block_size_bytes().  The default definition of this
> routine returns a memory block size equal to the memory section
> size. This maintains the current layout of sysfs memory
> directories as it appears to userspace to remain the same as it
> is today.
>

Why not update sysfs directory creation to be fast, for example by using 
an rbtree instead of a linked list.  This fixes an implementation 
problem in the kernel instead of working around it and creating a new ABI.

New ABIs mean old tools won't work, and new tools need to understand 
both ABIs.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 8/8] v2 Update memory hotplug documentation
  2010-09-27 19:28 ` [PATCH 8/8] v2 Update memory hotplug documentation Nathan Fontenot
@ 2010-09-28 12:45   ` Avi Kivity
  2010-09-28 18:18     ` Nathan Fontenot
  0 siblings, 1 reply; 33+ messages in thread
From: Avi Kivity @ 2010-09-28 12:45 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
	KAMEZAWA Hiroyuki

  On 09/27/2010 09:28 PM, Nathan Fontenot wrote:
>
>   For example, assume 1GiB section size. A device for a memory starting at
>   0x100000000 is /sys/device/system/memory/memory4
>   (0x100000000 / 1Gib = 4)
>   This device covers address range [0x100000000 ... 0x140000000)
>
> -Under each section, you can see 4 files.
> +Under each section, you can see 5 files.

Shouldn't this be, 4 or 5 files depending on kernel version?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/8] v2 Allow memory block to span multiple memory sections
  2010-09-27 19:25 ` [PATCH 4/8] v2 Allow memory block to span multiple memory sections Nathan Fontenot
  2010-09-27 23:55   ` Dave Hansen
@ 2010-09-28 12:48   ` Robin Holt
  2010-09-28 18:20     ` Nathan Fontenot
  1 sibling, 1 reply; 33+ messages in thread
From: Robin Holt @ 2010-09-28 12:48 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
	KAMEZAWA Hiroyuki

> +u32 __weak memory_block_size_bytes(void)
> +{
> +	return MIN_MEMORY_BLOCK_SIZE;
> +}
> +
> +static u32 get_memory_block_size(void)

Can we make this an unsigned long?  We are testing on a system whose
smallest possible configuration is 4GB per socket with 512 sockets.
We would like to be able to specify this as 2GB by default (results
in the least lost memory) and suggest we add a command line option
which overrides this value.  We have many installations where 16GB may
be optimal.  Large configurations will certainly become more prevalent.

...
> @@ -551,12 +608,16 @@
>  	unsigned int i;
>  	int ret;
>  	int err;
> +	int block_sz;

This one needs to match the return above.  In our tests, we ended up
with a negative sections_per_block which caused very unexpected results.

Robin

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
  2010-09-28 12:44 ` Avi Kivity
@ 2010-09-28 15:12   ` Robin Holt
  2010-09-28 16:34     ` Avi Kivity
  2010-09-29  2:50     ` Greg KH
  2010-09-28 15:17   ` Dave Hansen
  1 sibling, 2 replies; 33+ messages in thread
From: Robin Holt @ 2010-09-28 15:12 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
	KAMEZAWA Hiroyuki

On Tue, Sep 28, 2010 at 02:44:40PM +0200, Avi Kivity wrote:
>  On 09/27/2010 09:09 PM, Nathan Fontenot wrote:
> >This set of patches decouples the concept that a single memory
> >section corresponds to a single directory in
> >/sys/devices/system/memory/.  On systems
> >with large amounts of memory (1+ TB) there are perfomance issues
> >related to creating the large number of sysfs directories.  For
> >a powerpc machine with 1 TB of memory we are creating 63,000+
> >directories.  This is resulting in boot times of around 45-50
> >minutes for systems with 1 TB of memory and 8 hours for systems
> >with 2 TB of memory.  With this patch set applied I am now seeing
> >boot times of 5 minutes or less.
> >
> >The root of this issue is in sysfs directory creation. Every time
> >a directory is created a string compare is done against all sibling
> >directories to ensure we do not create duplicates.  The list of
> >directory nodes in sysfs is kept as an unsorted list which results
> >in this being an exponentially longer operation as the number of
> >directories are created.
> >
> >The solution solved by this patch set is to allow a single
> >directory in sysfs to span multiple memory sections.  This is
> >controlled by an optional architecturally defined function
> >memory_block_size_bytes().  The default definition of this
> >routine returns a memory block size equal to the memory section
> >size. This maintains the current layout of sysfs memory
> >directories as it appears to userspace to remain the same as it
> >is today.
> >
> 
> Why not update sysfs directory creation to be fast, for example by
> using an rbtree instead of a linked list.  This fixes an
> implementation problem in the kernel instead of working around it
> and creating a new ABI.

Because the old ABI creates 129,000+ entries inside
/sys/devices/system/memory with their associated links from
/sys/devices/system/node/node*/ back to those directory entries.

Thankfully things like rpm, hald, and other miscellaneous commands scan
that information.  On our 8 TB test machine, hald runs continuously
following boot for nearly an hour mostly scanning useless information
from /sys/

Robin

> 
> New ABIs mean old tools won't work, and new tools need to understand
> both ABIs.
> 
> -- 
> error compiling committee.c: too many arguments to function
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
  2010-09-28 12:44 ` Avi Kivity
  2010-09-28 15:12   ` Robin Holt
@ 2010-09-28 15:17   ` Dave Hansen
  1 sibling, 0 replies; 33+ messages in thread
From: Dave Hansen @ 2010-09-28 15:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linuxppc-dev, Greg KH, linux-kernel, linux-mm, KAMEZAWA Hiroyuki

On Tue, 2010-09-28 at 14:44 +0200, Avi Kivity wrote:
> Why not update sysfs directory creation to be fast, for example by using 
> an rbtree instead of a linked list.  This fixes an implementation 
> problem in the kernel instead of working around it and creating a new ABI.
> 
> New ABIs mean old tools won't work, and new tools need to understand 
> both ABIs.

Just to be clear _these_ patches do not change the existing ABI.

They do add a new ABI: the end_phys_index file.  But, it is completely
redundant at the moment.  It could be taken out of these patches.

That said, fixing the directory creation speed is probably a worthwhile
endeavor too.

-- Dave

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 6/8] v2 Update node sysfs code
  2010-09-28  9:29   ` Robin Holt
@ 2010-09-28 15:21     ` Dave Hansen
  0 siblings, 0 replies; 33+ messages in thread
From: Dave Hansen @ 2010-09-28 15:21 UTC (permalink / raw)
  To: Robin Holt
  Cc: linuxppc-dev, Greg KH, linux-kernel, linux-mm, KAMEZAWA Hiroyuki

On Tue, 2010-09-28 at 04:29 -0500, Robin Holt wrote:
> Also, I don't think I much care for the weirdness that occurs if a
> memory block spans two nodes.  I have not thought through how possible
> (or likely) this is, but the code certainly permits it.  If that were
> the case, how would we know which sections need to be taken offline,
> etc? 

Since the architecture is the one doing the memory_block_size_bytes()
override, I'd expect that the per-arch code knows enough to ensure that
this doesn't happen.  It's probably something to add to the
documentation or the patch descriptions.  "How should an architecture
define this?  When should it be overridden?"

It's just like the question of SECTION_SIZE.  What if a section spans a
node?  Well, they don't because the sections are a software concept and
we _define_ them to not be able to cross nodes.  If they do, just make
them smaller.

-- Dave

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
  2010-09-28 15:12   ` Robin Holt
@ 2010-09-28 16:34     ` Avi Kivity
  2010-09-29  2:50     ` Greg KH
  1 sibling, 0 replies; 33+ messages in thread
From: Avi Kivity @ 2010-09-28 16:34 UTC (permalink / raw)
  To: Robin Holt
  Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
	KAMEZAWA Hiroyuki

  On 09/28/2010 05:12 PM, Robin Holt wrote:
> >  Why not update sysfs directory creation to be fast, for example by
> >  using an rbtree instead of a linked list.  This fixes an
> >  implementation problem in the kernel instead of working around it
> >  and creating a new ABI.
>
> Because the old ABI creates 129,000+ entries inside
> /sys/devices/system/memory with their associated links from
> /sys/devices/system/node/node*/ back to those directory entries.
>
> Thankfully things like rpm, hald, and other miscellaneous commands scan
> that information.  On our 8 TB test machine, hald runs continuously
> following boot for nearly an hour mostly scanning useless information
> from /sys/

I see - so the problem wasn't just kernel internal; the ABI itself was 
unsuitable.  Too bad this wasn't considered at the time it was added.

(129k entries / 1 hour = 35 entries/sec; not very impressive)

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/8] v2 Allow memory block to span multiple memory sections
  2010-09-27 23:55   ` Dave Hansen
@ 2010-09-28 18:06     ` Nathan Fontenot
  0 siblings, 0 replies; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-28 18:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Greg KH, linux-kernel, KAMEZAWA Hiroyuki, linuxppc-dev

On 09/27/2010 06:55 PM, Dave Hansen wrote:
> On Mon, 2010-09-27 at 14:25 -0500, Nathan Fontenot wrote:
>> +static inline int base_memory_block_id(int section_nr)
>> +{
>> +       return section_nr / sections_per_block;
>> +}
> ...
>> -       mutex_lock(&mem_sysfs_mutex);
>> -
>> -       mem->phys_index = __section_nr(section);
>> +       scn_nr = __section_nr(section);
>> +       mem->phys_index = base_memory_block_id(scn_nr) * sections_per_block; 
> 
> I'm really regretting giving this variable such a horrid name.  I suck.
> 
> I think this is correct now:
> 
> 	mem->phys_index = base_memory_block_id(scn_nr) * sections_per_block;
> 	mem->phys_index = section_nr / sections_per_block * sections_per_block;
> 	mem->phys_index = section_nr
> 
> Since it gets exported to userspace this way:
> 
>> +static ssize_t show_mem_start_phys_index(struct sys_device *dev,
>>                         struct sysdev_attribute *attr, char *buf)
>>  {
>>         struct memory_block *mem =
>>                 container_of(dev, struct memory_block, sysdev);
>> -       return sprintf(buf, "%08lx\n", mem->phys_index / sections_per_block);
>> +       unsigned long phys_index;
>> +
>> +       phys_index = mem->start_phys_index / sections_per_block;
>> +       return sprintf(buf, "%08lx\n", phys_index);
>> +}
> 
> The only other thing I'd say is that we need to put phys_index out of
> its misery and call it what it is now: a section number.  I think it's
> OK to call them "start/end_section_nr", at least inside the kernel.  I
> intentionally used "phys_index" terminology in sysfs so that we _could_
> eventually do this stuff and break the relationship between sections and
> the sysfs dirs, but I think keeping the terminology around inside the
> kernel is confusing now.

Yes, it took me a couple o looks to get the phys_index <-> section number
correlation.  I think changing the kernel names to start/end_section_number
is a good idea.

-Nathan

> 
> -- Dave
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/8] v2 Add section count to memory_block struct
  2010-09-28  9:31   ` Robin Holt
@ 2010-09-28 18:14     ` Nathan Fontenot
  0 siblings, 0 replies; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-28 18:14 UTC (permalink / raw)
  To: Robin Holt
  Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
	KAMEZAWA Hiroyuki

On 09/28/2010 04:31 AM, Robin Holt wrote:
> In the next patch, you introduce a mutex for adding/removing memory blocks.
> Is there really a need for this to be atomic?  If you reorder the patches
> so the mutex comes first, would the atomic be needed any longer?
> 

I think you're right.  Looking at the code with all patches applied I am only
updating the atomic when holding the mem_sysfs_mutex.  I think the atomic
could safely be changed to a regular int.

-Nathan

> Robin
> 
> On Mon, Sep 27, 2010 at 02:22:24PM -0500, Nathan Fontenot wrote:
>> Add a section count property to the memory_block struct to track the number
>> of memory sections that have been added/removed from a memory block. This
>> allows us to know when the last memory section of a memory block has been
>> removed so we can remove the memory block.
>>
>> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
>>
>> ---
>>  drivers/base/memory.c  |   16 ++++++++++------
>>  include/linux/memory.h |    3 +++
>>  2 files changed, 13 insertions(+), 6 deletions(-)
>>
>> Index: linux-next/drivers/base/memory.c
>> ===================================================================
>> --- linux-next.orig/drivers/base/memory.c	2010-09-27 09:17:20.000000000 -0500
>> +++ linux-next/drivers/base/memory.c	2010-09-27 09:31:35.000000000 -0500
>> @@ -478,6 +478,7 @@
>>  
>>  	mem->phys_index = __section_nr(section);
>>  	mem->state = state;
>> +	atomic_inc(&mem->section_count);
>>  	mutex_init(&mem->state_mutex);
>>  	start_pfn = section_nr_to_pfn(mem->phys_index);
>>  	mem->phys_device = arch_get_memory_phys_device(start_pfn);
>> @@ -505,12 +506,15 @@
>>  	struct memory_block *mem;
>>  
>>  	mem = find_memory_block(section);
>> -	unregister_mem_sect_under_nodes(mem);
>> -	mem_remove_simple_file(mem, phys_index);
>> -	mem_remove_simple_file(mem, state);
>> -	mem_remove_simple_file(mem, phys_device);
>> -	mem_remove_simple_file(mem, removable);
>> -	unregister_memory(mem, section);
>> +
>> +	if (atomic_dec_and_test(&mem->section_count)) {
>> +		unregister_mem_sect_under_nodes(mem);
>> +		mem_remove_simple_file(mem, phys_index);
>> +		mem_remove_simple_file(mem, state);
>> +		mem_remove_simple_file(mem, phys_device);
>> +		mem_remove_simple_file(mem, removable);
>> +		unregister_memory(mem, section);
>> +	}
>>  
>>  	return 0;
>>  }
>> Index: linux-next/include/linux/memory.h
>> ===================================================================
>> --- linux-next.orig/include/linux/memory.h	2010-09-27 09:17:20.000000000 -0500
>> +++ linux-next/include/linux/memory.h	2010-09-27 09:22:56.000000000 -0500
>> @@ -19,10 +19,13 @@
>>  #include <linux/node.h>
>>  #include <linux/compiler.h>
>>  #include <linux/mutex.h>
>> +#include <asm/atomic.h>
>>  
>>  struct memory_block {
>>  	unsigned long phys_index;
>>  	unsigned long state;
>> +	atomic_t section_count;
>> +
>>  	/*
>>  	 * This serializes all state change requests.  It isn't
>>  	 * held during creation because the control files are
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
  2010-09-28 12:38 ` [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Robin Holt
@ 2010-09-28 18:17   ` Nathan Fontenot
  2010-09-29 19:28     ` Robin Holt
  0 siblings, 1 reply; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-28 18:17 UTC (permalink / raw)
  To: Robin Holt
  Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
	KAMEZAWA Hiroyuki

On 09/28/2010 07:38 AM, Robin Holt wrote:
> I was tasked with looking at a slowdown in similar sized SGI machines
> booting x86_64.  Jack Steiner had already looked into the memory_dev_init.
> I was looking at link_mem_sections().
> 
> I made a dramatic improvement on a 16TB machine in that function by
> merely caching the most recent memory section and checking to see if
> the next memory section happens to be the subsequent in the linked list
> of kobjects.
> 
> That simple cache reduced the time for link_mem_sections from 1 hour 27
> minutes down to 46 seconds.

Nice!

> 
> I would like to propose we implement something along those lines also,
> but I am currently swamped.  I can probably get you a patch tomorrow
> afternoon that applies at the end of this set.

Should this be done as a separate patch?  This patch set concentrates on
updates to the memory code with the node updates only being done due to the
memory changes.

I think its a good idea to do the caching and have no problem adding on to
this patchset if no one else has any objections.

-Nathan

> 
> Thanks,
> Robin
> 
> On Mon, Sep 27, 2010 at 02:09:31PM -0500, Nathan Fontenot wrote:
>> This set of patches decouples the concept that a single memory
>> section corresponds to a single directory in 
>> /sys/devices/system/memory/.  On systems
>> with large amounts of memory (1+ TB) there are perfomance issues
>> related to creating the large number of sysfs directories.  For
>> a powerpc machine with 1 TB of memory we are creating 63,000+
>> directories.  This is resulting in boot times of around 45-50
>> minutes for systems with 1 TB of memory and 8 hours for systems
>> with 2 TB of memory.  With this patch set applied I am now seeing
>> boot times of 5 minutes or less.
>>
>> The root of this issue is in sysfs directory creation. Every time
>> a directory is created a string compare is done against all sibling
>> directories to ensure we do not create duplicates.  The list of
>> directory nodes in sysfs is kept as an unsorted list which results
>> in this being an exponentially longer operation as the number of
>> directories are created.
>>
>> The solution solved by this patch set is to allow a single
>> directory in sysfs to span multiple memory sections.  This is
>> controlled by an optional architecturally defined function
>> memory_block_size_bytes().  The default definition of this
>> routine returns a memory block size equal to the memory section
>> size. This maintains the current layout of sysfs memory
>> directories as it appears to userspace to remain the same as it
>> is today.
>>
>> For architectures that define their own version of this routine,
>> as is done for powerpc in this patchset, the view in userspace
>> would change such that each memoryXXX directory would span
>> multiple memory sections.  The number of sections spanned would
>> depend on the value reported by memory_block_size_bytes.
>>
>> In both cases a new file 'end_phys_index' is created in each
>> memoryXXX directory.  This file will contain the physical id
>> of the last memory section covered by the sysfs directory.  For
>> the default case, the value in 'end_phys_index' will be the same
>> as in the existing 'phys_index' file.
>>
>> This version of the patch set includes an update to to properly
>> report block_size_bytes, phys_index, and end_phys_index.  Additionally,
>> the patch that adds the end_phys_index sysfs file is now patch 5/8
>> instead of being patch 2/8 as in the previous version of the patches.
>>
>> -Nathan Fontenot
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 8/8] v2 Update memory hotplug documentation
  2010-09-28 12:45   ` Avi Kivity
@ 2010-09-28 18:18     ` Nathan Fontenot
  0 siblings, 0 replies; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-28 18:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
	KAMEZAWA Hiroyuki

On 09/28/2010 07:45 AM, Avi Kivity wrote:
>  On 09/27/2010 09:28 PM, Nathan Fontenot wrote:
>>
>>   For example, assume 1GiB section size. A device for a memory
>> starting at
>>   0x100000000 is /sys/device/system/memory/memory4
>>   (0x100000000 / 1Gib = 4)
>>   This device covers address range [0x100000000 ... 0x140000000)
>>
>> -Under each section, you can see 4 files.
>> +Under each section, you can see 5 files.
> 
> Shouldn't this be, 4 or 5 files depending on kernel version?
> 

Correct,  I'll update this.  Thanks.

-Nathan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/8] v2 Allow memory block to span multiple memory sections
  2010-09-28 12:48   ` Robin Holt
@ 2010-09-28 18:20     ` Nathan Fontenot
  0 siblings, 0 replies; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-28 18:20 UTC (permalink / raw)
  To: Robin Holt
  Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
	KAMEZAWA Hiroyuki

On 09/28/2010 07:48 AM, Robin Holt wrote:
>> +u32 __weak memory_block_size_bytes(void)
>> +{
>> +	return MIN_MEMORY_BLOCK_SIZE;
>> +}
>> +
>> +static u32 get_memory_block_size(void)
> 
> Can we make this an unsigned long?  We are testing on a system whose
> smallest possible configuration is 4GB per socket with 512 sockets.
> We would like to be able to specify this as 2GB by default (results
> in the least lost memory) and suggest we add a command line option
> which overrides this value.  We have many installations where 16GB may
> be optimal.  Large configurations will certainly become more prevalent.

Works for me.

> 
> ...
>> @@ -551,12 +608,16 @@
>>  	unsigned int i;
>>  	int ret;
>>  	int err;
>> +	int block_sz;
> 
> This one needs to match the return above.  In our tests, we ended up
> with a negative sections_per_block which caused very unexpected results.

Oh, nice catch.  I'll update both of these.

-Nathan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
  2010-09-28 15:12   ` Robin Holt
  2010-09-28 16:34     ` Avi Kivity
@ 2010-09-29  2:50     ` Greg KH
  2010-09-29  8:32       ` Avi Kivity
  1 sibling, 1 reply; 33+ messages in thread
From: Greg KH @ 2010-09-29  2:50 UTC (permalink / raw)
  To: Robin Holt
  Cc: linuxppc-dev, linux-kernel, Dave Hansen, linux-mm, Avi Kivity,
	KAMEZAWA Hiroyuki

On Tue, Sep 28, 2010 at 10:12:18AM -0500, Robin Holt wrote:
> On Tue, Sep 28, 2010 at 02:44:40PM +0200, Avi Kivity wrote:
> >  On 09/27/2010 09:09 PM, Nathan Fontenot wrote:
> > >This set of patches decouples the concept that a single memory
> > >section corresponds to a single directory in
> > >/sys/devices/system/memory/.  On systems
> > >with large amounts of memory (1+ TB) there are perfomance issues
> > >related to creating the large number of sysfs directories.  For
> > >a powerpc machine with 1 TB of memory we are creating 63,000+
> > >directories.  This is resulting in boot times of around 45-50
> > >minutes for systems with 1 TB of memory and 8 hours for systems
> > >with 2 TB of memory.  With this patch set applied I am now seeing
> > >boot times of 5 minutes or less.
> > >
> > >The root of this issue is in sysfs directory creation. Every time
> > >a directory is created a string compare is done against all sibling
> > >directories to ensure we do not create duplicates.  The list of
> > >directory nodes in sysfs is kept as an unsorted list which results
> > >in this being an exponentially longer operation as the number of
> > >directories are created.
> > >
> > >The solution solved by this patch set is to allow a single
> > >directory in sysfs to span multiple memory sections.  This is
> > >controlled by an optional architecturally defined function
> > >memory_block_size_bytes().  The default definition of this
> > >routine returns a memory block size equal to the memory section
> > >size. This maintains the current layout of sysfs memory
> > >directories as it appears to userspace to remain the same as it
> > >is today.
> > >
> > 
> > Why not update sysfs directory creation to be fast, for example by
> > using an rbtree instead of a linked list.  This fixes an
> > implementation problem in the kernel instead of working around it
> > and creating a new ABI.
> 
> Because the old ABI creates 129,000+ entries inside
> /sys/devices/system/memory with their associated links from
> /sys/devices/system/node/node*/ back to those directory entries.
> 
> Thankfully things like rpm, hald, and other miscellaneous commands scan
> that information.

Really?  Why?  Why would rpm care about this?  hald is dead now so we
don't need to worry about that anymore, but what other commands/programs
read this information?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
  2010-09-29  2:50     ` Greg KH
@ 2010-09-29  8:32       ` Avi Kivity
  2010-09-29 12:37         ` Greg KH
  0 siblings, 1 reply; 33+ messages in thread
From: Avi Kivity @ 2010-09-29  8:32 UTC (permalink / raw)
  To: Greg KH
  Cc: linuxppc-dev, linux-kernel, Dave Hansen, linux-mm, Robin Holt,
	KAMEZAWA Hiroyuki

  On 09/29/2010 04:50 AM, Greg KH wrote:
> >
> >  Because the old ABI creates 129,000+ entries inside
> >  /sys/devices/system/memory with their associated links from
> >  /sys/devices/system/node/node*/ back to those directory entries.
> >
> >  Thankfully things like rpm, hald, and other miscellaneous commands scan
> >  that information.
>
> Really?  Why?  Why would rpm care about this?  hald is dead now so we
> don't need to worry about that anymore,

That's not what compatiblity means.  We can't just support 
latest-and-greatest userspace on latest-and-greatest kernels.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
  2010-09-29  8:32       ` Avi Kivity
@ 2010-09-29 12:37         ` Greg KH
  2010-09-29 13:39           ` Kay Sievers
  2010-10-03  7:52           ` Avi Kivity
  0 siblings, 2 replies; 33+ messages in thread
From: Greg KH @ 2010-09-29 12:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linuxppc-dev, linux-kernel, Dave Hansen, linux-mm, Robin Holt,
	KAMEZAWA Hiroyuki

On Wed, Sep 29, 2010 at 10:32:34AM +0200, Avi Kivity wrote:
>  On 09/29/2010 04:50 AM, Greg KH wrote:
>> >
>> >  Because the old ABI creates 129,000+ entries inside
>> >  /sys/devices/system/memory with their associated links from
>> >  /sys/devices/system/node/node*/ back to those directory entries.
>> >
>> >  Thankfully things like rpm, hald, and other miscellaneous commands scan
>> >  that information.
>>
>> Really?  Why?  Why would rpm care about this?  hald is dead now so we
>> don't need to worry about that anymore,
>
> That's not what compatiblity means.  We can't just support 
> latest-and-greatest userspace on latest-and-greatest kernels.

Oh, I know that, that's not what I was getting at at all here, sorry if
it came across that way.

I wanted to know so we could go fix programs that are mucking around in
these files, as odds are, the shouldn't be doing that in the first
place.

Like rpm, why would it matter what the memory in the system looks like?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
  2010-09-29 12:37         ` Greg KH
@ 2010-09-29 13:39           ` Kay Sievers
  2010-10-03  7:52           ` Avi Kivity
  1 sibling, 0 replies; 33+ messages in thread
From: Kay Sievers @ 2010-09-29 13:39 UTC (permalink / raw)
  To: Greg KH
  Cc: linuxppc-dev, linux-kernel, Dave Hansen, linux-mm, Avi Kivity,
	KAMEZAWA Hiroyuki, Robin Holt

On Wed, Sep 29, 2010 at 14:37, Greg KH <greg@kroah.com> wrote:
> On Wed, Sep 29, 2010 at 10:32:34AM +0200, Avi Kivity wrote:
>> =C2=A0On 09/29/2010 04:50 AM, Greg KH wrote:
>>> >
>>> > =C2=A0Because the old ABI creates 129,000+ entries inside
>>> > =C2=A0/sys/devices/system/memory with their associated links from
>>> > =C2=A0/sys/devices/system/node/node*/ back to those directory entries=
.
>>> >
>>> > =C2=A0Thankfully things like rpm, hald, and other miscellaneous comma=
nds scan
>>> > =C2=A0that information.
>>>
>>> Really? =C2=A0Why? =C2=A0Why would rpm care about this? =C2=A0hald is d=
ead now so we
>>> don't need to worry about that anymore,
>>
>> That's not what compatiblity means. =C2=A0We can't just support
>> latest-and-greatest userspace on latest-and-greatest kernels.
>
> Oh, I know that, that's not what I was getting at at all here, sorry if
> it came across that way.
>
> I wanted to know so we could go fix programs that are mucking around in
> these files, as odds are, the shouldn't be doing that in the first
> place.
>
> Like rpm, why would it matter what the memory in the system looks like?

HAL does many inefficient things, but I don't think it's using
/sys/system/, besides that it may check the cpufreq govenors state
there.

Kay

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
  2010-09-28 18:17   ` Nathan Fontenot
@ 2010-09-29 19:28     ` Robin Holt
  2010-09-30 15:17       ` Nathan Fontenot
  2010-09-30 16:39       ` Robin Holt
  0 siblings, 2 replies; 33+ messages in thread
From: Robin Holt @ 2010-09-29 19:28 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: linux-mm, Greg KH, linux-kernel, Dave Hansen, linuxppc-dev,
	Robin Holt, KAMEZAWA Hiroyuki

On Tue, Sep 28, 2010 at 01:17:33PM -0500, Nathan Fontenot wrote:
> On 09/28/2010 07:38 AM, Robin Holt wrote:
> > I was tasked with looking at a slowdown in similar sized SGI machines
> > booting x86_64.  Jack Steiner had already looked into the memory_dev_init.
> > I was looking at link_mem_sections().
> > 
> > I made a dramatic improvement on a 16TB machine in that function by
> > merely caching the most recent memory section and checking to see if
> > the next memory section happens to be the subsequent in the linked list
> > of kobjects.
> > 
> > That simple cache reduced the time for link_mem_sections from 1 hour 27
> > minutes down to 46 seconds.
> 
> Nice!
> 
> > 
> > I would like to propose we implement something along those lines also,
> > but I am currently swamped.  I can probably get you a patch tomorrow
> > afternoon that applies at the end of this set.
> 
> Should this be done as a separate patch?  This patch set concentrates on
> updates to the memory code with the node updates only being done due to the
> memory changes.
> 
> I think its a good idea to do the caching and have no problem adding on to
> this patchset if no one else has any objections.

I am sorry.  I had meant to include you on the Cc: list.  I just posted a
set of patches (3 small patches) which implement the cache most recent bit
I aluded to above.  Search for a subject of "Speed up link_mem_sections
during boot" and you will find them.  I did add you to the Cc: list for
the next time I end up sending the set.

My next task is to implement a x86_64 SGI UV specific chunk of code
to memory_block_size_bytes().  Would you consider adding that to your
patch set?  I expect to have that either later today or early tomorrow.

Robin

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
  2010-09-29 19:28     ` Robin Holt
@ 2010-09-30 15:17       ` Nathan Fontenot
  2010-09-30 16:39       ` Robin Holt
  1 sibling, 0 replies; 33+ messages in thread
From: Nathan Fontenot @ 2010-09-30 15:17 UTC (permalink / raw)
  To: Robin Holt
  Cc: linuxppc-dev, Greg KH, linux-kernel, Dave Hansen, linux-mm,
	KAMEZAWA Hiroyuki

On 09/29/2010 02:28 PM, Robin Holt wrote:
> On Tue, Sep 28, 2010 at 01:17:33PM -0500, Nathan Fontenot wrote:
>> On 09/28/2010 07:38 AM, Robin Holt wrote:
>>> I was tasked with looking at a slowdown in similar sized SGI machines
>>> booting x86_64.  Jack Steiner had already looked into the memory_dev_init.
>>> I was looking at link_mem_sections().
>>>
>>> I made a dramatic improvement on a 16TB machine in that function by
>>> merely caching the most recent memory section and checking to see if
>>> the next memory section happens to be the subsequent in the linked list
>>> of kobjects.
>>>
>>> That simple cache reduced the time for link_mem_sections from 1 hour 27
>>> minutes down to 46 seconds.
>>
>> Nice!
>>
>>>
>>> I would like to propose we implement something along those lines also,
>>> but I am currently swamped.  I can probably get you a patch tomorrow
>>> afternoon that applies at the end of this set.
>>
>> Should this be done as a separate patch?  This patch set concentrates on
>> updates to the memory code with the node updates only being done due to the
>> memory changes.
>>
>> I think its a good idea to do the caching and have no problem adding on to
>> this patchset if no one else has any objections.
> 
> I am sorry.  I had meant to include you on the Cc: list.  I just posted a
> set of patches (3 small patches) which implement the cache most recent bit
> I aluded to above.  Search for a subject of "Speed up link_mem_sections
> during boot" and you will find them.  I did add you to the Cc: list for
> the next time I end up sending the set.
> 
> My next task is to implement a x86_64 SGI UV specific chunk of code
> to memory_block_size_bytes().  Would you consider adding that to your
> patch set?  I expect to have that either later today or early tomorrow.
> 

No problem. I'm putting together a new patch set with updates from all of
the comments now so go ahead and send it to me when you have it ready.

-Nathan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
  2010-09-29 19:28     ` Robin Holt
  2010-09-30 15:17       ` Nathan Fontenot
@ 2010-09-30 16:39       ` Robin Holt
  1 sibling, 0 replies; 33+ messages in thread
From: Robin Holt @ 2010-09-30 16:39 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: linux-mm, Greg KH, linux-kernel, Dave Hansen, linuxppc-dev,
	Robin Holt, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	KAMEZAWA Hiroyuki

On Wed, Sep 29, 2010 at 02:28:30PM -0500, Robin Holt wrote:
> On Tue, Sep 28, 2010 at 01:17:33PM -0500, Nathan Fontenot wrote:
...
> My next task is to implement a x86_64 SGI UV specific chunk of code
> to memory_block_size_bytes().  Would you consider adding that to your
> patch set?  I expect to have that either later today or early tomorrow.

The patch is below.

I left things at a u32, but I would really like it if you changed to an
unsigned long and adjusted my patch for me.

Thanks,
Robin

------------------------------------------------------------------------
Subject: [Patch] Implement memory_block_size_bytes for x86_64 when CONFIG_X86_UV


Nathan Fontenot has implemented a patch set for large memory configuration
systems which will combine drivers/base/memory.c memory sections
together into memory blocks with the default behavior being
unchanged from the current behavior.

In his patch set, he implements a memory_block_size_bytes() function
for PPC.  This is the equivalent patch for x86_64 when it has
CONFIG_X86_UV set.

Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Jack Steiner <steiner@sgi.com>
To: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: lkml <linux-kernel@vger.kernel.org>

---

 arch/x86/mm/init_64.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

Index: memory_block/arch/x86/mm/init_64.c
===================================================================
--- memory_block.orig/arch/x86/mm/init_64.c	2010-09-29 14:46:50.711824616 -0500
+++ memory_block/arch/x86/mm/init_64.c	2010-09-29 14:46:55.683997672 -0500
@@ -50,6 +50,7 @@
 #include <asm/numa.h>
 #include <asm/cacheflush.h>
 #include <asm/init.h>
+#include <asm/uv/uv.h>
 #include <linux/bootmem.h>
 
 static unsigned long dma_reserve __initdata;
@@ -928,6 +929,20 @@ const char *arch_vma_name(struct vm_area
 	return NULL;
 }
 
+#ifdef CONFIG_X86_UV
+#define MIN_MEMORY_BLOCK_SIZE   (1 << SECTION_SIZE_BITS)
+
+u32 memory_block_size_bytes(void)
+{
+	if (is_uv_system()) {
+		printk("UV: memory block size 2GB\n");
+		return 2UL * 1024 * 1024 * 1024;
+	}
+	return MIN_MEMORY_BLOCK_SIZE;
+}
+#endif
+
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 /*
  * Initialise the sparsemem vmemmap using huge-pages at the PMD level.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
  2010-09-29 12:37         ` Greg KH
  2010-09-29 13:39           ` Kay Sievers
@ 2010-10-03  7:52           ` Avi Kivity
  1 sibling, 0 replies; 33+ messages in thread
From: Avi Kivity @ 2010-10-03  7:52 UTC (permalink / raw)
  To: Greg KH
  Cc: linuxppc-dev, linux-kernel, Dave Hansen, linux-mm, Robin Holt,
	KAMEZAWA Hiroyuki

  On 09/29/2010 02:37 PM, Greg KH wrote:
> >>  >   Thankfully things like rpm, hald, and other miscellaneous commands scan
> >>  >   that information.
> >>
> >>  Really?  Why?  Why would rpm care about this?  hald is dead now so we
> >>  don't need to worry about that anymore,
> >
> >  That's not what compatiblity means.  We can't just support
> >  latest-and-greatest userspace on latest-and-greatest kernels.
>
> Oh, I know that, that's not what I was getting at at all here, sorry if
> it came across that way.
>
> I wanted to know so we could go fix programs that are mucking around in
> these files, as odds are, the shouldn't be doing that in the first
> place.
>
> Like rpm, why would it matter what the memory in the system looks like?
>

I see, thanks.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2010-10-03  7:53 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-27 19:09 [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Nathan Fontenot
2010-09-27 19:21 ` [PATCH 1/8] v2 Move find_memory_block() routine Nathan Fontenot
2010-09-27 19:22 ` [PATCH 2/8] v2 Add section count to memory_block struct Nathan Fontenot
2010-09-28  9:31   ` Robin Holt
2010-09-28 18:14     ` Nathan Fontenot
2010-09-27 19:23 ` [PATCH 3/8] v2 Add mutex for adding/removing memory blocks Nathan Fontenot
2010-09-27 19:25 ` [PATCH 4/8] v2 Allow memory block to span multiple memory sections Nathan Fontenot
2010-09-27 23:55   ` Dave Hansen
2010-09-28 18:06     ` Nathan Fontenot
2010-09-28 12:48   ` Robin Holt
2010-09-28 18:20     ` Nathan Fontenot
2010-09-27 19:26 ` [PATCH 5/8] v2 Add end_phys_index file Nathan Fontenot
2010-09-27 19:27 ` [PATCH 6/8] v2 Update node sysfs code Nathan Fontenot
2010-09-28  9:29   ` Robin Holt
2010-09-28 15:21     ` Dave Hansen
2010-09-27 19:28 ` [PATCH 7/8] v2 Define memory_block_size_bytes() for powerpc/pseries Nathan Fontenot
2010-09-27 19:28 ` [PATCH 8/8] v2 Update memory hotplug documentation Nathan Fontenot
2010-09-28 12:45   ` Avi Kivity
2010-09-28 18:18     ` Nathan Fontenot
2010-09-28 12:38 ` [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections Robin Holt
2010-09-28 18:17   ` Nathan Fontenot
2010-09-29 19:28     ` Robin Holt
2010-09-30 15:17       ` Nathan Fontenot
2010-09-30 16:39       ` Robin Holt
2010-09-28 12:44 ` Avi Kivity
2010-09-28 15:12   ` Robin Holt
2010-09-28 16:34     ` Avi Kivity
2010-09-29  2:50     ` Greg KH
2010-09-29  8:32       ` Avi Kivity
2010-09-29 12:37         ` Greg KH
2010-09-29 13:39           ` Kay Sievers
2010-10-03  7:52           ` Avi Kivity
2010-09-28 15:17   ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).