linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: memory hotremove prototype, take 3
@ 2003-12-01 20:12 Luck, Tony
  2003-12-02  3:01 ` IWAMOTO Toshihiro
  2003-12-02 22:26 ` Yasunori Goto
  0 siblings, 2 replies; 17+ messages in thread
From: Luck, Tony @ 2003-12-01 20:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Pavel Machek

Pavel Machek wrote:

> hotunplug seems cool... How do you deal with kernel data structures in
> memory "to be removed"? Or you simply don't allow kmalloc() to
> allocate there?

You guessed right.  Hot removeable memory can only be allocated
for uses that we can easily re-allocate.  So kmalloc() etc. have
to get memory from some area that we promise not to ever try to
remove.

> During hotunplug, you copy pages to new locaion. Would it simplify
> code if you forced them to be swapped out, instead? [Yep, it would be
> slower...]

There are some pages that will have to be copied (e.g. pages that
the user "mlock()d" should still be locked in their new location,
same for hugetlbfs pages).

-Tony Luck  


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: memory hotremove prototype, take 3
  2003-12-01 20:12 memory hotremove prototype, take 3 Luck, Tony
@ 2003-12-02  3:01 ` IWAMOTO Toshihiro
  2003-12-02  6:43   ` Hirokazu Takahashi
  2003-12-02 22:26 ` Yasunori Goto
  1 sibling, 1 reply; 17+ messages in thread
From: IWAMOTO Toshihiro @ 2003-12-02  3:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: Luck, Tony, Pavel Machek

At Mon, 1 Dec 2003 12:12:09 -0800,
Luck, Tony <tony.luck@intel.com> wrote:
> 
> Pavel Machek wrote:

> > During hotunplug, you copy pages to new locaion. Would it simplify
> > code if you forced them to be swapped out, instead? [Yep, it would be
> > slower...]
> 
> There are some pages that will have to be copied (e.g. pages that
> the user "mlock()d" should still be locked in their new location,
> same for hugetlbfs pages).

Using kswapd is easy, but doesn't always work well.  The patch
contains the code to ignore page accessed bits when kswapd is run on
disabled zones, but that's not enough for swapping out frequently used
pages.
In my patch, page copying, or "remapping", solves this problem by
blocking accesses to the page under operation.

--
IWAMOTO Toshihiro


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: memory hotremove prototype, take 3
  2003-12-02  3:01 ` IWAMOTO Toshihiro
@ 2003-12-02  6:43   ` Hirokazu Takahashi
  0 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2003-12-02  6:43 UTC (permalink / raw)
  To: iwamoto; +Cc: linux-kernel, tony.luck, pavel

Hello,

> > > During hotunplug, you copy pages to new locaion. Would it simplify
> > > code if you forced them to be swapped out, instead? [Yep, it would be
> > > slower...]
> > 
> > There are some pages that will have to be copied (e.g. pages that
> > the user "mlock()d" should still be locked in their new location,
> > same for hugetlbfs pages).

And some pages which aren't associated with backing store like sysfs or
ramdisk have to be, too.

> Using kswapd is easy, but doesn't always work well.  The patch
> contains the code to ignore page accessed bits when kswapd is run on
> disabled zones, but that's not enough for swapping out frequently used
> pages.
> In my patch, page copying, or "remapping", solves this problem by
> blocking accesses to the page under operation.

Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: memory hotremove prototype, take 3
  2003-12-01 20:12 memory hotremove prototype, take 3 Luck, Tony
  2003-12-02  3:01 ` IWAMOTO Toshihiro
@ 2003-12-02 22:26 ` Yasunori Goto
  1 sibling, 0 replies; 17+ messages in thread
From: Yasunori Goto @ 2003-12-02 22:26 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-kernel, Luck, Tony, IWAMOTO Toshihiro, Hirokazu Takahashi,
	Linux Hotplug Memory Support

[-- Attachment #1: Type: text/plain, Size: 919 bytes --]

Hello.

> Pavel Machek wrote:
> 
> > hotunplug seems cool... How do you deal with kernel data structures in
> > memory "to be removed"? Or you simply don't allow kmalloc() to
> > allocate there?
> 
> You guessed right.  Hot removeable memory can only be allocated
> for uses that we can easily re-allocate.  So kmalloc() etc. have
> to get memory from some area that we promise not to ever try to
> remove.

IMHO, To hot-remove memory, memory attribute should be divided
into Hotpluggable and no-Hotpluggable, and each attribute memory
should be allocated each unit(ex. node). 

(I posted the following mail 2 month ago.)
http://marc.theaimsgroup.com/?l=linux-kernel&m=106506389406876&w=2

Now, I'm making a Memory hot-ADD trial patch, but it don't work yet.
(Kernel panic when memory enable command is executed.)
After this patch will work, I will post it again.

Thanks.

-- 
Yasunori Goto <ygoto at fsw.fujitsu.com>

[-- Attachment #2: 20031125.patch --]
[-- Type: application/octet-stream, Size: 34776 bytes --]

diff -duprb linux-2.6.0-test7/Makefile testdir/Makefile
--- linux-2.6.0-test7/Makefile	Wed Oct  8 12:24:17 2003
+++ testdir/Makefile	Sat Nov 22 17:55:21 2003
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 0
-EXTRAVERSION = -test7
+EXTRAVERSION = -test7-mem-hotplug
 
 # *DOCUMENTATION*
 # To see a list of typical targets execute "make help"
diff -duprb linux-2.6.0-test7/arch/i386/Kconfig testdir/arch/i386/Kconfig
--- linux-2.6.0-test7/arch/i386/Kconfig	Wed Oct  8 12:24:02 2003
+++ testdir/arch/i386/Kconfig	Sat Nov 22 17:52:36 2003
@@ -706,14 +706,18 @@ comment "NUMA (NUMA-Q) requires SMP, 64G
 comment "NUMA (Summit) requires SMP, 64GB highmem support, full ACPI"
 	depends on X86_SUMMIT && (!HIGHMEM64G || !ACPI || ACPI_HT_ONLY)
 
+config MEMHOTPLUGTEST
+	bool "Memory hotplug test"
+	default n
+
 config DISCONTIGMEM
 	bool
-	depends on NUMA
+	depends on NUMA || MEMHOTPLUGTEST
 	default y
 
 config HAVE_ARCH_BOOTMEM_NODE
 	bool
-	depends on NUMA
+	depends on NUMA || MEMHOTPLUGTEST
 	default y
 
 config HIGHPTE
diff -duprb linux-2.6.0-test7/arch/i386/kernel/setup.c testdir/arch/i386/kernel/setup.c
--- linux-2.6.0-test7/arch/i386/kernel/setup.c	Wed Oct  8 12:24:05 2003
+++ testdir/arch/i386/kernel/setup.c	Sat Nov 22 17:52:36 2003
@@ -114,6 +114,8 @@ extern void generic_apic_probe(char *);
 extern int root_mountflags;
 extern char _end[];
 
+extern unsigned long node_end_pfn[MAX_NUMNODES];
+
 unsigned long saved_videomode;
 
 #define RAMDISK_IMAGE_START_MASK  	0x07FF
@@ -611,7 +613,11 @@ unsigned long __init find_max_low_pfn(vo
 {
 	unsigned long max_low_pfn;
 
+#if CONFIG_MEMHOTPLUGTEST
+	max_low_pfn = node_end_pfn[0];
+#else
 	max_low_pfn = max_pfn;
+#endif
 	if (max_low_pfn > MAXMEM_PFN) {
 		if (highmem_pages == -1)
 			highmem_pages = max_pfn - MAXMEM_PFN;
diff -duprb linux-2.6.0-test7/arch/i386/mm/discontig.c testdir/arch/i386/mm/discontig.c
--- linux-2.6.0-test7/arch/i386/mm/discontig.c	Wed Oct  8 12:24:07 2003
+++ testdir/arch/i386/mm/discontig.c	Tue Nov 25 19:34:03 2003
@@ -28,6 +28,12 @@
 #include <linux/mmzone.h>
 #include <linux/highmem.h>
 #include <linux/initrd.h>
+#include <linux/proc_fs.h>
+
+#ifdef CONFIG_MEMHOTPLUG
+#include <linux/sched.h>
+#endif
+
 #include <asm/e820.h>
 #include <asm/setup.h>
 #include <asm/mmzone.h>
@@ -80,6 +86,10 @@ unsigned long node_remap_offset[MAX_NUMN
 void *node_remap_start_vaddr[MAX_NUMNODES];
 void set_pmd_pfn(unsigned long vaddr, unsigned long pfn, pgprot_t flags);
 
+#ifdef CONFIG_MEMHOTPLUGTEST
+void set_pmd_pfn_withpgd(unsigned long vaddr, unsigned long pfn,pgd_t *pgd, pgprot_t flags);
+#endif
+
 /*
  * FLAT - support for basic PC memory model with discontig enabled, essentially
  *        a single node with all available processors in it with a flat
@@ -111,6 +121,44 @@ int __init get_memcfg_numa_flat(void)
 	return 1;
 }
 
+int __init get_memcfg_numa_blks(void)
+{
+	int i, pfn;
+
+	printk("NUMA - single node, flat memory mode, but broken in several blocks\n");
+
+	/* Run the memory configuration and find the top of memory. */
+	find_max_pfn();
+	max_pfn = max_pfn & ~(PTRS_PER_PTE - 1);
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		pfn = PFN_DOWN(256 << 20) * i;
+		node_start_pfn[i] = pfn;
+		pfn += PFN_DOWN(256 << 20);
+		if (pfn < max_pfn)
+			node_end_pfn[i] = pfn;
+		else {
+			node_end_pfn[i] = max_pfn;
+			i++;
+			printk("total %d blocks, max %d\n", i, (int)max_pfn);
+			break;
+		}
+	}
+
+	/* Fill in the physnode_map with our simplistic memory model,
+	   * all memory is in node 0.
+	 */
+	for (pfn = node_start_pfn[0]; pfn <= max_pfn;
+		pfn += PAGES_PER_ELEMENT) {
+		physnode_map[pfn / PAGES_PER_ELEMENT] = pfn / PFN_DOWN(256 << 20);
+	}
+
+	/* Indicate there is one node available. */
+	node_set_online(0);
+	numnodes = i;
+
+	return 1;
+}
+
 /*
  * Find the highest page frame number we have available for the node
  */
@@ -134,6 +182,12 @@ static void __init find_max_pfn_node(int
  */
 static void __init allocate_pgdat(int nid)
 {
+#if CONFIG_MEMHOTPLUGTEST
+	/* pg_dat allocate Node 0 statically */
+	NODE_DATA(nid) = (pg_data_t *)(__va(min_low_pfn << PAGE_SHIFT));
+	min_low_pfn += PFN_UP(sizeof(pg_data_t));
+	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+#else
 	if (nid)
 		NODE_DATA(nid) = (pg_data_t *)node_remap_start_vaddr[nid];
 	else {
@@ -141,6 +195,7 @@ static void __init allocate_pgdat(int ni
 		min_low_pfn += PFN_UP(sizeof(pg_data_t));
 		memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
 	}
+#endif
 }
 
 /*
@@ -183,6 +238,7 @@ static void __init register_bootmem_low_
 	}
 }
 
+/*static struct kcore_list numa_kc;*/
 void __init remap_numa_kva(void)
 {
 	void *vaddr;
@@ -196,9 +252,34 @@ void __init remap_numa_kva(void)
 				node_remap_start_pfn[node] + pfn, 
 				PAGE_KERNEL_LARGE);
 		}
+	/*	memset(node_remap_start_vaddr[node], 0,node_remap_size[node] * PAGE_SIZE); */
+	}
+/*	kclist_add(&numa_kc, node_remap_start_vaddr[numnodes - 1],
+		   node_remap_offset[numnodes - 1] << PAGE_SHIFT);*/
+}
+
+void remap_add_node_kva(int node)
+{
+	void *vaddr;
+	unsigned long pfn;
+	struct task_struct *p;
+	pgd_t *pg_dir;
+
+	read_lock(&tasklist_lock);
+	for_each_process(p){
+		pg_dir = p->mm->pgd;
+		for(pfn=0; pfn < node_remap_size[node]; pfn += PTRS_PER_PTE){
+			vaddr = node_remap_start_vaddr[node]+(pfn<<PAGE_SHIFT);
+			set_pmd_pfn_withpgd((ulong) vaddr,
+				node_remap_start_pfn[node] + pfn,
+				pg_dir + pgd_index( (ulong)vaddr ) ,
+				PAGE_KERNEL_LARGE);
+		}
 	}
+	read_unlock(&tasklist_lock);
 }
 
+
 static unsigned long calculate_numa_remap_pages(void)
 {
 	int nid;
@@ -206,8 +287,13 @@ static unsigned long calculate_numa_rema
 
 	for (nid = 1; nid < numnodes; nid++) {
 		/* calculate the size of the mem_map needed in bytes */
+#if CONFIG_MEMHOTPLUGTEST
+		size = (node_end_pfn[nid] - node_start_pfn[nid] + 1)
+			* sizeof(struct page);
+#else
 		size = (node_end_pfn[nid] - node_start_pfn[nid] + 1) 
 			* sizeof(struct page) + sizeof(pg_data_t);
+#endif
 		/* convert size to large (pmd size) pages, rounding up */
 		size = (size + LARGE_PAGE_BYTES - 1) / LARGE_PAGE_BYTES;
 		/* now the roundup is correct, convert to PAGE_SIZE pages */
@@ -248,7 +334,9 @@ unsigned long __init setup_memory(void)
 	printk(KERN_NOTICE "%ldMB HIGHMEM available.\n",
 	       pages_to_mb(highend_pfn - highstart_pfn));
 #endif
+#ifndef CONFIG_MEMHOTPLUGTEST
 	system_max_low_pfn = max_low_pfn = max_low_pfn - reserve_pages;
+#endif
 	printk(KERN_NOTICE "%ldMB LOWMEM available.\n",
 			pages_to_mb(system_max_low_pfn));
 	printk("min_low_pfn = %ld, max_low_pfn = %ld, highstart_pfn = %ld\n", 
@@ -426,7 +514,11 @@ void __init set_highmem_pages_init(int b
 void __init set_max_mapnr_init(void)
 {
 #ifdef CONFIG_HIGHMEM
+#ifdef CONFIG_MEMHOTPLUGTEST
+	highmem_start_page = phys_to_virt(max_low_pfn << PAGE_SHIFT);
+#else  /* !CONFIG_MEMHOTPLUGTEST */
 	highmem_start_page = NODE_DATA(0)->node_zones[ZONE_HIGHMEM].zone_mem_map;
+#endif /* !CONFIG_MEMHOTPLUGTEST */
 	num_physpages = highend_pfn;
 #else
 	num_physpages = max_low_pfn;
diff -duprb linux-2.6.0-test7/arch/i386/mm/pgtable.c testdir/arch/i386/mm/pgtable.c
--- linux-2.6.0-test7/arch/i386/mm/pgtable.c	Wed Oct  8 12:24:53 2003
+++ testdir/arch/i386/mm/pgtable.c	Tue Nov 25 19:23:46 2003
@@ -118,6 +118,30 @@ void set_pmd_pfn(unsigned long vaddr, un
 	 */
 	__flush_tlb_one(vaddr);
 }
+void set_pmd_pfn_withpgd(unsigned long vaddr, unsigned long pfn, pgd_t *pgd, pgprot_t flags)
+{
+	pmd_t *pmd;
+
+	if (vaddr & (PMD_SIZE-1)) {		/* vaddr is misaligned */
+		printk ("set_pmd_pfn: vaddr misaligned\n");
+		return; /* BUG(); */
+	}
+	if (pfn & (PTRS_PER_PTE-1)) {		/* pfn is misaligned */
+		printk ("set_pmd_pfn: pfn misaligned\n");
+		return; /* BUG(); */
+	}
+	if (pgd_none(*pgd)) {
+		printk ("set_pmd_pfn: pgd_none\n");
+		return; /* BUG(); */
+	}
+	pmd = pmd_offset(pgd, vaddr);
+	set_pmd(pmd, pfn_pmd(pfn, flags));
+	/*
+	 * It's enough to flush this one mapping.
+	 * (PGE mappings get flushed as well)
+	 */
+	__flush_tlb_one(vaddr);
+}
 
 void __set_fixmap (enum fixed_addresses idx, unsigned long phys, pgprot_t flags)
 {
diff -duprb linux-2.6.0-test7/drivers/char/mem.c testdir/drivers/char/mem.c
--- linux-2.6.0-test7/drivers/char/mem.c	Wed Oct  8 12:24:06 2003
+++ testdir/drivers/char/mem.c	Sat Nov 22 17:53:41 2003
@@ -24,6 +24,9 @@
 #include <linux/smp_lock.h>
 #include <linux/devfs_fs_kernel.h>
 #include <linux/ptrace.h>
+#ifdef CONFIG_HIGHMEM
+#include <linux/highmem.h>
+#endif
 
 #include <asm/uaccess.h>
 #include <asm/io.h>
@@ -104,6 +107,36 @@ static ssize_t do_write_mem(struct file 
 	return written;
 }
 
+#ifdef CONFIG_HIGHMEM
+static ssize_t read_highmem(struct file * file, char * buf,
+ 			size_t count, loff_t *ppos)
+{
+	unsigned long p = *ppos;
+	ssize_t read = 0;
+	int off, pfn = p >> PAGE_SHIFT;
+	char *pp;
+	struct page *page;
+
+	if (! pfn_valid(pfn))
+		return 0;
+	page = pfn_to_page(pfn);
+	pp = kmap(page);
+
+	off = p & (PAGE_SIZE - 1);
+	if (PAGE_SIZE - off > count)
+		count = PAGE_SIZE - off;
+
+	if (copy_to_user(buf, pp + off, count)) {
+		kunmap(page);
+		return -EFAULT;
+	}
+	read += count;
+	*ppos += read;
+	kunmap(page);
+	return read;
+}
+
+#endif
 
 /*
  * This funcion reads the *physical* memory. The f_pos points directly to the 
@@ -118,7 +151,11 @@ static ssize_t read_mem(struct file * fi
 
 	end_mem = __pa(high_memory);
 	if (p >= end_mem)
+#ifdef CONFIG_HIGHMEM
+		return read_highmem(file, buf, count, ppos);
+#else
 		return 0;
+#endif
 	if (count > end_mem - p)
 		count = end_mem - p;
 	read = 0;
diff -duprb linux-2.6.0-test7/fs/proc/kcore.c testdir/fs/proc/kcore.c
--- linux-2.6.0-test7/fs/proc/kcore.c	Wed Oct  8 12:24:07 2003
+++ testdir/fs/proc/kcore.c	Sat Nov 22 17:54:58 2003
@@ -387,7 +387,7 @@ read_kcore(struct file *file, char __use
 			}
 			kfree(elf_buf);
 		} else {
-			if (kern_addr_valid(start)) {
+			if (1 /*kern_addr_valid(start)*/) {
 				unsigned long n;
 
 				n = copy_to_user(buffer, (char *)start, tsz);
diff -duprb linux-2.6.0-test7/include/asm-i386/mmzone.h testdir/include/asm-i386/mmzone.h
--- linux-2.6.0-test7/include/asm-i386/mmzone.h	Wed Oct  8 12:24:06 2003
+++ testdir/include/asm-i386/mmzone.h	Sat Nov 22 17:54:41 2003
@@ -128,6 +128,7 @@ static inline struct pglist_data *pfn_to
 #endif /* CONFIG_X86_NUMAQ */
 
 extern int get_memcfg_numa_flat(void );
+extern int get_memcfg_numa_blks(void );
 /*
  * This allows any one NUMA architecture to be compiled
  * for, and still fall back to the flat function if it
@@ -140,6 +141,9 @@ static inline void get_memcfg_numa(void)
 		return;
 #elif CONFIG_ACPI_SRAT
 	if (get_memcfg_from_srat())
+		return;
+#elif CONFIG_MEMHOTPLUGTEST
+	if (get_memcfg_numa_blks())
 		return;
 #endif
 
diff -duprb linux-2.6.0-test7/include/asm-i386/numnodes.h testdir/include/asm-i386/numnodes.h
--- linux-2.6.0-test7/include/asm-i386/numnodes.h	Wed Oct  8 12:24:02 2003
+++ testdir/include/asm-i386/numnodes.h	Sat Nov 22 17:54:41 2003
@@ -13,6 +13,10 @@
 /* Max 8 Nodes */
 #define NODES_SHIFT	3
 
+#elif defined(CONFIG_MEMHOTPLUGTEST)
+
+#define NODES_SHIFT	3
+
 #endif /* CONFIG_X86_NUMAQ */
 
 #endif /* _ASM_MAX_NUMNODES_H */
diff -duprb linux-2.6.0-test7/include/linux/mm.h testdir/include/linux/mm.h
--- linux-2.6.0-test7/include/linux/mm.h	Wed Oct  8 12:24:01 2003
+++ testdir/include/linux/mm.h	Sat Nov 22 17:54:21 2003
@@ -219,7 +219,14 @@ struct page {
  */
 #define put_page_testzero(p)				\
 	({						\
-		BUG_ON(page_count(p) == 0);		\
+		if (page_count(p) == 0) {		\
+			int i;						\
+			printk("Page: %lx ", (long)p);			\
+			for(i = 0; i < sizeof(struct page); i++)	\
+				printk(" %02x", ((unsigned char *)p)[i]); \
+			printk("\n");					\
+			BUG();				\
+		}					\
 		atomic_dec_and_test(&(p)->count);	\
 	})
 
@@ -622,5 +629,17 @@ kernel_map_pages(struct page *page, int 
 }
 #endif
 
+#ifdef CONFIG_MEMHOTPLUGTEST
+#define	page_trace(p)	page_trace_func(p, __FUNCTION__, __LINE__)
+extern void page_trace_func(const struct page *, const char *, int);
+#else
+#define	page_trace(p)	do { } while(0)
+#endif
+#ifdef CONFIG_MEMHOTPLUGTEST
+#define	page_trace(p)	page_trace_func(p, __FUNCTION__, __LINE__)
+extern void page_trace_func(const struct page *, const char *, int);
+#else
+#define	page_trace(p)	do { } while(0)
+#endif /* CONFIG_MEMHOTPLUGTEST */ 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff -duprb linux-2.6.0-test7/include/linux/mmzone.h testdir/include/linux/mmzone.h
--- linux-2.6.0-test7/include/linux/mmzone.h	Wed Oct  8 12:24:08 2003
+++ testdir/include/linux/mmzone.h	Sat Nov 22 17:54:23 2003
@@ -174,6 +174,7 @@ struct zone {
  * footprint of this construct is very small.
  */
 struct zonelist {
+	rwlock_t zonelist_lock; 
 	struct zone *zones[MAX_NUMNODES * MAX_NR_ZONES + 1]; // NULL delimited
 };
 
@@ -235,10 +236,29 @@ void wakeup_kswapd(struct zone *zone);
  * next_zone - helper magic for for_each_zone()
  * Thanks to William Lee Irwin III for this piece of ingenuity.
  */
+extern char zone_active[];
+
 static inline struct zone *next_zone(struct zone *zone)
 {
 	pg_data_t *pgdat = zone->zone_pgdat;
 
+#ifdef CONFIG_MEMHOTPLUGTEST
+	unsigned int zone_idx = zone - pgdat->node_zones;
+	do{
+		if (zone_idx < MAX_NR_ZONES -1){
+			zone++;
+			zone_idx++;
+		}else if (pgdat->pgdat_next){
+			pgdat = pgdat->pgdat_next;
+			zone = pgdat->node_zones;
+			zone_idx=0;
+		}else
+			return NULL;
+	}while(!zone_active[pgdat->node_id * MAX_NR_ZONES + zone_idx]);
+
+	return zone;
+
+#else
 	if (zone - pgdat->node_zones < MAX_NR_ZONES - 1)
 		zone++;
 	else if (pgdat->pgdat_next) {
@@ -248,6 +268,7 @@ static inline struct zone *next_zone(str
 		zone = NULL;
 
 	return zone;
+#endif
 }
 
 /**
@@ -359,6 +380,10 @@ static inline unsigned int num_online_me
 	}
 	return num;
 }
+
+#ifdef CONFIG_MEMHOTPLUGTEST
+int zone_activep(const struct zone *);
+#endif
 
 #else /* !CONFIG_DISCONTIGMEM && !CONFIG_NUMA */
 
diff -duprb linux-2.6.0-test7/mm/page_alloc.c testdir/mm/page_alloc.c
--- linux-2.6.0-test7/mm/page_alloc.c	Wed Oct  8 12:24:01 2003
+++ testdir/mm/page_alloc.c	Tue Nov 25 18:48:01 2003
@@ -31,6 +31,7 @@
 #include <linux/topology.h>
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
+#include <linux/proc_fs.h>
 
 #include <asm/tlbflush.h>
 
@@ -52,6 +53,11 @@ EXPORT_SYMBOL(nr_swap_pages);
  */
 struct zone *zone_table[MAX_NR_ZONES*MAX_NUMNODES];
 EXPORT_SYMBOL(zone_table);
+#ifdef CONFIG_MEMHOTPLUGTEST
+char zone_active[MAX_NR_ZONES*MAX_NUMNODES];
+EXPORT_SYMBOL(zone_active);
+static const struct page *page_trace_list[10];
+#endif
 
 static char *zone_names[MAX_NR_ZONES] = { "DMA", "Normal", "HighMem" };
 int min_free_kbytes = 1024;
@@ -512,9 +518,28 @@ static struct page *buffered_rmqueue(str
 		mod_page_state(pgalloc, 1 << order);
 		prep_new_page(page, order);
 	}
+#ifdef CONFIG_MEMHOTPLUGTEST
+	if (! zone_active[page->flags >> ZONE_SHIFT])
+		BUG();
+#endif
 	return page;
 }
 
+#ifdef CONFIG_MEMHOTPLUGTEST
+int
+zone_activep(const struct zone *z)
+{
+	int i;
+
+	for(i = 0; ; i++) {
+		if (zone_table[i] == z)
+			return zone_active[i];
+		if (zone_table[i] == NULL)
+			BUG();
+	}
+}
+#endif
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  *
@@ -544,6 +569,7 @@ __alloc_pages(unsigned int gfp_mask, uns
 	int i;
 	int cold;
 	int do_retry;
+	unsigned long flag;
 
 	might_sleep_if(wait);
 
@@ -551,10 +577,13 @@ __alloc_pages(unsigned int gfp_mask, uns
 	if (gfp_mask & __GFP_COLD)
 		cold = 1;
 
+	read_lock_irqsave(&zonelist->zonelist_lock,flag);
 	zones = zonelist->zones;  /* the list of zones suitable for gfp_mask */
 	classzone = zones[0]; 
-	if (classzone == NULL)    /* no zones in the zonelist */
+	if (classzone == NULL){    /* no zones in the zonelist */
+		read_unlock_irqrestore(&zonelist->zonelist_lock,flag);
 		return NULL;
+	}
 
 	/* Go through the zonelist once, looking for a zone with enough free */
 	min = 1UL << order;
@@ -562,6 +591,10 @@ __alloc_pages(unsigned int gfp_mask, uns
 		struct zone *z = zones[i];
 		unsigned long local_low;
 
+#ifdef CONFIG_MEMHOTPLUGTEST
+		if (! zone_activep(z))
+			continue;
+#endif
 		/*
 		 * This is the fabled 'incremental min'. We let real-time tasks
 		 * dip their real-time paws a little deeper into reserves.
@@ -589,6 +622,10 @@ __alloc_pages(unsigned int gfp_mask, uns
 	for (i = 0; zones[i] != NULL; i++) {
 		unsigned long local_min;
 		struct zone *z = zones[i];
+#ifdef CONFIG_MEMHOTPLUGTEST
+		if (! zone_activep(z))
+			continue;
+#endif
 
 		local_min = z->pages_min;
 		if (gfp_mask & __GFP_HIGH)
@@ -612,6 +649,10 @@ rebalance:
 		/* go through the zonelist yet again, ignoring mins */
 		for (i = 0; zones[i] != NULL; i++) {
 			struct zone *z = zones[i];
+#ifdef CONFIG_MEMHOTPLUGTEST
+			if (! zone_activep(z))
+				continue;
+#endif
 
 			page = buffered_rmqueue(z, order, cold);
 			if (page)
@@ -668,6 +709,7 @@ rebalance:
 	}
 
 nopage:
+	read_unlock_irqrestore(&zonelist->zonelist_lock,flag);
 	if (!(gfp_mask & __GFP_NOWARN)) {
 		printk("%s: page allocation failure."
 			" order:%d, mode:0x%x\n",
@@ -676,6 +718,24 @@ nopage:
 	return NULL;
 got_pg:
 	kernel_map_pages(page, 1 << order, 1);
+#if 1 // debug
+	/* Validate page */
+	{
+		struct zone *z = page_zone(page);
+		int idx = page - z->zone_mem_map;
+		if (idx < 0 || idx >= z->spanned_pages) {
+			printk("0x%08x %d\n", (int)(page->flags >> ZONE_SHIFT), idx);
+			read_unlock(&zonelist->zonelist_lock);
+			BUG();
+		}
+	}
+#endif
+#ifdef CONFIG_MEMHOTPLUGTEST
+	read_unlock_irqrestore(&zonelist->zonelist_lock,flag);
+	if (! zone_active[page->flags >> ZONE_SHIFT]){
+		BUG();
+	}
+#endif
 	return page;
 }
 
@@ -1046,7 +1106,11 @@ void show_free_areas(void)
 /*
  * Builds allocation fallback zone lists.
  */
+#ifdef CONFIG_MEMHOTPLUGTEST
+static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, int j, int k)
+#else
 static int __init build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, int j, int k)
+#endif
 {
 	switch (k) {
 		struct zone *zone;
@@ -1076,6 +1140,9 @@ static int __init build_zonelists_node(p
 static void __init build_zonelists(pg_data_t *pgdat)
 {
 	int i, j, k, node, local_node;
+#ifdef CONFIG_MEMHOTPLUGTEST
+	struct zone *zone;
+#endif
 
 	local_node = pgdat->node_id;
 	printk("Building zonelist for node : %d\n", local_node);
@@ -1092,6 +1159,7 @@ static void __init build_zonelists(pg_da
 		if (i & __GFP_DMA)
 			k = ZONE_DMA;
 
+#ifndef CONFIG_MEMHOTPLUGTEST
  		j = build_zonelists_node(pgdat, zonelist, j, k);
  		/*
  		 * Now we build the zonelist so that it contains the zones
@@ -1107,6 +1175,26 @@ static void __init build_zonelists(pg_da
  			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
  
 		zonelist->zones[j++] = NULL;
+#else
+		rwlock_init(&zonelist->zonelist_lock);
+		for(; k >= 0; k--) {
+			zone = pgdat->node_zones + k;
+			if (!zone_activep(zone))
+				continue;
+			if (zone->present_pages)
+				zonelist->zones[j++] = zone;
+			for (node = local_node + 1; node < numnodes; node++) {
+				zone = NODE_DATA(node)->node_zones + k;
+				if (zone_activep(zone) && zone->present_pages)
+					zonelist->zones[j++] = zone;
+			}
+			for (node = 0; node < local_node; node++) {
+				zone = NODE_DATA(node)->node_zones + k;
+				if (zone_activep(zone) && zone->present_pages)
+					zonelist->zones[j++] = zone;
+			}
+		}
+#endif
 	} 
 }
 
@@ -1162,8 +1250,14 @@ static inline unsigned long wait_table_b
 
 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
 
+#if CONFIG_MEMHOTPLUGTEST
+static void calculate_zone_totalpages(struct pglist_data *pgdat,
+		unsigned long *zones_size, unsigned long *zholes_size)
+#else
 static void __init calculate_zone_totalpages(struct pglist_data *pgdat,
 		unsigned long *zones_size, unsigned long *zholes_size)
+#endif
+
 {
 	unsigned long realtotalpages, totalpages = 0;
 	int i;
@@ -1199,6 +1293,20 @@ static void __init calculate_zone_bitmap
 	}
 }
 
+#if CONFIG_MEMHOTPLUGTEST
+static void calculate_addzone_bitmap(struct pglist_data *pgdat, unsigned long *zones_size)
+{
+	unsigned long size = zones_size[ZONE_HIGHMEM];
+
+	size = LONG_ALIGN((size + 7) >> 3);
+	if (size) {
+		pgdat->valid_addr_bitmap = (unsigned long *)kmalloc(size,GFP_KERNEL);
+		memset(pgdat->valid_addr_bitmap, 0, size);
+	}
+}
+
+#endif
+
 /*
  * Initially all pages are reserved - free ones are freed
  * up by free_all_bootmem() once the early boot process is
@@ -1252,6 +1360,45 @@ static void __init free_area_init_core(s
 		unsigned long batch;
 
 		zone_table[nid * MAX_NR_ZONES + j] = zone;
+#ifdef CONFIG_MEMHOTPLUGTEST
+										/* only node 0 is active */
+		if ( nid ){ 							/*  node 1-... are node active */
+										/* XXX : This should be changed. */
+			zone_active[nid * MAX_NR_ZONES + j ] = 0;
+			zone->spanned_pages = 0;
+			zone->present_pages = 0;
+			zone->name = zone_names[j];
+			spin_lock_init(&zone->lock);
+			spin_lock_init(&zone->lru_lock);
+			zone->zone_pgdat = pgdat;
+			zone->free_pages = 0;
+			for (cpu = 0; cpu < NR_CPUS; cpu++) {
+				struct per_cpu_pages *pcp;
+
+				pcp = &zone->pageset[cpu].pcp[0];	/* hot */
+				pcp->count = 0;
+				pcp->low = 0;
+				pcp->high = 0;
+				pcp->batch = 0;
+				INIT_LIST_HEAD(&pcp->list);
+
+				pcp = &zone->pageset[cpu].pcp[1];	/* cold */
+				pcp->count = 0;
+				pcp->low = 0;
+				pcp->high = 0;
+				pcp->batch = 0;
+				INIT_LIST_HEAD(&pcp->list);
+			}
+			INIT_LIST_HEAD(&zone->active_list);
+			INIT_LIST_HEAD(&zone->inactive_list);
+			atomic_set(&zone->refill_counter, 0);
+			zone->nr_active = 0;
+			zone->nr_inactive = 0;
+
+			continue;
+		}
+		zone_active[nid * MAX_NR_ZONES + j ] =  1 ;		/* only node 0 is active */
+#endif
 		realsize = size = zones_size[j];
 		if (zholes_size)
 			realsize -= zholes_size[j];
@@ -1295,8 +1442,8 @@ static void __init free_area_init_core(s
 			pcp->batch = 1 * batch;
 			INIT_LIST_HEAD(&pcp->list);
 		}
-		printk("  %s zone: %lu pages, LIFO batch:%lu\n",
-				zone_names[j], realsize, batch);
+		printk("  %s zone: %lu pages, LIFO batch:%lu start:%lu\n",
+				zone_names[j], realsize, batch, zone_start_pfn);
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
 		atomic_set(&zone->refill_counter, 0);
@@ -1381,14 +1528,22 @@ void __init free_area_init_node(int nid,
 	pgdat->node_id = nid;
 	pgdat->node_start_pfn = node_start_pfn;
 	calculate_zone_totalpages(pgdat, zones_size, zholes_size);
+#ifdef CONFIG_MEMHOTPLUGTEST
+	if (!node_mem_map && !nid) {
+#else
 	if (!node_mem_map) {
+#endif
 		size = (pgdat->node_spanned_pages + 1) * sizeof(struct page);
 		node_mem_map = alloc_bootmem_node(pgdat, size);
 	}
 	pgdat->node_mem_map = node_mem_map;
 
 	free_area_init_core(pgdat, zones_size, zholes_size);
+#ifdef CONFIG_MEMHOTPLUGTEST
+	if(!nid)memblk_set_online(node_to_memblk(nid));		/* only node 0 is online */
+#else
 	memblk_set_online(node_to_memblk(nid));
+#endif
 
 	calculate_zone_bitmap(pgdat, zones_size);
 }
@@ -1644,3 +1799,387 @@ int min_free_kbytes_sysctl_handler(ctl_t
 	setup_per_zone_pages_min();
 	return 0;
 }
+
+#ifdef CONFIG_MEMHOTPLUGTEST
+static void rebuild_all_zonelist(unsigned long nid)
+{
+	struct zonelist *zonelist;
+	unsigned long node, p_node, j=0;
+
+
+	zonelist = NODE_DATA(nid)->node_zonelists + ZONE_HIGHMEM;
+	write_lock(&zonelist->zonelist_lock);
+	memset(zonelist, 0, sizeof(*zonelist));
+
+	/* build zonelist for added zone */
+	j= build_zonelists_node( NODE_DATA(nid), zonelist, j, ZONE_HIGHMEM);
+
+	for ( node = nid + 1; node < numnodes; node++)
+		j = build_zonelists_node( NODE_DATA(node), zonelist, j, ZONE_HIGHMEM);
+	for (node = 0; node < nid ; node++)
+		j = build_zonelists_node( NODE_DATA(node), zonelist, j, ZONE_HIGHMEM);
+
+
+	/* rebuild zonelist for other node */
+	for( p_node = 0; p_node < numnodes ; p_node++){
+		zonelist = NODE_DATA(p_node)->node_zonelists + ZONE_HIGHMEM;
+		j=0;
+
+		j = build_zonelists_node( NODE_DATA(p_node), zonelist, j, ZONE_HIGHMEM);
+
+		for ( node = p_node + 1 ; node < numnodes ; node++ )
+			j = build_zonelists_node( NODE_DATA(node), zonelist, j, ZONE_HIGHMEM);
+		for ( node = 0; node < p_node; node++ )
+			j = build_zonelists_node( NODE_DATA(node), zonelist, j, ZONE_HIGHMEM);
+		zonelist->zones[j++] = NULL;
+
+	}
+	write_unlock(&zonelist->zonelist_lock);
+}
+
+
+static void free_area_add_core(struct pglist_data *pgdat,
+		unsigned long *zones_size, unsigned long *zholes_size)
+{
+	unsigned long i;
+	const unsigned long zone_required_alignment = 1UL << (MAX_ORDER-1);
+	int cpu, nid = pgdat->node_id;
+	struct page *lmem_map = pgdat->node_mem_map;
+	unsigned long zone_start_pfn = pgdat->node_start_pfn;
+
+	pgdat->nr_zones = 0;
+	init_waitqueue_head(&pgdat->kswapd_wait);
+
+	{
+		struct zone *zone = pgdat->node_zones + ZONE_HIGHMEM;
+		unsigned long size, realsize;
+		unsigned long batch;
+
+		zone_table[nid * MAX_NR_ZONES + ZONE_HIGHMEM] = zone;
+
+		realsize = size = zones_size[ZONE_HIGHMEM];
+		if (zholes_size)
+			realsize -= zholes_size[ZONE_HIGHMEM];
+
+		zone->spanned_pages = size;
+		zone->present_pages = realsize;
+		zone->name = zone_names[ZONE_HIGHMEM];
+		spin_lock_init(&zone->lock);
+		spin_lock_init(&zone->lru_lock);
+		zone->zone_pgdat = pgdat;
+		zone->free_pages = 0;
+
+		/*
+		 * The per-cpu-pages pools are set to around 1000th of the
+		 * size of the zone.  But no more than 1/4 of a meg - there's
+		 * no point in going beyond the size of L2 cache.
+		 *
+		 * OK, so we don't know how big the cache is.  So guess.
+		 */
+		batch = zone->present_pages / 1024;
+		if (batch * PAGE_SIZE > 256 * 1024)
+			batch = (256 * 1024) / PAGE_SIZE;
+		batch /= 4;		/* We effectively *= 4 below */
+		if (batch < 1)
+			batch = 1;
+
+		for (cpu = 0; cpu < NR_CPUS; cpu++) {
+			struct per_cpu_pages *pcp;
+
+			pcp = &zone->pageset[cpu].pcp[0];	/* hot */
+			pcp->count = 0;
+			pcp->low = 2 * batch;
+			pcp->high = 6 * batch;
+			pcp->batch = 1 * batch;
+			INIT_LIST_HEAD(&pcp->list);
+
+			pcp = &zone->pageset[cpu].pcp[1];	/* cold */
+			pcp->count = 0;
+			pcp->low = 0;
+			pcp->high = 2 * batch;
+			pcp->batch = 1 * batch;
+			INIT_LIST_HEAD(&pcp->list);
+		}
+		printk("  %s zone: %lu pages, LIFO batch:%lu start:%lu\n",
+				zone_names[ZONE_HIGHMEM], realsize, batch, zone_start_pfn);
+		INIT_LIST_HEAD(&zone->active_list);
+		INIT_LIST_HEAD(&zone->inactive_list);
+		atomic_set(&zone->refill_counter, 0);
+		zone->nr_active = 0;
+		zone->nr_inactive = 0;
+
+		/*
+		 * The per-page waitqueue mechanism uses hashed waitqueues
+		 * per zone.
+		 */
+		zone->wait_table_size = wait_table_size(size);
+		zone->wait_table_bits =
+			wait_table_bits(zone->wait_table_size);
+		zone->wait_table = (wait_queue_head_t *)kmalloc(zone->wait_table_size
+						* sizeof(wait_queue_head_t), GFP_KERNEL);
+				/* XXX: wait_table might have to be allocate own node. */
+
+		for(i = 0; i < zone->wait_table_size; ++i)
+			init_waitqueue_head(zone->wait_table + i);
+
+		pgdat->nr_zones = ZONE_HIGHMEM+1;
+
+		zone->zone_mem_map = lmem_map;
+		zone->zone_start_pfn = zone_start_pfn;
+
+		if ((zone_start_pfn) & (zone_required_alignment-1))
+			printk("BUG: wrong zone alignment, it will crash\n");
+
+		memmap_init_zone(lmem_map, size, nid, ZONE_HIGHMEM, zone_start_pfn);
+
+		for (i = 0; ; i++) {
+			unsigned long bitmap_size;
+
+			INIT_LIST_HEAD(&zone->free_area[i].free_list);
+			if (i == MAX_ORDER-1) {
+				zone->free_area[i].map = NULL;
+				break;
+			}
+
+			/*
+			 * Page buddy system uses "index >> (i+1)",
+			 * where "index" is at most "size-1".
+			 *
+			 * The extra "+3" is to round down to byte
+			 * size (8 bits per byte assumption). Thus
+			 * we get "(size-1) >> (i+4)" as the last byte
+			 * we can access.
+			 *
+			 * The "+1" is because we want to round the
+			 * byte allocation up rather than down. So
+			 * we should have had a "+7" before we shifted
+			 * down by three. Also, we have to add one as
+			 * we actually _use_ the last bit (it's [0,n]
+			 * inclusive, not [0,n[).
+			 *
+			 * So we actually had +7+1 before we shift
+			 * down by 3. But (n+8) >> 3 == (n >> 3) + 1
+			 * (modulo overflows, which we do not have).
+			 *
+			 * Finally, we LONG_ALIGN because all bitmap
+			 * operations are on longs.
+			 */
+			bitmap_size = (size-1) >> (i+4);
+			bitmap_size = LONG_ALIGN(bitmap_size+1);
+			zone->free_area[i].map =
+			  (unsigned long *) kmalloc(bitmap_size, GFP_KERNEL);
+				/* XXX: bitmap might have to be allocate own node too. */
+		}
+	}
+}
+
+extern void *node_remap_start_vaddr[];
+
+void free_area_add_node(int nid, struct pglist_data *pgdat,unsigned long *zones_size,
+		unsigned long node_start_pfn, unsigned long *zholes_size)
+{
+	unsigned long size;
+
+	calculate_zone_totalpages(pgdat, zones_size, zholes_size);
+
+	size = (pgdat->node_spanned_pages + 1) * sizeof(struct page);
+	remap_add_node_kva(nid);
+
+	free_area_add_core(pgdat, zones_size, zholes_size);
+	calculate_addzone_bitmap(pgdat, zones_size);
+
+}
+
+extern unsigned long node_start_pfn[];
+extern unsigned long node_end_pfn[];
+
+static void node_enable(unsigned long nid)
+{
+	unsigned long idx = nid * MAX_NR_ZONES + ZONE_HIGHMEM;
+	unsigned long zones_size[MAX_NR_ZONES] =  {0, 0, 0};
+	unsigned long *zholes_size;
+
+	if (nid > numnodes){		/* XXX : nid should has continuity now */
+					/*       but it should be changed */
+		printk("nid=%d isn&t possible to enable \n",nid);
+		return;
+	}
+
+	if (node_online(nid)){
+		printk("nid=%d is already enable \n", nid);
+		return;
+	}
+
+	zones_size[ZONE_HIGHMEM] = node_end_pfn[nid] - node_start_pfn[nid];
+					/* XXX: This information should be got from firmware.
+					        However, this is emulation. */
+	if( !zones_size[ZONE_HIGHMEM] ){
+		printk("nid=%d is size 0\n",nid);
+		return;
+	}
+
+	zholes_size = get_zholes_size(nid);
+
+	free_area_add_node(nid, NODE_DATA(nid), zones_size, node_start_pfn[nid], zholes_size);
+
+	setup_per_zone_pages_min();	/* set up again */
+
+	rebuild_all_zonelist( nid);
+	memblk_set_online(node_to_memblk(nid));
+	node_set_online(nid);
+	zone_active[idx] = 1;
+
+}
+
+static int mhtest_read(char *page, char **start, off_t off, int count,
+    int *eof, void *data)
+{
+	char *p;
+	int i, len;
+	const struct zone *z;
+
+	p = page;
+	for(i = 0; ; i++) {
+		z = zone_table[i];
+		if (z == NULL)
+			break;
+		if (! z->present_pages)
+			/* skip empty zone */
+			continue;
+		len = sprintf(p, "Zone %d: %sabled free %d, active %d, present %d\n", i,
+		    zone_active[i] ? "en" : "dis", (int)z->free_pages, (int)z->nr_active,
+		    (int)z->present_pages);
+		p += len;
+	}
+	len = p - page;
+
+	if (len <= off + count)
+		*eof = 1;
+	*start = page + off;
+	len -= off;
+	if (len < 0)
+		len = 0;
+	if (len > count)
+		len = count;
+
+	return len;
+}
+
+static int mhtest_write(struct file *file, const char *buffer,
+    unsigned long count, void *data)
+{
+	unsigned long idx;
+	char buf[64], *p;
+	int i;
+
+	if (count > sizeof(buf) - 1)
+		count = sizeof(buf) - 1;
+	if (copy_from_user(buf, buffer, count))
+		return -EFAULT;
+
+	buf[count] = 0;
+
+	p = strchr(buf, ' ');
+	if (p == NULL)
+		goto out;
+
+	*p++ = '\0';
+	idx = simple_strtoul(p, NULL, 0);
+
+	if (strcmp(buf, "trace") == 0) {
+		for(i = 0; i < sizeof(page_trace_list) /
+		    sizeof(page_trace_list[0]); i++)
+			if (page_trace_list[i] == NULL) {
+				page_trace_list[i] = (struct page *)idx;
+				printk("add trace %lx\n", (unsigned long)idx);
+				goto out;
+			}
+		printk("page_trace_list is full (not added)\n");
+		goto out;
+	} else if (strcmp(buf, "untrace") == 0) {
+		for(i = 0; i < sizeof(page_trace_list) /
+		    sizeof(page_trace_list[0]); i++)
+			if (page_trace_list[i] == (struct page *)idx)
+				break;
+		if (i == sizeof(page_trace_list) / sizeof(page_trace_list[0])) {
+			printk("not registered\n");
+			goto out;
+		}
+		for(; i < sizeof(page_trace_list) /
+		    sizeof(page_trace_list[0]) - 1; i++)
+			page_trace_list[i] = page_trace_list[i + 1];
+		page_trace_list[i] = NULL;
+		goto out;
+	}
+	if (idx > MAX_NUMNODES) {
+		printk("Argument out of range\n");
+		goto out;
+	}
+	if (strcmp(buf, "disable") == 0) {
+		printk("disable node = %d\n", (int)idx);	/* XXX */
+		goto out;
+	} else if (strcmp(buf, "purge") == 0) {
+		/* XXX */
+	} else if (strcmp(buf, "enable") == 0) {
+		printk("enable node = %d\n", (int)idx);
+		node_enable(idx);
+	} else if (strcmp(buf, "active") == 0) {
+		/*
+		if (zone_table[idx] == NULL)
+			goto out;
+		spin_lock_irq(&zone_table[idx]->lru_lock);
+		i = 0;
+		list_for_each(l, &zone_table[idx]->active_list) {
+			printk(" %lx", (unsigned long)list_entry(l, struct page, lru));
+			i++;
+			if (i == 10)
+				break;
+		}
+		spin_unlock_irq(&zone_table[idx]->lru_lock);
+		printk("\n");
+		*/
+	} else if (strcmp(buf, "inuse") == 0) {
+		/*
+		if (zone_table[idx] == NULL)
+			goto out;
+		for(i = 0; i < zone_table[idx]->spanned_pages; i++)
+			if (page_count(&zone_table[idx]->zone_mem_map[i]))
+				printk(" %lx", (unsigned long)&zone_table[idx]->zone_mem_map[i]);
+		printk("\n");
+		*/
+	}
+out:
+	return count;
+}
+
+static int __init procmhtest_init(void)
+{
+	struct proc_dir_entry *entry;
+
+	entry = create_proc_entry("memhotplug", 0, NULL);
+	if (entry == NULL)
+		return -1;
+
+	entry->read_proc = &mhtest_read;
+	entry->write_proc = &mhtest_write;
+	return 0;
+}
+__initcall(procmhtest_init);
+
+void
+page_trace_func(const struct page *p, const char *func, int line) {
+	int i;
+
+	for(i = 0; i < sizeof(page_trace_list) /
+	    sizeof(page_trace_list[0]); i++) {
+		if (page_trace_list[i] == NULL)
+			return;
+		if (page_trace_list[i] == p)
+			break;
+	}
+	if (i == sizeof(page_trace_list) / sizeof(page_trace_list[0]))
+		return;
+
+	printk("Page %lx, %s %d\n", (unsigned long)p, func, line);
+}
+#endif

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: memory hotremove prototype, take 3
@ 2003-12-10  0:45 Luck, Tony
  0 siblings, 0 replies; 17+ messages in thread
From: Luck, Tony @ 2003-12-10  0:45 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel

> If your target is NUMA, then you really, really need CONFIG_NONLINEAR.
> We don't support multiple pgdats per node, nor do I wish to, as it'll
> make an unholy mess ;-). With CONFIG_NONLINEAR, the discontiguities
> within a node are buried down further, so we have much less complexity
> to deal with from the main VM. The abstraction also keeps the poor
> VM engineers trying to read / write the code saner via simplicity ;-)
> 
> WRT generic discontigmem support (not NUMA), doing that via pgdats
> should really go away, as there's no real difference between the 
> chunks of physical memory as far as the page allocator is concerned.
> The plan is to use Daniel's nonlinear stuff to replace that, and keep
> the pgdats strictly for NUMA. Same would apply to hotpluggable zones - 
> I'd hate to end up with 512 pgdats of stuff that are really all the
> same memory types underneath.

I guess this all depends on whether you allow bits of memory on
nodes to be hot-plugged ... or insist on the whole node being
added/removed in one fell swoop.  I'd expect the latter to be
a more common model, and in that case the "pgdat-for-the-node" is
the same as the "pgdat-for-the-hot-plug-zone", so you don't have
a proliferation of pgdats to support hotplug.

> The real issue you have is the mapping of the struct pages - if we can
> achieve a non-contig mapping of the mem_map / lmem_map array, we should
> be able to take memory on and offline reasonably easy. If you're willing
> for a first implementation to pre-allocate the struct page array for 
> every possible virtual address, it makes life a lot easier.

On 64-bit systems with CONFIG_VIRTUAL_MEMMAP, this would be trivial,
and avoids the need for the extra level of indirection in the psection[]
and vection[] arrays in CONFIG_NONLINEAR (ok ... it doesn't really get
rid of the indirection, as the page table lookup to access the virtual
mem_map effectively ends up doing the same thing).
 
> Adding the other layer of indirection for access the struct page array
> should fix up most of that, and is very easily abstracted out via the
> pfn_to_page macros and friends. I ripped out all the direct references
> to mem_map indexing already in 2.6, so it should all be nicely 
> abstracted out.

I did go back and look at the CONFIG_NONLINEAR patch again, and I
still can't see how to make it useful on 64-bit machines. Jack
Steiner asked a bunch of questions on how it would work for an
architecture like the SGI:
http://marc.theaimsgroup.com/?l=lse-tech&m=101828803506249&w=2
I don't remember seeing any answers on the list.  Assuming he
were to use a section size of 64MB (a convenient number for ia64)
he'd end up with psection[]/vsection[] tables with 8 million
entries each (@ 4 bytes/entry -> 64MB for the pair).

-Tony Luck  


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: memory hotremove prototype, take 3
  2003-12-04 18:29           ` Martin J. Bligh
@ 2003-12-04 18:59             ` Jesse Barnes
  0 siblings, 0 replies; 17+ messages in thread
From: Jesse Barnes @ 2003-12-04 18:59 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: IWAMOTO Toshihiro, linux-kernel, linux-mm

On Thu, Dec 04, 2003 at 10:29:53AM -0800, Martin J. Bligh wrote:
> >> IIRC, memory is contiguous within a NUMA node.  I think Goto-san will
> >> clarify this issue when his code gets ready. :-)
> > 
> > Not on all systems.  On sn2 we use ia64's virtual memmap to make memory
> > within a node appear contiguous, even though it may not be.
> 
> Wasn't there a plan to get rid of that though? I forget what it was,
> probably using CONFIG_NONLINEAR too ... ?

I think config_nonliner would do the trick, but no one's done the work
yet :)

Jesse

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: memory hotremove prototype, take 3
  2003-12-04 18:27         ` Jesse Barnes
@ 2003-12-04 18:29           ` Martin J. Bligh
  2003-12-04 18:59             ` Jesse Barnes
  0 siblings, 1 reply; 17+ messages in thread
From: Martin J. Bligh @ 2003-12-04 18:29 UTC (permalink / raw)
  To: Jesse Barnes, IWAMOTO Toshihiro; +Cc: linux-kernel, linux-mm

>> IIRC, memory is contiguous within a NUMA node.  I think Goto-san will
>> clarify this issue when his code gets ready. :-)
> 
> Not on all systems.  On sn2 we use ia64's virtual memmap to make memory
> within a node appear contiguous, even though it may not be.

Wasn't there a plan to get rid of that though? I forget what it was,
probably using CONFIG_NONLINEAR too ... ?

M.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: memory hotremove prototype, take 3
  2003-12-04 15:44       ` IWAMOTO Toshihiro
  2003-12-04 17:12         ` Martin J. Bligh
@ 2003-12-04 18:27         ` Jesse Barnes
  2003-12-04 18:29           ` Martin J. Bligh
  1 sibling, 1 reply; 17+ messages in thread
From: Jesse Barnes @ 2003-12-04 18:27 UTC (permalink / raw)
  To: IWAMOTO Toshihiro; +Cc: Martin J. Bligh, linux-kernel, linux-mm

On Fri, Dec 05, 2003 at 12:44:06AM +0900, IWAMOTO Toshihiro wrote:
> IIRC, memory is contiguous within a NUMA node.  I think Goto-san will
> clarify this issue when his code gets ready. :-)

Not on all systems.  On sn2 we use ia64's virtual memmap to make memory
within a node appear contiguous, even though it may not be.

Jesse

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: memory hotremove prototype, take 3
  2003-12-04 15:44       ` IWAMOTO Toshihiro
@ 2003-12-04 17:12         ` Martin J. Bligh
  2003-12-04 18:27         ` Jesse Barnes
  1 sibling, 0 replies; 17+ messages in thread
From: Martin J. Bligh @ 2003-12-04 17:12 UTC (permalink / raw)
  To: IWAMOTO Toshihiro; +Cc: linux-kernel, linux-mm

>> > My target is somewhat NUMA-ish and fairly large.  So I'm not sure if
>> > CONFIG_NONLINEAR fits, but CONFIG_NUMA isn't perfect either.
>> 
>> If your target is NUMA, then you really, really need CONFIG_NONLINEAR.
>> We don't support multiple pgdats per node, nor do I wish to, as it'll
>> make an unholy mess ;-). With CONFIG_NONLINEAR, the discontiguities
>> within a node are buried down further, so we have much less complexity
>> to deal with from the main VM. The abstraction also keeps the poor
>> VM engineers trying to read / write the code saner via simplicity ;-)
> 
> IIRC, memory is contiguous within a NUMA node.  I think Goto-san will
> clarify this issue when his code gets ready. :-)

Right - but then you can't use discontigmem's multiple pgdat's inside
a node to implement hotplug mem for NUMA systems.
 
> Preallocating struct page array isn't feasible for the target system
> because max memory / min memory ratio is large.
> Our plan is to use the beginning (or the end) of the memory block being
> hotplugged.  If a 2GB memory block is added, first ~20MB is used for
> the struct page array for the rest of the memory block.

Right - that makes perfect sense, it just has 2 problems:

1) You end up with a discontiguous mem_map array (fixable by adding a layer
of indirection in the wrapped macros).
2) on 32 bit, it's going to make a mess, as you need to map mem_map
inside the permanently mapped kernel area (aka ZONE_NORMAL+vmalloc space 
except in a kind of wierd cornercase I created with remap_numa_kva, 
which creates a no-man's land of permanently mapped kernel memory 
between ZONE_NORMAL and VMALLOC_RESERVE area for the remapped 
lmem_maps from the other nodes).

>> You could just lock the pages, I'd think? I don't see at a glance
>> exactly what you were using this for, but would that work?
> 
> I haven't seriously considered to implement vmalloc'd memory, but I
> guess that would be too complicated if not impossible.
> Making kernel threads or interrupt handlers block on memory access
> sound very difficult to me.

Aahh, maybe I understand now. You're saying you don't support hotplugging
ZONE_NORMAL, so you want to restrict vmalloc accesses to the non-hotplugged
areas? In which case things like HIGHPTE will be a nightmare as well ... ;-)
You also need to be very wary of where memlocked pages are allocated from.

M.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: memory hotremove prototype, take 3
  2003-12-04  5:38     ` Martin J. Bligh
@ 2003-12-04 15:44       ` IWAMOTO Toshihiro
  2003-12-04 17:12         ` Martin J. Bligh
  2003-12-04 18:27         ` Jesse Barnes
  0 siblings, 2 replies; 17+ messages in thread
From: IWAMOTO Toshihiro @ 2003-12-04 15:44 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: IWAMOTO Toshihiro, linux-kernel, linux-mm

At Wed, 03 Dec 2003 21:38:54 -0800,
Martin J. Bligh <mbligh@aracnet.com> wrote:
> > My target is somewhat NUMA-ish and fairly large.  So I'm not sure if
> > CONFIG_NONLINEAR fits, but CONFIG_NUMA isn't perfect either.
> 
> If your target is NUMA, then you really, really need CONFIG_NONLINEAR.
> We don't support multiple pgdats per node, nor do I wish to, as it'll
> make an unholy mess ;-). With CONFIG_NONLINEAR, the discontiguities
> within a node are buried down further, so we have much less complexity
> to deal with from the main VM. The abstraction also keeps the poor
> VM engineers trying to read / write the code saner via simplicity ;-)

IIRC, memory is contiguous within a NUMA node.  I think Goto-san will
clarify this issue when his code gets ready. :-)

> WRT generic discontigmem support (not NUMA), doing that via pgdats
> should really go away, as there's no real difference between the 
> chunks of physical memory as far as the page allocator is concerned.
> The plan is to use Daniel's nonlinear stuff to replace that, and keep
> the pgdats strictly for NUMA. Same would apply to hotpluggable zones - 
> I'd hate to end up with 512 pgdats of stuff that are really all the
> same memory types underneath.

Yes. Unnecessary zone rebalancing would suck.

> The real issue you have is the mapping of the struct pages - if we can
> acheive a non-contig mapping of the mem_map / lmem_map array, we should
> be able to take memory on and offline reasonably easy. If you're willing
> for a first implementation to pre-allocate the struct page array for 
> every possible virtual address, it makes life a lot easier.

Preallocating struct page array isn't feasible for the target system
because max memory / min memory ratio is large.
Our plan is to use the beginning (or the end) of the memory block being
hotplugged.  If a 2GB memory block is added, first ~20MB is used for
the struct page array for the rest of the memory block.


> >> PS. What's this bit of the patch for?
> >> 
> >>  void *vmalloc(unsigned long size)
> >>  {
> >> +#ifdef CONFIG_MEMHOTPLUGTEST
> >> +       return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
> >> +#else
> >>         return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
> >> +#endif
> >>  }
> > 
> > This is necessary because kernel memory cannot be swapped out.
> > Only highmem can be hot removed, though it doesn't need to be highmem.
> > We can define another zone attribute such as GFP_HOTPLUGGABLE.
> 
> You could just lock the pages, I'd think? I don't see at a glance
> exactly what you were using this for, but would that work?

I haven't seriously considered to implement vmalloc'd memory, but I
guess that would be too complicated if not impossible.
Making kernel threads or interrupt handlers block on memory access
sound very difficult to me.

--
IWAMOTO Toshihiro

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: memory hotremove prototype, take 3
  2003-12-04  3:58   ` IWAMOTO Toshihiro
@ 2003-12-04  5:38     ` Martin J. Bligh
  2003-12-04 15:44       ` IWAMOTO Toshihiro
  0 siblings, 1 reply; 17+ messages in thread
From: Martin J. Bligh @ 2003-12-04  5:38 UTC (permalink / raw)
  To: IWAMOTO Toshihiro; +Cc: linux-kernel, linux-mm

> I used the discontigmem code because this is what we have now.
> My hacks such as zone_active[] will go away when the memory hot add
> code (on which Goto-san is working on) is ready.

Understand that, but it'd be much cleaner (and more likely to get 
accepted) doing it the other way.
 
>> Have you looked at Daniel's CONFIG_NONLINEAR stuff? That provides a much
>> cleaner abstraction for getting rid of discontiguous memory in the non
>> truly-NUMA case, and should work really well for doing mem hot add / remove
>> as well.
> 
> Thanks for pointing out.  I looked at the patch.
> It should be doable to make my patch work with the CONFIG_NONLINEAR
> code.  For my code to work, basically the following functionarities
> are necessary:
> 1. disabling alloc_page from hot-removing area
> and
> 2. enumerating pages in use in hot-removing area.
> 
> My target is somewhat NUMA-ish and fairly large.  So I'm not sure if
> CONFIG_NONLINEAR fits, but CONFIG_NUMA isn't perfect either.

If your target is NUMA, then you really, really need CONFIG_NONLINEAR.
We don't support multiple pgdats per node, nor do I wish to, as it'll
make an unholy mess ;-). With CONFIG_NONLINEAR, the discontiguities
within a node are buried down further, so we have much less complexity
to deal with from the main VM. The abstraction also keeps the poor
VM engineers trying to read / write the code saner via simplicity ;-)

WRT generic discontigmem support (not NUMA), doing that via pgdats
should really go away, as there's no real difference between the 
chunks of physical memory as far as the page allocator is concerned.
The plan is to use Daniel's nonlinear stuff to replace that, and keep
the pgdats strictly for NUMA. Same would apply to hotpluggable zones - 
I'd hate to end up with 512 pgdats of stuff that are really all the
same memory types underneath.

The real issue you have is the mapping of the struct pages - if we can
acheive a non-contig mapping of the mem_map / lmem_map array, we should
be able to take memory on and offline reasonably easy. If you're willing
for a first implementation to pre-allocate the struct page array for 
every possible virtual address, it makes life a lot easier.

Adding the other layer of indirection for access the struct page array
should fix up most of that, and is very easily abstracted out via the
pfn_to_page macros and friends. I ripped out all the direct references
to mem_map indexing already in 2.6, so it should all be nicely 
abstracted out.

>> PS. What's this bit of the patch for?
>> 
>>  void *vmalloc(unsigned long size)
>>  {
>> +#ifdef CONFIG_MEMHOTPLUGTEST
>> +       return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
>> +#else
>>         return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
>> +#endif
>>  }
> 
> This is necessary because kernel memory cannot be swapped out.
> Only highmem can be hot removed, though it doesn't need to be highmem.
> We can define another zone attribute such as GFP_HOTPLUGGABLE.

You could just lock the pages, I'd think? I don't see at a glance
exactly what you were using this for, but would that work?

M.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: memory hotremove prototype, take 3
  2003-12-03 19:41 ` Martin J. Bligh
@ 2003-12-04  3:58   ` IWAMOTO Toshihiro
  2003-12-04  5:38     ` Martin J. Bligh
  0 siblings, 1 reply; 17+ messages in thread
From: IWAMOTO Toshihiro @ 2003-12-04  3:58 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: IWAMOTO Toshihiro, linux-kernel, linux-mm

At Wed, 03 Dec 2003 11:41:01 -0800,
Martin J. Bligh <mbligh@aracnet.com> wrote:
> 
> > this is a new version of my memory hotplug prototype patch, against
> > linux-2.6.0-test11.
> > 
> > Freeing 100% of a specified memory zone is non-trivial and necessary
> > for memory hot removal.  This patch splits memory into 1GB zones, and
> > implements complete zone memory freeing using kswapd or "remapping".
> > 
> > A bit more detailed explanation and some test scripts are at:
> > 	http://people.valinux.co.jp/~iwamoto/mh.html
> > 
> > Main changes against previous versions are:
> > - The stability is greatly improved.  Kernel crashes (probably related
> >   with kswapd) still happen, but they are rather rare so that I'm
> >   having difficulty reproducing crashes.
> >   Page remapping under simultaneous tar + rm -rf works.
> > - Implemented a solution to a deadlock caused by ext2_rename, which
> >   increments a refcount of a directory page twice.
> > 
> > Questions and comments are welcome.
> 
> I really think that doing this over zones and pgdats isn't the best approach.
> You're going to make memory allocation and reclaim vastly less efficient,
> and you're exposing a bunch of very specialised code inside the main
> memory paths. 

I used the discontigmem code because this is what we have now.
My hacks such as zone_active[] will go away when the memory hot add
code (on which Goto-san is working on) is ready.

> Have you looked at Daniel's CONFIG_NONLINEAR stuff? That provides a much
> cleaner abstraction for getting rid of discontiguous memory in the non
> truly-NUMA case, and should work really well for doing mem hot add / remove
> as well.

Thanks for pointing out.  I looked at the patch.
It should be doable to make my patch work with the CONFIG_NONLINEAR
code.  For my code to work, basically the following functionarities
are necessary:
1. disabling alloc_page from hot-removing area
and
2. enumerating pages in use in hot-removing area.

My target is somewhat NUMA-ish and fairly large.  So I'm not sure if
CONFIG_NONLINEAR fits, but CONFIG_NUMA isn't perfect either.


> PS. What's this bit of the patch for?
> 
>  void *vmalloc(unsigned long size)
>  {
> +#ifdef CONFIG_MEMHOTPLUGTEST
> +       return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
> +#else
>         return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
> +#endif
>  }

This is necessary because kernel memory cannot be swapped out.
Only highmem can be hot removed, though it doesn't need to be highmem.
We can define another zone attribute such as GFP_HOTPLUGGABLE.

--
IWAMOTO Toshihiro

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: memory hotremove prototype, take 3
  2003-12-01  3:41 IWAMOTO Toshihiro
  2003-12-01 19:56 ` Pavel Machek
@ 2003-12-03 19:41 ` Martin J. Bligh
  2003-12-04  3:58   ` IWAMOTO Toshihiro
  1 sibling, 1 reply; 17+ messages in thread
From: Martin J. Bligh @ 2003-12-03 19:41 UTC (permalink / raw)
  To: IWAMOTO Toshihiro, linux-kernel, linux-mm

> this is a new version of my memory hotplug prototype patch, against
> linux-2.6.0-test11.
> 
> Freeing 100% of a specified memory zone is non-trivial and necessary
> for memory hot removal.  This patch splits memory into 1GB zones, and
> implements complete zone memory freeing using kswapd or "remapping".
> 
> A bit more detailed explanation and some test scripts are at:
> 	http://people.valinux.co.jp/~iwamoto/mh.html
> 
> Main changes against previous versions are:
> - The stability is greatly improved.  Kernel crashes (probably related
>   with kswapd) still happen, but they are rather rare so that I'm
>   having difficulty reproducing crashes.
>   Page remapping under simultaneous tar + rm -rf works.
> - Implemented a solution to a deadlock caused by ext2_rename, which
>   increments a refcount of a directory page twice.
> 
> Questions and comments are welcome.

I really think that doing this over zones and pgdats isn't the best approach.
You're going to make memory allocation and reclaim vastly less efficient,
and you're exposing a bunch of very specialised code inside the main
memory paths. 

Have you looked at Daniel's CONFIG_NONLINEAR stuff? That provides a much
cleaner abstraction for getting rid of discontiguous memory in the non
truly-NUMA case, and should work really well for doing mem hot add / remove
as well.

M.

PS. What's this bit of the patch for?

 void *vmalloc(unsigned long size)
 {
+#ifdef CONFIG_MEMHOTPLUGTEST
+       return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
+#else
        return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
+#endif
 }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: memory hotremove prototype, take 3
@ 2003-12-03 17:57 Luck, Tony
  0 siblings, 0 replies; 17+ messages in thread
From: Luck, Tony @ 2003-12-03 17:57 UTC (permalink / raw)
  To: Perez-Gonzalez, Inaky, Yasunori Goto, Pavel Machek
  Cc: linux-kernel, IWAMOTO Toshihiro, Hirokazu Takahashi,
	Linux Hotplug Memory Support

> > IMHO, To hot-remove memory, memory attribute should be divided
> > into Hotpluggable and no-Hotpluggable, and each attribute memory
> > should be allocated each unit(ex. node).
> 
> Why? I still don't get that -- we should be able to use the virtual
> addressing mechanism of any CPU to swap under the rug any virtual
> address without needing to do anything more than allocate a page frame
> for the new physical location (I am ignoring here devices that are 
> directly accessing physical memory--a callback in the device 
> model could
> be added to require them to reallocate their buffers).
> 
> Or am I deadly and naively wrong?

Most (all?) Linux implementations make use of a large area
of memory that is mapped 1:1 (with a constant offset) from
kernel virtual space to physical space.  Kernel memory is
allocated in this virtual area.  If the processor supports
some form of large pages in the TLB, this 1:1 area uses the
large pages ... so it would require some major surgery to
remap portions of this area, and would have a negative effect
on performance (since you'd take more TLB misses).  It might
even be a correctness issue if the structures in this area
were needed to handle small page TLB faults in the area itself.

-Tony

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: memory hotremove prototype, take 3
@ 2003-12-03  5:19 Perez-Gonzalez, Inaky
  0 siblings, 0 replies; 17+ messages in thread
From: Perez-Gonzalez, Inaky @ 2003-12-03  5:19 UTC (permalink / raw)
  To: Yasunori Goto, Pavel Machek
  Cc: linux-kernel, Luck, Tony, IWAMOTO Toshihiro, Hirokazu Takahashi,
	Linux Hotplug Memory Support


> From: Yasunori Goto

> IMHO, To hot-remove memory, memory attribute should be divided
> into Hotpluggable and no-Hotpluggable, and each attribute memory
> should be allocated each unit(ex. node).

Why? I still don't get that -- we should be able to use the virtual
addressing mechanism of any CPU to swap under the rug any virtual
address without needing to do anything more than allocate a page frame
for the new physical location (I am ignoring here devices that are 
directly accessing physical memory--a callback in the device model could
be added to require them to reallocate their buffers).

Or am I deadly and naively wrong?

Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own (and my fault)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: memory hotremove prototype, take 3
  2003-12-01  3:41 IWAMOTO Toshihiro
@ 2003-12-01 19:56 ` Pavel Machek
  2003-12-03 19:41 ` Martin J. Bligh
  1 sibling, 0 replies; 17+ messages in thread
From: Pavel Machek @ 2003-12-01 19:56 UTC (permalink / raw)
  To: IWAMOTO Toshihiro; +Cc: linux-kernel, linux-mm

Hi!

> this is a new version of my memory hotplug prototype patch, against
> linux-2.6.0-test11.
> 
> Freeing 100% of a specified memory zone is non-trivial and necessary
> for memory hot removal.  This patch splits memory into 1GB zones, and
> implements complete zone memory freeing using kswapd or "remapping".
> 
> A bit more detailed explanation and some test scripts are at:
> 	http://people.valinux.co.jp/~iwamoto/mh.html

I scanned it...

hotunplug seems cool... How do you deal with kernel data structures in
memory "to be removed"? Or you simply don't allow kmalloc() to
allocate there?

During hotunplug, you copy pages to new locaion. Would it simplify
code if you forced them to be swapped out, instead? [Yep, it would be
slower...]
								Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* memory hotremove prototype, take 3
@ 2003-12-01  3:41 IWAMOTO Toshihiro
  2003-12-01 19:56 ` Pavel Machek
  2003-12-03 19:41 ` Martin J. Bligh
  0 siblings, 2 replies; 17+ messages in thread
From: IWAMOTO Toshihiro @ 2003-12-01  3:41 UTC (permalink / raw)
  To: linux-kernel, linux-mm

Hi,

this is a new version of my memory hotplug prototype patch, against
linux-2.6.0-test11.

Freeing 100% of a specified memory zone is non-trivial and necessary
for memory hot removal.  This patch splits memory into 1GB zones, and
implements complete zone memory freeing using kswapd or "remapping".

A bit more detailed explanation and some test scripts are at:
	http://people.valinux.co.jp/~iwamoto/mh.html

Main changes against previous versions are:
- The stability is greatly improved.  Kernel crashes (probably related
  with kswapd) still happen, but they are rather rare so that I'm
  having difficulty reproducing crashes.
  Page remapping under simultaneous tar + rm -rf works.
- Implemented a solution to a deadlock caused by ext2_rename, which
  increments a refcount of a directory page twice.

Questions and comments are welcome.

$Id: memoryhotplug.patch,v 1.26 2003/11/28 09:12:12 iwamoto Exp $

diff -dpur linux-2.6.0-test11/arch/i386/Kconfig linux-2.6.0-test11-mh/arch/i386/Kconfig
--- linux-2.6.0-test11/arch/i386/Kconfig	Thu Nov 27 05:43:07 2003
+++ linux-2.6.0-test11-mh/arch/i386/Kconfig	Fri Nov 28 17:45:42 2003
@@ -706,14 +706,18 @@ comment "NUMA (NUMA-Q) requires SMP, 64G
 comment "NUMA (Summit) requires SMP, 64GB highmem support, full ACPI"
 	depends on X86_SUMMIT && (!HIGHMEM64G || !ACPI || ACPI_HT_ONLY)
 
+config MEMHOTPLUGTEST
+       bool "Memory hotplug test"
+       default n
+
 config DISCONTIGMEM
 	bool
-	depends on NUMA
+	depends on NUMA || MEMHOTPLUGTEST
 	default y
 
 config HAVE_ARCH_BOOTMEM_NODE
 	bool
-	depends on NUMA
+	depends on NUMA || MEMHOTPLUGTEST
 	default y
 
 config HIGHPTE
diff -dpur linux-2.6.0-test11/arch/i386/mm/discontig.c linux-2.6.0-test11-mh/arch/i386/mm/discontig.c
--- linux-2.6.0-test11/arch/i386/mm/discontig.c	Thu Nov 27 05:44:20 2003
+++ linux-2.6.0-test11-mh/arch/i386/mm/discontig.c	Fri Nov 28 17:45:42 2003
@@ -28,6 +28,7 @@
 #include <linux/mmzone.h>
 #include <linux/highmem.h>
 #include <linux/initrd.h>
+#include <linux/proc_fs.h>
 #include <asm/e820.h>
 #include <asm/setup.h>
 #include <asm/mmzone.h>
@@ -111,6 +112,49 @@ int __init get_memcfg_numa_flat(void)
 	return 1;
 }
 
+int __init get_memcfg_numa_blks(void)
+{
+	int i, pfn;
+
+	printk("NUMA - single node, flat memory mode, but broken in several blocks\n");
+
+	/* Run the memory configuration and find the top of memory. */
+	find_max_pfn();
+	if (max_pfn & (PTRS_PER_PTE - 1)) {
+		pfn = max_pfn & ~(PTRS_PER_PTE - 1);
+		printk("Rounding down maxpfn %d -> %d\n", max_pfn, pfn);
+		max_pfn = pfn;
+	}
+	for(i = 0; i < MAX_NUMNODES; i++) {
+		pfn = PFN_DOWN(1 << 30) * i;
+		node_start_pfn[i]  = pfn;
+		pfn += PFN_DOWN(1 << 30);
+		if (pfn < max_pfn)
+			node_end_pfn[i]	  = pfn;
+		else {
+			node_end_pfn[i]	  = max_pfn;
+			i++;
+			printk("total %d blocks, max %d\n", i, max_pfn);
+			break;
+		}
+	}
+
+	/* Fill in the physnode_map with our simplistic memory model,
+	* all memory is in node 0.
+	*/
+	for (pfn = node_start_pfn[0]; pfn <= max_pfn;
+	       pfn += PAGES_PER_ELEMENT)
+	{
+		physnode_map[pfn / PAGES_PER_ELEMENT] = pfn / PFN_DOWN(1 << 30);
+	}
+
+         /* Indicate there is one node available. */
+	node_set_online(0);
+	numnodes = i;
+
+	return 1;
+}
+
 /*
  * Find the highest page frame number we have available for the node
  */
@@ -183,6 +227,8 @@ static void __init register_bootmem_low_
 	}
 }
 
+static struct kcore_list numa_kc;
+
 void __init remap_numa_kva(void)
 {
 	void *vaddr;
@@ -196,7 +242,11 @@ void __init remap_numa_kva(void)
 				node_remap_start_pfn[node] + pfn, 
 				PAGE_KERNEL_LARGE);
 		}
+		memset(node_remap_start_vaddr[node], 0,
+		    node_remap_size[node] * PAGE_SIZE);
 	}
+	kclist_add(&numa_kc, node_remap_start_vaddr[numnodes - 1],
+	    node_remap_offset[numnodes - 1] << PAGE_SHIFT);
 }
 
 static unsigned long calculate_numa_remap_pages(void)
diff -dpur linux-2.6.0-test11/include/asm-i386/kmap_types.h linux-2.6.0-test11-mh/include/asm-i386/kmap_types.h
--- linux-2.6.0-test11/include/asm-i386/kmap_types.h	Thu Nov 27 05:44:56 2003
+++ linux-2.6.0-test11-mh/include/asm-i386/kmap_types.h	Fri Nov 28 17:52:08 2003
@@ -24,7 +24,13 @@ D(10)	KM_IRQ0,
 D(11)	KM_IRQ1,
 D(12)	KM_SOFTIRQ0,
 D(13)	KM_SOFTIRQ1,
+#ifdef CONFIG_MEMHOTPLUGTEST
+D(14)	KM_REMAP0,
+D(15)	KM_REMAP1,
+D(16)	KM_TYPE_NR,
+#else
 D(14)	KM_TYPE_NR
+#endif
 };
 
 #undef D
diff -dpur linux-2.6.0-test11/include/asm-i386/mmzone.h linux-2.6.0-test11-mh/include/asm-i386/mmzone.h
--- linux-2.6.0-test11/include/asm-i386/mmzone.h	Thu Nov 27 05:44:10 2003
+++ linux-2.6.0-test11-mh/include/asm-i386/mmzone.h	Fri Nov 28 17:45:42 2003
@@ -128,6 +128,10 @@ static inline struct pglist_data *pfn_to
 #endif /* CONFIG_X86_NUMAQ */
 
 extern int get_memcfg_numa_flat(void );
+#ifdef CONFIG_MEMHOTPLUGTEST
+extern int get_memcfg_numa_blks(void);
+#endif
+
 /*
  * This allows any one NUMA architecture to be compiled
  * for, and still fall back to the flat function if it
@@ -143,6 +147,10 @@ static inline void get_memcfg_numa(void)
 		return;
 #endif
 
+#ifdef CONFIG_MEMHOTPLUGTEST
+	get_memcfg_numa_blks();
+	return;
+#endif
 	get_memcfg_numa_flat();
 }
 
diff -dpur linux-2.6.0-test11/include/asm-i386/numnodes.h linux-2.6.0-test11-mh/include/asm-i386/numnodes.h
--- linux-2.6.0-test11/include/asm-i386/numnodes.h	Thu Nov 27 05:43:09 2003
+++ linux-2.6.0-test11-mh/include/asm-i386/numnodes.h	Fri Nov 28 17:45:42 2003
@@ -13,6 +13,10 @@
 /* Max 8 Nodes */
 #define NODES_SHIFT	3
 
+#elif defined(CONFIG_MEMHOTPLUGTEST)
+
+#define NODES_SHIFT	3
+
 #endif /* CONFIG_X86_NUMAQ */
 
 #endif /* _ASM_MAX_NUMNODES_H */
diff -dpur linux-2.6.0-test11/include/linux/mm.h linux-2.6.0-test11-mh/include/linux/mm.h
--- linux-2.6.0-test11/include/linux/mm.h	Thu Nov 27 05:42:55 2003
+++ linux-2.6.0-test11-mh/include/linux/mm.h	Fri Nov 28 17:45:42 2003
@@ -219,7 +219,14 @@ struct page {
  */
 #define put_page_testzero(p)				\
 	({						\
-		BUG_ON(page_count(p) == 0);		\
+		if (page_count(p) == 0) {		\
+			int i;						\
+			printk("Page: %lx ", (long)p);			\
+			for(i = 0; i < sizeof(struct page); i++)	\
+				printk(" %02x", ((unsigned char *)p)[i]); \
+			printk("\n");					\
+			BUG();				\
+		}					\
 		atomic_dec_and_test(&(p)->count);	\
 	})
 
diff -dpur linux-2.6.0-test11/include/linux/mmzone.h linux-2.6.0-test11-mh/include/linux/mmzone.h
--- linux-2.6.0-test11/include/linux/mmzone.h	Thu Nov 27 05:44:20 2003
+++ linux-2.6.0-test11-mh/include/linux/mmzone.h	Fri Nov 28 17:45:42 2003
@@ -360,6 +360,10 @@ static inline unsigned int num_online_me
 	return num;
 }
 
+#ifdef CONFIG_MEMHOTPLUGTEST
+int zone_activep(const struct zone *);
+int remapd(void *p);
+#endif
 #else /* !CONFIG_DISCONTIGMEM && !CONFIG_NUMA */
 
 #define node_online(node) \
diff -dpur linux-2.6.0-test11/include/linux/page-flags.h linux-2.6.0-test11-mh/include/linux/page-flags.h
--- linux-2.6.0-test11/include/linux/page-flags.h	Thu Nov 27 05:44:52 2003
+++ linux-2.6.0-test11-mh/include/linux/page-flags.h	Fri Nov 28 17:45:42 2003
@@ -76,6 +76,8 @@
 #define PG_reclaim		18	/* To be reclaimed asap */
 #define PG_compound		19	/* Part of a compound page */
 
+#define	PG_again		20
+
 
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
@@ -268,6 +270,10 @@ extern void get_full_page_state(struct p
 #define PageCompound(page)	test_bit(PG_compound, &(page)->flags)
 #define SetPageCompound(page)	set_bit(PG_compound, &(page)->flags)
 #define ClearPageCompound(page)	clear_bit(PG_compound, &(page)->flags)
+
+#define PageAgain(page)	test_bit(PG_again, &(page)->flags)
+#define SetPageAgain(page)	set_bit(PG_again, &(page)->flags)
+#define ClearPageAgain(page)	clear_bit(PG_again, &(page)->flags)
 
 /*
  * The PageSwapCache predicate doesn't use a PG_flag at this time,
diff -dpur linux-2.6.0-test11/mm/filemap.c linux-2.6.0-test11-mh/mm/filemap.c
--- linux-2.6.0-test11/mm/filemap.c	Thu Nov 27 05:43:33 2003
+++ linux-2.6.0-test11-mh/mm/filemap.c	Fri Nov 28 17:45:42 2003
@@ -448,7 +448,8 @@ repeat:
 			spin_lock(&mapping->page_lock);
 
 			/* Has the page been truncated while we slept? */
-			if (page->mapping != mapping || page->index != offset) {
+			if (page->mapping != mapping || page->index != offset ||
+			    PageAgain(page)) {
 				unlock_page(page);
 				page_cache_release(page);
 				goto repeat;
@@ -677,6 +678,12 @@ page_not_up_to_date:
 			goto page_ok;
 		}
 
+		if (PageAgain(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto find_page;
+		}
+
 readpage:
 		/* ... and start the actual read. The read will unlock the page. */
 		error = mapping->a_ops->readpage(filp, page);
@@ -1120,6 +1127,12 @@ page_not_uptodate:
 		goto success;
 	}
 
+	if (PageAgain(page)) {
+		unlock_page(page);
+		page_cache_release(page);
+		goto retry_find;
+	}
+
 	if (!mapping->a_ops->readpage(file, page)) {
 		wait_on_page_locked(page);
 		if (PageUptodate(page))
@@ -1228,6 +1241,12 @@ page_not_uptodate:
 		goto success;
 	}
 
+	if (PageAgain(page)) {
+		unlock_page(page);
+		page_cache_release(page);
+		goto retry_find;
+	}
+
 	if (!mapping->a_ops->readpage(file, page)) {
 		wait_on_page_locked(page);
 		if (PageUptodate(page))
@@ -1436,6 +1455,11 @@ retry:
 	if (PageUptodate(page)) {
 		unlock_page(page);
 		goto out;
+	}
+	if (PageAgain(page)) {
+		unlock_page(page);
+		page_cache_release(page);
+		goto retry;
 	}
 	err = filler(data, page);
 	if (err < 0) {
diff -dpur linux-2.6.0-test11/mm/page_alloc.c linux-2.6.0-test11-mh/mm/page_alloc.c
--- linux-2.6.0-test11/mm/page_alloc.c	Thu Nov 27 05:42:56 2003
+++ linux-2.6.0-test11-mh/mm/page_alloc.c	Fri Nov 28 17:45:42 2003
@@ -31,6 +31,7 @@
 #include <linux/topology.h>
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
+#include <linux/proc_fs.h>
 
 #include <asm/tlbflush.h>
 
@@ -52,6 +53,9 @@ EXPORT_SYMBOL(nr_swap_pages);
  */
 struct zone *zone_table[MAX_NR_ZONES*MAX_NUMNODES];
 EXPORT_SYMBOL(zone_table);
+#ifdef CONFIG_MEMHOTPLUGTEST
+static char zone_active[MAX_NR_ZONES*MAX_NUMNODES];
+#endif
 
 static char *zone_names[MAX_NR_ZONES] = { "DMA", "Normal", "HighMem" };
 int min_free_kbytes = 1024;
@@ -411,7 +415,9 @@ int is_head_of_free_region(struct page *
 	spin_unlock_irqrestore(&zone->lock, flags);
         return 0;
 }
+#endif
 
+#if defined(CONFIG_SOFTWARE_SUSPEND) || defined(CONFIG_MEMHOTPLUGTEST)
 /*
  * Spill all of this CPU's per-cpu pages back into the buddy allocator.
  */
@@ -512,9 +518,28 @@ static struct page *buffered_rmqueue(str
 		mod_page_state(pgalloc, 1 << order);
 		prep_new_page(page, order);
 	}
+#ifdef CONFIG_MEMHOTPLUGTEST
+	if (page != NULL && ! zone_active[page->flags >> ZONE_SHIFT])
+		printk("alloc_page from disabled zone: %p\n", page);
+#endif
 	return page;
 }
 
+#ifdef CONFIG_MEMHOTPLUGTEST
+int
+zone_activep(const struct zone *z)
+{
+	int i;
+
+	for(i = 0; ; i++) {
+		if (zone_table[i] == z)
+			return zone_active[i];
+		if (zone_table[i] == NULL)
+			BUG();
+	}
+}
+#endif
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  *
@@ -562,6 +587,10 @@ __alloc_pages(unsigned int gfp_mask, uns
 		struct zone *z = zones[i];
 		unsigned long local_low;
 
+#ifdef CONFIG_MEMHOTPLUGTEST
+		if (! zone_activep(z))
+			continue;
+#endif
 		/*
 		 * This is the fabled 'incremental min'. We let real-time tasks
 		 * dip their real-time paws a little deeper into reserves.
@@ -590,6 +619,10 @@ __alloc_pages(unsigned int gfp_mask, uns
 		unsigned long local_min;
 		struct zone *z = zones[i];
 
+#ifdef CONFIG_MEMHOTPLUGTEST
+		if (! zone_activep(z))
+			continue;
+#endif
 		local_min = z->pages_min;
 		if (gfp_mask & __GFP_HIGH)
 			local_min >>= 2;
@@ -613,6 +646,10 @@ rebalance:
 		for (i = 0; zones[i] != NULL; i++) {
 			struct zone *z = zones[i];
 
+#ifdef CONFIG_MEMHOTPLUGTEST
+			if (! zone_activep(z))
+				continue;
+#endif
 			page = buffered_rmqueue(z, order, cold);
 			if (page)
 				goto got_pg;
@@ -638,6 +675,10 @@ rebalance:
 	for (i = 0; zones[i] != NULL; i++) {
 		struct zone *z = zones[i];
 
+#ifdef CONFIG_MEMHOTPLUGTEST
+		if (! zone_activep(z))
+			continue;
+#endif
 		min += z->pages_min;
 		if (z->free_pages >= min ||
 				(!wait && z->free_pages >= z->pages_high)) {
@@ -1076,6 +1117,9 @@ static int __init build_zonelists_node(p
 static void __init build_zonelists(pg_data_t *pgdat)
 {
 	int i, j, k, node, local_node;
+#ifdef CONFIG_MEMHOTPLUGTEST
+	struct zone *zone;
+#endif
 
 	local_node = pgdat->node_id;
 	printk("Building zonelist for node : %d\n", local_node);
@@ -1091,7 +1135,7 @@ static void __init build_zonelists(pg_da
 			k = ZONE_HIGHMEM;
 		if (i & __GFP_DMA)
 			k = ZONE_DMA;
-
+#ifndef CONFIG_MEMHOTPLUGTEST
  		j = build_zonelists_node(pgdat, zonelist, j, k);
  		/*
  		 * Now we build the zonelist so that it contains the zones
@@ -1107,6 +1151,23 @@ static void __init build_zonelists(pg_da
  			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
  
 		zonelist->zones[j++] = NULL;
+#else
+		for(; k >= 0; k--) {
+			zone = pgdat->node_zones + k;
+			if (zone->present_pages)
+				zonelist->zones[j++] = zone;
+			for (node = local_node + 1; node < numnodes; node++) {
+				zone = NODE_DATA(node)->node_zones + k;
+				if (zone->present_pages)
+					zonelist->zones[j++] = zone;
+			}
+			for (node = 0; node < local_node; node++) {
+				zone = NODE_DATA(node)->node_zones + k;
+				if (zone->present_pages)
+					zonelist->zones[j++] = zone;
+			}
+		}
+#endif
 	} 
 }
 
@@ -1252,6 +1313,9 @@ static void __init free_area_init_core(s
 		unsigned long batch;
 
 		zone_table[nid * MAX_NR_ZONES + j] = zone;
+#ifdef CONFIG_MEMHOTPLUGTEST
+		zone_active[nid * MAX_NR_ZONES + j] = 1;
+#endif
 		realsize = size = zones_size[j];
 		if (zholes_size)
 			realsize -= zholes_size[j];
@@ -1644,3 +1708,145 @@ int min_free_kbytes_sysctl_handler(ctl_t
 	setup_per_zone_pages_min();
 	return 0;
 }
+
+#ifdef CONFIG_MEMHOTPLUGTEST
+static int mhtest_read(char *page, char **start, off_t off, int count,
+    int *eof, void *data)
+{
+	char *p;
+	int i, len;
+	const struct zone *z;
+
+	p = page;
+	for(i = 0; ; i++) {
+		z = zone_table[i];
+		if (z == NULL)
+			break;
+		if (! z->present_pages)
+			/* skip empty zone */
+			continue;
+		len = sprintf(p, "Zone %d: %sabled free %d, active %d, present %d\n", i,
+		    zone_active[i] ? "en" : "dis", z->free_pages, z->nr_active,
+		    z->present_pages);
+		p += len;
+	}
+	len = p - page;
+
+	if (len <= off + count)
+		*eof = 1;
+	*start = page + off;
+	len -= off;
+	if (len < 0)
+		len = 0;
+	if (len > count)
+		len = count;
+
+	return len;
+}
+
+static int mhtest_write(struct file *file, const char *buffer,
+    unsigned long count, void *data)
+{
+	unsigned long idx;
+	char buf[64], *p;
+	int i;
+	struct list_head *l;
+
+	if (count > sizeof(buf) - 1)
+		count = sizeof(buf) - 1;
+	if (copy_from_user(buf, buffer, count))
+		return -EFAULT;
+
+	buf[count] = 0;
+
+	p = strchr(buf, ' ');
+	if (p == NULL)
+		goto out;
+
+	*p++ = '\0';
+	idx = simple_strtoul(p, NULL, 0);
+
+	if (idx > MAX_NR_ZONES*MAX_NUMNODES) {
+		printk("Argument out of range\n");
+		goto out;
+	}
+	if (strcmp(buf, "disable") == 0) {
+		printk("disable %d\n", idx);
+		/* XXX */
+		for (i = 0; i < NR_CPUS; i++) {
+			struct per_cpu_pages *pcp;
+
+			pcp = &zone_table[idx]->pageset[i].pcp[0];	/* hot */
+			pcp->low = pcp->high = 0;
+
+			pcp = &zone_table[idx]->pageset[i].pcp[1];	/* cold */
+			pcp->low = pcp->high = 0;
+		}
+		zone_active[idx] = 0;
+		zone_table[idx]->pages_high = zone_table[idx]->present_pages;
+	} else if (strcmp(buf, "purge") == 0) {
+		if (zone_active[idx])
+			printk("Zone %d still active (proceeding anyway)\n",
+			    idx);
+		printk("purge %d\n", idx);
+		wake_up_interruptible(&zone_table[idx]->zone_pgdat->kswapd_wait);
+		/* XXX overkill, but who cares? */
+		on_each_cpu(drain_local_pages, NULL, 1, 1);
+	} else if (strcmp(buf, "enable") == 0) {
+		printk("enable %d\n", idx);
+		zone_active[idx] = 1;
+		zone_table[idx]->pages_high = 
+		    zone_table[idx]->pages_min * 3;
+		/* XXX */
+		for (i = 0; i < NR_CPUS; i++) {
+			struct per_cpu_pages *pcp;
+
+			pcp = &zone_table[idx]->pageset[i].pcp[0];	/* hot */
+			pcp->low = 2 * pcp->batch;
+			pcp->high = 6 * pcp->batch;
+
+			pcp = &zone_table[idx]->pageset[i].pcp[1];	/* cold */
+			pcp->high = 2 * pcp->batch;
+		}
+	} else if (strcmp(buf, "remap") == 0) {
+		on_each_cpu(drain_local_pages, NULL, 1, 1);
+		kernel_thread(remapd, zone_table[idx], CLONE_KERNEL);
+	} else if (strcmp(buf, "active") == 0) {
+		if (zone_table[idx] == NULL)
+			goto out;
+		spin_lock_irq(&zone_table[idx]->lru_lock);
+		i = 0;
+		list_for_each(l, &zone_table[idx]->active_list) {
+			printk(" %lx", (unsigned long)list_entry(l, struct page, lru));
+			i++;
+			if (i == 10)
+				break;
+		}
+		spin_unlock_irq(&zone_table[idx]->lru_lock);
+		printk("\n");
+	} else if (strcmp(buf, "inuse") == 0) {
+		if (zone_table[idx] == NULL)
+			goto out;
+		for(i = 0; i < zone_table[idx]->spanned_pages; i++)
+			if (page_count(&zone_table[idx]->zone_mem_map[i]))
+				printk(" %lx", (unsigned long)&zone_table[idx]->zone_mem_map[i]);
+		printk("\n");
+	}
+out:
+	return count;
+}
+
+static int __init procmhtest_init(void)
+{
+	struct proc_dir_entry *entry;
+
+	entry = create_proc_entry("memhotplug", 0, NULL);
+	if (entry == NULL)
+		return -1;
+
+	entry->read_proc = &mhtest_read;
+	entry->write_proc = &mhtest_write;
+	return 0;
+}
+__initcall(procmhtest_init);
+#endif
diff -dpur linux-2.6.0-test11/mm/shmem.c linux-2.6.0-test11-mh/mm/shmem.c
--- linux-2.6.0-test11/mm/shmem.c	Thu Nov 27 05:43:41 2003
+++ linux-2.6.0-test11-mh/mm/shmem.c	Fri Nov 28 17:45:42 2003
@@ -80,7 +80,15 @@ static inline struct page *shmem_dir_all
 	 * BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
 	 * might be reconsidered if it ever diverges from PAGE_SIZE.
 	 */
-	return alloc_pages(gfp_mask, PAGE_CACHE_SHIFT-PAGE_SHIFT);
+#ifdef CONFIG_MEMHOTPLUGTEST
+	struct page* p = alloc_pages(gfp_mask & ~__GFP_HIGHMEM,
+	    PAGE_CACHE_SHIFT-PAGE_SHIFT);
+	printk("shmem_dir_alloc: %lx\n", (unsigned long)p);
+	return p;
+#else
+	return alloc_pages(gfp_mask & ~__GFP_HIGHMEM,
+	    PAGE_CACHE_SHIFT-PAGE_SHIFT);
+#endif
 }
 
 static inline void shmem_dir_free(struct page *page)
diff -dpur linux-2.6.0-test11/mm/truncate.c linux-2.6.0-test11-mh/mm/truncate.c
--- linux-2.6.0-test11/mm/truncate.c	Thu Nov 27 05:45:39 2003
+++ linux-2.6.0-test11-mh/mm/truncate.c	Fri Nov 28 17:45:42 2003
@@ -132,6 +132,10 @@ void truncate_inode_pages(struct address
 			next++;
 			if (TestSetPageLocked(page))
 				continue;
+			if (PageAgain(page)) {
+				unlock_page(page);
+				continue;
+			}
 			if (PageWriteback(page)) {
 				unlock_page(page);
 				continue;
@@ -165,6 +169,14 @@ void truncate_inode_pages(struct address
 			struct page *page = pvec.pages[i];
 
 			lock_page(page);
+			if (PageAgain(page)) {
+				unsigned long index = page->index;
+
+				unlock_page(page);
+				put_page(page);
+				page = find_lock_page(mapping, index);
+				pvec.pages[i] = page;
+			}
 			wait_on_page_writeback(page);
 			if (page->index > next)
 				next = page->index;
@@ -255,6 +267,14 @@ void invalidate_inode_pages2(struct addr
 			struct page *page = pvec.pages[i];
 
 			lock_page(page);
+			if (PageAgain(page)) {
+				unsigned long index = page->index;
+
+				unlock_page(page);
+				put_page(page);
+				page = find_lock_page(mapping, index);
+				pvec.pages[i] = page;
+			}
 			if (page->mapping == mapping) {	/* truncate race? */
 				wait_on_page_writeback(page);
 				next = page->index + 1;
diff -dpur linux-2.6.0-test11/mm/vmalloc.c linux-2.6.0-test11-mh/mm/vmalloc.c
--- linux-2.6.0-test11/mm/vmalloc.c	Thu Nov 27 05:44:23 2003
+++ linux-2.6.0-test11-mh/mm/vmalloc.c	Fri Nov 28 17:45:42 2003
@@ -447,7 +447,11 @@ EXPORT_SYMBOL(__vmalloc);
  */
 void *vmalloc(unsigned long size)
 {
+#ifdef CONFIG_MEMHOTPLUGTEST
+       return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
+#else
        return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
+#endif
 }
 
 EXPORT_SYMBOL(vmalloc);
diff -dpur linux-2.6.0-test11/mm/vmscan.c linux-2.6.0-test11-mh/mm/vmscan.c
--- linux-2.6.0-test11/mm/vmscan.c	Thu Nov 27 05:43:06 2003
+++ linux-2.6.0-test11-mh/mm/vmscan.c	Fri Nov 28 17:55:35 2003
@@ -36,6 +36,9 @@
 #include <asm/div64.h>
 
 #include <linux/swapops.h>
+#ifdef CONFIG_KDB
+#include <linux/kdb.h>
+#endif
 
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
@@ -285,6 +288,8 @@ shrink_list(struct list_head *page_list,
 			goto keep_locked;
 
 		pte_chain_lock(page);
+		if ((! zone_activep(page_zone(page))) && page_mapped(page))
+			page_referenced(page);
 		referenced = page_referenced(page);
 		if (referenced && page_mapping_inuse(page)) {
 			/* In active use or really unfreeable.  Activate it. */
@@ -589,7 +594,7 @@ done:
  * But we had to alter page->flags anyway.
  */
 static void
-refill_inactive_zone(struct zone *zone, const int nr_pages_in,
+refill_inactive_zone(struct zone *zone, int nr_pages_in,
 			struct page_state *ps, int priority)
 {
 	int pgmoved;
@@ -607,6 +612,12 @@ refill_inactive_zone(struct zone *zone, 
 
 	lru_add_drain();
 	pgmoved = 0;
+#ifdef CONFIG_MEMHOTPLUGTEST
+	if (! zone_activep(zone)) {
+		nr_pages = nr_pages_in = zone->present_pages - zone->free_pages;
+		printk("Purging active list of disabled zone\n");
+	}
+#endif
 	spin_lock_irq(&zone->lru_lock);
 	while (nr_pages && !list_empty(&zone->active_list)) {
 		page = list_entry(zone->active_list.prev, struct page, lru);
@@ -658,12 +669,20 @@ refill_inactive_zone(struct zone *zone, 
 	 */
 	if (swap_tendency >= 100)
 		reclaim_mapped = 1;
+#ifdef CONFIG_MEMHOTPLUGTEST
+	if (! zone_activep(zone))
+		reclaim_mapped = 1;
+#endif
 
 	while (!list_empty(&l_hold)) {
 		page = list_entry(l_hold.prev, struct page, lru);
 		list_del(&page->lru);
 		if (page_mapped(page)) {
 			pte_chain_lock(page);
+#ifdef CONFIG_MEMHOTPLUGTEST
+			if (! zone_activep(zone))
+				page_referenced(page);	/* XXX */
+#endif
 			if (page_mapped(page) && page_referenced(page)) {
 				pte_chain_unlock(page);
 				list_add(&page->lru, &l_active);
@@ -767,6 +786,11 @@ shrink_zone(struct zone *zone, int max_s
 	ratio = (unsigned long)nr_pages * zone->nr_active /
 				((zone->nr_inactive | 1) * 2);
 	atomic_add(ratio+1, &zone->refill_counter);
+#ifdef CONFIG_MEMHOTPLUGTEST
+	if (! zone_activep(zone))
+		/* XXX */
+		atomic_add(SWAP_CLUSTER_MAX, &zone->refill_counter);
+#endif
 	if (atomic_read(&zone->refill_counter) > SWAP_CLUSTER_MAX) {
 		int count;
 
@@ -1048,6 +1072,439 @@ int kswapd(void *p)
 		balance_pgdat(pgdat, 0, &ps);
 	}
 }
+
+#ifdef CONFIG_MEMHOTPLUGTEST
+static void
+print_buffer(struct page* page)
+{
+	struct address_space* mapping = page->mapping;
+	struct buffer_head *bh, *head;
+
+	spin_lock(&mapping->private_lock);
+	bh = head = page_buffers(page);
+	printk("buffers:");
+	do {
+		printk(" %lx %d\n", bh->b_state, atomic_read(&bh->b_count));
+
+		bh = bh->b_this_page;
+	} while (bh != head);
+	printk("\n");
+	spin_unlock(&mapping->private_lock);
+}
+/* try to remap a page. returns non-zero on failure */
+int remap_onepage(struct page *page)
+{
+	struct page *newpage;
+	struct zone *zone;
+	struct address_space *mapping = page->mapping;
+	char *np, *op;
+	void *p;
+	int waitcnt, error = -1;
+
+	newpage = alloc_page(GFP_HIGHUSER);
+	if (newpage == NULL)
+		return -ENOMEM;
+	if (TestSetPageLocked(newpage))
+		BUG();
+	lock_page(page);
+
+	if (! PagePrivate(page) && PageWriteback(page))
+#ifdef CONFIG_KDB
+		KDB_ENTER();
+#else
+		BUG();
+#endif
+	if (PagePrivate(page)) {
+		waitcnt = 100;
+		while (PageWriteback(page)) {
+			__set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout(10);
+			__set_current_state(TASK_RUNNING);
+			if (! --waitcnt)
+				goto radixfail;
+		}
+
+		/* XXX copied from shrink_list() */
+		if (PageDirty(page) &&
+		    is_page_cache_freeable(page) &&
+		    mapping != NULL &&
+		    mapping->a_ops->writepage != NULL) {
+			spin_lock(&mapping->page_lock);
+			if (test_clear_page_dirty(page)) {
+				int res;
+				struct writeback_control wbc = {
+					.sync_mode = WB_SYNC_NONE,
+					.nr_to_write = SWAP_CLUSTER_MAX,
+					.nonblocking = 1,
+					.for_reclaim = 1,
+				};
+
+				list_move(&page->list, &mapping->locked_pages);
+				spin_unlock(&mapping->page_lock);
+
+				SetPageReclaim(page);
+				res = mapping->a_ops->writepage(page, &wbc);
+
+				if (res == WRITEPAGE_ACTIVATE) {
+					ClearPageReclaim(page);
+					goto radixfail;
+				}
+				if (!PageWriteback(page)) {
+					/* synchronous write or broken a_ops? */
+					ClearPageReclaim(page);
+				}
+				lock_page(page);
+				if (! PagePrivate(page))
+					goto bufferdone;
+			} else
+				spin_unlock(&mapping->page_lock);
+		}
+
+		waitcnt = 100;
+		while (1) {
+			if (try_to_release_page(page, GFP_KERNEL))
+				break;
+			__set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout(10);
+			__set_current_state(TASK_RUNNING);
+			if (! --waitcnt) {
+				print_buffer(page);
+				goto radixfail;
+			}
+		}
+	}
+bufferdone:
+	if (mapping == NULL) {
+		/* The page is an anon page. Allocate swap entry. */
+		if (!add_to_swap(page))
+			goto radixfail;
+		mapping = page->mapping;
+	}
+	error = radix_tree_preload(GFP_KERNEL);
+	if (error)
+		goto radixfail;
+	if (PagePrivate(page)) /* XXX */
+		BUG();
+
+	/* should {__add_to,__remove_from}_page_cache be used instead? */
+	spin_lock(&mapping->page_lock);
+	if (mapping != page->mapping)
+		printk("mapping changed %p -> %p, page %p\n",
+		    mapping, page->mapping, page);
+	if (radix_tree_delete(&mapping->page_tree, page->index) == NULL) {
+		/* Page truncated. */
+		spin_unlock(&mapping->page_lock);
+		radix_tree_preload_end();
+		goto radixfail;
+	}
+	/* don't __put_page(page) here. truncate may be in progress */
+	newpage->flags |= page->flags & ~(1 << PG_uptodate) &
+	    ~(1 << PG_highmem) & ~(1 << PG_chainlock) &
+	    ~(1 << PG_direct) & ~(~0UL << ZONE_SHIFT);
+
+	/* list_del(&page->list); XXX */
+	radix_tree_insert(&mapping->page_tree, page->index, newpage);
+	page_cache_get(newpage);
+	newpage->mapping = mapping;
+	newpage->index = page->index;
+	if (PageDirty(page))
+		list_add(&newpage->list, &mapping->dirty_pages);
+	else
+		list_add(&newpage->list, &mapping->clean_pages);
+	spin_unlock(&mapping->page_lock);
+	radix_tree_preload_end();
+
+	pte_chain_lock(page);
+	if (page_mapped(page)) {
+		while ((error = try_to_unmap(page)) == SWAP_AGAIN) {
+			pte_chain_unlock(page);
+			__set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout(1);
+			__set_current_state(TASK_RUNNING);
+			pte_chain_lock(page);
+		}
+		if (error == SWAP_FAIL) {
+			pte_chain_unlock(page); /* XXX */
+			/* either during mremap or mlocked */
+			goto unmapfail;
+		}
+	}
+	pte_chain_unlock(page);
+	if (PagePrivate(page))
+		printk("buffer reappeared\n");
+
+	unlock_page(page);	/* no lock needed while waiting page count */
+
+	waitcnt = 1;
+wait_again:
+	while (page_count(page) > 2) {
+		waitcnt++;
+		current->state = TASK_INTERRUPTIBLE;
+		schedule_timeout(1);
+		if ((waitcnt % 5000) == 0) {
+			printk("remap_onepage: still waiting on %p %d\n", page, waitcnt);
+			break;
+		}
+		if (PagePrivate(page))
+			break;		/* see below */
+	}
+
+	lock_page(page);
+	BUG_ON(page_count(page) == 0);
+	if (PagePrivate(page))
+		try_to_release_page(page, GFP_KERNEL);
+	if (page_count(page) > 2) {
+		if (waitcnt > 50000)
+			goto unmapfail;
+		unlock_page(page);
+		goto wait_again;
+	}
+	if (PageReclaim(page) || PageWriteback(page) || PagePrivate(page))
+#ifdef CONFIG_KDB
+		KDB_ENTER();
+#else
+		BUG();
+#endif
+	if (page_count(page) == 1) {
+		/* page has been truncated.  free both pages. */
+		spin_lock(&mapping->page_lock);
+		p = radix_tree_lookup(&mapping->page_tree, newpage->index);
+		if (p != NULL) {
+			/* new cache page appeared after truncation */
+			printk("page %p newpage %p radix %p\n",
+			    page, newpage, p);
+			BUG_ON(p == newpage);
+		}
+		list_del(&newpage->list);
+		put_page(newpage);
+		if (page_count(newpage) != 1) {
+			printk("newpage count %d != 1, %p\n",
+			    page_count(newpage), newpage);
+			BUG();
+		}
+		/* No need to do page->list. remove_from_page_cache did. */
+		newpage->mapping = page->mapping = NULL;
+		spin_unlock(&mapping->page_lock);
+		ClearPageActive(page);
+		ClearPageActive(newpage);
+		unlock_page(page);
+		unlock_page(newpage);
+		put_page(page);
+		put_page(newpage);
+		return 0;
+	}
+
+	spin_lock(&mapping->page_lock);
+	list_del(&page->list); /* XXX */
+	page->mapping = NULL;
+	spin_unlock(&mapping->page_lock);
+	unlock_page(page);
+
+	np = kmap_atomic(newpage, KM_REMAP0);
+	op = kmap_atomic(page, KM_REMAP1);
+	if (np == NULL || op == NULL) {	/* XXX */
+		printk("%p %p %p %p\n", np, op, newpage, page);
+		BUG();
+	}
+	memcpy(np, op, PAGE_SIZE);
+	kunmap_atomic(page, KM_REMAP1);
+	kunmap_atomic(newpage, KM_REMAP0);
+	ClearPageActive(page);
+	__put_page(page);
+	put_page(page);
+
+	/* We are done. Finish and let the waiters run. */
+	SetPageUptodate(newpage);
+	/* XXX locking order correct? */
+	zone = page_zone(newpage);
+	spin_lock_irq(&zone->lru_lock);
+	if (PageActive(newpage)) {
+		list_add(&newpage->lru, &zone->active_list);
+		zone->nr_active++;
+	} else {
+		list_add(&newpage->lru, &zone->inactive_list);
+		zone->nr_inactive++;
+	}
+	SetPageLRU(newpage);
+	spin_unlock_irq(&zone->lru_lock);
+	unlock_page(newpage);
+	page_cache_release(newpage);
+	return 0;
+
+unmapfail:
+	/*
+	 * Try to unwind by notifying waiters.  If someone misbehaves,
+	 * we die.
+	 */
+	error = radix_tree_preload(GFP_KERNEL);
+	if (error)
+		BUG();
+	/* should {__add_to,__remove_from}_page_cache be used instead? */
+	spin_lock(&mapping->page_lock);
+	/* list_del(&newpage->list); */
+	if (radix_tree_delete(&mapping->page_tree, page->index) == NULL)
+		/* Hold extra count to handle truncate */
+		page_cache_get(newpage);
+	radix_tree_insert(&mapping->page_tree, page->index, page);
+	/* no page_cache_get(page); needed */
+	radix_tree_preload_end();
+	spin_unlock(&mapping->page_lock);
+
+	SetPageAgain(newpage);
+	/* XXX unmap needed?  No, it shouldn't.  Handled by fault handlers. */
+	unlock_page(newpage);
+
+	waitcnt = 1;
+	for(; page_count(newpage) > 2; waitcnt++) {
+		current->state = TASK_INTERRUPTIBLE;
+		schedule_timeout(1);
+		if ((waitcnt % 10000) == 0) {
+			printk("You are hosed.\n");
+			printk("newpage %p\n", newpage);
+			BUG();
+		}
+	}
+	BUG_ON(PageUptodate(newpage));
+	ClearPageDirty(newpage);
+	ClearPageActive(newpage);
+	spin_lock(&mapping->page_lock);
+	newpage->mapping = NULL;
+	if (page_count(newpage) == 1) {
+		printk("newpage %p truncated. page %p\n", newpage, page);
+		BUG();
+	}
+	list_del(&newpage->list);
+	spin_unlock(&mapping->page_lock);
+	unlock_page(page);
+	__put_page(newpage);
+	__free_page(newpage);
+	return 1;
+	
+radixfail:
+	unlock_page(page);
+	unlock_page(newpage);
+	__free_page(newpage);
+	return 1;
+}
+
+static struct work_struct lru_drain_wq[NR_CPUS];
+static void
+lru_drain_schedule(void *p)
+{
+	int cpu = get_cpu();
+
+	schedule_work(&lru_drain_wq[cpu]);
+	put_cpu();
+}
+
+atomic_t remapd_count;
+int remapd(void *p)
+{
+	struct zone *zone = p;
+	struct page *page, *page1;
+	struct list_head *l;
+	int active, i, nr_failed = 0;
+	int fastmode = 100;
+	LIST_HEAD(failedp);
+
+	daemonize("remap%d", zone->zone_start_pfn);
+	if (atomic_read(&remapd_count) > 0) {
+		printk("remapd already running\n");
+		return 0;
+	}
+	atomic_inc(&remapd_count);
+	on_each_cpu(lru_drain_schedule, NULL, 1, 1);
+	while(nr_failed < 100) {
+		spin_lock_irq(&zone->lru_lock);
+		for(active = 0; active < 2; active++) {
+			l = active ? &zone->active_list :
+			    &zone->inactive_list;
+			for(i = 0; ! list_empty(l) && i < 10; i++) {
+				page = list_entry(l->prev, struct page, lru);
+				if (fastmode && PageLocked(page)) {
+					page1 = page;
+					while (fastmode && PageLocked(page)) {
+						page =
+						    list_entry(page->lru.prev,
+						    struct page, lru);
+						fastmode--;
+						if (&page->lru == l) {
+							/* scanned the whole
+							   list */
+							page = page1;
+							break;
+						}
+						if (page == page1)
+							BUG();
+					}
+					if (! fastmode) {
+						printk("used up fastmode\n");
+						page = page1;
+					}
+				}
+				if (! TestClearPageLRU(page))
+					BUG();
+				list_del(&page->lru);
+				if (page_count(page) == 0) {
+					/* the page is in pagevec_release();
+					   shrink_cache says so. */
+					SetPageLRU(page);
+					list_add(&page->lru, l);
+					continue;
+				}
+				if (active)
+					zone->nr_active--;
+				else
+					zone->nr_inactive--;
+				page_cache_get(page);
+				spin_unlock_irq(&zone->lru_lock);
+				goto got_page;
+			}
+		}
+		spin_unlock_irq(&zone->lru_lock);
+		break;
+
+	got_page:
+		if (remap_onepage(page)) {
+			nr_failed++;
+			list_add(&page->lru, &failedp);
+		}
+	}
+	if (list_empty(&failedp))
+		goto out;
+
+	while (! list_empty(&failedp)) {
+		spin_lock_irq(&zone->lru_lock);
+		page = list_entry(failedp.prev, struct page, lru);
+		list_del(&page->lru);
+		if (PageActive(page)) {
+			list_add(&page->lru, &zone->active_list);
+			zone->nr_active++;
+		} else {
+			list_add(&page->lru, &zone->inactive_list);
+			zone->nr_inactive++;
+		}
+		if (TestSetPageLRU(page))
+			BUG();
+		spin_unlock_irq(&zone->lru_lock);
+		page_cache_release(page);
+	}
+out:
+	atomic_dec(&remapd_count);
+	return 0;
+}
+			
+static int __init remapd_init(void)
+{
+	int i;
+
+	for(i = 0; i < NR_CPUS; i++)
+		INIT_WORK(&lru_drain_wq[i], lru_add_drain, NULL);
+	return 0;
+}
+
+module_init(remapd_init);
+#endif
 
 /*
  * A zone is low on free memory, so wake its kswapd task to service it.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2003-12-10  0:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-01 20:12 memory hotremove prototype, take 3 Luck, Tony
2003-12-02  3:01 ` IWAMOTO Toshihiro
2003-12-02  6:43   ` Hirokazu Takahashi
2003-12-02 22:26 ` Yasunori Goto
  -- strict thread matches above, loose matches on Subject: below --
2003-12-10  0:45 Luck, Tony
2003-12-03 17:57 Luck, Tony
2003-12-03  5:19 Perez-Gonzalez, Inaky
2003-12-01  3:41 IWAMOTO Toshihiro
2003-12-01 19:56 ` Pavel Machek
2003-12-03 19:41 ` Martin J. Bligh
2003-12-04  3:58   ` IWAMOTO Toshihiro
2003-12-04  5:38     ` Martin J. Bligh
2003-12-04 15:44       ` IWAMOTO Toshihiro
2003-12-04 17:12         ` Martin J. Bligh
2003-12-04 18:27         ` Jesse Barnes
2003-12-04 18:29           ` Martin J. Bligh
2003-12-04 18:59             ` Jesse Barnes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).