linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/3] memory hotplug prototype
@ 2004-04-06 10:53 IWAMOTO Toshihiro
  2004-04-06 10:56 ` [patch 1/3] " IWAMOTO Toshihiro
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: IWAMOTO Toshihiro @ 2004-04-06 10:53 UTC (permalink / raw)
  To: linux-kernel, lhms-devel

This is an updated version of memory hotplug prototype patch, which I
have posted here several times.

Main changes are:

	* Changes to make hotpluggable normal zones possible:
		* Added two fields (removable, enabled) to pglist_data.
		* Added __GFP_HOTREMOVABLE macro for page caches and
	          anonymous pages.
		* Added an element to node_zonelists[].
		  node_zonelists[3] is for __GFP_HOTREMOVABLE.

	* node_zonelists[] are calculated using pgdat->removable and
	  pgdat->enabled.  zone_active[] checks have been removed.

	* Remap code has been cleaned up, to share code with hugepage
          handling and better readability.

	* Some hack to work better under high IO load. (incomplete)

	* Changed page remapping rollback detection logic not to use
	  PG_again bit as suggested by Dave Hansen in
	  lhms-devel@sourceforge.  PG_again is left for consistency
	  checks. (not tested well)

	* Added an argument to remap_onepage() specifying a node to
          allocate replacement pages. (NUMA support, incomplete)


/proc/memhotplug interface has been changed. For example:

        # echo plug 1 > /proc/memhotplug
		Plugs node 1.
        # echo enable 1 > /proc/memhotplug
		Enables page allocation from node 1.
        # echo disable 1 > /proc/memhotplug
		Disables page allocation from node 1.
        # echo remap 5 > /proc/memhotplug
		Free pages in zone 5 by remapping.
        # echo unplug 1 > /proc/memhotplug
		Unplugs node 1. (All pages must be freed in advance)

	$ cat /proc/memhotplug
	Node 0 enabled nonhotremovable
		DMA[0]: free 250, active 940, present 4096
		Normal[1]: free 307, active 101623, present 126976

	Node 1 enabled hotremovable
		Normal[5]: free 336, active 9287, present 83968
		HighMem[6]: free 88, active 14406, present 45056


Known issues/TODO items:

	* kswapd doesn't terminate when a node is unplugged.

	* Currently, a page is written back to disk before remapping
          if it has a dirty buffer.  This can be too slow, so
	  such pages needs to be handled without issueing writebacks.
	  I guess this would require a new vfs interface.


My patch consists of 3 files:
	1. memoryhotplug.patch
		The main patch.
		
	2. va-emulation_memhotplug.patch
		to emulate hotpluggable memory blocks on usual PC.

	3. va-proc_memhotplug.patch
		/proc/memhotplug interface.

They are sent as followups to this mail and will be followed by
Takahashi's hugetlbfs page remapping patches shortly.

--
IWAMOTO Toshihiro

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 1/3] memory hotplug prototype
  2004-04-06 10:53 [patch 0/3] memory hotplug prototype IWAMOTO Toshihiro
@ 2004-04-06 10:56 ` IWAMOTO Toshihiro
  2004-04-06 17:12   ` Dave Hansen
  2004-04-06 10:58 ` [patch 2/3] " IWAMOTO Toshihiro
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: IWAMOTO Toshihiro @ 2004-04-06 10:56 UTC (permalink / raw)
  To: linux-kernel, lhms-devel

memoryhotplug.patch: The main, most important patch.

$Id: memoryhotplug.patch,v 1.72 2004/04/06 10:56:05 iwamoto Exp $

diff -dpurN linux-2.6.5/arch/i386/Kconfig linux-2.6.5-mh/arch/i386/Kconfig
--- linux-2.6.5/arch/i386/Kconfig	Sun Apr  4 12:36:25 2004
+++ linux-2.6.5-mh/arch/i386/Kconfig	Mon Apr  5 12:44:53 2004
@@ -717,9 +717,19 @@ comment "NUMA (NUMA-Q) requires SMP, 64G
 comment "NUMA (Summit) requires SMP, 64GB highmem support, ACPI"
 	depends on X86_SUMMIT && (!HIGHMEM64G || !ACPI)
 
+config MEMHOTPLUG
+	bool "Memory hotplug test"
+	depends on !X86_PAE
+	default n
+
+config MEMHOTPLUG_BLKSIZE
+	int "Size of a memory hotplug unit (in MB, must be multiple of 256)."
+	range 256 1024
+	depends on MEMHOTPLUG
+
 config DISCONTIGMEM
 	bool
-	depends on NUMA
+	depends on NUMA || MEMHOTPLUG
 	default y
 
 config HAVE_ARCH_BOOTMEM_NODE
diff -dpurN linux-2.6.5/include/linux/gfp.h linux-2.6.5-mh/include/linux/gfp.h
--- linux-2.6.5/include/linux/gfp.h	Sun Apr  4 12:36:52 2004
+++ linux-2.6.5-mh/include/linux/gfp.h	Mon Apr  5 12:44:53 2004
@@ -7,9 +7,10 @@
 /*
  * GFP bitmasks..
  */
-/* Zone modifiers in GFP_ZONEMASK (see linux/mmzone.h - low two bits) */
-#define __GFP_DMA	0x01
-#define __GFP_HIGHMEM	0x02
+/* Zone modifiers in GFP_ZONEMASK (see linux/mmzone.h - low three bits) */
+#define __GFP_DMA		0x01
+#define __GFP_HIGHMEM		0x02
+#define __GFP_HOTREMOVABLE	0x03
 
 /*
  * Action modifiers - doesn't change the zoning
@@ -41,7 +42,7 @@
 #define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_HOTREMOVABLE)
 
 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
diff -dpurN linux-2.6.5/include/linux/mm.h linux-2.6.5-mh/include/linux/mm.h
--- linux-2.6.5/include/linux/mm.h	Sun Apr  4 12:36:15 2004
+++ linux-2.6.5-mh/include/linux/mm.h	Mon Apr  5 12:44:53 2004
@@ -228,7 +228,14 @@ struct page {
  */
 #define put_page_testzero(p)				\
 	({						\
-		BUG_ON(page_count(p) == 0);		\
+		if (page_count(p) == 0) {		\
+			int i;						\
+			printk("Page: %lx ", (long)p);			\
+			for(i = 0; i < sizeof(struct page); i++)	\
+				printk(" %02x", ((unsigned char *)p)[i]); \
+			printk("\n");					\
+			BUG();				\
+		}					\
 		atomic_dec_and_test(&(p)->count);	\
 	})
 
@@ -286,6 +293,11 @@ static inline void put_page(struct page 
 }
 
 #endif		/* CONFIG_HUGETLB_PAGE */
+
+static inline int is_page_cache_freeable(struct page *page)
+{
+	return page_count(page) - !!PagePrivate(page) == 2;
+}
 
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
diff -dpurN linux-2.6.5/include/linux/mmzone.h linux-2.6.5-mh/include/linux/mmzone.h
--- linux-2.6.5/include/linux/mmzone.h	Sun Apr  4 12:37:23 2004
+++ linux-2.6.5-mh/include/linux/mmzone.h	Mon Apr  5 12:46:47 2004
@@ -160,8 +160,10 @@ struct zone {
 #define ZONE_DMA		0
 #define ZONE_NORMAL		1
 #define ZONE_HIGHMEM		2
+#define ZONE_HOTREMOVABLE	3	/* only for zonelists */
 
 #define MAX_NR_ZONES		3	/* Sync this with ZONES_SHIFT */
+#define MAX_NR_ZONELISTS	4
 #define ZONES_SHIFT		2	/* ceil(log2(MAX_NR_ZONES)) */
 
 #define GFP_ZONEMASK	0x03
@@ -203,7 +205,7 @@ struct zonelist {
 struct bootmem_data;
 typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
-	struct zonelist node_zonelists[MAX_NR_ZONES];
+	struct zonelist node_zonelists[MAX_NR_ZONELISTS];
 	int nr_zones;
 	struct page *node_mem_map;
 	struct bootmem_data *bdata;
@@ -215,6 +217,7 @@ typedef struct pglist_data {
 	struct pglist_data *pgdat_next;
 	wait_queue_head_t       kswapd_wait;
 	struct task_struct *kswapd;
+	char removable, enabled;
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff -dpurN linux-2.6.5/include/linux/page-flags.h linux-2.6.5-mh/include/linux/page-flags.h
--- linux-2.6.5/include/linux/page-flags.h	Sun Apr  4 12:37:37 2004
+++ linux-2.6.5-mh/include/linux/page-flags.h	Mon Apr  5 12:44:53 2004
@@ -76,6 +76,8 @@
 #define PG_reclaim		18	/* To be reclaimed asap */
 #define PG_compound		19	/* Part of a compound page */
 
+#define PG_again		20
+
 
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
@@ -297,6 +299,10 @@ extern void get_full_page_state(struct p
 #define PageCompound(page)	test_bit(PG_compound, &(page)->flags)
 #define SetPageCompound(page)	set_bit(PG_compound, &(page)->flags)
 #define ClearPageCompound(page)	clear_bit(PG_compound, &(page)->flags)
+
+#define PageAgain(page)	test_bit(PG_again, &(page)->flags)
+#define SetPageAgain(page)	set_bit(PG_again, &(page)->flags)
+#define ClearPageAgain(page)	clear_bit(PG_again, &(page)->flags)
 
 /*
  * The PageSwapCache predicate doesn't use a PG_flag at this time,
diff -dpurN linux-2.6.5/include/linux/swap.h linux-2.6.5-mh/include/linux/swap.h
--- linux-2.6.5/include/linux/swap.h	Sun Apr  4 12:36:15 2004
+++ linux-2.6.5-mh/include/linux/swap.h	Mon Apr  5 12:44:53 2004
@@ -183,13 +183,13 @@ int FASTCALL(page_referenced(struct page
 struct pte_chain *FASTCALL(page_add_rmap(struct page *, pte_t *,
 					struct pte_chain *));
 void FASTCALL(page_remove_rmap(struct page *, pte_t *));
-int FASTCALL(try_to_unmap(struct page *));
+int FASTCALL(try_to_unmap(struct page *, struct list_head *));
 
 /* linux/mm/shmem.c */
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
 #else
 #define page_referenced(page)	TestClearPageReferenced(page)
-#define try_to_unmap(page)	SWAP_FAIL
+#define try_to_unmap(page, force)	SWAP_FAIL
 #endif /* CONFIG_MMU */
 
 /* return values of try_to_unmap */
diff -dpurN linux-2.6.5/mm/Makefile linux-2.6.5-mh/mm/Makefile
--- linux-2.6.5/mm/Makefile	Sun Apr  4 12:37:36 2004
+++ linux-2.6.5-mh/mm/Makefile	Mon Apr  5 12:44:53 2004
@@ -12,3 +12,5 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   slab.o swap.o truncate.o vmscan.o $(mmu-y)
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
+
+obj-$(CONFIG_MEMHOTPLUG) += memhotplug.o
diff -dpurN linux-2.6.5/mm/filemap.c linux-2.6.5-mh/mm/filemap.c
--- linux-2.6.5/mm/filemap.c	Sun Apr  4 12:36:55 2004
+++ linux-2.6.5-mh/mm/filemap.c	Mon Apr  5 12:44:53 2004
@@ -248,7 +248,8 @@ EXPORT_SYMBOL(filemap_fdatawait);
 int add_to_page_cache(struct page *page, struct address_space *mapping,
 		pgoff_t offset, int gfp_mask)
 {
-	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+	int error = radix_tree_preload((gfp_mask & ~GFP_ZONEMASK) |
+	    ((gfp_mask & GFP_ZONEMASK) == __GFP_DMA ? __GFP_DMA : 0));
 
 	if (error == 0) {
 		page_cache_get(page);
@@ -455,6 +456,7 @@ repeat:
 				page_cache_release(page);
 				goto repeat;
 			}
+			BUG_ON(PageAgain(page));
 		}
 	}
 	spin_unlock(&mapping->page_lock);
@@ -679,6 +681,8 @@ page_not_up_to_date:
 			goto page_ok;
 		}
 
+		BUG_ON(PageAgain(page));
+
 readpage:
 		/* ... and start the actual read. The read will unlock the page. */
 		error = mapping->a_ops->readpage(filp, page);
@@ -1135,6 +1139,8 @@ page_not_uptodate:
 		goto success;
 	}
 
+	BUG_ON(PageAgain(page));
+
 	if (!mapping->a_ops->readpage(file, page)) {
 		wait_on_page_locked(page);
 		if (PageUptodate(page))
@@ -1243,6 +1249,8 @@ page_not_uptodate:
 		goto success;
 	}
 
+	BUG_ON(PageAgain(page));
+
 	if (!mapping->a_ops->readpage(file, page)) {
 		wait_on_page_locked(page);
 		if (PageUptodate(page))
@@ -1451,6 +1459,8 @@ retry:
 		unlock_page(page);
 		goto out;
 	}
+	BUG_ON(PageAgain(page));
+
 	err = filler(data, page);
 	if (err < 0) {
 		page_cache_release(page);
diff -dpurN linux-2.6.5/mm/memory.c linux-2.6.5-mh/mm/memory.c
--- linux-2.6.5/mm/memory.c	Sun Apr  4 12:36:58 2004
+++ linux-2.6.5-mh/mm/memory.c	Mon Apr  5 12:44:53 2004
@@ -1248,6 +1248,7 @@ static int do_swap_page(struct mm_struct
 
 	pte_unmap(page_table);
 	spin_unlock(&mm->page_table_lock);
+again:
 	page = lookup_swap_cache(entry);
 	if (!page) {
 		swapin_readahead(entry);
@@ -1280,6 +1281,14 @@ static int do_swap_page(struct mm_struct
 		goto out;
 	}
 	lock_page(page);
+	if (page->mapping == NULL) {
+		BUG_ON(! PageAgain(page));
+		unlock_page(page);
+		page_cache_release(page);
+		pte_chain_free(pte_chain);
+		goto again;
+	}
+	BUG_ON(PageAgain(page));
 
 	/*
 	 * Back out if somebody else faulted in this pte while we
diff -dpurN linux-2.6.5/mm/page_alloc.c linux-2.6.5-mh/mm/page_alloc.c
--- linux-2.6.5/mm/page_alloc.c	Sun Apr  4 12:36:17 2004
+++ linux-2.6.5-mh/mm/page_alloc.c	Tue Apr  6 13:27:58 2004
@@ -25,6 +25,7 @@
 #include <linux/module.h>
 #include <linux/suspend.h>
 #include <linux/pagevec.h>
+#include <linux/memhotplug.h>
 #include <linux/blkdev.h>
 #include <linux/slab.h>
 #include <linux/notifier.h>
@@ -220,6 +221,7 @@ static inline void free_pages_check(cons
 			1 << PG_active	|
 			1 << PG_reclaim	|
 			1 << PG_slab	|
+			1 << PG_again	|
 			1 << PG_writeback )))
 		bad_page(function, page);
 	if (PageDirty(page))
@@ -327,12 +329,13 @@ static void prep_new_page(struct page *p
 			1 << PG_active	|
 			1 << PG_dirty	|
 			1 << PG_reclaim	|
+			1 << PG_again	|
 			1 << PG_writeback )))
 		bad_page(__FUNCTION__, page);
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
 			1 << PG_referenced | 1 << PG_arch_1 |
-			1 << PG_checked | 1 << PG_mappedtodisk);
+			1 << PG_checked | 1 << PG_mappedtodisk | 1 << PG_again);
 	page->private = 0;
 	set_page_refs(page, order);
 }
@@ -390,7 +393,7 @@ static int rmqueue_bulk(struct zone *zon
 	return allocated;
 }
 
-#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
+#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMHOTPLUG)
 static void __drain_pages(unsigned int cpu)
 {
 	struct zone *zone;
@@ -433,7 +436,9 @@ int is_head_of_free_region(struct page *
 	spin_unlock_irqrestore(&zone->lock, flags);
         return 0;
 }
+#endif
 
+#if defined(CONFIG_SOFTWARE_SUSPEND) || defined(CONFIG_MEMHOTPLUG)
 /*
  * Spill all of this CPU's per-cpu pages back into the buddy allocator.
  */
@@ -1106,13 +1111,21 @@ void show_free_areas(void)
 /*
  * Builds allocation fallback zone lists.
  */
-static int __init build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, int j, int k)
+static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, int j, int k)
 {
+
+	if (! pgdat->enabled)
+		return j;
+	if (k != ZONE_HOTREMOVABLE &&
+	    pgdat->removable)
+		return j;
+
 	switch (k) {
 		struct zone *zone;
 	default:
 		BUG();
 	case ZONE_HIGHMEM:
+	case ZONE_HOTREMOVABLE:
 		zone = pgdat->node_zones + ZONE_HIGHMEM;
 		if (zone->present_pages) {
 #ifndef CONFIG_HIGHMEM
@@ -1239,24 +1252,48 @@ static void __init build_zonelists(pg_da
 
 #else	/* CONFIG_NUMA */
 
-static void __init build_zonelists(pg_data_t *pgdat)
+static void build_zonelists(pg_data_t *pgdat)
 {
 	int i, j, k, node, local_node;
+	int hotremovable;
+#ifdef CONFIG_MEMHOTPLUG
+	struct zone *zone;
+#endif
 
 	local_node = pgdat->node_id;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
+	for (i = 0; i < MAX_NR_ZONELISTS; i++) {
 		struct zonelist *zonelist;
 
 		zonelist = pgdat->node_zonelists + i;
-		memset(zonelist, 0, sizeof(*zonelist));
+		/* memset(zonelist, 0, sizeof(*zonelist)); */
 
 		j = 0;
 		k = ZONE_NORMAL;
-		if (i & __GFP_HIGHMEM)
+		hotremovable = 0;
+		switch (i) {
+		default:
+			BUG();
+			return;
+		case 0:
+			k = ZONE_NORMAL;
+			break;
+		case __GFP_HIGHMEM:
 			k = ZONE_HIGHMEM;
-		if (i & __GFP_DMA)
+			break;
+		case __GFP_DMA:
 			k = ZONE_DMA;
+			break;
+		case __GFP_HOTREMOVABLE:
+#ifdef CONFIG_MEMHOTPLUG
+			k = ZONE_HIGHMEM;
+#else
+			k = ZONE_HOTREMOVABLE;
+#endif
+			hotremovable = 1;
+			break;
+		}
 
+#ifndef CONFIG_MEMHOTPLUG
  		j = build_zonelists_node(pgdat, zonelist, j, k);
  		/*
  		 * Now we build the zonelist so that it contains the zones
@@ -1267,22 +1304,59 @@ static void __init build_zonelists(pg_da
  		 * node N+1 (modulo N)
  		 */
  		for (node = local_node + 1; node < numnodes; node++)
- 			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+			j = build_zonelists_node(NODE_DATA(node),
+			    zonelist, j, k);
  		for (node = 0; node < local_node; node++)
- 			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
- 
-		zonelist->zones[j++] = NULL;
+			j = build_zonelists_node(NODE_DATA(node),
+			    zonelist, j, k);
+#else
+		while (hotremovable >= 0) {
+			for(; k >= 0; k--) {
+				zone = pgdat->node_zones + k;
+				for (node = local_node; ;) {
+					if (NODE_DATA(node) == NULL ||
+					    ! NODE_DATA(node)->enabled ||
+					    (!! NODE_DATA(node)->removable) !=
+					    (!! hotremovable))
+						goto next;
+					zone = NODE_DATA(node)->node_zones + k;
+					if (zone->present_pages)
+						zonelist->zones[j++] = zone;
+				next:
+					node = (node + 1) % numnodes;
+					if (node == local_node)
+						break;
+				}
+			}
+			if (hotremovable) {
+				/* place non-hotremovable after hotremovable */
+				k = ZONE_HIGHMEM;
+			}
+			hotremovable--;
+		}
+#endif
+		BUG_ON(j > sizeof(zonelist->zones) /
+		    sizeof(zonelist->zones[0]) - 1);
+		for(; j < sizeof(zonelist->zones) /
+		    sizeof(zonelist->zones[0]); j++)
+			zonelist->zones[j] = NULL;
 	} 
 }
 
 #endif	/* CONFIG_NUMA */
 
-void __init build_all_zonelists(void)
+#ifdef CONFIG_MEMHOTPLUG
+void
+#else
+void __init
+#endif
+build_all_zonelists(void)
 {
 	int i;
 
 	for(i = 0 ; i < numnodes ; i++)
-		build_zonelists(NODE_DATA(i));
+		if (NODE_DATA(i) != NULL)
+			build_zonelists(NODE_DATA(i));
 	printk("Built %i zonelists\n", numnodes);
 }
 
@@ -1354,7 +1428,7 @@ static void __init calculate_zone_totalp
  * up by free_all_bootmem() once the early boot process is
  * done. Non-atomic initialization, single-pass.
  */
-void __init memmap_init_zone(struct page *start, unsigned long size, int nid,
+void memmap_init_zone(struct page *start, unsigned long size, int nid,
 		unsigned long zone, unsigned long start_pfn)
 {
 	struct page *page;
@@ -1392,10 +1466,13 @@ static void __init free_area_init_core(s
 	int cpu, nid = pgdat->node_id;
 	struct page *lmem_map = pgdat->node_mem_map;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
+#ifdef CONFIG_MEMHOTPLUG
+	int cold = ! nid;
+#endif	
 
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
-	
+
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize;
@@ -1465,6 +1542,13 @@ static void __init free_area_init_core(s
 		zone->wait_table_size = wait_table_size(size);
 		zone->wait_table_bits =
 			wait_table_bits(zone->wait_table_size);
+#ifdef CONFIG_MEMHOTPLUG
+		if (! cold)
+			zone->wait_table = (wait_queue_head_t *)
+				kmalloc(zone->wait_table_size
+				* sizeof(wait_queue_head_t), GFP_KERNEL);
+		else
+#endif
 		zone->wait_table = (wait_queue_head_t *)
 			alloc_bootmem_node(pgdat, zone->wait_table_size
 						* sizeof(wait_queue_head_t));
@@ -1519,6 +1603,13 @@ static void __init free_area_init_core(s
 			 */
 			bitmap_size = (size-1) >> (i+4);
 			bitmap_size = LONG_ALIGN(bitmap_size+1);
+#ifdef CONFIG_MEMHOTPLUG
+			if (! cold) {
+			zone->free_area[i].map = 
+			  (unsigned long *)kmalloc(bitmap_size, GFP_KERNEL);
+			memset(zone->free_area[i].map, 0, bitmap_size);
+			} else
+#endif
 			zone->free_area[i].map = 
 			  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
 		}
@@ -1749,7 +1840,7 @@ void __init page_alloc_init(void)
  *	that the pages_{min,low,high} values for each zone are set correctly 
  *	with respect to min_free_kbytes.
  */
-static void setup_per_zone_pages_min(void)
+void setup_per_zone_pages_min(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
diff -dpurN linux-2.6.5/mm/rmap.c linux-2.6.5-mh/mm/rmap.c
--- linux-2.6.5/mm/rmap.c	Sun Apr  4 12:38:16 2004
+++ linux-2.6.5-mh/mm/rmap.c	Mon Apr  5 12:44:53 2004
@@ -24,6 +24,7 @@
 #include <linux/pagemap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/memhotplug.h>
 #include <linux/slab.h>
 #include <linux/init.h>
 #include <linux/rmap-locking.h>
@@ -293,13 +294,18 @@ out_unlock:
  *		pte_chain_lock		shrink_list()
  *		    mm->page_table_lock	try_to_unmap_one(), trylock
  */
-static int FASTCALL(try_to_unmap_one(struct page *, pte_addr_t));
-static int fastcall try_to_unmap_one(struct page * page, pte_addr_t paddr)
+static int FASTCALL(try_to_unmap_one(struct page *, pte_addr_t,
+    struct list_head *));
+static int fastcall try_to_unmap_one(struct page * page, pte_addr_t paddr,
+    struct list_head *force)
 {
 	pte_t *ptep = rmap_ptep_map(paddr);
 	unsigned long address = ptep_to_address(ptep);
 	struct mm_struct * mm = ptep_to_mm(ptep);
 	struct vm_area_struct * vma;
+#ifdef CONFIG_MEMHOTPLUG
+	struct page_va_list *vlist;
+#endif
 	pte_t pte;
 	int ret;
 
@@ -325,8 +331,16 @@ static int fastcall try_to_unmap_one(str
 
 	/* The page is mlock()d, we cannot swap it out. */
 	if (vma->vm_flags & VM_LOCKED) {
-		ret = SWAP_FAIL;
-		goto out_unlock;
+		if (force == NULL) {
+			ret = SWAP_FAIL;
+			goto out_unlock;
+		}
+#ifdef CONFIG_MEMHOTPLUG
+		vlist = kmalloc(sizeof(struct page_va_list), GFP_KERNEL);
+		vlist->mm = mm;
+		vlist->addr = address;
+		list_add(&vlist->list, force);
+#endif
 	}
 
 	/* Nuke the page table entry. */
@@ -383,7 +397,7 @@ out_unlock:
  * SWAP_AGAIN	- we missed a trylock, try again later
  * SWAP_FAIL	- the page is unswappable
  */
-int fastcall try_to_unmap(struct page * page)
+int fastcall try_to_unmap(struct page * page, struct list_head *force)
 {
 	struct pte_chain *pc, *next_pc, *start;
 	int ret = SWAP_SUCCESS;
@@ -399,7 +413,7 @@ int fastcall try_to_unmap(struct page * 
 		BUG();
 
 	if (PageDirect(page)) {
-		ret = try_to_unmap_one(page, page->pte.direct);
+		ret = try_to_unmap_one(page, page->pte.direct, force);
 		if (ret == SWAP_SUCCESS) {
 			if (page_test_and_clear_dirty(page))
 				set_page_dirty(page);
@@ -420,7 +434,7 @@ int fastcall try_to_unmap(struct page * 
 		for (i = pte_chain_idx(pc); i < NRPTE; i++) {
 			pte_addr_t pte_paddr = pc->ptes[i];
 
-			switch (try_to_unmap_one(page, pte_paddr)) {
+			switch (try_to_unmap_one(page, pte_paddr, force)) {
 			case SWAP_SUCCESS:
 				/*
 				 * Release a slot.  If we're releasing the
diff -dpurN linux-2.6.5/mm/swap_state.c linux-2.6.5-mh/mm/swap_state.c
--- linux-2.6.5/mm/swap_state.c	Sun Apr  4 12:36:57 2004
+++ linux-2.6.5-mh/mm/swap_state.c	Mon Apr  5 12:44:53 2004
@@ -234,12 +234,21 @@ int move_from_swap_cache(struct page *pa
 	spin_lock(&swapper_space.page_lock);
 	spin_lock(&mapping->page_lock);
 
+	if (radix_tree_lookup(&page->mapping->page_tree, page->index)
+	    != page) {
+		/* remap in progress */
+		printk("move_from_swap_cache: under remap %p\n", page);
+		err = -EAGAIN;
+		goto out;
+	}
+	
 	err = radix_tree_insert(&mapping->page_tree, index, page);
 	if (!err) {
 		__delete_from_swap_cache(page);
 		___add_to_page_cache(page, mapping, index);
 	}
 
+out:
 	spin_unlock(&mapping->page_lock);
 	spin_unlock(&swapper_space.page_lock);
 
diff -dpurN linux-2.6.5/mm/swapfile.c linux-2.6.5-mh/mm/swapfile.c
--- linux-2.6.5/mm/swapfile.c	Sun Apr  4 12:36:26 2004
+++ linux-2.6.5-mh/mm/swapfile.c	Mon Apr  5 12:44:53 2004
@@ -607,6 +607,7 @@ static int try_to_unuse(unsigned int typ
 		 */
 		swap_map = &si->swap_map[i];
 		entry = swp_entry(type, i);
+	again:
 		page = read_swap_cache_async(entry);
 		if (!page) {
 			/*
@@ -641,6 +642,13 @@ static int try_to_unuse(unsigned int typ
 		wait_on_page_locked(page);
 		wait_on_page_writeback(page);
 		lock_page(page);
+		if (page->mapping != &swapper_space) {
+			BUG_ON(! PageAgain(page));
+			unlock_page(page);
+			page_cache_release(page);
+			goto again;
+		}
+		BUG_ON(PageAgain(page));
 		wait_on_page_writeback(page);
 
 		/*
@@ -749,6 +757,7 @@ static int try_to_unuse(unsigned int typ
 
 			swap_writepage(page, &wbc);
 			lock_page(page);
+			BUG_ON(PageAgain(page));
 			wait_on_page_writeback(page);
 		}
 		if (PageSwapCache(page)) {
diff -dpurN linux-2.6.5/mm/truncate.c linux-2.6.5-mh/mm/truncate.c
--- linux-2.6.5/mm/truncate.c	Sun Apr  4 12:38:18 2004
+++ linux-2.6.5-mh/mm/truncate.c	Mon Apr  5 12:44:53 2004
@@ -132,6 +132,8 @@ void truncate_inode_pages(struct address
 			next++;
 			if (TestSetPageLocked(page))
 				continue;
+			/* no PageAgain(page) check; page->mapping check
+			 * is done in truncate_complete_page */
 			if (PageWriteback(page)) {
 				unlock_page(page);
 				continue;
@@ -165,6 +167,24 @@ void truncate_inode_pages(struct address
 			struct page *page = pvec.pages[i];
 
 			lock_page(page);
+			if (page->mapping == NULL) {
+				/* XXX Is page->index still valid? */
+				unsigned long index = page->index;
+				int again = PageAgain(page);
+
+				unlock_page(page);
+				put_page(page);
+				page = find_lock_page(mapping, index);
+				if (page == NULL) {
+					BUG_ON(again);
+					/* XXX */
+					if (page->index > next)
+						next = page->index;
+					next++;
+				}
+				BUG_ON(! again);
+				pvec.pages[i] = page;
+			}
 			wait_on_page_writeback(page);
 			if (page->index > next)
 				next = page->index;
@@ -257,14 +277,29 @@ void invalidate_inode_pages2(struct addr
 			struct page *page = pvec.pages[i];
 
 			lock_page(page);
-			if (page->mapping == mapping) {	/* truncate race? */
-				wait_on_page_writeback(page);
-				next = page->index + 1;
-				if (page_mapped(page))
-					clear_page_dirty(page);
-				else
-					invalidate_complete_page(mapping, page);
+			while (page->mapping != mapping) {
+				struct page *newpage;
+				unsigned long index = page->index;
+
+				BUG_ON(page->mapping != NULL);
+
+				unlock_page(page);
+				newpage = find_lock_page(mapping, index);
+				if (page == newpage) {
+					put_page(page);
+					break;
+				}
+				BUG_ON(! PageAgain(page));
+				pvec.pages[i] = newpage;
+				put_page(page);
+				page = newpage;
 			}
+			wait_on_page_writeback(page);
+			next = page->index + 1;
+			if (page_mapped(page))
+				clear_page_dirty(page);
+			else
+				invalidate_complete_page(mapping, page);
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
diff -dpurN linux-2.6.5/mm/vmscan.c linux-2.6.5-mh/mm/vmscan.c
--- linux-2.6.5/mm/vmscan.c	Sun Apr  4 12:36:24 2004
+++ linux-2.6.5-mh/mm/vmscan.c	Mon Apr  5 12:44:53 2004
@@ -199,11 +199,6 @@ static inline int page_mapping_inuse(str
 	return 0;
 }
 
-static inline int is_page_cache_freeable(struct page *page)
-{
-	return page_count(page) - !!PagePrivate(page) == 2;
-}
-
 static int may_write_to_queue(struct backing_dev_info *bdi)
 {
 	if (current_is_kswapd())
@@ -311,7 +306,7 @@ shrink_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page)) {
+			switch (try_to_unmap(page, NULL)) {
 			case SWAP_FAIL:
 				pte_chain_unlock(page);
 				goto activate_locked;
@@ -1140,4 +1140,14 @@ static int __init kswapd_init(void)
 	return 0;
 }
 
+#ifdef CONFIG_MEMHOTPLUG
+void
+kswapd_start_one(pg_data_t *pgdat)
+{
+	pgdat->kswapd
+	= find_task_by_pid(kernel_thread(kswapd, pgdat, CLONE_KERNEL));
+	total_memory = nr_free_pagecache_pages();
+}
+#endif
+
 module_init(kswapd_init)
diff -dpurN linux-2.6.5/include/linux/memhotplug.h linux-2.6.5-mh/include/linux/memhotplug.h
--- linux-2.6.5/include/linux/memhotplug.h	Thu Jan  1 09:00:00 1970
+++ linux-2.6.5-mh/include/linux/memhotplug.h	Mon Apr  5 12:44:53 2004
@@ -0,0 +1,32 @@
+#ifndef _LINUX_MEMHOTPLUG_H
+#define _LINUX_MEMHOTPLUG_H
+
+#include <linux/config.h>
+#include <linux/mm.h>
+
+#ifdef __KERNEL__
+
+struct page_va_list {
+	struct mm_struct *mm;
+	unsigned long addr;
+	struct list_head list;
+};
+
+struct remap_operations {
+	struct page * (*remap_alloc_page)(int);
+	int (*remap_delete_page)(struct page *);
+	int (*remap_copy_page)(struct page *, struct page *);
+	int (*remap_lru_add_page)(struct page *);
+	int (*remap_release_buffers)(struct page *);
+	int (*remap_prepare)(struct page *page, int fastmode);
+	int (*remap_stick_page)(struct list_head *vlist);
+};
+
+extern int remapd(void *p);
+extern int remap_onepage(struct page *, int, int, struct remap_operations *);
+extern int remap_onepage_normal(struct page *, int, int);
+
+#define REMAP_ANYNODE  (-1)
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_MEMHOTPLUG_H */
diff -dpurN linux-2.6.5/mm/memhotplug.c linux-2.6.5-mh/mm/memhotplug.c
--- linux-2.6.5/mm/memhotplug.c	Thu Jan  1 09:00:00 1970
+++ linux-2.6.5-mh/mm/memhotplug.c	Mon Apr  5 12:44:53 2004
@@ -0,0 +1,699 @@
+/*
+ *  linux/mm/memhotplug.c
+ *
+ *  Copyright (C) 1991, 1992, 1993, 1994  Linus Torvalds
+ *
+ *  Support of memory hotplug, Iwamoto
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/writeback.h>
+#include <linux/buffer_head.h>
+#include <linux/rmap-locking.h>
+#include <linux/memhotplug.h>
+
+#ifdef CONFIG_KDB
+#include <linux/kdb.h>
+#endif
+
+static void
+print_buffer(struct page* page)
+{
+	struct address_space* mapping = page->mapping;
+	struct buffer_head *bh, *head;
+
+	spin_lock(&mapping->private_lock);
+	bh = head = page_buffers(page);
+	printk("buffers:");
+	do {
+		printk(" %lx %d", bh->b_state, atomic_read(&bh->b_count));
+
+		bh = bh->b_this_page;
+	} while (bh != head);
+	printk("\n");
+	spin_unlock(&mapping->private_lock);
+}
+
+static int
+stick_mlocked_page(struct list_head *vlist)
+{
+	struct page_va_list *v1;
+	struct vm_area_struct *vma;
+	int error;
+
+	while(!list_empty(vlist)) {
+		v1 = list_entry(vlist->next, struct page_va_list, list);
+		list_del(&v1->list);
+		vma = find_vma(v1->mm, v1->addr);
+		BUG_ON(! (vma->vm_flags & VM_LOCKED));
+		error = get_user_pages(current, v1->mm, v1->addr, PAGE_SIZE,
+		    (vma->vm_flags & VM_WRITE) != 0, 0, NULL, NULL);
+		BUG_ON(error <= 0);
+		kfree(v1);
+	}
+	return 0;
+}
+
+/* helper function for remap_onepage */
+#define	REMAPPREP_WB		1
+#define	REMAPPREP_BUFFER	2
+
+/*
+ * Try to free buffers if "page" has them.
+ */
+static int
+remap_preparepage(struct page *page, int fastmode)
+{
+	struct address_space *mapping;
+	int waitcnt = fastmode ? 0 : 100;
+
+	BUG_ON(! PageLocked(page));
+
+	mapping = page->mapping;
+
+	if (! PagePrivate(page) && PageWriteback(page) &&
+	    page->mapping != &swapper_space) {
+		printk("remap_preparepage: mapping %p page %p\n",
+		    page->mapping, page);
+		return -REMAPPREP_WB;
+	}
+
+	while (PageWriteback(page)) {
+		if (!waitcnt)
+			return -REMAPPREP_WB;
+		__set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(10);
+		__set_current_state(TASK_RUNNING);
+		waitcnt--;
+	}
+	if (PagePrivate(page)) {
+		/* XXX copied from shrink_list() */
+		if (PageDirty(page) &&
+		    is_page_cache_freeable(page) &&
+		    mapping != NULL &&
+		    mapping->a_ops->writepage != NULL) {
+			spin_lock(&mapping->page_lock);
+			if (test_clear_page_dirty(page)) {
+				int res;
+				struct writeback_control wbc = {
+					.sync_mode = WB_SYNC_NONE,
+					.nr_to_write = SWAP_CLUSTER_MAX,
+					.nonblocking = 1,
+					.for_reclaim = 1,
+				};
+
+				list_move(&page->list, &mapping->locked_pages);
+				spin_unlock(&mapping->page_lock);
+
+				SetPageReclaim(page);
+				res = mapping->a_ops->writepage(page, &wbc);
+
+				if (res == WRITEPAGE_ACTIVATE) {
+					ClearPageReclaim(page);
+					return -REMAPPREP_WB;
+				}
+				if (!PageWriteback(page)) {
+					/* synchronous write or broken a_ops? */
+					ClearPageReclaim(page);
+				}
+				lock_page(page);
+				mapping = page->mapping;
+				if (! PagePrivate(page))
+					return 0;
+			} else
+				spin_unlock(&mapping->page_lock);
+		}
+
+		while (1) {
+			if (try_to_release_page(page, GFP_KERNEL))
+				break;
+			if (! waitcnt)
+				return -REMAPPREP_BUFFER;
+			__set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout(10);
+			__set_current_state(TASK_RUNNING);
+			waitcnt--;
+			if (! waitcnt)
+				print_buffer(page);
+		}
+	}
+	return 0;
+}
+
+/*
+ * Just assign swap space to a anonymous page if it doesn't have yet,
+ * so that the page can be handled like a page in the page cache
+ * since it in the swap cache. 
+ */
+static struct address_space *
+make_page_mapped(struct page *page)
+{
+	if (! page_mapped(page)) {
+		if (page_count(page) > 1)
+			printk("page %p not mapped: count %d\n",
+			    page, page_count(page));
+		return NULL;
+	}
+	/* The page is an anon page.  Allocate its swap entry. */
+	if (!add_to_swap(page))
+		return NULL;
+	return page->mapping;
+}
+
+/*
+ * Replace "page" with "newpage" on the radix tree.  After that, all
+ * new access to "page" will be redirected to "newpage" and it
+ * will be blocked until remapping has been done.
+ */
+static int
+radix_tree_replace_pages(struct page *page, struct page *newpage,
+			 struct address_space *mapping)
+{
+	if (radix_tree_preload(GFP_KERNEL))
+		return -1;
+
+	if (PagePrivate(page)) /* XXX */
+		BUG();
+
+	/* should {__add_to,__remove_from}_page_cache be used instead? */
+	spin_lock(&mapping->page_lock);
+	if (mapping != page->mapping)
+		printk("mapping changed %p -> %p, page %p\n",
+		    mapping, page->mapping, page);
+	if (radix_tree_delete(&mapping->page_tree, page->index) == NULL) {
+		/* Page truncated. */
+		spin_unlock(&mapping->page_lock);
+		radix_tree_preload_end();
+		return -1;
+	}
+	/* Don't __put_page(page) here.  Truncate may be in progress. */
+	newpage->flags |= page->flags & ~(1 << PG_uptodate) &
+	    ~(1 << PG_highmem) & ~(1 << PG_chainlock) &
+	    ~(1 << PG_direct) & ~(~0UL << NODEZONE_SHIFT);
+
+	/* list_del(&page->list); XXX */
+	radix_tree_insert(&mapping->page_tree, page->index, newpage);
+	page_cache_get(newpage);
+	newpage->mapping = mapping;
+	newpage->index = page->index;
+	spin_unlock(&mapping->page_lock);
+	radix_tree_preload_end();
+	return 0;
+}
+
+/*
+ * Remove all PTE mappings to "page".
+ */
+static int
+unmap_page(struct page *page, struct list_head *vlist)
+{
+	int error;
+	pte_chain_lock(page);
+	if (page_mapped(page)) {
+		while ((error = try_to_unmap(page, vlist)) == SWAP_AGAIN) {
+			pte_chain_unlock(page);
+			__set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout(1);
+			__set_current_state(TASK_RUNNING);
+			pte_chain_lock(page);
+		}
+		if (error == SWAP_FAIL) {
+			pte_chain_unlock(page); /* XXX */
+			/* either during mremap or mlocked */
+			return -1;
+		}
+	}
+	pte_chain_unlock(page);
+	return 0;
+}
+
+/*
+ * Wait for "page" to become free.  Almost same as waiting for its
+ * page count to drop to 2, but truncated pages are special.
+ */
+static int
+wait_on_page_freeable(struct page *page, struct address_space *mapping,
+			struct list_head *vlist, int truncated,
+			int nretry, struct remap_operations *ops)
+{
+	while ((truncated + page_count(page)) > 2) {
+		if (nretry <= 0)
+			return -1;
+		/* no lock needed while waiting page count */
+		unlock_page(page);
+
+		while ((truncated + page_count(page)) > 2) {
+			nretry--;
+			current->state = TASK_INTERRUPTIBLE;
+			schedule_timeout(1);
+			if ((nretry % 5000) == 0) {
+				printk("remap_onepage: still waiting on %p %d\n", page, nretry);
+				break;
+			}
+			if (PagePrivate(page) || page_mapped(page))
+				break;		/* see below */
+		}
+
+		lock_page(page);
+		BUG_ON(page_count(page) == 0);
+		if (mapping != page->mapping && page->mapping != NULL)
+			printk("mapping changed %p -> %p, page %p\n",
+			    mapping, page->mapping, page);
+		if (PagePrivate(page))
+			ops->remap_release_buffers(page);
+		unmap_page(page, vlist);
+	}
+	return nretry;
+}
+
+/*
+ * A file which "page" belongs to has been truncated.  Free both pages.
+ */
+static void
+free_truncated_pages(struct page *page, struct page *newpage,
+			 struct address_space *mapping)
+{
+	void *p;
+	/* mapping->page_lock must be held. */
+	p = radix_tree_lookup(&mapping->page_tree, newpage->index);
+	if (p != NULL) {
+		/* new cache page appeared after truncation */
+		printk("page %p newpage %p radix %p\n",
+		    page, newpage, p);
+		BUG_ON(p == newpage);
+	}
+	BUG_ON(page->mapping != NULL);
+	put_page(newpage);
+	if (page_count(newpage) != 1) {
+		printk("newpage count %d != 1, %p\n",
+		    page_count(newpage), newpage);
+		BUG();
+	}
+	/* No need to do page->list.  remove_from_page_cache did. */
+	newpage->mapping = page->mapping = NULL;
+	spin_unlock(&mapping->page_lock);
+	ClearPageActive(page);
+	ClearPageActive(newpage);
+	unlock_page(page);
+	unlock_page(newpage);
+	put_page(page);
+	put_page(newpage);
+}
+
+static inline int
+is_page_truncated(struct page *page, struct page *newpage,
+			 struct address_space *mapping)
+{
+	void *p;
+	spin_lock(&mapping->page_lock);
+	if (page_count(page) == 1) {
+		/* page has been truncated. */
+		return 0;
+	}
+	p = radix_tree_lookup(&mapping->page_tree, newpage->index);
+	spin_unlock(&mapping->page_lock);
+	if (p == NULL) {
+		BUG_ON(page->mapping != NULL);
+		return -1;
+	}
+	return 1;
+}
+
+/*
+ * Replace "page" with "newpage" on the list of clean/dirty pages.
+ */
+static void
+remap_exchange_pages(struct page *page, struct page *newpage,
+			 struct address_space *mapping)
+{
+	spin_lock(&mapping->page_lock);
+	list_del(&page->list); /* XXX */
+	if (PageDirty(page)) {
+		SetPageDirty(newpage);
+		list_add(&newpage->list, &mapping->dirty_pages);
+	} else
+		list_add(&newpage->list, &mapping->clean_pages);
+	page->mapping = NULL;
+	spin_unlock(&mapping->page_lock);
+	unlock_page(page);
+
+	ClearPageActive(page);
+	__put_page(page);
+
+	/* We are done.  Finish and let the waiters run. */
+	SetPageUptodate(newpage);
+}
+
+/*
+ * Roll back all remapping operations.
+ */
+static int
+radix_tree_rewind_page(struct page *page, struct page *newpage,
+		 struct address_space *mapping)
+{
+	int waitcnt;
+	/*
+	 * Try to unwind by notifying waiters.  If someone misbehaves,
+	 * we die.
+	 */
+	if (radix_tree_preload(GFP_KERNEL))
+		BUG();
+	/* should {__add_to,__remove_from}_page_cache be used instead? */
+	spin_lock(&mapping->page_lock);
+	/* list_del(&newpage->list); */
+	if (radix_tree_delete(&mapping->page_tree, page->index) == NULL)
+		/* Hold extra count to handle truncate */
+		page_cache_get(newpage);
+	radix_tree_insert(&mapping->page_tree, page->index, page);
+	/* no page_cache_get(page); needed */
+	radix_tree_preload_end();
+	spin_unlock(&mapping->page_lock);
+
+	SetPageAgain(newpage);
+	/* XXX unmap needed?  No, it shouldn't.  Handled by fault handlers. */
+	unlock_page(newpage);
+
+	waitcnt = 1;
+	for(; page_count(newpage) > 2; waitcnt++) {
+		current->state = TASK_INTERRUPTIBLE;
+		schedule_timeout(1);
+		if ((waitcnt % 10000) == 0) {
+			printk("You are hosed.\n");
+			printk("newpage %p\n", newpage);
+			BUG();
+		}
+	}
+	BUG_ON(PageUptodate(newpage));
+	ClearPageDirty(newpage);
+	ClearPageActive(newpage);
+	spin_lock(&mapping->page_lock);
+	newpage->mapping = NULL;
+	if (page_count(newpage) == 1) {
+		printk("newpage %p truncated. page %p\n", newpage, page);
+		BUG();
+	}
+	spin_unlock(&mapping->page_lock);
+	unlock_page(page);
+	BUG_ON(page_count(newpage) != 2);
+	ClearPageAgain(newpage);
+	__put_page(newpage);
+	return 1;
+}
+
+/*
+ * Allocate a new page from specified node.
+ */
+static struct page *
+remap_alloc_page(int nid)
+{
+	if (nid == REMAP_ANYNODE)
+		return alloc_page(GFP_HIGHUSER);
+	else
+		return alloc_pages_node(nid, GFP_HIGHUSER, 0);
+}
+
+static int
+remap_delete_page(struct page *page)
+{
+	BUG_ON(page_count(page) != 1);
+	put_page(page);
+	return 0;
+}
+
+static int
+remap_copy_page(struct page *to, struct page *from)
+{
+	copy_highpage(to, from);
+	return 0;
+}
+
+static int
+remap_lru_add_page(struct page *page)
+{
+#if 1
+	struct zone *zone;
+	/* XXX locking order correct? */
+	zone = page_zone(page);
+	spin_lock_irq(&zone->lru_lock);
+	if (PageActive(page)) {
+		list_add(&page->lru, &zone->active_list);
+		zone->nr_active++;
+	} else {
+		list_add(&page->lru, &zone->inactive_list);
+		zone->nr_inactive++;
+	}
+	SetPageLRU(page);
+	spin_unlock_irq(&zone->lru_lock);
+#endif
+#if 0
+	if (PageActive(page))
+		lru_cache_add_active(page);
+	else
+		lru_cache_add(page);
+#endif
+	return 0;
+}
+
+static int
+remap_release_buffer(struct page *page)
+{
+	try_to_release_page(page, GFP_KERNEL);
+	return 0;
+}
+
+static struct remap_operations remap_ops = {
+	.remap_alloc_page	= remap_alloc_page,
+        .remap_delete_page	= remap_delete_page,
+        .remap_copy_page	= remap_copy_page,
+        .remap_lru_add_page	= remap_lru_add_page,
+        .remap_release_buffers	= remap_release_buffer,
+        .remap_prepare		= remap_preparepage,
+        .remap_stick_page	= stick_mlocked_page
+};
+
+/*
+ * Try to remap a page.  Returns non-zero on failure.
+ */
+int remap_onepage(struct page *page, int nodeid, int fastmode,
+				struct remap_operations *ops)
+{
+	struct page *newpage;
+	struct address_space *mapping;
+	LIST_HEAD(vlist);
+	int truncated = 0;
+	int nretry = fastmode ? HZ/50: HZ*10; /* XXXX */
+
+	if ((newpage = ops->remap_alloc_page(nodeid)) == NULL)
+		return -ENOMEM;
+	if (TestSetPageLocked(newpage))
+		BUG();
+	lock_page(page);
+	mapping = page->mapping;
+
+	if (ops->remap_prepare && ops->remap_prepare(page, fastmode))
+		goto radixfail;
+	if (mapping == NULL && (mapping = make_page_mapped(page)) == NULL)
+		goto radixfail;
+	if (radix_tree_replace_pages(page, newpage, mapping))
+		goto radixfail;
+	if (unmap_page(page, &vlist))
+		goto unmapfail;
+	if (PagePrivate(page))
+		printk("buffer reappeared\n");
+wait_again:
+	if ((nretry = wait_on_page_freeable(page, mapping, &vlist, truncated, nretry, ops)) < 0)
+		goto unmapfail;
+
+	if (PageReclaim(page) || PageWriteback(page) || PagePrivate(page))
+#ifdef CONFIG_KDB
+		KDB_ENTER();
+#else
+		BUG();
+#endif
+	switch (is_page_truncated(page, newpage, mapping)) {
+		case 0:
+			/* has been truncated */
+			free_truncated_pages(page, newpage, mapping);
+			return 0;
+		case -1:
+			/* being truncated */
+			truncated = 1;
+			BUG_ON(page->mapping != NULL);
+			goto wait_again;
+		default:
+			/* through */
+	}
+	
+	BUG_ON(mapping != page->mapping);
+
+	ops->remap_copy_page(newpage, page);
+	remap_exchange_pages(page, newpage, mapping);
+	if (ops->remap_lru_add_page)
+		ops->remap_lru_add_page(newpage);
+	ops->remap_delete_page(page);
+
+	/*
+	 * Wake up all waiters which are waiting for completion
+	 * of remapping operations.
+	 */
+	unlock_page(newpage);
+
+	if (ops->remap_stick_page)
+		ops->remap_stick_page(&vlist);
+	page_cache_release(newpage);
+	return 0;
+
+unmapfail:
+	radix_tree_rewind_page(page, newpage, mapping);
+	if (ops->remap_stick_page)
+		ops->remap_stick_page(&vlist);
+	ops->remap_delete_page(newpage);
+	return 1;
+
+radixfail:
+	unlock_page(page);
+	unlock_page(newpage);
+	if (ops->remap_stick_page)
+		ops->remap_stick_page(&vlist);
+	ops->remap_delete_page(newpage);
+	return 1;
+}
+
+int remap_onepage_normal(struct page *page, int nodeid, int fastmode)
+{
+	return remap_onepage(page, nodeid, fastmode, &remap_ops);
+}
+
+static struct work_struct lru_drain_wq[NR_CPUS];
+static void
+lru_drain_schedule(void *p)
+{
+	int cpu = get_cpu();
+
+	schedule_work(&lru_drain_wq[cpu]);
+	put_cpu();
+}
+
+atomic_t remapd_count;
+int remapd(void *p)
+{
+	struct zone *zone = p;
+	struct page *page, *page1;
+	struct list_head *l;
+	int active, i, nr_failed = 0;
+	int fastmode = 100;
+	LIST_HEAD(failedp);
+
+	daemonize("remap%d", zone->zone_start_pfn);
+	if (atomic_read(&remapd_count) > 0) {
+		printk("remapd already running\n");
+		return 0;
+	}
+	atomic_inc(&remapd_count);
+	on_each_cpu(lru_drain_schedule, NULL, 1, 1);
+	while(nr_failed < 100) {
+		spin_lock_irq(&zone->lru_lock);
+		for(active = 0; active < 2; active++) {
+			l = active ? &zone->active_list :
+			    &zone->inactive_list;
+			for(i = 0; ! list_empty(l) && i < 10; i++) {
+				page = list_entry(l->prev, struct page, lru);
+				if (fastmode && PageLocked(page)) {
+					page1 = page;
+					while (fastmode && PageLocked(page)) {
+						page =
+						    list_entry(page->lru.prev,
+						    struct page, lru);
+						fastmode--;
+						if (&page->lru == l) {
+							/* scanned the whole
+							   list */
+							page = page1;
+							break;
+						}
+						if (page == page1)
+							BUG();
+					}
+					if (! fastmode) {
+						printk("used up fastmode\n");
+						page = page1;
+					}
+				}
+				if (! TestClearPageLRU(page))
+					BUG();
+				list_del(&page->lru);
+				if (page_count(page) == 0) {
+					/* the page is in pagevec_release();
+					   shrink_cache says so. */
+					SetPageLRU(page);
+					list_add(&page->lru, l);
+					continue;
+				}
+				if (active)
+					zone->nr_active--;
+				else
+					zone->nr_inactive--;
+				page_cache_get(page);
+				spin_unlock_irq(&zone->lru_lock);
+				goto got_page;
+			}
+		}
+		spin_unlock_irq(&zone->lru_lock);
+		break;
+
+	got_page:
+		if (remap_onepage(page, REMAP_ANYNODE, fastmode, &remap_ops)) {
+			nr_failed++;
+			if (fastmode)
+				fastmode--;
+			list_add(&page->lru, &failedp);
+		}
+	}
+	if (list_empty(&failedp))
+		goto out;
+
+	while (! list_empty(&failedp)) {
+		page = list_entry(failedp.prev, struct page, lru);
+		list_del(&page->lru);
+		if (! TestSetPageLocked(page)) {
+			if (remap_preparepage(page, 10 /* XXX */)) {
+				unlock_page(page);
+			} else {
+				ClearPageLocked(page);	/* XXX */
+				if (! remap_onepage(page, REMAP_ANYNODE, 0, &remap_ops))
+					continue;
+			}
+		}
+		spin_lock_irq(&zone->lru_lock);
+		if (PageActive(page)) {
+			list_add(&page->lru, &zone->active_list);
+			zone->nr_active++;
+		} else {
+			list_add(&page->lru, &zone->inactive_list);
+			zone->nr_inactive++;
+		}
+		if (TestSetPageLRU(page))
+			BUG();
+		spin_unlock_irq(&zone->lru_lock);
+		page_cache_release(page);
+	}
+out:
+	atomic_dec(&remapd_count);
+	return 0;
+}
+
+static int __init remapd_init(void)
+{
+	int i;
+
+	for(i = 0; i < NR_CPUS; i++)
+		INIT_WORK(&lru_drain_wq[i], (void (*)(void *))lru_add_drain, NULL);
+	return 0;
+}
+
+module_init(remapd_init);

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 2/3] memory hotplug prototype
  2004-04-06 10:53 [patch 0/3] memory hotplug prototype IWAMOTO Toshihiro
  2004-04-06 10:56 ` [patch 1/3] " IWAMOTO Toshihiro
@ 2004-04-06 10:58 ` IWAMOTO Toshihiro
  2004-04-06 10:59 ` [patch 3/3] " IWAMOTO Toshihiro
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: IWAMOTO Toshihiro @ 2004-04-06 10:58 UTC (permalink / raw)
  To: linux-kernel, lhms-devel

va-emulation_memhotplug.patch:
	emulation code for hotpluggable memory blocks on usual PC.

$Id: va-emulation_memhotplug.patch,v 1.13 2004/04/06 04:32:20 iwamoto Exp $

diff -dpur linux-2.6.4/arch/i386/Kconfig linux-2.6.4-mh/arch/i386/Kconfig
--- linux-2.6.4/arch/i386/Kconfig	Thu Mar 11 11:55:22 2004
+++ linux-2.6.4-mh/arch/i386/Kconfig	Thu Apr  1 14:46:19 2004
@@ -736,7 +736,7 @@ config DISCONTIGMEM
 
 config HAVE_ARCH_BOOTMEM_NODE
 	bool
-	depends on NUMA
+	depends on NUMA || MEMHOTPLUG
 	default y
 
 config HIGHPTE
diff -dpur linux-2.6.5/arch/i386/mm/discontig.c linux-2.6.5-mh/arch/i386/mm/discontig.c
--- linux-2.6.5/arch/i386/mm/discontig.c	Sun Apr  4 12:37:23 2004
+++ linux-2.6.5-mh/arch/i386/mm/discontig.c	Tue Apr  6 11:43:21 2004
@@ -64,6 +64,7 @@ unsigned long node_end_pfn[MAX_NUMNODES]
 extern unsigned long find_max_low_pfn(void);
 extern void find_max_pfn(void);
 extern void one_highpage_init(struct page *, int, int);
+static unsigned long calculate_blk_remap_pages(void);
 
 extern struct e820map e820;
 extern unsigned long init_pg_tables_end;
@@ -111,6 +112,51 @@ int __init get_memcfg_numa_flat(void)
 	return 1;
 }
 
+int __init get_memcfg_numa_blks(void)
+{
+	int i, pfn;
+
+	printk("NUMA - single node, flat memory mode, but broken in several blocks\n");
+
+	/* Run the memory configuration and find the top of memory. */
+	find_max_pfn();
+	if (max_pfn & (PTRS_PER_PTE - 1)) {
+		pfn = max_pfn & ~(PTRS_PER_PTE - 1);
+		printk("Rounding down maxpfn %ld -> %d\n", max_pfn, pfn);
+		max_pfn = pfn;
+	}
+	for(i = 0; i < MAX_NUMNODES; i++) {
+		pfn = PFN_DOWN(CONFIG_MEMHOTPLUG_BLKSIZE << 20) * i;
+		node_start_pfn[i]  = pfn;
+		printk("node %d start %d\n", i, pfn);
+		pfn += PFN_DOWN(CONFIG_MEMHOTPLUG_BLKSIZE << 20);
+		if (pfn < max_pfn)
+			node_end_pfn[i]	  = pfn;
+		else {
+			node_end_pfn[i]	  = max_pfn;
+			i++;
+			printk("total %d blocks, max %ld\n", i, max_pfn);
+			break;
+		}
+	}
+
+	printk("physnode_map");
+	/* Needed for pfn_to_nid */
+	for (pfn = node_start_pfn[0]; pfn <= max_pfn;
+	       pfn += PAGES_PER_ELEMENT)
+	{
+		physnode_map[pfn / PAGES_PER_ELEMENT] =
+		    pfn / PFN_DOWN(CONFIG_MEMHOTPLUG_BLKSIZE << 20);
+		printk(" %d", physnode_map[pfn / PAGES_PER_ELEMENT]);
+	}
+	printk("\n");
+
+	node_set_online(0);
+	numnodes = i;
+
+	return 1;
+}
+
 /*
  * Find the highest page frame number we have available for the node
  */
@@ -132,11 +178,21 @@ static void __init find_max_pfn_node(int
  * Allocate memory for the pg_data_t via a crude pre-bootmem method
  * We ought to relocate these onto their own node later on during boot.
  */
-static void __init allocate_pgdat(int nid)
+static void allocate_pgdat(int nid)
 {
-	if (nid)
+	if (nid) {
+#ifndef CONFIG_MEMHOTPLUG
 		NODE_DATA(nid) = (pg_data_t *)node_remap_start_vaddr[nid];
-	else {
+#else
+		int remapsize;
+		unsigned long addr;
+
+		remapsize = calculate_blk_remap_pages();
+		addr = (unsigned long)(pfn_to_kaddr(max_low_pfn +
+		    (nid - 1) * remapsize));
+		NODE_DATA(nid) = (void *)addr;
+#endif
+	} else {
 		NODE_DATA(nid) = (pg_data_t *)(__va(min_low_pfn << PAGE_SHIFT));
 		min_low_pfn += PFN_UP(sizeof(pg_data_t));
 		memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
@@ -185,6 +241,7 @@ static void __init register_bootmem_low_
 
 void __init remap_numa_kva(void)
 {
+#ifndef CONFIG_MEMHOTPLUG
 	void *vaddr;
 	unsigned long pfn;
 	int node;
@@ -197,6 +254,7 @@ void __init remap_numa_kva(void)
 				PAGE_KERNEL_LARGE);
 		}
 	}
+#endif
 }
 
 static unsigned long calculate_numa_remap_pages(void)
@@ -227,6 +285,21 @@ static unsigned long calculate_numa_rema
 	return reserve_pages;
 }
 
+static unsigned long calculate_blk_remap_pages(void)
+{
+	unsigned long size;
+
+	/* calculate the size of the mem_map needed in bytes */
+	size = (PFN_DOWN(CONFIG_MEMHOTPLUG_BLKSIZE << 20) + 1)
+		* sizeof(struct page) + sizeof(pg_data_t);
+	/* convert size to large (pmd size) pages, rounding up */
+	size = (size + LARGE_PAGE_BYTES - 1) / LARGE_PAGE_BYTES;
+	/* now the roundup is correct, convert to PAGE_SIZE pages */
+	size = size * PTRS_PER_PTE;
+
+	return size;
+}
+
 unsigned long __init setup_memory(void)
 {
 	int nid;
@@ -234,7 +307,7 @@ unsigned long __init setup_memory(void)
 	unsigned long reserve_pages;
 
 	get_memcfg_numa();
-	reserve_pages = calculate_numa_remap_pages();
+	reserve_pages = calculate_blk_remap_pages() * (numnodes - 1);
 
 	/* partially used pages are not usable - thus round upwards */
 	system_start_pfn = min_low_pfn = PFN_UP(init_pg_tables_end);
@@ -256,14 +329,18 @@ unsigned long __init setup_memory(void)
 
 	printk("Low memory ends at vaddr %08lx\n",
 			(ulong) pfn_to_kaddr(max_low_pfn));
+#ifdef CONFIG_MEMHOTPLUG
+	for (nid = 1; nid < numnodes; nid++)
+		NODE_DATA(nid) = NULL;
+	for (nid = 0; nid < 1; nid++) {
+#else
 	for (nid = 0; nid < numnodes; nid++) {
+#endif
 		node_remap_start_vaddr[nid] = pfn_to_kaddr(
-			highstart_pfn - node_remap_offset[nid]);
+			max_low_pfn + calculate_blk_remap_pages() * nid);
 		allocate_pgdat(nid);
-		printk ("node %d will remap to vaddr %08lx - %08lx\n", nid,
-			(ulong) node_remap_start_vaddr[nid],
-			(ulong) pfn_to_kaddr(highstart_pfn
-			    - node_remap_offset[nid] + node_remap_size[nid]));
+		printk ("node %d will remap to vaddr %08lx - \n", nid,
+			(ulong) node_remap_start_vaddr[nid]);
 	}
 	printk("High memory starts at vaddr %08lx\n",
 			(ulong) pfn_to_kaddr(highstart_pfn));
@@ -275,9 +352,12 @@ unsigned long __init setup_memory(void)
 	/*
 	 * Initialize the boot-time allocator (with low memory only):
 	 */
-	bootmap_size = init_bootmem_node(NODE_DATA(0), min_low_pfn, 0, system_max_low_pfn);
+	bootmap_size = init_bootmem_node(NODE_DATA(0), min_low_pfn, 0,
+	    (system_max_low_pfn > node_end_pfn[0]) ?
+	    node_end_pfn[0] : system_max_low_pfn);
 
-	register_bootmem_low_pages(system_max_low_pfn);
+	register_bootmem_low_pages((system_max_low_pfn > node_end_pfn[0]) ?
+	    node_end_pfn[0] : system_max_low_pfn);
 
 	/*
 	 * Reserve the bootmem bitmap itself as well. We do this in two
@@ -342,14 +422,24 @@ void __init zone_sizes_init(void)
 	 * Clobber node 0's links and NULL out pgdat_list before starting.
 	 */
 	pgdat_list = NULL;
-	for (nid = numnodes - 1; nid >= 0; nid--) {       
+#ifndef CONFIG_MEMHOTPLUG
+	for (nid = numnodes - 1; nid >= 0; nid--) {
+#else
+	for (nid = 0; nid >= 0; nid--) {
+#endif
 		if (nid)
 			memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+		if (nid == 0)
+			NODE_DATA(nid)->enabled = 1;
 		NODE_DATA(nid)->pgdat_next = pgdat_list;
 		pgdat_list = NODE_DATA(nid);
 	}
 
+#ifdef CONFIG_MEMHOTPLUG
+	for (nid = 0; nid < 1; nid++) {
+#else
 	for (nid = 0; nid < numnodes; nid++) {
+#endif
 		unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
 		unsigned long *zholes_size;
 		unsigned int max_dma;
@@ -368,14 +458,17 @@ void __init zone_sizes_init(void)
 		} else {
 			if (low < max_dma)
 				zones_size[ZONE_DMA] = low;
-			else {
+			else if (low <= high) {
 				BUG_ON(max_dma > low);
-				BUG_ON(low > high);
 				zones_size[ZONE_DMA] = max_dma;
 				zones_size[ZONE_NORMAL] = low - max_dma;
 #ifdef CONFIG_HIGHMEM
 				zones_size[ZONE_HIGHMEM] = high - low;
 #endif
+			} else {
+				BUG_ON(max_dma > low);
+				zones_size[ZONE_DMA] = max_dma;
+				zones_size[ZONE_NORMAL] = high - max_dma;
 			}
 		}
 		zholes_size = get_zholes_size(nid);
@@ -423,12 +516,198 @@ void __init set_highmem_pages_init(int b
 #endif
 }
 
-void __init set_max_mapnr_init(void)
+void set_max_mapnr_init(void)
 {
 #ifdef CONFIG_HIGHMEM
+#ifndef CONFIG_MEMHOTPLUG
 	highmem_start_page = NODE_DATA(0)->node_zones[ZONE_HIGHMEM].zone_mem_map;
+#else
+	struct pglist_data *z = NULL;
+	int i;
+
+	for (i = 0; i < numnodes; i++) {
+		if (NODE_DATA(i) == NULL)
+			continue;
+		z = NODE_DATA(i);
+		highmem_start_page = z->node_zones[ZONE_HIGHMEM].zone_mem_map;
+		if (highmem_start_page != NULL)
+			break;
+	}
+	if (highmem_start_page == NULL)
+		highmem_start_page =
+		    z->node_zones[ZONE_NORMAL].zone_mem_map +
+		    z->node_zones[ZONE_NORMAL].spanned_pages;
+#endif
 	num_physpages = highend_pfn;
 #else
 	num_physpages = max_low_pfn;
 #endif
+}
+
+void
+plug_node(int nid)
+{
+	unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
+	unsigned long *zholes_size, addr, pfn;
+	unsigned long remapsize;
+	int i, j;
+	struct page *node_mem_map;
+	pg_data_t **pgdat;
+	struct mm_struct *mm;
+	struct task_struct *tsk;
+
+	unsigned long start = node_start_pfn[nid];
+	unsigned long high = node_end_pfn[nid];
+
+	BUG_ON(nid == 0);
+
+	allocate_pgdat(nid);
+
+	remapsize = calculate_blk_remap_pages();
+	addr = (unsigned long)(pfn_to_kaddr(max_low_pfn +
+	    (nid - 1) * remapsize));
+	
+	/* shrink size,
+	   which is done in calculate_numa_remap_pages() if normal NUMA */
+	high -= remapsize;
+	BUG_ON(start > high);
+
+	read_lock(&tasklist_lock);
+	for_each_process(tsk) {
+		mm = tsk->mm;
+		if (mm == NULL)
+			continue;
+		spin_lock(&mm->page_table_lock);
+		for(pfn = 0; pfn < remapsize; pfn += PTRS_PER_PTE) {
+			pgd_t *pgd;
+			pmd_t *pmd;
+
+			pgd = pgd_offset(mm, addr + (pfn << PAGE_SHIFT));
+			pmd = pmd_offset(pgd, addr + (pfn << PAGE_SHIFT));
+			set_pmd(pmd, pfn_pmd(high + pfn, PAGE_KERNEL_LARGE));
+		}
+		spin_unlock(&mm->page_table_lock);
+	}
+	read_unlock(&tasklist_lock);
+	for(pfn = 0; pfn < remapsize; pfn += PTRS_PER_PTE)
+                set_pmd_pfn(addr + (pfn << PAGE_SHIFT), high + pfn,
+                    PAGE_KERNEL_LARGE);
+	flush_tlb();
+
+	node_mem_map = (struct page *)((addr + sizeof(pg_data_t) +
+	    PAGE_SIZE - 1) & PAGE_MASK);
+	memset(node_mem_map, 0, (remapsize << PAGE_SHIFT) -
+	    ((char *)node_mem_map - (char *)addr));
+
+	printk("plug_node: %p %p\n", NODE_DATA(nid), node_mem_map);
+	memset(NODE_DATA(nid), 0, sizeof(*NODE_DATA(nid)));
+	printk("zeroed nodedata\n");
+
+	/* XXX defaults to hotremovable */ 
+	NODE_DATA(nid)->removable = 1;
+
+	BUG_ON(virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT > start);
+	if (start <= max_low_pfn)
+		zones_size[ZONE_NORMAL] =
+		    (max_low_pfn > high ? high : max_low_pfn) - start;
+#ifdef CONFIG_HIGHMEM
+	if (high > max_low_pfn)
+		zones_size[ZONE_HIGHMEM] = high -
+		    ((start > max_low_pfn) ? start : max_low_pfn);
+#endif
+
+	zholes_size = get_zholes_size(nid);
+	free_area_init_node(nid, NODE_DATA(nid), node_mem_map, zones_size,
+	    start, zholes_size);
+
+	/* lock? */
+	for(pgdat = &pgdat_list; *pgdat; pgdat = &(*pgdat)->pgdat_next)
+		if ((*pgdat)->node_id > nid) {
+			NODE_DATA(nid)->pgdat_next = *pgdat;
+			*pgdat = NODE_DATA(nid);
+			break;
+		}
+	if (*pgdat == NULL)
+		*pgdat = NODE_DATA(nid);
+	{
+		struct zone *z;
+		for_each_zone (z)
+			printk("%p ", z);
+		printk("\n");
+	}
+	set_max_mapnr_init();
+
+	for(i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *z;
+		struct page *p;
+
+		z = &NODE_DATA(nid)->node_zones[i];
+
+		for(j = 0; j < z->spanned_pages; j++) {
+			p = &z->zone_mem_map[j];
+			ClearPageReserved(p);
+			if (i == ZONE_HIGHMEM) {
+				set_bit(PG_highmem, &p->flags);
+				totalhigh_pages++;
+			}
+			set_page_count(p, 1);
+			__free_page(p);
+		}
+	}
+	kswapd_start_one(NODE_DATA(nid));
+	setup_per_zone_pages_min();
+}
+
+void
+enable_node(int node)
+{
+
+	NODE_DATA(node)->enabled = 1;
+	build_all_zonelists();
+}
+
+void
+makepermanent_node(int node)
+{
+
+	NODE_DATA(node)->removable = 0;
+	build_all_zonelists();
+}
+	
+void
+disable_node(int node)
+{
+
+	NODE_DATA(node)->enabled = 0;
+	build_all_zonelists();
+}
+
+int
+unplug_node(int node)
+{
+	int i;
+	struct zone *z;
+	pg_data_t *pgdat;
+
+	if (NODE_DATA(node)->enabled)
+		return -1;
+	for(i = 0; i < MAX_NR_ZONES; i++) {
+		z = zone_table[NODEZONE(node, i)];
+		if (z->present_pages != z->free_pages)
+			return -1;
+	}
+
+	/* lock? */
+	for(pgdat = pgdat_list; pgdat; pgdat = pgdat->pgdat_next)
+		if (pgdat->pgdat_next == NODE_DATA(node)) {
+			pgdat->pgdat_next = pgdat->pgdat_next->pgdat_next;
+			break;
+		}
+	BUG_ON(pgdat == NULL);
+
+	for(i = 0; i < MAX_NR_ZONES; i++)
+		zone_table[NODEZONE(node, i)] = NULL;
+	NODE_DATA(node) = NULL;
+
+	return 0;
 }
diff -dpur linux-2.6.4/arch/i386/mm/fault.c linux-2.6.4-mh/arch/i386/mm/fault.c
--- linux-2.6.4/arch/i386/mm/fault.c	Thu Mar 11 11:55:20 2004
+++ linux-2.6.4-mh/arch/i386/mm/fault.c	Wed Mar 31 19:38:26 2004
@@ -502,6 +502,8 @@ vmalloc_fault:
 		if (!pmd_present(*pmd_k))
 			goto no_context;
 		set_pmd(pmd, *pmd_k);
+		if (pmd_large(*pmd_k))
+			return;
 
 		pte_k = pte_offset_kernel(pmd_k, address);
 		if (!pte_present(*pte_k))
diff -dpur linux-2.6.4/arch/i386/mm/init.c linux-2.6.4-mh/arch/i386/mm/init.c
--- linux-2.6.4/arch/i386/mm/init.c	Thu Mar 11 11:55:37 2004
+++ linux-2.6.4-mh/arch/i386/mm/init.c	Wed Mar 31 19:38:26 2004
@@ -43,6 +43,7 @@
 
 DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 unsigned long highstart_pfn, highend_pfn;
+extern unsigned long node_end_pfn[MAX_NUMNODES];
 
 static int do_test_wp_bit(void);
 
@@ -481,7 +482,11 @@ void __init mem_init(void)
 	totalram_pages += __free_all_bootmem();
 
 	reservedpages = 0;
+#ifdef CONFIG_MEMHOTPLUG
+	for (tmp = 0; tmp < node_end_pfn[0]; tmp++)
+#else
 	for (tmp = 0; tmp < max_low_pfn; tmp++)
+#endif
 		/*
 		 * Only count reserved RAM pages
 		 */
diff -dpur linux-2.6.4/include/asm-i386/mmzone.h linux-2.6.4-mh/include/asm-i386/mmzone.h
--- linux-2.6.4/include/asm-i386/mmzone.h	Thu Mar 11 11:55:27 2004
+++ linux-2.6.4-mh/include/asm-i386/mmzone.h	Wed Mar 31 19:38:26 2004
@@ -17,7 +17,9 @@
 		#include <asm/srat.h>
 	#endif
 #else /* !CONFIG_NUMA */
+#ifndef CONFIG_MEMHOTPLUG
 	#define get_memcfg_numa get_memcfg_numa_flat
+#endif
 	#define get_zholes_size(n) (0)
 #endif /* CONFIG_NUMA */
 
@@ -41,7 +43,7 @@ extern u8 physnode_map[];
 
 static inline int pfn_to_nid(unsigned long pfn)
 {
-#ifdef CONFIG_NUMA
+#if defined(CONFIG_NUMA) || defined (CONFIG_MEMHOTPLUG)
 	return(physnode_map[(pfn) / PAGES_PER_ELEMENT]);
 #else
 	return 0;
@@ -132,6 +134,10 @@ static inline int pfn_valid(int pfn)
 #endif
 
 extern int get_memcfg_numa_flat(void );
+#ifdef CONFIG_MEMHOTPLUG
+extern int get_memcfg_numa_blks(void);
+#endif
+
 /*
  * This allows any one NUMA architecture to be compiled
  * for, and still fall back to the flat function if it
@@ -144,6 +150,9 @@ static inline void get_memcfg_numa(void)
 		return;
 #elif CONFIG_ACPI_SRAT
 	if (get_memcfg_from_srat())
+		return;
+#elif CONFIG_MEMHOTPLUG
+	if (get_memcfg_numa_blks())
 		return;
 #endif
 
diff -dpur linux-2.6.4/include/asm-i386/numnodes.h linux-2.6.4-mh/include/asm-i386/numnodes.h
--- linux-2.6.4/include/asm-i386/numnodes.h	Thu Mar 11 11:55:23 2004
+++ linux-2.6.4-mh/include/asm-i386/numnodes.h	Wed Mar 31 19:38:26 2004
@@ -13,6 +13,8 @@
 /* Max 8 Nodes */
 #define NODES_SHIFT	3
 
-#endif /* CONFIG_X86_NUMAQ */
+#elif defined(CONFIG_MEMHOTPLUG)
+#define NODES_SHIFT	3
+#endif
 
 #endif /* _ASM_MAX_NUMNODES_H */
diff -dpur linux-2.6.4/mm/page_alloc.c linux-2.6.4-mh/mm/page_alloc.c
--- linux-2.6.4/mm/page_alloc.c	Thu Mar 11 11:55:22 2004
+++ linux-2.6.4-mh/mm/page_alloc.c	Thu Apr  1 16:54:26 2004
@@ -1177,7 +1177,12 @@ static inline unsigned long wait_table_b
 
 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
 
-static void __init calculate_zone_totalpages(struct pglist_data *pgdat,
+#ifdef CONFIG_MEMHOTPLUG
+static void
+#else
+static void __init
+#endif
+calculate_zone_totalpages(struct pglist_data *pgdat,
 		unsigned long *zones_size, unsigned long *zholes_size)
 {
 	unsigned long realtotalpages, totalpages = 0;
@@ -1231,8 +1236,13 @@ void __init memmap_init_zone(struct page
  *   - mark all memory queues empty
  *   - clear the memory bitmaps
  */
-static void __init free_area_init_core(struct pglist_data *pgdat,
-		unsigned long *zones_size, unsigned long *zholes_size)
+#ifdef CONFIG_MEMHOTPLUG
+static void
+#else
+static void __init
+#endif
+free_area_init_core(struct pglist_data *pgdat,
+	unsigned long *zones_size, unsigned long *zholes_size)
 {
 	unsigned long i, j;
 	const unsigned long zone_required_alignment = 1UL << (MAX_ORDER-1);
@@ -1371,7 +1381,12 @@ static void __init free_area_init_core(s
 	}
 }
 
-void __init free_area_init_node(int nid, struct pglist_data *pgdat,
+#ifdef CONFIG_MEMHOTPLUG
+void
+#else
+void __init
+#endif
+free_area_init_node(int nid, struct pglist_data *pgdat,
 		struct page *node_mem_map, unsigned long *zones_size,
 		unsigned long node_start_pfn, unsigned long *zholes_size)
 {

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 3/3] memory hotplug prototype
  2004-04-06 10:53 [patch 0/3] memory hotplug prototype IWAMOTO Toshihiro
  2004-04-06 10:56 ` [patch 1/3] " IWAMOTO Toshihiro
  2004-04-06 10:58 ` [patch 2/3] " IWAMOTO Toshihiro
@ 2004-04-06 10:59 ` IWAMOTO Toshihiro
  2004-04-06 11:47 ` [patch 0/3] " IWAMOTO Toshihiro
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: IWAMOTO Toshihiro @ 2004-04-06 10:59 UTC (permalink / raw)
  To: linux-kernel, lhms-devel

va-proc_memhotplug.patch:
	/proc/memhotplug interface for testing/debugging

$Id: va-proc_memhotplug.patch,v 1.8 2004/04/05 10:46:32 iwamoto Exp $

diff -dpur linux-2.6.5/mm/page_alloc.c linux-2.6.5-mh/mm/page_alloc.c
--- linux-2.6.5/mm/page_alloc.c	Sun Apr  4 12:36:17 2004
+++ linux-2.6.5-mh/mm/page_alloc.c	Mon Apr  5 15:22:09 2004
@@ -31,6 +31,7 @@
 #include <linux/topology.h>
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
+#include <linux/proc_fs.h>

 #include <asm/tlbflush.h>

@@ -1843,3 +1950,243 @@ int min_free_kbytes_sysctl_handler(ctl_t
 	setup_per_zone_pages_min();
 	return 0;
 }
+
+#ifdef CONFIG_MEMHOTPLUG
+static int mhtest_read(char *page, char **start, off_t off, int count,
+    int *eof, void *data)
+{
+	char *p;
+	int i, j, len;
+	const struct pglist_data *pgdat;
+	const struct zone *z;
+
+	p = page;
+	for(i = 0; i < numnodes; i++) {
+		pgdat = NODE_DATA(i);
+		if (pgdat == NULL)
+			continue;
+		len = sprintf(p, "Node %d %sabled %shotremovable\n", i,
+		    pgdat->enabled ? "en" : "dis",
+		    pgdat->removable ? "" : "non");
+		p += len;
+		for (j = 0; j < MAX_NR_ZONES; j++) {
+			z = &pgdat->node_zones[j];
+			if (! z->present_pages)
+				/* skip empty zone */
+				continue;
+			len = sprintf(p,
+			    "\t%s[%d]: free %ld, active %ld, present %ld\n",
+			    z->name, NODEZONE(i, j),
+			    z->free_pages, z->nr_active, z->present_pages);
+			p += len;
+		}
+	}
+	len = p - page;
+
+	if (len <= off + count)
+		*eof = 1;
+	*start = page + off;
+	len -= off;
+	if (len < 0)
+		len = 0;
+	if (len > count)
+		len = count;
+
+	return len;
+}
+
+static void mhtest_enable(int);
+static void mhtest_disable(int);
+static void mhtest_plug(int);
+static void mhtest_unplug(int);
+static void mhtest_purge(int);
+static void mhtest_remap(int);
+static void mhtest_active(int);
+static void mhtest_inuse(int);
+
+const static struct {
+	char *cmd;
+	void (*func)(int);
+	char zone_check;
+} mhtest_cmds[] = {
+	{ "disable", mhtest_disable, 1 },
+	{ "enable", mhtest_enable, 1 },
+	{ "plug", mhtest_plug, 0 },
+	{ "unplug", mhtest_unplug, 0 },
+	{ "purge", mhtest_purge, 1 },
+	{ "remap", mhtest_remap, 1 },
+	{ "active", mhtest_active, 1 },
+	{ "inuse", mhtest_inuse, 1 },
+	{ NULL, NULL }};
+
+static void
+mhtest_disable(int idx) {
+	int i, z;
+
+	printk("disable %d\n", idx);
+	/* XXX */
+	for (z = 0; z < MAX_NR_ZONES; z++) {
+		for (i = 0; i < NR_CPUS; i++) {
+			struct per_cpu_pages *pcp;
+
+			pcp = &zone_table[NODEZONE(idx, z)]->pageset[i].pcp[0];	/* hot */
+			pcp->low = pcp->high = 0;
+
+			pcp = &zone_table[NODEZONE(idx, z)]->pageset[i].pcp[1];	/* cold */
+			pcp->low = pcp->high = 0;
+		}
+		zone_table[NODEZONE(idx, z)]->pages_high =
+		    zone_table[NODEZONE(idx, z)]->present_pages;
+	}
+	disable_node(idx);
+}
+static void
+mhtest_enable(int idx) {
+	int i, z;
+
+	printk("enable %d\n", idx);
+	for (z = 0; z < MAX_NR_ZONES; z++) {
+		zone_table[NODEZONE(idx, z)]->pages_high = 
+		    zone_table[NODEZONE(idx, z)]->pages_min * 3;
+		/* XXX */
+		for (i = 0; i < NR_CPUS; i++) {
+			struct per_cpu_pages *pcp;
+
+			pcp = &zone_table[NODEZONE(idx, z)]->pageset[i].pcp[0];	/* hot */
+			pcp->low = 2 * pcp->batch;
+			pcp->high = 6 * pcp->batch;
+
+			pcp = &zone_table[NODEZONE(idx, z)]->pageset[i].pcp[1];	/* cold */
+			pcp->high = 2 * pcp->batch;
+		}
+	}
+	enable_node(idx);
+}
+
+static void
+mhtest_plug(int idx) {
+
+	if (NODE_DATA(idx) != NULL) {
+		printk("Already plugged\n");
+		return;
+	}
+	plug_node(idx);
+}
+
+static void
+mhtest_unplug(int idx) {
+
+	unplug_node(idx);
+}
+
+static void
+mhtest_purge(int idx)
+{
+	printk("purge %d\n", idx);
+	wake_up_interruptible(&zone_table[idx]->zone_pgdat->kswapd_wait);
+	/* XXX overkill, but who cares? */
+	on_each_cpu(drain_local_pages, NULL, 1, 1);
+}
+
+static void
+mhtest_remap(int idx) {
+
+	on_each_cpu(drain_local_pages, NULL, 1, 1);
+	kernel_thread(remapd, zone_table[idx], CLONE_KERNEL);
+}
+
+static void
+mhtest_active(int idx)
+{
+	struct list_head *l;
+	int i;
+
+	if (zone_table[idx] == NULL)
+		return;
+	spin_lock_irq(&zone_table[idx]->lru_lock);
+	i = 0;
+	list_for_each(l, &zone_table[idx]->active_list) {
+		printk(" %lx", (unsigned long)list_entry(l, struct page, lru));
+		i++;
+		if (i == 10)
+			break;
+	}
+	spin_unlock_irq(&zone_table[idx]->lru_lock);
+	printk("\n");
+}
+
+static void
+mhtest_inuse(int idx)
+{
+	int i;
+
+	if (zone_table[idx] == NULL)
+		return;
+	for(i = 0; i < zone_table[idx]->spanned_pages; i++)
+		if (page_count(&zone_table[idx]->zone_mem_map[i]))
+			printk(" %p", &zone_table[idx]->zone_mem_map[i]);
+	printk("\n");
+}
+
+static int mhtest_write(struct file *file, const char *buffer,
+    unsigned long count, void *data)
+{
+	int idx;
+	char buf[64], *p;
+	int i;
+
+	if (count > sizeof(buf) - 1)
+		count = sizeof(buf) - 1;
+	if (copy_from_user(buf, buffer, count))
+		return -EFAULT;
+
+	buf[count] = 0;
+
+	p = strchr(buf, ' ');
+	if (p == NULL)
+		goto out;
+
+	*p++ = '\0';
+	idx = (int)simple_strtoul(p, NULL, 0);
+
+	if (idx > MAX_NR_ZONES*MAX_NUMNODES) {
+		printk("Argument out of range\n");
+		goto out;
+	}
+
+	for(i = 0; ; i++) {
+		if (mhtest_cmds[i].cmd == NULL)
+			break;
+		if (strcmp(buf, mhtest_cmds[i].cmd) == 0) {
+			if (mhtest_cmds[i].zone_check) {
+				if (zone_table[idx] == NULL) {
+					printk("Zone %d not plugged\n", idx);
+					return count;
+				}
+			} else if (strcmp(buf, "plug") != 0 &&
+			    NODE_DATA(idx) == NULL) {
+				printk("Node %d not plugged\n", idx);
+				return count;
+			}
+			(mhtest_cmds[i].func)(idx);
+			break;
+		}
+	}
+out:
+	return count;
+}
+
+static int __init procmhtest_init(void)
+{
+	struct proc_dir_entry *entry;
+
+	entry = create_proc_entry("memhotplug", 0, NULL);
+	if (entry == NULL)
+		return -1;
+
+	entry->read_proc = &mhtest_read;
+	entry->write_proc = &mhtest_write;
+	return 0;
+}
+__initcall(procmhtest_init);
+#endif

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 0/3] memory hotplug prototype
  2004-04-06 10:53 [patch 0/3] memory hotplug prototype IWAMOTO Toshihiro
                   ` (2 preceding siblings ...)
  2004-04-06 10:59 ` [patch 3/3] " IWAMOTO Toshihiro
@ 2004-04-06 11:47 ` IWAMOTO Toshihiro
  2004-04-06 12:41 ` [patch 0/6] memory hotplug for hugetlbpages Hirokazu Takahashi
  2004-04-07 18:12 ` [patch 0/3] memory hotplug prototype Martin J. Bligh
  5 siblings, 0 replies; 27+ messages in thread
From: IWAMOTO Toshihiro @ 2004-04-06 11:47 UTC (permalink / raw)
  To: linux-kernel, lhms-devel

> This is an updated version of memory hotplug prototype patch, which I
> have posted here several times.

Though ambiguous from the patch files, the files are for linux-2.6.5.

--
IWAMOTO Toshihiro

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 0/6] memory hotplug for hugetlbpages
  2004-04-06 10:53 [patch 0/3] memory hotplug prototype IWAMOTO Toshihiro
                   ` (3 preceding siblings ...)
  2004-04-06 11:47 ` [patch 0/3] " IWAMOTO Toshihiro
@ 2004-04-06 12:41 ` Hirokazu Takahashi
  2004-04-06 12:44   ` [patch 1/6] " Hirokazu Takahashi
                     ` (5 more replies)
  2004-04-07 18:12 ` [patch 0/3] memory hotplug prototype Martin J. Bligh
  5 siblings, 6 replies; 27+ messages in thread
From: Hirokazu Takahashi @ 2004-04-06 12:41 UTC (permalink / raw)
  To: linux-kernel, lhms-devel

Hello,

I also updated memory hotplug patches for hugetlbpages.
The patches are against linux 2.6.5 and work with 
memory hotplug patches which Iwamoto just posted.

Main changes from previous version are:

	* Now remap code is shared between hugetlbpage handling
	  and normal page handling. New remap_operations table is
	  introduced for this purpose and it can handle any kind
	  of pages which have diffirent feature from normal one.
	  This made the remap code simple.

	* Made pagefault routine for hugetlbpage stable.
	  Two or more threads can happen to make pagefault
	  at the same hugetlbpage, at the same time.


/proc/memhotplug interface also provide infomation about Hugetlbpages.

  ex.)
   $ cat /proc/memhotplug 
   Node 0 enabled nonhotremovable
           DMA[0]: free 964, active 34, present 4096 / HugePage free 0, total 0

           Normal[1]: free 9641, active 36633, present 126976 / HugePage free 0, total 0

   Node 1 enabled hotremovable
           Normal[5]: free 208, active 86115, present 94208 / HugePage free 0, total 0

           HighMem[6]: free 0, active 17234, present 34816 / HugePage free 8, total 16

   Node 2 enabled hotremovable
           HighMem[10]: free 272, active 109643, present 128000 / HugePage free 0, total 0


How to apply:

 # cd linux-2.6.5

   First of all, apply Iwamoto's patches here.

 # patch -p1 < va00-hugepagealloc.patch
 # patch -p1 < va01-hugepagefault.patch
 # patch -p1 < va02-hugepagelist.patch
 # patch -p1 < va03-hugepagermap.patch
 # patch -p1 < va04-hugeremap.patch
 # patch -p1 < va05-hugepageproc.patch


Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 1/6] memory hotplug for hugetlbpages
  2004-04-06 12:41 ` [patch 0/6] memory hotplug for hugetlbpages Hirokazu Takahashi
@ 2004-04-06 12:44   ` Hirokazu Takahashi
  2004-04-06 12:45   ` [patch 2/6] " Hirokazu Takahashi
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Hirokazu Takahashi @ 2004-04-06 12:44 UTC (permalink / raw)
  To: linux-kernel, lhms-devel

This is a part 1 of memory hotplug patches for hugetlbpages.

$Id: va-hugepagealloc.patch,v 1.4 2004/04/01 14:10:46 taka Exp $

--- linux-2.6.4.ORG/include/linux/page-flags.h	Thu Apr  1 14:24:07 2032
+++ linux-2.6.4/include/linux/page-flags.h	Thu Apr  1 15:32:16 2032
@@ -77,6 +77,7 @@
 #define PG_compound		19	/* Part of a compound page */
 
 #define PG_again		20
+#define PG_booked		21
 
 
 /*
@@ -275,6 +276,10 @@ extern void get_full_page_state(struct p
 #define PageAgain(page)	test_bit(PG_again, &(page)->flags)
 #define SetPageAgain(page)	set_bit(PG_again, &(page)->flags)
 #define ClearPageAgain(page)	clear_bit(PG_again, &(page)->flags)
+
+#define PageBooked(page)	test_bit(PG_booked, &(page)->flags)
+#define SetPageBooked(page)	set_bit(PG_booked, &(page)->flags)
+#define ClearPageBooked(page)	clear_bit(PG_booked, &(page)->flags)
 
 /*
  * The PageSwapCache predicate doesn't use a PG_flag at this time,
--- linux-2.6.4.ORG/include/linux/mmzone.h	Thu Apr  1 14:24:07 2032
+++ linux-2.6.4/include/linux/mmzone.h	Thu Apr  1 15:32:16 2032
@@ -154,6 +154,9 @@ struct zone {
 	char			*name;
 	unsigned long		spanned_pages;	/* total size, including holes */
 	unsigned long		present_pages;	/* amount of memory (excluding holes) */
+	unsigned long		contig_pages_alloc_hint;
+	unsigned long		booked_pages;
+	long			scan_pages;
 } ____cacheline_maxaligned_in_smp;
 
 #define ZONE_DMA		0
--- linux-2.6.4.ORG/mm/page_alloc.c	Thu Apr  1 14:24:25 2032
+++ linux-2.6.4/mm/page_alloc.c	Thu Apr  1 15:32:16 2032
@@ -182,7 +182,11 @@ static inline void __free_pages_bulk (st
 		BUG();
 	index = page_idx >> (1 + order);
 
-	zone->free_pages -= mask;
+	if (!PageBooked(page))
+		zone->free_pages -= mask;
+	else {
+		zone->booked_pages -= mask;
+	}
 	while (mask + (1 << (MAX_ORDER-1))) {
 		struct page *buddy1, *buddy2;
 
@@ -201,6 +205,9 @@ static inline void __free_pages_bulk (st
 		buddy2 = base + page_idx;
 		BUG_ON(bad_range(zone, buddy1));
 		BUG_ON(bad_range(zone, buddy2));
+		if (PageBooked(buddy1) != PageBooked(buddy2)) {
+			break;
+		}
 		list_del(&buddy1->list);
 		mask <<= 1;
 		area++;
@@ -356,8 +363,13 @@ static struct page *__rmqueue(struct zon
 		area = zone->free_area + current_order;
 		if (list_empty(&area->free_list))
 			continue;
+		list_for_each_entry(page, &area->free_list, list) {
+			if (!PageBooked(page))
+				goto gotit;
+		}
+		continue;
 
-		page = list_entry(area->free_list.next, struct page, list);
+gotit:
 		list_del(&page->list);
 		index = page - zone->zone_mem_map;
 		if (current_order != MAX_ORDER-1)
@@ -463,6 +475,11 @@ static void fastcall free_hot_cold_page(
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 
+	if (PageBooked(page)) {
+		__free_pages_ok(page, 0);
+		return;
+	}
+
 	kernel_map_pages(page, 1, 0);
 	inc_page_state(pgfree);
 	free_pages_check(__FUNCTION__, page);
@@ -530,6 +547,241 @@ static struct page *buffered_rmqueue(str
 	return page;
 }
 
+#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_MEMHOTPLUG)
+/* 
+ * Check wheter the page is freeable or not.
+ * It might not be free even if this function says OK,
+ * when it is just being allocated.
+ * This check is almost sufficient but not perfect.
+ */
+static inline int is_page_freeable(struct page *page)
+{
+	return (page->mapping || page_mapped(page) || !page_count(page)) &&
+	    !(page->flags & (1<<PG_reserved|1<<PG_compound|1<<PG_booked|1<<PG_slab));
+}
+
+static inline int is_free_page(struct page *page)
+{
+	return !(page_mapped(page) ||
+		page->mapping != NULL ||
+		page_count(page) != 0 ||
+		(page->flags & (
+			1 << PG_reserved|
+			1 << PG_compound|
+			1 << PG_booked	|
+			1 << PG_lru	|
+			1 << PG_private |
+			1 << PG_locked	|
+			1 << PG_active	|
+			1 << PG_reclaim	|
+			1 << PG_dirty	|
+			1 << PG_slab	|
+			1 << PG_writeback )));
+}
+
+static int
+try_to_book_pages(struct zone *zone, struct page *page, unsigned int order)
+{
+	struct page	*p;
+	int booked_count = 0;
+	unsigned long	flags;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	for (p = page; p < &page[1<<order]; p++) {
+		if (!is_page_freeable(p))
+			goto out;
+		if (is_free_page(p))
+			booked_count++;
+		SetPageBooked(p);
+	}
+
+	zone->booked_pages = booked_count;
+	zone->free_pages -= booked_count;
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return 1;
+out:
+	for (p--; p >= page; p--) {
+		ClearPageBooked(p);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return 0;
+}
+
+static struct page *
+book_pages(struct zone *zone, unsigned int gfp_mask, unsigned int order)
+{
+	unsigned long	num = 1<<order;
+	unsigned long	slot = zone->contig_pages_alloc_hint;
+	struct page	*page;
+	
+	slot = (slot + num - 1) & ~(num - 1);	/* align */
+
+	for ( ; zone->scan_pages > 0; slot += num) {
+		zone->scan_pages -= num;
+		if (slot + num > zone->present_pages)
+			slot = 0;
+		page = &zone->zone_mem_map[slot];
+		if (try_to_book_pages(zone, page, order)) {
+			zone->contig_pages_alloc_hint = slot + num;
+			return page;
+		}
+	}
+	return NULL;
+}
+
+static void
+unbook_pages(struct zone *zone, struct page *page, unsigned int order)
+{
+	struct page	*p;
+	for (p = page; p < &page[1<<order]; p++) {
+		ClearPageBooked(p);
+	}
+}
+
+/*
+ * sweepout_pages() might not work well as the booked pages 
+ * may include some unfreeable pages.
+ */
+static int
+sweepout_pages(struct zone *zone, struct page *page, int num)
+{
+	struct page *p;
+	int failed = 0;
+	int retry = 0;
+	int retry_save = 0;
+	int retry_count = 20;
+
+again:
+	on_each_cpu((void (*)(void*))drain_local_pages, NULL, 1, 1);
+	for (p = page; p <= &page[num - 1]; p++) {
+		if (!page_count(p))
+			continue;
+		if (!PageBooked(p)) {
+			printk(KERN_ERR "ERROR sweepout_pages: page:%p isn't booked. page(%p) num(%d)\n", p, page, num);
+		}
+
+		spin_lock_irq(&zone->lru_lock);
+		if (!PageLRU(p)) {
+			spin_unlock_irq(&zone->lru_lock);
+			retry++;
+			continue;
+		}
+		list_del(&p->lru);
+		if (!TestClearPageLRU(p))
+			BUG();
+		if (PageActive(p)) {
+			zone->nr_active--;
+			if (page_count(p) == 0) {
+				/* the page is in pagevec_release();
+				   shrink_cache says so. */
+				SetPageLRU(p);
+				list_add(&p->lru, &zone->active_list);
+				spin_unlock_irq(&zone->lru_lock);
+				continue;
+			}
+		} else {
+			zone->nr_inactive--;
+			if (page_count(p) == 0) {
+				/* the page is in pagevec_release();
+				   shrink_cache says so. */
+				SetPageLRU(p);
+				list_add(&p->lru, &zone->inactive_list);
+				spin_unlock_irq(&zone->lru_lock);
+				continue;
+			}
+		}
+		page_cache_get(p);
+		spin_unlock_irq(&zone->lru_lock);
+		if (remap_onepage_normal(p, REMAP_ANYNODE, 0)) {
+			failed++;
+			spin_lock_irq(&zone->lru_lock);
+			if (PageActive(p)) {
+				list_add(&p->lru, &zone->active_list);
+				zone->nr_active++;
+			} else {
+				list_add(&p->lru, &zone->inactive_list);
+				zone->nr_inactive++;
+			}
+			SetPageLRU(p);
+			spin_unlock_irq(&zone->lru_lock);
+			page_cache_release(p);
+		}
+	}
+	if (retry && (retry_count--)) {
+		retry_save = retry;
+		retry = 0;
+		schedule_timeout(HZ/4);
+		/* Actually we should wait on the pages */
+		goto again;
+	}
+	on_each_cpu((void (*)(void*))drain_local_pages, NULL, 1, 1);
+	return failed;
+}
+
+/*
+ * Allocate contiguous pages enen if pages are fragmented in zones.
+ * Migrating pages helps to make enough space in them.
+ */
+static struct page *
+force_alloc_pages(unsigned int gfp_mask, unsigned int order,
+			struct zonelist *zonelist)
+{
+	struct zone **zones = zonelist->zones;
+	struct zone *zone;
+	struct page *page = NULL;
+	unsigned long flags;
+	int i;
+	int ret;
+
+	static DECLARE_MUTEX(bookedpage_sem);
+
+	if (down_trylock(&bookedpage_sem)) {
+		down(&bookedpage_sem);
+	}
+
+	for (i = 0; zones[i] != NULL; i++) {
+		zone = zones[i];
+		zone->scan_pages = zone->present_pages;
+		while (zone->scan_pages > 0) {
+			page = book_pages(zone, gfp_mask, order);
+			if (!page)
+				break;
+			ret = sweepout_pages(zone, page, 1<<order);
+			if (ret) {
+				spin_lock_irqsave(&zone->lock, flags);
+				unbook_pages(zone, page, order);
+				page = NULL;
+
+				zone->free_pages += zone->booked_pages;
+				spin_unlock_irqrestore(&zone->lock, flags);
+				continue;
+			}
+			spin_lock_irqsave(&zone->lock, flags);
+			unbook_pages(zone, page, order);
+			zone->free_pages += zone->booked_pages;
+			page = __rmqueue(zone, order);
+			spin_unlock_irqrestore(&zone->lock, flags);
+			if (page) {
+				prep_compound_page(page, order);
+				up(&bookedpage_sem);
+				return page;
+			}
+		}
+	}
+	up(&bookedpage_sem);
+	return NULL;
+}
+#endif /* CONFIG_HUGETLB_PAGE */
+
+static inline int
+enough_pages(struct zone *zone, unsigned long min, const int wait)
+{
+	return (long)zone->free_pages - (long)min >= 0 ||
+		(!wait && (long)zone->free_pages - (long)zone->pages_high >= 0);
+}
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  *
@@ -585,8 +837,7 @@ __alloc_pages(unsigned int gfp_mask, uns
 			local_low >>= 1;
 		min += local_low;
 
-		if (z->free_pages >= min ||
-				(!wait && z->free_pages >= z->pages_high)) {
+		if (enough_pages(z, min, wait)) {
 			page = buffered_rmqueue(z, order, cold);
 			if (page)
 		       		goto got_pg;
@@ -610,8 +861,7 @@ __alloc_pages(unsigned int gfp_mask, uns
 		if (rt_task(p))
 			local_min >>= 1;
 		min += local_min;
-		if (z->free_pages >= min ||
-				(!wait && z->free_pages >= z->pages_high)) {
+		if (enough_pages(z, min, wait)) {
 			page = buffered_rmqueue(z, order, cold);
 			if (page)
 				goto got_pg;
@@ -653,14 +903,27 @@ rebalance:
 		struct zone *z = zones[i];
 
 		min += z->pages_min;
-		if (z->free_pages >= min ||
-				(!wait && z->free_pages >= z->pages_high)) {
+		if (enough_pages(z, min, wait)) {
 			page = buffered_rmqueue(z, order, cold);
 			if (page)
 				goto got_pg;
 		}
 		min += z->pages_low * sysctl_lower_zone_protection;
 	}
+
+#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_MEMHOTPLUG)
+	/*
+	 * Defrag pages to allocate large contiguous pages
+	 *
+	 * FIXME: The following code will work only if CONFIG_HUGETLB_PAGE
+	 *        flag is on.
+	 */
+	if (order) {
+		page = force_alloc_pages(gfp_mask, order, zonelist);
+		if (page)
+			goto got_pg;
+	}
+#endif /* CONFIG_HUGETLB_PAGE */
 
 	/*
 	 * Don't let big-order allocations loop unless the caller explicitly
--- linux-2.6.4.ORG/mm/memhotplug.c	Thu Apr  1 14:24:07 2032
+++ linux-2.6.4/mm/memhotplug.c	Thu Apr  1 15:32:16 2032
@@ -180,7 +180,7 @@ radix_tree_replace_pages(struct page *pa
 	}
 	/* don't __put_page(page) here. truncate may be in progress */
 	newpage->flags |= page->flags & ~(1 << PG_uptodate) &
-	    ~(1 << PG_highmem) & ~(1 << PG_chainlock) &
+	    ~(1 << PG_highmem) & ~(1 << PG_chainlock) & ~(1 << PG_booked) &
 	    ~(1 << PG_direct) & ~(~0UL << NODEZONE_SHIFT);
 
 	/* list_del(&page->list); XXX */

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 2/6] memory hotplug for hugetlbpages
  2004-04-06 12:41 ` [patch 0/6] memory hotplug for hugetlbpages Hirokazu Takahashi
  2004-04-06 12:44   ` [patch 1/6] " Hirokazu Takahashi
@ 2004-04-06 12:45   ` Hirokazu Takahashi
  2004-04-06 12:45   ` [patch 3/6] " Hirokazu Takahashi
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Hirokazu Takahashi @ 2004-04-06 12:45 UTC (permalink / raw)
  To: linux-kernel, lhms-devel

This is a part 2 of memory hotplug patches for hugetlbpages.

$Id: va-hugepagefault.patch,v 1.8 2004/04/05 06:13:36 taka Exp $

--- linux-2.6.5.ORG/include/linux/hugetlb.h	Mon Apr  5 16:13:27 2032
+++ linux-2.6.5/include/linux/hugetlb.h	Mon Apr  5 16:15:15 2032
@@ -24,10 +24,12 @@ struct page *follow_huge_addr(struct mm_
 			unsigned long address, int write);
 struct vm_area_struct *hugepage_vma(struct mm_struct *mm,
 					unsigned long address);
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-				pmd_t *pmd, int write);
+struct page *follow_huge_pmd(struct mm_struct *mm, struct vm_area_struct *,
+				unsigned long address, pmd_t *pmd, int write);
 int is_aligned_hugepage_range(unsigned long addr, unsigned long len);
 int pmd_huge(pmd_t pmd);
+extern int hugetlb_fault(struct mm_struct *, struct vm_area_struct *,
+				int, unsigned long);
 
 extern int htlbpage_max;
 
@@ -72,12 +74,13 @@ static inline unsigned long hugetlb_tota
 #define hugetlb_report_meminfo(buf)		0
 #define hugepage_vma(mm, addr)			0
 #define mark_mm_hugetlb(mm, vma)		do { } while (0)
-#define follow_huge_pmd(mm, addr, pmd, write)	0
+#define follow_huge_pmd(mm, vma, addr, pmd, write)	0
 #define is_aligned_hugepage_range(addr, len)	0
 #define prepare_hugepage_range(addr, len)	(-EINVAL)
 #define pmd_huge(x)	0
 #define is_hugepage_only_range(addr, len)	0
 #define hugetlb_free_pgtables(tlb, prev, start, end) do { } while (0)
+#define hugetlb_fault(mm, vma, write, addr)	0
 
 #ifndef HPAGE_MASK
 #define HPAGE_MASK	0		/* Keep the compiler happy */
--- linux-2.6.5.ORG/mm/memory.c	Mon Apr  5 16:13:38 2032
+++ linux-2.6.5/mm/memory.c	Mon Apr  5 16:14:02 2032
@@ -643,7 +643,7 @@ follow_page(struct mm_struct *mm, unsign
 	if (pmd_none(*pmd))
 		goto out;
 	if (pmd_huge(*pmd))
-		return follow_huge_pmd(mm, address, pmd, write);
+		return follow_huge_pmd(mm, vma, address, pmd, write);
 	if (pmd_bad(*pmd))
 		goto out;
 
@@ -1628,7 +1628,7 @@ int handle_mm_fault(struct mm_struct *mm
 	inc_page_state(pgfault);
 
 	if (is_vm_hugetlb_page(vma))
-		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+		return hugetlb_fault(mm, vma, write_access, address);
 
 	/*
 	 * We need the page table lock to synchronize with kswapd
--- linux-2.6.5.ORG/arch/i386/mm/hugetlbpage.c	Mon Apr  5 16:13:30 2032
+++ linux-2.6.5/arch/i386/mm/hugetlbpage.c	Mon Apr  5 16:14:02 2032
@@ -142,8 +142,10 @@ int copy_hugetlb_page_range(struct mm_st
 			goto nomem;
 		src_pte = huge_pte_offset(src, addr);
 		entry = *src_pte;
-		ptepage = pte_page(entry);
-		get_page(ptepage);
+		if (!pte_none(entry)) {
+			ptepage = pte_page(entry);
+			get_page(ptepage);
+		}
 		set_pte(dst_pte, entry);
 		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
 		addr += HPAGE_SIZE;
@@ -173,6 +175,11 @@ follow_hugetlb_page(struct mm_struct *mm
 
 			pte = huge_pte_offset(mm, vaddr);
 
+			if (!pte || pte_none(*pte)) {
+				hugetlb_fault(mm, vma, 0, vaddr);
+				pte = huge_pte_offset(mm, vaddr);
+			}
+
 			/* hugetlb should be locked, and hence, prefaulted */
 			WARN_ON(!pte || pte_none(*pte));
 
@@ -261,12 +268,17 @@ int pmd_huge(pmd_t pmd)
 }
 
 struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd, int write)
+follow_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, int write)
 {
 	struct page *page;
 
 	page = pte_page(*(pte_t *)pmd);
+
+	if (!page) {
+		hugetlb_fault(mm, vma, write, address);
+		page = pte_page(*(pte_t *)pmd);
+	}
 	if (page) {
 		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
 		get_page(page);
@@ -329,54 +341,94 @@ zap_hugepage_range(struct vm_area_struct
 	spin_unlock(&mm->page_table_lock);
 }
 
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, unsigned long address)
 {
-	struct mm_struct *mm = current->mm;
-	unsigned long addr;
-	int ret = 0;
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
+	struct page *page;
+	unsigned long idx;
+	pte_t *pte = huge_pte_alloc(mm, address);
+	int ret;
 
 	BUG_ON(vma->vm_start & ~HPAGE_MASK);
 	BUG_ON(vma->vm_end & ~HPAGE_MASK);
 
-	spin_lock(&mm->page_table_lock);
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
-		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
+	if (!pte) {
+		ret = VM_FAULT_SIGBUS;
+		goto out;
+	}
 
-		if (!pte) {
-			ret = -ENOMEM;
+	if (!pte_none(*pte)) {
+		ret = VM_FAULT_MINOR;
+		goto out;
+	}
+
+	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+again:
+	page = find_lock_page(mapping, idx);
+
+	if (!page) {
+		if (hugetlb_get_quota(mapping)) {
+			ret = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		if (!pte_none(*pte))
-			continue;
-
-		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
-			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-		page = find_get_page(mapping, idx);
+		page = alloc_hugetlb_page();
 		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
-				goto out;
-			}
-			page = alloc_hugetlb_page();
-			if (!page) {
-				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
-				goto out;
-			}
-			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+			hugetlb_put_quota(mapping);
+			ret = VM_FAULT_SIGBUS;
+			goto out;
+		}
+		ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+		if (ret) {
+			hugetlb_put_quota(mapping);
+			free_huge_page(page);
 			unlock_page(page);
-			if (ret) {
-				hugetlb_put_quota(mapping);
-				free_huge_page(page);
-				goto out;
-			}
+			goto again;
 		}
+	}
+	spin_lock(&mm->page_table_lock);
+	if (pte_none(*pte)) {
 		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+		flush_tlb_page(vma, address);
+		update_mmu_cache(vma, address, *pte);
+	} else {
+		huge_page_release(page);
 	}
+	spin_unlock(&mm->page_table_lock);
+	unlock_page(page);
+	ret = VM_FAULT_MINOR;
 out:
+	return ret;
+}
+
+int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr;
+	int ret = 0;
+
+	BUG_ON(vma->vm_start & ~HPAGE_MASK);
+	BUG_ON(vma->vm_end & ~HPAGE_MASK);
+
+	spin_lock(&mm->page_table_lock);
+	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
+		if (addr < vma->vm_start)
+			addr = vma->vm_start;
+		if (addr >= vma->vm_end) {
+			ret = 0;
+			break;
+		}
+		spin_unlock(&mm->page_table_lock);
+		ret = hugetlb_fault(mm, vma, 1, addr);
+		schedule();
+		spin_lock(&mm->page_table_lock);
+		if (ret == VM_FAULT_SIGBUS) {
+			ret = -ENOMEM;
+			break;
+		}
+		ret = 0;
+	}
 	spin_unlock(&mm->page_table_lock);
 	return ret;
 }

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 3/6] memory hotplug for hugetlbpages
  2004-04-06 12:41 ` [patch 0/6] memory hotplug for hugetlbpages Hirokazu Takahashi
  2004-04-06 12:44   ` [patch 1/6] " Hirokazu Takahashi
  2004-04-06 12:45   ` [patch 2/6] " Hirokazu Takahashi
@ 2004-04-06 12:45   ` Hirokazu Takahashi
  2004-04-06 12:48   ` [Lhms-devel] [patch 4/6] " Hirokazu Takahashi
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Hirokazu Takahashi @ 2004-04-06 12:45 UTC (permalink / raw)
  To: linux-kernel, lhms-devel

This is a part 3 of memory hotplug patches for hugetlbpages.

$Id: va-hugepagelist.patch,v 1.5 2004/03/30 06:14:28 iwamoto Exp $

--- linux-2.6.1.ORG2/arch/i386/mm/hugetlbpage.c	Wed Mar 24 22:49:45 2032
+++ linux-2.6.1/arch/i386/mm/hugetlbpage.c	Wed Mar 24 22:59:40 2032
@@ -25,8 +25,20 @@ int     htlbpage_max;
 static long    htlbzone_pages;
 
 static struct list_head hugepage_freelists[MAX_NUMNODES];
+static struct list_head hugepage_alllists[MAX_NUMNODES];
 static spinlock_t htlbpage_lock = SPIN_LOCK_UNLOCKED;
 
+static void register_huge_page(struct page *page)
+{
+	list_add(&page[1].list,
+		&hugepage_alllists[page_zone(page)->zone_pgdat->node_id]);
+}
+
+static void unregister_huge_page(struct page *page)
+{
+	list_del(&page[1].list);
+}
+
 static void enqueue_huge_page(struct page *page)
 {
 	list_add(&page->list,
@@ -462,6 +474,7 @@ static int try_to_free_low(int count)
 	list_for_each(p, &hugepage_freelists[0]) {
 		if (map) {
 			list_del(&map->list);
+			unregister_huge_page(map);
 			update_and_free_page(map);
 			htlbpagemem--;
 			map = NULL;
@@ -474,6 +487,7 @@ static int try_to_free_low(int count)
 	}
 	if (map) {
 		list_del(&map->list);
+		unregister_huge_page(map);
 		update_and_free_page(map);
 		htlbpagemem--;
 		count++;
@@ -500,6 +514,7 @@ static int set_hugetlb_mem_size(int coun
 			if (page == NULL)
 				break;
 			spin_lock(&htlbpage_lock);
+			register_huge_page(page);
 			enqueue_huge_page(page);
 			htlbpagemem++;
 			htlbzone_pages++;
@@ -514,6 +529,7 @@ static int set_hugetlb_mem_size(int coun
 		if (page == NULL)
 			break;
 		spin_lock(&htlbpage_lock);
+		unregister_huge_page(page);
 		update_and_free_page(page);
 		spin_unlock(&htlbpage_lock);
 	}
@@ -546,14 +562,17 @@ static int __init hugetlb_init(void)
 	if (!cpu_has_pse)
 		return -ENODEV;
 
-	for (i = 0; i < MAX_NUMNODES; ++i)
+	for (i = 0; i < MAX_NUMNODES; ++i) {
 		INIT_LIST_HEAD(&hugepage_freelists[i]);
+		INIT_LIST_HEAD(&hugepage_alllists[i]);
+	}
 
 	for (i = 0; i < htlbpage_max; ++i) {
 		page = alloc_fresh_huge_page();
 		if (!page)
 			break;
 		spin_lock(&htlbpage_lock);
+		register_huge_page(page);
 		enqueue_huge_page(page);
 		spin_unlock(&htlbpage_lock);
 	}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Lhms-devel] [patch 4/6] memory hotplug for hugetlbpages
  2004-04-06 12:41 ` [patch 0/6] memory hotplug for hugetlbpages Hirokazu Takahashi
                     ` (2 preceding siblings ...)
  2004-04-06 12:45   ` [patch 3/6] " Hirokazu Takahashi
@ 2004-04-06 12:48   ` Hirokazu Takahashi
  2004-04-06 13:02     ` Russell King
  2004-04-06 12:49   ` [patch 5/6] " Hirokazu Takahashi
  2004-04-06 12:50   ` [patch 6/6] " Hirokazu Takahashi
  5 siblings, 1 reply; 27+ messages in thread
From: Hirokazu Takahashi @ 2004-04-06 12:48 UTC (permalink / raw)
  To: linux-kernel, lhms-devel

This is a part 4 of memory hotplug patches for hugetlbpages.

$Id: va-hugepagermap.patch,v 1.11 2004/04/01 08:57:40 taka Exp $

--- linux-2.6.4.ORG/include/asm-i386/rmap.h	Thu Mar 11 11:55:34 2004
+++ linux-2.6.4/include/asm-i386/rmap.h	Thu Apr  1 15:38:03 2032
@@ -1,6 +1,9 @@
 #ifndef _I386_RMAP_H
 #define _I386_RMAP_H
 
+#define pmd_add_rmap(page, mm, address)		do { } while (0)
+#define pmd_remove_rmap(page)			do { } while (0)
+
 /* nothing to see, move along */
 #include <asm-generic/rmap.h>
 
--- linux-2.6.4.ORG/include/asm-generic/rmap.h	Thu Mar 11 11:55:34 2004
+++ linux-2.6.4/include/asm-generic/rmap.h	Thu Apr  1 15:38:03 2032
@@ -87,4 +87,51 @@ static inline void rmap_ptep_unmap(pte_t
 }
 #endif
 
+static inline void __pmd_add_rmap(struct page * page, struct mm_struct * mm, unsigned long address)
+{
+#ifdef BROKEN_PPC_PTE_ALLOC_ONE
+	/* OK, so PPC calls pte_alloc() before mem_map[] is setup ... ;( */
+	extern int mem_init_done;
+
+	if (!mem_init_done)
+		return;
+#endif
+	page->mapping = (void *)mm;
+	page->index = address & ~((PTRS_PER_PMD * PTRS_PER_PTE * PAGE_SIZE) - 1);
+}
+
+static inline void __pmd_remove_rmap(struct page * page)
+{
+	page->mapping = NULL;
+	page->index = 0;
+}
+
+static inline void pgd_add_rmap(struct page * page, struct mm_struct * mm)
+{
+#ifdef BROKEN_PPC_PTE_ALLOC_ONE
+	/* OK, so PPC calls pte_alloc() before mem_map[] is setup ... ;( */
+	extern int mem_init_done;
+
+	if (!mem_init_done)
+		return;
+#endif
+	page->mapping = (void *)mm;
+	page->index = 0;
+}
+
+static inline void pgd_remove_rmap(struct page * page)
+{
+	page->mapping = NULL;
+	page->index = 0;
+}
+
+#if !defined(pmd_add_rmap)
+#define pmd_add_rmap(page, mm, address)	__pmd_add_rmap(page, mm, address)
+#endif
+
+#if !defined(pmd_remove_rmap)
+#define pmd_remove_rmap(page)		__pmd_remove_rmap(page)
+#endif
+
+
 #endif /* _GENERIC_RMAP_H */
--- linux-2.6.4.ORG/include/linux/hugetlb.h	Thu Apr  1 15:36:20 2032
+++ linux-2.6.4/include/linux/hugetlb.h	Thu Apr  1 15:38:03 2032
@@ -29,6 +29,7 @@ int is_aligned_hugepage_range(unsigned l
 int pmd_huge(pmd_t pmd);
 extern int hugetlb_fault(struct mm_struct *, struct vm_area_struct *,
 				int, unsigned long);
+int try_to_unmap_hugepage(struct page *, pte_addr_t, struct list_head *);
 
 extern int htlbpage_max;
 
@@ -68,6 +69,7 @@ static inline int is_vm_hugetlb_page(str
 #define is_hugepage_only_range(addr, len)	0
 #define hugetlb_free_pgtables(tlb, prev, start, end) do { } while (0)
 #define hugetlb_fault(mm, vma, write, addr)	0
+#define try_to_unmap_hugepage(page, paddr, force)	0
 
 #ifndef HPAGE_MASK
 #define HPAGE_MASK	0		/* Keep the compiler happy */
--- linux-2.6.4.ORG/mm/rmap.c	Thu Apr  1 14:24:07 2032
+++ linux-2.6.4/mm/rmap.c	Thu Apr  1 15:38:03 2032
@@ -30,6 +30,7 @@
 #include <linux/rmap-locking.h>
 #include <linux/cache.h>
 #include <linux/percpu.h>
+#include <linux/hugetlb.h>
 
 #include <asm/pgalloc.h>
 #include <asm/rmap.h>
@@ -298,15 +299,27 @@ static int FASTCALL(try_to_unmap_one(str
 static int fastcall try_to_unmap_one(struct page * page, pte_addr_t paddr,
     struct list_head *force)
 {
-	pte_t *ptep = rmap_ptep_map(paddr);
-	unsigned long address = ptep_to_address(ptep);
-	struct mm_struct * mm = ptep_to_mm(ptep);
+	pte_t *ptep;
+	unsigned long address;
+	struct mm_struct * mm;
 	struct vm_area_struct * vma;
 #ifdef CONFIG_MEMHOTPLUG
 	struct page_va_list *vlist;
 #endif
 	pte_t pte;
 	int ret;
+
+	/*
+	 * Is there any better way to check whether the page is
+	 * HugePage or not?
+	 */
+	if (PageCompound(page))
+		return try_to_unmap_hugepage(page, paddr, force);
+
+	ptep = rmap_ptep_map(paddr);
+	address = ptep_to_address(ptep);
+	mm = ptep_to_mm(ptep);
+  
 
 	if (!mm)
 		BUG();
--- linux-2.6.4.ORG/mm/memory.c	Thu Apr  1 15:36:20 2032
+++ linux-2.6.4/mm/memory.c	Thu Apr  1 15:38:03 2032
@@ -113,6 +113,7 @@ static inline void free_one_pgd(struct m
 {
 	int j;
 	pmd_t * pmd;
+	struct page *page;
 
 	if (pgd_none(*dir))
 		return;
@@ -125,6 +126,8 @@ static inline void free_one_pgd(struct m
 	pgd_clear(dir);
 	for (j = 0; j < PTRS_PER_PMD ; j++)
 		free_one_pmd(tlb, pmd+j);
+	page = virt_to_page(pmd);
+	pmd_remove_rmap(page);
 	pmd_free_tlb(tlb, pmd);
 }
 
@@ -1667,6 +1670,7 @@ int handle_mm_fault(struct mm_struct *mm
 pmd_t fastcall *__pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
 {
 	pmd_t *new;
+	struct page *page;
 
 	spin_unlock(&mm->page_table_lock);
 	new = pmd_alloc_one(mm, address);
@@ -1682,6 +1686,8 @@ pmd_t fastcall *__pmd_alloc(struct mm_st
 		pmd_free(new);
 		goto out;
 	}
+	page = virt_to_page(new);
+	pmd_add_rmap(new, mm, address);
 	pgd_populate(mm, pgd, new);
 out:
 	return pmd_offset(pgd, address);
--- linux-2.6.4.ORG/arch/i386/mm/pgtable.c	Thu Mar 11 11:55:56 2004
+++ linux-2.6.4/arch/i386/mm/pgtable.c	Thu Apr  1 15:38:03 2032
@@ -21,6 +21,7 @@
 #include <asm/e820.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
+#include <asm/rmap.h>
 
 void show_mem(void)
 {
@@ -206,22 +207,34 @@ void pgd_dtor(void *pgd, kmem_cache_t *c
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
 	int i;
+	struct page *page;
 	pgd_t *pgd = kmem_cache_alloc(pgd_cache, GFP_KERNEL);
 
-	if (PTRS_PER_PMD == 1 || !pgd)
+	if (!pgd)
 		return pgd;
+	if (PTRS_PER_PMD == 1) {
+		page = virt_to_page(pgd);
+		pgd_add_rmap(page, mm);
+		return pgd;
+	}
 
 	for (i = 0; i < USER_PTRS_PER_PGD; ++i) {
 		pmd_t *pmd = kmem_cache_alloc(pmd_cache, GFP_KERNEL);
 		if (!pmd)
 			goto out_oom;
+		page = virt_to_page(pmd);
+		__pmd_add_rmap(page, mm, (PTRS_PER_PMD * PTRS_PER_PTE * PAGE_SIZE)*i);
 		set_pgd(&pgd[i], __pgd(1 + __pa((u64)((u32)pmd))));
 	}
 	return pgd;
 
 out_oom:
-	for (i--; i >= 0; i--)
-		kmem_cache_free(pmd_cache, (void *)__va(pgd_val(pgd[i])-1));
+	for (i--; i >= 0; i--) {
+		pmd_t *pmd = (pmd_t *)__va(pgd_val(pgd[i])-1);
+		page = virt_to_page(pmd);
+		__pmd_remove_rmap(page);
+		kmem_cache_free(pmd_cache, pmd);
+	}
 	kmem_cache_free(pgd_cache, pgd);
 	return NULL;
 }
@@ -229,11 +242,20 @@ out_oom:
 void pgd_free(pgd_t *pgd)
 {
 	int i;
+	struct page *page;
 
 	/* in the PAE case user pgd entries are overwritten before usage */
-	if (PTRS_PER_PMD > 1)
-		for (i = 0; i < USER_PTRS_PER_PGD; ++i)
-			kmem_cache_free(pmd_cache, (void *)__va(pgd_val(pgd[i])-1));
+	if (PTRS_PER_PMD > 1) {
+		for (i = 0; i < USER_PTRS_PER_PGD; ++i) {
+			pmd_t *pmd = (pmd_t *)__va(pgd_val(pgd[i])-1);
+			page = virt_to_page(pmd);
+			__pmd_remove_rmap(page);
+			kmem_cache_free(pmd_cache, pmd);
+		}
+	} else {
+		page = virt_to_page(pgd);
+		pgd_remove_rmap(page);
+	}
 	/* in the non-PAE case, clear_page_tables() clears user pgd entries */
 	kmem_cache_free(pgd_cache, pgd);
 }
--- linux-2.6.4.ORG/arch/i386/mm/hugetlbpage.c	Thu Apr  1 15:37:21 2032
+++ linux-2.6.4/arch/i386/mm/hugetlbpage.c	Thu Apr  1 15:38:03 2032
@@ -12,11 +12,13 @@
 #include <linux/pagemap.h>
 #include <linux/smp_lock.h>
 #include <linux/slab.h>
+#include <linux/rmap-locking.h>
 #include <linux/module.h>
 #include <linux/err.h>
 #include <linux/sysctl.h>
 #include <asm/mman.h>
 #include <asm/pgalloc.h>
+#include <asm/rmap.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 
@@ -28,6 +30,29 @@ static struct list_head hugepage_freelis
 static struct list_head hugepage_alllists[MAX_NUMNODES];
 static spinlock_t htlbpage_lock = SPIN_LOCK_UNLOCKED;
 
+static inline void hugepgtable_add_rmap(struct page * page, struct mm_struct * mm, unsigned long address)
+{
+/* 	page->mapping = (void *)mm; */
+	page->index = address & ~((PTRS_PER_PTE * HPAGE_SIZE) - 1);
+}
+
+static inline void hugepgtable_remove_rmap(struct page * page)
+{
+	page->mapping = NULL;
+	page->index = 0;
+}
+
+static inline struct pte_chain *hugepage_add_rmap(struct page *page, pte_t *ptep, struct pte_chain *pte_chain)
+{
+	return page_add_rmap(page, ptep, pte_chain);
+}
+
+static inline void hugepage_remove_rmap(struct page *page, pte_t *ptep)
+{
+	ClearPageReferenced(page);	/* XXX */
+	page_remove_rmap(page, ptep);
+}
+
 static void register_huge_page(struct page *page)
 {
 	list_add(&page[1].list,
@@ -37,10 +62,12 @@ static void register_huge_page(struct pa
 static void unregister_huge_page(struct page *page)
 {
 	list_del(&page[1].list);
+	INIT_LIST_HEAD(&page[1].list);
 }
 
 static void enqueue_huge_page(struct page *page)
 {
+/* 	set_page_count(page, 1); */
 	list_add(&page->list,
 		&hugepage_freelists[page_zone(page)->zone_pgdat->node_id]);
 }
@@ -97,9 +124,14 @@ static pte_t *huge_pte_alloc(struct mm_s
 {
 	pgd_t *pgd;
 	pmd_t *pmd = NULL;
+	struct page *page;
 
 	pgd = pgd_offset(mm, addr);
 	pmd = pmd_alloc(mm, pgd, addr);
+	page = virt_to_page(pmd);
+	/* The following call may be redundant. */
+	hugepgtable_add_rmap(page, mm, addr);
+
 	return (pte_t *) pmd;
 }
 
@@ -147,24 +179,49 @@ int copy_hugetlb_page_range(struct mm_st
 	struct page *ptepage;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
+	struct pte_chain *pte_chain = NULL;
+
+	pte_chain = pte_chain_alloc(GFP_ATOMIC);
+	if (!pte_chain) {
+		spin_unlock(&dst->page_table_lock);
+		pte_chain = pte_chain_alloc(GFP_KERNEL);
+		spin_lock(&dst->page_table_lock);
+		if (!pte_chain)
+			goto nomem;
+	}
 
 	while (addr < end) {
 		dst_pte = huge_pte_alloc(dst, addr);
 		if (!dst_pte)
 			goto nomem;
+		spin_lock(&src->page_table_lock);
 		src_pte = huge_pte_offset(src, addr);
 		entry = *src_pte;
 		if (!pte_none(entry)) {
 			ptepage = pte_page(entry);
 			get_page(ptepage);
+			pte_chain = hugepage_add_rmap(ptepage, dst_pte, pte_chain);
 		}
 		set_pte(dst_pte, entry);
+		spin_unlock(&src->page_table_lock);
 		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
 		addr += HPAGE_SIZE;
+		if (pte_chain)
+			continue;
+		pte_chain = pte_chain_alloc(GFP_ATOMIC);
+		if (!pte_chain) {
+			spin_unlock(&dst->page_table_lock);
+			pte_chain = pte_chain_alloc(GFP_KERNEL);
+			spin_lock(&dst->page_table_lock);
+			if (!pte_chain)
+				goto nomem;
+		}
 	}
+	pte_chain_free(pte_chain);
 	return 0;
 
 nomem:
+	pte_chain_free(pte_chain);
 	return -ENOMEM;
 }
 
@@ -336,6 +393,7 @@ void unmap_hugepage_range(struct vm_area
 		if (pte_none(*pte))
 			continue;
 		page = pte_page(*pte);
+		hugepage_remove_rmap(page, pte);
 		huge_page_release(page);
 		pte_clear(pte);
 	}
@@ -360,6 +418,7 @@ int hugetlb_fault(struct mm_struct *mm, 
 	struct page *page;
 	unsigned long idx;
 	pte_t *pte = huge_pte_alloc(mm, address);
+	struct pte_chain *pte_chain = NULL;
 	int ret;
 
 	BUG_ON(vma->vm_start & ~HPAGE_MASK);
@@ -375,6 +434,16 @@ int hugetlb_fault(struct mm_struct *mm, 
 		goto out;
 	}
 
+	pte_chain = pte_chain_alloc(GFP_ATOMIC);
+	if (!pte_chain) {
+		pte_chain = pte_chain_alloc(GFP_KERNEL);
+		if (!pte_chain) {
+			ret = VM_FAULT_SIGBUS;
+			goto out;
+		}
+		pte = huge_pte_alloc(mm, address);
+	}
+
 	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
 		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
 again:
@@ -402,6 +471,7 @@ again:
 	spin_lock(&mm->page_table_lock);
 	if (pte_none(*pte)) {
 		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+		pte_chain = hugepage_add_rmap(page, pte, pte_chain);
 		flush_tlb_page(vma, address);
 		update_mmu_cache(vma, address, *pte);
 	} else {
@@ -411,6 +481,7 @@ again:
 	unlock_page(page);
 	ret = VM_FAULT_MINOR;
 out:
+	pte_chain_free(pte_chain);
 	return ret;
 }
 
@@ -441,6 +512,86 @@ int hugetlb_prefault(struct address_spac
 		}
 		ret = 0;
 	}
+	spin_unlock(&mm->page_table_lock);
+	return ret;
+}
+
+static inline unsigned long hugeptep_to_address(pte_t *ptep)
+{
+	struct page *page = kmap_atomic_to_page(ptep);
+	unsigned long low_bits;
+	low_bits = ((unsigned long)ptep & ~PAGE_MASK)/sizeof(pte_t)*HPAGE_SIZE;
+	return page->index + low_bits;
+}
+
+int try_to_unmap_hugepage(struct page * page, pte_addr_t paddr, struct list_head *force)
+{
+	pte_t *ptep = rmap_ptep_map(paddr);
+	unsigned long address = hugeptep_to_address(ptep);
+	struct mm_struct * mm = ptep_to_mm(ptep);
+	struct vm_area_struct * vma;
+	pte_t pte;
+	int ret;
+
+	if (!mm)
+		BUG();
+
+	/*
+	 * We need the page_table_lock to protect us from page faults,
+	 * munmap, fork, etc...
+	 */
+	if (!spin_trylock(&mm->page_table_lock)) {
+		rmap_ptep_unmap(ptep);
+		return SWAP_AGAIN;
+	}
+
+
+	/* During mremap, it's possible pages are not in a VMA. */
+	vma = find_vma(mm, address);
+	if (!vma) {
+		ret = SWAP_FAIL;
+		goto out_unlock;
+	}
+
+	/* The page is mlock()d, we cannot swap it out. */
+	if (force == NULL && (vma->vm_flags & VM_LOCKED)) {
+		BUG();	/* Never come here */
+	}
+
+	/* Nuke the page table entry. */
+	flush_cache_page(vma, address);
+	pte = ptep_get_and_clear(ptep);
+	flush_tlb_range(vma, address, address + HPAGE_SIZE);
+
+	if (PageSwapCache(page)) {
+		BUG();	/* Never come here */
+	} else {
+		unsigned long pgidx;
+		/*
+		 * If a nonlinear mapping then store the file page offset
+		 * in the pte.
+		 */
+		pgidx = (address - vma->vm_start) >> HPAGE_SHIFT;
+		pgidx += vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT);
+		pgidx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
+		if (page->index != pgidx) {
+#if 0
+			set_pte(ptep, pgoff_to_pte(page->index));
+			BUG_ON(!pte_file(*ptep));
+#endif
+		}
+	}
+
+	/* Move the dirty bit to the physical page now the pte is gone. */
+	if (pte_dirty(pte))
+		set_page_dirty(page);
+
+	mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
+	page_cache_release(page);
+	ret = SWAP_SUCCESS;
+
+out_unlock:
+	rmap_ptep_unmap(ptep);
 	spin_unlock(&mm->page_table_lock);
 	return ret;
 }

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 5/6] memory hotplug for hugetlbpages
  2004-04-06 12:41 ` [patch 0/6] memory hotplug for hugetlbpages Hirokazu Takahashi
                     ` (3 preceding siblings ...)
  2004-04-06 12:48   ` [Lhms-devel] [patch 4/6] " Hirokazu Takahashi
@ 2004-04-06 12:49   ` Hirokazu Takahashi
  2004-04-06 12:50   ` [patch 6/6] " Hirokazu Takahashi
  5 siblings, 0 replies; 27+ messages in thread
From: Hirokazu Takahashi @ 2004-04-06 12:49 UTC (permalink / raw)
  To: linux-kernel, lhms-devel

This is a part 5 of memory hotplug patches for hugetlbpages.

--- linux-2.6.5.ORG/include/linux/hugetlb.h	Tue Apr  6 22:28:09 2032
+++ linux-2.6.5/include/linux/hugetlb.h	Tue Apr  6 15:00:59 2032
@@ -31,6 +31,7 @@ int pmd_huge(pmd_t pmd);
 extern int hugetlb_fault(struct mm_struct *, struct vm_area_struct *,
 				int, unsigned long);
 int try_to_unmap_hugepage(struct page *, pte_addr_t, struct list_head *);
+int remap_hugetlb_pages(struct zone *);
 
 extern int htlbpage_max;
 
@@ -83,6 +84,7 @@ static inline unsigned long hugetlb_tota
 #define hugetlb_free_pgtables(tlb, prev, start, end) do { } while (0)
 #define hugetlb_fault(mm, vma, write, addr)	0
 #define try_to_unmap_hugepage(page, paddr, force)	0
+#define remap_hugetlb_pages(zone)		0
 
 #ifndef HPAGE_MASK
 #define HPAGE_MASK	0		/* Keep the compiler happy */
--- linux-2.6.5.ORG/arch/i386/mm/hugetlbpage.c	Tue Apr  6 22:28:09 2032
+++ linux-2.6.5/arch/i386/mm/hugetlbpage.c	Tue Apr  6 22:30:59 2032
@@ -13,6 +13,7 @@
 #include <linux/smp_lock.h>
 #include <linux/slab.h>
 #include <linux/rmap-locking.h>
+#include <linux/memhotplug.h>
 #include <linux/module.h>
 #include <linux/err.h>
 #include <linux/sysctl.h>
@@ -92,7 +93,10 @@ static struct page *dequeue_huge_page(vo
 static struct page *alloc_fresh_huge_page(void)
 {
 	static int nid = 0;
+	struct pglist_data *pgdat;
 	struct page *page;
+	while ((pgdat = NODE_DATA(nid)) == NULL || !pgdat->enabled)
+		nid = (nid + 1) % numnodes;
 	page = alloc_pages_node(nid, GFP_HIGHUSER, HUGETLB_PAGE_ORDER);
 	nid = (nid + 1) % numnodes;
 	return page;
@@ -114,6 +118,8 @@ static struct page *alloc_hugetlb_page(v
 	htlbpagemem--;
 	spin_unlock(&htlbpage_lock);
 	set_page_count(page, 1);
+	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+			 1 << PG_referenced | 1 << PG_again);
 	page->lru.prev = (void *)free_huge_page;
 	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
 		clear_highpage(&page[i]);
@@ -468,6 +474,15 @@ again:
 			goto again;
 		}
 	}
+
+	if (page->mapping == NULL) {
+		 BUG_ON(! PageAgain(page));
+		/* This page will go back to freelists[] */
+		huge_page_release(page);	/* XXX */
+		unlock_page(page);
+		goto again;
+	}
+
 	spin_lock(&mm->page_table_lock);
 	if (pte_none(*pte)) {
 		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
@@ -614,7 +629,7 @@ static void update_and_free_page(struct 
 	__free_pages(page, HUGETLB_PAGE_ORDER);
 }
 
-static int try_to_free_low(int count)
+int try_to_free_hugepages(int idx, int count, struct zone *zone)
 {
 	struct list_head *p;
 	struct page *page, *map;
@@ -622,7 +637,7 @@ static int try_to_free_low(int count)
 	map = NULL;
 	spin_lock(&htlbpage_lock);
 	/* all lowmem is on node 0 */
-	list_for_each(p, &hugepage_freelists[0]) {
+	list_for_each(p, &hugepage_freelists[idx]) {
 		if (map) {
 			list_del(&map->list);
 			unregister_huge_page(map);
@@ -633,7 +648,8 @@ static int try_to_free_low(int count)
 				break;
 		}
 		page = list_entry(p, struct page, list);
-		if (!PageHighMem(page))
+		if ((zone == NULL && !PageHighMem(page)) ||
+					(page_zone(page) == zone))
 			map = page;
 	}
 	if (map) {
@@ -647,6 +663,11 @@ static int try_to_free_low(int count)
 	return count;
 }
 
+int try_to_free_low(int count)
+{
+	return try_to_free_hugepages(0, count, NULL);
+}
+
 static int set_hugetlb_mem_size(int count)
 {
 	int lcount;
@@ -686,6 +707,146 @@ static int set_hugetlb_mem_size(int coun
 	}
 	return (int) htlbzone_pages;
 }
+
+#ifdef CONFIG_MEMHOTPLUG
+static int copy_hugepage(struct page *to, struct page *from)
+{
+	int size;
+	for (size = 0; size < HPAGE_SIZE; size += PAGE_SIZE) {
+		copy_highpage(to, from);
+		to++;
+		from++;
+	}
+	return 0;
+}
+
+/*
+ * Allocate a hugepage from Buddy system directly.
+ */
+static struct page *
+hugepage_remap_alloc(int nid)
+{
+	struct page *page;
+	/* 
+	 * ToDo:
+	 * - NUMA aware page allocation is required. we should allocate
+	 *   a hugepage from the node which the process depends on.
+	 * - New hugepages should be preallocated prior to remapping pages
+	 *   so that lack of memory can be found before them.
+	 * - New hugepages should be allocate from the node specified by nid.
+	 */
+	page = alloc_fresh_huge_page();
+	
+	if (page == NULL) {
+		printk(KERN_WARNING "remap: Failed to allocate new hugepage\n");
+	} else {
+		spin_lock(&htlbpage_lock);
+		register_huge_page(page);
+		enqueue_huge_page(page);
+		htlbpagemem++;
+		htlbzone_pages++;
+		spin_unlock(&htlbpage_lock);
+	}
+	page = alloc_hugetlb_page();
+	unregister_huge_page(page);	/* XXXX */
+	return page;
+}
+
+/*
+ * Free a hugepage into Buddy system directly.
+ */
+static int
+hugepage_delete(struct page *page)
+{
+        BUG_ON(page_count(page) != 1);
+        BUG_ON(page->mapping);
+
+	spin_lock(&htlbpage_lock);
+	update_and_free_page(page);
+	spin_unlock(&htlbpage_lock);
+        return 0;
+}
+
+static int
+hugepage_register(struct page *page)
+{
+	spin_lock(&htlbpage_lock);
+	register_huge_page(page);
+	spin_unlock(&htlbpage_lock);
+        return 0;
+}
+
+static int
+hugepage_release_buffer(struct page *page)
+{
+	BUG();
+	return -1;
+}
+
+static struct remap_operations hugepage_remap_ops = {
+	.remap_alloc_page       = hugepage_remap_alloc,
+	.remap_delete_page      = hugepage_delete,
+	.remap_copy_page        = copy_hugepage,
+	.remap_lru_add_page     = hugepage_register,
+	.remap_release_buffers  = hugepage_release_buffer,
+	.remap_prepare          = NULL,
+	.remap_stick_page       = NULL
+};
+
+int remap_hugetlb_pages(struct zone *zone)
+{
+	struct list_head *p;
+	struct page *page, *map;
+	int idx = zone->zone_pgdat->node_id;
+	LIST_HEAD(templist);
+	int ret = 0;
+
+	try_to_free_hugepages(idx, -htlbpagemem, zone);
+/* 	htlbpage_max = set_hugetlb_mem_size(htlbpage_max); */
+
+	map = NULL;
+	spin_lock(&htlbpage_lock);
+	list_for_each(p, &hugepage_alllists[idx]) {
+		page = list_entry(p, struct page, list);
+		if (map) {
+			page_cache_get(map-1);
+			unregister_huge_page(map-1);
+			list_add(&map->list, &templist);
+			map = NULL;
+		}
+		if (page_zone(page) == zone) {
+			map = page;
+		}
+	}
+	if (map) {
+		page_cache_get(map-1);
+		unregister_huge_page(map-1);
+		list_add(&map->list, &templist);
+		map = NULL;
+	}
+	spin_unlock(&htlbpage_lock);
+
+	while (!list_empty(&templist)) {
+		page = list_entry(templist.next, struct page, list);
+		list_del(&page->list);
+		INIT_LIST_HEAD(&page->list);
+		page--;
+
+		if (page_count(page) <= 1 || page->mapping == NULL ||
+				remap_onepage(page, REMAP_ANYNODE, 0, &hugepage_remap_ops)) {
+			/* free the page later */
+			spin_lock(&htlbpage_lock);
+			register_huge_page(page);
+			spin_unlock(&htlbpage_lock);
+			page_cache_release(page);
+			ret++;
+		}
+	}
+
+	htlbpage_max = set_hugetlb_mem_size(htlbpage_max);
+	return ret;
+}
+#endif /* CONFIG_MEMHOTPLUG */
 
 int hugetlb_sysctl_handler(ctl_table *table, int write,
 		struct file *file, void *buffer, size_t *length)
--- linux-2.6.5.ORG/mm/memhotplug.c	Tue Apr  6 22:28:09 2032
+++ linux-2.6.5/mm/memhotplug.c	Tue Apr  6 15:00:59 2032
@@ -15,6 +15,7 @@
 #include <linux/writeback.h>
 #include <linux/buffer_head.h>
 #include <linux/rmap-locking.h>
+#include <linux/hugetlb.h>
 #include <linux/memhotplug.h>
 
 #ifdef CONFIG_KDB
@@ -595,6 +596,8 @@ int remapd(void *p)
 		return 0;
 	}
 	atomic_inc(&remapd_count);
+	if (remap_hugetlb_pages(zone))
+		goto out;
 	on_each_cpu(lru_drain_schedule, NULL, 1, 1);
 	while(nr_failed < 100) {
 		spin_lock_irq(&zone->lru_lock);

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 6/6] memory hotplug for hugetlbpages
  2004-04-06 12:41 ` [patch 0/6] memory hotplug for hugetlbpages Hirokazu Takahashi
                     ` (4 preceding siblings ...)
  2004-04-06 12:49   ` [patch 5/6] " Hirokazu Takahashi
@ 2004-04-06 12:50   ` Hirokazu Takahashi
  5 siblings, 0 replies; 27+ messages in thread
From: Hirokazu Takahashi @ 2004-04-06 12:50 UTC (permalink / raw)
  To: linux-kernel, lhms-devel

This is a part 6 of memory hotplug patches for hugetlbpages.

$Id: va-hugepageproc.patch,v 1.7 2004/04/05 06:46:16 iwamoto Exp $

--- linux-2.6.4.ORG/mm/page_alloc.c	Thu Apr  1 18:32:34 2032
+++ linux-2.6.4/mm/page_alloc.c	Thu Apr  1 18:32:44 2032
@@ -1989,6 +1989,8 @@ int min_free_kbytes_sysctl_handler(ctl_t
 }
 
 #ifdef CONFIG_MEMHOTPLUG
+extern int mhtest_hpage_read(char *p, int, int);
+
 static int mhtest_read(char *page, char **start, off_t off, int count,
     int *eof, void *data)
 {
@@ -2012,9 +2014,15 @@ static int mhtest_read(char *page, char 
 				/* skip empty zone */
 				continue;
 			len = sprintf(p,
-			    "\t%s[%d]: free %ld, active %ld, present %ld\n",
+			    "\t%s[%d]: free %ld, active %ld, present %ld",
 			    z->name, NODEZONE(i, j),
 			    z->free_pages, z->nr_active, z->present_pages);
+			p += len;
+#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_MEMHOTPLUG)
+			len = mhtest_hpage_read(p, i, j);
+			p += len;
+#endif
+			len = sprintf(p, "\n");
 			p += len;
 		}
 	}
--- linux-2.6.4.ORG/arch/i386/mm/hugetlbpage.c	Thu Apr  1 18:30:33 2032
+++ linux-2.6.4/arch/i386/mm/hugetlbpage.c	Thu Apr  1 18:32:44 2032
@@ -846,6 +846,24 @@ int remap_hugetlb_pages(struct zone *zon
 }
 #endif /* CONFIG_MEMHOTPLUG */
 
+#ifdef CONFIG_MEMHOTPLUG
+int mhtest_hpage_read(char *p, int nodenum, int zonenum)
+{
+	struct page *page;
+	int total = 0;
+	int free = 0;
+	spin_lock(&htlbpage_lock);
+	list_for_each_entry(page, &hugepage_alllists[nodenum], list) {
+		if (page_zonenum(page) == zonenum) total++;
+	}
+	list_for_each_entry(page, &hugepage_freelists[nodenum], list) {
+		if (page_zonenum(page) == zonenum) free++;
+	}
+	spin_unlock(&htlbpage_lock);
+	return sprintf(p, " / HugePage free %d, total %d\n", free, total);
+}
+#endif
+
 int hugetlb_sysctl_handler(ctl_table *table, int write,
 		struct file *file, void *buffer, size_t *length)
 {

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lhms-devel] [patch 4/6] memory hotplug for hugetlbpages
  2004-04-06 12:48   ` [Lhms-devel] [patch 4/6] " Hirokazu Takahashi
@ 2004-04-06 13:02     ` Russell King
  2004-04-06 13:11       ` Hirokazu Takahashi
  0 siblings, 1 reply; 27+ messages in thread
From: Russell King @ 2004-04-06 13:02 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: linux-kernel, lhms-devel

On Tue, Apr 06, 2004 at 09:48:01PM +0900, Hirokazu Takahashi wrote:
> @@ -1667,6 +1670,7 @@ int handle_mm_fault(struct mm_struct *mm
>  pmd_t fastcall *__pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
>  {
>  	pmd_t *new;
> +	struct page *page;
>  
>  	spin_unlock(&mm->page_table_lock);
>  	new = pmd_alloc_one(mm, address);
> @@ -1682,6 +1686,8 @@ pmd_t fastcall *__pmd_alloc(struct mm_st
>  		pmd_free(new);
>  		goto out;
>  	}
> +	page = virt_to_page(new);
> +	pmd_add_rmap(new, mm, address);

Doesn't this want to be:

	pmd_add_rmap(page, mm, address);

?

And how about collapsing this down to:

	pmd_add_rmap(virt_to_page(new), mm, address);

?

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lhms-devel] [patch 4/6] memory hotplug for hugetlbpages
  2004-04-06 13:02     ` Russell King
@ 2004-04-06 13:11       ` Hirokazu Takahashi
  0 siblings, 0 replies; 27+ messages in thread
From: Hirokazu Takahashi @ 2004-04-06 13:11 UTC (permalink / raw)
  To: rmk+lkml; +Cc: linux-kernel, lhms-devel

Hello,

> > @@ -1667,6 +1670,7 @@ int handle_mm_fault(struct mm_struct *mm
> >  pmd_t fastcall *__pmd_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
> >  {
> >  	pmd_t *new;
> > +	struct page *page;
> >  
> >  	spin_unlock(&mm->page_table_lock);
> >  	new = pmd_alloc_one(mm, address);
> > @@ -1682,6 +1686,8 @@ pmd_t fastcall *__pmd_alloc(struct mm_st
> >  		pmd_free(new);
> >  		goto out;
> >  	}
> > +	page = virt_to_page(new);
> > +	pmd_add_rmap(new, mm, address);
> 
> Doesn't this want to be:
> 
> 	pmd_add_rmap(page, mm, address);
> 
> ?
> 
> And how about collapsing this down to:
> 
> 	pmd_add_rmap(virt_to_page(new), mm, address);

Yes, it can be. I'll fix it. 

But I guess these code would be replaced with objrmap in no distant
future.

Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 1/3] memory hotplug prototype
  2004-04-06 10:56 ` [patch 1/3] " IWAMOTO Toshihiro
@ 2004-04-06 17:12   ` Dave Hansen
  2004-04-07  6:10     ` IWAMOTO Toshihiro
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2004-04-06 17:12 UTC (permalink / raw)
  To: IWAMOTO Toshihiro; +Cc: Linux Kernel Mailing List, lhms

On Tue, 2004-04-06 at 03:56, IWAMOTO Toshihiro wrote:
>                 for (node = local_node + 1; node < numnodes; node++)
> -                       j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
> +                       j = build_zonelists_node(NODE_DATA(node),
> +                           zonelist, j, k);

This line is probably too long already, but please leave it.  If you
want to clean things up, please send a separate cleanup patch.

> -/* Zone modifiers in GFP_ZONEMASK (see linux/mmzone.h - low two bits) */
> -#define __GFP_DMA	0x01
> -#define __GFP_HIGHMEM	0x02
> +/* Zone modifiers in GFP_ZONEMASK (see linux/mmzone.h - low three bits) */
> +#define __GFP_DMA		0x01
> +#define __GFP_HIGHMEM		0x02
> +#define __GFP_HOTREMOVABLE	0x03

Are you still determined to add a zone like this?  Hotplug will
eventually have to be done on all of the zones (NORMAL, DMA, etc...), so
it seems a bit shortsighted to add a zone like this.  I think it would
be much more valuable to be able to attempt hotremove operations on any
zone, even if they are unlikely to succeed.

>  #define put_page_testzero(p)				\
>  	({						\
> -		BUG_ON(page_count(p) == 0);		\
> +		if (page_count(p) == 0) {		\
> +			int i;						\
> +			printk("Page: %lx ", (long)p);			\
> +			for(i = 0; i < sizeof(struct page); i++)	\
> +				printk(" %02x", ((unsigned char *)p)[i]); \
> +			printk("\n");					\
> +			BUG();				\
> +		}					\
>  		atomic_dec_and_test(&(p)->count);	\
>  	})

Could you pull this debugging code out, or put it in an out-of-line
function?  Stuff like this in inline functions or macros just bloat
code. 

> -#define try_to_unmap(page)	SWAP_FAIL
> +#define try_to_unmap(page, force)	SWAP_FAIL
>  #endif /* CONFIG_MMU *
...
> +#ifdef CONFIG_MEMHOTPLUG
> +               vlist = kmalloc(sizeof(struct page_va_list), GFP_KERNEL);
> +               vlist->mm = mm;
> +               vlist->addr = address;
> +               list_add(&vlist->list, force);
> +#endif


Could you explain what you're trying to do with try_to_unmap() and why
all of the calls need to be changed?

> -	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> +	int error = radix_tree_preload((gfp_mask & ~GFP_ZONEMASK) |
> +	    ((gfp_mask & GFP_ZONEMASK) == __GFP_DMA ? __GFP_DMA : 0));

What is this doing?  Trying to filter off the highmem flag without
affecting the hotremove flag???

>  	lock_page(page);
> +	if (page->mapping == NULL) {
> +		BUG_ON(! PageAgain(page));
> +		unlock_page(page);
> +		page_cache_release(page);
> +		pte_chain_free(pte_chain);
> +		goto again;
> +	}
> +	BUG_ON(PageAgain(page));

You might want to add a little comment here noting that this is for the
hotremove case only.  No normal paths hit it.
 

> -#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
> +#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMHOTPLUG)

I think it's safe to say that you need an aggregate config option for
this one :)

>  /*
>   * Builds allocation fallback zone lists.
>   */
> -static int __init build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, int j, int k)
> +static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, int j, int k)
>  {
> +
> +	if (! pgdat->enabled)
> +		return j;

Normal Linux style is:
> +	if (!pgdat->enabled)
> +		return j;


> +	if (k != ZONE_HOTREMOVABLE &&
> +	    pgdat->removable)
> +		return j;

What is this check supposed to do?


> -		memset(zonelist, 0, sizeof(*zonelist));
> +		/* memset(zonelist, 0, sizeof(*zonelist)); */

Why is this memset unnecessary now?


>  		j = 0;
>  		k = ZONE_NORMAL;
> -		if (i & __GFP_HIGHMEM)
> +		hotremovable = 0;
> +		switch (i) {
> +		default:
> +			BUG();
> +			return;
> +		case 0:
> +			k = ZONE_NORMAL;
> +			break;
> +		case __GFP_HIGHMEM:
>  			k = ZONE_HIGHMEM;
> -		if (i & __GFP_DMA)
> +			break;
> +		case __GFP_DMA:
>  			k = ZONE_DMA;
> +			break;
> +		case __GFP_HOTREMOVABLE:
> +#ifdef CONFIG_MEMHOTPLUG
> +			k = ZONE_HIGHMEM;
> +#else
> +			k = ZONE_HOTREMOVABLE;
> +#endif
> +			hotremovable = 1;
> +			break;
> +		}

What if, in the header you did this:
#ifndef CONFIG_MEMHOTPLUG
#define ZONE_HOTREMOVABLE ZONE_HIGHMEM
#endif

Then, you wouldn't need the #ifdef in the .c file.

There is way too much ifdef'ing going on in this code.  The general
Linux rule is: no #ifdef's in .c files.  This is a good example why :)

> +#ifndef CONFIG_MEMHOTPLUG
>   		j = build_zonelists_node(pgdat, zonelist, j, k);
>   		/*
>   		 * Now we build the zonelist so that it contains the zones
> @@ -1267,22 +1304,59 @@ static void __init build_zonelists(pg_da
>   		 * node N+1 (modulo N)
>   		 */
>   		for (node = local_node + 1; node < numnodes; node++)
> - 			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
> +			j = build_zonelists_node(NODE_DATA(node),
> +			    zonelist, j, k);
>   		for (node = 0; node < local_node; node++)
> - 			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
> - 
> -		zonelist->zones[j++] = NULL;
> +			j = build_zonelists_node(NODE_DATA(node),
> +			    zonelist, j, k);
> +#else
> +		while (hotremovable >= 0) {
> +			for(; k >= 0; k--) {
> +				zone = pgdat->node_zones + k;
> +				for (node = local_node; ;) {
> +					if (NODE_DATA(node) == NULL ||
> +					    ! NODE_DATA(node)->enabled ||
> +					    (!! NODE_DATA(node)->removable) !=
> +					    (!! hotremovable))
> +						goto next;
> +					zone = NODE_DATA(node)->node_zones + k;
> +					if (zone->present_pages)
> +						zonelist->zones[j++] = zone;
> +				next:
> +					node = (node + 1) % numnodes;
> +					if (node == local_node)
> +						break;
> +				}
> +			}
> +			if (hotremovable) {
> +				/* place non-hotremovable after hotremovable */
> +				k = ZONE_HIGHMEM;
> +			}
> +			hotremovable--;
> +		}
> +#endif
> +		BUG_ON(j > sizeof(zonelist->zones) /
> +		    sizeof(zonelist->zones[0]) - 1);
> +		for(; j < sizeof(zonelist->zones) /
> +		    sizeof(zonelist->zones[0]); j++)
> +			zonelist->zones[j] = NULL;
>  	} 
>  }

That code need to be separated out somehow.   It's exceedingly hard to
understand what's going on.  

> -void __init build_all_zonelists(void)
> +#ifdef CONFIG_MEMHOTPLUG
> +void
> +#else
> +void __init
> +#endif
> +build_all_zonelists(void)
>  {
>  	int i;
>  
>  	for(i = 0 ; i < numnodes ; i++)
> -		build_zonelists(NODE_DATA(i));
> +		if (NODE_DATA(i) != NULL)
> +			build_zonelists(NODE_DATA(i));
>  	printk("Built %i zonelists\n", numnodes);
>  }

Please make that __init __devinit, instead of using the #ifdef.  That
will turn off the __init when CONFIG_HOTPLUG is turned on.  BTW, you
should make HOTPLUGMEM dependent on CONFIG_HOTPLUG.


-- Dave


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 1/3] memory hotplug prototype
  2004-04-06 17:12   ` Dave Hansen
@ 2004-04-07  6:10     ` IWAMOTO Toshihiro
  0 siblings, 0 replies; 27+ messages in thread
From: IWAMOTO Toshihiro @ 2004-04-07  6:10 UTC (permalink / raw)
  To: Dave Hansen; +Cc: IWAMOTO Toshihiro, Linux Kernel Mailing List, lhms

At Tue, 06 Apr 2004 10:12:58 -0700,
Dave Hansen wrote:
> > -/* Zone modifiers in GFP_ZONEMASK (see linux/mmzone.h - low two bits) */
> > -#define __GFP_DMA	0x01
> > -#define __GFP_HIGHMEM	0x02
> > +/* Zone modifiers in GFP_ZONEMASK (see linux/mmzone.h - low three bits) */
> > +#define __GFP_DMA		0x01
> > +#define __GFP_HIGHMEM		0x02
> > +#define __GFP_HOTREMOVABLE	0x03
> 
> Are you still determined to add a zone like this?  Hotplug will
> eventually have to be done on all of the zones (NORMAL, DMA, etc...), so
> it seems a bit shortsighted to add a zone like this.  I think it would

Hotremoval things are a bit confusing.
There are __GFP_HOTREMOVABLE macro for alloc_page and
node_zonelists[ZONE_HOTREMOVABLE], but there's no
node_zones[ZONE_HOTREMOVABLE].  Hotremovable attribute of a zone is
stored in its pgdat.
And every zone can be hotremovable.

> > -#define try_to_unmap(page)	SWAP_FAIL
> > +#define try_to_unmap(page, force)	SWAP_FAIL
> >  #endif /* CONFIG_MMU *

> Could you explain what you're trying to do with try_to_unmap() and why
> all of the calls need to be changed?

This is necessary for handling mlocked pages.

> > -	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> > +	int error = radix_tree_preload((gfp_mask & ~GFP_ZONEMASK) |
> > +	    ((gfp_mask & GFP_ZONEMASK) == __GFP_DMA ? __GFP_DMA : 0));
> 
> What is this doing?  Trying to filter off the highmem flag without
> affecting the hotremove flag???

Because __GFP_HOTREMOVABLE & ~ __GFP_HIGHMEM evaluates to __GFP_DMA.


> >  /*
> >   * Builds allocation fallback zone lists.
> >   */
> > -static int __init build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, int j, int k)
> > +static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, int j, int k)
> >  {
> > +
> > +	if (! pgdat->enabled)
> > +		return j;
> 
> Normal Linux style is:
> > +	if (!pgdat->enabled)
> > +		return j;

Ok.

> > +	if (k != ZONE_HOTREMOVABLE &&
> > +	    pgdat->removable)
> > +		return j;
> 
> What is this check supposed to do?

This code was used when build_zonelists_node was called from
build_zonelists even if CONFIG_MEMHOTPLUG was defined.
So, this diff is no longer necessary.

> > -		memset(zonelist, 0, sizeof(*zonelist));
> > +		/* memset(zonelist, 0, sizeof(*zonelist)); */
> 
> Why is this memset unnecessary now?

This diff is also no longer necessary, but unused tail elements of
zonelist is zeroed in build_zonelists anyway.

--
IWAMOTO Toshihiro

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 0/3] memory hotplug prototype
  2004-04-06 10:53 [patch 0/3] memory hotplug prototype IWAMOTO Toshihiro
                   ` (4 preceding siblings ...)
  2004-04-06 12:41 ` [patch 0/6] memory hotplug for hugetlbpages Hirokazu Takahashi
@ 2004-04-07 18:12 ` Martin J. Bligh
  2004-04-07 18:59   ` [Lhms-devel] " Mike Kravetz
  2004-04-08  9:16   ` IWAMOTO Toshihiro
  5 siblings, 2 replies; 27+ messages in thread
From: Martin J. Bligh @ 2004-04-07 18:12 UTC (permalink / raw)
  To: IWAMOTO Toshihiro, linux-kernel, lhms-devel

> This is an updated version of memory hotplug prototype patch, which I
> have posted here several times.

I really, really suggest you take a look at Dave McCracken's work, which
he posted as "Basic nonlinear for x86" recently. It's going to be much
much easier to use this abstraction than creating 1000s of zones ...

M.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lhms-devel] Re: [patch 0/3] memory hotplug prototype
  2004-04-07 18:12 ` [patch 0/3] memory hotplug prototype Martin J. Bligh
@ 2004-04-07 18:59   ` Mike Kravetz
  2004-04-07 19:20     ` Dave Hansen
  2004-04-07 22:33     ` Martin J. Bligh
  2004-04-08  9:16   ` IWAMOTO Toshihiro
  1 sibling, 2 replies; 27+ messages in thread
From: Mike Kravetz @ 2004-04-07 18:59 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: IWAMOTO Toshihiro, linux-kernel, lhms-devel

On Wed, Apr 07, 2004 at 11:12:55AM -0700, Martin J. Bligh wrote:
> > This is an updated version of memory hotplug prototype patch, which I
> > have posted here several times.
> 
> I really, really suggest you take a look at Dave McCracken's work, which
> he posted as "Basic nonlinear for x86" recently. It's going to be much
> much easier to use this abstraction than creating 1000s of zones ...
> 

I agree.  However, one could argue that taking a zone offline is 'easier'
than taking a 'section' offline: at least right now.  Note that I said
easier NOT better.  Currently a section represents a subset of one or more
zones.  Ideally, these sections represent units that can be added or
removed.  IIRC these sections only define a range of physical memory.
To determine if it is possible to take a section offline, one needs to
dig into the zone(s) that the section may be associated with.  We'll
have to do things like:
- Stop allocations of pages associated with the section.
- Grab all 'free pages' associated with the section.
- Try to reclaim/free all pages associated with the section.
  - Work on this until all pages in the section are not in use (or free).
  - OR give up if we know we will not succeed.

My claim of zones being easier to work with today is due to the
fact that zones contain much of the data/infrastructure to make
these types of operations easy.  For example, in IWAMOTO's code a
node/zone can be take offline when 'z->present_pages == z->free_pages'.

I've been thinking about how to take a section (or any 'block') of
memory offline.  To do this the offlining code needs to somehow
'allocate' all the pages associated with the section.  After
'allocation', the code knows the pages are not 'in use' and safely
offline.  Any suggestions on how to approach this?  I don't think
we can add any infrastructure to section definitions as these will
need to be relatively small.  Do we need to teach some of the code
paths that currently understand zones about sections?

Thanks,
-- 
Mike

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lhms-devel] Re: [patch 0/3] memory hotplug prototype
  2004-04-07 18:59   ` [Lhms-devel] " Mike Kravetz
@ 2004-04-07 19:20     ` Dave Hansen
  2004-04-07 22:33     ` Martin J. Bligh
  1 sibling, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2004-04-07 19:20 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Martin J. Bligh, IWAMOTO Toshihiro, Linux Kernel Mailing List, lhms

On Wed, 2004-04-07 at 11:59, Mike Kravetz wrote:
> I've been thinking about how to take a section (or any 'block') of
> memory offline.  To do this the offlining code needs to somehow
> 'allocate' all the pages associated with the section.  After
> 'allocation', the code knows the pages are not 'in use' and safely
> offline.  Any suggestions on how to approach this?  I don't think
> we can add any infrastructure to section definitions as these will
> need to be relatively small.  Do we need to teach some of the code
> paths that currently understand zones about sections?

No, we don't need to teach zones about sections, yet (we may never have
to).  Zones currently know about contiguous pfn ranges. 
CONFIG_NONLINEAR (CNL) allows arbirary mappings from virtual and
physical memory locations to pfns.  Using this, we can keep zones
dealing with contiguous pfn ranges, and still back them with any random
memory that we want.

I think that the code changes that will be needed to offline sections
will be mostly limited to teaching the page allocator a few things about
allocating particular pfn (or physical) ranges.  If you can alloc() a
certain set of memory, then it's free to offline.

offline_section(section) {
	mark_section_going_offline(section)
	pfn = section_to_pfn(section)
	some_alloc_pages(SECTION_ORDER, pfn, ...)
	if alloc is ok
		mark_section_offline(section)
	else
		try smaller pieces
}

The mark_section_going_offline() will be used by the __free_pages() path
to intercept any page that's going offline from getting freed back into
the page allocator.  After being intercepted like this, the page can
then be considered "allocated" for the removal process.  

I think the mark_section_going_offline() stage is what some others have
referred to as a "caged" state before memory is offlined.

-- Dave


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lhms-devel] Re: [patch 0/3] memory hotplug prototype
  2004-04-07 18:59   ` [Lhms-devel] " Mike Kravetz
  2004-04-07 19:20     ` Dave Hansen
@ 2004-04-07 22:33     ` Martin J. Bligh
  2004-04-08 12:41       ` Hirokazu Takahashi
  1 sibling, 1 reply; 27+ messages in thread
From: Martin J. Bligh @ 2004-04-07 22:33 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: IWAMOTO Toshihiro, linux-kernel, lhms-devel

--On Wednesday, April 07, 2004 11:59:53 -0700 Mike Kravetz <kravetz@us.ibm.com> wrote:

> On Wed, Apr 07, 2004 at 11:12:55AM -0700, Martin J. Bligh wrote:
>> > This is an updated version of memory hotplug prototype patch, which I
>> > have posted here several times.
>> 
>> I really, really suggest you take a look at Dave McCracken's work, which
>> he posted as "Basic nonlinear for x86" recently. It's going to be much
>> much easier to use this abstraction than creating 1000s of zones ...
>> 
> 
> I agree.  However, one could argue that taking a zone offline is 'easier'
> than taking a 'section' offline: at least right now.  Note that I said
> easier NOT better.  Currently a section represents a subset of one or more
> zones.  Ideally, these sections represent units that can be added or
> removed.  IIRC these sections only define a range of physical memory.
> To determine if it is possible to take a section offline, one needs to
> dig into the zone(s) that the section may be associated with.  We'll
> have to do things like:
> - Stop allocations of pages associated with the section.
> - Grab all 'free pages' associated with the section.
> - Try to reclaim/free all pages associated with the section.
>   - Work on this until all pages in the section are not in use (or free).
>   - OR give up if we know we will not succeed.
> 
> My claim of zones being easier to work with today is due to the
> fact that zones contain much of the data/infrastructure to make
> these types of operations easy.  For example, in IWAMOTO's code a
> node/zone can be take offline when 'z->present_pages == z->free_pages'.

I really think the level of difference in difficultly here is trivial.
The hard bit is freeing the pages, not measuring them. I would suspect
altering the swap path to just not "free" the pages, but put them into
a pool for hotplug is fairly easy.

M.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 0/3] memory hotplug prototype
  2004-04-07 18:12 ` [patch 0/3] memory hotplug prototype Martin J. Bligh
  2004-04-07 18:59   ` [Lhms-devel] " Mike Kravetz
@ 2004-04-08  9:16   ` IWAMOTO Toshihiro
  2004-04-08 10:19     ` [Lhms-devel] " IWAMOTO Toshihiro
  2004-04-08 16:56     ` Martin J. Bligh
  1 sibling, 2 replies; 27+ messages in thread
From: IWAMOTO Toshihiro @ 2004-04-08  9:16 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: IWAMOTO Toshihiro, linux-kernel, lhms-devel

At Wed, 07 Apr 2004 11:12:55 -0700,
Martin J. Bligh <mbligh@aracnet.com> wrote:
> 
> > This is an updated version of memory hotplug prototype patch, which I
> > have posted here several times.
> 
> I really, really suggest you take a look at Dave McCracken's work, which
> he posted as "Basic nonlinear for x86" recently. It's going to be much
> much easier to use this abstraction than creating 1000s of zones ...

Well, I think his patch is orthogonal to mine.  My ultimate target
is IA64 and it will only support node-sized memory hotplugging.

If you need fine-grained memory resizing, that shouldn't be hard to
do.  As others have pointed out, per section hotremovable is not as
easy as per zone one, but we've done a similar thing for hugetlbfs
support.  Look for PG_again in Takahashi's patch.

--
IWAMOTO Toshihiro

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lhms-devel] Re: [patch 0/3] memory hotplug prototype
  2004-04-08  9:16   ` IWAMOTO Toshihiro
@ 2004-04-08 10:19     ` IWAMOTO Toshihiro
  2004-04-08 12:10       ` Hirokazu Takahashi
  2004-04-08 16:56     ` Martin J. Bligh
  1 sibling, 1 reply; 27+ messages in thread
From: IWAMOTO Toshihiro @ 2004-04-08 10:19 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: IWAMOTO Toshihiro, linux-kernel, lhms-devel

At Thu, 08 Apr 2004 18:16:10 +0900,
IWAMOTO Toshihiro wrote:
> 
> At Wed, 07 Apr 2004 11:12:55 -0700,
> Martin J. Bligh <mbligh@aracnet.com> wrote:
> > 
> > > This is an updated version of memory hotplug prototype patch, which I
> > > have posted here several times.
> > 
> > I really, really suggest you take a look at Dave McCracken's work, which
> > he posted as "Basic nonlinear for x86" recently. It's going to be much
> > much easier to use this abstraction than creating 1000s of zones ...
> 
> Well, I think his patch is orthogonal to mine.  My ultimate target
> is IA64 and it will only support node-sized memory hotplugging.
> 
> If you need fine-grained memory resizing, that shouldn't be hard to
> do.  As others have pointed out, per section hotremovable is not as
> easy as per zone one, but we've done a similar thing for hugetlbfs
> support.  Look for PG_again in Takahashi's patch.

Err, s/PG_again/PG_booked/
Pages with PG_booked bit set are skipped in alloc_pages.
Alternatively, when such pages are freed, they can be linked to
another list than free_list to avoid being used again, but buddy
bits handling would be a bit tricky in this case.

--
IWAMOTO Toshihiro


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lhms-devel] Re: [patch 0/3] memory hotplug prototype
  2004-04-08 10:19     ` [Lhms-devel] " IWAMOTO Toshihiro
@ 2004-04-08 12:10       ` Hirokazu Takahashi
  0 siblings, 0 replies; 27+ messages in thread
From: Hirokazu Takahashi @ 2004-04-08 12:10 UTC (permalink / raw)
  To: mbligh, linux-kernel, lhms-devel; +Cc: iwamoto

Hello,

> > > > This is an updated version of memory hotplug prototype patch, which I
> > > > have posted here several times.
> > > 
> > > I really, really suggest you take a look at Dave McCracken's work, which
> > > he posted as "Basic nonlinear for x86" recently. It's going to be much
> > > much easier to use this abstraction than creating 1000s of zones ...
> > 
> > Well, I think his patch is orthogonal to mine.  My ultimate target
> > is IA64 and it will only support node-sized memory hotplugging.
> > 
> > If you need fine-grained memory resizing, that shouldn't be hard to
> > do.  As others have pointed out, per section hotremovable is not as
> > easy as per zone one, but we've done a similar thing for hugetlbfs
> > support.  Look for PG_again in Takahashi's patch.
> 
> Err, s/PG_again/PG_booked/
> Pages with PG_booked bit set are skipped in alloc_pages.
> Alternatively, when such pages are freed, they can be linked to
> another list than free_list to avoid being used again, but buddy
> bits handling would be a bit tricky in this case.

It might be possible but it's not easy to do.

If page count equals 0, where do you think the page is?
It might be in the buddy system or in the per-cpu-pages pools,
or it might be a part of coumpound page, or it's just being
allocated/freed. If it in the buddy system, which free_area of zones
is it linked?  

It's very hard to determin that where it is, so that
I introduced PG_booked flag to avoid to re-use it.

Is there any better way?


Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lhms-devel] Re: [patch 0/3] memory hotplug prototype
  2004-04-07 22:33     ` Martin J. Bligh
@ 2004-04-08 12:41       ` Hirokazu Takahashi
  0 siblings, 0 replies; 27+ messages in thread
From: Hirokazu Takahashi @ 2004-04-08 12:41 UTC (permalink / raw)
  To: mbligh; +Cc: kravetz, iwamoto, linux-kernel, lhms-devel

Hello,

> >> > This is an updated version of memory hotplug prototype patch, which I
> >> > have posted here several times.
> >> 
> >> I really, really suggest you take a look at Dave McCracken's work, which
> >> he posted as "Basic nonlinear for x86" recently. It's going to be much
> >> much easier to use this abstraction than creating 1000s of zones ...
> >> 
> > 
> > I agree.  However, one could argue that taking a zone offline is 'easier'
> > than taking a 'section' offline: at least right now.  Note that I said
> > easier NOT better.  Currently a section represents a subset of one or more
> > zones.  Ideally, these sections represent units that can be added or
> > removed.  IIRC these sections only define a range of physical memory.
> > To determine if it is possible to take a section offline, one needs to
> > dig into the zone(s) that the section may be associated with.  We'll
> > have to do things like:
> > - Stop allocations of pages associated with the section.
> > - Grab all 'free pages' associated with the section.
> > - Try to reclaim/free all pages associated with the section.
> >   - Work on this until all pages in the section are not in use (or free).
> >   - OR give up if we know we will not succeed.
> > 
> > My claim of zones being easier to work with today is due to the
> > fact that zones contain much of the data/infrastructure to make
> > these types of operations easy.  For example, in IWAMOTO's code a
> > node/zone can be take offline when 'z->present_pages == z->free_pages'.
> 
> I really think the level of difference in difficultly here is trivial.
> The hard bit is freeing the pages, not measuring them. I would suspect
> altering the swap path to just not "free" the pages, but put them into
> a pool for hotplug is fairly easy.

This is what IWAMOTO did.

The swap path won't work well as you expect since it can't handle
busy pages. It just skips them and only frees silent pages.

And page cache memory without backing store can't be handled
either.

Regards,
Hirokazu Takahashi


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 0/3] memory hotplug prototype
  2004-04-08  9:16   ` IWAMOTO Toshihiro
  2004-04-08 10:19     ` [Lhms-devel] " IWAMOTO Toshihiro
@ 2004-04-08 16:56     ` Martin J. Bligh
  2004-04-09  2:37       ` IWAMOTO Toshihiro
  1 sibling, 1 reply; 27+ messages in thread
From: Martin J. Bligh @ 2004-04-08 16:56 UTC (permalink / raw)
  To: IWAMOTO Toshihiro; +Cc: linux-kernel, lhms-devel

> At Wed, 07 Apr 2004 11:12:55 -0700,
> Martin J. Bligh <mbligh@aracnet.com> wrote:
>> 
>> > This is an updated version of memory hotplug prototype patch, which I
>> > have posted here several times.
>> 
>> I really, really suggest you take a look at Dave McCracken's work, which
>> he posted as "Basic nonlinear for x86" recently. It's going to be much
>> much easier to use this abstraction than creating 1000s of zones ...
> 
> Well, I think his patch is orthogonal to mine.  My ultimate target
> is IA64 and it will only support node-sized memory hotplugging.

Are you saying you're only creating a patch for ia64 only, rather than
an arch-independant one?

M.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 0/3] memory hotplug prototype
  2004-04-08 16:56     ` Martin J. Bligh
@ 2004-04-09  2:37       ` IWAMOTO Toshihiro
  2004-04-09  5:18         ` Martin J. Bligh
  0 siblings, 1 reply; 27+ messages in thread
From: IWAMOTO Toshihiro @ 2004-04-09  2:37 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: IWAMOTO Toshihiro, linux-kernel, lhms-devel

At Thu, 08 Apr 2004 09:56:35 -0700,
Martin J. Bligh <mbligh@aracnet.com> wrote:
> 
> > At Wed, 07 Apr 2004 11:12:55 -0700,
> > Martin J. Bligh <mbligh@aracnet.com> wrote:
> >> 
> >> > This is an updated version of memory hotplug prototype patch, which I
> >> > have posted here several times.
> >> 
> >> I really, really suggest you take a look at Dave McCracken's work, which
> >> he posted as "Basic nonlinear for x86" recently. It's going to be much
> >> much easier to use this abstraction than creating 1000s of zones ...
> > 
> > Well, I think his patch is orthogonal to mine.  My ultimate target
> > is IA64 and it will only support node-sized memory hotplugging.
> 
> Are you saying you're only creating a patch for ia64 only, rather than
> an arch-independant one?

I'm not trying to make my patches IA64 specific, and my patch is quite
machine independent.  However, some of them, such as placing mem_map at
boot time or hotplug events, are machine dependent by nature.

I'm afraid that the memsection thing is overkill for systems whose
hotpluggable memory unit is large.  I understand there's need for
memsection, but I think that should be optional.

BTW, I think memory hotplugging on 32 bit archs are difficult because
of their narrow address space.  For example:
	1. A system boots with 1GB * 4 blocks of RAM.
	2. The second RAM block is removed.
	3. A 2GB block is inserted in the second slot.
Where does these RAM block appear in physical space, and where should
their mem_map be placed?

--
IWAMOTO Toshihiro



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 0/3] memory hotplug prototype
  2004-04-09  2:37       ` IWAMOTO Toshihiro
@ 2004-04-09  5:18         ` Martin J. Bligh
  0 siblings, 0 replies; 27+ messages in thread
From: Martin J. Bligh @ 2004-04-09  5:18 UTC (permalink / raw)
  To: IWAMOTO Toshihiro; +Cc: linux-kernel, lhms-devel

> I'm not trying to make my patches IA64 specific, and my patch is quite
> machine independent.  However, some of them, such as placing mem_map at
> boot time or hotplug events, are machine dependent by nature.

Sounds good - maybe I'd misread your intentions - sorry.
 
> I'm afraid that the memsection thing is overkill for systems whose
> hotpluggable memory unit is large.  I understand there's need for
> memsection, but I think that should be optional.

If you're doing things that are logically nodes on the system anyway,
that makes sense. However, I really hate the idea of breaking up existing
logical groupings of memory just to do hotplug ... zones represent a logical
type of memory to the system, where the memory is somehow logically 
"different" from that in other zones (eg it fullils some restraint of
DMA or "permanently mapped kernel space", or has a different locality
to cpus (NUMA)). Yes, I know the current non-NUMA discontigmem violates
that premise (and thus must die ;-)).

Perhaps I'm confusing your patches with sombody elses - it gets hard 
to keep track of all the groups involved, sorry ;-)

> BTW, I think memory hotplugging on 32 bit archs are difficult because
> of their narrow address space.  For example:
> 	1. A system boots with 1GB * 4 blocks of RAM.
> 	2. The second RAM block is removed.
> 	3. A 2GB block is inserted in the second slot.
> Where does these RAM block appear in physical space, 

I don't think we have control over that - it's a machine issue.

> and where should their mem_map be placed? 

ia32 has a really hard time doing that, as it has to be in permanent
KVA space. Moreover, you have alignment requirements between sections.
Probably for the first cut, we'll have to just reserve enough mem_map 
for any physical address that might appear for that machine. If Dave's
nonlinear stuff can sort out the alignment requirements, we can probably 
do a "vmalloc-like" remapping out of physical ranges in the new mem
segment, much as I did in remap_numa_kva for NUMA support.

M.


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2004-04-09  5:18 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-04-06 10:53 [patch 0/3] memory hotplug prototype IWAMOTO Toshihiro
2004-04-06 10:56 ` [patch 1/3] " IWAMOTO Toshihiro
2004-04-06 17:12   ` Dave Hansen
2004-04-07  6:10     ` IWAMOTO Toshihiro
2004-04-06 10:58 ` [patch 2/3] " IWAMOTO Toshihiro
2004-04-06 10:59 ` [patch 3/3] " IWAMOTO Toshihiro
2004-04-06 11:47 ` [patch 0/3] " IWAMOTO Toshihiro
2004-04-06 12:41 ` [patch 0/6] memory hotplug for hugetlbpages Hirokazu Takahashi
2004-04-06 12:44   ` [patch 1/6] " Hirokazu Takahashi
2004-04-06 12:45   ` [patch 2/6] " Hirokazu Takahashi
2004-04-06 12:45   ` [patch 3/6] " Hirokazu Takahashi
2004-04-06 12:48   ` [Lhms-devel] [patch 4/6] " Hirokazu Takahashi
2004-04-06 13:02     ` Russell King
2004-04-06 13:11       ` Hirokazu Takahashi
2004-04-06 12:49   ` [patch 5/6] " Hirokazu Takahashi
2004-04-06 12:50   ` [patch 6/6] " Hirokazu Takahashi
2004-04-07 18:12 ` [patch 0/3] memory hotplug prototype Martin J. Bligh
2004-04-07 18:59   ` [Lhms-devel] " Mike Kravetz
2004-04-07 19:20     ` Dave Hansen
2004-04-07 22:33     ` Martin J. Bligh
2004-04-08 12:41       ` Hirokazu Takahashi
2004-04-08  9:16   ` IWAMOTO Toshihiro
2004-04-08 10:19     ` [Lhms-devel] " IWAMOTO Toshihiro
2004-04-08 12:10       ` Hirokazu Takahashi
2004-04-08 16:56     ` Martin J. Bligh
2004-04-09  2:37       ` IWAMOTO Toshihiro
2004-04-09  5:18         ` Martin J. Bligh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).