linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/6] Critical Page Pool
@ 2005-12-14  7:50 Matthew Dobson
  2005-12-14  7:52 ` [RFC][PATCH 1/6] Create " Matthew Dobson
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Matthew Dobson @ 2005-12-14  7:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: andrea, Sridhar Samudrala, pavel, Andrew Morton, Linux Memory Management

Here is the latest version of the Critical Page Pool patches.  Besides
bugfixes, I've removed all the slab cleanup work from the series.  Also,
since one of the main questions about the patch series seems to revolve
around how to appropriately size the pool, I've added some basic statistics
about the critical page pool, viewable by reading
/proc/sys/vm/critical_pages.  The code now exports how many pages were
requested, how many pages are currently in use, and the maximum number of
pages that were ever in use.

The overall purpose of this patch series is to all a system administrator
to reserve a number of pages in a 'critical pool' that is set aside for
situations when the system is 'in emergency'.  It is up to the individual
administrator to determine when his/her system is 'in emergency'.  This is
not meant to (necessarily) anticipate OOM situations, though that is
certainly one possible use.  The purpose this was originally designed for
is to allow the networking code to keep functioning despite the sytem
losing its (potentially networked) swap device, and thus temporarily
putting the system under exreme memory pressure.

Any comments about the code or the overall design are very welcome.
Patches agaist 2.6.15-rc5.

-Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 1/6] Create Critical Page Pool
  2005-12-14  7:50 [RFC][PATCH 0/6] Critical Page Pool Matthew Dobson
@ 2005-12-14  7:52 ` Matthew Dobson
  2005-12-14 10:48   ` Andrea Arcangeli
  2005-12-14 13:30   ` Rik van Riel
  2005-12-14  7:54 ` [RFC][PATCH 2/6] in_emergency Trigger Matthew Dobson
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 26+ messages in thread
From: Matthew Dobson @ 2005-12-14  7:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: andrea, Sridhar Samudrala, pavel, Andrew Morton, Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 238 bytes --]

Create the basic Critical Page Pool.  Any allocation specifying
__GFP_CRITICAL will, as a last resort before failing the allocation, try to
get a page from the critical pool.  For now, only singleton (order 0) pages
are supported.

-Matt

[-- Attachment #2: critical_pool.patch --]
[-- Type: text/x-patch, Size: 11007 bytes --]

Implement a Critical Page Pool:

* Write a number of pages into /proc/sys/vm/critical_pages
* These pages will be reserved for 'critical' allocations, signified
     by the __GFP_CRITICAL flag passed to an allocation request
* Reading /proc/sys/vm/critical_pages will give statistics about the
     critical page pool

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>

Index: linux-2.6.15-rc5+critical_pool/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/Documentation/sysctl/vm.txt	2005-12-13 15:56:55.819232488 -0800
+++ linux-2.6.15-rc5+critical_pool/Documentation/sysctl/vm.txt	2005-12-13 16:01:57.783326968 -0800
@@ -26,6 +26,7 @@ Currently, these files are in /proc/sys/
 - min_free_kbytes
 - laptop_mode
 - block_dump
+- critical_pages
 
 ==============================================================
 
@@ -102,3 +103,12 @@ This is used to force the Linux VM to ke
 of kilobytes free.  The VM uses this number to compute a pages_min
 value for each lowmem zone in the system.  Each lowmem zone gets 
 a number of reserved free pages based proportionally on its size.
+
+==============================================================
+
+critical_pages:
+
+This is used to force the Linux VM to reserve a pool of pages for
+emergency (__GFP_CRITICAL) allocations.  Allocations with this flag
+MUST succeed.
+The number written into this file is the number of pages to reserve.
Index: linux-2.6.15-rc5+critical_pool/include/linux/gfp.h
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/include/linux/gfp.h	2005-12-13 15:56:55.820232336 -0800
+++ linux-2.6.15-rc5+critical_pool/include/linux/gfp.h	2005-12-13 15:56:57.531972112 -0800
@@ -47,6 +47,7 @@ struct vm_area_struct;
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_CRITICAL	((__force gfp_t)0x40000u) /* Critical allocation. MUST succeed! */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -55,7 +56,7 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_HARDWALL)
+			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_CRITICAL)
 
 #define GFP_ATOMIC	(__GFP_HIGH)
 #define GFP_NOIO	(__GFP_WAIT)
@@ -64,6 +65,8 @@ struct vm_area_struct;
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
 #define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
 			 __GFP_HIGHMEM)
+#define GFP_ATOMIC_CRIT	(GFP_ATOMIC | __GFP_CRITICAL)
+#define GFP_KERNEL_CRIT	(GFP_KERNEL | __GFP_CRITICAL)
 
 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
Index: linux-2.6.15-rc5+critical_pool/include/linux/sysctl.h
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/include/linux/sysctl.h	2005-12-13 15:56:55.820232336 -0800
+++ linux-2.6.15-rc5+critical_pool/include/linux/sysctl.h	2005-12-13 16:02:13.757898464 -0800
@@ -180,6 +180,7 @@ enum
 	VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+	VM_CRITICAL_PAGES=29,	/* # of pages to reserve for __GFP_CRITICAL allocs */
 };
 
 
Index: linux-2.6.15-rc5+critical_pool/kernel/sysctl.c
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/kernel/sysctl.c	2005-12-13 15:56:55.820232336 -0800
+++ linux-2.6.15-rc5+critical_pool/kernel/sysctl.c	2005-12-13 16:01:57.784326816 -0800
@@ -849,6 +849,16 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+	{
+		.ctl_name	= VM_CRITICAL_PAGES,
+		.procname	= "critical_pages",
+		.data		= &critical_pages,
+		.maxlen		= sizeof(critical_pages),
+		.mode		= 0644,
+		.proc_handler	= &critical_pages_sysctl_handler,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
 	{ .ctl_name = 0 }
 };
 
Index: linux-2.6.15-rc5+critical_pool/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/mm/page_alloc.c	2005-12-13 15:56:55.820232336 -0800
+++ linux-2.6.15-rc5+critical_pool/mm/page_alloc.c	2005-12-13 16:01:57.810322864 -0800
@@ -53,6 +53,68 @@ unsigned long totalram_pages __read_most
 unsigned long totalhigh_pages __read_mostly;
 long nr_swap_pages;
 
+/* The number of pages to maintain in the critical page pool */
+int critical_pages = 0;
+
+/* Critical Page Pool control structure */
+static struct critical_pool {
+	unsigned int free, min_free;
+	spinlock_t lock;
+	struct list_head pages;
+} critical_pool = {
+	.free		= 0,
+	.min_free	= 0,
+	.lock		= SPIN_LOCK_UNLOCKED,
+	.pages		= LIST_HEAD_INIT(critical_pool.pages),
+};
+
+/* MCD - This needs to be arch specific */
+#define CRITICAL_POOL_GFPMASK	(GFP_KERNEL)
+
+static inline int is_critical_pool_full(void)
+{
+	return critical_pool.free >= critical_pages;
+}
+
+static inline struct page *get_critical_page(gfp_t gfpmask)
+{
+	struct page *page = NULL;
+
+	spin_lock(&critical_pool.lock);
+	if (!list_empty(&critical_pool.pages)) {
+		/* Grab the next free critical pool page */
+		page = list_entry(critical_pool.pages.next, struct page, lru);
+		list_del(&page->lru);
+		critical_pool.free--;
+		if (critical_pool.free < critical_pool.min_free)
+			critical_pool.min_free = critical_pool.free;
+	}
+	spin_unlock(&critical_pool.lock);
+
+	return page;
+}
+
+static inline int put_critical_page(struct page *page)
+{
+	int ret = 0;
+
+	spin_lock(&critical_pool.lock);
+	if (!is_critical_pool_full()) {
+		/*
+		 * We snached this page away in the process of being freed, so
+		 * we must re-increment it's count so we don't cause problems
+		 * when we hand it back out to a future __GFP_CRITICAL alloc.
+		 */
+		BUG_ON(!get_page_testone(page));
+		list_add(&page->lru, &critical_pool.pages);
+		critical_pool.free++;
+		ret = 1;
+	}
+	spin_unlock(&critical_pool.lock);
+
+	return ret;
+}
+
 /*
  * results with 256, 32 in the lowmem_reserve sysctl:
  *	1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
@@ -302,6 +364,10 @@ static inline void __free_pages_bulk (st
 
 	if (unlikely(order))
 		destroy_compound_page(page, order);
+	else if (!is_critical_pool_full())
+		/* If the critical pool isn't full, add this page to it */
+		if (put_critical_page(page))
+			return;
 
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
 
@@ -997,6 +1063,19 @@ rebalance:
 	}
 
 nopage:
+	/*
+	 * This is the LAST DITCH effort.
+	 * We maintain a pool of singleton (order 0) 'critical' pages,
+	 * specifically for allocation requests marked __GFP_CRITICAL.
+	 * Rather than fail one of these allocations, take a page,
+	 * if there are any, from the critical pool.
+	 */
+	if ((gfp_mask & __GFP_CRITICAL) && !order) {
+		page = get_critical_page(gfp_mask);
+		if (page)
+			goto got_pg;
+	}
+
 	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
 		printk(KERN_WARNING "%s: page allocation failure."
 			" order:%d, mode:0x%x\n",
@@ -2539,6 +2618,90 @@ int lowmem_reserve_ratio_sysctl_handler(
 	return 0;
 }
 
+static inline void fill_critical_pool(int num)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < num; i++) {
+		page = alloc_page(CRITICAL_POOL_GFPMASK);
+		if (!page)
+			return;
+		spin_lock(&critical_pool.lock);
+		list_add(&page->lru, &critical_pool.pages);
+		critical_pool.free++;
+		critical_pool.min_free++;
+		spin_unlock(&critical_pool.lock);
+	}
+}
+
+static inline void drain_critical_pool(int num)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < num; i++) {
+		spin_lock(&critical_pool.lock);
+		BUG_ON(critical_pool.free < 0);
+		if (list_empty(&critical_pool.pages) || !critical_pool.free) {
+			spin_unlock(&critical_pool.lock);
+			break;
+		}
+			
+		page = list_entry(critical_pool.pages.next, struct page, lru);
+		list_del(&page->lru);
+		critical_pool.free--;
+		if (critical_pool.free < critical_pool.min_free)
+			critical_pool.min_free = critical_pool.free;
+		spin_unlock(&critical_pool.lock);
+			
+		__free_pages(page, 0);
+	}
+}
+
+/*
+ * critical_pages_sysctl_handler - handle r/w of /proc/sys/vm/critical_pages
+ *     On write, add/remove pages to/from the critical page pool.
+ *     On read, dump out some simple stats about the critical page pool.
+ */
+int critical_pages_sysctl_handler(ctl_table *table, int write,
+				   struct file *file, void __user *buffer,
+				   size_t *length, loff_t *ppos)
+{
+	char buf[100];
+	int num, len;
+
+	if (write) {
+		proc_dointvec(table, write, file, buffer, length, ppos);
+
+		num = critical_pages - critical_pool.free;
+		if (num > 0)
+			fill_critical_pool(num);
+		else if (num < 0)
+			drain_critical_pool(-num);
+
+		return 0;
+	}
+
+	if (!*length || *ppos) {
+		*length = 0;
+		return 0;
+	}
+
+	sprintf(buf, "Critical Page Pool Size: %d\n"
+		"Critical Pages In Use: %d\n"
+		"Max Critical Pages In Use: %d\n", critical_pages,
+		critical_pages - critical_pool.free,
+		critical_pages - critical_pool.min_free);
+	len = strlen(buf);
+	if (copy_to_user(buffer, buf, len))
+		return -EFAULT;
+
+	*length = len;
+	*ppos += len;
+	return 0;
+}
+
 __initdata int hashdist = HASHDIST_DEFAULT;
 
 #ifdef CONFIG_NUMA
Index: linux-2.6.15-rc5+critical_pool/include/linux/mmzone.h
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/include/linux/mmzone.h	2005-12-13 15:56:55.820232336 -0800
+++ linux-2.6.15-rc5+critical_pool/include/linux/mmzone.h	2005-12-13 15:56:57.537971200 -0800
@@ -422,6 +422,8 @@ int min_free_kbytes_sysctl_handler(struc
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, struct file *,
 					void __user *, size_t *, loff_t *);
+int critical_pages_sysctl_handler(struct ctl_table *, int, struct file *,
+				  void __user *, size_t *, loff_t *);
 
 #include <linux/topology.h>
 /* Returns the number of the current Node. */
Index: linux-2.6.15-rc5+critical_pool/include/linux/mm.h
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/include/linux/mm.h	2005-12-13 15:56:55.820232336 -0800
+++ linux-2.6.15-rc5+critical_pool/include/linux/mm.h	2005-12-13 16:01:57.783326968 -0800
@@ -32,6 +32,8 @@ extern int sysctl_legacy_va_layout;
 #define sysctl_legacy_va_layout 0
 #endif
 
+extern int critical_pages;
+
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/processor.h>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 2/6] in_emergency Trigger
  2005-12-14  7:50 [RFC][PATCH 0/6] Critical Page Pool Matthew Dobson
  2005-12-14  7:52 ` [RFC][PATCH 1/6] Create " Matthew Dobson
@ 2005-12-14  7:54 ` Matthew Dobson
  2005-12-14  7:56 ` [RFC][PATCH 3/6] Slab Prep: get/return_object Matthew Dobson
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Matthew Dobson @ 2005-12-14  7:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: andrea, Sridhar Samudrala, pavel, Andrew Morton, Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 268 bytes --]

Create the 'in_emergency' trigger, to allow userspace to turn access to the
critical pool on and off.  The rationale behind this is to ensure that the
critical pool stays full for *actual* emergency situations, and isn't used
for transient, low-mem situations.

-Matt

[-- Attachment #2: emergency_trigger.patch --]
[-- Type: text/x-patch, Size: 4928 bytes --]

Create a userspace trigger: /proc/sys/vm/in_emergency that notifies the kernel
that the system is in an emergency state, and allows the kernel to delve into
the 'critical pool' to satisfy __GFP_CRITICAL allocations.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>

Index: linux-2.6.15-rc5+critical_pool/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/Documentation/sysctl/vm.txt	2005-12-13 16:01:57.783326968 -0800
+++ linux-2.6.15-rc5+critical_pool/Documentation/sysctl/vm.txt	2005-12-13 16:02:40.935766800 -0800
@@ -27,6 +27,7 @@ Currently, these files are in /proc/sys/
 - laptop_mode
 - block_dump
 - critical_pages
+- in_emergency
 
 ==============================================================
 
@@ -112,3 +113,12 @@ This is used to force the Linux VM to re
 emergency (__GFP_CRITICAL) allocations.  Allocations with this flag
 MUST succeed.
 The number written into this file is the number of pages to reserve.
+
+==============================================================
+
+in_emergency:
+
+This is used to let the Linux VM know that userspace thinks that the system is
+in an emergency situation.
+Writing a non-zero value into this file tells the VM we *are* in an emergency
+situation & writing zero tells the VM we *are not* in an emergency situation.
Index: linux-2.6.15-rc5+critical_pool/include/linux/sysctl.h
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/include/linux/sysctl.h	2005-12-13 16:02:13.757898464 -0800
+++ linux-2.6.15-rc5+critical_pool/include/linux/sysctl.h	2005-12-13 16:02:40.937766496 -0800
@@ -181,6 +181,7 @@ enum
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
 	VM_CRITICAL_PAGES=29,	/* # of pages to reserve for __GFP_CRITICAL allocs */
+	VM_IN_EMERGENCY=30,	/* tell the VM if we are/aren't in an emergency */
 };
 
 
Index: linux-2.6.15-rc5+critical_pool/kernel/sysctl.c
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/kernel/sysctl.c	2005-12-13 16:01:57.784326816 -0800
+++ linux-2.6.15-rc5+critical_pool/kernel/sysctl.c	2005-12-13 16:02:40.942765736 -0800
@@ -859,6 +859,16 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_intvec,
 		.extra1		= &zero,
 	},
+	{
+		.ctl_name	= VM_IN_EMERGENCY,
+		.procname	= "in_emergency",
+		.data		= &system_in_emergency,
+		.maxlen		= sizeof(system_in_emergency),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
 	{ .ctl_name = 0 }
 };
 
Index: linux-2.6.15-rc5+critical_pool/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/mm/page_alloc.c	2005-12-13 16:01:57.810322864 -0800
+++ linux-2.6.15-rc5+critical_pool/mm/page_alloc.c	2005-12-13 16:02:40.946765128 -0800
@@ -53,6 +53,10 @@ unsigned long totalram_pages __read_most
 unsigned long totalhigh_pages __read_mostly;
 long nr_swap_pages;
 
+/* Is the sytem in an emergency situation? */
+int system_in_emergency = 0;
+EXPORT_SYMBOL(system_in_emergency);
+
 /* The number of pages to maintain in the critical page pool */
 int critical_pages = 0;
 
@@ -927,7 +931,7 @@ struct page * fastcall
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist)
 {
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct zone **z;
 	struct page *page;
 	struct reclaim_state reclaim_state;
@@ -936,6 +940,16 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
 	int alloc_flags;
 	int did_some_progress;
 
+	if (is_emergency_alloc(gfp_mask)) {
+		/*
+		 * If the system is 'in emergency' and this is a critical
+		 * allocation, then make sure we don't sleep
+		 */
+		gfp_mask &= ~__GFP_WAIT;
+		gfp_mask |= __GFP_HIGH;
+		wait = 0;
+	}
+
 	might_sleep_if(wait);
 
 restart:
@@ -1070,7 +1084,7 @@ nopage:
 	 * Rather than fail one of these allocations, take a page,
 	 * if there are any, from the critical pool.
 	 */
-	if ((gfp_mask & __GFP_CRITICAL) && !order) {
+	if (is_emergency_alloc(gfp_mask) && !order) {
 		page = get_critical_page(gfp_mask);
 		if (page)
 			goto got_pg;
Index: linux-2.6.15-rc5+critical_pool/include/linux/mm.h
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/include/linux/mm.h	2005-12-13 16:01:57.783326968 -0800
+++ linux-2.6.15-rc5+critical_pool/include/linux/mm.h	2005-12-13 16:02:40.950764520 -0800
@@ -33,6 +33,12 @@ extern int sysctl_legacy_va_layout;
 #endif
 
 extern int critical_pages;
+extern int system_in_emergency;
+
+static inline int is_emergency_alloc(gfp_t gfpmask)
+{
+	return system_in_emergency && (gfpmask & __GFP_CRITICAL);
+}
 
 #include <asm/page.h>
 #include <asm/pgtable.h>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 3/6] Slab Prep: get/return_object
  2005-12-14  7:50 [RFC][PATCH 0/6] Critical Page Pool Matthew Dobson
  2005-12-14  7:52 ` [RFC][PATCH 1/6] Create " Matthew Dobson
  2005-12-14  7:54 ` [RFC][PATCH 2/6] in_emergency Trigger Matthew Dobson
@ 2005-12-14  7:56 ` Matthew Dobson
  2005-12-14  8:19   ` Pekka Enberg
  2005-12-14  7:58 ` [RFC][PATCH 4/6] Slab Prep: slab_destruct() Matthew Dobson
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Matthew Dobson @ 2005-12-14  7:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: andrea, Sridhar Samudrala, pavel, Andrew Morton, Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 235 bytes --]

Create 2 helper functions in mm/slab.c: get_object() and return_object().
These functions reduce some existing duplicated code in the slab allocator
and will be used when adding Critical Page Pool support to the slab allocator.

-Matt

[-- Attachment #2: slab_prep-get_return_object.patch --]
[-- Type: text/x-patch, Size: 3912 bytes --]

Create two helper functions: get_object_from_slab() & return_object_to_slab().
Use these two helper function to replace duplicated code in mm/slab.c

These functions will also be reused by a later patch in this series.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>

Index: linux-2.6.15-rc5+critical_pool/mm/slab.c
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/mm/slab.c	2005-12-13 15:56:55.459287208 -0800
+++ linux-2.6.15-rc5+critical_pool/mm/slab.c	2005-12-13 16:05:21.308386456 -0800
@@ -2140,6 +2140,42 @@ static void kmem_flagcheck(kmem_cache_t 
 	}
 }
 
+static void *get_object(kmem_cache_t *cachep, struct slab *slabp, int nodeid)
+{
+	void *objp = slabp->s_mem + (slabp->free * cachep->objsize);
+	kmem_bufctl_t next;
+
+	slabp->inuse++;
+	next = slab_bufctl(slabp)[slabp->free];
+#if DEBUG
+	slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
+	WARN_ON(slabp->nodeid != nodeid);
+#endif
+	slabp->free = next;
+
+	return objp;
+}
+
+static void return_object(kmem_cache_t *cachep, struct slab *slabp, void *objp,
+			  int nodeid)
+{
+	unsigned int objnr = (objp - slabp->s_mem) / cachep->objsize;
+
+#if DEBUG
+	/* Verify that the slab belongs to the intended node */
+	WARN_ON(slabp->nodeid != nodeid);
+
+	if (slab_bufctl(slabp)[objnr] != BUFCTL_FREE) {
+		printk(KERN_ERR "slab: double free detected in cache "
+		       "'%s', objp %p\n", cachep->name, objp);
+		BUG();
+	}
+#endif
+	slab_bufctl(slabp)[objnr] = slabp->free;
+	slabp->free = objnr;
+	slabp->inuse--;
+}
+
 static void set_slab_attr(kmem_cache_t *cachep, struct slab *slabp, void *objp)
 {
 	int i;
@@ -2418,22 +2454,12 @@ retry:
 		check_slabp(cachep, slabp);
 		check_spinlock_acquired(cachep);
 		while (slabp->inuse < cachep->num && batchcount--) {
-			kmem_bufctl_t next;
 			STATS_INC_ALLOCED(cachep);
 			STATS_INC_ACTIVE(cachep);
 			STATS_SET_HIGH(cachep);
 
-			/* get obj pointer */
-			ac->entry[ac->avail++] = slabp->s_mem +
-				slabp->free*cachep->objsize;
-
-			slabp->inuse++;
-			next = slab_bufctl(slabp)[slabp->free];
-#if DEBUG
-			slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
-			WARN_ON(numa_node_id() != slabp->nodeid);
-#endif
-		       	slabp->free = next;
+			ac->entry[ac->avail++] = get_object(cachep, slabp,
+							    numa_node_id());
 		}
 		check_slabp(cachep, slabp);
 
@@ -2565,7 +2591,6 @@ static void *__cache_alloc_node(kmem_cac
  	struct slab *slabp;
  	struct kmem_list3 *l3;
  	void *obj;
- 	kmem_bufctl_t next;
  	int x;
 
  	l3 = cachep->nodelists[nodeid];
@@ -2591,14 +2616,7 @@ retry:
 
  	BUG_ON(slabp->inuse == cachep->num);
 
- 	/* get obj pointer */
- 	obj =  slabp->s_mem + slabp->free*cachep->objsize;
- 	slabp->inuse++;
- 	next = slab_bufctl(slabp)[slabp->free];
-#if DEBUG
- 	slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
-#endif
- 	slabp->free = next;
+	obj = get_object(cachep, slabp, nodeid);
  	check_slabp(cachep, slabp);
  	l3->free_objects--;
  	/* move slabp to correct slabp list: */
@@ -2637,29 +2655,14 @@ static void free_block(kmem_cache_t *cac
 	for (i = 0; i < nr_objects; i++) {
 		void *objp = objpp[i];
 		struct slab *slabp;
-		unsigned int objnr;
 
 		slabp = page_get_slab(virt_to_page(objp));
 		l3 = cachep->nodelists[node];
 		list_del(&slabp->list);
-		objnr = (objp - slabp->s_mem) / cachep->objsize;
 		check_spinlock_acquired_node(cachep, node);
 		check_slabp(cachep, slabp);
-
-#if DEBUG
-		/* Verify that the slab belongs to the intended node */
-		WARN_ON(slabp->nodeid != node);
-
-		if (slab_bufctl(slabp)[objnr] != BUFCTL_FREE) {
-			printk(KERN_ERR "slab: double free detected in cache "
-					"'%s', objp %p\n", cachep->name, objp);
-			BUG();
-		}
-#endif
-		slab_bufctl(slabp)[objnr] = slabp->free;
-		slabp->free = objnr;
+		return_object(cachep, slabp, objp, node);
 		STATS_DEC_ACTIVE(cachep);
-		slabp->inuse--;
 		l3->free_objects++;
 		check_slabp(cachep, slabp);
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 4/6] Slab Prep: slab_destruct()
  2005-12-14  7:50 [RFC][PATCH 0/6] Critical Page Pool Matthew Dobson
                   ` (2 preceding siblings ...)
  2005-12-14  7:56 ` [RFC][PATCH 3/6] Slab Prep: get/return_object Matthew Dobson
@ 2005-12-14  7:58 ` Matthew Dobson
  2005-12-14  8:37   ` Pekka Enberg
  2005-12-14  7:59 ` [RFC][PATCH 5/6] Slab Prep: Move cache_grow() Matthew Dobson
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Matthew Dobson @ 2005-12-14  7:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: andrea, Sridhar Samudrala, pavel, Andrew Morton, Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 220 bytes --]

Create a helper function for slab_destroy() called slab_destruct().  Remove
some ifdefs inside functions and generally make the slab destroying code
more readable prior to slab support for the Critical Page Pool.

-Matt

[-- Attachment #2: slab_prep-slab_destruct.patch --]
[-- Type: text/x-patch, Size: 2089 bytes --]

Create a helper function, slab_destruct(), called from slab_destroy().  This
makes slab_destroy() smaller and more readable, and moves ifdefs outside the
function body.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>

Index: linux-2.6.15-rc5+critical_pool/mm/slab.c
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/mm/slab.c	2005-12-05 10:20:43.886907432 -0800
+++ linux-2.6.15-rc5+critical_pool/mm/slab.c	2005-12-05 10:20:45.289694176 -0800
@@ -1401,15 +1401,13 @@ static void check_poison_obj(kmem_cache_
 }
 #endif
 
-/* Destroy all the objs in a slab, and release the mem back to the system.
- * Before calling the slab must have been unlinked from the cache.
- * The cache-lock is not held/needed.
+#if DEBUG
+/**
+ * slab_destruct - call the registered destructor for each object in
+ *      a slab that is to be destroyed.
  */
-static void slab_destroy (kmem_cache_t *cachep, struct slab *slabp)
+static void slab_destruct(kmem_cache_t *cachep, struct slab *slabp)
 {
-	void *addr = slabp->s_mem - slabp->colouroff;
-
-#if DEBUG
 	int i;
 	for (i = 0; i < cachep->num; i++) {
 		void *objp = slabp->s_mem + cachep->objsize * i;
@@ -1435,7 +1433,10 @@ static void slab_destroy (kmem_cache_t *
 		if (cachep->dtor && !(cachep->flags & SLAB_POISON))
 			(cachep->dtor)(objp+obj_dbghead(cachep), cachep, 0);
 	}
+}
 #else
+static void slab_destruct(kmem_cache_t *cachep, struct slab *slabp)
+{
 	if (cachep->dtor) {
 		int i;
 		for (i = 0; i < cachep->num; i++) {
@@ -1443,8 +1444,19 @@ static void slab_destroy (kmem_cache_t *
 			(cachep->dtor)(objp, cachep, 0);
 		}
 	}
+}
 #endif
 
+/**
+ * Destroy all the objs in a slab, and release the mem back to the system.
+ * Before calling the slab must have been unlinked from the cache.
+ * The cache-lock is not held/needed.
+ */
+static void slab_destroy(kmem_cache_t *cachep, struct slab *slabp)
+{
+	void *addr = slabp->s_mem - slabp->colouroff;
+
+	slab_destruct(cachep, slabp);
 	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU)) {
 		struct slab_rcu *slab_rcu;
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 5/6] Slab Prep: Move cache_grow()
  2005-12-14  7:50 [RFC][PATCH 0/6] Critical Page Pool Matthew Dobson
                   ` (3 preceding siblings ...)
  2005-12-14  7:58 ` [RFC][PATCH 4/6] Slab Prep: slab_destruct() Matthew Dobson
@ 2005-12-14  7:59 ` Matthew Dobson
  2005-12-14  8:02 ` [RFC][PATCH 6/6] Critical Page Pool: Slab Support Matthew Dobson
  2005-12-14 10:08 ` [RFC][PATCH 0/6] Critical Page Pool Pavel Machek
  6 siblings, 0 replies; 26+ messages in thread
From: Matthew Dobson @ 2005-12-14  7:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: andrea, Sridhar Samudrala, pavel, Andrew Morton, Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 203 bytes --]

Move cache_grow() a few lines further down in mm/slab.c to gain access to a
couple debugging functions that will be used by the next patch.  Also,
rename a goto label and fixup a couple comments.

-Matt

[-- Attachment #2: slab_prep-cache_grow.patch --]
[-- Type: text/x-patch, Size: 5684 bytes --]

Move cache_grow() below some debugging function definitions, so those debugging
functions can be inserted into cache_grow() by the next patch without needing
forward declarations.

Also, do a few small cleanups:
	Tidy up a few comments
	Rename a label to something readable

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>

Index: linux-2.6.15-rc5+critical_pool/mm/slab.c
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/mm/slab.c	2005-12-13 16:08:04.123634776 -0800
+++ linux-2.6.15-rc5+critical_pool/mm/slab.c	2005-12-13 16:14:25.757617592 -0800
@@ -2203,96 +2203,6 @@ static void set_slab_attr(kmem_cache_t *
 	} while (--i);
 }
 
-/*
- * Grow (by 1) the number of slabs within a cache.  This is called by
- * kmem_cache_alloc() when there are no active objs left in a cache.
- */
-static int cache_grow(kmem_cache_t *cachep, gfp_t flags, int nodeid)
-{
-	struct slab	*slabp;
-	void		*objp;
-	size_t		 offset;
-	gfp_t	 	 local_flags;
-	unsigned long	 ctor_flags;
-	struct kmem_list3 *l3;
-
-	/* Be lazy and only check for valid flags here,
- 	 * keeping it out of the critical path in kmem_cache_alloc().
-	 */
-	if (flags & ~(SLAB_DMA|SLAB_LEVEL_MASK|SLAB_NO_GROW))
-		BUG();
-	if (flags & SLAB_NO_GROW)
-		return 0;
-
-	ctor_flags = SLAB_CTOR_CONSTRUCTOR;
-	local_flags = (flags & SLAB_LEVEL_MASK);
-	if (!(local_flags & __GFP_WAIT))
-		/*
-		 * Not allowed to sleep.  Need to tell a constructor about
-		 * this - it might need to know...
-		 */
-		ctor_flags |= SLAB_CTOR_ATOMIC;
-
-	/* About to mess with non-constant members - lock. */
-	check_irq_off();
-	spin_lock(&cachep->spinlock);
-
-	/* Get colour for the slab, and cal the next value. */
-	offset = cachep->colour_next;
-	cachep->colour_next++;
-	if (cachep->colour_next >= cachep->colour)
-		cachep->colour_next = 0;
-	offset *= cachep->colour_off;
-
-	spin_unlock(&cachep->spinlock);
-
-	check_irq_off();
-	if (local_flags & __GFP_WAIT)
-		local_irq_enable();
-
-	/*
-	 * The test for missing atomic flag is performed here, rather than
-	 * the more obvious place, simply to reduce the critical path length
-	 * in kmem_cache_alloc(). If a caller is seriously mis-behaving they
-	 * will eventually be caught here (where it matters).
-	 */
-	kmem_flagcheck(cachep, flags);
-
-	/* Get mem for the objs.
-	 * Attempt to allocate a physical page from 'nodeid',
-	 */
-	if (!(objp = kmem_getpages(cachep, flags, nodeid)))
-		goto failed;
-
-	/* Get slab management. */
-	if (!(slabp = alloc_slabmgmt(cachep, objp, offset, local_flags)))
-		goto opps1;
-
-	slabp->nodeid = nodeid;
-	set_slab_attr(cachep, slabp, objp);
-
-	cache_init_objs(cachep, slabp, ctor_flags);
-
-	if (local_flags & __GFP_WAIT)
-		local_irq_disable();
-	check_irq_off();
-	l3 = cachep->nodelists[nodeid];
-	spin_lock(&l3->list_lock);
-
-	/* Make slab active. */
-	list_add_tail(&slabp->list, &(l3->slabs_free));
-	STATS_INC_GROWN(cachep);
-	l3->free_objects += cachep->num;
-	spin_unlock(&l3->list_lock);
-	return 1;
-opps1:
-	kmem_freepages(cachep, objp);
-failed:
-	if (local_flags & __GFP_WAIT)
-		local_irq_disable();
-	return 0;
-}
-
 #if DEBUG
 
 /*
@@ -2414,6 +2324,90 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
+/**
+ * Grow (by 1) the number of slabs within a cache.  This is called by
+ * kmem_cache_alloc() when there are no active objs left in a cache.
+ */
+static int cache_grow(kmem_cache_t *cachep, gfp_t flags, int nodeid)
+{
+	struct slab *slabp;
+	void *objp;
+	size_t offset;
+	gfp_t local_flags;
+	unsigned long ctor_flags;
+	struct kmem_list3 *l3;
+
+	/*
+	 * Be lazy and only check for valid flags here,
+	 * keeping it out of the critical path in kmem_cache_alloc().
+	 */
+	if (flags & ~(SLAB_DMA|SLAB_LEVEL_MASK|SLAB_NO_GROW))
+		BUG();
+	if (flags & SLAB_NO_GROW)
+		return 0;
+
+	ctor_flags = SLAB_CTOR_CONSTRUCTOR;
+	local_flags = (flags & SLAB_LEVEL_MASK);
+	if (!(local_flags & __GFP_WAIT))
+		/* The constructor might need to know it can't sleep */
+		ctor_flags |= SLAB_CTOR_ATOMIC;
+
+	/* About to mess with non-constant members - lock. */
+	check_irq_off();
+	spin_lock(&cachep->spinlock);
+	/* Get colour for the slab, and calculate the next value. */
+	offset = cachep->colour_next;
+	cachep->colour_next++;
+	if (cachep->colour_next >= cachep->colour)
+		cachep->colour_next = 0;
+	offset *= cachep->colour_off;
+	/* done...  Unlock. */
+	spin_unlock(&cachep->spinlock);
+
+	check_irq_off();
+	if (local_flags & __GFP_WAIT)
+		local_irq_enable();
+
+	/*
+	 * Ensure caller isn't asking for DMA memory if the slab wasn't created
+	 * with the SLAB_DMA flag.
+	 * Also ensure the caller *is* asking for DMA memory if the slab was
+	 * created with the SLAB_DMA flag.
+	 */
+	kmem_flagcheck(cachep, flags);
+
+	/* Get memory for the objects by allocating a page from 'nodeid'. */
+	if (!(objp = kmem_getpages(cachep, flags, nodeid)))
+		goto failed;
+
+	/* Get slab management. */
+	if (!(slabp = alloc_slabmgmt(cachep, objp, offset, local_flags)))
+		goto failed_freepages;
+
+	slabp->nodeid = nodeid;
+	set_slab_attr(cachep, slabp, objp);
+	cache_init_objs(cachep, slabp, ctor_flags);
+
+	if (local_flags & __GFP_WAIT)
+		local_irq_disable();
+	check_irq_off();
+	l3 = cachep->nodelists[nodeid];
+	spin_lock(&l3->list_lock);
+
+	/* Make slab active. */
+	list_add_tail(&slabp->list, &(l3->slabs_free));
+	STATS_INC_GROWN(cachep);
+	l3->free_objects += cachep->num;
+	spin_unlock(&l3->list_lock);
+	return 1;
+failed_freepages:
+	kmem_freepages(cachep, objp);
+failed:
+	if (local_flags & __GFP_WAIT)
+		local_irq_disable();
+	return 0;
+}
+
 static void *cache_alloc_refill(kmem_cache_t *cachep, gfp_t flags)
 {
 	int batchcount;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 6/6] Critical Page Pool: Slab Support
  2005-12-14  7:50 [RFC][PATCH 0/6] Critical Page Pool Matthew Dobson
                   ` (4 preceding siblings ...)
  2005-12-14  7:59 ` [RFC][PATCH 5/6] Slab Prep: Move cache_grow() Matthew Dobson
@ 2005-12-14  8:02 ` Matthew Dobson
  2005-12-14 10:08 ` [RFC][PATCH 0/6] Critical Page Pool Pavel Machek
  6 siblings, 0 replies; 26+ messages in thread
From: Matthew Dobson @ 2005-12-14  8:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: andrea, Sridhar Samudrala, pavel, Andrew Morton, Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 577 bytes --]

Finally, add support for the Critical Page Pool to the Slab Allocator.  We
need the slab allocator to be at least marginally aware of the existence of
critical pages, or else we leave open the possibility of non-critical slab
allocations stealing objects from 'critical' slabs.  We add a separate,
node-unspecific list to kmem_cache_t called slabs_crit.  We keep all
partial and full critical slabs on this list.  We don't keep empty critical
slabs around, in the interest of giving this memory back to the VM ASAP in
what is typically a high memory pressure situation.

-Matt

[-- Attachment #2: slab_support.patch --]
[-- Type: text/x-patch, Size: 7230 bytes --]

Modify the Slab Allocator to support the addition of a Critical Pool to the VM.
What we want is to ensure that if a cache is allocated a new slab page from the
Critical Pool during an Emergency situation, that only other __GFP_CRITICAL
allocations are satisfied from that slab.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>

Index: linux-2.6.15-rc5+critical_pool/mm/slab.c
===================================================================
--- linux-2.6.15-rc5+critical_pool.orig/mm/slab.c	2005-12-13 16:14:25.757617592 -0800
+++ linux-2.6.15-rc5+critical_pool/mm/slab.c	2005-12-13 16:32:08.300086584 -0800
@@ -221,8 +221,9 @@ struct slab {
 	unsigned long		colouroff;
 	void			*s_mem;		/* including colour offset */
 	unsigned int		inuse;		/* num of objs active in slab */
-	kmem_bufctl_t		free;
+	unsigned short		critical;	/* is this an critical slab? */
 	unsigned short          nodeid;
+	kmem_bufctl_t		free;
 };
 
 /*
@@ -395,6 +396,9 @@ struct kmem_cache {
 	unsigned int		slab_size;
 	unsigned int		dflags;		/* dynamic flags */
 
+	/* list of critical slabs for this cache */
+	struct list_head	slabs_crit;
+
 	/* constructor func */
 	void (*ctor)(void *, kmem_cache_t *, unsigned long);
 
@@ -1770,6 +1774,7 @@ next:
 		cachep->gfpflags |= GFP_DMA;
 	spin_lock_init(&cachep->spinlock);
 	cachep->objsize = size;
+	INIT_LIST_HEAD(&cachep->slabs_crit);
 
 	if (flags & CFLGS_OFF_SLAB)
 		cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);
@@ -2086,6 +2091,7 @@ static struct slab* alloc_slabmgmt(kmem_
 	slabp->inuse = 0;
 	slabp->colouroff = colour_off;
 	slabp->s_mem = objp+colour_off;
+	slabp->critical = 0;
 
 	return slabp;
 }
@@ -2161,7 +2167,8 @@ static void *get_object(kmem_cache_t *ca
 	next = slab_bufctl(slabp)[slabp->free];
 #if DEBUG
 	slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
-	WARN_ON(slabp->nodeid != nodeid);
+	if (nodeid >= 0)
+		WARN_ON(slabp->nodeid != nodeid);
 #endif
 	slabp->free = next;
 
@@ -2175,7 +2182,8 @@ static void return_object(kmem_cache_t *
 
 #if DEBUG
 	/* Verify that the slab belongs to the intended node */
-	WARN_ON(slabp->nodeid != nodeid);
+	if (nodeid >= 0)
+		WARN_ON(slabp->nodeid != nodeid);
 
 	if (slab_bufctl(slabp)[objnr] != BUFCTL_FREE) {
 		printk(KERN_ERR "slab: double free detected in cache "
@@ -2324,18 +2332,64 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
+static inline int is_critical_object(void *obj)
+{
+	struct slab *slabp;
+
+	if (!obj)
+		return 0;
+
+	slabp = page_get_slab(virt_to_page(obj));
+	return slabp->critical;
+}
+
+static inline void *get_critical_object(kmem_cache_t *cachep, gfp_t flags)
+{
+	struct slab *slabp;
+	void *objp = NULL;
+
+	spin_lock(&cachep->spinlock);
+	/* search for any partially free critical slabs */
+	if (!list_empty(&cachep->slabs_crit))
+		list_for_each_entry(slabp, &cachep->slabs_crit, list)
+			if (slabp->free != BUFCTL_END) {
+				objp = get_object(cachep, slabp, -1);
+				check_slabp(cachep, slabp);
+				break;
+			}
+	spin_unlock(&cachep->spinlock);
+
+	return objp;
+}
+
+static inline void free_critical_object(kmem_cache_t *cachep, void *objp)
+{
+	struct slab *slabp = page_get_slab(virt_to_page(objp));
+
+	check_slabp(cachep, slabp);
+	return_object(cachep, slabp, objp, -1);
+	check_slabp(cachep, slabp);
+
+	if (!slabp->inuse) {
+		BUG_ON(cachep->flags & SLAB_DESTROY_BY_RCU);
+
+		list_del(&slabp->list);
+		slab_destroy(cachep, slabp);
+	}
+}
+
 /**
  * Grow (by 1) the number of slabs within a cache.  This is called by
  * kmem_cache_alloc() when there are no active objs left in a cache.
  */
-static int cache_grow(kmem_cache_t *cachep, gfp_t flags, int nodeid)
+static void *cache_grow(kmem_cache_t *cachep, gfp_t flags, int nodeid)
 {
 	struct slab *slabp;
 	void *objp;
 	size_t offset;
 	gfp_t local_flags;
 	unsigned long ctor_flags;
-	struct kmem_list3 *l3;
+	int critical = is_emergency_alloc(flags) && !cachep->gfporder;
 
 	/*
 	 * Be lazy and only check for valid flags here,
@@ -2344,7 +2398,14 @@ static int cache_grow(kmem_cache_t *cach
 	if (flags & ~(SLAB_DMA|SLAB_LEVEL_MASK|SLAB_NO_GROW))
 		BUG();
 	if (flags & SLAB_NO_GROW)
-		return 0;
+		return NULL;
+
+	/*
+	 * If we are in an emergency situation and this is a 'critical' alloc,
+	 * see if we can get an object from an existing critical slab first.
+	 */
+	if (critical && (objp = get_critical_object(cachep, flags)))
+		return objp;
 
 	ctor_flags = SLAB_CTOR_CONSTRUCTOR;
 	local_flags = (flags & SLAB_LEVEL_MASK);
@@ -2391,21 +2452,30 @@ static int cache_grow(kmem_cache_t *cach
 	if (local_flags & __GFP_WAIT)
 		local_irq_disable();
 	check_irq_off();
-	l3 = cachep->nodelists[nodeid];
-	spin_lock(&l3->list_lock);
 
 	/* Make slab active. */
-	list_add_tail(&slabp->list, &(l3->slabs_free));
 	STATS_INC_GROWN(cachep);
-	l3->free_objects += cachep->num;
-	spin_unlock(&l3->list_lock);
-	return 1;
+	if (!critical) {
+		struct kmem_list3 *l3 = cachep->nodelists[nodeid];
+		spin_lock(&l3->list_lock);
+		list_add_tail(&slabp->list, &l3->slabs_free);
+		l3->free_objects += cachep->num;
+		spin_unlock(&l3->list_lock);
+	} else {
+		slabp->critical = 1;
+		spin_lock(&cachep->spinlock);
+		list_add_tail(&slabp->list, &cachep->slabs_crit);
+		spin_unlock(&cachep->spinlock);
+		objp = get_object(cachep, slabp, -1);
+		check_slabp(cachep, slabp);
+	}
+	return objp;
 failed_freepages:
 	kmem_freepages(cachep, objp);
 failed:
 	if (local_flags & __GFP_WAIT)
 		local_irq_disable();
-	return 0;
+	return NULL;
 }
 
 static void *cache_alloc_refill(kmem_cache_t *cachep, gfp_t flags)
@@ -2483,15 +2553,18 @@ alloc_done:
 	spin_unlock(&l3->list_lock);
 
 	if (unlikely(!ac->avail)) {
-		int x;
-		x = cache_grow(cachep, flags, numa_node_id());
+		void *obj = cache_grow(cachep, flags, numa_node_id());
+
+		/* critical objects don't "grow" the slab, just return 'obj' */
+		if (is_critical_object(obj))
+			return obj;
 
-		// cache_grow can reenable interrupts, then ac could change.
+		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = ac_data(cachep);
-		if (!x && ac->avail == 0)	// no objects in sight? abort
+		if (!obj && ac->avail == 0) /* No objects in sight?  Abort.  */
 			return NULL;
 
-		if (!ac->avail)		// objects refilled by interrupt?
+		if (!ac->avail)		/* objects refilled by interrupt?    */
 			goto retry;
 	}
 	ac->touched = 1;
@@ -2597,7 +2670,6 @@ static void *__cache_alloc_node(kmem_cac
  	struct slab *slabp;
  	struct kmem_list3 *l3;
  	void *obj;
- 	int x;
 
  	l3 = cachep->nodelists[nodeid];
  	BUG_ON(!l3);
@@ -2639,11 +2711,15 @@ retry:
 
 must_grow:
  	spin_unlock(&l3->list_lock);
- 	x = cache_grow(cachep, flags, nodeid);
+	obj = cache_grow(cachep, flags, nodeid);
 
- 	if (!x)
+	if (!obj)
  		return NULL;
 
+	/* critical objects don't "grow" the slab, just return 'obj' */
+	if (is_critical_object(obj))
+		goto done;
+
  	goto retry;
 done:
  	return obj;
@@ -2758,6 +2834,11 @@ static inline void __cache_free(kmem_cac
 	check_irq_off();
 	objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
 
+	if (is_critical_object(objp)) {
+		free_critical_object(cachep, objp);
+		return;
+	}
+
 	/* Make sure we are not freeing a object from another
 	 * node to the array cache on this cpu.
 	 */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 3/6] Slab Prep: get/return_object
  2005-12-14  7:56 ` [RFC][PATCH 3/6] Slab Prep: get/return_object Matthew Dobson
@ 2005-12-14  8:19   ` Pekka Enberg
  2005-12-14 16:26     ` Matthew Dobson
  0 siblings, 1 reply; 26+ messages in thread
From: Pekka Enberg @ 2005-12-14  8:19 UTC (permalink / raw)
  To: Matthew Dobson
  Cc: linux-kernel, andrea, Sridhar Samudrala, pavel, Andrew Morton,
	Linux Memory Management

Hi Matt,

On 12/14/05, Matthew Dobson <colpatch@us.ibm.com> wrote:
> Create 2 helper functions in mm/slab.c: get_object() and return_object().
> These functions reduce some existing duplicated code in the slab allocator
> and will be used when adding Critical Page Pool support to the slab allocator.

May I suggest different naming, slab_get_obj and slab_put_obj ?

                                            Pekka

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 4/6] Slab Prep: slab_destruct()
  2005-12-14  7:58 ` [RFC][PATCH 4/6] Slab Prep: slab_destruct() Matthew Dobson
@ 2005-12-14  8:37   ` Pekka Enberg
  2005-12-14 16:30     ` Matthew Dobson
  0 siblings, 1 reply; 26+ messages in thread
From: Pekka Enberg @ 2005-12-14  8:37 UTC (permalink / raw)
  To: Matthew Dobson
  Cc: linux-kernel, andrea, Sridhar Samudrala, pavel, Andrew Morton,
	Linux Memory Management

On 12/14/05, Matthew Dobson <colpatch@us.ibm.com> wrote:
> Create a helper function for slab_destroy() called slab_destruct().  Remove
> some ifdefs inside functions and generally make the slab destroying code
> more readable prior to slab support for the Critical Page Pool.

Looks good. How about calling it slab_destroy_objs instead?

                          Pekka

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 0/6] Critical Page Pool
  2005-12-14  7:50 [RFC][PATCH 0/6] Critical Page Pool Matthew Dobson
                   ` (5 preceding siblings ...)
  2005-12-14  8:02 ` [RFC][PATCH 6/6] Critical Page Pool: Slab Support Matthew Dobson
@ 2005-12-14 10:08 ` Pavel Machek
  2005-12-14 12:01   ` Andrea Arcangeli
  2005-12-14 15:55   ` Matthew Dobson
  6 siblings, 2 replies; 26+ messages in thread
From: Pavel Machek @ 2005-12-14 10:08 UTC (permalink / raw)
  To: Matthew Dobson
  Cc: linux-kernel, andrea, Sridhar Samudrala, Andrew Morton,
	Linux Memory Management

Hi!

> The overall purpose of this patch series is to all a system administrator
> to reserve a number of pages in a 'critical pool' that is set aside for
> situations when the system is 'in emergency'.  It is up to the individual
> administrator to determine when his/her system is 'in emergency'.  This is
> not meant to (necessarily) anticipate OOM situations, though that is
> certainly one possible use.  The purpose this was originally designed for
> is to allow the networking code to keep functioning despite the sytem
> losing its (potentially networked) swap device, and thus temporarily
> putting the system under exreme memory pressure.

I don't see how this can ever work.

How can _userspace_ know about what allocations are critical to the
kernel?!

And as you noticed, it does not work for your original usage case,
because reserved memory pool would have to be "sum of all network
interface bandwidths * ammount of time expected to survive without
network" which is way too much.

If you want few emergency pages for some strange hack you are doing
(swapping over network?), just put swap into ramdisk and swapon() it
when you are in emergency, or use memory hotplug and plug few more
gigabytes into your machine. But don't go introducing infrastructure
that _can't_ be used right.
								Pavel
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 1/6] Create Critical Page Pool
  2005-12-14  7:52 ` [RFC][PATCH 1/6] Create " Matthew Dobson
@ 2005-12-14 10:48   ` Andrea Arcangeli
  2005-12-14 13:30   ` Rik van Riel
  1 sibling, 0 replies; 26+ messages in thread
From: Andrea Arcangeli @ 2005-12-14 10:48 UTC (permalink / raw)
  To: Matthew Dobson
  Cc: linux-kernel, Sridhar Samudrala, pavel, Andrew Morton,
	Linux Memory Management

Hi Matthew,

On Tue, Dec 13, 2005 at 11:52:46PM -0800, Matthew Dobson wrote:
> Create the basic Critical Page Pool.  Any allocation specifying
> __GFP_CRITICAL will, as a last resort before failing the allocation, try to
> get a page from the critical pool.  For now, only singleton (order 0) pages
> are supported.

Hmm sorry, but this design looks wrong to me. Since the caller has to
use __GFP_CRITICAL anyway, why don't you build this critical pool
_outside_ the page allocator exactly like the mempool does?

Then you will also get an huge advantage, that is allowing to create
more than one critical pool without having to add a __GFP_CRITICAL2 next
month.

So IMHO if something you should create something like a mempool (if the
mempool isn't good enough already for your usage), so more subsystems
can register their critical pools. Call it criticalpool.c or similar but
I wouldn't mess with __GFP_* and page_alloc.c, and the sysctl should be
in the user subsystem, not global.

Or perhaps you can share the mempool code and extend the mempool API to
refill itself internally automatically as soon as pages are being
released.

You may still need a single hook in the __free_pages path, to refill
pools transparently from any freeing (not only the freeing of your
subsystem) but such an hook is acceptable. You may need to set
priorities in the criticalpool.c api as well to choose which pool to
refill first, or if to refill them in round robin when they've the same
priority.

I would touch page_alloc.c only with regard of the prioritized pool
refilling with a registration hook and I would definitely not use a
global pool and I wouldn't use __GFP_ bitflag for it.

Then each slab will be allowed to have its criticalpool too, then, not
a global one. A global one driven by the __GFP_CRITICAL flag will
quickly become useless as soon as you've more than one subsystem using
it, plus it unnecessairly mess with page_alloc.c APIs where the only
thing you care about is to catch the freeing operation with a hook.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 0/6] Critical Page Pool
  2005-12-14 10:08 ` [RFC][PATCH 0/6] Critical Page Pool Pavel Machek
@ 2005-12-14 12:01   ` Andrea Arcangeli
  2005-12-14 13:03     ` Alan Cox
  2005-12-14 16:03     ` Matthew Dobson
  2005-12-14 15:55   ` Matthew Dobson
  1 sibling, 2 replies; 26+ messages in thread
From: Andrea Arcangeli @ 2005-12-14 12:01 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Matthew Dobson, linux-kernel, Sridhar Samudrala, Andrew Morton,
	Linux Memory Management

On Wed, Dec 14, 2005 at 11:08:41AM +0100, Pavel Machek wrote:
> because reserved memory pool would have to be "sum of all network
> interface bandwidths * ammount of time expected to survive without
> network" which is way too much.

Yes, a global pool isn't really useful. A per-subsystem pool would be
more reasonable...

> gigabytes into your machine. But don't go introducing infrastructure
> that _can't_ be used right.

Agreed, the current design of the patch can't be used right.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 0/6] Critical Page Pool
  2005-12-14 12:01   ` Andrea Arcangeli
@ 2005-12-14 13:03     ` Alan Cox
  2005-12-14 16:37       ` Matthew Dobson
  2005-12-14 16:03     ` Matthew Dobson
  1 sibling, 1 reply; 26+ messages in thread
From: Alan Cox @ 2005-12-14 13:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Pavel Machek, Matthew Dobson, linux-kernel, Sridhar Samudrala,
	Andrew Morton, Linux Memory Management

On Mer, 2005-12-14 at 13:01 +0100, Andrea Arcangeli wrote:
> On Wed, Dec 14, 2005 at 11:08:41AM +0100, Pavel Machek wrote:
> > because reserved memory pool would have to be "sum of all network
> > interface bandwidths * ammount of time expected to survive without
> > network" which is way too much.
> 
> Yes, a global pool isn't really useful. A per-subsystem pool would be
> more reasonable...


The whole extra critical level seems dubious in itself. In 2.0/2.2 days
there were a set of patches that just dropped incoming memory on sockets
when the memory was tight unless they were marked as critical (ie NFS
swap). It worked rather well. The rest of the changes beyond that seem
excessive.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 1/6] Create Critical Page Pool
  2005-12-14  7:52 ` [RFC][PATCH 1/6] Create " Matthew Dobson
  2005-12-14 10:48   ` Andrea Arcangeli
@ 2005-12-14 13:30   ` Rik van Riel
  2005-12-14 16:26     ` Matthew Dobson
  1 sibling, 1 reply; 26+ messages in thread
From: Rik van Riel @ 2005-12-14 13:30 UTC (permalink / raw)
  To: Matthew Dobson
  Cc: linux-kernel, andrea, Sridhar Samudrala, pavel, Andrew Morton,
	Linux Memory Management

On Tue, 13 Dec 2005, Matthew Dobson wrote:

> Create the basic Critical Page Pool.  Any allocation specifying 
> __GFP_CRITICAL will, as a last resort before failing the allocation, try 
> to get a page from the critical pool.  For now, only singleton (order 0) 
> pages are supported.

How are you going to limit the number of GFP_CRITICAL
allocations to something smaller than the number of
pages in the pool ?

Unless you can do that, all guarantees are off...

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 0/6] Critical Page Pool
  2005-12-14 10:08 ` [RFC][PATCH 0/6] Critical Page Pool Pavel Machek
  2005-12-14 12:01   ` Andrea Arcangeli
@ 2005-12-14 15:55   ` Matthew Dobson
  2005-12-15 16:26     ` Pavel Machek
  1 sibling, 1 reply; 26+ messages in thread
From: Matthew Dobson @ 2005-12-14 15:55 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-kernel, andrea, Sridhar Samudrala, Andrew Morton,
	Linux Memory Management

Pavel Machek wrote:
> Hi!
> 
> 
>>The overall purpose of this patch series is to all a system administrator
>>to reserve a number of pages in a 'critical pool' that is set aside for
>>situations when the system is 'in emergency'.  It is up to the individual
>>administrator to determine when his/her system is 'in emergency'.  This is
>>not meant to (necessarily) anticipate OOM situations, though that is
>>certainly one possible use.  The purpose this was originally designed for
>>is to allow the networking code to keep functioning despite the sytem
>>losing its (potentially networked) swap device, and thus temporarily
>>putting the system under exreme memory pressure.
> 
> 
> I don't see how this can ever work.
> 
> How can _userspace_ know about what allocations are critical to the
> kernel?!

Well, it isn't userspace that is determining *which* allocations are
critical to the kernel.  That is statically determined at compile time by
using the flag __GFP_CRITICAL on specific *kernel* allocations.  Sridhar,
cc'd on this mail, has a set of patches that sprinkle the __GFP_CRITICAL
flag throughout the networking code to take advantage of this pool.
Userspace is in charge of determining *when* we're in an emergency
situation, and should thus use the critical pool, but not *which*
allocations are critical to surviving this emergency situation.


> And as you noticed, it does not work for your original usage case,
> because reserved memory pool would have to be "sum of all network
> interface bandwidths * ammount of time expected to survive without
> network" which is way too much.

Well, I never suggested it didn't work for my original usage case.  The
discussion we had is that it would be incredibly difficult to 100%
iron-clad guarantee that the pool would NEVER run out of pages.  But we can
size the pool, especially given a decent workload approximation, so as to
make failure far less likely.


> If you want few emergency pages for some strange hack you are doing
> (swapping over network?), just put swap into ramdisk and swapon() it
> when you are in emergency, or use memory hotplug and plug few more
> gigabytes into your machine. But don't go introducing infrastructure
> that _can't_ be used right.

Well, that's basically the point of posting these patches as an RFC.  I'm
not quite so delusional as to think they're going to get picked up right
now.  I was, however, hoping for feedback to figure out how to design
infrastructure that *can* be used right, as well as trying to find other
potential users of such a feature.

Thanks!

-Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 0/6] Critical Page Pool
  2005-12-14 12:01   ` Andrea Arcangeli
  2005-12-14 13:03     ` Alan Cox
@ 2005-12-14 16:03     ` Matthew Dobson
  1 sibling, 0 replies; 26+ messages in thread
From: Matthew Dobson @ 2005-12-14 16:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Pavel Machek, linux-kernel, Sridhar Samudrala, Andrew Morton,
	Linux Memory Management

Andrea Arcangeli wrote:
> On Wed, Dec 14, 2005 at 11:08:41AM +0100, Pavel Machek wrote:
> 
>>because reserved memory pool would have to be "sum of all network
>>interface bandwidths * ammount of time expected to survive without
>>network" which is way too much.
> 
> 
> Yes, a global pool isn't really useful. A per-subsystem pool would be
> more reasonable...

Which is an idea that I toyed with, as well.  The problem that I ran into
is how to tag an allocation as belonging to a specific subsystem.  For
example, in our code we need networking to use the critical pool.  How do
we let __alloc_pages() know what allocations belong to networking?
Networking needs named slab allocations, kmalloc allocations, and whole
page allocations to function.  Should each subsystem get it's own GFP flag
(GFP_NETWORKING, GFP_SCSI, GFP_SOUND, GFP_TERMINAL, ad nauseum)?  Should we
create these pools dynamically and pass a reference to which pool each
specific allocation uses (thus adding a parameter to all memory allocation
functions in the kernel)?  I realize that per-subsystem pools would be
better, but I thought about this for a while and couldn't come up with a
reasonable way to do it.


>>gigabytes into your machine. But don't go introducing infrastructure
>>that _can't_ be used right.
> 
> 
> Agreed, the current design of the patch can't be used right.

Well, it can for our use, but I recognize that isn't going to be a huge
selling point! :)  As I mentioned in my reply to Pavel, I'd really like to
find a way to design something that WOULD be generally useful.

Thanks!

-Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 1/6] Create Critical Page Pool
  2005-12-14 13:30   ` Rik van Riel
@ 2005-12-14 16:26     ` Matthew Dobson
  2005-12-15  3:29       ` Matt Mackall
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Dobson @ 2005-12-14 16:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, andrea, Sridhar Samudrala, pavel, Andrew Morton,
	Linux Memory Management

Rik van Riel wrote:
> On Tue, 13 Dec 2005, Matthew Dobson wrote:
> 
> 
>>Create the basic Critical Page Pool.  Any allocation specifying 
>>__GFP_CRITICAL will, as a last resort before failing the allocation, try 
>>to get a page from the critical pool.  For now, only singleton (order 0) 
>>pages are supported.
> 
> 
> How are you going to limit the number of GFP_CRITICAL
> allocations to something smaller than the number of
> pages in the pool ?

We can't.


> Unless you can do that, all guarantees are off...

Well, I was careful not to use the word guarantee in my post. ;)  The idea
is not to offer a 100% guarantee that the pool will never be exhausted.
The idea is to offer a pool that, sized appropriately, offers a very good
chance of surviving your emergency situation.  The definition of what is a
critical allocation and what the emergency situation is left intentionally
somewhat vague, so as to offer more flexibility.  For our use, certain
networking allocations are critical and our emergency situation is a 2
minute window of potential exreme memory pressure.  For others it could be
something completely different, but the expectation is that the emergency
situation would be of a finite time, since the pool is a fixed size.

Thanks!

-Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 3/6] Slab Prep: get/return_object
  2005-12-14  8:19   ` Pekka Enberg
@ 2005-12-14 16:26     ` Matthew Dobson
  0 siblings, 0 replies; 26+ messages in thread
From: Matthew Dobson @ 2005-12-14 16:26 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-kernel, andrea, Sridhar Samudrala, pavel, Andrew Morton,
	Linux Memory Management

Pekka Enberg wrote:
> Hi Matt,
> 
> On 12/14/05, Matthew Dobson <colpatch@us.ibm.com> wrote:
> 
>>Create 2 helper functions in mm/slab.c: get_object() and return_object().
>>These functions reduce some existing duplicated code in the slab allocator
>>and will be used when adding Critical Page Pool support to the slab allocator.
> 
> 
> May I suggest different naming, slab_get_obj and slab_put_obj ?
> 
>                                             Pekka

Sure.  Those sound much better than mine. :)

-Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 4/6] Slab Prep: slab_destruct()
  2005-12-14  8:37   ` Pekka Enberg
@ 2005-12-14 16:30     ` Matthew Dobson
  0 siblings, 0 replies; 26+ messages in thread
From: Matthew Dobson @ 2005-12-14 16:30 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-kernel, andrea, Sridhar Samudrala, pavel, Andrew Morton,
	Linux Memory Management

Pekka Enberg wrote:
> On 12/14/05, Matthew Dobson <colpatch@us.ibm.com> wrote:
> 
>>Create a helper function for slab_destroy() called slab_destruct().  Remove
>>some ifdefs inside functions and generally make the slab destroying code
>>more readable prior to slab support for the Critical Page Pool.
> 
> 
> Looks good. How about calling it slab_destroy_objs instead?
> 
>                           Pekka

I called it slab_destruct() because it's the part of the old slab_destroy()
that called the slab destructor to destroy the slab's objects.
slab_destroy_objs() is reasonable as well, though, and I can live with that.

Thanks!

-Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 0/6] Critical Page Pool
  2005-12-14 13:03     ` Alan Cox
@ 2005-12-14 16:37       ` Matthew Dobson
  2005-12-14 19:17         ` Alan Cox
  2005-12-15 16:27         ` Pavel Machek
  0 siblings, 2 replies; 26+ messages in thread
From: Matthew Dobson @ 2005-12-14 16:37 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrea Arcangeli, Pavel Machek, linux-kernel, Sridhar Samudrala,
	Andrew Morton, Linux Memory Management

Alan Cox wrote:
> On Mer, 2005-12-14 at 13:01 +0100, Andrea Arcangeli wrote:
> 
>>On Wed, Dec 14, 2005 at 11:08:41AM +0100, Pavel Machek wrote:
>>
>>>because reserved memory pool would have to be "sum of all network
>>>interface bandwidths * ammount of time expected to survive without
>>>network" which is way too much.
>>
>>Yes, a global pool isn't really useful. A per-subsystem pool would be
>>more reasonable...
> 
> 
> 
> The whole extra critical level seems dubious in itself. In 2.0/2.2 days
> there were a set of patches that just dropped incoming memory on sockets
> when the memory was tight unless they were marked as critical (ie NFS
> swap). It worked rather well. The rest of the changes beyond that seem
> excessive.

Actually, Sridhar's code (mentioned earlier in this thread) *does* drop
incoming packets that are not 'critical', but unfortunately you need to
completely copy the packet into kernel memory before you can do any
processing on it to determine whether or not it's 'critical', and thus
accept or reject it.  If network traffic is coming in at a good clip and
the system is already under memory pressure, it's going to be difficult to
receive all these packets, which was the inspiration for this patchset.

Thanks!

-Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 0/6] Critical Page Pool
  2005-12-14 16:37       ` Matthew Dobson
@ 2005-12-14 19:17         ` Alan Cox
  2005-12-15 16:27         ` Pavel Machek
  1 sibling, 0 replies; 26+ messages in thread
From: Alan Cox @ 2005-12-14 19:17 UTC (permalink / raw)
  To: Matthew Dobson
  Cc: Andrea Arcangeli, Pavel Machek, linux-kernel, Sridhar Samudrala,
	Andrew Morton, Linux Memory Management

On Mer, 2005-12-14 at 08:37 -0800, Matthew Dobson wrote:
> Actually, Sridhar's code (mentioned earlier in this thread) *does* drop
> incoming packets that are not 'critical', but unfortunately you need to

I realise that but if you look at the previous history in 2.0 and 2.2
this was all that was ever needed. It thus begs the question why all the
extra support and logic this time around ?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 1/6] Create Critical Page Pool
  2005-12-14 16:26     ` Matthew Dobson
@ 2005-12-15  3:29       ` Matt Mackall
  0 siblings, 0 replies; 26+ messages in thread
From: Matt Mackall @ 2005-12-15  3:29 UTC (permalink / raw)
  To: Matthew Dobson
  Cc: Rik van Riel, linux-kernel, andrea, Sridhar Samudrala, pavel,
	Andrew Morton, Linux Memory Management

On Wed, Dec 14, 2005 at 08:26:09AM -0800, Matthew Dobson wrote:
> Rik van Riel wrote:
> > On Tue, 13 Dec 2005, Matthew Dobson wrote:
> > 
> > 
> >>Create the basic Critical Page Pool.  Any allocation specifying 
> >>__GFP_CRITICAL will, as a last resort before failing the allocation, try 
> >>to get a page from the critical pool.  For now, only singleton (order 0) 
> >>pages are supported.
> > 
> > 
> > How are you going to limit the number of GFP_CRITICAL
> > allocations to something smaller than the number of
> > pages in the pool ?
> 
> We can't.
> 
> 
> > Unless you can do that, all guarantees are off...
> 
> Well, I was careful not to use the word guarantee in my post. ;)  The idea
> is not to offer a 100% guarantee that the pool will never be exhausted.
> The idea is to offer a pool that, sized appropriately, offers a very good
> chance of surviving your emergency situation.  The definition of what is a
> critical allocation and what the emergency situation is left intentionally
> somewhat vague, so as to offer more flexibility.  For our use, certain
> networking allocations are critical and our emergency situation is a 2
> minute window of potential exreme memory pressure.  For others it could be
> something completely different, but the expectation is that the emergency
> situation would be of a finite time, since the pool is a fixed size.

What's your plan for handling the no-room-to-receive-ACKs problem? 

Without addressing this, this is a non-starter for most of the network
OOM problems I care about.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 0/6] Critical Page Pool
  2005-12-14 15:55   ` Matthew Dobson
@ 2005-12-15 16:26     ` Pavel Machek
  2005-12-15 21:51       ` Matthew Dobson
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Machek @ 2005-12-15 16:26 UTC (permalink / raw)
  To: Matthew Dobson
  Cc: linux-kernel, andrea, Sridhar Samudrala, Andrew Morton,
	Linux Memory Management

Hi!

> > I don't see how this can ever work.
> > 
> > How can _userspace_ know about what allocations are critical to the
> > kernel?!
> 
> Well, it isn't userspace that is determining *which* allocations are
> critical to the kernel.  That is statically determined at compile time by
> using the flag __GFP_CRITICAL on specific *kernel* allocations.  Sridhar,
> cc'd on this mail, has a set of patches that sprinkle the __GFP_CRITICAL
> flag throughout the networking code to take advantage of this pool.
> Userspace is in charge of determining *when* we're in an emergency
> situation, and should thus use the critical pool, but not *which*

It still is not too reliable. If you userspace tool is swapped out
(etc), it may not get chance to wake up.

> > And as you noticed, it does not work for your original usage case,
> > because reserved memory pool would have to be "sum of all network
> > interface bandwidths * ammount of time expected to survive without
> > network" which is way too much.
> 
> Well, I never suggested it didn't work for my original usage case.  The
> discussion we had is that it would be incredibly difficult to 100%
> iron-clad guarantee that the pool would NEVER run out of pages.  But we can
> size the pool, especially given a decent workload approximation, so as to
> make failure far less likely.

Perhaps you should add file in Documentation/ explaining it is not
reliable?

> > If you want few emergency pages for some strange hack you are doing
> > (swapping over network?), just put swap into ramdisk and swapon() it
> > when you are in emergency, or use memory hotplug and plug few more
> > gigabytes into your machine. But don't go introducing infrastructure
> > that _can't_ be used right.
> 
> Well, that's basically the point of posting these patches as an RFC.  I'm
> not quite so delusional as to think they're going to get picked up right
> now.  I was, however, hoping for feedback to figure out how to design
> infrastructure that *can* be used right, as well as trying to find other
> potential users of such a feature.

Well, we don't usually take infrastructure that has no in-kernel
users, and example user would indeed be nice.
							Pavel
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 0/6] Critical Page Pool
  2005-12-14 16:37       ` Matthew Dobson
  2005-12-14 19:17         ` Alan Cox
@ 2005-12-15 16:27         ` Pavel Machek
  1 sibling, 0 replies; 26+ messages in thread
From: Pavel Machek @ 2005-12-15 16:27 UTC (permalink / raw)
  To: Matthew Dobson
  Cc: Alan Cox, Andrea Arcangeli, linux-kernel, Sridhar Samudrala,
	Andrew Morton, Linux Memory Management

Hi!

> > The whole extra critical level seems dubious in itself. In 2.0/2.2 days
> > there were a set of patches that just dropped incoming memory on sockets
> > when the memory was tight unless they were marked as critical (ie NFS
> > swap). It worked rather well. The rest of the changes beyond that seem
> > excessive.
> 
> Actually, Sridhar's code (mentioned earlier in this thread) *does* drop
> incoming packets that are not 'critical', but unfortunately you need to
> completely copy the packet into kernel memory before you can do any
> processing on it to determine whether or not it's 'critical', and thus
> accept or reject it.  If network traffic is coming in at a good clip and
> the system is already under memory pressure, it's going to be difficult to
> receive all these packets, which was the inspiration for this patchset.

You should be able to do all this with single, MTU-sized buffer.

Receive packet into buffer. If it is nice, pass it up, otherwise drop
it. Yes, it may drop some "important" packets, but that's okay, packet
loss is expected on networks.
								Pavel
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 0/6] Critical Page Pool
  2005-12-15 16:26     ` Pavel Machek
@ 2005-12-15 21:51       ` Matthew Dobson
  2005-12-16  5:02         ` Sridhar Samudrala
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Dobson @ 2005-12-15 21:51 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-kernel, andrea, Sridhar Samudrala, Andrew Morton,
	Linux Memory Management

Pavel Machek wrote:
>>>And as you noticed, it does not work for your original usage case,
>>>because reserved memory pool would have to be "sum of all network
>>>interface bandwidths * ammount of time expected to survive without
>>>network" which is way too much.
>>
>>Well, I never suggested it didn't work for my original usage case.  The
>>discussion we had is that it would be incredibly difficult to 100%
>>iron-clad guarantee that the pool would NEVER run out of pages.  But we can
>>size the pool, especially given a decent workload approximation, so as to
>>make failure far less likely.
> 
> 
> Perhaps you should add file in Documentation/ explaining it is not
> reliable?

That's a good suggestion.  I will rework the patch's additions to
Documentation/sysctl/vm.txt to be more clear about exactly what we're
providing.


>>>If you want few emergency pages for some strange hack you are doing
>>>(swapping over network?), just put swap into ramdisk and swapon() it
>>>when you are in emergency, or use memory hotplug and plug few more
>>>gigabytes into your machine. But don't go introducing infrastructure
>>>that _can't_ be used right.
>>
>>Well, that's basically the point of posting these patches as an RFC.  I'm
>>not quite so delusional as to think they're going to get picked up right
>>now.  I was, however, hoping for feedback to figure out how to design
>>infrastructure that *can* be used right, as well as trying to find other
>>potential users of such a feature.
> 
> 
> Well, we don't usually take infrastructure that has no in-kernel
> users, and example user would indeed be nice.
> 							Pavel

Understood.  I certainly wouldn't expect otherwise.  I'll see if I can get
Sridhar to post his networking changes that take advantage of this.

Thanks!

-Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 0/6] Critical Page Pool
  2005-12-15 21:51       ` Matthew Dobson
@ 2005-12-16  5:02         ` Sridhar Samudrala
  0 siblings, 0 replies; 26+ messages in thread
From: Sridhar Samudrala @ 2005-12-16  5:02 UTC (permalink / raw)
  To: Matthew Dobson
  Cc: Pavel Machek, linux-kernel, andrea, Andrew Morton,
	Linux Memory Management

Matthew Dobson wrote:

>Pavel Machek wrote:
>  
>
>>>>And as you noticed, it does not work for your original usage case,
>>>>because reserved memory pool would have to be "sum of all network
>>>>interface bandwidths * ammount of time expected to survive without
>>>>network" which is way too much.
>>>>        
>>>>
>>>Well, I never suggested it didn't work for my original usage case.  The
>>>discussion we had is that it would be incredibly difficult to 100%
>>>iron-clad guarantee that the pool would NEVER run out of pages.  But we can
>>>size the pool, especially given a decent workload approximation, so as to
>>>make failure far less likely.
>>>      
>>>
>>Perhaps you should add file in Documentation/ explaining it is not
>>reliable?
>>    
>>
>
>That's a good suggestion.  I will rework the patch's additions to
>Documentation/sysctl/vm.txt to be more clear about exactly what we're
>providing.
>
>
>  
>
>>>>If you want few emergency pages for some strange hack you are doing
>>>>(swapping over network?), just put swap into ramdisk and swapon() it
>>>>when you are in emergency, or use memory hotplug and plug few more
>>>>gigabytes into your machine. But don't go introducing infrastructure
>>>>that _can't_ be used right.
>>>>        
>>>>
>>>Well, that's basically the point of posting these patches as an RFC.  I'm
>>>not quite so delusional as to think they're going to get picked up right
>>>now.  I was, however, hoping for feedback to figure out how to design
>>>infrastructure that *can* be used right, as well as trying to find other
>>>potential users of such a feature.
>>>      
>>>
>>Well, we don't usually take infrastructure that has no in-kernel
>>users, and example user would indeed be nice.
>>							Pavel
>>    
>>
>
>Understood.  I certainly wouldn't expect otherwise.  I'll see if I can get
>Sridhar to post his networking changes that take advantage of this.
>  
>
I have posted these patches yesterday on lkml and netdev and here is a 
link to the thread.
    http://thread.gmane.org/gmane.linux.kernel/357835
  
Thanks
Sridhar


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2005-12-16  5:03 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-14  7:50 [RFC][PATCH 0/6] Critical Page Pool Matthew Dobson
2005-12-14  7:52 ` [RFC][PATCH 1/6] Create " Matthew Dobson
2005-12-14 10:48   ` Andrea Arcangeli
2005-12-14 13:30   ` Rik van Riel
2005-12-14 16:26     ` Matthew Dobson
2005-12-15  3:29       ` Matt Mackall
2005-12-14  7:54 ` [RFC][PATCH 2/6] in_emergency Trigger Matthew Dobson
2005-12-14  7:56 ` [RFC][PATCH 3/6] Slab Prep: get/return_object Matthew Dobson
2005-12-14  8:19   ` Pekka Enberg
2005-12-14 16:26     ` Matthew Dobson
2005-12-14  7:58 ` [RFC][PATCH 4/6] Slab Prep: slab_destruct() Matthew Dobson
2005-12-14  8:37   ` Pekka Enberg
2005-12-14 16:30     ` Matthew Dobson
2005-12-14  7:59 ` [RFC][PATCH 5/6] Slab Prep: Move cache_grow() Matthew Dobson
2005-12-14  8:02 ` [RFC][PATCH 6/6] Critical Page Pool: Slab Support Matthew Dobson
2005-12-14 10:08 ` [RFC][PATCH 0/6] Critical Page Pool Pavel Machek
2005-12-14 12:01   ` Andrea Arcangeli
2005-12-14 13:03     ` Alan Cox
2005-12-14 16:37       ` Matthew Dobson
2005-12-14 19:17         ` Alan Cox
2005-12-15 16:27         ` Pavel Machek
2005-12-14 16:03     ` Matthew Dobson
2005-12-14 15:55   ` Matthew Dobson
2005-12-15 16:26     ` Pavel Machek
2005-12-15 21:51       ` Matthew Dobson
2005-12-16  5:02         ` Sridhar Samudrala

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).