linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] foundations for reserve-based allocation
@ 2007-08-06 10:29 Peter Zijlstra
  2007-08-06 10:29 ` [PATCH 01/10] mm: gfp_to_alloc_flags() Peter Zijlstra
                   ` (12 more replies)
  0 siblings, 13 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 10:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, David Miller, Andrew Morton, Daniel Phillips,
	Pekka Enberg, Christoph Lameter, Matt Mackall, Lee Schermerhorn,
	Steve Dickson


In the interrest of getting swap over network working and posting in smaller
series, here is the first series.

This series lays the foundations needed to do reserve based allocation.
Traditionally we have used mempools (and others like radix_tree_preload) to
handle the problem.

However this does not fit the network stack. It is built around variable
sized allocations using kmalloc().

This calls for a different approach.

We want a guarantee for N bytes from kmalloc(), this translates to a demand
on the slab allocator for 2*N+m (due to the power-of-two nature of kmalloc 
slabs), where m is the meta-data needed by the allocator itself.

The slab allocator then puts a demand of P pages on the page allocator.

So we need functions translating our demanded kmalloc space into a page
reserve limit, and then need to provide a reserve of pages.

And we need to ensure that once we hit the reserve, the slab allocator honours
the reserve's access. That is, a regular allocation may not get objects from
a slab allocated from the reserves.

There is already a page reserve, but it does not fully comply with our needs.
For example, it does not guarantee a strict level (due to the relative nature
of ALLOC_HIGH and ALLOC_HARDER). Hence we augment this reserve with a strict
limit.

Furthermore a new __GFP flag is added to allow easy access to the reserves
along-side the existing PF_MEMALLOC.

Users of this infrastructure will need to do the necessary bean counting to
ensure they stay within the requested limits.

NOTE: I have currently only implemented SLUB support, I'm still hoping
at least one of the others will die soon and save me the trouble.

--


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 01/10] mm: gfp_to_alloc_flags()
  2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
@ 2007-08-06 10:29 ` Peter Zijlstra
  2007-08-06 10:29 ` [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 10:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, David Miller, Andrew Morton, Daniel Phillips,
	Pekka Enberg, Christoph Lameter, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

[-- Attachment #1: mm-gfp-to-alloc_flags.patch --]
[-- Type: text/plain, Size: 5546 bytes --]

abstract out the gfp to alloc_flags mapping so it can be used in other places.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/internal.h   |   10 +++++
 mm/page_alloc.c |   98 ++++++++++++++++++++++++++++++++------------------------
 2 files changed, 66 insertions(+), 42 deletions(-)

Index: linux-2.6-2/mm/internal.h
===================================================================
--- linux-2.6-2.orig/mm/internal.h
+++ linux-2.6-2/mm/internal.h
@@ -37,4 +37,14 @@ static inline void __put_page(struct pag
 extern void fastcall __init __free_pages_bootmem(struct page *page,
 						unsigned int order);
 
+#define ALLOC_HARDER		0x01 /* try to alloc harder */
+#define ALLOC_HIGH		0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN		0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW		0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH	0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS	0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+
+int gfp_to_alloc_flags(gfp_t gfp_mask);
+
 #endif
Index: linux-2.6-2/mm/page_alloc.c
===================================================================
--- linux-2.6-2.orig/mm/page_alloc.c
+++ linux-2.6-2/mm/page_alloc.c
@@ -887,14 +887,6 @@ failed:
 	return NULL;
 }
 
-#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
-#define ALLOC_HARDER		0x10 /* try to alloc harder */
-#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
 static struct fail_page_alloc_attr {
@@ -1217,6 +1209,44 @@ try_next_zone:
 }
 
 /*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+int gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
+/*
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page * fastcall
@@ -1268,48 +1298,28 @@ restart:
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+	/* This is the last chance, in general, before the goto nopage. */
+	page = get_page_from_freelist(gfp_mask, order, zonelist,
+			alloc_flags & ~ALLOC_NO_WATERMARKS);
 	if (page)
 		goto got_pg;
 
 	/* This allocation should allow future memory freeing. */
-
 rebalance:
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
-				zonelist, ALLOC_NO_WATERMARKS);
-			if (page)
-				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
+		/* go through the zonelist yet again, ignoring mins */
+		page = get_page_from_freelist(gfp_mask, order, zonelist,
+				ALLOC_NO_WATERMARKS);
+		if (page)
+			goto got_pg;
+
+		if (wait && (gfp_mask & __GFP_NOFAIL)) {
+			congestion_wait(WRITE, HZ/50);
+			goto nofail_alloc;
 		}
 		goto nopage;
 	}
@@ -1318,6 +1328,10 @@ nofail_alloc:
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 	cond_resched();
 
 	/* We now go into synchronous reclaim */

--


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
  2007-08-06 10:29 ` [PATCH 01/10] mm: gfp_to_alloc_flags() Peter Zijlstra
@ 2007-08-06 10:29 ` Peter Zijlstra
  2007-08-06 18:11   ` Christoph Lameter
  2007-08-06 10:29 ` [PATCH 03/10] mm: tag reseve pages Peter Zijlstra
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 10:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, David Miller, Andrew Morton, Daniel Phillips,
	Pekka Enberg, Christoph Lameter, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

[-- Attachment #1: global-ALLOC_NO_WATERMARKS.patch --]
[-- Type: text/plain, Size: 1296 bytes --]

Change ALLOC_NO_WATERMARK page allocation such that dipping into the reserves
becomes a system wide event.

This has the advantage that logic dealing with reserve pages need not be node
aware (when we're this low on memory speed is usually not an issue).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
---
 mm/page_alloc.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

Index: linux-2.6-2/mm/page_alloc.c
===================================================================
--- linux-2.6-2.orig/mm/page_alloc.c
+++ linux-2.6-2/mm/page_alloc.c
@@ -1311,6 +1311,21 @@ restart:
 rebalance:
 	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 nofail_alloc:
+		/*
+		 * break out of mempolicy boundaries
+		 */
+		zonelist = NODE_DATA(numa_node_id())->node_zonelists +
+			gfp_zone(gfp_mask);
+
+		/*
+		 * Before going bare metal, try to get a page above the
+		 * critical threshold - ignoring CPU sets.
+		 */
+		page = get_page_from_freelist(gfp_mask, order, zonelist,
+				ALLOC_WMARK_MIN|ALLOC_HIGH|ALLOC_HARDER);
+		if (page)
+			goto got_pg;
+
 		/* go through the zonelist yet again, ignoring mins */
 		page = get_page_from_freelist(gfp_mask, order, zonelist,
 				ALLOC_NO_WATERMARKS);

--


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 03/10] mm: tag reseve pages
  2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
  2007-08-06 10:29 ` [PATCH 01/10] mm: gfp_to_alloc_flags() Peter Zijlstra
  2007-08-06 10:29 ` [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
@ 2007-08-06 10:29 ` Peter Zijlstra
  2007-08-06 18:11   ` Christoph Lameter
  2007-08-06 10:29 ` [PATCH 04/10] mm: slub: add knowledge of reserve pages Peter Zijlstra
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 10:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, David Miller, Andrew Morton, Daniel Phillips,
	Pekka Enberg, Christoph Lameter, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

[-- Attachment #1: page_alloc-reserve.patch --]
[-- Type: text/plain, Size: 1218 bytes --]

Tag pages allocated from the reserves with a non-zero page->reserve.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm_types.h |    1 +
 mm/page_alloc.c          |    4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6-2/include/linux/mm_types.h
===================================================================
--- linux-2.6-2.orig/include/linux/mm_types.h
+++ linux-2.6-2/include/linux/mm_types.h
@@ -60,6 +60,7 @@ struct page {
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
 		void *freelist;		/* SLUB: freelist req. slab lock */
+		int reserve;		/* page_alloc: page is a reserve page */
 	};
 	struct list_head lru;		/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
Index: linux-2.6-2/mm/page_alloc.c
===================================================================
--- linux-2.6-2.orig/mm/page_alloc.c
+++ linux-2.6-2/mm/page_alloc.c
@@ -1186,8 +1186,10 @@ zonelist_scan:
 		}
 
 		page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
-		if (page)
+		if (page) {
+			page->reserve = (alloc_flags & ALLOC_NO_WATERMARKS);
 			break;
+		}
 this_zone_full:
 		if (NUMA_BUILD)
 			zlc_mark_zone_full(zonelist, z);

--


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
                   ` (2 preceding siblings ...)
  2007-08-06 10:29 ` [PATCH 03/10] mm: tag reseve pages Peter Zijlstra
@ 2007-08-06 10:29 ` Peter Zijlstra
  2007-08-08  0:13   ` Christoph Lameter
  2007-08-20  7:38   ` Peter Zijlstra
  2007-08-06 10:29 ` [PATCH 05/10] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 10:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, David Miller, Andrew Morton, Daniel Phillips,
	Pekka Enberg, Christoph Lameter, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

[-- Attachment #1: reserve-slab.patch --]
[-- Type: text/plain, Size: 4829 bytes --]

Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
contexts that are entitled to it.

Care is taken to only touch the SLUB slow path.

Because the reserve threshold is system wide (by virtue of the previous patches)
we can do with a single kmem_cache wide state.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <clameter@sgi.com>
---
 include/linux/slub_def.h |    2 +
 mm/slub.c                |   75 ++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 70 insertions(+), 7 deletions(-)

Index: linux-2.6-2/include/linux/slub_def.h
===================================================================
--- linux-2.6-2.orig/include/linux/slub_def.h
+++ linux-2.6-2/include/linux/slub_def.h
@@ -50,6 +50,8 @@ struct kmem_cache {
 	struct kobject kobj;	/* For sysfs */
 #endif
 
+	struct page *reserve_slab;
+
 #ifdef CONFIG_NUMA
 	int defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
Index: linux-2.6-2/mm/slub.c
===================================================================
--- linux-2.6-2.orig/mm/slub.c
+++ linux-2.6-2/mm/slub.c
@@ -20,11 +20,13 @@
 #include <linux/mempolicy.h>
 #include <linux/ctype.h>
 #include <linux/kallsyms.h>
+#include "internal.h"
 
 /*
  * Lock order:
- *   1. slab_lock(page)
- *   2. slab->list_lock
+ *   1. reserve_lock
+ *   2. slab_lock(page)
+ *   3. node->list_lock
  *
  *   The slab_lock protects operations on the object of a particular
  *   slab and its metadata in the page struct. If the slab lock
@@ -258,6 +260,8 @@ static inline int sysfs_slab_alias(struc
 static inline void sysfs_slab_remove(struct kmem_cache *s) {}
 #endif
 
+static DEFINE_SPINLOCK(reserve_lock);
+
 /********************************************************************
  * 			Core slab cache functions
  *******************************************************************/
@@ -1069,7 +1073,7 @@ static void setup_object(struct kmem_cac
 		s->ctor(object, s, 0);
 }
 
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
 {
 	struct page *page;
 	struct kmem_cache_node *n;
@@ -1087,6 +1091,7 @@ static struct page *new_slab(struct kmem
 	if (!page)
 		goto out;
 
+	*reserve = page->reserve;
 	n = get_node(s, page_to_nid(page));
 	if (n)
 		atomic_long_inc(&n->nr_slabs);
@@ -1457,6 +1462,7 @@ static void *__slab_alloc(struct kmem_ca
 {
 	void **object;
 	int cpu = smp_processor_id();
+	int reserve = 0;
 
 	if (!page)
 		goto new_slab;
@@ -1486,10 +1492,25 @@ new_slab:
 	if (page) {
 		s->cpu_slab[cpu] = page;
 		goto load_freelist;
-	}
+	} else if (unlikely(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+		goto try_reserve;
 
-	page = new_slab(s, gfpflags, node);
-	if (page) {
+alloc_slab:
+	page = new_slab(s, gfpflags, node, &reserve);
+	if (page && !reserve) {
+		if (unlikely(s->reserve_slab)) {
+			struct page *reserve;
+
+			spin_lock(&reserve_lock);
+			reserve = s->reserve_slab;
+			s->reserve_slab = NULL;
+			spin_unlock(&reserve_lock);
+
+			if (reserve) {
+				slab_lock(reserve);
+				unfreeze_slab(s, reserve);
+			}
+		}
 		cpu = smp_processor_id();
 		if (s->cpu_slab[cpu]) {
 			/*
@@ -1517,6 +1538,18 @@ new_slab:
 		SetSlabFrozen(page);
 		s->cpu_slab[cpu] = page;
 		goto load_freelist;
+	} else if (page) {
+		spin_lock(&reserve_lock);
+		if (s->reserve_slab) {
+			discard_slab(s, page);
+			page = s->reserve_slab;
+			goto got_reserve;
+		}
+		slab_lock(page);
+		SetSlabFrozen(page);
+		s->reserve_slab = page;
+		spin_unlock(&reserve_lock);
+		goto use_reserve;
 	}
 	return NULL;
 debug:
@@ -1528,6 +1561,31 @@ debug:
 	page->freelist = object[page->offset];
 	slab_unlock(page);
 	return object;
+
+try_reserve:
+	spin_lock(&reserve_lock);
+	page = s->reserve_slab;
+	if (!page) {
+		spin_unlock(&reserve_lock);
+		goto alloc_slab;
+	}
+
+got_reserve:
+	slab_lock(page);
+	if (!page->freelist) {
+		s->reserve_slab = NULL;
+		spin_unlock(&reserve_lock);
+		unfreeze_slab(s, page);
+		goto alloc_slab;
+	}
+	spin_unlock(&reserve_lock);
+
+use_reserve:
+	object = page->freelist;
+	page->inuse++;
+	page->freelist = object[page->offset];
+	slab_unlock(page);
+	return object;
 }
 
 /*
@@ -1872,10 +1930,11 @@ static struct kmem_cache_node * __init e
 {
 	struct page *page;
 	struct kmem_cache_node *n;
+	int reserve;
 
 	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node);
+	page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node, &reserve);
 
 	BUG_ON(!page);
 	n = page->freelist;
@@ -2091,6 +2150,8 @@ static int kmem_cache_open(struct kmem_c
 	s->defrag_ratio = 100;
 #endif
 
+	s->reserve_slab = NULL;
+
 	if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
 		return 1;
 error:

--


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 05/10] mm: allow mempool to fall back to memalloc reserves
  2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
                   ` (3 preceding siblings ...)
  2007-08-06 10:29 ` [PATCH 04/10] mm: slub: add knowledge of reserve pages Peter Zijlstra
@ 2007-08-06 10:29 ` Peter Zijlstra
  2007-08-06 10:29 ` [PATCH 06/10] mm: kmem_estimate_pages() Peter Zijlstra
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 10:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, David Miller, Andrew Morton, Daniel Phillips,
	Pekka Enberg, Christoph Lameter, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

[-- Attachment #1: mm-mempool_fixup.patch --]
[-- Type: text/plain, Size: 1329 bytes --]

Allow the mempool to use the memalloc reserves when all else fails and
the allocation context would otherwise allow it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/mempool.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Index: linux-2.6-2/mm/mempool.c
===================================================================
--- linux-2.6-2.orig/mm/mempool.c
+++ linux-2.6-2/mm/mempool.c
@@ -14,6 +14,7 @@
 #include <linux/mempool.h>
 #include <linux/blkdev.h>
 #include <linux/writeback.h>
+#include "internal.h"
 
 static void add_element(mempool_t *pool, void *element)
 {
@@ -204,7 +205,7 @@ void * mempool_alloc(mempool_t *pool, gf
 	void *element;
 	unsigned long flags;
 	wait_queue_t wait;
-	gfp_t gfp_temp;
+	gfp_t gfp_temp, gfp_orig = gfp_mask;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
@@ -228,6 +229,15 @@ repeat_alloc:
 	}
 	spin_unlock_irqrestore(&pool->lock, flags);
 
+	/* if we really had right to the emergency reserves try those */
+	if (gfp_to_alloc_flags(gfp_orig) & ALLOC_NO_WATERMARKS) {
+		if (gfp_temp & __GFP_NOMEMALLOC) {
+			gfp_temp &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+			goto repeat_alloc;
+		} else
+			gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN;
+	}
+
 	/* We must not sleep in the GFP_ATOMIC case */
 	if (!(gfp_mask & __GFP_WAIT))
 		return NULL;

--


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 06/10] mm: kmem_estimate_pages()
  2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
                   ` (4 preceding siblings ...)
  2007-08-06 10:29 ` [PATCH 05/10] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
@ 2007-08-06 10:29 ` Peter Zijlstra
  2007-08-06 10:29 ` [PATCH 07/10] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 10:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, David Miller, Andrew Morton, Daniel Phillips,
	Pekka Enberg, Christoph Lameter, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

[-- Attachment #1: mm-kmem_estimate_pages.patch --]
[-- Type: text/plain, Size: 4110 bytes --]

Provide a method to get the upper bound on the pages needed to allocate
a given number of objects from a given kmem_cache.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <clameter@sgi.com>
---
 include/linux/slab.h |    3 +
 mm/slub.c            |   90 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+)

Index: linux-2.6-2/include/linux/slab.h
===================================================================
--- linux-2.6-2.orig/include/linux/slab.h
+++ linux-2.6-2/include/linux/slab.h
@@ -58,6 +58,7 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+unsigned kmem_estimate_pages(struct kmem_cache *cachep, gfp_t flags, int objects);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -92,6 +93,8 @@ int kmem_ptr_validate(struct kmem_cache 
 void * __must_check krealloc(const void *, size_t, gfp_t);
 void kfree(const void *);
 size_t ksize(const void *);
+unsigned kestimate_single(size_t, gfp_t, int);
+unsigned kestimate(gfp_t, size_t);
 
 /*
  * Allocator specific definitions. These are mainly used to establish optimized
Index: linux-2.6-2/mm/slub.c
===================================================================
--- linux-2.6-2.orig/mm/slub.c
+++ linux-2.6-2/mm/slub.c
@@ -2206,6 +2206,45 @@ const char *kmem_cache_name(struct kmem_
 EXPORT_SYMBOL(kmem_cache_name);
 
 /*
+ * return the max number of pages required to allocated count
+ * objects from the given cache
+ */
+unsigned kmem_estimate_pages(struct kmem_cache *s, gfp_t flags, int objects)
+{
+	unsigned long slabs;
+
+	if (WARN_ON(!s) || WARN_ON(!s->objects))
+		return 0;
+
+	slabs = DIV_ROUND_UP(objects, s->objects);
+
+	/*
+	 * Account the possible additional overhead if the slab holds more that
+	 * one object.
+	 */
+	if (s->objects > 1) {
+		if (!(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS)) {
+			/*
+			 * Account the possible additional overhead if per cpu
+			 * slabs are currently empty and have to be allocated.
+			 * This is very unlikely but a possible scenario
+			 * immediately after kmem_cache_shrink.
+			 */
+			slabs += num_online_cpus();
+		} else {
+			/*
+			 * when using the reserves there will be only a single
+			 * slab per kmem_cache.
+			 */
+			slabs += 1;
+		}
+	}
+
+	return slabs << s->order;
+}
+EXPORT_SYMBOL_GPL(kmem_estimate_pages);
+
+/*
  * Attempt to free all slabs on a node. Return the number of slabs we
  * were unable to free.
  */
@@ -2508,6 +2547,57 @@ void kfree(const void *x)
 EXPORT_SYMBOL(kfree);
 
 /*
+ * return the max number of pages required to allocate @count objects
+ * of @size bytes from kmalloc given @flags.
+ */
+unsigned kestimate_single(size_t size, gfp_t flags, int count)
+{
+	struct kmem_cache *s = get_slab(size, flags);
+	if (!s)
+		return 0;
+
+	return kmem_estimate_pages(s, flags, count);
+
+}
+EXPORT_SYMBOL_GPL(kestimate_single);
+
+/*
+ * return the max number of pages required to allocate @bytes from kmalloc
+ * in an unspecified number of allocation of heterogeneous size.
+ */
+unsigned kestimate(gfp_t flags, size_t bytes)
+{
+	int i;
+	unsigned long pages;
+
+	/*
+	 * multiply by two, in order to account the worst case slack space
+	 * due to the power-of-two allocation sizes.
+	 */
+	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+	/*
+	 * add the kmem_cache overhead of each possible kmalloc cache
+	 */
+	for (i = 1; i < KMALLOC_SHIFT_HIGH; i++) {
+		struct kmem_cache *s;
+
+#ifdef CONFIG_ZONE_DMA
+		if (unlikely(flags & SLUB_DMA))
+			s = &dma_kmalloc_cache(i, flags);
+		else
+#endif
+			s = &kmalloc_caches[i];
+
+		if (s)
+			pages += kmem_estimate_pages(s, flags, 0);
+	}
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kestimate);
+
+/*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up

--


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 07/10] mm: allow PF_MEMALLOC from softirq context
  2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
                   ` (5 preceding siblings ...)
  2007-08-06 10:29 ` [PATCH 06/10] mm: kmem_estimate_pages() Peter Zijlstra
@ 2007-08-06 10:29 ` Peter Zijlstra
  2007-08-06 10:29 ` [PATCH 08/10] mm: serialize access to min_free_kbytes Peter Zijlstra
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 10:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, David Miller, Andrew Morton, Daniel Phillips,
	Pekka Enberg, Christoph Lameter, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

[-- Attachment #1: mm-PF_MEMALLOC-softirq.patch --]
[-- Type: text/plain, Size: 2218 bytes --]

Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
a borrowed context save current->flags, ksoftirqd will have its own 
task_struct.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    4 ++++
 kernel/softirq.c      |    3 +++
 mm/page_alloc.c       |    7 ++++---
 3 files changed, 11 insertions(+), 3 deletions(-)

Index: linux-2.6-2/mm/page_alloc.c
===================================================================
--- linux-2.6-2.orig/mm/page_alloc.c
+++ linux-2.6-2/mm/page_alloc.c
@@ -1239,9 +1239,10 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((p->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+		if (!in_irq() && (p->flags & PF_MEMALLOC))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_interrupt() &&
+				unlikely(test_thread_flag(TIF_MEMDIE)))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
Index: linux-2.6-2/kernel/softirq.c
===================================================================
--- linux-2.6-2.orig/kernel/softirq.c
+++ linux-2.6-2/kernel/softirq.c
@@ -211,6 +211,8 @@ asmlinkage void __do_softirq(void)
 	__u32 pending;
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu;
+	unsigned long pflags = current->flags;
+	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
 	account_system_vtime(current);
@@ -249,6 +251,7 @@ restart:
 
 	account_system_vtime(current);
 	_local_bh_enable();
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
Index: linux-2.6-2/include/linux/sched.h
===================================================================
--- linux-2.6-2.orig/include/linux/sched.h
+++ linux-2.6-2/include/linux/sched.h
@@ -1336,6 +1336,10 @@ static inline void put_task_struct(struc
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+#define tsk_restore_flags(p, pflags, mask) \
+	do {	(p)->flags &= ~(mask); \
+		(p)->flags |= ((pflags) & (mask)); } while (0)
+
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask);
 #else

--


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 08/10] mm: serialize access to min_free_kbytes
  2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
                   ` (6 preceding siblings ...)
  2007-08-06 10:29 ` [PATCH 07/10] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
@ 2007-08-06 10:29 ` Peter Zijlstra
  2007-08-06 10:29 ` [PATCH 09/10] mm: emergency pool Peter Zijlstra
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 10:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, David Miller, Andrew Morton, Daniel Phillips,
	Pekka Enberg, Christoph Lameter, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

[-- Attachment #1: mm-setup_per_zone_pages_min.patch --]
[-- Type: text/plain, Size: 1842 bytes --]

There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6-2/mm/page_alloc.c
===================================================================
--- linux-2.6-2.orig/mm/page_alloc.c
+++ linux-2.6-2/mm/page_alloc.c
@@ -101,6 +101,7 @@ static char * const zone_names[MAX_NR_ZO
 	 "Movable",
 };
 
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
@@ -3634,12 +3635,12 @@ static void setup_per_zone_lowmem_reserv
 }
 
 /**
- * setup_per_zone_pages_min - called when min_free_kbytes changes.
+ * __setup_per_zone_pages_min - called when min_free_kbytes changes.
  *
  * Ensures that the pages_{min,low,high} values for each zone are set correctly
  * with respect to min_free_kbytes.
  */
-void setup_per_zone_pages_min(void)
+static void __setup_per_zone_pages_min(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -3693,6 +3694,15 @@ void setup_per_zone_pages_min(void)
 	calculate_totalreserve_pages();
 }
 
+void setup_per_zone_pages_min(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&min_free_lock, flags);
+	__setup_per_zone_pages_min();
+	spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -3728,7 +3738,7 @@ static int __init init_per_zone_pages_mi
 		min_free_kbytes = 128;
 	if (min_free_kbytes > 65536)
 		min_free_kbytes = 65536;
-	setup_per_zone_pages_min();
+	__setup_per_zone_pages_min();
 	setup_per_zone_lowmem_reserve();
 	return 0;
 }

--


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 09/10] mm: emergency pool
  2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
                   ` (7 preceding siblings ...)
  2007-08-06 10:29 ` [PATCH 08/10] mm: serialize access to min_free_kbytes Peter Zijlstra
@ 2007-08-06 10:29 ` Peter Zijlstra
  2007-08-06 10:29 ` [PATCH 10/10] mm: __GFP_MEMALLOC Peter Zijlstra
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 10:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, David Miller, Andrew Morton, Daniel Phillips,
	Pekka Enberg, Christoph Lameter, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

[-- Attachment #1: mm-page_alloc-emerg.patch --]
[-- Type: text/plain, Size: 6644 bytes --]

Provide means to reserve a specific amount of pages.

The emergency pool is separated from the min watermark because ALLOC_HARDER
and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure
a strict minimum.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mmzone.h |    3 +
 mm/page_alloc.c        |   82 +++++++++++++++++++++++++++++++++++++++++++------
 mm/vmstat.c            |    6 +--
 3 files changed, 78 insertions(+), 13 deletions(-)

Index: linux-2.6-2/include/linux/mmzone.h
===================================================================
--- linux-2.6-2.orig/include/linux/mmzone.h
+++ linux-2.6-2/include/linux/mmzone.h
@@ -191,7 +191,7 @@ enum zone_type {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
-	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		pages_emerg, pages_min, pages_low, pages_high;
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
@@ -589,6 +589,7 @@ int sysctl_min_unmapped_ratio_sysctl_han
 			struct file *, void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
+int adjust_memalloc_reserve(int pages);
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
Index: linux-2.6-2/mm/page_alloc.c
===================================================================
--- linux-2.6-2.orig/mm/page_alloc.c
+++ linux-2.6-2/mm/page_alloc.c
@@ -103,6 +103,8 @@ static char * const zone_names[MAX_NR_ZO
 
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+static DEFINE_MUTEX(var_free_mutex);
+int var_free_kbytes;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -1000,7 +1002,7 @@ int zone_watermark_ok(struct zone *z, in
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
 
-	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
+	if (free_pages <= min + z->lowmem_reserve[classzone_idx] + z->pages_emerg)
 		return 0;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -1410,8 +1412,8 @@ nofail_alloc:
 nopage:
 	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
 		printk(KERN_WARNING "%s: page allocation failure."
-			" order:%d, mode:0x%x\n",
-			p->comm, order, gfp_mask);
+			" order:%d, mode:0x%x, alloc_flags:0x%x, pflags:0x%x\n",
+			p->comm, order, gfp_mask, alloc_flags, p->flags);
 		dump_stack();
 		show_mem();
 	}
@@ -1625,9 +1627,9 @@ void show_free_areas(void)
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
-			K(zone->pages_min),
-			K(zone->pages_low),
-			K(zone->pages_high),
+			K(zone->pages_emerg + zone->pages_min),
+			K(zone->pages_emerg + zone->pages_low),
+			K(zone->pages_emerg + zone->pages_high),
 			K(zone_page_state(zone, NR_ACTIVE)),
 			K(zone_page_state(zone, NR_INACTIVE)),
 			K(zone->present_pages),
@@ -3585,7 +3587,7 @@ static void calculate_totalreserve_pages
 			}
 
 			/* we treat pages_high as reserved pages. */
-			max += zone->pages_high;
+			max += zone->pages_high + zone->pages_emerg;
 
 			if (max > zone->present_pages)
 				max = zone->present_pages;
@@ -3642,7 +3644,8 @@ static void setup_per_zone_lowmem_reserv
  */
 static void __setup_per_zone_pages_min(void)
 {
-	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_emerg = var_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
 	unsigned long flags;
@@ -3654,11 +3657,13 @@ static void __setup_per_zone_pages_min(v
 	}
 
 	for_each_zone(zone) {
-		u64 tmp;
+		u64 tmp, tmp_emerg;
 
 		spin_lock_irqsave(&zone->lru_lock, flags);
 		tmp = (u64)pages_min * zone->present_pages;
 		do_div(tmp, lowmem_pages);
+		tmp_emerg = (u64)pages_emerg * zone->present_pages;
+		do_div(tmp_emerg, lowmem_pages);
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ -3677,12 +3682,14 @@ static void __setup_per_zone_pages_min(v
 			if (min_pages > 128)
 				min_pages = 128;
 			zone->pages_min = min_pages;
+			zone->pages_emerg = min_pages;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
 			zone->pages_min = tmp;
+			zone->pages_emerg = tmp_emerg;
 		}
 
 		zone->pages_low   = zone->pages_min + (tmp >> 2);
@@ -3703,6 +3710,63 @@ void setup_per_zone_pages_min(void)
 	spin_unlock_irqrestore(&min_free_lock, flags);
 }
 
+static void __adjust_memalloc_reserve(int pages)
+{
+	var_free_kbytes += pages << (PAGE_SHIFT - 10);
+	BUG_ON(var_free_kbytes < 0);
+	setup_per_zone_pages_min();
+}
+
+static int test_reserve_limits(void)
+{
+	struct zone *zone;
+	int node;
+
+	for_each_zone(zone)
+		wakeup_kswapd(zone, 0);
+
+	for_each_online_node(node) {
+		struct page *page = alloc_pages_node(node, GFP_KERNEL, 0);
+		if (!page)
+			return -ENOMEM;
+
+		__free_page(page);
+	}
+
+	return 0;
+}
+
+/**
+ *	adjust_memalloc_reserve - adjust the memalloc reserve
+ *	@pages: number of pages to add
+ *
+ *	It adds a number of pages to the memalloc reserve; if
+ *	the number was positive it kicks reclaim into action to
+ *	satisfy the higher watermarks.
+ *
+ *	returns -ENOMEM when it failed to satisfy the watermarks.
+ */
+int adjust_memalloc_reserve(int pages)
+{
+	int err = 0;
+
+	mutex_lock(&var_free_mutex);
+	__adjust_memalloc_reserve(pages);
+	if (pages > 0) {
+		err = test_reserve_limits();
+		if (err) {
+			__adjust_memalloc_reserve(-pages);
+			goto unlock;
+		}
+	}
+	printk(KERN_DEBUG "Emergency reserve: %d\n", var_free_kbytes);
+
+unlock:
+	mutex_unlock(&var_free_mutex);
+	return err;
+}
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
 /*
  * Initialise min_free_kbytes.
  *
Index: linux-2.6-2/mm/vmstat.c
===================================================================
--- linux-2.6-2.orig/mm/vmstat.c
+++ linux-2.6-2/mm/vmstat.c
@@ -558,9 +558,9 @@ static int zoneinfo_show(struct seq_file
 			   "\n        spanned  %lu"
 			   "\n        present  %lu",
 			   zone_page_state(zone, NR_FREE_PAGES),
-			   zone->pages_min,
-			   zone->pages_low,
-			   zone->pages_high,
+			   zone->pages_emerg + zone->pages_min,
+			   zone->pages_emerg + zone->pages_low,
+			   zone->pages_emerg + zone->pages_high,
 			   zone->pages_scanned,
 			   zone->nr_scan_active, zone->nr_scan_inactive,
 			   zone->spanned_pages,

--


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 10/10] mm: __GFP_MEMALLOC
  2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
                   ` (8 preceding siblings ...)
  2007-08-06 10:29 ` [PATCH 09/10] mm: emergency pool Peter Zijlstra
@ 2007-08-06 10:29 ` Peter Zijlstra
  2007-08-06 17:35 ` [PATCH 00/10] foundations for reserve-based allocation Daniel Phillips
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 10:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, David Miller, Andrew Morton, Daniel Phillips,
	Pekka Enberg, Christoph Lameter, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

[-- Attachment #1: mm-page_alloc-GFP_EMERGENCY.patch --]
[-- Type: text/plain, Size: 2074 bytes --]

__GFP_MEMALLOC will allow the allocation to disregard the watermarks, 
much like PF_MEMALLOC.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/gfp.h |    3 ++-
 mm/page_alloc.c     |    4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6-2/include/linux/gfp.h
===================================================================
--- linux-2.6-2.orig/include/linux/gfp.h
+++ linux-2.6-2/include/linux/gfp.h
@@ -43,6 +43,7 @@ struct vm_area_struct;
 #define __GFP_REPEAT	((__force gfp_t)0x400u)	/* Retry the allocation.  Might fail */
 #define __GFP_NOFAIL	((__force gfp_t)0x800u)	/* Retry for ever.  Cannot fail */
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* Do not retry.  Might fail */
+#define __GFP_MEMALLOC  ((__force gfp_t)0x2000u)/* Use emergency reserves */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
@@ -56,7 +57,7 @@ struct vm_area_struct;
 /* if you forget to add the bitmask here kernel will crash, period */
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
-			__GFP_NOFAIL|__GFP_NORETRY|__GFP_COMP| \
+			__GFP_NOFAIL|__GFP_NORETRY|__GFP_MEMALLOC|__GFP_COMP| \
 			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE| \
 			__GFP_MOVABLE)
 
Index: linux-2.6-2/mm/page_alloc.c
===================================================================
--- linux-2.6-2.orig/mm/page_alloc.c
+++ linux-2.6-2/mm/page_alloc.c
@@ -1242,7 +1242,9 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_irq() && (p->flags & PF_MEMALLOC))
+		if (gfp_mask & __GFP_MEMALLOC)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_irq() && (p->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (!in_interrupt() &&
 				unlikely(test_thread_flag(TIF_MEMDIE)))

--


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 00/10] foundations for reserve-based allocation
  2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
                   ` (9 preceding siblings ...)
  2007-08-06 10:29 ` [PATCH 10/10] mm: __GFP_MEMALLOC Peter Zijlstra
@ 2007-08-06 17:35 ` Daniel Phillips
  2007-08-06 18:17   ` Peter Zijlstra
  2007-08-06 17:56 ` Christoph Lameter
  2007-08-06 20:23 ` Matt Mackall
  12 siblings, 1 reply; 85+ messages in thread
From: Daniel Phillips @ 2007-08-06 17:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Monday 06 August 2007 03:29, Peter Zijlstra wrote:
> In the interrest of getting swap over network working and posting in
> smaller series, here is the first series.
>
> This series lays the foundations needed to do reserve based
> allocation. Traditionally we have used mempools (and others like
> radix_tree_preload) to handle the problem.
>
> However this does not fit the network stack. It is built around
> variable sized allocations using kmalloc().
>
> This calls for a different approach.
>
> We want a guarantee for N bytes from kmalloc(), this translates to a
> demand on the slab allocator for 2*N+m (due to the power-of-two
> nature of kmalloc slabs), where m is the meta-data needed by the
> allocator itself.

Where does the 2* come from?  Isn't it exp2(ceil(log2(N + m)))?

> The slab allocator then puts a demand of P pages on the page
> allocator.
>
> So we need functions translating our demanded kmalloc space into a
> page reserve limit, and then need to provide a reserve of pages.
> 
> And we need to ensure that once we hit the reserve, the slab
> allocator honours the reserve's access. That is, a regular allocation
> may not get objects from a slab allocated from the reserves.

Patch [3/10] adds a new field to struct page.  I do not think this is 
necessary.   Allocating a page from reserve does not make it special.  
All we care about is that the total number of pages taken out of 
reserve is balanced by the total pages freed by a user of the reserve.

We do care about slab fragmentation in the sense that a slab page may be 
pinned in the slab by an unprivileged allocation and so that page may 
never be returned to the global page reserve.  One way to solve this is 
to have a per slabpage flag indicating the page came from reserve, and 
prevent mixing of privileged and unprivileged allocations on such a 
page.

> There is already a page reserve, but it does not fully comply with
> our needs. For example, it does not guarantee a strict level (due to
> the relative nature of ALLOC_HIGH and ALLOC_HARDER). Hence we augment
> this reserve with a strict limit.
>
> Furthermore a new __GFP flag is added to allow easy access to the
> reserves along-side the existing PF_MEMALLOC.
>
> Users of this infrastructure will need to do the necessary bean
> counting to ensure they stay within the requested limits.

This patch set is _way_ less intimidating than its predecessor.  
However, I see we have entered the era of sets of patch sets, since it 
is impossible to understand the need for this allocation infrastructure 
without reading the dependent network patch set.  Waiting with 
breathless anticipation.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 00/10] foundations for reserve-based allocation
  2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
                   ` (10 preceding siblings ...)
  2007-08-06 17:35 ` [PATCH 00/10] foundations for reserve-based allocation Daniel Phillips
@ 2007-08-06 17:56 ` Christoph Lameter
  2007-08-06 18:33   ` Peter Zijlstra
  2007-08-06 20:23 ` Matt Mackall
  12 siblings, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-06 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Pekka Enberg, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

On Mon, 6 Aug 2007, Peter Zijlstra wrote:

> We want a guarantee for N bytes from kmalloc(), this translates to a demand
> on the slab allocator for 2*N+m (due to the power-of-two nature of kmalloc 
> slabs), where m is the meta-data needed by the allocator itself.

The guarantee occurs in what context? Looks like its global here but 
allocations may be restricted to a cpuset context? What happens in a 
GFP_THISNODE allocation? Or a memory policy restricted allocations?

> So we need functions translating our demanded kmalloc space into a page
> reserve limit, and then need to provide a reserve of pages.

Only kmalloc? What about skb heads and such?

> And we need to ensure that once we hit the reserve, the slab allocator honours
> the reserve's access. That is, a regular allocation may not get objects from
> a slab allocated from the reserves.

>From a cpuset we may hit the reserves since cpuset memory is out and then 
the rest of the system fails allocations?

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 10:29 ` [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
@ 2007-08-06 18:11   ` Christoph Lameter
  2007-08-06 18:21     ` Daniel Phillips
  0 siblings, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-06 18:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Pekka Enberg, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

On Mon, 6 Aug 2007, Peter Zijlstra wrote:

> Change ALLOC_NO_WATERMARK page allocation such that dipping into the reserves
> becomes a system wide event.

Shudder. That can just be a desaster for NUMA. Both performance wise and 
logic wise. One cpuset being low on memory should not affect applications 
in other cpusets.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 03/10] mm: tag reseve pages
  2007-08-06 10:29 ` [PATCH 03/10] mm: tag reseve pages Peter Zijlstra
@ 2007-08-06 18:11   ` Christoph Lameter
  2007-08-06 18:13     ` Daniel Phillips
                       ` (2 more replies)
  0 siblings, 3 replies; 85+ messages in thread
From: Christoph Lameter @ 2007-08-06 18:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Pekka Enberg, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

On Mon, 6 Aug 2007, Peter Zijlstra wrote:

> ===================================================================
> --- linux-2.6-2.orig/include/linux/mm_types.h
> +++ linux-2.6-2/include/linux/mm_types.h
> @@ -60,6 +60,7 @@ struct page {
>  	union {
>  		pgoff_t index;		/* Our offset within mapping. */
>  		void *freelist;		/* SLUB: freelist req. slab lock */
> +		int reserve;		/* page_alloc: page is a reserve page */

Extending page struct ???

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 03/10] mm: tag reseve pages
  2007-08-06 18:11   ` Christoph Lameter
@ 2007-08-06 18:13     ` Daniel Phillips
  2007-08-06 18:28     ` Peter Zijlstra
  2007-08-06 19:34     ` Andi Kleen
  2 siblings, 0 replies; 85+ messages in thread
From: Daniel Phillips @ 2007-08-06 18:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Monday 06 August 2007 11:11, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Peter Zijlstra wrote:
> > ===================================================================
> > --- linux-2.6-2.orig/include/linux/mm_types.h
> > +++ linux-2.6-2/include/linux/mm_types.h
> > @@ -60,6 +60,7 @@ struct page {
> >  	union {
> >  		pgoff_t index;		/* Our offset within mapping. */
> >  		void *freelist;		/* SLUB: freelist req. slab lock */
> > +		int reserve;		/* page_alloc: page is a reserve page */
>
> Extending page struct ???

See my comment above, I do not think this is necessary.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 00/10] foundations for reserve-based allocation
  2007-08-06 17:35 ` [PATCH 00/10] foundations for reserve-based allocation Daniel Phillips
@ 2007-08-06 18:17   ` Peter Zijlstra
  2007-08-06 18:40     ` Daniel Phillips
  2007-08-06 19:31     ` Daniel Phillips
  0 siblings, 2 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 18:17 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Mon, 2007-08-06 at 10:35 -0700, Daniel Phillips wrote:
> On Monday 06 August 2007 03:29, Peter Zijlstra wrote:

> > We want a guarantee for N bytes from kmalloc(), this translates to a
> > demand on the slab allocator for 2*N+m (due to the power-of-two
> > nature of kmalloc slabs), where m is the meta-data needed by the
> > allocator itself.
> 
> Where does the 2* come from?  Isn't it exp2(ceil(log2(N + m)))?

Given a size distribution of 2^n the worst slack space is 100% - see how
allocations of (2^m) + 1 will always need 2^(m+1) bytes.

lim_{n -> inf} (2^(n+1)/((2^n)+1)) = 
2^lim_{n -> inf} ((n+1)-n) = 2^1 = 2

> Patch [3/10] adds a new field to struct page.

No it doesn't.

>   I do not think this is 
> necessary.   Allocating a page from reserve does not make it special.  
> All we care about is that the total number of pages taken out of 
> reserve is balanced by the total pages freed by a user of the reserve.

And how do we know a page was taken out of the reserves?

This is done by looking at page->reserve (overload of page->index) and
this value can be destroyed as soon as its observed. It is in a sense an
extra return value.

> We do care about slab fragmentation in the sense that a slab page may be 
> pinned in the slab by an unprivileged allocation and so that page may 
> never be returned to the global page reserve.

A slab page obtained from the reseserve will never serve an object to an
unprivilidged allocation.

>   One way to solve this is 
> to have a per slabpage flag indicating the page came from reserve, and 
> prevent mixing of privileged and unprivileged allocations on such a 
> page.

is done.

> This patch set is _way_ less intimidating than its predecessor.  
> However, I see we have entered the era of sets of patch sets, since it 
> is impossible to understand the need for this allocation infrastructure 
> without reading the dependent network patch set.  Waiting with 
> breathless anticipation.

Yeah, there were some objections to the size of it.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 18:11   ` Christoph Lameter
@ 2007-08-06 18:21     ` Daniel Phillips
  2007-08-06 18:31       ` Peter Zijlstra
  2007-08-06 18:42       ` Christoph Lameter
  0 siblings, 2 replies; 85+ messages in thread
From: Daniel Phillips @ 2007-08-06 18:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Monday 06 August 2007 11:11, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Peter Zijlstra wrote:
> > Change ALLOC_NO_WATERMARK page allocation such that dipping into
> > the reserves becomes a system wide event.
>
> Shudder. That can just be a desaster for NUMA. Both performance wise
> and logic wise. One cpuset being low on memory should not affect
> applications in other cpusets.

Currently your system likely would have died here, so ending up with a 
reserve page temporarily on the wrong node is already an improvement. 

I agree that the reserve pool should be per-node in the end, but I do 
not think that serves the interest of simplifying the initial patch 
set.  How about a numa performance patch that adds onto the end of 
Peter's series?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 03/10] mm: tag reseve pages
  2007-08-06 18:11   ` Christoph Lameter
  2007-08-06 18:13     ` Daniel Phillips
@ 2007-08-06 18:28     ` Peter Zijlstra
  2007-08-06 19:34     ` Andi Kleen
  2 siblings, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 18:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Pekka Enberg, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

On Mon, 2007-08-06 at 11:11 -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Peter Zijlstra wrote:
> 
> > ===================================================================
> > --- linux-2.6-2.orig/include/linux/mm_types.h
> > +++ linux-2.6-2/include/linux/mm_types.h
> > @@ -60,6 +60,7 @@ struct page {
> >  	union {

	^^^^^^^^^^
> >  		pgoff_t index;		/* Our offset within mapping. */
> >  		void *freelist;		/* SLUB: freelist req. slab lock */
> > +		int reserve;		/* page_alloc: page is a reserve page */
> 
> Extending page struct ???


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 18:21     ` Daniel Phillips
@ 2007-08-06 18:31       ` Peter Zijlstra
  2007-08-06 18:43         ` Daniel Phillips
  2007-08-06 19:11         ` Christoph Lameter
  2007-08-06 18:42       ` Christoph Lameter
  1 sibling, 2 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 18:31 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Christoph Lameter, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Mon, 2007-08-06 at 11:21 -0700, Daniel Phillips wrote:
> On Monday 06 August 2007 11:11, Christoph Lameter wrote:
> > On Mon, 6 Aug 2007, Peter Zijlstra wrote:
> > > Change ALLOC_NO_WATERMARK page allocation such that dipping into
> > > the reserves becomes a system wide event.
> >
> > Shudder. That can just be a desaster for NUMA. Both performance wise
> > and logic wise. One cpuset being low on memory should not affect
> > applications in other cpusets.

Do note that these are only PF_MEMALLOC allocations that will break the
cpuset. And one can argue that these are not application allocation but
system allocations.

> Currently your system likely would have died here, so ending up with a 
> reserve page temporarily on the wrong node is already an improvement. 
> 
> I agree that the reserve pool should be per-node in the end, but I do 
> not think that serves the interest of simplifying the initial patch 
> set.  How about a numa performance patch that adds onto the end of 
> Peter's series?

Trouble with keeping this per node is that all the code dealing with the
reserve needs to keep per-cpu state, which given that the system is
really crawling at that moment, seems excessive.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 00/10] foundations for reserve-based allocation
  2007-08-06 17:56 ` Christoph Lameter
@ 2007-08-06 18:33   ` Peter Zijlstra
  0 siblings, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 18:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Pekka Enberg, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

On Mon, 2007-08-06 at 10:56 -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Peter Zijlstra wrote:
> 
> > We want a guarantee for N bytes from kmalloc(), this translates to a demand
> > on the slab allocator for 2*N+m (due to the power-of-two nature of kmalloc 
> > slabs), where m is the meta-data needed by the allocator itself.
> 
> The guarantee occurs in what context? Looks like its global here but 
> allocations may be restricted to a cpuset context? What happens in a 
> GFP_THISNODE allocation? Or a memory policy restricted allocations?

>From what I could understand of the low level network allocations these
try to be node affine at best and are not subject to mempolicies due to
taking place from irq context.

> > So we need functions translating our demanded kmalloc space into a page
> > reserve limit, and then need to provide a reserve of pages.
> 
> Only kmalloc? What about skb heads and such?

kmalloc is the hardest thing to get right, there are also functions to
calculate the reserves for kmem_cache like allocations, see
kmem_estimate_pages().

> > And we need to ensure that once we hit the reserve, the slab allocator honours
> > the reserve's access. That is, a regular allocation may not get objects from
> > a slab allocated from the reserves.
> 
> From a cpuset we may hit the reserves since cpuset memory is out and then 
> the rest of the system fails allocations?

No, do see patch 2/10. What we do is we fail cpuset local allocations,
_except_ PF_MEMALLOC (or __GFP_MEMALLOC). These will break out of the
mempolicy bounds before dipping into the reserves.

This has the effect that all nodes will be low before we hit the
reserves. Also, given that PF_MEMALLOC will usually use only a little
amount of memory the impact of breaking out of these bounds is quite
limited.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 00/10] foundations for reserve-based allocation
  2007-08-06 18:17   ` Peter Zijlstra
@ 2007-08-06 18:40     ` Daniel Phillips
  2007-08-06 19:31     ` Daniel Phillips
  1 sibling, 0 replies; 85+ messages in thread
From: Daniel Phillips @ 2007-08-06 18:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Monday 06 August 2007 11:17, Peter Zijlstra wrote:
> lim_{n -> inf} (2^(n+1)/((2^n)+1)) = 2^lim_{n -> inf} ((n+1)-n) = 2^1 
= 2

Glad I asked :-)

> > Patch [3/10] adds a new field to struct page.
>
> No it doesn't.

True.  It is not immediately obvious from the declaration that the 
overloaded field is always freed up before anybody else needs to use 
the union.

> >   I do not think this is
> > necessary.   Allocating a page from reserve does not make it
> > special. All we care about is that the total number of pages taken
> > out of reserve is balanced by the total pages freed by a user of
> > the reserve.
>
> And how do we know a page was taken out of the reserves?
>
> This is done by looking at page->reserve (overload of page->index)
> and this value can be destroyed as soon as its observed. It is in a
> sense an extra return value.

Ah I see.  I used to let alloc_pages fail then repeat the allocation 
with __GFP_MEMALLOC set, which was easy but stupidly repetitious.  Your 
technique is better, though returning the status in the page still 
looks a little funny.  This is really about saving a page flag, no?

> > We do care about slab fragmentation in the sense that a slab page
> > may be pinned in the slab by an unprivileged allocation and so that
> > page may never be returned to the global page reserve.
>
> A slab page obtained from the reseserve will never serve an object to
> an unprivilidged allocation.
>
> >   One way to solve this is
> > to have a per slabpage flag indicating the page came from reserve,
> > and prevent mixing of privileged and unprivileged allocations on
> > such a page.
>
> is done.

Serves me right for not reading that bit.  So the score for that round 
is: peterz 3, phillips 0 ;-)

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 18:21     ` Daniel Phillips
  2007-08-06 18:31       ` Peter Zijlstra
@ 2007-08-06 18:42       ` Christoph Lameter
  2007-08-06 18:48         ` Daniel Phillips
  1 sibling, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-06 18:42 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Peter Zijlstra, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Mon, 6 Aug 2007, Daniel Phillips wrote:

> Currently your system likely would have died here, so ending up with a 
> reserve page temporarily on the wrong node is already an improvement. 

The system would have died? Why? The application in the cpuset that 
ran out of memory should have died not the system.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 18:31       ` Peter Zijlstra
@ 2007-08-06 18:43         ` Daniel Phillips
  2007-08-06 19:11         ` Christoph Lameter
  1 sibling, 0 replies; 85+ messages in thread
From: Daniel Phillips @ 2007-08-06 18:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Monday 06 August 2007 11:31, Peter Zijlstra wrote:
> > I agree that the reserve pool should be per-node in the end, but I
> > do not think that serves the interest of simplifying the initial
> > patch set.  How about a numa performance patch that adds onto the
> > end of Peter's series?
>
> Trouble with keeping this per node is that all the code dealing with
> the reserve needs to keep per-cpu state, which given that the system
> is really crawling at that moment, seems excessive.

It does.  I was suggesting that Christoph think about the NUMA part, our 
job just to save the world ;-)

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 03/10] mm: tag reseve pages
  2007-08-06 19:34     ` Andi Kleen
@ 2007-08-06 18:43       ` Christoph Lameter
  2007-08-06 18:47         ` Peter Zijlstra
  0 siblings, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-06 18:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Mon, 6 Aug 2007, Andi Kleen wrote:

> > >  		pgoff_t index;		/* Our offset within mapping. */
> > >  		void *freelist;		/* SLUB: freelist req. slab lock */
> > > +		int reserve;		/* page_alloc: page is a reserve page */
> > 
> > Extending page struct ???
> 
> Note it's an union.

Ok. Then under what conditions can we use reserve? Right after alloc?


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 03/10] mm: tag reseve pages
  2007-08-06 18:43       ` Christoph Lameter
@ 2007-08-06 18:47         ` Peter Zijlstra
  2007-08-06 18:59           ` Andi Kleen
  0 siblings, 1 reply; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 18:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Pekka Enberg, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

On Mon, 2007-08-06 at 11:43 -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Andi Kleen wrote:
> 
> > > >  		pgoff_t index;		/* Our offset within mapping. */
> > > >  		void *freelist;		/* SLUB: freelist req. slab lock */
> > > > +		int reserve;		/* page_alloc: page is a reserve page */
> > > 
> > > Extending page struct ???
> > 
> > Note it's an union.
> 
> Ok. Then under what conditions can we use reserve? Right after alloc?

Yes, its usually only observed right after alloc. Its basically an extra
return value.

Daniel suggested it was about saving page flags, _maybe_. The value is
1) rare and 2) usually only interesting right after alloc. So wasting a
precious page flag which would keep this state for the duration of the
whole allocation seemed like a waste.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 18:42       ` Christoph Lameter
@ 2007-08-06 18:48         ` Daniel Phillips
  2007-08-06 18:51           ` Christoph Lameter
  0 siblings, 1 reply; 85+ messages in thread
From: Daniel Phillips @ 2007-08-06 18:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Monday 06 August 2007 11:42, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Daniel Phillips wrote:
> > Currently your system likely would have died here, so ending up
> > with a reserve page temporarily on the wrong node is already an
> > improvement.
>
> The system would have died? Why?

Because a block device may have deadlocked here, leaving the system 
unable to clean dirty memory, or unable to load executables over the 
network for example.

> The application in the cpuset that ran out of memory should have died
> not the system. 

If your "application" is a virtual block device then you can land in 
deep doodoo.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 18:48         ` Daniel Phillips
@ 2007-08-06 18:51           ` Christoph Lameter
  2007-08-06 19:15             ` Daniel Phillips
  2007-08-06 20:12             ` Matt Mackall
  0 siblings, 2 replies; 85+ messages in thread
From: Christoph Lameter @ 2007-08-06 18:51 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Peter Zijlstra, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Mon, 6 Aug 2007, Daniel Phillips wrote:

> On Monday 06 August 2007 11:42, Christoph Lameter wrote:
> > On Mon, 6 Aug 2007, Daniel Phillips wrote:
> > > Currently your system likely would have died here, so ending up
> > > with a reserve page temporarily on the wrong node is already an
> > > improvement.
> >
> > The system would have died? Why?
> 
> Because a block device may have deadlocked here, leaving the system 
> unable to clean dirty memory, or unable to load executables over the 
> network for example.

So this is a locking problem that has not been taken care of?


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 03/10] mm: tag reseve pages
  2007-08-06 18:47         ` Peter Zijlstra
@ 2007-08-06 18:59           ` Andi Kleen
  2007-08-06 19:09             ` Christoph Lameter
  2007-08-06 19:10             ` Andrew Morton
  0 siblings, 2 replies; 85+ messages in thread
From: Andi Kleen @ 2007-08-06 18:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Andi Kleen, linux-kernel, linux-mm,
	David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg,
	Matt Mackall, Lee Schermerhorn, Steve Dickson

> precious page flag

I always cringe when I hear that. It's really more than node/sparsemem
use too many bits. If we get rid of 32bit NUMA that problem would be
gone for the node at least because it could be moved into the mostly
unused upper 32bit part on 64bit architectures.

The alternative would be to investigate again what it does to the
kernel to just use different lookup methods for this.

-Andi

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 03/10] mm: tag reseve pages
  2007-08-06 18:59           ` Andi Kleen
@ 2007-08-06 19:09             ` Christoph Lameter
  2007-08-06 19:10             ` Andrew Morton
  1 sibling, 0 replies; 85+ messages in thread
From: Christoph Lameter @ 2007-08-06 19:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Mon, 6 Aug 2007, Andi Kleen wrote:

> I always cringe when I hear that. It's really more than node/sparsemem
> use too many bits. If we get rid of 32bit NUMA that problem would be
> gone for the node at least because it could be moved into the mostly
> unused upper 32bit part on 64bit architectures.

Looks like that will not be possible. Seems that embedded systems 
now want NUMA support.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 03/10] mm: tag reseve pages
  2007-08-06 18:59           ` Andi Kleen
  2007-08-06 19:09             ` Christoph Lameter
@ 2007-08-06 19:10             ` Andrew Morton
  2007-08-06 19:16               ` Christoph Lameter
                                 ` (2 more replies)
  1 sibling, 3 replies; 85+ messages in thread
From: Andrew Morton @ 2007-08-06 19:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Christoph Lameter, linux-kernel, linux-mm,
	David Miller, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Mon, 6 Aug 2007 20:59:26 +0200 Andi Kleen <andi@firstfloor.org> wrote:

> > precious page flag
> 
> I always cringe when I hear that. It's really more than node/sparsemem
> use too many bits. If we get rid of 32bit NUMA that problem would be
> gone for the node at least because it could be moved into the mostly
> unused upper 32bit part on 64bit architectures.

Removing 32-bit NUMA is attractive - NUMAQ we can probably live without,
not sure about summit.  But superh is starting to use NUMA now, due to
varying access times of various sorts of memory, and one can envisage other
embedded setups doing that.

Plus I don't think there are many flags left in the upper 32-bits.  ia64
swooped in and gobbled lots of them, although it's not immediately clear
how many were consumed.

> The alternative would be to investigate again what it does to the
> kernel to just use different lookup methods for this.

That's cringeworthy too, I expect.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 18:31       ` Peter Zijlstra
  2007-08-06 18:43         ` Daniel Phillips
@ 2007-08-06 19:11         ` Christoph Lameter
  2007-08-06 19:31           ` Peter Zijlstra
  1 sibling, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-06 19:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Phillips, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Mon, 6 Aug 2007, Peter Zijlstra wrote:

> > > Shudder. That can just be a desaster for NUMA. Both performance wise
> > > and logic wise. One cpuset being low on memory should not affect
> > > applications in other cpusets.
> 
> Do note that these are only PF_MEMALLOC allocations that will break the
> cpuset. And one can argue that these are not application allocation but
> system allocations.

This is global, global locking etc etc. On a large NUMA system this will 
cause significant delays. One fears that a livelock may result.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 18:51           ` Christoph Lameter
@ 2007-08-06 19:15             ` Daniel Phillips
  2007-08-06 20:12             ` Matt Mackall
  1 sibling, 0 replies; 85+ messages in thread
From: Daniel Phillips @ 2007-08-06 19:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Monday 06 August 2007 11:51, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Daniel Phillips wrote:
> > On Monday 06 August 2007 11:42, Christoph Lameter wrote:
> > > On Mon, 6 Aug 2007, Daniel Phillips wrote:
> > > > Currently your system likely would have died here, so ending up
> > > > with a reserve page temporarily on the wrong node is already an
> > > > improvement.
> > >
> > > The system would have died? Why?
> >
> > Because a block device may have deadlocked here, leaving the system
> > unable to clean dirty memory, or unable to load executables over
> > the network for example.
>
> So this is a locking problem that has not been taken care of?

A deadlock problem that we used to call memory recursion deadlock, aka: 
somebody in the vm writeout path needs to allocate memory, but there is 
no memory, so vm writeout stops forever.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 03/10] mm: tag reseve pages
  2007-08-06 19:10             ` Andrew Morton
@ 2007-08-06 19:16               ` Christoph Lameter
  2007-08-06 19:38               ` Matt Mackall
  2007-08-06 20:18               ` Andi Kleen
  2 siblings, 0 replies; 85+ messages in thread
From: Christoph Lameter @ 2007-08-06 19:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Peter Zijlstra, linux-kernel, linux-mm, David Miller,
	Daniel Phillips, Pekka Enberg, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

On Mon, 6 Aug 2007, Andrew Morton wrote:

> Plus I don't think there are many flags left in the upper 32-bits.  ia64
> swooped in and gobbled lots of them, although it's not immediately clear
> how many were consumed.

IA64 uses one of these bits for the uncached allocator.  10 bits used for 
the node. Sparsemem may use some of the rest.




^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 00/10] foundations for reserve-based allocation
  2007-08-06 18:17   ` Peter Zijlstra
  2007-08-06 18:40     ` Daniel Phillips
@ 2007-08-06 19:31     ` Daniel Phillips
  2007-08-06 19:36       ` Peter Zijlstra
  1 sibling, 1 reply; 85+ messages in thread
From: Daniel Phillips @ 2007-08-06 19:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Monday 06 August 2007 11:17, Peter Zijlstra wrote:
> And how do we know a page was taken out of the reserves?

Why not return that in the low bit of the page address?  This is a 
little more cache efficient, does not leave that odd footprint in the 
page union and forces the caller to examine the 
alloc_pages(...P_MEMALLOC) return, making it harder to overlook the 
fact that it got a page out of reserve and forget to put one back 
later.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 19:11         ` Christoph Lameter
@ 2007-08-06 19:31           ` Peter Zijlstra
  2007-08-06 20:12             ` Christoph Lameter
  0 siblings, 1 reply; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 19:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Daniel Phillips, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Mon, 2007-08-06 at 12:11 -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Peter Zijlstra wrote:
> 
> > > > Shudder. That can just be a desaster for NUMA. Both performance wise
> > > > and logic wise. One cpuset being low on memory should not affect
> > > > applications in other cpusets.
> > 
> > Do note that these are only PF_MEMALLOC allocations that will break the
> > cpuset. And one can argue that these are not application allocation but
> > system allocations.
> 
> This is global, global locking etc etc. On a large NUMA system this will 
> cause significant delays. One fears that a livelock may result.

The only new lock is in SLUB, and I'm not aware of any regular
PF_MEMALLOC paths using slab allocations, but I'll instrument the
regular reclaim path to verify this.

The functionality this is aimed at is swap over network, and I doubt
you'll be enabling that on these machines.



^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 03/10] mm: tag reseve pages
  2007-08-06 18:11   ` Christoph Lameter
  2007-08-06 18:13     ` Daniel Phillips
  2007-08-06 18:28     ` Peter Zijlstra
@ 2007-08-06 19:34     ` Andi Kleen
  2007-08-06 18:43       ` Christoph Lameter
  2 siblings, 1 reply; 85+ messages in thread
From: Andi Kleen @ 2007-08-06 19:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

Christoph Lameter <clameter@sgi.com> writes:

> On Mon, 6 Aug 2007, Peter Zijlstra wrote:
> 
> > ===================================================================
> > --- linux-2.6-2.orig/include/linux/mm_types.h
> > +++ linux-2.6-2/include/linux/mm_types.h
> > @@ -60,6 +60,7 @@ struct page {
> >  	union {
> >  		pgoff_t index;		/* Our offset within mapping. */
> >  		void *freelist;		/* SLUB: freelist req. slab lock */
> > +		int reserve;		/* page_alloc: page is a reserve page */
> 
> Extending page struct ???

Note it's an union.

-Andi

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 00/10] foundations for reserve-based allocation
  2007-08-06 19:31     ` Daniel Phillips
@ 2007-08-06 19:36       ` Peter Zijlstra
  2007-08-06 19:53         ` Daniel Phillips
  0 siblings, 1 reply; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 19:36 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Mon, 2007-08-06 at 12:31 -0700, Daniel Phillips wrote:
> On Monday 06 August 2007 11:17, Peter Zijlstra wrote:
> > And how do we know a page was taken out of the reserves?
> 
> Why not return that in the low bit of the page address?  This is a 
> little more cache efficient, does not leave that odd footprint in the 
> page union and forces the caller to examine the 
> alloc_pages(...P_MEMALLOC) return, making it harder to overlook the 
> fact that it got a page out of reserve and forget to put one back 
> later.

This would require auditing all page allocation sites to ensure they
ever happen under PF_MEMALLOC or the like. Because if an allocator ever
fails to check the low bit and assumes its a valid struct page *, stuff
will go *bang*.




^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 03/10] mm: tag reseve pages
  2007-08-06 19:10             ` Andrew Morton
  2007-08-06 19:16               ` Christoph Lameter
@ 2007-08-06 19:38               ` Matt Mackall
  2007-08-06 20:18               ` Andi Kleen
  2 siblings, 0 replies; 85+ messages in thread
From: Matt Mackall @ 2007-08-06 19:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Peter Zijlstra, Christoph Lameter, linux-kernel,
	linux-mm, David Miller, Daniel Phillips, Pekka Enberg,
	Lee Schermerhorn, Steve Dickson

On Mon, Aug 06, 2007 at 12:10:53PM -0700, Andrew Morton wrote:
> On Mon, 6 Aug 2007 20:59:26 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> 
> > > precious page flag
> > 
> > I always cringe when I hear that. It's really more than node/sparsemem
> > use too many bits. If we get rid of 32bit NUMA that problem would be
> > gone for the node at least because it could be moved into the mostly
> > unused upper 32bit part on 64bit architectures.
> 
> Removing 32-bit NUMA is attractive - NUMAQ we can probably live without,
> not sure about summit.  But superh is starting to use NUMA now, due to
> varying access times of various sorts of memory, and one can envisage other
> embedded setups doing that.
> 
> Plus I don't think there are many flags left in the upper 32-bits.  ia64
> swooped in and gobbled lots of them, although it's not immediately clear
> how many were consumed.
> 
> > The alternative would be to investigate again what it does to the
> > kernel to just use different lookup methods for this.
> 
> That's cringeworthy too, I expect.

Perhaps the node info could be pulled out into a parallel and
effectively read-only array of shorts.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 00/10] foundations for reserve-based allocation
  2007-08-06 19:36       ` Peter Zijlstra
@ 2007-08-06 19:53         ` Daniel Phillips
  0 siblings, 0 replies; 85+ messages in thread
From: Daniel Phillips @ 2007-08-06 19:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Monday 06 August 2007 12:36, Peter Zijlstra wrote:
> On Mon, 2007-08-06 at 12:31 -0700, Daniel Phillips wrote:
> > On Monday 06 August 2007 11:17, Peter Zijlstra wrote:
> > > And how do we know a page was taken out of the reserves?
> >
> > Why not return that in the low bit of the page address?  This is a
> > little more cache efficient, does not leave that odd footprint in
> > the page union and forces the caller to examine the
> > alloc_pages(...P_MEMALLOC) return, making it harder to overlook the
> > fact that it got a page out of reserve and forget to put one back
> > later.
>
> This would require auditing all page allocation sites to ensure they
> ever happen under PF_MEMALLOC or the like. Because if an allocator
> ever fails to check the low bit and assumes its a valid struct page
> *, stuff will go *bang*.

But is that not exactly what we want?  If somebody sets __GFP_MEMALLOC 
then they had better be prepared to deal with the result.

OK, I see there are cases where the the gfp flags are blindly passed 
down level after level (tending to indicate a design flaw anyway) but 
they usually also get passed back up the chain without examination.  
Yes, an audit would be needed, but since when has that ever stood in 
the way of a miniscule optimization?  And if we miss a case it 
helpfully goes bang on us, no need to play games with new function 
names or the like.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 19:31           ` Peter Zijlstra
@ 2007-08-06 20:12             ` Christoph Lameter
  0 siblings, 0 replies; 85+ messages in thread
From: Christoph Lameter @ 2007-08-06 20:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Phillips, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

On Mon, 6 Aug 2007, Peter Zijlstra wrote:

> The functionality this is aimed at is swap over network, and I doubt
> you'll be enabling that on these machines.

So add #ifdefs around it?


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 18:51           ` Christoph Lameter
  2007-08-06 19:15             ` Daniel Phillips
@ 2007-08-06 20:12             ` Matt Mackall
  2007-08-06 20:19               ` Christoph Lameter
  1 sibling, 1 reply; 85+ messages in thread
From: Matt Mackall @ 2007-08-06 20:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Daniel Phillips, Peter Zijlstra, linux-kernel, linux-mm,
	David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg,
	Lee Schermerhorn, Steve Dickson

On Mon, Aug 06, 2007 at 11:51:45AM -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Daniel Phillips wrote:
> 
> > On Monday 06 August 2007 11:42, Christoph Lameter wrote:
> > > On Mon, 6 Aug 2007, Daniel Phillips wrote:
> > > > Currently your system likely would have died here, so ending up
> > > > with a reserve page temporarily on the wrong node is already an
> > > > improvement.
> > >
> > > The system would have died? Why?
> > 
> > Because a block device may have deadlocked here, leaving the system 
> > unable to clean dirty memory, or unable to load executables over the 
> > network for example.
> 
> So this is a locking problem that has not been taken care of?

No.

It's very simple:

1) memory becomes full
2) we try to free memory by paging or swapping
3) I/O requires a memory allocation which fails because memory is full
4) box dies because it's unable to dig itself out of OOM

Most I/O paths can deal with this by having a mempool for their I/O
needs. For network I/O, this turns out to be prohibitively hard due to
the complexity of the stack.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 03/10] mm: tag reseve pages
  2007-08-06 19:10             ` Andrew Morton
  2007-08-06 19:16               ` Christoph Lameter
  2007-08-06 19:38               ` Matt Mackall
@ 2007-08-06 20:18               ` Andi Kleen
  2 siblings, 0 replies; 85+ messages in thread
From: Andi Kleen @ 2007-08-06 20:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Peter Zijlstra, Christoph Lameter, linux-kernel,
	linux-mm, David Miller, Daniel Phillips, Pekka Enberg,
	Matt Mackall, Lee Schermerhorn, Steve Dickson

On Mon, Aug 06, 2007 at 12:10:53PM -0700, Andrew Morton wrote:
> On Mon, 6 Aug 2007 20:59:26 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> 
> > > precious page flag
> > 
> > I always cringe when I hear that. It's really more than node/sparsemem
> > use too many bits. If we get rid of 32bit NUMA that problem would be
> > gone for the node at least because it could be moved into the mostly
> > unused upper 32bit part on 64bit architectures.
> 
> Removing 32-bit NUMA is attractive - NUMAQ we can probably live without,
> not sure about summit.  But superh is starting to use NUMA now, due to
> varying access times of various sorts of memory, and one can envisage other
> embedded setups doing that.

They can just use phys_to_nid() or equivalent instead. Putting the node
into the page flags is just a very minor optimization.  I doubt
actually you could benchmark the difference. While in theory the
hash lookup could be another cache miss in practice this should
be already hot since it's used elsewhere.

> Plus I don't think there are many flags left in the upper 32-bits.  ia64
> swooped in and gobbled lots of them, although it's not immediately clear
> how many were consumed.

Really?  They forgot to document it then.

Anyways, if they don't have enough bits left they can always just
use the hash table and drop the node completely. Shouldn't make too much 
difference and IA64 has gobs of cache anyways.

I'm actually thinking about a PG_arch_2 on x86_64 too. arch_1 is already
used now but another one would be useful in c_p_a().

-Andi

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 20:12             ` Matt Mackall
@ 2007-08-06 20:19               ` Christoph Lameter
  2007-08-06 20:26                 ` Peter Zijlstra
                                   ` (2 more replies)
  0 siblings, 3 replies; 85+ messages in thread
From: Christoph Lameter @ 2007-08-06 20:19 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Daniel Phillips, Peter Zijlstra, linux-kernel, linux-mm,
	David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg,
	Lee Schermerhorn, Steve Dickson

On Mon, 6 Aug 2007, Matt Mackall wrote:

> > > Because a block device may have deadlocked here, leaving the system 
> > > unable to clean dirty memory, or unable to load executables over the 
> > > network for example.
> > 
> > So this is a locking problem that has not been taken care of?
> 
> No.
> 
> It's very simple:
> 
> 1) memory becomes full

We do have limits to avoid memory getting too full.

> 2) we try to free memory by paging or swapping
> 3) I/O requires a memory allocation which fails because memory is full
> 4) box dies because it's unable to dig itself out of OOM
> 
> Most I/O paths can deal with this by having a mempool for their I/O
> needs. For network I/O, this turns out to be prohibitively hard due to
> the complexity of the stack.

The common solution is to have a reserve (min_free_kbytes). The problem 
with the network stack seems to be that the amount of reserve needed 
cannot be predicted accurately.

The solution may be as simple as configuring the reserves right and 
avoid the unbounded memory allocations. That is possible if one 
would make sure that the network layer triggers reclaim once in a 
while.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 00/10] foundations for reserve-based allocation
  2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
                   ` (11 preceding siblings ...)
  2007-08-06 17:56 ` Christoph Lameter
@ 2007-08-06 20:23 ` Matt Mackall
  2007-08-07  0:09   ` Daniel Phillips
  12 siblings, 1 reply; 85+ messages in thread
From: Matt Mackall @ 2007-08-06 20:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Pekka Enberg, Christoph Lameter,
	Lee Schermerhorn, Steve Dickson

On Mon, Aug 06, 2007 at 12:29:22PM +0200, Peter Zijlstra wrote:
> 
> In the interrest of getting swap over network working and posting in smaller
> series, here is the first series.
> 
> This series lays the foundations needed to do reserve based allocation.
> Traditionally we have used mempools (and others like radix_tree_preload) to
> handle the problem.
> 
> However this does not fit the network stack. It is built around variable
> sized allocations using kmalloc().
> 
> This calls for a different approach.

One wonders if containers are a possible solution. I can already solve
this problem with virtualization: have one VM manage all the network
I/O and export the device as a simpler virtual block device to other
VMs. Provided this VM isn't doing any "real" work and is sized
appropriately, it won't get wedged. Since the other VMs now submit I/O
through the simpler block interface, they can avoid getting wedged
with the standard mempool approach.

If we can run nbd and friends inside their own container that can give
similar isolation, we might not need to add this other complexity.

Just food for thought. I haven't looked closely enough at the
containers implementations yet to determine whether this is possible
or if the overhead in performance or complexity is acceptable.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 20:19               ` Christoph Lameter
@ 2007-08-06 20:26                 ` Peter Zijlstra
  2007-08-06 21:05                   ` Christoph Lameter
  2007-08-06 20:27                 ` Andrew Morton
  2007-08-06 22:47                 ` Daniel Phillips
  2 siblings, 1 reply; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-06 20:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matt Mackall, Daniel Phillips, linux-kernel, linux-mm,
	David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg,
	Lee Schermerhorn, Steve Dickson

[-- Attachment #1: Type: text/plain, Size: 1657 bytes --]

On Mon, 2007-08-06 at 13:19 -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Matt Mackall wrote:
> 
> > > > Because a block device may have deadlocked here, leaving the system 
> > > > unable to clean dirty memory, or unable to load executables over the 
> > > > network for example.
> > > 
> > > So this is a locking problem that has not been taken care of?
> > 
> > No.
> > 
> > It's very simple:
> > 
> > 1) memory becomes full
> 
> We do have limits to avoid memory getting too full.
> 
> > 2) we try to free memory by paging or swapping
> > 3) I/O requires a memory allocation which fails because memory is full
> > 4) box dies because it's unable to dig itself out of OOM
> > 
> > Most I/O paths can deal with this by having a mempool for their I/O
> > needs. For network I/O, this turns out to be prohibitively hard due to
> > the complexity of the stack.
> 
> The common solution is to have a reserve (min_free_kbytes). 

This patch set builds on that. Trouble with min_free_kbytes is the
relative nature of ALLOC_HIGH and ALLOC_HARDER.

> The problem 
> with the network stack seems to be that the amount of reserve needed 
> cannot be predicted accurately.
> 
> The solution may be as simple as configuring the reserves right and 
> avoid the unbounded memory allocations. 

Which is what the next series of patches will be doing. Please do look
in detail at these networked swap patches I've been posting for the last
year or so.

> That is possible if one 
> would make sure that the network layer triggers reclaim once in a 
> while.

This does not make sense, we cannot reclaim from reclaim.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 20:19               ` Christoph Lameter
  2007-08-06 20:26                 ` Peter Zijlstra
@ 2007-08-06 20:27                 ` Andrew Morton
  2007-08-06 23:16                   ` Daniel Phillips
  2007-08-06 22:47                 ` Daniel Phillips
  2 siblings, 1 reply; 85+ messages in thread
From: Andrew Morton @ 2007-08-06 20:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matt Mackall, Daniel Phillips, Peter Zijlstra, linux-kernel,
	linux-mm, David Miller, Daniel Phillips, Pekka Enberg,
	Lee Schermerhorn, Steve Dickson

On Mon, 6 Aug 2007 13:19:26 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> On Mon, 6 Aug 2007, Matt Mackall wrote:
> 
> > > > Because a block device may have deadlocked here, leaving the system 
> > > > unable to clean dirty memory, or unable to load executables over the 
> > > > network for example.
> > > 
> > > So this is a locking problem that has not been taken care of?
> > 
> > No.
> > 
> > It's very simple:
> > 
> > 1) memory becomes full
> 
> We do have limits to avoid memory getting too full.
> 
> > 2) we try to free memory by paging or swapping
> > 3) I/O requires a memory allocation which fails because memory is full
> > 4) box dies because it's unable to dig itself out of OOM
> > 
> > Most I/O paths can deal with this by having a mempool for their I/O
> > needs. For network I/O, this turns out to be prohibitively hard due to
> > the complexity of the stack.
> 
> The common solution is to have a reserve (min_free_kbytes). The problem 
> with the network stack seems to be that the amount of reserve needed 
> cannot be predicted accurately.
> 
> The solution may be as simple as configuring the reserves right and 
> avoid the unbounded memory allocations. That is possible if one 
> would make sure that the network layer triggers reclaim once in a 
> while.

Such a simple fix would be attractive.  Some of the net drivers now have
remarkably large rx and tx queues.  One wonders if this is playing a part
in the problem and whether reducing the queue sizes would help.

I guess we'd need to reduce the queue size on all NICs in the machine
though, which might be somewhat of a performance problem.

I don't think we've seen a lot of justification for those large queues. 
I'm suspecting it's a few percent in carefully-chosen workloads (of the
microbenchmark kind?)


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 20:26                 ` Peter Zijlstra
@ 2007-08-06 21:05                   ` Christoph Lameter
  2007-08-06 22:59                     ` Daniel Phillips
  0 siblings, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-06 21:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matt Mackall, Daniel Phillips, linux-kernel, linux-mm,
	David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg,
	Lee Schermerhorn, Steve Dickson

On Mon, 6 Aug 2007, Peter Zijlstra wrote:

> > The solution may be as simple as configuring the reserves right and 
> > avoid the unbounded memory allocations. 
> 
> Which is what the next series of patches will be doing. Please do look
> in detail at these networked swap patches I've been posting for the last
> year or so.
> 
> > That is possible if one 
> > would make sure that the network layer triggers reclaim once in a 
> > while.
> 
> This does not make sense, we cannot reclaim from reclaim.

But we should limit the amounts of allocation we do while performing 
reclaim. F.e. refilling memory pools during reclaim should be disabled.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 20:19               ` Christoph Lameter
  2007-08-06 20:26                 ` Peter Zijlstra
  2007-08-06 20:27                 ` Andrew Morton
@ 2007-08-06 22:47                 ` Daniel Phillips
  2 siblings, 0 replies; 85+ messages in thread
From: Daniel Phillips @ 2007-08-06 22:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matt Mackall, Peter Zijlstra, linux-kernel, linux-mm,
	David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg,
	Lee Schermerhorn, Steve Dickson

(What Peter already wrote, but in different words)

On Monday 06 August 2007 13:19, Christoph Lameter wrote:
> The solution may be as simple as configuring the reserves right and
> avoid the unbounded memory allocations.

Exactly.  That is what this patch set is about.  This is the part that 
provides some hooks to extend the traditional reserves to be able to 
handle some of the more difficult situations.

> That is possible if one 
> would make sure that the network layer triggers reclaim once in a
> while.

No.  It does no good at all for network to do a bunch of work 
reclaiming, then have some other random task (for example, a heavy 
writer) swoop in and grab the reclaimed memory before net can use it.  
Also, net allocates memory in interrupt context where shrink_caches is 
not possible.  The correct solution is to _reserve_ the memory net 
needs for vm writeout, which is in Peter's next patch set coming down 
the pipe.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 21:05                   ` Christoph Lameter
@ 2007-08-06 22:59                     ` Daniel Phillips
  2007-08-06 23:14                       ` Christoph Lameter
  0 siblings, 1 reply; 85+ messages in thread
From: Daniel Phillips @ 2007-08-06 22:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Matt Mackall, linux-kernel, linux-mm,
	David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg,
	Lee Schermerhorn, Steve Dickson

On Monday 06 August 2007 14:05, Christoph Lameter wrote:
> > > That is possible if one
> > > would make sure that the network layer triggers reclaim once in a
> > > while.
> >
> > This does not make sense, we cannot reclaim from reclaim.
>
> But we should limit the amounts of allocation we do while performing
> reclaim...

Correct.  That is what the throttling part of these patches is about.  
In order to fix the vm writeout deadlock problem properly, two things 
are necessary:

  1) Throttle the vm writeout path to use a bounded amount of memory

  2) Provide access to a sufficiently large amount of reserve memory for 
each memory user in the vm writeout path

You can understand every detail of this patch set and the following ones 
coming from Peter in terms of those two requirements.

> F.e. refilling memory pools during reclaim should be disabled.

Actually, recursing into the vm should be disabled entirely but that is 
a rather deeply ingrained part of mm culture we do not propose to 
fiddle with just now.

Memory pools are refilled when the pool user frees some memory, not ever 
by the mm.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 22:59                     ` Daniel Phillips
@ 2007-08-06 23:14                       ` Christoph Lameter
  2007-08-06 23:49                         ` Daniel Phillips
  0 siblings, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-06 23:14 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Peter Zijlstra, Matt Mackall, linux-kernel, linux-mm,
	David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg,
	Lee Schermerhorn, Steve Dickson

On Mon, 6 Aug 2007, Daniel Phillips wrote:

> Correct.  That is what the throttling part of these patches is about.  

Where are those patches?

> In order to fix the vm writeout deadlock problem properly, two things 
> are necessary:
> 
>   1) Throttle the vm writeout path to use a bounded amount of memory
> 
>   2) Provide access to a sufficiently large amount of reserve memory for 
> each memory user in the vm writeout path
> 
> You can understand every detail of this patch set and the following ones 
> coming from Peter in terms of those two requirements.

AFAICT: This patchset is not throttling processes but failing allocations. 
The patchset does not reconfigure the memory reserves as expected. Instead 
new reserve logic is added. And I suspect that we have the same issues as 
in earlier releases with various corner cases not being covered. Code is 
added that is supposedly not used. If it ever is on a large config then we 
are in very deep trouble by the new code paths themselves that serialize 
things in order to give some allocations precendence over the other 
allocations that are made to fail ....






^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 20:27                 ` Andrew Morton
@ 2007-08-06 23:16                   ` Daniel Phillips
  0 siblings, 0 replies; 85+ messages in thread
From: Daniel Phillips @ 2007-08-06 23:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Matt Mackall, Peter Zijlstra, linux-kernel,
	linux-mm, David Miller, Daniel Phillips, Pekka Enberg,
	Lee Schermerhorn, Steve Dickson

On Monday 06 August 2007 13:27, Andrew Morton wrote:
> On Mon, 6 Aug 2007 13:19:26 -0700 (PDT) Christoph Lameter wrote:
> > The solution may be as simple as configuring the reserves right and
> > avoid the unbounded memory allocations. That is possible if one
> > would make sure that the network layer triggers reclaim once in a
> > while.
>
> Such a simple fix would be attractive.  Some of the net drivers now
> have remarkably large rx and tx queues.  One wonders if this is
> playing a part in the problem and whether reducing the queue sizes
> would help.

There is nothing wrong with having huge rx and tx queues, except when 
they lie in the vm writeout path.  In that case, we do indeed throttle 
them to reasonable numbers.

See:

   http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c
   down(&info->throttle_sem);

The only way we have ever gotten ddsnap to run reliably under heavy load 
without deadlocking is with Peter's patch set (a distant descendant of 
mine from two years or so ago) and we did put a lot of effort into 
trying to make congestion_wait and friends do the job instead.

To be sure, it is a small subset of Peter's full patch set that actually 
does the job that congestion_wait cannot be made to do, which is to 
guarantee that socket->write() is always able to get enough memory to 
send out the vm traffic without recursing.  What Peter's patches do is 
make it _nice_ to fix these problems.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 23:14                       ` Christoph Lameter
@ 2007-08-06 23:49                         ` Daniel Phillips
  2007-08-07 22:18                           ` Christoph Lameter
  0 siblings, 1 reply; 85+ messages in thread
From: Daniel Phillips @ 2007-08-06 23:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Matt Mackall, linux-kernel, linux-mm,
	David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg,
	Lee Schermerhorn, Steve Dickson

On Monday 06 August 2007 16:14, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Daniel Phillips wrote:
> > Correct.  That is what the throttling part of these patches is
> > about.
>
> Where are those patches?

Here is one user:

   http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c
   down(&info->throttle_sem);

Peter has another (swap over net).  A third is the network stack itself, 
which is in the full patch set that Peter has posted a number of times 
in the past but did not appear today because Peter broke the full patch 
set up into multiple sets to make it all easier to understand.

> AFAICT: This patchset is not throttling processes but failing
> allocations.

Failing allocations?  Where do you see that?  As far as I can see, 
Peter's patch set allows allocations to fail exactly where the user has 
always specified they may fail, and in no new places.  If there is a 
flaw in that logic, please let us know.

What the current patch set actually does is allow some critical 
allocations that would have failed or recursed into deadlock before to 
succeed instead, allowing vm writeout to complete successfully.  You 
may quibble with exactly how he accomplishes that, but preventing these 
allocations from failing is not optional.

> The patchset does not reconfigure the memory reserves as 
> expected.

What do you mean by that?  Expected by who?

> Instead new reserve logic is added.

Peter did not actually have to add a new layer of abstraction to 
alloc_pages to impose order on the hardcoded hacks that currently live 
in there to decide how far various callers can dig into reserves.  It 
would probably help people understand this patch set if that part were 
taken out for now and replaced with the original seat-of-the-pants two 
line hack I had in the original.  But I do not see anything wrong with 
what Peter has written there, it just takes a little more time to read.

> And I suspect that we  
> have the same issues as in earlier releases with various corner cases
> not being covered.

Do you have an example?

> Code is added that is supposedly not used.

What makes you think that?

> If it  ever is on a large config then we are in very deep trouble by
> the new code paths themselves that serialize things in order to give
> some allocations precendence over the other allocations that are made
> to fail ....

You mean by allocating the reserve memory on the wrong node in NUMA?  
That is on a code path that avoids destroying your machine performance 
or killing the machine entirely as with current kernels, for which a 
few cachelines pulled to another node is a small price to pay.  And you 
are free to use your special expertise in NUMA to make those fallback 
paths even more efficient, but first you need to understand what they 
are doing and why.

At your service for any more questions :-)

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 00/10] foundations for reserve-based allocation
  2007-08-06 20:23 ` Matt Mackall
@ 2007-08-07  0:09   ` Daniel Phillips
  0 siblings, 0 replies; 85+ messages in thread
From: Daniel Phillips @ 2007-08-07  0:09 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Peter Zijlstra, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Christoph Lameter,
	Lee Schermerhorn, Steve Dickson

Hi Matt,

On Monday 06 August 2007 13:23, Matt Mackall wrote:
> On Mon, Aug 06, 2007 at 12:29:22PM +0200, Peter Zijlstra wrote:
> > In the interrest of getting swap over network working and posting
> > in smaller series, here is the first series.
> >
> > This series lays the foundations needed to do reserve based
> > allocation. Traditionally we have used mempools (and others like
> > radix_tree_preload) to handle the problem.
> >
> > However this does not fit the network stack. It is built around
> > variable sized allocations using kmalloc().
> >
> > This calls for a different approach.
>
> One wonders if containers are a possible solution. I can already
> solve this problem with virtualization: have one VM manage all the
> network I/O and export the device as a simpler virtual block device
> to other VMs. Provided this VM isn't doing any "real" work and is
> sized appropriately, it won't get wedged. Since the other VMs now
> submit I/O through the simpler block interface, they can avoid
> getting wedged with the standard mempool approach.

The patch set posted today amounts to hardly any new bytes of code at 
all, and the other necessary bits coming down the pipe are not much 
heavier.  Compare to dedicating a whole VM to supressing just one of 
the symptoms.  Much better just to cure the disease and save your VM 
for a more noble purpose.

> If we can run nbd and friends inside their own container that can
> give similar isolation, we might not need to add this other
> complexity.

This patch set fixes a severe problem that is so far unfixed in spite of 
a number of attempts, and does it in a simple way that translates into 
a small amount of code.  If it comes across as complex, that is a 
matter of presentation and tweaking.  Subtle, yes, but not complex.

> Just food for thought. I haven't looked closely enough at the
> containers implementations yet to determine whether this is possible
> or if the overhead in performance or complexity is acceptable.

Immensely more overhead than Peter's patches.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-06 23:49                         ` Daniel Phillips
@ 2007-08-07 22:18                           ` Christoph Lameter
  2007-08-08  7:24                             ` Peter Zijlstra
  2007-08-08  7:37                             ` Daniel Phillips
  0 siblings, 2 replies; 85+ messages in thread
From: Christoph Lameter @ 2007-08-07 22:18 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Peter Zijlstra, Matt Mackall, linux-kernel, linux-mm,
	David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg,
	Lee Schermerhorn, Steve Dickson

On Mon, 6 Aug 2007, Daniel Phillips wrote:

> > AFAICT: This patchset is not throttling processes but failing
> > allocations.
> 
> Failing allocations?  Where do you see that?  As far as I can see, 
> Peter's patch set allows allocations to fail exactly where the user has 
> always specified they may fail, and in no new places.  If there is a 
> flaw in that logic, please let us know.

See the code added to slub: Allocations are satisfied from the reserve 
patch or they are failing.

> > The patchset does not reconfigure the memory reserves as 
> > expected.
> 
> What do you mean by that?  Expected by who?

What would be expected it some recalculation of min_freekbytes?

> > And I suspect that we  
> > have the same issues as in earlier releases with various corner cases
> > not being covered.
> 
> Do you have an example?

Try NUMA constraints and zone limitations.
 
> > Code is added that is supposedly not used.
> 
> What makes you think that?

Because the argument is that performance does not matter since the code 
patchs are not used.

> > If it  ever is on a large config then we are in very deep trouble by
> > the new code paths themselves that serialize things in order to give
> > some allocations precendence over the other allocations that are made
> > to fail ....
> 
> You mean by allocating the reserve memory on the wrong node in NUMA?  

No I mean all 1024 processors of our system running into this fail/succeed 
thingy that was added.

> That is on a code path that avoids destroying your machine performance 
> or killing the machine entirely as with current kernels, for which a 

As far as I know from our systems: The current kernels do not kill the 
machine if the reserves are configured the right way.

> few cachelines pulled to another node is a small price to pay.  And you 
> are free to use your special expertise in NUMA to make those fallback 
> paths even more efficient, but first you need to understand what they 
> are doing and why.

There is your problem. The justification is not clear at all and the 
solution likely causes unrelated problems.


 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-06 10:29 ` [PATCH 04/10] mm: slub: add knowledge of reserve pages Peter Zijlstra
@ 2007-08-08  0:13   ` Christoph Lameter
  2007-08-08  1:44     ` Matt Mackall
  2007-08-20  7:38   ` Peter Zijlstra
  1 sibling, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-08  0:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Pekka Enberg, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

On Mon, 6 Aug 2007, Peter Zijlstra wrote:

> Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
> contexts that are entitled to it.

Is this patch actually necessary?

If you are in an atomic context and bound to a cpu then a per cpu slab is 
assigned to you and no one else can take object aways from that process 
since nothing else can run on the cpu. The point of the patch is to avoid 
other processes draining objects right? So you do not need the 
modifications in that case.

If you are not in an atomic context and are preemptable or can switch 
allocation context then you can create another context in which reclaim 
could be run to remove some clean pages and get you more memory. Again no 
need for the patch.

I guess you may be limited in not being able to call into reclaim again 
because it is already running. Maybe that can be fixed? F.e. zone reclaim 
does that for the NUMA case. It simply scans for easily reclaimable pages.

We could guarantee easily reclaimable pages to exist in much larger 
numbers than the reserves of min_free_kbytes. So in a tight spot one could 
reclaim from those.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-08  0:13   ` Christoph Lameter
@ 2007-08-08  1:44     ` Matt Mackall
  2007-08-08 17:13       ` Christoph Lameter
  0 siblings, 1 reply; 85+ messages in thread
From: Matt Mackall @ 2007-08-08  1:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Lee Schermerhorn,
	Steve Dickson

On Tue, Aug 07, 2007 at 05:13:52PM -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Peter Zijlstra wrote:
> 
> > Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
> > contexts that are entitled to it.
> 
> Is this patch actually necessary?
>
 > If you are in an atomic context and bound to a cpu then a per cpu slab is 
> assigned to you and no one else can take object aways from that process 
> since nothing else can run on the cpu.

Servicing I/O over the network requires an allocation to send a buffer
and an allocation to later receive the acknowledgement. We can't free
our send buffer (or the memory it's supposed to clean) until the
relevant ack is received. We have to hold our reserves privately
throughout, even if an interrupt that wants to do GFP_ATOMIC
allocation shows up in-between.

> If you are not in an atomic context and are preemptable or can switch 
> allocation context then you can create another context in which reclaim 
> could be run to remove some clean pages and get you more memory. Again no 
> need for the patch.

By the point that this patch is relevant, there are already no clean
pages. The only way to free up more memory is via I/O.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-07 22:18                           ` Christoph Lameter
@ 2007-08-08  7:24                             ` Peter Zijlstra
  2007-08-08 18:06                               ` Christoph Lameter
  2007-08-08  7:37                             ` Daniel Phillips
  1 sibling, 1 reply; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-08  7:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Daniel Phillips, Matt Mackall, linux-kernel, linux-mm,
	David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg,
	Lee Schermerhorn, Steve Dickson

[-- Attachment #1: Type: text/plain, Size: 6007 bytes --]

On Tue, 2007-08-07 at 15:18 -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Daniel Phillips wrote:
> 
> > > AFAICT: This patchset is not throttling processes but failing
> > > allocations.
> > 
> > Failing allocations?  Where do you see that?  As far as I can see, 
> > Peter's patch set allows allocations to fail exactly where the user has 
> > always specified they may fail, and in no new places.  If there is a 
> > flaw in that logic, please let us know.
> 
> See the code added to slub: Allocations are satisfied from the reserve 
> patch or they are failing.

Allocations are satisfied from the reserve IFF the allocation context is
entitled to the reserve, otherwise it will try to allocate a new slab.
And that is exactly like any other low mem situation, allocations do
fail under pressure, but not more so with this patch.

> > > The patchset does not reconfigure the memory reserves as 
> > > expected.
> > 
> > What do you mean by that?  Expected by who?
> 
> What would be expected it some recalculation of min_freekbytes?

And I have been telling you:

        if (alloc_flags & ALLOC_HIGH)
                min -= min / 2;
        if (alloc_flags & ALLOC_HARDER)
                min -= min / 4;

so the reserve is 3/8 of min_freekbytes, a fixed limit is required.


> > > Code is added that is supposedly not used.
> > 
> > What makes you think that?
> 
> Because the argument is that performance does not matter since the code 
> patchs are not used.

The argument is that once you hit these code paths, you don't much care
for performance, not the other way around.

> > And I suspect that we  
> > have the same issues as in earlier releases with various corner cases
> > > not being covered.
> > 
> > Do you have an example?
> 
> Try NUMA constraints and zone limitations.

How exactly are these failing, the breaking out of the policy boundaries
in the reserve path is not different from IRQ context allocations not
honouring them.

> > > If it  ever is on a large config then we are in very deep trouble by
> > > the new code paths themselves that serialize things in order to give
> > > some allocations precendence over the other allocations that are made
> > > to fail ....
> > 
> > You mean by allocating the reserve memory on the wrong node in NUMA?  
> 
> No I mean all 1024 processors of our system running into this fail/succeed 
> thingy that was added.

You mean to say, 1024 cpus running into the kmem_cache wide reserve
slab, yeah if that happens that will hurt.

I just instrumented the kernel to find the allocations done under
PF_MEMALLOC for a heavy (non-swapping) reclaim load:

localhost ~ # cat /proc/ip_trace 
79	[<000000006001c140>] do_ubd_request+0x97/0x176
1	[<000000006006dfef>] allocate_slab+0x44/0x9a
165	[<00000000600e75d7>] new_handle+0x1d/0x47
1	[<00000000600ee897>] __jbd_kmalloc+0x18/0x1a
141	[<00000000600eeb2d>] journal_alloc_journal_head+0x15/0x6f
1	[<0000000060136907>] current_io_context+0x38/0x95
1	[<0000000060139634>] alloc_as_io_context+0x18/0x95

this is from about 15 mins of load-5 trashing file backed reclaim.

So sadly I was mistaken, you will run into it, albeit from the numbers
its not very frequent. (and here I though that path was fully covered
with mempools)

> > That is on a code path that avoids destroying your machine performance 
> > or killing the machine entirely as with current kernels, for which a 
> 
> As far as I know from our systems: The current kernels do not kill the 
> machine if the reserves are configured the right way.

AFAIK the current kernels do not allow for swap over nfs and a bunch of
other things.

The scenario I'm interested in is (two machines: client [A]/server [B]):

 1) networked machine [A] runs swap-over-net (NFS/iSCSI)
 2) someone trips over the NFS/iSCSI server's [B] ethernet cable
   / someone does /etc/init.d/nfs stop

 3) the machine [A] stops dead in its tracks in the middle of reclaim
 4) the machine [A] keeps on receiving unrelated network traffic for
    whatever other purpose the machine served

   - time passes -

 5) cable gets reconnected / nfs server started [B]
 6) client [A] succeeds with the reconnect and happily continues its
    swapping life.

Between 4-5 it needs to receive and discard an unspecified amount of
network traffic while user-space is basically frozen solid.

FWIW this scenario works here.

> > few cachelines pulled to another node is a small price to pay.  And you 
> > are free to use your special expertise in NUMA to make those fallback 
> > paths even more efficient, but first you need to understand what they 
> > are doing and why.
> 
> There is your problem. The justification is not clear at all and the 
> solution likely causes unrelated problems.

The situation I'm wanting to avoid is during 3), the machine will slowly
but steadily freeze over as userspace allocations start to block on
outstanding swap-IO.

Since I need a fixed reserve to operate 4), I do not want these slowly
freezing processes to 'accidentally' gobble up slabs allocated from the
reserve for more important matters.

So what the modification to the slab allocator does is:

 - if there is a reserve slab:
   - if the context allows access to the reserves
     - serve object
   - otherwise try to allocate a new slab
     - if this succeeds
       - remove the reserve slab since clearly the pressure is gone
       - serve from the new slab
     - otherwise fail

Now that 'fail'-part scares you, however, note that __GFP_WAIT
allocations will stick in the allocate part for a while, just like
regular low memory situations.


Anyway, I'm in a bit of a mess having proven even file backed reclaim
hits these paths.

 1) I'm reluctant to add #ifdef in core allocation paths
 2) I'm reluctant to add yet another NR_CPUS array to struct kmem_cache

Christoph, does this all explain the situation?

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-07 22:18                           ` Christoph Lameter
  2007-08-08  7:24                             ` Peter Zijlstra
@ 2007-08-08  7:37                             ` Daniel Phillips
  2007-08-08 18:09                               ` Christoph Lameter
  1 sibling, 1 reply; 85+ messages in thread
From: Daniel Phillips @ 2007-08-08  7:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Daniel Phillips, Peter Zijlstra, Matt Mackall, linux-kernel,
	linux-mm, David Miller, Andrew Morton, Daniel Phillips

On 8/7/07, Christoph Lameter <clameter@sgi.com> wrote:
> > > AFAICT: This patchset is not throttling processes but failing
> > > allocations.
> >
> > Failing allocations?  Where do you see that?  As far as I can see,
> > Peter's patch set allows allocations to fail exactly where the user has
> > always specified they may fail, and in no new places.  If there is a
> > flaw in that logic, please let us know.
>
> See the code added to slub: Allocations are satisfied from the reserve
> patch or they are failing.

First off, what I know about this patch set is, it works.  Without it,
ddsnap deadlocks on network IO, and with it, no deadlock.  We are
niggling over how it works, not whether it works, or whether it is
needed.

Next a confession: I have not yet applied this particular patch and
read the resulting file.  However, the algorithm it is supposed to
implement with respect to slab is:

  1. If the allocation can be satisified in the usual way, do that.
  2. Otherwise, if the GFP flags do not include __GFP_MEMALLOC or
PF_MEMALLOC is not set, fail the allocation
  3. Otherwise, if the memcache's reserve quota is not reached,
satisfy the request, allocating a new page from the MEMALLOC reserve,
but the memcache's reserve counter and succeed
  4. Die.  We are the walking dead.  Do whatever we feel like, for
example fail allocation.  This is a kernel bug, if we are lucky enough
of the kernel will remain running to get some diagnostics.

If it does not implement that algorithm, please shout.

> > > The patchset does not reconfigure the memory reserves as
> > > expected.
> >
> > What do you mean by that?  Expected by who?
>
> What would be expected it some recalculation of min_freekbytes?

I still do not know exactly what you are talking about.  The patch set
provides a means of adjusting the global memalloc reserve when a
memalloc pool user starts or stops.  We could leave that part entirely
out of the patch set and just rely on the reserve being "big enough"
as we have done since the dawn of time.  That would leave less to
niggle about and the reserve adjustment mechanism could  be submitted
later.  Would that be better?

> > > And I suspect that we
> > > have the same issues as in earlier releases with various corner cases
> > > not being covered.
> >
> > Do you have an example?
>
> Try NUMA constraints and zone limitations.

Are you worried about a correctness issue that would prevent the
machine from operating, or are you just worried about allocating
reserve pages to the local node for performance reasons?

> > > Code is added that is supposedly not used.
> >
> > What makes you think that?
>
> Because the argument is that performance does not matter since the code
> patchs are not used.

Used, yes.   Maybe heavily, maybe not.  The point is, with current
kernels you would never get to these new code paths because your
machine would have slowed to a crawl or deadlocked.  So not having the
perfect NUMA implementation of the new memory reserves is perhaps a
new performance issue for NUMA, but it is no show stopper, and
according to design, existing code paths are not degraded by any
measurable extent.  Deadlock in current kernels is a showstopper, that
is what this patch set fixes.

Since there is no regression for NUMA here (there is not supposed to
be anyway) I do not think that adding more complexity to the patch set
to optimize NUMA in this corner case that you could never even get to
before is warranted.  That is properly a follow-on NUMA-specific
patch, analogous to a per-arch optimization.

> > > If it  ever is on a large config then we are in very deep trouble by
> > > the new code paths themselves that serialize things in order to give
> > > some allocations precendence over the other allocations that are made
> > > to fail ....
> >
> > You mean by allocating the reserve memory on the wrong node in NUMA?
>
> No I mean all 1024 processors of our system running into this fail/succeed
> thingy that was added.

If an allocation now fails that would have succeeded in the past, the
patch set is buggy.  I can't say for sure one way or another at this
time of night.  If you see something, could you please mention a
file/line number?

> > That is on a code path that avoids destroying your machine performance
> > or killing the machine entirely as with current kernels, for which a
>
> As far as I know from our systems: The current kernels do not kill the
> machine if the reserves are configured the right way.

Current kernels deadlock on a regular basis doing fancy block IO.  I
am not sure what you mean.

> > few cachelines pulled to another node is a small price to pay.  And you
> > are free to use your special expertise in NUMA to make those fallback
> > paths even more efficient, but first you need to understand what they
> > are doing and why.
>
> There is your problem. The justification is not clear at all and the
> solution likely causes unrelated problems.

Well I hope the justification is clear now.  Not deadlocking is a very
good thing, and we have a before and after test case.  Getting late
here, Peter's email shift starts now ;-)

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-08  1:44     ` Matt Mackall
@ 2007-08-08 17:13       ` Christoph Lameter
  2007-08-08 17:39         ` Andrew Morton
  0 siblings, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-08 17:13 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Peter Zijlstra, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Pekka Enberg, Lee Schermerhorn,
	Steve Dickson

On Tue, 7 Aug 2007, Matt Mackall wrote:

>  > If you are in an atomic context and bound to a cpu then a per cpu slab is 
> > assigned to you and no one else can take object aways from that process 
> > since nothing else can run on the cpu.
> 
> Servicing I/O over the network requires an allocation to send a buffer
> and an allocation to later receive the acknowledgement. We can't free
> our send buffer (or the memory it's supposed to clean) until the
> relevant ack is received. We have to hold our reserves privately
> throughout, even if an interrupt that wants to do GFP_ATOMIC
> allocation shows up in-between.

If you can take an interrupt then you can move to a different allocation 
context. This means reclaim could free up more pages if we tell reclaim 
not to allocate any memory.
 
> > If you are not in an atomic context and are preemptable or can switch 
> > allocation context then you can create another context in which reclaim 
> > could be run to remove some clean pages and get you more memory. Again no 
> > need for the patch.
> 
> By the point that this patch is relevant, there are already no clean
> pages. The only way to free up more memory is via I/O.

That is never true. The dirty ratio limit limits the number of dirty pages 
in memory. There is always a large percentage of memory that is kept 
clean. Pages that are file backed and clean can be freed without any 
additional memory allocation. This is true for the executable code that 
you must have to execute any instructions. We could guarantee that the 
number of pages reclaimable without memory allocs stays above certain 
limits by checking VM counters.

I think there are two ways to address this in a simpler way:

1. Allow recursive calls into reclaim. If we are in a PF_MEMALLOC context 
then we can still scan lru lists and free up memory of clean pages. Idea 
patch follows.

2. Make pageout figure out if the write action requires actual I/O 
submission. If so then the submission will *not* immediately free memory 
and we have to wait for I/O to complete. In that case do not immediately
initiate I/O (which would not free up memory and its bad to initiate 
I/O when we have not enough free memory) but put all those pages on a 
pageout list. When reclaim has reclaimed enough memory then go through the 
pageout list and trigger I/O. That can be done without PF_MEMALLOC so that 
additional reclaim could be triggered as needed. Maybe we can just get rid 
of PF_MEMALLOC and some of the contorted code around it?




Recursive reclaim concept patch:

---
 include/linux/swap.h |    2 ++
 mm/page_alloc.c      |   11 +++++++++++
 mm/vmscan.c          |   27 +++++++++++++++++++++++++++
 3 files changed, 40 insertions(+)

Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h	2007-08-08 04:31:06.000000000 -0700
+++ linux-2.6/include/linux/swap.h	2007-08-08 04:31:28.000000000 -0700
@@ -190,6 +190,8 @@ extern void swap_setup(void);
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zone **zones, int order,
 					gfp_t gfp_mask);
+extern unsigned long emergency_free_pages(struct zone **zones, int order,
+					gfp_t gfp_mask);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2007-08-08 04:17:33.000000000 -0700
+++ linux-2.6/mm/page_alloc.c	2007-08-08 04:39:26.000000000 -0700
@@ -1306,6 +1306,17 @@ nofail_alloc:
 				zonelist, ALLOC_NO_WATERMARKS);
 			if (page)
 				goto got_pg;
+
+			/*
+			 * We cannot go into full synchrononous reclaim
+			 * but we can still scan for easily reclaimable
+			 * pages.
+			 */
+			if (p->flags & PF_MEMALLOC &&
+				emergency_free_pages(zonelist->zones, order,
+								gfp_mask))
+				goto nofail_alloc;
+
 			if (gfp_mask & __GFP_NOFAIL) {
 				congestion_wait(WRITE, HZ/50);
 				goto nofail_alloc;
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-08-08 04:21:14.000000000 -0700
+++ linux-2.6/mm/vmscan.c	2007-08-08 04:42:24.000000000 -0700
@@ -1204,6 +1204,33 @@ out:
 }
 
 /*
+ * Emergency reclaim. We are alreedy in the vm write out path
+ * and we have exhausted all memory. We have to free memory without
+ * any additional allocations. So no writes and no swap. Get
+ * as bare bones as we can.
+ */
+unsigned long emergency_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
+{
+	int priority;
+	unsigned long nr_reclaimed = 0;
+	struct scan_control sc = {
+		.gfp_mask = gfp_mask,
+		.swap_cluster_max = SWAP_CLUSTER_MAX,
+		.order = order,
+	};
+
+	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+		sc.nr_scanned = 0;
+		nr_reclaimed += shrink_zones(priority, zones, &sc);
+		if (nr_reclaimed >= sc.swap_cluster_max)
+			return 1;
+	}
+
+	/* top priority shrink_caches still had more to do? don't OOM, then */
+	return sc.all_unreclaimable;
+}
+
+/*
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at pages_high.
  *




^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-08 17:13       ` Christoph Lameter
@ 2007-08-08 17:39         ` Andrew Morton
  2007-08-08 17:57           ` Christoph Lameter
  0 siblings, 1 reply; 85+ messages in thread
From: Andrew Morton @ 2007-08-08 17:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matt Mackall, Peter Zijlstra, linux-kernel, linux-mm,
	David Miller, Daniel Phillips, Pekka Enberg, Lee Schermerhorn,
	Steve Dickson

On Wed, 8 Aug 2007 10:13:05 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> I think there are two ways to address this in a simpler way:
> 
> 1. Allow recursive calls into reclaim. If we are in a PF_MEMALLOC context 
> then we can still scan lru lists and free up memory of clean pages. Idea 
> patch follows.
> 
> 2. Make pageout figure out if the write action requires actual I/O 
> submission. If so then the submission will *not* immediately free memory 
> and we have to wait for I/O to complete. In that case do not immediately
> initiate I/O (which would not free up memory and its bad to initiate 
> I/O when we have not enough free memory) but put all those pages on a 
> pageout list. When reclaim has reclaimed enough memory then go through the 
> pageout list and trigger I/O. That can be done without PF_MEMALLOC so that 
> additional reclaim could be triggered as needed. Maybe we can just get rid 
> of PF_MEMALLOC and some of the contorted code around it?

3.  Perform page reclaim from hard IRQ context.  Pretty simple to
implement, most of the work would be needed in the rmap code.  It might be
better to make it opt-in via a new __GFP_flag.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-08 17:39         ` Andrew Morton
@ 2007-08-08 17:57           ` Christoph Lameter
  2007-08-08 18:46             ` Andrew Morton
  0 siblings, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-08 17:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matt Mackall, Peter Zijlstra, linux-kernel, linux-mm,
	David Miller, Daniel Phillips, Pekka Enberg, Lee Schermerhorn,
	Steve Dickson

On Wed, 8 Aug 2007, Andrew Morton wrote:

> 3.  Perform page reclaim from hard IRQ context.  Pretty simple to
> implement, most of the work would be needed in the rmap code.  It might be
> better to make it opt-in via a new __GFP_flag.

In a hardirq context one is bound to a processor and through the slab 
allocator to a slab from which one allocates objects. The slab is per cpu 
and so the slab is reserved for the current context. Nothing can take 
objects away from it. The modifications here would not be needed for that 
context.

I think in general irq context reclaim is doable. Cannot see obvious 
issues on a first superficial pass through rmap.c. The irq holdoff would 
be pretty long though which may make it unacceptable.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-08  7:24                             ` Peter Zijlstra
@ 2007-08-08 18:06                               ` Christoph Lameter
  0 siblings, 0 replies; 85+ messages in thread
From: Christoph Lameter @ 2007-08-08 18:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Phillips, Matt Mackall, linux-kernel, linux-mm,
	David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg,
	Lee Schermerhorn, Steve Dickson

On Wed, 8 Aug 2007, Peter Zijlstra wrote:

> Christoph, does this all explain the situation?

Sort of. I am still very sceptical that this will work reliably. I'd 
rather look at alternate solution like fixing reclaim. Could you have a 
look at Andrew's and my comments on the slub patch?

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-08  7:37                             ` Daniel Phillips
@ 2007-08-08 18:09                               ` Christoph Lameter
  2007-08-09 18:41                                 ` Daniel Phillips
  0 siblings, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-08 18:09 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Daniel Phillips, Peter Zijlstra, Matt Mackall, linux-kernel,
	linux-mm, David Miller, Andrew Morton, Daniel Phillips

On Wed, 8 Aug 2007, Daniel Phillips wrote:

>   1. If the allocation can be satisified in the usual way, do that.
>   2. Otherwise, if the GFP flags do not include __GFP_MEMALLOC or
> PF_MEMALLOC is not set, fail the allocation
>   3. Otherwise, if the memcache's reserve quota is not reached,
> satisfy the request, allocating a new page from the MEMALLOC reserve,
> but the memcache's reserve counter and succeed

Maybe we need to kill PF_MEMALLOC....

> > Try NUMA constraints and zone limitations.
> 
> Are you worried about a correctness issue that would prevent the
> machine from operating, or are you just worried about allocating
> reserve pages to the local node for performance reasons?

I am worried that allocation constraints will make the approach incorrect. 
Because logically you must have distinct pools for each set of allocations 
constraints. Otherwise something will drain the precious reserve slab.

> > No I mean all 1024 processors of our system running into this fail/succeed
> > thingy that was added.
> 
> If an allocation now fails that would have succeeded in the past, the
> patch set is buggy.  I can't say for sure one way or another at this
> time of night.  If you see something, could you please mention a
> file/line number?

It seems that allocations fail that the reclaim logic should have taken 
care of letting succeed. Not good.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-08 17:57           ` Christoph Lameter
@ 2007-08-08 18:46             ` Andrew Morton
  2007-08-10  1:54               ` Daniel Phillips
  0 siblings, 1 reply; 85+ messages in thread
From: Andrew Morton @ 2007-08-08 18:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matt Mackall, Peter Zijlstra, linux-kernel, linux-mm,
	David Miller, Daniel Phillips, Pekka Enberg, Lee Schermerhorn,
	Steve Dickson

On Wed, 8 Aug 2007 10:57:13 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> I think in general irq context reclaim is doable. Cannot see obvious 
> issues on a first superficial pass through rmap.c. The irq holdoff would 
> be pretty long though which may make it unacceptable.

The IRQ holdoff could be tremendous.  But if it is sufficiently infrequent
and if the worst effect is merely a network rx ring overflow then the tradeoff
might be a good one.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-08 18:09                               ` Christoph Lameter
@ 2007-08-09 18:41                                 ` Daniel Phillips
  2007-08-09 18:49                                   ` Christoph Lameter
  0 siblings, 1 reply; 85+ messages in thread
From: Daniel Phillips @ 2007-08-09 18:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Daniel Phillips, Peter Zijlstra, Matt Mackall, linux-kernel,
	linux-mm, David Miller, Andrew Morton, Daniel Phillips

On 8/8/07, Christoph Lameter <clameter@sgi.com> wrote:
> On Wed, 8 Aug 2007, Daniel Phillips wrote:
> Maybe we need to kill PF_MEMALLOC....

Shrink_caches needs to be able to recurse into filesystems at least,
and for the duration of the recursion the filesystem must have
privileged access to reserves.  Consider the difficulty of handling
that with anything other than a process flag.

> > > Try NUMA constraints and zone limitations.
> >
> > Are you worried about a correctness issue that would prevent the
> > machine from operating, or are you just worried about allocating
> > reserve pages to the local node for performance reasons?
>
> I am worried that allocation constraints will make the approach incorrect.
> Because logically you must have distinct pools for each set of allocations
> constraints. Otherwise something will drain the precious reserve slab.

To be precise, each user of the reserve must have bounded reserve
requirements, and a shared reserve must be at least as large as the
total of the bounds.

Peter's contribution here was to modify the slab slightly so that
multiple slabs can share the global memalloc reserve.  Not necessary
pre-allocating pages into the reserve reduces complexity.  The total
size of the reserve can also be reduced by this pool sharing strategy
in theory, but the patch set does not attempt this.

Anyway, you are right that if allocation constraints are not
calculated and enforced on the user side then the global reserve may
drain and get the system into a very bad state.  That is why each user
on the block writeout path must be audited for reserve requirements
and throttled to a requirement no greater than its share of the global
pool.  This patch set does not provide an example of such block IO
throttling, but you can see one in ddsnap at the link I provided
earlier (throttle_sem).

> > > No I mean all 1024 processors of our system running into this fail/succeed
> > > thingy that was added.
> >
> > If an allocation now fails that would have succeeded in the past, the
> > patch set is buggy.  I can't say for sure one way or another at this
> > time of night.  If you see something, could you please mention a
> > file/line number?
>
> It seems that allocations fail that the reclaim logic should have taken
> care of letting succeed. Not good.

I do not see a case like that, if you can see one then we would like
to know.  We make a guarantee to each reserve user that the memory it
has reserved will be available immediately from the global memalloc
pool.  Bugs can break this, and we do not automatically monitor each
user just now to detect violations, but both are also the case with
the traditional memalloc paths.

In theory, we could reduce the size of the global memalloc pool by
including "easily freeable" memory in it.  This is just an
optimization and does not belong in this patch set, which fixes a
system integrity issue.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-09 18:41                                 ` Daniel Phillips
@ 2007-08-09 18:49                                   ` Christoph Lameter
  2007-08-10  0:17                                     ` Daniel Phillips
  0 siblings, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-09 18:49 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Daniel Phillips, Peter Zijlstra, Matt Mackall, linux-kernel,
	linux-mm, David Miller, Andrew Morton, Daniel Phillips

On Thu, 9 Aug 2007, Daniel Phillips wrote:

> On 8/8/07, Christoph Lameter <clameter@sgi.com> wrote:
> > On Wed, 8 Aug 2007, Daniel Phillips wrote:
> > Maybe we need to kill PF_MEMALLOC....
> 
> Shrink_caches needs to be able to recurse into filesystems at least,
> and for the duration of the recursion the filesystem must have
> privileged access to reserves.  Consider the difficulty of handling
> that with anything other than a process flag.

Shrink_caches needs to allocate memory? Hmmm... Maybe we can only limit 
the PF_MEMALLOC use.

> In theory, we could reduce the size of the global memalloc pool by
> including "easily freeable" memory in it.  This is just an
> optimization and does not belong in this patch set, which fixes a
> system integrity issue.

I think the main thing would be to fix reclaim to not do stupid things 
like triggering writeout early in the reclaim pass and to allow reentry 
into reclaim. The idea of memory pools always sounded strange to me given 
that you have a lot of memory in a zone that is reclaimable as needed.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-09 18:49                                   ` Christoph Lameter
@ 2007-08-10  0:17                                     ` Daniel Phillips
  2007-08-10  1:48                                       ` Christoph Lameter
  0 siblings, 1 reply; 85+ messages in thread
From: Daniel Phillips @ 2007-08-10  0:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Daniel Phillips, Peter Zijlstra, Matt Mackall, linux-kernel,
	linux-mm, David Miller, Andrew Morton, Daniel Phillips

On 8/9/07, Christoph Lameter <clameter@sgi.com> wrote:
> On Thu, 9 Aug 2007, Daniel Phillips wrote:
> > On 8/8/07, Christoph Lameter <clameter@sgi.com> wrote:
> > > On Wed, 8 Aug 2007, Daniel Phillips wrote:
> > > Maybe we need to kill PF_MEMALLOC....
> > Shrink_caches needs to be able to recurse into filesystems at least,
> > and for the duration of the recursion the filesystem must have
> > privileged access to reserves.  Consider the difficulty of handling
> > that with anything other than a process flag.
>
> Shrink_caches needs to allocate memory? Hmmm... Maybe we can only limit
> the PF_MEMALLOC use.

PF_MEMALLOC is not such a bad thing.  It will usually be less code
than mempool for the same use case, besides being able to handle a
wider range of problems.  We  introduce __GPF_MEMALLOC for situations
where the need for reserve memory is locally known, as in the network
stack, which is similar or identical to the use case for mempool.  One
could reasonably ask why we need mempool with a lighter alternative
available.  But this is a case of to each their own I think.  Either
technique will work for reserve management.

> > In theory, we could reduce the size of the global memalloc pool by
> > including "easily freeable" memory in it.  This is just an
> > optimization and does not belong in this patch set, which fixes a
> > system integrity issue.
>
> I think the main thing would be to fix reclaim to not do stupid things
> like triggering writeout early in the reclaim pass and to allow reentry
> into reclaim. The idea of memory pools always sounded strange to me given
> that you have a lot of memory in a zone that is reclaimable as needed.

You can fix reclaim as much as you want and the basic deadlock will
still not go away.  When you finally do get to writing something out,
memory consumers in the writeout path are going to cause problems,
which this patch set fixes.

Agreed that the idea of mempool always sounded strange, and we show
how to get rid of them, but that is not the immediate purpose of this
patch set.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-10  0:17                                     ` Daniel Phillips
@ 2007-08-10  1:48                                       ` Christoph Lameter
  2007-08-10  3:34                                         ` Daniel Phillips
  0 siblings, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-10  1:48 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Daniel Phillips, Peter Zijlstra, Matt Mackall, linux-kernel,
	linux-mm, David Miller, Andrew Morton, Daniel Phillips

On Thu, 9 Aug 2007, Daniel Phillips wrote:

> You can fix reclaim as much as you want and the basic deadlock will
> still not go away.  When you finally do get to writing something out,
> memory consumers in the writeout path are going to cause problems,
> which this patch set fixes.

We currently also do *not* write out immediately. I/O is queued when 
submitted so it does *not* reduce memory. It is better to actually delay 
writeout until you have thrown out clean pages. At that point the free 
memory is at its high point. If memory goes below the high point again by 
these writes then we can again reclaim until things are right.

> Agreed that the idea of mempool always sounded strange, and we show
> how to get rid of them, but that is not the immediate purpose of this
> patch set.

Ok mempools are unrelated. The allocations problems that this patch 
addresses can be fixed by making reclaim more intelligent. This may likely 
make mempools less of an issue in the kernel. If we can reclaim in an 
emergency even in ATOMIC contexts then things get much easier.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-08 18:46             ` Andrew Morton
@ 2007-08-10  1:54               ` Daniel Phillips
  2007-08-10  2:01                 ` Christoph Lameter
  0 siblings, 1 reply; 85+ messages in thread
From: Daniel Phillips @ 2007-08-10  1:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Matt Mackall, Peter Zijlstra, linux-kernel,
	linux-mm, David Miller, Pekka Enberg

On 8/8/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Wed, 8 Aug 2007 10:57:13 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > I think in general irq context reclaim is doable. Cannot see obvious
> > issues on a first superficial pass through rmap.c. The irq holdoff would
> > be pretty long though which may make it unacceptable.
>
> The IRQ holdoff could be tremendous.  But if it is sufficiently infrequent
> and if the worst effect is merely a network rx ring overflow then the tradeoff
> might be a good one.

Hi Andrew,

No matter how you look at this problem, you still need to have _some_
sort of reserve, and limit access to it.  We extend existing methods,
you are proposing to what seems like an entirely new reserve
management system.  Great idea, maybe, but it does not solve the
deadlocks.  You still need some organized way of being sure that your
reserve is as big as you need (hopefully not an awful lot bigger) and
you still have to make sure that nobody dips into that reserve further
than they are allowed to.

So translation: reclaim from "easily freeable" lists is an
optimization, maybe a great one.  Probably great.  Reclaim from atomic
context is also a great idea, probably. But you are talking about a
whole nuther patch set.  Neither of those are in themselves a fix for
these deadlocks.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-10  1:54               ` Daniel Phillips
@ 2007-08-10  2:01                 ` Christoph Lameter
  0 siblings, 0 replies; 85+ messages in thread
From: Christoph Lameter @ 2007-08-10  2:01 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andrew Morton, Matt Mackall, Peter Zijlstra, linux-kernel,
	linux-mm, David Miller, Pekka Enberg

On Thu, 9 Aug 2007, Daniel Phillips wrote:

> No matter how you look at this problem, you still need to have _some_
> sort of reserve, and limit access to it.  We extend existing methods,

The reserve is in the memory in the zone and reclaim can guarantee that 
there are a sufficient number of easily reclaimable pages in it.

> you are proposing to what seems like an entirely new reserve

The reserve always has been managed by per zone counters. Nothing new 
there.

> management system.  Great idea, maybe, but it does not solve the
> deadlocks.  You still need some organized way of being sure that your
> reserve is as big as you need (hopefully not an awful lot bigger) and
> you still have to make sure that nobody dips into that reserve further
> than they are allowed to.

Nope there is no need to have additional reserves. You delay the writeout 
until you are finished with reclaim. Then you do the writeout. During 
writeout reclaim may be called as needed. After the writeout is complete 
then you recheck the vm counters again to be sure that dirty ratio / 
easily reclaimable ratio and mem low / high boundaries are still okay. If not go 
back to reclaim.

> So translation: reclaim from "easily freeable" lists is an
> optimization, maybe a great one.  Probably great.  Reclaim from atomic
> context is also a great idea, probably. But you are talking about a
> whole nuther patch set.  Neither of those are in themselves a fix for
> these deadlocks.

Yes they are a much better fix and may allow code cleanup by getting rid 
of checks for PF_MEMALLOC. They integrate in a straightforward way 
into the existing reclaim methods.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-10  1:48                                       ` Christoph Lameter
@ 2007-08-10  3:34                                         ` Daniel Phillips
  2007-08-10  3:48                                           ` Christoph Lameter
  0 siblings, 1 reply; 85+ messages in thread
From: Daniel Phillips @ 2007-08-10  3:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Daniel Phillips, Peter Zijlstra, Matt Mackall, linux-kernel,
	linux-mm, David Miller, Andrew Morton, Daniel Phillips

On 8/9/07, Christoph Lameter <clameter@sgi.com> wrote:
> The allocations problems that this patch addresses can be fixed by making reclaim
> more intelligent.

If you believe that the deadlock problems we address here can be
better fixed by making reclaim more intelligent then please post a
patch and we will test it.  I am highly skeptical, but the proof is in
the patch.

> If we can reclaim in an emergency even in ATOMIC contexts then things get much
> easier.

It is already easy, and it is already fixed in this patch series.
Sure, we can pare these patches down a little more, but you are going
to have a really hard time coming up with something simpler that
actually works.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-10  3:34                                         ` Daniel Phillips
@ 2007-08-10  3:48                                           ` Christoph Lameter
  2007-08-10  8:15                                             ` Daniel Phillips
  0 siblings, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-10  3:48 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Daniel Phillips, Peter Zijlstra, Matt Mackall, linux-kernel,
	linux-mm, David Miller, Andrew Morton, Daniel Phillips

On Thu, 9 Aug 2007, Daniel Phillips wrote:

> If you believe that the deadlock problems we address here can be
> better fixed by making reclaim more intelligent then please post a
> patch and we will test it.  I am highly skeptical, but the proof is in
> the patch.

Then please test the patch that I posted here earlier to reclaim even if 
PF_MEMALLOC is set. It may require some fixups but it should address your 
issues in most vm load situations.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-10  3:48                                           ` Christoph Lameter
@ 2007-08-10  8:15                                             ` Daniel Phillips
  2007-08-10 17:46                                               ` Christoph Lameter
  0 siblings, 1 reply; 85+ messages in thread
From: Daniel Phillips @ 2007-08-10  8:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Daniel Phillips, Peter Zijlstra, Matt Mackall, linux-kernel,
	linux-mm, David Miller, Andrew Morton, Daniel Phillips

On 8/9/07, Christoph Lameter <clameter@sgi.com> wrote:
> > If you believe that the deadlock problems we address here can be
> > better fixed by making reclaim more intelligent then please post a
> > patch and we will test it.  I am highly skeptical, but the proof is in
> > the patch.
>
> Then please test the patch that I posted here earlier to reclaim even if
> PF_MEMALLOC is set. It may require some fixups but it should address your
> issues in most vm load situations.

It is quite clear what is in your patch.  Instead of just grabbing a
page off the buddy free lists in a critical allocation situation you
go invoke shrink_caches.  Why oh why?  All the memory needed to get
through these crunches is already sitting right there on the buddy
free lists, ready to be used, why would you go off scanning instead?
And this does not work in atomic contexts at all, that is a whole
thing you would have to develop, and why?  You just offered us
functionality that we already have, except your idea has issues.

You do not do anything to prevent mixing of ordinary slab allocations
of unknown duration with critical allocations of controlled duration.
 This  is _very important_ for sk_alloc.  How are you going to take
care of that?

In short, you provide a piece we don't need because we already have it
in a more efficient form, your approach does not work in atomic
context, and you need to solve the slab object problem.  You also need
integration with sk_alloc.   That is just what I noticed on a
once-over-lightly.  Your patch has a _long_ way to go before it is
ready to try.

We have already presented a patch set that is tested and is known to
solve the deadlocks.  This patch set has been more than two years in
development.  It covers problems you have not even begun to think
about, which we have been aware of for years.  Your idea is not
anywhere close to working.  Why don't you just work with us instead?
There are certainly improvements that can be made to the posted patch
set.  Running off and learning from scratch how to do this is not
really helpful.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-10  8:15                                             ` Daniel Phillips
@ 2007-08-10 17:46                                               ` Christoph Lameter
  2007-08-10 23:25                                                 ` Daniel Phillips
  2007-08-13  6:55                                                 ` Daniel Phillips
  0 siblings, 2 replies; 85+ messages in thread
From: Christoph Lameter @ 2007-08-10 17:46 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Daniel Phillips, Peter Zijlstra, Matt Mackall, linux-kernel,
	linux-mm, David Miller, Andrew Morton, Daniel Phillips

On Fri, 10 Aug 2007, Daniel Phillips wrote:

> It is quite clear what is in your patch.  Instead of just grabbing a
> page off the buddy free lists in a critical allocation situation you
> go invoke shrink_caches.  Why oh why?  All the memory needed to get

Because we get to the code of interest when we have no memory on the 
buddy free lists and need to reclaim memory to fill them up again.

> You do not do anything to prevent mixing of ordinary slab allocations
> of unknown duration with critical allocations of controlled duration.
>  This  is _very important_ for sk_alloc.  How are you going to take
> care of that?

It is not necessary because you can reclaim memory as needed.

> There are certainly improvements that can be made to the posted patch
> set.  Running off and learning from scratch how to do this is not
> really helpful.

The idea of adding code to deal with "I have no memory" situations 
in a kernel that based on have as much memory as possible in use at all 
times is plainly the wrong approach. If you need memory then memory needs 
to be reclaimed. That is the basic way that things work and following that 
through brings about a much less invasive solution without all the issues 
that the proposed solution creates.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-10 17:46                                               ` Christoph Lameter
@ 2007-08-10 23:25                                                 ` Daniel Phillips
  2007-08-13  6:55                                                 ` Daniel Phillips
  1 sibling, 0 replies; 85+ messages in thread
From: Daniel Phillips @ 2007-08-10 23:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Daniel Phillips, Peter Zijlstra, Matt Mackall, linux-kernel,
	linux-mm, David Miller, Andrew Morton, Daniel Phillips

On 8/10/07, Christoph Lameter <clameter@sgi.com> wrote:
> The idea of adding code to deal with "I have no memory" situations
> in a kernel that based on have as much memory as possible in use at all
> times is plainly the wrong approach.

No.  It is you who have read the patches wrongly, because what you
imply here is exactly backwards.

> If you need memory then memory needs
> to be reclaimed. That is the basic way that things work

Wrong.  A naive reading of your comment would suggest you do not
understand how PF_MEMALLOC works, and that it has worked that way from
day one (well, since long before I arrived) and that we just do more
of the same, except better.

> and following that
> through brings about a much less invasive solution without all the issues
> that the proposed solution creates.

What issues?  Test case please, a real one that you have run yourself.
 Please, no more theoretical issues that cannot be demonstrated in
practice because they do not exist.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-10 17:46                                               ` Christoph Lameter
  2007-08-10 23:25                                                 ` Daniel Phillips
@ 2007-08-13  6:55                                                 ` Daniel Phillips
  2007-08-13 23:04                                                   ` Christoph Lameter
  1 sibling, 1 reply; 85+ messages in thread
From: Daniel Phillips @ 2007-08-13  6:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Daniel Phillips, Peter Zijlstra, Matt Mackall, linux-kernel,
	linux-mm, David Miller, Andrew Morton, Daniel Phillips

On Friday 10 August 2007 10:46, Christoph Lameter wrote:
> On Fri, 10 Aug 2007, Daniel Phillips wrote:
> > It is quite clear what is in your patch.  Instead of just grabbing
> > a page off the buddy free lists in a critical allocation situation
> > you go invoke shrink_caches.  Why oh why?  All the memory needed to
> > get
>
> Because we get to the code of interest when we have no memory on the
> buddy free lists...

Ah wait, that statement is incorrect and may well be the crux of your 
misunderstanding.  Buddy free lists are not exhausted until the entire 
memalloc reserve has been depleted, which would indicate a kernel bug 
and imminent system death.

> ...and need to reclaim memory to fill them up again. 

That we do, but we satisfy the allocations in the vm writeout path 
first, without waiting for shrink_caches to do its thing.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
  2007-08-13  6:55                                                 ` Daniel Phillips
@ 2007-08-13 23:04                                                   ` Christoph Lameter
  0 siblings, 0 replies; 85+ messages in thread
From: Christoph Lameter @ 2007-08-13 23:04 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Daniel Phillips, Peter Zijlstra, Matt Mackall, linux-kernel,
	linux-mm, David Miller, Andrew Morton, Daniel Phillips

On Sun, 12 Aug 2007, Daniel Phillips wrote:

> > Because we get to the code of interest when we have no memory on the
> > buddy free lists...
> 
> Ah wait, that statement is incorrect and may well be the crux of your 
> misunderstanding.  Buddy free lists are not exhausted until the entire 
> memalloc reserve has been depleted, which would indicate a kernel bug 
> and imminent system death.

I added the call to reclaim where the memory is out.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-06 10:29 ` [PATCH 04/10] mm: slub: add knowledge of reserve pages Peter Zijlstra
  2007-08-08  0:13   ` Christoph Lameter
@ 2007-08-20  7:38   ` Peter Zijlstra
  2007-08-20  7:43     ` Peter Zijlstra
  2007-08-20  9:12     ` Pekka J Enberg
  1 sibling, 2 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-20  7:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Miller, Andrew Morton, Daniel Phillips,
	Pekka Enberg, Christoph Lameter, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

[-- Attachment #1: Type: text/plain, Size: 6382 bytes --]

Ok, so I got rid of the global stuff, this also obsoletes 3/10.


---
Subject: mm: slub: add knowledge of reserve pages

Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
contexts that are entitled to it.

Care is taken to only touch the SLUB slow path.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <clameter@sgi.com>
---
 mm/slub.c |   87 +++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 69 insertions(+), 18 deletions(-)

Index: linux-2.6-2/mm/slub.c
===================================================================
--- linux-2.6-2.orig/mm/slub.c
+++ linux-2.6-2/mm/slub.c
@@ -20,11 +20,12 @@
 #include <linux/mempolicy.h>
 #include <linux/ctype.h>
 #include <linux/kallsyms.h>
+#include "internal.h"
 
 /*
  * Lock order:
  *   1. slab_lock(page)
- *   2. slab->list_lock
+ *   2. node->list_lock
  *
  *   The slab_lock protects operations on the object of a particular
  *   slab and its metadata in the page struct. If the slab lock
@@ -1069,7 +1070,7 @@ static void setup_object(struct kmem_cac
 		s->ctor(object, s, 0);
 }
 
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
 {
 	struct page *page;
 	struct kmem_cache_node *n;
@@ -1087,6 +1088,7 @@ static struct page *new_slab(struct kmem
 	if (!page)
 		goto out;
 
+	*reserve = page->reserve;
 	n = get_node(s, page_to_nid(page));
 	if (n)
 		atomic_long_inc(&n->nr_slabs);
@@ -1403,12 +1405,36 @@ static inline void flush_slab(struct kme
 }
 
 /*
+ * cpu slab reserve magic
+ *
+ * we mark reserve status in the lsb of the ->cpu_slab[] pointer.
+ */
+static inline unsigned long cpu_slab_reserve(struct kmem_cache *s, int cpu)
+{
+	return unlikely((unsigned long)s->cpu_slab[cpu] & 1);
+}
+
+static inline void
+cpu_slab_set(struct kmem_cache *s, int cpu, struct page *page, int reserve)
+{
+	if (unlikely(reserve))
+		page = (struct page *)((unsigned long)page | 1);
+
+	s->cpu_slab[cpu] = page;
+}
+
+static inline struct page *cpu_slab(struct kmem_cache *s, int cpu)
+{
+	return (struct page *)((unsigned long)s->cpu_slab[cpu] & ~1UL);
+}
+
+/*
  * Flush cpu slab.
  * Called from IPI handler with interrupts disabled.
  */
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
-	struct page *page = s->cpu_slab[cpu];
+	struct page *page = cpu_slab(s, cpu);
 
 	if (likely(page))
 		flush_slab(s, page, cpu);
@@ -1457,10 +1483,22 @@ static void *__slab_alloc(struct kmem_ca
 {
 	void **object;
 	int cpu = smp_processor_id();
+	int reserve = 0;
 
 	if (!page)
 		goto new_slab;
 
+	if (cpu_slab_reserve(s, cpu)) {
+		/*
+		 * If the current slab is a reserve slab and the current
+		 * allocation context does not allow access to the reserves
+		 * we must force an allocation to test the current levels.
+		 */
+		if (!(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+			goto alloc_slab;
+		reserve = 1;
+	}
+
 	slab_lock(page);
 	if (unlikely(node != -1 && page_to_nid(page) != node))
 		goto another_slab;
@@ -1468,10 +1506,9 @@ load_freelist:
 	object = page->freelist;
 	if (unlikely(!object))
 		goto another_slab;
-	if (unlikely(SlabDebug(page)))
+	if (unlikely(SlabDebug(page) || reserve))
 		goto debug;
 
-	object = page->freelist;
 	page->lockless_freelist = object[page->offset];
 	page->inuse = s->objects;
 	page->freelist = NULL;
@@ -1484,14 +1521,28 @@ another_slab:
 new_slab:
 	page = get_partial(s, gfpflags, node);
 	if (page) {
-		s->cpu_slab[cpu] = page;
+		cpu_slab_set(s, cpu, page, reserve);
 		goto load_freelist;
 	}
 
-	page = new_slab(s, gfpflags, node);
+alloc_slab:
+	page = new_slab(s, gfpflags, node, &reserve);
 	if (page) {
+		struct page *slab;
+
 		cpu = smp_processor_id();
-		if (s->cpu_slab[cpu]) {
+		slab = cpu_slab(s, cpu);
+
+		if (cpu_slab_reserve(s, cpu) && !reserve) {
+			/*
+			 * If the current cpu_slab is a reserve slab but we
+			 * managed to allocate a new slab the pressure is
+			 * lifted and we can unmark the current one.
+			 */
+			cpu_slab_set(s, cpu, slab, 0);
+		}
+
+		if (slab) {
 			/*
 			 * Someone else populated the cpu_slab while we
 			 * enabled interrupts, or we have gotten scheduled
@@ -1499,29 +1550,28 @@ new_slab:
 			 * requested node even if __GFP_THISNODE was
 			 * specified. So we need to recheck.
 			 */
-			if (node == -1 ||
-				page_to_nid(s->cpu_slab[cpu]) == node) {
+			if (node == -1 || page_to_nid(slab) == node) {
 				/*
 				 * Current cpuslab is acceptable and we
 				 * want the current one since its cache hot
 				 */
 				discard_slab(s, page);
-				page = s->cpu_slab[cpu];
+				page = slab;
 				slab_lock(page);
 				goto load_freelist;
 			}
 			/* New slab does not fit our expectations */
-			flush_slab(s, s->cpu_slab[cpu], cpu);
+			flush_slab(s, slab, cpu);
 		}
 		slab_lock(page);
 		SetSlabFrozen(page);
-		s->cpu_slab[cpu] = page;
+		cpu_slab_set(s, cpu, page, reserve);
 		goto load_freelist;
 	}
 	return NULL;
 debug:
-	object = page->freelist;
-	if (!alloc_debug_processing(s, page, object, addr))
+	if (SlabDebug(page) &&
+			!alloc_debug_processing(s, page, object, addr))
 		goto another_slab;
 
 	page->inuse++;
@@ -1548,7 +1598,7 @@ static void __always_inline *slab_alloc(
 	unsigned long flags;
 
 	local_irq_save(flags);
-	page = s->cpu_slab[smp_processor_id()];
+	page = cpu_slab(s, smp_processor_id());
 	if (unlikely(!page || !page->lockless_freelist ||
 			(node != -1 && page_to_nid(page) != node)))
 
@@ -1873,10 +1923,11 @@ static struct kmem_cache_node * __init e
 {
 	struct page *page;
 	struct kmem_cache_node *n;
+	int reserve;
 
 	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node);
+	page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node, &reserve);
 
 	BUG_ON(!page);
 	n = page->freelist;
@@ -3189,7 +3240,7 @@ static unsigned long slab_objects(struct
 	per_cpu = nodes + nr_node_ids;
 
 	for_each_possible_cpu(cpu) {
-		struct page *page = s->cpu_slab[cpu];
+		struct page *page = cpu_slab(s, cpu);
 		int node;
 
 		if (page) {


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-20  7:38   ` Peter Zijlstra
@ 2007-08-20  7:43     ` Peter Zijlstra
  2007-08-20  9:12     ` Pekka J Enberg
  1 sibling, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-20  7:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Miller, Andrew Morton, Daniel Phillips,
	Pekka Enberg, Christoph Lameter, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

On Mon, 2007-08-20 at 09:38 +0200, Peter Zijlstra wrote:
> Ok, so I got rid of the global stuff, this also obsoletes 3/10.
 
2/10 that is


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-20  7:38   ` Peter Zijlstra
  2007-08-20  7:43     ` Peter Zijlstra
@ 2007-08-20  9:12     ` Pekka J Enberg
  2007-08-20  9:17       ` Peter Zijlstra
  1 sibling, 1 reply; 85+ messages in thread
From: Pekka J Enberg @ 2007-08-20  9:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Christoph Lameter, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

Hi Peter,

On Mon, 20 Aug 2007, Peter Zijlstra wrote:
> -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
> +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
>  {

[snip]

> +	*reserve = page->reserve;

Any reason why the callers that are actually interested in this don't do 
page->reserve on their own?

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-20  9:12     ` Pekka J Enberg
@ 2007-08-20  9:17       ` Peter Zijlstra
  2007-08-20  9:28         ` Pekka Enberg
  0 siblings, 1 reply; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-20  9:17 UTC (permalink / raw)
  To: Pekka J Enberg
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Christoph Lameter, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

[-- Attachment #1: Type: text/plain, Size: 948 bytes --]

On Mon, 2007-08-20 at 12:12 +0300, Pekka J Enberg wrote:
> Hi Peter,
> 
> On Mon, 20 Aug 2007, Peter Zijlstra wrote:
> > -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
> > +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
> >  {
> 
> [snip]
> 
> > +	*reserve = page->reserve;
> 
> Any reason why the callers that are actually interested in this don't do 
> page->reserve on their own?

because new_slab() destroys the content?

struct page {
	...
	union {
		pgoff_t index;		/* Our offset within mapping. */
		void *freelist;		/* SLUB: freelist req. slab lock */
		int reserve;		/* page_alloc: page is a reserve page */
		atomic_t frag_count;	/* skb fragment use count */
	};
	...
};

static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
{
	...
	*reserve = page->reserve;
	...
	page->freelist = start;
	...
}


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-20  9:17       ` Peter Zijlstra
@ 2007-08-20  9:28         ` Pekka Enberg
  2007-08-20 19:26           ` Christoph Lameter
  0 siblings, 1 reply; 85+ messages in thread
From: Pekka Enberg @ 2007-08-20  9:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, David Miller, Andrew Morton,
	Daniel Phillips, Christoph Lameter, Matt Mackall,
	Lee Schermerhorn, Steve Dickson

Hi Peter,

On Mon, 2007-08-20 at 12:12 +0300, Pekka J Enberg wrote:
> > Any reason why the callers that are actually interested in this don't do
> > page->reserve on their own?

On 8/20/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> because new_slab() destroys the content?

Right. So maybe we could move the initialization parts of new_slab()
to __new_slab() so that the callers that are actually interested in
'reserve' could do allocate_slab(), store page->reserve and do rest of
the initialization with it?

As for the __GFP_WAIT handling, I *think* we can move the interrupt
enable/disable to allocate_slab()... Christoph?

                                       Pekka

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-20  9:28         ` Pekka Enberg
@ 2007-08-20 19:26           ` Christoph Lameter
  2007-08-20 20:08             ` Peter Zijlstra
  0 siblings, 1 reply; 85+ messages in thread
From: Christoph Lameter @ 2007-08-20 19:26 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Peter Zijlstra, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

On Mon, 20 Aug 2007, Pekka Enberg wrote:

> Hi Peter,
> 
> On Mon, 2007-08-20 at 12:12 +0300, Pekka J Enberg wrote:
> > > Any reason why the callers that are actually interested in this don't do
> > > page->reserve on their own?
> 
> On 8/20/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > because new_slab() destroys the content?
> 
> Right. So maybe we could move the initialization parts of new_slab()
> to __new_slab() so that the callers that are actually interested in
> 'reserve' could do allocate_slab(), store page->reserve and do rest of
> the initialization with it?

I am still not convinced about this approach and there seems to be 
agreement that this is not working on large NUMA. So #ifdef it out? 
!CONFIG_NUMA? Some more general approach that does not rely on a single 
slab being a reserve?

The object is to check the alloc flags when having allocated a reserve 
slab right? Adding another flag SlabReserve and keying off on that one may 
be the easiest solution.

I have pending patches here that add per cpu structures. Those will make 
that job easier.

> As for the __GFP_WAIT handling, I *think* we can move the interrupt
> enable/disable to allocate_slab()... Christoph?

The reason the enable/disable is in new_slab is to minimize interrupt 
holdoff time. If we move it to allocate slab then the slab preparation is 
done with interrupts disabled.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages
  2007-08-20 19:26           ` Christoph Lameter
@ 2007-08-20 20:08             ` Peter Zijlstra
  0 siblings, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2007-08-20 20:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, linux-kernel, linux-mm, David Miller,
	Andrew Morton, Daniel Phillips, Matt Mackall, Lee Schermerhorn,
	Steve Dickson

On Mon, 2007-08-20 at 12:26 -0700, Christoph Lameter wrote:
> On Mon, 20 Aug 2007, Pekka Enberg wrote:
> 
> > Hi Peter,
> > 
> > On Mon, 2007-08-20 at 12:12 +0300, Pekka J Enberg wrote:
> > > > Any reason why the callers that are actually interested in this don't do
> > > > page->reserve on their own?
> > 
> > On 8/20/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > because new_slab() destroys the content?
> > 
> > Right. So maybe we could move the initialization parts of new_slab()
> > to __new_slab() so that the callers that are actually interested in
> > 'reserve' could do allocate_slab(), store page->reserve and do rest of
> > the initialization with it?
> 
> I am still not convinced about this approach and there seems to be 
> agreement that this is not working on large NUMA. So #ifdef it out? 
> !CONFIG_NUMA? Some more general approach that does not rely on a single 
> slab being a reserve?

See the patch I sent earlier today?

> The object is to check the alloc flags when having allocated a reserve 
> slab right?

The initial idea was to make each slab allocation respect the watermarks
like page allocation does (the page rank thingies, if you remember).
That is if the slab is allocated from below the ALLOC_MIN|ALLOC_HARDER
threshold, an ALLOC_MIN|ALLOC_HIGH allocation would get memory, but an
ALLOC_MIN would not.

Now, we only needed the ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER <->
ALLOC_NO_WATERMARKS transition and hence fell back to a binary system
that is not quite fair wrt to all the other levels but suffices for the
problem at hand.

So we want to ensure that slab allocations that are _not_ entitled to
ALLOC_NO_WATERMARK memory will not get objects when a page allocation
with the same right would fail, even if there is a slab present.

>  Adding another flag SlabReserve and keying off on that one may 
> be the easiest solution.

Trouble with something like that is that page flags are peristent and
you'd need to clean them when the status flips -> O(n) -> unwanted.

> I have pending patches here that add per cpu structures. Those will make 
> that job easier.

Yeah, I've seen earlier versions of those.


^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2007-08-20 20:09 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
2007-08-06 10:29 ` [PATCH 01/10] mm: gfp_to_alloc_flags() Peter Zijlstra
2007-08-06 10:29 ` [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
2007-08-06 18:11   ` Christoph Lameter
2007-08-06 18:21     ` Daniel Phillips
2007-08-06 18:31       ` Peter Zijlstra
2007-08-06 18:43         ` Daniel Phillips
2007-08-06 19:11         ` Christoph Lameter
2007-08-06 19:31           ` Peter Zijlstra
2007-08-06 20:12             ` Christoph Lameter
2007-08-06 18:42       ` Christoph Lameter
2007-08-06 18:48         ` Daniel Phillips
2007-08-06 18:51           ` Christoph Lameter
2007-08-06 19:15             ` Daniel Phillips
2007-08-06 20:12             ` Matt Mackall
2007-08-06 20:19               ` Christoph Lameter
2007-08-06 20:26                 ` Peter Zijlstra
2007-08-06 21:05                   ` Christoph Lameter
2007-08-06 22:59                     ` Daniel Phillips
2007-08-06 23:14                       ` Christoph Lameter
2007-08-06 23:49                         ` Daniel Phillips
2007-08-07 22:18                           ` Christoph Lameter
2007-08-08  7:24                             ` Peter Zijlstra
2007-08-08 18:06                               ` Christoph Lameter
2007-08-08  7:37                             ` Daniel Phillips
2007-08-08 18:09                               ` Christoph Lameter
2007-08-09 18:41                                 ` Daniel Phillips
2007-08-09 18:49                                   ` Christoph Lameter
2007-08-10  0:17                                     ` Daniel Phillips
2007-08-10  1:48                                       ` Christoph Lameter
2007-08-10  3:34                                         ` Daniel Phillips
2007-08-10  3:48                                           ` Christoph Lameter
2007-08-10  8:15                                             ` Daniel Phillips
2007-08-10 17:46                                               ` Christoph Lameter
2007-08-10 23:25                                                 ` Daniel Phillips
2007-08-13  6:55                                                 ` Daniel Phillips
2007-08-13 23:04                                                   ` Christoph Lameter
2007-08-06 20:27                 ` Andrew Morton
2007-08-06 23:16                   ` Daniel Phillips
2007-08-06 22:47                 ` Daniel Phillips
2007-08-06 10:29 ` [PATCH 03/10] mm: tag reseve pages Peter Zijlstra
2007-08-06 18:11   ` Christoph Lameter
2007-08-06 18:13     ` Daniel Phillips
2007-08-06 18:28     ` Peter Zijlstra
2007-08-06 19:34     ` Andi Kleen
2007-08-06 18:43       ` Christoph Lameter
2007-08-06 18:47         ` Peter Zijlstra
2007-08-06 18:59           ` Andi Kleen
2007-08-06 19:09             ` Christoph Lameter
2007-08-06 19:10             ` Andrew Morton
2007-08-06 19:16               ` Christoph Lameter
2007-08-06 19:38               ` Matt Mackall
2007-08-06 20:18               ` Andi Kleen
2007-08-06 10:29 ` [PATCH 04/10] mm: slub: add knowledge of reserve pages Peter Zijlstra
2007-08-08  0:13   ` Christoph Lameter
2007-08-08  1:44     ` Matt Mackall
2007-08-08 17:13       ` Christoph Lameter
2007-08-08 17:39         ` Andrew Morton
2007-08-08 17:57           ` Christoph Lameter
2007-08-08 18:46             ` Andrew Morton
2007-08-10  1:54               ` Daniel Phillips
2007-08-10  2:01                 ` Christoph Lameter
2007-08-20  7:38   ` Peter Zijlstra
2007-08-20  7:43     ` Peter Zijlstra
2007-08-20  9:12     ` Pekka J Enberg
2007-08-20  9:17       ` Peter Zijlstra
2007-08-20  9:28         ` Pekka Enberg
2007-08-20 19:26           ` Christoph Lameter
2007-08-20 20:08             ` Peter Zijlstra
2007-08-06 10:29 ` [PATCH 05/10] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
2007-08-06 10:29 ` [PATCH 06/10] mm: kmem_estimate_pages() Peter Zijlstra
2007-08-06 10:29 ` [PATCH 07/10] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
2007-08-06 10:29 ` [PATCH 08/10] mm: serialize access to min_free_kbytes Peter Zijlstra
2007-08-06 10:29 ` [PATCH 09/10] mm: emergency pool Peter Zijlstra
2007-08-06 10:29 ` [PATCH 10/10] mm: __GFP_MEMALLOC Peter Zijlstra
2007-08-06 17:35 ` [PATCH 00/10] foundations for reserve-based allocation Daniel Phillips
2007-08-06 18:17   ` Peter Zijlstra
2007-08-06 18:40     ` Daniel Phillips
2007-08-06 19:31     ` Daniel Phillips
2007-08-06 19:36       ` Peter Zijlstra
2007-08-06 19:53         ` Daniel Phillips
2007-08-06 17:56 ` Christoph Lameter
2007-08-06 18:33   ` Peter Zijlstra
2007-08-06 20:23 ` Matt Mackall
2007-08-07  0:09   ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).