linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/28] Swap over NFS -v16
@ 2008-02-20 14:46 Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 01/28] mm: gfp_to_alloc_flags() Peter Zijlstra
                   ` (28 more replies)
  0 siblings, 29 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

Hi,

Another posting of the full swap over NFS series. 

Andrew/Linus, could we start thinking of sticking this in -mm?

[ patches against 2.6.25-rc2-mm1, also to be found online at:
  http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v2.6.25-rc2-mm1/ ]

The patch-set can be split in roughtly 5 parts, for each of which I shall give
a description.


  Part 1, patches 1-11

The problem with swap over network is the generic swap problem: needing memory
to free memory. Normally this is solved using mempools, as can be seen in the
BIO layer.

Swap over network has the problem that the network subsystem does not use fixed
sized allocations, but heavily relies on kmalloc(). This makes mempools
unusable.

This first part provides a generic reserve framework. This framework
could also be used to get rid of some of the __GFP_NOFAIL users.

Care is taken to only affect the slow paths - when we're low on memory.

Caveats: it currently doesn't do SLOB.

 1 - mm: gfp_to_alloc_flags()
 2 - mm: tag reseve pages
 3 - mm: sl[au]b: add knowledge of reserve pages
 4 - mm: kmem_estimate_pages()
 5 - mm: allow PF_MEMALLOC from softirq context
 6 - mm: serialize access to min_free_kbytes
 7 - mm: emergency pool
 8 - mm: system wide ALLOC_NO_WATERMARK
 9 - mm: __GFP_MEMALLOC
10 - mm: memory reserve management
11 - selinux: tag avc cache alloc as non-critical


  Part 2, patches 12-14

Provide some generic network infrastructure needed later on.

12 - net: wrap sk->sk_backlog_rcv()
13 - net: packet split receive api
14 - net: sk_allocation() - concentrate socket related allocations


  Part 3, patches 15-21

Now that we have a generic memory reserve system, use it on the network stack.
The thing that makes this interesting is that, contrary to BIO, both the
transmit and receive path require memory allocations. 

That is, in the BIO layer write back completion is usually just an ISR flipping
a bit and waking stuff up. A network write back completion involved receiving
packets, which when there is no memory, is rather hard. And even when there is
memory there is no guarantee that the required packet comes in in the window
that that memory buys us.

The solution to this problem is found in the fact that network is to be assumed
lossy. Even now, when there is no memory to receive packets the network card
will have to discard packets. What we do is move this into the network stack.

So we reserve a little pool to act as a receive buffer, this allows us to
inspect packets before tossing them. This way, we can filter out those packets
that ensure progress (writeback completion) and disregard the others (as would
have happened anyway). [ NOTE: this is a stable mode of operation with limited
memory usage, exactly the kind of thing we need ]

Again, care is taken to keep much of the overhead of this to only affect the
slow path. Only packets allocated from the reserves will suffer the extra
atomic overhead needed for accounting.

15 - netvm: network reserve infrastructure
16 - netvm: INET reserves.
17 - netvm: hook skb allocation to reserves
18 - netvm: filter emergency skbs.
19 - netvm: prevent a TCP specific deadlock
20 - netfilter: NF_QUEUE vs emergency skbs
21 - netvm: skb processing


  Part 4, patches 22-23

Generic vm infrastructure to handle swapping to a filesystem instead of a block
device.

This provides new a_ops to handle swapcache pages and could be used to obsolete
the bmap usage for swapfiles.

22 - mm: add support for non block device backed swap files
23 - mm: methods for teaching filesystems about PG_swapcache pages


  Part 5, patches 24-28

Finally, convert NFS to make use of the new network and vm infrastructure to
provide swap over NFS.

24 - nfs: remove mempools
25 - nfs: teach the NFS client how to treat PG_swapcache pages
26 - nfs: disable data cache revalidation for swapfiles
27 - nfs: enable swap on NFS
28 - nfs: fix various memory recursions possible with swap over NFS.


Changes since -v15:
 - fwd port
 - more SGE fragment drivers ported
 - made the new swapfile logic unconditional
 - various bug fixes and cleanups



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 01/28] mm: gfp_to_alloc_flags()
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 02/28] mm: tag reseve pages Peter Zijlstra
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-gfp-to-alloc_flags.patch --]
[-- Type: text/plain, Size: 5547 bytes --]

Factor out the gfp to alloc_flags mapping so it can be used in other places.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/internal.h   |   11 ++++++
 mm/page_alloc.c |   98 ++++++++++++++++++++++++++++++++------------------------
 2 files changed, 67 insertions(+), 42 deletions(-)

Index: linux-2.6/mm/internal.h
===================================================================
--- linux-2.6.orig/mm/internal.h
+++ linux-2.6/mm/internal.h
@@ -47,4 +47,15 @@ static inline unsigned long page_order(s
 	VM_BUG_ON(!PageBuddy(page));
 	return page_private(page);
 }
+
+#define ALLOC_HARDER		0x01 /* try to alloc harder */
+#define ALLOC_HIGH		0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN		0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW		0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH	0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS	0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+
+int gfp_to_alloc_flags(gfp_t gfp_mask);
+
 #endif
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1127,14 +1127,6 @@ failed:
 	return NULL;
 }
 
-#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
-#define ALLOC_HARDER		0x10 /* try to alloc harder */
-#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
 static struct fail_page_alloc_attr {
@@ -1523,6 +1515,44 @@ static void set_page_owner(struct page *
 #endif /* CONFIG_PAGE_OWNER */
 
 /*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+int gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
+/*
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *
@@ -1577,48 +1607,28 @@ restart:
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+	/* This is the last chance, in general, before the goto nopage. */
+	page = get_page_from_freelist(gfp_mask, order, zonelist,
+			alloc_flags & ~ALLOC_NO_WATERMARKS);
 	if (page)
 		goto got_pg;
 
 	/* This allocation should allow future memory freeing. */
-
 rebalance:
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
-				zonelist, ALLOC_NO_WATERMARKS);
-			if (page)
-				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
+		/* go through the zonelist yet again, ignoring mins */
+		page = get_page_from_freelist(gfp_mask, order, zonelist,
+				ALLOC_NO_WATERMARKS);
+		if (page)
+			goto got_pg;
+
+		if (wait && (gfp_mask & __GFP_NOFAIL)) {
+			congestion_wait(WRITE, HZ/50);
+			goto nofail_alloc;
 		}
 		goto nopage;
 	}
@@ -1627,6 +1637,10 @@ nofail_alloc:
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 	cond_resched();
 
 	/* We now go into synchronous reclaim */

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 02/28] mm: tag reseve pages
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 01/28] mm: gfp_to_alloc_flags() Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 03/28] mm: slb: add knowledge of reserve pages Peter Zijlstra
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: page_alloc-reserve.patch --]
[-- Type: text/plain, Size: 1265 bytes --]

Tag pages allocated from the reserves with a non-zero page->reserve.
This allows us to distinguish and account reserve pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm_types.h |    1 +
 mm/page_alloc.c          |    4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -73,6 +73,7 @@ struct page {
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
 		void *freelist;		/* SLUB: freelist req. slab lock */
+		int reserve;		/* page_alloc: page is a reserve page */
 	};
 	struct list_head lru;		/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1418,8 +1418,10 @@ zonelist_scan:
 		}
 
 		page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
-		if (page)
+		if (page) {
+			page->reserve = !!(alloc_flags & ALLOC_NO_WATERMARKS);
 			break;
+		}
 this_zone_full:
 		if (NUMA_BUILD)
 			zlc_mark_zone_full(zonelist, z);

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 03/28] mm: slb: add knowledge of reserve pages
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 01/28] mm: gfp_to_alloc_flags() Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 02/28] mm: tag reseve pages Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 04/28] mm: kmem_estimate_pages() Peter Zijlstra
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: reserve-slub.patch --]
[-- Type: text/plain, Size: 11005 bytes --]

Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
contexts that are entitled to it. This is done to ensure reserve pages don't
leak out and get consumed.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/slub_def.h |    1 
 mm/slab.c                |   60 +++++++++++++++++++++++++++++++++++++++--------
 mm/slub.c                |   42 +++++++++++++++++++++-----------
 3 files changed, 80 insertions(+), 23 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -21,11 +21,12 @@
 #include <linux/ctype.h>
 #include <linux/kallsyms.h>
 #include <linux/memory.h>
+#include "internal.h"
 
 /*
  * Lock order:
  *   1. slab_lock(page)
- *   2. slab->list_lock
+ *   2. node->list_lock
  *
  *   The slab_lock protects operations on the object of a particular
  *   slab and its metadata in the page struct. If the slab lock
@@ -1098,15 +1099,15 @@ static struct page *allocate_slab(struct
 	return page;
 }
 
-static void setup_object(struct kmem_cache *s, struct page *page,
-				void *object)
+static void setup_object(struct kmem_cache *s, struct page *page, void *object)
 {
 	setup_object_debug(s, page, object);
 	if (unlikely(s->ctor))
 		s->ctor(s, object);
 }
 
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static
+struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
 {
 	struct page *page;
 	struct kmem_cache_node *n;
@@ -1121,6 +1122,7 @@ static struct page *new_slab(struct kmem
 	if (!page)
 		goto out;
 
+	*reserve = page->reserve;
 	n = get_node(s, page_to_nid(page));
 	if (n)
 		atomic_long_inc(&n->nr_slabs);
@@ -1228,8 +1230,7 @@ static __always_inline int slab_trylock(
 /*
  * Management of partially allocated slabs
  */
-static void add_partial(struct kmem_cache_node *n,
-				struct page *page, int tail)
+static void add_partial(struct kmem_cache_node *n, struct page *page, int tail)
 {
 	spin_lock(&n->list_lock);
 	n->nr_partial++;
@@ -1240,8 +1241,7 @@ static void add_partial(struct kmem_cach
 	spin_unlock(&n->list_lock);
 }
 
-static void remove_partial(struct kmem_cache *s,
-						struct page *page)
+static void remove_partial(struct kmem_cache *s, struct page *page)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
 
@@ -1256,7 +1256,8 @@ static void remove_partial(struct kmem_c
  *
  * Must hold list_lock.
  */
-static inline int lock_and_freeze_slab(struct kmem_cache_node *n, struct page *page)
+static inline
+int lock_and_freeze_slab(struct kmem_cache_node *n, struct page *page)
 {
 	if (slab_trylock(page)) {
 		list_del(&page->lru);
@@ -1514,11 +1515,21 @@ static void *__slab_alloc(struct kmem_ca
 {
 	void **object;
 	struct page *new;
+	int reserve;
 #ifdef SLUB_FASTPATH
 	unsigned long flags;
 
 	local_irq_save(flags);
 #endif
+	if (unlikely(c->reserve)) {
+		/*
+		 * If the current slab is a reserve slab and the current
+		 * allocation context does not allow access to the reserves we
+		 * must force an allocation to test the current levels.
+		 */
+		if (!(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+			goto grow_slab;
+	}
 	if (!c->page)
 		goto new_slab;
 
@@ -1530,7 +1541,7 @@ load_freelist:
 	object = c->page->freelist;
 	if (unlikely(object == c->page->end))
 		goto another_slab;
-	if (unlikely(SlabDebug(c->page)))
+	if (unlikely(SlabDebug(c->page) || c->reserve))
 		goto debug;
 
 	object = c->page->freelist;
@@ -1557,16 +1568,18 @@ new_slab:
 		goto load_freelist;
 	}
 
+grow_slab:
 	if (gfpflags & __GFP_WAIT)
 		local_irq_enable();
 
-	new = new_slab(s, gfpflags, node);
+	new = new_slab(s, gfpflags, node, &reserve);
 
 	if (gfpflags & __GFP_WAIT)
 		local_irq_disable();
 
 	if (new) {
 		c = get_cpu_slab(s, smp_processor_id());
+		c->reserve = reserve;
 		stat(c, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
@@ -1594,8 +1607,8 @@ new_slab:
 
 	return NULL;
 debug:
-	object = c->page->freelist;
-	if (!alloc_debug_processing(s, c->page, object, addr))
+	if (SlabDebug(c->page) &&
+			!alloc_debug_processing(s, c->page, object, addr))
 		goto another_slab;
 
 	c->page->inuse++;
@@ -2153,10 +2166,11 @@ static struct kmem_cache_node *early_kme
 	struct page *page;
 	struct kmem_cache_node *n;
 	unsigned long flags;
+	int reserve;
 
 	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmalloc_caches, gfpflags, node);
+	page = new_slab(kmalloc_caches, gfpflags, node, &reserve);
 
 	BUG_ON(!page);
 	if (page_to_nid(page) != node) {
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h
+++ linux-2.6/include/linux/slub_def.h
@@ -37,6 +37,7 @@ struct kmem_cache_cpu {
 	int node;		/* The node of the page (or -1 for debug) */
 	unsigned int offset;	/* Freepointer offset (in word units) */
 	unsigned int objsize;	/* Size of an object (from kmem_cache) */
+	int reserve;		/* Did the current page come from the reserve */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c
+++ linux-2.6/mm/slab.c
@@ -115,6 +115,8 @@
 #include	<asm/tlbflush.h>
 #include	<asm/page.h>
 
+#include 	"internal.h"
+
 /*
  * DEBUG	- 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
  *		  0 for faster, smaller code (especially in the critical paths).
@@ -265,7 +267,8 @@ struct array_cache {
 	unsigned int avail;
 	unsigned int limit;
 	unsigned int batchcount;
-	unsigned int touched;
+	unsigned int touched:1,
+		     reserve:1;
 	spinlock_t lock;
 	void *entry[];	/*
 			 * Must have this definition in here for the proper
@@ -761,6 +764,27 @@ static inline struct array_cache *cpu_ca
 	return cachep->array[smp_processor_id()];
 }
 
+/*
+ * If the last page came from the reserves, and the current allocation context
+ * does not have access to them, force an allocation to test the watermarks.
+ */
+static inline int slab_force_alloc(struct kmem_cache *cachep, gfp_t flags)
+{
+	if (unlikely(cpu_cache_get(cachep)->reserve) &&
+			!(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
+		return 1;
+
+	return 0;
+}
+
+static inline void slab_set_reserve(struct kmem_cache *cachep, int reserve)
+{
+	struct array_cache *ac = cpu_cache_get(cachep);
+
+	if (unlikely(ac->reserve != reserve))
+		ac->reserve = reserve;
+}
+
 static inline struct kmem_cache *__find_general_cachep(size_t size,
 							gfp_t gfpflags)
 {
@@ -960,6 +984,7 @@ static struct array_cache *alloc_arrayca
 		nc->limit = entries;
 		nc->batchcount = batchcount;
 		nc->touched = 0;
+		nc->reserve = 0;
 		spin_lock_init(&nc->lock);
 	}
 	return nc;
@@ -1663,7 +1688,8 @@ __initcall(cpucache_init);
  * did not request dmaable memory, we might get it, but that
  * would be relatively rare and ignorable.
  */
-static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
+static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid,
+		int *reserve)
 {
 	struct page *page;
 	int nr_pages;
@@ -1685,6 +1711,7 @@ static void *kmem_getpages(struct kmem_c
 	if (!page)
 		return NULL;
 
+	*reserve = page->reserve;
 	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
@@ -2113,6 +2140,7 @@ static int __init_refok setup_cpu_cache(
 	cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
 	cpu_cache_get(cachep)->batchcount = 1;
 	cpu_cache_get(cachep)->touched = 0;
+	cpu_cache_get(cachep)->reserve = 0;
 	cachep->batchcount = 1;
 	cachep->limit = BOOT_CPUCACHE_ENTRIES;
 	return 0;
@@ -2768,6 +2796,7 @@ static int cache_grow(struct kmem_cache 
 	size_t offset;
 	gfp_t local_flags;
 	struct kmem_list3 *l3;
+	int reserve;
 
 	/*
 	 * Be lazy and only check for valid flags here,  keeping it out of the
@@ -2806,7 +2835,7 @@ static int cache_grow(struct kmem_cache 
 	 * 'nodeid'.
 	 */
 	if (!objp)
-		objp = kmem_getpages(cachep, local_flags, nodeid);
+		objp = kmem_getpages(cachep, local_flags, nodeid, &reserve);
 	if (!objp)
 		goto failed;
 
@@ -2823,6 +2852,7 @@ static int cache_grow(struct kmem_cache 
 	if (local_flags & __GFP_WAIT)
 		local_irq_disable();
 	check_irq_off();
+	slab_set_reserve(cachep, reserve);
 	spin_lock(&l3->list_lock);
 
 	/* Make slab active. */
@@ -2957,7 +2987,8 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep,
+		gfp_t flags, int must_refill)
 {
 	int batchcount;
 	struct kmem_list3 *l3;
@@ -2967,6 +2998,8 @@ static void *cache_alloc_refill(struct k
 	node = numa_node_id();
 
 	check_irq_off();
+	if (unlikely(must_refill))
+		goto force_grow;
 	ac = cpu_cache_get(cachep);
 retry:
 	batchcount = ac->batchcount;
@@ -3035,11 +3068,14 @@ alloc_done:
 
 	if (unlikely(!ac->avail)) {
 		int x;
+force_grow:
 		x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
-		if (!x && ac->avail == 0)	/* no objects in sight? abort */
+
+		/* no objects in sight? abort */
+		if (!x && (ac->avail == 0 || must_refill))
 			return NULL;
 
 		if (!ac->avail)		/* objects refilled by interrupt? */
@@ -3194,17 +3230,18 @@ static inline void *____cache_alloc(stru
 {
 	void *objp;
 	struct array_cache *ac;
+	int must_refill = slab_force_alloc(cachep, flags);
 
 	check_irq_off();
 
 	ac = cpu_cache_get(cachep);
-	if (likely(ac->avail)) {
+	if (likely(ac->avail && !must_refill)) {
 		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
 		objp = ac->entry[--ac->avail];
 	} else {
 		STATS_INC_ALLOCMISS(cachep);
-		objp = cache_alloc_refill(cachep, flags);
+		objp = cache_alloc_refill(cachep, flags, must_refill);
 	}
 	return objp;
 }
@@ -3246,7 +3283,7 @@ static void *fallback_alloc(struct kmem_
 	gfp_t local_flags;
 	struct zone **z;
 	void *obj = NULL;
-	int nid;
+	int nid, reserve;
 
 	if (flags & __GFP_THISNODE)
 		return NULL;
@@ -3280,10 +3317,11 @@ retry:
 		if (local_flags & __GFP_WAIT)
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
-		obj = kmem_getpages(cache, flags, -1);
+		obj = kmem_getpages(cache, flags, -1, &reserve);
 		if (local_flags & __GFP_WAIT)
 			local_irq_disable();
 		if (obj) {
+			slab_set_reserve(cache, reserve);
 			/*
 			 * Insert into the appropriate per node queues
 			 */
@@ -3322,6 +3360,9 @@ static void *____cache_alloc_node(struct
 	l3 = cachep->nodelists[nodeid];
 	BUG_ON(!l3);
 
+	if (unlikely(slab_force_alloc(cachep, flags)))
+		goto force_grow;
+
 retry:
 	check_irq_off();
 	spin_lock(&l3->list_lock);
@@ -3359,6 +3400,7 @@ retry:
 
 must_grow:
 	spin_unlock(&l3->list_lock);
+force_grow:
 	x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL);
 	if (x)
 		goto retry;

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 04/28] mm: kmem_estimate_pages()
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 03/28] mm: slb: add knowledge of reserve pages Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-23  8:05   ` Andrew Morton
  2008-02-20 14:46 ` [PATCH 05/28] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
                   ` (24 subsequent siblings)
  28 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-kmem_estimate_pages.patch --]
[-- Type: text/plain, Size: 6384 bytes --]

Provide a method to get the upper bound on the pages needed to allocate
a given number of objects from a given kmem_cache.

This lays the foundation for a generic reserve framework as presented in
a later patch in this series. This framework needs to convert object demand
(kmalloc() bytes, kmem_cache_alloc() objects) to pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/slab.h |    4 ++
 mm/slab.c            |   75 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/slub.c            |   82 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 161 insertions(+)

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -60,6 +60,8 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+unsigned kmem_estimate_pages(struct kmem_cache *cachep,
+			gfp_t flags, int objects);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -94,6 +96,8 @@ int kmem_ptr_validate(struct kmem_cache 
 void * __must_check krealloc(const void *, size_t, gfp_t);
 void kfree(const void *);
 size_t ksize(const void *);
+unsigned kestimate_single(size_t, gfp_t, int);
+unsigned kestimate(gfp_t, size_t);
 
 /*
  * Allocator specific definitions. These are mainly used to establish optimized
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -2465,6 +2465,37 @@ const char *kmem_cache_name(struct kmem_
 EXPORT_SYMBOL(kmem_cache_name);
 
 /*
+ * return the max number of pages required to allocated count
+ * objects from the given cache
+ */
+unsigned kmem_estimate_pages(struct kmem_cache *s, gfp_t flags, int objects)
+{
+	unsigned long slabs;
+
+	if (WARN_ON(!s) || WARN_ON(!s->objects))
+		return 0;
+
+	slabs = DIV_ROUND_UP(objects, s->objects);
+
+	/*
+	 * Account the possible additional overhead if the slab holds more that
+	 * one object.
+	 */
+	if (s->objects > 1) {
+		/*
+		 * Account the possible additional overhead if per cpu slabs
+		 * are currently empty and have to be allocated. This is very
+		 * unlikely but a possible scenario immediately after
+		 * kmem_cache_shrink.
+		 */
+		slabs += num_online_cpus();
+	}
+
+	return slabs << s->order;
+}
+EXPORT_SYMBOL_GPL(kmem_estimate_pages);
+
+/*
  * Attempt to free all slabs on a node. Return the number of slabs we
  * were unable to free.
  */
@@ -2818,6 +2849,57 @@ static unsigned long count_partial(struc
 }
 
 /*
+ * return the max number of pages required to allocate @count objects
+ * of @size bytes from kmalloc given @flags.
+ */
+unsigned kestimate_single(size_t size, gfp_t flags, int count)
+{
+	struct kmem_cache *s = get_slab(size, flags);
+	if (!s)
+		return 0;
+
+	return kmem_estimate_pages(s, flags, count);
+
+}
+EXPORT_SYMBOL_GPL(kestimate_single);
+
+/*
+ * return the max number of pages required to allocate @bytes from kmalloc
+ * in an unspecified number of allocation of heterogeneous size.
+ */
+unsigned kestimate(gfp_t flags, size_t bytes)
+{
+	int i;
+	unsigned long pages;
+
+	/*
+	 * multiply by two, in order to account the worst case slack space
+	 * due to the power-of-two allocation sizes.
+	 */
+	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+	/*
+	 * add the kmem_cache overhead of each possible kmalloc cache
+	 */
+	for (i = 1; i < PAGE_SHIFT; i++) {
+		struct kmem_cache *s;
+
+#ifdef CONFIG_ZONE_DMA
+		if (unlikely(flags & SLUB_DMA))
+			s = dma_kmalloc_cache(i, flags);
+		else
+#endif
+			s = &kmalloc_caches[i];
+
+		if (s)
+			pages += kmem_estimate_pages(s, flags, 0);
+	}
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kestimate);
+
+/*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up
Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c
+++ linux-2.6/mm/slab.c
@@ -3851,6 +3851,81 @@ const char *kmem_cache_name(struct kmem_
 EXPORT_SYMBOL_GPL(kmem_cache_name);
 
 /*
+ * return the max number of pages required to allocated count
+ * objects from the given cache
+ */
+unsigned kmem_estimate_pages(struct kmem_cache *cachep,
+		gfp_t flags, int objects)
+{
+	/*
+	 * (1) memory for objects,
+	 */
+	unsigned nr_slabs = DIV_ROUND_UP(objects, cachep->num);
+	unsigned nr_pages = nr_slabs << cachep->gfporder;
+
+	/*
+	 * (2) memory for each per-cpu queue (nr_cpu_ids),
+	 * (3) memory for each per-node alien queues (nr_cpu_ids), and
+	 * (4) some amount of memory for the slab management structures
+	 *
+	 * XXX: truely account these
+	 */
+	nr_pages += 1 + ilog2(nr_pages);
+
+	return nr_pages;
+}
+
+/*
+ * return the max number of pages required to allocate @count objects
+ * of @size bytes from kmalloc given @flags.
+ */
+unsigned kestimate_single(size_t size, gfp_t flags, int count)
+{
+	struct kmem_cache *s = kmem_find_general_cachep(size, flags);
+	if (!s)
+		return 0;
+
+	return kmem_estimate_pages(s, flags, count);
+}
+EXPORT_SYMBOL_GPL(kestimate_single);
+
+/*
+ * return the max number of pages required to allocate @bytes from kmalloc
+ * in an unspecified number of allocation of heterogeneous size.
+ */
+unsigned kestimate(gfp_t flags, size_t bytes)
+{
+	unsigned long pages;
+	struct cache_sizes *csizep = malloc_sizes;
+
+	/*
+	 * multiply by two, in order to account the worst case slack space
+	 * due to the power-of-two allocation sizes.
+	 */
+	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+	/*
+	 * add the kmem_cache overhead of each possible kmalloc cache
+	 */
+	for (csizep = malloc_sizes; csizep->cs_cachep; csizep++) {
+		struct kmem_cache *s;
+
+#ifdef CONFIG_ZONE_DMA
+		if (unlikely(flags & __GFP_DMA))
+			s = csizep->cs_dmacachep;
+		else
+#endif
+			s = csizep->cs_cachep;
+
+		if (s)
+			pages += kmem_estimate_pages(s, flags, 0);
+	}
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kestimate);
+
+/*
  * This initializes kmem_list3 or resizes various caches for all nodes.
  */
 static int alloc_kmemlist(struct kmem_cache *cachep)

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 05/28] mm: allow PF_MEMALLOC from softirq context
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 04/28] mm: kmem_estimate_pages() Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-23  8:05   ` Andrew Morton
  2008-02-20 14:46 ` [PATCH 06/28] mm: serialize access to min_free_kbytes Peter Zijlstra
                   ` (23 subsequent siblings)
  28 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-PF_MEMALLOC-softirq.patch --]
[-- Type: text/plain, Size: 2287 bytes --]

Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
a borrowed context save current->flags, ksoftirqd will have its own 
task_struct.

This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    4 ++++
 kernel/softirq.c      |    3 +++
 mm/page_alloc.c       |    7 ++++---
 3 files changed, 11 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1471,9 +1471,10 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((p->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+		if (!in_irq() && (p->flags & PF_MEMALLOC))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_interrupt() &&
+				unlikely(test_thread_flag(TIF_MEMDIE)))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
Index: linux-2.6/kernel/softirq.c
===================================================================
--- linux-2.6.orig/kernel/softirq.c
+++ linux-2.6/kernel/softirq.c
@@ -213,6 +213,8 @@ asmlinkage void __do_softirq(void)
 	__u32 pending;
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu;
+	unsigned long pflags = current->flags;
+	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
 	account_system_vtime(current);
@@ -251,6 +253,7 @@ restart:
 
 	account_system_vtime(current);
 	_local_bh_enable();
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1497,6 +1497,10 @@ static inline void put_task_struct(struc
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+#define tsk_restore_flags(p, pflags, mask) \
+	do {	(p)->flags &= ~(mask); \
+		(p)->flags |= ((pflags) & (mask)); } while (0)
+
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask);
 #else

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 06/28] mm: serialize access to min_free_kbytes
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 05/28] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 07/28] mm: emergency pool Peter Zijlstra
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-setup_per_zone_pages_min.patch --]
[-- Type: text/plain, Size: 1836 bytes --]

There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -116,6 +116,7 @@ static char * const zone_names[MAX_NR_ZO
 	 "Movable",
 };
 
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
@@ -4087,12 +4088,12 @@ static void setup_per_zone_lowmem_reserv
 }
 
 /**
- * setup_per_zone_pages_min - called when min_free_kbytes changes.
+ * __setup_per_zone_pages_min - called when min_free_kbytes changes.
  *
  * Ensures that the pages_{min,low,high} values for each zone are set correctly
  * with respect to min_free_kbytes.
  */
-void setup_per_zone_pages_min(void)
+static void __setup_per_zone_pages_min(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -4147,6 +4148,15 @@ void setup_per_zone_pages_min(void)
 	calculate_totalreserve_pages();
 }
 
+void setup_per_zone_pages_min(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&min_free_lock, flags);
+	__setup_per_zone_pages_min();
+	spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -4182,7 +4192,7 @@ static int __init init_per_zone_pages_mi
 		min_free_kbytes = 128;
 	if (min_free_kbytes > 65536)
 		min_free_kbytes = 65536;
-	setup_per_zone_pages_min();
+	__setup_per_zone_pages_min();
 	setup_per_zone_lowmem_reserve();
 	return 0;
 }

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 07/28] mm: emergency pool
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 06/28] mm: serialize access to min_free_kbytes Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-23  8:05   ` Andrew Morton
  2008-02-20 14:46 ` [PATCH 08/28] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
                   ` (21 subsequent siblings)
  28 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-page_alloc-emerg.patch --]
[-- Type: text/plain, Size: 6847 bytes --]

Provide means to reserve a specific amount of pages.

The emergency pool is separated from the min watermark because ALLOC_HARDER
and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure
a strict minimum.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mmzone.h |    3 +
 mm/page_alloc.c        |   84 +++++++++++++++++++++++++++++++++++++++++++------
 mm/vmstat.c            |    6 +--
 3 files changed, 79 insertions(+), 14 deletions(-)

Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -213,7 +213,7 @@ enum zone_type {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
-	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		pages_emerg, pages_min, pages_low, pages_high;
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
@@ -683,6 +683,7 @@ int sysctl_min_unmapped_ratio_sysctl_han
 			struct file *, void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
+int adjust_memalloc_reserve(int pages);
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -118,6 +118,8 @@ static char * const zone_names[MAX_NR_ZO
 
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+static DEFINE_MUTEX(var_free_mutex);
+int var_free_kbytes;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -1240,7 +1242,7 @@ int zone_watermark_ok(struct zone *z, in
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
 
-	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
+	if (free_pages <= min+z->lowmem_reserve[classzone_idx]+z->pages_emerg)
 		return 0;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -1569,7 +1571,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
 	int do_retry;
-	int alloc_flags;
+	int alloc_flags = 0;
 	int did_some_progress;
 
 	might_sleep_if(wait);
@@ -1721,8 +1723,8 @@ nofail_alloc:
 nopage:
 	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
 		printk(KERN_WARNING "%s: page allocation failure."
-			" order:%d, mode:0x%x\n",
-			p->comm, order, gfp_mask);
+			" order:%d, mode:0x%x, alloc_flags:0x%x, pflags:0x%x\n",
+			p->comm, order, gfp_mask, alloc_flags, p->flags);
 		dump_stack();
 		show_mem();
 	}
@@ -1937,9 +1939,9 @@ void show_free_areas(void)
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
-			K(zone->pages_min),
-			K(zone->pages_low),
-			K(zone->pages_high),
+			K(zone->pages_emerg + zone->pages_min),
+			K(zone->pages_emerg + zone->pages_low),
+			K(zone->pages_emerg + zone->pages_high),
 			K(zone_page_state(zone, NR_ACTIVE)),
 			K(zone_page_state(zone, NR_INACTIVE)),
 			K(zone->present_pages),
@@ -4125,7 +4127,7 @@ static void calculate_totalreserve_pages
 			}
 
 			/* we treat pages_high as reserved pages. */
-			max += zone->pages_high;
+			max += zone->pages_high + zone->pages_emerg;
 
 			if (max > zone->present_pages)
 				max = zone->present_pages;
@@ -4182,7 +4184,8 @@ static void setup_per_zone_lowmem_reserv
  */
 static void __setup_per_zone_pages_min(void)
 {
-	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_emerg = var_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
 	unsigned long flags;
@@ -4194,11 +4197,13 @@ static void __setup_per_zone_pages_min(v
 	}
 
 	for_each_zone(zone) {
-		u64 tmp;
+		u64 tmp, tmp_emerg;
 
 		spin_lock_irqsave(&zone->lru_lock, flags);
 		tmp = (u64)pages_min * zone->present_pages;
 		do_div(tmp, lowmem_pages);
+		tmp_emerg = (u64)pages_emerg * zone->present_pages;
+		do_div(tmp_emerg, lowmem_pages);
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ -4217,12 +4222,14 @@ static void __setup_per_zone_pages_min(v
 			if (min_pages > 128)
 				min_pages = 128;
 			zone->pages_min = min_pages;
+			zone->pages_emerg = 0;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
 			zone->pages_min = tmp;
+			zone->pages_emerg = tmp_emerg;
 		}
 
 		zone->pages_low   = zone->pages_min + (tmp >> 2);
@@ -4244,6 +4251,63 @@ void setup_per_zone_pages_min(void)
 	spin_unlock_irqrestore(&min_free_lock, flags);
 }
 
+static void __adjust_memalloc_reserve(int pages)
+{
+	var_free_kbytes += pages << (PAGE_SHIFT - 10);
+	BUG_ON(var_free_kbytes < 0);
+	setup_per_zone_pages_min();
+}
+
+static int test_reserve_limits(void)
+{
+	struct zone *zone;
+	int node;
+
+	for_each_zone(zone)
+		wakeup_kswapd(zone, 0);
+
+	for_each_online_node(node) {
+		struct page *page = alloc_pages_node(node, GFP_KERNEL, 0);
+		if (!page)
+			return -ENOMEM;
+
+		__free_page(page);
+	}
+
+	return 0;
+}
+
+/**
+ *	adjust_memalloc_reserve - adjust the memalloc reserve
+ *	@pages: number of pages to add
+ *
+ *	It adds a number of pages to the memalloc reserve; if
+ *	the number was positive it kicks reclaim into action to
+ *	satisfy the higher watermarks.
+ *
+ *	returns -ENOMEM when it failed to satisfy the watermarks.
+ */
+int adjust_memalloc_reserve(int pages)
+{
+	int err = 0;
+
+	mutex_lock(&var_free_mutex);
+	__adjust_memalloc_reserve(pages);
+	if (pages > 0) {
+		err = test_reserve_limits();
+		if (err) {
+			__adjust_memalloc_reserve(-pages);
+			goto unlock;
+		}
+	}
+	printk(KERN_DEBUG "Emergency reserve: %d\n", var_free_kbytes);
+
+unlock:
+	mutex_unlock(&var_free_mutex);
+	return err;
+}
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
 /*
  * Initialise min_free_kbytes.
  *
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -754,9 +754,9 @@ static void zoneinfo_show_print(struct s
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
-		   zone->pages_min,
-		   zone->pages_low,
-		   zone->pages_high,
+		   zone->pages_emerg + zone->pages_min,
+		   zone->pages_emerg + zone->pages_low,
+		   zone->pages_emerg + zone->pages_high,
 		   zone->pages_scanned,
 		   zone->nr_scan_active, zone->nr_scan_inactive,
 		   zone->spanned_pages,

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 08/28] mm: system wide ALLOC_NO_WATERMARK
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 07/28] mm: emergency pool Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-23  8:05   ` Andrew Morton
  2008-02-20 14:46 ` [PATCH 09/28] mm: __GFP_MEMALLOC Peter Zijlstra
                   ` (20 subsequent siblings)
  28 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: global-ALLOC_NO_WATERMARKS.patch --]
[-- Type: text/plain, Size: 853 bytes --]

Change ALLOC_NO_WATERMARK page allocation such that the reserves are system
wide - which they are per setup_per_zone_pages_min(), when we scrape the
barrel, do it properly.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1552,6 +1552,12 @@ restart:
 rebalance:
 	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 nofail_alloc:
+		/*
+		 * break out of mempolicy boundaries
+		 */
+		zonelist = NODE_DATA(numa_node_id())->node_zonelists +
+			gfp_zone(gfp_mask);
+
 		/* go through the zonelist yet again, ignoring mins */
 		page = get_page_from_freelist(gfp_mask, order, zonelist,
 				ALLOC_NO_WATERMARKS);

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 09/28] mm: __GFP_MEMALLOC
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 08/28] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-23  8:06   ` Andrew Morton
  2008-02-20 14:46 ` [PATCH 10/28] mm: memory reserve management Peter Zijlstra
                   ` (19 subsequent siblings)
  28 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-page_alloc-GFP_EMERGENCY.patch --]
[-- Type: text/plain, Size: 2048 bytes --]

__GFP_MEMALLOC will allow the allocation to disregard the watermarks, 
much like PF_MEMALLOC.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/gfp.h |    3 ++-
 mm/page_alloc.c     |    4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -43,6 +43,7 @@ struct vm_area_struct;
 #define __GFP_REPEAT	((__force gfp_t)0x400u)	/* Retry the allocation.  Might fail */
 #define __GFP_NOFAIL	((__force gfp_t)0x800u)	/* Retry for ever.  Cannot fail */
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* Do not retry.  Might fail */
+#define __GFP_MEMALLOC  ((__force gfp_t)0x2000u)/* Use emergency reserves */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
@@ -88,7 +89,7 @@ struct vm_area_struct;
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-			__GFP_NORETRY|__GFP_NOMEMALLOC)
+			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* Control allocation constraints */
 #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1474,7 +1474,9 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_irq() && (p->flags & PF_MEMALLOC))
+		if (gfp_mask & __GFP_MEMALLOC)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_irq() && (p->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (!in_interrupt() &&
 				unlikely(test_thread_flag(TIF_MEMDIE)))

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 10/28] mm: memory reserve management
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 09/28] mm: __GFP_MEMALLOC Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-23  8:06   ` Andrew Morton
  2008-02-20 14:46 ` [PATCH 11/28] selinux: tag avc cache alloc as non-critical Peter Zijlstra
                   ` (18 subsequent siblings)
  28 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-reserve.patch --]
[-- Type: text/plain, Size: 14372 bytes --]

Generic reserve management code. 

It provides methods to reserve and charge. Upon this, generic alloc/free style
reserve pools could be build, which could fully replace mempool_t
functionality.

It should also allow for a Banker's algorithm replacement of __GFP_NOFAIL.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/reserve.h |   54 ++++++
 mm/Makefile             |    2 
 mm/reserve.c            |  429 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 484 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/reserve.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/reserve.h
@@ -0,0 +1,54 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+
+#ifndef _LINUX_RESERVE_H
+#define _LINUX_RESERVE_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+struct mem_reserve {
+	struct mem_reserve *parent;
+	struct list_head children;
+	struct list_head siblings;
+
+	const char *name;
+
+	long pages;
+	long limit;
+	long usage;
+	spinlock_t lock;	/* protects limit and usage */
+};
+
+extern struct mem_reserve mem_reserve_root;
+
+void mem_reserve_init(struct mem_reserve *res, const char *name,
+		      struct mem_reserve *parent);
+int mem_reserve_connect(struct mem_reserve *new_child,
+			struct mem_reserve *node);
+int mem_reserve_disconnect(struct mem_reserve *node);
+
+int mem_reserve_pages_set(struct mem_reserve *res, long pages);
+int mem_reserve_pages_add(struct mem_reserve *res, long pages);
+int mem_reserve_pages_charge(struct mem_reserve *res, long pages,
+			     int overcommit);
+
+int mem_reserve_kmalloc_set(struct mem_reserve *res, long bytes);
+int mem_reserve_kmalloc_charge(struct mem_reserve *res, long bytes,
+			       int overcommit);
+
+struct kmem_cache;
+
+int mem_reserve_kmem_cache_set(struct mem_reserve *res,
+			       struct kmem_cache *s,
+			       int objects);
+int mem_reserve_kmem_cache_charge(struct mem_reserve *res,
+				  long objs,
+				  int overcommit);
+
+#endif /* _LINUX_RESERVE_H */
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -11,7 +11,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   page_alloc.o page-writeback.o pdflush.o \
 			   readahead.o swap.o truncate.o vmscan.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
-			   page_isolation.o $(mmu-y)
+			   page_isolation.o reserve.o $(mmu-y)
 
 obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
 obj-$(CONFIG_BOUNCE)	+= bounce.o
Index: linux-2.6/mm/reserve.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/reserve.c
@@ -0,0 +1,429 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007, Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * Description:
+ *
+ * Manage a set of memory reserves.
+ *
+ * A memory reserve is a reserve for a specified number of object of specified
+ * size. Since memory is managed in pages, this reserve demand is then
+ * translated into a page unit.
+ *
+ * So each reserve has a specified object limit, an object usage count and a
+ * number of pages required to back these objects.
+ *
+ * Usage is charged against a reserve, if the charge fails, the resource must
+ * not be allocated/used.
+ *
+ * The reserves are managed in a tree, and the resource demands (pages and
+ * limit) are propagated up the tree. Obviously the object limit will be
+ * meaningless as soon as the unit starts mixing, but the required page reserve
+ * (being of one unit) is still valid at the root.
+ *
+ * It is the page demand of the root node that is used to set the global
+ * reserve (adjust_memalloc_reserve() which sets zone->pages_emerg).
+ *
+ * As long as a subtree has the same usage unit, an aggregate node can be used
+ * to charge against, instead of the leaf nodes. However, do be consistent with
+ * who is charged, resource usage is not propagated up the tree (for
+ * performance reasons).
+ */
+
+#include <linux/reserve.h>
+#include <linux/mutex.h>
+#include <linux/mmzone.h>
+#include <linux/log2.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+
+static DEFINE_MUTEX(mem_reserve_mutex);
+
+/**
+ * @mem_reserve_root - the global reserve root
+ *
+ * The global reserve is empty, and has no limit unit, it merely
+ * acts as an aggregation point for reserves and an interface to
+ * adjust_memalloc_reserve().
+ */
+struct mem_reserve mem_reserve_root = {
+	.children = LIST_HEAD_INIT(mem_reserve_root.children),
+	.siblings = LIST_HEAD_INIT(mem_reserve_root.siblings),
+	.name = "total reserve",
+	.lock = __SPIN_LOCK_UNLOCKED(mem_reserve_root.lock),
+};
+EXPORT_SYMBOL_GPL(mem_reserve_root);
+
+/**
+ * mem_reserve_init - initialize a memory reserve object
+ * @res - the new reserve object
+ * @name - a name for this reserve
+ */
+void mem_reserve_init(struct mem_reserve *res, const char *name,
+		      struct mem_reserve *parent)
+{
+	memset(res, 0, sizeof(*res));
+	INIT_LIST_HEAD(&res->children);
+	INIT_LIST_HEAD(&res->siblings);
+	res->name = name;
+	spin_lock_init(&res->lock);
+
+	if (parent)
+		mem_reserve_connect(res, parent);
+}
+EXPORT_SYMBOL_GPL(mem_reserve_init);
+
+/*
+ * propagate the pages and limit changes up the tree.
+ */
+static void __calc_reserve(struct mem_reserve *res, long pages, long limit)
+{
+	unsigned long flags;
+
+	for ( ; res; res = res->parent) {
+		res->pages += pages;
+
+		if (limit) {
+			spin_lock_irqsave(&res->lock, flags);
+			res->limit += limit;
+			spin_unlock_irqrestore(&res->lock, flags);
+		}
+	}
+}
+
+/**
+ * __mem_reserve_add - primitive to change the size of a reserve
+ * @res - reserve to change
+ * @pages - page delta
+ * @limit - usage limit delta
+ *
+ * Returns -ENOMEM when a size increase is not possible atm.
+ */
+static int __mem_reserve_add(struct mem_reserve *res, long pages, long limit)
+{
+	int ret = 0;
+	long reserve;
+
+	reserve = mem_reserve_root.pages;
+	__calc_reserve(res, pages, 0);
+	reserve = mem_reserve_root.pages - reserve;
+
+	if (reserve) {
+		ret = adjust_memalloc_reserve(reserve);
+		if (ret)
+			__calc_reserve(res, -pages, 0);
+	}
+
+	if (!ret)
+		__calc_reserve(res, 0, limit);
+
+	return ret;
+}
+
+/**
+ * __mem_reserve_charge - primitive to charge object usage to a reserve
+ * @res - reserve to charge
+ * @charge - size of the charge
+ * @overcommit - allow despite of limit (use with caution!)
+ *
+ * Returns non-zero on success, zero on failure.
+ */
+static
+int __mem_reserve_charge(struct mem_reserve *res, long charge, int overcommit)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&res->lock, flags);
+	if (charge < 0 || res->usage + charge < res->limit || overcommit) {
+		res->usage += charge;
+		if (unlikely(res->usage < 0))
+			res->usage = 0;
+		ret = 1;
+	}
+	spin_unlock_irqrestore(&res->lock, flags);
+
+	return ret;
+}
+
+/**
+ * mem_reserve_connect - connect a reserve to another in a child-parent relation
+ * @new_child - the reserve node to connect (child)
+ * @node - the reserve node to connect to (parent)
+ *
+ * Returns -ENOMEM when the new connection would increase the reserve (parent
+ * is connected to mem_reserve_root) and there is no memory to do so.
+ *
+ * The child is _NOT_ connected on error.
+ */
+int mem_reserve_connect(struct mem_reserve *new_child, struct mem_reserve *node)
+{
+	int ret;
+
+	WARN_ON(!new_child->name);
+
+	mutex_lock(&mem_reserve_mutex);
+	new_child->parent = node;
+	list_add(&new_child->siblings, &node->children);
+	ret = __mem_reserve_add(node, new_child->pages, new_child->limit);
+	if (ret) {
+		new_child->parent = NULL;
+		list_del_init(&new_child->siblings);
+	}
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mem_reserve_connect);
+
+/**
+ * mem_reserve_disconnect - sever a nodes connection to the reserve tree
+ * @node - the node to disconnect
+ *
+ * Could, in theory, return -ENOMEM, but since disconnecting a node _should_
+ * only decrease the reserves that _should_ not happen.
+ */
+int mem_reserve_disconnect(struct mem_reserve *node)
+{
+	int ret;
+
+	BUG_ON(!node->parent);
+
+	mutex_lock(&mem_reserve_mutex);
+	ret = __mem_reserve_add(node->parent, -node->pages, -node->limit);
+	if (!ret) {
+		node->parent = NULL;
+		list_del_init(&node->siblings);
+	}
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mem_reserve_disconnect);
+
+#ifdef CONFIG_PROC_FS
+
+/*
+ * Simple output of the reserve tree in: /proc/reserve_info
+ * Example:
+ *
+ * localhost ~ # cat /proc/reserve_info
+ * total reserve                  8156K (0/544817)
+ *   total network reserve          8156K (0/544817)
+ *     network TX reserve             196K (0/49)
+ *       protocol TX pages              196K (0/49)
+ *     network RX reserve             7960K (0/544768)
+ *       IPv6 route cache               1372K (0/4096)
+ *       IPv4 route cache               5468K (0/16384)
+ *       SKB data reserve               1120K (0/524288)
+ *         IPv6 fragment cache            560K (0/262144)
+ *         IPv4 fragment cache            560K (0/262144)
+ */
+
+static void mem_reserve_show_item(struct seq_file *m, struct mem_reserve *res,
+				  int nesting)
+{
+	int i;
+	struct mem_reserve *child;
+
+	for (i = 0; i < nesting; i++)
+		seq_puts(m, "  ");
+
+	seq_printf(m, "%-30s %ldK (%ld/%ld)\n",
+		   res->name, res->pages << (PAGE_SHIFT - 10),
+		   res->usage, res->limit);
+
+	list_for_each_entry(child, &res->children, siblings)
+		mem_reserve_show_item(m, child, nesting+1);
+}
+
+static int mem_reserve_show(struct seq_file *m, void *v)
+{
+	mutex_lock(&mem_reserve_mutex);
+	mem_reserve_show_item(m, &mem_reserve_root, 0);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return 0;
+}
+
+static int mem_reserve_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, mem_reserve_show, NULL);
+}
+
+static const struct file_operations mem_reserve_opterations = {
+	.open = mem_reserve_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+static __init int mem_reserve_proc_init(void)
+{
+	struct proc_dir_entry *entry;
+
+	entry = create_proc_entry("reserve_info", S_IRUSR, NULL);
+	if (entry)
+		entry->proc_fops = &mem_reserve_opterations;
+
+	return 0;
+}
+
+__initcall(mem_reserve_proc_init);
+
+#endif
+
+/*
+ * alloc_page helpers
+ */
+
+/**
+ * mem_reserve_pages_set - set reserves size in pages
+ * @res - reserve to set
+ * @pages - size in pages to set it to
+ *
+ * Returns -ENOMEM when it fails to set the reserve. On failure the old size
+ * is preserved.
+ */
+int mem_reserve_pages_set(struct mem_reserve *res, long pages)
+{
+	int ret;
+
+	mutex_lock(&mem_reserve_mutex);
+	pages -= res->pages;
+	ret = __mem_reserve_add(res, pages, pages);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mem_reserve_pages_set);
+
+/**
+ * mem_reserve_pages_add - change the size in a relative way
+ * @res - reserve to change
+ * @pages - number of pages to add (or subtract when negative)
+ *
+ * Similar to mem_reserve_pages_set, except that the argument is relative
+ * instead of absolute.
+ *
+ * Returns -ENOMEM when it fails to increase.
+ */
+int mem_reserve_pages_add(struct mem_reserve *res, long pages)
+{
+	int ret;
+
+	mutex_lock(&mem_reserve_mutex);
+	ret = __mem_reserve_add(res, pages, pages);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+
+/**
+ * mem_reserve_pages_charge - charge page usage to a reserve
+ * @res - reserve to charge
+ * @pages - size to charge
+ * @overcommit - disregard the usage limit (use with caution!)
+ *
+ * Returns non-zero on success.
+ */
+int mem_reserve_pages_charge(struct mem_reserve *res,
+			     long pages, int overcommit)
+{
+	return __mem_reserve_charge(res, pages, overcommit);
+}
+EXPORT_SYMBOL_GPL(mem_reserve_pages_charge);
+
+/*
+ * kmalloc helpers
+ */
+
+/**
+ * mem_reserve_kmalloc_set - set this reserve to bytes worth of kmalloc
+ * @res - reserve to change
+ * @bytes - size in bytes to reserve
+ *
+ * Returns -ENOMEM on failure.
+ */
+int mem_reserve_kmalloc_set(struct mem_reserve *res, long bytes)
+{
+	int ret;
+	long pages;
+
+	mutex_lock(&mem_reserve_mutex);
+	pages = kestimate(GFP_ATOMIC, bytes);
+	pages -= res->pages;
+	bytes -= res->limit;
+	ret = __mem_reserve_add(res, pages, bytes);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mem_reserve_kmalloc_set);
+
+/**
+ * mem_reserve_kmalloc_charge - charge bytes to a reserve
+ * @res - reserve to charge
+ * @bytes - bytes to charge
+ * @overcommit - disregard the usage limit (use with caution!)
+ *
+ * Returns non-zero on success.
+ */
+int mem_reserve_kmalloc_charge(struct mem_reserve *res, long bytes,
+			       int overcommit)
+{
+	if (bytes < 0)
+		bytes = -roundup_pow_of_two(-bytes);
+	else
+		bytes = roundup_pow_of_two(bytes);
+
+	return __mem_reserve_charge(res, bytes, overcommit);
+}
+EXPORT_SYMBOL_GPL(mem_reserve_kmalloc_charge);
+
+/*
+ * kmem_cache helpers
+ */
+
+/**
+ * mem_reserve_kmem_cache_set - set reserve to @objects worth of kmem_cache_alloc of @s
+ * @res - reserve to set
+ * @s - kmem_cache to reserve from
+ * @objects - number of objects to reserve
+ *
+ * Returns -ENOMEM on failure.
+ */
+int mem_reserve_kmem_cache_set(struct mem_reserve *res, struct kmem_cache *s,
+			       int objects)
+{
+	int ret;
+	long pages;
+
+	mutex_lock(&mem_reserve_mutex);
+	pages = kmem_estimate_pages(s, GFP_ATOMIC, objects);
+	pages -= res->pages;
+	objects -= res->limit;
+	ret = __mem_reserve_add(res, pages, objects);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mem_reserve_kmem_cache_set);
+
+/**
+ * mem_reserve_kmem_cache_charge - charge (or uncharge) usage of objs
+ * @res - reserve to charge
+ * @objs - objects to charge for
+ * @overcommit - disregard the usage limit (use with caution!)
+ *
+ * Returns non-zero on success.
+ */
+int mem_reserve_kmem_cache_charge(struct mem_reserve *res, long objs,
+				  int overcommit)
+{
+	return __mem_reserve_charge(res, objs, overcommit);
+}
+EXPORT_SYMBOL_GPL(mem_reserve_kmem_cache_charge);

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 11/28] selinux: tag avc cache alloc as non-critical
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (9 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 10/28] mm: memory reserve management Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 12/28] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra, James Morris

[-- Attachment #1: mm-selinux-emergency.patch --]
[-- Type: text/plain, Size: 768 bytes --]

Failing to allocate a cache entry will only harm performance not correctness.
Do not consume valuable reserve pages for something like that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: James Morris <jmorris@namei.org>
---
 security/selinux/avc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/security/selinux/avc.c
===================================================================
--- linux-2.6.orig/security/selinux/avc.c
+++ linux-2.6/security/selinux/avc.c
@@ -334,7 +334,7 @@ static struct avc_node *avc_alloc_node(v
 {
 	struct avc_node *node;
 
-	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC);
+	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC);
 	if (!node)
 		goto out;
 

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 12/28] net: wrap sk->sk_backlog_rcv()
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (10 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 11/28] selinux: tag avc cache alloc as non-critical Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 13/28] net: packet split receive api Peter Zijlstra
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: net-backlog.patch --]
[-- Type: text/plain, Size: 2909 bytes --]

Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h   |    5 +++++
 include/net/tcp.h    |    2 +-
 net/core/sock.c      |    4 ++--
 net/ipv4/tcp.c       |    2 +-
 net/ipv4/tcp_timer.c |    2 +-
 5 files changed, 10 insertions(+), 5 deletions(-)

Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -474,6 +474,11 @@ static inline void sk_add_backlog(struct
 	skb->next = NULL;
 }
 
+static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	return sk->sk_backlog_rcv(sk, skb);
+}
+
 #define sk_wait_event(__sk, __timeo, __condition)			\
 	({	int __rc;						\
 		release_sock(__sk);					\
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -325,7 +325,7 @@ int sk_receive_skb(struct sock *sk, stru
 		 */
 		mutex_acquire(&sk->sk_lock.dep_map, 0, 1, _RET_IP_);
 
-		rc = sk->sk_backlog_rcv(sk, skb);
+		rc = sk_backlog_rcv(sk, skb);
 
 		mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
 	} else
@@ -1360,7 +1360,7 @@ static void __release_sock(struct sock *
 			struct sk_buff *next = skb->next;
 
 			skb->next = NULL;
-			sk->sk_backlog_rcv(sk, skb);
+			sk_backlog_rcv(sk, skb);
 
 			/*
 			 * We are in process context here with softirqs
Index: linux-2.6/net/ipv4/tcp.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp.c
+++ linux-2.6/net/ipv4/tcp.c
@@ -1158,7 +1158,7 @@ static void tcp_prequeue_process(struct 
 	 * necessary */
 	local_bh_disable();
 	while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
-		sk->sk_backlog_rcv(sk, skb);
+		sk_backlog_rcv(sk, skb);
 	local_bh_enable();
 
 	/* Clear memory counter. */
Index: linux-2.6/net/ipv4/tcp_timer.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_timer.c
+++ linux-2.6/net/ipv4/tcp_timer.c
@@ -203,7 +203,7 @@ static void tcp_delack_timer(unsigned lo
 		NET_INC_STATS_BH(LINUX_MIB_TCPSCHEDULERFAILED);
 
 		while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
-			sk->sk_backlog_rcv(sk, skb);
+			sk_backlog_rcv(sk, skb);
 
 		tp->ucopy.memory = 0;
 	}
Index: linux-2.6/include/net/tcp.h
===================================================================
--- linux-2.6.orig/include/net/tcp.h
+++ linux-2.6/include/net/tcp.h
@@ -879,7 +879,7 @@ static inline int tcp_prequeue(struct so
 			BUG_ON(sock_owned_by_user(sk));
 
 			while ((skb1 = __skb_dequeue(&tp->ucopy.prequeue)) != NULL) {
-				sk->sk_backlog_rcv(sk, skb1);
+				sk_backlog_rcv(sk, skb1);
 				NET_INC_STATS_BH(LINUX_MIB_TCPPREQUEUEDROPPED);
 			}
 

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 13/28] net: packet split receive api
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (11 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 12/28] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 14/28] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: net-ps_rx.patch --]
[-- Type: text/plain, Size: 9721 bytes --]

Add some packet-split receive hooks.

For one this allows to do NUMA node affine page allocs. Later on these hooks
will be extended to do emergency reserve allocations for fragments.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 drivers/net/bnx2.c             |    8 +++-----
 drivers/net/e1000/e1000_main.c |    8 ++------
 drivers/net/e1000e/netdev.c    |    7 ++-----
 drivers/net/igb/igb_main.c     |    8 ++------
 drivers/net/ixgbe/ixgbe_main.c |   10 +++-------
 drivers/net/sky2.c             |   16 ++++++----------
 include/linux/skbuff.h         |   23 +++++++++++++++++++++++
 net/core/skbuff.c              |   20 ++++++++++++++++++++
 8 files changed, 61 insertions(+), 39 deletions(-)

Index: linux-2.6/drivers/net/e1000/e1000_main.c
===================================================================
--- linux-2.6.orig/drivers/net/e1000/e1000_main.c
+++ linux-2.6/drivers/net/e1000/e1000_main.c
@@ -4478,12 +4478,8 @@ e1000_clean_rx_irq_ps(struct e1000_adapt
 			pci_unmap_page(pdev, ps_page_dma->ps_page_dma[j],
 					PAGE_SIZE, PCI_DMA_FROMDEVICE);
 			ps_page_dma->ps_page_dma[j] = 0;
-			skb_fill_page_desc(skb, j, ps_page->ps_page[j], 0,
-			                   length);
+			skb_add_rx_frag(skb, j, ps_page->ps_page[j], 0, length);
 			ps_page->ps_page[j] = NULL;
-			skb->len += length;
-			skb->data_len += length;
-			skb->truesize += length;
 		}
 
 		/* strip the ethernet crc, problem is we're using pages now so
@@ -4691,7 +4687,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
 			if (j < adapter->rx_ps_pages) {
 				if (likely(!ps_page->ps_page[j])) {
 					ps_page->ps_page[j] =
-						alloc_page(GFP_ATOMIC);
+						netdev_alloc_page(netdev);
 					if (unlikely(!ps_page->ps_page[j])) {
 						adapter->alloc_rx_buff_failed++;
 						goto no_buffers;
Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -846,6 +846,9 @@ static inline void skb_fill_page_desc(st
 	skb_shinfo(skb)->nr_frags = i + 1;
 }
 
+extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page,
+			    int off, int size);
+
 #define SKB_PAGE_ASSERT(skb) 	BUG_ON(skb_shinfo(skb)->nr_frags)
 #define SKB_FRAG_ASSERT(skb) 	BUG_ON(skb_shinfo(skb)->frag_list)
 #define SKB_LINEAR_ASSERT(skb)  BUG_ON(skb_is_nonlinear(skb))
@@ -1339,6 +1342,26 @@ static inline struct sk_buff *netdev_all
 	return __netdev_alloc_skb(dev, length, GFP_ATOMIC);
 }
 
+extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+
+/**
+ *	netdev_alloc_page - allocate a page for ps-rx on a specific device
+ *	@dev: network device to receive on
+ *
+ * 	Allocate a new page node local to the specified device.
+ *
+ * 	%NULL is returned if there is no free memory.
+ */
+static inline struct page *netdev_alloc_page(struct net_device *dev)
+{
+	return __netdev_alloc_page(dev, GFP_ATOMIC);
+}
+
+static inline void netdev_free_page(struct net_device *dev, struct page *page)
+{
+	__free_page(page);
+}
+
 /**
  *	skb_clone_writable - is the header of a clone writable
  *	@skb: buffer to check
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -263,6 +263,26 @@ struct sk_buff *__netdev_alloc_skb(struc
 	return skb;
 }
 
+struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
+{
+	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
+	struct page *page;
+
+	page = alloc_pages_node(node, gfp_mask, 0);
+	return page;
+}
+EXPORT_SYMBOL(__netdev_alloc_page);
+
+void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
+		int size)
+{
+	skb_fill_page_desc(skb, i, page, off, size);
+	skb->len += size;
+	skb->data_len += size;
+	skb->truesize += size;
+}
+EXPORT_SYMBOL(skb_add_rx_frag);
+
 static void skb_drop_list(struct sk_buff **listp)
 {
 	struct sk_buff *list = *listp;
Index: linux-2.6/drivers/net/sky2.c
===================================================================
--- linux-2.6.orig/drivers/net/sky2.c
+++ linux-2.6/drivers/net/sky2.c
@@ -1216,7 +1216,7 @@ static struct sk_buff *sky2_rx_alloc(str
 	}
 
 	for (i = 0; i < sky2->rx_nfrags; i++) {
-		struct page *page = alloc_page(GFP_ATOMIC);
+		struct page *page = netdev_alloc_page(sky2->netdev);
 
 		if (!page)
 			goto free_partial;
@@ -2088,8 +2088,8 @@ static struct sk_buff *receive_copy(stru
 }
 
 /* Adjust length of skb with fragments to match received data */
-static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
-			  unsigned int length)
+static void skb_put_frags(struct sky2_port *sky2, struct sk_buff *skb,
+			  unsigned int hdr_space, unsigned int length)
 {
 	int i, num_frags;
 	unsigned int size;
@@ -2106,15 +2106,11 @@ static void skb_put_frags(struct sk_buff
 
 		if (length == 0) {
 			/* don't need this page */
-			__free_page(frag->page);
+			netdev_free_page(sky2->netdev, frag->page);
 			--skb_shinfo(skb)->nr_frags;
 		} else {
 			size = min(length, (unsigned) PAGE_SIZE);
-
-			frag->size = size;
-			skb->data_len += size;
-			skb->truesize += size;
-			skb->len += size;
+			skb_add_rx_frag(skb, i, frag->page, 0, size);
 			length -= size;
 		}
 	}
@@ -2141,7 +2137,7 @@ static struct sk_buff *receive_new(struc
 	sky2_rx_map_skb(sky2->hw->pdev, re, hdr_space);
 
 	if (skb_shinfo(skb)->nr_frags)
-		skb_put_frags(skb, hdr_space, length);
+		skb_put_frags(sky2, skb, hdr_space, length);
 	else
 		skb_put(skb, length);
 	return skb;
Index: linux-2.6/drivers/net/bnx2.c
===================================================================
--- linux-2.6.orig/drivers/net/bnx2.c
+++ linux-2.6/drivers/net/bnx2.c
@@ -2356,7 +2356,7 @@ bnx2_alloc_rx_page(struct bnx2 *bp, u16 
 	struct sw_pg *rx_pg = &bp->rx_pg_ring[index];
 	struct rx_bd *rxbd =
 		&bp->rx_pg_desc_ring[RX_RING(index)][RX_IDX(index)];
-	struct page *page = alloc_page(GFP_ATOMIC);
+	struct page *page = netdev_alloc_page(bp->dev);
 
 	if (!page)
 		return -ENOMEM;
@@ -2381,7 +2381,7 @@ bnx2_free_rx_page(struct bnx2 *bp, u16 i
 	pci_unmap_page(bp->pdev, pci_unmap_addr(rx_pg, mapping), PAGE_SIZE,
 		       PCI_DMA_FROMDEVICE);
 
-	__free_page(page);
+	netdev_free_page(bp->dev, page);
 	rx_pg->page = NULL;
 }
 
@@ -2705,9 +2705,7 @@ bnx2_rx_skb(struct bnx2 *bp, struct bnx2
 			}
 
 			frag_size -= frag_len;
-			skb->data_len += frag_len;
-			skb->truesize += frag_len;
-			skb->len += frag_len;
+			skb_add_rx_frag(skb, i, rx_pg->page, 0, frag_len);
 
 			pg_prod = NEXT_RX_BD(pg_prod);
 			pg_cons = RX_PG_RING_IDX(NEXT_RX_BD(pg_cons));
Index: linux-2.6/drivers/net/e1000e/netdev.c
===================================================================
--- linux-2.6.orig/drivers/net/e1000e/netdev.c
+++ linux-2.6/drivers/net/e1000e/netdev.c
@@ -252,7 +252,7 @@ static void e1000_alloc_rx_buffers_ps(st
 				continue;
 			}
 			if (!ps_page->page) {
-				ps_page->page = alloc_page(GFP_ATOMIC);
+				ps_page->page = netdev_alloc_page(netdev);
 				if (!ps_page->page) {
 					adapter->alloc_rx_buff_failed++;
 					goto no_buffers;
@@ -714,11 +714,8 @@ static bool e1000_clean_rx_irq_ps(struct
 			pci_unmap_page(pdev, ps_page->dma, PAGE_SIZE,
 				       PCI_DMA_FROMDEVICE);
 			ps_page->dma = 0;
-			skb_fill_page_desc(skb, j, ps_page->page, 0, length);
+			skb_add_rx_frag(skb, j, ps_page->page, 0, length);
 			ps_page->page = NULL;
-			skb->len += length;
-			skb->data_len += length;
-			skb->truesize += length;
 		}
 
 copydone:
Index: linux-2.6/drivers/net/igb/igb_main.c
===================================================================
--- linux-2.6.orig/drivers/net/igb/igb_main.c
+++ linux-2.6/drivers/net/igb/igb_main.c
@@ -3506,13 +3506,9 @@ static bool igb_clean_rx_irq_adv(struct 
 			pci_unmap_page(pdev, buffer_info->page_dma,
 				PAGE_SIZE, PCI_DMA_FROMDEVICE);
 			buffer_info->page_dma = 0;
-			skb_fill_page_desc(skb, j, buffer_info->page,
-						0, length);
+			skb_add_rx_frag(skb, j, buffer_info->page, 0, length);
 			buffer_info->page = NULL;
 
-			skb->len += length;
-			skb->data_len += length;
-			skb->truesize += length;
 			rx_desc->wb.upper.status_error = 0;
 			if (staterr & E1000_RXD_STAT_EOP)
 				break;
@@ -3614,7 +3610,7 @@ static void igb_alloc_rx_buffers_adv(str
 		rx_desc = E1000_RX_DESC_ADV(*rx_ring, i);
 
 		if (adapter->rx_ps_hdr_size && !buffer_info->page) {
-			buffer_info->page = alloc_page(GFP_ATOMIC);
+			buffer_info->page = netdev_alloc_page(netdev);
 			if (!buffer_info->page) {
 				adapter->alloc_rx_buff_failed++;
 				goto no_buffers;
Index: linux-2.6/drivers/net/ixgbe/ixgbe_main.c
===================================================================
--- linux-2.6.orig/drivers/net/ixgbe/ixgbe_main.c
+++ linux-2.6/drivers/net/ixgbe/ixgbe_main.c
@@ -367,7 +367,7 @@ static void ixgbe_alloc_rx_buffers(struc
 
 		if (!rx_buffer_info->page &&
 				(adapter->flags & IXGBE_FLAG_RX_PS_ENABLED)) {
-			rx_buffer_info->page = alloc_page(GFP_ATOMIC);
+			rx_buffer_info->page = netdev_alloc_page(netdev);
 			if (!rx_buffer_info->page) {
 				adapter->alloc_rx_page_failed++;
 				goto no_buffers;
@@ -490,13 +490,9 @@ static bool ixgbe_clean_rx_irq(struct ix
 			pci_unmap_page(pdev, rx_buffer_info->page_dma,
 				       PAGE_SIZE, PCI_DMA_FROMDEVICE);
 			rx_buffer_info->page_dma = 0;
-			skb_fill_page_desc(skb, skb_shinfo(skb)->nr_frags,
-					   rx_buffer_info->page, 0, upper_len);
+			skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
+					rx_buffer_info->page, 0, upper_len);
 			rx_buffer_info->page = NULL;
-
-			skb->len += upper_len;
-			skb->data_len += upper_len;
-			skb->truesize += upper_len;
 		}
 
 		i++;

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 14/28] net: sk_allocation() - concentrate socket related allocations
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (12 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 13/28] net: packet split receive api Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 15/28] netvm: network reserve infrastructure Peter Zijlstra
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: net-sk_allocation.patch --]
[-- Type: text/plain, Size: 5243 bytes --]

Introduce sk_allocation(), this function allows to inject sock specific
flags to each sock related allocation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h    |    5 +++++
 net/ipv4/tcp.c        |    3 ++-
 net/ipv4/tcp_output.c |   12 +++++++-----
 net/ipv6/tcp_ipv6.c   |   14 +++++++++-----
 4 files changed, 23 insertions(+), 11 deletions(-)

Index: linux-2.6/net/ipv4/tcp_output.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_output.c
+++ linux-2.6/net/ipv4/tcp_output.c
@@ -2078,7 +2078,8 @@ void tcp_send_fin(struct sock *sk)
 	} else {
 		/* Socket is locked, keep trying until memory is available. */
 		for (;;) {
-			skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
+			skb = alloc_skb_fclone(MAX_TCP_HEADER,
+					       sk->sk_allocation);
 			if (skb)
 				break;
 			yield();
@@ -2104,7 +2105,7 @@ void tcp_send_active_reset(struct sock *
 	struct sk_buff *skb;
 
 	/* NOTE: No TCP options attached and we never retransmit this. */
-	skb = alloc_skb(MAX_TCP_HEADER, priority);
+	skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, priority));
 	if (!skb) {
 		NET_INC_STATS(LINUX_MIB_TCPABORTFAILED);
 		return;
@@ -2171,7 +2172,8 @@ struct sk_buff *tcp_make_synack(struct s
 	__u8 *md5_hash_location;
 #endif
 
-	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC);
+	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1,
+			sk_allocation(sk, GFP_ATOMIC));
 	if (skb == NULL)
 		return NULL;
 
@@ -2425,7 +2427,7 @@ void tcp_send_ack(struct sock *sk)
 	 * tcp_transmit_skb() will set the ownership to this
 	 * sock.
 	 */
-	buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+	buff = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
 	if (buff == NULL) {
 		inet_csk_schedule_ack(sk);
 		inet_csk(sk)->icsk_ack.ato = TCP_ATO_MIN;
@@ -2460,7 +2462,7 @@ static int tcp_xmit_probe_skb(struct soc
 	struct sk_buff *skb;
 
 	/* We don't queue it, tcp_transmit_skb() sets ownership. */
-	skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+	skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
 	if (skb == NULL)
 		return -1;
 
Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -427,6 +427,11 @@ static inline int sock_flag(struct sock 
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
+{
+	return gfp_mask;
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
Index: linux-2.6/net/ipv6/tcp_ipv6.c
===================================================================
--- linux-2.6.orig/net/ipv6/tcp_ipv6.c
+++ linux-2.6/net/ipv6/tcp_ipv6.c
@@ -568,7 +568,8 @@ static int tcp_v6_md5_do_add(struct sock
 	} else {
 		/* reallocate new list if current one is full. */
 		if (!tp->md5sig_info) {
-			tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info), GFP_ATOMIC);
+			tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info),
+					sk_allocation(sk, GFP_ATOMIC));
 			if (!tp->md5sig_info) {
 				kfree(newkey);
 				return -ENOMEM;
@@ -581,7 +582,8 @@ static int tcp_v6_md5_do_add(struct sock
 		}
 		if (tp->md5sig_info->alloced6 == tp->md5sig_info->entries6) {
 			keys = kmalloc((sizeof (tp->md5sig_info->keys6[0]) *
-				       (tp->md5sig_info->entries6 + 1)), GFP_ATOMIC);
+				       (tp->md5sig_info->entries6 + 1)),
+				       sk_allocation(sk, GFP_ATOMIC));
 
 			if (!keys) {
 				tcp_free_md5sig_pool();
@@ -705,7 +707,7 @@ static int tcp_v6_parse_md5_keys (struct
 		struct tcp_sock *tp = tcp_sk(sk);
 		struct tcp_md5sig_info *p;
 
-		p = kzalloc(sizeof(struct tcp_md5sig_info), GFP_KERNEL);
+		p = kzalloc(sizeof(struct tcp_md5sig_info), sk->sk_allocation);
 		if (!p)
 			return -ENOMEM;
 
@@ -1006,7 +1008,7 @@ static void tcp_v6_send_reset(struct soc
 	 */
 
 	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
-			 GFP_ATOMIC);
+			 sk_allocation(sk, GFP_ATOMIC));
 	if (buff == NULL)
 		return;
 
@@ -1085,10 +1087,12 @@ static void tcp_v6_send_ack(struct tcp_t
 	struct tcp_md5sig_key *key;
 	struct tcp_md5sig_key tw_key;
 #endif
+	gfp_t gfp_mask = GFP_ATOMIC;
 
 #ifdef CONFIG_TCP_MD5SIG
 	if (!tw && skb->sk) {
 		key = tcp_v6_md5_do_lookup(skb->sk, &ipv6_hdr(skb)->daddr);
+		gfp_mask = sk_allocation(skb->sk, gfp_mask);
 	} else if (tw && tw->tw_md5_keylen) {
 		tw_key.key = tw->tw_md5_key;
 		tw_key.keylen = tw->tw_md5_keylen;
@@ -1106,7 +1110,7 @@ static void tcp_v6_send_ack(struct tcp_t
 #endif
 
 	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
-			 GFP_ATOMIC);
+			 gfp_mask);
 	if (buff == NULL)
 		return;
 
Index: linux-2.6/net/ipv4/tcp.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp.c
+++ linux-2.6/net/ipv4/tcp.c
@@ -636,7 +636,8 @@ struct sk_buff *sk_stream_alloc_skb(stru
 	/* The TCP header must be at least 32-bit aligned.  */
 	size = ALIGN(size, 4);
 
-	skb = alloc_skb_fclone(size + sk->sk_prot->max_header, gfp);
+	skb = alloc_skb_fclone(size + sk->sk_prot->max_header,
+			       sk_allocation(sk, gfp));
 	if (skb) {
 		if (sk_wmem_schedule(sk, skb->truesize)) {
 			/*

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 15/28] netvm: network reserve infrastructure
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (13 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 14/28] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-23  8:06   ` Andrew Morton
  2008-02-24  6:52   ` Mike Snitzer
  2008-02-20 14:46 ` [PATCH 16/28] netvm: INET reserves Peter Zijlstra
                   ` (13 subsequent siblings)
  28 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: netvm-reserve.patch --]
[-- Type: text/plain, Size: 7304 bytes --]

Provide the basic infrastructure to reserve and charge/account network memory.

We provide the following reserve tree:

1)  total network reserve
2)    network TX reserve
3)      protocol TX pages
4)    network RX reserve
5)      SKB data reserve

[1] is used to make all the network reserves a single subtree, for easy
manipulation.

[2] and [4] are merely for eastetic reasons.

The TX pages reserve [3] is assumed bounded by it being the upper bound of
memory that can be used for sending pages (not quite true, but good enough)

The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data
against in the fallback path.

The consumers for these reserves are sockets marked with:
  SOCK_MEMALLOC

Such sockets are to be used to service the VM (iow. to swap over). They
must be handled kernel side, exposing such a socket to user-space is a BUG.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |   35 +++++++++++++++-
 net/Kconfig        |    3 +
 net/core/sock.c    |  113 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 150 insertions(+), 1 deletion(-)

Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -51,6 +51,7 @@
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/mm.h>
 #include <linux/security.h>
+#include <linux/reserve.h>
 
 #include <linux/filter.h>
 
@@ -405,6 +406,7 @@ enum sock_flags {
 	SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_MEMALLOC, /* the VM depends on us - make sure we're serviced */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -427,9 +429,40 @@ static inline int sock_flag(struct sock 
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int sk_has_memalloc(struct sock *sk)
+{
+	return sock_flag(sk, SOCK_MEMALLOC);
+}
+
+/*
+ * Guestimate the per request queue TX upper bound.
+ *
+ * Max packet size is 64k, and we need to reserve that much since the data
+ * might need to bounce it. Double it to be on the safe side.
+ */
+#define TX_RESERVE_PAGES DIV_ROUND_UP(2*65536, PAGE_SIZE)
+
+extern atomic_t memalloc_socks;
+
+extern struct mem_reserve net_rx_reserve;
+extern struct mem_reserve net_skb_reserve;
+
+static inline int sk_memalloc_socks(void)
+{
+	return atomic_read(&memalloc_socks);
+}
+
+extern int rx_emergency_get(int bytes);
+extern int rx_emergency_get_overcommit(int bytes);
+extern void rx_emergency_put(int bytes);
+
+extern int sk_adjust_memalloc(int socks, long tx_reserve_pages);
+extern int sk_set_memalloc(struct sock *sk);
+extern int sk_clear_memalloc(struct sock *sk);
+
 static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
 {
-	return gfp_mask;
+	return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
 }
 
 static inline void sk_acceptq_removed(struct sock *sk)
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -112,6 +112,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/reserve.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -213,6 +214,111 @@ __u32 sysctl_rmem_default __read_mostly 
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 
+atomic_t memalloc_socks;
+
+static struct mem_reserve net_reserve;
+struct mem_reserve net_rx_reserve;
+EXPORT_SYMBOL_GPL(net_rx_reserve); /* modular ipv6 only */
+struct mem_reserve net_skb_reserve;
+EXPORT_SYMBOL_GPL(net_skb_reserve); /* modular ipv6 only */
+static struct mem_reserve net_tx_reserve;
+static struct mem_reserve net_tx_pages;
+
+
+/*
+ * is there room for another emergency packet?
+ */
+static int __rx_emergency_get(int bytes, bool overcommit)
+{
+	return mem_reserve_kmalloc_charge(&net_skb_reserve, bytes, overcommit);
+}
+
+int rx_emergency_get(int bytes)
+{
+	return __rx_emergency_get(bytes, false);
+}
+
+int rx_emergency_get_overcommit(int bytes)
+{
+	return __rx_emergency_get(bytes, true);
+}
+
+void rx_emergency_put(int bytes)
+{
+	mem_reserve_kmalloc_charge(&net_skb_reserve, -bytes, 0);
+}
+
+/**
+ *	sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ *	@socks: number of new %SOCK_MEMALLOC sockets
+ *	@tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ *	This function adjusts the memalloc reserve based on system demand.
+ *	The RX reserve is a limit, and only added once, not for each socket.
+ *
+ *	NOTE:
+ *	   @tx_reserve_pages is an upper-bound of memory used for TX hence
+ *	   we need not account the pages like we do for RX pages.
+ */
+int sk_adjust_memalloc(int socks, long tx_reserve_pages)
+{
+	int nr_socks;
+	int err;
+
+	err = mem_reserve_pages_add(&net_tx_pages, tx_reserve_pages);
+	if (err)
+		return err;
+
+	nr_socks = atomic_read(&memalloc_socks);
+	if (!nr_socks && socks > 0)
+		err = mem_reserve_connect(&net_reserve, &mem_reserve_root);
+	nr_socks = atomic_add_return(socks, &memalloc_socks);
+	if (!nr_socks && socks)
+		err = mem_reserve_disconnect(&net_reserve);
+
+	if (err)
+		mem_reserve_pages_add(&net_tx_pages, -tx_reserve_pages);
+
+	return err;
+}
+
+/**
+ *	sk_set_memalloc - sets %SOCK_MEMALLOC
+ *	@sk: socket to set it on
+ *
+ *	Set %SOCK_MEMALLOC on a socket and increase the memalloc reserve
+ *	accordingly.
+ */
+int sk_set_memalloc(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_MEMALLOC);
+#ifndef CONFIG_NETVM
+	BUG();
+#endif
+	if (!set) {
+		int err = sk_adjust_memalloc(1, 0);
+		if (err)
+			return err;
+
+		sock_set_flag(sk, SOCK_MEMALLOC);
+		sk->sk_allocation |= __GFP_MEMALLOC;
+	}
+	return !set;
+}
+EXPORT_SYMBOL_GPL(sk_set_memalloc);
+
+int sk_clear_memalloc(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_MEMALLOC);
+	if (set) {
+		sk_adjust_memalloc(-1, 0);
+		sock_reset_flag(sk, SOCK_MEMALLOC);
+		sk->sk_allocation &= ~__GFP_MEMALLOC;
+	}
+	return set;
+}
+EXPORT_SYMBOL_GPL(sk_clear_memalloc);
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -968,6 +1074,7 @@ void sk_free(struct sock *sk)
 {
 	struct sk_filter *filter;
 
+	sk_clear_memalloc(sk);
 	if (sk->sk_destruct)
 		sk->sk_destruct(sk);
 
@@ -1095,6 +1202,12 @@ void __init sk_init(void)
 		sysctl_wmem_max = 131071;
 		sysctl_rmem_max = 131071;
 	}
+
+	mem_reserve_init(&net_reserve, "total network reserve", NULL);
+	mem_reserve_init(&net_rx_reserve, "network RX reserve", &net_reserve);
+	mem_reserve_init(&net_skb_reserve, "SKB data reserve", &net_rx_reserve);
+	mem_reserve_init(&net_tx_reserve, "network TX reserve", &net_reserve);
+	mem_reserve_init(&net_tx_pages, "protocol TX pages", &net_tx_reserve);
 }
 
 /*
Index: linux-2.6/net/Kconfig
===================================================================
--- linux-2.6.orig/net/Kconfig
+++ linux-2.6/net/Kconfig
@@ -250,6 +250,9 @@ endmenu
 source "net/rfkill/Kconfig"
 source "net/9p/Kconfig"
 
+config NETVM
+	def_bool n
+
 endif   # if NET
 endmenu # Networking
 

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 16/28] netvm: INET reserves.
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (14 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 15/28] netvm: network reserve infrastructure Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 17/28] netvm: hook skb allocation to reserves Peter Zijlstra
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: netvm-reserve-inet.patch --]
[-- Type: text/plain, Size: 12009 bytes --]

Add reserves for INET.

The two big users seem to be the route cache and ip-fragment cache.

Reserve the route cache under generic RX reserve, its usage is bounded by
the high reclaim watermark, and thus does not need further accounting.

Reserve the ip-fragement caches under SKB data reserve, these add to the
SKB RX limit. By ensuring we can at least receive as much data as fits in
the reassmbly line we avoid fragment attack deadlocks.

Use proc conv() routines to update these limits and return -ENOMEM to user
space.

Adds to the reserve tree:

  total network reserve      
    network TX reserve       
      protocol TX pages      
    network RX reserve       
+     IPv6 route cache       
+     IPv4 route cache       
      SKB data reserve       
+       IPv6 fragment cache  
+       IPv4 fragment cache  

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 net/ipv4/ip_fragment.c |   65 +++++++++++++++++++++++++++++++++++++++++++++++--
 net/ipv4/route.c       |   65 +++++++++++++++++++++++++++++++++++++++++++++++--
 net/ipv6/reassembly.c  |   65 +++++++++++++++++++++++++++++++++++++++++++++++--
 net/ipv6/route.c       |   65 +++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 252 insertions(+), 8 deletions(-)

Index: linux-2.6/net/ipv4/ip_fragment.c
===================================================================
--- linux-2.6.orig/net/ipv4/ip_fragment.c
+++ linux-2.6/net/ipv4/ip_fragment.c
@@ -44,6 +44,7 @@
 #include <linux/udp.h>
 #include <linux/inet.h>
 #include <linux/netfilter_ipv4.h>
+#include <linux/reserve.h>
 
 /* NOTE. Logic of IP defragmentation is parallel to corresponding IPv6
  * code now. If you change something here, _PLEASE_ update ipv6/reassembly.c
@@ -591,17 +592,72 @@ int ip_defrag(struct sk_buff *skb, u32 u
 	return -ENOMEM;
 }
 
+static struct mem_reserve ipv4_frag_reserve;
+
 #ifdef CONFIG_SYSCTL
+static int ipv4_frag_bytes;
+
+static int proc_dointvec_fragment(struct ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int old_bytes, ret;
+
+	if (!write)
+		ipv4_frag_bytes = init_net.ipv4.frags.high_thresh;
+	old_bytes = ipv4_frag_bytes;
+
+	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmalloc_set(&ipv4_frag_reserve,
+					      ipv4_frag_bytes);
+		if (!ret)
+			init_net.ipv4.frags.high_thresh = ipv4_frag_bytes;
+		else
+			ipv4_frag_bytes = old_bytes;
+	}
+
+	return ret;
+}
+
+static int sysctl_intvec_fragment(struct ctl_table *table,
+		int __user *name, int nlen,
+		void __user *oldval, size_t __user *oldlenp,
+		void __user *newval, size_t newlen)
+{
+	int old_bytes, ret;
+	int write = (newval && newlen);
+
+	if (!write)
+		ipv4_frag_bytes = init_net.ipv4.frags.high_thresh;
+	old_bytes = ipv4_frag_bytes;
+
+	ret = sysctl_intvec(table, name, nlen, oldval, oldlenp, newval, newlen);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmalloc_set(&ipv4_frag_reserve,
+					      ipv4_frag_bytes);
+		if (!ret)
+			init_net.ipv4.frags.high_thresh = ipv4_frag_bytes;
+		else
+			ipv4_frag_bytes = old_bytes;
+	}
+
+	return ret;
+}
+
 static int zero;
 
 static struct ctl_table ip4_frags_ctl_table[] = {
 	{
 		.ctl_name	= NET_IPV4_IPFRAG_HIGH_THRESH,
 		.procname	= "ipfrag_high_thresh",
-		.data		= &init_net.ipv4.frags.high_thresh,
+		.data		= &ipv4_frag_bytes,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment,
+		.strategy	= &sysctl_intvec_fragment,
 	},
 	{
 		.ctl_name	= NET_IPV4_IPFRAG_LOW_THRESH,
@@ -736,6 +792,11 @@ void __init ipfrag_init(void)
 	ip4_frags.frag_expire = ip_expire;
 	ip4_frags.secret_interval = 10 * 60 * HZ;
 	inet_frags_init(&ip4_frags);
+
+	mem_reserve_init(&ipv4_frag_reserve, "IPv4 fragment cache",
+			 &net_skb_reserve);
+	mem_reserve_kmalloc_set(&ipv4_frag_reserve,
+				init_net.ipv4.frags.high_thresh);
 }
 
 EXPORT_SYMBOL(ip_defrag);
Index: linux-2.6/net/ipv6/reassembly.c
===================================================================
--- linux-2.6.orig/net/ipv6/reassembly.c
+++ linux-2.6/net/ipv6/reassembly.c
@@ -43,6 +43,7 @@
 #include <linux/random.h>
 #include <linux/jhash.h>
 #include <linux/skbuff.h>
+#include <linux/reserve.h>
 
 #include <net/sock.h>
 #include <net/snmp.h>
@@ -628,15 +629,70 @@ static struct inet6_protocol frag_protoc
 	.flags		=	INET6_PROTO_NOPOLICY,
 };
 
+static struct mem_reserve ipv6_frag_reserve;
+
 #ifdef CONFIG_SYSCTL
+static int ipv6_frag_bytes;
+
+static int proc_dointvec_fragment(struct ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int old_bytes, ret;
+
+	if (!write)
+		ipv6_frag_bytes = init_net.ipv6.frags.high_thresh;
+	old_bytes = ipv6_frag_bytes;
+
+	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmalloc_set(&ipv6_frag_reserve,
+					      ipv6_frag_bytes);
+		if (!ret)
+			init_net.ipv6.frags.high_thresh = ipv6_frag_bytes;
+		else
+			ipv6_frag_bytes = old_bytes;
+	}
+
+	return ret;
+}
+
+static int sysctl_intvec_fragment(struct ctl_table *table,
+		int __user *name, int nlen,
+		void __user *oldval, size_t __user *oldlenp,
+		void __user *newval, size_t newlen)
+{
+	int old_bytes, ret;
+	int write = (newval && newlen);
+
+	if (!write)
+		ipv6_frag_bytes = init_net.ipv6.frags.high_thresh;
+	old_bytes = ipv6_frag_bytes;
+
+	ret = sysctl_intvec(table, name, nlen, oldval, oldlenp, newval, newlen);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmalloc_set(&ipv6_frag_reserve,
+					      ipv6_frag_bytes);
+		if (!ret)
+			init_net.ipv6.frags.high_thresh = ipv6_frag_bytes;
+		else
+			ipv6_frag_bytes = old_bytes;
+	}
+
+	return ret;
+}
+
 static struct ctl_table ip6_frags_ctl_table[] = {
 	{
 		.ctl_name	= NET_IPV6_IP6FRAG_HIGH_THRESH,
 		.procname	= "ip6frag_high_thresh",
-		.data		= &init_net.ipv6.frags.high_thresh,
+		.data		= &ipv6_frag_bytes,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment,
+		.strategy	= &sysctl_intvec_fragment,
 	},
 	{
 		.ctl_name	= NET_IPV6_IP6FRAG_LOW_THRESH,
@@ -758,6 +814,11 @@ int __init ipv6_frag_init(void)
 	ip6_frags.frag_expire = ip6_frag_expire;
 	ip6_frags.secret_interval = 10 * 60 * HZ;
 	inet_frags_init(&ip6_frags);
+
+	mem_reserve_init(&ipv6_frag_reserve, "IPv6 fragment cache",
+			 &net_skb_reserve);
+	mem_reserve_kmalloc_set(&ipv6_frag_reserve,
+				init_net.ipv6.frags.high_thresh);
 out:
 	return ret;
 }
Index: linux-2.6/net/ipv4/route.c
===================================================================
--- linux-2.6.orig/net/ipv4/route.c
+++ linux-2.6/net/ipv4/route.c
@@ -109,6 +109,7 @@
 #ifdef CONFIG_SYSCTL
 #include <linux/sysctl.h>
 #endif
+#include <linux/reserve.h>
 
 #define RT_FL_TOS(oldflp) \
     ((u32)(oldflp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK)))
@@ -2794,6 +2795,8 @@ void ip_rt_multicast_event(struct in_dev
 	rt_cache_flush(0);
 }
 
+static struct mem_reserve ipv4_route_reserve;
+
 #ifdef CONFIG_SYSCTL
 static int flush_delay;
 
@@ -2827,6 +2830,58 @@ static int ipv4_sysctl_rtcache_flush_str
 	return 0;
 }
 
+static int ipv4_route_size;
+
+static int proc_dointvec_route(struct ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int old_size, ret;
+
+	if (!write)
+		ipv4_route_size = ip_rt_max_size;
+	old_size = ipv4_route_size;
+
+	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmem_cache_set(&ipv4_route_reserve,
+				ipv4_dst_ops.kmem_cachep, ipv4_route_size);
+		if (!ret)
+			ip_rt_max_size = ipv4_route_size;
+		else
+			ipv4_route_size = old_size;
+	}
+
+	return ret;
+}
+
+static int sysctl_intvec_route(struct ctl_table *table,
+		int __user *name, int nlen,
+		void __user *oldval, size_t __user *oldlenp,
+		void __user *newval, size_t newlen)
+{
+	int old_size, ret;
+	int write = (newval && newlen);
+
+	if (!write)
+		ipv4_route_size = ip_rt_max_size;
+	old_size = ipv4_route_size;
+
+	ret = sysctl_intvec(table, name, nlen, oldval, oldlenp, newval, newlen);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmem_cache_set(&ipv4_route_reserve,
+				ipv4_dst_ops.kmem_cachep, ipv4_route_size);
+		if (!ret)
+			ip_rt_max_size = ipv4_route_size;
+		else
+			ipv4_route_size = old_size;
+	}
+
+	return ret;
+}
+
 ctl_table ipv4_route_table[] = {
 	{
 		.ctl_name 	= NET_IPV4_ROUTE_FLUSH,
@@ -2848,10 +2903,11 @@ ctl_table ipv4_route_table[] = {
 	{
 		.ctl_name	= NET_IPV4_ROUTE_MAX_SIZE,
 		.procname	= "max_size",
-		.data		= &ip_rt_max_size,
+		.data		= &ipv4_route_size,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_dointvec_route,
+		.strategy	= &sysctl_intvec_route,
 	},
 	{
 		/*  Deprecated. Use gc_min_interval_ms */
@@ -3026,6 +3082,11 @@ int __init ip_rt_init(void)
 	ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
 	ip_rt_max_size = (rt_hash_mask + 1) * 16;
 
+	mem_reserve_init(&ipv4_route_reserve, "IPv4 route cache",
+			&net_rx_reserve);
+	mem_reserve_kmem_cache_set(&ipv4_route_reserve,
+			ipv4_dst_ops.kmem_cachep, ip_rt_max_size);
+
 	devinet_init();
 	ip_fib_init();
 
Index: linux-2.6/net/ipv6/route.c
===================================================================
--- linux-2.6.orig/net/ipv6/route.c
+++ linux-2.6/net/ipv6/route.c
@@ -38,6 +38,7 @@
 #include <linux/in6.h>
 #include <linux/init.h>
 #include <linux/if_arp.h>
+#include <linux/reserve.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <net/net_namespace.h>
@@ -2391,6 +2392,8 @@ static inline void ipv6_route_proc_fini(
 }
 #endif	/* CONFIG_PROC_FS */
 
+static struct mem_reserve ipv6_route_reserve;
+
 #ifdef CONFIG_SYSCTL
 
 static
@@ -2406,6 +2409,58 @@ int ipv6_sysctl_rtcache_flush(ctl_table 
 		return -EINVAL;
 }
 
+static int ipv6_route_size;
+
+static int proc_dointvec_route(struct ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int old_size, ret;
+
+	if (!write)
+		ipv6_route_size = ip6_rt_max_size;
+	old_size = ipv6_route_size;
+
+	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmem_cache_set(&ipv6_route_reserve,
+				ip6_dst_ops.kmem_cachep, ipv6_route_size);
+		if (!ret)
+			ip6_rt_max_size = ipv6_route_size;
+		else
+			ipv6_route_size = old_size;
+	}
+
+	return ret;
+}
+
+static int sysctl_intvec_route(struct ctl_table *table,
+		int __user *name, int nlen,
+		void __user *oldval, size_t __user *oldlenp,
+		void __user *newval, size_t newlen)
+{
+	int old_size, ret;
+	int write = (newval && newlen);
+
+	if (!write)
+		ipv6_route_size = ip6_rt_max_size;
+	old_size = ipv6_route_size;
+
+	ret = sysctl_intvec(table, name, nlen, oldval, oldlenp, newval, newlen);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmem_cache_set(&ipv6_route_reserve,
+				ip6_dst_ops.kmem_cachep, ipv6_route_size);
+		if (!ret)
+			ip6_rt_max_size = ipv6_route_size;
+		else
+			ipv6_route_size = old_size;
+	}
+
+	return ret;
+}
+
 ctl_table ipv6_route_table_template[] = {
 	{
 		.procname	=	"flush",
@@ -2425,10 +2480,11 @@ ctl_table ipv6_route_table_template[] = 
 	{
 		.ctl_name	=	NET_IPV6_ROUTE_MAX_SIZE,
 		.procname	=	"max_size",
-		.data		=	&init_net.ipv6.sysctl.ip6_rt_max_size,
+		.data		=	&ipv6_route_size,
 		.maxlen		=	sizeof(int),
 		.mode		=	0644,
-		.proc_handler	=	&proc_dointvec,
+		.proc_handler	=	&proc_dointvec_route,
+		.strategy	= 	&sysctl_intvec_route,
 	},
 	{
 		.ctl_name	=	NET_IPV6_ROUTE_GC_MIN_INTERVAL,
@@ -2519,6 +2575,11 @@ int __init ip6_route_init(void)
 
 	ip6_dst_blackhole_ops.kmem_cachep = ip6_dst_ops.kmem_cachep;
 
+	mem_reserve_init(&ipv6_route_reserve, "IPv6 route cache",
+			&net_rx_reserve);
+	mem_reserve_kmem_cache_set(&ipv6_route_reserve,
+			ip6_dst_ops.kmem_cachep, ip6_rt_max_size);
+
 	ret = fib6_init();
 	if (ret)
 		goto out_kmem_cache;

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 17/28] netvm: hook skb allocation to reserves
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (15 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 16/28] netvm: INET reserves Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-23  8:06   ` Andrew Morton
  2008-02-20 14:46 ` [PATCH 18/28] netvm: filter emergency skbs Peter Zijlstra
                   ` (11 subsequent siblings)
  28 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: netvm-skbuff-reserve.patch --]
[-- Type: text/plain, Size: 14777 bytes --]

Change the skb allocation api to indicate RX usage and use this to fall back to
the reserve when needed. SKBs allocated from the reserve are tagged in
skb->emergency.

Teach all other skb ops about emergency skbs and the reserve accounting.

Use the (new) packet split API to allocate and track fragment pages from the
emergency reserve. Do this using an atomic counter in page->index. This is
needed because the fragments have a different sharing semantic than that
indicated by skb_shinfo()->dataref. 

Note that the decision to distinguish between regular and emergency SKBs allows
the accounting overhead to be limited to the later kind.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm_types.h |    1 
 include/linux/skbuff.h   |   26 +++++-
 net/core/skbuff.c        |  177 +++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 177 insertions(+), 27 deletions(-)

Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -308,7 +308,9 @@ struct sk_buff {
 	__u16			tc_verd;	/* traffic control verdict */
 #endif
 #endif
-	/* 2 byte hole */
+	__u8 			emergency:1;
+				/* 7 bit hole */
+	/* 1 byte hole */
 
 #ifdef CONFIG_NET_DMA
 	dma_cookie_t		dma_cookie;
@@ -339,10 +341,22 @@ struct sk_buff {
 
 #include <asm/system.h>
 
+#define SKB_ALLOC_FCLONE	0x01
+#define SKB_ALLOC_RX		0x02
+
+static inline bool skb_emergency(const struct sk_buff *skb)
+{
+#ifdef CONFIG_NETVM
+	return unlikely(skb->emergency);
+#else
+	return false;
+#endif
+}
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-				   gfp_t priority, int fclone, int node);
+				   gfp_t priority, int flags, int node);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
@@ -352,7 +366,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1, -1);
+	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, -1);
 }
 
 extern struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
@@ -1297,7 +1311,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
 					      gfp_t gfp_mask)
 {
-	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	struct sk_buff *skb =
+		__alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, -1);
 	if (likely(skb))
 		skb_reserve(skb, NET_SKB_PAD);
 	return skb;
@@ -1343,6 +1358,7 @@ static inline struct sk_buff *netdev_all
 }
 
 extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+extern void __netdev_free_page(struct net_device *dev, struct page *page);
 
 /**
  *	netdev_alloc_page - allocate a page for ps-rx on a specific device
@@ -1359,7 +1375,7 @@ static inline struct page *netdev_alloc_
 
 static inline void netdev_free_page(struct net_device *dev, struct page *page)
 {
-	__free_page(page);
+	__netdev_free_page(dev, page);
 }
 
 /**
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -179,21 +179,28 @@ EXPORT_SYMBOL(skb_truesize_bug);
  *	%GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-			    int fclone, int node)
+			    int flags, int node)
 {
 	struct kmem_cache *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
+	int emergency = 0, memalloc = sk_memalloc_socks();
 
-	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+	size = SKB_DATA_ALIGN(size);
+	cache = (flags & SKB_ALLOC_FCLONE)
+		? skbuff_fclone_cache : skbuff_head_cache;
+#ifdef CONFIG_NETVM
+	if (memalloc && (flags & SKB_ALLOC_RX))
+		gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN;
 
+retry_alloc:
+#endif
 	/* Get the HEAD */
 	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
 	if (!skb)
-		goto out;
+		goto noskb;
 
-	size = SKB_DATA_ALIGN(size);
 	data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
 			gfp_mask, node);
 	if (!data)
@@ -203,6 +210,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	 * See comment in sk_buff definition, just before the 'tail' member
 	 */
 	memset(skb, 0, offsetof(struct sk_buff, tail));
+	skb->emergency = emergency;
 	skb->truesize = size + sizeof(struct sk_buff);
 	atomic_set(&skb->users, 1);
 	skb->head = data;
@@ -219,7 +227,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	shinfo->ip6_frag_id = 0;
 	shinfo->frag_list = NULL;
 
-	if (fclone) {
+	if (flags & SKB_ALLOC_FCLONE) {
 		struct sk_buff *child = skb + 1;
 		atomic_t *fclone_ref = (atomic_t *) (child + 1);
 
@@ -227,12 +235,31 @@ struct sk_buff *__alloc_skb(unsigned int
 		atomic_set(fclone_ref, 1);
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
+		child->emergency = skb->emergency;
 	}
 out:
 	return skb;
+
 nodata:
 	kmem_cache_free(cache, skb);
 	skb = NULL;
+noskb:
+#ifdef CONFIG_NETVM
+	/* Attempt emergency allocation when RX skb. */
+	if (likely(!(flags & SKB_ALLOC_RX) || !memalloc))
+		goto out;
+
+	if (!emergency) {
+		if (rx_emergency_get(size)) {
+			gfp_mask &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+			gfp_mask |= __GFP_MEMALLOC;
+			emergency = 1;
+			goto retry_alloc;
+		}
+	} else
+		rx_emergency_put(size);
+#endif
+
 	goto out;
 }
 
@@ -255,7 +282,7 @@ struct sk_buff *__netdev_alloc_skb(struc
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct sk_buff *skb;
 
-	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, node);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -268,11 +295,36 @@ struct page *__netdev_alloc_page(struct 
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct page *page;
 
+#ifdef CONFIG_NETVM
+	gfp_mask |= __GFP_NOMEMALLOC | __GFP_NOWARN;
+#endif
+
 	page = alloc_pages_node(node, gfp_mask, 0);
+
+#ifdef CONFIG_NETVM
+	if (!page && rx_emergency_get(PAGE_SIZE)) {
+		gfp_mask &= ~(__GFP_NOMEMALLOC | __GFP_NOWARN);
+		gfp_mask |= __GFP_MEMALLOC;
+		page = alloc_pages_node(node, gfp_mask, 0);
+		if (!page)
+			rx_emergency_put(PAGE_SIZE);
+	}
+#endif
+
 	return page;
 }
 EXPORT_SYMBOL(__netdev_alloc_page);
 
+void __netdev_free_page(struct net_device *dev, struct page *page)
+{
+#ifdef CONFIG_NETVM
+	if (unlikely(page->reserve))
+		rx_emergency_put(PAGE_SIZE);
+#endif
+	__free_page(page);
+}
+EXPORT_SYMBOL(__netdev_free_page);
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {
@@ -280,6 +332,34 @@ void skb_add_rx_frag(struct sk_buff *skb
 	skb->len += size;
 	skb->data_len += size;
 	skb->truesize += size;
+
+#ifdef CONFIG_NETVM
+	/*
+	 * Fix-up the emergency accounting; make sure all pages match
+	 * skb->emergency.
+	 *
+	 * This relies on page->reserve to be preserved between
+	 * the call to __netdev_alloc_page() and this call.
+	 */
+	if (skb_emergency(skb)) {
+		/*
+		 * If the page was not an emergency alloc (ALLOC_NO_WATERMARK)
+		 * we can use overcommit accounting, since we already have the
+		 * memory.
+		 */
+		if (!page->reserve)
+			rx_emergency_get_overcommit(PAGE_SIZE);
+		atomic_set(&page->frag_count, 1);
+	} else if (unlikely(page->reserve)) {
+		/*
+		 * Rare case; the skb wasn't allocated under pressure but
+		 * the page was. We need to return the page. This can offset
+		 * the accounting a little, but its a constant shift, it does
+		 * not accumulate.
+		 */
+		rx_emergency_put(PAGE_SIZE);
+	}
+#endif
 }
 EXPORT_SYMBOL(skb_add_rx_frag);
 
@@ -309,21 +389,47 @@ static void skb_clone_fraglist(struct sk
 		skb_get(list);
 }
 
+static inline void skb_get_page(struct sk_buff *skb, struct page *page)
+{
+	get_page(page);
+	if (skb_emergency(skb))
+		atomic_inc(&page->frag_count);
+}
+
+static inline void skb_put_page(struct sk_buff *skb, struct page *page)
+{
+	if (skb_emergency(skb) && atomic_dec_and_test(&page->frag_count))
+		rx_emergency_put(PAGE_SIZE);
+	put_page(page);
+}
+
 static void skb_release_data(struct sk_buff *skb)
 {
 	if (!skb->cloned ||
 	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
 			       &skb_shinfo(skb)->dataref)) {
+		int size;
+
+#ifdef NET_SKBUFF_DATA_USES_OFFSET
+		size = skb->end;
+#else
+		size = skb->end - skb->head;
+#endif
+
 		if (skb_shinfo(skb)->nr_frags) {
 			int i;
-			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-				put_page(skb_shinfo(skb)->frags[i].page);
+			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+				skb_put_page(skb,
+					     skb_shinfo(skb)->frags[i].page);
+			}
 		}
 
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
 
 		kfree(skb->head);
+		if (skb_emergency(skb))
+			rx_emergency_put(size);
 	}
 }
 
@@ -444,6 +550,7 @@ static void __copy_skb_header(struct sk_
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	new->ipvs_property	= old->ipvs_property;
 #endif
+	new->emergency		= old->emergency;
 	new->protocol		= old->protocol;
 	new->mark		= old->mark;
 	__nf_copy(new, old);
@@ -532,6 +639,9 @@ struct sk_buff *skb_clone(struct sk_buff
 		n->fclone = SKB_FCLONE_CLONE;
 		atomic_inc(fclone_ref);
 	} else {
+		if (skb_emergency(skb))
+			gfp_mask |= __GFP_MEMALLOC;
+
 		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 		if (!n)
 			return NULL;
@@ -563,6 +673,14 @@ static void copy_skb_header(struct sk_bu
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
 
+static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
+{
+	if (skb_emergency(skb))
+		return SKB_ALLOC_RX;
+
+	return 0;
+}
+
 /**
  *	skb_copy	-	create private copy of an sk_buff
  *	@skb: buffer to copy
@@ -583,15 +701,17 @@ static void copy_skb_header(struct sk_bu
 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 {
 	int headerlen = skb->data - skb->head;
+	int size;
 	/*
 	 *	Allocate the copy buffer
 	 */
 	struct sk_buff *n;
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
-	n = alloc_skb(skb->end + skb->data_len, gfp_mask);
+	size = skb->end + skb->data_len;
 #else
-	n = alloc_skb(skb->end - skb->head + skb->data_len, gfp_mask);
+	size = skb->end - skb->head + skb->data_len;
 #endif
+	n = __alloc_skb(size, gfp_mask, skb_alloc_rx_flag(skb), -1);
 	if (!n)
 		return NULL;
 
@@ -626,12 +746,14 @@ struct sk_buff *pskb_copy(struct sk_buff
 	/*
 	 *	Allocate the copy buffer
 	 */
+	int size;
 	struct sk_buff *n;
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
-	n = alloc_skb(skb->end, gfp_mask);
+	size = skb->end;
 #else
-	n = alloc_skb(skb->end - skb->head, gfp_mask);
+	size = skb->end - skb->head;
 #endif
+	n = __alloc_skb(size, gfp_mask, skb_alloc_rx_flag(skb), -1);
 	if (!n)
 		goto out;
 
@@ -650,8 +772,9 @@ struct sk_buff *pskb_copy(struct sk_buff
 		int i;
 
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-			skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
-			get_page(skb_shinfo(n)->frags[i].page);
+			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+			skb_shinfo(n)->frags[i] = *frag;
+			skb_get_page(n, frag->page);
 		}
 		skb_shinfo(n)->nr_frags = i;
 	}
@@ -699,6 +822,14 @@ int pskb_expand_head(struct sk_buff *skb
 
 	size = SKB_DATA_ALIGN(size);
 
+	if (skb_emergency(skb)) {
+		if (rx_emergency_get(size))
+			gfp_mask |= __GFP_MEMALLOC;
+		else
+			goto nodata;
+	} else
+		gfp_mask |= __GFP_NOMEMALLOC;
+
 	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
 	if (!data)
 		goto nodata;
@@ -714,7 +845,7 @@ int pskb_expand_head(struct sk_buff *skb
 	       sizeof(struct skb_shared_info));
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-		get_page(skb_shinfo(skb)->frags[i].page);
+		skb_get_page(skb, skb_shinfo(skb)->frags[i].page);
 
 	if (skb_shinfo(skb)->frag_list)
 		skb_clone_fraglist(skb);
@@ -793,8 +924,8 @@ struct sk_buff *skb_copy_expand(const st
 	/*
 	 *	Allocate the copy buffer
 	 */
-	struct sk_buff *n = alloc_skb(newheadroom + skb->len + newtailroom,
-				      gfp_mask);
+	struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom,
+					gfp_mask, skb_alloc_rx_flag(skb), -1);
 	int oldheadroom = skb_headroom(skb);
 	int head_copy_len, head_copy_off;
 	int off;
@@ -911,7 +1042,7 @@ drop_pages:
 		skb_shinfo(skb)->nr_frags = i;
 
 		for (; i < nfrags; i++)
-			put_page(skb_shinfo(skb)->frags[i].page);
+			skb_put_page(skb, skb_shinfo(skb)->frags[i].page);
 
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
@@ -1080,7 +1211,7 @@ pull_pages:
 	k = 0;
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		if (skb_shinfo(skb)->frags[i].size <= eat) {
-			put_page(skb_shinfo(skb)->frags[i].page);
+			skb_put_page(skb, skb_shinfo(skb)->frags[i].page);
 			eat -= skb_shinfo(skb)->frags[i].size;
 		} else {
 			skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
@@ -1852,6 +1983,7 @@ static inline void skb_split_no_header(s
 			skb_shinfo(skb1)->frags[k] = skb_shinfo(skb)->frags[i];
 
 			if (pos < len) {
+				struct page *page = skb_shinfo(skb)->frags[i].page;
 				/* Split frag.
 				 * We have two variants in this case:
 				 * 1. Move all the frag to the second
@@ -1860,7 +1992,7 @@ static inline void skb_split_no_header(s
 				 *    where splitting is expensive.
 				 * 2. Split is accurately. We make this.
 				 */
-				get_page(skb_shinfo(skb)->frags[i].page);
+				skb_get_page(skb1, page);
 				skb_shinfo(skb1)->frags[0].page_offset += len - pos;
 				skb_shinfo(skb1)->frags[0].size -= len - pos;
 				skb_shinfo(skb)->frags[i].size	= len - pos;
@@ -2190,7 +2322,8 @@ struct sk_buff *skb_segment(struct sk_bu
 		if (hsize > len || !sg)
 			hsize = len;
 
-		nskb = alloc_skb(hsize + doffset + headroom, GFP_ATOMIC);
+		nskb = __alloc_skb(hsize + doffset + headroom, GFP_ATOMIC,
+				   skb_alloc_rx_flag(skb), -1);
 		if (unlikely(!nskb))
 			goto err;
 
@@ -2235,7 +2368,7 @@ struct sk_buff *skb_segment(struct sk_bu
 			BUG_ON(i >= nfrags);
 
 			*frag = skb_shinfo(skb)->frags[i];
-			get_page(frag->page);
+			skb_get_page(nskb, frag->page);
 			size = frag->size;
 
 			if (pos < offset) {
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -74,6 +74,7 @@ struct page {
 		pgoff_t index;		/* Our offset within mapping. */
 		void *freelist;		/* SLUB: freelist req. slab lock */
 		int reserve;		/* page_alloc: page is a reserve page */
+		atomic_t frag_count;	/* skb fragment use count */
 	};
 	struct list_head lru;		/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 18/28] netvm: filter emergency skbs.
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (16 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 17/28] netvm: hook skb allocation to reserves Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 19/28] netvm: prevent a stream specific deadlock Peter Zijlstra
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: netvm-sk_filter.patch --]
[-- Type: text/plain, Size: 765 bytes --]

Toss all emergency packets not for a SOCK_MEMALLOC socket. This ensures our
precious memory reserve doesn't get stuck waiting for user-space.

The correctness of this approach relies on the fact that networks must be
assumed lossy.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -998,6 +998,9 @@ static inline int sk_filter(struct sock 
 {
 	int err;
 	struct sk_filter *filter;
+
+	if (skb_emergency(skb) && !sk_has_memalloc(sk))
+		return -ENOMEM;
 	
 	err = security_sock_rcv_skb(sk, skb);
 	if (err)

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 19/28] netvm: prevent a stream specific deadlock
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (17 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 18/28] netvm: filter emergency skbs Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 20/28] netfilter: NF_QUEUE vs emergency skbs Peter Zijlstra
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: netvm-tcp-deadlock.patch --]
[-- Type: text/plain, Size: 3124 bytes --]

It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC buffers
from receiving data, which will prevent userspace from running, which is needed
to reduce the buffered data.

Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h   |    7 ++++---
 net/core/sock.c      |    2 +-
 net/ipv4/tcp_input.c |    8 ++++----
 net/sctp/ulpevent.c  |    2 +-
 4 files changed, 10 insertions(+), 9 deletions(-)

Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -791,12 +791,13 @@ static inline int sk_wmem_schedule(struc
 		__sk_mem_schedule(sk, size, SK_MEM_SEND);
 }
 
-static inline int sk_rmem_schedule(struct sock *sk, int size)
+static inline int sk_rmem_schedule(struct sock *sk, struct sk_buff *skb)
 {
 	if (!sk_has_account(sk))
 		return 1;
-	return size <= sk->sk_forward_alloc ||
-		__sk_mem_schedule(sk, size, SK_MEM_RECV);
+	return skb->truesize <= sk->sk_forward_alloc ||
+		__sk_mem_schedule(sk, skb->truesize, SK_MEM_RECV) ||
+		skb_emergency(skb);
 }
 
 static inline void sk_mem_reclaim(struct sock *sk)
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -388,7 +388,7 @@ int sock_queue_rcv_skb(struct sock *sk, 
 	if (err)
 		goto out;
 
-	if (!sk_rmem_schedule(sk, skb->truesize)) {
+	if (!sk_rmem_schedule(sk, skb)) {
 		err = -ENOBUFS;
 		goto out;
 	}
Index: linux-2.6/net/ipv4/tcp_input.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_input.c
+++ linux-2.6/net/ipv4/tcp_input.c
@@ -3858,9 +3858,9 @@ static void tcp_data_queue(struct sock *
 queue_and_out:
 			if (eaten < 0 &&
 			    (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-			     !sk_rmem_schedule(sk, skb->truesize))) {
+			     !sk_rmem_schedule(sk, skb))) {
 				if (tcp_prune_queue(sk) < 0 ||
-				    !sk_rmem_schedule(sk, skb->truesize))
+				    !sk_rmem_schedule(sk, skb))
 					goto drop;
 			}
 			skb_set_owner_r(skb, sk);
@@ -3932,9 +3932,9 @@ drop:
 	TCP_ECN_check_ce(tp, skb);
 
 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-	    !sk_rmem_schedule(sk, skb->truesize)) {
+	    !sk_rmem_schedule(sk, skb)) {
 		if (tcp_prune_queue(sk) < 0 ||
-		    !sk_rmem_schedule(sk, skb->truesize))
+		    !sk_rmem_schedule(sk, skb))
 			goto drop;
 	}
 
Index: linux-2.6/net/sctp/ulpevent.c
===================================================================
--- linux-2.6.orig/net/sctp/ulpevent.c
+++ linux-2.6/net/sctp/ulpevent.c
@@ -701,7 +701,7 @@ struct sctp_ulpevent *sctp_ulpevent_make
 	if (rx_count >= asoc->base.sk->sk_rcvbuf) {
 
 		if ((asoc->base.sk->sk_userlocks & SOCK_RCVBUF_LOCK) ||
-		    (!sk_rmem_schedule(asoc->base.sk, chunk->skb->truesize)))
+		    (!sk_rmem_schedule(asoc->base.sk, chunk->skb)))
 			goto fail;
 	}
 

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 20/28] netfilter: NF_QUEUE vs emergency skbs
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (18 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 19/28] netvm: prevent a stream specific deadlock Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 21/28] netvm: skb processing Peter Zijlstra
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: emergency-nf_queue.patch --]
[-- Type: text/plain, Size: 835 bytes --]

Avoid memory getting stuck waiting for userspace, drop all emergency packets.
This of course requires the regular storage route to not include an NF_QUEUE
target ;-)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 net/netfilter/core.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/net/netfilter/core.c
===================================================================
--- linux-2.6.orig/net/netfilter/core.c
+++ linux-2.6/net/netfilter/core.c
@@ -176,9 +176,12 @@ next_hook:
 		ret = 1;
 		goto unlock;
 	} else if (verdict == NF_DROP) {
+drop:
 		kfree_skb(skb);
 		ret = -EPERM;
 	} else if ((verdict & NF_VERDICT_MASK) == NF_QUEUE) {
+		if (skb_emergency(*pskb))
+			goto drop;
 		if (!nf_queue(skb, elem, pf, hook, indev, outdev, okfn,
 			      verdict >> NF_VERDICT_BITS))
 			goto next_hook;

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 21/28] netvm: skb processing
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (19 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 20/28] netfilter: NF_QUEUE vs emergency skbs Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 22/28] mm: add support for non block device backed swap files Peter Zijlstra
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: netvm.patch --]
[-- Type: text/plain, Size: 4836 bytes --]

In order to make sure emergency packets receive all memory needed to proceed
ensure processing of emergency SKBs happens under PF_MEMALLOC.

Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing.

Skip taps, since those are user-space again.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |    5 ++++
 net/core/dev.c     |   59 +++++++++++++++++++++++++++++++++++++++++++++++------
 net/core/sock.c    |   18 ++++++++++++++++
 3 files changed, 76 insertions(+), 6 deletions(-)

Index: linux-2.6/net/core/dev.c
===================================================================
--- linux-2.6.orig/net/core/dev.c
+++ linux-2.6/net/core/dev.c
@@ -2004,6 +2004,30 @@ out:
 }
 #endif
 
+/*
+ * Filter the protocols for which the reserves are adequate.
+ *
+ * Before adding a protocol make sure that it is either covered by the existing
+ * reserves, or add reserves covering the memory need of the new protocol's
+ * packet processing.
+ */
+static int skb_emergency_protocol(struct sk_buff *skb)
+{
+	if (skb_emergency(skb))
+		switch (skb->protocol) {
+		case __constant_htons(ETH_P_ARP):
+		case __constant_htons(ETH_P_IP):
+		case __constant_htons(ETH_P_IPV6):
+		case __constant_htons(ETH_P_8021Q):
+			break;
+
+		default:
+			return 0;
+		}
+
+	return 1;
+}
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
@@ -2025,10 +2049,23 @@ int netif_receive_skb(struct sk_buff *sk
 	struct net_device *orig_dev;
 	int ret = NET_RX_DROP;
 	__be16 type;
+	unsigned long pflags = current->flags;
+
+	/* Emergency skb are special, they should
+	 *  - be delivered to SOCK_MEMALLOC sockets only
+	 *  - stay away from userspace
+	 *  - have bounded memory usage
+	 *
+	 * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
+	 * This saves us from propagating the allocation context down to all
+	 * allocation sites.
+	 */
+	if (skb_emergency(skb))
+		current->flags |= PF_MEMALLOC;
 
 	/* if we've gotten here through NAPI, check netpoll */
 	if (netpoll_receive_skb(skb))
-		return NET_RX_DROP;
+		goto out;
 
 	if (!skb->tstamp.tv64)
 		net_timestamp(skb);
@@ -2039,7 +2076,7 @@ int netif_receive_skb(struct sk_buff *sk
 	orig_dev = skb_bond(skb);
 
 	if (!orig_dev)
-		return NET_RX_DROP;
+		goto out;
 
 	__get_cpu_var(netdev_rx_stat).total++;
 
@@ -2058,6 +2095,9 @@ int netif_receive_skb(struct sk_buff *sk
 	}
 #endif
 
+	if (skb_emergency(skb))
+		goto skip_taps;
+
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
 		if (!ptype->dev || ptype->dev == skb->dev) {
 			if (pt_prev)
@@ -2066,19 +2106,23 @@ int netif_receive_skb(struct sk_buff *sk
 		}
 	}
 
+skip_taps:
 #ifdef CONFIG_NET_CLS_ACT
 	skb = handle_ing(skb, &pt_prev, &ret, orig_dev);
 	if (!skb)
-		goto out;
+		goto unlock;
 ncls:
 #endif
 
+	if (!skb_emergency_protocol(skb))
+		goto drop;
+
 	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
 	if (!skb)
-		goto out;
+		goto unlock;
 	skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
 	if (!skb)
-		goto out;
+		goto unlock;
 
 	type = skb->protocol;
 	list_for_each_entry_rcu(ptype,
@@ -2094,6 +2138,7 @@ ncls:
 	if (pt_prev) {
 		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 	} else {
+drop:
 		kfree_skb(skb);
 		/* Jamal, now you will not able to escape explaining
 		 * me how you were going to use this. :-)
@@ -2101,8 +2146,10 @@ ncls:
 		ret = NET_RX_DROP;
 	}
 
-out:
+unlock:
 	rcu_read_unlock();
+out:
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 	return ret;
 }
 
Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -512,8 +512,13 @@ static inline void sk_add_backlog(struct
 	skb->next = NULL;
 }
 
+extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+
 static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
+	if (skb_emergency(skb))
+		return __sk_backlog_rcv(sk, skb);
+
 	return sk->sk_backlog_rcv(sk, skb);
 }
 
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -319,6 +319,24 @@ int sk_clear_memalloc(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_clear_memalloc);
 
+#ifdef CONFIG_NETVM
+int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	int ret;
+	unsigned long pflags = current->flags;
+
+	/* these should have been dropped before queueing */
+	BUG_ON(!sk_has_memalloc(sk));
+
+	current->flags |= PF_MEMALLOC;
+	ret = sk->sk_backlog_rcv(sk, skb);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
+
+	return ret;
+}
+EXPORT_SYMBOL(__sk_backlog_rcv);
+#endif
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 22/28] mm: add support for non block device backed swap files
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (20 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 21/28] netvm: skb processing Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 16:30   ` Randy Dunlap
  2008-02-26 12:45   ` Miklos Szeredi
  2008-02-20 14:46 ` [PATCH 23/28] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
                   ` (6 subsequent siblings)
  28 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-swapfile.patch --]
[-- Type: text/plain, Size: 11996 bytes --]

New addres_space_operations methods are added:
  int swapfile(struct address_space *, int);
  int swap_out(struct file *, struct page *, struct writeback_control *);
  int swap_in(struct file *, struct page *);

When during sys_swapon() the swapfile() method is found and returns no error
the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, and
make use of swap_{out,in}() to write/read swapcache pages.

The swapfile method will be used to communicate to the address_space that the
VM relies on it, and the address_space should take adequate measures (like 
reserving memory for mempools or the like).

This new interface can be used to obviate the need for ->bmap in the swapfile
code. A filesystem would need to load (and maybe even allocate) the full block
map for a file into memory and pin it there on ->swapfile(,1) so that
->swap_{out,in}() have instant access to it. It can be released on
->swapfile(,0).

The reason to provide ->swap_{out,in}() over using {write,read}page() is to
 1) make a distinction between swapcache and pagecache pages, and
 2) to provide a struct file * for credential context (normally not needed
    in the context of writepage, as the page content is normally dirtied
    using either of the following interfaces:
      write_{begin,end}()
      {prepare,commit}_write()
      page_mkwrite()
    which do have the file context.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 Documentation/filesystems/Locking |   19 +++++++++++++
 Documentation/filesystems/vfs.txt |   17 ++++++++++++
 include/linux/buffer_head.h       |    2 -
 include/linux/fs.h                |    8 +++++
 include/linux/swap.h              |    4 ++
 mm/page_io.c                      |   52 ++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c                   |    4 +-
 mm/swapfile.c                     |   26 ++++++++++++++++++-
 8 files changed, 128 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -120,6 +120,7 @@ enum {
 	SWP_USED	= (1 << 0),	/* is slot in swap_info[] used? */
 	SWP_WRITEOK	= (1 << 1),	/* ok to write to this swap?	*/
 	SWP_ACTIVE	= (SWP_USED | SWP_WRITEOK),
+	SWP_FILE	= (1 << 2),	/* file swap area */
 					/* add others here before... */
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
@@ -217,6 +218,8 @@ extern void swap_unplug_io_fn(struct bac
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
+extern void swap_sync_page(struct page *page);
+extern int swap_set_page_dirty(struct page *page);
 extern void end_swap_bio_read(struct bio *bio, int err);
 
 /* linux/mm/swap_state.c */
@@ -250,6 +253,7 @@ extern unsigned int count_swap_pages(int
 extern sector_t map_swap_page(struct swap_info_struct *, pgoff_t);
 extern sector_t swapdev_block(int, pgoff_t);
 extern struct swap_info_struct *get_swap_info_struct(unsigned);
+extern struct swap_info_struct *page_swap_info(struct page *);
 extern int can_share_swap_page(struct page *);
 extern int remove_exclusive_swap_page(struct page *);
 struct backing_dev_info;
Index: linux-2.6/mm/page_io.c
===================================================================
--- linux-2.6.orig/mm/page_io.c
+++ linux-2.6/mm/page_io.c
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <linux/buffer_head.h>
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -97,11 +98,21 @@ int swap_writepage(struct page *page, st
 {
 	struct bio *bio;
 	int ret = 0, rw = WRITE;
+	struct swap_info_struct *sis = page_swap_info(page);
 
 	if (remove_exclusive_swap_page(page)) {
 		unlock_page(page);
 		goto out;
 	}
+
+	if (sis->flags & SWP_FILE) {
+		ret = sis->swap_file->f_mapping->
+			a_ops->swap_out(sis->swap_file, page, wbc);
+		if (!ret)
+			count_vm_event(PSWPOUT);
+		return ret;
+	}
+
 	bio = get_swap_bio(GFP_NOIO, page_private(page), page,
 				end_swap_bio_write);
 	if (bio == NULL) {
@@ -120,13 +131,54 @@ out:
 	return ret;
 }
 
+void swap_sync_page(struct page *page)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+
+	if (sis->flags & SWP_FILE) {
+		const struct address_space_operations *a_ops =
+			sis->swap_file->f_mapping->a_ops;
+		if (a_ops->sync_page)
+			a_ops->sync_page(page);
+	} else
+		block_sync_page(page);
+}
+
+int swap_set_page_dirty(struct page *page)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+
+	if (sis->flags & SWP_FILE) {
+		const struct address_space_operations *a_ops =
+			sis->swap_file->f_mapping->a_ops;
+		int (*spd)(struct page *) = a_ops->set_page_dirty;
+#ifdef CONFIG_BLOCK
+		if (!spd)
+			spd = __set_page_dirty_buffers;
+#endif
+		return (*spd)(page);
+	}
+
+	return __set_page_dirty_nobuffers(page);
+}
+
 int swap_readpage(struct file *file, struct page *page)
 {
 	struct bio *bio;
 	int ret = 0;
+	struct swap_info_struct *sis = page_swap_info(page);
 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(PageUptodate(page));
+
+	if (sis->flags & SWP_FILE) {
+		ret = sis->swap_file->f_mapping->
+			a_ops->swap_in(sis->swap_file, page);
+		if (!ret)
+			count_vm_event(PSWPIN);
+		return ret;
+	}
+
 	bio = get_swap_bio(GFP_KERNEL, page_private(page), page,
 				end_swap_bio_read);
 	if (bio == NULL) {
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -27,8 +27,8 @@
  */
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
-	.sync_page	= block_sync_page,
-	.set_page_dirty	= __set_page_dirty_nobuffers,
+	.sync_page	= swap_sync_page,
+	.set_page_dirty	= swap_set_page_dirty,
 	.migratepage	= migrate_page,
 };
 
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -1012,6 +1012,12 @@ static void destroy_swap_extents(struct 
 		list_del(&se->list);
 		kfree(se);
 	}
+
+	if (sis->flags & SWP_FILE) {
+		sis->flags &= ~SWP_FILE;
+		sis->swap_file->f_mapping->a_ops->
+			swapfile(sis->swap_file->f_mapping, 0);
+	}
 }
 
 /*
@@ -1104,6 +1110,17 @@ static int setup_swap_extents(struct swa
 		goto done;
 	}
 
+	if (sis->swap_file->f_mapping->a_ops->swapfile) {
+		ret = sis->swap_file->f_mapping->a_ops->
+			swapfile(sis->swap_file->f_mapping, 1);
+		if (!ret) {
+			sis->flags |= SWP_FILE;
+			ret = add_swap_extent(sis, 0, sis->max, 0);
+			*span = sis->pages;
+		}
+		goto done;
+	}
+
 	blkbits = inode->i_blkbits;
 	blocks_per_page = PAGE_SIZE >> blkbits;
 
@@ -1668,7 +1685,7 @@ asmlinkage long sys_swapon(const char __
 
 	mutex_lock(&swapon_mutex);
 	spin_lock(&swap_lock);
-	p->flags = SWP_ACTIVE;
+	p->flags |= SWP_WRITEOK;
 	nr_swap_pages += nr_good_pages;
 	total_swap_pages += nr_good_pages;
 
@@ -1793,6 +1810,13 @@ get_swap_info_struct(unsigned type)
 	return &swap_info[type];
 }
 
+struct swap_info_struct *page_swap_info(struct page *page)
+{
+	swp_entry_t swap = { .val = page_private(page) };
+	BUG_ON(!PageSwapCache(page));
+	return &swap_info[swp_type(swap)];
+}
+
 /*
  * swap_lock prevents swap_map being freed. Don't grab an extra
  * reference on the swaphandle, it doesn't matter if it becomes unused.
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -481,6 +481,14 @@ struct address_space_operations {
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
 	int (*launder_page) (struct page *);
+
+	/*
+	 * swapfile support
+	 */
+	int (*swapfile)(struct address_space *, int);
+	int (*swap_out)(struct file *file, struct page *page,
+			struct writeback_control *wbc);
+	int (*swap_in)(struct file *file, struct page *page);
 };
 
 /*
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -171,6 +171,9 @@ prototypes:
 	int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
 	int (*launder_page) (struct page *);
+	int (*swapfile) (struct address_space *, int);
+	int (*swap_out) (struct file *, struct page *, struct writeback_control *);
+	int (*swap_in)  (struct file *, struct page *);
 
 locking rules:
 	All except set_page_dirty may block
@@ -192,6 +195,9 @@ invalidatepage:		no	yes
 releasepage:		no	yes
 direct_IO:		no
 launder_page:		no	yes
+swapfile		no
+swap_out		no	yes, unlocks
+swap_in			no	yes, unlocks
 
 	->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
 may be called from the request handler (/dev/loop).
@@ -291,6 +297,19 @@ cleaned, or an error value if not. Note 
 getting mapped back in and redirtied, it needs to be kept locked
 across the entire operation.
 
+	->swapfile() will be called with a non zero argument on address spaces
+backing non block device backed swapfiles. A return value of zero indicates
+success. In which case this address space can be used for backing swapspace.
+The swapspace operations will be proxied to the address space operations.
+Swapoff will call this method with a zero argument to release the address
+space.
+
+	->swap_out() when swapfile() returned success, this method is used to
+write the swap page.
+
+	->swap_in() when swapfile() returned success, this method is used to
+read the swap page.
+
 	Note: currently almost all instances of address_space methods are
 using BKL for internal serialization and that's one of the worst sources
 of contention. Normally they are calling library functions (in fs/buffer.c)
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -339,7 +339,7 @@ static inline void invalidate_inode_buff
 static inline int remove_inode_buffers(struct inode *inode) { return 1; }
 static inline int sync_mapping_buffers(struct address_space *mapping) { return 0; }
 static inline void invalidate_bdev(struct block_device *bdev) {}
-
+static inline void block_sync_page(struct page *) { }
 
 #endif /* CONFIG_BLOCK */
 #endif /* _LINUX_BUFFER_HEAD_H */
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6/Documentation/filesystems/vfs.txt
@@ -543,6 +543,10 @@ struct address_space_operations {
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct page *, struct page *);
 	int (*launder_page) (struct page *);
+	int (*swapfile)(struct address_space *, int);
+	int (*swap_out)(struct file *file, struct page *page,
+			struct writeback_control *wbc);
+	int (*swap_in)(struct file *file, struct page *page);
 };
 
   writepage: called by the VM to write a dirty page to backing store.
@@ -728,6 +732,19 @@ struct address_space_operations {
   	prevent redirtying the page, it is kept locked during the whole
 	operation.
 
+  swapfile: Called with a non-zero argument when swapon is used on a file. A
+	return value of zero indicates success. In which case this
+	address_space can be used to back swapspace. The swapspace operations
+	will be proxied to this address space's ->swap_{out,in} methods.
+	Swapoff will call this method with a zero argument to release the
+	address space.
+
+  swap_out: Called to write a swapcache page to a backing store, similar to
+	writepage.
+
+  swap_in: Called to read a swapcache page from a backing store, similar to
+	readpage.
+
 The File Object
 ===============
 

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 23/28] mm: methods for teaching filesystems about PG_swapcache pages
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (21 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 22/28] mm: add support for non block device backed swap files Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 24/28] nfs: remove mempools Peter Zijlstra
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-page_file_methods.patch --]
[-- Type: text/plain, Size: 3457 bytes --]

In order to teach filesystems to handle swap cache pages, two new page
functions are introduced:

  pgoff_t page_file_index(struct page *);
  struct address_space *page_file_mapping(struct page *);

page_file_index - gives the offset of this page in the file in PAGE_CACHE_SIZE
blocks. Like page->index is for mapped pages, this function also gives the
correct index for PG_swapcache pages.

page_file_mapping - gives the mapping backing the actual page; that is for
swap cache pages it will give swap_file->f_mapping.

page_offset() is modified to use page_file_index(), so that it will give the
expected result, even for PG_swapcache pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm.h      |   25 +++++++++++++++++++++++++
 include/linux/pagemap.h |    2 +-
 mm/swapfile.c           |   19 +++++++++++++++++++
 3 files changed, 45 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -600,6 +600,17 @@ static inline struct address_space *page
 	return mapping;
 }
 
+extern struct address_space *__page_file_mapping(struct page *);
+
+static inline
+struct address_space *page_file_mapping(struct page *page)
+{
+	if (unlikely(PageSwapCache(page)))
+		return __page_file_mapping(page);
+
+	return page->mapping;
+}
+
 static inline int PageAnon(struct page *page)
 {
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
@@ -616,6 +627,20 @@ static inline pgoff_t page_index(struct 
 	return page->index;
 }
 
+extern pgoff_t __page_file_index(struct page *page);
+
+/*
+ * Return the file index of the page. Regular pagecache pages use ->index
+ * whereas swapcache pages use swp_offset(->private)
+ */
+static inline pgoff_t page_file_index(struct page *page)
+{
+	if (unlikely(PageSwapCache(page)))
+		return __page_file_index(page);
+
+	return page->index;
+}
+
 /*
  * The atomic page->_mapcount, like _count, starts from -1:
  * so that transitions both from it and to it can be tracked,
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -145,7 +145,7 @@ extern void __remove_from_page_cache(str
  */
 static inline loff_t page_offset(struct page *page)
 {
-	return ((loff_t)page->index) << PAGE_CACHE_SHIFT;
+	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
 }
 
 static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -1818,6 +1818,25 @@ struct swap_info_struct *page_swap_info(
 }
 
 /*
+ * out-of-line __page_file_ methods to avoid include hell.
+ */
+
+struct address_space *__page_file_mapping(struct page *page)
+{
+	VM_BUG_ON(!PageSwapCache(page));
+	return page_swap_info(page)->swap_file->f_mapping;
+}
+EXPORT_SYMBOL_GPL(__page_file_mapping);
+
+pgoff_t __page_file_index(struct page *page)
+{
+	swp_entry_t swap = { .val = page_private(page) };
+	VM_BUG_ON(!PageSwapCache(page));
+	return swp_offset(swap);
+}
+EXPORT_SYMBOL_GPL(__page_file_index);
+
+/*
  * swap_lock prevents swap_map being freed. Don't grab an extra
  * reference on the swaphandle, it doesn't matter if it becomes unused.
  */

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 24/28] nfs: remove mempools
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (22 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 23/28] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 25/28] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: nfs-no-mempool.patch --]
[-- Type: text/plain, Size: 4919 bytes --]

With the introduction of the shared dirty page accounting in .19, NFS should
not be able to surpise the VM with all dirty pages. Thus it should always be
able to free some memory. Hence no more need for mempools.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/read.c  |   15 +++------------
 fs/nfs/write.c |   27 +++++----------------------
 2 files changed, 8 insertions(+), 34 deletions(-)

Index: linux-2.6/fs/nfs/read.c
===================================================================
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -33,13 +33,10 @@ static const struct rpc_call_ops nfs_rea
 static const struct rpc_call_ops nfs_read_full_ops;
 
 static struct kmem_cache *nfs_rdata_cachep;
-static mempool_t *nfs_rdata_mempool;
-
-#define MIN_POOL_READ	(32)
 
 struct nfs_read_data *nfs_readdata_alloc(unsigned int pagecount)
 {
-	struct nfs_read_data *p = mempool_alloc(nfs_rdata_mempool, GFP_NOFS);
+	struct nfs_read_data *p = kmem_cache_alloc(nfs_rdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -50,7 +47,7 @@ struct nfs_read_data *nfs_readdata_alloc
 		else {
 			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
 			if (!p->pagevec) {
-				mempool_free(p, nfs_rdata_mempool);
+				kmem_cache_free(nfs_rdata_cachep, p);
 				p = NULL;
 			}
 		}
@@ -63,7 +60,7 @@ static void nfs_readdata_rcu_free(struct
 	struct nfs_read_data *p = container_of(head, struct nfs_read_data, task.u.tk_rcu);
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_rdata_mempool);
+	kmem_cache_free(nfs_rdata_cachep, p);
 }
 
 static void nfs_readdata_free(struct nfs_read_data *rdata)
@@ -595,16 +592,10 @@ int __init nfs_init_readpagecache(void)
 	if (nfs_rdata_cachep == NULL)
 		return -ENOMEM;
 
-	nfs_rdata_mempool = mempool_create_slab_pool(MIN_POOL_READ,
-						     nfs_rdata_cachep);
-	if (nfs_rdata_mempool == NULL)
-		return -ENOMEM;
-
 	return 0;
 }
 
 void nfs_destroy_readpagecache(void)
 {
-	mempool_destroy(nfs_rdata_mempool);
 	kmem_cache_destroy(nfs_rdata_cachep);
 }
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -28,9 +28,6 @@
 
 #define NFSDBG_FACILITY		NFSDBG_PAGECACHE
 
-#define MIN_POOL_WRITE		(32)
-#define MIN_POOL_COMMIT		(4)
-
 /*
  * Local function declarations
  */
@@ -44,12 +41,10 @@ static const struct rpc_call_ops nfs_wri
 static const struct rpc_call_ops nfs_commit_ops;
 
 static struct kmem_cache *nfs_wdata_cachep;
-static mempool_t *nfs_wdata_mempool;
-static mempool_t *nfs_commit_mempool;
 
 struct nfs_write_data *nfs_commit_alloc(void)
 {
-	struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -63,7 +58,7 @@ static void nfs_commit_rcu_free(struct r
 	struct nfs_write_data *p = container_of(head, struct nfs_write_data, task.u.tk_rcu);
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_commit_mempool);
+	kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 void nfs_commit_free(struct nfs_write_data *wdata)
@@ -73,7 +68,7 @@ void nfs_commit_free(struct nfs_write_da
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-	struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -84,7 +79,7 @@ struct nfs_write_data *nfs_writedata_all
 		else {
 			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
 			if (!p->pagevec) {
-				mempool_free(p, nfs_wdata_mempool);
+				kmem_cache_free(nfs_wdata_cachep, p);
 				p = NULL;
 			}
 		}
@@ -97,7 +92,7 @@ static void nfs_writedata_rcu_free(struc
 	struct nfs_write_data *p = container_of(head, struct nfs_write_data, task.u.tk_rcu);
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_wdata_mempool);
+	kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 static void nfs_writedata_free(struct nfs_write_data *wdata)
@@ -1514,16 +1509,6 @@ int __init nfs_init_writepagecache(void)
 	if (nfs_wdata_cachep == NULL)
 		return -ENOMEM;
 
-	nfs_wdata_mempool = mempool_create_slab_pool(MIN_POOL_WRITE,
-						     nfs_wdata_cachep);
-	if (nfs_wdata_mempool == NULL)
-		return -ENOMEM;
-
-	nfs_commit_mempool = mempool_create_slab_pool(MIN_POOL_COMMIT,
-						      nfs_wdata_cachep);
-	if (nfs_commit_mempool == NULL)
-		return -ENOMEM;
-
 	/*
 	 * NFS congestion size, scale with available memory.
 	 *
@@ -1549,8 +1534,6 @@ int __init nfs_init_writepagecache(void)
 
 void nfs_destroy_writepagecache(void)
 {
-	mempool_destroy(nfs_commit_mempool);
-	mempool_destroy(nfs_wdata_mempool);
 	kmem_cache_destroy(nfs_wdata_cachep);
 }
 

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 25/28] nfs: teach the NFS client how to treat PG_swapcache pages
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (23 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 24/28] nfs: remove mempools Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 26/28] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: nfs-swapcache.patch --]
[-- Type: text/plain, Size: 12476 bytes --]

Replace all relevant occurences of page->index and page->mapping in the NFS
client with the new page_file_index() and page_file_mapping() functions.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/file.c     |    6 +++---
 fs/nfs/internal.h |    7 ++++---
 fs/nfs/pagelist.c |    6 +++---
 fs/nfs/read.c     |    6 +++---
 fs/nfs/write.c    |   51 ++++++++++++++++++++++++++-------------------------
 5 files changed, 39 insertions(+), 37 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -359,7 +359,7 @@ static void nfs_invalidate_page(struct p
 	if (offset != 0)
 		return;
 	/* Cancel any unstarted writes on this page */
-	nfs_wb_page_cancel(page->mapping->host, page);
+	nfs_wb_page_cancel(page_file_mapping(page)->host, page);
 }
 
 static int nfs_release_page(struct page *page, gfp_t gfp)
@@ -370,7 +370,7 @@ static int nfs_release_page(struct page 
 
 static int nfs_launder_page(struct page *page)
 {
-	return nfs_wb_page(page->mapping->host, page);
+	return nfs_wb_page(page_file_mapping(page)->host, page);
 }
 
 const struct address_space_operations nfs_file_aops = {
@@ -397,7 +397,7 @@ static int nfs_vm_page_mkwrite(struct vm
 	struct address_space *mapping;
 
 	lock_page(page);
-	mapping = page->mapping;
+	mapping = page_file_mapping(page);
 	if (mapping != vma->vm_file->f_path.dentry->d_inode->i_mapping)
 		goto out_unlock;
 
Index: linux-2.6/fs/nfs/pagelist.c
===================================================================
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -76,11 +76,11 @@ nfs_create_request(struct nfs_open_conte
 	 * update_nfs_request below if the region is not locked. */
 	req->wb_page    = page;
 	atomic_set(&req->wb_complete, 0);
-	req->wb_index	= page->index;
+	req->wb_index	= page_file_index(page);
 	page_cache_get(page);
 	BUG_ON(PagePrivate(page));
 	BUG_ON(!PageLocked(page));
-	BUG_ON(page->mapping->host != inode);
+	BUG_ON(page_file_mapping(page)->host != inode);
 	req->wb_offset  = offset;
 	req->wb_pgbase	= offset;
 	req->wb_bytes   = count;
@@ -376,7 +376,7 @@ void nfs_pageio_cond_complete(struct nfs
  * nfs_scan_list - Scan a list for matching requests
  * @nfsi: NFS inode
  * @dst: Destination list
- * @idx_start: lower bound of page->index to scan
+ * @idx_start: lower bound of page_file_index(page) to scan
  * @npages: idx_start + npages sets the upper bound to scan.
  * @tag: tag to scan for
  *
Index: linux-2.6/fs/nfs/read.c
===================================================================
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -458,11 +458,11 @@ static const struct rpc_call_ops nfs_rea
 int nfs_readpage(struct file *file, struct page *page)
 {
 	struct nfs_open_context *ctx;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	int		error;
 
 	dprintk("NFS: nfs_readpage (%p %ld@%lu)\n",
-		page, PAGE_CACHE_SIZE, page->index);
+		page, PAGE_CACHE_SIZE, page_file_index(page));
 	nfs_inc_stats(inode, NFSIOS_VFSREADPAGE);
 	nfs_add_stats(inode, NFSIOS_READPAGES, 1);
 
@@ -509,7 +509,7 @@ static int
 readpage_async_filler(void *data, struct page *page)
 {
 	struct nfs_readdesc *desc = (struct nfs_readdesc *)data;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *new;
 	unsigned int len;
 	int error;
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -126,7 +126,7 @@ static struct nfs_page *nfs_page_find_re
 
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *req = NULL;
 
 	spin_lock(&inode->i_lock);
@@ -138,13 +138,13 @@ static struct nfs_page *nfs_page_find_re
 /* Adjust the file length if we're writing beyond the end */
 static void nfs_grow_file(struct page *page, unsigned int offset, unsigned int count)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	loff_t end, i_size = i_size_read(inode);
 	pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
 
-	if (i_size > 0 && page->index < end_index)
+	if (i_size > 0 && page_file_index(page) < end_index)
 		return;
-	end = ((loff_t)page->index << PAGE_CACHE_SHIFT) + ((loff_t)offset+count);
+	end = page_offset(page) + ((loff_t)offset+count);
 	if (i_size >= end)
 		return;
 	nfs_inc_stats(inode, NFSIOS_EXTENDWRITE);
@@ -155,7 +155,7 @@ static void nfs_grow_file(struct page *p
 static void nfs_set_pageerror(struct page *page)
 {
 	SetPageError(page);
-	nfs_zap_mapping(page->mapping->host, page->mapping);
+	nfs_zap_mapping(page_file_mapping(page)->host, page_file_mapping(page));
 }
 
 /* We can set the PG_uptodate flag if we see that a write request
@@ -185,7 +185,7 @@ static int nfs_writepage_setup(struct nf
 		ret = PTR_ERR(req);
 		if (ret != -EBUSY)
 			return ret;
-		ret = nfs_wb_page(page->mapping->host, page);
+		ret = nfs_wb_page(page_file_mapping(page)->host, page);
 		if (ret != 0)
 			return ret;
 	}
@@ -219,7 +219,7 @@ static int nfs_set_page_writeback(struct
 	int ret = test_set_page_writeback(page);
 
 	if (!ret) {
-		struct inode *inode = page->mapping->host;
+		struct inode *inode = page_file_mapping(page)->host;
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		if (atomic_long_inc_return(&nfss->writeback) >
@@ -231,7 +231,7 @@ static int nfs_set_page_writeback(struct
 
 static void nfs_end_page_writeback(struct page *page)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	end_page_writeback(page);
@@ -246,7 +246,7 @@ static void nfs_end_page_writeback(struc
 static int nfs_page_async_flush(struct nfs_pageio_descriptor *pgio,
 				struct page *page)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *req;
 	int ret;
 
@@ -289,12 +289,12 @@ static int nfs_page_async_flush(struct n
 
 static int nfs_do_writepage(struct page *page, struct writeback_control *wbc, struct nfs_pageio_descriptor *pgio)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE);
 	nfs_add_stats(inode, NFSIOS_WRITEPAGES, 1);
 
-	nfs_pageio_cond_complete(pgio, page->index);
+	nfs_pageio_cond_complete(pgio, page_file_index(page));
 	return nfs_page_async_flush(pgio, page);
 }
 
@@ -306,7 +306,7 @@ static int nfs_writepage_locked(struct p
 	struct nfs_pageio_descriptor pgio;
 	int err;
 
-	nfs_pageio_init_write(&pgio, page->mapping->host, wb_priority(wbc));
+	nfs_pageio_init_write(&pgio, page_file_mapping(page)->host, wb_priority(wbc));
 	err = nfs_do_writepage(page, wbc, &pgio);
 	nfs_pageio_complete(&pgio);
 	if (err < 0)
@@ -438,7 +438,8 @@ nfs_mark_request_commit(struct nfs_page 
 			NFS_PAGE_TAG_COMMIT);
 	spin_unlock(&inode->i_lock);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
-	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
+	inc_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
+			BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 
@@ -525,7 +526,7 @@ static void nfs_cancel_commit_list(struc
 	while(!list_empty(head)) {
 		req = nfs_list_entry(head->next);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
-		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+		dec_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
 				BDI_RECLAIMABLE);
 		nfs_list_remove_request(req);
 		clear_bit(PG_NEED_COMMIT, &(req)->wb_flags);
@@ -539,7 +540,7 @@ static void nfs_cancel_commit_list(struc
  * nfs_scan_commit - Scan an inode for commit requests
  * @inode: NFS inode to scan
  * @dst: destination list
- * @idx_start: lower bound of page->index to scan.
+ * @idx_start: lower bound of page_file_index(page) to scan.
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves requests from the inode's 'commit' request list.
@@ -575,7 +576,7 @@ static inline int nfs_scan_commit(struct
 static struct nfs_page * nfs_update_request(struct nfs_open_context* ctx,
 		struct page *page, unsigned int offset, unsigned int bytes)
 {
-	struct address_space *mapping = page->mapping;
+	struct address_space *mapping = page_file_mapping(page);
 	struct inode *inode = mapping->host;
 	struct nfs_page		*req, *new = NULL;
 	pgoff_t		rqend, end;
@@ -686,7 +687,7 @@ int nfs_flush_incompatible(struct file *
 		nfs_release_request(req);
 		if (!do_flush)
 			return 0;
-		status = nfs_wb_page(page->mapping->host, page);
+		status = nfs_wb_page(page_file_mapping(page)->host, page);
 	} while (status == 0);
 	return status;
 }
@@ -712,7 +713,7 @@ int nfs_updatepage(struct file *file, st
 		unsigned int offset, unsigned int count)
 {
 	struct nfs_open_context *ctx = nfs_file_open_context(file);
-	struct inode	*inode = page->mapping->host;
+	struct inode	*inode = page_file_mapping(page)->host;
 	int		status = 0;
 
 	nfs_inc_stats(inode, NFSIOS_VFSUPDATEPAGE);
@@ -980,7 +981,7 @@ static void nfs_writeback_done_partial(s
 	}
 
 	if (nfs_write_need_commit(data)) {
-		struct inode *inode = page->mapping->host;
+		struct inode *inode = page_file_mapping(page)->host;
 
 		spin_lock(&inode->i_lock);
 		if (test_bit(PG_NEED_RESCHED, &req->wb_flags)) {
@@ -1231,7 +1232,7 @@ nfs_commit_list(struct inode *inode, str
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
-		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+		dec_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
 				BDI_RECLAIMABLE);
 		nfs_clear_page_tag_locked(req);
 	}
@@ -1258,7 +1259,7 @@ static void nfs_commit_done(struct rpc_t
 		nfs_list_remove_request(req);
 		clear_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
-		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+		dec_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
 				BDI_RECLAIMABLE);
 
 		dprintk("NFS: commit (%s/%Ld %d@%Ld)",
@@ -1424,7 +1425,7 @@ int nfs_wb_page_cancel(struct inode *ino
 	loff_t range_start = page_offset(page);
 	loff_t range_end = range_start + (loff_t)(PAGE_CACHE_SIZE - 1);
 	struct writeback_control wbc = {
-		.bdi = page->mapping->backing_dev_info,
+		.bdi = page_file_mapping(page)->backing_dev_info,
 		.sync_mode = WB_SYNC_ALL,
 		.nr_to_write = LONG_MAX,
 		.range_start = range_start,
@@ -1457,7 +1458,7 @@ int nfs_wb_page_cancel(struct inode *ino
 	}
 	if (!PagePrivate(page))
 		return 0;
-	ret = nfs_sync_mapping_wait(page->mapping, &wbc, FLUSH_INVALIDATE);
+	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, FLUSH_INVALIDATE);
 out:
 	return ret;
 }
@@ -1468,7 +1469,7 @@ static int nfs_wb_page_priority(struct i
 	loff_t range_start = page_offset(page);
 	loff_t range_end = range_start + (loff_t)(PAGE_CACHE_SIZE - 1);
 	struct writeback_control wbc = {
-		.bdi = page->mapping->backing_dev_info,
+		.bdi = page_file_mapping(page)->backing_dev_info,
 		.sync_mode = WB_SYNC_ALL,
 		.nr_to_write = LONG_MAX,
 		.range_start = range_start,
@@ -1484,7 +1485,7 @@ static int nfs_wb_page_priority(struct i
 	}
 	if (!PagePrivate(page))
 		return 0;
-	ret = nfs_sync_mapping_wait(page->mapping, &wbc, how);
+	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, how);
 	if (ret >= 0)
 		return 0;
 out:
Index: linux-2.6/fs/nfs/internal.h
===================================================================
--- linux-2.6.orig/fs/nfs/internal.h
+++ linux-2.6/fs/nfs/internal.h
@@ -252,13 +252,14 @@ void nfs_super_set_maxbytes(struct super
 static inline
 unsigned int nfs_page_length(struct page *page)
 {
-	loff_t i_size = i_size_read(page->mapping->host);
+	loff_t i_size = i_size_read(page_file_mapping(page)->host);
 
 	if (i_size > 0) {
+		pgoff_t page_index = page_file_index(page);
 		pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
-		if (page->index < end_index)
+		if (page_index < end_index)
 			return PAGE_CACHE_SIZE;
-		if (page->index == end_index)
+		if (page_index == end_index)
 			return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1;
 	}
 	return 0;

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 26/28] nfs: disable data cache revalidation for swapfiles
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (24 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 25/28] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 27/28] nfs: enable swap on NFS Peter Zijlstra
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: nfs-swapper.patch --]
[-- Type: text/plain, Size: 5292 bytes --]

Do as Trond suggested:
  http://lkml.org/lkml/2006/8/25/348

Disable NFS data cache revalidation on swap files since it doesn't really 
make sense to have other clients change the file while you are using it.

Thereby we can stop setting PG_private on swap pages, since there ought to
be no further races with invalidate_inode_pages2() to deal with.

And since we cannot set PG_private we cannot use page->private (which is
already used by PG_swapcache pages anyway) to store the nfs_page. Thus
augment the new nfs_page_find_request logic.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/inode.c |    6 ++++
 fs/nfs/write.c |   73 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 65 insertions(+), 14 deletions(-)

Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -763,6 +763,12 @@ int nfs_revalidate_mapping_nolock(struct
 	struct nfs_inode *nfsi = NFS_I(inode);
 	int ret = 0;
 
+	/*
+	 * swapfiles are not supposed to be shared.
+	 */
+	if (IS_SWAPFILE(inode))
+		goto out;
+
 	if ((nfsi->cache_validity & NFS_INO_REVAL_PAGECACHE)
 			|| nfs_attribute_timeout(inode) || NFS_STALE(inode)) {
 		ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -112,25 +112,62 @@ static void nfs_context_set_write_error(
 	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
 }
 
-static struct nfs_page *nfs_page_find_request_locked(struct page *page)
+static struct nfs_page *
+__nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page, int get)
 {
 	struct nfs_page *req = NULL;
 
-	if (PagePrivate(page)) {
+	if (PagePrivate(page))
 		req = (struct nfs_page *)page_private(page);
-		if (req != NULL)
-			kref_get(&req->wb_kref);
-	}
+	else if (unlikely(PageSwapCache(page)))
+		req = radix_tree_lookup(&nfsi->nfs_page_tree, page_file_index(page));
+
+	if (get && req)
+		kref_get(&req->wb_kref);
+
 	return req;
 }
 
+static inline struct nfs_page *
+nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page)
+{
+	return __nfs_page_find_request_locked(nfsi, page, 1);
+}
+
+static int __nfs_page_has_request(struct page *page)
+{
+	struct inode *inode = page_file_mapping(page)->host;
+	struct nfs_page *req = NULL;
+
+	spin_lock(&inode->i_lock);
+	req = __nfs_page_find_request_locked(NFS_I(inode), page, 0);
+	spin_unlock(&inode->i_lock);
+
+	/*
+	 * hole here plugged by the caller holding onto PG_locked
+	 */
+
+	return req != NULL;
+}
+
+static inline int nfs_page_has_request(struct page *page)
+{
+	if (PagePrivate(page))
+		return 1;
+
+	if (unlikely(PageSwapCache(page)))
+		return __nfs_page_has_request(page);
+
+	return 0;
+}
+
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
 	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *req = NULL;
 
 	spin_lock(&inode->i_lock);
-	req = nfs_page_find_request_locked(page);
+	req = nfs_page_find_request_locked(NFS_I(inode), page);
 	spin_unlock(&inode->i_lock);
 	return req;
 }
@@ -252,7 +289,7 @@ static int nfs_page_async_flush(struct n
 
 	spin_lock(&inode->i_lock);
 	for(;;) {
-		req = nfs_page_find_request_locked(page);
+		req = nfs_page_find_request_locked(NFS_I(inode), page);
 		if (req == NULL) {
 			spin_unlock(&inode->i_lock);
 			return 0;
@@ -367,8 +404,14 @@ static void nfs_inode_add_request(struct
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
-	SetPagePrivate(req->wb_page);
-	set_page_private(req->wb_page, (unsigned long)req);
+	/*
+	 * Swap-space should not get truncated. Hence no need to plug the race
+	 * with invalidate/truncate.
+	 */
+	if (likely(!PageSwapCache(req->wb_page))) {
+		SetPagePrivate(req->wb_page);
+		set_page_private(req->wb_page, (unsigned long)req);
+	}
 	nfsi->npages++;
 	kref_get(&req->wb_kref);
 	radix_tree_tag_set(&nfsi->nfs_page_tree, req->wb_index,
@@ -386,8 +429,10 @@ static void nfs_inode_remove_request(str
 	BUG_ON (!NFS_WBACK_BUSY(req));
 
 	spin_lock(&inode->i_lock);
-	set_page_private(req->wb_page, 0);
-	ClearPagePrivate(req->wb_page);
+	if (likely(!PageSwapCache(req->wb_page))) {
+		set_page_private(req->wb_page, 0);
+		ClearPagePrivate(req->wb_page);
+	}
 	radix_tree_delete(&nfsi->nfs_page_tree, req->wb_index);
 	nfsi->npages--;
 	if (!nfsi->npages) {
@@ -593,7 +638,7 @@ static struct nfs_page * nfs_update_requ
 		}
 
 		spin_lock(&inode->i_lock);
-		req = nfs_page_find_request_locked(page);
+		req = nfs_page_find_request_locked(NFS_I(inode), page);
 		if (req) {
 			if (!nfs_set_page_tag_locked(req)) {
 				int error;
@@ -1460,7 +1505,7 @@ int nfs_wb_page_cancel(struct inode *ino
 		if (ret < 0)
 			goto out;
 	}
-	if (!PagePrivate(page))
+	if (!nfs_page_has_request(page))
 		return 0;
 	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, FLUSH_INVALIDATE);
 out:
@@ -1487,7 +1532,7 @@ static int nfs_wb_page_priority(struct i
 		if (ret < 0)
 			goto out;
 	}
-	if (!PagePrivate(page))
+	if (!nfs_page_has_request(page))
 		return 0;
 	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, how);
 	if (ret >= 0)

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 27/28] nfs: enable swap on NFS
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (25 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 26/28] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-20 14:46 ` [PATCH 28/28] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
  2008-02-23  8:06 ` [PATCH 00/28] Swap over NFS -v16 Andrew Morton
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: nfs-swap_ops.patch --]
[-- Type: text/plain, Size: 10795 bytes --]

Implement all the new swapfile a_ops for NFS. This will set the NFS socket to
SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well as reset
SOCK_MEMALLOC before engaging the protocol ->connect() method.

PF_MEMALLOC should allow the allocation of struct socket and related objects
and the early (re)setting of SOCK_MEMALLOC should allow us to receive the
packets required for the TCP connection buildup.

(swapping continues over a server reset during heavy network traffic)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/Kconfig                  |   17 +++++++++++
 fs/nfs/file.c               |   12 ++++++++
 fs/nfs/write.c              |   19 +++++++++++++
 include/linux/nfs_fs.h      |    2 +
 include/linux/sunrpc/xprt.h |    5 ++-
 net/sunrpc/sched.c          |    9 ++++--
 net/sunrpc/xprtsock.c       |   63 ++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 124 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -373,6 +373,13 @@ static int nfs_launder_page(struct page 
 	return nfs_wb_page(page_file_mapping(page)->host, page);
 }
 
+#ifdef CONFIG_NFS_SWAP
+static int nfs_swapfile(struct address_space *mapping, int enable)
+{
+	return xs_swapper(NFS_CLIENT(mapping->host)->cl_xprt, enable);
+}
+#endif
+
 const struct address_space_operations nfs_file_aops = {
 	.readpage = nfs_readpage,
 	.readpages = nfs_readpages,
@@ -387,6 +394,11 @@ const struct address_space_operations nf
 	.direct_IO = nfs_direct_IO,
 #endif
 	.launder_page = nfs_launder_page,
+#ifdef CONFIG_NFS_SWAP
+	.swapfile = nfs_swapfile,
+	.swap_out = nfs_swap_out,
+	.swap_in = nfs_readpage,
+#endif
 };
 
 static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page)
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -362,6 +362,25 @@ int nfs_writepage(struct page *page, str
 	return ret;
 }
 
+int nfs_swap_out(struct file *file, struct page *page,
+		 struct writeback_control *wbc)
+{
+	struct nfs_open_context *ctx = nfs_file_open_context(file);
+	int status;
+
+	status = nfs_writepage_setup(ctx, page, 0, nfs_page_length(page));
+	if (status < 0) {
+		nfs_set_pageerror(page);
+		goto out;
+	}
+
+	status = nfs_writepage_locked(page, wbc);
+
+out:
+	unlock_page(page);
+	return status;
+}
+
 static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data)
 {
 	int ret;
Index: linux-2.6/include/linux/nfs_fs.h
===================================================================
--- linux-2.6.orig/include/linux/nfs_fs.h
+++ linux-2.6/include/linux/nfs_fs.h
@@ -453,6 +453,8 @@ extern int  nfs_flush_incompatible(struc
 extern int  nfs_updatepage(struct file *, struct page *, unsigned int, unsigned int);
 extern int nfs_writeback_done(struct rpc_task *, struct nfs_write_data *);
 extern void nfs_writedata_release(void *);
+extern int  nfs_swap_out(struct file *file, struct page *page,
+			 struct writeback_control *wbc);
 
 /*
  * Try to write back everything synchronously (but check the
Index: linux-2.6/fs/Kconfig
===================================================================
--- linux-2.6.orig/fs/Kconfig
+++ linux-2.6/fs/Kconfig
@@ -1661,6 +1661,18 @@ config NFS_DIRECTIO
 	  causes open() to return EINVAL if a file residing in NFS is
 	  opened with the O_DIRECT flag.
 
+config NFS_SWAP
+	bool "Provide swap over NFS support"
+	default n
+	depends on NFS_FS
+	select SUNRPC_SWAP
+	help
+	  This option enables swapon to work on files located on NFS mounts.
+
+	  For more details, see Documentation/vm_deadlock.txt
+
+	  If unsure, say N.
+
 config NFSD
 	tristate "NFS server support"
 	depends on INET
@@ -1794,6 +1806,11 @@ config SUNRPC_BIND34
 	  If unsure, say N to get traditional behavior (version 2 rpcbind
 	  requests only).
 
+config SUNRPC_SWAP
+	def_bool n
+	depends on SUNRPC
+	select NETVM
+
 config RPCSEC_GSS_KRB5
 	tristate "Secure RPC: Kerberos V mechanism (EXPERIMENTAL)"
 	depends on SUNRPC && EXPERIMENTAL
Index: linux-2.6/include/linux/sunrpc/xprt.h
===================================================================
--- linux-2.6.orig/include/linux/sunrpc/xprt.h
+++ linux-2.6/include/linux/sunrpc/xprt.h
@@ -143,7 +143,9 @@ struct rpc_xprt {
 	unsigned int		max_reqs;	/* total slots */
 	unsigned long		state;		/* transport state */
 	unsigned char		shutdown   : 1,	/* being shut down */
-				resvport   : 1; /* use a reserved port */
+				resvport   : 1, /* use a reserved port */
+				swapper    : 1; /* we're swapping over this
+						   transport */
 	unsigned int		bind_index;	/* bind function index */
 
 	/*
@@ -241,6 +243,7 @@ void			xprt_complete_rqst(struct rpc_tas
 void			xprt_release_rqst_cong(struct rpc_task *task);
 void			xprt_disconnect_done(struct rpc_xprt *xprt);
 void			xprt_force_disconnect(struct rpc_xprt *xprt);
+int			xs_swapper(struct rpc_xprt *xprt, int enable);
 
 /*
  * Reserved bit positions in xprt->state
Index: linux-2.6/net/sunrpc/sched.c
===================================================================
--- linux-2.6.orig/net/sunrpc/sched.c
+++ linux-2.6/net/sunrpc/sched.c
@@ -766,7 +766,10 @@ struct rpc_buffer {
 void *rpc_malloc(struct rpc_task *task, size_t size)
 {
 	struct rpc_buffer *buf;
-	gfp_t gfp = RPC_IS_SWAPPER(task) ? GFP_ATOMIC : GFP_NOWAIT;
+	gfp_t gfp = GFP_NOWAIT;
+
+	if (RPC_IS_SWAPPER(task))
+		gfp |= __GFP_MEMALLOC;
 
 	size += sizeof(struct rpc_buffer);
 	if (size <= RPC_BUFFER_MAXSIZE)
@@ -839,6 +842,8 @@ static void rpc_init_task(struct rpc_tas
 		kref_get(&task->tk_client->cl_kref);
 		if (task->tk_client->cl_softrtry)
 			task->tk_flags |= RPC_TASK_SOFT;
+		if (task->tk_client->cl_xprt->swapper)
+			task->tk_flags |= RPC_TASK_SWAPPER;
 	}
 
 	if (task->tk_ops->rpc_call_prepare != NULL)
@@ -865,7 +870,7 @@ static void rpc_init_task(struct rpc_tas
 static struct rpc_task *
 rpc_alloc_task(void)
 {
-	return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOFS);
+	return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOIO);
 }
 
 static void rpc_free_task(struct rcu_head *rcu)
Index: linux-2.6/net/sunrpc/xprtsock.c
===================================================================
--- linux-2.6.orig/net/sunrpc/xprtsock.c
+++ linux-2.6/net/sunrpc/xprtsock.c
@@ -1451,6 +1451,9 @@ static void xs_udp_finish_connecting(str
 		transport->sock = sock;
 		transport->inet = sk;
 
+		if (xprt->swapper)
+			sk_set_memalloc(sk);
+
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 	xs_udp_do_set_buffer_size(xprt);
@@ -1468,11 +1471,15 @@ static void xs_udp_connect_worker4(struc
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	/* Start by resetting any existing state */
 	xs_close(xprt);
 
@@ -1495,6 +1502,7 @@ static void xs_udp_connect_worker4(struc
 out:
 	xprt_wake_pending_tasks(xprt, status);
 	xprt_clear_connecting(xprt);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /**
@@ -1509,11 +1517,15 @@ static void xs_udp_connect_worker6(struc
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	/* Start by resetting any existing state */
 	xs_close(xprt);
 
@@ -1536,6 +1548,7 @@ static void xs_udp_connect_worker6(struc
 out:
 	xprt_wake_pending_tasks(xprt, status);
 	xprt_clear_connecting(xprt);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /*
@@ -1595,6 +1608,9 @@ static int xs_tcp_finish_connecting(stru
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 
+	if (xprt->swapper)
+		sk_set_memalloc(transport->inet);
+
 	/* Tell the socket layer to start connecting... */
 	xprt->stat.connect_count++;
 	xprt->stat.connect_start = jiffies;
@@ -1613,11 +1629,15 @@ static void xs_tcp_connect_worker4(struc
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	if (!sock) {
 		/* start from scratch */
 		if ((err = sock_create_kern(PF_INET, SOCK_STREAM, IPPROTO_TCP, &sock)) < 0) {
@@ -1659,6 +1679,7 @@ out:
 	xprt_wake_pending_tasks(xprt, status);
 out_clear:
 	xprt_clear_connecting(xprt);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /**
@@ -1673,11 +1694,15 @@ static void xs_tcp_connect_worker6(struc
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	if (!sock) {
 		/* start from scratch */
 		if ((err = sock_create_kern(PF_INET6, SOCK_STREAM, IPPROTO_TCP, &sock)) < 0) {
@@ -1718,6 +1743,7 @@ out:
 	xprt_wake_pending_tasks(xprt, status);
 out_clear:
 	xprt_clear_connecting(xprt);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /**
@@ -2055,6 +2081,43 @@ int init_socket_xprt(void)
 	return 0;
 }
 
+#ifdef CONFIG_SUNRPC_SWAP
+#define RPC_BUF_RESERVE_PAGES \
+	kestimate_single(sizeof(struct rpc_rqst), GFP_KERNEL, RPC_MAX_SLOT_TABLE)
+#define RPC_RESERVE_PAGES	(RPC_BUF_RESERVE_PAGES + TX_RESERVE_PAGES)
+
+/**
+ * xs_swapper - Tag this transport as being used for swap.
+ * @xprt: transport to tag
+ * @enable: enable/disable
+ *
+ */
+int xs_swapper(struct rpc_xprt *xprt, int enable)
+{
+	struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+	int err = 0;
+
+	if (enable) {
+		/*
+		 * keep one extra sock reference so the reserve won't dip
+		 * when the socket gets reconnected.
+		 */
+		err = sk_adjust_memalloc(1, RPC_RESERVE_PAGES);
+		if (!err) {
+			sk_set_memalloc(transport->inet);
+			xprt->swapper = 1;
+		}
+	} else if (xprt->swapper) {
+		xprt->swapper = 0;
+		sk_clear_memalloc(transport->inet);
+		sk_adjust_memalloc(-1, -RPC_RESERVE_PAGES);
+	}
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(xs_swapper);
+#endif
+
 /**
  * cleanup_socket_xprt - remove xprtsock's sysctls, unregister
  *

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 28/28] nfs: fix various memory recursions possible with swap over NFS.
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (26 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 27/28] nfs: enable swap on NFS Peter Zijlstra
@ 2008-02-20 14:46 ` Peter Zijlstra
  2008-02-23  8:06 ` [PATCH 00/28] Swap over NFS -v16 Andrew Morton
  28 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 14:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: nfs-alloc-recursions.patch --]
[-- Type: text/plain, Size: 1861 bytes --]

GFP_NOFS is not enough, since swap traffic is IO, hence fall back to GFP_NOIO.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/pagelist.c |    2 +-
 fs/nfs/write.c    |    6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -44,7 +44,7 @@ static struct kmem_cache *nfs_wdata_cach
 
 struct nfs_write_data *nfs_commit_alloc(void)
 {
-	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -68,7 +68,7 @@ void nfs_commit_free(struct nfs_write_da
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -77,7 +77,7 @@ struct nfs_write_data *nfs_writedata_all
 		if (pagecount <= ARRAY_SIZE(p->page_array))
 			p->pagevec = p->page_array;
 		else {
-			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
+			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOIO);
 			if (!p->pagevec) {
 				kmem_cache_free(nfs_wdata_cachep, p);
 				p = NULL;
Index: linux-2.6/fs/nfs/pagelist.c
===================================================================
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -27,7 +27,7 @@ static inline struct nfs_page *
 nfs_page_alloc(void)
 {
 	struct nfs_page	*p;
-	p = kmem_cache_alloc(nfs_page_cachep, GFP_KERNEL);
+	p = kmem_cache_alloc(nfs_page_cachep, GFP_NOIO);
 	if (p) {
 		memset(p, 0, sizeof(*p));
 		INIT_LIST_HEAD(&p->wb_list);

--


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 22/28] mm: add support for non block device backed swap files
  2008-02-20 14:46 ` [PATCH 22/28] mm: add support for non block device backed swap files Peter Zijlstra
@ 2008-02-20 16:30   ` Randy Dunlap
  2008-02-20 16:46     ` Peter Zijlstra
  2008-02-26 12:45   ` Miklos Szeredi
  1 sibling, 1 reply; 73+ messages in thread
From: Randy Dunlap @ 2008-02-20 16:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Wed, 20 Feb 2008 15:46:32 +0100 Peter Zijlstra wrote:

> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  Documentation/filesystems/Locking |   19 +++++++++++++
>  Documentation/filesystems/vfs.txt |   17 ++++++++++++
>  include/linux/buffer_head.h       |    2 -
>  include/linux/fs.h                |    8 +++++
>  include/linux/swap.h              |    4 ++
>  mm/page_io.c                      |   52 ++++++++++++++++++++++++++++++++++++++
>  mm/swap_state.c                   |    4 +-
>  mm/swapfile.c                     |   26 ++++++++++++++++++-
>  8 files changed, 128 insertions(+), 4 deletions(-)

> Index: linux-2.6/Documentation/filesystems/Locking
> ===================================================================
> --- linux-2.6.orig/Documentation/filesystems/Locking
> +++ linux-2.6/Documentation/filesystems/Locking

> @@ -291,6 +297,19 @@ cleaned, or an error value if not. Note 
>  getting mapped back in and redirtied, it needs to be kept locked
>  across the entire operation.
>  
> +	->swapfile() will be called with a non zero argument on address spaces

                                           non-zero

> +backing non block device backed swapfiles. A return value of zero indicates
> +success. In which case this address space can be used for backing swapspace.

   success, in which case

> +The swapspace operations will be proxied to the address space operations.
> +Swapoff will call this method with a zero argument to release the address
> +space.
> +
> +	->swap_out() when swapfile() returned success, this method is used to
> +write the swap page.
> +
> +	->swap_in() when swapfile() returned success, this method is used to
> +read the swap page.
> +
>  	Note: currently almost all instances of address_space methods are
>  using BKL for internal serialization and that's one of the worst sources
>  of contention. Normally they are calling library functions (in fs/buffer.c)

> Index: linux-2.6/Documentation/filesystems/vfs.txt
> ===================================================================
> --- linux-2.6.orig/Documentation/filesystems/vfs.txt
> +++ linux-2.6/Documentation/filesystems/vfs.txt
> @@ -728,6 +732,19 @@ struct address_space_operations {
>    	prevent redirtying the page, it is kept locked during the whole
>  	operation.
>  
> +  swapfile: Called with a non-zero argument when swapon is used on a file. A
> +	return value of zero indicates success. In which case this

                                       success, in which case this

> +	address_space can be used to back swapspace. The swapspace operations
> +	will be proxied to this address space's ->swap_{out,in} methods.
> +	Swapoff will call this method with a zero argument to release the
> +	address space.
> +
> +  swap_out: Called to write a swapcache page to a backing store, similar to
> +	writepage.
> +
> +  swap_in: Called to read a swapcache page from a backing store, similar to
> +	readpage.
> +
>  The File Object
>  ===============

---
~Randy

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 22/28] mm: add support for non block device backed swap files
  2008-02-20 16:30   ` Randy Dunlap
@ 2008-02-20 16:46     ` Peter Zijlstra
  0 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-20 16:46 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust


On Wed, 2008-02-20 at 08:30 -0800, Randy Dunlap wrote:
> On Wed, 20 Feb 2008 15:46:32 +0100 Peter Zijlstra wrote:

< grammar mistakes >

Thanks Randy!


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 04/28] mm: kmem_estimate_pages()
  2008-02-20 14:46 ` [PATCH 04/28] mm: kmem_estimate_pages() Peter Zijlstra
@ 2008-02-23  8:05   ` Andrew Morton
  0 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2008-02-23  8:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, linux-mm, netdev, trond.myklebust

On Wed, 20 Feb 2008 15:46:14 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Provide a method to get the upper bound on the pages needed to allocate
> a given number of objects from a given kmem_cache.
> 
> This lays the foundation for a generic reserve framework as presented in
> a later patch in this series. This framework needs to convert object demand
> (kmalloc() bytes, kmem_cache_alloc() objects) to pages.
> 
> ...
>
>  /*
> + * return the max number of pages required to allocated count
> + * objects from the given cache
> + */
> +unsigned kmem_estimate_pages(struct kmem_cache *s, gfp_t flags, int objects)

You might want to have another go at that comment.

> +/*
> + * return the max number of pages required to allocate @bytes from kmalloc
> + * in an unspecified number of allocation of heterogeneous size.
> + */
> +unsigned kestimate(gfp_t flags, size_t bytes)

And its pal.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 05/28] mm: allow PF_MEMALLOC from softirq context
  2008-02-20 14:46 ` [PATCH 05/28] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
@ 2008-02-23  8:05   ` Andrew Morton
  0 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2008-02-23  8:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, linux-mm, netdev, trond.myklebust

On Wed, 20 Feb 2008 15:46:15 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
> a borrowed context save current->flags, ksoftirqd will have its own 
> task_struct.

The second sentence doesn't make sense.

> This is needed to allow network softirq packet processing to make use of
> PF_MEMALLOC.
>
> ...
>
> +#define tsk_restore_flags(p, pflags, mask) \
> +	do {	(p)->flags &= ~(mask); \
> +		(p)->flags |= ((pflags) & (mask)); } while (0)
> +

Does it need to be a macro?

If so, it really should cook up a temporary to avoid referencing p twice -
the children might be watching.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/28] mm: emergency pool
  2008-02-20 14:46 ` [PATCH 07/28] mm: emergency pool Peter Zijlstra
@ 2008-02-23  8:05   ` Andrew Morton
  0 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2008-02-23  8:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, linux-mm, netdev, trond.myklebust

On Wed, 20 Feb 2008 15:46:17 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> @@ -213,7 +213,7 @@ enum zone_type {
>  
>  struct zone {
>  	/* Fields commonly accessed by the page allocator */
> -	unsigned long		pages_min, pages_low, pages_high;
> +	unsigned long		pages_emerg, pages_min, pages_low, pages_high;

It would be nice to make these one-per-line, then document them.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 08/28] mm: system wide ALLOC_NO_WATERMARK
  2008-02-20 14:46 ` [PATCH 08/28] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
@ 2008-02-23  8:05   ` Andrew Morton
  0 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2008-02-23  8:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, linux-mm, netdev, trond.myklebust

On Wed, 20 Feb 2008 15:46:18 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Change ALLOC_NO_WATERMARK page allocation such that the reserves are system
> wide - which they are per setup_per_zone_pages_min(), when we scrape the
> barrel, do it properly.
> 

The changelog is fairly incomprehensible.

>  mm/page_alloc.c |    6 ++++++
>  1 file changed, 6 insertions(+)
> 
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1552,6 +1552,12 @@ restart:
>  rebalance:
>  	if (alloc_flags & ALLOC_NO_WATERMARKS) {
>  nofail_alloc:
> +		/*
> +		 * break out of mempolicy boundaries
> +		 */
> +		zonelist = NODE_DATA(numa_node_id())->node_zonelists +
> +			gfp_zone(gfp_mask);
> +
>  		/* go through the zonelist yet again, ignoring mins */
>  		page = get_page_from_freelist(gfp_mask, order, zonelist,
>  				ALLOC_NO_WATERMARKS);

As is the patch.  People who care about mempolicies will want a better
explanation, please, so they can check that we're not busting their stuff.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/28] mm: __GFP_MEMALLOC
  2008-02-20 14:46 ` [PATCH 09/28] mm: __GFP_MEMALLOC Peter Zijlstra
@ 2008-02-23  8:06   ` Andrew Morton
  0 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2008-02-23  8:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, linux-mm, netdev, trond.myklebust

On Wed, 20 Feb 2008 15:46:19 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> __GFP_MEMALLOC will allow the allocation to disregard the watermarks, 
> much like PF_MEMALLOC.
> 

'twould be nice if the changelog had some explanation of the reason
for this change.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 10/28] mm: memory reserve management
  2008-02-20 14:46 ` [PATCH 10/28] mm: memory reserve management Peter Zijlstra
@ 2008-02-23  8:06   ` Andrew Morton
  0 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2008-02-23  8:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, linux-mm, netdev, trond.myklebust,
	Alexey Dobriyan

On Wed, 20 Feb 2008 15:46:20 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Generic reserve management code. 
> 
> It provides methods to reserve and charge. Upon this, generic alloc/free style
> reserve pools could be build, which could fully replace mempool_t
> functionality.
> 
> It should also allow for a Banker's algorithm replacement of __GFP_NOFAIL.

Generally: the comments in this code are a bit straggly and hard to follow.
They'd be worth a revisit.

> +/*
> + * Simple output of the reserve tree in: /proc/reserve_info
> + * Example:
> + *
> + * localhost ~ # cat /proc/reserve_info
> + * total reserve                  8156K (0/544817)
> + *   total network reserve          8156K (0/544817)
> + *     network TX reserve             196K (0/49)
> + *       protocol TX pages              196K (0/49)
> + *     network RX reserve             7960K (0/544768)
> + *       IPv6 route cache               1372K (0/4096)
> + *       IPv4 route cache               5468K (0/16384)
> + *       SKB data reserve               1120K (0/524288)
> + *         IPv6 fragment cache            560K (0/262144)
> + *         IPv4 fragment cache            560K (0/262144)
> + */

Well, "Simple" was a freudian typo.  Not designed for programmatic parsing,
I see.

> +static __init int mem_reserve_proc_init(void)
> +{
> +	struct proc_dir_entry *entry;
> +
> +	entry = create_proc_entry("reserve_info", S_IRUSR, NULL);

I think we're supposed to use proc_create().  Blame Alexey.

> +	if (entry)
> +		entry->proc_fops = &mem_reserve_opterations;
> +
> +	return 0;
> +}
> +
> +__initcall(mem_reserve_proc_init);

module_init() is more trendy.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 15/28] netvm: network reserve infrastructure
  2008-02-20 14:46 ` [PATCH 15/28] netvm: network reserve infrastructure Peter Zijlstra
@ 2008-02-23  8:06   ` Andrew Morton
  2008-02-24  6:52   ` Mike Snitzer
  1 sibling, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2008-02-23  8:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, linux-mm, netdev, trond.myklebust

On Wed, 20 Feb 2008 15:46:25 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Provide the basic infrastructure to reserve and charge/account network memory.
> 
> We provide the following reserve tree:
> 
> 1)  total network reserve
> 2)    network TX reserve
> 3)      protocol TX pages
> 4)    network RX reserve
> 5)      SKB data reserve
> 
> [1] is used to make all the network reserves a single subtree, for easy
> manipulation.
> 
> [2] and [4] are merely for eastetic reasons.
> 
> The TX pages reserve [3] is assumed bounded by it being the upper bound of
> memory that can be used for sending pages (not quite true, but good enough)
> 
> The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data
> against in the fallback path.
> 
> The consumers for these reserves are sockets marked with:
>   SOCK_MEMALLOC
> 
> Such sockets are to be used to service the VM (iow. to swap over). They
> must be handled kernel side, exposing such a socket to user-space is a BUG.
> 
> +/**
> + *	sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
> + *	@socks: number of new %SOCK_MEMALLOC sockets
> + *	@tx_resserve_pages: number of pages to (un)reserve for TX
> + *
> + *	This function adjusts the memalloc reserve based on system demand.
> + *	The RX reserve is a limit, and only added once, not for each socket.
> + *
> + *	NOTE:
> + *	   @tx_reserve_pages is an upper-bound of memory used for TX hence
> + *	   we need not account the pages like we do for RX pages.
> + */
> +int sk_adjust_memalloc(int socks, long tx_reserve_pages)
> +{
> +	int nr_socks;
> +	int err;
> +
> +	err = mem_reserve_pages_add(&net_tx_pages, tx_reserve_pages);
> +	if (err)
> +		return err;
> +
> +	nr_socks = atomic_read(&memalloc_socks);
> +	if (!nr_socks && socks > 0)
> +		err = mem_reserve_connect(&net_reserve, &mem_reserve_root);

This looks like it should have some locking?

> +	nr_socks = atomic_add_return(socks, &memalloc_socks);
> +	if (!nr_socks && socks)
> +		err = mem_reserve_disconnect(&net_reserve);

Or does that try to make up for it?  Still looks fishy.

> +	if (err)
> +		mem_reserve_pages_add(&net_tx_pages, -tx_reserve_pages);
> +
> +	return err;
> +}
> +
> +/**
> + *	sk_set_memalloc - sets %SOCK_MEMALLOC
> + *	@sk: socket to set it on
> + *
> + *	Set %SOCK_MEMALLOC on a socket and increase the memalloc reserve
> + *	accordingly.
> + */
> +int sk_set_memalloc(struct sock *sk)
> +{
> +	int set = sock_flag(sk, SOCK_MEMALLOC);
> +#ifndef CONFIG_NETVM
> +	BUG();
> +#endif

??  #error, maybe?

> +	if (!set) {
> +		int err = sk_adjust_memalloc(1, 0);
> +		if (err)
> +			return err;
> +
> +		sock_set_flag(sk, SOCK_MEMALLOC);
> +		sk->sk_allocation |= __GFP_MEMALLOC;
> +	}
> +	return !set;
> +}
> +EXPORT_SYMBOL_GPL(sk_set_memalloc);


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 17/28] netvm: hook skb allocation to reserves
  2008-02-20 14:46 ` [PATCH 17/28] netvm: hook skb allocation to reserves Peter Zijlstra
@ 2008-02-23  8:06   ` Andrew Morton
  0 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2008-02-23  8:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, linux-mm, netdev, trond.myklebust

On Wed, 20 Feb 2008 15:46:27 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Change the skb allocation api to indicate RX usage and use this to fall back to
> the reserve when needed. SKBs allocated from the reserve are tagged in
> skb->emergency.
> 
> Teach all other skb ops about emergency skbs and the reserve accounting.
> 
> Use the (new) packet split API to allocate and track fragment pages from the
> emergency reserve. Do this using an atomic counter in page->index. This is
> needed because the fragments have a different sharing semantic than that
> indicated by skb_shinfo()->dataref. 
> 
> Note that the decision to distinguish between regular and emergency SKBs allows
> the accounting overhead to be limited to the later kind.
> 
> ...
>
> +static inline void skb_get_page(struct sk_buff *skb, struct page *page)
> +{
> +	get_page(page);
> +	if (skb_emergency(skb))
> +		atomic_inc(&page->frag_count);
> +}
> +
> +static inline void skb_put_page(struct sk_buff *skb, struct page *page)
> +{
> +	if (skb_emergency(skb) && atomic_dec_and_test(&page->frag_count))
> +		rx_emergency_put(PAGE_SIZE);
> +	put_page(page);
> +}

I'm thinking we should do `#define slowcall inline' then use that in the future.

>  static void skb_release_data(struct sk_buff *skb)
>  {
>  	if (!skb->cloned ||
>  	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
>  			       &skb_shinfo(skb)->dataref)) {
> +		int size;
> +
> +#ifdef NET_SKBUFF_DATA_USES_OFFSET
> +		size = skb->end;
> +#else
> +		size = skb->end - skb->head;
> +#endif

The patch adds rather a lot of ifdefs.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
                   ` (27 preceding siblings ...)
  2008-02-20 14:46 ` [PATCH 28/28] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
@ 2008-02-23  8:06 ` Andrew Morton
  2008-02-26  6:03   ` Neil Brown
  28 siblings, 1 reply; 73+ messages in thread
From: Andrew Morton @ 2008-02-23  8:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, linux-mm, netdev, trond.myklebust

On Wed, 20 Feb 2008 15:46:10 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Another posting of the full swap over NFS series. 

Well I looked.  There's rather a lot of it and I wouldn't pretend to
understand it.

What is the NFS and net people's take on all of this?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 15/28] netvm: network reserve infrastructure
  2008-02-20 14:46 ` [PATCH 15/28] netvm: network reserve infrastructure Peter Zijlstra
  2008-02-23  8:06   ` Andrew Morton
@ 2008-02-24  6:52   ` Mike Snitzer
  1 sibling, 0 replies; 73+ messages in thread
From: Mike Snitzer @ 2008-02-24  6:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Wed, Feb 20, 2008 at 9:46 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Provide the basic infrastructure to reserve and charge/account network memory.
...

>  Index: linux-2.6/net/core/sock.c
>  ===================================================================
>  --- linux-2.6.orig/net/core/sock.c
>  +++ linux-2.6/net/core/sock.c
...
>  +/**
>  + *     sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
>  + *     @socks: number of new %SOCK_MEMALLOC sockets
>  + *     @tx_resserve_pages: number of pages to (un)reserve for TX
>  + *
>  + *     This function adjusts the memalloc reserve based on system demand.
>  + *     The RX reserve is a limit, and only added once, not for each socket.
>  + *
>  + *     NOTE:
>  + *        @tx_reserve_pages is an upper-bound of memory used for TX hence
>  + *        we need not account the pages like we do for RX pages.
>  + */
>  +int sk_adjust_memalloc(int socks, long tx_reserve_pages)
>  +{
>  +       int nr_socks;
>  +       int err;
>  +
>  +       err = mem_reserve_pages_add(&net_tx_pages, tx_reserve_pages);
>  +       if (err)
>  +               return err;
>  +
>  +       nr_socks = atomic_read(&memalloc_socks);
>  +       if (!nr_socks && socks > 0)
>  +               err = mem_reserve_connect(&net_reserve, &mem_reserve_root);
>  +       nr_socks = atomic_add_return(socks, &memalloc_socks);
>  +       if (!nr_socks && socks)
>  +               err = mem_reserve_disconnect(&net_reserve);
>  +
>  +       if (err)
>  +               mem_reserve_pages_add(&net_tx_pages, -tx_reserve_pages);
>  +
>  +       return err;
>  +}

EXPORT_SYMBOL_GPL(sk_adjust_memalloc); is needed here to build sunrpc
as a module.

Mike

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-23  8:06 ` [PATCH 00/28] Swap over NFS -v16 Andrew Morton
@ 2008-02-26  6:03   ` Neil Brown
  2008-02-26 10:50     ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Neil Brown @ 2008-02-26  6:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Saturday February 23, akpm@linux-foundation.org wrote:
> On Wed, 20 Feb 2008 15:46:10 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > Another posting of the full swap over NFS series. 
> 
> Well I looked.  There's rather a lot of it and I wouldn't pretend to
> understand it.

But pretending is fun :-)

> 
> What is the NFS and net people's take on all of this?

Well I'm only vaguely an NFS person, barely a net person, sporadically
an mm person, but I've had a look and it seems to mostly make sense.

We introduce a new "emergency" concept for page allocation.
The size of the emergency pool is set by various reservations by
different potential users.
If the number of free pages is below the "emergency" size, then only
users with a "MEMALLOC" flag get to allocate pages.  Further, those
pages get a "reserve" flag set which propagates into slab/slub so
kmalloc/kmemalloc only return memory from those pages to MEMALLOC
users. 
MEMALLOC users are those that set PF_MEMALLOC.  A socket can get
SOCK_MEMALLOC set which will cause certain pieces of code to
temporarily set PF_MEMALLOC while working on that socket.

The upshot is that providing any MEMALLOC user reserves an appropriate
amount of emergency space, returns the emergency memory promptly, and
sets PF_MEMALLOC whenever allocating memory, it's memory allocations
should never fail.

As memory is requested is small units, but allocated as pages, there
needs to be a conversion from small-units to pages.  One of the
patches does this and appears to err on the side of be over-generous,
which is the right thing to do.


Memory reservations are organised in a tree.  I really don't
understand the tree.  Is it just to make /proc/reserve_info look more
helpful?
Certainly all the individual reservations need to be recorded, and the
cumulative reservation needs also to be recorded (currently in the
root of the tree) but what are all the other levels used for?

Reservations are used for all the transient memory that might be used
by the network stack.  This particularly includes the route cache and
skbs for incoming messages.  I have no idea if there is anything else
that needs to be allowed for.

Filesystems can advertise (via address_space_operations) that files
may be used as swap file.  They then provide swapout/swapin methods
which are like writepage/readpage but may behave differently and have
a different way to get credentials from a 'struct file'.


So in general, the patch set looks to have the right sort of shape.  I
cannot be very authoritative on the details as there are a lot of
them, and they touch code that I'm not very familiar with.

Some specific comments on patches:


reserve-slub.patch

   Please avoid irrelevant reformatting in patches.  It makes them
   harder to read.  e.g.:

-static void setup_object(struct kmem_cache *s, struct page *page,
-				void *object)
+static void setup_object(struct kmem_cache *s, struct page *page, void *object)


mm-kmem_estimate_pages.patch

   This introduces
         kestimate
         kestimate_single
         kmem_estimate_pages

   The last obviously returns a number of pages.  The contrast seems
   to suggest the others don't.   But they do...
   I don't think the names are very good, but I concede that it is
   hard to choose good names here.  Maybe:
          kmalloc_estimate_variable
          kmalloc_estimate_fixed
          kmem_alloc_estimate
   ???

mm-reserve.patch

   I'm confused by __mem_reserve_add.

+	reserve = mem_reserve_root.pages;
+	__calc_reserve(res, pages, 0);
+	reserve = mem_reserve_root.pages - reserve;

   __calc_reserve will always add 'pages' to mem_reserve_root.pages.
   So this is a complex way of doing
        reserve = pages;
        __calc_reserve(res, pages, 0);

    And as you can calculate reserve before calling __calc_reserve
    (which seems odd when stated that way), the whole function looks
    like it could become:

           ret = adjust_memalloc_reserve(pages);
	   if (!ret)
		__calc_reserve(res, pages, limit);
	   return ret;

    What am I missing?

    Also, mem_reserve_disconnect really should be a "void" function.
    Just put a BUG_ON(ret) and don't return anything.

    Finally, I'll just repeat that the purpose of the tree structure
    eludes me.

net-sk_allocation.patch

    Why are the "GFP_KERNEL" call sites just changed to
    "sk->sk_allocation" rather than "sk_allocation(sk, GFP_KERNEL)" ??

    I assume there is a good reason, and seeing it in the change log
    would educate me and make the patch more obviously correct.

netvm-reserve.patch

    Function names again:

         sk_adjust_memalloc
         sk_set_memalloc

    sound similar.  Purpose is completely different.

mm-page_file_methods.patch

    This makes page_offset and others more expensive by adding a
    conditional jump to a function call that is not usually made.

    Why do swap pages have a different index to everyone else?

nfs-swap_ops.patch

    What happens if you have two swap files on the same NFS
    filesystem?
    I assume ->swapfile gets called twice.  But it hasn't been written
    to nest, so the first swapoff will disable swapping for both
    files??

That's all for now :-)

NeilBrown

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-26  6:03   ` Neil Brown
@ 2008-02-26 10:50     ` Peter Zijlstra
  2008-02-26 12:00       ` Peter Zijlstra
                         ` (3 more replies)
  0 siblings, 4 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-26 10:50 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust

Hi Neil,


On Tue, 2008-02-26 at 17:03 +1100, Neil Brown wrote:
> On Saturday February 23, akpm@linux-foundation.org wrote:
 
> > What is the NFS and net people's take on all of this?
> 
> Well I'm only vaguely an NFS person, barely a net person, sporadically
> an mm person, but I've had a look and it seems to mostly make sense.

Thanks for taking a look, and giving such elaborate feedback. I'll try
and address these issues asap, but first let me reply to a few points
here.

> We introduce a new "emergency" concept for page allocation.
> The size of the emergency pool is set by various reservations by
> different potential users.
> If the number of free pages is below the "emergency" size, then only
> users with a "MEMALLOC" flag get to allocate pages.  Further, those
> pages get a "reserve" flag set which propagates into slab/slub so
> kmalloc/kmemalloc only return memory from those pages to MEMALLOC
> users. 
> MEMALLOC users are those that set PF_MEMALLOC.  A socket can get
> SOCK_MEMALLOC set which will cause certain pieces of code to
> temporarily set PF_MEMALLOC while working on that socket.

Small detail, there is also __GFP_MEMALLOC, this is used for single
allocations to avoid setting and unsetting PF_MEMALLOC - like in the skb
alloc once we have determined we otherwise fail and still have room.

> The upshot is that providing any MEMALLOC user reserves an appropriate
> amount of emergency space, returns the emergency memory promptly, and
> sets PF_MEMALLOC whenever allocating memory, it's memory allocations
> should never fail.
> 
> As memory is requested is small units, but allocated as pages, there
> needs to be a conversion from small-units to pages.  One of the
> patches does this and appears to err on the side of be over-generous,
> which is the right thing to do.
> 
> 
> Memory reservations are organised in a tree.  I really don't
> understand the tree.  Is it just to make /proc/reserve_info look more
> helpful?
> Certainly all the individual reservations need to be recorded, and the
> cumulative reservation needs also to be recorded (currently in the
> root of the tree) but what are all the other levels used for?

Ah, there is a little trick there, I hint at that in the reserve.c
description comment:

+ * As long as a subtree has the same usage unit, an aggregate node can be used
+ * to charge against, instead of the leaf nodes. However, do be consistent with
+ * who is charged, resource usage is not propagated up the tree (for
+ * performance reasons).

And I actually use that, if we show a little of the tree (which andrew
rightly dislikes for not being machine parseable - will fix):

+ * localhost ~ # cat /proc/reserve_info
+ * total reserve                  8156K (0/544817)
+ *   total network reserve          8156K (0/544817)
+ *     network TX reserve             196K (0/49)
+ *       protocol TX pages              196K (0/49)
+ *     network RX reserve             7960K (0/544768)
+ *       IPv6 route cache               1372K (0/4096)
+ *       IPv4 route cache               5468K (0/16384)
+ *       SKB data reserve               1120K (0/524288)
+ *         IPv6 fragment cache            560K (0/262144)
+ *         IPv4 fragment cache            560K (0/262144)

We see that the 'SKB data reserve' is build up of the IPv4 and IPv6
fragment cache reserves.

I use the 'SKB data reserve' to charge memory against and account usage,
but use its children to grow/shrink the actual reserve.

This allows you to see the individual reserves, but still use an
aggregate.

The tree form is the simplest structure that allowed such things,
another nice thing is that you can easily detach whole sub-trees to stop
actually reserving the memory, but continue tracking its potential
needs. 

This is done when there are no SOCK_MEMALLOC sockets around. The 'total
network reserve' is detached, reducing the 'total reserve' to 0
(assuming no other reserve trees) but the individual reserves are still
tracking their potential need for when it will be re-attached.

With only a single user this might seen a little too much, but I have
hopes for more users.

> Reservations are used for all the transient memory that might be used
> by the network stack.  This particularly includes the route cache and
> skbs for incoming messages.  I have no idea if there is anything else
> that needs to be allowed for.

This is something I'd like feedback on from the network guru's. In my
reading there weren't many other allocation sites, but hey, I'm not much
of a net person myself. (I did write some instrumentation to track
allocations, but I'm sure I didn't get full coverage of the stack with
my simple usage).

> Filesystems can advertise (via address_space_operations) that files
> may be used as swap file.  They then provide swapout/swapin methods
> which are like writepage/readpage but may behave differently and have
> a different way to get credentials from a 'struct file'.

Yes, the added benefit is that even regular blockdev filesystem swap
files could move to this interface and we'd finally be able to remove
->bmap().

> So in general, the patch set looks to have the right sort of shape.  I
> cannot be very authoritative on the details as there are a lot of
> them, and they touch code that I'm not very familiar with.
> 
> Some specific comments on patches:
> 
> 
> reserve-slub.patch
> 
>    Please avoid irrelevant reformatting in patches.  It makes them
>    harder to read.  e.g.:
> 
> -static void setup_object(struct kmem_cache *s, struct page *page,
> -				void *object)
> +static void setup_object(struct kmem_cache *s, struct page *page, void *object)

Right, I'll split out the cleanups and send those in separately.

> mm-kmem_estimate_pages.patch
> 
>    This introduces
>          kestimate
>          kestimate_single
>          kmem_estimate_pages
> 
>    The last obviously returns a number of pages.  The contrast seems
>    to suggest the others don't.   But they do...
>    I don't think the names are very good, but I concede that it is
>    hard to choose good names here.  Maybe:
>           kmalloc_estimate_variable
>           kmalloc_estimate_fixed
>           kmem_alloc_estimate
>    ???

You caught me here (and further on), I'm one of those who needs a little
help when it comes to names :-). I'll try and improve the ones you
pointed out.

> mm-reserve.patch
> 
>    I'm confused by __mem_reserve_add.
> 
> +	reserve = mem_reserve_root.pages;
> +	__calc_reserve(res, pages, 0);
> +	reserve = mem_reserve_root.pages - reserve;
> 
>    __calc_reserve will always add 'pages' to mem_reserve_root.pages.
>    So this is a complex way of doing
>         reserve = pages;
>         __calc_reserve(res, pages, 0);
> 
>     And as you can calculate reserve before calling __calc_reserve
>     (which seems odd when stated that way), the whole function looks
>     like it could become:
> 
>            ret = adjust_memalloc_reserve(pages);
> 	   if (!ret)
> 		__calc_reserve(res, pages, limit);
> 	   return ret;
> 
>     What am I missing?

Probably the horrible twist my brain has. Looking at it makes me doubt
my own sanity. I think you're right - it would also clean up
__calc_reserve() a little.

This is what review for :-)

>     Also, mem_reserve_disconnect really should be a "void" function.
>     Just put a BUG_ON(ret) and don't return anything.

Agreed, I was being over cautious here. I'll WARN_ON, as Andrew is
scared of BUGs :-)

>     Finally, I'll just repeat that the purpose of the tree structure
>     eludes me.

I hope to have cleared that up. It came from the desire of my users
(there are quite a few out there who use some form of this code) to see
what the reserve is made up of - to see what tunables to use, and their
effect.

The tree form was the easiest that allowed me to keep individual
reserves and work with aggregates. (Must confess, some nodes are purely
decoration).

> net-sk_allocation.patch
> 
>     Why are the "GFP_KERNEL" call sites just changed to
>     "sk->sk_allocation" rather than "sk_allocation(sk, GFP_KERNEL)" ??
> 
>     I assume there is a good reason, and seeing it in the change log
>     would educate me and make the patch more obviously correct.

Good point. I think because of legacy (with this patch-set) reasons. Its
not needed for my use of sk_allocation(), but that gives me no right to
deny other people more creative uses of this construct. I'll rectify
this.

> netvm-reserve.patch
> 
>     Function names again:
> 
>          sk_adjust_memalloc
>          sk_set_memalloc
> 
>     sound similar.  Purpose is completely different.

The naming thing,... I'll try and come up with better ones.

> mm-page_file_methods.patch
> 
>     This makes page_offset and others more expensive by adding a
>     conditional jump to a function call that is not usually made.
> 
>     Why do swap pages have a different index to everyone else?

Because the page->index of an anonymous page is related to its (anon)vma
so that it satisfies the constraints for vm_normal_page().

The index in the swap file it totally unrelated and quite random. Hence
the swap-cache uses page->private to store it in.

Moving these functions inline (esp __page_file_index seems doable)
results in a horrible include hell.

> nfs-swap_ops.patch
> 
>     What happens if you have two swap files on the same NFS
>     filesystem?
>     I assume ->swapfile gets called twice.  But it hasn't been written
>     to nest, so the first swapoff will disable swapping for both
>     files??

Hmm,.. you are quite right (again). I failed to consider this. Not sure
how to rectify this as xprt->swapper is but a single bit.

I'll think about this.

Thanks!


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-26 10:50     ` Peter Zijlstra
@ 2008-02-26 12:00       ` Peter Zijlstra
  2008-02-26 15:29       ` Miklos Szeredi
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-26 12:00 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust


On Tue, 2008-02-26 at 11:50 +0100, Peter Zijlstra wrote:

> > mm-reserve.patch
> > 
> >    I'm confused by __mem_reserve_add.
> > 
> > +	reserve = mem_reserve_root.pages;
> > +	__calc_reserve(res, pages, 0);
> > +	reserve = mem_reserve_root.pages - reserve;
> > 
> >    __calc_reserve will always add 'pages' to mem_reserve_root.pages.
> >    So this is a complex way of doing
> >         reserve = pages;
> >         __calc_reserve(res, pages, 0);
> > 
> >     And as you can calculate reserve before calling __calc_reserve
> >     (which seems odd when stated that way), the whole function looks
> >     like it could become:
> > 
> >            ret = adjust_memalloc_reserve(pages);
> > 	   if (!ret)
> > 		__calc_reserve(res, pages, limit);
> > 	   return ret;
> > 
> >     What am I missing?
> 
> Probably the horrible twist my brain has. Looking at it makes me doubt
> my own sanity. I think you're right - it would also clean up
> __calc_reserve() a little.
> 
> This is what review for :-)

Ah, you confused me. Well, I confused me - this does deserve a comment
its tricksy.

Its correct. The trick is, the mem_reserve in question (res) need not be
connected to mem_reserve_root.

In that case, mem_reserve_root.pages will not change, but we do
propagate the change as far up as possible, so that
mem_reserve_connect() can just observe the parent and child without
being bothered by the rest of the hierarchy.




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 22/28] mm: add support for non block device backed swap files
  2008-02-20 14:46 ` [PATCH 22/28] mm: add support for non block device backed swap files Peter Zijlstra
  2008-02-20 16:30   ` Randy Dunlap
@ 2008-02-26 12:45   ` Miklos Szeredi
  2008-02-26 12:58     ` Peter Zijlstra
  1 sibling, 1 reply; 73+ messages in thread
From: Miklos Szeredi @ 2008-02-26 12:45 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: torvalds, akpm, linux-kernel, linux-mm, netdev, trond.myklebust,
	a.p.zijlstra, linux-fsdevel

Starting review in the middle, because this is the part I'm most
familiar with.

> New addres_space_operations methods are added:
>   int swapfile(struct address_space *, int);

Separate ->swapon() and ->swapoff() methods would be so much cleaner IMO.

Also is there a reason why 'struct file *' cannot be supplied to these
functions?

[snip]

> +int swap_set_page_dirty(struct page *page)
> +{
> +	struct swap_info_struct *sis = page_swap_info(page);
> +
> +	if (sis->flags & SWP_FILE) {
> +		const struct address_space_operations *a_ops =
> +			sis->swap_file->f_mapping->a_ops;
> +		int (*spd)(struct page *) = a_ops->set_page_dirty;
> +#ifdef CONFIG_BLOCK
> +		if (!spd)
> +			spd = __set_page_dirty_buffers;
> +#endif

This ifdef is not really needed.  Just require ->set_page_dirty() be
filled in by filesystems which want swapfiles (and others too, in the
longer term, the fallback is just historical crud).

Here's an incremental patch addressing these issues and beautifying
the new code.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>

Index: linux/mm/page_io.c
===================================================================
--- linux.orig/mm/page_io.c	2008-02-26 11:15:58.000000000 +0100
+++ linux/mm/page_io.c	2008-02-26 13:40:55.000000000 +0100
@@ -106,8 +106,10 @@ int swap_writepage(struct page *page, st
 	}
 
 	if (sis->flags & SWP_FILE) {
-		ret = sis->swap_file->f_mapping->
-			a_ops->swap_out(sis->swap_file, page, wbc);
+		struct file *swap_file = sis->swap_file;
+		struct address_space *mapping = swap_file->f_mapping;
+
+		ret = mapping->a_ops->swap_out(swap_file, page, wbc);
 		if (!ret)
 			count_vm_event(PSWPOUT);
 		return ret;
@@ -136,12 +138,13 @@ void swap_sync_page(struct page *page)
 	struct swap_info_struct *sis = page_swap_info(page);
 
 	if (sis->flags & SWP_FILE) {
-		const struct address_space_operations *a_ops =
-			sis->swap_file->f_mapping->a_ops;
-		if (a_ops->sync_page)
-			a_ops->sync_page(page);
-	} else
+		struct address_space *mapping = sis->swap_file->f_mapping;
+
+		if (mapping->a_ops->sync_page)
+			mapping->a_ops->sync_page(page);
+	} else {
 		block_sync_page(page);
+	}
 }
 
 int swap_set_page_dirty(struct page *page)
@@ -149,17 +152,12 @@ int swap_set_page_dirty(struct page *pag
 	struct swap_info_struct *sis = page_swap_info(page);
 
 	if (sis->flags & SWP_FILE) {
-		const struct address_space_operations *a_ops =
-			sis->swap_file->f_mapping->a_ops;
-		int (*spd)(struct page *) = a_ops->set_page_dirty;
-#ifdef CONFIG_BLOCK
-		if (!spd)
-			spd = __set_page_dirty_buffers;
-#endif
-		return (*spd)(page);
-	}
+		struct address_space *mapping = sis->swap_file->f_mapping;
 
-	return __set_page_dirty_nobuffers(page);
+		return mapping->a_ops->set_page_dirty(page);
+	} else {
+		return __set_page_dirty_nobuffers(page);
+	}
 }
 
 int swap_readpage(struct file *file, struct page *page)
@@ -172,8 +170,10 @@ int swap_readpage(struct file *file, str
 	BUG_ON(PageUptodate(page));
 
 	if (sis->flags & SWP_FILE) {
-		ret = sis->swap_file->f_mapping->
-			a_ops->swap_in(sis->swap_file, page);
+		struct file *swap_file = sis->swap_file;
+		struct address_space *mapping = swap_file->f_mapping;
+
+		ret = mapping->a_ops->swap_in(swap_file, page);
 		if (!ret)
 			count_vm_event(PSWPIN);
 		return ret;
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h	2008-02-26 11:15:58.000000000 +0100
+++ linux/include/linux/fs.h	2008-02-26 13:29:40.000000000 +0100
@@ -485,7 +485,8 @@ struct address_space_operations {
 	/*
 	 * swapfile support
 	 */
-	int (*swapfile)(struct address_space *, int);
+	int (*swapon)(struct file *file);
+	int (*swapoff)(struct file *file);
 	int (*swap_out)(struct file *file, struct page *page,
 			struct writeback_control *wbc);
 	int (*swap_in)(struct file *file, struct page *page);
Index: linux/mm/swapfile.c
===================================================================
--- linux.orig/mm/swapfile.c	2008-02-26 12:43:57.000000000 +0100
+++ linux/mm/swapfile.c	2008-02-26 13:34:57.000000000 +0100
@@ -1014,9 +1014,11 @@ static void destroy_swap_extents(struct 
 	}
 
 	if (sis->flags & SWP_FILE) {
+		struct file *swap_file = sis->swap_file;
+		struct address_space *mapping = swap_file->f_mapping;
+
 		sis->flags &= ~SWP_FILE;
-		sis->swap_file->f_mapping->a_ops->
-			swapfile(sis->swap_file->f_mapping, 0);
+		mapping->a_ops->swapoff(swap_file);
 	}
 }
 
@@ -1092,7 +1094,9 @@ add_swap_extent(struct swap_info_struct 
  */
 static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
 {
-	struct inode *inode;
+	struct file *swap_file = sis->swap_file;
+	struct address_space *mapping = swap_file->f_mapping;
+	struct inode *inode = mapping->host;
 	unsigned blocks_per_page;
 	unsigned long page_no;
 	unsigned blkbits;
@@ -1103,16 +1107,14 @@ static int setup_swap_extents(struct swa
 	int nr_extents = 0;
 	int ret;
 
-	inode = sis->swap_file->f_mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
 		ret = add_swap_extent(sis, 0, sis->max, 0);
 		*span = sis->pages;
 		goto done;
 	}
 
-	if (sis->swap_file->f_mapping->a_ops->swapfile) {
-		ret = sis->swap_file->f_mapping->a_ops->
-			swapfile(sis->swap_file->f_mapping, 1);
+	if (mapping->a_ops->swapon) {
+		ret = mapping->a_ops->swapon(swap_file);
 		if (!ret) {
 			sis->flags |= SWP_FILE;
 			ret = add_swap_extent(sis, 0, sis->max, 0);




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 22/28] mm: add support for non block device backed swap files
  2008-02-26 12:45   ` Miklos Szeredi
@ 2008-02-26 12:58     ` Peter Zijlstra
  0 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-26 12:58 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: torvalds, akpm, linux-kernel, linux-mm, netdev, trond.myklebust,
	linux-fsdevel


On Tue, 2008-02-26 at 13:45 +0100, Miklos Szeredi wrote:
> Starting review in the middle, because this is the part I'm most
> familiar with.
> 
> > New addres_space_operations methods are added:
> >   int swapfile(struct address_space *, int);
> 
> Separate ->swapon() and ->swapoff() methods would be so much cleaner IMO.

I'm ok with that, but its a_ops bloat, do we care about that? I guess
since it has limited instances - typically one per filesystem - there is
no issue here.

> Also is there a reason why 'struct file *' cannot be supplied to these
> functions?

No real reason here. I guess its cleaner indeed. Thanks.

> > +int swap_set_page_dirty(struct page *page)
> > +{
> > +	struct swap_info_struct *sis = page_swap_info(page);
> > +
> > +	if (sis->flags & SWP_FILE) {
> > +		const struct address_space_operations *a_ops =
> > +			sis->swap_file->f_mapping->a_ops;
> > +		int (*spd)(struct page *) = a_ops->set_page_dirty;
> > +#ifdef CONFIG_BLOCK
> > +		if (!spd)
> > +			spd = __set_page_dirty_buffers;
> > +#endif
> 
> This ifdef is not really needed.  Just require ->set_page_dirty() be
> filled in by filesystems which want swapfiles (and others too, in the
> longer term, the fallback is just historical crud).

Agreed. This is a good motivation to clean up that stuff.

> Here's an incremental patch addressing these issues and beautifying
> the new code.

Thanks, I'll fold it into the patch and update the documentation. I'll
put your creds in akpm style.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-26 10:50     ` Peter Zijlstra
  2008-02-26 12:00       ` Peter Zijlstra
@ 2008-02-26 15:29       ` Miklos Szeredi
  2008-02-26 15:41         ` Peter Zijlstra
  2008-02-26 15:43         ` Peter Zijlstra
  2008-02-26 17:56       ` Andrew Morton
  2008-02-27  5:51       ` Neil Brown
  3 siblings, 2 replies; 73+ messages in thread
From: Miklos Szeredi @ 2008-02-26 15:29 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: neilb, akpm, torvalds, linux-kernel, linux-mm, netdev, trond.myklebust

> > mm-page_file_methods.patch
> > 
> >     This makes page_offset and others more expensive by adding a
> >     conditional jump to a function call that is not usually made.
> > 
> >     Why do swap pages have a different index to everyone else?
> 
> Because the page->index of an anonymous page is related to its (anon)vma
> so that it satisfies the constraints for vm_normal_page().
> 
> The index in the swap file it totally unrelated and quite random. Hence
> the swap-cache uses page->private to store it in.

Yeah, and putting the condition into page_offset() will confuse code
which uses it for finding the offset in the VMA or in a tmpfs file.

So why not just have a separate page_swap_offset() function, used
exclusively by swap_in/out()?

Miklos

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-26 15:29       ` Miklos Szeredi
@ 2008-02-26 15:41         ` Peter Zijlstra
  2008-02-26 15:43         ` Peter Zijlstra
  1 sibling, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-26 15:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: neilb, akpm, torvalds, linux-kernel, linux-mm, netdev, trond.myklebust


On Tue, 2008-02-26 at 16:29 +0100, Miklos Szeredi wrote:
> > > mm-page_file_methods.patch
> > > 
> > >     This makes page_offset and others more expensive by adding a
> > >     conditional jump to a function call that is not usually made.
> > > 
> > >     Why do swap pages have a different index to everyone else?
> > 
> > Because the page->index of an anonymous page is related to its (anon)vma
> > so that it satisfies the constraints for vm_normal_page().
> > 
> > The index in the swap file it totally unrelated and quite random. Hence
> > the swap-cache uses page->private to store it in.
> 
> Yeah, and putting the condition into page_offset() will confuse code
> which uses it for finding the offset in the VMA 

Right, do we do that anywhere?

> or in a tmpfs file.

Good point. I really should go read tmpfs some day, its really a blind
spot for me.

> So why not just have a separate page_swap_offset() function, used
> exclusively by swap_in/out()?

That would require duplicating quite a lot of NFS code from what I can
see.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-26 15:29       ` Miklos Szeredi
  2008-02-26 15:41         ` Peter Zijlstra
@ 2008-02-26 15:43         ` Peter Zijlstra
  2008-02-26 15:47           ` Miklos Szeredi
  1 sibling, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-26 15:43 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: neilb, akpm, torvalds, linux-kernel, linux-mm, netdev, trond.myklebust


On Tue, 2008-02-26 at 16:29 +0100, Miklos Szeredi wrote:
> > > mm-page_file_methods.patch
> > > 
> > >     This makes page_offset and others more expensive by adding a
> > >     conditional jump to a function call that is not usually made.
> > > 
> > >     Why do swap pages have a different index to everyone else?
> > 
> > Because the page->index of an anonymous page is related to its (anon)vma
> > so that it satisfies the constraints for vm_normal_page().
> > 
> > The index in the swap file it totally unrelated and quite random. Hence
> > the swap-cache uses page->private to store it in.
> 
> Yeah, and putting the condition into page_offset() will confuse code
> which uses it for finding the offset in the VMA or in a tmpfs file.
> 
> So why not just have a separate page_swap_offset() function, used
> exclusively by swap_in/out()?

Ah, we can do the page_file_offset() to match page_file_index() and
page_file_mapping(). And convert NFS to use page_file_offset() where
appropriate, as I already did for these others.

That would sort out the mess, right?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-26 15:43         ` Peter Zijlstra
@ 2008-02-26 15:47           ` Miklos Szeredi
  0 siblings, 0 replies; 73+ messages in thread
From: Miklos Szeredi @ 2008-02-26 15:47 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: miklos, neilb, akpm, torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust

> > > > mm-page_file_methods.patch
> > > > 
> > > >     This makes page_offset and others more expensive by adding a
> > > >     conditional jump to a function call that is not usually made.
> > > > 
> > > >     Why do swap pages have a different index to everyone else?
> > > 
> > > Because the page->index of an anonymous page is related to its (anon)vma
> > > so that it satisfies the constraints for vm_normal_page().
> > > 
> > > The index in the swap file it totally unrelated and quite random. Hence
> > > the swap-cache uses page->private to store it in.
> > 
> > Yeah, and putting the condition into page_offset() will confuse code
> > which uses it for finding the offset in the VMA or in a tmpfs file.
> > 
> > So why not just have a separate page_swap_offset() function, used
> > exclusively by swap_in/out()?
> 
> Ah, we can do the page_file_offset() to match page_file_index() and
> page_file_mapping(). And convert NFS to use page_file_offset() where
> appropriate, as I already did for these others.
> 
> That would sort out the mess, right?

Yes, that sounds perfect.

Miklos

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-26 10:50     ` Peter Zijlstra
  2008-02-26 12:00       ` Peter Zijlstra
  2008-02-26 15:29       ` Miklos Szeredi
@ 2008-02-26 17:56       ` Andrew Morton
  2008-02-27  5:51       ` Neil Brown
  3 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2008-02-26 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Neil Brown, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Tue, 26 Feb 2008 11:50:42 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Tue, 2008-02-26 at 17:03 +1100, Neil Brown wrote:
> > On Saturday February 23, akpm@linux-foundation.org wrote:
>  
> > > What is the NFS and net people's take on all of this?
> > 
> > Well I'm only vaguely an NFS person, barely a net person, sporadically
> > an mm person, but I've had a look and it seems to mostly make sense.
> 
> Thanks for taking a look, and giving such elaborate feedback. I'll try
> and address these issues asap, but first let me reply to a few points
> here.

Neil's overview of what-all-this-is and how-it-all-works is really good. 
I'd suggest that you take it over, flesh it out and attach it firmly to the
patchset.  It really helps.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-26 10:50     ` Peter Zijlstra
                         ` (2 preceding siblings ...)
  2008-02-26 17:56       ` Andrew Morton
@ 2008-02-27  5:51       ` Neil Brown
  2008-02-27  7:58         ` Peter Zijlstra
  3 siblings, 1 reply; 73+ messages in thread
From: Neil Brown @ 2008-02-27  5:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust


Hi Peter,

On Tuesday February 26, a.p.zijlstra@chello.nl wrote:
> Hi Neil,
> 
> Thanks for taking a look, and giving such elaborate feedback. I'll try
> and address these issues asap, but first let me reply to a few points
> here.

Thanks... the tree thing is starting to make sense, and I'm not
confused my __mem_reserve_add any more :-)

I've been having a closer read of some of the code that I skimmed over
before and I have some more questions.

1/ I note there is no way to tell if memory returned by kmalloc is
  from the emergency reserve - which contrasts with alloc_page
  which does make that information available through page->reserve.
  This seems a slightly unfortunate aspect of the interface.

  It seems to me that __alloc_skb could be simpler if this
  information was available.  It currently tries the allocation
  normally, then if that fails it retries with __GFP_MEMALLOC set and
  if that works it assume it was from the emergency pool ... which it
  might not be, though the assumption is safe enough.

  It would seem to be nicer if you could just make the one alloc call,
  setting GFP_MEMALLOC if that might be appropriate.  Then if the
  memory came from the emergency reserve, flag it as an emergency skb.

  However doing that would have issues with reservations.... the
  mem_reserve would need to be passed to kmalloc :-(

2/ It doesn't appear to be possible to wait on a reservation. i.e. if
   mem_reserve_*_charge fails, we might sometimes want to wait until
   it will succeed.  This feature is an integral part of how mempools
   work and are used.  If you want reservations to (be able to)
   replace mempools, then waiting really is needed.

   It seems that waiting would work best if it was done quite low down
   inside kmalloc.  That would require kmalloc to know which
   'mem_reserve' you are using which it currently doesn't.

   If it did, then it could choose to use emergency pages if the
   mem_reserve allowed it more space, otherwise require a regular page.
   And if __GFP_WAIT is set then each time around the loop it could
   revise whether to use an emergency page or not, based on whether it
   can successfully charge the reservation.

   Of course, having a mem_reserve available for PF_MEMALLOC
   allocations would require changing every kmalloc call in the
   protected region, which might not be ideal, but might not be a
   major hassle, and would ensure that you find all kmalloc calls that
   might be made while in PF_MALLOC state.

3/ Thinking about the tree structure a bit more:  Your motivation
   seems to be that it allows you to make two separate reservations,
   and then charge some memory usage again either-one-of-the-other.
   This seems to go against a key attribute of reservations.  I would
   have thought that an important rule about reservations is that
   no-one else can use memory reserved for a particular purpose.
   So if IPv6 reserves some memory, and the IPv4 uses it, that doesn't
   seem like something we want to encourage...


4/ __netdev_alloc_page is bothering me a bit.
   This is used to allocate memory for incoming fragments in a
   (potentially multi-fragment) packet.  But we only rx_emergency_get
   for each page as it arrives rather than all at once at the start.
   So you could have a situation where two very large packets are
   arriving at the same time and there is enough reserve to hold
   either of them but not both.  The current code will hand out that
   reservation a little (well, one page) at a time to each packet and
   will run out before either packet has been fully received.  This
   seems like a bad thing.  Is it?

   Is it possible to do the rx_emergency_get just once of the whole
   packet I wonder?

NeilBrown

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-27  5:51       ` Neil Brown
@ 2008-02-27  7:58         ` Peter Zijlstra
  2008-02-27  8:05           ` Pekka Enberg
                             ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-27  7:58 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust


On Wed, 2008-02-27 at 16:51 +1100, Neil Brown wrote:
> Hi Peter,
> 
> On Tuesday February 26, a.p.zijlstra@chello.nl wrote:
> > Hi Neil,
> > 
> > Thanks for taking a look, and giving such elaborate feedback. I'll try
> > and address these issues asap, but first let me reply to a few points
> > here.
> 
> Thanks... the tree thing is starting to make sense, and I'm not
> confused my __mem_reserve_add any more :-)
> 
> I've been having a closer read of some of the code that I skimmed over
> before and I have some more questions.
> 
> 1/ I note there is no way to tell if memory returned by kmalloc is
>   from the emergency reserve - which contrasts with alloc_page
>   which does make that information available through page->reserve.
>   This seems a slightly unfortunate aspect of the interface.

Yes, but alas there is no room to store such information in kmalloc().
That is, in a sane way. I think it was Daniel Phillips who suggested
encoding it in the return pointer by flipping the low bit - but that is
just too ugly and breaks all current kmalloc sites to boot.

>   It seems to me that __alloc_skb could be simpler if this
>   information was available.  It currently tries the allocation
>   normally, then if that fails it retries with __GFP_MEMALLOC set and
>   if that works it assume it was from the emergency pool ... which it
>   might not be, though the assumption is safe enough.
> 
>   It would seem to be nicer if you could just make the one alloc call,
>   setting GFP_MEMALLOC if that might be appropriate.  Then if the
>   memory came from the emergency reserve, flag it as an emergency skb.
> 
>   However doing that would have issues with reservations.... the
>   mem_reserve would need to be passed to kmalloc :-(

Yes, it would require a massive overhaul of quite a few things. I agree,
it would all be nicer, but I think you see why I didn't do it.

> 2/ It doesn't appear to be possible to wait on a reservation. i.e. if
>    mem_reserve_*_charge fails, we might sometimes want to wait until
>    it will succeed.  This feature is an integral part of how mempools
>    work and are used.  If you want reservations to (be able to)
>    replace mempools, then waiting really is needed.
> 
>    It seems that waiting would work best if it was done quite low down
>    inside kmalloc.  That would require kmalloc to know which
>    'mem_reserve' you are using which it currently doesn't.
> 
>    If it did, then it could choose to use emergency pages if the
>    mem_reserve allowed it more space, otherwise require a regular page.
>    And if __GFP_WAIT is set then each time around the loop it could
>    revise whether to use an emergency page or not, based on whether it
>    can successfully charge the reservation.

Like mempools, we could add a wrapper with a mem_reserve and waitqueue
inside, strip __GFP_WAIT, try, see if the reservation allows, and wait
if not.

I haven't yet done such a wrapper because it wasn't needed. But it could
be done.

>    Of course, having a mem_reserve available for PF_MEMALLOC
>    allocations would require changing every kmalloc call in the
>    protected region, which might not be ideal, but might not be a
>    major hassle, and would ensure that you find all kmalloc calls that
>    might be made while in PF_MALLOC state.

I assumed the current PF_MEMALLOC usage was good enough for the current
reserves - not quite true as its potentially unlimited, but it seems to
work in practise.

I did try to find all allocation sites in the paths I enabled
PF_MEMALLOC over.

> 3/ Thinking about the tree structure a bit more:  Your motivation
>    seems to be that it allows you to make two separate reservations,
>    and then charge some memory usage again either-one-of-the-other.
>    This seems to go against a key attribute of reservations.  I would
>    have thought that an important rule about reservations is that
>    no-one else can use memory reserved for a particular purpose.
>    So if IPv6 reserves some memory, and the IPv4 uses it, that doesn't
>    seem like something we want to encourage...

Well, we only have one kind of skb, a network packet doesn't know if it
belongs to IPv4 or IPv6 (or yet a whole different address familiy) when
it comes in. So we grow the skb pool to overflow both defragment caches.

But yeah, its something where you need to know what you're doing - as
with so many other things in the kernel, hence I didn't worry too much.

> 4/ __netdev_alloc_page is bothering me a bit.
>    This is used to allocate memory for incoming fragments in a
>    (potentially multi-fragment) packet.  But we only rx_emergency_get
>    for each page as it arrives rather than all at once at the start.
>    So you could have a situation where two very large packets are
>    arriving at the same time and there is enough reserve to hold
>    either of them but not both.  The current code will hand out that
>    reservation a little (well, one page) at a time to each packet and
>    will run out before either packet has been fully received.  This
>    seems like a bad thing.  Is it?
> 
>    Is it possible to do the rx_emergency_get just once of the whole
>    packet I wonder?

I honestly don't know enough about network cards and drivers to answer
this. It was a great feat I managed this much :-)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-27  7:58         ` Peter Zijlstra
@ 2008-02-27  8:05           ` Pekka Enberg
  2008-02-27  8:14             ` Peter Zijlstra
  2008-02-29 11:51             ` Peter Zijlstra
  2008-02-29  1:29           ` Neil Brown
       [not found]           ` <1837 <1204626509.6241.39.camel@lappy>
  2 siblings, 2 replies; 73+ messages in thread
From: Pekka Enberg @ 2008-02-27  8:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Neil Brown, Andrew Morton, Linus Torvalds, linux-kernel,
	linux-mm, netdev, trond.myklebust

Hi Peter,

On Wed, Feb 27, 2008 at 9:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>  > 1/ I note there is no way to tell if memory returned by kmalloc is
>  >   from the emergency reserve - which contrasts with alloc_page
>  >   which does make that information available through page->reserve.
>  >   This seems a slightly unfortunate aspect of the interface.
>
>  Yes, but alas there is no room to store such information in kmalloc().
>  That is, in a sane way. I think it was Daniel Phillips who suggested
>  encoding it in the return pointer by flipping the low bit - but that is
>  just too ugly and breaks all current kmalloc sites to boot.

Why can't you add a kmem_is_emergency() to SLUB that looks up the
cache/slab/page (whatever is the smallest unit of the emergency pool
here) for the object and use that?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-27  8:05           ` Pekka Enberg
@ 2008-02-27  8:14             ` Peter Zijlstra
  2008-02-27  8:33               ` Peter Zijlstra
  2008-02-29 11:51             ` Peter Zijlstra
  1 sibling, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-27  8:14 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Neil Brown, Andrew Morton, Linus Torvalds, linux-kernel,
	linux-mm, netdev, trond.myklebust


On Wed, 2008-02-27 at 10:05 +0200, Pekka Enberg wrote:
> Hi Peter,
> 
> On Wed, Feb 27, 2008 at 9:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >  > 1/ I note there is no way to tell if memory returned by kmalloc is
> >  >   from the emergency reserve - which contrasts with alloc_page
> >  >   which does make that information available through page->reserve.
> >  >   This seems a slightly unfortunate aspect of the interface.
> >
> >  Yes, but alas there is no room to store such information in kmalloc().
> >  That is, in a sane way. I think it was Daniel Phillips who suggested
> >  encoding it in the return pointer by flipping the low bit - but that is
> >  just too ugly and breaks all current kmalloc sites to boot.
> 
> Why can't you add a kmem_is_emergency() to SLUB that looks up the
> cache/slab/page (whatever is the smallest unit of the emergency pool
> here) for the object and use that?

There is an idea.. :-) It would mean preserving page->reserved, but SLUB
has plenty of page flags to pick from. Or maybe I should move the thing
to a page flag anyway. If we do that SLAB would allow something similar,
just look up the page for whatever address you get and look at PG_emerg
or something.

Having this would clean things up. I'll go work on this.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-27  8:14             ` Peter Zijlstra
@ 2008-02-27  8:33               ` Peter Zijlstra
  2008-02-27  8:43                 ` Pekka J Enberg
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-27  8:33 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Neil Brown, Andrew Morton, Linus Torvalds, linux-kernel,
	linux-mm, netdev, trond.myklebust


On Wed, 2008-02-27 at 09:14 +0100, Peter Zijlstra wrote:
> On Wed, 2008-02-27 at 10:05 +0200, Pekka Enberg wrote:
> > Hi Peter,
> > 
> > On Wed, Feb 27, 2008 at 9:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > >  > 1/ I note there is no way to tell if memory returned by kmalloc is
> > >  >   from the emergency reserve - which contrasts with alloc_page
> > >  >   which does make that information available through page->reserve.
> > >  >   This seems a slightly unfortunate aspect of the interface.
> > >
> > >  Yes, but alas there is no room to store such information in kmalloc().
> > >  That is, in a sane way. I think it was Daniel Phillips who suggested
> > >  encoding it in the return pointer by flipping the low bit - but that is
> > >  just too ugly and breaks all current kmalloc sites to boot.
> > 
> > Why can't you add a kmem_is_emergency() to SLUB that looks up the
> > cache/slab/page (whatever is the smallest unit of the emergency pool
> > here) for the object and use that?
> 
> There is an idea.. :-) It would mean preserving page->reserved, but SLUB
> has plenty of page flags to pick from. Or maybe I should move the thing
> to a page flag anyway. If we do that SLAB would allow something similar,
> just look up the page for whatever address you get and look at PG_emerg
> or something.
> 
> Having this would clean things up. I'll go work on this.

Humm, and here I sit staring at the screen. Perhaps I should go get my
morning juice, but...

  if (mem_reserve_kmalloc_charge(my_res, sizeof(*foo), 0)) {
    foo = kmalloc(sizeof(*foo), gfp|__GFP_MEMALLOC)
    if (!kmem_is_emergency(foo))
      mem_reserve_kmalloc_charge(my_res, -sizeof(*foo), 0)
  } else
    foo = kmalloc(sizeof(*foo), gfp);

Just doesn't look too pretty..

And needing to always account the allocation seems wrong.. but I'll take
poison and see if that wakes up my mind.




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-27  8:33               ` Peter Zijlstra
@ 2008-02-27  8:43                 ` Pekka J Enberg
  0 siblings, 0 replies; 73+ messages in thread
From: Pekka J Enberg @ 2008-02-27  8:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Neil Brown, Andrew Morton, Linus Torvalds, linux-kernel,
	linux-mm, netdev, trond.myklebust

On Wed, 27 Feb 2008, Peter Zijlstra wrote:
> Humm, and here I sit staring at the screen. Perhaps I should go get my
> morning juice, but...
> 
>   if (mem_reserve_kmalloc_charge(my_res, sizeof(*foo), 0)) {
>     foo = kmalloc(sizeof(*foo), gfp|__GFP_MEMALLOC)
>     if (!kmem_is_emergency(foo))
>       mem_reserve_kmalloc_charge(my_res, -sizeof(*foo), 0)
>   } else
>     foo = kmalloc(sizeof(*foo), gfp);
> 
> Just doesn't look too pretty..
> 
> And needing to always account the allocation seems wrong.. but I'll take
> poison and see if that wakes up my mind.

Hmm, perhaps this is just hand-waving but why don't you have a 
kmalloc_reserve() function in SLUB that does the accounting properly?

			Pekka

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-27  7:58         ` Peter Zijlstra
  2008-02-27  8:05           ` Pekka Enberg
@ 2008-02-29  1:29           ` Neil Brown
  2008-02-29 10:21             ` Peter Zijlstra
       [not found]           ` <1837 <1204626509.6241.39.camel@lappy>
  2 siblings, 1 reply; 73+ messages in thread
From: Neil Brown @ 2008-02-29  1:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust


So I've been pondering all this some more trying to find the pattern,
and things are beginning to crystalise (I hope).

One of the approaches I have been taking is to compare it to mempools
(which I think I understand) and work out what the important
differences are.

One difference is that you don't wait for memory to become available
(as I mentioned earlier).  Rather you just try to get the memory and
if it isn't available, you drop the packet.  This probably makes sense
for incoming packets as you rely on the packet being re-sent, and
hopefully various back-off algorithms will slow things down a bit so
that there is a good change that memory will be available next time...

For out going messages I'm less clear on exactly what is going on.
Maybe I haven't looked at that code properly yet, but I would expect
there would be a place for waiting for memory to become available
somewhere in the out-going path ??

But there is another important difference to mempools which I think is
worth exploring.  With mempools, you are certain that the memory will
only be used to make forward progress in writing out dirty data.  So
if you find that there isn't enough memory at the moment and you have
to wait, you can be sure someone else is making forward progress and
so waiting isn't such a bad thing.

With your reservations it isn't quite the same.  Reserved memory can
be used for related purposes.  In particular, any incoming packet can
use some reserved memory.  Once the purpose of that packet is
discovered (i.e. that matching socket is found), the memory will be
freed again.  But there is a period of time when memory is being used
for an inappropriate purpose.  The consequences of this should be
clearly understood.

In particular, the memory that is reserved for the emergency pool
should include some overhead to acknowledge the fact that memory
might be used for short periods of time for unrelated purposes.

I think we can fit this acknowledgement into the current model quite
easily, and it makes the tree structure suddenly make lots of sense
(where as before I was still struggling with it).

A key observation in this design is "Sometimes we need to allocate
emergency memory without knowing exactly what it is going to be used
for".  I think we should make that explicit in the implementation as
follows:

  We have a tree of reservations (as you already do) where levels in
  the tree correspond to more explicit knowledge of how the memory
  will be used.
  At the top level there is a generic 'page' reservation.  Below that
  to one side with have a 'SLUB/SLAB' reservation.  I'm not sure yet
  exactly what that will look like.
  Also below the 'page' reservation is a reservation for pages to hold
  incoming network fragments.
  Below the SLxB reservation is a reservation for skbs, which is
  parent to a reservation for IPv4 skbs and another for IPv6 skbs.

Each of these nodes has its own independent reservation - parents are
not simply the sum of the children.
The sum over the whole tree is given to the VM as the size of the
emergency pool to reserve for emergency allocations.

Now, every actual allocation from the emergency pool effectively comes
in at the top of the tree and moves down as its purpose is more fully
understood.  Every emergency allocation is *always* charged to one
node in the tree, though which node may change.

e.g.
  A network driver asks for a page to store a fragment.
  netdev_alloc_page calls alloc_page with __GFP_MEMALLOC set.
  If alloc_page needs to dive into the emergency pool, it first
  charges the one page against the root for the reservation tree.
  If this succeeds, it returns the page with ->reserve set.  If the
  reservation fails, it ignores the GFP_MEMALLOC and fails.
  netdev_alloc_page notices that the page is a ->reserve page, and
  knows that it has been changed to the top 'page' reservation, but it
  should be changed to the network-page reservation.  So it tried to
  charge against the network-pages reservation, and reverses the
  charge against 'pages'.  If the network-pages reservation fails, the
  page is freed and netdev_alloc_page fails.
  As you can see, the charge moves down the tree as more information
  becomes available.

  Similarly a charge might move from 'pages' to 'SLxB' to 'net_skb' to
  'ipv4_skb'.

  At the bottom levels, the reservations says how much memory is
  needed for that particular usage to be able to make sensible forward
  progress.
  At the higher levels, the reservation says how much overhead we need
  to allow to ensure that transient invalid uses don't unduly limit
  available emergency memory.  As pages are likely to be immediately
  re-charged lower down the tree, the reservation at the top level
  would probably be proportional to the number of CPUs (probably one
  page per CPU would be perfect).  Lower down, different calculations
  might suggest different intermediate reservations.

Of course, these things don't need to be explicitly structured as a
tree.  There is no need for 'parent' or 'sibling' pointers.  The code
implicitly knows where to move charges from and to.
You still need an explicit structure to allow groups of reservations
that are activated or de-activated as a whole.  That can use your
current tree structure, or whatever else turns out to make sense.

This model, I think, captures the important "allocate before charging"
aspect of reservations that you need (particularly for incoming
network packets) and it makes that rule apply throughout the different
stages that an allocated chunk of memory goes through.

With this model, alloc_page could fail more often, as it now also
fails if the top level reservation is exhausted.  This may seem
un-necessary, but I think it could be a good thing.  It means that at
very busy times (when lots of requests are needing emergency memory)
we drop requests randomly and very early.  If we are going to drop a
request eventually, dropping it early means we waste less time on it
which is probably a good thing.


So: Does this model help others with understanding how the
reservations work, or am I just over-engineering?

NeilBrown

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-29  1:29           ` Neil Brown
@ 2008-02-29 10:21             ` Peter Zijlstra
  2008-03-02 22:18               ` Neil Brown
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-29 10:21 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust


On Fri, 2008-02-29 at 12:29 +1100, Neil Brown wrote:
> So I've been pondering all this some more trying to find the pattern,
> and things are beginning to crystalise (I hope).
> 
> One of the approaches I have been taking is to compare it to mempools
> (which I think I understand) and work out what the important
> differences are.
> 
> One difference is that you don't wait for memory to become available
> (as I mentioned earlier).  Rather you just try to get the memory and
> if it isn't available, you drop the packet.  This probably makes sense
> for incoming packets as you rely on the packet being re-sent, and
> hopefully various back-off algorithms will slow things down a bit so
> that there is a good change that memory will be available next time...
> 
> For out going messages I'm less clear on exactly what is going on.
> Maybe I haven't looked at that code properly yet, but I would expect
> there would be a place for waiting for memory to become available
> somewhere in the out-going path ??

The tx path is a bit fuzzed. I assume it has an upper limit, take a stab
at that upper limit and leave it at that.

It should be full of holes, and there is some work on writeout
throttling to fill some of them - but I haven't seen any lockups in this
area for a long long while.

> But there is another important difference to mempools which I think is
> worth exploring.  With mempools, you are certain that the memory will
> only be used to make forward progress in writing out dirty data.  So
> if you find that there isn't enough memory at the moment and you have
> to wait, you can be sure someone else is making forward progress and
> so waiting isn't such a bad thing.
> 
> With your reservations it isn't quite the same.  Reserved memory can
> be used for related purposes.  In particular, any incoming packet can
> use some reserved memory.  Once the purpose of that packet is
> discovered (i.e. that matching socket is found), the memory will be
> freed again.  But there is a period of time when memory is being used
> for an inappropriate purpose.  The consequences of this should be
> clearly understood.

IIRC the route-cache is in this state. Entries there can be added before
we can decide to keep or toss the packet. So we reserve enough memory to
overflow the route-cache (route-cache reclaim keeps it in bounds).

> In particular, the memory that is reserved for the emergency pool
> should include some overhead to acknowledge the fact that memory
> might be used for short periods of time for unrelated purposes.
> 
> I think we can fit this acknowledgement into the current model quite
> easily, and it makes the tree structure suddenly make lots of sense
> (where as before I was still struggling with it).
> 
> A key observation in this design is "Sometimes we need to allocate
> emergency memory without knowing exactly what it is going to be used
> for".  I think we should make that explicit in the implementation as
> follows:
> 
>   We have a tree of reservations (as you already do) where levels in
>   the tree correspond to more explicit knowledge of how the memory
>   will be used.
>   At the top level there is a generic 'page' reservation.  Below that
>   to one side with have a 'SLUB/SLAB' reservation.  I'm not sure yet
>   exactly what that will look like.
>   Also below the 'page' reservation is a reservation for pages to hold
>   incoming network fragments.
>   Below the SLxB reservation is a reservation for skbs, which is
>   parent to a reservation for IPv4 skbs and another for IPv6 skbs.
> 
> Each of these nodes has its own independent reservation - parents are
> not simply the sum of the children.
> The sum over the whole tree is given to the VM as the size of the
> emergency pool to reserve for emergency allocations.
> 
> Now, every actual allocation from the emergency pool effectively comes
> in at the top of the tree and moves down as its purpose is more fully
> understood.  Every emergency allocation is *always* charged to one
> node in the tree, though which node may change.
> 
> e.g.
>   A network driver asks for a page to store a fragment.
>   netdev_alloc_page calls alloc_page with __GFP_MEMALLOC set.
>   If alloc_page needs to dive into the emergency pool, it first
>   charges the one page against the root for the reservation tree.
>   If this succeeds, it returns the page with ->reserve set.  If the
>   reservation fails, it ignores the GFP_MEMALLOC and fails.
>   netdev_alloc_page notices that the page is a ->reserve page, and
>   knows that it has been changed to the top 'page' reservation, but it
>   should be changed to the network-page reservation.  So it tried to
>   charge against the network-pages reservation, and reverses the
>   charge against 'pages'.  If the network-pages reservation fails, the
>   page is freed and netdev_alloc_page fails.
>   As you can see, the charge moves down the tree as more information
>   becomes available.
> 
>   Similarly a charge might move from 'pages' to 'SLxB' to 'net_skb' to
>   'ipv4_skb'.
> 
>   At the bottom levels, the reservations says how much memory is
>   needed for that particular usage to be able to make sensible forward
>   progress.
>   At the higher levels, the reservation says how much overhead we need
>   to allow to ensure that transient invalid uses don't unduly limit
>   available emergency memory.  As pages are likely to be immediately
>   re-charged lower down the tree, the reservation at the top level
>   would probably be proportional to the number of CPUs (probably one
>   page per CPU would be perfect).  Lower down, different calculations
>   might suggest different intermediate reservations.
> 
> Of course, these things don't need to be explicitly structured as a
> tree.  There is no need for 'parent' or 'sibling' pointers.  The code
> implicitly knows where to move charges from and to.
> You still need an explicit structure to allow groups of reservations
> that are activated or de-activated as a whole.  That can use your
> current tree structure, or whatever else turns out to make sense.
> 
> This model, I think, captures the important "allocate before charging"
> aspect of reservations that you need (particularly for incoming
> network packets) and it makes that rule apply throughout the different
> stages that an allocated chunk of memory goes through.

I'm a bit confused here, the only way to keep the allocations bounded is
by accounting before allocation (well, another other way is to bound the
number of concurrent allocations).

Also, I try not to account when not needed, like with the route-cache.
We already know it has bounded memory usage because it maintains that
itself. So by just supplying enough memory to overflow the thing you're
home save.

While the model of moving the accounting down might work, I think it its
not needed. We don't need to know if its ipv4 or ipv6 or yet another
protocol, as long as we have enough skb room to overflow whatever caches
are in between incomming packets and socket de-multiplex.

> With this model, alloc_page could fail more often, as it now also
> fails if the top level reservation is exhausted.  This may seem
> un-necessary, but I think it could be a good thing.  It means that at
> very busy times (when lots of requests are needing emergency memory)
> we drop requests randomly and very early.  If we are going to drop a
> request eventually, dropping it early means we waste less time on it
> which is probably a good thing.

But, might you not be dropping the few packets we do want, early as
well?

> So: Does this model help others with understanding how the
> reservations work, or am I just over-engineering?

Sounds like a bit of overkill to me.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-27  8:05           ` Pekka Enberg
  2008-02-27  8:14             ` Peter Zijlstra
@ 2008-02-29 11:51             ` Peter Zijlstra
  2008-02-29 11:58               ` Pekka Enberg
  1 sibling, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-29 11:51 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Neil Brown, Andrew Morton, Linus Torvalds, linux-kernel,
	linux-mm, netdev, trond.myklebust


On Wed, 2008-02-27 at 10:05 +0200, Pekka Enberg wrote:
> Hi Peter,
> 
> On Wed, Feb 27, 2008 at 9:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >  > 1/ I note there is no way to tell if memory returned by kmalloc is
> >  >   from the emergency reserve - which contrasts with alloc_page
> >  >   which does make that information available through page->reserve.
> >  >   This seems a slightly unfortunate aspect of the interface.
> >
> >  Yes, but alas there is no room to store such information in kmalloc().
> >  That is, in a sane way. I think it was Daniel Phillips who suggested
> >  encoding it in the return pointer by flipping the low bit - but that is
> >  just too ugly and breaks all current kmalloc sites to boot.
> 
> Why can't you add a kmem_is_emergency() to SLUB that looks up the
> cache/slab/page (whatever is the smallest unit of the emergency pool
> here) for the object and use that?

I made page->reserve into PG_emergency and made that bit stick for the
lifetime of that page allocation. I then made kmem_is_emergency() look
up the head page backing that allocation's slab and return
PageEmergency().

This gives a consistent kmem_is_emergency() - that is if during the
lifetime of the kmem allocation it returns true once, it must return
true always.

You can then, using this properly, push the accounting into
kmalloc_reserve() and kfree_reserve() (and
kmem_cache_{alloc,free}_reserve).

Which yields very pretty code all round. (can make public if you like to
see..)

However...

This is a stricter model than I had before, and has one ramification I'm
not entirely sure I like.

It means the page remains a reserve page throughout its lifetime, which
means the slab remains a reserve slab throughout its lifetime. Therefore
it may never be used for !reserve allocations. Which in turn generates
complexities for the partial list.

In my previous model I had the reserve accounting external to the
allocation, which relaxed the strict need for consistency here, and I
dropped the reserve status once we were above the page limits again. 

I managed to complicate the SLUB patch with this extra constraint, by
checking reserve against PageEmergency() when scanning the partial list,
but gave up on SLAB.

Does this sound like something I should pursuit? I feel it might
complicate the slab allocators too much..


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-29 11:51             ` Peter Zijlstra
@ 2008-02-29 11:58               ` Pekka Enberg
  2008-02-29 12:18                 ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Pekka Enberg @ 2008-02-29 11:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Neil Brown, Andrew Morton, Linus Torvalds, linux-kernel,
	linux-mm, netdev, trond.myklebust, Christoph Lameter

Hi Peter,

On Fri, Feb 29, 2008 at 1:51 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>  I made page->reserve into PG_emergency and made that bit stick for the
>  lifetime of that page allocation. I then made kmem_is_emergency() look
>  up the head page backing that allocation's slab and return
>  PageEmergency().

[snip]

On Fri, Feb 29, 2008 at 1:51 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>  This is a stricter model than I had before, and has one ramification I'm
>  not entirely sure I like.
>
>  It means the page remains a reserve page throughout its lifetime, which
>  means the slab remains a reserve slab throughout its lifetime. Therefore
>  it may never be used for !reserve allocations. Which in turn generates
>  complexities for the partial list.

Hmm, so why don't we then clear the PG_emergency flag then and
allocate a new fresh page to the reserves?

On Fri, Feb 29, 2008 at 1:51 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>  Does this sound like something I should pursuit? I feel it might
>  complicate the slab allocators too much..

I can't answer that question until I see the code ;-). But overall, I
think it's better to put that code in SLUB rather than trying to work
around it elsewhere. The fact is, as soon as you have some sort of
reservation for _objects_, you need help from the SLUB allocator.

                           Pekka

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-29 11:58               ` Pekka Enberg
@ 2008-02-29 12:18                 ` Peter Zijlstra
  2008-02-29 12:29                   ` Pekka Enberg
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-02-29 12:18 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Neil Brown, Andrew Morton, Linus Torvalds, linux-kernel,
	linux-mm, netdev, trond.myklebust, Christoph Lameter


On Fri, 2008-02-29 at 13:58 +0200, Pekka Enberg wrote:
> Hi Peter,
> 
> On Fri, Feb 29, 2008 at 1:51 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >  I made page->reserve into PG_emergency and made that bit stick for the
> >  lifetime of that page allocation. I then made kmem_is_emergency() look
> >  up the head page backing that allocation's slab and return
> >  PageEmergency().
> 
> [snip]
> 
> On Fri, Feb 29, 2008 at 1:51 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >  This is a stricter model than I had before, and has one ramification I'm
> >  not entirely sure I like.
> >
> >  It means the page remains a reserve page throughout its lifetime, which
> >  means the slab remains a reserve slab throughout its lifetime. Therefore
> >  it may never be used for !reserve allocations. Which in turn generates
> >  complexities for the partial list.
> 
> Hmm, so why don't we then clear the PG_emergency flag then 

Clearing PG_emergency would mean kmem_is_emergency() would return false
in kfree_reserve() and fail to un-charge the object.

Previously objects would track their account status themselves (when
needed) and freeing PG_emergency wouldn't be a problem.

> and allocate a new fresh page to the reserves?

Not sure I understand this properly. We would only do this once the page
watermarks are high enough, so the reserves are full again.

> On Fri, Feb 29, 2008 at 1:51 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >  Does this sound like something I should pursuit? I feel it might
> >  complicate the slab allocators too much..
> 
> I can't answer that question until I see the code ;-). But overall, I
> think it's better to put that code in SLUB rather than trying to work
> around it elsewhere. The fact is, as soon as you have some sort of
> reservation for _objects_, you need help from the SLUB allocator.

Well, I agree with that consolidating it makes sense. And like I said,
it gives pretty code. However, it also puts the burden of this feature
on everyone and might affect performance - still its only the slow path,
but still.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-29 12:18                 ` Peter Zijlstra
@ 2008-02-29 12:29                   ` Pekka Enberg
  0 siblings, 0 replies; 73+ messages in thread
From: Pekka Enberg @ 2008-02-29 12:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Neil Brown, Andrew Morton, Linus Torvalds, linux-kernel,
	linux-mm, netdev, trond.myklebust, Christoph Lameter

On Fri, Feb 29, 2008 at 2:18 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>  Clearing PG_emergency would mean kmem_is_emergency() would return false
>  in kfree_reserve() and fail to un-charge the object.
>
>  Previously objects would track their account status themselves (when
>  needed) and freeing PG_emergency wouldn't be a problem.
>
>  > and allocate a new fresh page to the reserves?
>
>  Not sure I understand this properly. We would only do this once the page
>  watermarks are high enough, so the reserves are full again.

The problem with PG_emergency is that, once the watermarks are high
again, SLUB keeps holding to the emergency page and it cannot be used
for regular kmalloc allocations, right?

So the way to fix this is to batch uncharge the objects and clear
PG_emergency for the said SLUB pages thus freeing them for regular
allocations. And to compensate for the loss in the reserves, we ask
the page allocator to give a new one that SLUB knows nothing about.

If you don't do this, the reserve page can only contain few objects
making them unavailable for regular allocations. So we're might be
forced into "emergency mode" even though there's enough memory
available to satisfy the allocation.

                          Pekka

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-02-29 10:21             ` Peter Zijlstra
@ 2008-03-02 22:18               ` Neil Brown
  2008-03-02 23:33                 ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Neil Brown @ 2008-03-02 22:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Friday February 29, a.p.zijlstra@chello.nl wrote:
> 
> The tx path is a bit fuzzed. I assume it has an upper limit, take a stab
> at that upper limit and leave it at that.
> 
> It should be full of holes, and there is some work on writeout
> throttling to fill some of them - but I haven't seen any lockups in this
> area for a long long while.

I think this is very interesting and useful.

You seem to be saying that the write-throttling is enough to avoid any
starvation on the transmit path. i.e. the VM is limiting the amount
of dirty memory so that when we desperately need memory on the
writeout path we can always get it, without lots of careful
accounting.

So why doesn't this work on for the receive side?  What - exactly - is
the difference.

I think the difference is that on the receive side we have to hand
out memory before we know how it will be used (i.e. whether it is for
a SK_MEMALLOC socket or not) and so emergency memory could get stuck
in some non-emergency usage.

So suppose we forgot about all the allocation tracking (that doesn't
seem to be needed on the send side so maybe isn't on the receive side)
and just focus on not letting emergency memory get used for the wrong
thing.

So: Have some global flag that says "we are into the emergency pool"
which gets set the first time an allocation has to dip below the low
water mark, and cleared when an allocation succeeds without needing to
dip that low.

Then whenever we have memory that might have been allocated from below
the watermark (i.e. an incoming packet) and we find out that it isn't
required for write-out (i.e. it gets attached to a socket for which
SK_MEMALLOC is not set) we test the global flag and if it is set, we
drop the packet and free the memory.

To clarify: 
   no new accounting
   no new reservations
   just "drop non-writeout packets when we are low on memory"

Is there any chance that could work reliably?

I call this the "Linus" approach because I remember reading somewhere
(that google cannot find for me) where Linus said he didn't think a
provably correct implementation was the way to go - just something
that made it very likely that we won't run out of memory at an awkward
time.

I guess my position is that any accounting that we do needs to have a
clear theoretical model underneath it so we can reason about it.
I cannot see the clear model in beneath the current code so I'm trying
to come up with models that seem to capture the important elements of
the code, and to explore them.

NeilBrown

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-03-02 22:18               ` Neil Brown
@ 2008-03-02 23:33                 ` Peter Zijlstra
  2008-03-03 23:41                   ` Neil Brown
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-03-02 23:33 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust


On Mon, 2008-03-03 at 09:18 +1100, Neil Brown wrote:
> On Friday February 29, a.p.zijlstra@chello.nl wrote:
> > 
> > The tx path is a bit fuzzed. I assume it has an upper limit, take a stab
> > at that upper limit and leave it at that.
> > 
> > It should be full of holes, and there is some work on writeout
> > throttling to fill some of them - but I haven't seen any lockups in this
> > area for a long long while.
> 
> I think this is very interesting and useful.
> 
> You seem to be saying that the write-throttling is enough to avoid any
> starvation on the transmit path. i.e. the VM is limiting the amount
> of dirty memory so that when we desperately need memory on the
> writeout path we can always get it, without lots of careful
> accounting.
> 
> So why doesn't this work on for the receive side?  What - exactly - is
> the difference.

The TX path needs to be able to make progress in that it must be able to
send out at least one full request (page). The thing the TX path must
not do is tie up so much memory sending out pages that we can't receive
any incoming packets.

So, having a throttle on the amount of writes in progress, and
sufficient memory to back those, seem like a solid way here.

NFS has such a limit in its congestion logic. But I'm quite sure I'm
failing to allocate enough memory to back it, as I got confused by the
whole RPC code.

> I think the difference is that on the receive side we have to hand
> out memory before we know how it will be used (i.e. whether it is for
> a SK_MEMALLOC socket or not) and so emergency memory could get stuck
> in some non-emergency usage.
> 
> So suppose we forgot about all the allocation tracking (that doesn't
> seem to be needed on the send side so maybe isn't on the receive side)
> and just focus on not letting emergency memory get used for the wrong
> thing.
> 
> So: Have some global flag that says "we are into the emergency pool"
> which gets set the first time an allocation has to dip below the low
> water mark, and cleared when an allocation succeeds without needing to
> dip that low.

That is basically what the slub logic I added does. Except that global
flags in the vm make people very nervous, so its a little more complex.

> Then whenever we have memory that might have been allocated from below
> the watermark (i.e. an incoming packet) 

Which is what I do in the skb_alloc() path.

> and we find out that it isn't
> required for write-out (i.e. it gets attached to a socket for which
> SK_MEMALLOC is not set) we test the global flag and if it is set, we
> drop the packet and free the memory.

Which is somewhat more complex than you make it sound, but that is
exactly what I do.

> To clarify: 
>    no new accounting
>    no new reservations
>    just "drop non-writeout packets when we are low on memory"
> 
> Is there any chance that could work reliably?

You need to be able to overflow the ip fragement assembly cache, or we
could get stuck with all memory in fragments.

Same for other memory usage before we hit the socket de-multiplex, like
the route-cache.

I just refined those points here; you need to drop more that
non-writeout packets, you need to drop all packets not meant for
SK_MEMALLOC.

You also need to allow some writeout packets, because if you hit 'oom'
and need to write-out some pages to free up memory,...

I did the reservation because I wanted some guarantee we'd be able to
over-flow the caches mentioned. The alternative is working with the
variable ratio that the current reserve has.

The accounting makes the whole system more robust. I wanted to make the
state stable enough to survive a connection drop, or server reset for a
long while, and it does. During a swapping workload and heavy network
load, I can pull the network cable, or shut down the NFS server and
leave it down for over 30 minutes. When I bring it back up again, stuff
resumes.

> I call this the "Linus" approach because I remember reading somewhere
> (that google cannot find for me) where Linus said he didn't think a
> provably correct implementation was the way to go - just something
> that made it very likely that we won't run out of memory at an awkward
> time.
> 
> I guess my position is that any accounting that we do needs to have a
> clear theoretical model underneath it so we can reason about it.
> I cannot see the clear model in beneath the current code so I'm trying
> to come up with models that seem to capture the important elements of
> the code, and to explore them.

>From my POV there is a model, and I've tried to convey it, but clearly
I'm failing horribly. Let me try again:

Create a stable state where you can receive an unlimited amount of
network packets awaiting the one packet you need to move forward.

To do so we need to distinguish needed from unneeded packets; we do this
by means of SK_MEMALLOC. So we need to be able to receive packets up to
that point.

The unlimited amount of packets means unlimited time; which means that
our state must not consume memory, merely use memory. That is, the
amount of memory used must not grow unbounded over time.

So we must guarantee that all memory allocated will be promptly freed
again, and never allocate more than available.

Because this state is not the normal state, we need a trigger to enter
this state (and consequently a trigger to leave this state). We do that
by detecting a low memory situation just like you propose. We enter this
state once normal memory allocations fail and leave this state once they
start succeeding again.

We need the accounting to ensure we never allocate more than is
available, but more importantly because we need to ensure progress for
those packets we already have allocated.

A packet is received, it can be a fragment, it will be placed in the
fragment cache for packet re-assembly.

We need to ensure we can overflow this fragment cache in order that
something will come out at the other end. If under a fragment attack,
the fragment cache limit will prune the oldest fragments, freeing up
memory to receive new ones.

Eventually we'd be able to receive either a whole packet, or enough
fragments to assemble one.

Next comes routing the packet; we need to know where to process the
packet; local or non-local. This potentially involves filling the
route-cache.

If at this point there is no memory available because we forgot to limit
the amount of memory available for skb allocation we again are stuck.

The route-cache, like the fragment assembly, is already accounted and
will prune old (unused) entries once the total memory usage exceeds a
pre-determined amount of memory.

Eventually we'll end up at socket demux, matching packets to sockets
which allows us to either toss the packet or consume it. Dropping
packets is allowed because network is assumed lossy, and we have not yet
acknowledged the receive.

Does this make sense?


Then we have TX, which like I said above needs to operate under certain
limits as well. We need to be able to send out packets when under
pressure in order to relieve said pressure.

We need to ensure doing so will not exhaust our reserves.

Writing out a page typically takes a little memory, you fudge some
packets with protocol info, mtu size etc.. send them out, and wait for
an acknowledge from the other end, and drop the stuff and go on writing
other pages.

So sending out pages does not consume memory if we're able to receive
ACKs. Being able to receive packets what what all the previous was
about.

Now of course there is some RPC concurrency, TCP windows and other
funnies going on, but I assumed - and I don't think that's a wrong
assumption - that sending out pages will consume endless amounts of
memory.

Nor will it keep on sending pages, once there is a certain amount of
packets outstanding (nfs congestion logic), it will wait, at which point
it should have no memory in use at all.

Anyway I did get lost in the RPC code, and I know I didn't fully account
everything, but under some (hopefully realistic) assumptions I think the
model is sound.

Does this make sense?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-03-02 23:33                 ` Peter Zijlstra
@ 2008-03-03 23:41                   ` Neil Brown
  2008-03-04 10:28                     ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Neil Brown @ 2008-03-03 23:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust


Hi Peter,

 Thanks for trying to spell it out for me. :-)

On Monday March 3, a.p.zijlstra@chello.nl wrote:
> 
> >From my POV there is a model, and I've tried to convey it, but clearly
> I'm failing ^[$,3r_^[(Bhorribly. Let me try again:
> 
> Create a stable state where you can receive an unlimited amount of
> network packets awaiting the one packet you need to move forward.

Yep.

> 
> To do so we need to distinguish needed from unneeded packets; we do this
> by means of SK_MEMALLOC. So we need to be able to receive packets up to
> that point.

Yep.

> 
> The unlimited amount of packets means unlimited time; which means that
> our state must not consume memory, merely use memory. That is, the
> amount of memory used must not grow unbounded over time.

Yes.  Good point.

> 
> So we must guarantee that all memory allocated will be promptly freed
> again, and never allocate more than available.

Definitely.

> 
> Because this state is not the normal state, we need a trigger to enter
> this state (and consequently a trigger to leave this state). We do that
> by detecting a low memory situation just like you propose. We enter this
> state once normal memory allocations fail and leave this state once they
> start succeeding again.

Agreed.

> 
> We need the accounting to ensure we never allocate more than is
> available, but more importantly because we need to ensure progress for
> those packets we already have allocated.

Maybe...
 1/ Memory is used 
     a/ in caches, such as the fragment cache and the route cache
     b/ in transient allocations on their way from one place to
        another. e.g. network card to fragment cache, frag cache to
        socket. 
    The caches can (do?) impose a natural limit on the amount of
    memory they use.  The transient allocations should be satisfied
    from the normal low watermark pool.  When we are in a low memory
    conditions we can expect packet loss so we expect network streams
    to slow down, so we expect there to be fewer bits in transit.
    Also in low memory conditions the caches would be extra-cautious
    not to use too much memory.
    So it isn't completely clear (to me) that extra accounting is needed.

 2/ If we were to do accounting to "ensure progress for those packets
    we already have allocated", then I would expect a reservation
    (charge) of max_packet_size when a fragment arrives on the network
    card - or at least when a new fragment is determined to not match
    any packet already in the fragment cache.  But I didn't see that
    in your code.  I saw incremental charges as each page arrived.
    And that implementation does seem to fit the model.
  
> 
> A packet is received, it can be a fragment, it will be placed in the
> fragment cache for packet re-assembly.

Understood.

> 
> We need to ensure we can overflow this fragment cache in order that
> something will come out at the other end. If under a fragment attack,
> the fragment cache limit will prune the oldest fragments, freeing up
> memory to receive new ones.

I don't understand why we want to "overflow this fragment cache".
I picture the cache having a target size.  When under this size,
fragments might be allowed to live longer.  When at or over the target
size, old fragments are pruned earlier.  When in a low memory
situation it might be even more keen to prune old fragments, to keep
beneath the target size.
When you say "overflow this fragment cache", I picture deliberately
allowing the cache to get bigger than the target size.  I don't
understand why you would want to do that.

> 
> Eventually we'd be able to receive either a whole packet, or enough
> fragments to assemble one.

That would be important, yes.

> 
> Next comes routing the packet; we need to know where to process the
> packet; local or non-local. This potentially involves filling the
> route-cache.
> 
> If at this point there is no memory available because we forgot to limit
> the amount of memory available for skb allocation we again are stuck.

Those skbs we allocated - they are either sitting in the fragment
cache, or have been attached to a SK_MEMALLOC socket, or have been
freed - correct?  If so, then there is already a limit to how much
memory they can consume.

> 
> The route-cache, like the fragment assembly, is already accounted and
> will prune old (unused) entries once the total memory usage exceeds a
> pre-determined amount of memory.

Good.  So as long as the normal emergency reserves covers the size of
the route cache plus the size of the fragment cache plus a little bit
of slack, we should be safe - yes?

> 
> Eventually we'll end up at socket demux, matching packets to sockets
> which allows us to either toss the packet or consume it. Dropping
> packets is allowed because network is assumed lossy, and we have not yet
> acknowledged the receive.
> 
> Does this make sense?

Lots of it does, yes.

> 
> 
> Then we have TX, which like I said above needs to operate under certain
> limits as well. We need to be able to send out packets when under
> pressure in order to relieve said pressure.

Catch-22 ?? :-)

> 
> We need to ensure doing so will not exhaust our reserves.
> 
> Writing out a page typically takes a little memory, you fudge some
> packets with protocol info, mtu size etc.. send them out, and wait for
> an acknowledge from the other end, and drop the stuff and go on writing
> other pages.

Yes, rate-limiting those write-outs should keep that moving.

> 
> So sending out pages does not consume memory if we're able to receive
> ACKs. Being able to receive packets what what all the previous was
> about.
> 
> Now of course there is some RPC concurrency, TCP windows and other
> funnies going on, but I assumed - and I don't think that's a wrong
> assumption - that sending out pages will consume endless amounts of
                                          ^not ??
> memory.

Sounds fair.

> 
> Nor will it keep on sending pages, once there is a certain amount of
> packets outstanding (nfs congestion logic), it will wait, at which point
> it should have no memory in use at all.

Providing it frees any headers it attached to each page (or had
allocated them from a private pool), it should have no memory in use.
I'd have to check through the RPC code (I get lost in there too) to
see how much memory is tied up by each outstanding page write.

> 
> Anyway I did get lost in the RPC code, and I know I didn't fully account
> everything, but under some (hopefully realistic) assumptions I think the
> model is sound.
> 
> Does this make sense?

Yes.

So I can see two possible models here.

The first is the "bounded cache" or "locally bounded" model.
At every step in the path from writepage to clear_page_writeback,
the amount of extra memory used is bounded by some local rules.
NFS and RPC uses congestion logic to limit the number of outstanding
writes.  For incoming packets, the fragment cache and route cache
impose their own limits.
We simply need that the VM reserves a total amount of memory to meet
the sum of those local limits.

Your code embodies this model with the tree of reservations.  The root
of the tree stores the sum of all the reservations below, and this
number is given to the VM.
The value of the tree is that different components can register their
needs independently, and the whole tree (or subtrees) can be attached
or not depending on global conditions, such as whether there are any
SK_MEMALLOC sockets or not.

However I don't see how the charging that you implemented fits into
this model.
You don't do any significant charging for the route cache.  But you do
for skbs.  Why?  Don't the majority of those skbs live in the fragment
cache?  Doesn't it account their size? (Maybe it doesn't.... maybe it
should?).

I also don't see the value of tracking pages to see if they are
'reserve' pages or not.  The decision to drop an skb that is not for
an SK_MEMALLOC socket should be based on whether we are currently
short on memory.  Not whether we were short on memory when the skb was
allocated.

The second model that could fit is "total accounting". 
In this model we reserve memory at each stage including the transient
stages (packet that has arrived but isn't in fragment cache yet).
As memory moves around, we move the charging from one reserve to
another.  If the target reserve doesn't have an space, we drop the
message.
On the transmit side, that means putting the page back on a queue for
sending later.  On the receive side that means discarding the packet
and waiting for a resend.
This model makes it easy for the various limits to be very different
while under memory pressure that otherwise.  It also means they are
imposed differently which isn't so good.

So:
 - Why do you impose skb allocation limits beyond what is imposed
   by the fragment cache?
 - Why do you need to track whether each allocation is a reserve or
   not?

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-03-03 23:41                   ` Neil Brown
@ 2008-03-04 10:28                     ` Peter Zijlstra
  0 siblings, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-03-04 10:28 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust


On Tue, 2008-03-04 at 10:41 +1100, Neil Brown wrote:
> Hi Peter,
> 
>  Thanks for trying to spell it out for me. :-)
> 
> On Monday March 3, a.p.zijlstra@chello.nl wrote:
> > 
> > >From my POV there is a model, and I've tried to convey it, but clearly
> > I'm failing ^[$,3r_^[(Bhorribly. Let me try again:

Hmm, I wonder what went wrong there ^

> > Create a stable state where you can receive an unlimited amount of
> > network packets awaiting the one packet you need to move forward.
> 
> Yep.
> 
> > 
> > To do so we need to distinguish needed from unneeded packets; we do this
> > by means of SK_MEMALLOC. So we need to be able to receive packets up to
> > that point.
> 
> Yep.
> 
> > 
> > The unlimited amount of packets means unlimited time; which means that
> > our state must not consume memory, merely use memory. That is, the
> > amount of memory used must not grow unbounded over time.
> 
> Yes.  Good point.
> 
> > 
> > So we must guarantee that all memory allocated will be promptly freed
> > again, and never allocate more than available.
> 
> Definitely.
> 
> > 
> > Because this state is not the normal state, we need a trigger to enter
> > this state (and consequently a trigger to leave this state). We do that
> > by detecting a low memory situation just like you propose. We enter this
> > state once normal memory allocations fail and leave this state once they
> > start succeeding again.
> 
> Agreed.
> 
> > 
> > We need the accounting to ensure we never allocate more than is
> > available, but more importantly because we need to ensure progress for
> > those packets we already have allocated.
> 
> Maybe...
>  1/ Memory is used 
>      a/ in caches, such as the fragment cache and the route cache
>      b/ in transient allocations on their way from one place to
>         another. e.g. network card to fragment cache, frag cache to
>         socket. 
>     The caches can (do?) impose a natural limit on the amount of
>     memory they use.  The transient allocations should be satisfied
>     from the normal low watermark pool.  When we are in a low memory
>     conditions we can expect packet loss so we expect network streams
>     to slow down, so we expect there to be fewer bits in transit.
>     Also in low memory conditions the caches would be extra-cautious
>     not to use too much memory.
>     So it isn't completely clear (to me) that extra accounting is needed.
> 
>  2/ If we were to do accounting to "ensure progress for those packets
>     we already have allocated", then I would expect a reservation
>     (charge) of max_packet_size when a fragment arrives on the network
>     card - or at least when a new fragment is determined to not match
>     any packet already in the fragment cache.  But I didn't see that
>     in your code.  I saw incremental charges as each page arrived.
>     And that implementation does seem to fit the model.

Ah, the extra accounting I do is count the number of bytes associated
with skb data. So that we don't exhaust the reserves with incoming
packets. Like you said, packets need a little more memory in their
travels up to the socket demux.

When you look at __alloc_skb(), you'll find we charge the data size to
the reserves, and if you look at __netdev_alloc_page(), you'll see
PAGE_SIZE being changed against the skb reserve.

If we would not do this, and the incoming packet rate would be high
enough, we could exhaust the reserves and leave the packets no memory to
use on their travels to the socket demux.
 
> > A packet is received, it can be a fragment, it will be placed in the
> > fragment cache for packet re-assembly.
> 
> Understood.
> 
> > 
> > We need to ensure we can overflow this fragment cache in order that
> > something will come out at the other end. If under a fragment attack,
> > the fragment cache limit will prune the oldest fragments, freeing up
> > memory to receive new ones.
> 
> I don't understand why we want to "overflow this fragment cache".
> I picture the cache having a target size.  When under this size,
> fragments might be allowed to live longer.  When at or over the target
> size, old fragments are pruned earlier.  When in a low memory
> situation it might be even more keen to prune old fragments, to keep
> beneath the target size.
> When you say "overflow this fragment cache", I picture deliberately
> allowing the cache to get bigger than the target size.  I don't
> understand why you would want to do that.

What I mean by overflowing is: by providing more than the cache can
handle we guarantee that forward progress is made, because either the
cache will prune items - giving us memory to continue - or we'll get out
a fully assembled packet which can continue its quest :-)

If we provide less memory than the cache can hold, all memory can be
tied up in the cache, waiting for something to happen - which won't,
because we're out of memory.

> > Eventually we'd be able to receive either a whole packet, or enough
> > fragments to assemble one.
> 
> That would be important, yes.
> 
> > 
> > Next comes routing the packet; we need to know where to process the
> > packet; local or non-local. This potentially involves filling the
> > route-cache.
> > 
> > If at this point there is no memory available because we forgot to limit
> > the amount of memory available for skb allocation we again are stuck.
> 
> Those skbs we allocated - they are either sitting in the fragment
> cache, or have been attached to a SK_MEMALLOC socket, or have been
> freed - correct?  If so, then there is already a limit to how much
> memory they can consume.

Not really, there is no natural limit to the amount of packets that can
be in transit between RX and socket demux. So we need the extra (skb)
accounting to impose that.

> > The route-cache, like the fragment assembly, is already accounted and
> > will prune old (unused) entries once the total memory usage exceeds a
> > pre-determined amount of memory.
> 
> Good.  So as long as the normal emergency reserves covers the size of
> the route cache plus the size of the fragment cache plus a little bit
> of slack, we should be safe - yes?

Basically (except for the last point), unless I've missed something in
the net-stack. 

> > Eventually we'll end up at socket demux, matching packets to sockets
> > which allows us to either toss the packet or consume it. Dropping
> > packets is allowed because network is assumed lossy, and we have not yet
> > acknowledged the receive.
> > 
> > Does this make sense?
> 
> Lots of it does, yes.

Good, making progress here :-)

> > Then we have TX, which like I said above needs to operate under certain
> > limits as well. We need to be able to send out packets when under
> > pressure in order to relieve said pressure.
> 
> Catch-22 ?? :-)

Yeah, swapping is such fun.. Which is why we have these reserves. We
only fake we're out of memory, but secretly we do have some left. But,
sssh don't tell user-space :-)

> > We need to ensure doing so will not exhaust our reserves.
> > 
> > Writing out a page typically takes a little memory, you fudge some
> > packets with protocol info, mtu size etc.. send them out, and wait for
> > an acknowledge from the other end, and drop the stuff and go on writing
> > other pages.
> 
> Yes, rate-limiting those write-outs should keep that moving.
> 
> > 
> > So sending out pages does not consume memory if we're able to receive
> > ACKs. Being able to receive packets what what all the previous was
> > about.
> > 
> > Now of course there is some RPC concurrency, TCP windows and other
> > funnies going on, but I assumed - and I don't think that's a wrong
> > assumption - that sending out pages will consume endless amounts of
>                                           ^not ??

Uhm yeah, sorry about that.

> > memory.
> 
> Sounds fair.
> 
> > 
> > Nor will it keep on sending pages, once there is a certain amount of
> > packets outstanding (nfs congestion logic), it will wait, at which point
> > it should have no memory in use at all.
> 
> Providing it frees any headers it attached to each page (or had
> allocated them from a private pool), it should have no memory in use.
> I'd have to check through the RPC code (I get lost in there too) to
> see how much memory is tied up by each outstanding page write.
> 
> > 
> > Anyway I did get lost in the RPC code, and I know I didn't fully account
> > everything, but under some (hopefully realistic) assumptions I think the
> > model is sound.
> > 
> > Does this make sense?
> 
> Yes.
> 
> So I can see two possible models here.
> 
> The first is the "bounded cache" or "locally bounded" model.
> At every step in the path from writepage to clear_page_writeback,
> the amount of extra memory used is bounded by some local rules.
> NFS and RPC uses congestion logic to limit the number of outstanding
> writes.  For incoming packets, the fragment cache and route cache
> impose their own limits.
> We simply need that the VM reserves a total amount of memory to meet
> the sum of those local limits.
> 
> Your code embodies this model with the tree of reservations.  The root
> of the tree stores the sum of all the reservations below, and this
> number is given to the VM.
> The value of the tree is that different components can register their
> needs independently, and the whole tree (or subtrees) can be attached
> or not depending on global conditions, such as whether there are any
> SK_MEMALLOC sockets or not.
> 
> However I don't see how the charging that you implemented fits into
> this model.
> You don't do any significant charging for the route cache.  But you do
> for skbs.  Why?  Don't the majority of those skbs live in the fragment
> cache?  Doesn't it account their size? (Maybe it doesn't.... maybe it
> should?).

To impose a limit on the amount of skb data in transit. Like stated
above, there is (afaik) no natural limit on this.

> I also don't see the value of tracking pages to see if they are
> 'reserve' pages or not.  The decision to drop an skb that is not for
> an SK_MEMALLOC socket should be based on whether we are currently
> short on memory.  Not whether we were short on memory when the skb was
> allocated.

That comes from accounting, once you need to account data you need to
know when to start accounting, and keep state so that you can properly
un-account.

Also, from a practical POV its easier to detect our lack of memory from
an allocation site, that outside of it.

> The second model that could fit is "total accounting". 
> In this model we reserve memory at each stage including the transient
> stages (packet that has arrived but isn't in fragment cache yet).
> As memory moves around, we move the charging from one reserve to
> another.  If the target reserve doesn't have an space, we drop the
> message.
> On the transmit side, that means putting the page back on a queue for
> sending later.  On the receive side that means discarding the packet
> and waiting for a resend.
> This model makes it easy for the various limits to be very different
> while under memory pressure that otherwise.  It also means they are
> imposed differently which isn't so good.
> 
> So:
>  - Why do you impose skb allocation limits beyond what is imposed
>    by the fragment cache?

To impose a limit on the amount of skbs in transit. The fragment cache
only imposes a limit on the amount of data held for packet assembly. Not
the total amount of skb data between receive and socket demux.

>  - Why do you need to track whether each allocation is a reserve or
>    not?

To do accounting.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
       [not found]           ` <1837 <1204626509.6241.39.camel@lappy>
@ 2008-03-07  3:33             ` Neil Brown
  2008-03-07 11:17               ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Neil Brown @ 2008-03-07  3:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Tuesday March 4, a.p.zijlstra@chello.nl wrote:
> 
> On Tue, 2008-03-04 at 10:41 +1100, Neil Brown wrote:
> > 
> > Those skbs we allocated - they are either sitting in the fragment
> > cache, or have been attached to a SK_MEMALLOC socket, or have been
> > freed - correct?  If so, then there is already a limit to how much
> > memory they can consume.
> 
> Not really, there is no natural limit to the amount of packets that can
> be in transit between RX and socket demux. So we need the extra (skb)
> accounting to impose that.

Isn't there?  A brief look at the code suggests that (except for
fragment handling) there is a fairly straight path from
network-receive to socket demux.  No queues along the way.
That suggests the number of in-transit skbs should be limited by the
number of CPUs.  Did I miss something?  Or is the number of CPUs
potentially too large to be a suitable limit (seems unlikely).

While looking at the code it also occurred to me that:
  1/ tcpdump could be sent incoming packets.  Is there a limit
     to the number of packets that can be in-kernel waiting for
     tcpdump to collect them?  Should this limit be added to the base
     reserve?
  2/ If the host is routing network packets, then incoming packets
     might go on an outbound queue.  Is this space limited?  and
     included in the reserve?

Not major points, but I thought I would mention them.

> > I also don't see the value of tracking pages to see if they are
> > 'reserve' pages or not.  The decision to drop an skb that is not for
> > an SK_MEMALLOC socket should be based on whether we are currently
> > short on memory.  Not whether we were short on memory when the skb was
> > allocated.
> 
> That comes from accounting, once you need to account data you need to
> know when to start accounting, and keep state so that you can properly
> un-account.
> 

skbs in the main (only?) thing you do accounting on, so focusing on
those:

Suppose that every time you allocate memory for an skb, you check
if the allocation had to dip into emergency reserves, and account the
memory if so - releasing the memory and dropping the packet if we are
over the limit.
And any time you free memory associated with an skb, you check if the
accounts currently say '0', and if not subtract the size of the
allocation from the accounts.

Then you have quite workable accounting that doesn't need to tag every
piece of memory with its 'reserve' status, and only pays the
accounting cost (presumably a spinlock) when running out of memory, or
just recovering.

This more relaxed approach to accounting reserved vs non-reserved
memory has a strong parallel in your slub code (which I now
understand).  When sl[au]b first gets a ->reserve page, it sets the
->reserve flag on the memcache and leaves it set until it sometime
later gets a non-"->reserve" page.  Any memory freed in the mean time
(whether originally reserved or not) is treated as reserve memory in
that it will only be returned for ALLOC_NO_WATERMARKS allocations.
I think this is a good way of working with reserved memory.  It isn't
precise, but it is low-cost and good enough to get you through the
difficult patch.

Your netvm-skbuff-reserve.patch has some code to make sure that all
the allocations in an skb have the same 'reserve' status.   I don't
think that is needed and just makes the code messy - plus it requires
the 'overcommit' flag to mem_reserve_kmalloc_charge which is a bit of
a wart on the interface.

I would suggest getting rid of that.  Just flag the whole skb if any
part gets a 'reserve' allocation, and use that flag to decide to drop
packets arriving at non-SK_MEMALLOC sockets.



So: I think I now really understand what your code is doing, so I will
try to explain it in terms that even I understand... This text in
explicitly available under GPLv2 in case you want it.

It actually describes something a bit different to what your code
currently does, but I think it is very close to the spirit.  Some
differences follow from my observations above.  Others the way that
seemed to make sense while describing the problem and solution
differed slightly from what I saw the code doing.  Obviously the code
and the description should be aligned one way or another before being
finalised.  
The description is a bit long ... sorry about that.  But I wanted to
make sure I included motivation and various assumptions.  Some of my
understanding may well be wrong, but I present it here anyway.  It is
easier for you to correct if it is clearly visible:-)

Problem:
   When Linux needs to allocate memory it may find that there is
   insufficient free memory so it needs to reclaim space that is in
   use but not needed at the moment.  There are several options:

   1/ Shrink a kernel cache such as the inode or dentry cache.  This
      is fairly easy but provides limited returns.
   2/ Discard 'clean' pages from the page cache.  This is easy, and
      works well as long as there are clean pages in the page cache.
      Similarly clean 'anonymous' pages can be discarded - if there
      are any.
   3/ Write out some dirty page-cache pages so that they become clean.
      The VM limits the number of dirty page-cache pages to e.g. 40%
      of available memory so that (among other reasons) a "sync" will
      not take excessively long.  So there should never be excessive
      amounts of dirty pagecache.
      Writing out dirty page-cache pages involves work by the
      filesystem which may need to allocate memory itself.  To avoid
      deadlock, filesystems use GFP_NOFS when allocating memory on the
      write-out path.  When this is used, cleaning dirty page-cache
      pages is not an option so if the filesystem finds that  memory
      is tight, another option must be found.
   4/ Write out dirty anonymous pages to the "Swap" partition/file.
      This is the most interesting for a couple of reasons.
      a/ Unlike dirty page-cache pages, there is no need to write anon
         pages out unless we are actually short of memory.  Thus they
         tend to be left to last.
      b/ Anon pages tend to be updated randomly and unpredictably, and
         flushing them out of memory can have a very significant
         performance impact on the process using them.  This contrasts
         with page-cache pages which are often written sequentially
         and often treated as "write-once, read-many".
      So anon pages tend to be left until last to be cleaned, and may
      be the only cleanable pages while there are still some dirty
      page-cache pages (which are waiting on a GFP_NOFS allocation).

[I don't find the above wholly satisfying.  There seems to be too much
 hand-waving.  If someone can provide better text explaining why
 swapout is a special case, that would be great.]

So we need to be able to write to the swap file/partition without
needing to allocate any memory ... or only a small well controlled
amount.

The VM reserves a small amount of memory that can only be allocated
for use as part of the swap-out procedure.  It is only available to
processes with the PF_MEMALLOC flag set, which is typically just the
memory cleaner.

Traditionally swap-out is performed directly to block devices (swap
files on block-device filesystems are supported by examining the
mapping from file offset to device offset in advance, and then using
the device offsets to write directly to the device).  Block devices
are (required to be) written to pre-allocate any memory that might be
needed during write-out, and to block when the pre-allocated memory is
exhausted and no other memory is available.  They can be sure not to
block forever as the pre-allocated memory will be returned as soon as
the data it is being used for has been written out.  The primary
mechanism for pre-allocating memory is called "mempools".

This approach does not work for writing anonymous pages
(i.e. swapping) over a network, using e.g NFS or NBD or iSCSI.


The main reason that it does not work is that when data from an anon
page is written to the network, we must wait for a reply to confirm
the data is safe.  Receiving that reply will consume memory and,
significantly, we need to allocate memory to an incoming packet before
we can tell if it is the reply we are waiting for or not.

The secondary reason is that the network code is not written to use
mempools and in most cases does not need to use them.  Changing all
allocations in the networking layer to use mempools would be quite
intrusive, and would waste memory, and probably cause a slow-down in
the common case of not swapping over the network.

These problems are addressed by enhancing the system of memory
reserves used by PF_MEMALLOC and requiring any in-kernel networking
client that is used for swap-out to indicate which sockets are used
for swapout so they can be handled specially in low memory situations.

There are several major parts to this enhancement:

1/ PG_emergency, GFP_MEMALLOC

  To handle low memory conditions we need to know when those
  conditions exist.  Having a global "low on memory" flag seems easy,
  but its implementation is problematic.  Instead we make it possible
  to tell if a recent memory allocation required use of the emergency
  memory pool.
  For pages returned by alloc_page, the new page flag PG_emergency
  can be tested.  If this is set, then a low memory condition was
  current when the page was allocated, so the memory should be used
  carefully.

  For memory allocated using slab/slub: If a page that is added to a
  kmem_cache is found to have PG_emergency set, then a  ->reserve
  flag is set for the whole kmem_cache.  Further allocations will only
  be returned from that page (or any other page in the cache) if they
  are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).
  Non-emergency allocations will block in alloc_page until a
  non-reserve page is available.  Once a non-reserve page has been
  added to the cache, the ->reserve flag on the cache is removed.
  When memory is returned by slab/slub, PG_emergency is set on the page
  holding the memory to match the ->reserve flag on that cache.

  After memory has been returned by kmem_cache_alloc or kmalloc, the
  page's PG_emergency flag can be tested.  If it is set, then the most
  recent allocation from that cache required reserve memory, so this
  allocation should be used with care.

  It is not safe to test the cache's ->reserve flag immediately after
  an allocation as that flag is in per-cpu data, and the process could
  have be rescheduled to a different cpu if preemption is enabled.
  Thus the use of PG_emergency to carry this information.

  This allows us to
   a/ request use of the emergency pool when allocating memory
     (GFP_MEMALLOC), and 
   b/ to find out if the emergency pool was used.

2/ SK_MEMALLOC, sk_buff->emergency.

  When memory from the reserve is used to store incoming network
  packets, the memory must be freed (and the packet dropped) as soon
  as we find out that the packet is not for a socket that is used for
  swap-out. 
  To achieve this we have an ->emergency flag for skbs, and an
  SK_MEMALLOC flag for sockets.
  When memory is allocated for an skb, it is allocated with
  GFP_MEMALLOC (if we are currently swapping over the network at
  all).  If a subsequent test shows that the emergency pool was used,
  ->emergency is set.
  When the skb is finally attached to its destination socket, the
  SK_MEMALLOC flag on the socket is tested.  If the skb has
  ->emergency set, but the socket does not have SK_MEMALLOC set, then
  the skb is immediately freed and the packet is dropped.
  This ensures that reserve memory is never queued on a socket that is
  not used for swapout.

  Similarly, if an skb is ever queued for deliver to user-space for
  example by netfilter, the ->emergency flag is tested and the skb is
  released if ->emergency is set.

  This ensures that memory from the emergency reserve can be used to
  allow swapout to proceed, but will not get caught up in any other
  network queue.


3/ pages_emergency

  The above would be sufficient if the total memory below the lowest
  memory watermark (i.e the size of the emergency reserve) were known
  to be enough to hold all transient allocations needed for writeout.
  I'm a little blurry on how big the current emergency pool is, but it
  isn't big and certainly hasn't been sized to allow network traffic
  to consume any.

  We could simply make the size of the reserve bigger. However in the
  common case that we are not swapping over the network, that would be
  a waste of memory.

  So a new "watermark" is defined: pages_emergency.  This is
  effectively added to the current low water marks, so that pages from
  this emergency pool can only be allocated if one of PF_MEMALLOC or
  GFP_MEMALLOC are set.

  pages_emergency can be changed dynamically based on need.  When
  swapout over the network is required, pages_emergency is increased
  to cover the maximum expected load.  When network swapout is
  disabled, pages_emergency is decreased.

  To determine how much to increase it by, we introduce reservation
  groups....

3a/ reservation groups

  The memory used transiently for swapout can be in a number of
  different places.  e.g. the network route cache, the network
  fragment cache, in transit between network card and socket, or (in
  the case of NFS) in sunrpc data structures awaiting a reply.
  We need to ensure each of these is limited in the amount of memory
  they use, and that the maximum is included in the reserve.

  The memory required by the network layer only needs to be reserved
  once, even if there are multiple swapout paths using the network
  (e.g. NFS and NDB and iSCSI, though using all three for swapout at
  the same time would be unusual).

  So we create a tree of reservation groups.  The network might
  register a collection of reservations, but not mark them as being in
  use.  NFS and sunrpc might similarly register a collection of
  reservations, and attach it to the network reservations as it
  depends on them.
  When swapout over NFS is requested, the NFS/sunrpc reservations are
  activated which implicitly activates the network reservations.

  The total new reservation is added to pages_emergency.

  Provided each memory usage stays beneath the registered limit (at
  least when allocating memory from reserves), the system will never
  run out of emergency memory, and swapout will not deadlock.

  It is worth noting here that it is not critical that each usage
  stays beneath the limit 100% of the time.  Occasional excess is
  acceptable provided that the memory will be freed  again within a
  short amount of time that does *not* require waiting for any event
  that itself might require memory.
  This is because, at all stages of transmit and receive, it is
  acceptable to discard all transient memory associated with a
  particular writeout and try again later.  On transmit, the page can
  be re-queued for later transmission.  On receive, the packet can be
  dropped assuming that the peer will resend after a timeout.

  Thus allocations that are truly transient and will be freed without
  blocking do not strictly need to be reserved for.  Doing so might
  still be a good idea to ensure forward progress doesn't take too
  long. 

4/ lo-mem accounting

  Most places that might hold on to emergency memory (e.g. route
  cache, fragment cache etc) already place a limit on the amount of
  memory that they can use.  This limit can simply be reserved using
  the above mechanism and no more needs to be done.

  However some memory usage might not be accounted with sufficient
  firmness to allow an appropriate emergency reservation.  The
  in-flight skbs for incoming packets is (claimed to be) on such
  example.

  To support this, a low-overhead mechanism for accounting memory
  usage against the reserves is provided.  This mechanism uses the
  same data structure that is used to store the emergency memory
  reservations through the addition of a 'usage' field.

  When memory allocation for a particular purpose succeeds, the memory
  is checked to see if it is 'reserve' memory.  If it is, the size of
  the allocation is added to the 'usage'.  If this exceeds the
  reservation, the usage is reduced again and the memory that was
  allocated is free.

  When memory that was allocated for that purpose is freed, the
  'usage' field is checked again.  If it is non-zero, then the size of
  the freed memory is subtracted from the usage, making sure the usage
  never becomes less than zero.

  This provides adequate accounting with minimal overheads when not in
  a low memory condition.  When a low memory condition is encountered
  it does add the cost of a spin lock necessary to serialise updates
  to 'usage'.
  


5/ swapfile/swap_out/swap_in

  So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
  any network socket that it uses, and can know when to account
  reserve memory carefully, new address_space_operations are
  available.
  "swapfile" requests that an address space (i.e a file) be make ready
  for swapout.  swap_out and swap_in request the actual IO.  They
  together must ensure that each swap_out request can succeed without
  allocating more emergency memory that was reserved by swapfile.


Thanks for reading this far.  I hope it made sense :-)

NeilBrown

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-03-07  3:33             ` Neil Brown
@ 2008-03-07 11:17               ` Peter Zijlstra
  2008-03-07 11:55                 ` Peter Zijlstra
  2008-03-10  5:15                 ` Neil Brown
  0 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-03-07 11:17 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust, Pekka Enberg

Hi Neil,

I'm so glad you are working with me on this and writing this in human
English. It seems to be my eternal short-comming to communicate my ideas
clearly :-/. Thanks for your effort!

On Fri, 2008-03-07 at 14:33 +1100, Neil Brown wrote:
> On Tuesday March 4, a.p.zijlstra@chello.nl wrote:
> > 
> > On Tue, 2008-03-04 at 10:41 +1100, Neil Brown wrote:
> > > 
> > > Those skbs we allocated - they are either sitting in the fragment
> > > cache, or have been attached to a SK_MEMALLOC socket, or have been
> > > freed - correct?  If so, then there is already a limit to how much
> > > memory they can consume.
> > 
> > Not really, there is no natural limit to the amount of packets that can
> > be in transit between RX and socket demux. So we need the extra (skb)
> > accounting to impose that.
> 
> Isn't there?  A brief look at the code suggests that (except for
> fragment handling) there is a fairly straight path from
> network-receive to socket demux.  No queues along the way.
> That suggests the number of in-transit skbs should be limited by the
> number of CPUs.  Did I miss something?  Or is the number of CPUs
> potentially too large to be a suitable limit (seems unlikely).

That would be so if the whole path from RX to socket demux would have
hard-irqs disabled. However I didn't see that. Moreover I think the
whole purpose of the NetPoll interface is to allow some RX queueing to
cut down on softirq overhead.

> While looking at the code it also occurred to me that:
>   1/ tcpdump could be sent incoming packets.  Is there a limit
>      to the number of packets that can be in-kernel waiting for
>      tcpdump to collect them?  Should this limit be added to the base
>      reserve?

We could indeed do something like that, building a cache there that
re-cycles the oldest waiting skbs, allowing taps to continue working
under light pressure seems like a reasonable thing.

Good suggestion for future work.

>   2/ If the host is routing network packets, then incoming packets
>      might go on an outbound queue.  Is this space limited?  and
>      included in the reserve?

Not sure, somewhere along the routing code I lost it again. Constructive
input from someone versed in that part of the kernel would be most
welcome.

> Not major points, but I thought I would mention them.
> 
> > > I also don't see the value of tracking pages to see if they are
> > > 'reserve' pages or not.  The decision to drop an skb that is not for
> > > an SK_MEMALLOC socket should be based on whether we are currently
> > > short on memory.  Not whether we were short on memory when the skb was
> > > allocated.
> > 
> > That comes from accounting, once you need to account data you need to
> > know when to start accounting, and keep state so that you can properly
> > un-account.
> > 
> 
> skbs in the main (only?) thing you do accounting on, so focusing on
> those:
> 
> Suppose that every time you allocate memory for an skb, you check
> if the allocation had to dip into emergency reserves, and account the
> memory if so - releasing the memory and dropping the packet if we are
> over the limit.
> And any time you free memory associated with an skb, you check if the
> accounts currently say '0', and if not subtract the size of the
> allocation from the accounts.
> 
> Then you have quite workable accounting that doesn't need to tag every
> piece of memory with its 'reserve' status, and only pays the
> accounting cost (presumably a spinlock) when running out of memory, or
> just recovering.

Quite so, that has been the intent.

> This more relaxed approach to accounting reserved vs non-reserved
> memory has a strong parallel in your slub code (which I now
> understand).  When sl[au]b first gets a ->reserve page, it sets the
> ->reserve flag on the memcache and leaves it set until it sometime
> later gets a non-"->reserve" page.  Any memory freed in the mean time
> (whether originally reserved or not) is treated as reserve memory in
> that it will only be returned for ALLOC_NO_WATERMARKS allocations.
> I think this is a good way of working with reserved memory.  It isn't
> precise, but it is low-cost and good enough to get you through the
> difficult patch.

Agreed.

> Your netvm-skbuff-reserve.patch has some code to make sure that all
> the allocations in an skb have the same 'reserve' status.   I don't
> think that is needed and just makes the code messy - plus it requires
> the 'overcommit' flag to mem_reserve_kmalloc_charge which is a bit of
> a wart on the interface.

It is indeed.

> I would suggest getting rid of that.  Just flag the whole skb if any
> part gets a 'reserve' allocation, and use that flag to decide to drop
> packets arriving at non-SK_MEMALLOC sockets.

OK, I'll look into doing that. I must have been in pedantic mode when I
wrote that code.

> So: I think I now really understand what your code is doing, so I will
> try to explain it in terms that even I understand... This text in
> explicitly available under GPLv2 in case you want it.

Great work! Thanks!

> It actually describes something a bit different to what your code
> currently does, but I think it is very close to the spirit.  Some
> differences follow from my observations above.  Others the way that
> seemed to make sense while describing the problem and solution
> differed slightly from what I saw the code doing.  Obviously the code
> and the description should be aligned one way or another before being
> finalised.  
> The description is a bit long ... sorry about that.  But I wanted to
> make sure I included motivation and various assumptions.  Some of my
> understanding may well be wrong, but I present it here anyway.  It is
> easier for you to correct if it is clearly visible:-)
> 
> Problem:
>    When Linux needs to allocate memory it may find that there is
>    insufficient free memory so it needs to reclaim space that is in
>    use but not needed at the moment.  There are several options:
> 
>    1/ Shrink a kernel cache such as the inode or dentry cache.  This
>       is fairly easy but provides limited returns.
>    2/ Discard 'clean' pages from the page cache.  This is easy, and
>       works well as long as there are clean pages in the page cache.
>       Similarly clean 'anonymous' pages can be discarded - if there
>       are any.
>    3/ Write out some dirty page-cache pages so that they become clean.
>       The VM limits the number of dirty page-cache pages to e.g. 40%
>       of available memory so that (among other reasons) a "sync" will
>       not take excessively long.  So there should never be excessive
>       amounts of dirty pagecache.
>       Writing out dirty page-cache pages involves work by the
>       filesystem which may need to allocate memory itself.  To avoid
>       deadlock, filesystems use GFP_NOFS when allocating memory on the
>       write-out path.  When this is used, cleaning dirty page-cache
>       pages is not an option so if the filesystem finds that  memory
>       is tight, another option must be found.
>    4/ Write out dirty anonymous pages to the "Swap" partition/file.
>       This is the most interesting for a couple of reasons.
>       a/ Unlike dirty page-cache pages, there is no need to write anon
>          pages out unless we are actually short of memory.  Thus they
>          tend to be left to last.
>       b/ Anon pages tend to be updated randomly and unpredictably, and
>          flushing them out of memory can have a very significant
>          performance impact on the process using them.  This contrasts
>          with page-cache pages which are often written sequentially
>          and often treated as "write-once, read-many".
>       So anon pages tend to be left until last to be cleaned, and may
>       be the only cleanable pages while there are still some dirty
>       page-cache pages (which are waiting on a GFP_NOFS allocation).
> 
> [I don't find the above wholly satisfying.  There seems to be too much
>  hand-waving.  If someone can provide better text explaining why
>  swapout is a special case, that would be great.]

Anonymous pages are dirty by definition (except the zero page, but I
think we recently ditched it). So shrinking of the anonymous pool will
require swapping.

It is indeed the last refuge for those with GFP_NOFS. Allong with the
strict limit on the amount of dirty file pages it also ensures writing
those out will never deadlock the machine as there are always clean file
pages and or anonymous pages to launder.

Your observation about the difference in swap vs file disk patterns is
the motivation (one of the, at least) for Rik van Riel's split VM
series, it is currently not of any consequence.

Swap is indeed special in that it requires 'atomic' writeout. This
requirement comes from the cyclic dependency you outlined and we have to
break: we're out of memory, but need memory to write out pages to free
memory. So we must make it appear as if swap writes are indeed atomic.

> So we need to be able to write to the swap file/partition without
> needing to allocate any memory ... or only a small well controlled
> amount.
> 
> The VM reserves a small amount of memory that can only be allocated
> for use as part of the swap-out procedure.  It is only available to
> processes with the PF_MEMALLOC flag set, which is typically just the
> memory cleaner.
> 
> Traditionally swap-out is performed directly to block devices (swap
> files on block-device filesystems are supported by examining the
> mapping from file offset to device offset in advance, and then using
> the device offsets to write directly to the device).  Block devices
> are (required to be) written to pre-allocate any memory that might be
> needed during write-out, and to block when the pre-allocated memory is
> exhausted and no other memory is available.  They can be sure not to
> block forever as the pre-allocated memory will be returned as soon as
> the data it is being used for has been written out.  The primary
> mechanism for pre-allocating memory is called "mempools".
> 
> This approach does not work for writing anonymous pages
> (i.e. swapping) over a network, using e.g NFS or NBD or iSCSI.
> 
> 
> The main reason that it does not work is that when data from an anon
> page is written to the network, we must wait for a reply to confirm
> the data is safe.  Receiving that reply will consume memory and,
> significantly, we need to allocate memory to an incoming packet before
> we can tell if it is the reply we are waiting for or not.
> 
> The secondary reason is that the network code is not written to use
> mempools and in most cases does not need to use them.  Changing all
> allocations in the networking layer to use mempools would be quite
> intrusive, and would waste memory, and probably cause a slow-down in
> the common case of not swapping over the network.
> 
> These problems are addressed by enhancing the system of memory
> reserves used by PF_MEMALLOC and requiring any in-kernel networking
> client that is used for swap-out to indicate which sockets are used
> for swapout so they can be handled specially in low memory situations.
> 
> There are several major parts to this enhancement:
> 
> 1/ PG_emergency, GFP_MEMALLOC
> 
>   To handle low memory conditions we need to know when those
>   conditions exist.  Having a global "low on memory" flag seems easy,
>   but its implementation is problematic.  Instead we make it possible
>   to tell if a recent memory allocation required use of the emergency
>   memory pool.
>   For pages returned by alloc_page, the new page flag PG_emergency
>   can be tested.  If this is set, then a low memory condition was
>   current when the page was allocated, so the memory should be used
>   carefully.
> 
>   For memory allocated using slab/slub: If a page that is added to a
>   kmem_cache is found to have PG_emergency set, then a  ->reserve
>   flag is set for the whole kmem_cache.  Further allocations will only
>   be returned from that page (or any other page in the cache) if they
>   are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).
>   Non-emergency allocations will block in alloc_page until a
>   non-reserve page is available.  Once a non-reserve page has been
>   added to the cache, the ->reserve flag on the cache is removed.
>   When memory is returned by slab/slub, PG_emergency is set on the page
>   holding the memory to match the ->reserve flag on that cache.
> 
>   After memory has been returned by kmem_cache_alloc or kmalloc, the
>   page's PG_emergency flag can be tested.  If it is set, then the most
>   recent allocation from that cache required reserve memory, so this
>   allocation should be used with care.
> 
>   It is not safe to test the cache's ->reserve flag immediately after
>   an allocation as that flag is in per-cpu data, and the process could
>   have be rescheduled to a different cpu if preemption is enabled.
>   Thus the use of PG_emergency to carry this information.
> 
>   This allows us to
>    a/ request use of the emergency pool when allocating memory
>      (GFP_MEMALLOC), and 
>    b/ to find out if the emergency pool was used.

Right. I've had a long conversation on PG_emergency with Pekka. And I
think the conclusion was that PG_emergency will create more head-aches
than it solves. I probably have the conversation in my IRC logs and
could email it if you're interested (and Pekka doesn't object).

> 2/ SK_MEMALLOC, sk_buff->emergency.
> 
>   When memory from the reserve is used to store incoming network
>   packets, the memory must be freed (and the packet dropped) as soon
>   as we find out that the packet is not for a socket that is used for
>   swap-out. 
>   To achieve this we have an ->emergency flag for skbs, and an
>   SK_MEMALLOC flag for sockets.
>   When memory is allocated for an skb, it is allocated with
>   GFP_MEMALLOC (if we are currently swapping over the network at
>   all).  If a subsequent test shows that the emergency pool was used,
>   ->emergency is set.
>   When the skb is finally attached to its destination socket, the
>   SK_MEMALLOC flag on the socket is tested.  If the skb has
>   ->emergency set, but the socket does not have SK_MEMALLOC set, then
>   the skb is immediately freed and the packet is dropped.
>   This ensures that reserve memory is never queued on a socket that is
>   not used for swapout.
> 
>   Similarly, if an skb is ever queued for deliver to user-space for
>   example by netfilter, the ->emergency flag is tested and the skb is
>   released if ->emergency is set.
> 
>   This ensures that memory from the emergency reserve can be used to
>   allow swapout to proceed, but will not get caught up in any other
>   network queue.
> 
> 
> 3/ pages_emergency
> 
>   The above would be sufficient if the total memory below the lowest
>   memory watermark (i.e the size of the emergency reserve) were known
>   to be enough to hold all transient allocations needed for writeout.
>   I'm a little blurry on how big the current emergency pool is, but it
>   isn't big and certainly hasn't been sized to allow network traffic
>   to consume any.
> 
>   We could simply make the size of the reserve bigger. However in the
>   common case that we are not swapping over the network, that would be
>   a waste of memory.
> 
>   So a new "watermark" is defined: pages_emergency.  This is
>   effectively added to the current low water marks, so that pages from
>   this emergency pool can only be allocated if one of PF_MEMALLOC or
>   GFP_MEMALLOC are set.
> 
>   pages_emergency can be changed dynamically based on need.  When
>   swapout over the network is required, pages_emergency is increased
>   to cover the maximum expected load.  When network swapout is
>   disabled, pages_emergency is decreased.
> 
>   To determine how much to increase it by, we introduce reservation
>   groups....
> 
> 3a/ reservation groups
> 
>   The memory used transiently for swapout can be in a number of
>   different places.  e.g. the network route cache, the network
>   fragment cache, in transit between network card and socket, or (in
>   the case of NFS) in sunrpc data structures awaiting a reply.
>   We need to ensure each of these is limited in the amount of memory
>   they use, and that the maximum is included in the reserve.
> 
>   The memory required by the network layer only needs to be reserved
>   once, even if there are multiple swapout paths using the network
>   (e.g. NFS and NDB and iSCSI, though using all three for swapout at
>   the same time would be unusual).
> 
>   So we create a tree of reservation groups.  The network might
>   register a collection of reservations, but not mark them as being in
>   use.  NFS and sunrpc might similarly register a collection of
>   reservations, and attach it to the network reservations as it
>   depends on them.
>   When swapout over NFS is requested, the NFS/sunrpc reservations are
>   activated which implicitly activates the network reservations.
> 
>   The total new reservation is added to pages_emergency.
> 
>   Provided each memory usage stays beneath the registered limit (at
>   least when allocating memory from reserves), the system will never
>   run out of emergency memory, and swapout will not deadlock.
> 
>   It is worth noting here that it is not critical that each usage
>   stays beneath the limit 100% of the time.  Occasional excess is
>   acceptable provided that the memory will be freed  again within a
>   short amount of time that does *not* require waiting for any event
>   that itself might require memory.
>   This is because, at all stages of transmit and receive, it is
>   acceptable to discard all transient memory associated with a
>   particular writeout and try again later.  On transmit, the page can
>   be re-queued for later transmission.  On receive, the packet can be
>   dropped assuming that the peer will resend after a timeout.
> 
>   Thus allocations that are truly transient and will be freed without
>   blocking do not strictly need to be reserved for.  Doing so might
>   still be a good idea to ensure forward progress doesn't take too
>   long. 
> 
> 4/ lo-mem accounting
> 
>   Most places that might hold on to emergency memory (e.g. route
>   cache, fragment cache etc) already place a limit on the amount of
>   memory that they can use.  This limit can simply be reserved using
>   the above mechanism and no more needs to be done.
> 
>   However some memory usage might not be accounted with sufficient
>   firmness to allow an appropriate emergency reservation.  The
>   in-flight skbs for incoming packets is (claimed to be) on such
>   example.

:-)

>   To support this, a low-overhead mechanism for accounting memory
>   usage against the reserves is provided.  This mechanism uses the
>   same data structure that is used to store the emergency memory
>   reservations through the addition of a 'usage' field.
> 
>   When memory allocation for a particular purpose succeeds, the memory
>   is checked to see if it is 'reserve' memory.  If it is, the size of
>   the allocation is added to the 'usage'.  If this exceeds the
>   reservation, the usage is reduced again and the memory that was
>   allocated is free.
> 
>   When memory that was allocated for that purpose is freed, the
>   'usage' field is checked again.  If it is non-zero, then the size of
>   the freed memory is subtracted from the usage, making sure the usage
>   never becomes less than zero.
> 
>   This provides adequate accounting with minimal overheads when not in
>   a low memory condition.  When a low memory condition is encountered
>   it does add the cost of a spin lock necessary to serialise updates
>   to 'usage'.

Agreed, minimizing the overhead for the common !net_swap case has been
my goal.

> 5/ swapfile/swap_out/swap_in
> 
>   So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
>   any network socket that it uses, and can know when to account
>   reserve memory carefully, new address_space_operations are
>   available.
>   "swapfile" requests that an address space (i.e a file) be make ready
>   for swapout.  swap_out and swap_in request the actual IO.  They
>   together must ensure that each swap_out request can succeed without
>   allocating more emergency memory that was reserved by swapfile.

Miklos kindly provided code to slightly out-date this piece, but not in
a conceptual manner.

I've already heard interest from other people to use these hooks to
provide swap on other non-block filesystems such as jffs2, logfs and the
like.

> Thanks for reading this far.  I hope it made sense :-)

It does, and I hope it does so for more people.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-03-07 11:17               ` Peter Zijlstra
@ 2008-03-07 11:55                 ` Peter Zijlstra
  2008-03-10  5:15                 ` Neil Brown
  1 sibling, 0 replies; 73+ messages in thread
From: Peter Zijlstra @ 2008-03-07 11:55 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust, Pekka Enberg

On Fri, 2008-03-07 at 12:17 +0100, Peter Zijlstra wrote:

> That would be so if the whole path from RX to socket demux would have
> hard-irqs disabled. However I didn't see that. Moreover I think the
> whole purpose of the NetPoll interface is to allow some RX queueing to
> cut down on softirq overhead.

s/NetPoll/NAPI/

More specifically look at net/core/dev.c:netif_rx()
It has a input queue per device.

> >   2/ If the host is routing network packets, then incoming packets
> >      might go on an outbound queue.  Is this space limited?  and
> >      included in the reserve?
> 
> Not sure, somewhere along the routing code I lost it again. Constructive
> input from someone versed in that part of the kernel would be most
> welcome.

To clarify, I think we just send it on as I saw no reason why that could
fail. However the more fancy stuff like engress or QoS might spoil the
party, that is where I lost track.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-03-07 11:17               ` Peter Zijlstra
  2008-03-07 11:55                 ` Peter Zijlstra
@ 2008-03-10  5:15                 ` Neil Brown
  2008-03-10  9:17                   ` Peter Zijlstra
  1 sibling, 1 reply; 73+ messages in thread
From: Neil Brown @ 2008-03-10  5:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust, Pekka Enberg

On Friday March 7, a.p.zijlstra@chello.nl wrote:
> Hi Neil,
> 
> I'm so glad you are working with me on this and writing this in human
> English. It seems to be my eternal short-comming to communicate my ideas
> clearly :-/. Thanks for your effort!

:-)
It always helps to have a second brain with a different perspective.


> 
> On Fri, 2008-03-07 at 14:33 +1100, Neil Brown wrote:
> > 
> > [I don't find the above wholly satisfying.  There seems to be too much
> >  hand-waving.  If someone can provide better text explaining why
> >  swapout is a special case, that would be great.]
> 
> Anonymous pages are dirty by definition (except the zero page, but I
> think we recently ditched it). So shrinking of the anonymous pool will
> require swapping.

Well, there is the swap cache.  That's probably what I was thinking of
when I said "clean anonymous pages".  I suspect they are the first to
go!

> 
> It is indeed the last refuge for those with GFP_NOFS. Allong with the
> strict limit on the amount of dirty file pages it also ensures writing
> those out will never deadlock the machine as there are always clean file
> pages and or anonymous pages to launder.

The difficulty I have is justifying exactly why page-cache writeout
will not deadlock.  What if all the memory that is not dirty-pagecache
is anonymous, and if swap isn't enabled?
Maybe the number returned by "determine_dirtyable_memory" in
page-writeback.c excludes anonymous pages?  I wonder if the meaning of
NR_FREE_PAGES, NR_INACTIVE, etc is documented anywhere....

...
> 
> Right. I've had a long conversation on PG_emergency with Pekka. And I
> think the conclusion was that PG_emergency will create more head-aches
> than it solves. I probably have the conversation in my IRC logs and
> could email it if you're interested (and Pekka doesn't object).

Maybe that depends on the exact semantic of PG_emergency ??
I remember you being concerned that PG_emergency never changes between
allocation and freeing, and that wouldn't work well with slub.
My envisioned semantic has it possibly changing quite often.
What it means is:
   The last allocation done from this page was in a low-memory
   condition.

You really need some way to tell if the result of kmalloc/kmemalloc
should be treated as reserved.
I think you had code which first tried the allocation without
GFP_MEMALLOC and then if that failed, tried again *with*
GFP_MEMALLOC.  If that then succeeded, it is assumed to be an
allocation from reserves.  That seemed rather ugly, though I guess you
could wrap it in a function to hide the ugliness:

void *kmalloc_reserve(size_t size, int *reserve, gfp_t gfp_flags)
{
	void *result = kmalloc(size, gfp_flags & ~GFP_MEMALLOC);
	if (result) {
		*reserve = 0;
		return result;
	}
	result = kmalloc(size, gfp_flags | GFP_MEMALLOC);
	if (result) {
		*reserve = 1;
		return result;
	}
	return NULL;
}
???

> 
> I've already heard interest from other people to use these hooks to
> provide swap on other non-block filesystems such as jffs2, logfs and the
> like.

I'm interested in the swap_in/swap_out interface for external
write-intent bitmaps for md/raid arrays.
You can have a write-intent bitmap which records which blocks might be
dirty if the host crashes, so that resync is much faster.
It can be stored in a file in a separate filesystem, but that is
currently implemented by using bmap to enumerate the blocks and then
reading/writing directly to the device (like swap).  Your interface
would be much nicer for that (not that I think having a
write-intent-bitmap on an NFS filesystem would be a clever idea ;-)

I'll look forward to your next patch set....

One thing I had thought odd while reading the patches, but haven't
found an opportunity to mention before, is the "IS_SWAPFILE" test in
nfs-swapper.patch.
This seems like a layering violation.  It would be better if the test
was based on whether  ->swapfile had been called on the file.  That way
my write-intent-bitmaps would get the same benefit.

NeilBrown

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-03-10  5:15                 ` Neil Brown
@ 2008-03-10  9:17                   ` Peter Zijlstra
  2008-03-14  5:22                     ` Neil Brown
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2008-03-10  9:17 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust, Pekka Enberg

On Mon, 2008-03-10 at 16:15 +1100, Neil Brown wrote:

> > On Fri, 2008-03-07 at 14:33 +1100, Neil Brown wrote:
> > > 
> > > [I don't find the above wholly satisfying.  There seems to be too much
> > >  hand-waving.  If someone can provide better text explaining why
> > >  swapout is a special case, that would be great.]
> > 
> > Anonymous pages are dirty by definition (except the zero page, but I
> > think we recently ditched it). So shrinking of the anonymous pool will
> > require swapping.
> 
> Well, there is the swap cache.  That's probably what I was thinking of
> when I said "clean anonymous pages".  I suspect they are the first to
> go!

Ah, right, we could consider those clean anonymous. Alas, they are just
part of the aging lists and do not get special priority.

> > It is indeed the last refuge for those with GFP_NOFS. Allong with the
> > strict limit on the amount of dirty file pages it also ensures writing
> > those out will never deadlock the machine as there are always clean file
> > pages and or anonymous pages to launder.
> 
> The difficulty I have is justifying exactly why page-cache writeout
> will not deadlock.  What if all the memory that is not dirty-pagecache
> is anonymous, and if swap isn't enabled?

Ah, I never considered the !SWAP case.

> Maybe the number returned by "determine_dirtyable_memory" in
> page-writeback.c excludes anonymous pages?  I wonder if the meaning of
> NR_FREE_PAGES, NR_INACTIVE, etc is documented anywhere....

I don't think they are, but it should be obvious once you know the VM,
har har har :-)

NR_FREE_PAGES are the pages in the page allocators free lists.
NR_INACTIVE are the pages on the inactive list
NR_ACTIVE are the pageso on the active list

NR_INACTIVE+NR_ACTIVE are the number of pages on the page reclaim lists.

So, if you consider !SWAP, we could get in a deadlock when all of memory
is anonymous except for a few (<=dirty limit) dirty file pages.

But I guess the !SWAP people know what they're doing, large anon usage
without swap is asking for trouble.
 
> > Right. I've had a long conversation on PG_emergency with Pekka. And I
> > think the conclusion was that PG_emergency will create more head-aches
> > than it solves. I probably have the conversation in my IRC logs and
> > could email it if you're interested (and Pekka doesn't object).
> 
> Maybe that depends on the exact semantic of PG_emergency ??
> I remember you being concerned that PG_emergency never changes between
> allocation and freeing, and that wouldn't work well with slub.
> My envisioned semantic has it possibly changing quite often.
> What it means is:
>    The last allocation done from this page was in a low-memory
>    condition.

Yes, that works, except that we'd need to iterate all pages and clear
PG_emergency - which would imply tracking all these pages etc..

Hence it would be better not to keep persistent state and do as we do
now; use some non-persistent state on allocation.

> You really need some way to tell if the result of kmalloc/kmemalloc
> should be treated as reserved.
> I think you had code which first tried the allocation without
> GFP_MEMALLOC and then if that failed, tried again *with*
> GFP_MEMALLOC.  If that then succeeded, it is assumed to be an
> allocation from reserves.  That seemed rather ugly, though I guess you
> could wrap it in a function to hide the ugliness:
> 
> void *kmalloc_reserve(size_t size, int *reserve, gfp_t gfp_flags)
> {
> 	void *result = kmalloc(size, gfp_flags & ~GFP_MEMALLOC);
> 	if (result) {
> 		*reserve = 0;
> 		return result;
> 	}
> 	result = kmalloc(size, gfp_flags | GFP_MEMALLOC);
> 	if (result) {
> 		*reserve = 1;
> 		return result;
> 	}
> 	return NULL;
> }
> ???

Yeah, I this this is the best we can do, just split this part out into
helper functions. I've been thinking of doing this - just haven't gotten
around to implementing it. Hope to do so this week and send out a new
series.

> > I've already heard interest from other people to use these hooks to
> > provide swap on other non-block filesystems such as jffs2, logfs and the
> > like.
> 
> I'm interested in the swap_in/swap_out interface for external
> write-intent bitmaps for md/raid arrays.
> You can have a write-intent bitmap which records which blocks might be
> dirty if the host crashes, so that resync is much faster.
> It can be stored in a file in a separate filesystem, but that is
> currently implemented by using bmap to enumerate the blocks and then
> reading/writing directly to the device (like swap).  Your interface
> would be much nicer for that (not that I think having a
> write-intent-bitmap on an NFS filesystem would be a clever idea ;-)

Hmm, right. But for that purpose the names swap_* are a tad misleading.
I remember hch mentioning this at some point. What would be a more
suitable naming scheme so we can both use it?

> I'll look forward to your next patch set....
> 
> One thing I had thought odd while reading the patches, but haven't
> found an opportunity to mention before, is the "IS_SWAPFILE" test in
> nfs-swapper.patch.
> This seems like a layering violation.  It would be better if the test
> was based on whether  ->swapfile had been called on the file.  That way
> my write-intent-bitmaps would get the same benefit.

I'll look into this, I didn't thing using a inode test inside a
filesystem implementation was too weird..


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/28] Swap over NFS -v16
  2008-03-10  9:17                   ` Peter Zijlstra
@ 2008-03-14  5:22                     ` Neil Brown
  0 siblings, 0 replies; 73+ messages in thread
From: Neil Brown @ 2008-03-14  5:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, netdev,
	trond.myklebust, Pekka Enberg

On Monday March 10, a.p.zijlstra@chello.nl wrote:
> > 
> > Maybe that depends on the exact semantic of PG_emergency ??
> > I remember you being concerned that PG_emergency never changes between
> > allocation and freeing, and that wouldn't work well with slub.
> > My envisioned semantic has it possibly changing quite often.
> > What it means is:
> >    The last allocation done from this page was in a low-memory
> >    condition.
> 
> Yes, that works, except that we'd need to iterate all pages and clear
> PG_emergency - which would imply tracking all these pages etc..
> 

I don't see why you need to clear PG_emergency at all.
If the semantic is:

> >    The last allocation done from this page was in a low-memory
> >    condition.

Then you only need to (potentially) modify it's value when you
allocate it, or an element within it.

But if it doesn't fit well in the overall picture, then by all means
get rid of it.

> 
> Hmm, right. But for that purpose the names swap_* are a tad misleading.
> I remember hch mentioning this at some point. What would be a more
> suitable naming scheme so we can both use it?

One could argue that "swap" is already a misleading term.
Level 7 Unix used to do swapping.  It would write one process image
out to swap space, and read a different one in.  Moving whole
processes at a time was called swapping.
When this clever idea of only moving pages at a time was introduced (I
think in 4BSD, but possible in 2BSD and elsewhere) it was called
"demand paging" or just "paging".

So we don't have a swap partition any more.  We have a paging
partition.

But everyone calls it 'swap' and we know what it means.  I don't think
there would be a big cost in keeping the swap_ names but allowing them
to be used for occasional things other than swap.
And I suspect you would lose a lot if you tried to use a different
name that people didn't immediately identify with...

NeilBrown

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2008-03-14  5:23 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-20 14:46 [PATCH 00/28] Swap over NFS -v16 Peter Zijlstra
2008-02-20 14:46 ` [PATCH 01/28] mm: gfp_to_alloc_flags() Peter Zijlstra
2008-02-20 14:46 ` [PATCH 02/28] mm: tag reseve pages Peter Zijlstra
2008-02-20 14:46 ` [PATCH 03/28] mm: slb: add knowledge of reserve pages Peter Zijlstra
2008-02-20 14:46 ` [PATCH 04/28] mm: kmem_estimate_pages() Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 05/28] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 06/28] mm: serialize access to min_free_kbytes Peter Zijlstra
2008-02-20 14:46 ` [PATCH 07/28] mm: emergency pool Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 08/28] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 09/28] mm: __GFP_MEMALLOC Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 10/28] mm: memory reserve management Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 11/28] selinux: tag avc cache alloc as non-critical Peter Zijlstra
2008-02-20 14:46 ` [PATCH 12/28] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
2008-02-20 14:46 ` [PATCH 13/28] net: packet split receive api Peter Zijlstra
2008-02-20 14:46 ` [PATCH 14/28] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
2008-02-20 14:46 ` [PATCH 15/28] netvm: network reserve infrastructure Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-24  6:52   ` Mike Snitzer
2008-02-20 14:46 ` [PATCH 16/28] netvm: INET reserves Peter Zijlstra
2008-02-20 14:46 ` [PATCH 17/28] netvm: hook skb allocation to reserves Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 18/28] netvm: filter emergency skbs Peter Zijlstra
2008-02-20 14:46 ` [PATCH 19/28] netvm: prevent a stream specific deadlock Peter Zijlstra
2008-02-20 14:46 ` [PATCH 20/28] netfilter: NF_QUEUE vs emergency skbs Peter Zijlstra
2008-02-20 14:46 ` [PATCH 21/28] netvm: skb processing Peter Zijlstra
2008-02-20 14:46 ` [PATCH 22/28] mm: add support for non block device backed swap files Peter Zijlstra
2008-02-20 16:30   ` Randy Dunlap
2008-02-20 16:46     ` Peter Zijlstra
2008-02-26 12:45   ` Miklos Szeredi
2008-02-26 12:58     ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 23/28] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
2008-02-20 14:46 ` [PATCH 24/28] nfs: remove mempools Peter Zijlstra
2008-02-20 14:46 ` [PATCH 25/28] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
2008-02-20 14:46 ` [PATCH 26/28] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
2008-02-20 14:46 ` [PATCH 27/28] nfs: enable swap on NFS Peter Zijlstra
2008-02-20 14:46 ` [PATCH 28/28] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
2008-02-23  8:06 ` [PATCH 00/28] Swap over NFS -v16 Andrew Morton
2008-02-26  6:03   ` Neil Brown
2008-02-26 10:50     ` Peter Zijlstra
2008-02-26 12:00       ` Peter Zijlstra
2008-02-26 15:29       ` Miklos Szeredi
2008-02-26 15:41         ` Peter Zijlstra
2008-02-26 15:43         ` Peter Zijlstra
2008-02-26 15:47           ` Miklos Szeredi
2008-02-26 17:56       ` Andrew Morton
2008-02-27  5:51       ` Neil Brown
2008-02-27  7:58         ` Peter Zijlstra
2008-02-27  8:05           ` Pekka Enberg
2008-02-27  8:14             ` Peter Zijlstra
2008-02-27  8:33               ` Peter Zijlstra
2008-02-27  8:43                 ` Pekka J Enberg
2008-02-29 11:51             ` Peter Zijlstra
2008-02-29 11:58               ` Pekka Enberg
2008-02-29 12:18                 ` Peter Zijlstra
2008-02-29 12:29                   ` Pekka Enberg
2008-02-29  1:29           ` Neil Brown
2008-02-29 10:21             ` Peter Zijlstra
2008-03-02 22:18               ` Neil Brown
2008-03-02 23:33                 ` Peter Zijlstra
2008-03-03 23:41                   ` Neil Brown
2008-03-04 10:28                     ` Peter Zijlstra
     [not found]           ` <1837 <1204626509.6241.39.camel@lappy>
2008-03-07  3:33             ` Neil Brown
2008-03-07 11:17               ` Peter Zijlstra
2008-03-07 11:55                 ` Peter Zijlstra
2008-03-10  5:15                 ` Neil Brown
2008-03-10  9:17                   ` Peter Zijlstra
2008-03-14  5:22                     ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).