linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/40] Swap over Networked storage -v12
@ 2007-05-04 10:26 Peter Zijlstra
  2007-05-04 10:26 ` [PATCH 01/40] mm: page allocation rank Peter Zijlstra
                   ` (41 more replies)
  0 siblings, 42 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

There is a fundamental deadlock associated with paging; when writing out a page
to free memory requires free memory to complete. The usually solution is to
keep a small amount of memory available at all times so we can overcome this
problem. This however assumes the amount of memory needed for writeout is
(constant and) smaller than the provided reserve.

It is this latter assumption that breaks when doing writeout over network.
Network can take up an unspecified amount of memory while waiting for a reply
to our write request. This re-introduces the deadlock; we might never complete
the writeout, for we might not have enough memory to receive the completion
message.

The proposed solution is simple, only allow traffic servicing the VM to make
use of the reserves.

This however implies you know what packets are for whom, which generally
speaking you don't. Hence we need to receive all packets but discard them as
soon as we encounter a non VM bound packet allocated from the reserves.

Also knowing it is headed towards the VM needs a little help, hence we
introduce the socket flag SOCK_VMIO to mark sockets with.

Of course, since we are paging all this has to happen in kernel-space, since
user-space might just not be there.

Since packet processing might also require memory, this all also implies that
those auxiliary allocations may use the reserves when an emergency packet is
processed. This is accomplished by using PF_MEMALLOC.

How much memory is to be reserved is also an issue, enough memory to saturate
both the route cache and IP fragment reassembly, along with various constants.

This patch-set comes in 6 parts:

1) introduce the memory reserve and make the SLAB allocator play nice with it.
   patches 01-10

2) add some needed infrastructure to the network code
   patches 11-13

3) implement the idea outlined above
   patches 14-20

4) teach the swap machinery to use generic address_spaces
   patches 21-24

5) implement swap over NFS using all the new stuff
   patches 25-31

6) implement swap over iSCSI
   patches 32-40

Patches can also be found here:
  http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v12/

If I receive no feedback, I will assume the various maintainers do not object
and I will respin the series against -mm and submit for inclusion.

There is interest in this feature from the stateless linux world; that is both
the virtualization world, and the cluster world.

I have been contacted by various groups, some have just expressed their
interest, others have been testing this work in their environments.

Various hardware vendors have also expressed interest, and, of course, my
employer finds it important enough to have me work on it.

Also, while it doesn't present a full-fledged reserve-based allocator API yet,
it does lay most of the groundwork for it. There is a GFP_NOFAIL elimination
project wanting to use this as a foundation. Elimination of GFP_NOFAIL will
greatly improve the basic soundness and stability of the code that currently
uses that construct - most disk based filesystems.

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 01/40] mm: page allocation rank
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
@ 2007-05-04 10:26 ` Peter Zijlstra
  2007-05-04 10:26 ` [PATCH 02/40] mm: slab allocation fairness Peter Zijlstra
                   ` (40 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: mm-page_alloc-rank.patch --]
[-- Type: text/plain, Size: 7975 bytes --]

Introduce page allocation rank.

This allocation rank is an measure of the 'hardness' of the page allocation.
Where hardness refers to how deep we have to reach (and thereby if reclaim 
was activated) to obtain the page.

It basically is a mapping from the ALLOC_/gfp flags into a scalar quantity,
which allows for comparisons of the kind: 
  'would this allocation have succeeded using these gfp flags'.

For the gfp -> alloc_flags mapping we use the 'hardest' possible, those
used by __alloc_pages() right before going into direct reclaim.

The alloc_flags -> rank mapping is given by: 2*2^wmark - harder - 2*high
where wmark = { min = 1, low, high } and harder, high are booleans.
This gives:
  0 is the hardest possible allocation - ALLOC_NO_WATERMARK,
  1 is ALLOC_WMARK_MIN|ALLOC_HARDER|ALLOC_HIGH,
  ...
  15 is ALLOC_WMARK_HIGH|ALLOC_HARDER,
  16 is the softest allocation - ALLOC_WMARK_HIGH.

Rank <= 4 will have woke up kswapd and when also > 0 might have ran into
direct reclaim.

Rank > 8 rarely happens and means lots of memory free (due to parallel oom kill).

The allocation rank is stored in page->index for successful allocations.

'offline' testing of the rank is made impossible by direct reclaim and
fragmentation issues. That is, it is impossible to tell if a given allocation
will succeed without actually doing it.

The purpose of this measure is to introduce some fairness into the slab
allocator.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/internal.h   |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c |   58 +++++++++++++---------------------------------
 2 files changed, 87 insertions(+), 41 deletions(-)

Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h	2007-02-22 13:56:00.000000000 +0100
+++ linux-2.6-git/mm/internal.h	2007-02-22 14:08:41.000000000 +0100
@@ -12,6 +12,7 @@
 #define __MM_INTERNAL_H
 
 #include <linux/mm.h>
+#include <linux/hardirq.h>
 
 static inline void set_page_count(struct page *page, int v)
 {
@@ -37,4 +38,73 @@ static inline void __put_page(struct pag
 extern void fastcall __init __free_pages_bootmem(struct page *page,
 						unsigned int order);
 
+#define ALLOC_HARDER		0x01 /* try to alloc harder */
+#define ALLOC_HIGH		0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN		0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW		0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH	0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS	0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int inline gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
+#define MAX_ALLOC_RANK	16
+
+/*
+ * classify the allocation: 0 is hardest, 16 is easiest.
+ */
+static inline int alloc_flags_to_rank(int alloc_flags)
+{
+	int rank;
+
+	if (alloc_flags & ALLOC_NO_WATERMARKS)
+		return 0;
+
+	rank = alloc_flags & (ALLOC_WMARK_MIN|ALLOC_WMARK_LOW|ALLOC_WMARK_HIGH);
+	rank -= alloc_flags & (ALLOC_HARDER|ALLOC_HIGH);
+
+	return rank;
+}
+
+static inline int gfp_to_rank(gfp_t gfp_mask)
+{
+	return alloc_flags_to_rank(gfp_to_alloc_flags(gfp_mask));
+}
+
 #endif
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2007-02-22 13:56:00.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2007-02-22 14:08:41.000000000 +0100
@@ -892,14 +892,6 @@ failed:
 	return NULL;
 }
 
-#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
-#define ALLOC_HARDER		0x10 /* try to alloc harder */
-#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
 static struct fail_page_alloc_attr {
@@ -1190,6 +1182,7 @@ zonelist_scan:
 
 		page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
 		if (page)
+			page->index = alloc_flags_to_rank(alloc_flags);
 			break;
 this_zone_full:
 		if (NUMA_BUILD)
@@ -1263,48 +1256,27 @@ restart:
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+	/* This is the last chance, in general, before the goto nopage. */
+	page = get_page_from_freelist(gfp_mask, order, zonelist,
+			alloc_flags & ~ALLOC_NO_WATERMARKS);
 	if (page)
 		goto got_pg;
 
 	/* This allocation should allow future memory freeing. */
-
 rebalance:
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
+		/* go through the zonelist yet again, ignoring mins */
+		page = get_page_from_freelist(gfp_mask, order,
 				zonelist, ALLOC_NO_WATERMARKS);
-			if (page)
-				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
+		if (page)
+			goto got_pg;
+		if (wait && (gfp_mask & __GFP_NOFAIL)) {
+			congestion_wait(WRITE, HZ/50);
+			goto nofail_alloc;
 		}
 		goto nopage;
 	}
@@ -1313,6 +1285,10 @@ nofail_alloc:
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 	cond_resched();
 
 	/* We now go into synchronous reclaim */

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 02/40] mm: slab allocation fairness
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
  2007-05-04 10:26 ` [PATCH 01/40] mm: page allocation rank Peter Zijlstra
@ 2007-05-04 10:26 ` Peter Zijlstra
  2007-05-16 20:41   ` Christoph Lameter
  2007-05-04 10:26 ` [PATCH 03/40] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
                   ` (39 subsequent siblings)
  41 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: mm-slab-ranking.patch --]
[-- Type: text/plain, Size: 13313 bytes --]

The slab allocator has some unfairness wrt gfp flags; when the slab cache is
grown the gfp flags are used to allocate more memory, however when there is 
slab cache available (in partial or free slabs, per cpu caches or otherwise)
gfp flags are ignored.

Thus it is possible for less critical slab allocations to succeed and gobble
up precious memory when under memory pressure.

This patch solves that by using the newly introduced page allocation rank.

Page allocation rank is a scalar quantity connecting ALLOC_ and gfp flags which
represents how deep we had to reach into our reserves when allocating a page. 
Rank 0 is the deepest we can reach (ALLOC_NO_WATERMARK) and 16 is the most 
shallow allocation possible (ALLOC_WMARK_HIGH).

When the slab space is grown the rank of the page allocation is stored. For
each slab allocation we test the given gfp flags against this rank. Thereby
asking the question: would these flags have allowed the slab to grow.

If not so, we need to test the current situation. This is done by forcing the
growth of the slab space. (Just testing the free page limits will not work due
to direct reclaim) Failing this we need to fail the slab allocation.

Thus if we grew the slab under great duress while PF_MEMALLOC was set and we 
really did access the memalloc reserve the rank would be set to 0. If the next
allocation to that slab would be GFP_NOFS|__GFP_NOMEMALLOC (which ordinarily
maps to rank 4 and always > 0) we'd want to make sure that memory pressure has
decreased enough to allow an allocation with the given gfp flags.

So in this case we try to force grow the slab cache and on failure we fail the
slab allocation. Thus preserving the available slab cache for more pressing
allocations.

If this newly allocated slab will be trimmed on the next kmem_cache_free
(not unlikely) this is no problem, since 1) it will free memory and 2) the
sole purpose of the allocation was to probe the allocation rank, we didn't
need the space itself.

[AIM9 results go here]

 AIM9 test          2.6.21-rc5            2.6.21-rc5-slab1             
                                         CONFIG_SLAB_FAIR=y            

54 tcp_test      2124.48 +/-  10.85    2137.43 +/-  9.22    12.95      
55 udp_test      5204.43 +/-  45.13    5231.59 +/- 56.66    27.16      
56 fifo_test    20991.42 +/-  46.71   19675.97 +/- 56.35  1315.44      
57 stream_pipe  10024.16 +/- 119.88    9912.53 +/- 75.52   111.63      
58 dgram_pipe    9460.18 +/- 119.50    9502.75 +/- 89.06    42.57      
59 pipe_cpy     30719.81 +/- 117.01   27885.52 +/- 46.81  2834.28  

                                          2.6.21-rc5-slab2
                                         CONFIG_SLAB_FAIR=n
                                                               
54 tcp_test      2124.48 +/-  10.85    2137.97 +/-  12.85    13.50
55 udp_test      5204.43 +/-  45.13    5268.21 +/-  83.38    63.78
56 fifo_test    20991.42 +/-  46.71   19394.42 +/-  65.15  1596.99
57 stream_pipe  10024.16 +/- 119.88   10042.49 +/- 132.13    18.33
58 dgram_pipe    9460.18 +/- 119.50    9575.97 +/- 111.86   115.80
59 pipe_cpy     30719.81 +/- 117.01   27226.52 +/- 120.15  3493.28

Given that the CONFIG_SLAB_FAIR=n numbers are worse than =y, I'm not sure
how to interpret these numbers.

Will work on getting =n equal. Also, will work on a SLUB version of
these patches.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/Kconfig |    3 ++
 mm/slab.c  |   81 ++++++++++++++++++++++++++++++++++++++++---------------------
 2 files changed, 57 insertions(+), 27 deletions(-)

Index: linux-2.6-git/mm/slab.c
===================================================================
--- linux-2.6-git.orig/mm/slab.c	2007-03-26 13:34:55.000000000 +0200
+++ linux-2.6-git/mm/slab.c	2007-03-26 14:18:59.000000000 +0200
@@ -114,6 +114,7 @@
 #include	<asm/cacheflush.h>
 #include	<asm/tlbflush.h>
 #include	<asm/page.h>
+#include	"internal.h"
 
 /*
  * DEBUG	- 1 for kmem_cache_create() to honour; SLAB_DEBUG_INITIAL,
@@ -380,6 +381,7 @@ static void kmem_list3_init(struct kmem_
 
 struct kmem_cache {
 /* 1) per-cpu data, touched during every alloc/free */
+	int rank;
 	struct array_cache *array[NR_CPUS];
 /* 2) Cache tunables. Protected by cache_chain_mutex */
 	unsigned int batchcount;
@@ -1023,21 +1025,21 @@ static inline int cache_free_alien(struc
 }
 
 static inline void *alternate_node_alloc(struct kmem_cache *cachep,
-		gfp_t flags)
+		gfp_t flags, int rank)
 {
 	return NULL;
 }
 
 static inline void *____cache_alloc_node(struct kmem_cache *cachep,
-		 gfp_t flags, int nodeid)
+		 gfp_t flags, int nodeid, int rank)
 {
 	return NULL;
 }
 
 #else	/* CONFIG_NUMA */
 
-static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int);
-static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
+static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int, int);
+static void *alternate_node_alloc(struct kmem_cache *, gfp_t, int);
 
 static struct array_cache **alloc_alien_cache(int node, int limit)
 {
@@ -1628,6 +1630,7 @@ static void *kmem_getpages(struct kmem_c
 	if (!page)
 		return NULL;
 
+	cachep->rank = page->index;
 	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
@@ -2276,6 +2279,7 @@ kmem_cache_create (const char *name, siz
 	}
 #endif
 #endif
+	cachep->rank = MAX_ALLOC_RANK;
 
 	/*
 	 * Determine if the slab management is 'on' or 'off' slab.
@@ -2942,7 +2946,7 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags, int rank)
 {
 	int batchcount;
 	struct kmem_list3 *l3;
@@ -2954,6 +2958,8 @@ static void *cache_alloc_refill(struct k
 	check_irq_off();
 	ac = cpu_cache_get(cachep);
 retry:
+	if (unlikely(rank > cachep->rank))
+		goto force_grow;
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
 		/*
@@ -3009,14 +3015,16 @@ must_grow:
 	l3->free_objects -= ac->avail;
 alloc_done:
 	spin_unlock(&l3->list_lock);
-
 	if (unlikely(!ac->avail)) {
 		int x;
+force_grow:
 		x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
-		if (!x && ac->avail == 0)	/* no objects in sight? abort */
+
+		/* no objects in sight? abort */
+		if (!x && (ac->avail == 0 || rank > cachep->rank))
 			return NULL;
 
 		if (!ac->avail)		/* objects refilled by interrupt? */
@@ -3173,7 +3181,8 @@ static inline int should_failslab(struct
 
 #endif /* CONFIG_FAILSLAB */
 
-static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+static inline void *____cache_alloc(struct kmem_cache *cachep,
+		gfp_t flags, int rank)
 {
 	void *objp;
 	struct array_cache *ac;
@@ -3184,17 +3193,29 @@ static inline void *____cache_alloc(stru
 		return NULL;
 
 	ac = cpu_cache_get(cachep);
-	if (likely(ac->avail)) {
+	if (likely(ac->avail && rank <= cachep->rank)) {
 		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
 		objp = ac->entry[--ac->avail];
 	} else {
 		STATS_INC_ALLOCMISS(cachep);
-		objp = cache_alloc_refill(cachep, flags);
+		objp = cache_alloc_refill(cachep, flags, rank);
 	}
 	return objp;
 }
 
+#ifdef CONFIG_SLAB_FAIR
+static inline int slab_alloc_rank(gfp_t flags)
+{
+	return gfp_to_rank(flags);
+}
+#else
+static inline int slab_alloc_rank(gfp_t flags)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_NUMA
 /*
  * Try allocating on another node if PF_SPREAD_SLAB|PF_MEMPOLICY.
@@ -3202,7 +3223,8 @@ static inline void *____cache_alloc(stru
  * If we are in_interrupt, then process context, including cpusets and
  * mempolicy, may not apply and should not be used for allocation policy.
  */
-static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
+static void *alternate_node_alloc(struct kmem_cache *cachep,
+		gfp_t flags, int rank)
 {
 	int nid_alloc, nid_here;
 
@@ -3214,7 +3236,7 @@ static void *alternate_node_alloc(struct
 	else if (current->mempolicy)
 		nid_alloc = slab_node(current->mempolicy);
 	if (nid_alloc != nid_here)
-		return ____cache_alloc_node(cachep, flags, nid_alloc);
+		return ____cache_alloc_node(cachep, flags, nid_alloc, rank);
 	return NULL;
 }
 
@@ -3226,7 +3248,7 @@ static void *alternate_node_alloc(struct
  * allocator to do its reclaim / fallback magic. We then insert the
  * slab into the proper nodelist and then allocate from it.
  */
-static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
+static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags, int rank)
 {
 	struct zonelist *zonelist;
 	gfp_t local_flags;
@@ -3253,7 +3275,7 @@ retry:
 			cache->nodelists[nid] &&
 			cache->nodelists[nid]->free_objects)
 				obj = ____cache_alloc_node(cache,
-					flags | GFP_THISNODE, nid);
+					flags | GFP_THISNODE, nid, rank);
 	}
 
 	if (!obj && !(flags & __GFP_NO_GROW)) {
@@ -3276,7 +3298,7 @@ retry:
 			nid = page_to_nid(virt_to_page(obj));
 			if (cache_grow(cache, flags, nid, obj)) {
 				obj = ____cache_alloc_node(cache,
-					flags | GFP_THISNODE, nid);
+					flags | GFP_THISNODE, nid, rank);
 				if (!obj)
 					/*
 					 * Another processor may allocate the
@@ -3297,7 +3319,7 @@ retry:
  * A interface to enable slab creation on nodeid
  */
 static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
-				int nodeid)
+				int nodeid, int rank)
 {
 	struct list_head *entry;
 	struct slab *slabp;
@@ -3310,6 +3332,8 @@ static void *____cache_alloc_node(struct
 
 retry:
 	check_irq_off();
+	if (unlikely(rank > cachep->rank))
+		goto force_grow;
 	spin_lock(&l3->list_lock);
 	entry = l3->slabs_partial.next;
 	if (entry == &l3->slabs_partial) {
@@ -3345,11 +3369,12 @@ retry:
 
 must_grow:
 	spin_unlock(&l3->list_lock);
+force_grow:
 	x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL);
 	if (x)
 		goto retry;
 
-	return fallback_alloc(cachep, flags);
+	return fallback_alloc(cachep, flags, rank);
 
 done:
 	return obj;
@@ -3373,6 +3398,7 @@ __cache_alloc_node(struct kmem_cache *ca
 {
 	unsigned long save_flags;
 	void *ptr;
+	int rank = slab_alloc_rank(flags);
 
 	cache_alloc_debugcheck_before(cachep, flags);
 	local_irq_save(save_flags);
@@ -3382,7 +3408,7 @@ __cache_alloc_node(struct kmem_cache *ca
 
 	if (unlikely(!cachep->nodelists[nodeid])) {
 		/* Node not bootstrapped yet */
-		ptr = fallback_alloc(cachep, flags);
+		ptr = fallback_alloc(cachep, flags, rank);
 		goto out;
 	}
 
@@ -3393,12 +3419,12 @@ __cache_alloc_node(struct kmem_cache *ca
 		 * to other nodes. It may fail while we still have
 		 * objects on other nodes available.
 		 */
-		ptr = ____cache_alloc(cachep, flags);
+		ptr = ____cache_alloc(cachep, flags, rank);
 		if (ptr)
 			goto out;
 	}
 	/* ___cache_alloc_node can fall back to other nodes */
-	ptr = ____cache_alloc_node(cachep, flags, nodeid);
+	ptr = ____cache_alloc_node(cachep, flags, nodeid, rank);
   out:
 	local_irq_restore(save_flags);
 	ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller);
@@ -3407,23 +3433,23 @@ __cache_alloc_node(struct kmem_cache *ca
 }
 
 static __always_inline void *
-__do_cache_alloc(struct kmem_cache *cache, gfp_t flags)
+__do_cache_alloc(struct kmem_cache *cache, gfp_t flags, int rank)
 {
 	void *objp;
 
 	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))) {
-		objp = alternate_node_alloc(cache, flags);
+		objp = alternate_node_alloc(cache, flags, rank);
 		if (objp)
 			goto out;
 	}
-	objp = ____cache_alloc(cache, flags);
+	objp = ____cache_alloc(cache, flags, rank);
 
 	/*
 	 * We may just have run out of memory on the local node.
 	 * ____cache_alloc_node() knows how to locate memory on other nodes
 	 */
  	if (!objp)
- 		objp = ____cache_alloc_node(cache, flags, numa_node_id());
+ 		objp = ____cache_alloc_node(cache, flags, numa_node_id(), rank);
 
   out:
 	return objp;
@@ -3431,9 +3457,9 @@ __do_cache_alloc(struct kmem_cache *cach
 #else
 
 static __always_inline void *
-__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags, int rank)
 {
-	return ____cache_alloc(cachep, flags);
+	return ____cache_alloc(cachep, flags, rank);
 }
 
 #endif /* CONFIG_NUMA */
@@ -3443,10 +3469,11 @@ __cache_alloc(struct kmem_cache *cachep,
 {
 	unsigned long save_flags;
 	void *objp;
+	int rank = slab_alloc_rank(flags);
 
 	cache_alloc_debugcheck_before(cachep, flags);
 	local_irq_save(save_flags);
-	objp = __do_cache_alloc(cachep, flags);
+	objp = __do_cache_alloc(cachep, flags, rank);
 	local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
 	prefetchw(objp);
Index: linux-2.6-git/mm/Kconfig
===================================================================
--- linux-2.6-git.orig/mm/Kconfig	2007-03-26 13:34:55.000000000 +0200
+++ linux-2.6-git/mm/Kconfig	2007-03-26 14:18:56.000000000 +0200
@@ -163,3 +163,6 @@ config ZONE_DMA_FLAG
 	default "0" if !ZONE_DMA
 	default "1"
 
+config SLAB_FAIR
+	def_bool n
+	depends on SLAB

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 03/40] mm: allow PF_MEMALLOC from softirq context
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
  2007-05-04 10:26 ` [PATCH 01/40] mm: page allocation rank Peter Zijlstra
  2007-05-04 10:26 ` [PATCH 02/40] mm: slab allocation fairness Peter Zijlstra
@ 2007-05-04 10:26 ` Peter Zijlstra
  2007-05-04 10:26 ` [PATCH 04/40] mm: serialize access to min_free_kbytes Peter Zijlstra
                   ` (38 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: mm-PF_MEMALLOC-softirq.patch --]
[-- Type: text/plain, Size: 2669 bytes --]

Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
a borrowed context save current->flags, ksoftirqd will have its own 
task_struct.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    4 ++++
 kernel/softirq.c      |    3 +++
 mm/internal.h         |    7 ++++---
 3 files changed, 11 insertions(+), 3 deletions(-)

Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h	2007-02-22 14:44:37.000000000 +0100
+++ linux-2.6-git/mm/internal.h	2007-02-22 15:16:58.000000000 +0100
@@ -75,9 +75,10 @@ static int inline gfp_to_alloc_flags(gfp
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((p->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+		if (!in_irq() && (p->flags & PF_MEMALLOC))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_interrupt() &&
+				unlikely(test_thread_flag(TIF_MEMDIE)))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
Index: linux-2.6-git/kernel/softirq.c
===================================================================
--- linux-2.6-git.orig/kernel/softirq.c	2007-02-22 14:44:35.000000000 +0100
+++ linux-2.6-git/kernel/softirq.c	2007-02-22 15:29:38.000000000 +0100
@@ -210,6 +210,8 @@ asmlinkage void __do_softirq(void)
 	__u32 pending;
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu;
+	unsigned long pflags = current->flags;
+	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
 	account_system_vtime(current);
@@ -248,6 +250,7 @@ restart:
 
 	account_system_vtime(current);
 	_local_bh_enable();
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
Index: linux-2.6-git/include/linux/sched.h
===================================================================
--- linux-2.6-git.orig/include/linux/sched.h	2007-02-22 15:17:39.000000000 +0100
+++ linux-2.6-git/include/linux/sched.h	2007-02-22 15:29:05.000000000 +0100
@@ -1185,6 +1185,10 @@ static inline void put_task_struct(struc
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+#define tsk_restore_flags(p, pflags, mask) \
+	do {	(p)->flags &= ~(mask); \
+		(p)->flags |= ((pflags) & (mask)); } while (0)
+
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask);
 #else

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 04/40] mm: serialize access to min_free_kbytes
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2007-05-04 10:26 ` [PATCH 03/40] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
@ 2007-05-04 10:26 ` Peter Zijlstra
  2007-05-04 10:26 ` [PATCH 05/40] mm: emergency pool Peter Zijlstra
                   ` (37 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: mm-setup_per_zone_pages_min.patch --]
[-- Type: text/plain, Size: 2137 bytes --]

There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2007-01-15 09:58:49.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2007-01-15 09:58:51.000000000 +0100
@@ -95,6 +95,7 @@ static char * const zone_names[MAX_NR_ZO
 #endif
 };
 
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
@@ -3074,12 +3075,12 @@ static void setup_per_zone_lowmem_reserv
 }
 
 /**
- * setup_per_zone_pages_min - called when min_free_kbytes changes.
+ * __setup_per_zone_pages_min - called when min_free_kbytes changes.
  *
  * Ensures that the pages_{min,low,high} values for each zone are set correctly
  * with respect to min_free_kbytes.
  */
-void setup_per_zone_pages_min(void)
+static void __setup_per_zone_pages_min(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -3133,6 +3134,15 @@ void setup_per_zone_pages_min(void)
 	calculate_totalreserve_pages();
 }
 
+void setup_per_zone_pages_min(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&min_free_lock, flags);
+	__setup_per_zone_pages_min();
+	spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -3168,7 +3178,7 @@ static int __init init_per_zone_pages_mi
 		min_free_kbytes = 128;
 	if (min_free_kbytes > 65536)
 		min_free_kbytes = 65536;
-	setup_per_zone_pages_min();
+	__setup_per_zone_pages_min();
 	setup_per_zone_lowmem_reserve();
 	return 0;
 }

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 05/40] mm: emergency pool
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2007-05-04 10:26 ` [PATCH 04/40] mm: serialize access to min_free_kbytes Peter Zijlstra
@ 2007-05-04 10:26 ` Peter Zijlstra
  2007-05-04 10:26 ` [PATCH 06/40] mm: __GFP_EMERGENCY Peter Zijlstra
                   ` (36 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: mm-page_alloc-emerg.patch --]
[-- Type: text/plain, Size: 6511 bytes --]

Provide means to reserve a specific amount of pages.

The emergency pool is separated from the min watermark because ALLOC_HARDER
and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure
a strict minimum.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mmzone.h |    3 +-
 mm/page_alloc.c        |   52 ++++++++++++++++++++++++++++++++++++++++---------
 mm/vmstat.c            |    6 ++---
 3 files changed, 48 insertions(+), 13 deletions(-)

Index: linux-2.6-git/include/linux/mmzone.h
===================================================================
--- linux-2.6-git.orig/include/linux/mmzone.h	2007-02-12 09:40:51.000000000 +0100
+++ linux-2.6-git/include/linux/mmzone.h	2007-02-12 11:13:58.000000000 +0100
@@ -178,7 +178,7 @@ enum zone_type {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
-	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		pages_emerg, pages_min, pages_low, pages_high;
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
@@ -562,6 +562,7 @@ int sysctl_min_unmapped_ratio_sysctl_han
 			struct file *, void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
+void adjust_memalloc_reserve(int pages);
 
 #include <linux/topology.h>
 /* Returns the number of the current Node. */
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2007-02-12 11:13:35.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2007-02-12 11:14:16.000000000 +0100
@@ -101,6 +101,7 @@ static char * const zone_names[MAX_NR_ZO
 
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+int var_free_kbytes;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -995,7 +996,8 @@ int zone_watermark_ok(struct zone *z, in
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
 
-	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
+	if (free_pages <= min + z->lowmem_reserve[classzone_idx] +
+			z->pages_emerg)
 		return 0;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -1348,8 +1350,8 @@ nofail_alloc:
 nopage:
 	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
 		printk(KERN_WARNING "%s: page allocation failure."
-			" order:%d, mode:0x%x\n",
-			p->comm, order, gfp_mask);
+			" order:%d, mode:0x%x, alloc_flags:0x%x, pflags:0x%lx\n",
+			p->comm, order, gfp_mask, alloc_flags, p->flags);
 		dump_stack();
 		show_mem();
 	}
@@ -1562,9 +1564,9 @@ void show_free_areas(void)
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
-			K(zone->pages_min),
-			K(zone->pages_low),
-			K(zone->pages_high),
+			K(zone->pages_emerg + zone->pages_min),
+			K(zone->pages_emerg + zone->pages_low),
+			K(zone->pages_emerg + zone->pages_high),
 			K(zone_page_state(zone, NR_ACTIVE)),
 			K(zone_page_state(zone, NR_INACTIVE)),
 			K(zone->present_pages),
@@ -3000,7 +3002,7 @@ static void calculate_totalreserve_pages
 			}
 
 			/* we treat pages_high as reserved pages. */
-			max += zone->pages_high;
+			max += zone->pages_high + zone->pages_emerg;
 
 			if (max > zone->present_pages)
 				max = zone->present_pages;
@@ -3057,7 +3059,8 @@ static void setup_per_zone_lowmem_reserv
  */
 static void __setup_per_zone_pages_min(void)
 {
-	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_emerg = var_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
 	unsigned long flags;
@@ -3069,11 +3072,13 @@ static void __setup_per_zone_pages_min(v
 	}
 
 	for_each_zone(zone) {
-		u64 tmp;
+		u64 tmp, tmp_emerg;
 
 		spin_lock_irqsave(&zone->lru_lock, flags);
 		tmp = (u64)pages_min * zone->present_pages;
 		do_div(tmp, lowmem_pages);
+		tmp_emerg = (u64)pages_emerg * zone->present_pages;
+		do_div(tmp_emerg, lowmem_pages);
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ -3092,12 +3097,14 @@ static void __setup_per_zone_pages_min(v
 			if (min_pages > 128)
 				min_pages = 128;
 			zone->pages_min = min_pages;
+			zone->pages_emerg = min_pages;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
 			zone->pages_min = tmp;
+			zone->pages_emerg = tmp_emerg;
 		}
 
 		zone->pages_low   = zone->pages_min + (tmp >> 2);
@@ -3118,6 +3125,33 @@ void setup_per_zone_pages_min(void)
 	spin_unlock_irqrestore(&min_free_lock, flags);
 }
 
+/**
+ *	adjust_memalloc_reserve - adjust the memalloc reserve
+ *	@pages: number of pages to add
+ *
+ *	It adds a number of pages to the memalloc reserve; if
+ *	the number was positive it kicks kswapd into action to
+ *	satisfy the higher watermarks.
+ *
+ *	NOTE: there is only a single caller, hence no locking.
+ */
+void adjust_memalloc_reserve(int pages)
+{
+	var_free_kbytes += pages << (PAGE_SHIFT - 10);
+	BUG_ON(var_free_kbytes < 0);
+	setup_per_zone_pages_min();
+	if (pages > 0) {
+		struct zone *zone;
+		for_each_zone(zone)
+			wakeup_kswapd(zone, 0);
+	}
+	if (pages)
+		printk(KERN_DEBUG "Emergency reserve: %d\n",
+				var_free_kbytes);
+}
+
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
 /*
  * Initialise min_free_kbytes.
  *
Index: linux-2.6-git/mm/vmstat.c
===================================================================
--- linux-2.6-git.orig/mm/vmstat.c	2007-02-12 09:40:51.000000000 +0100
+++ linux-2.6-git/mm/vmstat.c	2007-02-12 11:14:28.000000000 +0100
@@ -513,9 +513,9 @@ static int zoneinfo_show(struct seq_file
 			   "\n        spanned  %lu"
 			   "\n        present  %lu",
 			   zone_page_state(zone, NR_FREE_PAGES),
-			   zone->pages_min,
-			   zone->pages_low,
-			   zone->pages_high,
+			   zone->pages_emerg + zone->pages_min,
+			   zone->pages_emerg + zone->pages_low,
+			   zone->pages_emerg + zone->pages_high,
 			   zone->pages_scanned,
 			   zone->nr_scan_active, zone->nr_scan_inactive,
 			   zone->spanned_pages,

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 06/40] mm: __GFP_EMERGENCY
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2007-05-04 10:26 ` [PATCH 05/40] mm: emergency pool Peter Zijlstra
@ 2007-05-04 10:26 ` Peter Zijlstra
  2007-05-04 10:26 ` [PATCH 07/40] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
                   ` (35 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: mm-page_alloc-GFP_EMERGENCY.patch --]
[-- Type: text/plain, Size: 3301 bytes --]

__GFP_EMERGENCY will allow the allocation to disregard the watermarks, 
much like PF_MEMALLOC.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/gfp.h |    7 ++++++-
 mm/internal.h       |   10 +++++++---
 2 files changed, 13 insertions(+), 4 deletions(-)

Index: linux-2.6-git/include/linux/gfp.h
===================================================================
--- linux-2.6-git.orig/include/linux/gfp.h	2006-12-14 10:02:18.000000000 +0100
+++ linux-2.6-git/include/linux/gfp.h	2006-12-14 10:02:52.000000000 +0100
@@ -35,17 +35,21 @@ struct vm_area_struct;
 #define __GFP_HIGH	((__force gfp_t)0x20u)	/* Should access emergency pools? */
 #define __GFP_IO	((__force gfp_t)0x40u)	/* Can start physical IO? */
 #define __GFP_FS	((__force gfp_t)0x80u)	/* Can call down to low-level FS? */
+
 #define __GFP_COLD	((__force gfp_t)0x100u)	/* Cache-cold page required */
 #define __GFP_NOWARN	((__force gfp_t)0x200u)	/* Suppress page allocation failure warning */
 #define __GFP_REPEAT	((__force gfp_t)0x400u)	/* Retry the allocation.  Might fail */
 #define __GFP_NOFAIL	((__force gfp_t)0x800u)	/* Retry for ever.  Cannot fail */
+
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* Do not retry.  Might fail */
 #define __GFP_NO_GROW	((__force gfp_t)0x2000u)/* Slab internal usage */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
+
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_EMERGENCY  ((__force gfp_t)0x80000u) /* Use emergency reserves */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +58,8 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE)
+			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE| \
+			__GFP_EMERGENCY)
 
 /* This equals 0, but use constants in case they ever change */
 #define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h	2006-12-14 10:02:52.000000000 +0100
+++ linux-2.6-git/mm/internal.h	2006-12-14 10:02:52.000000000 +0100
@@ -75,7 +75,9 @@ static int inline gfp_to_alloc_flags(gfp
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_irq() && (p->flags & PF_MEMALLOC))
+		if (gfp_mask & __GFP_EMERGENCY)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_irq() && (p->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (!in_interrupt() &&
 				unlikely(test_thread_flag(TIF_MEMDIE)))

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 07/40] mm: allow mempool to fall back to memalloc reserves
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2007-05-04 10:26 ` [PATCH 06/40] mm: __GFP_EMERGENCY Peter Zijlstra
@ 2007-05-04 10:26 ` Peter Zijlstra
  2007-05-04 10:26 ` [PATCH 08/40] mm: kmem_cache_objsize Peter Zijlstra
                   ` (34 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: mm-mempool_fixup.patch --]
[-- Type: text/plain, Size: 1391 bytes --]

Allow the mempool to use the memalloc reserves when all else fails and
the allocation context would otherwise allow it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/mempool.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux-2.6-git/mm/mempool.c
===================================================================
--- linux-2.6-git.orig/mm/mempool.c	2007-01-12 08:03:44.000000000 +0100
+++ linux-2.6-git/mm/mempool.c	2007-01-12 10:38:57.000000000 +0100
@@ -14,6 +14,7 @@
 #include <linux/mempool.h>
 #include <linux/blkdev.h>
 #include <linux/writeback.h>
+#include "internal.h"
 
 static void add_element(mempool_t *pool, void *element)
 {
@@ -229,6 +230,15 @@ repeat_alloc:
 	}
 	spin_unlock_irqrestore(&pool->lock, flags);
 
+	/* if we really had right to the emergency reserves try those */
+	if (gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS) {
+		if (gfp_temp & __GFP_NOMEMALLOC) {
+			gfp_temp &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+			goto repeat_alloc;
+		} else
+			gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN;
+	}
+
 	/* We must not sleep in the GFP_ATOMIC case */
 	if (!(gfp_mask & __GFP_WAIT))
 		return NULL;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2007-05-04 10:26 ` [PATCH 07/40] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
@ 2007-05-04 10:26 ` Peter Zijlstra
  2007-05-04 10:54   ` Pekka Enberg
  2007-05-04 16:36   ` Christoph Lameter
  2007-05-04 10:27 ` [PATCH 09/40] mm: optimize gfp_to_rank() Peter Zijlstra
                   ` (33 subsequent siblings)
  41 siblings, 2 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips,
	Pekka Enberg

[-- Attachment #1: mm-kmem_objsize.patch --]
[-- Type: text/plain, Size: 3619 bytes --]

Expost buffer_size in order to allow fair estimates on the actual space 
used/needed.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
---
 include/linux/slab.h |    2 ++
 mm/slab.c            |   16 ++++++++++++++--
 mm/slob.c            |   18 ++++++++++++++++++
 3 files changed, 34 insertions(+), 2 deletions(-)

Index: linux-2.6-git/include/linux/slab.h
===================================================================
--- linux-2.6-git.orig/include/linux/slab.h	2007-03-26 14:18:59.000000000 +0200
+++ linux-2.6-git/include/linux/slab.h	2007-03-26 18:33:58.000000000 +0200
@@ -54,6 +54,7 @@ void *kmem_cache_alloc(struct kmem_cache
 void *kmem_cache_zalloc(struct kmem_cache *, gfp_t);
 void kmem_cache_free(struct kmem_cache *, void *);
 unsigned int kmem_cache_size(struct kmem_cache *);
+unsigned int kmem_cache_objsize(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
 
@@ -74,6 +75,7 @@ void *__kmalloc(size_t, gfp_t);
 void *__kzalloc(size_t, gfp_t);
 void kfree(const void *);
 unsigned int ksize(const void *);
+unsigned int kobjsize(size_t);
 
 /**
  * kcalloc - allocate memory for an array. The memory is set to zero.
Index: linux-2.6-git/mm/slab.c
===================================================================
--- linux-2.6-git.orig/mm/slab.c	2007-03-26 15:44:34.000000000 +0200
+++ linux-2.6-git/mm/slab.c	2007-03-28 10:10:36.000000000 +0200
@@ -3205,12 +3205,12 @@ static inline void *____cache_alloc(stru
 }
 
 #ifdef CONFIG_SLAB_FAIR
-static inline int slab_alloc_rank(gfp_t flags)
+static __always_inline int slab_alloc_rank(gfp_t flags)
 {
 	return gfp_to_rank(flags);
 }
 #else
-static inline int slab_alloc_rank(gfp_t flags)
+static __always_inline int slab_alloc_rank(gfp_t flags)
 {
 	return 0;
 }
@@ -3815,6 +3815,12 @@ unsigned int kmem_cache_size(struct kmem
 }
 EXPORT_SYMBOL(kmem_cache_size);
 
+unsigned int kmem_cache_objsize(struct kmem_cache *cachep)
+{
+	return cachep->buffer_size;
+}
+EXPORT_SYMBOL_GPL(kmem_cache_objsize);
+
 const char *kmem_cache_name(struct kmem_cache *cachep)
 {
 	return cachep->name;
@@ -4512,3 +4518,9 @@ unsigned int ksize(const void *objp)
 
 	return obj_size(virt_to_cache(objp));
 }
+
+unsigned int kobjsize(size_t size)
+{
+	return kmem_cache_objsize(kmem_find_general_cachep(size, 0));
+}
+EXPORT_SYMBOL_GPL(kobjsize);
Index: linux-2.6-git/mm/slob.c
===================================================================
--- linux-2.6-git.orig/mm/slob.c	2007-03-26 14:18:59.000000000 +0200
+++ linux-2.6-git/mm/slob.c	2007-03-26 18:33:58.000000000 +0200
@@ -240,6 +240,15 @@ unsigned int ksize(const void *block)
 	return ((slob_t *)block - 1)->units * SLOB_UNIT;
 }
 
+unsigned int kobjsize(size_t size)
+{
+	if (size < PAGE_SIZE)
+		return size;
+
+	return PAGE_SIZE << find_order(size);
+}
+EXPORT_SYMBOL_GPL(kobjsize);
+
 struct kmem_cache {
 	unsigned int size, align;
 	const char *name;
@@ -321,6 +330,15 @@ unsigned int kmem_cache_size(struct kmem
 }
 EXPORT_SYMBOL(kmem_cache_size);
 
+unsigned int kmem_cache_objsize(struct kmem_cache *c)
+{
+	if (c->size < PAGE_SIZE)
+		return c->size + c->align;
+
+	return PAGE_SIZE << find_order(c->size);
+}
+EXPORT_SYMBOL_GPL(kmem_cache_objsize);
+
 const char *kmem_cache_name(struct kmem_cache *c)
 {
 	return c->name;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 09/40] mm: optimize gfp_to_rank()
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2007-05-04 10:26 ` [PATCH 08/40] mm: kmem_cache_objsize Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 10/40] selinux: tag avc cache alloc as non-critical Peter Zijlstra
                   ` (32 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: mm-optimize-gtp_to_rank.patch --]
[-- Type: text/plain, Size: 3137 bytes --]

The gfp_to_rank() call in the slab allocator severely impacts performance.
Hence reduce it to the bone, keeping only what is needed to make the reserve
work.

[more AIM9 results go here]

 AIM9 test          2.6.21-rc5            2.6.21-rc5-slab1             
                                         CONFIG_SLAB_FAIR=y            

54 tcp_test      2124.48 +/-  10.85    2137.43 +/-  9.22    12.95      
55 udp_test      5204.43 +/-  45.13    5231.59 +/- 56.66    27.16      
56 fifo_test    20991.42 +/-  46.71   19675.97 +/- 56.35  1315.44      
57 stream_pipe  10024.16 +/- 119.88    9912.53 +/- 75.52   111.63      
58 dgram_pipe    9460.18 +/- 119.50    9502.75 +/- 89.06    42.57      
59 pipe_cpy     30719.81 +/- 117.01   27885.52 +/- 46.81  2834.28      

                                          2.6.21-rc5-slab2    
                                         CONFIG_SLAB_FAIR=y   
                                                              
54 tcp_test      2124.48 +/-  10.85    2122.80 +/-   4.70     1.68
55 udp_test      5204.43 +/-  45.13    5136.98 +/-  62.31    67.45
56 fifo_test    20991.42 +/-  46.71   19646.81 +/-  53.61  1344.60
57 stream_pipe  10024.16 +/- 119.88    9940.87 +/- 280.73    83.29
58 dgram_pipe    9460.18 +/- 119.50    9432.69 +/- 250.27    27.49
59 pipe_cpy     30719.81 +/- 117.01   27870.70 +/-  65.50  2849.10

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/internal.h |   33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h	2007-02-22 14:09:39.000000000 +0100
+++ linux-2.6-git/mm/internal.h	2007-02-22 14:24:34.000000000 +0100
@@ -105,9 +105,38 @@ static inline int alloc_flags_to_rank(in
 	return rank;
 }
 
-static inline int gfp_to_rank(gfp_t gfp_mask)
+static __always_inline int gfp_to_rank(gfp_t gfp_mask)
 {
-	return alloc_flags_to_rank(gfp_to_alloc_flags(gfp_mask));
+	/*
+	 * Although correct this full version takes a ~3% performance hit
+	 * on the network test in aim9.
+	 *
+	 * return alloc_flags_to_rank(gfp_to_alloc_flags(gfp_mask));
+	 *
+	 * So we go cheat a little. We'll only focus on the correctness of
+	 * rank 0.
+	 */
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (gfp_mask & __GFP_EMERGENCY)
+			return 0;
+		else if (!in_irq() && (current->flags & PF_MEMALLOC))
+			return 0;
+		/*
+		 * We skip the TIF_MEMDIE test:
+		 *
+		 * if (!in_interrupt() && unlikely(test_thread_flag(TIF_MEMDIE)))
+		 * 	return 0;
+		 *
+		 * this will force an alloc but since we are allowed the memory
+		 * that will succeed. This will make this very rare occurence
+		 * very expensive when under severe memory pressure, but it
+		 * seems a valid tradeoff.
+		 */
+	}
+
+	/* Cheat by lumping everybody else in rank 1. */
+	return 1;
 }
 
 #endif

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 10/40] selinux: tag avc cache alloc as non-critical
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 09/40] mm: optimize gfp_to_rank() Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 11/40] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
                   ` (31 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips,
	James Morris

[-- Attachment #1: mm-selinux-emergency.patch --]
[-- Type: text/plain, Size: 1014 bytes --]

Failing to allocate a cache entry will only harm performance not correctness.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: James Morris <jmorris@namei.org>
---
 security/selinux/avc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-git/security/selinux/avc.c
===================================================================
--- linux-2.6-git.orig/security/selinux/avc.c	2007-02-14 08:31:13.000000000 +0100
+++ linux-2.6-git/security/selinux/avc.c	2007-02-14 10:10:47.000000000 +0100
@@ -332,7 +332,7 @@ static struct avc_node *avc_alloc_node(v
 {
 	struct avc_node *node;
 
-	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC);
+	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC);
 	if (!node)
 		goto out;
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 11/40] net: wrap sk->sk_backlog_rcv()
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (9 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 10/40] selinux: tag avc cache alloc as non-critical Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 12/40] net: packet split receive api Peter Zijlstra
                   ` (30 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: net-backlog.patch --]
[-- Type: text/plain, Size: 2969 bytes --]

Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h   |    5 +++++
 net/core/sock.c      |    4 ++--
 net/ipv4/tcp.c       |    2 +-
 net/ipv4/tcp_timer.c |    2 +-
 4 files changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h	2007-02-14 11:29:55.000000000 +0100
+++ linux-2.6-git/include/net/sock.h	2007-02-14 11:42:00.000000000 +0100
@@ -480,6 +480,11 @@ static inline void sk_add_backlog(struct
 	skb->next = NULL;
 }
 
+static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	return sk->sk_backlog_rcv(sk, skb);
+}
+
 #define sk_wait_event(__sk, __timeo, __condition)		\
 ({	int rc;							\
 	release_sock(__sk);					\
Index: linux-2.6-git/net/core/sock.c
===================================================================
--- linux-2.6-git.orig/net/core/sock.c	2007-02-14 11:29:55.000000000 +0100
+++ linux-2.6-git/net/core/sock.c	2007-02-14 11:42:00.000000000 +0100
@@ -290,7 +290,7 @@ int sk_receive_skb(struct sock *sk, stru
 		 */
 		mutex_acquire(&sk->sk_lock.dep_map, 0, 1, _RET_IP_);
 
-		rc = sk->sk_backlog_rcv(sk, skb);
+		rc = sk_backlog_rcv(sk, skb);
 
 		mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
 	} else
@@ -1244,7 +1244,7 @@ static void __release_sock(struct sock *
 			struct sk_buff *next = skb->next;
 
 			skb->next = NULL;
-			sk->sk_backlog_rcv(sk, skb);
+			sk_backlog_rcv(sk, skb);
 
 			/*
 			 * We are in process context here with softirqs
Index: linux-2.6-git/net/ipv4/tcp.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/tcp.c	2007-02-14 11:29:35.000000000 +0100
+++ linux-2.6-git/net/ipv4/tcp.c	2007-02-14 11:42:00.000000000 +0100
@@ -1002,7 +1002,7 @@ static void tcp_prequeue_process(struct 
 	 * necessary */
 	local_bh_disable();
 	while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
-		sk->sk_backlog_rcv(sk, skb);
+		sk_backlog_rcv(sk, skb);
 	local_bh_enable();
 
 	/* Clear memory counter. */
Index: linux-2.6-git/net/ipv4/tcp_timer.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/tcp_timer.c	2007-02-14 11:29:36.000000000 +0100
+++ linux-2.6-git/net/ipv4/tcp_timer.c	2007-02-14 11:42:00.000000000 +0100
@@ -198,7 +198,7 @@ static void tcp_delack_timer(unsigned lo
 		NET_INC_STATS_BH(LINUX_MIB_TCPSCHEDULERFAILED);
 
 		while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
-			sk->sk_backlog_rcv(sk, skb);
+			sk_backlog_rcv(sk, skb);
 
 		tp->ucopy.memory = 0;
 	}

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 12/40] net: packet split receive api
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (10 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 11/40] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 13/40] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
                   ` (29 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: net-ps_rx.patch --]
[-- Type: text/plain, Size: 6243 bytes --]

Add some packet-split receive hooks.

For one this allows to do NUMA node affine page allocs.  Later on these hooks
will be extended to do emergency reserve allocations for fragments.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 drivers/net/e1000/e1000_main.c |    8 ++------
 drivers/net/sky2.c             |   16 ++++++----------
 include/linux/skbuff.h         |   23 +++++++++++++++++++++++
 net/core/skbuff.c              |   20 ++++++++++++++++++++
 4 files changed, 51 insertions(+), 16 deletions(-)

Index: linux-2.6-git/drivers/net/e1000/e1000_main.c
===================================================================
--- linux-2.6-git.orig/drivers/net/e1000/e1000_main.c	2007-02-14 08:31:12.000000000 +0100
+++ linux-2.6-git/drivers/net/e1000/e1000_main.c	2007-02-14 11:42:07.000000000 +0100
@@ -4412,12 +4412,8 @@ e1000_clean_rx_irq_ps(struct e1000_adapt
 			pci_unmap_page(pdev, ps_page_dma->ps_page_dma[j],
 					PAGE_SIZE, PCI_DMA_FROMDEVICE);
 			ps_page_dma->ps_page_dma[j] = 0;
-			skb_fill_page_desc(skb, j, ps_page->ps_page[j], 0,
-			                   length);
+			skb_add_rx_frag(skb, j, ps_page->ps_page[j], 0, length);
 			ps_page->ps_page[j] = NULL;
-			skb->len += length;
-			skb->data_len += length;
-			skb->truesize += length;
 		}
 
 		/* strip the ethernet crc, problem is we're using pages now so
@@ -4623,7 +4619,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
 			if (j < adapter->rx_ps_pages) {
 				if (likely(!ps_page->ps_page[j])) {
 					ps_page->ps_page[j] =
-						alloc_page(GFP_ATOMIC);
+						netdev_alloc_page(netdev);
 					if (unlikely(!ps_page->ps_page[j])) {
 						adapter->alloc_rx_buff_failed++;
 						goto no_buffers;
Index: linux-2.6-git/include/linux/skbuff.h
===================================================================
--- linux-2.6-git.orig/include/linux/skbuff.h	2007-02-14 11:29:54.000000000 +0100
+++ linux-2.6-git/include/linux/skbuff.h	2007-02-14 11:59:04.000000000 +0100
@@ -813,6 +813,9 @@ static inline void skb_fill_page_desc(st
 	skb_shinfo(skb)->nr_frags = i + 1;
 }
 
+extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page,
+			    int off, int size);
+
 #define SKB_PAGE_ASSERT(skb) 	BUG_ON(skb_shinfo(skb)->nr_frags)
 #define SKB_FRAG_ASSERT(skb) 	BUG_ON(skb_shinfo(skb)->frag_list)
 #define SKB_LINEAR_ASSERT(skb)  BUG_ON(skb_is_nonlinear(skb))
@@ -1148,6 +1151,26 @@ static inline struct sk_buff *netdev_all
 	return __netdev_alloc_skb(dev, length, GFP_ATOMIC);
 }
 
+extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+
+/**
+ *	netdev_alloc_page - allocate a page for ps-rx on a specific device
+ *	@dev: network device to receive on
+ *
+ * 	Allocate a new page node local to the specified device.
+ *
+ * 	%NULL is returned if there is no free memory.
+ */
+static inline struct page *netdev_alloc_page(struct net_device *dev)
+{
+	return __netdev_alloc_page(dev, GFP_ATOMIC);
+}
+
+static inline void netdev_free_page(struct net_device *dev, struct page *page)
+{
+	__free_page(page);
+}
+
 /**
  *	skb_cow - copy header of skb when it is required
  *	@skb: buffer to cow
Index: linux-2.6-git/net/core/skbuff.c
===================================================================
--- linux-2.6-git.orig/net/core/skbuff.c	2007-02-14 11:29:54.000000000 +0100
+++ linux-2.6-git/net/core/skbuff.c	2007-02-14 12:01:40.000000000 +0100
@@ -279,6 +279,24 @@ struct sk_buff *__netdev_alloc_skb(struc
 	return skb;
 }
 
+struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
+{
+	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
+	struct page *page;
+
+	page = alloc_pages_node(node, gfp_mask, 0);
+	return page;
+}
+
+void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
+		int size)
+{
+	skb_fill_page_desc(skb, i, page, off, size);
+	skb->len += size;
+	skb->data_len += size;
+	skb->truesize += size;
+}
+
 static void skb_drop_list(struct sk_buff **listp)
 {
 	struct sk_buff *list = *listp;
@@ -2066,6 +2084,8 @@ EXPORT_SYMBOL(kfree_skb);
 EXPORT_SYMBOL(__pskb_pull_tail);
 EXPORT_SYMBOL(__alloc_skb);
 EXPORT_SYMBOL(__netdev_alloc_skb);
+EXPORT_SYMBOL(__netdev_alloc_page);
+EXPORT_SYMBOL(skb_add_rx_frag);
 EXPORT_SYMBOL(pskb_copy);
 EXPORT_SYMBOL(pskb_expand_head);
 EXPORT_SYMBOL(skb_checksum);
Index: linux-2.6-git/drivers/net/sky2.c
===================================================================
--- linux-2.6-git.orig/drivers/net/sky2.c	2007-02-14 08:31:12.000000000 +0100
+++ linux-2.6-git/drivers/net/sky2.c	2007-02-14 12:00:22.000000000 +0100
@@ -1083,7 +1083,7 @@ static struct sk_buff *sky2_rx_alloc(str
 	skb_reserve(skb, ALIGN(p, RX_SKB_ALIGN) - p);
 
 	for (i = 0; i < sky2->rx_nfrags; i++) {
-		struct page *page = alloc_page(GFP_ATOMIC);
+		struct page *page = netdev_alloc_page(sky2->netdev);
 
 		if (!page)
 			goto free_partial;
@@ -1972,8 +1972,8 @@ static struct sk_buff *receive_copy(stru
 }
 
 /* Adjust length of skb with fragments to match received data */
-static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
-			  unsigned int length)
+static void skb_put_frags(struct sky2_port *sky2, struct sk_buff *skb,
+			  unsigned int hdr_space, unsigned int length)
 {
 	int i, num_frags;
 	unsigned int size;
@@ -1990,15 +1990,11 @@ static void skb_put_frags(struct sk_buff
 
 		if (length == 0) {
 			/* don't need this page */
-			__free_page(frag->page);
+			netdev_free_page(sky2->netdev, frag->page);
 			--skb_shinfo(skb)->nr_frags;
 		} else {
 			size = min(length, (unsigned) PAGE_SIZE);
-
-			frag->size = size;
-			skb->data_len += size;
-			skb->truesize += size;
-			skb->len += size;
+			skb_add_rx_frag(skb, i, frag->page, 0, size);
 			length -= size;
 		}
 	}
@@ -2027,7 +2023,7 @@ static struct sk_buff *receive_new(struc
 	sky2_rx_map_skb(sky2->hw->pdev, re, hdr_space);
 
 	if (skb_shinfo(skb)->nr_frags)
-		skb_put_frags(skb, hdr_space, length);
+		skb_put_frags(sky2, skb, hdr_space, length);
 	else
 		skb_put(skb, length);
 	return skb;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 13/40] net: sk_allocation() - concentrate socket related allocations
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (11 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 12/40] net: packet split receive api Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 14/40] netvm: link network to vm layer Peter Zijlstra
                   ` (28 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: net-sk_allocation.patch --]
[-- Type: text/plain, Size: 5293 bytes --]

Introduce sk_allocation(), this function allows to inject sock specific
flags to each sock related allocation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h    |    7 ++++++-
 net/ipv4/tcp_output.c |   11 ++++++-----
 net/ipv6/tcp_ipv6.c   |   14 +++++++++-----
 3 files changed, 21 insertions(+), 11 deletions(-)

Index: linux-2.6-git/net/ipv4/tcp_output.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/tcp_output.c
+++ linux-2.6-git/net/ipv4/tcp_output.c
@@ -2011,7 +2011,7 @@ void tcp_send_fin(struct sock *sk)
 	} else {
 		/* Socket is locked, keep trying until memory is available. */
 		for (;;) {
-			skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
+			skb = alloc_skb_fclone(MAX_TCP_HEADER, sk->sk_allocation);
 			if (skb)
 				break;
 			yield();
@@ -2044,7 +2044,7 @@ void tcp_send_active_reset(struct sock *
 	struct sk_buff *skb;
 
 	/* NOTE: No TCP options attached and we never retransmit this. */
-	skb = alloc_skb(MAX_TCP_HEADER, priority);
+	skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, priority));
 	if (!skb) {
 		NET_INC_STATS(LINUX_MIB_TCPABORTFAILED);
 		return;
@@ -2117,7 +2117,8 @@ struct sk_buff * tcp_make_synack(struct 
 	__u8 *md5_hash_location;
 #endif
 
-	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC);
+	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1,
+			sk_allocation(sk, GFP_ATOMIC));
 	if (skb == NULL)
 		return NULL;
 
@@ -2376,7 +2377,7 @@ void tcp_send_ack(struct sock *sk)
 		 * tcp_transmit_skb() will set the ownership to this
 		 * sock.
 		 */
-		buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+		buff = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
 		if (buff == NULL) {
 			inet_csk_schedule_ack(sk);
 			inet_csk(sk)->icsk_ack.ato = TCP_ATO_MIN;
@@ -2418,7 +2419,7 @@ static int tcp_xmit_probe_skb(struct soc
 	struct sk_buff *skb;
 
 	/* We don't queue it, tcp_transmit_skb() sets ownership. */
-	skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+	skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
 	if (skb == NULL)
 		return -1;
 
Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h
+++ linux-2.6-git/include/net/sock.h
@@ -415,6 +415,11 @@ static inline int sock_flag(struct sock 
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
+{
+	return gfp_mask;
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
@@ -1207,7 +1212,7 @@ static inline struct sk_buff *sk_stream_
 	int hdr_len;
 
 	hdr_len = SKB_DATA_ALIGN(sk->sk_prot->max_header);
-	skb = alloc_skb_fclone(size + hdr_len, gfp);
+	skb = alloc_skb_fclone(size + hdr_len, sk_allocation(sk, gfp));
 	if (skb) {
 		skb->truesize += mem;
 		if (sk_stream_wmem_schedule(sk, skb->truesize)) {
Index: linux-2.6-git/net/ipv6/tcp_ipv6.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/tcp_ipv6.c
+++ linux-2.6-git/net/ipv6/tcp_ipv6.c
@@ -581,7 +581,8 @@ static int tcp_v6_md5_do_add(struct sock
 	} else {
 		/* reallocate new list if current one is full. */
 		if (!tp->md5sig_info) {
-			tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info), GFP_ATOMIC);
+			tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info),
+					sk_allocation(sk, GFP_ATOMIC));
 			if (!tp->md5sig_info) {
 				kfree(newkey);
 				return -ENOMEM;
@@ -590,7 +591,8 @@ static int tcp_v6_md5_do_add(struct sock
 		tcp_alloc_md5sig_pool();
 		if (tp->md5sig_info->alloced6 == tp->md5sig_info->entries6) {
 			keys = kmalloc((sizeof (tp->md5sig_info->keys6[0]) *
-				       (tp->md5sig_info->entries6 + 1)), GFP_ATOMIC);
+				       (tp->md5sig_info->entries6 + 1)),
+				       sk_allocation(sk, GFP_ATOMIC));
 
 			if (!keys) {
 				tcp_free_md5sig_pool();
@@ -715,7 +717,7 @@ static int tcp_v6_parse_md5_keys (struct
 		struct tcp_sock *tp = tcp_sk(sk);
 		struct tcp_md5sig_info *p;
 
-		p = kzalloc(sizeof(struct tcp_md5sig_info), GFP_KERNEL);
+		p = kzalloc(sizeof(struct tcp_md5sig_info), sk->allocation);
 		if (!p)
 			return -ENOMEM;
 
@@ -1011,7 +1013,7 @@ static void tcp_v6_send_reset(struct soc
 	 */
 
 	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
-			 GFP_ATOMIC);
+			 sk_allocation(sk, GFP_ATOMIC));
 	if (buff == NULL)
 		return;
 
@@ -1090,10 +1092,12 @@ static void tcp_v6_send_ack(struct tcp_t
 	struct tcp_md5sig_key *key;
 	struct tcp_md5sig_key tw_key;
 #endif
+	gfp_t gfp_mask = GFP_ATOMIC;
 
 #ifdef CONFIG_TCP_MD5SIG
 	if (!tw && skb->sk) {
 		key = tcp_v6_md5_do_lookup(skb->sk, &ipv6_hdr(skb)->daddr);
+		gfp_mask = sk_allocation(skb->sk, gfp_mask);
 	} else if (tw && tw->tw_md5_keylen) {
 		tw_key.key = tw->tw_md5_key;
 		tw_key.keylen = tw->tw_md5_keylen;
@@ -1111,7 +1115,7 @@ static void tcp_v6_send_ack(struct tcp_t
 #endif
 
 	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
-			 GFP_ATOMIC);
+			 gfp_mask);
 	if (buff == NULL)
 		return;
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 14/40] netvm: link network to vm layer
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (12 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 13/40] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 15/40] netvm: INET reserves Peter Zijlstra
                   ` (27 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: netvm-reserve.patch --]
[-- Type: text/plain, Size: 7443 bytes --]

Hook up networking to the memory reserve.

There are two kinds of reserves: skb and aux. 
 - skb reserves are used for incomming packets,
 - aux reserves are used for processing these packets.

The consumers for these reserves are sockets marked with:
  SOCK_VMIO

Such sockets are to be used to service the VM (iow. to swap over). They
must be handled kernel side, exposing such a socket to user-space is a BUG.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |   43 ++++++++++++++++
 net/Kconfig        |    3 +
 net/core/sock.c    |  135 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 180 insertions(+), 1 deletion(-)

Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h
+++ linux-2.6-git/include/net/sock.h
@@ -49,6 +49,7 @@
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/mm.h>
 #include <linux/security.h>
+#include <linux/log2.h>
 
 #include <linux/filter.h>
 
@@ -393,6 +394,7 @@ enum sock_flags {
 	SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_VMIO, /* the VM depends on us - make sure we're serviced */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -415,9 +417,48 @@ static inline int sock_flag(struct sock 
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int sk_has_vmio(struct sock *sk)
+{
+	return sock_flag(sk, SOCK_VMIO);
+}
+
+/*
+ * Guestimate the per request queue TX upper bound.
+ *
+ * Max packet size is 64k, and we need to reserve that much since the data
+ * might need to bounce it. Double it to be on the safe side.
+ */
+#define TX_RESERVE_PAGES DIV_ROUND_UP(2*65536, PAGE_SIZE)
+
+extern atomic_t vmio_socks;
+
+static inline int sk_vmio_socks(void)
+{
+	return atomic_read(&vmio_socks);
+}
+
+extern int rx_emergency_get(int bytes);
+extern int rx_emergency_get_overcommit(int bytes);
+extern void rx_emergency_put(int bytes);
+
+static inline
+int guess_kmem_cache_pages(struct kmem_cache *cachep, int nr_objs)
+{
+	int guess = DIV_ROUND_UP((kmem_cache_objsize(cachep) * nr_objs),
+			PAGE_SIZE);
+	guess += ilog2(guess);
+	return guess;
+}
+
+extern void sk_adjust_memalloc(int socks, int tx_reserve_pages);
+extern void skb_reserve_memory(int skb_reserve_bytes);
+extern void aux_reserve_memory(int aux_reserve_pages);
+extern int sk_set_vmio(struct sock *sk);
+extern int sk_clear_vmio(struct sock *sk);
+
 static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
 {
-	return gfp_mask;
+	return gfp_mask | (sk->sk_allocation & __GFP_EMERGENCY);
 }
 
 static inline void sk_acceptq_removed(struct sock *sk)
Index: linux-2.6-git/net/core/sock.c
===================================================================
--- linux-2.6-git.orig/net/core/sock.c
+++ linux-2.6-git/net/core/sock.c
@@ -112,6 +112,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/log2.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -198,6 +199,139 @@ __u32 sysctl_rmem_default __read_mostly 
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 
+static atomic_t rx_emergency_bytes;
+
+static int skb_reserve_bytes;
+static int aux_reserve_pages;
+
+static DEFINE_SPINLOCK(memalloc_lock);
+static int rx_net_reserve;
+atomic_t vmio_socks;
+EXPORT_SYMBOL_GPL(vmio_socks);
+
+/*
+ * is there room for another emergency packet?
+ * we account in power of two units to approx the slab allocator.
+ */
+static int __rx_emergency_get(int bytes, bool overcommit)
+{
+	int size = roundup_pow_of_two(bytes);
+	int nr = atomic_add_return(size, &rx_emergency_bytes);
+	int thresh = 2 * skb_reserve_bytes;
+	if (nr < thresh || overcommit)
+		return 1;
+
+	atomic_dec(&rx_emergency_bytes);
+	return 0;
+}
+
+int rx_emergency_get(int bytes)
+{
+	return __rx_emergency_get(bytes, false);
+}
+
+int rx_emergency_get_overcommit(int bytes)
+{
+	return __rx_emergency_get(bytes, true);
+}
+
+void rx_emergency_put(int bytes)
+{
+	int size = roundup_pow_of_two(bytes);
+	return atomic_sub(size, &rx_emergency_bytes);
+}
+
+/**
+ *	sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ *	@socks: number of new %SOCK_VMIO sockets
+ *	@tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ *	This function adjusts the memalloc reserve based on system demand.
+ *	The RX reserve is a limit, and only added once, not for each socket.
+ *
+ *	NOTE:
+ *	   @tx_reserve_pages is an upper-bound of memory used for TX hence
+ *	   we need not account the pages like we do for RX pages.
+ */
+void sk_adjust_memalloc(int socks, int tx_reserve_pages)
+{
+	unsigned long flags;
+	int reserve = tx_reserve_pages;
+	int nr_socks;
+
+	spin_lock_irqsave(&memalloc_lock, flags);
+	nr_socks = atomic_add_return(socks, &vmio_socks);
+	BUG_ON(nr_socks < 0);
+
+	if (nr_socks) {
+		int skb_reserve_pages =
+			DIV_ROUND_UP(skb_reserve_bytes, PAGE_SIZE);
+		int rx_pages = 2 * skb_reserve_pages + aux_reserve_pages;
+		reserve += rx_pages - rx_net_reserve;
+		rx_net_reserve = rx_pages;
+	} else {
+		reserve -= rx_net_reserve;
+		rx_net_reserve = 0;
+	}
+
+	if (reserve)
+		adjust_memalloc_reserve(reserve);
+	spin_unlock_irqrestore(&memalloc_lock, flags);
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
+
+/*
+ * tiny helper functions to track the memory reserves
+ * needed because of modular ipv6
+ */
+void skb_reserve_memory(int bytes)
+{
+	skb_reserve_bytes += bytes;
+	sk_adjust_memalloc(0, 0);
+}
+EXPORT_SYMBOL_GPL(skb_reserve_memory);
+
+void aux_reserve_memory(int pages)
+{
+	aux_reserve_pages += pages;
+	sk_adjust_memalloc(0, 0);
+}
+EXPORT_SYMBOL_GPL(aux_reserve_memory);
+
+/**
+ *	sk_set_vmio - sets %SOCK_VMIO
+ *	@sk: socket to set it on
+ *
+ *	Set %SOCK_VMIO on a socket and increase the memalloc reserve
+ *	accordingly.
+ */
+int sk_set_vmio(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_VMIO);
+#ifndef CONFIG_NETVM
+	BUG();
+#endif
+	if (!set) {
+		sk_adjust_memalloc(1, 0);
+		sock_set_flag(sk, SOCK_VMIO);
+		sk->sk_allocation |= __GFP_EMERGENCY;
+	}
+	return !set;
+}
+EXPORT_SYMBOL_GPL(sk_set_vmio);
+
+int sk_clear_vmio(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_VMIO);
+	if (set) {
+		sk_adjust_memalloc(-1, 0);
+		sock_reset_flag(sk, SOCK_VMIO);
+		sk->sk_allocation &= ~__GFP_EMERGENCY;
+	}
+	return set;
+}
+EXPORT_SYMBOL_GPL(sk_clear_vmio);
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -879,6 +1013,7 @@ void sk_free(struct sock *sk)
 	struct sk_filter *filter;
 	struct module *owner = sk->sk_prot_creator->owner;
 
+	sk_clear_vmio(sk);
 	if (sk->sk_destruct)
 		sk->sk_destruct(sk);
 
Index: linux-2.6-git/net/Kconfig
===================================================================
--- linux-2.6-git.orig/net/Kconfig
+++ linux-2.6-git/net/Kconfig
@@ -224,6 +224,9 @@ source "net/ieee80211/Kconfig"
 
 endmenu
 
+config NETVM
+	def_bool n
+
 endif   # if NET
 endmenu # Networking
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 15/40] netvm: INET reserves.
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (13 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 14/40] netvm: link network to vm layer Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 16/40] netvm: hook skb allocation to reserves Peter Zijlstra
                   ` (26 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: netvm-reserve-inet.patch --]
[-- Type: text/plain, Size: 7136 bytes --]

Add reserves for INET.

The two big users seem to be the route cache and ip-fragment cache.

Account the route cache to the auxillary reserve.
Account the fragments to the skb reserve so that one can at least
overflow the fragment cache (avoids fragment deadlocks).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 net/ipv4/ip_fragment.c     |    1 +
 net/ipv4/route.c           |   19 ++++++++++++++++++-
 net/ipv4/sysctl_net_ipv4.c |   14 +++++++++++++-
 net/ipv6/reassembly.c      |    1 +
 net/ipv6/route.c           |   19 ++++++++++++++++++-
 net/ipv6/sysctl_net_ipv6.c |   13 ++++++++++++-
 6 files changed, 63 insertions(+), 4 deletions(-)

Index: linux-2.6-git/net/ipv4/sysctl_net_ipv4.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/sysctl_net_ipv4.c	2007-03-26 12:01:01.000000000 +0200
+++ linux-2.6-git/net/ipv4/sysctl_net_ipv4.c	2007-03-26 12:37:19.000000000 +0200
@@ -18,6 +18,7 @@
 #include <net/route.h>
 #include <net/tcp.h>
 #include <net/cipso_ipv4.h>
+#include <net/sock.h>
 
 /* From af_inet.c */
 extern int sysctl_ip_nonlocal_bind;
@@ -186,6 +187,17 @@ static int strategy_allowed_congestion_c
 
 }
 
+static int proc_dointvec_fragment(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	int old_thresh = *(int *)table->data;
+	ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+	if (write)
+		skb_reserve_memory(*(int *)table->data - old_thresh);
+	return ret;
+}
+
 ctl_table ipv4_table[] = {
 	{
 		.ctl_name	= NET_IPV4_TCP_TIMESTAMPS,
@@ -291,7 +303,7 @@ ctl_table ipv4_table[] = {
 		.data		= &sysctl_ipfrag_high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment
 	},
 	{
 		.ctl_name	= NET_IPV4_IPFRAG_LOW_THRESH,
Index: linux-2.6-git/net/ipv6/sysctl_net_ipv6.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/sysctl_net_ipv6.c	2007-03-26 12:01:01.000000000 +0200
+++ linux-2.6-git/net/ipv6/sysctl_net_ipv6.c	2007-03-26 12:37:52.000000000 +0200
@@ -15,6 +15,17 @@
 
 #ifdef CONFIG_SYSCTL
 
+static int proc_dointvec_fragment(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	int old_thresh = *(int *)table->data;
+	ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+	if (write)
+		skb_reserve_memory(*(int *)table->data - old_thresh);
+	return ret;
+}
+
 static ctl_table ipv6_table[] = {
 	{
 		.ctl_name	= NET_IPV6_ROUTE,
@@ -44,7 +55,7 @@ static ctl_table ipv6_table[] = {
 		.data		= &sysctl_ip6frag_high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment
 	},
 	{
 		.ctl_name	= NET_IPV6_IP6FRAG_LOW_THRESH,
Index: linux-2.6-git/net/ipv4/ip_fragment.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/ip_fragment.c	2007-03-26 12:01:01.000000000 +0200
+++ linux-2.6-git/net/ipv4/ip_fragment.c	2007-03-26 12:03:07.000000000 +0200
@@ -743,6 +743,7 @@ void ipfrag_init(void)
 	ipfrag_secret_timer.function = ipfrag_secret_rebuild;
 	ipfrag_secret_timer.expires = jiffies + sysctl_ipfrag_secret_interval;
 	add_timer(&ipfrag_secret_timer);
+	skb_reserve_memory(sysctl_ipfrag_high_thresh);
 }
 
 EXPORT_SYMBOL(ip_defrag);
Index: linux-2.6-git/net/ipv6/reassembly.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/reassembly.c	2007-03-26 12:01:01.000000000 +0200
+++ linux-2.6-git/net/ipv6/reassembly.c	2007-03-26 12:03:07.000000000 +0200
@@ -772,4 +772,5 @@ void __init ipv6_frag_init(void)
 	ip6_frag_secret_timer.function = ip6_frag_secret_rebuild;
 	ip6_frag_secret_timer.expires = jiffies + sysctl_ip6frag_secret_interval;
 	add_timer(&ip6_frag_secret_timer);
+	skb_reserve_memory(sysctl_ip6frag_high_thresh);
 }
Index: linux-2.6-git/net/ipv4/route.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/route.c	2007-03-26 12:01:01.000000000 +0200
+++ linux-2.6-git/net/ipv4/route.c	2007-03-26 12:31:43.000000000 +0200
@@ -2884,6 +2884,21 @@ static int ipv4_sysctl_rtcache_flush_str
 	return 0;
 }
 
+static int proc_dointvec_rt_size(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	int new_pages;
+	int old_pages = guess_kmem_cache_pages(ipv4_dst_ops.kmem_cachep,
+			*(int *)table->data);
+	ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+	new_pages = guess_kmem_cache_pages(ipv4_dst_ops.kmem_cachep,
+			*(int *)table->data);
+	if (write && (new_pages - old_pages))
+		aux_reserve_memory(new_pages - old_pages);
+	return ret;
+}
+
 ctl_table ipv4_route_table[] = {
 	{
 		.ctl_name 	= NET_IPV4_ROUTE_FLUSH,
@@ -2926,7 +2941,7 @@ ctl_table ipv4_route_table[] = {
 		.data		= &ip_rt_max_size,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_dointvec_rt_size,
 	},
 	{
 		/*  Deprecated. Use gc_min_interval_ms */
@@ -3153,6 +3168,8 @@ int __init ip_rt_init(void)
 
 	ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
 	ip_rt_max_size = (rt_hash_mask + 1) * 16;
+	aux_reserve_memory(guess_kmem_cache_pages(ipv4_dst_ops.kmem_cachep,
+				ip_rt_max_size));
 
 	devinet_init();
 	ip_fib_init();
Index: linux-2.6-git/net/ipv6/route.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/route.c	2007-03-26 12:02:29.000000000 +0200
+++ linux-2.6-git/net/ipv6/route.c	2007-03-26 12:37:43.000000000 +0200
@@ -2370,6 +2370,21 @@ int ipv6_sysctl_rtcache_flush(ctl_table 
 		return -EINVAL;
 }
 
+static int proc_dointvec_rt_size(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret;
+	int new_pages;
+	int old_pages = guess_kmem_cache_pages(ip6_dst_ops.kmem_cachep,
+			*(int *)table->data);
+	ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+	new_pages = guess_kmem_cache_pages(ip6_dst_ops.kmem_cachep,
+			*(int *)table->data);
+	if (write && (new_pages - old_pages))
+		aux_reserve_memory(new_pages - old_pages);
+	return ret;
+}
+
 ctl_table ipv6_route_table[] = {
 	{
 		.ctl_name	=	NET_IPV6_ROUTE_FLUSH,
@@ -2393,7 +2408,7 @@ ctl_table ipv6_route_table[] = {
 		.data		=	&ip6_rt_max_size,
 		.maxlen		=	sizeof(int),
 		.mode		=	0644,
-		.proc_handler	=	&proc_dointvec,
+         	.proc_handler	=	&proc_dointvec_rt_size,
 	},
 	{
 		.ctl_name	=	NET_IPV6_ROUTE_GC_MIN_INTERVAL,
@@ -2478,6 +2493,8 @@ void __init ip6_route_init(void)
 
 	proc_net_fops_create("rt6_stats", S_IRUGO, &rt6_stats_seq_fops);
 #endif
+	aux_reserve_memory(guess_kmem_cache_pages(ip6_dst_ops.kmem_cachep,
+				ip6_rt_max_size));
 #ifdef CONFIG_XFRM
 	xfrm6_init();
 #endif

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 16/40] netvm: hook skb allocation to reserves
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (14 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 15/40] netvm: INET reserves Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 14:07   ` Arnaldo Carvalho de Melo
  2007-05-04 10:27 ` [PATCH 17/40] netvm: filter emergency skbs Peter Zijlstra
                   ` (25 subsequent siblings)
  41 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: netvm-skbuff-reserve.patch --]
[-- Type: text/plain, Size: 14283 bytes --]

Change the skb allocation api to indicate RX usage and use this to fall back to
the reserve when needed. Skbs allocated from the reserve are tagged in
skb->emergency.

Teach all other skb ops about emergency skbs and the reserve accounting.

Use the (new) packet split API to allocate and track fragment pages from the
emergency reserve. Do this using an atomic counter in page->index. This is
needed because the fragments have a different sharing semantic than that
indicated by skb_shinfo()->dataref. 

(NOTE the extra atomic overhead is only for those pages allocated from the
reserves - it does not affect the normal fast path.)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/skbuff.h |   22 +++++-
 net/core/skbuff.c      |  161 ++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 157 insertions(+), 26 deletions(-)

Index: linux-2.6-git/include/linux/skbuff.h
===================================================================
--- linux-2.6-git.orig/include/linux/skbuff.h
+++ linux-2.6-git/include/linux/skbuff.h
@@ -277,7 +277,8 @@ struct sk_buff {
 				nfctinfo:3;
 	__u8			pkt_type:3,
 				fclone:2,
-				ipvs_property:1;
+				ipvs_property:1,
+				emergency:1;
 	__be16			protocol;
 
 	void			(*destructor)(struct sk_buff *skb);
@@ -323,10 +324,19 @@ struct sk_buff {
 
 #include <asm/system.h>
 
+#define SKB_ALLOC_FCLONE	0x01
+#define SKB_ALLOC_RX		0x02
+
+#ifdef CONFIG_NETVM
+#define skb_emergency(skb)	unlikely((skb)->emergency)
+#else
+#define skb_emergency(skb)	false
+#endif
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-				   gfp_t priority, int fclone, int node);
+				   gfp_t priority, int flags, int node);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
@@ -336,7 +346,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1, -1);
+	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, -1);
 }
 
 extern void	       kfree_skbmem(struct sk_buff *skb);
@@ -1279,7 +1289,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
 					      gfp_t gfp_mask)
 {
-	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	struct sk_buff *skb =
+		__alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, -1);
 	if (likely(skb))
 		skb_reserve(skb, NET_SKB_PAD);
 	return skb;
@@ -1325,6 +1336,7 @@ static inline struct sk_buff *netdev_all
 }
 
 extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+extern void __netdev_free_page(struct net_device *dev, struct page *page);
 
 /**
  *	netdev_alloc_page - allocate a page for ps-rx on a specific device
@@ -1341,7 +1353,7 @@ static inline struct page *netdev_alloc_
 
 static inline void netdev_free_page(struct net_device *dev, struct page *page)
 {
-	__free_page(page);
+	__netdev_free_page(dev, page);
 }
 
 /**
Index: linux-2.6-git/net/core/skbuff.c
===================================================================
--- linux-2.6-git.orig/net/core/skbuff.c
+++ linux-2.6-git/net/core/skbuff.c
@@ -144,21 +144,28 @@ EXPORT_SYMBOL(skb_truesize_bug);
  *	%GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-			    int fclone, int node)
+			    int flags, int node)
 {
 	struct kmem_cache *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
+	int emergency = 0;
 
-	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+	size = SKB_DATA_ALIGN(size);
+	cache = (flags & SKB_ALLOC_FCLONE)
+		? skbuff_fclone_cache : skbuff_head_cache;
+#ifdef CONFIG_NETVM
+	if (flags & SKB_ALLOC_RX)
+		gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN;
+#endif
 
+retry_alloc:
 	/* Get the HEAD */
 	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
 	if (!skb)
-		goto out;
+		goto noskb;
 
-	size = SKB_DATA_ALIGN(size);
 	data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
 			gfp_mask, node);
 	if (!data)
@@ -168,6 +175,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	 * See comment in sk_buff definition, just before the 'tail' member
 	 */
 	memset(skb, 0, offsetof(struct sk_buff, tail));
+	skb->emergency = emergency;
 	skb->truesize = size + sizeof(struct sk_buff);
 	atomic_set(&skb->users, 1);
 	skb->head = data;
@@ -184,7 +192,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	shinfo->ip6_frag_id = 0;
 	shinfo->frag_list = NULL;
 
-	if (fclone) {
+	if (flags & SKB_ALLOC_FCLONE) {
 		struct sk_buff *child = skb + 1;
 		atomic_t *fclone_ref = (atomic_t *) (child + 1);
 
@@ -192,12 +200,31 @@ struct sk_buff *__alloc_skb(unsigned int
 		atomic_set(fclone_ref, 1);
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
+		child->emergency = skb->emergency;
 	}
 out:
 	return skb;
+
 nodata:
 	kmem_cache_free(cache, skb);
 	skb = NULL;
+noskb:
+#ifdef CONFIG_NETVM
+	/* Attempt emergency allocation when RX skb. */
+	if (likely(!(flags & SKB_ALLOC_RX) || !sk_vmio_socks()))
+		goto out;
+
+	if (!emergency) {
+		if (rx_emergency_get(size)) {
+			gfp_mask &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+			gfp_mask |= __GFP_EMERGENCY;
+			emergency = 1;
+			goto retry_alloc;
+		}
+	} else
+		rx_emergency_put(size);
+#endif
+
 	goto out;
 }
 
@@ -220,7 +247,7 @@ struct sk_buff *__netdev_alloc_skb(struc
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct sk_buff *skb;
 
-	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+ 	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, node);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -233,10 +260,34 @@ struct page *__netdev_alloc_page(struct 
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct page *page;
 
+#ifdef CONFIG_NETVM
+	gfp_mask |= __GFP_NOMEMALLOC | __GFP_NOWARN;
+#endif
+
 	page = alloc_pages_node(node, gfp_mask, 0);
+
+#ifdef CONFIG_NETVM
+	if (!page && rx_emergency_get(PAGE_SIZE)) {
+		gfp_mask &= ~(__GFP_NOMEMALLOC | __GFP_NOWARN);
+		gfp_mask |= __GFP_EMERGENCY;
+		page = alloc_pages_node(node, gfp_mask, 0);
+		if (!page)
+			rx_emergency_put(PAGE_SIZE);
+	}
+#endif
+
 	return page;
 }
 
+void __netdev_free_page(struct net_device *dev, struct page *page)
+{
+#ifdef CONFIG_NETVM
+	if (unlikely(page->index == 0))
+		rx_emergency_put(PAGE_SIZE);
+#endif
+	__free_page(page);
+}
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {
@@ -244,6 +295,33 @@ void skb_add_rx_frag(struct sk_buff *skb
 	skb->len += size;
 	skb->data_len += size;
 	skb->truesize += size;
+
+#ifdef CONFIG_NETVM
+	/*
+	 * Fix-up the emergency accounting; make sure all pages match
+	 * skb->emergency.
+	 *
+	 * This relies on the page rank (page->index) to be preserved between
+	 * the call to __netdev_alloc_page() and this call.
+	 */
+	if (skb_emergency(skb)) {
+		/*
+		 * If the page rank wasn't 0 (ALLOC_NO_WATERMARK) we can use
+		 * overcommit accounting, since we already have the memory.
+		 */
+		if (page->index != 0)
+			rx_emergency_get_overcommit(PAGE_SIZE);
+		atomic_set((atomic_t *)&page->index, 1);
+	} else if (unlikely(page->index == 0)) {
+		/*
+		 * Rare case; the skb wasn't allocated under pressure but
+		 * the page was. We need to return the page. This can offset
+		 * the accounting a little, but its a constant shift, it does
+		 * not accumulate.
+		 */
+		rx_emergency_put(PAGE_SIZE);
+	}
+#endif
 }
 
 static void skb_drop_list(struct sk_buff **listp)
@@ -272,21 +350,40 @@ static void skb_clone_fraglist(struct sk
 		skb_get(list);
 }
 
+static inline void skb_get_page(struct sk_buff *skb, struct page *page)
+{
+	get_page(page);
+	if (skb_emergency(skb))
+		atomic_inc((atomic_t *)&page->index);
+}
+
+static inline void skb_put_page(struct sk_buff *skb, struct page *page)
+{
+	if (skb_emergency(skb) &&
+			atomic_dec_and_test((atomic_t *)&page->index))
+		rx_emergency_put(PAGE_SIZE);
+	put_page(page);
+}
+
 static void skb_release_data(struct sk_buff *skb)
 {
 	if (!skb->cloned ||
 	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
 			       &skb_shinfo(skb)->dataref)) {
+		int size = skb->end - skb->head;
+
 		if (skb_shinfo(skb)->nr_frags) {
 			int i;
 			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-				put_page(skb_shinfo(skb)->frags[i].page);
+				skb_put_page(skb, skb_shinfo(skb)->frags[i].page);
 		}
 
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
 
 		kfree(skb->head);
+		if (skb_emergency(skb))
+			rx_emergency_put(size);
 	}
 }
 
@@ -405,6 +502,9 @@ struct sk_buff *skb_clone(struct sk_buff
 		n->fclone = SKB_FCLONE_CLONE;
 		atomic_inc(fclone_ref);
 	} else {
+		if (skb_emergency(skb))
+			gfp_mask |= __GFP_EMERGENCY;
+
 		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 		if (!n)
 			return NULL;
@@ -440,6 +540,7 @@ struct sk_buff *skb_clone(struct sk_buff
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
 #endif
+	C(emergency);
 	C(protocol);
 	n->destructor = NULL;
 	C(mark);
@@ -516,6 +617,8 @@ static void copy_skb_header(struct sk_bu
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
 
+#define skb_alloc_rx(skb) (skb_emergency(skb) ? SKB_ALLOC_RX : 0)
+
 /**
  *	skb_copy	-	create private copy of an sk_buff
  *	@skb: buffer to copy
@@ -536,15 +639,17 @@ static void copy_skb_header(struct sk_bu
 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 {
 	int headerlen = skb->data - skb->head;
+	int size;
 	/*
 	 *	Allocate the copy buffer
 	 */
 	struct sk_buff *n;
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
-	n = alloc_skb(skb->end + skb->data_len, gfp_mask);
+	size = skb->end + skb->data_len;
 #else
-	n = alloc_skb(skb->end - skb->head + skb->data_len, gfp_mask);
+	size = skb->end - skb->head + skb->data_len;
 #endif
+	n = __alloc_skb(size, gfp_mask, skb_alloc_rx(skb), -1);
 	if (!n)
 		return NULL;
 
@@ -581,12 +686,14 @@ struct sk_buff *pskb_copy(struct sk_buff
 	/*
 	 *	Allocate the copy buffer
 	 */
+	int size;
 	struct sk_buff *n;
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
-	n = alloc_skb(skb->end, gfp_mask);
+	size = skb->end;
 #else
-	n = alloc_skb(skb->end - skb->head, gfp_mask);
+	size = skb->end - skb->head;
 #endif
+	n = __alloc_skb(size, gfp_mask, skb_alloc_rx(skb), -1);
 	if (!n)
 		goto out;
 
@@ -607,8 +714,9 @@ struct sk_buff *pskb_copy(struct sk_buff
 		int i;
 
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-			skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
-			get_page(skb_shinfo(n)->frags[i].page);
+			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+			skb_shinfo(n)->frags[i] = *frag;
+			skb_get_page(n, frag->page);
 		}
 		skb_shinfo(n)->nr_frags = i;
 	}
@@ -656,6 +764,14 @@ int pskb_expand_head(struct sk_buff *skb
 
 	size = SKB_DATA_ALIGN(size);
 
+	if (skb_emergency(skb)) {
+		if (rx_emergency_get(size))
+			gfp_mask |= __GFP_EMERGENCY;
+		else
+			goto nodata;
+	} else
+		gfp_mask |= __GFP_NOMEMALLOC;
+
 	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
 	if (!data)
 		goto nodata;
@@ -672,7 +788,7 @@ int pskb_expand_head(struct sk_buff *skb
 	       sizeof(struct skb_shared_info));
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-		get_page(skb_shinfo(skb)->frags[i].page);
+		skb_get_page(skb, skb_shinfo(skb)->frags[i].page);
 
 	if (skb_shinfo(skb)->frag_list)
 		skb_clone_fraglist(skb);
@@ -752,8 +868,8 @@ struct sk_buff *skb_copy_expand(const st
 	/*
 	 *	Allocate the copy buffer
 	 */
-	struct sk_buff *n = alloc_skb(newheadroom + skb->len + newtailroom,
-				      gfp_mask);
+	struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom,
+				        gfp_mask, skb_alloc_rx(skb), -1);
 	int oldheadroom = skb_headroom(skb);
 	int head_copy_len, head_copy_off;
 	int off = 0;
@@ -869,7 +985,7 @@ drop_pages:
 		skb_shinfo(skb)->nr_frags = i;
 
 		for (; i < nfrags; i++)
-			put_page(skb_shinfo(skb)->frags[i].page);
+			skb_put_page(skb, skb_shinfo(skb)->frags[i].page);
 
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
@@ -1038,7 +1154,7 @@ pull_pages:
 	k = 0;
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		if (skb_shinfo(skb)->frags[i].size <= eat) {
-			put_page(skb_shinfo(skb)->frags[i].page);
+			skb_put_page(skb, skb_shinfo(skb)->frags[i].page);
 			eat -= skb_shinfo(skb)->frags[i].size;
 		} else {
 			skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
@@ -1599,6 +1715,7 @@ static inline void skb_split_no_header(s
 			skb_shinfo(skb1)->frags[k] = skb_shinfo(skb)->frags[i];
 
 			if (pos < len) {
+				struct page *page = skb_shinfo(skb)->frags[i].page;
 				/* Split frag.
 				 * We have two variants in this case:
 				 * 1. Move all the frag to the second
@@ -1607,7 +1724,7 @@ static inline void skb_split_no_header(s
 				 *    where splitting is expensive.
 				 * 2. Split is accurately. We make this.
 				 */
-				get_page(skb_shinfo(skb)->frags[i].page);
+				skb_get_page(skb1, page);
 				skb_shinfo(skb1)->frags[0].page_offset += len - pos;
 				skb_shinfo(skb1)->frags[0].size -= len - pos;
 				skb_shinfo(skb)->frags[i].size	= len - pos;
@@ -1933,7 +2050,8 @@ struct sk_buff *skb_segment(struct sk_bu
 		if (hsize > len || !sg)
 			hsize = len;
 
-		nskb = alloc_skb(hsize + doffset + headroom, GFP_ATOMIC);
+		nskb = __alloc_skb(hsize + doffset + headroom, GFP_ATOMIC,
+				   skb_alloc_rx(skb), -1);
 		if (unlikely(!nskb))
 			goto err;
 
@@ -1977,7 +2095,7 @@ struct sk_buff *skb_segment(struct sk_bu
 			BUG_ON(i >= nfrags);
 
 			*frag = skb_shinfo(skb)->frags[i];
-			get_page(frag->page);
+			skb_get_page(nskb, frag->page);
 			size = frag->size;
 
 			if (pos < offset) {
@@ -2222,6 +2340,7 @@ EXPORT_SYMBOL(__pskb_pull_tail);
 EXPORT_SYMBOL(__alloc_skb);
 EXPORT_SYMBOL(__netdev_alloc_skb);
 EXPORT_SYMBOL(__netdev_alloc_page);
+EXPORT_SYMBOL(__netdev_free_page);
 EXPORT_SYMBOL(skb_add_rx_frag);
 EXPORT_SYMBOL(pskb_copy);
 EXPORT_SYMBOL(pskb_expand_head);

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 17/40] netvm: filter emergency skbs.
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (15 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 16/40] netvm: hook skb allocation to reserves Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 18/40] netvm: prevent a TCP specific deadlock Peter Zijlstra
                   ` (24 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: netvm-sk_filter.patch --]
[-- Type: text/plain, Size: 975 bytes --]

Toss all emergency packets not for a SOCK_VMIO socket. This ensures our
precious memory reserve doesn't get stuck waiting for user-space.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h	2007-02-14 16:15:49.000000000 +0100
+++ linux-2.6-git/include/net/sock.h	2007-02-14 16:16:27.000000000 +0100
@@ -926,6 +926,9 @@ static inline int sk_filter(struct sock 
 {
 	int err;
 	struct sk_filter *filter;
+
+	if (skb_emergency(skb) && !sk_has_vmio(sk))
+		return -EPERM;
 	
 	err = security_sock_rcv_skb(sk, skb);
 	if (err)

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 18/40] netvm: prevent a TCP specific deadlock
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (16 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 17/40] netvm: filter emergency skbs Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 19/40] netfilter: notify about NF_QUEUE vs emergency skbs Peter Zijlstra
                   ` (23 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: netvm-tcp-deadlock.patch --]
[-- Type: text/plain, Size: 2821 bytes --]

It could happen that all !SOCK_VMIO sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_VMIO buffers
from receiving data, which will prevent userspace from running, which is needed
to reduce the buffered data.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h  |    7 ++++---
 net/core/stream.c   |    5 +++--
 net/ipv4/tcp_ipv4.c |    8 ++++++++
 net/ipv6/tcp_ipv6.c |    8 ++++++++
 4 files changed, 23 insertions(+), 5 deletions(-)

Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h	2007-02-14 12:09:05.000000000 +0100
+++ linux-2.6-git/include/net/sock.h	2007-02-14 12:09:21.000000000 +0100
@@ -730,7 +730,8 @@ static inline struct inode *SOCK_INODE(s
 }
 
 extern void __sk_stream_mem_reclaim(struct sock *sk);
-extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
+extern int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb,
+		int size, int kind);
 
 #define SK_STREAM_MEM_QUANTUM ((int)PAGE_SIZE)
 
@@ -757,13 +758,13 @@ static inline void sk_stream_writequeue_
 static inline int sk_stream_rmem_schedule(struct sock *sk, struct sk_buff *skb)
 {
 	return (int)skb->truesize <= sk->sk_forward_alloc ||
-		sk_stream_mem_schedule(sk, skb->truesize, 1);
+		sk_stream_mem_schedule(sk, skb, skb->truesize, 1);
 }
 
 static inline int sk_stream_wmem_schedule(struct sock *sk, int size)
 {
 	return size <= sk->sk_forward_alloc ||
-	       sk_stream_mem_schedule(sk, size, 0);
+	       sk_stream_mem_schedule(sk, NULL, size, 0);
 }
 
 /* Used by processes to "lock" a socket state, so that
Index: linux-2.6-git/net/core/stream.c
===================================================================
--- linux-2.6-git.orig/net/core/stream.c	2007-02-14 12:09:05.000000000 +0100
+++ linux-2.6-git/net/core/stream.c	2007-02-14 12:09:21.000000000 +0100
@@ -207,7 +207,7 @@ void __sk_stream_mem_reclaim(struct sock
 
 EXPORT_SYMBOL(__sk_stream_mem_reclaim);
 
-int sk_stream_mem_schedule(struct sock *sk, int size, int kind)
+int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb, int size, int kind)
 {
 	int amt = sk_stream_pages(size);
 
@@ -224,7 +224,8 @@ int sk_stream_mem_schedule(struct sock *
 	/* Over hard limit. */
 	if (atomic_read(sk->sk_prot->memory_allocated) > sk->sk_prot->sysctl_mem[2]) {
 		sk->sk_prot->enter_memory_pressure();
-		goto suppress_allocation;
+		if (!skb || (skb && !skb_emergency(skb)))
+			goto suppress_allocation;
 	}
 
 	/* Under pressure. */

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 19/40] netfilter: notify about NF_QUEUE vs emergency skbs
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (17 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 18/40] netvm: prevent a TCP specific deadlock Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 20/40] netvm: skb processing Peter Zijlstra
                   ` (22 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips,
	Patrick McHardy

[-- Attachment #1: emergency-nf_queue.patch --]
[-- Type: text/plain, Size: 1210 bytes --]

Avoid memory getting stuck waiting for userspace, drop all emergency packets.
This of course requires the regular storage route to not include an NF_QUEUE
target ;-)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Patrick McHardy <kaber@trash.net>
---
 net/netfilter/core.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6-git/net/netfilter/core.c
===================================================================
--- linux-2.6-git.orig/net/netfilter/core.c	2007-02-22 15:48:28.000000000 +0100
+++ linux-2.6-git/net/netfilter/core.c	2007-02-26 14:23:25.000000000 +0100
@@ -184,9 +184,12 @@ next_hook:
 		ret = 1;
 		goto unlock;
 	} else if (verdict == NF_DROP) {
+drop:
 		kfree_skb(*pskb);
 		ret = -EPERM;
 	} else if ((verdict & NF_VERDICT_MASK)  == NF_QUEUE) {
+		if (skb_emergency(*pskb))
+			goto drop;
 		NFDEBUG("nf_hook: Verdict = QUEUE.\n");
 		if (!nf_queue(*pskb, elem, pf, hook, indev, outdev, okfn,
 			      verdict >> NF_VERDICT_BITS))

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 20/40] netvm: skb processing
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (18 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 19/40] netfilter: notify about NF_QUEUE vs emergency skbs Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 21/40] uml: rename arch/um remove_mapping() Peter Zijlstra
                   ` (21 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: netvm.patch --]
[-- Type: text/plain, Size: 4796 bytes --]

In order to make sure emergency packets receive all memory needed to proceed
ensure processing of emergency skbs happens under PF_MEMALLOC.

Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing.

Skip taps, since those are user-space again.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |    4 ++++
 net/core/dev.c     |   42 +++++++++++++++++++++++++++++++++++++-----
 net/core/sock.c    |   19 +++++++++++++++++++
 3 files changed, 60 insertions(+), 5 deletions(-)

Index: linux-2.6-git/net/core/dev.c
===================================================================
--- linux-2.6-git.orig/net/core/dev.c
+++ linux-2.6-git/net/core/dev.c
@@ -1756,10 +1756,23 @@ int netif_receive_skb(struct sk_buff *sk
 	struct net_device *orig_dev;
 	int ret = NET_RX_DROP;
 	__be16 type;
+	unsigned long pflags = current->flags;
+
+	/* Emergency skb are special, they should
+	 *  - be delivered to SOCK_VMIO sockets only
+	 *  - stay away from userspace
+	 *  - have bounded memory usage
+	 *
+	 * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
+	 * This saves us from propagating the allocation context down to all
+	 * allocation sites.
+	 */
+	if (skb_emergency(skb))
+		current->flags |= PF_MEMALLOC;
 
 	/* if we've gotten here through NAPI, check netpoll */
 	if (skb->dev->poll && netpoll_rx(skb))
-		return NET_RX_DROP;
+		goto out;
 
 	if (!skb->tstamp.tv64)
 		net_timestamp(skb);
@@ -1770,7 +1783,7 @@ int netif_receive_skb(struct sk_buff *sk
 	orig_dev = skb_bond(skb);
 
 	if (!orig_dev)
-		return NET_RX_DROP;
+		goto out;
 
 	__get_cpu_var(netdev_rx_stat).total++;
 
@@ -1789,6 +1802,9 @@ int netif_receive_skb(struct sk_buff *sk
 	}
 #endif
 
+	if (skb_emergency(skb))
+		goto skip_taps;
+
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
 		if (!ptype->dev || ptype->dev == skb->dev) {
 			if (pt_prev)
@@ -1797,6 +1813,7 @@ int netif_receive_skb(struct sk_buff *sk
 		}
 	}
 
+skip_taps:
 #ifdef CONFIG_NET_CLS_ACT
 	if (pt_prev) {
 		ret = deliver_skb(skb, pt_prev, orig_dev);
@@ -1809,16 +1826,28 @@ int netif_receive_skb(struct sk_buff *sk
 
 	if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
 		kfree_skb(skb);
-		goto out;
+		goto unlock;
 	}
 
 	skb->tc_verd = 0;
 ncls:
 #endif
 
+	if (skb_emergency(skb))
+		switch(skb->protocol) {
+			case __constant_htons(ETH_P_ARP):
+			case __constant_htons(ETH_P_IP):
+			case __constant_htons(ETH_P_IPV6):
+			case __constant_htons(ETH_P_8021Q):
+				break;
+
+			default:
+				goto drop;
+		}
+
 	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
 	if (!skb)
-		goto out;
+		goto unlock;
 
 	type = skb->protocol;
 	list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
@@ -1833,6 +1862,7 @@ ncls:
 	if (pt_prev) {
 		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 	} else {
+drop:
 		kfree_skb(skb);
 		/* Jamal, now you will not able to escape explaining
 		 * me how you were going to use this. :-)
@@ -1840,8 +1870,10 @@ ncls:
 		ret = NET_RX_DROP;
 	}
 
-out:
+unlock:
 	rcu_read_unlock();
+out:
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 	return ret;
 }
 
Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h
+++ linux-2.6-git/include/net/sock.h
@@ -527,10 +527,14 @@ static inline void sk_add_backlog(struct
 	skb->next = NULL;
 }
 
+#ifndef CONFIG_NETVM
 static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
 	return sk->sk_backlog_rcv(sk, skb);
 }
+#else
+extern int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+#endif
 
 #define sk_wait_event(__sk, __timeo, __condition)		\
 ({	int rc;							\
Index: linux-2.6-git/net/core/sock.c
===================================================================
--- linux-2.6-git.orig/net/core/sock.c
+++ linux-2.6-git/net/core/sock.c
@@ -332,6 +332,25 @@ int sk_clear_vmio(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_clear_vmio);
 
+#ifdef CONFIG_NETVM
+int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	if (skb_emergency(skb)) {
+		int ret;
+		unsigned long pflags = current->flags;
+	       	/* these should have been dropped before queueing */
+		BUG_ON(!sk_has_vmio(sk));
+		current->flags |= PF_MEMALLOC;
+		ret = sk->sk_backlog_rcv(sk, skb);
+		tsk_restore_flags(current, pflags, PF_MEMALLOC);
+		return ret;
+	}
+
+	return sk->sk_backlog_rcv(sk, skb);
+}
+EXPORT_SYMBOL(sk_backlog_rcv);
+#endif
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 21/40] uml: rename arch/um remove_mapping()
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (19 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 20/40] netvm: skb processing Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 22/40] mm: prepare swap entry methods for use in page methods Peter Zijlstra
                   ` (20 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips,
	Jeff Dike

[-- Attachment #1: uml_remove_mapping.patch --]
[-- Type: text/plain, Size: 1566 bytes --]

When 'include/linux/mm.h' includes 'include/linux/swap.h', the global
remove_mapping() definition clashes with the arch/um one.

Rename the arch/um one.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jeff Dike <jdike@addtoit.com>
---
 arch/um/kernel/physmem.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6-git/arch/um/kernel/physmem.c
===================================================================
--- linux-2.6-git.orig/arch/um/kernel/physmem.c	2007-02-12 09:40:47.000000000 +0100
+++ linux-2.6-git/arch/um/kernel/physmem.c	2007-02-12 11:17:47.000000000 +0100
@@ -160,7 +160,7 @@ int physmem_subst_mapping(void *virt, in
 
 static int physmem_fd = -1;
 
-static void remove_mapping(struct phys_desc *desc)
+static void um_remove_mapping(struct phys_desc *desc)
 {
 	void *virt = desc->virt;
 	int err;
@@ -184,7 +184,7 @@ int physmem_remove_mapping(void *virt)
 	if(desc == NULL)
 		return 0;
 
-	remove_mapping(desc);
+	um_remove_mapping(desc);
 	return 1;
 }
 
@@ -205,7 +205,7 @@ void physmem_forget_descriptor(int fd)
 		page = list_entry(ele, struct phys_desc, list);
 		offset = page->offset;
 		addr = page->virt;
-		remove_mapping(page);
+		um_remove_mapping(page);
 		err = os_seek_file(fd, offset);
 		if(err)
 			panic("physmem_forget_descriptor - failed to seek "

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 22/40] mm: prepare swap entry methods for use in page methods
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (20 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 21/40] uml: rename arch/um remove_mapping() Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 23/40] mm: add support for non block device backed swap files Peter Zijlstra
                   ` (19 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: mm-swap_entry_methods.patch --]
[-- Type: text/plain, Size: 5794 bytes --]

Move around the swap entry methods in preparation for use from
page methods.

Also provide a function to obtain the swap_info_struct backing
a swap cache page.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 include/linux/mm.h      |    8 ++++++++
 include/linux/swap.h    |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/swapops.h |   44 --------------------------------------------
 mm/swapfile.c           |    1 +
 4 files changed, 57 insertions(+), 44 deletions(-)

Index: linux-2.6-git/include/linux/mm.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm.h	2007-02-21 12:15:00.000000000 +0100
+++ linux-2.6-git/include/linux/mm.h	2007-02-21 12:15:01.000000000 +0100
@@ -17,6 +17,7 @@
 #include <linux/debug_locks.h>
 #include <linux/backing-dev.h>
 #include <linux/mm_types.h>
+#include <linux/swap.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -586,6 +587,13 @@ static inline struct address_space *page
 	return mapping;
 }
 
+static inline struct swap_info_struct *page_swap_info(struct page *page)
+{
+	swp_entry_t swap = { .val = page_private(page) };
+	BUG_ON(!PageSwapCache(page));
+	return get_swap_info_struct(swp_type(swap));
+}
+
 static inline int PageAnon(struct page *page)
 {
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
Index: linux-2.6-git/include/linux/swap.h
===================================================================
--- linux-2.6-git.orig/include/linux/swap.h	2007-02-21 12:15:00.000000000 +0100
+++ linux-2.6-git/include/linux/swap.h	2007-02-21 12:15:01.000000000 +0100
@@ -79,6 +79,50 @@ typedef struct {
 } swp_entry_t;
 
 /*
+ * swapcache pages are stored in the swapper_space radix tree.  We want to
+ * get good packing density in that tree, so the index should be dense in
+ * the low-order bits.
+ *
+ * We arrange the `type' and `offset' fields so that `type' is at the five
+ * high-order bits of the swp_entry_t and `offset' is right-aligned in the
+ * remaining bits.
+ *
+ * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
+ */
+#define SWP_TYPE_SHIFT(e)	(sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
+#define SWP_OFFSET_MASK(e)	((1UL << SWP_TYPE_SHIFT(e)) - 1)
+
+/*
+ * Store a type+offset into a swp_entry_t in an arch-independent format
+ */
+static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
+{
+	swp_entry_t ret;
+
+	ret.val = (type << SWP_TYPE_SHIFT(ret)) |
+			(offset & SWP_OFFSET_MASK(ret));
+	return ret;
+}
+
+/*
+ * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline unsigned swp_type(swp_entry_t entry)
+{
+	return (entry.val >> SWP_TYPE_SHIFT(entry));
+}
+
+/*
+ * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline pgoff_t swp_offset(swp_entry_t entry)
+{
+	return entry.val & SWP_OFFSET_MASK(entry);
+}
+
+/*
  * current->reclaim_state points to one of these when a task is running
  * memory reclaim
  */
@@ -326,6 +370,10 @@ static inline int valid_swaphandles(swp_
 	return 0;
 }
 
+static inline struct swap_info_struct *get_swap_info_struct(unsigned type)
+{
+	return NULL;
+}
 #define can_share_swap_page(p)			(page_mapcount(p) == 1)
 
 static inline int move_to_swap_cache(struct page *page, swp_entry_t entry)
Index: linux-2.6-git/include/linux/swapops.h
===================================================================
--- linux-2.6-git.orig/include/linux/swapops.h	2007-02-21 12:15:00.000000000 +0100
+++ linux-2.6-git/include/linux/swapops.h	2007-02-21 12:15:01.000000000 +0100
@@ -1,48 +1,4 @@
 /*
- * swapcache pages are stored in the swapper_space radix tree.  We want to
- * get good packing density in that tree, so the index should be dense in
- * the low-order bits.
- *
- * We arrange the `type' and `offset' fields so that `type' is at the five
- * high-order bits of the swp_entry_t and `offset' is right-aligned in the
- * remaining bits.
- *
- * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
- */
-#define SWP_TYPE_SHIFT(e)	(sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
-#define SWP_OFFSET_MASK(e)	((1UL << SWP_TYPE_SHIFT(e)) - 1)
-
-/*
- * Store a type+offset into a swp_entry_t in an arch-independent format
- */
-static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
-{
-	swp_entry_t ret;
-
-	ret.val = (type << SWP_TYPE_SHIFT(ret)) |
-			(offset & SWP_OFFSET_MASK(ret));
-	return ret;
-}
-
-/*
- * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline unsigned swp_type(swp_entry_t entry)
-{
-	return (entry.val >> SWP_TYPE_SHIFT(entry));
-}
-
-/*
- * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline pgoff_t swp_offset(swp_entry_t entry)
-{
-	return entry.val & SWP_OFFSET_MASK(entry);
-}
-
-/*
  * Convert the arch-dependent pte representation of a swp_entry_t into an
  * arch-independent swp_entry_t.
  */
Index: linux-2.6-git/mm/swapfile.c
===================================================================
--- linux-2.6-git.orig/mm/swapfile.c	2007-02-21 12:15:00.000000000 +0100
+++ linux-2.6-git/mm/swapfile.c	2007-02-21 12:15:01.000000000 +0100
@@ -1764,6 +1764,7 @@ get_swap_info_struct(unsigned type)
 {
 	return &swap_info[type];
 }
+EXPORT_SYMBOL_GPL(get_swap_info_struct);
 
 /*
  * swap_lock prevents swap_map being freed. Don't grab an extra

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 23/40] mm: add support for non block device backed swap files
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (21 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 22/40] mm: prepare swap entry methods for use in page methods Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 24/40] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
                   ` (18 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: mm-swapfile.patch --]
[-- Type: text/plain, Size: 8828 bytes --]

A new addres_space_operations method is added:
  int swapfile(struct address_space *, int)

When during sys_swapon() this method is found and returns no error the 
swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops.

The swapfile method will be used to communicate to the address_space that the
VM relies on it, and the address_space should take adequate measures (like 
reserving memory for mempools or the like).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 Documentation/filesystems/Locking |    9 ++++++
 include/linux/fs.h                |    1 
 include/linux/swap.h              |    3 ++
 mm/Kconfig                        |    4 ++
 mm/page_io.c                      |   55 ++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c                   |    5 +++
 mm/swapfile.c                     |   22 ++++++++++++++-
 7 files changed, 98 insertions(+), 1 deletion(-)

Index: linux-2.6-git/include/linux/swap.h
===================================================================
--- linux-2.6-git.orig/include/linux/swap.h	2007-03-26 13:40:38.000000000 +0200
+++ linux-2.6-git/include/linux/swap.h	2007-03-26 13:40:43.000000000 +0200
@@ -163,6 +163,7 @@ enum {
 	SWP_USED	= (1 << 0),	/* is slot in swap_info[] used? */
 	SWP_WRITEOK	= (1 << 1),	/* ok to write to this swap?	*/
 	SWP_ACTIVE	= (SWP_USED | SWP_WRITEOK),
+	SWP_FILE	= (1 << 2),	/* file swap area */
 					/* add others here before... */
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
@@ -264,6 +265,8 @@ extern void swap_unplug_io_fn(struct bac
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
+extern void swap_sync_page(struct page *page);
+extern int swap_set_page_dirty(struct page *page);
 extern int end_swap_bio_read(struct bio *bio, unsigned int bytes_done, int err);
 
 /* linux/mm/swap_state.c */
Index: linux-2.6-git/mm/page_io.c
===================================================================
--- linux-2.6-git.orig/mm/page_io.c	2007-03-26 13:34:51.000000000 +0200
+++ linux-2.6-git/mm/page_io.c	2007-03-26 13:40:43.000000000 +0200
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <linux/buffer_head.h>
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -110,6 +111,18 @@ int swap_writepage(struct page *page, st
 		unlock_page(page);
 		goto out;
 	}
+#ifdef CONFIG_SWAP_FILE
+	{
+		struct swap_info_struct *sis = page_swap_info(page);
+		if (sis->flags & SWP_FILE) {
+			ret = sis->swap_file->f_mapping->
+				a_ops->writepage(page, wbc);
+			if (!ret)
+				count_vm_event(PSWPOUT);
+			return ret;
+		}
+	}
+#endif
 	bio = get_swap_bio(GFP_NOIO, page_private(page), page,
 				end_swap_bio_write);
 	if (bio == NULL) {
@@ -128,6 +141,36 @@ out:
 	return ret;
 }
 
+#ifdef CONFIG_SWAP_FILE
+void swap_sync_page(struct page *page)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+
+	if (sis->flags & SWP_FILE) {
+		const struct address_space_operations * a_ops =
+			sis->swap_file->f_mapping->a_ops;
+		if (a_ops->sync_page)
+			a_ops->sync_page(page);
+	} else
+		block_sync_page(page);
+}
+
+int swap_set_page_dirty(struct page *page)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+
+	if (sis->flags & SWP_FILE) {
+		const struct address_space_operations * a_ops =
+			sis->swap_file->f_mapping->a_ops;
+		if (a_ops->set_page_dirty)
+			return a_ops->set_page_dirty(page);
+		return __set_page_dirty_buffers(page);
+	}
+
+	return __set_page_dirty_nobuffers(page);
+}
+#endif
+
 int swap_readpage(struct file *file, struct page *page)
 {
 	struct bio *bio;
@@ -135,6 +178,18 @@ int swap_readpage(struct file *file, str
 
 	BUG_ON(!PageLocked(page));
 	ClearPageUptodate(page);
+#ifdef CONFIG_SWAP_FILE
+	{
+		struct swap_info_struct *sis = page_swap_info(page);
+		if (sis->flags & SWP_FILE) {
+			ret = sis->swap_file->f_mapping->
+				a_ops->readpage(sis->swap_file, page);
+			if (!ret)
+				count_vm_event(PSWPIN);
+			return ret;
+		}
+	}
+#endif
 	bio = get_swap_bio(GFP_KERNEL, page_private(page), page,
 				end_swap_bio_read);
 	if (bio == NULL) {
Index: linux-2.6-git/mm/swap_state.c
===================================================================
--- linux-2.6-git.orig/mm/swap_state.c	2007-03-26 13:34:51.000000000 +0200
+++ linux-2.6-git/mm/swap_state.c	2007-03-26 13:40:43.000000000 +0200
@@ -26,8 +26,13 @@
  */
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
+#ifdef CONFIG_SWAP_FILE
+	.sync_page	= swap_sync_page,
+	.set_page_dirty	= swap_set_page_dirty,
+#else
 	.sync_page	= block_sync_page,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
+#endif
 	.migratepage	= migrate_page,
 };
 
Index: linux-2.6-git/mm/swapfile.c
===================================================================
--- linux-2.6-git.orig/mm/swapfile.c	2007-03-26 13:40:38.000000000 +0200
+++ linux-2.6-git/mm/swapfile.c	2007-03-26 13:40:43.000000000 +0200
@@ -981,6 +981,13 @@ static void destroy_swap_extents(struct 
 		list_del(&se->list);
 		kfree(se);
 	}
+#ifdef CONFIG_SWAP_FILE
+	if (sis->flags & SWP_FILE) {
+		sis->flags &= ~SWP_FILE;
+		sis->swap_file->f_mapping->a_ops->
+			swapfile(sis->swap_file->f_mapping, 0);
+	}
+#endif
 }
 
 /*
@@ -1073,6 +1080,19 @@ static int setup_swap_extents(struct swa
 		goto done;
 	}
 
+#ifdef CONFIG_SWAP_FILE
+	if (sis->swap_file->f_mapping->a_ops->swapfile) {
+		ret = sis->swap_file->f_mapping->a_ops->
+			swapfile(sis->swap_file->f_mapping, 1);
+		if (!ret) {
+			sis->flags |= SWP_FILE;
+			ret = add_swap_extent(sis, 0, sis->max, 0);
+			*span = sis->pages;
+		}
+		goto done;
+	}
+#endif
+
 	blkbits = inode->i_blkbits;
 	blocks_per_page = PAGE_SIZE >> blkbits;
 
@@ -1640,7 +1660,7 @@ asmlinkage long sys_swapon(const char __
 
 	mutex_lock(&swapon_mutex);
 	spin_lock(&swap_lock);
-	p->flags = SWP_ACTIVE;
+	p->flags |= SWP_WRITEOK;
 	nr_swap_pages += nr_good_pages;
 	total_swap_pages += nr_good_pages;
 
Index: linux-2.6-git/include/linux/fs.h
===================================================================
--- linux-2.6-git.orig/include/linux/fs.h	2007-03-26 13:34:51.000000000 +0200
+++ linux-2.6-git/include/linux/fs.h	2007-03-26 13:40:43.000000000 +0200
@@ -428,6 +428,7 @@ struct address_space_operations {
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
 	int (*launder_page) (struct page *);
+	int (*swapfile)(struct address_space *, int);
 };
 
 struct backing_dev_info;
Index: linux-2.6-git/Documentation/filesystems/Locking
===================================================================
--- linux-2.6-git.orig/Documentation/filesystems/Locking	2007-03-26 13:34:51.000000000 +0200
+++ linux-2.6-git/Documentation/filesystems/Locking	2007-03-26 13:40:43.000000000 +0200
@@ -172,6 +172,7 @@ prototypes:
 	int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
 	int (*launder_page) (struct page *);
+	int (*swapfile) (struct address_space *, int);
 
 locking rules:
 	All except set_page_dirty may block
@@ -190,6 +191,7 @@ invalidatepage:		no	yes
 releasepage:		no	yes
 direct_IO:		no
 launder_page:		no	yes
+swapfile		no
 
 	->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
 may be called from the request handler (/dev/loop).
@@ -289,6 +291,13 @@ cleaned, or an error value if not. Note 
 getting mapped back in and redirtied, it needs to be kept locked
 across the entire operation.
 
+	->swapfile() will be called with a non zero argument on address spaces
+backing non block device backed swapfiles. A return value of zero indicates
+success. In which case this address space can be used for backing swapspace.
+The swapspace operations will be proxied to the address space operations.
+Swapoff will call this method with a zero argument to release the address
+space.
+
 	Note: currently almost all instances of address_space methods are
 using BKL for internal serialization and that's one of the worst sources
 of contention. Normally they are calling library functions (in fs/buffer.c)
Index: linux-2.6-git/mm/Kconfig
===================================================================
--- linux-2.6-git.orig/mm/Kconfig	2007-03-26 13:36:32.000000000 +0200
+++ linux-2.6-git/mm/Kconfig	2007-03-26 13:41:22.000000000 +0200
@@ -166,3 +166,6 @@ config ZONE_DMA_FLAG
 config SLAB_FAIR
 	def_bool n
 	depends on SLAB
+
+config SWAP_FILE
+	def_bool n

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 24/40] mm: methods for teaching filesystems about PG_swapcache pages
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (22 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 23/40] mm: add support for non block device backed swap files Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 25/40] nfs: remove mempools Peter Zijlstra
                   ` (17 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: mm-page_file_methods.patch --]
[-- Type: text/plain, Size: 3067 bytes --]

In order to teach filesystems to handle swap cache pages, two new page
functions are introduced:

  pgoff_t page_file_index(struct page *);
  struct address_space *page_file_mapping(struct page *);

page_file_index - gives the offset of this page in the file in PAGE_CACHE_SIZE
blocks. Like page->index is for mapped pages, this function also gives the
correct index for PG_swapcache pages.

page_file_mapping - gives the mapping backing the actual page; that is for
swap cache pages it will give swap_file->f_mapping.

page_offset() is modified to use page_file_index(), so that it will give the
expected result, even for PG_swapcache pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 include/linux/mm.h      |   25 +++++++++++++++++++++++++
 include/linux/pagemap.h |    2 +-
 2 files changed, 26 insertions(+), 1 deletion(-)

Index: linux-2.6-git/include/linux/mm.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm.h	2007-02-21 12:15:01.000000000 +0100
+++ linux-2.6-git/include/linux/mm.h	2007-02-21 12:15:07.000000000 +0100
@@ -594,6 +594,16 @@ static inline struct swap_info_struct *p
 	return get_swap_info_struct(swp_type(swap));
 }
 
+static inline
+struct address_space *page_file_mapping(struct page *page)
+{
+#ifdef CONFIG_SWAP_FILE
+	if (unlikely(PageSwapCache(page)))
+		return page_swap_info(page)->swap_file->f_mapping;
+#endif
+	return page->mapping;
+}
+
 static inline int PageAnon(struct page *page)
 {
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
@@ -611,6 +621,21 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
+ * Return the file index of the page. Regular pagecache pages use ->index
+ * whereas swapcache pages use swp_offset(->private)
+ */
+static inline pgoff_t page_file_index(struct page *page)
+{
+#ifdef CONFIG_SWAP_FILE
+	if (unlikely(PageSwapCache(page))) {
+		swp_entry_t swap = { .val = page_private(page) };
+		return swp_offset(swap);
+	}
+#endif
+	return page->index;
+}
+
+/*
  * The atomic page->_mapcount, like _count, starts from -1:
  * so that transitions both from it and to it can be tracked,
  * using atomic_inc_and_test and atomic_add_negative(-1).
Index: linux-2.6-git/include/linux/pagemap.h
===================================================================
--- linux-2.6-git.orig/include/linux/pagemap.h	2007-02-21 12:14:54.000000000 +0100
+++ linux-2.6-git/include/linux/pagemap.h	2007-02-21 12:15:07.000000000 +0100
@@ -120,7 +120,7 @@ extern void __remove_from_page_cache(str
  */
 static inline loff_t page_offset(struct page *page)
 {
-	return ((loff_t)page->index) << PAGE_CACHE_SHIFT;
+	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
 }
 
 static inline pgoff_t linear_page_index(struct vm_area_struct *vma,

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 25/40] nfs: remove mempools
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (23 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 24/40] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 26/40] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
                   ` (16 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: nfs-no-mempool.patch --]
[-- Type: text/plain, Size: 5217 bytes --]

With the introduction of the shared dirty page accounting in .19, NFS should
not be able to surpise the VM with all dirty pages. Thus it should always be
able to free some memory. Hence no more need for mempools.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/nfs/read.c  |   15 +++------------
 fs/nfs/write.c |   27 +++++----------------------
 2 files changed, 8 insertions(+), 34 deletions(-)

Index: linux-2.6-git/fs/nfs/read.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/read.c
+++ linux-2.6-git/fs/nfs/read.c
@@ -33,13 +33,10 @@ static const struct rpc_call_ops nfs_rea
 static const struct rpc_call_ops nfs_read_full_ops;
 
 static struct kmem_cache *nfs_rdata_cachep;
-static mempool_t *nfs_rdata_mempool;
-
-#define MIN_POOL_READ	(32)
 
 struct nfs_read_data *nfs_readdata_alloc(unsigned int pagecount)
 {
-	struct nfs_read_data *p = mempool_alloc(nfs_rdata_mempool, GFP_NOFS);
+	struct nfs_read_data *p = kmem_cache_alloc(nfs_rdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -50,7 +47,7 @@ struct nfs_read_data *nfs_readdata_alloc
 		else {
 			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
 			if (!p->pagevec) {
-				mempool_free(p, nfs_rdata_mempool);
+				kmem_cache_free(nfs_rdata_cachep, p);
 				p = NULL;
 			}
 		}
@@ -63,7 +60,7 @@ static void nfs_readdata_rcu_free(struct
 	struct nfs_read_data *p = container_of(head, struct nfs_read_data, task.u.tk_rcu);
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_rdata_mempool);
+	kmem_cache_free(nfs_rdata_cachep, p);
 }
 
 static void nfs_readdata_free(struct nfs_read_data *rdata)
@@ -590,16 +587,10 @@ int __init nfs_init_readpagecache(void)
 	if (nfs_rdata_cachep == NULL)
 		return -ENOMEM;
 
-	nfs_rdata_mempool = mempool_create_slab_pool(MIN_POOL_READ,
-						     nfs_rdata_cachep);
-	if (nfs_rdata_mempool == NULL)
-		return -ENOMEM;
-
 	return 0;
 }
 
 void nfs_destroy_readpagecache(void)
 {
-	mempool_destroy(nfs_rdata_mempool);
 	kmem_cache_destroy(nfs_rdata_cachep);
 }
Index: linux-2.6-git/fs/nfs/write.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/write.c
+++ linux-2.6-git/fs/nfs/write.c
@@ -29,9 +29,6 @@
 
 #define NFSDBG_FACILITY		NFSDBG_PAGECACHE
 
-#define MIN_POOL_WRITE		(32)
-#define MIN_POOL_COMMIT		(4)
-
 /*
  * Local function declarations
  */
@@ -45,12 +42,10 @@ static const struct rpc_call_ops nfs_wri
 static const struct rpc_call_ops nfs_commit_ops;
 
 static struct kmem_cache *nfs_wdata_cachep;
-static mempool_t *nfs_wdata_mempool;
-static mempool_t *nfs_commit_mempool;
 
 struct nfs_write_data *nfs_commit_alloc(void)
 {
-	struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -64,7 +59,7 @@ void nfs_commit_rcu_free(struct rcu_head
 	struct nfs_write_data *p = container_of(head, struct nfs_write_data, task.u.tk_rcu);
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_commit_mempool);
+	kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 void nfs_commit_free(struct nfs_write_data *wdata)
@@ -74,7 +69,7 @@ void nfs_commit_free(struct nfs_write_da
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-	struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -85,7 +80,7 @@ struct nfs_write_data *nfs_writedata_all
 		else {
 			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
 			if (!p->pagevec) {
-				mempool_free(p, nfs_wdata_mempool);
+				kmem_cache_free(nfs_wdata_cachep, p);
 				p = NULL;
 			}
 		}
@@ -98,7 +93,7 @@ static void nfs_writedata_rcu_free(struc
 	struct nfs_write_data *p = container_of(head, struct nfs_write_data, task.u.tk_rcu);
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_wdata_mempool);
+	kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 static void nfs_writedata_free(struct nfs_write_data *wdata)
@@ -1465,16 +1460,6 @@ int __init nfs_init_writepagecache(void)
 	if (nfs_wdata_cachep == NULL)
 		return -ENOMEM;
 
-	nfs_wdata_mempool = mempool_create_slab_pool(MIN_POOL_WRITE,
-						     nfs_wdata_cachep);
-	if (nfs_wdata_mempool == NULL)
-		return -ENOMEM;
-
-	nfs_commit_mempool = mempool_create_slab_pool(MIN_POOL_COMMIT,
-						      nfs_wdata_cachep);
-	if (nfs_commit_mempool == NULL)
-		return -ENOMEM;
-
 	/*
 	 * NFS congestion size, scale with available memory.
 	 *
@@ -1500,8 +1485,6 @@ int __init nfs_init_writepagecache(void)
 
 void nfs_destroy_writepagecache(void)
 {
-	mempool_destroy(nfs_commit_mempool);
-	mempool_destroy(nfs_wdata_mempool);
 	kmem_cache_destroy(nfs_wdata_cachep);
 }
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 26/40] nfs: teach the NFS client how to treat PG_swapcache pages
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (24 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 25/40] nfs: remove mempools Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 27/40] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
                   ` (15 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: nfs-swapcache.patch --]
[-- Type: text/plain, Size: 10164 bytes --]

Replace all relevant occurences of page->index and page->mapping in the NFS
client with the new page_file_index() and page_file_mapping() functions.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/nfs/file.c     |    4 ++--
 fs/nfs/internal.h |    7 ++++---
 fs/nfs/pagelist.c |    6 +++---
 fs/nfs/read.c     |    6 +++---
 fs/nfs/write.c    |   36 ++++++++++++++++++------------------
 5 files changed, 30 insertions(+), 29 deletions(-)

Index: linux-2.6-git/fs/nfs/file.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/file.c
+++ linux-2.6-git/fs/nfs/file.c
@@ -310,7 +310,7 @@ static void nfs_invalidate_page(struct p
 	if (offset != 0)
 		return;
 	/* Cancel any unstarted writes on this page */
-	nfs_wb_page_priority(page->mapping->host, page, FLUSH_INVALIDATE);
+	nfs_wb_page_priority(page_file_mapping(page)->host, page, FLUSH_INVALIDATE);
 }
 
 static int nfs_release_page(struct page *page, gfp_t gfp)
@@ -321,7 +321,7 @@ static int nfs_release_page(struct page 
 
 static int nfs_launder_page(struct page *page)
 {
-	return nfs_wb_page(page->mapping->host, page);
+	return nfs_wb_page(page_file_mapping(page)->host, page);
 }
 
 const struct address_space_operations nfs_file_aops = {
Index: linux-2.6-git/fs/nfs/pagelist.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/pagelist.c
+++ linux-2.6-git/fs/nfs/pagelist.c
@@ -78,11 +78,11 @@ nfs_create_request(struct nfs_open_conte
 	 * update_nfs_request below if the region is not locked. */
 	req->wb_page    = page;
 	atomic_set(&req->wb_complete, 0);
-	req->wb_index	= page->index;
+	req->wb_index	= page_file_index(page);
 	page_cache_get(page);
 	BUG_ON(PagePrivate(page));
 	BUG_ON(!PageLocked(page));
-	BUG_ON(page->mapping->host != inode);
+	BUG_ON(page_file_mapping(page)->host != inode);
 	req->wb_offset  = offset;
 	req->wb_pgbase	= offset;
 	req->wb_bytes   = count;
@@ -367,7 +367,7 @@ void nfs_pageio_complete(struct nfs_page
  * @nfsi: NFS inode
  * @head: One of the NFS inode request lists
  * @dst: Destination list
- * @idx_start: lower bound of page->index to scan
+ * @idx_start: lower bound of page_file_index(page) to scan
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves elements from one of the inode request lists.
Index: linux-2.6-git/fs/nfs/read.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/read.c
+++ linux-2.6-git/fs/nfs/read.c
@@ -463,11 +463,11 @@ static const struct rpc_call_ops nfs_rea
 int nfs_readpage(struct file *file, struct page *page)
 {
 	struct nfs_open_context *ctx;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	int		error;
 
 	dprintk("NFS: nfs_readpage (%p %ld@%lu)\n",
-		page, PAGE_CACHE_SIZE, page->index);
+		page, PAGE_CACHE_SIZE, page_file_index(page));
 	nfs_inc_stats(inode, NFSIOS_VFSREADPAGE);
 	nfs_add_stats(inode, NFSIOS_READPAGES, 1);
 
@@ -514,7 +514,7 @@ static int
 readpage_async_filler(void *data, struct page *page)
 {
 	struct nfs_readdesc *desc = (struct nfs_readdesc *)data;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *new;
 	unsigned int len;
 
Index: linux-2.6-git/fs/nfs/write.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/write.c
+++ linux-2.6-git/fs/nfs/write.c
@@ -121,7 +121,7 @@ static struct nfs_page *nfs_page_find_re
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
 	struct nfs_page *req = NULL;
-	spinlock_t *req_lock = &NFS_I(page->mapping->host)->req_lock;
+	spinlock_t *req_lock = &NFS_I(page_file_mapping(page)->host)->req_lock;
 
 	spin_lock(req_lock);
 	req = nfs_page_find_request_locked(page);
@@ -132,13 +132,13 @@ static struct nfs_page *nfs_page_find_re
 /* Adjust the file length if we're writing beyond the end */
 static void nfs_grow_file(struct page *page, unsigned int offset, unsigned int count)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	loff_t end, i_size = i_size_read(inode);
 	pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
 
-	if (i_size > 0 && page->index < end_index)
+	if (i_size > 0 && page_file_index(page) < end_index)
 		return;
-	end = ((loff_t)page->index << PAGE_CACHE_SHIFT) + ((loff_t)offset+count);
+	end = page_offset(page) + ((loff_t)offset+count);
 	if (i_size >= end)
 		return;
 	nfs_inc_stats(inode, NFSIOS_EXTENDWRITE);
@@ -149,7 +149,7 @@ static void nfs_grow_file(struct page *p
 static void nfs_set_pageerror(struct page *page)
 {
 	SetPageError(page);
-	nfs_zap_mapping(page->mapping->host, page->mapping);
+	nfs_zap_mapping(page_file_mapping(page)->host, page_file_mapping(page));
 }
 
 /* We can set the PG_uptodate flag if we see that a write request
@@ -181,7 +181,7 @@ static int nfs_writepage_setup(struct nf
 		ret = PTR_ERR(req);
 		if (ret != -EBUSY)
 			return ret;
-		ret = nfs_wb_page(page->mapping->host, page);
+		ret = nfs_wb_page(page_file_mapping(page)->host, page);
 		if (ret != 0)
 			return ret;
 	}
@@ -217,7 +217,7 @@ static int nfs_set_page_writeback(struct
 	int ret = test_set_page_writeback(page);
 
 	if (!ret) {
-		struct inode *inode = page->mapping->host;
+		struct inode *inode = page_file_mapping(page)->host;
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		if (atomic_inc_return(&nfss->writeback) >
@@ -229,7 +229,7 @@ static int nfs_set_page_writeback(struct
 
 static void nfs_end_page_writeback(struct page *page)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	end_page_writeback(page);
@@ -250,7 +250,7 @@ static int nfs_page_async_flush(struct n
 				struct page *page)
 {
 	struct nfs_page *req;
-	struct nfs_inode *nfsi = NFS_I(page->mapping->host);
+	struct nfs_inode *nfsi = NFS_I(page_file_mapping(page)->host);
 	spinlock_t *req_lock = &nfsi->req_lock;
 	int ret;
 
@@ -303,7 +303,7 @@ static int nfs_writepage_locked(struct p
 {
 	struct nfs_pageio_descriptor mypgio, *pgio;
 	struct nfs_open_context *ctx;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	unsigned offset;
 	int err;
 
@@ -558,7 +558,7 @@ static void nfs_cancel_commit_list(struc
  * nfs_scan_commit - Scan an inode for commit requests
  * @inode: NFS inode to scan
  * @dst: destination list
- * @idx_start: lower bound of page->index to scan.
+ * @idx_start: lower bound of page_file_index(page) to scan.
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves requests from the inode's 'commit' request list.
@@ -595,7 +595,7 @@ static inline int nfs_scan_commit(struct
 static struct nfs_page * nfs_update_request(struct nfs_open_context* ctx,
 		struct page *page, unsigned int offset, unsigned int bytes)
 {
-	struct address_space *mapping = page->mapping;
+	struct address_space *mapping = page_file_mapping(page);
 	struct inode *inode = mapping->host;
 	struct nfs_inode *nfsi = NFS_I(inode);
 	struct nfs_page		*req, *new = NULL;
@@ -698,7 +698,7 @@ int nfs_flush_incompatible(struct file *
 		nfs_release_request(req);
 		if (!do_flush)
 			return 0;
-		status = nfs_wb_page(page->mapping->host, page);
+		status = nfs_wb_page(page_file_mapping(page)->host, page);
 	} while (status == 0);
 	return status;
 }
@@ -713,7 +713,7 @@ int nfs_updatepage(struct file *file, st
 		unsigned int offset, unsigned int count)
 {
 	struct nfs_open_context *ctx = (struct nfs_open_context *)file->private_data;
-	struct inode	*inode = page->mapping->host;
+	struct inode	*inode = page_file_mapping(page)->host;
 	int		status = 0;
 
 	nfs_inc_stats(inode, NFSIOS_VFSUPDATEPAGE);
@@ -964,7 +964,7 @@ static void nfs_writeback_done_partial(s
 	}
 
 	if (nfs_write_need_commit(data)) {
-		spinlock_t *req_lock = &NFS_I(page->mapping->host)->req_lock;
+		spinlock_t *req_lock = &NFS_I(page_file_mapping(page)->host)->req_lock;
 
 		spin_lock(req_lock);
 		if (test_bit(PG_NEED_RESCHED, &req->wb_flags)) {
@@ -1388,7 +1388,7 @@ int nfs_wb_page_priority(struct inode *i
 	loff_t range_start = page_offset(page);
 	loff_t range_end = range_start + (loff_t)(PAGE_CACHE_SIZE - 1);
 	struct writeback_control wbc = {
-		.bdi = page->mapping->backing_dev_info,
+		.bdi = page_file_mapping(page)->backing_dev_info,
 		.sync_mode = WB_SYNC_ALL,
 		.nr_to_write = LONG_MAX,
 		.range_start = range_start,
@@ -1404,7 +1404,7 @@ int nfs_wb_page_priority(struct inode *i
 	}
 	if (!PagePrivate(page))
 		return 0;
-	ret = nfs_sync_mapping_wait(page->mapping, &wbc, how);
+	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, how);
 	if (ret >= 0)
 		return 0;
 out:
@@ -1422,7 +1422,7 @@ int nfs_wb_page(struct inode *inode, str
 
 int nfs_set_page_dirty(struct page *page)
 {
-	struct address_space *mapping = page->mapping;
+	struct address_space *mapping = page_file_mapping(page);
 	struct inode *inode;
 	spinlock_t *req_lock;
 	struct nfs_page *req;
Index: linux-2.6-git/fs/nfs/internal.h
===================================================================
--- linux-2.6-git.orig/fs/nfs/internal.h
+++ linux-2.6-git/fs/nfs/internal.h
@@ -220,13 +220,14 @@ void nfs_super_set_maxbytes(struct super
 static inline
 unsigned int nfs_page_length(struct page *page)
 {
-	loff_t i_size = i_size_read(page->mapping->host);
+	loff_t i_size = i_size_read(page_file_mapping(page)->host);
 
 	if (i_size > 0) {
+		pgoff_t page_index = page_file_index(page);
 		pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
-		if (page->index < end_index)
+		if (page_index < end_index)
 			return PAGE_CACHE_SIZE;
-		if (page->index == end_index)
+		if (page_index == end_index)
 			return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1;
 	}
 	return 0;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 27/40] nfs: disable data cache revalidation for swapfiles
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (25 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 26/40] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 28/40] nfs: enable swap on NFS Peter Zijlstra
                   ` (14 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: nfs-swapper.patch --]
[-- Type: text/plain, Size: 5091 bytes --]

Do as Trond suggested:
  http://lkml.org/lkml/2006/8/25/348

Disable NFS data cache revalidation on swap files since it doesn't really 
make sense to have other clients change the file while you are using it.

Thereby we can stop setting PG_private on swap pages, since there ought to
be no further races with invalidate_inode_pages2() to deal with.

And since we cannot set PG_private we cannot use page->private (which is
already used by PG_swapcache pages anyway) to store the nfs_page. Thus
augment the new nfs_page_find_request logic.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/nfs/inode.c |    6 ++++++
 fs/nfs/write.c |   42 ++++++++++++++++++++++++++----------------
 2 files changed, 32 insertions(+), 16 deletions(-)

Index: linux-2.6-git/fs/nfs/inode.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/inode.c
+++ linux-2.6-git/fs/nfs/inode.c
@@ -722,6 +722,12 @@ int nfs_revalidate_mapping_nolock(struct
 	struct nfs_inode *nfsi = NFS_I(inode);
 	int ret = 0;
 
+	/*
+	 * swapfiles are not supposed to be shared.
+	 */
+	if (IS_SWAPFILE(inode))
+		goto out;
+
 	if ((nfsi->cache_validity & NFS_INO_REVAL_PAGECACHE)
 			|| nfs_attribute_timeout(inode) || NFS_STALE(inode)) {
 		ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
Index: linux-2.6-git/fs/nfs/write.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/write.c
+++ linux-2.6-git/fs/nfs/write.c
@@ -106,25 +106,29 @@ void nfs_writedata_release(void *wdata)
 	nfs_writedata_free(wdata);
 }
 
-static struct nfs_page *nfs_page_find_request_locked(struct page *page)
+static struct nfs_page *nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page)
 {
 	struct nfs_page *req = NULL;
 
-	if (PagePrivate(page)) {
+	if (PagePrivate(page))
 		req = (struct nfs_page *)page_private(page);
-		if (req != NULL)
-			atomic_inc(&req->wb_count);
-	}
+	else if (unlikely(PageSwapCache(page)))
+		req = radix_tree_lookup(&nfsi->nfs_page_tree, page_file_index(page));
+
+	if (req != NULL)
+		atomic_inc(&req->wb_count);
+
 	return req;
 }
 
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
 	struct nfs_page *req = NULL;
-	spinlock_t *req_lock = &NFS_I(page_file_mapping(page)->host)->req_lock;
+	struct nfs_inode *nfsi = NFS_I(page_file_mapping(page)->host);
+	spinlock_t *req_lock = &nfsi->req_lock;
 
 	spin_lock(req_lock);
-	req = nfs_page_find_request_locked(page);
+	req = nfs_page_find_request_locked(nfsi, page);
 	spin_unlock(req_lock);
 	return req;
 }
@@ -256,7 +260,7 @@ static int nfs_page_async_flush(struct n
 
 	spin_lock(req_lock);
 	for(;;) {
-		req = nfs_page_find_request_locked(page);
+		req = nfs_page_find_request_locked(nfsi, page);
 		if (req == NULL) {
 			spin_unlock(req_lock);
 			return 1;
@@ -389,8 +393,14 @@ static int nfs_inode_add_request(struct 
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
-	SetPagePrivate(req->wb_page);
-	set_page_private(req->wb_page, (unsigned long)req);
+	/*
+	 * Swap-space should not get truncated. Hence no need to plug the race
+	 * with invalidate/truncate.
+	 */
+	if (likely(!PageSwapCache(req->wb_page))) {
+		SetPagePrivate(req->wb_page);
+		set_page_private(req->wb_page, (unsigned long)req);
+	}
 	if (PageDirty(req->wb_page))
 		set_bit(PG_NEED_FLUSH, &req->wb_flags);
 	nfsi->npages++;
@@ -409,8 +419,10 @@ static void nfs_inode_remove_request(str
 	BUG_ON (!NFS_WBACK_BUSY(req));
 
 	spin_lock(&nfsi->req_lock);
-	set_page_private(req->wb_page, 0);
-	ClearPagePrivate(req->wb_page);
+	if (likely(!PageSwapCache(req->wb_page))) {
+		set_page_private(req->wb_page, 0);
+		ClearPagePrivate(req->wb_page);
+	}
 	radix_tree_delete(&nfsi->nfs_page_tree, req->wb_index);
 	if (test_and_clear_bit(PG_NEED_FLUSH, &req->wb_flags))
 		__set_page_dirty_nobuffers(req->wb_page);
@@ -608,7 +620,7 @@ static struct nfs_page * nfs_update_requ
 		 * A request for the page we wish to update
 		 */
 		spin_lock(&nfsi->req_lock);
-		req = nfs_page_find_request_locked(page);
+		req = nfs_page_find_request_locked(nfsi, page);
 		if (req) {
 			if (!nfs_lock_request_dontget(req)) {
 				int error;
@@ -1402,8 +1414,6 @@ int nfs_wb_page_priority(struct inode *i
 		if (ret < 0)
 			goto out;
 	}
-	if (!PagePrivate(page))
-		return 0;
 	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, how);
 	if (ret >= 0)
 		return 0;
@@ -1435,7 +1445,7 @@ int nfs_set_page_dirty(struct page *page
 		goto out_raced;
 	req_lock = &NFS_I(inode)->req_lock;
 	spin_lock(req_lock);
-	req = nfs_page_find_request_locked(page);
+	req = nfs_page_find_request_locked(NFS_I(inode), page);
 	if (req != NULL) {
 		/* Mark any existing write requests for flushing */
 		ret = !test_and_set_bit(PG_NEED_FLUSH, &req->wb_flags);

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 28/40] nfs: enable swap on NFS
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (26 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 27/40] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 29/40] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
                   ` (13 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: nfs-swapfile.patch --]
[-- Type: text/plain, Size: 7972 bytes --]

Provide an a_ops->swapfile() implementation for NFS. This will set the
NFS socket to SOCK_VMIO and run socket reconnect under PF_MEMALLOC as well
as reset SOCK_VMIO before engaging the protocol ->connect() method.

PF_MEMALLOC should allow the allocation of struct socket and related objects
and the early (re)setting of SOCK_VMIO should allow us to receive the packets
required for the TCP connection buildup.

(swapping continues over a server reset during heavy network traffic)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
---
 fs/Kconfig                  |   19 +++++++++++++++
 fs/nfs/file.c               |   10 ++++++++
 include/linux/sunrpc/xprt.h |    5 +++-
 net/sunrpc/sched.c          |    7 ++++-
 net/sunrpc/xprtsock.c       |   54 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 92 insertions(+), 3 deletions(-)

Index: linux-2.6-git/fs/nfs/file.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/file.c
+++ linux-2.6-git/fs/nfs/file.c
@@ -324,6 +324,13 @@ static int nfs_launder_page(struct page 
 	return nfs_wb_page(page_file_mapping(page)->host, page);
 }
 
+#ifdef CONFIG_NFS_SWAP
+static int nfs_swapfile(struct address_space *mapping, int enable)
+{
+	return xs_swapper(NFS_CLIENT(mapping->host)->cl_xprt, enable);
+}
+#endif
+
 const struct address_space_operations nfs_file_aops = {
 	.readpage = nfs_readpage,
 	.readpages = nfs_readpages,
@@ -338,6 +345,9 @@ const struct address_space_operations nf
 	.direct_IO = nfs_direct_IO,
 #endif
 	.launder_page = nfs_launder_page,
+#ifdef CONFIG_NFS_SWAP
+	.swapfile = nfs_swapfile,
+#endif
 };
 
 static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov,
Index: linux-2.6-git/include/linux/sunrpc/xprt.h
===================================================================
--- linux-2.6-git.orig/include/linux/sunrpc/xprt.h
+++ linux-2.6-git/include/linux/sunrpc/xprt.h
@@ -151,7 +151,9 @@ struct rpc_xprt {
 	unsigned int		max_reqs;	/* total slots */
 	unsigned long		state;		/* transport state */
 	unsigned char		shutdown   : 1,	/* being shut down */
-				resvport   : 1; /* use a reserved port */
+				resvport   : 1, /* use a reserved port */
+				swapper    : 1; /* we're swapping over this
+						   transport */
 	unsigned int		bind_index;	/* bind function index */
 
 	/*
@@ -244,6 +246,7 @@ void			xprt_disconnect(struct rpc_xprt *
  */
 struct rpc_xprt *	xs_setup_udp(struct sockaddr *addr, size_t addrlen, struct rpc_timeout *to);
 struct rpc_xprt *	xs_setup_tcp(struct sockaddr *addr, size_t addrlen, struct rpc_timeout *to);
+int			xs_swapper(struct rpc_xprt *xprt, int enable);
 
 /*
  * Reserved bit positions in xprt->state
Index: linux-2.6-git/net/sunrpc/sched.c
===================================================================
--- linux-2.6-git.orig/net/sunrpc/sched.c
+++ linux-2.6-git/net/sunrpc/sched.c
@@ -755,7 +755,10 @@ static void rpc_async_schedule(struct wo
 void *rpc_malloc(struct rpc_task *task, size_t size)
 {
 	size_t *buf;
-	gfp_t gfp = RPC_IS_SWAPPER(task) ? GFP_ATOMIC : GFP_NOWAIT;
+	gfp_t gfp = GFP_NOWAIT;
+
+	if (RPC_IS_SWAPPER(task))
+		gfp |= __GFP_EMERGENCY;
 
 	size += sizeof(size_t);
 	if (size <= RPC_BUFFER_MAXSIZE)
@@ -837,7 +840,7 @@ void rpc_init_task(struct rpc_task *task
 static struct rpc_task *
 rpc_alloc_task(void)
 {
-	return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOFS);
+	return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOIO);
 }
 
 static void rpc_free_task(struct rcu_head *rcu)
Index: linux-2.6-git/net/sunrpc/xprtsock.c
===================================================================
--- linux-2.6-git.orig/net/sunrpc/xprtsock.c
+++ linux-2.6-git/net/sunrpc/xprtsock.c
@@ -1215,11 +1215,15 @@ static void xs_udp_connect_worker(struct
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	/* Start by resetting any existing state */
 	xs_close(xprt);
 
@@ -1257,6 +1261,9 @@ static void xs_udp_connect_worker(struct
 		transport->sock = sock;
 		transport->inet = sk;
 
+		if (xprt->swapper)
+			sk_set_vmio(sk);
+
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 	xs_udp_do_set_buffer_size(xprt);
@@ -1264,6 +1271,7 @@ static void xs_udp_connect_worker(struct
 out:
 	xprt_wake_pending_tasks(xprt, status);
 	xprt_clear_connecting(xprt);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /*
@@ -1302,11 +1310,15 @@ static void xs_tcp_connect_worker(struct
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	if (!sock) {
 		/* start from scratch */
 		if ((err = sock_create_kern(PF_INET, SOCK_STREAM, IPPROTO_TCP, &sock)) < 0) {
@@ -1356,6 +1368,10 @@ static void xs_tcp_connect_worker(struct
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 
+
+	if (xprt->swapper)
+		sk_set_vmio(transport->inet);
+
 	/* Tell the socket layer to start connecting... */
 	xprt->stat.connect_count++;
 	xprt->stat.connect_start = jiffies;
@@ -1383,6 +1399,7 @@ out:
 	xprt_wake_pending_tasks(xprt, status);
 out_clear:
 	xprt_clear_connecting(xprt);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /**
@@ -1642,6 +1659,43 @@ int init_socket_xprt(void)
 	return 0;
 }
 
+#ifdef CONFIG_SUNRPC_SWAP
+#define RPC_BUF_RESERVE_PAGES	\
+	DIV_ROUND_UP((RPC_MAX_SLOT_TABLE * \
+				kobjsize(sizeof(struct rpc_rqst))), \
+			PAGE_SIZE)
+#define RPC_RESERVE_PAGES	(RPC_BUF_RESERVE_PAGES + TX_RESERVE_PAGES)
+
+/**
+ * xs_swapper - Tag this transport as being used for swap.
+ * @xprt: transport to tag
+ * @enable: enable/disable
+ *
+ */
+int xs_swapper(struct rpc_xprt *xprt, int enable)
+{
+	struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+	int err = 0;
+
+	if (enable) {
+		/*
+		 * keep one extra sock reference so the reserve won't dip
+		 * when the socket gets reconnected.
+		 */
+		sk_adjust_memalloc(1, RPC_RESERVE_PAGES);
+		sk_set_vmio(transport->inet);
+		xprt->swapper = 1;
+	} else if (xprt->swapper) {
+		xprt->swapper = 0;
+		sk_clear_vmio(transport->inet);
+		sk_adjust_memalloc(-1, -RPC_RESERVE_PAGES);
+	}
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(xs_swapper);
+#endif
+
 /**
  * cleanup_socket_xprt - remove xprtsock's sysctls
  *
Index: linux-2.6-git/fs/Kconfig
===================================================================
--- linux-2.6-git.orig/fs/Kconfig
+++ linux-2.6-git/fs/Kconfig
@@ -1621,6 +1621,18 @@ config NFS_DIRECTIO
 	  causes open() to return EINVAL if a file residing in NFS is
 	  opened with the O_DIRECT flag.
 
+config NFS_SWAP
+	bool "Provide swap over NFS support"
+	default n
+	depends on NFS_FS
+	select SUNRPC_SWAP
+	help
+	  This option enables swapon to work on files located on NFS mounts.
+
+	  For more details, see Documentation/vm_deadlock.txt
+
+	  If unsure, say N.
+
 config NFSD
 	tristate "NFS server support"
 	depends on INET
@@ -1746,6 +1758,13 @@ config SUNRPC_BIND34
 	  If unsure, say N to get traditional behavior (version 2 rpcbind
 	  requests only).
 
+config SUNRPC_SWAP
+	def_bool n
+	depends on SUNRPC
+	select SLAB_FAIR
+	select NETVM
+	select SWAP_FILE
+
 config RPCSEC_GSS_KRB5
 	tristate "Secure RPC: Kerberos V mechanism (EXPERIMENTAL)"
 	depends on SUNRPC && EXPERIMENTAL

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 29/40] nfs: fix various memory recursions possible with swap over NFS.
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (27 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 28/40] nfs: enable swap on NFS Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 30/40] nfs: fixup missing error code Peter Zijlstra
                   ` (12 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: nfs-alloc-recursions.patch --]
[-- Type: text/plain, Size: 2110 bytes --]

GFP_NOFS is not enough, since swap traffic is IO, hence fall back to GFP_NOIO.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/pagelist.c |    2 +-
 fs/nfs/write.c    |    6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6-git/fs/nfs/write.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/write.c
+++ linux-2.6-git/fs/nfs/write.c
@@ -45,7 +45,7 @@ static struct kmem_cache *nfs_wdata_cach
 
 struct nfs_write_data *nfs_commit_alloc(void)
 {
-	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -69,7 +69,7 @@ void nfs_commit_free(struct nfs_write_da
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -78,7 +78,7 @@ struct nfs_write_data *nfs_writedata_all
 		if (pagecount <= ARRAY_SIZE(p->page_array))
 			p->pagevec = p->page_array;
 		else {
-			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
+			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOIO);
 			if (!p->pagevec) {
 				kmem_cache_free(nfs_wdata_cachep, p);
 				p = NULL;
Index: linux-2.6-git/fs/nfs/pagelist.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/pagelist.c
+++ linux-2.6-git/fs/nfs/pagelist.c
@@ -28,7 +28,7 @@ static inline struct nfs_page *
 nfs_page_alloc(void)
 {
 	struct nfs_page	*p;
-	p = kmem_cache_alloc(nfs_page_cachep, GFP_KERNEL);
+	p = kmem_cache_alloc(nfs_page_cachep, GFP_NOIO);
 	if (p) {
 		memset(p, 0, sizeof(*p));
 		INIT_LIST_HEAD(&p->wb_list);

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 30/40] nfs: fixup missing error code
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (28 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 29/40] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 13:10   ` Peter Staubach
  2007-05-04 10:27 ` [PATCH 31/40] mm: balance_dirty_pages() vs throttle_vm_writeout() deadlock Peter Zijlstra
                   ` (11 subsequent siblings)
  41 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: nfs_fix.patch --]
[-- Type: text/plain, Size: 1114 bytes --]

Commit 0b67130149b006628389ff3e8f46be9957af98aa lost the setting of tk_status
to -EIO when there was no progress with short reads.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/read.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6-git/fs/nfs/read.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/read.c	2007-03-13 14:35:53.000000000 +0100
+++ linux-2.6-git/fs/nfs/read.c	2007-03-13 14:36:05.000000000 +0100
@@ -384,8 +384,10 @@ static int nfs_readpage_retry(struct rpc
 	/* This is a short read! */
 	nfs_inc_stats(data->inode, NFSIOS_SHORTREAD);
 	/* Has the server at least made some progress? */
-	if (resp->count == 0)
+	if (resp->count == 0) {
+		task->tk_status = -EIO;
 		return 0;
+	}
 
 	/* Yes, so retry the read at the end of the data */
 	argp->offset += resp->count;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 31/40] mm: balance_dirty_pages() vs throttle_vm_writeout() deadlock
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (29 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 30/40] nfs: fixup missing error code Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 32/40] block: add a swapdev callback to the request_queue Peter Zijlstra
                   ` (10 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: nfs_mm-throttle_vm_writeout.patch --]
[-- Type: text/plain, Size: 1840 bytes --]

If we have a lot of dirty memory and hit the throttle in balance_dirty_pages()
we (potentially) generate a lot of writeback and unstable pages, if however
during this writeback we need to reclaim a bit, we might hit
throttle_vm_writeout(), which might delay us until the combined total of
NR_UNSTABLE_NFS + NR_WRITEBACK falls below the dirty limit.

However unstable pages don't go away automagickally, they need a push. While
balance_dirty_pages() does this push, throttle_vm_writeout() doesn't. So we can
sit here ad infintum.

Hence I propose to remove the NR_UNSTABLE_NFS count from throttle_vm_writeout().

Akpm's recent GFP checks don't much change this picture, any __GFP_IO|__GFP_FS
alloc can still get stalled by this. It turns into a deadlock when swapping
over NFS.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page-writeback.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Index: linux-2.6-git/mm/page-writeback.c
===================================================================
--- linux-2.6-git.orig/mm/page-writeback.c	2007-03-06 17:44:23.000000000 +0100
+++ linux-2.6-git/mm/page-writeback.c	2007-03-15 15:09:16.000000000 +0100
@@ -320,8 +320,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
+                if (global_page_state(NR_WRITEBACK) <= dirty_thresh)
                         	break;
                 congestion_wait(WRITE, HZ/10);
         }

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 32/40] block: add a swapdev callback to the request_queue
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (30 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 31/40] mm: balance_dirty_pages() vs throttle_vm_writeout() deadlock Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 33/40] uml: enable scsi and add iscsi config Peter Zijlstra
                   ` (9 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips,
	Jens Axboe

[-- Attachment #1: blk_queue_swapdev.patch --]
[-- Type: text/plain, Size: 2956 bytes --]

Networked storage devices need a swap-on/off callback in order to setup
some state and reserve memory. Place the block device callback in the
request_queue as suggested by James Bottomley.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: James Bottomley <James.Bottomley@SteelEye.com>
---
 include/linux/blkdev.h |   19 +++++++++++++++++++
 mm/swapfile.c          |    4 ++++
 2 files changed, 23 insertions(+)

Index: linux-2.6-git/include/linux/blkdev.h
===================================================================
--- linux-2.6-git.orig/include/linux/blkdev.h	2007-01-08 11:53:13.000000000 +0100
+++ linux-2.6-git/include/linux/blkdev.h	2007-01-16 14:14:50.000000000 +0100
@@ -341,6 +341,7 @@ typedef int (merge_bvec_fn) (request_que
 typedef int (issue_flush_fn) (request_queue_t *, struct gendisk *, sector_t *);
 typedef void (prepare_flush_fn) (request_queue_t *, struct request *);
 typedef void (softirq_done_fn)(struct request *);
+typedef int (swapdev_fn)(void*, int);
 
 enum blk_queue_state {
 	Queue_down,
@@ -379,6 +380,8 @@ struct request_queue
 	issue_flush_fn		*issue_flush_fn;
 	prepare_flush_fn	*prepare_flush_fn;
 	softirq_done_fn		*softirq_done_fn;
+	swapdev_fn		*swapdev_fn;
+	void			*swapdev_obj;
 
 	/*
 	 * Dispatch queue sorting
@@ -766,6 +769,22 @@ request_queue_t *blk_alloc_queue(gfp_t);
 request_queue_t *blk_alloc_queue_node(gfp_t, int);
 extern void blk_put_queue(request_queue_t *);
 
+static inline
+void blk_queue_swapdev(struct request_queue *rq,
+		       swapdev_fn *swapdev_fn, void *swapdev_obj)
+{
+	rq->swapdev_fn = swapdev_fn;
+	rq->swapdev_obj = swapdev_obj;
+}
+
+static inline
+int blk_queue_swapdev_fn(struct request_queue *rq, int enable)
+{
+	if (rq->swapdev_fn)
+		return rq->swapdev_fn(rq->swapdev_obj, enable);
+	return 0;
+}
+
 /*
  * tag stuff
  */
Index: linux-2.6-git/mm/swapfile.c
===================================================================
--- linux-2.6-git.orig/mm/swapfile.c	2007-01-15 09:59:02.000000000 +0100
+++ linux-2.6-git/mm/swapfile.c	2007-01-16 14:14:50.000000000 +0100
@@ -1305,6 +1305,7 @@ asmlinkage long sys_swapoff(const char _
 	inode = mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
 		struct block_device *bdev = I_BDEV(inode);
+		blk_queue_swapdev_fn(bdev->bd_disk->queue, 0);
 		set_blocksize(bdev, p->old_block_size);
 		bd_release(bdev);
 	} else {
@@ -1524,6 +1525,9 @@ asmlinkage long sys_swapon(const char __
 		error = set_blocksize(bdev, PAGE_SIZE);
 		if (error < 0)
 			goto bad_swap;
+		error = blk_queue_swapdev_fn(bdev->bd_disk->queue, 1);
+		if (error < 0)
+			goto bad_swap;
 		p->bdev = bdev;
 	} else if (S_ISREG(inode->i_mode)) {
 		p->bdev = inode->i_sb->s_bdev;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 33/40] uml: enable scsi and add iscsi config
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (31 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 32/40] block: add a swapdev callback to the request_queue Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 34/40] sock: safely expose kernel sockets to userspace Peter Zijlstra
                   ` (8 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips,
	Jeff Dike

[-- Attachment #1: uml_iscsi.patch --]
[-- Type: text/plain, Size: 2875 bytes --]

Enable (i)SCSI on UML, dunno why SCSI was deemed broken, it works like a charm.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jeff Dike <jdike@addtoit.com>
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
---
 arch/um/Kconfig      |   16 --------------
 arch/um/Kconfig.scsi |   58 ---------------------------------------------------
 2 files changed, 1 insertion(+), 73 deletions(-)

Index: linux-2.6-git/arch/um/Kconfig
===================================================================
--- linux-2.6-git.orig/arch/um/Kconfig	2006-12-11 14:39:09.000000000 +0100
+++ linux-2.6-git/arch/um/Kconfig	2007-01-16 14:14:45.000000000 +0100
@@ -317,21 +317,7 @@ source "crypto/Kconfig"
 
 source "lib/Kconfig"
 
-menu "SCSI support"
-depends on BROKEN
-
-config SCSI
-	tristate "SCSI support"
-
-# This gives us free_dma, which scsi.c wants.
-config GENERIC_ISA_DMA
-	bool
-	depends on SCSI
-	default y
-
-source "arch/um/Kconfig.scsi"
-
-endmenu
+source "drivers/scsi/Kconfig"
 
 source "drivers/md/Kconfig"
 
Index: linux-2.6-git/arch/um/Kconfig.scsi
===================================================================
--- linux-2.6-git.orig/arch/um/Kconfig.scsi	2006-09-05 15:30:39.000000000 +0200
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,58 +0,0 @@
-comment "SCSI support type (disk, tape, CD-ROM)"
-	depends on SCSI
-
-config BLK_DEV_SD
-	tristate "SCSI disk support"
-	depends on SCSI
-
-config SD_EXTRA_DEVS
-	int "Maximum number of SCSI disks that can be loaded as modules"
-	depends on BLK_DEV_SD
-	default "40"
-
-config CHR_DEV_ST
-	tristate "SCSI tape support"
-	depends on SCSI
-
-config BLK_DEV_SR
-	tristate "SCSI CD-ROM support"
-	depends on SCSI
-
-config BLK_DEV_SR_VENDOR
-	bool "Enable vendor-specific extensions (for SCSI CDROM)"
-	depends on BLK_DEV_SR
-
-config SR_EXTRA_DEVS
-	int "Maximum number of CDROM devices that can be loaded as modules"
-	depends on BLK_DEV_SR
-	default "2"
-
-config CHR_DEV_SG
-	tristate "SCSI generic support"
-	depends on SCSI
-
-comment "Some SCSI devices (e.g. CD jukebox) support multiple LUNs"
-	depends on SCSI
-
-#if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then
-config SCSI_DEBUG_QUEUES
-	bool "Enable extra checks in new queueing code"
-	depends on SCSI
-
-#fi
-config SCSI_MULTI_LUN
-	bool "Probe all LUNs on each SCSI device"
-	depends on SCSI
-
-config SCSI_CONSTANTS
-	bool "Verbose SCSI error reporting (kernel size +=12K)"
-	depends on SCSI
-
-config SCSI_LOGGING
-	bool "SCSI logging facility"
-	depends on SCSI
-
-config SCSI_DEBUG
-	tristate "SCSI debugging host simulator (EXPERIMENTAL)"
-	depends on SCSI
-

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 34/40] sock: safely expose kernel sockets to userspace
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (32 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 33/40] uml: enable scsi and add iscsi config Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 35/40] From: Mike Christie <mchristi@redhat.com> Peter Zijlstra
                   ` (7 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips,
	Mike Christie

[-- Attachment #1: net-SOCK_KERNEL.patch --]
[-- Type: text/plain, Size: 2426 bytes --]

SOCK_KERNEL - avoids user-space from actually using this socket for anything.
This enables sticking kernel sockets into the files_table for identifying and
reference counting purposes.

(iSCSI wants to do this)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <mchristi@redhat.com>
---
 include/net/sock.h |    1 +
 net/socket.c       |   10 +++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h	2007-03-22 11:29:07.000000000 +0100
+++ linux-2.6-git/include/net/sock.h	2007-03-22 11:29:08.000000000 +0100
@@ -394,6 +394,7 @@ enum sock_flags {
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
 	SOCK_VMIO, /* the VM depends on us - make sure we're serviced */
+	SOCK_KERNEL, /* userspace cannot touch this socket */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
Index: linux-2.6-git/net/socket.c
===================================================================
--- linux-2.6-git.orig/net/socket.c	2007-03-22 11:28:58.000000000 +0100
+++ linux-2.6-git/net/socket.c	2007-03-26 12:00:36.000000000 +0200
@@ -353,7 +353,7 @@ static int sock_alloc_fd(struct file **f
 	return fd;
 }
 
-static int sock_attach_fd(struct socket *sock, struct file *file)
+static noinline int sock_attach_fd(struct socket *sock, struct file *file)
 {
 	struct qstr this;
 	char name[32];
@@ -381,6 +381,10 @@ static int sock_attach_fd(struct socket 
 	file->f_op = SOCK_INODE(sock)->i_fop = &socket_file_ops;
 	file->f_mode = FMODE_READ | FMODE_WRITE;
 	file->f_flags = O_RDWR;
+	if (unlikely(sock->sk && sock_flag(sock->sk, SOCK_KERNEL))) {
+		file->f_mode = 0;
+		file->f_flags = 0;
+	}
 	file->f_pos = 0;
 	file->private_data = sock;
 
@@ -806,6 +810,10 @@ static long sock_ioctl(struct file *file
 	int pid, err;
 
 	sock = file->private_data;
+
+	if (unlikely(sock_flag(sock->sk, SOCK_KERNEL)))
+		return -EBADF;
+
 	if (cmd >= SIOCDEVPRIVATE && cmd <= (SIOCDEVPRIVATE + 15)) {
 		err = dev_ioctl(cmd, argp);
 	} else

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 35/40] From: Mike Christie <mchristi@redhat.com>
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (33 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 34/40] sock: safely expose kernel sockets to userspace Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 36/40] iscsi: fixup of the ep_connect patch Peter Zijlstra
                   ` (6 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips,
	Mike Christie

[-- Attachment #1: iscsi_ep_connect.patch --]
[-- Type: text/plain, Size: 12040 bytes --]

This patch has iscsi_tcp implement a ep_connect callback. We only do the
connect for now and let userspace do the poll and close. I do not like
the lack of symmetry but doing sys_close in iscsi_tcp felt a little creepy.

This patch also fixes a bug where when iscsid restarts while sessions
are running, we leak the ep object. This occurs because iscsid, when it
restarts, does not know the connection and ep relationship. To fix this,
I just export the ep handle sysfs. Or I converted iser in this patch.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 drivers/infiniband/ulp/iser/iscsi_iser.c |    4 -
 drivers/scsi/iscsi_tcp.c                 |   99 +++++++++++++++++--------------
 drivers/scsi/libiscsi.c                  |    8 ++
 drivers/scsi/scsi_transport_iscsi.c      |    4 -
 include/scsi/iscsi_if.h                  |    4 -
 include/scsi/libiscsi.h                  |    3 
 include/scsi/scsi_transport_iscsi.h      |    2 
 7 files changed, 75 insertions(+), 49 deletions(-)

Index: linux-2.6-git/drivers/infiniband/ulp/iser/iscsi_iser.c
===================================================================
--- linux-2.6-git.orig/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-01-25 11:57:44.000000000 +0100
+++ linux-2.6-git/drivers/infiniband/ulp/iser/iscsi_iser.c	2007-01-25 13:58:45.000000000 +0100
@@ -332,7 +332,8 @@ iscsi_iser_conn_bind(struct iscsi_cls_se
 	struct iser_conn *ib_conn;
 	int error;
 
-	error = iscsi_conn_bind(cls_session, cls_conn, is_leading);
+	error = iscsi_conn_bind(cls_session, cls_conn, is_leading,
+				transport_eph);
 	if (error)
 		return error;
 
@@ -572,6 +573,7 @@ static struct iscsi_transport iscsi_iser
 				  ISCSI_PDU_INORDER_EN |
 				  ISCSI_DATASEQ_INORDER_EN |
 				  ISCSI_EXP_STATSN |
+				  ISCSI_PARAM_EP_HANDLE |
 				  ISCSI_PERSISTENT_PORT |
 				  ISCSI_PERSISTENT_ADDRESS |
 				  ISCSI_TARGET_NAME |
Index: linux-2.6-git/drivers/scsi/iscsi_tcp.c
===================================================================
--- linux-2.6-git.orig/drivers/scsi/iscsi_tcp.c	2007-01-25 13:29:02.000000000 +0100
+++ linux-2.6-git/drivers/scsi/iscsi_tcp.c	2007-01-25 13:58:45.000000000 +0100
@@ -35,6 +35,8 @@
 #include <linux/kfifo.h>
 #include <linux/scatterlist.h>
 #include <linux/mutex.h>
+#include <linux/net.h>
+#include <linux/file.h>
 #include <net/tcp.h>
 #include <scsi/scsi_cmnd.h>
 #include <scsi/scsi_host.h>
@@ -1064,21 +1066,6 @@ iscsi_conn_set_callbacks(struct iscsi_co
 	write_unlock_bh(&sk->sk_callback_lock);
 }
 
-static void
-iscsi_conn_restore_callbacks(struct iscsi_tcp_conn *tcp_conn)
-{
-	struct sock *sk = tcp_conn->sock->sk;
-
-	/* restore socket callbacks, see also: iscsi_conn_set_callbacks() */
-	write_lock_bh(&sk->sk_callback_lock);
-	sk->sk_user_data    = NULL;
-	sk->sk_data_ready   = tcp_conn->old_data_ready;
-	sk->sk_state_change = tcp_conn->old_state_change;
-	sk->sk_write_space  = tcp_conn->old_write_space;
-	sk->sk_no_check	 = 0;
-	write_unlock_bh(&sk->sk_callback_lock);
-}
-
 /**
  * iscsi_send - generic send routine
  * @sk: kernel's socket
@@ -1747,6 +1734,51 @@ iscsi_tcp_ctask_xmit(struct iscsi_conn *
 	return rc;
 }
 
+static int
+iscsi_tcp_ep_connect(struct sockaddr *dst_addr, int non_blocking,
+		     uint64_t *ep_handle)
+{
+	struct socket *sock;
+	int rc, size;
+
+	rc = sock_create_kern(dst_addr->sa_family, SOCK_STREAM, IPPROTO_TCP,
+			      &sock);
+	if (rc < 0) {
+		printk(KERN_ERR "Could not create socket %d.\n", rc);
+		return rc;
+	}
+	/* TODO: test this with GFP_NOIO */
+	sock->sk->sk_allocation = GFP_ATOMIC;
+
+	if (dst_addr->sa_family == PF_INET)
+		size = sizeof(struct sockaddr_in);
+	else if (dst_addr->sa_family == PF_INET6)
+		size = sizeof(struct sockaddr_in6);
+	else {
+		rc = -EINVAL;
+		goto release_sock;
+	}
+
+	rc = sock->ops->connect(sock, (struct sockaddr *)dst_addr, size,
+				O_NONBLOCK);
+	if (rc == -EINPROGRESS)
+		rc = 0;
+	else if (rc) {
+		printk(KERN_ERR "Could not connect %d\n", rc);
+		goto release_sock;
+	}
+
+	rc = sock_map_fd(sock);
+	if (rc < 0)
+		goto release_sock;
+	*ep_handle = (uint64_t)rc;
+	return 0;
+
+release_sock:
+	sock_release(sock);
+	return rc;
+}
+
 static struct iscsi_cls_conn *
 iscsi_tcp_conn_create(struct iscsi_cls_session *cls_session, uint32_t conn_idx)
 {
@@ -1798,31 +1830,12 @@ tcp_conn_alloc_fail:
 }
 
 static void
-iscsi_tcp_release_conn(struct iscsi_conn *conn)
-{
-	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
-
-	if (!tcp_conn->sock)
-		return;
-
-	sock_hold(tcp_conn->sock->sk);
-	iscsi_conn_restore_callbacks(tcp_conn);
-	sock_put(tcp_conn->sock->sk);
-
-	sock_release(tcp_conn->sock);
-	tcp_conn->sock = NULL;
-	conn->recv_lock = NULL;
-}
-
-static void
 iscsi_tcp_conn_destroy(struct iscsi_cls_conn *cls_conn)
 {
 	struct iscsi_conn *conn = cls_conn->dd_data;
 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
 
-	iscsi_tcp_release_conn(conn);
 	iscsi_conn_teardown(cls_conn);
-
 	if (tcp_conn->tx_hash.tfm)
 		crypto_free_hash(tcp_conn->tx_hash.tfm);
 	if (tcp_conn->rx_hash.tfm)
@@ -1838,7 +1851,6 @@ iscsi_tcp_conn_stop(struct iscsi_cls_con
 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
 
 	iscsi_conn_stop(cls_conn, flag);
-	iscsi_tcp_release_conn(conn);
 	tcp_conn->hdr_size = sizeof(struct iscsi_hdr);
 }
 
@@ -1860,9 +1872,9 @@ iscsi_tcp_conn_bind(struct iscsi_cls_ses
 		return -EEXIST;
 	}
 
-	err = iscsi_conn_bind(cls_session, cls_conn, is_leading);
+	err = iscsi_conn_bind(cls_session, cls_conn, is_leading, transport_eph);
 	if (err)
-		return err;
+		goto done;
 
 	/* bind iSCSI connection and socket */
 	tcp_conn->sock = sock;
@@ -1871,7 +1883,6 @@ iscsi_tcp_conn_bind(struct iscsi_cls_ses
 	sk = sock->sk;
 	sk->sk_reuse = 1;
 	sk->sk_sndtimeo = 15 * HZ; /* FIXME: make it configurable */
-	sk->sk_allocation = GFP_ATOMIC;
 
 	/* FIXME: disable Nagle's algorithm */
 
@@ -1887,7 +1898,9 @@ iscsi_tcp_conn_bind(struct iscsi_cls_ses
 	 */
 	tcp_conn->in_progress = IN_PROGRESS_WAIT_HEADER;
 
-	return 0;
+done:
+	sockfd_put(sock);
+	return err;
 }
 
 /* called with host lock */
@@ -2163,6 +2176,7 @@ static struct iscsi_transport iscsi_tcp_
 				  ISCSI_PDU_INORDER_EN |
 				  ISCSI_DATASEQ_INORDER_EN |
 				  ISCSI_ERL |
+				  ISCSI_EP_HANDLE |
 				  ISCSI_CONN_PORT |
 				  ISCSI_CONN_ADDRESS |
 				  ISCSI_EXP_STATSN |
@@ -2186,6 +2200,7 @@ static struct iscsi_transport iscsi_tcp_
 	.get_session_param	= iscsi_session_get_param,
 	.start_conn		= iscsi_conn_start,
 	.stop_conn		= iscsi_tcp_conn_stop,
+	.ep_connect		= iscsi_tcp_ep_connect,
 	/* IO */
 	.send_pdu		= iscsi_conn_send_pdu,
 	.get_stats		= iscsi_conn_get_stats,
Index: linux-2.6-git/drivers/scsi/libiscsi.c
===================================================================
--- linux-2.6-git.orig/drivers/scsi/libiscsi.c	2007-01-25 13:29:02.000000000 +0100
+++ linux-2.6-git/drivers/scsi/libiscsi.c	2007-01-25 13:58:45.000000000 +0100
@@ -1793,7 +1793,8 @@ void iscsi_conn_stop(struct iscsi_cls_co
 EXPORT_SYMBOL_GPL(iscsi_conn_stop);
 
 int iscsi_conn_bind(struct iscsi_cls_session *cls_session,
-		    struct iscsi_cls_conn *cls_conn, int is_leading)
+		    struct iscsi_cls_conn *cls_conn, int is_leading,
+		    uint64_t transport_eph)
 {
 	struct iscsi_session *session = class_to_transport_session(cls_session);
 	struct iscsi_conn *conn = cls_conn->dd_data;
@@ -1803,6 +1804,8 @@ int iscsi_conn_bind(struct iscsi_cls_ses
 		session->leadconn = conn;
 	spin_unlock_bh(&session->lock);
 
+	conn->ep_handle = transport_eph;
+
 	/*
 	 * Unblock xmitworker(), Login Phase will pass through.
 	 */
@@ -1983,6 +1986,9 @@ int iscsi_conn_get_param(struct iscsi_cl
 	case ISCSI_PARAM_PERSISTENT_ADDRESS:
 		len = sprintf(buf, "%s\n", conn->persistent_address);
 		break;
+	case ISCSI_PARAM_EP_HANDLE:
+		len = sprintf(buf, "%llu\n", conn->ep_handle);
+		break;
 	default:
 		return -ENOSYS;
 	}
Index: linux-2.6-git/drivers/scsi/scsi_transport_iscsi.c
===================================================================
--- linux-2.6-git.orig/drivers/scsi/scsi_transport_iscsi.c	2007-01-25 13:29:02.000000000 +0100
+++ linux-2.6-git/drivers/scsi/scsi_transport_iscsi.c	2007-01-25 13:58:45.000000000 +0100
@@ -31,7 +31,7 @@
 #include <scsi/iscsi_if.h>
 
 #define ISCSI_SESSION_ATTRS 11
-#define ISCSI_CONN_ATTRS 11
+#define ISCSI_CONN_ATTRS 12
 #define ISCSI_HOST_ATTRS 0
 #define ISCSI_TRANSPORT_VERSION "2.0-724"
 
@@ -1153,6 +1153,7 @@ iscsi_conn_attr(port, ISCSI_PARAM_CONN_P
 iscsi_conn_attr(exp_statsn, ISCSI_PARAM_EXP_STATSN);
 iscsi_conn_attr(persistent_address, ISCSI_PARAM_PERSISTENT_ADDRESS);
 iscsi_conn_attr(address, ISCSI_PARAM_CONN_ADDRESS);
+iscsi_conn_attr(ep_handle, ISCSI_PARAM_EP_HANDLE);
 
 #define iscsi_cdev_to_session(_cdev) \
 	iscsi_dev_to_session(_cdev->dev)
@@ -1343,6 +1344,7 @@ iscsi_register_transport(struct iscsi_tr
 	SETUP_CONN_RD_ATTR(exp_statsn, ISCSI_EXP_STATSN);
 	SETUP_CONN_RD_ATTR(persistent_address, ISCSI_PERSISTENT_ADDRESS);
 	SETUP_CONN_RD_ATTR(persistent_port, ISCSI_PERSISTENT_PORT);
+	SETUP_CONN_RD_ATTR(ep_handle, ISCSI_EP_HANDLE);
 
 	BUG_ON(count > ISCSI_CONN_ATTRS);
 	priv->conn_attrs[count] = NULL;
Index: linux-2.6-git/include/scsi/iscsi_if.h
===================================================================
--- linux-2.6-git.orig/include/scsi/iscsi_if.h	2007-01-25 11:57:44.000000000 +0100
+++ linux-2.6-git/include/scsi/iscsi_if.h	2007-01-25 13:58:45.000000000 +0100
@@ -219,9 +219,10 @@ enum iscsi_param {
 	ISCSI_PARAM_PERSISTENT_PORT,
 	ISCSI_PARAM_SESS_RECOVERY_TMO,
 
-	/* pased in through bind conn using transport_fd */
+	/* pased in through bind or ep callbacks */
 	ISCSI_PARAM_CONN_PORT,
 	ISCSI_PARAM_CONN_ADDRESS,
+	ISCSI_PARAM_EP_HANDLE,
 
 	/* must always be last */
 	ISCSI_PARAM_MAX,
@@ -249,6 +250,7 @@ enum iscsi_param {
 #define ISCSI_SESS_RECOVERY_TMO		(1 << ISCSI_PARAM_SESS_RECOVERY_TMO)
 #define ISCSI_CONN_PORT			(1 << ISCSI_PARAM_CONN_PORT)
 #define ISCSI_CONN_ADDRESS		(1 << ISCSI_PARAM_CONN_ADDRESS)
+#define ISCSI_EP_HANDLE			(1 << ISCSI_PARAM_EP_HANDLE)
 
 #define iscsi_ptr(_handle) ((void*)(unsigned long)_handle)
 #define iscsi_handle(_ptr) ((uint64_t)(unsigned long)_ptr)
Index: linux-2.6-git/include/scsi/libiscsi.h
===================================================================
--- linux-2.6-git.orig/include/scsi/libiscsi.h	2007-01-25 11:57:44.000000000 +0100
+++ linux-2.6-git/include/scsi/libiscsi.h	2007-01-25 13:58:45.000000000 +0100
@@ -123,6 +123,7 @@ struct iscsi_conn {
 	struct iscsi_cls_conn	*cls_conn;	/* ptr to class connection */
 	void			*dd_data;	/* iscsi_transport data */
 	struct iscsi_session	*session;	/* parent session */
+	uint64_t		ep_handle;	/* ep handle */
 	/*
 	 * LLDs should set this lock. It protects the transport recv
 	 * code
@@ -281,7 +282,7 @@ extern void iscsi_conn_teardown(struct i
 extern int iscsi_conn_start(struct iscsi_cls_conn *);
 extern void iscsi_conn_stop(struct iscsi_cls_conn *, int);
 extern int iscsi_conn_bind(struct iscsi_cls_session *, struct iscsi_cls_conn *,
-			   int);
+			   int, uint64_t transport_eph);
 extern void iscsi_conn_failure(struct iscsi_conn *conn, enum iscsi_err err);
 extern int iscsi_conn_get_param(struct iscsi_cls_conn *cls_conn,
 				enum iscsi_param param, char *buf);
Index: linux-2.6-git/include/scsi/scsi_transport_iscsi.h
===================================================================
--- linux-2.6-git.orig/include/scsi/scsi_transport_iscsi.h	2007-01-25 11:57:44.000000000 +0100
+++ linux-2.6-git/include/scsi/scsi_transport_iscsi.h	2007-01-25 13:58:45.000000000 +0100
@@ -79,7 +79,7 @@ struct iscsi_transport {
 	char *name;
 	unsigned int caps;
 	/* LLD sets this to indicate what values it can export to sysfs */
-	unsigned int param_mask;
+	uint64_t param_mask;
 	struct scsi_host_template *host_template;
 	/* LLD connection data size */
 	int conndata_size;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 36/40] iscsi: fixup of the ep_connect patch
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (34 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 35/40] From: Mike Christie <mchristi@redhat.com> Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 37/40] iscsi: ensure the iscsi kernel fd is not usable in userspace Peter Zijlstra
                   ` (5 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: iscsi_ep_connect_fix.patch --]
[-- Type: text/plain, Size: 2128 bytes --]

Make sure a malicious user-space program cannot crash the kernel module
by prematurely closing the filedesc.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/scsi/iscsi_tcp.c |   23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

Index: linux-2.6-git/drivers/scsi/iscsi_tcp.c
===================================================================
--- linux-2.6-git.orig/drivers/scsi/iscsi_tcp.c	2007-01-16 14:15:50.000000000 +0100
+++ linux-2.6-git/drivers/scsi/iscsi_tcp.c	2007-01-16 14:24:05.000000000 +0100
@@ -1830,11 +1830,25 @@ tcp_conn_alloc_fail:
 }
 
 static void
+iscsi_tcp_release_conn(struct iscsi_conn *conn)
+{
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+
+	if (!tcp_conn->sock)
+		return;
+
+	sockfd_put(tcp_conn->sock);
+	tcp_conn->sock = NULL;
+	conn->recv_lock = NULL;
+}
+
+static void
 iscsi_tcp_conn_destroy(struct iscsi_cls_conn *cls_conn)
 {
 	struct iscsi_conn *conn = cls_conn->dd_data;
 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
 
+	iscsi_tcp_release_conn(conn);
 	iscsi_conn_teardown(cls_conn);
 	if (tcp_conn->tx_hash.tfm)
 		crypto_free_hash(tcp_conn->tx_hash.tfm);
@@ -1851,6 +1865,7 @@ iscsi_tcp_conn_stop(struct iscsi_cls_con
 	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
 
 	iscsi_conn_stop(cls_conn, flag);
+	iscsi_tcp_release_conn(conn);
 	tcp_conn->hdr_size = sizeof(struct iscsi_hdr);
 }
 
@@ -1873,8 +1888,10 @@ iscsi_tcp_conn_bind(struct iscsi_cls_ses
 	}
 
 	err = iscsi_conn_bind(cls_session, cls_conn, is_leading, transport_eph);
-	if (err)
-		goto done;
+	if (err) {
+		sockfd_put(sock);
+		return err;
+	}
 
 	/* bind iSCSI connection and socket */
 	tcp_conn->sock = sock;
@@ -1898,8 +1915,6 @@ iscsi_tcp_conn_bind(struct iscsi_cls_ses
 	 */
 	tcp_conn->in_progress = IN_PROGRESS_WAIT_HEADER;
 
-done:
-	sockfd_put(sock);
 	return err;
 }
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 37/40] iscsi: ensure the iscsi kernel fd is not usable in userspace
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (35 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 36/40] iscsi: fixup of the ep_connect patch Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 38/40] netlink: add SOCK_VMIO support to AF_NETLINK Peter Zijlstra
                   ` (4 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips,
	Mike Christie

[-- Attachment #1: iscsi_ep_connect_SOCK_KERNEL.patch --]
[-- Type: text/plain, Size: 1286 bytes --]

We expose the iSCSI connection fd to userspace for reference tracking, but we
do not want userspace to actually have access to the data; mark it with
SOCK_KERNEL.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <mchristi@redhat.com>
---
 drivers/scsi/iscsi_tcp.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: linux-2.6-git/drivers/scsi/iscsi_tcp.c
===================================================================
--- linux-2.6-git.orig/drivers/scsi/iscsi_tcp.c	2007-03-22 11:29:08.000000000 +0100
+++ linux-2.6-git/drivers/scsi/iscsi_tcp.c	2007-03-22 12:00:14.000000000 +0100
@@ -1759,6 +1759,13 @@ iscsi_tcp_ep_connect(struct sockaddr *ds
 		goto release_sock;
 	}
 
+	/*
+	 * Even though we're going to expose this socket to user-space
+	 * (as an identifier for the connection and for tracking life times)
+	 * we don't want it used by user-space at all.
+	 */
+	sock_set_flag(sock->sk, SOCK_KERNEL);
+
 	rc = sock->ops->connect(sock, (struct sockaddr *)dst_addr, size,
 				O_NONBLOCK);
 	if (rc == -EINPROGRESS)

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 38/40] netlink: add SOCK_VMIO support to AF_NETLINK
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (36 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 37/40] iscsi: ensure the iscsi kernel fd is not usable in userspace Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 39/40] mm: a process flags to avoid blocking allocations Peter Zijlstra
                   ` (3 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: netlink_vmio.patch --]
[-- Type: text/plain, Size: 4883 bytes --]

Modify the netlink code so that SOCK_VMIO has the desired effect on the
user-space side of the connection.

Modify sys_{send,recv}msg to use sk->sk_allocation instead of GFP_KERNEL,
this should not change existing behaviour because the default of
sk->sk_allocation is GFP_KERNEL, and no user-space exposed socket would
have it any different at this time.

This change allows the system calls to succeed for SOCK_VMIO sockets 
(who have sk->sk_allocation |= GFP_EMERGENCY) even under extreme memory
pressure.

Since netlink_sendmsg is used to transfer msgs from user- to kernel-space
treat the skb allocation there as a receive allocation.

Also export netlink_lookup, this is needed to locate the kernel side struct
sock object associated with the user-space netlink socket. 

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Miller <davem@davemloft.net>
Cc: Mike Christie <michaelc@cs.wisc.edu>
---
 include/linux/netlink.h  |    1 +
 net/compat.c             |    2 +-
 net/netlink/af_netlink.c |   12 +++++++++---
 net/socket.c             |    6 +++---
 4 files changed, 14 insertions(+), 7 deletions(-)

Index: linux-2.6-git/net/netlink/af_netlink.c
===================================================================
--- linux-2.6-git.orig/net/netlink/af_netlink.c
+++ linux-2.6-git/net/netlink/af_netlink.c
@@ -203,7 +203,7 @@ netlink_unlock_table(void)
 		wake_up(&nl_table_wait);
 }
 
-static __inline__ struct sock *netlink_lookup(int protocol, u32 pid)
+struct sock *netlink_lookup(int protocol, u32 pid)
 {
 	struct nl_pid_hash *hash = &nl_table[protocol].hash;
 	struct hlist_head *head;
@@ -1157,7 +1157,7 @@ static int netlink_sendmsg(struct kiocb 
 	if (len > sk->sk_sndbuf - 32)
 		goto out;
 	err = -ENOBUFS;
-	skb = alloc_skb(len, GFP_KERNEL);
+	skb = __alloc_skb(len, GFP_KERNEL, SKB_ALLOC_RX, -1);
 	if (skb==NULL)
 		goto out;
 
@@ -1186,8 +1186,13 @@ static int netlink_sendmsg(struct kiocb 
 	}
 
 	if (dst_group) {
+		gfp_t gfp_mask = sk->sk_allocation;
+
+		if (skb_emergency(skb))
+			gfp_mask |= __GFP_EMERGENCY;
+
 		atomic_inc(&skb->users);
-		netlink_broadcast(sk, skb, dst_pid, dst_group, GFP_KERNEL);
+		netlink_broadcast(sk, skb, dst_pid, dst_group, gfp_mask);
 	}
 	err = netlink_unicast(sk, skb, dst_pid, msg->msg_flags&MSG_DONTWAIT);
 
@@ -1850,6 +1855,7 @@ panic:
 
 core_initcall(netlink_proto_init);
 
+EXPORT_SYMBOL(netlink_lookup);
 EXPORT_SYMBOL(netlink_ack);
 EXPORT_SYMBOL(netlink_run_queue);
 EXPORT_SYMBOL(netlink_broadcast);
Index: linux-2.6-git/net/socket.c
===================================================================
--- linux-2.6-git.orig/net/socket.c
+++ linux-2.6-git/net/socket.c
@@ -1817,7 +1817,7 @@ asmlinkage long sys_sendmsg(int fd, stru
 	err = -ENOMEM;
 	iov_size = msg_sys.msg_iovlen * sizeof(struct iovec);
 	if (msg_sys.msg_iovlen > UIO_FASTIOV) {
-		iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL);
+		iov = sock_kmalloc(sock->sk, iov_size, sock->sk->sk_allocation);
 		if (!iov)
 			goto out_put;
 	}
@@ -1846,7 +1846,7 @@ asmlinkage long sys_sendmsg(int fd, stru
 		ctl_len = msg_sys.msg_controllen;
 	} else if (ctl_len) {
 		if (ctl_len > sizeof(ctl)) {
-			ctl_buf = sock_kmalloc(sock->sk, ctl_len, GFP_KERNEL);
+			ctl_buf = sock_kmalloc(sock->sk, ctl_len, sock->sk->sk_allocation);
 			if (ctl_buf == NULL)
 				goto out_freeiov;
 		}
@@ -1922,7 +1922,7 @@ asmlinkage long sys_recvmsg(int fd, stru
 	err = -ENOMEM;
 	iov_size = msg_sys.msg_iovlen * sizeof(struct iovec);
 	if (msg_sys.msg_iovlen > UIO_FASTIOV) {
-		iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL);
+		iov = sock_kmalloc(sock->sk, iov_size, sock->sk->sk_allocation);
 		if (!iov)
 			goto out_put;
 	}
Index: linux-2.6-git/include/linux/netlink.h
===================================================================
--- linux-2.6-git.orig/include/linux/netlink.h
+++ linux-2.6-git/include/linux/netlink.h
@@ -157,6 +157,7 @@ struct netlink_skb_parms
 #define NETLINK_CREDS(skb)	(&NETLINK_CB((skb)).creds)
 
 
+extern struct sock *netlink_lookup(int protocol, __u32 pid);
 extern struct sock *netlink_kernel_create(int unit, unsigned int groups,
 					  void (*input)(struct sock *sk, int len),
 					  struct mutex *cb_mutex,
Index: linux-2.6-git/net/compat.c
===================================================================
--- linux-2.6-git.orig/net/compat.c
+++ linux-2.6-git/net/compat.c
@@ -169,7 +169,7 @@ int cmsghdr_from_user_compat_to_kern(str
 	 * from the user.
 	 */
 	if (kcmlen > stackbuf_size)
-		kcmsg_base = kcmsg = sock_kmalloc(sk, kcmlen, GFP_KERNEL);
+		kcmsg_base = kcmsg = sock_kmalloc(sk, kcmlen, sk->sk_allocation);
 	if (kcmsg == NULL)
 		return -ENOBUFS;
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 39/40] mm: a process flags to avoid blocking allocations
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (37 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 38/40] netlink: add SOCK_VMIO support to AF_NETLINK Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 10:27 ` [PATCH 40/40] iscsi: support for swapping over iSCSI Peter Zijlstra
                   ` (2 subsequent siblings)
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: pf_mem_nowait.patch --]
[-- Type: text/plain, Size: 3832 bytes --]

PF_MEM_NOWAIT - will make allocations fail before blocking. This is usefull
to convert process behaviour to non-blocking.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
---
 include/linux/sched.h |    1 +
 kernel/softirq.c      |    4 ++--
 mm/internal.h         |   11 ++++++++++-
 mm/page_alloc.c       |    4 ++--
 4 files changed, 15 insertions(+), 5 deletions(-)

Index: linux-2.6-git/include/linux/sched.h
===================================================================
--- linux-2.6-git.orig/include/linux/sched.h	2007-03-26 12:03:07.000000000 +0200
+++ linux-2.6-git/include/linux/sched.h	2007-03-26 12:03:09.000000000 +0200
@@ -1158,6 +1158,7 @@ static inline void put_task_struct(struc
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
+#define PF_MEM_NOWAIT	0x40000000	/* Make allocations fail instead of block */
 
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2007-03-26 12:03:07.000000000 +0200
+++ linux-2.6-git/mm/page_alloc.c	2007-03-26 12:03:09.000000000 +0200
@@ -1234,11 +1234,11 @@ struct page * fastcall
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist)
 {
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	struct task_struct *p = current;
+	const bool wait = gfp_wait(gfp_mask);
 	struct zone **z;
 	struct page *page;
 	struct reclaim_state reclaim_state;
-	struct task_struct *p = current;
 	int do_retry;
 	int alloc_flags;
 	int did_some_progress;
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h	2007-03-26 12:03:07.000000000 +0200
+++ linux-2.6-git/mm/internal.h	2007-03-26 12:03:09.000000000 +0200
@@ -46,6 +46,15 @@ extern void fastcall __init __free_pages
 #define ALLOC_NO_WATERMARKS	0x20 /* don't check watermarks at all */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
 
+static bool inline gfp_wait(gfp_t gfp_mask)
+{
+	bool wait = gfp_mask & __GFP_WAIT;
+	if (wait && !in_irq() && (current->flags & PF_MEM_NOWAIT))
+		wait = false;
+
+	return wait;
+}
+
 /*
  * get the deepest reaching allocation flags for the given gfp_mask
  */
@@ -53,7 +62,7 @@ static int inline gfp_to_alloc_flags(gfp
 {
 	struct task_struct *p = current;
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	const bool wait = gfp_wait(gfp_mask);
 
 	/*
 	 * The caller may dip into page reserves a bit more if the caller
Index: linux-2.6-git/kernel/softirq.c
===================================================================
--- linux-2.6-git.orig/kernel/softirq.c	2007-03-26 12:03:07.000000000 +0200
+++ linux-2.6-git/kernel/softirq.c	2007-03-26 12:12:58.000000000 +0200
@@ -211,7 +211,7 @@ asmlinkage void __do_softirq(void)
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu;
 	unsigned long pflags = current->flags;
-	current->flags &= ~PF_MEMALLOC;
+	current->flags &= ~(PF_MEMALLOC|PF_MEM_NOWAIT);
 
 	pending = local_softirq_pending();
 	account_system_vtime(current);
@@ -250,7 +250,7 @@ restart:
 
 	account_system_vtime(current);
 	_local_bh_enable();
-	tsk_restore_flags(current, pflags, PF_MEMALLOC);
+	tsk_restore_flags(current, pflags, (PF_MEMALLOC|PF_MEM_NOWAIT));
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 40/40] iscsi: support for swapping over iSCSI.
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (38 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 39/40] mm: a process flags to avoid blocking allocations Peter Zijlstra
@ 2007-05-04 10:27 ` Peter Zijlstra
  2007-05-04 15:22 ` [PATCH 00/40] Swap over Networked storage -v12 Daniel Walker
  2007-05-04 19:27 ` David Miller, Peter Zijlstra
  41 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 10:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev
  Cc: Peter Zijlstra, Trond Myklebust, Thomas Graf, David Miller,
	James Bottomley, Mike Christie, Andrew Morton, Daniel Phillips

[-- Attachment #1: iscsi_vmio.patch --]
[-- Type: text/plain, Size: 16495 bytes --]

Set blk_queue_swapdev for iSCSI. This method takes care of reserving the
extra memory needed and marking all relevant sockets with SOCK_VMIO.

When used for swapping, TCP socket creation is done under GFP_MEMALLOC and
the TCP connect is done with SOCK_VMIO to ensure their success. 

Also the netlink userspace interface is marked SOCK_VMIO, this will ensure
that even under pressure we can still communicate with the daemon (which
runs as mlockall() and needs no additional memory to operate).

Netlink requests are handled under the new PF_MEM_NOWAIT when a swapper is
present. This ensures that the netlink socket will not block. User-space
will need to retry failed requests.

The TCP receive path is handled under PF_MEMALLOC for SOCK_VMIO sockets.
This makes sure we do not block the critical socket, and that we do not
fail to process incoming data.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Mike Christie <michaelc@cs.wisc.edu>
---
 drivers/scsi/Kconfig                |   17 ++++++++
 drivers/scsi/iscsi_tcp.c            |   70 ++++++++++++++++++++++++++++++++---
 drivers/scsi/libiscsi.c             |   18 ++++++---
 drivers/scsi/qla4xxx/ql4_os.c       |    2 -
 drivers/scsi/scsi_transport_iscsi.c |   72 ++++++++++++++++++++++++++++++++----
 include/scsi/scsi_transport_iscsi.h |   12 +++++-
 6 files changed, 170 insertions(+), 21 deletions(-)

Index: linux-2.6-git/drivers/scsi/iscsi_tcp.c
===================================================================
--- linux-2.6-git.orig/drivers/scsi/iscsi_tcp.c	2007-03-26 12:59:39.000000000 +0200
+++ linux-2.6-git/drivers/scsi/iscsi_tcp.c	2007-03-26 13:07:54.000000000 +0200
@@ -42,6 +42,7 @@
 #include <scsi/scsi_host.h>
 #include <scsi/scsi.h>
 #include <scsi/scsi_transport_iscsi.h>
+#include <scsi/scsi_device.h>
 
 #include "iscsi_tcp.h"
 
@@ -1740,15 +1741,19 @@ iscsi_tcp_ep_connect(struct sockaddr *ds
 {
 	struct socket *sock;
 	int rc, size;
+	int swapper = sk_vmio_socks();
+	unsigned long pflags = current->flags;
+
+	if (swapper)
+		pflags |= PF_MEMALLOC;
 
 	rc = sock_create_kern(dst_addr->sa_family, SOCK_STREAM, IPPROTO_TCP,
 			      &sock);
 	if (rc < 0) {
 		printk(KERN_ERR "Could not create socket %d.\n", rc);
-		return rc;
+		goto out;
 	}
-	/* TODO: test this with GFP_NOIO */
-	sock->sk->sk_allocation = GFP_ATOMIC;
+	sock->sk->sk_allocation = GFP_NOIO;
 
 	if (dst_addr->sa_family == PF_INET)
 		size = sizeof(struct sockaddr_in);
@@ -1765,6 +1770,8 @@ iscsi_tcp_ep_connect(struct sockaddr *ds
 	 * we don't want it used by user-space at all.
 	 */
 	sock_set_flag(sock->sk, SOCK_KERNEL);
+	if (swapper)
+		sk_set_vmio(sock->sk);
 
 	rc = sock->ops->connect(sock, (struct sockaddr *)dst_addr, size,
 				O_NONBLOCK);
@@ -1779,11 +1786,14 @@ iscsi_tcp_ep_connect(struct sockaddr *ds
 	if (rc < 0)
 		goto release_sock;
 	*ep_handle = (uint64_t)rc;
-	return 0;
+	rc = 0;
+out:
+	current->flags = pflags;
+	return rc;
 
 release_sock:
 	sock_release(sock);
-	return rc;
+	goto out;
 }
 
 static struct iscsi_cls_conn *
@@ -1908,8 +1918,13 @@ iscsi_tcp_conn_bind(struct iscsi_cls_ses
 	sk->sk_reuse = 1;
 	sk->sk_sndtimeo = 15 * HZ; /* FIXME: make it configurable */
 
+	if (!cls_session->swapper && sk_has_vmio(sk))
+		sk_clear_vmio(sk);
+
 	/* FIXME: disable Nagle's algorithm */
 
+	BUG_ON(!sk_has_vmio(sk) && cls_session->swapper);
+
 	/*
 	 * Intercept TCP callbacks for sendfile like receive
 	 * processing.
@@ -2167,6 +2182,50 @@ static void iscsi_tcp_session_destroy(st
 	iscsi_session_teardown(cls_session);
 }
 
+#ifdef CONFIG_ISCSI_TCP_SWAP
+
+#define ISCSI_TCP_RESERVE_PAGES	(TX_RESERVE_PAGES)
+
+static int iscsi_tcp_swapdev(void *objp, int enable)
+{
+	int error = 0;
+	struct scsi_device *sdev = objp;
+	struct Scsi_Host *shost = sdev->host;
+	struct iscsi_session *session = iscsi_hostdata(shost->hostdata);
+
+	if (enable) {
+		iscsi_swapdev(session->tt, session_to_cls(session), 1);
+		sk_adjust_memalloc(1, ISCSI_TCP_RESERVE_PAGES);
+	}
+
+	spin_lock(&session->lock);
+	if (session->leadconn) {
+		struct iscsi_tcp_conn *tcp_conn = session->leadconn->dd_data;
+		if (enable)
+			sk_set_vmio(tcp_conn->sock->sk);
+		else
+			sk_clear_vmio(tcp_conn->sock->sk);
+	}
+	spin_unlock(&session->lock);
+
+	if (!enable) {
+		sk_adjust_memalloc(-1, -ISCSI_TCP_RESERVE_PAGES);
+		iscsi_swapdev(session->tt, session_to_cls(session), 0);
+	}
+
+	return error;
+}
+#endif
+
+static int iscsi_tcp_slave_configure(struct scsi_device *sdev)
+{
+#ifdef CONFIG_ISCSI_TCP_SWAP
+	if (sdev->type == TYPE_DISK)
+		blk_queue_swapdev(sdev->request_queue, iscsi_tcp_swapdev, sdev);
+#endif
+	return 0;
+}
+
 static struct scsi_host_template iscsi_sht = {
 	.name			= "iSCSI Initiator over TCP/IP",
 	.queuecommand           = iscsi_queuecommand,
@@ -2174,6 +2233,7 @@ static struct scsi_host_template iscsi_s
 	.can_queue		= ISCSI_XMIT_CMDS_MAX - 1,
 	.sg_tablesize		= ISCSI_SG_TABLESIZE,
 	.cmd_per_lun		= ISCSI_DEF_CMD_PER_LUN,
+	.slave_configure	= iscsi_tcp_slave_configure,
 	.eh_abort_handler       = iscsi_eh_abort,
 	.eh_host_reset_handler	= iscsi_eh_host_reset,
 	.use_clustering         = DISABLE_CLUSTERING,
Index: linux-2.6-git/drivers/scsi/scsi_transport_iscsi.c
===================================================================
--- linux-2.6-git.orig/drivers/scsi/scsi_transport_iscsi.c	2007-03-26 12:59:39.000000000 +0200
+++ linux-2.6-git/drivers/scsi/scsi_transport_iscsi.c	2007-03-26 13:15:15.000000000 +0200
@@ -498,6 +498,47 @@ iscsi_if_transport_lookup(struct iscsi_t
 	return NULL;
 }
 
+#ifdef CONFIG_ISCSI_SWAP
+static int iscsi_netlink_sk_vmio(u32 pid, int enable)
+{
+	int rc = -EINVAL;
+	struct sock *sk = netlink_lookup(NETLINK_ISCSI, pid);
+	if (sk) {
+		if (enable)
+			rc = sk_set_vmio(sk);
+		else
+			rc = sk_clear_vmio(sk);
+		sock_put(sk);
+	}
+	return rc;
+}
+
+#define ISCSI_NETLINK_RESERVE_PAGES	(5 + 2 * (5 + 31))
+
+int iscsi_swapdev(struct iscsi_transport *tt,
+		  struct iscsi_cls_session *cls_session, int enable)
+{
+	int pid = iscsi_if_transport_lookup(tt)->daemon_pid;
+
+	if (enable)
+		sk_adjust_memalloc(0, ISCSI_NETLINK_RESERVE_PAGES);
+	else
+		cls_session->swapper = 0;
+
+	iscsi_netlink_sk_vmio(0, enable);
+	iscsi_netlink_sk_vmio(pid, enable);
+
+	if (!enable)
+		sk_adjust_memalloc(0, -ISCSI_NETLINK_RESERVE_PAGES);
+	else
+		cls_session->swapper = 1;
+
+	return 0;
+}
+
+EXPORT_SYMBOL_GPL(iscsi_swapdev);
+#endif
+
 static int
 iscsi_broadcast_skb(struct sk_buff *skb, gfp_t gfp)
 {
@@ -527,7 +568,7 @@ iscsi_unicast_skb(struct sk_buff *skb, i
 }
 
 int iscsi_recv_pdu(struct iscsi_cls_conn *conn, struct iscsi_hdr *hdr,
-		   char *data, uint32_t data_size)
+		   char *data, uint32_t data_size, gfp_t gfp_mask)
 {
 	struct nlmsghdr	*nlh;
 	struct sk_buff *skb;
@@ -541,9 +582,9 @@ int iscsi_recv_pdu(struct iscsi_cls_conn
 	if (!priv)
 		return -EINVAL;
 
-	skb = alloc_skb(len, GFP_ATOMIC);
+	skb = alloc_skb(len, gfp_mask);
 	if (!skb) {
-		iscsi_conn_error(conn, ISCSI_ERR_CONN_FAILED);
+		iscsi_conn_error(conn, ISCSI_ERR_CONN_FAILED, gfp_mask);
 		dev_printk(KERN_ERR, &conn->dev, "iscsi: can not deliver "
 			   "control PDU: OOM\n");
 		return -ENOMEM;
@@ -564,7 +605,8 @@ int iscsi_recv_pdu(struct iscsi_cls_conn
 }
 EXPORT_SYMBOL_GPL(iscsi_recv_pdu);
 
-void iscsi_conn_error(struct iscsi_cls_conn *conn, enum iscsi_err error)
+void iscsi_conn_error(struct iscsi_cls_conn *conn, enum iscsi_err error,
+		      gfp_t gfp_mask)
 {
 	struct nlmsghdr	*nlh;
 	struct sk_buff	*skb;
@@ -576,7 +618,7 @@ void iscsi_conn_error(struct iscsi_cls_c
 	if (!priv)
 		return;
 
-	skb = alloc_skb(len, GFP_ATOMIC);
+	skb = alloc_skb(len, gfp_mask);
 	if (!skb) {
 		dev_printk(KERN_ERR, &conn->dev, "iscsi: gracefully ignored "
 			  "conn error (%d)\n", error);
@@ -591,7 +633,7 @@ void iscsi_conn_error(struct iscsi_cls_c
 	ev->r.connerror.cid = conn->cid;
 	ev->r.connerror.sid = iscsi_conn_get_sid(conn);
 
-	iscsi_broadcast_skb(skb, GFP_ATOMIC);
+	iscsi_broadcast_skb(skb, gfp_mask);
 
 	dev_printk(KERN_INFO, &conn->dev, "iscsi: detected conn error (%d)\n",
 		   error);
@@ -608,7 +650,7 @@ iscsi_if_send_reply(int pid, int seq, in
 	int flags = multi ? NLM_F_MULTI : 0;
 	int t = done ? NLMSG_DONE : type;
 
-	skb = alloc_skb(len, GFP_ATOMIC);
+	skb = alloc_skb(len, nls->sk_allocation);
 	/*
 	 * FIXME:
 	 * user is supposed to react on iferror == -ENOMEM;
@@ -686,6 +728,7 @@ iscsi_if_get_stats(struct iscsi_transpor
 	return err;
 }
 
+#if 0
 /**
  * iscsi_if_destroy_session_done - send session destr. completion event
  * @conn: last connection for session
@@ -806,6 +849,7 @@ int iscsi_if_create_session_done(struct 
 	return rc;
 }
 EXPORT_SYMBOL_GPL(iscsi_if_create_session_done);
+#endif
 
 static int
 iscsi_if_create_session(struct iscsi_internal *priv, struct iscsi_uevent *ev)
@@ -968,6 +1012,7 @@ iscsi_if_recv_msg(struct sk_buff *skb, s
 	struct iscsi_cls_session *session;
 	struct iscsi_cls_conn *conn;
 	unsigned long flags;
+	int pid;
 
 	priv = iscsi_if_transport_lookup(iscsi_ptr(ev->transport_handle));
 	if (!priv)
@@ -977,7 +1022,15 @@ iscsi_if_recv_msg(struct sk_buff *skb, s
 	if (!try_module_get(transport->owner))
 		return -EINVAL;
 
-	priv->daemon_pid = NETLINK_CREDS(skb)->pid;
+	pid = NETLINK_CREDS(skb)->pid;
+	if (priv->daemon_pid > 0 && priv->daemon_pid != pid) {
+		if (sk_has_vmio(nls)) {
+			struct sock * sk = netlink_lookup(NETLINK_ISCSI, pid);
+			BUG_ON(!sk);
+			WARN_ON(!sk_set_vmio(sk));
+		}
+	}
+	priv->daemon_pid = pid;
 
 	switch (nlh->nlmsg_type) {
 	case ISCSI_UEVENT_CREATE_SESSION:
@@ -1092,7 +1145,10 @@ iscsi_if_rx(struct sock *sk, int len)
 			if (rlen > skb->len)
 				rlen = skb->len;
 
+			if (sk_has_vmio(sk))
+				current->flags |= PF_MEM_NOWAIT;
 			err = iscsi_if_recv_msg(skb, nlh);
+			current->flags &= ~PF_MEM_NOWAIT;
 			if (err) {
 				ev->type = ISCSI_KEVENT_IF_ERROR;
 				ev->iferror = err;
Index: linux-2.6-git/include/scsi/scsi_transport_iscsi.h
===================================================================
--- linux-2.6-git.orig/include/scsi/scsi_transport_iscsi.h	2007-03-26 12:59:39.000000000 +0200
+++ linux-2.6-git/include/scsi/scsi_transport_iscsi.h	2007-03-26 12:59:39.000000000 +0200
@@ -137,9 +137,10 @@ extern int iscsi_unregister_transport(st
 /*
  * control plane upcalls
  */
-extern void iscsi_conn_error(struct iscsi_cls_conn *conn, enum iscsi_err error);
+extern void iscsi_conn_error(struct iscsi_cls_conn *conn, enum iscsi_err error,
+			     gfp_t gfp_mask);
 extern int iscsi_recv_pdu(struct iscsi_cls_conn *conn, struct iscsi_hdr *hdr,
-			  char *data, uint32_t data_size);
+			  char *data, uint32_t data_size, gfp_t gfp_mask);
 
 
 /* Connection's states */
@@ -183,6 +184,7 @@ struct iscsi_cls_session {
 	int sid;				/* session id */
 	void *dd_data;				/* LLD private data */
 	struct device dev;	/* sysfs transport/container device */
+	int swapper;				/* we are used to swap on */
 };
 
 #define iscsi_dev_to_session(_dev) \
@@ -194,6 +196,10 @@ struct iscsi_cls_session {
 #define starget_to_session(_stgt) \
 	iscsi_dev_to_session(_stgt->dev.parent)
 
+#define iscsi_session_gfp(_session) \
+	((in_interrupt() ? GFP_ATOMIC : GFP_NOIO) | \
+	 ((_session)->swapper ? __GFP_EMERGENCY : 0))
+
 struct iscsi_host {
 	struct list_head sessions;
 	struct mutex mutex;
@@ -217,6 +223,8 @@ extern int iscsi_destroy_session(struct 
 extern struct iscsi_cls_conn *iscsi_create_conn(struct iscsi_cls_session *sess,
 					    uint32_t cid);
 extern int iscsi_destroy_conn(struct iscsi_cls_conn *conn);
+extern int iscsi_swapdev(struct iscsi_transport *tt, struct iscsi_cls_session *,
+			 int enable);
 extern void iscsi_unblock_session(struct iscsi_cls_session *session);
 extern void iscsi_block_session(struct iscsi_cls_session *session);
 
Index: linux-2.6-git/drivers/scsi/libiscsi.c
===================================================================
--- linux-2.6-git.orig/drivers/scsi/libiscsi.c	2007-03-26 12:59:39.000000000 +0200
+++ linux-2.6-git/drivers/scsi/libiscsi.c	2007-03-26 12:59:39.000000000 +0200
@@ -361,10 +361,12 @@ int __iscsi_complete_pdu(struct iscsi_co
 			 char *data, int datalen)
 {
 	struct iscsi_session *session = conn->session;
+	struct iscsi_cls_session *cls_session = session_to_cls(session);
 	int opcode = hdr->opcode & ISCSI_OPCODE_MASK, rc = 0;
 	struct iscsi_cmd_task *ctask;
 	struct iscsi_mgmt_task *mtask;
 	uint32_t itt;
+	gfp_t gfp_mask = iscsi_session_gfp(cls_session);
 
 	if (hdr->itt != RESERVED_ITT)
 		itt = get_itt(hdr->itt);
@@ -423,7 +425,8 @@ int __iscsi_complete_pdu(struct iscsi_co
 			 * login related PDU's exp_statsn is handled in
 			 * userspace
 			 */
-			if (iscsi_recv_pdu(conn->cls_conn, hdr, data, datalen))
+			if (iscsi_recv_pdu(conn->cls_conn, hdr, data, datalen,
+						gfp_mask))
 				rc = ISCSI_ERR_CONN_FAILED;
 			list_del(&mtask->running);
 			if (conn->login_mtask != mtask)
@@ -445,7 +448,8 @@ int __iscsi_complete_pdu(struct iscsi_co
 			}
 			conn->exp_statsn = be32_to_cpu(hdr->statsn) + 1;
 
-			if (iscsi_recv_pdu(conn->cls_conn, hdr, data, datalen))
+			if (iscsi_recv_pdu(conn->cls_conn, hdr, data, datalen,
+						gfp_mask))
 				rc = ISCSI_ERR_CONN_FAILED;
 			list_del(&mtask->running);
 			if (conn->login_mtask != mtask)
@@ -472,7 +476,8 @@ int __iscsi_complete_pdu(struct iscsi_co
 			if (hdr->ttt == cpu_to_be32(ISCSI_RESERVED_TAG))
 				break;
 
-			if (iscsi_recv_pdu(conn->cls_conn, hdr, NULL, 0))
+			if (iscsi_recv_pdu(conn->cls_conn, hdr, NULL, 0,
+						gfp_mask))
 				rc = ISCSI_ERR_CONN_FAILED;
 			break;
 		case ISCSI_OP_REJECT:
@@ -480,7 +485,8 @@ int __iscsi_complete_pdu(struct iscsi_co
 			break;
 		case ISCSI_OP_ASYNC_EVENT:
 			conn->exp_statsn = be32_to_cpu(hdr->statsn) + 1;
-			if (iscsi_recv_pdu(conn->cls_conn, hdr, data, datalen))
+			if (iscsi_recv_pdu(conn->cls_conn, hdr, data, datalen,
+						gfp_mask))
 				rc = ISCSI_ERR_CONN_FAILED;
 			break;
 		default:
@@ -560,7 +566,9 @@ EXPORT_SYMBOL_GPL(iscsi_verify_itt);
 void iscsi_conn_failure(struct iscsi_conn *conn, enum iscsi_err err)
 {
 	struct iscsi_session *session = conn->session;
+	struct iscsi_cls_session *cls_session = session_to_cls(session);
 	unsigned long flags;
+	gfp_t gfp_mask = iscsi_session_gfp(cls_session);
 
 	spin_lock_irqsave(&session->lock, flags);
 	if (session->state == ISCSI_STATE_FAILED) {
@@ -573,7 +581,7 @@ void iscsi_conn_failure(struct iscsi_con
 	spin_unlock_irqrestore(&session->lock, flags);
 	set_bit(ISCSI_SUSPEND_BIT, &conn->suspend_tx);
 	set_bit(ISCSI_SUSPEND_BIT, &conn->suspend_rx);
-	iscsi_conn_error(conn->cls_conn, err);
+	iscsi_conn_error(conn->cls_conn, err, gfp_mask);
 }
 EXPORT_SYMBOL_GPL(iscsi_conn_failure);
 
Index: linux-2.6-git/drivers/scsi/qla4xxx/ql4_os.c
===================================================================
--- linux-2.6-git.orig/drivers/scsi/qla4xxx/ql4_os.c	2007-03-26 12:38:34.000000000 +0200
+++ linux-2.6-git/drivers/scsi/qla4xxx/ql4_os.c	2007-03-26 12:59:39.000000000 +0200
@@ -340,7 +340,7 @@ void qla4xxx_mark_device_missing(struct 
 	DEBUG3(printk("scsi%d:%d:%d: index [%d] marked MISSING\n",
 		      ha->host_no, ddb_entry->bus, ddb_entry->target,
 		      ddb_entry->fw_ddb_index));
-	iscsi_conn_error(ddb_entry->conn, ISCSI_ERR_CONN_FAILED);
+	iscsi_conn_error(ddb_entry->conn, ISCSI_ERR_CONN_FAILED, GFP_ATOMIC);
 }
 
 static struct srb* qla4xxx_get_new_srb(struct scsi_qla_host *ha,
Index: linux-2.6-git/drivers/scsi/Kconfig
===================================================================
--- linux-2.6-git.orig/drivers/scsi/Kconfig	2007-03-26 13:00:05.000000000 +0200
+++ linux-2.6-git/drivers/scsi/Kconfig	2007-03-26 13:14:25.000000000 +0200
@@ -268,6 +268,12 @@ config SCSI_ISCSI_ATTRS
 	  each attached iSCSI device to sysfs, say Y.
 	  Otherwise, say N.
 
+config ISCSI_SWAP
+	def_bool n
+	depends on SCSI_ISCSI_ATTRS
+	select SLAB_FAIR
+	select NETVM
+
 config SCSI_SAS_ATTRS
 	tristate "SAS Transport Attributes"
 	depends on SCSI
@@ -306,6 +312,17 @@ config ISCSI_TCP
 
 	 http://linux-iscsi.sf.net
 
+config ISCSI_TCP_SWAP
+	bool "Provide swap over iSCSI over TCP/IP"
+	default n
+	depends on ISCSI_TCP
+	select ISCSI_SWAP
+	help
+	  This option enables swapon to savely work with iSCSI over TCP/IP
+	  devices.
+
+	  If unsure, say N.
+
 config SGIWD93_SCSI
 	tristate "SGI WD93C93 SCSI Driver"
 	depends on SGI_IP22 && SCSI

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 10:26 ` [PATCH 08/40] mm: kmem_cache_objsize Peter Zijlstra
@ 2007-05-04 10:54   ` Pekka Enberg
  2007-05-04 16:09     ` Christoph Lameter
  2007-05-04 16:36   ` Christoph Lameter
  1 sibling, 1 reply; 78+ messages in thread
From: Pekka Enberg @ 2007-05-04 10:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, James Bottomley, Mike Christie, Andrew Morton,
	Daniel Phillips, Christoph Lameter

On 5/4/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Expost buffer_size in order to allow fair estimates on the actual space
> used/needed.

[snip]

>  #ifdef CONFIG_SLAB_FAIR
> -static inline int slab_alloc_rank(gfp_t flags)
> +static __always_inline int slab_alloc_rank(gfp_t flags)
>  {
>         return gfp_to_rank(flags);
>  }
>  #else
> -static inline int slab_alloc_rank(gfp_t flags)
> +static __always_inline int slab_alloc_rank(gfp_t flags)
>  {
>         return 0;
>  }

Me thinks this hunk doesn't belong in this patch.

> @@ -3815,6 +3815,12 @@ unsigned int kmem_cache_size(struct kmem
>  }
>  EXPORT_SYMBOL(kmem_cache_size);
>
> +unsigned int kmem_cache_objsize(struct kmem_cache *cachep)
> +{
> +       return cachep->buffer_size;
> +}
> +EXPORT_SYMBOL_GPL(kmem_cache_objsize);
> +
>  const char *kmem_cache_name(struct kmem_cache *cachep)
>  {
>         return cachep->name;
> @@ -4512,3 +4518,9 @@ unsigned int ksize(const void *objp)
>
>         return obj_size(virt_to_cache(objp));
>  }
> +
> +unsigned int kobjsize(size_t size)
> +{
> +       return kmem_cache_objsize(kmem_find_general_cachep(size, 0));
> +}
> +EXPORT_SYMBOL_GPL(kobjsize);

Looks good to me. Unfortunately, you need to do SLUB as well. Aah, the
wonders of three kernel memory allocators... ;-)

                                     Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 30/40] nfs: fixup missing error code
  2007-05-04 10:27 ` [PATCH 30/40] nfs: fixup missing error code Peter Zijlstra
@ 2007-05-04 13:10   ` Peter Staubach
  2007-05-04 13:18     ` Peter Zijlstra
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Staubach @ 2007-05-04 13:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, James Bottomley, Mike Christie, Andrew Morton,
	Daniel Phillips

Peter Zijlstra wrote:
> Commit 0b67130149b006628389ff3e8f46be9957af98aa lost the setting of tk_status
> to -EIO when there was no progress with short reads.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  fs/nfs/read.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> Index: linux-2.6-git/fs/nfs/read.c
> ===================================================================
> --- linux-2.6-git.orig/fs/nfs/read.c	2007-03-13 14:35:53.000000000 +0100
> +++ linux-2.6-git/fs/nfs/read.c	2007-03-13 14:36:05.000000000 +0100
> @@ -384,8 +384,10 @@ static int nfs_readpage_retry(struct rpc
>  	/* This is a short read! */
>  	nfs_inc_stats(data->inode, NFSIOS_SHORTREAD);
>  	/* Has the server at least made some progress? */
> -	if (resp->count == 0)
> +	if (resp->count == 0) {
> +		task->tk_status = -EIO;
>  		return 0;
> +	}
>  
>  	/* Yes, so retry the read at the end of the data */
>  	argp->offset += resp->count;

This doesn't look right to me.  It is not an error for the NFS server
to return 0 bytes.  It is usually an indication of EOF.  If an error
occured, then the NFS server would have returned an error.

    Thanx...

       ps

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 30/40] nfs: fixup missing error code
  2007-05-04 13:10   ` Peter Staubach
@ 2007-05-04 13:18     ` Peter Zijlstra
  0 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 13:18 UTC (permalink / raw)
  To: Peter Staubach
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, James Bottomley, Mike Christie, Andrew Morton,
	Daniel Phillips

On Fri, 2007-05-04 at 09:10 -0400, Peter Staubach wrote:
> Peter Zijlstra wrote:
> > Commit 0b67130149b006628389ff3e8f46be9957af98aa lost the setting of tk_status
> > to -EIO when there was no progress with short reads.
> >
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  fs/nfs/read.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6-git/fs/nfs/read.c
> > ===================================================================
> > --- linux-2.6-git.orig/fs/nfs/read.c	2007-03-13 14:35:53.000000000 +0100
> > +++ linux-2.6-git/fs/nfs/read.c	2007-03-13 14:36:05.000000000 +0100
> > @@ -384,8 +384,10 @@ static int nfs_readpage_retry(struct rpc
> >  	/* This is a short read! */
> >  	nfs_inc_stats(data->inode, NFSIOS_SHORTREAD);
> >  	/* Has the server at least made some progress? */
> > -	if (resp->count == 0)
> > +	if (resp->count == 0) {
> > +		task->tk_status = -EIO;
> >  		return 0;
> > +	}
> >  
> >  	/* Yes, so retry the read at the end of the data */
> >  	argp->offset += resp->count;
> 
> This doesn't look right to me.  It is not an error for the NFS server
> to return 0 bytes.  It is usually an indication of EOF.  If an error
> occured, then the NFS server would have returned an error.

Ah, ok; I found this when looking through NFS changelogs, and this
change was not changelogged. Consider it dropped.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 16/40] netvm: hook skb allocation to reserves
  2007-05-04 10:27 ` [PATCH 16/40] netvm: hook skb allocation to reserves Peter Zijlstra
@ 2007-05-04 14:07   ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 78+ messages in thread
From: Arnaldo Carvalho de Melo @ 2007-05-04 14:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, James Bottomley, Mike Christie, Andrew Morton,
	Daniel Phillips

On 5/4/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Change the skb allocation api to indicate RX usage and use this to fall back to
> the reserve when needed. Skbs allocated from the reserve are tagged in
> skb->emergency.
>
> Teach all other skb ops about emergency skbs and the reserve accounting.
>
> Use the (new) packet split API to allocate and track fragment pages from the
> emergency reserve. Do this using an atomic counter in page->index. This is
> needed because the fragments have a different sharing semantic than that
> indicated by skb_shinfo()->dataref.
>
> (NOTE the extra atomic overhead is only for those pages allocated from the
> reserves - it does not affect the normal fast path.)
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/skbuff.h |   22 +++++-
>  net/core/skbuff.c      |  161 ++++++++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 157 insertions(+), 26 deletions(-)

>
> +#define skb_alloc_rx(skb) (skb_emergency(skb) ? SKB_ALLOC_RX : 0)

skb_alloc_rx seems to imply "alloc an skb for rx", not "gimme the
right flags to allocate a skb for rx". Can this be changed to
"skb_alloc_rx_flag(skb)", similar to the existing sock_flag() for
socks?

- Arnaldo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/40] Swap over Networked storage -v12
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (39 preceding siblings ...)
  2007-05-04 10:27 ` [PATCH 40/40] iscsi: support for swapping over iSCSI Peter Zijlstra
@ 2007-05-04 15:22 ` Daniel Walker
  2007-05-04 15:38   ` Peter Zijlstra
  2007-05-04 21:36   ` Arnaldo Carvalho de Melo
  2007-05-04 19:27 ` David Miller, Peter Zijlstra
  41 siblings, 2 replies; 78+ messages in thread
From: Daniel Walker @ 2007-05-04 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, James Bottomley, Mike Christie, Andrew Morton,
	Daniel Phillips

On Fri, 2007-05-04 at 12:26 +0200, Peter Zijlstra wrote:

> 1) introduce the memory reserve and make the SLAB allocator play nice with it.
>    patches 01-10
> 
> 2) add some needed infrastructure to the network code
>    patches 11-13
> 
> 3) implement the idea outlined above
>    patches 14-20
> 
> 4) teach the swap machinery to use generic address_spaces
>    patches 21-24
> 
> 5) implement swap over NFS using all the new stuff
>    patches 25-31
> 
> 6) implement swap over iSCSI
>    patches 32-40

This is kind of a lot of patches all at once .. Have you release any of
these patch sets prior to this release ? 

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/40] Swap over Networked storage -v12
  2007-05-04 15:22 ` [PATCH 00/40] Swap over Networked storage -v12 Daniel Walker
@ 2007-05-04 15:38   ` Peter Zijlstra
  2007-05-04 15:59     ` Daniel Walker
  2007-05-04 21:36   ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 15:38 UTC (permalink / raw)
  To: Daniel Walker
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, James Bottomley, Mike Christie, Andrew Morton,
	Daniel Phillips

On Fri, 2007-05-04 at 08:22 -0700, Daniel Walker wrote:
> On Fri, 2007-05-04 at 12:26 +0200, Peter Zijlstra wrote:
> 
> > 1) introduce the memory reserve and make the SLAB allocator play nice with it.
> >    patches 01-10
> > 
> > 2) add some needed infrastructure to the network code
> >    patches 11-13
> > 
> > 3) implement the idea outlined above
> >    patches 14-20
> > 
> > 4) teach the swap machinery to use generic address_spaces
> >    patches 21-24
> > 
> > 5) implement swap over NFS using all the new stuff
> >    patches 25-31
> > 
> > 6) implement swap over iSCSI
> >    patches 32-40
> 
> This is kind of a lot of patches all at once .. Have you release any of
> these patch sets prior to this release ? 

Like the -v12 suggests, this is the 12th posting of this patch set.
Some is the same, some has changed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/40] Swap over Networked storage -v12
  2007-05-04 15:38   ` Peter Zijlstra
@ 2007-05-04 15:59     ` Daniel Walker
  2007-05-04 18:09       ` Mike Snitzer
  0 siblings, 1 reply; 78+ messages in thread
From: Daniel Walker @ 2007-05-04 15:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, James Bottomley, Mike Christie, Andrew Morton,
	Daniel Phillips

On Fri, 2007-05-04 at 17:38 +0200, Peter Zijlstra wrote:
> > 
> > This is kind of a lot of patches all at once .. Have you release any of
> > these patch sets prior to this release ? 
> 
> Like the -v12 suggests, this is the 12th posting of this patch set.
> Some is the same, some has changed.

I can find one prior release with this subject (-v11) , what was the
subject prior to that release? It's not a hard rule, but usually >15
patches is too many (check Documentation/SubmittingPatches under
references).. You might want to consider submitting a URL instead. 

I think it's a benefit to release less since a developer (like myself)
might know very little about "Swap over Networked storage", but if you
submit 10 patches that developer might still review it, 40 patches they
likely wouldn't review it.

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 10:54   ` Pekka Enberg
@ 2007-05-04 16:09     ` Christoph Lameter
  2007-05-04 16:15       ` Peter Zijlstra
  0 siblings, 1 reply; 78+ messages in thread
From: Christoph Lameter @ 2007-05-04 16:09 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Peter Zijlstra, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

On Fri, 4 May 2007, Pekka Enberg wrote:

> On 5/4/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > Expost buffer_size in order to allow fair estimates on the actual space
> > used/needed.

We already have ksize?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 16:09     ` Christoph Lameter
@ 2007-05-04 16:15       ` Peter Zijlstra
  2007-05-04 16:23         ` Christoph Lameter
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 16:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

On Fri, 2007-05-04 at 09:09 -0700, Christoph Lameter wrote:
> On Fri, 4 May 2007, Pekka Enberg wrote:
> 
> > On 5/4/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > Expost buffer_size in order to allow fair estimates on the actual space
> > > used/needed.
> 
> We already have ksize?

ksize gives the internal size, whereas these give the external size.

I need to know how much space I need to reserve, hence I need the
external size; whereas normally you want to know how much space you have
available, which is what ksize gives.

Didn't we have this discussion last time?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 16:15       ` Peter Zijlstra
@ 2007-05-04 16:23         ` Christoph Lameter
  2007-05-04 16:30           ` Peter Zijlstra
  0 siblings, 1 reply; 78+ messages in thread
From: Christoph Lameter @ 2007-05-04 16:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pekka Enberg, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

On Fri, 4 May 2007, Peter Zijlstra wrote:

> On Fri, 2007-05-04 at 09:09 -0700, Christoph Lameter wrote:
> > On Fri, 4 May 2007, Pekka Enberg wrote:
> > 
> > > On 5/4/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > > Expost buffer_size in order to allow fair estimates on the actual space
> > > > used/needed.
> > 
> > We already have ksize?
> 
> ksize gives the internal size, whereas these give the external size.
> 
> I need to know how much space I need to reserve, hence I need the
> external size; whereas normally you want to know how much space you have
> available, which is what ksize gives.
> 
> Didn't we have this discussion last time?

I was cced on that as far as I can tell.

The name objsize suggests the size of the object not the slab size.
If you want this then maybe call it kmem_cache_slab_size. SLUB 
distinguishes between obj_size which is the size of the struct that is 
used and slab_size which is the size of the object after alignment, adding 
debug information etc etc. See also slabinfo.c for a way to calculate 
theses sizes from user space.

If we really drop SLAB then we wont need this. SLUBs data structures are 
not opaque.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 16:23         ` Christoph Lameter
@ 2007-05-04 16:30           ` Peter Zijlstra
  0 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 16:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

On Fri, 2007-05-04 at 09:23 -0700, Christoph Lameter wrote:
> On Fri, 4 May 2007, Peter Zijlstra wrote:
> 
> > On Fri, 2007-05-04 at 09:09 -0700, Christoph Lameter wrote:
> > > On Fri, 4 May 2007, Pekka Enberg wrote:
> > > 
> > > > On 5/4/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > > > Expost buffer_size in order to allow fair estimates on the actual space
> > > > > used/needed.
> > > 
> > > We already have ksize?
> > 
> > ksize gives the internal size, whereas these give the external size.
> > 
> > I need to know how much space I need to reserve, hence I need the
> > external size; whereas normally you want to know how much space you have
> > available, which is what ksize gives.
> > 
> > Didn't we have this discussion last time?
> 
> I was cced on that as far as I can tell.

Ah, that might have been, I was collecting Cc's but must've overlooked
you. My bad.

> The name objsize suggests the size of the object not the slab size.
> If you want this then maybe call it kmem_cache_slab_size. SLUB 
> distinguishes between obj_size which is the size of the struct that is 
> used and slab_size which is the size of the object after alignment, adding 
> debug information etc etc. See also slabinfo.c for a way to calculate 
> theses sizes from user space.

I'm open to renames, this is what Pekka suggested IIRC.

> If we really drop SLAB then we wont need this. SLUBs data structures are 
> not opaque.

Yeah, I know, I still have to add SLUB support, its high on my TODO list
though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 10:26 ` [PATCH 08/40] mm: kmem_cache_objsize Peter Zijlstra
  2007-05-04 10:54   ` Pekka Enberg
@ 2007-05-04 16:36   ` Christoph Lameter
  2007-05-04 17:59     ` Peter Zijlstra
  1 sibling, 1 reply; 78+ messages in thread
From: Christoph Lameter @ 2007-05-04 16:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, James Bottomley, Mike Christie, Andrew Morton,
	Daniel Phillips, Pekka Enberg

On Fri, 4 May 2007, Peter Zijlstra wrote:

> Expost buffer_size in order to allow fair estimates on the actual space 
> used/needed.

If its just an estimate that you are after then I think ksize is 
sufficient.

The buffer size does not include the other per slab overhead that SLAB 
needs nor the alignment overhead or the padding. For SLUB you'd be more 
lucky but there it does not include the per slab padding that exist.

Need to check how this is going to be used. It is difficult to estimate 
slab use because this depends on the availability of object slots in 
partial slabs.

I could add a function that tells you how many object you could allocate 
from a slab without the page allocator becoming involved? It would count 
the object slots available on the partial slabs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 16:36   ` Christoph Lameter
@ 2007-05-04 17:59     ` Peter Zijlstra
  2007-05-04 18:04       ` Christoph Lameter
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 17:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, James Bottomley, Mike Christie, Andrew Morton,
	Daniel Phillips, Pekka Enberg

On Fri, 2007-05-04 at 09:36 -0700, Christoph Lameter wrote:
> On Fri, 4 May 2007, Peter Zijlstra wrote:
> 
> > Expost buffer_size in order to allow fair estimates on the actual space 
> > used/needed.
> 
> If its just an estimate that you are after then I think ksize is 
> sufficient.
> 
> The buffer size does not include the other per slab overhead that SLAB 
> needs nor the alignment overhead or the padding. For SLUB you'd be more 
> lucky but there it does not include the per slab padding that exist.
> 
> Need to check how this is going to be used. It is difficult to estimate 
> slab use because this depends on the availability of object slots in 
> partial slabs.
> 
> I could add a function that tells you how many object you could allocate 
> from a slab without the page allocator becoming involved? It would count 
> the object slots available on the partial slabs.

I need to know how many pages to reserve to allocate a given number of
items from a given slab; assuming the partial slabs are empty. That is,
I need a worst case upper bound.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 17:59     ` Peter Zijlstra
@ 2007-05-04 18:04       ` Christoph Lameter
  2007-05-04 18:21         ` Peter Zijlstra
  0 siblings, 1 reply; 78+ messages in thread
From: Christoph Lameter @ 2007-05-04 18:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, James Bottomley, Mike Christie, Andrew Morton,
	Daniel Phillips, Pekka Enberg

On Fri, 4 May 2007, Peter Zijlstra wrote:

> > I could add a function that tells you how many object you could allocate 
> > from a slab without the page allocator becoming involved? It would count 
> > the object slots available on the partial slabs.
> 
> I need to know how many pages to reserve to allocate a given number of
> items from a given slab; assuming the partial slabs are empty. That is,
> I need a worst case upper bound.

Ok so you really need the number of objects per page? If you know the 
number of objects then you can calculate the pages needed which would be 
the maximum memory needed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/40] Swap over Networked storage -v12
  2007-05-04 15:59     ` Daniel Walker
@ 2007-05-04 18:09       ` Mike Snitzer
  2007-05-04 19:31         ` Daniel Walker
  2007-05-04 19:54         ` David Miller, Mike Snitzer
  0 siblings, 2 replies; 78+ messages in thread
From: Mike Snitzer @ 2007-05-04 18:09 UTC (permalink / raw)
  To: Daniel Walker
  Cc: Peter Zijlstra, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

On 5/4/07, Daniel Walker <dwalker@mvista.com> wrote:
> On Fri, 2007-05-04 at 17:38 +0200, Peter Zijlstra wrote:
> > >
> > > This is kind of a lot of patches all at once .. Have you release any of
> > > these patch sets prior to this release ?
> >
> > Like the -v12 suggests, this is the 12th posting of this patch set.
> > Some is the same, some has changed.
>
> I can find one prior release with this subject (-v11) , what was the
> subject prior to that release? It's not a hard rule, but usually >15
> patches is too many (check Documentation/SubmittingPatches under
> references).. You might want to consider submitting a URL instead.

Previous subjects were like:
[PATCH 00/20] vm deadlock avoidance for NFS, NBD and iSCSI (take 7)

A URL doesn't allow for true discussion about a particular patch
unless the reviewer takes the initiative to create a new thread to
discuss the Nth patch it a patchset; whereby taking on the burden of a
structured subject and so on.  It would get out of control on a large
patchset that actually got a lot of simultaneous feedback... reviewers
don't have a forum to talk about each individual change without
stepping on each others' toes.

> I think it's a benefit to release less since a developer (like myself)
> might know very little about "Swap over Networked storage", but if you
> submit 10 patches that developer might still review it, 40 patches they
> likely wouldn't review it.

The _suggestions_ in Documentation/SubmittingPatches are nice and all
but the quantity of patches shouldn't _really_ matter.

Documentation/SubmittingPatches actually doesn't cover how to post a
large change because it first states:
"Separate _logical changes_ into a single patch file."
then:
"If you cannot condense your patch set into a smaller set of patches,
then only post say 15 or so at a time and wait for review and integration."

These suggestions conflict in the case of a large patchset: the second
can't be met if you honor the first (more important suggestion IMHO).
Unless you leave something out... and I can't see the value in leaving
out the auxiliary consumers of the core changes.

Reviewing 10 patches that are quite large/overloaded is actually
harder than 40 broken-out/well-documented patches.  But maybe others
disagree.

*shrug*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 18:04       ` Christoph Lameter
@ 2007-05-04 18:21         ` Peter Zijlstra
  2007-05-04 18:30           ` Christoph Lameter
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 18:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, James Bottomley, Mike Christie, Andrew Morton,
	Daniel Phillips, Pekka Enberg

On Fri, 2007-05-04 at 11:04 -0700, Christoph Lameter wrote:
> On Fri, 4 May 2007, Peter Zijlstra wrote:
> 
> > > I could add a function that tells you how many object you could allocate 
> > > from a slab without the page allocator becoming involved? It would count 
> > > the object slots available on the partial slabs.
> > 
> > I need to know how many pages to reserve to allocate a given number of
> > items from a given slab; assuming the partial slabs are empty. That is,
> > I need a worst case upper bound.
> 
> Ok so you really need the number of objects per page? If you know the 
> number of objects then you can calculate the pages needed which would be 
> the maximum memory needed?

Yes, that would work.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 18:21         ` Peter Zijlstra
@ 2007-05-04 18:30           ` Christoph Lameter
  2007-05-04 18:32             ` Peter Zijlstra
  2007-05-04 18:41             ` Pekka Enberg
  0 siblings, 2 replies; 78+ messages in thread
From: Christoph Lameter @ 2007-05-04 18:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, James Bottomley, Mike Christie, Andrew Morton,
	Daniel Phillips, Pekka Enberg

On Fri, 4 May 2007, Peter Zijlstra wrote:

> > Ok so you really need the number of objects per page? If you know the 
> > number of objects then you can calculate the pages needed which would be 
> > the maximum memory needed?
> 
> Yes, that would work.

Hmmm... Maybe lets have

unsigned kmem_estimate_pages(struct kmem_cache *slab_cache, int objects)

which would calculate the worst case memory scenario for allocation the 
number of indicated objects?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 18:30           ` Christoph Lameter
@ 2007-05-04 18:32             ` Peter Zijlstra
  2007-05-04 18:45               ` Pekka Enberg
  2007-05-04 18:41             ` Pekka Enberg
  1 sibling, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 18:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, James Bottomley, Mike Christie, Andrew Morton,
	Daniel Phillips, Pekka Enberg

On Fri, 2007-05-04 at 11:30 -0700, Christoph Lameter wrote:
> On Fri, 4 May 2007, Peter Zijlstra wrote:
> 
> > > Ok so you really need the number of objects per page? If you know the 
> > > number of objects then you can calculate the pages needed which would be 
> > > the maximum memory needed?
> > 
> > Yes, that would work.
> 
> Hmmm... Maybe lets have
> 
> unsigned kmem_estimate_pages(struct kmem_cache *slab_cache, int objects)
> 
> which would calculate the worst case memory scenario for allocation the 
> number of indicated objects?

Perfectly fine with me, Pekka, any objections?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 18:30           ` Christoph Lameter
  2007-05-04 18:32             ` Peter Zijlstra
@ 2007-05-04 18:41             ` Pekka Enberg
  2007-05-04 18:46               ` Christoph Lameter
  1 sibling, 1 reply; 78+ messages in thread
From: Pekka Enberg @ 2007-05-04 18:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

Christoph Lameter wrote:
> Hmmm... Maybe lets have
> 
> unsigned kmem_estimate_pages(struct kmem_cache *slab_cache, int objects)
> 
> which would calculate the worst case memory scenario for allocation the 
> number of indicated objects?

IIRC this looks more or less what Peter had initially. I don't like the 
API because there's no way for slab (perhaps this is different for slub) 
how many pages you really need due to per-node and per-cpu caches, etc.

It's better that the slab tells you what it actually knows and lets the 
callers figure out what a worst-case upper bound is.

				Pekka


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 18:32             ` Peter Zijlstra
@ 2007-05-04 18:45               ` Pekka Enberg
  2007-05-04 18:47                 ` Christoph Lameter
  0 siblings, 1 reply; 78+ messages in thread
From: Pekka Enberg @ 2007-05-04 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, linux-kernel, linux-mm, netdev,
	Trond Myklebust, Thomas Graf, David Miller, James Bottomley,
	Mike Christie, Andrew Morton, Daniel Phillips

On Fri, 2007-05-04 at 11:30 -0700, Christoph Lameter wrote:
> > Hmmm... Maybe lets have
> >
> > unsigned kmem_estimate_pages(struct kmem_cache *slab_cache, int objects)
> >
> > which would calculate the worst case memory scenario for allocation the 
> > number of indicated objects?

On Fri, 4 May 2007, Peter Zijlstra wrote:
> Perfectly fine with me, Pekka, any objections?

Again, slab has no way of actually estimating how many pages you need 
for a given number of objects. So we end up calculating some upper bound 
which doesn't belong in mm/slab.c. I am perfectly okay with:

   (1) kmem_nr_bytes_per_object which is what Peter has now

or alternatively,

   (2) kmem_nr_objects_per_page which I think Christoph suggested

Both of them, the slab knows the answer, and doesn't need to guess. It's 
up to the caller to figure out what the acceptable upper bound is.

			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 18:41             ` Pekka Enberg
@ 2007-05-04 18:46               ` Christoph Lameter
  2007-05-04 18:53                 ` Pekka Enberg
  0 siblings, 1 reply; 78+ messages in thread
From: Christoph Lameter @ 2007-05-04 18:46 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Peter Zijlstra, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

On Fri, 4 May 2007, Pekka Enberg wrote:

> > which would calculate the worst case memory scenario for allocation the
> > number of indicated objects?
> 
> IIRC this looks more or less what Peter had initially. I don't like the API
> because there's no way for slab (perhaps this is different for slub) how many
> pages you really need due to per-node and per-cpu caches, etc.

SLAB can calculate exactly how many pages are needed. The per 
cpu and per node stuff is setup at boot and does not change. We are 
talking about the worst case scenario here. True in case of an off slab
we have additional overhead that would also have to go into worst case 
scenario.

> It's better that the slab tells you what it actually knows and lets the
> callers figure out what a worst-case upper bound is.

They do not have the data. For that they would need to know how to deal 
with alignments, (in case of SLAB) the location of the struct slab, the 
distinction between the differrent sizes, padding etc. I think this has to 
be done by the allocator. If we ever have another allocator with another 
structure then this will nicely isolate that functionality. Otherwise we 
may have to change the callers depending on how the slab organizes its 
data.

SLUB organizes its data more effectively so SLUB will return a lower 
number than SLAB f.e.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 18:45               ` Pekka Enberg
@ 2007-05-04 18:47                 ` Christoph Lameter
  2007-05-04 18:54                   ` Pekka Enberg
  0 siblings, 1 reply; 78+ messages in thread
From: Christoph Lameter @ 2007-05-04 18:47 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Peter Zijlstra, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

On Fri, 4 May 2007, Pekka Enberg wrote:

> Again, slab has no way of actually estimating how many pages you need for a
> given number of objects. So we end up calculating some upper bound which
> doesn't belong in mm/slab.c. I am perfectly okay with:

It can give a worst case number and that is what he wants.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 18:46               ` Christoph Lameter
@ 2007-05-04 18:53                 ` Pekka Enberg
  2007-05-04 19:58                   ` Christoph Lameter
  0 siblings, 1 reply; 78+ messages in thread
From: Pekka Enberg @ 2007-05-04 18:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

Christoph Lameter wrote:
> SLAB can calculate exactly how many pages are needed. The per 
> cpu and per node stuff is setup at boot and does not change. We are 
> talking about the worst case scenario here. True in case of an off slab
> we have additional overhead that would also have to go into worst case 
> scenario.

Fair enough. But there's no way it can take into account any slab 
management structures it needs to allocate. The slab simply doesn't know 
how many pages are needed to _allocate n amount of objects_.

Peter is interested in a _rough estimate_ so I don't see the point of 
adding that kind of logic in the slab. It's an API that simply cannot 
satisfy all its callers which is why I suggested exposing buffer size in 
the first place (the slab certainly knows how many bytes it needs for 
one object).

			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 18:47                 ` Christoph Lameter
@ 2007-05-04 18:54                   ` Pekka Enberg
  2007-05-04 19:59                     ` Christoph Lameter
  0 siblings, 1 reply; 78+ messages in thread
From: Pekka Enberg @ 2007-05-04 18:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

Christoph Lameter wrote:
> On Fri, 4 May 2007, Pekka Enberg wrote:
> 
>> Again, slab has no way of actually estimating how many pages you need for a
>> given number of objects. So we end up calculating some upper bound which
>> doesn't belong in mm/slab.c. I am perfectly okay with:
> 
> It can give a worst case number and that is what he wants.

Sure. But he can calculate that elsewhere instead of bringing it in 
mm/slab.c where it's no use for anyone else...


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/40] Swap over Networked storage -v12
  2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
                   ` (40 preceding siblings ...)
  2007-05-04 15:22 ` [PATCH 00/40] Swap over Networked storage -v12 Daniel Walker
@ 2007-05-04 19:27 ` David Miller, Peter Zijlstra
  2007-05-04 19:41   ` Peter Zijlstra
  2007-05-05  9:43   ` Christoph Hellwig
  41 siblings, 2 replies; 78+ messages in thread
From: David Miller, Peter Zijlstra @ 2007-05-04 19:27 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: linux-kernel, linux-mm, netdev, trond.myklebust, tgraf,
	James.Bottomley, michaelc, akpm, phillips

> There is a fundamental deadlock associated with paging;

I know you'd really like people like myself to review this work, but a
set of 40 patches is just too much to try and digest at once
especially when I have other things going on.  When I have lots of
other things already on my plate, when I see a huge patch set like
this I have to just say "delete" because I don't kid myself since
I know I'll never get to it.

Sorry there's now way I can review this with my current workload.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/40] Swap over Networked storage -v12
  2007-05-04 18:09       ` Mike Snitzer
@ 2007-05-04 19:31         ` Daniel Walker
  2007-05-04 19:54         ` David Miller, Mike Snitzer
  1 sibling, 0 replies; 78+ messages in thread
From: Daniel Walker @ 2007-05-04 19:31 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Peter Zijlstra, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

On Fri, 2007-05-04 at 14:09 -0400, Mike Snitzer wrote:
> On 5/4/07, Daniel Walker <dwalker@mvista.com> wrote:
> > On Fri, 2007-05-04 at 17:38 +0200, Peter Zijlstra wrote:
> > > >
> > > > This is kind of a lot of patches all at once .. Have you release any of
> > > > these patch sets prior to this release ?
> > >
> > > Like the -v12 suggests, this is the 12th posting of this patch set.
> > > Some is the same, some has changed.
> >
> > I can find one prior release with this subject (-v11) , what was the
> > subject prior to that release? It's not a hard rule, but usually >15
> > patches is too many (check Documentation/SubmittingPatches under
> > references).. You might want to consider submitting a URL instead.
> 
> Previous subjects were like:
> [PATCH 00/20] vm deadlock avoidance for NFS, NBD and iSCSI (take 7)
> 
> A URL doesn't allow for true discussion about a particular patch
> unless the reviewer takes the initiative to create a new thread to
> discuss the Nth patch it a patchset; whereby taking on the burden of a
> structured subject and so on.  It would get out of control on a large
> patchset that actually got a lot of simultaneous feedback... reviewers
> don't have a forum to talk about each individual change without
> stepping on each others' toes.

True ..

> > I think it's a benefit to release less since a developer (like myself)
> > might know very little about "Swap over Networked storage", but if you
> > submit 10 patches that developer might still review it, 40 patches they
> > likely wouldn't review it.
> 
> The _suggestions_ in Documentation/SubmittingPatches are nice and all
> but the quantity of patches shouldn't _really_ matter.

I guess I take the documentation more seriously than your do. It's
clearly not mandatory, but for my reviewing I appreciate less then 15
sets of "logical changes".

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/40] Swap over Networked storage -v12
  2007-05-04 19:27 ` David Miller, Peter Zijlstra
@ 2007-05-04 19:41   ` Peter Zijlstra
  2007-05-04 20:02     ` David Miller, Peter Zijlstra
  2007-05-05  9:43   ` Christoph Hellwig
  1 sibling, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2007-05-04 19:41 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, linux-mm, netdev, trond.myklebust, tgraf,
	James.Bottomley, michaelc, akpm, phillips

On Fri, 2007-05-04 at 12:27 -0700, David Miller wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Fri, 04 May 2007 12:26:51 +0200
> 
> > There is a fundamental deadlock associated with paging;
> 
> I know you'd really like people like myself to review this work, but a
> set of 40 patches is just too much to try and digest at once
> especially when I have other things going on.

I realize this, however I expected you to mainly look at the the 10
network related patches, namely: 11/40 - 20/40.

I know they build upon the previous 10 patches, which are mostly VM, and
you seem to have an interest in that as well, so that would be 20
patches to look at. Still a sizable set.

How would you prefer I present these?

The other patches are NFS and iSCSI, I'd not expect you to review those
in depth.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/40] Swap over Networked storage -v12
  2007-05-04 18:09       ` Mike Snitzer
  2007-05-04 19:31         ` Daniel Walker
@ 2007-05-04 19:54         ` David Miller, Mike Snitzer
  1 sibling, 0 replies; 78+ messages in thread
From: David Miller, Mike Snitzer @ 2007-05-04 19:54 UTC (permalink / raw)
  To: snitzer
  Cc: dwalker, a.p.zijlstra, linux-kernel, linux-mm, netdev,
	trond.myklebust, tgraf, James.Bottomley, michaelc, akpm,
	phillips

> These suggestions conflict in the case of a large patchset: the second
> can't be met if you honor the first (more important suggestion IMHO).
> Unless you leave something out... and I can't see the value in leaving
> out the auxiliary consumers of the core changes.

They do not conflict.

If you say you're setting up infrastructure for a well defined
purpose, then each and every one of the patches can all stand on their
own just fine.  You can even post them one at a time and the review
process would work just fine.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 18:53                 ` Pekka Enberg
@ 2007-05-04 19:58                   ` Christoph Lameter
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2007-05-04 19:58 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Peter Zijlstra, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

On Fri, 4 May 2007, Pekka Enberg wrote:

> Christoph Lameter wrote:
> > SLAB can calculate exactly how many pages are needed. The per cpu and per
> > node stuff is setup at boot and does not change. We are talking about the
> > worst case scenario here. True in case of an off slab
> > we have additional overhead that would also have to go into worst case
> > scenario.
> 
> Fair enough. But there's no way it can take into account any slab management
> structures it needs to allocate. The slab simply doesn't know how many pages
> are needed to _allocate n amount of objects_.

In the worst case we will need need nr_objects / nr_object_per_slab off slab management 
structures. There is one off slab management object per allocated slab.
 
> Peter is interested in a _rough estimate_ so I don't see the point of adding
> that kind of logic in the slab. It's an API that simply cannot satisfy all its
> callers which is why I suggested exposing buffer size in the first place (the
> slab certainly knows how many bytes it needs for one object).

But the slab size is not useful to the caller since the caller does not 
know about off slab structures etc. It is only the SLAB that can 
calculate the worst case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 18:54                   ` Pekka Enberg
@ 2007-05-04 19:59                     ` Christoph Lameter
  2007-05-05  9:00                       ` Pekka J Enberg
  0 siblings, 1 reply; 78+ messages in thread
From: Christoph Lameter @ 2007-05-04 19:59 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Peter Zijlstra, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

On Fri, 4 May 2007, Pekka Enberg wrote:

> Christoph Lameter wrote:
> > On Fri, 4 May 2007, Pekka Enberg wrote:
> > 
> > > Again, slab has no way of actually estimating how many pages you need for
> > > a
> > > given number of objects. So we end up calculating some upper bound which
> > > doesn't belong in mm/slab.c. I am perfectly okay with:
> > 
> > It can give a worst case number and that is what he wants.
> 
> Sure. But he can calculate that elsewhere instead of bringing it in mm/slab.c
> where it's no use for anyone else...

He is not able to calculate it just using the object size since he does 
not know where the slab put the slab management structure. And in case of 
SLUB there is no slab management structure... Which means he would have to 
special case based on the slab allocator selected.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/40] Swap over Networked storage -v12
  2007-05-04 19:41   ` Peter Zijlstra
@ 2007-05-04 20:02     ` David Miller, Peter Zijlstra
  2007-05-04 20:29       ` Jeff Garzik
  0 siblings, 1 reply; 78+ messages in thread
From: David Miller, Peter Zijlstra @ 2007-05-04 20:02 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: linux-kernel, linux-mm, netdev, trond.myklebust, tgraf,
	James.Bottomley, michaelc, akpm, phillips

> How would you prefer I present these?

How about 8 or 9 at a time?  You are building infrastructure
and therefore you could post them 1 at a time for review
since each patch should be able to stand on it's own.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/40] Swap over Networked storage -v12
  2007-05-04 20:02     ` David Miller, Peter Zijlstra
@ 2007-05-04 20:29       ` Jeff Garzik
  0 siblings, 0 replies; 78+ messages in thread
From: Jeff Garzik @ 2007-05-04 20:29 UTC (permalink / raw)
  To: David Miller, a.p.zijlstra
  Cc: linux-kernel, linux-mm, netdev, trond.myklebust, tgraf,
	James.Bottomley, michaelc, akpm, phillips

David Miller wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Fri, 04 May 2007 21:41:49 +0200
> 
>> How would you prefer I present these?
> 
> How about 8 or 9 at a time?  You are building infrastructure
> and therefore you could post them 1 at a time for review
> since each patch should be able to stand on it's own.

Indeed.  Just glancing over the patchset, there are quite a few "easy to 
apply" cleanup patches that could be fast-forwarded to upstream, without 
requiring deep thought on the swap-over-storage MM changes or net 
allocator changes.

	Jeff



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/40] Swap over Networked storage -v12
  2007-05-04 15:22 ` [PATCH 00/40] Swap over Networked storage -v12 Daniel Walker
  2007-05-04 15:38   ` Peter Zijlstra
@ 2007-05-04 21:36   ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 78+ messages in thread
From: Arnaldo Carvalho de Melo @ 2007-05-04 21:36 UTC (permalink / raw)
  To: Daniel Walker
  Cc: Peter Zijlstra, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

Daniel Walker wrote:
> On Fri, 2007-05-04 at 12:26 +0200, Peter Zijlstra wrote:
>
>   
>> 1) introduce the memory reserve and make the SLAB allocator play nice with it.
>>    patches 01-10
>>
>> 2) add some needed infrastructure to the network code
>>    patches 11-13
>>
>> 3) implement the idea outlined above
>>    patches 14-20
>>
>> 4) teach the swap machinery to use generic address_spaces
>>    patches 21-24
>>
>> 5) implement swap over NFS using all the new stuff
>>    patches 25-31
>>
>> 6) implement swap over iSCSI
>>    patches 32-40
>>     
>
> This is kind of a lot of patches all at once .. Have you release any of
> these patch sets prior to this release ? 
>   

Yes, several times AFAIK.

- Arnaldo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/40] mm: kmem_cache_objsize
  2007-05-04 19:59                     ` Christoph Lameter
@ 2007-05-05  9:00                       ` Pekka J Enberg
  0 siblings, 0 replies; 78+ messages in thread
From: Pekka J Enberg @ 2007-05-05  9:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, linux-kernel, linux-mm, netdev, Trond Myklebust,
	Thomas Graf, David Miller, James Bottomley, Mike Christie,
	Andrew Morton, Daniel Phillips

On Fri, 4 May 2007, Christoph Lameter wrote:
> He is not able to calculate it just using the object size since he does 
> not know where the slab put the slab management structure. And in case of 
> SLUB there is no slab management structure... Which means he would have to 
> special case based on the slab allocator selected.

Let me state this once more: he is interested in _rough approximation_. It 
makes no sense to me to add this kind of fuzzy logic in the slab. Now, as 
the slab clearly cannot give a _precise number_ either, it shouldn't be 
added there.

But, if both of you really want to stick it in mm/slab.c, I guess I won't 
be too violently opposed to it. It just doesn't make any sense to me.

			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/40] Swap over Networked storage -v12
  2007-05-04 19:27 ` David Miller, Peter Zijlstra
  2007-05-04 19:41   ` Peter Zijlstra
@ 2007-05-05  9:43   ` Christoph Hellwig
  2007-05-05  9:55     ` William Lee Irwin III
  1 sibling, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2007-05-05  9:43 UTC (permalink / raw)
  To: David Miller
  Cc: a.p.zijlstra, linux-kernel, linux-mm, netdev, trond.myklebust,
	tgraf, James.Bottomley, michaelc, akpm, phillips

On Fri, May 04, 2007 at 12:27:16PM -0700, David Miller wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Fri, 04 May 2007 12:26:51 +0200
> 
> > There is a fundamental deadlock associated with paging;
> 
> I know you'd really like people like myself to review this work, but a
> set of 40 patches is just too much to try and digest at once
> especially when I have other things going on.  When I have lots of
> other things already on my plate, when I see a huge patch set like
> this I have to just say "delete" because I don't kid myself since
> I know I'll never get to it.
> 
> Sorry there's now way I can review this with my current workload.

There also quite alot of only semi-related thing in there.  It would
be much better to only do the network stack and iscsi parts first
and leave nfs out for a while.  Especially as the former are definitively
useful while I strongly doubt that for swap over nfs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 00/40] Swap over Networked storage -v12
  2007-05-05  9:43   ` Christoph Hellwig
@ 2007-05-05  9:55     ` William Lee Irwin III
  0 siblings, 0 replies; 78+ messages in thread
From: William Lee Irwin III @ 2007-05-05  9:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Miller, a.p.zijlstra, linux-kernel, linux-mm, netdev,
	trond.myklebust, tgraf, James.Bottomley, michaelc, akpm,
	phillips

On Fri, 04 May 2007 12:26:51 +0200, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>>> There is a fundamental deadlock associated with paging;

On Fri, May 04, 2007 at 12:27:16PM -0700, David Miller wrote:
>> I know you'd really like people like myself to review this work, but a
>> set of 40 patches is just too much to try and digest at once
>> especially when I have other things going on.  When I have lots of
>> other things already on my plate, when I see a huge patch set like
>> this I have to just say "delete" because I don't kid myself since
>> I know I'll never get to it.
>> Sorry there's now way I can review this with my current workload.

On Sat, May 05, 2007 at 10:43:00AM +0100, Christoph Hellwig wrote:
> There also quite alot of only semi-related thing in there.  It would
> be much better to only do the network stack and iscsi parts first
> and leave nfs out for a while.  Especially as the former are definitively
> useful while I strongly doubt that for swap over nfs.

This is backward. As much as we hate it, the common case is swap over
nfs, essentially because that is/was how things were commonly set up
for other operating systems. I'm not a Solaris administrator, though,
so various disclaimers apply.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 02/40] mm: slab allocation fairness
  2007-05-04 10:26 ` [PATCH 02/40] mm: slab allocation fairness Peter Zijlstra
@ 2007-05-16 20:41   ` Christoph Lameter
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2007-05-16 20:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, netdev, Trond Myklebust, Thomas Graf,
	David Miller, James Bottomley, Mike Christie, Andrew Morton,
	Daniel Phillips

On Fri, 4 May 2007, Peter Zijlstra wrote:

> Page allocation rank is a scalar quantity connecting ALLOC_ and gfp flags which
> represents how deep we had to reach into our reserves when allocating a page. 
> Rank 0 is the deepest we can reach (ALLOC_NO_WATERMARK) and 16 is the most 
> shallow allocation possible (ALLOC_WMARK_HIGH).
> 
> When the slab space is grown the rank of the page allocation is stored. For
> each slab allocation we test the given gfp flags against this rank. Thereby
> asking the question: would these flags have allowed the slab to grow.
> 
> If not so, we need to test the current situation. This is done by forcing the
> growth of the slab space. (Just testing the free page limits will not work due
> to direct reclaim) Failing this we need to fail the slab allocation.

This implies that an allocation at time t2 must be aware of the result of 
an allocation at time t1. It assumes a linear ordering of allocations that 
is not possible on large systems. Ordering of events is a very expensive 
endeavor in particular on NUMA systems given the potentially large 
latencies between various portions of the system.

Maybe you need to restrict the ordering per cpu or per node? Per zone? 

Then we would need to store the ranks somewhere which raises scalability 
issues if these are global.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2007-05-16 20:41 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
2007-05-04 10:26 ` [PATCH 01/40] mm: page allocation rank Peter Zijlstra
2007-05-04 10:26 ` [PATCH 02/40] mm: slab allocation fairness Peter Zijlstra
2007-05-16 20:41   ` Christoph Lameter
2007-05-04 10:26 ` [PATCH 03/40] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
2007-05-04 10:26 ` [PATCH 04/40] mm: serialize access to min_free_kbytes Peter Zijlstra
2007-05-04 10:26 ` [PATCH 05/40] mm: emergency pool Peter Zijlstra
2007-05-04 10:26 ` [PATCH 06/40] mm: __GFP_EMERGENCY Peter Zijlstra
2007-05-04 10:26 ` [PATCH 07/40] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
2007-05-04 10:26 ` [PATCH 08/40] mm: kmem_cache_objsize Peter Zijlstra
2007-05-04 10:54   ` Pekka Enberg
2007-05-04 16:09     ` Christoph Lameter
2007-05-04 16:15       ` Peter Zijlstra
2007-05-04 16:23         ` Christoph Lameter
2007-05-04 16:30           ` Peter Zijlstra
2007-05-04 16:36   ` Christoph Lameter
2007-05-04 17:59     ` Peter Zijlstra
2007-05-04 18:04       ` Christoph Lameter
2007-05-04 18:21         ` Peter Zijlstra
2007-05-04 18:30           ` Christoph Lameter
2007-05-04 18:32             ` Peter Zijlstra
2007-05-04 18:45               ` Pekka Enberg
2007-05-04 18:47                 ` Christoph Lameter
2007-05-04 18:54                   ` Pekka Enberg
2007-05-04 19:59                     ` Christoph Lameter
2007-05-05  9:00                       ` Pekka J Enberg
2007-05-04 18:41             ` Pekka Enberg
2007-05-04 18:46               ` Christoph Lameter
2007-05-04 18:53                 ` Pekka Enberg
2007-05-04 19:58                   ` Christoph Lameter
2007-05-04 10:27 ` [PATCH 09/40] mm: optimize gfp_to_rank() Peter Zijlstra
2007-05-04 10:27 ` [PATCH 10/40] selinux: tag avc cache alloc as non-critical Peter Zijlstra
2007-05-04 10:27 ` [PATCH 11/40] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
2007-05-04 10:27 ` [PATCH 12/40] net: packet split receive api Peter Zijlstra
2007-05-04 10:27 ` [PATCH 13/40] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
2007-05-04 10:27 ` [PATCH 14/40] netvm: link network to vm layer Peter Zijlstra
2007-05-04 10:27 ` [PATCH 15/40] netvm: INET reserves Peter Zijlstra
2007-05-04 10:27 ` [PATCH 16/40] netvm: hook skb allocation to reserves Peter Zijlstra
2007-05-04 14:07   ` Arnaldo Carvalho de Melo
2007-05-04 10:27 ` [PATCH 17/40] netvm: filter emergency skbs Peter Zijlstra
2007-05-04 10:27 ` [PATCH 18/40] netvm: prevent a TCP specific deadlock Peter Zijlstra
2007-05-04 10:27 ` [PATCH 19/40] netfilter: notify about NF_QUEUE vs emergency skbs Peter Zijlstra
2007-05-04 10:27 ` [PATCH 20/40] netvm: skb processing Peter Zijlstra
2007-05-04 10:27 ` [PATCH 21/40] uml: rename arch/um remove_mapping() Peter Zijlstra
2007-05-04 10:27 ` [PATCH 22/40] mm: prepare swap entry methods for use in page methods Peter Zijlstra
2007-05-04 10:27 ` [PATCH 23/40] mm: add support for non block device backed swap files Peter Zijlstra
2007-05-04 10:27 ` [PATCH 24/40] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
2007-05-04 10:27 ` [PATCH 25/40] nfs: remove mempools Peter Zijlstra
2007-05-04 10:27 ` [PATCH 26/40] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
2007-05-04 10:27 ` [PATCH 27/40] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
2007-05-04 10:27 ` [PATCH 28/40] nfs: enable swap on NFS Peter Zijlstra
2007-05-04 10:27 ` [PATCH 29/40] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
2007-05-04 10:27 ` [PATCH 30/40] nfs: fixup missing error code Peter Zijlstra
2007-05-04 13:10   ` Peter Staubach
2007-05-04 13:18     ` Peter Zijlstra
2007-05-04 10:27 ` [PATCH 31/40] mm: balance_dirty_pages() vs throttle_vm_writeout() deadlock Peter Zijlstra
2007-05-04 10:27 ` [PATCH 32/40] block: add a swapdev callback to the request_queue Peter Zijlstra
2007-05-04 10:27 ` [PATCH 33/40] uml: enable scsi and add iscsi config Peter Zijlstra
2007-05-04 10:27 ` [PATCH 34/40] sock: safely expose kernel sockets to userspace Peter Zijlstra
2007-05-04 10:27 ` [PATCH 35/40] From: Mike Christie <mchristi@redhat.com> Peter Zijlstra
2007-05-04 10:27 ` [PATCH 36/40] iscsi: fixup of the ep_connect patch Peter Zijlstra
2007-05-04 10:27 ` [PATCH 37/40] iscsi: ensure the iscsi kernel fd is not usable in userspace Peter Zijlstra
2007-05-04 10:27 ` [PATCH 38/40] netlink: add SOCK_VMIO support to AF_NETLINK Peter Zijlstra
2007-05-04 10:27 ` [PATCH 39/40] mm: a process flags to avoid blocking allocations Peter Zijlstra
2007-05-04 10:27 ` [PATCH 40/40] iscsi: support for swapping over iSCSI Peter Zijlstra
2007-05-04 15:22 ` [PATCH 00/40] Swap over Networked storage -v12 Daniel Walker
2007-05-04 15:38   ` Peter Zijlstra
2007-05-04 15:59     ` Daniel Walker
2007-05-04 18:09       ` Mike Snitzer
2007-05-04 19:31         ` Daniel Walker
2007-05-04 19:54         ` David Miller, Mike Snitzer
2007-05-04 21:36   ` Arnaldo Carvalho de Melo
2007-05-04 19:27 ` David Miller, Peter Zijlstra
2007-05-04 19:41   ` Peter Zijlstra
2007-05-04 20:02     ` David Miller, Peter Zijlstra
2007-05-04 20:29       ` Jeff Garzik
2007-05-05  9:43   ` Christoph Hellwig
2007-05-05  9:55     ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).