linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/30] Swap over NFS -v18
@ 2008-07-24 14:00 Peter Zijlstra
  2008-07-24 14:00 ` [PATCH 01/30] swap over network documentation Peter Zijlstra
                   ` (30 more replies)
  0 siblings, 31 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

Latest version of the swap over nfs work.

Patches are against: v2.6.26-rc8-mm1

I still need to write some more comments in the reservation code.

Pekka, it uses ksize(), please have a look.

This version also deals with network namespaces.
Two things where I could do with some suggestsion:

  - currently the sysctl code uses current->nrproxy.net_ns to obtain
    the current network namespace

  - the ipv6 route cache code has some initialization order issues

Thanks,

Peter - who hopes we can someday merge this - Zijlstra

--
 Documentation/filesystems/Locking |   22 +
 Documentation/filesystems/vfs.txt |   18 +
 Documentation/network-swap.txt    |  270 +++++++++++++++++
 drivers/net/bnx2.c                |    8
 drivers/net/e1000/e1000_main.c    |    8
 drivers/net/e1000e/netdev.c       |    7
 drivers/net/igb/igb_main.c        |    8
 drivers/net/ixgbe/ixgbe_main.c    |   10
 drivers/net/sky2.c                |   16 -
 fs/Kconfig                        |   17 +
 fs/nfs/file.c                     |   24 +
 fs/nfs/inode.c                    |    6
 fs/nfs/internal.h                 |    7
 fs/nfs/pagelist.c                 |    8
 fs/nfs/read.c                     |   21 -
 fs/nfs/write.c                    |  175 +++++++----
 include/linux/buffer_head.h       |    2
 include/linux/fs.h                |    9
 include/linux/gfp.h               |    3
 include/linux/mm.h                |   25 +
 include/linux/mm_types.h          |    2
 include/linux/mmzone.h            |    6
 include/linux/nfs_fs.h            |    2
 include/linux/pagemap.h           |    5
 include/linux/reserve.h           |  146 +++++++++
 include/linux/sched.h             |    7
 include/linux/skbuff.h            |   46 ++
 include/linux/slab.h              |   24 -
 include/linux/slub_def.h          |    8
 include/linux/sunrpc/xprt.h       |    5
 include/linux/swap.h              |    4
 include/net/inet_frag.h           |    7
 include/net/netns/ipv6.h          |    4
 include/net/sock.h                |   63 +++-
 include/net/tcp.h                 |    2
 kernel/softirq.c                  |    3
 mm/Makefile                       |    2
 mm/internal.h                     |   10
 mm/page_alloc.c                   |  207 +++++++++----
 mm/page_io.c                      |   52 +++
 mm/reserve.c                      |  594 ++++++++++++++++++++++++++++++++++++++
 mm/slab.c                         |  135 ++++++++
 mm/slub.c                         |  159 ++++++++--
 mm/swap_state.c                   |    4
 mm/swapfile.c                     |   51 +++
 mm/vmstat.c                       |    6
 net/Kconfig                       |    3
 net/core/dev.c                    |   59 +++
 net/core/filter.c                 |    3
 net/core/skbuff.c                 |  147 +++++++--
 net/core/sock.c                   |  129 ++++++++
 net/ipv4/inet_fragment.c          |    3
 net/ipv4/ip_fragment.c            |   89 +++++
 net/ipv4/route.c                  |   72 ++++
 net/ipv4/tcp.c                    |    5
 net/ipv4/tcp_input.c              |   12
 net/ipv4/tcp_output.c             |   12
 net/ipv4/tcp_timer.c              |    2
 net/ipv6/af_inet6.c               |   20 +
 net/ipv6/reassembly.c             |   88 +++++
 net/ipv6/route.c                  |   66 ++++
 net/ipv6/tcp_ipv6.c               |   17 -
 net/netfilter/core.c              |    3
 net/sctp/ulpevent.c               |    2
 net/sunrpc/sched.c                |    9
 net/sunrpc/xprtsock.c             |   73 ++++
 security/selinux/avc.c            |    2
 67 files changed, 2720 insertions(+), 314 deletions(-)



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 01/30] swap over network documentation
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-24 14:00 ` [PATCH 02/30] mm: gfp_to_alloc_flags() Peter Zijlstra
                   ` (29 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: doc.patch --]
[-- Type: text/plain, Size: 13686 bytes --]

From: Neil Brown <neilb@suse.de>

Document describing the problem and proposed solution

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 Documentation/network-swap.txt |  270 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 270 insertions(+)

Index: linux-2.6/Documentation/network-swap.txt
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/network-swap.txt
@@ -0,0 +1,270 @@
+
+Problem:
+   When Linux needs to allocate memory it may find that there is
+   insufficient free memory so it needs to reclaim space that is in
+   use but not needed at the moment.  There are several options:
+
+   1/ Shrink a kernel cache such as the inode or dentry cache.  This
+      is fairly easy but provides limited returns.
+   2/ Discard 'clean' pages from the page cache.  This is easy, and
+      works well as long as there are clean pages in the page cache.
+      Similarly clean 'anonymous' pages can be discarded - if there
+      are any.
+   3/ Write out some dirty page-cache pages so that they become clean.
+      The VM limits the number of dirty page-cache pages to e.g. 40%
+      of available memory so that (among other reasons) a "sync" will
+      not take excessively long.  So there should never be excessive
+      amounts of dirty pagecache.
+      Writing out dirty page-cache pages involves work by the
+      filesystem which may need to allocate memory itself.  To avoid
+      deadlock, filesystems use GFP_NOFS when allocating memory on the
+      write-out path.  When this is used, cleaning dirty page-cache
+      pages is not an option so if the filesystem finds that  memory
+      is tight, another option must be found.
+   4/ Write out dirty anonymous pages to the "Swap" partition/file.
+      This is the most interesting for a couple of reasons.
+      a/ Unlike dirty page-cache pages, there is no need to write anon
+         pages out unless we are actually short of memory.  Thus they
+         tend to be left to last.
+      b/ Anon pages tend to be updated randomly and unpredictably, and
+         flushing them out of memory can have a very significant
+         performance impact on the process using them.  This contrasts
+         with page-cache pages which are often written sequentially
+         and often treated as "write-once, read-many".
+      So anon pages tend to be left until last to be cleaned, and may
+      be the only cleanable pages while there are still some dirty
+      page-cache pages (which are waiting on a GFP_NOFS allocation).
+
+[I don't find the above wholly satisfying.  There seems to be too much
+ hand-waving.  If someone can provide better text explaining why
+ swapout is a special case, that would be great.]
+
+So we need to be able to write to the swap file/partition without
+needing to allocate any memory ... or only a small well controlled
+amount.
+
+The VM reserves a small amount of memory that can only be allocated
+for use as part of the swap-out procedure.  It is only available to
+processes with the PF_MEMALLOC flag set, which is typically just the
+memory cleaner.
+
+Traditionally swap-out is performed directly to block devices (swap
+files on block-device filesystems are supported by examining the
+mapping from file offset to device offset in advance, and then using
+the device offsets to write directly to the device).  Block devices
+are (required to be) written to pre-allocate any memory that might be
+needed during write-out, and to block when the pre-allocated memory is
+exhausted and no other memory is available.  They can be sure not to
+block forever as the pre-allocated memory will be returned as soon as
+the data it is being used for has been written out.  The primary
+mechanism for pre-allocating memory is called "mempools".
+
+This approach does not work for writing anonymous pages
+(i.e. swapping) over a network, using e.g NFS or NBD or iSCSI.
+
+
+The main reason that it does not work is that when data from an anon
+page is written to the network, we must wait for a reply to confirm
+the data is safe.  Receiving that reply will consume memory and,
+significantly, we need to allocate memory to an incoming packet before
+we can tell if it is the reply we are waiting for or not.
+
+The secondary reason is that the network code is not written to use
+mempools and in most cases does not need to use them.  Changing all
+allocations in the networking layer to use mempools would be quite
+intrusive, and would waste memory, and probably cause a slow-down in
+the common case of not swapping over the network.
+
+These problems are addressed by enhancing the system of memory
+reserves used by PF_MEMALLOC and requiring any in-kernel networking
+client that is used for swap-out to indicate which sockets are used
+for swapout so they can be handled specially in low memory situations.
+
+There are several major parts to this enhancement:
+
+1/ page->reserve, GFP_MEMALLOC
+
+  To handle low memory conditions we need to know when those
+  conditions exist.  Having a global "low on memory" flag seems easy,
+  but its implementation is problematic.  Instead we make it possible
+  to tell if a recent memory allocation required use of the emergency
+  memory pool.
+  For pages returned by alloc_page, the new page->reserve flag
+  can be tested.  If this is set, then a low memory condition was
+  current when the page was allocated, so the memory should be used
+  carefully. (Because low memory conditions are transient, this
+  state is kept in an overloaded member instead of in page flags, which
+  would suggest a more permanent state.)
+
+  For memory allocated using slab/slub: If a page that is added to a
+  kmem_cache is found to have page->reserve set, then a  s->reserve
+  flag is set for the whole kmem_cache.  Further allocations will only
+  be returned from that page (or any other page in the cache) if they
+  are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).
+  Non-emergency allocations will block in alloc_page until a
+  non-reserve page is available.  Once a non-reserve page has been
+  added to the cache, the s->reserve flag on the cache is removed.
+
+  Because slab objects have no individual state its hard to pass
+  reserve state along, the current code relies on a regular alloc
+  failing. There are various allocation wrappers help here.
+
+  This allows us to
+   a/ request use of the emergency pool when allocating memory
+     (GFP_MEMALLOC), and
+   b/ to find out if the emergency pool was used.
+
+2/ SK_MEMALLOC, sk_buff->emergency.
+
+  When memory from the reserve is used to store incoming network
+  packets, the memory must be freed (and the packet dropped) as soon
+  as we find out that the packet is not for a socket that is used for
+  swap-out.
+  To achieve this we have an ->emergency flag for skbs, and an
+  SK_MEMALLOC flag for sockets.
+  When memory is allocated for an skb, it is allocated with
+  GFP_MEMALLOC (if we are currently swapping over the network at
+  all).  If a subsequent test shows that the emergency pool was used,
+  ->emergency is set.
+  When the skb is finally attached to its destination socket, the
+  SK_MEMALLOC flag on the socket is tested.  If the skb has
+  ->emergency set, but the socket does not have SK_MEMALLOC set, then
+  the skb is immediately freed and the packet is dropped.
+  This ensures that reserve memory is never queued on a socket that is
+  not used for swapout.
+
+  Similarly, if an skb is ever queued for delivery to user-space for
+  example by netfilter, the ->emergency flag is tested and the skb is
+  released if ->emergency is set. (so obviously the storage route may
+  not pass through a userspace helper, otherwise the packets will never
+  arrive and we'll deadlock)
+
+  This ensures that memory from the emergency reserve can be used to
+  allow swapout to proceed, but will not get caught up in any other
+  network queue.
+
+
+3/ pages_emergency
+
+  The above would be sufficient if the total memory below the lowest
+  memory watermark (i.e the size of the emergency reserve) were known
+  to be enough to hold all transient allocations needed for writeout.
+  I'm a little blurry on how big the current emergency pool is, but it
+  isn't big and certainly hasn't been sized to allow network traffic
+  to consume any.
+
+  We could simply make the size of the reserve bigger. However in the
+  common case that we are not swapping over the network, that would be
+  a waste of memory.
+
+  So a new "watermark" is defined: pages_emergency.  This is
+  effectively added to the current low water marks, so that pages from
+  this emergency pool can only be allocated if one of PF_MEMALLOC or
+  GFP_MEMALLOC are set.
+
+  pages_emergency can be changed dynamically based on need.  When
+  swapout over the network is required, pages_emergency is increased
+  to cover the maximum expected load.  When network swapout is
+  disabled, pages_emergency is decreased.
+
+  To determine how much to increase it by, we introduce reservation
+  groups....
+
+3a/ reservation groups
+
+  The memory used transiently for swapout can be in a number of
+  different places.  e.g. the network route cache, the network
+  fragment cache, in transit between network card and socket, or (in
+  the case of NFS) in sunrpc data structures awaiting a reply.
+  We need to ensure each of these is limited in the amount of memory
+  they use, and that the maximum is included in the reserve.
+
+  The memory required by the network layer only needs to be reserved
+  once, even if there are multiple swapout paths using the network
+  (e.g. NFS and NDB and iSCSI, though using all three for swapout at
+  the same time would be unusual).
+
+  So we create a tree of reservation groups.  The network might
+  register a collection of reservations, but not mark them as being in
+  use.  NFS and sunrpc might similarly register a collection of
+  reservations, and attach it to the network reservations as it
+  depends on them.
+  When swapout over NFS is requested, the NFS/sunrpc reservations are
+  activated which implicitly activates the network reservations.
+
+  The total new reservation is added to pages_emergency.
+
+  Provided each memory usage stays beneath the registered limit (at
+  least when allocating memory from reserves), the system will never
+  run out of emergency memory, and swapout will not deadlock.
+
+  It is worth noting here that it is not critical that each usage
+  stays beneath the limit 100% of the time.  Occasional excess is
+  acceptable provided that the memory will be freed  again within a
+  short amount of time that does *not* require waiting for any event
+  that itself might require memory.
+  This is because, at all stages of transmit and receive, it is
+  acceptable to discard all transient memory associated with a
+  particular writeout and try again later.  On transmit, the page can
+  be re-queued for later transmission.  On receive, the packet can be
+  dropped assuming that the peer will resend after a timeout.
+
+  Thus allocations that are truly transient and will be freed without
+  blocking do not strictly need to be reserved for.  Doing so might
+  still be a good idea to ensure forward progress doesn't take too
+  long.
+
+4/ low-mem accounting
+
+  Most places that might hold on to emergency memory (e.g. route
+  cache, fragment cache etc) already place a limit on the amount of
+  memory that they can use.  This limit can simply be reserved using
+  the above mechanism and no more needs to be done.
+
+  However some memory usage might not be accounted with sufficient
+  firmness to allow an appropriate emergency reservation.  The
+  in-flight skbs for incoming packets is one such example.
+
+  To support this, a low-overhead mechanism for accounting memory
+  usage against the reserves is provided.  This mechanism uses the
+  same data structure that is used to store the emergency memory
+  reservations through the addition of a 'usage' field.
+
+  Before we attempt allocation from the memory reserves, we much check
+  if the resulting 'usage' is below the reservation. If so, we increase
+  the usage and attempt the allocation (which should succeed). If
+  the projected 'usage' exceeds the reservation we'll either fail the
+  allocation, or wait for 'usage' to decrease enough so that it would
+  succeed, depending on __GFP_WAIT.
+
+  When memory that was allocated for that purpose is freed, the
+  'usage' field is checked again.  If it is non-zero, then the size of
+  the freed memory is subtracted from the usage, making sure the usage
+  never becomes less than zero.
+
+  This provides adequate accounting with minimal overheads when not in
+  a low memory condition.  When a low memory condition is encountered
+  it does add the cost of a spin lock necessary to serialise updates
+  to 'usage'.
+
+
+
+5/ swapon/swapoff/swap_out/swap_in
+
+  So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
+  any network socket that it uses, and can know when to account
+  reserve memory carefully, new address_space_operations are
+  available.
+  "swapon" requests that an address space (i.e a file) be make ready
+  for swapout.  swap_out and swap_in request the actual IO.  They
+  together must ensure that each swap_out request can succeed without
+  allocating more emergency memory that was reserved by swapon. swapoff
+  is used to reverse the state changes caused by swapon when we disable
+  the swap file.
+
+
+Thanks for reading this far.  I hope it made sense :-)
+
+Neil Brown (with updates from Peter Zijlstra)
+
+

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 02/30] mm: gfp_to_alloc_flags()
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
  2008-07-24 14:00 ` [PATCH 01/30] swap over network documentation Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-08-12  5:01   ` Neil Brown
  2008-07-24 14:00 ` [PATCH 03/30] mm: tag reseve pages Peter Zijlstra
                   ` (28 subsequent siblings)
  30 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: mm-gfp-to-alloc_flags.patch --]
[-- Type: text/plain, Size: 5589 bytes --]

Factor out the gfp to alloc_flags mapping so it can be used in other places.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/internal.h   |   10 +++++
 mm/page_alloc.c |   95 +++++++++++++++++++++++++++++++-------------------------
 2 files changed, 64 insertions(+), 41 deletions(-)

Index: linux-2.6/mm/internal.h
===================================================================
--- linux-2.6.orig/mm/internal.h
+++ linux-2.6/mm/internal.h
@@ -187,6 +187,16 @@ static inline void free_page_mlock(struc
 #define __paginginit __init
 #endif
 
+#define ALLOC_HARDER		0x01 /* try to alloc harder */
+#define ALLOC_HIGH		0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN		0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW		0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH	0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS	0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+
+int gfp_to_alloc_flags(gfp_t gfp_mask);
+
 /* Memory initialisation debug and verification */
 enum mminit_level {
 	MMINIT_WARNING,
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1122,14 +1122,6 @@ failed:
 	return NULL;
 }
 
-#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
-#define ALLOC_HARDER		0x10 /* try to alloc harder */
-#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
 static struct fail_page_alloc_attr {
@@ -1510,6 +1502,44 @@ static void set_page_owner(struct page *
 #endif /* CONFIG_PAGE_OWNER */
 
 /*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+int gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
+/*
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *
@@ -1567,49 +1597,28 @@ restart:
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
+	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags);
+			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS);
 	if (page)
 		goto got_pg;
 
 	/* This allocation should allow future memory freeing. */
-
 rebalance:
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, nodemask, order,
+		/* go through the zonelist yet again, ignoring mins */
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
 				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
-			if (page)
-				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
+		if (page)
+			goto got_pg;
+
+		if (wait && (gfp_mask & __GFP_NOFAIL)) {
+			congestion_wait(WRITE, HZ/50);
+			goto nofail_alloc;
 		}
 		goto nopage;
 	}
@@ -1618,6 +1627,10 @@ nofail_alloc:
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 	cond_resched();
 
 	/* We now go into synchronous reclaim */

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 03/30] mm: tag reseve pages
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
  2008-07-24 14:00 ` [PATCH 01/30] swap over network documentation Peter Zijlstra
  2008-07-24 14:00 ` [PATCH 02/30] mm: gfp_to_alloc_flags() Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-24 14:00 ` [PATCH 04/30] mm: slub: trivial cleanups Peter Zijlstra
                   ` (27 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: page_alloc-reserve.patch --]
[-- Type: text/plain, Size: 1272 bytes --]

Tag pages allocated from the reserves with a non-zero page->reserve.
This allows us to distinguish and account reserve pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm_types.h |    1 +
 mm/page_alloc.c          |    4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -70,6 +70,7 @@ struct page {
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
 		void *freelist;		/* SLUB: freelist req. slab lock */
+		int reserve;		/* page_alloc: page is a reserve page */
 	};
 	struct list_head lru;		/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1433,8 +1433,10 @@ zonelist_scan:
 		}
 
 		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
-		if (page)
+		if (page) {
+			page->reserve = !!(alloc_flags & ALLOC_NO_WATERMARKS);
 			break;
+		}
 this_zone_full:
 		if (NUMA_BUILD)
 			zlc_mark_zone_full(zonelist, z);

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 04/30] mm: slub: trivial cleanups
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 03/30] mm: tag reseve pages Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-28  9:43   ` Pekka Enberg
  2008-07-29 22:15   ` Pekka Enberg
  2008-07-24 14:00 ` [PATCH 05/30] mm: slb: add knowledge of reserve pages Peter Zijlstra
                   ` (26 subsequent siblings)
  30 siblings, 2 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: cleanup-slub.patch --]
[-- Type: text/plain, Size: 4894 bytes --]

Some cleanups....

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/slub_def.h |    7 ++++++-
 mm/slub.c                |   40 ++++++++++++++++++----------------------
 2 files changed, 24 insertions(+), 23 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -27,7 +27,7 @@
 /*
  * Lock order:
  *   1. slab_lock(page)
- *   2. slab->list_lock
+ *   2. node->list_lock
  *
  *   The slab_lock protects operations on the object of a particular
  *   slab and its metadata in the page struct. If the slab lock
@@ -163,11 +163,11 @@ static struct notifier_block slab_notifi
 #endif
 
 static enum {
-	DOWN,		/* No slab functionality available */
+	DOWN = 0,	/* No slab functionality available */
 	PARTIAL,	/* kmem_cache_open() works but kmalloc does not */
 	UP,		/* Everything works but does not show up in sysfs */
 	SYSFS		/* Sysfs up */
-} slab_state = DOWN;
+} slab_state;
 
 /* A list of all slab caches on the system */
 static DECLARE_RWSEM(slub_lock);
@@ -288,21 +288,22 @@ static inline int slab_index(void *p, st
 static inline struct kmem_cache_order_objects oo_make(int order,
 						unsigned long size)
 {
-	struct kmem_cache_order_objects x = {
-		(order << 16) + (PAGE_SIZE << order) / size
-	};
+	struct kmem_cache_order_objects x;
+
+	x.order = order;
+	x.objects = (PAGE_SIZE << order) / size;
 
 	return x;
 }
 
 static inline int oo_order(struct kmem_cache_order_objects x)
 {
-	return x.x >> 16;
+	return x.order;
 }
 
 static inline int oo_objects(struct kmem_cache_order_objects x)
 {
-	return x.x & ((1 << 16) - 1);
+	return x.objects;
 }
 
 #ifdef CONFIG_SLUB_DEBUG
@@ -1076,8 +1077,7 @@ static struct page *allocate_slab(struct
 
 	flags |= s->allocflags;
 
-	page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node,
-									oo);
+	page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node, oo);
 	if (unlikely(!page)) {
 		oo = s->min;
 		/*
@@ -1099,8 +1099,7 @@ static struct page *allocate_slab(struct
 	return page;
 }
 
-static void setup_object(struct kmem_cache *s, struct page *page,
-				void *object)
+static void setup_object(struct kmem_cache *s, struct page *page, void *object)
 {
 	setup_object_debug(s, page, object);
 	if (unlikely(s->ctor))
@@ -1157,8 +1156,7 @@ static void __free_slab(struct kmem_cach
 		void *p;
 
 		slab_pad_check(s, page);
-		for_each_object(p, s, page_address(page),
-						page->objects)
+		for_each_object(p, s, page_address(page), page->objects)
 			check_object(s, page, p, 0);
 		__ClearPageSlubDebug(page);
 	}
@@ -1224,8 +1222,7 @@ static __always_inline int slab_trylock(
 /*
  * Management of partially allocated slabs
  */
-static void add_partial(struct kmem_cache_node *n,
-				struct page *page, int tail)
+static void add_partial(struct kmem_cache_node *n, struct page *page, int tail)
 {
 	spin_lock(&n->list_lock);
 	n->nr_partial++;
@@ -1251,8 +1248,8 @@ static void remove_partial(struct kmem_c
  *
  * Must hold list_lock.
  */
-static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
-							struct page *page)
+static inline
+int lock_and_freeze_slab(struct kmem_cache_node *n, struct page *page)
 {
 	if (slab_trylock(page)) {
 		list_del(&page->lru);
@@ -1799,11 +1796,11 @@ static int slub_nomerge;
  * slub_max_order specifies the order where we begin to stop considering the
  * number of objects in a slab as critical. If we reach slub_max_order then
  * we try to keep the page order as low as possible. So we accept more waste
- * of space in favor of a small page order.
+ * of space in favour of a small page order.
  *
  * Higher order allocations also allow the placement of more objects in a
  * slab and thereby reduce object handling overhead. If the user has
- * requested a higher mininum order then we start with that one instead of
+ * requested a higher minimum order then we start with that one instead of
  * the smallest order which will fit the object.
  */
 static inline int slab_order(int size, int min_objects,
@@ -1816,8 +1813,7 @@ static inline int slab_order(int size, i
 	if ((PAGE_SIZE << min_order) / size > 65535)
 		return get_order(size * 65535) - 1;
 
-	for (order = max(min_order,
-				fls(min_objects * size - 1) - PAGE_SHIFT);
+	for (order = max(min_order, fls(min_objects * size - 1) - PAGE_SHIFT);
 			order <= max_order; order++) {
 
 		unsigned long slab_size = PAGE_SIZE << order;
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h
+++ linux-2.6/include/linux/slub_def.h
@@ -60,7 +60,12 @@ struct kmem_cache_node {
  * given order would contain.
  */
 struct kmem_cache_order_objects {
-	unsigned long x;
+	union {
+		u32 x;
+		struct {
+			u16 order, objects;
+		};
+	};
 };
 
 /*

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 05/30] mm: slb: add knowledge of reserve pages
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 04/30] mm: slub: trivial cleanups Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-08-12  5:35   ` Neil Brown
  2008-07-24 14:00 ` [PATCH 06/30] mm: kmem_alloc_estimate() Peter Zijlstra
                   ` (25 subsequent siblings)
  30 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: reserve-slub.patch --]
[-- Type: text/plain, Size: 9643 bytes --]

Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
contexts that are entitled to it. This is done to ensure reserve pages don't
leak out and get consumed.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/slub_def.h |    1 
 mm/slab.c                |   60 +++++++++++++++++++++++++++++++++++++++--------
 mm/slub.c                |   28 ++++++++++++++++++---
 3 files changed, 75 insertions(+), 14 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -23,6 +23,7 @@
 #include <linux/kallsyms.h>
 #include <linux/memory.h>
 #include <linux/math64.h>
+#include "internal.h"
 
 /*
  * Lock order:
@@ -1106,7 +1107,8 @@ static void setup_object(struct kmem_cac
 		s->ctor(s, object);
 }
 
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static
+struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
 {
 	struct page *page;
 	void *start;
@@ -1120,6 +1122,8 @@ static struct page *new_slab(struct kmem
 	if (!page)
 		goto out;
 
+	*reserve = page->reserve;
+
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 	page->flags |= 1 << PG_slab;
@@ -1509,10 +1513,20 @@ static void *__slab_alloc(struct kmem_ca
 {
 	void **object;
 	struct page *new;
+	int reserve;
 
 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
 
+	if (unlikely(c->reserve)) {
+		/*
+		 * If the current slab is a reserve slab and the current
+		 * allocation context does not allow access to the reserves we
+		 * must force an allocation to test the current levels.
+		 */
+		if (!(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+			goto grow_slab;
+	}
 	if (!c->page)
 		goto new_slab;
 
@@ -1526,7 +1540,7 @@ load_freelist:
 	object = c->page->freelist;
 	if (unlikely(!object))
 		goto another_slab;
-	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
+	if (unlikely(PageSlubDebug(c->page) || c->reserve))
 		goto debug;
 
 	c->freelist = object[c->offset];
@@ -1549,16 +1563,18 @@ new_slab:
 		goto load_freelist;
 	}
 
+grow_slab:
 	if (gfpflags & __GFP_WAIT)
 		local_irq_enable();
 
-	new = new_slab(s, gfpflags, node);
+	new = new_slab(s, gfpflags, node, &reserve);
 
 	if (gfpflags & __GFP_WAIT)
 		local_irq_disable();
 
 	if (new) {
 		c = get_cpu_slab(s, smp_processor_id());
+		c->reserve = reserve;
 		stat(c, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
@@ -1569,7 +1585,8 @@ new_slab:
 	}
 	return NULL;
 debug:
-	if (!alloc_debug_processing(s, c->page, object, addr))
+	if (PageSlubDebug(c->page) &&
+			!alloc_debug_processing(s, c->page, object, addr))
 		goto another_slab;
 
 	c->page->inuse++;
@@ -2068,10 +2085,11 @@ static struct kmem_cache_node *early_kme
 	struct page *page;
 	struct kmem_cache_node *n;
 	unsigned long flags;
+	int reserve;
 
 	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmalloc_caches, gfpflags, node);
+	page = new_slab(kmalloc_caches, gfpflags, node, &reserve);
 
 	BUG_ON(!page);
 	if (page_to_nid(page) != node) {
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h
+++ linux-2.6/include/linux/slub_def.h
@@ -38,6 +38,7 @@ struct kmem_cache_cpu {
 	int node;		/* The node of the page (or -1 for debug) */
 	unsigned int offset;	/* Freepointer offset (in word units) */
 	unsigned int objsize;	/* Size of an object (from kmem_cache) */
+	int reserve;		/* Did the current page come from the reserve */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c
+++ linux-2.6/mm/slab.c
@@ -116,6 +116,8 @@
 #include	<asm/tlbflush.h>
 #include	<asm/page.h>
 
+#include 	"internal.h"
+
 /*
  * DEBUG	- 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
  *		  0 for faster, smaller code (especially in the critical paths).
@@ -265,7 +267,8 @@ struct array_cache {
 	unsigned int avail;
 	unsigned int limit;
 	unsigned int batchcount;
-	unsigned int touched;
+	unsigned int touched:1,
+		     reserve:1;
 	spinlock_t lock;
 	void *entry[];	/*
 			 * Must have this definition in here for the proper
@@ -761,6 +764,27 @@ static inline struct array_cache *cpu_ca
 	return cachep->array[smp_processor_id()];
 }
 
+/*
+ * If the last page came from the reserves, and the current allocation context
+ * does not have access to them, force an allocation to test the watermarks.
+ */
+static inline int slab_force_alloc(struct kmem_cache *cachep, gfp_t flags)
+{
+	if (unlikely(cpu_cache_get(cachep)->reserve) &&
+			!(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
+		return 1;
+
+	return 0;
+}
+
+static inline void slab_set_reserve(struct kmem_cache *cachep, int reserve)
+{
+	struct array_cache *ac = cpu_cache_get(cachep);
+
+	if (unlikely(ac->reserve != reserve))
+		ac->reserve = reserve;
+}
+
 static inline struct kmem_cache *__find_general_cachep(size_t size,
 							gfp_t gfpflags)
 {
@@ -960,6 +984,7 @@ static struct array_cache *alloc_arrayca
 		nc->limit = entries;
 		nc->batchcount = batchcount;
 		nc->touched = 0;
+		nc->reserve = 0;
 		spin_lock_init(&nc->lock);
 	}
 	return nc;
@@ -1662,7 +1687,8 @@ __initcall(cpucache_init);
  * did not request dmaable memory, we might get it, but that
  * would be relatively rare and ignorable.
  */
-static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
+static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid,
+		int *reserve)
 {
 	struct page *page;
 	int nr_pages;
@@ -1684,6 +1710,7 @@ static void *kmem_getpages(struct kmem_c
 	if (!page)
 		return NULL;
 
+	*reserve = page->reserve;
 	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
@@ -2112,6 +2139,7 @@ static int __init_refok setup_cpu_cache(
 	cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
 	cpu_cache_get(cachep)->batchcount = 1;
 	cpu_cache_get(cachep)->touched = 0;
+	cpu_cache_get(cachep)->reserve = 0;
 	cachep->batchcount = 1;
 	cachep->limit = BOOT_CPUCACHE_ENTRIES;
 	return 0;
@@ -2767,6 +2795,7 @@ static int cache_grow(struct kmem_cache 
 	size_t offset;
 	gfp_t local_flags;
 	struct kmem_list3 *l3;
+	int reserve;
 
 	/*
 	 * Be lazy and only check for valid flags here,  keeping it out of the
@@ -2805,7 +2834,7 @@ static int cache_grow(struct kmem_cache 
 	 * 'nodeid'.
 	 */
 	if (!objp)
-		objp = kmem_getpages(cachep, local_flags, nodeid);
+		objp = kmem_getpages(cachep, local_flags, nodeid, &reserve);
 	if (!objp)
 		goto failed;
 
@@ -2822,6 +2851,7 @@ static int cache_grow(struct kmem_cache 
 	if (local_flags & __GFP_WAIT)
 		local_irq_disable();
 	check_irq_off();
+	slab_set_reserve(cachep, reserve);
 	spin_lock(&l3->list_lock);
 
 	/* Make slab active. */
@@ -2967,7 +2997,8 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep,
+		gfp_t flags, int must_refill)
 {
 	int batchcount;
 	struct kmem_list3 *l3;
@@ -2977,6 +3008,8 @@ static void *cache_alloc_refill(struct k
 retry:
 	check_irq_off();
 	node = numa_node_id();
+	if (unlikely(must_refill))
+		goto force_grow;
 	ac = cpu_cache_get(cachep);
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -3044,11 +3077,14 @@ alloc_done:
 
 	if (unlikely(!ac->avail)) {
 		int x;
+force_grow:
 		x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
-		if (!x && ac->avail == 0)	/* no objects in sight? abort */
+
+		/* no objects in sight? abort */
+		if (!x && (ac->avail == 0 || must_refill))
 			return NULL;
 
 		if (!ac->avail)		/* objects refilled by interrupt? */
@@ -3203,17 +3239,18 @@ static inline void *____cache_alloc(stru
 {
 	void *objp;
 	struct array_cache *ac;
+	int must_refill = slab_force_alloc(cachep, flags);
 
 	check_irq_off();
 
 	ac = cpu_cache_get(cachep);
-	if (likely(ac->avail)) {
+	if (likely(ac->avail && !must_refill)) {
 		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
 		objp = ac->entry[--ac->avail];
 	} else {
 		STATS_INC_ALLOCMISS(cachep);
-		objp = cache_alloc_refill(cachep, flags);
+		objp = cache_alloc_refill(cachep, flags, must_refill);
 	}
 	return objp;
 }
@@ -3257,7 +3294,7 @@ static void *fallback_alloc(struct kmem_
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
-	int nid;
+	int nid, reserve;
 
 	if (flags & __GFP_THISNODE)
 		return NULL;
@@ -3293,10 +3330,11 @@ retry:
 		if (local_flags & __GFP_WAIT)
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
-		obj = kmem_getpages(cache, local_flags, -1);
+		obj = kmem_getpages(cache, local_flags, -1, &reserve);
 		if (local_flags & __GFP_WAIT)
 			local_irq_disable();
 		if (obj) {
+			slab_set_reserve(cache, reserve);
 			/*
 			 * Insert into the appropriate per node queues
 			 */
@@ -3335,6 +3373,9 @@ static void *____cache_alloc_node(struct
 	l3 = cachep->nodelists[nodeid];
 	BUG_ON(!l3);
 
+	if (unlikely(slab_force_alloc(cachep, flags)))
+		goto force_grow;
+
 retry:
 	check_irq_off();
 	spin_lock(&l3->list_lock);
@@ -3372,6 +3413,7 @@ retry:
 
 must_grow:
 	spin_unlock(&l3->list_lock);
+force_grow:
 	x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL);
 	if (x)
 		goto retry;

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 06/30] mm: kmem_alloc_estimate()
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 05/30] mm: slb: add knowledge of reserve pages Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-30 12:21   ` Pekka Enberg
  2008-07-24 14:00 ` [PATCH 07/30] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
                   ` (24 subsequent siblings)
  30 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: mm-kmem_estimate_pages.patch --]
[-- Type: text/plain, Size: 6807 bytes --]

Provide a method to get the upper bound on the pages needed to allocate
a given number of objects from a given kmem_cache.

This lays the foundation for a generic reserve framework as presented in
a later patch in this series. This framework needs to convert object demand
(kmalloc() bytes, kmem_cache_alloc() objects) to pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/slab.h |    4 ++
 mm/slab.c            |   75 +++++++++++++++++++++++++++++++++++++++++++
 mm/slub.c            |   87 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 166 insertions(+)

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -65,6 +65,8 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+unsigned kmem_alloc_estimate(struct kmem_cache *cachep,
+			gfp_t flags, int objects);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -99,6 +101,8 @@ int kmem_ptr_validate(struct kmem_cache 
 void * __must_check krealloc(const void *, size_t, gfp_t);
 void kfree(const void *);
 size_t ksize(const void *);
+unsigned kmalloc_estimate_fixed(size_t, gfp_t, int);
+unsigned kmalloc_estimate_variable(gfp_t, size_t);
 
 /*
  * Allocator specific definitions. These are mainly used to establish optimized
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -2412,6 +2412,42 @@ const char *kmem_cache_name(struct kmem_
 }
 EXPORT_SYMBOL(kmem_cache_name);
 
+/*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @objects objects from @cachep.
+ *
+ * We should use s->min_objects because those are the least efficient.
+ */
+unsigned kmem_alloc_estimate(struct kmem_cache *s, gfp_t flags, int objects)
+{
+	unsigned long pages;
+	struct kmem_cache_order_objects x;
+
+	if (WARN_ON(!s) || WARN_ON(!oo_objects(s->min)))
+		return 0;
+
+	x = s->min;
+	pages = DIV_ROUND_UP(objects, oo_objects(x)) << oo_order(x);
+
+	/*
+	 * Account the possible additional overhead if the slab holds more that
+	 * one object. Use s->max_objects because that's the worst case.
+	 */
+	x = s->oo;
+	if (oo_objects(x) > 1) {
+		/*
+		 * Account the possible additional overhead if per cpu slabs
+		 * are currently empty and have to be allocated. This is very
+		 * unlikely but a possible scenario immediately after
+		 * kmem_cache_shrink.
+		 */
+		pages += num_online_cpus() << oo_order(x);
+	}
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kmem_alloc_estimate);
+
 static void list_slab_objects(struct kmem_cache *s, struct page *page,
 							const char *text)
 {
@@ -2789,6 +2825,57 @@ void kfree(const void *x)
 EXPORT_SYMBOL(kfree);
 
 /*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @count objects of @size bytes from kmalloc given @flags.
+ */
+unsigned kmalloc_estimate_fixed(size_t size, gfp_t flags, int count)
+{
+	struct kmem_cache *s = get_slab(size, flags);
+	if (!s)
+		return 0;
+
+	return kmem_alloc_estimate(s, flags, count);
+
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_fixed);
+
+/*
+ * Calculate the upper bound of pages requires to sequentially allocate @bytes
+ * from kmalloc in an unspecified number of allocations of nonuniform size.
+ */
+unsigned kmalloc_estimate_variable(gfp_t flags, size_t bytes)
+{
+	int i;
+	unsigned long pages;
+
+	/*
+	 * multiply by two, in order to account the worst case slack space
+	 * due to the power-of-two allocation sizes.
+	 */
+	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+	/*
+	 * add the kmem_cache overhead of each possible kmalloc cache
+	 */
+	for (i = 1; i < PAGE_SHIFT; i++) {
+		struct kmem_cache *s;
+
+#ifdef CONFIG_ZONE_DMA
+		if (unlikely(flags & SLUB_DMA))
+			s = dma_kmalloc_cache(i, flags);
+		else
+#endif
+			s = &kmalloc_caches[i];
+
+		if (s)
+			pages += kmem_alloc_estimate(s, flags, 0);
+	}
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_variable);
+
+/*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up
Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c
+++ linux-2.6/mm/slab.c
@@ -3854,6 +3854,81 @@ const char *kmem_cache_name(struct kmem_
 EXPORT_SYMBOL_GPL(kmem_cache_name);
 
 /*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @objects objects from @cachep.
+ */
+unsigned kmem_alloc_estimate(struct kmem_cache *cachep,
+		gfp_t flags, int objects)
+{
+	/*
+	 * (1) memory for objects,
+	 */
+	unsigned nr_slabs = DIV_ROUND_UP(objects, cachep->num);
+	unsigned nr_pages = nr_slabs << cachep->gfporder;
+
+	/*
+	 * (2) memory for each per-cpu queue (nr_cpu_ids),
+	 * (3) memory for each per-node alien queues (nr_cpu_ids), and
+	 * (4) some amount of memory for the slab management structures
+	 *
+	 * XXX: truely account these
+	 */
+	nr_pages += 1 + ilog2(nr_pages);
+
+	return nr_pages;
+}
+
+/*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @count objects of @size bytes from kmalloc given @flags.
+ */
+unsigned kmalloc_estimate_fixed(size_t size, gfp_t flags, int count)
+{
+	struct kmem_cache *s = kmem_find_general_cachep(size, flags);
+	if (!s)
+		return 0;
+
+	return kmem_alloc_estimate(s, flags, count);
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_fixed);
+
+/*
+ * Calculate the upper bound of pages requires to sequentially allocate @bytes
+ * from kmalloc in an unspecified number of allocations of nonuniform size.
+ */
+unsigned kmalloc_estimate_variable(gfp_t flags, size_t bytes)
+{
+	unsigned long pages;
+	struct cache_sizes *csizep = malloc_sizes;
+
+	/*
+	 * multiply by two, in order to account the worst case slack space
+	 * due to the power-of-two allocation sizes.
+	 */
+	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+	/*
+	 * add the kmem_cache overhead of each possible kmalloc cache
+	 */
+	for (csizep = malloc_sizes; csizep->cs_cachep; csizep++) {
+		struct kmem_cache *s;
+
+#ifdef CONFIG_ZONE_DMA
+		if (unlikely(flags & __GFP_DMA))
+			s = csizep->cs_dmacachep;
+		else
+#endif
+			s = csizep->cs_cachep;
+
+		if (s)
+			pages += kmem_alloc_estimate(s, flags, 0);
+	}
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_variable);
+
+/*
  * This initializes kmem_list3 or resizes various caches for all nodes.
  */
 static int alloc_kmemlist(struct kmem_cache *cachep)

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 07/30] mm: allow PF_MEMALLOC from softirq context
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 06/30] mm: kmem_alloc_estimate() Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-24 14:00 ` [PATCH 08/30] mm: serialize access to min_free_kbytes Peter Zijlstra
                   ` (23 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: mm-PF_MEMALLOC-softirq.patch --]
[-- Type: text/plain, Size: 2233 bytes --]

Allow PF_MEMALLOC to be set in softirq context. 

This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    7 +++++++
 kernel/softirq.c      |    3 +++
 mm/page_alloc.c       |    7 ++++---
 3 files changed, 14 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1449,9 +1449,10 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((p->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+		if (!in_irq() && (p->flags & PF_MEMALLOC))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_interrupt() &&
+				unlikely(test_thread_flag(TIF_MEMDIE)))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
Index: linux-2.6/kernel/softirq.c
===================================================================
--- linux-2.6.orig/kernel/softirq.c
+++ linux-2.6/kernel/softirq.c
@@ -213,6 +213,8 @@ asmlinkage void __do_softirq(void)
 	__u32 pending;
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu;
+	unsigned long pflags = current->flags;
+	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
 	account_system_vtime(current);
@@ -251,6 +253,7 @@ restart:
 
 	account_system_vtime(current);
 	_local_bh_enable();
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1533,6 +1533,13 @@ static inline void put_task_struct(struc
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+static inline void tsk_restore_flags(struct task_struct *p,
+				     unsigned long pflags, unsigned long mask)
+{
+	p->flags &= ~mask;
+	p->flags |= pflags & mask;
+}
+
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed_ptr(struct task_struct *p,
 				const cpumask_t *new_mask);

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 08/30] mm: serialize access to min_free_kbytes
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 07/30] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-30 12:36   ` Pekka Enberg
  2008-07-24 14:00 ` [PATCH 09/30] mm: emergency pool Peter Zijlstra
                   ` (22 subsequent siblings)
  30 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: mm-setup_per_zone_pages_min.patch --]
[-- Type: text/plain, Size: 1844 bytes --]

There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -118,6 +118,7 @@ static char * const zone_names[MAX_NR_ZO
 	 "Movable",
 };
 
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
@@ -4333,12 +4334,12 @@ static void setup_per_zone_lowmem_reserv
 }
 
 /**
- * setup_per_zone_pages_min - called when min_free_kbytes changes.
+ * __setup_per_zone_pages_min - called when min_free_kbytes changes.
  *
  * Ensures that the pages_{min,low,high} values for each zone are set correctly
  * with respect to min_free_kbytes.
  */
-void setup_per_zone_pages_min(void)
+static void __setup_per_zone_pages_min(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -4433,6 +4434,15 @@ void setup_per_zone_inactive_ratio(void)
 	}
 }
 
+void setup_per_zone_pages_min(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&min_free_lock, flags);
+	__setup_per_zone_pages_min();
+	spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -4468,7 +4478,7 @@ static int __init init_per_zone_pages_mi
 		min_free_kbytes = 128;
 	if (min_free_kbytes > 65536)
 		min_free_kbytes = 65536;
-	setup_per_zone_pages_min();
+	__setup_per_zone_pages_min();
 	setup_per_zone_lowmem_reserve();
 	setup_per_zone_inactive_ratio();
 	return 0;

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 09/30] mm: emergency pool
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 08/30] mm: serialize access to min_free_kbytes Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-24 14:00 ` [PATCH 10/30] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: mm-page_alloc-emerg.patch --]
[-- Type: text/plain, Size: 7056 bytes --]

Provide means to reserve a specific amount of pages.

The emergency pool is separated from the min watermark because ALLOC_HARDER
and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure
a strict minimum.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mmzone.h |    6 ++-
 mm/page_alloc.c        |   84 +++++++++++++++++++++++++++++++++++++++++++------
 mm/vmstat.c            |    6 +--
 3 files changed, 82 insertions(+), 14 deletions(-)

Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -265,7 +265,10 @@ enum zone_type {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
-	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		pages_high;	/* we stop kswapd */
+	unsigned long		pages_low;	/* we wake up kswapd */
+	unsigned long		pages_min;	/* we enter direct reclaim */
+	unsigned long		pages_emerg;	/* emergency pool */
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
@@ -751,6 +754,7 @@ int sysctl_min_unmapped_ratio_sysctl_han
 			struct file *, void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
+int adjust_memalloc_reserve(int pages);
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -120,6 +120,8 @@ static char * const zone_names[MAX_NR_ZO
 
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+static DEFINE_MUTEX(var_free_mutex);
+int var_free_kbytes;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -1235,7 +1237,7 @@ int zone_watermark_ok(struct zone *z, in
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
 
-	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
+	if (free_pages <= min+z->lowmem_reserve[classzone_idx]+z->pages_emerg)
 		return 0;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -1558,7 +1560,7 @@ __alloc_pages_internal(gfp_t gfp_mask, u
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
 	int do_retry;
-	int alloc_flags;
+	int alloc_flags = 0;
 	unsigned long did_some_progress;
 	unsigned long pages_reclaimed = 0;
 
@@ -1724,8 +1726,8 @@ nofail_alloc:
 nopage:
 	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
 		printk(KERN_WARNING "%s: page allocation failure."
-			" order:%d, mode:0x%x\n",
-			p->comm, order, gfp_mask);
+			" order:%d, mode:0x%x, alloc_flags:0x%x, pflags:0x%x\n",
+			p->comm, order, gfp_mask, alloc_flags, p->flags);
 		dump_stack();
 		show_mem();
 	}
@@ -2008,9 +2010,9 @@ void show_free_areas(void)
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
-			K(zone->pages_min),
-			K(zone->pages_low),
-			K(zone->pages_high),
+			K(zone->pages_emerg + zone->pages_min),
+			K(zone->pages_emerg + zone->pages_low),
+			K(zone->pages_emerg + zone->pages_high),
 			K(zone_page_state(zone, NR_ACTIVE_ANON)),
 			K(zone_page_state(zone, NR_INACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ACTIVE_FILE)),
@@ -4284,7 +4286,7 @@ static void calculate_totalreserve_pages
 			}
 
 			/* we treat pages_high as reserved pages. */
-			max += zone->pages_high;
+			max += zone->pages_high + zone->pages_emerg;
 
 			if (max > zone->present_pages)
 				max = zone->present_pages;
@@ -4341,7 +4343,8 @@ static void setup_per_zone_lowmem_reserv
  */
 static void __setup_per_zone_pages_min(void)
 {
-	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_emerg = var_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
 	unsigned long flags;
@@ -4353,11 +4356,13 @@ static void __setup_per_zone_pages_min(v
 	}
 
 	for_each_zone(zone) {
-		u64 tmp;
+		u64 tmp, tmp_emerg;
 
 		spin_lock_irqsave(&zone->lru_lock, flags);
 		tmp = (u64)pages_min * zone->present_pages;
 		do_div(tmp, lowmem_pages);
+		tmp_emerg = (u64)pages_emerg * zone->present_pages;
+		do_div(tmp_emerg, lowmem_pages);
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ -4376,12 +4381,14 @@ static void __setup_per_zone_pages_min(v
 			if (min_pages > 128)
 				min_pages = 128;
 			zone->pages_min = min_pages;
+			zone->pages_emerg = 0;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
 			zone->pages_min = tmp;
+			zone->pages_emerg = tmp_emerg;
 		}
 
 		zone->pages_low   = zone->pages_min + (tmp >> 2);
@@ -4443,6 +4450,63 @@ void setup_per_zone_pages_min(void)
 	spin_unlock_irqrestore(&min_free_lock, flags);
 }
 
+static void __adjust_memalloc_reserve(int pages)
+{
+	var_free_kbytes += pages << (PAGE_SHIFT - 10);
+	BUG_ON(var_free_kbytes < 0);
+	setup_per_zone_pages_min();
+}
+
+static int test_reserve_limits(void)
+{
+	struct zone *zone;
+	int node;
+
+	for_each_zone(zone)
+		wakeup_kswapd(zone, 0);
+
+	for_each_online_node(node) {
+		struct page *page = alloc_pages_node(node, GFP_KERNEL, 0);
+		if (!page)
+			return -ENOMEM;
+
+		__free_page(page);
+	}
+
+	return 0;
+}
+
+/**
+ *	adjust_memalloc_reserve - adjust the memalloc reserve
+ *	@pages: number of pages to add
+ *
+ *	It adds a number of pages to the memalloc reserve; if
+ *	the number was positive it kicks reclaim into action to
+ *	satisfy the higher watermarks.
+ *
+ *	returns -ENOMEM when it failed to satisfy the watermarks.
+ */
+int adjust_memalloc_reserve(int pages)
+{
+	int err = 0;
+
+	mutex_lock(&var_free_mutex);
+	__adjust_memalloc_reserve(pages);
+	if (pages > 0) {
+		err = test_reserve_limits();
+		if (err) {
+			__adjust_memalloc_reserve(-pages);
+			goto unlock;
+		}
+	}
+	printk(KERN_DEBUG "Emergency reserve: %d\n", var_free_kbytes);
+
+unlock:
+	mutex_unlock(&var_free_mutex);
+	return err;
+}
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
 /*
  * Initialise min_free_kbytes.
  *
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -785,9 +785,9 @@ static void zoneinfo_show_print(struct s
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
-		   zone->pages_min,
-		   zone->pages_low,
-		   zone->pages_high,
+		   zone->pages_emerg + zone->pages_min,
+		   zone->pages_emerg + zone->pages_low,
+		   zone->pages_emerg + zone->pages_high,
 		   zone->pages_scanned,
 		   zone->lru[LRU_ACTIVE_ANON].nr_scan,
 		   zone->lru[LRU_INACTIVE_ANON].nr_scan,

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 10/30] mm: system wide ALLOC_NO_WATERMARK
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 09/30] mm: emergency pool Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-24 14:00 ` [PATCH 11/30] mm: __GFP_MEMALLOC Peter Zijlstra
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: global-ALLOC_NO_WATERMARKS.patch --]
[-- Type: text/plain, Size: 1276 bytes --]

The reserve is proportionally distributed over all (!highmem) zones in the
system. So we need to allow an emergency allocation access to all zones. In
order to do that we need to break out of any mempolicy boundaries we might
have.

In my opinion that does not break mempolicies as those are user oriented
and not system oriented. That is, system allocations are not guaranteed to be
within mempolicy boundaries. For instance IRQs don't even have a mempolicy.

So breaking out of mempolicy boundaries for 'rare' emergency allocations,
which are always system allocations (as opposed to user) is ok.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |    5 +++++
 1 file changed, 5 insertions(+)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1533,6 +1533,11 @@ restart:
 rebalance:
 	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 nofail_alloc:
+		/*
+		 * break out of mempolicy boundaries
+		 */
+		zonelist = node_zonelist(numa_node_id(), gfp_mask);
+
 		/* go through the zonelist yet again, ignoring mins */
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 11/30] mm: __GFP_MEMALLOC
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (9 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 10/30] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-25  9:29   ` KOSAKI Motohiro
  2008-07-24 14:00 ` [PATCH 12/30] mm: memory reserve management Peter Zijlstra
                   ` (19 subsequent siblings)
  30 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: mm-page_alloc-GFP_EMERGENCY.patch --]
[-- Type: text/plain, Size: 2134 bytes --]

__GFP_MEMALLOC will allow the allocation to disregard the watermarks, 
much like PF_MEMALLOC.

It allows one to pass along the memalloc state in object related allocation
flags as opposed to task related flags, such as sk->sk_allocation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/gfp.h |    3 ++-
 mm/page_alloc.c     |    4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -43,6 +43,7 @@ struct vm_area_struct;
 #define __GFP_REPEAT	((__force gfp_t)0x400u)	/* See above */
 #define __GFP_NOFAIL	((__force gfp_t)0x800u)	/* See above */
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* See above */
+#define __GFP_MEMALLOC  ((__force gfp_t)0x2000u)/* Use emergency reserves */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
@@ -88,7 +89,7 @@ struct vm_area_struct;
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-			__GFP_NORETRY|__GFP_NOMEMALLOC)
+			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* Control allocation constraints */
 #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1452,7 +1452,9 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_irq() && (p->flags & PF_MEMALLOC))
+		if (gfp_mask & __GFP_MEMALLOC)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_irq() && (p->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (!in_interrupt() &&
 				unlikely(test_thread_flag(TIF_MEMDIE)))

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 12/30] mm: memory reserve management
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (10 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 11/30] mm: __GFP_MEMALLOC Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-28 10:06   ` Pekka Enberg
                     ` (2 more replies)
  2008-07-24 14:00 ` [PATCH 13/30] selinux: tag avc cache alloc as non-critical Peter Zijlstra
                   ` (18 subsequent siblings)
  30 siblings, 3 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: mm-reserve.patch --]
[-- Type: text/plain, Size: 23621 bytes --]

Generic reserve management code. 

It provides methods to reserve and charge. Upon this, generic alloc/free style
reserve pools could be build, which could fully replace mempool_t
functionality.

It should also allow for a Banker's algorithm replacement of __GFP_NOFAIL.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/reserve.h |  146 +++++++++++
 include/linux/slab.h    |   20 -
 mm/Makefile             |    2 
 mm/reserve.c            |  594 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/slub.c               |    4 
 5 files changed, 755 insertions(+), 11 deletions(-)

Index: linux-2.6/include/linux/reserve.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/reserve.h
@@ -0,0 +1,146 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+
+#ifndef _LINUX_RESERVE_H
+#define _LINUX_RESERVE_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/slab.h>
+
+struct mem_reserve {
+	struct mem_reserve *parent;
+	struct list_head children;
+	struct list_head siblings;
+
+	const char *name;
+
+	long pages;
+	long limit;
+	long usage;
+	spinlock_t lock;	/* protects limit and usage */
+
+	wait_queue_head_t waitqueue;
+};
+
+extern struct mem_reserve mem_reserve_root;
+
+void mem_reserve_init(struct mem_reserve *res, const char *name,
+		      struct mem_reserve *parent);
+int mem_reserve_connect(struct mem_reserve *new_child,
+			struct mem_reserve *node);
+void mem_reserve_disconnect(struct mem_reserve *node);
+
+int mem_reserve_pages_set(struct mem_reserve *res, long pages);
+int mem_reserve_pages_add(struct mem_reserve *res, long pages);
+int mem_reserve_pages_charge(struct mem_reserve *res, long pages);
+
+int mem_reserve_kmalloc_set(struct mem_reserve *res, long bytes);
+int mem_reserve_kmalloc_charge(struct mem_reserve *res, long bytes);
+
+struct kmem_cache;
+
+int mem_reserve_kmem_cache_set(struct mem_reserve *res,
+			       struct kmem_cache *s,
+			       int objects);
+int mem_reserve_kmem_cache_charge(struct mem_reserve *res,
+				  struct kmem_cache *s, long objs);
+
+void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
+			 struct mem_reserve *res, int *emerg);
+
+static inline
+void *__kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
+			struct mem_reserve *res, int *emerg)
+{
+	void *obj;
+
+	obj = __kmalloc_node_track_caller(size,
+			flags | __GFP_NOMEMALLOC | __GFP_NOWARN, node, ip);
+	if (!obj)
+		obj = ___kmalloc_reserve(size, flags, node, ip, res, emerg);
+
+	return obj;
+}
+
+#define kmalloc_reserve(size, gfp, node, res, emerg) 			\
+	__kmalloc_reserve(size, gfp, node, 				\
+			  __builtin_return_address(0), res, emerg)
+
+void __kfree_reserve(void *obj, struct mem_reserve *res, int emerg);
+
+static inline
+void kfree_reserve(void *obj, struct mem_reserve *res, int emerg)
+{
+	if (unlikely(obj && res && emerg))
+		__kfree_reserve(obj, res, emerg);
+	else
+		kfree(obj);
+}
+
+void *__kmem_cache_alloc_reserve(struct kmem_cache *s, gfp_t flags, int node,
+				 struct mem_reserve *res, int *emerg);
+static inline
+void *kmem_cache_alloc_reserve(struct kmem_cache *s, gfp_t flags, int node,
+			       struct mem_reserve *res, int *emerg)
+{
+	void *obj;
+
+	obj = kmem_cache_alloc_node(s,
+			flags | __GFP_NOMEMALLOC | __GFP_NOWARN, node);
+	if (!obj)
+		obj = __kmem_cache_alloc_reserve(s, flags, node, res, emerg);
+
+	return obj;
+}
+
+void __kmem_cache_free_reserve(struct kmem_cache *s, void *obj,
+			       struct mem_reserve *res, int emerg);
+
+static inline
+void kmem_cache_free_reserve(struct kmem_cache *s, void *obj,
+			     struct mem_reserve *res, int emerg)
+{
+	if (unlikely(obj && res && emerg))
+		__kmem_cache_free_reserve(s, obj, res, emerg);
+	else
+		kmem_cache_free(s, obj);
+}
+
+struct page *__alloc_pages_reserve(int node, gfp_t flags, int order,
+				  struct mem_reserve *res, int *emerg);
+
+static inline
+struct page *alloc_pages_reserve(int node, gfp_t flags, int order,
+				 struct mem_reserve *res, int *emerg)
+{
+	struct page *page;
+
+	page = alloc_pages_node(node,
+			flags | __GFP_NOMEMALLOC | __GFP_NOWARN, order);
+	if (!page)
+		page = __alloc_pages_reserve(node, flags, order, res, emerg);
+
+	return page;
+}
+
+void __free_pages_reserve(struct page *page, int order,
+			  struct mem_reserve *res, int emerg);
+
+static inline
+void free_pages_reserve(struct page *page, int order,
+			struct mem_reserve *res, int emerg)
+{
+	if (unlikely(page && res && emerg))
+		__free_pages_reserve(page, order, res, emerg);
+	else
+		__free_pages(page, order);
+}
+
+#endif /* _LINUX_RESERVE_H */
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -11,7 +11,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   maccess.o page_alloc.o page-writeback.o pdflush.o \
 			   readahead.o swap.o truncate.o vmscan.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
-			   page_isolation.o $(mmu-y)
+			   page_isolation.o reserve.o $(mmu-y)
 
 obj-$(CONFIG_PAGE_WALKER) += pagewalk.o
 obj-$(CONFIG_BOUNCE)	+= bounce.o
Index: linux-2.6/mm/reserve.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/reserve.c
@@ -0,0 +1,594 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007, Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * Description:
+ *
+ * Manage a set of memory reserves.
+ *
+ * A memory reserve is a reserve for a specified number of object of specified
+ * size. Since memory is managed in pages, this reserve demand is then
+ * translated into a page unit.
+ *
+ * So each reserve has a specified object limit, an object usage count and a
+ * number of pages required to back these objects.
+ *
+ * Usage is charged against a reserve, if the charge fails, the resource must
+ * not be allocated/used.
+ *
+ * The reserves are managed in a tree, and the resource demands (pages and
+ * limit) are propagated up the tree. Obviously the object limit will be
+ * meaningless as soon as the unit starts mixing, but the required page reserve
+ * (being of one unit) is still valid at the root.
+ *
+ * It is the page demand of the root node that is used to set the global
+ * reserve (adjust_memalloc_reserve() which sets zone->pages_emerg).
+ *
+ * As long as a subtree has the same usage unit, an aggregate node can be used
+ * to charge against, instead of the leaf nodes. However, do be consistent with
+ * who is charged, resource usage is not propagated up the tree (for
+ * performance reasons).
+ */
+
+#include <linux/reserve.h>
+#include <linux/mutex.h>
+#include <linux/mmzone.h>
+#include <linux/log2.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include "internal.h"
+
+static DEFINE_MUTEX(mem_reserve_mutex);
+
+/**
+ * @mem_reserve_root - the global reserve root
+ *
+ * The global reserve is empty, and has no limit unit, it merely
+ * acts as an aggregation point for reserves and an interface to
+ * adjust_memalloc_reserve().
+ */
+struct mem_reserve mem_reserve_root = {
+	.children = LIST_HEAD_INIT(mem_reserve_root.children),
+	.siblings = LIST_HEAD_INIT(mem_reserve_root.siblings),
+	.name = "total reserve",
+	.lock = __SPIN_LOCK_UNLOCKED(mem_reserve_root.lock),
+	.waitqueue = __WAIT_QUEUE_HEAD_INITIALIZER(mem_reserve_root.waitqueue),
+};
+EXPORT_SYMBOL_GPL(mem_reserve_root);
+
+/**
+ * mem_reserve_init() - initialize a memory reserve object
+ * @res - the new reserve object
+ * @name - a name for this reserve
+ * @parent - when non NULL, the parent to connect to.
+ */
+void mem_reserve_init(struct mem_reserve *res, const char *name,
+		      struct mem_reserve *parent)
+{
+	memset(res, 0, sizeof(*res));
+	INIT_LIST_HEAD(&res->children);
+	INIT_LIST_HEAD(&res->siblings);
+	res->name = name;
+	spin_lock_init(&res->lock);
+	init_waitqueue_head(&res->waitqueue);
+
+	if (parent)
+		mem_reserve_connect(res, parent);
+}
+EXPORT_SYMBOL_GPL(mem_reserve_init);
+
+/*
+ * propagate the pages and limit changes up the (sub)tree.
+ */
+static void __calc_reserve(struct mem_reserve *res, long pages, long limit)
+{
+	unsigned long flags;
+
+	for ( ; res; res = res->parent) {
+		res->pages += pages;
+
+		if (limit) {
+			spin_lock_irqsave(&res->lock, flags);
+			res->limit += limit;
+			spin_unlock_irqrestore(&res->lock, flags);
+		}
+	}
+}
+
+/**
+ * __mem_reserve_add() - primitive to change the size of a reserve
+ * @res - reserve to change
+ * @pages - page delta
+ * @limit - usage limit delta
+ *
+ * Returns -ENOMEM when a size increase is not possible atm.
+ */
+static int __mem_reserve_add(struct mem_reserve *res, long pages, long limit)
+{
+	int ret = 0;
+	long reserve;
+
+	/*
+	 * This looks more complex than need be, that is because we handle
+	 * the case where @res isn't actually connected to mem_reserve_root.
+	 *
+	 * So, by propagating the new pages up the (sub)tree and computing
+	 * the difference in mem_reserve_root.pages we find if this action
+	 * affects the actual reserve.
+	 *
+	 * The (partial) propagation also makes that mem_reserve_connect()
+	 * needs only look at the direct child, since each disconnected
+	 * sub-tree is fully up-to-date.
+	 */
+	reserve = mem_reserve_root.pages;
+	__calc_reserve(res, pages, 0);
+	reserve = mem_reserve_root.pages - reserve;
+
+	if (reserve) {
+		ret = adjust_memalloc_reserve(reserve);
+		if (ret)
+			__calc_reserve(res, -pages, 0);
+	}
+
+	/*
+	 * Delay updating the limits until we've acquired the resources to
+	 * back it.
+	 */
+	if (!ret)
+		__calc_reserve(res, 0, limit);
+
+	return ret;
+}
+
+/**
+ * __mem_reserve_charge() - primitive to charge object usage of a reserve
+ * @res - reserve to charge
+ * @charge - size of the charge
+ *
+ * Returns non-zero on success, zero on failure.
+ */
+static
+int __mem_reserve_charge(struct mem_reserve *res, long charge)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&res->lock, flags);
+	if (charge < 0 || res->usage + charge < res->limit) {
+		res->usage += charge;
+		if (unlikely(res->usage < 0))
+			res->usage = 0;
+		ret = 1;
+	}
+	if (charge < 0)
+		wake_up_all(&res->waitqueue);
+	spin_unlock_irqrestore(&res->lock, flags);
+
+	return ret;
+}
+
+/**
+ * mem_reserve_connect() - connect a reserve to another in a child-parent relation
+ * @new_child - the reserve node to connect (child)
+ * @node - the reserve node to connect to (parent)
+ *
+ * Connecting a node results in an increase of the reserve by the amount of
+ * pages in @new_child->pages if @node has a connection to mem_reserve_root.
+ *
+ * Returns -ENOMEM when the new connection would increase the reserve (parent
+ * is connected to mem_reserve_root) and there is no memory to do so.
+ *
+ * On error, the child is _NOT_ connected.
+ */
+int mem_reserve_connect(struct mem_reserve *new_child, struct mem_reserve *node)
+{
+	int ret;
+
+	WARN_ON(!new_child->name);
+
+	mutex_lock(&mem_reserve_mutex);
+	if (new_child->parent) {
+		ret = -EEXIST;
+		goto unlock;
+	}
+	new_child->parent = node;
+	list_add(&new_child->siblings, &node->children);
+	ret = __mem_reserve_add(node, new_child->pages, new_child->limit);
+	if (ret) {
+		new_child->parent = NULL;
+		list_del_init(&new_child->siblings);
+	}
+unlock:
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mem_reserve_connect);
+
+/**
+ * mem_reserve_disconnect() - sever a nodes connection to the reserve tree
+ * @node - the node to disconnect
+ *
+ * Disconnecting a node results in a reduction of the reserve by @node->pages
+ * if node had a connection to mem_reserve_root.
+ */
+void mem_reserve_disconnect(struct mem_reserve *node)
+{
+	int ret;
+
+	BUG_ON(!node->parent);
+
+	mutex_lock(&mem_reserve_mutex);
+	if (!node->parent) {
+		ret = -ENOENT;
+		goto unlock;
+	}
+	ret = __mem_reserve_add(node->parent, -node->pages, -node->limit);
+	if (!ret) {
+		node->parent = NULL;
+		list_del_init(&node->siblings);
+	}
+unlock:
+	mutex_unlock(&mem_reserve_mutex);
+
+	/*
+	 * We cannot fail to shrink the reserves, can we?
+	 */
+	WARN_ON(ret);
+}
+EXPORT_SYMBOL_GPL(mem_reserve_disconnect);
+
+#ifdef CONFIG_PROC_FS
+
+/*
+ * Simple output of the reserve tree in: /proc/reserve_info
+ * Example:
+ *
+ * localhost ~ # cat /proc/reserve_info
+ * 1:0 "total reserve" 6232K 0/278581
+ * 2:1 "total network reserve" 6232K 0/278581
+ * 3:2 "network TX reserve" 212K 0/53
+ * 4:3 "protocol TX pages" 212K 0/53
+ * 5:2 "network RX reserve" 6020K 0/278528
+ * 6:5 "IPv4 route cache" 5508K 0/16384
+ * 7:5 "SKB data reserve" 512K 0/262144
+ * 8:7 "IPv4 fragment cache" 512K 0/262144
+ */
+
+static void mem_reserve_show_item(struct seq_file *m, struct mem_reserve *res,
+				  unsigned int parent, unsigned int *id)
+{
+	struct mem_reserve *child;
+	unsigned int my_id = ++*id;
+
+	seq_printf(m, "%d:%d \"%s\" %ldK %ld/%ld\n",
+			my_id, parent, res->name,
+			res->pages << (PAGE_SHIFT - 10),
+			res->usage, res->limit);
+
+	list_for_each_entry(child, &res->children, siblings)
+		mem_reserve_show_item(m, child, my_id, id);
+}
+
+static int mem_reserve_show(struct seq_file *m, void *v)
+{
+	unsigned int ident = 0;
+
+	mutex_lock(&mem_reserve_mutex);
+	mem_reserve_show_item(m, &mem_reserve_root, ident, &ident);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return 0;
+}
+
+static int mem_reserve_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, mem_reserve_show, NULL);
+}
+
+static const struct file_operations mem_reserve_opterations = {
+	.open = mem_reserve_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+static __init int mem_reserve_proc_init(void)
+{
+	proc_create("reserve_info", S_IRUSR, NULL, &mem_reserve_opterations);
+	return 0;
+}
+
+module_init(mem_reserve_proc_init);
+
+#endif
+
+/*
+ * alloc_page helpers
+ */
+
+/**
+ * mem_reserve_pages_set() - set reserves size in pages
+ * @res - reserve to set
+ * @pages - size in pages to set it to
+ *
+ * Returns -ENOMEM when it fails to set the reserve. On failure the old size
+ * is preserved.
+ */
+int mem_reserve_pages_set(struct mem_reserve *res, long pages)
+{
+	int ret;
+
+	mutex_lock(&mem_reserve_mutex);
+	pages -= res->pages;
+	ret = __mem_reserve_add(res, pages, pages * PAGE_SIZE);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mem_reserve_pages_set);
+
+/**
+ * mem_reserve_pages_add() - change the size in a relative way
+ * @res - reserve to change
+ * @pages - number of pages to add (or subtract when negative)
+ *
+ * Similar to mem_reserve_pages_set, except that the argument is relative
+ * instead of absolute.
+ *
+ * Returns -ENOMEM when it fails to increase.
+ */
+int mem_reserve_pages_add(struct mem_reserve *res, long pages)
+{
+	int ret;
+
+	mutex_lock(&mem_reserve_mutex);
+	ret = __mem_reserve_add(res, pages, pages * PAGE_SIZE);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+
+/**
+ * mem_reserve_pages_charge() - charge page usage to a reserve
+ * @res - reserve to charge
+ * @pages - size to charge
+ *
+ * Returns non-zero on success.
+ */
+int mem_reserve_pages_charge(struct mem_reserve *res, long pages)
+{
+	return __mem_reserve_charge(res, pages * PAGE_SIZE);
+}
+EXPORT_SYMBOL_GPL(mem_reserve_pages_charge);
+
+/*
+ * kmalloc helpers
+ */
+
+/**
+ * mem_reserve_kmalloc_set() - set this reserve to bytes worth of kmalloc
+ * @res - reserve to change
+ * @bytes - size in bytes to reserve
+ *
+ * Returns -ENOMEM on failure.
+ */
+int mem_reserve_kmalloc_set(struct mem_reserve *res, long bytes)
+{
+	int ret;
+	long pages;
+
+	mutex_lock(&mem_reserve_mutex);
+	pages = kmalloc_estimate_variable(GFP_ATOMIC, bytes);
+	pages -= res->pages;
+	bytes -= res->limit;
+	ret = __mem_reserve_add(res, pages, bytes);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mem_reserve_kmalloc_set);
+
+/**
+ * mem_reserve_kmalloc_charge() - charge bytes to a reserve
+ * @res - reserve to charge
+ * @bytes - bytes to charge
+ *
+ * Returns non-zero on success.
+ */
+int mem_reserve_kmalloc_charge(struct mem_reserve *res, long bytes)
+{
+	if (bytes < 0)
+		bytes = -roundup_pow_of_two(-bytes);
+	else
+		bytes = roundup_pow_of_two(bytes);
+
+	return __mem_reserve_charge(res, bytes);
+}
+EXPORT_SYMBOL_GPL(mem_reserve_kmalloc_charge);
+
+/*
+ * kmem_cache helpers
+ */
+
+/**
+ * mem_reserve_kmem_cache_set() - set reserve to @objects worth of kmem_cache_alloc of @s
+ * @res - reserve to set
+ * @s - kmem_cache to reserve from
+ * @objects - number of objects to reserve
+ *
+ * Returns -ENOMEM on failure.
+ */
+int mem_reserve_kmem_cache_set(struct mem_reserve *res, struct kmem_cache *s,
+			       int objects)
+{
+	int ret;
+	long pages, bytes;
+
+	mutex_lock(&mem_reserve_mutex);
+	pages = kmem_alloc_estimate(s, GFP_ATOMIC, objects);
+	pages -= res->pages;
+	bytes = objects * kmem_cache_size(s) - res->limit;
+	ret = __mem_reserve_add(res, pages, bytes);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mem_reserve_kmem_cache_set);
+
+/**
+ * mem_reserve_kmem_cache_charge() - charge (or uncharge) usage of objs
+ * @res - reserve to charge
+ * @objs - objects to charge for
+ *
+ * Returns non-zero on success.
+ */
+int mem_reserve_kmem_cache_charge(struct mem_reserve *res, struct kmem_cache *s,
+				  long objs)
+{
+	return __mem_reserve_charge(res, objs * kmem_cache_size(s));
+}
+EXPORT_SYMBOL_GPL(mem_reserve_kmem_cache_charge);
+
+/*
+ * alloc wrappers
+ */
+
+void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
+			 struct mem_reserve *res, int *emerg)
+{
+	void *obj;
+	gfp_t gfp;
+
+	gfp = flags | __GFP_NOMEMALLOC | __GFP_NOWARN;
+	obj = __kmalloc_node_track_caller(size, gfp, node, ip);
+
+	if (obj || !(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
+		goto out;
+
+	if (res && !mem_reserve_kmalloc_charge(res, size)) {
+		if (!(flags & __GFP_WAIT))
+			goto out;
+
+		wait_event(res->waitqueue,
+				mem_reserve_kmalloc_charge(res, size));
+
+		obj = __kmalloc_node_track_caller(size, gfp, node, ip);
+		if (obj) {
+			mem_reserve_kmalloc_charge(res, -size);
+			goto out;
+		}
+	}
+
+	obj = __kmalloc_node_track_caller(size, flags, node, ip);
+	WARN_ON(!obj);
+	if (emerg)
+		*emerg |= 1;
+
+out:
+	return obj;
+}
+
+void __kfree_reserve(void *obj, struct mem_reserve *res, int emerg)
+{
+	size_t size = ksize(obj);
+
+	kfree(obj);
+	/*
+	 * ksize gives the full allocated size vs the requested size we used to
+	 * charge; however since we round up to the nearest power of two, this
+	 * should all work nicely.
+	 */
+	mem_reserve_kmalloc_charge(res, -size);
+}
+
+
+void *__kmem_cache_alloc_reserve(struct kmem_cache *s, gfp_t flags, int node,
+				 struct mem_reserve *res, int *emerg)
+{
+	void *obj;
+	gfp_t gfp;
+
+	gfp = flags | __GFP_NOMEMALLOC | __GFP_NOWARN;
+	obj = kmem_cache_alloc_node(s, gfp, node);
+
+	if (obj || !(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
+		goto out;
+
+	if (res && !mem_reserve_kmem_cache_charge(res, s, 1)) {
+		if (!(flags & __GFP_WAIT))
+			goto out;
+
+		wait_event(res->waitqueue,
+				mem_reserve_kmem_cache_charge(res, s, 1));
+
+		obj = kmem_cache_alloc_node(s, gfp, node);
+		if (obj) {
+			mem_reserve_kmem_cache_charge(res, s, -1);
+			goto out;
+		}
+	}
+
+	obj = kmem_cache_alloc_node(s, flags, node);
+	WARN_ON(!obj);
+	if (emerg)
+		*emerg |= 1;
+
+out:
+	return obj;
+}
+
+void __kmem_cache_free_reserve(struct kmem_cache *s, void *obj,
+			       struct mem_reserve *res, int emerg)
+{
+	kmem_cache_free(s, obj);
+	mem_reserve_kmem_cache_charge(res, s, -1);
+}
+
+
+struct page *__alloc_pages_reserve(int node, gfp_t flags, int order,
+				   struct mem_reserve *res, int *emerg)
+{
+	struct page *page;
+	gfp_t gfp;
+	long pages = 1 << order;
+
+	gfp = flags | __GFP_NOMEMALLOC | __GFP_NOWARN;
+	page = alloc_pages_node(node, gfp, order);
+
+	if (page || !(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
+		goto out;
+
+	if (res && !mem_reserve_pages_charge(res, pages)) {
+		if (!(flags & __GFP_WAIT))
+			goto out;
+
+		wait_event(res->waitqueue,
+				mem_reserve_pages_charge(res, pages));
+
+		page = alloc_pages_node(node, gfp, order);
+		if (page) {
+			mem_reserve_pages_charge(res, -pages);
+			goto out;
+		}
+	}
+
+	page = alloc_pages_node(node, flags, order);
+	WARN_ON(!page);
+	if (emerg)
+		*emerg |= 1;
+
+out:
+	return page;
+}
+
+void __free_pages_reserve(struct page *page, int order,
+			  struct mem_reserve *res, int emerg)
+{
+	__free_pages(page, order);
+	mem_reserve_pages_charge(res, -(1 << order));
+}
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -229,13 +229,14 @@ static inline void *kmem_cache_alloc_nod
  */
 #if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
 extern void *__kmalloc_track_caller(size_t, gfp_t, void*);
-#define kmalloc_track_caller(size, flags) \
-	__kmalloc_track_caller(size, flags, __builtin_return_address(0))
 #else
-#define kmalloc_track_caller(size, flags) \
+#define __kmalloc_track_caller(size, flags, ip) \
 	__kmalloc(size, flags)
 #endif /* DEBUG_SLAB */
 
+#define kmalloc_track_caller(size, flags) \
+	__kmalloc_track_caller(size, flags, __builtin_return_address(0))
+
 #ifdef CONFIG_NUMA
 /*
  * kmalloc_node_track_caller is a special version of kmalloc_node that
@@ -247,21 +248,22 @@ extern void *__kmalloc_track_caller(size
  */
 #if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
 extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, void *);
-#define kmalloc_node_track_caller(size, flags, node) \
-	__kmalloc_node_track_caller(size, flags, node, \
-			__builtin_return_address(0))
 #else
-#define kmalloc_node_track_caller(size, flags, node) \
+#define __kmalloc_node_track_caller(size, flags, node, ip) \
 	__kmalloc_node(size, flags, node)
 #endif
 
 #else /* CONFIG_NUMA */
 
-#define kmalloc_node_track_caller(size, flags, node) \
-	kmalloc_track_caller(size, flags)
+#define __kmalloc_node_track_caller(size, flags, node, ip) \
+	__kmalloc_track_caller(size, flags, ip)
 
 #endif /* DEBUG_SLAB */
 
+#define kmalloc_node_track_caller(size, flags, node) \
+	__kmalloc_node_track_caller(size, flags, node, \
+			__builtin_return_address(0))
+
 /*
  * Shortcuts
  */
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -2706,6 +2706,7 @@ void *__kmalloc(size_t size, gfp_t flags
 }
 EXPORT_SYMBOL(__kmalloc);
 
+#ifdef CONFIG_NUMA
 static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
 {
 	struct page *page = alloc_pages_node(node, flags | __GFP_COMP,
@@ -2717,7 +2718,6 @@ static void *kmalloc_large_node(size_t s
 		return NULL;
 }
 
-#ifdef CONFIG_NUMA
 void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	struct kmem_cache *s;
@@ -3303,6 +3303,7 @@ void *__kmalloc_track_caller(size_t size
 	return slab_alloc(s, gfpflags, -1, caller);
 }
 
+#ifdef CONFIG_NUMA
 void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
 					int node, void *caller)
 {
@@ -3318,6 +3319,7 @@ void *__kmalloc_node_track_caller(size_t
 
 	return slab_alloc(s, gfpflags, node, caller);
 }
+#endif
 
 #ifdef CONFIG_SLUB_DEBUG
 static unsigned long count_partial(struct kmem_cache_node *n,

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 13/30] selinux: tag avc cache alloc as non-critical
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (11 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 12/30] mm: memory reserve management Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-24 14:00 ` [PATCH 14/30] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown
  Cc: James Morris

[-- Attachment #1: mm-selinux-emergency.patch --]
[-- Type: text/plain, Size: 769 bytes --]

Failing to allocate a cache entry will only harm performance not correctness.
Do not consume valuable reserve pages for something like that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: James Morris <jmorris@namei.org>
---
 security/selinux/avc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/security/selinux/avc.c
===================================================================
--- linux-2.6.orig/security/selinux/avc.c
+++ linux-2.6/security/selinux/avc.c
@@ -334,7 +334,7 @@ static struct avc_node *avc_alloc_node(v
 {
 	struct avc_node *node;
 
-	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC);
+	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC);
 	if (!node)
 		goto out;
 

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 14/30] net: wrap sk->sk_backlog_rcv()
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (12 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 13/30] selinux: tag avc cache alloc as non-critical Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-24 14:00 ` [PATCH 15/30] net: packet split receive api Peter Zijlstra
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: net-backlog.patch --]
[-- Type: text/plain, Size: 2910 bytes --]

Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h   |    5 +++++
 include/net/tcp.h    |    2 +-
 net/core/sock.c      |    4 ++--
 net/ipv4/tcp.c       |    2 +-
 net/ipv4/tcp_timer.c |    2 +-
 5 files changed, 10 insertions(+), 5 deletions(-)

Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -482,6 +482,11 @@ static inline void sk_add_backlog(struct
 	skb->next = NULL;
 }
 
+static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	return sk->sk_backlog_rcv(sk, skb);
+}
+
 #define sk_wait_event(__sk, __timeo, __condition)			\
 	({	int __rc;						\
 		release_sock(__sk);					\
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -326,7 +326,7 @@ int sk_receive_skb(struct sock *sk, stru
 		 */
 		mutex_acquire(&sk->sk_lock.dep_map, 0, 1, _RET_IP_);
 
-		rc = sk->sk_backlog_rcv(sk, skb);
+		rc = sk_backlog_rcv(sk, skb);
 
 		mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
 	} else
@@ -1373,7 +1373,7 @@ static void __release_sock(struct sock *
 			struct sk_buff *next = skb->next;
 
 			skb->next = NULL;
-			sk->sk_backlog_rcv(sk, skb);
+			sk_backlog_rcv(sk, skb);
 
 			/*
 			 * We are in process context here with softirqs
Index: linux-2.6/net/ipv4/tcp.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp.c
+++ linux-2.6/net/ipv4/tcp.c
@@ -1161,7 +1161,7 @@ static void tcp_prequeue_process(struct 
 	 * necessary */
 	local_bh_disable();
 	while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
-		sk->sk_backlog_rcv(sk, skb);
+		sk_backlog_rcv(sk, skb);
 	local_bh_enable();
 
 	/* Clear memory counter. */
Index: linux-2.6/net/ipv4/tcp_timer.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_timer.c
+++ linux-2.6/net/ipv4/tcp_timer.c
@@ -203,7 +203,7 @@ static void tcp_delack_timer(unsigned lo
 		NET_INC_STATS_BH(LINUX_MIB_TCPSCHEDULERFAILED);
 
 		while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
-			sk->sk_backlog_rcv(sk, skb);
+			sk_backlog_rcv(sk, skb);
 
 		tp->ucopy.memory = 0;
 	}
Index: linux-2.6/include/net/tcp.h
===================================================================
--- linux-2.6.orig/include/net/tcp.h
+++ linux-2.6/include/net/tcp.h
@@ -893,7 +893,7 @@ static inline int tcp_prequeue(struct so
 			BUG_ON(sock_owned_by_user(sk));
 
 			while ((skb1 = __skb_dequeue(&tp->ucopy.prequeue)) != NULL) {
-				sk->sk_backlog_rcv(sk, skb1);
+				sk_backlog_rcv(sk, skb1);
 				NET_INC_STATS_BH(LINUX_MIB_TCPPREQUEUEDROPPED);
 			}
 

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 15/30] net: packet split receive api
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (13 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 14/30] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-24 14:00 ` [PATCH 16/30] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
                   ` (15 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: net-ps_rx.patch --]
[-- Type: text/plain, Size: 9727 bytes --]

Add some packet-split receive hooks.

For one this allows to do NUMA node affine page allocs. Later on these hooks
will be extended to do emergency reserve allocations for fragments.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 drivers/net/bnx2.c             |    8 +++-----
 drivers/net/e1000/e1000_main.c |    8 ++------
 drivers/net/e1000e/netdev.c    |    7 ++-----
 drivers/net/igb/igb_main.c     |    8 ++------
 drivers/net/ixgbe/ixgbe_main.c |   10 +++-------
 drivers/net/sky2.c             |   16 ++++++----------
 include/linux/skbuff.h         |   23 +++++++++++++++++++++++
 net/core/skbuff.c              |   20 ++++++++++++++++++++
 8 files changed, 61 insertions(+), 39 deletions(-)

Index: linux-2.6/drivers/net/e1000/e1000_main.c
===================================================================
--- linux-2.6.orig/drivers/net/e1000/e1000_main.c
+++ linux-2.6/drivers/net/e1000/e1000_main.c
@@ -4492,12 +4492,8 @@ e1000_clean_rx_irq_ps(struct e1000_adapt
 			pci_unmap_page(pdev, ps_page_dma->ps_page_dma[j],
 					PAGE_SIZE, PCI_DMA_FROMDEVICE);
 			ps_page_dma->ps_page_dma[j] = 0;
-			skb_fill_page_desc(skb, j, ps_page->ps_page[j], 0,
-			                   length);
+			skb_add_rx_frag(skb, j, ps_page->ps_page[j], 0, length);
 			ps_page->ps_page[j] = NULL;
-			skb->len += length;
-			skb->data_len += length;
-			skb->truesize += length;
 		}
 
 		/* strip the ethernet crc, problem is we're using pages now so
@@ -4705,7 +4701,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
 			if (j < adapter->rx_ps_pages) {
 				if (likely(!ps_page->ps_page[j])) {
 					ps_page->ps_page[j] =
-						alloc_page(GFP_ATOMIC);
+						netdev_alloc_page(netdev);
 					if (unlikely(!ps_page->ps_page[j])) {
 						adapter->alloc_rx_buff_failed++;
 						goto no_buffers;
Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -824,6 +824,9 @@ static inline void skb_fill_page_desc(st
 	skb_shinfo(skb)->nr_frags = i + 1;
 }
 
+extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page,
+			    int off, int size);
+
 #define SKB_PAGE_ASSERT(skb) 	BUG_ON(skb_shinfo(skb)->nr_frags)
 #define SKB_FRAG_ASSERT(skb) 	BUG_ON(skb_shinfo(skb)->frag_list)
 #define SKB_LINEAR_ASSERT(skb)  BUG_ON(skb_is_nonlinear(skb))
@@ -1238,6 +1241,26 @@ static inline struct sk_buff *netdev_all
 	return __netdev_alloc_skb(dev, length, GFP_ATOMIC);
 }
 
+extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+
+/**
+ *	netdev_alloc_page - allocate a page for ps-rx on a specific device
+ *	@dev: network device to receive on
+ *
+ * 	Allocate a new page node local to the specified device.
+ *
+ * 	%NULL is returned if there is no free memory.
+ */
+static inline struct page *netdev_alloc_page(struct net_device *dev)
+{
+	return __netdev_alloc_page(dev, GFP_ATOMIC);
+}
+
+static inline void netdev_free_page(struct net_device *dev, struct page *page)
+{
+	__free_page(page);
+}
+
 /**
  *	skb_clone_writable - is the header of a clone writable
  *	@skb: buffer to check
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -263,6 +263,26 @@ struct sk_buff *__netdev_alloc_skb(struc
 	return skb;
 }
 
+struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
+{
+	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
+	struct page *page;
+
+	page = alloc_pages_node(node, gfp_mask, 0);
+	return page;
+}
+EXPORT_SYMBOL(__netdev_alloc_page);
+
+void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
+		int size)
+{
+	skb_fill_page_desc(skb, i, page, off, size);
+	skb->len += size;
+	skb->data_len += size;
+	skb->truesize += size;
+}
+EXPORT_SYMBOL(skb_add_rx_frag);
+
 /**
  *	dev_alloc_skb - allocate an skbuff for receiving
  *	@length: length to allocate
Index: linux-2.6/drivers/net/sky2.c
===================================================================
--- linux-2.6.orig/drivers/net/sky2.c
+++ linux-2.6/drivers/net/sky2.c
@@ -1343,7 +1343,7 @@ static struct sk_buff *sky2_rx_alloc(str
 	}
 
 	for (i = 0; i < sky2->rx_nfrags; i++) {
-		struct page *page = alloc_page(GFP_ATOMIC);
+		struct page *page = netdev_alloc_page(sky2->netdev);
 
 		if (!page)
 			goto free_partial;
@@ -2217,8 +2217,8 @@ static struct sk_buff *receive_copy(stru
 }
 
 /* Adjust length of skb with fragments to match received data */
-static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
-			  unsigned int length)
+static void skb_put_frags(struct sky2_port *sky2, struct sk_buff *skb,
+			  unsigned int hdr_space, unsigned int length)
 {
 	int i, num_frags;
 	unsigned int size;
@@ -2235,15 +2235,11 @@ static void skb_put_frags(struct sk_buff
 
 		if (length == 0) {
 			/* don't need this page */
-			__free_page(frag->page);
+			netdev_free_page(sky2->netdev, frag->page);
 			--skb_shinfo(skb)->nr_frags;
 		} else {
 			size = min(length, (unsigned) PAGE_SIZE);
-
-			frag->size = size;
-			skb->data_len += size;
-			skb->truesize += size;
-			skb->len += size;
+			skb_add_rx_frag(skb, i, frag->page, 0, size);
 			length -= size;
 		}
 	}
@@ -2270,7 +2266,7 @@ static struct sk_buff *receive_new(struc
 	sky2_rx_map_skb(sky2->hw->pdev, re, hdr_space);
 
 	if (skb_shinfo(skb)->nr_frags)
-		skb_put_frags(skb, hdr_space, length);
+		skb_put_frags(sky2, skb, hdr_space, length);
 	else
 		skb_put(skb, length);
 	return skb;
Index: linux-2.6/drivers/net/bnx2.c
===================================================================
--- linux-2.6.orig/drivers/net/bnx2.c
+++ linux-2.6/drivers/net/bnx2.c
@@ -2466,7 +2466,7 @@ bnx2_alloc_rx_page(struct bnx2 *bp, stru
 	struct sw_pg *rx_pg = &rxr->rx_pg_ring[index];
 	struct rx_bd *rxbd =
 		&rxr->rx_pg_desc_ring[RX_RING(index)][RX_IDX(index)];
-	struct page *page = alloc_page(GFP_ATOMIC);
+	struct page *page = netdev_alloc_page(bp->dev);
 
 	if (!page)
 		return -ENOMEM;
@@ -2491,7 +2491,7 @@ bnx2_free_rx_page(struct bnx2 *bp, struc
 	pci_unmap_page(bp->pdev, pci_unmap_addr(rx_pg, mapping), PAGE_SIZE,
 		       PCI_DMA_FROMDEVICE);
 
-	__free_page(page);
+	netdev_free_page(bp->dev, page);
 	rx_pg->page = NULL;
 }
 
@@ -2816,9 +2816,7 @@ bnx2_rx_skb(struct bnx2 *bp, struct bnx2
 			}
 
 			frag_size -= frag_len;
-			skb->data_len += frag_len;
-			skb->truesize += frag_len;
-			skb->len += frag_len;
+			skb_add_rx_frag(skb, i, rx_pg->page, 0, frag_len);
 
 			pg_prod = NEXT_RX_BD(pg_prod);
 			pg_cons = RX_PG_RING_IDX(NEXT_RX_BD(pg_cons));
Index: linux-2.6/drivers/net/e1000e/netdev.c
===================================================================
--- linux-2.6.orig/drivers/net/e1000e/netdev.c
+++ linux-2.6/drivers/net/e1000e/netdev.c
@@ -257,7 +257,7 @@ static void e1000_alloc_rx_buffers_ps(st
 				continue;
 			}
 			if (!ps_page->page) {
-				ps_page->page = alloc_page(GFP_ATOMIC);
+				ps_page->page = netdev_alloc_page(netdev);
 				if (!ps_page->page) {
 					adapter->alloc_rx_buff_failed++;
 					goto no_buffers;
@@ -816,11 +816,8 @@ static bool e1000_clean_rx_irq_ps(struct
 			pci_unmap_page(pdev, ps_page->dma, PAGE_SIZE,
 				       PCI_DMA_FROMDEVICE);
 			ps_page->dma = 0;
-			skb_fill_page_desc(skb, j, ps_page->page, 0, length);
+			skb_add_rx_frag(skb, j, ps_page->page, 0, length);
 			ps_page->page = NULL;
-			skb->len += length;
-			skb->data_len += length;
-			skb->truesize += length;
 		}
 
 copydone:
Index: linux-2.6/drivers/net/igb/igb_main.c
===================================================================
--- linux-2.6.orig/drivers/net/igb/igb_main.c
+++ linux-2.6/drivers/net/igb/igb_main.c
@@ -3526,13 +3526,9 @@ static bool igb_clean_rx_irq_adv(struct 
 			pci_unmap_page(pdev, buffer_info->page_dma,
 				PAGE_SIZE, PCI_DMA_FROMDEVICE);
 			buffer_info->page_dma = 0;
-			skb_fill_page_desc(skb, j, buffer_info->page,
-						0, length);
+			skb_add_rx_frag(skb, j, buffer_info->page, 0, length);
 			buffer_info->page = NULL;
 
-			skb->len += length;
-			skb->data_len += length;
-			skb->truesize += length;
 			rx_desc->wb.upper.status_error = 0;
 			if (staterr & E1000_RXD_STAT_EOP)
 				break;
@@ -3634,7 +3630,7 @@ static void igb_alloc_rx_buffers_adv(str
 		rx_desc = E1000_RX_DESC_ADV(*rx_ring, i);
 
 		if (adapter->rx_ps_hdr_size && !buffer_info->page) {
-			buffer_info->page = alloc_page(GFP_ATOMIC);
+			buffer_info->page = netdev_alloc_page(netdev);
 			if (!buffer_info->page) {
 				adapter->alloc_rx_buff_failed++;
 				goto no_buffers;
Index: linux-2.6/drivers/net/ixgbe/ixgbe_main.c
===================================================================
--- linux-2.6.orig/drivers/net/ixgbe/ixgbe_main.c
+++ linux-2.6/drivers/net/ixgbe/ixgbe_main.c
@@ -486,7 +486,7 @@ static void ixgbe_alloc_rx_buffers(struc
 
 		if (!rx_buffer_info->page &&
 				(adapter->flags & IXGBE_FLAG_RX_PS_ENABLED)) {
-			rx_buffer_info->page = alloc_page(GFP_ATOMIC);
+			rx_buffer_info->page = netdev_alloc_page(netdev);
 			if (!rx_buffer_info->page) {
 				adapter->alloc_rx_page_failed++;
 				goto no_buffers;
@@ -607,13 +607,9 @@ static bool ixgbe_clean_rx_irq(struct ix
 			pci_unmap_page(pdev, rx_buffer_info->page_dma,
 				       PAGE_SIZE, PCI_DMA_FROMDEVICE);
 			rx_buffer_info->page_dma = 0;
-			skb_fill_page_desc(skb, skb_shinfo(skb)->nr_frags,
-					   rx_buffer_info->page, 0, upper_len);
+			skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
+					rx_buffer_info->page, 0, upper_len);
 			rx_buffer_info->page = NULL;
-
-			skb->len += upper_len;
-			skb->data_len += upper_len;
-			skb->truesize += upper_len;
 		}
 
 		i++;

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 16/30] net: sk_allocation() - concentrate socket related allocations
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (14 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 15/30] net: packet split receive api Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-24 14:00 ` [PATCH 17/30] netvm: network reserve infrastructure Peter Zijlstra
                   ` (14 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: net-sk_allocation.patch --]
[-- Type: text/plain, Size: 5244 bytes --]

Introduce sk_allocation(), this function allows to inject sock specific
flags to each sock related allocation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h    |    5 +++++
 net/ipv4/tcp.c        |    3 ++-
 net/ipv4/tcp_output.c |   12 +++++++-----
 net/ipv6/tcp_ipv6.c   |   17 ++++++++++++-----
 4 files changed, 26 insertions(+), 11 deletions(-)

Index: linux-2.6/net/ipv4/tcp_output.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_output.c
+++ linux-2.6/net/ipv4/tcp_output.c
@@ -2088,7 +2088,8 @@ void tcp_send_fin(struct sock *sk)
 	} else {
 		/* Socket is locked, keep trying until memory is available. */
 		for (;;) {
-			skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
+			skb = alloc_skb_fclone(MAX_TCP_HEADER,
+					       sk_allocation(sk, GFP_KERNEL));
 			if (skb)
 				break;
 			yield();
@@ -2114,7 +2115,7 @@ void tcp_send_active_reset(struct sock *
 	struct sk_buff *skb;
 
 	/* NOTE: No TCP options attached and we never retransmit this. */
-	skb = alloc_skb(MAX_TCP_HEADER, priority);
+	skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, priority));
 	if (!skb) {
 		NET_INC_STATS(LINUX_MIB_TCPABORTFAILED);
 		return;
@@ -2183,7 +2184,8 @@ struct sk_buff *tcp_make_synack(struct s
 	__u8 *md5_hash_location;
 #endif
 
-	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC);
+	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1,
+			sk_allocation(sk, GFP_ATOMIC));
 	if (skb == NULL)
 		return NULL;
 
@@ -2441,7 +2443,7 @@ void tcp_send_ack(struct sock *sk)
 	 * tcp_transmit_skb() will set the ownership to this
 	 * sock.
 	 */
-	buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+	buff = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
 	if (buff == NULL) {
 		inet_csk_schedule_ack(sk);
 		inet_csk(sk)->icsk_ack.ato = TCP_ATO_MIN;
@@ -2476,7 +2478,7 @@ static int tcp_xmit_probe_skb(struct soc
 	struct sk_buff *skb;
 
 	/* We don't queue it, tcp_transmit_skb() sets ownership. */
-	skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+	skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
 	if (skb == NULL)
 		return -1;
 
Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -435,6 +435,11 @@ static inline int sock_flag(struct sock 
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
+{
+	return gfp_mask;
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
Index: linux-2.6/net/ipv6/tcp_ipv6.c
===================================================================
--- linux-2.6.orig/net/ipv6/tcp_ipv6.c
+++ linux-2.6/net/ipv6/tcp_ipv6.c
@@ -580,7 +580,8 @@ static int tcp_v6_md5_do_add(struct sock
 	} else {
 		/* reallocate new list if current one is full. */
 		if (!tp->md5sig_info) {
-			tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info), GFP_ATOMIC);
+			tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info),
+					sk_allocation(sk, GFP_ATOMIC));
 			if (!tp->md5sig_info) {
 				kfree(newkey);
 				return -ENOMEM;
@@ -593,7 +594,8 @@ static int tcp_v6_md5_do_add(struct sock
 		}
 		if (tp->md5sig_info->alloced6 == tp->md5sig_info->entries6) {
 			keys = kmalloc((sizeof (tp->md5sig_info->keys6[0]) *
-				       (tp->md5sig_info->entries6 + 1)), GFP_ATOMIC);
+				       (tp->md5sig_info->entries6 + 1)),
+				       sk_allocation(sk, GFP_ATOMIC));
 
 			if (!keys) {
 				tcp_free_md5sig_pool();
@@ -717,7 +719,8 @@ static int tcp_v6_parse_md5_keys (struct
 		struct tcp_sock *tp = tcp_sk(sk);
 		struct tcp_md5sig_info *p;
 
-		p = kzalloc(sizeof(struct tcp_md5sig_info), GFP_KERNEL);
+		p = kzalloc(sizeof(struct tcp_md5sig_info),
+				   sk_allocation(sk, GFP_KERNEL));
 		if (!p)
 			return -ENOMEM;
 
@@ -920,6 +923,7 @@ static void tcp_v6_send_reset(struct soc
 #ifdef CONFIG_TCP_MD5SIG
 	struct tcp_md5sig_key *key;
 #endif
+	gfp_t gfp_mask = GFP_ATOMIC;
 
 	if (th->rst)
 		return;
@@ -937,13 +941,16 @@ static void tcp_v6_send_reset(struct soc
 		tot_len += TCPOLEN_MD5SIG_ALIGNED;
 #endif
 
+	if (sk)
+		gfp_mask = sk_allocation(skb->sk, gfp_mask);
+
 	/*
 	 * We need to grab some memory, and put together an RST,
 	 * and then put it into the queue to be sent.
 	 */
 
 	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
-			 GFP_ATOMIC);
+			 sk_allocation(sk, GFP_ATOMIC));
 	if (buff == NULL)
 		return;
 
@@ -1032,7 +1039,7 @@ static void tcp_v6_send_ack(struct sk_bu
 #endif
 
 	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
-			 GFP_ATOMIC);
+			 gfp_mask);
 	if (buff == NULL)
 		return;
 
Index: linux-2.6/net/ipv4/tcp.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp.c
+++ linux-2.6/net/ipv4/tcp.c
@@ -637,7 +637,8 @@ struct sk_buff *sk_stream_alloc_skb(stru
 	/* The TCP header must be at least 32-bit aligned.  */
 	size = ALIGN(size, 4);
 
-	skb = alloc_skb_fclone(size + sk->sk_prot->max_header, gfp);
+	skb = alloc_skb_fclone(size + sk->sk_prot->max_header,
+			       sk_allocation(sk, gfp));
 	if (skb) {
 		if (sk_wmem_schedule(sk, skb->truesize)) {
 			/*

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 17/30] netvm: network reserve infrastructure
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (15 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 16/30] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
@ 2008-07-24 14:00 ` Peter Zijlstra
  2008-07-24 14:01 ` [PATCH 18/30] netvm: INET reserves Peter Zijlstra
                   ` (13 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: netvm-reserve.patch --]
[-- Type: text/plain, Size: 7327 bytes --]

Provide the basic infrastructure to reserve and charge/account network memory.

We provide the following reserve tree:

1)  total network reserve
2)    network TX reserve
3)      protocol TX pages
4)    network RX reserve
5)      SKB data reserve

[1] is used to make all the network reserves a single subtree, for easy
manipulation.

[2] and [4] are merely for eastetic reasons.

The TX pages reserve [3] is assumed bounded by it being the upper bound of
memory that can be used for sending pages (not quite true, but good enough)

The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data
against in the fallback path.

The consumers for these reserves are sockets marked with:
  SOCK_MEMALLOC

Such sockets are to be used to service the VM (iow. to swap over). They
must be handled kernel side, exposing such a socket to user-space is a BUG.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |   43 ++++++++++++++++++++-
 net/Kconfig        |    3 +
 net/core/sock.c    |  107 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 152 insertions(+), 1 deletion(-)

Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -51,6 +51,7 @@
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/mm.h>
 #include <linux/security.h>
+#include <linux/reserve.h>
 
 #include <linux/filter.h>
 
@@ -406,6 +407,7 @@ enum sock_flags {
 	SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_MEMALLOC, /* the VM depends on us - make sure we're serviced */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -428,9 +430,48 @@ static inline int sock_flag(struct sock 
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int sk_has_memalloc(struct sock *sk)
+{
+	return sock_flag(sk, SOCK_MEMALLOC);
+}
+
+extern struct mem_reserve net_rx_reserve;
+extern struct mem_reserve net_skb_reserve;
+
+#ifdef CONFIG_NETVM
+/*
+ * Guestimate the per request queue TX upper bound.
+ *
+ * Max packet size is 64k, and we need to reserve that much since the data
+ * might need to bounce it. Double it to be on the safe side.
+ */
+#define TX_RESERVE_PAGES DIV_ROUND_UP(2*65536, PAGE_SIZE)
+
+extern int memalloc_socks;
+
+static inline int sk_memalloc_socks(void)
+{
+	return memalloc_socks;
+}
+
+extern int sk_adjust_memalloc(int socks, long tx_reserve_pages);
+extern int sk_set_memalloc(struct sock *sk);
+extern int sk_clear_memalloc(struct sock *sk);
+#else
+static inline int sk_memalloc_socks(void)
+{
+	return 0;
+}
+
+static inline int sk_clear_memalloc(struct sock *sk)
+{
+	return 0;
+}
+#endif
+
 static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
 {
-	return gfp_mask;
+	return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
 }
 
 static inline void sk_acceptq_removed(struct sock *sk)
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -112,6 +112,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/reserve.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -213,6 +214,105 @@ __u32 sysctl_rmem_default __read_mostly 
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 
+static struct mem_reserve net_reserve;
+struct mem_reserve net_rx_reserve;
+EXPORT_SYMBOL_GPL(net_rx_reserve); /* modular ipv6 only */
+struct mem_reserve net_skb_reserve;
+EXPORT_SYMBOL_GPL(net_skb_reserve); /* modular ipv6 only */
+static struct mem_reserve net_tx_reserve;
+static struct mem_reserve net_tx_pages;
+
+#ifdef CONFIG_NETVM
+static DEFINE_MUTEX(memalloc_socks_lock);
+int memalloc_socks;
+
+/**
+ *	sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ *	@socks: number of new %SOCK_MEMALLOC sockets
+ *	@tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ *	This function adjusts the memalloc reserve based on system demand.
+ *	The RX reserve is a limit, and only added once, not for each socket.
+ *
+ *	NOTE:
+ *	   @tx_reserve_pages is an upper-bound of memory used for TX hence
+ *	   we need not account the pages like we do for RX pages.
+ */
+int sk_adjust_memalloc(int socks, long tx_reserve_pages)
+{
+	int err;
+
+	mutex_lock(&memalloc_socks_lock);
+	err = mem_reserve_pages_add(&net_tx_pages, tx_reserve_pages);
+	if (err)
+		goto unlock;
+
+	/*
+	 * either socks is positive and we need to check for 0 -> !0
+	 * transition and connect the reserve tree when we observe it.
+	 */
+	if (!memalloc_socks && socks > 0) {
+		err = mem_reserve_connect(&net_reserve, &mem_reserve_root);
+		if (err) {
+			/*
+			 * if we failed to connect the tree, undo the tx
+			 * reserve so that failure has no side effects.
+			 */
+			mem_reserve_pages_add(&net_tx_pages, -tx_reserve_pages);
+			goto unlock;
+		}
+	}
+	memalloc_socks += socks;
+	/*
+	 * or socks is negative and we must observe the !0 -> 0 transition
+	 * and disconnect the reserve tree.
+	 */
+	if (!memalloc_socks && socks)
+		mem_reserve_disconnect(&net_reserve);
+
+unlock:
+	mutex_unlock(&memalloc_socks_lock);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
+
+/**
+ *	sk_set_memalloc - sets %SOCK_MEMALLOC
+ *	@sk: socket to set it on
+ *
+ *	Set %SOCK_MEMALLOC on a socket and increase the memalloc reserve
+ *	accordingly.
+ */
+int sk_set_memalloc(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_MEMALLOC);
+
+	if (!set) {
+		int err = sk_adjust_memalloc(1, 0);
+		if (err)
+			return err;
+
+		sock_set_flag(sk, SOCK_MEMALLOC);
+		sk->sk_allocation |= __GFP_MEMALLOC;
+	}
+	return !set;
+}
+EXPORT_SYMBOL_GPL(sk_set_memalloc);
+
+int sk_clear_memalloc(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_MEMALLOC);
+	if (set) {
+		sk_adjust_memalloc(-1, 0);
+		sock_reset_flag(sk, SOCK_MEMALLOC);
+		sk->sk_allocation &= ~__GFP_MEMALLOC;
+	}
+	return set;
+}
+EXPORT_SYMBOL_GPL(sk_clear_memalloc);
+#endif
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -968,6 +1068,7 @@ void sk_free(struct sock *sk)
 {
 	struct sk_filter *filter;
 
+	sk_clear_memalloc(sk);
 	if (sk->sk_destruct)
 		sk->sk_destruct(sk);
 
@@ -1095,6 +1196,12 @@ void __init sk_init(void)
 		sysctl_wmem_max = 131071;
 		sysctl_rmem_max = 131071;
 	}
+
+	mem_reserve_init(&net_reserve, "total network reserve", NULL);
+	mem_reserve_init(&net_rx_reserve, "network RX reserve", &net_reserve);
+	mem_reserve_init(&net_skb_reserve, "SKB data reserve", &net_rx_reserve);
+	mem_reserve_init(&net_tx_reserve, "network TX reserve", &net_reserve);
+	mem_reserve_init(&net_tx_pages, "protocol TX pages", &net_tx_reserve);
 }
 
 /*
Index: linux-2.6/net/Kconfig
===================================================================
--- linux-2.6.orig/net/Kconfig
+++ linux-2.6/net/Kconfig
@@ -250,6 +250,9 @@ endmenu
 source "net/rfkill/Kconfig"
 source "net/9p/Kconfig"
 
+config NETVM
+	def_bool n
+
 endif   # if NET
 endmenu # Networking
 

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 18/30] netvm: INET reserves.
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (16 preceding siblings ...)
  2008-07-24 14:00 ` [PATCH 17/30] netvm: network reserve infrastructure Peter Zijlstra
@ 2008-07-24 14:01 ` Peter Zijlstra
  2008-10-01 11:38   ` Daniel Lezcano
  2008-07-24 14:01 ` [PATCH 19/30] netvm: hook skb allocation to reserves Peter Zijlstra
                   ` (12 subsequent siblings)
  30 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: netvm-reserve-inet.patch --]
[-- Type: text/plain, Size: 16501 bytes --]

Add reserves for INET.

The two big users seem to be the route cache and ip-fragment cache.

Reserve the route cache under generic RX reserve, its usage is bounded by
the high reclaim watermark, and thus does not need further accounting.

Reserve the ip-fragement caches under SKB data reserve, these add to the
SKB RX limit. By ensuring we can at least receive as much data as fits in
the reassmbly line we avoid fragment attack deadlocks.

Adds to the reserve tree:

  total network reserve      
    network TX reserve       
      protocol TX pages      
    network RX reserve       
+     IPv6 route cache       
+     IPv4 route cache       
      SKB data reserve       
+       IPv6 fragment cache  
+       IPv4 fragment cache  

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/inet_frag.h  |    7 +++
 include/net/netns/ipv6.h |    4 ++
 net/ipv4/inet_fragment.c |    3 +
 net/ipv4/ip_fragment.c   |   89 +++++++++++++++++++++++++++++++++++++++++++++--
 net/ipv4/route.c         |   72 +++++++++++++++++++++++++++++++++++++-
 net/ipv6/af_inet6.c      |   20 +++++++++-
 net/ipv6/reassembly.c    |   88 +++++++++++++++++++++++++++++++++++++++++++++-
 net/ipv6/route.c         |   66 ++++++++++++++++++++++++++++++++++
 8 files changed, 341 insertions(+), 8 deletions(-)

Index: linux-2.6/net/ipv4/ip_fragment.c
===================================================================
--- linux-2.6.orig/net/ipv4/ip_fragment.c
+++ linux-2.6/net/ipv4/ip_fragment.c
@@ -42,6 +42,8 @@
 #include <linux/udp.h>
 #include <linux/inet.h>
 #include <linux/netfilter_ipv4.h>
+#include <linux/reserve.h>
+#include <linux/nsproxy.h>
 
 /* NOTE. Logic of IP defragmentation is parallel to corresponding IPv6
  * code now. If you change something here, _PLEASE_ update ipv6/reassembly.c
@@ -596,6 +598,66 @@ int ip_defrag(struct sk_buff *skb, u32 u
 }
 
 #ifdef CONFIG_SYSCTL
+static int proc_dointvec_fragment(struct ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int new_bytes, ret;
+
+	mutex_lock(&net->ipv4.frags.lock);
+
+	if (write)
+		table->data = &new_bytes;
+
+	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmalloc_set(&net->ipv4.frags.reserve,
+				new_bytes);
+		if (!ret)
+			net->ipv4.frags.high_thresh = new_bytes;
+	}
+
+	if (write)
+		table->data = &net->ipv4.frags.high_thresh;
+
+	mutex_unlock(&net->ipv4.frags.lock);
+
+	return ret;
+}
+
+static int sysctl_intvec_fragment(struct ctl_table *table,
+		int __user *name, int nlen,
+		void __user *oldval, size_t __user *oldlenp,
+		void __user *newval, size_t newlen)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int write = (newval && newlen);
+	int new_bytes, ret;
+
+	mutex_lock(&net->ipv4.frags.lock);
+
+	if (write)
+		table->data = &new_bytes;
+
+	ret = sysctl_intvec(table, name, nlen, oldval, oldlenp, newval, newlen);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmalloc_set(&net->ipv4.frags.reserve,
+				new_bytes);
+		if (!ret)
+			net->ipv4.frags.high_thresh = new_bytes;
+	}
+
+	if (write)
+		table->data = &net->ipv4.frags.high_thresh;
+
+	mutex_unlock(&net->ipv4.frags.lock);
+
+	return ret;
+}
+
 static int zero;
 
 static struct ctl_table ip4_frags_ns_ctl_table[] = {
@@ -605,7 +667,8 @@ static struct ctl_table ip4_frags_ns_ctl
 		.data		= &init_net.ipv4.frags.high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment,
+		.strategy	= &sysctl_intvec_fragment,
 	},
 	{
 		.ctl_name	= NET_IPV4_IPFRAG_LOW_THRESH,
@@ -708,6 +771,8 @@ static inline void ip4_frags_ctl_registe
 
 static int ipv4_frags_init_net(struct net *net)
 {
+	int ret;
+
 	/*
 	 * Fragment cache limits. We will commit 256K at one time. Should we
 	 * cross that limit we will prune down to 192K. This should cope with
@@ -725,11 +790,31 @@ static int ipv4_frags_init_net(struct ne
 
 	inet_frags_init_net(&net->ipv4.frags);
 
-	return ip4_frags_ns_ctl_register(net);
+	ret = ip4_frags_ns_ctl_register(net);
+	if (ret)
+		goto out_reg;
+
+	mem_reserve_init(&net->ipv4.frags.reserve, "IPv4 fragment cache",
+			&net_skb_reserve);
+	ret = mem_reserve_kmalloc_set(&net->ipv4.frags.reserve,
+			net->ipv4.frags.high_thresh);
+	if (ret)
+		goto out_reserve;
+
+	return 0;
+
+out_reserve:
+	mem_reserve_disconnect(&net->ipv4.frags.reserve);
+	ip4_frags_ns_ctl_unregister(net);
+out_reg:
+	inet_frags_exit_net(&net->ipv4.frags, &ip4_frags);
+
+	return ret;
 }
 
 static void ipv4_frags_exit_net(struct net *net)
 {
+	mem_reserve_disconnect(&net->ipv4.frags.reserve);
 	ip4_frags_ns_ctl_unregister(net);
 	inet_frags_exit_net(&net->ipv4.frags, &ip4_frags);
 }
Index: linux-2.6/net/ipv6/reassembly.c
===================================================================
--- linux-2.6.orig/net/ipv6/reassembly.c
+++ linux-2.6/net/ipv6/reassembly.c
@@ -41,6 +41,7 @@
 #include <linux/random.h>
 #include <linux/jhash.h>
 #include <linux/skbuff.h>
+#include <linux/reserve.h>
 
 #include <net/sock.h>
 #include <net/snmp.h>
@@ -632,6 +633,66 @@ static struct inet6_protocol frag_protoc
 };
 
 #ifdef CONFIG_SYSCTL
+static int proc_dointvec_fragment(struct ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int new_bytes, ret;
+
+	mutex_lock(&net->ipv6.frags.lock);
+
+	if (write)
+		table->data = &new_bytes;
+
+	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmalloc_set(&net->ipv6.frags.reserve,
+					      new_bytes);
+		if (!ret)
+			net->ipv6.frags.high_thresh = new_bytes;
+	}
+
+	if (write)
+		table->data = &net->ipv6.frags.high_thresh;
+
+	mutex_unlock(&net->ipv6.frags.lock);
+
+	return ret;
+}
+
+static int sysctl_intvec_fragment(struct ctl_table *table,
+		int __user *name, int nlen,
+		void __user *oldval, size_t __user *oldlenp,
+		void __user *newval, size_t newlen)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int write = (newval && newlen);
+	int new_bytes, ret;
+
+	mutex_lock(&net->ipv6.frags.lock);
+
+	if (write)
+		table->data = &new_bytes;
+
+	ret = sysctl_intvec(table, name, nlen, oldval, oldlenp, newval, newlen);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmalloc_set(&net->ipv6.frags.reserve,
+					      new_bytes);
+		if (!ret)
+			net->ipv6.frags.high_thresh = new_bytes;
+	}
+
+	if (write)
+		table->data = &net->ipv6.frags.high_thresh;
+
+	mutex_unlock(&net->ipv6.frags.lock);
+
+	return ret;
+}
+
 static struct ctl_table ip6_frags_ns_ctl_table[] = {
 	{
 		.ctl_name	= NET_IPV6_IP6FRAG_HIGH_THRESH,
@@ -639,7 +700,8 @@ static struct ctl_table ip6_frags_ns_ctl
 		.data		= &init_net.ipv6.frags.high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment,
+		.strategy	= &sysctl_intvec_fragment,
 	},
 	{
 		.ctl_name	= NET_IPV6_IP6FRAG_LOW_THRESH,
@@ -748,17 +810,39 @@ static inline void ip6_frags_sysctl_unre
 
 static int ipv6_frags_init_net(struct net *net)
 {
+	int ret;
+
 	net->ipv6.frags.high_thresh = 256 * 1024;
 	net->ipv6.frags.low_thresh = 192 * 1024;
 	net->ipv6.frags.timeout = IPV6_FRAG_TIMEOUT;
 
 	inet_frags_init_net(&net->ipv6.frags);
 
-	return ip6_frags_ns_sysctl_register(net);
+	ret = ip6_frags_sysctl_register(net);
+	if (ret)
+		goto out_reg;
+
+	mem_reserve_init(&net->ipv6.frags.reserve, "IPv6 fragment cache",
+			 &net_skb_reserve);
+	ret = mem_reserve_kmalloc_set(&net->ipv6.frags.reserve,
+				      net->ipv6.frags.high_thresh);
+	if (ret)
+		goto out_reserve;
+
+	return 0;
+
+out_reserve:
+	mem_reserve_disconnect(&net->ipv6.frags.reserve);
+	ip6_frags_sysctl_unregister(net);
+out_reg:
+	inet_frags_exit_net(&net->ipv6.frags, &ip6_frags);
+
+	return ret;
 }
 
 static void ipv6_frags_exit_net(struct net *net)
 {
+	mem_reserve_disconnect(&net->ipv6.frags.reserve);
 	ip6_frags_ns_sysctl_unregister(net);
 	inet_frags_exit_net(&net->ipv6.frags, &ip6_frags);
 }
Index: linux-2.6/net/ipv4/route.c
===================================================================
--- linux-2.6.orig/net/ipv4/route.c
+++ linux-2.6/net/ipv4/route.c
@@ -107,6 +107,7 @@
 #ifdef CONFIG_SYSCTL
 #include <linux/sysctl.h>
 #endif
+#include <linux/reserve.h>
 
 #define RT_FL_TOS(oldflp) \
     ((u32)(oldflp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK)))
@@ -2828,7 +2829,10 @@ void ip_rt_multicast_event(struct in_dev
 	rt_cache_flush(0);
 }
 
+static struct mem_reserve ipv4_route_reserve;
+
 #ifdef CONFIG_SYSCTL
+static struct mutex ipv4_route_lock;
 static int flush_delay;
 
 static int ipv4_sysctl_rtcache_flush(ctl_table *ctl, int write,
@@ -2861,6 +2865,64 @@ static int ipv4_sysctl_rtcache_flush_str
 	return 0;
 }
 
+static int proc_dointvec_route(struct ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int new_size, ret;
+
+	mutex_lock(&ipv4_route_lock);
+
+	if (write)
+		table->data = &new_size;
+
+	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmem_cache_set(&ipv4_route_reserve,
+				ipv4_dst_ops.kmem_cachep, new_size);
+		if (!ret)
+			ip_rt_max_size = new_size;
+	}
+
+	if (write)
+		table->data = &ip_rt_max_size;
+
+	mutex_unlock(&ipv4_route_lock);
+
+	return ret;
+}
+
+static int sysctl_intvec_route(struct ctl_table *table,
+		int __user *name, int nlen,
+		void __user *oldval, size_t __user *oldlenp,
+		void __user *newval, size_t newlen)
+{
+	int write = (newval && newlen);
+	int new_size, ret;
+
+	mutex_lock(&ipv4_route_lock);
+
+	if (write)
+		table->data = &new_size;
+
+	ret = sysctl_intvec(table, name, nlen, oldval, oldlenp, newval, newlen);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmem_cache_set(&ipv4_route_reserve,
+				ipv4_dst_ops.kmem_cachep, new_size);
+		if (!ret)
+			ip_rt_max_size = new_size;
+	}
+
+	if (write)
+		table->data = &ip_rt_max_size;
+
+	mutex_unlock(&ipv4_route_lock);
+
+	return ret;
+}
+
 ctl_table ipv4_route_table[] = {
 	{
 		.ctl_name 	= NET_IPV4_ROUTE_FLUSH,
@@ -2885,7 +2947,8 @@ ctl_table ipv4_route_table[] = {
 		.data		= &ip_rt_max_size,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_dointvec_route,
+		.strategy	= &sysctl_intvec_route,
 	},
 	{
 		/*  Deprecated. Use gc_min_interval_ms */
@@ -3060,6 +3123,13 @@ int __init ip_rt_init(void)
 	ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
 	ip_rt_max_size = (rt_hash_mask + 1) * 16;
 
+	mutex_init(&ipv4_route_lock);
+
+	mem_reserve_init(&ipv4_route_reserve, "IPv4 route cache",
+			&net_rx_reserve);
+	mem_reserve_kmem_cache_set(&ipv4_route_reserve,
+			ipv4_dst_ops.kmem_cachep, ip_rt_max_size);
+
 	devinet_init();
 	ip_fib_init();
 
Index: linux-2.6/net/ipv6/route.c
===================================================================
--- linux-2.6.orig/net/ipv6/route.c
+++ linux-2.6/net/ipv6/route.c
@@ -37,6 +37,7 @@
 #include <linux/mroute6.h>
 #include <linux/init.h>
 #include <linux/if_arp.h>
+#include <linux/reserve.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/nsproxy.h>
@@ -2498,6 +2499,66 @@ int ipv6_sysctl_rtcache_flush(ctl_table 
 		return -EINVAL;
 }
 
+static int proc_dointvec_route(struct ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int new_size, ret;
+
+	mutex_lock(&net->ipv6.sysctl.ip6_rt_lock);
+
+	if (write)
+		table->data = &new_size;
+
+	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmem_cache_set(&net->ipv6.ip6_rt_reserve,
+				net->ipv6.ip6_dst_ops.kmem_cachep, new_size);
+		if (!ret)
+			net->ipv6.sysctl.ip6_rt_max_size = new_size;
+	}
+
+	if (write)
+		table->data = &net->ipv6.sysctl.ip6_rt_max_size;
+
+	mutex_unlock(&net->ipv6.sysctl.ip6_rt_lock);
+
+	return ret;
+}
+
+static int sysctl_intvec_route(struct ctl_table *table,
+		int __user *name, int nlen,
+		void __user *oldval, size_t __user *oldlenp,
+		void __user *newval, size_t newlen)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int write = (newval && newlen);
+	int new_size, ret;
+
+	mutex_lock(&net->ipv6.sysctl.ip6_rt_lock);
+
+	if (write)
+		table->data = &new_size;
+
+	ret = sysctl_intvec(table, name, nlen, oldval, oldlenp, newval, newlen);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmem_cache_set(&net->ipv6.ip6_rt_reserve,
+				net->ipv6.ip6_dst_ops.kmem_cachep, new_size);
+		if (!ret)
+			net->ipv6.sysctl.ip6_rt_max_size = new_size;
+	}
+
+	if (write)
+		table->data = &net->ipv6.sysctl.ip6_rt_max_size;
+
+	mutex_unlock(&net->ipv6.sysctl.ip6_rt_lock);
+
+	return ret;
+}
+
 ctl_table ipv6_route_table_template[] = {
 	{
 		.procname	=	"flush",
@@ -2520,7 +2581,8 @@ ctl_table ipv6_route_table_template[] = 
 		.data		=	&init_net.ipv6.sysctl.ip6_rt_max_size,
 		.maxlen		=	sizeof(int),
 		.mode		=	0644,
-		.proc_handler	=	&proc_dointvec,
+		.proc_handler	=	&proc_dointvec_route,
+		.strategy	= 	&sysctl_intvec_route,
 	},
 	{
 		.ctl_name	=	NET_IPV6_ROUTE_GC_MIN_INTERVAL,
@@ -2608,6 +2670,8 @@ struct ctl_table *ipv6_route_sysctl_init
 		table[8].data = &net->ipv6.sysctl.ip6_rt_min_advmss;
 	}
 
+	mutex_init(&net->ipv6.sysctl.ip6_rt_lock);
+
 	return table;
 }
 #endif
Index: linux-2.6/include/net/inet_frag.h
===================================================================
--- linux-2.6.orig/include/net/inet_frag.h
+++ linux-2.6/include/net/inet_frag.h
@@ -1,6 +1,9 @@
 #ifndef __NET_FRAG_H__
 #define __NET_FRAG_H__
 
+#include <linux/reserve.h>
+#include <linux/mutex.h>
+
 struct netns_frags {
 	int			nqueues;
 	atomic_t		mem;
@@ -10,6 +13,10 @@ struct netns_frags {
 	int			timeout;
 	int			high_thresh;
 	int			low_thresh;
+
+	/* reserves */
+	struct mutex		lock;
+	struct mem_reserve	reserve;
 };
 
 struct inet_frag_queue {
Index: linux-2.6/net/ipv4/inet_fragment.c
===================================================================
--- linux-2.6.orig/net/ipv4/inet_fragment.c
+++ linux-2.6/net/ipv4/inet_fragment.c
@@ -19,6 +19,7 @@
 #include <linux/random.h>
 #include <linux/skbuff.h>
 #include <linux/rtnetlink.h>
+#include <linux/reserve.h>
 
 #include <net/inet_frag.h>
 
@@ -74,6 +75,8 @@ void inet_frags_init_net(struct netns_fr
 	nf->nqueues = 0;
 	atomic_set(&nf->mem, 0);
 	INIT_LIST_HEAD(&nf->lru_list);
+	mutex_init(&nf->lock);
+	mem_reserve_init(&nf->reserve, "IP fragement cache", NULL);
 }
 EXPORT_SYMBOL(inet_frags_init_net);
 
Index: linux-2.6/include/net/netns/ipv6.h
===================================================================
--- linux-2.6.orig/include/net/netns/ipv6.h
+++ linux-2.6/include/net/netns/ipv6.h
@@ -24,6 +24,8 @@ struct netns_sysctl_ipv6 {
 	int ip6_rt_mtu_expires;
 	int ip6_rt_min_advmss;
 	int icmpv6_time;
+
+	struct mutex ip6_rt_lock;
 };
 
 struct netns_ipv6 {
@@ -55,5 +57,7 @@ struct netns_ipv6 {
 	struct sock             *ndisc_sk;
 	struct sock             *tcp_sk;
 	struct sock             *igmp_sk;
+
+	struct mem_reserve	ip6_rt_reserve;
 };
 #endif
Index: linux-2.6/net/ipv6/af_inet6.c
===================================================================
--- linux-2.6.orig/net/ipv6/af_inet6.c
+++ linux-2.6/net/ipv6/af_inet6.c
@@ -851,6 +851,20 @@ static int inet6_net_init(struct net *ne
 	net->ipv6.sysctl.ip6_rt_min_advmss = IPV6_MIN_MTU - 20 - 40;
 	net->ipv6.sysctl.icmpv6_time = 1*HZ;
 
+	mem_reserve_init(&net->ipv6.ip6_rt_reserve, "IPv6 route cache",
+			 &net_rx_reserve);
+	/*
+	 * XXX: requires that net->ipv6.ip6_dst_ops is already set-up
+	 *      but afaikt its impossible to order the various
+	 *      pernet_subsys calls so that this one is done after
+	 *      ip6_route_net_init().
+	 */
+	err = mem_reserve_kmem_cache_set(&net->ipv6.ip6_rt_reserve,
+			net->ipv6.ip6_dst_ops.kmem_cachep,
+			net->ipv6.sysctl.ip6_rt_max_size);
+	if (err)
+		goto reserve_fail;
+
 #ifdef CONFIG_PROC_FS
 	err = udp6_proc_init(net);
 	if (err)
@@ -861,8 +875,8 @@ static int inet6_net_init(struct net *ne
 	err = ac6_proc_init(net);
 	if (err)
 		goto proc_ac6_fail;
-out:
 #endif
+out:
 	return err;
 
 #ifdef CONFIG_PROC_FS
@@ -870,8 +884,10 @@ proc_ac6_fail:
 	tcp6_proc_exit(net);
 proc_tcp6_fail:
 	udp6_proc_exit(net);
-	goto out;
 #endif
+reserve_fail:
+	mem_reserve_disconnect(&net->ipv6.ip6_rt_reserve);
+	goto out;
 }
 
 static void inet6_net_exit(struct net *net)

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 19/30] netvm: hook skb allocation to reserves
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (17 preceding siblings ...)
  2008-07-24 14:01 ` [PATCH 18/30] netvm: INET reserves Peter Zijlstra
@ 2008-07-24 14:01 ` Peter Zijlstra
  2008-07-24 14:01 ` [PATCH 20/30] netvm: filter emergency skbs Peter Zijlstra
                   ` (11 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: netvm-skbuff-reserve.patch --]
[-- Type: text/plain, Size: 13721 bytes --]

Change the skb allocation api to indicate RX usage and use this to fall back to
the reserve when needed. SKBs allocated from the reserve are tagged in
skb->emergency.

Teach all other skb ops about emergency skbs and the reserve accounting.

Use the (new) packet split API to allocate and track fragment pages from the
emergency reserve. Do this using an atomic counter in page->index. This is
needed because the fragments have a different sharing semantic than that
indicated by skb_shinfo()->dataref. 

Note that the decision to distinguish between regular and emergency SKBs allows
the accounting overhead to be limited to the later kind.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm_types.h |    1 
 include/linux/skbuff.h   |   25 +++++++--
 net/core/skbuff.c        |  129 +++++++++++++++++++++++++++++++++++++----------
 3 files changed, 124 insertions(+), 31 deletions(-)

Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -317,7 +317,8 @@ struct sk_buff {
 #ifdef CONFIG_IPV6_NDISC_NODETYPE
 	__u8			ndisc_nodetype:2;
 #endif
-	/* 14 bit hole */
+	__u8 			emergency:1;
+	/* 13 bit hole */
 
 #ifdef CONFIG_NET_DMA
 	dma_cookie_t		dma_cookie;
@@ -348,10 +349,22 @@ struct sk_buff {
 
 #include <asm/system.h>
 
+#define SKB_ALLOC_FCLONE	0x01
+#define SKB_ALLOC_RX		0x02
+
+static inline bool skb_emergency(const struct sk_buff *skb)
+{
+#ifdef CONFIG_NETVM
+	return unlikely(skb->emergency);
+#else
+	return false;
+#endif
+}
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-				   gfp_t priority, int fclone, int node);
+				   gfp_t priority, int flags, int node);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
@@ -361,7 +374,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1, -1);
+	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, -1);
 }
 
 extern struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
@@ -1211,7 +1224,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
 					      gfp_t gfp_mask)
 {
-	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	struct sk_buff *skb =
+		__alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, -1);
 	if (likely(skb))
 		skb_reserve(skb, NET_SKB_PAD);
 	return skb;
@@ -1242,6 +1256,7 @@ static inline struct sk_buff *netdev_all
 }
 
 extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+extern void __netdev_free_page(struct net_device *dev, struct page *page);
 
 /**
  *	netdev_alloc_page - allocate a page for ps-rx on a specific device
@@ -1258,7 +1273,7 @@ static inline struct page *netdev_alloc_
 
 static inline void netdev_free_page(struct net_device *dev, struct page *page)
 {
-	__free_page(page);
+	__netdev_free_page(dev, page);
 }
 
 /**
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -179,23 +179,29 @@ EXPORT_SYMBOL(skb_truesize_bug);
  *	%GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-			    int fclone, int node)
+			    int flags, int node)
 {
 	struct kmem_cache *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
+	int emergency = 0;
+	int memalloc = sk_memalloc_socks();
 
-	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+	size = SKB_DATA_ALIGN(size);
+	cache = (flags & SKB_ALLOC_FCLONE)
+		? skbuff_fclone_cache : skbuff_head_cache;
+
+	if (memalloc && (flags & SKB_ALLOC_RX))
+		gfp_mask |= __GFP_MEMALLOC;
 
 	/* Get the HEAD */
 	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
 	if (!skb)
 		goto out;
 
-	size = SKB_DATA_ALIGN(size);
-	data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
-			gfp_mask, node);
+	data = kmalloc_reserve(size + sizeof(struct skb_shared_info),
+			gfp_mask, node, &net_skb_reserve, &emergency);
 	if (!data)
 		goto nodata;
 
@@ -205,6 +211,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	 * the tail pointer in struct sk_buff!
 	 */
 	memset(skb, 0, offsetof(struct sk_buff, tail));
+	skb->emergency = emergency;
 	skb->truesize = size + sizeof(struct sk_buff);
 	atomic_set(&skb->users, 1);
 	skb->head = data;
@@ -221,7 +228,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	shinfo->ip6_frag_id = 0;
 	shinfo->frag_list = NULL;
 
-	if (fclone) {
+	if (flags & SKB_ALLOC_FCLONE) {
 		struct sk_buff *child = skb + 1;
 		atomic_t *fclone_ref = (atomic_t *) (child + 1);
 
@@ -229,6 +236,7 @@ struct sk_buff *__alloc_skb(unsigned int
 		atomic_set(fclone_ref, 1);
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
+		child->emergency = skb->emergency;
 	}
 out:
 	return skb;
@@ -257,7 +265,7 @@ struct sk_buff *__netdev_alloc_skb(struc
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct sk_buff *skb;
 
-	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, node);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -270,11 +278,19 @@ struct page *__netdev_alloc_page(struct 
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct page *page;
 
-	page = alloc_pages_node(node, gfp_mask, 0);
+	page = alloc_pages_reserve(node, gfp_mask | __GFP_MEMALLOC, 0,
+			&net_skb_reserve, NULL);
+
 	return page;
 }
 EXPORT_SYMBOL(__netdev_alloc_page);
 
+void __netdev_free_page(struct net_device *dev, struct page *page)
+{
+	free_pages_reserve(page, 0, &net_skb_reserve, page->reserve);
+}
+EXPORT_SYMBOL(__netdev_free_page);
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {
@@ -282,6 +298,27 @@ void skb_add_rx_frag(struct sk_buff *skb
 	skb->len += size;
 	skb->data_len += size;
 	skb->truesize += size;
+
+#ifdef CONFIG_NETVM
+	/*
+	 * In the rare case that skb_emergency() != page->reserved we'll
+	 * skew the accounting slightly, but since its only a 'small' constant
+	 * shift its ok.
+	 */
+	if (skb_emergency(skb)) {
+		/*
+		 * We need to track fragment pages so that we properly
+		 * release their reserve in skb_put_page().
+		 */
+		atomic_set(&page->frag_count, 1);
+	} else if (unlikely(page->reserve)) {
+		/*
+		 * Release the reserve now, because normal skbs don't
+		 * do the emergency accounting.
+		 */
+		mem_reserve_pages_charge(&net_skb_reserve, -1);
+	}
+#endif
 }
 EXPORT_SYMBOL(skb_add_rx_frag);
 
@@ -333,21 +370,38 @@ static void skb_clone_fraglist(struct sk
 		skb_get(list);
 }
 
+static void skb_get_page(struct sk_buff *skb, struct page *page)
+{
+	get_page(page);
+	if (skb_emergency(skb))
+		atomic_inc(&page->frag_count);
+}
+
+static void skb_put_page(struct sk_buff *skb, struct page *page)
+{
+	if (skb_emergency(skb) && atomic_dec_and_test(&page->frag_count))
+		mem_reserve_pages_charge(&net_skb_reserve, -1);
+	put_page(page);
+}
+
 static void skb_release_data(struct sk_buff *skb)
 {
 	if (!skb->cloned ||
 	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
 			       &skb_shinfo(skb)->dataref)) {
+
 		if (skb_shinfo(skb)->nr_frags) {
 			int i;
-			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-				put_page(skb_shinfo(skb)->frags[i].page);
+			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+				skb_put_page(skb,
+					     skb_shinfo(skb)->frags[i].page);
+			}
 		}
 
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
 
-		kfree(skb->head);
+		kfree_reserve(skb->head, &net_skb_reserve, skb_emergency(skb));
 	}
 }
 
@@ -468,6 +522,7 @@ static void __copy_skb_header(struct sk_
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	new->ipvs_property	= old->ipvs_property;
 #endif
+	new->emergency		= old->emergency;
 	new->protocol		= old->protocol;
 	new->mark		= old->mark;
 	__nf_copy(new, old);
@@ -556,6 +611,9 @@ struct sk_buff *skb_clone(struct sk_buff
 		n->fclone = SKB_FCLONE_CLONE;
 		atomic_inc(fclone_ref);
 	} else {
+		if (skb_emergency(skb))
+			gfp_mask |= __GFP_MEMALLOC;
+
 		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 		if (!n)
 			return NULL;
@@ -587,6 +645,14 @@ static void copy_skb_header(struct sk_bu
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
 
+static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
+{
+	if (skb_emergency(skb))
+		return SKB_ALLOC_RX;
+
+	return 0;
+}
+
 /**
  *	skb_copy	-	create private copy of an sk_buff
  *	@skb: buffer to copy
@@ -607,15 +673,17 @@ static void copy_skb_header(struct sk_bu
 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 {
 	int headerlen = skb->data - skb->head;
+	int size;
 	/*
 	 *	Allocate the copy buffer
 	 */
 	struct sk_buff *n;
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
-	n = alloc_skb(skb->end + skb->data_len, gfp_mask);
+	size = skb->end + skb->data_len;
 #else
-	n = alloc_skb(skb->end - skb->head + skb->data_len, gfp_mask);
+	size = skb->end - skb->head + skb->data_len;
 #endif
+	n = __alloc_skb(size, gfp_mask, skb_alloc_rx_flag(skb), -1);
 	if (!n)
 		return NULL;
 
@@ -650,12 +718,14 @@ struct sk_buff *pskb_copy(struct sk_buff
 	/*
 	 *	Allocate the copy buffer
 	 */
+	int size;
 	struct sk_buff *n;
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
-	n = alloc_skb(skb->end, gfp_mask);
+	size = skb->end;
 #else
-	n = alloc_skb(skb->end - skb->head, gfp_mask);
+	size = skb->end - skb->head;
 #endif
+	n = __alloc_skb(size, gfp_mask, skb_alloc_rx_flag(skb), -1);
 	if (!n)
 		goto out;
 
@@ -674,8 +744,9 @@ struct sk_buff *pskb_copy(struct sk_buff
 		int i;
 
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-			skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
-			get_page(skb_shinfo(n)->frags[i].page);
+			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+			skb_shinfo(n)->frags[i] = *frag;
+			skb_get_page(n, frag->page);
 		}
 		skb_shinfo(n)->nr_frags = i;
 	}
@@ -723,7 +794,11 @@ int pskb_expand_head(struct sk_buff *skb
 
 	size = SKB_DATA_ALIGN(size);
 
-	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+	if (skb_emergency(skb))
+		gfp_mask |= __GFP_MEMALLOC;
+
+	data = kmalloc_reserve(size + sizeof(struct skb_shared_info),
+			gfp_mask, -1, &net_skb_reserve, NULL);
 	if (!data)
 		goto nodata;
 
@@ -738,7 +813,7 @@ int pskb_expand_head(struct sk_buff *skb
 	       sizeof(struct skb_shared_info));
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-		get_page(skb_shinfo(skb)->frags[i].page);
+		skb_get_page(skb, skb_shinfo(skb)->frags[i].page);
 
 	if (skb_shinfo(skb)->frag_list)
 		skb_clone_fraglist(skb);
@@ -817,8 +892,8 @@ struct sk_buff *skb_copy_expand(const st
 	/*
 	 *	Allocate the copy buffer
 	 */
-	struct sk_buff *n = alloc_skb(newheadroom + skb->len + newtailroom,
-				      gfp_mask);
+	struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom,
+					gfp_mask, skb_alloc_rx_flag(skb), -1);
 	int oldheadroom = skb_headroom(skb);
 	int head_copy_len, head_copy_off;
 	int off;
@@ -1007,7 +1082,7 @@ drop_pages:
 		skb_shinfo(skb)->nr_frags = i;
 
 		for (; i < nfrags; i++)
-			put_page(skb_shinfo(skb)->frags[i].page);
+			skb_put_page(skb, skb_shinfo(skb)->frags[i].page);
 
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
@@ -1176,7 +1251,7 @@ pull_pages:
 	k = 0;
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		if (skb_shinfo(skb)->frags[i].size <= eat) {
-			put_page(skb_shinfo(skb)->frags[i].page);
+			skb_put_page(skb, skb_shinfo(skb)->frags[i].page);
 			eat -= skb_shinfo(skb)->frags[i].size;
 		} else {
 			skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
@@ -1956,6 +2031,7 @@ static inline void skb_split_no_header(s
 			skb_shinfo(skb1)->frags[k] = skb_shinfo(skb)->frags[i];
 
 			if (pos < len) {
+				struct page *page = skb_shinfo(skb)->frags[i].page;
 				/* Split frag.
 				 * We have two variants in this case:
 				 * 1. Move all the frag to the second
@@ -1964,7 +2040,7 @@ static inline void skb_split_no_header(s
 				 *    where splitting is expensive.
 				 * 2. Split is accurately. We make this.
 				 */
-				get_page(skb_shinfo(skb)->frags[i].page);
+				skb_get_page(skb1, page);
 				skb_shinfo(skb1)->frags[0].page_offset += len - pos;
 				skb_shinfo(skb1)->frags[0].size -= len - pos;
 				skb_shinfo(skb)->frags[i].size	= len - pos;
@@ -2294,7 +2370,8 @@ struct sk_buff *skb_segment(struct sk_bu
 		if (hsize > len || !sg)
 			hsize = len;
 
-		nskb = alloc_skb(hsize + doffset + headroom, GFP_ATOMIC);
+		nskb = __alloc_skb(hsize + doffset + headroom, GFP_ATOMIC,
+				   skb_alloc_rx_flag(skb), -1);
 		if (unlikely(!nskb))
 			goto err;
 
@@ -2339,7 +2416,7 @@ struct sk_buff *skb_segment(struct sk_bu
 			BUG_ON(i >= nfrags);
 
 			*frag = skb_shinfo(skb)->frags[i];
-			get_page(frag->page);
+			skb_get_page(nskb, frag->page);
 			size = frag->size;
 
 			if (pos < offset) {
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -74,6 +74,7 @@ struct page {
 		pgoff_t index;		/* Our offset within mapping. */
 		void *freelist;		/* SLUB: freelist req. slab lock */
 		int reserve;		/* page_alloc: page is a reserve page */
+		atomic_t frag_count;	/* skb fragment use count */
 	};
 	struct list_head lru;		/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 20/30] netvm: filter emergency skbs.
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (18 preceding siblings ...)
  2008-07-24 14:01 ` [PATCH 19/30] netvm: hook skb allocation to reserves Peter Zijlstra
@ 2008-07-24 14:01 ` Peter Zijlstra
  2008-07-24 14:01 ` [PATCH 21/30] netvm: prevent a stream specific deadlock Peter Zijlstra
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: netvm-sk_filter.patch --]
[-- Type: text/plain, Size: 771 bytes --]

Toss all emergency packets not for a SOCK_MEMALLOC socket. This ensures our
precious memory reserve doesn't get stuck waiting for user-space.

The correctness of this approach relies on the fact that networks must be
assumed lossy.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 net/core/filter.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/net/core/filter.c
===================================================================
--- linux-2.6.orig/net/core/filter.c
+++ linux-2.6/net/core/filter.c
@@ -81,6 +81,9 @@ int sk_filter(struct sock *sk, struct sk
 	int err;
 	struct sk_filter *filter;
 
+	if (skb_emergency(skb) && !sk_has_memalloc(sk))
+		return -ENOMEM;
+
 	err = security_sock_rcv_skb(sk, skb);
 	if (err)
 		return err;

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 21/30] netvm: prevent a stream specific deadlock
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (19 preceding siblings ...)
  2008-07-24 14:01 ` [PATCH 20/30] netvm: filter emergency skbs Peter Zijlstra
@ 2008-07-24 14:01 ` Peter Zijlstra
  2008-07-24 14:01 ` [PATCH 22/30] netfilter: NF_QUEUE vs emergency skbs Peter Zijlstra
                   ` (9 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: netvm-tcp-deadlock.patch --]
[-- Type: text/plain, Size: 3559 bytes --]

It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC buffers
from receiving data, which will prevent userspace from running, which is needed
to reduce the buffered data.

Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h   |    7 ++++---
 net/core/sock.c      |    2 +-
 net/ipv4/tcp_input.c |   12 ++++++------
 net/sctp/ulpevent.c  |    2 +-
 4 files changed, 12 insertions(+), 11 deletions(-)

Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -788,12 +788,13 @@ static inline int sk_wmem_schedule(struc
 		__sk_mem_schedule(sk, size, SK_MEM_SEND);
 }
 
-static inline int sk_rmem_schedule(struct sock *sk, int size)
+static inline int sk_rmem_schedule(struct sock *sk, struct sk_buff *skb)
 {
 	if (!sk_has_account(sk))
 		return 1;
-	return size <= sk->sk_forward_alloc ||
-		__sk_mem_schedule(sk, size, SK_MEM_RECV);
+	return skb->truesize <= sk->sk_forward_alloc ||
+		__sk_mem_schedule(sk, skb->truesize, SK_MEM_RECV) ||
+		skb_emergency(skb);
 }
 
 static inline void sk_mem_reclaim(struct sock *sk)
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -383,7 +383,7 @@ int sock_queue_rcv_skb(struct sock *sk, 
 	if (err)
 		goto out;
 
-	if (!sk_rmem_schedule(sk, skb->truesize)) {
+	if (!sk_rmem_schedule(sk, skb)) {
 		err = -ENOBUFS;
 		goto out;
 	}
Index: linux-2.6/net/ipv4/tcp_input.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_input.c
+++ linux-2.6/net/ipv4/tcp_input.c
@@ -3877,19 +3877,19 @@ static void tcp_ofo_queue(struct sock *s
 static int tcp_prune_ofo_queue(struct sock *sk);
 static int tcp_prune_queue(struct sock *sk);
 
-static inline int tcp_try_rmem_schedule(struct sock *sk, unsigned int size)
+static inline int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb)
 {
 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-	    !sk_rmem_schedule(sk, size)) {
+	    !sk_rmem_schedule(sk, skb)) {
 
 		if (tcp_prune_queue(sk) < 0)
 			return -1;
 
-		if (!sk_rmem_schedule(sk, size)) {
+		if (!sk_rmem_schedule(sk, skb)) {
 			if (!tcp_prune_ofo_queue(sk))
 				return -1;
 
-			if (!sk_rmem_schedule(sk, size))
+			if (!sk_rmem_schedule(sk, skb))
 				return -1;
 		}
 	}
@@ -3945,7 +3945,7 @@ static void tcp_data_queue(struct sock *
 		if (eaten <= 0) {
 queue_and_out:
 			if (eaten < 0 &&
-			    tcp_try_rmem_schedule(sk, skb->truesize))
+			    tcp_try_rmem_schedule(sk, skb))
 				goto drop;
 
 			skb_set_owner_r(skb, sk);
@@ -4016,7 +4016,7 @@ drop:
 
 	TCP_ECN_check_ce(tp, skb);
 
-	if (tcp_try_rmem_schedule(sk, skb->truesize))
+	if (tcp_try_rmem_schedule(sk, skb))
 		goto drop;
 
 	/* Disable header prediction. */
Index: linux-2.6/net/sctp/ulpevent.c
===================================================================
--- linux-2.6.orig/net/sctp/ulpevent.c
+++ linux-2.6/net/sctp/ulpevent.c
@@ -701,7 +701,7 @@ struct sctp_ulpevent *sctp_ulpevent_make
 	if (rx_count >= asoc->base.sk->sk_rcvbuf) {
 
 		if ((asoc->base.sk->sk_userlocks & SOCK_RCVBUF_LOCK) ||
-		    (!sk_rmem_schedule(asoc->base.sk, chunk->skb->truesize)))
+		    (!sk_rmem_schedule(asoc->base.sk, chunk->skb)))
 			goto fail;
 	}
 

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 22/30] netfilter: NF_QUEUE vs emergency skbs
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (20 preceding siblings ...)
  2008-07-24 14:01 ` [PATCH 21/30] netvm: prevent a stream specific deadlock Peter Zijlstra
@ 2008-07-24 14:01 ` Peter Zijlstra
  2008-07-24 14:01 ` [PATCH 23/30] netvm: skb processing Peter Zijlstra
                   ` (8 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: emergency-nf_queue.patch --]
[-- Type: text/plain, Size: 836 bytes --]

Avoid memory getting stuck waiting for userspace, drop all emergency packets.
This of course requires the regular storage route to not include an NF_QUEUE
target ;-)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 net/netfilter/core.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/net/netfilter/core.c
===================================================================
--- linux-2.6.orig/net/netfilter/core.c
+++ linux-2.6/net/netfilter/core.c
@@ -176,9 +176,12 @@ next_hook:
 		ret = 1;
 		goto unlock;
 	} else if (verdict == NF_DROP) {
+drop:
 		kfree_skb(skb);
 		ret = -EPERM;
 	} else if ((verdict & NF_VERDICT_MASK) == NF_QUEUE) {
+		if (skb_emergency(*pskb))
+			goto drop;
 		if (!nf_queue(skb, elem, pf, hook, indev, outdev, okfn,
 			      verdict >> NF_VERDICT_BITS))
 			goto next_hook;

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 23/30] netvm: skb processing
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (21 preceding siblings ...)
  2008-07-24 14:01 ` [PATCH 22/30] netfilter: NF_QUEUE vs emergency skbs Peter Zijlstra
@ 2008-07-24 14:01 ` Peter Zijlstra
  2008-07-24 14:01 ` [PATCH 24/30] mm: add support for non block device backed swap files Peter Zijlstra
                   ` (7 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: netvm.patch --]
[-- Type: text/plain, Size: 4804 bytes --]

In order to make sure emergency packets receive all memory needed to proceed
ensure processing of emergency SKBs happens under PF_MEMALLOC.

Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing.

Skip taps, since those are user-space again.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |    5 ++++
 net/core/dev.c     |   59 +++++++++++++++++++++++++++++++++++++++++++++++------
 net/core/sock.c    |   16 ++++++++++++++
 3 files changed, 74 insertions(+), 6 deletions(-)

Index: linux-2.6/net/core/dev.c
===================================================================
--- linux-2.6.orig/net/core/dev.c
+++ linux-2.6/net/core/dev.c
@@ -2008,6 +2008,30 @@ out:
 }
 #endif
 
+/*
+ * Filter the protocols for which the reserves are adequate.
+ *
+ * Before adding a protocol make sure that it is either covered by the existing
+ * reserves, or add reserves covering the memory need of the new protocol's
+ * packet processing.
+ */
+static int skb_emergency_protocol(struct sk_buff *skb)
+{
+	if (skb_emergency(skb))
+		switch (skb->protocol) {
+		case __constant_htons(ETH_P_ARP):
+		case __constant_htons(ETH_P_IP):
+		case __constant_htons(ETH_P_IPV6):
+		case __constant_htons(ETH_P_8021Q):
+			break;
+
+		default:
+			return 0;
+		}
+
+	return 1;
+}
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
@@ -2029,10 +2053,23 @@ int netif_receive_skb(struct sk_buff *sk
 	struct net_device *orig_dev;
 	int ret = NET_RX_DROP;
 	__be16 type;
+	unsigned long pflags = current->flags;
+
+	/* Emergency skb are special, they should
+	 *  - be delivered to SOCK_MEMALLOC sockets only
+	 *  - stay away from userspace
+	 *  - have bounded memory usage
+	 *
+	 * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
+	 * This saves us from propagating the allocation context down to all
+	 * allocation sites.
+	 */
+	if (skb_emergency(skb))
+		current->flags |= PF_MEMALLOC;
 
 	/* if we've gotten here through NAPI, check netpoll */
 	if (netpoll_receive_skb(skb))
-		return NET_RX_DROP;
+		goto out;
 
 	if (!skb->tstamp.tv64)
 		net_timestamp(skb);
@@ -2043,7 +2080,7 @@ int netif_receive_skb(struct sk_buff *sk
 	orig_dev = skb_bond(skb);
 
 	if (!orig_dev)
-		return NET_RX_DROP;
+		goto out;
 
 	__get_cpu_var(netdev_rx_stat).total++;
 
@@ -2062,6 +2099,9 @@ int netif_receive_skb(struct sk_buff *sk
 	}
 #endif
 
+	if (skb_emergency(skb))
+		goto skip_taps;
+
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
 		if (!ptype->dev || ptype->dev == skb->dev) {
 			if (pt_prev)
@@ -2070,19 +2110,23 @@ int netif_receive_skb(struct sk_buff *sk
 		}
 	}
 
+skip_taps:
 #ifdef CONFIG_NET_CLS_ACT
 	skb = handle_ing(skb, &pt_prev, &ret, orig_dev);
 	if (!skb)
-		goto out;
+		goto unlock;
 ncls:
 #endif
 
+	if (!skb_emergency_protocol(skb))
+		goto drop;
+
 	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
 	if (!skb)
-		goto out;
+		goto unlock;
 	skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
 	if (!skb)
-		goto out;
+		goto unlock;
 
 	type = skb->protocol;
 	list_for_each_entry_rcu(ptype,
@@ -2098,6 +2142,7 @@ ncls:
 	if (pt_prev) {
 		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 	} else {
+drop:
 		kfree_skb(skb);
 		/* Jamal, now you will not able to escape explaining
 		 * me how you were going to use this. :-)
@@ -2105,8 +2150,10 @@ ncls:
 		ret = NET_RX_DROP;
 	}
 
-out:
+unlock:
 	rcu_read_unlock();
+out:
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 	return ret;
 }
 
Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -521,8 +521,13 @@ static inline void sk_add_backlog(struct
 	skb->next = NULL;
 }
 
+extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+
 static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
+	if (skb_emergency(skb))
+		return __sk_backlog_rcv(sk, skb);
+
 	return sk->sk_backlog_rcv(sk, skb);
 }
 
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -313,6 +313,22 @@ int sk_clear_memalloc(struct sock *sk)
 	return set;
 }
 EXPORT_SYMBOL_GPL(sk_clear_memalloc);
+
+int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	int ret;
+	unsigned long pflags = current->flags;
+
+	/* these should have been dropped before queueing */
+	BUG_ON(!sk_has_memalloc(sk));
+
+	current->flags |= PF_MEMALLOC;
+	ret = sk->sk_backlog_rcv(sk, skb);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
+
+	return ret;
+}
+EXPORT_SYMBOL(__sk_backlog_rcv);
 #endif
 
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 24/30] mm: add support for non block device backed swap files
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (22 preceding siblings ...)
  2008-07-24 14:01 ` [PATCH 23/30] netvm: skb processing Peter Zijlstra
@ 2008-07-24 14:01 ` Peter Zijlstra
  2008-07-24 14:01 ` [PATCH 25/30] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: mm-swapfile.patch --]
[-- Type: text/plain, Size: 12786 bytes --]

New addres_space_operations methods are added:
  int swapon(struct file *);
  int swapoff(struct file *);
  int swap_out(struct file *, struct page *, struct writeback_control *);
  int swap_in(struct file *, struct page *);

When during sys_swapon() the ->swapon() method is found and returns no error
the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, and
make use of ->swap_{out,in}() to write/read swapcache pages.

The ->swapon() method will be used to communicate to the file that the VM
relies on it, and the address_space should take adequate measures (like
reserving memory for mempools or the like). The ->swapoff() method will be
called on sys_swapoff() when ->swapon() was found and returned no error.

This new interface can be used to obviate the need for ->bmap in the swapfile
code. A filesystem would need to load (and maybe even allocate) the full block
map for a file into memory and pin it there on ->swapon() so that
->swap_{out,in}() have instant access to it. It can be released on ->swapoff().

The reason to provide ->swap_{out,in}() over using {write,read}page() is to
 1) make a distinction between swapcache and pagecache pages, and
 2) to provide a struct file * for credential context (normally not needed
    in the context of writepage, as the page content is normally dirtied
    using either of the following interfaces:
      write_{begin,end}()
      {prepare,commit}_write()
      page_mkwrite()
    which do have the file context.

[miklos@szeredi.hu: cleanups]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 Documentation/filesystems/Locking |   22 ++++++++++++++++
 Documentation/filesystems/vfs.txt |   18 +++++++++++++
 include/linux/buffer_head.h       |    2 -
 include/linux/fs.h                |    9 ++++++
 include/linux/swap.h              |    4 ++
 mm/page_io.c                      |   52 ++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c                   |    4 +-
 mm/swapfile.c                     |   32 +++++++++++++++++++++--
 8 files changed, 137 insertions(+), 6 deletions(-)

Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -121,6 +121,7 @@ enum {
 	SWP_USED	= (1 << 0),	/* is slot in swap_info[] used? */
 	SWP_WRITEOK	= (1 << 1),	/* ok to write to this swap?	*/
 	SWP_ACTIVE	= (SWP_USED | SWP_WRITEOK),
+	SWP_FILE	= (1 << 2),	/* file swap area */
 					/* add others here before... */
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
@@ -274,6 +275,8 @@ extern void swap_unplug_io_fn(struct bac
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
+extern void swap_sync_page(struct page *page);
+extern int swap_set_page_dirty(struct page *page);
 extern void end_swap_bio_read(struct bio *bio, int err);
 
 /* linux/mm/swap_state.c */
@@ -306,6 +309,7 @@ extern unsigned int count_swap_pages(int
 extern sector_t map_swap_page(struct swap_info_struct *, pgoff_t);
 extern sector_t swapdev_block(int, pgoff_t);
 extern struct swap_info_struct *get_swap_info_struct(unsigned);
+extern struct swap_info_struct *page_swap_info(struct page *);
 extern int can_share_swap_page(struct page *);
 extern int remove_exclusive_swap_page(struct page *);
 extern int remove_exclusive_swap_page_ref(struct page *);
Index: linux-2.6/mm/page_io.c
===================================================================
--- linux-2.6.orig/mm/page_io.c
+++ linux-2.6/mm/page_io.c
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <linux/buffer_head.h>
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -97,11 +98,23 @@ int swap_writepage(struct page *page, st
 {
 	struct bio *bio;
 	int ret = 0, rw = WRITE;
+	struct swap_info_struct *sis = page_swap_info(page);
 
 	if (remove_exclusive_swap_page(page)) {
 		unlock_page(page);
 		goto out;
 	}
+
+	if (sis->flags & SWP_FILE) {
+		struct file *swap_file = sis->swap_file;
+		struct address_space *mapping = swap_file->f_mapping;
+
+		ret = mapping->a_ops->swap_out(swap_file, page, wbc);
+		if (!ret)
+			count_vm_event(PSWPOUT);
+		return ret;
+	}
+
 	bio = get_swap_bio(GFP_NOIO, page_private(page), page,
 				end_swap_bio_write);
 	if (bio == NULL) {
@@ -120,13 +133,52 @@ out:
 	return ret;
 }
 
+void swap_sync_page(struct page *page)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+
+	if (sis->flags & SWP_FILE) {
+		struct address_space *mapping = sis->swap_file->f_mapping;
+
+		if (mapping->a_ops->sync_page)
+			mapping->a_ops->sync_page(page);
+	} else {
+		block_sync_page(page);
+	}
+}
+
+int swap_set_page_dirty(struct page *page)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+
+	if (sis->flags & SWP_FILE) {
+		struct address_space *mapping = sis->swap_file->f_mapping;
+
+		return mapping->a_ops->set_page_dirty(page);
+	} else {
+		return __set_page_dirty_nobuffers(page);
+	}
+}
+
 int swap_readpage(struct file *file, struct page *page)
 {
 	struct bio *bio;
 	int ret = 0;
+	struct swap_info_struct *sis = page_swap_info(page);
 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(PageUptodate(page));
+
+	if (sis->flags & SWP_FILE) {
+		struct file *swap_file = sis->swap_file;
+		struct address_space *mapping = swap_file->f_mapping;
+
+		ret = mapping->a_ops->swap_in(swap_file, page);
+		if (!ret)
+			count_vm_event(PSWPIN);
+		return ret;
+	}
+
 	bio = get_swap_bio(GFP_KERNEL, page_private(page), page,
 				end_swap_bio_read);
 	if (bio == NULL) {
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -27,8 +27,8 @@
  */
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
-	.sync_page	= block_sync_page,
-	.set_page_dirty	= __set_page_dirty_nobuffers,
+	.sync_page	= swap_sync_page,
+	.set_page_dirty	= swap_set_page_dirty,
 	.migratepage	= migrate_page,
 };
 
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -1031,6 +1031,14 @@ static void destroy_swap_extents(struct 
 		list_del(&se->list);
 		kfree(se);
 	}
+
+	if (sis->flags & SWP_FILE) {
+		struct file *swap_file = sis->swap_file;
+		struct address_space *mapping = swap_file->f_mapping;
+
+		sis->flags &= ~SWP_FILE;
+		mapping->a_ops->swapoff(swap_file);
+	}
 }
 
 /*
@@ -1105,7 +1113,9 @@ add_swap_extent(struct swap_info_struct 
  */
 static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
 {
-	struct inode *inode;
+	struct file *swap_file = sis->swap_file;
+	struct address_space *mapping = swap_file->f_mapping;
+	struct inode *inode = mapping->host;
 	unsigned blocks_per_page;
 	unsigned long page_no;
 	unsigned blkbits;
@@ -1116,13 +1126,22 @@ static int setup_swap_extents(struct swa
 	int nr_extents = 0;
 	int ret;
 
-	inode = sis->swap_file->f_mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
 		ret = add_swap_extent(sis, 0, sis->max, 0);
 		*span = sis->pages;
 		goto done;
 	}
 
+	if (mapping->a_ops->swapon) {
+		ret = mapping->a_ops->swapon(swap_file);
+		if (!ret) {
+			sis->flags |= SWP_FILE;
+			ret = add_swap_extent(sis, 0, sis->max, 0);
+			*span = sis->pages;
+		}
+		goto done;
+	}
+
 	blkbits = inode->i_blkbits;
 	blocks_per_page = PAGE_SIZE >> blkbits;
 
@@ -1691,7 +1710,7 @@ asmlinkage long sys_swapon(const char __
 
 	mutex_lock(&swapon_mutex);
 	spin_lock(&swap_lock);
-	p->flags = SWP_ACTIVE;
+	p->flags |= SWP_WRITEOK;
 	nr_swap_pages += nr_good_pages;
 	total_swap_pages += nr_good_pages;
 
@@ -1816,6 +1835,13 @@ get_swap_info_struct(unsigned type)
 	return &swap_info[type];
 }
 
+struct swap_info_struct *page_swap_info(struct page *page)
+{
+	swp_entry_t swap = { .val = page_private(page) };
+	BUG_ON(!PageSwapCache(page));
+	return &swap_info[swp_type(swap)];
+}
+
 /*
  * swap_lock prevents swap_map being freed. Don't grab an extra
  * reference on the swaphandle, it doesn't matter if it becomes unused.
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -509,6 +509,15 @@ struct address_space_operations {
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
 					unsigned long);
+
+	/*
+	 * swapfile support
+	 */
+	int (*swapon)(struct file *file);
+	int (*swapoff)(struct file *file);
+	int (*swap_out)(struct file *file, struct page *page,
+			struct writeback_control *wbc);
+	int (*swap_in)(struct file *file, struct page *page);
 };
 
 /*
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -169,6 +169,10 @@ prototypes:
 	int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
 	int (*launder_page) (struct page *);
+	int (*swapon) (struct file *);
+	int (*swapoff) (struct file *);
+	int (*swap_out) (struct file *, struct page *, struct writeback_control *);
+	int (*swap_in)  (struct file *, struct page *);
 
 locking rules:
 	All except set_page_dirty may block
@@ -190,6 +194,10 @@ invalidatepage:		no	yes
 releasepage:		no	yes
 direct_IO:		no
 launder_page:		no	yes
+swapon			no
+swapoff			no
+swap_out		no	yes, unlocks
+swap_in			no	yes, unlocks
 
 	->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
 may be called from the request handler (/dev/loop).
@@ -289,6 +297,20 @@ cleaned, or an error value if not. Note 
 getting mapped back in and redirtied, it needs to be kept locked
 across the entire operation.
 
+	->swapon() will be called with a non-zero argument on files backing
+(non block device backed) swapfiles. A return value of zero indicates success,
+in which case this file can be used for backing swapspace. The swapspace
+operations will be proxied to the address space operations.
+
+	->swapoff() will be called in the sys_swapoff() path when ->swapon()
+returned success.
+
+	->swap_out() when swapon() returned success, this method is used to
+write the swap page.
+
+	->swap_in() when swapon() returned success, this method is used to
+read the swap page.
+
 	Note: currently almost all instances of address_space methods are
 using BKL for internal serialization and that's one of the worst sources
 of contention. Normally they are calling library functions (in fs/buffer.c)
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -336,7 +336,7 @@ static inline void invalidate_inode_buff
 static inline int remove_inode_buffers(struct inode *inode) { return 1; }
 static inline int sync_mapping_buffers(struct address_space *mapping) { return 0; }
 static inline void invalidate_bdev(struct block_device *bdev) {}
-
+static inline void block_sync_page(struct page *) { }
 
 #endif /* CONFIG_BLOCK */
 #endif /* _LINUX_BUFFER_HEAD_H */
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6/Documentation/filesystems/vfs.txt
@@ -539,6 +539,11 @@ struct address_space_operations {
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct page *, struct page *);
 	int (*launder_page) (struct page *);
+	int (*swapon)(struct file *);
+	int (*swapoff)(struct file *);
+	int (*swap_out)(struct file *file, struct page *page,
+			struct writeback_control *wbc);
+	int (*swap_in)(struct file *file, struct page *page);
 };
 
   writepage: called by the VM to write a dirty page to backing store.
@@ -724,6 +729,19 @@ struct address_space_operations {
   	prevent redirtying the page, it is kept locked during the whole
 	operation.
 
+  swapon: Called when swapon is used on a file. A
+	return value of zero indicates success, in which case this
+	file can be used to back swapspace. The swapspace operations
+	will be proxied to this address space's ->swap_{out,in} methods.
+
+  swapoff: Called during swapoff on files where swapon was successfull.
+
+  swap_out: Called to write a swapcache page to a backing store, similar to
+	writepage.
+
+  swap_in: Called to read a swapcache page from a backing store, similar to
+	readpage.
+
 The File Object
 ===============
 

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 25/30] mm: methods for teaching filesystems about PG_swapcache pages
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (23 preceding siblings ...)
  2008-07-24 14:01 ` [PATCH 24/30] mm: add support for non block device backed swap files Peter Zijlstra
@ 2008-07-24 14:01 ` Peter Zijlstra
  2008-07-24 14:01 ` [PATCH 26/30] nfs: remove mempools Peter Zijlstra
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: mm-page_file_methods.patch --]
[-- Type: text/plain, Size: 3525 bytes --]

In order to teach filesystems to handle swap cache pages, three new page
functions are introduced:

  pgoff_t page_file_index(struct page *);
  loff_t page_file_offset(struct page *);
  struct address_space *page_file_mapping(struct page *);

page_file_index() - gives the offset of this page in the file in
PAGE_CACHE_SIZE blocks. Like page->index is for mapped pages, this function
also gives the correct index for PG_swapcache pages.

page_file_offset() - uses page_file_index(), so that it will give the expected
result, even for PG_swapcache pages.

page_file_mapping() - gives the mapping backing the actual page; that is for
swap cache pages it will give swap_file->f_mapping.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm.h      |   25 +++++++++++++++++++++++++
 include/linux/pagemap.h |    5 +++++
 mm/swapfile.c           |   19 +++++++++++++++++++
 3 files changed, 49 insertions(+)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -598,6 +598,17 @@ static inline struct address_space *page
 	return mapping;
 }
 
+extern struct address_space *__page_file_mapping(struct page *);
+
+static inline
+struct address_space *page_file_mapping(struct page *page)
+{
+	if (unlikely(PageSwapCache(page)))
+		return __page_file_mapping(page);
+
+	return page->mapping;
+}
+
 static inline int PageAnon(struct page *page)
 {
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
@@ -614,6 +625,20 @@ static inline pgoff_t page_index(struct 
 	return page->index;
 }
 
+extern pgoff_t __page_file_index(struct page *page);
+
+/*
+ * Return the file index of the page. Regular pagecache pages use ->index
+ * whereas swapcache pages use swp_offset(->private)
+ */
+static inline pgoff_t page_file_index(struct page *page)
+{
+	if (unlikely(PageSwapCache(page)))
+		return __page_file_index(page);
+
+	return page->index;
+}
+
 /*
  * The atomic page->_mapcount, like _count, starts from -1:
  * so that transitions both from it and to it can be tracked,
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -148,6 +148,11 @@ static inline loff_t page_offset(struct 
 	return ((loff_t)page->index) << PAGE_CACHE_SHIFT;
 }
 
+static inline loff_t page_file_offset(struct page *page)
+{
+	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
+}
+
 static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
 					unsigned long address)
 {
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -1828,6 +1828,25 @@ struct swap_info_struct *page_swap_info(
 }
 
 /*
+ * out-of-line __page_file_ methods to avoid include hell.
+ */
+
+struct address_space *__page_file_mapping(struct page *page)
+{
+	VM_BUG_ON(!PageSwapCache(page));
+	return page_swap_info(page)->swap_file->f_mapping;
+}
+EXPORT_SYMBOL_GPL(__page_file_mapping);
+
+pgoff_t __page_file_index(struct page *page)
+{
+	swp_entry_t swap = { .val = page_private(page) };
+	VM_BUG_ON(!PageSwapCache(page));
+	return swp_offset(swap);
+}
+EXPORT_SYMBOL_GPL(__page_file_index);
+
+/*
  * swap_lock prevents swap_map being freed. Don't grab an extra
  * reference on the swaphandle, it doesn't matter if it becomes unused.
  */

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 26/30] nfs: remove mempools
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (24 preceding siblings ...)
  2008-07-24 14:01 ` [PATCH 25/30] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
@ 2008-07-24 14:01 ` Peter Zijlstra
  2008-07-24 14:46   ` Nick Piggin
  2008-07-24 14:01 ` [PATCH 27/30] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
                   ` (4 subsequent siblings)
  30 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: nfs-no-mempool.patch --]
[-- Type: text/plain, Size: 4520 bytes --]

With the introduction of the shared dirty page accounting in .19, NFS should
not be able to surpise the VM with all dirty pages. Thus it should always be
able to free some memory. Hence no more need for mempools.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/read.c  |   15 +++------------
 fs/nfs/write.c |   27 +++++----------------------
 2 files changed, 8 insertions(+), 34 deletions(-)

Index: linux-2.6/fs/nfs/read.c
===================================================================
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -33,13 +33,10 @@ static const struct rpc_call_ops nfs_rea
 static const struct rpc_call_ops nfs_read_full_ops;
 
 static struct kmem_cache *nfs_rdata_cachep;
-static mempool_t *nfs_rdata_mempool;
-
-#define MIN_POOL_READ	(32)
 
 struct nfs_read_data *nfs_readdata_alloc(unsigned int pagecount)
 {
-	struct nfs_read_data *p = mempool_alloc(nfs_rdata_mempool, GFP_NOFS);
+	struct nfs_read_data *p = kmem_cache_alloc(nfs_rdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -50,7 +47,7 @@ struct nfs_read_data *nfs_readdata_alloc
 		else {
 			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
 			if (!p->pagevec) {
-				mempool_free(p, nfs_rdata_mempool);
+				kmem_cache_free(nfs_rdata_cachep, p);
 				p = NULL;
 			}
 		}
@@ -62,7 +59,7 @@ static void nfs_readdata_free(struct nfs
 {
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_rdata_mempool);
+	kmem_cache_free(nfs_rdata_cachep, p);
 }
 
 void nfs_readdata_release(void *data)
@@ -614,16 +611,10 @@ int __init nfs_init_readpagecache(void)
 	if (nfs_rdata_cachep == NULL)
 		return -ENOMEM;
 
-	nfs_rdata_mempool = mempool_create_slab_pool(MIN_POOL_READ,
-						     nfs_rdata_cachep);
-	if (nfs_rdata_mempool == NULL)
-		return -ENOMEM;
-
 	return 0;
 }
 
 void nfs_destroy_readpagecache(void)
 {
-	mempool_destroy(nfs_rdata_mempool);
 	kmem_cache_destroy(nfs_rdata_cachep);
 }
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -28,9 +28,6 @@
 
 #define NFSDBG_FACILITY		NFSDBG_PAGECACHE
 
-#define MIN_POOL_WRITE		(32)
-#define MIN_POOL_COMMIT		(4)
-
 /*
  * Local function declarations
  */
@@ -45,12 +42,10 @@ static const struct rpc_call_ops nfs_wri
 static const struct rpc_call_ops nfs_commit_ops;
 
 static struct kmem_cache *nfs_wdata_cachep;
-static mempool_t *nfs_wdata_mempool;
-static mempool_t *nfs_commit_mempool;
 
 struct nfs_write_data *nfs_commitdata_alloc(void)
 {
-	struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -63,12 +58,12 @@ void nfs_commit_free(struct nfs_write_da
 {
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_commit_mempool);
+	kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-	struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -79,7 +74,7 @@ struct nfs_write_data *nfs_writedata_all
 		else {
 			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
 			if (!p->pagevec) {
-				mempool_free(p, nfs_wdata_mempool);
+				kmem_cache_free(nfs_wdata_cachep, p);
 				p = NULL;
 			}
 		}
@@ -91,7 +86,7 @@ static void nfs_writedata_free(struct nf
 {
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_wdata_mempool);
+	kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 void nfs_writedata_release(void *data)
@@ -1552,16 +1547,6 @@ int __init nfs_init_writepagecache(void)
 	if (nfs_wdata_cachep == NULL)
 		return -ENOMEM;
 
-	nfs_wdata_mempool = mempool_create_slab_pool(MIN_POOL_WRITE,
-						     nfs_wdata_cachep);
-	if (nfs_wdata_mempool == NULL)
-		return -ENOMEM;
-
-	nfs_commit_mempool = mempool_create_slab_pool(MIN_POOL_COMMIT,
-						      nfs_wdata_cachep);
-	if (nfs_commit_mempool == NULL)
-		return -ENOMEM;
-
 	/*
 	 * NFS congestion size, scale with available memory.
 	 *
@@ -1587,8 +1572,6 @@ int __init nfs_init_writepagecache(void)
 
 void nfs_destroy_writepagecache(void)
 {
-	mempool_destroy(nfs_commit_mempool);
-	mempool_destroy(nfs_wdata_mempool);
 	kmem_cache_destroy(nfs_wdata_cachep);
 }
 

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 27/30] nfs: teach the NFS client how to treat PG_swapcache pages
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (25 preceding siblings ...)
  2008-07-24 14:01 ` [PATCH 26/30] nfs: remove mempools Peter Zijlstra
@ 2008-07-24 14:01 ` Peter Zijlstra
  2008-07-24 14:01 ` [PATCH 28/30] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: nfs-swapcache.patch --]
[-- Type: text/plain, Size: 12409 bytes --]

Replace all relevant occurences of page->index and page->mapping in the NFS
client with the new page_file_index() and page_file_mapping() functions.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/file.c     |    6 +++---
 fs/nfs/internal.h |    7 ++++---
 fs/nfs/pagelist.c |    6 +++---
 fs/nfs/read.c     |    6 +++---
 fs/nfs/write.c    |   53 +++++++++++++++++++++++++++--------------------------
 5 files changed, 40 insertions(+), 38 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -413,7 +413,7 @@ static void nfs_invalidate_page(struct p
 	if (offset != 0)
 		return;
 	/* Cancel any unstarted writes on this page */
-	nfs_wb_page_cancel(page->mapping->host, page);
+	nfs_wb_page_cancel(page_file_mapping(page)->host, page);
 }
 
 static int nfs_release_page(struct page *page, gfp_t gfp)
@@ -426,7 +426,7 @@ static int nfs_release_page(struct page 
 
 static int nfs_launder_page(struct page *page)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 
 	dfprintk(PAGECACHE, "NFS: launder_page(%ld, %llu)\n",
 		inode->i_ino, (long long)page_offset(page));
@@ -462,7 +462,7 @@ static int nfs_vm_page_mkwrite(struct vm
 		(long long)page_offset(page));
 
 	lock_page(page);
-	mapping = page->mapping;
+	mapping = page_file_mapping(page);
 	if (mapping != dentry->d_inode->i_mapping)
 		goto out_unlock;
 
Index: linux-2.6/fs/nfs/pagelist.c
===================================================================
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -76,11 +76,11 @@ nfs_create_request(struct nfs_open_conte
 	 * update_nfs_request below if the region is not locked. */
 	req->wb_page    = page;
 	atomic_set(&req->wb_complete, 0);
-	req->wb_index	= page->index;
+	req->wb_index	= page_file_index(page);
 	page_cache_get(page);
 	BUG_ON(PagePrivate(page));
 	BUG_ON(!PageLocked(page));
-	BUG_ON(page->mapping->host != inode);
+	BUG_ON(page_file_mapping(page)->host != inode);
 	req->wb_offset  = offset;
 	req->wb_pgbase	= offset;
 	req->wb_bytes   = count;
@@ -376,7 +376,7 @@ void nfs_pageio_cond_complete(struct nfs
  * nfs_scan_list - Scan a list for matching requests
  * @nfsi: NFS inode
  * @dst: Destination list
- * @idx_start: lower bound of page->index to scan
+ * @idx_start: lower bound of page_file_index(page) to scan
  * @npages: idx_start + npages sets the upper bound to scan.
  * @tag: tag to scan for
  *
Index: linux-2.6/fs/nfs/read.c
===================================================================
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -474,11 +474,11 @@ static const struct rpc_call_ops nfs_rea
 int nfs_readpage(struct file *file, struct page *page)
 {
 	struct nfs_open_context *ctx;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	int		error;
 
 	dprintk("NFS: nfs_readpage (%p %ld@%lu)\n",
-		page, PAGE_CACHE_SIZE, page->index);
+		page, PAGE_CACHE_SIZE, page_file_index(page));
 	nfs_inc_stats(inode, NFSIOS_VFSREADPAGE);
 	nfs_add_stats(inode, NFSIOS_READPAGES, 1);
 
@@ -525,7 +525,7 @@ static int
 readpage_async_filler(void *data, struct page *page)
 {
 	struct nfs_readdesc *desc = (struct nfs_readdesc *)data;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *new;
 	unsigned int len;
 	int error;
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -115,7 +115,7 @@ static struct nfs_page *nfs_page_find_re
 
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *req = NULL;
 
 	spin_lock(&inode->i_lock);
@@ -127,13 +127,13 @@ static struct nfs_page *nfs_page_find_re
 /* Adjust the file length if we're writing beyond the end */
 static void nfs_grow_file(struct page *page, unsigned int offset, unsigned int count)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	loff_t end, i_size = i_size_read(inode);
 	pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
 
-	if (i_size > 0 && page->index < end_index)
+	if (i_size > 0 && page_file_index(page) < end_index)
 		return;
-	end = ((loff_t)page->index << PAGE_CACHE_SHIFT) + ((loff_t)offset+count);
+	end = page_file_offset(page) + ((loff_t)offset+count);
 	if (i_size >= end)
 		return;
 	nfs_inc_stats(inode, NFSIOS_EXTENDWRITE);
@@ -146,7 +146,7 @@ static void nfs_grow_file(struct page *p
 static void nfs_set_pageerror(struct page *page)
 {
 	SetPageError(page);
-	nfs_zap_mapping(page->mapping->host, page->mapping);
+	nfs_zap_mapping(page_file_mapping(page)->host, page_file_mapping(page));
 }
 
 /* We can set the PG_uptodate flag if we see that a write request
@@ -187,7 +187,7 @@ static int nfs_set_page_writeback(struct
 	int ret = test_set_page_writeback(page);
 
 	if (!ret) {
-		struct inode *inode = page->mapping->host;
+		struct inode *inode = page_file_mapping(page)->host;
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		if (atomic_long_inc_return(&nfss->writeback) >
@@ -199,7 +199,7 @@ static int nfs_set_page_writeback(struct
 
 static void nfs_end_page_writeback(struct page *page)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	end_page_writeback(page);
@@ -214,7 +214,7 @@ static void nfs_end_page_writeback(struc
 static int nfs_page_async_flush(struct nfs_pageio_descriptor *pgio,
 				struct page *page)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *req;
 	int ret;
 
@@ -257,12 +257,12 @@ static int nfs_page_async_flush(struct n
 
 static int nfs_do_writepage(struct page *page, struct writeback_control *wbc, struct nfs_pageio_descriptor *pgio)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE);
 	nfs_add_stats(inode, NFSIOS_WRITEPAGES, 1);
 
-	nfs_pageio_cond_complete(pgio, page->index);
+	nfs_pageio_cond_complete(pgio, page_file_index(page));
 	return nfs_page_async_flush(pgio, page);
 }
 
@@ -274,7 +274,7 @@ static int nfs_writepage_locked(struct p
 	struct nfs_pageio_descriptor pgio;
 	int err;
 
-	nfs_pageio_init_write(&pgio, page->mapping->host, wb_priority(wbc));
+	nfs_pageio_init_write(&pgio, page_file_mapping(page)->host, wb_priority(wbc));
 	err = nfs_do_writepage(page, wbc, &pgio);
 	nfs_pageio_complete(&pgio);
 	if (err < 0)
@@ -403,7 +403,8 @@ nfs_mark_request_commit(struct nfs_page 
 			NFS_PAGE_TAG_COMMIT);
 	spin_unlock(&inode->i_lock);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
-	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
+	inc_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
+			BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 
@@ -414,7 +415,7 @@ nfs_clear_request_commit(struct nfs_page
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
-		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
+		dec_bdi_stat(page_file_mapping(page)->backing_dev_info, BDI_RECLAIMABLE);
 		return 1;
 	}
 	return 0;
@@ -520,7 +521,7 @@ static void nfs_cancel_commit_list(struc
  * nfs_scan_commit - Scan an inode for commit requests
  * @inode: NFS inode to scan
  * @dst: destination list
- * @idx_start: lower bound of page->index to scan.
+ * @idx_start: lower bound of page_file_index(page) to scan.
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves requests from the inode's 'commit' request list.
@@ -631,7 +632,7 @@ out_err:
 static struct nfs_page * nfs_setup_write_request(struct nfs_open_context* ctx,
 		struct page *page, unsigned int offset, unsigned int bytes)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page	*req;
 	int error;
 
@@ -686,7 +687,7 @@ int nfs_flush_incompatible(struct file *
 		nfs_release_request(req);
 		if (!do_flush)
 			return 0;
-		status = nfs_wb_page(page->mapping->host, page);
+		status = nfs_wb_page(page_file_mapping(page)->host, page);
 	} while (status == 0);
 	return status;
 }
@@ -712,7 +713,7 @@ int nfs_updatepage(struct file *file, st
 		unsigned int offset, unsigned int count)
 {
 	struct nfs_open_context *ctx = nfs_file_open_context(file);
-	struct inode	*inode = page->mapping->host;
+	struct inode	*inode = page_file_mapping(page)->host;
 	int		status = 0;
 
 	nfs_inc_stats(inode, NFSIOS_VFSUPDATEPAGE);
@@ -720,7 +721,7 @@ int nfs_updatepage(struct file *file, st
 	dprintk("NFS:       nfs_updatepage(%s/%s %d@%lld)\n",
 		file->f_path.dentry->d_parent->d_name.name,
 		file->f_path.dentry->d_name.name, count,
-		(long long)(page_offset(page) + offset));
+		(long long)(page_file_offset(page) + offset));
 
 	/* If we're not using byte range locks, and we know the page
 	 * is up to date, it may be more efficient to extend the write
@@ -995,7 +996,7 @@ static void nfs_writeback_release_partia
 	}
 
 	if (nfs_write_need_commit(data)) {
-		struct inode *inode = page->mapping->host;
+		struct inode *inode = page_file_mapping(page)->host;
 
 		spin_lock(&inode->i_lock);
 		if (test_bit(PG_NEED_RESCHED, &req->wb_flags)) {
@@ -1256,7 +1257,7 @@ nfs_commit_list(struct inode *inode, str
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
-		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+		dec_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
 				BDI_RECLAIMABLE);
 		nfs_clear_page_tag_locked(req);
 	}
@@ -1447,10 +1448,10 @@ int nfs_wb_nocommit(struct inode *inode)
 int nfs_wb_page_cancel(struct inode *inode, struct page *page)
 {
 	struct nfs_page *req;
-	loff_t range_start = page_offset(page);
+	loff_t range_start = page_file_offset(page);
 	loff_t range_end = range_start + (loff_t)(PAGE_CACHE_SIZE - 1);
 	struct writeback_control wbc = {
-		.bdi = page->mapping->backing_dev_info,
+		.bdi = page_file_mapping(page)->backing_dev_info,
 		.sync_mode = WB_SYNC_ALL,
 		.nr_to_write = LONG_MAX,
 		.range_start = range_start,
@@ -1483,7 +1484,7 @@ int nfs_wb_page_cancel(struct inode *ino
 	}
 	if (!PagePrivate(page))
 		return 0;
-	ret = nfs_sync_mapping_wait(page->mapping, &wbc, FLUSH_INVALIDATE);
+	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, FLUSH_INVALIDATE);
 out:
 	return ret;
 }
@@ -1491,10 +1492,10 @@ out:
 static int nfs_wb_page_priority(struct inode *inode, struct page *page,
 				int how)
 {
-	loff_t range_start = page_offset(page);
+	loff_t range_start = page_file_offset(page);
 	loff_t range_end = range_start + (loff_t)(PAGE_CACHE_SIZE - 1);
 	struct writeback_control wbc = {
-		.bdi = page->mapping->backing_dev_info,
+		.bdi = page_file_mapping(page)->backing_dev_info,
 		.sync_mode = WB_SYNC_ALL,
 		.nr_to_write = LONG_MAX,
 		.range_start = range_start,
@@ -1509,7 +1510,7 @@ static int nfs_wb_page_priority(struct i
 				goto out_error;
 		} else if (!PagePrivate(page))
 			break;
-		ret = nfs_sync_mapping_wait(page->mapping, &wbc, how);
+		ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, how);
 		if (ret < 0)
 			goto out_error;
 	} while (PagePrivate(page));
Index: linux-2.6/fs/nfs/internal.h
===================================================================
--- linux-2.6.orig/fs/nfs/internal.h
+++ linux-2.6/fs/nfs/internal.h
@@ -253,13 +253,14 @@ void nfs_super_set_maxbytes(struct super
 static inline
 unsigned int nfs_page_length(struct page *page)
 {
-	loff_t i_size = i_size_read(page->mapping->host);
+	loff_t i_size = i_size_read(page_file_mapping(page)->host);
 
 	if (i_size > 0) {
+		pgoff_t page_index = page_file_index(page);
 		pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
-		if (page->index < end_index)
+		if (page_index < end_index)
 			return PAGE_CACHE_SIZE;
-		if (page->index == end_index)
+		if (page_index == end_index)
 			return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1;
 	}
 	return 0;

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 28/30] nfs: disable data cache revalidation for swapfiles
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (26 preceding siblings ...)
  2008-07-24 14:01 ` [PATCH 27/30] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
@ 2008-07-24 14:01 ` Peter Zijlstra
  2008-07-24 14:01 ` [PATCH 29/30] nfs: enable swap on NFS Peter Zijlstra
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: nfs-swapper.patch --]
[-- Type: text/plain, Size: 5018 bytes --]

Do as Trond suggested:
  http://lkml.org/lkml/2006/8/25/348

Disable NFS data cache revalidation on swap files since it doesn't really 
make sense to have other clients change the file while you are using it.

Thereby we can stop setting PG_private on swap pages, since there ought to
be no further races with invalidate_inode_pages2() to deal with.

And since we cannot set PG_private we cannot use page->private (which is
already used by PG_swapcache pages anyway) to store the nfs_page. Thus
augment the new nfs_page_find_request logic.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/inode.c |    6 ++++
 fs/nfs/write.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 64 insertions(+), 13 deletions(-)

Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -824,6 +824,12 @@ int nfs_revalidate_mapping_nolock(struct
 	struct nfs_inode *nfsi = NFS_I(inode);
 	int ret = 0;
 
+	/*
+	 * swapfiles are not supposed to be shared.
+	 */
+	if (IS_SWAPFILE(inode))
+		goto out;
+
 	if ((nfsi->cache_validity & NFS_INO_REVAL_PAGECACHE)
 			|| nfs_attribute_timeout(inode) || NFS_STALE(inode)) {
 		ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -101,25 +101,62 @@ static void nfs_context_set_write_error(
 	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
 }
 
-static struct nfs_page *nfs_page_find_request_locked(struct page *page)
+static struct nfs_page *
+__nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page, int get)
 {
 	struct nfs_page *req = NULL;
 
-	if (PagePrivate(page)) {
+	if (PagePrivate(page))
 		req = (struct nfs_page *)page_private(page);
-		if (req != NULL)
-			kref_get(&req->wb_kref);
-	}
+	else if (unlikely(PageSwapCache(page)))
+		req = radix_tree_lookup(&nfsi->nfs_page_tree, page_file_index(page));
+
+	if (get && req)
+		kref_get(&req->wb_kref);
+
 	return req;
 }
 
+static inline struct nfs_page *
+nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page)
+{
+	return __nfs_page_find_request_locked(nfsi, page, 1);
+}
+
+static int __nfs_page_has_request(struct page *page)
+{
+	struct inode *inode = page_file_mapping(page)->host;
+	struct nfs_page *req = NULL;
+
+	spin_lock(&inode->i_lock);
+	req = __nfs_page_find_request_locked(NFS_I(inode), page, 0);
+	spin_unlock(&inode->i_lock);
+
+	/*
+	 * hole here plugged by the caller holding onto PG_locked
+	 */
+
+	return req != NULL;
+}
+
+static inline int nfs_page_has_request(struct page *page)
+{
+	if (PagePrivate(page))
+		return 1;
+
+	if (unlikely(PageSwapCache(page)))
+		return __nfs_page_has_request(page);
+
+	return 0;
+}
+
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
 	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *req = NULL;
 
 	spin_lock(&inode->i_lock);
-	req = nfs_page_find_request_locked(page);
+	req = nfs_page_find_request_locked(NFS_I(inode), page);
 	spin_unlock(&inode->i_lock);
 	return req;
 }
@@ -220,7 +257,7 @@ static int nfs_page_async_flush(struct n
 
 	spin_lock(&inode->i_lock);
 	for(;;) {
-		req = nfs_page_find_request_locked(page);
+		req = nfs_page_find_request_locked(NFS_I(inode), page);
 		if (req == NULL) {
 			spin_unlock(&inode->i_lock);
 			return 0;
@@ -343,8 +380,14 @@ static int nfs_inode_add_request(struct 
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
-	SetPagePrivate(req->wb_page);
-	set_page_private(req->wb_page, (unsigned long)req);
+	/*
+	 * Swap-space should not get truncated. Hence no need to plug the race
+	 * with invalidate/truncate.
+	 */
+	if (likely(!PageSwapCache(req->wb_page))) {
+		SetPagePrivate(req->wb_page);
+		set_page_private(req->wb_page, (unsigned long)req);
+	}
 	nfsi->npages++;
 	kref_get(&req->wb_kref);
 	radix_tree_tag_set(&nfsi->nfs_page_tree, req->wb_index,
@@ -366,8 +409,10 @@ static void nfs_inode_remove_request(str
 	BUG_ON (!NFS_WBACK_BUSY(req));
 
 	spin_lock(&inode->i_lock);
-	set_page_private(req->wb_page, 0);
-	ClearPagePrivate(req->wb_page);
+	if (likely(!PageSwapCache(req->wb_page))) {
+		set_page_private(req->wb_page, 0);
+		ClearPagePrivate(req->wb_page);
+	}
 	radix_tree_delete(&nfsi->nfs_page_tree, req->wb_index);
 	nfsi->npages--;
 	if (!nfsi->npages) {
@@ -571,7 +616,7 @@ static struct nfs_page *nfs_try_to_updat
 	spin_lock(&inode->i_lock);
 
 	for (;;) {
-		req = nfs_page_find_request_locked(page);
+		req = nfs_page_find_request_locked(NFS_I(inode), page);
 		if (req == NULL)
 			goto out_unlock;
 
@@ -1482,7 +1527,7 @@ int nfs_wb_page_cancel(struct inode *ino
 		if (ret < 0)
 			goto out;
 	}
-	if (!PagePrivate(page))
+	if (!nfs_page_has_request(page))
 		return 0;
 	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, FLUSH_INVALIDATE);
 out:

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 29/30] nfs: enable swap on NFS
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (27 preceding siblings ...)
  2008-07-24 14:01 ` [PATCH 28/30] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
@ 2008-07-24 14:01 ` Peter Zijlstra
  2008-07-24 14:01 ` [PATCH 30/30] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
  2008-09-30 12:41 ` [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
  30 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: nfs-swap_ops.patch --]
[-- Type: text/plain, Size: 11193 bytes --]

Implement all the new swapfile a_ops for NFS. This will set the NFS socket to
SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well as reset
SOCK_MEMALLOC before engaging the protocol ->connect() method.

PF_MEMALLOC should allow the allocation of struct socket and related objects
and the early (re)setting of SOCK_MEMALLOC should allow us to receive the
packets required for the TCP connection buildup.

(swapping continues over a server reset during heavy network traffic)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/Kconfig                  |   17 ++++++++++
 fs/nfs/file.c               |   18 ++++++++++
 fs/nfs/write.c              |   22 +++++++++++++
 include/linux/nfs_fs.h      |    2 +
 include/linux/sunrpc/xprt.h |    5 ++-
 net/sunrpc/sched.c          |    9 ++++-
 net/sunrpc/xprtsock.c       |   73 ++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 143 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -434,6 +434,18 @@ static int nfs_launder_page(struct page 
 	return nfs_wb_page(inode, page);
 }
 
+#ifdef CONFIG_NFS_SWAP
+static int nfs_swapon(struct file *file)
+{
+	return xs_swapper(NFS_CLIENT(file->f_mapping->host)->cl_xprt, 1);
+}
+
+static int nfs_swapoff(struct file *file)
+{
+	return xs_swapper(NFS_CLIENT(file->f_mapping->host)->cl_xprt, 0);
+}
+#endif
+
 const struct address_space_operations nfs_file_aops = {
 	.readpage = nfs_readpage,
 	.readpages = nfs_readpages,
@@ -446,6 +458,12 @@ const struct address_space_operations nf
 	.releasepage = nfs_release_page,
 	.direct_IO = nfs_direct_IO,
 	.launder_page = nfs_launder_page,
+#ifdef CONFIG_NFS_SWAP
+	.swapon = nfs_swapon,
+	.swapoff = nfs_swapoff,
+	.swap_out = nfs_swap_out,
+	.swap_in = nfs_readpage,
+#endif
 };
 
 static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page)
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -330,6 +330,28 @@ int nfs_writepage(struct page *page, str
 	return ret;
 }
 
+static int nfs_writepage_setup(struct nfs_open_context *ctx, struct page *page,
+		unsigned int offset, unsigned int count);
+
+int nfs_swap_out(struct file *file, struct page *page,
+		 struct writeback_control *wbc)
+{
+	struct nfs_open_context *ctx = nfs_file_open_context(file);
+	int status;
+
+	status = nfs_writepage_setup(ctx, page, 0, nfs_page_length(page));
+	if (status < 0) {
+		nfs_set_pageerror(page);
+		goto out;
+	}
+
+	status = nfs_writepage_locked(page, wbc);
+
+out:
+	unlock_page(page);
+	return status;
+}
+
 static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data)
 {
 	int ret;
Index: linux-2.6/include/linux/nfs_fs.h
===================================================================
--- linux-2.6.orig/include/linux/nfs_fs.h
+++ linux-2.6/include/linux/nfs_fs.h
@@ -463,6 +463,8 @@ extern int  nfs_flush_incompatible(struc
 extern int  nfs_updatepage(struct file *, struct page *, unsigned int, unsigned int);
 extern int nfs_writeback_done(struct rpc_task *, struct nfs_write_data *);
 extern void nfs_writedata_release(void *);
+extern int  nfs_swap_out(struct file *file, struct page *page,
+			 struct writeback_control *wbc);
 
 /*
  * Try to write back everything synchronously (but check the
Index: linux-2.6/fs/Kconfig
===================================================================
--- linux-2.6.orig/fs/Kconfig
+++ linux-2.6/fs/Kconfig
@@ -1692,6 +1692,18 @@ config ROOT_NFS
 
 	  Most people say N here.
 
+config NFS_SWAP
+	bool "Provide swap over NFS support"
+	default n
+	depends on NFS_FS
+	select SUNRPC_SWAP
+	help
+	  This option enables swapon to work on files located on NFS mounts.
+
+	  For more details, see Documentation/network-swap.txt
+
+	  If unsure, say N.
+
 config NFSD
 	tristate "NFS server support"
 	depends on INET
@@ -1813,6 +1825,11 @@ config SUNRPC_XPRT_RDMA
 
 	  If unsure, say N.
 
+config SUNRPC_SWAP
+	def_bool n
+	depends on SUNRPC
+	select NETVM
+
 config RPCSEC_GSS_KRB5
 	tristate "Secure RPC: Kerberos V mechanism (EXPERIMENTAL)"
 	depends on SUNRPC && EXPERIMENTAL
Index: linux-2.6/include/linux/sunrpc/xprt.h
===================================================================
--- linux-2.6.orig/include/linux/sunrpc/xprt.h
+++ linux-2.6/include/linux/sunrpc/xprt.h
@@ -147,7 +147,9 @@ struct rpc_xprt {
 	unsigned int		max_reqs;	/* total slots */
 	unsigned long		state;		/* transport state */
 	unsigned char		shutdown   : 1,	/* being shut down */
-				resvport   : 1; /* use a reserved port */
+				resvport   : 1, /* use a reserved port */
+				swapper    : 1; /* we're swapping over this
+						   transport */
 	unsigned int		bind_index;	/* bind function index */
 
 	/*
@@ -249,6 +251,7 @@ void			xprt_release_rqst_cong(struct rpc
 void			xprt_disconnect_done(struct rpc_xprt *xprt);
 void			xprt_force_disconnect(struct rpc_xprt *xprt);
 void			xprt_conditional_disconnect(struct rpc_xprt *xprt, unsigned int cookie);
+int			xs_swapper(struct rpc_xprt *xprt, int enable);
 
 /*
  * Reserved bit positions in xprt->state
Index: linux-2.6/net/sunrpc/sched.c
===================================================================
--- linux-2.6.orig/net/sunrpc/sched.c
+++ linux-2.6/net/sunrpc/sched.c
@@ -729,7 +729,10 @@ struct rpc_buffer {
 void *rpc_malloc(struct rpc_task *task, size_t size)
 {
 	struct rpc_buffer *buf;
-	gfp_t gfp = RPC_IS_SWAPPER(task) ? GFP_ATOMIC : GFP_NOWAIT;
+	gfp_t gfp = GFP_NOWAIT;
+
+	if (RPC_IS_SWAPPER(task))
+		gfp |= __GFP_MEMALLOC;
 
 	size += sizeof(struct rpc_buffer);
 	if (size <= RPC_BUFFER_MAXSIZE)
@@ -800,6 +803,8 @@ static void rpc_init_task(struct rpc_tas
 		kref_get(&task->tk_client->cl_kref);
 		if (task->tk_client->cl_softrtry)
 			task->tk_flags |= RPC_TASK_SOFT;
+		if (task->tk_client->cl_xprt->swapper)
+			task->tk_flags |= RPC_TASK_SWAPPER;
 	}
 
 	if (task->tk_ops->rpc_call_prepare != NULL)
@@ -825,7 +830,7 @@ static void rpc_init_task(struct rpc_tas
 static struct rpc_task *
 rpc_alloc_task(void)
 {
-	return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOFS);
+	return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOIO);
 }
 
 /*
Index: linux-2.6/net/sunrpc/xprtsock.c
===================================================================
--- linux-2.6.orig/net/sunrpc/xprtsock.c
+++ linux-2.6/net/sunrpc/xprtsock.c
@@ -1445,6 +1445,55 @@ static inline void xs_reclassify_socket6
 }
 #endif
 
+#ifdef CONFIG_SUNRPC_SWAP
+static void xs_set_memalloc(struct rpc_xprt *xprt)
+{
+	struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+
+	if (xprt->swapper)
+		sk_set_memalloc(transport->inet);
+}
+
+#define RPC_BUF_RESERVE_PAGES \
+	kmalloc_estimate_fixed(sizeof(struct rpc_rqst), GFP_KERNEL, RPC_MAX_SLOT_TABLE)
+#define RPC_RESERVE_PAGES	(RPC_BUF_RESERVE_PAGES + TX_RESERVE_PAGES)
+
+/**
+ * xs_swapper - Tag this transport as being used for swap.
+ * @xprt: transport to tag
+ * @enable: enable/disable
+ *
+ */
+int xs_swapper(struct rpc_xprt *xprt, int enable)
+{
+	struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+	int err = 0;
+
+	if (enable) {
+		/*
+		 * keep one extra sock reference so the reserve won't dip
+		 * when the socket gets reconnected.
+		 */
+		err = sk_adjust_memalloc(1, RPC_RESERVE_PAGES);
+		if (!err) {
+			xprt->swapper = 1;
+			xs_set_memalloc(xprt);
+		}
+	} else if (xprt->swapper) {
+		xprt->swapper = 0;
+		sk_clear_memalloc(transport->inet);
+		sk_adjust_memalloc(-1, -RPC_RESERVE_PAGES);
+	}
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(xs_swapper);
+#else
+static void xs_set_memalloc(struct rpc_xprt *xprt)
+{
+}
+#endif
+
 static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 {
 	struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
@@ -1469,6 +1518,8 @@ static void xs_udp_finish_connecting(str
 		transport->sock = sock;
 		transport->inet = sk;
 
+		xs_set_memalloc(xprt);
+
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 	xs_udp_do_set_buffer_size(xprt);
@@ -1486,11 +1537,15 @@ static void xs_udp_connect_worker4(struc
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	/* Start by resetting any existing state */
 	xs_close(xprt);
 
@@ -1513,6 +1568,7 @@ static void xs_udp_connect_worker4(struc
 out:
 	xprt_wake_pending_tasks(xprt, status);
 	xprt_clear_connecting(xprt);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /**
@@ -1527,11 +1583,15 @@ static void xs_udp_connect_worker6(struc
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	/* Start by resetting any existing state */
 	xs_close(xprt);
 
@@ -1554,6 +1614,7 @@ static void xs_udp_connect_worker6(struc
 out:
 	xprt_wake_pending_tasks(xprt, status);
 	xprt_clear_connecting(xprt);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /*
@@ -1613,6 +1674,8 @@ static int xs_tcp_finish_connecting(stru
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 
+	xs_set_memalloc(xprt);
+
 	/* Tell the socket layer to start connecting... */
 	xprt->stat.connect_count++;
 	xprt->stat.connect_start = jiffies;
@@ -1631,11 +1694,15 @@ static void xs_tcp_connect_worker4(struc
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	if (!sock) {
 		/* start from scratch */
 		if ((err = sock_create_kern(PF_INET, SOCK_STREAM, IPPROTO_TCP, &sock)) < 0) {
@@ -1677,6 +1744,7 @@ out:
 	xprt_wake_pending_tasks(xprt, status);
 out_clear:
 	xprt_clear_connecting(xprt);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /**
@@ -1691,11 +1759,15 @@ static void xs_tcp_connect_worker6(struc
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	if (!sock) {
 		/* start from scratch */
 		if ((err = sock_create_kern(PF_INET6, SOCK_STREAM, IPPROTO_TCP, &sock)) < 0) {
@@ -1736,6 +1808,7 @@ out:
 	xprt_wake_pending_tasks(xprt, status);
 out_clear:
 	xprt_clear_connecting(xprt);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /**

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 30/30] nfs: fix various memory recursions possible with swap over NFS.
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (28 preceding siblings ...)
  2008-07-24 14:01 ` [PATCH 29/30] nfs: enable swap on NFS Peter Zijlstra
@ 2008-07-24 14:01 ` Peter Zijlstra
  2008-07-25 10:46   ` KOSAKI Motohiro
  2008-09-30 12:41 ` [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
  30 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Peter Zijlstra,
	Neil Brown

[-- Attachment #1: nfs-alloc-recursions.patch --]
[-- Type: text/plain, Size: 1866 bytes --]

GFP_NOFS is not enough, since swap traffic is IO, hence fall back to GFP_NOIO.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/pagelist.c |    2 +-
 fs/nfs/write.c    |    6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -45,7 +45,7 @@ static struct kmem_cache *nfs_wdata_cach
 
 struct nfs_write_data *nfs_commitdata_alloc(void)
 {
-	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -63,7 +63,7 @@ void nfs_commit_free(struct nfs_write_da
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -72,7 +72,7 @@ struct nfs_write_data *nfs_writedata_all
 		if (pagecount <= ARRAY_SIZE(p->page_array))
 			p->pagevec = p->page_array;
 		else {
-			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
+			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOIO);
 			if (!p->pagevec) {
 				kmem_cache_free(nfs_wdata_cachep, p);
 				p = NULL;
Index: linux-2.6/fs/nfs/pagelist.c
===================================================================
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -27,7 +27,7 @@ static inline struct nfs_page *
 nfs_page_alloc(void)
 {
 	struct nfs_page	*p;
-	p = kmem_cache_alloc(nfs_page_cachep, GFP_KERNEL);
+	p = kmem_cache_alloc(nfs_page_cachep, GFP_NOIO);
 	if (p) {
 		memset(p, 0, sizeof(*p));
 		INIT_LIST_HEAD(&p->wb_list);

-- 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 26/30] nfs: remove mempools
  2008-07-24 14:01 ` [PATCH 26/30] nfs: remove mempools Peter Zijlstra
@ 2008-07-24 14:46   ` Nick Piggin
  2008-07-24 14:53     ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Nick Piggin @ 2008-07-24 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Neil Brown

On Friday 25 July 2008 00:01, Peter Zijlstra wrote:
> With the introduction of the shared dirty page accounting in .19, NFS
> should not be able to surpise the VM with all dirty pages. Thus it should
> always be able to free some memory. Hence no more need for mempools.

It seems like a very backward step to me to go from a hard guarantee
to some heuristic that could break for someone in some particular
setup.

Filling with dirty pages isn't the only way to exaust free reclaimable
memory remember, it can also happen maybe due to mlock or kernel
allocations.

Is there pressing reason to remove them?

> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  fs/nfs/read.c  |   15 +++------------
>  fs/nfs/write.c |   27 +++++----------------------
>  2 files changed, 8 insertions(+), 34 deletions(-)
>
> Index: linux-2.6/fs/nfs/read.c
> ===================================================================
> --- linux-2.6.orig/fs/nfs/read.c
> +++ linux-2.6/fs/nfs/read.c
> @@ -33,13 +33,10 @@ static const struct rpc_call_ops nfs_rea
>  static const struct rpc_call_ops nfs_read_full_ops;
>
>  static struct kmem_cache *nfs_rdata_cachep;
> -static mempool_t *nfs_rdata_mempool;
> -
> -#define MIN_POOL_READ	(32)
>
>  struct nfs_read_data *nfs_readdata_alloc(unsigned int pagecount)
>  {
> -	struct nfs_read_data *p = mempool_alloc(nfs_rdata_mempool, GFP_NOFS);
> +	struct nfs_read_data *p = kmem_cache_alloc(nfs_rdata_cachep, GFP_NOFS);
>
>  	if (p) {
>  		memset(p, 0, sizeof(*p));
> @@ -50,7 +47,7 @@ struct nfs_read_data *nfs_readdata_alloc
>  		else {
>  			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
>  			if (!p->pagevec) {
> -				mempool_free(p, nfs_rdata_mempool);
> +				kmem_cache_free(nfs_rdata_cachep, p);
>  				p = NULL;
>  			}
>  		}
> @@ -62,7 +59,7 @@ static void nfs_readdata_free(struct nfs
>  {
>  	if (p && (p->pagevec != &p->page_array[0]))
>  		kfree(p->pagevec);
> -	mempool_free(p, nfs_rdata_mempool);
> +	kmem_cache_free(nfs_rdata_cachep, p);
>  }
>
>  void nfs_readdata_release(void *data)
> @@ -614,16 +611,10 @@ int __init nfs_init_readpagecache(void)
>  	if (nfs_rdata_cachep == NULL)
>  		return -ENOMEM;
>
> -	nfs_rdata_mempool = mempool_create_slab_pool(MIN_POOL_READ,
> -						     nfs_rdata_cachep);
> -	if (nfs_rdata_mempool == NULL)
> -		return -ENOMEM;
> -
>  	return 0;
>  }
>
>  void nfs_destroy_readpagecache(void)
>  {
> -	mempool_destroy(nfs_rdata_mempool);
>  	kmem_cache_destroy(nfs_rdata_cachep);
>  }
> Index: linux-2.6/fs/nfs/write.c
> ===================================================================
> --- linux-2.6.orig/fs/nfs/write.c
> +++ linux-2.6/fs/nfs/write.c
> @@ -28,9 +28,6 @@
>
>  #define NFSDBG_FACILITY		NFSDBG_PAGECACHE
>
> -#define MIN_POOL_WRITE		(32)
> -#define MIN_POOL_COMMIT		(4)
> -
>  /*
>   * Local function declarations
>   */
> @@ -45,12 +42,10 @@ static const struct rpc_call_ops nfs_wri
>  static const struct rpc_call_ops nfs_commit_ops;
>
>  static struct kmem_cache *nfs_wdata_cachep;
> -static mempool_t *nfs_wdata_mempool;
> -static mempool_t *nfs_commit_mempool;
>
>  struct nfs_write_data *nfs_commitdata_alloc(void)
>  {
> -	struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
> +	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
>
>  	if (p) {
>  		memset(p, 0, sizeof(*p));
> @@ -63,12 +58,12 @@ void nfs_commit_free(struct nfs_write_da
>  {
>  	if (p && (p->pagevec != &p->page_array[0]))
>  		kfree(p->pagevec);
> -	mempool_free(p, nfs_commit_mempool);
> +	kmem_cache_free(nfs_wdata_cachep, p);
>  }
>
>  struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
>  {
> -	struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS);
> +	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
>
>  	if (p) {
>  		memset(p, 0, sizeof(*p));
> @@ -79,7 +74,7 @@ struct nfs_write_data *nfs_writedata_all
>  		else {
>  			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
>  			if (!p->pagevec) {
> -				mempool_free(p, nfs_wdata_mempool);
> +				kmem_cache_free(nfs_wdata_cachep, p);
>  				p = NULL;
>  			}
>  		}
> @@ -91,7 +86,7 @@ static void nfs_writedata_free(struct nf
>  {
>  	if (p && (p->pagevec != &p->page_array[0]))
>  		kfree(p->pagevec);
> -	mempool_free(p, nfs_wdata_mempool);
> +	kmem_cache_free(nfs_wdata_cachep, p);
>  }
>
>  void nfs_writedata_release(void *data)
> @@ -1552,16 +1547,6 @@ int __init nfs_init_writepagecache(void)
>  	if (nfs_wdata_cachep == NULL)
>  		return -ENOMEM;
>
> -	nfs_wdata_mempool = mempool_create_slab_pool(MIN_POOL_WRITE,
> -						     nfs_wdata_cachep);
> -	if (nfs_wdata_mempool == NULL)
> -		return -ENOMEM;
> -
> -	nfs_commit_mempool = mempool_create_slab_pool(MIN_POOL_COMMIT,
> -						      nfs_wdata_cachep);
> -	if (nfs_commit_mempool == NULL)
> -		return -ENOMEM;
> -
>  	/*
>  	 * NFS congestion size, scale with available memory.
>  	 *
> @@ -1587,8 +1572,6 @@ int __init nfs_init_writepagecache(void)
>
>  void nfs_destroy_writepagecache(void)
>  {
> -	mempool_destroy(nfs_commit_mempool);
> -	mempool_destroy(nfs_wdata_mempool);
>  	kmem_cache_destroy(nfs_wdata_cachep);
>  }

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 26/30] nfs: remove mempools
  2008-07-24 14:46   ` Nick Piggin
@ 2008-07-24 14:53     ` Peter Zijlstra
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-24 14:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Neil Brown

On Fri, 2008-07-25 at 00:46 +1000, Nick Piggin wrote:
> On Friday 25 July 2008 00:01, Peter Zijlstra wrote:
> > With the introduction of the shared dirty page accounting in .19, NFS
> > should not be able to surpise the VM with all dirty pages. Thus it should
> > always be able to free some memory. Hence no more need for mempools.
> 
> It seems like a very backward step to me to go from a hard guarantee
> to some heuristic that could break for someone in some particular
> setup.
> 
> Filling with dirty pages isn't the only way to exaust free reclaimable
> memory remember, it can also happen maybe due to mlock or kernel
> allocations.
> 
> Is there pressing reason to remove them?

There was some funny interaction between mempools and PF_MEMALLOC as
used to write out swap pages iirc.

Specifially, the mempool would strip it and not provide memory even
though we needed it.

I've never hit a deadlock due to this patch in my testing, although
admittedly there might be a hole in my testing.

I can either 'fix' mempools, or replace it with the reserve framework
introduced earlier in this patch-set.

> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  fs/nfs/read.c  |   15 +++------------
> >  fs/nfs/write.c |   27 +++++----------------------
> >  2 files changed, 8 insertions(+), 34 deletions(-)
> >
> > Index: linux-2.6/fs/nfs/read.c
> > ===================================================================
> > --- linux-2.6.orig/fs/nfs/read.c
> > +++ linux-2.6/fs/nfs/read.c
> > @@ -33,13 +33,10 @@ static const struct rpc_call_ops nfs_rea
> >  static const struct rpc_call_ops nfs_read_full_ops;
> >
> >  static struct kmem_cache *nfs_rdata_cachep;
> > -static mempool_t *nfs_rdata_mempool;
> > -
> > -#define MIN_POOL_READ	(32)
> >
> >  struct nfs_read_data *nfs_readdata_alloc(unsigned int pagecount)
> >  {
> > -	struct nfs_read_data *p = mempool_alloc(nfs_rdata_mempool, GFP_NOFS);
> > +	struct nfs_read_data *p = kmem_cache_alloc(nfs_rdata_cachep, GFP_NOFS);
> >
> >  	if (p) {
> >  		memset(p, 0, sizeof(*p));
> > @@ -50,7 +47,7 @@ struct nfs_read_data *nfs_readdata_alloc
> >  		else {
> >  			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
> >  			if (!p->pagevec) {
> > -				mempool_free(p, nfs_rdata_mempool);
> > +				kmem_cache_free(nfs_rdata_cachep, p);
> >  				p = NULL;
> >  			}
> >  		}
> > @@ -62,7 +59,7 @@ static void nfs_readdata_free(struct nfs
> >  {
> >  	if (p && (p->pagevec != &p->page_array[0]))
> >  		kfree(p->pagevec);
> > -	mempool_free(p, nfs_rdata_mempool);
> > +	kmem_cache_free(nfs_rdata_cachep, p);
> >  }
> >
> >  void nfs_readdata_release(void *data)
> > @@ -614,16 +611,10 @@ int __init nfs_init_readpagecache(void)
> >  	if (nfs_rdata_cachep == NULL)
> >  		return -ENOMEM;
> >
> > -	nfs_rdata_mempool = mempool_create_slab_pool(MIN_POOL_READ,
> > -						     nfs_rdata_cachep);
> > -	if (nfs_rdata_mempool == NULL)
> > -		return -ENOMEM;
> > -
> >  	return 0;
> >  }
> >
> >  void nfs_destroy_readpagecache(void)
> >  {
> > -	mempool_destroy(nfs_rdata_mempool);
> >  	kmem_cache_destroy(nfs_rdata_cachep);
> >  }
> > Index: linux-2.6/fs/nfs/write.c
> > ===================================================================
> > --- linux-2.6.orig/fs/nfs/write.c
> > +++ linux-2.6/fs/nfs/write.c
> > @@ -28,9 +28,6 @@
> >
> >  #define NFSDBG_FACILITY		NFSDBG_PAGECACHE
> >
> > -#define MIN_POOL_WRITE		(32)
> > -#define MIN_POOL_COMMIT		(4)
> > -
> >  /*
> >   * Local function declarations
> >   */
> > @@ -45,12 +42,10 @@ static const struct rpc_call_ops nfs_wri
> >  static const struct rpc_call_ops nfs_commit_ops;
> >
> >  static struct kmem_cache *nfs_wdata_cachep;
> > -static mempool_t *nfs_wdata_mempool;
> > -static mempool_t *nfs_commit_mempool;
> >
> >  struct nfs_write_data *nfs_commitdata_alloc(void)
> >  {
> > -	struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
> > +	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
> >
> >  	if (p) {
> >  		memset(p, 0, sizeof(*p));
> > @@ -63,12 +58,12 @@ void nfs_commit_free(struct nfs_write_da
> >  {
> >  	if (p && (p->pagevec != &p->page_array[0]))
> >  		kfree(p->pagevec);
> > -	mempool_free(p, nfs_commit_mempool);
> > +	kmem_cache_free(nfs_wdata_cachep, p);
> >  }
> >
> >  struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
> >  {
> > -	struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS);
> > +	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
> >
> >  	if (p) {
> >  		memset(p, 0, sizeof(*p));
> > @@ -79,7 +74,7 @@ struct nfs_write_data *nfs_writedata_all
> >  		else {
> >  			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
> >  			if (!p->pagevec) {
> > -				mempool_free(p, nfs_wdata_mempool);
> > +				kmem_cache_free(nfs_wdata_cachep, p);
> >  				p = NULL;
> >  			}
> >  		}
> > @@ -91,7 +86,7 @@ static void nfs_writedata_free(struct nf
> >  {
> >  	if (p && (p->pagevec != &p->page_array[0]))
> >  		kfree(p->pagevec);
> > -	mempool_free(p, nfs_wdata_mempool);
> > +	kmem_cache_free(nfs_wdata_cachep, p);
> >  }
> >
> >  void nfs_writedata_release(void *data)
> > @@ -1552,16 +1547,6 @@ int __init nfs_init_writepagecache(void)
> >  	if (nfs_wdata_cachep == NULL)
> >  		return -ENOMEM;
> >
> > -	nfs_wdata_mempool = mempool_create_slab_pool(MIN_POOL_WRITE,
> > -						     nfs_wdata_cachep);
> > -	if (nfs_wdata_mempool == NULL)
> > -		return -ENOMEM;
> > -
> > -	nfs_commit_mempool = mempool_create_slab_pool(MIN_POOL_COMMIT,
> > -						      nfs_wdata_cachep);
> > -	if (nfs_commit_mempool == NULL)
> > -		return -ENOMEM;
> > -
> >  	/*
> >  	 * NFS congestion size, scale with available memory.
> >  	 *
> > @@ -1587,8 +1572,6 @@ int __init nfs_init_writepagecache(void)
> >
> >  void nfs_destroy_writepagecache(void)
> >  {
> > -	mempool_destroy(nfs_commit_mempool);
> > -	mempool_destroy(nfs_wdata_mempool);
> >  	kmem_cache_destroy(nfs_wdata_cachep);
> >  }


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 11/30] mm: __GFP_MEMALLOC
  2008-07-24 14:00 ` [PATCH 11/30] mm: __GFP_MEMALLOC Peter Zijlstra
@ 2008-07-25  9:29   ` KOSAKI Motohiro
  2008-07-25  9:35     ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2008-07-25  9:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, trond.myklebust, Daniel Lezcano, Pekka Enberg,
	Neil Brown

Hi Peter,

> __GFP_MEMALLOC will allow the allocation to disregard the watermarks, 
> much like PF_MEMALLOC.
> 
> It allows one to pass along the memalloc state in object related allocation
> flags as opposed to task related flags, such as sk->sk_allocation.

Is this properly name?
page alloc is always "mem alloc".

you wrote comment as "Use emergency reserves" and 
this flag works to turn on ALLOC_NO_WATERMARKS.

then, __GFP_NO_WATERMARK or __GFP_EMERGENCY are better?





^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 11/30] mm: __GFP_MEMALLOC
  2008-07-25  9:29   ` KOSAKI Motohiro
@ 2008-07-25  9:35     ` Peter Zijlstra
  2008-07-25  9:39       ` KOSAKI Motohiro
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-25  9:35 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Neil Brown

On Fri, 2008-07-25 at 18:29 +0900, KOSAKI Motohiro wrote:
> Hi Peter,
> 
> > __GFP_MEMALLOC will allow the allocation to disregard the watermarks, 
> > much like PF_MEMALLOC.
> > 
> > It allows one to pass along the memalloc state in object related allocation
> > flags as opposed to task related flags, such as sk->sk_allocation.
> 
> Is this properly name?
> page alloc is always "mem alloc".
> 
> you wrote comment as "Use emergency reserves" and 
> this flag works to turn on ALLOC_NO_WATERMARKS.
> 
> then, __GFP_NO_WATERMARK or __GFP_EMERGENCY are better?

We've been through this pick a better name thing several times :-/

Yes I agree, __GFP_MEMALLOC is a misnomer, however its consistent with
PF_MEMALLOC and __GFP_NOMEMALLOC - of which people know the semantics.

Creating a new name with similar semantics can only serve to confuse.

So unless enough people think its worth renaming all of them, I think
we're better off with this name.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 11/30] mm: __GFP_MEMALLOC
  2008-07-25  9:35     ` Peter Zijlstra
@ 2008-07-25  9:39       ` KOSAKI Motohiro
  0 siblings, 0 replies; 74+ messages in thread
From: KOSAKI Motohiro @ 2008-07-25  9:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, trond.myklebust, Daniel Lezcano, Pekka Enberg,
	Neil Brown

> On Fri, 2008-07-25 at 18:29 +0900, KOSAKI Motohiro wrote:
> > Hi Peter,
> > 
> > > __GFP_MEMALLOC will allow the allocation to disregard the watermarks, 
> > > much like PF_MEMALLOC.
> > > 
> > > It allows one to pass along the memalloc state in object related allocation
> > > flags as opposed to task related flags, such as sk->sk_allocation.
> > 
> > Is this properly name?
> > page alloc is always "mem alloc".
> > 
> > you wrote comment as "Use emergency reserves" and 
> > this flag works to turn on ALLOC_NO_WATERMARKS.
> > 
> > then, __GFP_NO_WATERMARK or __GFP_EMERGENCY are better?
> 
> We've been through this pick a better name thing several times :-/
> 
> Yes I agree, __GFP_MEMALLOC is a misnomer, however its consistent with
> PF_MEMALLOC and __GFP_NOMEMALLOC - of which people know the semantics.
> 
> Creating a new name with similar semantics can only serve to confuse.

Ah, understand.
Thanks.

Agreed to __GFP_MEMALLOC.

> 
> So unless enough people think its worth renaming all of them, I think
> we're better off with this name.




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 30/30] nfs: fix various memory recursions possible with swap over NFS.
  2008-07-24 14:01 ` [PATCH 30/30] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
@ 2008-07-25 10:46   ` KOSAKI Motohiro
  2008-07-25 10:57     ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2008-07-25 10:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, trond.myklebust, Daniel Lezcano, Pekka Enberg,
	Neil Brown

> GFP_NOFS is not enough, since swap traffic is IO, hence fall back to GFP_NOIO.

this comment imply turn on GFP_NOIO, but the code is s/NOFS/NOIO/. why?



>  struct nfs_write_data *nfs_commitdata_alloc(void)
>  {
> -	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
> +	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);





^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 30/30] nfs: fix various memory recursions possible with swap over NFS.
  2008-07-25 10:46   ` KOSAKI Motohiro
@ 2008-07-25 10:57     ` Peter Zijlstra
  2008-07-25 11:15       ` KOSAKI Motohiro
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-25 10:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Neil Brown

On Fri, 2008-07-25 at 19:46 +0900, KOSAKI Motohiro wrote:
> > GFP_NOFS is not enough, since swap traffic is IO, hence fall back to GFP_NOIO.
> 
> this comment imply turn on GFP_NOIO, but the code is s/NOFS/NOIO/. why?

Does the misunderstanding stem from the use of 'enough'?

GFP_NOFS is _more_ permissive than GFP_NOIO in that it will initiate IO,
just not of any filesystem data.

The problem is that previuosly NOFS was correct because that avoids
recursion into the NFS code, it now is not, because also IO (swap) can
lead to this recursion.

Therefore we must tighten the restriction and use NOIO.

> >  struct nfs_write_data *nfs_commitdata_alloc(void)
> >  {
> > -	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
> > +	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 30/30] nfs: fix various memory recursions possible with swap over NFS.
  2008-07-25 10:57     ` Peter Zijlstra
@ 2008-07-25 11:15       ` KOSAKI Motohiro
  2008-07-25 11:19         ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: KOSAKI Motohiro @ 2008-07-25 11:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, trond.myklebust, Daniel Lezcano, Pekka Enberg,
	Neil Brown

> On Fri, 2008-07-25 at 19:46 +0900, KOSAKI Motohiro wrote:
> > > GFP_NOFS is not enough, since swap traffic is IO, hence fall back to GFP_NOIO.
> > 
> > this comment imply turn on GFP_NOIO, but the code is s/NOFS/NOIO/. why?
> 
> Does the misunderstanding stem from the use of 'enough'?
> 
> GFP_NOFS is _more_ permissive than GFP_NOIO in that it will initiate IO,
> just not of any filesystem data.
> 


> The problem is that previuosly NOFS was correct because that avoids
> recursion into the NFS code, it now is not, because also IO (swap) can
> lead to this recursion.


Thanks nicer explain.
So, I hope add above 3 line to patch description.



Cheers!



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 30/30] nfs: fix various memory recursions possible with swap over NFS.
  2008-07-25 11:15       ` KOSAKI Motohiro
@ 2008-07-25 11:19         ` Peter Zijlstra
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-25 11:19 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg, Neil Brown

On Fri, 2008-07-25 at 20:15 +0900, KOSAKI Motohiro wrote:
> > On Fri, 2008-07-25 at 19:46 +0900, KOSAKI Motohiro wrote:
> > > > GFP_NOFS is not enough, since swap traffic is IO, hence fall back to GFP_NOIO.
> > > 
> > > this comment imply turn on GFP_NOIO, but the code is s/NOFS/NOIO/. why?
> > 
> > Does the misunderstanding stem from the use of 'enough'?
> > 
> > GFP_NOFS is _more_ permissive than GFP_NOIO in that it will initiate IO,
> > just not of any filesystem data.
> > 
> 
> 
> > The problem is that previuosly NOFS was correct because that avoids
> > recursion into the NFS code, it now is not, because also IO (swap) can
> > lead to this recursion.
> 
> 
> Thanks nicer explain.
> So, I hope add above 3 line to patch description.

Done, thanks!


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 04/30] mm: slub: trivial cleanups
  2008-07-24 14:00 ` [PATCH 04/30] mm: slub: trivial cleanups Peter Zijlstra
@ 2008-07-28  9:43   ` Pekka Enberg
  2008-07-28 10:19     ` Peter Zijlstra
  2008-07-29 22:15   ` Pekka Enberg
  1 sibling, 1 reply; 74+ messages in thread
From: Pekka Enberg @ 2008-07-28  9:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Neil Brown, cl

Hi Peter,

Could you perhaps send this patch separately?

On Thu, 2008-07-24 at 16:00 +0200, Peter Zijlstra wrote:
> plain text document attachment (cleanup-slub.patch)
> Some cleanups....
> 

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

Christoph?

> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/slub_def.h |    7 ++++++-
>  mm/slub.c                |   40 ++++++++++++++++++----------------------
>  2 files changed, 24 insertions(+), 23 deletions(-)
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c
> +++ linux-2.6/mm/slub.c
> @@ -27,7 +27,7 @@
>  /*
>   * Lock order:
>   *   1. slab_lock(page)
> - *   2. slab->list_lock
> + *   2. node->list_lock
>   *
>   *   The slab_lock protects operations on the object of a particular
>   *   slab and its metadata in the page struct. If the slab lock
> @@ -163,11 +163,11 @@ static struct notifier_block slab_notifi
>  #endif
>  
>  static enum {
> -	DOWN,		/* No slab functionality available */
> +	DOWN = 0,	/* No slab functionality available */
>  	PARTIAL,	/* kmem_cache_open() works but kmalloc does not */
>  	UP,		/* Everything works but does not show up in sysfs */
>  	SYSFS		/* Sysfs up */
> -} slab_state = DOWN;
> +} slab_state;
>  
>  /* A list of all slab caches on the system */
>  static DECLARE_RWSEM(slub_lock);
> @@ -288,21 +288,22 @@ static inline int slab_index(void *p, st
>  static inline struct kmem_cache_order_objects oo_make(int order,
>  						unsigned long size)
>  {
> -	struct kmem_cache_order_objects x = {
> -		(order << 16) + (PAGE_SIZE << order) / size
> -	};
> +	struct kmem_cache_order_objects x;
> +
> +	x.order = order;
> +	x.objects = (PAGE_SIZE << order) / size;
>  
>  	return x;
>  }
>  
>  static inline int oo_order(struct kmem_cache_order_objects x)
>  {
> -	return x.x >> 16;
> +	return x.order;
>  }
>  
>  static inline int oo_objects(struct kmem_cache_order_objects x)
>  {
> -	return x.x & ((1 << 16) - 1);
> +	return x.objects;
>  }
>  
>  #ifdef CONFIG_SLUB_DEBUG
> @@ -1076,8 +1077,7 @@ static struct page *allocate_slab(struct
>  
>  	flags |= s->allocflags;
>  
> -	page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node,
> -									oo);
> +	page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node, oo);
>  	if (unlikely(!page)) {
>  		oo = s->min;
>  		/*
> @@ -1099,8 +1099,7 @@ static struct page *allocate_slab(struct
>  	return page;
>  }
>  
> -static void setup_object(struct kmem_cache *s, struct page *page,
> -				void *object)
> +static void setup_object(struct kmem_cache *s, struct page *page, void *object)
>  {
>  	setup_object_debug(s, page, object);
>  	if (unlikely(s->ctor))
> @@ -1157,8 +1156,7 @@ static void __free_slab(struct kmem_cach
>  		void *p;
>  
>  		slab_pad_check(s, page);
> -		for_each_object(p, s, page_address(page),
> -						page->objects)
> +		for_each_object(p, s, page_address(page), page->objects)
>  			check_object(s, page, p, 0);
>  		__ClearPageSlubDebug(page);
>  	}
> @@ -1224,8 +1222,7 @@ static __always_inline int slab_trylock(
>  /*
>   * Management of partially allocated slabs
>   */
> -static void add_partial(struct kmem_cache_node *n,
> -				struct page *page, int tail)
> +static void add_partial(struct kmem_cache_node *n, struct page *page, int tail)
>  {
>  	spin_lock(&n->list_lock);
>  	n->nr_partial++;
> @@ -1251,8 +1248,8 @@ static void remove_partial(struct kmem_c
>   *
>   * Must hold list_lock.
>   */
> -static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
> -							struct page *page)
> +static inline
> +int lock_and_freeze_slab(struct kmem_cache_node *n, struct page *page)
>  {
>  	if (slab_trylock(page)) {
>  		list_del(&page->lru);
> @@ -1799,11 +1796,11 @@ static int slub_nomerge;
>   * slub_max_order specifies the order where we begin to stop considering the
>   * number of objects in a slab as critical. If we reach slub_max_order then
>   * we try to keep the page order as low as possible. So we accept more waste
> - * of space in favor of a small page order.
> + * of space in favour of a small page order.
>   *
>   * Higher order allocations also allow the placement of more objects in a
>   * slab and thereby reduce object handling overhead. If the user has
> - * requested a higher mininum order then we start with that one instead of
> + * requested a higher minimum order then we start with that one instead of
>   * the smallest order which will fit the object.
>   */
>  static inline int slab_order(int size, int min_objects,
> @@ -1816,8 +1813,7 @@ static inline int slab_order(int size, i
>  	if ((PAGE_SIZE << min_order) / size > 65535)
>  		return get_order(size * 65535) - 1;
>  
> -	for (order = max(min_order,
> -				fls(min_objects * size - 1) - PAGE_SHIFT);
> +	for (order = max(min_order, fls(min_objects * size - 1) - PAGE_SHIFT);
>  			order <= max_order; order++) {
>  
>  		unsigned long slab_size = PAGE_SIZE << order;
> Index: linux-2.6/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slub_def.h
> +++ linux-2.6/include/linux/slub_def.h
> @@ -60,7 +60,12 @@ struct kmem_cache_node {
>   * given order would contain.
>   */
>  struct kmem_cache_order_objects {
> -	unsigned long x;
> +	union {
> +		u32 x;
> +		struct {
> +			u16 order, objects;
> +		};
> +	};
>  };
>  
>  /*
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 12/30] mm: memory reserve management
  2008-07-24 14:00 ` [PATCH 12/30] mm: memory reserve management Peter Zijlstra
@ 2008-07-28 10:06   ` Pekka Enberg
  2008-07-28 10:17     ` Peter Zijlstra
  2008-07-28 16:49     ` Matt Mackall
  2008-08-12  6:23   ` Neil Brown
  2008-08-12  7:46   ` Neil Brown
  2 siblings, 2 replies; 74+ messages in thread
From: Pekka Enberg @ 2008-07-28 10:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Neil Brown, mpm, cl

Hi Peter,

On Thu, 2008-07-24 at 16:00 +0200, Peter Zijlstra wrote:
> +/*
> + * alloc wrappers
> + */
> +

Hmm, I'm not sure I like the use of __kmalloc_track_caller() (even
though you do add the wrappers for SLUB). The functions really are SLAB
internals so I'd prefer to see kmalloc_reserve() moved to the
allocators.

> +void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
> +			 struct mem_reserve *res, int *emerg)
> +{

This function could use some comments...

> +	void *obj;
> +	gfp_t gfp;
> +
> +	gfp = flags | __GFP_NOMEMALLOC | __GFP_NOWARN;
> +	obj = __kmalloc_node_track_caller(size, gfp, node, ip);
> +
> +	if (obj || !(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
> +		goto out;
> +
> +	if (res && !mem_reserve_kmalloc_charge(res, size)) {
> +		if (!(flags & __GFP_WAIT))
> +			goto out;
> +
> +		wait_event(res->waitqueue,
> +				mem_reserve_kmalloc_charge(res, size));
> +
> +		obj = __kmalloc_node_track_caller(size, gfp, node, ip);
> +		if (obj) {
> +			mem_reserve_kmalloc_charge(res, -size);

Why do we discharge here?

> +			goto out;
> +		}

If the allocation fails, we try again (but nothing has changed, right?).
Why?

> +	}
> +
> +	obj = __kmalloc_node_track_caller(size, flags, node, ip);
> +	WARN_ON(!obj);

Why don't we discharge from the reserve here if !obj?

> +	if (emerg)
> +		*emerg |= 1;
> +
> +out:
> +	return obj;
> +}
> +
> +void __kfree_reserve(void *obj, struct mem_reserve *res, int emerg)

I don't see 'emerg' used anywhere.

> +{
> +	size_t size = ksize(obj);
> +
> +	kfree(obj);

We're trying to get rid of kfree() so I'd __kfree_reserve() could to
mm/sl?b.c. Matt, thoughts?

> +	/*
> +	 * ksize gives the full allocated size vs the requested size we used to
> +	 * charge; however since we round up to the nearest power of two, this
> +	 * should all work nicely.
> +	 */
> +	mem_reserve_kmalloc_charge(res, -size);
> +}
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 12/30] mm: memory reserve management
  2008-07-28 10:06   ` Pekka Enberg
@ 2008-07-28 10:17     ` Peter Zijlstra
  2008-07-28 10:29       ` Pekka Enberg
  2008-07-28 16:49     ` Matt Mackall
  1 sibling, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-28 10:17 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Neil Brown, mpm, cl

On Mon, 2008-07-28 at 13:06 +0300, Pekka Enberg wrote:
> Hi Peter,
> 
> On Thu, 2008-07-24 at 16:00 +0200, Peter Zijlstra wrote:
> > +/*
> > + * alloc wrappers
> > + */
> > +
> 
> Hmm, I'm not sure I like the use of __kmalloc_track_caller() (even
> though you do add the wrappers for SLUB). The functions really are SLAB
> internals so I'd prefer to see kmalloc_reserve() moved to the
> allocators.

See below..

> > +void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
> > +			 struct mem_reserve *res, int *emerg)
> > +{
> 
> This function could use some comments...

Yes, my latest does have those.. let me paste the relevant bit:

+void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
+                        struct mem_reserve *res, int *emerg)
+{
+       void *obj;
+       gfp_t gfp;
+
+       /*
+        * Try a regular allocation, when that fails and we're not entitled
+        * to the reserves, fail.
+        */
+       gfp = flags | __GFP_NOMEMALLOC | __GFP_NOWARN;
+       obj = __kmalloc_node_track_caller(size, gfp, node, ip);
+
+       if (obj || !(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
+               goto out;
+
+       /*
+        * If we were given a reserve to charge against, try that.
+        */
+       if (res && !mem_reserve_kmalloc_charge(res, size)) {
+               /*
+                * If we failed to charge and we're not allowed to wait for
+                * it to succeed, bail.
+                */
+               if (!(flags & __GFP_WAIT))
+                       goto out;
+
+               /*
+                * Wait for a successfull charge against the reserve. All
+                * uncharge operations against this reserve will wake us up.
+                */
+               wait_event(res->waitqueue,
+                               mem_reserve_kmalloc_charge(res, size));
+
+               /*
+                * After waiting for it, again try a regular allocation.
+                * Pressure could have lifted during our sleep. If this
+                * succeeds, uncharge the reserve.
+                */
+               obj = __kmalloc_node_track_caller(size, gfp, node, ip);
+               if (obj) {
+                       mem_reserve_kmalloc_charge(res, -size);
+                       goto out;
+               }
+       }
+
+       /*
+        * Regular allocation failed, and we've successfully charged our
+        * requested usage against the reserve. Do the emergency allocation.
+        */
+       obj = __kmalloc_node_track_caller(size, flags, node, ip);
+       WARN_ON(!obj);
+       if (emerg)
+               *emerg |= 1;
+
+out:
+       return obj;
+}


> > +	void *obj;
> > +	gfp_t gfp;
> > +
> > +	gfp = flags | __GFP_NOMEMALLOC | __GFP_NOWARN;
> > +	obj = __kmalloc_node_track_caller(size, gfp, node, ip);
> > +
> > +	if (obj || !(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
> > +		goto out;
> > +
> > +	if (res && !mem_reserve_kmalloc_charge(res, size)) {
> > +		if (!(flags & __GFP_WAIT))
> > +			goto out;
> > +
> > +		wait_event(res->waitqueue,
> > +				mem_reserve_kmalloc_charge(res, size));
> > +
> > +		obj = __kmalloc_node_track_caller(size, gfp, node, ip);
> > +		if (obj) {
> > +			mem_reserve_kmalloc_charge(res, -size);
> 
> Why do we discharge here?

because a regular allocation succeeded.

> > +			goto out;
> > +		}
> 
> If the allocation fails, we try again (but nothing has changed, right?).
> Why?

Note the different allocation flags for the two allocations.

> > +	}
> > +
> > +	obj = __kmalloc_node_track_caller(size, flags, node, ip);
> > +	WARN_ON(!obj);
> 
> Why don't we discharge from the reserve here if !obj?

Well, this allocation should never fail:
  - we reserved memory
  - we accounted/throttle its usage

Thus this allocation should always succeed.

> > +	if (emerg)
> > +		*emerg |= 1;
> > +
> > +out:
> > +	return obj;
> > +}
> > +
> > +void __kfree_reserve(void *obj, struct mem_reserve *res, int emerg)
> 
> I don't see 'emerg' used anywhere.

Patch 19/30 has:

-       data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
-                       gfp_mask, node);
+       data = kmalloc_reserve(size + sizeof(struct skb_shared_info),
+                       gfp_mask, node, &net_skb_reserve, &emergency);
        if (!data)
                goto nodata;

@@ -205,6 +211,7 @@ struct sk_buff *__alloc_skb(unsigned int
         * the tail pointer in struct sk_buff!
         */
        memset(skb, 0, offsetof(struct sk_buff, tail));
+       skb->emergency = emergency;
        skb->truesize = size + sizeof(struct sk_buff);
        atomic_set(&skb->users, 1);
        skb->head = data;


> > +{
> > +	size_t size = ksize(obj);
> > +
> > +	kfree(obj);
> 
> We're trying to get rid of kfree() so I'd __kfree_reserve() could to
> mm/sl?b.c. Matt, thoughts?

My issue with moving these helpers into mm/sl?b.c is that it would
require replicating all this code 3 times. Even though the functionality
is (or should) be invariant to the actual slab implementation.

> > +	/*
> > +	 * ksize gives the full allocated size vs the requested size we used to
> > +	 * charge; however since we round up to the nearest power of two, this
> > +	 * should all work nicely.
> > +	 */
> > +	mem_reserve_kmalloc_charge(res, -size);
> > +}
> > 
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 04/30] mm: slub: trivial cleanups
  2008-07-28  9:43   ` Pekka Enberg
@ 2008-07-28 10:19     ` Peter Zijlstra
  2008-07-30 13:59       ` Christoph Lameter
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-28 10:19 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Neil Brown, cl

On Mon, 2008-07-28 at 12:43 +0300, Pekka Enberg wrote:
> Hi Peter,
> 
> Could you perhaps send this patch separately?

I think this version applies to mainline without modifications, so you
could just take it as is, but sure I can resend..

> On Thu, 2008-07-24 at 16:00 +0200, Peter Zijlstra wrote:
> > plain text document attachment (cleanup-slub.patch)
> > Some cleanups....
> > 
> 
> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> 
> Christoph?
> 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  include/linux/slub_def.h |    7 ++++++-
> >  mm/slub.c                |   40 ++++++++++++++++++----------------------
> >  2 files changed, 24 insertions(+), 23 deletions(-)
> > 
> > Index: linux-2.6/mm/slub.c
> > ===================================================================
> > --- linux-2.6.orig/mm/slub.c
> > +++ linux-2.6/mm/slub.c
> > @@ -27,7 +27,7 @@
> >  /*
> >   * Lock order:
> >   *   1. slab_lock(page)
> > - *   2. slab->list_lock
> > + *   2. node->list_lock
> >   *
> >   *   The slab_lock protects operations on the object of a particular
> >   *   slab and its metadata in the page struct. If the slab lock
> > @@ -163,11 +163,11 @@ static struct notifier_block slab_notifi
> >  #endif
> >  
> >  static enum {
> > -	DOWN,		/* No slab functionality available */
> > +	DOWN = 0,	/* No slab functionality available */
> >  	PARTIAL,	/* kmem_cache_open() works but kmalloc does not */
> >  	UP,		/* Everything works but does not show up in sysfs */
> >  	SYSFS		/* Sysfs up */
> > -} slab_state = DOWN;
> > +} slab_state;
> >  
> >  /* A list of all slab caches on the system */
> >  static DECLARE_RWSEM(slub_lock);
> > @@ -288,21 +288,22 @@ static inline int slab_index(void *p, st
> >  static inline struct kmem_cache_order_objects oo_make(int order,
> >  						unsigned long size)
> >  {
> > -	struct kmem_cache_order_objects x = {
> > -		(order << 16) + (PAGE_SIZE << order) / size
> > -	};
> > +	struct kmem_cache_order_objects x;
> > +
> > +	x.order = order;
> > +	x.objects = (PAGE_SIZE << order) / size;
> >  
> >  	return x;
> >  }
> >  
> >  static inline int oo_order(struct kmem_cache_order_objects x)
> >  {
> > -	return x.x >> 16;
> > +	return x.order;
> >  }
> >  
> >  static inline int oo_objects(struct kmem_cache_order_objects x)
> >  {
> > -	return x.x & ((1 << 16) - 1);
> > +	return x.objects;
> >  }
> >  
> >  #ifdef CONFIG_SLUB_DEBUG
> > @@ -1076,8 +1077,7 @@ static struct page *allocate_slab(struct
> >  
> >  	flags |= s->allocflags;
> >  
> > -	page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node,
> > -									oo);
> > +	page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node, oo);
> >  	if (unlikely(!page)) {
> >  		oo = s->min;
> >  		/*
> > @@ -1099,8 +1099,7 @@ static struct page *allocate_slab(struct
> >  	return page;
> >  }
> >  
> > -static void setup_object(struct kmem_cache *s, struct page *page,
> > -				void *object)
> > +static void setup_object(struct kmem_cache *s, struct page *page, void *object)
> >  {
> >  	setup_object_debug(s, page, object);
> >  	if (unlikely(s->ctor))
> > @@ -1157,8 +1156,7 @@ static void __free_slab(struct kmem_cach
> >  		void *p;
> >  
> >  		slab_pad_check(s, page);
> > -		for_each_object(p, s, page_address(page),
> > -						page->objects)
> > +		for_each_object(p, s, page_address(page), page->objects)
> >  			check_object(s, page, p, 0);
> >  		__ClearPageSlubDebug(page);
> >  	}
> > @@ -1224,8 +1222,7 @@ static __always_inline int slab_trylock(
> >  /*
> >   * Management of partially allocated slabs
> >   */
> > -static void add_partial(struct kmem_cache_node *n,
> > -				struct page *page, int tail)
> > +static void add_partial(struct kmem_cache_node *n, struct page *page, int tail)
> >  {
> >  	spin_lock(&n->list_lock);
> >  	n->nr_partial++;
> > @@ -1251,8 +1248,8 @@ static void remove_partial(struct kmem_c
> >   *
> >   * Must hold list_lock.
> >   */
> > -static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
> > -							struct page *page)
> > +static inline
> > +int lock_and_freeze_slab(struct kmem_cache_node *n, struct page *page)
> >  {
> >  	if (slab_trylock(page)) {
> >  		list_del(&page->lru);
> > @@ -1799,11 +1796,11 @@ static int slub_nomerge;
> >   * slub_max_order specifies the order where we begin to stop considering the
> >   * number of objects in a slab as critical. If we reach slub_max_order then
> >   * we try to keep the page order as low as possible. So we accept more waste
> > - * of space in favor of a small page order.
> > + * of space in favour of a small page order.
> >   *
> >   * Higher order allocations also allow the placement of more objects in a
> >   * slab and thereby reduce object handling overhead. If the user has
> > - * requested a higher mininum order then we start with that one instead of
> > + * requested a higher minimum order then we start with that one instead of
> >   * the smallest order which will fit the object.
> >   */
> >  static inline int slab_order(int size, int min_objects,
> > @@ -1816,8 +1813,7 @@ static inline int slab_order(int size, i
> >  	if ((PAGE_SIZE << min_order) / size > 65535)
> >  		return get_order(size * 65535) - 1;
> >  
> > -	for (order = max(min_order,
> > -				fls(min_objects * size - 1) - PAGE_SHIFT);
> > +	for (order = max(min_order, fls(min_objects * size - 1) - PAGE_SHIFT);
> >  			order <= max_order; order++) {
> >  
> >  		unsigned long slab_size = PAGE_SIZE << order;
> > Index: linux-2.6/include/linux/slub_def.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/slub_def.h
> > +++ linux-2.6/include/linux/slub_def.h
> > @@ -60,7 +60,12 @@ struct kmem_cache_node {
> >   * given order would contain.
> >   */
> >  struct kmem_cache_order_objects {
> > -	unsigned long x;
> > +	union {
> > +		u32 x;
> > +		struct {
> > +			u16 order, objects;
> > +		};
> > +	};
> >  };
> >  
> >  /*
> > 
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 12/30] mm: memory reserve management
  2008-07-28 10:17     ` Peter Zijlstra
@ 2008-07-28 10:29       ` Pekka Enberg
  2008-07-28 10:39         ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Pekka Enberg @ 2008-07-28 10:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Neil Brown, mpm, cl

Hi Peter,

On Mon, 2008-07-28 at 12:17 +0200, Peter Zijlstra wrote:
> On Mon, 2008-07-28 at 13:06 +0300, Pekka Enberg wrote:
> > Hi Peter,
> > 
> > On Thu, 2008-07-24 at 16:00 +0200, Peter Zijlstra wrote:
> > > +/*
> > > + * alloc wrappers
> > > + */
> > > +
> > 
> > Hmm, I'm not sure I like the use of __kmalloc_track_caller() (even
> > though you do add the wrappers for SLUB). The functions really are SLAB
> > internals so I'd prefer to see kmalloc_reserve() moved to the
> > allocators.
> 
> See below..
> 
> > > +void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
> > > +			 struct mem_reserve *res, int *emerg)
> > > +{
> > 
> > This function could use some comments...
> 
> Yes, my latest does have those.. let me paste the relevant bit:
> 
> +void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
> +                        struct mem_reserve *res, int *emerg)
> +{
> +       void *obj;
> +       gfp_t gfp;
> +
> +       /*
> +        * Try a regular allocation, when that fails and we're not entitled
> +        * to the reserves, fail.
> +        */
> +       gfp = flags | __GFP_NOMEMALLOC | __GFP_NOWARN;
> +       obj = __kmalloc_node_track_caller(size, gfp, node, ip);
> +
> +       if (obj || !(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
> +               goto out;
> +
> +       /*
> +        * If we were given a reserve to charge against, try that.
> +        */
> +       if (res && !mem_reserve_kmalloc_charge(res, size)) {
> +               /*
> +                * If we failed to charge and we're not allowed to wait for
> +                * it to succeed, bail.
> +                */
> +               if (!(flags & __GFP_WAIT))
> +                       goto out;
> +
> +               /*
> +                * Wait for a successfull charge against the reserve. All
> +                * uncharge operations against this reserve will wake us up.
> +                */
> +               wait_event(res->waitqueue,
> +                               mem_reserve_kmalloc_charge(res, size));
> +
> +               /*
> +                * After waiting for it, again try a regular allocation.
> +                * Pressure could have lifted during our sleep. If this
> +                * succeeds, uncharge the reserve.
> +                */
> +               obj = __kmalloc_node_track_caller(size, gfp, node, ip);
> +               if (obj) {
> +                       mem_reserve_kmalloc_charge(res, -size);
> +                       goto out;
> +               }
> +       }
> +
> +       /*
> +        * Regular allocation failed, and we've successfully charged our
> +        * requested usage against the reserve. Do the emergency allocation.
> +        */
> +       obj = __kmalloc_node_track_caller(size, flags, node, ip);
> +       WARN_ON(!obj);
> +       if (emerg)
> +               *emerg |= 1;
> +
> +out:
> +       return obj;
> +}

Heh, indeed, looks much better :-).

> 
> > > +	void *obj;
> > > +	gfp_t gfp;
> > > +
> > > +	gfp = flags | __GFP_NOMEMALLOC | __GFP_NOWARN;
> > > +	obj = __kmalloc_node_track_caller(size, gfp, node, ip);
> > > +
> > > +	if (obj || !(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
> > > +		goto out;
> > > +
> > > +	if (res && !mem_reserve_kmalloc_charge(res, size)) {
> > > +		if (!(flags & __GFP_WAIT))
> > > +			goto out;
> > > +
> > > +		wait_event(res->waitqueue,
> > > +				mem_reserve_kmalloc_charge(res, size));
> > > +
> > > +		obj = __kmalloc_node_track_caller(size, gfp, node, ip);
> > > +		if (obj) {
> > > +			mem_reserve_kmalloc_charge(res, -size);
> > 
> > Why do we discharge here?
> 
> because a regular allocation succeeded.
> 
> > > +			goto out;
> > > +		}
> > 
> > If the allocation fails, we try again (but nothing has changed, right?).
> > Why?
> 
> Note the different allocation flags for the two allocations.

Uhm, yeah. I missed that.

> > > +	}
> > > +
> > > +	obj = __kmalloc_node_track_caller(size, flags, node, ip);
> > > +	WARN_ON(!obj);
> > 
> > Why don't we discharge from the reserve here if !obj?
> 
> Well, this allocation should never fail:
>   - we reserved memory
>   - we accounted/throttle its usage
> 
> Thus this allocation should always succeed.

But if it *does* fail, it doesn't help that we mess up the reservation
counts, no?

> > > +	if (emerg)
> > > +		*emerg |= 1;
> > > +
> > > +out:
> > > +	return obj;
> > > +}
> > > +
> > > +void __kfree_reserve(void *obj, struct mem_reserve *res, int emerg)
> > 
> > I don't see 'emerg' used anywhere.
> 
> Patch 19/30 has:
> 
> -       data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
> -                       gfp_mask, node);
> +       data = kmalloc_reserve(size + sizeof(struct skb_shared_info),
> +                       gfp_mask, node, &net_skb_reserve, &emergency);
>         if (!data)
>                 goto nodata;
> 
> @@ -205,6 +211,7 @@ struct sk_buff *__alloc_skb(unsigned int
>          * the tail pointer in struct sk_buff!
>          */
>         memset(skb, 0, offsetof(struct sk_buff, tail));
> +       skb->emergency = emergency;
>         skb->truesize = size + sizeof(struct sk_buff);
>         atomic_set(&skb->users, 1);
>         skb->head = data;
> 
> > > +{
> > > +	size_t size = ksize(obj);
> > > +
> > > +	kfree(obj);
> > 
> > We're trying to get rid of kfree() so I'd __kfree_reserve() could to
> > mm/sl?b.c. Matt, thoughts?
> 
> My issue with moving these helpers into mm/sl?b.c is that it would
> require replicating all this code 3 times. Even though the functionality
> is (or should) be invariant to the actual slab implementation.

Right, I guess we could just rename ksize() to something else then and
keep it internal to mm/.

> > > +	/*
> > > +	 * ksize gives the full allocated size vs the requested size we used to
> > > +	 * charge; however since we round up to the nearest power of two, this
> > > +	 * should all work nicely.
> > > +	 */
> > > +	mem_reserve_kmalloc_charge(res, -size);
> > > +}
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 12/30] mm: memory reserve management
  2008-07-28 10:29       ` Pekka Enberg
@ 2008-07-28 10:39         ` Peter Zijlstra
  2008-07-28 10:41           ` Pekka Enberg
  2008-07-28 16:59           ` Matt Mackall
  0 siblings, 2 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-28 10:39 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Neil Brown, mpm, cl

On Mon, 2008-07-28 at 13:29 +0300, Pekka Enberg wrote:

> > > > +	}
> > > > +
> > > > +	obj = __kmalloc_node_track_caller(size, flags, node, ip);
> > > > +	WARN_ON(!obj);
> > > 
> > > Why don't we discharge from the reserve here if !obj?
> > 
> > Well, this allocation should never fail:
> >   - we reserved memory
> >   - we accounted/throttle its usage
> > 
> > Thus this allocation should always succeed.
> 
> But if it *does* fail, it doesn't help that we mess up the reservation
> counts, no?

I guess you're right there. Will fix. Thanks!

> > > > +{
> > > > +	size_t size = ksize(obj);
> > > > +
> > > > +	kfree(obj);
> > > 
> > > We're trying to get rid of kfree() so I'd __kfree_reserve() could to
> > > mm/sl?b.c. Matt, thoughts?
> > 
> > My issue with moving these helpers into mm/sl?b.c is that it would
> > require replicating all this code 3 times. Even though the functionality
> > is (or should) be invariant to the actual slab implementation.
> 
> Right, I guess we could just rename ksize() to something else then and
> keep it internal to mm/.

That would be nice - we can stuff it into mm/internal.h or somesuch.

Also, you might have noticed, I still need to do everything SLOB. The
last time I rewrote all this code I was still hoping Linux would 'soon'
have a single slab allocator, but evidently we're still going with 3 for
now.. :-/

So I guess I can no longer hide behind that and will have to bite the
bullet and write the SLOB bits..

> > > > +	/*
> > > > +	 * ksize gives the full allocated size vs the requested size we used to
> > > > +	 * charge; however since we round up to the nearest power of two, this
> > > > +	 * should all work nicely.
> > > > +	 */
> > > > +	mem_reserve_kmalloc_charge(res, -size);
> > > > +}


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 12/30] mm: memory reserve management
  2008-07-28 10:39         ` Peter Zijlstra
@ 2008-07-28 10:41           ` Pekka Enberg
  2008-07-28 16:59           ` Matt Mackall
  1 sibling, 0 replies; 74+ messages in thread
From: Pekka Enberg @ 2008-07-28 10:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Neil Brown, mpm, cl

Hi Peter,

On Mon, 2008-07-28 at 12:39 +0200, Peter Zijlstra wrote:
> Also, you might have noticed, I still need to do everything SLOB. The
> last time I rewrote all this code I was still hoping Linux would 'soon'
> have a single slab allocator, but evidently we're still going with 3 for
> now.. :-/
> 
> So I guess I can no longer hide behind that and will have to bite the
> bullet and write the SLOB bits..

Oh, I don't expect SLOB to go away anytime soon. We are still trying to
get rid of SLAB, though, but there are some TPC regressions that we
don't have a reproducible test case for so that effort has stalled a
bit.

		Pekka


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 12/30] mm: memory reserve management
  2008-07-28 10:06   ` Pekka Enberg
  2008-07-28 10:17     ` Peter Zijlstra
@ 2008-07-28 16:49     ` Matt Mackall
  2008-07-28 17:13       ` Peter Zijlstra
  1 sibling, 1 reply; 74+ messages in thread
From: Matt Mackall @ 2008-07-28 16:49 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, trond.myklebust, Daniel Lezcano, Neil Brown,
	cl


On Mon, 2008-07-28 at 13:06 +0300, Pekka Enberg wrote:
> We're trying to get rid of kfree() so I'd __kfree_reserve() could to
> mm/sl?b.c. Matt, thoughts?

I think you mean ksize there. My big issue is that we need to make it
clear that ksize pairs -only- with kmalloc and that
ksize(kmem_cache_alloc(...)) is a categorical error. Preferably, we do
this by giving it a distinct name, like kmalloc_size(). We can stick an
underbar in front of it to suggest you ought not be using it too.

> > +	/*
> > +	 * ksize gives the full allocated size vs the requested size we
> used to
> > +	 * charge; however since we round up to the nearest power of two,
> this
> > +	 * should all work nicely.
> > +	 */

SLOB doesn't do this, of course. But does that matter? I think you want
to charge the actual allocation size to the reserve in all cases, no?
That probably means calling ksize() on both alloc and free.

-- 
Mathematics is the supreme nostalgia of our time.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 12/30] mm: memory reserve management
  2008-07-28 10:39         ` Peter Zijlstra
  2008-07-28 10:41           ` Pekka Enberg
@ 2008-07-28 16:59           ` Matt Mackall
  2008-07-28 17:13             ` Peter Zijlstra
  1 sibling, 1 reply; 74+ messages in thread
From: Matt Mackall @ 2008-07-28 16:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, trond.myklebust, Daniel Lezcano, Neil Brown,
	cl


On Mon, 2008-07-28 at 12:39 +0200, Peter Zijlstra wrote:
> Also, you might have noticed, I still need to do everything SLOB. The
> last time I rewrote all this code I was still hoping Linux would 'soon'
> have a single slab allocator, but evidently we're still going with 3 for
> now.. :-/
>
> So I guess I can no longer hide behind that and will have to bite the
> bullet and write the SLOB bits..

I haven't seen the rest of this thread, but I presume this is part of
your OOM-avoidance for network I/O framework?

SLOB can be pretty easily expanded to handle a notion of independent
allocation arenas as there are only a couple global variables to switch
between. kfree will also return allocations to the page list (and
therefore arena) from whence they came. That may make it pretty simple
to create and prepopulate reserve pools.

-- 
Mathematics is the supreme nostalgia of our time.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 12/30] mm: memory reserve management
  2008-07-28 16:49     ` Matt Mackall
@ 2008-07-28 17:13       ` Peter Zijlstra
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-28 17:13 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, trond.myklebust, Daniel Lezcano, Neil Brown,
	cl

On Mon, 2008-07-28 at 11:49 -0500, Matt Mackall wrote:
> On Mon, 2008-07-28 at 13:06 +0300, Pekka Enberg wrote:
> > We're trying to get rid of kfree() so I'd __kfree_reserve() could to
> > mm/sl?b.c. Matt, thoughts?
> 
> I think you mean ksize there. My big issue is that we need to make it
> clear that ksize pairs -only- with kmalloc and that
> ksize(kmem_cache_alloc(...)) is a categorical error. Preferably, we do
> this by giving it a distinct name, like kmalloc_size(). We can stick an
> underbar in front of it to suggest you ought not be using it too.

Right, both make sense, so _kmalloc_size() has my vote.

> > > +	/*
> > > +	 * ksize gives the full allocated size vs the requested size we
> > used to
> > > +	 * charge; however since we round up to the nearest power of two,
> > this
> > > +	 * should all work nicely.
> > > +	 */
> 
> SLOB doesn't do this, of course. But does that matter? I think you want
> to charge the actual allocation size to the reserve in all cases, no?
> That probably means calling ksize() on both alloc and free.

Like said, I still need to do all the SLOB reservation stuff. That
includes coming up with upper bound fragmentation loss.

For SL[UA]B I use roundup_power_of_two for kmalloc sizes. Thus with the
above ksize(), if we did p=kmalloc(x), then we'd account
roundup_power_of_two(x), and that should be equal to
roundup_power_of_two(ksize(p)), as ksize will always be smaller or equal
to the roundup.

I'm guessing the power of two upper bound is good for SLOB too -
although I haven't tried proving it wrong or tighetening it.

Only the kmem_cache_* reservation stuff would need some extra attention
with SLOB.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 12/30] mm: memory reserve management
  2008-07-28 16:59           ` Matt Mackall
@ 2008-07-28 17:13             ` Peter Zijlstra
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-28 17:13 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, trond.myklebust, Daniel Lezcano, Neil Brown,
	cl

On Mon, 2008-07-28 at 11:59 -0500, Matt Mackall wrote:
> On Mon, 2008-07-28 at 12:39 +0200, Peter Zijlstra wrote:
> > Also, you might have noticed, I still need to do everything SLOB. The
> > last time I rewrote all this code I was still hoping Linux would 'soon'
> > have a single slab allocator, but evidently we're still going with 3 for
> > now.. :-/
> >
> > So I guess I can no longer hide behind that and will have to bite the
> > bullet and write the SLOB bits..
> 
> I haven't seen the rest of this thread, but I presume this is part of
> your OOM-avoidance for network I/O framework?

Yes indeed.

> SLOB can be pretty easily expanded to handle a notion of independent
> allocation arenas as there are only a couple global variables to switch
> between. kfree will also return allocations to the page list (and
> therefore arena) from whence they came. That may make it pretty simple
> to create and prepopulate reserve pools.

Right - currently we let all the reserves sit on the free page list. The
advantage there is that it also helps the anti-frag stuff, due to having
larger free lists.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 04/30] mm: slub: trivial cleanups
  2008-07-24 14:00 ` [PATCH 04/30] mm: slub: trivial cleanups Peter Zijlstra
  2008-07-28  9:43   ` Pekka Enberg
@ 2008-07-29 22:15   ` Pekka Enberg
  1 sibling, 0 replies; 74+ messages in thread
From: Pekka Enberg @ 2008-07-29 22:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Neil Brown

Peter Zijlstra wrote:
> Some cleanups....
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Applied, thanks!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 06/30] mm: kmem_alloc_estimate()
  2008-07-24 14:00 ` [PATCH 06/30] mm: kmem_alloc_estimate() Peter Zijlstra
@ 2008-07-30 12:21   ` Pekka Enberg
  2008-07-30 13:31     ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Pekka Enberg @ 2008-07-30 12:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Neil Brown, cl

Hi Peter,

On Thu, 2008-07-24 at 16:00 +0200, Peter Zijlstra wrote:
> plain text document attachment (mm-kmem_estimate_pages.patch)
> Provide a method to get the upper bound on the pages needed to allocate
> a given number of objects from a given kmem_cache.
> 
> This lays the foundation for a generic reserve framework as presented in
> a later patch in this series. This framework needs to convert object demand
> (kmalloc() bytes, kmem_cache_alloc() objects) to pages.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/slab.h |    4 ++
>  mm/slab.c            |   75 +++++++++++++++++++++++++++++++++++++++++++
>  mm/slub.c            |   87 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 166 insertions(+)
> 
> Index: linux-2.6/include/linux/slab.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slab.h
> +++ linux-2.6/include/linux/slab.h
> @@ -65,6 +65,8 @@ void kmem_cache_free(struct kmem_cache *
>  unsigned int kmem_cache_size(struct kmem_cache *);
>  const char *kmem_cache_name(struct kmem_cache *);
>  int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
> +unsigned kmem_alloc_estimate(struct kmem_cache *cachep,
> +			gfp_t flags, int objects);
>  
>  /*
>   * Please use this macro to create slab caches. Simply specify the
> @@ -99,6 +101,8 @@ int kmem_ptr_validate(struct kmem_cache 
>  void * __must_check krealloc(const void *, size_t, gfp_t);
>  void kfree(const void *);
>  size_t ksize(const void *);

Just a nitpick, but:

> +unsigned kmalloc_estimate_fixed(size_t, gfp_t, int);

kmalloc_estimate_objs()?

> +unsigned kmalloc_estimate_variable(gfp_t, size_t);

kmalloc_estimate_bytes()?

>  
>  /*
>   * Allocator specific definitions. These are mainly used to establish optimized
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c
> +++ linux-2.6/mm/slub.c
> @@ -2412,6 +2412,42 @@ const char *kmem_cache_name(struct kmem_
>  }
>  EXPORT_SYMBOL(kmem_cache_name);
>  
> +/*
> + * Calculate the upper bound of pages required to sequentially allocate
> + * @objects objects from @cachep.
> + *
> + * We should use s->min_objects because those are the least efficient.
> + */
> +unsigned kmem_alloc_estimate(struct kmem_cache *s, gfp_t flags, int objects)
> +{
> +	unsigned long pages;
> +	struct kmem_cache_order_objects x;
> +
> +	if (WARN_ON(!s) || WARN_ON(!oo_objects(s->min)))
> +		return 0;
> +
> +	x = s->min;
> +	pages = DIV_ROUND_UP(objects, oo_objects(x)) << oo_order(x);
> +
> +	/*
> +	 * Account the possible additional overhead if the slab holds more that
> +	 * one object. Use s->max_objects because that's the worst case.
> +	 */
> +	x = s->oo;
> +	if (oo_objects(x) > 1) {

Hmm, I'm not sure why slab with just one object is treated separately
here. Surely you have per-CPU slabs then as well?

> +		/*
> +		 * Account the possible additional overhead if per cpu slabs
> +		 * are currently empty and have to be allocated. This is very
> +		 * unlikely but a possible scenario immediately after
> +		 * kmem_cache_shrink.
> +		 */
> +		pages += num_online_cpus() << oo_order(x);

Isn't this problematic with CPU hotplug? Shouldn't we use
num_possible_cpus() here?

> +	}
> +
> +	return pages;
> +}
> +EXPORT_SYMBOL_GPL(kmem_alloc_estimate);
> +
>  static void list_slab_objects(struct kmem_cache *s, struct page *page,
>  							const char *text)
>  {
> @@ -2789,6 +2825,57 @@ void kfree(const void *x)
>  EXPORT_SYMBOL(kfree);
>  
>  /*
> + * Calculate the upper bound of pages required to sequentially allocate
> + * @count objects of @size bytes from kmalloc given @flags.
> + */
> +unsigned kmalloc_estimate_fixed(size_t size, gfp_t flags, int count)
> +{
> +	struct kmem_cache *s = get_slab(size, flags);
> +	if (!s)
> +		return 0;
> +
> +	return kmem_alloc_estimate(s, flags, count);
> +
> +}
> +EXPORT_SYMBOL_GPL(kmalloc_estimate_fixed);
> +
> +/*
> + * Calculate the upper bound of pages requires to sequentially allocate @bytes
> + * from kmalloc in an unspecified number of allocations of nonuniform size.
> + */
> +unsigned kmalloc_estimate_variable(gfp_t flags, size_t bytes)
> +{
> +	int i;
> +	unsigned long pages;
> +
> +	/*
> +	 * multiply by two, in order to account the worst case slack space
> +	 * due to the power-of-two allocation sizes.
> +	 */
> +	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);

For bytes > PAGE_SIZE this doesn't look right (for SLUB). We do page
allocator pass-through which means that we'll be grabbing high order
pages which can be bigger than what 'pages' is here.

> +
> +	/*
> +	 * add the kmem_cache overhead of each possible kmalloc cache
> +	 */
> +	for (i = 1; i < PAGE_SHIFT; i++) {
> +		struct kmem_cache *s;
> +
> +#ifdef CONFIG_ZONE_DMA
> +		if (unlikely(flags & SLUB_DMA))
> +			s = dma_kmalloc_cache(i, flags);
> +		else
> +#endif
> +			s = &kmalloc_caches[i];
> +
> +		if (s)
> +			pages += kmem_alloc_estimate(s, flags, 0);
> +	}
> +
> +	return pages;
> +}
> +EXPORT_SYMBOL_GPL(kmalloc_estimate_variable);
> +
> +/*
>   * kmem_cache_shrink removes empty slabs from the partial lists and sorts
>   * the remaining slabs by the number of items in use. The slabs with the
>   * most items in use come first. New allocations will then fill those up
> Index: linux-2.6/mm/slab.c
> ===================================================================
> --- linux-2.6.orig/mm/slab.c
> +++ linux-2.6/mm/slab.c
> @@ -3854,6 +3854,81 @@ const char *kmem_cache_name(struct kmem_
>  EXPORT_SYMBOL_GPL(kmem_cache_name);
>  
>  /*
> + * Calculate the upper bound of pages required to sequentially allocate
> + * @objects objects from @cachep.
> + */
> +unsigned kmem_alloc_estimate(struct kmem_cache *cachep,
> +		gfp_t flags, int objects)
> +{
> +	/*
> +	 * (1) memory for objects,
> +	 */
> +	unsigned nr_slabs = DIV_ROUND_UP(objects, cachep->num);
> +	unsigned nr_pages = nr_slabs << cachep->gfporder;
> +
> +	/*
> +	 * (2) memory for each per-cpu queue (nr_cpu_ids),
> +	 * (3) memory for each per-node alien queues (nr_cpu_ids), and
> +	 * (4) some amount of memory for the slab management structures
> +	 *
> +	 * XXX: truely account these

Heh, yes please. Or add a comment why it doesn't matter.

> +	 */
> +	nr_pages += 1 + ilog2(nr_pages);
> +
> +	return nr_pages;
> +}
> +
> +/*
> + * Calculate the upper bound of pages required to sequentially allocate
> + * @count objects of @size bytes from kmalloc given @flags.
> + */
> +unsigned kmalloc_estimate_fixed(size_t size, gfp_t flags, int count)
> +{
> +	struct kmem_cache *s = kmem_find_general_cachep(size, flags);
> +	if (!s)
> +		return 0;
> +
> +	return kmem_alloc_estimate(s, flags, count);
> +}
> +EXPORT_SYMBOL_GPL(kmalloc_estimate_fixed);
> +
> +/*
> + * Calculate the upper bound of pages requires to sequentially allocate @bytes
> + * from kmalloc in an unspecified number of allocations of nonuniform size.
> + */
> +unsigned kmalloc_estimate_variable(gfp_t flags, size_t bytes)
> +{
> +	unsigned long pages;
> +	struct cache_sizes *csizep = malloc_sizes;
> +
> +	/*
> +	 * multiply by two, in order to account the worst case slack space
> +	 * due to the power-of-two allocation sizes.
> +	 */
> +	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
> +
> +	/*
> +	 * add the kmem_cache overhead of each possible kmalloc cache
> +	 */
> +	for (csizep = malloc_sizes; csizep->cs_cachep; csizep++) {
> +		struct kmem_cache *s;
> +
> +#ifdef CONFIG_ZONE_DMA
> +		if (unlikely(flags & __GFP_DMA))
> +			s = csizep->cs_dmacachep;
> +		else
> +#endif
> +			s = csizep->cs_cachep;
> +
> +		if (s)
> +			pages += kmem_alloc_estimate(s, flags, 0);
> +	}
> +
> +	return pages;
> +}
> +EXPORT_SYMBOL_GPL(kmalloc_estimate_variable);
> +
> +/*
>   * This initializes kmem_list3 or resizes various caches for all nodes.
>   */
>  static int alloc_kmemlist(struct kmem_cache *cachep)
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 08/30] mm: serialize access to min_free_kbytes
  2008-07-24 14:00 ` [PATCH 08/30] mm: serialize access to min_free_kbytes Peter Zijlstra
@ 2008-07-30 12:36   ` Pekka Enberg
  0 siblings, 0 replies; 74+ messages in thread
From: Pekka Enberg @ 2008-07-30 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Neil Brown

On Thu, 2008-07-24 at 16:00 +0200, Peter Zijlstra wrote:
> plain text document attachment (mm-setup_per_zone_pages_min.patch)
> There is a small race between the procfs caller and the memory hotplug caller
> of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
> another caller. Time to close the gap.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Looks good to me.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

> ---
>  mm/page_alloc.c |   16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -118,6 +118,7 @@ static char * const zone_names[MAX_NR_ZO
>  	 "Movable",
>  };
>  
> +static DEFINE_SPINLOCK(min_free_lock);
>  int min_free_kbytes = 1024;
>  
>  unsigned long __meminitdata nr_kernel_pages;
> @@ -4333,12 +4334,12 @@ static void setup_per_zone_lowmem_reserv
>  }
>  
>  /**
> - * setup_per_zone_pages_min - called when min_free_kbytes changes.
> + * __setup_per_zone_pages_min - called when min_free_kbytes changes.
>   *
>   * Ensures that the pages_{min,low,high} values for each zone are set correctly
>   * with respect to min_free_kbytes.
>   */
> -void setup_per_zone_pages_min(void)
> +static void __setup_per_zone_pages_min(void)
>  {
>  	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
>  	unsigned long lowmem_pages = 0;
> @@ -4433,6 +4434,15 @@ void setup_per_zone_inactive_ratio(void)
>  	}
>  }
>  
> +void setup_per_zone_pages_min(void)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&min_free_lock, flags);
> +	__setup_per_zone_pages_min();
> +	spin_unlock_irqrestore(&min_free_lock, flags);
> +}
> +
>  /*
>   * Initialise min_free_kbytes.
>   *
> @@ -4468,7 +4478,7 @@ static int __init init_per_zone_pages_mi
>  		min_free_kbytes = 128;
>  	if (min_free_kbytes > 65536)
>  		min_free_kbytes = 65536;
> -	setup_per_zone_pages_min();
> +	__setup_per_zone_pages_min();
>  	setup_per_zone_lowmem_reserve();
>  	setup_per_zone_inactive_ratio();
>  	return 0;
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 06/30] mm: kmem_alloc_estimate()
  2008-07-30 12:21   ` Pekka Enberg
@ 2008-07-30 13:31     ` Peter Zijlstra
  2008-07-30 20:02       ` Christoph Lameter
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-30 13:31 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Neil Brown, cl

On Wed, 2008-07-30 at 15:21 +0300, Pekka Enberg wrote:
> Hi Peter,
> 
> On Thu, 2008-07-24 at 16:00 +0200, Peter Zijlstra wrote:

> Just a nitpick, but:
> 
> > +unsigned kmalloc_estimate_fixed(size_t, gfp_t, int);
> 
> kmalloc_estimate_objs()?
> 
> > +unsigned kmalloc_estimate_variable(gfp_t, size_t);
> 
> kmalloc_estimate_bytes()?

Sounds good, I'll do some sed magic on the patch-set to make it happen.

> >  
> >  /*
> >   * Allocator specific definitions. These are mainly used to establish optimized
> > Index: linux-2.6/mm/slub.c
> > ===================================================================
> > --- linux-2.6.orig/mm/slub.c
> > +++ linux-2.6/mm/slub.c
> > @@ -2412,6 +2412,42 @@ const char *kmem_cache_name(struct kmem_
> >  }
> >  EXPORT_SYMBOL(kmem_cache_name);
> >  
> > +/*
> > + * Calculate the upper bound of pages required to sequentially allocate
> > + * @objects objects from @cachep.
> > + *
> > + * We should use s->min_objects because those are the least efficient.
> > + */
> > +unsigned kmem_alloc_estimate(struct kmem_cache *s, gfp_t flags, int objects)
> > +{
> > +	unsigned long pages;
> > +	struct kmem_cache_order_objects x;
> > +
> > +	if (WARN_ON(!s) || WARN_ON(!oo_objects(s->min)))
> > +		return 0;
> > +
> > +	x = s->min;
> > +	pages = DIV_ROUND_UP(objects, oo_objects(x)) << oo_order(x);
> > +
> > +	/*
> > +	 * Account the possible additional overhead if the slab holds more that
> > +	 * one object. Use s->max_objects because that's the worst case.
> > +	 */
> > +	x = s->oo;
> > +	if (oo_objects(x) > 1) {
> 
> Hmm, I'm not sure why slab with just one object is treated separately
> here. Surely you have per-CPU slabs then as well?

The thought was that if the slab only contains 1 obj, then the per-cpu
slabs are always full (or empty but already there), so you don't loose
memory to other cpu's having half-filled slabs.

Say you want to reserve memory for 10 object.

In the 1 object per slab case, you will always allocate a slab, no
matter what cpu you do the allocation on.

With say, 16 objects per slab and allocations spread across 2 cpus, you
have to allow for per-cpu slabs to be half-filled.

> > +		/*
> > +		 * Account the possible additional overhead if per cpu slabs
> > +		 * are currently empty and have to be allocated. This is very
> > +		 * unlikely but a possible scenario immediately after
> > +		 * kmem_cache_shrink.
> > +		 */
> > +		pages += num_online_cpus() << oo_order(x);
> 
> Isn't this problematic with CPU hotplug? Shouldn't we use
> num_possible_cpus() here?

ACK, thanks!

> > +/*
> > + * Calculate the upper bound of pages requires to sequentially allocate @bytes
> > + * from kmalloc in an unspecified number of allocations of nonuniform size.
> > + */
> > +unsigned kmalloc_estimate_variable(gfp_t flags, size_t bytes)
> > +{
> > +	int i;
> > +	unsigned long pages;
> > +
> > +	/*
> > +	 * multiply by two, in order to account the worst case slack space
> > +	 * due to the power-of-two allocation sizes.
> > +	 */
> > +	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
> 
> For bytes > PAGE_SIZE this doesn't look right (for SLUB). We do page
> allocator pass-through which means that we'll be grabbing high order
> pages which can be bigger than what 'pages' is here.

Hehe - you actually made me think here.

Satisfying allocations from a bucket distribution with power-of-two
(which page alloc order satisfies) has a worst case slack space of:

S(x) = 2^n - (2^(n-1)) - 1, n = ceil(log2(x))

This can be seen for the cases where x = 2^i + 1.

If we approximate S(x) by 2^(n-1) and compute the slack ratio for any
given x:

 R(x) ~ 2^n / 2^(n-1) = 2

We'll see that for any amount of x, we can only use half that due to
slack space.

Therefore, by multiplying the demand @bytes by 2 we'll always have
enough to cover the worst case slack considering the power-of-two
allocation buckets.

In example, if @bytes asks for 4 pages + 1 byte = 16385 bytes (assuming
4k pages), then the above will request 8 pages + 2 bytes, rounded up to
pages, is 9 pages. Which is enough to satisfy the order 3 allocation
needed for the 8 contiguous pages to store the requested 16385 bytes.

> > Index: linux-2.6/mm/slab.c
> > ===================================================================
> > --- linux-2.6.orig/mm/slab.c
> > +++ linux-2.6/mm/slab.c
> > @@ -3854,6 +3854,81 @@ const char *kmem_cache_name(struct kmem_
> >  EXPORT_SYMBOL_GPL(kmem_cache_name);
> >  
> >  /*
> > + * Calculate the upper bound of pages required to sequentially allocate
> > + * @objects objects from @cachep.
> > + */
> > +unsigned kmem_alloc_estimate(struct kmem_cache *cachep,
> > +		gfp_t flags, int objects)
> > +{
> > +	/*
> > +	 * (1) memory for objects,
> > +	 */
> > +	unsigned nr_slabs = DIV_ROUND_UP(objects, cachep->num);
> > +	unsigned nr_pages = nr_slabs << cachep->gfporder;
> > +
> > +	/*
> > +	 * (2) memory for each per-cpu queue (nr_cpu_ids),
> > +	 * (3) memory for each per-node alien queues (nr_cpu_ids), and
> > +	 * (4) some amount of memory for the slab management structures
> > +	 *
> > +	 * XXX: truely account these
> 
> Heh, yes please. Or add a comment why it doesn't matter.

Since you were the one I cribbed that comment from some (long) time ago,
can you advise on how well the below approximation is to an upper bound
on the above factors - assuming SLAB will live long enough to make it
worth the effort?

> > +	 */
> > +	nr_pages += 1 + ilog2(nr_pages);
> > +
> > +	return nr_pages;
> > +}


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 04/30] mm: slub: trivial cleanups
  2008-07-28 10:19     ` Peter Zijlstra
@ 2008-07-30 13:59       ` Christoph Lameter
  2008-07-30 14:13         ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Christoph Lameter @ 2008-07-30 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, trond.myklebust, Daniel Lezcano, Neil Brown

Peter Zijlstra wrote:

>> Christoph?

Sorry for the delay but this moving stuff is unending....


>>> Index: linux-2.6/mm/slub.c
>>> ===================================================================
>>> --- linux-2.6.orig/mm/slub.c
>>> +++ linux-2.6/mm/slub.c
>>> @@ -27,7 +27,7 @@
>>>  /*
>>>   * Lock order:
>>>   *   1. slab_lock(page)
>>> - *   2. slab->list_lock
>>> + *   2. node->list_lock
>>>   *
>>>   *   The slab_lock protects operations on the object of a particular
>>>   *   slab and its metadata in the page struct. If the slab lock

Hmmm..... node? Maybe use the struct name? kmem_cache_node?

>>> @@ -163,11 +163,11 @@ static struct notifier_block slab_notifi
>>>  #endif
>>>  
>>>  static enum {
>>> -	DOWN,		/* No slab functionality available */
>>> +	DOWN = 0,	/* No slab functionality available */
>>>  	PARTIAL,	/* kmem_cache_open() works but kmalloc does not */
>>>  	UP,		/* Everything works but does not show up in sysfs */
>>>  	SYSFS		/* Sysfs up */
>>> -} slab_state = DOWN;
>>> +} slab_state;
>>>  
>>>  /* A list of all slab caches on the system */
>>>  static DECLARE_RWSEM(slub_lock);

It defaults to the first enum value. We also do not initialize statics with zero.

>>> @@ -288,21 +288,22 @@ static inline int slab_index(void *p, st
>>>  static inline struct kmem_cache_order_objects oo_make(int order,
>>>  						unsigned long size)
>>>  {
>>> -	struct kmem_cache_order_objects x = {
>>> -		(order << 16) + (PAGE_SIZE << order) / size
>>> -	};
>>> +	struct kmem_cache_order_objects x;
>>> +
>>> +	x.order = order;
>>> +	x.objects = (PAGE_SIZE << order) / size;
>>>  
>>>  	return x;
>>>  }
>>>  

Another width limitation that will limit the number of objects in a slab to 64k.
Also gcc does not get the fields, wont be able to optimize this as well and it will emit conversion from 16 bit loads.


>>> @@ -1076,8 +1077,7 @@ static struct page *allocate_slab(struct
>>>  
>>>  	flags |= s->allocflags;
>>>  
>>> -	page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node,
>>> -									oo);
>>> +	page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node, oo);
>>>  	if (unlikely(!page)) {
>>>  		oo = s->min;
>>>  		/*

ok.

>>> @@ -1099,8 +1099,7 @@ static struct page *allocate_slab(struct
>>>  	return page;
>>>  }
>>>  
>>> -static void setup_object(struct kmem_cache *s, struct page *page,
>>> -				void *object)
>>> +static void setup_object(struct kmem_cache *s, struct page *page, void *object)
>>>  {
>>>  	setup_object_debug(s, page, object);
>>>  	if (unlikely(s->ctor))

Hmmm. You are moving it back on one line and Andrew will cut it up again later? This seems to be oscillating...

>>> @@ -1799,11 +1796,11 @@ static int slub_nomerge;
>>>   * slub_max_order specifies the order where we begin to stop considering the
>>>   * number of objects in a slab as critical. If we reach slub_max_order then
>>>   * we try to keep the page order as low as possible. So we accept more waste
>>> - * of space in favor of a small page order.
>>> + * of space in favour of a small page order.
>>>   *
>>>   * Higher order allocations also allow the placement of more objects in a
>>>   * slab and thereby reduce object handling overhead. If the user has
>>> - * requested a higher mininum order then we start with that one instead of
>>> + * requested a higher minimum order then we start with that one instead of
>>>   * the smallest order which will fit the object.
>>>   */
>>>  static inline int slab_order(int size, int min_objects,

Ack.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 04/30] mm: slub: trivial cleanups
  2008-07-30 13:59       ` Christoph Lameter
@ 2008-07-30 14:13         ` Peter Zijlstra
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-07-30 14:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, trond.myklebust, Daniel Lezcano, Neil Brown

On Wed, 2008-07-30 at 08:59 -0500, Christoph Lameter wrote:
> Peter Zijlstra wrote:
> 
> >> Christoph?
> 
> Sorry for the delay but this moving stuff is unending....
> 
> 
> >>> Index: linux-2.6/mm/slub.c
> >>> ===================================================================
> >>> --- linux-2.6.orig/mm/slub.c
> >>> +++ linux-2.6/mm/slub.c
> >>> @@ -27,7 +27,7 @@
> >>>  /*
> >>>   * Lock order:
> >>>   *   1. slab_lock(page)
> >>> - *   2. slab->list_lock
> >>> + *   2. node->list_lock
> >>>   *
> >>>   *   The slab_lock protects operations on the object of a particular
> >>>   *   slab and its metadata in the page struct. If the slab lock
> 
> Hmmm..... node? Maybe use the struct name? kmem_cache_node?
> 
> >>> @@ -163,11 +163,11 @@ static struct notifier_block slab_notifi
> >>>  #endif
> >>>  
> >>>  static enum {
> >>> -	DOWN,		/* No slab functionality available */
> >>> +	DOWN = 0,	/* No slab functionality available */
> >>>  	PARTIAL,	/* kmem_cache_open() works but kmalloc does not */
> >>>  	UP,		/* Everything works but does not show up in sysfs */
> >>>  	SYSFS		/* Sysfs up */
> >>> -} slab_state = DOWN;
> >>> +} slab_state;
> >>>  
> >>>  /* A list of all slab caches on the system */
> >>>  static DECLARE_RWSEM(slub_lock);
> 
> It defaults to the first enum value. We also do not initialize statics with zero.

I seem to recall differently, I thought we explicitly init global stuff
to 0.

> >>> @@ -288,21 +288,22 @@ static inline int slab_index(void *p, st
> >>>  static inline struct kmem_cache_order_objects oo_make(int order,
> >>>  						unsigned long size)
> >>>  {
> >>> -	struct kmem_cache_order_objects x = {
> >>> -		(order << 16) + (PAGE_SIZE << order) / size
> >>> -	};
> >>> +	struct kmem_cache_order_objects x;
> >>> +
> >>> +	x.order = order;
> >>> +	x.objects = (PAGE_SIZE << order) / size;
> >>>  
> >>>  	return x;
> >>>  }
> >>>  
> 
> Another width limitation that will limit the number of objects in a slab to 64k.
> Also gcc does not get the fields, wont be able to optimize this as well and it will emit conversion from 16 bit loads.

Well, it was already 16 by virtue of the old code.

GCC emmiting rubbish code due to the union would be sad - I can respin
without this.

> >>> @@ -1076,8 +1077,7 @@ static struct page *allocate_slab(struct
> >>>  
> >>>  	flags |= s->allocflags;
> >>>  
> >>> -	page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node,
> >>> -									oo);
> >>> +	page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node, oo);
> >>>  	if (unlikely(!page)) {
> >>>  		oo = s->min;
> >>>  		/*
> 
> ok.
> 
> >>> @@ -1099,8 +1099,7 @@ static struct page *allocate_slab(struct
> >>>  	return page;
> >>>  }
> >>>  
> >>> -static void setup_object(struct kmem_cache *s, struct page *page,
> >>> -				void *object)
> >>> +static void setup_object(struct kmem_cache *s, struct page *page, void *object)
> >>>  {
> >>>  	setup_object_debug(s, page, object);
> >>>  	if (unlikely(s->ctor))
> 
> Hmmm. You are moving it back on one line and Andrew will cut it up again later? This seems to be oscillating...

Gah, somehow I got convinced the result was <= 80..

> >>> @@ -1799,11 +1796,11 @@ static int slub_nomerge;
> >>>   * slub_max_order specifies the order where we begin to stop considering the
> >>>   * number of objects in a slab as critical. If we reach slub_max_order then
> >>>   * we try to keep the page order as low as possible. So we accept more waste
> >>> - * of space in favor of a small page order.
> >>> + * of space in favour of a small page order.
> >>>   *
> >>>   * Higher order allocations also allow the placement of more objects in a
> >>>   * slab and thereby reduce object handling overhead. If the user has
> >>> - * requested a higher mininum order then we start with that one instead of
> >>> + * requested a higher minimum order then we start with that one instead of
> >>>   * the smallest order which will fit the object.
> >>>   */
> >>>  static inline int slab_order(int size, int min_objects,
> 
> Ack.

OK, will send a new one without the bad bits..


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 06/30] mm: kmem_alloc_estimate()
  2008-07-30 13:31     ` Peter Zijlstra
@ 2008-07-30 20:02       ` Christoph Lameter
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Lameter @ 2008-07-30 20:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pekka Enberg, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, trond.myklebust, Daniel Lezcano, Neil Brown

Peter Zijlstra wrote:

>>> +/*
>>> + * Calculate the upper bound of pages requires to sequentially allocate @bytes
>>> + * from kmalloc in an unspecified number of allocations of nonuniform size.
>>> + */
>>> +unsigned kmalloc_estimate_variable(gfp_t flags, size_t bytes)
>>> +{
>>> +	int i;
>>> +	unsigned long pages;
>>> +
>>> +	/*
>>> +	 * multiply by two, in order to account the worst case slack space
>>> +	 * due to the power-of-two allocation sizes.
>>> +	 */
>>> +	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
>> For bytes > PAGE_SIZE this doesn't look right (for SLUB). We do page
>> allocator pass-through which means that we'll be grabbing high order
>> pages which can be bigger than what 'pages' is here.
> 
> Satisfying allocations from a bucket distribution with power-of-two
> (which page alloc order satisfies) has a worst case slack space of:
> 
> S(x) = 2^n - (2^(n-1)) - 1, n = ceil(log2(x))
> 
> This can be seen for the cases where x = 2^i + 1.

The needed bytes for a kmalloc allocation with size > PAGE_SIZE is

get_order(size) << PAGE_SHIFT bytes.

See kmalloc_large().




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 02/30] mm: gfp_to_alloc_flags()
  2008-07-24 14:00 ` [PATCH 02/30] mm: gfp_to_alloc_flags() Peter Zijlstra
@ 2008-08-12  5:01   ` Neil Brown
  2008-08-12  7:33     ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Neil Brown @ 2008-08-12  5:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg

On Thursday July 24, a.p.zijlstra@chello.nl wrote:
> Factor out the gfp to alloc_flags mapping so it can be used in other places.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  mm/internal.h   |   10 +++++
>  mm/page_alloc.c |   95 +++++++++++++++++++++++++++++++-------------------------
>  2 files changed, 64 insertions(+), 41 deletions(-)

This patch all looks "obviously correct" and a nice factorisation of
code, except the last little bit:

> @@ -1618,6 +1627,10 @@ nofail_alloc:
>  	if (!wait)
>  		goto nopage;
>  
> +	/* Avoid recursion of direct reclaim */
> +	if (p->flags & PF_MEMALLOC)
> +		goto nopage;
> +
>  	cond_resched();
>  
>  	/* We now go into synchronous reclaim */
> 
> -- 

I don't remember seeing it before (though my memory is imperfect) and
it doesn't seem to fit with the rest of the patch (except spatially).

There is a test above for PF_MEMALLOC which will result in a "goto"
somewhere else unless "in_interrupt()".
There is immediately above a test for "!wait".
So the only way this test can fire is when in_interrupt and wait.
But if that happens, then the
	might_sleep_if(wait)
at the top should have thrown a warning...  It really shouldn't happen.

So it looks like it is useless code:  there is already protection
against recursion in this case.

Did I miss something?
If I did, maybe more text in the changelog entry (or the comment)
would help.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 05/30] mm: slb: add knowledge of reserve pages
  2008-07-24 14:00 ` [PATCH 05/30] mm: slb: add knowledge of reserve pages Peter Zijlstra
@ 2008-08-12  5:35   ` Neil Brown
  2008-08-12  7:22     ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Neil Brown @ 2008-08-12  5:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg

On Thursday July 24, a.p.zijlstra@chello.nl wrote:
> Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
> contexts that are entitled to it. This is done to ensure reserve pages don't
> leak out and get consumed.

This looks good (we are still missing slob though, aren't we :-( )

> @@ -1526,7 +1540,7 @@ load_freelist:
>  	object = c->page->freelist;
>  	if (unlikely(!object))
>  		goto another_slab;
> -	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
> +	if (unlikely(PageSlubDebug(c->page) || c->reserve))
>  		goto debug;

This looks suspiciously like debugging code that you have left in.
Is it??

> @@ -265,7 +267,8 @@ struct array_cache {
>  	unsigned int avail;
>  	unsigned int limit;
>  	unsigned int batchcount;
> -	unsigned int touched;
> +	unsigned int touched:1,
> +		     reserve:1;

This sort of thing always worries me.
It is a per-cpu data structure so you won't get SMP races corrupting
fields.  But you do get read-modify-write in place of simple updates.
I guess it's not a problem..  But it worries me :-)


NeilBrown

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 12/30] mm: memory reserve management
  2008-07-24 14:00 ` [PATCH 12/30] mm: memory reserve management Peter Zijlstra
  2008-07-28 10:06   ` Pekka Enberg
@ 2008-08-12  6:23   ` Neil Brown
  2008-08-12  8:10     ` Peter Zijlstra
  2008-08-12  7:46   ` Neil Brown
  2 siblings, 1 reply; 74+ messages in thread
From: Neil Brown @ 2008-08-12  6:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg

On Thursday July 24, a.p.zijlstra@chello.nl wrote:
> Generic reserve management code. 
> 
> It provides methods to reserve and charge. Upon this, generic alloc/free style
> reserve pools could be build, which could fully replace mempool_t
> functionality.

This looks quite different to last time I looked at the code (I
think).

You now have a more structured "kmalloc_reserve" interface which
returns a flag to say if the allocation was from an emergency pool.  I
think this will be a distinct improvement at the call sites, though I
haven't looked at them yet. :-)

> +
> +struct mem_reserve {
> +	struct mem_reserve *parent;
> +	struct list_head children;
> +	struct list_head siblings;
> +
> +	const char *name;
> +
> +	long pages;
> +	long limit;
> +	long usage;
> +	spinlock_t lock;	/* protects limit and usage */
                                            ^^^^^
> +
> +	wait_queue_head_t waitqueue;
> +};

....
> +static void __calc_reserve(struct mem_reserve *res, long pages, long limit)
> +{
> +	unsigned long flags;
> +
> +	for ( ; res; res = res->parent) {
> +		res->pages += pages;
> +
> +		if (limit) {
> +			spin_lock_irqsave(&res->lock, flags);
> +			res->limit += limit;
> +			spin_unlock_irqrestore(&res->lock, flags);
> +		}
> +	}
> +}

I cannot figure out why the spinlock is being used to protect updates
to 'limit'.
As far as I can see, mem_reserve_mutex already protects all those
updates.
Certainly we need the spinlock for usage, but why for limit??

> +
> +void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
> +			 struct mem_reserve *res, int *emerg)
> +{
....
> +	if (emerg)
> +		*emerg |= 1;

Why not just

	if (emerg)
		*emerg = 1.

I can't we where '*emerg' can have any value but 0 or 1, so the '|' is
pointless ???

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 05/30] mm: slb: add knowledge of reserve pages
  2008-08-12  5:35   ` Neil Brown
@ 2008-08-12  7:22     ` Peter Zijlstra
  2008-08-12  9:35       ` Neil Brown
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-08-12  7:22 UTC (permalink / raw)
  To: Neil Brown
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg

On Tue, 2008-08-12 at 15:35 +1000, Neil Brown wrote:
> On Thursday July 24, a.p.zijlstra@chello.nl wrote:
> > Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
> > contexts that are entitled to it. This is done to ensure reserve pages don't
> > leak out and get consumed.
> 
> This looks good (we are still missing slob though, aren't we :-( )

I actually have that now, just needs some testing..

> > @@ -1526,7 +1540,7 @@ load_freelist:
> >  	object = c->page->freelist;
> >  	if (unlikely(!object))
> >  		goto another_slab;
> > -	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
> > +	if (unlikely(PageSlubDebug(c->page) || c->reserve))
> >  		goto debug;
> 
> This looks suspiciously like debugging code that you have left in.
> Is it??

Its not, we need to force slub into the debug slow path when we have a
reserve page, otherwise we cannot do the permission check on each
allocation.

> > @@ -265,7 +267,8 @@ struct array_cache {
> >  	unsigned int avail;
> >  	unsigned int limit;
> >  	unsigned int batchcount;
> > -	unsigned int touched;
> > +	unsigned int touched:1,
> > +		     reserve:1;
> 
> This sort of thing always worries me.
> It is a per-cpu data structure so you won't get SMP races corrupting
> fields.  But you do get read-modify-write in place of simple updates.
> I guess it's not a problem..  But it worries me :-)

Right,.. do people prefer I just add another int?


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 02/30] mm: gfp_to_alloc_flags()
  2008-08-12  5:01   ` Neil Brown
@ 2008-08-12  7:33     ` Peter Zijlstra
  2008-08-12  9:33       ` Neil Brown
  0 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-08-12  7:33 UTC (permalink / raw)
  To: Neil Brown
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg

On Tue, 2008-08-12 at 15:01 +1000, Neil Brown wrote:
> On Thursday July 24, a.p.zijlstra@chello.nl wrote:
> > Factor out the gfp to alloc_flags mapping so it can be used in other places.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  mm/internal.h   |   10 +++++
> >  mm/page_alloc.c |   95 +++++++++++++++++++++++++++++++-------------------------
> >  2 files changed, 64 insertions(+), 41 deletions(-)
> 
> This patch all looks "obviously correct" and a nice factorisation of
> code, except the last little bit:
> 
> > @@ -1618,6 +1627,10 @@ nofail_alloc:
> >  	if (!wait)
> >  		goto nopage;
> >  
> > +	/* Avoid recursion of direct reclaim */
> > +	if (p->flags & PF_MEMALLOC)
> > +		goto nopage;
> > +
> >  	cond_resched();
> >  
> >  	/* We now go into synchronous reclaim */
> > 
> > -- 
> 
> I don't remember seeing it before (though my memory is imperfect) and
> it doesn't seem to fit with the rest of the patch (except spatially).
> 
> There is a test above for PF_MEMALLOC which will result in a "goto"
> somewhere else unless "in_interrupt()".
> There is immediately above a test for "!wait".
> So the only way this test can fire is when in_interrupt and wait.
> But if that happens, then the
> 	might_sleep_if(wait)
> at the top should have thrown a warning...  It really shouldn't happen.
> 
> So it looks like it is useless code:  there is already protection
> against recursion in this case.
> 
> Did I miss something?
> If I did, maybe more text in the changelog entry (or the comment)
> would help.

Ok, so the old code did:

  if (((p->flags & PF_MEMALLOC) || ...) && !in_interrupt) {
    ....
    goto nopage;
  }

which avoid anything that has PF_MEMALLOC set from entering into direct
reclaim, right?

Now, the new code reads:

  if (alloc_flags & ALLOC_NO_WATERMARK) {
  }

Which might be false, even though we have PF_MEMALLOC set -
__GFP_NOMEMALLOC comes to mind.

So we have to stop that recursion from happening.

so we add:

  if (p->flags & PF_MEMALLOC)
    goto nopage;

Now, if it were done before the !wait check, we'd have to consider
atomic contexts, but as those are - as you rightly pointed out - handled
by the !wait case, we can plainly do this check.




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 12/30] mm: memory reserve management
  2008-07-24 14:00 ` [PATCH 12/30] mm: memory reserve management Peter Zijlstra
  2008-07-28 10:06   ` Pekka Enberg
  2008-08-12  6:23   ` Neil Brown
@ 2008-08-12  7:46   ` Neil Brown
  2008-08-12  8:12     ` Peter Zijlstra
  2 siblings, 1 reply; 74+ messages in thread
From: Neil Brown @ 2008-08-12  7:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg

On Thursday July 24, a.p.zijlstra@chello.nl wrote:
> Generic reserve management code. 
> 
> It provides methods to reserve and charge. Upon this, generic alloc/free style
> reserve pools could be build, which could fully replace mempool_t
> functionality.

More comments on this patch .....

> +void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
> +			 struct mem_reserve *res, int *emerg);
> +
> +static inline
> +void *__kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
> +			struct mem_reserve *res, int *emerg)
> +{
> +	void *obj;
> +
> +	obj = __kmalloc_node_track_caller(size,
> +			flags | __GFP_NOMEMALLOC | __GFP_NOWARN, node, ip);
> +	if (!obj)
> +		obj = ___kmalloc_reserve(size, flags, node, ip, res, emerg);
> +
> +	return obj;
> +}
> +
> +#define kmalloc_reserve(size, gfp, node, res, emerg) 			\
> +	__kmalloc_reserve(size, gfp, node, 				\
> +			  __builtin_return_address(0), res, emerg)
> +
.....
> +/*
> + * alloc wrappers
> + */
> +
> +void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
> +			 struct mem_reserve *res, int *emerg)
> +{
> +	void *obj;
> +	gfp_t gfp;
> +
> +	gfp = flags | __GFP_NOMEMALLOC | __GFP_NOWARN;
> +	obj = __kmalloc_node_track_caller(size, gfp, node, ip);
> +
> +	if (obj || !(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
> +		goto out;
> +
> +	if (res && !mem_reserve_kmalloc_charge(res, size)) {
> +		if (!(flags & __GFP_WAIT))
> +			goto out;
> +
> +		wait_event(res->waitqueue,
> +				mem_reserve_kmalloc_charge(res, size));
> +
> +		obj = __kmalloc_node_track_caller(size, gfp, node, ip);
> +		if (obj) {
> +			mem_reserve_kmalloc_charge(res, -size);
> +			goto out;
> +		}
> +	}
> +
> +	obj = __kmalloc_node_track_caller(size, flags, node, ip);
> +	WARN_ON(!obj);
> +	if (emerg)
> +		*emerg |= 1;
> +
> +out:
> +	return obj;
> +}

Two comments to be precise.

1/ __kmalloc_reserve attempts a __GFP_NOMEMALLOC allocation, and then
   if that fails, ___kmalloc_reserve immediately tries again.
   Is that pointless?  Should the second one be removed?

2/ mem_reserve_kmalloc_charge appears to assume that the 'mem_reserve'
   has been 'connected' and so is active.
   While callers probably only set GFP_MEMALLOC in cases where the
   mem_reserve is connected, ALLOC_NO_WATERMARKS could get via
   PF_MEMALLOC so we could end up calling mem_reserve_kmalloc_charge
   when the mem_reserve is not connected.
   That seems to be 'odd' at least.
   It might even be 'wrong' as mem_reserve_connect doesn't add the
   usage of the child to the parent - only the ->pages and ->limit.

   What is your position on this?  Mine is "still slightly confused".

NeilBrown

  

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 12/30] mm: memory reserve management
  2008-08-12  6:23   ` Neil Brown
@ 2008-08-12  8:10     ` Peter Zijlstra
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-08-12  8:10 UTC (permalink / raw)
  To: Neil Brown
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg

On Tue, 2008-08-12 at 16:23 +1000, Neil Brown wrote:
> On Thursday July 24, a.p.zijlstra@chello.nl wrote:
> > Generic reserve management code. 
> > 
> > It provides methods to reserve and charge. Upon this, generic alloc/free style
> > reserve pools could be build, which could fully replace mempool_t
> > functionality.
> 
> This looks quite different to last time I looked at the code (I
> think).
> 
> You now have a more structured "kmalloc_reserve" interface which
> returns a flag to say if the allocation was from an emergency pool.  I
> think this will be a distinct improvement at the call sites, though I
> haven't looked at them yet. :-)
> 
> > +
> > +struct mem_reserve {
> > +	struct mem_reserve *parent;
> > +	struct list_head children;
> > +	struct list_head siblings;
> > +
> > +	const char *name;
> > +
> > +	long pages;
> > +	long limit;
> > +	long usage;
> > +	spinlock_t lock;	/* protects limit and usage */
>                                             ^^^^^
> > +
> > +	wait_queue_head_t waitqueue;
> > +};
> 
> .....
> > +static void __calc_reserve(struct mem_reserve *res, long pages, long limit)
> > +{
> > +	unsigned long flags;
> > +
> > +	for ( ; res; res = res->parent) {
> > +		res->pages += pages;
> > +
> > +		if (limit) {
> > +			spin_lock_irqsave(&res->lock, flags);
> > +			res->limit += limit;
> > +			spin_unlock_irqrestore(&res->lock, flags);
> > +		}
> > +	}
> > +}
> 
> I cannot figure out why the spinlock is being used to protect updates
> to 'limit'.
> As far as I can see, mem_reserve_mutex already protects all those
> updates.
> Certainly we need the spinlock for usage, but why for limit??

against __mem_reserve_charge(), granted, the race would be minimal at
best - but it seemed better this way.

> > +
> > +void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
> > +			 struct mem_reserve *res, int *emerg)
> > +{
> .....
> > +	if (emerg)
> > +		*emerg |= 1;
> 
> Why not just
> 
> 	if (emerg)
> 		*emerg = 1.
> 
> I can't we where '*emerg' can have any value but 0 or 1, so the '|' is
> pointless ???

weirdness in my brain when I wrote that I guess, shall ammend!


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 12/30] mm: memory reserve management
  2008-08-12  7:46   ` Neil Brown
@ 2008-08-12  8:12     ` Peter Zijlstra
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-08-12  8:12 UTC (permalink / raw)
  To: Neil Brown
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg

On Tue, 2008-08-12 at 17:46 +1000, Neil Brown wrote:
> On Thursday July 24, a.p.zijlstra@chello.nl wrote:
> > Generic reserve management code. 
> > 
> > It provides methods to reserve and charge. Upon this, generic alloc/free style
> > reserve pools could be build, which could fully replace mempool_t
> > functionality.
> 
> More comments on this patch .....
> 
> > +void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
> > +			 struct mem_reserve *res, int *emerg);
> > +
> > +static inline
> > +void *__kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
> > +			struct mem_reserve *res, int *emerg)
> > +{
> > +	void *obj;
> > +
> > +	obj = __kmalloc_node_track_caller(size,
> > +			flags | __GFP_NOMEMALLOC | __GFP_NOWARN, node, ip);
> > +	if (!obj)
> > +		obj = ___kmalloc_reserve(size, flags, node, ip, res, emerg);
> > +
> > +	return obj;
> > +}
> > +
> > +#define kmalloc_reserve(size, gfp, node, res, emerg) 			\
> > +	__kmalloc_reserve(size, gfp, node, 				\
> > +			  __builtin_return_address(0), res, emerg)
> > +
> ......
> > +/*
> > + * alloc wrappers
> > + */
> > +
> > +void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
> > +			 struct mem_reserve *res, int *emerg)
> > +{
> > +	void *obj;
> > +	gfp_t gfp;
> > +
> > +	gfp = flags | __GFP_NOMEMALLOC | __GFP_NOWARN;
> > +	obj = __kmalloc_node_track_caller(size, gfp, node, ip);
> > +
> > +	if (obj || !(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
> > +		goto out;
> > +
> > +	if (res && !mem_reserve_kmalloc_charge(res, size)) {
> > +		if (!(flags & __GFP_WAIT))
> > +			goto out;
> > +
> > +		wait_event(res->waitqueue,
> > +				mem_reserve_kmalloc_charge(res, size));
> > +
> > +		obj = __kmalloc_node_track_caller(size, gfp, node, ip);
> > +		if (obj) {
> > +			mem_reserve_kmalloc_charge(res, -size);
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	obj = __kmalloc_node_track_caller(size, flags, node, ip);
> > +	WARN_ON(!obj);
> > +	if (emerg)
> > +		*emerg |= 1;
> > +
> > +out:
> > +	return obj;
> > +}
> 
> Two comments to be precise.
> 
> 1/ __kmalloc_reserve attempts a __GFP_NOMEMALLOC allocation, and then
>    if that fails, ___kmalloc_reserve immediately tries again.
>    Is that pointless?  Should the second one be removed?

Pretty pointless yes, except that it made ___kmalloc_reserve a nicer
function to read, and as its an utter slow path I couldn't be arsed to
optimize :-)

> 2/ mem_reserve_kmalloc_charge appears to assume that the 'mem_reserve'
>    has been 'connected' and so is active.

Hmm, that would be __mem_reserve_charge() then, because the callers
don't do much.

>    While callers probably only set GFP_MEMALLOC in cases where the
>    mem_reserve is connected, ALLOC_NO_WATERMARKS could get via
>    PF_MEMALLOC so we could end up calling mem_reserve_kmalloc_charge
>    when the mem_reserve is not connected.

Right..

>    That seems to be 'odd' at least.
>    It might even be 'wrong' as mem_reserve_connect doesn't add the
>    usage of the child to the parent - only the ->pages and ->limit.
> 
>    What is your position on this?  Mine is "still slightly confused".

Uhmm,. good point. Let me ponder this while I go for breakfast ;-)




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 02/30] mm: gfp_to_alloc_flags()
  2008-08-12  7:33     ` Peter Zijlstra
@ 2008-08-12  9:33       ` Neil Brown
  0 siblings, 0 replies; 74+ messages in thread
From: Neil Brown @ 2008-08-12  9:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg

On Tuesday August 12, a.p.zijlstra@chello.nl wrote:
> On Tue, 2008-08-12 at 15:01 +1000, Neil Brown wrote:
> > Did I miss something?
> > If I did, maybe more text in the changelog entry (or the comment)
> > would help.
> 
> Ok, so the old code did:
> 
>   if (((p->flags & PF_MEMALLOC) || ...) && !in_interrupt) {
>     ....
>     goto nopage;
>   }
> 
> which avoid anything that has PF_MEMALLOC set from entering into direct
> reclaim, right?
> 
> Now, the new code reads:
> 
>   if (alloc_flags & ALLOC_NO_WATERMARK) {
>   }
> 
> Which might be false, even though we have PF_MEMALLOC set -
> __GFP_NOMEMALLOC comes to mind.
> 
> So we have to stop that recursion from happening.
> 
> so we add:
> 
>   if (p->flags & PF_MEMALLOC)
>     goto nopage;
> 
> Now, if it were done before the !wait check, we'd have to consider
> atomic contexts, but as those are - as you rightly pointed out - handled
> by the !wait case, we can plainly do this check.
> 
> 

Oh yes, obvious when you explain it, thanks.

cat << END >> Changelog

As the test
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
has been replaced with a slightly strong
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {

we need to ensure we don't recurse when PF_MEMALLOC is set

END

??

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 05/30] mm: slb: add knowledge of reserve pages
  2008-08-12  7:22     ` Peter Zijlstra
@ 2008-08-12  9:35       ` Neil Brown
  2008-08-12 10:23         ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Neil Brown @ 2008-08-12  9:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg

On Tuesday August 12, a.p.zijlstra@chello.nl wrote:
> On Tue, 2008-08-12 at 15:35 +1000, Neil Brown wrote:
> > On Thursday July 24, a.p.zijlstra@chello.nl wrote:
> > > Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
> > > contexts that are entitled to it. This is done to ensure reserve pages don't
> > > leak out and get consumed.
> > 
> > This looks good (we are still missing slob though, aren't we :-( )
> 
> I actually have that now, just needs some testing..

Cool!

> 
> > > @@ -1526,7 +1540,7 @@ load_freelist:
> > >  	object = c->page->freelist;
> > >  	if (unlikely(!object))
> > >  		goto another_slab;
> > > -	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
> > > +	if (unlikely(PageSlubDebug(c->page) || c->reserve))
> > >  		goto debug;
> > 
> > This looks suspiciously like debugging code that you have left in.
> > Is it??
> 
> Its not, we need to force slub into the debug slow path when we have a
> reserve page, otherwise we cannot do the permission check on each
> allocation.

I see.... a little.  I'm trying to avoid understanding slub too
deeply, I don't want to use up valuable brain cell :-)
Would we be justified in changing the label from 'debug:' to
'slow_path:'  or something?  And if it is just c->reserve, should
we avoid the call to alloc_debug_processing?


Thanks,
NeilBrown

> 
> > > @@ -265,7 +267,8 @@ struct array_cache {
> > >  	unsigned int avail;
> > >  	unsigned int limit;
> > >  	unsigned int batchcount;
> > > -	unsigned int touched;
> > > +	unsigned int touched:1,
> > > +		     reserve:1;
> > 
> > This sort of thing always worries me.
> > It is a per-cpu data structure so you won't get SMP races corrupting
> > fields.  But you do get read-modify-write in place of simple updates.
> > I guess it's not a problem..  But it worries me :-)
> 
> Right,.. do people prefer I just add another int?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 05/30] mm: slb: add knowledge of reserve pages
  2008-08-12  9:35       ` Neil Brown
@ 2008-08-12 10:23         ` Peter Zijlstra
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-08-12 10:23 UTC (permalink / raw)
  To: Neil Brown
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Daniel Lezcano, Pekka Enberg

On Tue, 2008-08-12 at 19:35 +1000, Neil Brown wrote:
> On Tuesday August 12, a.p.zijlstra@chello.nl wrote:
> > On Tue, 2008-08-12 at 15:35 +1000, Neil Brown wrote:
> > > On Thursday July 24, a.p.zijlstra@chello.nl wrote:
> > > > Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
> > > > contexts that are entitled to it. This is done to ensure reserve pages don't
> > > > leak out and get consumed.
> > > 
> > > This looks good (we are still missing slob though, aren't we :-( )
> > 
> > I actually have that now, just needs some testing..
> 
> Cool!
> 
> > 
> > > > @@ -1526,7 +1540,7 @@ load_freelist:
> > > >  	object = c->page->freelist;
> > > >  	if (unlikely(!object))
> > > >  		goto another_slab;
> > > > -	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
> > > > +	if (unlikely(PageSlubDebug(c->page) || c->reserve))
> > > >  		goto debug;
> > > 
> > > This looks suspiciously like debugging code that you have left in.
> > > Is it??
> > 
> > Its not, we need to force slub into the debug slow path when we have a
> > reserve page, otherwise we cannot do the permission check on each
> > allocation.
> 
> I see.... a little.  I'm trying to avoid understanding slub too
> deeply, I don't want to use up valuable brain cell :-)

:-)

> Would we be justified in changing the label from 'debug:' to
> 'slow_path:'  or something?  

Could do I guess.

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -1543,7 +1543,7 @@ load_freelist:
 	if (unlikely(!object))
 		goto another_slab;
 	if (unlikely(PageSlubDebug(c->page) || c->reserve))
-		goto debug;
+		goto slow_path;
 
 	c->freelist = object[c->offset];
 	c->page->inuse = c->page->objects;
@@ -1586,11 +1586,21 @@ grow_slab:
 		goto load_freelist;
 	}
 	return NULL;
-debug:
+
+slow_path:
 	if (PageSlubDebug(c->page) &&
 			!alloc_debug_processing(s, c->page, object, addr))
 		goto another_slab;
 
+	/*
+	 * Avoid the slub fast path in slab_alloc by not setting
+	 * c->freelist and the fast path in slab_fere by making 
+	 * node_match() fail by setting c->node to -1.
+	 *
+	 * We use this for for debug checks and reserve handling,
+	 * which needs to do permission checks on each allocation.
+	 */
+
 	c->page->inuse++;
 	c->page->freelist = object[c->offset];
 	c->node = -1;


> And if it is just c->reserve, should
> we avoid the call to alloc_debug_processing?

We already do:

	if (PageSlubDebug(c->page) &&
			!alloc_debug_processing(s, c->page, object, addr))
		goto another_slab;

since in that case PageSlubDebug() will be false.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 00/30] Swap over NFS -v18
  2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
                   ` (29 preceding siblings ...)
  2008-07-24 14:01 ` [PATCH 30/30] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
@ 2008-09-30 12:41 ` Peter Zijlstra
  2008-09-30 15:46   ` Daniel Lezcano
  30 siblings, 1 reply; 74+ messages in thread
From: Peter Zijlstra @ 2008-09-30 12:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-kernel, linux-mm, netdev, trond.myklebust,
	Daniel Lezcano, Pekka Enberg, Neil Brown

On Thu, 2008-07-24 at 16:00 +0200, Peter Zijlstra wrote:
> Latest version of the swap over nfs work.
> 
> Patches are against: v2.6.26-rc8-mm1
> 
> I still need to write some more comments in the reservation code.
> 
> Pekka, it uses ksize(), please have a look.
> 
> This version also deals with network namespaces.
> Two things where I could do with some suggestsion:
> 
>   - currently the sysctl code uses current->nrproxy.net_ns to obtain
>     the current network namespace
> 
>   - the ipv6 route cache code has some initialization order issues

Daniel, have you ever found time to look at my namespace issues?


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 00/30] Swap over NFS -v18
  2008-09-30 12:41 ` [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
@ 2008-09-30 15:46   ` Daniel Lezcano
  0 siblings, 0 replies; 74+ messages in thread
From: Daniel Lezcano @ 2008-09-30 15:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Pekka Enberg, Neil Brown

Peter Zijlstra wrote:
> On Thu, 2008-07-24 at 16:00 +0200, Peter Zijlstra wrote:
>> Latest version of the swap over nfs work.
>>
>> Patches are against: v2.6.26-rc8-mm1
>>
>> I still need to write some more comments in the reservation code.
>>
>> Pekka, it uses ksize(), please have a look.
>>
>> This version also deals with network namespaces.
>> Two things where I could do with some suggestsion:
>>
>>   - currently the sysctl code uses current->nrproxy.net_ns to obtain
>>     the current network namespace
>>
>>   - the ipv6 route cache code has some initialization order issues
> 
> Daniel, have you ever found time to look at my namespace issues?

Oops, no. I was busy and I forgot, sorry.
Let me review the sysctl vs namespace part.

Thanks
   -- Daniel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 18/30] netvm: INET reserves.
  2008-07-24 14:01 ` [PATCH 18/30] netvm: INET reserves Peter Zijlstra
@ 2008-10-01 11:38   ` Daniel Lezcano
  2008-10-01 18:56     ` Peter Zijlstra
  0 siblings, 1 reply; 74+ messages in thread
From: Daniel Lezcano @ 2008-10-01 11:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Pekka Enberg, Neil Brown

> Add reserves for INET.
> 
> The two big users seem to be the route cache and ip-fragment cache.
> 
> Reserve the route cache under generic RX reserve, its usage is bounded by
> the high reclaim watermark, and thus does not need further accounting.
> 
> Reserve the ip-fragement caches under SKB data reserve, these add to the
> SKB RX limit. By ensuring we can at least receive as much data as fits in
> the reassmbly line we avoid fragment attack deadlocks.
> 
> Adds to the reserve tree:
> 
>   total network reserve      
>     network TX reserve       
>       protocol TX pages      
>     network RX reserve       
> +     IPv6 route cache       
> +     IPv4 route cache       
>       SKB data reserve       
> +       IPv6 fragment cache  
> +       IPv4 fragment cache  
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/net/inet_frag.h  |    7 +++
>  include/net/netns/ipv6.h |    4 ++
>  net/ipv4/inet_fragment.c |    3 +
>  net/ipv4/ip_fragment.c   |   89 +++++++++++++++++++++++++++++++++++++++++++++--
>  net/ipv4/route.c         |   72 +++++++++++++++++++++++++++++++++++++-
>  net/ipv6/af_inet6.c      |   20 +++++++++-
>  net/ipv6/reassembly.c    |   88 +++++++++++++++++++++++++++++++++++++++++++++-
>  net/ipv6/route.c         |   66 ++++++++++++++++++++++++++++++++++
>  8 files changed, 341 insertions(+), 8 deletions(-)
> 

Sorry for the delay ...

[ cut ]

I removed a big portion of code because the remarks below apply to the 
rest of the code.

> +static int sysctl_intvec_route(struct ctl_table *table,
> +		int __user *name, int nlen,
> +		void __user *oldval, size_t __user *oldlenp,
> +		void __user *newval, size_t newlen)
> +{
> +	struct net *net = current->nsproxy->net_ns;

I think you can use the container_of and get rid of using 
current->nsproxy->net_ns.

	struct net *net = container_of(table->data, struct net,
				ipv6.sysctl.ip6_rt_max_size);

Another solution could be to pass directly the sysctl structure pointer 
in the table data instead of
".data = &init_net.ipv6.sysctl.ip6_rt_max_size" when initializing the 
sysctl table below. But you have to set the right field value yourself.

> +	int write = (newval && newlen);
> +	int new_size, ret;
> +
> +	mutex_lock(&net->ipv6.sysctl.ip6_rt_lock);
> +
> +	if (write)
> +		table->data = &new_size;
> +
> +	ret = sysctl_intvec(table, name, nlen, oldval, oldlenp, newval, newlen);
> +
> +	if (!ret && write) {
> +		ret = mem_reserve_kmem_cache_set(&net->ipv6.ip6_rt_reserve,
> +				net->ipv6.ip6_dst_ops.kmem_cachep, new_size);
> +		if (!ret)
> +			net->ipv6.sysctl.ip6_rt_max_size = new_size;
> +	}
> +
> +	if (write)
> +		table->data = &net->ipv6.sysctl.ip6_rt_max_size;
> +
> +	mutex_unlock(&net->ipv6.sysctl.ip6_rt_lock);
> +
> +	return ret;
> +}

Dancing with the table->data looks safe but it is not very nice.
Isn't possible to use a temporary table like in the function 
"ipv4_sysctl_local_port_range" ?

>  ctl_table ipv6_route_table_template[] = {
>  	{
>  		.procname	=	"flush",
> @@ -2520,7 +2581,8 @@ ctl_table ipv6_route_table_template[] = 
>  		.data		=	&init_net.ipv6.sysctl.ip6_rt_max_size,
>  		.maxlen		=	sizeof(int),
>  		.mode		=	0644,
> -		.proc_handler	=	&proc_dointvec,
> +		.proc_handler	=	&proc_dointvec_route,
> +		.strategy	= 	&sysctl_intvec_route,
>  	},
>  	{
>  		.ctl_name	=	NET_IPV6_ROUTE_GC_MIN_INTERVAL,
> @@ -2608,6 +2670,8 @@ struct ctl_table *ipv6_route_sysctl_init
>  		table[8].data = &net->ipv6.sysctl.ip6_rt_min_advmss;
>  	}
> 
> +	mutex_init(&net->ipv6.sysctl.ip6_rt_lock);
> +
>  	return table;
>  }
>  #endif
> Index: linux-2.6/include/net/inet_frag.h
> ===================================================================
> --- linux-2.6.orig/include/net/inet_frag.h
> +++ linux-2.6/include/net/inet_frag.h
> @@ -1,6 +1,9 @@
>  #ifndef __NET_FRAG_H__
>  #define __NET_FRAG_H__
> 
> +#include <linux/reserve.h>
> +#include <linux/mutex.h>
> +
>  struct netns_frags {
>  	int			nqueues;
>  	atomic_t		mem;
> @@ -10,6 +13,10 @@ struct netns_frags {
>  	int			timeout;
>  	int			high_thresh;
>  	int			low_thresh;
> +
> +	/* reserves */
> +	struct mutex		lock;
> +	struct mem_reserve	reserve;
>  };
> 
>  struct inet_frag_queue {
> Index: linux-2.6/net/ipv4/inet_fragment.c
> ===================================================================
> --- linux-2.6.orig/net/ipv4/inet_fragment.c
> +++ linux-2.6/net/ipv4/inet_fragment.c
> @@ -19,6 +19,7 @@
>  #include <linux/random.h>
>  #include <linux/skbuff.h>
>  #include <linux/rtnetlink.h>
> +#include <linux/reserve.h>
> 
>  #include <net/inet_frag.h>
> 
> @@ -74,6 +75,8 @@ void inet_frags_init_net(struct netns_fr
>  	nf->nqueues = 0;
>  	atomic_set(&nf->mem, 0);
>  	INIT_LIST_HEAD(&nf->lru_list);
> +	mutex_init(&nf->lock);
> +	mem_reserve_init(&nf->reserve, "IP fragement cache", NULL);
>  }
>  EXPORT_SYMBOL(inet_frags_init_net);
> 
> Index: linux-2.6/include/net/netns/ipv6.h
> ===================================================================
> --- linux-2.6.orig/include/net/netns/ipv6.h
> +++ linux-2.6/include/net/netns/ipv6.h
> @@ -24,6 +24,8 @@ struct netns_sysctl_ipv6 {
>  	int ip6_rt_mtu_expires;
>  	int ip6_rt_min_advmss;
>  	int icmpv6_time;
> +
> +	struct mutex ip6_rt_lock;
>  };
> 
>  struct netns_ipv6 {
> @@ -55,5 +57,7 @@ struct netns_ipv6 {
>  	struct sock             *ndisc_sk;
>  	struct sock             *tcp_sk;
>  	struct sock             *igmp_sk;
> +
> +	struct mem_reserve	ip6_rt_reserve;
>  };
>  #endif
> Index: linux-2.6/net/ipv6/af_inet6.c
> ===================================================================
> --- linux-2.6.orig/net/ipv6/af_inet6.c
> +++ linux-2.6/net/ipv6/af_inet6.c
> @@ -851,6 +851,20 @@ static int inet6_net_init(struct net *ne
>  	net->ipv6.sysctl.ip6_rt_min_advmss = IPV6_MIN_MTU - 20 - 40;
>  	net->ipv6.sysctl.icmpv6_time = 1*HZ;
> 
> +	mem_reserve_init(&net->ipv6.ip6_rt_reserve, "IPv6 route cache",
> +			 &net_rx_reserve);
> +	/*
> +	 * XXX: requires that net->ipv6.ip6_dst_ops is already set-up
> +	 *      but afaikt its impossible to order the various
> +	 *      pernet_subsys calls so that this one is done after
> +	 *      ip6_route_net_init().
> +	 */

As this code seems related to the routes, is there a particular reason 
to not put it at the end of "ip6_route_net_init" function ? You will be 
sure "net->ipv6.ip6_dst_ops is already set-up", no ?

> +	err = mem_reserve_kmem_cache_set(&net->ipv6.ip6_rt_reserve,
> +			net->ipv6.ip6_dst_ops.kmem_cachep,
> +			net->ipv6.sysctl.ip6_rt_max_size);
> +	if (err)
> +		goto reserve_fail;
> +
>  #ifdef CONFIG_PROC_FS
>  	err = udp6_proc_init(net);
>  	if (err)
> @@ -861,8 +875,8 @@ static int inet6_net_init(struct net *ne
>  	err = ac6_proc_init(net);
>  	if (err)
>  		goto proc_ac6_fail;
> -out:
>  #endif
> +out:
>  	return err;
> 
>  #ifdef CONFIG_PROC_FS
> @@ -870,8 +884,10 @@ proc_ac6_fail:
>  	tcp6_proc_exit(net);
>  proc_tcp6_fail:
>  	udp6_proc_exit(net);
> -	goto out;
>  #endif
> +reserve_fail:
> +	mem_reserve_disconnect(&net->ipv6.ip6_rt_reserve);

Idem.

> +	goto out;
>  }
> 
>  static void inet6_net_exit(struct net *net)

Isn't "mem_reserve_disconnect" missing here ? (but going to 
ip6_route_net_exit)


I hope this review helped :)

Thanks
	--Daniel



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 18/30] netvm: INET reserves.
  2008-10-01 11:38   ` Daniel Lezcano
@ 2008-10-01 18:56     ` Peter Zijlstra
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-10-01 18:56 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, Pekka Enberg, Neil Brown

On Wed, 2008-10-01 at 13:38 +0200, Daniel Lezcano wrote:

> I removed a big portion of code because the remarks below apply to the 
> rest of the code.
> 
> > +static int sysctl_intvec_route(struct ctl_table *table,
> > +		int __user *name, int nlen,
> > +		void __user *oldval, size_t __user *oldlenp,
> > +		void __user *newval, size_t newlen)
> > +{
> > +	struct net *net = current->nsproxy->net_ns;
> 
> I think you can use the container_of and get rid of using 
> current->nsproxy->net_ns.
> 
> 	struct net *net = container_of(table->data, struct net,
> 				ipv6.sysctl.ip6_rt_max_size);

D'oh - why didn't I think of that... yes very nice.


> > +	int write = (newval && newlen);
> > +	int new_size, ret;
> > +
> > +	mutex_lock(&net->ipv6.sysctl.ip6_rt_lock);
> > +
> > +	if (write)
> > +		table->data = &new_size;
> > +
> > +	ret = sysctl_intvec(table, name, nlen, oldval, oldlenp, newval, newlen);
> > +
> > +	if (!ret && write) {
> > +		ret = mem_reserve_kmem_cache_set(&net->ipv6.ip6_rt_reserve,
> > +				net->ipv6.ip6_dst_ops.kmem_cachep, new_size);
> > +		if (!ret)
> > +			net->ipv6.sysctl.ip6_rt_max_size = new_size;
> > +	}
> > +
> > +	if (write)
> > +		table->data = &net->ipv6.sysctl.ip6_rt_max_size;
> > +
> > +	mutex_unlock(&net->ipv6.sysctl.ip6_rt_lock);
> > +
> > +	return ret;
> > +}
> 
> Dancing with the table->data looks safe but it is not very nice.
> Isn't possible to use a temporary table like in the function 
> "ipv4_sysctl_local_port_range" ?

Ah, nice solution. Thanks!

> > Index: linux-2.6/net/ipv6/af_inet6.c
> > ===================================================================
> > --- linux-2.6.orig/net/ipv6/af_inet6.c
> > +++ linux-2.6/net/ipv6/af_inet6.c
> > @@ -851,6 +851,20 @@ static int inet6_net_init(struct net *ne
> >  	net->ipv6.sysctl.ip6_rt_min_advmss = IPV6_MIN_MTU - 20 - 40;
> >  	net->ipv6.sysctl.icmpv6_time = 1*HZ;
> > 
> > +	mem_reserve_init(&net->ipv6.ip6_rt_reserve, "IPv6 route cache",
> > +			 &net_rx_reserve);
> > +	/*
> > +	 * XXX: requires that net->ipv6.ip6_dst_ops is already set-up
> > +	 *      but afaikt its impossible to order the various
> > +	 *      pernet_subsys calls so that this one is done after
> > +	 *      ip6_route_net_init().
> > +	 */
> 
> As this code seems related to the routes, is there a particular reason 
> to not put it at the end of "ip6_route_net_init" function ? You will be 
> sure "net->ipv6.ip6_dst_ops is already set-up", no ?

Ah, the problem is that I need both dst_ops and ip6_rt_max_size set.

The former is set in ip6_route_net_init() while the later is set in
inet6_net_init(), both are registered pernet_ops without specified
order.

So where exactly do I hook in?

> > +	err = mem_reserve_kmem_cache_set(&net->ipv6.ip6_rt_reserve,
> > +			net->ipv6.ip6_dst_ops.kmem_cachep,
> > +			net->ipv6.sysctl.ip6_rt_max_size);
> > +	if (err)
> > +		goto reserve_fail;
> > +
> >  #ifdef CONFIG_PROC_FS
> >  	err = udp6_proc_init(net);
> >  	if (err)
> > @@ -861,8 +875,8 @@ static int inet6_net_init(struct net *ne
> >  	err = ac6_proc_init(net);
> >  	if (err)
> >  		goto proc_ac6_fail;
> > -out:
> >  #endif
> > +out:
> >  	return err;
> > 
> >  #ifdef CONFIG_PROC_FS
> > @@ -870,8 +884,10 @@ proc_ac6_fail:
> >  	tcp6_proc_exit(net);
> >  proc_tcp6_fail:
> >  	udp6_proc_exit(net);
> > -	goto out;
> >  #endif
> > +reserve_fail:
> > +	mem_reserve_disconnect(&net->ipv6.ip6_rt_reserve);
> 
> Idem.
> 
> > +	goto out;
> >  }
> > 
> >  static void inet6_net_exit(struct net *net)
> 
> Isn't "mem_reserve_disconnect" missing here ? (but going to 
> ip6_route_net_exit)

Probably, I'll go over the exit paths once I get the init path ;-)

> I hope this review helped :)

It did, much appreciated!


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 22/30] netfilter: NF_QUEUE vs emergency skbs
  2008-03-20 20:10 [PATCH 00/30] Swap over NFS -v17 Peter Zijlstra
@ 2008-03-20 20:11 ` Peter Zijlstra
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2008-03-20 20:11 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust, neilb, miklos, penberg, a.p.zijlstra

[-- Attachment #1: emergency-nf_queue.patch --]
[-- Type: text/plain, Size: 835 bytes --]

Avoid memory getting stuck waiting for userspace, drop all emergency packets.
This of course requires the regular storage route to not include an NF_QUEUE
target ;-)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 net/netfilter/core.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/net/netfilter/core.c
===================================================================
--- linux-2.6.orig/net/netfilter/core.c
+++ linux-2.6/net/netfilter/core.c
@@ -176,9 +176,12 @@ next_hook:
 		ret = 1;
 		goto unlock;
 	} else if (verdict == NF_DROP) {
+drop:
 		kfree_skb(skb);
 		ret = -EPERM;
 	} else if ((verdict & NF_VERDICT_MASK) == NF_QUEUE) {
+		if (skb_emergency(*pskb))
+			goto drop;
 		if (!nf_queue(skb, elem, pf, hook, indev, outdev, okfn,
 			      verdict >> NF_VERDICT_BITS))
 			goto next_hook;

--


^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2008-10-01 18:57 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-07-24 14:00 [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
2008-07-24 14:00 ` [PATCH 01/30] swap over network documentation Peter Zijlstra
2008-07-24 14:00 ` [PATCH 02/30] mm: gfp_to_alloc_flags() Peter Zijlstra
2008-08-12  5:01   ` Neil Brown
2008-08-12  7:33     ` Peter Zijlstra
2008-08-12  9:33       ` Neil Brown
2008-07-24 14:00 ` [PATCH 03/30] mm: tag reseve pages Peter Zijlstra
2008-07-24 14:00 ` [PATCH 04/30] mm: slub: trivial cleanups Peter Zijlstra
2008-07-28  9:43   ` Pekka Enberg
2008-07-28 10:19     ` Peter Zijlstra
2008-07-30 13:59       ` Christoph Lameter
2008-07-30 14:13         ` Peter Zijlstra
2008-07-29 22:15   ` Pekka Enberg
2008-07-24 14:00 ` [PATCH 05/30] mm: slb: add knowledge of reserve pages Peter Zijlstra
2008-08-12  5:35   ` Neil Brown
2008-08-12  7:22     ` Peter Zijlstra
2008-08-12  9:35       ` Neil Brown
2008-08-12 10:23         ` Peter Zijlstra
2008-07-24 14:00 ` [PATCH 06/30] mm: kmem_alloc_estimate() Peter Zijlstra
2008-07-30 12:21   ` Pekka Enberg
2008-07-30 13:31     ` Peter Zijlstra
2008-07-30 20:02       ` Christoph Lameter
2008-07-24 14:00 ` [PATCH 07/30] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
2008-07-24 14:00 ` [PATCH 08/30] mm: serialize access to min_free_kbytes Peter Zijlstra
2008-07-30 12:36   ` Pekka Enberg
2008-07-24 14:00 ` [PATCH 09/30] mm: emergency pool Peter Zijlstra
2008-07-24 14:00 ` [PATCH 10/30] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
2008-07-24 14:00 ` [PATCH 11/30] mm: __GFP_MEMALLOC Peter Zijlstra
2008-07-25  9:29   ` KOSAKI Motohiro
2008-07-25  9:35     ` Peter Zijlstra
2008-07-25  9:39       ` KOSAKI Motohiro
2008-07-24 14:00 ` [PATCH 12/30] mm: memory reserve management Peter Zijlstra
2008-07-28 10:06   ` Pekka Enberg
2008-07-28 10:17     ` Peter Zijlstra
2008-07-28 10:29       ` Pekka Enberg
2008-07-28 10:39         ` Peter Zijlstra
2008-07-28 10:41           ` Pekka Enberg
2008-07-28 16:59           ` Matt Mackall
2008-07-28 17:13             ` Peter Zijlstra
2008-07-28 16:49     ` Matt Mackall
2008-07-28 17:13       ` Peter Zijlstra
2008-08-12  6:23   ` Neil Brown
2008-08-12  8:10     ` Peter Zijlstra
2008-08-12  7:46   ` Neil Brown
2008-08-12  8:12     ` Peter Zijlstra
2008-07-24 14:00 ` [PATCH 13/30] selinux: tag avc cache alloc as non-critical Peter Zijlstra
2008-07-24 14:00 ` [PATCH 14/30] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
2008-07-24 14:00 ` [PATCH 15/30] net: packet split receive api Peter Zijlstra
2008-07-24 14:00 ` [PATCH 16/30] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
2008-07-24 14:00 ` [PATCH 17/30] netvm: network reserve infrastructure Peter Zijlstra
2008-07-24 14:01 ` [PATCH 18/30] netvm: INET reserves Peter Zijlstra
2008-10-01 11:38   ` Daniel Lezcano
2008-10-01 18:56     ` Peter Zijlstra
2008-07-24 14:01 ` [PATCH 19/30] netvm: hook skb allocation to reserves Peter Zijlstra
2008-07-24 14:01 ` [PATCH 20/30] netvm: filter emergency skbs Peter Zijlstra
2008-07-24 14:01 ` [PATCH 21/30] netvm: prevent a stream specific deadlock Peter Zijlstra
2008-07-24 14:01 ` [PATCH 22/30] netfilter: NF_QUEUE vs emergency skbs Peter Zijlstra
2008-07-24 14:01 ` [PATCH 23/30] netvm: skb processing Peter Zijlstra
2008-07-24 14:01 ` [PATCH 24/30] mm: add support for non block device backed swap files Peter Zijlstra
2008-07-24 14:01 ` [PATCH 25/30] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
2008-07-24 14:01 ` [PATCH 26/30] nfs: remove mempools Peter Zijlstra
2008-07-24 14:46   ` Nick Piggin
2008-07-24 14:53     ` Peter Zijlstra
2008-07-24 14:01 ` [PATCH 27/30] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
2008-07-24 14:01 ` [PATCH 28/30] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
2008-07-24 14:01 ` [PATCH 29/30] nfs: enable swap on NFS Peter Zijlstra
2008-07-24 14:01 ` [PATCH 30/30] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
2008-07-25 10:46   ` KOSAKI Motohiro
2008-07-25 10:57     ` Peter Zijlstra
2008-07-25 11:15       ` KOSAKI Motohiro
2008-07-25 11:19         ` Peter Zijlstra
2008-09-30 12:41 ` [PATCH 00/30] Swap over NFS -v18 Peter Zijlstra
2008-09-30 15:46   ` Daniel Lezcano
  -- strict thread matches above, loose matches on Subject: below --
2008-03-20 20:10 [PATCH 00/30] Swap over NFS -v17 Peter Zijlstra
2008-03-20 20:11 ` [PATCH 22/30] netfilter: NF_QUEUE vs emergency skbs Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).