linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Christoph Lameter <cl@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	Linux-Netdev <netdev@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	David Miller <davem@davemloft.net>, Neil Brown <neilb@suse.de>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Pekka Enberg <penberg@cs.helsinki.fi>
Subject: Re: [PATCH 02/15] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
Date: Thu, 9 Feb 2012 12:50:18 +0000	[thread overview]
Message-ID: <20120209125018.GN5938@suse.de> (raw)
In-Reply-To: <alpine.DEB.2.00.1202081557540.5970@router.home>

On Wed, Feb 08, 2012 at 04:13:15PM -0600, Christoph Lameter wrote:
> On Wed, 8 Feb 2012, Mel Gorman wrote:
> 
> > On Wed, Feb 08, 2012 at 01:49:05PM -0600, Christoph Lameter wrote:
> > > On Wed, 8 Feb 2012, Mel Gorman wrote:
> > >
> > > > Ok, I looked into what is necessary to replace these with checking a page
> > > > flag and the cost shifts quite a bit and ends up being more expensive.
> > >
> > > That is only true if you go the slab route.
> >
> > Well, yes but both slab and slub have to be supported. I see no reason
> > why I would choose to make this a slab-only or slub-only feature. Slob is
> > not supported because it's not expected that a platform using slob is also
> > going to use network-based swap.
> 
> I think so far the patches in particular to slab.c are pretty significant
> in impact.
> 

Ok, I am working on a solution that does not affect any of the existing
slab structures. Between that and the fact we check if there are any
memalloc_socks after patch 12, the impact for normal systems is an additional
branch in ac_get_obj() and ac_put_obj()

> > > Slab suffers from not having
> > > the page struct pointer readily available. The changes are likely already
> > > impacting slab performance without the virt_to_page patch.
> > >
> >
> > The performance impact only comes into play when swap is on a network
> > device and pfmemalloc reserves are in use. The rest of the time the check
> > on ac avoids all the cost and there is a micro-optimisation later to avoid
> > calling a function (patch 12).
> 
> We have been down this road too many times. Logic is added to critical
> paths and memory structures grow. This is not free. And for NBD swap
> support? Pretty exotic use case.
> 

NFS support is the real target. NBD is the logical starting point and
NFS needs the same support.

> > Ok, are you asking that I use the page flag for slub and leave kmem_cache_cpu
> > alone in the slub case? I can certainly check it out if that's what you
> > are asking for.
> 
> No I am not asking for something. Still thinking about the best way to
> address the issues. I think we can easily come up with a minimally
> invasive patch for slub. Not sure about slab at this point. I think we
> could avoid most of the new fields but this requires some tinkering. I
> have a day @ home tomorrow which hopefully gives me a chance to
> put some focus on this issue.
>

I think we can avoid adding any additional fields but array_cache needs
a new read-mostly global. Also, when network storage is in use and under
memory pressure, it might be slower as we will have lost granularity on
what slabs are using pfmemalloc. That is an acceptable compromise as it
moves the cost to users of network-based swap instead of normal usage.
 
> > I did come up with a way: the necessary information is in ac and slabp
> > on slab :/ . There are not exactly many ways that the information can
> > be recorded.
> 
> Wish we had something that would not involve increasing the number of
> fields in these slab structures.
> 

This is what I currently have. It's untested but builds. It reverts the
structures back to the way they were and uses page flags instead.

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e90a673..f96fa87 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -432,6 +432,28 @@ static inline int PageTransCompound(struct page *page)
 }
 #endif
 
+/*
+ * If network-based swap is enabled, sl*b must keep track of whether pages
+ * were allocated from pfmemalloc reserves.
+ */
+static inline int PageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	return PageActive(page);
+}
+
+static inline void SetPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	SetPageActive(page);
+}
+
+static inline void ClearPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	ClearPageActive(page);
+}
+
 #ifdef CONFIG_MMU
 #define __PG_MLOCKED		(1 << PG_mlocked)
 #else
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 1d9ae40..a32bcfd 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -46,7 +46,6 @@ struct kmem_cache_cpu {
 	struct page *page;	/* The slab from which we are allocating */
 	struct page *partial;	/* Partially allocated frozen slabs */
 	int node;		/* The node of the page (or -1 for debug) */
-	bool pfmemalloc;	/* Slab page had pfmemalloc set */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
diff --git a/mm/slab.c b/mm/slab.c
index 268cd96..783a92e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -155,6 +155,12 @@
 #define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
 #endif
 
+/*
+ * true if a page was allocated from pfmemalloc reserves for network-based
+ * swap
+ */
+static bool pfmemalloc_active;
+
 /* Legal flag mask for kmem_cache_create(). */
 #if DEBUG
 # define CREATE_MASK	(SLAB_RED_ZONE | \
@@ -233,7 +239,6 @@ struct slab {
 			unsigned int inuse;	/* num of objs active in slab */
 			kmem_bufctl_t free;
 			unsigned short nodeid;
-			bool pfmemalloc;	/* Slab had pfmemalloc set */
 		};
 		struct slab_rcu __slab_cover_slab_rcu;
 	};
@@ -255,8 +260,7 @@ struct array_cache {
 	unsigned int avail;
 	unsigned int limit;
 	unsigned int batchcount;
-	bool touched;
-	bool pfmemalloc;
+	unsigned int touched;
 	spinlock_t lock;
 	void *entry[];	/*
 			 * Must have this definition in here for the proper
@@ -978,6 +982,13 @@ static struct array_cache *alloc_arraycache(int node, int entries,
 	return nc;
 }
 
+static inline bool is_slab_pfmemalloc(struct slab *slabp)
+{
+	struct page *page = virt_to_page(slabp->s_mem);
+
+	return PageSlabPfmemalloc(page);
+}
+
 /* Clears ac->pfmemalloc if no slabs have pfmalloc set */
 static void check_ac_pfmemalloc(struct kmem_cache *cachep,
 						struct array_cache *ac)
@@ -985,22 +996,22 @@ static void check_ac_pfmemalloc(struct kmem_cache *cachep,
 	struct kmem_list3 *l3 = cachep->nodelists[numa_mem_id()];
 	struct slab *slabp;
 
-	if (!ac->pfmemalloc)
+	if (!pfmemalloc_active)
 		return;
 
 	list_for_each_entry(slabp, &l3->slabs_full, list)
-		if (slabp->pfmemalloc)
+		if (is_slab_pfmemalloc(slabp))
 			return;
 
 	list_for_each_entry(slabp, &l3->slabs_partial, list)
-		if (slabp->pfmemalloc)
+		if (is_slab_pfmemalloc(slabp))
 			return;
 
 	list_for_each_entry(slabp, &l3->slabs_free, list)
-		if (slabp->pfmemalloc)
+		if (is_slab_pfmemalloc(slabp))
 			return;
 
-	ac->pfmemalloc = false;
+	pfmemalloc_active = false;
 }
 
 static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
@@ -1036,7 +1047,7 @@ static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
 		l3 = cachep->nodelists[numa_mem_id()];
 		if (!list_empty(&l3->slabs_free) && force_refill) {
 			struct slab *slabp = virt_to_slab(objp);
-			slabp->pfmemalloc = false;
+			ClearPageSlabPfmemalloc(virt_to_page(slabp->s_mem));
 			clear_obj_pfmemalloc(&objp);
 			check_ac_pfmemalloc(cachep, ac);
 			return objp;
@@ -1066,13 +1077,10 @@ static inline void *ac_get_obj(struct kmem_cache *cachep,
 static void *__ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
 								void *objp)
 {
-	struct slab *slabp;
-
-	/* If there are pfmemalloc slabs, check if the object is part of one */
-	if (unlikely(ac->pfmemalloc)) {
-		slabp = virt_to_slab(objp);
-
-		if (slabp->pfmemalloc)
+	if (unlikely(pfmemalloc_active)) {
+		/* Some pfmemalloc slabs exist, check if this is one */
+		struct page *page = virt_to_page(objp);
+		if (PageSlabPfmemalloc(page))
 			set_obj_pfmemalloc(&objp);
 	}
 
@@ -1906,9 +1914,13 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	else
 		add_zone_page_state(page_zone(page),
 			NR_SLAB_UNRECLAIMABLE, nr_pages);
-	for (i = 0; i < nr_pages; i++)
+	for (i = 0; i < nr_pages; i++) {
 		__SetPageSlab(page + i);
 
+		if (*pfmemalloc)
+			SetPageSlabPfmemalloc(page);
+	}
+
 	if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
 		kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
 
@@ -2888,7 +2900,6 @@ static struct slab *alloc_slabmgmt(struct kmem_cache *cachep, void *objp,
 	slabp->s_mem = objp + colour_off;
 	slabp->nodeid = nodeid;
 	slabp->free = 0;
-	slabp->pfmemalloc = false;
 	return slabp;
 }
 
@@ -3075,11 +3086,8 @@ static int cache_grow(struct kmem_cache *cachep,
 		goto opps1;
 
 	/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
-	if (pfmemalloc) {
-		struct array_cache *ac = cpu_cache_get(cachep);
-		slabp->pfmemalloc = true;
-		ac->pfmemalloc = true;
-	}
+	if (unlikely(pfmemalloc))
+		pfmemalloc_active = pfmemalloc;
 
 	slab_map_pages(cachep, slabp, objp);
 
diff --git a/mm/slub.c b/mm/slub.c
index ea04994..f9b0f35 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2126,7 +2126,8 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
 		stat(s, ALLOC_SLAB);
 		c->node = page_to_nid(page);
 		c->page = page;
-		c->pfmemalloc = pfmemalloc;
+		if (pfmemalloc)
+			SetPageSlabPfmemalloc(page);
 		*pc = c;
 	} else
 		object = NULL;
@@ -2136,7 +2137,7 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
 
 static inline bool pfmemalloc_match(struct kmem_cache_cpu *c, gfp_t gfpflags)
 {
-	if (unlikely(c->pfmemalloc))
+	if (unlikely(PageSlabPfmemalloc(c->page)))
 		return gfp_pfmemalloc_allowed(gfpflags);
 
 	return true;

  reply	other threads:[~2012-02-09 12:50 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-06 22:56 [PATCH 00/15] Swap-over-NBD without deadlocking V8 Mel Gorman
2012-02-06 22:56 ` [PATCH 01/15] mm: Serialize access to min_free_kbytes Mel Gorman
2012-02-08 18:47   ` Rik van Riel
2012-02-06 22:56 ` [PATCH 02/15] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages Mel Gorman
2012-02-07 16:27   ` Christoph Lameter
2012-02-08 14:45     ` Mel Gorman
2012-02-08 15:14       ` Christoph Lameter
2012-02-08 16:34         ` Mel Gorman
2012-02-08 19:49           ` Christoph Lameter
2012-02-08 21:23             ` Mel Gorman
2012-02-08 22:13               ` Christoph Lameter
2012-02-09 12:50                 ` Mel Gorman [this message]
2012-02-09 19:53                   ` Christoph Lameter
2012-02-10 10:26                     ` Mel Gorman
2012-02-10 21:01                       ` Christoph Lameter
2012-02-10 22:07                         ` Christoph Lameter
2012-02-13 10:12                           ` Mel Gorman
2012-02-13 11:10                         ` Mel Gorman
2012-02-06 22:56 ` [PATCH 03/15] mm: Introduce __GFP_MEMALLOC to allow access to emergency reserves Mel Gorman
2012-02-06 22:56 ` [PATCH 04/15] mm: allow PF_MEMALLOC from softirq context Mel Gorman
2012-02-06 22:56 ` [PATCH 05/15] mm: Ignore mempolicies when using ALLOC_NO_WATERMARK Mel Gorman
2012-02-06 22:56 ` [PATCH 06/15] net: Introduce sk_allocation() to allow addition of GFP flags depending on the individual socket Mel Gorman
2012-02-06 22:56 ` [PATCH 07/15] netvm: Allow the use of __GFP_MEMALLOC by specific sockets Mel Gorman
2012-02-06 22:56 ` [PATCH 08/15] netvm: Allow skb allocation to use PFMEMALLOC reserves Mel Gorman
2012-02-06 22:56 ` [PATCH 09/15] netvm: Propagate page->pfmemalloc to skb Mel Gorman
2012-02-06 22:56 ` [PATCH 10/15] netvm: Propagate page->pfmemalloc from netdev_alloc_page " Mel Gorman
2012-02-07 23:38   ` Alexander Duyck
2012-02-08 15:23     ` Mel Gorman
2012-02-06 22:56 ` [PATCH 11/15] netvm: Set PF_MEMALLOC as appropriate during SKB processing Mel Gorman
2012-02-06 22:56 ` [PATCH 12/15] mm: Micro-optimise slab to avoid a function call Mel Gorman
2012-02-06 22:56 ` [PATCH 13/15] nbd: Set SOCK_MEMALLOC for access to PFMEMALLOC reserves Mel Gorman
2012-02-06 22:56 ` [PATCH 14/15] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage Mel Gorman
2012-02-06 22:56 ` [PATCH 15/15] mm: Account for the number of times direct reclaimers get throttled Mel Gorman
2012-02-07 12:45 ` [PATCH 00/15] Swap-over-NBD without deadlocking V8 Hillf Danton
2012-02-07 13:27   ` Mel Gorman
2012-02-08 12:51     ` Hillf Danton
2012-02-08 15:26       ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120209125018.GN5938@suse.de \
    --to=mgorman@suse.de \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=davem@davemloft.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=neilb@suse.de \
    --cc=netdev@vger.kernel.org \
    --cc=penberg@cs.helsinki.fi \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).