* [PATCH 0/5] make slab gfp fair
@ 2007-05-14 13:19 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 13:19 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Peter Zijlstra, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall
In the interest of creating a reserve based allocator; we need to make the slab
allocator (*sigh*, all three) fair with respect to GFP flags.
That is, we need to protect memory from being used by easier gfp flags than it
was allocated with. If our reserve is placed below GFP_ATOMIC, we do not want a
GFP_KERNEL allocation to walk away with it - a scenario that is perfectly
possible with the current allocators.
^ permalink raw reply [flat|nested] 138+ messages in thread
* [PATCH 0/5] make slab gfp fair
@ 2007-05-14 13:19 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 13:19 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Peter Zijlstra, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall
In the interest of creating a reserve based allocator; we need to make the slab
allocator (*sigh*, all three) fair with respect to GFP flags.
That is, we need to protect memory from being used by easier gfp flags than it
was allocated with. If our reserve is placed below GFP_ATOMIC, we do not want a
GFP_KERNEL allocation to walk away with it - a scenario that is perfectly
possible with the current allocators.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* [PATCH 1/5] mm: page allocation rank
2007-05-14 13:19 ` Peter Zijlstra
@ 2007-05-14 13:19 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 13:19 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Peter Zijlstra, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall
[-- Attachment #1: mm-page_alloc-rank.patch --]
[-- Type: text/plain, Size: 7868 bytes --]
Introduce page allocation rank.
This allocation rank is an measure of the 'hardness' of the page allocation.
Where hardness refers to how deep we have to reach (and thereby if reclaim
was activated) to obtain the page.
It basically is a mapping from the ALLOC_/gfp flags into a scalar quantity,
which allows for comparisons of the kind:
'would this allocation have succeeded using these gfp flags'.
For the gfp -> alloc_flags mapping we use the 'hardest' possible, those
used by __alloc_pages() right before going into direct reclaim.
The alloc_flags -> rank mapping is given by: 2*2^wmark - harder - 2*high
where wmark = { min = 1, low, high } and harder, high are booleans.
This gives:
0 is the hardest possible allocation - ALLOC_NO_WATERMARK,
1 is ALLOC_WMARK_MIN|ALLOC_HARDER|ALLOC_HIGH,
...
15 is ALLOC_WMARK_HIGH|ALLOC_HARDER,
16 is the softest allocation - ALLOC_WMARK_HIGH.
Rank <= 4 will have woke up kswapd and when also > 0 might have ran into
direct reclaim.
Rank > 8 rarely happens and means lots of memory free (due to parallel oom kill).
The allocation rank is stored in page->index for successful allocations.
'offline' testing of the rank is made impossible by direct reclaim and
fragmentation issues. That is, it is impossible to tell if a given allocation
will succeed without actually doing it.
The purpose of this measure is to introduce some fairness into the slab
allocator.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@selenic.com>
---
mm/internal.h | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 58 +++++++++++++---------------------------------
2 files changed, 87 insertions(+), 41 deletions(-)
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h 2007-02-22 13:56:00.000000000 +0100
+++ linux-2.6-git/mm/internal.h 2007-02-22 14:08:41.000000000 +0100
@@ -12,6 +12,7 @@
#define __MM_INTERNAL_H
#include <linux/mm.h>
+#include <linux/hardirq.h>
static inline void set_page_count(struct page *page, int v)
{
@@ -37,4 +38,73 @@ static inline void __put_page(struct pag
extern void fastcall __init __free_pages_bootmem(struct page *page,
unsigned int order);
+#define ALLOC_HARDER 0x01 /* try to alloc harder */
+#define ALLOC_HIGH 0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN 0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW 0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH 0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS 0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int inline gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+ struct task_struct *p = current;
+ int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+ const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+ /*
+ * The caller may dip into page reserves a bit more if the caller
+ * cannot run direct reclaim, or if the caller has realtime scheduling
+ * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
+ * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+ */
+ if (gfp_mask & __GFP_HIGH)
+ alloc_flags |= ALLOC_HIGH;
+
+ if (!wait) {
+ alloc_flags |= ALLOC_HARDER;
+ /*
+ * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+ * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+ */
+ alloc_flags &= ~ALLOC_CPUSET;
+ } else if (unlikely(rt_task(p)) && !in_interrupt())
+ alloc_flags |= ALLOC_HARDER;
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+ if (!in_interrupt() &&
+ ((p->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ }
+
+ return alloc_flags;
+}
+
+#define MAX_ALLOC_RANK 16
+
+/*
+ * classify the allocation: 0 is hardest, 16 is easiest.
+ */
+static inline int alloc_flags_to_rank(int alloc_flags)
+{
+ int rank;
+
+ if (alloc_flags & ALLOC_NO_WATERMARKS)
+ return 0;
+
+ rank = alloc_flags & (ALLOC_WMARK_MIN|ALLOC_WMARK_LOW|ALLOC_WMARK_HIGH);
+ rank -= alloc_flags & (ALLOC_HARDER|ALLOC_HIGH);
+
+ return rank;
+}
+
+static inline int gfp_to_rank(gfp_t gfp_mask)
+{
+ return alloc_flags_to_rank(gfp_to_alloc_flags(gfp_mask));
+}
+
#endif
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c 2007-02-22 13:56:00.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c 2007-02-22 14:08:41.000000000 +0100
@@ -892,14 +892,6 @@ failed:
return NULL;
}
-#define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */
-#define ALLOC_HARDER 0x10 /* try to alloc harder */
-#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
-
#ifdef CONFIG_FAIL_PAGE_ALLOC
static struct fail_page_alloc_attr {
@@ -1190,6 +1182,7 @@ zonelist_scan:
page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
if (page)
+ page->index = alloc_flags_to_rank(alloc_flags);
break;
this_zone_full:
if (NUMA_BUILD)
@@ -1263,48 +1256,27 @@ restart:
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
* to how we want to proceed.
- *
- * The caller may dip into page reserves a bit more if the caller
- * cannot run direct reclaim, or if the caller has realtime scheduling
- * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
- * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
*/
- alloc_flags = ALLOC_WMARK_MIN;
- if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
- alloc_flags |= ALLOC_HARDER;
- if (gfp_mask & __GFP_HIGH)
- alloc_flags |= ALLOC_HIGH;
- if (wait)
- alloc_flags |= ALLOC_CPUSET;
+ alloc_flags = gfp_to_alloc_flags(gfp_mask);
- /*
- * Go through the zonelist again. Let __GFP_HIGH and allocations
- * coming from realtime tasks go deeper into reserves.
- *
- * This is the last chance, in general, before the goto nopage.
- * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
- * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
- */
- page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+ /* This is the last chance, in general, before the goto nopage. */
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ alloc_flags & ~ALLOC_NO_WATERMARKS);
if (page)
goto got_pg;
/* This allocation should allow future memory freeing. */
-
rebalance:
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+ if (alloc_flags & ALLOC_NO_WATERMARKS) {
nofail_alloc:
- /* go through the zonelist yet again, ignoring mins */
- page = get_page_from_freelist(gfp_mask, order,
+ /* go through the zonelist yet again, ignoring mins */
+ page = get_page_from_freelist(gfp_mask, order,
zonelist, ALLOC_NO_WATERMARKS);
- if (page)
- goto got_pg;
- if (gfp_mask & __GFP_NOFAIL) {
- congestion_wait(WRITE, HZ/50);
- goto nofail_alloc;
- }
+ if (page)
+ goto got_pg;
+ if (wait && (gfp_mask & __GFP_NOFAIL)) {
+ congestion_wait(WRITE, HZ/50);
+ goto nofail_alloc;
}
goto nopage;
}
@@ -1313,6 +1285,10 @@ nofail_alloc:
if (!wait)
goto nopage;
+ /* Avoid recursion of direct reclaim */
+ if (p->flags & PF_MEMALLOC)
+ goto nopage;
+
cond_resched();
/* We now go into synchronous reclaim */
--
^ permalink raw reply [flat|nested] 138+ messages in thread
* [PATCH 1/5] mm: page allocation rank
@ 2007-05-14 13:19 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 13:19 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Peter Zijlstra, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall
[-- Attachment #1: mm-page_alloc-rank.patch --]
[-- Type: text/plain, Size: 8093 bytes --]
Introduce page allocation rank.
This allocation rank is an measure of the 'hardness' of the page allocation.
Where hardness refers to how deep we have to reach (and thereby if reclaim
was activated) to obtain the page.
It basically is a mapping from the ALLOC_/gfp flags into a scalar quantity,
which allows for comparisons of the kind:
'would this allocation have succeeded using these gfp flags'.
For the gfp -> alloc_flags mapping we use the 'hardest' possible, those
used by __alloc_pages() right before going into direct reclaim.
The alloc_flags -> rank mapping is given by: 2*2^wmark - harder - 2*high
where wmark = { min = 1, low, high } and harder, high are booleans.
This gives:
0 is the hardest possible allocation - ALLOC_NO_WATERMARK,
1 is ALLOC_WMARK_MIN|ALLOC_HARDER|ALLOC_HIGH,
...
15 is ALLOC_WMARK_HIGH|ALLOC_HARDER,
16 is the softest allocation - ALLOC_WMARK_HIGH.
Rank <= 4 will have woke up kswapd and when also > 0 might have ran into
direct reclaim.
Rank > 8 rarely happens and means lots of memory free (due to parallel oom kill).
The allocation rank is stored in page->index for successful allocations.
'offline' testing of the rank is made impossible by direct reclaim and
fragmentation issues. That is, it is impossible to tell if a given allocation
will succeed without actually doing it.
The purpose of this measure is to introduce some fairness into the slab
allocator.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@selenic.com>
---
mm/internal.h | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 58 +++++++++++++---------------------------------
2 files changed, 87 insertions(+), 41 deletions(-)
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h 2007-02-22 13:56:00.000000000 +0100
+++ linux-2.6-git/mm/internal.h 2007-02-22 14:08:41.000000000 +0100
@@ -12,6 +12,7 @@
#define __MM_INTERNAL_H
#include <linux/mm.h>
+#include <linux/hardirq.h>
static inline void set_page_count(struct page *page, int v)
{
@@ -37,4 +38,73 @@ static inline void __put_page(struct pag
extern void fastcall __init __free_pages_bootmem(struct page *page,
unsigned int order);
+#define ALLOC_HARDER 0x01 /* try to alloc harder */
+#define ALLOC_HIGH 0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN 0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW 0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH 0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS 0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int inline gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+ struct task_struct *p = current;
+ int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+ const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+ /*
+ * The caller may dip into page reserves a bit more if the caller
+ * cannot run direct reclaim, or if the caller has realtime scheduling
+ * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
+ * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+ */
+ if (gfp_mask & __GFP_HIGH)
+ alloc_flags |= ALLOC_HIGH;
+
+ if (!wait) {
+ alloc_flags |= ALLOC_HARDER;
+ /*
+ * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+ * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+ */
+ alloc_flags &= ~ALLOC_CPUSET;
+ } else if (unlikely(rt_task(p)) && !in_interrupt())
+ alloc_flags |= ALLOC_HARDER;
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+ if (!in_interrupt() &&
+ ((p->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ }
+
+ return alloc_flags;
+}
+
+#define MAX_ALLOC_RANK 16
+
+/*
+ * classify the allocation: 0 is hardest, 16 is easiest.
+ */
+static inline int alloc_flags_to_rank(int alloc_flags)
+{
+ int rank;
+
+ if (alloc_flags & ALLOC_NO_WATERMARKS)
+ return 0;
+
+ rank = alloc_flags & (ALLOC_WMARK_MIN|ALLOC_WMARK_LOW|ALLOC_WMARK_HIGH);
+ rank -= alloc_flags & (ALLOC_HARDER|ALLOC_HIGH);
+
+ return rank;
+}
+
+static inline int gfp_to_rank(gfp_t gfp_mask)
+{
+ return alloc_flags_to_rank(gfp_to_alloc_flags(gfp_mask));
+}
+
#endif
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c 2007-02-22 13:56:00.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c 2007-02-22 14:08:41.000000000 +0100
@@ -892,14 +892,6 @@ failed:
return NULL;
}
-#define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */
-#define ALLOC_HARDER 0x10 /* try to alloc harder */
-#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
-
#ifdef CONFIG_FAIL_PAGE_ALLOC
static struct fail_page_alloc_attr {
@@ -1190,6 +1182,7 @@ zonelist_scan:
page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
if (page)
+ page->index = alloc_flags_to_rank(alloc_flags);
break;
this_zone_full:
if (NUMA_BUILD)
@@ -1263,48 +1256,27 @@ restart:
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
* to how we want to proceed.
- *
- * The caller may dip into page reserves a bit more if the caller
- * cannot run direct reclaim, or if the caller has realtime scheduling
- * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
- * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
*/
- alloc_flags = ALLOC_WMARK_MIN;
- if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
- alloc_flags |= ALLOC_HARDER;
- if (gfp_mask & __GFP_HIGH)
- alloc_flags |= ALLOC_HIGH;
- if (wait)
- alloc_flags |= ALLOC_CPUSET;
+ alloc_flags = gfp_to_alloc_flags(gfp_mask);
- /*
- * Go through the zonelist again. Let __GFP_HIGH and allocations
- * coming from realtime tasks go deeper into reserves.
- *
- * This is the last chance, in general, before the goto nopage.
- * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
- * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
- */
- page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+ /* This is the last chance, in general, before the goto nopage. */
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ alloc_flags & ~ALLOC_NO_WATERMARKS);
if (page)
goto got_pg;
/* This allocation should allow future memory freeing. */
-
rebalance:
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+ if (alloc_flags & ALLOC_NO_WATERMARKS) {
nofail_alloc:
- /* go through the zonelist yet again, ignoring mins */
- page = get_page_from_freelist(gfp_mask, order,
+ /* go through the zonelist yet again, ignoring mins */
+ page = get_page_from_freelist(gfp_mask, order,
zonelist, ALLOC_NO_WATERMARKS);
- if (page)
- goto got_pg;
- if (gfp_mask & __GFP_NOFAIL) {
- congestion_wait(WRITE, HZ/50);
- goto nofail_alloc;
- }
+ if (page)
+ goto got_pg;
+ if (wait && (gfp_mask & __GFP_NOFAIL)) {
+ congestion_wait(WRITE, HZ/50);
+ goto nofail_alloc;
}
goto nopage;
}
@@ -1313,6 +1285,10 @@ nofail_alloc:
if (!wait)
goto nopage;
+ /* Avoid recursion of direct reclaim */
+ if (p->flags & PF_MEMALLOC)
+ goto nopage;
+
cond_resched();
/* We now go into synchronous reclaim */
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* [PATCH 2/5] mm: slab allocation fairness
2007-05-14 13:19 ` Peter Zijlstra
@ 2007-05-14 13:19 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 13:19 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Peter Zijlstra, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall
[-- Attachment #1: mm-slab-ranking.patch --]
[-- Type: text/plain, Size: 13560 bytes --]
The slab allocator has some unfairness wrt gfp flags; when the slab cache is
grown the gfp flags are used to allocate more memory, however when there is
slab cache available (in partial or free slabs, per cpu caches or otherwise)
gfp flags are ignored.
Thus it is possible for less critical slab allocations to succeed and gobble
up precious memory when under memory pressure.
This patch solves that by using the newly introduced page allocation rank.
Page allocation rank is a scalar quantity connecting ALLOC_ and gfp flags which
represents how deep we had to reach into our reserves when allocating a page.
Rank 0 is the deepest we can reach (ALLOC_NO_WATERMARK) and 16 is the most
shallow allocation possible (ALLOC_WMARK_HIGH).
When the slab space is grown the rank of the page allocation is stored. For
each slab allocation we test the given gfp flags against this rank. Thereby
asking the question: would these flags have allowed the slab to grow.
If not so, we need to test the current situation. This is done by forcing the
growth of the slab space. (Just testing the free page limits will not work due
to direct reclaim) Failing this we need to fail the slab allocation.
Thus if we grew the slab under great duress while PF_MEMALLOC was set and we
really did access the memalloc reserve the rank would be set to 0. If the next
allocation to that slab would be GFP_NOFS|__GFP_NOMEMALLOC (which ordinarily
maps to rank 4 and always > 0) we'd want to make sure that memory pressure has
decreased enough to allow an allocation with the given gfp flags.
So in this case we try to force grow the slab cache and on failure we fail the
slab allocation. Thus preserving the available slab cache for more pressing
allocations.
If this newly allocated slab will be trimmed on the next kmem_cache_free
(not unlikely) this is no problem, since 1) it will free memory and 2) the
sole purpose of the allocation was to probe the allocation rank, we didn't
need the space itself.
[netperf results]
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.1 (127.0.0.1) port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
v2.6.21
114688 114688 4096 60.00 5424.17
114688 114688 4096 60.00 5486.54
114688 114688 4096 60.00 6460.20
114688 114688 4096 60.00 6457.82
114688 114688 4096 60.00 6468.99
114688 114688 4096 60.00 6326.81
114688 114688 4096 60.00 6476.74
114688 114688 4096 60.00 6473.61
6196.86 460.58
v2.6.21-slab CONFIG_SLAB_FAIR=n
114688 114688 4096 60.00 6987.93
114688 114688 4096 60.00 6214.82
114688 114688 4096 60.00 5539.82
114688 114688 4096 60.00 5597.57
114688 114688 4096 60.00 6192.22
114688 114688 4096 60.00 6306.76
114688 114688 4096 60.00 5492.49
114688 114688 4096 60.00 5607.24
5992.36 526.21
v2.6.21-slab CONFIG_SLAB_FAIR=y
114688 114688 4096 60.00 5475.34
114688 114688 4096 60.00 6464.61
114688 114688 4096 60.00 6457.15
114688 114688 4096 60.00 6465.70
114688 114688 4096 60.00 6404.30
114688 114688 4096 60.00 6474.61
114688 114688 4096 60.00 6461.68
114688 114688 4096 60.00 6453.63
6332.13 346.86
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
---
mm/Kconfig | 4 +++
mm/internal.h | 8 ++++++
mm/slab.c | 70 +++++++++++++++++++++++++++++++++++-----------------------
3 files changed, 55 insertions(+), 27 deletions(-)
Index: linux-2.6-git/mm/slab.c
===================================================================
--- linux-2.6-git.orig/mm/slab.c
+++ linux-2.6-git/mm/slab.c
@@ -114,6 +114,7 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
#include <asm/page.h>
+#include "internal.h"
/*
* DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
@@ -380,6 +381,7 @@ static void kmem_list3_init(struct kmem_
struct kmem_cache {
/* 1) per-cpu data, touched during every alloc/free */
+ int rank;
struct array_cache *array[NR_CPUS];
/* 2) Cache tunables. Protected by cache_chain_mutex */
unsigned int batchcount;
@@ -1030,21 +1032,21 @@ static inline int cache_free_alien(struc
}
static inline void *alternate_node_alloc(struct kmem_cache *cachep,
- gfp_t flags)
+ gfp_t flags, int rank)
{
return NULL;
}
static inline void *____cache_alloc_node(struct kmem_cache *cachep,
- gfp_t flags, int nodeid)
+ gfp_t flags, int nodeid, int rank)
{
return NULL;
}
#else /* CONFIG_NUMA */
-static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int);
-static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
+static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int, int);
+static void *alternate_node_alloc(struct kmem_cache *, gfp_t, int);
static struct array_cache **alloc_alien_cache(int node, int limit)
{
@@ -1648,6 +1650,7 @@ static void *kmem_getpages(struct kmem_c
if (!page)
return NULL;
+ cachep->rank = page->index;
nr_pages = (1 << cachep->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
add_zone_page_state(page_zone(page),
@@ -2292,6 +2295,7 @@ kmem_cache_create (const char *name, siz
}
#endif
#endif
+ cachep->rank = MAX_ALLOC_RANK;
/*
* Determine if the slab management is 'on' or 'off' slab.
@@ -2941,7 +2945,7 @@ bad:
#define check_slabp(x,y) do { } while(0)
#endif
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags, int rank)
{
int batchcount;
struct kmem_list3 *l3;
@@ -2953,6 +2957,8 @@ static void *cache_alloc_refill(struct k
check_irq_off();
ac = cpu_cache_get(cachep);
retry:
+ if (slab_insufficient_rank(cachep, rank))
+ goto force_grow;
batchcount = ac->batchcount;
if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
/*
@@ -3016,14 +3022,17 @@ must_grow:
l3->free_objects -= ac->avail;
alloc_done:
spin_unlock(&l3->list_lock);
-
if (unlikely(!ac->avail)) {
int x;
+force_grow:
x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
/* cache_grow can reenable interrupts, then ac could change. */
ac = cpu_cache_get(cachep);
- if (!x && ac->avail == 0) /* no objects in sight? abort */
+
+ /* no objects in sight? abort */
+ if (!x && (ac->avail == 0 ||
+ slab_insufficient_rank(cachep, rank)))
return NULL;
if (!ac->avail) /* objects refilled by interrupt? */
@@ -3174,7 +3183,8 @@ static inline int should_failslab(struct
#endif /* CONFIG_FAILSLAB */
-static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+static inline void *____cache_alloc(struct kmem_cache *cachep,
+ gfp_t flags, int rank)
{
void *objp;
struct array_cache *ac;
@@ -3182,13 +3192,13 @@ static inline void *____cache_alloc(stru
check_irq_off();
ac = cpu_cache_get(cachep);
- if (likely(ac->avail)) {
+ if (likely(ac->avail) && !slab_insufficient_rank(cachep, rank)) {
STATS_INC_ALLOCHIT(cachep);
ac->touched = 1;
objp = ac->entry[--ac->avail];
} else {
STATS_INC_ALLOCMISS(cachep);
- objp = cache_alloc_refill(cachep, flags);
+ objp = cache_alloc_refill(cachep, flags, rank);
}
return objp;
}
@@ -3200,7 +3210,8 @@ static inline void *____cache_alloc(stru
* If we are in_interrupt, then process context, including cpusets and
* mempolicy, may not apply and should not be used for allocation policy.
*/
-static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
+static void *alternate_node_alloc(struct kmem_cache *cachep,
+ gfp_t flags, int rank)
{
int nid_alloc, nid_here;
@@ -3212,7 +3223,7 @@ static void *alternate_node_alloc(struct
else if (current->mempolicy)
nid_alloc = slab_node(current->mempolicy);
if (nid_alloc != nid_here)
- return ____cache_alloc_node(cachep, flags, nid_alloc);
+ return ____cache_alloc_node(cachep, flags, nid_alloc, rank);
return NULL;
}
@@ -3224,7 +3235,7 @@ static void *alternate_node_alloc(struct
* allocator to do its reclaim / fallback magic. We then insert the
* slab into the proper nodelist and then allocate from it.
*/
-static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
+static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags, int rank)
{
struct zonelist *zonelist;
gfp_t local_flags;
@@ -3251,7 +3262,7 @@ retry:
cache->nodelists[nid] &&
cache->nodelists[nid]->free_objects)
obj = ____cache_alloc_node(cache,
- flags | GFP_THISNODE, nid);
+ flags | GFP_THISNODE, nid, rank);
}
if (!obj) {
@@ -3274,7 +3285,7 @@ retry:
nid = page_to_nid(virt_to_page(obj));
if (cache_grow(cache, flags, nid, obj)) {
obj = ____cache_alloc_node(cache,
- flags | GFP_THISNODE, nid);
+ flags | GFP_THISNODE, nid, rank);
if (!obj)
/*
* Another processor may allocate the
@@ -3295,7 +3306,7 @@ retry:
* A interface to enable slab creation on nodeid
*/
static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
- int nodeid)
+ int nodeid, int rank)
{
struct list_head *entry;
struct slab *slabp;
@@ -3308,6 +3319,8 @@ static void *____cache_alloc_node(struct
retry:
check_irq_off();
+ if (slab_insufficient_rank(cachep, rank))
+ goto force_grow;
spin_lock(&l3->list_lock);
entry = l3->slabs_partial.next;
if (entry == &l3->slabs_partial) {
@@ -3343,11 +3356,12 @@ retry:
must_grow:
spin_unlock(&l3->list_lock);
+force_grow:
x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL);
if (x)
goto retry;
- return fallback_alloc(cachep, flags);
+ return fallback_alloc(cachep, flags, rank);
done:
return obj;
@@ -3371,6 +3385,7 @@ __cache_alloc_node(struct kmem_cache *ca
{
unsigned long save_flags;
void *ptr;
+ int rank = slab_alloc_rank(flags);
if (should_failslab(cachep, flags))
return NULL;
@@ -3383,7 +3398,7 @@ __cache_alloc_node(struct kmem_cache *ca
if (unlikely(!cachep->nodelists[nodeid])) {
/* Node not bootstrapped yet */
- ptr = fallback_alloc(cachep, flags);
+ ptr = fallback_alloc(cachep, flags, rank);
goto out;
}
@@ -3394,12 +3409,12 @@ __cache_alloc_node(struct kmem_cache *ca
* to other nodes. It may fail while we still have
* objects on other nodes available.
*/
- ptr = ____cache_alloc(cachep, flags);
+ ptr = ____cache_alloc(cachep, flags, rank);
if (ptr)
goto out;
}
/* ___cache_alloc_node can fall back to other nodes */
- ptr = ____cache_alloc_node(cachep, flags, nodeid);
+ ptr = ____cache_alloc_node(cachep, flags, nodeid, rank);
out:
local_irq_restore(save_flags);
ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller);
@@ -3408,23 +3423,23 @@ __cache_alloc_node(struct kmem_cache *ca
}
static __always_inline void *
-__do_cache_alloc(struct kmem_cache *cache, gfp_t flags)
+__do_cache_alloc(struct kmem_cache *cache, gfp_t flags, int rank)
{
void *objp;
if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))) {
- objp = alternate_node_alloc(cache, flags);
+ objp = alternate_node_alloc(cache, flags, rank);
if (objp)
goto out;
}
- objp = ____cache_alloc(cache, flags);
+ objp = ____cache_alloc(cache, flags, rank);
/*
* We may just have run out of memory on the local node.
* ____cache_alloc_node() knows how to locate memory on other nodes
*/
if (!objp)
- objp = ____cache_alloc_node(cache, flags, numa_node_id());
+ objp = ____cache_alloc_node(cache, flags, numa_node_id(), rank);
out:
return objp;
@@ -3432,9 +3447,9 @@ __do_cache_alloc(struct kmem_cache *cach
#else
static __always_inline void *
-__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags, int rank)
{
- return ____cache_alloc(cachep, flags);
+ return ____cache_alloc(cachep, flags, rank);
}
#endif /* CONFIG_NUMA */
@@ -3444,13 +3459,14 @@ __cache_alloc(struct kmem_cache *cachep,
{
unsigned long save_flags;
void *objp;
+ int rank = slab_alloc_rank(flags);
if (should_failslab(cachep, flags))
return NULL;
cache_alloc_debugcheck_before(cachep, flags);
local_irq_save(save_flags);
- objp = __do_cache_alloc(cachep, flags);
+ objp = __do_cache_alloc(cachep, flags, rank);
local_irq_restore(save_flags);
objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
prefetchw(objp);
Index: linux-2.6-git/mm/Kconfig
===================================================================
--- linux-2.6-git.orig/mm/Kconfig
+++ linux-2.6-git/mm/Kconfig
@@ -163,6 +163,10 @@ config ZONE_DMA_FLAG
default "0" if !ZONE_DMA
default "1"
+config SLAB_FAIR
+ def_bool n
+ depends on SLAB
+
config NR_QUICK
int
depends on QUICKLIST
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h
+++ linux-2.6-git/mm/internal.h
@@ -107,4 +107,12 @@ static inline int gfp_to_rank(gfp_t gfp_
return alloc_flags_to_rank(gfp_to_alloc_flags(gfp_mask));
}
+#ifdef CONFIG_SLAB_FAIR
+#define slab_alloc_rank(gfp) gfp_to_rank(gfp)
+#define slab_insufficient_rank(s, _rank) unlikely((_rank) > (s)->rank)
+#else
+#define slab_alloc_rank(gfp) 0
+#define slab_insufficient_rank(s, _rank) false
+#endif
+
#endif
--
^ permalink raw reply [flat|nested] 138+ messages in thread
* [PATCH 2/5] mm: slab allocation fairness
@ 2007-05-14 13:19 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 13:19 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Peter Zijlstra, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall
[-- Attachment #1: mm-slab-ranking.patch --]
[-- Type: text/plain, Size: 13785 bytes --]
The slab allocator has some unfairness wrt gfp flags; when the slab cache is
grown the gfp flags are used to allocate more memory, however when there is
slab cache available (in partial or free slabs, per cpu caches or otherwise)
gfp flags are ignored.
Thus it is possible for less critical slab allocations to succeed and gobble
up precious memory when under memory pressure.
This patch solves that by using the newly introduced page allocation rank.
Page allocation rank is a scalar quantity connecting ALLOC_ and gfp flags which
represents how deep we had to reach into our reserves when allocating a page.
Rank 0 is the deepest we can reach (ALLOC_NO_WATERMARK) and 16 is the most
shallow allocation possible (ALLOC_WMARK_HIGH).
When the slab space is grown the rank of the page allocation is stored. For
each slab allocation we test the given gfp flags against this rank. Thereby
asking the question: would these flags have allowed the slab to grow.
If not so, we need to test the current situation. This is done by forcing the
growth of the slab space. (Just testing the free page limits will not work due
to direct reclaim) Failing this we need to fail the slab allocation.
Thus if we grew the slab under great duress while PF_MEMALLOC was set and we
really did access the memalloc reserve the rank would be set to 0. If the next
allocation to that slab would be GFP_NOFS|__GFP_NOMEMALLOC (which ordinarily
maps to rank 4 and always > 0) we'd want to make sure that memory pressure has
decreased enough to allow an allocation with the given gfp flags.
So in this case we try to force grow the slab cache and on failure we fail the
slab allocation. Thus preserving the available slab cache for more pressing
allocations.
If this newly allocated slab will be trimmed on the next kmem_cache_free
(not unlikely) this is no problem, since 1) it will free memory and 2) the
sole purpose of the allocation was to probe the allocation rank, we didn't
need the space itself.
[netperf results]
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.1 (127.0.0.1) port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
v2.6.21
114688 114688 4096 60.00 5424.17
114688 114688 4096 60.00 5486.54
114688 114688 4096 60.00 6460.20
114688 114688 4096 60.00 6457.82
114688 114688 4096 60.00 6468.99
114688 114688 4096 60.00 6326.81
114688 114688 4096 60.00 6476.74
114688 114688 4096 60.00 6473.61
6196.86 460.58
v2.6.21-slab CONFIG_SLAB_FAIR=n
114688 114688 4096 60.00 6987.93
114688 114688 4096 60.00 6214.82
114688 114688 4096 60.00 5539.82
114688 114688 4096 60.00 5597.57
114688 114688 4096 60.00 6192.22
114688 114688 4096 60.00 6306.76
114688 114688 4096 60.00 5492.49
114688 114688 4096 60.00 5607.24
5992.36 526.21
v2.6.21-slab CONFIG_SLAB_FAIR=y
114688 114688 4096 60.00 5475.34
114688 114688 4096 60.00 6464.61
114688 114688 4096 60.00 6457.15
114688 114688 4096 60.00 6465.70
114688 114688 4096 60.00 6404.30
114688 114688 4096 60.00 6474.61
114688 114688 4096 60.00 6461.68
114688 114688 4096 60.00 6453.63
6332.13 346.86
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
---
mm/Kconfig | 4 +++
mm/internal.h | 8 ++++++
mm/slab.c | 70 +++++++++++++++++++++++++++++++++++-----------------------
3 files changed, 55 insertions(+), 27 deletions(-)
Index: linux-2.6-git/mm/slab.c
===================================================================
--- linux-2.6-git.orig/mm/slab.c
+++ linux-2.6-git/mm/slab.c
@@ -114,6 +114,7 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
#include <asm/page.h>
+#include "internal.h"
/*
* DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
@@ -380,6 +381,7 @@ static void kmem_list3_init(struct kmem_
struct kmem_cache {
/* 1) per-cpu data, touched during every alloc/free */
+ int rank;
struct array_cache *array[NR_CPUS];
/* 2) Cache tunables. Protected by cache_chain_mutex */
unsigned int batchcount;
@@ -1030,21 +1032,21 @@ static inline int cache_free_alien(struc
}
static inline void *alternate_node_alloc(struct kmem_cache *cachep,
- gfp_t flags)
+ gfp_t flags, int rank)
{
return NULL;
}
static inline void *____cache_alloc_node(struct kmem_cache *cachep,
- gfp_t flags, int nodeid)
+ gfp_t flags, int nodeid, int rank)
{
return NULL;
}
#else /* CONFIG_NUMA */
-static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int);
-static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
+static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int, int);
+static void *alternate_node_alloc(struct kmem_cache *, gfp_t, int);
static struct array_cache **alloc_alien_cache(int node, int limit)
{
@@ -1648,6 +1650,7 @@ static void *kmem_getpages(struct kmem_c
if (!page)
return NULL;
+ cachep->rank = page->index;
nr_pages = (1 << cachep->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
add_zone_page_state(page_zone(page),
@@ -2292,6 +2295,7 @@ kmem_cache_create (const char *name, siz
}
#endif
#endif
+ cachep->rank = MAX_ALLOC_RANK;
/*
* Determine if the slab management is 'on' or 'off' slab.
@@ -2941,7 +2945,7 @@ bad:
#define check_slabp(x,y) do { } while(0)
#endif
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags, int rank)
{
int batchcount;
struct kmem_list3 *l3;
@@ -2953,6 +2957,8 @@ static void *cache_alloc_refill(struct k
check_irq_off();
ac = cpu_cache_get(cachep);
retry:
+ if (slab_insufficient_rank(cachep, rank))
+ goto force_grow;
batchcount = ac->batchcount;
if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
/*
@@ -3016,14 +3022,17 @@ must_grow:
l3->free_objects -= ac->avail;
alloc_done:
spin_unlock(&l3->list_lock);
-
if (unlikely(!ac->avail)) {
int x;
+force_grow:
x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
/* cache_grow can reenable interrupts, then ac could change. */
ac = cpu_cache_get(cachep);
- if (!x && ac->avail == 0) /* no objects in sight? abort */
+
+ /* no objects in sight? abort */
+ if (!x && (ac->avail == 0 ||
+ slab_insufficient_rank(cachep, rank)))
return NULL;
if (!ac->avail) /* objects refilled by interrupt? */
@@ -3174,7 +3183,8 @@ static inline int should_failslab(struct
#endif /* CONFIG_FAILSLAB */
-static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+static inline void *____cache_alloc(struct kmem_cache *cachep,
+ gfp_t flags, int rank)
{
void *objp;
struct array_cache *ac;
@@ -3182,13 +3192,13 @@ static inline void *____cache_alloc(stru
check_irq_off();
ac = cpu_cache_get(cachep);
- if (likely(ac->avail)) {
+ if (likely(ac->avail) && !slab_insufficient_rank(cachep, rank)) {
STATS_INC_ALLOCHIT(cachep);
ac->touched = 1;
objp = ac->entry[--ac->avail];
} else {
STATS_INC_ALLOCMISS(cachep);
- objp = cache_alloc_refill(cachep, flags);
+ objp = cache_alloc_refill(cachep, flags, rank);
}
return objp;
}
@@ -3200,7 +3210,8 @@ static inline void *____cache_alloc(stru
* If we are in_interrupt, then process context, including cpusets and
* mempolicy, may not apply and should not be used for allocation policy.
*/
-static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
+static void *alternate_node_alloc(struct kmem_cache *cachep,
+ gfp_t flags, int rank)
{
int nid_alloc, nid_here;
@@ -3212,7 +3223,7 @@ static void *alternate_node_alloc(struct
else if (current->mempolicy)
nid_alloc = slab_node(current->mempolicy);
if (nid_alloc != nid_here)
- return ____cache_alloc_node(cachep, flags, nid_alloc);
+ return ____cache_alloc_node(cachep, flags, nid_alloc, rank);
return NULL;
}
@@ -3224,7 +3235,7 @@ static void *alternate_node_alloc(struct
* allocator to do its reclaim / fallback magic. We then insert the
* slab into the proper nodelist and then allocate from it.
*/
-static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
+static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags, int rank)
{
struct zonelist *zonelist;
gfp_t local_flags;
@@ -3251,7 +3262,7 @@ retry:
cache->nodelists[nid] &&
cache->nodelists[nid]->free_objects)
obj = ____cache_alloc_node(cache,
- flags | GFP_THISNODE, nid);
+ flags | GFP_THISNODE, nid, rank);
}
if (!obj) {
@@ -3274,7 +3285,7 @@ retry:
nid = page_to_nid(virt_to_page(obj));
if (cache_grow(cache, flags, nid, obj)) {
obj = ____cache_alloc_node(cache,
- flags | GFP_THISNODE, nid);
+ flags | GFP_THISNODE, nid, rank);
if (!obj)
/*
* Another processor may allocate the
@@ -3295,7 +3306,7 @@ retry:
* A interface to enable slab creation on nodeid
*/
static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
- int nodeid)
+ int nodeid, int rank)
{
struct list_head *entry;
struct slab *slabp;
@@ -3308,6 +3319,8 @@ static void *____cache_alloc_node(struct
retry:
check_irq_off();
+ if (slab_insufficient_rank(cachep, rank))
+ goto force_grow;
spin_lock(&l3->list_lock);
entry = l3->slabs_partial.next;
if (entry == &l3->slabs_partial) {
@@ -3343,11 +3356,12 @@ retry:
must_grow:
spin_unlock(&l3->list_lock);
+force_grow:
x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL);
if (x)
goto retry;
- return fallback_alloc(cachep, flags);
+ return fallback_alloc(cachep, flags, rank);
done:
return obj;
@@ -3371,6 +3385,7 @@ __cache_alloc_node(struct kmem_cache *ca
{
unsigned long save_flags;
void *ptr;
+ int rank = slab_alloc_rank(flags);
if (should_failslab(cachep, flags))
return NULL;
@@ -3383,7 +3398,7 @@ __cache_alloc_node(struct kmem_cache *ca
if (unlikely(!cachep->nodelists[nodeid])) {
/* Node not bootstrapped yet */
- ptr = fallback_alloc(cachep, flags);
+ ptr = fallback_alloc(cachep, flags, rank);
goto out;
}
@@ -3394,12 +3409,12 @@ __cache_alloc_node(struct kmem_cache *ca
* to other nodes. It may fail while we still have
* objects on other nodes available.
*/
- ptr = ____cache_alloc(cachep, flags);
+ ptr = ____cache_alloc(cachep, flags, rank);
if (ptr)
goto out;
}
/* ___cache_alloc_node can fall back to other nodes */
- ptr = ____cache_alloc_node(cachep, flags, nodeid);
+ ptr = ____cache_alloc_node(cachep, flags, nodeid, rank);
out:
local_irq_restore(save_flags);
ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller);
@@ -3408,23 +3423,23 @@ __cache_alloc_node(struct kmem_cache *ca
}
static __always_inline void *
-__do_cache_alloc(struct kmem_cache *cache, gfp_t flags)
+__do_cache_alloc(struct kmem_cache *cache, gfp_t flags, int rank)
{
void *objp;
if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))) {
- objp = alternate_node_alloc(cache, flags);
+ objp = alternate_node_alloc(cache, flags, rank);
if (objp)
goto out;
}
- objp = ____cache_alloc(cache, flags);
+ objp = ____cache_alloc(cache, flags, rank);
/*
* We may just have run out of memory on the local node.
* ____cache_alloc_node() knows how to locate memory on other nodes
*/
if (!objp)
- objp = ____cache_alloc_node(cache, flags, numa_node_id());
+ objp = ____cache_alloc_node(cache, flags, numa_node_id(), rank);
out:
return objp;
@@ -3432,9 +3447,9 @@ __do_cache_alloc(struct kmem_cache *cach
#else
static __always_inline void *
-__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags, int rank)
{
- return ____cache_alloc(cachep, flags);
+ return ____cache_alloc(cachep, flags, rank);
}
#endif /* CONFIG_NUMA */
@@ -3444,13 +3459,14 @@ __cache_alloc(struct kmem_cache *cachep,
{
unsigned long save_flags;
void *objp;
+ int rank = slab_alloc_rank(flags);
if (should_failslab(cachep, flags))
return NULL;
cache_alloc_debugcheck_before(cachep, flags);
local_irq_save(save_flags);
- objp = __do_cache_alloc(cachep, flags);
+ objp = __do_cache_alloc(cachep, flags, rank);
local_irq_restore(save_flags);
objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
prefetchw(objp);
Index: linux-2.6-git/mm/Kconfig
===================================================================
--- linux-2.6-git.orig/mm/Kconfig
+++ linux-2.6-git/mm/Kconfig
@@ -163,6 +163,10 @@ config ZONE_DMA_FLAG
default "0" if !ZONE_DMA
default "1"
+config SLAB_FAIR
+ def_bool n
+ depends on SLAB
+
config NR_QUICK
int
depends on QUICKLIST
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h
+++ linux-2.6-git/mm/internal.h
@@ -107,4 +107,12 @@ static inline int gfp_to_rank(gfp_t gfp_
return alloc_flags_to_rank(gfp_to_alloc_flags(gfp_mask));
}
+#ifdef CONFIG_SLAB_FAIR
+#define slab_alloc_rank(gfp) gfp_to_rank(gfp)
+#define slab_insufficient_rank(s, _rank) unlikely((_rank) > (s)->rank)
+#else
+#define slab_alloc_rank(gfp) 0
+#define slab_insufficient_rank(s, _rank) false
+#endif
+
#endif
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* [PATCH 3/5] mm: slub allocation fairness
2007-05-14 13:19 ` Peter Zijlstra
@ 2007-05-14 13:19 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 13:19 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Peter Zijlstra, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall
[-- Attachment #1: mm-slub-ranking.patch --]
[-- Type: text/plain, Size: 4800 bytes --]
The slub allocator has some unfairness wrt gfp flags; when the slub cache is
grown the gfp flags are used to allocate more memory, however when there is
slub cache available (in partial or free slabs) gfp flags are ignored.
Thus it is possible for less critical slub allocations to succeed and gobble
up precious memory when under memory pressure.
This patch solves that by using the newly introduced page allocation rank.
Page allocation rank is a scalar quantity connecting ALLOC_ and gfp flags which
represents how deep we had to reach into our reserves when allocating a page.
Rank 0 is the deepest we can reach (ALLOC_NO_WATERMARK) and 16 is the most
shallow allocation possible (ALLOC_WMARK_HIGH).
When the slub space is grown the rank of the page allocation is stored. For
each slub allocation we test the given gfp flags against this rank. Thereby
asking the question: would these flags have allowed the slub to grow.
If not so, we need to test the current situation. This is done by forcing the
growth of the slub space. (Just testing the free page limits will not work due
to direct reclaim) Failing this we need to fail the slub allocation.
Thus if we grew the slub under great duress while PF_MEMALLOC was set and we
really did access the memalloc reserve the rank would be set to 0. If the next
allocation to that slub would be GFP_NOFS|__GFP_NOMEMALLOC (which ordinarily
maps to rank 4 and always > 0) we'd want to make sure that memory pressure has
decreased enough to allow an allocation with the given gfp flags.
So in this case we try to force grow the slub cache and on failure we fail the
slub allocation. Thus preserving the available slub cache for more pressing
allocations.
[netperf results]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
---
include/linux/slub_def.h | 1 +
mm/Kconfig | 2 +-
mm/slub.c | 24 +++++++++++++++++++++---
3 files changed, 23 insertions(+), 4 deletions(-)
Index: linux-2.6-git/include/linux/slub_def.h
===================================================================
--- linux-2.6-git.orig/include/linux/slub_def.h
+++ linux-2.6-git/include/linux/slub_def.h
@@ -52,6 +52,7 @@ struct kmem_cache {
struct kmem_cache_node *node[MAX_NUMNODES];
#endif
struct page *cpu_slab[NR_CPUS];
+ int rank;
};
/*
Index: linux-2.6-git/mm/slub.c
===================================================================
--- linux-2.6-git.orig/mm/slub.c
+++ linux-2.6-git/mm/slub.c
@@ -20,6 +20,7 @@
#include <linux/mempolicy.h>
#include <linux/ctype.h>
#include <linux/kallsyms.h>
+#include "internal.h"
/*
* Lock order:
@@ -961,6 +962,8 @@ static struct page *allocate_slab(struct
if (!page)
return NULL;
+ s->rank = page->index;
+
mod_zone_page_state(page_zone(page),
(s->flags & SLAB_RECLAIM_ACCOUNT) ?
NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
@@ -1350,6 +1353,8 @@ static void flush_all(struct kmem_cache
#endif
}
+#define FORCE_PAGE ((void *)~0UL)
+
/*
* Slow path. The lockless freelist is empty or we need to perform
* debugging duties.
@@ -1371,8 +1376,12 @@ static void *__slab_alloc(struct kmem_ca
gfp_t gfpflags, int node, void *addr, struct page *page)
{
void **object;
- int cpu = smp_processor_id();
+ int cpu;
+
+ if (page == FORCE_PAGE)
+ goto force_new;
+ cpu = smp_processor_id();
if (!page)
goto new_slab;
@@ -1405,6 +1414,7 @@ have_slab:
goto load_freelist;
}
+force_new:
page = new_slab(s, gfpflags, node);
if (page) {
cpu = smp_processor_id();
@@ -1465,15 +1475,22 @@ static void __always_inline *slab_alloc(
struct page *page;
void **object;
unsigned long flags;
+ int rank = slab_alloc_rank(gfpflags);
local_irq_save(flags);
+ if (slab_insufficient_rank(s, rank)) {
+ page = FORCE_PAGE;
+ goto force_alloc;
+ }
+
page = s->cpu_slab[smp_processor_id()];
if (unlikely(!page || !page->lockless_freelist ||
- (node != -1 && page_to_nid(page) != node)))
+ (node != -1 && page_to_nid(page) != node))) {
+force_alloc:
object = __slab_alloc(s, gfpflags, node, addr, page);
- else {
+ } else {
object = page->lockless_freelist;
page->lockless_freelist = object[page->offset];
}
@@ -1993,6 +2010,7 @@ static int kmem_cache_open(struct kmem_c
s->flags = flags;
s->align = align;
kmem_cache_open_debug_check(s);
+ s->rank = MAX_ALLOC_RANK;
if (!calculate_sizes(s))
goto error;
Index: linux-2.6-git/mm/Kconfig
===================================================================
--- linux-2.6-git.orig/mm/Kconfig
+++ linux-2.6-git/mm/Kconfig
@@ -165,7 +165,7 @@ config ZONE_DMA_FLAG
config SLAB_FAIR
def_bool n
- depends on SLAB
+ depends on SLAB || SLUB
config NR_QUICK
int
--
^ permalink raw reply [flat|nested] 138+ messages in thread
* [PATCH 3/5] mm: slub allocation fairness
@ 2007-05-14 13:19 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 13:19 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Peter Zijlstra, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall
[-- Attachment #1: mm-slub-ranking.patch --]
[-- Type: text/plain, Size: 5025 bytes --]
The slub allocator has some unfairness wrt gfp flags; when the slub cache is
grown the gfp flags are used to allocate more memory, however when there is
slub cache available (in partial or free slabs) gfp flags are ignored.
Thus it is possible for less critical slub allocations to succeed and gobble
up precious memory when under memory pressure.
This patch solves that by using the newly introduced page allocation rank.
Page allocation rank is a scalar quantity connecting ALLOC_ and gfp flags which
represents how deep we had to reach into our reserves when allocating a page.
Rank 0 is the deepest we can reach (ALLOC_NO_WATERMARK) and 16 is the most
shallow allocation possible (ALLOC_WMARK_HIGH).
When the slub space is grown the rank of the page allocation is stored. For
each slub allocation we test the given gfp flags against this rank. Thereby
asking the question: would these flags have allowed the slub to grow.
If not so, we need to test the current situation. This is done by forcing the
growth of the slub space. (Just testing the free page limits will not work due
to direct reclaim) Failing this we need to fail the slub allocation.
Thus if we grew the slub under great duress while PF_MEMALLOC was set and we
really did access the memalloc reserve the rank would be set to 0. If the next
allocation to that slub would be GFP_NOFS|__GFP_NOMEMALLOC (which ordinarily
maps to rank 4 and always > 0) we'd want to make sure that memory pressure has
decreased enough to allow an allocation with the given gfp flags.
So in this case we try to force grow the slub cache and on failure we fail the
slub allocation. Thus preserving the available slub cache for more pressing
allocations.
[netperf results]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
---
include/linux/slub_def.h | 1 +
mm/Kconfig | 2 +-
mm/slub.c | 24 +++++++++++++++++++++---
3 files changed, 23 insertions(+), 4 deletions(-)
Index: linux-2.6-git/include/linux/slub_def.h
===================================================================
--- linux-2.6-git.orig/include/linux/slub_def.h
+++ linux-2.6-git/include/linux/slub_def.h
@@ -52,6 +52,7 @@ struct kmem_cache {
struct kmem_cache_node *node[MAX_NUMNODES];
#endif
struct page *cpu_slab[NR_CPUS];
+ int rank;
};
/*
Index: linux-2.6-git/mm/slub.c
===================================================================
--- linux-2.6-git.orig/mm/slub.c
+++ linux-2.6-git/mm/slub.c
@@ -20,6 +20,7 @@
#include <linux/mempolicy.h>
#include <linux/ctype.h>
#include <linux/kallsyms.h>
+#include "internal.h"
/*
* Lock order:
@@ -961,6 +962,8 @@ static struct page *allocate_slab(struct
if (!page)
return NULL;
+ s->rank = page->index;
+
mod_zone_page_state(page_zone(page),
(s->flags & SLAB_RECLAIM_ACCOUNT) ?
NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
@@ -1350,6 +1353,8 @@ static void flush_all(struct kmem_cache
#endif
}
+#define FORCE_PAGE ((void *)~0UL)
+
/*
* Slow path. The lockless freelist is empty or we need to perform
* debugging duties.
@@ -1371,8 +1376,12 @@ static void *__slab_alloc(struct kmem_ca
gfp_t gfpflags, int node, void *addr, struct page *page)
{
void **object;
- int cpu = smp_processor_id();
+ int cpu;
+
+ if (page == FORCE_PAGE)
+ goto force_new;
+ cpu = smp_processor_id();
if (!page)
goto new_slab;
@@ -1405,6 +1414,7 @@ have_slab:
goto load_freelist;
}
+force_new:
page = new_slab(s, gfpflags, node);
if (page) {
cpu = smp_processor_id();
@@ -1465,15 +1475,22 @@ static void __always_inline *slab_alloc(
struct page *page;
void **object;
unsigned long flags;
+ int rank = slab_alloc_rank(gfpflags);
local_irq_save(flags);
+ if (slab_insufficient_rank(s, rank)) {
+ page = FORCE_PAGE;
+ goto force_alloc;
+ }
+
page = s->cpu_slab[smp_processor_id()];
if (unlikely(!page || !page->lockless_freelist ||
- (node != -1 && page_to_nid(page) != node)))
+ (node != -1 && page_to_nid(page) != node))) {
+force_alloc:
object = __slab_alloc(s, gfpflags, node, addr, page);
- else {
+ } else {
object = page->lockless_freelist;
page->lockless_freelist = object[page->offset];
}
@@ -1993,6 +2010,7 @@ static int kmem_cache_open(struct kmem_c
s->flags = flags;
s->align = align;
kmem_cache_open_debug_check(s);
+ s->rank = MAX_ALLOC_RANK;
if (!calculate_sizes(s))
goto error;
Index: linux-2.6-git/mm/Kconfig
===================================================================
--- linux-2.6-git.orig/mm/Kconfig
+++ linux-2.6-git/mm/Kconfig
@@ -165,7 +165,7 @@ config ZONE_DMA_FLAG
config SLAB_FAIR
def_bool n
- depends on SLAB
+ depends on SLAB || SLUB
config NR_QUICK
int
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* [PATCH 4/5] mm: slob allocation fairness
2007-05-14 13:19 ` Peter Zijlstra
@ 2007-05-14 13:19 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 13:19 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Peter Zijlstra, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall
[-- Attachment #1: mm-slob-ranking.patch --]
[-- Type: text/plain, Size: 4451 bytes --]
The slob allocator has some unfairness wrt gfp flags; when the slob space is
grown the gfp flags are used to allocate more memory, however when there is
slob space available gfp flags are ignored.
Thus it is possible for less critical slob allocations to succeed and gobble
up precious memory when under memory pressure.
This patch solves that by using the newly introduced page allocation rank.
Page allocation rank is a scalar quantity connecting ALLOC_ and gfp flags which
represents how deep we had to reach into our reserves when allocating a page.
Rank 0 is the deepest we can reach (ALLOC_NO_WATERMARK) and 16 is the most
shallow allocation possible (ALLOC_WMARK_HIGH).
When the slob space is grown the rank of the page allocation is stored. For
each slob allocation we test the given gfp flags against this rank. Thereby
asking the question: would these flags have allowed the slob to grow.
If not so, we need to test the current situation. This is done by forcing the
growth of the slob space. (Just testing the free page limits will not work due
to direct reclaim) Failing this we need to fail the slob allocation.
Thus if we grew the slob under great duress while PF_MEMALLOC was set and we
really did access the memalloc reserve the rank would be set to 0. If the next
allocation to that slob would be GFP_NOFS|__GFP_NOMEMALLOC (which ordinarily
maps to rank 4 and always > 0) we'd want to make sure that memory pressure has
decreased enough to allow an allocation with the given gfp flags.
So in this case we try to force grow the slob space and on failure we fail the
slob allocation. Thus preserving the available slob space for more pressing
allocations.
[netperf results]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Matt Mackall <mpm@selenic.com>
---
mm/Kconfig | 1 -
mm/slob.c | 25 ++++++++++++++++++++++---
2 files changed, 22 insertions(+), 4 deletions(-)
Index: linux-2.6-git/mm/slob.c
===================================================================
--- linux-2.6-git.orig/mm/slob.c
+++ linux-2.6-git/mm/slob.c
@@ -35,6 +35,7 @@
#include <linux/init.h>
#include <linux/module.h>
#include <linux/timer.h>
+#include "internal.h"
struct slob_block {
int units;
@@ -53,6 +54,7 @@ struct bigblock {
};
typedef struct bigblock bigblock_t;
+static struct { int rank; } slobrank = { .rank = MAX_ALLOC_RANK };
static slob_t arena = { .next = &arena, .units = 1 };
static slob_t *slobfree = &arena;
static bigblock_t *bigblocks;
@@ -62,12 +64,29 @@ static DEFINE_SPINLOCK(block_lock);
static void slob_free(void *b, int size);
static void slob_timer_cbk(void);
+static unsigned long slob_get_free_pages(gfp_t flags, int order)
+{
+ struct page *page = alloc_pages(gfp, order);
+ if (!page)
+ return 0;
+ slobrank.rank = page->index;
+ return (unsigned long)page_address(page);
+}
static void *slob_alloc(size_t size, gfp_t gfp, int align)
{
slob_t *prev, *cur, *aligned = 0;
int delta = 0, units = SLOB_UNITS(size);
unsigned long flags;
+ int rank = slab_alloc_rank(gfp);
+
+ if (slab_insufficient_rank(&slobrank, rank)) {
+ struct page *page = alloc_page(gfp);
+ if (!page)
+ return NULL;
+ slobrank.rank = page->index;
+ __free_page(page);
+ }
spin_lock_irqsave(&slob_lock, flags);
prev = slobfree;
@@ -105,7 +124,7 @@ static void *slob_alloc(size_t size, gfp
if (size == PAGE_SIZE) /* trying to shrink arena? */
return 0;
- cur = (slob_t *)__get_free_page(gfp);
+ cur = (slob_t *)slob_get_free_pages(gfp, 0);
if (!cur)
return 0;
@@ -166,7 +185,7 @@ void *__kmalloc(size_t size, gfp_t gfp)
return 0;
bb->order = get_order(size);
- bb->pages = (void *)__get_free_pages(gfp, bb->order);
+ bb->pages = (void *)slob_get_free_pages(gfp, bb->order);
if (bb->pages) {
spin_lock_irqsave(&block_lock, flags);
@@ -309,7 +328,7 @@ void *kmem_cache_alloc(struct kmem_cache
if (c->size < PAGE_SIZE)
b = slob_alloc(c->size, flags, c->align);
else
- b = (void *)__get_free_pages(flags, get_order(c->size));
+ b = (void *)slob_get_free_pages(flags, get_order(c->size));
if (c->ctor)
c->ctor(b, c, SLAB_CTOR_CONSTRUCTOR);
Index: linux-2.6-git/mm/Kconfig
===================================================================
--- linux-2.6-git.orig/mm/Kconfig
+++ linux-2.6-git/mm/Kconfig
@@ -165,7 +165,6 @@ config ZONE_DMA_FLAG
config SLAB_FAIR
def_bool n
- depends on SLAB || SLUB
config NR_QUICK
int
--
^ permalink raw reply [flat|nested] 138+ messages in thread
* [PATCH 4/5] mm: slob allocation fairness
@ 2007-05-14 13:19 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 13:19 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Peter Zijlstra, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall
[-- Attachment #1: mm-slob-ranking.patch --]
[-- Type: text/plain, Size: 4676 bytes --]
The slob allocator has some unfairness wrt gfp flags; when the slob space is
grown the gfp flags are used to allocate more memory, however when there is
slob space available gfp flags are ignored.
Thus it is possible for less critical slob allocations to succeed and gobble
up precious memory when under memory pressure.
This patch solves that by using the newly introduced page allocation rank.
Page allocation rank is a scalar quantity connecting ALLOC_ and gfp flags which
represents how deep we had to reach into our reserves when allocating a page.
Rank 0 is the deepest we can reach (ALLOC_NO_WATERMARK) and 16 is the most
shallow allocation possible (ALLOC_WMARK_HIGH).
When the slob space is grown the rank of the page allocation is stored. For
each slob allocation we test the given gfp flags against this rank. Thereby
asking the question: would these flags have allowed the slob to grow.
If not so, we need to test the current situation. This is done by forcing the
growth of the slob space. (Just testing the free page limits will not work due
to direct reclaim) Failing this we need to fail the slob allocation.
Thus if we grew the slob under great duress while PF_MEMALLOC was set and we
really did access the memalloc reserve the rank would be set to 0. If the next
allocation to that slob would be GFP_NOFS|__GFP_NOMEMALLOC (which ordinarily
maps to rank 4 and always > 0) we'd want to make sure that memory pressure has
decreased enough to allow an allocation with the given gfp flags.
So in this case we try to force grow the slob space and on failure we fail the
slob allocation. Thus preserving the available slob space for more pressing
allocations.
[netperf results]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Matt Mackall <mpm@selenic.com>
---
mm/Kconfig | 1 -
mm/slob.c | 25 ++++++++++++++++++++++---
2 files changed, 22 insertions(+), 4 deletions(-)
Index: linux-2.6-git/mm/slob.c
===================================================================
--- linux-2.6-git.orig/mm/slob.c
+++ linux-2.6-git/mm/slob.c
@@ -35,6 +35,7 @@
#include <linux/init.h>
#include <linux/module.h>
#include <linux/timer.h>
+#include "internal.h"
struct slob_block {
int units;
@@ -53,6 +54,7 @@ struct bigblock {
};
typedef struct bigblock bigblock_t;
+static struct { int rank; } slobrank = { .rank = MAX_ALLOC_RANK };
static slob_t arena = { .next = &arena, .units = 1 };
static slob_t *slobfree = &arena;
static bigblock_t *bigblocks;
@@ -62,12 +64,29 @@ static DEFINE_SPINLOCK(block_lock);
static void slob_free(void *b, int size);
static void slob_timer_cbk(void);
+static unsigned long slob_get_free_pages(gfp_t flags, int order)
+{
+ struct page *page = alloc_pages(gfp, order);
+ if (!page)
+ return 0;
+ slobrank.rank = page->index;
+ return (unsigned long)page_address(page);
+}
static void *slob_alloc(size_t size, gfp_t gfp, int align)
{
slob_t *prev, *cur, *aligned = 0;
int delta = 0, units = SLOB_UNITS(size);
unsigned long flags;
+ int rank = slab_alloc_rank(gfp);
+
+ if (slab_insufficient_rank(&slobrank, rank)) {
+ struct page *page = alloc_page(gfp);
+ if (!page)
+ return NULL;
+ slobrank.rank = page->index;
+ __free_page(page);
+ }
spin_lock_irqsave(&slob_lock, flags);
prev = slobfree;
@@ -105,7 +124,7 @@ static void *slob_alloc(size_t size, gfp
if (size == PAGE_SIZE) /* trying to shrink arena? */
return 0;
- cur = (slob_t *)__get_free_page(gfp);
+ cur = (slob_t *)slob_get_free_pages(gfp, 0);
if (!cur)
return 0;
@@ -166,7 +185,7 @@ void *__kmalloc(size_t size, gfp_t gfp)
return 0;
bb->order = get_order(size);
- bb->pages = (void *)__get_free_pages(gfp, bb->order);
+ bb->pages = (void *)slob_get_free_pages(gfp, bb->order);
if (bb->pages) {
spin_lock_irqsave(&block_lock, flags);
@@ -309,7 +328,7 @@ void *kmem_cache_alloc(struct kmem_cache
if (c->size < PAGE_SIZE)
b = slob_alloc(c->size, flags, c->align);
else
- b = (void *)__get_free_pages(flags, get_order(c->size));
+ b = (void *)slob_get_free_pages(flags, get_order(c->size));
if (c->ctor)
c->ctor(b, c, SLAB_CTOR_CONSTRUCTOR);
Index: linux-2.6-git/mm/Kconfig
===================================================================
--- linux-2.6-git.orig/mm/Kconfig
+++ linux-2.6-git/mm/Kconfig
@@ -165,7 +165,6 @@ config ZONE_DMA_FLAG
config SLAB_FAIR
def_bool n
- depends on SLAB || SLUB
config NR_QUICK
int
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* [PATCH 5/5] mm: allow mempool to fall back to memalloc reserves
2007-05-14 13:19 ` Peter Zijlstra
@ 2007-05-14 13:19 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 13:19 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Peter Zijlstra, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall
[-- Attachment #1: mm-mempool_fixup.patch --]
[-- Type: text/plain, Size: 1335 bytes --]
Allow the mempool to use the memalloc reserves when all else fails and
the allocation context would otherwise allow it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
mm/mempool.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
Index: linux-2.6-git/mm/mempool.c
===================================================================
--- linux-2.6-git.orig/mm/mempool.c
+++ linux-2.6-git/mm/mempool.c
@@ -14,6 +14,7 @@
#include <linux/mempool.h>
#include <linux/blkdev.h>
#include <linux/writeback.h>
+#include "internal.h"
static void add_element(mempool_t *pool, void *element)
{
@@ -205,7 +206,7 @@ void * mempool_alloc(mempool_t *pool, gf
void *element;
unsigned long flags;
wait_queue_t wait;
- gfp_t gfp_temp;
+ gfp_t gfp_temp, gfp_orig = gfp_mask;
might_sleep_if(gfp_mask & __GFP_WAIT);
@@ -229,6 +230,15 @@ repeat_alloc:
}
spin_unlock_irqrestore(&pool->lock, flags);
+ /* if we really had right to the emergency reserves try those */
+ if (gfp_to_alloc_flags(gfp_orig) & ALLOC_NO_WATERMARKS) {
+ if (gfp_temp & __GFP_NOMEMALLOC) {
+ gfp_temp &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+ goto repeat_alloc;
+ } else
+ gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN;
+ }
+
/* We must not sleep in the GFP_ATOMIC case */
if (!(gfp_mask & __GFP_WAIT))
return NULL;
--
^ permalink raw reply [flat|nested] 138+ messages in thread
* [PATCH 5/5] mm: allow mempool to fall back to memalloc reserves
@ 2007-05-14 13:19 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 13:19 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Peter Zijlstra, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Christoph Lameter, Matt Mackall
[-- Attachment #1: mm-mempool_fixup.patch --]
[-- Type: text/plain, Size: 1560 bytes --]
Allow the mempool to use the memalloc reserves when all else fails and
the allocation context would otherwise allow it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
mm/mempool.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
Index: linux-2.6-git/mm/mempool.c
===================================================================
--- linux-2.6-git.orig/mm/mempool.c
+++ linux-2.6-git/mm/mempool.c
@@ -14,6 +14,7 @@
#include <linux/mempool.h>
#include <linux/blkdev.h>
#include <linux/writeback.h>
+#include "internal.h"
static void add_element(mempool_t *pool, void *element)
{
@@ -205,7 +206,7 @@ void * mempool_alloc(mempool_t *pool, gf
void *element;
unsigned long flags;
wait_queue_t wait;
- gfp_t gfp_temp;
+ gfp_t gfp_temp, gfp_orig = gfp_mask;
might_sleep_if(gfp_mask & __GFP_WAIT);
@@ -229,6 +230,15 @@ repeat_alloc:
}
spin_unlock_irqrestore(&pool->lock, flags);
+ /* if we really had right to the emergency reserves try those */
+ if (gfp_to_alloc_flags(gfp_orig) & ALLOC_NO_WATERMARKS) {
+ if (gfp_temp & __GFP_NOMEMALLOC) {
+ gfp_temp &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+ goto repeat_alloc;
+ } else
+ gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN;
+ }
+
/* We must not sleep in the GFP_ATOMIC case */
if (!(gfp_mask & __GFP_WAIT))
return NULL;
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 3/5] mm: slub allocation fairness
2007-05-14 13:19 ` Peter Zijlstra
@ 2007-05-14 15:49 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 15:49 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 14 May 2007, Peter Zijlstra wrote:
> Index: linux-2.6-git/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6-git.orig/include/linux/slub_def.h
> +++ linux-2.6-git/include/linux/slub_def.h
> @@ -52,6 +52,7 @@ struct kmem_cache {
> struct kmem_cache_node *node[MAX_NUMNODES];
> #endif
> struct page *cpu_slab[NR_CPUS];
> + int rank;
> };
Ranks as part of the kmem_cache structure? I thought this is a temporary
thing?
> * Lock order:
> @@ -961,6 +962,8 @@ static struct page *allocate_slab(struct
> if (!page)
> return NULL;
>
> + s->rank = page->index;
> +
Argh.... Setting a cache structure field from a page struct field? What
about concurrency?
> @@ -1371,8 +1376,12 @@ static void *__slab_alloc(struct kmem_ca
> gfp_t gfpflags, int node, void *addr, struct page *page)
> {
> void **object;
> - int cpu = smp_processor_id();
> + int cpu;
> +
> + if (page == FORCE_PAGE)
> + goto force_new;
>
> + cpu = smp_processor_id();
> if (!page)
> goto new_slab;
>
> @@ -1405,6 +1414,7 @@ have_slab:
> goto load_freelist;
> }
>
> +force_new:
> page = new_slab(s, gfpflags, node);
> if (page) {
> cpu = smp_processor_id();
> @@ -1465,15 +1475,22 @@ static void __always_inline *slab_alloc(
> struct page *page;
> void **object;
> unsigned long flags;
> + int rank = slab_alloc_rank(gfpflags);
>
> local_irq_save(flags);
> + if (slab_insufficient_rank(s, rank)) {
> + page = FORCE_PAGE;
> + goto force_alloc;
> + }
> +
> page = s->cpu_slab[smp_processor_id()];
> if (unlikely(!page || !page->lockless_freelist ||
> - (node != -1 && page_to_nid(page) != node)))
> + (node != -1 && page_to_nid(page) != node))) {
>
> +force_alloc:
> object = __slab_alloc(s, gfpflags, node, addr, page);
>
> - else {
> + } else {
> object = page->lockless_freelist;
> page->lockless_freelist = object[page->offset];
> }
This is the hot path. No modifications please.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 3/5] mm: slub allocation fairness
@ 2007-05-14 15:49 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 15:49 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 14 May 2007, Peter Zijlstra wrote:
> Index: linux-2.6-git/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6-git.orig/include/linux/slub_def.h
> +++ linux-2.6-git/include/linux/slub_def.h
> @@ -52,6 +52,7 @@ struct kmem_cache {
> struct kmem_cache_node *node[MAX_NUMNODES];
> #endif
> struct page *cpu_slab[NR_CPUS];
> + int rank;
> };
Ranks as part of the kmem_cache structure? I thought this is a temporary
thing?
> * Lock order:
> @@ -961,6 +962,8 @@ static struct page *allocate_slab(struct
> if (!page)
> return NULL;
>
> + s->rank = page->index;
> +
Argh.... Setting a cache structure field from a page struct field? What
about concurrency?
> @@ -1371,8 +1376,12 @@ static void *__slab_alloc(struct kmem_ca
> gfp_t gfpflags, int node, void *addr, struct page *page)
> {
> void **object;
> - int cpu = smp_processor_id();
> + int cpu;
> +
> + if (page == FORCE_PAGE)
> + goto force_new;
>
> + cpu = smp_processor_id();
> if (!page)
> goto new_slab;
>
> @@ -1405,6 +1414,7 @@ have_slab:
> goto load_freelist;
> }
>
> +force_new:
> page = new_slab(s, gfpflags, node);
> if (page) {
> cpu = smp_processor_id();
> @@ -1465,15 +1475,22 @@ static void __always_inline *slab_alloc(
> struct page *page;
> void **object;
> unsigned long flags;
> + int rank = slab_alloc_rank(gfpflags);
>
> local_irq_save(flags);
> + if (slab_insufficient_rank(s, rank)) {
> + page = FORCE_PAGE;
> + goto force_alloc;
> + }
> +
> page = s->cpu_slab[smp_processor_id()];
> if (unlikely(!page || !page->lockless_freelist ||
> - (node != -1 && page_to_nid(page) != node)))
> + (node != -1 && page_to_nid(page) != node))) {
>
> +force_alloc:
> object = __slab_alloc(s, gfpflags, node, addr, page);
>
> - else {
> + } else {
> object = page->lockless_freelist;
> page->lockless_freelist = object[page->offset];
> }
This is the hot path. No modifications please.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 2/5] mm: slab allocation fairness
2007-05-14 13:19 ` Peter Zijlstra
@ 2007-05-14 15:51 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 15:51 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 14 May 2007, Peter Zijlstra wrote:
> @@ -3182,13 +3192,13 @@ static inline void *____cache_alloc(stru
> check_irq_off();
>
> ac = cpu_cache_get(cachep);
> - if (likely(ac->avail)) {
> + if (likely(ac->avail) && !slab_insufficient_rank(cachep, rank)) {
> STATS_INC_ALLOCHIT(cachep);
> ac->touched = 1;
Hotpath modifications.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 2/5] mm: slab allocation fairness
@ 2007-05-14 15:51 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 15:51 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 14 May 2007, Peter Zijlstra wrote:
> @@ -3182,13 +3192,13 @@ static inline void *____cache_alloc(stru
> check_irq_off();
>
> ac = cpu_cache_get(cachep);
> - if (likely(ac->avail)) {
> + if (likely(ac->avail) && !slab_insufficient_rank(cachep, rank)) {
> STATS_INC_ALLOCHIT(cachep);
> ac->touched = 1;
Hotpath modifications.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 13:19 ` Peter Zijlstra
@ 2007-05-14 15:53 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 15:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 14 May 2007, Peter Zijlstra wrote:
> In the interest of creating a reserve based allocator; we need to make the slab
> allocator (*sigh*, all three) fair with respect to GFP flags.
I am not sure what the point of all of this is.
> That is, we need to protect memory from being used by easier gfp flags than it
> was allocated with. If our reserve is placed below GFP_ATOMIC, we do not want a
> GFP_KERNEL allocation to walk away with it - a scenario that is perfectly
> possible with the current allocators.
Why does this have to handled by the slab allocators at all? If you have
free pages in the page allocator then the slab allocators will be able to
use that reserve.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 15:53 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 15:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 14 May 2007, Peter Zijlstra wrote:
> In the interest of creating a reserve based allocator; we need to make the slab
> allocator (*sigh*, all three) fair with respect to GFP flags.
I am not sure what the point of all of this is.
> That is, we need to protect memory from being used by easier gfp flags than it
> was allocated with. If our reserve is placed below GFP_ATOMIC, we do not want a
> GFP_KERNEL allocation to walk away with it - a scenario that is perfectly
> possible with the current allocators.
Why does this have to handled by the slab allocators at all? If you have
free pages in the page allocator then the slab allocators will be able to
use that reserve.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 15:53 ` Christoph Lameter
@ 2007-05-14 16:10 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 16:10 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 2007-05-14 at 08:53 -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> > In the interest of creating a reserve based allocator; we need to make the slab
> > allocator (*sigh*, all three) fair with respect to GFP flags.
>
> I am not sure what the point of all of this is.
>
> > That is, we need to protect memory from being used by easier gfp flags than it
> > was allocated with. If our reserve is placed below GFP_ATOMIC, we do not want a
> > GFP_KERNEL allocation to walk away with it - a scenario that is perfectly
> > possible with the current allocators.
>
> Why does this have to handled by the slab allocators at all? If you have
> free pages in the page allocator then the slab allocators will be able to
> use that reserve.
Yes, too freely. GFP flags are only ever checked when you allocate a new
page. Hence, if you have a low reaching alloc allocating a slab page;
subsequent non critical GFP_KERNEL allocs can fill up that slab. Hence
you would need to reserve a slab per object instead of the normal
packing.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 16:10 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 16:10 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 2007-05-14 at 08:53 -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> > In the interest of creating a reserve based allocator; we need to make the slab
> > allocator (*sigh*, all three) fair with respect to GFP flags.
>
> I am not sure what the point of all of this is.
>
> > That is, we need to protect memory from being used by easier gfp flags than it
> > was allocated with. If our reserve is placed below GFP_ATOMIC, we do not want a
> > GFP_KERNEL allocation to walk away with it - a scenario that is perfectly
> > possible with the current allocators.
>
> Why does this have to handled by the slab allocators at all? If you have
> free pages in the page allocator then the slab allocators will be able to
> use that reserve.
Yes, too freely. GFP flags are only ever checked when you allocate a new
page. Hence, if you have a low reaching alloc allocating a slab page;
subsequent non critical GFP_KERNEL allocs can fill up that slab. Hence
you would need to reserve a slab per object instead of the normal
packing.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 15:53 ` Christoph Lameter
@ 2007-05-14 16:12 ` Matt Mackall
-1 siblings, 0 replies; 138+ messages in thread
From: Matt Mackall @ 2007-05-14 16:12 UTC (permalink / raw)
To: Christoph Lameter
Cc: Peter Zijlstra, linux-kernel, linux-mm, Thomas Graf,
David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, May 14, 2007 at 08:53:21AM -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> > In the interest of creating a reserve based allocator; we need to make the slab
> > allocator (*sigh*, all three) fair with respect to GFP flags.
>
> I am not sure what the point of all of this is.
>
> > That is, we need to protect memory from being used by easier gfp flags than it
> > was allocated with. If our reserve is placed below GFP_ATOMIC, we do not want a
> > GFP_KERNEL allocation to walk away with it - a scenario that is perfectly
> > possible with the current allocators.
>
> Why does this have to handled by the slab allocators at all? If you have
> free pages in the page allocator then the slab allocators will be able to
> use that reserve.
If I understand this correctly:
privileged thread unprivileged greedy process
kmem_cache_alloc(...)
adds new slab page from lowmem pool
do_io()
kmem_cache_alloc(...)
kmem_cache_alloc(...)
kmem_cache_alloc(...)
kmem_cache_alloc(...)
kmem_cache_alloc(...)
...
eats it all
kmem_cache_alloc(...) -> ENOMEM
who ate my donuts?!
But I think this solution is somehow overkill. If we only care about
this issue in the OOM avoidance case, then our rank reduces to a
boolean.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 16:12 ` Matt Mackall
0 siblings, 0 replies; 138+ messages in thread
From: Matt Mackall @ 2007-05-14 16:12 UTC (permalink / raw)
To: Christoph Lameter
Cc: Peter Zijlstra, linux-kernel, linux-mm, Thomas Graf,
David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, May 14, 2007 at 08:53:21AM -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> > In the interest of creating a reserve based allocator; we need to make the slab
> > allocator (*sigh*, all three) fair with respect to GFP flags.
>
> I am not sure what the point of all of this is.
>
> > That is, we need to protect memory from being used by easier gfp flags than it
> > was allocated with. If our reserve is placed below GFP_ATOMIC, we do not want a
> > GFP_KERNEL allocation to walk away with it - a scenario that is perfectly
> > possible with the current allocators.
>
> Why does this have to handled by the slab allocators at all? If you have
> free pages in the page allocator then the slab allocators will be able to
> use that reserve.
If I understand this correctly:
privileged thread unprivileged greedy process
kmem_cache_alloc(...)
adds new slab page from lowmem pool
do_io()
kmem_cache_alloc(...)
kmem_cache_alloc(...)
kmem_cache_alloc(...)
kmem_cache_alloc(...)
kmem_cache_alloc(...)
...
eats it all
kmem_cache_alloc(...) -> ENOMEM
who ate my donuts?!
But I think this solution is somehow overkill. If we only care about
this issue in the OOM avoidance case, then our rank reduces to a
boolean.
--
Mathematics is the supreme nostalgia of our time.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 3/5] mm: slub allocation fairness
2007-05-14 15:49 ` Christoph Lameter
@ 2007-05-14 16:14 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 16:14 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 2007-05-14 at 08:49 -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> > Index: linux-2.6-git/include/linux/slub_def.h
> > ===================================================================
> > --- linux-2.6-git.orig/include/linux/slub_def.h
> > +++ linux-2.6-git/include/linux/slub_def.h
> > @@ -52,6 +52,7 @@ struct kmem_cache {
> > struct kmem_cache_node *node[MAX_NUMNODES];
> > #endif
> > struct page *cpu_slab[NR_CPUS];
> > + int rank;
> > };
>
> Ranks as part of the kmem_cache structure? I thought this is a temporary
> thing?
No it needs to store the current state to verity subsequent allocations
their gfp flags against.
> > * Lock order:
> > @@ -961,6 +962,8 @@ static struct page *allocate_slab(struct
> > if (!page)
> > return NULL;
> >
> > + s->rank = page->index;
> > +
>
> Argh.... Setting a cache structure field from a page struct field? What
> about concurrency?
Oh, right; allocate_slab is not serialized itself.
> > @@ -1371,8 +1376,12 @@ static void *__slab_alloc(struct kmem_ca
> > gfp_t gfpflags, int node, void *addr, struct page *page)
> > {
> > void **object;
> > - int cpu = smp_processor_id();
> > + int cpu;
> > +
> > + if (page == FORCE_PAGE)
> > + goto force_new;
> >
> > + cpu = smp_processor_id();
> > if (!page)
> > goto new_slab;
> >
> > @@ -1405,6 +1414,7 @@ have_slab:
> > goto load_freelist;
> > }
> >
> > +force_new:
> > page = new_slab(s, gfpflags, node);
> > if (page) {
> > cpu = smp_processor_id();
> > @@ -1465,15 +1475,22 @@ static void __always_inline *slab_alloc(
> > struct page *page;
> > void **object;
> > unsigned long flags;
> > + int rank = slab_alloc_rank(gfpflags);
> >
> > local_irq_save(flags);
> > + if (slab_insufficient_rank(s, rank)) {
> > + page = FORCE_PAGE;
> > + goto force_alloc;
> > + }
> > +
> > page = s->cpu_slab[smp_processor_id()];
> > if (unlikely(!page || !page->lockless_freelist ||
> > - (node != -1 && page_to_nid(page) != node)))
> > + (node != -1 && page_to_nid(page) != node))) {
> >
> > +force_alloc:
> > object = __slab_alloc(s, gfpflags, node, addr, page);
> >
> > - else {
> > + } else {
> > object = page->lockless_freelist;
> > page->lockless_freelist = object[page->offset];
> > }
>
> This is the hot path. No modifications please.
Yes it is, but sorry, I have to. I really need to validate each slab
alloc its GFP flags. Thats what the whole thing is about, I thought you
understood that.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 3/5] mm: slub allocation fairness
@ 2007-05-14 16:14 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 16:14 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 2007-05-14 at 08:49 -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> > Index: linux-2.6-git/include/linux/slub_def.h
> > ===================================================================
> > --- linux-2.6-git.orig/include/linux/slub_def.h
> > +++ linux-2.6-git/include/linux/slub_def.h
> > @@ -52,6 +52,7 @@ struct kmem_cache {
> > struct kmem_cache_node *node[MAX_NUMNODES];
> > #endif
> > struct page *cpu_slab[NR_CPUS];
> > + int rank;
> > };
>
> Ranks as part of the kmem_cache structure? I thought this is a temporary
> thing?
No it needs to store the current state to verity subsequent allocations
their gfp flags against.
> > * Lock order:
> > @@ -961,6 +962,8 @@ static struct page *allocate_slab(struct
> > if (!page)
> > return NULL;
> >
> > + s->rank = page->index;
> > +
>
> Argh.... Setting a cache structure field from a page struct field? What
> about concurrency?
Oh, right; allocate_slab is not serialized itself.
> > @@ -1371,8 +1376,12 @@ static void *__slab_alloc(struct kmem_ca
> > gfp_t gfpflags, int node, void *addr, struct page *page)
> > {
> > void **object;
> > - int cpu = smp_processor_id();
> > + int cpu;
> > +
> > + if (page == FORCE_PAGE)
> > + goto force_new;
> >
> > + cpu = smp_processor_id();
> > if (!page)
> > goto new_slab;
> >
> > @@ -1405,6 +1414,7 @@ have_slab:
> > goto load_freelist;
> > }
> >
> > +force_new:
> > page = new_slab(s, gfpflags, node);
> > if (page) {
> > cpu = smp_processor_id();
> > @@ -1465,15 +1475,22 @@ static void __always_inline *slab_alloc(
> > struct page *page;
> > void **object;
> > unsigned long flags;
> > + int rank = slab_alloc_rank(gfpflags);
> >
> > local_irq_save(flags);
> > + if (slab_insufficient_rank(s, rank)) {
> > + page = FORCE_PAGE;
> > + goto force_alloc;
> > + }
> > +
> > page = s->cpu_slab[smp_processor_id()];
> > if (unlikely(!page || !page->lockless_freelist ||
> > - (node != -1 && page_to_nid(page) != node)))
> > + (node != -1 && page_to_nid(page) != node))) {
> >
> > +force_alloc:
> > object = __slab_alloc(s, gfpflags, node, addr, page);
> >
> > - else {
> > + } else {
> > object = page->lockless_freelist;
> > page->lockless_freelist = object[page->offset];
> > }
>
> This is the hot path. No modifications please.
Yes it is, but sorry, I have to. I really need to validate each slab
alloc its GFP flags. Thats what the whole thing is about, I thought you
understood that.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 16:12 ` Matt Mackall
@ 2007-05-14 16:29 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 16:29 UTC (permalink / raw)
To: Matt Mackall
Cc: Peter Zijlstra, linux-kernel, linux-mm, Thomas Graf,
David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 14 May 2007, Matt Mackall wrote:
> privileged thread unprivileged greedy process
> kmem_cache_alloc(...)
> adds new slab page from lowmem pool
Yes but it returns an object for the privileged thread. Is that not
enough?
> do_io()
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> ...
> eats it all
> kmem_cache_alloc(...) -> ENOMEM
> who ate my donuts?!
>
> But I think this solution is somehow overkill. If we only care about
> this issue in the OOM avoidance case, then our rank reduces to a
> boolean.
>
> --
> Mathematics is the supreme nostalgia of our time.
>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 16:29 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 16:29 UTC (permalink / raw)
To: Matt Mackall
Cc: Peter Zijlstra, linux-kernel, linux-mm, Thomas Graf,
David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 14 May 2007, Matt Mackall wrote:
> privileged thread unprivileged greedy process
> kmem_cache_alloc(...)
> adds new slab page from lowmem pool
Yes but it returns an object for the privileged thread. Is that not
enough?
> do_io()
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> ...
> eats it all
> kmem_cache_alloc(...) -> ENOMEM
> who ate my donuts?!
>
> But I think this solution is somehow overkill. If we only care about
> this issue in the OOM avoidance case, then our rank reduces to a
> boolean.
>
> --
> Mathematics is the supreme nostalgia of our time.
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 3/5] mm: slub allocation fairness
2007-05-14 16:14 ` Peter Zijlstra
@ 2007-05-14 16:35 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 16:35 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 14 May 2007, Peter Zijlstra wrote:
> On Mon, 2007-05-14 at 08:49 -0700, Christoph Lameter wrote:
> > On Mon, 14 May 2007, Peter Zijlstra wrote:
> >
> > > Index: linux-2.6-git/include/linux/slub_def.h
> > > ===================================================================
> > > --- linux-2.6-git.orig/include/linux/slub_def.h
> > > +++ linux-2.6-git/include/linux/slub_def.h
> > > @@ -52,6 +52,7 @@ struct kmem_cache {
> > > struct kmem_cache_node *node[MAX_NUMNODES];
> > > #endif
> > > struct page *cpu_slab[NR_CPUS];
> > > + int rank;
> > > };
> >
> > Ranks as part of the kmem_cache structure? I thought this is a temporary
> > thing?
>
> No it needs to store the current state to verity subsequent allocations
> their gfp flags against.
What state? This is a global state? The kmem_cache struct is rarely
written to after setting up the slab. Any writes could create a serious
performance problem on large scale systems.
> > > * Lock order:
> > > @@ -961,6 +962,8 @@ static struct page *allocate_slab(struct
> > > if (!page)
> > > return NULL;
> > >
> > > + s->rank = page->index;
> > > +
> >
> > Argh.... Setting a cache structure field from a page struct field? What
> > about concurrency?
>
> Oh, right; allocate_slab is not serialized itself.
Nor should you ever write to the kmem_cache structure concurrently at all.
> > >
> > > - else {
> > > + } else {
> > > object = page->lockless_freelist;
> > > page->lockless_freelist = object[page->offset];
> > > }
> >
> > This is the hot path. No modifications please.
>
> Yes it is, but sorry, I have to. I really need to validate each slab
> alloc its GFP flags. Thats what the whole thing is about, I thought you
> understood that.
You are accessing a kmem_cache structure field in the hot path. That
cacheline is never used in the hot path. Sorry this is way to intrusive
for the problem you are trying to solve.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 3/5] mm: slub allocation fairness
@ 2007-05-14 16:35 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 16:35 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 14 May 2007, Peter Zijlstra wrote:
> On Mon, 2007-05-14 at 08:49 -0700, Christoph Lameter wrote:
> > On Mon, 14 May 2007, Peter Zijlstra wrote:
> >
> > > Index: linux-2.6-git/include/linux/slub_def.h
> > > ===================================================================
> > > --- linux-2.6-git.orig/include/linux/slub_def.h
> > > +++ linux-2.6-git/include/linux/slub_def.h
> > > @@ -52,6 +52,7 @@ struct kmem_cache {
> > > struct kmem_cache_node *node[MAX_NUMNODES];
> > > #endif
> > > struct page *cpu_slab[NR_CPUS];
> > > + int rank;
> > > };
> >
> > Ranks as part of the kmem_cache structure? I thought this is a temporary
> > thing?
>
> No it needs to store the current state to verity subsequent allocations
> their gfp flags against.
What state? This is a global state? The kmem_cache struct is rarely
written to after setting up the slab. Any writes could create a serious
performance problem on large scale systems.
> > > * Lock order:
> > > @@ -961,6 +962,8 @@ static struct page *allocate_slab(struct
> > > if (!page)
> > > return NULL;
> > >
> > > + s->rank = page->index;
> > > +
> >
> > Argh.... Setting a cache structure field from a page struct field? What
> > about concurrency?
>
> Oh, right; allocate_slab is not serialized itself.
Nor should you ever write to the kmem_cache structure concurrently at all.
> > >
> > > - else {
> > > + } else {
> > > object = page->lockless_freelist;
> > > page->lockless_freelist = object[page->offset];
> > > }
> >
> > This is the hot path. No modifications please.
>
> Yes it is, but sorry, I have to. I really need to validate each slab
> alloc its GFP flags. Thats what the whole thing is about, I thought you
> understood that.
You are accessing a kmem_cache structure field in the hot path. That
cacheline is never used in the hot path. Sorry this is way to intrusive
for the problem you are trying to solve.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 16:10 ` Peter Zijlstra
@ 2007-05-14 16:37 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 16:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 14 May 2007, Peter Zijlstra wrote:
> > Why does this have to handled by the slab allocators at all? If you have
> > free pages in the page allocator then the slab allocators will be able to
> > use that reserve.
>
> Yes, too freely. GFP flags are only ever checked when you allocate a new
> page. Hence, if you have a low reaching alloc allocating a slab page;
> subsequent non critical GFP_KERNEL allocs can fill up that slab. Hence
> you would need to reserve a slab per object instead of the normal
> packing.
This is all about making one thread fail rather than another? Note that
the allocations are a rather compex affair in the slab allocators. Per
node and per cpu structures play a big role.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 16:37 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 16:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 14 May 2007, Peter Zijlstra wrote:
> > Why does this have to handled by the slab allocators at all? If you have
> > free pages in the page allocator then the slab allocators will be able to
> > use that reserve.
>
> Yes, too freely. GFP flags are only ever checked when you allocate a new
> page. Hence, if you have a low reaching alloc allocating a slab page;
> subsequent non critical GFP_KERNEL allocs can fill up that slab. Hence
> you would need to reserve a slab per object instead of the normal
> packing.
This is all about making one thread fail rather than another? Note that
the allocations are a rather compex affair in the slab allocators. Per
node and per cpu structures play a big role.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 16:29 ` Christoph Lameter
@ 2007-05-14 17:40 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 17:40 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 2007-05-14 at 09:29 -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Matt Mackall wrote:
>
> > privileged thread unprivileged greedy process
> > kmem_cache_alloc(...)
> > adds new slab page from lowmem pool
>
> Yes but it returns an object for the privileged thread. Is that not
> enough?
No, because we reserved memory for n objects, and like matt illustrates
most of those that will be eaten by the greedy process.
We could reserve 1 page per object but that rather bloats the reserve.
> > do_io()
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > ...
> > eats it all
> > kmem_cache_alloc(...) -> ENOMEM
> > who ate my donuts?!
> >
> > But I think this solution is somehow overkill. If we only care about
> > this issue in the OOM avoidance case, then our rank reduces to a
> > boolean.
I tried to slim it down to a two state affair; but last time I tried
performance runs that actually slowed it down some.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 17:40 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 17:40 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 2007-05-14 at 09:29 -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Matt Mackall wrote:
>
> > privileged thread unprivileged greedy process
> > kmem_cache_alloc(...)
> > adds new slab page from lowmem pool
>
> Yes but it returns an object for the privileged thread. Is that not
> enough?
No, because we reserved memory for n objects, and like matt illustrates
most of those that will be eaten by the greedy process.
We could reserve 1 page per object but that rather bloats the reserve.
> > do_io()
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > ...
> > eats it all
> > kmem_cache_alloc(...) -> ENOMEM
> > who ate my donuts?!
> >
> > But I think this solution is somehow overkill. If we only care about
> > this issue in the OOM avoidance case, then our rank reduces to a
> > boolean.
I tried to slim it down to a two state affair; but last time I tried
performance runs that actually slowed it down some.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 17:40 ` Peter Zijlstra
@ 2007-05-14 17:57 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 17:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 14 May 2007, Peter Zijlstra wrote:
> On Mon, 2007-05-14 at 09:29 -0700, Christoph Lameter wrote:
> > On Mon, 14 May 2007, Matt Mackall wrote:
> >
> > > privileged thread unprivileged greedy process
> > > kmem_cache_alloc(...)
> > > adds new slab page from lowmem pool
> >
> > Yes but it returns an object for the privileged thread. Is that not
> > enough?
>
> No, because we reserved memory for n objects, and like matt illustrates
> most of those that will be eaten by the greedy process.
> We could reserve 1 page per object but that rather bloats the reserve.
1 slab per object not one page. But yes thats some bloat.
You can pull the big switch (only on a SLUB slab I fear) to switch
off the fast path. Do SetSlabDebug() when allocating a precious
allocation that should not be gobbled up by lower level processes.
Then you can do whatever you want in the __slab_alloc debug section and we
wont care because its not the hot path.
SLAB is a bit different. There we already have issues with the fast path
due to the attempt to handle numa policies at the object level. SLUB fixes
that issue (if we can avoid you hot path patch). It intentionally does
defer all special object handling to the slab level to increase NUMA
performance. If you do the same to SLAB then you will get the NUMA
troubles propagated to the SMP and UP level.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 17:57 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 17:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 14 May 2007, Peter Zijlstra wrote:
> On Mon, 2007-05-14 at 09:29 -0700, Christoph Lameter wrote:
> > On Mon, 14 May 2007, Matt Mackall wrote:
> >
> > > privileged thread unprivileged greedy process
> > > kmem_cache_alloc(...)
> > > adds new slab page from lowmem pool
> >
> > Yes but it returns an object for the privileged thread. Is that not
> > enough?
>
> No, because we reserved memory for n objects, and like matt illustrates
> most of those that will be eaten by the greedy process.
> We could reserve 1 page per object but that rather bloats the reserve.
1 slab per object not one page. But yes thats some bloat.
You can pull the big switch (only on a SLUB slab I fear) to switch
off the fast path. Do SetSlabDebug() when allocating a precious
allocation that should not be gobbled up by lower level processes.
Then you can do whatever you want in the __slab_alloc debug section and we
wont care because its not the hot path.
SLAB is a bit different. There we already have issues with the fast path
due to the attempt to handle numa policies at the object level. SLUB fixes
that issue (if we can avoid you hot path patch). It intentionally does
defer all special object handling to the slab level to increase NUMA
performance. If you do the same to SLAB then you will get the NUMA
troubles propagated to the SMP and UP level.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 17:57 ` Christoph Lameter
@ 2007-05-14 19:28 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 19:28 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 2007-05-14 at 10:57 -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> > On Mon, 2007-05-14 at 09:29 -0700, Christoph Lameter wrote:
> > > On Mon, 14 May 2007, Matt Mackall wrote:
> > >
> > > > privileged thread unprivileged greedy process
> > > > kmem_cache_alloc(...)
> > > > adds new slab page from lowmem pool
> > >
> > > Yes but it returns an object for the privileged thread. Is that not
> > > enough?
> >
> > No, because we reserved memory for n objects, and like matt illustrates
> > most of those that will be eaten by the greedy process.
> > We could reserve 1 page per object but that rather bloats the reserve.
>
> 1 slab per object not one page. But yes thats some bloat.
>
> You can pull the big switch (only on a SLUB slab I fear) to switch
> off the fast path. Do SetSlabDebug() when allocating a precious
> allocation that should not be gobbled up by lower level processes.
> Then you can do whatever you want in the __slab_alloc debug section and we
> wont care because its not the hot path.
One allocator is all I need; it would just be grand if all could be
supported.
So what you suggest is not placing the 'emergency' slab into the regular
place so that normal allocations will not be able to find it. Then if an
emergency allocation cannot be satified by the regular path, we fall
back to the slow path and find the emergency slab.
> SLAB is a bit different. There we already have issues with the fast path
> due to the attempt to handle numa policies at the object level. SLUB fixes
> that issue (if we can avoid you hot path patch). It intentionally does
> defer all special object handling to the slab level to increase NUMA
> performance. If you do the same to SLAB then you will get the NUMA
> troubles propagated to the SMP and UP level.
I could hack in a similar reserve slab; by catching the failure of the
regular allocation path. It'd not make it prettier though.
The thing is; I'm not needing any speed, as long as the machine stay
alive I'm good. However others are planing to build a full reserve based
allocator to properly fix the places that now use __GFP_NOFAIL and
situation such as in add_to_swap().
A well, one thing at a time. I'll hack this up.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 19:28 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 19:28 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 2007-05-14 at 10:57 -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> > On Mon, 2007-05-14 at 09:29 -0700, Christoph Lameter wrote:
> > > On Mon, 14 May 2007, Matt Mackall wrote:
> > >
> > > > privileged thread unprivileged greedy process
> > > > kmem_cache_alloc(...)
> > > > adds new slab page from lowmem pool
> > >
> > > Yes but it returns an object for the privileged thread. Is that not
> > > enough?
> >
> > No, because we reserved memory for n objects, and like matt illustrates
> > most of those that will be eaten by the greedy process.
> > We could reserve 1 page per object but that rather bloats the reserve.
>
> 1 slab per object not one page. But yes thats some bloat.
>
> You can pull the big switch (only on a SLUB slab I fear) to switch
> off the fast path. Do SetSlabDebug() when allocating a precious
> allocation that should not be gobbled up by lower level processes.
> Then you can do whatever you want in the __slab_alloc debug section and we
> wont care because its not the hot path.
One allocator is all I need; it would just be grand if all could be
supported.
So what you suggest is not placing the 'emergency' slab into the regular
place so that normal allocations will not be able to find it. Then if an
emergency allocation cannot be satified by the regular path, we fall
back to the slow path and find the emergency slab.
> SLAB is a bit different. There we already have issues with the fast path
> due to the attempt to handle numa policies at the object level. SLUB fixes
> that issue (if we can avoid you hot path patch). It intentionally does
> defer all special object handling to the slab level to increase NUMA
> performance. If you do the same to SLAB then you will get the NUMA
> troubles propagated to the SMP and UP level.
I could hack in a similar reserve slab; by catching the failure of the
regular allocation path. It'd not make it prettier though.
The thing is; I'm not needing any speed, as long as the machine stay
alive I'm good. However others are planing to build a full reserve based
allocator to properly fix the places that now use __GFP_NOFAIL and
situation such as in add_to_swap().
A well, one thing at a time. I'll hack this up.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 16:12 ` Matt Mackall
@ 2007-05-14 19:44 ` Andrew Morton
-1 siblings, 0 replies; 138+ messages in thread
From: Andrew Morton @ 2007-05-14 19:44 UTC (permalink / raw)
To: Matt Mackall
Cc: Christoph Lameter, Peter Zijlstra, linux-kernel, linux-mm,
Thomas Graf, David Miller, Daniel Phillips, Pekka Enberg
On Mon, 14 May 2007 11:12:24 -0500
Matt Mackall <mpm@selenic.com> wrote:
> If I understand this correctly:
>
> privileged thread unprivileged greedy process
> kmem_cache_alloc(...)
> adds new slab page from lowmem pool
> do_io()
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> ...
> eats it all
> kmem_cache_alloc(...) -> ENOMEM
> who ate my donuts?!
Yes, that's my understanding also.
I can see why it's a problem in theory, but I don't think Peter has yet
revealed to us why it's a problem in practice. I got all excited when
Christoph asked "I am not sure what the point of all of this is.", but
Peter cunningly avoided answering that ;)
What observed problem is being fixed here?
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 19:44 ` Andrew Morton
0 siblings, 0 replies; 138+ messages in thread
From: Andrew Morton @ 2007-05-14 19:44 UTC (permalink / raw)
To: Matt Mackall
Cc: Christoph Lameter, Peter Zijlstra, linux-kernel, linux-mm,
Thomas Graf, David Miller, Daniel Phillips, Pekka Enberg
On Mon, 14 May 2007 11:12:24 -0500
Matt Mackall <mpm@selenic.com> wrote:
> If I understand this correctly:
>
> privileged thread unprivileged greedy process
> kmem_cache_alloc(...)
> adds new slab page from lowmem pool
> do_io()
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> kmem_cache_alloc(...)
> ...
> eats it all
> kmem_cache_alloc(...) -> ENOMEM
> who ate my donuts?!
Yes, that's my understanding also.
I can see why it's a problem in theory, but I don't think Peter has yet
revealed to us why it's a problem in practice. I got all excited when
Christoph asked "I am not sure what the point of all of this is.", but
Peter cunningly avoided answering that ;)
What observed problem is being fixed here?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 19:28 ` Peter Zijlstra
@ 2007-05-14 19:56 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 19:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 14 May 2007, Peter Zijlstra wrote:
> > You can pull the big switch (only on a SLUB slab I fear) to switch
> > off the fast path. Do SetSlabDebug() when allocating a precious
> > allocation that should not be gobbled up by lower level processes.
> > Then you can do whatever you want in the __slab_alloc debug section and we
> > wont care because its not the hot path.
>
> One allocator is all I need; it would just be grand if all could be
> supported.
>
> So what you suggest is not placing the 'emergency' slab into the regular
> place so that normal allocations will not be able to find it. Then if an
> emergency allocation cannot be satified by the regular path, we fall
> back to the slow path and find the emergency slab.
Hmmm.. Maybe we could do that.... But what I had in mind was simply to
set a page flag (DebugSlab()) if you know in alloc_slab that the slab
should be only used for emergency allocation. If DebugSlab is set then the
fastpath will not be called. You can trap all allocation attempts and
insert whatever fancy logic you want in the debug path since its not
performance critical.
> The thing is; I'm not needing any speed, as long as the machine stay
> alive I'm good. However others are planing to build a full reserve based
> allocator to properly fix the places that now use __GFP_NOFAIL and
> situation such as in add_to_swap().
Well I have version of SLUB here that allows you do redirect the alloc
calls at will. Adds a kmem_cache_ops structure and in the kmem_cache_ops
structure you can redirect allocation and freeing of slabs (not objects!)
at will. Would that help?
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 19:56 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 19:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 14 May 2007, Peter Zijlstra wrote:
> > You can pull the big switch (only on a SLUB slab I fear) to switch
> > off the fast path. Do SetSlabDebug() when allocating a precious
> > allocation that should not be gobbled up by lower level processes.
> > Then you can do whatever you want in the __slab_alloc debug section and we
> > wont care because its not the hot path.
>
> One allocator is all I need; it would just be grand if all could be
> supported.
>
> So what you suggest is not placing the 'emergency' slab into the regular
> place so that normal allocations will not be able to find it. Then if an
> emergency allocation cannot be satified by the regular path, we fall
> back to the slow path and find the emergency slab.
Hmmm.. Maybe we could do that.... But what I had in mind was simply to
set a page flag (DebugSlab()) if you know in alloc_slab that the slab
should be only used for emergency allocation. If DebugSlab is set then the
fastpath will not be called. You can trap all allocation attempts and
insert whatever fancy logic you want in the debug path since its not
performance critical.
> The thing is; I'm not needing any speed, as long as the machine stay
> alive I'm good. However others are planing to build a full reserve based
> allocator to properly fix the places that now use __GFP_NOFAIL and
> situation such as in add_to_swap().
Well I have version of SLUB here that allows you do redirect the alloc
calls at will. Adds a kmem_cache_ops structure and in the kmem_cache_ops
structure you can redirect allocation and freeing of slabs (not objects!)
at will. Would that help?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 19:44 ` Andrew Morton
@ 2007-05-14 20:01 ` Matt Mackall
-1 siblings, 0 replies; 138+ messages in thread
From: Matt Mackall @ 2007-05-14 20:01 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, Peter Zijlstra, linux-kernel, linux-mm,
Thomas Graf, David Miller, Daniel Phillips, Pekka Enberg
On Mon, May 14, 2007 at 12:44:51PM -0700, Andrew Morton wrote:
> On Mon, 14 May 2007 11:12:24 -0500
> Matt Mackall <mpm@selenic.com> wrote:
>
> > If I understand this correctly:
> >
> > privileged thread unprivileged greedy process
> > kmem_cache_alloc(...)
> > adds new slab page from lowmem pool
> > do_io()
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > ...
> > eats it all
> > kmem_cache_alloc(...) -> ENOMEM
> > who ate my donuts?!
>
> Yes, that's my understanding also.
>
> I can see why it's a problem in theory, but I don't think Peter has yet
> revealed to us why it's a problem in practice. I got all excited when
> Christoph asked "I am not sure what the point of all of this is.", but
> Peter cunningly avoided answering that ;)
>
> What observed problem is being fixed here?
(From my recollection of looking at this problem a few years ago:)
There are various critical I/O paths that aren't protected by
mempools that need to dip into reserves when we approach OOM.
If, say, we need some number of SKBs in the critical I/O cleaning path
while something else is cheerfully sending non-I/O data, that second
stream can eat the SKBs that the first had budgeted for in its
reserve.
I think the simplest thing to do is to make everyone either fail or
sleep if they're not marked critical and a global memory crisis flag
is set.
To make this not impact the fast path, we could pull some trick like
swapping out and hiding all the real slab caches when turning the crisis
flag on.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 20:01 ` Matt Mackall
0 siblings, 0 replies; 138+ messages in thread
From: Matt Mackall @ 2007-05-14 20:01 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, Peter Zijlstra, linux-kernel, linux-mm,
Thomas Graf, David Miller, Daniel Phillips, Pekka Enberg
On Mon, May 14, 2007 at 12:44:51PM -0700, Andrew Morton wrote:
> On Mon, 14 May 2007 11:12:24 -0500
> Matt Mackall <mpm@selenic.com> wrote:
>
> > If I understand this correctly:
> >
> > privileged thread unprivileged greedy process
> > kmem_cache_alloc(...)
> > adds new slab page from lowmem pool
> > do_io()
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > ...
> > eats it all
> > kmem_cache_alloc(...) -> ENOMEM
> > who ate my donuts?!
>
> Yes, that's my understanding also.
>
> I can see why it's a problem in theory, but I don't think Peter has yet
> revealed to us why it's a problem in practice. I got all excited when
> Christoph asked "I am not sure what the point of all of this is.", but
> Peter cunningly avoided answering that ;)
>
> What observed problem is being fixed here?
(From my recollection of looking at this problem a few years ago:)
There are various critical I/O paths that aren't protected by
mempools that need to dip into reserves when we approach OOM.
If, say, we need some number of SKBs in the critical I/O cleaning path
while something else is cheerfully sending non-I/O data, that second
stream can eat the SKBs that the first had budgeted for in its
reserve.
I think the simplest thing to do is to make everyone either fail or
sleep if they're not marked critical and a global memory crisis flag
is set.
To make this not impact the fast path, we could pull some trick like
swapping out and hiding all the real slab caches when turning the crisis
flag on.
--
Mathematics is the supreme nostalgia of our time.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 19:56 ` Christoph Lameter
@ 2007-05-14 20:03 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 20:03 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 2007-05-14 at 12:56 -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> > > You can pull the big switch (only on a SLUB slab I fear) to switch
> > > off the fast path. Do SetSlabDebug() when allocating a precious
> > > allocation that should not be gobbled up by lower level processes.
> > > Then you can do whatever you want in the __slab_alloc debug section and we
> > > wont care because its not the hot path.
> >
> > One allocator is all I need; it would just be grand if all could be
> > supported.
> >
> > So what you suggest is not placing the 'emergency' slab into the regular
> > place so that normal allocations will not be able to find it. Then if an
> > emergency allocation cannot be satified by the regular path, we fall
> > back to the slow path and find the emergency slab.
>
> Hmmm.. Maybe we could do that.... But what I had in mind was simply to
> set a page flag (DebugSlab()) if you know in alloc_slab that the slab
> should be only used for emergency allocation. If DebugSlab is set then the
> fastpath will not be called. You can trap all allocation attempts and
> insert whatever fancy logic you want in the debug path since its not
> performance critical.
I might have missed some detail when I looked at SLUB, but I did not see
how setting SlabDebug would trap subsequent allocations to that slab.
> > The thing is; I'm not needing any speed, as long as the machine stay
> > alive I'm good. However others are planing to build a full reserve based
> > allocator to properly fix the places that now use __GFP_NOFAIL and
> > situation such as in add_to_swap().
>
> Well I have version of SLUB here that allows you do redirect the alloc
> calls at will. Adds a kmem_cache_ops structure and in the kmem_cache_ops
> structure you can redirect allocation and freeing of slabs (not objects!)
> at will. Would that help?
I'm not sure; I need kmalloc as well.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 20:03 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 20:03 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 2007-05-14 at 12:56 -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> > > You can pull the big switch (only on a SLUB slab I fear) to switch
> > > off the fast path. Do SetSlabDebug() when allocating a precious
> > > allocation that should not be gobbled up by lower level processes.
> > > Then you can do whatever you want in the __slab_alloc debug section and we
> > > wont care because its not the hot path.
> >
> > One allocator is all I need; it would just be grand if all could be
> > supported.
> >
> > So what you suggest is not placing the 'emergency' slab into the regular
> > place so that normal allocations will not be able to find it. Then if an
> > emergency allocation cannot be satified by the regular path, we fall
> > back to the slow path and find the emergency slab.
>
> Hmmm.. Maybe we could do that.... But what I had in mind was simply to
> set a page flag (DebugSlab()) if you know in alloc_slab that the slab
> should be only used for emergency allocation. If DebugSlab is set then the
> fastpath will not be called. You can trap all allocation attempts and
> insert whatever fancy logic you want in the debug path since its not
> performance critical.
I might have missed some detail when I looked at SLUB, but I did not see
how setting SlabDebug would trap subsequent allocations to that slab.
> > The thing is; I'm not needing any speed, as long as the machine stay
> > alive I'm good. However others are planing to build a full reserve based
> > allocator to properly fix the places that now use __GFP_NOFAIL and
> > situation such as in add_to_swap().
>
> Well I have version of SLUB here that allows you do redirect the alloc
> calls at will. Adds a kmem_cache_ops structure and in the kmem_cache_ops
> structure you can redirect allocation and freeing of slabs (not objects!)
> at will. Would that help?
I'm not sure; I need kmalloc as well.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 19:44 ` Andrew Morton
@ 2007-05-14 20:05 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 20:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Matt Mackall, Christoph Lameter, linux-kernel, linux-mm,
Thomas Graf, David Miller, Daniel Phillips, Pekka Enberg
On Mon, 2007-05-14 at 12:44 -0700, Andrew Morton wrote:
> On Mon, 14 May 2007 11:12:24 -0500
> Matt Mackall <mpm@selenic.com> wrote:
>
> > If I understand this correctly:
> >
> > privileged thread unprivileged greedy process
> > kmem_cache_alloc(...)
> > adds new slab page from lowmem pool
> > do_io()
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > ...
> > eats it all
> > kmem_cache_alloc(...) -> ENOMEM
> > who ate my donuts?!
>
> Yes, that's my understanding also.
>
> I can see why it's a problem in theory, but I don't think Peter has yet
> revealed to us why it's a problem in practice. I got all excited when
> Christoph asked "I am not sure what the point of all of this is.", but
> Peter cunningly avoided answering that ;)
>
> What observed problem is being fixed here?
I'm moving towards swapping over networked storage. Admittedly a new
feature.
Like with pretty much all other swap solutions; there is the fundamental
vm deadlock: freeing memory requires memory. Current block devices get
around that by using mempools. This works well.
However with network traffic mempools are not easily usable; the network
stack uses kmalloc. By using reserve based allocation we can keep
operating in a similar matter.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 20:05 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 20:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Matt Mackall, Christoph Lameter, linux-kernel, linux-mm,
Thomas Graf, David Miller, Daniel Phillips, Pekka Enberg
On Mon, 2007-05-14 at 12:44 -0700, Andrew Morton wrote:
> On Mon, 14 May 2007 11:12:24 -0500
> Matt Mackall <mpm@selenic.com> wrote:
>
> > If I understand this correctly:
> >
> > privileged thread unprivileged greedy process
> > kmem_cache_alloc(...)
> > adds new slab page from lowmem pool
> > do_io()
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > kmem_cache_alloc(...)
> > ...
> > eats it all
> > kmem_cache_alloc(...) -> ENOMEM
> > who ate my donuts?!
>
> Yes, that's my understanding also.
>
> I can see why it's a problem in theory, but I don't think Peter has yet
> revealed to us why it's a problem in practice. I got all excited when
> Christoph asked "I am not sure what the point of all of this is.", but
> Peter cunningly avoided answering that ;)
>
> What observed problem is being fixed here?
I'm moving towards swapping over networked storage. Admittedly a new
feature.
Like with pretty much all other swap solutions; there is the fundamental
vm deadlock: freeing memory requires memory. Current block devices get
around that by using mempools. This works well.
However with network traffic mempools are not easily usable; the network
stack uses kmalloc. By using reserve based allocation we can keep
operating in a similar matter.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 20:03 ` Peter Zijlstra
@ 2007-05-14 20:06 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 20:06 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 14 May 2007, Peter Zijlstra wrote:
> > Hmmm.. Maybe we could do that.... But what I had in mind was simply to
> > set a page flag (DebugSlab()) if you know in alloc_slab that the slab
> > should be only used for emergency allocation. If DebugSlab is set then the
> > fastpath will not be called. You can trap all allocation attempts and
> > insert whatever fancy logic you want in the debug path since its not
> > performance critical.
>
> I might have missed some detail when I looked at SLUB, but I did not see
> how setting SlabDebug would trap subsequent allocations to that slab.
Ok its not evident in slab_alloc. But if SlabDebug is set then
page->lockless_list is always NULL and we always fall back to
__slab_alloc. There we check for SlabDebug and go to the debug: label.
There you can insert any fancy processing you want.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 20:06 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 20:06 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 14 May 2007, Peter Zijlstra wrote:
> > Hmmm.. Maybe we could do that.... But what I had in mind was simply to
> > set a page flag (DebugSlab()) if you know in alloc_slab that the slab
> > should be only used for emergency allocation. If DebugSlab is set then the
> > fastpath will not be called. You can trap all allocation attempts and
> > insert whatever fancy logic you want in the debug path since its not
> > performance critical.
>
> I might have missed some detail when I looked at SLUB, but I did not see
> how setting SlabDebug would trap subsequent allocations to that slab.
Ok its not evident in slab_alloc. But if SlabDebug is set then
page->lockless_list is always NULL and we always fall back to
__slab_alloc. There we check for SlabDebug and go to the debug: label.
There you can insert any fancy processing you want.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 20:06 ` Christoph Lameter
@ 2007-05-14 20:12 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 20:12 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 2007-05-14 at 13:06 -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> > > Hmmm.. Maybe we could do that.... But what I had in mind was simply to
> > > set a page flag (DebugSlab()) if you know in alloc_slab that the slab
> > > should be only used for emergency allocation. If DebugSlab is set then the
> > > fastpath will not be called. You can trap all allocation attempts and
> > > insert whatever fancy logic you want in the debug path since its not
> > > performance critical.
> >
> > I might have missed some detail when I looked at SLUB, but I did not see
> > how setting SlabDebug would trap subsequent allocations to that slab.
>
> Ok its not evident in slab_alloc. But if SlabDebug is set then
> page->lockless_list is always NULL and we always fall back to
> __slab_alloc.
Ah, indeed, that is the detail I missed. Yes that would work out.
> There we check for SlabDebug and go to the debug: label.
> There you can insert any fancy processing you want.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 20:12 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-14 20:12 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 2007-05-14 at 13:06 -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> > > Hmmm.. Maybe we could do that.... But what I had in mind was simply to
> > > set a page flag (DebugSlab()) if you know in alloc_slab that the slab
> > > should be only used for emergency allocation. If DebugSlab is set then the
> > > fastpath will not be called. You can trap all allocation attempts and
> > > insert whatever fancy logic you want in the debug path since its not
> > > performance critical.
> >
> > I might have missed some detail when I looked at SLUB, but I did not see
> > how setting SlabDebug would trap subsequent allocations to that slab.
>
> Ok its not evident in slab_alloc. But if SlabDebug is set then
> page->lockless_list is always NULL and we always fall back to
> __slab_alloc.
Ah, indeed, that is the detail I missed. Yes that would work out.
> There we check for SlabDebug and go to the debug: label.
> There you can insert any fancy processing you want.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 20:03 ` Peter Zijlstra
@ 2007-05-14 20:25 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 20:25 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 14 May 2007, Peter Zijlstra wrote:
> > > The thing is; I'm not needing any speed, as long as the machine stay
> > > alive I'm good. However others are planing to build a full reserve based
> > > allocator to properly fix the places that now use __GFP_NOFAIL and
> > > situation such as in add_to_swap().
> >
> > Well I have version of SLUB here that allows you do redirect the alloc
> > calls at will. Adds a kmem_cache_ops structure and in the kmem_cache_ops
> > structure you can redirect allocation and freeing of slabs (not objects!)
> > at will. Would that help?
>
> I'm not sure; I need kmalloc as well.
We could add a kmalloc_ops structuret to allow redirects?
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-14 20:25 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-14 20:25 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 14 May 2007, Peter Zijlstra wrote:
> > > The thing is; I'm not needing any speed, as long as the machine stay
> > > alive I'm good. However others are planing to build a full reserve based
> > > allocator to properly fix the places that now use __GFP_NOFAIL and
> > > situation such as in add_to_swap().
> >
> > Well I have version of SLUB here that allows you do redirect the alloc
> > calls at will. Adds a kmem_cache_ops structure and in the kmem_cache_ops
> > structure you can redirect allocation and freeing of slabs (not objects!)
> > at will. Would that help?
>
> I'm not sure; I need kmalloc as well.
We could add a kmalloc_ops structuret to allow redirects?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 19:28 ` Peter Zijlstra
@ 2007-05-15 17:27 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-15 17:27 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 2007-05-14 at 21:28 +0200, Peter Zijlstra wrote:
> One allocator is all I need; it would just be grand if all could be
> supported.
>
> So what you suggest is not placing the 'emergency' slab into the regular
> place so that normal allocations will not be able to find it. Then if an
> emergency allocation cannot be satified by the regular path, we fall
> back to the slow path and find the emergency slab.
How about something like this; it seems to sustain a little stress.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/slub_def.h | 3 +
mm/slub.c | 73 +++++++++++++++++++++++++++++++++++++++++------
2 files changed, 68 insertions(+), 8 deletions(-)
Index: linux-2.6-git/include/linux/slub_def.h
===================================================================
--- linux-2.6-git.orig/include/linux/slub_def.h
+++ linux-2.6-git/include/linux/slub_def.h
@@ -47,6 +47,9 @@ struct kmem_cache {
struct list_head list; /* List of slab caches */
struct kobject kobj; /* For sysfs */
+ spinlock_t reserve_lock;
+ struct page *reserve_slab;
+
#ifdef CONFIG_NUMA
int defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
Index: linux-2.6-git/mm/slub.c
===================================================================
--- linux-2.6-git.orig/mm/slub.c
+++ linux-2.6-git/mm/slub.c
@@ -20,11 +20,13 @@
#include <linux/mempolicy.h>
#include <linux/ctype.h>
#include <linux/kallsyms.h>
+#include "internal.h"
/*
* Lock order:
- * 1. slab_lock(page)
- * 2. slab->list_lock
+ * 1. slab->reserve_lock
+ * 2. slab_lock(page)
+ * 3. node->list_lock
*
* The slab_lock protects operations on the object of a particular
* slab and its metadata in the page struct. If the slab lock
@@ -981,7 +983,7 @@ static void setup_object(struct kmem_cac
s->ctor(object, s, SLAB_CTOR_CONSTRUCTOR);
}
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *rank)
{
struct page *page;
struct kmem_cache_node *n;
@@ -999,6 +1001,7 @@ static struct page *new_slab(struct kmem
if (!page)
goto out;
+ *rank = page->rank;
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(&n->nr_slabs);
@@ -1286,7 +1289,7 @@ static void putback_slab(struct kmem_cac
/*
* Remove the cpu slab
*/
-static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
+static void __deactivate_slab(struct kmem_cache *s, struct page *page)
{
/*
* Merge cpu freelist into freelist. Typically we get here
@@ -1305,8 +1308,13 @@ static void deactivate_slab(struct kmem_
page->freelist = object;
page->inuse--;
}
- s->cpu_slab[cpu] = NULL;
ClearPageActive(page);
+}
+
+static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
+{
+ __deactivate_slab(s, page);
+ s->cpu_slab[cpu] = NULL;
putback_slab(s, page);
}
@@ -1372,6 +1380,7 @@ static void *__slab_alloc(struct kmem_ca
{
void **object;
int cpu = smp_processor_id();
+ int rank = 0;
if (!page)
goto new_slab;
@@ -1403,10 +1412,42 @@ have_slab:
s->cpu_slab[cpu] = page;
SetPageActive(page);
goto load_freelist;
+ } else if (gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS) {
+ spin_lock(&s->reserve_lock);
+ page = s->reserve_slab;
+ if (page) {
+ if (page->freelist) {
+ slab_lock(page);
+ spin_unlock(&s->reserve_lock);
+ goto load_freelist;
+ } else
+ s->reserve_slab = NULL;
+ }
+ spin_unlock(&s->reserve_lock);
+
+ if (page) {
+ slab_lock(page);
+ __deactivate_slab(s, page);
+ putback_slab(s, page);
+ }
}
- page = new_slab(s, gfpflags, node);
- if (page) {
+ page = new_slab(s, gfpflags, node, &rank);
+ if (page && rank) {
+ if (unlikely(s->reserve_slab)) {
+ struct page *reserve;
+
+ spin_lock(&s->reserve_lock);
+ reserve = s->reserve_slab;
+ s->reserve_slab = NULL;
+ spin_unlock(&s->reserve_lock);
+
+ if (reserve) {
+ slab_lock(reserve);
+ __deactivate_slab(s, reserve);
+ putback_slab(s, reserve);
+ }
+ }
cpu = smp_processor_id();
if (s->cpu_slab[cpu]) {
/*
@@ -1432,6 +1473,18 @@ have_slab:
}
slab_lock(page);
goto have_slab;
+ } else if (page) {
+ spin_lock(&s->reserve_lock);
+ if (s->reserve_slab) {
+ discard_slab(s, page);
+ page = s->reserve_slab;
+ }
+ slab_lock(page);
+ SetPageActive(page);
+ s->reserve_slab = page;
+ spin_unlock(&s->reserve_lock);
+
+ goto load_freelist;
}
return NULL;
debug:
@@ -1788,10 +1841,11 @@ static struct kmem_cache_node * __init e
{
struct page *page;
struct kmem_cache_node *n;
+ int rank;
BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
- page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node);
+ page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node, &rank);
/* new_slab() disables interupts */
local_irq_enable();
@@ -2002,6 +2056,9 @@ static int kmem_cache_open(struct kmem_c
s->defrag_ratio = 100;
#endif
+ spin_lock_init(&s->reserve_lock);
+ s->reserve_slab = NULL;
+
if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
return 1;
error:
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-15 17:27 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-15 17:27 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Mon, 2007-05-14 at 21:28 +0200, Peter Zijlstra wrote:
> One allocator is all I need; it would just be grand if all could be
> supported.
>
> So what you suggest is not placing the 'emergency' slab into the regular
> place so that normal allocations will not be able to find it. Then if an
> emergency allocation cannot be satified by the regular path, we fall
> back to the slow path and find the emergency slab.
How about something like this; it seems to sustain a little stress.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/slub_def.h | 3 +
mm/slub.c | 73 +++++++++++++++++++++++++++++++++++++++++------
2 files changed, 68 insertions(+), 8 deletions(-)
Index: linux-2.6-git/include/linux/slub_def.h
===================================================================
--- linux-2.6-git.orig/include/linux/slub_def.h
+++ linux-2.6-git/include/linux/slub_def.h
@@ -47,6 +47,9 @@ struct kmem_cache {
struct list_head list; /* List of slab caches */
struct kobject kobj; /* For sysfs */
+ spinlock_t reserve_lock;
+ struct page *reserve_slab;
+
#ifdef CONFIG_NUMA
int defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
Index: linux-2.6-git/mm/slub.c
===================================================================
--- linux-2.6-git.orig/mm/slub.c
+++ linux-2.6-git/mm/slub.c
@@ -20,11 +20,13 @@
#include <linux/mempolicy.h>
#include <linux/ctype.h>
#include <linux/kallsyms.h>
+#include "internal.h"
/*
* Lock order:
- * 1. slab_lock(page)
- * 2. slab->list_lock
+ * 1. slab->reserve_lock
+ * 2. slab_lock(page)
+ * 3. node->list_lock
*
* The slab_lock protects operations on the object of a particular
* slab and its metadata in the page struct. If the slab lock
@@ -981,7 +983,7 @@ static void setup_object(struct kmem_cac
s->ctor(object, s, SLAB_CTOR_CONSTRUCTOR);
}
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *rank)
{
struct page *page;
struct kmem_cache_node *n;
@@ -999,6 +1001,7 @@ static struct page *new_slab(struct kmem
if (!page)
goto out;
+ *rank = page->rank;
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(&n->nr_slabs);
@@ -1286,7 +1289,7 @@ static void putback_slab(struct kmem_cac
/*
* Remove the cpu slab
*/
-static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
+static void __deactivate_slab(struct kmem_cache *s, struct page *page)
{
/*
* Merge cpu freelist into freelist. Typically we get here
@@ -1305,8 +1308,13 @@ static void deactivate_slab(struct kmem_
page->freelist = object;
page->inuse--;
}
- s->cpu_slab[cpu] = NULL;
ClearPageActive(page);
+}
+
+static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
+{
+ __deactivate_slab(s, page);
+ s->cpu_slab[cpu] = NULL;
putback_slab(s, page);
}
@@ -1372,6 +1380,7 @@ static void *__slab_alloc(struct kmem_ca
{
void **object;
int cpu = smp_processor_id();
+ int rank = 0;
if (!page)
goto new_slab;
@@ -1403,10 +1412,42 @@ have_slab:
s->cpu_slab[cpu] = page;
SetPageActive(page);
goto load_freelist;
+ } else if (gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS) {
+ spin_lock(&s->reserve_lock);
+ page = s->reserve_slab;
+ if (page) {
+ if (page->freelist) {
+ slab_lock(page);
+ spin_unlock(&s->reserve_lock);
+ goto load_freelist;
+ } else
+ s->reserve_slab = NULL;
+ }
+ spin_unlock(&s->reserve_lock);
+
+ if (page) {
+ slab_lock(page);
+ __deactivate_slab(s, page);
+ putback_slab(s, page);
+ }
}
- page = new_slab(s, gfpflags, node);
- if (page) {
+ page = new_slab(s, gfpflags, node, &rank);
+ if (page && rank) {
+ if (unlikely(s->reserve_slab)) {
+ struct page *reserve;
+
+ spin_lock(&s->reserve_lock);
+ reserve = s->reserve_slab;
+ s->reserve_slab = NULL;
+ spin_unlock(&s->reserve_lock);
+
+ if (reserve) {
+ slab_lock(reserve);
+ __deactivate_slab(s, reserve);
+ putback_slab(s, reserve);
+ }
+ }
cpu = smp_processor_id();
if (s->cpu_slab[cpu]) {
/*
@@ -1432,6 +1473,18 @@ have_slab:
}
slab_lock(page);
goto have_slab;
+ } else if (page) {
+ spin_lock(&s->reserve_lock);
+ if (s->reserve_slab) {
+ discard_slab(s, page);
+ page = s->reserve_slab;
+ }
+ slab_lock(page);
+ SetPageActive(page);
+ s->reserve_slab = page;
+ spin_unlock(&s->reserve_lock);
+
+ goto load_freelist;
}
return NULL;
debug:
@@ -1788,10 +1841,11 @@ static struct kmem_cache_node * __init e
{
struct page *page;
struct kmem_cache_node *n;
+ int rank;
BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
- page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node);
+ page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node, &rank);
/* new_slab() disables interupts */
local_irq_enable();
@@ -2002,6 +2056,9 @@ static int kmem_cache_open(struct kmem_c
s->defrag_ratio = 100;
#endif
+ spin_lock_init(&s->reserve_lock);
+ s->reserve_slab = NULL;
+
if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
return 1;
error:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-15 17:27 ` Peter Zijlstra
@ 2007-05-15 22:02 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-15 22:02 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Tue, 15 May 2007, Peter Zijlstra wrote:
> How about something like this; it seems to sustain a little stress.
Argh again mods to kmem_cache.
Could we do this with a new slab page flag? F.e. SlabEmergPool.
in alloc_slab() do
if (is_emergency_pool_page(page)) {
SetSlabDebug(page);
SetSlabEmerg(page);
}
So now you can intercept allocs to the SlabEmerg slab in __slab_alloc
debug:
if (SlabEmergPool(page)) {
if (mem_no_longer_critical()) {
/* Avoid future trapping */
ClearSlabDebug(page);
ClearSlabEmergPool(page);
} else
if (process_not_allowed_this_memory()) {
do_something_bad_to_the_caller();
} else {
/* Allocation permitted */
}
}
....
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-15 22:02 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-15 22:02 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Tue, 15 May 2007, Peter Zijlstra wrote:
> How about something like this; it seems to sustain a little stress.
Argh again mods to kmem_cache.
Could we do this with a new slab page flag? F.e. SlabEmergPool.
in alloc_slab() do
if (is_emergency_pool_page(page)) {
SetSlabDebug(page);
SetSlabEmerg(page);
}
So now you can intercept allocs to the SlabEmerg slab in __slab_alloc
debug:
if (SlabEmergPool(page)) {
if (mem_no_longer_critical()) {
/* Avoid future trapping */
ClearSlabDebug(page);
ClearSlabEmergPool(page);
} else
if (process_not_allowed_this_memory()) {
do_something_bad_to_the_caller();
} else {
/* Allocation permitted */
}
}
....
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-15 22:02 ` Christoph Lameter
@ 2007-05-16 6:59 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-16 6:59 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Tue, 2007-05-15 at 15:02 -0700, Christoph Lameter wrote:
> On Tue, 15 May 2007, Peter Zijlstra wrote:
>
> > How about something like this; it seems to sustain a little stress.
>
> Argh again mods to kmem_cache.
Hmm, I had not understood you minded that very much; I did stay away
from all the fast paths this time.
The thing is, I wanted to fold all the emergency allocs into a single
slab, not a per cpu thing. And once you loose the per cpu thing, you
need some extra serialization. Currently the top level lock is
slab_lock(page), but that only works because we have interrupts disabled
and work per cpu.
Why is it bad to extend kmem_cache a bit?
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-16 6:59 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-16 6:59 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Tue, 2007-05-15 at 15:02 -0700, Christoph Lameter wrote:
> On Tue, 15 May 2007, Peter Zijlstra wrote:
>
> > How about something like this; it seems to sustain a little stress.
>
> Argh again mods to kmem_cache.
Hmm, I had not understood you minded that very much; I did stay away
from all the fast paths this time.
The thing is, I wanted to fold all the emergency allocs into a single
slab, not a per cpu thing. And once you loose the per cpu thing, you
need some extra serialization. Currently the top level lock is
slab_lock(page), but that only works because we have interrupts disabled
and work per cpu.
Why is it bad to extend kmem_cache a bit?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-16 6:59 ` Peter Zijlstra
@ 2007-05-16 18:43 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-16 18:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 16 May 2007, Peter Zijlstra wrote:
> On Tue, 2007-05-15 at 15:02 -0700, Christoph Lameter wrote:
> > On Tue, 15 May 2007, Peter Zijlstra wrote:
> >
> > > How about something like this; it seems to sustain a little stress.
> >
> > Argh again mods to kmem_cache.
>
> Hmm, I had not understood you minded that very much; I did stay away
> from all the fast paths this time.
Well you added a new locking level and changed the locking hierachy!
> The thing is, I wanted to fold all the emergency allocs into a single
> slab, not a per cpu thing. And once you loose the per cpu thing, you
> need some extra serialization. Currently the top level lock is
> slab_lock(page), but that only works because we have interrupts disabled
> and work per cpu.
SLUB can only allocate from a per cpu slab. You will have to reserve one
slab per cpu anyways unless we flush the cpu slab after each access. Same
thing is true for SLAB. It wants objects in its per cpu queues.
> Why is it bad to extend kmem_cache a bit?
Because it is for all practical purposes a heavily accessed read only
structure. Modifications only occur to per node and per cpu structures.
In a 4k systems any write will kick out the kmem_cache cacheline in 4k
processors.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-16 18:43 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-16 18:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 16 May 2007, Peter Zijlstra wrote:
> On Tue, 2007-05-15 at 15:02 -0700, Christoph Lameter wrote:
> > On Tue, 15 May 2007, Peter Zijlstra wrote:
> >
> > > How about something like this; it seems to sustain a little stress.
> >
> > Argh again mods to kmem_cache.
>
> Hmm, I had not understood you minded that very much; I did stay away
> from all the fast paths this time.
Well you added a new locking level and changed the locking hierachy!
> The thing is, I wanted to fold all the emergency allocs into a single
> slab, not a per cpu thing. And once you loose the per cpu thing, you
> need some extra serialization. Currently the top level lock is
> slab_lock(page), but that only works because we have interrupts disabled
> and work per cpu.
SLUB can only allocate from a per cpu slab. You will have to reserve one
slab per cpu anyways unless we flush the cpu slab after each access. Same
thing is true for SLAB. It wants objects in its per cpu queues.
> Why is it bad to extend kmem_cache a bit?
Because it is for all practical purposes a heavily accessed read only
structure. Modifications only occur to per node and per cpu structures.
In a 4k systems any write will kick out the kmem_cache cacheline in 4k
processors.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-16 18:43 ` Christoph Lameter
@ 2007-05-16 19:25 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-16 19:25 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 2007-05-16 at 11:43 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
>
> > On Tue, 2007-05-15 at 15:02 -0700, Christoph Lameter wrote:
> > > On Tue, 15 May 2007, Peter Zijlstra wrote:
> > >
> > > > How about something like this; it seems to sustain a little stress.
> > >
> > > Argh again mods to kmem_cache.
> >
> > Hmm, I had not understood you minded that very much; I did stay away
> > from all the fast paths this time.
>
> Well you added a new locking level and changed the locking hierachy!
>
> > The thing is, I wanted to fold all the emergency allocs into a single
> > slab, not a per cpu thing. And once you loose the per cpu thing, you
> > need some extra serialization. Currently the top level lock is
> > slab_lock(page), but that only works because we have interrupts disabled
> > and work per cpu.
>
> SLUB can only allocate from a per cpu slab. You will have to reserve one
> slab per cpu anyways unless we flush the cpu slab after each access. Same
> thing is true for SLAB. It wants objects in its per cpu queues.
>
> > Why is it bad to extend kmem_cache a bit?
>
> Because it is for all practical purposes a heavily accessed read only
> structure. Modifications only occur to per node and per cpu structures.
> In a 4k systems any write will kick out the kmem_cache cacheline in 4k
> processors.
If this 4k cpu system ever gets to touch the new lock it is in way
deeper problems than a bouncing cache-line.
Please look at it more carefully.
We differentiate pages allocated at the level where GFP_ATOMIC starts to
fail. By not updating the percpu slabs those are retried every time,
except for ALLOC_NO_WATERMARKS allocations; those are served from the
->reserve_slab.
Once a regular slab allocation succeeds again, the ->reserve_slab is
cleaned up and never again looked at it until we're in distress again.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/slub_def.h | 2 +
mm/slub.c | 85 ++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 78 insertions(+), 9 deletions(-)
Index: linux-2.6-git/include/linux/slub_def.h
===================================================================
--- linux-2.6-git.orig/include/linux/slub_def.h
+++ linux-2.6-git/include/linux/slub_def.h
@@ -46,6 +46,8 @@ struct kmem_cache {
struct list_head list; /* List of slab caches */
struct kobject kobj; /* For sysfs */
+ struct page *reserve_slab;
+
#ifdef CONFIG_NUMA
int defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
Index: linux-2.6-git/mm/slub.c
===================================================================
--- linux-2.6-git.orig/mm/slub.c
+++ linux-2.6-git/mm/slub.c
@@ -20,11 +20,13 @@
#include <linux/mempolicy.h>
#include <linux/ctype.h>
#include <linux/kallsyms.h>
+#include "internal.h"
/*
* Lock order:
- * 1. slab_lock(page)
- * 2. slab->list_lock
+ * 1. reserve_lock
+ * 2. slab_lock(page)
+ * 3. node->list_lock
*
* The slab_lock protects operations on the object of a particular
* slab and its metadata in the page struct. If the slab lock
@@ -259,6 +261,8 @@ static int sysfs_slab_alias(struct kmem_
static void sysfs_slab_remove(struct kmem_cache *s) {}
#endif
+static DEFINE_SPINLOCK(reserve_lock);
+
/********************************************************************
* Core slab cache functions
*******************************************************************/
@@ -1007,7 +1011,7 @@ static void setup_object(struct kmem_cac
s->ctor(object, s, 0);
}
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *rank)
{
struct page *page;
struct kmem_cache_node *n;
@@ -1025,6 +1029,7 @@ static struct page *new_slab(struct kmem
if (!page)
goto out;
+ *rank = page->rank;
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(&n->nr_slabs);
@@ -1311,7 +1316,7 @@ static void unfreeze_slab(struct kmem_ca
/*
* Remove the cpu slab
*/
-static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
+static void __deactivate_slab(struct kmem_cache *s, struct page *page)
{
/*
* Merge cpu freelist into freelist. Typically we get here
@@ -1330,10 +1335,15 @@ static void deactivate_slab(struct kmem_
page->freelist = object;
page->inuse--;
}
- s->cpu_slab[cpu] = NULL;
unfreeze_slab(s, page);
}
+static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
+{
+ __deactive_slab(s, page);
+ s->cpu_slab[cpu] = NULL;
+}
+
static void flush_slab(struct kmem_cache *s, struct page *page, int cpu)
{
slab_lock(page);
@@ -1395,6 +1405,7 @@ static void *__slab_alloc(struct kmem_ca
{
void **object;
int cpu = smp_processor_id();
+ int rank = 0;
if (!page)
goto new_slab;
@@ -1424,10 +1435,26 @@ new_slab:
if (page) {
s->cpu_slab[cpu] = page;
goto load_freelist;
- }
+ } else if (unlikely(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+ goto try_reserve;
- page = new_slab(s, gfpflags, node);
- if (page) {
+alloc_slab:
+ page = new_slab(s, gfpflags, node, &rank);
+ if (page && rank) {
+ if (unlikely(s->reserve_slab)) {
+ struct page *reserve;
+
+ spin_lock(&reserve_lock);
+ reserve = s->reserve_slab;
+ s->reserve_slab = NULL;
+ spin_unlock(&reserve_lock);
+
+ if (reserve) {
+ slab_lock(reserve);
+ __deactivate_slab(s, reserve);
+ putback_slab(s, reserve);
+ }
+ }
cpu = smp_processor_id();
if (s->cpu_slab[cpu]) {
/*
@@ -1455,6 +1482,18 @@ new_slab:
SetSlabFrozen(page);
s->cpu_slab[cpu] = page;
goto load_freelist;
+ } else if (page) {
+ spin_lock(&reserve_lock);
+ if (s->reserve_slab) {
+ discard_slab(s, page);
+ page = s->reserve_slab;
+ }
+ slab_lock(page);
+ SetPageActive(page);
+ s->reserve_slab = page;
+ spin_unlock(&reserve_lock);
+
+ goto got_reserve;
}
return NULL;
debug:
@@ -1470,6 +1509,31 @@ debug:
page->freelist = object[page->offset];
slab_unlock(page);
return object;
+
+try_reserve:
+ spin_lock(&reserve_lock);
+ page = s->reserve_slab;
+ if (!page) {
+ spin_unlock(&reserve_lock);
+ goto alloc_slab;
+ }
+
+ slab_lock(page);
+ if (!page->freelist) {
+ s->reserve_slab = NULL;
+ spin_unlock(&reserve_lock);
+ __deactivate_slab(s, page);
+ putback_slab(s, page);
+ goto alloc_slab;
+ }
+ spin_unlock(&reserve_lock);
+
+got_reserve:
+ object = page->freelist;
+ page->inuse++;
+ page->freelist = object[page->offset];
+ slab_unlock(page);
+ return object;
}
/*
@@ -1807,10 +1871,11 @@ static struct kmem_cache_node * __init e
{
struct page *page;
struct kmem_cache_node *n;
+ int rank;
BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
- page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node);
+ page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node, &rank);
/* new_slab() disables interupts */
local_irq_enable();
@@ -2018,6 +2083,8 @@ static int kmem_cache_open(struct kmem_c
#ifdef CONFIG_NUMA
s->defrag_ratio = 100;
#endif
+ s->reserve_slab = NULL;
+
if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
return 1;
error:
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-16 19:25 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-16 19:25 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 2007-05-16 at 11:43 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
>
> > On Tue, 2007-05-15 at 15:02 -0700, Christoph Lameter wrote:
> > > On Tue, 15 May 2007, Peter Zijlstra wrote:
> > >
> > > > How about something like this; it seems to sustain a little stress.
> > >
> > > Argh again mods to kmem_cache.
> >
> > Hmm, I had not understood you minded that very much; I did stay away
> > from all the fast paths this time.
>
> Well you added a new locking level and changed the locking hierachy!
>
> > The thing is, I wanted to fold all the emergency allocs into a single
> > slab, not a per cpu thing. And once you loose the per cpu thing, you
> > need some extra serialization. Currently the top level lock is
> > slab_lock(page), but that only works because we have interrupts disabled
> > and work per cpu.
>
> SLUB can only allocate from a per cpu slab. You will have to reserve one
> slab per cpu anyways unless we flush the cpu slab after each access. Same
> thing is true for SLAB. It wants objects in its per cpu queues.
>
> > Why is it bad to extend kmem_cache a bit?
>
> Because it is for all practical purposes a heavily accessed read only
> structure. Modifications only occur to per node and per cpu structures.
> In a 4k systems any write will kick out the kmem_cache cacheline in 4k
> processors.
If this 4k cpu system ever gets to touch the new lock it is in way
deeper problems than a bouncing cache-line.
Please look at it more carefully.
We differentiate pages allocated at the level where GFP_ATOMIC starts to
fail. By not updating the percpu slabs those are retried every time,
except for ALLOC_NO_WATERMARKS allocations; those are served from the
->reserve_slab.
Once a regular slab allocation succeeds again, the ->reserve_slab is
cleaned up and never again looked at it until we're in distress again.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/slub_def.h | 2 +
mm/slub.c | 85 ++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 78 insertions(+), 9 deletions(-)
Index: linux-2.6-git/include/linux/slub_def.h
===================================================================
--- linux-2.6-git.orig/include/linux/slub_def.h
+++ linux-2.6-git/include/linux/slub_def.h
@@ -46,6 +46,8 @@ struct kmem_cache {
struct list_head list; /* List of slab caches */
struct kobject kobj; /* For sysfs */
+ struct page *reserve_slab;
+
#ifdef CONFIG_NUMA
int defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
Index: linux-2.6-git/mm/slub.c
===================================================================
--- linux-2.6-git.orig/mm/slub.c
+++ linux-2.6-git/mm/slub.c
@@ -20,11 +20,13 @@
#include <linux/mempolicy.h>
#include <linux/ctype.h>
#include <linux/kallsyms.h>
+#include "internal.h"
/*
* Lock order:
- * 1. slab_lock(page)
- * 2. slab->list_lock
+ * 1. reserve_lock
+ * 2. slab_lock(page)
+ * 3. node->list_lock
*
* The slab_lock protects operations on the object of a particular
* slab and its metadata in the page struct. If the slab lock
@@ -259,6 +261,8 @@ static int sysfs_slab_alias(struct kmem_
static void sysfs_slab_remove(struct kmem_cache *s) {}
#endif
+static DEFINE_SPINLOCK(reserve_lock);
+
/********************************************************************
* Core slab cache functions
*******************************************************************/
@@ -1007,7 +1011,7 @@ static void setup_object(struct kmem_cac
s->ctor(object, s, 0);
}
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *rank)
{
struct page *page;
struct kmem_cache_node *n;
@@ -1025,6 +1029,7 @@ static struct page *new_slab(struct kmem
if (!page)
goto out;
+ *rank = page->rank;
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(&n->nr_slabs);
@@ -1311,7 +1316,7 @@ static void unfreeze_slab(struct kmem_ca
/*
* Remove the cpu slab
*/
-static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
+static void __deactivate_slab(struct kmem_cache *s, struct page *page)
{
/*
* Merge cpu freelist into freelist. Typically we get here
@@ -1330,10 +1335,15 @@ static void deactivate_slab(struct kmem_
page->freelist = object;
page->inuse--;
}
- s->cpu_slab[cpu] = NULL;
unfreeze_slab(s, page);
}
+static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
+{
+ __deactive_slab(s, page);
+ s->cpu_slab[cpu] = NULL;
+}
+
static void flush_slab(struct kmem_cache *s, struct page *page, int cpu)
{
slab_lock(page);
@@ -1395,6 +1405,7 @@ static void *__slab_alloc(struct kmem_ca
{
void **object;
int cpu = smp_processor_id();
+ int rank = 0;
if (!page)
goto new_slab;
@@ -1424,10 +1435,26 @@ new_slab:
if (page) {
s->cpu_slab[cpu] = page;
goto load_freelist;
- }
+ } else if (unlikely(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+ goto try_reserve;
- page = new_slab(s, gfpflags, node);
- if (page) {
+alloc_slab:
+ page = new_slab(s, gfpflags, node, &rank);
+ if (page && rank) {
+ if (unlikely(s->reserve_slab)) {
+ struct page *reserve;
+
+ spin_lock(&reserve_lock);
+ reserve = s->reserve_slab;
+ s->reserve_slab = NULL;
+ spin_unlock(&reserve_lock);
+
+ if (reserve) {
+ slab_lock(reserve);
+ __deactivate_slab(s, reserve);
+ putback_slab(s, reserve);
+ }
+ }
cpu = smp_processor_id();
if (s->cpu_slab[cpu]) {
/*
@@ -1455,6 +1482,18 @@ new_slab:
SetSlabFrozen(page);
s->cpu_slab[cpu] = page;
goto load_freelist;
+ } else if (page) {
+ spin_lock(&reserve_lock);
+ if (s->reserve_slab) {
+ discard_slab(s, page);
+ page = s->reserve_slab;
+ }
+ slab_lock(page);
+ SetPageActive(page);
+ s->reserve_slab = page;
+ spin_unlock(&reserve_lock);
+
+ goto got_reserve;
}
return NULL;
debug:
@@ -1470,6 +1509,31 @@ debug:
page->freelist = object[page->offset];
slab_unlock(page);
return object;
+
+try_reserve:
+ spin_lock(&reserve_lock);
+ page = s->reserve_slab;
+ if (!page) {
+ spin_unlock(&reserve_lock);
+ goto alloc_slab;
+ }
+
+ slab_lock(page);
+ if (!page->freelist) {
+ s->reserve_slab = NULL;
+ spin_unlock(&reserve_lock);
+ __deactivate_slab(s, page);
+ putback_slab(s, page);
+ goto alloc_slab;
+ }
+ spin_unlock(&reserve_lock);
+
+got_reserve:
+ object = page->freelist;
+ page->inuse++;
+ page->freelist = object[page->offset];
+ slab_unlock(page);
+ return object;
}
/*
@@ -1807,10 +1871,11 @@ static struct kmem_cache_node * __init e
{
struct page *page;
struct kmem_cache_node *n;
+ int rank;
BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
- page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node);
+ page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node, &rank);
/* new_slab() disables interupts */
local_irq_enable();
@@ -2018,6 +2083,8 @@ static int kmem_cache_open(struct kmem_c
#ifdef CONFIG_NUMA
s->defrag_ratio = 100;
#endif
+ s->reserve_slab = NULL;
+
if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
return 1;
error:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-16 19:25 ` Peter Zijlstra
@ 2007-05-16 19:53 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-16 19:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 16 May 2007, Peter Zijlstra wrote:
> If this 4k cpu system ever gets to touch the new lock it is in way
> deeper problems than a bouncing cache-line.
So its no use on NUMA?
> Please look at it more carefully.
>
> We differentiate pages allocated at the level where GFP_ATOMIC starts to
> fail. By not updating the percpu slabs those are retried every time,
> except for ALLOC_NO_WATERMARKS allocations; those are served from the
> ->reserve_slab.
>
> Once a regular slab allocation succeeds again, the ->reserve_slab is
> cleaned up and never again looked at it until we're in distress again.
A single slab? This may only give you a a single object in an extreme
case. Are you sure that this solution is generic enough?
The problem here is that you may spinlock and take out the slab for one
cpu but then (AFAICT) other cpus can still not get their high priority
allocs satisfied. Some comments follow.
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> include/linux/slub_def.h | 2 +
> mm/slub.c | 85 ++++++++++++++++++++++++++++++++++++++++++-----
> 2 files changed, 78 insertions(+), 9 deletions(-)
>
> Index: linux-2.6-git/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6-git.orig/include/linux/slub_def.h
> +++ linux-2.6-git/include/linux/slub_def.h
> @@ -46,6 +46,8 @@ struct kmem_cache {
> struct list_head list; /* List of slab caches */
> struct kobject kobj; /* For sysfs */
>
> + struct page *reserve_slab;
> +
> #ifdef CONFIG_NUMA
> int defrag_ratio;
> struct kmem_cache_node *node[MAX_NUMNODES];
> Index: linux-2.6-git/mm/slub.c
> ===================================================================
> --- linux-2.6-git.orig/mm/slub.c
> +++ linux-2.6-git/mm/slub.c
> @@ -20,11 +20,13 @@
> #include <linux/mempolicy.h>
> #include <linux/ctype.h>
> #include <linux/kallsyms.h>
> +#include "internal.h"
>
> /*
> * Lock order:
> - * 1. slab_lock(page)
> - * 2. slab->list_lock
> + * 1. reserve_lock
> + * 2. slab_lock(page)
> + * 3. node->list_lock
> *
> * The slab_lock protects operations on the object of a particular
> * slab and its metadata in the page struct. If the slab lock
> @@ -259,6 +261,8 @@ static int sysfs_slab_alias(struct kmem_
> static void sysfs_slab_remove(struct kmem_cache *s) {}
> #endif
>
> +static DEFINE_SPINLOCK(reserve_lock);
> +
> /********************************************************************
> * Core slab cache functions
> *******************************************************************/
> @@ -1007,7 +1011,7 @@ static void setup_object(struct kmem_cac
> s->ctor(object, s, 0);
> }
>
> -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
> +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *rank)
> {
> struct page *page;
> struct kmem_cache_node *n;
> @@ -1025,6 +1029,7 @@ static struct page *new_slab(struct kmem
> if (!page)
> goto out;
>
> + *rank = page->rank;
> n = get_node(s, page_to_nid(page));
> if (n)
> atomic_long_inc(&n->nr_slabs);
> @@ -1311,7 +1316,7 @@ static void unfreeze_slab(struct kmem_ca
> /*
> * Remove the cpu slab
> */
> -static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
> +static void __deactivate_slab(struct kmem_cache *s, struct page *page)
> {
> /*
> * Merge cpu freelist into freelist. Typically we get here
> @@ -1330,10 +1335,15 @@ static void deactivate_slab(struct kmem_
> page->freelist = object;
> page->inuse--;
> }
> - s->cpu_slab[cpu] = NULL;
> unfreeze_slab(s, page);
> }
So you want to spill back the lockless_freelist without deactivating the
slab? Why are you using the lockless_freelist at all? If you do not use it
then you can call unfreeze_slab. No need for this split.
> @@ -1395,6 +1405,7 @@ static void *__slab_alloc(struct kmem_ca
> {
> void **object;
> int cpu = smp_processor_id();
> + int rank = 0;
>
> if (!page)
> goto new_slab;
> @@ -1424,10 +1435,26 @@ new_slab:
> if (page) {
> s->cpu_slab[cpu] = page;
> goto load_freelist;
> - }
> + } else if (unlikely(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
> + goto try_reserve;
Ok so we are trying to allocate a slab and do not get one thus ->
try_reserve. But this is only working if we are using the slab after
explicitly flushing the cpuslabs. Otherwise the slab may be full and we
get to alloc_slab.
>
> - page = new_slab(s, gfpflags, node);
> - if (page) {
> +alloc_slab:
> + page = new_slab(s, gfpflags, node, &rank);
> + if (page && rank) {
Huh? You mean !page?
> + if (unlikely(s->reserve_slab)) {
> + struct page *reserve;
> +
> + spin_lock(&reserve_lock);
> + reserve = s->reserve_slab;
> + s->reserve_slab = NULL;
> + spin_unlock(&reserve_lock);
> +
> + if (reserve) {
> + slab_lock(reserve);
> + __deactivate_slab(s, reserve);
> + putback_slab(s, reserve);
Remove the above two lines (they are wrong regardless) and simply make
this the cpu slab.
> + }
> + }
> cpu = smp_processor_id();
> if (s->cpu_slab[cpu]) {
> /*
> @@ -1455,6 +1482,18 @@ new_slab:
> SetSlabFrozen(page);
> s->cpu_slab[cpu] = page;
> goto load_freelist;
> + } else if (page) {
> + spin_lock(&reserve_lock);
> + if (s->reserve_slab) {
> + discard_slab(s, page);
> + page = s->reserve_slab;
> + }
> + slab_lock(page);
> + SetPageActive(page);
> + s->reserve_slab = page;
> + spin_unlock(&reserve_lock);
> +
> + goto got_reserve;
> }
> return NULL;
> debug:
> @@ -1470,6 +1509,31 @@ debug:
> page->freelist = object[page->offset];
> slab_unlock(page);
> return object;
> +
> +try_reserve:
> + spin_lock(&reserve_lock);
> + page = s->reserve_slab;
> + if (!page) {
> + spin_unlock(&reserve_lock);
> + goto alloc_slab;
> + }
> +
> + slab_lock(page);
> + if (!page->freelist) {
> + s->reserve_slab = NULL;
> + spin_unlock(&reserve_lock);
> + __deactivate_slab(s, page);
replace with unfreeze slab.
> + putback_slab(s, page);
Putting back the slab twice.
> + goto alloc_slab;
> + }
> + spin_unlock(&reserve_lock);
> +
> +got_reserve:
> + object = page->freelist;
> + page->inuse++;
> + page->freelist = object[page->offset];
> + slab_unlock(page);
> + return object;
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-16 19:53 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-16 19:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 16 May 2007, Peter Zijlstra wrote:
> If this 4k cpu system ever gets to touch the new lock it is in way
> deeper problems than a bouncing cache-line.
So its no use on NUMA?
> Please look at it more carefully.
>
> We differentiate pages allocated at the level where GFP_ATOMIC starts to
> fail. By not updating the percpu slabs those are retried every time,
> except for ALLOC_NO_WATERMARKS allocations; those are served from the
> ->reserve_slab.
>
> Once a regular slab allocation succeeds again, the ->reserve_slab is
> cleaned up and never again looked at it until we're in distress again.
A single slab? This may only give you a a single object in an extreme
case. Are you sure that this solution is generic enough?
The problem here is that you may spinlock and take out the slab for one
cpu but then (AFAICT) other cpus can still not get their high priority
allocs satisfied. Some comments follow.
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> include/linux/slub_def.h | 2 +
> mm/slub.c | 85 ++++++++++++++++++++++++++++++++++++++++++-----
> 2 files changed, 78 insertions(+), 9 deletions(-)
>
> Index: linux-2.6-git/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6-git.orig/include/linux/slub_def.h
> +++ linux-2.6-git/include/linux/slub_def.h
> @@ -46,6 +46,8 @@ struct kmem_cache {
> struct list_head list; /* List of slab caches */
> struct kobject kobj; /* For sysfs */
>
> + struct page *reserve_slab;
> +
> #ifdef CONFIG_NUMA
> int defrag_ratio;
> struct kmem_cache_node *node[MAX_NUMNODES];
> Index: linux-2.6-git/mm/slub.c
> ===================================================================
> --- linux-2.6-git.orig/mm/slub.c
> +++ linux-2.6-git/mm/slub.c
> @@ -20,11 +20,13 @@
> #include <linux/mempolicy.h>
> #include <linux/ctype.h>
> #include <linux/kallsyms.h>
> +#include "internal.h"
>
> /*
> * Lock order:
> - * 1. slab_lock(page)
> - * 2. slab->list_lock
> + * 1. reserve_lock
> + * 2. slab_lock(page)
> + * 3. node->list_lock
> *
> * The slab_lock protects operations on the object of a particular
> * slab and its metadata in the page struct. If the slab lock
> @@ -259,6 +261,8 @@ static int sysfs_slab_alias(struct kmem_
> static void sysfs_slab_remove(struct kmem_cache *s) {}
> #endif
>
> +static DEFINE_SPINLOCK(reserve_lock);
> +
> /********************************************************************
> * Core slab cache functions
> *******************************************************************/
> @@ -1007,7 +1011,7 @@ static void setup_object(struct kmem_cac
> s->ctor(object, s, 0);
> }
>
> -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
> +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *rank)
> {
> struct page *page;
> struct kmem_cache_node *n;
> @@ -1025,6 +1029,7 @@ static struct page *new_slab(struct kmem
> if (!page)
> goto out;
>
> + *rank = page->rank;
> n = get_node(s, page_to_nid(page));
> if (n)
> atomic_long_inc(&n->nr_slabs);
> @@ -1311,7 +1316,7 @@ static void unfreeze_slab(struct kmem_ca
> /*
> * Remove the cpu slab
> */
> -static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
> +static void __deactivate_slab(struct kmem_cache *s, struct page *page)
> {
> /*
> * Merge cpu freelist into freelist. Typically we get here
> @@ -1330,10 +1335,15 @@ static void deactivate_slab(struct kmem_
> page->freelist = object;
> page->inuse--;
> }
> - s->cpu_slab[cpu] = NULL;
> unfreeze_slab(s, page);
> }
So you want to spill back the lockless_freelist without deactivating the
slab? Why are you using the lockless_freelist at all? If you do not use it
then you can call unfreeze_slab. No need for this split.
> @@ -1395,6 +1405,7 @@ static void *__slab_alloc(struct kmem_ca
> {
> void **object;
> int cpu = smp_processor_id();
> + int rank = 0;
>
> if (!page)
> goto new_slab;
> @@ -1424,10 +1435,26 @@ new_slab:
> if (page) {
> s->cpu_slab[cpu] = page;
> goto load_freelist;
> - }
> + } else if (unlikely(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
> + goto try_reserve;
Ok so we are trying to allocate a slab and do not get one thus ->
try_reserve. But this is only working if we are using the slab after
explicitly flushing the cpuslabs. Otherwise the slab may be full and we
get to alloc_slab.
>
> - page = new_slab(s, gfpflags, node);
> - if (page) {
> +alloc_slab:
> + page = new_slab(s, gfpflags, node, &rank);
> + if (page && rank) {
Huh? You mean !page?
> + if (unlikely(s->reserve_slab)) {
> + struct page *reserve;
> +
> + spin_lock(&reserve_lock);
> + reserve = s->reserve_slab;
> + s->reserve_slab = NULL;
> + spin_unlock(&reserve_lock);
> +
> + if (reserve) {
> + slab_lock(reserve);
> + __deactivate_slab(s, reserve);
> + putback_slab(s, reserve);
Remove the above two lines (they are wrong regardless) and simply make
this the cpu slab.
> + }
> + }
> cpu = smp_processor_id();
> if (s->cpu_slab[cpu]) {
> /*
> @@ -1455,6 +1482,18 @@ new_slab:
> SetSlabFrozen(page);
> s->cpu_slab[cpu] = page;
> goto load_freelist;
> + } else if (page) {
> + spin_lock(&reserve_lock);
> + if (s->reserve_slab) {
> + discard_slab(s, page);
> + page = s->reserve_slab;
> + }
> + slab_lock(page);
> + SetPageActive(page);
> + s->reserve_slab = page;
> + spin_unlock(&reserve_lock);
> +
> + goto got_reserve;
> }
> return NULL;
> debug:
> @@ -1470,6 +1509,31 @@ debug:
> page->freelist = object[page->offset];
> slab_unlock(page);
> return object;
> +
> +try_reserve:
> + spin_lock(&reserve_lock);
> + page = s->reserve_slab;
> + if (!page) {
> + spin_unlock(&reserve_lock);
> + goto alloc_slab;
> + }
> +
> + slab_lock(page);
> + if (!page->freelist) {
> + s->reserve_slab = NULL;
> + spin_unlock(&reserve_lock);
> + __deactivate_slab(s, page);
replace with unfreeze slab.
> + putback_slab(s, page);
Putting back the slab twice.
> + goto alloc_slab;
> + }
> + spin_unlock(&reserve_lock);
> +
> +got_reserve:
> + object = page->freelist;
> + page->inuse++;
> + page->freelist = object[page->offset];
> + slab_unlock(page);
> + return object;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-16 19:53 ` Christoph Lameter
@ 2007-05-16 20:18 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-16 20:18 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 2007-05-16 at 12:53 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
>
> > If this 4k cpu system ever gets to touch the new lock it is in way
> > deeper problems than a bouncing cache-line.
>
> So its no use on NUMA?
It is, its just that we're swapping very heavily at that point, a
bouncing cache-line will not significantly slow down the box compared to
waiting for block IO, will it?
> > Please look at it more carefully.
> >
> > We differentiate pages allocated at the level where GFP_ATOMIC starts to
> > fail. By not updating the percpu slabs those are retried every time,
> > except for ALLOC_NO_WATERMARKS allocations; those are served from the
> > ->reserve_slab.
> >
> > Once a regular slab allocation succeeds again, the ->reserve_slab is
> > cleaned up and never again looked at it until we're in distress again.
>
> A single slab? This may only give you a a single object in an extreme
> case. Are you sure that this solution is generic enough?
Well, single as in a single active; it gets spilled into the full list
and a new one is instanciated if more is needed.
> The problem here is that you may spinlock and take out the slab for one
> cpu but then (AFAICT) other cpus can still not get their high priority
> allocs satisfied. Some comments follow.
All cpus are redirected to ->reserve_slab when the regular allocations
start to fail.
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> > include/linux/slub_def.h | 2 +
> > mm/slub.c | 85 ++++++++++++++++++++++++++++++++++++++++++-----
> > 2 files changed, 78 insertions(+), 9 deletions(-)
> >
> > Index: linux-2.6-git/include/linux/slub_def.h
> > ===================================================================
> > --- linux-2.6-git.orig/include/linux/slub_def.h
> > +++ linux-2.6-git/include/linux/slub_def.h
> > @@ -46,6 +46,8 @@ struct kmem_cache {
> > struct list_head list; /* List of slab caches */
> > struct kobject kobj; /* For sysfs */
> >
> > + struct page *reserve_slab;
> > +
> > #ifdef CONFIG_NUMA
> > int defrag_ratio;
> > struct kmem_cache_node *node[MAX_NUMNODES];
> > Index: linux-2.6-git/mm/slub.c
> > ===================================================================
> > --- linux-2.6-git.orig/mm/slub.c
> > +++ linux-2.6-git/mm/slub.c
> > @@ -20,11 +20,13 @@
> > #include <linux/mempolicy.h>
> > #include <linux/ctype.h>
> > #include <linux/kallsyms.h>
> > +#include "internal.h"
> >
> > /*
> > * Lock order:
> > - * 1. slab_lock(page)
> > - * 2. slab->list_lock
> > + * 1. reserve_lock
> > + * 2. slab_lock(page)
> > + * 3. node->list_lock
> > *
> > * The slab_lock protects operations on the object of a particular
> > * slab and its metadata in the page struct. If the slab lock
> > @@ -259,6 +261,8 @@ static int sysfs_slab_alias(struct kmem_
> > static void sysfs_slab_remove(struct kmem_cache *s) {}
> > #endif
> >
> > +static DEFINE_SPINLOCK(reserve_lock);
> > +
> > /********************************************************************
> > * Core slab cache functions
> > *******************************************************************/
> > @@ -1007,7 +1011,7 @@ static void setup_object(struct kmem_cac
> > s->ctor(object, s, 0);
> > }
> >
> > -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
> > +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *rank)
> > {
> > struct page *page;
> > struct kmem_cache_node *n;
> > @@ -1025,6 +1029,7 @@ static struct page *new_slab(struct kmem
> > if (!page)
> > goto out;
> >
> > + *rank = page->rank;
> > n = get_node(s, page_to_nid(page));
> > if (n)
> > atomic_long_inc(&n->nr_slabs);
> > @@ -1311,7 +1316,7 @@ static void unfreeze_slab(struct kmem_ca
> > /*
> > * Remove the cpu slab
> > */
> > -static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
> > +static void __deactivate_slab(struct kmem_cache *s, struct page *page)
> > {
> > /*
> > * Merge cpu freelist into freelist. Typically we get here
> > @@ -1330,10 +1335,15 @@ static void deactivate_slab(struct kmem_
> > page->freelist = object;
> > page->inuse--;
> > }
> > - s->cpu_slab[cpu] = NULL;
> > unfreeze_slab(s, page);
> > }
>
> So you want to spill back the lockless_freelist without deactivating the
> slab? Why are you using the lockless_freelist at all? If you do not use it
> then you can call unfreeze_slab. No need for this split.
Ah, quite right. I do indeed not use the lockless_freelist.
> > @@ -1395,6 +1405,7 @@ static void *__slab_alloc(struct kmem_ca
> > {
> > void **object;
> > int cpu = smp_processor_id();
> > + int rank = 0;
> >
> > if (!page)
> > goto new_slab;
> > @@ -1424,10 +1435,26 @@ new_slab:
> > if (page) {
> > s->cpu_slab[cpu] = page;
> > goto load_freelist;
> > - }
> > + } else if (unlikely(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
> > + goto try_reserve;
>
> Ok so we are trying to allocate a slab and do not get one thus ->
> try_reserve.
Right, so the cpu-slab is NULL, and we need a new slab.
> But this is only working if we are using the slab after
> explicitly flushing the cpuslabs. Otherwise the slab may be full and we
> get to alloc_slab.
/me fails to parse.
When we need a new_slab:
- we try the partial lists,
- we try the reserve (if ALLOC_NO_WATERMARKS)
otherwise alloc_slab
> >
> > - page = new_slab(s, gfpflags, node);
> > - if (page) {
>
> > +alloc_slab:
> > + page = new_slab(s, gfpflags, node, &rank);
> > + if (page && rank) {
>
> Huh? You mean !page?
No, no, we did get a page, and it was !ALLOC_NO_WATERMARK hard to get
it.
> > + if (unlikely(s->reserve_slab)) {
> > + struct page *reserve;
> > +
> > + spin_lock(&reserve_lock);
> > + reserve = s->reserve_slab;
> > + s->reserve_slab = NULL;
> > + spin_unlock(&reserve_lock);
> > +
> > + if (reserve) {
> > + slab_lock(reserve);
> > + __deactivate_slab(s, reserve);
> > + putback_slab(s, reserve);
>
> Remove the above two lines (they are wrong regardless) and simply make
> this the cpu slab.
It need not be the same node; the reserve_slab is node agnostic.
So here the free page watermarks are good again, and we can forget all
about the ->reserve_slab. We just push it on the free/partial lists and
forget about it.
But like you said above: unfreeze_slab() should be good, since I don't
use the lockless_freelist.
> > + }
> > + }
> > cpu = smp_processor_id();
> > if (s->cpu_slab[cpu]) {
> > /*
> > @@ -1455,6 +1482,18 @@ new_slab:
> > SetSlabFrozen(page);
> > s->cpu_slab[cpu] = page;
> > goto load_freelist;
> > + } else if (page) {
> > + spin_lock(&reserve_lock);
> > + if (s->reserve_slab) {
> > + discard_slab(s, page);
> > + page = s->reserve_slab;
> > + }
> > + slab_lock(page);
> > + SetPageActive(page);
> > + s->reserve_slab = page;
> > + spin_unlock(&reserve_lock);
> > +
> > + goto got_reserve;
So this is when we get a page and it was ALLOC_NO_WATERMARKS hard to get
it. Instead of updating the cpu_slab we leave that unset, so that
subsequent allocations will try to allocate a slab again thereby testing
the current free pages limit (and not gobble up the reserve).
> > }
> > return NULL;
> > debug:
> > @@ -1470,6 +1509,31 @@ debug:
> > page->freelist = object[page->offset];
> > slab_unlock(page);
> > return object;
> > +
> > +try_reserve:
> > + spin_lock(&reserve_lock);
> > + page = s->reserve_slab;
> > + if (!page) {
> > + spin_unlock(&reserve_lock);
> > + goto alloc_slab;
> > + }
> > +
> > + slab_lock(page);
> > + if (!page->freelist) {
> > + s->reserve_slab = NULL;
> > + spin_unlock(&reserve_lock);
> > + __deactivate_slab(s, page);
> replace with unfreeze slab.
>
> > + putback_slab(s, page);
>
> Putting back the slab twice.
__deactivete_slab() doesn't do putback_slab, and now I see that whole
function isn't there anymore. unfreeze_slab() it is.
> > + goto alloc_slab;
> > + }
> > + spin_unlock(&reserve_lock);
> > +
> > +got_reserve:
> > + object = page->freelist;
> > + page->inuse++;
> > + page->freelist = object[page->offset];
> > + slab_unlock(page);
> > + return object;
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-16 20:18 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-16 20:18 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 2007-05-16 at 12:53 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
>
> > If this 4k cpu system ever gets to touch the new lock it is in way
> > deeper problems than a bouncing cache-line.
>
> So its no use on NUMA?
It is, its just that we're swapping very heavily at that point, a
bouncing cache-line will not significantly slow down the box compared to
waiting for block IO, will it?
> > Please look at it more carefully.
> >
> > We differentiate pages allocated at the level where GFP_ATOMIC starts to
> > fail. By not updating the percpu slabs those are retried every time,
> > except for ALLOC_NO_WATERMARKS allocations; those are served from the
> > ->reserve_slab.
> >
> > Once a regular slab allocation succeeds again, the ->reserve_slab is
> > cleaned up and never again looked at it until we're in distress again.
>
> A single slab? This may only give you a a single object in an extreme
> case. Are you sure that this solution is generic enough?
Well, single as in a single active; it gets spilled into the full list
and a new one is instanciated if more is needed.
> The problem here is that you may spinlock and take out the slab for one
> cpu but then (AFAICT) other cpus can still not get their high priority
> allocs satisfied. Some comments follow.
All cpus are redirected to ->reserve_slab when the regular allocations
start to fail.
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> > include/linux/slub_def.h | 2 +
> > mm/slub.c | 85 ++++++++++++++++++++++++++++++++++++++++++-----
> > 2 files changed, 78 insertions(+), 9 deletions(-)
> >
> > Index: linux-2.6-git/include/linux/slub_def.h
> > ===================================================================
> > --- linux-2.6-git.orig/include/linux/slub_def.h
> > +++ linux-2.6-git/include/linux/slub_def.h
> > @@ -46,6 +46,8 @@ struct kmem_cache {
> > struct list_head list; /* List of slab caches */
> > struct kobject kobj; /* For sysfs */
> >
> > + struct page *reserve_slab;
> > +
> > #ifdef CONFIG_NUMA
> > int defrag_ratio;
> > struct kmem_cache_node *node[MAX_NUMNODES];
> > Index: linux-2.6-git/mm/slub.c
> > ===================================================================
> > --- linux-2.6-git.orig/mm/slub.c
> > +++ linux-2.6-git/mm/slub.c
> > @@ -20,11 +20,13 @@
> > #include <linux/mempolicy.h>
> > #include <linux/ctype.h>
> > #include <linux/kallsyms.h>
> > +#include "internal.h"
> >
> > /*
> > * Lock order:
> > - * 1. slab_lock(page)
> > - * 2. slab->list_lock
> > + * 1. reserve_lock
> > + * 2. slab_lock(page)
> > + * 3. node->list_lock
> > *
> > * The slab_lock protects operations on the object of a particular
> > * slab and its metadata in the page struct. If the slab lock
> > @@ -259,6 +261,8 @@ static int sysfs_slab_alias(struct kmem_
> > static void sysfs_slab_remove(struct kmem_cache *s) {}
> > #endif
> >
> > +static DEFINE_SPINLOCK(reserve_lock);
> > +
> > /********************************************************************
> > * Core slab cache functions
> > *******************************************************************/
> > @@ -1007,7 +1011,7 @@ static void setup_object(struct kmem_cac
> > s->ctor(object, s, 0);
> > }
> >
> > -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
> > +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *rank)
> > {
> > struct page *page;
> > struct kmem_cache_node *n;
> > @@ -1025,6 +1029,7 @@ static struct page *new_slab(struct kmem
> > if (!page)
> > goto out;
> >
> > + *rank = page->rank;
> > n = get_node(s, page_to_nid(page));
> > if (n)
> > atomic_long_inc(&n->nr_slabs);
> > @@ -1311,7 +1316,7 @@ static void unfreeze_slab(struct kmem_ca
> > /*
> > * Remove the cpu slab
> > */
> > -static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
> > +static void __deactivate_slab(struct kmem_cache *s, struct page *page)
> > {
> > /*
> > * Merge cpu freelist into freelist. Typically we get here
> > @@ -1330,10 +1335,15 @@ static void deactivate_slab(struct kmem_
> > page->freelist = object;
> > page->inuse--;
> > }
> > - s->cpu_slab[cpu] = NULL;
> > unfreeze_slab(s, page);
> > }
>
> So you want to spill back the lockless_freelist without deactivating the
> slab? Why are you using the lockless_freelist at all? If you do not use it
> then you can call unfreeze_slab. No need for this split.
Ah, quite right. I do indeed not use the lockless_freelist.
> > @@ -1395,6 +1405,7 @@ static void *__slab_alloc(struct kmem_ca
> > {
> > void **object;
> > int cpu = smp_processor_id();
> > + int rank = 0;
> >
> > if (!page)
> > goto new_slab;
> > @@ -1424,10 +1435,26 @@ new_slab:
> > if (page) {
> > s->cpu_slab[cpu] = page;
> > goto load_freelist;
> > - }
> > + } else if (unlikely(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
> > + goto try_reserve;
>
> Ok so we are trying to allocate a slab and do not get one thus ->
> try_reserve.
Right, so the cpu-slab is NULL, and we need a new slab.
> But this is only working if we are using the slab after
> explicitly flushing the cpuslabs. Otherwise the slab may be full and we
> get to alloc_slab.
/me fails to parse.
When we need a new_slab:
- we try the partial lists,
- we try the reserve (if ALLOC_NO_WATERMARKS)
otherwise alloc_slab
> >
> > - page = new_slab(s, gfpflags, node);
> > - if (page) {
>
> > +alloc_slab:
> > + page = new_slab(s, gfpflags, node, &rank);
> > + if (page && rank) {
>
> Huh? You mean !page?
No, no, we did get a page, and it was !ALLOC_NO_WATERMARK hard to get
it.
> > + if (unlikely(s->reserve_slab)) {
> > + struct page *reserve;
> > +
> > + spin_lock(&reserve_lock);
> > + reserve = s->reserve_slab;
> > + s->reserve_slab = NULL;
> > + spin_unlock(&reserve_lock);
> > +
> > + if (reserve) {
> > + slab_lock(reserve);
> > + __deactivate_slab(s, reserve);
> > + putback_slab(s, reserve);
>
> Remove the above two lines (they are wrong regardless) and simply make
> this the cpu slab.
It need not be the same node; the reserve_slab is node agnostic.
So here the free page watermarks are good again, and we can forget all
about the ->reserve_slab. We just push it on the free/partial lists and
forget about it.
But like you said above: unfreeze_slab() should be good, since I don't
use the lockless_freelist.
> > + }
> > + }
> > cpu = smp_processor_id();
> > if (s->cpu_slab[cpu]) {
> > /*
> > @@ -1455,6 +1482,18 @@ new_slab:
> > SetSlabFrozen(page);
> > s->cpu_slab[cpu] = page;
> > goto load_freelist;
> > + } else if (page) {
> > + spin_lock(&reserve_lock);
> > + if (s->reserve_slab) {
> > + discard_slab(s, page);
> > + page = s->reserve_slab;
> > + }
> > + slab_lock(page);
> > + SetPageActive(page);
> > + s->reserve_slab = page;
> > + spin_unlock(&reserve_lock);
> > +
> > + goto got_reserve;
So this is when we get a page and it was ALLOC_NO_WATERMARKS hard to get
it. Instead of updating the cpu_slab we leave that unset, so that
subsequent allocations will try to allocate a slab again thereby testing
the current free pages limit (and not gobble up the reserve).
> > }
> > return NULL;
> > debug:
> > @@ -1470,6 +1509,31 @@ debug:
> > page->freelist = object[page->offset];
> > slab_unlock(page);
> > return object;
> > +
> > +try_reserve:
> > + spin_lock(&reserve_lock);
> > + page = s->reserve_slab;
> > + if (!page) {
> > + spin_unlock(&reserve_lock);
> > + goto alloc_slab;
> > + }
> > +
> > + slab_lock(page);
> > + if (!page->freelist) {
> > + s->reserve_slab = NULL;
> > + spin_unlock(&reserve_lock);
> > + __deactivate_slab(s, page);
> replace with unfreeze slab.
>
> > + putback_slab(s, page);
>
> Putting back the slab twice.
__deactivete_slab() doesn't do putback_slab, and now I see that whole
function isn't there anymore. unfreeze_slab() it is.
> > + goto alloc_slab;
> > + }
> > + spin_unlock(&reserve_lock);
> > +
> > +got_reserve:
> > + object = page->freelist;
> > + page->inuse++;
> > + page->freelist = object[page->offset];
> > + slab_unlock(page);
> > + return object;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-16 20:18 ` Peter Zijlstra
@ 2007-05-16 20:27 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-16 20:27 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 16 May 2007, Peter Zijlstra wrote:
> > So its no use on NUMA?
>
> It is, its just that we're swapping very heavily at that point, a
> bouncing cache-line will not significantly slow down the box compared to
> waiting for block IO, will it?
How does all of this interact with
1. cpusets
2. dma allocations and highmem?
3. Containers?
> > The problem here is that you may spinlock and take out the slab for one
> > cpu but then (AFAICT) other cpus can still not get their high priority
> > allocs satisfied. Some comments follow.
>
> All cpus are redirected to ->reserve_slab when the regular allocations
> start to fail.
And the reserve slab is refilled from page allocator reserves if needed?
> > But this is only working if we are using the slab after
> > explicitly flushing the cpuslabs. Otherwise the slab may be full and we
> > get to alloc_slab.
>
> /me fails to parse.
s->cpu[cpu] is only NULL if the cpu slab was flushed. This is a pretty
rare case likely not worth checking.
> > Remove the above two lines (they are wrong regardless) and simply make
> > this the cpu slab.
>
> It need not be the same node; the reserve_slab is node agnostic.
> So here the free page watermarks are good again, and we can forget all
> about the ->reserve_slab. We just push it on the free/partial lists and
> forget about it.
>
> But like you said above: unfreeze_slab() should be good, since I don't
> use the lockless_freelist.
You could completely bypass the regular allocation functions and do
object = s->reserve_slab->freelist;
s->reserve_slab->freelist = object[s->reserve_slab->offset];
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-16 20:27 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-16 20:27 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 16 May 2007, Peter Zijlstra wrote:
> > So its no use on NUMA?
>
> It is, its just that we're swapping very heavily at that point, a
> bouncing cache-line will not significantly slow down the box compared to
> waiting for block IO, will it?
How does all of this interact with
1. cpusets
2. dma allocations and highmem?
3. Containers?
> > The problem here is that you may spinlock and take out the slab for one
> > cpu but then (AFAICT) other cpus can still not get their high priority
> > allocs satisfied. Some comments follow.
>
> All cpus are redirected to ->reserve_slab when the regular allocations
> start to fail.
And the reserve slab is refilled from page allocator reserves if needed?
> > But this is only working if we are using the slab after
> > explicitly flushing the cpuslabs. Otherwise the slab may be full and we
> > get to alloc_slab.
>
> /me fails to parse.
s->cpu[cpu] is only NULL if the cpu slab was flushed. This is a pretty
rare case likely not worth checking.
> > Remove the above two lines (they are wrong regardless) and simply make
> > this the cpu slab.
>
> It need not be the same node; the reserve_slab is node agnostic.
> So here the free page watermarks are good again, and we can forget all
> about the ->reserve_slab. We just push it on the free/partial lists and
> forget about it.
>
> But like you said above: unfreeze_slab() should be good, since I don't
> use the lockless_freelist.
You could completely bypass the regular allocation functions and do
object = s->reserve_slab->freelist;
s->reserve_slab->freelist = object[s->reserve_slab->offset];
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-16 20:27 ` Christoph Lameter
@ 2007-05-16 20:40 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-16 20:40 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 2007-05-16 at 13:27 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
>
> > > So its no use on NUMA?
> >
> > It is, its just that we're swapping very heavily at that point, a
> > bouncing cache-line will not significantly slow down the box compared to
> > waiting for block IO, will it?
>
> How does all of this interact with
>
> 1. cpusets
>
> 2. dma allocations and highmem?
>
> 3. Containers?
Much like the normal kmem_cache would do; I'm not changing any of the
page allocation semantics.
For containers it could be that the machine is not actually swapping but
the container will be in dire straights.
> > > The problem here is that you may spinlock and take out the slab for one
> > > cpu but then (AFAICT) other cpus can still not get their high priority
> > > allocs satisfied. Some comments follow.
> >
> > All cpus are redirected to ->reserve_slab when the regular allocations
> > start to fail.
>
> And the reserve slab is refilled from page allocator reserves if needed?
Yes, using new_slab(), exacly as it would normally be.
> > > But this is only working if we are using the slab after
> > > explicitly flushing the cpuslabs. Otherwise the slab may be full and we
> > > get to alloc_slab.
> >
> > /me fails to parse.
>
> s->cpu[cpu] is only NULL if the cpu slab was flushed. This is a pretty
> rare case likely not worth checking.
Ah, right:
- !page || !page->freelist
- and no available partial slabs.
then we try the reserve (if we're entiteld).
> > > Remove the above two lines (they are wrong regardless) and simply make
> > > this the cpu slab.
> >
> > It need not be the same node; the reserve_slab is node agnostic.
> > So here the free page watermarks are good again, and we can forget all
> > about the ->reserve_slab. We just push it on the free/partial lists and
> > forget about it.
> >
> > But like you said above: unfreeze_slab() should be good, since I don't
> > use the lockless_freelist.
>
> You could completely bypass the regular allocation functions and do
>
> object = s->reserve_slab->freelist;
> s->reserve_slab->freelist = object[s->reserve_slab->offset];
That is basically what happens at the end; if an object is returned from
the reserve slab.
But its wanted to try the normal cpu_slab path first to detect that the
situation has subsided and we can resume normal operation.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-16 20:40 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-16 20:40 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 2007-05-16 at 13:27 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
>
> > > So its no use on NUMA?
> >
> > It is, its just that we're swapping very heavily at that point, a
> > bouncing cache-line will not significantly slow down the box compared to
> > waiting for block IO, will it?
>
> How does all of this interact with
>
> 1. cpusets
>
> 2. dma allocations and highmem?
>
> 3. Containers?
Much like the normal kmem_cache would do; I'm not changing any of the
page allocation semantics.
For containers it could be that the machine is not actually swapping but
the container will be in dire straights.
> > > The problem here is that you may spinlock and take out the slab for one
> > > cpu but then (AFAICT) other cpus can still not get their high priority
> > > allocs satisfied. Some comments follow.
> >
> > All cpus are redirected to ->reserve_slab when the regular allocations
> > start to fail.
>
> And the reserve slab is refilled from page allocator reserves if needed?
Yes, using new_slab(), exacly as it would normally be.
> > > But this is only working if we are using the slab after
> > > explicitly flushing the cpuslabs. Otherwise the slab may be full and we
> > > get to alloc_slab.
> >
> > /me fails to parse.
>
> s->cpu[cpu] is only NULL if the cpu slab was flushed. This is a pretty
> rare case likely not worth checking.
Ah, right:
- !page || !page->freelist
- and no available partial slabs.
then we try the reserve (if we're entiteld).
> > > Remove the above two lines (they are wrong regardless) and simply make
> > > this the cpu slab.
> >
> > It need not be the same node; the reserve_slab is node agnostic.
> > So here the free page watermarks are good again, and we can forget all
> > about the ->reserve_slab. We just push it on the free/partial lists and
> > forget about it.
> >
> > But like you said above: unfreeze_slab() should be good, since I don't
> > use the lockless_freelist.
>
> You could completely bypass the regular allocation functions and do
>
> object = s->reserve_slab->freelist;
> s->reserve_slab->freelist = object[s->reserve_slab->offset];
That is basically what happens at the end; if an object is returned from
the reserve slab.
But its wanted to try the normal cpu_slab path first to detect that the
situation has subsided and we can resume normal operation.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-16 20:40 ` Peter Zijlstra
@ 2007-05-16 20:44 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-16 20:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 16 May 2007, Peter Zijlstra wrote:
> > How does all of this interact with
> >
> > 1. cpusets
> >
> > 2. dma allocations and highmem?
> >
> > 3. Containers?
>
> Much like the normal kmem_cache would do; I'm not changing any of the
> page allocation semantics.
So if we run out of memory on a cpuset then network I/O will still fail?
I do not see any distinction between DMA and regular memory. If we need
DMA memory to complete the transaction then this wont work?
> But its wanted to try the normal cpu_slab path first to detect that the
> situation has subsided and we can resume normal operation.
Is there some indicator somewhere that indicates that we are in trouble? I
just see the ranks.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-16 20:44 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-16 20:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 16 May 2007, Peter Zijlstra wrote:
> > How does all of this interact with
> >
> > 1. cpusets
> >
> > 2. dma allocations and highmem?
> >
> > 3. Containers?
>
> Much like the normal kmem_cache would do; I'm not changing any of the
> page allocation semantics.
So if we run out of memory on a cpuset then network I/O will still fail?
I do not see any distinction between DMA and regular memory. If we need
DMA memory to complete the transaction then this wont work?
> But its wanted to try the normal cpu_slab path first to detect that the
> situation has subsided and we can resume normal operation.
Is there some indicator somewhere that indicates that we are in trouble? I
just see the ranks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-16 20:44 ` Christoph Lameter
@ 2007-05-16 20:54 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-16 20:54 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 2007-05-16 at 13:44 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
>
> > > How does all of this interact with
> > >
> > > 1. cpusets
> > >
> > > 2. dma allocations and highmem?
> > >
> > > 3. Containers?
> >
> > Much like the normal kmem_cache would do; I'm not changing any of the
> > page allocation semantics.
>
> So if we run out of memory on a cpuset then network I/O will still fail?
>
> I do not see any distinction between DMA and regular memory. If we need
> DMA memory to complete the transaction then this wont work?
If network relies on slabs that are cpuset constrained and the page
allocator reserves do not match that, then yes, it goes bang.
> > But its wanted to try the normal cpu_slab path first to detect that the
> > situation has subsided and we can resume normal operation.
>
> Is there some indicator somewhere that indicates that we are in trouble? I
> just see the ranks.
Yes, and page->rank will only ever be 0 if the page was allocated with
ALLOC_NO_WATERMARKS, and that only ever happens if we're in dire
straights and entitled to it.
Otherwise it'll be ALLOC_WMARK_MIN or somesuch.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-16 20:54 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-16 20:54 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 2007-05-16 at 13:44 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
>
> > > How does all of this interact with
> > >
> > > 1. cpusets
> > >
> > > 2. dma allocations and highmem?
> > >
> > > 3. Containers?
> >
> > Much like the normal kmem_cache would do; I'm not changing any of the
> > page allocation semantics.
>
> So if we run out of memory on a cpuset then network I/O will still fail?
>
> I do not see any distinction between DMA and regular memory. If we need
> DMA memory to complete the transaction then this wont work?
If network relies on slabs that are cpuset constrained and the page
allocator reserves do not match that, then yes, it goes bang.
> > But its wanted to try the normal cpu_slab path first to detect that the
> > situation has subsided and we can resume normal operation.
>
> Is there some indicator somewhere that indicates that we are in trouble? I
> just see the ranks.
Yes, and page->rank will only ever be 0 if the page was allocated with
ALLOC_NO_WATERMARKS, and that only ever happens if we're in dire
straights and entitled to it.
Otherwise it'll be ALLOC_WMARK_MIN or somesuch.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-16 20:54 ` Peter Zijlstra
@ 2007-05-16 20:59 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-16 20:59 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 16 May 2007, Peter Zijlstra wrote:
> > I do not see any distinction between DMA and regular memory. If we need
> > DMA memory to complete the transaction then this wont work?
>
> If network relies on slabs that are cpuset constrained and the page
> allocator reserves do not match that, then yes, it goes bang.
So if I put a 32 bit network card in a 64 bit system -> bang?
> > Is there some indicator somewhere that indicates that we are in trouble? I
> > just see the ranks.
>
> Yes, and page->rank will only ever be 0 if the page was allocated with
> ALLOC_NO_WATERMARKS, and that only ever happens if we're in dire
> straights and entitled to it.
>
> Otherwise it'll be ALLOC_WMARK_MIN or somesuch.
How we know that we are out of trouble? Just try another alloc and see? If
that is the case then we may be failing allocations after the memory
situation has cleared up.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-16 20:59 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-16 20:59 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 16 May 2007, Peter Zijlstra wrote:
> > I do not see any distinction between DMA and regular memory. If we need
> > DMA memory to complete the transaction then this wont work?
>
> If network relies on slabs that are cpuset constrained and the page
> allocator reserves do not match that, then yes, it goes bang.
So if I put a 32 bit network card in a 64 bit system -> bang?
> > Is there some indicator somewhere that indicates that we are in trouble? I
> > just see the ranks.
>
> Yes, and page->rank will only ever be 0 if the page was allocated with
> ALLOC_NO_WATERMARKS, and that only ever happens if we're in dire
> straights and entitled to it.
>
> Otherwise it'll be ALLOC_WMARK_MIN or somesuch.
How we know that we are out of trouble? Just try another alloc and see? If
that is the case then we may be failing allocations after the memory
situation has cleared up.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-16 20:59 ` Christoph Lameter
@ 2007-05-16 21:04 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-16 21:04 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 2007-05-16 at 13:59 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
>
> > > I do not see any distinction between DMA and regular memory. If we need
> > > DMA memory to complete the transaction then this wont work?
> >
> > If network relies on slabs that are cpuset constrained and the page
> > allocator reserves do not match that, then yes, it goes bang.
>
> So if I put a 32 bit network card in a 64 bit system -> bang?
I hope the network stack already uses the appropriate allocator flags.
If the slab was GFP_DMA that doesn't change, the ->reserve_slab will
still be GFP_DMA.
> > > Is there some indicator somewhere that indicates that we are in trouble? I
> > > just see the ranks.
> >
> > Yes, and page->rank will only ever be 0 if the page was allocated with
> > ALLOC_NO_WATERMARKS, and that only ever happens if we're in dire
> > straights and entitled to it.
> >
> > Otherwise it'll be ALLOC_WMARK_MIN or somesuch.
>
> How we know that we are out of trouble? Just try another alloc and see? If
> that is the case then we may be failing allocations after the memory
> situation has cleared up.
No, no, for each regular allocation we retry to populate ->cpu_slab with
a new slab. If that works we're out of the woods and the ->reserve_slab
is cleaned up.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-16 21:04 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-16 21:04 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 2007-05-16 at 13:59 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
>
> > > I do not see any distinction between DMA and regular memory. If we need
> > > DMA memory to complete the transaction then this wont work?
> >
> > If network relies on slabs that are cpuset constrained and the page
> > allocator reserves do not match that, then yes, it goes bang.
>
> So if I put a 32 bit network card in a 64 bit system -> bang?
I hope the network stack already uses the appropriate allocator flags.
If the slab was GFP_DMA that doesn't change, the ->reserve_slab will
still be GFP_DMA.
> > > Is there some indicator somewhere that indicates that we are in trouble? I
> > > just see the ranks.
> >
> > Yes, and page->rank will only ever be 0 if the page was allocated with
> > ALLOC_NO_WATERMARKS, and that only ever happens if we're in dire
> > straights and entitled to it.
> >
> > Otherwise it'll be ALLOC_WMARK_MIN or somesuch.
>
> How we know that we are out of trouble? Just try another alloc and see? If
> that is the case then we may be failing allocations after the memory
> situation has cleared up.
No, no, for each regular allocation we retry to populate ->cpu_slab with
a new slab. If that works we're out of the woods and the ->reserve_slab
is cleaned up.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-16 21:04 ` Peter Zijlstra
@ 2007-05-16 21:13 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-16 21:13 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 16 May 2007, Peter Zijlstra wrote:
> > How we know that we are out of trouble? Just try another alloc and see? If
> > that is the case then we may be failing allocations after the memory
> > situation has cleared up.
> No, no, for each regular allocation we retry to populate ->cpu_slab with
> a new slab. If that works we're out of the woods and the ->reserve_slab
> is cleaned up.
Hmmm.. so we could simplify the scheme by storing the last rank
somewheres.
If the alloc has less priority and we can extend the slab then
clear up the situation.
If we cannot extend the slab then the alloc must fail.
Could you put the rank into the page flags? On 64 bit at least there
should be enough space.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-16 21:13 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-16 21:13 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 16 May 2007, Peter Zijlstra wrote:
> > How we know that we are out of trouble? Just try another alloc and see? If
> > that is the case then we may be failing allocations after the memory
> > situation has cleared up.
> No, no, for each regular allocation we retry to populate ->cpu_slab with
> a new slab. If that works we're out of the woods and the ->reserve_slab
> is cleaned up.
Hmmm.. so we could simplify the scheme by storing the last rank
somewheres.
If the alloc has less priority and we can extend the slab then
clear up the situation.
If we cannot extend the slab then the alloc must fail.
Could you put the rank into the page flags? On 64 bit at least there
should be enough space.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-16 21:13 ` Christoph Lameter
@ 2007-05-16 21:20 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-16 21:20 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 2007-05-16 at 14:13 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
> > > How we know that we are out of trouble? Just try another alloc and see? If
> > > that is the case then we may be failing allocations after the memory
> > > situation has cleared up.
> > No, no, for each regular allocation we retry to populate ->cpu_slab with
> > a new slab. If that works we're out of the woods and the ->reserve_slab
> > is cleaned up.
>
> Hmmm.. so we could simplify the scheme by storing the last rank
> somewheres.
Not sure how that would help..
> If the alloc has less priority and we can extend the slab then
> clear up the situation.
>
> If we cannot extend the slab then the alloc must fail.
That is exactly what is done; and as mpm remarked the other day, its a
binary system; we don't need full gfp fairness just ALLOC_NO_WATERMARKS.
And that is already found in ->reserve_slab; if present the last
allocation needed it; if not the last allocation was good.
> Could you put the rank into the page flags? On 64 bit at least there
> should be enough space.
Current I stick the newly allocated page's rank in page->rank (yet
another overload of page->index). I've not yet seen the need to keep it
around longer.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-16 21:20 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-16 21:20 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 2007-05-16 at 14:13 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
> > > How we know that we are out of trouble? Just try another alloc and see? If
> > > that is the case then we may be failing allocations after the memory
> > > situation has cleared up.
> > No, no, for each regular allocation we retry to populate ->cpu_slab with
> > a new slab. If that works we're out of the woods and the ->reserve_slab
> > is cleaned up.
>
> Hmmm.. so we could simplify the scheme by storing the last rank
> somewheres.
Not sure how that would help..
> If the alloc has less priority and we can extend the slab then
> clear up the situation.
>
> If we cannot extend the slab then the alloc must fail.
That is exactly what is done; and as mpm remarked the other day, its a
binary system; we don't need full gfp fairness just ALLOC_NO_WATERMARKS.
And that is already found in ->reserve_slab; if present the last
allocation needed it; if not the last allocation was good.
> Could you put the rank into the page flags? On 64 bit at least there
> should be enough space.
Current I stick the newly allocated page's rank in page->rank (yet
another overload of page->index). I've not yet seen the need to keep it
around longer.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-16 21:20 ` Peter Zijlstra
@ 2007-05-16 21:42 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-16 21:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 16 May 2007, Peter Zijlstra wrote:
> > Hmmm.. so we could simplify the scheme by storing the last rank
> > somewheres.
>
> Not sure how that would help..
One does not have a way of determining the current processes
priority? Just need to do an alloc?
If we had the current processes "rank" then we could simply compare.
If rank is okay give them the object. If not try to extend slab. If that
succeeds clear the rank. If extending fails fail the alloc. There would be
no need for a reserve slab.
What worries me about this whole thing is
1. It is designed to fail an allocation rather than guarantee that all
succeed. Is it not possible to figure out which processes are not
essential and simply put them to sleep until the situation clear up?
2. It seems to be based on global ordering of allocations which is
not possible given large systems and the relativistic constraints
of physics. Ordering of events get more expensive the bigger the
system is.
How does this system work if you can just order events within
a processor? Or within a node? Within a zone?
3. I do not see how this integrates with other allocation constraints:
DMA constraints, cpuset constraints, memory node constraints,
GFP_THISNODE, MEMALLOC, GFP_HIGH.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-16 21:42 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-16 21:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 16 May 2007, Peter Zijlstra wrote:
> > Hmmm.. so we could simplify the scheme by storing the last rank
> > somewheres.
>
> Not sure how that would help..
One does not have a way of determining the current processes
priority? Just need to do an alloc?
If we had the current processes "rank" then we could simply compare.
If rank is okay give them the object. If not try to extend slab. If that
succeeds clear the rank. If extending fails fail the alloc. There would be
no need for a reserve slab.
What worries me about this whole thing is
1. It is designed to fail an allocation rather than guarantee that all
succeed. Is it not possible to figure out which processes are not
essential and simply put them to sleep until the situation clear up?
2. It seems to be based on global ordering of allocations which is
not possible given large systems and the relativistic constraints
of physics. Ordering of events get more expensive the bigger the
system is.
How does this system work if you can just order events within
a processor? Or within a node? Within a zone?
3. I do not see how this integrates with other allocation constraints:
DMA constraints, cpuset constraints, memory node constraints,
GFP_THISNODE, MEMALLOC, GFP_HIGH.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-14 13:19 ` Peter Zijlstra
@ 2007-05-17 3:02 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 3:02 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> In the interest of creating a reserve based allocator; we need to make the slab
> allocator (*sigh*, all three) fair with respect to GFP flags.
>
> That is, we need to protect memory from being used by easier gfp flags than it
> was allocated with. If our reserve is placed below GFP_ATOMIC, we do not want a
> GFP_KERNEL allocation to walk away with it - a scenario that is perfectly
> possible with the current allocators.
And the solution is to fail the allocation of the process which tries to
walk away with it. The failing allocation will lead to the killing of the
process right?
We already have an OOM killer which potentially kills random processes. We
hate it.
Could you please modify the patchset to *avoid* failure conditions. This
patchset here only manages failure conditions. The system should not get
into the failure conditions in the first place! For that purpose you may
want to put processes to sleep etc. But in order to do so you need to
figure out which processes you need to make progress.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 3:02 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 3:02 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> In the interest of creating a reserve based allocator; we need to make the slab
> allocator (*sigh*, all three) fair with respect to GFP flags.
>
> That is, we need to protect memory from being used by easier gfp flags than it
> was allocated with. If our reserve is placed below GFP_ATOMIC, we do not want a
> GFP_KERNEL allocation to walk away with it - a scenario that is perfectly
> possible with the current allocators.
And the solution is to fail the allocation of the process which tries to
walk away with it. The failing allocation will lead to the killing of the
process right?
We already have an OOM killer which potentially kills random processes. We
hate it.
Could you please modify the patchset to *avoid* failure conditions. This
patchset here only manages failure conditions. The system should not get
into the failure conditions in the first place! For that purpose you may
want to put processes to sleep etc. But in order to do so you need to
figure out which processes you need to make progress.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 3:02 ` Christoph Lameter
@ 2007-05-17 7:08 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-17 7:08 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Wed, 2007-05-16 at 20:02 -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> >
> > In the interest of creating a reserve based allocator; we need to make the slab
> > allocator (*sigh*, all three) fair with respect to GFP flags.
> >
> > That is, we need to protect memory from being used by easier gfp flags than it
> > was allocated with. If our reserve is placed below GFP_ATOMIC, we do not want a
> > GFP_KERNEL allocation to walk away with it - a scenario that is perfectly
> > possible with the current allocators.
>
> And the solution is to fail the allocation of the process which tries to
> walk away with it. The failing allocation will lead to the killing of the
> process right?
Not necessarily, we have this fault injection system that can fail
allocations; that doesn't bring the processes down, now does it?
> Could you please modify the patchset to *avoid* failure conditions. This
> patchset here only manages failure conditions. The system should not get
> into the failure conditions in the first place! For that purpose you may
> want to put processes to sleep etc. But in order to do so you need to
> figure out which processes you need to make progress.
Those that have __GFP_WAIT set will go to sleep - or do whatever
__GFP_WAIT allocations do best; the other allocations must handle
failure anyway. (even __GFP_WAIT allocations must handle failure for
that matter)
I'm really not seeing why you're making such a fuzz about it; normally
when you push the system this hard we're failing allocations left right
and center too. Its just that the block IO path has some mempools which
allow it to write out some (swap) pages and slowly get back to sanity.
This really is not much different; the system is in dire need for
memory; those allocations that cannot sleep will fail, simple.
All I'm wanting to do is limit the reserve to PF_MEMALLOC processes;
those that are in charge of cleaning memory; not every other random
process that just wants to do its thing - that doesn't seem like a weird
thing to do at all.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 7:08 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-17 7:08 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Wed, 2007-05-16 at 20:02 -0700, Christoph Lameter wrote:
> On Mon, 14 May 2007, Peter Zijlstra wrote:
>
> >
> > In the interest of creating a reserve based allocator; we need to make the slab
> > allocator (*sigh*, all three) fair with respect to GFP flags.
> >
> > That is, we need to protect memory from being used by easier gfp flags than it
> > was allocated with. If our reserve is placed below GFP_ATOMIC, we do not want a
> > GFP_KERNEL allocation to walk away with it - a scenario that is perfectly
> > possible with the current allocators.
>
> And the solution is to fail the allocation of the process which tries to
> walk away with it. The failing allocation will lead to the killing of the
> process right?
Not necessarily, we have this fault injection system that can fail
allocations; that doesn't bring the processes down, now does it?
> Could you please modify the patchset to *avoid* failure conditions. This
> patchset here only manages failure conditions. The system should not get
> into the failure conditions in the first place! For that purpose you may
> want to put processes to sleep etc. But in order to do so you need to
> figure out which processes you need to make progress.
Those that have __GFP_WAIT set will go to sleep - or do whatever
__GFP_WAIT allocations do best; the other allocations must handle
failure anyway. (even __GFP_WAIT allocations must handle failure for
that matter)
I'm really not seeing why you're making such a fuzz about it; normally
when you push the system this hard we're failing allocations left right
and center too. Its just that the block IO path has some mempools which
allow it to write out some (swap) pages and slowly get back to sanity.
This really is not much different; the system is in dire need for
memory; those allocations that cannot sleep will fail, simple.
All I'm wanting to do is limit the reserve to PF_MEMALLOC processes;
those that are in charge of cleaning memory; not every other random
process that just wants to do its thing - that doesn't seem like a weird
thing to do at all.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-16 21:42 ` Christoph Lameter
@ 2007-05-17 7:28 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-17 7:28 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 2007-05-16 at 14:42 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
>
> > > Hmmm.. so we could simplify the scheme by storing the last rank
> > > somewheres.
> >
> > Not sure how that would help..
>
> One does not have a way of determining the current processes
> priority? Just need to do an alloc?
We need that alloc anyway, to gauge the current memory pressure.
Sure you could perhaps not do that for allocations are are entitled to
the reserve if we still have on; but I'm not sure that is worth the
bother.
> If we had the current processes "rank" then we could simply compare.
> If rank is okay give them the object. If not try to extend slab. If that
> succeeds clear the rank. If extending fails fail the alloc. There would be
> no need for a reserve slab.
>
> What worries me about this whole thing is
>
>
> 1. It is designed to fail an allocation rather than guarantee that all
> succeed. Is it not possible to figure out which processes are not
> essential and simply put them to sleep until the situation clear up?
Well, that is currently not done either (in as far as that __GFP_WAIT
doesn't sleep indefinitely). When you run very low on memory, some
allocations just need to fail, there is nothing very magical about that,
the system seems to cope just fine. It happens today.
Disable the __GFP_NOWARN logic and create a swap storm, see what
happens.
> 2. It seems to be based on global ordering of allocations which is
> not possible given large systems and the relativistic constraints
> of physics. Ordering of events get more expensive the bigger the
> system is.
>
> How does this system work if you can just order events within
> a processor? Or within a node? Within a zone?
/me fails again..
Its about ensuring ALLOC_NO_WATERMARKS memory only reaches PF_MEMALLOC
processes, not joe random's pi calculator.
> 3. I do not see how this integrates with other allocation constraints:
> DMA constraints, cpuset constraints, memory node constraints,
> GFP_THISNODE, MEMALLOC, GFP_HIGH.
It works exactly as it used to; if you can currently get out of a swap
storm you still can.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 7:28 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-17 7:28 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Wed, 2007-05-16 at 14:42 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
>
> > > Hmmm.. so we could simplify the scheme by storing the last rank
> > > somewheres.
> >
> > Not sure how that would help..
>
> One does not have a way of determining the current processes
> priority? Just need to do an alloc?
We need that alloc anyway, to gauge the current memory pressure.
Sure you could perhaps not do that for allocations are are entitled to
the reserve if we still have on; but I'm not sure that is worth the
bother.
> If we had the current processes "rank" then we could simply compare.
> If rank is okay give them the object. If not try to extend slab. If that
> succeeds clear the rank. If extending fails fail the alloc. There would be
> no need for a reserve slab.
>
> What worries me about this whole thing is
>
>
> 1. It is designed to fail an allocation rather than guarantee that all
> succeed. Is it not possible to figure out which processes are not
> essential and simply put them to sleep until the situation clear up?
Well, that is currently not done either (in as far as that __GFP_WAIT
doesn't sleep indefinitely). When you run very low on memory, some
allocations just need to fail, there is nothing very magical about that,
the system seems to cope just fine. It happens today.
Disable the __GFP_NOWARN logic and create a swap storm, see what
happens.
> 2. It seems to be based on global ordering of allocations which is
> not possible given large systems and the relativistic constraints
> of physics. Ordering of events get more expensive the bigger the
> system is.
>
> How does this system work if you can just order events within
> a processor? Or within a node? Within a zone?
/me fails again..
Its about ensuring ALLOC_NO_WATERMARKS memory only reaches PF_MEMALLOC
processes, not joe random's pi calculator.
> 3. I do not see how this integrates with other allocation constraints:
> DMA constraints, cpuset constraints, memory node constraints,
> GFP_THISNODE, MEMALLOC, GFP_HIGH.
It works exactly as it used to; if you can currently get out of a swap
storm you still can.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 7:08 ` Peter Zijlstra
@ 2007-05-17 17:29 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 17:29 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Thu, 17 May 2007, Peter Zijlstra wrote:
> I'm really not seeing why you're making such a fuzz about it; normally
> when you push the system this hard we're failing allocations left right
> and center too. Its just that the block IO path has some mempools which
> allow it to write out some (swap) pages and slowly get back to sanity.
I am weirdly confused by these patches. Among other things you told me
that the performance does not matter since its never (or rarely) being
used (why do it then?). Then we do these strange swizzles with reserve
slabs that may contain an indeterminate amount of objects.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 17:29 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 17:29 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Thu, 17 May 2007, Peter Zijlstra wrote:
> I'm really not seeing why you're making such a fuzz about it; normally
> when you push the system this hard we're failing allocations left right
> and center too. Its just that the block IO path has some mempools which
> allow it to write out some (swap) pages and slowly get back to sanity.
I am weirdly confused by these patches. Among other things you told me
that the performance does not matter since its never (or rarely) being
used (why do it then?). Then we do these strange swizzles with reserve
slabs that may contain an indeterminate amount of objects.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 7:28 ` Peter Zijlstra
@ 2007-05-17 17:30 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 17:30 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 17 May 2007, Peter Zijlstra wrote:
> > 2. It seems to be based on global ordering of allocations which is
> > not possible given large systems and the relativistic constraints
> > of physics. Ordering of events get more expensive the bigger the
> > system is.
> >
> > How does this system work if you can just order events within
> > a processor? Or within a node? Within a zone?
>
> /me fails again..
>
> Its about ensuring ALLOC_NO_WATERMARKS memory only reaches PF_MEMALLOC
> processes, not joe random's pi calculator.
Watermarks are per zone?
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 17:30 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 17:30 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 17 May 2007, Peter Zijlstra wrote:
> > 2. It seems to be based on global ordering of allocations which is
> > not possible given large systems and the relativistic constraints
> > of physics. Ordering of events get more expensive the bigger the
> > system is.
> >
> > How does this system work if you can just order events within
> > a processor? Or within a node? Within a zone?
>
> /me fails again..
>
> Its about ensuring ALLOC_NO_WATERMARKS memory only reaches PF_MEMALLOC
> processes, not joe random's pi calculator.
Watermarks are per zone?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 17:29 ` Christoph Lameter
@ 2007-05-17 17:52 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-17 17:52 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Thu, 2007-05-17 at 10:29 -0700, Christoph Lameter wrote:
> On Thu, 17 May 2007, Peter Zijlstra wrote:
>
> > I'm really not seeing why you're making such a fuzz about it; normally
> > when you push the system this hard we're failing allocations left right
> > and center too. Its just that the block IO path has some mempools which
> > allow it to write out some (swap) pages and slowly get back to sanity.
>
> I am weirdly confused by these patches. Among other things you told me
> that the performance does not matter since its never (or rarely) being
> used (why do it then?).
When we are very low on memory and do access the reserves by means of
ALLOC_NO_WATERMARKS, we want to avoid processed that are not entitled to
use such memory from running away with the little we have.
That is the whole and only point; restrict memory allocated under
ALLOC_NO_WATERMARKS to those processes that are entitled to it.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 17:52 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-17 17:52 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Thu, 2007-05-17 at 10:29 -0700, Christoph Lameter wrote:
> On Thu, 17 May 2007, Peter Zijlstra wrote:
>
> > I'm really not seeing why you're making such a fuzz about it; normally
> > when you push the system this hard we're failing allocations left right
> > and center too. Its just that the block IO path has some mempools which
> > allow it to write out some (swap) pages and slowly get back to sanity.
>
> I am weirdly confused by these patches. Among other things you told me
> that the performance does not matter since its never (or rarely) being
> used (why do it then?).
When we are very low on memory and do access the reserves by means of
ALLOC_NO_WATERMARKS, we want to avoid processed that are not entitled to
use such memory from running away with the little we have.
That is the whole and only point; restrict memory allocated under
ALLOC_NO_WATERMARKS to those processes that are entitled to it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 17:29 ` Christoph Lameter
@ 2007-05-17 17:53 ` Matt Mackall
-1 siblings, 0 replies; 138+ messages in thread
From: Matt Mackall @ 2007-05-17 17:53 UTC (permalink / raw)
To: Christoph Lameter
Cc: Peter Zijlstra, linux-kernel, linux-mm, Thomas Graf,
David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, May 17, 2007 at 10:29:06AM -0700, Christoph Lameter wrote:
> On Thu, 17 May 2007, Peter Zijlstra wrote:
>
> > I'm really not seeing why you're making such a fuzz about it; normally
> > when you push the system this hard we're failing allocations left right
> > and center too. Its just that the block IO path has some mempools which
> > allow it to write out some (swap) pages and slowly get back to sanity.
>
> I am weirdly confused by these patches. Among other things you told me
> that the performance does not matter since its never (or rarely) being
> used (why do it then?).
Because it's a failsafe.
Simply stated, the problem is sometimes it's impossible to free memory
without allocating more memory. Thus we must keep enough protected
reserve that we can guarantee progress. This is what mempools are for
in the regular I/O stack. Unfortunately, mempools are a bad match for
network I/O.
It's absolutely correct that performance doesn't matter in the case
this patch is addressing. All that matters is digging ourselves out of
OOM. The box either survives the crisis or it doesn't.
It's also correct that we should hardly ever get into a situation
where we trigger this problem. But such cases are still fairly easy to
trigger in some workloads. Swap over network is an excellent example,
because we typically don't start swapping heavily until we're quite
low on freeable memory.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 17:53 ` Matt Mackall
0 siblings, 0 replies; 138+ messages in thread
From: Matt Mackall @ 2007-05-17 17:53 UTC (permalink / raw)
To: Christoph Lameter
Cc: Peter Zijlstra, linux-kernel, linux-mm, Thomas Graf,
David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, May 17, 2007 at 10:29:06AM -0700, Christoph Lameter wrote:
> On Thu, 17 May 2007, Peter Zijlstra wrote:
>
> > I'm really not seeing why you're making such a fuzz about it; normally
> > when you push the system this hard we're failing allocations left right
> > and center too. Its just that the block IO path has some mempools which
> > allow it to write out some (swap) pages and slowly get back to sanity.
>
> I am weirdly confused by these patches. Among other things you told me
> that the performance does not matter since its never (or rarely) being
> used (why do it then?).
Because it's a failsafe.
Simply stated, the problem is sometimes it's impossible to free memory
without allocating more memory. Thus we must keep enough protected
reserve that we can guarantee progress. This is what mempools are for
in the regular I/O stack. Unfortunately, mempools are a bad match for
network I/O.
It's absolutely correct that performance doesn't matter in the case
this patch is addressing. All that matters is digging ourselves out of
OOM. The box either survives the crisis or it doesn't.
It's also correct that we should hardly ever get into a situation
where we trigger this problem. But such cases are still fairly easy to
trigger in some workloads. Swap over network is an excellent example,
because we typically don't start swapping heavily until we're quite
low on freeable memory.
--
Mathematics is the supreme nostalgia of our time.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 17:30 ` Christoph Lameter
@ 2007-05-17 17:53 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-17 17:53 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 2007-05-17 at 10:30 -0700, Christoph Lameter wrote:
> On Thu, 17 May 2007, Peter Zijlstra wrote:
>
> > > 2. It seems to be based on global ordering of allocations which is
> > > not possible given large systems and the relativistic constraints
> > > of physics. Ordering of events get more expensive the bigger the
> > > system is.
> > >
> > > How does this system work if you can just order events within
> > > a processor? Or within a node? Within a zone?
> >
> > /me fails again..
> >
> > Its about ensuring ALLOC_NO_WATERMARKS memory only reaches PF_MEMALLOC
> > processes, not joe random's pi calculator.
>
> Watermarks are per zone?
Yes, but the page allocator might address multiple zones in order to
obtain a page.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 17:53 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-17 17:53 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 2007-05-17 at 10:30 -0700, Christoph Lameter wrote:
> On Thu, 17 May 2007, Peter Zijlstra wrote:
>
> > > 2. It seems to be based on global ordering of allocations which is
> > > not possible given large systems and the relativistic constraints
> > > of physics. Ordering of events get more expensive the bigger the
> > > system is.
> > >
> > > How does this system work if you can just order events within
> > > a processor? Or within a node? Within a zone?
> >
> > /me fails again..
> >
> > Its about ensuring ALLOC_NO_WATERMARKS memory only reaches PF_MEMALLOC
> > processes, not joe random's pi calculator.
>
> Watermarks are per zone?
Yes, but the page allocator might address multiple zones in order to
obtain a page.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 17:52 ` Peter Zijlstra
@ 2007-05-17 17:59 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 17:59 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Thu, 17 May 2007, Peter Zijlstra wrote:
> > I am weirdly confused by these patches. Among other things you told me
> > that the performance does not matter since its never (or rarely) being
> > used (why do it then?).
>
> When we are very low on memory and do access the reserves by means of
> ALLOC_NO_WATERMARKS, we want to avoid processed that are not entitled to
> use such memory from running away with the little we have.
For me low memory conditions are node or zone specific and may be
particular to certain allocation constraints. For some reason you have
this simplified global picture in mind.
The other statement is weird. It is bad to fail allocation attempts, they
may lead to a process being terminated. Memory should be reclaimed
earlier to avoid these situations.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 17:59 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 17:59 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, Thomas Graf, David Miller, Andrew Morton,
Daniel Phillips, Pekka Enberg, Matt Mackall
On Thu, 17 May 2007, Peter Zijlstra wrote:
> > I am weirdly confused by these patches. Among other things you told me
> > that the performance does not matter since its never (or rarely) being
> > used (why do it then?).
>
> When we are very low on memory and do access the reserves by means of
> ALLOC_NO_WATERMARKS, we want to avoid processed that are not entitled to
> use such memory from running away with the little we have.
For me low memory conditions are node or zone specific and may be
particular to certain allocation constraints. For some reason you have
this simplified global picture in mind.
The other statement is weird. It is bad to fail allocation attempts, they
may lead to a process being terminated. Memory should be reclaimed
earlier to avoid these situations.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 17:53 ` Peter Zijlstra
@ 2007-05-17 18:01 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 18:01 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 17 May 2007, Peter Zijlstra wrote:
> > > Its about ensuring ALLOC_NO_WATERMARKS memory only reaches PF_MEMALLOC
> > > processes, not joe random's pi calculator.
> >
> > Watermarks are per zone?
>
> Yes, but the page allocator might address multiple zones in order to
> obtain a page.
And then again it may not because the allocation is contrained to a
particular node,a NORMAL zone or a DMA zone. One zone way be below the
watermark and another may not. Different allocations may be allowed to
tap into various zones for various reasons.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 18:01 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 18:01 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 17 May 2007, Peter Zijlstra wrote:
> > > Its about ensuring ALLOC_NO_WATERMARKS memory only reaches PF_MEMALLOC
> > > processes, not joe random's pi calculator.
> >
> > Watermarks are per zone?
>
> Yes, but the page allocator might address multiple zones in order to
> obtain a page.
And then again it may not because the allocation is contrained to a
particular node,a NORMAL zone or a DMA zone. One zone way be below the
watermark and another may not. Different allocations may be allowed to
tap into various zones for various reasons.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 17:53 ` Matt Mackall
@ 2007-05-17 18:02 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 18:02 UTC (permalink / raw)
To: Matt Mackall
Cc: Peter Zijlstra, linux-kernel, linux-mm, Thomas Graf,
David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 17 May 2007, Matt Mackall wrote:
> Simply stated, the problem is sometimes it's impossible to free memory
> without allocating more memory. Thus we must keep enough protected
> reserve that we can guarantee progress. This is what mempools are for
> in the regular I/O stack. Unfortunately, mempools are a bad match for
> network I/O.
>
> It's absolutely correct that performance doesn't matter in the case
> this patch is addressing. All that matters is digging ourselves out of
> OOM. The box either survives the crisis or it doesn't.
Well we fail allocations in order to do so and these allocations may be
even nonatomic allocs. Pretty dangerous approach.
> It's also correct that we should hardly ever get into a situation
> where we trigger this problem. But such cases are still fairly easy to
> trigger in some workloads. Swap over network is an excellent example,
> because we typically don't start swapping heavily until we're quite
> low on freeable memory.
Is it not possible to avoid failing allocs? Instead put processes to
sleep? Run synchrononous reclaim?
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 18:02 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 18:02 UTC (permalink / raw)
To: Matt Mackall
Cc: Peter Zijlstra, linux-kernel, linux-mm, Thomas Graf,
David Miller, Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 17 May 2007, Matt Mackall wrote:
> Simply stated, the problem is sometimes it's impossible to free memory
> without allocating more memory. Thus we must keep enough protected
> reserve that we can guarantee progress. This is what mempools are for
> in the regular I/O stack. Unfortunately, mempools are a bad match for
> network I/O.
>
> It's absolutely correct that performance doesn't matter in the case
> this patch is addressing. All that matters is digging ourselves out of
> OOM. The box either survives the crisis or it doesn't.
Well we fail allocations in order to do so and these allocations may be
even nonatomic allocs. Pretty dangerous approach.
> It's also correct that we should hardly ever get into a situation
> where we trigger this problem. But such cases are still fairly easy to
> trigger in some workloads. Swap over network is an excellent example,
> because we typically don't start swapping heavily until we're quite
> low on freeable memory.
Is it not possible to avoid failing allocs? Instead put processes to
sleep? Run synchrononous reclaim?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 18:02 ` Christoph Lameter
@ 2007-05-17 19:18 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-17 19:18 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 2007-05-17 at 11:02 -0700, Christoph Lameter wrote:
> On Thu, 17 May 2007, Matt Mackall wrote:
>
> > Simply stated, the problem is sometimes it's impossible to free memory
> > without allocating more memory. Thus we must keep enough protected
> > reserve that we can guarantee progress. This is what mempools are for
> > in the regular I/O stack. Unfortunately, mempools are a bad match for
> > network I/O.
> >
> > It's absolutely correct that performance doesn't matter in the case
> > this patch is addressing. All that matters is digging ourselves out of
> > OOM. The box either survives the crisis or it doesn't.
>
> Well we fail allocations in order to do so and these allocations may be
> even nonatomic allocs. Pretty dangerous approach.
These allocations didn't have right to the memory they would otherwise
get. Also they will end up in the page allocator just like they normally
would. So from that point, its no different than what happens now; only
they will not eat away the very last bit of memory that could be used to
avoid deadlocking.
> > It's also correct that we should hardly ever get into a situation
> > where we trigger this problem. But such cases are still fairly easy to
> > trigger in some workloads. Swap over network is an excellent example,
> > because we typically don't start swapping heavily until we're quite
> > low on freeable memory.
>
> Is it not possible to avoid failing allocs? Instead put processes to
> sleep? Run synchrononous reclaim?
That would radically change the way we do reclaim and would be much
harder to get right. Such things could be done independant from this.
The proposed patch doesn't change how the kernel functions at this
point; it just enforces an existing rule better.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 19:18 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-17 19:18 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 2007-05-17 at 11:02 -0700, Christoph Lameter wrote:
> On Thu, 17 May 2007, Matt Mackall wrote:
>
> > Simply stated, the problem is sometimes it's impossible to free memory
> > without allocating more memory. Thus we must keep enough protected
> > reserve that we can guarantee progress. This is what mempools are for
> > in the regular I/O stack. Unfortunately, mempools are a bad match for
> > network I/O.
> >
> > It's absolutely correct that performance doesn't matter in the case
> > this patch is addressing. All that matters is digging ourselves out of
> > OOM. The box either survives the crisis or it doesn't.
>
> Well we fail allocations in order to do so and these allocations may be
> even nonatomic allocs. Pretty dangerous approach.
These allocations didn't have right to the memory they would otherwise
get. Also they will end up in the page allocator just like they normally
would. So from that point, its no different than what happens now; only
they will not eat away the very last bit of memory that could be used to
avoid deadlocking.
> > It's also correct that we should hardly ever get into a situation
> > where we trigger this problem. But such cases are still fairly easy to
> > trigger in some workloads. Swap over network is an excellent example,
> > because we typically don't start swapping heavily until we're quite
> > low on freeable memory.
>
> Is it not possible to avoid failing allocs? Instead put processes to
> sleep? Run synchrononous reclaim?
That would radically change the way we do reclaim and would be much
harder to get right. Such things could be done independant from this.
The proposed patch doesn't change how the kernel functions at this
point; it just enforces an existing rule better.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 19:18 ` Peter Zijlstra
@ 2007-05-17 19:24 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 19:24 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 17 May 2007, Peter Zijlstra wrote:
> The proposed patch doesn't change how the kernel functions at this
> point; it just enforces an existing rule better.
Well I'd say it controls the allocation failures. And that only works if
one can consider the system having a single zone.
Lets say the system has two cpusets A and B. A allocs from node 1 and B
allocs from node 2. Two processes one in A and one in B run on the same
processor.
Node 1 gets very low in memory so your patch kicks in and sets up the
global memory emergency situation with the reserve slab.
Now the process in B will either fail although it has plenty of memory on
node 2.
Or it may just clear the emergency slab and then the next critical alloc
of the process in A that is low on memory will fail.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 19:24 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 19:24 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 17 May 2007, Peter Zijlstra wrote:
> The proposed patch doesn't change how the kernel functions at this
> point; it just enforces an existing rule better.
Well I'd say it controls the allocation failures. And that only works if
one can consider the system having a single zone.
Lets say the system has two cpusets A and B. A allocs from node 1 and B
allocs from node 2. Two processes one in A and one in B run on the same
processor.
Node 1 gets very low in memory so your patch kicks in and sets up the
global memory emergency situation with the reserve slab.
Now the process in B will either fail although it has plenty of memory on
node 2.
Or it may just clear the emergency slab and then the next critical alloc
of the process in A that is low on memory will fail.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 19:24 ` Christoph Lameter
@ 2007-05-17 21:26 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-17 21:26 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 2007-05-17 at 12:24 -0700, Christoph Lameter wrote:
> On Thu, 17 May 2007, Peter Zijlstra wrote:
>
> > The proposed patch doesn't change how the kernel functions at this
> > point; it just enforces an existing rule better.
>
> Well I'd say it controls the allocation failures. And that only works if
> one can consider the system having a single zone.
>
> Lets say the system has two cpusets A and B. A allocs from node 1 and B
> allocs from node 2. Two processes one in A and one in B run on the same
> processor.
>
> Node 1 gets very low in memory so your patch kicks in and sets up the
> global memory emergency situation with the reserve slab.
>
> Now the process in B will either fail although it has plenty of memory on
> node 2.
>
> Or it may just clear the emergency slab and then the next critical alloc
> of the process in A that is low on memory will fail.
The way I read the cpuset page allocator, it will only respect the
cpuset if there is memory aplenty. Otherwise it will grab whatever. So
still, it will only ever use ALLOC_NO_WATERMARKS if the whole system is
in distress.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 21:26 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-17 21:26 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 2007-05-17 at 12:24 -0700, Christoph Lameter wrote:
> On Thu, 17 May 2007, Peter Zijlstra wrote:
>
> > The proposed patch doesn't change how the kernel functions at this
> > point; it just enforces an existing rule better.
>
> Well I'd say it controls the allocation failures. And that only works if
> one can consider the system having a single zone.
>
> Lets say the system has two cpusets A and B. A allocs from node 1 and B
> allocs from node 2. Two processes one in A and one in B run on the same
> processor.
>
> Node 1 gets very low in memory so your patch kicks in and sets up the
> global memory emergency situation with the reserve slab.
>
> Now the process in B will either fail although it has plenty of memory on
> node 2.
>
> Or it may just clear the emergency slab and then the next critical alloc
> of the process in A that is low on memory will fail.
The way I read the cpuset page allocator, it will only respect the
cpuset if there is memory aplenty. Otherwise it will grab whatever. So
still, it will only ever use ALLOC_NO_WATERMARKS if the whole system is
in distress.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 21:26 ` Peter Zijlstra
@ 2007-05-17 21:44 ` Paul Jackson
-1 siblings, 0 replies; 138+ messages in thread
From: Paul Jackson @ 2007-05-17 21:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: clameter, mpm, linux-kernel, linux-mm, tgraf, davem, akpm,
phillips, penberg
> The way I read the cpuset page allocator, it will only respect the
> cpuset if there is memory aplenty. Otherwise it will grab whatever. So
> still, it will only ever use ALLOC_NO_WATERMARKS if the whole system is
> in distress.
Wrong. Well, only a little right.
For allocations that can't fail (the kernel could die if it failed)
then yes, the kernel will eventually take any damn page it can find,
regardless of cpusets.
Allocations for user space are hardwall enforced to be in the current
tasks cpuset.
Allocations off interrupts ignore the current tasks cpuset (such allocations
don't have a valid current contect.)
Allocations for most kernel space allocations will try to fit in the
current tasks cpuset, but may come from the possibly larger context of
the closest ancestor cpuset that is marked memory_exclusive.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 21:44 ` Paul Jackson
0 siblings, 0 replies; 138+ messages in thread
From: Paul Jackson @ 2007-05-17 21:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: clameter, mpm, linux-kernel, linux-mm, tgraf, davem, akpm,
phillips, penberg
> The way I read the cpuset page allocator, it will only respect the
> cpuset if there is memory aplenty. Otherwise it will grab whatever. So
> still, it will only ever use ALLOC_NO_WATERMARKS if the whole system is
> in distress.
Wrong. Well, only a little right.
For allocations that can't fail (the kernel could die if it failed)
then yes, the kernel will eventually take any damn page it can find,
regardless of cpusets.
Allocations for user space are hardwall enforced to be in the current
tasks cpuset.
Allocations off interrupts ignore the current tasks cpuset (such allocations
don't have a valid current contect.)
Allocations for most kernel space allocations will try to fit in the
current tasks cpuset, but may come from the possibly larger context of
the closest ancestor cpuset that is marked memory_exclusive.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 21:26 ` Peter Zijlstra
@ 2007-05-17 22:27 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 22:27 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 17 May 2007, Peter Zijlstra wrote:
> The way I read the cpuset page allocator, it will only respect the
> cpuset if there is memory aplenty. Otherwise it will grab whatever. So
> still, it will only ever use ALLOC_NO_WATERMARKS if the whole system is
> in distress.
Sorry no. The purpose of the cpuset is to limit memory for an application.
If the boundaries would be fluid then we would not need cpusets.
But the same principles also apply for allocations to different zones in a
SMP system. There are 4 zones DMA DMA32 NORMAL and HIGHMEM and we have
general slabs for DMA and NORMAL. A slab that uses zone NORMAL falls back
to DMA32 and DMA depending on the watermarks of the 3 zones. So a
ZONE_NORMAL slab can exhaust memory available for ZONE_DMA.
Again the question is the watermarks of which zone? In case of the
ZONE_NORMAL allocation you have 3 to pick from. Its the last one? Then its
the same as ZONE_DMA, and you got a collision with the corresponding
DMA slab. Depending the system deciding on a zone where we allocate the
page from you may get a different watermark situation.
On x86_64 systems you have the additional complication that there are
even multiple DMA32 or NORMAL zones per node. Some will have DMA32 and
NORMAL, others DMA32 alone or NORMAL alone. Which watermarks are we
talking about?
The use of ALLOC_NO_WATERMARKS depends on the contraints of the allocation
in all cases. You can only compare the stresslevel (rank?) of allocations
that have the same allocation constraints. The allocation constraints are
a result of gfp flags, cpuset configuration and memory policies in effect.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-17 22:27 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-17 22:27 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 17 May 2007, Peter Zijlstra wrote:
> The way I read the cpuset page allocator, it will only respect the
> cpuset if there is memory aplenty. Otherwise it will grab whatever. So
> still, it will only ever use ALLOC_NO_WATERMARKS if the whole system is
> in distress.
Sorry no. The purpose of the cpuset is to limit memory for an application.
If the boundaries would be fluid then we would not need cpusets.
But the same principles also apply for allocations to different zones in a
SMP system. There are 4 zones DMA DMA32 NORMAL and HIGHMEM and we have
general slabs for DMA and NORMAL. A slab that uses zone NORMAL falls back
to DMA32 and DMA depending on the watermarks of the 3 zones. So a
ZONE_NORMAL slab can exhaust memory available for ZONE_DMA.
Again the question is the watermarks of which zone? In case of the
ZONE_NORMAL allocation you have 3 to pick from. Its the last one? Then its
the same as ZONE_DMA, and you got a collision with the corresponding
DMA slab. Depending the system deciding on a zone where we allocate the
page from you may get a different watermark situation.
On x86_64 systems you have the additional complication that there are
even multiple DMA32 or NORMAL zones per node. Some will have DMA32 and
NORMAL, others DMA32 alone or NORMAL alone. Which watermarks are we
talking about?
The use of ALLOC_NO_WATERMARKS depends on the contraints of the allocation
in all cases. You can only compare the stresslevel (rank?) of allocations
that have the same allocation constraints. The allocation constraints are
a result of gfp flags, cpuset configuration and memory policies in effect.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-17 22:27 ` Christoph Lameter
@ 2007-05-18 9:54 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-18 9:54 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 2007-05-17 at 15:27 -0700, Christoph Lameter wrote:
> On Thu, 17 May 2007, Peter Zijlstra wrote:
>
> > The way I read the cpuset page allocator, it will only respect the
> > cpuset if there is memory aplenty. Otherwise it will grab whatever. So
> > still, it will only ever use ALLOC_NO_WATERMARKS if the whole system is
> > in distress.
>
> Sorry no. The purpose of the cpuset is to limit memory for an application.
> If the boundaries would be fluid then we would not need cpusets.
Right, I see that I missed an ALLOC_CPUSET yesterday; but like Paul
said, cpusets are ignored when in dire straights for an kernel alloc.
Just not enough to make inter-cpuset interaction on slabs go away wrt
ALLOC_NO_WATERMARK :-/
> But the same principles also apply for allocations to different zones in a
> SMP system. There are 4 zones DMA DMA32 NORMAL and HIGHMEM and we have
> general slabs for DMA and NORMAL. A slab that uses zone NORMAL falls back
> to DMA32 and DMA depending on the watermarks of the 3 zones. So a
> ZONE_NORMAL slab can exhaust memory available for ZONE_DMA.
>
> Again the question is the watermarks of which zone? In case of the
> ZONE_NORMAL allocation you have 3 to pick from. Its the last one? Then its
> the same as ZONE_DMA, and you got a collision with the corresponding
> DMA slab. Depending the system deciding on a zone where we allocate the
> page from you may get a different watermark situation.
Isn't the zone mask the same for all allocations from a specific slab?
If so, then the slab wide ->reserve_slab will still dtrt (barring
cpusets).
> On x86_64 systems you have the additional complication that there are
> even multiple DMA32 or NORMAL zones per node. Some will have DMA32 and
> NORMAL, others DMA32 alone or NORMAL alone. Which watermarks are we
> talking about?
Watermarks like used by the page allocator given the slabs zone mask.
The page allocator will only fall back to ALLOC_NO_WATERMARKS when all
target zones are exhausted.
> The use of ALLOC_NO_WATERMARKS depends on the contraints of the allocation
> in all cases. You can only compare the stresslevel (rank?) of allocations
> that have the same allocation constraints. The allocation constraints are
> a result of gfp flags,
The gfp zone mask is constant per slab, no? It has to, because the zone
mask is only used when the slab is extended, other allocations live off
whatever was there before them.
> cpuset configuration and memory policies in effect.
Yes, I see now that these might become an issue, I will have to think on
this.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-18 9:54 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-18 9:54 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Thu, 2007-05-17 at 15:27 -0700, Christoph Lameter wrote:
> On Thu, 17 May 2007, Peter Zijlstra wrote:
>
> > The way I read the cpuset page allocator, it will only respect the
> > cpuset if there is memory aplenty. Otherwise it will grab whatever. So
> > still, it will only ever use ALLOC_NO_WATERMARKS if the whole system is
> > in distress.
>
> Sorry no. The purpose of the cpuset is to limit memory for an application.
> If the boundaries would be fluid then we would not need cpusets.
Right, I see that I missed an ALLOC_CPUSET yesterday; but like Paul
said, cpusets are ignored when in dire straights for an kernel alloc.
Just not enough to make inter-cpuset interaction on slabs go away wrt
ALLOC_NO_WATERMARK :-/
> But the same principles also apply for allocations to different zones in a
> SMP system. There are 4 zones DMA DMA32 NORMAL and HIGHMEM and we have
> general slabs for DMA and NORMAL. A slab that uses zone NORMAL falls back
> to DMA32 and DMA depending on the watermarks of the 3 zones. So a
> ZONE_NORMAL slab can exhaust memory available for ZONE_DMA.
>
> Again the question is the watermarks of which zone? In case of the
> ZONE_NORMAL allocation you have 3 to pick from. Its the last one? Then its
> the same as ZONE_DMA, and you got a collision with the corresponding
> DMA slab. Depending the system deciding on a zone where we allocate the
> page from you may get a different watermark situation.
Isn't the zone mask the same for all allocations from a specific slab?
If so, then the slab wide ->reserve_slab will still dtrt (barring
cpusets).
> On x86_64 systems you have the additional complication that there are
> even multiple DMA32 or NORMAL zones per node. Some will have DMA32 and
> NORMAL, others DMA32 alone or NORMAL alone. Which watermarks are we
> talking about?
Watermarks like used by the page allocator given the slabs zone mask.
The page allocator will only fall back to ALLOC_NO_WATERMARKS when all
target zones are exhausted.
> The use of ALLOC_NO_WATERMARKS depends on the contraints of the allocation
> in all cases. You can only compare the stresslevel (rank?) of allocations
> that have the same allocation constraints. The allocation constraints are
> a result of gfp flags,
The gfp zone mask is constant per slab, no? It has to, because the zone
mask is only used when the slab is extended, other allocations live off
whatever was there before them.
> cpuset configuration and memory policies in effect.
Yes, I see now that these might become an issue, I will have to think on
this.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-18 9:54 ` Peter Zijlstra
@ 2007-05-18 17:11 ` Paul Jackson
-1 siblings, 0 replies; 138+ messages in thread
From: Paul Jackson @ 2007-05-18 17:11 UTC (permalink / raw)
To: Peter Zijlstra
Cc: clameter, mpm, linux-kernel, linux-mm, tgraf, davem, akpm,
phillips, penberg
Peter wrote:
> cpusets are ignored when in dire straights for an kernel alloc.
No - most kernel allocations never ignore cpusets.
The ones marked NOFAIL or ATOMIC can ignore cpusets in dire straights
and the ones off interrupts lack an applicable cpuset context.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-18 17:11 ` Paul Jackson
0 siblings, 0 replies; 138+ messages in thread
From: Paul Jackson @ 2007-05-18 17:11 UTC (permalink / raw)
To: Peter Zijlstra
Cc: clameter, mpm, linux-kernel, linux-mm, tgraf, davem, akpm,
phillips, penberg
Peter wrote:
> cpusets are ignored when in dire straights for an kernel alloc.
No - most kernel allocations never ignore cpusets.
The ones marked NOFAIL or ATOMIC can ignore cpusets in dire straights
and the ones off interrupts lack an applicable cpuset context.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-18 9:54 ` Peter Zijlstra
@ 2007-05-18 17:11 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-18 17:11 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Fri, 18 May 2007, Peter Zijlstra wrote:
> On Thu, 2007-05-17 at 15:27 -0700, Christoph Lameter wrote:
> Isn't the zone mask the same for all allocations from a specific slab?
> If so, then the slab wide ->reserve_slab will still dtrt (barring
> cpusets).
All allocations from a single slab have the same set of allowed types of
zones. I.e. a DMA slab can access only ZONE_DMA a regular slab
ZONE_NORMAL, ZONE_DMA32 and ZONE_DMA.
> > On x86_64 systems you have the additional complication that there are
> > even multiple DMA32 or NORMAL zones per node. Some will have DMA32 and
> > NORMAL, others DMA32 alone or NORMAL alone. Which watermarks are we
> > talking about?
>
> Watermarks like used by the page allocator given the slabs zone mask.
> The page allocator will only fall back to ALLOC_NO_WATERMARKS when all
> target zones are exhausted.
That works if zones do not vary between slab requests. So on SMP (without
extra gfp flags) we may be fine. But see other concerns below.
> > The use of ALLOC_NO_WATERMARKS depends on the contraints of the allocation
> > in all cases. You can only compare the stresslevel (rank?) of allocations
> > that have the same allocation constraints. The allocation constraints are
> > a result of gfp flags,
>
> The gfp zone mask is constant per slab, no? It has to, because the zone
> mask is only used when the slab is extended, other allocations live off
> whatever was there before them.
The gfp zone mask is used to select the zones in a SMP config. But not in
a NUMA configuration there the zones can come from multiple nodes.
Ok in an SMP configuration the zones are determined by the allocation
flags. But then there are also the gfp flags that influence reclaim
behavior. These also have an influence on the memory pressure.
These are
__GFP_IO
__GFP_FS
__GFP_NOMEMMALLOC
__GFP_NOFAIL
__GFP_NORETRY
__GFP_REPEAT
An allocation that can call into a filesystem or do I/O will have much
less memory pressure to contend with. Are the ranks for an allocation
with __GFP_IO|__GFP_FS really comparable with an allocation that does not
have these set?
> > cpuset configuration and memory policies in effect.
>
> Yes, I see now that these might become an issue, I will have to think on
> this.
Note that we have not yet investigated what weird effect memory policy
constraints can have on this. There are issues with memory policies only
applying to certain zones.....
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-18 17:11 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-18 17:11 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg
On Fri, 18 May 2007, Peter Zijlstra wrote:
> On Thu, 2007-05-17 at 15:27 -0700, Christoph Lameter wrote:
> Isn't the zone mask the same for all allocations from a specific slab?
> If so, then the slab wide ->reserve_slab will still dtrt (barring
> cpusets).
All allocations from a single slab have the same set of allowed types of
zones. I.e. a DMA slab can access only ZONE_DMA a regular slab
ZONE_NORMAL, ZONE_DMA32 and ZONE_DMA.
> > On x86_64 systems you have the additional complication that there are
> > even multiple DMA32 or NORMAL zones per node. Some will have DMA32 and
> > NORMAL, others DMA32 alone or NORMAL alone. Which watermarks are we
> > talking about?
>
> Watermarks like used by the page allocator given the slabs zone mask.
> The page allocator will only fall back to ALLOC_NO_WATERMARKS when all
> target zones are exhausted.
That works if zones do not vary between slab requests. So on SMP (without
extra gfp flags) we may be fine. But see other concerns below.
> > The use of ALLOC_NO_WATERMARKS depends on the contraints of the allocation
> > in all cases. You can only compare the stresslevel (rank?) of allocations
> > that have the same allocation constraints. The allocation constraints are
> > a result of gfp flags,
>
> The gfp zone mask is constant per slab, no? It has to, because the zone
> mask is only used when the slab is extended, other allocations live off
> whatever was there before them.
The gfp zone mask is used to select the zones in a SMP config. But not in
a NUMA configuration there the zones can come from multiple nodes.
Ok in an SMP configuration the zones are determined by the allocation
flags. But then there are also the gfp flags that influence reclaim
behavior. These also have an influence on the memory pressure.
These are
__GFP_IO
__GFP_FS
__GFP_NOMEMMALLOC
__GFP_NOFAIL
__GFP_NORETRY
__GFP_REPEAT
An allocation that can call into a filesystem or do I/O will have much
less memory pressure to contend with. Are the ranks for an allocation
with __GFP_IO|__GFP_FS really comparable with an allocation that does not
have these set?
> > cpuset configuration and memory policies in effect.
>
> Yes, I see now that these might become an issue, I will have to think on
> this.
Note that we have not yet investigated what weird effect memory policy
constraints can have on this. There are issues with memory policies only
applying to certain zones.....
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-18 17:11 ` Christoph Lameter
@ 2007-05-20 8:39 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-20 8:39 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson
Ok, full reset.
I care about kernel allocations only. In particular about those that
have PF_MEMALLOC semantics.
The thing I need is that any memory allocated below
ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
is only ever used by processes that have ALLOC_NO_WATERMARKS rights;
for the duration of the distress.
What this patch does:
- change the page allocator to try ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
if ALLOC_NO_WATERMARKS, before the actual ALLOC_NO_WATERMARKS alloc
- set page->reserve nonzero for each page allocated with
ALLOC_NO_WATERMARKS; which by the previous point implies that all
available zones are below ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
- when a page->reserve slab is allocated store it in s->reserve_slab
and do not update the ->cpu_slab[] (this forces subsequent allocs to
retry the allocation).
All ALLOC_NO_WATERMARKS enabled slab allocations are served from
->reserve_slab, up until the point where a !page->reserve slab alloc
succeeds, at which point the ->reserve_slab is pushed into the partial
lists and ->reserve_slab set to NULL.
Since only the allocation of a new slab uses the gfp zone flags, and
other allocations placement hints they have to be uniform over all slab
allocs for a given kmem_cache. Thus the s->reserve_slab/page->reserve
status is kmem_cache wide.
Any holes left?
---
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h
+++ linux-2.6-git/mm/internal.h
@@ -12,6 +12,7 @@
#define __MM_INTERNAL_H
#include <linux/mm.h>
+#include <linux/hardirq.h>
static inline void set_page_count(struct page *page, int v)
{
@@ -37,4 +38,50 @@ static inline void __put_page(struct pag
extern void fastcall __init __free_pages_bootmem(struct page *page,
unsigned int order);
+#define ALLOC_HARDER 0x01 /* try to alloc harder */
+#define ALLOC_HIGH 0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN 0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW 0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH 0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS 0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int inline gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+ struct task_struct *p = current;
+ int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+ const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+ /*
+ * The caller may dip into page reserves a bit more if the caller
+ * cannot run direct reclaim, or if the caller has realtime scheduling
+ * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
+ * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+ */
+ if (gfp_mask & __GFP_HIGH)
+ alloc_flags |= ALLOC_HIGH;
+
+ if (!wait) {
+ alloc_flags |= ALLOC_HARDER;
+ /*
+ * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+ * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+ */
+ alloc_flags &= ~ALLOC_CPUSET;
+ } else if (unlikely(rt_task(p)) && !in_interrupt())
+ alloc_flags |= ALLOC_HARDER;
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+ if (!in_interrupt() &&
+ ((p->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ }
+
+ return alloc_flags;
+}
+
#endif
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c
+++ linux-2.6-git/mm/page_alloc.c
@@ -1175,14 +1175,6 @@ failed:
return NULL;
}
-#define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */
-#define ALLOC_HARDER 0x10 /* try to alloc harder */
-#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
-
#ifdef CONFIG_FAIL_PAGE_ALLOC
static struct fail_page_alloc_attr {
@@ -1494,6 +1486,7 @@ zonelist_scan:
page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
if (page)
+ page->reserve = (alloc_flags & ALLOC_NO_WATERMARKS);
break;
this_zone_full:
if (NUMA_BUILD)
@@ -1619,48 +1612,36 @@ restart:
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
* to how we want to proceed.
- *
- * The caller may dip into page reserves a bit more if the caller
- * cannot run direct reclaim, or if the caller has realtime scheduling
- * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
- * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
*/
- alloc_flags = ALLOC_WMARK_MIN;
- if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
- alloc_flags |= ALLOC_HARDER;
- if (gfp_mask & __GFP_HIGH)
- alloc_flags |= ALLOC_HIGH;
- if (wait)
- alloc_flags |= ALLOC_CPUSET;
+ alloc_flags = gfp_to_alloc_flags(gfp_mask);
- /*
- * Go through the zonelist again. Let __GFP_HIGH and allocations
- * coming from realtime tasks go deeper into reserves.
- *
- * This is the last chance, in general, before the goto nopage.
- * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
- * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
- */
- page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+ /* This is the last chance, in general, before the goto nopage. */
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ alloc_flags & ~ALLOC_NO_WATERMARKS);
if (page)
goto got_pg;
/* This allocation should allow future memory freeing. */
-
rebalance:
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+ if (alloc_flags & ALLOC_NO_WATERMARKS) {
nofail_alloc:
- /* go through the zonelist yet again, ignoring mins */
- page = get_page_from_freelist(gfp_mask, order,
+ /*
+ * Before going bare metal, try to get a page above the
+ * critical threshold - ignoring CPU sets.
+ */
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER);
+ if (page)
+ goto got_pg;
+
+ /* go through the zonelist yet again, ignoring mins */
+ page = get_page_from_freelist(gfp_mask, order,
zonelist, ALLOC_NO_WATERMARKS);
- if (page)
- goto got_pg;
- if (gfp_mask & __GFP_NOFAIL) {
- congestion_wait(WRITE, HZ/50);
- goto nofail_alloc;
- }
+ if (page)
+ goto got_pg;
+ if (wait && (gfp_mask & __GFP_NOFAIL)) {
+ congestion_wait(WRITE, HZ/50);
+ goto nofail_alloc;
}
goto nopage;
}
@@ -1669,6 +1650,10 @@ nofail_alloc:
if (!wait)
goto nopage;
+ /* Avoid recursion of direct reclaim */
+ if (p->flags & PF_MEMALLOC)
+ goto nopage;
+
cond_resched();
/* We now go into synchronous reclaim */
Index: linux-2.6-git/include/linux/mm_types.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_types.h
+++ linux-2.6-git/include/linux/mm_types.h
@@ -60,6 +60,7 @@ struct page {
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* SLUB: freelist req. slab lock */
+ int reserve; /* page_alloc: page is a reserve page */
};
struct list_head lru; /* Pageout list, eg. active_list
* protected by zone->lru_lock !
Index: linux-2.6-git/include/linux/slub_def.h
===================================================================
--- linux-2.6-git.orig/include/linux/slub_def.h
+++ linux-2.6-git/include/linux/slub_def.h
@@ -46,6 +46,8 @@ struct kmem_cache {
struct list_head list; /* List of slab caches */
struct kobject kobj; /* For sysfs */
+ struct page *reserve_slab;
+
#ifdef CONFIG_NUMA
int defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
Index: linux-2.6-git/mm/slub.c
===================================================================
--- linux-2.6-git.orig/mm/slub.c
+++ linux-2.6-git/mm/slub.c
@@ -20,11 +20,13 @@
#include <linux/mempolicy.h>
#include <linux/ctype.h>
#include <linux/kallsyms.h>
+#include "internal.h"
/*
* Lock order:
- * 1. slab_lock(page)
- * 2. slab->list_lock
+ * 1. reserve_lock
+ * 2. slab_lock(page)
+ * 3. node->list_lock
*
* The slab_lock protects operations on the object of a particular
* slab and its metadata in the page struct. If the slab lock
@@ -259,6 +261,8 @@ static int sysfs_slab_alias(struct kmem_
static void sysfs_slab_remove(struct kmem_cache *s) {}
#endif
+static DEFINE_SPINLOCK(reserve_lock);
+
/********************************************************************
* Core slab cache functions
*******************************************************************/
@@ -1007,7 +1011,7 @@ static void setup_object(struct kmem_cac
s->ctor(object, s, 0);
}
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
{
struct page *page;
struct kmem_cache_node *n;
@@ -1025,6 +1029,7 @@ static struct page *new_slab(struct kmem
if (!page)
goto out;
+ *reserve = page->reserve;
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(&n->nr_slabs);
@@ -1395,6 +1400,7 @@ static void *__slab_alloc(struct kmem_ca
{
void **object;
int cpu = smp_processor_id();
+ int reserve = 0;
if (!page)
goto new_slab;
@@ -1424,10 +1430,25 @@ new_slab:
if (page) {
s->cpu_slab[cpu] = page;
goto load_freelist;
- }
+ } else if (unlikely(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+ goto try_reserve;
- page = new_slab(s, gfpflags, node);
- if (page) {
+alloc_slab:
+ page = new_slab(s, gfpflags, node, &reserve);
+ if (page && !reserve) {
+ if (unlikely(s->reserve_slab)) {
+ struct page *reserve;
+
+ spin_lock(&reserve_lock);
+ reserve = s->reserve_slab;
+ s->reserve_slab = NULL;
+ spin_unlock(&reserve_lock);
+
+ if (reserve) {
+ slab_lock(reserve);
+ unfreeze_slab(s, reserve);
+ }
+ }
cpu = smp_processor_id();
if (s->cpu_slab[cpu]) {
/*
@@ -1455,6 +1476,18 @@ new_slab:
SetSlabFrozen(page);
s->cpu_slab[cpu] = page;
goto load_freelist;
+ } else if (page) {
+ spin_lock(&reserve_lock);
+ if (s->reserve_slab) {
+ discard_slab(s, page);
+ page = s->reserve_slab;
+ goto got_reserve;
+ }
+ slab_lock(page);
+ SetSlabFrozen(page);
+ s->reserve_slab = page;
+ spin_unlock(&reserve_lock);
+ goto use_reserve;
}
return NULL;
debug:
@@ -1470,6 +1503,31 @@ debug:
page->freelist = object[page->offset];
slab_unlock(page);
return object;
+
+try_reserve:
+ spin_lock(&reserve_lock);
+ page = s->reserve_slab;
+ if (!page) {
+ spin_unlock(&reserve_lock);
+ goto alloc_slab;
+ }
+
+got_reserve:
+ slab_lock(page);
+ if (!page->freelist) {
+ s->reserve_slab = NULL;
+ spin_unlock(&reserve_lock);
+ unfreeze_slab(s, page);
+ goto alloc_slab;
+ }
+ spin_unlock(&reserve_lock);
+
+use_reserve:
+ object = page->freelist;
+ page->inuse++;
+ page->freelist = object[page->offset];
+ slab_unlock(page);
+ return object;
}
/*
@@ -1807,10 +1865,11 @@ static struct kmem_cache_node * __init e
{
struct page *page;
struct kmem_cache_node *n;
+ int reserve;
BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
- page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node);
+ page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node, &reserve);
/* new_slab() disables interupts */
local_irq_enable();
@@ -2018,6 +2077,8 @@ static int kmem_cache_open(struct kmem_c
#ifdef CONFIG_NUMA
s->defrag_ratio = 100;
#endif
+ s->reserve_slab = NULL;
+
if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
return 1;
error:
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-20 8:39 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-20 8:39 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson
Ok, full reset.
I care about kernel allocations only. In particular about those that
have PF_MEMALLOC semantics.
The thing I need is that any memory allocated below
ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
is only ever used by processes that have ALLOC_NO_WATERMARKS rights;
for the duration of the distress.
What this patch does:
- change the page allocator to try ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
if ALLOC_NO_WATERMARKS, before the actual ALLOC_NO_WATERMARKS alloc
- set page->reserve nonzero for each page allocated with
ALLOC_NO_WATERMARKS; which by the previous point implies that all
available zones are below ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
- when a page->reserve slab is allocated store it in s->reserve_slab
and do not update the ->cpu_slab[] (this forces subsequent allocs to
retry the allocation).
All ALLOC_NO_WATERMARKS enabled slab allocations are served from
->reserve_slab, up until the point where a !page->reserve slab alloc
succeeds, at which point the ->reserve_slab is pushed into the partial
lists and ->reserve_slab set to NULL.
Since only the allocation of a new slab uses the gfp zone flags, and
other allocations placement hints they have to be uniform over all slab
allocs for a given kmem_cache. Thus the s->reserve_slab/page->reserve
status is kmem_cache wide.
Any holes left?
---
Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h
+++ linux-2.6-git/mm/internal.h
@@ -12,6 +12,7 @@
#define __MM_INTERNAL_H
#include <linux/mm.h>
+#include <linux/hardirq.h>
static inline void set_page_count(struct page *page, int v)
{
@@ -37,4 +38,50 @@ static inline void __put_page(struct pag
extern void fastcall __init __free_pages_bootmem(struct page *page,
unsigned int order);
+#define ALLOC_HARDER 0x01 /* try to alloc harder */
+#define ALLOC_HIGH 0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN 0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW 0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH 0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS 0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int inline gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+ struct task_struct *p = current;
+ int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+ const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+ /*
+ * The caller may dip into page reserves a bit more if the caller
+ * cannot run direct reclaim, or if the caller has realtime scheduling
+ * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
+ * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+ */
+ if (gfp_mask & __GFP_HIGH)
+ alloc_flags |= ALLOC_HIGH;
+
+ if (!wait) {
+ alloc_flags |= ALLOC_HARDER;
+ /*
+ * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+ * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+ */
+ alloc_flags &= ~ALLOC_CPUSET;
+ } else if (unlikely(rt_task(p)) && !in_interrupt())
+ alloc_flags |= ALLOC_HARDER;
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+ if (!in_interrupt() &&
+ ((p->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ }
+
+ return alloc_flags;
+}
+
#endif
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c
+++ linux-2.6-git/mm/page_alloc.c
@@ -1175,14 +1175,6 @@ failed:
return NULL;
}
-#define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */
-#define ALLOC_HARDER 0x10 /* try to alloc harder */
-#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
-
#ifdef CONFIG_FAIL_PAGE_ALLOC
static struct fail_page_alloc_attr {
@@ -1494,6 +1486,7 @@ zonelist_scan:
page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
if (page)
+ page->reserve = (alloc_flags & ALLOC_NO_WATERMARKS);
break;
this_zone_full:
if (NUMA_BUILD)
@@ -1619,48 +1612,36 @@ restart:
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
* to how we want to proceed.
- *
- * The caller may dip into page reserves a bit more if the caller
- * cannot run direct reclaim, or if the caller has realtime scheduling
- * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
- * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
*/
- alloc_flags = ALLOC_WMARK_MIN;
- if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
- alloc_flags |= ALLOC_HARDER;
- if (gfp_mask & __GFP_HIGH)
- alloc_flags |= ALLOC_HIGH;
- if (wait)
- alloc_flags |= ALLOC_CPUSET;
+ alloc_flags = gfp_to_alloc_flags(gfp_mask);
- /*
- * Go through the zonelist again. Let __GFP_HIGH and allocations
- * coming from realtime tasks go deeper into reserves.
- *
- * This is the last chance, in general, before the goto nopage.
- * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
- * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
- */
- page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+ /* This is the last chance, in general, before the goto nopage. */
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ alloc_flags & ~ALLOC_NO_WATERMARKS);
if (page)
goto got_pg;
/* This allocation should allow future memory freeing. */
-
rebalance:
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+ if (alloc_flags & ALLOC_NO_WATERMARKS) {
nofail_alloc:
- /* go through the zonelist yet again, ignoring mins */
- page = get_page_from_freelist(gfp_mask, order,
+ /*
+ * Before going bare metal, try to get a page above the
+ * critical threshold - ignoring CPU sets.
+ */
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER);
+ if (page)
+ goto got_pg;
+
+ /* go through the zonelist yet again, ignoring mins */
+ page = get_page_from_freelist(gfp_mask, order,
zonelist, ALLOC_NO_WATERMARKS);
- if (page)
- goto got_pg;
- if (gfp_mask & __GFP_NOFAIL) {
- congestion_wait(WRITE, HZ/50);
- goto nofail_alloc;
- }
+ if (page)
+ goto got_pg;
+ if (wait && (gfp_mask & __GFP_NOFAIL)) {
+ congestion_wait(WRITE, HZ/50);
+ goto nofail_alloc;
}
goto nopage;
}
@@ -1669,6 +1650,10 @@ nofail_alloc:
if (!wait)
goto nopage;
+ /* Avoid recursion of direct reclaim */
+ if (p->flags & PF_MEMALLOC)
+ goto nopage;
+
cond_resched();
/* We now go into synchronous reclaim */
Index: linux-2.6-git/include/linux/mm_types.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_types.h
+++ linux-2.6-git/include/linux/mm_types.h
@@ -60,6 +60,7 @@ struct page {
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* SLUB: freelist req. slab lock */
+ int reserve; /* page_alloc: page is a reserve page */
};
struct list_head lru; /* Pageout list, eg. active_list
* protected by zone->lru_lock !
Index: linux-2.6-git/include/linux/slub_def.h
===================================================================
--- linux-2.6-git.orig/include/linux/slub_def.h
+++ linux-2.6-git/include/linux/slub_def.h
@@ -46,6 +46,8 @@ struct kmem_cache {
struct list_head list; /* List of slab caches */
struct kobject kobj; /* For sysfs */
+ struct page *reserve_slab;
+
#ifdef CONFIG_NUMA
int defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
Index: linux-2.6-git/mm/slub.c
===================================================================
--- linux-2.6-git.orig/mm/slub.c
+++ linux-2.6-git/mm/slub.c
@@ -20,11 +20,13 @@
#include <linux/mempolicy.h>
#include <linux/ctype.h>
#include <linux/kallsyms.h>
+#include "internal.h"
/*
* Lock order:
- * 1. slab_lock(page)
- * 2. slab->list_lock
+ * 1. reserve_lock
+ * 2. slab_lock(page)
+ * 3. node->list_lock
*
* The slab_lock protects operations on the object of a particular
* slab and its metadata in the page struct. If the slab lock
@@ -259,6 +261,8 @@ static int sysfs_slab_alias(struct kmem_
static void sysfs_slab_remove(struct kmem_cache *s) {}
#endif
+static DEFINE_SPINLOCK(reserve_lock);
+
/********************************************************************
* Core slab cache functions
*******************************************************************/
@@ -1007,7 +1011,7 @@ static void setup_object(struct kmem_cac
s->ctor(object, s, 0);
}
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
{
struct page *page;
struct kmem_cache_node *n;
@@ -1025,6 +1029,7 @@ static struct page *new_slab(struct kmem
if (!page)
goto out;
+ *reserve = page->reserve;
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(&n->nr_slabs);
@@ -1395,6 +1400,7 @@ static void *__slab_alloc(struct kmem_ca
{
void **object;
int cpu = smp_processor_id();
+ int reserve = 0;
if (!page)
goto new_slab;
@@ -1424,10 +1430,25 @@ new_slab:
if (page) {
s->cpu_slab[cpu] = page;
goto load_freelist;
- }
+ } else if (unlikely(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+ goto try_reserve;
- page = new_slab(s, gfpflags, node);
- if (page) {
+alloc_slab:
+ page = new_slab(s, gfpflags, node, &reserve);
+ if (page && !reserve) {
+ if (unlikely(s->reserve_slab)) {
+ struct page *reserve;
+
+ spin_lock(&reserve_lock);
+ reserve = s->reserve_slab;
+ s->reserve_slab = NULL;
+ spin_unlock(&reserve_lock);
+
+ if (reserve) {
+ slab_lock(reserve);
+ unfreeze_slab(s, reserve);
+ }
+ }
cpu = smp_processor_id();
if (s->cpu_slab[cpu]) {
/*
@@ -1455,6 +1476,18 @@ new_slab:
SetSlabFrozen(page);
s->cpu_slab[cpu] = page;
goto load_freelist;
+ } else if (page) {
+ spin_lock(&reserve_lock);
+ if (s->reserve_slab) {
+ discard_slab(s, page);
+ page = s->reserve_slab;
+ goto got_reserve;
+ }
+ slab_lock(page);
+ SetSlabFrozen(page);
+ s->reserve_slab = page;
+ spin_unlock(&reserve_lock);
+ goto use_reserve;
}
return NULL;
debug:
@@ -1470,6 +1503,31 @@ debug:
page->freelist = object[page->offset];
slab_unlock(page);
return object;
+
+try_reserve:
+ spin_lock(&reserve_lock);
+ page = s->reserve_slab;
+ if (!page) {
+ spin_unlock(&reserve_lock);
+ goto alloc_slab;
+ }
+
+got_reserve:
+ slab_lock(page);
+ if (!page->freelist) {
+ s->reserve_slab = NULL;
+ spin_unlock(&reserve_lock);
+ unfreeze_slab(s, page);
+ goto alloc_slab;
+ }
+ spin_unlock(&reserve_lock);
+
+use_reserve:
+ object = page->freelist;
+ page->inuse++;
+ page->freelist = object[page->offset];
+ slab_unlock(page);
+ return object;
}
/*
@@ -1807,10 +1865,11 @@ static struct kmem_cache_node * __init e
{
struct page *page;
struct kmem_cache_node *n;
+ int reserve;
BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
- page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node);
+ page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node, &reserve);
/* new_slab() disables interupts */
local_irq_enable();
@@ -2018,6 +2077,8 @@ static int kmem_cache_open(struct kmem_c
#ifdef CONFIG_NUMA
s->defrag_ratio = 100;
#endif
+ s->reserve_slab = NULL;
+
if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
return 1;
error:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-20 8:39 ` Peter Zijlstra
@ 2007-05-21 16:45 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-21 16:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson,
npiggin
On Sun, 20 May 2007, Peter Zijlstra wrote:
> I care about kernel allocations only. In particular about those that
> have PF_MEMALLOC semantics.
Hmmmm.. I wish I was more familiar with PF_MEMALLOC. ccing Nick.
> - set page->reserve nonzero for each page allocated with
> ALLOC_NO_WATERMARKS; which by the previous point implies that all
> available zones are below ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
Ok that adds a new field to the page struct. I suggested a page flag in
slub before.
> - when a page->reserve slab is allocated store it in s->reserve_slab
> and do not update the ->cpu_slab[] (this forces subsequent allocs to
> retry the allocation).
Right that should work.
> All ALLOC_NO_WATERMARKS enabled slab allocations are served from
> ->reserve_slab, up until the point where a !page->reserve slab alloc
> succeeds, at which point the ->reserve_slab is pushed into the partial
> lists and ->reserve_slab set to NULL.
So the original issue is still not fixed. A slab alloc may succeed without
watermarks if that particular allocation is restricted to a different set
of nodes. Then the reserve slab is dropped despite the memory scarcity on
another set of nodes?
> Since only the allocation of a new slab uses the gfp zone flags, and
> other allocations placement hints they have to be uniform over all slab
> allocs for a given kmem_cache. Thus the s->reserve_slab/page->reserve
> status is kmem_cache wide.
No the gfp zone flags are not uniform and placement of page allocator
allocs through SLUB do not always have the same allocation constraints.
SLUB will check the node of the page that was allocated when the page
allocator returns and put the page into that nodes slab list. This varies
depending on the allocation context.
Allocations can be particular to uses of a slab in particular situations.
A kmalloc cache can be used to allocate from various sets of nodes in
different circumstances. kmalloc will allow serving a limited number of
objects from the wrong nodes for performance reasons but the next
allocation from the page allocator (or from the partial lists) will occur
using the current set of allowed nodes in order to ensure a rough
obedience to the memory policies and cpusets. kmalloc_node behaves
differently and will enforce using memory from a particular node.
SLAB is very strict in that area and will not allow serving objects from
the wrong node even with only kmalloc. Changing policy will immediately
change the per node queue that SLAB takes its objects from.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-21 16:45 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-21 16:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson,
npiggin
On Sun, 20 May 2007, Peter Zijlstra wrote:
> I care about kernel allocations only. In particular about those that
> have PF_MEMALLOC semantics.
Hmmmm.. I wish I was more familiar with PF_MEMALLOC. ccing Nick.
> - set page->reserve nonzero for each page allocated with
> ALLOC_NO_WATERMARKS; which by the previous point implies that all
> available zones are below ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
Ok that adds a new field to the page struct. I suggested a page flag in
slub before.
> - when a page->reserve slab is allocated store it in s->reserve_slab
> and do not update the ->cpu_slab[] (this forces subsequent allocs to
> retry the allocation).
Right that should work.
> All ALLOC_NO_WATERMARKS enabled slab allocations are served from
> ->reserve_slab, up until the point where a !page->reserve slab alloc
> succeeds, at which point the ->reserve_slab is pushed into the partial
> lists and ->reserve_slab set to NULL.
So the original issue is still not fixed. A slab alloc may succeed without
watermarks if that particular allocation is restricted to a different set
of nodes. Then the reserve slab is dropped despite the memory scarcity on
another set of nodes?
> Since only the allocation of a new slab uses the gfp zone flags, and
> other allocations placement hints they have to be uniform over all slab
> allocs for a given kmem_cache. Thus the s->reserve_slab/page->reserve
> status is kmem_cache wide.
No the gfp zone flags are not uniform and placement of page allocator
allocs through SLUB do not always have the same allocation constraints.
SLUB will check the node of the page that was allocated when the page
allocator returns and put the page into that nodes slab list. This varies
depending on the allocation context.
Allocations can be particular to uses of a slab in particular situations.
A kmalloc cache can be used to allocate from various sets of nodes in
different circumstances. kmalloc will allow serving a limited number of
objects from the wrong nodes for performance reasons but the next
allocation from the page allocator (or from the partial lists) will occur
using the current set of allowed nodes in order to ensure a rough
obedience to the memory policies and cpusets. kmalloc_node behaves
differently and will enforce using memory from a particular node.
SLAB is very strict in that area and will not allow serving objects from
the wrong node even with only kmalloc. Changing policy will immediately
change the per node queue that SLAB takes its objects from.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-21 16:45 ` Christoph Lameter
@ 2007-05-21 19:33 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-21 19:33 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson,
npiggin
On Mon, 2007-05-21 at 09:45 -0700, Christoph Lameter wrote:
> On Sun, 20 May 2007, Peter Zijlstra wrote:
>
> > I care about kernel allocations only. In particular about those that
> > have PF_MEMALLOC semantics.
>
> Hmmmm.. I wish I was more familiar with PF_MEMALLOC. ccing Nick.
>
> > - set page->reserve nonzero for each page allocated with
> > ALLOC_NO_WATERMARKS; which by the previous point implies that all
> > available zones are below ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
>
> Ok that adds a new field to the page struct. I suggested a page flag in
> slub before.
No it doesn't; it overloads page->index. Its just used as extra return
value, it need not be persistent. Definitely not worth a page-flag.
> > - when a page->reserve slab is allocated store it in s->reserve_slab
> > and do not update the ->cpu_slab[] (this forces subsequent allocs to
> > retry the allocation).
>
> Right that should work.
>
> > All ALLOC_NO_WATERMARKS enabled slab allocations are served from
> > ->reserve_slab, up until the point where a !page->reserve slab alloc
> > succeeds, at which point the ->reserve_slab is pushed into the partial
> > lists and ->reserve_slab set to NULL.
>
> So the original issue is still not fixed. A slab alloc may succeed without
> watermarks if that particular allocation is restricted to a different set
> of nodes. Then the reserve slab is dropped despite the memory scarcity on
> another set of nodes?
I can't see how. This extra ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER alloc will
first deplete all other zones. Once that starts failing no node should
still have pages accessible by any allocation context other than
PF_MEMALLOC.
> > Since only the allocation of a new slab uses the gfp zone flags, and
> > other allocations placement hints they have to be uniform over all slab
> > allocs for a given kmem_cache. Thus the s->reserve_slab/page->reserve
> > status is kmem_cache wide.
>
> No the gfp zone flags are not uniform and placement of page allocator
> allocs through SLUB do not always have the same allocation constraints.
It has to; since it can serve the allocation from a pre-existing slab
allocation. Hence any page allocation must be valid for all other users.
> SLUB will check the node of the page that was allocated when the page
> allocator returns and put the page into that nodes slab list. This varies
> depending on the allocation context.
Yes, it keeps slabs on per node lists. I'm just not seeing how this puts
hard constraints on the allocations.
As far as I can see there cannot be a hard constraint here, because
allocations form interrupt context are at best node local. And node
affine zone lists still have all zones, just ordered on locality.
> Allocations can be particular to uses of a slab in particular situations.
> A kmalloc cache can be used to allocate from various sets of nodes in
> different circumstances. kmalloc will allow serving a limited number of
> objects from the wrong nodes for performance reasons but the next
> allocation from the page allocator (or from the partial lists) will occur
> using the current set of allowed nodes in order to ensure a rough
> obedience to the memory policies and cpusets. kmalloc_node behaves
> differently and will enforce using memory from a particular node.
>From what I can see, it takes pretty much any page it can get once you
hit it with PF_MEMALLOC. If the page allocation doesn't use ALLOC_CPUSET
the page can come from pretty much anywhere.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-21 19:33 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-21 19:33 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson,
npiggin
On Mon, 2007-05-21 at 09:45 -0700, Christoph Lameter wrote:
> On Sun, 20 May 2007, Peter Zijlstra wrote:
>
> > I care about kernel allocations only. In particular about those that
> > have PF_MEMALLOC semantics.
>
> Hmmmm.. I wish I was more familiar with PF_MEMALLOC. ccing Nick.
>
> > - set page->reserve nonzero for each page allocated with
> > ALLOC_NO_WATERMARKS; which by the previous point implies that all
> > available zones are below ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
>
> Ok that adds a new field to the page struct. I suggested a page flag in
> slub before.
No it doesn't; it overloads page->index. Its just used as extra return
value, it need not be persistent. Definitely not worth a page-flag.
> > - when a page->reserve slab is allocated store it in s->reserve_slab
> > and do not update the ->cpu_slab[] (this forces subsequent allocs to
> > retry the allocation).
>
> Right that should work.
>
> > All ALLOC_NO_WATERMARKS enabled slab allocations are served from
> > ->reserve_slab, up until the point where a !page->reserve slab alloc
> > succeeds, at which point the ->reserve_slab is pushed into the partial
> > lists and ->reserve_slab set to NULL.
>
> So the original issue is still not fixed. A slab alloc may succeed without
> watermarks if that particular allocation is restricted to a different set
> of nodes. Then the reserve slab is dropped despite the memory scarcity on
> another set of nodes?
I can't see how. This extra ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER alloc will
first deplete all other zones. Once that starts failing no node should
still have pages accessible by any allocation context other than
PF_MEMALLOC.
> > Since only the allocation of a new slab uses the gfp zone flags, and
> > other allocations placement hints they have to be uniform over all slab
> > allocs for a given kmem_cache. Thus the s->reserve_slab/page->reserve
> > status is kmem_cache wide.
>
> No the gfp zone flags are not uniform and placement of page allocator
> allocs through SLUB do not always have the same allocation constraints.
It has to; since it can serve the allocation from a pre-existing slab
allocation. Hence any page allocation must be valid for all other users.
> SLUB will check the node of the page that was allocated when the page
> allocator returns and put the page into that nodes slab list. This varies
> depending on the allocation context.
Yes, it keeps slabs on per node lists. I'm just not seeing how this puts
hard constraints on the allocations.
As far as I can see there cannot be a hard constraint here, because
allocations form interrupt context are at best node local. And node
affine zone lists still have all zones, just ordered on locality.
> Allocations can be particular to uses of a slab in particular situations.
> A kmalloc cache can be used to allocate from various sets of nodes in
> different circumstances. kmalloc will allow serving a limited number of
> objects from the wrong nodes for performance reasons but the next
> allocation from the page allocator (or from the partial lists) will occur
> using the current set of allowed nodes in order to ensure a rough
> obedience to the memory policies and cpusets. kmalloc_node behaves
> differently and will enforce using memory from a particular node.
>From what I can see, it takes pretty much any page it can get once you
hit it with PF_MEMALLOC. If the page allocation doesn't use ALLOC_CPUSET
the page can come from pretty much anywhere.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-21 19:33 ` Peter Zijlstra
@ 2007-05-21 19:43 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-21 19:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson,
npiggin
On Mon, 21 May 2007, Peter Zijlstra wrote:
> > So the original issue is still not fixed. A slab alloc may succeed without
> > watermarks if that particular allocation is restricted to a different set
> > of nodes. Then the reserve slab is dropped despite the memory scarcity on
> > another set of nodes?
>
> I can't see how. This extra ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER alloc will
> first deplete all other zones. Once that starts failing no node should
> still have pages accessible by any allocation context other than
> PF_MEMALLOC.
This means we will disobey cpuset and memory policy constraints?
> > No the gfp zone flags are not uniform and placement of page allocator
> > allocs through SLUB do not always have the same allocation constraints.
>
> It has to; since it can serve the allocation from a pre-existing slab
> allocation. Hence any page allocation must be valid for all other users.
Why does it have to? This is not true.
> > SLUB will check the node of the page that was allocated when the page
> > allocator returns and put the page into that nodes slab list. This varies
> > depending on the allocation context.
>
> Yes, it keeps slabs on per node lists. I'm just not seeing how this puts
> hard constraints on the allocations.
The constraints come from the context of memory policies and cpusets. See
get_any_partial().
> As far as I can see there cannot be a hard constraint here, because
> allocations form interrupt context are at best node local. And node
> affine zone lists still have all zones, just ordered on locality.
Interrupt context is something different. If we do not have a process
context then no cpuset and memory policy constraints can apply since we
have no way of determining that. If you restrict your use of the reserve
cpuset to only interrupt allocs then we may indeed be fine.
> > Allocations can be particular to uses of a slab in particular situations.
> > A kmalloc cache can be used to allocate from various sets of nodes in
> > different circumstances. kmalloc will allow serving a limited number of
> > objects from the wrong nodes for performance reasons but the next
> > allocation from the page allocator (or from the partial lists) will occur
> > using the current set of allowed nodes in order to ensure a rough
> > obedience to the memory policies and cpusets. kmalloc_node behaves
> > differently and will enforce using memory from a particular node.
>
> >From what I can see, it takes pretty much any page it can get once you
> hit it with PF_MEMALLOC. If the page allocation doesn't use ALLOC_CPUSET
> the page can come from pretty much anywhere.
No it cannot. One the current cpuslab is exhaused (which can be anytime)
it will enforce the contextual allocation constraints. See
get_any_partial() in slub.c.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-21 19:43 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-21 19:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson,
npiggin
On Mon, 21 May 2007, Peter Zijlstra wrote:
> > So the original issue is still not fixed. A slab alloc may succeed without
> > watermarks if that particular allocation is restricted to a different set
> > of nodes. Then the reserve slab is dropped despite the memory scarcity on
> > another set of nodes?
>
> I can't see how. This extra ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER alloc will
> first deplete all other zones. Once that starts failing no node should
> still have pages accessible by any allocation context other than
> PF_MEMALLOC.
This means we will disobey cpuset and memory policy constraints?
> > No the gfp zone flags are not uniform and placement of page allocator
> > allocs through SLUB do not always have the same allocation constraints.
>
> It has to; since it can serve the allocation from a pre-existing slab
> allocation. Hence any page allocation must be valid for all other users.
Why does it have to? This is not true.
> > SLUB will check the node of the page that was allocated when the page
> > allocator returns and put the page into that nodes slab list. This varies
> > depending on the allocation context.
>
> Yes, it keeps slabs on per node lists. I'm just not seeing how this puts
> hard constraints on the allocations.
The constraints come from the context of memory policies and cpusets. See
get_any_partial().
> As far as I can see there cannot be a hard constraint here, because
> allocations form interrupt context are at best node local. And node
> affine zone lists still have all zones, just ordered on locality.
Interrupt context is something different. If we do not have a process
context then no cpuset and memory policy constraints can apply since we
have no way of determining that. If you restrict your use of the reserve
cpuset to only interrupt allocs then we may indeed be fine.
> > Allocations can be particular to uses of a slab in particular situations.
> > A kmalloc cache can be used to allocate from various sets of nodes in
> > different circumstances. kmalloc will allow serving a limited number of
> > objects from the wrong nodes for performance reasons but the next
> > allocation from the page allocator (or from the partial lists) will occur
> > using the current set of allowed nodes in order to ensure a rough
> > obedience to the memory policies and cpusets. kmalloc_node behaves
> > differently and will enforce using memory from a particular node.
>
> >From what I can see, it takes pretty much any page it can get once you
> hit it with PF_MEMALLOC. If the page allocation doesn't use ALLOC_CPUSET
> the page can come from pretty much anywhere.
No it cannot. One the current cpuslab is exhaused (which can be anytime)
it will enforce the contextual allocation constraints. See
get_any_partial() in slub.c.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-21 19:43 ` Christoph Lameter
@ 2007-05-21 20:08 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-21 20:08 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson,
npiggin
On Mon, 2007-05-21 at 12:43 -0700, Christoph Lameter wrote:
> On Mon, 21 May 2007, Peter Zijlstra wrote:
>
> > > So the original issue is still not fixed. A slab alloc may succeed without
> > > watermarks if that particular allocation is restricted to a different set
> > > of nodes. Then the reserve slab is dropped despite the memory scarcity on
> > > another set of nodes?
> >
> > I can't see how. This extra ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER alloc will
> > first deplete all other zones. Once that starts failing no node should
> > still have pages accessible by any allocation context other than
> > PF_MEMALLOC.
>
> This means we will disobey cpuset and memory policy constraints?
>From what I can make of it, yes. Although I'm a bit hazy on the
mempolicy code.
Note that disobeying these constraints is not new behaviour. PF_MEMALLOC
needs a page, and will get a page, no matter what the cost.
> > > No the gfp zone flags are not uniform and placement of page allocator
> > > allocs through SLUB do not always have the same allocation constraints.
> >
> > It has to; since it can serve the allocation from a pre-existing slab
> > allocation. Hence any page allocation must be valid for all other users.
>
> Why does it have to? This is not true.
Say the slab gets allocated by an allocation from interrupt context; no
cpuset, no policy. This same slab must be valid for whatever allocation
comes next, right? Regardless of whatever policy or GFP_ flags are in
effect for that allocation.
> > > SLUB will check the node of the page that was allocated when the page
> > > allocator returns and put the page into that nodes slab list. This varies
> > > depending on the allocation context.
> >
> > Yes, it keeps slabs on per node lists. I'm just not seeing how this puts
> > hard constraints on the allocations.
>
> The constraints come from the context of memory policies and cpusets. See
> get_any_partial().
but get_partial() will only be called if the cpu_slab is full, up until
that point you have to do with whatever is there.
> > As far as I can see there cannot be a hard constraint here, because
> > allocations form interrupt context are at best node local. And node
> > affine zone lists still have all zones, just ordered on locality.
>
> Interrupt context is something different. If we do not have a process
> context then no cpuset and memory policy constraints can apply since we
> have no way of determining that. If you restrict your use of the reserve
> cpuset to only interrupt allocs then we may indeed be fine.
No, what I'm saying is that if the slab gets refilled from interrupt
context the next process context alloc will have to work with whatever
the interrupt left behind. Hence there is no hard constraint.
> > > Allocations can be particular to uses of a slab in particular situations.
> > > A kmalloc cache can be used to allocate from various sets of nodes in
> > > different circumstances. kmalloc will allow serving a limited number of
> > > objects from the wrong nodes for performance reasons but the next
> > > allocation from the page allocator (or from the partial lists) will occur
> > > using the current set of allowed nodes in order to ensure a rough
> > > obedience to the memory policies and cpusets. kmalloc_node behaves
> > > differently and will enforce using memory from a particular node.
> >
> > >From what I can see, it takes pretty much any page it can get once you
> > hit it with PF_MEMALLOC. If the page allocation doesn't use ALLOC_CPUSET
> > the page can come from pretty much anywhere.
>
> No it cannot. One the current cpuslab is exhaused (which can be anytime)
> it will enforce the contextual allocation constraints. See
> get_any_partial() in slub.c.
If it finds no partial slabs it goes back to the page allocator; and
when you allocate a page under PF_MEMALLOC and the normal allocations
are exhausted it takes a page from pretty much anywhere.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-21 20:08 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-21 20:08 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson,
npiggin
On Mon, 2007-05-21 at 12:43 -0700, Christoph Lameter wrote:
> On Mon, 21 May 2007, Peter Zijlstra wrote:
>
> > > So the original issue is still not fixed. A slab alloc may succeed without
> > > watermarks if that particular allocation is restricted to a different set
> > > of nodes. Then the reserve slab is dropped despite the memory scarcity on
> > > another set of nodes?
> >
> > I can't see how. This extra ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER alloc will
> > first deplete all other zones. Once that starts failing no node should
> > still have pages accessible by any allocation context other than
> > PF_MEMALLOC.
>
> This means we will disobey cpuset and memory policy constraints?
>From what I can make of it, yes. Although I'm a bit hazy on the
mempolicy code.
Note that disobeying these constraints is not new behaviour. PF_MEMALLOC
needs a page, and will get a page, no matter what the cost.
> > > No the gfp zone flags are not uniform and placement of page allocator
> > > allocs through SLUB do not always have the same allocation constraints.
> >
> > It has to; since it can serve the allocation from a pre-existing slab
> > allocation. Hence any page allocation must be valid for all other users.
>
> Why does it have to? This is not true.
Say the slab gets allocated by an allocation from interrupt context; no
cpuset, no policy. This same slab must be valid for whatever allocation
comes next, right? Regardless of whatever policy or GFP_ flags are in
effect for that allocation.
> > > SLUB will check the node of the page that was allocated when the page
> > > allocator returns and put the page into that nodes slab list. This varies
> > > depending on the allocation context.
> >
> > Yes, it keeps slabs on per node lists. I'm just not seeing how this puts
> > hard constraints on the allocations.
>
> The constraints come from the context of memory policies and cpusets. See
> get_any_partial().
but get_partial() will only be called if the cpu_slab is full, up until
that point you have to do with whatever is there.
> > As far as I can see there cannot be a hard constraint here, because
> > allocations form interrupt context are at best node local. And node
> > affine zone lists still have all zones, just ordered on locality.
>
> Interrupt context is something different. If we do not have a process
> context then no cpuset and memory policy constraints can apply since we
> have no way of determining that. If you restrict your use of the reserve
> cpuset to only interrupt allocs then we may indeed be fine.
No, what I'm saying is that if the slab gets refilled from interrupt
context the next process context alloc will have to work with whatever
the interrupt left behind. Hence there is no hard constraint.
> > > Allocations can be particular to uses of a slab in particular situations.
> > > A kmalloc cache can be used to allocate from various sets of nodes in
> > > different circumstances. kmalloc will allow serving a limited number of
> > > objects from the wrong nodes for performance reasons but the next
> > > allocation from the page allocator (or from the partial lists) will occur
> > > using the current set of allowed nodes in order to ensure a rough
> > > obedience to the memory policies and cpusets. kmalloc_node behaves
> > > differently and will enforce using memory from a particular node.
> >
> > >From what I can see, it takes pretty much any page it can get once you
> > hit it with PF_MEMALLOC. If the page allocation doesn't use ALLOC_CPUSET
> > the page can come from pretty much anywhere.
>
> No it cannot. One the current cpuslab is exhaused (which can be anytime)
> it will enforce the contextual allocation constraints. See
> get_any_partial() in slub.c.
If it finds no partial slabs it goes back to the page allocator; and
when you allocate a page under PF_MEMALLOC and the normal allocations
are exhausted it takes a page from pretty much anywhere.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-21 20:08 ` Peter Zijlstra
@ 2007-05-21 20:32 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-21 20:32 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson,
npiggin
On Mon, 21 May 2007, Peter Zijlstra wrote:
> > This means we will disobey cpuset and memory policy constraints?
>
> >From what I can make of it, yes. Although I'm a bit hazy on the
> mempolicy code.
In an interrupt context we do not have a process context. But there is
no exemption from memory policy constraints.
> > > > No the gfp zone flags are not uniform and placement of page allocator
> > > > allocs through SLUB do not always have the same allocation constraints.
> > >
> > > It has to; since it can serve the allocation from a pre-existing slab
> > > allocation. Hence any page allocation must be valid for all other users.
> >
> > Why does it have to? This is not true.
>
> Say the slab gets allocated by an allocation from interrupt context; no
> cpuset, no policy. This same slab must be valid for whatever allocation
> comes next, right? Regardless of whatever policy or GFP_ flags are in
> effect for that allocation.
Yes sure if we do not have a context then no restrictions originating
there can be enforced. So you want to restrict the logic now to
interrupt allocs? I.e. GFP_ATOMIC?
> > The constraints come from the context of memory policies and cpusets. See
> > get_any_partial().
>
> but get_partial() will only be called if the cpu_slab is full, up until
> that point you have to do with whatever is there.
Correct. That is an optimization but it may be called anytime from the
perspective of an execution thread and that may cause problems with your
approach.
> > > As far as I can see there cannot be a hard constraint here, because
> > > allocations form interrupt context are at best node local. And node
> > > affine zone lists still have all zones, just ordered on locality.
> >
> > Interrupt context is something different. If we do not have a process
> > context then no cpuset and memory policy constraints can apply since we
> > have no way of determining that. If you restrict your use of the reserve
> > cpuset to only interrupt allocs then we may indeed be fine.
>
> No, what I'm saying is that if the slab gets refilled from interrupt
> context the next process context alloc will have to work with whatever
> the interrupt left behind. Hence there is no hard constraint.
It will work with whatever was left behind in the case of SLUB and a
kmalloc alloc (optimization there). It wont if its SLAB (which is
stricter) or a kmalloc_node alloc. A kmalloc_node alloc will remove the
current cpuslab if its not on the right now.
> > > >From what I can see, it takes pretty much any page it can get once you
> > > hit it with PF_MEMALLOC. If the page allocation doesn't use ALLOC_CPUSET
> > > the page can come from pretty much anywhere.
> >
> > No it cannot. One the current cpuslab is exhaused (which can be anytime)
> > it will enforce the contextual allocation constraints. See
> > get_any_partial() in slub.c.
>
> If it finds no partial slabs it goes back to the page allocator; and
> when you allocate a page under PF_MEMALLOC and the normal allocations
> are exhausted it takes a page from pretty much anywhere.
If it finds no partial slab then it will go to the page allocator which
will allocate given the current contextual alloc constraints. In the case
of a memory policy we may have limited the allocations to a single node
where there is no escape (the zonelist does *not* contain zones of other
nodes). The only chance to bypass this is by only dealing with allocations
during interrupt that have no allocation context.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-21 20:32 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-21 20:32 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson,
npiggin
On Mon, 21 May 2007, Peter Zijlstra wrote:
> > This means we will disobey cpuset and memory policy constraints?
>
> >From what I can make of it, yes. Although I'm a bit hazy on the
> mempolicy code.
In an interrupt context we do not have a process context. But there is
no exemption from memory policy constraints.
> > > > No the gfp zone flags are not uniform and placement of page allocator
> > > > allocs through SLUB do not always have the same allocation constraints.
> > >
> > > It has to; since it can serve the allocation from a pre-existing slab
> > > allocation. Hence any page allocation must be valid for all other users.
> >
> > Why does it have to? This is not true.
>
> Say the slab gets allocated by an allocation from interrupt context; no
> cpuset, no policy. This same slab must be valid for whatever allocation
> comes next, right? Regardless of whatever policy or GFP_ flags are in
> effect for that allocation.
Yes sure if we do not have a context then no restrictions originating
there can be enforced. So you want to restrict the logic now to
interrupt allocs? I.e. GFP_ATOMIC?
> > The constraints come from the context of memory policies and cpusets. See
> > get_any_partial().
>
> but get_partial() will only be called if the cpu_slab is full, up until
> that point you have to do with whatever is there.
Correct. That is an optimization but it may be called anytime from the
perspective of an execution thread and that may cause problems with your
approach.
> > > As far as I can see there cannot be a hard constraint here, because
> > > allocations form interrupt context are at best node local. And node
> > > affine zone lists still have all zones, just ordered on locality.
> >
> > Interrupt context is something different. If we do not have a process
> > context then no cpuset and memory policy constraints can apply since we
> > have no way of determining that. If you restrict your use of the reserve
> > cpuset to only interrupt allocs then we may indeed be fine.
>
> No, what I'm saying is that if the slab gets refilled from interrupt
> context the next process context alloc will have to work with whatever
> the interrupt left behind. Hence there is no hard constraint.
It will work with whatever was left behind in the case of SLUB and a
kmalloc alloc (optimization there). It wont if its SLAB (which is
stricter) or a kmalloc_node alloc. A kmalloc_node alloc will remove the
current cpuslab if its not on the right now.
> > > >From what I can see, it takes pretty much any page it can get once you
> > > hit it with PF_MEMALLOC. If the page allocation doesn't use ALLOC_CPUSET
> > > the page can come from pretty much anywhere.
> >
> > No it cannot. One the current cpuslab is exhaused (which can be anytime)
> > it will enforce the contextual allocation constraints. See
> > get_any_partial() in slub.c.
>
> If it finds no partial slabs it goes back to the page allocator; and
> when you allocate a page under PF_MEMALLOC and the normal allocations
> are exhausted it takes a page from pretty much anywhere.
If it finds no partial slab then it will go to the page allocator which
will allocate given the current contextual alloc constraints. In the case
of a memory policy we may have limited the allocations to a single node
where there is no escape (the zonelist does *not* contain zones of other
nodes). The only chance to bypass this is by only dealing with allocations
during interrupt that have no allocation context.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-21 20:32 ` Christoph Lameter
@ 2007-05-21 20:54 ` Peter Zijlstra
-1 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-21 20:54 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson,
npiggin
On Mon, 2007-05-21 at 13:32 -0700, Christoph Lameter wrote:
> On Mon, 21 May 2007, Peter Zijlstra wrote:
>
> > > This means we will disobey cpuset and memory policy constraints?
> >
> > >From what I can make of it, yes. Although I'm a bit hazy on the
> > mempolicy code.
>
> In an interrupt context we do not have a process context. But there is
> no exemption from memory policy constraints.
Ah, see below.
> > > > > No the gfp zone flags are not uniform and placement of page allocator
> > > > > allocs through SLUB do not always have the same allocation constraints.
> > > >
> > > > It has to; since it can serve the allocation from a pre-existing slab
> > > > allocation. Hence any page allocation must be valid for all other users.
> > >
> > > Why does it have to? This is not true.
> >
> > Say the slab gets allocated by an allocation from interrupt context; no
> > cpuset, no policy. This same slab must be valid for whatever allocation
> > comes next, right? Regardless of whatever policy or GFP_ flags are in
> > effect for that allocation.
>
> Yes sure if we do not have a context then no restrictions originating
> there can be enforced. So you want to restrict the logic now to
> interrupt allocs? I.e. GFP_ATOMIC?
No, any kernel alloc.
> > > The constraints come from the context of memory policies and cpusets. See
> > > get_any_partial().
> >
> > but get_partial() will only be called if the cpu_slab is full, up until
> > that point you have to do with whatever is there.
>
> Correct. That is an optimization but it may be called anytime from the
> perspective of an execution thread and that may cause problems with your
> approach.
I'm not seeing how this would interfere; if the alloc can be handled
from a partial slab, that is fine.
> > > > As far as I can see there cannot be a hard constraint here, because
> > > > allocations form interrupt context are at best node local. And node
> > > > affine zone lists still have all zones, just ordered on locality.
> > >
> > > Interrupt context is something different. If we do not have a process
> > > context then no cpuset and memory policy constraints can apply since we
> > > have no way of determining that. If you restrict your use of the reserve
> > > cpuset to only interrupt allocs then we may indeed be fine.
> >
> > No, what I'm saying is that if the slab gets refilled from interrupt
> > context the next process context alloc will have to work with whatever
> > the interrupt left behind. Hence there is no hard constraint.
>
> It will work with whatever was left behind in the case of SLUB and a
> kmalloc alloc (optimization there). It wont if its SLAB (which is
> stricter) or a kmalloc_node alloc. A kmalloc_node alloc will remove the
> current cpuslab if its not on the right now.
OK, that is my understanding too. So this should be good too.
> > > > >From what I can see, it takes pretty much any page it can get once you
> > > > hit it with PF_MEMALLOC. If the page allocation doesn't use ALLOC_CPUSET
> > > > the page can come from pretty much anywhere.
> > >
> > > No it cannot. One the current cpuslab is exhaused (which can be anytime)
> > > it will enforce the contextual allocation constraints. See
> > > get_any_partial() in slub.c.
> >
> > If it finds no partial slabs it goes back to the page allocator; and
> > when you allocate a page under PF_MEMALLOC and the normal allocations
> > are exhausted it takes a page from pretty much anywhere.
>
> If it finds no partial slab then it will go to the page allocator which
> will allocate given the current contextual alloc constraints.
> In the case
> of a memory policy we may have limited the allocations to a single node
> where there is no escape (the zonelist does *not* contain zones of other
> nodes).
Ah, this is the point I was missing; I assumed each zonelist would
always include all zones, but would just continue/break the loop using
things like cpuset_zone_allwed_*().
This might indeed foil the game.
I could 'fix' this by doing the PF_MEMALLOC allocation from the regular
node zonelist instead of from the one handed down....
/me thinks out loud.. since direct reclaim runs in whatever process
context was handed out we're stuck with whatever policy we started from;
but since the allocations are kernel allocs - not userspace allocs, and
we're in dire straights, it makes sense to violate the tasks restraints
in order to keep the machine up.
memory policies are the only ones with 'short' zonelists, right? CPU
sets are on top of whatever zonelist is handed out, and the normal
zonelists include all nodes - ordered by distance
> The only chance to bypass this is by only dealing with allocations
> during interrupt that have no allocation context.
But you just said that interrupts are not exempt from memory policies,
and policies are the only ones that have 'short' zonelists. /me
confused.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-21 20:54 ` Peter Zijlstra
0 siblings, 0 replies; 138+ messages in thread
From: Peter Zijlstra @ 2007-05-21 20:54 UTC (permalink / raw)
To: Christoph Lameter
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson,
npiggin
On Mon, 2007-05-21 at 13:32 -0700, Christoph Lameter wrote:
> On Mon, 21 May 2007, Peter Zijlstra wrote:
>
> > > This means we will disobey cpuset and memory policy constraints?
> >
> > >From what I can make of it, yes. Although I'm a bit hazy on the
> > mempolicy code.
>
> In an interrupt context we do not have a process context. But there is
> no exemption from memory policy constraints.
Ah, see below.
> > > > > No the gfp zone flags are not uniform and placement of page allocator
> > > > > allocs through SLUB do not always have the same allocation constraints.
> > > >
> > > > It has to; since it can serve the allocation from a pre-existing slab
> > > > allocation. Hence any page allocation must be valid for all other users.
> > >
> > > Why does it have to? This is not true.
> >
> > Say the slab gets allocated by an allocation from interrupt context; no
> > cpuset, no policy. This same slab must be valid for whatever allocation
> > comes next, right? Regardless of whatever policy or GFP_ flags are in
> > effect for that allocation.
>
> Yes sure if we do not have a context then no restrictions originating
> there can be enforced. So you want to restrict the logic now to
> interrupt allocs? I.e. GFP_ATOMIC?
No, any kernel alloc.
> > > The constraints come from the context of memory policies and cpusets. See
> > > get_any_partial().
> >
> > but get_partial() will only be called if the cpu_slab is full, up until
> > that point you have to do with whatever is there.
>
> Correct. That is an optimization but it may be called anytime from the
> perspective of an execution thread and that may cause problems with your
> approach.
I'm not seeing how this would interfere; if the alloc can be handled
from a partial slab, that is fine.
> > > > As far as I can see there cannot be a hard constraint here, because
> > > > allocations form interrupt context are at best node local. And node
> > > > affine zone lists still have all zones, just ordered on locality.
> > >
> > > Interrupt context is something different. If we do not have a process
> > > context then no cpuset and memory policy constraints can apply since we
> > > have no way of determining that. If you restrict your use of the reserve
> > > cpuset to only interrupt allocs then we may indeed be fine.
> >
> > No, what I'm saying is that if the slab gets refilled from interrupt
> > context the next process context alloc will have to work with whatever
> > the interrupt left behind. Hence there is no hard constraint.
>
> It will work with whatever was left behind in the case of SLUB and a
> kmalloc alloc (optimization there). It wont if its SLAB (which is
> stricter) or a kmalloc_node alloc. A kmalloc_node alloc will remove the
> current cpuslab if its not on the right now.
OK, that is my understanding too. So this should be good too.
> > > > >From what I can see, it takes pretty much any page it can get once you
> > > > hit it with PF_MEMALLOC. If the page allocation doesn't use ALLOC_CPUSET
> > > > the page can come from pretty much anywhere.
> > >
> > > No it cannot. One the current cpuslab is exhaused (which can be anytime)
> > > it will enforce the contextual allocation constraints. See
> > > get_any_partial() in slub.c.
> >
> > If it finds no partial slabs it goes back to the page allocator; and
> > when you allocate a page under PF_MEMALLOC and the normal allocations
> > are exhausted it takes a page from pretty much anywhere.
>
> If it finds no partial slab then it will go to the page allocator which
> will allocate given the current contextual alloc constraints.
> In the case
> of a memory policy we may have limited the allocations to a single node
> where there is no escape (the zonelist does *not* contain zones of other
> nodes).
Ah, this is the point I was missing; I assumed each zonelist would
always include all zones, but would just continue/break the loop using
things like cpuset_zone_allwed_*().
This might indeed foil the game.
I could 'fix' this by doing the PF_MEMALLOC allocation from the regular
node zonelist instead of from the one handed down....
/me thinks out loud.. since direct reclaim runs in whatever process
context was handed out we're stuck with whatever policy we started from;
but since the allocations are kernel allocs - not userspace allocs, and
we're in dire straights, it makes sense to violate the tasks restraints
in order to keep the machine up.
memory policies are the only ones with 'short' zonelists, right? CPU
sets are on top of whatever zonelist is handed out, and the normal
zonelists include all nodes - ordered by distance
> The only chance to bypass this is by only dealing with allocations
> during interrupt that have no allocation context.
But you just said that interrupts are not exempt from memory policies,
and policies are the only ones that have 'short' zonelists. /me
confused.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
2007-05-21 20:54 ` Peter Zijlstra
@ 2007-05-21 21:04 ` Christoph Lameter
-1 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-21 21:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson,
npiggin
On Mon, 21 May 2007, Peter Zijlstra wrote:
> > Yes sure if we do not have a context then no restrictions originating
> > there can be enforced. So you want to restrict the logic now to
> > interrupt allocs? I.e. GFP_ATOMIC?
>
> No, any kernel alloc.
Then we have the problem again.
> > Correct. That is an optimization but it may be called anytime from the
> > perspective of an execution thread and that may cause problems with your
> > approach.
>
> I'm not seeing how this would interfere; if the alloc can be handled
> from a partial slab, that is fine.
There is no guarantee that a partial slab is available.
> > In the case
> > of a memory policy we may have limited the allocations to a single node
> > where there is no escape (the zonelist does *not* contain zones of other
> > nodes).
>
> Ah, this is the point I was missing; I assumed each zonelist would
> always include all zones, but would just continue/break the loop using
> things like cpuset_zone_allwed_*().
>
> This might indeed foil the game.
>
> I could 'fix' this by doing the PF_MEMALLOC allocation from the regular
> node zonelist instead of from the one handed down....
I wonder if this makes any sense at all given that the only point of
what you are doing is to help to decide which alloc should fail...
> /me thinks out loud.. since direct reclaim runs in whatever process
> context was handed out we're stuck with whatever policy we started from;
> but since the allocations are kernel allocs - not userspace allocs, and
> we're in dire straights, it makes sense to violate the tasks restraints
> in order to keep the machine up.
The memory policy constraints may have been setup to cage in an
application. It was setup to *stop* the application from using memory on
other nodes. If you now allow that then the semantics of memory policies
are significantly changed. The cpuset constraints are sometimes not that
hard but I better let Paul speak for them.
> memory policies are the only ones with 'short' zonelists, right? CPU
> sets are on top of whatever zonelist is handed out, and the normal
> zonelists include all nodes - ordered by distance
GFP_THISNODE can have a similar effect.
> > The only chance to bypass this is by only dealing with allocations
> > during interrupt that have no allocation context.
>
> But you just said that interrupts are not exempt from memory policies,
> and policies are the only ones that have 'short' zonelists. /me
> confused.
No I said that in an interrupt allocation we have no process context and
therefore no cpuset or memory policy context. Thus no policies or cpusets
are applied to an allocation. You can allocate without restrictions.
^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [PATCH 0/5] make slab gfp fair
@ 2007-05-21 21:04 ` Christoph Lameter
0 siblings, 0 replies; 138+ messages in thread
From: Christoph Lameter @ 2007-05-21 21:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Mackall, linux-kernel, linux-mm, Thomas Graf, David Miller,
Andrew Morton, Daniel Phillips, Pekka Enberg, Paul Jackson,
npiggin
On Mon, 21 May 2007, Peter Zijlstra wrote:
> > Yes sure if we do not have a context then no restrictions originating
> > there can be enforced. So you want to restrict the logic now to
> > interrupt allocs? I.e. GFP_ATOMIC?
>
> No, any kernel alloc.
Then we have the problem again.
> > Correct. That is an optimization but it may be called anytime from the
> > perspective of an execution thread and that may cause problems with your
> > approach.
>
> I'm not seeing how this would interfere; if the alloc can be handled
> from a partial slab, that is fine.
There is no guarantee that a partial slab is available.
> > In the case
> > of a memory policy we may have limited the allocations to a single node
> > where there is no escape (the zonelist does *not* contain zones of other
> > nodes).
>
> Ah, this is the point I was missing; I assumed each zonelist would
> always include all zones, but would just continue/break the loop using
> things like cpuset_zone_allwed_*().
>
> This might indeed foil the game.
>
> I could 'fix' this by doing the PF_MEMALLOC allocation from the regular
> node zonelist instead of from the one handed down....
I wonder if this makes any sense at all given that the only point of
what you are doing is to help to decide which alloc should fail...
> /me thinks out loud.. since direct reclaim runs in whatever process
> context was handed out we're stuck with whatever policy we started from;
> but since the allocations are kernel allocs - not userspace allocs, and
> we're in dire straights, it makes sense to violate the tasks restraints
> in order to keep the machine up.
The memory policy constraints may have been setup to cage in an
application. It was setup to *stop* the application from using memory on
other nodes. If you now allow that then the semantics of memory policies
are significantly changed. The cpuset constraints are sometimes not that
hard but I better let Paul speak for them.
> memory policies are the only ones with 'short' zonelists, right? CPU
> sets are on top of whatever zonelist is handed out, and the normal
> zonelists include all nodes - ordered by distance
GFP_THISNODE can have a similar effect.
> > The only chance to bypass this is by only dealing with allocations
> > during interrupt that have no allocation context.
>
> But you just said that interrupts are not exempt from memory policies,
> and policies are the only ones that have 'short' zonelists. /me
> confused.
No I said that in an interrupt allocation we have no process context and
therefore no cpuset or memory policy context. Thus no policies or cpusets
are applied to an allocation. You can allocate without restrictions.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 138+ messages in thread
end of thread, other threads:[~2007-05-21 21:04 UTC | newest]
Thread overview: 138+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-14 13:19 [PATCH 0/5] make slab gfp fair Peter Zijlstra
2007-05-14 13:19 ` Peter Zijlstra
2007-05-14 13:19 ` [PATCH 1/5] mm: page allocation rank Peter Zijlstra
2007-05-14 13:19 ` Peter Zijlstra
2007-05-14 13:19 ` [PATCH 2/5] mm: slab allocation fairness Peter Zijlstra
2007-05-14 13:19 ` Peter Zijlstra
2007-05-14 15:51 ` Christoph Lameter
2007-05-14 15:51 ` Christoph Lameter
2007-05-14 13:19 ` [PATCH 3/5] mm: slub " Peter Zijlstra
2007-05-14 13:19 ` Peter Zijlstra
2007-05-14 15:49 ` Christoph Lameter
2007-05-14 15:49 ` Christoph Lameter
2007-05-14 16:14 ` Peter Zijlstra
2007-05-14 16:14 ` Peter Zijlstra
2007-05-14 16:35 ` Christoph Lameter
2007-05-14 16:35 ` Christoph Lameter
2007-05-14 13:19 ` [PATCH 4/5] mm: slob " Peter Zijlstra
2007-05-14 13:19 ` Peter Zijlstra
2007-05-14 13:19 ` [PATCH 5/5] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
2007-05-14 13:19 ` Peter Zijlstra
2007-05-14 15:53 ` [PATCH 0/5] make slab gfp fair Christoph Lameter
2007-05-14 15:53 ` Christoph Lameter
2007-05-14 16:10 ` Peter Zijlstra
2007-05-14 16:10 ` Peter Zijlstra
2007-05-14 16:37 ` Christoph Lameter
2007-05-14 16:37 ` Christoph Lameter
2007-05-14 16:12 ` Matt Mackall
2007-05-14 16:12 ` Matt Mackall
2007-05-14 16:29 ` Christoph Lameter
2007-05-14 16:29 ` Christoph Lameter
2007-05-14 17:40 ` Peter Zijlstra
2007-05-14 17:40 ` Peter Zijlstra
2007-05-14 17:57 ` Christoph Lameter
2007-05-14 17:57 ` Christoph Lameter
2007-05-14 19:28 ` Peter Zijlstra
2007-05-14 19:28 ` Peter Zijlstra
2007-05-14 19:56 ` Christoph Lameter
2007-05-14 19:56 ` Christoph Lameter
2007-05-14 20:03 ` Peter Zijlstra
2007-05-14 20:03 ` Peter Zijlstra
2007-05-14 20:06 ` Christoph Lameter
2007-05-14 20:06 ` Christoph Lameter
2007-05-14 20:12 ` Peter Zijlstra
2007-05-14 20:12 ` Peter Zijlstra
2007-05-14 20:25 ` Christoph Lameter
2007-05-14 20:25 ` Christoph Lameter
2007-05-15 17:27 ` Peter Zijlstra
2007-05-15 17:27 ` Peter Zijlstra
2007-05-15 22:02 ` Christoph Lameter
2007-05-15 22:02 ` Christoph Lameter
2007-05-16 6:59 ` Peter Zijlstra
2007-05-16 6:59 ` Peter Zijlstra
2007-05-16 18:43 ` Christoph Lameter
2007-05-16 18:43 ` Christoph Lameter
2007-05-16 19:25 ` Peter Zijlstra
2007-05-16 19:25 ` Peter Zijlstra
2007-05-16 19:53 ` Christoph Lameter
2007-05-16 19:53 ` Christoph Lameter
2007-05-16 20:18 ` Peter Zijlstra
2007-05-16 20:18 ` Peter Zijlstra
2007-05-16 20:27 ` Christoph Lameter
2007-05-16 20:27 ` Christoph Lameter
2007-05-16 20:40 ` Peter Zijlstra
2007-05-16 20:40 ` Peter Zijlstra
2007-05-16 20:44 ` Christoph Lameter
2007-05-16 20:44 ` Christoph Lameter
2007-05-16 20:54 ` Peter Zijlstra
2007-05-16 20:54 ` Peter Zijlstra
2007-05-16 20:59 ` Christoph Lameter
2007-05-16 20:59 ` Christoph Lameter
2007-05-16 21:04 ` Peter Zijlstra
2007-05-16 21:04 ` Peter Zijlstra
2007-05-16 21:13 ` Christoph Lameter
2007-05-16 21:13 ` Christoph Lameter
2007-05-16 21:20 ` Peter Zijlstra
2007-05-16 21:20 ` Peter Zijlstra
2007-05-16 21:42 ` Christoph Lameter
2007-05-16 21:42 ` Christoph Lameter
2007-05-17 7:28 ` Peter Zijlstra
2007-05-17 7:28 ` Peter Zijlstra
2007-05-17 17:30 ` Christoph Lameter
2007-05-17 17:30 ` Christoph Lameter
2007-05-17 17:53 ` Peter Zijlstra
2007-05-17 17:53 ` Peter Zijlstra
2007-05-17 18:01 ` Christoph Lameter
2007-05-17 18:01 ` Christoph Lameter
2007-05-14 19:44 ` Andrew Morton
2007-05-14 19:44 ` Andrew Morton
2007-05-14 20:01 ` Matt Mackall
2007-05-14 20:01 ` Matt Mackall
2007-05-14 20:05 ` Peter Zijlstra
2007-05-14 20:05 ` Peter Zijlstra
2007-05-17 3:02 ` Christoph Lameter
2007-05-17 3:02 ` Christoph Lameter
2007-05-17 7:08 ` Peter Zijlstra
2007-05-17 7:08 ` Peter Zijlstra
2007-05-17 17:29 ` Christoph Lameter
2007-05-17 17:29 ` Christoph Lameter
2007-05-17 17:52 ` Peter Zijlstra
2007-05-17 17:52 ` Peter Zijlstra
2007-05-17 17:59 ` Christoph Lameter
2007-05-17 17:59 ` Christoph Lameter
2007-05-17 17:53 ` Matt Mackall
2007-05-17 17:53 ` Matt Mackall
2007-05-17 18:02 ` Christoph Lameter
2007-05-17 18:02 ` Christoph Lameter
2007-05-17 19:18 ` Peter Zijlstra
2007-05-17 19:18 ` Peter Zijlstra
2007-05-17 19:24 ` Christoph Lameter
2007-05-17 19:24 ` Christoph Lameter
2007-05-17 21:26 ` Peter Zijlstra
2007-05-17 21:26 ` Peter Zijlstra
2007-05-17 21:44 ` Paul Jackson
2007-05-17 21:44 ` Paul Jackson
2007-05-17 22:27 ` Christoph Lameter
2007-05-17 22:27 ` Christoph Lameter
2007-05-18 9:54 ` Peter Zijlstra
2007-05-18 9:54 ` Peter Zijlstra
2007-05-18 17:11 ` Paul Jackson
2007-05-18 17:11 ` Paul Jackson
2007-05-18 17:11 ` Christoph Lameter
2007-05-18 17:11 ` Christoph Lameter
2007-05-20 8:39 ` Peter Zijlstra
2007-05-20 8:39 ` Peter Zijlstra
2007-05-21 16:45 ` Christoph Lameter
2007-05-21 16:45 ` Christoph Lameter
2007-05-21 19:33 ` Peter Zijlstra
2007-05-21 19:33 ` Peter Zijlstra
2007-05-21 19:43 ` Christoph Lameter
2007-05-21 19:43 ` Christoph Lameter
2007-05-21 20:08 ` Peter Zijlstra
2007-05-21 20:08 ` Peter Zijlstra
2007-05-21 20:32 ` Christoph Lameter
2007-05-21 20:32 ` Christoph Lameter
2007-05-21 20:54 ` Peter Zijlstra
2007-05-21 20:54 ` Peter Zijlstra
2007-05-21 21:04 ` Christoph Lameter
2007-05-21 21:04 ` Christoph Lameter
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.