* [PATCH 0/2] Reduce frequency of stalls due to zone_reclaim() on NUMA v2r1
@ 2011-07-15 15:08 Mel Gorman
2011-07-15 15:08 ` [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim Mel Gorman
2011-07-15 15:09 ` [PATCH 2/2] mm: page allocator: Reconsider zones for allocation after direct reclaim Mel Gorman
0 siblings, 2 replies; 17+ messages in thread
From: Mel Gorman @ 2011-07-15 15:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Minchan Kim, KOSAKI Motohiro, Christoph Lameter, Mel Gorman,
linux-mm, linux-kernel
Sorry for the resend. I screwed up the patch numbers in the first
sending.
Changelog since v1
o Dropped PF_SWAPWRITE change as discussions related to it stalled and
it's not important for fixing the underlying problem.
There have been a small number of complaints about significant stalls
while copying large amounts of data on NUMA machines reported on
a distribution bugzilla. In these cases, zone_reclaim was enabled
by default due to large NUMA distances. In general, the complaints
have not been about the workload itself unless it was a file server
(in which case the recommendation was disable zone_reclaim).
The stalls are mostly due to significant amounts of time spent
scanning the preferred zone for pages to free. After a failure, it
might fallback to another node (as zonelists are often node-ordered
rather than zone-ordered) but stall quickly again when the next
allocation attempt occurs. In bad cases, each page allocated results
in a full scan of the preferred zone.
Patch 1 checks the preferred zone for recent allocation failure which
is particularly important if zone_reclaim has failed recently.
This avoids rescanning the zone in the near future and instead
falling back to another node. This may hurt node locality in
some cases but a failure to zone_reclaim is more expensive
than a remote access.
Patch 2 clears the zlc information after direct reclaim. Otherwise,
zone_reclaim can mark zones full, direct reclaim can
reclaim enough pages but the zone is still not considered
for allocation.
This was tested on a 24-thread 2-node x86_64 machine. The tests were
focused on large amounts of IO. All tests were bound to the CPUs
on node-0 to avoid disturbances due to processes being scheduled on
different nodes. The kernels tested are
3.0-rc6-vanilla Vanilla 3.0-rc6
zlcfirst Patch 1 applied
zlcreconsider Patches 1+2 applied
FS-Mark
./fs_mark -d /tmp/fsmark-10813 -D 100 -N 5000 -n 208 -L 35 -t 24 -S0 -s 524288
fsmark-3.0-rc6 3.0-rc6 3.0-rc6
vanilla zlcfirs zlcreconsider
Files/s min 54.90 ( 0.00%) 49.80 (-10.24%) 49.10 (-11.81%)
Files/s mean 100.11 ( 0.00%) 135.17 (25.94%) 146.93 (31.87%)
Files/s stddev 57.51 ( 0.00%) 138.97 (58.62%) 158.69 (63.76%)
Files/s max 361.10 ( 0.00%) 834.40 (56.72%) 802.40 (55.00%)
Overhead min 76704.00 ( 0.00%) 76501.00 ( 0.27%) 77784.00 (-1.39%)
Overhead mean 1485356.51 ( 0.00%) 1035797.83 (43.40%) 1594680.26 (-6.86%)
Overhead stddev 1848122.53 ( 0.00%) 881489.88 (109.66%) 1772354.90 ( 4.27%)
Overhead max 7989060.00 ( 0.00%) 3369118.00 (137.13%) 10135324.00 (-21.18%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 501.49 493.91 499.93
Total Elapsed Time (seconds) 2451.57 2257.48 2215.92
MMTests Statistics: vmstat
Page Ins 46268 63840 66008
Page Outs 90821596 90671128 88043732
Swap Ins 0 0 0
Swap Outs 0 0 0
Direct pages scanned 13091697 8966863 8971790
Kswapd pages scanned 0 1830011 1831116
Kswapd pages reclaimed 0 1829068 1829930
Direct pages reclaimed 13037777 8956828 8648314
Kswapd efficiency 100% 99% 99%
Kswapd velocity 0.000 810.643 826.346
Direct efficiency 99% 99% 96%
Direct velocity 5340.128 3972.068 4048.788
Percentage direct scans 100% 83% 83%
Page writes by reclaim 0 3 0
Slabs scanned 796672 720640 720256
Direct inode steals 7422667 7160012 7088638
Kswapd inode steals 0 1736840 2021238
Test completes far faster with a large increase in the number of files
created per second. Standard deviation is high as a small number
of iterations were much higher than the mean. The number of pages
scanned by zone_reclaim is reduced and kswapd is used for more work.
LARGE DD
3.0-rc6 3.0-rc6 3.0-rc6
vanilla zlcfirst zlcreconsider
download tar 59 ( 0.00%) 59 ( 0.00%) 55 ( 7.27%)
dd source files 527 ( 0.00%) 296 (78.04%) 320 (64.69%)
delete source 36 ( 0.00%) 19 (89.47%) 20 (80.00%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 125.03 118.98 122.01
Total Elapsed Time (seconds) 624.56 375.02 398.06
MMTests Statistics: vmstat
Page Ins 3594216 439368 407032
Page Outs 23380832 23380488 23377444
Swap Ins 0 0 0
Swap Outs 0 436 287
Direct pages scanned 17482342 69315973 82864918
Kswapd pages scanned 0 519123 575425
Kswapd pages reclaimed 0 466501 522487
Direct pages reclaimed 5858054 2732949 2712547
Kswapd efficiency 100% 89% 90%
Kswapd velocity 0.000 1384.254 1445.574
Direct efficiency 33% 3% 3%
Direct velocity 27991.453 184832.737 208171.929
Percentage direct scans 100% 99% 99%
Page writes by reclaim 0 5082 13917
Slabs scanned 17280 29952 35328
Direct inode steals 115257 1431122 332201
Kswapd inode steals 0 0 979532
This test downloads a large tarfile and copies it with dd a number of
times - similar to the most recent bug report I've dealt with. Time to
completion is reduced. The number of pages scanned directly is still
disturbingly high with a low efficiency but this is likely due to
the number of dirty pages encountered. The figures could probably be
improved with more work around how kswapd is used and how dirty pages
are handled but that is separate work and this result is significant
on its own.
Streaming Mapped Writer
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 124.47 111.67 112.64
Total Elapsed Time (seconds) 2138.14 1816.30 1867.56
MMTests Statistics: vmstat
Page Ins 90760 89124 89516
Page Outs 121028340 120199524 120736696
Swap Ins 0 86 55
Swap Outs 0 0 0
Direct pages scanned 114989363 96461439 96330619
Kswapd pages scanned 56430948 56965763 57075875
Kswapd pages reclaimed 27743219 27752044 27766606
Direct pages reclaimed 49777 46884 36655
Kswapd efficiency 49% 48% 48%
Kswapd velocity 26392.541 31363.631 30561.736
Direct efficiency 0% 0% 0%
Direct velocity 53780.091 53108.759 51581.004
Percentage direct scans 67% 62% 62%
Page writes by reclaim 385 122 1513
Slabs scanned 43008 39040 42112
Direct inode steals 0 10 8
Kswapd inode steals 733 534 477
This test just creates a large file mapping and writes to it
linearly. Time to completion is again reduced.
The gains are mostly down to two things. In many cases, there
is less scanning as zone_reclaim simply gives up faster due to
recent failures. The second reason is that memory is used more
efficiently. Instead of scanning the preferred zone every time, the
allocator falls back to another zone and uses it instead improving
overall memory utilisation.
mm/page_alloc.c | 54 +++++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 41 insertions(+), 13 deletions(-)
--
1.7.3.4
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-15 15:08 [PATCH 0/2] Reduce frequency of stalls due to zone_reclaim() on NUMA v2r1 Mel Gorman
@ 2011-07-15 15:08 ` Mel Gorman
2011-07-18 14:56 ` Christoph Lameter
2011-07-15 15:09 ` [PATCH 2/2] mm: page allocator: Reconsider zones for allocation after direct reclaim Mel Gorman
1 sibling, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2011-07-15 15:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Minchan Kim, KOSAKI Motohiro, Christoph Lameter, Mel Gorman,
linux-mm, linux-kernel
The zonelist cache (ZLC) is used among other things to record if
zone_reclaim() failed for a particular zone recently. The intention
is to avoid a high cost scanning extremely long zonelists or scanning
within the zone uselessly.
Currently the zonelist cache is setup only after the first zone has
been considered and zone_reclaim() has been called. The objective was
to avoid a costly setup but zone_reclaim is itself quite expensive. If
it is failing regularly such as the first eligible zone having mostly
mapped pages, the cost in scanning and allocation stalls is far higher
than the ZLC initialisation step.
This patch initialises ZLC before the first eligible zone calls
zone_reclaim(). Once initialised, it is checked whether the zone
failed zone_reclaim recently. If it has, the zone is skipped. As the
first zone is now being checked, additional care has to be taken about
zones marked full. A zone can be marked "full" because it should not
have enough unmapped pages for zone_reclaim but this is excessive as
direct reclaim or kswapd may succeed where zone_reclaim fails. Only
mark zones "full" after zone_reclaim fails if it failed to reclaim
enough pages after scanning.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 35 ++++++++++++++++++++++-------------
1 files changed, 22 insertions(+), 13 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e8985a..6913854 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1664,7 +1664,7 @@ zonelist_scan:
continue;
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed_softwall(zone, gfp_mask))
- goto try_next_zone;
+ continue;
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
@@ -1676,17 +1676,36 @@ zonelist_scan:
classzone_idx, alloc_flags))
goto try_this_zone;
+ if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
+ /*
+ * we do zlc_setup if there are multiple nodes
+ * and before considering the first zone allowed
+ * by the cpuset.
+ */
+ allowednodes = zlc_setup(zonelist, alloc_flags);
+ zlc_active = 1;
+ did_zlc_setup = 1;
+ }
+
if (zone_reclaim_mode == 0)
goto this_zone_full;
+ /*
+ * As we may have just activated ZLC, check if the first
+ * eligible zone has failed zone_reclaim recently.
+ */
+ if (NUMA_BUILD && zlc_active &&
+ !zlc_zone_worth_trying(zonelist, z, allowednodes))
+ continue;
+
ret = zone_reclaim(zone, gfp_mask, order);
switch (ret) {
case ZONE_RECLAIM_NOSCAN:
/* did not scan */
- goto try_next_zone;
+ continue;
case ZONE_RECLAIM_FULL:
/* scanned but unreclaimable */
- goto this_zone_full;
+ continue;
default:
/* did we reclaim enough */
if (!zone_watermark_ok(zone, order, mark,
@@ -1703,16 +1722,6 @@ try_this_zone:
this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);
-try_next_zone:
- if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
- /*
- * we do zlc_setup after the first zone is tried but only
- * if there are multiple nodes make it worthwhile
- */
- allowednodes = zlc_setup(zonelist, alloc_flags);
- zlc_active = 1;
- did_zlc_setup = 1;
- }
}
if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
--
1.7.3.4
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/2] mm: page allocator: Reconsider zones for allocation after direct reclaim
2011-07-15 15:08 [PATCH 0/2] Reduce frequency of stalls due to zone_reclaim() on NUMA v2r1 Mel Gorman
2011-07-15 15:08 ` [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim Mel Gorman
@ 2011-07-15 15:09 ` Mel Gorman
2011-07-19 11:46 ` [PATCH] mm: page allocator: Reconsider zones for allocation after direct reclaim fix Mel Gorman
1 sibling, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2011-07-15 15:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Minchan Kim, KOSAKI Motohiro, Christoph Lameter, Mel Gorman,
linux-mm, linux-kernel
With zone_reclaim_mode enabled, it's possible for zones to be considered
full in the zonelist_cache so they are skipped in the future. If the
process enters direct reclaim, the ZLC may still consider zones to be
full even after reclaiming pages. Reconsider all zones for allocation
if direct reclaim returns successfully.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 19 +++++++++++++++++++
1 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6913854..149409c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1616,6 +1616,21 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
set_bit(i, zlc->fullzones);
}
+/*
+ * clear all zones full, called after direct reclaim makes progress so that
+ * a zone that was recently full is not skipped over for up to a second
+ */
+static void zlc_clear_zones_full(struct zonelist *zonelist)
+{
+ struct zonelist_cache *zlc; /* cached zonelist speedup info */
+
+ zlc = zonelist->zlcache_ptr;
+ if (!zlc)
+ return;
+
+ bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
+}
+
#else /* CONFIG_NUMA */
static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
@@ -1963,6 +1978,10 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
if (unlikely(!(*did_some_progress)))
return NULL;
+ /* After successful reclaim, reconsider all zones for allocation */
+ if (NUMA_BUILD)
+ zlc_clear_zones_full(zonelist);
+
retry:
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx,
--
1.7.3.4
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-15 15:08 ` [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim Mel Gorman
@ 2011-07-18 14:56 ` Christoph Lameter
2011-07-18 16:05 ` Mel Gorman
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2011-07-18 14:56 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
On Fri, 15 Jul 2011, Mel Gorman wrote:
> Currently the zonelist cache is setup only after the first zone has
> been considered and zone_reclaim() has been called. The objective was
> to avoid a costly setup but zone_reclaim is itself quite expensive. If
> it is failing regularly such as the first eligible zone having mostly
> mapped pages, the cost in scanning and allocation stalls is far higher
> than the ZLC initialisation step.
Would it not be easier to set zlc_active and allowednodes based on the
zone having an active ZLC at the start of get_pages()?
Buffered_rmqueue is handling the situation of a zone with an ZLC in a
weird way right now since it ignores the (potentially existing) ZLC
for the first pass. zlc_setup() does a lot of things. So that is because
there is a performance benefit?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-18 14:56 ` Christoph Lameter
@ 2011-07-18 16:05 ` Mel Gorman
2011-07-18 17:20 ` Christoph Lameter
0 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2011-07-18 16:05 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
On Mon, Jul 18, 2011 at 09:56:31AM -0500, Christoph Lameter wrote:
> On Fri, 15 Jul 2011, Mel Gorman wrote:
>
> > Currently the zonelist cache is setup only after the first zone has
> > been considered and zone_reclaim() has been called. The objective was
> > to avoid a costly setup but zone_reclaim is itself quite expensive. If
> > it is failing regularly such as the first eligible zone having mostly
> > mapped pages, the cost in scanning and allocation stalls is far higher
> > than the ZLC initialisation step.
>
> Would it not be easier to set zlc_active and allowednodes based on the
> zone having an active ZLC at the start of get_pages()?
>
What do you mean by a zones active ZLC? zonelists are on a per-node,
not a per-zone basis (see node_zonelist) so a zone doesn't have an
active ZLC as such. If the zlc_active is set at the beginning of
get_page_from_freelist(), it implies that we are calling zlc_setup()
even when the watermarks are met which is unnecessary.
> Buffered_rmqueue is handling the situation of a zone with an ZLC in a
> weird way right now since it ignores the (potentially existing) ZLC
> for the first pass.
Where does buffered_rmqueue() refer to a zonelist_cache?
> zlc_setup() does a lot of things. So that is because
> there is a performance benefit?
>
I do not understand this question. Are you asking if zonelist_cache
has a performance benefit? The answer is "yes" because you can see
how the performance when zone_reclaim degrades when it is not used
for the first zone.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-18 16:05 ` Mel Gorman
@ 2011-07-18 17:20 ` Christoph Lameter
2011-07-18 21:13 ` Mel Gorman
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2011-07-18 17:20 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
On Mon, 18 Jul 2011, Mel Gorman wrote:
> On Mon, Jul 18, 2011 at 09:56:31AM -0500, Christoph Lameter wrote:
> > On Fri, 15 Jul 2011, Mel Gorman wrote:
> >
> > > Currently the zonelist cache is setup only after the first zone has
> > > been considered and zone_reclaim() has been called. The objective was
> > > to avoid a costly setup but zone_reclaim is itself quite expensive. If
> > > it is failing regularly such as the first eligible zone having mostly
> > > mapped pages, the cost in scanning and allocation stalls is far higher
> > > than the ZLC initialisation step.
> >
> > Would it not be easier to set zlc_active and allowednodes based on the
> > zone having an active ZLC at the start of get_pages()?
> >
>
> What do you mean by a zones active ZLC? zonelists are on a per-node,
> not a per-zone basis (see node_zonelist) so a zone doesn't have an
> active ZLC as such. If the zlc_active is set at the beginning of
Look at get_page_from_freelist(): It sets
zlc_active = 0 even through the zonelist under consideration may have a
ZLC. zlc_active = 0 can also mean that the function has not bothered to
look for the zlc information of the current zonelist.
> get_page_from_freelist(), it implies that we are calling zlc_setup()
> even when the watermarks are met which is unnecessary.
Ok then that decision to not call zlc_setup() for performance reasons is
what created the problem that you are trying to solve. In case that the
first zones watermarks are okay we can avoid calling zlc_setup().
What we do now have is checking for zlc_active in the loop just so that
the first time around we do not call zlc_setup().
We may be able to simplify the function by:
1. Checking for the special case that the first zone is ok and that we do
not want to call zlc_setup before we get to the loop.
2. Do the zlc_setup() before the loop.
3. Remove the zlc_setup() code as you did from the loop as well as the
checks for zlc_active. zlc_active becomes not necessary since a zlc
is always available when we go through the loop.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-18 17:20 ` Christoph Lameter
@ 2011-07-18 21:13 ` Mel Gorman
2011-07-18 21:54 ` Christoph Lameter
0 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2011-07-18 21:13 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
On Mon, Jul 18, 2011 at 12:20:11PM -0500, Christoph Lameter wrote:
>
>
> On Mon, 18 Jul 2011, Mel Gorman wrote:
>
> > On Mon, Jul 18, 2011 at 09:56:31AM -0500, Christoph Lameter wrote:
> > > On Fri, 15 Jul 2011, Mel Gorman wrote:
> > >
> > > > Currently the zonelist cache is setup only after the first zone has
> > > > been considered and zone_reclaim() has been called. The objective was
> > > > to avoid a costly setup but zone_reclaim is itself quite expensive. If
> > > > it is failing regularly such as the first eligible zone having mostly
> > > > mapped pages, the cost in scanning and allocation stalls is far higher
> > > > than the ZLC initialisation step.
> > >
> > > Would it not be easier to set zlc_active and allowednodes based on the
> > > zone having an active ZLC at the start of get_pages()?
> > >
> >
> > What do you mean by a zones active ZLC? zonelists are on a per-node,
> > not a per-zone basis (see node_zonelist) so a zone doesn't have an
> > active ZLC as such. If the zlc_active is set at the beginning of
>
> Look at get_page_from_freelist(): It sets
> zlc_active = 0 even through the zonelist under consideration may have a
> ZLC. zlc_active = 0 can also mean that the function has not bothered to
> look for the zlc information of the current zonelist.
>
Yes. So? It's only necessary if the watermarks are not met.
> > get_page_from_freelist(), it implies that we are calling zlc_setup()
> > even when the watermarks are met which is unnecessary.
>
> Ok then that decision to not call zlc_setup() for performance reasons is
> what created the problem that you are trying to solve. In case that the
> first zones watermarks are okay we can avoid calling zlc_setup().
>
The original implementation did not check the ZLC in the first loop
at all. It wasn't just about avoiding the cost of setup. I suspect
this problem has been there a long time and it's taking this long
for bug reports to show up because NUMA machines are being used for
generic numa-unaware workloads.
> What we do now have is checking for zlc_active in the loop just so that
> the first time around we do not call zlc_setup().
>
Yes, why incur the cost for the common case?
> We may be able to simplify the function by:
>
> 1. Checking for the special case that the first zone is ok and that we do
> not want to call zlc_setup before we get to the loop.
>
> 2. Do the zlc_setup() before the loop.
>
> 3. Remove the zlc_setup() code as you did from the loop as well as the
> checks for zlc_active. zlc_active becomes not necessary since a zlc
> is always available when we go through the loop.
>
That initial test will involve duplication of things like the cpuset and
no watermarks check just to place the zlc_setup() in a different place.
I might be missing your point but it seems like the gain would be
marginal. Fancy posting a patch?
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-18 21:13 ` Mel Gorman
@ 2011-07-18 21:54 ` Christoph Lameter
2011-07-19 14:01 ` Christoph Lameter
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2011-07-18 21:54 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
On Mon, 18 Jul 2011, Mel Gorman wrote:
> > We may be able to simplify the function by:
> >
> > 1. Checking for the special case that the first zone is ok and that we do
> > not want to call zlc_setup before we get to the loop.
> >
> > 2. Do the zlc_setup() before the loop.
> >
> > 3. Remove the zlc_setup() code as you did from the loop as well as the
> > checks for zlc_active. zlc_active becomes not necessary since a zlc
> > is always available when we go through the loop.
> >
>
> That initial test will involve duplication of things like the cpuset and
> no watermarks check just to place the zlc_setup() in a different place.
> I might be missing your point but it seems like the gain would be
> marginal. Fancy posting a patch?
Looked at it for some time. Would have to create a new function for the
watermark checks, the call to buffer_rmqueue and the marking of a zone as
full. After that the goto mess could be unraveled. But I am out of time
for today.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] mm: page allocator: Reconsider zones for allocation after direct reclaim fix
2011-07-15 15:09 ` [PATCH 2/2] mm: page allocator: Reconsider zones for allocation after direct reclaim Mel Gorman
@ 2011-07-19 11:46 ` Mel Gorman
0 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2011-07-19 11:46 UTC (permalink / raw)
To: Andrew Morton
Cc: Minchan Kim, KOSAKI Motohiro, Christoph Lameter, linux-mm, linux-kernel
mm/page_alloc.c: In function â__alloc_pages_direct_reclaimâ:
mm/page_alloc.c:1983:3: error: implicit declaration of function âzlc_clear_zones_fullâ
This patch is a build fix for !CONFIG_NUMA that should be merged with
mm-page-allocator-reconsider-zones-for-allocation-after-direct-reclaim.patch .
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 149409c..0f50cdb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1647,6 +1647,10 @@ static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
{
}
+
+static void zlc_clear_zones_full(struct zonelist *zonelist)
+{
+}
#endif /* CONFIG_NUMA */
/*
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-18 21:54 ` Christoph Lameter
@ 2011-07-19 14:01 ` Christoph Lameter
2011-07-20 18:08 ` Christoph Lameter
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2011-07-19 14:01 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
Well we can unwind that complexity later I guess.
Reviewed-by: Christoph Lameter <cl@linux.com>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-19 14:01 ` Christoph Lameter
@ 2011-07-20 18:08 ` Christoph Lameter
2011-07-20 19:18 ` Mel Gorman
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2011-07-20 18:08 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
Hmmm... Looking at get_page_from_freelist and considering speeding that up
in general: Could we move the whole watermark logic into the slow path?
Only check when we refill the per cpu queues?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-20 18:08 ` Christoph Lameter
@ 2011-07-20 19:18 ` Mel Gorman
2011-07-20 19:28 ` Christoph Lameter
0 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2011-07-20 19:18 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
On Wed, Jul 20, 2011 at 01:08:46PM -0500, Christoph Lameter wrote:
> Hmmm... Looking at get_page_from_freelist and considering speeding that up
> in general: Could we move the whole watermark logic into the slow path?
> Only check when we refill the per cpu queues?
Each CPU list can hold 186 pages (on my currently running
kernel at least) which is 744K. As I'm running with THP enabled,
the min watermark is 25852K so with 34 of more CPUs, there is a
risk that a zone would be fully depleted due to lack of watermark
checking. Bit unlikely that 34 CPUs would be on one node but the risk
is there. Without THP, the min watermark would have been something like
32K where it would be much easier to accidentally consume all memory.
Yes, moving the watermark checks to the slow path would be faster
but under some conditions, the system will lock up.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-20 19:18 ` Mel Gorman
@ 2011-07-20 19:28 ` Christoph Lameter
2011-07-20 19:52 ` Christoph Lameter
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2011-07-20 19:28 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
On Wed, 20 Jul 2011, Mel Gorman wrote:
> On Wed, Jul 20, 2011 at 01:08:46PM -0500, Christoph Lameter wrote:
> > Hmmm... Looking at get_page_from_freelist and considering speeding that up
> > in general: Could we move the whole watermark logic into the slow path?
> > Only check when we refill the per cpu queues?
>
> Each CPU list can hold 186 pages (on my currently running
> kernel at least) which is 744K. As I'm running with THP enabled,
> the min watermark is 25852K so with 34 of more CPUs, there is a
> risk that a zone would be fully depleted due to lack of watermark
> checking. Bit unlikely that 34 CPUs would be on one node but the risk
> is there. Without THP, the min watermark would have been something like
> 32K where it would be much easier to accidentally consume all memory.
>
> Yes, moving the watermark checks to the slow path would be faster
> but under some conditions, the system will lock up.
Well the fastpath would simply grab a page if its on the list. If the list
is empty then we would be checking the watermarks and extract pages from
the buddylists. The pages in the per cpu lists would not be accounted for
for reclaim. Counters would reflect the buddy allocator pages available.
Reclaim flushes the per cpu pages so the buddy allocator pages would be
replenished.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-20 19:28 ` Christoph Lameter
@ 2011-07-20 19:52 ` Christoph Lameter
2011-07-20 21:17 ` Christoph Lameter
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2011-07-20 19:52 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
The existing way of deciding if watermarks have been met looks broken to
me.
There are two pools of pages: One is the pages available from the buddy
lists and another the pages in the per cpu lists.
zone_watermark_ok() only checks those in the buddy lists
(NR_FREE_PAGES) is not updated when we get a page from the per cpu lists).
And we do check zone_watermark_ok() before even attempting to allocate
pages that may be available from the per cpu lists?
So the allocator may pass on a zone and/or go into reclaim despite of the
availability of pages on per cpu lists. The more pages one puts into the
per cpu lists the higher the chance of an OOM. ... Ok that is not true
since we flush the per cpu pages and get them back into the buddy lists
before that happens.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-20 19:52 ` Christoph Lameter
@ 2011-07-20 21:17 ` Christoph Lameter
2011-07-20 22:48 ` Mel Gorman
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2011-07-20 21:17 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
Hmmm... Maybe we can bypass the checks?
Subject: [page allocator] Do not check watermarks if there is a page available on the per cpu freelists
One should be able to grab a page from the per cpu freelists if available.
The pages on the per cpu freelists are not accounted for in VM statistics
so getting a page from there has no impact on reclaim.
Check for this condition in get_page_from_freelist and short circuit
to the call to buffered_rmqueue if so.
Note that there is a race here. We may deplete the reserve pools by
one page if either the process is rescheduled on a different processor
or if another process grabs the last page from the per cpu freelist.
Signed-off-by: Christoph Lameter <cl@linux.com>
---
mm/page_alloc.c | 10 ++++++++++
1 file changed, 10 insertions(+)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2011-07-20 15:27:20.544825852 -0500
+++ linux-2.6/mm/page_alloc.c 2011-07-20 15:30:05.314824797 -0500
@@ -1666,6 +1666,16 @@ zonelist_scan:
!cpuset_zone_allowed_softwall(zone, gfp_mask))
goto try_next_zone;
+ /*
+ * Short circuit allocation if we have a usable object on
+ * the percpu freelist. Note that this can only be an
+ * optimization since there is no guarantee that we will
+ * be executing on the same cpu. Another process could also
+ * be scheduled and take the available page from us.
+ */
+ if (order == 0 && this_cpu_read(zone->pageset->pcp.count))
+ goto try_this_zone;
+
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-20 21:17 ` Christoph Lameter
@ 2011-07-20 22:48 ` Mel Gorman
2011-07-21 15:24 ` Christoph Lameter
0 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2011-07-20 22:48 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
On Wed, Jul 20, 2011 at 04:17:41PM -0500, Christoph Lameter wrote:
> Hmmm... Maybe we can bypass the checks?
>
Maybe we should not.
Watermarks should not just be ignored. They prevent the system
deadlocking due to an inability allocate a page needed to free more
memory. This patch allows allocations that are not high priority
or atomic to succeed when the buddy lists are at the min watermark
and would normally be throttled. Minimally, this patch increasing
the risk of the locking up due to memory expiration. For example,
a GFP_ATOMIC allocation can refill the per-cpu list with the pages
then consumed by GFP_KERNEL allocations, next GFP_ATOMIC allocation
refills again, gets consumed etc. It's even worse if it's PF_MEMALLOC
allocations that are refilling the lists as they ignore watermarks.
If this is happening on enough CPUs, it will cause trouble.
At the very least, the performance benefit of such a change should
be illustrated. Even if it's faster (and I'd expect it to be,
watermark checks particularly at low memory are expensive), it may
just mean the system occasionally runs very fast into a wall. Hence,
the patch should be accompanied with tests showing that even under
very high stress for a long period of time that it does not lock up
and the changelog should include a *very* convincing description
on why PF_MEMALLOC refilling the per-cpu lists to be consumed by
low-priority users is not a problem.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-20 22:48 ` Mel Gorman
@ 2011-07-21 15:24 ` Christoph Lameter
0 siblings, 0 replies; 17+ messages in thread
From: Christoph Lameter @ 2011-07-21 15:24 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
On Wed, 20 Jul 2011, Mel Gorman wrote:
> On Wed, Jul 20, 2011 at 04:17:41PM -0500, Christoph Lameter wrote:
> > Hmmm... Maybe we can bypass the checks?
> >
>
> Maybe we should not.
>
> Watermarks should not just be ignored. They prevent the system
> deadlocking due to an inability allocate a page needed to free more
> memory. This patch allows allocations that are not high priority
> or atomic to succeed when the buddy lists are at the min watermark
> and would normally be throttled. Minimally, this patch increasing
> the risk of the locking up due to memory expiration. For example,
> a GFP_ATOMIC allocation can refill the per-cpu list with the pages
> then consumed by GFP_KERNEL allocations, next GFP_ATOMIC allocation
> refills again, gets consumed etc. It's even worse if it's PF_MEMALLOC
> allocations that are refilling the lists as they ignore watermarks.
> If this is happening on enough CPUs, it will cause trouble.
Hmmm... True. This allocation complexity prevents effective use of caches.
> At the very least, the performance benefit of such a change should
> be illustrated. Even if it's faster (and I'd expect it to be,
> watermark checks particularly at low memory are expensive), it may
> just mean the system occasionally runs very fast into a wall. Hence,
> the patch should be accompanied with tests showing that even under
> very high stress for a long period of time that it does not lock up
> and the changelog should include a *very* convincing description
> on why PF_MEMALLOC refilling the per-cpu lists to be consumed by
> low-priority users is not a problem.
The performance of the page allocator is extremely bad at this point and
it is so because of all these checks in the critical paths. There have
been numerous ways that subsystems worked around this in the past and I
would think that there is no question that removing expensive checks from
the fastpath improves performance.
Maybe the only solution is to build a consistent second layer of
caching around the page allocator that is usable by various subsystems?
SLAB has in the past provided such a caching layer. The problem is that
people are trying to build similar complexity into the fast path of those
allocators as well now (f.e. the NFS swap patch with its ways of reserving
objects to fix the issue of objects being taken for the wrong reasons that
you mentioned above). We need some solution that allows the implementation of
fast object allocation and that means reducing the complexity of what is
going on during page alloc and free.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2011-07-21 15:24 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-15 15:08 [PATCH 0/2] Reduce frequency of stalls due to zone_reclaim() on NUMA v2r1 Mel Gorman
2011-07-15 15:08 ` [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim Mel Gorman
2011-07-18 14:56 ` Christoph Lameter
2011-07-18 16:05 ` Mel Gorman
2011-07-18 17:20 ` Christoph Lameter
2011-07-18 21:13 ` Mel Gorman
2011-07-18 21:54 ` Christoph Lameter
2011-07-19 14:01 ` Christoph Lameter
2011-07-20 18:08 ` Christoph Lameter
2011-07-20 19:18 ` Mel Gorman
2011-07-20 19:28 ` Christoph Lameter
2011-07-20 19:52 ` Christoph Lameter
2011-07-20 21:17 ` Christoph Lameter
2011-07-20 22:48 ` Mel Gorman
2011-07-21 15:24 ` Christoph Lameter
2011-07-15 15:09 ` [PATCH 2/2] mm: page allocator: Reconsider zones for allocation after direct reclaim Mel Gorman
2011-07-19 11:46 ` [PATCH] mm: page allocator: Reconsider zones for allocation after direct reclaim fix Mel Gorman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).