All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wanpeng Li <liwanp@linux.vnet.ibm.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@surriel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zlatko Calusic <zcalusic@bitsync.net>,
	Minchan Kim <minchan@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [patch v2 3/3] mm: page_alloc: fair zone allocator policy
Date: Mon, 5 Aug 2013 18:34:56 +0800	[thread overview]
Message-ID: <31100.7594842978$1375699113@news.gmane.org> (raw)
In-Reply-To: <1375457846-21521-4-git-send-email-hannes@cmpxchg.org>

On Fri, Aug 02, 2013 at 11:37:26AM -0400, Johannes Weiner wrote:
>Each zone that holds userspace pages of one workload must be aged at a
>speed proportional to the zone size.  Otherwise, the time an
>individual page gets to stay in memory depends on the zone it happened
>to be allocated in.  Asymmetry in the zone aging creates rather
>unpredictable aging behavior and results in the wrong pages being
>reclaimed, activated etc.
>
>But exactly this happens right now because of the way the page
>allocator and kswapd interact.  The page allocator uses per-node lists
>of all zones in the system, ordered by preference, when allocating a
>new page.  When the first iteration does not yield any results, kswapd
>is woken up and the allocator retries.  Due to the way kswapd reclaims
>zones below the high watermark while a zone can be allocated from when
>it is above the low watermark, the allocator may keep kswapd running
>while kswapd reclaim ensures that the page allocator can keep
>allocating from the first zone in the zonelist for extended periods of
>time.  Meanwhile the other zones rarely see new allocations and thus
>get aged much slower in comparison.
>
>The result is that the occasional page placed in lower zones gets
>relatively more time in memory, even gets promoted to the active list
>after its peers have long been evicted.  Meanwhile, the bulk of the
>working set may be thrashing on the preferred zone even though there
>may be significant amounts of memory available in the lower zones.
>
>Even the most basic test -- repeatedly reading a file slightly bigger
>than memory -- shows how broken the zone aging is.  In this scenario,
>no single page should be able stay in memory long enough to get
>referenced twice and activated, but activation happens in spades:
>
>  $ grep active_file /proc/zoneinfo
>      nr_inactive_file 0
>      nr_active_file 0
>      nr_inactive_file 0
>      nr_active_file 8
>      nr_inactive_file 1582
>      nr_active_file 11994
>  $ cat data data data data >/dev/null
>  $ grep active_file /proc/zoneinfo
>      nr_inactive_file 0
>      nr_active_file 70
>      nr_inactive_file 258753
>      nr_active_file 443214
>      nr_inactive_file 149793
>      nr_active_file 12021
>
>Fix this with a very simple round robin allocator.  Each zone is
>allowed a batch of allocations that is proportional to the zone's
>size, after which it is treated as full.  The batch counters are reset
>when all zones have been tried and the allocator enters the slowpath
>and kicks off kswapd reclaim.  Allocation and reclaim is now fairly
>spread out to all available/allowable zones:
>
>  $ grep active_file /proc/zoneinfo
>      nr_inactive_file 0
>      nr_active_file 0
>      nr_inactive_file 174
>      nr_active_file 4865
>      nr_inactive_file 53
>      nr_active_file 860
>  $ cat data data data data >/dev/null
>  $ grep active_file /proc/zoneinfo
>      nr_inactive_file 0
>      nr_active_file 0
>      nr_inactive_file 666622
>      nr_active_file 4988
>      nr_inactive_file 190969
>      nr_active_file 937
>

Why round robin allocator don't consume ZONE_DMA?

>When zone_reclaim_mode is enabled, allocations will now spread out to
>all zones on the local node, not just the first preferred zone (which
>on a 4G node might be a tiny Normal zone).
>
>Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
>---
> include/linux/mmzone.h |  1 +
> mm/page_alloc.c        | 69 ++++++++++++++++++++++++++++++++++++++++++--------
> 2 files changed, 60 insertions(+), 10 deletions(-)
>
>diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>index af4a3b7..dcad2ab 100644
>--- a/include/linux/mmzone.h
>+++ b/include/linux/mmzone.h
>@@ -352,6 +352,7 @@ struct zone {
> 	 * free areas of different sizes
> 	 */
> 	spinlock_t		lock;
>+	int			alloc_batch;
> 	int                     all_unreclaimable; /* All pages pinned */
> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> 	/* Set to true when the PG_migrate_skip bits should be cleared */
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index 3b27d3e..b2cdfd0 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -1817,6 +1817,11 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> 	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
> }
>
>+static bool zone_local(struct zone *local_zone, struct zone *zone)
>+{
>+	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
>+}
>+
> static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> {
> 	return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
>@@ -1854,6 +1859,11 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> {
> }
>
>+static bool zone_local(struct zone *local_zone, struct zone *zone)
>+{
>+	return true;
>+}
>+
> static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> {
> 	return true;
>@@ -1901,6 +1911,26 @@ zonelist_scan:
> 		if (alloc_flags & ALLOC_NO_WATERMARKS)
> 			goto try_this_zone;
> 		/*
>+		 * Distribute pages in proportion to the individual
>+		 * zone size to ensure fair page aging.  The zone a
>+		 * page was allocated in should have no effect on the
>+		 * time the page has in memory before being reclaimed.
>+		 *
>+		 * When zone_reclaim_mode is enabled, try to stay in
>+		 * local zones in the fastpath.  If that fails, the
>+		 * slowpath is entered, which will do another pass
>+		 * starting with the local zones, but ultimately fall
>+		 * back to remote zones that do not partake in the
>+		 * fairness round-robin cycle of this zonelist.
>+		 */
>+		if (alloc_flags & ALLOC_WMARK_LOW) {
>+			if (zone->alloc_batch <= 0)
>+				continue;
>+			if (zone_reclaim_mode &&
>+			    !zone_local(preferred_zone, zone))
>+				continue;
>+		}
>+		/*
> 		 * When allocating a page cache page for writing, we
> 		 * want to get it from a zone that is within its dirty
> 		 * limit, such that no single zone holds more than its
>@@ -2006,7 +2036,8 @@ this_zone_full:
> 		goto zonelist_scan;
> 	}
>
>-	if (page)
>+	if (page) {
>+		zone->alloc_batch -= 1U << order;
> 		/*
> 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> 		 * necessary to allocate the page. The expectation is
>@@ -2015,6 +2046,7 @@ this_zone_full:
> 		 * for !PFMEMALLOC purposes.
> 		 */
> 		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
>+	}
>
> 	return page;
> }
>@@ -2346,16 +2378,28 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
> 	return page;
> }
>
>-static inline
>-void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
>-						enum zone_type high_zoneidx,
>-						enum zone_type classzone_idx)
>+static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
>+			     struct zonelist *zonelist,
>+			     enum zone_type high_zoneidx,
>+			     struct zone *preferred_zone)
> {
> 	struct zoneref *z;
> 	struct zone *zone;
>
>-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
>-		wakeup_kswapd(zone, order, classzone_idx);
>+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+		if (!(gfp_mask & __GFP_NO_KSWAPD))
>+			wakeup_kswapd(zone, order, zone_idx(preferred_zone));
>+		/*
>+		 * Only reset the batches of zones that were actually
>+		 * considered in the fast path, we don't want to
>+		 * thrash fairness information for zones that are not
>+		 * actually part of this zonelist's round-robin cycle.
>+		 */
>+		if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
>+			continue;
>+		zone->alloc_batch = high_wmark_pages(zone) -
>+			low_wmark_pages(zone);
>+	}
> }
>
> static inline int
>@@ -2451,9 +2495,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> 		goto nopage;
>
> restart:
>-	if (!(gfp_mask & __GFP_NO_KSWAPD))
>-		wake_all_kswapd(order, zonelist, high_zoneidx,
>-						zone_idx(preferred_zone));
>+	prepare_slowpath(gfp_mask, order, zonelist,
>+			 high_zoneidx, preferred_zone);
>
> 	/*
> 	 * OK, we're below the kswapd watermark and have kicked background
>@@ -4754,6 +4797,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
> 		zone_seqlock_init(zone);
> 		zone->zone_pgdat = pgdat;
>
>+		/* For bootup, initialized properly in watermark setup */
>+		zone->alloc_batch = zone->managed_pages;
>+
> 		zone_pcp_init(zone);
> 		lruvec_init(&zone->lruvec);
> 		if (!size)
>@@ -5525,6 +5571,9 @@ static void __setup_per_zone_wmarks(void)
> 		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
> 		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
>
>+		zone->alloc_batch = high_wmark_pages(zone) -
>+			low_wmark_pages(zone);
>+
> 		setup_zone_migrate_reserve(zone);
> 		spin_unlock_irqrestore(&zone->lock, flags);
> 	}
>-- 
>1.8.3.2
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2013-08-05 10:38 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-02 15:37 [patch v2 0/3] mm: improve page aging fairness between zones/nodes Johannes Weiner
2013-08-02 15:37 ` Johannes Weiner
2013-08-02 15:37 ` [patch v2 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd Johannes Weiner
2013-08-02 15:37   ` Johannes Weiner
2013-08-07 14:15   ` Mel Gorman
2013-08-07 14:15     ` Mel Gorman
2013-08-02 15:37 ` [patch v2 2/3] mm: page_alloc: rearrange watermark checking in get_page_from_freelist Johannes Weiner
2013-08-02 15:37   ` Johannes Weiner
2013-08-07 14:20   ` Mel Gorman
2013-08-07 14:20     ` Mel Gorman
2013-08-02 15:37 ` [patch v2 3/3] mm: page_alloc: fair zone allocator policy Johannes Weiner
2013-08-02 15:37   ` Johannes Weiner
2013-08-02 17:51   ` Rik van Riel
2013-08-02 17:51     ` Rik van Riel
2013-08-05  1:15   ` Minchan Kim
2013-08-05  1:15     ` Minchan Kim
2013-08-05  3:43     ` Johannes Weiner
2013-08-05  3:43       ` Johannes Weiner
2013-08-05  4:48       ` Minchan Kim
2013-08-05  4:48         ` Minchan Kim
2013-08-05  5:01         ` Johannes Weiner
2013-08-05  5:01           ` Johannes Weiner
2013-08-05 10:34   ` Wanpeng Li [this message]
2013-08-05 10:34   ` Wanpeng Li
2013-08-05 11:34     ` Andrea Arcangeli
2013-08-05 11:34       ` Andrea Arcangeli
2013-08-05 13:11       ` Wanpeng Li
2013-08-05 13:11       ` Wanpeng Li
2013-08-07 14:58   ` Mel Gorman
2013-08-07 14:58     ` Mel Gorman
2013-08-07 15:37     ` Johannes Weiner
2013-08-07 15:37       ` Johannes Weiner
2013-08-08  4:16       ` Johannes Weiner
2013-08-08  4:16         ` Johannes Weiner
2013-08-08  9:21         ` Mel Gorman
2013-08-08  9:21           ` Mel Gorman
2013-08-09 18:45         ` Rik van Riel
2013-08-09 18:45           ` Rik van Riel
2013-08-16 17:07         ` Kevin Hilman
2013-08-16 17:17         ` Kevin Hilman
2013-08-16 20:18           ` Johannes Weiner
2013-08-16 20:18             ` Johannes Weiner
2013-08-16 20:18             ` Johannes Weiner
2013-08-16 21:24             ` Stephen Warren
2013-08-16 21:24               ` Stephen Warren
2013-08-16 21:24               ` Stephen Warren
2013-08-16 21:52             ` Kevin Hilman
2013-08-16 21:52               ` Kevin Hilman
2013-08-16 21:52               ` Kevin Hilman
2013-08-19  0:48               ` Stephen Rothwell
2013-08-19  0:48                 ` Stephen Rothwell
2014-04-02 14:26   ` Thomas Schwinge
2014-04-02 14:26     ` Thomas Schwinge
2014-04-24 13:37     ` radeon: screen garbled after page allocator change, was: " Johannes Weiner
2014-04-24 13:37       ` Johannes Weiner
2014-04-24 13:37       ` Johannes Weiner
2014-04-25 21:47       ` Jerome Glisse
2014-04-25 21:47         ` Jerome Glisse
2014-04-25 21:47         ` Jerome Glisse
2014-04-25 21:50         ` Jerome Glisse
2014-04-25 21:50           ` Jerome Glisse
2014-04-25 23:03           ` Jerome Glisse
2014-04-25 23:03             ` Jerome Glisse
2014-04-25 23:03             ` Jerome Glisse
2014-04-28  8:03             ` Thomas Schwinge
2014-04-28  9:09               ` Thomas Schwinge
2014-04-27  3:31       ` Jerome Glisse
2014-04-27  3:31         ` Jerome Glisse
2014-04-27  3:31         ` Jerome Glisse
2014-04-27 19:55         ` Jerome Glisse
2014-04-27 19:55           ` Jerome Glisse
2014-04-27 19:55           ` Jerome Glisse
2014-04-28  7:30           ` Christian König
2014-04-28  7:30             ` Christian König
2014-04-28 12:51             ` Deucher, Alexander
2014-04-28 12:51               ` Deucher, Alexander
2014-04-28 12:52             ` Deucher, Alexander
2014-04-28 12:52               ` Deucher, Alexander
2014-04-28  8:09           ` Thomas Schwinge
2014-06-16  7:11             ` Thomas Schwinge
2013-08-02 19:59 ` [patch v2 0/3] mm: improve page aging fairness between zones/nodes Andrea Arcangeli
2013-08-02 19:59   ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='31100.7594842978$1375699113@news.gmane.org' \
    --to=liwanp@linux.vnet.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=riel@surriel.com \
    --cc=zcalusic@bitsync.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.