linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>,
	Dave Hansen <dave.hansen@intel.com>, Mel Gorman <mgorman@suse.de>,
	anton@sambar.org, linuxppc-dev@lists.ozlabs.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Rik van Riel <riel@redhat.com>, Dan Streetman <ddstreet@ieee.org>
Subject: Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
Date: Fri, 3 Apr 2015 10:45:56 -0700	[thread overview]
Message-ID: <20150403174556.GF32318@linux.vnet.ibm.com> (raw)
In-Reply-To: <551E47EF.5030800@suse.cz>

On 03.04.2015 [09:57:35 +0200], Vlastimil Babka wrote:
> On 03/31/2015 11:48 AM, Michal Hocko wrote:
> >On Fri 27-03-15 15:23:50, Nishanth Aravamudan wrote:
> >>On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote:
> >>>On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote:
> >>>>@@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> >>>>
> >>>>         for (i = 0; i <= ZONE_NORMAL; i++) {
> >>>>                 zone = &pgdat->node_zones[i];
> >>>>-               if (!populated_zone(zone))
> >>>>+               if (!populated_zone(zone) || !zone_reclaimable(zone))
> >>>>                         continue;
> >>>>
> >>>>                 pfmemalloc_reserve += min_wmark_pages(zone);
> >>>
> >>>Do you really want zone_reclaimable()?  Or do you want something more
> >>>direct like "zone_reclaimable_pages(zone) == 0"?
> >>
> >>Yeah, I guess in my testing this worked out to be the same, since
> >>zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will
> >>always be false. Thanks!
> >>
> >>Based upon 675becce15 ("mm: vmscan: do not throttle based on pfmemalloc
> >>reserves if node has no ZONE_NORMAL") from Mel.
> >>
> >>We have a system with the following topology:
> >>
> >># numactl -H
> >>available: 3 nodes (0,2-3)
> >>node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
> >>23 24 25 26 27 28 29 30 31
> >>node 0 size: 28273 MB
> >>node 0 free: 27323 MB
> >>node 2 cpus:
> >>node 2 size: 16384 MB
> >>node 2 free: 0 MB
> >>node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> >>node 3 size: 30533 MB
> >>node 3 free: 13273 MB
> >>node distances:
> >>node   0   2   3
> >>   0:  10  20  20
> >>   2:  20  10  20
> >>   3:  20  20  10
> >>
> >>Node 2 has no free memory, because:
> >># cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
> >>1
> >>
> >>This leads to the following zoneinfo:
> >>
> >>Node 2, zone      DMA
> >>   pages free     0
> >>         min      1840
> >>         low      2300
> >>         high     2760
> >>         scanned  0
> >>         spanned  262144
> >>         present  262144
> >>         managed  262144
> >>...
> >>   all_unreclaimable: 1
> >
> >Blee, this is a weird configuration.
> >
> >>If one then attempts to allocate some normal 16M hugepages via
> >>
> >>echo 37 > /proc/sys/vm/nr_hugepages
> >>
> >>The echo never returns and kswapd2 consumes CPU cycles.
> >>
> >>This is because throttle_direct_reclaim ends up calling
> >>wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
> >>pfmemalloc_watermark_ok() in turn checks all zones on the node if there
> >>are any reserves, and if so, then indicates the watermarks are ok, by
> >>seeing if there are sufficient free pages.
> >>
> >>675becce15 added a condition already for memoryless nodes. In this case,
> >>though, the node has memory, it is just all consumed (and not
> >>reclaimable). Effectively, though, the result is the same on this call
> >>to pfmemalloc_watermark_ok() and thus seems like a reasonable additional
> >>condition.
> >>
> >>With this change, the afore-mentioned 16M hugepage allocation attempt
> >>succeeds and correctly round-robins between Nodes 1 and 3.
> >
> >I am just wondering whether this is the right/complete fix. Don't we
> >need a similar treatment at more places?
> >I would expect kswapd would be looping endlessly because the zone
> >wouldn't be balanced obviously. But I would be wrong... because
> >pgdat_balanced is doing this:
> >		/*
> >		 * A special case here:
> >		 *
> >		 * balance_pgdat() skips over all_unreclaimable after
> >		 * DEF_PRIORITY. Effectively, it considers them balanced so
> >		 * they must be considered balanced here as well!
> >		 */
> >		if (!zone_reclaimable(zone)) {
> >			balanced_pages += zone->managed_pages;
> >			continue;
> >		}
> >
> >and zone_reclaimable is false for you as you didn't have any
> >zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
> >would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
> >zones right?) and so the kswapd would be woken up easily. So it looks
> >like a mess.
> 
> Yeah, looks like a much cleaner/complete solution would be to remove
> such zones from zonelists. But that means covering all situations
> when these hugepages are allocated/removed and the approach then
> looks similar to memory hotplug.
> Also I'm not sure if the ability to actually allocate the reserved
> hugepage would be impossible due to not being reachable by a
> zonelist...
> 
> >There are possibly other places which rely on populated_zone or
> >for_each_populated_zone without checking reclaimability. Are those
> >working as expected?
> 
> Yeah. At least the wakeup_kswapd case should be fixed IMHO. No point
> in waking it up just to let it immediately go to sleep again.
> 
> >That being said. I am not objecting to this patch. I am just trying to
> >wrap my head around possible issues from such a weird configuration and
> >all the consequences.
> >
> >>Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> >
> >The patch as is doesn't seem to be harmful.
> >
> >Reviewed-by: Michal Hocko <mhocko@suse.cz>
> >
> >>---
> >>v1 -> v2:
> >>   Check against zone_reclaimable_pages, rather zone_reclaimable, based
> >>   upon feedback from Dave Hansen.
> >
> >Dunno, but shouldn't we use the same thing here and in pgdat_balanced?
> >zone_reclaimable_pages seems to be used only from zone_reclaimable().
> 
> pgdat_balanced() has a different goal than pfmemalloc_watermark_ok()
> and needs to match what balance_pgdat() does, which includes
> considering NR_PAGES_SCANNED through zone_reclaimable(). For the
> situation considered in this patch, result of zone_reclaimable()
> will match the test zone_reclaimable_pages(zone) == 0, so it is fine
> I think.

Right.

> What I find somewhat worrying though is that we could potentially
> break the pfmemalloc_watermark_ok() test in situations where
> zone_reclaimable_pages(zone) == 0 is a transient situation (and not
> a permanently allocated hugepage). In that case, the throttling is
> supposed to help system recover, and we might be breaking that
> ability with this patch, no?

Well, if it's transient, we'll skip it this time through, and once there
are reclaimable pages, we should notice it again.

I'm not familiar enough with this logic, so I'll read through the code
again soon to see if your concern is valid, as best I can.

Thanks,
Nish


  reply	other threads:[~2015-04-03 17:46 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-27 19:28 [PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones Nishanth Aravamudan
2015-03-27 19:39 ` Nishanth Aravamudan
2015-03-27 19:58 ` Dan Streetman
2015-03-27 20:17 ` Dave Hansen
2015-03-27 22:23   ` [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages Nishanth Aravamudan
2015-03-31  9:48     ` Michal Hocko
2015-04-03  7:57       ` Vlastimil Babka
2015-04-03 17:45         ` Nishanth Aravamudan [this message]
2015-05-05 22:09           ` Nishanth Aravamudan
2015-05-06  9:28             ` Vlastimil Babka
2015-05-08 22:47               ` Andrew Morton
2015-05-08 23:18                 ` Nishanth Aravamudan
2015-04-03 17:43       ` Nishanth Aravamudan
2015-04-03 18:24         ` Michal Hocko
2015-04-03 18:50           ` Nishanth Aravamudan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150403174556.GF32318@linux.vnet.ibm.com \
    --to=nacc@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=anton@sambar.org \
    --cc=dave.hansen@intel.com \
    --cc=ddstreet@ieee.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=riel@redhat.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).