linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ivan Babrou <ivan@cloudflare.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-mm@kvack.org, linux-kernel <linux-kernel@vger.kernel.org>,
	 kernel-team <kernel-team@cloudflare.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Rik van Riel <riel@surriel.com>,
	Vlastimil Babka <vbabka@suse.cz>
Subject: Re: Reclaim regression after 1c30844d2dfe
Date: Wed, 12 Feb 2020 14:45:39 -0800	[thread overview]
Message-ID: <CABWYdi36O_Gd6=CVZkxY6RR8r4EKzEngScngT5VZc9-x4TB=3w@mail.gmail.com> (raw)
In-Reply-To: <20200211101627.GJ3466@techsingularity.net>

Here's a typical graph: https://imgur.com/a/n03x5yH

* Green (numa0) and blue (numa1) for 4.19
* Yellow (numa0) and orange (numa1) for 5.4

These downward slopes on numa0 on 5.4 are somewhat typical to the
worst case scenario.

If I try to clean up data a bit from a bunch of machines, this is how
numa0 compares to numa1 with 1h average values of free memory above
5GiB:

* https://imgur.com/a/6T4rRzi

I think it's safe to say that numa0 is much much worse, but I cannot
be 100% sure that numa1 is free from adverse effects, they may be just
hiding in the noise caused by rolling reboots.


On Tue, Feb 11, 2020 at 2:16 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Feb 07, 2020 at 02:54:43PM -0800, Ivan Babrou wrote:
> > This change from 5.5 times:
> >
> > * https://github.com/torvalds/linux/commit/1c30844d2dfe
> >
> > > mm: reclaim small amounts of memory when an external fragmentation event occurs
> >
> > Introduced undesired effects in our environment.
> >
> > * NUMA with 2 x CPU
> > * 128GB of RAM
> > * THP disabled
> > * Upgraded from 4.19 to 5.4
> >
> > Before we saw free memory hover at around 1.4GB with no spikes. After
> > the upgrade we saw some machines decide that they need a lot more than
> > that, with frequent spikes above 10GB, often only on a single numa
> > node.
> >
> > We can see kswapd quite active in balance_pgdat (it didn't look like
> > it slept at all):
> >
> > $ ps uax | fgrep kswapd
> > root       1850 23.0  0.0      0     0 ?        R    Jan30 1902:24 [kswapd0]
> > root       1851  1.8  0.0      0     0 ?        S    Jan30 152:16 [kswapd1]
> >
> > This in turn massively increased pressure on page cache, which did not
> > go well to services that depend on having a quick response from a
> > local cache backed by solid storage.
> >
> > Here's how it looked like when I zeroed vm.watermark_boost_factor:
> >
> > * https://imgur.com/a/6IZWicU
> >
> > IO subsided from 100% busy in page cache population at 300MB/s on a
> > single SATA drive down to under 100MB/s.
> >
> > This sort of regression doesn't seem like a good thing.
>
> It is not a good thing, so thanks for the report. Obviously I have not
> seen something similar or least not severe enough to show up on my radar.
> I'd seen some increases with reclaim activity affecting benchmarks that
> rely on use-twice data remaining resident but nothing severe enough to
> warrant action.
>
> Can you tell me if it is *always* node 0 that shows crazy activity? I
> ask because some conditions would have to be met for the boost to always
> apply. It's already a per-zone attribute but it is treated indirectly as a
> pgdat property. What I'm thinking is that on node 0, the DMA32 or DMA zone
> gets boosted but vmscan then reclaims from higher zones until the boost is
> removed. That would excessively reclaim memory but be specific to node 0.
>
> I've cc'd Rik as he says he saw something similar even on single node
> systems. The boost applying to lower zones would still affect single
> node systems but NUMA machines always getting impacted by boost would
> show that the boost really needs to be a per-node flag. Sure, we *could*
> apply the reclaim to just the lower zones but that potentially means a
> *lot* of scan activity -- potentially 124G of pages before a lower zone
> page is found on Ivan's machine. That might be the very situation being
> encountered here.
>
> An alternative is that boosting is only ever applied to the highest
> populated zone in a system. The intent of the patch was primarily about
> THP which can use any zone to reduce their allocaation latency. While
> it's possible that there are cases where the latency of other orders
> matter *and* they require lower zones, I think it's unlikely and that
> this would be a safer option overall.
>
> However, overall I think the simpliest is to abort the boosting if
> reclaim is reaching higher priorities without being able to clear
> the boost. The boost is best-effort to reduce allocation latency in
> the future. This approach still has some overhead as there is a reclaim
> pass but kswapd will abort and go to sleep if the normal watermarks
> are met.
>
> This is build tested only. Ideally someone on the cc has a test case
> that can reproduce this specific problem of excessive kswapd activity.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 572fb17c6273..71dd47172cef 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3462,6 +3462,25 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
>         return false;
>  }
>
> +static void acct_boosted_reclaim(pg_data_t *pgdat, int classzone_idx,
> +                               unsigned long *zone_boosts)
> +{
> +       struct zone *zone;
> +       unsigned long flags;
> +       int i;
> +
> +       for (i = 0; i <= classzone_idx; i++) {
> +               if (!zone_boosts[i])
> +                       continue;
> +
> +               /* Increments are under the zone lock */
> +               zone = pgdat->node_zones + i;
> +               spin_lock_irqsave(&zone->lock, flags);
> +               zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
> +               spin_unlock_irqrestore(&zone->lock, flags);
> +       }
> +}
> +
>  /* Clear pgdat state for congested, dirty or under writeback. */
>  static void clear_pgdat_congested(pg_data_t *pgdat)
>  {
> @@ -3654,9 +3673,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                 if (!nr_boost_reclaim && balanced)
>                         goto out;
>
> -               /* Limit the priority of boosting to avoid reclaim writeback */
> -               if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
> -                       raise_priority = false;
> +               /*
> +                * Abort boosting if reclaiming at higher priority is not
> +                * working to avoid excessive reclaim due to lower zones
> +                * being boosted.
> +                */
> +               if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) {
> +                       acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts);
> +                       boosted = false;
> +                       nr_boost_reclaim = 0;
> +                       goto restart;
> +               }
>
>                 /*
>                  * Do not writeback or swap pages for boosted reclaim. The
> @@ -3738,18 +3765,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  out:
>         /* If reclaim was boosted, account for the reclaim done in this pass */
>         if (boosted) {
> -               unsigned long flags;
> -
> -               for (i = 0; i <= classzone_idx; i++) {
> -                       if (!zone_boosts[i])
> -                               continue;
> -
> -                       /* Increments are under the zone lock */
> -                       zone = pgdat->node_zones + i;
> -                       spin_lock_irqsave(&zone->lock, flags);
> -                       zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
> -                       spin_unlock_irqrestore(&zone->lock, flags);
> -               }
> +               acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts);
>
>                 /*
>                  * As there is now likely space, wakeup kcompact to defragment


  reply	other threads:[~2020-02-12 22:45 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-07 22:54 Reclaim regression after 1c30844d2dfe Ivan Babrou
2020-02-07 23:05 ` Rik van Riel
2020-02-08  9:08   ` Vlastimil Babka
2020-02-08 11:11 ` Hillf Danton
2020-02-11 10:16 ` Mel Gorman
2020-02-12 22:45   ` Ivan Babrou [this message]
2020-02-12 23:55     ` Mel Gorman
2020-02-18 22:07       ` Ivan Babrou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABWYdi36O_Gd6=CVZkxY6RR8r4EKzEngScngT5VZc9-x4TB=3w@mail.gmail.com' \
    --to=ivan@cloudflare.com \
    --cc=akpm@linux-foundation.org \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=riel@surriel.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).