All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Yu Zhao <yuzhao@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>,
	Yafang Shao <laoar.shao@gmail.com>,
	Chris Down <chris@chrisdown.name>,
	cgroups@vger.kernel.org, kernel-team@fb.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: MGLRU premature memcg OOM on slow writes
Date: Tue, 12 Mar 2024 23:22:52 -0400	[thread overview]
Message-ID: <20240313032252.GC65481@cmpxchg.org> (raw)
In-Reply-To: <CAOUHufa4zgWtA_XyWZUcQ7pqbarC2VpJd-=J--gy8tCGQin+yQ@mail.gmail.com>

On Tue, Mar 12, 2024 at 10:08:13PM -0400, Yu Zhao wrote:
> On Tue, Mar 12, 2024 at 5:08 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote:
> > > Yes, these two are among the differences between the active/inactive
> > > LRU and MGLRU, but their roles, IMO, are not as important as the LRU
> > > positions of dirty pages. The active/inactive LRU moves dirty pages
> > > all the way to the end of the line (reclaim happens at the front)
> > > whereas MGLRU moves them into the middle, during direct reclaim. The
> > > rationale for MGLRU was that this way those dirty pages would still
> > > be counted as "inactive" (or cold).
> >
> > Note that activating the page is not a statement on the page's
> > hotness. It's simply to park it away from the scanner. We could as
> > well have moved it to the unevictable list - this is just easier.
> >
> > folio_end_writeback() will call folio_rotate_reclaimable() and move it
> > back to the inactive tail, to make it the very next reclaim target as
> > soon as it's clean.
> >
> > > This theory can be quickly verified by comparing how much
> > > nr_vmscan_immediate_reclaim grows, i.e.,
> > >
> > >   Before the copy
> > >     grep nr_vmscan_immediate_reclaim /proc/vmstat
> > >   And then after the copy
> > >     grep nr_vmscan_immediate_reclaim /proc/vmstat
> > >
> > > The growth should be trivial for MGLRU and nontrivial for the
> > > active/inactive LRU.
> > >
> > > If this is indeed the case, I'd appreciate very much if anyone could
> > > try the following (I'll try it myself too later next week).
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 4255619a1a31..020f5d98b9a1 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > >       }
> > >
> > >       /* waiting for writeback */
> > > -     if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > > -         (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > > -             gen = folio_inc_gen(lruvec, folio, true);
> > > -             list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > > +     if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> > > +             DEFINE_MAX_SEQ(lruvec);
> > > +             int old_gen, new_gen = lru_gen_from_seq(max_seq);
> > > +
> > > +             old_gen = folio_update_gen(folio, new_gen);
> > > +             lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> > > +             list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]);
> > >               return true;
> >
> > Right, because MGLRU sorts these pages out before calling the scanner,
> > so they never get marked for immediate reclaim.
> >
> > But that also implies they won't get rotated back to the tail when
> > writeback finishes.
> 
> Those dirty pages are marked by PG_reclaim either by
> 
>   folio_inc_gen()
>   {
>     ...
>     if (reclaiming)
>       new_flags |= BIT(PG_reclaim);
>     ...
>   }
> 
> or [1], which I missed initially. So they should be rotated on writeback
> finishing up.
> 
> [1] https://lore.kernel.org/linux-mm/ZfC2612ZYwwxpOmR@google.com/

Ah, I missed that! Thanks.

> > Doesn't that mean that you now have pages that
> >
> > a) came from the oldest generation and were only deferred due to their
> >    writeback state, and
> >
> > b) are now clean and should be reclaimed. But since they're
> >    permanently advanced to the next gen, you'll instead reclaim pages
> >    that were originally ahead of them, and likely hotter.
> >
> > Isn't that an age inversion?
> >
> > Back to the broader question though: if reclaim demand outstrips clean
> > pages and the only viable candidates are dirty ones (e.g. an
> > allocation spike in the presence of dirty/writeback pages), there only
> > seem to be 3 options:
> >
> > 1) sleep-wait for writeback
> > 2) continue scanning, aka busy-wait for writeback + age inversions
> > 3) find nothing and declare OOM
> >
> > Since you're not doing 1), it must be one of the other two, no? One
> > way or another it has to either pace-match to IO completions, or OOM.
> 
> Yes, and in this case, 2) is possible but 3) is very likely.
> 
> MGLRU doesn't do 1) for sure (in the reclaim path of course). I didn't
> find any throttling on dirty pages for cgroup v2 either in the
> active/inactive LRU -- I assume Chris was on v2, and hence my take on
> throttling on dirty pages in the reclaim path not being the key for
> his case.

It's kind of spread out, but it's there:

shrink_folio_list() will bump nr_dirty on dirty pages, and
nr_congested if immediate reclaim folios cycle back around.

shrink_inactive_list() will wake the flushers if all the dirty pages
it encountered are still unqueued.

shrink_node() will set LRUVEC_CGROUP_CONGESTED, and then call
reclaim_throttle() on it. (As Chris points out, though, the throttle
call was not long ago changed from VMSCAN_THROTTLE_WRITEBACK to
VMSCAN_THROTTLE_CONGESTED, and appears a bit more fragile now than it
used to be. Probably worth following up on this.)

  reply	other threads:[~2024-03-13  3:22 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-09  2:31 MGLRU premature memcg OOM on slow writes Chris Down
2024-02-29 17:28 ` Chris Down
2024-02-29 23:51 ` Axel Rasmussen
2024-03-01  0:30   ` Chris Down
2024-03-08 19:18     ` Axel Rasmussen
2024-03-08 21:22       ` Johannes Weiner
2024-03-11  9:11       ` Yafang Shao
2024-03-12 16:44         ` Axel Rasmussen
2024-03-12 20:07           ` Yu Zhao
2024-03-12 20:11             ` Yu Zhao
2024-03-13  3:33               ` Yafang Shao
2024-03-14 22:23                 ` Yu Zhao
2024-03-15  2:38                   ` Yafang Shao
2024-03-15 14:27                     ` Johannes Weiner
2024-03-12 21:08             ` Johannes Weiner
2024-03-13  2:08               ` Yu Zhao
2024-03-13  3:22                 ` Johannes Weiner [this message]
2024-03-13 10:59               ` Hillf Danton
2024-03-01 11:25   ` Hillf Danton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240313032252.GC65481@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=axelrasmussen@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chris@chrisdown.name \
    --cc=kernel-team@fb.com \
    --cc=laoar.shao@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.