All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yang Shi <shy828301@gmail.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Mina Almasry <almasrymina@google.com>,
	Yang Shi <yang.shi@linux.alibaba.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	weixugc@google.com, shakeelb@google.com, gthelen@google.com,
	fvdl@google.com, Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Muchun Song <songmuchun@bytedance.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [RFC PATCH V1] mm: Disable demotion from proactive reclaim
Date: Thu, 1 Dec 2022 14:45:36 -0800	[thread overview]
Message-ID: <CAHbLzkr9k8fvBGVskN1sMJiLX_JkWW7OrrscUrA0xASh+rYN7Q@mail.gmail.com> (raw)
In-Reply-To: <87h6yfao37.fsf@yhuang6-desk2.ccr.corp.intel.com>

On Wed, Nov 30, 2022 at 5:52 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > On Tue, Nov 29, 2022 at 9:33 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yang Shi <shy828301@gmail.com> writes:
> >>
> >> > On Mon, Nov 28, 2022 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yang Shi <shy828301@gmail.com> writes:
> >> >>
> >> >> > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Hi, Johannes,
> >> >> >>
> >> >> >> Johannes Weiner <hannes@cmpxchg.org> writes:
> >> >> >> [...]
> >> >> >> >
> >> >> >> > The fallback to reclaim actually strikes me as wrong.
> >> >> >> >
> >> >> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we
> >> >> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
> >> >> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
> >> >> >> > we 'demote' it directly from RAM to storage, bypassing potentially a
> >> >> >> > huge amount of pages colder than it in CXL. That doesn't seem right.
> >> >> >> >
> >> >> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
> >> >> >> > breaking the layering. Rather it should deflect that pressure to the
> >> >> >> > lower layers to make room. This makes sure we maintain an aging
> >> >> >> > pipeline that honors the memory tier hierarchy.
> >> >> >>
> >> >> >> Yes.  I think that we should avoid to fall back to reclaim as much as
> >> >> >> possible too.  Now, when we allocate memory for demotion
> >> >> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
> >> >> >> kswapd reclaim on lower tier node to free some memory to avoid fall back
> >> >> >> to reclaim on current (higher tier) node.  This may be not good enough,
> >> >> >> for example, the following patch from Hasan may help via waking up
> >> >> >> kswapd earlier.
> >> >> >
> >> >> > For the ideal case, I do agree with Johannes to demote the page tier
> >> >> > by tier rather than reclaiming them from the higher tiers. But I also
> >> >> > agree with your premature OOM concern.
> >> >> >
> >> >> >>
> >> >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
> >> >> >>
> >> >> >> Do you know what is the next step plan for this patch?
> >> >> >>
> >> >> >> Should we do even more?
> >> >> >
> >> >> > In my initial implementation I implemented a simple throttle logic
> >> >> > when the demotion is not going to succeed if the demotion target has
> >> >> > not enough free memory (just check the watermark) to make migration
> >> >> > succeed without doing any reclamation. Shall we resurrect that?
> >> >>
> >> >> Can you share the link to your throttle patch?  Or paste it here?
> >> >
> >> > I just found this on the mailing list.
> >> > https://lore.kernel.org/linux-mm/1560468577-101178-8-git-send-email-yang.shi@linux.alibaba.com/
> >>
> >> Per my understanding, this patch will avoid demoting if there's no free
> >> space on demotion target?  If so, I think that we should trigger kswapd
> >> reclaiming on demotion target before that.  And we can simply avoid to
> >> fall back to reclaim firstly, then avoid to scan as an improvement as
> >> that in your patch above.
> >
> > Yes, it should. The rough idea looks like:
> >
> > if (the demote target is contended)
> >     wake up kswapd
> >     reclaim_throttle(VMSCAN_THROTTLE_DEMOTION)
> >     retry demotion
> >
> > The kswapd is responsible for clearing the contention flag.
>
> We may do this, at least for demotion in kswapd.  But I think that this
> could be the second step optimization after we make correct choice
> between demotion/reclaim.  What if the pages in demotion target is too
> hot to be reclaimed first?  Should we reclaim in fast memory node to
> avoid OOM?

IMHO we can't avoid reclaiming from the fast nodes entirely if we
prioritize avoiding OOMs. But it should happen very very rarely with
the throttling logic or other methods. BTW did you run any test to see
how many times vmscan reclaims from fast nodes instead of demotion
with the current implementation for some typical workloads?

>
> Best Regards,
> Huang, Ying
>
> >>
> >> > But it didn't have the throttling logic, I may not submit that version
> >> > to the mailing list since we decided to drop this and merge mine and
> >> > Dave's.
> >> >
> >> > Anyway it is not hard to add the throttling logic, we already have a
> >> > few throttling cases in vmscan, for example, "mm/vmscan: throttle
> >> > reclaim until some writeback completes if congested".
> >> >>
> >> >> > Waking kswapd sooner is fine to me, but it may be not enough, for
> >> >> > example, the kswapd may not keep up so remature OOM may happen on
> >> >> > higher tiers or reclaim may still happen. I think throttling the
> >> >> > reclaimer/demoter until kswapd makes progress could avoid both. And
> >> >> > since the lower tiers memory typically is quite larger than the higher
> >> >> > tiers, so the throttle should happen very rarely IMHO.
> >> >> >
> >> >> >>
> >> >> >> From another point of view, I still think that we can use falling back
> >> >> >> to reclaim as the last resort to avoid OOM in some special situations,
> >> >> >> for example, most pages in the lowest tier node are mlock() or too hot
> >> >> >> to be reclaimed.
> >> >> >>
> >> >> >> > So I'm hesitant to design cgroup controls around the current behavior.
> >> >>
> >> >> Best Regards,
> >> >> Huang, Ying

WARNING: multiple messages have this Message-ID (diff)
From: Yang Shi <shy828301-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: "Huang, Ying" <ying.huang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Mina Almasry
	<almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Yang Shi
	<yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>,
	Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Tim Chen <tim.c.chen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>,
	weixugc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	fvdl-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Roman Gushchin
	<roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>,
	Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Subject: Re: [RFC PATCH V1] mm: Disable demotion from proactive reclaim
Date: Thu, 1 Dec 2022 14:45:36 -0800	[thread overview]
Message-ID: <CAHbLzkr9k8fvBGVskN1sMJiLX_JkWW7OrrscUrA0xASh+rYN7Q@mail.gmail.com> (raw)
In-Reply-To: <87h6yfao37.fsf-fFUE1NP8JkzwuUmzmnQr+vooFf0ArEBIu+b9c/7xato@public.gmane.org>

On Wed, Nov 30, 2022 at 5:52 PM Huang, Ying <ying.huang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>
> Yang Shi <shy828301-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
>
> > On Tue, Nov 29, 2022 at 9:33 PM Huang, Ying <ying.huang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> >>
> >> Yang Shi <shy828301-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
> >>
> >> > On Mon, Nov 28, 2022 at 4:54 PM Huang, Ying <ying.huang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> >> >>
> >> >> Yang Shi <shy828301-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
> >> >>
> >> >> > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> >> >> >>
> >> >> >> Hi, Johannes,
> >> >> >>
> >> >> >> Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> writes:
> >> >> >> [...]
> >> >> >> >
> >> >> >> > The fallback to reclaim actually strikes me as wrong.
> >> >> >> >
> >> >> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we
> >> >> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
> >> >> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
> >> >> >> > we 'demote' it directly from RAM to storage, bypassing potentially a
> >> >> >> > huge amount of pages colder than it in CXL. That doesn't seem right.
> >> >> >> >
> >> >> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
> >> >> >> > breaking the layering. Rather it should deflect that pressure to the
> >> >> >> > lower layers to make room. This makes sure we maintain an aging
> >> >> >> > pipeline that honors the memory tier hierarchy.
> >> >> >>
> >> >> >> Yes.  I think that we should avoid to fall back to reclaim as much as
> >> >> >> possible too.  Now, when we allocate memory for demotion
> >> >> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
> >> >> >> kswapd reclaim on lower tier node to free some memory to avoid fall back
> >> >> >> to reclaim on current (higher tier) node.  This may be not good enough,
> >> >> >> for example, the following patch from Hasan may help via waking up
> >> >> >> kswapd earlier.
> >> >> >
> >> >> > For the ideal case, I do agree with Johannes to demote the page tier
> >> >> > by tier rather than reclaiming them from the higher tiers. But I also
> >> >> > agree with your premature OOM concern.
> >> >> >
> >> >> >>
> >> >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf-b10kYP2dOMg@public.gmane.org/
> >> >> >>
> >> >> >> Do you know what is the next step plan for this patch?
> >> >> >>
> >> >> >> Should we do even more?
> >> >> >
> >> >> > In my initial implementation I implemented a simple throttle logic
> >> >> > when the demotion is not going to succeed if the demotion target has
> >> >> > not enough free memory (just check the watermark) to make migration
> >> >> > succeed without doing any reclamation. Shall we resurrect that?
> >> >>
> >> >> Can you share the link to your throttle patch?  Or paste it here?
> >> >
> >> > I just found this on the mailing list.
> >> > https://lore.kernel.org/linux-mm/1560468577-101178-8-git-send-email-yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org/
> >>
> >> Per my understanding, this patch will avoid demoting if there's no free
> >> space on demotion target?  If so, I think that we should trigger kswapd
> >> reclaiming on demotion target before that.  And we can simply avoid to
> >> fall back to reclaim firstly, then avoid to scan as an improvement as
> >> that in your patch above.
> >
> > Yes, it should. The rough idea looks like:
> >
> > if (the demote target is contended)
> >     wake up kswapd
> >     reclaim_throttle(VMSCAN_THROTTLE_DEMOTION)
> >     retry demotion
> >
> > The kswapd is responsible for clearing the contention flag.
>
> We may do this, at least for demotion in kswapd.  But I think that this
> could be the second step optimization after we make correct choice
> between demotion/reclaim.  What if the pages in demotion target is too
> hot to be reclaimed first?  Should we reclaim in fast memory node to
> avoid OOM?

IMHO we can't avoid reclaiming from the fast nodes entirely if we
prioritize avoiding OOMs. But it should happen very very rarely with
the throttling logic or other methods. BTW did you run any test to see
how many times vmscan reclaims from fast nodes instead of demotion
with the current implementation for some typical workloads?

>
> Best Regards,
> Huang, Ying
>
> >>
> >> > But it didn't have the throttling logic, I may not submit that version
> >> > to the mailing list since we decided to drop this and merge mine and
> >> > Dave's.
> >> >
> >> > Anyway it is not hard to add the throttling logic, we already have a
> >> > few throttling cases in vmscan, for example, "mm/vmscan: throttle
> >> > reclaim until some writeback completes if congested".
> >> >>
> >> >> > Waking kswapd sooner is fine to me, but it may be not enough, for
> >> >> > example, the kswapd may not keep up so remature OOM may happen on
> >> >> > higher tiers or reclaim may still happen. I think throttling the
> >> >> > reclaimer/demoter until kswapd makes progress could avoid both. And
> >> >> > since the lower tiers memory typically is quite larger than the higher
> >> >> > tiers, so the throttle should happen very rarely IMHO.
> >> >> >
> >> >> >>
> >> >> >> From another point of view, I still think that we can use falling back
> >> >> >> to reclaim as the last resort to avoid OOM in some special situations,
> >> >> >> for example, most pages in the lowest tier node are mlock() or too hot
> >> >> >> to be reclaimed.
> >> >> >>
> >> >> >> > So I'm hesitant to design cgroup controls around the current behavior.
> >> >>
> >> >> Best Regards,
> >> >> Huang, Ying

  reply	other threads:[~2022-12-01 22:45 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-22 20:38 [RFC PATCH V1] mm: Disable demotion from proactive reclaim Mina Almasry
2022-11-22 20:38 ` Mina Almasry
2022-11-22 20:38 ` [RFC PATCH v1] mm: Add memory.demote for proactive demotion only Mina Almasry
2022-11-22 20:38   ` Mina Almasry
2022-11-25  3:58   ` kernel test robot
2022-11-25 14:34   ` kernel test robot
2022-11-22 20:38 ` [RFC PATCH v1 3/4] mm: Fix demotion-only scanning anon pages Mina Almasry
2022-11-22 20:38   ` Mina Almasry
2022-11-24  5:27   ` Huang, Ying
2022-11-24  5:27     ` Huang, Ying
2022-11-22 20:38 ` [RFC PATCH v1 4/4] mm: Add nodes= arg to memory.demote Mina Almasry
2022-11-22 20:38   ` Mina Almasry
2022-11-23 18:00 ` [RFC PATCH V1] mm: Disable demotion from proactive reclaim Johannes Weiner
2022-11-23 18:00   ` Johannes Weiner
2022-11-23 21:20   ` Mina Almasry
2022-11-23 21:20     ` Mina Almasry
2022-11-23 21:35     ` Yosry Ahmed
2022-11-23 21:35       ` Yosry Ahmed
2022-11-23 22:30       ` Johannes Weiner
2022-11-23 22:30         ` Johannes Weiner
2022-11-23 23:47         ` Yosry Ahmed
2022-11-23 21:58     ` Johannes Weiner
2022-11-23 21:58       ` Johannes Weiner
2022-11-23 22:37       ` Mina Almasry
2022-11-23 22:37         ` Mina Almasry
2022-11-24  5:51       ` Huang, Ying
2022-11-24  5:51         ` Huang, Ying
2022-11-28 22:24         ` Yang Shi
2022-11-28 22:24           ` Yang Shi
2022-11-29  0:53           ` Huang, Ying
2022-11-29  0:53             ` Huang, Ying
2022-11-29 17:27             ` Yang Shi
2022-11-29 17:27               ` Yang Shi
2022-11-30  5:31               ` Huang, Ying
2022-11-30  5:31                 ` Huang, Ying
2022-11-30 18:49                 ` Yang Shi
2022-11-30 18:49                   ` Yang Shi
2022-12-01  1:51                   ` Huang, Ying
2022-12-01  1:51                     ` Huang, Ying
2022-12-01 22:45                     ` Yang Shi [this message]
2022-12-01 22:45                       ` Yang Shi
2022-12-02  1:57                       ` Huang, Ying
2022-12-02  1:57                         ` Huang, Ying
2022-11-29 18:08         ` Johannes Weiner
2022-11-29 18:08           ` Johannes Weiner
2022-11-30  3:55           ` Huang, Ying
2022-11-30  3:55             ` Huang, Ying
2022-12-01 20:40             ` Mina Almasry
2022-12-01 20:40               ` Mina Almasry
2022-12-02  2:01               ` Huang, Ying
2022-12-02  2:01                 ` Huang, Ying
2022-12-02  2:06                 ` Mina Almasry
2022-12-02  2:06                   ` Mina Almasry
2022-11-30  2:14         ` Mina Almasry
2022-11-30  2:14           ` Mina Almasry
2022-11-30  5:39           ` Huang, Ying
2022-11-30  5:39             ` Huang, Ying
2022-11-30  6:06             ` Mina Almasry
2022-11-30  6:06               ` Mina Almasry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHbLzkr9k8fvBGVskN1sMJiLX_JkWW7OrrscUrA0xASh+rYN7Q@mail.gmail.com \
    --to=shy828301@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=fvdl@google.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=songmuchun@bytedance.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=weixugc@google.com \
    --cc=yang.shi@linux.alibaba.com \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.