Re: [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup

From: Yang Shi <shy828301@gmail.com>
To: Gavin Shan <gshan@redhat.com>
Cc: Roman Gushchin <guro@fb.com>, Linux MM <linux-mm@kvack.org>,
	drjones@redhat.com,  david@redhat.com, bhe@redhat.com,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup
Date: Mon, 10 Feb 2020 19:03:22 -0800	[thread overview]
Message-ID: <CAHbLzkoQHSX-5pBfYSB2Dbw95EEQjSZtGfqKyv9qvSASO1A79g@mail.gmail.com> (raw)
In-Reply-To: <63c5d402-ec1e-2935-7f16-8e2aed047c7c@redhat.com>

On Mon, Feb 10, 2020 at 6:18 PM Gavin Shan <gshan@redhat.com> wrote:
>
> Hi Roman,
>
> On 2/11/20 12:31 PM, Roman Gushchin wrote:
> > On Tue, Feb 11, 2020 at 10:55:53AM +1100, Gavin Shan wrote:
> >> On 2/11/20 3:17 AM, Roman Gushchin wrote:
> >>> On Mon, Feb 10, 2020 at 11:14:45PM +1100, Gavin Shan wrote:
> >>>> commit 68600f623d69 ("mm: don't miss the last page because of round-off
> >>>> error") makes the scan size round up to @denominator regardless of the
> >>>> memory cgroup's state, online or offline. This affects the overall
> >>>> reclaiming behavior: The corresponding LRU list is eligible for reclaiming
> >>>> only when its size logically right shifted by @sc->priority is bigger than
> >>>> zero in the former formula (non-roundup one).
> >>>
> >>> Not sure I fully understand, but wasn't it so before 68600f623d69 too?
> >>>
> >>
> >> It's correct that "(non-roundup one)" is typo and should have been dropped.
> >> Will be corrected in v2 if needed.
> >
> > Thanks!
> >
> >>
> >>>> For example, the inactive
> >>>> anonymous LRU list should have at least 0x4000 pages to be eligible for
> >>>> reclaiming when we have 60/12 for swappiness/priority and without taking
> >>>> scan/rotation ratio into account. After the roundup is applied, the
> >>>> inactive anonymous LRU list becomes eligible for reclaiming when its
> >>>> size is bigger than or equal to 0x1000 in the same condition.
> >>>>
> >>>>       (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
> >>>>       ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
> >>>>
> >>>> aarch64 has 512MB huge page size when the base page size is 64KB. The
> >>>> memory cgroup that has a huge page is always eligible for reclaiming in
> >>>> that case. The reclaiming is likely to stop after the huge page is
> >>>> reclaimed, meaing the subsequent @sc->priority and memory cgroups will be
> >>>> skipped. It changes the overall reclaiming behavior. This fixes the issue
> >>>> by applying the roundup to offlined memory cgroups only, to give more
> >>>> preference to reclaim memory from offlined memory cgroup. It sounds
> >>>> reasonable as those memory is likely to be useless.
> >>>
> >>> So is the problem that relatively small memory cgroups are getting reclaimed
> >>> on default prio, however before they were skipped?
> >>>
> >>
> >> Yes, you're correct. There are two dimensions for global reclaim: priority
> >> (sc->priority) and memory cgroup. The scan/reclaim is carried out by iterating
> >> from these two dimensions until the reclaimed pages are enough. If the roundup
> >> is applied to current memory cgroup and occasionally helps to reclaim enough
> >> memory, the subsequent priority and memory cgroup will be skipped.
> >>
> >>>>
> >>>> The issue was found by starting up 8 VMs on a Ampere Mustang machine,
> >>>> which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and 2GB
> >>>> memory. 784MB swap space is consumed after these 8 VMs are completely up.
> >>>> Note that KSM is disable while THP is enabled in the testing. With this
> >>>> applied, the consumed swap space decreased to 60MB.
> >>>>
> >>>>            total        used        free      shared  buff/cache   available
> >>>> Mem:     16196       10065        2049          16        4081        3749
> >>>> Swap:     8175         784        7391
> >>>>            total        used        free      shared  buff/cache   available
> >>>> Mem:     16196       11324        3656          24        1215        2936
> >>>> Swap:     8175          60        8115
> >>>
> >>> Does it lead to any performance regressions? Or it's only about increased
> >>> swap usage?
> >>>
> >>
> >> Apart from swap usage, it also had performance downgrade for my case. With
> >> your patch (68600f623d69) included, it took 264 seconds to bring up 8 VMs.
> >> However, 236 seconds are used to do same thing with my patch applied on top
> >> of yours. There is 10% performance downgrade. It's the reason why I had a
> >> stable tag.
> >
> > I see...
> >
>
> I will put these data into the commit log of v2, which will be posted shortly.
>
> >>
> >>>>
> >>>> Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error")
> >>>> Cc: <stable@vger.kernel.org> # v4.20+
> >>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >>>> ---
> >>>>    mm/vmscan.c | 9 ++++++---
> >>>>    1 file changed, 6 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>> index c05eb9efec07..876370565455 100644
> >>>> --- a/mm/vmscan.c
> >>>> +++ b/mm/vmscan.c
> >>>> @@ -2415,10 +2415,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >>>>                            /*
> >>>>                             * Scan types proportional to swappiness and
> >>>>                             * their relative recent reclaim efficiency.
> >>>> -                   * Make sure we don't miss the last page
> >>>> -                   * because of a round-off error.
> >>>> +                   * Make sure we don't miss the last page on
> >>>> +                   * the offlined memory cgroups because of a
> >>>> +                   * round-off error.
> >>>>                             */
> >>>> -                  scan = DIV64_U64_ROUND_UP(scan * fraction[file],
> >>>> +                  scan = mem_cgroup_online(memcg) ?
> >>>> +                         div64_u64(scan * fraction[file], denominator) :
> >>>> +                         DIV64_U64_ROUND_UP(scan * fraction[file],
> >>>>                                                      denominator);
> >>>
> >>> It looks a bit strange to round up for offline and basically down for
> >>> everything else. So maybe it's better to return to something like
> >>> the very first version of the patch:
> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kernel_msg2883146.html&d=DwIC-g&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=urGWFxpEgETD4pryLqIYaKdVUk1Munj_zLpJthvrreM&s=k2RDZGNcvb_Sia2tZwcMPZ79Mad5dw1oT8JdIy0rkGY&e=  ?
> >>> For memcg reclaim reasons we do care only about an edge case with few pages.
> >>>
> >>> But overall it's not obvious to me, why rounding up is worse than rounding down.
> >>> Maybe we should average down but accumulate the reminder?
> >>> Creating an implicit bias for small memory cgroups sounds groundless.
> >>>
> >>
> >> I don't think v1 path works for me either. The logic in v1 isn't too much
> >> different from commit 68600f623d69. v1 has selective roundup, but current
> >> code is having a forced roundup. With 68600f623d69 reverted and your v1
> >> patch applied, it took 273 seconds to bring up 8 VMs and 1752MB swap is used.
> >> It looks more worse than 68600f623d69.
> >>
> >> Yeah, it's not reasonable to have a bias on all memory cgroups regardless
> >> their states. I do think it's still right to give bias to offlined memory
> >> cgroups.
> >
> > I don't think so, it really depends on the workload. Imagine systemd restarting
> > a service due to some update or with other arguments. Almost entire pagecache
> > is relevant and can be reused by a new cgroup.
> >
>
> Indeed, it depends on the workload. This patch is to revert 68600f623d69 for online
> memory cgroups, but keep the logic for offlined memory cgroup to avoid breaking your
> case.
>
> There is something which might be unrelated to discuss here: the pagecache could be backed
> by a low-speed (HDD) or high-speed (SSD) media. So the cost to fetch them from disk to memory
> isn't equal, meaning we need some kind of bias during reclaiming. It seems something missed
> from current implementation.

Yes, the refault cost was not taken into account. I recalled Johannes
posted a patch series to do swap with refault cost weighted in a
couple of years ago, please see: https://lwn.net/Articles/690079/.

>
> >> So the point is we need take care of the memory cgroup's state
> >> and apply the bias to offlined ones only. The offlined memory cgroup is
> >> going to die and has been dead. It's unlikely for its memory to be used
> >> by someone, but still possible. So it's reasonable to hardly squeeze the
> >> used memory of offlined memory cgroup if possible.
> >
> > Anyway, I think your version is good to mitigate the regression.
> > So, please feel free to add
> > Acked-by: Roman Gushchin <guro@fb.com>
> >
>
> Thanks, Roman! It will be included in v2.
>
> > But I think we need something more clever long-term: e.g. accumulate
> > the leftover from the division and add it to the next calculation.
> >
> > If you can test such an approach on your workload, that would be nice.
> >
>
> Yeah, we need something smart in long run. Lets see if I can sort/test
> it out and then come back to you.
>
> Thanks,
> Gavin
>
>