[RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup
@ 2020-02-10 12:14 Gavin Shan
  2020-02-10 16:17 ` Roman Gushchin
  0 siblings, 1 reply; 8+ messages in thread
From: Gavin Shan @ 2020-02-10 12:14 UTC (permalink / raw)
  To: linux-mm; +Cc: drjones, david, bhe, hannes, guro

commit 68600f623d69 ("mm: don't miss the last page because of round-off
error") makes the scan size round up to @denominator regardless of the
memory cgroup's state, online or offline. This affects the overall
reclaiming behavior: The corresponding LRU list is eligible for reclaiming
only when its size logically right shifted by @sc->priority is bigger than
zero in the former formula (non-roundup one). For example, the inactive
anonymous LRU list should have at least 0x4000 pages to be eligible for
reclaiming when we have 60/12 for swappiness/priority and without taking
scan/rotation ratio into account. After the roundup is applied, the
inactive anonymous LRU list becomes eligible for reclaiming when its
size is bigger than or equal to 0x1000 in the same condition.

    (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
    ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1

aarch64 has 512MB huge page size when the base page size is 64KB. The
memory cgroup that has a huge page is always eligible for reclaiming in
that case. The reclaiming is likely to stop after the huge page is
reclaimed, meaing the subsequent @sc->priority and memory cgroups will be
skipped. It changes the overall reclaiming behavior. This fixes the issue
by applying the roundup to offlined memory cgroups only, to give more
preference to reclaim memory from offlined memory cgroup. It sounds
reasonable as those memory is likely to be useless.

The issue was found by starting up 8 VMs on a Ampere Mustang machine,
which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and 2GB
memory. 784MB swap space is consumed after these 8 VMs are completely up.
Note that KSM is disable while THP is enabled in the testing. With this
applied, the consumed swap space decreased to 60MB.

         total        used        free      shared  buff/cache   available
Mem:     16196       10065        2049          16        4081        3749
Swap:     8175         784        7391
         total        used        free      shared  buff/cache   available
Mem:     16196       11324        3656          24        1215        2936
Swap:     8175          60        8115

Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error")
Cc: <stable@vger.kernel.org> # v4.20+
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 mm/vmscan.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c05eb9efec07..876370565455 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2415,10 +2415,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			/*
 			 * Scan types proportional to swappiness and
 			 * their relative recent reclaim efficiency.
-			 * Make sure we don't miss the last page
-			 * because of a round-off error.
+			 * Make sure we don't miss the last page on
+			 * the offlined memory cgroups because of a
+			 * round-off error.
 			 */
-			scan = DIV64_U64_ROUND_UP(scan * fraction[file],
+			scan = mem_cgroup_online(memcg) ?
+			       div64_u64(scan * fraction[file], denominator) :
+			       DIV64_U64_ROUND_UP(scan * fraction[file],
 						  denominator);
 			break;
 		case SCAN_FILE:
-- 
2.23.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup
  2020-02-10 12:14 [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup Gavin Shan
@ 2020-02-10 16:17 ` Roman Gushchin
  2020-02-10 23:55   ` Gavin Shan
  0 siblings, 1 reply; 8+ messages in thread
From: Roman Gushchin @ 2020-02-10 16:17 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linux-mm, drjones, david, bhe, hannes

Hello, Gavin!

On Mon, Feb 10, 2020 at 11:14:45PM +1100, Gavin Shan wrote:
> commit 68600f623d69 ("mm: don't miss the last page because of round-off
> error") makes the scan size round up to @denominator regardless of the
> memory cgroup's state, online or offline. This affects the overall
> reclaiming behavior: The corresponding LRU list is eligible for reclaiming
> only when its size logically right shifted by @sc->priority is bigger than
> zero in the former formula (non-roundup one).

Not sure I fully understand, but wasn't it so before 68600f623d69 too?

> For example, the inactive
> anonymous LRU list should have at least 0x4000 pages to be eligible for
> reclaiming when we have 60/12 for swappiness/priority and without taking
> scan/rotation ratio into account. After the roundup is applied, the
> inactive anonymous LRU list becomes eligible for reclaiming when its
> size is bigger than or equal to 0x1000 in the same condition.
> 
>     (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
>     ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
> 
> aarch64 has 512MB huge page size when the base page size is 64KB. The
> memory cgroup that has a huge page is always eligible for reclaiming in
> that case. The reclaiming is likely to stop after the huge page is
> reclaimed, meaing the subsequent @sc->priority and memory cgroups will be
> skipped. It changes the overall reclaiming behavior. This fixes the issue
> by applying the roundup to offlined memory cgroups only, to give more
> preference to reclaim memory from offlined memory cgroup. It sounds
> reasonable as those memory is likely to be useless.

So is the problem that relatively small memory cgroups are getting reclaimed
on default prio, however before they were skipped?

> 
> The issue was found by starting up 8 VMs on a Ampere Mustang machine,
> which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and 2GB
> memory. 784MB swap space is consumed after these 8 VMs are completely up.
> Note that KSM is disable while THP is enabled in the testing. With this
> applied, the consumed swap space decreased to 60MB.
> 
>          total        used        free      shared  buff/cache   available
> Mem:     16196       10065        2049          16        4081        3749
> Swap:     8175         784        7391
>          total        used        free      shared  buff/cache   available
> Mem:     16196       11324        3656          24        1215        2936
> Swap:     8175          60        8115

Does it lead to any performance regressions? Or it's only about increased
swap usage?

> 
> Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error")
> Cc: <stable@vger.kernel.org> # v4.20+
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  mm/vmscan.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c05eb9efec07..876370565455 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2415,10 +2415,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  			/*
>  			 * Scan types proportional to swappiness and
>  			 * their relative recent reclaim efficiency.
> -			 * Make sure we don't miss the last page
> -			 * because of a round-off error.
> +			 * Make sure we don't miss the last page on
> +			 * the offlined memory cgroups because of a
> +			 * round-off error.
>  			 */
> -			scan = DIV64_U64_ROUND_UP(scan * fraction[file],
> +			scan = mem_cgroup_online(memcg) ?
> +			       div64_u64(scan * fraction[file], denominator) :
> +			       DIV64_U64_ROUND_UP(scan * fraction[file],
>  						  denominator);

It looks a bit strange to round up for offline and basically down for
everything else. So maybe it's better to return to something like
the very first version of the patch:
https://www.spinics.net/lists/kernel/msg2883146.html ?
For memcg reclaim reasons we do care only about an edge case with few pages.

But overall it's not obvious to me, why rounding up is worse than rounding down.
Maybe we should average down but accumulate the reminder?
Creating an implicit bias for small memory cgroups sounds groundless.

Thank you!

Roman


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup
  2020-02-10 16:17 ` Roman Gushchin
@ 2020-02-10 23:55   ` Gavin Shan
  2020-02-11  1:31     ` Roman Gushchin
  0 siblings, 1 reply; 8+ messages in thread
From: Gavin Shan @ 2020-02-10 23:55 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: linux-mm, drjones, david, bhe, hannes

Hi Roman,

On 2/11/20 3:17 AM, Roman Gushchin wrote:
> Hello, Gavin!
> 
> On Mon, Feb 10, 2020 at 11:14:45PM +1100, Gavin Shan wrote:
>> commit 68600f623d69 ("mm: don't miss the last page because of round-off
>> error") makes the scan size round up to @denominator regardless of the
>> memory cgroup's state, online or offline. This affects the overall
>> reclaiming behavior: The corresponding LRU list is eligible for reclaiming
>> only when its size logically right shifted by @sc->priority is bigger than
>> zero in the former formula (non-roundup one).
> 
> Not sure I fully understand, but wasn't it so before 68600f623d69 too?
> 

It's correct that "(non-roundup one)" is typo and should have been dropped.
Will be corrected in v2 if needed.

>> For example, the inactive
>> anonymous LRU list should have at least 0x4000 pages to be eligible for
>> reclaiming when we have 60/12 for swappiness/priority and without taking
>> scan/rotation ratio into account. After the roundup is applied, the
>> inactive anonymous LRU list becomes eligible for reclaiming when its
>> size is bigger than or equal to 0x1000 in the same condition.
>>
>>      (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
>>      ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
>>
>> aarch64 has 512MB huge page size when the base page size is 64KB. The
>> memory cgroup that has a huge page is always eligible for reclaiming in
>> that case. The reclaiming is likely to stop after the huge page is
>> reclaimed, meaing the subsequent @sc->priority and memory cgroups will be
>> skipped. It changes the overall reclaiming behavior. This fixes the issue
>> by applying the roundup to offlined memory cgroups only, to give more
>> preference to reclaim memory from offlined memory cgroup. It sounds
>> reasonable as those memory is likely to be useless.
> 
> So is the problem that relatively small memory cgroups are getting reclaimed
> on default prio, however before they were skipped?
> 

Yes, you're correct. There are two dimensions for global reclaim: priority
(sc->priority) and memory cgroup. The scan/reclaim is carried out by iterating
from these two dimensions until the reclaimed pages are enough. If the roundup
is applied to current memory cgroup and occasionally helps to reclaim enough
memory, the subsequent priority and memory cgroup will be skipped.

>>
>> The issue was found by starting up 8 VMs on a Ampere Mustang machine,
>> which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and 2GB
>> memory. 784MB swap space is consumed after these 8 VMs are completely up.
>> Note that KSM is disable while THP is enabled in the testing. With this
>> applied, the consumed swap space decreased to 60MB.
>>
>>           total        used        free      shared  buff/cache   available
>> Mem:     16196       10065        2049          16        4081        3749
>> Swap:     8175         784        7391
>>           total        used        free      shared  buff/cache   available
>> Mem:     16196       11324        3656          24        1215        2936
>> Swap:     8175          60        8115
> 
> Does it lead to any performance regressions? Or it's only about increased
> swap usage?
> 

Apart from swap usage, it also had performance downgrade for my case. With
your patch (68600f623d69) included, it took 264 seconds to bring up 8 VMs.
However, 236 seconds are used to do same thing with my patch applied on top
of yours. There is 10% performance downgrade. It's the reason why I had a
stable tag.

>>
>> Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error")
>> Cc: <stable@vger.kernel.org> # v4.20+
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   mm/vmscan.c | 9 ++++++---
>>   1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index c05eb9efec07..876370565455 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2415,10 +2415,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>>   			/*
>>   			 * Scan types proportional to swappiness and
>>   			 * their relative recent reclaim efficiency.
>> -			 * Make sure we don't miss the last page
>> -			 * because of a round-off error.
>> +			 * Make sure we don't miss the last page on
>> +			 * the offlined memory cgroups because of a
>> +			 * round-off error.
>>   			 */
>> -			scan = DIV64_U64_ROUND_UP(scan * fraction[file],
>> +			scan = mem_cgroup_online(memcg) ?
>> +			       div64_u64(scan * fraction[file], denominator) :
>> +			       DIV64_U64_ROUND_UP(scan * fraction[file],
>>   						  denominator);
> 
> It looks a bit strange to round up for offline and basically down for
> everything else. So maybe it's better to return to something like
> the very first version of the patch:
> https://www.spinics.net/lists/kernel/msg2883146.html ?
> For memcg reclaim reasons we do care only about an edge case with few pages.
> 
> But overall it's not obvious to me, why rounding up is worse than rounding down.
> Maybe we should average down but accumulate the reminder?
> Creating an implicit bias for small memory cgroups sounds groundless.
> 

I don't think v1 path works for me either. The logic in v1 isn't too much
different from commit 68600f623d69. v1 has selective roundup, but current
code is having a forced roundup. With 68600f623d69 reverted and your v1
patch applied, it took 273 seconds to bring up 8 VMs and 1752MB swap is used.
It looks more worse than 68600f623d69.

Yeah, it's not reasonable to have a bias on all memory cgroups regardless
their states. I do think it's still right to give bias to offlined memory
cgroups. So the point is we need take care of the memory cgroup's state
and apply the bias to offlined ones only. The offlined memory cgroup is
going to die and has been dead. It's unlikely for its memory to be used
by someone, but still possible. So it's reasonable to hardly squeeze the
used memory of offlined memory cgroup if possible.

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup
  2020-02-10 23:55   ` Gavin Shan
@ 2020-02-11  1:31     ` Roman Gushchin
  2020-02-11  2:17       ` Gavin Shan
  0 siblings, 1 reply; 8+ messages in thread
From: Roman Gushchin @ 2020-02-11  1:31 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linux-mm, drjones, david, bhe, hannes

On Tue, Feb 11, 2020 at 10:55:53AM +1100, Gavin Shan wrote:
> Hi Roman,
> 
> On 2/11/20 3:17 AM, Roman Gushchin wrote:
> > Hello, Gavin!
> > 
> > On Mon, Feb 10, 2020 at 11:14:45PM +1100, Gavin Shan wrote:
> > > commit 68600f623d69 ("mm: don't miss the last page because of round-off
> > > error") makes the scan size round up to @denominator regardless of the
> > > memory cgroup's state, online or offline. This affects the overall
> > > reclaiming behavior: The corresponding LRU list is eligible for reclaiming
> > > only when its size logically right shifted by @sc->priority is bigger than
> > > zero in the former formula (non-roundup one).
> > 
> > Not sure I fully understand, but wasn't it so before 68600f623d69 too?
> > 
> 
> It's correct that "(non-roundup one)" is typo and should have been dropped.
> Will be corrected in v2 if needed.

Thanks!

> 
> > > For example, the inactive
> > > anonymous LRU list should have at least 0x4000 pages to be eligible for
> > > reclaiming when we have 60/12 for swappiness/priority and without taking
> > > scan/rotation ratio into account. After the roundup is applied, the
> > > inactive anonymous LRU list becomes eligible for reclaiming when its
> > > size is bigger than or equal to 0x1000 in the same condition.
> > > 
> > >      (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
> > >      ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
> > > 
> > > aarch64 has 512MB huge page size when the base page size is 64KB. The
> > > memory cgroup that has a huge page is always eligible for reclaiming in
> > > that case. The reclaiming is likely to stop after the huge page is
> > > reclaimed, meaing the subsequent @sc->priority and memory cgroups will be
> > > skipped. It changes the overall reclaiming behavior. This fixes the issue
> > > by applying the roundup to offlined memory cgroups only, to give more
> > > preference to reclaim memory from offlined memory cgroup. It sounds
> > > reasonable as those memory is likely to be useless.
> > 
> > So is the problem that relatively small memory cgroups are getting reclaimed
> > on default prio, however before they were skipped?
> > 
> 
> Yes, you're correct. There are two dimensions for global reclaim: priority
> (sc->priority) and memory cgroup. The scan/reclaim is carried out by iterating
> from these two dimensions until the reclaimed pages are enough. If the roundup
> is applied to current memory cgroup and occasionally helps to reclaim enough
> memory, the subsequent priority and memory cgroup will be skipped.
> 
> > > 
> > > The issue was found by starting up 8 VMs on a Ampere Mustang machine,
> > > which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and 2GB
> > > memory. 784MB swap space is consumed after these 8 VMs are completely up.
> > > Note that KSM is disable while THP is enabled in the testing. With this
> > > applied, the consumed swap space decreased to 60MB.
> > > 
> > >           total        used        free      shared  buff/cache   available
> > > Mem:     16196       10065        2049          16        4081        3749
> > > Swap:     8175         784        7391
> > >           total        used        free      shared  buff/cache   available
> > > Mem:     16196       11324        3656          24        1215        2936
> > > Swap:     8175          60        8115
> > 
> > Does it lead to any performance regressions? Or it's only about increased
> > swap usage?
> > 
> 
> Apart from swap usage, it also had performance downgrade for my case. With
> your patch (68600f623d69) included, it took 264 seconds to bring up 8 VMs.
> However, 236 seconds are used to do same thing with my patch applied on top
> of yours. There is 10% performance downgrade. It's the reason why I had a
> stable tag.

I see...

> 
> > > 
> > > Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error")
> > > Cc: <stable@vger.kernel.org> # v4.20+
> > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > ---
> > >   mm/vmscan.c | 9 ++++++---
> > >   1 file changed, 6 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index c05eb9efec07..876370565455 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2415,10 +2415,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >   			/*
> > >   			 * Scan types proportional to swappiness and
> > >   			 * their relative recent reclaim efficiency.
> > > -			 * Make sure we don't miss the last page
> > > -			 * because of a round-off error.
> > > +			 * Make sure we don't miss the last page on
> > > +			 * the offlined memory cgroups because of a
> > > +			 * round-off error.
> > >   			 */
> > > -			scan = DIV64_U64_ROUND_UP(scan * fraction[file],
> > > +			scan = mem_cgroup_online(memcg) ?
> > > +			       div64_u64(scan * fraction[file], denominator) :
> > > +			       DIV64_U64_ROUND_UP(scan * fraction[file],
> > >   						  denominator);
> > 
> > It looks a bit strange to round up for offline and basically down for
> > everything else. So maybe it's better to return to something like
> > the very first version of the patch:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kernel_msg2883146.html&d=DwIC-g&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=urGWFxpEgETD4pryLqIYaKdVUk1Munj_zLpJthvrreM&s=k2RDZGNcvb_Sia2tZwcMPZ79Mad5dw1oT8JdIy0rkGY&e=  ?
> > For memcg reclaim reasons we do care only about an edge case with few pages.
> > 
> > But overall it's not obvious to me, why rounding up is worse than rounding down.
> > Maybe we should average down but accumulate the reminder?
> > Creating an implicit bias for small memory cgroups sounds groundless.
> > 
> 
> I don't think v1 path works for me either. The logic in v1 isn't too much
> different from commit 68600f623d69. v1 has selective roundup, but current
> code is having a forced roundup. With 68600f623d69 reverted and your v1
> patch applied, it took 273 seconds to bring up 8 VMs and 1752MB swap is used.
> It looks more worse than 68600f623d69.
> 
> Yeah, it's not reasonable to have a bias on all memory cgroups regardless
> their states. I do think it's still right to give bias to offlined memory
> cgroups.

I don't think so, it really depends on the workload. Imagine systemd restarting
a service due to some update or with other arguments. Almost entire pagecache
is relevant and can be reused by a new cgroup.

> So the point is we need take care of the memory cgroup's state
> and apply the bias to offlined ones only. The offlined memory cgroup is
> going to die and has been dead. It's unlikely for its memory to be used
> by someone, but still possible. So it's reasonable to hardly squeeze the
> used memory of offlined memory cgroup if possible.

Anyway, I think your version is good to mitigate the regression.
So, please feel free to add
Acked-by: Roman Gushchin <guro@fb.com>

But I think we need something more clever long-term: e.g. accumulate
the leftover from the division and add it to the next calculation.

If you can test such an approach on your workload, that would be nice.


Thanks!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup
  2020-02-11  1:31     ` Roman Gushchin
@ 2020-02-11  2:17       ` Gavin Shan
  2020-02-11  3:03         ` Yang Shi
  2020-02-11 14:42         ` Roman Gushchin
  0 siblings, 2 replies; 8+ messages in thread
From: Gavin Shan @ 2020-02-11  2:17 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: linux-mm, drjones, david, bhe, hannes

Hi Roman,

On 2/11/20 12:31 PM, Roman Gushchin wrote:
> On Tue, Feb 11, 2020 at 10:55:53AM +1100, Gavin Shan wrote:
>> On 2/11/20 3:17 AM, Roman Gushchin wrote:
>>> On Mon, Feb 10, 2020 at 11:14:45PM +1100, Gavin Shan wrote:
>>>> commit 68600f623d69 ("mm: don't miss the last page because of round-off
>>>> error") makes the scan size round up to @denominator regardless of the
>>>> memory cgroup's state, online or offline. This affects the overall
>>>> reclaiming behavior: The corresponding LRU list is eligible for reclaiming
>>>> only when its size logically right shifted by @sc->priority is bigger than
>>>> zero in the former formula (non-roundup one).
>>>
>>> Not sure I fully understand, but wasn't it so before 68600f623d69 too?
>>>
>>
>> It's correct that "(non-roundup one)" is typo and should have been dropped.
>> Will be corrected in v2 if needed.
> 
> Thanks!
> 
>>
>>>> For example, the inactive
>>>> anonymous LRU list should have at least 0x4000 pages to be eligible for
>>>> reclaiming when we have 60/12 for swappiness/priority and without taking
>>>> scan/rotation ratio into account. After the roundup is applied, the
>>>> inactive anonymous LRU list becomes eligible for reclaiming when its
>>>> size is bigger than or equal to 0x1000 in the same condition.
>>>>
>>>>       (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
>>>>       ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
>>>>
>>>> aarch64 has 512MB huge page size when the base page size is 64KB. The
>>>> memory cgroup that has a huge page is always eligible for reclaiming in
>>>> that case. The reclaiming is likely to stop after the huge page is
>>>> reclaimed, meaing the subsequent @sc->priority and memory cgroups will be
>>>> skipped. It changes the overall reclaiming behavior. This fixes the issue
>>>> by applying the roundup to offlined memory cgroups only, to give more
>>>> preference to reclaim memory from offlined memory cgroup. It sounds
>>>> reasonable as those memory is likely to be useless.
>>>
>>> So is the problem that relatively small memory cgroups are getting reclaimed
>>> on default prio, however before they were skipped?
>>>
>>
>> Yes, you're correct. There are two dimensions for global reclaim: priority
>> (sc->priority) and memory cgroup. The scan/reclaim is carried out by iterating
>> from these two dimensions until the reclaimed pages are enough. If the roundup
>> is applied to current memory cgroup and occasionally helps to reclaim enough
>> memory, the subsequent priority and memory cgroup will be skipped.
>>
>>>>
>>>> The issue was found by starting up 8 VMs on a Ampere Mustang machine,
>>>> which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and 2GB
>>>> memory. 784MB swap space is consumed after these 8 VMs are completely up.
>>>> Note that KSM is disable while THP is enabled in the testing. With this
>>>> applied, the consumed swap space decreased to 60MB.
>>>>
>>>>            total        used        free      shared  buff/cache   available
>>>> Mem:     16196       10065        2049          16        4081        3749
>>>> Swap:     8175         784        7391
>>>>            total        used        free      shared  buff/cache   available
>>>> Mem:     16196       11324        3656          24        1215        2936
>>>> Swap:     8175          60        8115
>>>
>>> Does it lead to any performance regressions? Or it's only about increased
>>> swap usage?
>>>
>>
>> Apart from swap usage, it also had performance downgrade for my case. With
>> your patch (68600f623d69) included, it took 264 seconds to bring up 8 VMs.
>> However, 236 seconds are used to do same thing with my patch applied on top
>> of yours. There is 10% performance downgrade. It's the reason why I had a
>> stable tag.
> 
> I see...
> 

I will put these data into the commit log of v2, which will be posted shortly.

>>
>>>>
>>>> Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error")
>>>> Cc: <stable@vger.kernel.org> # v4.20+
>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>> ---
>>>>    mm/vmscan.c | 9 ++++++---
>>>>    1 file changed, 6 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index c05eb9efec07..876370565455 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -2415,10 +2415,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>>>>    			/*
>>>>    			 * Scan types proportional to swappiness and
>>>>    			 * their relative recent reclaim efficiency.
>>>> -			 * Make sure we don't miss the last page
>>>> -			 * because of a round-off error.
>>>> +			 * Make sure we don't miss the last page on
>>>> +			 * the offlined memory cgroups because of a
>>>> +			 * round-off error.
>>>>    			 */
>>>> -			scan = DIV64_U64_ROUND_UP(scan * fraction[file],
>>>> +			scan = mem_cgroup_online(memcg) ?
>>>> +			       div64_u64(scan * fraction[file], denominator) :
>>>> +			       DIV64_U64_ROUND_UP(scan * fraction[file],
>>>>    						  denominator);
>>>
>>> It looks a bit strange to round up for offline and basically down for
>>> everything else. So maybe it's better to return to something like
>>> the very first version of the patch:
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kernel_msg2883146.html&d=DwIC-g&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=urGWFxpEgETD4pryLqIYaKdVUk1Munj_zLpJthvrreM&s=k2RDZGNcvb_Sia2tZwcMPZ79Mad5dw1oT8JdIy0rkGY&e=  ?
>>> For memcg reclaim reasons we do care only about an edge case with few pages.
>>>
>>> But overall it's not obvious to me, why rounding up is worse than rounding down.
>>> Maybe we should average down but accumulate the reminder?
>>> Creating an implicit bias for small memory cgroups sounds groundless.
>>>
>>
>> I don't think v1 path works for me either. The logic in v1 isn't too much
>> different from commit 68600f623d69. v1 has selective roundup, but current
>> code is having a forced roundup. With 68600f623d69 reverted and your v1
>> patch applied, it took 273 seconds to bring up 8 VMs and 1752MB swap is used.
>> It looks more worse than 68600f623d69.
>>
>> Yeah, it's not reasonable to have a bias on all memory cgroups regardless
>> their states. I do think it's still right to give bias to offlined memory
>> cgroups.
> 
> I don't think so, it really depends on the workload. Imagine systemd restarting
> a service due to some update or with other arguments. Almost entire pagecache
> is relevant and can be reused by a new cgroup.
> 

Indeed, it depends on the workload. This patch is to revert 68600f623d69 for online
memory cgroups, but keep the logic for offlined memory cgroup to avoid breaking your
case.

There is something which might be unrelated to discuss here: the pagecache could be backed
by a low-speed (HDD) or high-speed (SSD) media. So the cost to fetch them from disk to memory
isn't equal, meaning we need some kind of bias during reclaiming. It seems something missed
from current implementation.

>> So the point is we need take care of the memory cgroup's state
>> and apply the bias to offlined ones only. The offlined memory cgroup is
>> going to die and has been dead. It's unlikely for its memory to be used
>> by someone, but still possible. So it's reasonable to hardly squeeze the
>> used memory of offlined memory cgroup if possible.
> 
> Anyway, I think your version is good to mitigate the regression.
> So, please feel free to add
> Acked-by: Roman Gushchin <guro@fb.com>
> 

Thanks, Roman! It will be included in v2.

> But I think we need something more clever long-term: e.g. accumulate
> the leftover from the division and add it to the next calculation.
> 
> If you can test such an approach on your workload, that would be nice.
> 

Yeah, we need something smart in long run. Lets see if I can sort/test
it out and then come back to you.

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup
  2020-02-11  2:17       ` Gavin Shan
@ 2020-02-11  3:03         ` Yang Shi
  2020-02-12  2:38           ` Gavin Shan
  2020-02-11 14:42         ` Roman Gushchin
  1 sibling, 1 reply; 8+ messages in thread
From: Yang Shi @ 2020-02-11  3:03 UTC (permalink / raw)
  To: Gavin Shan; +Cc: Roman Gushchin, Linux MM, drjones, david, bhe, Johannes Weiner

On Mon, Feb 10, 2020 at 6:18 PM Gavin Shan <gshan@redhat.com> wrote:
>
> Hi Roman,
>
> On 2/11/20 12:31 PM, Roman Gushchin wrote:
> > On Tue, Feb 11, 2020 at 10:55:53AM +1100, Gavin Shan wrote:
> >> On 2/11/20 3:17 AM, Roman Gushchin wrote:
> >>> On Mon, Feb 10, 2020 at 11:14:45PM +1100, Gavin Shan wrote:
> >>>> commit 68600f623d69 ("mm: don't miss the last page because of round-off
> >>>> error") makes the scan size round up to @denominator regardless of the
> >>>> memory cgroup's state, online or offline. This affects the overall
> >>>> reclaiming behavior: The corresponding LRU list is eligible for reclaiming
> >>>> only when its size logically right shifted by @sc->priority is bigger than
> >>>> zero in the former formula (non-roundup one).
> >>>
> >>> Not sure I fully understand, but wasn't it so before 68600f623d69 too?
> >>>
> >>
> >> It's correct that "(non-roundup one)" is typo and should have been dropped.
> >> Will be corrected in v2 if needed.
> >
> > Thanks!
> >
> >>
> >>>> For example, the inactive
> >>>> anonymous LRU list should have at least 0x4000 pages to be eligible for
> >>>> reclaiming when we have 60/12 for swappiness/priority and without taking
> >>>> scan/rotation ratio into account. After the roundup is applied, the
> >>>> inactive anonymous LRU list becomes eligible for reclaiming when its
> >>>> size is bigger than or equal to 0x1000 in the same condition.
> >>>>
> >>>>       (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
> >>>>       ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
> >>>>
> >>>> aarch64 has 512MB huge page size when the base page size is 64KB. The
> >>>> memory cgroup that has a huge page is always eligible for reclaiming in
> >>>> that case. The reclaiming is likely to stop after the huge page is
> >>>> reclaimed, meaing the subsequent @sc->priority and memory cgroups will be
> >>>> skipped. It changes the overall reclaiming behavior. This fixes the issue
> >>>> by applying the roundup to offlined memory cgroups only, to give more
> >>>> preference to reclaim memory from offlined memory cgroup. It sounds
> >>>> reasonable as those memory is likely to be useless.
> >>>
> >>> So is the problem that relatively small memory cgroups are getting reclaimed
> >>> on default prio, however before they were skipped?
> >>>
> >>
> >> Yes, you're correct. There are two dimensions for global reclaim: priority
> >> (sc->priority) and memory cgroup. The scan/reclaim is carried out by iterating
> >> from these two dimensions until the reclaimed pages are enough. If the roundup
> >> is applied to current memory cgroup and occasionally helps to reclaim enough
> >> memory, the subsequent priority and memory cgroup will be skipped.
> >>
> >>>>
> >>>> The issue was found by starting up 8 VMs on a Ampere Mustang machine,
> >>>> which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and 2GB
> >>>> memory. 784MB swap space is consumed after these 8 VMs are completely up.
> >>>> Note that KSM is disable while THP is enabled in the testing. With this
> >>>> applied, the consumed swap space decreased to 60MB.
> >>>>
> >>>>            total        used        free      shared  buff/cache   available
> >>>> Mem:     16196       10065        2049          16        4081        3749
> >>>> Swap:     8175         784        7391
> >>>>            total        used        free      shared  buff/cache   available
> >>>> Mem:     16196       11324        3656          24        1215        2936
> >>>> Swap:     8175          60        8115
> >>>
> >>> Does it lead to any performance regressions? Or it's only about increased
> >>> swap usage?
> >>>
> >>
> >> Apart from swap usage, it also had performance downgrade for my case. With
> >> your patch (68600f623d69) included, it took 264 seconds to bring up 8 VMs.
> >> However, 236 seconds are used to do same thing with my patch applied on top
> >> of yours. There is 10% performance downgrade. It's the reason why I had a
> >> stable tag.
> >
> > I see...
> >
>
> I will put these data into the commit log of v2, which will be posted shortly.
>
> >>
> >>>>
> >>>> Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error")
> >>>> Cc: <stable@vger.kernel.org> # v4.20+
> >>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >>>> ---
> >>>>    mm/vmscan.c | 9 ++++++---
> >>>>    1 file changed, 6 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>> index c05eb9efec07..876370565455 100644
> >>>> --- a/mm/vmscan.c
> >>>> +++ b/mm/vmscan.c
> >>>> @@ -2415,10 +2415,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >>>>                            /*
> >>>>                             * Scan types proportional to swappiness and
> >>>>                             * their relative recent reclaim efficiency.
> >>>> -                   * Make sure we don't miss the last page
> >>>> -                   * because of a round-off error.
> >>>> +                   * Make sure we don't miss the last page on
> >>>> +                   * the offlined memory cgroups because of a
> >>>> +                   * round-off error.
> >>>>                             */
> >>>> -                  scan = DIV64_U64_ROUND_UP(scan * fraction[file],
> >>>> +                  scan = mem_cgroup_online(memcg) ?
> >>>> +                         div64_u64(scan * fraction[file], denominator) :
> >>>> +                         DIV64_U64_ROUND_UP(scan * fraction[file],
> >>>>                                                      denominator);
> >>>
> >>> It looks a bit strange to round up for offline and basically down for
> >>> everything else. So maybe it's better to return to something like
> >>> the very first version of the patch:
> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kernel_msg2883146.html&d=DwIC-g&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=urGWFxpEgETD4pryLqIYaKdVUk1Munj_zLpJthvrreM&s=k2RDZGNcvb_Sia2tZwcMPZ79Mad5dw1oT8JdIy0rkGY&e=  ?
> >>> For memcg reclaim reasons we do care only about an edge case with few pages.
> >>>
> >>> But overall it's not obvious to me, why rounding up is worse than rounding down.
> >>> Maybe we should average down but accumulate the reminder?
> >>> Creating an implicit bias for small memory cgroups sounds groundless.
> >>>
> >>
> >> I don't think v1 path works for me either. The logic in v1 isn't too much
> >> different from commit 68600f623d69. v1 has selective roundup, but current
> >> code is having a forced roundup. With 68600f623d69 reverted and your v1
> >> patch applied, it took 273 seconds to bring up 8 VMs and 1752MB swap is used.
> >> It looks more worse than 68600f623d69.
> >>
> >> Yeah, it's not reasonable to have a bias on all memory cgroups regardless
> >> their states. I do think it's still right to give bias to offlined memory
> >> cgroups.
> >
> > I don't think so, it really depends on the workload. Imagine systemd restarting
> > a service due to some update or with other arguments. Almost entire pagecache
> > is relevant and can be reused by a new cgroup.
> >
>
> Indeed, it depends on the workload. This patch is to revert 68600f623d69 for online
> memory cgroups, but keep the logic for offlined memory cgroup to avoid breaking your
> case.
>
> There is something which might be unrelated to discuss here: the pagecache could be backed
> by a low-speed (HDD) or high-speed (SSD) media. So the cost to fetch them from disk to memory
> isn't equal, meaning we need some kind of bias during reclaiming. It seems something missed
> from current implementation.

Yes, the refault cost was not taken into account. I recalled Johannes
posted a patch series to do swap with refault cost weighted in a
couple of years ago, please see: https://lwn.net/Articles/690079/.

>
> >> So the point is we need take care of the memory cgroup's state
> >> and apply the bias to offlined ones only. The offlined memory cgroup is
> >> going to die and has been dead. It's unlikely for its memory to be used
> >> by someone, but still possible. So it's reasonable to hardly squeeze the
> >> used memory of offlined memory cgroup if possible.
> >
> > Anyway, I think your version is good to mitigate the regression.
> > So, please feel free to add
> > Acked-by: Roman Gushchin <guro@fb.com>
> >
>
> Thanks, Roman! It will be included in v2.
>
> > But I think we need something more clever long-term: e.g. accumulate
> > the leftover from the division and add it to the next calculation.
> >
> > If you can test such an approach on your workload, that would be nice.
> >
>
> Yeah, we need something smart in long run. Lets see if I can sort/test
> it out and then come back to you.
>
> Thanks,
> Gavin
>
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup
  2020-02-11  2:17       ` Gavin Shan
  2020-02-11  3:03         ` Yang Shi
@ 2020-02-11 14:42         ` Roman Gushchin
  1 sibling, 0 replies; 8+ messages in thread
From: Roman Gushchin @ 2020-02-11 14:42 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linux-mm, drjones, david, bhe, hannes

On Tue, Feb 11, 2020 at 01:17:47PM +1100, Gavin Shan wrote:
> Hi Roman,
> 
> On 2/11/20 12:31 PM, Roman Gushchin wrote:
> > On Tue, Feb 11, 2020 at 10:55:53AM +1100, Gavin Shan wrote:
> > > On 2/11/20 3:17 AM, Roman Gushchin wrote:
> > > > On Mon, Feb 10, 2020 at 11:14:45PM +1100, Gavin Shan wrote:
> > > > > commit 68600f623d69 ("mm: don't miss the last page because of round-off
> > > > > error") makes the scan size round up to @denominator regardless of the
> > > > > memory cgroup's state, online or offline. This affects the overall
> > > > > reclaiming behavior: The corresponding LRU list is eligible for reclaiming
> > > > > only when its size logically right shifted by @sc->priority is bigger than
> > > > > zero in the former formula (non-roundup one).
> > > > 
> > > > Not sure I fully understand, but wasn't it so before 68600f623d69 too?
> > > > 
> > > 
> > > It's correct that "(non-roundup one)" is typo and should have been dropped.
> > > Will be corrected in v2 if needed.
> > 
> > Thanks!
> > 
> > > 
> > > > > For example, the inactive
> > > > > anonymous LRU list should have at least 0x4000 pages to be eligible for
> > > > > reclaiming when we have 60/12 for swappiness/priority and without taking
> > > > > scan/rotation ratio into account. After the roundup is applied, the
> > > > > inactive anonymous LRU list becomes eligible for reclaiming when its
> > > > > size is bigger than or equal to 0x1000 in the same condition.
> > > > > 
> > > > >       (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
> > > > >       ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
> > > > > 
> > > > > aarch64 has 512MB huge page size when the base page size is 64KB. The
> > > > > memory cgroup that has a huge page is always eligible for reclaiming in
> > > > > that case. The reclaiming is likely to stop after the huge page is
> > > > > reclaimed, meaing the subsequent @sc->priority and memory cgroups will be
> > > > > skipped. It changes the overall reclaiming behavior. This fixes the issue
> > > > > by applying the roundup to offlined memory cgroups only, to give more
> > > > > preference to reclaim memory from offlined memory cgroup. It sounds
> > > > > reasonable as those memory is likely to be useless.
> > > > 
> > > > So is the problem that relatively small memory cgroups are getting reclaimed
> > > > on default prio, however before they were skipped?
> > > > 
> > > 
> > > Yes, you're correct. There are two dimensions for global reclaim: priority
> > > (sc->priority) and memory cgroup. The scan/reclaim is carried out by iterating
> > > from these two dimensions until the reclaimed pages are enough. If the roundup
> > > is applied to current memory cgroup and occasionally helps to reclaim enough
> > > memory, the subsequent priority and memory cgroup will be skipped.
> > > 
> > > > > 
> > > > > The issue was found by starting up 8 VMs on a Ampere Mustang machine,
> > > > > which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and 2GB
> > > > > memory. 784MB swap space is consumed after these 8 VMs are completely up.
> > > > > Note that KSM is disable while THP is enabled in the testing. With this
> > > > > applied, the consumed swap space decreased to 60MB.
> > > > > 
> > > > >            total        used        free      shared  buff/cache   available
> > > > > Mem:     16196       10065        2049          16        4081        3749
> > > > > Swap:     8175         784        7391
> > > > >            total        used        free      shared  buff/cache   available
> > > > > Mem:     16196       11324        3656          24        1215        2936
> > > > > Swap:     8175          60        8115
> > > > 
> > > > Does it lead to any performance regressions? Or it's only about increased
> > > > swap usage?
> > > > 
> > > 
> > > Apart from swap usage, it also had performance downgrade for my case. With
> > > your patch (68600f623d69) included, it took 264 seconds to bring up 8 VMs.
> > > However, 236 seconds are used to do same thing with my patch applied on top
> > > of yours. There is 10% performance downgrade. It's the reason why I had a
> > > stable tag.
> > 
> > I see...
> > 
> 
> I will put these data into the commit log of v2, which will be posted shortly.
> 
> > > 
> > > > > 
> > > > > Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error")
> > > > > Cc: <stable@vger.kernel.org> # v4.20+
> > > > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > > > ---
> > > > >    mm/vmscan.c | 9 ++++++---
> > > > >    1 file changed, 6 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index c05eb9efec07..876370565455 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -2415,10 +2415,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > > > >    			/*
> > > > >    			 * Scan types proportional to swappiness and
> > > > >    			 * their relative recent reclaim efficiency.
> > > > > -			 * Make sure we don't miss the last page
> > > > > -			 * because of a round-off error.
> > > > > +			 * Make sure we don't miss the last page on
> > > > > +			 * the offlined memory cgroups because of a
> > > > > +			 * round-off error.
> > > > >    			 */
> > > > > -			scan = DIV64_U64_ROUND_UP(scan * fraction[file],
> > > > > +			scan = mem_cgroup_online(memcg) ?
> > > > > +			       div64_u64(scan * fraction[file], denominator) :
> > > > > +			       DIV64_U64_ROUND_UP(scan * fraction[file],
> > > > >    						  denominator);
> > > > 
> > > > It looks a bit strange to round up for offline and basically down for
> > > > everything else. So maybe it's better to return to something like
> > > > the very first version of the patch:
> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kernel_msg2883146.html&d=DwIC-g&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=urGWFxpEgETD4pryLqIYaKdVUk1Munj_zLpJthvrreM&s=k2RDZGNcvb_Sia2tZwcMPZ79Mad5dw1oT8JdIy0rkGY&e=  ?
> > > > For memcg reclaim reasons we do care only about an edge case with few pages.
> > > > 
> > > > But overall it's not obvious to me, why rounding up is worse than rounding down.
> > > > Maybe we should average down but accumulate the reminder?
> > > > Creating an implicit bias for small memory cgroups sounds groundless.
> > > > 
> > > 
> > > I don't think v1 path works for me either. The logic in v1 isn't too much
> > > different from commit 68600f623d69. v1 has selective roundup, but current
> > > code is having a forced roundup. With 68600f623d69 reverted and your v1
> > > patch applied, it took 273 seconds to bring up 8 VMs and 1752MB swap is used.
> > > It looks more worse than 68600f623d69.
> > > 
> > > Yeah, it's not reasonable to have a bias on all memory cgroups regardless
> > > their states. I do think it's still right to give bias to offlined memory
> > > cgroups.
> > 
> > I don't think so, it really depends on the workload. Imagine systemd restarting
> > a service due to some update or with other arguments. Almost entire pagecache
> > is relevant and can be reused by a new cgroup.
> > 
> 
> Indeed, it depends on the workload. This patch is to revert 68600f623d69 for online
> memory cgroups, but keep the logic for offlined memory cgroup to avoid breaking your
> case.
> 
> There is something which might be unrelated to discuss here: the pagecache could be backed
> by a low-speed (HDD) or high-speed (SSD) media. So the cost to fetch them from disk to memory
> isn't equal, meaning we need some kind of bias during reclaiming. It seems something missed
> from current implementation.
> 
> > > So the point is we need take care of the memory cgroup's state
> > > and apply the bias to offlined ones only. The offlined memory cgroup is
> > > going to die and has been dead. It's unlikely for its memory to be used
> > > by someone, but still possible. So it's reasonable to hardly squeeze the
> > > used memory of offlined memory cgroup if possible.
> > 
> > Anyway, I think your version is good to mitigate the regression.
> > So, please feel free to add
> > Acked-by: Roman Gushchin <guro@fb.com>
> > 
> 
> Thanks, Roman! It will be included in v2.
> 
> > But I think we need something more clever long-term: e.g. accumulate
> > the leftover from the division and add it to the next calculation.
> > 
> > If you can test such an approach on your workload, that would be nice.
> > 
> 
> Yeah, we need something smart in long run. Lets see if I can sort/test
> it out and then come back to you.
> 

Perfect, thank you!

Roman


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup
  2020-02-11  3:03         ` Yang Shi
@ 2020-02-12  2:38           ` Gavin Shan
  0 siblings, 0 replies; 8+ messages in thread
From: Gavin Shan @ 2020-02-12  2:38 UTC (permalink / raw)
  To: Yang Shi; +Cc: Roman Gushchin, Linux MM, drjones, david, bhe, Johannes Weiner

On 2/11/20 2:03 PM, Yang Shi wrote:
> On Mon, Feb 10, 2020 at 6:18 PM Gavin Shan <gshan@redhat.com> wrote:
>> On 2/11/20 12:31 PM, Roman Gushchin wrote:
>>> On Tue, Feb 11, 2020 at 10:55:53AM +1100, Gavin Shan wrote:
>>>> On 2/11/20 3:17 AM, Roman Gushchin wrote:
>>>>> On Mon, Feb 10, 2020 at 11:14:45PM +1100, Gavin Shan wrote:

.../...

>>
>> There is something which might be unrelated to discuss here: the pagecache could be backed
>> by a low-speed (HDD) or high-speed (SSD) media. So the cost to fetch them from disk to memory
>> isn't equal, meaning we need some kind of bias during reclaiming. It seems something missed
>> from current implementation.
> 
> Yes, the refault cost was not taken into account. I recalled Johannes
> posted a patch series to do swap with refault cost weighted in a
> couple of years ago, please see: https://lwn.net/Articles/690079/.
> 

Thanks for the link. Yes, Johannes's patchset is comprehensive, even I
didn't look into the details. The concern I had is fixed bias to pagecaches
backed by different storage device, which have different IO speed. It seems
the patchset was disconnected since v1 because I didn't find a v2 from
google.

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-02-12  2:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-10 12:14 [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup Gavin Shan
2020-02-10 16:17 ` Roman Gushchin
2020-02-10 23:55   ` Gavin Shan
2020-02-11  1:31     ` Roman Gushchin
2020-02-11  2:17       ` Gavin Shan
2020-02-11  3:03         ` Yang Shi
2020-02-12  2:38           ` Gavin Shan
2020-02-11 14:42         ` Roman Gushchin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.