linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Memory CG and 5.1 to 5.6 uprade slows backup
@ 2020-04-09  9:25 Bruno Prémont
  2020-04-09  9:46 ` Michal Hocko
  2020-04-09 10:50 ` Chris Down
  0 siblings, 2 replies; 18+ messages in thread
From: Bruno Prémont @ 2020-04-09  9:25 UTC (permalink / raw)
  To: cgroups, linux-mm; +Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov

Hi,

Upgrading from 5.1 kernel to 5.6 kernel on a production system using
cgroups (v2) and having backup process in a memory.high=2G cgroup
sees backup being highly throttled (there are about 1.5T to be
backuped).

Most memory usage in that cgroup is for file cache.

Here are the memory details for the cgroup:
memory.current:2147225600
memory.events:low 0
memory.events:high 423774
memory.events:max 31131
memory.events:oom 0
memory.events:oom_kill 0
memory.events.local:low 0
memory.events.local:high 423774
memory.events.local:max 31131
memory.events.local:oom 0
memory.events.local:oom_kill 0
memory.high:2147483648
memory.low:33554432
memory.max:2415919104
memory.min:0
memory.oom.group:0
memory.pressure:some avg10=90.42 avg60=72.59 avg300=78.30 total=298252577711
memory.pressure:full avg10=90.32 avg60=72.53 avg300=78.24 total=295658626500
memory.stat:anon 10887168
memory.stat:file 2062102528
memory.stat:kernel_stack 73728
memory.stat:slab 76148736
memory.stat:sock 360448
memory.stat:shmem 0
memory.stat:file_mapped 12029952
memory.stat:file_dirty 946176
memory.stat:file_writeback 405504
memory.stat:anon_thp 0
memory.stat:inactive_anon 0
memory.stat:active_anon 10121216
memory.stat:inactive_file 1954959360
memory.stat:active_file 106418176
memory.stat:unevictable 0
memory.stat:slab_reclaimable 75247616
memory.stat:slab_unreclaimable 901120
memory.stat:pgfault 8651676
memory.stat:pgmajfault 2013
memory.stat:workingset_refault 8670651
memory.stat:workingset_activate 409200
memory.stat:workingset_nodereclaim 62040
memory.stat:pgrefill 1513537
memory.stat:pgscan 47519855
memory.stat:pgsteal 44933838
memory.stat:pgactivate 7986
memory.stat:pgdeactivate 1480623
memory.stat:pglazyfree 0
memory.stat:pglazyfreed 0
memory.stat:thp_fault_alloc 0
memory.stat:thp_collapse_alloc 0

Numbers that change most are pgscan/pgsteal
Regularly the backup process seems to be blocked for about 2s, but not
within a syscall according to strace.

Is there a way to tell kernel that this cgroup should not be throttled
and its inactive file cache given up (rather quickly).

The aim here is to avoid backup from killing production task file cache
but not starving it.


If there is some useful info missing, please tell (eventually adding how
I can obtain it).


On a side note, I liked v1's mode of soft/hard memory limit where the
memory amount between soft and hard could be used if system has enough
free memory. For v2 the difference between high and max seems almost of
no use.

A cgroup parameter for impacting RO file cache differently than
anonymous memory or otherwise dirty memory would be great too.


Thanks,
Bruno


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-09  9:25 Memory CG and 5.1 to 5.6 uprade slows backup Bruno Prémont
@ 2020-04-09  9:46 ` Michal Hocko
  2020-04-09 10:17   ` Bruno Prémont
  2020-04-09 10:50 ` Chris Down
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2020-04-09  9:46 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: cgroups, linux-mm, Johannes Weiner, Vladimir Davydov, Chris Down

[Cc Chris]

On Thu 09-04-20 11:25:05, Bruno Prémont wrote:
> Hi,
> 
> Upgrading from 5.1 kernel to 5.6 kernel on a production system using
> cgroups (v2) and having backup process in a memory.high=2G cgroup
> sees backup being highly throttled (there are about 1.5T to be
> backuped).

What does /proc/sys/vm/dirty_* say? Is it possible that the reclaim is
not making progress on too many dirty pages and that triggers the back
off mechanism that has been implemented recently in  5.4 (have a look at 
0e4b01df8659 ("mm, memcg: throttle allocators when failing reclaim over
memory.high") and e26733e0d0ec ("mm, memcg: throttle allocators based on
ancestral memory.high").

Keeping the rest of the email for reference.

> Most memory usage in that cgroup is for file cache.
> 
> Here are the memory details for the cgroup:
> memory.current:2147225600
> memory.events:low 0
> memory.events:high 423774
> memory.events:max 31131
> memory.events:oom 0
> memory.events:oom_kill 0
> memory.events.local:low 0
> memory.events.local:high 423774
> memory.events.local:max 31131
> memory.events.local:oom 0
> memory.events.local:oom_kill 0
> memory.high:2147483648
> memory.low:33554432
> memory.max:2415919104
> memory.min:0
> memory.oom.group:0
> memory.pressure:some avg10=90.42 avg60=72.59 avg300=78.30 total=298252577711
> memory.pressure:full avg10=90.32 avg60=72.53 avg300=78.24 total=295658626500
> memory.stat:anon 10887168
> memory.stat:file 2062102528
> memory.stat:kernel_stack 73728
> memory.stat:slab 76148736
> memory.stat:sock 360448
> memory.stat:shmem 0
> memory.stat:file_mapped 12029952
> memory.stat:file_dirty 946176
> memory.stat:file_writeback 405504
> memory.stat:anon_thp 0
> memory.stat:inactive_anon 0
> memory.stat:active_anon 10121216
> memory.stat:inactive_file 1954959360
> memory.stat:active_file 106418176
> memory.stat:unevictable 0
> memory.stat:slab_reclaimable 75247616
> memory.stat:slab_unreclaimable 901120
> memory.stat:pgfault 8651676
> memory.stat:pgmajfault 2013
> memory.stat:workingset_refault 8670651
> memory.stat:workingset_activate 409200
> memory.stat:workingset_nodereclaim 62040
> memory.stat:pgrefill 1513537
> memory.stat:pgscan 47519855
> memory.stat:pgsteal 44933838
> memory.stat:pgactivate 7986
> memory.stat:pgdeactivate 1480623
> memory.stat:pglazyfree 0
> memory.stat:pglazyfreed 0
> memory.stat:thp_fault_alloc 0
> memory.stat:thp_collapse_alloc 0
> 
> Numbers that change most are pgscan/pgsteal
> Regularly the backup process seems to be blocked for about 2s, but not
> within a syscall according to strace.
> 
> Is there a way to tell kernel that this cgroup should not be throttled
> and its inactive file cache given up (rather quickly).
> 
> The aim here is to avoid backup from killing production task file cache
> but not starving it.
> 
> 
> If there is some useful info missing, please tell (eventually adding how
> I can obtain it).
> 
> 
> On a side note, I liked v1's mode of soft/hard memory limit where the
> memory amount between soft and hard could be used if system has enough
> free memory. For v2 the difference between high and max seems almost of
> no use.
> 
> A cgroup parameter for impacting RO file cache differently than
> anonymous memory or otherwise dirty memory would be great too.
> 
> 
> Thanks,
> Bruno

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-09  9:46 ` Michal Hocko
@ 2020-04-09 10:17   ` Bruno Prémont
  2020-04-09 10:34     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Bruno Prémont @ 2020-04-09 10:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-mm, Johannes Weiner, Vladimir Davydov, Chris Down

On Thu, 9 Apr 2020 11:46:15 Michal Hocko <mhocko@kernel.org> wrote:
> [Cc Chris]
> 
> On Thu 09-04-20 11:25:05, Bruno Prémont wrote:
> > Hi,
> > 
> > Upgrading from 5.1 kernel to 5.6 kernel on a production system using
> > cgroups (v2) and having backup process in a memory.high=2G cgroup
> > sees backup being highly throttled (there are about 1.5T to be
> > backuped).  
> 
> What does /proc/sys/vm/dirty_* say?

/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:20
/proc/sys/vm/dirty_writeback_centisecs:500

Captured after having restarted the backup task.
After backup process restart the cgroup again has more free memory and
things run at a normal speed (until cgroup memory gets "full" again).
Current cgroup stats when things run fluently:

anon 176128
file 633012224
kernel_stack 73728
slab 47173632
sock 364544
shmem 0
file_mapped 10678272
file_dirty 811008
file_writeback 405504
anon_thp 0
inactive_anon 0
active_anon 0
inactive_file 552849408
active_file 79360000
unevictable 0
slab_reclaimable 46411776
slab_unreclaimable 761856
pgfault 8656857
pgmajfault 2145
workingset_refault 8672334
workingset_activate 410586
workingset_nodereclaim 92895
pgrefill 1516540
pgscan 48241750
pgsteal 45655752
pgactivate 7986
pgdeactivate 1483626
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0


> Is it possible that the reclaim is not making progress on too many
> dirty pages and that triggers the back off mechanism that has been
> implemented recently in  5.4 (have a look at 0e4b01df8659 ("mm,
> memcg: throttle allocators when failing reclaim over memory.high")
> and e26733e0d0ec ("mm, memcg: throttle allocators based on
> ancestral memory.high").

Could be though in that case it's throttling the wrong task/cgroup
as far as I can see (at least from cgroup's memory stats) or being
blocked by state external to the cgroup.
Will have a look at those patches so get a better idea at what they
change.

System-wide memory is at least 10G/64G completely free (varies between
10G and 20G free - ~18G file cache, ~10G reclaimable slabs, ~5G
unreclaimable slabs and 7G otherwise in use).

> Keeping the rest of the email for reference.
> 
> > Most memory usage in that cgroup is for file cache.
> > 
> > Here are the memory details for the cgroup:
> > memory.current:2147225600
> > memory.events:low 0
> > memory.events:high 423774
> > memory.events:max 31131
> > memory.events:oom 0
> > memory.events:oom_kill 0
> > memory.events.local:low 0
> > memory.events.local:high 423774
> > memory.events.local:max 31131
> > memory.events.local:oom 0
> > memory.events.local:oom_kill 0
> > memory.high:2147483648
> > memory.low:33554432
> > memory.max:2415919104
> > memory.min:0
> > memory.oom.group:0
> > memory.pressure:some avg10=90.42 avg60=72.59 avg300=78.30 total=298252577711
> > memory.pressure:full avg10=90.32 avg60=72.53 avg300=78.24 total=295658626500
> > memory.stat:anon 10887168
> > memory.stat:file 2062102528
> > memory.stat:kernel_stack 73728
> > memory.stat:slab 76148736
> > memory.stat:sock 360448
> > memory.stat:shmem 0
> > memory.stat:file_mapped 12029952
> > memory.stat:file_dirty 946176
> > memory.stat:file_writeback 405504
> > memory.stat:anon_thp 0
> > memory.stat:inactive_anon 0
> > memory.stat:active_anon 10121216
> > memory.stat:inactive_file 1954959360
> > memory.stat:active_file 106418176
> > memory.stat:unevictable 0
> > memory.stat:slab_reclaimable 75247616
> > memory.stat:slab_unreclaimable 901120
> > memory.stat:pgfault 8651676
> > memory.stat:pgmajfault 2013
> > memory.stat:workingset_refault 8670651
> > memory.stat:workingset_activate 409200
> > memory.stat:workingset_nodereclaim 62040
> > memory.stat:pgrefill 1513537
> > memory.stat:pgscan 47519855
> > memory.stat:pgsteal 44933838
> > memory.stat:pgactivate 7986
> > memory.stat:pgdeactivate 1480623
> > memory.stat:pglazyfree 0
> > memory.stat:pglazyfreed 0
> > memory.stat:thp_fault_alloc 0
> > memory.stat:thp_collapse_alloc 0
> > 
> > Numbers that change most are pgscan/pgsteal
> > Regularly the backup process seems to be blocked for about 2s, but not
> > within a syscall according to strace.
> > 
> > Is there a way to tell kernel that this cgroup should not be throttled
> > and its inactive file cache given up (rather quickly).
> > 
> > The aim here is to avoid backup from killing production task file cache
> > but not starving it.
> > 
> > 
> > If there is some useful info missing, please tell (eventually adding how
> > I can obtain it).
> > 
> > 
> > On a side note, I liked v1's mode of soft/hard memory limit where the
> > memory amount between soft and hard could be used if system has enough
> > free memory. For v2 the difference between high and max seems almost of
> > no use.
> > 
> > A cgroup parameter for impacting RO file cache differently than
> > anonymous memory or otherwise dirty memory would be great too.
> > 
> > 
> > Thanks,
> > Bruno


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-09 10:17   ` Bruno Prémont
@ 2020-04-09 10:34     ` Michal Hocko
  2020-04-09 15:09       ` Bruno Prémont
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2020-04-09 10:34 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: cgroups, linux-mm, Johannes Weiner, Vladimir Davydov, Chris Down

On Thu 09-04-20 12:17:33, Bruno Prémont wrote:
> On Thu, 9 Apr 2020 11:46:15 Michal Hocko <mhocko@kernel.org> wrote:
> > [Cc Chris]
> > 
> > On Thu 09-04-20 11:25:05, Bruno Prémont wrote:
> > > Hi,
> > > 
> > > Upgrading from 5.1 kernel to 5.6 kernel on a production system using
> > > cgroups (v2) and having backup process in a memory.high=2G cgroup
> > > sees backup being highly throttled (there are about 1.5T to be
> > > backuped).  
> > 
> > What does /proc/sys/vm/dirty_* say?
> 
> /proc/sys/vm/dirty_background_bytes:0
> /proc/sys/vm/dirty_background_ratio:10
> /proc/sys/vm/dirty_bytes:0
> /proc/sys/vm/dirty_expire_centisecs:3000
> /proc/sys/vm/dirty_ratio:20
> /proc/sys/vm/dirty_writeback_centisecs:500

Sorry, but I forgot ask for the total amount of memory. But it seems
this is 64GB and 10% dirty ration might mean a lot of dirty memory.
Does the same happen if you reduce those knobs to something smaller than
2G? _bytes alternatives should be useful for that purpose.

[...]

> > Is it possible that the reclaim is not making progress on too many
> > dirty pages and that triggers the back off mechanism that has been
> > implemented recently in  5.4 (have a look at 0e4b01df8659 ("mm,
> > memcg: throttle allocators when failing reclaim over memory.high")
> > and e26733e0d0ec ("mm, memcg: throttle allocators based on
> > ancestral memory.high").
> 
> Could be though in that case it's throttling the wrong task/cgroup
> as far as I can see (at least from cgroup's memory stats) or being
> blocked by state external to the cgroup.
> Will have a look at those patches so get a better idea at what they
> change.

Could you check where is the task of your interest throttled?
/proc/<pid>/stack should give you a clue.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-09  9:25 Memory CG and 5.1 to 5.6 uprade slows backup Bruno Prémont
  2020-04-09  9:46 ` Michal Hocko
@ 2020-04-09 10:50 ` Chris Down
  2020-04-09 11:58   ` Bruno Prémont
  1 sibling, 1 reply; 18+ messages in thread
From: Chris Down @ 2020-04-09 10:50 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko, Vladimir Davydov

Hi Bruno,

Bruno Prémont writes:
>Upgrading from 5.1 kernel to 5.6 kernel on a production system using
>cgroups (v2) and having backup process in a memory.high=2G cgroup
>sees backup being highly throttled (there are about 1.5T to be
>backuped).

Before 5.4, memory usage with memory.high=N is essentially unbounded if the 
system is not able to reclaim pages for some reason. This is because all 
memory.high throttling before that point is just based on forcing direct 
reclaim for a cgroup, but there's no guarantee that we can actually reclaim 
pages, or that it will serve as a time penalty.

In 5.4, my patch 0e4b01df8659 ("mm, memcg: throttle allocators when failing 
reclaim over memory.high") changes kernel behaviour to actively penalise 
cgroups exceeding their memory.high by a large amount. That is, if reclaim 
fails to reclaim pages and bring the cgroup below the high threshold, we 
actively deschedule the process running for some number of jiffies that is 
exponential to the amount of overage incurred. This is so that cgroups using 
memory.high cannot simply have runaway memory usage without any consequences.

This is the patch that I'd particularly suspect is related to your problem. 
However:

>Most memory usage in that cgroup is for file cache.
>
>Here are the memory details for the cgroup:
>memory.current:2147225600
>[...]
>memory.events:high 423774
>memory.events:max 31131
>memory.high:2147483648
>memory.max:2415919104

Your high limit is being exceeded heavily and you are failing to reclaim. You 
have `max` events here, which mean your application is at least at some point 
using over 268 *mega*bytes over its memory.high.

So yes, we will penalise this cgroup heavily since we cannot reclaim from it. 
The real question is why we can't reclaim from it :-)

>memory.low:33554432

You have a memory.low set, which will bias reclaim away from this cgroup based 
on overage. It's not very large, though, so it shouldn't change the semantics 
here, although it's worth noting since it also changed in another one of my 
patches, 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim"), 
which is also in 5.4.

In 5.1, as soon as you exceed memory.low, you immediately lose all protection.  
This is not ideal because it results in extremely binary, back-and-forth 
behaviour for cgroups using it (see the changelog for more information). This 
change means you will still receive some small amount of protection based on 
your overage, but it's fairly insignificant in this case (memory.current is 
about 64x larger than memory.low). What did you intend to do with this in 5.1? 
:-)

>memory.stat:anon 10887168
>memory.stat:file 2062102528
>memory.stat:kernel_stack 73728
>memory.stat:slab 76148736
>memory.stat:sock 360448
>memory.stat:shmem 0
>memory.stat:file_mapped 12029952
>memory.stat:file_dirty 946176
>memory.stat:file_writeback 405504
>memory.stat:anon_thp 0
>memory.stat:inactive_anon 0
>memory.stat:active_anon 10121216
>memory.stat:inactive_file 1954959360
>memory.stat:active_file 106418176
>memory.stat:unevictable 0
>memory.stat:slab_reclaimable 75247616
>memory.stat:slab_unreclaimable 901120
>memory.stat:pgfault 8651676
>memory.stat:pgmajfault 2013
>memory.stat:workingset_refault 8670651
>memory.stat:workingset_activate 409200
>memory.stat:workingset_nodereclaim 62040
>memory.stat:pgrefill 1513537
>memory.stat:pgscan 47519855
>memory.stat:pgsteal 44933838
>memory.stat:pgactivate 7986
>memory.stat:pgdeactivate 1480623
>memory.stat:pglazyfree 0
>memory.stat:pglazyfreed 0
>memory.stat:thp_fault_alloc 0
>memory.stat:thp_collapse_alloc 0

Hard to say exactly why we can't reclaim using these statistics, usually if 
anything the kernel is *over* eager to drop cache pages than anything.

If the kernel thinks those file pages are too hot, though, it won't drop them. 
However, we only have 106M active file, compared to 2GB memory.current, so it 
doesn't look like this is the issue.

Can you please show io.pressure, io.stat, and cpu.pressure during these periods 
compared to baseline for this cgroup and globally (from /proc/pressure)? My 
suspicion is that we are not able to reclaim fast enough because memory 
management is getting stuck behind a slow disk.

Swap availability and usage information would also be helpful.

>Regularly the backup process seems to be blocked for about 2s, but not
>within a syscall according to strace.

2 seconds is important, it's the maximum time we allow the allocator throttler 
to throttle for one allocation :-)

If you want to verify, you can look at /proc/pid/stack during these stalls -- 
they should be in mem_cgroup_handle_over_high, in an address related to 
allocator throttling.

>Is there a way to tell kernel that this cgroup should not be throttled

Huh? That's what memory.high is for, so why are you using if it you don't want 
that?

>and its inactive file cache given up (rather quickly).

I suspect the kernel is reclaiming as far as it can, but is being stopped from 
doing so for some reason, which is why I'd like to see io.pressure and 
cpu.pressure.

>On a side note, I liked v1's mode of soft/hard memory limit where the
>memory amount between soft and hard could be used if system has enough
>free memory. For v2 the difference between high and max seems almost of
>no use.

For that use case, that's more or less what we've designed memory.low to do. 
The difference is that v1's soft limit almost never worked: the heuristics are 
extremely complicated, so complicated in fact that even we as memcg maintainers 
cannot reason about them. If we cannot reason about them, I'm quite sure it's 
not really doing what you expect :-)

In this case everything looks like it's working as intended, just this is all 
the result of memory.high becoming less broken in 5.4. From your description, 
I'm not sure that memory.high is what you want, either.

>A cgroup parameter for impacting RO file cache differently than
>anonymous memory or otherwise dirty memory would be great too.

We had vm.swappiness in v1 and it manifested extremely poorly. I won't go too 
much into the details of that here though, since we already discussed it fairly 
comprehensively here[0].

Please feel free to send over the io.pressure, io.stat, cpu.pressure, and swap 
metrics at baseline and during this when possible. Thanks!

0: https://lore.kernel.org/patchwork/patch/1172080/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-09 10:50 ` Chris Down
@ 2020-04-09 11:58   ` Bruno Prémont
  0 siblings, 0 replies; 18+ messages in thread
From: Bruno Prémont @ 2020-04-09 11:58 UTC (permalink / raw)
  To: Chris Down
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko, Vladimir Davydov

[-- Attachment #1: Type: text/plain, Size: 12574 bytes --]

Hi Chris,

Answering here (partially to cover Michal's questions further down the
thread as well.

On Thu, 9 Apr 2020 11:50:48 Chris Down <chris@chrisdown.name> wrote:
> Hi Bruno,
> 
> Bruno Prémont writes:
> >Upgrading from 5.1 kernel to 5.6 kernel on a production system using
> >cgroups (v2) and having backup process in a memory.high=2G cgroup
> >sees backup being highly throttled (there are about 1.5T to be
> >backuped).  
> 
> Before 5.4, memory usage with memory.high=N is essentially unbounded if the 
> system is not able to reclaim pages for some reason. This is because all 
> memory.high throttling before that point is just based on forcing direct 
> reclaim for a cgroup, but there's no guarantee that we can actually reclaim 
> pages, or that it will serve as a time penalty.
> 
> In 5.4, my patch 0e4b01df8659 ("mm, memcg: throttle allocators when failing 
> reclaim over memory.high") changes kernel behaviour to actively penalise 
> cgroups exceeding their memory.high by a large amount. That is, if reclaim 
> fails to reclaim pages and bring the cgroup below the high threshold, we 
> actively deschedule the process running for some number of jiffies that is 
> exponential to the amount of overage incurred. This is so that cgroups using 
> memory.high cannot simply have runaway memory usage without any consequences.

Thanks for the background-information!

> This is the patch that I'd particularly suspect is related to your problem. 
> However:
> 
> >Most memory usage in that cgroup is for file cache.
> >
> >Here are the memory details for the cgroup:
> >memory.current:2147225600
> >[...]
> >memory.events:high 423774
> >memory.events:max 31131
> >memory.high:2147483648
> >memory.max:2415919104  
> 
> Your high limit is being exceeded heavily and you are failing to reclaim. You 
> have `max` events here, which mean your application is at least at some point 
> using over 268 *mega*bytes over its memory.high.
> 
> So yes, we will penalise this cgroup heavily since we cannot reclaim from it. 
> The real question is why we can't reclaim from it :-)

That's the great question!

> >memory.low:33554432  
> 
> You have a memory.low set, which will bias reclaim away from this cgroup based 
> on overage. It's not very large, though, so it shouldn't change the semantics 
> here, although it's worth noting since it also changed in another one of my 
> patches, 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim"), 
> which is also in 5.4.
> 
> In 5.1, as soon as you exceed memory.low, you immediately lose all protection.  
> This is not ideal because it results in extremely binary, back-and-forth 
> behaviour for cgroups using it (see the changelog for more information). This 
> change means you will still receive some small amount of protection based on 
> your overage, but it's fairly insignificant in this case (memory.current is 
> about 64x larger than memory.low). What did you intend to do with this in 5.1? 
> :-)

Well my intent was that it should have access to this low amount to
perform its work (e.g. for anonymous memory and active file [code and
minimal payload]) when rest of system is using its allowed but not
granted memory resources up to global system limits.

So feels like your patch made this promise better enforced.

> >memory.stat:anon 10887168
> >memory.stat:file 2062102528
> >memory.stat:kernel_stack 73728
> >memory.stat:slab 76148736
> >memory.stat:sock 360448
> >memory.stat:shmem 0
> >memory.stat:file_mapped 12029952
> >memory.stat:file_dirty 946176
> >memory.stat:file_writeback 405504
> >memory.stat:anon_thp 0
> >memory.stat:inactive_anon 0
> >memory.stat:active_anon 10121216
> >memory.stat:inactive_file 1954959360
> >memory.stat:active_file 106418176
> >memory.stat:unevictable 0
> >memory.stat:slab_reclaimable 75247616
> >memory.stat:slab_unreclaimable 901120
> >memory.stat:pgfault 8651676
> >memory.stat:pgmajfault 2013
> >memory.stat:workingset_refault 8670651
> >memory.stat:workingset_activate 409200
> >memory.stat:workingset_nodereclaim 62040
> >memory.stat:pgrefill 1513537
> >memory.stat:pgscan 47519855
> >memory.stat:pgsteal 44933838
> >memory.stat:pgactivate 7986
> >memory.stat:pgdeactivate 1480623
> >memory.stat:pglazyfree 0
> >memory.stat:pglazyfreed 0
> >memory.stat:thp_fault_alloc 0
> >memory.stat:thp_collapse_alloc 0  
> 
> Hard to say exactly why we can't reclaim using these statistics, usually if 
> anything the kernel is *over* eager to drop cache pages than anything.
> 
> If the kernel thinks those file pages are too hot, though, it won't drop them. 
> However, we only have 106M active file, compared to 2GB memory.current, so it 
> doesn't look like this is the issue.
> 
> Can you please show io.pressure, io.stat, and cpu.pressure during these periods 
> compared to baseline for this cgroup and globally (from /proc/pressure)? My 
> suspicion is that we are not able to reclaim fast enough because memory 
> management is getting stuck behind a slow disk.

Disk should not be too slow at writing (SAN for most of the data,
local, battery-backed RAID for logs)

System's IO pressure is low (below 1 except for some random peaks going
to 20)
System's CPU pressure is similar (spikes happen at unrelated times)
System's Memory pressure though most often is high.

Prior to kernel update is was mostly in 5-10 (short-term value with
periods of spiking around 20), while long-term value remained below 5.

Since kernel upgrade things changed quite a lot:
Sometimes memory pressure is low but it's mostly ranging between 40 and
80.

I guess attached PNG will give you a better idea than any textual
explanation. (reboot for kernel upgrade happened during night from
Friday to Saturday about midnight).

Digging some deeper in the (highly affected) cg hierarchy:

CGv2:
  + workload
  | | ....
  + system
    | base    (has init, ntp and the like system daemons)
    | shell   (has tty gettys, ssh and the like)
    | backup  (has backup processes only)

system/:
	memory.current:8589053952
	memory.high:8589934592
	memory.low:134217728
	memory.max:9663676416
	memory.events:low 0
	memory.events:high 441886
	memory.events:max 31131
	memory.events:oom 0
	memory.events:oom_kill 0
	memory.stat:file 8346779648
	memory.stat:file_mapped 105971712
	memory.stat:file_dirty 2838528
	memory.stat:file_writeback 1486848
	memory.stat:inactive_file 6600683520
	memory.stat:active_file 1067331584
system/base:
	memory.current:7789477888
	memory.high:max
	memory.low:0
	memory.max:max
	memory.events:low 0
	memory.events:high 0
	memory.events:max 0
	memory.events:oom 0
	memory.events:oom_kill 0
	memory.stat:file 7586832384
	memory.stat:file_mapped 92995584
	memory.stat:file_dirty 1351680
	memory.stat:file_writeback 1081344
	memory.stat:inactive_file 6592962560
	memory.stat:active_file 946053120
system/shell:
	memory.current:638394368
	memory.high:max
	memory.low:0
	memory.max:max
	memory.events:low 0
	memory.events:high 0
	memory.events:max 0
	memory.events:oom 0
	memory.events:oom_kill 0
	memory.stat:file 637349888
	memory.stat:file_mapped 2568192
	memory.stat:file_dirty 405504
	memory.stat:file_writeback 0
	memory.stat:inactive_file 3645440
	memory.stat:active_file 6991872
system/backup:
	memory.current:160874496
	memory.high:2147483648
	memory.low:33554432
	memory.max:2415919104
	memory.events:low 0
	memory.events:high 425240
	memory.events:max 31131
	memory.events:oom 0
	memory.events:oom_kill 0
	memory.stat:file 122687488
	memory.stat:file_mapped 10678272
	memory.stat:file_dirty 675840
	memory.stat:file_writeback 405504
	memory.stat:inactive_file 10416128
	memory.stat:active_file 110329856

For tasks being throttled /proc/$pid/stack shows
	[<0>] mem_cgroup_handle_over_high+0x121/0x170
	[<0>] exit_to_usermode_loop+0x67/0xa0
	[<0>] do_syscall_64+0x149/0x170
	[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

It even hits my shell running over ssh. Turns out that for now (as
backup processes had been restarted and run mostly smoothly at the
moment) now the throttling happens caused by parent system/ cgroup
which reached high with tons of inactive file cache.

Note: under workload there is a task running through whole SAN disk
      file try counting file sizes at hourly interval, thus keeping most
      inodes at least partially active
      Major workload is a webserver (serving static files and running
      PHP based CMSs)

> Swap availability and usage information would also be helpful.

There is no swap.

> >Regularly the backup process seems to be blocked for about 2s, but not
> >within a syscall according to strace.  
> 
> 2 seconds is important, it's the maximum time we allow the allocator throttler 
> to throttle for one allocation :-)
> 
> If you want to verify, you can look at /proc/pid/stack during these stalls -- 
> they should be in mem_cgroup_handle_over_high, in an address related to 
> allocator throttling.

Yes, I've seen that (see above).

> >Is there a way to tell kernel that this cgroup should not be throttled  
> 
> Huh? That's what memory.high is for, so why are you using if it you don't want 
> that?

Well, remainder of sentence is the important part.
The cgroup is expected to have short-lived cache usage and thus caches
should not be a reason for throttling.

> >and its inactive file cache given up (rather quickly).  
> 
> I suspect the kernel is reclaiming as far as it can, but is being stopped from 
> doing so for some reason, which is why I'd like to see io.pressure and 
> cpu.pressure.

  io.pressure:some avg10=0.17 avg60=0.35 avg300=0.37 total=6479904094
  io.pressure:full avg10=0.16 avg60=0.32 avg300=0.33 total=6363939615
  backup/io.pressure:some avg10=0.00 avg60=0.00 avg300=0.00 total=3600665286
  backup/io.pressure:full avg10=0.00 avg60=0.00 avg300=0.00 total=3580320436
  base/io.pressure:some avg10=0.26 avg60=0.40 avg300=0.38 total=4584357682
  base/io.pressure:full avg10=0.25 avg60=0.37 avg300=0.35 total=4512115687
  shell/io.pressure:some avg10=0.00 avg60=0.00 avg300=0.00 total=7337275
  shell/io.pressure:full avg10=0.00 avg60=0.00 avg300=0.00 total=7329137

That's low I would say.

> >On a side note, I liked v1's mode of soft/hard memory limit where the
> >memory amount between soft and hard could be used if system has enough
> >free memory. For v2 the difference between high and max seems almost of
> >no use.  
> 
> For that use case, that's more or less what we've designed memory.low to do. 
> The difference is that v1's soft limit almost never worked: the heuristics are 
> extremely complicated, so complicated in fact that even we as memcg maintainers 
> cannot reason about them. If we cannot reason about them, I'm quite sure it's 
> not really doing what you expect :-)

Well, memory.low is great for workload, but not really for backup which should
not "pollute" system's file cache (about same issue as logs which are almost
write-only but still tend to fill file cache throwing out).

> In this case everything looks like it's working as intended, just this is all 
> the result of memory.high becoming less broken in 5.4. From your description, 
> I'm not sure that memory.high is what you want, either.
> 
> >A cgroup parameter for impacting RO file cache differently than
> >anonymous memory or otherwise dirty memory would be great too.  
> 
> We had vm.swappiness in v1 and it manifested extremely poorly. I won't go too 
> much into the details of that here though, since we already discussed it fairly 
> comprehensively here[0].
> 
> Please feel free to send over the io.pressure, io.stat, cpu.pressure, and swap 
> metrics at baseline and during this when possible. Thanks!

Current system-wide pressure metrics:
/proc/pressure/cpu:some avg10=0.05 avg60=0.08 avg300=0.07 total=965407160
/proc/pressure/io:some avg10=0.00 avg60=0.02 avg300=0.04 total=5674971954
/proc/pressure/io:full avg10=0.00 avg60=0.02 avg300=0.04 total=5492982327
/proc/pressure/memory:some avg10=33.21 avg60=21.28 avg300=21.06 total=166513106563
/proc/pressure/memory:full avg10=32.09 avg60=20.23 avg300=20.13 total=158792995733



In the end the big question is why do the large amounts of inactive file caches
survive reclaim and thus cause cgroups to get starved.

> 0: https://lore.kernel.org/patchwork/patch/1172080/

Thanks,
Bruno

[-- Attachment #2: MemoryPressure_.png --]
[-- Type: image/png, Size: 25356 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-09 10:34     ` Michal Hocko
@ 2020-04-09 15:09       ` Bruno Prémont
  2020-04-09 15:24         ` Chris Down
  2020-04-09 15:25         ` Michal Hocko
  0 siblings, 2 replies; 18+ messages in thread
From: Bruno Prémont @ 2020-04-09 15:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-mm, Johannes Weiner, Vladimir Davydov, Chris Down

On Thu, 9 Apr 2020 12:34:00 +0200Michal Hocko wrote:

> On Thu 09-04-20 12:17:33, Bruno Prémont wrote:
> > On Thu, 9 Apr 2020 11:46:15 Michal Hocko wrote:  
> > > [Cc Chris]
> > > 
> > > On Thu 09-04-20 11:25:05, Bruno Prémont wrote:  
> > > > Hi,
> > > > 
> > > > Upgrading from 5.1 kernel to 5.6 kernel on a production system using
> > > > cgroups (v2) and having backup process in a memory.high=2G cgroup
> > > > sees backup being highly throttled (there are about 1.5T to be
> > > > backuped).    
> > > 
> > > What does /proc/sys/vm/dirty_* say?  
> > 
> > /proc/sys/vm/dirty_background_bytes:0
> > /proc/sys/vm/dirty_background_ratio:10
> > /proc/sys/vm/dirty_bytes:0
> > /proc/sys/vm/dirty_expire_centisecs:3000
> > /proc/sys/vm/dirty_ratio:20
> > /proc/sys/vm/dirty_writeback_centisecs:500  
> 
> Sorry, but I forgot ask for the total amount of memory. But it seems
> this is 64GB and 10% dirty ration might mean a lot of dirty memory.
> Does the same happen if you reduce those knobs to something smaller than
> 2G? _bytes alternatives should be useful for that purpose.

Well, tuning it to /proc/sys/vm/dirty_background_bytes:268435456
/proc/sys/vm/dirty_background_ratio:0
/proc/sys/vm/dirty_bytes:536870912
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:0
/proc/sys/vm/dirty_writeback_centisecs:500
does not make any difference.


From /proc/meminfo there is no indication on high amounts of dirty
memory either:
MemTotal:       65930032 kB
MemFree:        21237240 kB
MemAvailable:   51646528 kB
Buffers:          202692 kB
Cached:         21493120 kB
SwapCached:            0 kB
Active:         12875888 kB
Inactive:       11361852 kB
Active(anon):    2879072 kB
Inactive(anon):    20600 kB
Active(file):    9996816 kB
Inactive(file): 11341252 kB
Unevictable:        9396 kB
Mlocked:            9396 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:              3880 kB
Writeback:             0 kB
AnonPages:       2551776 kB
Mapped:           461816 kB
Shmem:            351144 kB
KReclaimable:   10012140 kB
Slab:           15673816 kB
SReclaimable:   10012140 kB
SUnreclaim:      5661676 kB
KernelStack:        7888 kB
PageTables:        24192 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    32965016 kB
Committed_AS:    4792440 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      126408 kB
VmallocChunk:          0 kB
Percpu:           136448 kB
HardwareCorrupted:     0 kB
AnonHugePages:    825344 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
DirectMap4k:       27240 kB
DirectMap2M:     4132864 kB
DirectMap1G:    65011712 kB


> [...]
> 
> > > Is it possible that the reclaim is not making progress on too many
> > > dirty pages and that triggers the back off mechanism that has been
> > > implemented recently in  5.4 (have a look at 0e4b01df8659 ("mm,
> > > memcg: throttle allocators when failing reclaim over memory.high")
> > > and e26733e0d0ec ("mm, memcg: throttle allocators based on
> > > ancestral memory.high").  
> > 
> > Could be though in that case it's throttling the wrong task/cgroup
> > as far as I can see (at least from cgroup's memory stats) or being
> > blocked by state external to the cgroup.
> > Will have a look at those patches so get a better idea at what they
> > change.  
> 
> Could you check where is the task of your interest throttled?
> /proc/<pid>/stack should give you a clue.

As guessed by Chris, it's
[<0>] mem_cgroup_handle_over_high+0x121/0x170
[<0>] exit_to_usermode_loop+0x67/0xa0
[<0>] do_syscall_64+0x149/0x170
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9


And I know no way to tell kernel "drop all caches" for a specific cgroup
nor how to list the inactive files assigned to a given cgroup (knowing
which ones they are and their idle state could help understanding why
they aren't being reclaimed).



Could it be that cache is being prevented from being reclaimed by a task
in another cgroup?

e.g.
  cgroup/system/backup
    first reads $files (reads each once)
  cgroup/workload/bla
    second&more reads $files

Would $files remain associated to cgroup/system/backup and not
reclaimed there instead of being reassigned to cgroup/workload/bla?



Bruno


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-09 15:09       ` Bruno Prémont
@ 2020-04-09 15:24         ` Chris Down
  2020-04-09 15:40           ` Bruno Prémont
  2020-04-09 15:25         ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Chris Down @ 2020-04-09 15:24 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: Michal Hocko, cgroups, linux-mm, Johannes Weiner, Vladimir Davydov

Bruno Prémont writes:
>Could it be that cache is being prevented from being reclaimed by a task
>in another cgroup?
>
>e.g.
>  cgroup/system/backup
>    first reads $files (reads each once)
>  cgroup/workload/bla
>    second&more reads $files
>
>Would $files remain associated to cgroup/system/backup and not
>reclaimed there instead of being reassigned to cgroup/workload/bla?

Yes, that's entirely possible. The first cgroup to fault in the pages is 
charged for the memory. Other cgroups may use them, but they are not accounted 
for as part of that other cgroup. They may also still be "active" as a result 
of use by another cgroup.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-09 15:09       ` Bruno Prémont
  2020-04-09 15:24         ` Chris Down
@ 2020-04-09 15:25         ` Michal Hocko
  2020-04-10  7:15           ` Bruno Prémont
  2020-04-14 15:09           ` Bruno Prémont
  1 sibling, 2 replies; 18+ messages in thread
From: Michal Hocko @ 2020-04-09 15:25 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: cgroups, linux-mm, Johannes Weiner, Vladimir Davydov, Chris Down

On Thu 09-04-20 17:09:26, Bruno Prémont wrote:
> On Thu, 9 Apr 2020 12:34:00 +0200Michal Hocko wrote:
> 
> > On Thu 09-04-20 12:17:33, Bruno Prémont wrote:
> > > On Thu, 9 Apr 2020 11:46:15 Michal Hocko wrote:  
> > > > [Cc Chris]
> > > > 
> > > > On Thu 09-04-20 11:25:05, Bruno Prémont wrote:  
> > > > > Hi,
> > > > > 
> > > > > Upgrading from 5.1 kernel to 5.6 kernel on a production system using
> > > > > cgroups (v2) and having backup process in a memory.high=2G cgroup
> > > > > sees backup being highly throttled (there are about 1.5T to be
> > > > > backuped).    
> > > > 
> > > > What does /proc/sys/vm/dirty_* say?  
> > > 
> > > /proc/sys/vm/dirty_background_bytes:0
> > > /proc/sys/vm/dirty_background_ratio:10
> > > /proc/sys/vm/dirty_bytes:0
> > > /proc/sys/vm/dirty_expire_centisecs:3000
> > > /proc/sys/vm/dirty_ratio:20
> > > /proc/sys/vm/dirty_writeback_centisecs:500  
> > 
> > Sorry, but I forgot ask for the total amount of memory. But it seems
> > this is 64GB and 10% dirty ration might mean a lot of dirty memory.
> > Does the same happen if you reduce those knobs to something smaller than
> > 2G? _bytes alternatives should be useful for that purpose.
> 
> Well, tuning it to /proc/sys/vm/dirty_background_bytes:268435456
> /proc/sys/vm/dirty_background_ratio:0
> /proc/sys/vm/dirty_bytes:536870912
> /proc/sys/vm/dirty_expire_centisecs:3000
> /proc/sys/vm/dirty_ratio:0
> /proc/sys/vm/dirty_writeback_centisecs:500
> does not make any difference.

OK, it was a wild guess because cgroup v2 should be able to throttle
heavy writers and be memcg aware AFAIR. But good to have it confirmed.

[...]

> > > > Is it possible that the reclaim is not making progress on too many
> > > > dirty pages and that triggers the back off mechanism that has been
> > > > implemented recently in  5.4 (have a look at 0e4b01df8659 ("mm,
> > > > memcg: throttle allocators when failing reclaim over memory.high")
> > > > and e26733e0d0ec ("mm, memcg: throttle allocators based on
> > > > ancestral memory.high").  
> > > 
> > > Could be though in that case it's throttling the wrong task/cgroup
> > > as far as I can see (at least from cgroup's memory stats) or being
> > > blocked by state external to the cgroup.
> > > Will have a look at those patches so get a better idea at what they
> > > change.  
> > 
> > Could you check where is the task of your interest throttled?
> > /proc/<pid>/stack should give you a clue.
> 
> As guessed by Chris, it's
> [<0>] mem_cgroup_handle_over_high+0x121/0x170
> [<0>] exit_to_usermode_loop+0x67/0xa0
> [<0>] do_syscall_64+0x149/0x170
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> 
> And I know no way to tell kernel "drop all caches" for a specific cgroup
> nor how to list the inactive files assigned to a given cgroup (knowing
> which ones they are and their idle state could help understanding why
> they aren't being reclaimed).
> 
> 
> 
> Could it be that cache is being prevented from being reclaimed by a task
> in another cgroup?
> 
> e.g.
>   cgroup/system/backup
>     first reads $files (reads each once)
>   cgroup/workload/bla
>     second&more reads $files
> 
> Would $files remain associated to cgroup/system/backup and not
> reclaimed there instead of being reassigned to cgroup/workload/bla?

No, page cache is first-touch-gets-charged. But there is certainly a
interference possible if the memory is somehow pinned - e.g. mlock - by
a task from another cgroup or internally by FS.

Your earlier stat snapshot doesn't indicate a big problem with the
reclaim though:

memory.stat:pgscan 47519855
memory.stat:pgsteal 44933838

This tells the overall reclaim effectiveness was 94%. Could you try to
gather snapshots with a 1s granularity starting before your run your
backup to see how those numbers evolve? Ideally with timestamps to
compare with the actual stall information.

Another option would be to enable vmscan tracepoints but let's try with
stats first.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-09 15:24         ` Chris Down
@ 2020-04-09 15:40           ` Bruno Prémont
  2020-04-09 17:50             ` Chris Down
  0 siblings, 1 reply; 18+ messages in thread
From: Bruno Prémont @ 2020-04-09 15:40 UTC (permalink / raw)
  To: Chris Down
  Cc: Michal Hocko, cgroups, linux-mm, Johannes Weiner, Vladimir Davydov

On Thu, 9 Apr 2020 16:24:17 +0100 wrote:

> Bruno Prémont writes:
> >Could it be that cache is being prevented from being reclaimed by a task
> >in another cgroup?
> >
> >e.g.
> >  cgroup/system/backup
> >    first reads $files (reads each once)
> >  cgroup/workload/bla
> >    second&more reads $files
> >
> >Would $files remain associated to cgroup/system/backup and not
> >reclaimed there instead of being reassigned to cgroup/workload/bla?  
> 
> Yes, that's entirely possible. The first cgroup to fault in the pages is 
> charged for the memory. Other cgroups may use them, but they are not accounted 
> for as part of that other cgroup. They may also still be "active" as a result 
> of use by another cgroup.

But the memory would then be 'active' in the original cgroup? which is
not the case here I feel.
If the remain inactive-unreclaimable in the first cgroup due to use in
another cgroup that would be at least surprising.

Doubling the high value helped (but for how long?), back with
memory.current around memory.high nut no throttling yet. But from
increase until now memory.pressure is small/zero.

Capturing 
  memory.stat:pgscan 47519855
  memory.stat:pgsteal 44933838
over time for Michal and will report back later this evening.

When seen stuck backup was reading a multi-GiB file with
  open(, O_NOATIME)
  while (read()) {
    transform and write to network
  }
  close()
thus plain sequential file read through file cache (and for this
backup run, only files not in use by anyone else, or some being
just appended to by others).

Bruno


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-09 15:40           ` Bruno Prémont
@ 2020-04-09 17:50             ` Chris Down
  2020-04-09 17:56               ` Chris Down
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Down @ 2020-04-09 17:50 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: Michal Hocko, cgroups, linux-mm, Johannes Weiner, Vladimir Davydov

Bruno Prémont writes:
>On Thu, 9 Apr 2020 16:24:17 +0100 wrote:
>
>> Bruno Prémont writes:
>> >Could it be that cache is being prevented from being reclaimed by a task
>> >in another cgroup?
>> >
>> >e.g.
>> >  cgroup/system/backup
>> >    first reads $files (reads each once)
>> >  cgroup/workload/bla
>> >    second&more reads $files
>> >
>> >Would $files remain associated to cgroup/system/backup and not
>> >reclaimed there instead of being reassigned to cgroup/workload/bla?
>>
>> Yes, that's entirely possible. The first cgroup to fault in the pages is
>> charged for the memory. Other cgroups may use them, but they are not accounted
>> for as part of that other cgroup. They may also still be "active" as a result
>> of use by another cgroup.
>
>But the memory would then be 'active' in the original cgroup? which is
>not the case here I feel.

Yes, that's correct. I don't think it's the case here (since active_file is not 
that large in the affected cgroup), but it's certainly generally a possibility.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-09 17:50             ` Chris Down
@ 2020-04-09 17:56               ` Chris Down
  0 siblings, 0 replies; 18+ messages in thread
From: Chris Down @ 2020-04-09 17:56 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: Michal Hocko, cgroups, linux-mm, Johannes Weiner, Vladimir Davydov

 From my side, this looks like memory.high is working as intended and there is 
some other generic problem with reclaim happening here.

I think the data which Michal asked for would help a lot to narrow down what's 
going on.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-09 15:25         ` Michal Hocko
@ 2020-04-10  7:15           ` Bruno Prémont
  2020-04-10  8:43             ` Bruno Prémont
  2020-04-14 15:09           ` Bruno Prémont
  1 sibling, 1 reply; 18+ messages in thread
From: Bruno Prémont @ 2020-04-10  7:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-mm, Johannes Weiner, Vladimir Davydov, Chris Down

[-- Attachment #1: Type: text/plain, Size: 1457 bytes --]

Hi Michal,

On Thu, 9 Apr 2020 17:25:40 Michal Hocko <mhocko@kernel.org> wrote:
> Your earlier stat snapshot doesn't indicate a big problem with the
> reclaim though:
> 
> memory.stat:pgscan 47519855
> memory.stat:pgsteal 44933838
> 
> This tells the overall reclaim effectiveness was 94%. Could you try to
> gather snapshots with a 1s granularity starting before your run your
> backup to see how those numbers evolve? Ideally with timestamps to
> compare with the actual stall information.

Attached is a long collection of
 date  memory.current   memory.stat[pgscan]  memory.stat[pgsteal]

It started while backup was running +/- smoothly with its memory.high
set to 4294967296 (4G instead of 2G) until backup finished around 20:22.

From system memory pressure RRD-graph I see pressure (around 60)
between about 19:50 to 20:10 while very small the rest of the time
(below 1).



I started a new backup run this morning grabbing full info snapshots of
backup cgroup at 1s interval in order to get a better/more complete
picture and CG's memory.high back to 2G limit.


I have the impression as if reclaim was somehow triggered not enough or
not strongly enough compared to the IO performed within the CG
(complete backup covers 130G of data, data being read in blocks of
128kB at a smooth-running rate of ~7MiB/s).

> Another option would be to enable vmscan tracepoints but let's try with
> stats first.


Bruno

[-- Attachment #2: backup.cg_pg_.log.gz --]
[-- Type: application/gzip, Size: 123934 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-10  7:15           ` Bruno Prémont
@ 2020-04-10  8:43             ` Bruno Prémont
       [not found]               ` <20200410115010.1d9f6a3f@hemera.lan.sysophe.eu>
  0 siblings, 1 reply; 18+ messages in thread
From: Bruno Prémont @ 2020-04-10  8:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-mm, Johannes Weiner, Vladimir Davydov, Chris Down

Hi Michal, Chris,

Well, tar made me unhappy, it just collected list of files but not
content from /sys/fs/cgroup/...

But if I set memory.max = memory.high reclaim seems to work and memory
pressure remains zero for the cg.
If I set memory.max = $((memory.high + 128M)) memory pressure rises
immediately (when memory.current ~= memory.high).

Returning to memory.max=memory.high gets things running again and
memory pressure starts dropping immediately.


Could it be that the wrong limit of high/max is being used for reclaim?


Bruno

On Fri, 10 Apr 2020 09:15:25 +0200
Bruno Prémont <bonbons@linux-vserver.org> wrote:
> Hi Michal,
> 
> On Thu, 9 Apr 2020 17:25:40 Michal Hocko <mhocko@kernel.org> wrote:
> > Your earlier stat snapshot doesn't indicate a big problem with the
> > reclaim though:
> > 
> > memory.stat:pgscan 47519855
> > memory.stat:pgsteal 44933838
> > 
> > This tells the overall reclaim effectiveness was 94%. Could you try to
> > gather snapshots with a 1s granularity starting before your run your
> > backup to see how those numbers evolve? Ideally with timestamps to
> > compare with the actual stall information.  
> 
> Attached is a long collection of
>  date  memory.current   memory.stat[pgscan]  memory.stat[pgsteal]
> 
> It started while backup was running +/- smoothly with its memory.high
> set to 4294967296 (4G instead of 2G) until backup finished around 20:22.
> 
> From system memory pressure RRD-graph I see pressure (around 60)
> between about 19:50 to 20:10 while very small the rest of the time
> (below 1).
> 
> 
> 
> I started a new backup run this morning grabbing full info snapshots of
> backup cgroup at 1s interval in order to get a better/more complete
> picture and CG's memory.high back to 2G limit.
> 
> 
> I have the impression as if reclaim was somehow triggered not enough or
> not strongly enough compared to the IO performed within the CG
> (complete backup covers 130G of data, data being read in blocks of
> 128kB at a smooth-running rate of ~7MiB/s).
> 
> > Another option would be to enable vmscan tracepoints but let's try with
> > stats first.  
> 
> 
> Bruno



-- 
Bruno Prémont <bruno.premont@restena.lu>
Ingénieur système et développements

Fondation RESTENA
2, avenue de l'Université
L-4365 Esch/Alzette

Tél: (+352) 424409
Fax: (+352) 422473
https://www.restena.lu     https://www.dns.lu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-09 15:25         ` Michal Hocko
  2020-04-10  7:15           ` Bruno Prémont
@ 2020-04-14 15:09           ` Bruno Prémont
  1 sibling, 0 replies; 18+ messages in thread
From: Bruno Prémont @ 2020-04-14 15:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-mm, Johannes Weiner, Vladimir Davydov, Chris Down

Hi Michal, Chris,

I can reproduce very easily with basic commands on a idle system with
just a reasonably filled partition and lots of (free) RAM and running:
  bash -c 'echo $$ > $path/to/cgroup/cgroup.procs; tar -zc -C /export . > /dev/null'
where tar is running all alone in its cgroup with
  memory.high = 1024M
  memory.max  = 1152M   (high + 128M)

At the start
  memory.stat:pgscan 0
  memory.stat:pgsteal 0
once pressure is "high" and tar gets throttled both values increase
concurrently by 64 once every 2 seconds.

Cgroup's memory.current starts 0 and grows up to memory.high and then
pressure starts.
  memory.stat:inactive_file 910192640
  memory.stat:active_file 61501440
active_file remains low (64M) while inactive_file is high (most of the
1024M allowed)

Somehow reclaim does not consider the inactive_file or tries to reclaim
in too small pieces compared to memory turnover in the cgroup.


Event having memory.max being just a single page (4096 bytes) larger
than memory.high brings the same throttling behavior.
Changing memory.max to match memory.high gets reclaim to work without
throttling.


Bruno


On Thu, 9 Apr 2020 17:25:40 Michal Hocko wrote:
> On Thu 09-04-20 17:09:26, Bruno Prémont wrote:
> > On Thu, 9 Apr 2020 12:34:00 +0200Michal Hocko wrote:
> >   
> > > On Thu 09-04-20 12:17:33, Bruno Prémont wrote:  
> > > > On Thu, 9 Apr 2020 11:46:15 Michal Hocko wrote:    
> > > > > [Cc Chris]
> > > > > 
> > > > > On Thu 09-04-20 11:25:05, Bruno Prémont wrote:    
> > > > > > Hi,
> > > > > > 
> > > > > > Upgrading from 5.1 kernel to 5.6 kernel on a production system using
> > > > > > cgroups (v2) and having backup process in a memory.high=2G cgroup
> > > > > > sees backup being highly throttled (there are about 1.5T to be
> > > > > > backuped).      
> > > > > 
> > > > > What does /proc/sys/vm/dirty_* say?    
> > > > 
> > > > /proc/sys/vm/dirty_background_bytes:0
> > > > /proc/sys/vm/dirty_background_ratio:10
> > > > /proc/sys/vm/dirty_bytes:0
> > > > /proc/sys/vm/dirty_expire_centisecs:3000
> > > > /proc/sys/vm/dirty_ratio:20
> > > > /proc/sys/vm/dirty_writeback_centisecs:500    
> > > 
> > > Sorry, but I forgot ask for the total amount of memory. But it seems
> > > this is 64GB and 10% dirty ration might mean a lot of dirty memory.
> > > Does the same happen if you reduce those knobs to something smaller than
> > > 2G? _bytes alternatives should be useful for that purpose.  
> > 
> > Well, tuning it to /proc/sys/vm/dirty_background_bytes:268435456
> > /proc/sys/vm/dirty_background_ratio:0
> > /proc/sys/vm/dirty_bytes:536870912
> > /proc/sys/vm/dirty_expire_centisecs:3000
> > /proc/sys/vm/dirty_ratio:0
> > /proc/sys/vm/dirty_writeback_centisecs:500
> > does not make any difference.  
> 
> OK, it was a wild guess because cgroup v2 should be able to throttle
> heavy writers and be memcg aware AFAIR. But good to have it confirmed.
> 
> [...]
> 
> > > > > Is it possible that the reclaim is not making progress on too many
> > > > > dirty pages and that triggers the back off mechanism that has been
> > > > > implemented recently in  5.4 (have a look at 0e4b01df8659 ("mm,
> > > > > memcg: throttle allocators when failing reclaim over memory.high")
> > > > > and e26733e0d0ec ("mm, memcg: throttle allocators based on
> > > > > ancestral memory.high").    
> > > > 
> > > > Could be though in that case it's throttling the wrong task/cgroup
> > > > as far as I can see (at least from cgroup's memory stats) or being
> > > > blocked by state external to the cgroup.
> > > > Will have a look at those patches so get a better idea at what they
> > > > change.    
> > > 
> > > Could you check where is the task of your interest throttled?
> > > /proc/<pid>/stack should give you a clue.  
> > 
> > As guessed by Chris, it's
> > [<0>] mem_cgroup_handle_over_high+0x121/0x170
> > [<0>] exit_to_usermode_loop+0x67/0xa0
> > [<0>] do_syscall_64+0x149/0x170
> > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > 
> > And I know no way to tell kernel "drop all caches" for a specific cgroup
> > nor how to list the inactive files assigned to a given cgroup (knowing
> > which ones they are and their idle state could help understanding why
> > they aren't being reclaimed).
> > 
> > 
> > 
> > Could it be that cache is being prevented from being reclaimed by a task
> > in another cgroup?
> > 
> > e.g.
> >   cgroup/system/backup
> >     first reads $files (reads each once)
> >   cgroup/workload/bla
> >     second&more reads $files
> > 
> > Would $files remain associated to cgroup/system/backup and not
> > reclaimed there instead of being reassigned to cgroup/workload/bla?  
> 
> No, page cache is first-touch-gets-charged. But there is certainly a
> interference possible if the memory is somehow pinned - e.g. mlock - by
> a task from another cgroup or internally by FS.
> 
> Your earlier stat snapshot doesn't indicate a big problem with the
> reclaim though:
> 
> memory.stat:pgscan 47519855
> memory.stat:pgsteal 44933838
> 
> This tells the overall reclaim effectiveness was 94%. Could you try to
> gather snapshots with a 1s granularity starting before your run your
> backup to see how those numbers evolve? Ideally with timestamps to
> compare with the actual stall information.
> 
> Another option would be to enable vmscan tracepoints but let's try with
> stats first.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
       [not found]                 ` <20200414163134.GQ4629@dhcp22.suse.cz>
@ 2020-04-15 10:17                   ` Bruno Prémont
  2020-04-15 10:24                     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Bruno Prémont @ 2020-04-15 10:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Chris Down, cgroups, linux-mm, Johannes Weiner, Vladimir Davydov

Hi Michal,

On Tue, 14 Apr 2020 18:31:34 Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 10-04-20 11:50:10, Bruno Prémont wrote:
> > Hi Michal, Chris,
> > 
> > Sending ephemeral link to (now properly) captured cgroup details off-list.

Re-adding the list and other readers.

> > It contains:
> >   snapshots of the cgroup contents at 1s interval
> > backup was running through the full captured period.
> > 
> > You can see memory.max changes at
> >   26-21  (to high + 128M)
> > and
> >   30-24  (back to high)  
> 
> OK, so you have started with high = max and do not see any stalls. As
> soon as you activate the high limit reclaim by increasing the max limit
> you get your stalls because those tasks are put into sleep. There are
> only 3 tasks in the cgroup and it seems to be shell, tar and some
> subshell or is there anything else that could charge any memory?

On production system it's just backup software (3-4 processes)

On basic reproducer it's bash, tar and tar's gzip subprocess.

> Let's just focus on the time prior to switch and after
> * prior
> $ for i in $(seq -w 21); do cat 26-$i/memory.current ; done | calc_min_max.awk
> min: 2371895296.00 max: 2415874048.00 avg: 2404059136.00 std: 13469809.78 nr: 20
> 
> high = hard limit = 2415919104
> 
> * after
> $ for i in $(seq -w 22 59); do cat 26-$i/memory.current ; done | calc_min_max.awk
> min: 2409172992.00 max: 2415828992.00 avg: 2415420793.26 std: 1475181.24 nr: 38
> 
> high limit = 2415919104
> hard limit = 2550136832
> 
> Nothing interesting here. The charged memory stays below the high limit.
> This might be a matter of timing of course because your snapshot might
> hit the window when the situation was ok. But 90K is a larger margin
> than I would expect in such a case.
> 
> $ cat 27-*/memory.current | calc_min_max.awk 
> min: 2408161280.00 max: 2415910912.00 avg: 2415709583.19 std: 993983.28 nr: 59
> 
> Still under the high limit but closer 8K so this looks much more like
> seeing high limit reclaim in action.
> 
> $ cat 28-*/memory.current | calc_min_max.awk
> min: 2409123840.00 max: 2415910912.00 avg: 2415633019.59 std: 870311.11 nr: 58
> 
> same here.
> 
> $ cat 29-*/memory.current | calc_min_max.awk
> min: 2400464896.00 max: 2415828992.00 avg: 2414819015.59 std: 3133978.89 nr: 59
> 
> quite below high limit.
> 
> So I do not see any large high limit excess here but it could be a
> matter of timing as mentioned above. Let's have a look at the reclaim
> activity.
> 
> 27-00/memory.stat:pgscan 82811883
> 27-00/memory.stat:pgsteal 80223813
> 27-01/memory.stat:pgscan 82811883
> 27-01/memory.stat:pgsteal 80223813
> 
> No scanning 
> 
> 27-02/memory.stat:pgscan 82811947
> 27-02/memory.stat:pgsteal 80223877
> 
> 64 pages scanned and reclaimed
> 
> 27-03/memory.stat:pgscan 82811947
> 27-03/memory.stat:pgsteal 80223877
> 27-04/memory.stat:pgscan 82811947
> 27-04/memory.stat:pgsteal 80223877
> 27-05/memory.stat:pgscan 82811947
> 27-05/memory.stat:pgsteal 80223877
> 
> No scanning
> 
> 27-06/memory.stat:pgscan 82812011
> 27-06/memory.stat:pgsteal 80223941
> 
> 64 pages scanned and reclaimed
> 
> 27-07/memory.stat:pgscan 82812011
> 27-07/memory.stat:pgsteal 80223941
> 27-08/memory.stat:pgscan 82812011
> 27-08/memory.stat:pgsteal 80223941
> 27-09/memory.stat:pgscan 82812011
> 27-09/memory.stat:pgsteal 80223941
> 
> No scanning
> 
> 27-11/memory.stat:pgscan 82812075
> 27-11/memory.stat:pgsteal 80224005
> 
> 64 pages scanned
> 
> 27-12/memory.stat:pgscan 82812075
> 27-12/memory.stat:pgsteal 80224005
> 27-13/memory.stat:pgscan 82812075
> 27-13/memory.stat:pgsteal 80224005
> 27-14/memory.stat:pgscan 82812075
> 27-14/memory.stat:pgsteal 80224005
> 
> No scanning. etc...
> 
> So it seems there were two rounds of scanning (we usually do 32pages per
> batch and the reclaim was really effective at reclaiming that memory but
> then the task is put into sleep for 2-3s. This is quite unexpected
> because collected stats do not show the high limit excess during the
> sleeping time.
> 
> It would be interesting to see more detailed information on the
> throttling itself. Which kernel version are you testing this on?
> 5.6+ kernel needs http://lkml.kernel.org/r/20200331152424.GA1019937@chrisdown.name
> but please note that e26733e0d0ec ("mm, memcg: throttle allocators based
> on ancestral memory.high") has been marked for stable so 5.4+ kernels
> might have it as well and you would need the same fix for them as well.
> I wouldn't be really surprised if this was the actual problem that you
> are hitting because the reclaim could simply made usage < high and the
> math in calculate_high_delay doesn't work properly.

I'm on 5.6.2. Seems neither e26733e0d0ec nor the fix hit 5.6.2 (nor
current 5.6.4).

> Anyway the following simply tracing patch should give a better clue.
> The output will appear in the trace buffer (mount tracefs and read
> trace_pipe file).

This is the output I get on 5.6.4 with simple tar -zc call (max=high+4096):
  tar-16943 [000] ....  1098.796955: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144
  tar-16943 [000] ....  1100.876794: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144
  tar-16943 [000] ....  1102.956636: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262120 high:262144
  tar-16943 [000] ....  1105.037388: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262121 high:262144
  tar-16943 [000] ....  1107.117246: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144

With 5.7-rc1 it runs just fine, pressure remains zero and no output in trace_pipe or throttling.

So the fixes that went in there do fix it.
Now matter of cherry-picking the right ones... e26733e0d0ec and its fixe's fix,
maybe some others (will start with those tagged for stable).


Thanks,
Bruno

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 05b4ec2c6499..dcee3030309d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2417,6 +2417,10 @@ void mem_cgroup_handle_over_high(void)
>  	if (penalty_jiffies <= HZ / 100)
>  		goto out;
>  
> +	trace_printk("memcg_nr_pages_over_high:%d penalty_jiffies:%ld current:%lu high:%lu\n",
> +			nr_pages, penalty_jiffies,
> +			page_counter_read(&memcg->memory), READ_ONCE(memcg->high));
> +
>  	/*
>  	 * If we exit early, we're guaranteed to die (since
>  	 * schedule_timeout_killable sets TASK_KILLABLE). This means we don't



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-15 10:17                   ` Bruno Prémont
@ 2020-04-15 10:24                     ` Michal Hocko
  2020-04-15 11:37                       ` Bruno Prémont
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2020-04-15 10:24 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: Chris Down, cgroups, linux-mm, Johannes Weiner, Vladimir Davydov

On Wed 15-04-20 12:17:53, Bruno Prémont wrote:
[...]
> > Anyway the following simply tracing patch should give a better clue.
> > The output will appear in the trace buffer (mount tracefs and read
> > trace_pipe file).
> 
> This is the output I get on 5.6.4 with simple tar -zc call (max=high+4096):
>   tar-16943 [000] ....  1098.796955: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144
>   tar-16943 [000] ....  1100.876794: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144
>   tar-16943 [000] ....  1102.956636: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262120 high:262144
>   tar-16943 [000] ....  1105.037388: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262121 high:262144
>   tar-16943 [000] ....  1107.117246: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144

OK, that points to the underflow fix.

> 
> With 5.7-rc1 it runs just fine, pressure remains zero and no output in trace_pipe or throttling.
> 
> So the fixes that went in there do fix it.
> Now matter of cherry-picking the right ones... e26733e0d0ec and its fixe's fix,
> maybe some others (will start with those tagged for stable).

I have seen Greg picking up this for stable trees so it should show up
there soon.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Memory CG and 5.1 to 5.6 uprade slows backup
  2020-04-15 10:24                     ` Michal Hocko
@ 2020-04-15 11:37                       ` Bruno Prémont
  0 siblings, 0 replies; 18+ messages in thread
From: Bruno Prémont @ 2020-04-15 11:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Chris Down, cgroups, linux-mm, Johannes Weiner, Vladimir Davydov

On Wed, 15 Apr 2020 12:24:42 Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 15-04-20 12:17:53, Bruno Prémont wrote:
> [...]
> > > Anyway the following simply tracing patch should give a better clue.
> > > The output will appear in the trace buffer (mount tracefs and read
> > > trace_pipe file).  
> > 
> > This is the output I get on 5.6.4 with simple tar -zc call (max=high+4096):
> >   tar-16943 [000] ....  1098.796955: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144
> >   tar-16943 [000] ....  1100.876794: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144
> >   tar-16943 [000] ....  1102.956636: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262120 high:262144
> >   tar-16943 [000] ....  1105.037388: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262121 high:262144
> >   tar-16943 [000] ....  1107.117246: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144  
> 
> OK, that points to the underflow fix.
> 
> > 
> > With 5.7-rc1 it runs just fine, pressure remains zero and no output in trace_pipe or throttling.
> > 
> > So the fixes that went in there do fix it.
> > Now matter of cherry-picking the right ones... e26733e0d0ec and its fixe's fix,
> > maybe some others (will start with those tagged for stable).  
> 
> I have seen Greg picking up this for stable trees so it should show up
> there soon.

Applying just 9b8b17541f13809d06f6f873325305ddbb760e3e which went to
stable-rc for 5.6.5 gets things running fine where.
(e26733e0d0ec seems to have gone in shortly prior to 5.6 release, need
to improve my git-foo to locate commits between tags!)

So yes it's the fix.

Thanks,
Bruno

> Thanks!


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-04-15 11:37 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-09  9:25 Memory CG and 5.1 to 5.6 uprade slows backup Bruno Prémont
2020-04-09  9:46 ` Michal Hocko
2020-04-09 10:17   ` Bruno Prémont
2020-04-09 10:34     ` Michal Hocko
2020-04-09 15:09       ` Bruno Prémont
2020-04-09 15:24         ` Chris Down
2020-04-09 15:40           ` Bruno Prémont
2020-04-09 17:50             ` Chris Down
2020-04-09 17:56               ` Chris Down
2020-04-09 15:25         ` Michal Hocko
2020-04-10  7:15           ` Bruno Prémont
2020-04-10  8:43             ` Bruno Prémont
     [not found]               ` <20200410115010.1d9f6a3f@hemera.lan.sysophe.eu>
     [not found]                 ` <20200414163134.GQ4629@dhcp22.suse.cz>
2020-04-15 10:17                   ` Bruno Prémont
2020-04-15 10:24                     ` Michal Hocko
2020-04-15 11:37                       ` Bruno Prémont
2020-04-14 15:09           ` Bruno Prémont
2020-04-09 10:50 ` Chris Down
2020-04-09 11:58   ` Bruno Prémont

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).