* Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
@ 2020-11-25 11:39 Bruno Prémont
2020-11-25 13:37 ` Michal Hocko
2020-11-25 18:21 ` Roman Gushchin
0 siblings, 2 replies; 8+ messages in thread
From: Bruno Prémont @ 2020-11-25 11:39 UTC (permalink / raw)
To: Yafang Shao
Cc: Chris Down, Michal Hocko, Johannes Weiner, Chris Down, cgroups,
linux-mm, Vladimir Davydov
Hello,
On a production system I've encountered a rather harsh behavior from
kernel in the context of memory cgroup (v2) after updating kernel from
5.7 series to 5.9 series.
It seems like kernel is reclaiming file cache but leaving inode cache
(reclaimable slabs) alone in a way that the server ends up trashing and
maxing out on IO to one of its disks instead of doing actual work.
My setup, server has 64G of RAM:
root
+ system { min=0, low=128M, high=8G, max=8G }
+ base { no specific constraints }
+ backup { min=0, low=32M, high=2G, max=2G }
+ shell { no specific constraints }
+ websrv { min=0, low=4G, high=32G, max=32G }
+ website { min=0, low=16G, high=40T, max=40T }
+ website1 { min=0, low=64M, high=2G, max=2G }
+ website2 { min=0, low=64M, high=2G, max=2G }
...
+ remote { min=0, low=1G, high=14G, max=14G }
+ webuser1 { min=0, low=64M, high=2G, max=2G }
+ webuser2 { min=0, low=64M, high=2G, max=2G }
...
When the server was struggling I've had mostly IO on disk hosting
system processes and some cache files of websrv processes.
It seems that running backup does make the issue much more probable.
The processes in websrv are the most impacted by the trashing and this
is the one with lots of disk cache and inode cache assigned to it.
(note a helper running in websrv cgroup scan whole file system
hierarchy once per hour and this keeps inode cache pretty filled.
Dropping just file cache (about 10G) did not unlock situation but
dropping reclaimable slabs (inode cache, about 30G) got the system back
running.
Some metrics I have collected during a trashing period (metrics
collected at about 5min interval) - I don't have ful memory.stat
unfortunately:
system/memory.min 0 = 0
system/memory.low 134217728 = 134217728
system/memory.high 8589934592 = 8589934592
system/memory.max 8589934592 = 8589934592
system/memory.pressure
some avg10=54.41 avg60=59.28 avg300=69.46 total=7347640237
full avg10=27.45 avg60=22.19 avg300=29.28 total=3287847481
->
some avg10=77.25 avg60=73.24 avg300=69.63 total=7619662740
full avg10=23.04 avg60=25.26 avg300=27.97 total=3401421903
system/memory.current 262533120 < 263929856
system/memory.events.local
low 5399469 = 5399469
high 0 = 0
max 112303 = 112303
oom 0 = 0
oom_kill 0 = 0
system/base/memory.min 0 = 0
system/base/memory.low 0 = 0
system/base/memory.high max = max
system/base/memory.max max = max
system/base/memory.pressure
some avg10=18.89 avg60=20.34 avg300=24.95 total=5156816349
full avg10=10.90 avg60=8.50 avg300=11.68 total=2253916169
->
some avg10=33.82 avg60=32.26 avg300=26.95 total=5258381824
full avg10=12.51 avg60=13.01 avg300=12.05 total=2301375471
system/base/memory.current 31363072 < 32243712
system/base/memory.events.local
low 0 = 0
high 0 = 0
max 0 = 0
oom 0 = 0
oom_kill 0 = 0
system/backup/memory.min 0 = 0
system/backup/memory.low 33554432 = 33554432
system/backup/memory.high 2147483648 = 2147483648
system/backup/memory.max 2147483648 = 2147483648
system/backup/memory.pressure
some avg10=41.73 avg60=45.97 avg300=56.27 total=3385780085
full avg10=21.78 avg60=18.15 avg300=25.35 total=1571263731
->
some avg10=60.27 avg60=55.44 avg300=54.37 total=3599850643
full avg10=19.52 avg60=20.91 avg300=23.58 total=1667430954
system/backup/memory.current 222130176 < 222543872
system/backup/memory.events.local
low 5446 = 5446
high 0 = 0
max 0 = 0
oom 0 = 0
oom_kill 0 = 0
system/shell/memory.min 0 = 0
system/shell/memory.low 0 = 0
system/shell/memory.high max = max
system/shell/memory.max max = max
system/shell/memory.pressure
some avg10=0.00 avg60=0.12 avg300=0.25 total=1348427661
full avg10=0.00 avg60=0.04 avg300=0.06 total=493582108
->
some avg10=0.00 avg60=0.00 avg300=0.06 total=1348516773
full avg10=0.00 avg60=0.00 avg300=0.00 total=493591500
system/shell/memory.current 8814592 < 8888320
system/shell/memory.events.local
low 0 = 0
high 0 = 0
max 0 = 0
oom 0 = 0
oom_kill 0 = 0
website/memory.min 0 = 0
website/memory.low 17179869184 = 17179869184
website/memory.high 45131717672960 = 45131717672960
website/memory.max 45131717672960 = 45131717672960
website/memory.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=415009408
full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483
->
some avg10=0.00 avg60=0.00 avg300=0.00 total=415009408
full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483
website/memory.current 11811520512 > 11456942080
website/memory.events.local
low 11372142 < 11377350
high 0 = 0
max 0 = 0
oom 0 = 0
oom_kill 0 = 0
remote/memory.min 0
remote/memory.low 1073741824
remote/memory.high 15032385536
remote/memory.max 15032385536
remote/memory.pressure
some avg10=0.00 avg60=0.25 avg300=0.50 total=2017364408
full avg10=0.00 avg60=0.00 avg300=0.01 total=738071296
->
remote/memory.current 84439040 > 81797120
remote/memory.events.local
low 11372142 < 11377350
high 0 = 0
max 0 = 0
oom 0 = 0
oom_kill 0 = 0
websrv/memory.min 0 = 0
websrv/memory.low 4294967296 = 4294967296
websrv/memory.high 34359738368 = 34359738368
websrv/memory.max 34426847232 = 34426847232
websrv/memory.pressure
some avg10=40.38 avg60=62.58 avg300=68.83 total=7760096704
full avg10=7.80 avg60=10.78 avg300=12.64 total=2254679370
->
some avg10=89.97 avg60=83.78 avg300=72.99 total=8040513640
full avg10=11.46 avg60=11.49 avg300=11.47 total=2300116237
websrv/memory.current 18421673984 < 18421936128
websrv/memory.events.local
low 0 = 0
high 0 = 0
max 0 = 0
oom 0 = 0
oom_kill 0 = 0
Is there something important I'm missing in my setup that could prevent
things from starving?
Did memory.low meaning change between 5.7 and 5.9? From behavior it
feels as if inodes are not accounted to cgroup at all and kernel pushes
cgroups down to their memory.low by killing file cache if there is not
enough free memory to hold all promises (and not only when a cgroup
tries to use up to its promised amount of memory).
As system was trashing as much with 10G of file cache dropped
(completely unused memory) as with it in use.
I will try to create a test-case for it to reproduce it on a test
machine an be able to verify a fix or eventually bisect to triggering
patch though it this all rings a bell, please tell!
Note until I have a test-case I'm reluctant to just wait [on
production system] for next occurrence (usually at unpractical times) to
gather some more metrics.
Regards,
Bruno
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
2020-11-25 11:39 Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints Bruno Prémont
@ 2020-11-25 13:37 ` Michal Hocko
2020-11-25 14:33 ` Bruno Prémont
2020-11-25 18:21 ` Roman Gushchin
1 sibling, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2020-11-25 13:37 UTC (permalink / raw)
To: Bruno Prémont
Cc: Yafang Shao, Chris Down, Johannes Weiner, cgroups, linux-mm,
Vladimir Davydov
Hi,
thanks for the detailed report.
On Wed 25-11-20 12:39:56, Bruno Prémont wrote:
[...]
> Did memory.low meaning change between 5.7 and 5.9?
The latest semantic change in the low limit protection semantic was
introduced in 5.7 (recursive protection) but it requires an explicit
enablinig.
> From behavior it
> feels as if inodes are not accounted to cgroup at all and kernel pushes
> cgroups down to their memory.low by killing file cache if there is not
> enough free memory to hold all promises (and not only when a cgroup
> tries to use up to its promised amount of memory).
Your counters indeed show that the low protection has been breached,
most likely because the reclaim couldn't make any progress. Considering
that this is the case for all/most of your cgroups it suggests that the
memory pressure was global rather than limit imposed. In fact even top
level cgroups got reclaimed below the low limit.
This suggests that this is not likely to be memcg specific. It is
more likely that this is a general memory reclaim regression for your
workload. There were larger changes in that area. Be it lru balancing
based on cost model by Johannes or working set tracking for anonymous
pages by Joonsoo. Maybe even more. Both of them can influence page cache
reclaim but you are suggesting that slab accounted memory is not
reclaimed properly. I am not sure sure there were considerable changes
there. Would it be possible to collect /prov/vmstat as well?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
2020-11-25 13:37 ` Michal Hocko
@ 2020-11-25 14:33 ` Bruno Prémont
0 siblings, 0 replies; 8+ messages in thread
From: Bruno Prémont @ 2020-11-25 14:33 UTC (permalink / raw)
To: Michal Hocko
Cc: Yafang Shao, Chris Down, Johannes Weiner, cgroups, linux-mm,
Vladimir Davydov
Hi Michal,
On Wed, 25 Nov 2020 14:37:40 +0100 Michal Hocko <mhocko@suse.com> wrote:
> Hi,
> thanks for the detailed report.
>
> On Wed 25-11-20 12:39:56, Bruno Prémont wrote:
> [...]
> > Did memory.low meaning change between 5.7 and 5.9?
>
> The latest semantic change in the low limit protection semantic was
> introduced in 5.7 (recursive protection) but it requires an explicit
> enablinig.
No specific mount options set for v2 cgroup, so not active.
> > From behavior it
> > feels as if inodes are not accounted to cgroup at all and kernel pushes
> > cgroups down to their memory.low by killing file cache if there is not
> > enough free memory to hold all promises (and not only when a cgroup
> > tries to use up to its promised amount of memory).
>
> Your counters indeed show that the low protection has been breached,
> most likely because the reclaim couldn't make any progress. Considering
> that this is the case for all/most of your cgroups it suggests that the
> memory pressure was global rather than limit imposed. In fact even top
> level cgroups got reclaimed below the low limit.
Note that the "original" counters we partially triggered by a first
event where I had one cgroup (websrv) of the with a rather very high
memory.low (16G or even 32G) which caused counters everywhere to
increase.
So before the last trashing during which the values were collected the
event counters and `current` looked as follows:
system/memory.pressure
some avg10=0.04 avg60=0.28 avg300=0.12 total=5844917510
full avg10=0.04 avg60=0.26 avg300=0.11 total=2439353404
system/memory.current
96432128
system/memory.events.local
low 5399469 (unchanged)
high 0
max 112303 (unchanged)
oom 0
oom_kill 0
system/base/memory.pressure
some avg10=0.04 avg60=0.28 avg300=0.12 total=4589562039
full avg10=0.04 avg60=0.28 avg300=0.12 total=1926984197
system/base/memory.current
59305984
system/base/memory.events.local
low 0 (unchanged)
high 0
max 0 (unchanged)
oom 0
oom_kill 0
system/backup/memory.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=2123293649
full avg10=0.00 avg60=0.00 avg300=0.00 total=815450446
system/backup/memory.current
32444416
system/backup/memory.events.local
low 5446 (unchanged)
high 0
max 0
oom 0
oom_kill 0
system/shell/memory.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=1345965660
full avg10=0.00 avg60=0.00 avg300=0.00 total=492812915
system/shell/memory.current
4571136
system/shell/memory.events.local
low 0
high 0
max 0
oom 0
oom_kill 0
website/memory.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=415008878
full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483
website/memory.current
12104380416
website/memory.events.local
low 11264569 (during trashing: 11372142 then 11377350)
high 0
max 0
oom 0
oom_kill 0
remote/memory.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=2005130126
full avg10=0.00 avg60=0.00 avg300=0.00 total=735366752
remote/memory.current
116330496
remote/memory.events.local
low 11264569 (during trashing: 11372142 then 11377350)
high 0
max 0
oom 0
oom_kill 0
websrv/memory.pressure
some avg10=0.02 avg60=0.11 avg300=0.03 total=6650355162
full avg10=0.02 avg60=0.11 avg300=0.03 total=2034584579
websrv/memory.current
18483359744
websrv/memory.events.local
low 0
high 0
max 0
oom 0
oom_kill 0
> This suggests that this is not likely to be memcg specific. It is
> more likely that this is a general memory reclaim regression for your
> workload. There were larger changes in that area. Be it lru balancing
> based on cost model by Johannes or working set tracking for anonymous
> pages by Joonsoo. Maybe even more. Both of them can influence page cache
> reclaim but you are suggesting that slab accounted memory is not
> reclaimed properly.
That is my impression, yes. No idea though if memcg can influence the
way reclaim tries to perform its work or if slab_reclaimable not
associated to any (child) cg would somehow be excluded from reclaim.
> I am not sure sure there were considerable changes
> there. Would it be possible to collect /prov/vmstat as well?
I will have a look at gathering memory.stat and /proc/vmstat at next
opportunity.
Will first try with a test system with not too much memory and lots of
files to reproduce about 50% of memory usage by slab_reclaimable and
see how far I get.
Thanks,
Bruno
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
2020-11-25 11:39 Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints Bruno Prémont
2020-11-25 13:37 ` Michal Hocko
@ 2020-11-25 18:21 ` Roman Gushchin
2020-12-03 11:09 ` Bruno Prémont
1 sibling, 1 reply; 8+ messages in thread
From: Roman Gushchin @ 2020-11-25 18:21 UTC (permalink / raw)
To: Bruno Prémont
Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups,
linux-mm, Vladimir Davydov
On Wed, Nov 25, 2020 at 12:39:56PM +0100, Bruno Prémont wrote:
> Hello,
>
> On a production system I've encountered a rather harsh behavior from
> kernel in the context of memory cgroup (v2) after updating kernel from
> 5.7 series to 5.9 series.
>
>
> It seems like kernel is reclaiming file cache but leaving inode cache
> (reclaimable slabs) alone in a way that the server ends up trashing and
> maxing out on IO to one of its disks instead of doing actual work.
>
>
> My setup, server has 64G of RAM:
> root
> + system { min=0, low=128M, high=8G, max=8G }
> + base { no specific constraints }
> + backup { min=0, low=32M, high=2G, max=2G }
> + shell { no specific constraints }
> + websrv { min=0, low=4G, high=32G, max=32G }
> + website { min=0, low=16G, high=40T, max=40T }
> + website1 { min=0, low=64M, high=2G, max=2G }
> + website2 { min=0, low=64M, high=2G, max=2G }
> ...
> + remote { min=0, low=1G, high=14G, max=14G }
> + webuser1 { min=0, low=64M, high=2G, max=2G }
> + webuser2 { min=0, low=64M, high=2G, max=2G }
> ...
>
>
> When the server was struggling I've had mostly IO on disk hosting
> system processes and some cache files of websrv processes.
> It seems that running backup does make the issue much more probable.
>
> The processes in websrv are the most impacted by the trashing and this
> is the one with lots of disk cache and inode cache assigned to it.
> (note a helper running in websrv cgroup scan whole file system
> hierarchy once per hour and this keeps inode cache pretty filled.
> Dropping just file cache (about 10G) did not unlock situation but
> dropping reclaimable slabs (inode cache, about 30G) got the system back
> running.
>
>
>
> Some metrics I have collected during a trashing period (metrics
> collected at about 5min interval) - I don't have ful memory.stat
> unfortunately:
>
> system/memory.min 0 = 0
> system/memory.low 134217728 = 134217728
> system/memory.high 8589934592 = 8589934592
> system/memory.max 8589934592 = 8589934592
> system/memory.pressure
> some avg10=54.41 avg60=59.28 avg300=69.46 total=7347640237
> full avg10=27.45 avg60=22.19 avg300=29.28 total=3287847481
> ->
> some avg10=77.25 avg60=73.24 avg300=69.63 total=7619662740
> full avg10=23.04 avg60=25.26 avg300=27.97 total=3401421903
> system/memory.current 262533120 < 263929856
> system/memory.events.local
> low 5399469 = 5399469
> high 0 = 0
> max 112303 = 112303
> oom 0 = 0
> oom_kill 0 = 0
>
> system/base/memory.min 0 = 0
> system/base/memory.low 0 = 0
> system/base/memory.high max = max
> system/base/memory.max max = max
> system/base/memory.pressure
> some avg10=18.89 avg60=20.34 avg300=24.95 total=5156816349
> full avg10=10.90 avg60=8.50 avg300=11.68 total=2253916169
> ->
> some avg10=33.82 avg60=32.26 avg300=26.95 total=5258381824
> full avg10=12.51 avg60=13.01 avg300=12.05 total=2301375471
> system/base/memory.current 31363072 < 32243712
> system/base/memory.events.local
> low 0 = 0
> high 0 = 0
> max 0 = 0
> oom 0 = 0
> oom_kill 0 = 0
>
> system/backup/memory.min 0 = 0
> system/backup/memory.low 33554432 = 33554432
> system/backup/memory.high 2147483648 = 2147483648
> system/backup/memory.max 2147483648 = 2147483648
> system/backup/memory.pressure
> some avg10=41.73 avg60=45.97 avg300=56.27 total=3385780085
> full avg10=21.78 avg60=18.15 avg300=25.35 total=1571263731
> ->
> some avg10=60.27 avg60=55.44 avg300=54.37 total=3599850643
> full avg10=19.52 avg60=20.91 avg300=23.58 total=1667430954
> system/backup/memory.current 222130176 < 222543872
> system/backup/memory.events.local
> low 5446 = 5446
> high 0 = 0
> max 0 = 0
> oom 0 = 0
> oom_kill 0 = 0
>
> system/shell/memory.min 0 = 0
> system/shell/memory.low 0 = 0
> system/shell/memory.high max = max
> system/shell/memory.max max = max
> system/shell/memory.pressure
> some avg10=0.00 avg60=0.12 avg300=0.25 total=1348427661
> full avg10=0.00 avg60=0.04 avg300=0.06 total=493582108
> ->
> some avg10=0.00 avg60=0.00 avg300=0.06 total=1348516773
> full avg10=0.00 avg60=0.00 avg300=0.00 total=493591500
> system/shell/memory.current 8814592 < 8888320
> system/shell/memory.events.local
> low 0 = 0
> high 0 = 0
> max 0 = 0
> oom 0 = 0
> oom_kill 0 = 0
>
> website/memory.min 0 = 0
> website/memory.low 17179869184 = 17179869184
> website/memory.high 45131717672960 = 45131717672960
> website/memory.max 45131717672960 = 45131717672960
> website/memory.pressure
> some avg10=0.00 avg60=0.00 avg300=0.00 total=415009408
> full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483
> ->
> some avg10=0.00 avg60=0.00 avg300=0.00 total=415009408
> full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483
> website/memory.current 11811520512 > 11456942080
> website/memory.events.local
> low 11372142 < 11377350
> high 0 = 0
> max 0 = 0
> oom 0 = 0
> oom_kill 0 = 0
>
> remote/memory.min 0
> remote/memory.low 1073741824
> remote/memory.high 15032385536
> remote/memory.max 15032385536
> remote/memory.pressure
> some avg10=0.00 avg60=0.25 avg300=0.50 total=2017364408
> full avg10=0.00 avg60=0.00 avg300=0.01 total=738071296
> ->
> remote/memory.current 84439040 > 81797120
> remote/memory.events.local
> low 11372142 < 11377350
> high 0 = 0
> max 0 = 0
> oom 0 = 0
> oom_kill 0 = 0
>
> websrv/memory.min 0 = 0
> websrv/memory.low 4294967296 = 4294967296
> websrv/memory.high 34359738368 = 34359738368
> websrv/memory.max 34426847232 = 34426847232
> websrv/memory.pressure
> some avg10=40.38 avg60=62.58 avg300=68.83 total=7760096704
> full avg10=7.80 avg60=10.78 avg300=12.64 total=2254679370
> ->
> some avg10=89.97 avg60=83.78 avg300=72.99 total=8040513640
> full avg10=11.46 avg60=11.49 avg300=11.47 total=2300116237
> websrv/memory.current 18421673984 < 18421936128
> websrv/memory.events.local
> low 0 = 0
> high 0 = 0
> max 0 = 0
> oom 0 = 0
> oom_kill 0 = 0
>
>
>
> Is there something important I'm missing in my setup that could prevent
> things from starving?
>
> Did memory.low meaning change between 5.7 and 5.9? From behavior it
> feels as if inodes are not accounted to cgroup at all and kernel pushes
> cgroups down to their memory.low by killing file cache if there is not
> enough free memory to hold all promises (and not only when a cgroup
> tries to use up to its promised amount of memory).
> As system was trashing as much with 10G of file cache dropped
> (completely unused memory) as with it in use.
>
>
> I will try to create a test-case for it to reproduce it on a test
> machine an be able to verify a fix or eventually bisect to triggering
> patch though it this all rings a bell, please tell!
>
> Note until I have a test-case I'm reluctant to just wait [on
> production system] for next occurrence (usually at unpractical times) to
> gather some more metrics.
Hi Bruno!
Thank you for the report.
Can you, please, check if the following patch fixes the issue?
Thanks!
--
diff --git a/mm/slab.h b/mm/slab.h
index 6cc323f1313a..ef02b841bcd8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -290,7 +290,7 @@ static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
obj_cgroup_put(objcg);
- return NULL;
+ return (struct obj_cgroup *)-1UL;
}
return objcg;
@@ -501,9 +501,13 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
return NULL;
if (memcg_kmem_enabled() &&
- ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
+ ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT))) {
*objcgp = memcg_slab_pre_alloc_hook(s, size, flags);
+ if (unlikely(*objcgp == (struct obj_cgroup *)-1UL))
+ return NULL;
+ }
+
return s;
}
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
2020-11-25 18:21 ` Roman Gushchin
@ 2020-12-03 11:09 ` Bruno Prémont
2020-12-03 20:55 ` Roman Gushchin
0 siblings, 1 reply; 8+ messages in thread
From: Bruno Prémont @ 2020-12-03 11:09 UTC (permalink / raw)
To: Roman Gushchin
Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups,
linux-mm, Vladimir Davydov
Hello Roman,
Sorry for having taken so much time to reply, I've only had the
opportunity to deploy the patch on Tuesday morning for testing and
now two days later the trashing occurred again.
> diff --git a/mm/slab.h b/mm/slab.h
> index 6cc323f1313a..ef02b841bcd8 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -290,7 +290,7 @@ static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
>
> if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
> obj_cgroup_put(objcg);
> - return NULL;
> + return (struct obj_cgroup *)-1UL;
> }
>
> return objcg;
> @@ -501,9 +501,13 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
> return NULL;
>
> if (memcg_kmem_enabled() &&
> - ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
> + ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT))) {
> *objcgp = memcg_slab_pre_alloc_hook(s, size, flags);
>
> + if (unlikely(*objcgp == (struct obj_cgroup *)-1UL))
> + return NULL;
> + }
> +
> return s;
> }
Seems your proposed patch didn't really help.
Compared to initial occurrence I do now have some more details (all but
/proc/slabinfo since boot) and according to /proc/slabinfo a good deal
of reclaimable slabs seem to be dentries (and probably
xfs_inode/xfs_ifork related to them) - not sure if those are assigned
to cgroups or not-accounted and not seen as candidate for reclaim...
xfs_buf 444908 445068 448 36 4 : tunables 0 0 0 : slabdata 12363 12363 0
xfs_bui_item 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
xfs_bud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0
xfs_cui_item 0 0 456 35 4 : tunables 0 0 0 : slabdata 0 0 0
xfs_cud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0
xfs_rui_item 0 0 712 46 8 : tunables 0 0 0 : slabdata 0 0 0
xfs_rud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0
xfs_icr 0 156 208 39 2 : tunables 0 0 0 : slabdata 4 4 0
xfs_ili 1223169 1535904 224 36 2 : tunables 0 0 0 : slabdata 42664 42664 0
xfs_inode 12851565 22081140 1088 30 8 : tunables 0 0 0 : slabdata 736038 736038 0
xfs_efi_item 0 280 456 35 4 : tunables 0 0 0 : slabdata 8 8 0
xfs_efd_item 0 280 464 35 4 : tunables 0 0 0 : slabdata 8 8 0
xfs_buf_item 7 216 296 27 2 : tunables 0 0 0 : slabdata 8 8 0
xf_trans 0 224 288 28 2 : tunables 0 0 0 : slabdata 8 8 0
xfs_ifork 12834992 46309928 72 56 1 : tunables 0 0 0 : slabdata 826963 826963 0
xfs_da_state 0 224 512 32 4 : tunables 0 0 0 : slabdata 7 7 0
xfs_btree_cur 0 224 256 32 2 : tunables 0 0 0 : slabdata 7 7 0
xfs_bmap_free_item 0 230 88 46 1 : tunables 0 0 0 : slabdata 5 5 0
xfs_log_ticket 4 296 216 37 2 : tunables 0 0 0 : slabdata 8 8 0
fat_inode_cache 0 0 744 44 8 : tunables 0 0 0 : slabdata 0 0 0
fat_cache 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
mnt_cache 114 180 448 36 4 : tunables 0 0 0 : slabdata 5 5 0
filp 6228 15582 384 42 4 : tunables 0 0 0 : slabdata 371 371 0
inode_cache 6669 16016 608 26 4 : tunables 0 0 0 : slabdata 616 616 0
dentry 8092159 15642504 224 36 2 : tunables 0 0 0 : slabdata 434514 434514 0
The full collected details are available at
https://faramir-fj.hosting-restena.lu/cgmon-20201203.txt
(please take a copy as that file will not stay there forever)
A visual graph of memory evolution is available at
https://faramir-fj.hosting-restena.lu/system-memory-20201203.png
with reboot on Tuesday morning and steady increase of slabs starting
Webnesday evening correlating with start of backup until trashing
started at about 3:30 and the large drop in memory being me doing
echo 2 > /proc/sys/vm/drop_caches
which stopped the trashing as well.
Against what does memcg attempt reclaim when it tries to satisfy a CG's
low limit? Only against siblings or also against root or not-accounted?
How does it take into account slabs where evictable entries will cause
unevictable entries to be freed as well?
> > My setup, server has 64G of RAM:
> > root
> > + system { min=0, low=128M, high=8G, max=8G }
> > + base { no specific constraints }
> > + backup { min=0, low=32M, high=2G, max=2G }
> > + shell { no specific constraints }
> > + websrv { min=0, low=4G, high=32G, max=32G }
> > + website { min=0, low=16G, high=40T, max=40T }
> > + website1 { min=0, low=64M, high=2G, max=2G }
> > + website2 { min=0, low=64M, high=2G, max=2G }
> > ...
> > + remote { min=0, low=1G, high=14G, max=14G }
> > + webuser1 { min=0, low=64M, high=2G, max=2G }
> > + webuser2 { min=0, low=64M, high=2G, max=2G }
> > ...
Also interesting is that backup which is forced into 2G
(system/backup CG) causes amount of slabs assigned to websrv CG to
increase until that CG has almost only slab entries assigned to it to
fill 16G, like file cache being reclaimed but not slab entries even if
there is almost no file cache left and tons of slabs.
What I'm also surprised is the so much memory remains completely unused
(instead of being used for file caches).
According to the documentation if I didn't get it wrong any limits of
child CGs (e.g. webuser1...) are applied up to what their parent's
limits allow. Thus, if looking at e.g. remote -> webuser1... even if I
have 1000 webuserN they wont "reserve" 65G for themselves via
memory.low limit when their parent sets memory.low to 1G?
Or does this depend on on CG mount options (memory_recursiveprot)?
Regards,
Bruno
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
2020-12-03 11:09 ` Bruno Prémont
@ 2020-12-03 20:55 ` Roman Gushchin
2020-12-06 11:30 ` Bruno Prémont
0 siblings, 1 reply; 8+ messages in thread
From: Roman Gushchin @ 2020-12-03 20:55 UTC (permalink / raw)
To: Bruno Prémont
Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups,
linux-mm, Vladimir Davydov
On Thu, Dec 03, 2020 at 12:09:36PM +0100, Bruno Prémont wrote:
> Hello Roman,
>
> Sorry for having taken so much time to reply, I've only had the
> opportunity to deploy the patch on Tuesday morning for testing and
> now two days later the trashing occurred again.
>
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 6cc323f1313a..ef02b841bcd8 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -290,7 +290,7 @@ static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
> >
> > if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
> > obj_cgroup_put(objcg);
> > - return NULL;
> > + return (struct obj_cgroup *)-1UL;
> > }
> >
> > return objcg;
> > @@ -501,9 +501,13 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
> > return NULL;
> >
> > if (memcg_kmem_enabled() &&
> > - ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
> > + ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT))) {
> > *objcgp = memcg_slab_pre_alloc_hook(s, size, flags);
> >
> > + if (unlikely(*objcgp == (struct obj_cgroup *)-1UL))
> > + return NULL;
> > + }
> > +
> > return s;
> > }
>
> Seems your proposed patch didn't really help.
Anyway, thank you for testing! Actually your report helped me to reveal and
fix this problem, so thank you!
In the meantime Yang Shi discovered a problem related slab shrinkers,
which is to some extent similar to what you describe: under certain conditions
large amounts of slab memory can be completely excluded from the reclaim process.
Can you, please, check if his fix will solve your problem?
Here is the final version: https://www.spinics.net/lists/stable/msg430601.html .
>
>
>
> Compared to initial occurrence I do now have some more details (all but
> /proc/slabinfo since boot) and according to /proc/slabinfo a good deal
> of reclaimable slabs seem to be dentries (and probably
> xfs_inode/xfs_ifork related to them) - not sure if those are assigned
> to cgroups or not-accounted and not seen as candidate for reclaim...
>
> xfs_buf 444908 445068 448 36 4 : tunables 0 0 0 : slabdata 12363 12363 0
> xfs_bui_item 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
> xfs_bud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0
> xfs_cui_item 0 0 456 35 4 : tunables 0 0 0 : slabdata 0 0 0
> xfs_cud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0
> xfs_rui_item 0 0 712 46 8 : tunables 0 0 0 : slabdata 0 0 0
> xfs_rud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0
> xfs_icr 0 156 208 39 2 : tunables 0 0 0 : slabdata 4 4 0
> xfs_ili 1223169 1535904 224 36 2 : tunables 0 0 0 : slabdata 42664 42664 0
> xfs_inode 12851565 22081140 1088 30 8 : tunables 0 0 0 : slabdata 736038 736038 0
> xfs_efi_item 0 280 456 35 4 : tunables 0 0 0 : slabdata 8 8 0
> xfs_efd_item 0 280 464 35 4 : tunables 0 0 0 : slabdata 8 8 0
> xfs_buf_item 7 216 296 27 2 : tunables 0 0 0 : slabdata 8 8 0
> xf_trans 0 224 288 28 2 : tunables 0 0 0 : slabdata 8 8 0
> xfs_ifork 12834992 46309928 72 56 1 : tunables 0 0 0 : slabdata 826963 826963 0
> xfs_da_state 0 224 512 32 4 : tunables 0 0 0 : slabdata 7 7 0
> xfs_btree_cur 0 224 256 32 2 : tunables 0 0 0 : slabdata 7 7 0
> xfs_bmap_free_item 0 230 88 46 1 : tunables 0 0 0 : slabdata 5 5 0
> xfs_log_ticket 4 296 216 37 2 : tunables 0 0 0 : slabdata 8 8 0
> fat_inode_cache 0 0 744 44 8 : tunables 0 0 0 : slabdata 0 0 0
> fat_cache 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
> mnt_cache 114 180 448 36 4 : tunables 0 0 0 : slabdata 5 5 0
> filp 6228 15582 384 42 4 : tunables 0 0 0 : slabdata 371 371 0
> inode_cache 6669 16016 608 26 4 : tunables 0 0 0 : slabdata 616 616 0
> dentry 8092159 15642504 224 36 2 : tunables 0 0 0 : slabdata 434514 434514 0
>
>
>
> The full collected details are available at
> https://faramir-fj.hosting-restena.lu/cgmon-20201203.txt
> (please take a copy as that file will not stay there forever)
>
> A visual graph of memory evolution is available at
> https://faramir-fj.hosting-restena.lu/system-memory-20201203.png
> with reboot on Tuesday morning and steady increase of slabs starting
> Webnesday evening correlating with start of backup until trashing
> started at about 3:30 and the large drop in memory being me doing
> echo 2 > /proc/sys/vm/drop_caches
> which stopped the trashing as well.
>
>
> Against what does memcg attempt reclaim when it tries to satisfy a CG's
> low limit? Only against siblings or also against root or not-accounted?
> How does it take into account slabs where evictable entries will cause
> unevictable entries to be freed as well?
Low limits are working by excluding some portions of memory from the reclaim,
not by adding a memory pressure to something else.
>
> > > My setup, server has 64G of RAM:
> > > root
> > > + system { min=0, low=128M, high=8G, max=8G }
> > > + base { no specific constraints }
> > > + backup { min=0, low=32M, high=2G, max=2G }
> > > + shell { no specific constraints }
> > > + websrv { min=0, low=4G, high=32G, max=32G }
> > > + website { min=0, low=16G, high=40T, max=40T }
> > > + website1 { min=0, low=64M, high=2G, max=2G }
> > > + website2 { min=0, low=64M, high=2G, max=2G }
> > > ...
> > > + remote { min=0, low=1G, high=14G, max=14G }
> > > + webuser1 { min=0, low=64M, high=2G, max=2G }
> > > + webuser2 { min=0, low=64M, high=2G, max=2G }
> > > ...
>
> Also interesting is that backup which is forced into 2G
> (system/backup CG) causes amount of slabs assigned to websrv CG to
> increase until that CG has almost only slab entries assigned to it to
> fill 16G, like file cache being reclaimed but not slab entries even if
> there is almost no file cache left and tons of slabs.
> What I'm also surprised is the so much memory remains completely unused
> (instead of being used for file caches).
>
> According to the documentation if I didn't get it wrong any limits of
> child CGs (e.g. webuser1...) are applied up to what their parent's
> limits allow. Thus, if looking at e.g. remote -> webuser1... even if I
> have 1000 webuserN they wont "reserve" 65G for themselves via
> memory.low limit when their parent sets memory.low to 1G?
> Or does this depend on on CG mount options (memory_recursiveprot)?
It does. What you're describing is the old (!memory_recursiveprot) behavior.
Thanks!
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
2020-12-03 20:55 ` Roman Gushchin
@ 2020-12-06 11:30 ` Bruno Prémont
2020-12-10 11:08 ` Bruno Prémont
0 siblings, 1 reply; 8+ messages in thread
From: Bruno Prémont @ 2020-12-06 11:30 UTC (permalink / raw)
To: Roman Gushchin
Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups,
linux-mm, Vladimir Davydov
On Thu, 3 Dec 2020 12:55:59 -0800 Roman Gushchin <guro@fb.com> wrote:
> In the meantime Yang Shi discovered a problem related slab shrinkers,
> which is to some extent similar to what you describe: under certain conditions
> large amounts of slab memory can be completely excluded from the reclaim process.
>
> Can you, please, check if his fix will solve your problem?
> Here is the final version: https://www.spinics.net/lists/stable/msg430601.html .
I've added that patch on top of yours but it seems not to completely
help either.
With this patch is seems that such dentries might get reclaimed as a
last resort instead of not at all.
I've added logs since current boot:
https://faramir-fj.hosting-restena.lu/cgmon-20201204.txt
https://faramir-fj.hosting-restena.lu/cgmon-20201205.txt
https://faramir-fj.hosting-restena.lu/cgmon-20201206.txt
with the memory evolution. Evolution started to degrade over past night
where memory usage started to increase from 40G tending to full use but
with only slabs growing (not file cache) and memory assigned to cgroups
staying more or less constant - even root cgroup's memory stats seem not
to list a great deal of used memory.
The only cgroup not "sufficiently" protected by memory.low (websrv) has
seen its memory use somehow clamped to about 16G while it should be
allowed to go up to 32G according to memory.high and of those 16G in
use at time of writing it only had 100M of file cache left, all the
rest being slabs.
As system now is using most of its memory I've bumped websrv CG's
memory.low to 20G so it should stay protected some more (which after a
few minutes showed its filecache growing again) with the aim of moving
pressure out of leaf-cgroups to non-cg-assigned-memory.
Somehow this move seems to prove getting me some success.
I will report back later today or tomorrow with more details on the
evolution with "no unused" memory. At least production service tends not
to suffer (more than from storage response time).
I have the impression that memory reclaim now only looks at cgroup and
if it can make some progress it will not bother looking anywhere else.
I also have the vague impression that distribution of my tasks on the
two NUMA nodes somehow impacts when or how memory reclaim happens.
NUMA node0 CPU(s): 0-5,12-17
NUMA node1 CPU(s): 6-11,18-23
CPUs MEMs
system: 0-3 0
websrv: 8-11 0 (allowed mems=0-1 as of 2020-12-06 12:18
after increasing memory.low from 4G
to 20G which did release quite some
pression and allow)
website: 12-23 1
remote: 4-7 0
(assignment done using cpuset cgroup)
(seems NUMA distribution changed or I missed the non-linear node
distribution of CPUs - cores versus hyperthreading as my Intent was to
have website on 1 socket and the rest on the other socket. Memory is
mapped as I planned it, but tasks not really)
So calculating:
system: 128M..8G mems=0 \
remote: 1G..14G mems=0 }--> 5G..54G
websrv: 4G..32G mems=0 /
website: 16G.. mems=1
But with everything except websrv lying below its low limit I wonder why
reclaim only hits the cgroups's file cache but still mostly ignores its
slabs.
Bruno
> > Compared to initial occurrence I do now have some more details (all but
> > /proc/slabinfo since boot) and according to /proc/slabinfo a good deal
> > of reclaimable slabs seem to be dentries (and probably
> > xfs_inode/xfs_ifork related to them) - not sure if those are assigned
> > to cgroups or not-accounted and not seen as candidate for reclaim...
> >
> > xfs_buf 444908 445068 448 36 4 : tunables 0 0 0 : slabdata 12363 12363 0
> > xfs_bui_item 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
> > xfs_bud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0
> > xfs_cui_item 0 0 456 35 4 : tunables 0 0 0 : slabdata 0 0 0
> > xfs_cud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0
> > xfs_rui_item 0 0 712 46 8 : tunables 0 0 0 : slabdata 0 0 0
> > xfs_rud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0
> > xfs_icr 0 156 208 39 2 : tunables 0 0 0 : slabdata 4 4 0
> > xfs_ili 1223169 1535904 224 36 2 : tunables 0 0 0 : slabdata 42664 42664 0
> > xfs_inode 12851565 22081140 1088 30 8 : tunables 0 0 0 : slabdata 736038 736038 0
> > xfs_efi_item 0 280 456 35 4 : tunables 0 0 0 : slabdata 8 8 0
> > xfs_efd_item 0 280 464 35 4 : tunables 0 0 0 : slabdata 8 8 0
> > xfs_buf_item 7 216 296 27 2 : tunables 0 0 0 : slabdata 8 8 0
> > xf_trans 0 224 288 28 2 : tunables 0 0 0 : slabdata 8 8 0
> > xfs_ifork 12834992 46309928 72 56 1 : tunables 0 0 0 : slabdata 826963 826963 0
> > xfs_da_state 0 224 512 32 4 : tunables 0 0 0 : slabdata 7 7 0
> > xfs_btree_cur 0 224 256 32 2 : tunables 0 0 0 : slabdata 7 7 0
> > xfs_bmap_free_item 0 230 88 46 1 : tunables 0 0 0 : slabdata 5 5 0
> > xfs_log_ticket 4 296 216 37 2 : tunables 0 0 0 : slabdata 8 8 0
> > fat_inode_cache 0 0 744 44 8 : tunables 0 0 0 : slabdata 0 0 0
> > fat_cache 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
> > mnt_cache 114 180 448 36 4 : tunables 0 0 0 : slabdata 5 5 0
> > filp 6228 15582 384 42 4 : tunables 0 0 0 : slabdata 371 371 0
> > inode_cache 6669 16016 608 26 4 : tunables 0 0 0 : slabdata 616 616 0
> > dentry 8092159 15642504 224 36 2 : tunables 0 0 0 : slabdata 434514 434514 0
> >
> >
> >
> > The full collected details are available at
> > https://faramir-fj.hosting-restena.lu/cgmon-20201203.txt
> > (please take a copy as that file will not stay there forever)
> >
> > A visual graph of memory evolution is available at
> > https://faramir-fj.hosting-restena.lu/system-memory-20201203.png
> > with reboot on Tuesday morning and steady increase of slabs starting
> > Webnesday evening correlating with start of backup until trashing
> > started at about 3:30 and the large drop in memory being me doing
> > echo 2 > /proc/sys/vm/drop_caches
> > which stopped the trashing as well.
> >
> >
> > Against what does memcg attempt reclaim when it tries to satisfy a CG's
> > low limit? Only against siblings or also against root or not-accounted?
> > How does it take into account slabs where evictable entries will cause
> > unevictable entries to be freed as well?
>
> Low limits are working by excluding some portions of memory from the reclaim,
> not by adding a memory pressure to something else.
>
> >
> > > > My setup, server has 64G of RAM:
> > > > root
> > > > + system { min=0, low=128M, high=8G, max=8G }
> > > > + base { no specific constraints }
> > > > + backup { min=0, low=32M, high=2G, max=2G }
> > > > + shell { no specific constraints }
> > > > + websrv { min=0, low=4G, high=32G, max=32G }
> > > > + website { min=0, low=16G, high=40T, max=40T }
> > > > + website1 { min=0, low=64M, high=2G, max=2G }
> > > > + website2 { min=0, low=64M, high=2G, max=2G }
> > > > ...
> > > > + remote { min=0, low=1G, high=14G, max=14G }
> > > > + webuser1 { min=0, low=64M, high=2G, max=2G }
> > > > + webuser2 { min=0, low=64M, high=2G, max=2G }
> > > > ...
> >
> > Also interesting is that backup which is forced into 2G
> > (system/backup CG) causes amount of slabs assigned to websrv CG to
> > increase until that CG has almost only slab entries assigned to it to
> > fill 16G, like file cache being reclaimed but not slab entries even if
> > there is almost no file cache left and tons of slabs.
> > What I'm also surprised is the so much memory remains completely unused
> > (instead of being used for file caches).
> >
> > According to the documentation if I didn't get it wrong any limits of
> > child CGs (e.g. webuser1...) are applied up to what their parent's
> > limits allow. Thus, if looking at e.g. remote -> webuser1... even if I
> > have 1000 webuserN they wont "reserve" 65G for themselves via
> > memory.low limit when their parent sets memory.low to 1G?
> > Or does this depend on on CG mount options (memory_recursiveprot)?
>
> It does. What you're describing is the old (!memory_recursiveprot) behavior.
>
> Thanks!
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
2020-12-06 11:30 ` Bruno Prémont
@ 2020-12-10 11:08 ` Bruno Prémont
0 siblings, 0 replies; 8+ messages in thread
From: Bruno Prémont @ 2020-12-10 11:08 UTC (permalink / raw)
To: Roman Gushchin
Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups,
linux-mm, Vladimir Davydov
Hello All,
Since my last changes (allowing the websrv CG to use both NUMA memory
areas) the system as a whole runs reasonably without trashing and also
makes a way better use of memory.
As such it really seems that NUMA node restrictions (memory wise at
least) do not properly interact with reclaim.
Is there a way to see how memory usage is on the different NUMA nodes?
I have the impressions some reclaim was ongoing because memory
allocation was ask on one node which may have been "full" and there
only file cache was being reclaimed (in cgroups where memory.low didn't
protect it).
Thanks,
Bruno
On Sun, 6 Dec 2020 12:30:21 +0100 Bruno Prémont wrote:
> On Thu, 3 Dec 2020 12:55:59 -0800 Roman Gushchin <guro@fb.com> wrote:
> > In the meantime Yang Shi discovered a problem related slab shrinkers,
> > which is to some extent similar to what you describe: under certain conditions
> > large amounts of slab memory can be completely excluded from the reclaim process.
> >
> > Can you, please, check if his fix will solve your problem?
> > Here is the final version: https://www.spinics.net/lists/stable/msg430601.html .
>
> I've added that patch on top of yours but it seems not to completely
> help either.
> With this patch is seems that such dentries might get reclaimed as a
> last resort instead of not at all.
>
> I've added logs since current boot:
> https://faramir-fj.hosting-restena.lu/cgmon-20201204.txt
> https://faramir-fj.hosting-restena.lu/cgmon-20201205.txt
> https://faramir-fj.hosting-restena.lu/cgmon-20201206.txt
> with the memory evolution. Evolution started to degrade over past night
> where memory usage started to increase from 40G tending to full use but
> with only slabs growing (not file cache) and memory assigned to cgroups
> staying more or less constant - even root cgroup's memory stats seem not
> to list a great deal of used memory.
>
> The only cgroup not "sufficiently" protected by memory.low (websrv) has
> seen its memory use somehow clamped to about 16G while it should be
> allowed to go up to 32G according to memory.high and of those 16G in
> use at time of writing it only had 100M of file cache left, all the
> rest being slabs.
>
>
> As system now is using most of its memory I've bumped websrv CG's
> memory.low to 20G so it should stay protected some more (which after a
> few minutes showed its filecache growing again) with the aim of moving
> pressure out of leaf-cgroups to non-cg-assigned-memory.
>
> Somehow this move seems to prove getting me some success.
>
>
> I will report back later today or tomorrow with more details on the
> evolution with "no unused" memory. At least production service tends not
> to suffer (more than from storage response time).
>
>
>
> I have the impression that memory reclaim now only looks at cgroup and
> if it can make some progress it will not bother looking anywhere else.
>
> I also have the vague impression that distribution of my tasks on the
> two NUMA nodes somehow impacts when or how memory reclaim happens.
>
> NUMA node0 CPU(s): 0-5,12-17
> NUMA node1 CPU(s): 6-11,18-23
>
> CPUs MEMs
> system: 0-3 0
> websrv: 8-11 0 (allowed mems=0-1 as of 2020-12-06 12:18
> after increasing memory.low from 4G
> to 20G which did release quite some
> pression and allow)
> website: 12-23 1
> remote: 4-7 0
> (assignment done using cpuset cgroup)
>
> (seems NUMA distribution changed or I missed the non-linear node
> distribution of CPUs - cores versus hyperthreading as my Intent was to
> have website on 1 socket and the rest on the other socket. Memory is
> mapped as I planned it, but tasks not really)
>
>
> So calculating:
> system: 128M..8G mems=0 \
> remote: 1G..14G mems=0 }--> 5G..54G
> websrv: 4G..32G mems=0 /
> website: 16G.. mems=1
>
> But with everything except websrv lying below its low limit I wonder why
> reclaim only hits the cgroups's file cache but still mostly ignores its
> slabs.
>
>
> Bruno
>
> > > Compared to initial occurrence I do now have some more details (all but
> > > /proc/slabinfo since boot) and according to /proc/slabinfo a good deal
> > > of reclaimable slabs seem to be dentries (and probably
> > > xfs_inode/xfs_ifork related to them) - not sure if those are assigned
> > > to cgroups or not-accounted and not seen as candidate for reclaim...
> > >
> > > xfs_buf 444908 445068 448 36 4 : tunables 0 0 0 : slabdata 12363 12363 0
> > > xfs_bui_item 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
> > > xfs_bud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0
> > > xfs_cui_item 0 0 456 35 4 : tunables 0 0 0 : slabdata 0 0 0
> > > xfs_cud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0
> > > xfs_rui_item 0 0 712 46 8 : tunables 0 0 0 : slabdata 0 0 0
> > > xfs_rud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0
> > > xfs_icr 0 156 208 39 2 : tunables 0 0 0 : slabdata 4 4 0
> > > xfs_ili 1223169 1535904 224 36 2 : tunables 0 0 0 : slabdata 42664 42664 0
> > > xfs_inode 12851565 22081140 1088 30 8 : tunables 0 0 0 : slabdata 736038 736038 0
> > > xfs_efi_item 0 280 456 35 4 : tunables 0 0 0 : slabdata 8 8 0
> > > xfs_efd_item 0 280 464 35 4 : tunables 0 0 0 : slabdata 8 8 0
> > > xfs_buf_item 7 216 296 27 2 : tunables 0 0 0 : slabdata 8 8 0
> > > xf_trans 0 224 288 28 2 : tunables 0 0 0 : slabdata 8 8 0
> > > xfs_ifork 12834992 46309928 72 56 1 : tunables 0 0 0 : slabdata 826963 826963 0
> > > xfs_da_state 0 224 512 32 4 : tunables 0 0 0 : slabdata 7 7 0
> > > xfs_btree_cur 0 224 256 32 2 : tunables 0 0 0 : slabdata 7 7 0
> > > xfs_bmap_free_item 0 230 88 46 1 : tunables 0 0 0 : slabdata 5 5 0
> > > xfs_log_ticket 4 296 216 37 2 : tunables 0 0 0 : slabdata 8 8 0
> > > fat_inode_cache 0 0 744 44 8 : tunables 0 0 0 : slabdata 0 0 0
> > > fat_cache 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
> > > mnt_cache 114 180 448 36 4 : tunables 0 0 0 : slabdata 5 5 0
> > > filp 6228 15582 384 42 4 : tunables 0 0 0 : slabdata 371 371 0
> > > inode_cache 6669 16016 608 26 4 : tunables 0 0 0 : slabdata 616 616 0
> > > dentry 8092159 15642504 224 36 2 : tunables 0 0 0 : slabdata 434514 434514 0
> > >
> > >
> > >
> > > The full collected details are available at
> > > https://faramir-fj.hosting-restena.lu/cgmon-20201203.txt
> > > (please take a copy as that file will not stay there forever)
> > >
> > > A visual graph of memory evolution is available at
> > > https://faramir-fj.hosting-restena.lu/system-memory-20201203.png
> > > with reboot on Tuesday morning and steady increase of slabs starting
> > > Webnesday evening correlating with start of backup until trashing
> > > started at about 3:30 and the large drop in memory being me doing
> > > echo 2 > /proc/sys/vm/drop_caches
> > > which stopped the trashing as well.
> > >
> > >
> > > Against what does memcg attempt reclaim when it tries to satisfy a CG's
> > > low limit? Only against siblings or also against root or not-accounted?
> > > How does it take into account slabs where evictable entries will cause
> > > unevictable entries to be freed as well?
> >
> > Low limits are working by excluding some portions of memory from the reclaim,
> > not by adding a memory pressure to something else.
> >
> > >
> > > > > My setup, server has 64G of RAM:
> > > > > root
> > > > > + system { min=0, low=128M, high=8G, max=8G }
> > > > > + base { no specific constraints }
> > > > > + backup { min=0, low=32M, high=2G, max=2G }
> > > > > + shell { no specific constraints }
> > > > > + websrv { min=0, low=4G, high=32G, max=32G }
> > > > > + website { min=0, low=16G, high=40T, max=40T }
> > > > > + website1 { min=0, low=64M, high=2G, max=2G }
> > > > > + website2 { min=0, low=64M, high=2G, max=2G }
> > > > > ...
> > > > > + remote { min=0, low=1G, high=14G, max=14G }
> > > > > + webuser1 { min=0, low=64M, high=2G, max=2G }
> > > > > + webuser2 { min=0, low=64M, high=2G, max=2G }
> > > > > ...
> > >
> > > Also interesting is that backup which is forced into 2G
> > > (system/backup CG) causes amount of slabs assigned to websrv CG to
> > > increase until that CG has almost only slab entries assigned to it to
> > > fill 16G, like file cache being reclaimed but not slab entries even if
> > > there is almost no file cache left and tons of slabs.
> > > What I'm also surprised is the so much memory remains completely unused
> > > (instead of being used for file caches).
> > >
> > > According to the documentation if I didn't get it wrong any limits of
> > > child CGs (e.g. webuser1...) are applied up to what their parent's
> > > limits allow. Thus, if looking at e.g. remote -> webuser1... even if I
> > > have 1000 webuserN they wont "reserve" 65G for themselves via
> > > memory.low limit when their parent sets memory.low to 1G?
> > > Or does this depend on on CG mount options (memory_recursiveprot)?
> >
> > It does. What you're describing is the old (!memory_recursiveprot) behavior.
> >
> > Thanks!
>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2020-12-10 11:08 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-25 11:39 Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints Bruno Prémont
2020-11-25 13:37 ` Michal Hocko
2020-11-25 14:33 ` Bruno Prémont
2020-11-25 18:21 ` Roman Gushchin
2020-12-03 11:09 ` Bruno Prémont
2020-12-03 20:55 ` Roman Gushchin
2020-12-06 11:30 ` Bruno Prémont
2020-12-10 11:08 ` Bruno Prémont
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).