* Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints @ 2020-11-25 11:39 Bruno Prémont 2020-11-25 13:37 ` Michal Hocko 2020-11-25 18:21 ` Roman Gushchin 0 siblings, 2 replies; 8+ messages in thread From: Bruno Prémont @ 2020-11-25 11:39 UTC (permalink / raw) To: Yafang Shao Cc: Chris Down, Michal Hocko, Johannes Weiner, Chris Down, cgroups, linux-mm, Vladimir Davydov Hello, On a production system I've encountered a rather harsh behavior from kernel in the context of memory cgroup (v2) after updating kernel from 5.7 series to 5.9 series. It seems like kernel is reclaiming file cache but leaving inode cache (reclaimable slabs) alone in a way that the server ends up trashing and maxing out on IO to one of its disks instead of doing actual work. My setup, server has 64G of RAM: root + system { min=0, low=128M, high=8G, max=8G } + base { no specific constraints } + backup { min=0, low=32M, high=2G, max=2G } + shell { no specific constraints } + websrv { min=0, low=4G, high=32G, max=32G } + website { min=0, low=16G, high=40T, max=40T } + website1 { min=0, low=64M, high=2G, max=2G } + website2 { min=0, low=64M, high=2G, max=2G } ... + remote { min=0, low=1G, high=14G, max=14G } + webuser1 { min=0, low=64M, high=2G, max=2G } + webuser2 { min=0, low=64M, high=2G, max=2G } ... When the server was struggling I've had mostly IO on disk hosting system processes and some cache files of websrv processes. It seems that running backup does make the issue much more probable. The processes in websrv are the most impacted by the trashing and this is the one with lots of disk cache and inode cache assigned to it. (note a helper running in websrv cgroup scan whole file system hierarchy once per hour and this keeps inode cache pretty filled. Dropping just file cache (about 10G) did not unlock situation but dropping reclaimable slabs (inode cache, about 30G) got the system back running. Some metrics I have collected during a trashing period (metrics collected at about 5min interval) - I don't have ful memory.stat unfortunately: system/memory.min 0 = 0 system/memory.low 134217728 = 134217728 system/memory.high 8589934592 = 8589934592 system/memory.max 8589934592 = 8589934592 system/memory.pressure some avg10=54.41 avg60=59.28 avg300=69.46 total=7347640237 full avg10=27.45 avg60=22.19 avg300=29.28 total=3287847481 -> some avg10=77.25 avg60=73.24 avg300=69.63 total=7619662740 full avg10=23.04 avg60=25.26 avg300=27.97 total=3401421903 system/memory.current 262533120 < 263929856 system/memory.events.local low 5399469 = 5399469 high 0 = 0 max 112303 = 112303 oom 0 = 0 oom_kill 0 = 0 system/base/memory.min 0 = 0 system/base/memory.low 0 = 0 system/base/memory.high max = max system/base/memory.max max = max system/base/memory.pressure some avg10=18.89 avg60=20.34 avg300=24.95 total=5156816349 full avg10=10.90 avg60=8.50 avg300=11.68 total=2253916169 -> some avg10=33.82 avg60=32.26 avg300=26.95 total=5258381824 full avg10=12.51 avg60=13.01 avg300=12.05 total=2301375471 system/base/memory.current 31363072 < 32243712 system/base/memory.events.local low 0 = 0 high 0 = 0 max 0 = 0 oom 0 = 0 oom_kill 0 = 0 system/backup/memory.min 0 = 0 system/backup/memory.low 33554432 = 33554432 system/backup/memory.high 2147483648 = 2147483648 system/backup/memory.max 2147483648 = 2147483648 system/backup/memory.pressure some avg10=41.73 avg60=45.97 avg300=56.27 total=3385780085 full avg10=21.78 avg60=18.15 avg300=25.35 total=1571263731 -> some avg10=60.27 avg60=55.44 avg300=54.37 total=3599850643 full avg10=19.52 avg60=20.91 avg300=23.58 total=1667430954 system/backup/memory.current 222130176 < 222543872 system/backup/memory.events.local low 5446 = 5446 high 0 = 0 max 0 = 0 oom 0 = 0 oom_kill 0 = 0 system/shell/memory.min 0 = 0 system/shell/memory.low 0 = 0 system/shell/memory.high max = max system/shell/memory.max max = max system/shell/memory.pressure some avg10=0.00 avg60=0.12 avg300=0.25 total=1348427661 full avg10=0.00 avg60=0.04 avg300=0.06 total=493582108 -> some avg10=0.00 avg60=0.00 avg300=0.06 total=1348516773 full avg10=0.00 avg60=0.00 avg300=0.00 total=493591500 system/shell/memory.current 8814592 < 8888320 system/shell/memory.events.local low 0 = 0 high 0 = 0 max 0 = 0 oom 0 = 0 oom_kill 0 = 0 website/memory.min 0 = 0 website/memory.low 17179869184 = 17179869184 website/memory.high 45131717672960 = 45131717672960 website/memory.max 45131717672960 = 45131717672960 website/memory.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=415009408 full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483 -> some avg10=0.00 avg60=0.00 avg300=0.00 total=415009408 full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483 website/memory.current 11811520512 > 11456942080 website/memory.events.local low 11372142 < 11377350 high 0 = 0 max 0 = 0 oom 0 = 0 oom_kill 0 = 0 remote/memory.min 0 remote/memory.low 1073741824 remote/memory.high 15032385536 remote/memory.max 15032385536 remote/memory.pressure some avg10=0.00 avg60=0.25 avg300=0.50 total=2017364408 full avg10=0.00 avg60=0.00 avg300=0.01 total=738071296 -> remote/memory.current 84439040 > 81797120 remote/memory.events.local low 11372142 < 11377350 high 0 = 0 max 0 = 0 oom 0 = 0 oom_kill 0 = 0 websrv/memory.min 0 = 0 websrv/memory.low 4294967296 = 4294967296 websrv/memory.high 34359738368 = 34359738368 websrv/memory.max 34426847232 = 34426847232 websrv/memory.pressure some avg10=40.38 avg60=62.58 avg300=68.83 total=7760096704 full avg10=7.80 avg60=10.78 avg300=12.64 total=2254679370 -> some avg10=89.97 avg60=83.78 avg300=72.99 total=8040513640 full avg10=11.46 avg60=11.49 avg300=11.47 total=2300116237 websrv/memory.current 18421673984 < 18421936128 websrv/memory.events.local low 0 = 0 high 0 = 0 max 0 = 0 oom 0 = 0 oom_kill 0 = 0 Is there something important I'm missing in my setup that could prevent things from starving? Did memory.low meaning change between 5.7 and 5.9? From behavior it feels as if inodes are not accounted to cgroup at all and kernel pushes cgroups down to their memory.low by killing file cache if there is not enough free memory to hold all promises (and not only when a cgroup tries to use up to its promised amount of memory). As system was trashing as much with 10G of file cache dropped (completely unused memory) as with it in use. I will try to create a test-case for it to reproduce it on a test machine an be able to verify a fix or eventually bisect to triggering patch though it this all rings a bell, please tell! Note until I have a test-case I'm reluctant to just wait [on production system] for next occurrence (usually at unpractical times) to gather some more metrics. Regards, Bruno ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints 2020-11-25 11:39 Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints Bruno Prémont @ 2020-11-25 13:37 ` Michal Hocko 2020-11-25 14:33 ` Bruno Prémont 2020-11-25 18:21 ` Roman Gushchin 1 sibling, 1 reply; 8+ messages in thread From: Michal Hocko @ 2020-11-25 13:37 UTC (permalink / raw) To: Bruno Prémont Cc: Yafang Shao, Chris Down, Johannes Weiner, cgroups, linux-mm, Vladimir Davydov Hi, thanks for the detailed report. On Wed 25-11-20 12:39:56, Bruno Prémont wrote: [...] > Did memory.low meaning change between 5.7 and 5.9? The latest semantic change in the low limit protection semantic was introduced in 5.7 (recursive protection) but it requires an explicit enablinig. > From behavior it > feels as if inodes are not accounted to cgroup at all and kernel pushes > cgroups down to their memory.low by killing file cache if there is not > enough free memory to hold all promises (and not only when a cgroup > tries to use up to its promised amount of memory). Your counters indeed show that the low protection has been breached, most likely because the reclaim couldn't make any progress. Considering that this is the case for all/most of your cgroups it suggests that the memory pressure was global rather than limit imposed. In fact even top level cgroups got reclaimed below the low limit. This suggests that this is not likely to be memcg specific. It is more likely that this is a general memory reclaim regression for your workload. There were larger changes in that area. Be it lru balancing based on cost model by Johannes or working set tracking for anonymous pages by Joonsoo. Maybe even more. Both of them can influence page cache reclaim but you are suggesting that slab accounted memory is not reclaimed properly. I am not sure sure there were considerable changes there. Would it be possible to collect /prov/vmstat as well? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints 2020-11-25 13:37 ` Michal Hocko @ 2020-11-25 14:33 ` Bruno Prémont 0 siblings, 0 replies; 8+ messages in thread From: Bruno Prémont @ 2020-11-25 14:33 UTC (permalink / raw) To: Michal Hocko Cc: Yafang Shao, Chris Down, Johannes Weiner, cgroups, linux-mm, Vladimir Davydov Hi Michal, On Wed, 25 Nov 2020 14:37:40 +0100 Michal Hocko <mhocko@suse.com> wrote: > Hi, > thanks for the detailed report. > > On Wed 25-11-20 12:39:56, Bruno Prémont wrote: > [...] > > Did memory.low meaning change between 5.7 and 5.9? > > The latest semantic change in the low limit protection semantic was > introduced in 5.7 (recursive protection) but it requires an explicit > enablinig. No specific mount options set for v2 cgroup, so not active. > > From behavior it > > feels as if inodes are not accounted to cgroup at all and kernel pushes > > cgroups down to their memory.low by killing file cache if there is not > > enough free memory to hold all promises (and not only when a cgroup > > tries to use up to its promised amount of memory). > > Your counters indeed show that the low protection has been breached, > most likely because the reclaim couldn't make any progress. Considering > that this is the case for all/most of your cgroups it suggests that the > memory pressure was global rather than limit imposed. In fact even top > level cgroups got reclaimed below the low limit. Note that the "original" counters we partially triggered by a first event where I had one cgroup (websrv) of the with a rather very high memory.low (16G or even 32G) which caused counters everywhere to increase. So before the last trashing during which the values were collected the event counters and `current` looked as follows: system/memory.pressure some avg10=0.04 avg60=0.28 avg300=0.12 total=5844917510 full avg10=0.04 avg60=0.26 avg300=0.11 total=2439353404 system/memory.current 96432128 system/memory.events.local low 5399469 (unchanged) high 0 max 112303 (unchanged) oom 0 oom_kill 0 system/base/memory.pressure some avg10=0.04 avg60=0.28 avg300=0.12 total=4589562039 full avg10=0.04 avg60=0.28 avg300=0.12 total=1926984197 system/base/memory.current 59305984 system/base/memory.events.local low 0 (unchanged) high 0 max 0 (unchanged) oom 0 oom_kill 0 system/backup/memory.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=2123293649 full avg10=0.00 avg60=0.00 avg300=0.00 total=815450446 system/backup/memory.current 32444416 system/backup/memory.events.local low 5446 (unchanged) high 0 max 0 oom 0 oom_kill 0 system/shell/memory.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=1345965660 full avg10=0.00 avg60=0.00 avg300=0.00 total=492812915 system/shell/memory.current 4571136 system/shell/memory.events.local low 0 high 0 max 0 oom 0 oom_kill 0 website/memory.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=415008878 full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483 website/memory.current 12104380416 website/memory.events.local low 11264569 (during trashing: 11372142 then 11377350) high 0 max 0 oom 0 oom_kill 0 remote/memory.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=2005130126 full avg10=0.00 avg60=0.00 avg300=0.00 total=735366752 remote/memory.current 116330496 remote/memory.events.local low 11264569 (during trashing: 11372142 then 11377350) high 0 max 0 oom 0 oom_kill 0 websrv/memory.pressure some avg10=0.02 avg60=0.11 avg300=0.03 total=6650355162 full avg10=0.02 avg60=0.11 avg300=0.03 total=2034584579 websrv/memory.current 18483359744 websrv/memory.events.local low 0 high 0 max 0 oom 0 oom_kill 0 > This suggests that this is not likely to be memcg specific. It is > more likely that this is a general memory reclaim regression for your > workload. There were larger changes in that area. Be it lru balancing > based on cost model by Johannes or working set tracking for anonymous > pages by Joonsoo. Maybe even more. Both of them can influence page cache > reclaim but you are suggesting that slab accounted memory is not > reclaimed properly. That is my impression, yes. No idea though if memcg can influence the way reclaim tries to perform its work or if slab_reclaimable not associated to any (child) cg would somehow be excluded from reclaim. > I am not sure sure there were considerable changes > there. Would it be possible to collect /prov/vmstat as well? I will have a look at gathering memory.stat and /proc/vmstat at next opportunity. Will first try with a test system with not too much memory and lots of files to reproduce about 50% of memory usage by slab_reclaimable and see how far I get. Thanks, Bruno ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints 2020-11-25 11:39 Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints Bruno Prémont 2020-11-25 13:37 ` Michal Hocko @ 2020-11-25 18:21 ` Roman Gushchin 2020-12-03 11:09 ` Bruno Prémont 1 sibling, 1 reply; 8+ messages in thread From: Roman Gushchin @ 2020-11-25 18:21 UTC (permalink / raw) To: Bruno Prémont Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups, linux-mm, Vladimir Davydov On Wed, Nov 25, 2020 at 12:39:56PM +0100, Bruno Prémont wrote: > Hello, > > On a production system I've encountered a rather harsh behavior from > kernel in the context of memory cgroup (v2) after updating kernel from > 5.7 series to 5.9 series. > > > It seems like kernel is reclaiming file cache but leaving inode cache > (reclaimable slabs) alone in a way that the server ends up trashing and > maxing out on IO to one of its disks instead of doing actual work. > > > My setup, server has 64G of RAM: > root > + system { min=0, low=128M, high=8G, max=8G } > + base { no specific constraints } > + backup { min=0, low=32M, high=2G, max=2G } > + shell { no specific constraints } > + websrv { min=0, low=4G, high=32G, max=32G } > + website { min=0, low=16G, high=40T, max=40T } > + website1 { min=0, low=64M, high=2G, max=2G } > + website2 { min=0, low=64M, high=2G, max=2G } > ... > + remote { min=0, low=1G, high=14G, max=14G } > + webuser1 { min=0, low=64M, high=2G, max=2G } > + webuser2 { min=0, low=64M, high=2G, max=2G } > ... > > > When the server was struggling I've had mostly IO on disk hosting > system processes and some cache files of websrv processes. > It seems that running backup does make the issue much more probable. > > The processes in websrv are the most impacted by the trashing and this > is the one with lots of disk cache and inode cache assigned to it. > (note a helper running in websrv cgroup scan whole file system > hierarchy once per hour and this keeps inode cache pretty filled. > Dropping just file cache (about 10G) did not unlock situation but > dropping reclaimable slabs (inode cache, about 30G) got the system back > running. > > > > Some metrics I have collected during a trashing period (metrics > collected at about 5min interval) - I don't have ful memory.stat > unfortunately: > > system/memory.min 0 = 0 > system/memory.low 134217728 = 134217728 > system/memory.high 8589934592 = 8589934592 > system/memory.max 8589934592 = 8589934592 > system/memory.pressure > some avg10=54.41 avg60=59.28 avg300=69.46 total=7347640237 > full avg10=27.45 avg60=22.19 avg300=29.28 total=3287847481 > -> > some avg10=77.25 avg60=73.24 avg300=69.63 total=7619662740 > full avg10=23.04 avg60=25.26 avg300=27.97 total=3401421903 > system/memory.current 262533120 < 263929856 > system/memory.events.local > low 5399469 = 5399469 > high 0 = 0 > max 112303 = 112303 > oom 0 = 0 > oom_kill 0 = 0 > > system/base/memory.min 0 = 0 > system/base/memory.low 0 = 0 > system/base/memory.high max = max > system/base/memory.max max = max > system/base/memory.pressure > some avg10=18.89 avg60=20.34 avg300=24.95 total=5156816349 > full avg10=10.90 avg60=8.50 avg300=11.68 total=2253916169 > -> > some avg10=33.82 avg60=32.26 avg300=26.95 total=5258381824 > full avg10=12.51 avg60=13.01 avg300=12.05 total=2301375471 > system/base/memory.current 31363072 < 32243712 > system/base/memory.events.local > low 0 = 0 > high 0 = 0 > max 0 = 0 > oom 0 = 0 > oom_kill 0 = 0 > > system/backup/memory.min 0 = 0 > system/backup/memory.low 33554432 = 33554432 > system/backup/memory.high 2147483648 = 2147483648 > system/backup/memory.max 2147483648 = 2147483648 > system/backup/memory.pressure > some avg10=41.73 avg60=45.97 avg300=56.27 total=3385780085 > full avg10=21.78 avg60=18.15 avg300=25.35 total=1571263731 > -> > some avg10=60.27 avg60=55.44 avg300=54.37 total=3599850643 > full avg10=19.52 avg60=20.91 avg300=23.58 total=1667430954 > system/backup/memory.current 222130176 < 222543872 > system/backup/memory.events.local > low 5446 = 5446 > high 0 = 0 > max 0 = 0 > oom 0 = 0 > oom_kill 0 = 0 > > system/shell/memory.min 0 = 0 > system/shell/memory.low 0 = 0 > system/shell/memory.high max = max > system/shell/memory.max max = max > system/shell/memory.pressure > some avg10=0.00 avg60=0.12 avg300=0.25 total=1348427661 > full avg10=0.00 avg60=0.04 avg300=0.06 total=493582108 > -> > some avg10=0.00 avg60=0.00 avg300=0.06 total=1348516773 > full avg10=0.00 avg60=0.00 avg300=0.00 total=493591500 > system/shell/memory.current 8814592 < 8888320 > system/shell/memory.events.local > low 0 = 0 > high 0 = 0 > max 0 = 0 > oom 0 = 0 > oom_kill 0 = 0 > > website/memory.min 0 = 0 > website/memory.low 17179869184 = 17179869184 > website/memory.high 45131717672960 = 45131717672960 > website/memory.max 45131717672960 = 45131717672960 > website/memory.pressure > some avg10=0.00 avg60=0.00 avg300=0.00 total=415009408 > full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483 > -> > some avg10=0.00 avg60=0.00 avg300=0.00 total=415009408 > full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483 > website/memory.current 11811520512 > 11456942080 > website/memory.events.local > low 11372142 < 11377350 > high 0 = 0 > max 0 = 0 > oom 0 = 0 > oom_kill 0 = 0 > > remote/memory.min 0 > remote/memory.low 1073741824 > remote/memory.high 15032385536 > remote/memory.max 15032385536 > remote/memory.pressure > some avg10=0.00 avg60=0.25 avg300=0.50 total=2017364408 > full avg10=0.00 avg60=0.00 avg300=0.01 total=738071296 > -> > remote/memory.current 84439040 > 81797120 > remote/memory.events.local > low 11372142 < 11377350 > high 0 = 0 > max 0 = 0 > oom 0 = 0 > oom_kill 0 = 0 > > websrv/memory.min 0 = 0 > websrv/memory.low 4294967296 = 4294967296 > websrv/memory.high 34359738368 = 34359738368 > websrv/memory.max 34426847232 = 34426847232 > websrv/memory.pressure > some avg10=40.38 avg60=62.58 avg300=68.83 total=7760096704 > full avg10=7.80 avg60=10.78 avg300=12.64 total=2254679370 > -> > some avg10=89.97 avg60=83.78 avg300=72.99 total=8040513640 > full avg10=11.46 avg60=11.49 avg300=11.47 total=2300116237 > websrv/memory.current 18421673984 < 18421936128 > websrv/memory.events.local > low 0 = 0 > high 0 = 0 > max 0 = 0 > oom 0 = 0 > oom_kill 0 = 0 > > > > Is there something important I'm missing in my setup that could prevent > things from starving? > > Did memory.low meaning change between 5.7 and 5.9? From behavior it > feels as if inodes are not accounted to cgroup at all and kernel pushes > cgroups down to their memory.low by killing file cache if there is not > enough free memory to hold all promises (and not only when a cgroup > tries to use up to its promised amount of memory). > As system was trashing as much with 10G of file cache dropped > (completely unused memory) as with it in use. > > > I will try to create a test-case for it to reproduce it on a test > machine an be able to verify a fix or eventually bisect to triggering > patch though it this all rings a bell, please tell! > > Note until I have a test-case I'm reluctant to just wait [on > production system] for next occurrence (usually at unpractical times) to > gather some more metrics. Hi Bruno! Thank you for the report. Can you, please, check if the following patch fixes the issue? Thanks! -- diff --git a/mm/slab.h b/mm/slab.h index 6cc323f1313a..ef02b841bcd8 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -290,7 +290,7 @@ static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s, if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) { obj_cgroup_put(objcg); - return NULL; + return (struct obj_cgroup *)-1UL; } return objcg; @@ -501,9 +501,13 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, return NULL; if (memcg_kmem_enabled() && - ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT))) + ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT))) { *objcgp = memcg_slab_pre_alloc_hook(s, size, flags); + if (unlikely(*objcgp == (struct obj_cgroup *)-1UL)) + return NULL; + } + return s; } ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints 2020-11-25 18:21 ` Roman Gushchin @ 2020-12-03 11:09 ` Bruno Prémont 2020-12-03 20:55 ` Roman Gushchin 0 siblings, 1 reply; 8+ messages in thread From: Bruno Prémont @ 2020-12-03 11:09 UTC (permalink / raw) To: Roman Gushchin Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups, linux-mm, Vladimir Davydov Hello Roman, Sorry for having taken so much time to reply, I've only had the opportunity to deploy the patch on Tuesday morning for testing and now two days later the trashing occurred again. > diff --git a/mm/slab.h b/mm/slab.h > index 6cc323f1313a..ef02b841bcd8 100644 > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -290,7 +290,7 @@ static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s, > > if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) { > obj_cgroup_put(objcg); > - return NULL; > + return (struct obj_cgroup *)-1UL; > } > > return objcg; > @@ -501,9 +501,13 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, > return NULL; > > if (memcg_kmem_enabled() && > - ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT))) > + ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT))) { > *objcgp = memcg_slab_pre_alloc_hook(s, size, flags); > > + if (unlikely(*objcgp == (struct obj_cgroup *)-1UL)) > + return NULL; > + } > + > return s; > } Seems your proposed patch didn't really help. Compared to initial occurrence I do now have some more details (all but /proc/slabinfo since boot) and according to /proc/slabinfo a good deal of reclaimable slabs seem to be dentries (and probably xfs_inode/xfs_ifork related to them) - not sure if those are assigned to cgroups or not-accounted and not seen as candidate for reclaim... xfs_buf 444908 445068 448 36 4 : tunables 0 0 0 : slabdata 12363 12363 0 xfs_bui_item 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0 xfs_bud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0 xfs_cui_item 0 0 456 35 4 : tunables 0 0 0 : slabdata 0 0 0 xfs_cud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0 xfs_rui_item 0 0 712 46 8 : tunables 0 0 0 : slabdata 0 0 0 xfs_rud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0 xfs_icr 0 156 208 39 2 : tunables 0 0 0 : slabdata 4 4 0 xfs_ili 1223169 1535904 224 36 2 : tunables 0 0 0 : slabdata 42664 42664 0 xfs_inode 12851565 22081140 1088 30 8 : tunables 0 0 0 : slabdata 736038 736038 0 xfs_efi_item 0 280 456 35 4 : tunables 0 0 0 : slabdata 8 8 0 xfs_efd_item 0 280 464 35 4 : tunables 0 0 0 : slabdata 8 8 0 xfs_buf_item 7 216 296 27 2 : tunables 0 0 0 : slabdata 8 8 0 xf_trans 0 224 288 28 2 : tunables 0 0 0 : slabdata 8 8 0 xfs_ifork 12834992 46309928 72 56 1 : tunables 0 0 0 : slabdata 826963 826963 0 xfs_da_state 0 224 512 32 4 : tunables 0 0 0 : slabdata 7 7 0 xfs_btree_cur 0 224 256 32 2 : tunables 0 0 0 : slabdata 7 7 0 xfs_bmap_free_item 0 230 88 46 1 : tunables 0 0 0 : slabdata 5 5 0 xfs_log_ticket 4 296 216 37 2 : tunables 0 0 0 : slabdata 8 8 0 fat_inode_cache 0 0 744 44 8 : tunables 0 0 0 : slabdata 0 0 0 fat_cache 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0 mnt_cache 114 180 448 36 4 : tunables 0 0 0 : slabdata 5 5 0 filp 6228 15582 384 42 4 : tunables 0 0 0 : slabdata 371 371 0 inode_cache 6669 16016 608 26 4 : tunables 0 0 0 : slabdata 616 616 0 dentry 8092159 15642504 224 36 2 : tunables 0 0 0 : slabdata 434514 434514 0 The full collected details are available at https://faramir-fj.hosting-restena.lu/cgmon-20201203.txt (please take a copy as that file will not stay there forever) A visual graph of memory evolution is available at https://faramir-fj.hosting-restena.lu/system-memory-20201203.png with reboot on Tuesday morning and steady increase of slabs starting Webnesday evening correlating with start of backup until trashing started at about 3:30 and the large drop in memory being me doing echo 2 > /proc/sys/vm/drop_caches which stopped the trashing as well. Against what does memcg attempt reclaim when it tries to satisfy a CG's low limit? Only against siblings or also against root or not-accounted? How does it take into account slabs where evictable entries will cause unevictable entries to be freed as well? > > My setup, server has 64G of RAM: > > root > > + system { min=0, low=128M, high=8G, max=8G } > > + base { no specific constraints } > > + backup { min=0, low=32M, high=2G, max=2G } > > + shell { no specific constraints } > > + websrv { min=0, low=4G, high=32G, max=32G } > > + website { min=0, low=16G, high=40T, max=40T } > > + website1 { min=0, low=64M, high=2G, max=2G } > > + website2 { min=0, low=64M, high=2G, max=2G } > > ... > > + remote { min=0, low=1G, high=14G, max=14G } > > + webuser1 { min=0, low=64M, high=2G, max=2G } > > + webuser2 { min=0, low=64M, high=2G, max=2G } > > ... Also interesting is that backup which is forced into 2G (system/backup CG) causes amount of slabs assigned to websrv CG to increase until that CG has almost only slab entries assigned to it to fill 16G, like file cache being reclaimed but not slab entries even if there is almost no file cache left and tons of slabs. What I'm also surprised is the so much memory remains completely unused (instead of being used for file caches). According to the documentation if I didn't get it wrong any limits of child CGs (e.g. webuser1...) are applied up to what their parent's limits allow. Thus, if looking at e.g. remote -> webuser1... even if I have 1000 webuserN they wont "reserve" 65G for themselves via memory.low limit when their parent sets memory.low to 1G? Or does this depend on on CG mount options (memory_recursiveprot)? Regards, Bruno ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints 2020-12-03 11:09 ` Bruno Prémont @ 2020-12-03 20:55 ` Roman Gushchin 2020-12-06 11:30 ` Bruno Prémont 0 siblings, 1 reply; 8+ messages in thread From: Roman Gushchin @ 2020-12-03 20:55 UTC (permalink / raw) To: Bruno Prémont Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups, linux-mm, Vladimir Davydov On Thu, Dec 03, 2020 at 12:09:36PM +0100, Bruno Prémont wrote: > Hello Roman, > > Sorry for having taken so much time to reply, I've only had the > opportunity to deploy the patch on Tuesday morning for testing and > now two days later the trashing occurred again. > > > diff --git a/mm/slab.h b/mm/slab.h > > index 6cc323f1313a..ef02b841bcd8 100644 > > --- a/mm/slab.h > > +++ b/mm/slab.h > > @@ -290,7 +290,7 @@ static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s, > > > > if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) { > > obj_cgroup_put(objcg); > > - return NULL; > > + return (struct obj_cgroup *)-1UL; > > } > > > > return objcg; > > @@ -501,9 +501,13 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, > > return NULL; > > > > if (memcg_kmem_enabled() && > > - ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT))) > > + ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT))) { > > *objcgp = memcg_slab_pre_alloc_hook(s, size, flags); > > > > + if (unlikely(*objcgp == (struct obj_cgroup *)-1UL)) > > + return NULL; > > + } > > + > > return s; > > } > > Seems your proposed patch didn't really help. Anyway, thank you for testing! Actually your report helped me to reveal and fix this problem, so thank you! In the meantime Yang Shi discovered a problem related slab shrinkers, which is to some extent similar to what you describe: under certain conditions large amounts of slab memory can be completely excluded from the reclaim process. Can you, please, check if his fix will solve your problem? Here is the final version: https://www.spinics.net/lists/stable/msg430601.html . > > > > Compared to initial occurrence I do now have some more details (all but > /proc/slabinfo since boot) and according to /proc/slabinfo a good deal > of reclaimable slabs seem to be dentries (and probably > xfs_inode/xfs_ifork related to them) - not sure if those are assigned > to cgroups or not-accounted and not seen as candidate for reclaim... > > xfs_buf 444908 445068 448 36 4 : tunables 0 0 0 : slabdata 12363 12363 0 > xfs_bui_item 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0 > xfs_bud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0 > xfs_cui_item 0 0 456 35 4 : tunables 0 0 0 : slabdata 0 0 0 > xfs_cud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0 > xfs_rui_item 0 0 712 46 8 : tunables 0 0 0 : slabdata 0 0 0 > xfs_rud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0 > xfs_icr 0 156 208 39 2 : tunables 0 0 0 : slabdata 4 4 0 > xfs_ili 1223169 1535904 224 36 2 : tunables 0 0 0 : slabdata 42664 42664 0 > xfs_inode 12851565 22081140 1088 30 8 : tunables 0 0 0 : slabdata 736038 736038 0 > xfs_efi_item 0 280 456 35 4 : tunables 0 0 0 : slabdata 8 8 0 > xfs_efd_item 0 280 464 35 4 : tunables 0 0 0 : slabdata 8 8 0 > xfs_buf_item 7 216 296 27 2 : tunables 0 0 0 : slabdata 8 8 0 > xf_trans 0 224 288 28 2 : tunables 0 0 0 : slabdata 8 8 0 > xfs_ifork 12834992 46309928 72 56 1 : tunables 0 0 0 : slabdata 826963 826963 0 > xfs_da_state 0 224 512 32 4 : tunables 0 0 0 : slabdata 7 7 0 > xfs_btree_cur 0 224 256 32 2 : tunables 0 0 0 : slabdata 7 7 0 > xfs_bmap_free_item 0 230 88 46 1 : tunables 0 0 0 : slabdata 5 5 0 > xfs_log_ticket 4 296 216 37 2 : tunables 0 0 0 : slabdata 8 8 0 > fat_inode_cache 0 0 744 44 8 : tunables 0 0 0 : slabdata 0 0 0 > fat_cache 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0 > mnt_cache 114 180 448 36 4 : tunables 0 0 0 : slabdata 5 5 0 > filp 6228 15582 384 42 4 : tunables 0 0 0 : slabdata 371 371 0 > inode_cache 6669 16016 608 26 4 : tunables 0 0 0 : slabdata 616 616 0 > dentry 8092159 15642504 224 36 2 : tunables 0 0 0 : slabdata 434514 434514 0 > > > > The full collected details are available at > https://faramir-fj.hosting-restena.lu/cgmon-20201203.txt > (please take a copy as that file will not stay there forever) > > A visual graph of memory evolution is available at > https://faramir-fj.hosting-restena.lu/system-memory-20201203.png > with reboot on Tuesday morning and steady increase of slabs starting > Webnesday evening correlating with start of backup until trashing > started at about 3:30 and the large drop in memory being me doing > echo 2 > /proc/sys/vm/drop_caches > which stopped the trashing as well. > > > Against what does memcg attempt reclaim when it tries to satisfy a CG's > low limit? Only against siblings or also against root or not-accounted? > How does it take into account slabs where evictable entries will cause > unevictable entries to be freed as well? Low limits are working by excluding some portions of memory from the reclaim, not by adding a memory pressure to something else. > > > > My setup, server has 64G of RAM: > > > root > > > + system { min=0, low=128M, high=8G, max=8G } > > > + base { no specific constraints } > > > + backup { min=0, low=32M, high=2G, max=2G } > > > + shell { no specific constraints } > > > + websrv { min=0, low=4G, high=32G, max=32G } > > > + website { min=0, low=16G, high=40T, max=40T } > > > + website1 { min=0, low=64M, high=2G, max=2G } > > > + website2 { min=0, low=64M, high=2G, max=2G } > > > ... > > > + remote { min=0, low=1G, high=14G, max=14G } > > > + webuser1 { min=0, low=64M, high=2G, max=2G } > > > + webuser2 { min=0, low=64M, high=2G, max=2G } > > > ... > > Also interesting is that backup which is forced into 2G > (system/backup CG) causes amount of slabs assigned to websrv CG to > increase until that CG has almost only slab entries assigned to it to > fill 16G, like file cache being reclaimed but not slab entries even if > there is almost no file cache left and tons of slabs. > What I'm also surprised is the so much memory remains completely unused > (instead of being used for file caches). > > According to the documentation if I didn't get it wrong any limits of > child CGs (e.g. webuser1...) are applied up to what their parent's > limits allow. Thus, if looking at e.g. remote -> webuser1... even if I > have 1000 webuserN they wont "reserve" 65G for themselves via > memory.low limit when their parent sets memory.low to 1G? > Or does this depend on on CG mount options (memory_recursiveprot)? It does. What you're describing is the old (!memory_recursiveprot) behavior. Thanks! ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints 2020-12-03 20:55 ` Roman Gushchin @ 2020-12-06 11:30 ` Bruno Prémont 2020-12-10 11:08 ` Bruno Prémont 0 siblings, 1 reply; 8+ messages in thread From: Bruno Prémont @ 2020-12-06 11:30 UTC (permalink / raw) To: Roman Gushchin Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups, linux-mm, Vladimir Davydov On Thu, 3 Dec 2020 12:55:59 -0800 Roman Gushchin <guro@fb.com> wrote: > In the meantime Yang Shi discovered a problem related slab shrinkers, > which is to some extent similar to what you describe: under certain conditions > large amounts of slab memory can be completely excluded from the reclaim process. > > Can you, please, check if his fix will solve your problem? > Here is the final version: https://www.spinics.net/lists/stable/msg430601.html . I've added that patch on top of yours but it seems not to completely help either. With this patch is seems that such dentries might get reclaimed as a last resort instead of not at all. I've added logs since current boot: https://faramir-fj.hosting-restena.lu/cgmon-20201204.txt https://faramir-fj.hosting-restena.lu/cgmon-20201205.txt https://faramir-fj.hosting-restena.lu/cgmon-20201206.txt with the memory evolution. Evolution started to degrade over past night where memory usage started to increase from 40G tending to full use but with only slabs growing (not file cache) and memory assigned to cgroups staying more or less constant - even root cgroup's memory stats seem not to list a great deal of used memory. The only cgroup not "sufficiently" protected by memory.low (websrv) has seen its memory use somehow clamped to about 16G while it should be allowed to go up to 32G according to memory.high and of those 16G in use at time of writing it only had 100M of file cache left, all the rest being slabs. As system now is using most of its memory I've bumped websrv CG's memory.low to 20G so it should stay protected some more (which after a few minutes showed its filecache growing again) with the aim of moving pressure out of leaf-cgroups to non-cg-assigned-memory. Somehow this move seems to prove getting me some success. I will report back later today or tomorrow with more details on the evolution with "no unused" memory. At least production service tends not to suffer (more than from storage response time). I have the impression that memory reclaim now only looks at cgroup and if it can make some progress it will not bother looking anywhere else. I also have the vague impression that distribution of my tasks on the two NUMA nodes somehow impacts when or how memory reclaim happens. NUMA node0 CPU(s): 0-5,12-17 NUMA node1 CPU(s): 6-11,18-23 CPUs MEMs system: 0-3 0 websrv: 8-11 0 (allowed mems=0-1 as of 2020-12-06 12:18 after increasing memory.low from 4G to 20G which did release quite some pression and allow) website: 12-23 1 remote: 4-7 0 (assignment done using cpuset cgroup) (seems NUMA distribution changed or I missed the non-linear node distribution of CPUs - cores versus hyperthreading as my Intent was to have website on 1 socket and the rest on the other socket. Memory is mapped as I planned it, but tasks not really) So calculating: system: 128M..8G mems=0 \ remote: 1G..14G mems=0 }--> 5G..54G websrv: 4G..32G mems=0 / website: 16G.. mems=1 But with everything except websrv lying below its low limit I wonder why reclaim only hits the cgroups's file cache but still mostly ignores its slabs. Bruno > > Compared to initial occurrence I do now have some more details (all but > > /proc/slabinfo since boot) and according to /proc/slabinfo a good deal > > of reclaimable slabs seem to be dentries (and probably > > xfs_inode/xfs_ifork related to them) - not sure if those are assigned > > to cgroups or not-accounted and not seen as candidate for reclaim... > > > > xfs_buf 444908 445068 448 36 4 : tunables 0 0 0 : slabdata 12363 12363 0 > > xfs_bui_item 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0 > > xfs_bud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0 > > xfs_cui_item 0 0 456 35 4 : tunables 0 0 0 : slabdata 0 0 0 > > xfs_cud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0 > > xfs_rui_item 0 0 712 46 8 : tunables 0 0 0 : slabdata 0 0 0 > > xfs_rud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0 > > xfs_icr 0 156 208 39 2 : tunables 0 0 0 : slabdata 4 4 0 > > xfs_ili 1223169 1535904 224 36 2 : tunables 0 0 0 : slabdata 42664 42664 0 > > xfs_inode 12851565 22081140 1088 30 8 : tunables 0 0 0 : slabdata 736038 736038 0 > > xfs_efi_item 0 280 456 35 4 : tunables 0 0 0 : slabdata 8 8 0 > > xfs_efd_item 0 280 464 35 4 : tunables 0 0 0 : slabdata 8 8 0 > > xfs_buf_item 7 216 296 27 2 : tunables 0 0 0 : slabdata 8 8 0 > > xf_trans 0 224 288 28 2 : tunables 0 0 0 : slabdata 8 8 0 > > xfs_ifork 12834992 46309928 72 56 1 : tunables 0 0 0 : slabdata 826963 826963 0 > > xfs_da_state 0 224 512 32 4 : tunables 0 0 0 : slabdata 7 7 0 > > xfs_btree_cur 0 224 256 32 2 : tunables 0 0 0 : slabdata 7 7 0 > > xfs_bmap_free_item 0 230 88 46 1 : tunables 0 0 0 : slabdata 5 5 0 > > xfs_log_ticket 4 296 216 37 2 : tunables 0 0 0 : slabdata 8 8 0 > > fat_inode_cache 0 0 744 44 8 : tunables 0 0 0 : slabdata 0 0 0 > > fat_cache 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0 > > mnt_cache 114 180 448 36 4 : tunables 0 0 0 : slabdata 5 5 0 > > filp 6228 15582 384 42 4 : tunables 0 0 0 : slabdata 371 371 0 > > inode_cache 6669 16016 608 26 4 : tunables 0 0 0 : slabdata 616 616 0 > > dentry 8092159 15642504 224 36 2 : tunables 0 0 0 : slabdata 434514 434514 0 > > > > > > > > The full collected details are available at > > https://faramir-fj.hosting-restena.lu/cgmon-20201203.txt > > (please take a copy as that file will not stay there forever) > > > > A visual graph of memory evolution is available at > > https://faramir-fj.hosting-restena.lu/system-memory-20201203.png > > with reboot on Tuesday morning and steady increase of slabs starting > > Webnesday evening correlating with start of backup until trashing > > started at about 3:30 and the large drop in memory being me doing > > echo 2 > /proc/sys/vm/drop_caches > > which stopped the trashing as well. > > > > > > Against what does memcg attempt reclaim when it tries to satisfy a CG's > > low limit? Only against siblings or also against root or not-accounted? > > How does it take into account slabs where evictable entries will cause > > unevictable entries to be freed as well? > > Low limits are working by excluding some portions of memory from the reclaim, > not by adding a memory pressure to something else. > > > > > > > My setup, server has 64G of RAM: > > > > root > > > > + system { min=0, low=128M, high=8G, max=8G } > > > > + base { no specific constraints } > > > > + backup { min=0, low=32M, high=2G, max=2G } > > > > + shell { no specific constraints } > > > > + websrv { min=0, low=4G, high=32G, max=32G } > > > > + website { min=0, low=16G, high=40T, max=40T } > > > > + website1 { min=0, low=64M, high=2G, max=2G } > > > > + website2 { min=0, low=64M, high=2G, max=2G } > > > > ... > > > > + remote { min=0, low=1G, high=14G, max=14G } > > > > + webuser1 { min=0, low=64M, high=2G, max=2G } > > > > + webuser2 { min=0, low=64M, high=2G, max=2G } > > > > ... > > > > Also interesting is that backup which is forced into 2G > > (system/backup CG) causes amount of slabs assigned to websrv CG to > > increase until that CG has almost only slab entries assigned to it to > > fill 16G, like file cache being reclaimed but not slab entries even if > > there is almost no file cache left and tons of slabs. > > What I'm also surprised is the so much memory remains completely unused > > (instead of being used for file caches). > > > > According to the documentation if I didn't get it wrong any limits of > > child CGs (e.g. webuser1...) are applied up to what their parent's > > limits allow. Thus, if looking at e.g. remote -> webuser1... even if I > > have 1000 webuserN they wont "reserve" 65G for themselves via > > memory.low limit when their parent sets memory.low to 1G? > > Or does this depend on on CG mount options (memory_recursiveprot)? > > It does. What you're describing is the old (!memory_recursiveprot) behavior. > > Thanks! ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints 2020-12-06 11:30 ` Bruno Prémont @ 2020-12-10 11:08 ` Bruno Prémont 0 siblings, 0 replies; 8+ messages in thread From: Bruno Prémont @ 2020-12-10 11:08 UTC (permalink / raw) To: Roman Gushchin Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups, linux-mm, Vladimir Davydov Hello All, Since my last changes (allowing the websrv CG to use both NUMA memory areas) the system as a whole runs reasonably without trashing and also makes a way better use of memory. As such it really seems that NUMA node restrictions (memory wise at least) do not properly interact with reclaim. Is there a way to see how memory usage is on the different NUMA nodes? I have the impressions some reclaim was ongoing because memory allocation was ask on one node which may have been "full" and there only file cache was being reclaimed (in cgroups where memory.low didn't protect it). Thanks, Bruno On Sun, 6 Dec 2020 12:30:21 +0100 Bruno Prémont wrote: > On Thu, 3 Dec 2020 12:55:59 -0800 Roman Gushchin <guro@fb.com> wrote: > > In the meantime Yang Shi discovered a problem related slab shrinkers, > > which is to some extent similar to what you describe: under certain conditions > > large amounts of slab memory can be completely excluded from the reclaim process. > > > > Can you, please, check if his fix will solve your problem? > > Here is the final version: https://www.spinics.net/lists/stable/msg430601.html . > > I've added that patch on top of yours but it seems not to completely > help either. > With this patch is seems that such dentries might get reclaimed as a > last resort instead of not at all. > > I've added logs since current boot: > https://faramir-fj.hosting-restena.lu/cgmon-20201204.txt > https://faramir-fj.hosting-restena.lu/cgmon-20201205.txt > https://faramir-fj.hosting-restena.lu/cgmon-20201206.txt > with the memory evolution. Evolution started to degrade over past night > where memory usage started to increase from 40G tending to full use but > with only slabs growing (not file cache) and memory assigned to cgroups > staying more or less constant - even root cgroup's memory stats seem not > to list a great deal of used memory. > > The only cgroup not "sufficiently" protected by memory.low (websrv) has > seen its memory use somehow clamped to about 16G while it should be > allowed to go up to 32G according to memory.high and of those 16G in > use at time of writing it only had 100M of file cache left, all the > rest being slabs. > > > As system now is using most of its memory I've bumped websrv CG's > memory.low to 20G so it should stay protected some more (which after a > few minutes showed its filecache growing again) with the aim of moving > pressure out of leaf-cgroups to non-cg-assigned-memory. > > Somehow this move seems to prove getting me some success. > > > I will report back later today or tomorrow with more details on the > evolution with "no unused" memory. At least production service tends not > to suffer (more than from storage response time). > > > > I have the impression that memory reclaim now only looks at cgroup and > if it can make some progress it will not bother looking anywhere else. > > I also have the vague impression that distribution of my tasks on the > two NUMA nodes somehow impacts when or how memory reclaim happens. > > NUMA node0 CPU(s): 0-5,12-17 > NUMA node1 CPU(s): 6-11,18-23 > > CPUs MEMs > system: 0-3 0 > websrv: 8-11 0 (allowed mems=0-1 as of 2020-12-06 12:18 > after increasing memory.low from 4G > to 20G which did release quite some > pression and allow) > website: 12-23 1 > remote: 4-7 0 > (assignment done using cpuset cgroup) > > (seems NUMA distribution changed or I missed the non-linear node > distribution of CPUs - cores versus hyperthreading as my Intent was to > have website on 1 socket and the rest on the other socket. Memory is > mapped as I planned it, but tasks not really) > > > So calculating: > system: 128M..8G mems=0 \ > remote: 1G..14G mems=0 }--> 5G..54G > websrv: 4G..32G mems=0 / > website: 16G.. mems=1 > > But with everything except websrv lying below its low limit I wonder why > reclaim only hits the cgroups's file cache but still mostly ignores its > slabs. > > > Bruno > > > > Compared to initial occurrence I do now have some more details (all but > > > /proc/slabinfo since boot) and according to /proc/slabinfo a good deal > > > of reclaimable slabs seem to be dentries (and probably > > > xfs_inode/xfs_ifork related to them) - not sure if those are assigned > > > to cgroups or not-accounted and not seen as candidate for reclaim... > > > > > > xfs_buf 444908 445068 448 36 4 : tunables 0 0 0 : slabdata 12363 12363 0 > > > xfs_bui_item 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0 > > > xfs_bud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0 > > > xfs_cui_item 0 0 456 35 4 : tunables 0 0 0 : slabdata 0 0 0 > > > xfs_cud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0 > > > xfs_rui_item 0 0 712 46 8 : tunables 0 0 0 : slabdata 0 0 0 > > > xfs_rud_item 0 0 200 40 2 : tunables 0 0 0 : slabdata 0 0 0 > > > xfs_icr 0 156 208 39 2 : tunables 0 0 0 : slabdata 4 4 0 > > > xfs_ili 1223169 1535904 224 36 2 : tunables 0 0 0 : slabdata 42664 42664 0 > > > xfs_inode 12851565 22081140 1088 30 8 : tunables 0 0 0 : slabdata 736038 736038 0 > > > xfs_efi_item 0 280 456 35 4 : tunables 0 0 0 : slabdata 8 8 0 > > > xfs_efd_item 0 280 464 35 4 : tunables 0 0 0 : slabdata 8 8 0 > > > xfs_buf_item 7 216 296 27 2 : tunables 0 0 0 : slabdata 8 8 0 > > > xf_trans 0 224 288 28 2 : tunables 0 0 0 : slabdata 8 8 0 > > > xfs_ifork 12834992 46309928 72 56 1 : tunables 0 0 0 : slabdata 826963 826963 0 > > > xfs_da_state 0 224 512 32 4 : tunables 0 0 0 : slabdata 7 7 0 > > > xfs_btree_cur 0 224 256 32 2 : tunables 0 0 0 : slabdata 7 7 0 > > > xfs_bmap_free_item 0 230 88 46 1 : tunables 0 0 0 : slabdata 5 5 0 > > > xfs_log_ticket 4 296 216 37 2 : tunables 0 0 0 : slabdata 8 8 0 > > > fat_inode_cache 0 0 744 44 8 : tunables 0 0 0 : slabdata 0 0 0 > > > fat_cache 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0 > > > mnt_cache 114 180 448 36 4 : tunables 0 0 0 : slabdata 5 5 0 > > > filp 6228 15582 384 42 4 : tunables 0 0 0 : slabdata 371 371 0 > > > inode_cache 6669 16016 608 26 4 : tunables 0 0 0 : slabdata 616 616 0 > > > dentry 8092159 15642504 224 36 2 : tunables 0 0 0 : slabdata 434514 434514 0 > > > > > > > > > > > > The full collected details are available at > > > https://faramir-fj.hosting-restena.lu/cgmon-20201203.txt > > > (please take a copy as that file will not stay there forever) > > > > > > A visual graph of memory evolution is available at > > > https://faramir-fj.hosting-restena.lu/system-memory-20201203.png > > > with reboot on Tuesday morning and steady increase of slabs starting > > > Webnesday evening correlating with start of backup until trashing > > > started at about 3:30 and the large drop in memory being me doing > > > echo 2 > /proc/sys/vm/drop_caches > > > which stopped the trashing as well. > > > > > > > > > Against what does memcg attempt reclaim when it tries to satisfy a CG's > > > low limit? Only against siblings or also against root or not-accounted? > > > How does it take into account slabs where evictable entries will cause > > > unevictable entries to be freed as well? > > > > Low limits are working by excluding some portions of memory from the reclaim, > > not by adding a memory pressure to something else. > > > > > > > > > > My setup, server has 64G of RAM: > > > > > root > > > > > + system { min=0, low=128M, high=8G, max=8G } > > > > > + base { no specific constraints } > > > > > + backup { min=0, low=32M, high=2G, max=2G } > > > > > + shell { no specific constraints } > > > > > + websrv { min=0, low=4G, high=32G, max=32G } > > > > > + website { min=0, low=16G, high=40T, max=40T } > > > > > + website1 { min=0, low=64M, high=2G, max=2G } > > > > > + website2 { min=0, low=64M, high=2G, max=2G } > > > > > ... > > > > > + remote { min=0, low=1G, high=14G, max=14G } > > > > > + webuser1 { min=0, low=64M, high=2G, max=2G } > > > > > + webuser2 { min=0, low=64M, high=2G, max=2G } > > > > > ... > > > > > > Also interesting is that backup which is forced into 2G > > > (system/backup CG) causes amount of slabs assigned to websrv CG to > > > increase until that CG has almost only slab entries assigned to it to > > > fill 16G, like file cache being reclaimed but not slab entries even if > > > there is almost no file cache left and tons of slabs. > > > What I'm also surprised is the so much memory remains completely unused > > > (instead of being used for file caches). > > > > > > According to the documentation if I didn't get it wrong any limits of > > > child CGs (e.g. webuser1...) are applied up to what their parent's > > > limits allow. Thus, if looking at e.g. remote -> webuser1... even if I > > > have 1000 webuserN they wont "reserve" 65G for themselves via > > > memory.low limit when their parent sets memory.low to 1G? > > > Or does this depend on on CG mount options (memory_recursiveprot)? > > > > It does. What you're describing is the old (!memory_recursiveprot) behavior. > > > > Thanks! > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2020-12-10 11:08 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-11-25 11:39 Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints Bruno Prémont 2020-11-25 13:37 ` Michal Hocko 2020-11-25 14:33 ` Bruno Prémont 2020-11-25 18:21 ` Roman Gushchin 2020-12-03 11:09 ` Bruno Prémont 2020-12-03 20:55 ` Roman Gushchin 2020-12-06 11:30 ` Bruno Prémont 2020-12-10 11:08 ` Bruno Prémont
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).