linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
@ 2020-11-25 11:39 Bruno Prémont
  2020-11-25 13:37 ` Michal Hocko
  2020-11-25 18:21 ` Roman Gushchin
  0 siblings, 2 replies; 8+ messages in thread
From: Bruno Prémont @ 2020-11-25 11:39 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Chris Down, Michal Hocko, Johannes Weiner, Chris Down, cgroups,
	linux-mm, Vladimir Davydov

Hello,

On a production system I've encountered a rather harsh behavior from
kernel in the context of memory cgroup (v2) after updating kernel from
5.7 series to 5.9 series.


It seems like kernel is reclaiming file cache but leaving inode cache
(reclaimable slabs) alone in a way that the server ends up trashing and
maxing out on IO to one of its disks instead of doing actual work.


My setup, server has 64G of RAM:
  root
   + system        { min=0, low=128M, high=8G, max=8G }
     + base        { no specific constraints }
     + backup      { min=0, low=32M, high=2G, max=2G }
     + shell       { no specific constraints }
  + websrv         { min=0, low=4G, high=32G, max=32G }
  + website        { min=0, low=16G, high=40T, max=40T }
    + website1     { min=0, low=64M, high=2G, max=2G }
    + website2     { min=0, low=64M, high=2G, max=2G }
      ...
  + remote         { min=0, low=1G, high=14G, max=14G }
    + webuser1     { min=0, low=64M, high=2G, max=2G }
    + webuser2     { min=0, low=64M, high=2G, max=2G }
      ...


When the server was struggling I've had mostly IO on disk hosting
system processes and some cache files of websrv processes.
It seems that running backup does make the issue much more probable.

The processes in websrv are the most impacted by the trashing and this
is the one with lots of disk cache and inode cache assigned to it.
(note a helper running in websrv cgroup scan whole file system
hierarchy once per hour and this keeps inode cache pretty filled.
Dropping just file cache (about 10G) did not unlock situation but
dropping reclaimable slabs (inode cache, about 30G) got the system back
running.



Some metrics I have collected during a trashing period (metrics
collected at about 5min interval) - I don't have ful memory.stat
unfortunately:

system/memory.min              0              = 0
system/memory.low              134217728      = 134217728
system/memory.high             8589934592     = 8589934592
system/memory.max              8589934592     = 8589934592
system/memory.pressure
    some avg10=54.41 avg60=59.28 avg300=69.46 total=7347640237
    full avg10=27.45 avg60=22.19 avg300=29.28 total=3287847481
  ->
    some avg10=77.25 avg60=73.24 avg300=69.63 total=7619662740
    full avg10=23.04 avg60=25.26 avg300=27.97 total=3401421903
system/memory.current          262533120      < 263929856
system/memory.events.local
    low                        5399469        = 5399469
    high                       0              = 0
    max                        112303         = 112303
    oom                        0              = 0
    oom_kill                   0              = 0

system/base/memory.min         0              = 0
system/base/memory.low         0              = 0
system/base/memory.high        max            = max
system/base/memory.max         max            = max
system/base/memory.pressure
    some avg10=18.89 avg60=20.34 avg300=24.95 total=5156816349
    full avg10=10.90 avg60=8.50 avg300=11.68 total=2253916169
  ->
    some avg10=33.82 avg60=32.26 avg300=26.95 total=5258381824
    full avg10=12.51 avg60=13.01 avg300=12.05 total=2301375471
system/base/memory.current     31363072       < 32243712
system/base/memory.events.local
    low                        0              = 0
    high                       0              = 0
    max                        0              = 0
    oom                        0              = 0
    oom_kill                   0              = 0

system/backup/memory.min       0              = 0
system/backup/memory.low       33554432       = 33554432
system/backup/memory.high      2147483648     = 2147483648
system/backup/memory.max       2147483648     = 2147483648
system/backup/memory.pressure
    some avg10=41.73 avg60=45.97 avg300=56.27 total=3385780085
    full avg10=21.78 avg60=18.15 avg300=25.35 total=1571263731
  ->
    some avg10=60.27 avg60=55.44 avg300=54.37 total=3599850643
    full avg10=19.52 avg60=20.91 avg300=23.58 total=1667430954
system/backup/memory.current  222130176       < 222543872
system/backup/memory.events.local
    low                       5446            = 5446
    high                      0               = 0
    max                       0               = 0
    oom                       0               = 0
    oom_kill                  0               = 0

system/shell/memory.min       0               = 0
system/shell/memory.low       0               = 0
system/shell/memory.high      max             = max
system/shell/memory.max       max             = max
system/shell/memory.pressure
    some avg10=0.00 avg60=0.12 avg300=0.25 total=1348427661
    full avg10=0.00 avg60=0.04 avg300=0.06 total=493582108
  ->
    some avg10=0.00 avg60=0.00 avg300=0.06 total=1348516773
    full avg10=0.00 avg60=0.00 avg300=0.00 total=493591500
system/shell/memory.current  8814592          < 8888320
system/shell/memory.events.local
    low                      0                = 0
    high                     0                = 0
    max                      0                = 0
    oom                      0                = 0
    oom_kill                 0                = 0

website/memory.min           0                = 0
website/memory.low           17179869184      = 17179869184
website/memory.high          45131717672960   = 45131717672960
website/memory.max           45131717672960   = 45131717672960
website/memory.pressure
    some avg10=0.00 avg60=0.00 avg300=0.00 total=415009408
    full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483
  ->
    some avg10=0.00 avg60=0.00 avg300=0.00 total=415009408
    full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483
website/memory.current       11811520512      > 11456942080
website/memory.events.local
    low                      11372142         < 11377350
    high                     0                = 0
    max                      0                = 0
    oom                      0                = 0
    oom_kill                 0                = 0

remote/memory.min            0
remote/memory.low            1073741824
remote/memory.high           15032385536
remote/memory.max            15032385536
remote/memory.pressure
    some avg10=0.00 avg60=0.25 avg300=0.50 total=2017364408
    full avg10=0.00 avg60=0.00 avg300=0.01 total=738071296
  ->
remote/memory.current        84439040         > 81797120
remote/memory.events.local
    low                      11372142         < 11377350
    high                     0                = 0
    max                      0                = 0
    oom                      0                = 0
    oom_kill                 0                = 0

websrv/memory.min            0                = 0
websrv/memory.low            4294967296       = 4294967296
websrv/memory.high           34359738368      = 34359738368
websrv/memory.max            34426847232      = 34426847232
websrv/memory.pressure
    some avg10=40.38 avg60=62.58 avg300=68.83 total=7760096704
    full avg10=7.80 avg60=10.78 avg300=12.64 total=2254679370
  ->
    some avg10=89.97 avg60=83.78 avg300=72.99 total=8040513640
    full avg10=11.46 avg60=11.49 avg300=11.47 total=2300116237
websrv/memory.current        18421673984      < 18421936128
websrv/memory.events.local
    low                      0                = 0
    high                     0                = 0
    max                      0                = 0
    oom                      0                = 0
    oom_kill                 0                = 0



Is there something important I'm missing in my setup that could prevent
things from starving?

Did memory.low meaning change between 5.7 and 5.9? From behavior it
feels as if inodes are not accounted to cgroup at all and kernel pushes
cgroups down to their memory.low by killing file cache if there is not
enough free memory to hold all promises (and not only when a cgroup
tries to use up to its promised amount of memory).
As system was trashing as much with 10G of file cache dropped
(completely unused memory) as with it in use.


I will try to create a test-case for it to reproduce it on a test
machine an be able to verify a fix or eventually bisect to triggering
patch though it this all rings a bell, please tell!

Note until I have a test-case I'm reluctant to just wait [on
production system] for next occurrence (usually at unpractical times) to
gather some more metrics.

Regards,
Bruno


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
  2020-11-25 11:39 Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints Bruno Prémont
@ 2020-11-25 13:37 ` Michal Hocko
  2020-11-25 14:33   ` Bruno Prémont
  2020-11-25 18:21 ` Roman Gushchin
  1 sibling, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2020-11-25 13:37 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: Yafang Shao, Chris Down, Johannes Weiner, cgroups, linux-mm,
	Vladimir Davydov

Hi,
thanks for the detailed report.

On Wed 25-11-20 12:39:56, Bruno Prémont wrote:
[...]
> Did memory.low meaning change between 5.7 and 5.9?

The latest semantic change in the low limit protection semantic was
introduced in 5.7 (recursive protection) but it requires an explicit
enablinig.

> From behavior it
> feels as if inodes are not accounted to cgroup at all and kernel pushes
> cgroups down to their memory.low by killing file cache if there is not
> enough free memory to hold all promises (and not only when a cgroup
> tries to use up to its promised amount of memory).

Your counters indeed show that the low protection has been breached,
most likely because the reclaim couldn't make any progress. Considering
that this is the case for all/most of your cgroups it suggests that the
memory pressure was global rather than limit imposed. In fact even top
level cgroups got reclaimed below the low limit.

This suggests that this is not likely to be memcg specific. It is
more likely that this is a general memory reclaim regression for your
workload. There were larger changes in that area. Be it lru balancing
based on cost model by Johannes or working set tracking for anonymous
pages by Joonsoo. Maybe even more. Both of them can influence page cache
reclaim but you are suggesting that slab accounted memory is not
reclaimed properly. I am not sure sure there were considerable changes
there. Would it be possible to collect /prov/vmstat as well?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
  2020-11-25 13:37 ` Michal Hocko
@ 2020-11-25 14:33   ` Bruno Prémont
  0 siblings, 0 replies; 8+ messages in thread
From: Bruno Prémont @ 2020-11-25 14:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yafang Shao, Chris Down, Johannes Weiner, cgroups, linux-mm,
	Vladimir Davydov

Hi Michal,

On Wed, 25 Nov 2020 14:37:40 +0100 Michal Hocko <mhocko@suse.com> wrote:
> Hi,
> thanks for the detailed report.
> 
> On Wed 25-11-20 12:39:56, Bruno Prémont wrote:
> [...]
> > Did memory.low meaning change between 5.7 and 5.9?  
> 
> The latest semantic change in the low limit protection semantic was
> introduced in 5.7 (recursive protection) but it requires an explicit
> enablinig.

No specific mount options set for v2 cgroup, so not active.

> > From behavior it
> > feels as if inodes are not accounted to cgroup at all and kernel pushes
> > cgroups down to their memory.low by killing file cache if there is not
> > enough free memory to hold all promises (and not only when a cgroup
> > tries to use up to its promised amount of memory).  
> 
> Your counters indeed show that the low protection has been breached,
> most likely because the reclaim couldn't make any progress. Considering
> that this is the case for all/most of your cgroups it suggests that the
> memory pressure was global rather than limit imposed. In fact even top
> level cgroups got reclaimed below the low limit.

Note that the "original" counters we partially triggered by a first
event where I had one cgroup (websrv) of the with a rather very high
memory.low (16G or even 32G) which caused counters everywhere to
increase.


So before the last trashing during which the values were collected the
event counters and `current` looked as follows:

system/memory.pressure
  some avg10=0.04 avg60=0.28 avg300=0.12 total=5844917510
  full avg10=0.04 avg60=0.26 avg300=0.11 total=2439353404
system/memory.current
  96432128
system/memory.events.local
  low      5399469   (unchanged)
  high     0
  max      112303    (unchanged)
  oom      0
  oom_kill 0

system/base/memory.pressure
  some avg10=0.04 avg60=0.28 avg300=0.12 total=4589562039
  full avg10=0.04 avg60=0.28 avg300=0.12 total=1926984197
system/base/memory.current
  59305984
system/base/memory.events.local
  low      0   (unchanged)
  high     0
  max      0   (unchanged)
  oom      0
  oom_kill 0

system/backup/memory.pressure
  some avg10=0.00 avg60=0.00 avg300=0.00 total=2123293649
  full avg10=0.00 avg60=0.00 avg300=0.00 total=815450446
system/backup/memory.current
  32444416
system/backup/memory.events.local
  low      5446   (unchanged)
  high     0
  max      0
  oom      0
  oom_kill 0

system/shell/memory.pressure
  some avg10=0.00 avg60=0.00 avg300=0.00 total=1345965660
  full avg10=0.00 avg60=0.00 avg300=0.00 total=492812915
system/shell/memory.current
  4571136
system/shell/memory.events.local
  low      0
  high     0
  max      0
  oom      0
  oom_kill 0

website/memory.pressure
  some avg10=0.00 avg60=0.00 avg300=0.00 total=415008878
  full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483
website/memory.current
  12104380416
website/memory.events.local
  low      11264569  (during trashing: 11372142 then 11377350)
  high     0
  max      0
  oom      0
  oom_kill 0

remote/memory.pressure
  some avg10=0.00 avg60=0.00 avg300=0.00 total=2005130126
  full avg10=0.00 avg60=0.00 avg300=0.00 total=735366752
remote/memory.current
  116330496
remote/memory.events.local
  low      11264569  (during trashing: 11372142 then 11377350)
  high     0
  max      0
  oom      0
  oom_kill 0

websrv/memory.pressure
  some avg10=0.02 avg60=0.11 avg300=0.03 total=6650355162
  full avg10=0.02 avg60=0.11 avg300=0.03 total=2034584579
websrv/memory.current
  18483359744
websrv/memory.events.local
  low      0
  high     0
  max      0
  oom      0
  oom_kill 0


> This suggests that this is not likely to be memcg specific. It is
> more likely that this is a general memory reclaim regression for your
> workload. There were larger changes in that area. Be it lru balancing
> based on cost model by Johannes or working set tracking for anonymous
> pages by Joonsoo. Maybe even more. Both of them can influence page cache
> reclaim but you are suggesting that slab accounted memory is not
> reclaimed properly.

That is my impression, yes. No idea though if memcg can influence the
way reclaim tries to perform its work or if slab_reclaimable not
associated to any (child) cg would somehow be excluded from reclaim.

> I am not sure sure there were considerable changes
> there. Would it be possible to collect /prov/vmstat as well?

I will have a look at gathering memory.stat and /proc/vmstat at next
opportunity.
Will first try with a test system with not too much memory and lots of
files to reproduce about 50% of memory usage by slab_reclaimable and
see how far I get.

Thanks,
Bruno


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
  2020-11-25 11:39 Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints Bruno Prémont
  2020-11-25 13:37 ` Michal Hocko
@ 2020-11-25 18:21 ` Roman Gushchin
  2020-12-03 11:09   ` Bruno Prémont
  1 sibling, 1 reply; 8+ messages in thread
From: Roman Gushchin @ 2020-11-25 18:21 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups,
	linux-mm, Vladimir Davydov

On Wed, Nov 25, 2020 at 12:39:56PM +0100, Bruno Prémont wrote:
> Hello,
> 
> On a production system I've encountered a rather harsh behavior from
> kernel in the context of memory cgroup (v2) after updating kernel from
> 5.7 series to 5.9 series.
> 
> 
> It seems like kernel is reclaiming file cache but leaving inode cache
> (reclaimable slabs) alone in a way that the server ends up trashing and
> maxing out on IO to one of its disks instead of doing actual work.
> 
> 
> My setup, server has 64G of RAM:
>   root
>    + system        { min=0, low=128M, high=8G, max=8G }
>      + base        { no specific constraints }
>      + backup      { min=0, low=32M, high=2G, max=2G }
>      + shell       { no specific constraints }
>   + websrv         { min=0, low=4G, high=32G, max=32G }
>   + website        { min=0, low=16G, high=40T, max=40T }
>     + website1     { min=0, low=64M, high=2G, max=2G }
>     + website2     { min=0, low=64M, high=2G, max=2G }
>       ...
>   + remote         { min=0, low=1G, high=14G, max=14G }
>     + webuser1     { min=0, low=64M, high=2G, max=2G }
>     + webuser2     { min=0, low=64M, high=2G, max=2G }
>       ...
> 
> 
> When the server was struggling I've had mostly IO on disk hosting
> system processes and some cache files of websrv processes.
> It seems that running backup does make the issue much more probable.
> 
> The processes in websrv are the most impacted by the trashing and this
> is the one with lots of disk cache and inode cache assigned to it.
> (note a helper running in websrv cgroup scan whole file system
> hierarchy once per hour and this keeps inode cache pretty filled.
> Dropping just file cache (about 10G) did not unlock situation but
> dropping reclaimable slabs (inode cache, about 30G) got the system back
> running.
> 
> 
> 
> Some metrics I have collected during a trashing period (metrics
> collected at about 5min interval) - I don't have ful memory.stat
> unfortunately:
> 
> system/memory.min              0              = 0
> system/memory.low              134217728      = 134217728
> system/memory.high             8589934592     = 8589934592
> system/memory.max              8589934592     = 8589934592
> system/memory.pressure
>     some avg10=54.41 avg60=59.28 avg300=69.46 total=7347640237
>     full avg10=27.45 avg60=22.19 avg300=29.28 total=3287847481
>   ->
>     some avg10=77.25 avg60=73.24 avg300=69.63 total=7619662740
>     full avg10=23.04 avg60=25.26 avg300=27.97 total=3401421903
> system/memory.current          262533120      < 263929856
> system/memory.events.local
>     low                        5399469        = 5399469
>     high                       0              = 0
>     max                        112303         = 112303
>     oom                        0              = 0
>     oom_kill                   0              = 0
> 
> system/base/memory.min         0              = 0
> system/base/memory.low         0              = 0
> system/base/memory.high        max            = max
> system/base/memory.max         max            = max
> system/base/memory.pressure
>     some avg10=18.89 avg60=20.34 avg300=24.95 total=5156816349
>     full avg10=10.90 avg60=8.50 avg300=11.68 total=2253916169
>   ->
>     some avg10=33.82 avg60=32.26 avg300=26.95 total=5258381824
>     full avg10=12.51 avg60=13.01 avg300=12.05 total=2301375471
> system/base/memory.current     31363072       < 32243712
> system/base/memory.events.local
>     low                        0              = 0
>     high                       0              = 0
>     max                        0              = 0
>     oom                        0              = 0
>     oom_kill                   0              = 0
> 
> system/backup/memory.min       0              = 0
> system/backup/memory.low       33554432       = 33554432
> system/backup/memory.high      2147483648     = 2147483648
> system/backup/memory.max       2147483648     = 2147483648
> system/backup/memory.pressure
>     some avg10=41.73 avg60=45.97 avg300=56.27 total=3385780085
>     full avg10=21.78 avg60=18.15 avg300=25.35 total=1571263731
>   ->
>     some avg10=60.27 avg60=55.44 avg300=54.37 total=3599850643
>     full avg10=19.52 avg60=20.91 avg300=23.58 total=1667430954
> system/backup/memory.current  222130176       < 222543872
> system/backup/memory.events.local
>     low                       5446            = 5446
>     high                      0               = 0
>     max                       0               = 0
>     oom                       0               = 0
>     oom_kill                  0               = 0
> 
> system/shell/memory.min       0               = 0
> system/shell/memory.low       0               = 0
> system/shell/memory.high      max             = max
> system/shell/memory.max       max             = max
> system/shell/memory.pressure
>     some avg10=0.00 avg60=0.12 avg300=0.25 total=1348427661
>     full avg10=0.00 avg60=0.04 avg300=0.06 total=493582108
>   ->
>     some avg10=0.00 avg60=0.00 avg300=0.06 total=1348516773
>     full avg10=0.00 avg60=0.00 avg300=0.00 total=493591500
> system/shell/memory.current  8814592          < 8888320
> system/shell/memory.events.local
>     low                      0                = 0
>     high                     0                = 0
>     max                      0                = 0
>     oom                      0                = 0
>     oom_kill                 0                = 0
> 
> website/memory.min           0                = 0
> website/memory.low           17179869184      = 17179869184
> website/memory.high          45131717672960   = 45131717672960
> website/memory.max           45131717672960   = 45131717672960
> website/memory.pressure
>     some avg10=0.00 avg60=0.00 avg300=0.00 total=415009408
>     full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483
>   ->
>     some avg10=0.00 avg60=0.00 avg300=0.00 total=415009408
>     full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483
> website/memory.current       11811520512      > 11456942080
> website/memory.events.local
>     low                      11372142         < 11377350
>     high                     0                = 0
>     max                      0                = 0
>     oom                      0                = 0
>     oom_kill                 0                = 0
> 
> remote/memory.min            0
> remote/memory.low            1073741824
> remote/memory.high           15032385536
> remote/memory.max            15032385536
> remote/memory.pressure
>     some avg10=0.00 avg60=0.25 avg300=0.50 total=2017364408
>     full avg10=0.00 avg60=0.00 avg300=0.01 total=738071296
>   ->
> remote/memory.current        84439040         > 81797120
> remote/memory.events.local
>     low                      11372142         < 11377350
>     high                     0                = 0
>     max                      0                = 0
>     oom                      0                = 0
>     oom_kill                 0                = 0
> 
> websrv/memory.min            0                = 0
> websrv/memory.low            4294967296       = 4294967296
> websrv/memory.high           34359738368      = 34359738368
> websrv/memory.max            34426847232      = 34426847232
> websrv/memory.pressure
>     some avg10=40.38 avg60=62.58 avg300=68.83 total=7760096704
>     full avg10=7.80 avg60=10.78 avg300=12.64 total=2254679370
>   ->
>     some avg10=89.97 avg60=83.78 avg300=72.99 total=8040513640
>     full avg10=11.46 avg60=11.49 avg300=11.47 total=2300116237
> websrv/memory.current        18421673984      < 18421936128
> websrv/memory.events.local
>     low                      0                = 0
>     high                     0                = 0
>     max                      0                = 0
>     oom                      0                = 0
>     oom_kill                 0                = 0
> 
> 
> 
> Is there something important I'm missing in my setup that could prevent
> things from starving?
> 
> Did memory.low meaning change between 5.7 and 5.9? From behavior it
> feels as if inodes are not accounted to cgroup at all and kernel pushes
> cgroups down to their memory.low by killing file cache if there is not
> enough free memory to hold all promises (and not only when a cgroup
> tries to use up to its promised amount of memory).
> As system was trashing as much with 10G of file cache dropped
> (completely unused memory) as with it in use.
> 
> 
> I will try to create a test-case for it to reproduce it on a test
> machine an be able to verify a fix or eventually bisect to triggering
> patch though it this all rings a bell, please tell!
> 
> Note until I have a test-case I'm reluctant to just wait [on
> production system] for next occurrence (usually at unpractical times) to
> gather some more metrics.

Hi Bruno!

Thank you for the report.

Can you, please, check if the following patch fixes the issue?

Thanks!

--

diff --git a/mm/slab.h b/mm/slab.h
index 6cc323f1313a..ef02b841bcd8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -290,7 +290,7 @@ static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 
        if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
                obj_cgroup_put(objcg);
-               return NULL;
+               return (struct obj_cgroup *)-1UL;
        }
 
        return objcg;
@@ -501,9 +501,13 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
                return NULL;
 
        if (memcg_kmem_enabled() &&
-           ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
+           ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT))) {
                *objcgp = memcg_slab_pre_alloc_hook(s, size, flags);
 
+               if (unlikely(*objcgp == (struct obj_cgroup *)-1UL))
+                       return NULL;
+       }
+
        return s;
 }
 


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
  2020-11-25 18:21 ` Roman Gushchin
@ 2020-12-03 11:09   ` Bruno Prémont
  2020-12-03 20:55     ` Roman Gushchin
  0 siblings, 1 reply; 8+ messages in thread
From: Bruno Prémont @ 2020-12-03 11:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups,
	linux-mm, Vladimir Davydov

Hello Roman,

Sorry for having taken so much time to reply, I've only had the
opportunity to deploy the patch on Tuesday morning for testing and
now two days later the trashing occurred again.

> diff --git a/mm/slab.h b/mm/slab.h
> index 6cc323f1313a..ef02b841bcd8 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -290,7 +290,7 @@ static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
>  
>         if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
>                 obj_cgroup_put(objcg);
> -               return NULL;
> +               return (struct obj_cgroup *)-1UL;
>         }
>  
>         return objcg;
> @@ -501,9 +501,13 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
>                 return NULL;
>  
>         if (memcg_kmem_enabled() &&
> -           ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
> +           ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT))) {
>                 *objcgp = memcg_slab_pre_alloc_hook(s, size, flags);
>  
> +               if (unlikely(*objcgp == (struct obj_cgroup *)-1UL))
> +                       return NULL;
> +       }
> +
>         return s;
>  }

Seems your proposed patch didn't really help.



Compared to initial occurrence I do now have some more details (all but
/proc/slabinfo since boot) and according to /proc/slabinfo a good deal
of reclaimable slabs seem to be dentries (and probably
xfs_inode/xfs_ifork related to them) - not sure if those are assigned
to cgroups or not-accounted and not seen as candidate for reclaim...

xfs_buf           444908 445068    448   36    4 : tunables    0    0    0 : slabdata  12363  12363      0
xfs_bui_item           0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_bud_item           0      0    200   40    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_cui_item           0      0    456   35    4 : tunables    0    0    0 : slabdata      0      0      0
xfs_cud_item           0      0    200   40    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_rui_item           0      0    712   46    8 : tunables    0    0    0 : slabdata      0      0      0
xfs_rud_item           0      0    200   40    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_icr                0    156    208   39    2 : tunables    0    0    0 : slabdata      4      4      0
xfs_ili           1223169 1535904    224   36    2 : tunables    0    0    0 : slabdata  42664  42664      0
xfs_inode         12851565 22081140   1088   30    8 : tunables    0    0    0 : slabdata 736038 736038      0
xfs_efi_item           0    280    456   35    4 : tunables    0    0    0 : slabdata      8      8      0
xfs_efd_item           0    280    464   35    4 : tunables    0    0    0 : slabdata      8      8      0
xfs_buf_item           7    216    296   27    2 : tunables    0    0    0 : slabdata      8      8      0
xf_trans               0    224    288   28    2 : tunables    0    0    0 : slabdata      8      8      0
xfs_ifork         12834992 46309928     72   56    1 : tunables    0    0    0 : slabdata 826963 826963      0
xfs_da_state           0    224    512   32    4 : tunables    0    0    0 : slabdata      7      7      0
xfs_btree_cur          0    224    256   32    2 : tunables    0    0    0 : slabdata      7      7      0
xfs_bmap_free_item      0    230     88   46    1 : tunables    0    0    0 : slabdata      5      5      0
xfs_log_ticket         4    296    216   37    2 : tunables    0    0    0 : slabdata      8      8      0
fat_inode_cache        0      0    744   44    8 : tunables    0    0    0 : slabdata      0      0      0
fat_cache              0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
mnt_cache            114    180    448   36    4 : tunables    0    0    0 : slabdata      5      5      0
filp                6228  15582    384   42    4 : tunables    0    0    0 : slabdata    371    371      0
inode_cache         6669  16016    608   26    4 : tunables    0    0    0 : slabdata    616    616      0
dentry            8092159 15642504    224   36    2 : tunables    0    0    0 : slabdata 434514 434514      0



The full collected details are available at
  https://faramir-fj.hosting-restena.lu/cgmon-20201203.txt
(please take a copy as that file will not stay there forever)

A visual graph of memory evolution is available at
  https://faramir-fj.hosting-restena.lu/system-memory-20201203.png
with reboot on Tuesday morning and steady increase of slabs starting
Webnesday evening correlating with start of backup until trashing
started at about 3:30 and the large drop in memory being me doing
  echo 2 > /proc/sys/vm/drop_caches
which stopped the trashing as well.


Against what does memcg attempt reclaim when it tries to satisfy a CG's
low limit? Only against siblings or also against root or not-accounted?
How does it take into account slabs where evictable entries will cause
unevictable entries to be freed as well?

> > My setup, server has 64G of RAM:
> >   root
> >    + system        { min=0, low=128M, high=8G, max=8G }
> >      + base        { no specific constraints }
> >      + backup      { min=0, low=32M, high=2G, max=2G }
> >      + shell       { no specific constraints }
> >   + websrv         { min=0, low=4G, high=32G, max=32G }
> >   + website        { min=0, low=16G, high=40T, max=40T }
> >     + website1     { min=0, low=64M, high=2G, max=2G }
> >     + website2     { min=0, low=64M, high=2G, max=2G }
> >       ...
> >   + remote         { min=0, low=1G, high=14G, max=14G }
> >     + webuser1     { min=0, low=64M, high=2G, max=2G }
> >     + webuser2     { min=0, low=64M, high=2G, max=2G }
> >       ...

Also interesting is that backup which is forced into 2G
(system/backup CG) causes amount of slabs assigned to websrv CG to
increase until that CG has almost only slab entries assigned to it to
fill 16G, like file cache being reclaimed but not slab entries even if
there is almost no file cache left and tons of slabs.
What I'm also surprised is the so much memory remains completely unused
(instead of being used for file caches).

According to the documentation if I didn't get it wrong any limits of
child CGs (e.g. webuser1...) are applied up to what their parent's
limits allow. Thus, if looking at e.g. remote -> webuser1... even if I
have 1000 webuserN they wont "reserve" 65G for themselves via
memory.low limit when their parent sets memory.low to 1G?
Or does this depend on on CG mount options (memory_recursiveprot)?


Regards,
Bruno


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
  2020-12-03 11:09   ` Bruno Prémont
@ 2020-12-03 20:55     ` Roman Gushchin
  2020-12-06 11:30       ` Bruno Prémont
  0 siblings, 1 reply; 8+ messages in thread
From: Roman Gushchin @ 2020-12-03 20:55 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups,
	linux-mm, Vladimir Davydov

On Thu, Dec 03, 2020 at 12:09:36PM +0100, Bruno Prémont wrote:
> Hello Roman,
> 
> Sorry for having taken so much time to reply, I've only had the
> opportunity to deploy the patch on Tuesday morning for testing and
> now two days later the trashing occurred again.
> 
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 6cc323f1313a..ef02b841bcd8 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -290,7 +290,7 @@ static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
> >  
> >         if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
> >                 obj_cgroup_put(objcg);
> > -               return NULL;
> > +               return (struct obj_cgroup *)-1UL;
> >         }
> >  
> >         return objcg;
> > @@ -501,9 +501,13 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
> >                 return NULL;
> >  
> >         if (memcg_kmem_enabled() &&
> > -           ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
> > +           ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT))) {
> >                 *objcgp = memcg_slab_pre_alloc_hook(s, size, flags);
> >  
> > +               if (unlikely(*objcgp == (struct obj_cgroup *)-1UL))
> > +                       return NULL;
> > +       }
> > +
> >         return s;
> >  }
> 
> Seems your proposed patch didn't really help.

Anyway, thank you for testing! Actually your report helped me to reveal and
fix this problem, so thank you!

In the meantime Yang Shi discovered a problem related slab shrinkers,
which is to some extent similar to what you describe: under certain conditions
large amounts of slab memory can be completely excluded from the reclaim process.

Can you, please, check if his fix will solve your problem?
Here is the final version: https://www.spinics.net/lists/stable/msg430601.html .

> 
> 
> 
> Compared to initial occurrence I do now have some more details (all but
> /proc/slabinfo since boot) and according to /proc/slabinfo a good deal
> of reclaimable slabs seem to be dentries (and probably
> xfs_inode/xfs_ifork related to them) - not sure if those are assigned
> to cgroups or not-accounted and not seen as candidate for reclaim...
> 
> xfs_buf           444908 445068    448   36    4 : tunables    0    0    0 : slabdata  12363  12363      0
> xfs_bui_item           0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
> xfs_bud_item           0      0    200   40    2 : tunables    0    0    0 : slabdata      0      0      0
> xfs_cui_item           0      0    456   35    4 : tunables    0    0    0 : slabdata      0      0      0
> xfs_cud_item           0      0    200   40    2 : tunables    0    0    0 : slabdata      0      0      0
> xfs_rui_item           0      0    712   46    8 : tunables    0    0    0 : slabdata      0      0      0
> xfs_rud_item           0      0    200   40    2 : tunables    0    0    0 : slabdata      0      0      0
> xfs_icr                0    156    208   39    2 : tunables    0    0    0 : slabdata      4      4      0
> xfs_ili           1223169 1535904    224   36    2 : tunables    0    0    0 : slabdata  42664  42664      0
> xfs_inode         12851565 22081140   1088   30    8 : tunables    0    0    0 : slabdata 736038 736038      0
> xfs_efi_item           0    280    456   35    4 : tunables    0    0    0 : slabdata      8      8      0
> xfs_efd_item           0    280    464   35    4 : tunables    0    0    0 : slabdata      8      8      0
> xfs_buf_item           7    216    296   27    2 : tunables    0    0    0 : slabdata      8      8      0
> xf_trans               0    224    288   28    2 : tunables    0    0    0 : slabdata      8      8      0
> xfs_ifork         12834992 46309928     72   56    1 : tunables    0    0    0 : slabdata 826963 826963      0
> xfs_da_state           0    224    512   32    4 : tunables    0    0    0 : slabdata      7      7      0
> xfs_btree_cur          0    224    256   32    2 : tunables    0    0    0 : slabdata      7      7      0
> xfs_bmap_free_item      0    230     88   46    1 : tunables    0    0    0 : slabdata      5      5      0
> xfs_log_ticket         4    296    216   37    2 : tunables    0    0    0 : slabdata      8      8      0
> fat_inode_cache        0      0    744   44    8 : tunables    0    0    0 : slabdata      0      0      0
> fat_cache              0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
> mnt_cache            114    180    448   36    4 : tunables    0    0    0 : slabdata      5      5      0
> filp                6228  15582    384   42    4 : tunables    0    0    0 : slabdata    371    371      0
> inode_cache         6669  16016    608   26    4 : tunables    0    0    0 : slabdata    616    616      0
> dentry            8092159 15642504    224   36    2 : tunables    0    0    0 : slabdata 434514 434514      0
> 
> 
> 
> The full collected details are available at
>   https://faramir-fj.hosting-restena.lu/cgmon-20201203.txt 
> (please take a copy as that file will not stay there forever)
> 
> A visual graph of memory evolution is available at
>   https://faramir-fj.hosting-restena.lu/system-memory-20201203.png 
> with reboot on Tuesday morning and steady increase of slabs starting
> Webnesday evening correlating with start of backup until trashing
> started at about 3:30 and the large drop in memory being me doing
>   echo 2 > /proc/sys/vm/drop_caches
> which stopped the trashing as well.
> 
> 
> Against what does memcg attempt reclaim when it tries to satisfy a CG's
> low limit? Only against siblings or also against root or not-accounted?
> How does it take into account slabs where evictable entries will cause
> unevictable entries to be freed as well?

Low limits are working by excluding some portions of memory from the reclaim,
not by adding a memory pressure to something else.

> 
> > > My setup, server has 64G of RAM:
> > >   root
> > >    + system        { min=0, low=128M, high=8G, max=8G }
> > >      + base        { no specific constraints }
> > >      + backup      { min=0, low=32M, high=2G, max=2G }
> > >      + shell       { no specific constraints }
> > >   + websrv         { min=0, low=4G, high=32G, max=32G }
> > >   + website        { min=0, low=16G, high=40T, max=40T }
> > >     + website1     { min=0, low=64M, high=2G, max=2G }
> > >     + website2     { min=0, low=64M, high=2G, max=2G }
> > >       ...
> > >   + remote         { min=0, low=1G, high=14G, max=14G }
> > >     + webuser1     { min=0, low=64M, high=2G, max=2G }
> > >     + webuser2     { min=0, low=64M, high=2G, max=2G }
> > >       ...
> 
> Also interesting is that backup which is forced into 2G
> (system/backup CG) causes amount of slabs assigned to websrv CG to
> increase until that CG has almost only slab entries assigned to it to
> fill 16G, like file cache being reclaimed but not slab entries even if
> there is almost no file cache left and tons of slabs.
> What I'm also surprised is the so much memory remains completely unused
> (instead of being used for file caches).
> 
> According to the documentation if I didn't get it wrong any limits of
> child CGs (e.g. webuser1...) are applied up to what their parent's
> limits allow. Thus, if looking at e.g. remote -> webuser1... even if I
> have 1000 webuserN they wont "reserve" 65G for themselves via
> memory.low limit when their parent sets memory.low to 1G?
> Or does this depend on on CG mount options (memory_recursiveprot)?

It does. What you're describing is the old (!memory_recursiveprot) behavior.

Thanks!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
  2020-12-03 20:55     ` Roman Gushchin
@ 2020-12-06 11:30       ` Bruno Prémont
  2020-12-10 11:08         ` Bruno Prémont
  0 siblings, 1 reply; 8+ messages in thread
From: Bruno Prémont @ 2020-12-06 11:30 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups,
	linux-mm, Vladimir Davydov

On Thu, 3 Dec 2020 12:55:59 -0800 Roman Gushchin <guro@fb.com> wrote:
> In the meantime Yang Shi discovered a problem related slab shrinkers,
> which is to some extent similar to what you describe: under certain conditions
> large amounts of slab memory can be completely excluded from the reclaim process.
> 
> Can you, please, check if his fix will solve your problem?
> Here is the final version: https://www.spinics.net/lists/stable/msg430601.html .

I've added that patch on top of yours but it seems not to completely
help either.
With this patch is seems that such dentries might get reclaimed as a
last resort instead of not at all.

I've added logs since current boot:
 https://faramir-fj.hosting-restena.lu/cgmon-20201204.txt
 https://faramir-fj.hosting-restena.lu/cgmon-20201205.txt
 https://faramir-fj.hosting-restena.lu/cgmon-20201206.txt
with the memory evolution. Evolution started to degrade over past night
where memory usage started to increase from 40G tending to full use but
with only slabs growing (not file cache) and memory assigned to cgroups
staying more or less constant - even root cgroup's memory stats seem not
to list a great deal of used memory.

The only cgroup not "sufficiently" protected by memory.low (websrv) has
seen its memory use somehow clamped to about 16G while it should be
allowed to go up to 32G according to memory.high and of those 16G in
use at time of writing it only had 100M of file cache left, all the
rest being slabs.


As system now is using most of its memory I've bumped websrv CG's
memory.low to 20G so it should stay protected some more (which after a
few minutes showed its filecache growing again) with the aim of moving
pressure out of leaf-cgroups to non-cg-assigned-memory.

Somehow this move seems to prove getting me some success.


I will report back later today or tomorrow with more details on the
evolution with "no unused" memory. At least production service tends not
to suffer (more than from storage response time).



I have the impression that memory reclaim now only looks at cgroup and
if it can make some progress it will not bother looking anywhere else.

I also have the vague impression that distribution of my tasks on the
two NUMA nodes somehow impacts when or how memory reclaim happens.

NUMA node0 CPU(s):     0-5,12-17
NUMA node1 CPU(s):     6-11,18-23

            CPUs    MEMs
  system:    0-3     0
  websrv:   8-11     0        (allowed mems=0-1 as of 2020-12-06 12:18
                               after increasing memory.low from 4G
                               to 20G which did release quite some
                               pression and allow)
  website: 12-23     1
  remote:    4-7     0
    (assignment done using cpuset cgroup)

(seems NUMA distribution changed or I missed the non-linear node
distribution of CPUs - cores versus hyperthreading as my Intent was to
have website on 1 socket and the rest on the other socket. Memory is
mapped as I planned it, but tasks not really)


So calculating:
  system: 128M..8G     mems=0    \
  remote: 1G..14G      mems=0     }-->   5G..54G
  websrv: 4G..32G      mems=0    /
  website: 16G..       mems=1

But with everything except websrv lying below its low limit I wonder why
reclaim only hits the cgroups's file cache but still mostly ignores its
slabs.


Bruno

> > Compared to initial occurrence I do now have some more details (all but
> > /proc/slabinfo since boot) and according to /proc/slabinfo a good deal
> > of reclaimable slabs seem to be dentries (and probably
> > xfs_inode/xfs_ifork related to them) - not sure if those are assigned
> > to cgroups or not-accounted and not seen as candidate for reclaim...
> > 
> > xfs_buf           444908 445068    448   36    4 : tunables    0    0    0 : slabdata  12363  12363      0
> > xfs_bui_item           0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
> > xfs_bud_item           0      0    200   40    2 : tunables    0    0    0 : slabdata      0      0      0
> > xfs_cui_item           0      0    456   35    4 : tunables    0    0    0 : slabdata      0      0      0
> > xfs_cud_item           0      0    200   40    2 : tunables    0    0    0 : slabdata      0      0      0
> > xfs_rui_item           0      0    712   46    8 : tunables    0    0    0 : slabdata      0      0      0
> > xfs_rud_item           0      0    200   40    2 : tunables    0    0    0 : slabdata      0      0      0
> > xfs_icr                0    156    208   39    2 : tunables    0    0    0 : slabdata      4      4      0
> > xfs_ili           1223169 1535904    224   36    2 : tunables    0    0    0 : slabdata  42664  42664      0
> > xfs_inode         12851565 22081140   1088   30    8 : tunables    0    0    0 : slabdata 736038 736038      0
> > xfs_efi_item           0    280    456   35    4 : tunables    0    0    0 : slabdata      8      8      0
> > xfs_efd_item           0    280    464   35    4 : tunables    0    0    0 : slabdata      8      8      0
> > xfs_buf_item           7    216    296   27    2 : tunables    0    0    0 : slabdata      8      8      0
> > xf_trans               0    224    288   28    2 : tunables    0    0    0 : slabdata      8      8      0
> > xfs_ifork         12834992 46309928     72   56    1 : tunables    0    0    0 : slabdata 826963 826963      0
> > xfs_da_state           0    224    512   32    4 : tunables    0    0    0 : slabdata      7      7      0
> > xfs_btree_cur          0    224    256   32    2 : tunables    0    0    0 : slabdata      7      7      0
> > xfs_bmap_free_item      0    230     88   46    1 : tunables    0    0    0 : slabdata      5      5      0
> > xfs_log_ticket         4    296    216   37    2 : tunables    0    0    0 : slabdata      8      8      0
> > fat_inode_cache        0      0    744   44    8 : tunables    0    0    0 : slabdata      0      0      0
> > fat_cache              0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
> > mnt_cache            114    180    448   36    4 : tunables    0    0    0 : slabdata      5      5      0
> > filp                6228  15582    384   42    4 : tunables    0    0    0 : slabdata    371    371      0
> > inode_cache         6669  16016    608   26    4 : tunables    0    0    0 : slabdata    616    616      0
> > dentry            8092159 15642504    224   36    2 : tunables    0    0    0 : slabdata 434514 434514      0
> > 
> > 
> > 
> > The full collected details are available at
> >   https://faramir-fj.hosting-restena.lu/cgmon-20201203.txt 
> > (please take a copy as that file will not stay there forever)
> > 
> > A visual graph of memory evolution is available at
> >   https://faramir-fj.hosting-restena.lu/system-memory-20201203.png 
> > with reboot on Tuesday morning and steady increase of slabs starting
> > Webnesday evening correlating with start of backup until trashing
> > started at about 3:30 and the large drop in memory being me doing
> >   echo 2 > /proc/sys/vm/drop_caches
> > which stopped the trashing as well.
> > 
> > 
> > Against what does memcg attempt reclaim when it tries to satisfy a CG's
> > low limit? Only against siblings or also against root or not-accounted?
> > How does it take into account slabs where evictable entries will cause
> > unevictable entries to be freed as well?
> 
> Low limits are working by excluding some portions of memory from the reclaim,
> not by adding a memory pressure to something else.
> 
> > 
> > > > My setup, server has 64G of RAM:
> > > >   root
> > > >    + system        { min=0, low=128M, high=8G, max=8G }
> > > >      + base        { no specific constraints }
> > > >      + backup      { min=0, low=32M, high=2G, max=2G }
> > > >      + shell       { no specific constraints }
> > > >   + websrv         { min=0, low=4G, high=32G, max=32G }
> > > >   + website        { min=0, low=16G, high=40T, max=40T }
> > > >     + website1     { min=0, low=64M, high=2G, max=2G }
> > > >     + website2     { min=0, low=64M, high=2G, max=2G }
> > > >       ...
> > > >   + remote         { min=0, low=1G, high=14G, max=14G }
> > > >     + webuser1     { min=0, low=64M, high=2G, max=2G }
> > > >     + webuser2     { min=0, low=64M, high=2G, max=2G }
> > > >       ...
> > 
> > Also interesting is that backup which is forced into 2G
> > (system/backup CG) causes amount of slabs assigned to websrv CG to
> > increase until that CG has almost only slab entries assigned to it to
> > fill 16G, like file cache being reclaimed but not slab entries even if
> > there is almost no file cache left and tons of slabs.
> > What I'm also surprised is the so much memory remains completely unused
> > (instead of being used for file caches).
> > 
> > According to the documentation if I didn't get it wrong any limits of
> > child CGs (e.g. webuser1...) are applied up to what their parent's
> > limits allow. Thus, if looking at e.g. remote -> webuser1... even if I
> > have 1000 webuserN they wont "reserve" 65G for themselves via
> > memory.low limit when their parent sets memory.low to 1G?
> > Or does this depend on on CG mount options (memory_recursiveprot)?
> 
> It does. What you're describing is the old (!memory_recursiveprot) behavior.
> 
> Thanks!



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints
  2020-12-06 11:30       ` Bruno Prémont
@ 2020-12-10 11:08         ` Bruno Prémont
  0 siblings, 0 replies; 8+ messages in thread
From: Bruno Prémont @ 2020-12-10 11:08 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Yafang Shao, Chris Down, Michal Hocko, Johannes Weiner, cgroups,
	linux-mm, Vladimir Davydov

Hello All,

Since my last changes (allowing the websrv CG to use both NUMA memory
areas) the system as a whole runs reasonably without trashing and also
makes a way better use of memory.

As such it really seems that NUMA node restrictions (memory wise at
least) do not properly interact with reclaim.

Is there a way to see how memory usage is on the different NUMA nodes?

I have the impressions some reclaim was ongoing because memory
allocation was ask on one node which may have been "full" and there
only file cache was being reclaimed (in cgroups where memory.low didn't
protect it).


Thanks,
Bruno

On Sun, 6 Dec 2020 12:30:21 +0100 Bruno Prémont wrote:
> On Thu, 3 Dec 2020 12:55:59 -0800 Roman Gushchin <guro@fb.com> wrote:
> > In the meantime Yang Shi discovered a problem related slab shrinkers,
> > which is to some extent similar to what you describe: under certain conditions
> > large amounts of slab memory can be completely excluded from the reclaim process.
> > 
> > Can you, please, check if his fix will solve your problem?
> > Here is the final version: https://www.spinics.net/lists/stable/msg430601.html .  
> 
> I've added that patch on top of yours but it seems not to completely
> help either.
> With this patch is seems that such dentries might get reclaimed as a
> last resort instead of not at all.
> 
> I've added logs since current boot:
>  https://faramir-fj.hosting-restena.lu/cgmon-20201204.txt
>  https://faramir-fj.hosting-restena.lu/cgmon-20201205.txt
>  https://faramir-fj.hosting-restena.lu/cgmon-20201206.txt
> with the memory evolution. Evolution started to degrade over past night
> where memory usage started to increase from 40G tending to full use but
> with only slabs growing (not file cache) and memory assigned to cgroups
> staying more or less constant - even root cgroup's memory stats seem not
> to list a great deal of used memory.
> 
> The only cgroup not "sufficiently" protected by memory.low (websrv) has
> seen its memory use somehow clamped to about 16G while it should be
> allowed to go up to 32G according to memory.high and of those 16G in
> use at time of writing it only had 100M of file cache left, all the
> rest being slabs.
> 
> 
> As system now is using most of its memory I've bumped websrv CG's
> memory.low to 20G so it should stay protected some more (which after a
> few minutes showed its filecache growing again) with the aim of moving
> pressure out of leaf-cgroups to non-cg-assigned-memory.
> 
> Somehow this move seems to prove getting me some success.
> 
> 
> I will report back later today or tomorrow with more details on the
> evolution with "no unused" memory. At least production service tends not
> to suffer (more than from storage response time).
> 
> 
> 
> I have the impression that memory reclaim now only looks at cgroup and
> if it can make some progress it will not bother looking anywhere else.
> 
> I also have the vague impression that distribution of my tasks on the
> two NUMA nodes somehow impacts when or how memory reclaim happens.
> 
> NUMA node0 CPU(s):     0-5,12-17
> NUMA node1 CPU(s):     6-11,18-23
> 
>             CPUs    MEMs
>   system:    0-3     0
>   websrv:   8-11     0        (allowed mems=0-1 as of 2020-12-06 12:18
>                                after increasing memory.low from 4G
>                                to 20G which did release quite some
>                                pression and allow)
>   website: 12-23     1
>   remote:    4-7     0
>     (assignment done using cpuset cgroup)
> 
> (seems NUMA distribution changed or I missed the non-linear node
> distribution of CPUs - cores versus hyperthreading as my Intent was to
> have website on 1 socket and the rest on the other socket. Memory is
> mapped as I planned it, but tasks not really)
> 
> 
> So calculating:
>   system: 128M..8G     mems=0    \
>   remote: 1G..14G      mems=0     }-->   5G..54G
>   websrv: 4G..32G      mems=0    /
>   website: 16G..       mems=1
> 
> But with everything except websrv lying below its low limit I wonder why
> reclaim only hits the cgroups's file cache but still mostly ignores its
> slabs.
> 
> 
> Bruno
> 
> > > Compared to initial occurrence I do now have some more details (all but
> > > /proc/slabinfo since boot) and according to /proc/slabinfo a good deal
> > > of reclaimable slabs seem to be dentries (and probably
> > > xfs_inode/xfs_ifork related to them) - not sure if those are assigned
> > > to cgroups or not-accounted and not seen as candidate for reclaim...
> > > 
> > > xfs_buf           444908 445068    448   36    4 : tunables    0    0    0 : slabdata  12363  12363      0
> > > xfs_bui_item           0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
> > > xfs_bud_item           0      0    200   40    2 : tunables    0    0    0 : slabdata      0      0      0
> > > xfs_cui_item           0      0    456   35    4 : tunables    0    0    0 : slabdata      0      0      0
> > > xfs_cud_item           0      0    200   40    2 : tunables    0    0    0 : slabdata      0      0      0
> > > xfs_rui_item           0      0    712   46    8 : tunables    0    0    0 : slabdata      0      0      0
> > > xfs_rud_item           0      0    200   40    2 : tunables    0    0    0 : slabdata      0      0      0
> > > xfs_icr                0    156    208   39    2 : tunables    0    0    0 : slabdata      4      4      0
> > > xfs_ili           1223169 1535904    224   36    2 : tunables    0    0    0 : slabdata  42664  42664      0
> > > xfs_inode         12851565 22081140   1088   30    8 : tunables    0    0    0 : slabdata 736038 736038      0
> > > xfs_efi_item           0    280    456   35    4 : tunables    0    0    0 : slabdata      8      8      0
> > > xfs_efd_item           0    280    464   35    4 : tunables    0    0    0 : slabdata      8      8      0
> > > xfs_buf_item           7    216    296   27    2 : tunables    0    0    0 : slabdata      8      8      0
> > > xf_trans               0    224    288   28    2 : tunables    0    0    0 : slabdata      8      8      0
> > > xfs_ifork         12834992 46309928     72   56    1 : tunables    0    0    0 : slabdata 826963 826963      0
> > > xfs_da_state           0    224    512   32    4 : tunables    0    0    0 : slabdata      7      7      0
> > > xfs_btree_cur          0    224    256   32    2 : tunables    0    0    0 : slabdata      7      7      0
> > > xfs_bmap_free_item      0    230     88   46    1 : tunables    0    0    0 : slabdata      5      5      0
> > > xfs_log_ticket         4    296    216   37    2 : tunables    0    0    0 : slabdata      8      8      0
> > > fat_inode_cache        0      0    744   44    8 : tunables    0    0    0 : slabdata      0      0      0
> > > fat_cache              0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
> > > mnt_cache            114    180    448   36    4 : tunables    0    0    0 : slabdata      5      5      0
> > > filp                6228  15582    384   42    4 : tunables    0    0    0 : slabdata    371    371      0
> > > inode_cache         6669  16016    608   26    4 : tunables    0    0    0 : slabdata    616    616      0
> > > dentry            8092159 15642504    224   36    2 : tunables    0    0    0 : slabdata 434514 434514      0
> > > 
> > > 
> > > 
> > > The full collected details are available at
> > >   https://faramir-fj.hosting-restena.lu/cgmon-20201203.txt 
> > > (please take a copy as that file will not stay there forever)
> > > 
> > > A visual graph of memory evolution is available at
> > >   https://faramir-fj.hosting-restena.lu/system-memory-20201203.png 
> > > with reboot on Tuesday morning and steady increase of slabs starting
> > > Webnesday evening correlating with start of backup until trashing
> > > started at about 3:30 and the large drop in memory being me doing
> > >   echo 2 > /proc/sys/vm/drop_caches
> > > which stopped the trashing as well.
> > > 
> > > 
> > > Against what does memcg attempt reclaim when it tries to satisfy a CG's
> > > low limit? Only against siblings or also against root or not-accounted?
> > > How does it take into account slabs where evictable entries will cause
> > > unevictable entries to be freed as well?  
> > 
> > Low limits are working by excluding some portions of memory from the reclaim,
> > not by adding a memory pressure to something else.
> >   
> > >   
> > > > > My setup, server has 64G of RAM:
> > > > >   root
> > > > >    + system        { min=0, low=128M, high=8G, max=8G }
> > > > >      + base        { no specific constraints }
> > > > >      + backup      { min=0, low=32M, high=2G, max=2G }
> > > > >      + shell       { no specific constraints }
> > > > >   + websrv         { min=0, low=4G, high=32G, max=32G }
> > > > >   + website        { min=0, low=16G, high=40T, max=40T }
> > > > >     + website1     { min=0, low=64M, high=2G, max=2G }
> > > > >     + website2     { min=0, low=64M, high=2G, max=2G }
> > > > >       ...
> > > > >   + remote         { min=0, low=1G, high=14G, max=14G }
> > > > >     + webuser1     { min=0, low=64M, high=2G, max=2G }
> > > > >     + webuser2     { min=0, low=64M, high=2G, max=2G }
> > > > >       ...  
> > > 
> > > Also interesting is that backup which is forced into 2G
> > > (system/backup CG) causes amount of slabs assigned to websrv CG to
> > > increase until that CG has almost only slab entries assigned to it to
> > > fill 16G, like file cache being reclaimed but not slab entries even if
> > > there is almost no file cache left and tons of slabs.
> > > What I'm also surprised is the so much memory remains completely unused
> > > (instead of being used for file caches).
> > > 
> > > According to the documentation if I didn't get it wrong any limits of
> > > child CGs (e.g. webuser1...) are applied up to what their parent's
> > > limits allow. Thus, if looking at e.g. remote -> webuser1... even if I
> > > have 1000 webuserN they wont "reserve" 65G for themselves via
> > > memory.low limit when their parent sets memory.low to 1G?
> > > Or does this depend on on CG mount options (memory_recursiveprot)?  
> > 
> > It does. What you're describing is the old (!memory_recursiveprot) behavior.
> > 
> > Thanks!  
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-12-10 11:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-25 11:39 Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints Bruno Prémont
2020-11-25 13:37 ` Michal Hocko
2020-11-25 14:33   ` Bruno Prémont
2020-11-25 18:21 ` Roman Gushchin
2020-12-03 11:09   ` Bruno Prémont
2020-12-03 20:55     ` Roman Gushchin
2020-12-06 11:30       ` Bruno Prémont
2020-12-10 11:08         ` Bruno Prémont

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).