On Fri, 2019-02-22 at 20:58 +0300, Andrey Ryabinin wrote: > In a presence of more than 1 memory cgroup in the system our reclaim > logic is just suck. When we hit memory limit (global or a limit on > cgroup with subgroups) we reclaim some memory from all cgroups. > This is sucks because, the cgroup that allocates more often always > wins. > E.g. job that allocates a lot of clean rarely used page cache will > push > out of memory other jobs with active relatively small all in memory > working set. > > To prevent such situations we have memcg controls like low/max, etc > which > are supposed to protect jobs or limit them so they to not hurt > others. > But memory cgroups are very hard to configure right because it > requires > precise knowledge of the workload which may vary during the > execution. > E.g. setting memory limit means that job won't be able to use all > memory > in the system for page cache even if the rest the system is idle. > Basically our current scheme requires to configure every single > cgroup > in the system. > > I think we can do better. The idea proposed by this patch is to > reclaim > only inactive pages and only from cgroups that have big > (!inactive_is_low()) inactive list. And go back to shrinking active > lists > only if all inactive lists are low. Your general idea seems like a good one, but the logic in the code seems a little convoluted to me. I wonder if we can simplify things a little, by checking (when we enter page reclaim) whether the pgdat has enough inactive pages based on the node_page_state statistics, and basing our decision whether or not to scan the active lists off that. As it stands, your patch seems like the kind of code that makes perfect sense today, but which will confuse people who look at the code two years from now. If the code could be made a little more explicit, great. If there are good reasons to do things in the fallback way your current patch does it, the code could use some good comments explaining why :) -- All Rights Reversed.