* [patch v3 0/3 -mm] Soft limit rework @ 2013-05-13 7:46 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-13 7:46 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh Hi, This is the third version of the patchset. The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973 (lkml wasn't CCed at the time so I cannot find it in lwn.net archives). There were no major objections. The second version has been posted here http://lwn.net/Articles/548191/ as a part of a longer and spicier thread which started after LSF here: https://lwn.net/Articles/548192/ So this is a back to trees repost which tries to get only the guts from the original post. Changes between RFC (aka V1) -> V2 As there were no major objections there were only some minor cleanups since the last version and I have moved "memcg: Ignore soft limit until it is explicitly specified" to the end of the series. Changes between V2 -> V3 No changes in the code since the last version. I have just rebased the series on top of the current mmotm tree. The most controversial part has been dropped (the last patch "memcg: Ignore soft limit until it is explicitly specified") so there are no semantical changes to the soft limit behavior. This makes this work mostly a code clean up and code reorganization. Nevertheless, this is enough to make the soft limit work more efficiently according to my testing and groups above the soft limit are reclaimed much less as a result. The basic idea is quite simple. Pull soft reclaim into shrink_zone in the first step and get rid of the previous soft reclaim infrastructure. shrink_zone is done in two passes now. First it tries to do the soft limit reclaim and it falls back to reclaim-all-mode if no group is over the limit or no pages have been scanned. The second pass happens at the same priority so the only time we waste is the memcg tree walk which shouldn't be a big deal [1]. There is certainly room for improvements in that direction (e.g. do not do soft reclaim pass if no limit has been specified for any group yet or if we know that no group is bellow the limit). But let's keep it simple for now. As a bonus we will get rid of a _lot_ of code by this and soft reclaim will not stand out like before. The clean up is in a separate patch because I felt it would be easier to review that way. The second step is soft limit reclaim integration into targeted reclaim. It should be rather straight forward. Soft limit has been used only for the global reclaim so far but it makes for any kind of pressure coming from up-the-hierarchy, including targeted reclaim. My primary test case was a parallel kernel (each make is run with -j4 with a distribution .config in a separate cgroup without any hard limit) build on a 8 CPU machine booted with 1GB memory. I was mostly interested in 2 setups. Default - no soft limit set and - and 0 soft limit set to both groups. The first one should tell us whether the rework regresses the default behavior while the second one should show us improvements in an extreme case where both workloads are always over the soft limit. /usr/bin/time -v has been used to collect the statistics and each configuration had 3 runs after fresh boot without any other load on the system. * No-limit Base kernel (before) System: min:238.26 max:240.87 avg:239.85 stdev:0.90 User: min:1166.40 max:1173.79 avg:1169.92 stdev:2.41 Elapsed: min:09:38.06 max:09:57.96 avg:09:48.80 stdev:00:07.14 Rework (after) System: min:238.88[100.3%] max:242.43[100.6%] avg:240.70[100.4%] stdev:1.62 User: min:1168.61[100.2%] max:1171.15[99.8%] avg:1169.69[100%] stdev:0.82 Elapsed: min:09:31.85[98.9%] max:09:57.74[100.0%] avg:09:47.57[99.8%] stdev:00:09.54 The numbers are within stdev so I think we are doing good here. * 0-limit Base kernel (before) System: min:244.74 max:248.66 avg:246.39 stdev:1.20 User: min:1190.51 max:1195.70 avg:1193.77 stdev:1.66 Elapsed: min:12:39.73 max:12:56.79 avg:12:48.33 stdev:00:06.59 Rework (after) System: min:238.64[97.5%] max:242.34[97.5%] avg:240.67[97.7%] stdev:1.26 User: min:1166.20[98%] max:1170.98[98%] avg:1168.17[97.9%] stdev:1.44 Elapsed: min:09:47.53[77.3%] max:09:59.05[77.1%] avg:09:53.23[77.2%] stdev:00:04.72 We can see 2% time decrease for both System and User time which is not rocket high but sounds like a good outcome from a cleanup ;). It is even more interesting to check the Elapsed time numbers which show that the parallel load is much more effective. I haven't looked into the specific reasons for this boost up deeply but I would guess that priority-0 reclaim done in the original implementation should be a big contributor. Page fault statistics tell us at least part of the story: Base Major min:29686 max:34224 avg:31028.67 stdev:1503.11 Minor min:35919443 max:35980518 avg:35945936.50 stdev:21340.41 Rework Major min:332[1.1%] max:883[2.6%] avg:478.67[1.5%] stdev:187.46 Minor min:35583644[99%] max:35660073[99%] avg:35621868.67[99%] stdev:29240.64 While the minor faults are within the noise the major faults are reduced considerably. This looks like an aggressive pageout during the reclaim and that pageout affects the working set presumably. If we look at the same numbers for no-limit (aka no soft limit in action) we get very similar numbers to 0-limit rework: Base Major min:208.00 max:454.00, avg:375.17 stdev:86.73 Rework Major min:270.00 max:730.00, avg:431.67 stdev:143.58 So this clearly points at the priority-0 reclaim. Shortlog says: Michal Hocko (3): memcg: integrate soft reclaim tighter with zone shrinking code memcg: Get rid of soft-limit tree infrastructure vmscan, memcg: Do softlimit reclaim also for targeted reclaim And the diffstat: include/linux/memcontrol.h | 12 +- mm/memcontrol.c | 387 +++----------------------------------------- mm/vmscan.c | 62 ++++--- 3 files changed, 65 insertions(+), 396 deletions(-) ^ permalink raw reply [flat|nested] 74+ messages in thread
* [patch v3 0/3 -mm] Soft limit rework @ 2013-05-13 7:46 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-13 7:46 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh Hi, This is the third version of the patchset. The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973 (lkml wasn't CCed at the time so I cannot find it in lwn.net archives). There were no major objections. The second version has been posted here http://lwn.net/Articles/548191/ as a part of a longer and spicier thread which started after LSF here: https://lwn.net/Articles/548192/ So this is a back to trees repost which tries to get only the guts from the original post. Changes between RFC (aka V1) -> V2 As there were no major objections there were only some minor cleanups since the last version and I have moved "memcg: Ignore soft limit until it is explicitly specified" to the end of the series. Changes between V2 -> V3 No changes in the code since the last version. I have just rebased the series on top of the current mmotm tree. The most controversial part has been dropped (the last patch "memcg: Ignore soft limit until it is explicitly specified") so there are no semantical changes to the soft limit behavior. This makes this work mostly a code clean up and code reorganization. Nevertheless, this is enough to make the soft limit work more efficiently according to my testing and groups above the soft limit are reclaimed much less as a result. The basic idea is quite simple. Pull soft reclaim into shrink_zone in the first step and get rid of the previous soft reclaim infrastructure. shrink_zone is done in two passes now. First it tries to do the soft limit reclaim and it falls back to reclaim-all-mode if no group is over the limit or no pages have been scanned. The second pass happens at the same priority so the only time we waste is the memcg tree walk which shouldn't be a big deal [1]. There is certainly room for improvements in that direction (e.g. do not do soft reclaim pass if no limit has been specified for any group yet or if we know that no group is bellow the limit). But let's keep it simple for now. As a bonus we will get rid of a _lot_ of code by this and soft reclaim will not stand out like before. The clean up is in a separate patch because I felt it would be easier to review that way. The second step is soft limit reclaim integration into targeted reclaim. It should be rather straight forward. Soft limit has been used only for the global reclaim so far but it makes for any kind of pressure coming from up-the-hierarchy, including targeted reclaim. My primary test case was a parallel kernel (each make is run with -j4 with a distribution .config in a separate cgroup without any hard limit) build on a 8 CPU machine booted with 1GB memory. I was mostly interested in 2 setups. Default - no soft limit set and - and 0 soft limit set to both groups. The first one should tell us whether the rework regresses the default behavior while the second one should show us improvements in an extreme case where both workloads are always over the soft limit. /usr/bin/time -v has been used to collect the statistics and each configuration had 3 runs after fresh boot without any other load on the system. * No-limit Base kernel (before) System: min:238.26 max:240.87 avg:239.85 stdev:0.90 User: min:1166.40 max:1173.79 avg:1169.92 stdev:2.41 Elapsed: min:09:38.06 max:09:57.96 avg:09:48.80 stdev:00:07.14 Rework (after) System: min:238.88[100.3%] max:242.43[100.6%] avg:240.70[100.4%] stdev:1.62 User: min:1168.61[100.2%] max:1171.15[99.8%] avg:1169.69[100%] stdev:0.82 Elapsed: min:09:31.85[98.9%] max:09:57.74[100.0%] avg:09:47.57[99.8%] stdev:00:09.54 The numbers are within stdev so I think we are doing good here. * 0-limit Base kernel (before) System: min:244.74 max:248.66 avg:246.39 stdev:1.20 User: min:1190.51 max:1195.70 avg:1193.77 stdev:1.66 Elapsed: min:12:39.73 max:12:56.79 avg:12:48.33 stdev:00:06.59 Rework (after) System: min:238.64[97.5%] max:242.34[97.5%] avg:240.67[97.7%] stdev:1.26 User: min:1166.20[98%] max:1170.98[98%] avg:1168.17[97.9%] stdev:1.44 Elapsed: min:09:47.53[77.3%] max:09:59.05[77.1%] avg:09:53.23[77.2%] stdev:00:04.72 We can see 2% time decrease for both System and User time which is not rocket high but sounds like a good outcome from a cleanup ;). It is even more interesting to check the Elapsed time numbers which show that the parallel load is much more effective. I haven't looked into the specific reasons for this boost up deeply but I would guess that priority-0 reclaim done in the original implementation should be a big contributor. Page fault statistics tell us at least part of the story: Base Major min:29686 max:34224 avg:31028.67 stdev:1503.11 Minor min:35919443 max:35980518 avg:35945936.50 stdev:21340.41 Rework Major min:332[1.1%] max:883[2.6%] avg:478.67[1.5%] stdev:187.46 Minor min:35583644[99%] max:35660073[99%] avg:35621868.67[99%] stdev:29240.64 While the minor faults are within the noise the major faults are reduced considerably. This looks like an aggressive pageout during the reclaim and that pageout affects the working set presumably. If we look at the same numbers for no-limit (aka no soft limit in action) we get very similar numbers to 0-limit rework: Base Major min:208.00 max:454.00, avg:375.17 stdev:86.73 Rework Major min:270.00 max:730.00, avg:431.67 stdev:143.58 So this clearly points at the priority-0 reclaim. Shortlog says: Michal Hocko (3): memcg: integrate soft reclaim tighter with zone shrinking code memcg: Get rid of soft-limit tree infrastructure vmscan, memcg: Do softlimit reclaim also for targeted reclaim And the diffstat: include/linux/memcontrol.h | 12 +- mm/memcontrol.c | 387 +++----------------------------------------- mm/vmscan.c | 62 ++++--- 3 files changed, 65 insertions(+), 396 deletions(-) ^ permalink raw reply [flat|nested] 74+ messages in thread
* [patch v3 0/3 -mm] Soft limit rework @ 2013-05-13 7:46 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-13 7:46 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh Hi, This is the third version of the patchset. The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973 (lkml wasn't CCed at the time so I cannot find it in lwn.net archives). There were no major objections. The second version has been posted here http://lwn.net/Articles/548191/ as a part of a longer and spicier thread which started after LSF here: https://lwn.net/Articles/548192/ So this is a back to trees repost which tries to get only the guts from the original post. Changes between RFC (aka V1) -> V2 As there were no major objections there were only some minor cleanups since the last version and I have moved "memcg: Ignore soft limit until it is explicitly specified" to the end of the series. Changes between V2 -> V3 No changes in the code since the last version. I have just rebased the series on top of the current mmotm tree. The most controversial part has been dropped (the last patch "memcg: Ignore soft limit until it is explicitly specified") so there are no semantical changes to the soft limit behavior. This makes this work mostly a code clean up and code reorganization. Nevertheless, this is enough to make the soft limit work more efficiently according to my testing and groups above the soft limit are reclaimed much less as a result. The basic idea is quite simple. Pull soft reclaim into shrink_zone in the first step and get rid of the previous soft reclaim infrastructure. shrink_zone is done in two passes now. First it tries to do the soft limit reclaim and it falls back to reclaim-all-mode if no group is over the limit or no pages have been scanned. The second pass happens at the same priority so the only time we waste is the memcg tree walk which shouldn't be a big deal [1]. There is certainly room for improvements in that direction (e.g. do not do soft reclaim pass if no limit has been specified for any group yet or if we know that no group is bellow the limit). But let's keep it simple for now. As a bonus we will get rid of a _lot_ of code by this and soft reclaim will not stand out like before. The clean up is in a separate patch because I felt it would be easier to review that way. The second step is soft limit reclaim integration into targeted reclaim. It should be rather straight forward. Soft limit has been used only for the global reclaim so far but it makes for any kind of pressure coming from up-the-hierarchy, including targeted reclaim. My primary test case was a parallel kernel (each make is run with -j4 with a distribution .config in a separate cgroup without any hard limit) build on a 8 CPU machine booted with 1GB memory. I was mostly interested in 2 setups. Default - no soft limit set and - and 0 soft limit set to both groups. The first one should tell us whether the rework regresses the default behavior while the second one should show us improvements in an extreme case where both workloads are always over the soft limit. /usr/bin/time -v has been used to collect the statistics and each configuration had 3 runs after fresh boot without any other load on the system. * No-limit Base kernel (before) System: min:238.26 max:240.87 avg:239.85 stdev:0.90 User: min:1166.40 max:1173.79 avg:1169.92 stdev:2.41 Elapsed: min:09:38.06 max:09:57.96 avg:09:48.80 stdev:00:07.14 Rework (after) System: min:238.88[100.3%] max:242.43[100.6%] avg:240.70[100.4%] stdev:1.62 User: min:1168.61[100.2%] max:1171.15[99.8%] avg:1169.69[100%] stdev:0.82 Elapsed: min:09:31.85[98.9%] max:09:57.74[100.0%] avg:09:47.57[99.8%] stdev:00:09.54 The numbers are within stdev so I think we are doing good here. * 0-limit Base kernel (before) System: min:244.74 max:248.66 avg:246.39 stdev:1.20 User: min:1190.51 max:1195.70 avg:1193.77 stdev:1.66 Elapsed: min:12:39.73 max:12:56.79 avg:12:48.33 stdev:00:06.59 Rework (after) System: min:238.64[97.5%] max:242.34[97.5%] avg:240.67[97.7%] stdev:1.26 User: min:1166.20[98%] max:1170.98[98%] avg:1168.17[97.9%] stdev:1.44 Elapsed: min:09:47.53[77.3%] max:09:59.05[77.1%] avg:09:53.23[77.2%] stdev:00:04.72 We can see 2% time decrease for both System and User time which is not rocket high but sounds like a good outcome from a cleanup ;). It is even more interesting to check the Elapsed time numbers which show that the parallel load is much more effective. I haven't looked into the specific reasons for this boost up deeply but I would guess that priority-0 reclaim done in the original implementation should be a big contributor. Page fault statistics tell us at least part of the story: Base Major min:29686 max:34224 avg:31028.67 stdev:1503.11 Minor min:35919443 max:35980518 avg:35945936.50 stdev:21340.41 Rework Major min:332[1.1%] max:883[2.6%] avg:478.67[1.5%] stdev:187.46 Minor min:35583644[99%] max:35660073[99%] avg:35621868.67[99%] stdev:29240.64 While the minor faults are within the noise the major faults are reduced considerably. This looks like an aggressive pageout during the reclaim and that pageout affects the working set presumably. If we look at the same numbers for no-limit (aka no soft limit in action) we get very similar numbers to 0-limit rework: Base Major min:208.00 max:454.00, avg:375.17 stdev:86.73 Rework Major min:270.00 max:730.00, avg:431.67 stdev:143.58 So this clearly points at the priority-0 reclaim. Shortlog says: Michal Hocko (3): memcg: integrate soft reclaim tighter with zone shrinking code memcg: Get rid of soft-limit tree infrastructure vmscan, memcg: Do softlimit reclaim also for targeted reclaim And the diffstat: include/linux/memcontrol.h | 12 +- mm/memcontrol.c | 387 +++----------------------------------------- mm/vmscan.c | 62 ++++--- 3 files changed, 65 insertions(+), 396 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-13 7:46 ` Michal Hocko @ 2013-05-13 7:46 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-13 7:46 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh Memcg soft reclaim has been traditionally triggered from the global reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim then picked up a group which exceeds the soft limit the most and reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. The infrastructure requires per-node-zone trees which hold over-limit groups and keep them up-to-date (via memcg_check_events) which is not cost free. Although this overhead hasn't turned out to be a bottle neck the implementation is suboptimal because mem_cgroup_update_tree has no idea which zones consumed memory over the limit so we could easily end up having a group on a node-zone tree having only few pages from that node-zone. This patch doesn't try to fix node-zone trees management because it seems that integrating soft reclaim into zone shrinking sounds much easier and more appropriate for several reasons. First of all 0 priority reclaim was a crude hack which might lead to big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot of dirty/writeback pages). Soft reclaim should be applicable also to the targeted reclaim which is awkward right now without additional hacks. Last but not least the whole infrastructure eats quite some code. After this patch shrink_zone is done in 2 passes. First it tries to do the soft reclaim if appropriate (only for global reclaim for now to keep compatible with the original state) and fall back to ignoring soft limit if no group is eligible to soft reclaim or nothing has been scanned during the first pass. Only groups which are over their soft limit or any of their parents up the hierarchy is over the limit are considered eligible during the first pass. Soft limit tree which is not necessary anymore will be removed in the follow up patch to make this patch smaller and easier to review. Changes since v1 - __shrink_zone doesn't return the number of shrunk groups as nr_scanned test covers both no groups scanned and no pages from the required zone as pointed by Johannes Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 10 +-- mm/memcontrol.c | 161 ++++++-------------------------------------- mm/vmscan.c | 62 ++++++++++------- 3 files changed, 59 insertions(+), 174 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d6183f0..1833c95 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -179,9 +179,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned); +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, @@ -358,11 +356,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) { - return 0; + return false; } static inline void mem_cgroup_split_huge_fixup(struct page *head) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fe4f123..7e5d7a9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2083,57 +2083,28 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) } #endif -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, - struct zone *zone, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - struct mem_cgroup *victim = NULL; - int total = 0; - int loop = 0; - unsigned long excess; - unsigned long nr_scanned; - struct mem_cgroup_reclaim_cookie reclaim = { - .zone = zone, - .priority = 0, - }; +/* + * A group is eligible for the soft limit reclaim if it is + * a) is over its soft limit + * b) any parent up the hierarchy is over its soft limit + */ +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +{ + struct mem_cgroup *parent = memcg; - excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; - - while (1) { - victim = mem_cgroup_iter(root_memcg, victim, &reclaim); - if (!victim) { - loop++; - if (loop >= 2) { - /* - * If we have not been able to reclaim - * anything, it might because there are - * no reclaimable pages under this hierarchy - */ - if (!total) - break; - /* - * We want to do more targeted reclaim. - * excess >> 2 is not to excessive so as to - * reclaim too much, nor too less that we keep - * coming back to reclaim from this cgroup - */ - if (total >= (excess >> 2) || - (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) - break; - } - continue; - } - if (!mem_cgroup_reclaimable(victim, false)) - continue; - total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, - zone, &nr_scanned); - *total_scanned += nr_scanned; - if (!res_counter_soft_limit_excess(&root_memcg->res)) - break; + if (res_counter_soft_limit_excess(&memcg->res)) + return true; + + /* + * If any parent up the hierarchy is over its soft limit then we + * have to obey and reclaim from this group as well. + */ + while((parent = parent_mem_cgroup(parent))) { + if (res_counter_soft_limit_excess(&parent->res)) + return true; } - mem_cgroup_iter_break(root_memcg, victim); - return total; + + return false; } /* @@ -4751,98 +4722,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, return ret; } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - unsigned long nr_reclaimed = 0; - struct mem_cgroup_per_zone *mz, *next_mz = NULL; - unsigned long reclaimed; - int loop = 0; - struct mem_cgroup_tree_per_zone *mctz; - unsigned long long excess; - unsigned long nr_scanned; - - if (order > 0) - return 0; - - mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone)); - /* - * This loop can run a while, specially if mem_cgroup's continuously - * keep exceeding their soft limit and putting the system under - * pressure - */ - do { - if (next_mz) - mz = next_mz; - else - mz = mem_cgroup_largest_soft_limit_node(mctz); - if (!mz) - break; - - nr_scanned = 0; - reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone, - gfp_mask, &nr_scanned); - nr_reclaimed += reclaimed; - *total_scanned += nr_scanned; - spin_lock(&mctz->lock); - - /* - * If we failed to reclaim anything from this memory cgroup - * it is time to move on to the next cgroup - */ - next_mz = NULL; - if (!reclaimed) { - do { - /* - * Loop until we find yet another one. - * - * By the time we get the soft_limit lock - * again, someone might have aded the - * group back on the RB tree. Iterate to - * make sure we get a different mem. - * mem_cgroup_largest_soft_limit_node returns - * NULL if no other cgroup is present on - * the tree - */ - next_mz = - __mem_cgroup_largest_soft_limit_node(mctz); - if (next_mz == mz) - css_put(&next_mz->memcg->css); - else /* next_mz == NULL or other memcg */ - break; - } while (1); - } - __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz); - excess = res_counter_soft_limit_excess(&mz->memcg->res); - /* - * One school of thought says that we should not add - * back the node to the tree if reclaim returns 0. - * But our reclaim could return 0, simply because due - * to priority we are exposing a smaller subset of - * memory to reclaim from. Consider this as a longer - * term TODO. - */ - /* If excess == 0, no tree ops */ - __mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess); - spin_unlock(&mctz->lock); - css_put(&mz->memcg->css); - loop++; - /* - * Could not reclaim anything and there are no more - * mem cgroups to try or we seem to be looping without - * reclaiming anything. - */ - if (!nr_reclaimed && - (next_mz == NULL || - loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) - break; - } while (!nr_reclaimed); - if (next_mz) - css_put(&next_mz->memcg->css); - return nr_reclaimed; -} - /** * mem_cgroup_force_empty_list - clears LRU of a group * @memcg: group to clear diff --git a/mm/vmscan.c b/mm/vmscan.c index fa6a853..fe63a43 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -139,11 +139,21 @@ static bool global_reclaim(struct scan_control *sc) { return !sc->target_mem_cgroup; } + +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) +{ + return global_reclaim(sc); +} #else static bool global_reclaim(struct scan_control *sc) { return true; } + +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) +{ + return false; +} #endif static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru) @@ -1943,7 +1953,8 @@ static inline bool should_continue_reclaim(struct zone *zone, } } -static void shrink_zone(struct zone *zone, struct scan_control *sc) +static void +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) { unsigned long nr_reclaimed, nr_scanned; @@ -1962,6 +1973,12 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) do { struct lruvec *lruvec; + if (soft_reclaim && + !mem_cgroup_soft_reclaim_eligible(memcg)) { + memcg = mem_cgroup_iter(root, memcg, &reclaim); + continue; + } + lruvec = mem_cgroup_zone_lruvec(zone, memcg); shrink_lruvec(lruvec, sc); @@ -1992,6 +2009,24 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) sc->nr_scanned - nr_scanned, sc)); } + +static void shrink_zone(struct zone *zone, struct scan_control *sc) +{ + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); + unsigned long nr_scanned = sc->nr_scanned; + + __shrink_zone(zone, sc, do_soft_reclaim); + + /* + * No group is over the soft limit or those that are do not have + * pages in the zone we are reclaiming so we have to reclaim everybody + */ + if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) { + __shrink_zone(zone, sc, false); + return; + } +} + /* Returns true if compaction should go ahead for a high-order request */ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) { @@ -2053,8 +2088,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) { struct zoneref *z; struct zone *zone; - unsigned long nr_soft_reclaimed; - unsigned long nr_soft_scanned; bool aborted_reclaim = false; /* @@ -2094,18 +2127,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) continue; } } - /* - * This steals pages from memory cgroups over softlimit - * and returns the number of reclaimed pages and - * scanned pages. This works for global memory pressure - * and balancing, not for a memcg's limit. - */ - nr_soft_scanned = 0; - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - sc->order, sc->gfp_mask, - &nr_soft_scanned); - sc->nr_reclaimed += nr_soft_reclaimed; - sc->nr_scanned += nr_soft_scanned; /* need some check for avoid more shrink_zone() */ } @@ -2628,8 +2649,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ struct reclaim_state *reclaim_state = current->reclaim_state; - unsigned long nr_soft_reclaimed; - unsigned long nr_soft_scanned; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .may_unmap = 1, @@ -2728,15 +2747,6 @@ loop_again: sc.nr_scanned = 0; - nr_soft_scanned = 0; - /* - * Call soft limit reclaim before calling shrink_zone. - */ - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - order, sc.gfp_mask, - &nr_soft_scanned); - sc.nr_reclaimed += nr_soft_reclaimed; - /* * We put equal pressure on every zone, unless * one zone has way too many pages free -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-13 7:46 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-13 7:46 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh Memcg soft reclaim has been traditionally triggered from the global reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim then picked up a group which exceeds the soft limit the most and reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. The infrastructure requires per-node-zone trees which hold over-limit groups and keep them up-to-date (via memcg_check_events) which is not cost free. Although this overhead hasn't turned out to be a bottle neck the implementation is suboptimal because mem_cgroup_update_tree has no idea which zones consumed memory over the limit so we could easily end up having a group on a node-zone tree having only few pages from that node-zone. This patch doesn't try to fix node-zone trees management because it seems that integrating soft reclaim into zone shrinking sounds much easier and more appropriate for several reasons. First of all 0 priority reclaim was a crude hack which might lead to big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot of dirty/writeback pages). Soft reclaim should be applicable also to the targeted reclaim which is awkward right now without additional hacks. Last but not least the whole infrastructure eats quite some code. After this patch shrink_zone is done in 2 passes. First it tries to do the soft reclaim if appropriate (only for global reclaim for now to keep compatible with the original state) and fall back to ignoring soft limit if no group is eligible to soft reclaim or nothing has been scanned during the first pass. Only groups which are over their soft limit or any of their parents up the hierarchy is over the limit are considered eligible during the first pass. Soft limit tree which is not necessary anymore will be removed in the follow up patch to make this patch smaller and easier to review. Changes since v1 - __shrink_zone doesn't return the number of shrunk groups as nr_scanned test covers both no groups scanned and no pages from the required zone as pointed by Johannes Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 10 +-- mm/memcontrol.c | 161 ++++++-------------------------------------- mm/vmscan.c | 62 ++++++++++------- 3 files changed, 59 insertions(+), 174 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d6183f0..1833c95 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -179,9 +179,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned); +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, @@ -358,11 +356,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) { - return 0; + return false; } static inline void mem_cgroup_split_huge_fixup(struct page *head) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fe4f123..7e5d7a9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2083,57 +2083,28 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) } #endif -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, - struct zone *zone, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - struct mem_cgroup *victim = NULL; - int total = 0; - int loop = 0; - unsigned long excess; - unsigned long nr_scanned; - struct mem_cgroup_reclaim_cookie reclaim = { - .zone = zone, - .priority = 0, - }; +/* + * A group is eligible for the soft limit reclaim if it is + * a) is over its soft limit + * b) any parent up the hierarchy is over its soft limit + */ +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +{ + struct mem_cgroup *parent = memcg; - excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; - - while (1) { - victim = mem_cgroup_iter(root_memcg, victim, &reclaim); - if (!victim) { - loop++; - if (loop >= 2) { - /* - * If we have not been able to reclaim - * anything, it might because there are - * no reclaimable pages under this hierarchy - */ - if (!total) - break; - /* - * We want to do more targeted reclaim. - * excess >> 2 is not to excessive so as to - * reclaim too much, nor too less that we keep - * coming back to reclaim from this cgroup - */ - if (total >= (excess >> 2) || - (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) - break; - } - continue; - } - if (!mem_cgroup_reclaimable(victim, false)) - continue; - total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, - zone, &nr_scanned); - *total_scanned += nr_scanned; - if (!res_counter_soft_limit_excess(&root_memcg->res)) - break; + if (res_counter_soft_limit_excess(&memcg->res)) + return true; + + /* + * If any parent up the hierarchy is over its soft limit then we + * have to obey and reclaim from this group as well. + */ + while((parent = parent_mem_cgroup(parent))) { + if (res_counter_soft_limit_excess(&parent->res)) + return true; } - mem_cgroup_iter_break(root_memcg, victim); - return total; + + return false; } /* @@ -4751,98 +4722,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, return ret; } -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - unsigned long nr_reclaimed = 0; - struct mem_cgroup_per_zone *mz, *next_mz = NULL; - unsigned long reclaimed; - int loop = 0; - struct mem_cgroup_tree_per_zone *mctz; - unsigned long long excess; - unsigned long nr_scanned; - - if (order > 0) - return 0; - - mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone)); - /* - * This loop can run a while, specially if mem_cgroup's continuously - * keep exceeding their soft limit and putting the system under - * pressure - */ - do { - if (next_mz) - mz = next_mz; - else - mz = mem_cgroup_largest_soft_limit_node(mctz); - if (!mz) - break; - - nr_scanned = 0; - reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone, - gfp_mask, &nr_scanned); - nr_reclaimed += reclaimed; - *total_scanned += nr_scanned; - spin_lock(&mctz->lock); - - /* - * If we failed to reclaim anything from this memory cgroup - * it is time to move on to the next cgroup - */ - next_mz = NULL; - if (!reclaimed) { - do { - /* - * Loop until we find yet another one. - * - * By the time we get the soft_limit lock - * again, someone might have aded the - * group back on the RB tree. Iterate to - * make sure we get a different mem. - * mem_cgroup_largest_soft_limit_node returns - * NULL if no other cgroup is present on - * the tree - */ - next_mz = - __mem_cgroup_largest_soft_limit_node(mctz); - if (next_mz == mz) - css_put(&next_mz->memcg->css); - else /* next_mz == NULL or other memcg */ - break; - } while (1); - } - __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz); - excess = res_counter_soft_limit_excess(&mz->memcg->res); - /* - * One school of thought says that we should not add - * back the node to the tree if reclaim returns 0. - * But our reclaim could return 0, simply because due - * to priority we are exposing a smaller subset of - * memory to reclaim from. Consider this as a longer - * term TODO. - */ - /* If excess == 0, no tree ops */ - __mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess); - spin_unlock(&mctz->lock); - css_put(&mz->memcg->css); - loop++; - /* - * Could not reclaim anything and there are no more - * mem cgroups to try or we seem to be looping without - * reclaiming anything. - */ - if (!nr_reclaimed && - (next_mz == NULL || - loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) - break; - } while (!nr_reclaimed); - if (next_mz) - css_put(&next_mz->memcg->css); - return nr_reclaimed; -} - /** * mem_cgroup_force_empty_list - clears LRU of a group * @memcg: group to clear diff --git a/mm/vmscan.c b/mm/vmscan.c index fa6a853..fe63a43 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -139,11 +139,21 @@ static bool global_reclaim(struct scan_control *sc) { return !sc->target_mem_cgroup; } + +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) +{ + return global_reclaim(sc); +} #else static bool global_reclaim(struct scan_control *sc) { return true; } + +static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) +{ + return false; +} #endif static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru) @@ -1943,7 +1953,8 @@ static inline bool should_continue_reclaim(struct zone *zone, } } -static void shrink_zone(struct zone *zone, struct scan_control *sc) +static void +__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) { unsigned long nr_reclaimed, nr_scanned; @@ -1962,6 +1973,12 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) do { struct lruvec *lruvec; + if (soft_reclaim && + !mem_cgroup_soft_reclaim_eligible(memcg)) { + memcg = mem_cgroup_iter(root, memcg, &reclaim); + continue; + } + lruvec = mem_cgroup_zone_lruvec(zone, memcg); shrink_lruvec(lruvec, sc); @@ -1992,6 +2009,24 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) sc->nr_scanned - nr_scanned, sc)); } + +static void shrink_zone(struct zone *zone, struct scan_control *sc) +{ + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); + unsigned long nr_scanned = sc->nr_scanned; + + __shrink_zone(zone, sc, do_soft_reclaim); + + /* + * No group is over the soft limit or those that are do not have + * pages in the zone we are reclaiming so we have to reclaim everybody + */ + if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) { + __shrink_zone(zone, sc, false); + return; + } +} + /* Returns true if compaction should go ahead for a high-order request */ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) { @@ -2053,8 +2088,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) { struct zoneref *z; struct zone *zone; - unsigned long nr_soft_reclaimed; - unsigned long nr_soft_scanned; bool aborted_reclaim = false; /* @@ -2094,18 +2127,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) continue; } } - /* - * This steals pages from memory cgroups over softlimit - * and returns the number of reclaimed pages and - * scanned pages. This works for global memory pressure - * and balancing, not for a memcg's limit. - */ - nr_soft_scanned = 0; - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - sc->order, sc->gfp_mask, - &nr_soft_scanned); - sc->nr_reclaimed += nr_soft_reclaimed; - sc->nr_scanned += nr_soft_scanned; /* need some check for avoid more shrink_zone() */ } @@ -2628,8 +2649,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ struct reclaim_state *reclaim_state = current->reclaim_state; - unsigned long nr_soft_reclaimed; - unsigned long nr_soft_scanned; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .may_unmap = 1, @@ -2728,15 +2747,6 @@ loop_again: sc.nr_scanned = 0; - nr_soft_scanned = 0; - /* - * Call soft limit reclaim before calling shrink_zone. - */ - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, - order, sc.gfp_mask, - &nr_soft_scanned); - sc.nr_reclaimed += nr_soft_reclaimed; - /* * We put equal pressure on every zone, unless * one zone has way too many pages free -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-13 7:46 ` Michal Hocko @ 2013-05-15 8:34 ` Glauber Costa -1 siblings, 0 replies; 74+ messages in thread From: Glauber Costa @ 2013-05-15 8:34 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On 05/13/2013 11:46 AM, Michal Hocko wrote: > Memcg soft reclaim has been traditionally triggered from the global > reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim > then picked up a group which exceeds the soft limit the most and > reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. > > The infrastructure requires per-node-zone trees which hold over-limit > groups and keep them up-to-date (via memcg_check_events) which is not > cost free. Although this overhead hasn't turned out to be a bottle neck > the implementation is suboptimal because mem_cgroup_update_tree has no > idea which zones consumed memory over the limit so we could easily end > up having a group on a node-zone tree having only few pages from that > node-zone. > > This patch doesn't try to fix node-zone trees management because it > seems that integrating soft reclaim into zone shrinking sounds much > easier and more appropriate for several reasons. > First of all 0 priority reclaim was a crude hack which might lead to > big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot > of dirty/writeback pages). > Soft reclaim should be applicable also to the targeted reclaim which is > awkward right now without additional hacks. > Last but not least the whole infrastructure eats quite some code. > > After this patch shrink_zone is done in 2 passes. First it tries to do the > soft reclaim if appropriate (only for global reclaim for now to keep > compatible with the original state) and fall back to ignoring soft limit > if no group is eligible to soft reclaim or nothing has been scanned > during the first pass. Only groups which are over their soft limit or > any of their parents up the hierarchy is over the limit are considered > eligible during the first pass. > > Soft limit tree which is not necessary anymore will be removed in the > follow up patch to make this patch smaller and easier to review. > > Changes since v1 > - __shrink_zone doesn't return the number of shrunk groups as nr_scanned > test covers both no groups scanned and no pages from the required zone > as pointed by Johannes > > Signed-off-by: Michal Hocko <mhocko@suse.cz> Patch looks fine to me Reviewed-by: Glauber Costa <glommer@openvz.org> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-15 8:34 ` Glauber Costa 0 siblings, 0 replies; 74+ messages in thread From: Glauber Costa @ 2013-05-15 8:34 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On 05/13/2013 11:46 AM, Michal Hocko wrote: > Memcg soft reclaim has been traditionally triggered from the global > reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim > then picked up a group which exceeds the soft limit the most and > reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. > > The infrastructure requires per-node-zone trees which hold over-limit > groups and keep them up-to-date (via memcg_check_events) which is not > cost free. Although this overhead hasn't turned out to be a bottle neck > the implementation is suboptimal because mem_cgroup_update_tree has no > idea which zones consumed memory over the limit so we could easily end > up having a group on a node-zone tree having only few pages from that > node-zone. > > This patch doesn't try to fix node-zone trees management because it > seems that integrating soft reclaim into zone shrinking sounds much > easier and more appropriate for several reasons. > First of all 0 priority reclaim was a crude hack which might lead to > big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot > of dirty/writeback pages). > Soft reclaim should be applicable also to the targeted reclaim which is > awkward right now without additional hacks. > Last but not least the whole infrastructure eats quite some code. > > After this patch shrink_zone is done in 2 passes. First it tries to do the > soft reclaim if appropriate (only for global reclaim for now to keep > compatible with the original state) and fall back to ignoring soft limit > if no group is eligible to soft reclaim or nothing has been scanned > during the first pass. Only groups which are over their soft limit or > any of their parents up the hierarchy is over the limit are considered > eligible during the first pass. > > Soft limit tree which is not necessary anymore will be removed in the > follow up patch to make this patch smaller and easier to review. > > Changes since v1 > - __shrink_zone doesn't return the number of shrunk groups as nr_scanned > test covers both no groups scanned and no pages from the required zone > as pointed by Johannes > > Signed-off-by: Michal Hocko <mhocko@suse.cz> Patch looks fine to me Reviewed-by: Glauber Costa <glommer@openvz.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-13 7:46 ` Michal Hocko (?) @ 2013-05-16 22:12 ` Tejun Heo -1 siblings, 0 replies; 74+ messages in thread From: Tejun Heo @ 2013-05-16 22:12 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh Sorry about the delay. Just getting back to memcg. On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote: ... > during the first pass. Only groups which are over their soft limit or > any of their parents up the hierarchy is over the limit are considered ancestors? > +static void shrink_zone(struct zone *zone, struct scan_control *sc) > +{ > + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); > + unsigned long nr_scanned = sc->nr_scanned; > + > + __shrink_zone(zone, sc, do_soft_reclaim); > + > + /* > + * No group is over the soft limit or those that are do not have > + * pages in the zone we are reclaiming so we have to reclaim everybody > + */ > + if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) { > + __shrink_zone(zone, sc, false); > + return; > + } > +} Maybe the following is easier to follow? if (mem_cgroup_should_soft_reclaim(sc)) { __shrink_zone(zone, sc, true); if (sc->nr_scanned == nr_scanned) __shrink_zone(zone, sc, false); } else { __shrink_zone(zone, sc, false); } But it's a minor point, please feel free to ignore. Reviewed-by: Tejun Heo <tj@kernel.org> Thanks. -- tejun ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-16 22:12 ` Tejun Heo 0 siblings, 0 replies; 74+ messages in thread From: Tejun Heo @ 2013-05-16 22:12 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh Sorry about the delay. Just getting back to memcg. On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote: ... > during the first pass. Only groups which are over their soft limit or > any of their parents up the hierarchy is over the limit are considered ancestors? > +static void shrink_zone(struct zone *zone, struct scan_control *sc) > +{ > + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); > + unsigned long nr_scanned = sc->nr_scanned; > + > + __shrink_zone(zone, sc, do_soft_reclaim); > + > + /* > + * No group is over the soft limit or those that are do not have > + * pages in the zone we are reclaiming so we have to reclaim everybody > + */ > + if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) { > + __shrink_zone(zone, sc, false); > + return; > + } > +} Maybe the following is easier to follow? if (mem_cgroup_should_soft_reclaim(sc)) { __shrink_zone(zone, sc, true); if (sc->nr_scanned == nr_scanned) __shrink_zone(zone, sc, false); } else { __shrink_zone(zone, sc, false); } But it's a minor point, please feel free to ignore. Reviewed-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Thanks. -- tejun ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-16 22:12 ` Tejun Heo 0 siblings, 0 replies; 74+ messages in thread From: Tejun Heo @ 2013-05-16 22:12 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh Sorry about the delay. Just getting back to memcg. On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote: ... > during the first pass. Only groups which are over their soft limit or > any of their parents up the hierarchy is over the limit are considered ancestors? > +static void shrink_zone(struct zone *zone, struct scan_control *sc) > +{ > + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); > + unsigned long nr_scanned = sc->nr_scanned; > + > + __shrink_zone(zone, sc, do_soft_reclaim); > + > + /* > + * No group is over the soft limit or those that are do not have > + * pages in the zone we are reclaiming so we have to reclaim everybody > + */ > + if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) { > + __shrink_zone(zone, sc, false); > + return; > + } > +} Maybe the following is easier to follow? if (mem_cgroup_should_soft_reclaim(sc)) { __shrink_zone(zone, sc, true); if (sc->nr_scanned == nr_scanned) __shrink_zone(zone, sc, false); } else { __shrink_zone(zone, sc, false); } But it's a minor point, please feel free to ignore. Reviewed-by: Tejun Heo <tj@kernel.org> Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-16 22:12 ` Tejun Heo @ 2013-05-16 22:15 ` Tejun Heo -1 siblings, 0 replies; 74+ messages in thread From: Tejun Heo @ 2013-05-16 22:15 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh One more thing, Given that this is a rather significant behavior change, it probably is a good idea to include the the benchmark results from the head message? Thanks. -- tejun ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-16 22:15 ` Tejun Heo 0 siblings, 0 replies; 74+ messages in thread From: Tejun Heo @ 2013-05-16 22:15 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh One more thing, Given that this is a rather significant behavior change, it probably is a good idea to include the the benchmark results from the head message? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-16 22:15 ` Tejun Heo (?) @ 2013-05-17 7:16 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-17 7:16 UTC (permalink / raw) To: Tejun Heo Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh On Thu 16-05-13 15:15:01, Tejun Heo wrote: > One more thing, > > Given that this is a rather significant behavior change, it probably > is a good idea to include the the benchmark results from the head > message? The testing I have done was on top of the complete series. The last patch should be irrelevant as I have tested the global reclaim but the second patch might still influence figures a tiny bit (we still do the soft limit tree thing). That's why I haven't pushed the numbers here. I can add that information if people prefer or just ask Andrew to squash the leader email into the first patch as he tend to do quite often in other cases as well. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-17 7:16 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-17 7:16 UTC (permalink / raw) To: Tejun Heo Cc: Andrew Morton, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh On Thu 16-05-13 15:15:01, Tejun Heo wrote: > One more thing, > > Given that this is a rather significant behavior change, it probably > is a good idea to include the the benchmark results from the head > message? The testing I have done was on top of the complete series. The last patch should be irrelevant as I have tested the global reclaim but the second patch might still influence figures a tiny bit (we still do the soft limit tree thing). That's why I haven't pushed the numbers here. I can add that information if people prefer or just ask Andrew to squash the leader email into the first patch as he tend to do quite often in other cases as well. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-17 7:16 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-17 7:16 UTC (permalink / raw) To: Tejun Heo Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh On Thu 16-05-13 15:15:01, Tejun Heo wrote: > One more thing, > > Given that this is a rather significant behavior change, it probably > is a good idea to include the the benchmark results from the head > message? The testing I have done was on top of the complete series. The last patch should be irrelevant as I have tested the global reclaim but the second patch might still influence figures a tiny bit (we still do the soft limit tree thing). That's why I haven't pushed the numbers here. I can add that information if people prefer or just ask Andrew to squash the leader email into the first patch as he tend to do quite often in other cases as well. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-16 22:12 ` Tejun Heo @ 2013-05-17 7:12 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-17 7:12 UTC (permalink / raw) To: Tejun Heo Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh On Thu 16-05-13 15:12:00, Tejun Heo wrote: > Sorry about the delay. Just getting back to memcg. > > On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote: > ... > > during the first pass. Only groups which are over their soft limit or > > any of their parents up the hierarchy is over the limit are considered > > ancestors? Well, ancestors might or might not be part of the hierarchy, so while it is much shorter I consider the original more precise. > > +static void shrink_zone(struct zone *zone, struct scan_control *sc) > > +{ > > + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); > > + unsigned long nr_scanned = sc->nr_scanned; > > + > > + __shrink_zone(zone, sc, do_soft_reclaim); > > + > > + /* > > + * No group is over the soft limit or those that are do not have > > + * pages in the zone we are reclaiming so we have to reclaim everybody > > + */ > > + if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) { > > + __shrink_zone(zone, sc, false); > > + return; > > + } > > +} > > Maybe the following is easier to follow? > > if (mem_cgroup_should_soft_reclaim(sc)) { > __shrink_zone(zone, sc, true); > if (sc->nr_scanned == nr_scanned) > __shrink_zone(zone, sc, false); > } else { > __shrink_zone(zone, sc, false); > } > > But it's a minor point, please feel free to ignore. I will stick with the original code as I have some plans to build on top of that. > Reviewed-by: Tejun Heo <tj@kernel.org> Thank you! > > Thanks. > > -- > tejun -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-17 7:12 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-17 7:12 UTC (permalink / raw) To: Tejun Heo Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh On Thu 16-05-13 15:12:00, Tejun Heo wrote: > Sorry about the delay. Just getting back to memcg. > > On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote: > ... > > during the first pass. Only groups which are over their soft limit or > > any of their parents up the hierarchy is over the limit are considered > > ancestors? Well, ancestors might or might not be part of the hierarchy, so while it is much shorter I consider the original more precise. > > +static void shrink_zone(struct zone *zone, struct scan_control *sc) > > +{ > > + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc); > > + unsigned long nr_scanned = sc->nr_scanned; > > + > > + __shrink_zone(zone, sc, do_soft_reclaim); > > + > > + /* > > + * No group is over the soft limit or those that are do not have > > + * pages in the zone we are reclaiming so we have to reclaim everybody > > + */ > > + if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) { > > + __shrink_zone(zone, sc, false); > > + return; > > + } > > +} > > Maybe the following is easier to follow? > > if (mem_cgroup_should_soft_reclaim(sc)) { > __shrink_zone(zone, sc, true); > if (sc->nr_scanned == nr_scanned) > __shrink_zone(zone, sc, false); > } else { > __shrink_zone(zone, sc, false); > } > > But it's a minor point, please feel free to ignore. I will stick with the original code as I have some plans to build on top of that. > Reviewed-by: Tejun Heo <tj@kernel.org> Thank you! > > Thanks. > > -- > tejun -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-13 7:46 ` Michal Hocko @ 2013-05-17 16:02 ` Johannes Weiner -1 siblings, 0 replies; 74+ messages in thread From: Johannes Weiner @ 2013-05-17 16:02 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote: > Memcg soft reclaim has been traditionally triggered from the global > reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim > then picked up a group which exceeds the soft limit the most and > reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. > > The infrastructure requires per-node-zone trees which hold over-limit > groups and keep them up-to-date (via memcg_check_events) which is not > cost free. Although this overhead hasn't turned out to be a bottle neck > the implementation is suboptimal because mem_cgroup_update_tree has no > idea which zones consumed memory over the limit so we could easily end > up having a group on a node-zone tree having only few pages from that > node-zone. > > This patch doesn't try to fix node-zone trees management because it > seems that integrating soft reclaim into zone shrinking sounds much > easier and more appropriate for several reasons. > First of all 0 priority reclaim was a crude hack which might lead to > big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot > of dirty/writeback pages). > Soft reclaim should be applicable also to the targeted reclaim which is > awkward right now without additional hacks. > Last but not least the whole infrastructure eats quite some code. > > After this patch shrink_zone is done in 2 passes. First it tries to do the > soft reclaim if appropriate (only for global reclaim for now to keep > compatible with the original state) and fall back to ignoring soft limit > if no group is eligible to soft reclaim or nothing has been scanned > during the first pass. Only groups which are over their soft limit or > any of their parents up the hierarchy is over the limit are considered > eligible during the first pass. There are setups with thousands of groups that do not even use soft limits. Having them pointlessly iterate over all of them for every couple of pages reclaimed is just not acceptable. This is not the first time this implementation was proposed, either, I'm afraid we have now truly gone full circle on this stuff. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-17 16:02 ` Johannes Weiner 0 siblings, 0 replies; 74+ messages in thread From: Johannes Weiner @ 2013-05-17 16:02 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote: > Memcg soft reclaim has been traditionally triggered from the global > reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim > then picked up a group which exceeds the soft limit the most and > reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. > > The infrastructure requires per-node-zone trees which hold over-limit > groups and keep them up-to-date (via memcg_check_events) which is not > cost free. Although this overhead hasn't turned out to be a bottle neck > the implementation is suboptimal because mem_cgroup_update_tree has no > idea which zones consumed memory over the limit so we could easily end > up having a group on a node-zone tree having only few pages from that > node-zone. > > This patch doesn't try to fix node-zone trees management because it > seems that integrating soft reclaim into zone shrinking sounds much > easier and more appropriate for several reasons. > First of all 0 priority reclaim was a crude hack which might lead to > big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot > of dirty/writeback pages). > Soft reclaim should be applicable also to the targeted reclaim which is > awkward right now without additional hacks. > Last but not least the whole infrastructure eats quite some code. > > After this patch shrink_zone is done in 2 passes. First it tries to do the > soft reclaim if appropriate (only for global reclaim for now to keep > compatible with the original state) and fall back to ignoring soft limit > if no group is eligible to soft reclaim or nothing has been scanned > during the first pass. Only groups which are over their soft limit or > any of their parents up the hierarchy is over the limit are considered > eligible during the first pass. There are setups with thousands of groups that do not even use soft limits. Having them pointlessly iterate over all of them for every couple of pages reclaimed is just not acceptable. This is not the first time this implementation was proposed, either, I'm afraid we have now truly gone full circle on this stuff. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-17 16:02 ` Johannes Weiner @ 2013-05-17 16:57 ` Tejun Heo -1 siblings, 0 replies; 74+ messages in thread From: Tejun Heo @ 2013-05-17 16:57 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Andrew Morton, linux-mm, cgroups, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh Hello, Johannes. On Fri, May 17, 2013 at 12:02:47PM -0400, Johannes Weiner wrote: > There are setups with thousands of groups that do not even use soft > limits. Having them pointlessly iterate over all of them for every > couple of pages reclaimed is just not acceptable. Hmmm... if the iteration is the problem, it shouldn't be difficult to build list of children which should be iterated. Would that make it acceptable? Thanks. -- tejun ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-17 16:57 ` Tejun Heo 0 siblings, 0 replies; 74+ messages in thread From: Tejun Heo @ 2013-05-17 16:57 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Andrew Morton, linux-mm, cgroups, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh Hello, Johannes. On Fri, May 17, 2013 at 12:02:47PM -0400, Johannes Weiner wrote: > There are setups with thousands of groups that do not even use soft > limits. Having them pointlessly iterate over all of them for every > couple of pages reclaimed is just not acceptable. Hmmm... if the iteration is the problem, it shouldn't be difficult to build list of children which should be iterated. Would that make it acceptable? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-17 16:57 ` Tejun Heo @ 2013-05-17 17:27 ` Johannes Weiner -1 siblings, 0 replies; 74+ messages in thread From: Johannes Weiner @ 2013-05-17 17:27 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Andrew Morton, linux-mm, cgroups, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh Tejun Heo <tj@kernel.org> wrote: >Hello, Johannes. > >On Fri, May 17, 2013 at 12:02:47PM -0400, Johannes Weiner wrote: >> There are setups with thousands of groups that do not even use soft >> limits. Having them pointlessly iterate over all of them for every >> couple of pages reclaimed is just not acceptable. > >Hmmm... if the iteration is the problem, it shouldn't be difficult to >build list of children which should be iterated. Would that make it >acceptable? You mean, a separate structure that tracks which groups are in excess of the limit? Like the current tree? :) Kidding aside, yes, that would be better, and an unsorted list would probably be enough for the global case. To support target reclaim soft limits later on, we could maybe propagate tags upwards the cgroup tree when a group is in excess so that reclaim can be smarter about which subtrees to test for soft limits and which to skip during the soft limit pass. The no-softlimit-set-anywhere case is then only a single tag test in the root cgroup. But starting with the list would be simple enough, delete a bunch of code, come with the same performance improvements etc. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-17 17:27 ` Johannes Weiner 0 siblings, 0 replies; 74+ messages in thread From: Johannes Weiner @ 2013-05-17 17:27 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Andrew Morton, linux-mm, cgroups, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh Tejun Heo <tj@kernel.org> wrote: >Hello, Johannes. > >On Fri, May 17, 2013 at 12:02:47PM -0400, Johannes Weiner wrote: >> There are setups with thousands of groups that do not even use soft >> limits. Having them pointlessly iterate over all of them for every >> couple of pages reclaimed is just not acceptable. > >Hmmm... if the iteration is the problem, it shouldn't be difficult to >build list of children which should be iterated. Would that make it >acceptable? You mean, a separate structure that tracks which groups are in excess of the limit? Like the current tree? :) Kidding aside, yes, that would be better, and an unsorted list would probably be enough for the global case. To support target reclaim soft limits later on, we could maybe propagate tags upwards the cgroup tree when a group is in excess so that reclaim can be smarter about which subtrees to test for soft limits and which to skip during the soft limit pass. The no-softlimit-set-anywhere case is then only a single tag test in the root cgroup. But starting with the list would be simple enough, delete a bunch of code, come with the same performance improvements etc. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-17 17:27 ` Johannes Weiner (?) @ 2013-05-17 17:45 ` Tejun Heo -1 siblings, 0 replies; 74+ messages in thread From: Tejun Heo @ 2013-05-17 17:45 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Andrew Morton, linux-mm, Cgroups, lkml, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh Hello, On Fri, May 17, 2013 at 10:27 AM, Johannes Weiner <hannes@cmpxchg.org> wrote: >>Hmmm... if the iteration is the problem, it shouldn't be difficult to >>build list of children which should be iterated. Would that make it >>acceptable? > > You mean, a separate structure that tracks which groups are in excess of the limit? Like the current tree? :) Heh, yeah, realized that after writing it but it can be something much simpler. ie. just linked list of children with soft limit configured. > Kidding aside, yes, that would be better, and an unsorted list would probably be enough for the global case. Yeap. > To support target reclaim soft limits later on, we could maybe propagate tags upwards the cgroup tree when a group is in excess so that reclaim can be smarter about which subtrees to test for soft limits and which to skip during the soft limit pass. The no-softlimit-set-anywhere case is then only a single tag test in the root cgroup. > > But starting with the list would be simple enough, delete a bunch of code, come with the same performance improvements etc. Thanks. -- tejun ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-17 17:45 ` Tejun Heo 0 siblings, 0 replies; 74+ messages in thread From: Tejun Heo @ 2013-05-17 17:45 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Andrew Morton, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Cgroups, lkml, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh Hello, On Fri, May 17, 2013 at 10:27 AM, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote: >>Hmmm... if the iteration is the problem, it shouldn't be difficult to >>build list of children which should be iterated. Would that make it >>acceptable? > > You mean, a separate structure that tracks which groups are in excess of the limit? Like the current tree? :) Heh, yeah, realized that after writing it but it can be something much simpler. ie. just linked list of children with soft limit configured. > Kidding aside, yes, that would be better, and an unsorted list would probably be enough for the global case. Yeap. > To support target reclaim soft limits later on, we could maybe propagate tags upwards the cgroup tree when a group is in excess so that reclaim can be smarter about which subtrees to test for soft limits and which to skip during the soft limit pass. The no-softlimit-set-anywhere case is then only a single tag test in the root cgroup. > > But starting with the list would be simple enough, delete a bunch of code, come with the same performance improvements etc. Thanks. -- tejun ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-17 17:45 ` Tejun Heo 0 siblings, 0 replies; 74+ messages in thread From: Tejun Heo @ 2013-05-17 17:45 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Andrew Morton, linux-mm, Cgroups, lkml, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh Hello, On Fri, May 17, 2013 at 10:27 AM, Johannes Weiner <hannes@cmpxchg.org> wrote: >>Hmmm... if the iteration is the problem, it shouldn't be difficult to >>build list of children which should be iterated. Would that make it >>acceptable? > > You mean, a separate structure that tracks which groups are in excess of the limit? Like the current tree? :) Heh, yeah, realized that after writing it but it can be something much simpler. ie. just linked list of children with soft limit configured. > Kidding aside, yes, that would be better, and an unsorted list would probably be enough for the global case. Yeap. > To support target reclaim soft limits later on, we could maybe propagate tags upwards the cgroup tree when a group is in excess so that reclaim can be smarter about which subtrees to test for soft limits and which to skip during the soft limit pass. The no-softlimit-set-anywhere case is then only a single tag test in the root cgroup. > > But starting with the list would be simple enough, delete a bunch of code, come with the same performance improvements etc. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-17 16:02 ` Johannes Weiner (?) @ 2013-05-20 14:44 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-20 14:44 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On Fri 17-05-13 12:02:47, Johannes Weiner wrote: > On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote: > > Memcg soft reclaim has been traditionally triggered from the global > > reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim > > then picked up a group which exceeds the soft limit the most and > > reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. > > > > The infrastructure requires per-node-zone trees which hold over-limit > > groups and keep them up-to-date (via memcg_check_events) which is not > > cost free. Although this overhead hasn't turned out to be a bottle neck > > the implementation is suboptimal because mem_cgroup_update_tree has no > > idea which zones consumed memory over the limit so we could easily end > > up having a group on a node-zone tree having only few pages from that > > node-zone. > > > > This patch doesn't try to fix node-zone trees management because it > > seems that integrating soft reclaim into zone shrinking sounds much > > easier and more appropriate for several reasons. > > First of all 0 priority reclaim was a crude hack which might lead to > > big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot > > of dirty/writeback pages). > > Soft reclaim should be applicable also to the targeted reclaim which is > > awkward right now without additional hacks. > > Last but not least the whole infrastructure eats quite some code. > > > > After this patch shrink_zone is done in 2 passes. First it tries to do the > > soft reclaim if appropriate (only for global reclaim for now to keep > > compatible with the original state) and fall back to ignoring soft limit > > if no group is eligible to soft reclaim or nothing has been scanned > > during the first pass. Only groups which are over their soft limit or > > any of their parents up the hierarchy is over the limit are considered > > eligible during the first pass. > > There are setups with thousands of groups that do not even use soft > limits. Having them pointlessly iterate over all of them for every > couple of pages reclaimed is just not acceptable. OK, that is a fair point. The soft reclaim pass should not be done if we know that every group is below the limit. This can be fixed easily. mem_cgroup_should_soft_reclaim could check a counter of over limit groups. This still doesn't solve the problem if there are relatively few groups over the limit wrt. those that are under the limit. You are proposing a simple list in the follow up email. I have considered this approach as well but then I decided not to go that way because the list iteration doesn't respect per-node-zone-priority tree walk which makes it tricky to prevent from over-reclaim with many parallel reclaimers. I rather wanted to integrate the soft reclaim into the reclaim tree walk. There is also a problem when all groups are in excess then the whole tree collapses into a linked list which is not nice either (hmm, this could be mitigated if only a group in excess which is highest in the hierarchy would be in the list) I was playing with the huge number of groups a bit and my conclusion was that although the System time for the workload was higher with more groups, the Elapsed time was still better/comparable with the original kernel. I do not know how many groups are used in practice but I have never heard of configurations with more than thousands memcgs but even for 8k groups results didn't look too crazy. I had one group (call it A) with the streaming IO load (dd if=/dev/zero of=file with 4*TotalRam size) and a parallel hierarchy with 2 groups with up to 12 levels each (512, 1024, 4096, 8192 groups) and no limit set. I have compared the results with the same configuration with the base kernel. Two configurations have been tested. `A' group without any soft limit and with limit set to 0. The first configuration measures overhead of an additional pass as there is no soft reclaim done in both the base kernel and the rework. The second configuration compares the effectiveness of the reworked implementation wrt. the base kernel. * No soft limit set Elapsed 500-no-limit/base: min: 16.32 max: 18.03 avg: 17.37 std: 0.75 runs: 3 500-no-limit/rework: min: 15.76 [96.6%] max: 19.72 [109.4%] avg: 17.49 [100.7%] std: 1.66 runs: 3 User 500-no-limit/base: min: 1.53 max: 1.60 avg: 1.57 std: 0.03 runs: 3 500-no-limit/rework: min: 1.18 [77.1%] max: 1.45 [90.6%] avg: 1.33 [84.7%] std: 0.11 runs: 3 System 500-no-limit/base: min: 38.60 max: 41.54 avg: 39.95 std: 1.21 runs: 3 500-no-limit/rework: min: 39.78 [103.1%] max: 42.93 [103.3%] avg: 41.06 [102.8%] std: 1.35 runs: 3 Elapsed 1k-no-limit/base: min: 37.04 max: 43.36 avg: 40.26 std: 2.58 runs: 3 1k-no-limit/rework: min: 16.38 [44.2%] max: 17.82 [41.1%] avg: 17.22 [42.8%] std: 0.61 runs: 3 User 1k-no-limit/base: min: 1.12 max: 1.38 avg: 1.24 std: 0.11 runs: 3 1k-no-limit/rework: min: 1.11 [99.1%] max: 1.26 [91.3%] avg: 1.20 [96.8%] std: 0.07 runs: 3 System 1k-no-limit/base: min: 33.51 max: 36.29 avg: 34.99 std: 1.14 runs: 3 1k-no-limit/rework: min: 45.09 [134.6%] max: 49.52 [136.5%] avg: 47.99 [137.2%] std: 2.05 runs: 3 Elapsed 4k-no-limit/base: min: 40.04 max: 47.14 avg: 44.46 std: 3.15 runs: 3 4k-no-limit/rework: min: 30.38 [75.9%] max: 37.66 [79.9%] avg: 34.24 [77.0%] std: 2.99 runs: 3 User 4k-no-limit/base: min: 1.16 max: 1.33 avg: 1.25 std: 0.07 runs: 3 4k-no-limit/rework: min: 0.70 [60.3%] max: 0.82 [61.7%] avg: 0.77 [61.6%] std: 0.05 runs: 3 System 4k-no-limit/base: min: 37.91 max: 39.91 avg: 39.19 std: 0.91 runs: 3 4k-no-limit/rework: min: 130.35 [343.8%] max: 133.26 [333.9%] avg: 131.63 [335.9%] std: 1.21 runs: 3 Elapsed 8k-no-limit/base: min: 41.27 max: 50.60 avg: 45.51 std: 3.86 runs: 3 8k-no-limit/rework: min: 39.56 [95.9%] max: 52.12 [103.0%] avg: 44.49 [97.8%] std: 5.47 runs: 3 User 8k-no-limit/base: min: 1.26 max: 1.38 avg: 1.32 std: 0.05 runs: 3 8k-no-limit/rework: min: 0.68 [54.0%] max: 0.82 [59.4%] avg: 0.73 [55.3%] std: 0.06 runs: 3 System 8k-no-limit/base: min: 39.93 max: 40.73 avg: 40.25 std: 0.34 runs: 3 8k-no-limit/rework: min: 228.74 [572.9%] max: 238.43 [585.4%] avg: 232.57 [577.8%] std: 4.21 runs: 3 * Soft limit set to 0 for the group with the dd load Elapsed 500-0-limit/base: min: 30.29 max: 38.91 avg: 34.83 std: 3.53 runs: 3 500-0-limit/rework: min: 14.34 [47.3%] max: 17.18 [44.2%] avg: 16.01 [46.0%] std: 1.21 runs: 3 User 500-0-limit/base: min: 1.14 max: 1.29 avg: 1.24 std: 0.07 runs: 3 500-0-limit/rework: min: 1.42 [124.6%] max: 1.47 [114.0%] avg: 1.44 [116.1%] std: 0.02 runs: 3 System 500-0-limit/base: min: 31.94 max: 35.66 avg: 33.77 std: 1.52 runs: 3 500-0-limit/rework: min: 45.25 [141.7%] max: 47.43 [133.0%] avg: 46.27 [137.0%] std: 0.89 runs: 3 Elapsed 1k-0-limit/base: min: 37.23 max: 45.11 avg: 40.48 std: 3.36 runs: 3 1k-0-limit/rework: min: 15.18 [40.8%] max: 18.69 [41.4%] avg: 16.99 [42.0%] std: 1.44 runs: 3 User 1k-0-limit/base: min: 1.33 max: 1.56 avg: 1.44 std: 0.09 runs: 3 1k-0-limit/rework: min: 1.31 [98.5%] max: 1.55 [99.4%] avg: 1.44 [100.0%] std: 0.10 runs: 3 System 1k-0-limit/base: min: 33.21 max: 34.44 avg: 33.77 std: 0.51 runs: 3 1k-0-limit/rework: min: 45.52 [137.1%] max: 50.82 [147.6%] avg: 48.76 [144.4%] std: 2.32 runs: 3 Elapsed 4k-0-limit/base: min: 42.71 max: 47.83 avg: 45.45 std: 2.11 runs: 3 4k-0-limit/rework: min: 34.24 [80.2%] max: 34.99 [73.2%] avg: 34.56 [76.0%] std: 0.32 runs: 3 User 4k-0-limit/base: min: 1.11 max: 1.34 avg: 1.21 std: 0.10 runs: 3 4k-0-limit/rework: min: 0.80 [72.1%] max: 0.87 [64.9%] avg: 0.83 [68.6%] std: 0.03 runs: 3 System 4k-0-limit/base: min: 37.08 max: 40.28 avg: 38.91 std: 1.35 runs: 3 4k-0-limit/rework: min: 131.08 [353.5%] max: 132.33 [328.5%] avg: 131.66 [338.4%] std: 0.51 runs: 3 Elapsed 8k-0-limit/base: min: 35.71 max: 47.18 avg: 43.19 std: 5.29 runs: 3 8k-0-limit/rework: min: 43.95 [123.1%] max: 59.77 [126.7%] avg: 50.48 [116.9%] std: 6.75 runs: 3 User 8k-0-limit/base: min: 1.18 max: 1.21 avg: 1.19 std: 0.01 runs: 3 8k-0-limit/rework: min: 0.72 [61.0%] max: 0.85 [70.2%] avg: 0.77 [64.7%] std: 0.06 runs: 3 System 8k-0-limit/base: min: 38.34 max: 39.91 avg: 39.24 std: 0.66 runs: 3 8k-0-limit/rework: min: 196.90 [513.6%] max: 235.32 [589.6%] avg: 222.34 [566.6%] std: 17.99 runs: 3 As we can see the System time climbs really high but the Elapsed time is better than in the base kernel (except for 8k-0-limit). If we had more reclaimers then the system time should be amortized more because the reclaim tree walk is shared. I think that the numbers can be improved even without introducing the list of groups in excess. One way to go could be introducing a conditional (callback) to the memcg iterator so the groups under the limit would be excluded during the walk without playing with css references and other things. My quick and dirty patch shows that 4k-0-limit System time was reduced by 40% wrt. this patchset. With a proper tagging we can make the walk close to free. Nevertheless, I guess I can live with the excess list as well if the above sounds like a no-go for you. [...] -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-20 14:44 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-20 14:44 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On Fri 17-05-13 12:02:47, Johannes Weiner wrote: > On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote: > > Memcg soft reclaim has been traditionally triggered from the global > > reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim > > then picked up a group which exceeds the soft limit the most and > > reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. > > > > The infrastructure requires per-node-zone trees which hold over-limit > > groups and keep them up-to-date (via memcg_check_events) which is not > > cost free. Although this overhead hasn't turned out to be a bottle neck > > the implementation is suboptimal because mem_cgroup_update_tree has no > > idea which zones consumed memory over the limit so we could easily end > > up having a group on a node-zone tree having only few pages from that > > node-zone. > > > > This patch doesn't try to fix node-zone trees management because it > > seems that integrating soft reclaim into zone shrinking sounds much > > easier and more appropriate for several reasons. > > First of all 0 priority reclaim was a crude hack which might lead to > > big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot > > of dirty/writeback pages). > > Soft reclaim should be applicable also to the targeted reclaim which is > > awkward right now without additional hacks. > > Last but not least the whole infrastructure eats quite some code. > > > > After this patch shrink_zone is done in 2 passes. First it tries to do the > > soft reclaim if appropriate (only for global reclaim for now to keep > > compatible with the original state) and fall back to ignoring soft limit > > if no group is eligible to soft reclaim or nothing has been scanned > > during the first pass. Only groups which are over their soft limit or > > any of their parents up the hierarchy is over the limit are considered > > eligible during the first pass. > > There are setups with thousands of groups that do not even use soft > limits. Having them pointlessly iterate over all of them for every > couple of pages reclaimed is just not acceptable. OK, that is a fair point. The soft reclaim pass should not be done if we know that every group is below the limit. This can be fixed easily. mem_cgroup_should_soft_reclaim could check a counter of over limit groups. This still doesn't solve the problem if there are relatively few groups over the limit wrt. those that are under the limit. You are proposing a simple list in the follow up email. I have considered this approach as well but then I decided not to go that way because the list iteration doesn't respect per-node-zone-priority tree walk which makes it tricky to prevent from over-reclaim with many parallel reclaimers. I rather wanted to integrate the soft reclaim into the reclaim tree walk. There is also a problem when all groups are in excess then the whole tree collapses into a linked list which is not nice either (hmm, this could be mitigated if only a group in excess which is highest in the hierarchy would be in the list) I was playing with the huge number of groups a bit and my conclusion was that although the System time for the workload was higher with more groups, the Elapsed time was still better/comparable with the original kernel. I do not know how many groups are used in practice but I have never heard of configurations with more than thousands memcgs but even for 8k groups results didn't look too crazy. I had one group (call it A) with the streaming IO load (dd if=/dev/zero of=file with 4*TotalRam size) and a parallel hierarchy with 2 groups with up to 12 levels each (512, 1024, 4096, 8192 groups) and no limit set. I have compared the results with the same configuration with the base kernel. Two configurations have been tested. `A' group without any soft limit and with limit set to 0. The first configuration measures overhead of an additional pass as there is no soft reclaim done in both the base kernel and the rework. The second configuration compares the effectiveness of the reworked implementation wrt. the base kernel. * No soft limit set Elapsed 500-no-limit/base: min: 16.32 max: 18.03 avg: 17.37 std: 0.75 runs: 3 500-no-limit/rework: min: 15.76 [96.6%] max: 19.72 [109.4%] avg: 17.49 [100.7%] std: 1.66 runs: 3 User 500-no-limit/base: min: 1.53 max: 1.60 avg: 1.57 std: 0.03 runs: 3 500-no-limit/rework: min: 1.18 [77.1%] max: 1.45 [90.6%] avg: 1.33 [84.7%] std: 0.11 runs: 3 System 500-no-limit/base: min: 38.60 max: 41.54 avg: 39.95 std: 1.21 runs: 3 500-no-limit/rework: min: 39.78 [103.1%] max: 42.93 [103.3%] avg: 41.06 [102.8%] std: 1.35 runs: 3 Elapsed 1k-no-limit/base: min: 37.04 max: 43.36 avg: 40.26 std: 2.58 runs: 3 1k-no-limit/rework: min: 16.38 [44.2%] max: 17.82 [41.1%] avg: 17.22 [42.8%] std: 0.61 runs: 3 User 1k-no-limit/base: min: 1.12 max: 1.38 avg: 1.24 std: 0.11 runs: 3 1k-no-limit/rework: min: 1.11 [99.1%] max: 1.26 [91.3%] avg: 1.20 [96.8%] std: 0.07 runs: 3 System 1k-no-limit/base: min: 33.51 max: 36.29 avg: 34.99 std: 1.14 runs: 3 1k-no-limit/rework: min: 45.09 [134.6%] max: 49.52 [136.5%] avg: 47.99 [137.2%] std: 2.05 runs: 3 Elapsed 4k-no-limit/base: min: 40.04 max: 47.14 avg: 44.46 std: 3.15 runs: 3 4k-no-limit/rework: min: 30.38 [75.9%] max: 37.66 [79.9%] avg: 34.24 [77.0%] std: 2.99 runs: 3 User 4k-no-limit/base: min: 1.16 max: 1.33 avg: 1.25 std: 0.07 runs: 3 4k-no-limit/rework: min: 0.70 [60.3%] max: 0.82 [61.7%] avg: 0.77 [61.6%] std: 0.05 runs: 3 System 4k-no-limit/base: min: 37.91 max: 39.91 avg: 39.19 std: 0.91 runs: 3 4k-no-limit/rework: min: 130.35 [343.8%] max: 133.26 [333.9%] avg: 131.63 [335.9%] std: 1.21 runs: 3 Elapsed 8k-no-limit/base: min: 41.27 max: 50.60 avg: 45.51 std: 3.86 runs: 3 8k-no-limit/rework: min: 39.56 [95.9%] max: 52.12 [103.0%] avg: 44.49 [97.8%] std: 5.47 runs: 3 User 8k-no-limit/base: min: 1.26 max: 1.38 avg: 1.32 std: 0.05 runs: 3 8k-no-limit/rework: min: 0.68 [54.0%] max: 0.82 [59.4%] avg: 0.73 [55.3%] std: 0.06 runs: 3 System 8k-no-limit/base: min: 39.93 max: 40.73 avg: 40.25 std: 0.34 runs: 3 8k-no-limit/rework: min: 228.74 [572.9%] max: 238.43 [585.4%] avg: 232.57 [577.8%] std: 4.21 runs: 3 * Soft limit set to 0 for the group with the dd load Elapsed 500-0-limit/base: min: 30.29 max: 38.91 avg: 34.83 std: 3.53 runs: 3 500-0-limit/rework: min: 14.34 [47.3%] max: 17.18 [44.2%] avg: 16.01 [46.0%] std: 1.21 runs: 3 User 500-0-limit/base: min: 1.14 max: 1.29 avg: 1.24 std: 0.07 runs: 3 500-0-limit/rework: min: 1.42 [124.6%] max: 1.47 [114.0%] avg: 1.44 [116.1%] std: 0.02 runs: 3 System 500-0-limit/base: min: 31.94 max: 35.66 avg: 33.77 std: 1.52 runs: 3 500-0-limit/rework: min: 45.25 [141.7%] max: 47.43 [133.0%] avg: 46.27 [137.0%] std: 0.89 runs: 3 Elapsed 1k-0-limit/base: min: 37.23 max: 45.11 avg: 40.48 std: 3.36 runs: 3 1k-0-limit/rework: min: 15.18 [40.8%] max: 18.69 [41.4%] avg: 16.99 [42.0%] std: 1.44 runs: 3 User 1k-0-limit/base: min: 1.33 max: 1.56 avg: 1.44 std: 0.09 runs: 3 1k-0-limit/rework: min: 1.31 [98.5%] max: 1.55 [99.4%] avg: 1.44 [100.0%] std: 0.10 runs: 3 System 1k-0-limit/base: min: 33.21 max: 34.44 avg: 33.77 std: 0.51 runs: 3 1k-0-limit/rework: min: 45.52 [137.1%] max: 50.82 [147.6%] avg: 48.76 [144.4%] std: 2.32 runs: 3 Elapsed 4k-0-limit/base: min: 42.71 max: 47.83 avg: 45.45 std: 2.11 runs: 3 4k-0-limit/rework: min: 34.24 [80.2%] max: 34.99 [73.2%] avg: 34.56 [76.0%] std: 0.32 runs: 3 User 4k-0-limit/base: min: 1.11 max: 1.34 avg: 1.21 std: 0.10 runs: 3 4k-0-limit/rework: min: 0.80 [72.1%] max: 0.87 [64.9%] avg: 0.83 [68.6%] std: 0.03 runs: 3 System 4k-0-limit/base: min: 37.08 max: 40.28 avg: 38.91 std: 1.35 runs: 3 4k-0-limit/rework: min: 131.08 [353.5%] max: 132.33 [328.5%] avg: 131.66 [338.4%] std: 0.51 runs: 3 Elapsed 8k-0-limit/base: min: 35.71 max: 47.18 avg: 43.19 std: 5.29 runs: 3 8k-0-limit/rework: min: 43.95 [123.1%] max: 59.77 [126.7%] avg: 50.48 [116.9%] std: 6.75 runs: 3 User 8k-0-limit/base: min: 1.18 max: 1.21 avg: 1.19 std: 0.01 runs: 3 8k-0-limit/rework: min: 0.72 [61.0%] max: 0.85 [70.2%] avg: 0.77 [64.7%] std: 0.06 runs: 3 System 8k-0-limit/base: min: 38.34 max: 39.91 avg: 39.24 std: 0.66 runs: 3 8k-0-limit/rework: min: 196.90 [513.6%] max: 235.32 [589.6%] avg: 222.34 [566.6%] std: 17.99 runs: 3 As we can see the System time climbs really high but the Elapsed time is better than in the base kernel (except for 8k-0-limit). If we had more reclaimers then the system time should be amortized more because the reclaim tree walk is shared. I think that the numbers can be improved even without introducing the list of groups in excess. One way to go could be introducing a conditional (callback) to the memcg iterator so the groups under the limit would be excluded during the walk without playing with css references and other things. My quick and dirty patch shows that 4k-0-limit System time was reduced by 40% wrt. this patchset. With a proper tagging we can make the walk close to free. Nevertheless, I guess I can live with the excess list as well if the above sounds like a no-go for you. [...] -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-20 14:44 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-20 14:44 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On Fri 17-05-13 12:02:47, Johannes Weiner wrote: > On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote: > > Memcg soft reclaim has been traditionally triggered from the global > > reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim > > then picked up a group which exceeds the soft limit the most and > > reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. > > > > The infrastructure requires per-node-zone trees which hold over-limit > > groups and keep them up-to-date (via memcg_check_events) which is not > > cost free. Although this overhead hasn't turned out to be a bottle neck > > the implementation is suboptimal because mem_cgroup_update_tree has no > > idea which zones consumed memory over the limit so we could easily end > > up having a group on a node-zone tree having only few pages from that > > node-zone. > > > > This patch doesn't try to fix node-zone trees management because it > > seems that integrating soft reclaim into zone shrinking sounds much > > easier and more appropriate for several reasons. > > First of all 0 priority reclaim was a crude hack which might lead to > > big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot > > of dirty/writeback pages). > > Soft reclaim should be applicable also to the targeted reclaim which is > > awkward right now without additional hacks. > > Last but not least the whole infrastructure eats quite some code. > > > > After this patch shrink_zone is done in 2 passes. First it tries to do the > > soft reclaim if appropriate (only for global reclaim for now to keep > > compatible with the original state) and fall back to ignoring soft limit > > if no group is eligible to soft reclaim or nothing has been scanned > > during the first pass. Only groups which are over their soft limit or > > any of their parents up the hierarchy is over the limit are considered > > eligible during the first pass. > > There are setups with thousands of groups that do not even use soft > limits. Having them pointlessly iterate over all of them for every > couple of pages reclaimed is just not acceptable. OK, that is a fair point. The soft reclaim pass should not be done if we know that every group is below the limit. This can be fixed easily. mem_cgroup_should_soft_reclaim could check a counter of over limit groups. This still doesn't solve the problem if there are relatively few groups over the limit wrt. those that are under the limit. You are proposing a simple list in the follow up email. I have considered this approach as well but then I decided not to go that way because the list iteration doesn't respect per-node-zone-priority tree walk which makes it tricky to prevent from over-reclaim with many parallel reclaimers. I rather wanted to integrate the soft reclaim into the reclaim tree walk. There is also a problem when all groups are in excess then the whole tree collapses into a linked list which is not nice either (hmm, this could be mitigated if only a group in excess which is highest in the hierarchy would be in the list) I was playing with the huge number of groups a bit and my conclusion was that although the System time for the workload was higher with more groups, the Elapsed time was still better/comparable with the original kernel. I do not know how many groups are used in practice but I have never heard of configurations with more than thousands memcgs but even for 8k groups results didn't look too crazy. I had one group (call it A) with the streaming IO load (dd if=/dev/zero of=file with 4*TotalRam size) and a parallel hierarchy with 2 groups with up to 12 levels each (512, 1024, 4096, 8192 groups) and no limit set. I have compared the results with the same configuration with the base kernel. Two configurations have been tested. `A' group without any soft limit and with limit set to 0. The first configuration measures overhead of an additional pass as there is no soft reclaim done in both the base kernel and the rework. The second configuration compares the effectiveness of the reworked implementation wrt. the base kernel. * No soft limit set Elapsed 500-no-limit/base: min: 16.32 max: 18.03 avg: 17.37 std: 0.75 runs: 3 500-no-limit/rework: min: 15.76 [96.6%] max: 19.72 [109.4%] avg: 17.49 [100.7%] std: 1.66 runs: 3 User 500-no-limit/base: min: 1.53 max: 1.60 avg: 1.57 std: 0.03 runs: 3 500-no-limit/rework: min: 1.18 [77.1%] max: 1.45 [90.6%] avg: 1.33 [84.7%] std: 0.11 runs: 3 System 500-no-limit/base: min: 38.60 max: 41.54 avg: 39.95 std: 1.21 runs: 3 500-no-limit/rework: min: 39.78 [103.1%] max: 42.93 [103.3%] avg: 41.06 [102.8%] std: 1.35 runs: 3 Elapsed 1k-no-limit/base: min: 37.04 max: 43.36 avg: 40.26 std: 2.58 runs: 3 1k-no-limit/rework: min: 16.38 [44.2%] max: 17.82 [41.1%] avg: 17.22 [42.8%] std: 0.61 runs: 3 User 1k-no-limit/base: min: 1.12 max: 1.38 avg: 1.24 std: 0.11 runs: 3 1k-no-limit/rework: min: 1.11 [99.1%] max: 1.26 [91.3%] avg: 1.20 [96.8%] std: 0.07 runs: 3 System 1k-no-limit/base: min: 33.51 max: 36.29 avg: 34.99 std: 1.14 runs: 3 1k-no-limit/rework: min: 45.09 [134.6%] max: 49.52 [136.5%] avg: 47.99 [137.2%] std: 2.05 runs: 3 Elapsed 4k-no-limit/base: min: 40.04 max: 47.14 avg: 44.46 std: 3.15 runs: 3 4k-no-limit/rework: min: 30.38 [75.9%] max: 37.66 [79.9%] avg: 34.24 [77.0%] std: 2.99 runs: 3 User 4k-no-limit/base: min: 1.16 max: 1.33 avg: 1.25 std: 0.07 runs: 3 4k-no-limit/rework: min: 0.70 [60.3%] max: 0.82 [61.7%] avg: 0.77 [61.6%] std: 0.05 runs: 3 System 4k-no-limit/base: min: 37.91 max: 39.91 avg: 39.19 std: 0.91 runs: 3 4k-no-limit/rework: min: 130.35 [343.8%] max: 133.26 [333.9%] avg: 131.63 [335.9%] std: 1.21 runs: 3 Elapsed 8k-no-limit/base: min: 41.27 max: 50.60 avg: 45.51 std: 3.86 runs: 3 8k-no-limit/rework: min: 39.56 [95.9%] max: 52.12 [103.0%] avg: 44.49 [97.8%] std: 5.47 runs: 3 User 8k-no-limit/base: min: 1.26 max: 1.38 avg: 1.32 std: 0.05 runs: 3 8k-no-limit/rework: min: 0.68 [54.0%] max: 0.82 [59.4%] avg: 0.73 [55.3%] std: 0.06 runs: 3 System 8k-no-limit/base: min: 39.93 max: 40.73 avg: 40.25 std: 0.34 runs: 3 8k-no-limit/rework: min: 228.74 [572.9%] max: 238.43 [585.4%] avg: 232.57 [577.8%] std: 4.21 runs: 3 * Soft limit set to 0 for the group with the dd load Elapsed 500-0-limit/base: min: 30.29 max: 38.91 avg: 34.83 std: 3.53 runs: 3 500-0-limit/rework: min: 14.34 [47.3%] max: 17.18 [44.2%] avg: 16.01 [46.0%] std: 1.21 runs: 3 User 500-0-limit/base: min: 1.14 max: 1.29 avg: 1.24 std: 0.07 runs: 3 500-0-limit/rework: min: 1.42 [124.6%] max: 1.47 [114.0%] avg: 1.44 [116.1%] std: 0.02 runs: 3 System 500-0-limit/base: min: 31.94 max: 35.66 avg: 33.77 std: 1.52 runs: 3 500-0-limit/rework: min: 45.25 [141.7%] max: 47.43 [133.0%] avg: 46.27 [137.0%] std: 0.89 runs: 3 Elapsed 1k-0-limit/base: min: 37.23 max: 45.11 avg: 40.48 std: 3.36 runs: 3 1k-0-limit/rework: min: 15.18 [40.8%] max: 18.69 [41.4%] avg: 16.99 [42.0%] std: 1.44 runs: 3 User 1k-0-limit/base: min: 1.33 max: 1.56 avg: 1.44 std: 0.09 runs: 3 1k-0-limit/rework: min: 1.31 [98.5%] max: 1.55 [99.4%] avg: 1.44 [100.0%] std: 0.10 runs: 3 System 1k-0-limit/base: min: 33.21 max: 34.44 avg: 33.77 std: 0.51 runs: 3 1k-0-limit/rework: min: 45.52 [137.1%] max: 50.82 [147.6%] avg: 48.76 [144.4%] std: 2.32 runs: 3 Elapsed 4k-0-limit/base: min: 42.71 max: 47.83 avg: 45.45 std: 2.11 runs: 3 4k-0-limit/rework: min: 34.24 [80.2%] max: 34.99 [73.2%] avg: 34.56 [76.0%] std: 0.32 runs: 3 User 4k-0-limit/base: min: 1.11 max: 1.34 avg: 1.21 std: 0.10 runs: 3 4k-0-limit/rework: min: 0.80 [72.1%] max: 0.87 [64.9%] avg: 0.83 [68.6%] std: 0.03 runs: 3 System 4k-0-limit/base: min: 37.08 max: 40.28 avg: 38.91 std: 1.35 runs: 3 4k-0-limit/rework: min: 131.08 [353.5%] max: 132.33 [328.5%] avg: 131.66 [338.4%] std: 0.51 runs: 3 Elapsed 8k-0-limit/base: min: 35.71 max: 47.18 avg: 43.19 std: 5.29 runs: 3 8k-0-limit/rework: min: 43.95 [123.1%] max: 59.77 [126.7%] avg: 50.48 [116.9%] std: 6.75 runs: 3 User 8k-0-limit/base: min: 1.18 max: 1.21 avg: 1.19 std: 0.01 runs: 3 8k-0-limit/rework: min: 0.72 [61.0%] max: 0.85 [70.2%] avg: 0.77 [64.7%] std: 0.06 runs: 3 System 8k-0-limit/base: min: 38.34 max: 39.91 avg: 39.24 std: 0.66 runs: 3 8k-0-limit/rework: min: 196.90 [513.6%] max: 235.32 [589.6%] avg: 222.34 [566.6%] std: 17.99 runs: 3 As we can see the System time climbs really high but the Elapsed time is better than in the base kernel (except for 8k-0-limit). If we had more reclaimers then the system time should be amortized more because the reclaim tree walk is shared. I think that the numbers can be improved even without introducing the list of groups in excess. One way to go could be introducing a conditional (callback) to the memcg iterator so the groups under the limit would be excluded during the walk without playing with css references and other things. My quick and dirty patch shows that 4k-0-limit System time was reduced by 40% wrt. this patchset. With a proper tagging we can make the walk close to free. Nevertheless, I guess I can live with the excess list as well if the above sounds like a no-go for you. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-20 14:44 ` Michal Hocko @ 2013-05-21 6:53 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-21 6:53 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On Mon 20-05-13 16:44:38, Michal Hocko wrote: [...] > I had one group (call it A) with the streaming IO load (dd if=/dev/zero > of=file with 4*TotalRam size) and a parallel hierarchy with 2 groups > with up to 12 levels each (512, 1024, 4096, 8192 groups) and no limit > set. I have compared the results with the same configuration with the > base kernel. > Two configurations have been tested. `A' group without any soft limit and > with limit set to 0. The first configuration measures overhead of an > additional pass as there is no soft reclaim done in both the base kernel > and the rework. The second configuration compares the effectiveness of > the reworked implementation wrt. the base kernel. And I have forgotten to mention that the machine was booted with mem=1G so there was only a single node without Normal zone. This setup with only a single mem hog makes the original implementation really effective because there is a good chance that the aggressive soft limit reclaim prevents from getting into shrink_zone most of the time thus prevent from iterating all groups. The reality with bigger machines is quite different though. The per-node tree where a group is inserted depends on timing because mem_cgroup_update_tree uses the current page for the decision. So the group can be hanging on the node where it charged just few pages. So the figures below show the close the best case with an average case. As I have already said. The numbers can be improved but is it essential to do it right now rather than slowly evolve to something smarter? > * No soft limit set > Elapsed > 500-no-limit/base: min: 16.32 max: 18.03 avg: 17.37 std: 0.75 runs: 3 > 500-no-limit/rework: min: 15.76 [96.6%] max: 19.72 [109.4%] avg: 17.49 [100.7%] std: 1.66 runs: 3 > User > 500-no-limit/base: min: 1.53 max: 1.60 avg: 1.57 std: 0.03 runs: 3 > 500-no-limit/rework: min: 1.18 [77.1%] max: 1.45 [90.6%] avg: 1.33 [84.7%] std: 0.11 runs: 3 > System > 500-no-limit/base: min: 38.60 max: 41.54 avg: 39.95 std: 1.21 runs: 3 > 500-no-limit/rework: min: 39.78 [103.1%] max: 42.93 [103.3%] avg: 41.06 [102.8%] std: 1.35 runs: 3 > > Elapsed > 1k-no-limit/base: min: 37.04 max: 43.36 avg: 40.26 std: 2.58 runs: 3 > 1k-no-limit/rework: min: 16.38 [44.2%] max: 17.82 [41.1%] avg: 17.22 [42.8%] std: 0.61 runs: 3 > User > 1k-no-limit/base: min: 1.12 max: 1.38 avg: 1.24 std: 0.11 runs: 3 > 1k-no-limit/rework: min: 1.11 [99.1%] max: 1.26 [91.3%] avg: 1.20 [96.8%] std: 0.07 runs: 3 > System > 1k-no-limit/base: min: 33.51 max: 36.29 avg: 34.99 std: 1.14 runs: 3 > 1k-no-limit/rework: min: 45.09 [134.6%] max: 49.52 [136.5%] avg: 47.99 [137.2%] std: 2.05 runs: 3 > > Elapsed > 4k-no-limit/base: min: 40.04 max: 47.14 avg: 44.46 std: 3.15 runs: 3 > 4k-no-limit/rework: min: 30.38 [75.9%] max: 37.66 [79.9%] avg: 34.24 [77.0%] std: 2.99 runs: 3 > User > 4k-no-limit/base: min: 1.16 max: 1.33 avg: 1.25 std: 0.07 runs: 3 > 4k-no-limit/rework: min: 0.70 [60.3%] max: 0.82 [61.7%] avg: 0.77 [61.6%] std: 0.05 runs: 3 > System > 4k-no-limit/base: min: 37.91 max: 39.91 avg: 39.19 std: 0.91 runs: 3 > 4k-no-limit/rework: min: 130.35 [343.8%] max: 133.26 [333.9%] avg: 131.63 [335.9%] std: 1.21 runs: 3 > > Elapsed > 8k-no-limit/base: min: 41.27 max: 50.60 avg: 45.51 std: 3.86 runs: 3 > 8k-no-limit/rework: min: 39.56 [95.9%] max: 52.12 [103.0%] avg: 44.49 [97.8%] std: 5.47 runs: 3 > User > 8k-no-limit/base: min: 1.26 max: 1.38 avg: 1.32 std: 0.05 runs: 3 > 8k-no-limit/rework: min: 0.68 [54.0%] max: 0.82 [59.4%] avg: 0.73 [55.3%] std: 0.06 runs: 3 > System > 8k-no-limit/base: min: 39.93 max: 40.73 avg: 40.25 std: 0.34 runs: 3 > 8k-no-limit/rework: min: 228.74 [572.9%] max: 238.43 [585.4%] avg: 232.57 [577.8%] std: 4.21 runs: 3 > > * Soft limit set to 0 for the group with the dd load > Elapsed > 500-0-limit/base: min: 30.29 max: 38.91 avg: 34.83 std: 3.53 runs: 3 > 500-0-limit/rework: min: 14.34 [47.3%] max: 17.18 [44.2%] avg: 16.01 [46.0%] std: 1.21 runs: 3 > User > 500-0-limit/base: min: 1.14 max: 1.29 avg: 1.24 std: 0.07 runs: 3 > 500-0-limit/rework: min: 1.42 [124.6%] max: 1.47 [114.0%] avg: 1.44 [116.1%] std: 0.02 runs: 3 > System > 500-0-limit/base: min: 31.94 max: 35.66 avg: 33.77 std: 1.52 runs: 3 > 500-0-limit/rework: min: 45.25 [141.7%] max: 47.43 [133.0%] avg: 46.27 [137.0%] std: 0.89 runs: 3 > > Elapsed > 1k-0-limit/base: min: 37.23 max: 45.11 avg: 40.48 std: 3.36 runs: 3 > 1k-0-limit/rework: min: 15.18 [40.8%] max: 18.69 [41.4%] avg: 16.99 [42.0%] std: 1.44 runs: 3 > User > 1k-0-limit/base: min: 1.33 max: 1.56 avg: 1.44 std: 0.09 runs: 3 > 1k-0-limit/rework: min: 1.31 [98.5%] max: 1.55 [99.4%] avg: 1.44 [100.0%] std: 0.10 runs: 3 > System > 1k-0-limit/base: min: 33.21 max: 34.44 avg: 33.77 std: 0.51 runs: 3 > 1k-0-limit/rework: min: 45.52 [137.1%] max: 50.82 [147.6%] avg: 48.76 [144.4%] std: 2.32 runs: 3 > > Elapsed > 4k-0-limit/base: min: 42.71 max: 47.83 avg: 45.45 std: 2.11 runs: 3 > 4k-0-limit/rework: min: 34.24 [80.2%] max: 34.99 [73.2%] avg: 34.56 [76.0%] std: 0.32 runs: 3 > User > 4k-0-limit/base: min: 1.11 max: 1.34 avg: 1.21 std: 0.10 runs: 3 > 4k-0-limit/rework: min: 0.80 [72.1%] max: 0.87 [64.9%] avg: 0.83 [68.6%] std: 0.03 runs: 3 > System > 4k-0-limit/base: min: 37.08 max: 40.28 avg: 38.91 std: 1.35 runs: 3 > 4k-0-limit/rework: min: 131.08 [353.5%] max: 132.33 [328.5%] avg: 131.66 [338.4%] std: 0.51 runs: 3 > > Elapsed > 8k-0-limit/base: min: 35.71 max: 47.18 avg: 43.19 std: 5.29 runs: 3 > 8k-0-limit/rework: min: 43.95 [123.1%] max: 59.77 [126.7%] avg: 50.48 [116.9%] std: 6.75 runs: 3 > User > 8k-0-limit/base: min: 1.18 max: 1.21 avg: 1.19 std: 0.01 runs: 3 > 8k-0-limit/rework: min: 0.72 [61.0%] max: 0.85 [70.2%] avg: 0.77 [64.7%] std: 0.06 runs: 3 > System > 8k-0-limit/base: min: 38.34 max: 39.91 avg: 39.24 std: 0.66 runs: 3 > 8k-0-limit/rework: min: 196.90 [513.6%] max: 235.32 [589.6%] avg: 222.34 [566.6%] std: 17.99 runs: 3 > > As we can see the System time climbs really high but the Elapsed time > is better than in the base kernel (except for 8k-0-limit). If we had > more reclaimers then the system time should be amortized more because > the reclaim tree walk is shared. > > I think that the numbers can be improved even without introducing > the list of groups in excess. One way to go could be introducing a > conditional (callback) to the memcg iterator so the groups under the > limit would be excluded during the walk without playing with css > references and other things. My quick and dirty patch shows that > 4k-0-limit System time was reduced by 40% wrt. this patchset. With a > proper tagging we can make the walk close to free. > > Nevertheless, I guess I can live with the excess list as well if the > above sounds like a no-go for you. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-21 6:53 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-21 6:53 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On Mon 20-05-13 16:44:38, Michal Hocko wrote: [...] > I had one group (call it A) with the streaming IO load (dd if=/dev/zero > of=file with 4*TotalRam size) and a parallel hierarchy with 2 groups > with up to 12 levels each (512, 1024, 4096, 8192 groups) and no limit > set. I have compared the results with the same configuration with the > base kernel. > Two configurations have been tested. `A' group without any soft limit and > with limit set to 0. The first configuration measures overhead of an > additional pass as there is no soft reclaim done in both the base kernel > and the rework. The second configuration compares the effectiveness of > the reworked implementation wrt. the base kernel. And I have forgotten to mention that the machine was booted with mem=1G so there was only a single node without Normal zone. This setup with only a single mem hog makes the original implementation really effective because there is a good chance that the aggressive soft limit reclaim prevents from getting into shrink_zone most of the time thus prevent from iterating all groups. The reality with bigger machines is quite different though. The per-node tree where a group is inserted depends on timing because mem_cgroup_update_tree uses the current page for the decision. So the group can be hanging on the node where it charged just few pages. So the figures below show the close the best case with an average case. As I have already said. The numbers can be improved but is it essential to do it right now rather than slowly evolve to something smarter? > * No soft limit set > Elapsed > 500-no-limit/base: min: 16.32 max: 18.03 avg: 17.37 std: 0.75 runs: 3 > 500-no-limit/rework: min: 15.76 [96.6%] max: 19.72 [109.4%] avg: 17.49 [100.7%] std: 1.66 runs: 3 > User > 500-no-limit/base: min: 1.53 max: 1.60 avg: 1.57 std: 0.03 runs: 3 > 500-no-limit/rework: min: 1.18 [77.1%] max: 1.45 [90.6%] avg: 1.33 [84.7%] std: 0.11 runs: 3 > System > 500-no-limit/base: min: 38.60 max: 41.54 avg: 39.95 std: 1.21 runs: 3 > 500-no-limit/rework: min: 39.78 [103.1%] max: 42.93 [103.3%] avg: 41.06 [102.8%] std: 1.35 runs: 3 > > Elapsed > 1k-no-limit/base: min: 37.04 max: 43.36 avg: 40.26 std: 2.58 runs: 3 > 1k-no-limit/rework: min: 16.38 [44.2%] max: 17.82 [41.1%] avg: 17.22 [42.8%] std: 0.61 runs: 3 > User > 1k-no-limit/base: min: 1.12 max: 1.38 avg: 1.24 std: 0.11 runs: 3 > 1k-no-limit/rework: min: 1.11 [99.1%] max: 1.26 [91.3%] avg: 1.20 [96.8%] std: 0.07 runs: 3 > System > 1k-no-limit/base: min: 33.51 max: 36.29 avg: 34.99 std: 1.14 runs: 3 > 1k-no-limit/rework: min: 45.09 [134.6%] max: 49.52 [136.5%] avg: 47.99 [137.2%] std: 2.05 runs: 3 > > Elapsed > 4k-no-limit/base: min: 40.04 max: 47.14 avg: 44.46 std: 3.15 runs: 3 > 4k-no-limit/rework: min: 30.38 [75.9%] max: 37.66 [79.9%] avg: 34.24 [77.0%] std: 2.99 runs: 3 > User > 4k-no-limit/base: min: 1.16 max: 1.33 avg: 1.25 std: 0.07 runs: 3 > 4k-no-limit/rework: min: 0.70 [60.3%] max: 0.82 [61.7%] avg: 0.77 [61.6%] std: 0.05 runs: 3 > System > 4k-no-limit/base: min: 37.91 max: 39.91 avg: 39.19 std: 0.91 runs: 3 > 4k-no-limit/rework: min: 130.35 [343.8%] max: 133.26 [333.9%] avg: 131.63 [335.9%] std: 1.21 runs: 3 > > Elapsed > 8k-no-limit/base: min: 41.27 max: 50.60 avg: 45.51 std: 3.86 runs: 3 > 8k-no-limit/rework: min: 39.56 [95.9%] max: 52.12 [103.0%] avg: 44.49 [97.8%] std: 5.47 runs: 3 > User > 8k-no-limit/base: min: 1.26 max: 1.38 avg: 1.32 std: 0.05 runs: 3 > 8k-no-limit/rework: min: 0.68 [54.0%] max: 0.82 [59.4%] avg: 0.73 [55.3%] std: 0.06 runs: 3 > System > 8k-no-limit/base: min: 39.93 max: 40.73 avg: 40.25 std: 0.34 runs: 3 > 8k-no-limit/rework: min: 228.74 [572.9%] max: 238.43 [585.4%] avg: 232.57 [577.8%] std: 4.21 runs: 3 > > * Soft limit set to 0 for the group with the dd load > Elapsed > 500-0-limit/base: min: 30.29 max: 38.91 avg: 34.83 std: 3.53 runs: 3 > 500-0-limit/rework: min: 14.34 [47.3%] max: 17.18 [44.2%] avg: 16.01 [46.0%] std: 1.21 runs: 3 > User > 500-0-limit/base: min: 1.14 max: 1.29 avg: 1.24 std: 0.07 runs: 3 > 500-0-limit/rework: min: 1.42 [124.6%] max: 1.47 [114.0%] avg: 1.44 [116.1%] std: 0.02 runs: 3 > System > 500-0-limit/base: min: 31.94 max: 35.66 avg: 33.77 std: 1.52 runs: 3 > 500-0-limit/rework: min: 45.25 [141.7%] max: 47.43 [133.0%] avg: 46.27 [137.0%] std: 0.89 runs: 3 > > Elapsed > 1k-0-limit/base: min: 37.23 max: 45.11 avg: 40.48 std: 3.36 runs: 3 > 1k-0-limit/rework: min: 15.18 [40.8%] max: 18.69 [41.4%] avg: 16.99 [42.0%] std: 1.44 runs: 3 > User > 1k-0-limit/base: min: 1.33 max: 1.56 avg: 1.44 std: 0.09 runs: 3 > 1k-0-limit/rework: min: 1.31 [98.5%] max: 1.55 [99.4%] avg: 1.44 [100.0%] std: 0.10 runs: 3 > System > 1k-0-limit/base: min: 33.21 max: 34.44 avg: 33.77 std: 0.51 runs: 3 > 1k-0-limit/rework: min: 45.52 [137.1%] max: 50.82 [147.6%] avg: 48.76 [144.4%] std: 2.32 runs: 3 > > Elapsed > 4k-0-limit/base: min: 42.71 max: 47.83 avg: 45.45 std: 2.11 runs: 3 > 4k-0-limit/rework: min: 34.24 [80.2%] max: 34.99 [73.2%] avg: 34.56 [76.0%] std: 0.32 runs: 3 > User > 4k-0-limit/base: min: 1.11 max: 1.34 avg: 1.21 std: 0.10 runs: 3 > 4k-0-limit/rework: min: 0.80 [72.1%] max: 0.87 [64.9%] avg: 0.83 [68.6%] std: 0.03 runs: 3 > System > 4k-0-limit/base: min: 37.08 max: 40.28 avg: 38.91 std: 1.35 runs: 3 > 4k-0-limit/rework: min: 131.08 [353.5%] max: 132.33 [328.5%] avg: 131.66 [338.4%] std: 0.51 runs: 3 > > Elapsed > 8k-0-limit/base: min: 35.71 max: 47.18 avg: 43.19 std: 5.29 runs: 3 > 8k-0-limit/rework: min: 43.95 [123.1%] max: 59.77 [126.7%] avg: 50.48 [116.9%] std: 6.75 runs: 3 > User > 8k-0-limit/base: min: 1.18 max: 1.21 avg: 1.19 std: 0.01 runs: 3 > 8k-0-limit/rework: min: 0.72 [61.0%] max: 0.85 [70.2%] avg: 0.77 [64.7%] std: 0.06 runs: 3 > System > 8k-0-limit/base: min: 38.34 max: 39.91 avg: 39.24 std: 0.66 runs: 3 > 8k-0-limit/rework: min: 196.90 [513.6%] max: 235.32 [589.6%] avg: 222.34 [566.6%] std: 17.99 runs: 3 > > As we can see the System time climbs really high but the Elapsed time > is better than in the base kernel (except for 8k-0-limit). If we had > more reclaimers then the system time should be amortized more because > the reclaim tree walk is shared. > > I think that the numbers can be improved even without introducing > the list of groups in excess. One way to go could be introducing a > conditional (callback) to the memcg iterator so the groups under the > limit would be excluded during the walk without playing with css > references and other things. My quick and dirty patch shows that > 4k-0-limit System time was reduced by 40% wrt. this patchset. With a > proper tagging we can make the walk close to free. > > Nevertheless, I guess I can live with the excess list as well if the > above sounds like a no-go for you. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-17 16:02 ` Johannes Weiner (?) @ 2013-05-27 17:13 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-27 17:13 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm Hi, it took me a bit longer than I wanted but I was closed in a conference room in the end of the last week so I didn't have much time. On Mon 20-05-13 16:44:38, Michal Hocko wrote: > On Fri 17-05-13 12:02:47, Johannes Weiner wrote: > > On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote: [...] > > > After this patch shrink_zone is done in 2 passes. First it tries to do the > > > soft reclaim if appropriate (only for global reclaim for now to keep > > > compatible with the original state) and fall back to ignoring soft limit > > > if no group is eligible to soft reclaim or nothing has been scanned > > > during the first pass. Only groups which are over their soft limit or > > > any of their parents up the hierarchy is over the limit are considered > > > eligible during the first pass. > > > > There are setups with thousands of groups that do not even use soft > > limits. Having them pointlessly iterate over all of them for every > > couple of pages reclaimed is just not acceptable. > > OK, that is a fair point. The soft reclaim pass should not be done if > we know that every group is below the limit. This can be fixed easily. > mem_cgroup_should_soft_reclaim could check a counter of over limit > groups. This still doesn't solve the problem if there are relatively few > groups over the limit wrt. those that are under the limit. > > You are proposing a simple list in the follow up email. I have > considered this approach as well but then I decided not to go that > way because the list iteration doesn't respect per-node-zone-priority > tree walk which makes it tricky to prevent from over-reclaim with many > parallel reclaimers. I rather wanted to integrate the soft reclaim into > the reclaim tree walk. There is also a problem when all groups are in > excess then the whole tree collapses into a linked list which is not > nice either (hmm, this could be mitigated if only a group in excess > which is highest in the hierarchy would be in the list) > [...] > I think that the numbers can be improved even without introducing > the list of groups in excess. One way to go could be introducing a > conditional (callback) to the memcg iterator so the groups under the > limit would be excluded during the walk without playing with css > references and other things. My quick and dirty patch shows that > 4k-0-limit System time was reduced by 40% wrt. this patchset. With a > proper tagging we can make the walk close to free. And the following patchset implements that. My first numbers shown an improvement (I will post some numbers later after I collect them). Nevertheless I have encountered an issue while testing the huge number of groups scenario. And the issue is not limitted to only to this scenario unfortunately. As memcg iterators use per node-zone-priority cache to prevent from over reclaim it might quite easily happen that the walk will not visit all groups and will terminate the loop either prematurely or skip some groups. An example could be the direct reclaim racing with kswapd. This might cause that the loop misses over limit groups so no pages are scanned and so we will fall back to all groups reclaim. Not good! But also not easy to fix without risking an over reclaim or potential stalls. I was thinking about introducing something like: bool should_soft_limit_reclaim_continue(struct mem_cgroup *root, int groups_reclaimed) { if (!groups_reclaimed) return false; if (mem_cgroup_soft_reclaim_eligible(root, root) == VISIT) return true; } and loop again few times in __shrink_zone. I am not entirely thrilled about this as the effectiveness depends on the number of parallel reclaimers at the same priority but it should work at least somehow. If anybody has a better idea I am all for it. I will think about it some more. Anyway I will post 3 patches which should mitigate the "too many groups" issue as a reply to this email. See patches for details. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-27 17:13 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-27 17:13 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg Hi, it took me a bit longer than I wanted but I was closed in a conference room in the end of the last week so I didn't have much time. On Mon 20-05-13 16:44:38, Michal Hocko wrote: > On Fri 17-05-13 12:02:47, Johannes Weiner wrote: > > On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote: [...] > > > After this patch shrink_zone is done in 2 passes. First it tries to do the > > > soft reclaim if appropriate (only for global reclaim for now to keep > > > compatible with the original state) and fall back to ignoring soft limit > > > if no group is eligible to soft reclaim or nothing has been scanned > > > during the first pass. Only groups which are over their soft limit or > > > any of their parents up the hierarchy is over the limit are considered > > > eligible during the first pass. > > > > There are setups with thousands of groups that do not even use soft > > limits. Having them pointlessly iterate over all of them for every > > couple of pages reclaimed is just not acceptable. > > OK, that is a fair point. The soft reclaim pass should not be done if > we know that every group is below the limit. This can be fixed easily. > mem_cgroup_should_soft_reclaim could check a counter of over limit > groups. This still doesn't solve the problem if there are relatively few > groups over the limit wrt. those that are under the limit. > > You are proposing a simple list in the follow up email. I have > considered this approach as well but then I decided not to go that > way because the list iteration doesn't respect per-node-zone-priority > tree walk which makes it tricky to prevent from over-reclaim with many > parallel reclaimers. I rather wanted to integrate the soft reclaim into > the reclaim tree walk. There is also a problem when all groups are in > excess then the whole tree collapses into a linked list which is not > nice either (hmm, this could be mitigated if only a group in excess > which is highest in the hierarchy would be in the list) > [...] > I think that the numbers can be improved even without introducing > the list of groups in excess. One way to go could be introducing a > conditional (callback) to the memcg iterator so the groups under the > limit would be excluded during the walk without playing with css > references and other things. My quick and dirty patch shows that > 4k-0-limit System time was reduced by 40% wrt. this patchset. With a > proper tagging we can make the walk close to free. And the following patchset implements that. My first numbers shown an improvement (I will post some numbers later after I collect them). Nevertheless I have encountered an issue while testing the huge number of groups scenario. And the issue is not limitted to only to this scenario unfortunately. As memcg iterators use per node-zone-priority cache to prevent from over reclaim it might quite easily happen that the walk will not visit all groups and will terminate the loop either prematurely or skip some groups. An example could be the direct reclaim racing with kswapd. This might cause that the loop misses over limit groups so no pages are scanned and so we will fall back to all groups reclaim. Not good! But also not easy to fix without risking an over reclaim or potential stalls. I was thinking about introducing something like: bool should_soft_limit_reclaim_continue(struct mem_cgroup *root, int groups_reclaimed) { if (!groups_reclaimed) return false; if (mem_cgroup_soft_reclaim_eligible(root, root) == VISIT) return true; } and loop again few times in __shrink_zone. I am not entirely thrilled about this as the effectiveness depends on the number of parallel reclaimers at the same priority but it should work at least somehow. If anybody has a better idea I am all for it. I will think about it some more. Anyway I will post 3 patches which should mitigate the "too many groups" issue as a reply to this email. See patches for details. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-27 17:13 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-27 17:13 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm Hi, it took me a bit longer than I wanted but I was closed in a conference room in the end of the last week so I didn't have much time. On Mon 20-05-13 16:44:38, Michal Hocko wrote: > On Fri 17-05-13 12:02:47, Johannes Weiner wrote: > > On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote: [...] > > > After this patch shrink_zone is done in 2 passes. First it tries to do the > > > soft reclaim if appropriate (only for global reclaim for now to keep > > > compatible with the original state) and fall back to ignoring soft limit > > > if no group is eligible to soft reclaim or nothing has been scanned > > > during the first pass. Only groups which are over their soft limit or > > > any of their parents up the hierarchy is over the limit are considered > > > eligible during the first pass. > > > > There are setups with thousands of groups that do not even use soft > > limits. Having them pointlessly iterate over all of them for every > > couple of pages reclaimed is just not acceptable. > > OK, that is a fair point. The soft reclaim pass should not be done if > we know that every group is below the limit. This can be fixed easily. > mem_cgroup_should_soft_reclaim could check a counter of over limit > groups. This still doesn't solve the problem if there are relatively few > groups over the limit wrt. those that are under the limit. > > You are proposing a simple list in the follow up email. I have > considered this approach as well but then I decided not to go that > way because the list iteration doesn't respect per-node-zone-priority > tree walk which makes it tricky to prevent from over-reclaim with many > parallel reclaimers. I rather wanted to integrate the soft reclaim into > the reclaim tree walk. There is also a problem when all groups are in > excess then the whole tree collapses into a linked list which is not > nice either (hmm, this could be mitigated if only a group in excess > which is highest in the hierarchy would be in the list) > [...] > I think that the numbers can be improved even without introducing > the list of groups in excess. One way to go could be introducing a > conditional (callback) to the memcg iterator so the groups under the > limit would be excluded during the walk without playing with css > references and other things. My quick and dirty patch shows that > 4k-0-limit System time was reduced by 40% wrt. this patchset. With a > proper tagging we can make the walk close to free. And the following patchset implements that. My first numbers shown an improvement (I will post some numbers later after I collect them). Nevertheless I have encountered an issue while testing the huge number of groups scenario. And the issue is not limitted to only to this scenario unfortunately. As memcg iterators use per node-zone-priority cache to prevent from over reclaim it might quite easily happen that the walk will not visit all groups and will terminate the loop either prematurely or skip some groups. An example could be the direct reclaim racing with kswapd. This might cause that the loop misses over limit groups so no pages are scanned and so we will fall back to all groups reclaim. Not good! But also not easy to fix without risking an over reclaim or potential stalls. I was thinking about introducing something like: bool should_soft_limit_reclaim_continue(struct mem_cgroup *root, int groups_reclaimed) { if (!groups_reclaimed) return false; if (mem_cgroup_soft_reclaim_eligible(root, root) == VISIT) return true; } and loop again few times in __shrink_zone. I am not entirely thrilled about this as the effectiveness depends on the number of parallel reclaimers at the same priority but it should work at least somehow. If anybody has a better idea I am all for it. I will think about it some more. Anyway I will post 3 patches which should mitigate the "too many groups" issue as a reply to this email. See patches for details. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH 1/3] memcg: track children in soft limit excess to improve soft limit 2013-05-27 17:13 ` Michal Hocko @ 2013-05-27 17:13 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-27 17:13 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm Soft limit reclaim has to check the whole reclaim hierarchy while doing the first pass of the reclaim. This leads to a higher system time which can be visible especially when there are many groups in the hierarchy. - TODO put testing results here This patch adds a per-memcg counter of children in excess. It also restores MEM_CGROUP_TARGET_SOFTLIMIT into mem_cgroup_event_ratelimit for a proper batching. If a group crosses soft limit for the first time it increases parent's children_in_excess up the hierarchy. The similarly if a group gets below the limit it will decrease the counter. The transition phase is recorded in soft_contributed flag. mem_cgroup_soft_reclaim_eligible then uses this information to better decide whether to skip the node or the whole subtree. The rule is simple. Skip the node with a children in excess or skip the whole subtree otherwise. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 981ee12..60b48bc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -136,6 +136,7 @@ static const char * const mem_cgroup_lru_names[] = { */ enum mem_cgroup_events_target { MEM_CGROUP_TARGET_THRESH, + MEM_CGROUP_TARGET_SOFTLIMIT, MEM_CGROUP_TARGET_NUMAINFO, MEM_CGROUP_NTARGETS, }; @@ -355,6 +356,10 @@ struct mem_cgroup { atomic_t numainfo_updating; #endif + spinlock_t soft_lock; + bool soft_contributed; + atomic_t children_in_excess; + /* * Per cgroup active and inactive list, similar to the * per zone LRU lists. @@ -890,6 +895,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, case MEM_CGROUP_TARGET_THRESH: next = val + THRESHOLDS_EVENTS_TARGET; break; + case MEM_CGROUP_TARGET_SOFTLIMIT: + next = val + SOFTLIMIT_EVENTS_TARGET; + break; case MEM_CGROUP_TARGET_NUMAINFO: next = val + NUMAINFO_EVENTS_TARGET; break; @@ -902,6 +910,34 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, return false; } +static void mem_cgroup_update_soft_limit(struct mem_cgroup *memcg) +{ + unsigned long long excess = res_counter_soft_limit_excess(&memcg->res); + struct mem_cgroup *parent = memcg; + int delta = 0; + + spin_lock(&memcg->soft_lock); + if (excess) { + if (!memcg->soft_contributed) { + delta = 1; + memcg->soft_contributed = true; + } + } else { + if (memcg->soft_contributed) { + delta = -1; + memcg->soft_contributed = false; + } + } + + /* + * Necessary to update all ancestors when hierarchy is used + * because their event counter is not touched. + */ + while (delta && (parent = parent_mem_cgroup(parent))) + atomic_add(delta, &parent->children_in_excess); + spin_unlock(&memcg->soft_lock); +} + /* * Check events in order. * @@ -912,8 +948,11 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) /* threshold event is triggered in finer grain than soft limit */ if (unlikely(mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_THRESH))) { + bool do_softlimit; bool do_numainfo __maybe_unused; + do_softlimit = mem_cgroup_event_ratelimit(memcg, + MEM_CGROUP_TARGET_SOFTLIMIT); #if MAX_NUMNODES > 1 do_numainfo = mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_NUMAINFO); @@ -921,6 +960,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) preempt_enable(); mem_cgroup_threshold(memcg); + if (unlikely(do_softlimit)) + mem_cgroup_update_soft_limit(memcg); #if MAX_NUMNODES > 1 if (unlikely(do_numainfo)) atomic_inc(&memcg->numainfo_events); @@ -1894,6 +1935,9 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) * hierarchy if * a) it is over its soft limit * b) any parent up the hierarchy is over its soft limit + * + * If the given group doesn't have any children over the limit then it + * doesn't make any sense to iterate its subtree. */ enum mem_cgroup_filter_t mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, @@ -1915,6 +1959,8 @@ mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, break; } + if (!atomic_read(&memcg->children_in_excess)) + return SKIP_TREE; return SKIP; } @@ -6061,6 +6107,7 @@ mem_cgroup_css_alloc(struct cgroup *cont) mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); vmpressure_init(&memcg->vmpressure); + spin_lock_init(&memcg->soft_lock); return &memcg->css; @@ -6150,6 +6197,10 @@ static void mem_cgroup_css_offline(struct cgroup *cont) mem_cgroup_invalidate_reclaim_iterators(memcg); mem_cgroup_reparent_charges(memcg); + if (memcg->soft_contributed) { + while ((memcg = parent_mem_cgroup(memcg))) + atomic_dec(&memcg->children_in_excess); + } mem_cgroup_destroy_all_caches(memcg); } -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH 1/3] memcg: track children in soft limit excess to improve soft limit @ 2013-05-27 17:13 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-27 17:13 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm Soft limit reclaim has to check the whole reclaim hierarchy while doing the first pass of the reclaim. This leads to a higher system time which can be visible especially when there are many groups in the hierarchy. - TODO put testing results here This patch adds a per-memcg counter of children in excess. It also restores MEM_CGROUP_TARGET_SOFTLIMIT into mem_cgroup_event_ratelimit for a proper batching. If a group crosses soft limit for the first time it increases parent's children_in_excess up the hierarchy. The similarly if a group gets below the limit it will decrease the counter. The transition phase is recorded in soft_contributed flag. mem_cgroup_soft_reclaim_eligible then uses this information to better decide whether to skip the node or the whole subtree. The rule is simple. Skip the node with a children in excess or skip the whole subtree otherwise. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 981ee12..60b48bc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -136,6 +136,7 @@ static const char * const mem_cgroup_lru_names[] = { */ enum mem_cgroup_events_target { MEM_CGROUP_TARGET_THRESH, + MEM_CGROUP_TARGET_SOFTLIMIT, MEM_CGROUP_TARGET_NUMAINFO, MEM_CGROUP_NTARGETS, }; @@ -355,6 +356,10 @@ struct mem_cgroup { atomic_t numainfo_updating; #endif + spinlock_t soft_lock; + bool soft_contributed; + atomic_t children_in_excess; + /* * Per cgroup active and inactive list, similar to the * per zone LRU lists. @@ -890,6 +895,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, case MEM_CGROUP_TARGET_THRESH: next = val + THRESHOLDS_EVENTS_TARGET; break; + case MEM_CGROUP_TARGET_SOFTLIMIT: + next = val + SOFTLIMIT_EVENTS_TARGET; + break; case MEM_CGROUP_TARGET_NUMAINFO: next = val + NUMAINFO_EVENTS_TARGET; break; @@ -902,6 +910,34 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, return false; } +static void mem_cgroup_update_soft_limit(struct mem_cgroup *memcg) +{ + unsigned long long excess = res_counter_soft_limit_excess(&memcg->res); + struct mem_cgroup *parent = memcg; + int delta = 0; + + spin_lock(&memcg->soft_lock); + if (excess) { + if (!memcg->soft_contributed) { + delta = 1; + memcg->soft_contributed = true; + } + } else { + if (memcg->soft_contributed) { + delta = -1; + memcg->soft_contributed = false; + } + } + + /* + * Necessary to update all ancestors when hierarchy is used + * because their event counter is not touched. + */ + while (delta && (parent = parent_mem_cgroup(parent))) + atomic_add(delta, &parent->children_in_excess); + spin_unlock(&memcg->soft_lock); +} + /* * Check events in order. * @@ -912,8 +948,11 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) /* threshold event is triggered in finer grain than soft limit */ if (unlikely(mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_THRESH))) { + bool do_softlimit; bool do_numainfo __maybe_unused; + do_softlimit = mem_cgroup_event_ratelimit(memcg, + MEM_CGROUP_TARGET_SOFTLIMIT); #if MAX_NUMNODES > 1 do_numainfo = mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_NUMAINFO); @@ -921,6 +960,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) preempt_enable(); mem_cgroup_threshold(memcg); + if (unlikely(do_softlimit)) + mem_cgroup_update_soft_limit(memcg); #if MAX_NUMNODES > 1 if (unlikely(do_numainfo)) atomic_inc(&memcg->numainfo_events); @@ -1894,6 +1935,9 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) * hierarchy if * a) it is over its soft limit * b) any parent up the hierarchy is over its soft limit + * + * If the given group doesn't have any children over the limit then it + * doesn't make any sense to iterate its subtree. */ enum mem_cgroup_filter_t mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, @@ -1915,6 +1959,8 @@ mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, break; } + if (!atomic_read(&memcg->children_in_excess)) + return SKIP_TREE; return SKIP; } @@ -6061,6 +6107,7 @@ mem_cgroup_css_alloc(struct cgroup *cont) mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); vmpressure_init(&memcg->vmpressure); + spin_lock_init(&memcg->soft_lock); return &memcg->css; @@ -6150,6 +6197,10 @@ static void mem_cgroup_css_offline(struct cgroup *cont) mem_cgroup_invalidate_reclaim_iterators(memcg); mem_cgroup_reparent_charges(memcg); + if (memcg->soft_contributed) { + while ((memcg = parent_mem_cgroup(memcg))) + atomic_dec(&memcg->children_in_excess); + } mem_cgroup_destroy_all_caches(memcg); } -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH 2/3] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything 2013-05-27 17:13 ` Michal Hocko @ 2013-05-27 17:13 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-27 17:13 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is done and it always says yes currently. Memcg iterators are clever to skip nodes that are not soft reclaimable quite efficiently but mem_cgroup_should_soft_reclaim can be more clever and do not start the soft reclaim pass at all if it knows that nothing would be scanned anyway. In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for the target group of the reclaim and allow the pass only if the whole subtree wouldn't be skipped. TODO - should mem_cgroup_soft_reclaim_eligible check for NULL root/memcg so that we do not export root_mem_cgroup? Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 2 ++ mm/memcontrol.c | 2 +- mm/vmscan.c | 5 ++++- 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 811967a..909bb6b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -51,6 +51,8 @@ typedef enum mem_cgroup_filter_t (*mem_cgroup_iter_filter)(struct mem_cgroup *memcg, struct mem_cgroup *root); #ifdef CONFIG_MEMCG +extern struct mem_cgroup *root_mem_cgroup; + /* * All "charge" functions with gfp_mask should use GFP_KERNEL or * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 60b48bc..592df10 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -66,7 +66,7 @@ struct cgroup_subsys mem_cgroup_subsys __read_mostly; EXPORT_SYMBOL(mem_cgroup_subsys); #define MEM_CGROUP_RECLAIM_RETRIES 5 -static struct mem_cgroup *root_mem_cgroup __read_mostly; +struct mem_cgroup *root_mem_cgroup __read_mostly; #ifdef CONFIG_MEMCG_SWAP /* Turned on only when memory cgroup is enabled && really_do_swap_account = 1 */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 49878dd..22c1278 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -142,7 +142,10 @@ static bool global_reclaim(struct scan_control *sc) static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) { - return true; + struct mem_cgroup *root = sc->target_mem_cgroup; + if (!root) + root = root_mem_cgroup; + return mem_cgroup_soft_reclaim_eligible(root, root) != SKIP_TREE; } #else static bool global_reclaim(struct scan_control *sc) -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH 2/3] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything @ 2013-05-27 17:13 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-27 17:13 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is done and it always says yes currently. Memcg iterators are clever to skip nodes that are not soft reclaimable quite efficiently but mem_cgroup_should_soft_reclaim can be more clever and do not start the soft reclaim pass at all if it knows that nothing would be scanned anyway. In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for the target group of the reclaim and allow the pass only if the whole subtree wouldn't be skipped. TODO - should mem_cgroup_soft_reclaim_eligible check for NULL root/memcg so that we do not export root_mem_cgroup? Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 2 ++ mm/memcontrol.c | 2 +- mm/vmscan.c | 5 ++++- 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 811967a..909bb6b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -51,6 +51,8 @@ typedef enum mem_cgroup_filter_t (*mem_cgroup_iter_filter)(struct mem_cgroup *memcg, struct mem_cgroup *root); #ifdef CONFIG_MEMCG +extern struct mem_cgroup *root_mem_cgroup; + /* * All "charge" functions with gfp_mask should use GFP_KERNEL or * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 60b48bc..592df10 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -66,7 +66,7 @@ struct cgroup_subsys mem_cgroup_subsys __read_mostly; EXPORT_SYMBOL(mem_cgroup_subsys); #define MEM_CGROUP_RECLAIM_RETRIES 5 -static struct mem_cgroup *root_mem_cgroup __read_mostly; +struct mem_cgroup *root_mem_cgroup __read_mostly; #ifdef CONFIG_MEMCG_SWAP /* Turned on only when memory cgroup is enabled && really_do_swap_account = 1 */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 49878dd..22c1278 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -142,7 +142,10 @@ static bool global_reclaim(struct scan_control *sc) static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) { - return true; + struct mem_cgroup *root = sc->target_mem_cgroup; + if (!root) + root = root_mem_cgroup; + return mem_cgroup_soft_reclaim_eligible(root, root) != SKIP_TREE; } #else static bool global_reclaim(struct scan_control *sc) -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH 3/3] memcg: Track all children over limit in the root 2013-05-27 17:13 ` Michal Hocko @ 2013-05-27 17:13 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-27 17:13 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm Children in soft limit excess are currently tracked up the hierarchy in memcg->children_in_excess. Nevertheless there still might exist tons of groups that are not in hierarchy relation to the root cgroup (e.g. all first level groups if root_mem_cgroup->use_hierarchy == false). As the whole tree walk has to be done when the iteration starts at root_mem_cgroup the iterator should be able to skip the walk if there is no child above the limit without iterating them. This can be done easily if the root tracks all children rather than only hierarchical children. This is done by this patch which updates root_mem_cgroup children_in_excess if root_mem_cgroup->use_hierarchy == false so the root knows about all children in excess. Please note that this is not an issue for inner memcgs which have use_hierarchy == false because then only the single group is visited so no special optimization is necessary. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 592df10..7fb063f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -932,9 +932,15 @@ static void mem_cgroup_update_soft_limit(struct mem_cgroup *memcg) /* * Necessary to update all ancestors when hierarchy is used * because their event counter is not touched. + * We track children even outside the hierarchy for the root + * cgroup because tree walk starting at root should visit + * all cgroups and we want to prevent from pointless tree + * walk if no children is below the limit. */ while (delta && (parent = parent_mem_cgroup(parent))) atomic_add(delta, &parent->children_in_excess); + if (memcg != root_mem_cgroup && !root_mem_cgroup->use_hierarchy) + atomic_add(delta, &root_mem_cgroup->children_in_excess); spin_unlock(&memcg->soft_lock); } @@ -6200,6 +6206,9 @@ static void mem_cgroup_css_offline(struct cgroup *cont) if (memcg->soft_contributed) { while ((memcg = parent_mem_cgroup(memcg))) atomic_dec(&memcg->children_in_excess); + + if (memcg != root_mem_cgroup && !root_mem_cgroup->use_hierarchy) + atomic_dec(&root_mem_cgroup->children_in_excess); } mem_cgroup_destroy_all_caches(memcg); } -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH 3/3] memcg: Track all children over limit in the root @ 2013-05-27 17:13 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-27 17:13 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm Children in soft limit excess are currently tracked up the hierarchy in memcg->children_in_excess. Nevertheless there still might exist tons of groups that are not in hierarchy relation to the root cgroup (e.g. all first level groups if root_mem_cgroup->use_hierarchy == false). As the whole tree walk has to be done when the iteration starts at root_mem_cgroup the iterator should be able to skip the walk if there is no child above the limit without iterating them. This can be done easily if the root tracks all children rather than only hierarchical children. This is done by this patch which updates root_mem_cgroup children_in_excess if root_mem_cgroup->use_hierarchy == false so the root knows about all children in excess. Please note that this is not an issue for inner memcgs which have use_hierarchy == false because then only the single group is visited so no special optimization is necessary. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 592df10..7fb063f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -932,9 +932,15 @@ static void mem_cgroup_update_soft_limit(struct mem_cgroup *memcg) /* * Necessary to update all ancestors when hierarchy is used * because their event counter is not touched. + * We track children even outside the hierarchy for the root + * cgroup because tree walk starting at root should visit + * all cgroups and we want to prevent from pointless tree + * walk if no children is below the limit. */ while (delta && (parent = parent_mem_cgroup(parent))) atomic_add(delta, &parent->children_in_excess); + if (memcg != root_mem_cgroup && !root_mem_cgroup->use_hierarchy) + atomic_add(delta, &root_mem_cgroup->children_in_excess); spin_unlock(&memcg->soft_lock); } @@ -6200,6 +6206,9 @@ static void mem_cgroup_css_offline(struct cgroup *cont) if (memcg->soft_contributed) { while ((memcg = parent_mem_cgroup(memcg))) atomic_dec(&memcg->children_in_excess); + + if (memcg != root_mem_cgroup && !root_mem_cgroup->use_hierarchy) + atomic_dec(&root_mem_cgroup->children_in_excess); } mem_cgroup_destroy_all_caches(memcg); } -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH] memcg: enhance memcg iterator to support predicates 2013-05-27 17:13 ` Michal Hocko (?) @ 2013-05-27 17:20 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-27 17:20 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm Oh, forgot to post the first patch... --- >From 54b422ba96e3fa9264ee308c4fd9d5fe36efde63 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 20 May 2013 13:40:11 +0200 Subject: [PATCH] memcg: enhance memcg iterator to support predicates The caller of the iterator might know that some nodes or even subtrees should be skipped but there is no way to tell iterators about that so the only choice left is to let iterators to visit each node and do the selection outside of the iterating code. This, however, doesn't scale well with hierarchies with many groups where only few groups are interesting. This patch adds mem_cgroup_iter_cond variant of the iterator with a callback which gets called for every visited node. There are three possible ways how the callback can influence the walk. Either the node is visited, it is skipped but the tree walk continues down the tree or the whole subtree of the current group is skipped. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 40 +++++++++++++++++++++---- mm/memcontrol.c | 69 ++++++++++++++++++++++++++++++++++---------- mm/vmscan.c | 16 ++++------ 3 files changed, 93 insertions(+), 32 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 80ed1b6..811967a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -41,6 +41,15 @@ struct mem_cgroup_reclaim_cookie { unsigned int generation; }; +enum mem_cgroup_filter_t { + VISIT, + SKIP, + SKIP_TREE, +}; + +typedef enum mem_cgroup_filter_t +(*mem_cgroup_iter_filter)(struct mem_cgroup *memcg, struct mem_cgroup *root); + #ifdef CONFIG_MEMCG /* * All "charge" functions with gfp_mask should use GFP_KERNEL or @@ -107,9 +116,18 @@ mem_cgroup_prepare_migration(struct page *page, struct page *newpage, extern void mem_cgroup_end_migration(struct mem_cgroup *memcg, struct page *oldpage, struct page *newpage, bool migration_ok); -struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *, - struct mem_cgroup *, - struct mem_cgroup_reclaim_cookie *); +struct mem_cgroup *mem_cgroup_iter_cond(struct mem_cgroup *root, + struct mem_cgroup *prev, + struct mem_cgroup_reclaim_cookie *reclaim, + mem_cgroup_iter_filter cond); + +static inline struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, + struct mem_cgroup *prev, + struct mem_cgroup_reclaim_cookie *reclaim) +{ + return mem_cgroup_iter_cond(root, prev, reclaim, NULL); +} + void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *); /* @@ -179,7 +197,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, +enum mem_cgroup_filter_t +mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); @@ -294,6 +313,14 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg, struct page *oldpage, struct page *newpage, bool migration_ok) { } +static inline struct mem_cgroup * +mem_cgroup_iter_cond(struct mem_cgroup *root, + struct mem_cgroup *prev, + struct mem_cgroup_reclaim_cookie *reclaim, + mem_cgroup_iter_filter cond) +{ + return NULL; +} static inline struct mem_cgroup * mem_cgroup_iter(struct mem_cgroup *root, @@ -357,10 +384,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, +enum mem_cgroup_filter_t +mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root) { - return false; + return VISIT; } static inline void mem_cgroup_split_huge_fixup(struct page *head) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9f8dd37..981ee12 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -969,6 +969,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm) return memcg; } +static enum mem_cgroup_filter_t +mem_cgroup_filter(struct mem_cgroup *memcg, struct mem_cgroup *root, + mem_cgroup_iter_filter cond) +{ + if (!cond) + return VISIT; + return cond(memcg, root); +} + /* * Returns a next (in a pre-order walk) alive memcg (with elevated css * ref. count) or NULL if the whole root's subtree has been visited. @@ -976,7 +985,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm) * helper function to be used by mem_cgroup_iter */ static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root, - struct mem_cgroup *last_visited) + struct mem_cgroup *last_visited, mem_cgroup_iter_filter cond) { struct cgroup *prev_cgroup, *next_cgroup; @@ -984,10 +993,18 @@ static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root, * Root is not visited by cgroup iterators so it needs an * explicit visit. */ - if (!last_visited) - return root; + if (!last_visited) { + switch(mem_cgroup_filter(root, root, cond)) { + case VISIT: + return root; + case SKIP: + break; + case SKIP_TREE: + return NULL; + } + } - prev_cgroup = (last_visited == root) ? NULL + prev_cgroup = (last_visited == root || !last_visited) ? NULL : last_visited->css.cgroup; skip_node: next_cgroup = cgroup_next_descendant_pre( @@ -1003,11 +1020,22 @@ skip_node: if (next_cgroup) { struct mem_cgroup *mem = mem_cgroup_from_cont( next_cgroup); - if (css_tryget(&mem->css)) - return mem; - else { + + switch (mem_cgroup_filter(mem, root, cond)) { + case SKIP: prev_cgroup = next_cgroup; goto skip_node; + case SKIP_TREE: + prev_cgroup = cgroup_rightmost_descendant(next_cgroup); + goto skip_node; + case VISIT: + if (css_tryget(&mem->css)) + return mem; + else { + prev_cgroup = next_cgroup; + goto skip_node; + } + break; } } @@ -1019,6 +1047,7 @@ skip_node: * @root: hierarchy root * @prev: previously returned memcg, NULL on first invocation * @reclaim: cookie for shared reclaim walks, NULL for full walks + * @cond: filter for visited nodes, NULL for no filter * * Returns references to children of the hierarchy below @root, or * @root itself, or %NULL after a full round-trip. @@ -1031,9 +1060,10 @@ skip_node: * divide up the memcgs in the hierarchy among all concurrent * reclaimers operating on the same zone and priority. */ -struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, +struct mem_cgroup *mem_cgroup_iter_cond(struct mem_cgroup *root, struct mem_cgroup *prev, - struct mem_cgroup_reclaim_cookie *reclaim) + struct mem_cgroup_reclaim_cookie *reclaim, + mem_cgroup_iter_filter cond) { struct mem_cgroup *memcg = NULL; struct mem_cgroup *last_visited = NULL; @@ -1051,7 +1081,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (!root->use_hierarchy && root != root_mem_cgroup) { if (prev) goto out_css_put; - return root; + if (mem_cgroup_filter(root, root, cond) == VISIT) + return root; + return NULL; } rcu_read_lock(); @@ -1094,7 +1126,13 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, } } - memcg = __mem_cgroup_iter_next(root, last_visited); + memcg = __mem_cgroup_iter_next(root, last_visited, cond); + /* + * No group has been visited because filter told us to skip + * the root node + */ + if (cond && !last_visited && !memcg) + goto out_unlock; if (reclaim) { if (last_visited) @@ -1857,13 +1895,14 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) * a) it is over its soft limit * b) any parent up the hierarchy is over its soft limit */ -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, +enum mem_cgroup_filter_t +mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root) { struct mem_cgroup *parent = memcg; if (res_counter_soft_limit_excess(&memcg->res)) - return true; + return VISIT; /* * If any parent up to the root in the hierarchy is over its soft limit @@ -1871,12 +1910,12 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, */ while((parent = parent_mem_cgroup(parent))) { if (res_counter_soft_limit_excess(&parent->res)) - return true; + return VISIT; if (parent == root) break; } - return false; + return SKIP; } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 151f6b9..49878dd 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1964,21 +1964,16 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) .zone = zone, .priority = sc->priority, }; - struct mem_cgroup *memcg; + struct mem_cgroup *memcg = NULL; + mem_cgroup_iter_filter filter = (soft_reclaim) ? + mem_cgroup_soft_reclaim_eligible : NULL; nr_reclaimed = sc->nr_reclaimed; nr_scanned = sc->nr_scanned; - memcg = mem_cgroup_iter(root, NULL, &reclaim); - do { + while ((memcg = mem_cgroup_iter_cond(root, memcg, &reclaim, filter))) { struct lruvec *lruvec; - if (soft_reclaim && - !mem_cgroup_soft_reclaim_eligible(memcg, root)) { - memcg = mem_cgroup_iter(root, memcg, &reclaim); - continue; - } - lruvec = mem_cgroup_zone_lruvec(zone, memcg); shrink_lruvec(lruvec, sc); @@ -1998,8 +1993,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) mem_cgroup_iter_break(root, memcg); break; } - memcg = mem_cgroup_iter(root, memcg, &reclaim); - } while (memcg); + } vmpressure(sc->gfp_mask, sc->target_mem_cgroup, sc->nr_scanned - nr_scanned, -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH] memcg: enhance memcg iterator to support predicates @ 2013-05-27 17:20 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-27 17:20 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg Oh, forgot to post the first patch... --- From 54b422ba96e3fa9264ee308c4fd9d5fe36efde63 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Date: Mon, 20 May 2013 13:40:11 +0200 Subject: [PATCH] memcg: enhance memcg iterator to support predicates The caller of the iterator might know that some nodes or even subtrees should be skipped but there is no way to tell iterators about that so the only choice left is to let iterators to visit each node and do the selection outside of the iterating code. This, however, doesn't scale well with hierarchies with many groups where only few groups are interesting. This patch adds mem_cgroup_iter_cond variant of the iterator with a callback which gets called for every visited node. There are three possible ways how the callback can influence the walk. Either the node is visited, it is skipped but the tree walk continues down the tree or the whole subtree of the current group is skipped. Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> --- include/linux/memcontrol.h | 40 +++++++++++++++++++++---- mm/memcontrol.c | 69 ++++++++++++++++++++++++++++++++++---------- mm/vmscan.c | 16 ++++------ 3 files changed, 93 insertions(+), 32 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 80ed1b6..811967a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -41,6 +41,15 @@ struct mem_cgroup_reclaim_cookie { unsigned int generation; }; +enum mem_cgroup_filter_t { + VISIT, + SKIP, + SKIP_TREE, +}; + +typedef enum mem_cgroup_filter_t +(*mem_cgroup_iter_filter)(struct mem_cgroup *memcg, struct mem_cgroup *root); + #ifdef CONFIG_MEMCG /* * All "charge" functions with gfp_mask should use GFP_KERNEL or @@ -107,9 +116,18 @@ mem_cgroup_prepare_migration(struct page *page, struct page *newpage, extern void mem_cgroup_end_migration(struct mem_cgroup *memcg, struct page *oldpage, struct page *newpage, bool migration_ok); -struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *, - struct mem_cgroup *, - struct mem_cgroup_reclaim_cookie *); +struct mem_cgroup *mem_cgroup_iter_cond(struct mem_cgroup *root, + struct mem_cgroup *prev, + struct mem_cgroup_reclaim_cookie *reclaim, + mem_cgroup_iter_filter cond); + +static inline struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, + struct mem_cgroup *prev, + struct mem_cgroup_reclaim_cookie *reclaim) +{ + return mem_cgroup_iter_cond(root, prev, reclaim, NULL); +} + void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *); /* @@ -179,7 +197,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, +enum mem_cgroup_filter_t +mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); @@ -294,6 +313,14 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg, struct page *oldpage, struct page *newpage, bool migration_ok) { } +static inline struct mem_cgroup * +mem_cgroup_iter_cond(struct mem_cgroup *root, + struct mem_cgroup *prev, + struct mem_cgroup_reclaim_cookie *reclaim, + mem_cgroup_iter_filter cond) +{ + return NULL; +} static inline struct mem_cgroup * mem_cgroup_iter(struct mem_cgroup *root, @@ -357,10 +384,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, +enum mem_cgroup_filter_t +mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root) { - return false; + return VISIT; } static inline void mem_cgroup_split_huge_fixup(struct page *head) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9f8dd37..981ee12 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -969,6 +969,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm) return memcg; } +static enum mem_cgroup_filter_t +mem_cgroup_filter(struct mem_cgroup *memcg, struct mem_cgroup *root, + mem_cgroup_iter_filter cond) +{ + if (!cond) + return VISIT; + return cond(memcg, root); +} + /* * Returns a next (in a pre-order walk) alive memcg (with elevated css * ref. count) or NULL if the whole root's subtree has been visited. @@ -976,7 +985,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm) * helper function to be used by mem_cgroup_iter */ static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root, - struct mem_cgroup *last_visited) + struct mem_cgroup *last_visited, mem_cgroup_iter_filter cond) { struct cgroup *prev_cgroup, *next_cgroup; @@ -984,10 +993,18 @@ static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root, * Root is not visited by cgroup iterators so it needs an * explicit visit. */ - if (!last_visited) - return root; + if (!last_visited) { + switch(mem_cgroup_filter(root, root, cond)) { + case VISIT: + return root; + case SKIP: + break; + case SKIP_TREE: + return NULL; + } + } - prev_cgroup = (last_visited == root) ? NULL + prev_cgroup = (last_visited == root || !last_visited) ? NULL : last_visited->css.cgroup; skip_node: next_cgroup = cgroup_next_descendant_pre( @@ -1003,11 +1020,22 @@ skip_node: if (next_cgroup) { struct mem_cgroup *mem = mem_cgroup_from_cont( next_cgroup); - if (css_tryget(&mem->css)) - return mem; - else { + + switch (mem_cgroup_filter(mem, root, cond)) { + case SKIP: prev_cgroup = next_cgroup; goto skip_node; + case SKIP_TREE: + prev_cgroup = cgroup_rightmost_descendant(next_cgroup); + goto skip_node; + case VISIT: + if (css_tryget(&mem->css)) + return mem; + else { + prev_cgroup = next_cgroup; + goto skip_node; + } + break; } } @@ -1019,6 +1047,7 @@ skip_node: * @root: hierarchy root * @prev: previously returned memcg, NULL on first invocation * @reclaim: cookie for shared reclaim walks, NULL for full walks + * @cond: filter for visited nodes, NULL for no filter * * Returns references to children of the hierarchy below @root, or * @root itself, or %NULL after a full round-trip. @@ -1031,9 +1060,10 @@ skip_node: * divide up the memcgs in the hierarchy among all concurrent * reclaimers operating on the same zone and priority. */ -struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, +struct mem_cgroup *mem_cgroup_iter_cond(struct mem_cgroup *root, struct mem_cgroup *prev, - struct mem_cgroup_reclaim_cookie *reclaim) + struct mem_cgroup_reclaim_cookie *reclaim, + mem_cgroup_iter_filter cond) { struct mem_cgroup *memcg = NULL; struct mem_cgroup *last_visited = NULL; @@ -1051,7 +1081,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (!root->use_hierarchy && root != root_mem_cgroup) { if (prev) goto out_css_put; - return root; + if (mem_cgroup_filter(root, root, cond) == VISIT) + return root; + return NULL; } rcu_read_lock(); @@ -1094,7 +1126,13 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, } } - memcg = __mem_cgroup_iter_next(root, last_visited); + memcg = __mem_cgroup_iter_next(root, last_visited, cond); + /* + * No group has been visited because filter told us to skip + * the root node + */ + if (cond && !last_visited && !memcg) + goto out_unlock; if (reclaim) { if (last_visited) @@ -1857,13 +1895,14 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) * a) it is over its soft limit * b) any parent up the hierarchy is over its soft limit */ -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, +enum mem_cgroup_filter_t +mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root) { struct mem_cgroup *parent = memcg; if (res_counter_soft_limit_excess(&memcg->res)) - return true; + return VISIT; /* * If any parent up to the root in the hierarchy is over its soft limit @@ -1871,12 +1910,12 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, */ while((parent = parent_mem_cgroup(parent))) { if (res_counter_soft_limit_excess(&parent->res)) - return true; + return VISIT; if (parent == root) break; } - return false; + return SKIP; } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 151f6b9..49878dd 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1964,21 +1964,16 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) .zone = zone, .priority = sc->priority, }; - struct mem_cgroup *memcg; + struct mem_cgroup *memcg = NULL; + mem_cgroup_iter_filter filter = (soft_reclaim) ? + mem_cgroup_soft_reclaim_eligible : NULL; nr_reclaimed = sc->nr_reclaimed; nr_scanned = sc->nr_scanned; - memcg = mem_cgroup_iter(root, NULL, &reclaim); - do { + while ((memcg = mem_cgroup_iter_cond(root, memcg, &reclaim, filter))) { struct lruvec *lruvec; - if (soft_reclaim && - !mem_cgroup_soft_reclaim_eligible(memcg, root)) { - memcg = mem_cgroup_iter(root, memcg, &reclaim); - continue; - } - lruvec = mem_cgroup_zone_lruvec(zone, memcg); shrink_lruvec(lruvec, sc); @@ -1998,8 +1993,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) mem_cgroup_iter_break(root, memcg); break; } - memcg = mem_cgroup_iter(root, memcg, &reclaim); - } while (memcg); + } vmpressure(sc->gfp_mask, sc->target_mem_cgroup, sc->nr_scanned - nr_scanned, -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH] memcg: enhance memcg iterator to support predicates @ 2013-05-27 17:20 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-27 17:20 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm Oh, forgot to post the first patch... --- ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-27 17:13 ` Michal Hocko (?) @ 2013-05-29 13:05 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-29 13:05 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm On Mon 27-05-13 19:13:08, Michal Hocko wrote: [...] > Nevertheless I have encountered an issue while testing the huge number > of groups scenario. And the issue is not limitted to only to this > scenario unfortunately. As memcg iterators use per node-zone-priority > cache to prevent from over reclaim it might quite easily happen that > the walk will not visit all groups and will terminate the loop either > prematurely or skip some groups. An example could be the direct reclaim > racing with kswapd. This might cause that the loop misses over limit > groups so no pages are scanned and so we will fall back to all groups > reclaim. And after some more testing and head scratching it turned out that fallbacks to pass#2 I was seeing are caused by something else. It is not race between iterators but rather reclaiming from zone DMA which has troubles to scan anything despite there are pages on LRU and so we fall back. I have to look into that more but what-ever the issue is it shouldn't be related to the patch series. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-29 13:05 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-29 13:05 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg On Mon 27-05-13 19:13:08, Michal Hocko wrote: [...] > Nevertheless I have encountered an issue while testing the huge number > of groups scenario. And the issue is not limitted to only to this > scenario unfortunately. As memcg iterators use per node-zone-priority > cache to prevent from over reclaim it might quite easily happen that > the walk will not visit all groups and will terminate the loop either > prematurely or skip some groups. An example could be the direct reclaim > racing with kswapd. This might cause that the loop misses over limit > groups so no pages are scanned and so we will fall back to all groups > reclaim. And after some more testing and head scratching it turned out that fallbacks to pass#2 I was seeing are caused by something else. It is not race between iterators but rather reclaiming from zone DMA which has troubles to scan anything despite there are pages on LRU and so we fall back. I have to look into that more but what-ever the issue is it shouldn't be related to the patch series. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-29 13:05 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-29 13:05 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm On Mon 27-05-13 19:13:08, Michal Hocko wrote: [...] > Nevertheless I have encountered an issue while testing the huge number > of groups scenario. And the issue is not limitted to only to this > scenario unfortunately. As memcg iterators use per node-zone-priority > cache to prevent from over reclaim it might quite easily happen that > the walk will not visit all groups and will terminate the loop either > prematurely or skip some groups. An example could be the direct reclaim > racing with kswapd. This might cause that the loop misses over limit > groups so no pages are scanned and so we will fall back to all groups > reclaim. And after some more testing and head scratching it turned out that fallbacks to pass#2 I was seeing are caused by something else. It is not race between iterators but rather reclaiming from zone DMA which has troubles to scan anything despite there are pages on LRU and so we fall back. I have to look into that more but what-ever the issue is it shouldn't be related to the patch series. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-29 13:05 ` Michal Hocko @ 2013-05-29 15:57 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-29 15:57 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm On Wed 29-05-13 15:05:38, Michal Hocko wrote: > On Mon 27-05-13 19:13:08, Michal Hocko wrote: > [...] > > Nevertheless I have encountered an issue while testing the huge number > > of groups scenario. And the issue is not limitted to only to this > > scenario unfortunately. As memcg iterators use per node-zone-priority > > cache to prevent from over reclaim it might quite easily happen that > > the walk will not visit all groups and will terminate the loop either > > prematurely or skip some groups. An example could be the direct reclaim > > racing with kswapd. This might cause that the loop misses over limit > > groups so no pages are scanned and so we will fall back to all groups > > reclaim. > > And after some more testing and head scratching it turned out that > fallbacks to pass#2 I was seeing are caused by something else. It is > not race between iterators but rather reclaiming from zone DMA which > has troubles to scan anything despite there are pages on LRU and so we > fall back. I have to look into that more but what-ever the issue is it > shouldn't be related to the patch series. Think I know what is going on. get_scan_count sees relatively small amount of pages in the lists (around 2k). This means that get_scan_count will tell us to scan nothing for DEF_PRIORITY (as the DMA32 is usually ~16M) then the DEF_PRIORITY is basically no-op and we have to wait and fall down to a priority which actually let us scan something. Hmm, maybe ignoring soft reclaim for DMA zone would help to reduce one pointless loop over groups. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-29 15:57 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-29 15:57 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm On Wed 29-05-13 15:05:38, Michal Hocko wrote: > On Mon 27-05-13 19:13:08, Michal Hocko wrote: > [...] > > Nevertheless I have encountered an issue while testing the huge number > > of groups scenario. And the issue is not limitted to only to this > > scenario unfortunately. As memcg iterators use per node-zone-priority > > cache to prevent from over reclaim it might quite easily happen that > > the walk will not visit all groups and will terminate the loop either > > prematurely or skip some groups. An example could be the direct reclaim > > racing with kswapd. This might cause that the loop misses over limit > > groups so no pages are scanned and so we will fall back to all groups > > reclaim. > > And after some more testing and head scratching it turned out that > fallbacks to pass#2 I was seeing are caused by something else. It is > not race between iterators but rather reclaiming from zone DMA which > has troubles to scan anything despite there are pages on LRU and so we > fall back. I have to look into that more but what-ever the issue is it > shouldn't be related to the patch series. Think I know what is going on. get_scan_count sees relatively small amount of pages in the lists (around 2k). This means that get_scan_count will tell us to scan nothing for DEF_PRIORITY (as the DMA32 is usually ~16M) then the DEF_PRIORITY is basically no-op and we have to wait and fall down to a priority which actually let us scan something. Hmm, maybe ignoring soft reclaim for DMA zone would help to reduce one pointless loop over groups. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-29 15:57 ` Michal Hocko @ 2013-05-29 20:01 ` Johannes Weiner -1 siblings, 0 replies; 74+ messages in thread From: Johannes Weiner @ 2013-05-29 20:01 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm On Wed, May 29, 2013 at 05:57:56PM +0200, Michal Hocko wrote: > On Wed 29-05-13 15:05:38, Michal Hocko wrote: > > On Mon 27-05-13 19:13:08, Michal Hocko wrote: > > [...] > > > Nevertheless I have encountered an issue while testing the huge number > > > of groups scenario. And the issue is not limitted to only to this > > > scenario unfortunately. As memcg iterators use per node-zone-priority > > > cache to prevent from over reclaim it might quite easily happen that > > > the walk will not visit all groups and will terminate the loop either > > > prematurely or skip some groups. An example could be the direct reclaim > > > racing with kswapd. This might cause that the loop misses over limit > > > groups so no pages are scanned and so we will fall back to all groups > > > reclaim. > > > > And after some more testing and head scratching it turned out that > > fallbacks to pass#2 I was seeing are caused by something else. It is > > not race between iterators but rather reclaiming from zone DMA which > > has troubles to scan anything despite there are pages on LRU and so we > > fall back. I have to look into that more but what-ever the issue is it > > shouldn't be related to the patch series. > > Think I know what is going on. get_scan_count sees relatively small > amount of pages in the lists (around 2k). This means that get_scan_count > will tell us to scan nothing for DEF_PRIORITY (as the DMA32 is usually > ~16M) then the DEF_PRIORITY is basically no-op and we have to wait and > fall down to a priority which actually let us scan something. > > Hmm, maybe ignoring soft reclaim for DMA zone would help to reduce > one pointless loop over groups. If you have a small group in excess of its soft limit and bigger groups that are not, you may reclaim something in the regular reclaim cycle before reclaiming anything in the soft limit cycle with the way the code is structured. The soft limit cycle probably needs to sit outside of the priority loop, not inside the loop, so that the soft limit reclaim cycle descends priority levels until it makes progress BEFORE it exits to the regular reclaim cycle. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-29 20:01 ` Johannes Weiner 0 siblings, 0 replies; 74+ messages in thread From: Johannes Weiner @ 2013-05-29 20:01 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm On Wed, May 29, 2013 at 05:57:56PM +0200, Michal Hocko wrote: > On Wed 29-05-13 15:05:38, Michal Hocko wrote: > > On Mon 27-05-13 19:13:08, Michal Hocko wrote: > > [...] > > > Nevertheless I have encountered an issue while testing the huge number > > > of groups scenario. And the issue is not limitted to only to this > > > scenario unfortunately. As memcg iterators use per node-zone-priority > > > cache to prevent from over reclaim it might quite easily happen that > > > the walk will not visit all groups and will terminate the loop either > > > prematurely or skip some groups. An example could be the direct reclaim > > > racing with kswapd. This might cause that the loop misses over limit > > > groups so no pages are scanned and so we will fall back to all groups > > > reclaim. > > > > And after some more testing and head scratching it turned out that > > fallbacks to pass#2 I was seeing are caused by something else. It is > > not race between iterators but rather reclaiming from zone DMA which > > has troubles to scan anything despite there are pages on LRU and so we > > fall back. I have to look into that more but what-ever the issue is it > > shouldn't be related to the patch series. > > Think I know what is going on. get_scan_count sees relatively small > amount of pages in the lists (around 2k). This means that get_scan_count > will tell us to scan nothing for DEF_PRIORITY (as the DMA32 is usually > ~16M) then the DEF_PRIORITY is basically no-op and we have to wait and > fall down to a priority which actually let us scan something. > > Hmm, maybe ignoring soft reclaim for DMA zone would help to reduce > one pointless loop over groups. If you have a small group in excess of its soft limit and bigger groups that are not, you may reclaim something in the regular reclaim cycle before reclaiming anything in the soft limit cycle with the way the code is structured. The soft limit cycle probably needs to sit outside of the priority loop, not inside the loop, so that the soft limit reclaim cycle descends priority levels until it makes progress BEFORE it exits to the regular reclaim cycle. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-29 20:01 ` Johannes Weiner @ 2013-05-30 8:45 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-30 8:45 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm On Wed 29-05-13 16:01:54, Johannes Weiner wrote: > On Wed, May 29, 2013 at 05:57:56PM +0200, Michal Hocko wrote: > > On Wed 29-05-13 15:05:38, Michal Hocko wrote: > > > On Mon 27-05-13 19:13:08, Michal Hocko wrote: > > > [...] > > > > Nevertheless I have encountered an issue while testing the huge number > > > > of groups scenario. And the issue is not limitted to only to this > > > > scenario unfortunately. As memcg iterators use per node-zone-priority > > > > cache to prevent from over reclaim it might quite easily happen that > > > > the walk will not visit all groups and will terminate the loop either > > > > prematurely or skip some groups. An example could be the direct reclaim > > > > racing with kswapd. This might cause that the loop misses over limit > > > > groups so no pages are scanned and so we will fall back to all groups > > > > reclaim. > > > > > > And after some more testing and head scratching it turned out that > > > fallbacks to pass#2 I was seeing are caused by something else. It is > > > not race between iterators but rather reclaiming from zone DMA which > > > has troubles to scan anything despite there are pages on LRU and so we > > > fall back. I have to look into that more but what-ever the issue is it > > > shouldn't be related to the patch series. > > > > Think I know what is going on. get_scan_count sees relatively small > > amount of pages in the lists (around 2k). This means that get_scan_count > > will tell us to scan nothing for DEF_PRIORITY (as the DMA32 is usually > > ~16M) then the DEF_PRIORITY is basically no-op and we have to wait and > > fall down to a priority which actually let us scan something. > > > > Hmm, maybe ignoring soft reclaim for DMA zone would help to reduce > > one pointless loop over groups. > > If you have a small group in excess of its soft limit and bigger > groups that are not, you may reclaim something in the regular reclaim > cycle before reclaiming anything in the soft limit cycle with the way > the code is structured. Yes the way how get_scan_count works might really cause this. Although tageted reclaim is protected from this the global reclaim can really suffer from this. I am not sure this is necessarily a problem though. If we are under the global reclaim then a small group which doesn't have at least 1<<DEF_PRIORITY pages probably doesn't matter that much. The soft limit is not a guarantee anyway so we can sacrifice some pages from all groups in such a case. I also think that the force_scan logic should be enhanced a bit. Especially for cases like DMA zone. The zone is clearly under watermaks but we have to wait few priority cycles to reclaim something. But this is a different issue in depended on the soft reclaim rework. > The soft limit cycle probably needs to sit outside of the priority > loop, not inside the loop, so that the soft limit reclaim cycle > descends priority levels until it makes progress BEFORE it exits to > the regular reclaim cycle. I do not like this to be honest. shrink_zone is an ideal place as it is shared among all reclaimers and we really want to obey priority in the soft reclaim as well. The corner case mentioned above is probably fixable on the get_scan_count layer and even if not then I wouldn't call it a disaster. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-30 8:45 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-30 8:45 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm On Wed 29-05-13 16:01:54, Johannes Weiner wrote: > On Wed, May 29, 2013 at 05:57:56PM +0200, Michal Hocko wrote: > > On Wed 29-05-13 15:05:38, Michal Hocko wrote: > > > On Mon 27-05-13 19:13:08, Michal Hocko wrote: > > > [...] > > > > Nevertheless I have encountered an issue while testing the huge number > > > > of groups scenario. And the issue is not limitted to only to this > > > > scenario unfortunately. As memcg iterators use per node-zone-priority > > > > cache to prevent from over reclaim it might quite easily happen that > > > > the walk will not visit all groups and will terminate the loop either > > > > prematurely or skip some groups. An example could be the direct reclaim > > > > racing with kswapd. This might cause that the loop misses over limit > > > > groups so no pages are scanned and so we will fall back to all groups > > > > reclaim. > > > > > > And after some more testing and head scratching it turned out that > > > fallbacks to pass#2 I was seeing are caused by something else. It is > > > not race between iterators but rather reclaiming from zone DMA which > > > has troubles to scan anything despite there are pages on LRU and so we > > > fall back. I have to look into that more but what-ever the issue is it > > > shouldn't be related to the patch series. > > > > Think I know what is going on. get_scan_count sees relatively small > > amount of pages in the lists (around 2k). This means that get_scan_count > > will tell us to scan nothing for DEF_PRIORITY (as the DMA32 is usually > > ~16M) then the DEF_PRIORITY is basically no-op and we have to wait and > > fall down to a priority which actually let us scan something. > > > > Hmm, maybe ignoring soft reclaim for DMA zone would help to reduce > > one pointless loop over groups. > > If you have a small group in excess of its soft limit and bigger > groups that are not, you may reclaim something in the regular reclaim > cycle before reclaiming anything in the soft limit cycle with the way > the code is structured. Yes the way how get_scan_count works might really cause this. Although tageted reclaim is protected from this the global reclaim can really suffer from this. I am not sure this is necessarily a problem though. If we are under the global reclaim then a small group which doesn't have at least 1<<DEF_PRIORITY pages probably doesn't matter that much. The soft limit is not a guarantee anyway so we can sacrifice some pages from all groups in such a case. I also think that the force_scan logic should be enhanced a bit. Especially for cases like DMA zone. The zone is clearly under watermaks but we have to wait few priority cycles to reclaim something. But this is a different issue in depended on the soft reclaim rework. > The soft limit cycle probably needs to sit outside of the priority > loop, not inside the loop, so that the soft limit reclaim cycle > descends priority levels until it makes progress BEFORE it exits to > the regular reclaim cycle. I do not like this to be honest. shrink_zone is an ideal place as it is shared among all reclaimers and we really want to obey priority in the soft reclaim as well. The corner case mentioned above is probably fixable on the get_scan_count layer and even if not then I wouldn't call it a disaster. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-27 17:13 ` Michal Hocko @ 2013-05-29 14:54 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-29 14:54 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm On Mon 27-05-13 19:13:08, Michal Hocko wrote: [...] > > I think that the numbers can be improved even without introducing > > the list of groups in excess. One way to go could be introducing a > > conditional (callback) to the memcg iterator so the groups under the > > limit would be excluded during the walk without playing with css > > references and other things. My quick and dirty patch shows that > > 4k-0-limit System time was reduced by 40% wrt. this patchset. With a > > proper tagging we can make the walk close to free. > > And the following patchset implements that. My first numbers shown an > improvement (I will post some numbers later after I collect them). With that series applied (+some minor cleanups - no functional changes) I am getting much numbers this time. Again the test case is a streaming IO (dd from /dev/zero to a file with 4*MemTotal) booted with mem=1G so there are just DMA and DMA32 zones. This helps the previous soft limit implementation as the over limit group will be in the proper mztree. The load was running in a group (A) and other groups were created under the root/bunch cgroups (with use_hierarchy and two groups at each level). I have ran each test 3 times. rework denotes to the original series, reworkoptim contains 3 patches on top. First number in the name tells us the number of groups under root/bunch and the second the soft limit for A. System 0-0-limit/base: min: 32.45 max: 36.91 avg: 34.65 std: 1.82 runs: 3 0-0-limit/rework: min: 38.03 [117.2%] max: 45.30 [122.7%] avg: 42.30 [122.1%] std: 3.10 runs: 3 0-0-limit/reworkoptim: min: 43.99 [135.6%] max: 47.80 [129.5%] avg: 45.45 [131.2%] std: 1.68 runs: 3 Elapsed 0-0-limit/base: min: 88.21 max: 94.61 avg: 91.73 std: 2.65 runs: 3 0-0-limit/rework: min: 76.05 [86.2%] max: 79.08 [83.6%] avg: 77.84 [84.9%] std: 1.30 runs: 3 0-0-limit/reworkoptim: min: 77.98 [88.4%] max: 80.36 [84.9%] avg: 78.92 [86.0%] std: 1.03 runs: 3 System 0.5k-0-limit/base: min: 34.86 max: 36.42 avg: 35.89 std: 0.73 runs: 3 0.5k-0-limit/rework: min: 43.26 [124.1%] max: 48.95 [134.4%] avg: 46.09 [128.4%] std: 2.32 runs: 3 0.5k-0-limit/reworkoptim: min: 46.98 [134.8%] max: 50.98 [140.0%] avg: 48.49 [135.1%] std: 1.77 runs: 3 Elapsed 0.5k-0-limit/base: min: 88.50 max: 97.52 avg: 93.92 std: 3.90 runs: 3 0.5k-0-limit/rework: min: 75.92 [85.8%] max: 78.45 [80.4%] avg: 77.34 [82.3%] std: 1.06 runs: 3 0.5k-0-limit/reworkoptim: min: 75.79 [85.6%] max: 79.37 [81.4%] avg: 77.55 [82.6%] std: 1.46 runs: 3 System 2k-0-limit/base: min: 34.57 max: 37.65 avg: 36.34 std: 1.30 runs: 3 2k-0-limit/rework: min: 64.17 [185.6%] max: 68.20 [181.1%] avg: 66.21 [182.2%] std: 1.65 runs: 3 2k-0-limit/reworkoptim: min: 49.78 [144.0%] max: 52.99 [140.7%] avg: 51.00 [140.3%] std: 1.42 runs: 3 Elapsed 2k-0-limit/base: min: 92.61 max: 97.83 avg: 95.03 std: 2.15 runs: 3 2k-0-limit/rework: min: 78.33 [84.6%] max: 84.08 [85.9%] avg: 81.09 [85.3%] std: 2.35 runs: 3 2k-0-limit/reworkoptim: min: 75.72 [81.8%] max: 78.57 [80.3%] avg: 76.73 [80.7%] std: 1.30 runs: 3 System 8k-0-limit/base: min: 39.78 max: 42.09 avg: 41.09 std: 0.97 runs: 3 8k-0-limit/rework: min: 200.86 [504.9%] max: 265.42 [630.6%] avg: 241.80 [588.5%] std: 29.06 runs: 3 8k-0-limit/reworkoptim: min: 53.70 [135.0%] max: 54.89 [130.4%] avg: 54.43 [132.5%] std: 0.52 runs: 3 Elapsed 8k-0-limit/base: min: 95.11 max: 98.61 avg: 96.81 std: 1.43 runs: 3 8k-0-limit/rework: min: 246.96 [259.7%] max: 331.47 [336.1%] avg: 301.32 [311.2%] std: 38.52 runs: 3 8k-0-limit/reworkoptim: min: 76.79 [80.7%] max: 81.71 [82.9%] avg: 78.97 [81.6%] std: 2.05 runs: 3 The System time is increased by ~30-40% which can be explained by the fact that the original soft reclaim scanned at priority 0 so it was much more effective for this workload (which is basically touch once and writeback). The Elapsed time looks better though (~20%) which sounds like a good news to me. The tree walk overhead seems to be reduced considerably if we compare reworkoptim to rework. Same test without soft limit set to A: System 0-no-limit/base: min: 42.18 max: 50.38 avg: 46.44 std: 3.36 runs: 3 0-no-limit/rework: min: 40.57 [96.2%] max: 47.04 [93.4%] avg: 43.82 [94.4%] std: 2.64 runs: 3 0-no-limit/reworkoptim: min: 40.45 [95.9%] max: 45.28 [89.9%] avg: 42.10 [90.7%] std: 2.25 runs: 3 Elapsed 0-no-limit/base: min: 75.97 max: 78.21 avg: 76.87 std: 0.96 runs: 3 0-no-limit/rework: min: 75.59 [99.5%] max: 80.73 [103.2%] avg: 77.64 [101.0%] std: 2.23 runs: 3 0-no-limit/reworkoptim: min: 77.85 [102.5%] max: 82.42 [105.4%] avg: 79.64 [103.6%] std: 1.99 runs: 3 System 0.5k-no-limit/base: min: 44.54 max: 46.93 avg: 46.12 std: 1.12 runs: 3 0.5k-no-limit/rework: min: 42.09 [94.5%] max: 46.16 [98.4%] avg: 43.92 [95.2%] std: 1.69 runs: 3 0.5k-no-limit/reworkoptim: min: 42.47 [95.4%] max: 45.67 [97.3%] avg: 44.06 [95.5%] std: 1.31 runs: 3 Elapsed 0.5k-no-limit/base: min: 78.26 max: 81.49 avg: 79.65 std: 1.36 runs: 3 0.5k-no-limit/rework: min: 77.01 [98.4%] max: 80.43 [98.7%] avg: 78.30 [98.3%] std: 1.52 runs: 3 0.5k-no-limit/reworkoptim: min: 76.13 [97.3%] max: 77.87 [95.6%] avg: 77.18 [96.9%] std: 0.75 runs: 3 System 2k-no-limit/base: min: 62.96 max: 69.14 avg: 66.14 std: 2.53 runs: 3 2k-no-limit/rework: min: 76.01 [120.7%] max: 81.06 [117.2%] avg: 78.17 [118.2%] std: 2.12 runs: 3 2k-no-limit/reworkoptim: min: 62.57 [99.4%] max: 66.10 [95.6%] avg: 64.53 [97.6%] std: 1.47 runs: 3 Elapsed 2k-no-limit/base: min: 76.47 max: 84.22 avg: 79.12 std: 3.60 runs: 3 2k-no-limit/rework: min: 89.67 [117.3%] max: 93.26 [110.7%] avg: 91.10 [115.1%] std: 1.55 runs: 3 2k-no-limit/reworkoptim: min: 76.94 [100.6%] max: 79.21 [94.1%] avg: 78.45 [99.2%] std: 1.07 runs: 3 System 8k-no-limit/base: min: 104.74 max: 151.34 avg: 129.21 std: 19.10 runs: 3 8k-no-limit/rework: min: 205.23 [195.9%] max: 285.94 [188.9%] avg: 258.98 [200.4%] std: 38.01 runs: 3 8k-no-limit/reworkoptim: min: 161.16 [153.9%] max: 184.54 [121.9%] avg: 174.52 [135.1%] std: 9.83 runs: 3 Elapsed 8k-no-limit/base: min: 125.43 max: 181.00 avg: 154.81 std: 22.80 runs: 3 8k-no-limit/rework: min: 254.05 [202.5%] max: 355.67 [196.5%] avg: 321.46 [207.6%] std: 47.67 runs: 3 8k-no-limit/reworkoptim: min: 193.77 [154.5%] max: 222.72 [123.0%] avg: 210.18 [135.8%] std: 12.13 runs: 3 Both System and Elapsed are in stdev with the base kernel for all configurations except for 8k where both System and Elapsed are up by 35%. I do not have a good explanation for this because there is no soft reclaim pass going on as no group is above the limit which is checked in mem_cgroup_should_soft_reclaim. I am still running kbuild tests with the same configuration to see a more general workload. Does this sound like it reasonable mitigation of the issue you were worried about Johannes? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-29 14:54 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-29 14:54 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm On Mon 27-05-13 19:13:08, Michal Hocko wrote: [...] > > I think that the numbers can be improved even without introducing > > the list of groups in excess. One way to go could be introducing a > > conditional (callback) to the memcg iterator so the groups under the > > limit would be excluded during the walk without playing with css > > references and other things. My quick and dirty patch shows that > > 4k-0-limit System time was reduced by 40% wrt. this patchset. With a > > proper tagging we can make the walk close to free. > > And the following patchset implements that. My first numbers shown an > improvement (I will post some numbers later after I collect them). With that series applied (+some minor cleanups - no functional changes) I am getting much numbers this time. Again the test case is a streaming IO (dd from /dev/zero to a file with 4*MemTotal) booted with mem=1G so there are just DMA and DMA32 zones. This helps the previous soft limit implementation as the over limit group will be in the proper mztree. The load was running in a group (A) and other groups were created under the root/bunch cgroups (with use_hierarchy and two groups at each level). I have ran each test 3 times. rework denotes to the original series, reworkoptim contains 3 patches on top. First number in the name tells us the number of groups under root/bunch and the second the soft limit for A. System 0-0-limit/base: min: 32.45 max: 36.91 avg: 34.65 std: 1.82 runs: 3 0-0-limit/rework: min: 38.03 [117.2%] max: 45.30 [122.7%] avg: 42.30 [122.1%] std: 3.10 runs: 3 0-0-limit/reworkoptim: min: 43.99 [135.6%] max: 47.80 [129.5%] avg: 45.45 [131.2%] std: 1.68 runs: 3 Elapsed 0-0-limit/base: min: 88.21 max: 94.61 avg: 91.73 std: 2.65 runs: 3 0-0-limit/rework: min: 76.05 [86.2%] max: 79.08 [83.6%] avg: 77.84 [84.9%] std: 1.30 runs: 3 0-0-limit/reworkoptim: min: 77.98 [88.4%] max: 80.36 [84.9%] avg: 78.92 [86.0%] std: 1.03 runs: 3 System 0.5k-0-limit/base: min: 34.86 max: 36.42 avg: 35.89 std: 0.73 runs: 3 0.5k-0-limit/rework: min: 43.26 [124.1%] max: 48.95 [134.4%] avg: 46.09 [128.4%] std: 2.32 runs: 3 0.5k-0-limit/reworkoptim: min: 46.98 [134.8%] max: 50.98 [140.0%] avg: 48.49 [135.1%] std: 1.77 runs: 3 Elapsed 0.5k-0-limit/base: min: 88.50 max: 97.52 avg: 93.92 std: 3.90 runs: 3 0.5k-0-limit/rework: min: 75.92 [85.8%] max: 78.45 [80.4%] avg: 77.34 [82.3%] std: 1.06 runs: 3 0.5k-0-limit/reworkoptim: min: 75.79 [85.6%] max: 79.37 [81.4%] avg: 77.55 [82.6%] std: 1.46 runs: 3 System 2k-0-limit/base: min: 34.57 max: 37.65 avg: 36.34 std: 1.30 runs: 3 2k-0-limit/rework: min: 64.17 [185.6%] max: 68.20 [181.1%] avg: 66.21 [182.2%] std: 1.65 runs: 3 2k-0-limit/reworkoptim: min: 49.78 [144.0%] max: 52.99 [140.7%] avg: 51.00 [140.3%] std: 1.42 runs: 3 Elapsed 2k-0-limit/base: min: 92.61 max: 97.83 avg: 95.03 std: 2.15 runs: 3 2k-0-limit/rework: min: 78.33 [84.6%] max: 84.08 [85.9%] avg: 81.09 [85.3%] std: 2.35 runs: 3 2k-0-limit/reworkoptim: min: 75.72 [81.8%] max: 78.57 [80.3%] avg: 76.73 [80.7%] std: 1.30 runs: 3 System 8k-0-limit/base: min: 39.78 max: 42.09 avg: 41.09 std: 0.97 runs: 3 8k-0-limit/rework: min: 200.86 [504.9%] max: 265.42 [630.6%] avg: 241.80 [588.5%] std: 29.06 runs: 3 8k-0-limit/reworkoptim: min: 53.70 [135.0%] max: 54.89 [130.4%] avg: 54.43 [132.5%] std: 0.52 runs: 3 Elapsed 8k-0-limit/base: min: 95.11 max: 98.61 avg: 96.81 std: 1.43 runs: 3 8k-0-limit/rework: min: 246.96 [259.7%] max: 331.47 [336.1%] avg: 301.32 [311.2%] std: 38.52 runs: 3 8k-0-limit/reworkoptim: min: 76.79 [80.7%] max: 81.71 [82.9%] avg: 78.97 [81.6%] std: 2.05 runs: 3 The System time is increased by ~30-40% which can be explained by the fact that the original soft reclaim scanned at priority 0 so it was much more effective for this workload (which is basically touch once and writeback). The Elapsed time looks better though (~20%) which sounds like a good news to me. The tree walk overhead seems to be reduced considerably if we compare reworkoptim to rework. Same test without soft limit set to A: System 0-no-limit/base: min: 42.18 max: 50.38 avg: 46.44 std: 3.36 runs: 3 0-no-limit/rework: min: 40.57 [96.2%] max: 47.04 [93.4%] avg: 43.82 [94.4%] std: 2.64 runs: 3 0-no-limit/reworkoptim: min: 40.45 [95.9%] max: 45.28 [89.9%] avg: 42.10 [90.7%] std: 2.25 runs: 3 Elapsed 0-no-limit/base: min: 75.97 max: 78.21 avg: 76.87 std: 0.96 runs: 3 0-no-limit/rework: min: 75.59 [99.5%] max: 80.73 [103.2%] avg: 77.64 [101.0%] std: 2.23 runs: 3 0-no-limit/reworkoptim: min: 77.85 [102.5%] max: 82.42 [105.4%] avg: 79.64 [103.6%] std: 1.99 runs: 3 System 0.5k-no-limit/base: min: 44.54 max: 46.93 avg: 46.12 std: 1.12 runs: 3 0.5k-no-limit/rework: min: 42.09 [94.5%] max: 46.16 [98.4%] avg: 43.92 [95.2%] std: 1.69 runs: 3 0.5k-no-limit/reworkoptim: min: 42.47 [95.4%] max: 45.67 [97.3%] avg: 44.06 [95.5%] std: 1.31 runs: 3 Elapsed 0.5k-no-limit/base: min: 78.26 max: 81.49 avg: 79.65 std: 1.36 runs: 3 0.5k-no-limit/rework: min: 77.01 [98.4%] max: 80.43 [98.7%] avg: 78.30 [98.3%] std: 1.52 runs: 3 0.5k-no-limit/reworkoptim: min: 76.13 [97.3%] max: 77.87 [95.6%] avg: 77.18 [96.9%] std: 0.75 runs: 3 System 2k-no-limit/base: min: 62.96 max: 69.14 avg: 66.14 std: 2.53 runs: 3 2k-no-limit/rework: min: 76.01 [120.7%] max: 81.06 [117.2%] avg: 78.17 [118.2%] std: 2.12 runs: 3 2k-no-limit/reworkoptim: min: 62.57 [99.4%] max: 66.10 [95.6%] avg: 64.53 [97.6%] std: 1.47 runs: 3 Elapsed 2k-no-limit/base: min: 76.47 max: 84.22 avg: 79.12 std: 3.60 runs: 3 2k-no-limit/rework: min: 89.67 [117.3%] max: 93.26 [110.7%] avg: 91.10 [115.1%] std: 1.55 runs: 3 2k-no-limit/reworkoptim: min: 76.94 [100.6%] max: 79.21 [94.1%] avg: 78.45 [99.2%] std: 1.07 runs: 3 System 8k-no-limit/base: min: 104.74 max: 151.34 avg: 129.21 std: 19.10 runs: 3 8k-no-limit/rework: min: 205.23 [195.9%] max: 285.94 [188.9%] avg: 258.98 [200.4%] std: 38.01 runs: 3 8k-no-limit/reworkoptim: min: 161.16 [153.9%] max: 184.54 [121.9%] avg: 174.52 [135.1%] std: 9.83 runs: 3 Elapsed 8k-no-limit/base: min: 125.43 max: 181.00 avg: 154.81 std: 22.80 runs: 3 8k-no-limit/rework: min: 254.05 [202.5%] max: 355.67 [196.5%] avg: 321.46 [207.6%] std: 47.67 runs: 3 8k-no-limit/reworkoptim: min: 193.77 [154.5%] max: 222.72 [123.0%] avg: 210.18 [135.8%] std: 12.13 runs: 3 Both System and Elapsed are in stdev with the base kernel for all configurations except for 8k where both System and Elapsed are up by 35%. I do not have a good explanation for this because there is no soft reclaim pass going on as no group is above the limit which is checked in mem_cgroup_should_soft_reclaim. I am still running kbuild tests with the same configuration to see a more general workload. Does this sound like it reasonable mitigation of the issue you were worried about Johannes? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code 2013-05-29 14:54 ` Michal Hocko @ 2013-05-30 8:36 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-30 8:36 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm On Wed 29-05-13 16:54:00, Michal Hocko wrote: [...] > I am still running kbuild tests with the same configuration to see a > more general workload. And here we go with the kbuild numbers. Same configuration (mem=1G, one group for kernel build - it is actually expand the three + build a distro config + bunch of groups under root/bunch). System 0-0-limit/base: min: 242.70 max: 245.17 avg: 243.85 std: 1.02 runs: 3 0-0-limit/reclaim: min: 237.86 [98.0%] max: 240.22 [98.0%] avg: 239.00 [98.0%] std: 0.97 runs: 3 0-0-limit/reworkoptim: min: 241.11 [99.3%] max: 243.53 [99.3%] avg: 242.01 [99.2%] std: 1.08 runs: 3 Elapsed 0-0-limit/base: min: 348.48 max: 360.86 avg: 356.04 std: 5.41 runs: 3 0-0-limit/reclaim: min: 286.95 [82.3%] max: 290.26 [80.4%] avg: 288.27 [81.0%] std: 1.43 runs: 3 0-0-limit/reworkoptim: min: 286.55 [82.2%] max: 289.00 [80.1%] avg: 287.69 [80.8%] std: 1.01 runs: 3 System 0.5k-0-limit/base: min: 251.77 max: 254.41 avg: 252.70 std: 1.21 runs: 3 0.5k-0-limit/reclaim: min: 286.44 [113.8%] max: 289.30 [113.7%] avg: 287.60 [113.8%] std: 1.23 runs: 3 0.5k-0-limit/reworkoptim: min: 252.18 [100.2%] max: 253.16 [99.5%] avg: 252.62 [100.0%] std: 0.41 runs: 3 Elapsed 0.5k-0-limit/base: min: 347.83 max: 353.06 avg: 350.04 std: 2.21 runs: 3 0.5k-0-limit/reclaim: min: 290.19 [83.4%] max: 295.62 [83.7%] avg: 293.12 [83.7%] std: 2.24 runs: 3 0.5k-0-limit/reworkoptim: min: 293.91 [84.5%] max: 294.87 [83.5%] avg: 294.29 [84.1%] std: 0.42 runs: 3 System 2k-0-limit/base: min: 263.05 max: 271.52 avg: 267.94 std: 3.58 runs: 3 2k-0-limit/reclaim: min: 458.99 [174.5%] max: 468.31 [172.5%] avg: 464.45 [173.3%] std: 3.97 runs: 3 2k-0-limit/reworkoptim: min: 267.10 [101.5%] max: 279.38 [102.9%] avg: 272.78 [101.8%] std: 5.05 runs: 3 Elapsed 2k-0-limit/base: min: 372.33 max: 379.32 avg: 375.47 std: 2.90 runs: 3 2k-0-limit/reclaim: min: 334.40 [89.8%] max: 339.52 [89.5%] avg: 337.44 [89.9%] std: 2.20 runs: 3 2k-0-limit/reworkoptim: min: 301.47 [81.0%] max: 319.19 [84.1%] avg: 307.90 [82.0%] std: 8.01 runs: 3 System 8k-0-limit/base: min: 320.50 max: 332.10 avg: 325.46 std: 4.88 runs: 3 8k-0-limit/reclaim: min: 1115.76 [348.1%] max: 1165.66 [351.0%] avg: 1132.65 [348.0%] std: 23.34 runs: 3 8k-0-limit/reworkoptim: min: 403.75 [126.0%] max: 409.22 [123.2%] avg: 406.16 [124.8%] std: 2.28 runs: 3 Elapsed 8k-0-limit/base: min: 475.48 max: 585.19 avg: 525.54 std: 45.30 runs: 3 8k-0-limit/reclaim: min: 616.25 [129.6%] max: 625.90 [107.0%] avg: 620.68 [118.1%] std: 3.98 runs: 3 8k-0-limit/reworkoptim: min: 420.18 [88.4%] max: 428.28 [73.2%] avg: 423.05 [80.5%] std: 3.71 runs: 3 Apart from 8k the system time is comparable with the base kernel while Elapsed is up to 20% better with all configurations. And with not soft limit set System 0-no-limit/base: min: 234.76 max: 237.42 avg: 236.25 std: 1.11 runs: 3 0-no-limit/reclaim: min: 233.09 [99.3%] max: 238.65 [100.5%] avg: 236.09 [99.9%] std: 2.29 runs: 3 0-no-limit/reworkoptim: min: 236.12 [100.6%] max: 240.53 [101.3%] avg: 237.94 [100.7%] std: 1.88 runs: 3 Elapsed 0-no-limit/base: min: 288.52 max: 295.42 avg: 291.29 std: 2.98 runs: 3 0-no-limit/reclaim: min: 283.17 [98.1%] max: 284.33 [96.2%] avg: 283.78 [97.4%] std: 0.48 runs: 3 0-no-limit/reworkoptim: min: 288.50 [100.0%] max: 290.79 [98.4%] avg: 289.78 [99.5%] std: 0.95 runs: 3 System 0.5k-no-limit/base: min: 286.51 max: 293.23 avg: 290.21 std: 2.78 runs: 3 0.5k-no-limit/reclaim: min: 291.69 [101.8%] max: 294.38 [100.4%] avg: 292.97 [101.0%] std: 1.10 runs: 3 0.5k-no-limit/reworkoptim: min: 277.05 [96.7%] max: 288.76 [98.5%] avg: 284.17 [97.9%] std: 5.11 runs: 3 Elapsed 0.5k-no-limit/base: min: 294.94 max: 298.92 avg: 296.47 std: 1.75 runs: 3 0.5k-no-limit/reclaim: min: 292.55 [99.2%] max: 294.21 [98.4%] avg: 293.55 [99.0%] std: 0.72 runs: 3 0.5k-no-limit/reworkoptim: min: 294.41 [99.8%] max: 301.67 [100.9%] avg: 297.78 [100.4%] std: 2.99 runs: 3 System 2k-no-limit/base: min: 443.41 max: 466.66 avg: 457.66 std: 10.19 runs: 3 2k-no-limit/reclaim: min: 490.11 [110.5%] max: 516.02 [110.6%] avg: 501.42 [109.6%] std: 10.83 runs: 3 2k-no-limit/reworkoptim: min: 435.25 [98.2%] max: 458.11 [98.2%] avg: 446.73 [97.6%] std: 9.33 runs: 3 Elapsed 2k-no-limit/base: min: 330.85 max: 333.75 avg: 332.52 std: 1.23 runs: 3 2k-no-limit/reclaim: min: 343.06 [103.7%] max: 349.59 [104.7%] avg: 345.95 [104.0%] std: 2.72 runs: 3 2k-no-limit/reworkoptim: min: 330.01 [99.7%] max: 333.92 [100.1%] avg: 332.22 [99.9%] std: 1.64 runs: 3 System 8k-no-limit/base: min: 1175.64 max: 1259.38 avg: 1222.39 std: 34.88 runs: 3 8k-no-limit/reclaim: min: 1226.31 [104.3%] max: 1241.60 [98.6%] avg: 1233.74 [100.9%] std: 6.25 runs: 3 8k-no-limit/reworkoptim: min: 1023.45 [87.1%] max: 1056.74 [83.9%] avg: 1038.92 [85.0%] std: 13.69 runs: 3 Elapsed 8k-no-limit/base: min: 613.36 max: 619.60 avg: 616.47 std: 2.55 runs: 3 8k-no-limit/reclaim: min: 627.56 [102.3%] max: 642.33 [103.7%] avg: 633.44 [102.8%] std: 6.39 runs: 3 8k-no-limit/reworkoptim: min: 545.89 [89.0%] max: 555.36 [89.6%] avg: 552.06 [89.6%] std: 4.37 runs: 3 and these numbers look good as well. System time is around 100% (suprisingly better for the 8k case) and Elapsed is copies that trend. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code @ 2013-05-30 8:36 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-30 8:36 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh, cgroups, linux-kernel, linux-mm On Wed 29-05-13 16:54:00, Michal Hocko wrote: [...] > I am still running kbuild tests with the same configuration to see a > more general workload. And here we go with the kbuild numbers. Same configuration (mem=1G, one group for kernel build - it is actually expand the three + build a distro config + bunch of groups under root/bunch). System 0-0-limit/base: min: 242.70 max: 245.17 avg: 243.85 std: 1.02 runs: 3 0-0-limit/reclaim: min: 237.86 [98.0%] max: 240.22 [98.0%] avg: 239.00 [98.0%] std: 0.97 runs: 3 0-0-limit/reworkoptim: min: 241.11 [99.3%] max: 243.53 [99.3%] avg: 242.01 [99.2%] std: 1.08 runs: 3 Elapsed 0-0-limit/base: min: 348.48 max: 360.86 avg: 356.04 std: 5.41 runs: 3 0-0-limit/reclaim: min: 286.95 [82.3%] max: 290.26 [80.4%] avg: 288.27 [81.0%] std: 1.43 runs: 3 0-0-limit/reworkoptim: min: 286.55 [82.2%] max: 289.00 [80.1%] avg: 287.69 [80.8%] std: 1.01 runs: 3 System 0.5k-0-limit/base: min: 251.77 max: 254.41 avg: 252.70 std: 1.21 runs: 3 0.5k-0-limit/reclaim: min: 286.44 [113.8%] max: 289.30 [113.7%] avg: 287.60 [113.8%] std: 1.23 runs: 3 0.5k-0-limit/reworkoptim: min: 252.18 [100.2%] max: 253.16 [99.5%] avg: 252.62 [100.0%] std: 0.41 runs: 3 Elapsed 0.5k-0-limit/base: min: 347.83 max: 353.06 avg: 350.04 std: 2.21 runs: 3 0.5k-0-limit/reclaim: min: 290.19 [83.4%] max: 295.62 [83.7%] avg: 293.12 [83.7%] std: 2.24 runs: 3 0.5k-0-limit/reworkoptim: min: 293.91 [84.5%] max: 294.87 [83.5%] avg: 294.29 [84.1%] std: 0.42 runs: 3 System 2k-0-limit/base: min: 263.05 max: 271.52 avg: 267.94 std: 3.58 runs: 3 2k-0-limit/reclaim: min: 458.99 [174.5%] max: 468.31 [172.5%] avg: 464.45 [173.3%] std: 3.97 runs: 3 2k-0-limit/reworkoptim: min: 267.10 [101.5%] max: 279.38 [102.9%] avg: 272.78 [101.8%] std: 5.05 runs: 3 Elapsed 2k-0-limit/base: min: 372.33 max: 379.32 avg: 375.47 std: 2.90 runs: 3 2k-0-limit/reclaim: min: 334.40 [89.8%] max: 339.52 [89.5%] avg: 337.44 [89.9%] std: 2.20 runs: 3 2k-0-limit/reworkoptim: min: 301.47 [81.0%] max: 319.19 [84.1%] avg: 307.90 [82.0%] std: 8.01 runs: 3 System 8k-0-limit/base: min: 320.50 max: 332.10 avg: 325.46 std: 4.88 runs: 3 8k-0-limit/reclaim: min: 1115.76 [348.1%] max: 1165.66 [351.0%] avg: 1132.65 [348.0%] std: 23.34 runs: 3 8k-0-limit/reworkoptim: min: 403.75 [126.0%] max: 409.22 [123.2%] avg: 406.16 [124.8%] std: 2.28 runs: 3 Elapsed 8k-0-limit/base: min: 475.48 max: 585.19 avg: 525.54 std: 45.30 runs: 3 8k-0-limit/reclaim: min: 616.25 [129.6%] max: 625.90 [107.0%] avg: 620.68 [118.1%] std: 3.98 runs: 3 8k-0-limit/reworkoptim: min: 420.18 [88.4%] max: 428.28 [73.2%] avg: 423.05 [80.5%] std: 3.71 runs: 3 Apart from 8k the system time is comparable with the base kernel while Elapsed is up to 20% better with all configurations. And with not soft limit set System 0-no-limit/base: min: 234.76 max: 237.42 avg: 236.25 std: 1.11 runs: 3 0-no-limit/reclaim: min: 233.09 [99.3%] max: 238.65 [100.5%] avg: 236.09 [99.9%] std: 2.29 runs: 3 0-no-limit/reworkoptim: min: 236.12 [100.6%] max: 240.53 [101.3%] avg: 237.94 [100.7%] std: 1.88 runs: 3 Elapsed 0-no-limit/base: min: 288.52 max: 295.42 avg: 291.29 std: 2.98 runs: 3 0-no-limit/reclaim: min: 283.17 [98.1%] max: 284.33 [96.2%] avg: 283.78 [97.4%] std: 0.48 runs: 3 0-no-limit/reworkoptim: min: 288.50 [100.0%] max: 290.79 [98.4%] avg: 289.78 [99.5%] std: 0.95 runs: 3 System 0.5k-no-limit/base: min: 286.51 max: 293.23 avg: 290.21 std: 2.78 runs: 3 0.5k-no-limit/reclaim: min: 291.69 [101.8%] max: 294.38 [100.4%] avg: 292.97 [101.0%] std: 1.10 runs: 3 0.5k-no-limit/reworkoptim: min: 277.05 [96.7%] max: 288.76 [98.5%] avg: 284.17 [97.9%] std: 5.11 runs: 3 Elapsed 0.5k-no-limit/base: min: 294.94 max: 298.92 avg: 296.47 std: 1.75 runs: 3 0.5k-no-limit/reclaim: min: 292.55 [99.2%] max: 294.21 [98.4%] avg: 293.55 [99.0%] std: 0.72 runs: 3 0.5k-no-limit/reworkoptim: min: 294.41 [99.8%] max: 301.67 [100.9%] avg: 297.78 [100.4%] std: 2.99 runs: 3 System 2k-no-limit/base: min: 443.41 max: 466.66 avg: 457.66 std: 10.19 runs: 3 2k-no-limit/reclaim: min: 490.11 [110.5%] max: 516.02 [110.6%] avg: 501.42 [109.6%] std: 10.83 runs: 3 2k-no-limit/reworkoptim: min: 435.25 [98.2%] max: 458.11 [98.2%] avg: 446.73 [97.6%] std: 9.33 runs: 3 Elapsed 2k-no-limit/base: min: 330.85 max: 333.75 avg: 332.52 std: 1.23 runs: 3 2k-no-limit/reclaim: min: 343.06 [103.7%] max: 349.59 [104.7%] avg: 345.95 [104.0%] std: 2.72 runs: 3 2k-no-limit/reworkoptim: min: 330.01 [99.7%] max: 333.92 [100.1%] avg: 332.22 [99.9%] std: 1.64 runs: 3 System 8k-no-limit/base: min: 1175.64 max: 1259.38 avg: 1222.39 std: 34.88 runs: 3 8k-no-limit/reclaim: min: 1226.31 [104.3%] max: 1241.60 [98.6%] avg: 1233.74 [100.9%] std: 6.25 runs: 3 8k-no-limit/reworkoptim: min: 1023.45 [87.1%] max: 1056.74 [83.9%] avg: 1038.92 [85.0%] std: 13.69 runs: 3 Elapsed 8k-no-limit/base: min: 613.36 max: 619.60 avg: 616.47 std: 2.55 runs: 3 8k-no-limit/reclaim: min: 627.56 [102.3%] max: 642.33 [103.7%] avg: 633.44 [102.8%] std: 6.39 runs: 3 8k-no-limit/reworkoptim: min: 545.89 [89.0%] max: 555.36 [89.6%] avg: 552.06 [89.6%] std: 4.37 runs: 3 and these numbers look good as well. System time is around 100% (suprisingly better for the 8k case) and Elapsed is copies that trend. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* [patch v3 -mm 2/3] memcg: Get rid of soft-limit tree infrastructure 2013-05-13 7:46 ` Michal Hocko @ 2013-05-13 7:46 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-13 7:46 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh Now that the soft limit is integrated to the reclaim directly the whole soft-limit tree infrastructure is not needed anymore. Rip it out. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 224 +------------------------------------------------------ 1 file changed, 1 insertion(+), 223 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7e5d7a9..1223aaa 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -39,7 +39,6 @@ #include <linux/limits.h> #include <linux/export.h> #include <linux/mutex.h> -#include <linux/rbtree.h> #include <linux/slab.h> #include <linux/swap.h> #include <linux/swapops.h> @@ -137,7 +136,6 @@ static const char * const mem_cgroup_lru_names[] = { */ enum mem_cgroup_events_target { MEM_CGROUP_TARGET_THRESH, - MEM_CGROUP_TARGET_SOFTLIMIT, MEM_CGROUP_TARGET_NUMAINFO, MEM_CGROUP_NTARGETS, }; @@ -173,10 +171,6 @@ struct mem_cgroup_per_zone { struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1]; - struct rb_node tree_node; /* RB tree node */ - unsigned long long usage_in_excess;/* Set to the value by which */ - /* the soft limit is exceeded*/ - bool on_tree; struct mem_cgroup *memcg; /* Back pointer, we cannot */ /* use container_of */ }; @@ -189,26 +183,6 @@ struct mem_cgroup_lru_info { struct mem_cgroup_per_node *nodeinfo[0]; }; -/* - * Cgroups above their limits are maintained in a RB-Tree, independent of - * their hierarchy representation - */ - -struct mem_cgroup_tree_per_zone { - struct rb_root rb_root; - spinlock_t lock; -}; - -struct mem_cgroup_tree_per_node { - struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES]; -}; - -struct mem_cgroup_tree { - struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES]; -}; - -static struct mem_cgroup_tree soft_limit_tree __read_mostly; - struct mem_cgroup_threshold { struct eventfd_ctx *eventfd; u64 threshold; @@ -533,7 +507,6 @@ static bool move_file(void) * limit reclaim to prevent infinite loops, if they ever occur. */ #define MEM_CGROUP_MAX_RECLAIM_LOOPS 100 -#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS 2 enum charge_type { MEM_CGROUP_CHARGE_TYPE_CACHE = 0, @@ -764,164 +737,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page) return mem_cgroup_zoneinfo(memcg, nid, zid); } -static struct mem_cgroup_tree_per_zone * -soft_limit_tree_node_zone(int nid, int zid) -{ - return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid]; -} - -static struct mem_cgroup_tree_per_zone * -soft_limit_tree_from_page(struct page *page) -{ - int nid = page_to_nid(page); - int zid = page_zonenum(page); - - return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid]; -} - -static void -__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz, - unsigned long long new_usage_in_excess) -{ - struct rb_node **p = &mctz->rb_root.rb_node; - struct rb_node *parent = NULL; - struct mem_cgroup_per_zone *mz_node; - - if (mz->on_tree) - return; - - mz->usage_in_excess = new_usage_in_excess; - if (!mz->usage_in_excess) - return; - while (*p) { - parent = *p; - mz_node = rb_entry(parent, struct mem_cgroup_per_zone, - tree_node); - if (mz->usage_in_excess < mz_node->usage_in_excess) - p = &(*p)->rb_left; - /* - * We can't avoid mem cgroups that are over their soft - * limit by the same amount - */ - else if (mz->usage_in_excess >= mz_node->usage_in_excess) - p = &(*p)->rb_right; - } - rb_link_node(&mz->tree_node, parent, p); - rb_insert_color(&mz->tree_node, &mctz->rb_root); - mz->on_tree = true; -} - -static void -__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz) -{ - if (!mz->on_tree) - return; - rb_erase(&mz->tree_node, &mctz->rb_root); - mz->on_tree = false; -} - -static void -mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz) -{ - spin_lock(&mctz->lock); - __mem_cgroup_remove_exceeded(memcg, mz, mctz); - spin_unlock(&mctz->lock); -} - - -static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page) -{ - unsigned long long excess; - struct mem_cgroup_per_zone *mz; - struct mem_cgroup_tree_per_zone *mctz; - int nid = page_to_nid(page); - int zid = page_zonenum(page); - mctz = soft_limit_tree_from_page(page); - - /* - * Necessary to update all ancestors when hierarchy is used. - * because their event counter is not touched. - */ - for (; memcg; memcg = parent_mem_cgroup(memcg)) { - mz = mem_cgroup_zoneinfo(memcg, nid, zid); - excess = res_counter_soft_limit_excess(&memcg->res); - /* - * We have to update the tree if mz is on RB-tree or - * mem is over its softlimit. - */ - if (excess || mz->on_tree) { - spin_lock(&mctz->lock); - /* if on-tree, remove it */ - if (mz->on_tree) - __mem_cgroup_remove_exceeded(memcg, mz, mctz); - /* - * Insert again. mz->usage_in_excess will be updated. - * If excess is 0, no tree ops. - */ - __mem_cgroup_insert_exceeded(memcg, mz, mctz, excess); - spin_unlock(&mctz->lock); - } - } -} - -static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) -{ - int node, zone; - struct mem_cgroup_per_zone *mz; - struct mem_cgroup_tree_per_zone *mctz; - - for_each_node(node) { - for (zone = 0; zone < MAX_NR_ZONES; zone++) { - mz = mem_cgroup_zoneinfo(memcg, node, zone); - mctz = soft_limit_tree_node_zone(node, zone); - mem_cgroup_remove_exceeded(memcg, mz, mctz); - } - } -} - -static struct mem_cgroup_per_zone * -__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) -{ - struct rb_node *rightmost = NULL; - struct mem_cgroup_per_zone *mz; - -retry: - mz = NULL; - rightmost = rb_last(&mctz->rb_root); - if (!rightmost) - goto done; /* Nothing to reclaim from */ - - mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node); - /* - * Remove the node now but someone else can add it back, - * we will to add it back at the end of reclaim to its correct - * position in the tree. - */ - __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz); - if (!res_counter_soft_limit_excess(&mz->memcg->res) || - !css_tryget(&mz->memcg->css)) - goto retry; -done: - return mz; -} - -static struct mem_cgroup_per_zone * -mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) -{ - struct mem_cgroup_per_zone *mz; - - spin_lock(&mctz->lock); - mz = __mem_cgroup_largest_soft_limit_node(mctz); - spin_unlock(&mctz->lock); - return mz; -} - /* * Implementation Note: reading percpu statistics for memcg. * @@ -1075,9 +890,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, case MEM_CGROUP_TARGET_THRESH: next = val + THRESHOLDS_EVENTS_TARGET; break; - case MEM_CGROUP_TARGET_SOFTLIMIT: - next = val + SOFTLIMIT_EVENTS_TARGET; - break; case MEM_CGROUP_TARGET_NUMAINFO: next = val + NUMAINFO_EVENTS_TARGET; break; @@ -1100,11 +912,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) /* threshold event is triggered in finer grain than soft limit */ if (unlikely(mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_THRESH))) { - bool do_softlimit; bool do_numainfo __maybe_unused; - do_softlimit = mem_cgroup_event_ratelimit(memcg, - MEM_CGROUP_TARGET_SOFTLIMIT); #if MAX_NUMNODES > 1 do_numainfo = mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_NUMAINFO); @@ -1112,8 +921,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) preempt_enable(); mem_cgroup_threshold(memcg); - if (unlikely(do_softlimit)) - mem_cgroup_update_tree(memcg, page); #if MAX_NUMNODES > 1 if (unlikely(do_numainfo)) atomic_inc(&memcg->numainfo_events); @@ -2955,9 +2762,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg, unlock_page_cgroup(pc); /* - * "charge_statistics" updated event counter. Then, check it. - * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree. - * if they exceeds softlimit. + * "charge_statistics" updated event counter. */ memcg_check_events(memcg, page); } @@ -6086,8 +5891,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) for (zone = 0; zone < MAX_NR_ZONES; zone++) { mz = &pn->zoneinfo[zone]; lruvec_init(&mz->lruvec); - mz->usage_in_excess = 0; - mz->on_tree = false; mz->memcg = memcg; } memcg->info.nodeinfo[node] = pn; @@ -6143,7 +5946,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) int node; size_t size = memcg_size(); - mem_cgroup_remove_from_trees(memcg); free_css_id(&mem_cgroup_subsys, &memcg->css); for_each_node(node) @@ -6225,29 +6027,6 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) } EXPORT_SYMBOL(parent_mem_cgroup); -static void __init mem_cgroup_soft_limit_tree_init(void) -{ - struct mem_cgroup_tree_per_node *rtpn; - struct mem_cgroup_tree_per_zone *rtpz; - int tmp, node, zone; - - for_each_node(node) { - tmp = node; - if (!node_state(node, N_NORMAL_MEMORY)) - tmp = -1; - rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp); - BUG_ON(!rtpn); - - soft_limit_tree.rb_tree_per_node[node] = rtpn; - - for (zone = 0; zone < MAX_NR_ZONES; zone++) { - rtpz = &rtpn->rb_tree_per_zone[zone]; - rtpz->rb_root = RB_ROOT; - spin_lock_init(&rtpz->lock); - } - } -} - static struct cgroup_subsys_state * __ref mem_cgroup_css_alloc(struct cgroup *cont) { @@ -7040,7 +6819,6 @@ static int __init mem_cgroup_init(void) { hotcpu_notifier(memcg_cpu_hotplug_callback, 0); enable_swap_cgroup(); - mem_cgroup_soft_limit_tree_init(); memcg_stock_init(); return 0; } -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [patch v3 -mm 2/3] memcg: Get rid of soft-limit tree infrastructure @ 2013-05-13 7:46 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-13 7:46 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh Now that the soft limit is integrated to the reclaim directly the whole soft-limit tree infrastructure is not needed anymore. Rip it out. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 224 +------------------------------------------------------ 1 file changed, 1 insertion(+), 223 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7e5d7a9..1223aaa 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -39,7 +39,6 @@ #include <linux/limits.h> #include <linux/export.h> #include <linux/mutex.h> -#include <linux/rbtree.h> #include <linux/slab.h> #include <linux/swap.h> #include <linux/swapops.h> @@ -137,7 +136,6 @@ static const char * const mem_cgroup_lru_names[] = { */ enum mem_cgroup_events_target { MEM_CGROUP_TARGET_THRESH, - MEM_CGROUP_TARGET_SOFTLIMIT, MEM_CGROUP_TARGET_NUMAINFO, MEM_CGROUP_NTARGETS, }; @@ -173,10 +171,6 @@ struct mem_cgroup_per_zone { struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1]; - struct rb_node tree_node; /* RB tree node */ - unsigned long long usage_in_excess;/* Set to the value by which */ - /* the soft limit is exceeded*/ - bool on_tree; struct mem_cgroup *memcg; /* Back pointer, we cannot */ /* use container_of */ }; @@ -189,26 +183,6 @@ struct mem_cgroup_lru_info { struct mem_cgroup_per_node *nodeinfo[0]; }; -/* - * Cgroups above their limits are maintained in a RB-Tree, independent of - * their hierarchy representation - */ - -struct mem_cgroup_tree_per_zone { - struct rb_root rb_root; - spinlock_t lock; -}; - -struct mem_cgroup_tree_per_node { - struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES]; -}; - -struct mem_cgroup_tree { - struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES]; -}; - -static struct mem_cgroup_tree soft_limit_tree __read_mostly; - struct mem_cgroup_threshold { struct eventfd_ctx *eventfd; u64 threshold; @@ -533,7 +507,6 @@ static bool move_file(void) * limit reclaim to prevent infinite loops, if they ever occur. */ #define MEM_CGROUP_MAX_RECLAIM_LOOPS 100 -#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS 2 enum charge_type { MEM_CGROUP_CHARGE_TYPE_CACHE = 0, @@ -764,164 +737,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page) return mem_cgroup_zoneinfo(memcg, nid, zid); } -static struct mem_cgroup_tree_per_zone * -soft_limit_tree_node_zone(int nid, int zid) -{ - return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid]; -} - -static struct mem_cgroup_tree_per_zone * -soft_limit_tree_from_page(struct page *page) -{ - int nid = page_to_nid(page); - int zid = page_zonenum(page); - - return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid]; -} - -static void -__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz, - unsigned long long new_usage_in_excess) -{ - struct rb_node **p = &mctz->rb_root.rb_node; - struct rb_node *parent = NULL; - struct mem_cgroup_per_zone *mz_node; - - if (mz->on_tree) - return; - - mz->usage_in_excess = new_usage_in_excess; - if (!mz->usage_in_excess) - return; - while (*p) { - parent = *p; - mz_node = rb_entry(parent, struct mem_cgroup_per_zone, - tree_node); - if (mz->usage_in_excess < mz_node->usage_in_excess) - p = &(*p)->rb_left; - /* - * We can't avoid mem cgroups that are over their soft - * limit by the same amount - */ - else if (mz->usage_in_excess >= mz_node->usage_in_excess) - p = &(*p)->rb_right; - } - rb_link_node(&mz->tree_node, parent, p); - rb_insert_color(&mz->tree_node, &mctz->rb_root); - mz->on_tree = true; -} - -static void -__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz) -{ - if (!mz->on_tree) - return; - rb_erase(&mz->tree_node, &mctz->rb_root); - mz->on_tree = false; -} - -static void -mem_cgroup_remove_exceeded(struct mem_cgroup *memcg, - struct mem_cgroup_per_zone *mz, - struct mem_cgroup_tree_per_zone *mctz) -{ - spin_lock(&mctz->lock); - __mem_cgroup_remove_exceeded(memcg, mz, mctz); - spin_unlock(&mctz->lock); -} - - -static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page) -{ - unsigned long long excess; - struct mem_cgroup_per_zone *mz; - struct mem_cgroup_tree_per_zone *mctz; - int nid = page_to_nid(page); - int zid = page_zonenum(page); - mctz = soft_limit_tree_from_page(page); - - /* - * Necessary to update all ancestors when hierarchy is used. - * because their event counter is not touched. - */ - for (; memcg; memcg = parent_mem_cgroup(memcg)) { - mz = mem_cgroup_zoneinfo(memcg, nid, zid); - excess = res_counter_soft_limit_excess(&memcg->res); - /* - * We have to update the tree if mz is on RB-tree or - * mem is over its softlimit. - */ - if (excess || mz->on_tree) { - spin_lock(&mctz->lock); - /* if on-tree, remove it */ - if (mz->on_tree) - __mem_cgroup_remove_exceeded(memcg, mz, mctz); - /* - * Insert again. mz->usage_in_excess will be updated. - * If excess is 0, no tree ops. - */ - __mem_cgroup_insert_exceeded(memcg, mz, mctz, excess); - spin_unlock(&mctz->lock); - } - } -} - -static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) -{ - int node, zone; - struct mem_cgroup_per_zone *mz; - struct mem_cgroup_tree_per_zone *mctz; - - for_each_node(node) { - for (zone = 0; zone < MAX_NR_ZONES; zone++) { - mz = mem_cgroup_zoneinfo(memcg, node, zone); - mctz = soft_limit_tree_node_zone(node, zone); - mem_cgroup_remove_exceeded(memcg, mz, mctz); - } - } -} - -static struct mem_cgroup_per_zone * -__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) -{ - struct rb_node *rightmost = NULL; - struct mem_cgroup_per_zone *mz; - -retry: - mz = NULL; - rightmost = rb_last(&mctz->rb_root); - if (!rightmost) - goto done; /* Nothing to reclaim from */ - - mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node); - /* - * Remove the node now but someone else can add it back, - * we will to add it back at the end of reclaim to its correct - * position in the tree. - */ - __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz); - if (!res_counter_soft_limit_excess(&mz->memcg->res) || - !css_tryget(&mz->memcg->css)) - goto retry; -done: - return mz; -} - -static struct mem_cgroup_per_zone * -mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) -{ - struct mem_cgroup_per_zone *mz; - - spin_lock(&mctz->lock); - mz = __mem_cgroup_largest_soft_limit_node(mctz); - spin_unlock(&mctz->lock); - return mz; -} - /* * Implementation Note: reading percpu statistics for memcg. * @@ -1075,9 +890,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, case MEM_CGROUP_TARGET_THRESH: next = val + THRESHOLDS_EVENTS_TARGET; break; - case MEM_CGROUP_TARGET_SOFTLIMIT: - next = val + SOFTLIMIT_EVENTS_TARGET; - break; case MEM_CGROUP_TARGET_NUMAINFO: next = val + NUMAINFO_EVENTS_TARGET; break; @@ -1100,11 +912,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) /* threshold event is triggered in finer grain than soft limit */ if (unlikely(mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_THRESH))) { - bool do_softlimit; bool do_numainfo __maybe_unused; - do_softlimit = mem_cgroup_event_ratelimit(memcg, - MEM_CGROUP_TARGET_SOFTLIMIT); #if MAX_NUMNODES > 1 do_numainfo = mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_NUMAINFO); @@ -1112,8 +921,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) preempt_enable(); mem_cgroup_threshold(memcg); - if (unlikely(do_softlimit)) - mem_cgroup_update_tree(memcg, page); #if MAX_NUMNODES > 1 if (unlikely(do_numainfo)) atomic_inc(&memcg->numainfo_events); @@ -2955,9 +2762,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg, unlock_page_cgroup(pc); /* - * "charge_statistics" updated event counter. Then, check it. - * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree. - * if they exceeds softlimit. + * "charge_statistics" updated event counter. */ memcg_check_events(memcg, page); } @@ -6086,8 +5891,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) for (zone = 0; zone < MAX_NR_ZONES; zone++) { mz = &pn->zoneinfo[zone]; lruvec_init(&mz->lruvec); - mz->usage_in_excess = 0; - mz->on_tree = false; mz->memcg = memcg; } memcg->info.nodeinfo[node] = pn; @@ -6143,7 +5946,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) int node; size_t size = memcg_size(); - mem_cgroup_remove_from_trees(memcg); free_css_id(&mem_cgroup_subsys, &memcg->css); for_each_node(node) @@ -6225,29 +6027,6 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) } EXPORT_SYMBOL(parent_mem_cgroup); -static void __init mem_cgroup_soft_limit_tree_init(void) -{ - struct mem_cgroup_tree_per_node *rtpn; - struct mem_cgroup_tree_per_zone *rtpz; - int tmp, node, zone; - - for_each_node(node) { - tmp = node; - if (!node_state(node, N_NORMAL_MEMORY)) - tmp = -1; - rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp); - BUG_ON(!rtpn); - - soft_limit_tree.rb_tree_per_node[node] = rtpn; - - for (zone = 0; zone < MAX_NR_ZONES; zone++) { - rtpz = &rtpn->rb_tree_per_zone[zone]; - rtpz->rb_root = RB_ROOT; - spin_lock_init(&rtpz->lock); - } - } -} - static struct cgroup_subsys_state * __ref mem_cgroup_css_alloc(struct cgroup *cont) { @@ -7040,7 +6819,6 @@ static int __init mem_cgroup_init(void) { hotcpu_notifier(memcg_cpu_hotplug_callback, 0); enable_swap_cgroup(); - mem_cgroup_soft_limit_tree_init(); memcg_stock_init(); return 0; } -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 2/3] memcg: Get rid of soft-limit tree infrastructure 2013-05-13 7:46 ` Michal Hocko @ 2013-05-15 8:38 ` Glauber Costa -1 siblings, 0 replies; 74+ messages in thread From: Glauber Costa @ 2013-05-15 8:38 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On 05/13/2013 11:46 AM, Michal Hocko wrote: > Now that the soft limit is integrated to the reclaim directly the whole > soft-limit tree infrastructure is not needed anymore. Rip it out. > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- > mm/memcontrol.c | 224 +------------------------------------------------------ > 1 file changed, 1 insertion(+), 223 deletions(-) Great =) Reviewed-by: Glauber Costa <glommer@openvz.org> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 2/3] memcg: Get rid of soft-limit tree infrastructure @ 2013-05-15 8:38 ` Glauber Costa 0 siblings, 0 replies; 74+ messages in thread From: Glauber Costa @ 2013-05-15 8:38 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On 05/13/2013 11:46 AM, Michal Hocko wrote: > Now that the soft limit is integrated to the reclaim directly the whole > soft-limit tree infrastructure is not needed anymore. Rip it out. > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- > mm/memcontrol.c | 224 +------------------------------------------------------ > 1 file changed, 1 insertion(+), 223 deletions(-) Great =) Reviewed-by: Glauber Costa <glommer@openvz.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 2/3] memcg: Get rid of soft-limit tree infrastructure 2013-05-13 7:46 ` Michal Hocko @ 2013-05-16 22:16 ` Tejun Heo -1 siblings, 0 replies; 74+ messages in thread From: Tejun Heo @ 2013-05-16 22:16 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh On Mon, May 13, 2013 at 09:46:11AM +0200, Michal Hocko wrote: > Now that the soft limit is integrated to the reclaim directly the whole > soft-limit tree infrastructure is not needed anymore. Rip it out. > > Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Tejun Heo <tj@kernel.org> Nice cleanup, thanks. -- tejun ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 2/3] memcg: Get rid of soft-limit tree infrastructure @ 2013-05-16 22:16 ` Tejun Heo 0 siblings, 0 replies; 74+ messages in thread From: Tejun Heo @ 2013-05-16 22:16 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh On Mon, May 13, 2013 at 09:46:11AM +0200, Michal Hocko wrote: > Now that the soft limit is integrated to the reclaim directly the whole > soft-limit tree infrastructure is not needed anymore. Rip it out. > > Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Tejun Heo <tj@kernel.org> Nice cleanup, thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* [patch v3 -mm 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim 2013-05-13 7:46 ` Michal Hocko (?) @ 2013-05-13 7:46 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-13 7:46 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh Soft reclaim has been done only for the global reclaim (both background and direct). Since "memcg: integrate soft reclaim tighter with zone shrinking code" there is no reason for this limitation anymore as the soft limit reclaim doesn't use any special code paths and it is a part of the zone shrinking code which is used by both global and targeted reclaims. >From semantic point of view it is even natural to consider soft limit before touching all groups in the hierarchy tree which is touching the hard limit because soft limit tells us where to push back when there is a memory pressure. It is not important whether the pressure comes from the limit or imbalanced zones. This patch simply enables soft reclaim unconditionally in mem_cgroup_should_soft_reclaim so it is enabled for both global and targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn about the root of the reclaim to know where to stop checking soft limit state of parents up the hierarchy. Say we have A (over soft limit) \ B (below s.l., hit the hard limit) / \ C D (below s.l.) B is the source of the outside memory pressure now for D but we shouldn't soft reclaim it because it is behaving well under B subtree and we can still reclaim from C (pressumably it is over the limit). mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the hierarchy at B (root of the memory pressure). Changes since v1 - add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 6 ++++-- mm/memcontrol.c | 14 +++++++++----- mm/vmscan.c | 4 ++-- 3 files changed, 15 insertions(+), 9 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1833c95..80ed1b6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -179,7 +179,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, @@ -356,7 +357,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root) { return false; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1223aaa..163567b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1891,11 +1891,13 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) #endif /* - * A group is eligible for the soft limit reclaim if it is - * a) is over its soft limit + * A group is eligible for the soft limit reclaim under the given root + * hierarchy if + * a) it is over its soft limit * b) any parent up the hierarchy is over its soft limit */ -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root) { struct mem_cgroup *parent = memcg; @@ -1903,12 +1905,14 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) return true; /* - * If any parent up the hierarchy is over its soft limit then we - * have to obey and reclaim from this group as well. + * If any parent up to the root in the hierarchy is over its soft limit + * then we have to obey and reclaim from this group as well. */ while((parent = parent_mem_cgroup(parent))) { if (res_counter_soft_limit_excess(&parent->res)) return true; + if (parent == root) + break; } return false; diff --git a/mm/vmscan.c b/mm/vmscan.c index fe63a43..d738802 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -142,7 +142,7 @@ static bool global_reclaim(struct scan_control *sc) static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) { - return global_reclaim(sc); + return true; } #else static bool global_reclaim(struct scan_control *sc) @@ -1974,7 +1974,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) struct lruvec *lruvec; if (soft_reclaim && - !mem_cgroup_soft_reclaim_eligible(memcg)) { + !mem_cgroup_soft_reclaim_eligible(memcg, root)) { memcg = mem_cgroup_iter(root, memcg, &reclaim); continue; } -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [patch v3 -mm 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim @ 2013-05-13 7:46 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-13 7:46 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh Soft reclaim has been done only for the global reclaim (both background and direct). Since "memcg: integrate soft reclaim tighter with zone shrinking code" there is no reason for this limitation anymore as the soft limit reclaim doesn't use any special code paths and it is a part of the zone shrinking code which is used by both global and targeted reclaims. From semantic point of view it is even natural to consider soft limit before touching all groups in the hierarchy tree which is touching the hard limit because soft limit tells us where to push back when there is a memory pressure. It is not important whether the pressure comes from the limit or imbalanced zones. This patch simply enables soft reclaim unconditionally in mem_cgroup_should_soft_reclaim so it is enabled for both global and targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn about the root of the reclaim to know where to stop checking soft limit state of parents up the hierarchy. Say we have A (over soft limit) \ B (below s.l., hit the hard limit) / \ C D (below s.l.) B is the source of the outside memory pressure now for D but we shouldn't soft reclaim it because it is behaving well under B subtree and we can still reclaim from C (pressumably it is over the limit). mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the hierarchy at B (root of the memory pressure). Changes since v1 - add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 6 ++++-- mm/memcontrol.c | 14 +++++++++----- mm/vmscan.c | 4 ++-- 3 files changed, 15 insertions(+), 9 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1833c95..80ed1b6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -179,7 +179,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, @@ -356,7 +357,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root) { return false; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1223aaa..163567b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1891,11 +1891,13 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) #endif /* - * A group is eligible for the soft limit reclaim if it is - * a) is over its soft limit + * A group is eligible for the soft limit reclaim under the given root + * hierarchy if + * a) it is over its soft limit * b) any parent up the hierarchy is over its soft limit */ -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root) { struct mem_cgroup *parent = memcg; @@ -1903,12 +1905,14 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) return true; /* - * If any parent up the hierarchy is over its soft limit then we - * have to obey and reclaim from this group as well. + * If any parent up to the root in the hierarchy is over its soft limit + * then we have to obey and reclaim from this group as well. */ while((parent = parent_mem_cgroup(parent))) { if (res_counter_soft_limit_excess(&parent->res)) return true; + if (parent == root) + break; } return false; diff --git a/mm/vmscan.c b/mm/vmscan.c index fe63a43..d738802 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -142,7 +142,7 @@ static bool global_reclaim(struct scan_control *sc) static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) { - return global_reclaim(sc); + return true; } #else static bool global_reclaim(struct scan_control *sc) @@ -1974,7 +1974,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) struct lruvec *lruvec; if (soft_reclaim && - !mem_cgroup_soft_reclaim_eligible(memcg)) { + !mem_cgroup_soft_reclaim_eligible(memcg, root)) { memcg = mem_cgroup_iter(root, memcg, &reclaim); continue; } -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [patch v3 -mm 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim @ 2013-05-13 7:46 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-13 7:46 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh Soft reclaim has been done only for the global reclaim (both background and direct). Since "memcg: integrate soft reclaim tighter with zone shrinking code" there is no reason for this limitation anymore as the soft limit reclaim doesn't use any special code paths and it is a part of the zone shrinking code which is used by both global and targeted reclaims. >From semantic point of view it is even natural to consider soft limit before touching all groups in the hierarchy tree which is touching the hard limit because soft limit tells us where to push back when there is a memory pressure. It is not important whether the pressure comes from the limit or imbalanced zones. This patch simply enables soft reclaim unconditionally in mem_cgroup_should_soft_reclaim so it is enabled for both global and targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn about the root of the reclaim to know where to stop checking soft limit state of parents up the hierarchy. Say we have A (over soft limit) \ B (below s.l., hit the hard limit) / \ C D (below s.l.) B is the source of the outside memory pressure now for D but we shouldn't soft reclaim it because it is behaving well under B subtree and we can still reclaim from C (pressumably it is over the limit). mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the hierarchy at B (root of the memory pressure). Changes since v1 - add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 6 ++++-- mm/memcontrol.c | 14 +++++++++----- mm/vmscan.c | 4 ++-- 3 files changed, 15 insertions(+), 9 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1833c95..80ed1b6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -179,7 +179,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg); +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root); void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, @@ -356,7 +357,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, } static inline -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root) { return false; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1223aaa..163567b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1891,11 +1891,13 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) #endif /* - * A group is eligible for the soft limit reclaim if it is - * a) is over its soft limit + * A group is eligible for the soft limit reclaim under the given root + * hierarchy if + * a) it is over its soft limit * b) any parent up the hierarchy is over its soft limit */ -bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) +bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg, + struct mem_cgroup *root) { struct mem_cgroup *parent = memcg; @@ -1903,12 +1905,14 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg) return true; /* - * If any parent up the hierarchy is over its soft limit then we - * have to obey and reclaim from this group as well. + * If any parent up to the root in the hierarchy is over its soft limit + * then we have to obey and reclaim from this group as well. */ while((parent = parent_mem_cgroup(parent))) { if (res_counter_soft_limit_excess(&parent->res)) return true; + if (parent == root) + break; } return false; diff --git a/mm/vmscan.c b/mm/vmscan.c index fe63a43..d738802 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -142,7 +142,7 @@ static bool global_reclaim(struct scan_control *sc) static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) { - return global_reclaim(sc); + return true; } #else static bool global_reclaim(struct scan_control *sc) @@ -1974,7 +1974,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) struct lruvec *lruvec; if (soft_reclaim && - !mem_cgroup_soft_reclaim_eligible(memcg)) { + !mem_cgroup_soft_reclaim_eligible(memcg, root)) { memcg = mem_cgroup_iter(root, memcg, &reclaim); continue; } -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim 2013-05-13 7:46 ` Michal Hocko @ 2013-05-15 8:42 ` Glauber Costa -1 siblings, 0 replies; 74+ messages in thread From: Glauber Costa @ 2013-05-15 8:42 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On 05/13/2013 11:46 AM, Michal Hocko wrote: > Soft reclaim has been done only for the global reclaim (both background > and direct). Since "memcg: integrate soft reclaim tighter with zone > shrinking code" there is no reason for this limitation anymore as the > soft limit reclaim doesn't use any special code paths and it is a > part of the zone shrinking code which is used by both global and > targeted reclaims. > > From semantic point of view it is even natural to consider soft limit > before touching all groups in the hierarchy tree which is touching the > hard limit because soft limit tells us where to push back when there is > a memory pressure. It is not important whether the pressure comes from > the limit or imbalanced zones. > > This patch simply enables soft reclaim unconditionally in > mem_cgroup_should_soft_reclaim so it is enabled for both global and > targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn > about the root of the reclaim to know where to stop checking soft limit > state of parents up the hierarchy. > Say we have > A (over soft limit) > \ > B (below s.l., hit the hard limit) > / \ > C D (below s.l.) > > B is the source of the outside memory pressure now for D but we > shouldn't soft reclaim it because it is behaving well under B subtree > and we can still reclaim from C (pressumably it is over the limit). > mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the > hierarchy at B (root of the memory pressure). > > Changes since v1 > - add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- Reviewed-by: Glauber Costa <glommer@openvz.org> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim @ 2013-05-15 8:42 ` Glauber Costa 0 siblings, 0 replies; 74+ messages in thread From: Glauber Costa @ 2013-05-15 8:42 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On 05/13/2013 11:46 AM, Michal Hocko wrote: > Soft reclaim has been done only for the global reclaim (both background > and direct). Since "memcg: integrate soft reclaim tighter with zone > shrinking code" there is no reason for this limitation anymore as the > soft limit reclaim doesn't use any special code paths and it is a > part of the zone shrinking code which is used by both global and > targeted reclaims. > > From semantic point of view it is even natural to consider soft limit > before touching all groups in the hierarchy tree which is touching the > hard limit because soft limit tells us where to push back when there is > a memory pressure. It is not important whether the pressure comes from > the limit or imbalanced zones. > > This patch simply enables soft reclaim unconditionally in > mem_cgroup_should_soft_reclaim so it is enabled for both global and > targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn > about the root of the reclaim to know where to stop checking soft limit > state of parents up the hierarchy. > Say we have > A (over soft limit) > \ > B (below s.l., hit the hard limit) > / \ > C D (below s.l.) > > B is the source of the outside memory pressure now for D but we > shouldn't soft reclaim it because it is behaving well under B subtree > and we can still reclaim from C (pressumably it is over the limit). > mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the > hierarchy at B (root of the memory pressure). > > Changes since v1 > - add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- Reviewed-by: Glauber Costa <glommer@openvz.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim 2013-05-15 8:42 ` Glauber Costa (?) @ 2013-05-17 7:50 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-17 7:50 UTC (permalink / raw) To: Glauber Costa Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On Wed 15-05-13 12:42:10, Glauber Costa wrote: > On 05/13/2013 11:46 AM, Michal Hocko wrote: > > Soft reclaim has been done only for the global reclaim (both background > > and direct). Since "memcg: integrate soft reclaim tighter with zone > > shrinking code" there is no reason for this limitation anymore as the > > soft limit reclaim doesn't use any special code paths and it is a > > part of the zone shrinking code which is used by both global and > > targeted reclaims. > > > > From semantic point of view it is even natural to consider soft limit > > before touching all groups in the hierarchy tree which is touching the > > hard limit because soft limit tells us where to push back when there is > > a memory pressure. It is not important whether the pressure comes from > > the limit or imbalanced zones. > > > > This patch simply enables soft reclaim unconditionally in > > mem_cgroup_should_soft_reclaim so it is enabled for both global and > > targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn > > about the root of the reclaim to know where to stop checking soft limit > > state of parents up the hierarchy. > > Say we have > > A (over soft limit) > > \ > > B (below s.l., hit the hard limit) > > / \ > > C D (below s.l.) > > > > B is the source of the outside memory pressure now for D but we > > shouldn't soft reclaim it because it is behaving well under B subtree > > and we can still reclaim from C (pressumably it is over the limit). > > mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the > > hierarchy at B (root of the memory pressure). > > > > Changes since v1 > > - add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible > > > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > > --- > > Reviewed-by: Glauber Costa <glommer@openvz.org> Thanks for the review Glauber! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim @ 2013-05-17 7:50 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-17 7:50 UTC (permalink / raw) To: Glauber Costa Cc: Andrew Morton, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On Wed 15-05-13 12:42:10, Glauber Costa wrote: > On 05/13/2013 11:46 AM, Michal Hocko wrote: > > Soft reclaim has been done only for the global reclaim (both background > > and direct). Since "memcg: integrate soft reclaim tighter with zone > > shrinking code" there is no reason for this limitation anymore as the > > soft limit reclaim doesn't use any special code paths and it is a > > part of the zone shrinking code which is used by both global and > > targeted reclaims. > > > > From semantic point of view it is even natural to consider soft limit > > before touching all groups in the hierarchy tree which is touching the > > hard limit because soft limit tells us where to push back when there is > > a memory pressure. It is not important whether the pressure comes from > > the limit or imbalanced zones. > > > > This patch simply enables soft reclaim unconditionally in > > mem_cgroup_should_soft_reclaim so it is enabled for both global and > > targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn > > about the root of the reclaim to know where to stop checking soft limit > > state of parents up the hierarchy. > > Say we have > > A (over soft limit) > > \ > > B (below s.l., hit the hard limit) > > / \ > > C D (below s.l.) > > > > B is the source of the outside memory pressure now for D but we > > shouldn't soft reclaim it because it is behaving well under B subtree > > and we can still reclaim from C (pressumably it is over the limit). > > mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the > > hierarchy at B (root of the memory pressure). > > > > Changes since v1 > > - add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible > > > > Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > > --- > > Reviewed-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> Thanks for the review Glauber! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim @ 2013-05-17 7:50 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-17 7:50 UTC (permalink / raw) To: Glauber Costa Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh On Wed 15-05-13 12:42:10, Glauber Costa wrote: > On 05/13/2013 11:46 AM, Michal Hocko wrote: > > Soft reclaim has been done only for the global reclaim (both background > > and direct). Since "memcg: integrate soft reclaim tighter with zone > > shrinking code" there is no reason for this limitation anymore as the > > soft limit reclaim doesn't use any special code paths and it is a > > part of the zone shrinking code which is used by both global and > > targeted reclaims. > > > > From semantic point of view it is even natural to consider soft limit > > before touching all groups in the hierarchy tree which is touching the > > hard limit because soft limit tells us where to push back when there is > > a memory pressure. It is not important whether the pressure comes from > > the limit or imbalanced zones. > > > > This patch simply enables soft reclaim unconditionally in > > mem_cgroup_should_soft_reclaim so it is enabled for both global and > > targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn > > about the root of the reclaim to know where to stop checking soft limit > > state of parents up the hierarchy. > > Say we have > > A (over soft limit) > > \ > > B (below s.l., hit the hard limit) > > / \ > > C D (below s.l.) > > > > B is the source of the outside memory pressure now for D but we > > shouldn't soft reclaim it because it is behaving well under B subtree > > and we can still reclaim from C (pressumably it is over the limit). > > mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the > > hierarchy at B (root of the memory pressure). > > > > Changes since v1 > > - add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible > > > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > > --- > > Reviewed-by: Glauber Costa <glommer@openvz.org> Thanks for the review Glauber! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim 2013-05-13 7:46 ` Michal Hocko @ 2013-05-16 23:12 ` Tejun Heo -1 siblings, 0 replies; 74+ messages in thread From: Tejun Heo @ 2013-05-16 23:12 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh On Mon, May 13, 2013 at 09:46:12AM +0200, Michal Hocko wrote: > Soft reclaim has been done only for the global reclaim (both background > and direct). Since "memcg: integrate soft reclaim tighter with zone > shrinking code" there is no reason for this limitation anymore as the > soft limit reclaim doesn't use any special code paths and it is a > part of the zone shrinking code which is used by both global and > targeted reclaims. ... > Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Tejun Heo <tj@kernel.org> Some nitpicks follow. > /* > - * A group is eligible for the soft limit reclaim if it is > - * a) is over its soft limit > + * A group is eligible for the soft limit reclaim under the given root > + * hierarchy if > + * a) it is over its soft limit > * b) any parent up the hierarchy is over its soft limit This was added before but in general I think the use of parent for ancestor is a bit confusing. Not a big deal but no reason to continue it. > /* > - * If any parent up the hierarchy is over its soft limit then we > - * have to obey and reclaim from this group as well. > + * If any parent up to the root in the hierarchy is over its soft limit > + * then we have to obey and reclaim from this group as well. Prolly using terms ancestors and subtree would make the explanation clearer? > static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) > { > - return global_reclaim(sc); > + return true; Kinda silly after this change, maybe just modify shrink_zone() like the following? if (IS_ENABLED(CONFIG_MEMCG)) { __shrink_zone(zone, sc, true); if (sc->nr_scanned == nr_scanned) __shrink_zone(zone, sc, false); } else { __shrink_zone(zone, sc, false); } > @@ -1974,7 +1974,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) > struct lruvec *lruvec; > > if (soft_reclaim && > - !mem_cgroup_soft_reclaim_eligible(memcg)) { > + !mem_cgroup_soft_reclaim_eligible(memcg, root)) { Weird indentation which breaks line and goes over 80 col, why not do the following? if (soft_reclaim && !mem_cgroup_soft_reclaim_eligible(memcg, root)) { memcg = mem_cgroup_iter(root, memcg, &reclaim); continue; } Thanks. -- tejun ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim @ 2013-05-16 23:12 ` Tejun Heo 0 siblings, 0 replies; 74+ messages in thread From: Tejun Heo @ 2013-05-16 23:12 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh On Mon, May 13, 2013 at 09:46:12AM +0200, Michal Hocko wrote: > Soft reclaim has been done only for the global reclaim (both background > and direct). Since "memcg: integrate soft reclaim tighter with zone > shrinking code" there is no reason for this limitation anymore as the > soft limit reclaim doesn't use any special code paths and it is a > part of the zone shrinking code which is used by both global and > targeted reclaims. ... > Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Tejun Heo <tj@kernel.org> Some nitpicks follow. > /* > - * A group is eligible for the soft limit reclaim if it is > - * a) is over its soft limit > + * A group is eligible for the soft limit reclaim under the given root > + * hierarchy if > + * a) it is over its soft limit > * b) any parent up the hierarchy is over its soft limit This was added before but in general I think the use of parent for ancestor is a bit confusing. Not a big deal but no reason to continue it. > /* > - * If any parent up the hierarchy is over its soft limit then we > - * have to obey and reclaim from this group as well. > + * If any parent up to the root in the hierarchy is over its soft limit > + * then we have to obey and reclaim from this group as well. Prolly using terms ancestors and subtree would make the explanation clearer? > static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) > { > - return global_reclaim(sc); > + return true; Kinda silly after this change, maybe just modify shrink_zone() like the following? if (IS_ENABLED(CONFIG_MEMCG)) { __shrink_zone(zone, sc, true); if (sc->nr_scanned == nr_scanned) __shrink_zone(zone, sc, false); } else { __shrink_zone(zone, sc, false); } > @@ -1974,7 +1974,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) > struct lruvec *lruvec; > > if (soft_reclaim && > - !mem_cgroup_soft_reclaim_eligible(memcg)) { > + !mem_cgroup_soft_reclaim_eligible(memcg, root)) { Weird indentation which breaks line and goes over 80 col, why not do the following? if (soft_reclaim && !mem_cgroup_soft_reclaim_eligible(memcg, root)) { memcg = mem_cgroup_iter(root, memcg, &reclaim); continue; } Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim 2013-05-16 23:12 ` Tejun Heo @ 2013-05-17 7:34 ` Michal Hocko -1 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-17 7:34 UTC (permalink / raw) To: Tejun Heo Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh On Thu 16-05-13 16:12:38, Tejun Heo wrote: > On Mon, May 13, 2013 at 09:46:12AM +0200, Michal Hocko wrote: > > Soft reclaim has been done only for the global reclaim (both background > > and direct). Since "memcg: integrate soft reclaim tighter with zone > > shrinking code" there is no reason for this limitation anymore as the > > soft limit reclaim doesn't use any special code paths and it is a > > part of the zone shrinking code which is used by both global and > > targeted reclaims. > ... > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > > Reviewed-by: Tejun Heo <tj@kernel.org> Thanks > > Some nitpicks follow. > > > /* > > - * A group is eligible for the soft limit reclaim if it is > > - * a) is over its soft limit > > + * A group is eligible for the soft limit reclaim under the given root > > + * hierarchy if > > + * a) it is over its soft limit > > * b) any parent up the hierarchy is over its soft limit > > This was added before but in general I think the use of parent for > ancestor is a bit confusing. Not a big deal but no reason to continue > it. $ git grep ancestor mm/memcontrol.c | wc -l 4 $ git grep parent mm/memcontrol.c | wc -l 80 Yeah, we are used to use parent much more. Maybe it is worth a clean up on its own but I will stick with the majority in this patch > > /* > > - * If any parent up the hierarchy is over its soft limit then we > > - * have to obey and reclaim from this group as well. > > + * If any parent up to the root in the hierarchy is over its soft limit > > + * then we have to obey and reclaim from this group as well. > > Prolly using terms ancestors and subtree would make the explanation > clearer? As I said earlier we should be explicit about hierarchy as ancestor/parent (what ever we call it) might or might not be part of the hierarchy. Yeah, we have that use_hierarchy thingy which we love so much. > > static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) > > { > > - return global_reclaim(sc); > > + return true; > > Kinda silly after this change, maybe just modify shrink_zone() like > the following? > > if (IS_ENABLED(CONFIG_MEMCG)) { > __shrink_zone(zone, sc, true); > if (sc->nr_scanned == nr_scanned) > __shrink_zone(zone, sc, false); > } else { > __shrink_zone(zone, sc, false); > } I plan to build on top of this where mem_cgroup_should_soft_reclaim would do more than just return true. So I will keep it this way if you do not mind. > > @@ -1974,7 +1974,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) > > struct lruvec *lruvec; > > > > if (soft_reclaim && > > - !mem_cgroup_soft_reclaim_eligible(memcg)) { > > + !mem_cgroup_soft_reclaim_eligible(memcg, root)) { > > Weird indentation which breaks line and goes over 80 col, why not do > the following? > > if (soft_reclaim && > !mem_cgroup_soft_reclaim_eligible(memcg, root)) { > memcg = mem_cgroup_iter(root, memcg, &reclaim); > continue; > } Hmm, I rely on vim doing the_right_thing usually. I definitely do not mind to change the formatting. I have fixed this in the first patch where the code has been introduced and refreshed this patch on top of that. I will repost the whole series with reviewed-bys and other acks later Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [patch v3 -mm 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim @ 2013-05-17 7:34 ` Michal Hocko 0 siblings, 0 replies; 74+ messages in thread From: Michal Hocko @ 2013-05-17 7:34 UTC (permalink / raw) To: Tejun Heo Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Ying Han, Hugh Dickins, Glauber Costa, Michel Lespinasse, Greg Thelen, Balbir Singh On Thu 16-05-13 16:12:38, Tejun Heo wrote: > On Mon, May 13, 2013 at 09:46:12AM +0200, Michal Hocko wrote: > > Soft reclaim has been done only for the global reclaim (both background > > and direct). Since "memcg: integrate soft reclaim tighter with zone > > shrinking code" there is no reason for this limitation anymore as the > > soft limit reclaim doesn't use any special code paths and it is a > > part of the zone shrinking code which is used by both global and > > targeted reclaims. > ... > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > > Reviewed-by: Tejun Heo <tj@kernel.org> Thanks > > Some nitpicks follow. > > > /* > > - * A group is eligible for the soft limit reclaim if it is > > - * a) is over its soft limit > > + * A group is eligible for the soft limit reclaim under the given root > > + * hierarchy if > > + * a) it is over its soft limit > > * b) any parent up the hierarchy is over its soft limit > > This was added before but in general I think the use of parent for > ancestor is a bit confusing. Not a big deal but no reason to continue > it. $ git grep ancestor mm/memcontrol.c | wc -l 4 $ git grep parent mm/memcontrol.c | wc -l 80 Yeah, we are used to use parent much more. Maybe it is worth a clean up on its own but I will stick with the majority in this patch > > /* > > - * If any parent up the hierarchy is over its soft limit then we > > - * have to obey and reclaim from this group as well. > > + * If any parent up to the root in the hierarchy is over its soft limit > > + * then we have to obey and reclaim from this group as well. > > Prolly using terms ancestors and subtree would make the explanation > clearer? As I said earlier we should be explicit about hierarchy as ancestor/parent (what ever we call it) might or might not be part of the hierarchy. Yeah, we have that use_hierarchy thingy which we love so much. > > static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc) > > { > > - return global_reclaim(sc); > > + return true; > > Kinda silly after this change, maybe just modify shrink_zone() like > the following? > > if (IS_ENABLED(CONFIG_MEMCG)) { > __shrink_zone(zone, sc, true); > if (sc->nr_scanned == nr_scanned) > __shrink_zone(zone, sc, false); > } else { > __shrink_zone(zone, sc, false); > } I plan to build on top of this where mem_cgroup_should_soft_reclaim would do more than just return true. So I will keep it this way if you do not mind. > > @@ -1974,7 +1974,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim) > > struct lruvec *lruvec; > > > > if (soft_reclaim && > > - !mem_cgroup_soft_reclaim_eligible(memcg)) { > > + !mem_cgroup_soft_reclaim_eligible(memcg, root)) { > > Weird indentation which breaks line and goes over 80 col, why not do > the following? > > if (soft_reclaim && > !mem_cgroup_soft_reclaim_eligible(memcg, root)) { > memcg = mem_cgroup_iter(root, memcg, &reclaim); > continue; > } Hmm, I rely on vim doing the_right_thing usually. I definitely do not mind to change the formatting. I have fixed this in the first patch where the code has been introduced and refreshed this patch on top of that. I will repost the whole series with reviewed-bys and other acks later Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 74+ messages in thread
end of thread, other threads:[~2013-05-30 8:46 UTC | newest] Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-05-13 7:46 [patch v3 0/3 -mm] Soft limit rework Michal Hocko 2013-05-13 7:46 ` Michal Hocko 2013-05-13 7:46 ` Michal Hocko 2013-05-13 7:46 ` [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko 2013-05-13 7:46 ` Michal Hocko 2013-05-15 8:34 ` Glauber Costa 2013-05-15 8:34 ` Glauber Costa 2013-05-16 22:12 ` Tejun Heo 2013-05-16 22:12 ` Tejun Heo 2013-05-16 22:12 ` Tejun Heo 2013-05-16 22:15 ` Tejun Heo 2013-05-16 22:15 ` Tejun Heo 2013-05-17 7:16 ` Michal Hocko 2013-05-17 7:16 ` Michal Hocko 2013-05-17 7:16 ` Michal Hocko 2013-05-17 7:12 ` Michal Hocko 2013-05-17 7:12 ` Michal Hocko 2013-05-17 16:02 ` Johannes Weiner 2013-05-17 16:02 ` Johannes Weiner 2013-05-17 16:57 ` Tejun Heo 2013-05-17 16:57 ` Tejun Heo 2013-05-17 17:27 ` Johannes Weiner 2013-05-17 17:27 ` Johannes Weiner 2013-05-17 17:45 ` Tejun Heo 2013-05-17 17:45 ` Tejun Heo 2013-05-17 17:45 ` Tejun Heo 2013-05-20 14:44 ` Michal Hocko 2013-05-20 14:44 ` Michal Hocko 2013-05-20 14:44 ` Michal Hocko 2013-05-21 6:53 ` Michal Hocko 2013-05-21 6:53 ` Michal Hocko 2013-05-27 17:13 ` Michal Hocko 2013-05-27 17:13 ` Michal Hocko 2013-05-27 17:13 ` Michal Hocko 2013-05-27 17:13 ` [PATCH 1/3] memcg: track children in soft limit excess to improve soft limit Michal Hocko 2013-05-27 17:13 ` Michal Hocko 2013-05-27 17:13 ` [PATCH 2/3] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything Michal Hocko 2013-05-27 17:13 ` Michal Hocko 2013-05-27 17:13 ` [PATCH 3/3] memcg: Track all children over limit in the root Michal Hocko 2013-05-27 17:13 ` Michal Hocko 2013-05-27 17:20 ` [PATCH] memcg: enhance memcg iterator to support predicates Michal Hocko 2013-05-27 17:20 ` Michal Hocko 2013-05-27 17:20 ` Michal Hocko 2013-05-29 13:05 ` [patch v3 -mm 1/3] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko 2013-05-29 13:05 ` Michal Hocko 2013-05-29 13:05 ` Michal Hocko 2013-05-29 15:57 ` Michal Hocko 2013-05-29 15:57 ` Michal Hocko 2013-05-29 20:01 ` Johannes Weiner 2013-05-29 20:01 ` Johannes Weiner 2013-05-30 8:45 ` Michal Hocko 2013-05-30 8:45 ` Michal Hocko 2013-05-29 14:54 ` Michal Hocko 2013-05-29 14:54 ` Michal Hocko 2013-05-30 8:36 ` Michal Hocko 2013-05-30 8:36 ` Michal Hocko 2013-05-13 7:46 ` [patch v3 -mm 2/3] memcg: Get rid of soft-limit tree infrastructure Michal Hocko 2013-05-13 7:46 ` Michal Hocko 2013-05-15 8:38 ` Glauber Costa 2013-05-15 8:38 ` Glauber Costa 2013-05-16 22:16 ` Tejun Heo 2013-05-16 22:16 ` Tejun Heo 2013-05-13 7:46 ` [patch v3 -mm 3/3] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko 2013-05-13 7:46 ` Michal Hocko 2013-05-13 7:46 ` Michal Hocko 2013-05-15 8:42 ` Glauber Costa 2013-05-15 8:42 ` Glauber Costa 2013-05-17 7:50 ` Michal Hocko 2013-05-17 7:50 ` Michal Hocko 2013-05-17 7:50 ` Michal Hocko 2013-05-16 23:12 ` Tejun Heo 2013-05-16 23:12 ` Tejun Heo 2013-05-17 7:34 ` Michal Hocko 2013-05-17 7:34 ` Michal Hocko
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.