* [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim. @ 2019-02-22 17:58 Andrey Ryabinin 2019-02-22 18:56 ` Rik van Riel ` (3 more replies) 0 siblings, 4 replies; 13+ messages in thread From: Andrey Ryabinin @ 2019-02-22 17:58 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Andrey Ryabinin, Johannes Weiner, Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin, Shakeel Butt In a presence of more than 1 memory cgroup in the system our reclaim logic is just suck. When we hit memory limit (global or a limit on cgroup with subgroups) we reclaim some memory from all cgroups. This is sucks because, the cgroup that allocates more often always wins. E.g. job that allocates a lot of clean rarely used page cache will push out of memory other jobs with active relatively small all in memory working set. To prevent such situations we have memcg controls like low/max, etc which are supposed to protect jobs or limit them so they to not hurt others. But memory cgroups are very hard to configure right because it requires precise knowledge of the workload which may vary during the execution. E.g. setting memory limit means that job won't be able to use all memory in the system for page cache even if the rest the system is idle. Basically our current scheme requires to configure every single cgroup in the system. I think we can do better. The idea proposed by this patch is to reclaim only inactive pages and only from cgroups that have big (!inactive_is_low()) inactive list. And go back to shrinking active lists only if all inactive lists are low. Now, the simple test case to demonstrate the effect of the patch. The job in one memcg repeatedly compresses one file: perf stat -n --repeat 20 gzip -ck sample > /dev/null and just 'dd' running in parallel reading the disk in another cgroup. Before: Performance counter stats for 'gzip -ck sample' (20 runs): 17.673572290 seconds time elapsed ( +- 5.60% ) After: Performance counter stats for 'gzip -ck sample' (20 runs): 11.426193980 seconds time elapsed ( +- 0.20% ) The more often dd cgroup allocates memory, the more gzip suffer. With 4 parallel dd instead of one: Before: Performance counter stats for 'gzip -ck sample' (20 runs): 499.976782013 seconds time elapsed ( +- 23.13% ) After: Performance counter stats for 'gzip -ck sample' (20 runs): 11.307450516 seconds time elapsed ( +- 0.27% ) It would be possible to achieve the similar effect by setting the memory.low on gzip cgroup, but the best value for memory.low depends on the size of the 'sample' file. It also possible to limit the 'dd' job, but just imagine something more sophisticated than just 'dd', the job that would benefit from occupying all available memory. The best limit for such job would be something like 'total_memory' - 'sample size' which is again unknown. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> --- mm/vmscan.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index efd10d6b9510..2f562c3358ab 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -104,6 +104,8 @@ struct scan_control { /* One of the zones is ready for compaction */ unsigned int compaction_ready:1; + unsigned int may_shrink_active:1; + /* Allocation order */ s8 order; @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, scan >>= sc->priority; + if (!sc->may_shrink_active && inactive_list_is_low(lruvec, + file, memcg, sc, false)) + scan = 0; + /* * If the cgroup's already been deleted, make sure to * scrape out the remaining cache. @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long nr_reclaimed, nr_scanned; bool reclaimable = false; + bool retry; do { struct mem_cgroup *root = sc->target_mem_cgroup; @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) }; struct mem_cgroup *memcg; + retry = false; + memset(&sc->nr, 0, sizeof(sc->nr)); nr_reclaimed = sc->nr_reclaimed; @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) } } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); + if ((sc->nr_scanned - nr_scanned) == 0 && + !sc->may_shrink_active) { + sc->may_shrink_active = 1; + retry = true; + continue; + } + if (reclaim_state) { sc->nr_reclaimed += reclaim_state->reclaimed_slab; reclaim_state->reclaimed_slab = 0; @@ -2887,7 +2903,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) current_may_throttle() && pgdat_memcg_congested(pgdat, root)) wait_iff_congested(BLK_RW_ASYNC, HZ/10); - } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, + } while (retry || should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, sc->nr_scanned - nr_scanned, sc)); /* -- 2.19.2 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim. 2019-02-22 17:58 [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim Andrey Ryabinin @ 2019-02-22 18:56 ` Rik van Riel 2019-02-22 19:15 ` Johannes Weiner ` (2 subsequent siblings) 3 siblings, 0 replies; 13+ messages in thread From: Rik van Riel @ 2019-02-22 18:56 UTC (permalink / raw) To: Andrey Ryabinin, Andrew Morton Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko, Vlastimil Babka, Mel Gorman, Roman Gushchin, Shakeel Butt [-- Attachment #1: Type: text/plain, Size: 2086 bytes --] On Fri, 2019-02-22 at 20:58 +0300, Andrey Ryabinin wrote: > In a presence of more than 1 memory cgroup in the system our reclaim > logic is just suck. When we hit memory limit (global or a limit on > cgroup with subgroups) we reclaim some memory from all cgroups. > This is sucks because, the cgroup that allocates more often always > wins. > E.g. job that allocates a lot of clean rarely used page cache will > push > out of memory other jobs with active relatively small all in memory > working set. > > To prevent such situations we have memcg controls like low/max, etc > which > are supposed to protect jobs or limit them so they to not hurt > others. > But memory cgroups are very hard to configure right because it > requires > precise knowledge of the workload which may vary during the > execution. > E.g. setting memory limit means that job won't be able to use all > memory > in the system for page cache even if the rest the system is idle. > Basically our current scheme requires to configure every single > cgroup > in the system. > > I think we can do better. The idea proposed by this patch is to > reclaim > only inactive pages and only from cgroups that have big > (!inactive_is_low()) inactive list. And go back to shrinking active > lists > only if all inactive lists are low. Your general idea seems like a good one, but the logic in the code seems a little convoluted to me. I wonder if we can simplify things a little, by checking (when we enter page reclaim) whether the pgdat has enough inactive pages based on the node_page_state statistics, and basing our decision whether or not to scan the active lists off that. As it stands, your patch seems like the kind of code that makes perfect sense today, but which will confuse people who look at the code two years from now. If the code could be made a little more explicit, great. If there are good reasons to do things in the fallback way your current patch does it, the code could use some good comments explaining why :) -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim. 2019-02-22 17:58 [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim Andrey Ryabinin 2019-02-22 18:56 ` Rik van Riel @ 2019-02-22 19:15 ` Johannes Weiner 2019-02-26 12:50 ` Andrey Ryabinin 2019-02-25 4:03 ` Roman Gushchin 2019-02-25 13:57 ` Vlastimil Babka 3 siblings, 1 reply; 13+ messages in thread From: Johannes Weiner @ 2019-02-22 19:15 UTC (permalink / raw) To: Andrey Ryabinin Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin, Shakeel Butt On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: > In a presence of more than 1 memory cgroup in the system our reclaim > logic is just suck. When we hit memory limit (global or a limit on > cgroup with subgroups) we reclaim some memory from all cgroups. > This is sucks because, the cgroup that allocates more often always wins. > E.g. job that allocates a lot of clean rarely used page cache will push > out of memory other jobs with active relatively small all in memory > working set. > > To prevent such situations we have memcg controls like low/max, etc which > are supposed to protect jobs or limit them so they to not hurt others. > But memory cgroups are very hard to configure right because it requires > precise knowledge of the workload which may vary during the execution. > E.g. setting memory limit means that job won't be able to use all memory > in the system for page cache even if the rest the system is idle. > Basically our current scheme requires to configure every single cgroup > in the system. > > I think we can do better. The idea proposed by this patch is to reclaim > only inactive pages and only from cgroups that have big > (!inactive_is_low()) inactive list. And go back to shrinking active lists > only if all inactive lists are low. Yes, you are absolutely right. We shouldn't go after active pages as long as there are plenty of inactive pages around. That's the global reclaim policy, and we currently fail to translate that well to cgrouped systems. Setting group protections or limits would work around this problem, but they're kind of a red herring. We shouldn't ever allow use-once streams to push out hot workingsets, that's a bug. > @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, > > scan >>= sc->priority; > > + if (!sc->may_shrink_active && inactive_list_is_low(lruvec, > + file, memcg, sc, false)) > + scan = 0; > + > /* > * If the cgroup's already been deleted, make sure to > * scrape out the remaining cache. > @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) > struct reclaim_state *reclaim_state = current->reclaim_state; > unsigned long nr_reclaimed, nr_scanned; > bool reclaimable = false; > + bool retry; > > do { > struct mem_cgroup *root = sc->target_mem_cgroup; > @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) > }; > struct mem_cgroup *memcg; > > + retry = false; > + > memset(&sc->nr, 0, sizeof(sc->nr)); > > nr_reclaimed = sc->nr_reclaimed; > @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) > } > } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); > > + if ((sc->nr_scanned - nr_scanned) == 0 && > + !sc->may_shrink_active) { > + sc->may_shrink_active = 1; > + retry = true; > + continue; > + } Using !scanned as the gate could be a problem. There might be a cgroup that has inactive pages on the local level, but when viewed from the system level the total inactive pages in the system might still be low compared to active ones. In that case we should go after active pages. Basically, during global reclaim, the answer for whether active pages should be scanned or not should be the same regardless of whether the memory is all global or whether it's spread out between cgroups. The reason this isn't the case is because we're checking the ratio at the lruvec level - which is the highest level (and identical to the node counters) when memory is global, but it's at the lowest level when memory is cgrouped. So IMO what we should do is: - At the beginning of global reclaim, use node_page_state() to compare the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim can go after active pages or not. Regardless of what the ratio is in individual lruvecs. - And likewise at the beginning of cgroup limit reclaim, walk the subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE and ACTIVE_FILE counters, and make inactive_is_low() decision on those sums. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim. 2019-02-22 19:15 ` Johannes Weiner @ 2019-02-26 12:50 ` Andrey Ryabinin 2019-03-01 10:38 ` Andrey Ryabinin 0 siblings, 1 reply; 13+ messages in thread From: Andrey Ryabinin @ 2019-02-26 12:50 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin, Shakeel Butt On 2/22/19 10:15 PM, Johannes Weiner wrote: > On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: >> In a presence of more than 1 memory cgroup in the system our reclaim >> logic is just suck. When we hit memory limit (global or a limit on >> cgroup with subgroups) we reclaim some memory from all cgroups. >> This is sucks because, the cgroup that allocates more often always wins. >> E.g. job that allocates a lot of clean rarely used page cache will push >> out of memory other jobs with active relatively small all in memory >> working set. >> >> To prevent such situations we have memcg controls like low/max, etc which >> are supposed to protect jobs or limit them so they to not hurt others. >> But memory cgroups are very hard to configure right because it requires >> precise knowledge of the workload which may vary during the execution. >> E.g. setting memory limit means that job won't be able to use all memory >> in the system for page cache even if the rest the system is idle. >> Basically our current scheme requires to configure every single cgroup >> in the system. >> >> I think we can do better. The idea proposed by this patch is to reclaim >> only inactive pages and only from cgroups that have big >> (!inactive_is_low()) inactive list. And go back to shrinking active lists >> only if all inactive lists are low. > > Yes, you are absolutely right. > > We shouldn't go after active pages as long as there are plenty of > inactive pages around. That's the global reclaim policy, and we > currently fail to translate that well to cgrouped systems. > > Setting group protections or limits would work around this problem, > but they're kind of a red herring. We shouldn't ever allow use-once > streams to push out hot workingsets, that's a bug. > >> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, >> >> scan >>= sc->priority; >> >> + if (!sc->may_shrink_active && inactive_list_is_low(lruvec, >> + file, memcg, sc, false)) >> + scan = 0; >> + >> /* >> * If the cgroup's already been deleted, make sure to >> * scrape out the remaining cache. >> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) >> struct reclaim_state *reclaim_state = current->reclaim_state; >> unsigned long nr_reclaimed, nr_scanned; >> bool reclaimable = false; >> + bool retry; >> >> do { >> struct mem_cgroup *root = sc->target_mem_cgroup; >> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) >> }; >> struct mem_cgroup *memcg; >> >> + retry = false; >> + >> memset(&sc->nr, 0, sizeof(sc->nr)); >> >> nr_reclaimed = sc->nr_reclaimed; >> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) >> } >> } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); >> >> + if ((sc->nr_scanned - nr_scanned) == 0 && >> + !sc->may_shrink_active) { >> + sc->may_shrink_active = 1; >> + retry = true; >> + continue; >> + } > > Using !scanned as the gate could be a problem. There might be a cgroup > that has inactive pages on the local level, but when viewed from the > system level the total inactive pages in the system might still be low > compared to active ones. In that case we should go after active pages. > > Basically, during global reclaim, the answer for whether active pages > should be scanned or not should be the same regardless of whether the > memory is all global or whether it's spread out between cgroups. > > The reason this isn't the case is because we're checking the ratio at > the lruvec level - which is the highest level (and identical to the > node counters) when memory is global, but it's at the lowest level > when memory is cgrouped. > > So IMO what we should do is: > > - At the beginning of global reclaim, use node_page_state() to compare > the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim > can go after active pages or not. Regardless of what the ratio is in > individual lruvecs. > > - And likewise at the beginning of cgroup limit reclaim, walk the > subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE > and ACTIVE_FILE counters, and make inactive_is_low() decision on > those sums. > Sounds reasonable. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim. 2019-02-26 12:50 ` Andrey Ryabinin @ 2019-03-01 10:38 ` Andrey Ryabinin 2019-03-01 17:49 ` Johannes Weiner 0 siblings, 1 reply; 13+ messages in thread From: Andrey Ryabinin @ 2019-03-01 10:38 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin, Shakeel Butt On 2/26/19 3:50 PM, Andrey Ryabinin wrote: > > > On 2/22/19 10:15 PM, Johannes Weiner wrote: >> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: >>> In a presence of more than 1 memory cgroup in the system our reclaim >>> logic is just suck. When we hit memory limit (global or a limit on >>> cgroup with subgroups) we reclaim some memory from all cgroups. >>> This is sucks because, the cgroup that allocates more often always wins. >>> E.g. job that allocates a lot of clean rarely used page cache will push >>> out of memory other jobs with active relatively small all in memory >>> working set. >>> >>> To prevent such situations we have memcg controls like low/max, etc which >>> are supposed to protect jobs or limit them so they to not hurt others. >>> But memory cgroups are very hard to configure right because it requires >>> precise knowledge of the workload which may vary during the execution. >>> E.g. setting memory limit means that job won't be able to use all memory >>> in the system for page cache even if the rest the system is idle. >>> Basically our current scheme requires to configure every single cgroup >>> in the system. >>> >>> I think we can do better. The idea proposed by this patch is to reclaim >>> only inactive pages and only from cgroups that have big >>> (!inactive_is_low()) inactive list. And go back to shrinking active lists >>> only if all inactive lists are low. >> >> Yes, you are absolutely right. >> >> We shouldn't go after active pages as long as there are plenty of >> inactive pages around. That's the global reclaim policy, and we >> currently fail to translate that well to cgrouped systems. >> >> Setting group protections or limits would work around this problem, >> but they're kind of a red herring. We shouldn't ever allow use-once >> streams to push out hot workingsets, that's a bug. >> >>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, >>> >>> scan >>= sc->priority; >>> >>> + if (!sc->may_shrink_active && inactive_list_is_low(lruvec, >>> + file, memcg, sc, false)) >>> + scan = 0; >>> + >>> /* >>> * If the cgroup's already been deleted, make sure to >>> * scrape out the remaining cache. >>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) >>> struct reclaim_state *reclaim_state = current->reclaim_state; >>> unsigned long nr_reclaimed, nr_scanned; >>> bool reclaimable = false; >>> + bool retry; >>> >>> do { >>> struct mem_cgroup *root = sc->target_mem_cgroup; >>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) >>> }; >>> struct mem_cgroup *memcg; >>> >>> + retry = false; >>> + >>> memset(&sc->nr, 0, sizeof(sc->nr)); >>> >>> nr_reclaimed = sc->nr_reclaimed; >>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) >>> } >>> } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); >>> >>> + if ((sc->nr_scanned - nr_scanned) == 0 && >>> + !sc->may_shrink_active) { >>> + sc->may_shrink_active = 1; >>> + retry = true; >>> + continue; >>> + } >> >> Using !scanned as the gate could be a problem. There might be a cgroup >> that has inactive pages on the local level, but when viewed from the >> system level the total inactive pages in the system might still be low >> compared to active ones. In that case we should go after active pages. >> >> Basically, during global reclaim, the answer for whether active pages >> should be scanned or not should be the same regardless of whether the >> memory is all global or whether it's spread out between cgroups. >> >> The reason this isn't the case is because we're checking the ratio at >> the lruvec level - which is the highest level (and identical to the >> node counters) when memory is global, but it's at the lowest level >> when memory is cgrouped. >> >> So IMO what we should do is: >> >> - At the beginning of global reclaim, use node_page_state() to compare >> the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim >> can go after active pages or not. Regardless of what the ratio is in >> individual lruvecs. >> >> - And likewise at the beginning of cgroup limit reclaim, walk the >> subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE >> and ACTIVE_FILE counters, and make inactive_is_low() decision on >> those sums. >> > > Sounds reasonable. > On the second thought it seems to be better to keep the decision on lru level. There are couple reasons for this: 1) Using bare node_page_state() (or sc->targe_mem_cgroup's total_[in]active counters) would be wrong. Because some cgroups might have protection set (memory.low) and we must take it into account. Also different cgroups have different available swap space/memory.swappiness and it must be taken into account as well to. So it has to be yet another full memcg-tree iteration. 2) Let's consider simple case. Two cgroups, one with big 'active' set of pages the other allocates one-time used pages. So the total inactive is low, thus checking inactive ratio on higher level will result in reclaiming pages. While with check on lru-level only inactive will be reclaimed. I've tried to come up with a scenario in which checking ratio on higher level would better but failed. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim. 2019-03-01 10:38 ` Andrey Ryabinin @ 2019-03-01 17:49 ` Johannes Weiner 2019-03-01 19:46 ` Andrey Ryabinin 0 siblings, 1 reply; 13+ messages in thread From: Johannes Weiner @ 2019-03-01 17:49 UTC (permalink / raw) To: Andrey Ryabinin Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin, Shakeel Butt Hello Andrey, On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote: > On 2/26/19 3:50 PM, Andrey Ryabinin wrote: > > On 2/22/19 10:15 PM, Johannes Weiner wrote: > >> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: > >>> In a presence of more than 1 memory cgroup in the system our reclaim > >>> logic is just suck. When we hit memory limit (global or a limit on > >>> cgroup with subgroups) we reclaim some memory from all cgroups. > >>> This is sucks because, the cgroup that allocates more often always wins. > >>> E.g. job that allocates a lot of clean rarely used page cache will push > >>> out of memory other jobs with active relatively small all in memory > >>> working set. > >>> > >>> To prevent such situations we have memcg controls like low/max, etc which > >>> are supposed to protect jobs or limit them so they to not hurt others. > >>> But memory cgroups are very hard to configure right because it requires > >>> precise knowledge of the workload which may vary during the execution. > >>> E.g. setting memory limit means that job won't be able to use all memory > >>> in the system for page cache even if the rest the system is idle. > >>> Basically our current scheme requires to configure every single cgroup > >>> in the system. > >>> > >>> I think we can do better. The idea proposed by this patch is to reclaim > >>> only inactive pages and only from cgroups that have big > >>> (!inactive_is_low()) inactive list. And go back to shrinking active lists > >>> only if all inactive lists are low. > >> > >> Yes, you are absolutely right. > >> > >> We shouldn't go after active pages as long as there are plenty of > >> inactive pages around. That's the global reclaim policy, and we > >> currently fail to translate that well to cgrouped systems. > >> > >> Setting group protections or limits would work around this problem, > >> but they're kind of a red herring. We shouldn't ever allow use-once > >> streams to push out hot workingsets, that's a bug. > >> > >>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, > >>> > >>> scan >>= sc->priority; > >>> > >>> + if (!sc->may_shrink_active && inactive_list_is_low(lruvec, > >>> + file, memcg, sc, false)) > >>> + scan = 0; > >>> + > >>> /* > >>> * If the cgroup's already been deleted, make sure to > >>> * scrape out the remaining cache. > >>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) > >>> struct reclaim_state *reclaim_state = current->reclaim_state; > >>> unsigned long nr_reclaimed, nr_scanned; > >>> bool reclaimable = false; > >>> + bool retry; > >>> > >>> do { > >>> struct mem_cgroup *root = sc->target_mem_cgroup; > >>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) > >>> }; > >>> struct mem_cgroup *memcg; > >>> > >>> + retry = false; > >>> + > >>> memset(&sc->nr, 0, sizeof(sc->nr)); > >>> > >>> nr_reclaimed = sc->nr_reclaimed; > >>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) > >>> } > >>> } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); > >>> > >>> + if ((sc->nr_scanned - nr_scanned) == 0 && > >>> + !sc->may_shrink_active) { > >>> + sc->may_shrink_active = 1; > >>> + retry = true; > >>> + continue; > >>> + } > >> > >> Using !scanned as the gate could be a problem. There might be a cgroup > >> that has inactive pages on the local level, but when viewed from the > >> system level the total inactive pages in the system might still be low > >> compared to active ones. In that case we should go after active pages. > >> > >> Basically, during global reclaim, the answer for whether active pages > >> should be scanned or not should be the same regardless of whether the > >> memory is all global or whether it's spread out between cgroups. > >> > >> The reason this isn't the case is because we're checking the ratio at > >> the lruvec level - which is the highest level (and identical to the > >> node counters) when memory is global, but it's at the lowest level > >> when memory is cgrouped. > >> > >> So IMO what we should do is: > >> > >> - At the beginning of global reclaim, use node_page_state() to compare > >> the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim > >> can go after active pages or not. Regardless of what the ratio is in > >> individual lruvecs. > >> > >> - And likewise at the beginning of cgroup limit reclaim, walk the > >> subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE > >> and ACTIVE_FILE counters, and make inactive_is_low() decision on > >> those sums. > >> > > > > Sounds reasonable. > > > > On the second thought it seems to be better to keep the decision on lru level. > There are couple reasons for this: > > 1) Using bare node_page_state() (or sc->targe_mem_cgroup's total_[in]active counters) would be wrong. > Because some cgroups might have protection set (memory.low) and we must take it into account. Also different > cgroups have different available swap space/memory.swappiness and it must be taken into account as well to. > > So it has to be yet another full memcg-tree iteration. It should be possible to take that into account on the first iteration and adjust the inactive/active counters in proportion to how much of the cgroup's total memory is exempt by memory.low or min, right? > 2) Let's consider simple case. Two cgroups, one with big 'active' set of pages the other allocates one-time used pages. > So the total inactive is low, thus checking inactive ratio on higher level will result in reclaiming pages. > While with check on lru-level only inactive will be reclaimed. It's the other way around. Let's say you have two cgroups, A and B: A: 500M inactive 10G active -> inactive is low B: 10G inactive 500M active -> inactive is NOT low ---------------------------------------------------------- global: 10.5G inactive 10.5G active -> inactive is NOT low Checking locally will scan active pages from A. Checking globally will not, because there is plenty of use-once pages from B. So if you check globally, without any protection, A and B compete evenly during global reclaim. Under the same reclaim pressure, A has managed to activate most of its pages whereas B has not. That means A is hotter and B provides the better reclaim candidates. If you apply this decision locally, on the other hand, you are no longer aging the groups at the same rate. And then the LRU orders between groups will no longer be comparable, and you won't be reclaiming the coldest memory in the system anymore. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim. 2019-03-01 17:49 ` Johannes Weiner @ 2019-03-01 19:46 ` Andrey Ryabinin 2019-03-01 22:20 ` Johannes Weiner 0 siblings, 1 reply; 13+ messages in thread From: Andrey Ryabinin @ 2019-03-01 19:46 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin, Shakeel Butt On 3/1/19 8:49 PM, Johannes Weiner wrote: > Hello Andrey, > > On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote: >> On 2/26/19 3:50 PM, Andrey Ryabinin wrote: >>> On 2/22/19 10:15 PM, Johannes Weiner wrote: >>>> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: >>>>> In a presence of more than 1 memory cgroup in the system our reclaim >>>>> logic is just suck. When we hit memory limit (global or a limit on >>>>> cgroup with subgroups) we reclaim some memory from all cgroups. >>>>> This is sucks because, the cgroup that allocates more often always wins. >>>>> E.g. job that allocates a lot of clean rarely used page cache will push >>>>> out of memory other jobs with active relatively small all in memory >>>>> working set. >>>>> >>>>> To prevent such situations we have memcg controls like low/max, etc which >>>>> are supposed to protect jobs or limit them so they to not hurt others. >>>>> But memory cgroups are very hard to configure right because it requires >>>>> precise knowledge of the workload which may vary during the execution. >>>>> E.g. setting memory limit means that job won't be able to use all memory >>>>> in the system for page cache even if the rest the system is idle. >>>>> Basically our current scheme requires to configure every single cgroup >>>>> in the system. >>>>> >>>>> I think we can do better. The idea proposed by this patch is to reclaim >>>>> only inactive pages and only from cgroups that have big >>>>> (!inactive_is_low()) inactive list. And go back to shrinking active lists >>>>> only if all inactive lists are low. >>>> >>>> Yes, you are absolutely right. >>>> >>>> We shouldn't go after active pages as long as there are plenty of >>>> inactive pages around. That's the global reclaim policy, and we >>>> currently fail to translate that well to cgrouped systems. >>>> >>>> Setting group protections or limits would work around this problem, >>>> but they're kind of a red herring. We shouldn't ever allow use-once >>>> streams to push out hot workingsets, that's a bug. >>>> >>>>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, >>>>> >>>>> scan >>= sc->priority; >>>>> >>>>> + if (!sc->may_shrink_active && inactive_list_is_low(lruvec, >>>>> + file, memcg, sc, false)) >>>>> + scan = 0; >>>>> + >>>>> /* >>>>> * If the cgroup's already been deleted, make sure to >>>>> * scrape out the remaining cache. >>>>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) >>>>> struct reclaim_state *reclaim_state = current->reclaim_state; >>>>> unsigned long nr_reclaimed, nr_scanned; >>>>> bool reclaimable = false; >>>>> + bool retry; >>>>> >>>>> do { >>>>> struct mem_cgroup *root = sc->target_mem_cgroup; >>>>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) >>>>> }; >>>>> struct mem_cgroup *memcg; >>>>> >>>>> + retry = false; >>>>> + >>>>> memset(&sc->nr, 0, sizeof(sc->nr)); >>>>> >>>>> nr_reclaimed = sc->nr_reclaimed; >>>>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) >>>>> } >>>>> } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); >>>>> >>>>> + if ((sc->nr_scanned - nr_scanned) == 0 && >>>>> + !sc->may_shrink_active) { >>>>> + sc->may_shrink_active = 1; >>>>> + retry = true; >>>>> + continue; >>>>> + } >>>> >>>> Using !scanned as the gate could be a problem. There might be a cgroup >>>> that has inactive pages on the local level, but when viewed from the >>>> system level the total inactive pages in the system might still be low >>>> compared to active ones. In that case we should go after active pages. >>>> >>>> Basically, during global reclaim, the answer for whether active pages >>>> should be scanned or not should be the same regardless of whether the >>>> memory is all global or whether it's spread out between cgroups. >>>> >>>> The reason this isn't the case is because we're checking the ratio at >>>> the lruvec level - which is the highest level (and identical to the >>>> node counters) when memory is global, but it's at the lowest level >>>> when memory is cgrouped. >>>> >>>> So IMO what we should do is: >>>> >>>> - At the beginning of global reclaim, use node_page_state() to compare >>>> the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim >>>> can go after active pages or not. Regardless of what the ratio is in >>>> individual lruvecs. >>>> >>>> - And likewise at the beginning of cgroup limit reclaim, walk the >>>> subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE >>>> and ACTIVE_FILE counters, and make inactive_is_low() decision on >>>> those sums. >>>> >>> >>> Sounds reasonable. >>> >> >> On the second thought it seems to be better to keep the decision on lru level. >> There are couple reasons for this: >> >> 1) Using bare node_page_state() (or sc->targe_mem_cgroup's total_[in]active counters) would be wrong. >> Because some cgroups might have protection set (memory.low) and we must take it into account. Also different >> cgroups have different available swap space/memory.swappiness and it must be taken into account as well to. >> >> So it has to be yet another full memcg-tree iteration. > > It should be possible to take that into account on the first iteration > and adjust the inactive/active counters in proportion to how much of > the cgroup's total memory is exempt by memory.low or min, right? > Should be possible, more complexity though to this subtle code. >> 2) Let's consider simple case. Two cgroups, one with big 'active' set of pages the other allocates one-time used pages. >> So the total inactive is low, thus checking inactive ratio on higher level will result in reclaiming pages. >> While with check on lru-level only inactive will be reclaimed. > > It's the other way around. Let's say you have two cgroups, A and B: > > A: 500M inactive 10G active -> inactive is low > B: 10G inactive 500M active -> inactive is NOT low > ---------------------------------------------------------- > global: 10.5G inactive 10.5G active -> inactive is NOT low > > Checking locally will scan active pages from A. No, checking locally will not scan active from A. Initial state of sc->may_shrink_active = 0, so A group will be skipped completely, and will reclaim from B. Since overall reclaim was successful, sc->may_shrink_active remain 0 and A will be protected as long as B supply enough inactive pages. > Checking globally will > not, because there is plenty of use-once pages from B. > That is correct. So in this example global vs local check will not make a difference. > So if you check globally, without any protection, A and B compete > evenly during global reclaim. Under the same reclaim pressure, A has > managed to activate most of its pages whereas B has not. That means A > is hotter and B provides the better reclaim candidates. > > If you apply this decision locally, on the other hand, you are no > longer aging the groups at the same rate. And then the LRU orders > between groups will no longer be comparable, and you won't be > reclaiming the coldest memory in the system anymore. > I really don't see any how global check will make any difference in this example. In both cases, we reclaim only from B and don't touch A. And this what we actually want. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim. 2019-03-01 19:46 ` Andrey Ryabinin @ 2019-03-01 22:20 ` Johannes Weiner 2019-03-04 17:02 ` Andrey Ryabinin 0 siblings, 1 reply; 13+ messages in thread From: Johannes Weiner @ 2019-03-01 22:20 UTC (permalink / raw) To: Andrey Ryabinin Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin, Shakeel Butt On Fri, Mar 01, 2019 at 10:46:34PM +0300, Andrey Ryabinin wrote: > On 3/1/19 8:49 PM, Johannes Weiner wrote: > > On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote: > >> On 2/26/19 3:50 PM, Andrey Ryabinin wrote: > >>> On 2/22/19 10:15 PM, Johannes Weiner wrote: > >>>> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: > >>>>> In a presence of more than 1 memory cgroup in the system our reclaim > >>>>> logic is just suck. When we hit memory limit (global or a limit on > >>>>> cgroup with subgroups) we reclaim some memory from all cgroups. > >>>>> This is sucks because, the cgroup that allocates more often always wins. > >>>>> E.g. job that allocates a lot of clean rarely used page cache will push > >>>>> out of memory other jobs with active relatively small all in memory > >>>>> working set. > >>>>> > >>>>> To prevent such situations we have memcg controls like low/max, etc which > >>>>> are supposed to protect jobs or limit them so they to not hurt others. > >>>>> But memory cgroups are very hard to configure right because it requires > >>>>> precise knowledge of the workload which may vary during the execution. > >>>>> E.g. setting memory limit means that job won't be able to use all memory > >>>>> in the system for page cache even if the rest the system is idle. > >>>>> Basically our current scheme requires to configure every single cgroup > >>>>> in the system. > >>>>> > >>>>> I think we can do better. The idea proposed by this patch is to reclaim > >>>>> only inactive pages and only from cgroups that have big > >>>>> (!inactive_is_low()) inactive list. And go back to shrinking active lists > >>>>> only if all inactive lists are low. > >>>> > >>>> Yes, you are absolutely right. > >>>> > >>>> We shouldn't go after active pages as long as there are plenty of > >>>> inactive pages around. That's the global reclaim policy, and we > >>>> currently fail to translate that well to cgrouped systems. > >>>> > >>>> Setting group protections or limits would work around this problem, > >>>> but they're kind of a red herring. We shouldn't ever allow use-once > >>>> streams to push out hot workingsets, that's a bug. > >>>> > >>>>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, > >>>>> > >>>>> scan >>= sc->priority; > >>>>> > >>>>> + if (!sc->may_shrink_active && inactive_list_is_low(lruvec, > >>>>> + file, memcg, sc, false)) > >>>>> + scan = 0; > >>>>> + > >>>>> /* > >>>>> * If the cgroup's already been deleted, make sure to > >>>>> * scrape out the remaining cache. > >>>>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) > >>>>> struct reclaim_state *reclaim_state = current->reclaim_state; > >>>>> unsigned long nr_reclaimed, nr_scanned; > >>>>> bool reclaimable = false; > >>>>> + bool retry; > >>>>> > >>>>> do { > >>>>> struct mem_cgroup *root = sc->target_mem_cgroup; > >>>>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) > >>>>> }; > >>>>> struct mem_cgroup *memcg; > >>>>> > >>>>> + retry = false; > >>>>> + > >>>>> memset(&sc->nr, 0, sizeof(sc->nr)); > >>>>> > >>>>> nr_reclaimed = sc->nr_reclaimed; > >>>>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) > >>>>> } > >>>>> } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); > >>>>> > >>>>> + if ((sc->nr_scanned - nr_scanned) == 0 && > >>>>> + !sc->may_shrink_active) { > >>>>> + sc->may_shrink_active = 1; > >>>>> + retry = true; > >>>>> + continue; > >>>>> + } > >>>> > >>>> Using !scanned as the gate could be a problem. There might be a cgroup > >>>> that has inactive pages on the local level, but when viewed from the > >>>> system level the total inactive pages in the system might still be low > >>>> compared to active ones. In that case we should go after active pages. > >>>> > >>>> Basically, during global reclaim, the answer for whether active pages > >>>> should be scanned or not should be the same regardless of whether the > >>>> memory is all global or whether it's spread out between cgroups. > >>>> > >>>> The reason this isn't the case is because we're checking the ratio at > >>>> the lruvec level - which is the highest level (and identical to the > >>>> node counters) when memory is global, but it's at the lowest level > >>>> when memory is cgrouped. > >>>> > >>>> So IMO what we should do is: > >>>> > >>>> - At the beginning of global reclaim, use node_page_state() to compare > >>>> the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim > >>>> can go after active pages or not. Regardless of what the ratio is in > >>>> individual lruvecs. > >>>> > >>>> - And likewise at the beginning of cgroup limit reclaim, walk the > >>>> subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE > >>>> and ACTIVE_FILE counters, and make inactive_is_low() decision on > >>>> those sums. > >>>> > >>> > >>> Sounds reasonable. > >>> > >> > >> On the second thought it seems to be better to keep the decision on lru level. > >> There are couple reasons for this: > >> > >> 1) Using bare node_page_state() (or sc->targe_mem_cgroup's total_[in]active counters) would be wrong. > >> Because some cgroups might have protection set (memory.low) and we must take it into account. Also different > >> cgroups have different available swap space/memory.swappiness and it must be taken into account as well to. > >> > >> So it has to be yet another full memcg-tree iteration. > > > > It should be possible to take that into account on the first iteration > > and adjust the inactive/active counters in proportion to how much of > > the cgroup's total memory is exempt by memory.low or min, right? > > > > Should be possible, more complexity though to this subtle code. > > > >> 2) Let's consider simple case. Two cgroups, one with big 'active' set of pages the other allocates one-time used pages. > >> So the total inactive is low, thus checking inactive ratio on higher level will result in reclaiming pages. > >> While with check on lru-level only inactive will be reclaimed. > > > > It's the other way around. Let's say you have two cgroups, A and B: > > > > A: 500M inactive 10G active -> inactive is low > > B: 10G inactive 500M active -> inactive is NOT low > > ---------------------------------------------------------- > > global: 10.5G inactive 10.5G active -> inactive is NOT low > > > > Checking locally will scan active pages from A. > > No, checking locally will not scan active from A. Initial state of > sc->may_shrink_active = 0, so A group will be skipped completely, > and will reclaim from B. Since overall reclaim was successful, > sc->may_shrink_active remain 0 and A will be protected as long as B > supply enough inactive pages. Oh, this was a misunderstanding. When you wrote "on second thought it seems to be better to keep the decision at the lru level", I assumed you were arguing for keeping the current code as-is and abandoning your patch. But that leaves my questions from above unanswered. Consider the following situation: A: 50M inactive 0 active B: 0 inactive 20G active If the processes in A and B were not cgrouped, these pages would be on a single LRU and we'd go after B's active pages. But with your patches, we'd reclaim only A's inactive pages. What's the justification for that unfairness? Keep in mind that users also enable the memory controller on cgroups simply to keep track of how much memory different process groups are using. Merely enabling the controller without otherwise configuring any protections shouldn't change the aging behavior for that memory. And that USED to be the case. We USED to have a physical global LRU list that held all the cgroup pages, and then a per-cgroup LRU list that was only for limit reclaim. We got rid of the physical global LRU list for cgrouped memory with the idea that we can emulate the global aging. Hence the round-robin tree iteration in shrink_node(). We just need to fix the inactive_list_is_low() check to also emulate the global LRU behavior. That would fix your problem of scanning active pages when there are already plenty of inactive pages in the system, but without the risk of severely overreclaiming the inactive pages of a small group. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim. 2019-03-01 22:20 ` Johannes Weiner @ 2019-03-04 17:02 ` Andrey Ryabinin 0 siblings, 0 replies; 13+ messages in thread From: Andrey Ryabinin @ 2019-03-04 17:02 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin, Shakeel Butt On 3/2/19 1:20 AM, Johannes Weiner wrote: > On Fri, Mar 01, 2019 at 10:46:34PM +0300, Andrey Ryabinin wrote: >> On 3/1/19 8:49 PM, Johannes Weiner wrote: >>> On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote: >>>> On 2/26/19 3:50 PM, Andrey Ryabinin wrote: >>>>> On 2/22/19 10:15 PM, Johannes Weiner wrote: >>>>>> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: >>>>>>> In a presence of more than 1 memory cgroup in the system our reclaim >>>>>>> logic is just suck. When we hit memory limit (global or a limit on >>>>>>> cgroup with subgroups) we reclaim some memory from all cgroups. >>>>>>> This is sucks because, the cgroup that allocates more often always wins. >>>>>>> E.g. job that allocates a lot of clean rarely used page cache will push >>>>>>> out of memory other jobs with active relatively small all in memory >>>>>>> working set. >>>>>>> >>>>>>> To prevent such situations we have memcg controls like low/max, etc which >>>>>>> are supposed to protect jobs or limit them so they to not hurt others. >>>>>>> But memory cgroups are very hard to configure right because it requires >>>>>>> precise knowledge of the workload which may vary during the execution. >>>>>>> E.g. setting memory limit means that job won't be able to use all memory >>>>>>> in the system for page cache even if the rest the system is idle. >>>>>>> Basically our current scheme requires to configure every single cgroup >>>>>>> in the system. >>>>>>> >>>>>>> I think we can do better. The idea proposed by this patch is to reclaim >>>>>>> only inactive pages and only from cgroups that have big >>>>>>> (!inactive_is_low()) inactive list. And go back to shrinking active lists >>>>>>> only if all inactive lists are low. >>>>>> >>>>>> Yes, you are absolutely right. >>>>>> >>>>>> We shouldn't go after active pages as long as there are plenty of >>>>>> inactive pages around. That's the global reclaim policy, and we >>>>>> currently fail to translate that well to cgrouped systems. >>>>>> >>>>>> Setting group protections or limits would work around this problem, >>>>>> but they're kind of a red herring. We shouldn't ever allow use-once >>>>>> streams to push out hot workingsets, that's a bug. >>>>>> >>>>>>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, >>>>>>> >>>>>>> scan >>= sc->priority; >>>>>>> >>>>>>> + if (!sc->may_shrink_active && inactive_list_is_low(lruvec, >>>>>>> + file, memcg, sc, false)) >>>>>>> + scan = 0; >>>>>>> + >>>>>>> /* >>>>>>> * If the cgroup's already been deleted, make sure to >>>>>>> * scrape out the remaining cache. >>>>>>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) >>>>>>> struct reclaim_state *reclaim_state = current->reclaim_state; >>>>>>> unsigned long nr_reclaimed, nr_scanned; >>>>>>> bool reclaimable = false; >>>>>>> + bool retry; >>>>>>> >>>>>>> do { >>>>>>> struct mem_cgroup *root = sc->target_mem_cgroup; >>>>>>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) >>>>>>> }; >>>>>>> struct mem_cgroup *memcg; >>>>>>> >>>>>>> + retry = false; >>>>>>> + >>>>>>> memset(&sc->nr, 0, sizeof(sc->nr)); >>>>>>> >>>>>>> nr_reclaimed = sc->nr_reclaimed; >>>>>>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) >>>>>>> } >>>>>>> } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); >>>>>>> >>>>>>> + if ((sc->nr_scanned - nr_scanned) == 0 && >>>>>>> + !sc->may_shrink_active) { >>>>>>> + sc->may_shrink_active = 1; >>>>>>> + retry = true; >>>>>>> + continue; >>>>>>> + } >>>>>> >>>>>> Using !scanned as the gate could be a problem. There might be a cgroup >>>>>> that has inactive pages on the local level, but when viewed from the >>>>>> system level the total inactive pages in the system might still be low >>>>>> compared to active ones. In that case we should go after active pages. >>>>>> >>>>>> Basically, during global reclaim, the answer for whether active pages >>>>>> should be scanned or not should be the same regardless of whether the >>>>>> memory is all global or whether it's spread out between cgroups. >>>>>> >>>>>> The reason this isn't the case is because we're checking the ratio at >>>>>> the lruvec level - which is the highest level (and identical to the >>>>>> node counters) when memory is global, but it's at the lowest level >>>>>> when memory is cgrouped. >>>>>> >>>>>> So IMO what we should do is: >>>>>> >>>>>> - At the beginning of global reclaim, use node_page_state() to compare >>>>>> the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim >>>>>> can go after active pages or not. Regardless of what the ratio is in >>>>>> individual lruvecs. >>>>>> >>>>>> - And likewise at the beginning of cgroup limit reclaim, walk the >>>>>> subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE >>>>>> and ACTIVE_FILE counters, and make inactive_is_low() decision on >>>>>> those sums. >>>>>> >>>>> >>>>> Sounds reasonable. >>>>> >>>> >>>> On the second thought it seems to be better to keep the decision on lru level. >>>> There are couple reasons for this: >>>> >>>> 1) Using bare node_page_state() (or sc->targe_mem_cgroup's total_[in]active counters) would be wrong. >>>> Because some cgroups might have protection set (memory.low) and we must take it into account. Also different >>>> cgroups have different available swap space/memory.swappiness and it must be taken into account as well to. >>>> >>>> So it has to be yet another full memcg-tree iteration. >>> >>> It should be possible to take that into account on the first iteration >>> and adjust the inactive/active counters in proportion to how much of >>> the cgroup's total memory is exempt by memory.low or min, right? >>> >> >> Should be possible, more complexity though to this subtle code. >> >> >>>> 2) Let's consider simple case. Two cgroups, one with big 'active' set of pages the other allocates one-time used pages. >>>> So the total inactive is low, thus checking inactive ratio on higher level will result in reclaiming pages. >>>> While with check on lru-level only inactive will be reclaimed. >>> >>> It's the other way around. Let's say you have two cgroups, A and B: >>> >>> A: 500M inactive 10G active -> inactive is low >>> B: 10G inactive 500M active -> inactive is NOT low >>> ---------------------------------------------------------- >>> global: 10.5G inactive 10.5G active -> inactive is NOT low >>> >>> Checking locally will scan active pages from A. >> >> No, checking locally will not scan active from A. Initial state of >> sc->may_shrink_active = 0, so A group will be skipped completely, >> and will reclaim from B. Since overall reclaim was successful, >> sc->may_shrink_active remain 0 and A will be protected as long as B >> supply enough inactive pages. > > Oh, this was a misunderstanding. When you wrote "on second thought it > seems to be better to keep the decision at the lru level", I assumed > you were arguing for keeping the current code as-is and abandoning > your patch. > > But that leaves my questions from above unanswered. Consider the > following situation: > > A: 50M inactive 0 active > B: 0 inactive 20G active > > If the processes in A and B were not cgrouped, these pages would be on > a single LRU and we'd go after B's active pages. > > But with your patches, we'd reclaim only A's inactive pages. > I assume that not cgrouped case we would reclaim mostly from A anyway because going after B's active pages only means that we move them to inactive list where we still have A's pages for reclaim. And B has a chance to reactivate deactivated pages. In cgrouped case going after B's active pages implies immediate reclaim of them. > What's the justification for that unfairness? > If it's A creates pressure by allocating a lot of one-time used pages, than global inactive ratio check allows A to grow up to !inactive_is_low() point by pushing out B's active pages. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim. 2019-02-22 17:58 [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim Andrey Ryabinin 2019-02-22 18:56 ` Rik van Riel 2019-02-22 19:15 ` Johannes Weiner @ 2019-02-25 4:03 ` Roman Gushchin 2019-02-26 15:36 ` Andrey Ryabinin 2019-02-25 13:57 ` Vlastimil Babka 3 siblings, 1 reply; 13+ messages in thread From: Roman Gushchin @ 2019-02-25 4:03 UTC (permalink / raw) To: Andrey Ryabinin Cc: Andrew Morton, linux-mm, linux-kernel, Johannes Weiner, Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman, Shakeel Butt On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: > In a presence of more than 1 memory cgroup in the system our reclaim > logic is just suck. When we hit memory limit (global or a limit on > cgroup with subgroups) we reclaim some memory from all cgroups. > This is sucks because, the cgroup that allocates more often always wins. > E.g. job that allocates a lot of clean rarely used page cache will push > out of memory other jobs with active relatively small all in memory > working set. > > To prevent such situations we have memcg controls like low/max, etc which > are supposed to protect jobs or limit them so they to not hurt others. > But memory cgroups are very hard to configure right because it requires > precise knowledge of the workload which may vary during the execution. > E.g. setting memory limit means that job won't be able to use all memory > in the system for page cache even if the rest the system is idle. > Basically our current scheme requires to configure every single cgroup > in the system. > > I think we can do better. The idea proposed by this patch is to reclaim > only inactive pages and only from cgroups that have big > (!inactive_is_low()) inactive list. And go back to shrinking active lists > only if all inactive lists are low. Hi Andrey! It's definitely an interesting idea! However, let me bring some concerns: 1) What's considered active and inactive depends on memory pressure inside a cgroup. Actually active pages in one cgroup (e.g. just deleted) can be colder than inactive pages in an other (e.g. a memory-hungry cgroup with a tight memory.max). Also a workload inside a cgroup can to some extend control what's going to the active LRU. So it opens a way to get more memory unfairly by artificially promoting more pages to the active LRU. So a cgroup can get an unfair advantage over other cgroups. Generally speaking, now we have a way to measure the memory pressure inside a cgroup. So, in theory, it should be possible to balance scanning effort based on memory pressure. Thanks! ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim. 2019-02-25 4:03 ` Roman Gushchin @ 2019-02-26 15:36 ` Andrey Ryabinin 2019-02-26 22:08 ` Roman Gushchin 0 siblings, 1 reply; 13+ messages in thread From: Andrey Ryabinin @ 2019-02-26 15:36 UTC (permalink / raw) To: Roman Gushchin Cc: Andrew Morton, linux-mm, linux-kernel, Johannes Weiner, Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman, Shakeel Butt On 2/25/19 7:03 AM, Roman Gushchin wrote: > On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: >> In a presence of more than 1 memory cgroup in the system our reclaim >> logic is just suck. When we hit memory limit (global or a limit on >> cgroup with subgroups) we reclaim some memory from all cgroups. >> This is sucks because, the cgroup that allocates more often always wins. >> E.g. job that allocates a lot of clean rarely used page cache will push >> out of memory other jobs with active relatively small all in memory >> working set. >> >> To prevent such situations we have memcg controls like low/max, etc which >> are supposed to protect jobs or limit them so they to not hurt others. >> But memory cgroups are very hard to configure right because it requires >> precise knowledge of the workload which may vary during the execution. >> E.g. setting memory limit means that job won't be able to use all memory >> in the system for page cache even if the rest the system is idle. >> Basically our current scheme requires to configure every single cgroup >> in the system. >> >> I think we can do better. The idea proposed by this patch is to reclaim >> only inactive pages and only from cgroups that have big >> (!inactive_is_low()) inactive list. And go back to shrinking active lists >> only if all inactive lists are low. > > Hi Andrey! > > It's definitely an interesting idea! However, let me bring some concerns: > 1) What's considered active and inactive depends on memory pressure inside > a cgroup. There is no such dependency. High memory pressure may be generated both by active and inactive pages. We also can have a cgroup creating no pressure with almost only active (or only inactive) pages. > Actually active pages in one cgroup (e.g. just deleted) can be colder > than inactive pages in an other (e.g. a memory-hungry cgroup with a tight > memory.max). > Well, yes, this is a drawback of having per-memcg lrus. > Also a workload inside a cgroup can to some extend control what's going > to the active LRU. So it opens a way to get more memory unfairly by > artificially promoting more pages to the active LRU. So a cgroup > can get an unfair advantage over other cgroups. > Unfair is usually a negative term, but in this case it's very much depends on definition of what is "fair". If fair means to put equal reclaim pressure on all cgroups, than yes, the patch increases such unfairness, but such unfairness is a good thing. Obviously it's more valuable to keep in memory actively used page than the page that not used. > Generally speaking, now we have a way to measure the memory pressure > inside a cgroup. So, in theory, it should be possible to balance > scanning effort based on memory pressure. > Simply by design, the inactive pages are the first candidates to reclaim. Any decision that doesn't take into account inactive pages probably would be wrong. E.g. cgroup A with active job loading a big and active working set which creates high memory pressure and cgroup B - idle (no memory pressure) with a huge not used cache. It's definitely preferable to reclaim from B rather than from A. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim. 2019-02-26 15:36 ` Andrey Ryabinin @ 2019-02-26 22:08 ` Roman Gushchin 0 siblings, 0 replies; 13+ messages in thread From: Roman Gushchin @ 2019-02-26 22:08 UTC (permalink / raw) To: Andrey Ryabinin Cc: Andrew Morton, linux-mm, linux-kernel, Johannes Weiner, Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman, Shakeel Butt On Tue, Feb 26, 2019 at 06:36:38PM +0300, Andrey Ryabinin wrote: > > > On 2/25/19 7:03 AM, Roman Gushchin wrote: > > On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: > >> In a presence of more than 1 memory cgroup in the system our reclaim > >> logic is just suck. When we hit memory limit (global or a limit on > >> cgroup with subgroups) we reclaim some memory from all cgroups. > >> This is sucks because, the cgroup that allocates more often always wins. > >> E.g. job that allocates a lot of clean rarely used page cache will push > >> out of memory other jobs with active relatively small all in memory > >> working set. > >> > >> To prevent such situations we have memcg controls like low/max, etc which > >> are supposed to protect jobs or limit them so they to not hurt others. > >> But memory cgroups are very hard to configure right because it requires > >> precise knowledge of the workload which may vary during the execution. > >> E.g. setting memory limit means that job won't be able to use all memory > >> in the system for page cache even if the rest the system is idle. > >> Basically our current scheme requires to configure every single cgroup > >> in the system. > >> > >> I think we can do better. The idea proposed by this patch is to reclaim > >> only inactive pages and only from cgroups that have big > >> (!inactive_is_low()) inactive list. And go back to shrinking active lists > >> only if all inactive lists are low. > > > > Hi Andrey! > > > > It's definitely an interesting idea! However, let me bring some concerns: > > 1) What's considered active and inactive depends on memory pressure inside > > a cgroup. > > There is no such dependency. High memory pressure may be generated both > by active and inactive pages. We also can have a cgroup creating no pressure > with almost only active (or only inactive) pages. > > > Actually active pages in one cgroup (e.g. just deleted) can be colder > > than inactive pages in an other (e.g. a memory-hungry cgroup with a tight > > memory.max). > > > > Well, yes, this is a drawback of having per-memcg lrus. > > > Also a workload inside a cgroup can to some extend control what's going > > to the active LRU. So it opens a way to get more memory unfairly by > > artificially promoting more pages to the active LRU. So a cgroup > > can get an unfair advantage over other cgroups. > > > > Unfair is usually a negative term, but in this case it's very much depends on definition of what is "fair". > > If fair means to put equal reclaim pressure on all cgroups, than yes, the patch > increases such unfairness, but such unfairness is a good thing. > Obviously it's more valuable to keep in memory actively used page than the page that not used. I think that fairness is good here. > > > Generally speaking, now we have a way to measure the memory pressure > > inside a cgroup. So, in theory, it should be possible to balance > > scanning effort based on memory pressure. > > > > Simply by design, the inactive pages are the first candidates to reclaim. > Any decision that doesn't take into account inactive pages probably would be wrong. > > E.g. cgroup A with active job loading a big and active working set which creates high memory pressure > and cgroup B - idle (no memory pressure) with a huge not used cache. > It's definitely preferable to reclaim from B rather than from A. > For sure, if we're reclaiming hot pages instead of cold, it's bad for the overall performance. But active and inactive LRUs are just an approximation of what is hot and cold. E.g. I will run "cat some_large_file" twice in a cgroup, and the whole file will reside in the active LRU and considered hot. Even if nobody will ever use it again. So it means that depending on memory usage pattern, some workloads will benefit from your change, and some will suffer. Btw, what will be with protected cgroups (with memory.low set)? Those will still affect global scanning decisions (active/inactive ratio), but will be exempted from scanning? Thanks! ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim. 2019-02-22 17:58 [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim Andrey Ryabinin ` (2 preceding siblings ...) 2019-02-25 4:03 ` Roman Gushchin @ 2019-02-25 13:57 ` Vlastimil Babka 3 siblings, 0 replies; 13+ messages in thread From: Vlastimil Babka @ 2019-02-25 13:57 UTC (permalink / raw) To: Andrey Ryabinin, Andrew Morton Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko, Rik van Riel, Mel Gorman, Roman Gushchin, Shakeel Butt On 2/22/19 6:58 PM, Andrey Ryabinin wrote: > In a presence of more than 1 memory cgroup in the system our reclaim > logic is just suck. When we hit memory limit (global or a limit on > cgroup with subgroups) we reclaim some memory from all cgroups. > This is sucks because, the cgroup that allocates more often always wins. > E.g. job that allocates a lot of clean rarely used page cache will push > out of memory other jobs with active relatively small all in memory > working set. > > To prevent such situations we have memcg controls like low/max, etc which > are supposed to protect jobs or limit them so they to not hurt others. > But memory cgroups are very hard to configure right because it requires > precise knowledge of the workload which may vary during the execution. > E.g. setting memory limit means that job won't be able to use all memory > in the system for page cache even if the rest the system is idle. > Basically our current scheme requires to configure every single cgroup > in the system. > > I think we can do better. The idea proposed by this patch is to reclaim > only inactive pages and only from cgroups that have big > (!inactive_is_low()) inactive list. And go back to shrinking active lists > only if all inactive lists are low. Perhaps going this direction could also make page cache side-channel attacks harder? Quoting [1]: "On Linux, we are only able to evict pages efficiently because we can trick the page re- placement algorithm into believing our target page would be the best choice for eviction. The reason for this lies in the fact that Linux uses a global page replacement algorithm, i.e., an algorithm which does not distinguish between dif- ferent processes. Global page replacement algorithms have been known for decades to allow one process to perform a denial-of-service on other processes" [1] https://arxiv.org/abs/1901.01161 ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2019-03-04 17:02 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-02-22 17:58 [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim Andrey Ryabinin 2019-02-22 18:56 ` Rik van Riel 2019-02-22 19:15 ` Johannes Weiner 2019-02-26 12:50 ` Andrey Ryabinin 2019-03-01 10:38 ` Andrey Ryabinin 2019-03-01 17:49 ` Johannes Weiner 2019-03-01 19:46 ` Andrey Ryabinin 2019-03-01 22:20 ` Johannes Weiner 2019-03-04 17:02 ` Andrey Ryabinin 2019-02-25 4:03 ` Roman Gushchin 2019-02-26 15:36 ` Andrey Ryabinin 2019-02-26 22:08 ` Roman Gushchin 2019-02-25 13:57 ` Vlastimil Babka
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.