All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
@ 2019-02-22 17:58 Andrey Ryabinin
  2019-02-22 18:56 ` Rik van Riel
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Andrey Ryabinin @ 2019-02-22 17:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Andrey Ryabinin, Johannes Weiner,
	Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman,
	Roman Gushchin, Shakeel Butt

In a presence of more than 1 memory cgroup in the system our reclaim
logic is just suck. When we hit memory limit (global or a limit on
cgroup with subgroups) we reclaim some memory from all cgroups.
This is sucks because, the cgroup that allocates more often always wins.
E.g. job that allocates a lot of clean rarely used page cache will push
out of memory other jobs with active relatively small all in memory
working set.

To prevent such situations we have memcg controls like low/max, etc which
are supposed to protect jobs or limit them so they to not hurt others.
But memory cgroups are very hard to configure right because it requires
precise knowledge of the workload which may vary during the execution.
E.g. setting memory limit means that job won't be able to use all memory
in the system for page cache even if the rest the system is idle.
Basically our current scheme requires to configure every single cgroup
in the system.

I think we can do better. The idea proposed by this patch is to reclaim
only inactive pages and only from cgroups that have big
(!inactive_is_low()) inactive list. And go back to shrinking active lists
only if all inactive lists are low.

Now, the simple test case to demonstrate the effect of the patch.
The job in one memcg repeatedly compresses one file:

 perf stat -n --repeat 20 gzip -ck sample > /dev/null

and just 'dd' running in parallel reading the disk in another cgroup.

Before:
Performance counter stats for 'gzip -ck sample' (20 runs):
      17.673572290 seconds time elapsed                                          ( +-  5.60% )
After:
Performance counter stats for 'gzip -ck sample' (20 runs):
      11.426193980 seconds time elapsed                                          ( +-  0.20% )

The more often dd cgroup allocates memory, the more gzip suffer.
With 4 parallel dd instead of one:

Before:
Performance counter stats for 'gzip -ck sample' (20 runs):
      499.976782013 seconds time elapsed                                          ( +- 23.13% )
After:
Performance counter stats for 'gzip -ck sample' (20 runs):
      11.307450516 seconds time elapsed                                          ( +-  0.27% )

It would be possible to achieve the similar effect by
setting the memory.low on gzip cgroup, but the best value for memory.low
depends on the size of the 'sample' file. It also possible
to limit the 'dd' job, but just imagine something more sophisticated
than just 'dd', the job that would benefit from occupying all available
memory. The best limit for such job would be something like
'total_memory' - 'sample size' which is again unknown.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
---
 mm/vmscan.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index efd10d6b9510..2f562c3358ab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -104,6 +104,8 @@ struct scan_control {
 	/* One of the zones is ready for compaction */
 	unsigned int compaction_ready:1;
 
+	unsigned int may_shrink_active:1;
+
 	/* Allocation order */
 	s8 order;
 
@@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 
 		scan >>= sc->priority;
 
+		if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
+						file, memcg, sc, false))
+			scan = 0;
+
 		/*
 		 * If the cgroup's already been deleted, make sure to
 		 * scrape out the remaining cache.
@@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_reclaimed, nr_scanned;
 	bool reclaimable = false;
+	bool retry;
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		};
 		struct mem_cgroup *memcg;
 
+		retry = false;
+
 		memset(&sc->nr, 0, sizeof(sc->nr));
 
 		nr_reclaimed = sc->nr_reclaimed;
@@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			}
 		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
 
+		if ((sc->nr_scanned - nr_scanned) == 0 &&
+		     !sc->may_shrink_active) {
+			sc->may_shrink_active = 1;
+			retry = true;
+			continue;
+		}
+
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
@@ -2887,7 +2903,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		   current_may_throttle() && pgdat_memcg_congested(pgdat, root))
 			wait_iff_congested(BLK_RW_ASYNC, HZ/10);
 
-	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
+	} while (retry || should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
 
 	/*
-- 
2.19.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
  2019-02-22 17:58 [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim Andrey Ryabinin
@ 2019-02-22 18:56 ` Rik van Riel
  2019-02-22 19:15 ` Johannes Weiner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Rik van Riel @ 2019-02-22 18:56 UTC (permalink / raw)
  To: Andrey Ryabinin, Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Vlastimil Babka, Mel Gorman, Roman Gushchin, Shakeel Butt

[-- Attachment #1: Type: text/plain, Size: 2086 bytes --]

On Fri, 2019-02-22 at 20:58 +0300, Andrey Ryabinin wrote:
> In a presence of more than 1 memory cgroup in the system our reclaim
> logic is just suck. When we hit memory limit (global or a limit on
> cgroup with subgroups) we reclaim some memory from all cgroups.
> This is sucks because, the cgroup that allocates more often always
> wins.
> E.g. job that allocates a lot of clean rarely used page cache will
> push
> out of memory other jobs with active relatively small all in memory
> working set.
> 
> To prevent such situations we have memcg controls like low/max, etc
> which
> are supposed to protect jobs or limit them so they to not hurt
> others.
> But memory cgroups are very hard to configure right because it
> requires
> precise knowledge of the workload which may vary during the
> execution.
> E.g. setting memory limit means that job won't be able to use all
> memory
> in the system for page cache even if the rest the system is idle.
> Basically our current scheme requires to configure every single
> cgroup
> in the system.
> 
> I think we can do better. The idea proposed by this patch is to
> reclaim
> only inactive pages and only from cgroups that have big
> (!inactive_is_low()) inactive list. And go back to shrinking active
> lists
> only if all inactive lists are low.

Your general idea seems like a good one, but
the logic in the code seems a little convoluted
to me.

I wonder if we can simplify things a little, by
checking (when we enter page reclaim) whether
the pgdat has enough inactive pages based on
the node_page_state statistics, and basing our
decision whether or not to scan the active lists
off that.

As it stands, your patch seems like the kind of
code that makes perfect sense today, but which
will confuse people who look at the code two
years from now.

If the code could be made a little more explicit,
great. If there are good reasons to do things in
the fallback way your current patch does it, the
code could use some good comments explaining why :)

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
  2019-02-22 17:58 [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim Andrey Ryabinin
  2019-02-22 18:56 ` Rik van Riel
@ 2019-02-22 19:15 ` Johannes Weiner
  2019-02-26 12:50   ` Andrey Ryabinin
  2019-02-25  4:03 ` Roman Gushchin
  2019-02-25 13:57 ` Vlastimil Babka
  3 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2019-02-22 19:15 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin,
	Shakeel Butt

On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
> In a presence of more than 1 memory cgroup in the system our reclaim
> logic is just suck. When we hit memory limit (global or a limit on
> cgroup with subgroups) we reclaim some memory from all cgroups.
> This is sucks because, the cgroup that allocates more often always wins.
> E.g. job that allocates a lot of clean rarely used page cache will push
> out of memory other jobs with active relatively small all in memory
> working set.
> 
> To prevent such situations we have memcg controls like low/max, etc which
> are supposed to protect jobs or limit them so they to not hurt others.
> But memory cgroups are very hard to configure right because it requires
> precise knowledge of the workload which may vary during the execution.
> E.g. setting memory limit means that job won't be able to use all memory
> in the system for page cache even if the rest the system is idle.
> Basically our current scheme requires to configure every single cgroup
> in the system.
> 
> I think we can do better. The idea proposed by this patch is to reclaim
> only inactive pages and only from cgroups that have big
> (!inactive_is_low()) inactive list. And go back to shrinking active lists
> only if all inactive lists are low.

Yes, you are absolutely right.

We shouldn't go after active pages as long as there are plenty of
inactive pages around. That's the global reclaim policy, and we
currently fail to translate that well to cgrouped systems.

Setting group protections or limits would work around this problem,
but they're kind of a red herring. We shouldn't ever allow use-once
streams to push out hot workingsets, that's a bug.

> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  
>  		scan >>= sc->priority;
>  
> +		if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
> +						file, memcg, sc, false))
> +			scan = 0;
> +
>  		/*
>  		 * If the cgroup's already been deleted, make sure to
>  		 * scrape out the remaining cache.
> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
>  	unsigned long nr_reclaimed, nr_scanned;
>  	bool reclaimable = false;
> +	bool retry;
>  
>  	do {
>  		struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  		};
>  		struct mem_cgroup *memcg;
>  
> +		retry = false;
> +
>  		memset(&sc->nr, 0, sizeof(sc->nr));
>  
>  		nr_reclaimed = sc->nr_reclaimed;
> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  			}
>  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>  
> +		if ((sc->nr_scanned - nr_scanned) == 0 &&
> +		     !sc->may_shrink_active) {
> +			sc->may_shrink_active = 1;
> +			retry = true;
> +			continue;
> +		}

Using !scanned as the gate could be a problem. There might be a cgroup
that has inactive pages on the local level, but when viewed from the
system level the total inactive pages in the system might still be low
compared to active ones. In that case we should go after active pages.

Basically, during global reclaim, the answer for whether active pages
should be scanned or not should be the same regardless of whether the
memory is all global or whether it's spread out between cgroups.

The reason this isn't the case is because we're checking the ratio at
the lruvec level - which is the highest level (and identical to the
node counters) when memory is global, but it's at the lowest level
when memory is cgrouped.

So IMO what we should do is:

- At the beginning of global reclaim, use node_page_state() to compare
  the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim
  can go after active pages or not. Regardless of what the ratio is in
  individual lruvecs.

- And likewise at the beginning of cgroup limit reclaim, walk the
  subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE
  and ACTIVE_FILE counters, and make inactive_is_low() decision on
  those sums.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
  2019-02-22 17:58 [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim Andrey Ryabinin
  2019-02-22 18:56 ` Rik van Riel
  2019-02-22 19:15 ` Johannes Weiner
@ 2019-02-25  4:03 ` Roman Gushchin
  2019-02-26 15:36   ` Andrey Ryabinin
  2019-02-25 13:57 ` Vlastimil Babka
  3 siblings, 1 reply; 13+ messages in thread
From: Roman Gushchin @ 2019-02-25  4:03 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, linux-mm, linux-kernel, Johannes Weiner,
	Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman,
	Shakeel Butt

On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
> In a presence of more than 1 memory cgroup in the system our reclaim
> logic is just suck. When we hit memory limit (global or a limit on
> cgroup with subgroups) we reclaim some memory from all cgroups.
> This is sucks because, the cgroup that allocates more often always wins.
> E.g. job that allocates a lot of clean rarely used page cache will push
> out of memory other jobs with active relatively small all in memory
> working set.
> 
> To prevent such situations we have memcg controls like low/max, etc which
> are supposed to protect jobs or limit them so they to not hurt others.
> But memory cgroups are very hard to configure right because it requires
> precise knowledge of the workload which may vary during the execution.
> E.g. setting memory limit means that job won't be able to use all memory
> in the system for page cache even if the rest the system is idle.
> Basically our current scheme requires to configure every single cgroup
> in the system.
> 
> I think we can do better. The idea proposed by this patch is to reclaim
> only inactive pages and only from cgroups that have big
> (!inactive_is_low()) inactive list. And go back to shrinking active lists
> only if all inactive lists are low.

Hi Andrey!

It's definitely an interesting idea! However, let me bring some concerns:
1) What's considered active and inactive depends on memory pressure inside
a cgroup. Actually active pages in one cgroup (e.g. just deleted) can be colder
than inactive pages in an other (e.g. a memory-hungry cgroup with a tight
memory.max).

Also a workload inside a cgroup can to some extend control what's going
to the active LRU. So it opens a way to get more memory unfairly by
artificially promoting more pages to the active LRU. So a cgroup
can get an unfair advantage over other cgroups.

Generally speaking, now we have a way to measure the memory pressure
inside a cgroup. So, in theory, it should be possible to balance
scanning effort based on memory pressure.

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
  2019-02-22 17:58 [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim Andrey Ryabinin
                   ` (2 preceding siblings ...)
  2019-02-25  4:03 ` Roman Gushchin
@ 2019-02-25 13:57 ` Vlastimil Babka
  3 siblings, 0 replies; 13+ messages in thread
From: Vlastimil Babka @ 2019-02-25 13:57 UTC (permalink / raw)
  To: Andrey Ryabinin, Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Rik van Riel, Mel Gorman, Roman Gushchin, Shakeel Butt

On 2/22/19 6:58 PM, Andrey Ryabinin wrote:
> In a presence of more than 1 memory cgroup in the system our reclaim
> logic is just suck. When we hit memory limit (global or a limit on
> cgroup with subgroups) we reclaim some memory from all cgroups.
> This is sucks because, the cgroup that allocates more often always wins.
> E.g. job that allocates a lot of clean rarely used page cache will push
> out of memory other jobs with active relatively small all in memory
> working set.
> 
> To prevent such situations we have memcg controls like low/max, etc which
> are supposed to protect jobs or limit them so they to not hurt others.
> But memory cgroups are very hard to configure right because it requires
> precise knowledge of the workload which may vary during the execution.
> E.g. setting memory limit means that job won't be able to use all memory
> in the system for page cache even if the rest the system is idle.
> Basically our current scheme requires to configure every single cgroup
> in the system.
> 
> I think we can do better. The idea proposed by this patch is to reclaim
> only inactive pages and only from cgroups that have big
> (!inactive_is_low()) inactive list. And go back to shrinking active lists
> only if all inactive lists are low.

Perhaps going this direction could also make page cache side-channel
attacks harder?
Quoting [1]:

"On Linux, we are only able
to evict pages efficiently because we can trick the page re-
placement algorithm into believing our target page would be
the best choice for eviction. The reason for this lies in the
fact that Linux uses a global page replacement algorithm,
i.e., an algorithm which does not distinguish between dif-
ferent processes. Global page replacement algorithms have
been known for decades to allow one process to perform a
denial-of-service on other processes"

[1] https://arxiv.org/abs/1901.01161


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
  2019-02-22 19:15 ` Johannes Weiner
@ 2019-02-26 12:50   ` Andrey Ryabinin
  2019-03-01 10:38     ` Andrey Ryabinin
  0 siblings, 1 reply; 13+ messages in thread
From: Andrey Ryabinin @ 2019-02-26 12:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin,
	Shakeel Butt



On 2/22/19 10:15 PM, Johannes Weiner wrote:
> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
>> In a presence of more than 1 memory cgroup in the system our reclaim
>> logic is just suck. When we hit memory limit (global or a limit on
>> cgroup with subgroups) we reclaim some memory from all cgroups.
>> This is sucks because, the cgroup that allocates more often always wins.
>> E.g. job that allocates a lot of clean rarely used page cache will push
>> out of memory other jobs with active relatively small all in memory
>> working set.
>>
>> To prevent such situations we have memcg controls like low/max, etc which
>> are supposed to protect jobs or limit them so they to not hurt others.
>> But memory cgroups are very hard to configure right because it requires
>> precise knowledge of the workload which may vary during the execution.
>> E.g. setting memory limit means that job won't be able to use all memory
>> in the system for page cache even if the rest the system is idle.
>> Basically our current scheme requires to configure every single cgroup
>> in the system.
>>
>> I think we can do better. The idea proposed by this patch is to reclaim
>> only inactive pages and only from cgroups that have big
>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
>> only if all inactive lists are low.
> 
> Yes, you are absolutely right.
> 
> We shouldn't go after active pages as long as there are plenty of
> inactive pages around. That's the global reclaim policy, and we
> currently fail to translate that well to cgrouped systems.
> 
> Setting group protections or limits would work around this problem,
> but they're kind of a red herring. We shouldn't ever allow use-once
> streams to push out hot workingsets, that's a bug.
> 
>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>>  
>>  		scan >>= sc->priority;
>>  
>> +		if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
>> +						file, memcg, sc, false))
>> +			scan = 0;
>> +
>>  		/*
>>  		 * If the cgroup's already been deleted, make sure to
>>  		 * scrape out the remaining cache.
>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>  	struct reclaim_state *reclaim_state = current->reclaim_state;
>>  	unsigned long nr_reclaimed, nr_scanned;
>>  	bool reclaimable = false;
>> +	bool retry;
>>  
>>  	do {
>>  		struct mem_cgroup *root = sc->target_mem_cgroup;
>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>  		};
>>  		struct mem_cgroup *memcg;
>>  
>> +		retry = false;
>> +
>>  		memset(&sc->nr, 0, sizeof(sc->nr));
>>  
>>  		nr_reclaimed = sc->nr_reclaimed;
>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>  			}
>>  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>>  
>> +		if ((sc->nr_scanned - nr_scanned) == 0 &&
>> +		     !sc->may_shrink_active) {
>> +			sc->may_shrink_active = 1;
>> +			retry = true;
>> +			continue;
>> +		}
> 
> Using !scanned as the gate could be a problem. There might be a cgroup
> that has inactive pages on the local level, but when viewed from the
> system level the total inactive pages in the system might still be low
> compared to active ones. In that case we should go after active pages.
> 
> Basically, during global reclaim, the answer for whether active pages
> should be scanned or not should be the same regardless of whether the
> memory is all global or whether it's spread out between cgroups.
> 
> The reason this isn't the case is because we're checking the ratio at
> the lruvec level - which is the highest level (and identical to the
> node counters) when memory is global, but it's at the lowest level
> when memory is cgrouped.
> 
> So IMO what we should do is:
> 
> - At the beginning of global reclaim, use node_page_state() to compare
>   the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim
>   can go after active pages or not. Regardless of what the ratio is in
>   individual lruvecs.
> 
> - And likewise at the beginning of cgroup limit reclaim, walk the
>   subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE
>   and ACTIVE_FILE counters, and make inactive_is_low() decision on
>   those sums.
> 

Sounds reasonable.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
  2019-02-25  4:03 ` Roman Gushchin
@ 2019-02-26 15:36   ` Andrey Ryabinin
  2019-02-26 22:08     ` Roman Gushchin
  0 siblings, 1 reply; 13+ messages in thread
From: Andrey Ryabinin @ 2019-02-26 15:36 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-mm, linux-kernel, Johannes Weiner,
	Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman,
	Shakeel Butt



On 2/25/19 7:03 AM, Roman Gushchin wrote:
> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
>> In a presence of more than 1 memory cgroup in the system our reclaim
>> logic is just suck. When we hit memory limit (global or a limit on
>> cgroup with subgroups) we reclaim some memory from all cgroups.
>> This is sucks because, the cgroup that allocates more often always wins.
>> E.g. job that allocates a lot of clean rarely used page cache will push
>> out of memory other jobs with active relatively small all in memory
>> working set.
>>
>> To prevent such situations we have memcg controls like low/max, etc which
>> are supposed to protect jobs or limit them so they to not hurt others.
>> But memory cgroups are very hard to configure right because it requires
>> precise knowledge of the workload which may vary during the execution.
>> E.g. setting memory limit means that job won't be able to use all memory
>> in the system for page cache even if the rest the system is idle.
>> Basically our current scheme requires to configure every single cgroup
>> in the system.
>>
>> I think we can do better. The idea proposed by this patch is to reclaim
>> only inactive pages and only from cgroups that have big
>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
>> only if all inactive lists are low.
> 
> Hi Andrey!
> 
> It's definitely an interesting idea! However, let me bring some concerns:
> 1) What's considered active and inactive depends on memory pressure inside
> a cgroup.

There is no such dependency. High memory pressure may be generated both
by active and inactive pages. We also can have a cgroup creating no pressure
with almost only active (or only inactive) pages.

> Actually active pages in one cgroup (e.g. just deleted) can be colder
> than inactive pages in an other (e.g. a memory-hungry cgroup with a tight
> memory.max).
> 

Well, yes, this is a drawback of having per-memcg lrus.

> Also a workload inside a cgroup can to some extend control what's going
> to the active LRU. So it opens a way to get more memory unfairly by
> artificially promoting more pages to the active LRU. So a cgroup
> can get an unfair advantage over other cgroups.
> 

Unfair is usually a negative term, but in this case it's very much depends on definition of what is "fair".

If fair means to put equal reclaim pressure on all cgroups, than yes, the patch
increases such unfairness, but such unfairness is a good thing.
Obviously it's more valuable to keep in memory actively used page than the page that not used.

> Generally speaking, now we have a way to measure the memory pressure
> inside a cgroup. So, in theory, it should be possible to balance
> scanning effort based on memory pressure.
> 

Simply by design, the inactive pages are the first candidates to reclaim.
Any decision that doesn't take into account inactive pages probably would be wrong.

E.g. cgroup A with active job loading a big and active working set which creates high memory pressure
and cgroup B - idle (no memory pressure) with a huge not used cache.
It's definitely preferable to reclaim from B rather than from A.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
  2019-02-26 15:36   ` Andrey Ryabinin
@ 2019-02-26 22:08     ` Roman Gushchin
  0 siblings, 0 replies; 13+ messages in thread
From: Roman Gushchin @ 2019-02-26 22:08 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, linux-mm, linux-kernel, Johannes Weiner,
	Michal Hocko, Vlastimil Babka, Rik van Riel, Mel Gorman,
	Shakeel Butt

On Tue, Feb 26, 2019 at 06:36:38PM +0300, Andrey Ryabinin wrote:
> 
> 
> On 2/25/19 7:03 AM, Roman Gushchin wrote:
> > On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
> >> In a presence of more than 1 memory cgroup in the system our reclaim
> >> logic is just suck. When we hit memory limit (global or a limit on
> >> cgroup with subgroups) we reclaim some memory from all cgroups.
> >> This is sucks because, the cgroup that allocates more often always wins.
> >> E.g. job that allocates a lot of clean rarely used page cache will push
> >> out of memory other jobs with active relatively small all in memory
> >> working set.
> >>
> >> To prevent such situations we have memcg controls like low/max, etc which
> >> are supposed to protect jobs or limit them so they to not hurt others.
> >> But memory cgroups are very hard to configure right because it requires
> >> precise knowledge of the workload which may vary during the execution.
> >> E.g. setting memory limit means that job won't be able to use all memory
> >> in the system for page cache even if the rest the system is idle.
> >> Basically our current scheme requires to configure every single cgroup
> >> in the system.
> >>
> >> I think we can do better. The idea proposed by this patch is to reclaim
> >> only inactive pages and only from cgroups that have big
> >> (!inactive_is_low()) inactive list. And go back to shrinking active lists
> >> only if all inactive lists are low.
> > 
> > Hi Andrey!
> > 
> > It's definitely an interesting idea! However, let me bring some concerns:
> > 1) What's considered active and inactive depends on memory pressure inside
> > a cgroup.
> 
> There is no such dependency. High memory pressure may be generated both
> by active and inactive pages. We also can have a cgroup creating no pressure
> with almost only active (or only inactive) pages.
> 
> > Actually active pages in one cgroup (e.g. just deleted) can be colder
> > than inactive pages in an other (e.g. a memory-hungry cgroup with a tight
> > memory.max).
> > 
> 
> Well, yes, this is a drawback of having per-memcg lrus.
> 
> > Also a workload inside a cgroup can to some extend control what's going
> > to the active LRU. So it opens a way to get more memory unfairly by
> > artificially promoting more pages to the active LRU. So a cgroup
> > can get an unfair advantage over other cgroups.
> > 
> 
> Unfair is usually a negative term, but in this case it's very much depends on definition of what is "fair".
> 
> If fair means to put equal reclaim pressure on all cgroups, than yes, the patch
> increases such unfairness, but such unfairness is a good thing.
> Obviously it's more valuable to keep in memory actively used page than the page that not used.

I think that fairness is good here.

> 
> > Generally speaking, now we have a way to measure the memory pressure
> > inside a cgroup. So, in theory, it should be possible to balance
> > scanning effort based on memory pressure.
> > 
> 
> Simply by design, the inactive pages are the first candidates to reclaim.
> Any decision that doesn't take into account inactive pages probably would be wrong.
> 
> E.g. cgroup A with active job loading a big and active working set which creates high memory pressure
> and cgroup B - idle (no memory pressure) with a huge not used cache.
> It's definitely preferable to reclaim from B rather than from A.
>

For sure, if we're reclaiming hot pages instead of cold, it's bad for the
overall performance. But active and inactive LRUs are just an approximation of
what is hot and cold. E.g. I will run "cat some_large_file" twice in a cgroup,
and the whole file will reside in the active LRU and considered hot. Even if
nobody will ever use it again.

So it means that depending on memory usage pattern, some workloads will benefit
from your change, and some will suffer.

Btw, what will be with protected cgroups (with memory.low set)?
Those will still affect global scanning decisions (active/inactive ratio),
but will be exempted from scanning?

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
  2019-02-26 12:50   ` Andrey Ryabinin
@ 2019-03-01 10:38     ` Andrey Ryabinin
  2019-03-01 17:49       ` Johannes Weiner
  0 siblings, 1 reply; 13+ messages in thread
From: Andrey Ryabinin @ 2019-03-01 10:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin,
	Shakeel Butt



On 2/26/19 3:50 PM, Andrey Ryabinin wrote:
> 
> 
> On 2/22/19 10:15 PM, Johannes Weiner wrote:
>> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
>>> In a presence of more than 1 memory cgroup in the system our reclaim
>>> logic is just suck. When we hit memory limit (global or a limit on
>>> cgroup with subgroups) we reclaim some memory from all cgroups.
>>> This is sucks because, the cgroup that allocates more often always wins.
>>> E.g. job that allocates a lot of clean rarely used page cache will push
>>> out of memory other jobs with active relatively small all in memory
>>> working set.
>>>
>>> To prevent such situations we have memcg controls like low/max, etc which
>>> are supposed to protect jobs or limit them so they to not hurt others.
>>> But memory cgroups are very hard to configure right because it requires
>>> precise knowledge of the workload which may vary during the execution.
>>> E.g. setting memory limit means that job won't be able to use all memory
>>> in the system for page cache even if the rest the system is idle.
>>> Basically our current scheme requires to configure every single cgroup
>>> in the system.
>>>
>>> I think we can do better. The idea proposed by this patch is to reclaim
>>> only inactive pages and only from cgroups that have big
>>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
>>> only if all inactive lists are low.
>>
>> Yes, you are absolutely right.
>>
>> We shouldn't go after active pages as long as there are plenty of
>> inactive pages around. That's the global reclaim policy, and we
>> currently fail to translate that well to cgrouped systems.
>>
>> Setting group protections or limits would work around this problem,
>> but they're kind of a red herring. We shouldn't ever allow use-once
>> streams to push out hot workingsets, that's a bug.
>>
>>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>>>  
>>>  		scan >>= sc->priority;
>>>  
>>> +		if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
>>> +						file, memcg, sc, false))
>>> +			scan = 0;
>>> +
>>>  		/*
>>>  		 * If the cgroup's already been deleted, make sure to
>>>  		 * scrape out the remaining cache.
>>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>>  	struct reclaim_state *reclaim_state = current->reclaim_state;
>>>  	unsigned long nr_reclaimed, nr_scanned;
>>>  	bool reclaimable = false;
>>> +	bool retry;
>>>  
>>>  	do {
>>>  		struct mem_cgroup *root = sc->target_mem_cgroup;
>>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>>  		};
>>>  		struct mem_cgroup *memcg;
>>>  
>>> +		retry = false;
>>> +
>>>  		memset(&sc->nr, 0, sizeof(sc->nr));
>>>  
>>>  		nr_reclaimed = sc->nr_reclaimed;
>>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>>  			}
>>>  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>>>  
>>> +		if ((sc->nr_scanned - nr_scanned) == 0 &&
>>> +		     !sc->may_shrink_active) {
>>> +			sc->may_shrink_active = 1;
>>> +			retry = true;
>>> +			continue;
>>> +		}
>>
>> Using !scanned as the gate could be a problem. There might be a cgroup
>> that has inactive pages on the local level, but when viewed from the
>> system level the total inactive pages in the system might still be low
>> compared to active ones. In that case we should go after active pages.
>>
>> Basically, during global reclaim, the answer for whether active pages
>> should be scanned or not should be the same regardless of whether the
>> memory is all global or whether it's spread out between cgroups.
>>
>> The reason this isn't the case is because we're checking the ratio at
>> the lruvec level - which is the highest level (and identical to the
>> node counters) when memory is global, but it's at the lowest level
>> when memory is cgrouped.
>>
>> So IMO what we should do is:
>>
>> - At the beginning of global reclaim, use node_page_state() to compare
>>   the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim
>>   can go after active pages or not. Regardless of what the ratio is in
>>   individual lruvecs.
>>
>> - And likewise at the beginning of cgroup limit reclaim, walk the
>>   subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE
>>   and ACTIVE_FILE counters, and make inactive_is_low() decision on
>>   those sums.
>>
> 
> Sounds reasonable.
> 

On the second thought it seems to be better to keep the decision on lru level.
There are couple reasons for this:

1) Using bare node_page_state() (or sc->targe_mem_cgroup's total_[in]active counters) would be wrong.
 Because some cgroups might have protection set (memory.low) and we must take it into account. Also different
cgroups have different available swap space/memory.swappiness and it must be taken into account as well to.

So it has to be yet another full memcg-tree iteration.

2) Let's consider simple case. Two cgroups, one with big 'active' set of pages the other allocates one-time used pages.
So the total inactive is low, thus checking inactive ratio on higher level will result in reclaiming pages.
While with check on lru-level only inactive will be reclaimed.

I've tried to come up with a scenario in which checking ratio on higher level would better but failed.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
  2019-03-01 10:38     ` Andrey Ryabinin
@ 2019-03-01 17:49       ` Johannes Weiner
  2019-03-01 19:46         ` Andrey Ryabinin
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2019-03-01 17:49 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin,
	Shakeel Butt

Hello Andrey,

On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote:
> On 2/26/19 3:50 PM, Andrey Ryabinin wrote:
> > On 2/22/19 10:15 PM, Johannes Weiner wrote:
> >> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
> >>> In a presence of more than 1 memory cgroup in the system our reclaim
> >>> logic is just suck. When we hit memory limit (global or a limit on
> >>> cgroup with subgroups) we reclaim some memory from all cgroups.
> >>> This is sucks because, the cgroup that allocates more often always wins.
> >>> E.g. job that allocates a lot of clean rarely used page cache will push
> >>> out of memory other jobs with active relatively small all in memory
> >>> working set.
> >>>
> >>> To prevent such situations we have memcg controls like low/max, etc which
> >>> are supposed to protect jobs or limit them so they to not hurt others.
> >>> But memory cgroups are very hard to configure right because it requires
> >>> precise knowledge of the workload which may vary during the execution.
> >>> E.g. setting memory limit means that job won't be able to use all memory
> >>> in the system for page cache even if the rest the system is idle.
> >>> Basically our current scheme requires to configure every single cgroup
> >>> in the system.
> >>>
> >>> I think we can do better. The idea proposed by this patch is to reclaim
> >>> only inactive pages and only from cgroups that have big
> >>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
> >>> only if all inactive lists are low.
> >>
> >> Yes, you are absolutely right.
> >>
> >> We shouldn't go after active pages as long as there are plenty of
> >> inactive pages around. That's the global reclaim policy, and we
> >> currently fail to translate that well to cgrouped systems.
> >>
> >> Setting group protections or limits would work around this problem,
> >> but they're kind of a red herring. We shouldn't ever allow use-once
> >> streams to push out hot workingsets, that's a bug.
> >>
> >>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> >>>  
> >>>  		scan >>= sc->priority;
> >>>  
> >>> +		if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
> >>> +						file, memcg, sc, false))
> >>> +			scan = 0;
> >>> +
> >>>  		/*
> >>>  		 * If the cgroup's already been deleted, make sure to
> >>>  		 * scrape out the remaining cache.
> >>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >>>  	struct reclaim_state *reclaim_state = current->reclaim_state;
> >>>  	unsigned long nr_reclaimed, nr_scanned;
> >>>  	bool reclaimable = false;
> >>> +	bool retry;
> >>>  
> >>>  	do {
> >>>  		struct mem_cgroup *root = sc->target_mem_cgroup;
> >>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >>>  		};
> >>>  		struct mem_cgroup *memcg;
> >>>  
> >>> +		retry = false;
> >>> +
> >>>  		memset(&sc->nr, 0, sizeof(sc->nr));
> >>>  
> >>>  		nr_reclaimed = sc->nr_reclaimed;
> >>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >>>  			}
> >>>  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
> >>>  
> >>> +		if ((sc->nr_scanned - nr_scanned) == 0 &&
> >>> +		     !sc->may_shrink_active) {
> >>> +			sc->may_shrink_active = 1;
> >>> +			retry = true;
> >>> +			continue;
> >>> +		}
> >>
> >> Using !scanned as the gate could be a problem. There might be a cgroup
> >> that has inactive pages on the local level, but when viewed from the
> >> system level the total inactive pages in the system might still be low
> >> compared to active ones. In that case we should go after active pages.
> >>
> >> Basically, during global reclaim, the answer for whether active pages
> >> should be scanned or not should be the same regardless of whether the
> >> memory is all global or whether it's spread out between cgroups.
> >>
> >> The reason this isn't the case is because we're checking the ratio at
> >> the lruvec level - which is the highest level (and identical to the
> >> node counters) when memory is global, but it's at the lowest level
> >> when memory is cgrouped.
> >>
> >> So IMO what we should do is:
> >>
> >> - At the beginning of global reclaim, use node_page_state() to compare
> >>   the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim
> >>   can go after active pages or not. Regardless of what the ratio is in
> >>   individual lruvecs.
> >>
> >> - And likewise at the beginning of cgroup limit reclaim, walk the
> >>   subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE
> >>   and ACTIVE_FILE counters, and make inactive_is_low() decision on
> >>   those sums.
> >>
> > 
> > Sounds reasonable.
> > 
> 
> On the second thought it seems to be better to keep the decision on lru level.
> There are couple reasons for this:
> 
> 1) Using bare node_page_state() (or sc->targe_mem_cgroup's total_[in]active counters) would be wrong.
>  Because some cgroups might have protection set (memory.low) and we must take it into account. Also different
> cgroups have different available swap space/memory.swappiness and it must be taken into account as well to.
>
> So it has to be yet another full memcg-tree iteration.

It should be possible to take that into account on the first iteration
and adjust the inactive/active counters in proportion to how much of
the cgroup's total memory is exempt by memory.low or min, right?

> 2) Let's consider simple case. Two cgroups, one with big 'active' set of pages the other allocates one-time used pages.
> So the total inactive is low, thus checking inactive ratio on higher level will result in reclaiming pages.
> While with check on lru-level only inactive will be reclaimed.

It's the other way around. Let's say you have two cgroups, A and B:

        A:  500M inactive   10G active -> inactive is low
        B:   10G inactive  500M active -> inactive is NOT low
   ----------------------------------------------------------
   global: 10.5G inactive 10.5G active -> inactive is NOT low

Checking locally will scan active pages from A. Checking globally will
not, because there is plenty of use-once pages from B.

So if you check globally, without any protection, A and B compete
evenly during global reclaim. Under the same reclaim pressure, A has
managed to activate most of its pages whereas B has not. That means A
is hotter and B provides the better reclaim candidates.

If you apply this decision locally, on the other hand, you are no
longer aging the groups at the same rate. And then the LRU orders
between groups will no longer be comparable, and you won't be
reclaiming the coldest memory in the system anymore.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
  2019-03-01 17:49       ` Johannes Weiner
@ 2019-03-01 19:46         ` Andrey Ryabinin
  2019-03-01 22:20           ` Johannes Weiner
  0 siblings, 1 reply; 13+ messages in thread
From: Andrey Ryabinin @ 2019-03-01 19:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin,
	Shakeel Butt



On 3/1/19 8:49 PM, Johannes Weiner wrote:
> Hello Andrey,
> 
> On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote:
>> On 2/26/19 3:50 PM, Andrey Ryabinin wrote:
>>> On 2/22/19 10:15 PM, Johannes Weiner wrote:
>>>> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
>>>>> In a presence of more than 1 memory cgroup in the system our reclaim
>>>>> logic is just suck. When we hit memory limit (global or a limit on
>>>>> cgroup with subgroups) we reclaim some memory from all cgroups.
>>>>> This is sucks because, the cgroup that allocates more often always wins.
>>>>> E.g. job that allocates a lot of clean rarely used page cache will push
>>>>> out of memory other jobs with active relatively small all in memory
>>>>> working set.
>>>>>
>>>>> To prevent such situations we have memcg controls like low/max, etc which
>>>>> are supposed to protect jobs or limit them so they to not hurt others.
>>>>> But memory cgroups are very hard to configure right because it requires
>>>>> precise knowledge of the workload which may vary during the execution.
>>>>> E.g. setting memory limit means that job won't be able to use all memory
>>>>> in the system for page cache even if the rest the system is idle.
>>>>> Basically our current scheme requires to configure every single cgroup
>>>>> in the system.
>>>>>
>>>>> I think we can do better. The idea proposed by this patch is to reclaim
>>>>> only inactive pages and only from cgroups that have big
>>>>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
>>>>> only if all inactive lists are low.
>>>>
>>>> Yes, you are absolutely right.
>>>>
>>>> We shouldn't go after active pages as long as there are plenty of
>>>> inactive pages around. That's the global reclaim policy, and we
>>>> currently fail to translate that well to cgrouped systems.
>>>>
>>>> Setting group protections or limits would work around this problem,
>>>> but they're kind of a red herring. We shouldn't ever allow use-once
>>>> streams to push out hot workingsets, that's a bug.
>>>>
>>>>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>>>>>  
>>>>>  		scan >>= sc->priority;
>>>>>  
>>>>> +		if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
>>>>> +						file, memcg, sc, false))
>>>>> +			scan = 0;
>>>>> +
>>>>>  		/*
>>>>>  		 * If the cgroup's already been deleted, make sure to
>>>>>  		 * scrape out the remaining cache.
>>>>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>>>>  	struct reclaim_state *reclaim_state = current->reclaim_state;
>>>>>  	unsigned long nr_reclaimed, nr_scanned;
>>>>>  	bool reclaimable = false;
>>>>> +	bool retry;
>>>>>  
>>>>>  	do {
>>>>>  		struct mem_cgroup *root = sc->target_mem_cgroup;
>>>>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>>>>  		};
>>>>>  		struct mem_cgroup *memcg;
>>>>>  
>>>>> +		retry = false;
>>>>> +
>>>>>  		memset(&sc->nr, 0, sizeof(sc->nr));
>>>>>  
>>>>>  		nr_reclaimed = sc->nr_reclaimed;
>>>>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>>>>  			}
>>>>>  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>>>>>  
>>>>> +		if ((sc->nr_scanned - nr_scanned) == 0 &&
>>>>> +		     !sc->may_shrink_active) {
>>>>> +			sc->may_shrink_active = 1;
>>>>> +			retry = true;
>>>>> +			continue;
>>>>> +		}
>>>>
>>>> Using !scanned as the gate could be a problem. There might be a cgroup
>>>> that has inactive pages on the local level, but when viewed from the
>>>> system level the total inactive pages in the system might still be low
>>>> compared to active ones. In that case we should go after active pages.
>>>>
>>>> Basically, during global reclaim, the answer for whether active pages
>>>> should be scanned or not should be the same regardless of whether the
>>>> memory is all global or whether it's spread out between cgroups.
>>>>
>>>> The reason this isn't the case is because we're checking the ratio at
>>>> the lruvec level - which is the highest level (and identical to the
>>>> node counters) when memory is global, but it's at the lowest level
>>>> when memory is cgrouped.
>>>>
>>>> So IMO what we should do is:
>>>>
>>>> - At the beginning of global reclaim, use node_page_state() to compare
>>>>   the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim
>>>>   can go after active pages or not. Regardless of what the ratio is in
>>>>   individual lruvecs.
>>>>
>>>> - And likewise at the beginning of cgroup limit reclaim, walk the
>>>>   subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE
>>>>   and ACTIVE_FILE counters, and make inactive_is_low() decision on
>>>>   those sums.
>>>>
>>>
>>> Sounds reasonable.
>>>
>>
>> On the second thought it seems to be better to keep the decision on lru level.
>> There are couple reasons for this:
>>
>> 1) Using bare node_page_state() (or sc->targe_mem_cgroup's total_[in]active counters) would be wrong.
>>  Because some cgroups might have protection set (memory.low) and we must take it into account. Also different
>> cgroups have different available swap space/memory.swappiness and it must be taken into account as well to.
>>
>> So it has to be yet another full memcg-tree iteration.
> 
> It should be possible to take that into account on the first iteration
> and adjust the inactive/active counters in proportion to how much of
> the cgroup's total memory is exempt by memory.low or min, right?
> 

Should be possible, more complexity though to this subtle code.


>> 2) Let's consider simple case. Two cgroups, one with big 'active' set of pages the other allocates one-time used pages.
>> So the total inactive is low, thus checking inactive ratio on higher level will result in reclaiming pages.
>> While with check on lru-level only inactive will be reclaimed.
> 
> It's the other way around. Let's say you have two cgroups, A and B:
> 
>         A:  500M inactive   10G active -> inactive is low
>         B:   10G inactive  500M active -> inactive is NOT low
>    ----------------------------------------------------------
>    global: 10.5G inactive 10.5G active -> inactive is NOT low
> 
> Checking locally will scan active pages from A.

No, checking locally will not scan active from A. Initial state of sc->may_shrink_active = 0, so A group
will be skipped completely, and will reclaim from B. Since overall reclaim was successful, sc->may_shrink_active remain 0
and A will be protected as long as B supply enough inactive pages. 

> Checking globally will
> not, because there is plenty of use-once pages from B.
> 

That is correct. So in this example global vs local check will not make a difference.

> So if you check globally, without any protection, A and B compete
> evenly during global reclaim. Under the same reclaim pressure, A has
> managed to activate most of its pages whereas B has not. That means A
> is hotter and B provides the better reclaim candidates.
> 
> If you apply this decision locally, on the other hand, you are no
> longer aging the groups at the same rate. And then the LRU orders
> between groups will no longer be comparable, and you won't be
> reclaiming the coldest memory in the system anymore.
> 

I really don't see any how global check will make any difference in this example.
In both cases, we reclaim only from B and don't touch A. And this what we actually want.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
  2019-03-01 19:46         ` Andrey Ryabinin
@ 2019-03-01 22:20           ` Johannes Weiner
  2019-03-04 17:02             ` Andrey Ryabinin
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2019-03-01 22:20 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin,
	Shakeel Butt

On Fri, Mar 01, 2019 at 10:46:34PM +0300, Andrey Ryabinin wrote:
> On 3/1/19 8:49 PM, Johannes Weiner wrote:
> > On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote:
> >> On 2/26/19 3:50 PM, Andrey Ryabinin wrote:
> >>> On 2/22/19 10:15 PM, Johannes Weiner wrote:
> >>>> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
> >>>>> In a presence of more than 1 memory cgroup in the system our reclaim
> >>>>> logic is just suck. When we hit memory limit (global or a limit on
> >>>>> cgroup with subgroups) we reclaim some memory from all cgroups.
> >>>>> This is sucks because, the cgroup that allocates more often always wins.
> >>>>> E.g. job that allocates a lot of clean rarely used page cache will push
> >>>>> out of memory other jobs with active relatively small all in memory
> >>>>> working set.
> >>>>>
> >>>>> To prevent such situations we have memcg controls like low/max, etc which
> >>>>> are supposed to protect jobs or limit them so they to not hurt others.
> >>>>> But memory cgroups are very hard to configure right because it requires
> >>>>> precise knowledge of the workload which may vary during the execution.
> >>>>> E.g. setting memory limit means that job won't be able to use all memory
> >>>>> in the system for page cache even if the rest the system is idle.
> >>>>> Basically our current scheme requires to configure every single cgroup
> >>>>> in the system.
> >>>>>
> >>>>> I think we can do better. The idea proposed by this patch is to reclaim
> >>>>> only inactive pages and only from cgroups that have big
> >>>>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
> >>>>> only if all inactive lists are low.
> >>>>
> >>>> Yes, you are absolutely right.
> >>>>
> >>>> We shouldn't go after active pages as long as there are plenty of
> >>>> inactive pages around. That's the global reclaim policy, and we
> >>>> currently fail to translate that well to cgrouped systems.
> >>>>
> >>>> Setting group protections or limits would work around this problem,
> >>>> but they're kind of a red herring. We shouldn't ever allow use-once
> >>>> streams to push out hot workingsets, that's a bug.
> >>>>
> >>>>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> >>>>>  
> >>>>>  		scan >>= sc->priority;
> >>>>>  
> >>>>> +		if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
> >>>>> +						file, memcg, sc, false))
> >>>>> +			scan = 0;
> >>>>> +
> >>>>>  		/*
> >>>>>  		 * If the cgroup's already been deleted, make sure to
> >>>>>  		 * scrape out the remaining cache.
> >>>>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >>>>>  	struct reclaim_state *reclaim_state = current->reclaim_state;
> >>>>>  	unsigned long nr_reclaimed, nr_scanned;
> >>>>>  	bool reclaimable = false;
> >>>>> +	bool retry;
> >>>>>  
> >>>>>  	do {
> >>>>>  		struct mem_cgroup *root = sc->target_mem_cgroup;
> >>>>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >>>>>  		};
> >>>>>  		struct mem_cgroup *memcg;
> >>>>>  
> >>>>> +		retry = false;
> >>>>> +
> >>>>>  		memset(&sc->nr, 0, sizeof(sc->nr));
> >>>>>  
> >>>>>  		nr_reclaimed = sc->nr_reclaimed;
> >>>>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >>>>>  			}
> >>>>>  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
> >>>>>  
> >>>>> +		if ((sc->nr_scanned - nr_scanned) == 0 &&
> >>>>> +		     !sc->may_shrink_active) {
> >>>>> +			sc->may_shrink_active = 1;
> >>>>> +			retry = true;
> >>>>> +			continue;
> >>>>> +		}
> >>>>
> >>>> Using !scanned as the gate could be a problem. There might be a cgroup
> >>>> that has inactive pages on the local level, but when viewed from the
> >>>> system level the total inactive pages in the system might still be low
> >>>> compared to active ones. In that case we should go after active pages.
> >>>>
> >>>> Basically, during global reclaim, the answer for whether active pages
> >>>> should be scanned or not should be the same regardless of whether the
> >>>> memory is all global or whether it's spread out between cgroups.
> >>>>
> >>>> The reason this isn't the case is because we're checking the ratio at
> >>>> the lruvec level - which is the highest level (and identical to the
> >>>> node counters) when memory is global, but it's at the lowest level
> >>>> when memory is cgrouped.
> >>>>
> >>>> So IMO what we should do is:
> >>>>
> >>>> - At the beginning of global reclaim, use node_page_state() to compare
> >>>>   the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim
> >>>>   can go after active pages or not. Regardless of what the ratio is in
> >>>>   individual lruvecs.
> >>>>
> >>>> - And likewise at the beginning of cgroup limit reclaim, walk the
> >>>>   subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE
> >>>>   and ACTIVE_FILE counters, and make inactive_is_low() decision on
> >>>>   those sums.
> >>>>
> >>>
> >>> Sounds reasonable.
> >>>
> >>
> >> On the second thought it seems to be better to keep the decision on lru level.
> >> There are couple reasons for this:
> >>
> >> 1) Using bare node_page_state() (or sc->targe_mem_cgroup's total_[in]active counters) would be wrong.
> >>  Because some cgroups might have protection set (memory.low) and we must take it into account. Also different
> >> cgroups have different available swap space/memory.swappiness and it must be taken into account as well to.
> >>
> >> So it has to be yet another full memcg-tree iteration.
> > 
> > It should be possible to take that into account on the first iteration
> > and adjust the inactive/active counters in proportion to how much of
> > the cgroup's total memory is exempt by memory.low or min, right?
> > 
> 
> Should be possible, more complexity though to this subtle code.
> 
> 
> >> 2) Let's consider simple case. Two cgroups, one with big 'active' set of pages the other allocates one-time used pages.
> >> So the total inactive is low, thus checking inactive ratio on higher level will result in reclaiming pages.
> >> While with check on lru-level only inactive will be reclaimed.
> > 
> > It's the other way around. Let's say you have two cgroups, A and B:
> > 
> >         A:  500M inactive   10G active -> inactive is low
> >         B:   10G inactive  500M active -> inactive is NOT low
> >    ----------------------------------------------------------
> >    global: 10.5G inactive 10.5G active -> inactive is NOT low
> > 
> > Checking locally will scan active pages from A.
> 
> No, checking locally will not scan active from A. Initial state of
> sc->may_shrink_active = 0, so A group will be skipped completely,
> and will reclaim from B. Since overall reclaim was successful,
> sc->may_shrink_active remain 0 and A will be protected as long as B
> supply enough inactive pages.

Oh, this was a misunderstanding. When you wrote "on second thought it
seems to be better to keep the decision at the lru level", I assumed
you were arguing for keeping the current code as-is and abandoning
your patch.

But that leaves my questions from above unanswered. Consider the
following situation:

  A: 50M inactive   0 active
  B:   0 inactive 20G active

If the processes in A and B were not cgrouped, these pages would be on
a single LRU and we'd go after B's active pages.

But with your patches, we'd reclaim only A's inactive pages.

What's the justification for that unfairness?

Keep in mind that users also enable the memory controller on cgroups
simply to keep track of how much memory different process groups are
using.  Merely enabling the controller without otherwise configuring
any protections shouldn't change the aging behavior for that memory.

And that USED to be the case. We USED to have a physical global LRU
list that held all the cgroup pages, and then a per-cgroup LRU list
that was only for limit reclaim.

We got rid of the physical global LRU list for cgrouped memory with
the idea that we can emulate the global aging. Hence the round-robin
tree iteration in shrink_node().

We just need to fix the inactive_list_is_low() check to also emulate
the global LRU behavior.

That would fix your problem of scanning active pages when there are
already plenty of inactive pages in the system, but without the risk
of severely overreclaiming the inactive pages of a small group.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
  2019-03-01 22:20           ` Johannes Weiner
@ 2019-03-04 17:02             ` Andrey Ryabinin
  0 siblings, 0 replies; 13+ messages in thread
From: Andrey Ryabinin @ 2019-03-04 17:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	Vlastimil Babka, Rik van Riel, Mel Gorman, Roman Gushchin,
	Shakeel Butt



On 3/2/19 1:20 AM, Johannes Weiner wrote:
> On Fri, Mar 01, 2019 at 10:46:34PM +0300, Andrey Ryabinin wrote:
>> On 3/1/19 8:49 PM, Johannes Weiner wrote:
>>> On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote:
>>>> On 2/26/19 3:50 PM, Andrey Ryabinin wrote:
>>>>> On 2/22/19 10:15 PM, Johannes Weiner wrote:
>>>>>> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
>>>>>>> In a presence of more than 1 memory cgroup in the system our reclaim
>>>>>>> logic is just suck. When we hit memory limit (global or a limit on
>>>>>>> cgroup with subgroups) we reclaim some memory from all cgroups.
>>>>>>> This is sucks because, the cgroup that allocates more often always wins.
>>>>>>> E.g. job that allocates a lot of clean rarely used page cache will push
>>>>>>> out of memory other jobs with active relatively small all in memory
>>>>>>> working set.
>>>>>>>
>>>>>>> To prevent such situations we have memcg controls like low/max, etc which
>>>>>>> are supposed to protect jobs or limit them so they to not hurt others.
>>>>>>> But memory cgroups are very hard to configure right because it requires
>>>>>>> precise knowledge of the workload which may vary during the execution.
>>>>>>> E.g. setting memory limit means that job won't be able to use all memory
>>>>>>> in the system for page cache even if the rest the system is idle.
>>>>>>> Basically our current scheme requires to configure every single cgroup
>>>>>>> in the system.
>>>>>>>
>>>>>>> I think we can do better. The idea proposed by this patch is to reclaim
>>>>>>> only inactive pages and only from cgroups that have big
>>>>>>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
>>>>>>> only if all inactive lists are low.
>>>>>>
>>>>>> Yes, you are absolutely right.
>>>>>>
>>>>>> We shouldn't go after active pages as long as there are plenty of
>>>>>> inactive pages around. That's the global reclaim policy, and we
>>>>>> currently fail to translate that well to cgrouped systems.
>>>>>>
>>>>>> Setting group protections or limits would work around this problem,
>>>>>> but they're kind of a red herring. We shouldn't ever allow use-once
>>>>>> streams to push out hot workingsets, that's a bug.
>>>>>>
>>>>>>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>>>>>>>  
>>>>>>>  		scan >>= sc->priority;
>>>>>>>  
>>>>>>> +		if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
>>>>>>> +						file, memcg, sc, false))
>>>>>>> +			scan = 0;
>>>>>>> +
>>>>>>>  		/*
>>>>>>>  		 * If the cgroup's already been deleted, make sure to
>>>>>>>  		 * scrape out the remaining cache.
>>>>>>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>>>>>>  	struct reclaim_state *reclaim_state = current->reclaim_state;
>>>>>>>  	unsigned long nr_reclaimed, nr_scanned;
>>>>>>>  	bool reclaimable = false;
>>>>>>> +	bool retry;
>>>>>>>  
>>>>>>>  	do {
>>>>>>>  		struct mem_cgroup *root = sc->target_mem_cgroup;
>>>>>>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>>>>>>  		};
>>>>>>>  		struct mem_cgroup *memcg;
>>>>>>>  
>>>>>>> +		retry = false;
>>>>>>> +
>>>>>>>  		memset(&sc->nr, 0, sizeof(sc->nr));
>>>>>>>  
>>>>>>>  		nr_reclaimed = sc->nr_reclaimed;
>>>>>>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>>>>>>  			}
>>>>>>>  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>>>>>>>  
>>>>>>> +		if ((sc->nr_scanned - nr_scanned) == 0 &&
>>>>>>> +		     !sc->may_shrink_active) {
>>>>>>> +			sc->may_shrink_active = 1;
>>>>>>> +			retry = true;
>>>>>>> +			continue;
>>>>>>> +		}
>>>>>>
>>>>>> Using !scanned as the gate could be a problem. There might be a cgroup
>>>>>> that has inactive pages on the local level, but when viewed from the
>>>>>> system level the total inactive pages in the system might still be low
>>>>>> compared to active ones. In that case we should go after active pages.
>>>>>>
>>>>>> Basically, during global reclaim, the answer for whether active pages
>>>>>> should be scanned or not should be the same regardless of whether the
>>>>>> memory is all global or whether it's spread out between cgroups.
>>>>>>
>>>>>> The reason this isn't the case is because we're checking the ratio at
>>>>>> the lruvec level - which is the highest level (and identical to the
>>>>>> node counters) when memory is global, but it's at the lowest level
>>>>>> when memory is cgrouped.
>>>>>>
>>>>>> So IMO what we should do is:
>>>>>>
>>>>>> - At the beginning of global reclaim, use node_page_state() to compare
>>>>>>   the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim
>>>>>>   can go after active pages or not. Regardless of what the ratio is in
>>>>>>   individual lruvecs.
>>>>>>
>>>>>> - And likewise at the beginning of cgroup limit reclaim, walk the
>>>>>>   subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE
>>>>>>   and ACTIVE_FILE counters, and make inactive_is_low() decision on
>>>>>>   those sums.
>>>>>>
>>>>>
>>>>> Sounds reasonable.
>>>>>
>>>>
>>>> On the second thought it seems to be better to keep the decision on lru level.
>>>> There are couple reasons for this:
>>>>
>>>> 1) Using bare node_page_state() (or sc->targe_mem_cgroup's total_[in]active counters) would be wrong.
>>>>  Because some cgroups might have protection set (memory.low) and we must take it into account. Also different
>>>> cgroups have different available swap space/memory.swappiness and it must be taken into account as well to.
>>>>
>>>> So it has to be yet another full memcg-tree iteration.
>>>
>>> It should be possible to take that into account on the first iteration
>>> and adjust the inactive/active counters in proportion to how much of
>>> the cgroup's total memory is exempt by memory.low or min, right?
>>>
>>
>> Should be possible, more complexity though to this subtle code.
>>
>>
>>>> 2) Let's consider simple case. Two cgroups, one with big 'active' set of pages the other allocates one-time used pages.
>>>> So the total inactive is low, thus checking inactive ratio on higher level will result in reclaiming pages.
>>>> While with check on lru-level only inactive will be reclaimed.
>>>
>>> It's the other way around. Let's say you have two cgroups, A and B:
>>>
>>>         A:  500M inactive   10G active -> inactive is low
>>>         B:   10G inactive  500M active -> inactive is NOT low
>>>    ----------------------------------------------------------
>>>    global: 10.5G inactive 10.5G active -> inactive is NOT low
>>>
>>> Checking locally will scan active pages from A.
>>
>> No, checking locally will not scan active from A. Initial state of
>> sc->may_shrink_active = 0, so A group will be skipped completely,
>> and will reclaim from B. Since overall reclaim was successful,
>> sc->may_shrink_active remain 0 and A will be protected as long as B
>> supply enough inactive pages.
> 
> Oh, this was a misunderstanding. When you wrote "on second thought it
> seems to be better to keep the decision at the lru level", I assumed
> you were arguing for keeping the current code as-is and abandoning
> your patch.
> 
> But that leaves my questions from above unanswered. Consider the
> following situation:
> 
>   A: 50M inactive   0 active
>   B:   0 inactive 20G active
> 
> If the processes in A and B were not cgrouped, these pages would be on
> a single LRU and we'd go after B's active pages.
> 
> But with your patches, we'd reclaim only A's inactive pages.
> 

I assume that not cgrouped case we would reclaim mostly from A anyway because going after B's active pages
only means that we move them to inactive list where we still have A's pages for reclaim. And B has a chance
to reactivate deactivated pages.
In cgrouped case going after B's active pages implies immediate reclaim of them.


> What's the justification for that unfairness?
> 
If it's A creates pressure by allocating a lot of one-time used pages,
than global inactive ratio check allows A to grow up to !inactive_is_low() point
by pushing out B's active pages.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-03-04 17:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-22 17:58 [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim Andrey Ryabinin
2019-02-22 18:56 ` Rik van Riel
2019-02-22 19:15 ` Johannes Weiner
2019-02-26 12:50   ` Andrey Ryabinin
2019-03-01 10:38     ` Andrey Ryabinin
2019-03-01 17:49       ` Johannes Weiner
2019-03-01 19:46         ` Andrey Ryabinin
2019-03-01 22:20           ` Johannes Weiner
2019-03-04 17:02             ` Andrey Ryabinin
2019-02-25  4:03 ` Roman Gushchin
2019-02-26 15:36   ` Andrey Ryabinin
2019-02-26 22:08     ` Roman Gushchin
2019-02-25 13:57 ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.