All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
@ 2017-02-06 12:24 ` Vinayak Menon
  0 siblings, 0 replies; 32+ messages in thread
From: Vinayak Menon @ 2017-02-06 12:24 UTC (permalink / raw)
  To: akpm, hannes, mgorman, vbabka, mhocko, riel, vdavydov.dev,
	anton.vorontsov, minchan, shashim
  Cc: linux-mm, linux-kernel, Vinayak Menon

During global reclaim, the nr_reclaimed passed to vmpressure includes the
pages reclaimed from slab.  But the corresponding scanned slab pages is
not passed.  This can cause total reclaimed pages to be greater than
scanned, causing an unsigned underflow in vmpressure resulting in a
critical event being sent to root cgroup. It was also noticed that, apart
from the underflow, there is an impact to the vmpressure values because of
this. While moving from kernel version 3.18 to 4.4, a difference is seen
in the vmpressure values for the same workload resulting in a different
behaviour of the vmpressure consumer. One such case is of a vmpressure
based lowmemorykiller. It is observed that the vmpressure events are
received late and less in number resulting in tasks not being killed at
the right time. The following numbers show the impact on reclaim activity
due to the change in behaviour of lowmemorykiller on a 4GB device. The test
launches a number of apps in sequence and repeats it multiple times.
                      v4.4           v3.18
pgpgin                163016456      145617236
pgpgout               4366220        4188004
workingset_refault    29857868       26781854
workingset_activate   6293946        5634625
pswpin                1327601        1133912
pswpout               3593842        3229602
pgalloc_dma           99520618       94402970
pgalloc_normal        104046854      98124798
pgfree                203772640      192600737
pgmajfault            2126962        1851836
pgsteal_kswapd_dma    19732899       18039462
pgsteal_kswapd_normal 19945336       17977706
pgsteal_direct_dma    206757         131376
pgsteal_direct_normal 236783         138247
pageoutrun            116622         108370
allocstall            7220           4684
compact_stall         931            856

This is a regression introduced by commit 6b4f7799c6a5 ("mm: vmscan:
invoke slab shrinkers from shrink_zone()").

So do not consider reclaimed slab pages for vmpressure calculation. The
reclaimed pages from slab can be excluded because the freeing of a page
by slab shrinking depends on each slab's object population, making the
cost model (i.e. scan:free) different from that of LRU.  Also, not every
shrinker accounts the pages it reclaims.

Fixes: 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from shrink_zone()")
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Shiraz Hashim <shashim@codeaurora.org>
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
---
 mm/vmscan.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 947ab6f..8969f8e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2594,16 +2594,23 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 				    sc->nr_scanned - nr_scanned,
 				    node_lru_pages);
 
+		/*
+		 * Record the subtree's reclaim efficiency. The reclaimed
+		 * pages from slab is excluded here because the corresponding
+		 * scanned pages is not accounted. Moreover, freeing a page
+		 * by slab shrinking depends on each slab's object population,
+		 * making the cost model (i.e. scan:free) different from that
+		 * of LRU.
+		 */
+		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
+			   sc->nr_scanned - nr_scanned,
+			   sc->nr_reclaimed - nr_reclaimed);
+
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
 		}
 
-		/* Record the subtree's reclaim efficiency */
-		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
-			   sc->nr_scanned - nr_scanned,
-			   sc->nr_reclaimed - nr_reclaimed);
-
 		if (sc->nr_reclaimed - nr_reclaimed)
 			reclaimable = true;
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
@ 2017-02-06 12:24 ` Vinayak Menon
  0 siblings, 0 replies; 32+ messages in thread
From: Vinayak Menon @ 2017-02-06 12:24 UTC (permalink / raw)
  To: akpm, hannes, mgorman, vbabka, mhocko, riel, vdavydov.dev,
	anton.vorontsov, minchan, shashim
  Cc: linux-mm, linux-kernel, Vinayak Menon

During global reclaim, the nr_reclaimed passed to vmpressure includes the
pages reclaimed from slab.  But the corresponding scanned slab pages is
not passed.  This can cause total reclaimed pages to be greater than
scanned, causing an unsigned underflow in vmpressure resulting in a
critical event being sent to root cgroup. It was also noticed that, apart
from the underflow, there is an impact to the vmpressure values because of
this. While moving from kernel version 3.18 to 4.4, a difference is seen
in the vmpressure values for the same workload resulting in a different
behaviour of the vmpressure consumer. One such case is of a vmpressure
based lowmemorykiller. It is observed that the vmpressure events are
received late and less in number resulting in tasks not being killed at
the right time. The following numbers show the impact on reclaim activity
due to the change in behaviour of lowmemorykiller on a 4GB device. The test
launches a number of apps in sequence and repeats it multiple times.
                      v4.4           v3.18
pgpgin                163016456      145617236
pgpgout               4366220        4188004
workingset_refault    29857868       26781854
workingset_activate   6293946        5634625
pswpin                1327601        1133912
pswpout               3593842        3229602
pgalloc_dma           99520618       94402970
pgalloc_normal        104046854      98124798
pgfree                203772640      192600737
pgmajfault            2126962        1851836
pgsteal_kswapd_dma    19732899       18039462
pgsteal_kswapd_normal 19945336       17977706
pgsteal_direct_dma    206757         131376
pgsteal_direct_normal 236783         138247
pageoutrun            116622         108370
allocstall            7220           4684
compact_stall         931            856

This is a regression introduced by commit 6b4f7799c6a5 ("mm: vmscan:
invoke slab shrinkers from shrink_zone()").

So do not consider reclaimed slab pages for vmpressure calculation. The
reclaimed pages from slab can be excluded because the freeing of a page
by slab shrinking depends on each slab's object population, making the
cost model (i.e. scan:free) different from that of LRU.  Also, not every
shrinker accounts the pages it reclaims.

Fixes: 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from shrink_zone()")
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Shiraz Hashim <shashim@codeaurora.org>
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
---
 mm/vmscan.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 947ab6f..8969f8e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2594,16 +2594,23 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 				    sc->nr_scanned - nr_scanned,
 				    node_lru_pages);
 
+		/*
+		 * Record the subtree's reclaim efficiency. The reclaimed
+		 * pages from slab is excluded here because the corresponding
+		 * scanned pages is not accounted. Moreover, freeing a page
+		 * by slab shrinking depends on each slab's object population,
+		 * making the cost model (i.e. scan:free) different from that
+		 * of LRU.
+		 */
+		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
+			   sc->nr_scanned - nr_scanned,
+			   sc->nr_reclaimed - nr_reclaimed);
+
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
 		}
 
-		/* Record the subtree's reclaim efficiency */
-		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
-			   sc->nr_scanned - nr_scanned,
-			   sc->nr_reclaimed - nr_reclaimed);
-
 		if (sc->nr_reclaimed - nr_reclaimed)
 			reclaimable = true;
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
  2017-02-06 12:24 ` Vinayak Menon
@ 2017-02-06 12:24   ` Vinayak Menon
  -1 siblings, 0 replies; 32+ messages in thread
From: Vinayak Menon @ 2017-02-06 12:24 UTC (permalink / raw)
  To: akpm, hannes, mgorman, vbabka, mhocko, riel, vdavydov.dev,
	anton.vorontsov, minchan, shashim
  Cc: linux-mm, linux-kernel, Vinayak Menon

At the end of a window period, if the reclaimed pages
is greater than scanned, an unsigned underflow can
result in a huge pressure value and thus a critical event.
Reclaimed pages is found to go higher than scanned because
of the addition of reclaimed slab pages to reclaimed in
shrink_node without a corresponding increment to scanned
pages. Minchan Kim mentioned that this can also happen in
the case of a THP page where the scanned is 1 and reclaimed
could be 512.

Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
---
 mm/vmpressure.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 149fdf6..3281b34 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -112,8 +112,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 						    unsigned long reclaimed)
 {
 	unsigned long scale = scanned + reclaimed;
-	unsigned long pressure;
+	unsigned long pressure = 0;
 
+	if (reclaimed >= scanned)
+		goto out;
 	/*
 	 * We calculate the ratio (in percents) of how many pages were
 	 * scanned vs. reclaimed in a given time frame (window). Note that
@@ -124,6 +126,7 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 	pressure = scale - (reclaimed * scale / scanned);
 	pressure = pressure * 100 / scale;
 
+out:
 	pr_debug("%s: %3lu  (s: %lu  r: %lu)\n", __func__, pressure,
 		 scanned, reclaimed);
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
@ 2017-02-06 12:24   ` Vinayak Menon
  0 siblings, 0 replies; 32+ messages in thread
From: Vinayak Menon @ 2017-02-06 12:24 UTC (permalink / raw)
  To: akpm, hannes, mgorman, vbabka, mhocko, riel, vdavydov.dev,
	anton.vorontsov, minchan, shashim
  Cc: linux-mm, linux-kernel, Vinayak Menon

At the end of a window period, if the reclaimed pages
is greater than scanned, an unsigned underflow can
result in a huge pressure value and thus a critical event.
Reclaimed pages is found to go higher than scanned because
of the addition of reclaimed slab pages to reclaimed in
shrink_node without a corresponding increment to scanned
pages. Minchan Kim mentioned that this can also happen in
the case of a THP page where the scanned is 1 and reclaimed
could be 512.

Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
---
 mm/vmpressure.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 149fdf6..3281b34 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -112,8 +112,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 						    unsigned long reclaimed)
 {
 	unsigned long scale = scanned + reclaimed;
-	unsigned long pressure;
+	unsigned long pressure = 0;
 
+	if (reclaimed >= scanned)
+		goto out;
 	/*
 	 * We calculate the ratio (in percents) of how many pages were
 	 * scanned vs. reclaimed in a given time frame (window). Note that
@@ -124,6 +126,7 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 	pressure = scale - (reclaimed * scale / scanned);
 	pressure = pressure * 100 / scale;
 
+out:
 	pr_debug("%s: %3lu  (s: %lu  r: %lu)\n", __func__, pressure,
 		 scanned, reclaimed);
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
  2017-02-06 12:24   ` Vinayak Menon
@ 2017-02-06 12:40     ` Michal Hocko
  -1 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-06 12:40 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: akpm, hannes, mgorman, vbabka, riel, vdavydov.dev,
	anton.vorontsov, minchan, shashim, linux-mm, linux-kernel

On Mon 06-02-17 17:54:10, Vinayak Menon wrote:
[...]
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 149fdf6..3281b34 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -112,8 +112,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>  						    unsigned long reclaimed)
>  {
>  	unsigned long scale = scanned + reclaimed;
> -	unsigned long pressure;
> +	unsigned long pressure = 0;
>  
> +	if (reclaimed >= scanned)
> +		goto out;

This deserves a comment IMHO. Besides that, why shouldn't we normalize
the result already in vmpressure()? Please note that the tree == true
path will aggregate both scanned and reclaimed and that already skews
numbers.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
@ 2017-02-06 12:40     ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-06 12:40 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: akpm, hannes, mgorman, vbabka, riel, vdavydov.dev,
	anton.vorontsov, minchan, shashim, linux-mm, linux-kernel

On Mon 06-02-17 17:54:10, Vinayak Menon wrote:
[...]
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 149fdf6..3281b34 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -112,8 +112,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>  						    unsigned long reclaimed)
>  {
>  	unsigned long scale = scanned + reclaimed;
> -	unsigned long pressure;
> +	unsigned long pressure = 0;
>  
> +	if (reclaimed >= scanned)
> +		goto out;

This deserves a comment IMHO. Besides that, why shouldn't we normalize
the result already in vmpressure()? Please note that the tree == true
path will aggregate both scanned and reclaimed and that already skews
numbers.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
  2017-02-06 12:24 ` Vinayak Menon
@ 2017-02-06 12:52   ` Michal Hocko
  -1 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-06 12:52 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: akpm, hannes, mgorman, vbabka, riel, vdavydov.dev,
	anton.vorontsov, minchan, shashim, linux-mm, linux-kernel

On Mon 06-02-17 17:54:09, Vinayak Menon wrote:
> During global reclaim, the nr_reclaimed passed to vmpressure includes the
> pages reclaimed from slab.  But the corresponding scanned slab pages is
> not passed.  This can cause total reclaimed pages to be greater than
> scanned, causing an unsigned underflow in vmpressure resulting in a
> critical event being sent to root cgroup.

If you switched the ordering then this wouldn't be a problem, right?

> It was also noticed that, apart
> from the underflow, there is an impact to the vmpressure values because of
> this. While moving from kernel version 3.18 to 4.4, a difference is seen
> in the vmpressure values for the same workload resulting in a different
> behaviour of the vmpressure consumer. One such case is of a vmpressure
> based lowmemorykiller. It is observed that the vmpressure events are
> received late and less in number resulting in tasks not being killed at
> the right time. The following numbers show the impact on reclaim activity
> due to the change in behaviour of lowmemorykiller on a 4GB device. The test
> launches a number of apps in sequence and repeats it multiple times.
>                       v4.4           v3.18
> pgpgin                163016456      145617236
> pgpgout               4366220        4188004
> workingset_refault    29857868       26781854
> workingset_activate   6293946        5634625
> pswpin                1327601        1133912
> pswpout               3593842        3229602
> pgalloc_dma           99520618       94402970
> pgalloc_normal        104046854      98124798
> pgfree                203772640      192600737
> pgmajfault            2126962        1851836
> pgsteal_kswapd_dma    19732899       18039462
> pgsteal_kswapd_normal 19945336       17977706
> pgsteal_direct_dma    206757         131376
> pgsteal_direct_normal 236783         138247
> pageoutrun            116622         108370
> allocstall            7220           4684
> compact_stall         931            856

>From this numbers it seems that the memory pressure was higher in 4.4.
There is ~5% more allocations in 4.4 while we hit the direct reclaim 50%
more times.

But the above doesn't say anything about the number and levels of 
vmpressure events. Without that it is hard to draw any conclusion here.

It would be also more than useful to say how much the slab reclaim
really contributed.

> This is a regression introduced by commit 6b4f7799c6a5 ("mm: vmscan:
> invoke slab shrinkers from shrink_zone()").

I am not really sure this is a regression, though. Maybe your heuristic
which consumes events is just too fragile?

> So do not consider reclaimed slab pages for vmpressure calculation. The
> reclaimed pages from slab can be excluded because the freeing of a page
> by slab shrinking depends on each slab's object population, making the
> cost model (i.e. scan:free) different from that of LRU.  Also, not every
> shrinker accounts the pages it reclaims.

Yeah, this is really messy and not 100% correct. The reclaim cost model
for slab is completely different to the reclaim but the concern here is
that we can trigger higher vmpressure levels even though there _is_ a
reclaim progress. This should be at least mentioned in the changelog so
that people know that this aspect has been considered.

> Fixes: 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from shrink_zone()")
> Acked-by: Minchan Kim <minchan@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
> Cc: Shiraz Hashim <shashim@codeaurora.org>
> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
> ---
>  mm/vmscan.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 947ab6f..8969f8e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2594,16 +2594,23 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  				    sc->nr_scanned - nr_scanned,
>  				    node_lru_pages);
>  
> +		/*
> +		 * Record the subtree's reclaim efficiency. The reclaimed
> +		 * pages from slab is excluded here because the corresponding
> +		 * scanned pages is not accounted. Moreover, freeing a page
> +		 * by slab shrinking depends on each slab's object population,
> +		 * making the cost model (i.e. scan:free) different from that
> +		 * of LRU.
> +		 */
> +		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> +			   sc->nr_scanned - nr_scanned,
> +			   sc->nr_reclaimed - nr_reclaimed);
> +
>  		if (reclaim_state) {
>  			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>  			reclaim_state->reclaimed_slab = 0;
>  		}
>  
> -		/* Record the subtree's reclaim efficiency */
> -		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> -			   sc->nr_scanned - nr_scanned,
> -			   sc->nr_reclaimed - nr_reclaimed);
> -
>  		if (sc->nr_reclaimed - nr_reclaimed)
>  			reclaimable = true;
>  
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> member of the Code Aurora Forum, hosted by The Linux Foundation
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
@ 2017-02-06 12:52   ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-06 12:52 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: akpm, hannes, mgorman, vbabka, riel, vdavydov.dev,
	anton.vorontsov, minchan, shashim, linux-mm, linux-kernel

On Mon 06-02-17 17:54:09, Vinayak Menon wrote:
> During global reclaim, the nr_reclaimed passed to vmpressure includes the
> pages reclaimed from slab.  But the corresponding scanned slab pages is
> not passed.  This can cause total reclaimed pages to be greater than
> scanned, causing an unsigned underflow in vmpressure resulting in a
> critical event being sent to root cgroup.

If you switched the ordering then this wouldn't be a problem, right?

> It was also noticed that, apart
> from the underflow, there is an impact to the vmpressure values because of
> this. While moving from kernel version 3.18 to 4.4, a difference is seen
> in the vmpressure values for the same workload resulting in a different
> behaviour of the vmpressure consumer. One such case is of a vmpressure
> based lowmemorykiller. It is observed that the vmpressure events are
> received late and less in number resulting in tasks not being killed at
> the right time. The following numbers show the impact on reclaim activity
> due to the change in behaviour of lowmemorykiller on a 4GB device. The test
> launches a number of apps in sequence and repeats it multiple times.
>                       v4.4           v3.18
> pgpgin                163016456      145617236
> pgpgout               4366220        4188004
> workingset_refault    29857868       26781854
> workingset_activate   6293946        5634625
> pswpin                1327601        1133912
> pswpout               3593842        3229602
> pgalloc_dma           99520618       94402970
> pgalloc_normal        104046854      98124798
> pgfree                203772640      192600737
> pgmajfault            2126962        1851836
> pgsteal_kswapd_dma    19732899       18039462
> pgsteal_kswapd_normal 19945336       17977706
> pgsteal_direct_dma    206757         131376
> pgsteal_direct_normal 236783         138247
> pageoutrun            116622         108370
> allocstall            7220           4684
> compact_stall         931            856

>From this numbers it seems that the memory pressure was higher in 4.4.
There is ~5% more allocations in 4.4 while we hit the direct reclaim 50%
more times.

But the above doesn't say anything about the number and levels of 
vmpressure events. Without that it is hard to draw any conclusion here.

It would be also more than useful to say how much the slab reclaim
really contributed.

> This is a regression introduced by commit 6b4f7799c6a5 ("mm: vmscan:
> invoke slab shrinkers from shrink_zone()").

I am not really sure this is a regression, though. Maybe your heuristic
which consumes events is just too fragile?

> So do not consider reclaimed slab pages for vmpressure calculation. The
> reclaimed pages from slab can be excluded because the freeing of a page
> by slab shrinking depends on each slab's object population, making the
> cost model (i.e. scan:free) different from that of LRU.  Also, not every
> shrinker accounts the pages it reclaims.

Yeah, this is really messy and not 100% correct. The reclaim cost model
for slab is completely different to the reclaim but the concern here is
that we can trigger higher vmpressure levels even though there _is_ a
reclaim progress. This should be at least mentioned in the changelog so
that people know that this aspect has been considered.

> Fixes: 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from shrink_zone()")
> Acked-by: Minchan Kim <minchan@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
> Cc: Shiraz Hashim <shashim@codeaurora.org>
> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
> ---
>  mm/vmscan.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 947ab6f..8969f8e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2594,16 +2594,23 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  				    sc->nr_scanned - nr_scanned,
>  				    node_lru_pages);
>  
> +		/*
> +		 * Record the subtree's reclaim efficiency. The reclaimed
> +		 * pages from slab is excluded here because the corresponding
> +		 * scanned pages is not accounted. Moreover, freeing a page
> +		 * by slab shrinking depends on each slab's object population,
> +		 * making the cost model (i.e. scan:free) different from that
> +		 * of LRU.
> +		 */
> +		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> +			   sc->nr_scanned - nr_scanned,
> +			   sc->nr_reclaimed - nr_reclaimed);
> +
>  		if (reclaim_state) {
>  			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>  			reclaim_state->reclaimed_slab = 0;
>  		}
>  
> -		/* Record the subtree's reclaim efficiency */
> -		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> -			   sc->nr_scanned - nr_scanned,
> -			   sc->nr_reclaimed - nr_reclaimed);
> -
>  		if (sc->nr_reclaimed - nr_reclaimed)
>  			reclaimable = true;
>  
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> member of the Code Aurora Forum, hosted by The Linux Foundation
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
  2017-02-06 12:40     ` Michal Hocko
@ 2017-02-06 13:09       ` vinayak menon
  -1 siblings, 0 replies; 32+ messages in thread
From: vinayak menon @ 2017-02-06 13:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Mon, Feb 6, 2017 at 6:10 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 06-02-17 17:54:10, Vinayak Menon wrote:
> [...]
>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> index 149fdf6..3281b34 100644
>> --- a/mm/vmpressure.c
>> +++ b/mm/vmpressure.c
>> @@ -112,8 +112,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>>                                                   unsigned long reclaimed)
>>  {
>>       unsigned long scale = scanned + reclaimed;
>> -     unsigned long pressure;
>> +     unsigned long pressure = 0;
>>
>> +     if (reclaimed >= scanned)
>> +             goto out;
>
> This deserves a comment IMHO. Besides that, why shouldn't we normalize
> the result already in vmpressure()? Please note that the tree == true
> path will aggregate both scanned and reclaimed and that already skews
> numbers.
Sure. Will add a comment.
IIUC, normalizing in vmpressure() means something like this which you
mentioned in one
of your previous emails right ?

+ if (reclaimed > scanned)
+          reclaimed = scanned;

Considering a scan window of 512 pages and without above piece of
code, if the first scanning is of a THP page
Scan=1,Reclaimed=512
If the next 511 scans results in 0 reclaimed pages
total_scan=512,Reclaimed=512 => vmpressure 0

Now with the above piece of code in place
Scan=1,Reclaimed=1, then
Scan=511, Reclaimed=0
total_scan=512,Reclaimed=1 => critical vmpressure

With the slab issue fixed separately, we need to fix only the
underflow right ? And if we do it in vmpressure_calc_level,
the check needs to done only once at the end of a scan window.

Thanks,
Vinayak

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
@ 2017-02-06 13:09       ` vinayak menon
  0 siblings, 0 replies; 32+ messages in thread
From: vinayak menon @ 2017-02-06 13:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Mon, Feb 6, 2017 at 6:10 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 06-02-17 17:54:10, Vinayak Menon wrote:
> [...]
>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> index 149fdf6..3281b34 100644
>> --- a/mm/vmpressure.c
>> +++ b/mm/vmpressure.c
>> @@ -112,8 +112,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>>                                                   unsigned long reclaimed)
>>  {
>>       unsigned long scale = scanned + reclaimed;
>> -     unsigned long pressure;
>> +     unsigned long pressure = 0;
>>
>> +     if (reclaimed >= scanned)
>> +             goto out;
>
> This deserves a comment IMHO. Besides that, why shouldn't we normalize
> the result already in vmpressure()? Please note that the tree == true
> path will aggregate both scanned and reclaimed and that already skews
> numbers.
Sure. Will add a comment.
IIUC, normalizing in vmpressure() means something like this which you
mentioned in one
of your previous emails right ?

+ if (reclaimed > scanned)
+          reclaimed = scanned;

Considering a scan window of 512 pages and without above piece of
code, if the first scanning is of a THP page
Scan=1,Reclaimed=512
If the next 511 scans results in 0 reclaimed pages
total_scan=512,Reclaimed=512 => vmpressure 0

Now with the above piece of code in place
Scan=1,Reclaimed=1, then
Scan=511, Reclaimed=0
total_scan=512,Reclaimed=1 => critical vmpressure

With the slab issue fixed separately, we need to fix only the
underflow right ? And if we do it in vmpressure_calc_level,
the check needs to done only once at the end of a scan window.

Thanks,
Vinayak

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
  2017-02-06 13:09       ` vinayak menon
@ 2017-02-06 13:24         ` Michal Hocko
  -1 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-06 13:24 UTC (permalink / raw)
  To: vinayak menon
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Mon 06-02-17 18:39:03, vinayak menon wrote:
> On Mon, Feb 6, 2017 at 6:10 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 06-02-17 17:54:10, Vinayak Menon wrote:
> > [...]
> >> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> >> index 149fdf6..3281b34 100644
> >> --- a/mm/vmpressure.c
> >> +++ b/mm/vmpressure.c
> >> @@ -112,8 +112,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> >>                                                   unsigned long reclaimed)
> >>  {
> >>       unsigned long scale = scanned + reclaimed;
> >> -     unsigned long pressure;
> >> +     unsigned long pressure = 0;
> >>
> >> +     if (reclaimed >= scanned)
> >> +             goto out;
> >
> > This deserves a comment IMHO. Besides that, why shouldn't we normalize
> > the result already in vmpressure()? Please note that the tree == true
> > path will aggregate both scanned and reclaimed and that already skews
> > numbers.
> Sure. Will add a comment.
> IIUC, normalizing in vmpressure() means something like this which you
> mentioned in one
> of your previous emails right ?
> 
> + if (reclaimed > scanned)
> +          reclaimed = scanned;

yes or scanned = reclaimed.

> Considering a scan window of 512 pages and without above piece of
> code, if the first scanning is of a THP page
> Scan=1,Reclaimed=512
> If the next 511 scans results in 0 reclaimed pages
> total_scan=512,Reclaimed=512 => vmpressure 0

I am not sure I understand. What do you mean by next scans? We do not
modify counters outside of vmpressure? If you mean next iteration of
shrink_node's loop then this changeshouldn't make a difference, no?

> 
> Now with the above piece of code in place
> Scan=1,Reclaimed=1, then
> Scan=511, Reclaimed=0
> total_scan=512,Reclaimed=1 => critical vmpressure

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
@ 2017-02-06 13:24         ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-06 13:24 UTC (permalink / raw)
  To: vinayak menon
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Mon 06-02-17 18:39:03, vinayak menon wrote:
> On Mon, Feb 6, 2017 at 6:10 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 06-02-17 17:54:10, Vinayak Menon wrote:
> > [...]
> >> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> >> index 149fdf6..3281b34 100644
> >> --- a/mm/vmpressure.c
> >> +++ b/mm/vmpressure.c
> >> @@ -112,8 +112,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> >>                                                   unsigned long reclaimed)
> >>  {
> >>       unsigned long scale = scanned + reclaimed;
> >> -     unsigned long pressure;
> >> +     unsigned long pressure = 0;
> >>
> >> +     if (reclaimed >= scanned)
> >> +             goto out;
> >
> > This deserves a comment IMHO. Besides that, why shouldn't we normalize
> > the result already in vmpressure()? Please note that the tree == true
> > path will aggregate both scanned and reclaimed and that already skews
> > numbers.
> Sure. Will add a comment.
> IIUC, normalizing in vmpressure() means something like this which you
> mentioned in one
> of your previous emails right ?
> 
> + if (reclaimed > scanned)
> +          reclaimed = scanned;

yes or scanned = reclaimed.

> Considering a scan window of 512 pages and without above piece of
> code, if the first scanning is of a THP page
> Scan=1,Reclaimed=512
> If the next 511 scans results in 0 reclaimed pages
> total_scan=512,Reclaimed=512 => vmpressure 0

I am not sure I understand. What do you mean by next scans? We do not
modify counters outside of vmpressure? If you mean next iteration of
shrink_node's loop then this changeshouldn't make a difference, no?

> 
> Now with the above piece of code in place
> Scan=1,Reclaimed=1, then
> Scan=511, Reclaimed=0
> total_scan=512,Reclaimed=1 => critical vmpressure

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
  2017-02-06 13:24         ` Michal Hocko
@ 2017-02-06 14:35           ` vinayak menon
  -1 siblings, 0 replies; 32+ messages in thread
From: vinayak menon @ 2017-02-06 14:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Mon, Feb 6, 2017 at 6:54 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 06-02-17 18:39:03, vinayak menon wrote:
>> On Mon, Feb 6, 2017 at 6:10 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Mon 06-02-17 17:54:10, Vinayak Menon wrote:
>> > [...]
>> >> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> >> index 149fdf6..3281b34 100644
>> >> --- a/mm/vmpressure.c
>> >> +++ b/mm/vmpressure.c
>> >> @@ -112,8 +112,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>> >>                                                   unsigned long reclaimed)
>> >>  {
>> >>       unsigned long scale = scanned + reclaimed;
>> >> -     unsigned long pressure;
>> >> +     unsigned long pressure = 0;
>> >>
>> >> +     if (reclaimed >= scanned)
>> >> +             goto out;
>> >
>> > This deserves a comment IMHO. Besides that, why shouldn't we normalize
>> > the result already in vmpressure()? Please note that the tree == true
>> > path will aggregate both scanned and reclaimed and that already skews
>> > numbers.
>> Sure. Will add a comment.
>> IIUC, normalizing in vmpressure() means something like this which you
>> mentioned in one
>> of your previous emails right ?
>>
>> + if (reclaimed > scanned)
>> +          reclaimed = scanned;
>
> yes or scanned = reclaimed.
>
>> Considering a scan window of 512 pages and without above piece of
>> code, if the first scanning is of a THP page
>> Scan=1,Reclaimed=512
>> If the next 511 scans results in 0 reclaimed pages
>> total_scan=512,Reclaimed=512 => vmpressure 0
>
> I am not sure I understand. What do you mean by next scans? We do not
> modify counters outside of vmpressure? If you mean next iteration of
> shrink_node's loop then this changeshouldn't make a difference, no?
>
By scan I meant pages scanned by shrink_node_memcg/shrink_list which is passed
as nr_scanned to vmpressure.
The calculation of pressure for tree is done at the end of
vmpressure_win and it is that
calculation which underflows. With this patch we want only the
underflow to be avoided. But
if we make (reclaimed = scanned) in vmpressure(), we change the
vmpressure value even
when there is no underflow right ?
Rewriting the above e.g again.
First call to vmpressure with nr_scanned=1 and nr_reclaimed=512 (THP)
Second call to vmpressure with nr_scanned=511 and nr_reclaimed=0
In the second call vmpr->tree_scanned becomes equal to vmpressure_win
and the work
is scheduled and it will calculate the vmpressure as 0 because
tree_reclaimed = 512

Similarly, if scanned is made equal to reclaimed in vmpressure()
itself as you had suggested,
First call to vmpressure with nr_scanned=1 and nr_reclaimed=512 (THP)
And in vmpressure, we make nr_scanned=1 and nr_reclaimed=1
Second call to vmpressure with nr_scanned=511 and nr_reclaimed=0
In the second call vmpr->tree_scanned becomes equal to vmpressure_win
and the work
is scheduled and it will calculate the vmpressure as critical, because
tree_reclaimed = 1

So it makes a difference, no?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
@ 2017-02-06 14:35           ` vinayak menon
  0 siblings, 0 replies; 32+ messages in thread
From: vinayak menon @ 2017-02-06 14:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Mon, Feb 6, 2017 at 6:54 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 06-02-17 18:39:03, vinayak menon wrote:
>> On Mon, Feb 6, 2017 at 6:10 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Mon 06-02-17 17:54:10, Vinayak Menon wrote:
>> > [...]
>> >> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> >> index 149fdf6..3281b34 100644
>> >> --- a/mm/vmpressure.c
>> >> +++ b/mm/vmpressure.c
>> >> @@ -112,8 +112,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>> >>                                                   unsigned long reclaimed)
>> >>  {
>> >>       unsigned long scale = scanned + reclaimed;
>> >> -     unsigned long pressure;
>> >> +     unsigned long pressure = 0;
>> >>
>> >> +     if (reclaimed >= scanned)
>> >> +             goto out;
>> >
>> > This deserves a comment IMHO. Besides that, why shouldn't we normalize
>> > the result already in vmpressure()? Please note that the tree == true
>> > path will aggregate both scanned and reclaimed and that already skews
>> > numbers.
>> Sure. Will add a comment.
>> IIUC, normalizing in vmpressure() means something like this which you
>> mentioned in one
>> of your previous emails right ?
>>
>> + if (reclaimed > scanned)
>> +          reclaimed = scanned;
>
> yes or scanned = reclaimed.
>
>> Considering a scan window of 512 pages and without above piece of
>> code, if the first scanning is of a THP page
>> Scan=1,Reclaimed=512
>> If the next 511 scans results in 0 reclaimed pages
>> total_scan=512,Reclaimed=512 => vmpressure 0
>
> I am not sure I understand. What do you mean by next scans? We do not
> modify counters outside of vmpressure? If you mean next iteration of
> shrink_node's loop then this changeshouldn't make a difference, no?
>
By scan I meant pages scanned by shrink_node_memcg/shrink_list which is passed
as nr_scanned to vmpressure.
The calculation of pressure for tree is done at the end of
vmpressure_win and it is that
calculation which underflows. With this patch we want only the
underflow to be avoided. But
if we make (reclaimed = scanned) in vmpressure(), we change the
vmpressure value even
when there is no underflow right ?
Rewriting the above e.g again.
First call to vmpressure with nr_scanned=1 and nr_reclaimed=512 (THP)
Second call to vmpressure with nr_scanned=511 and nr_reclaimed=0
In the second call vmpr->tree_scanned becomes equal to vmpressure_win
and the work
is scheduled and it will calculate the vmpressure as 0 because
tree_reclaimed = 512

Similarly, if scanned is made equal to reclaimed in vmpressure()
itself as you had suggested,
First call to vmpressure with nr_scanned=1 and nr_reclaimed=512 (THP)
And in vmpressure, we make nr_scanned=1 and nr_reclaimed=1
Second call to vmpressure with nr_scanned=511 and nr_reclaimed=0
In the second call vmpr->tree_scanned becomes equal to vmpressure_win
and the work
is scheduled and it will calculate the vmpressure as critical, because
tree_reclaimed = 1

So it makes a difference, no?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
  2017-02-06 12:52   ` Michal Hocko
@ 2017-02-06 15:10     ` vinayak menon
  -1 siblings, 0 replies; 32+ messages in thread
From: vinayak menon @ 2017-02-06 15:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Mon, Feb 6, 2017 at 6:22 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 06-02-17 17:54:09, Vinayak Menon wrote:
>> During global reclaim, the nr_reclaimed passed to vmpressure includes the
>> pages reclaimed from slab.  But the corresponding scanned slab pages is
>> not passed.  This can cause total reclaimed pages to be greater than
>> scanned, causing an unsigned underflow in vmpressure resulting in a
>> critical event being sent to root cgroup.
>
> If you switched the ordering then this wouldn't be a problem, right?
You mean calling vmpressure first and then shrink_slab ? That also can be
done. Would completing shrink_slab before vmpressure be of any use to a
userspace task that takes into account both vmpressure and reclaimable
slab size ?
>
>> It was also noticed that, apart
>> from the underflow, there is an impact to the vmpressure values because of
>> this. While moving from kernel version 3.18 to 4.4, a difference is seen
>> in the vmpressure values for the same workload resulting in a different
>> behaviour of the vmpressure consumer. One such case is of a vmpressure
>> based lowmemorykiller. It is observed that the vmpressure events are
>> received late and less in number resulting in tasks not being killed at
>> the right time. The following numbers show the impact on reclaim activity
>> due to the change in behaviour of lowmemorykiller on a 4GB device. The test
>> launches a number of apps in sequence and repeats it multiple times.
>>                       v4.4           v3.18
>> pgpgin                163016456      145617236
>> pgpgout               4366220        4188004
>> workingset_refault    29857868       26781854
>> workingset_activate   6293946        5634625
>> pswpin                1327601        1133912
>> pswpout               3593842        3229602
>> pgalloc_dma           99520618       94402970
>> pgalloc_normal        104046854      98124798
>> pgfree                203772640      192600737
>> pgmajfault            2126962        1851836
>> pgsteal_kswapd_dma    19732899       18039462
>> pgsteal_kswapd_normal 19945336       17977706
>> pgsteal_direct_dma    206757         131376
>> pgsteal_direct_normal 236783         138247
>> pageoutrun            116622         108370
>> allocstall            7220           4684
>> compact_stall         931            856
>
> From this numbers it seems that the memory pressure was higher in 4.4.
> There is ~5% more allocations in 4.4 while we hit the direct reclaim 50%
> more times.
When the fix is applied to 4.4 the above reclaim stat looks almost similar
for 4.4 and 3.18. That means the memory pressure seen is a side effect of the
problem that the patch addresses.

>
> But the above doesn't say anything about the number and levels of
> vmpressure events. Without that it is hard to draw any conclusion here.
>
For this usecase, the number of critical vmpressure events received is around
50% less on 4.4 base than 3.18.

> It would be also more than useful to say how much the slab reclaim
> really contributed.
The 70% less events is caused by slab reclaim being added to vmpressure,
which is confirmed by running the test with and without the fix.
But it is hard to say the effect on reclaim stats is caused by this
problem because,
the lowmemorykiller can be written with different heuristics to make the reclaim
look better. The issue we see in the above reclaim stats is entirely
because of task kills
being delayed. That is the reason why I did not include the vmstat stats in the
changelog in the earlier versions.

>
>> This is a regression introduced by commit 6b4f7799c6a5 ("mm: vmscan:
>> invoke slab shrinkers from shrink_zone()").
>
> I am not really sure this is a regression, though. Maybe your heuristic
> which consumes events is just too fragile?
>
Yes it could be. A different kind of lowmemorykiller may not show up this issue
at all. In my opinion the regression here is the difference in vmpressure values
and thus the vmpressure events because of passing slab reclaimed pages
to vmpressure without considering the scanned pages and cost model.
So would it be better to drop the vmstat data from changelog ?

>> So do not consider reclaimed slab pages for vmpressure calculation. The
>> reclaimed pages from slab can be excluded because the freeing of a page
>> by slab shrinking depends on each slab's object population, making the
>> cost model (i.e. scan:free) different from that of LRU.  Also, not every
>> shrinker accounts the pages it reclaims.
>
> Yeah, this is really messy and not 100% correct. The reclaim cost model
> for slab is completely different to the reclaim but the concern here is
> that we can trigger higher vmpressure levels even though there _is_ a
> reclaim progress. This should be at least mentioned in the changelog so
> that people know that this aspect has been considered.
>
Yes I can add that.

>> Fixes: 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from shrink_zone()")
>> Acked-by: Minchan Kim <minchan@kernel.org>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
>> Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
>> Cc: Shiraz Hashim <shashim@codeaurora.org>
>> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
>> ---
>>  mm/vmscan.c | 17 ++++++++++++-----
>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 947ab6f..8969f8e 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2594,16 +2594,23 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>                                   sc->nr_scanned - nr_scanned,
>>                                   node_lru_pages);
>>
>> +             /*
>> +              * Record the subtree's reclaim efficiency. The reclaimed
>> +              * pages from slab is excluded here because the corresponding
>> +              * scanned pages is not accounted. Moreover, freeing a page
>> +              * by slab shrinking depends on each slab's object population,
>> +              * making the cost model (i.e. scan:free) different from that
>> +              * of LRU.
>> +              */
>> +             vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
>> +                        sc->nr_scanned - nr_scanned,
>> +                        sc->nr_reclaimed - nr_reclaimed);
>> +
>>               if (reclaim_state) {
>>                       sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>>                       reclaim_state->reclaimed_slab = 0;
>>               }
>>
>> -             /* Record the subtree's reclaim efficiency */
>> -             vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
>> -                        sc->nr_scanned - nr_scanned,
>> -                        sc->nr_reclaimed - nr_reclaimed);
>> -
>>               if (sc->nr_reclaimed - nr_reclaimed)
>>                       reclaimable = true;
>>
>> --
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>> member of the Code Aurora Forum, hosted by The Linux Foundation
>>
>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
@ 2017-02-06 15:10     ` vinayak menon
  0 siblings, 0 replies; 32+ messages in thread
From: vinayak menon @ 2017-02-06 15:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Mon, Feb 6, 2017 at 6:22 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 06-02-17 17:54:09, Vinayak Menon wrote:
>> During global reclaim, the nr_reclaimed passed to vmpressure includes the
>> pages reclaimed from slab.  But the corresponding scanned slab pages is
>> not passed.  This can cause total reclaimed pages to be greater than
>> scanned, causing an unsigned underflow in vmpressure resulting in a
>> critical event being sent to root cgroup.
>
> If you switched the ordering then this wouldn't be a problem, right?
You mean calling vmpressure first and then shrink_slab ? That also can be
done. Would completing shrink_slab before vmpressure be of any use to a
userspace task that takes into account both vmpressure and reclaimable
slab size ?
>
>> It was also noticed that, apart
>> from the underflow, there is an impact to the vmpressure values because of
>> this. While moving from kernel version 3.18 to 4.4, a difference is seen
>> in the vmpressure values for the same workload resulting in a different
>> behaviour of the vmpressure consumer. One such case is of a vmpressure
>> based lowmemorykiller. It is observed that the vmpressure events are
>> received late and less in number resulting in tasks not being killed at
>> the right time. The following numbers show the impact on reclaim activity
>> due to the change in behaviour of lowmemorykiller on a 4GB device. The test
>> launches a number of apps in sequence and repeats it multiple times.
>>                       v4.4           v3.18
>> pgpgin                163016456      145617236
>> pgpgout               4366220        4188004
>> workingset_refault    29857868       26781854
>> workingset_activate   6293946        5634625
>> pswpin                1327601        1133912
>> pswpout               3593842        3229602
>> pgalloc_dma           99520618       94402970
>> pgalloc_normal        104046854      98124798
>> pgfree                203772640      192600737
>> pgmajfault            2126962        1851836
>> pgsteal_kswapd_dma    19732899       18039462
>> pgsteal_kswapd_normal 19945336       17977706
>> pgsteal_direct_dma    206757         131376
>> pgsteal_direct_normal 236783         138247
>> pageoutrun            116622         108370
>> allocstall            7220           4684
>> compact_stall         931            856
>
> From this numbers it seems that the memory pressure was higher in 4.4.
> There is ~5% more allocations in 4.4 while we hit the direct reclaim 50%
> more times.
When the fix is applied to 4.4 the above reclaim stat looks almost similar
for 4.4 and 3.18. That means the memory pressure seen is a side effect of the
problem that the patch addresses.

>
> But the above doesn't say anything about the number and levels of
> vmpressure events. Without that it is hard to draw any conclusion here.
>
For this usecase, the number of critical vmpressure events received is around
50% less on 4.4 base than 3.18.

> It would be also more than useful to say how much the slab reclaim
> really contributed.
The 70% less events is caused by slab reclaim being added to vmpressure,
which is confirmed by running the test with and without the fix.
But it is hard to say the effect on reclaim stats is caused by this
problem because,
the lowmemorykiller can be written with different heuristics to make the reclaim
look better. The issue we see in the above reclaim stats is entirely
because of task kills
being delayed. That is the reason why I did not include the vmstat stats in the
changelog in the earlier versions.

>
>> This is a regression introduced by commit 6b4f7799c6a5 ("mm: vmscan:
>> invoke slab shrinkers from shrink_zone()").
>
> I am not really sure this is a regression, though. Maybe your heuristic
> which consumes events is just too fragile?
>
Yes it could be. A different kind of lowmemorykiller may not show up this issue
at all. In my opinion the regression here is the difference in vmpressure values
and thus the vmpressure events because of passing slab reclaimed pages
to vmpressure without considering the scanned pages and cost model.
So would it be better to drop the vmstat data from changelog ?

>> So do not consider reclaimed slab pages for vmpressure calculation. The
>> reclaimed pages from slab can be excluded because the freeing of a page
>> by slab shrinking depends on each slab's object population, making the
>> cost model (i.e. scan:free) different from that of LRU.  Also, not every
>> shrinker accounts the pages it reclaims.
>
> Yeah, this is really messy and not 100% correct. The reclaim cost model
> for slab is completely different to the reclaim but the concern here is
> that we can trigger higher vmpressure levels even though there _is_ a
> reclaim progress. This should be at least mentioned in the changelog so
> that people know that this aspect has been considered.
>
Yes I can add that.

>> Fixes: 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from shrink_zone()")
>> Acked-by: Minchan Kim <minchan@kernel.org>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
>> Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
>> Cc: Shiraz Hashim <shashim@codeaurora.org>
>> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
>> ---
>>  mm/vmscan.c | 17 ++++++++++++-----
>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 947ab6f..8969f8e 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2594,16 +2594,23 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>                                   sc->nr_scanned - nr_scanned,
>>                                   node_lru_pages);
>>
>> +             /*
>> +              * Record the subtree's reclaim efficiency. The reclaimed
>> +              * pages from slab is excluded here because the corresponding
>> +              * scanned pages is not accounted. Moreover, freeing a page
>> +              * by slab shrinking depends on each slab's object population,
>> +              * making the cost model (i.e. scan:free) different from that
>> +              * of LRU.
>> +              */
>> +             vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
>> +                        sc->nr_scanned - nr_scanned,
>> +                        sc->nr_reclaimed - nr_reclaimed);
>> +
>>               if (reclaim_state) {
>>                       sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>>                       reclaim_state->reclaimed_slab = 0;
>>               }
>>
>> -             /* Record the subtree's reclaim efficiency */
>> -             vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
>> -                        sc->nr_scanned - nr_scanned,
>> -                        sc->nr_reclaimed - nr_reclaimed);
>> -
>>               if (sc->nr_reclaimed - nr_reclaimed)
>>                       reclaimable = true;
>>
>> --
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>> member of the Code Aurora Forum, hosted by The Linux Foundation
>>
>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
  2017-02-06 14:35           ` vinayak menon
@ 2017-02-06 15:12             ` Michal Hocko
  -1 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-06 15:12 UTC (permalink / raw)
  To: vinayak menon
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Mon 06-02-17 20:05:21, vinayak menon wrote:
[...]
> By scan I meant pages scanned by shrink_node_memcg/shrink_list
> which is passed as nr_scanned to vmpressure.  The calculation of
> pressure for tree is done at the end of vmpressure_win and it is
> that calculation which underflows. With this patch we want only the
> underflow to be avoided. But if we make (reclaimed = scanned) in
> vmpressure(), we change the vmpressure value even when there is no
> underflow right ?
>
> Rewriting the above e.g again.  First call to vmpressure with
> nr_scanned=1 and nr_reclaimed=512 (THP) Second call to vmpressure
> with nr_scanned=511 and nr_reclaimed=0 In the second call
> vmpr->tree_scanned becomes equal to vmpressure_win and the work
> is scheduled and it will calculate the vmpressure as 0 because
> tree_reclaimed = 512
>
> Similarly, if scanned is made equal to reclaimed in vmpressure()
> itself as you had suggested, First call to vmpressure with
> nr_scanned=1 and nr_reclaimed=512 (THP) And in vmpressure, we
> make nr_scanned=1 and nr_reclaimed=1 Second call to vmpressure
> with nr_scanned=511 and nr_reclaimed=0 In the second call
> vmpr->tree_scanned becomes equal to vmpressure_win and the work is
> scheduled and it will calculate the vmpressure as critical, because
> tree_reclaimed = 1
> 
> So it makes a difference, no?

OK, I see what you meant. Thanks for the clarification. And you are
right that normalizing nr_reclaimed to nr_scanned is a wrong thing to
do because that just doesn't aggregate the real work done. Normalizing
nr_scanned to nr_reclaimed should be better - or it would be even better
to count the scanned pages properly...

My main concern of doing this normalization late on aggregated numbers
is just weird. We are mixing numbers from parallel reclaimers and that
might just add more confusion. It is better to do the fixup as soon as
possible when we still have at least an idea that this was a THP page
scanned and reclaimed.

If we get back to your example it works as you expect just due to good
luck. Just make your nr_scanned=511 and nr_reclaimed=0 be a separate
event and you have your critical event. You have no real control over
when a new event is fired because parallel reclaimers are basically
unpredictable.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
@ 2017-02-06 15:12             ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-06 15:12 UTC (permalink / raw)
  To: vinayak menon
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Mon 06-02-17 20:05:21, vinayak menon wrote:
[...]
> By scan I meant pages scanned by shrink_node_memcg/shrink_list
> which is passed as nr_scanned to vmpressure.  The calculation of
> pressure for tree is done at the end of vmpressure_win and it is
> that calculation which underflows. With this patch we want only the
> underflow to be avoided. But if we make (reclaimed = scanned) in
> vmpressure(), we change the vmpressure value even when there is no
> underflow right ?
>
> Rewriting the above e.g again.  First call to vmpressure with
> nr_scanned=1 and nr_reclaimed=512 (THP) Second call to vmpressure
> with nr_scanned=511 and nr_reclaimed=0 In the second call
> vmpr->tree_scanned becomes equal to vmpressure_win and the work
> is scheduled and it will calculate the vmpressure as 0 because
> tree_reclaimed = 512
>
> Similarly, if scanned is made equal to reclaimed in vmpressure()
> itself as you had suggested, First call to vmpressure with
> nr_scanned=1 and nr_reclaimed=512 (THP) And in vmpressure, we
> make nr_scanned=1 and nr_reclaimed=1 Second call to vmpressure
> with nr_scanned=511 and nr_reclaimed=0 In the second call
> vmpr->tree_scanned becomes equal to vmpressure_win and the work is
> scheduled and it will calculate the vmpressure as critical, because
> tree_reclaimed = 1
> 
> So it makes a difference, no?

OK, I see what you meant. Thanks for the clarification. And you are
right that normalizing nr_reclaimed to nr_scanned is a wrong thing to
do because that just doesn't aggregate the real work done. Normalizing
nr_scanned to nr_reclaimed should be better - or it would be even better
to count the scanned pages properly...

My main concern of doing this normalization late on aggregated numbers
is just weird. We are mixing numbers from parallel reclaimers and that
might just add more confusion. It is better to do the fixup as soon as
possible when we still have at least an idea that this was a THP page
scanned and reclaimed.

If we get back to your example it works as you expect just due to good
luck. Just make your nr_scanned=511 and nr_reclaimed=0 be a separate
event and you have your critical event. You have no real control over
when a new event is fired because parallel reclaimers are basically
unpredictable.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
  2017-02-06 15:10     ` vinayak menon
@ 2017-02-07  8:10       ` Michal Hocko
  -1 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-07  8:10 UTC (permalink / raw)
  To: vinayak menon
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Mon 06-02-17 20:40:10, vinayak menon wrote:
> On Mon, Feb 6, 2017 at 6:22 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 06-02-17 17:54:09, Vinayak Menon wrote:
> >> During global reclaim, the nr_reclaimed passed to vmpressure includes the
> >> pages reclaimed from slab.  But the corresponding scanned slab pages is
> >> not passed.  This can cause total reclaimed pages to be greater than
> >> scanned, causing an unsigned underflow in vmpressure resulting in a
> >> critical event being sent to root cgroup.
> >
> > If you switched the ordering then this wouldn't be a problem, right?
>
> You mean calling vmpressure first and then shrink_slab ?

No, I meant the scanned vs. reclaim normalization patch first and than do
whatever slab related thing later on. This would have an advantage that
we can rule the underflow issue out and only focus on why the slab
numbers actually matter.

> That also can be done. Would completing shrink_slab before vmpressure
> be of any use to a userspace task that takes into account both
> vmpressure and reclaimable slab size ?

Is this the case in your lowmemmory killer implementation? If yes how
does it actually work?

[...]
> > It would be also more than useful to say how much the slab reclaim
> > really contributed.
>
> The 70% less events is caused by slab reclaim being added to
> vmpressure, which is confirmed by running the test with and without
> the fix.  But it is hard to say the effect on reclaim stats is caused
> by this problem because, the lowmemorykiller can be written with
> different heuristics to make the reclaim look better.

Exactly! And this is why I am not still happy with the current
justification of this patch. It seems to be tuning for a particular
consumer of vmpressure events. Others might depend on a less pessimistic
events because we are making some progress afterall. Being more
pessimistic can lead to premature oom or other performance related
decisions and that is why I am not happy about that.

Btw. could you be more specific about your particular test? What is
desired/acceptable result?

> The issue we see
> in the above reclaim stats is entirely because of task kills being
> delayed. That is the reason why I did not include the vmstat stats in
> the changelog in the earlier versions.
> 
> >
> >> This is a regression introduced by commit 6b4f7799c6a5 ("mm: vmscan:
> >> invoke slab shrinkers from shrink_zone()").
> >
> > I am not really sure this is a regression, though. Maybe your heuristic
> > which consumes events is just too fragile?
> >
> Yes it could be. A different kind of lowmemorykiller may not show up
> this issue at all. In my opinion the regression here is the difference
> in vmpressure values and thus the vmpressure events because of passing
> slab reclaimed pages to vmpressure without considering the scanned
> pages and cost model.
> So would it be better to drop the vmstat data from changelog ?

No! The main question is whether being more pessimistic and report
higher reclaim levels really does make sense even when there is a slab
reclaim progress. This hasn't been explained and I _really_ do not like
a patch which optimizes for a particular consumer of events.

I understand that the change of the behavior is unexpeted and that
might be reason to revert to the original one. But if this is the only
reasonable way to go I would, at least, like to understand what is going
on here. Why cannot your lowmemorykiller cope with the workload? Why
starting to kill sooner (at the time when the slab still reclaims enough
pages to report lower critical events) helps to pass your test. Maybe it
is the implementation of the lmk which needs to be changed because it
has some false expectations? Or the memory reclaim just behaves in an
unpredictable manner?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
@ 2017-02-07  8:10       ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-07  8:10 UTC (permalink / raw)
  To: vinayak menon
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Mon 06-02-17 20:40:10, vinayak menon wrote:
> On Mon, Feb 6, 2017 at 6:22 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 06-02-17 17:54:09, Vinayak Menon wrote:
> >> During global reclaim, the nr_reclaimed passed to vmpressure includes the
> >> pages reclaimed from slab.  But the corresponding scanned slab pages is
> >> not passed.  This can cause total reclaimed pages to be greater than
> >> scanned, causing an unsigned underflow in vmpressure resulting in a
> >> critical event being sent to root cgroup.
> >
> > If you switched the ordering then this wouldn't be a problem, right?
>
> You mean calling vmpressure first and then shrink_slab ?

No, I meant the scanned vs. reclaim normalization patch first and than do
whatever slab related thing later on. This would have an advantage that
we can rule the underflow issue out and only focus on why the slab
numbers actually matter.

> That also can be done. Would completing shrink_slab before vmpressure
> be of any use to a userspace task that takes into account both
> vmpressure and reclaimable slab size ?

Is this the case in your lowmemmory killer implementation? If yes how
does it actually work?

[...]
> > It would be also more than useful to say how much the slab reclaim
> > really contributed.
>
> The 70% less events is caused by slab reclaim being added to
> vmpressure, which is confirmed by running the test with and without
> the fix.  But it is hard to say the effect on reclaim stats is caused
> by this problem because, the lowmemorykiller can be written with
> different heuristics to make the reclaim look better.

Exactly! And this is why I am not still happy with the current
justification of this patch. It seems to be tuning for a particular
consumer of vmpressure events. Others might depend on a less pessimistic
events because we are making some progress afterall. Being more
pessimistic can lead to premature oom or other performance related
decisions and that is why I am not happy about that.

Btw. could you be more specific about your particular test? What is
desired/acceptable result?

> The issue we see
> in the above reclaim stats is entirely because of task kills being
> delayed. That is the reason why I did not include the vmstat stats in
> the changelog in the earlier versions.
> 
> >
> >> This is a regression introduced by commit 6b4f7799c6a5 ("mm: vmscan:
> >> invoke slab shrinkers from shrink_zone()").
> >
> > I am not really sure this is a regression, though. Maybe your heuristic
> > which consumes events is just too fragile?
> >
> Yes it could be. A different kind of lowmemorykiller may not show up
> this issue at all. In my opinion the regression here is the difference
> in vmpressure values and thus the vmpressure events because of passing
> slab reclaimed pages to vmpressure without considering the scanned
> pages and cost model.
> So would it be better to drop the vmstat data from changelog ?

No! The main question is whether being more pessimistic and report
higher reclaim levels really does make sense even when there is a slab
reclaim progress. This hasn't been explained and I _really_ do not like
a patch which optimizes for a particular consumer of events.

I understand that the change of the behavior is unexpeted and that
might be reason to revert to the original one. But if this is the only
reasonable way to go I would, at least, like to understand what is going
on here. Why cannot your lowmemorykiller cope with the workload? Why
starting to kill sooner (at the time when the slab still reclaims enough
pages to report lower critical events) helps to pass your test. Maybe it
is the implementation of the lmk which needs to be changed because it
has some false expectations? Or the memory reclaim just behaves in an
unpredictable manner?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
  2017-02-07  8:10       ` Michal Hocko
@ 2017-02-07 11:09         ` vinayak menon
  -1 siblings, 0 replies; 32+ messages in thread
From: vinayak menon @ 2017-02-07 11:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Tue, Feb 7, 2017 at 1:40 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 06-02-17 20:40:10, vinayak menon wrote:
>> On Mon, Feb 6, 2017 at 6:22 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Mon 06-02-17 17:54:09, Vinayak Menon wrote:
>> >> During global reclaim, the nr_reclaimed passed to vmpressure includes the
>> >> pages reclaimed from slab.  But the corresponding scanned slab pages is
>> >> not passed.  This can cause total reclaimed pages to be greater than
>> >> scanned, causing an unsigned underflow in vmpressure resulting in a
>> >> critical event being sent to root cgroup.
>> >
>> > If you switched the ordering then this wouldn't be a problem, right?
>>
>> You mean calling vmpressure first and then shrink_slab ?
>
> No, I meant the scanned vs. reclaim normalization patch first and than do
> whatever slab related thing later on. This would have an advantage that
> we can rule the underflow issue out and only focus on why the slab
> numbers actually matter.
>
Yes, makes sense. Will reorder the patches and remove the underflow comments
from this changelog.

>> That also can be done. Would completing shrink_slab before vmpressure
>> be of any use to a userspace task that takes into account both
>> vmpressure and reclaimable slab size ?
>
> Is this the case in your lowmemmory killer implementation? If yes how
> does it actually work?
>
No. When you mentioned reordering I thought you meant reordering the
shrink_slab and
vmpressure calls. So was just thinking whether that would make a
visible side effect
for those clients considering both vmpressure and reclaimable slab size.

> [...]
>> > It would be also more than useful to say how much the slab reclaim
>> > really contributed.
>>
>> The 70% less events is caused by slab reclaim being added to
>> vmpressure, which is confirmed by running the test with and without
>> the fix.  But it is hard to say the effect on reclaim stats is caused
>> by this problem because, the lowmemorykiller can be written with
>> different heuristics to make the reclaim look better.
>
> Exactly! And this is why I am not still happy with the current
> justification of this patch. It seems to be tuning for a particular
> consumer of vmpressure events. Others might depend on a less pessimistic
> events because we are making some progress afterall. Being more
> pessimistic can lead to premature oom or other performance related
> decisions and that is why I am not happy about that.
>
> Btw. could you be more specific about your particular test? What is
> desired/acceptable result?
The test opens multiple applications on android in a sequence and then
repeats this
for N times. Time taken to launch the application is measured. With and without
the patch the deviation is seen in the launch latencies. The launch
latency diff is
caused by the lesser number of kills (because of vmpressure difference).

>
>> The issue we see
>> in the above reclaim stats is entirely because of task kills being
>> delayed. That is the reason why I did not include the vmstat stats in
>> the changelog in the earlier versions.
>>
>> >
>> >> This is a regression introduced by commit 6b4f7799c6a5 ("mm: vmscan:
>> >> invoke slab shrinkers from shrink_zone()").
>> >
>> > I am not really sure this is a regression, though. Maybe your heuristic
>> > which consumes events is just too fragile?
>> >
>> Yes it could be. A different kind of lowmemorykiller may not show up
>> this issue at all. In my opinion the regression here is the difference
>> in vmpressure values and thus the vmpressure events because of passing
>> slab reclaimed pages to vmpressure without considering the scanned
>> pages and cost model.
>> So would it be better to drop the vmstat data from changelog ?
>
> No! The main question is whether being more pessimistic and report
> higher reclaim levels really does make sense even when there is a slab
> reclaim progress. This hasn't been explained and I _really_ do not like
> a patch which optimizes for a particular consumer of events.
>
> I understand that the change of the behavior is unexpeted and that
> might be reason to revert to the original one. But if this is the only
> reasonable way to go I would, at least, like to understand what is going
> on here. Why cannot your lowmemorykiller cope with the workload? Why
> starting to kill sooner (at the time when the slab still reclaims enough
> pages to report lower critical events) helps to pass your test. Maybe it
> is the implementation of the lmk which needs to be changed because it
> has some false expectations? Or the memory reclaim just behaves in an
> unpredictable manner?
> --
Say if 4.4 had actually implemented page based shrinking model for slab and
included the correct scanned and reclaimed to vmpressure considering the cost
model, then it is all fine and behavior difference if any shown by a
vmpressure client
need to be fixed. But as I understand, the case here is different.
vmpressure was implemented to work with scanned and reclaimed pages
from LRU and it works
well for at least some use cases. As you had pointed out earlier there could be
problems with the way vmpressure works since it is not considering
many other costs. But
it shows an estimate of the pressure on LRUs. I think adding just the
slab reclaimed to nr_reclaimed
without considering the cost is arbitrary and it disturbs the LRU
pressure which vmpressure
reports properly. So shouldn't we account slab reclaimed in vmpressure
only when we have a proper
way to do it ? By adding slab reclaimed pages, we are saying
vmpressure that X pages were
reclaimed with 0 effort. With this patch the vmpressure will show an
estimate of pressure on LRU and
restores the original behavior of vmpressure. If we add in future the
slab cost, vmpressure can become
more accurate. But just adding slab reclaimed is arbitrary right ?
Consider a case where we start to
account reclaimed pages from other shrinkers which are not reporting
their reclaimed values right now.
Like zsmalloc, android lowmemorykiller etc. Then nr_reclaimed sent to
vmpressure will just be bloated
and will make vmpressure useless right ? And most of the time
vmpressure will receive
reclaimed greater than scanned and won't be reporting any critical
events. The problem we are
encountering now with slab reclaimed is a subset of the case above right ?

Starting to kill at the right time helps in recovering memory at a
faster rate than waiting for the
reclaim to complete. Yes, we may be able to modify lowmemorykiller to
cope with this
problem. But the actual problem this patch tried to fix was the
vmpressure event regression.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
@ 2017-02-07 11:09         ` vinayak menon
  0 siblings, 0 replies; 32+ messages in thread
From: vinayak menon @ 2017-02-07 11:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Tue, Feb 7, 2017 at 1:40 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 06-02-17 20:40:10, vinayak menon wrote:
>> On Mon, Feb 6, 2017 at 6:22 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Mon 06-02-17 17:54:09, Vinayak Menon wrote:
>> >> During global reclaim, the nr_reclaimed passed to vmpressure includes the
>> >> pages reclaimed from slab.  But the corresponding scanned slab pages is
>> >> not passed.  This can cause total reclaimed pages to be greater than
>> >> scanned, causing an unsigned underflow in vmpressure resulting in a
>> >> critical event being sent to root cgroup.
>> >
>> > If you switched the ordering then this wouldn't be a problem, right?
>>
>> You mean calling vmpressure first and then shrink_slab ?
>
> No, I meant the scanned vs. reclaim normalization patch first and than do
> whatever slab related thing later on. This would have an advantage that
> we can rule the underflow issue out and only focus on why the slab
> numbers actually matter.
>
Yes, makes sense. Will reorder the patches and remove the underflow comments
from this changelog.

>> That also can be done. Would completing shrink_slab before vmpressure
>> be of any use to a userspace task that takes into account both
>> vmpressure and reclaimable slab size ?
>
> Is this the case in your lowmemmory killer implementation? If yes how
> does it actually work?
>
No. When you mentioned reordering I thought you meant reordering the
shrink_slab and
vmpressure calls. So was just thinking whether that would make a
visible side effect
for those clients considering both vmpressure and reclaimable slab size.

> [...]
>> > It would be also more than useful to say how much the slab reclaim
>> > really contributed.
>>
>> The 70% less events is caused by slab reclaim being added to
>> vmpressure, which is confirmed by running the test with and without
>> the fix.  But it is hard to say the effect on reclaim stats is caused
>> by this problem because, the lowmemorykiller can be written with
>> different heuristics to make the reclaim look better.
>
> Exactly! And this is why I am not still happy with the current
> justification of this patch. It seems to be tuning for a particular
> consumer of vmpressure events. Others might depend on a less pessimistic
> events because we are making some progress afterall. Being more
> pessimistic can lead to premature oom or other performance related
> decisions and that is why I am not happy about that.
>
> Btw. could you be more specific about your particular test? What is
> desired/acceptable result?
The test opens multiple applications on android in a sequence and then
repeats this
for N times. Time taken to launch the application is measured. With and without
the patch the deviation is seen in the launch latencies. The launch
latency diff is
caused by the lesser number of kills (because of vmpressure difference).

>
>> The issue we see
>> in the above reclaim stats is entirely because of task kills being
>> delayed. That is the reason why I did not include the vmstat stats in
>> the changelog in the earlier versions.
>>
>> >
>> >> This is a regression introduced by commit 6b4f7799c6a5 ("mm: vmscan:
>> >> invoke slab shrinkers from shrink_zone()").
>> >
>> > I am not really sure this is a regression, though. Maybe your heuristic
>> > which consumes events is just too fragile?
>> >
>> Yes it could be. A different kind of lowmemorykiller may not show up
>> this issue at all. In my opinion the regression here is the difference
>> in vmpressure values and thus the vmpressure events because of passing
>> slab reclaimed pages to vmpressure without considering the scanned
>> pages and cost model.
>> So would it be better to drop the vmstat data from changelog ?
>
> No! The main question is whether being more pessimistic and report
> higher reclaim levels really does make sense even when there is a slab
> reclaim progress. This hasn't been explained and I _really_ do not like
> a patch which optimizes for a particular consumer of events.
>
> I understand that the change of the behavior is unexpeted and that
> might be reason to revert to the original one. But if this is the only
> reasonable way to go I would, at least, like to understand what is going
> on here. Why cannot your lowmemorykiller cope with the workload? Why
> starting to kill sooner (at the time when the slab still reclaims enough
> pages to report lower critical events) helps to pass your test. Maybe it
> is the implementation of the lmk which needs to be changed because it
> has some false expectations? Or the memory reclaim just behaves in an
> unpredictable manner?
> --
Say if 4.4 had actually implemented page based shrinking model for slab and
included the correct scanned and reclaimed to vmpressure considering the cost
model, then it is all fine and behavior difference if any shown by a
vmpressure client
need to be fixed. But as I understand, the case here is different.
vmpressure was implemented to work with scanned and reclaimed pages
from LRU and it works
well for at least some use cases. As you had pointed out earlier there could be
problems with the way vmpressure works since it is not considering
many other costs. But
it shows an estimate of the pressure on LRUs. I think adding just the
slab reclaimed to nr_reclaimed
without considering the cost is arbitrary and it disturbs the LRU
pressure which vmpressure
reports properly. So shouldn't we account slab reclaimed in vmpressure
only when we have a proper
way to do it ? By adding slab reclaimed pages, we are saying
vmpressure that X pages were
reclaimed with 0 effort. With this patch the vmpressure will show an
estimate of pressure on LRU and
restores the original behavior of vmpressure. If we add in future the
slab cost, vmpressure can become
more accurate. But just adding slab reclaimed is arbitrary right ?
Consider a case where we start to
account reclaimed pages from other shrinkers which are not reporting
their reclaimed values right now.
Like zsmalloc, android lowmemorykiller etc. Then nr_reclaimed sent to
vmpressure will just be bloated
and will make vmpressure useless right ? And most of the time
vmpressure will receive
reclaimed greater than scanned and won't be reporting any critical
events. The problem we are
encountering now with slab reclaimed is a subset of the case above right ?

Starting to kill at the right time helps in recovering memory at a
faster rate than waiting for the
reclaim to complete. Yes, we may be able to modify lowmemorykiller to
cope with this
problem. But the actual problem this patch tried to fix was the
vmpressure event regression.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
  2017-02-06 15:12             ` Michal Hocko
@ 2017-02-07 11:17               ` vinayak menon
  -1 siblings, 0 replies; 32+ messages in thread
From: vinayak menon @ 2017-02-07 11:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Mon, Feb 6, 2017 at 8:42 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 06-02-17 20:05:21, vinayak menon wrote:
> [...]
>> By scan I meant pages scanned by shrink_node_memcg/shrink_list
>> which is passed as nr_scanned to vmpressure.  The calculation of
>> pressure for tree is done at the end of vmpressure_win and it is
>> that calculation which underflows. With this patch we want only the
>> underflow to be avoided. But if we make (reclaimed = scanned) in
>> vmpressure(), we change the vmpressure value even when there is no
>> underflow right ?
>>
>> Rewriting the above e.g again.  First call to vmpressure with
>> nr_scanned=1 and nr_reclaimed=512 (THP) Second call to vmpressure
>> with nr_scanned=511 and nr_reclaimed=0 In the second call
>> vmpr->tree_scanned becomes equal to vmpressure_win and the work
>> is scheduled and it will calculate the vmpressure as 0 because
>> tree_reclaimed = 512
>>
>> Similarly, if scanned is made equal to reclaimed in vmpressure()
>> itself as you had suggested, First call to vmpressure with
>> nr_scanned=1 and nr_reclaimed=512 (THP) And in vmpressure, we
>> make nr_scanned=1 and nr_reclaimed=1 Second call to vmpressure
>> with nr_scanned=511 and nr_reclaimed=0 In the second call
>> vmpr->tree_scanned becomes equal to vmpressure_win and the work is
>> scheduled and it will calculate the vmpressure as critical, because
>> tree_reclaimed = 1
>>
>> So it makes a difference, no?
>
> OK, I see what you meant. Thanks for the clarification. And you are
> right that normalizing nr_reclaimed to nr_scanned is a wrong thing to
> do because that just doesn't aggregate the real work done. Normalizing
> nr_scanned to nr_reclaimed should be better - or it would be even better
> to count the scanned pages properly...
>
With the slab reclaimed issue fixed separately, only the THP case exists AFAIK.
In the case of THP, as I understand from one of Minchan's reply, the scan is
actually 1. i.e. Only a single huge page is scanned to get 512 reclaimed pages.
So the cost involved was scanning a single page.
In that case, there is no need to normalize the nr_scanned, no?

> My main concern of doing this normalization late on aggregated numbers
> is just weird. We are mixing numbers from parallel reclaimers and that
> might just add more confusion. It is better to do the fixup as soon as
> possible when we still have at least an idea that this was a THP page
> scanned and reclaimed.
>
> If we get back to your example it works as you expect just due to good
> luck. Just make your nr_scanned=511 and nr_reclaimed=0 be a separate
> event and you have your critical event. You have no real control over
> when a new event is fired because parallel reclaimers are basically
> unpredictable.
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
@ 2017-02-07 11:17               ` vinayak menon
  0 siblings, 0 replies; 32+ messages in thread
From: vinayak menon @ 2017-02-07 11:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Mon, Feb 6, 2017 at 8:42 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 06-02-17 20:05:21, vinayak menon wrote:
> [...]
>> By scan I meant pages scanned by shrink_node_memcg/shrink_list
>> which is passed as nr_scanned to vmpressure.  The calculation of
>> pressure for tree is done at the end of vmpressure_win and it is
>> that calculation which underflows. With this patch we want only the
>> underflow to be avoided. But if we make (reclaimed = scanned) in
>> vmpressure(), we change the vmpressure value even when there is no
>> underflow right ?
>>
>> Rewriting the above e.g again.  First call to vmpressure with
>> nr_scanned=1 and nr_reclaimed=512 (THP) Second call to vmpressure
>> with nr_scanned=511 and nr_reclaimed=0 In the second call
>> vmpr->tree_scanned becomes equal to vmpressure_win and the work
>> is scheduled and it will calculate the vmpressure as 0 because
>> tree_reclaimed = 512
>>
>> Similarly, if scanned is made equal to reclaimed in vmpressure()
>> itself as you had suggested, First call to vmpressure with
>> nr_scanned=1 and nr_reclaimed=512 (THP) And in vmpressure, we
>> make nr_scanned=1 and nr_reclaimed=1 Second call to vmpressure
>> with nr_scanned=511 and nr_reclaimed=0 In the second call
>> vmpr->tree_scanned becomes equal to vmpressure_win and the work is
>> scheduled and it will calculate the vmpressure as critical, because
>> tree_reclaimed = 1
>>
>> So it makes a difference, no?
>
> OK, I see what you meant. Thanks for the clarification. And you are
> right that normalizing nr_reclaimed to nr_scanned is a wrong thing to
> do because that just doesn't aggregate the real work done. Normalizing
> nr_scanned to nr_reclaimed should be better - or it would be even better
> to count the scanned pages properly...
>
With the slab reclaimed issue fixed separately, only the THP case exists AFAIK.
In the case of THP, as I understand from one of Minchan's reply, the scan is
actually 1. i.e. Only a single huge page is scanned to get 512 reclaimed pages.
So the cost involved was scanning a single page.
In that case, there is no need to normalize the nr_scanned, no?

> My main concern of doing this normalization late on aggregated numbers
> is just weird. We are mixing numbers from parallel reclaimers and that
> might just add more confusion. It is better to do the fixup as soon as
> possible when we still have at least an idea that this was a THP page
> scanned and reclaimed.
>
> If we get back to your example it works as you expect just due to good
> luck. Just make your nr_scanned=511 and nr_reclaimed=0 be a separate
> event and you have your critical event. You have no real control over
> when a new event is fired because parallel reclaimers are basically
> unpredictable.
> --
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
  2017-02-07 11:17               ` vinayak menon
@ 2017-02-07 12:09                 ` Michal Hocko
  -1 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-07 12:09 UTC (permalink / raw)
  To: vinayak menon
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Tue 07-02-17 16:47:18, vinayak menon wrote:
> On Mon, Feb 6, 2017 at 8:42 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 06-02-17 20:05:21, vinayak menon wrote:
> > [...]
> >> By scan I meant pages scanned by shrink_node_memcg/shrink_list
> >> which is passed as nr_scanned to vmpressure.  The calculation of
> >> pressure for tree is done at the end of vmpressure_win and it is
> >> that calculation which underflows. With this patch we want only the
> >> underflow to be avoided. But if we make (reclaimed = scanned) in
> >> vmpressure(), we change the vmpressure value even when there is no
> >> underflow right ?
> >>
> >> Rewriting the above e.g again.  First call to vmpressure with
> >> nr_scanned=1 and nr_reclaimed=512 (THP) Second call to vmpressure
> >> with nr_scanned=511 and nr_reclaimed=0 In the second call
> >> vmpr->tree_scanned becomes equal to vmpressure_win and the work
> >> is scheduled and it will calculate the vmpressure as 0 because
> >> tree_reclaimed = 512
> >>
> >> Similarly, if scanned is made equal to reclaimed in vmpressure()
> >> itself as you had suggested, First call to vmpressure with
> >> nr_scanned=1 and nr_reclaimed=512 (THP) And in vmpressure, we
> >> make nr_scanned=1 and nr_reclaimed=1 Second call to vmpressure
> >> with nr_scanned=511 and nr_reclaimed=0 In the second call
> >> vmpr->tree_scanned becomes equal to vmpressure_win and the work is
> >> scheduled and it will calculate the vmpressure as critical, because
> >> tree_reclaimed = 1
> >>
> >> So it makes a difference, no?
> >
> > OK, I see what you meant. Thanks for the clarification. And you are
> > right that normalizing nr_reclaimed to nr_scanned is a wrong thing to
> > do because that just doesn't aggregate the real work done. Normalizing
> > nr_scanned to nr_reclaimed should be better - or it would be even better
> > to count the scanned pages properly...
> >
> With the slab reclaimed issue fixed separately, only the THP case exists AFAIK.
> In the case of THP, as I understand from one of Minchan's reply, the scan is
> actually 1. i.e. Only a single huge page is scanned to get 512 reclaimed pages.
> So the cost involved was scanning a single page.
> In that case, there is no need to normalize the nr_scanned, no?

Strictly speaking it is not but it has weird side effects when we
basically lie about vmpressure_win.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow
@ 2017-02-07 12:09                 ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-07 12:09 UTC (permalink / raw)
  To: vinayak menon
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Tue 07-02-17 16:47:18, vinayak menon wrote:
> On Mon, Feb 6, 2017 at 8:42 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 06-02-17 20:05:21, vinayak menon wrote:
> > [...]
> >> By scan I meant pages scanned by shrink_node_memcg/shrink_list
> >> which is passed as nr_scanned to vmpressure.  The calculation of
> >> pressure for tree is done at the end of vmpressure_win and it is
> >> that calculation which underflows. With this patch we want only the
> >> underflow to be avoided. But if we make (reclaimed = scanned) in
> >> vmpressure(), we change the vmpressure value even when there is no
> >> underflow right ?
> >>
> >> Rewriting the above e.g again.  First call to vmpressure with
> >> nr_scanned=1 and nr_reclaimed=512 (THP) Second call to vmpressure
> >> with nr_scanned=511 and nr_reclaimed=0 In the second call
> >> vmpr->tree_scanned becomes equal to vmpressure_win and the work
> >> is scheduled and it will calculate the vmpressure as 0 because
> >> tree_reclaimed = 512
> >>
> >> Similarly, if scanned is made equal to reclaimed in vmpressure()
> >> itself as you had suggested, First call to vmpressure with
> >> nr_scanned=1 and nr_reclaimed=512 (THP) And in vmpressure, we
> >> make nr_scanned=1 and nr_reclaimed=1 Second call to vmpressure
> >> with nr_scanned=511 and nr_reclaimed=0 In the second call
> >> vmpr->tree_scanned becomes equal to vmpressure_win and the work is
> >> scheduled and it will calculate the vmpressure as critical, because
> >> tree_reclaimed = 1
> >>
> >> So it makes a difference, no?
> >
> > OK, I see what you meant. Thanks for the clarification. And you are
> > right that normalizing nr_reclaimed to nr_scanned is a wrong thing to
> > do because that just doesn't aggregate the real work done. Normalizing
> > nr_scanned to nr_reclaimed should be better - or it would be even better
> > to count the scanned pages properly...
> >
> With the slab reclaimed issue fixed separately, only the THP case exists AFAIK.
> In the case of THP, as I understand from one of Minchan's reply, the scan is
> actually 1. i.e. Only a single huge page is scanned to get 512 reclaimed pages.
> So the cost involved was scanning a single page.
> In that case, there is no need to normalize the nr_scanned, no?

Strictly speaking it is not but it has weird side effects when we
basically lie about vmpressure_win.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
  2017-02-07 11:09         ` vinayak menon
@ 2017-02-07 12:17           ` Michal Hocko
  -1 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-07 12:17 UTC (permalink / raw)
  To: vinayak menon
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Tue 07-02-17 16:39:15, vinayak menon wrote:
> On Tue, Feb 7, 2017 at 1:40 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 06-02-17 20:40:10, vinayak menon wrote:
> >> On Mon, Feb 6, 2017 at 6:22 PM, Michal Hocko <mhocko@kernel.org> wrote:
[...]
> >> > It would be also more than useful to say how much the slab reclaim
> >> > really contributed.
> >>
> >> The 70% less events is caused by slab reclaim being added to
> >> vmpressure, which is confirmed by running the test with and without
> >> the fix.  But it is hard to say the effect on reclaim stats is caused
> >> by this problem because, the lowmemorykiller can be written with
> >> different heuristics to make the reclaim look better.
> >
> > Exactly! And this is why I am not still happy with the current
> > justification of this patch. It seems to be tuning for a particular
> > consumer of vmpressure events. Others might depend on a less pessimistic
> > events because we are making some progress afterall. Being more
> > pessimistic can lead to premature oom or other performance related
> > decisions and that is why I am not happy about that.
> >
> > Btw. could you be more specific about your particular test? What is
> > desired/acceptable result?
>
> The test opens multiple applications on android in a sequence and
> then repeats this for N times. Time taken to launch the application
> is measured. With and without the patch the deviation is seen in the
> launch latencies. The launch latency diff is caused by the lesser
> number of kills (because of vmpressure difference).

So this is basically lmk throughput test. Is this representative enough
to make any decisions?

> >> The issue we see
> >> in the above reclaim stats is entirely because of task kills being
> >> delayed. That is the reason why I did not include the vmstat stats in
> >> the changelog in the earlier versions.
> >>
> >> >
> >> >> This is a regression introduced by commit 6b4f7799c6a5 ("mm: vmscan:
> >> >> invoke slab shrinkers from shrink_zone()").
> >> >
> >> > I am not really sure this is a regression, though. Maybe your heuristic
> >> > which consumes events is just too fragile?
> >> >
> >> Yes it could be. A different kind of lowmemorykiller may not show up
> >> this issue at all. In my opinion the regression here is the difference
> >> in vmpressure values and thus the vmpressure events because of passing
> >> slab reclaimed pages to vmpressure without considering the scanned
> >> pages and cost model.
> >> So would it be better to drop the vmstat data from changelog ?
> >
> > No! The main question is whether being more pessimistic and report
> > higher reclaim levels really does make sense even when there is a slab
> > reclaim progress. This hasn't been explained and I _really_ do not like
> > a patch which optimizes for a particular consumer of events.
> >
> > I understand that the change of the behavior is unexpeted and that
> > might be reason to revert to the original one. But if this is the only
> > reasonable way to go I would, at least, like to understand what is going
> > on here. Why cannot your lowmemorykiller cope with the workload? Why
> > starting to kill sooner (at the time when the slab still reclaims enough
> > pages to report lower critical events) helps to pass your test. Maybe it
> > is the implementation of the lmk which needs to be changed because it
> > has some false expectations? Or the memory reclaim just behaves in an
> > unpredictable manner?
>
> Say if 4.4 had actually implemented page based shrinking model for
> slab and included the correct scanned and reclaimed to vmpressure
> considering the cost model, then it is all fine and behavior
> difference if any shown by a vmpressure client need to be fixed. But
> as I understand, the case here is different.

> vmpressure was implemented to work with scanned and reclaimed pages
> from LRU and it works
> well for at least some use cases.

Userspace shouldn't care about the specific implementation at all. We
should be able to change the implementation without anybody noticing
actually.

> As you had pointed out earlier there could be problems with the way
> vmpressure works since it is not considering many other costs. But
> it shows an estimate of the pressure on LRUs. I think adding just
> the slab reclaimed to nr_reclaimed without considering the cost is
> arbitrary and it disturbs the LRU pressure which vmpressure reports
> properly.

Well it is not completely arbitrary. Slabs are scanned proportionally to
the LRU scanning.

> So shouldn't we account slab reclaimed in vmpressure only when we
> have a proper way to do it ? By adding slab reclaimed pages, we are
> saying vmpressure that X pages were reclaimed with 0 effort. With
> this patch the vmpressure will show an estimate of pressure on LRU
> and restores the original behavior of vmpressure. If we add in
> future the slab cost, vmpressure can become more accurate. But just
> adding slab reclaimed is arbitrary right ? Consider a case where we
> start to account reclaimed pages from other shrinkers which are not
> reporting their reclaimed values right now.  Like zsmalloc, android
> lowmemorykiller etc. Then nr_reclaimed sent to vmpressure will just
> be bloated and will make vmpressure useless right ? And most of the
> time vmpressure will receive reclaimed greater than scanned and won't
> be reporting any critical events. The problem we are encountering now
> with slab reclaimed is a subset of the case above right ?

The main point here is whether we really _should_ emit critical events
when we actually _reclaim_ pages. This is something I haven't heard an
answer for.

> Starting to kill at the right time helps in recovering memory at a
> faster rate than waiting for the reclaim to complete. Yes, we may
> be able to modify lowmemorykiller to cope with this problem. But
> the actual problem this patch tried to fix was the vmpressure event
> regression.

I am not happy about the regression but you should try to understand
that we might end up with another report a month later for a different
consumer of events.

I believe that the vmpressure needs some serious rethought and come with
a more realistic and stable metric.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
@ 2017-02-07 12:17           ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-07 12:17 UTC (permalink / raw)
  To: vinayak menon
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Tue 07-02-17 16:39:15, vinayak menon wrote:
> On Tue, Feb 7, 2017 at 1:40 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 06-02-17 20:40:10, vinayak menon wrote:
> >> On Mon, Feb 6, 2017 at 6:22 PM, Michal Hocko <mhocko@kernel.org> wrote:
[...]
> >> > It would be also more than useful to say how much the slab reclaim
> >> > really contributed.
> >>
> >> The 70% less events is caused by slab reclaim being added to
> >> vmpressure, which is confirmed by running the test with and without
> >> the fix.  But it is hard to say the effect on reclaim stats is caused
> >> by this problem because, the lowmemorykiller can be written with
> >> different heuristics to make the reclaim look better.
> >
> > Exactly! And this is why I am not still happy with the current
> > justification of this patch. It seems to be tuning for a particular
> > consumer of vmpressure events. Others might depend on a less pessimistic
> > events because we are making some progress afterall. Being more
> > pessimistic can lead to premature oom or other performance related
> > decisions and that is why I am not happy about that.
> >
> > Btw. could you be more specific about your particular test? What is
> > desired/acceptable result?
>
> The test opens multiple applications on android in a sequence and
> then repeats this for N times. Time taken to launch the application
> is measured. With and without the patch the deviation is seen in the
> launch latencies. The launch latency diff is caused by the lesser
> number of kills (because of vmpressure difference).

So this is basically lmk throughput test. Is this representative enough
to make any decisions?

> >> The issue we see
> >> in the above reclaim stats is entirely because of task kills being
> >> delayed. That is the reason why I did not include the vmstat stats in
> >> the changelog in the earlier versions.
> >>
> >> >
> >> >> This is a regression introduced by commit 6b4f7799c6a5 ("mm: vmscan:
> >> >> invoke slab shrinkers from shrink_zone()").
> >> >
> >> > I am not really sure this is a regression, though. Maybe your heuristic
> >> > which consumes events is just too fragile?
> >> >
> >> Yes it could be. A different kind of lowmemorykiller may not show up
> >> this issue at all. In my opinion the regression here is the difference
> >> in vmpressure values and thus the vmpressure events because of passing
> >> slab reclaimed pages to vmpressure without considering the scanned
> >> pages and cost model.
> >> So would it be better to drop the vmstat data from changelog ?
> >
> > No! The main question is whether being more pessimistic and report
> > higher reclaim levels really does make sense even when there is a slab
> > reclaim progress. This hasn't been explained and I _really_ do not like
> > a patch which optimizes for a particular consumer of events.
> >
> > I understand that the change of the behavior is unexpeted and that
> > might be reason to revert to the original one. But if this is the only
> > reasonable way to go I would, at least, like to understand what is going
> > on here. Why cannot your lowmemorykiller cope with the workload? Why
> > starting to kill sooner (at the time when the slab still reclaims enough
> > pages to report lower critical events) helps to pass your test. Maybe it
> > is the implementation of the lmk which needs to be changed because it
> > has some false expectations? Or the memory reclaim just behaves in an
> > unpredictable manner?
>
> Say if 4.4 had actually implemented page based shrinking model for
> slab and included the correct scanned and reclaimed to vmpressure
> considering the cost model, then it is all fine and behavior
> difference if any shown by a vmpressure client need to be fixed. But
> as I understand, the case here is different.

> vmpressure was implemented to work with scanned and reclaimed pages
> from LRU and it works
> well for at least some use cases.

Userspace shouldn't care about the specific implementation at all. We
should be able to change the implementation without anybody noticing
actually.

> As you had pointed out earlier there could be problems with the way
> vmpressure works since it is not considering many other costs. But
> it shows an estimate of the pressure on LRUs. I think adding just
> the slab reclaimed to nr_reclaimed without considering the cost is
> arbitrary and it disturbs the LRU pressure which vmpressure reports
> properly.

Well it is not completely arbitrary. Slabs are scanned proportionally to
the LRU scanning.

> So shouldn't we account slab reclaimed in vmpressure only when we
> have a proper way to do it ? By adding slab reclaimed pages, we are
> saying vmpressure that X pages were reclaimed with 0 effort. With
> this patch the vmpressure will show an estimate of pressure on LRU
> and restores the original behavior of vmpressure. If we add in
> future the slab cost, vmpressure can become more accurate. But just
> adding slab reclaimed is arbitrary right ? Consider a case where we
> start to account reclaimed pages from other shrinkers which are not
> reporting their reclaimed values right now.  Like zsmalloc, android
> lowmemorykiller etc. Then nr_reclaimed sent to vmpressure will just
> be bloated and will make vmpressure useless right ? And most of the
> time vmpressure will receive reclaimed greater than scanned and won't
> be reporting any critical events. The problem we are encountering now
> with slab reclaimed is a subset of the case above right ?

The main point here is whether we really _should_ emit critical events
when we actually _reclaim_ pages. This is something I haven't heard an
answer for.

> Starting to kill at the right time helps in recovering memory at a
> faster rate than waiting for the reclaim to complete. Yes, we may
> be able to modify lowmemorykiller to cope with this problem. But
> the actual problem this patch tried to fix was the vmpressure event
> regression.

I am not happy about the regression but you should try to understand
that we might end up with another report a month later for a different
consumer of events.

I believe that the vmpressure needs some serious rethought and come with
a more realistic and stable metric.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
  2017-02-07 12:17           ` Michal Hocko
@ 2017-02-07 13:16             ` vinayak menon
  -1 siblings, 0 replies; 32+ messages in thread
From: vinayak menon @ 2017-02-07 13:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Tue, Feb 7, 2017 at 5:47 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Tue 07-02-17 16:39:15, vinayak menon wrote:
>> On Tue, Feb 7, 2017 at 1:40 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Mon 06-02-17 20:40:10, vinayak menon wrote:
>> >> On Mon, Feb 6, 2017 at 6:22 PM, Michal Hocko <mhocko@kernel.org> wrote:
> [...]
>> >> > It would be also more than useful to say how much the slab reclaim
>> >> > really contributed.
>> >>
>> >> The 70% less events is caused by slab reclaim being added to
>> >> vmpressure, which is confirmed by running the test with and without
>> >> the fix.  But it is hard to say the effect on reclaim stats is caused
>> >> by this problem because, the lowmemorykiller can be written with
>> >> different heuristics to make the reclaim look better.
>> >
>> > Exactly! And this is why I am not still happy with the current
>> > justification of this patch. It seems to be tuning for a particular
>> > consumer of vmpressure events. Others might depend on a less pessimistic
>> > events because we are making some progress afterall. Being more
>> > pessimistic can lead to premature oom or other performance related
>> > decisions and that is why I am not happy about that.
>> >
>> > Btw. could you be more specific about your particular test? What is
>> > desired/acceptable result?
>>
>> The test opens multiple applications on android in a sequence and
>> then repeats this for N times. Time taken to launch the application
>> is measured. With and without the patch the deviation is seen in the
>> launch latencies. The launch latency diff is caused by the lesser
>> number of kills (because of vmpressure difference).
>
> So this is basically lmk throughput test. Is this representative enough
> to make any decisions?
>
>> >> The issue we see
>> >> in the above reclaim stats is entirely because of task kills being
>> >> delayed. That is the reason why I did not include the vmstat stats in
>> >> the changelog in the earlier versions.
>> >>
>> >> >
>> >> >> This is a regression introduced by commit 6b4f7799c6a5 ("mm: vmscan:
>> >> >> invoke slab shrinkers from shrink_zone()").
>> >> >
>> >> > I am not really sure this is a regression, though. Maybe your heuristic
>> >> > which consumes events is just too fragile?
>> >> >
>> >> Yes it could be. A different kind of lowmemorykiller may not show up
>> >> this issue at all. In my opinion the regression here is the difference
>> >> in vmpressure values and thus the vmpressure events because of passing
>> >> slab reclaimed pages to vmpressure without considering the scanned
>> >> pages and cost model.
>> >> So would it be better to drop the vmstat data from changelog ?
>> >
>> > No! The main question is whether being more pessimistic and report
>> > higher reclaim levels really does make sense even when there is a slab
>> > reclaim progress. This hasn't been explained and I _really_ do not like
>> > a patch which optimizes for a particular consumer of events.
>> >
>> > I understand that the change of the behavior is unexpeted and that
>> > might be reason to revert to the original one. But if this is the only
>> > reasonable way to go I would, at least, like to understand what is going
>> > on here. Why cannot your lowmemorykiller cope with the workload? Why
>> > starting to kill sooner (at the time when the slab still reclaims enough
>> > pages to report lower critical events) helps to pass your test. Maybe it
>> > is the implementation of the lmk which needs to be changed because it
>> > has some false expectations? Or the memory reclaim just behaves in an
>> > unpredictable manner?
>>
>> Say if 4.4 had actually implemented page based shrinking model for
>> slab and included the correct scanned and reclaimed to vmpressure
>> considering the cost model, then it is all fine and behavior
>> difference if any shown by a vmpressure client need to be fixed. But
>> as I understand, the case here is different.
>
>> vmpressure was implemented to work with scanned and reclaimed pages
>> from LRU and it works
>> well for at least some use cases.
>
> Userspace shouldn't care about the specific implementation at all. We
> should be able to change the implementation without anybody noticing
> actually.
>
>> As you had pointed out earlier there could be problems with the way
>> vmpressure works since it is not considering many other costs. But
>> it shows an estimate of the pressure on LRUs. I think adding just
>> the slab reclaimed to nr_reclaimed without considering the cost is
>> arbitrary and it disturbs the LRU pressure which vmpressure reports
>> properly.
>
> Well it is not completely arbitrary. Slabs are scanned proportionally to
> the LRU scanning.
By arbitrary I meant adding reclaimed alone without considering the
scanned.

>
>> So shouldn't we account slab reclaimed in vmpressure only when we
>> have a proper way to do it ? By adding slab reclaimed pages, we are
>> saying vmpressure that X pages were reclaimed with 0 effort. With
>> this patch the vmpressure will show an estimate of pressure on LRU
>> and restores the original behavior of vmpressure. If we add in
>> future the slab cost, vmpressure can become more accurate. But just
>> adding slab reclaimed is arbitrary right ? Consider a case where we
>> start to account reclaimed pages from other shrinkers which are not
>> reporting their reclaimed values right now.  Like zsmalloc, android
>> lowmemorykiller etc. Then nr_reclaimed sent to vmpressure will just
>> be bloated and will make vmpressure useless right ? And most of the
>> time vmpressure will receive reclaimed greater than scanned and won't
>> be reporting any critical events. The problem we are encountering now
>> with slab reclaimed is a subset of the case above right ?
>
> The main point here is whether we really _should_ emit critical events
> when we actually _reclaim_ pages. This is something I haven't heard an
> answer for.
>
I agree that we should not sent critical events when slab reclaims enough.
But the problem is that we really don't know the cost of reclaiming slab. Taking
just one case to show the difference.

Say we implement actual page based slab reclaim and in one of the instance
the,
nr_scanned_lru is 1024 and nr_reclaimed_lru is 256
nr_scanned_slab is 1024 and nr_reclaimed_slab is 512.
Thus, total_scanned=2048 and total_reclaimed=768 and vmpresure around 69.

With the regression we have now, it would look like this
nr_scanned_lru is 1024 and nr_reclaimed_lru is 256
nr_scanned_slab is 0 and nr_reclaimed_slab is 512.
Thus, total_scanned=1024 and total_reclaimed=768 and vmpresure around 25.

With the fix,
nr_scanned_lru is 1024 and nr_reclaimed_lru is 256
Thus, total_scanned=1024 and total_reclaimed=256 and vmpresure around 75.


>> Starting to kill at the right time helps in recovering memory at a
>> faster rate than waiting for the reclaim to complete. Yes, we may
>> be able to modify lowmemorykiller to cope with this problem. But
>> the actual problem this patch tried to fix was the vmpressure event
>> regression.
>
> I am not happy about the regression but you should try to understand
> that we might end up with another report a month later for a different
> consumer of events.
I understand that. But this was the way vmpressure had worked until the
regression and IMHO adding reclaimed slab just increases the noise in
vmpressure.

>
> I believe that the vmpressure needs some serious rethought and come with
> a more realistic and stable metric.
Okay. I agree. So you are suggesting to drop the patch ?

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
@ 2017-02-07 13:16             ` vinayak menon
  0 siblings, 0 replies; 32+ messages in thread
From: vinayak menon @ 2017-02-07 13:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Tue, Feb 7, 2017 at 5:47 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Tue 07-02-17 16:39:15, vinayak menon wrote:
>> On Tue, Feb 7, 2017 at 1:40 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Mon 06-02-17 20:40:10, vinayak menon wrote:
>> >> On Mon, Feb 6, 2017 at 6:22 PM, Michal Hocko <mhocko@kernel.org> wrote:
> [...]
>> >> > It would be also more than useful to say how much the slab reclaim
>> >> > really contributed.
>> >>
>> >> The 70% less events is caused by slab reclaim being added to
>> >> vmpressure, which is confirmed by running the test with and without
>> >> the fix.  But it is hard to say the effect on reclaim stats is caused
>> >> by this problem because, the lowmemorykiller can be written with
>> >> different heuristics to make the reclaim look better.
>> >
>> > Exactly! And this is why I am not still happy with the current
>> > justification of this patch. It seems to be tuning for a particular
>> > consumer of vmpressure events. Others might depend on a less pessimistic
>> > events because we are making some progress afterall. Being more
>> > pessimistic can lead to premature oom or other performance related
>> > decisions and that is why I am not happy about that.
>> >
>> > Btw. could you be more specific about your particular test? What is
>> > desired/acceptable result?
>>
>> The test opens multiple applications on android in a sequence and
>> then repeats this for N times. Time taken to launch the application
>> is measured. With and without the patch the deviation is seen in the
>> launch latencies. The launch latency diff is caused by the lesser
>> number of kills (because of vmpressure difference).
>
> So this is basically lmk throughput test. Is this representative enough
> to make any decisions?
>
>> >> The issue we see
>> >> in the above reclaim stats is entirely because of task kills being
>> >> delayed. That is the reason why I did not include the vmstat stats in
>> >> the changelog in the earlier versions.
>> >>
>> >> >
>> >> >> This is a regression introduced by commit 6b4f7799c6a5 ("mm: vmscan:
>> >> >> invoke slab shrinkers from shrink_zone()").
>> >> >
>> >> > I am not really sure this is a regression, though. Maybe your heuristic
>> >> > which consumes events is just too fragile?
>> >> >
>> >> Yes it could be. A different kind of lowmemorykiller may not show up
>> >> this issue at all. In my opinion the regression here is the difference
>> >> in vmpressure values and thus the vmpressure events because of passing
>> >> slab reclaimed pages to vmpressure without considering the scanned
>> >> pages and cost model.
>> >> So would it be better to drop the vmstat data from changelog ?
>> >
>> > No! The main question is whether being more pessimistic and report
>> > higher reclaim levels really does make sense even when there is a slab
>> > reclaim progress. This hasn't been explained and I _really_ do not like
>> > a patch which optimizes for a particular consumer of events.
>> >
>> > I understand that the change of the behavior is unexpeted and that
>> > might be reason to revert to the original one. But if this is the only
>> > reasonable way to go I would, at least, like to understand what is going
>> > on here. Why cannot your lowmemorykiller cope with the workload? Why
>> > starting to kill sooner (at the time when the slab still reclaims enough
>> > pages to report lower critical events) helps to pass your test. Maybe it
>> > is the implementation of the lmk which needs to be changed because it
>> > has some false expectations? Or the memory reclaim just behaves in an
>> > unpredictable manner?
>>
>> Say if 4.4 had actually implemented page based shrinking model for
>> slab and included the correct scanned and reclaimed to vmpressure
>> considering the cost model, then it is all fine and behavior
>> difference if any shown by a vmpressure client need to be fixed. But
>> as I understand, the case here is different.
>
>> vmpressure was implemented to work with scanned and reclaimed pages
>> from LRU and it works
>> well for at least some use cases.
>
> Userspace shouldn't care about the specific implementation at all. We
> should be able to change the implementation without anybody noticing
> actually.
>
>> As you had pointed out earlier there could be problems with the way
>> vmpressure works since it is not considering many other costs. But
>> it shows an estimate of the pressure on LRUs. I think adding just
>> the slab reclaimed to nr_reclaimed without considering the cost is
>> arbitrary and it disturbs the LRU pressure which vmpressure reports
>> properly.
>
> Well it is not completely arbitrary. Slabs are scanned proportionally to
> the LRU scanning.
By arbitrary I meant adding reclaimed alone without considering the
scanned.

>
>> So shouldn't we account slab reclaimed in vmpressure only when we
>> have a proper way to do it ? By adding slab reclaimed pages, we are
>> saying vmpressure that X pages were reclaimed with 0 effort. With
>> this patch the vmpressure will show an estimate of pressure on LRU
>> and restores the original behavior of vmpressure. If we add in
>> future the slab cost, vmpressure can become more accurate. But just
>> adding slab reclaimed is arbitrary right ? Consider a case where we
>> start to account reclaimed pages from other shrinkers which are not
>> reporting their reclaimed values right now.  Like zsmalloc, android
>> lowmemorykiller etc. Then nr_reclaimed sent to vmpressure will just
>> be bloated and will make vmpressure useless right ? And most of the
>> time vmpressure will receive reclaimed greater than scanned and won't
>> be reporting any critical events. The problem we are encountering now
>> with slab reclaimed is a subset of the case above right ?
>
> The main point here is whether we really _should_ emit critical events
> when we actually _reclaim_ pages. This is something I haven't heard an
> answer for.
>
I agree that we should not sent critical events when slab reclaims enough.
But the problem is that we really don't know the cost of reclaiming slab. Taking
just one case to show the difference.

Say we implement actual page based slab reclaim and in one of the instance
the,
nr_scanned_lru is 1024 and nr_reclaimed_lru is 256
nr_scanned_slab is 1024 and nr_reclaimed_slab is 512.
Thus, total_scanned=2048 and total_reclaimed=768 and vmpresure around 69.

With the regression we have now, it would look like this
nr_scanned_lru is 1024 and nr_reclaimed_lru is 256
nr_scanned_slab is 0 and nr_reclaimed_slab is 512.
Thus, total_scanned=1024 and total_reclaimed=768 and vmpresure around 25.

With the fix,
nr_scanned_lru is 1024 and nr_reclaimed_lru is 256
Thus, total_scanned=1024 and total_reclaimed=256 and vmpresure around 75.


>> Starting to kill at the right time helps in recovering memory at a
>> faster rate than waiting for the reclaim to complete. Yes, we may
>> be able to modify lowmemorykiller to cope with this problem. But
>> the actual problem this patch tried to fix was the vmpressure event
>> regression.
>
> I am not happy about the regression but you should try to understand
> that we might end up with another report a month later for a different
> consumer of events.
I understand that. But this was the way vmpressure had worked until the
regression and IMHO adding reclaimed slab just increases the noise in
vmpressure.

>
> I believe that the vmpressure needs some serious rethought and come with
> a more realistic and stable metric.
Okay. I agree. So you are suggesting to drop the patch ?

> --
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
  2017-02-07 13:16             ` vinayak menon
@ 2017-02-07 14:52               ` Michal Hocko
  -1 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-07 14:52 UTC (permalink / raw)
  To: vinayak menon
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Tue 07-02-17 18:46:55, vinayak menon wrote:
> On Tue, Feb 7, 2017 at 5:47 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Tue 07-02-17 16:39:15, vinayak menon wrote:
[...]
> >> Starting to kill at the right time helps in recovering memory at a
> >> faster rate than waiting for the reclaim to complete. Yes, we may
> >> be able to modify lowmemorykiller to cope with this problem. But
> >> the actual problem this patch tried to fix was the vmpressure event
> >> regression.
> >
> > I am not happy about the regression but you should try to understand
> > that we might end up with another report a month later for a different
> > consumer of events.
>
> I understand that. But this was the way vmpressure had worked until the
> regression and IMHO adding reclaimed slab just increases the noise in
> vmpressure.

I would argue the previous behavior was wrong as well.

> > I believe that the vmpressure needs some serious rethought and come with
> > a more realistic and stable metric.
>
> Okay. I agree. So you are suggesting to drop the patch ?

Unless there is a strong reason to keep it. Your test case seems to be
rather artificial and the behavior is not much better after your patch.
So rather than tunning the broken behavior for a particular test case
I would welcome rethinking the whole thing.

That being said I am not nacking the patch so if others think that this
is a reasonable thing to do for now I will not stand in the way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure
@ 2017-02-07 14:52               ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-02-07 14:52 UTC (permalink / raw)
  To: vinayak menon
  Cc: Vinayak Menon, Andrew Morton, Johannes Weiner, mgorman, vbabka,
	Rik van Riel, vdavydov.dev, anton.vorontsov, Minchan Kim,
	shashim, linux-mm, linux-kernel

On Tue 07-02-17 18:46:55, vinayak menon wrote:
> On Tue, Feb 7, 2017 at 5:47 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Tue 07-02-17 16:39:15, vinayak menon wrote:
[...]
> >> Starting to kill at the right time helps in recovering memory at a
> >> faster rate than waiting for the reclaim to complete. Yes, we may
> >> be able to modify lowmemorykiller to cope with this problem. But
> >> the actual problem this patch tried to fix was the vmpressure event
> >> regression.
> >
> > I am not happy about the regression but you should try to understand
> > that we might end up with another report a month later for a different
> > consumer of events.
>
> I understand that. But this was the way vmpressure had worked until the
> regression and IMHO adding reclaimed slab just increases the noise in
> vmpressure.

I would argue the previous behavior was wrong as well.

> > I believe that the vmpressure needs some serious rethought and come with
> > a more realistic and stable metric.
>
> Okay. I agree. So you are suggesting to drop the patch ?

Unless there is a strong reason to keep it. Your test case seems to be
rather artificial and the behavior is not much better after your patch.
So rather than tunning the broken behavior for a particular test case
I would welcome rethinking the whole thing.

That being said I am not nacking the patch so if others think that this
is a reasonable thing to do for now I will not stand in the way.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-02-07 14:53 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-06 12:24 [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure Vinayak Menon
2017-02-06 12:24 ` Vinayak Menon
2017-02-06 12:24 ` [PATCH 2/2 RESEND] mm: vmpressure: fix sending wrong events on underflow Vinayak Menon
2017-02-06 12:24   ` Vinayak Menon
2017-02-06 12:40   ` Michal Hocko
2017-02-06 12:40     ` Michal Hocko
2017-02-06 13:09     ` vinayak menon
2017-02-06 13:09       ` vinayak menon
2017-02-06 13:24       ` Michal Hocko
2017-02-06 13:24         ` Michal Hocko
2017-02-06 14:35         ` vinayak menon
2017-02-06 14:35           ` vinayak menon
2017-02-06 15:12           ` Michal Hocko
2017-02-06 15:12             ` Michal Hocko
2017-02-07 11:17             ` vinayak menon
2017-02-07 11:17               ` vinayak menon
2017-02-07 12:09               ` Michal Hocko
2017-02-07 12:09                 ` Michal Hocko
2017-02-06 12:52 ` [PATCH 1/2 v4] mm: vmscan: do not pass reclaimed slab to vmpressure Michal Hocko
2017-02-06 12:52   ` Michal Hocko
2017-02-06 15:10   ` vinayak menon
2017-02-06 15:10     ` vinayak menon
2017-02-07  8:10     ` Michal Hocko
2017-02-07  8:10       ` Michal Hocko
2017-02-07 11:09       ` vinayak menon
2017-02-07 11:09         ` vinayak menon
2017-02-07 12:17         ` Michal Hocko
2017-02-07 12:17           ` Michal Hocko
2017-02-07 13:16           ` vinayak menon
2017-02-07 13:16             ` vinayak menon
2017-02-07 14:52             ` Michal Hocko
2017-02-07 14:52               ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.