linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] memcg, vmpressure: expose vmpressure controls
@ 2020-04-13 21:57 svc_lmoiseichuk
  2020-04-13 21:57 ` [PATCH 1/2] memcg: expose vmpressure knobs svc_lmoiseichuk
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: svc_lmoiseichuk @ 2020-04-13 21:57 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, tj, lizefan, cgroups
  Cc: akpm, rientjes, minchan, vinmenon, andriy.shevchenko,
	anton.vorontsov, penberg, linux-mm, Leonid Moiseichuk

From: Leonid Moiseichuk <lmoiseichuk@magicleap.com>

Small tweak to populate vmpressure parameters to userspace without
any built-in logic change.

The vmpressure is used actively (e.g. on Android) to track mm stress.
vmpressure parameters selected empiricaly quite long time ago and not
always suitable for modern memory configurations.

Leonid Moiseichuk (2):
  memcg: expose vmpressure knobs
  memcg, vmpressure: expose vmpressure controls

 .../admin-guide/cgroup-v1/memory.rst          |  12 +-
 include/linux/vmpressure.h                    |  35 ++++++
 mm/memcontrol.c                               | 113 ++++++++++++++++++
 mm/vmpressure.c                               | 101 +++++++---------
 4 files changed, 200 insertions(+), 61 deletions(-)

-- 
2.17.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] memcg: expose vmpressure knobs
  2020-04-13 21:57 [PATCH 0/2] memcg, vmpressure: expose vmpressure controls svc_lmoiseichuk
@ 2020-04-13 21:57 ` svc_lmoiseichuk
  2020-04-14 22:55   ` Chris Down
  2020-04-13 21:57 ` [PATCH 2/2] memcg, vmpressure: expose vmpressure controls svc_lmoiseichuk
  2020-04-14 11:37 ` [PATCH 0/2] " Michal Hocko
  2 siblings, 1 reply; 16+ messages in thread
From: svc_lmoiseichuk @ 2020-04-13 21:57 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, tj, lizefan, cgroups
  Cc: akpm, rientjes, minchan, vinmenon, andriy.shevchenko,
	anton.vorontsov, penberg, linux-mm, Leonid Moiseichuk

From: Leonid Moiseichuk <lmoiseichuk@magicleap.com>

Populating memcg vmpressure controls with legacy defaults:
- memory.pressure_window (512 or SWAP_CLUSTER_MAX * 16)
- memory.pressure_level_critical_prio (3)
- memory.pressure_level_medium (60)
- memory.pressure_level_critical (95)

Signed-off-by: Leonid Moiseichuk <lmoiseichuk@magicleap.com>
---
 Documentation/admin-guide/cgroup-v1/memory.rst | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 0ae4f564c2d6..42508123a8e1 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -79,6 +79,12 @@ Brief summary of control files.
  memory.use_hierarchy		     set/show hierarchical account enabled
  memory.force_empty		     trigger forced page reclaim
  memory.pressure_level		     set memory pressure notifications
+ memory.pressure_window 	     set window size for scanned pages, better
+				     to perform it as vmscan reclaimer logic
+				     in chunks in multiple SWAP_CLUSTER_MAX
+ memory.pressure_level_critical_prio vmscan priority for critical level
+ memory.pressure_level_medium	     medium level pressure percents
+ memory.pressure_level_critical      critical level pressure percents
  memory.swappiness		     set/show swappiness parameter of vmscan
 				     (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate     set/show controls of moving charges
@@ -893,12 +899,16 @@ pressure, the system might be making swap, paging out active file caches,
 etc. Upon this event applications may decide to further analyze
 vmstat/zoneinfo/memcg or internal memory usage statistics and free any
 resources that can be easily reconstructed or re-read from a disk.
+The level threshold could be tuned using memory.pressure_level_medium.
 
 The "critical" level means that the system is actively thrashing, it is
 about to out of memory (OOM) or even the in-kernel OOM killer is on its
 way to trigger. Applications should do whatever they can to help the
 system. It might be too late to consult with vmstat or any other
-statistics, so it's advisable to take an immediate action.
+statistics, so it's advisable to take an immediate action. The level
+threshold could be tuned using memory.pressure_level_critical. Number
+of pages and vmscan priority handled as memory.pressure_window and
+memory.pressure_level_critical_prio.
 
 By default, events are propagated upward until the event is handled, i.e. the
 events are not pass-through. For example, you have three cgroups: A->B->C. Now
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] memcg, vmpressure: expose vmpressure controls
  2020-04-13 21:57 [PATCH 0/2] memcg, vmpressure: expose vmpressure controls svc_lmoiseichuk
  2020-04-13 21:57 ` [PATCH 1/2] memcg: expose vmpressure knobs svc_lmoiseichuk
@ 2020-04-13 21:57 ` svc_lmoiseichuk
  2020-04-14 11:37 ` [PATCH 0/2] " Michal Hocko
  2 siblings, 0 replies; 16+ messages in thread
From: svc_lmoiseichuk @ 2020-04-13 21:57 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, tj, lizefan, cgroups
  Cc: akpm, rientjes, minchan, vinmenon, andriy.shevchenko,
	anton.vorontsov, penberg, linux-mm, Leonid Moiseichuk

From: Leonid Moiseichuk <lmoiseichuk@magicleap.com>

vmpressure code used hardcoded empirically selected values
to control levels and parameters for reclaiming pages which
might be not acceptable for all memory profiles.

The controls exposed vmpressure controls with legacy defaults:
- memory.pressure_window (512 or SWAP_CLUSTER_MAX * 16)
- memory.pressure_level_critical_prio (3)
- memory.pressure_level_medium (60)
- memory.pressure_level_critical (95)

Signed-off-by: Leonid Moiseichuk <lmoiseichuk@magicleap.com>
---
 include/linux/vmpressure.h |  35 ++++++++++++
 mm/memcontrol.c            | 113 +++++++++++++++++++++++++++++++++++++
 mm/vmpressure.c            | 101 ++++++++++++++-------------------
 3 files changed, 189 insertions(+), 60 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 6d28bc433c1c..9ad0282f9ad9 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -25,6 +25,41 @@ struct vmpressure {
 	struct mutex events_lock;
 
 	struct work_struct work;
+
+	/*
+	 * The window size is the number of scanned pages before
+	 * we try to analyze scanned/reclaimed ratio. So the window is used as a
+	 * rate-limit tunable for the "low" level notification, and also for
+	 * averaging the ratio for medium/critical levels. Using small window
+	 * sizes can cause lot of false positives, but too big window size will
+	 * delay the notifications.
+	 */
+	unsigned long window;
+
+	/*
+	 * When there are too little pages left to scan, vmpressure() may miss
+	 * the critical pressure as number of pages will be less than
+	 * "window size".
+	 * However, in that case the vmscan priority will raise fast as the
+	 * reclaimer will try to scan LRUs more deeply.
+	 *
+	 * The vmscan logic considers these special priorities:
+	 *
+	 * prio == DEF_PRIORITY (12): reclaimer starts with that value
+	 * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
+	 * prio == 0                : close to OOM, kernel scans every page in
+	 *                          : an lru
+	 */
+	unsigned long level_critical_prio;
+
+	/*
+	 * These thresholds are used when we account memory pressure through
+	 * scanned/reclaimed ratio. The current values were chosen empirically.
+	 * In essence, they are percents: the higher the value, the more number
+	 * unsuccessful reclaims there were.
+	 */
+	unsigned long level_medium;
+	unsigned long level_critical;
 };
 
 struct mem_cgroup;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5beea03dd58a..f8a956bf6e81 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -251,6 +251,13 @@ struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
 	return &memcg->vmpressure;
 }
 
+struct vmpressure *vmpressure_from_css(struct cgroup_subsys_state *css)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return memcg_to_vmpressure(memcg);
+}
+
 struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
 {
 	return &container_of(vmpr, struct mem_cgroup, vmpressure)->css;
@@ -3905,6 +3912,92 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
 	return 0;
 }
 
+
+static u64 mem_cgroup_pressure_window_read(struct cgroup_subsys_state *css,
+					struct cftype *cft)
+{
+	struct vmpressure *vmpr = vmpressure_from_css(css);
+
+	return vmpr->window;
+}
+
+static int mem_cgroup_pressure_window_write(struct cgroup_subsys_state *css,
+					struct cftype *cft, u64 val)
+{
+	struct vmpressure *vmpr = vmpressure_from_css(css);
+
+	if (val < SWAP_CLUSTER_MAX)
+		return -EINVAL;
+
+	vmpr->window = val;
+
+	return 0;
+}
+
+static u64 mem_cgroup_pressure_level_critical_prio_read(
+			struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct vmpressure *vmpr = vmpressure_from_css(css);
+
+	return vmpr->level_critical_prio;
+}
+
+static int mem_cgroup_pressure_level_critical_prio_write(
+		struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	struct vmpressure *vmpr = vmpressure_from_css(css);
+
+	if (val > DEF_PRIORITY)
+		return -EINVAL;
+
+	vmpr->level_critical_prio = val;
+
+	return 0;
+}
+
+
+static u64 mem_cgroup_pressure_level_medium_read(
+		struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct vmpressure *vmpr = vmpressure_from_css(css);
+
+	return vmpr->level_medium;
+}
+
+static int mem_cgroup_pressure_level_medium_write(
+		struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	struct vmpressure *vmpr = vmpressure_from_css(css);
+
+	if (val > 100)
+		return -EINVAL;
+
+	vmpr->level_medium = val;
+
+	return 0;
+}
+
+static u64 mem_cgroup_pressure_level_critical_read(
+			struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct vmpressure *vmpr = vmpressure_from_css(css);
+
+	return vmpr->level_critical;
+}
+
+static int mem_cgroup_pressure_level_critical_write(
+		struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	struct vmpressure *vmpr = vmpressure_from_css(css);
+
+	if (val > 100)
+		return -EINVAL;
+
+	vmpr->level_critical = val;
+
+	return 0;
+}
+
 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 {
 	struct mem_cgroup_threshold_ary *t;
@@ -4777,6 +4870,26 @@ static struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "pressure_level",
 	},
+	{
+		.name = "pressure_window",
+		.read_u64 = mem_cgroup_pressure_window_read,
+		.write_u64 = mem_cgroup_pressure_window_write,
+	},
+	{
+		.name = "pressure_level_critical_prio",
+		.read_u64 = mem_cgroup_pressure_level_critical_prio_read,
+		.write_u64 = mem_cgroup_pressure_level_critical_prio_write,
+	},
+	{
+		.name = "pressure_level_medium",
+		.read_u64 = mem_cgroup_pressure_level_medium_read,
+		.write_u64 = mem_cgroup_pressure_level_medium_write,
+	},
+	{
+		.name = "pressure_level_critical",
+		.read_u64 = mem_cgroup_pressure_level_critical_read,
+		.write_u64 = mem_cgroup_pressure_level_critical_write,
+	},
 #ifdef CONFIG_NUMA
 	{
 		.name = "numa_stat",
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index d69019fc3789..6fc680dec971 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -21,52 +21,6 @@
 #include <linux/printk.h>
 #include <linux/vmpressure.h>
 
-/*
- * The window size (vmpressure_win) is the number of scanned pages before
- * we try to analyze scanned/reclaimed ratio. So the window is used as a
- * rate-limit tunable for the "low" level notification, and also for
- * averaging the ratio for medium/critical levels. Using small window
- * sizes can cause lot of false positives, but too big window size will
- * delay the notifications.
- *
- * As the vmscan reclaimer logic works with chunks which are multiple of
- * SWAP_CLUSTER_MAX, it makes sense to use it for the window size as well.
- *
- * TODO: Make the window size depend on machine size, as we do for vmstat
- * thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
- */
-static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
-
-/*
- * These thresholds are used when we account memory pressure through
- * scanned/reclaimed ratio. The current values were chosen empirically. In
- * essence, they are percents: the higher the value, the more number
- * unsuccessful reclaims there were.
- */
-static const unsigned int vmpressure_level_med = 60;
-static const unsigned int vmpressure_level_critical = 95;
-
-/*
- * When there are too little pages left to scan, vmpressure() may miss the
- * critical pressure as number of pages will be less than "window size".
- * However, in that case the vmscan priority will raise fast as the
- * reclaimer will try to scan LRUs more deeply.
- *
- * The vmscan logic considers these special priorities:
- *
- * prio == DEF_PRIORITY (12): reclaimer starts with that value
- * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
- * prio == 0                : close to OOM, kernel scans every page in an lru
- *
- * Any value in this range is acceptable for this tunable (i.e. from 12 to
- * 0). Current value for the vmpressure_level_critical_prio is chosen
- * empirically, but the number, in essence, means that we consider
- * critical level when scanning depth is ~10% of the lru size (vmscan
- * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one
- * eights).
- */
-static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10);
-
 static struct vmpressure *work_to_vmpressure(struct work_struct *work)
 {
 	return container_of(work, struct vmpressure, work);
@@ -109,17 +63,18 @@ static const char * const vmpressure_str_modes[] = {
 	[VMPRESSURE_LOCAL] = "local",
 };
 
-static enum vmpressure_levels vmpressure_level(unsigned long pressure)
+static enum vmpressure_levels vmpressure_level(struct vmpressure *vmpr,
+						unsigned long pressure)
 {
-	if (pressure >= vmpressure_level_critical)
+	if (pressure >= vmpr->level_critical)
 		return VMPRESSURE_CRITICAL;
-	else if (pressure >= vmpressure_level_med)
+	else if (pressure >= vmpr->level_medium)
 		return VMPRESSURE_MEDIUM;
 	return VMPRESSURE_LOW;
 }
 
-static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
-						    unsigned long reclaimed)
+static enum vmpressure_levels vmpressure_calc_level(struct vmpressure *vmpr,
+			unsigned long scanned, unsigned long reclaimed)
 {
 	unsigned long scale = scanned + reclaimed;
 	unsigned long pressure = 0;
@@ -145,7 +100,7 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 	pr_debug("%s: %3lu  (s: %lu  r: %lu)\n", __func__, pressure,
 		 scanned, reclaimed);
 
-	return vmpressure_level(pressure);
+	return vmpressure_level(vmpr, pressure);
 }
 
 struct vmpressure_event {
@@ -207,7 +162,7 @@ static void vmpressure_work_fn(struct work_struct *work)
 	vmpr->tree_reclaimed = 0;
 	spin_unlock(&vmpr->sr_lock);
 
-	level = vmpressure_calc_level(scanned, reclaimed);
+	level = vmpressure_calc_level(vmpr, scanned, reclaimed);
 
 	do {
 		if (vmpressure_event(vmpr, level, ancestor, signalled))
@@ -273,7 +228,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
 		vmpr->tree_reclaimed += reclaimed;
 		spin_unlock(&vmpr->sr_lock);
 
-		if (scanned < vmpressure_win)
+		if (scanned < vmpr->window)
 			return;
 		schedule_work(&vmpr->work);
 	} else {
@@ -286,14 +241,14 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
 		spin_lock(&vmpr->sr_lock);
 		scanned = vmpr->scanned += scanned;
 		reclaimed = vmpr->reclaimed += reclaimed;
-		if (scanned < vmpressure_win) {
+		if (scanned < vmpr->window) {
 			spin_unlock(&vmpr->sr_lock);
 			return;
 		}
 		vmpr->scanned = vmpr->reclaimed = 0;
 		spin_unlock(&vmpr->sr_lock);
 
-		level = vmpressure_calc_level(scanned, reclaimed);
+		level = vmpressure_calc_level(vmpr, scanned, reclaimed);
 
 		if (level > VMPRESSURE_LOW) {
 			/*
@@ -322,21 +277,23 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
  */
 void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
 {
+	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
+
 	/*
 	 * We only use prio for accounting critical level. For more info
-	 * see comment for vmpressure_level_critical_prio variable above.
+	 * see comment for vmpressure level_critical_prio variable above.
 	 */
-	if (prio > vmpressure_level_critical_prio)
+	if (prio > vmpr->level_critical_prio)
 		return;
 
 	/*
 	 * OK, the prio is below the threshold, updating vmpressure
 	 * information before shrinker dives into long shrinking of long
-	 * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
+	 * range vmscan. Passing scanned = vmpr->window, reclaimed = 0
 	 * to the vmpressure() basically means that we signal 'critical'
 	 * level.
 	 */
-	vmpressure(gfp, memcg, true, vmpressure_win, 0);
+	vmpressure(gfp, memcg, true, vmpr->window, 0);
 }
 
 #define MAX_VMPRESSURE_ARGS_LEN	(strlen("critical") + strlen("hierarchy") + 2)
@@ -450,6 +407,30 @@ void vmpressure_init(struct vmpressure *vmpr)
 	mutex_init(&vmpr->events_lock);
 	INIT_LIST_HEAD(&vmpr->events);
 	INIT_WORK(&vmpr->work, vmpressure_work_fn);
+
+	/*
+	 * As the vmscan reclaimer logic works with chunks which are multiple
+	 * of SWAP_CLUSTER_MAX, it makes sense to use it for the window size
+	 * as well.
+	 *
+	 * TODO: Make the window size depend on machine size, as we do for
+	 * vmstat thresholds. Now we set it to 512 pages (2MB for 4KB pages).
+	 */
+	vmpr->window = SWAP_CLUSTER_MAX * 16;
+
+	/*
+	 * Any value in this range is acceptable for this tunable (i.e. from
+	 * 12 to 0). Current value for the vmpressure level_critical_prio is
+	 * chosen empirically, but the number, in essence, means that we
+	 * consider critical level when scanning depth is ~10% of the lru size
+	 * (vmscan scans 'lru_size >> prio' pages, so it is actually 12.5%,
+	 * or one eights).
+	 */
+	vmpr->level_critical_prio = ilog2(100 / 10);
+
+	/* The current values were legacy and chosen empirically. */
+	vmpr->level_medium = 60;
+	vmpr->level_critical = 95;
 }
 
 /**
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] memcg, vmpressure: expose vmpressure controls
  2020-04-13 21:57 [PATCH 0/2] memcg, vmpressure: expose vmpressure controls svc_lmoiseichuk
  2020-04-13 21:57 ` [PATCH 1/2] memcg: expose vmpressure knobs svc_lmoiseichuk
  2020-04-13 21:57 ` [PATCH 2/2] memcg, vmpressure: expose vmpressure controls svc_lmoiseichuk
@ 2020-04-14 11:37 ` Michal Hocko
  2020-04-14 16:42   ` Leonid Moiseichuk
  2 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2020-04-14 11:37 UTC (permalink / raw)
  To: svc_lmoiseichuk
  Cc: hannes, vdavydov.dev, tj, lizefan, cgroups, akpm, rientjes,
	minchan, vinmenon, andriy.shevchenko, anton.vorontsov, penberg,
	linux-mm, Leonid Moiseichuk

On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote:
> From: Leonid Moiseichuk <lmoiseichuk@magicleap.com>
> 
> Small tweak to populate vmpressure parameters to userspace without
> any built-in logic change.
> 
> The vmpressure is used actively (e.g. on Android) to track mm stress.
> vmpressure parameters selected empiricaly quite long time ago and not
> always suitable for modern memory configurations.

This needs much more details. Why it is not suitable? What are usual
numbers you need to set up to work properly? Why those wouldn't be
generally applicable?

Anyway, I have to confess I am not a big fan of this. vmpressure turned
out to be a very weak interface to measure the memory pressure. Not only
it is not numa aware which makes it unusable on many systems it also 
gives data way too late from the practice.

Btw. why don't you use /proc/pressure/memory resp. its memcg counterpart
to measure the memory pressure in the first place?

> Leonid Moiseichuk (2):
>   memcg: expose vmpressure knobs
>   memcg, vmpressure: expose vmpressure controls
> 
>  .../admin-guide/cgroup-v1/memory.rst          |  12 +-
>  include/linux/vmpressure.h                    |  35 ++++++
>  mm/memcontrol.c                               | 113 ++++++++++++++++++
>  mm/vmpressure.c                               | 101 +++++++---------
>  4 files changed, 200 insertions(+), 61 deletions(-)
> 
> -- 
> 2.17.1
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] memcg, vmpressure: expose vmpressure controls
  2020-04-14 11:37 ` [PATCH 0/2] " Michal Hocko
@ 2020-04-14 16:42   ` Leonid Moiseichuk
  2020-04-14 18:49     ` Michal Hocko
  2020-04-14 19:23     ` Johannes Weiner
  0 siblings, 2 replies; 16+ messages in thread
From: Leonid Moiseichuk @ 2020-04-14 16:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: svc_lmoiseichuk, hannes, vdavydov.dev, tj, lizefan, cgroups,
	akpm, rientjes, minchan, vinmenon, andriy.shevchenko,
	anton.vorontsov, penberg, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4083 bytes --]

Thanks Michal for quick response, see my answer below.
I will update the commit message with numbers for 8 GB memory swapless
devices.

On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote:

> On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote:
> > From: Leonid Moiseichuk <lmoiseichuk@magicleap.com>
> >
> > Small tweak to populate vmpressure parameters to userspace without
> > any built-in logic change.
> >
> > The vmpressure is used actively (e.g. on Android) to track mm stress.
> > vmpressure parameters selected empiricaly quite long time ago and not
> > always suitable for modern memory configurations.
>
> This needs much more details. Why it is not suitable? What are usual
> numbers you need to set up to work properly? Why those wouldn't be
> generally applicable?
>
As far I see numbers which vmpressure uses - they are closer to RSS of
userspace processes for memory utilization.
Default calibration in memory.pressure_level_medium as 60% makes 8GB device
hit memory threshold when RSS utilization
reaches ~5 GB and that is a bit too early, I observe it happened
immediately after boot. Reasonable level should be
in the 70-80% range depending on SW preloaded on your device.

From another point of view having a memory.pressure_level_critical set to
95% may never happen as it comes to a level where an OOM killer already
starts to kill processes,
and in some cases it is even worse than the now removed Android low memory
killer. For such cases has sense to shift the threshold down to 85-90% to
have device reliably
handling low memory situations and not rely only on oom_score_adj hints.

Next important parameter for tweaking is memory.pressure_window which has
the sense to increase twice to reduce the number of activations of userspace
to save some power by reducing sensitivity.

For 12 and 16 GB devices the situation will be similar but worse, based on
fact in current settings they will hit medium memory usage when ~5 or 6.5
GB memory will be still free.


>
> Anyway, I have to confess I am not a big fan of this. vmpressure turned
> out to be a very weak interface to measure the memory pressure. Not only
> it is not numa aware which makes it unusable on many systems it also
> gives data way too late from the practice.
>
> Btw. why don't you use /proc/pressure/memory resp. its memcg counterpart
> to measure the memory pressure in the first place?
>

According to our checks PSI produced numbers only when swap enabled e.g.
swapless device 75% RAM utilization:
==> /proc/pressure/io <==
some avg10=0.00 avg60=1.18 avg300=1.51 total=9642648
full avg10=0.00 avg60=1.11 avg300=1.47 total=9271174

==> /proc/pressure/memory <==
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

Probably it is possible to activate PSI by introducing high IO and swap
enabled but that is not a typical case for mobile devices.

With swap-enabled case memory pressure follows IO pressure with some
fraction i.e. memory is io/2 ... io/10 depending on pattern.
Light sysbench case with swap enabled
==> /proc/pressure/io <==
some avg10=0.00 avg60=0.00 avg300=0.11 total=155383820
full avg10=0.00 avg60=0.00 avg300=0.05 total=100516966
==> /proc/pressure/memory <==
some avg10=0.00 avg60=0.00 avg300=0.06 total=465916397
full avg10=0.00 avg60=0.00 avg300=0.00 total=368664282

Since not all devices have zram or swap enabled it makes sense to have
vmpressure tuning option possible since
it is well used in Android and related issues are understandable.


> > Leonid Moiseichuk (2):
> >   memcg: expose vmpressure knobs
> >   memcg, vmpressure: expose vmpressure controls
> >
> >  .../admin-guide/cgroup-v1/memory.rst          |  12 +-
> >  include/linux/vmpressure.h                    |  35 ++++++
> >  mm/memcontrol.c                               | 113 ++++++++++++++++++
> >  mm/vmpressure.c                               | 101 +++++++---------
> >  4 files changed, 200 insertions(+), 61 deletions(-)
> >
> > --
> > 2.17.1
> >
>
> --
> Michal Hocko
> SUSE Labs
>


-- 
With Best Wishes,
Leonid

[-- Attachment #2: Type: text/html, Size: 5565 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] memcg, vmpressure: expose vmpressure controls
  2020-04-14 16:42   ` Leonid Moiseichuk
@ 2020-04-14 18:49     ` Michal Hocko
  2020-04-14 20:53       ` Leonid Moiseichuk
  2020-04-14 19:23     ` Johannes Weiner
  1 sibling, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2020-04-14 18:49 UTC (permalink / raw)
  To: Leonid Moiseichuk
  Cc: svc_lmoiseichuk, hannes, vdavydov.dev, tj, lizefan, cgroups,
	akpm, rientjes, minchan, vinmenon, andriy.shevchenko,
	anton.vorontsov, penberg, linux-mm

On Tue 14-04-20 12:42:44, Leonid Moiseichuk wrote:
> Thanks Michal for quick response, see my answer below.
> I will update the commit message with numbers for 8 GB memory swapless
> devices.
> 
> On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote:
> > > From: Leonid Moiseichuk <lmoiseichuk@magicleap.com>
> > >
> > > Small tweak to populate vmpressure parameters to userspace without
> > > any built-in logic change.
> > >
> > > The vmpressure is used actively (e.g. on Android) to track mm stress.
> > > vmpressure parameters selected empiricaly quite long time ago and not
> > > always suitable for modern memory configurations.
> >
> > This needs much more details. Why it is not suitable? What are usual
> > numbers you need to set up to work properly? Why those wouldn't be
> > generally applicable?
> >
> As far I see numbers which vmpressure uses - they are closer to RSS of
> userspace processes for memory utilization.
> Default calibration in memory.pressure_level_medium as 60% makes 8GB device
> hit memory threshold when RSS utilization
> reaches ~5 GB and that is a bit too early, I observe it happened
> immediately after boot. Reasonable level should be
> in the 70-80% range depending on SW preloaded on your device.

I am not sure I follow. Levels are based on the reclaim ineffectivity not
the overall memory utilization. So it takes to have only 40% reclaim
effectivity to trigger the medium level. While you are right that the
threshold for the event is pretty arbitrary I would like to hear why
that doesn't work in your environment. It shouldn't really depend on the
amount of memory as this is a percentage, right?

> From another point of view having a memory.pressure_level_critical set to
> 95% may never happen as it comes to a level where an OOM killer already
> starts to kill processes,
> and in some cases it is even worse than the now removed Android low memory
> killer. For such cases has sense to shift the threshold down to 85-90% to
> have device reliably
> handling low memory situations and not rely only on oom_score_adj hints.
> 
> Next important parameter for tweaking is memory.pressure_window which has
> the sense to increase twice to reduce the number of activations of userspace
> to save some power by reducing sensitivity.

Could you be more specific, please?

> For 12 and 16 GB devices the situation will be similar but worse, based on
> fact in current settings they will hit medium memory usage when ~5 or 6.5
> GB memory will be still free.
> 
> 
> >
> > Anyway, I have to confess I am not a big fan of this. vmpressure turned
> > out to be a very weak interface to measure the memory pressure. Not only
> > it is not numa aware which makes it unusable on many systems it also
> > gives data way too late from the practice.
> >
> > Btw. why don't you use /proc/pressure/memory resp. its memcg counterpart
> > to measure the memory pressure in the first place?
> >
> 
> According to our checks PSI produced numbers only when swap enabled e.g.
> swapless device 75% RAM utilization:

I believe you should discuss that with the people familiar with PSI
internals (Johannes already in the CC list). 
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] memcg, vmpressure: expose vmpressure controls
  2020-04-14 16:42   ` Leonid Moiseichuk
  2020-04-14 18:49     ` Michal Hocko
@ 2020-04-14 19:23     ` Johannes Weiner
  2020-04-14 22:12       ` Leonid Moiseichuk
  1 sibling, 1 reply; 16+ messages in thread
From: Johannes Weiner @ 2020-04-14 19:23 UTC (permalink / raw)
  To: Leonid Moiseichuk
  Cc: Michal Hocko, svc_lmoiseichuk, vdavydov.dev, tj, lizefan,
	cgroups, akpm, rientjes, minchan, vinmenon, andriy.shevchenko,
	anton.vorontsov, penberg, linux-mm

On Tue, Apr 14, 2020 at 12:42:44PM -0400, Leonid Moiseichuk wrote:
> On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote:
> > Anyway, I have to confess I am not a big fan of this. vmpressure turned
> > out to be a very weak interface to measure the memory pressure. Not only
> > it is not numa aware which makes it unusable on many systems it also
> > gives data way too late from the practice.

Yes, it's late in the game for vmpressure, and also a bit too late for
extensive changes in cgroup1.

> > Btw. why don't you use /proc/pressure/memory resp. its memcg counterpart
> > to measure the memory pressure in the first place?
> >
> 
> According to our checks PSI produced numbers only when swap enabled e.g.
> swapless device 75% RAM utilization:
> ==> /proc/pressure/io <==
> some avg10=0.00 avg60=1.18 avg300=1.51 total=9642648
> full avg10=0.00 avg60=1.11 avg300=1.47 total=9271174
> 
> ==> /proc/pressure/memory <==
> some avg10=0.00 avg60=0.00 avg300=0.00 total=0
> full avg10=0.00 avg60=0.00 avg300=0.00 total=0

That doesn't look right. With total=0, there couldn't have been any
reclaim activity, which means that vmpressure couldn't have reported
anything either.

By the time vmpressure reports a drop in reclaim efficiency, psi
should have already been reporting time spent doing reclaim. It
reports a superset of the information conveyed by vmpressure.

> Probably it is possible to activate PSI by introducing high IO and swap
> enabled but that is not a typical case for mobile devices.
> 
> With swap-enabled case memory pressure follows IO pressure with some
> fraction i.e. memory is io/2 ... io/10 depending on pattern.
> Light sysbench case with swap enabled
> ==> /proc/pressure/io <==
> some avg10=0.00 avg60=0.00 avg300=0.11 total=155383820
> full avg10=0.00 avg60=0.00 avg300=0.05 total=100516966
> ==> /proc/pressure/memory <==
> some avg10=0.00 avg60=0.00 avg300=0.06 total=465916397
> full avg10=0.00 avg60=0.00 avg300=0.00 total=368664282
> 
> Since not all devices have zram or swap enabled it makes sense to have
> vmpressure tuning option possible since
> it is well used in Android and related issues are understandable.

Android (since 10 afaik) uses psi to make low memory / OOM
decisions. See the introduction of the psi poll() support:
https://lwn.net/Articles/782662/

It's true that with swap you may see a more gradual increase in
pressure, whereas without swap you may go from idle to OOM much
faster, depending on what type of memory is being allocated. But psi
will still report it. You may just have to use poll() to get in-time
notification like you do with vmpressure.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] memcg, vmpressure: expose vmpressure controls
  2020-04-14 18:49     ` Michal Hocko
@ 2020-04-14 20:53       ` Leonid Moiseichuk
  2020-04-15  7:51         ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Leonid Moiseichuk @ 2020-04-14 20:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: svc_lmoiseichuk, hannes, vdavydov.dev, tj, lizefan, cgroups,
	akpm, rientjes, minchan, vinmenon, andriy.shevchenko,
	anton.vorontsov, penberg, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4637 bytes --]

It would be nice if you can specify exact numbers you like to see.

On Tue, Apr 14, 2020 at 2:49 PM Michal Hocko <mhocko@kernel.org> wrote:

> ....

> As far I see numbers which vmpressure uses - they are closer to RSS of
> > userspace processes for memory utilization.
> > Default calibration in memory.pressure_level_medium as 60% makes 8GB
> device
> > hit memory threshold when RSS utilization
> > reaches ~5 GB and that is a bit too early, I observe it happened
> > immediately after boot. Reasonable level should be
> > in the 70-80% range depending on SW preloaded on your device.
>
> I am not sure I follow. Levels are based on the reclaim ineffectivity not
> the overall memory utilization. So it takes to have only 40% reclaim
> effectivity to trigger the medium level. While you are right that the
> threshold for the event is pretty arbitrary I would like to hear why
> that doesn't work in your environment. It shouldn't really depend on the
> amount of memory as this is a percentage, right?
>
It is not only depends from amount of memory or reclams but also what is
software running.

As I see from vmscan.c vmpressure activated from various shrink_node()  or,
basically do_try_to_free_pages().
To hit this state you need to somehow lack memory due to various reasons,
so the amount of memory plays a role here.
In particular my case is very impacted by GPU (using CMA) consumption which
can easily take gigs.
Apps can take gigabyte as well.
So reclaiming will be quite often called in case of lack of memory (4K
calls are possible).

Handling level change will happen if the amount of scanned pages is more
than window size, 512 is too little as now it is only 2 MB.
So small slices are a source of false triggers.

Next, pressure counted as
        unsigned long scale = scanned + reclaimed;
        pressure = scale - (reclaimed * scale / scanned);
        pressure = pressure * 100 / scale;
Or for 512 pages (lets use minimal) it leads to reclaimed should be 204
pages for 60% threshold and 25 pages for 95% (as critical)

In case of pressure happened (usually at 85% of memory used, and hittin
critical level) I rarely see something like closer to real numbers
vmpressure_work_fn: scanned 545, reclaimed 144   <-- 73%
vmpressure_work_fn: scanned 16283, reclaimed 2495  <-- same session but 83%
Most of the time it is looping between kswapd and lmkd reclaiming failures,
consuming quite a high amount of cpu.

On vmscan calls everything looks as expected
[  312.410938] vmpressure: tree 0 scanned 4, reclaimed 2
[  312.410939] vmpressure: tree 0 scanned 120, reclaimed 62
[  312.410939] vmpressure: tree 1 scanned 2, reclaimed 1
[  312.410940] vmpressure: tree 1 scanned 120, reclaimed 62
[  312.410941] vmpressure: tree 0 scanned 0, reclaimed 0


>
> > From another point of view having a memory.pressure_level_critical set to
> > 95% may never happen as it comes to a level where an OOM killer already
> > starts to kill processes,
> > and in some cases it is even worse than the now removed Android low
> memory
> > killer. For such cases has sense to shift the threshold down to 85-90% to
> > have device reliably
> > handling low memory situations and not rely only on oom_score_adj hints.
> >
> > Next important parameter for tweaking is memory.pressure_window which has
> > the sense to increase twice to reduce the number of activations of
> userspace
> > to save some power by reducing sensitivity.
>
> Could you be more specific, please?
>
That are parameters which most sensitive for tweaking for me.
At least someone who use vmpressure will be able to tune up or down
depending on combination apps.



>
> > For 12 and 16 GB devices the situation will be similar but worse, based
> on
> > fact in current settings they will hit medium memory usage when ~5 or 6.5
> > GB memory will be still free.
> >
> >
> > >
> > > Anyway, I have to confess I am not a big fan of this. vmpressure turned
> > > out to be a very weak interface to measure the memory pressure. Not
> only
> > > it is not numa aware which makes it unusable on many systems it also
> > > gives data way too late from the practice.
> > >
> > > Btw. why don't you use /proc/pressure/memory resp. its memcg
> counterpart
> > > to measure the memory pressure in the first place?
> > >
> >
> > According to our checks PSI produced numbers only when swap enabled e.g.
> > swapless device 75% RAM utilization:
>
> I believe you should discuss that with the people familiar with PSI
> internals (Johannes already in the CC list).
>

Thanks for pointing, I will answer for his letters.

> --
> Michal Hocko
> SUSE Labs
>


-- 
With Best Wishes,
Leonid

[-- Attachment #2: Type: text/html, Size: 6269 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] memcg, vmpressure: expose vmpressure controls
  2020-04-14 19:23     ` Johannes Weiner
@ 2020-04-14 22:12       ` Leonid Moiseichuk
  2020-04-15  7:55         ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Leonid Moiseichuk @ 2020-04-14 22:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, svc lmoiseichuk, vdavydov.dev, tj, lizefan,
	cgroups, akpm, rientjes, minchan, vinmenon, andriy.shevchenko,
	penberg, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4012 bytes --]

I do not agree with all comments, see below.

On Tue, Apr 14, 2020 at 3:23 PM Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, Apr 14, 2020 at 12:42:44PM -0400, Leonid Moiseichuk wrote:
> > On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote:
> > > Anyway, I have to confess I am not a big fan of this. vmpressure turned
> > > out to be a very weak interface to measure the memory pressure. Not
> only
> > > it is not numa aware which makes it unusable on many systems it also
> > > gives data way too late from the practice.
>
> Yes, it's late in the game for vmpressure, and also a bit too late for
> extensive changes in cgroup1.
>
200 lines just to move functionality from one place to another without
logic change?
There does not seem to be extensive changes.


>
> > > Btw. why don't you use /proc/pressure/memory resp. its memcg
> counterpart
> > > to measure the memory pressure in the first place?
> > >
> >
> > According to our checks PSI produced numbers only when swap enabled e.g.
> > swapless device 75% RAM utilization:
> > ==> /proc/pressure/io <==
> > some avg10=0.00 avg60=1.18 avg300=1.51 total=9642648
> > full avg10=0.00 avg60=1.11 avg300=1.47 total=9271174
> >
> > ==> /proc/pressure/memory <==
> > some avg10=0.00 avg60=0.00 avg300=0.00 total=0
> > full avg10=0.00 avg60=0.00 avg300=0.00 total=0
>
> That doesn't look right. With total=0, there couldn't have been any
> reclaim activity, which means that vmpressure couldn't have reported
> anything either.
>
Unfortunately not, vmpressure do reclaiming, I shared numbers/calls in the
parallel letter.
And I see kswapd+lmkd consumes quite a lot of cpu cycles.
That is the same device, swap disabled.
If I enable swap (zram based as Android usually does) it starts to make
some numbers below 0.1,
which does not seem huge pressure.


By the time vmpressure reports a drop in reclaim efficiency, psi
> should have already been reporting time spent doing reclaim. It
> reports a superset of the information conveyed by vmpressure.
>


> > Probably it is possible to activate PSI by introducing high IO and swap
> > enabled but that is not a typical case for mobile devices.
> >
> > With swap-enabled case memory pressure follows IO pressure with some
> > fraction i.e. memory is io/2 ... io/10 depending on pattern.
> > Light sysbench case with swap enabled
> > ==> /proc/pressure/io <==
> > some avg10=0.00 avg60=0.00 avg300=0.11 total=155383820
> > full avg10=0.00 avg60=0.00 avg300=0.05 total=100516966
> > ==> /proc/pressure/memory <==
> > some avg10=0.00 avg60=0.00 avg300=0.06 total=465916397
> > full avg10=0.00 avg60=0.00 avg300=0.00 total=368664282
> >
> > Since not all devices have zram or swap enabled it makes sense to have
> > vmpressure tuning option possible since
> > it is well used in Android and related issues are understandable.
>
> Android (since 10 afaik) uses psi to make low memory / OOM
> decisions. See the introduction of the psi poll() support:
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_782662_&d=DwIBAg&c=0ia8zh_eZtQM1JEjWgVLZg&r=dIXgSomcB34epPNJ3JPl0D4WwsDd12lPHClV0_L9Aw4&m=GJC3IQZUa2vG0cqtoa4Ma_R-S_cRvQSZGbpD389b84w&s=Kp-EqrjqguJqWJ-tefwwRPeLIZennPkko0qEV_fgIbc&e=
>

Android makes a selection PSI (primary) or vmpressure (backup), see line
2872+
https://android.googlesource.com/platform/system/memory/lmkd/+/refs/heads/master/lmkd.cpp#2872


>
> It's true that with swap you may see a more gradual increase in
> pressure, whereas without swap you may go from idle to OOM much
> faster, depending on what type of memory is being allocated. But psi
> will still report it. You may just have to use poll() to get in-time
> notification like you do with vmpressure.
>
I expected that any spikes will be visible in previous avg level e.g. 10s
Cannot confirm that now but I could play around.  If you have preferences
about use-cases please let me know.


-- 
With Best Wishes,
Leonid

[-- Attachment #2: Type: text/html, Size: 6078 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] memcg: expose vmpressure knobs
  2020-04-13 21:57 ` [PATCH 1/2] memcg: expose vmpressure knobs svc_lmoiseichuk
@ 2020-04-14 22:55   ` Chris Down
  2020-04-14 23:00     ` Leonid Moiseichuk
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Down @ 2020-04-14 22:55 UTC (permalink / raw)
  To: svc_lmoiseichuk
  Cc: hannes, mhocko, vdavydov.dev, tj, lizefan, cgroups, akpm,
	rientjes, minchan, vinmenon, andriy.shevchenko, anton.vorontsov,
	penberg, linux-mm, Leonid Moiseichuk

svc_lmoiseichuk@magicleap.com writes:
>From: Leonid Moiseichuk <lmoiseichuk@magicleap.com>
>
>Populating memcg vmpressure controls with legacy defaults:
>- memory.pressure_window (512 or SWAP_CLUSTER_MAX * 16)
>- memory.pressure_level_critical_prio (3)
>- memory.pressure_level_medium (60)
>- memory.pressure_level_critical (95)
>
>Signed-off-by: Leonid Moiseichuk <lmoiseichuk@magicleap.com>

I'm against this even in the abstract, cgroup v1 is deprecated and its 
interface frozen, and vmpressure is pretty much already supplanted by PSI, 
which actually works (whereas vmpressure often doesn't since it mostly ends up 
just measuring reclaim efficiency, rather than actual memory pressure).

Without an extremely compelling reason to expose these, this just muddles the 
situation.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] memcg: expose vmpressure knobs
  2020-04-14 22:55   ` Chris Down
@ 2020-04-14 23:00     ` Leonid Moiseichuk
  0 siblings, 0 replies; 16+ messages in thread
From: Leonid Moiseichuk @ 2020-04-14 23:00 UTC (permalink / raw)
  To: Chris Down
  Cc: hannes, Michal Hocko, vdavydov.dev, tj, lizefan, cgroups, akpm,
	rientjes, minchan, vinmenon, andriy.shevchenko, penberg,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 1054 bytes --]

No problem, since cgroups v1 are frozen that is a valid stopper.
Let me check what could be done about PSI for swapless cases.

On Tue, Apr 14, 2020 at 6:55 PM Chris Down <chris@chrisdown.name> wrote:

> svc_lmoiseichuk@magicleap.com writes:
> >From: Leonid Moiseichuk <lmoiseichuk@magicleap.com>
> >
> >Populating memcg vmpressure controls with legacy defaults:
> >- memory.pressure_window (512 or SWAP_CLUSTER_MAX * 16)
> >- memory.pressure_level_critical_prio (3)
> >- memory.pressure_level_medium (60)
> >- memory.pressure_level_critical (95)
> >
> >Signed-off-by: Leonid Moiseichuk <lmoiseichuk@magicleap.com>
>
> I'm against this even in the abstract, cgroup v1 is deprecated and its
> interface frozen, and vmpressure is pretty much already supplanted by PSI,
> which actually works (whereas vmpressure often doesn't since it mostly
> ends up
> just measuring reclaim efficiency, rather than actual memory pressure).
>
> Without an extremely compelling reason to expose these, this just muddles
> the
> situation.
>


-- 
With Best Wishes,
Leonid

[-- Attachment #2: Type: text/html, Size: 1745 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] memcg, vmpressure: expose vmpressure controls
  2020-04-14 20:53       ` Leonid Moiseichuk
@ 2020-04-15  7:51         ` Michal Hocko
  2020-04-15 12:17           ` Leonid Moiseichuk
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2020-04-15  7:51 UTC (permalink / raw)
  To: Leonid Moiseichuk
  Cc: svc_lmoiseichuk, hannes, vdavydov.dev, tj, lizefan, cgroups,
	akpm, rientjes, minchan, vinmenon, andriy.shevchenko,
	anton.vorontsov, penberg, linux-mm

On Tue 14-04-20 16:53:55, Leonid Moiseichuk wrote:
> It would be nice if you can specify exact numbers you like to see.

You are proposing an interface which allows to tune thresholds from
userspace. Which suggests that you want to tune them. I am asking what
kind of tuning you are using and why cannot we use them as defaults in
the kernel.

> On Tue, Apr 14, 2020 at 2:49 PM Michal Hocko <mhocko@kernel.org> wrote:
> 
> > ....
> 
> > As far I see numbers which vmpressure uses - they are closer to RSS of
> > > userspace processes for memory utilization.
> > > Default calibration in memory.pressure_level_medium as 60% makes 8GB
> > device
> > > hit memory threshold when RSS utilization
> > > reaches ~5 GB and that is a bit too early, I observe it happened
> > > immediately after boot. Reasonable level should be
> > > in the 70-80% range depending on SW preloaded on your device.
> >
> > I am not sure I follow. Levels are based on the reclaim ineffectivity not
> > the overall memory utilization. So it takes to have only 40% reclaim
> > effectivity to trigger the medium level. While you are right that the
> > threshold for the event is pretty arbitrary I would like to hear why
> > that doesn't work in your environment. It shouldn't really depend on the
> > amount of memory as this is a percentage, right?
> >
> It is not only depends from amount of memory or reclams but also what is
> software running.
> 
> As I see from vmscan.c vmpressure activated from various shrink_node()  or,
> basically do_try_to_free_pages().
> To hit this state you need to somehow lack memory due to various reasons,
> so the amount of memory plays a role here.
> In particular my case is very impacted by GPU (using CMA) consumption which
> can easily take gigs.
> Apps can take gigabyte as well.
> So reclaiming will be quite often called in case of lack of memory (4K
> calls are possible).
> 
> Handling level change will happen if the amount of scanned pages is more
> than window size, 512 is too little as now it is only 2 MB.
> So small slices are a source of false triggers.
> 
> Next, pressure counted as
>         unsigned long scale = scanned + reclaimed;
>         pressure = scale - (reclaimed * scale / scanned);
>         pressure = pressure * 100 / scale;

Just to make this more obvious this is essentially 
	100 * (1 - reclaimed/scanned)

> Or for 512 pages (lets use minimal) it leads to reclaimed should be 204
> pages for 60% threshold and 25 pages for 95% (as critical)
>
> In case of pressure happened (usually at 85% of memory used, and hittin
> critical level)

I still find this very confusing because the amount of used memory is
not really important. It really only depends on the reclaim activity and
that is either the memcg or the global reclaim. And you are getting
critical levels only if the reclaim is failing to reclaim way too many
pages. 

> I rarely see something like closer to real numbers
> vmpressure_work_fn: scanned 545, reclaimed 144   <-- 73%
> vmpressure_work_fn: scanned 16283, reclaimed 2495  <-- same session but 83%
> Most of the time it is looping between kswapd and lmkd reclaiming failures,
> consuming quite a high amount of cpu.
> 
> On vmscan calls everything looks as expected
> [  312.410938] vmpressure: tree 0 scanned 4, reclaimed 2
> [  312.410939] vmpressure: tree 0 scanned 120, reclaimed 62
> [  312.410939] vmpressure: tree 1 scanned 2, reclaimed 1
> [  312.410940] vmpressure: tree 1 scanned 120, reclaimed 62
> [  312.410941] vmpressure: tree 0 scanned 0, reclaimed 0

This looks more like a problem of vmpressure implementation than
something you want to workaround by tuning to me.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] memcg, vmpressure: expose vmpressure controls
  2020-04-14 22:12       ` Leonid Moiseichuk
@ 2020-04-15  7:55         ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2020-04-15  7:55 UTC (permalink / raw)
  To: Leonid Moiseichuk
  Cc: Johannes Weiner, svc lmoiseichuk, vdavydov.dev, tj, lizefan,
	cgroups, akpm, rientjes, minchan, vinmenon, andriy.shevchenko,
	penberg, linux-mm

On Tue 14-04-20 18:12:47, Leonid Moiseichuk wrote:
> I do not agree with all comments, see below.
> 
> On Tue, Apr 14, 2020 at 3:23 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Tue, Apr 14, 2020 at 12:42:44PM -0400, Leonid Moiseichuk wrote:
> > > On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote:
> > > > Anyway, I have to confess I am not a big fan of this. vmpressure turned
> > > > out to be a very weak interface to measure the memory pressure. Not
> > only
> > > > it is not numa aware which makes it unusable on many systems it also
> > > > gives data way too late from the practice.
> >
> > Yes, it's late in the game for vmpressure, and also a bit too late for
> > extensive changes in cgroup1.
> >
> 200 lines just to move functionality from one place to another without
> logic change?
> There does not seem to be extensive changes.

Any user visible API is an big change. We have to maintain any api for
ever. So there has to be a really strong reason/use case for inclusion.
I haven't heard any strong justification so far. It all seems to me that
you are trying to workaround real vmpressure issues by fine tunning
parameters and that is almost always a bad reason for a adding a new
tunning.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] memcg, vmpressure: expose vmpressure controls
  2020-04-15  7:51         ` Michal Hocko
@ 2020-04-15 12:17           ` Leonid Moiseichuk
  2020-04-15 12:28             ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Leonid Moiseichuk @ 2020-04-15 12:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: svc lmoiseichuk, Johannes Weiner, vdavydov.dev, tj, lizefan,
	cgroups, akpm, rientjes, minchan, vinmenon, andriy.shevchenko,
	anton.vorontsov, penberg, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4881 bytes --]

As Chris Down stated cgroups v1 frozen, so no API changes in the mainline
kernel.
If opinions change in the future I can continue with polishing this change.
I will focus on PSI bugs for swapless/zram swapped devices :)

The rest is below.

On Wed, Apr 15, 2020 at 3:51 AM Michal Hocko <mhocko@kernel.org> wrote:

> On Tue 14-04-20 16:53:55, Leonid Moiseichuk wrote:
> > It would be nice if you can specify exact numbers you like to see.
>
> You are proposing an interface which allows to tune thresholds from
> userspace. Which suggests that you want to tune them. I am asking what
> kind of tuning you are using and why cannot we use them as defaults in
> the kernel.
>

Yes, this type of hack is obvious. But selecting some parameters in one
moment of time might be not good later.
Plus these patches can be applied by vendor to e.g. Android 8 or 9 who has
no PSI and tweaked their own way.
Some products stick to old versions of kernels. I made a docs in separate
change to cover a wider set of older kernels.
They are transparent, tested and working fine.


> > On Tue, Apr 14, 2020 at 2:49 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > > ....
> >
> > > As far I see numbers which vmpressure uses - they are closer to RSS of
> > > > userspace processes for memory utilization.
> > > > Default calibration in memory.pressure_level_medium as 60% makes 8GB
> > > device
> > > > hit memory threshold when RSS utilization
> > > > reaches ~5 GB and that is a bit too early, I observe it happened
> > > > immediately after boot. Reasonable level should be
> > > > in the 70-80% range depending on SW preloaded on your device.
> > >
> > > I am not sure I follow. Levels are based on the reclaim ineffectivity
> not
> > > the overall memory utilization. So it takes to have only 40% reclaim
> > > effectivity to trigger the medium level. While you are right that the
> > > threshold for the event is pretty arbitrary I would like to hear why
> > > that doesn't work in your environment. It shouldn't really depend on
> the
> > > amount of memory as this is a percentage, right?
> > >
> > It is not only depends from amount of memory or reclams but also what is
> > software running.
> >
> > As I see from vmscan.c vmpressure activated from various shrink_node()
> or,
> > basically do_try_to_free_pages().
> > To hit this state you need to somehow lack memory due to various reasons,
> > so the amount of memory plays a role here.
> > In particular my case is very impacted by GPU (using CMA) consumption
> which
> > can easily take gigs.
> > Apps can take gigabyte as well.
> > So reclaiming will be quite often called in case of lack of memory (4K
> > calls are possible).
> >
> > Handling level change will happen if the amount of scanned pages is more
> > than window size, 512 is too little as now it is only 2 MB.
> > So small slices are a source of false triggers.
> >
> > Next, pressure counted as
> >         unsigned long scale = scanned + reclaimed;
> >         pressure = scale - (reclaimed * scale / scanned);
> >         pressure = pressure * 100 / scale;
>
> Just to make this more obvious this is essentially
>         100 * (1 - reclaimed/scanned)
>
> > Or for 512 pages (lets use minimal) it leads to reclaimed should be 204
> > pages for 60% threshold and 25 pages for 95% (as critical)
> >
> > In case of pressure happened (usually at 85% of memory used, and hittin
> > critical level)
>
> I still find this very confusing because the amount of used memory is
> not really important. It really only depends on the reclaim activity and
> that is either the memcg or the global reclaim. And you are getting
> critical levels only if the reclaim is failing to reclaim way too many
> pages.
>

OK, agree from that point of view.
But for larger systems reclaiming happens not so often and we can
use larger window sizes to have better memory utilization approximation.


>
> > I rarely see something like closer to real numbers
> > vmpressure_work_fn: scanned 545, reclaimed 144   <-- 73%
> > vmpressure_work_fn: scanned 16283, reclaimed 2495  <-- same session but
> 83%
> > Most of the time it is looping between kswapd and lmkd reclaiming
> failures,
> > consuming quite a high amount of cpu.
> >
> > On vmscan calls everything looks as expected
> > [  312.410938] vmpressure: tree 0 scanned 4, reclaimed 2
> > [  312.410939] vmpressure: tree 0 scanned 120, reclaimed 62
> > [  312.410939] vmpressure: tree 1 scanned 2, reclaimed 1
> > [  312.410940] vmpressure: tree 1 scanned 120, reclaimed 62
> > [  312.410941] vmpressure: tree 0 scanned 0, reclaimed 0
>
> This looks more like a problem of vmpressure implementation than
> something you want to workaround by tuning to me.
>
Basically it is how it works - collect the scanned page and activate worker
activity to update the current level.


>
> --
> Michal Hocko
> SUSE Labs
>


-- 
With Best Wishes,
Leonid

[-- Attachment #2: Type: text/html, Size: 6458 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] memcg, vmpressure: expose vmpressure controls
  2020-04-15 12:17           ` Leonid Moiseichuk
@ 2020-04-15 12:28             ` Michal Hocko
  2020-04-15 12:33               ` Leonid Moiseichuk
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2020-04-15 12:28 UTC (permalink / raw)
  To: Leonid Moiseichuk
  Cc: svc lmoiseichuk, Johannes Weiner, vdavydov.dev, tj, lizefan,
	cgroups, akpm, rientjes, minchan, vinmenon, andriy.shevchenko,
	anton.vorontsov, penberg, linux-mm

On Wed 15-04-20 08:17:42, Leonid Moiseichuk wrote:
> As Chris Down stated cgroups v1 frozen, so no API changes in the mainline
> kernel.

Yes, this is true, _but_ if there are clear shortcomings in the existing
vmpressure implementation which could be addressed reasonably then there
is no reason to ignore them.

[...]

> > I still find this very confusing because the amount of used memory is
> > not really important. It really only depends on the reclaim activity and
> > that is either the memcg or the global reclaim. And you are getting
> > critical levels only if the reclaim is failing to reclaim way too many
> > pages.
> >
> 
> OK, agree from that point of view.
> But for larger systems reclaiming happens not so often and we can
> use larger window sizes to have better memory utilization approximation.

Nobody is saying the the window size has to be fixed. This all can be
auto tuned in the kernel.  It would, however, require to define what
"better utilization approximation" means much more specifically.

[...]
> > This looks more like a problem of vmpressure implementation than
> > something you want to workaround by tuning to me.
> >
> Basically it is how it works - collect the scanned page and activate worker
> activity to update the current level.

That is the case only for some vmpressure invocations. And your data
suggest that those might lead to misleading results. So this is likely
good to focus on and find out whether this can be addressed.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] memcg, vmpressure: expose vmpressure controls
  2020-04-15 12:28             ` Michal Hocko
@ 2020-04-15 12:33               ` Leonid Moiseichuk
  0 siblings, 0 replies; 16+ messages in thread
From: Leonid Moiseichuk @ 2020-04-15 12:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: svc lmoiseichuk, Johannes Weiner, vdavydov.dev, tj, lizefan,
	cgroups, akpm, rientjes, minchan, vinmenon, andriy.shevchenko,
	penberg, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1909 bytes --]

Good point but at the current moment I am not ready to implement autotune
window size because my device only has 8 GB RAM and a very custom version
of Android-based SW.
So  basically I cannot test in all possible combinations.

On Wed, Apr 15, 2020 at 8:29 AM Michal Hocko <mhocko@kernel.org> wrote:

> On Wed 15-04-20 08:17:42, Leonid Moiseichuk wrote:
> > As Chris Down stated cgroups v1 frozen, so no API changes in the mainline
> > kernel.
>
> Yes, this is true, _but_ if there are clear shortcomings in the existing
> vmpressure implementation which could be addressed reasonably then there
> is no reason to ignore them.
>
> [...]
>
> > > I still find this very confusing because the amount of used memory is
> > > not really important. It really only depends on the reclaim activity
> and
> > > that is either the memcg or the global reclaim. And you are getting
> > > critical levels only if the reclaim is failing to reclaim way too many
> > > pages.
> > >
> >
> > OK, agree from that point of view.
> > But for larger systems reclaiming happens not so often and we can
> > use larger window sizes to have better memory utilization approximation.
>
> Nobody is saying the the window size has to be fixed. This all can be
> auto tuned in the kernel.  It would, however, require to define what
> "better utilization approximation" means much more specifically.
>
> [...]
> > > This looks more like a problem of vmpressure implementation than
> > > something you want to workaround by tuning to me.
> > >
> > Basically it is how it works - collect the scanned page and activate
> worker
> > activity to update the current level.
>
> That is the case only for some vmpressure invocations. And your data
> suggest that those might lead to misleading results. So this is likely
> good to focus on and find out whether this can be addressed.
> --
> Michal Hocko
> SUSE Labs
>


-- 
With Best Wishes,
Leonid

[-- Attachment #2: Type: text/html, Size: 2509 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-04-15 12:33 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-13 21:57 [PATCH 0/2] memcg, vmpressure: expose vmpressure controls svc_lmoiseichuk
2020-04-13 21:57 ` [PATCH 1/2] memcg: expose vmpressure knobs svc_lmoiseichuk
2020-04-14 22:55   ` Chris Down
2020-04-14 23:00     ` Leonid Moiseichuk
2020-04-13 21:57 ` [PATCH 2/2] memcg, vmpressure: expose vmpressure controls svc_lmoiseichuk
2020-04-14 11:37 ` [PATCH 0/2] " Michal Hocko
2020-04-14 16:42   ` Leonid Moiseichuk
2020-04-14 18:49     ` Michal Hocko
2020-04-14 20:53       ` Leonid Moiseichuk
2020-04-15  7:51         ` Michal Hocko
2020-04-15 12:17           ` Leonid Moiseichuk
2020-04-15 12:28             ` Michal Hocko
2020-04-15 12:33               ` Leonid Moiseichuk
2020-04-14 19:23     ` Johannes Weiner
2020-04-14 22:12       ` Leonid Moiseichuk
2020-04-15  7:55         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).