Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 1/2] mm, memcg: introduce multiple levels memory low protection
@ 2019-11-07  6:08 Yafang Shao
  2019-11-07  6:08 ` [PATCH 2/2] mm, memcg: introduce passive reclaim counters for non-root memcg Yafang Shao
  2019-11-08 13:26 ` [PATCH 1/2] mm, memcg: introduce multiple levels memory low protection Michal Hocko
  0 siblings, 2 replies; 4+ messages in thread
From: Yafang Shao @ 2019-11-07  6:08 UTC (permalink / raw)
  To: mhocko, hannes, vdavydov.dev, akpm; +Cc: linux-mm, Yafang Shao

This patch introduces a new memory controller file memory.low.level,
which is used to set multiple levels memory.low protetion.
The valid value of memory.low.level is [0..3], meaning we support four
levels protection now. This new controller file takes effect only when
memory.low is set. With this new file, we can do page reclaim QoS on
different memcgs. For example, when the system is under memory pressure, it
will reclaim pages from the memcg with lower priority first and then higher
priority.

  - What is the problem in the current memory low proection ?
  Currently we can set bigger memory.low protection on memcg with higher
  priority, and smaller memory.low protection on memcg with lower priority.
  But once there's no available unprotected memory to reclaim, the
  reclaimers will reclaim the protected memory from all the memcgs.
  While we really want the reclaimers to reclaim the protected memory from
  the lower-priority memcgs first, and if it still can't meet the page
  allocation it will then reclaim the protected memory from higher-priority
  memdcgs. The logic can be displayed as bellow,
	under_memory_pressure
		reclaim_unprotected_memory
		if (meet_the_request)
			exit
		reclaim_protected_memory_from_lowest_priority_memcgs
		if (meet_the_request)
			exit
		reclaim_protected_memory_from_higher_priority_memcgs
		if (meet_the_request)
			exit
		reclaim_protected_memory_from_highest_priority_memcgs

  - Why does it support four-level memory low protection ?
  Low priority, medium priority and high priority, that is the most common
  usecases in the real life. So four-level memory low protection should be
  enough. The more levels it is, the higher overhead page reclaim will
  take. So four-level protection is really a trade-off.

  - How does it work ?
  One example how this multiple level controller works,
	target memcg (root or non-root)
	   /	\
	  B	 C
	 / \
	B1 B2
  B/memory.low.level=2	effective low level is 2
  B1/memory.low.level=3 effective low level is 2
  B2/memory.low.level=0 effective low level is 0
  C/memory.low.level=1	effective low level is 1

  The effective low level is min(low_level, parent_low_level).
  memory.low in all memcgs is set.
  Then the reclaimer will reclaims these priority in this order:
  B2->C->B/B1

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 11 ++++++++
 include/linux/page_counter.h            |  3 +++
 mm/memcontrol.c                         | 45 +++++++++++++++++++++++++++++++++
 mm/vmscan.c                             | 31 ++++++++++++++++++-----
 4 files changed, 83 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index ed912315..cdacc9c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1137,6 +1137,17 @@ PAGE_SIZE multiple when read back.
 	Putting more memory than generally available under this
 	protection is discouraged.
 
+  memory.low.level
+        A read-write single value file which exists on non-root
+        cgroups.  The default is "0". The valid value is [0..3].
+
+        The controller file takes effect only after memory.low is set.
+        If both memory.low and memory.low.level are set to many MEMCGs,
+        when under memory pressure the reclaimer will reclaim the
+        unprotected memory first, and then reclaims the protected memory
+        with lower memory.low.level and at last relcaims the protected
+        memory with highest memory.low.level.
+
   memory.high
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "max".
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index bab7e57..19bc589 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -6,6 +6,7 @@
 #include <linux/kernel.h>
 #include <asm/page.h>
 
+#define MEMCG_LOW_LEVEL_MAX 4
 struct page_counter {
 	atomic_long_t usage;
 	unsigned long min;
@@ -22,6 +23,8 @@ struct page_counter {
 	unsigned long elow;
 	atomic_long_t low_usage;
 	atomic_long_t children_low_usage;
+	unsigned long low_level;
+	unsigned long elow_level;
 
 	/* legacy */
 	unsigned long watermark;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 50f5bc5..9da4ef9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5962,6 +5962,37 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_low_level_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	seq_printf(m, "%lu\n", memcg->memory.low_level);
+
+	return 0;
+}
+
+static ssize_t memory_low_level_write(struct kernfs_open_file *of,
+				      char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int ret, low_level;
+
+	buf = strstrip(buf);
+	if (!buf)
+		return -EINVAL;
+
+	ret = kstrtoint(buf, 0, &low_level);
+	if (ret)
+		return ret;
+
+	if (low_level < 0 || low_level >= MEMCG_LOW_LEVEL_MAX)
+		return -EINVAL;
+
+	memcg->memory.low_level = low_level;
+
+	return nbytes;
+}
+
 static int memory_high_show(struct seq_file *m, void *v)
 {
 	return seq_puts_memcg_tunable(m, READ_ONCE(mem_cgroup_from_seq(m)->high));
@@ -6151,6 +6182,12 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 		.write = memory_low_write,
 	},
 	{
+		.name = "low.level",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_low_level_show,
+		.write = memory_low_level_write,
+	},
+	{
 		.name = "high",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = memory_high_show,
@@ -6280,6 +6317,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 	struct mem_cgroup *parent;
 	unsigned long emin, parent_emin;
 	unsigned long elow, parent_elow;
+	unsigned long elow_level, parent_elow_level;
 	unsigned long usage;
 
 	if (mem_cgroup_disabled())
@@ -6296,6 +6334,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 
 	emin = memcg->memory.min;
 	elow = memcg->memory.low;
+	elow_level = memcg->memory.low_level;
 
 	parent = parent_mem_cgroup(memcg);
 	/* No parent means a non-hierarchical mode on v1 memcg */
@@ -6331,11 +6370,17 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 		if (low_usage && siblings_low_usage)
 			elow = min(elow, parent_elow * low_usage /
 				   siblings_low_usage);
+
+		parent_elow_level = READ_ONCE(parent->memory.elow_level);
+		elow_level = min(elow_level, parent_elow_level);
+	} else {
+		elow_level = 0;
 	}
 
 exit:
 	memcg->memory.emin = emin;
 	memcg->memory.elow = elow;
+	memcg->memory.elow_level = elow_level;
 
 	if (usage <= emin)
 		return MEMCG_PROT_MIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d979852..3b08e85 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -88,15 +88,16 @@ struct scan_control {
 	/* Can pages be swapped as part of reclaim? */
 	unsigned int may_swap:1;
 
+	unsigned int hibernation_mode:1;
 	/*
 	 * Cgroups are not reclaimed below their configured memory.low,
 	 * unless we threaten to OOM. If any cgroups are skipped due to
 	 * memory.low and nothing was reclaimed, go back for memory.low.
 	 */
-	unsigned int memcg_low_reclaim:1;
+	unsigned int memcg_low_level:3;
 	unsigned int memcg_low_skipped:1;
+	unsigned int memcg_low_step:2;
 
-	unsigned int hibernation_mode:1;
 
 	/* One of the zones is ready for compaction */
 	unsigned int compaction_ready:1;
@@ -2403,10 +2404,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		unsigned long lruvec_size;
 		unsigned long scan;
 		unsigned long protection;
+		bool memcg_low_reclaim = (sc->memcg_low_level >
+					  memcg->memory.elow_level);
 
 		lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
 		protection = mem_cgroup_protection(memcg,
-						   sc->memcg_low_reclaim);
+						   memcg_low_reclaim);
 
 		if (protection) {
 			/*
@@ -2691,6 +2694,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 		unsigned long reclaimed;
 		unsigned long scanned;
+		unsigned long step;
 
 		switch (mem_cgroup_protected(target_memcg, memcg)) {
 		case MEMCG_PROT_MIN:
@@ -2706,7 +2710,11 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 			 * there is an unprotected supply
 			 * of reclaimable memory from other cgroups.
 			 */
-			if (!sc->memcg_low_reclaim) {
+			if (sc->memcg_low_level <= memcg->memory.elow_level) {
+				step = (memcg->memory.elow_level -
+					sc->memcg_low_level);
+				if (step < sc->memcg_low_step)
+					sc->memcg_low_step = step;
 				sc->memcg_low_skipped = 1;
 				continue;
 			}
@@ -3007,6 +3015,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	pg_data_t *last_pgdat;
 	struct zoneref *z;
 	struct zone *zone;
+
+	sc->memcg_low_step = MEMCG_LOW_LEVEL_MAX - 1;
+
 retry:
 	delayacct_freepages_start();
 
@@ -3061,9 +3072,15 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		return 1;
 
 	/* Untapped cgroup reserves?  Don't OOM, retry. */
-	if (sc->memcg_low_skipped) {
-		sc->priority = initial_priority;
-		sc->memcg_low_reclaim = 1;
+	if (sc->memcg_low_skipped &&
+	    sc->memcg_low_level < MEMCG_LOW_LEVEL_MAX) {
+		/*
+		 * If it is hard to reclaim page caches, we'd better use a
+		 * lower priority to avoid taking too much time.
+		 */
+		sc->priority = initial_priority > sc->memcg_low_level ?
+			       (initial_priority - sc->memcg_low_level) : 0;
+		sc->memcg_low_level += sc->memcg_low_step + 1;
 		sc->memcg_low_skipped = 0;
 		goto retry;
 	}
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 2/2] mm, memcg: introduce passive reclaim counters for non-root memcg
  2019-11-07  6:08 [PATCH 1/2] mm, memcg: introduce multiple levels memory low protection Yafang Shao
@ 2019-11-07  6:08 ` Yafang Shao
  2019-11-08 13:26 ` [PATCH 1/2] mm, memcg: introduce multiple levels memory low protection Michal Hocko
  1 sibling, 0 replies; 4+ messages in thread
From: Yafang Shao @ 2019-11-07  6:08 UTC (permalink / raw)
  To: mhocko, hannes, vdavydov.dev, akpm; +Cc: linux-mm, Yafang Shao

The memory.{min, low} protection can prevent the reclaimers to reclaim the
pages from a memcg when there're memory pressure outside of a memcg.
We'd better introduce some counters to show this behavior.
This patch introduce two counters, pgscan_passive and pgsteal_passive.
	pgscan_passive:
		pages scanned due to memory pressure outside this memcg
	pgsteal_passive:
		pages reclaimed due to memory pressure outside this memcg

memcg set with higher memory.{min, low} setting, will get a lower
pgscan_passive and pgsteal_passive. For example, if memory.min is equal to
memory.max, then these passive reclaim counters should always be zero.

These counters are only for non-root memory cgroup. It's not easy to
introduce some container-only vmstat counters, because it will make lots
of changes. So I introduce some global vmstat counters, but they are
always zero for root memory cgroup.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++
 include/linux/vm_event_item.h           |  2 ++
 mm/memcontrol.c                         |  4 ++++
 mm/vmscan.c                             | 11 +++++++++--
 mm/vmstat.c                             |  2 ++
 5 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index cdacc9c..11e0129 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1336,10 +1336,20 @@ PAGE_SIZE multiple when read back.
 
 		Amount of scanned pages (in an inactive LRU list)
 
+          pgscan_passive
+
+                Amount of scanned pages due to memory pressure outside this
+                memcg
+
 	  pgsteal
 
 		Amount of reclaimed pages
 
+          pgsteal_passive
+
+                Amount of reclaimed pages due to memory pressure outside
+                this memcg
+
 	  pgactivate
 
 		Amount of pages moved to the active LRU list
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441..0fcdaa3 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -32,8 +32,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGREFILL,
 		PGSTEAL_KSWAPD,
 		PGSTEAL_DIRECT,
+		PGSTEAL_PASSIVE,
 		PGSCAN_KSWAPD,
 		PGSCAN_DIRECT,
+		PGSCAN_PASSIVE,
 		PGSCAN_DIRECT_THROTTLE,
 #ifdef CONFIG_NUMA
 		PGSCAN_ZONE_RECLAIM_FAILED,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9da4ef9..0a2bf9a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1443,9 +1443,13 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 	seq_buf_printf(&s, "pgscan %lu\n",
 		       memcg_events(memcg, PGSCAN_KSWAPD) +
 		       memcg_events(memcg, PGSCAN_DIRECT));
+	seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGSCAN_PASSIVE),
+		       memcg_events(memcg, PGSCAN_PASSIVE));
 	seq_buf_printf(&s, "pgsteal %lu\n",
 		       memcg_events(memcg, PGSTEAL_KSWAPD) +
 		       memcg_events(memcg, PGSTEAL_DIRECT));
+	seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGSTEAL_PASSIVE),
+		       memcg_events(memcg, PGSTEAL_PASSIVE));
 	seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGACTIVATE),
 		       memcg_events(memcg, PGACTIVATE));
 	seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGDEACTIVATE),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3b08e85..0ea6c4a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1910,6 +1910,7 @@ static int current_may_throttle(void)
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	bool stalled = false;
+	struct mem_cgroup *memcg;
 
 	while (unlikely(too_many_isolated(pgdat, file, sc))) {
 		if (stalled)
@@ -1934,10 +1935,13 @@ static int current_may_throttle(void)
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
+	memcg = lruvec_memcg(lruvec);
 	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_scanned);
-	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
+	if (memcg != sc->target_mem_cgroup)
+		__count_memcg_events(memcg, PGSCAN_PASSIVE, nr_scanned);
+	__count_memcg_events(memcg, item, nr_scanned);
 	spin_unlock_irq(&pgdat->lru_lock);
 
 	if (nr_taken == 0)
@@ -1948,10 +1952,13 @@ static int current_may_throttle(void)
 
 	spin_lock_irq(&pgdat->lru_lock);
 
+	memcg = lruvec_memcg(lruvec);
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
-	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
+	if (memcg != sc->target_mem_cgroup)
+		__count_memcg_events(memcg, PGSTEAL_PASSIVE, nr_reclaimed);
+	__count_memcg_events(memcg, item, nr_reclaimed);
 	reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
 	reclaim_stat->recent_rotated[1] += stat.nr_activate[1];
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d5337..5d2a053 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1196,8 +1196,10 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 	"pgrefill",
 	"pgsteal_kswapd",
 	"pgsteal_direct",
+	"pgsteal_passive",
 	"pgscan_kswapd",
 	"pgscan_direct",
+	"pgscan_passive",
 	"pgscan_direct_throttle",
 
 #ifdef CONFIG_NUMA
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/2] mm, memcg: introduce multiple levels memory low protection
  2019-11-07  6:08 [PATCH 1/2] mm, memcg: introduce multiple levels memory low protection Yafang Shao
  2019-11-07  6:08 ` [PATCH 2/2] mm, memcg: introduce passive reclaim counters for non-root memcg Yafang Shao
@ 2019-11-08 13:26 ` Michal Hocko
  2019-11-09  3:32   ` Yafang Shao
  1 sibling, 1 reply; 4+ messages in thread
From: Michal Hocko @ 2019-11-08 13:26 UTC (permalink / raw)
  To: Yafang Shao; +Cc: hannes, vdavydov.dev, akpm, linux-mm

On Thu 07-11-19 01:08:08, Yafang Shao wrote:
> This patch introduces a new memory controller file memory.low.level,
> which is used to set multiple levels memory.low protetion.
> The valid value of memory.low.level is [0..3], meaning we support four
> levels protection now. This new controller file takes effect only when
> memory.low is set. With this new file, we can do page reclaim QoS on
> different memcgs. For example, when the system is under memory pressure, it
> will reclaim pages from the memcg with lower priority first and then higher
> priority.
> 
>   - What is the problem in the current memory low proection ?
>   Currently we can set bigger memory.low protection on memcg with higher
>   priority, and smaller memory.low protection on memcg with lower priority.
>   But once there's no available unprotected memory to reclaim, the
>   reclaimers will reclaim the protected memory from all the memcgs.
>   While we really want the reclaimers to reclaim the protected memory from
>   the lower-priority memcgs first, and if it still can't meet the page
>   allocation it will then reclaim the protected memory from higher-priority
>   memdcgs. The logic can be displayed as bellow,
> 	under_memory_pressure
> 		reclaim_unprotected_memory
> 		if (meet_the_request)
> 			exit
> 		reclaim_protected_memory_from_lowest_priority_memcgs
> 		if (meet_the_request)
> 			exit
> 		reclaim_protected_memory_from_higher_priority_memcgs
> 		if (meet_the_request)
> 			exit
> 		reclaim_protected_memory_from_highest_priority_memcgs

Could you expand a bit more on the usecase please? Do you overcommit on
the memory protection?

Also why is this needed only for the reclaim protection? In other words
let's say that you have more memcgs that are above their protection
thresholds why should reclaim behave differently for them from the
situation when all of them reach the protection? Also what about min
reclaim protection?

>   - Why does it support four-level memory low protection ?
>   Low priority, medium priority and high priority, that is the most common
>   usecases in the real life. So four-level memory low protection should be
>   enough. The more levels it is, the higher overhead page reclaim will
>   take. So four-level protection is really a trade-off.

Well, is this really the case? Isn't that just a matter of a proper
implementation? Starting an API with a very restricted input values
usually tends to outdate very quickly and it is found unsuitable. It
would help to describe how do you envision using those priorities. E.g.
do you have examples of what kind of services workloads fall into which
priority.

That being said there are quite some gaps in the usecase description as
well the interface design.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/2] mm, memcg: introduce multiple levels memory low protection
  2019-11-08 13:26 ` [PATCH 1/2] mm, memcg: introduce multiple levels memory low protection Michal Hocko
@ 2019-11-09  3:32   ` Yafang Shao
  0 siblings, 0 replies; 4+ messages in thread
From: Yafang Shao @ 2019-11-09  3:32 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Linux MM

On Fri, Nov 8, 2019 at 9:26 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 07-11-19 01:08:08, Yafang Shao wrote:
> > This patch introduces a new memory controller file memory.low.level,
> > which is used to set multiple levels memory.low protetion.
> > The valid value of memory.low.level is [0..3], meaning we support four
> > levels protection now. This new controller file takes effect only when
> > memory.low is set. With this new file, we can do page reclaim QoS on
> > different memcgs. For example, when the system is under memory pressure, it
> > will reclaim pages from the memcg with lower priority first and then higher
> > priority.
> >
> >   - What is the problem in the current memory low proection ?
> >   Currently we can set bigger memory.low protection on memcg with higher
> >   priority, and smaller memory.low protection on memcg with lower priority.
> >   But once there's no available unprotected memory to reclaim, the
> >   reclaimers will reclaim the protected memory from all the memcgs.
> >   While we really want the reclaimers to reclaim the protected memory from
> >   the lower-priority memcgs first, and if it still can't meet the page
> >   allocation it will then reclaim the protected memory from higher-priority
> >   memdcgs. The logic can be displayed as bellow,
> >       under_memory_pressure
> >               reclaim_unprotected_memory
> >               if (meet_the_request)
> >                       exit
> >               reclaim_protected_memory_from_lowest_priority_memcgs
> >               if (meet_the_request)
> >                       exit
> >               reclaim_protected_memory_from_higher_priority_memcgs
> >               if (meet_the_request)
> >                       exit
> >               reclaim_protected_memory_from_highest_priority_memcgs
>
> Could you expand a bit more on the usecase please? Do you overcommit on
> the memory protection?
>

Hi Michal,

It doesn't matter it is overcommited or not, becuase there's a
effective low protection when it is overcommitted.
Also the real low level is effective low level, which makes it work in
the cgroup hierachy.
Let's expand the example in the comment above mem_cgroup_protected()
with memory.low.level.

 *
 *       A        A/memory.low = 2G  A/memory.current = 6G,
A/memory.low.level = 1
 *    / /  \  \
 *   BC  DE   B/memory.low = 3G  B/memory.current = 2G  B/memory.low.level = 2
 *                  C/memory.low = 1G  C/memory.current = 2G
C/memory.low.level = 1
 *                  D/memory.low = 0     D/memory.current = 2G
D/memory.low.level = 3
 *                  E/memory.low = 10G E/memory.current = 0
E/memory.low.level = 3
Suppose A is the targeted memory cgroup,  the following memory
distribution is expected,
A/memory.current = 2G
B/memory.current = 2G (because it has a higher low.level than C)
C/memory.current = 0
D/memory.current = 0
E/memory.current = 0

While if C/memory.low.level = 2, then the result will be
A/memory.current = 2G
B/memory.current = 1.3G
C/memory.current = 0.6G
D/memory.current = 0
E/memory.current = 0


> Also why is this needed only for the reclaim protection? In other words
> let's say that you have more memcgs that are above their protection
> thresholds why should reclaim behave differently for them from the
> situation when all of them reach the protection? Also what about min
> reclaim protection?
>

If you don't set memory.low (and memory.min), then you don't care
about the page reclaim.
If you really care about the page reclaim, then you should set
memory.low (and memory.min) first,
and if you want to distinguish the protected memory, you can use
memory.low.level then.

The reason we don't want to use min reclaim protection is that it will
cause more frequent OOMs
and you have to do more work to prevent random OOM.

> >   - Why does it support four-level memory low protection ?
> >   Low priority, medium priority and high priority, that is the most common
> >   usecases in the real life. So four-level memory low protection should be
> >   enough. The more levels it is, the higher overhead page reclaim will
> >   take. So four-level protection is really a trade-off.
>
> Well, is this really the case? Isn't that just a matter of a proper
> implementation? Starting an API with a very restricted input values
> usually tends to outdate very quickly and it is found unsuitable. It
> would help to describe how do you envision using those priorities. E.g.
> do you have examples of what kind of services workloads fall into which
> priority.
>
> That being said there are quite some gaps in the usecase description as
> well the interface design.

We have diffrent workloads running on one single server.
Some workloads are latency sensitive, while the others are not.
When there's resource pressure on this server, we expect this pressure
impact the latency sensitive workoad as little as possile,
and throttle the latency non-sensitve workloads.
memory.{low, min} can seperate the page caches to 'unevictable'
memory, protected memory and non-protected memory, but it can't
seperate different containers (workloads).
In the latency-sensive containers, there're also some non-impottant
page caches, e.g. some page cahes for the log file,
so it would better to reclaim log file pages first (unprotected
memory) then reclaim protected pages  from latency-non-sensitve
containers, then reclaim the protected pages from latency-sensitve
containers.

Sometimes we have to restrict the user behavior.
The current design memory.{min, low} seprate the page caches into min
protected, low protected and non-protected, that seems outdate now,
but we can improve it.
So I don't think the four-level low protection will be a big problem.

Thanks
Yafang


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, back to index

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-07  6:08 [PATCH 1/2] mm, memcg: introduce multiple levels memory low protection Yafang Shao
2019-11-07  6:08 ` [PATCH 2/2] mm, memcg: introduce passive reclaim counters for non-root memcg Yafang Shao
2019-11-08 13:26 ` [PATCH 1/2] mm, memcg: introduce multiple levels memory low protection Michal Hocko
2019-11-09  3:32   ` Yafang Shao

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git