linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
@ 2021-04-05 17:08 Tim Chen
  2021-04-05 17:08 ` [RFC PATCH v1 01/11] mm: Define top tier memory node mask Tim Chen
                   ` (12 more replies)
  0 siblings, 13 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-05 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Shakeel Butt, linux-mm,
	cgroups, linux-kernel

Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
others NUMA wise, but a byte of media has about the same cost whether it
is close or far.  But, with new memory tiers such as Persistent Memory
(PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
PMEM.

The fast/expensive memory lives in the top tier of the memory hierachy.

Previously, the patchset
[PATCH 00/10] [v7] Migrate Pages in lieu of discard
https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
provides a mechanism to demote cold pages from DRAM node into PMEM.

And the patchset
[PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
provides a mechanism to promote hot pages in PMEM to the DRAM node
leveraging autonuma.

The two patchsets together keep the hot pages in DRAM and colder pages
in PMEM.

To make fine grain cgroup based management of the precious top tier
DRAM memory possible, this patchset adds a few new features:
1. Provides memory monitors on the amount of top tier memory used per cgroup 
   and by the system as a whole.
2. Applies soft limits on the top tier memory each cgroup uses 
3. Enables kswapd to demote top tier pages from cgroup with excess top
   tier memory usages.

This allows us to provision different amount of top tier memory to each
cgroup according to the cgroup's latency need.

The patchset is based on cgroup v1 interface. One shortcoming of the v1
interface is the limit on the cgroup is a soft limit, so a cgroup can
exceed the limit quite a bit before reclaim before page demotion reins
it in. 

We are also working on a cgroup v2 interface control interface that will will
have a max limit on the top tier memory per cgroup but requires much
additional logic to fall back and allocate from non top tier memory when a
cgroup reaches the maximum limit.  This simpler cgroup v1 implementation
with all its warts is used to illustrate the concept of cgroup based
top tier memory management and serves as a starting point of discussions.

The soft limit and soft reclaim logic in this patchset will be similar for what
we would do for a cgroup v2 interface when we reach the high watermark
for top tier usage in a cgroup v2 interface. 

This patchset is applied on top of 
[PATCH 00/10] [v7] Migrate Pages in lieu of discard
and
[PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system

It is part of a larger patchset.  You can play with the complete set of patches
using the tree:
https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/log/?h=tiering-0.71

Tim Chen (11):
  mm: Define top tier memory node mask
  mm: Add soft memory limit for mem cgroup
  mm: Account the top tier memory usage per cgroup
  mm: Report top tier memory usage in sysfs
  mm: Add soft_limit_top_tier tree for mem cgroup
  mm: Handle top tier memory in cgroup soft limit memory tree utilities
  mm: Account the total top tier memory in use
  mm: Add toptier option for mem_cgroup_soft_limit_reclaim()
  mm: Use kswapd to demote pages when toptier memory is tight
  mm: Set toptier_scale_factor via sysctl
  mm: Wakeup kswapd if toptier memory need soft reclaim

 Documentation/admin-guide/sysctl/vm.rst |  12 +
 drivers/base/node.c                     |   2 +
 include/linux/memcontrol.h              |  20 +-
 include/linux/mm.h                      |   4 +
 include/linux/mmzone.h                  |   7 +
 include/linux/nodemask.h                |   1 +
 include/linux/vmstat.h                  |  18 ++
 kernel/sysctl.c                         |  10 +
 mm/memcontrol.c                         | 303 +++++++++++++++++++-----
 mm/memory_hotplug.c                     |   3 +
 mm/migrate.c                            |   1 +
 mm/page_alloc.c                         |  36 ++-
 mm/vmscan.c                             |  73 +++++-
 mm/vmstat.c                             |  22 +-
 14 files changed, 444 insertions(+), 68 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH v1 01/11] mm: Define top tier memory node mask
  2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
@ 2021-04-05 17:08 ` Tim Chen
  2021-04-05 17:08 ` [RFC PATCH v1 02/11] mm: Add soft memory limit for mem cgroup Tim Chen
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-05 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Shakeel Butt, linux-mm,
	cgroups, linux-kernel

Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
than others, but a byte of media has about the same cost whether it
is close or far.  But, with new memory tiers such as High-Bandwidth
Memory or Persistent Memory, there is a choice between fast/expensive
and slow/cheap.

The fast/expensive memory lives in the top tier of the memory
hierachy and it is a precious resource that needs to be accounted and
managed on a memory cgroup basis.

Define the top tier memory as the memory nodes that don't have demotion
paths into it from higher tier memory.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 drivers/base/node.c      | 2 ++
 include/linux/nodemask.h | 1 +
 mm/memory_hotplug.c      | 3 +++
 mm/migrate.c             | 1 +
 mm/page_alloc.c          | 5 ++++-
 5 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 04f71c7bc3f8..9eb214ac331f 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -1016,6 +1016,7 @@ static struct node_attr node_state_attr[] = {
 #endif
 	[N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
 	[N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
+	[N_TOPTIER] = _NODE_ATTR(is_toptier, N_TOPTIER),
 	[N_GENERIC_INITIATOR] = _NODE_ATTR(has_generic_initiator,
 					   N_GENERIC_INITIATOR),
 };
@@ -1029,6 +1030,7 @@ static struct attribute *node_state_attrs[] = {
 #endif
 	&node_state_attr[N_MEMORY].attr.attr,
 	&node_state_attr[N_CPU].attr.attr,
+	&node_state_attr[N_TOPTIER].attr.attr,
 	&node_state_attr[N_GENERIC_INITIATOR].attr.attr,
 	NULL
 };
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index ac398e143c9a..3003401ed7f0 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -399,6 +399,7 @@ enum node_states {
 #endif
 	N_MEMORY,		/* The node has memory(regular, high, movable) */
 	N_CPU,		/* The node has one or more cpus */
+	N_TOPTIER,		/* Top tier node, no demotion path into node */
 	N_GENERIC_INITIATOR,	/* The node has one or more Generic Initiators */
 	NR_NODE_STATES
 };
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7550b88e2432..7b21560d4c4d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -36,6 +36,7 @@
 #include <linux/memblock.h>
 #include <linux/compaction.h>
 #include <linux/rmap.h>
+#include <linux/node.h>
 
 #include <asm/tlbflush.h>
 
@@ -654,6 +655,8 @@ static void node_states_set_node(int node, struct memory_notify *arg)
 
 	if (arg->status_change_nid >= 0)
 		node_set_state(node, N_MEMORY);
+
+	node_set_state(node, N_TOPTIER);
 }
 
 static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn,
diff --git a/mm/migrate.c b/mm/migrate.c
index 72223fd7e623..e84aedf611da 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -3439,6 +3439,7 @@ static int establish_migrate_target(int node, nodemask_t *used)
 		return NUMA_NO_NODE;
 
 	node_demotion[node] = migration_target;
+	node_clear_state(migration_target, N_TOPTIER);
 
 	return migration_target;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ff058941ccfa..471a2c342c4f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -157,6 +157,7 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
 	[N_MEMORY] = { { [0] = 1UL } },
 	[N_CPU] = { { [0] = 1UL } },
 #endif	/* NUMA */
+	[N_TOPTIER] = NODE_MASK_ALL,
 };
 EXPORT_SYMBOL(node_states);
 
@@ -7590,8 +7591,10 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 		free_area_init_node(nid);
 
 		/* Any memory on that node */
-		if (pgdat->node_present_pages)
+		if (pgdat->node_present_pages) {
 			node_set_state(nid, N_MEMORY);
+			node_set_state(nid, N_TOPTIER);
+		}
 		check_for_memory(pgdat, nid);
 	}
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH v1 02/11] mm: Add soft memory limit for mem cgroup
  2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
  2021-04-05 17:08 ` [RFC PATCH v1 01/11] mm: Define top tier memory node mask Tim Chen
@ 2021-04-05 17:08 ` Tim Chen
  2021-04-05 17:08 ` [RFC PATCH v1 03/11] mm: Account the top tier memory usage per cgroup Tim Chen
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-05 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Shakeel Butt, linux-mm,
	cgroups, linux-kernel

For each memory cgroup, define a soft memory limit on
its top tier memory consumption.  Memory cgroups exceeding
their top tier limit will be selected for demotion of
their top tier memory to lower tier under memory pressure.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c            | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index eeb0b52203e9..25d8b9acec7c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -230,6 +230,7 @@ struct mem_cgroup {
 	struct work_struct high_work;
 
 	unsigned long soft_limit;
+	unsigned long toptier_soft_limit;
 
 	/* vmpressure notifications */
 	struct vmpressure vmpressure;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 41a3f22b6639..9a9d677a6654 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3603,6 +3603,7 @@ enum {
 	RES_MAX_USAGE,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
+	RES_TOPTIER_SOFT_LIMIT,
 };
 
 static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
@@ -3643,6 +3644,8 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
 		return counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return (u64)memcg->soft_limit * PAGE_SIZE;
+	case RES_TOPTIER_SOFT_LIMIT:
+		return (u64)memcg->toptier_soft_limit * PAGE_SIZE;
 	default:
 		BUG();
 	}
@@ -3881,6 +3884,14 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 		memcg->soft_limit = nr_pages;
 		ret = 0;
 		break;
+	case RES_TOPTIER_SOFT_LIMIT:
+		if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
+			ret = -EINVAL;
+			break;
+		}
+		memcg->toptier_soft_limit = nr_pages;
+		ret = 0;
+		break;
 	}
 	return ret ?: nbytes;
 }
@@ -5029,6 +5040,12 @@ static struct cftype mem_cgroup_legacy_files[] = {
 		.write = mem_cgroup_write,
 		.read_u64 = mem_cgroup_read_u64,
 	},
+	{
+		.name = "toptier_soft_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_TOPTIER_SOFT_LIMIT),
+		.write = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read_u64,
+	},
 	{
 		.name = "failcnt",
 		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
@@ -5365,6 +5382,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
 	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
+	memcg->toptier_soft_limit = PAGE_COUNTER_MAX;
 	if (parent) {
 		memcg->swappiness = mem_cgroup_swappiness(parent);
 		memcg->oom_kill_disable = parent->oom_kill_disable;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH v1 03/11] mm: Account the top tier memory usage per cgroup
  2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
  2021-04-05 17:08 ` [RFC PATCH v1 01/11] mm: Define top tier memory node mask Tim Chen
  2021-04-05 17:08 ` [RFC PATCH v1 02/11] mm: Add soft memory limit for mem cgroup Tim Chen
@ 2021-04-05 17:08 ` Tim Chen
  2021-04-05 17:08 ` [RFC PATCH v1 04/11] mm: Report top tier memory usage in sysfs Tim Chen
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-05 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Shakeel Butt, linux-mm,
	cgroups, linux-kernel

For each memory cgroup, account its usage of the
top tier memory at the time a top tier page is assigned and
uncharged from the cgroup.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c            | 39 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 25d8b9acec7c..609d8590950c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -225,6 +225,7 @@ struct mem_cgroup {
 	/* Legacy consumer-oriented counters */
 	struct page_counter kmem;		/* v1 only */
 	struct page_counter tcpmem;		/* v1 only */
+	struct page_counter toptier;
 
 	/* Range enforcement for interrupt charges */
 	struct work_struct high_work;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9a9d677a6654..fe7bb8613f5a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -253,6 +253,13 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
 	return &container_of(vmpr, struct mem_cgroup, vmpressure)->css;
 }
 
+static inline bool top_tier(struct page *page)
+{
+	int nid = page_to_nid(page);
+
+	return node_state(nid, N_TOPTIER);
+}
+
 #ifdef CONFIG_MEMCG_KMEM
 extern spinlock_t css_set_lock;
 
@@ -951,6 +958,23 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 	__this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages);
 }
 
+static inline void mem_cgroup_charge_toptier(struct mem_cgroup *memcg,
+					 struct page *page,
+					 int nr_pages)
+{
+	if (!top_tier(page))
+		return;
+
+	if (nr_pages >= 0)
+		page_counter_charge(&memcg->toptier,
+				   (unsigned long) nr_pages);
+	else {
+		nr_pages = -nr_pages;
+		page_counter_uncharge(&memcg->toptier,
+				   (unsigned long) nr_pages);
+	}
+}
+
 static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 				       enum mem_cgroup_events_target target)
 {
@@ -2932,6 +2956,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg)
 	 * - exclusive reference
 	 */
 	page->memcg_data = (unsigned long)memcg;
+	mem_cgroup_charge_toptier(memcg, page, thp_nr_pages(page));
 }
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -3138,6 +3163,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 		if (!ret) {
 			page->memcg_data = (unsigned long)memcg |
 				MEMCG_DATA_KMEM;
+			mem_cgroup_charge_toptier(memcg, page, 1 << order);
 			return 0;
 		}
 		css_put(&memcg->css);
@@ -3161,6 +3187,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	__memcg_kmem_uncharge(memcg, nr_pages);
 	page->memcg_data = 0;
+	mem_cgroup_charge_toptier(memcg, page, -nr_pages);
 	css_put(&memcg->css);
 }
 
@@ -5389,11 +5416,13 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 		page_counter_init(&memcg->memory, &parent->memory);
 		page_counter_init(&memcg->swap, &parent->swap);
+		page_counter_init(&memcg->toptier, &parent->toptier);
 		page_counter_init(&memcg->kmem, &parent->kmem);
 		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
 	} else {
 		page_counter_init(&memcg->memory, NULL);
 		page_counter_init(&memcg->swap, NULL);
+		page_counter_init(&memcg->toptier, NULL);
 		page_counter_init(&memcg->kmem, NULL);
 		page_counter_init(&memcg->tcpmem, NULL);
 
@@ -5745,6 +5774,8 @@ static int mem_cgroup_move_account(struct page *page,
 	css_put(&from->css);
 
 	page->memcg_data = (unsigned long)to;
+	mem_cgroup_charge_toptier(to, page, nr_pages);
+	mem_cgroup_charge_toptier(from, page, -nr_pages);
 
 	__unlock_page_memcg(from);
 
@@ -6832,6 +6863,7 @@ struct uncharge_gather {
 	unsigned long nr_pages;
 	unsigned long pgpgout;
 	unsigned long nr_kmem;
+	unsigned long nr_toptier;
 	struct page *dummy_page;
 };
 
@@ -6846,6 +6878,7 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 
 	if (!mem_cgroup_is_root(ug->memcg)) {
 		page_counter_uncharge(&ug->memcg->memory, ug->nr_pages);
+		page_counter_uncharge(&ug->memcg->toptier, ug->nr_toptier);
 		if (do_memsw_account())
 			page_counter_uncharge(&ug->memcg->memsw, ug->nr_pages);
 		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
@@ -6891,6 +6924,8 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
 
 	nr_pages = compound_nr(page);
 	ug->nr_pages += nr_pages;
+	if (top_tier(page))
+		ug->nr_toptier += nr_pages;
 
 	if (PageMemcgKmem(page))
 		ug->nr_kmem += nr_pages;
@@ -7216,8 +7251,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 
 	page->memcg_data = 0;
 
-	if (!mem_cgroup_is_root(memcg))
+	if (!mem_cgroup_is_root(memcg)) {
 		page_counter_uncharge(&memcg->memory, nr_entries);
+		mem_cgroup_charge_toptier(memcg, page, -nr_entries);
+	}
 
 	if (!cgroup_memory_noswap && memcg != swap_memcg) {
 		if (!mem_cgroup_is_root(swap_memcg))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH v1 04/11] mm: Report top tier memory usage in sysfs
  2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
                   ` (2 preceding siblings ...)
  2021-04-05 17:08 ` [RFC PATCH v1 03/11] mm: Account the top tier memory usage per cgroup Tim Chen
@ 2021-04-05 17:08 ` Tim Chen
  2021-04-05 17:08 ` [RFC PATCH v1 05/11] mm: Add soft_limit_top_tier tree for mem cgroup Tim Chen
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-05 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Shakeel Butt, linux-mm,
	cgroups, linux-kernel

In memory cgroup's sysfs, report the memory cgroup's usage
of top tier memory in a new field: "toptier_usage_in_bytes".

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 mm/memcontrol.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fe7bb8613f5a..68590f46fa76 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3631,6 +3631,7 @@ enum {
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
 	RES_TOPTIER_SOFT_LIMIT,
+	RES_TOPTIER_USAGE,
 };
 
 static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
@@ -3673,6 +3674,8 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
 		return (u64)memcg->soft_limit * PAGE_SIZE;
 	case RES_TOPTIER_SOFT_LIMIT:
 		return (u64)memcg->toptier_soft_limit * PAGE_SIZE;
+	case RES_TOPTIER_USAGE:
+		return (u64)page_counter_read(&memcg->toptier) * PAGE_SIZE;
 	default:
 		BUG();
 	}
@@ -5073,6 +5076,11 @@ static struct cftype mem_cgroup_legacy_files[] = {
 		.write = mem_cgroup_write,
 		.read_u64 = mem_cgroup_read_u64,
 	},
+	{
+		.name = "toptier_usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_TOPTIER_USAGE),
+		.read_u64 = mem_cgroup_read_u64,
+	},
 	{
 		.name = "failcnt",
 		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH v1 05/11] mm: Add soft_limit_top_tier tree for mem cgroup
  2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
                   ` (3 preceding siblings ...)
  2021-04-05 17:08 ` [RFC PATCH v1 04/11] mm: Report top tier memory usage in sysfs Tim Chen
@ 2021-04-05 17:08 ` Tim Chen
  2021-04-05 17:08 ` [RFC PATCH v1 06/11] mm: Handle top tier memory in cgroup soft limit memory tree utilities Tim Chen
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-05 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Shakeel Butt, linux-mm,
	cgroups, linux-kernel

Define a per node soft_limit_top_tier red black tree that sort and track
the cgroups by each group's excess over its toptier soft limit.  A cgroup
is added to the tree if it has exceeded its top tier soft limit and it
has used pages on the node.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 mm/memcontrol.c | 68 +++++++++++++++++++++++++++++++++++++------------
 1 file changed, 52 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 68590f46fa76..90a78ff3fca8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -122,6 +122,7 @@ struct mem_cgroup_tree {
 };
 
 static struct mem_cgroup_tree soft_limit_tree __read_mostly;
+static struct mem_cgroup_tree soft_limit_toptier_tree __read_mostly;
 
 /* for OOM */
 struct mem_cgroup_eventfd_list {
@@ -590,17 +591,27 @@ mem_cgroup_page_nodeinfo(struct mem_cgroup *memcg, struct page *page)
 }
 
 static struct mem_cgroup_tree_per_node *
-soft_limit_tree_node(int nid)
-{
-	return soft_limit_tree.rb_tree_per_node[nid];
+soft_limit_tree_node(int nid, enum node_states type)
+{
+	switch (type) {
+	case N_MEMORY:
+		return soft_limit_tree.rb_tree_per_node[nid];
+	case N_TOPTIER:
+		if (node_state(nid, N_TOPTIER))
+			return soft_limit_toptier_tree.rb_tree_per_node[nid];
+		else
+			return NULL;
+	default:
+		return NULL;
+	}
 }
 
 static struct mem_cgroup_tree_per_node *
-soft_limit_tree_from_page(struct page *page)
+soft_limit_tree_from_page(struct page *page, enum node_states type)
 {
 	int nid = page_to_nid(page);
 
-	return soft_limit_tree.rb_tree_per_node[nid];
+	return soft_limit_tree_node(nid, type);
 }
 
 static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
@@ -661,12 +672,24 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
 	spin_unlock_irqrestore(&mctz->lock, flags);
 }
 
-static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
+static unsigned long soft_limit_excess(struct mem_cgroup *memcg, enum node_states type)
 {
-	unsigned long nr_pages = page_counter_read(&memcg->memory);
-	unsigned long soft_limit = READ_ONCE(memcg->soft_limit);
+	unsigned long nr_pages;
+	unsigned long soft_limit;
 	unsigned long excess = 0;
 
+	switch (type) {
+	case N_MEMORY:
+		nr_pages = page_counter_read(&memcg->memory);
+		soft_limit = READ_ONCE(memcg->soft_limit);
+		break;
+	case N_TOPTIER:
+		nr_pages = page_counter_read(&memcg->toptier);
+		soft_limit = READ_ONCE(memcg->toptier_soft_limit);
+		break;
+	default:
+		return 0;
+	}
 	if (nr_pages > soft_limit)
 		excess = nr_pages - soft_limit;
 
@@ -679,7 +702,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 	struct mem_cgroup_per_node *mz;
 	struct mem_cgroup_tree_per_node *mctz;
 
-	mctz = soft_limit_tree_from_page(page);
+	mctz = soft_limit_tree_from_page(page, N_MEMORY);
 	if (!mctz)
 		return;
 	/*
@@ -688,7 +711,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 	 */
 	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
 		mz = mem_cgroup_page_nodeinfo(memcg, page);
-		excess = soft_limit_excess(memcg);
+		excess = soft_limit_excess(memcg, N_MEMORY);
 		/*
 		 * We have to update the tree if mz is on RB-tree or
 		 * mem is over its softlimit.
@@ -718,7 +741,7 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
 
 	for_each_node(nid) {
 		mz = mem_cgroup_nodeinfo(memcg, nid);
-		mctz = soft_limit_tree_node(nid);
+		mctz = soft_limit_tree_node(nid, N_MEMORY);
 		if (mctz)
 			mem_cgroup_remove_exceeded(mz, mctz);
 	}
@@ -742,7 +765,7 @@ __mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 	 * position in the tree.
 	 */
 	__mem_cgroup_remove_exceeded(mz, mctz);
-	if (!soft_limit_excess(mz->memcg) ||
+	if (!soft_limit_excess(mz->memcg, N_MEMORY) ||
 	    !css_tryget(&mz->memcg->css))
 		goto retry;
 done:
@@ -1805,7 +1828,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 		.pgdat = pgdat,
 	};
 
-	excess = soft_limit_excess(root_memcg);
+	excess = soft_limit_excess(root_memcg, N_MEMORY);
 
 	while (1) {
 		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
@@ -1834,7 +1857,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 		total += mem_cgroup_shrink_node(victim, gfp_mask, false,
 					pgdat, &nr_scanned);
 		*total_scanned += nr_scanned;
-		if (!soft_limit_excess(root_memcg))
+		if (!soft_limit_excess(root_memcg, N_MEMORY))
 			break;
 	}
 	mem_cgroup_iter_break(root_memcg, victim);
@@ -3457,7 +3480,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 	if (order > 0)
 		return 0;
 
-	mctz = soft_limit_tree_node(pgdat->node_id);
+	mctz = soft_limit_tree_node(pgdat->node_id, N_MEMORY);
 
 	/*
 	 * Do not even bother to check the largest node if the root
@@ -3513,7 +3536,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 		if (!reclaimed)
 			next_mz = __mem_cgroup_largest_soft_limit_node(mctz);
 
-		excess = soft_limit_excess(mz->memcg);
+		excess = soft_limit_excess(mz->memcg, N_MEMORY);
 		/*
 		 * One school of thought says that we should not add
 		 * back the node to the tree if reclaim returns 0.
@@ -7189,6 +7212,19 @@ static int __init mem_cgroup_init(void)
 		rtpn->rb_rightmost = NULL;
 		spin_lock_init(&rtpn->lock);
 		soft_limit_tree.rb_tree_per_node[node] = rtpn;
+
+		if (!node_state(node, N_TOPTIER)) {
+			soft_limit_toptier_tree.rb_tree_per_node[node] = NULL;
+			continue;
+		}
+
+		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
+				    node_online(node) ? node : NUMA_NO_NODE);
+
+		rtpn->rb_root = RB_ROOT;
+		rtpn->rb_rightmost = NULL;
+		spin_lock_init(&rtpn->lock);
+		soft_limit_toptier_tree.rb_tree_per_node[node] = rtpn;
 	}
 
 	return 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH v1 06/11] mm: Handle top tier memory in cgroup soft limit memory tree utilities
  2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
                   ` (4 preceding siblings ...)
  2021-04-05 17:08 ` [RFC PATCH v1 05/11] mm: Add soft_limit_top_tier tree for mem cgroup Tim Chen
@ 2021-04-05 17:08 ` Tim Chen
  2021-04-05 17:08 ` [RFC PATCH v1 07/11] mm: Account the total top tier memory in use Tim Chen
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-05 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Shakeel Butt, linux-mm,
	cgroups, linux-kernel

Update the utility functions __mem_cgroup_insert_exceeded() and
__mem_cgroup_remove_exceeded(), to allow addition and removal of cgroups
from the new red black tree that tracks the cgroups that exceed their
toptier memory limits.

Update also the function +mem_cgroup_largest_soft_limit_node(),
to allow returning the cgroup that has the largest exceess usage
of toptier memory.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/memcontrol.h |   9 +++
 mm/memcontrol.c            | 152 +++++++++++++++++++++++++++----------
 2 files changed, 122 insertions(+), 39 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 609d8590950c..0ed8ddfd5436 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,6 +124,15 @@ struct mem_cgroup_per_node {
 	unsigned long		usage_in_excess;/* Set to the value by which */
 						/* the soft limit is exceeded*/
 	bool			on_tree;
+
+	struct rb_node		toptier_tree_node;	 /* RB tree node */
+	unsigned long		toptier_usage_in_excess; /* Set to the value by which */
+						         /* the soft limit is exceeded*/
+	bool			on_toptier_tree;
+
+	bool			congested;	/* memcg has many dirty pages */
+						/* backed by a congested BDI */
+
 	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
 						/* use container_of	   */
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90a78ff3fca8..8a7648b79635 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -616,24 +616,44 @@ soft_limit_tree_from_page(struct page *page, enum node_states type)
 
 static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
 					 struct mem_cgroup_tree_per_node *mctz,
-					 unsigned long new_usage_in_excess)
+					 unsigned long new_usage_in_excess,
+					 enum node_states type)
 {
 	struct rb_node **p = &mctz->rb_root.rb_node;
-	struct rb_node *parent = NULL;
+	struct rb_node *parent = NULL, *mz_tree_node;
 	struct mem_cgroup_per_node *mz_node;
-	bool rightmost = true;
+	bool rightmost = true, *mz_on_tree;
+	unsigned long usage_in_excess, *mz_usage_in_excess;
 
-	if (mz->on_tree)
+	if (type == N_TOPTIER) {
+		mz_usage_in_excess = &mz->toptier_usage_in_excess;
+		mz_tree_node = &mz->toptier_tree_node;
+		mz_on_tree = &mz->on_toptier_tree;
+	} else {
+		mz_usage_in_excess = &mz->usage_in_excess;
+		mz_tree_node = &mz->tree_node;
+		mz_on_tree = &mz->on_tree;
+	}
+
+	if (*mz_on_tree)
 		return;
 
-	mz->usage_in_excess = new_usage_in_excess;
-	if (!mz->usage_in_excess)
+	if (!new_usage_in_excess)
 		return;
+
 	while (*p) {
 		parent = *p;
-		mz_node = rb_entry(parent, struct mem_cgroup_per_node,
+		if (type == N_TOPTIER) {
+			mz_node = rb_entry(parent, struct mem_cgroup_per_node,
+					toptier_tree_node);
+			usage_in_excess = mz_node->toptier_usage_in_excess;
+		} else {
+			mz_node = rb_entry(parent, struct mem_cgroup_per_node,
 					tree_node);
-		if (mz->usage_in_excess < mz_node->usage_in_excess) {
+			usage_in_excess = mz_node->usage_in_excess;
+		}
+
+		if (new_usage_in_excess < usage_in_excess) {
 			p = &(*p)->rb_left;
 			rightmost = false;
 		} else {
@@ -642,33 +662,47 @@ static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
 	}
 
 	if (rightmost)
-		mctz->rb_rightmost = &mz->tree_node;
+		mctz->rb_rightmost = mz_tree_node;
 
-	rb_link_node(&mz->tree_node, parent, p);
-	rb_insert_color(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = true;
+	rb_link_node(mz_tree_node, parent, p);
+	rb_insert_color(mz_tree_node, &mctz->rb_root);
+	*mz_usage_in_excess = new_usage_in_excess;
+	*mz_on_tree = true;
 }
 
 static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
-					 struct mem_cgroup_tree_per_node *mctz)
+					 struct mem_cgroup_tree_per_node *mctz,
+					 enum node_states type)
 {
-	if (!mz->on_tree)
+	bool *mz_on_tree;
+	struct rb_node *mz_tree_node;
+
+	if (type == N_TOPTIER) {
+		mz_tree_node = &mz->toptier_tree_node;
+		mz_on_tree = &mz->on_toptier_tree;
+	} else {
+		mz_tree_node = &mz->tree_node;
+		mz_on_tree = &mz->on_tree;
+	}
+
+	if (!(*mz_on_tree))
 		return;
 
-	if (&mz->tree_node == mctz->rb_rightmost)
-		mctz->rb_rightmost = rb_prev(&mz->tree_node);
+	if (mz_tree_node == mctz->rb_rightmost)
+		mctz->rb_rightmost = rb_prev(mz_tree_node);
 
-	rb_erase(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = false;
+	rb_erase(mz_tree_node, &mctz->rb_root);
+	*mz_on_tree = false;
 }
 
 static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
-				       struct mem_cgroup_tree_per_node *mctz)
+				       struct mem_cgroup_tree_per_node *mctz,
+				       enum node_states type)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&mctz->lock, flags);
-	__mem_cgroup_remove_exceeded(mz, mctz);
+	__mem_cgroup_remove_exceeded(mz, mctz, type);
 	spin_unlock_irqrestore(&mctz->lock, flags);
 }
 
@@ -696,13 +730,18 @@ static unsigned long soft_limit_excess(struct mem_cgroup *memcg, enum node_state
 	return excess;
 }
 
-static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
+static void mem_cgroup_update_tree(struct mem_cgroup *bottom_memcg, struct page *page)
 {
 	unsigned long excess;
 	struct mem_cgroup_per_node *mz;
 	struct mem_cgroup_tree_per_node *mctz;
+	enum node_states type = N_MEMORY;
+	struct mem_cgroup *memcg;
+
+repeat_toptier:
+	memcg = bottom_memcg;
+	mctz = soft_limit_tree_from_page(page, type);
 
-	mctz = soft_limit_tree_from_page(page, N_MEMORY);
 	if (!mctz)
 		return;
 	/*
@@ -710,27 +749,37 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 	 * because their event counter is not touched.
 	 */
 	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+		bool on_tree;
+
 		mz = mem_cgroup_page_nodeinfo(memcg, page);
-		excess = soft_limit_excess(memcg, N_MEMORY);
+		excess = soft_limit_excess(memcg, type);
+
+		on_tree = (type == N_MEMORY) ? mz->on_tree: mz->on_toptier_tree;
 		/*
 		 * We have to update the tree if mz is on RB-tree or
 		 * mem is over its softlimit.
 		 */
-		if (excess || mz->on_tree) {
+		if (excess || on_tree) {
 			unsigned long flags;
 
 			spin_lock_irqsave(&mctz->lock, flags);
 			/* if on-tree, remove it */
-			if (mz->on_tree)
-				__mem_cgroup_remove_exceeded(mz, mctz);
+			if (on_tree)
+				__mem_cgroup_remove_exceeded(mz, mctz, type);
+
 			/*
 			 * Insert again. mz->usage_in_excess will be updated.
 			 * If excess is 0, no tree ops.
 			 */
-			__mem_cgroup_insert_exceeded(mz, mctz, excess);
+			__mem_cgroup_insert_exceeded(mz, mctz, excess, type);
+
 			spin_unlock_irqrestore(&mctz->lock, flags);
 		}
 	}
+	if (type == N_MEMORY) {
+		type = N_TOPTIER;
+		goto repeat_toptier;
+	}
 }
 
 static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
@@ -743,12 +792,16 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
 		mz = mem_cgroup_nodeinfo(memcg, nid);
 		mctz = soft_limit_tree_node(nid, N_MEMORY);
 		if (mctz)
-			mem_cgroup_remove_exceeded(mz, mctz);
+			mem_cgroup_remove_exceeded(mz, mctz, N_MEMORY);
+		mctz = soft_limit_tree_node(nid, N_TOPTIER);
+		if (mctz)
+			mem_cgroup_remove_exceeded(mz, mctz, N_TOPTIER);
 	}
 }
 
 static struct mem_cgroup_per_node *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
+__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz,
+				     enum node_states type)
 {
 	struct mem_cgroup_per_node *mz;
 
@@ -757,15 +810,19 @@ __mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 	if (!mctz->rb_rightmost)
 		goto done;		/* Nothing to reclaim from */
 
-	mz = rb_entry(mctz->rb_rightmost,
+	if (type == N_TOPTIER)
+		mz = rb_entry(mctz->rb_rightmost,
+		      struct mem_cgroup_per_node, toptier_tree_node);
+	else
+		mz = rb_entry(mctz->rb_rightmost,
 		      struct mem_cgroup_per_node, tree_node);
 	/*
 	 * Remove the node now but someone else can add it back,
 	 * we will to add it back at the end of reclaim to its correct
 	 * position in the tree.
 	 */
-	__mem_cgroup_remove_exceeded(mz, mctz);
-	if (!soft_limit_excess(mz->memcg, N_MEMORY) ||
+	__mem_cgroup_remove_exceeded(mz, mctz, type);
+	if (!soft_limit_excess(mz->memcg, type) ||
 	    !css_tryget(&mz->memcg->css))
 		goto retry;
 done:
@@ -773,12 +830,13 @@ __mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 }
 
 static struct mem_cgroup_per_node *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
+mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz,
+				   enum node_states type)
 {
 	struct mem_cgroup_per_node *mz;
 
 	spin_lock_irq(&mctz->lock);
-	mz = __mem_cgroup_largest_soft_limit_node(mctz);
+	mz = __mem_cgroup_largest_soft_limit_node(mctz, type);
 	spin_unlock_irq(&mctz->lock);
 	return mz;
 }
@@ -3472,7 +3530,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 	struct mem_cgroup_per_node *mz, *next_mz = NULL;
 	unsigned long reclaimed;
 	int loop = 0;
-	struct mem_cgroup_tree_per_node *mctz;
+	struct mem_cgroup_tree_per_node *mctz, *mctz_sibling;
 	unsigned long excess;
 	unsigned long nr_scanned;
 	int migration_nid;
@@ -3481,6 +3539,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 		return 0;
 
 	mctz = soft_limit_tree_node(pgdat->node_id, N_MEMORY);
+	mctz_sibling = soft_limit_tree_node(pgdat->node_id, N_TOPTIER);
 
 	/*
 	 * Do not even bother to check the largest node if the root
@@ -3516,7 +3575,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 		if (next_mz)
 			mz = next_mz;
 		else
-			mz = mem_cgroup_largest_soft_limit_node(mctz);
+			mz = mem_cgroup_largest_soft_limit_node(mctz, N_MEMORY);
 		if (!mz)
 			break;
 
@@ -3526,7 +3585,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
 		spin_lock_irq(&mctz->lock);
-		__mem_cgroup_remove_exceeded(mz, mctz);
+		__mem_cgroup_remove_exceeded(mz, mctz, N_MEMORY);
 
 		/*
 		 * If we failed to reclaim anything from this memory cgroup
@@ -3534,7 +3593,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 		 */
 		next_mz = NULL;
 		if (!reclaimed)
-			next_mz = __mem_cgroup_largest_soft_limit_node(mctz);
+			next_mz =
+			   __mem_cgroup_largest_soft_limit_node(mctz, N_MEMORY);
 
 		excess = soft_limit_excess(mz->memcg, N_MEMORY);
 		/*
@@ -3546,8 +3606,20 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 		 * term TODO.
 		 */
 		/* If excess == 0, no tree ops */
-		__mem_cgroup_insert_exceeded(mz, mctz, excess);
+		__mem_cgroup_insert_exceeded(mz, mctz, excess, N_MEMORY);
 		spin_unlock_irq(&mctz->lock);
+
+		/* update both affected N_MEMORY and N_TOPTIER trees */
+		if (mctz_sibling) {
+			spin_lock_irq(&mctz_sibling->lock);
+			__mem_cgroup_remove_exceeded(mz, mctz_sibling,
+						     N_TOPTIER);
+			excess = soft_limit_excess(mz->memcg, N_TOPTIER);
+			__mem_cgroup_insert_exceeded(mz, mctz, excess,
+						     N_TOPTIER);
+			spin_unlock_irq(&mctz_sibling->lock);
+		}
+
 		css_put(&mz->memcg->css);
 		loop++;
 		/*
@@ -5312,6 +5384,8 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 	lruvec_init(&pn->lruvec);
 	pn->usage_in_excess = 0;
 	pn->on_tree = false;
+	pn->toptier_usage_in_excess = 0;
+	pn->on_toptier_tree = false;
 	pn->memcg = memcg;
 
 	memcg->nodeinfo[node] = pn;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH v1 07/11] mm: Account the total top tier memory in use
  2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
                   ` (5 preceding siblings ...)
  2021-04-05 17:08 ` [RFC PATCH v1 06/11] mm: Handle top tier memory in cgroup soft limit memory tree utilities Tim Chen
@ 2021-04-05 17:08 ` Tim Chen
  2021-04-05 17:08 ` [RFC PATCH v1 08/11] mm: Add toptier option for mem_cgroup_soft_limit_reclaim() Tim Chen
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-05 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Shakeel Butt, linux-mm,
	cgroups, linux-kernel

Track the global top tier memory usage stats. They are used as the basis of
deciding when to start demoting pages from memory cgroups that have exceeded
their soft limit.  We start reclaiming top tier memory when the total
top tier memory is low.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/vmstat.h | 18 ++++++++++++++++++
 mm/vmstat.c            | 20 +++++++++++++++++---
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index e1a4fa9abb3a..a3ad5a937fd8 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -139,6 +139,7 @@ static inline void vm_events_fold_cpu(int cpu)
  * Zone and node-based page accounting with per cpu differentials.
  */
 extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
+extern atomic_long_t vm_toptier_zone_stat[NR_VM_ZONE_STAT_ITEMS];
 extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
 extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
 
@@ -175,6 +176,8 @@ static inline void zone_page_state_add(long x, struct zone *zone,
 {
 	atomic_long_add(x, &zone->vm_stat[item]);
 	atomic_long_add(x, &vm_zone_stat[item]);
+	if (node_state(zone->zone_pgdat->node_id, N_TOPTIER))
+		atomic_long_add(x, &vm_toptier_zone_stat[item]);
 }
 
 static inline void node_page_state_add(long x, struct pglist_data *pgdat,
@@ -212,6 +215,17 @@ static inline unsigned long global_node_page_state(enum node_stat_item item)
 	return global_node_page_state_pages(item);
 }
 
+static inline unsigned long global_toptier_zone_page_state(enum zone_stat_item item)
+{
+	long x = atomic_long_read(&vm_toptier_zone_stat[item]);
+
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
 static inline unsigned long zone_page_state(struct zone *zone,
 					enum zone_stat_item item)
 {
@@ -325,6 +339,8 @@ static inline void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	atomic_long_inc(&zone->vm_stat[item]);
 	atomic_long_inc(&vm_zone_stat[item]);
+	if (node_state(zone->zone_pgdat->node_id, N_TOPTIER))
+		atomic_long_inc(&vm_toptier_zone_stat[item]);
 }
 
 static inline void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
@@ -337,6 +353,8 @@ static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	atomic_long_dec(&zone->vm_stat[item]);
 	atomic_long_dec(&vm_zone_stat[item]);
+	if (node_state(zone->zone_pgdat->node_id, N_TOPTIER))
+		atomic_long_dec(&vm_toptier_zone_stat[item]);
 }
 
 static inline void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f299d2e89acb..b59efbcaef4e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -161,9 +161,11 @@ void vm_events_fold_cpu(int cpu)
  * vm_stat contains the global counters
  */
 atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
+atomic_long_t vm_toptier_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
 atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp;
 atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
 EXPORT_SYMBOL(vm_zone_stat);
+EXPORT_SYMBOL(vm_toptier_zone_stat);
 EXPORT_SYMBOL(vm_numa_stat);
 EXPORT_SYMBOL(vm_node_stat);
 
@@ -695,7 +697,7 @@ EXPORT_SYMBOL(dec_node_page_state);
  * Returns the number of counters updated.
  */
 #ifdef CONFIG_NUMA
-static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
+static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff, int *toptier_diff)
 {
 	int i;
 	int changes = 0;
@@ -717,6 +719,11 @@ static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
 			atomic_long_add(node_diff[i], &vm_node_stat[i]);
 			changes++;
 	}
+
+	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
+		if (toptier_diff[i]) {
+			atomic_long_add(toptier_diff[i], &vm_toptier_zone_stat[i]);
+	}
 	return changes;
 }
 #else
@@ -762,6 +769,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 	struct zone *zone;
 	int i;
 	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int global_toptier_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
 #ifdef CONFIG_NUMA
 	int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
 #endif
@@ -779,6 +787,9 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
+				if (node_state(zone->zone_pgdat->node_id, N_TOPTIER)) {
+					global_toptier_diff[i] +=v;
+				}
 #ifdef CONFIG_NUMA
 				/* 3 seconds idle till flush */
 				__this_cpu_write(p->expire, 3);
@@ -846,7 +857,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 
 #ifdef CONFIG_NUMA
 	changes += fold_diff(global_zone_diff, global_numa_diff,
-			     global_node_diff);
+			     global_node_diff, global_toptier_diff);
 #else
 	changes += fold_diff(global_zone_diff, global_node_diff);
 #endif
@@ -868,6 +879,7 @@ void cpu_vm_stats_fold(int cpu)
 	int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
 #endif
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
+	int global_toptier_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset *p;
@@ -910,11 +922,13 @@ void cpu_vm_stats_fold(int cpu)
 				p->vm_node_stat_diff[i] = 0;
 				atomic_long_add(v, &pgdat->vm_stat[i]);
 				global_node_diff[i] += v;
+				if (node_state(pgdat->node_id, N_TOPTIER))
+					global_toptier_diff[i] +=v;
 			}
 	}
 
 #ifdef CONFIG_NUMA
-	fold_diff(global_zone_diff, global_numa_diff, global_node_diff);
+	fold_diff(global_zone_diff, global_numa_diff, global_node_diff, global_toptier_diff);
 #else
 	fold_diff(global_zone_diff, global_node_diff);
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH v1 08/11] mm: Add toptier option for mem_cgroup_soft_limit_reclaim()
  2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
                   ` (6 preceding siblings ...)
  2021-04-05 17:08 ` [RFC PATCH v1 07/11] mm: Account the total top tier memory in use Tim Chen
@ 2021-04-05 17:08 ` Tim Chen
  2021-04-05 17:08 ` [RFC PATCH v1 09/11] mm: Use kswapd to demote pages when toptier memory is tight Tim Chen
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-05 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Shakeel Butt, linux-mm,
	cgroups, linux-kernel

Add toptier relcaim type in mem_cgroup_soft_limit_reclaim().
This option reclaims top tier memory from cgroups in the order of its
excess usage of top tier memory.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/memcontrol.h |  9 ++++---
 mm/memcontrol.c            | 48 ++++++++++++++++++++++++--------------
 mm/vmscan.c                |  4 ++--
 3 files changed, 39 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0ed8ddfd5436..c494c4b11ba2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -21,6 +21,7 @@
 #include <linux/vmstat.h>
 #include <linux/writeback.h>
 #include <linux/page-flags.h>
+#include <linux/nodemask.h>
 
 struct mem_cgroup;
 struct obj_cgroup;
@@ -1003,7 +1004,8 @@ static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
 
 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 						gfp_t gfp_mask,
-						unsigned long *total_scanned);
+						unsigned long *total_scanned,
+						enum node_states type);
 
 void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 			  unsigned long count);
@@ -1421,8 +1423,9 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
 
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
-					    gfp_t gfp_mask,
-					    unsigned long *total_scanned)
+						gfp_t gfp_mask,
+						unsigned long *total_scanned,
+						enum node_states type)
 {
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8a7648b79635..9f75475ae833 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1875,7 +1875,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 				   pg_data_t *pgdat,
 				   gfp_t gfp_mask,
-				   unsigned long *total_scanned)
+				   unsigned long *total_scanned,
+				   enum node_states type)
 {
 	struct mem_cgroup *victim = NULL;
 	int total = 0;
@@ -1886,7 +1887,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 		.pgdat = pgdat,
 	};
 
-	excess = soft_limit_excess(root_memcg, N_MEMORY);
+	excess = soft_limit_excess(root_memcg, type);
 
 	while (1) {
 		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
@@ -1915,7 +1916,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 		total += mem_cgroup_shrink_node(victim, gfp_mask, false,
 					pgdat, &nr_scanned);
 		*total_scanned += nr_scanned;
-		if (!soft_limit_excess(root_memcg, N_MEMORY))
+		if (!soft_limit_excess(root_memcg, type))
 			break;
 	}
 	mem_cgroup_iter_break(root_memcg, victim);
@@ -3524,7 +3525,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 
 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 					    gfp_t gfp_mask,
-					    unsigned long *total_scanned)
+					    unsigned long *total_scanned,
+					    enum node_states type)
 {
 	unsigned long nr_reclaimed = 0;
 	struct mem_cgroup_per_node *mz, *next_mz = NULL;
@@ -3534,12 +3536,24 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 	unsigned long excess;
 	unsigned long nr_scanned;
 	int migration_nid;
+	enum node_states sibling_type;
 
 	if (order > 0)
 		return 0;
 
-	mctz = soft_limit_tree_node(pgdat->node_id, N_MEMORY);
-	mctz_sibling = soft_limit_tree_node(pgdat->node_id, N_TOPTIER);
+	if (type != N_MEMORY && type != N_TOPTIER)
+		return 0;
+
+	if (type == N_TOPTIER && !node_state(pgdat->node_id, N_TOPTIER))
+		return 0;
+
+	if (type == N_TOPTIER)
+		sibling_type = N_MEMORY;
+	else
+		sibling_type = N_TOPTIER;
+
+	mctz = soft_limit_tree_node(pgdat->node_id, type);
+	mctz_sibling = soft_limit_tree_node(pgdat->node_id, sibling_type);
 
 	/*
 	 * Do not even bother to check the largest node if the root
@@ -3558,11 +3572,11 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 	if (migration_nid != -1) {
 		struct mem_cgroup_tree_per_node *mmctz;
 
-		mmctz = soft_limit_tree_node(migration_nid);
+		mmctz = soft_limit_tree_node(migration_nid, type);
 		if (mmctz && !RB_EMPTY_ROOT(&mmctz->rb_root)) {
 			pgdat = NODE_DATA(migration_nid);
 			return mem_cgroup_soft_limit_reclaim(pgdat, order,
-				gfp_mask, total_scanned);
+				gfp_mask, total_scanned, type);
 		}
 	}
 
@@ -3575,17 +3589,17 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 		if (next_mz)
 			mz = next_mz;
 		else
-			mz = mem_cgroup_largest_soft_limit_node(mctz, N_MEMORY);
+			mz = mem_cgroup_largest_soft_limit_node(mctz, type);
 		if (!mz)
 			break;
 
 		nr_scanned = 0;
 		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,
-						    gfp_mask, &nr_scanned);
+						    gfp_mask, &nr_scanned, type);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
 		spin_lock_irq(&mctz->lock);
-		__mem_cgroup_remove_exceeded(mz, mctz, N_MEMORY);
+		__mem_cgroup_remove_exceeded(mz, mctz, type);
 
 		/*
 		 * If we failed to reclaim anything from this memory cgroup
@@ -3594,9 +3608,9 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 		next_mz = NULL;
 		if (!reclaimed)
 			next_mz =
-			   __mem_cgroup_largest_soft_limit_node(mctz, N_MEMORY);
+			   __mem_cgroup_largest_soft_limit_node(mctz, type);
 
-		excess = soft_limit_excess(mz->memcg, N_MEMORY);
+		excess = soft_limit_excess(mz->memcg, type);
 		/*
 		 * One school of thought says that we should not add
 		 * back the node to the tree if reclaim returns 0.
@@ -3606,17 +3620,17 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 		 * term TODO.
 		 */
 		/* If excess == 0, no tree ops */
-		__mem_cgroup_insert_exceeded(mz, mctz, excess, N_MEMORY);
+		__mem_cgroup_insert_exceeded(mz, mctz, excess, type);
 		spin_unlock_irq(&mctz->lock);
 
 		/* update both affected N_MEMORY and N_TOPTIER trees */
 		if (mctz_sibling) {
 			spin_lock_irq(&mctz_sibling->lock);
 			__mem_cgroup_remove_exceeded(mz, mctz_sibling,
-						     N_TOPTIER);
-			excess = soft_limit_excess(mz->memcg, N_TOPTIER);
+						     sibling_type);
+			excess = soft_limit_excess(mz->memcg, sibling_type);
 			__mem_cgroup_insert_exceeded(mz, mctz, excess,
-						     N_TOPTIER);
+						     sibling_type);
 			spin_unlock_irq(&mctz_sibling->lock);
 		}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3b200b7170a9..11bb0c6fa524 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3134,7 +3134,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			nr_soft_scanned = 0;
 			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
 						sc->order, sc->gfp_mask,
-						&nr_soft_scanned);
+						&nr_soft_scanned, N_MEMORY);
 			sc->nr_reclaimed += nr_soft_reclaimed;
 			sc->nr_scanned += nr_soft_scanned;
 			/* need some check for avoid more shrink_zone() */
@@ -3849,7 +3849,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 		sc.nr_scanned = 0;
 		nr_soft_scanned = 0;
 		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(pgdat, sc.order,
-						sc.gfp_mask, &nr_soft_scanned);
+						sc.gfp_mask, &nr_soft_scanned, N_MEMORY);
 		sc.nr_reclaimed += nr_soft_reclaimed;
 
 		/*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH v1 09/11] mm: Use kswapd to demote pages when toptier memory is tight
  2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
                   ` (7 preceding siblings ...)
  2021-04-05 17:08 ` [RFC PATCH v1 08/11] mm: Add toptier option for mem_cgroup_soft_limit_reclaim() Tim Chen
@ 2021-04-05 17:08 ` Tim Chen
  2021-04-05 17:08 ` [RFC PATCH v1 10/11] mm: Set toptier_scale_factor via sysctl Tim Chen
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-05 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Shakeel Butt, linux-mm,
	cgroups, linux-kernel

Demote pages from memory cgroup that has excess
toptier memory usage when top tier memory is tight.

When free top tier memory falls below this fraction
"toptier_scale_factor/10000" of overall toptier memory in a node, kswapd
reclaims top tier memory from those mem cgroups that exceeded their
toptier memory soft limit by deomoting the top tier pages to
lower memory tier.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 12 +++++
 include/linux/mmzone.h                  |  2 +
 mm/page_alloc.c                         | 14 +++++
 mm/vmscan.c                             | 69 ++++++++++++++++++++++++-
 4 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 9de3847c3469..6b49e2e90953 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -74,6 +74,7 @@ Currently, these files are in /proc/sys/vm:
 - vfs_cache_pressure
 - watermark_boost_factor
 - watermark_scale_factor
+- toptier_scale_factor
 - zone_reclaim_mode
 
 
@@ -962,6 +963,17 @@ too small for the allocation bursts occurring in the system. This knob
 can then be used to tune kswapd aggressiveness accordingly.
 
 
+toptier_scale_factor
+====================
+
+This factor controls when kswapd wakes up to demote pages of those
+cgroups that have exceeded their memory soft limit.
+
+The unit is in fractions of 10,000. The default value of 2000 means the
+if there are less than 20% of free top tier memory in the
+node/system, we will start to demote pages of those memory cgroups
+that have exceeded their soft memory limit.
+
 zone_reclaim_mode
 =================
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bbe649c4fdee..4ee0073d255f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -332,12 +332,14 @@ enum zone_watermarks {
 	WMARK_MIN,
 	WMARK_LOW,
 	WMARK_HIGH,
+	WMARK_TOPTIER,
 	NR_WMARK
 };
 
 #define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
 #define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost)
 #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
+#define toptier_wmark_pages(z) (z->_watermark[WMARK_TOPTIER] + z->watermark_boost)
 #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
 
 struct per_cpu_pages {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 471a2c342c4f..20f3caee60f3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7964,6 +7964,20 @@ static void __setup_per_zone_wmarks(void)
 		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
 		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
 
+		tmp = mult_frac(zone_managed_pages(zone),
+				toptier_scale_factor, 10000);
+		/*
+		 * Clamp toptier watermark between twice high watermark
+		 * and max managed pages.
+		 */
+		if (tmp < 2 * zone->_watermark[WMARK_HIGH])
+			tmp = 2 * zone->_watermark[WMARK_HIGH];
+		if (tmp > zone_managed_pages(zone))
+			tmp = zone_managed_pages(zone);
+		zone->_watermark[WMARK_TOPTIER] = tmp;
+
+		zone->watermark_boost = 0;
+
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 11bb0c6fa524..270880c8baef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -185,6 +185,7 @@ static void set_task_reclaim_state(struct task_struct *task,
 
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
+int toptier_scale_factor = 2000;
 
 #ifdef CONFIG_MEMCG
 /*
@@ -3624,6 +3625,34 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
 	return false;
 }
 
+static bool pgdat_toptier_balanced(pg_data_t *pgdat, int order, int classzone_idx)
+{
+	int i;
+	unsigned long mark;
+	struct zone *zone;
+
+	zone = pgdat->node_zones + ZONE_NORMAL;
+
+	if (!node_state(pgdat->node_id, N_TOPTIER) ||
+	    next_demotion_node(pgdat->node_id) == -1 ||
+	    order > 0 || classzone_idx < ZONE_NORMAL) {
+		return true;
+	}
+
+	zone = pgdat->node_zones + ZONE_NORMAL;
+
+	if (!managed_zone(zone))
+		return true;
+
+	mark = min(toptier_wmark_pages(zone),
+		   zone_managed_pages(zone));
+
+	if (zone_page_state(zone, NR_FREE_PAGES) < mark)
+		return false;
+
+	return true;
+}
+
 /* Clear pgdat state for congested, dirty or under writeback. */
 static void clear_pgdat_congested(pg_data_t *pgdat)
 {
@@ -4049,6 +4078,39 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 	finish_wait(&pgdat->kswapd_wait, &wait);
 }
 
+static bool toptier_soft_reclaim(pg_data_t *pgdat,
+			      unsigned int reclaim_order,
+			      unsigned int classzone_idx)
+{
+	unsigned long nr_soft_scanned, nr_soft_reclaimed;
+	int ret;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.order = reclaim_order,
+		.may_unmap = 1,
+	};
+
+	if (!node_state(pgdat->node_id, N_TOPTIER) || kthread_should_stop())
+		return false;
+
+	set_task_reclaim_state(current, &sc.reclaim_state);
+
+	if (!pgdat_toptier_balanced(pgdat, 0, classzone_idx)) {
+		nr_soft_scanned = 0;
+		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(pgdat,
+					0, GFP_KERNEL,
+					&nr_soft_scanned, N_TOPTIER);
+	}
+
+	set_task_reclaim_state(current, NULL);
+
+	if (prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx) &&
+	   !kthread_should_stop())
+		return true;
+	else
+		return false;
+}
+
 /*
  * The background pageout daemon, started as a kernel thread
  * from the init process.
@@ -4108,6 +4170,10 @@ static int kswapd(void *p)
 		WRITE_ONCE(pgdat->kswapd_order, 0);
 		WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
 
+		if (toptier_soft_reclaim(pgdat, 0,
+					highest_zoneidx))
+			goto kswapd_try_sleep;
+
 		ret = try_to_freeze();
 		if (kthread_should_stop())
 			break;
@@ -4173,7 +4239,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
 
 	/* Hopeless node, leave it to direct reclaim if possible */
 	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ||
-	    (pgdat_balanced(pgdat, order, highest_zoneidx) &&
+	    (pgdat_toptier_balanced(pgdat, 0, highest_zoneidx) &&
+	     pgdat_balanced(pgdat, order, highest_zoneidx) &&
 	     !pgdat_watermark_boosted(pgdat, highest_zoneidx))) {
 		/*
 		 * There may be plenty of free memory available, but it's too
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH v1 10/11] mm: Set toptier_scale_factor via sysctl
  2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
                   ` (8 preceding siblings ...)
  2021-04-05 17:08 ` [RFC PATCH v1 09/11] mm: Use kswapd to demote pages when toptier memory is tight Tim Chen
@ 2021-04-05 17:08 ` Tim Chen
  2021-04-05 17:08 ` [RFC PATCH v1 11/11] mm: Wakeup kswapd if toptier memory need soft reclaim Tim Chen
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-05 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Shakeel Butt, linux-mm,
	cgroups, linux-kernel

Update the toptier_scale_factor via sysctl. This variable determines
when kswapd wakes up to recalaim toptier memory from those mem cgroups
exceeding their toptier memory limit.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/mm.h     |  4 ++++
 include/linux/mmzone.h |  2 ++
 kernel/sysctl.c        | 10 ++++++++++
 mm/page_alloc.c        | 15 +++++++++++++++
 mm/vmstat.c            |  2 ++
 5 files changed, 33 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a43429d51fc0..af39e221d0f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3179,6 +3179,10 @@ static inline bool debug_guardpage_enabled(void) { return false; }
 static inline bool page_is_guard(struct page *page) { return false; }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
+#ifdef CONFIG_MIGRATION
+extern int toptier_scale_factor;
+#endif
+
 #if MAX_NUMNODES > 1
 void __init setup_nr_node_ids(void);
 #else
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4ee0073d255f..789319dffe1c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1003,6 +1003,8 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int, void *, size_t *,
 		loff_t *);
 int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void *,
 		size_t *, loff_t *);
+int toptier_scale_factor_sysctl_handler(struct ctl_table *, int,
+					void __user *, size_t *, loff_t *);
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void *,
 		size_t *, loff_t *);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 57f89fe1b0f2..e97c974f37b7 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -112,6 +112,7 @@ static int sixty = 60;
 #endif
 
 static int __maybe_unused neg_one = -1;
+static int __maybe_unused one = 1;
 static int __maybe_unused two = 2;
 static int __maybe_unused three = 3;
 static int __maybe_unused four = 4;
@@ -2956,6 +2957,15 @@ static struct ctl_table vm_table[] = {
 		.extra1		= SYSCTL_ONE,
 		.extra2		= &one_thousand,
 	},
+	{
+		.procname       = "toptier_scale_factor",
+		.data           = &toptier_scale_factor,
+		.maxlen         = sizeof(toptier_scale_factor),
+		.mode           = 0644,
+		.proc_handler   = toptier_scale_factor_sysctl_handler,
+		.extra1         = &one,
+		.extra2         = &ten_thousand,
+	},
 	{
 		.procname	= "percpu_pagelist_fraction",
 		.data		= &percpu_pagelist_fraction,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20f3caee60f3..91212a837d8e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8094,6 +8094,21 @@ int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+int toptier_scale_factor_sysctl_handler(struct ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int rc;
+
+	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (rc)
+	       return rc;
+
+	if (write)
+		setup_per_zone_wmarks();
+
+	return 0;
+}
+
 #ifdef CONFIG_NUMA
 static void setup_min_unmapped_ratio(void)
 {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b59efbcaef4e..c581753cf076 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1658,6 +1658,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
+		   "\n        toptier  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu"
 		   "\n        managed  %lu",
@@ -1665,6 +1666,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
+		   toptier_wmark_pages(zone),
 		   zone->spanned_pages,
 		   zone->present_pages,
 		   zone_managed_pages(zone));
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH v1 11/11] mm: Wakeup kswapd if toptier memory need soft reclaim
  2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
                   ` (9 preceding siblings ...)
  2021-04-05 17:08 ` [RFC PATCH v1 10/11] mm: Set toptier_scale_factor via sysctl Tim Chen
@ 2021-04-05 17:08 ` Tim Chen
  2021-04-06  9:08 ` [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Michal Hocko
  2021-04-08 17:18 ` Shakeel Butt
  12 siblings, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-05 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Shakeel Butt, linux-mm,
	cgroups, linux-kernel

Detect during page allocation whether free toptier memory is low.
If so, wake up kswapd to reclaim memory from those mem cgroups
that have exceeded their limit.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/mmzone.h | 3 +++
 mm/page_alloc.c        | 2 ++
 mm/vmscan.c            | 2 +-
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 789319dffe1c..3603948e95cc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -886,6 +886,8 @@ bool zone_watermark_ok(struct zone *z, unsigned int order,
 		unsigned int alloc_flags);
 bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
 		unsigned long mark, int highest_zoneidx);
+bool pgdat_toptier_balanced(pg_data_t *pgdat, int order, int classzone_idx);
+
 /*
  * Memory initialization context, use to differentiate memory added by
  * the platform statically or via memory hotplug interface.
@@ -1466,5 +1468,6 @@ void sparse_init(void);
 #endif
 
 #endif /* !__GENERATING_BOUNDS.H */
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _LINUX_MMZONE_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91212a837d8e..ca8aa789a967 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3519,6 +3519,8 @@ struct page *rmqueue(struct zone *preferred_zone,
 	if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {
 		clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
 		wakeup_kswapd(zone, 0, 0, zone_idx(zone));
+	} else if (!pgdat_toptier_balanced(zone->zone_pgdat, order, zone_idx(zone))) {
+		wakeup_kswapd(zone, 0, 0, zone_idx(zone));
 	}
 
 	VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 270880c8baef..8fe709e3f5e4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3625,7 +3625,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
 	return false;
 }
 
-static bool pgdat_toptier_balanced(pg_data_t *pgdat, int order, int classzone_idx)
+bool pgdat_toptier_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	int i;
 	unsigned long mark;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
                   ` (10 preceding siblings ...)
  2021-04-05 17:08 ` [RFC PATCH v1 11/11] mm: Wakeup kswapd if toptier memory need soft reclaim Tim Chen
@ 2021-04-06  9:08 ` Michal Hocko
  2021-04-07 22:33   ` Tim Chen
  2021-04-08 17:18 ` Shakeel Butt
  12 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2021-04-06  9:08 UTC (permalink / raw)
  To: Tim Chen
  Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Ying Huang,
	Dan Williams, David Rientjes, Shakeel Butt, linux-mm, cgroups,
	linux-kernel

On Mon 05-04-21 10:08:24, Tim Chen wrote:
[...]
> To make fine grain cgroup based management of the precious top tier
> DRAM memory possible, this patchset adds a few new features:
> 1. Provides memory monitors on the amount of top tier memory used per cgroup 
>    and by the system as a whole.
> 2. Applies soft limits on the top tier memory each cgroup uses 
> 3. Enables kswapd to demote top tier pages from cgroup with excess top
>    tier memory usages.

Could you be more specific on how this interface is supposed to be used?

> This allows us to provision different amount of top tier memory to each
> cgroup according to the cgroup's latency need.
> 
> The patchset is based on cgroup v1 interface. One shortcoming of the v1
> interface is the limit on the cgroup is a soft limit, so a cgroup can
> exceed the limit quite a bit before reclaim before page demotion reins
> it in. 

I have to say that I dislike abusing soft limit reclaim for this. In the
past we have learned that the existing implementation is unfixable and
changing the existing semantic impossible due to backward compatibility.
So I would really prefer the soft limit just find its rest rather than
see new potential usecases.

I haven't really looked into details of this patchset but from a cursory
look it seems like you are actually introducing a NUMA aware limits into
memcg that would control consumption from some nodes differently than
other nodes. This would be rather alien concept to the existing memcg
infrastructure IMO. It looks like it is fusing borders between memcg and
cputset controllers.

You also seem to be basing the interface on the very specific usecase.
Can we expect that there will be many different tiers requiring their
own balancing?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-06  9:08 ` [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Michal Hocko
@ 2021-04-07 22:33   ` Tim Chen
  2021-04-08 11:52     ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Tim Chen @ 2021-04-07 22:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Ying Huang,
	Dan Williams, David Rientjes, Shakeel Butt, linux-mm, cgroups,
	linux-kernel



On 4/6/21 2:08 AM, Michal Hocko wrote:
> On Mon 05-04-21 10:08:24, Tim Chen wrote:
> [...]
>> To make fine grain cgroup based management of the precious top tier
>> DRAM memory possible, this patchset adds a few new features:
>> 1. Provides memory monitors on the amount of top tier memory used per cgroup 
>>    and by the system as a whole.
>> 2. Applies soft limits on the top tier memory each cgroup uses 
>> 3. Enables kswapd to demote top tier pages from cgroup with excess top
>>    tier memory usages.
> 

Michal,

Thanks for giving your feedback.  Much appreciated.

> Could you be more specific on how this interface is supposed to be used?

We created a README section on the cgroup control part of this patchset:
https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.71&id=20f20be02671384470c7cd8f66b56a9061a4071f
to illustrate how this interface should be used.

The top tier memory used is reported in

memory.toptier_usage_in_bytes

The amount of top tier memory usable by each cgroup without
triggering page reclaim is controlled by the

memory.toptier_soft_limit_in_bytes 

knob for each cgroup.  

We anticipate that for cgroup v2, we will have

memory_toptier.max  (max allowed top tier memory)
memory_toptier.high (aggressive page demotion from top tier memory)
memory_toptier.min  (not to page demote from top tier memory at this threshold) 

this is analogous to existing controllers
memory.max, memory.high and memory.min

> 
>> This allows us to provision different amount of top tier memory to each
>> cgroup according to the cgroup's latency need.
>>
>> The patchset is based on cgroup v1 interface. One shortcoming of the v1
>> interface is the limit on the cgroup is a soft limit, so a cgroup can
>> exceed the limit quite a bit before reclaim before page demotion reins
>> it in. 
> 
> I have to say that I dislike abusing soft limit reclaim for this. In the
> past we have learned that the existing implementation is unfixable and
> changing the existing semantic impossible due to backward compatibility.
> So I would really prefer the soft limit just find its rest rather than
> see new potential usecases.

Do you think we can reuse some of the existing soft reclaim machinery
for the v2 interface?

More particularly, can we treat memory_toptier.high in cgroup v2 as a soft limit?
We sort how much each mem cgroup exceeds memory_toptier.high and
go after the cgroup that have largest excess first for page demotion.
Will appreciate if you can shed some insights on what could go wrong
with such an approach. 

> 
> I haven't really looked into details of this patchset but from a cursory
> look it seems like you are actually introducing a NUMA aware limits into
> memcg that would control consumption from some nodes differently than
> other nodes. This would be rather alien concept to the existing memcg
> infrastructure IMO. It looks like it is fusing borders between memcg and
> cputset controllers.

Want to make sure I understand what you mean by NUMA aware limits.
Yes, in the patch set, it does treat the NUMA nodes differently.
We are putting constraint on the "top tier" RAM nodes vs the lower
tier PMEM nodes.  Is this what you mean?  I can see it does has
some flavor of cpuset controller.  In this case, it doesn't explicitly
set a node as allowed or forbidden as in cpuset, but put some constraints
on the usage of a group of nodes.  

Do you have suggestions on alternative controller for allocating tiered memory resource?


> 
> You also seem to be basing the interface on the very specific usecase.
> Can we expect that there will be many different tiers requiring their
> own balancing?
> 

You mean more than two tiers of memory? We did think a bit about system
that has stuff like high bandwidth memory that's faster than DRAM.
Our thought is usage and freeing of those memory will require 
explicit assignment (not used by default), so will be outside the
realm of auto balancing.  So at this point, we think two tiers will be good.

Tim

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-07 22:33   ` Tim Chen
@ 2021-04-08 11:52     ` Michal Hocko
  2021-04-09 23:26       ` Tim Chen
  2021-04-12 14:03       ` Shakeel Butt
  0 siblings, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2021-04-08 11:52 UTC (permalink / raw)
  To: Tim Chen
  Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Ying Huang,
	Dan Williams, David Rientjes, Shakeel Butt, linux-mm, cgroups,
	linux-kernel

On Wed 07-04-21 15:33:26, Tim Chen wrote:
> 
> 
> On 4/6/21 2:08 AM, Michal Hocko wrote:
> > On Mon 05-04-21 10:08:24, Tim Chen wrote:
> > [...]
> >> To make fine grain cgroup based management of the precious top tier
> >> DRAM memory possible, this patchset adds a few new features:
> >> 1. Provides memory monitors on the amount of top tier memory used per cgroup 
> >>    and by the system as a whole.
> >> 2. Applies soft limits on the top tier memory each cgroup uses 
> >> 3. Enables kswapd to demote top tier pages from cgroup with excess top
> >>    tier memory usages.
> > 
> 
> Michal,
> 
> Thanks for giving your feedback.  Much appreciated.
> 
> > Could you be more specific on how this interface is supposed to be used?
> 
> We created a README section on the cgroup control part of this patchset:
> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.71&id=20f20be02671384470c7cd8f66b56a9061a4071f
> to illustrate how this interface should be used.

I have to confess I didn't get to look at demotion patches yet.

> The top tier memory used is reported in
> 
> memory.toptier_usage_in_bytes
> 
> The amount of top tier memory usable by each cgroup without
> triggering page reclaim is controlled by the
> 
> memory.toptier_soft_limit_in_bytes 

Are you trying to say that soft limit acts as some sort of guarantee?
Does that mean that if the memcg is under memory pressure top tiear
memory is opted out from any reclaim if the usage is not in excess?

From you previous email it sounds more like the limit is evaluated on
the global memory pressure to balance specific memcgs which are in
excess when trying to reclaim/demote a toptier numa node.

Soft limit reclaim has several problems. Those are historical and
therefore the behavior cannot be changed. E.g. go after the biggest
excessed memcg (with priority 0 - aka potential full LRU scan) and then
continue with a normal reclaim. This can be really disruptive to the top
user.

So you can likely define a more sane semantic. E.g. push back memcgs
proporitional to their excess but then we have two different soft limits
behavior which is bad as well. I am not really sure there is a sensible
way out by (ab)using soft limit here.

Also I am not really sure how this is going to be used in practice.
There is no soft limit by default. So opting in would effectivelly
discriminate those memcgs. There has been a similar problem with the
soft limit we have in general. Is this really what you are looing for?
What would be a typical usecase?

[...]
> >> The patchset is based on cgroup v1 interface. One shortcoming of the v1
> >> interface is the limit on the cgroup is a soft limit, so a cgroup can
> >> exceed the limit quite a bit before reclaim before page demotion reins
> >> it in. 
> > 
> > I have to say that I dislike abusing soft limit reclaim for this. In the
> > past we have learned that the existing implementation is unfixable and
> > changing the existing semantic impossible due to backward compatibility.
> > So I would really prefer the soft limit just find its rest rather than
> > see new potential usecases.
> 
> Do you think we can reuse some of the existing soft reclaim machinery
> for the v2 interface?
> 
> More particularly, can we treat memory_toptier.high in cgroup v2 as a soft limit?

No, you should follow existing limits semantics. High limit acts as a
allocation throttling interface.

> We sort how much each mem cgroup exceeds memory_toptier.high and
> go after the cgroup that have largest excess first for page demotion.
> Will appreciate if you can shed some insights on what could go wrong
> with such an approach. 

This cannot work as a thorttling interface.
 
> > I haven't really looked into details of this patchset but from a cursory
> > look it seems like you are actually introducing a NUMA aware limits into
> > memcg that would control consumption from some nodes differently than
> > other nodes. This would be rather alien concept to the existing memcg
> > infrastructure IMO. It looks like it is fusing borders between memcg and
> > cputset controllers.
> 
> Want to make sure I understand what you mean by NUMA aware limits.
> Yes, in the patch set, it does treat the NUMA nodes differently.
> We are putting constraint on the "top tier" RAM nodes vs the lower
> tier PMEM nodes.  Is this what you mean?

What I am trying to say (and I have brought that up when demotion has been
discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
The specific technology shouldn't be imprinted into the interface.
Fundamentally you are trying to balance memory among NUMA nodes as we do
not have other abstraction to use. So rather than talking about top,
secondary, nth tier we have different NUMA nodes with different
characteristics and you want to express your "priorities" for them.

> I can see it does has
> some flavor of cpuset controller.  In this case, it doesn't explicitly
> set a node as allowed or forbidden as in cpuset, but put some constraints
> on the usage of a group of nodes.  
> 
> Do you have suggestions on alternative controller for allocating tiered memory resource?
 
I am not really sure what would be the best interface to be honest.
Maybe we want to carve this into memcg in some form of node priorities
for the reclaim. Any of the existing limits is numa aware so far. Maybe
we want to say hammer this node more than others if there is a memory
pressure. Not sure that would help you particular usecase though.

> > You also seem to be basing the interface on the very specific usecase.
> > Can we expect that there will be many different tiers requiring their
> > own balancing?
> > 
> 
> You mean more than two tiers of memory? We did think a bit about system
> that has stuff like high bandwidth memory that's faster than DRAM.
> Our thought is usage and freeing of those memory will require 
> explicit assignment (not used by default), so will be outside the
> realm of auto balancing.  So at this point, we think two tiers will be good.

Please keep in mind that once there is an interface it will be
impossible to change in the future. So do not bind yourself to the 2
tier setups that you have in hands right now.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
                   ` (11 preceding siblings ...)
  2021-04-06  9:08 ` [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Michal Hocko
@ 2021-04-08 17:18 ` Shakeel Butt
  2021-04-08 18:00   ` Yang Shi
  2021-04-15 22:25   ` Tim Chen
  12 siblings, 2 replies; 34+ messages in thread
From: Shakeel Butt @ 2021-04-08 17:18 UTC (permalink / raw)
  To: Tim Chen
  Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Linux MM, Cgroups,
	LKML

Hi Tim,

On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> others NUMA wise, but a byte of media has about the same cost whether it
> is close or far.  But, with new memory tiers such as Persistent Memory
> (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> PMEM.
>
> The fast/expensive memory lives in the top tier of the memory hierachy.
>
> Previously, the patchset
> [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
> provides a mechanism to demote cold pages from DRAM node into PMEM.
>
> And the patchset
> [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
> provides a mechanism to promote hot pages in PMEM to the DRAM node
> leveraging autonuma.
>
> The two patchsets together keep the hot pages in DRAM and colder pages
> in PMEM.

Thanks for working on this as this is becoming more and more important
particularly in the data centers where memory is a big portion of the
cost.

I see you have responded to Michal and I will add my more specific
response there. Here I wanted to give my high level concern regarding
using v1's soft limit like semantics for top tier memory.

This patch series aims to distribute/partition top tier memory between
jobs of different priorities. We want high priority jobs to have
preferential access to the top tier memory and we don't want low
priority jobs to hog the top tier memory.

Using v1's soft limit like behavior can potentially cause high
priority jobs to stall to make enough space on top tier memory on
their allocation path and I think this patchset is aiming to reduce
that impact by making kswapd do that work. However I think the more
concerning issue is the low priority job hogging the top tier memory.

The possible ways the low priority job can hog the top tier memory are
by allocating non-movable memory or by mlocking the memory. (Oh there
is also pinning the memory but I don't know if there is a user api to
pin memory?) For the mlocked memory, you need to either modify the
reclaim code or use a different mechanism for demoting cold memory.

Basically I am saying we should put the upfront control (limit) on the
usage of top tier memory by the jobs.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-08 17:18 ` Shakeel Butt
@ 2021-04-08 18:00   ` Yang Shi
  2021-04-08 20:29     ` Shakeel Butt
  2021-04-09  2:58     ` Huang, Ying
  2021-04-15 22:25   ` Tim Chen
  1 sibling, 2 replies; 34+ messages in thread
From: Yang Shi @ 2021-04-08 18:00 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tim Chen, Michal Hocko, Johannes Weiner, Andrew Morton,
	Dave Hansen, Ying Huang, Dan Williams, David Rientjes, Linux MM,
	Cgroups, LKML

On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Hi Tim,
>
> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >
> > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> > others NUMA wise, but a byte of media has about the same cost whether it
> > is close or far.  But, with new memory tiers such as Persistent Memory
> > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> > PMEM.
> >
> > The fast/expensive memory lives in the top tier of the memory hierachy.
> >
> > Previously, the patchset
> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
> > provides a mechanism to demote cold pages from DRAM node into PMEM.
> >
> > And the patchset
> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
> > provides a mechanism to promote hot pages in PMEM to the DRAM node
> > leveraging autonuma.
> >
> > The two patchsets together keep the hot pages in DRAM and colder pages
> > in PMEM.
>
> Thanks for working on this as this is becoming more and more important
> particularly in the data centers where memory is a big portion of the
> cost.
>
> I see you have responded to Michal and I will add my more specific
> response there. Here I wanted to give my high level concern regarding
> using v1's soft limit like semantics for top tier memory.
>
> This patch series aims to distribute/partition top tier memory between
> jobs of different priorities. We want high priority jobs to have
> preferential access to the top tier memory and we don't want low
> priority jobs to hog the top tier memory.
>
> Using v1's soft limit like behavior can potentially cause high
> priority jobs to stall to make enough space on top tier memory on
> their allocation path and I think this patchset is aiming to reduce
> that impact by making kswapd do that work. However I think the more
> concerning issue is the low priority job hogging the top tier memory.
>
> The possible ways the low priority job can hog the top tier memory are
> by allocating non-movable memory or by mlocking the memory. (Oh there
> is also pinning the memory but I don't know if there is a user api to
> pin memory?) For the mlocked memory, you need to either modify the
> reclaim code or use a different mechanism for demoting cold memory.

Do you mean long term pin? RDMA should be able to simply pin the
memory for weeks. A lot of transient pins come from Direct I/O. They
should be less concerned.

The low priority jobs should be able to be restricted by cpuset, for
example, just keep them on second tier memory nodes. Then all the
above problems are gone.

>
> Basically I am saying we should put the upfront control (limit) on the
> usage of top tier memory by the jobs.

This sounds similar to what I talked about in LSFMM 2019
(https://lwn.net/Articles/787418/). We used to have some potential
usecase which divides DRAM:PMEM ratio for different jobs or memcgs
when I was with Alibaba.

In the first place I thought about per NUMA node limit, but it was
very hard to configure it correctly for users unless you know exactly
about your memory usage and hot/cold memory distribution.

I'm wondering, just off the top of my head, if we could extend the
semantic of low and min limit. For example, just redefine low and min
to "the limit on top tier memory". Then we could have low priority
jobs have 0 low/min limit.

>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-08 18:00   ` Yang Shi
@ 2021-04-08 20:29     ` Shakeel Butt
  2021-04-08 20:50       ` Yang Shi
                         ` (2 more replies)
  2021-04-09  2:58     ` Huang, Ying
  1 sibling, 3 replies; 34+ messages in thread
From: Shakeel Butt @ 2021-04-08 20:29 UTC (permalink / raw)
  To: Yang Shi
  Cc: Tim Chen, Michal Hocko, Johannes Weiner, Andrew Morton,
	Dave Hansen, Ying Huang, Dan Williams, David Rientjes, Linux MM,
	Cgroups, LKML

On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > Hi Tim,
> >
> > On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > >
> > > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> > > others NUMA wise, but a byte of media has about the same cost whether it
> > > is close or far.  But, with new memory tiers such as Persistent Memory
> > > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> > > PMEM.
> > >
> > > The fast/expensive memory lives in the top tier of the memory hierachy.
> > >
> > > Previously, the patchset
> > > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> > > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
> > > provides a mechanism to demote cold pages from DRAM node into PMEM.
> > >
> > > And the patchset
> > > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> > > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
> > > provides a mechanism to promote hot pages in PMEM to the DRAM node
> > > leveraging autonuma.
> > >
> > > The two patchsets together keep the hot pages in DRAM and colder pages
> > > in PMEM.
> >
> > Thanks for working on this as this is becoming more and more important
> > particularly in the data centers where memory is a big portion of the
> > cost.
> >
> > I see you have responded to Michal and I will add my more specific
> > response there. Here I wanted to give my high level concern regarding
> > using v1's soft limit like semantics for top tier memory.
> >
> > This patch series aims to distribute/partition top tier memory between
> > jobs of different priorities. We want high priority jobs to have
> > preferential access to the top tier memory and we don't want low
> > priority jobs to hog the top tier memory.
> >
> > Using v1's soft limit like behavior can potentially cause high
> > priority jobs to stall to make enough space on top tier memory on
> > their allocation path and I think this patchset is aiming to reduce
> > that impact by making kswapd do that work. However I think the more
> > concerning issue is the low priority job hogging the top tier memory.
> >
> > The possible ways the low priority job can hog the top tier memory are
> > by allocating non-movable memory or by mlocking the memory. (Oh there
> > is also pinning the memory but I don't know if there is a user api to
> > pin memory?) For the mlocked memory, you need to either modify the
> > reclaim code or use a different mechanism for demoting cold memory.
>
> Do you mean long term pin? RDMA should be able to simply pin the
> memory for weeks. A lot of transient pins come from Direct I/O. They
> should be less concerned.
>
> The low priority jobs should be able to be restricted by cpuset, for
> example, just keep them on second tier memory nodes. Then all the
> above problems are gone.
>

Yes that's an extreme way to overcome the issue but we can do less
extreme by just (hard) limiting the top tier usage of low priority
jobs.

> >
> > Basically I am saying we should put the upfront control (limit) on the
> > usage of top tier memory by the jobs.
>
> This sounds similar to what I talked about in LSFMM 2019
> (https://lwn.net/Articles/787418/). We used to have some potential
> usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> when I was with Alibaba.
>
> In the first place I thought about per NUMA node limit, but it was
> very hard to configure it correctly for users unless you know exactly
> about your memory usage and hot/cold memory distribution.
>
> I'm wondering, just off the top of my head, if we could extend the
> semantic of low and min limit. For example, just redefine low and min
> to "the limit on top tier memory". Then we could have low priority
> jobs have 0 low/min limit.
>

The low and min limits have semantics similar to the v1's soft limit
for this situation i.e. letting the low priority job occupy top tier
memory and depending on reclaim to take back the excess top tier
memory use of such jobs.

I have some thoughts on NUMA node limits which I will share in the other thread.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-08 20:29     ` Shakeel Butt
@ 2021-04-08 20:50       ` Yang Shi
  2021-04-12 14:03         ` Shakeel Butt
  2021-04-09  7:24       ` Michal Hocko
  2021-04-14 23:22       ` Tim Chen
  2 siblings, 1 reply; 34+ messages in thread
From: Yang Shi @ 2021-04-08 20:50 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tim Chen, Michal Hocko, Johannes Weiner, Andrew Morton,
	Dave Hansen, Ying Huang, Dan Williams, David Rientjes, Linux MM,
	Cgroups, LKML

On Thu, Apr 8, 2021 at 1:29 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > Hi Tim,
> > >
> > > On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > > >
> > > > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> > > > others NUMA wise, but a byte of media has about the same cost whether it
> > > > is close or far.  But, with new memory tiers such as Persistent Memory
> > > > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> > > > PMEM.
> > > >
> > > > The fast/expensive memory lives in the top tier of the memory hierachy.
> > > >
> > > > Previously, the patchset
> > > > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> > > > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
> > > > provides a mechanism to demote cold pages from DRAM node into PMEM.
> > > >
> > > > And the patchset
> > > > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> > > > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
> > > > provides a mechanism to promote hot pages in PMEM to the DRAM node
> > > > leveraging autonuma.
> > > >
> > > > The two patchsets together keep the hot pages in DRAM and colder pages
> > > > in PMEM.
> > >
> > > Thanks for working on this as this is becoming more and more important
> > > particularly in the data centers where memory is a big portion of the
> > > cost.
> > >
> > > I see you have responded to Michal and I will add my more specific
> > > response there. Here I wanted to give my high level concern regarding
> > > using v1's soft limit like semantics for top tier memory.
> > >
> > > This patch series aims to distribute/partition top tier memory between
> > > jobs of different priorities. We want high priority jobs to have
> > > preferential access to the top tier memory and we don't want low
> > > priority jobs to hog the top tier memory.
> > >
> > > Using v1's soft limit like behavior can potentially cause high
> > > priority jobs to stall to make enough space on top tier memory on
> > > their allocation path and I think this patchset is aiming to reduce
> > > that impact by making kswapd do that work. However I think the more
> > > concerning issue is the low priority job hogging the top tier memory.
> > >
> > > The possible ways the low priority job can hog the top tier memory are
> > > by allocating non-movable memory or by mlocking the memory. (Oh there
> > > is also pinning the memory but I don't know if there is a user api to
> > > pin memory?) For the mlocked memory, you need to either modify the
> > > reclaim code or use a different mechanism for demoting cold memory.
> >
> > Do you mean long term pin? RDMA should be able to simply pin the
> > memory for weeks. A lot of transient pins come from Direct I/O. They
> > should be less concerned.
> >
> > The low priority jobs should be able to be restricted by cpuset, for
> > example, just keep them on second tier memory nodes. Then all the
> > above problems are gone.
> >
>
> Yes that's an extreme way to overcome the issue but we can do less
> extreme by just (hard) limiting the top tier usage of low priority
> jobs.
>
> > >
> > > Basically I am saying we should put the upfront control (limit) on the
> > > usage of top tier memory by the jobs.
> >
> > This sounds similar to what I talked about in LSFMM 2019
> > (https://lwn.net/Articles/787418/). We used to have some potential
> > usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> > when I was with Alibaba.
> >
> > In the first place I thought about per NUMA node limit, but it was
> > very hard to configure it correctly for users unless you know exactly
> > about your memory usage and hot/cold memory distribution.
> >
> > I'm wondering, just off the top of my head, if we could extend the
> > semantic of low and min limit. For example, just redefine low and min
> > to "the limit on top tier memory". Then we could have low priority
> > jobs have 0 low/min limit.
> >
>
> The low and min limits have semantics similar to the v1's soft limit
> for this situation i.e. letting the low priority job occupy top tier
> memory and depending on reclaim to take back the excess top tier
> memory use of such jobs.

I don't get why low priority jobs can *not* use top tier memory? I can
think of it may incur latency overhead for high priority jobs. If it
is not allowed, it could be restricted by cpuset without introducing
in any new interfaces.

I'm supposed the memory utilization could be maximized by allowing all
jobs allocate memory from all applicable nodes, then let reclaimer (or
something new if needed) do the job to migrate the memory to proper
nodes by time. We could achieve some kind of balance between memory
utilization and resource isolation.

>
> I have some thoughts on NUMA node limits which I will share in the other thread.

Look forward to reading it.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-08 18:00   ` Yang Shi
  2021-04-08 20:29     ` Shakeel Butt
@ 2021-04-09  2:58     ` Huang, Ying
  2021-04-09 20:50       ` Yang Shi
  1 sibling, 1 reply; 34+ messages in thread
From: Huang, Ying @ 2021-04-09  2:58 UTC (permalink / raw)
  To: Yang Shi
  Cc: Shakeel Butt, Tim Chen, Michal Hocko, Johannes Weiner,
	Andrew Morton, Dave Hansen, Dan Williams, David Rientjes,
	Linux MM, Cgroups, LKML, Feng Tang

Yang Shi <shy828301@gmail.com> writes:

> On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote:
>>
>> Hi Tim,
>>
>> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>> >
>> > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
>> > others NUMA wise, but a byte of media has about the same cost whether it
>> > is close or far.  But, with new memory tiers such as Persistent Memory
>> > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
>> > PMEM.
>> >
>> > The fast/expensive memory lives in the top tier of the memory hierachy.
>> >
>> > Previously, the patchset
>> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
>> > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
>> > provides a mechanism to demote cold pages from DRAM node into PMEM.
>> >
>> > And the patchset
>> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
>> > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
>> > provides a mechanism to promote hot pages in PMEM to the DRAM node
>> > leveraging autonuma.
>> >
>> > The two patchsets together keep the hot pages in DRAM and colder pages
>> > in PMEM.
>>
>> Thanks for working on this as this is becoming more and more important
>> particularly in the data centers where memory is a big portion of the
>> cost.
>>
>> I see you have responded to Michal and I will add my more specific
>> response there. Here I wanted to give my high level concern regarding
>> using v1's soft limit like semantics for top tier memory.
>>
>> This patch series aims to distribute/partition top tier memory between
>> jobs of different priorities. We want high priority jobs to have
>> preferential access to the top tier memory and we don't want low
>> priority jobs to hog the top tier memory.
>>
>> Using v1's soft limit like behavior can potentially cause high
>> priority jobs to stall to make enough space on top tier memory on
>> their allocation path and I think this patchset is aiming to reduce
>> that impact by making kswapd do that work. However I think the more
>> concerning issue is the low priority job hogging the top tier memory.
>>
>> The possible ways the low priority job can hog the top tier memory are
>> by allocating non-movable memory or by mlocking the memory. (Oh there
>> is also pinning the memory but I don't know if there is a user api to
>> pin memory?) For the mlocked memory, you need to either modify the
>> reclaim code or use a different mechanism for demoting cold memory.
>
> Do you mean long term pin? RDMA should be able to simply pin the
> memory for weeks. A lot of transient pins come from Direct I/O. They
> should be less concerned.
>
> The low priority jobs should be able to be restricted by cpuset, for
> example, just keep them on second tier memory nodes. Then all the
> above problems are gone.

To optimize the page placement of a process between DRAM and PMEM, we
want to place the hot pages in DRAM and the cold pages in PMEM.  But the
memory accessing pattern changes overtime, so we need to migrate pages
between DRAM and PMEM to adapt to the changing.

To avoid the hot pages be pinned in PMEM always, one way is to online
the PMEM as movable zones.  If so, and if the low priority jobs are
restricted by cpuset to allocate from PMEM only, we may fail to run
quite some workloads as being discussed in the following threads,

https://lore.kernel.org/linux-mm/1604470210-124827-1-git-send-email-feng.tang@intel.com/

>>
>> Basically I am saying we should put the upfront control (limit) on the
>> usage of top tier memory by the jobs.
>
> This sounds similar to what I talked about in LSFMM 2019
> (https://lwn.net/Articles/787418/). We used to have some potential
> usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> when I was with Alibaba.
>
> In the first place I thought about per NUMA node limit, but it was
> very hard to configure it correctly for users unless you know exactly
> about your memory usage and hot/cold memory distribution.
>
> I'm wondering, just off the top of my head, if we could extend the
> semantic of low and min limit. For example, just redefine low and min
> to "the limit on top tier memory". Then we could have low priority
> jobs have 0 low/min limit.

Per my understanding, memory.low/min are for the memory protection
instead of the memory limiting.  memory.high is for the memory limiting.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-08 20:29     ` Shakeel Butt
  2021-04-08 20:50       ` Yang Shi
@ 2021-04-09  7:24       ` Michal Hocko
  2021-04-15 22:31         ` Tim Chen
  2021-04-14 23:22       ` Tim Chen
  2 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2021-04-09  7:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Yang Shi, Tim Chen, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Linux MM, Cgroups,
	LKML

On Thu 08-04-21 13:29:08, Shakeel Butt wrote:
> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
[...]
> > The low priority jobs should be able to be restricted by cpuset, for
> > example, just keep them on second tier memory nodes. Then all the
> > above problems are gone.

Yes, if the aim is to isolate some users from certain numa node then
cpuset is a good fit but as Shakeel says this is very likely not what
this work is aiming for.

> Yes that's an extreme way to overcome the issue but we can do less
> extreme by just (hard) limiting the top tier usage of low priority
> jobs.

Per numa node high/hard limit would help with a more fine grained control.
The configuration would be tricky though. All low priority memcgs would
have to be carefully configured to leave enough for your important
processes. That includes also memory which is not accounted to any
memcg. 
The behavior of those limits would be quite tricky for OOM situations
as well due to a lack of NUMA aware oom killer.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-09  2:58     ` Huang, Ying
@ 2021-04-09 20:50       ` Yang Shi
  0 siblings, 0 replies; 34+ messages in thread
From: Yang Shi @ 2021-04-09 20:50 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Shakeel Butt, Tim Chen, Michal Hocko, Johannes Weiner,
	Andrew Morton, Dave Hansen, Dan Williams, David Rientjes,
	Linux MM, Cgroups, LKML, Feng Tang

On Thu, Apr 8, 2021 at 7:58 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote:
> >>
> >> Hi Tim,
> >>
> >> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >> >
> >> > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> >> > others NUMA wise, but a byte of media has about the same cost whether it
> >> > is close or far.  But, with new memory tiers such as Persistent Memory
> >> > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> >> > PMEM.
> >> >
> >> > The fast/expensive memory lives in the top tier of the memory hierachy.
> >> >
> >> > Previously, the patchset
> >> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> >> > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
> >> > provides a mechanism to demote cold pages from DRAM node into PMEM.
> >> >
> >> > And the patchset
> >> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> >> > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
> >> > provides a mechanism to promote hot pages in PMEM to the DRAM node
> >> > leveraging autonuma.
> >> >
> >> > The two patchsets together keep the hot pages in DRAM and colder pages
> >> > in PMEM.
> >>
> >> Thanks for working on this as this is becoming more and more important
> >> particularly in the data centers where memory is a big portion of the
> >> cost.
> >>
> >> I see you have responded to Michal and I will add my more specific
> >> response there. Here I wanted to give my high level concern regarding
> >> using v1's soft limit like semantics for top tier memory.
> >>
> >> This patch series aims to distribute/partition top tier memory between
> >> jobs of different priorities. We want high priority jobs to have
> >> preferential access to the top tier memory and we don't want low
> >> priority jobs to hog the top tier memory.
> >>
> >> Using v1's soft limit like behavior can potentially cause high
> >> priority jobs to stall to make enough space on top tier memory on
> >> their allocation path and I think this patchset is aiming to reduce
> >> that impact by making kswapd do that work. However I think the more
> >> concerning issue is the low priority job hogging the top tier memory.
> >>
> >> The possible ways the low priority job can hog the top tier memory are
> >> by allocating non-movable memory or by mlocking the memory. (Oh there
> >> is also pinning the memory but I don't know if there is a user api to
> >> pin memory?) For the mlocked memory, you need to either modify the
> >> reclaim code or use a different mechanism for demoting cold memory.
> >
> > Do you mean long term pin? RDMA should be able to simply pin the
> > memory for weeks. A lot of transient pins come from Direct I/O. They
> > should be less concerned.
> >
> > The low priority jobs should be able to be restricted by cpuset, for
> > example, just keep them on second tier memory nodes. Then all the
> > above problems are gone.
>
> To optimize the page placement of a process between DRAM and PMEM, we
> want to place the hot pages in DRAM and the cold pages in PMEM.  But the
> memory accessing pattern changes overtime, so we need to migrate pages
> between DRAM and PMEM to adapt to the changing.
>
> To avoid the hot pages be pinned in PMEM always, one way is to online
> the PMEM as movable zones.  If so, and if the low priority jobs are
> restricted by cpuset to allocate from PMEM only, we may fail to run
> quite some workloads as being discussed in the following threads,
>
> https://lore.kernel.org/linux-mm/1604470210-124827-1-git-send-email-feng.tang@intel.com/

Thanks for sharing the thread. It seems the configuration of movable
zone + node bind is not supported very well or need evolve to support
new use cases.

>
> >>
> >> Basically I am saying we should put the upfront control (limit) on the
> >> usage of top tier memory by the jobs.
> >
> > This sounds similar to what I talked about in LSFMM 2019
> > (https://lwn.net/Articles/787418/). We used to have some potential
> > usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> > when I was with Alibaba.
> >
> > In the first place I thought about per NUMA node limit, but it was
> > very hard to configure it correctly for users unless you know exactly
> > about your memory usage and hot/cold memory distribution.
> >
> > I'm wondering, just off the top of my head, if we could extend the
> > semantic of low and min limit. For example, just redefine low and min
> > to "the limit on top tier memory". Then we could have low priority
> > jobs have 0 low/min limit.
>
> Per my understanding, memory.low/min are for the memory protection
> instead of the memory limiting.  memory.high is for the memory limiting.

Yes, it is not limit. I just misused the term, I actually do mean
protection but typed "limit". Sorry for the confusion.

>
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-08 11:52     ` Michal Hocko
@ 2021-04-09 23:26       ` Tim Chen
  2021-04-12 19:20         ` Shakeel Butt
                           ` (2 more replies)
  2021-04-12 14:03       ` Shakeel Butt
  1 sibling, 3 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-09 23:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Ying Huang,
	Dan Williams, David Rientjes, Shakeel Butt, linux-mm, cgroups,
	linux-kernel


On 4/8/21 4:52 AM, Michal Hocko wrote:

>> The top tier memory used is reported in
>>
>> memory.toptier_usage_in_bytes
>>
>> The amount of top tier memory usable by each cgroup without
>> triggering page reclaim is controlled by the
>>
>> memory.toptier_soft_limit_in_bytes 
> 

Michal,

Thanks for your comments.  I will like to take a step back and
look at the eventual goal we envision: a mechanism to partition the 
tiered memory between the cgroups. 

A typical use case may be a system with two set of tasks.
One set of task is very latency sensitive and we desire instantaneous
response from them. Another set of tasks will be running batch jobs
were latency and performance is not critical.   In this case,
we want to carve out enough top tier memory such that the working set
of the latency sensitive tasks can fit entirely in the top tier memory.
The rest of the top tier memory can be assigned to the background tasks.  

To achieve such cgroup based tiered memory management, we probably want
something like the following.

For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
where tier t_0 sits at the top and demotes to the lower tier. 
We envision for this top tier memory t0 the following knobs and counters 
in the cgroup memory controller

memory_t0.current 	Current usage of tier 0 memory by the cgroup.

memory_t0.min		If tier 0 memory used by the cgroup falls below this low
			boundary, the memory will not be subjected to demotion
			to lower tiers to free up memory at tier 0.  

memory_t0.low		Above this boundary, the tier 0 memory will be subjected
			to demotion.  The demotion pressure will be proportional
			to the overage.

memory_t0.high		If tier 0 memory used by the cgroup exceeds this high
			boundary, allocation of tier 0 memory by the cgroup will
			be throttled. The tier 0 memory used by this cgroup
			will also be subjected to heavy demotion.

memory_t0.max		This will be a hard usage limit of tier 0 memory on the cgroup.

If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
This follows closely with the design of the general memory controller interface.  

Will such an interface looks sane and acceptable with everyone?

The patch set I posted is meant to be a straw man cgroup v1 implementation
and I readily admits that it falls short of the eventual functionality 
we want to achieve.  It is meant to solicit feedback from everyone on how the tiered
memory management should work.

> Are you trying to say that soft limit acts as some sort of guarantee?

No, the soft limit does not offers guarantee.  It will only serves to keep the usage
of the top tier memory in the vicinity of the soft limits.

> Does that mean that if the memcg is under memory pressure top tiear
> memory is opted out from any reclaim if the usage is not in excess?

In the prototype implementation, regular memory reclaim is still in effect
if we are under heavy memory pressure. 

> 
> From you previous email it sounds more like the limit is evaluated on
> the global memory pressure to balance specific memcgs which are in
> excess when trying to reclaim/demote a toptier numa node.

On a top tier node, if the free memory on the node falls below a percentage, then
we will start to reclaim/demote from the node.

> 
> Soft limit reclaim has several problems. Those are historical and
> therefore the behavior cannot be changed. E.g. go after the biggest
> excessed memcg (with priority 0 - aka potential full LRU scan) and then
> continue with a normal reclaim. This can be really disruptive to the top
> user.

Thanks for pointing out these problems with soft limit explicitly.

> 
> So you can likely define a more sane semantic. E.g. push back memcgs
> proporitional to their excess but then we have two different soft limits
> behavior which is bad as well. I am not really sure there is a sensible
> way out by (ab)using soft limit here.
> 
> Also I am not really sure how this is going to be used in practice.
> There is no soft limit by default. So opting in would effectivelly
> discriminate those memcgs. There has been a similar problem with the
> soft limit we have in general. Is this really what you are looing for?
> What would be a typical usecase?

>> Want to make sure I understand what you mean by NUMA aware limits.
>> Yes, in the patch set, it does treat the NUMA nodes differently.
>> We are putting constraint on the "top tier" RAM nodes vs the lower
>> tier PMEM nodes.  Is this what you mean?
> 
> What I am trying to say (and I have brought that up when demotion has been
> discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
> The specific technology shouldn't be imprinted into the interface.
> Fundamentally you are trying to balance memory among NUMA nodes as we do
> not have other abstraction to use. So rather than talking about top,
> secondary, nth tier we have different NUMA nodes with different
> characteristics and you want to express your "priorities" for them.

With node priorities, how would the system reserve enough
high performance memory for those performance critical task cgroup? 

By priority, do you mean the order of allocation of nodes for a cgroup?
Or you mean that all the similar performing memory node will be grouped in
the same priority?

Tim

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-08 20:50       ` Yang Shi
@ 2021-04-12 14:03         ` Shakeel Butt
  0 siblings, 0 replies; 34+ messages in thread
From: Shakeel Butt @ 2021-04-12 14:03 UTC (permalink / raw)
  To: Yang Shi
  Cc: Tim Chen, Michal Hocko, Johannes Weiner, Andrew Morton,
	Dave Hansen, Ying Huang, Dan Williams, David Rientjes, Linux MM,
	Cgroups, LKML

On Thu, Apr 8, 2021 at 1:50 PM Yang Shi <shy828301@gmail.com> wrote:
>
[...]

> >
> > The low and min limits have semantics similar to the v1's soft limit
> > for this situation i.e. letting the low priority job occupy top tier
> > memory and depending on reclaim to take back the excess top tier
> > memory use of such jobs.
>
> I don't get why low priority jobs can *not* use top tier memory?

I am saying low priority jobs can use top tier memory. The only
difference is to limit them upfront (using limits) or reclaim from
them later (using min/low/soft-limit).

> I can
> think of it may incur latency overhead for high priority jobs. If it
> is not allowed, it could be restricted by cpuset without introducing
> in any new interfaces.
>
> I'm supposed the memory utilization could be maximized by allowing all
> jobs allocate memory from all applicable nodes, then let reclaimer (or
> something new if needed)

Most probably something new as we do want to consider unevictable
memory as well.

> do the job to migrate the memory to proper
> nodes by time. We could achieve some kind of balance between memory
> utilization and resource isolation.
>

Tradeoff between utilization and isolation should be decided by the user/admin.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-08 11:52     ` Michal Hocko
  2021-04-09 23:26       ` Tim Chen
@ 2021-04-12 14:03       ` Shakeel Butt
  1 sibling, 0 replies; 34+ messages in thread
From: Shakeel Butt @ 2021-04-12 14:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Linux MM, Cgroups,
	LKML

On Thu, Apr 8, 2021 at 4:52 AM Michal Hocko <mhocko@suse.com> wrote:
>
[...]
>
> What I am trying to say (and I have brought that up when demotion has been
> discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
> The specific technology shouldn't be imprinted into the interface.
> Fundamentally you are trying to balance memory among NUMA nodes as we do
> not have other abstraction to use. So rather than talking about top,
> secondary, nth tier we have different NUMA nodes with different
> characteristics and you want to express your "priorities" for them.
>

I am also inclined towards NUMA based approach. It makes the solution
more general and even existing systems with multiple numa nodes and
DRAM can take advantage of this approach (if it makes sense).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-09 23:26       ` Tim Chen
@ 2021-04-12 19:20         ` Shakeel Butt
  2021-04-14  8:59           ` Jonathan Cameron
  2021-04-15  0:42           ` Tim Chen
  2021-04-13  2:15         ` Huang, Ying
  2021-04-13  8:33         ` Michal Hocko
  2 siblings, 2 replies; 34+ messages in thread
From: Shakeel Butt @ 2021-04-12 19:20 UTC (permalink / raw)
  To: Tim Chen
  Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Linux MM, Cgroups,
	LKML, Greg Thelen, Wei Xu

On Fri, Apr 9, 2021 at 4:26 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>
> On 4/8/21 4:52 AM, Michal Hocko wrote:
>
> >> The top tier memory used is reported in
> >>
> >> memory.toptier_usage_in_bytes
> >>
> >> The amount of top tier memory usable by each cgroup without
> >> triggering page reclaim is controlled by the
> >>
> >> memory.toptier_soft_limit_in_bytes
> >
>
> Michal,
>
> Thanks for your comments.  I will like to take a step back and
> look at the eventual goal we envision: a mechanism to partition the
> tiered memory between the cgroups.
>
> A typical use case may be a system with two set of tasks.
> One set of task is very latency sensitive and we desire instantaneous
> response from them. Another set of tasks will be running batch jobs
> were latency and performance is not critical.   In this case,
> we want to carve out enough top tier memory such that the working set
> of the latency sensitive tasks can fit entirely in the top tier memory.
> The rest of the top tier memory can be assigned to the background tasks.
>
> To achieve such cgroup based tiered memory management, we probably want
> something like the following.
>
> For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
> where tier t_0 sits at the top and demotes to the lower tier.
> We envision for this top tier memory t0 the following knobs and counters
> in the cgroup memory controller
>
> memory_t0.current       Current usage of tier 0 memory by the cgroup.
>
> memory_t0.min           If tier 0 memory used by the cgroup falls below this low
>                         boundary, the memory will not be subjected to demotion
>                         to lower tiers to free up memory at tier 0.
>
> memory_t0.low           Above this boundary, the tier 0 memory will be subjected
>                         to demotion.  The demotion pressure will be proportional
>                         to the overage.
>
> memory_t0.high          If tier 0 memory used by the cgroup exceeds this high
>                         boundary, allocation of tier 0 memory by the cgroup will
>                         be throttled. The tier 0 memory used by this cgroup
>                         will also be subjected to heavy demotion.
>
> memory_t0.max           This will be a hard usage limit of tier 0 memory on the cgroup.
>
> If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
> This follows closely with the design of the general memory controller interface.
>
> Will such an interface looks sane and acceptable with everyone?
>

I have a couple of questions. Let's suppose we have a two socket
system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket
0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1).
Based on the tier definition of this patch series, tier_0: {node_0,
node_1} and tier_1: {node_2, node_3}.

My questions are:

1) Can we assume that the cost of access within a tier will always be
less than the cost of access from the tier? (node_0 <-> node_1 vs
node_0 <-> node_2)
2) If yes to (1), is that assumption future proof? Will the future
systems with DRAM over CXL support have the same characteristics?
3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0
<-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3
might be third tier and similarly for jobs running on node_1, node_2
might be third tier.

The reason I am asking these questions is that the statically
partitioning memory nodes into tiers will inherently add platform
specific assumptions in the user API.

Assumptions like:
1) Access within tier is always cheaper than across tier.
2) Access from tier_i to tier_i+1 has uniform cost.

The reason I am more inclined towards having numa centric control is
that we don't have to make these assumptions. Though the usability
will be more difficult. Greg (CCed) has some ideas on making it better
and we will share our proposal after polishing it a bit more.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-09 23:26       ` Tim Chen
  2021-04-12 19:20         ` Shakeel Butt
@ 2021-04-13  2:15         ` Huang, Ying
  2021-04-13  8:33         ` Michal Hocko
  2 siblings, 0 replies; 34+ messages in thread
From: Huang, Ying @ 2021-04-13  2:15 UTC (permalink / raw)
  To: Tim Chen
  Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Dave Hansen,
	Dan Williams, David Rientjes, Shakeel Butt, linux-mm, cgroups,
	linux-kernel

Tim Chen <tim.c.chen@linux.intel.com> writes:

> On 4/8/21 4:52 AM, Michal Hocko wrote:
>
>>> The top tier memory used is reported in
>>>
>>> memory.toptier_usage_in_bytes
>>>
>>> The amount of top tier memory usable by each cgroup without
>>> triggering page reclaim is controlled by the
>>>
>>> memory.toptier_soft_limit_in_bytes 
>> 
>
> Michal,
>
> Thanks for your comments.  I will like to take a step back and
> look at the eventual goal we envision: a mechanism to partition the 
> tiered memory between the cgroups. 
>
> A typical use case may be a system with two set of tasks.
> One set of task is very latency sensitive and we desire instantaneous
> response from them. Another set of tasks will be running batch jobs
> were latency and performance is not critical.   In this case,
> we want to carve out enough top tier memory such that the working set
> of the latency sensitive tasks can fit entirely in the top tier memory.
> The rest of the top tier memory can be assigned to the background tasks.  
>
> To achieve such cgroup based tiered memory management, we probably want
> something like the following.
>
> For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
> where tier t_0 sits at the top and demotes to the lower tier. 
> We envision for this top tier memory t0 the following knobs and counters 
> in the cgroup memory controller
>
> memory_t0.current 	Current usage of tier 0 memory by the cgroup.
>
> memory_t0.min		If tier 0 memory used by the cgroup falls below this low
> 			boundary, the memory will not be subjected to demotion
> 			to lower tiers to free up memory at tier 0.  
>
> memory_t0.low		Above this boundary, the tier 0 memory will be subjected
> 			to demotion.  The demotion pressure will be proportional
> 			to the overage.
>
> memory_t0.high		If tier 0 memory used by the cgroup exceeds this high
> 			boundary, allocation of tier 0 memory by the cgroup will
> 			be throttled. The tier 0 memory used by this cgroup
> 			will also be subjected to heavy demotion.

I think we don't really need throttle here, because we can fallback to
allocate memory from the t1.  That will not cause something like IO
device bandwidth saturation.

Best Regards,
Huang, Ying

> memory_t0.max		This will be a hard usage limit of tier 0 memory on the cgroup.
>
> If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
> This follows closely with the design of the general memory controller interface.  
>
> Will such an interface looks sane and acceptable with everyone?
>
> The patch set I posted is meant to be a straw man cgroup v1 implementation
> and I readily admits that it falls short of the eventual functionality 
> we want to achieve.  It is meant to solicit feedback from everyone on how the tiered
> memory management should work.
>
>> Are you trying to say that soft limit acts as some sort of guarantee?
>
> No, the soft limit does not offers guarantee.  It will only serves to keep the usage
> of the top tier memory in the vicinity of the soft limits.
>
>> Does that mean that if the memcg is under memory pressure top tiear
>> memory is opted out from any reclaim if the usage is not in excess?
>
> In the prototype implementation, regular memory reclaim is still in effect
> if we are under heavy memory pressure. 
>
>> 
>> From you previous email it sounds more like the limit is evaluated on
>> the global memory pressure to balance specific memcgs which are in
>> excess when trying to reclaim/demote a toptier numa node.
>
> On a top tier node, if the free memory on the node falls below a percentage, then
> we will start to reclaim/demote from the node.
>
>> 
>> Soft limit reclaim has several problems. Those are historical and
>> therefore the behavior cannot be changed. E.g. go after the biggest
>> excessed memcg (with priority 0 - aka potential full LRU scan) and then
>> continue with a normal reclaim. This can be really disruptive to the top
>> user.
>
> Thanks for pointing out these problems with soft limit explicitly.
>
>> 
>> So you can likely define a more sane semantic. E.g. push back memcgs
>> proporitional to their excess but then we have two different soft limits
>> behavior which is bad as well. I am not really sure there is a sensible
>> way out by (ab)using soft limit here.
>> 
>> Also I am not really sure how this is going to be used in practice.
>> There is no soft limit by default. So opting in would effectivelly
>> discriminate those memcgs. There has been a similar problem with the
>> soft limit we have in general. Is this really what you are looing for?
>> What would be a typical usecase?
>
>>> Want to make sure I understand what you mean by NUMA aware limits.
>>> Yes, in the patch set, it does treat the NUMA nodes differently.
>>> We are putting constraint on the "top tier" RAM nodes vs the lower
>>> tier PMEM nodes.  Is this what you mean?
>> 
>> What I am trying to say (and I have brought that up when demotion has been
>> discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
>> The specific technology shouldn't be imprinted into the interface.
>> Fundamentally you are trying to balance memory among NUMA nodes as we do
>> not have other abstraction to use. So rather than talking about top,
>> secondary, nth tier we have different NUMA nodes with different
>> characteristics and you want to express your "priorities" for them.
>
> With node priorities, how would the system reserve enough
> high performance memory for those performance critical task cgroup? 
>
> By priority, do you mean the order of allocation of nodes for a cgroup?
> Or you mean that all the similar performing memory node will be grouped in
> the same priority?
>
> Tim

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-09 23:26       ` Tim Chen
  2021-04-12 19:20         ` Shakeel Butt
  2021-04-13  2:15         ` Huang, Ying
@ 2021-04-13  8:33         ` Michal Hocko
  2 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2021-04-13  8:33 UTC (permalink / raw)
  To: Tim Chen
  Cc: Johannes Weiner, Andrew Morton, Dave Hansen, Ying Huang,
	Dan Williams, David Rientjes, Shakeel Butt, linux-mm, cgroups,
	linux-kernel

On Fri 09-04-21 16:26:53, Tim Chen wrote:
> 
> On 4/8/21 4:52 AM, Michal Hocko wrote:
> 
> >> The top tier memory used is reported in
> >>
> >> memory.toptier_usage_in_bytes
> >>
> >> The amount of top tier memory usable by each cgroup without
> >> triggering page reclaim is controlled by the
> >>
> >> memory.toptier_soft_limit_in_bytes 
> > 
> 
> Michal,
> 
> Thanks for your comments.  I will like to take a step back and
> look at the eventual goal we envision: a mechanism to partition the 
> tiered memory between the cgroups. 

OK, this is goot mission statemet to start with. I would expect a follow
up to say what kind of granularity of control you want to achieve here.
Pressumably you want more than all or nothing because that is where
cpusets can be used for.

> A typical use case may be a system with two set of tasks.
> One set of task is very latency sensitive and we desire instantaneous
> response from them. Another set of tasks will be running batch jobs
> were latency and performance is not critical.   In this case,
> we want to carve out enough top tier memory such that the working set
> of the latency sensitive tasks can fit entirely in the top tier memory.
> The rest of the top tier memory can be assigned to the background tasks.  

While from a very high level this makes sense I would be interested in
more details though. Your high letency sensitive applications very likely
want to be bound to high performance node, right? Can they tolerate
memory reclaim? Can they consume more memory than the node size? What do
you expect to happen then?
 
> To achieve such cgroup based tiered memory management, we probably want
> something like the following.
> 
> For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
> where tier t_0 sits at the top and demotes to the lower tier. 

How is each tear defined? Is this an admin define set of NUMA nodes or
is this platform specific?

[...]

> Will such an interface looks sane and acceptable with everyone?

Let's talk more about usecases first before we even start talking about
the interface or which controller is the best fit for implementing it.
 
> The patch set I posted is meant to be a straw man cgroup v1 implementation
> and I readily admits that it falls short of the eventual functionality 
> we want to achieve.  It is meant to solicit feedback from everyone on how the tiered
> memory management should work.

OK, fair enough. Let me then just state that I strongly believe that
Soft limit based approach is a dead end and it would be better to focus
on the actual usecases and try to understand what you want to achieve
first.

[...]

> > What I am trying to say (and I have brought that up when demotion has been
> > discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
> > The specific technology shouldn't be imprinted into the interface.
> > Fundamentally you are trying to balance memory among NUMA nodes as we do
> > not have other abstraction to use. So rather than talking about top,
> > secondary, nth tier we have different NUMA nodes with different
> > characteristics and you want to express your "priorities" for them.
> 
> With node priorities, how would the system reserve enough
> high performance memory for those performance critical task cgroup? 
> 
> By priority, do you mean the order of allocation of nodes for a cgroup?
> Or you mean that all the similar performing memory node will be grouped in
> the same priority?

I have to say I do not yet have a clear idea on what those priorities
would look like. I just wanted to outline that usecases you are
interested about likely want to implement some form of (application
transparent) control for memory distribution over several nodes. There
is a long way to land on something more specific I am afraid.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-12 19:20         ` Shakeel Butt
@ 2021-04-14  8:59           ` Jonathan Cameron
  2021-04-15  0:42           ` Tim Chen
  1 sibling, 0 replies; 34+ messages in thread
From: Jonathan Cameron @ 2021-04-14  8:59 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tim Chen, Michal Hocko, Johannes Weiner, Andrew Morton,
	Dave Hansen, Ying Huang, Dan Williams, David Rientjes, Linux MM,
	Cgroups, LKML, Greg Thelen, Wei Xu

On Mon, 12 Apr 2021 12:20:22 -0700
Shakeel Butt <shakeelb@google.com> wrote:

> On Fri, Apr 9, 2021 at 4:26 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >
> >
> > On 4/8/21 4:52 AM, Michal Hocko wrote:
> >  
> > >> The top tier memory used is reported in
> > >>
> > >> memory.toptier_usage_in_bytes
> > >>
> > >> The amount of top tier memory usable by each cgroup without
> > >> triggering page reclaim is controlled by the
> > >>
> > >> memory.toptier_soft_limit_in_bytes  
> > >  
> >
> > Michal,
> >
> > Thanks for your comments.  I will like to take a step back and
> > look at the eventual goal we envision: a mechanism to partition the
> > tiered memory between the cgroups.
> >
> > A typical use case may be a system with two set of tasks.
> > One set of task is very latency sensitive and we desire instantaneous
> > response from them. Another set of tasks will be running batch jobs
> > were latency and performance is not critical.   In this case,
> > we want to carve out enough top tier memory such that the working set
> > of the latency sensitive tasks can fit entirely in the top tier memory.
> > The rest of the top tier memory can be assigned to the background tasks.
> >
> > To achieve such cgroup based tiered memory management, we probably want
> > something like the following.
> >
> > For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
> > where tier t_0 sits at the top and demotes to the lower tier.
> > We envision for this top tier memory t0 the following knobs and counters
> > in the cgroup memory controller
> >
> > memory_t0.current       Current usage of tier 0 memory by the cgroup.
> >
> > memory_t0.min           If tier 0 memory used by the cgroup falls below this low
> >                         boundary, the memory will not be subjected to demotion
> >                         to lower tiers to free up memory at tier 0.
> >
> > memory_t0.low           Above this boundary, the tier 0 memory will be subjected
> >                         to demotion.  The demotion pressure will be proportional
> >                         to the overage.
> >
> > memory_t0.high          If tier 0 memory used by the cgroup exceeds this high
> >                         boundary, allocation of tier 0 memory by the cgroup will
> >                         be throttled. The tier 0 memory used by this cgroup
> >                         will also be subjected to heavy demotion.
> >
> > memory_t0.max           This will be a hard usage limit of tier 0 memory on the cgroup.
> >
> > If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
> > This follows closely with the design of the general memory controller interface.
> >
> > Will such an interface looks sane and acceptable with everyone?
> >  
> 
> I have a couple of questions. Let's suppose we have a two socket
> system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket
> 0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1).
> Based on the tier definition of this patch series, tier_0: {node_0,
> node_1} and tier_1: {node_2, node_3}.
> 
> My questions are:
> 
> 1) Can we assume that the cost of access within a tier will always be
> less than the cost of access from the tier? (node_0 <-> node_1 vs
> node_0 <-> node_2)

No in large systems even it we can make this assumption in 2 socket ones.

> 2) If yes to (1), is that assumption future proof? Will the future
> systems with DRAM over CXL support have the same characteristics?
> 3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0
> <-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3
> might be third tier and similarly for jobs running on node_1, node_2
> might be third tier.
> 
> The reason I am asking these questions is that the statically
> partitioning memory nodes into tiers will inherently add platform
> specific assumptions in the user API.

Absolutely agree.

> 
> Assumptions like:
> 1) Access within tier is always cheaper than across tier.
> 2) Access from tier_i to tier_i+1 has uniform cost.
> 
> The reason I am more inclined towards having numa centric control is
> that we don't have to make these assumptions. Though the usability
> will be more difficult. Greg (CCed) has some ideas on making it better
> and we will share our proposal after polishing it a bit more.
> 

Sounds good, will look out for that.

Jonathan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-08 20:29     ` Shakeel Butt
  2021-04-08 20:50       ` Yang Shi
  2021-04-09  7:24       ` Michal Hocko
@ 2021-04-14 23:22       ` Tim Chen
  2 siblings, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-14 23:22 UTC (permalink / raw)
  To: Shakeel Butt, Yang Shi
  Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Linux MM, Cgroups,
	LKML



On 4/8/21 1:29 PM, Shakeel Butt wrote:
> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:

> 
> The low and min limits have semantics similar to the v1's soft limit
> for this situation i.e. letting the low priority job occupy top tier
> memory and depending on reclaim to take back the excess top tier
> memory use of such jobs.
> 
> I have some thoughts on NUMA node limits which I will share in the other thread.
> 

Shakeel,

Look forward to the proposal on NUMA node limits.  Which thread are
you going to post it?  Want to make sure I didn't miss it.

Tim

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-12 19:20         ` Shakeel Butt
  2021-04-14  8:59           ` Jonathan Cameron
@ 2021-04-15  0:42           ` Tim Chen
  1 sibling, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-15  0:42 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Linux MM, Cgroups,
	LKML, Greg Thelen, Wei Xu



On 4/12/21 12:20 PM, Shakeel Butt wrote:

>>
>> memory_t0.current       Current usage of tier 0 memory by the cgroup.
>>
>> memory_t0.min           If tier 0 memory used by the cgroup falls below this low
>>                         boundary, the memory will not be subjected to demotion
>>                         to lower tiers to free up memory at tier 0.
>>
>> memory_t0.low           Above this boundary, the tier 0 memory will be subjected
>>                         to demotion.  The demotion pressure will be proportional
>>                         to the overage.
>>
>> memory_t0.high          If tier 0 memory used by the cgroup exceeds this high
>>                         boundary, allocation of tier 0 memory by the cgroup will
>>                         be throttled. The tier 0 memory used by this cgroup
>>                         will also be subjected to heavy demotion.
>>
>> memory_t0.max           This will be a hard usage limit of tier 0 memory on the cgroup.
>>
>> If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
>> This follows closely with the design of the general memory controller interface.
>>
>> Will such an interface looks sane and acceptable with everyone?
>>
> 
> I have a couple of questions. Let's suppose we have a two socket
> system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket
> 0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1).
> Based on the tier definition of this patch series, tier_0: {node_0,
> node_1} and tier_1: {node_2, node_3}.
> 
> My questions are:
> 
> 1) Can we assume that the cost of access within a tier will always be
> less than the cost of access from the tier? (node_0 <-> node_1 vs
> node_0 <-> node_2)

I do assume that higher tier memory offers better performance (or less
access latency) than a lower tier memory.  Otherwise, this defeats the
whole purpose of promoting hot memory from lower tier to a higher tier,
and demoting cold memory to a lower tier.

Tiers assumption is embedded once we define this promotion/demotion relationship
between the numa nodes.

So if 

  node_m ----demotes----> node_n
         <---promotes---- 

then node_m is one tier higher tier than node_n. This promotion/demotion
relationship between the nodes is the underpinning of Dave and Ying's
demotion and promotion patch sets.  

> 2) If yes to (1), is that assumption future proof? Will the future
> systems with DRAM over CXL support have the same characteristics?

I think if you configure a promotion/demotion relationship between
DRAM over CXL and local-socket connected DRAM, you could divide them
up into separate tiers.  Or you don't care about the difference and
you will configure them not to have a promotion/demotion relationship
and they will be at the same tier.  Balance within the same tier
will be effected by the autonuma mechanism.

> 3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0
> <-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3
> might be third tier and similarly for jobs running on node_1, node_2
> might be third tier.

Tier definition is an admin's choice, of where the admin think the
hot memory should reside after looking at the memory performance.
It falls out of how the admin construct the promotion/demotion relationship
between the nodes and OS does not assume the tier relationship from
memory performance directly. 

> 
> The reason I am asking these questions is that the statically
> partitioning memory nodes into tiers will inherently add platform
> specific assumptions in the user API.
> 
> Assumptions like:
> 1) Access within tier is always cheaper than across tier.
> 2) Access from tier_i to tier_i+1 has uniform cost.
> 
> The reason I am more inclined towards having numa centric control is
> that we don't have to make these assumptions. Though the usability
> will be more difficult. Greg (CCed) has some ideas on making it better
> and we will share our proposal after polishing it a bit more.
> 

I am still trying to understand how a numa centric control actually
work. Putting limits on every numa node for each cgroup
seems to make the system configuration quite complicated.  Looking
forward to your proposal so I can better understand that perspective.

Tim 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-08 17:18 ` Shakeel Butt
  2021-04-08 18:00   ` Yang Shi
@ 2021-04-15 22:25   ` Tim Chen
  1 sibling, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-04-15 22:25 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Linux MM, Cgroups,
	LKML



On 4/8/21 10:18 AM, Shakeel Butt wrote:

> 
> Using v1's soft limit like behavior can potentially cause high
> priority jobs to stall to make enough space on top tier memory on
> their allocation path and I think this patchset is aiming to reduce
> that impact by making kswapd do that work. However I think the more
> concerning issue is the low priority job hogging the top tier memory.
> 
> The possible ways the low priority job can hog the top tier memory are
> by allocating non-movable memory or by mlocking the memory. (Oh there
> is also pinning the memory but I don't know if there is a user api to
> pin memory?) For the mlocked memory, you need to either modify the
> reclaim code or use a different mechanism for demoting cold memory.
> 
> Basically I am saying we should put the upfront control (limit) on the
> usage of top tier memory by the jobs.
> 

Circling back to your comment here.  

I agree that soft limit is deficient in this scenario that you 
have pointed out.  Eventually I was shooting for a hard limit on a 
memory tier for a cgroup that's similar to the v2 memory controller
interface (see mail in the other thread).  That interface should
satisfy the hard constraint you want to place on the low priority
jobs.


Tim

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-09  7:24       ` Michal Hocko
@ 2021-04-15 22:31         ` Tim Chen
  2021-04-16  6:38           ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Tim Chen @ 2021-04-15 22:31 UTC (permalink / raw)
  To: Michal Hocko, Shakeel Butt
  Cc: Yang Shi, Johannes Weiner, Andrew Morton, Dave Hansen,
	Ying Huang, Dan Williams, David Rientjes, Linux MM, Cgroups,
	LKML



On 4/9/21 12:24 AM, Michal Hocko wrote:
> On Thu 08-04-21 13:29:08, Shakeel Butt wrote:
>> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
> [...]
>>> The low priority jobs should be able to be restricted by cpuset, for
>>> example, just keep them on second tier memory nodes. Then all the
>>> above problems are gone.
> 
> Yes, if the aim is to isolate some users from certain numa node then
> cpuset is a good fit but as Shakeel says this is very likely not what
> this work is aiming for.
> 
>> Yes that's an extreme way to overcome the issue but we can do less
>> extreme by just (hard) limiting the top tier usage of low priority
>> jobs.
> 
> Per numa node high/hard limit would help with a more fine grained control.
> The configuration would be tricky though. All low priority memcgs would
> have to be carefully configured to leave enough for your important
> processes. That includes also memory which is not accounted to any
> memcg. 
> The behavior of those limits would be quite tricky for OOM situations
> as well due to a lack of NUMA aware oom killer.
> 

Another downside of putting limits on individual NUMA
node is it would limit flexibility.  For example two memory nodes are
similar enough in performance, that you really only care about a cgroup
not using more than a threshold of the combined capacity from the two
memory nodes.  But when you put a hard limit on NUMA node, then you are
tied down to a fix allocation partition for each node.  Perhaps there are
some kernel resources that are pre-allocated primarily from one node. A
cgroup may bump into the limit on the node and failed the allocation,
even when it has a lot of slack in the other node.  This makes getting
the configuration right trickier.

There are some differences in opinion currently
on whether grouping memory nodes into tiers, and putting limit on
using them by cgroup is a desirable.  Many people want the 
management constraint placed at individual NUMA node for each cgroup, instead
of at the tier level.  Will appreciate feedbacks from folks who have
insights on how such NUMA based control interface will work, so we
at least agree here in order to move forward.

Tim


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
  2021-04-15 22:31         ` Tim Chen
@ 2021-04-16  6:38           ` Michal Hocko
  0 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2021-04-16  6:38 UTC (permalink / raw)
  To: Tim Chen
  Cc: Shakeel Butt, Yang Shi, Johannes Weiner, Andrew Morton,
	Dave Hansen, Ying Huang, Dan Williams, David Rientjes, Linux MM,
	Cgroups, LKML

On Thu 15-04-21 15:31:46, Tim Chen wrote:
> 
> 
> On 4/9/21 12:24 AM, Michal Hocko wrote:
> > On Thu 08-04-21 13:29:08, Shakeel Butt wrote:
> >> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
> > [...]
> >>> The low priority jobs should be able to be restricted by cpuset, for
> >>> example, just keep them on second tier memory nodes. Then all the
> >>> above problems are gone.
> > 
> > Yes, if the aim is to isolate some users from certain numa node then
> > cpuset is a good fit but as Shakeel says this is very likely not what
> > this work is aiming for.
> > 
> >> Yes that's an extreme way to overcome the issue but we can do less
> >> extreme by just (hard) limiting the top tier usage of low priority
> >> jobs.
> > 
> > Per numa node high/hard limit would help with a more fine grained control.
> > The configuration would be tricky though. All low priority memcgs would
> > have to be carefully configured to leave enough for your important
> > processes. That includes also memory which is not accounted to any
> > memcg. 
> > The behavior of those limits would be quite tricky for OOM situations
> > as well due to a lack of NUMA aware oom killer.
> > 
> 
> Another downside of putting limits on individual NUMA
> node is it would limit flexibility.

Let me just clarify one thing. I haven't been proposing per NUMA limits.
As I've said above it would be quite tricky to use and the behavior
would be tricky as well. All I am saying is that we do not want to have
an interface that is tightly bound to any specific HW setup (fast RAM as
a top tier and PMEM as a fallback) that you have proposed here. We want
to have a generic NUMA based abstraction. How that abstraction is going
to look like is an open question and it really depends on usecase that
we expect to see.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-04-16  6:38 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 01/11] mm: Define top tier memory node mask Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 02/11] mm: Add soft memory limit for mem cgroup Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 03/11] mm: Account the top tier memory usage per cgroup Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 04/11] mm: Report top tier memory usage in sysfs Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 05/11] mm: Add soft_limit_top_tier tree for mem cgroup Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 06/11] mm: Handle top tier memory in cgroup soft limit memory tree utilities Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 07/11] mm: Account the total top tier memory in use Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 08/11] mm: Add toptier option for mem_cgroup_soft_limit_reclaim() Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 09/11] mm: Use kswapd to demote pages when toptier memory is tight Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 10/11] mm: Set toptier_scale_factor via sysctl Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 11/11] mm: Wakeup kswapd if toptier memory need soft reclaim Tim Chen
2021-04-06  9:08 ` [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Michal Hocko
2021-04-07 22:33   ` Tim Chen
2021-04-08 11:52     ` Michal Hocko
2021-04-09 23:26       ` Tim Chen
2021-04-12 19:20         ` Shakeel Butt
2021-04-14  8:59           ` Jonathan Cameron
2021-04-15  0:42           ` Tim Chen
2021-04-13  2:15         ` Huang, Ying
2021-04-13  8:33         ` Michal Hocko
2021-04-12 14:03       ` Shakeel Butt
2021-04-08 17:18 ` Shakeel Butt
2021-04-08 18:00   ` Yang Shi
2021-04-08 20:29     ` Shakeel Butt
2021-04-08 20:50       ` Yang Shi
2021-04-12 14:03         ` Shakeel Butt
2021-04-09  7:24       ` Michal Hocko
2021-04-15 22:31         ` Tim Chen
2021-04-16  6:38           ` Michal Hocko
2021-04-14 23:22       ` Tim Chen
2021-04-09  2:58     ` Huang, Ying
2021-04-09 20:50       ` Yang Shi
2021-04-15 22:25   ` Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).