linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/8] memory recharging for offline memcgs
@ 2023-07-20  7:08 Yosry Ahmed
  2023-07-20  7:08 ` [RFC PATCH 1/8] memcg: refactor updating memcg->moving_account Yosry Ahmed
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-20  7:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt
  Cc: Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, T.J. Mercier, Greg Thelen, linux-kernel, linux-mm,
	cgroups, Yosry Ahmed

This patch series implements the proposal in LSF/MM/BPF 2023 conference
for reducing offline/zombie memcgs by memory recharging [1]. The main
difference is that this series focuses on recharging and does not
include eviction of any memory charged to offline memcgs.

Two methods of recharging are proposed:

(a) Recharging of mapped folios.

When a memcg is offlined, queue an asynchronous worker that will walk
the lruvec of the offline memcg and try to recharge any mapped folios to
the memcg of one of the processes mapping the folio. The main assumption
is that a process mapping the folio is the "rightful" owner of the
memory.

Currently, this is only supported for evictable folios, as the
unevictable lru is imaginary and we cannot iterate the folios on it. A
separate proposal [2] was made to revive the unevictable lru, which
would allow recharging of unevictable folios.

(b) Deferred recharging of folios.

For folios that are unmapped, or mapped but we fail to recharge them
with (a), we rely on deferred recharging. Simply put, any time a folio
is accessed or dirtied by a userspace process, and that folio is charged
to an offline memcg, we will try to recharge it to the memcg of the
process accessing the folio. Again, we assume this process should be the
"rightful" owner of the memory. This is also done asynchronously to avoid
slowing down the data access path.

In both cases, we never OOM kill the recharging target if it goes above
limit. This is to avoid OOM killing a process an arbitrary amount of
time after it started using memory. This is a conservative policy that
can be revisited later.

The patches in this series are divided as follows:
- Patches 1 & 2 are preliminary refactoring and helpers introducion.
- Patches 3 to 5 implement (a) and (b) above.
- Patches 6 & 7 add stats, a sysctl, and a config option.
- Patch 8 is a selftest.

[1]https://lore.kernel.org/linux-mm/CABdmKX2M6koq4Q0Cmp_-=wbP0Qa190HdEGGaHfxNS05gAkUtPA@mail.gmail.com/
[2]https://lore.kernel.org/lkml/20230618065719.1363271-1-yosryahmed@google.com/

Yosry Ahmed (8):
  memcg: refactor updating memcg->moving_account
  mm: vmscan: add lruvec_for_each_list() helper
  memcg: recharge mapped folios when a memcg is offlined
  memcg: support deferred memcg recharging
  memcg: recharge folios when accessed or dirtied
  memcg: add stats for offline memcgs recharging
  memcg: add sysctl and config option to control memory recharging
  selftests: cgroup: test_memcontrol: add a selftest for memcg
    recharging

 include/linux/memcontrol.h                    |  14 +
 include/linux/swap.h                          |   8 +
 include/linux/vm_event_item.h                 |   5 +
 kernel/sysctl.c                               |  11 +
 mm/Kconfig                                    |  12 +
 mm/memcontrol.c                               | 376 +++++++++++++++++-
 mm/page-writeback.c                           |   2 +
 mm/swap.c                                     |   2 +
 mm/vmscan.c                                   |  28 ++
 mm/vmstat.c                                   |   6 +-
 tools/testing/selftests/cgroup/cgroup_util.c  |  14 +
 tools/testing/selftests/cgroup/cgroup_util.h  |   1 +
 .../selftests/cgroup/test_memcontrol.c        | 310 +++++++++++++++
 13 files changed, 784 insertions(+), 5 deletions(-)

-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH 1/8] memcg: refactor updating memcg->moving_account
  2023-07-20  7:08 [RFC PATCH 0/8] memory recharging for offline memcgs Yosry Ahmed
@ 2023-07-20  7:08 ` Yosry Ahmed
  2023-07-20  7:08 ` [RFC PATCH 2/8] mm: vmscan: add lruvec_for_each_list() helper Yosry Ahmed
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-20  7:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt
  Cc: Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, T.J. Mercier, Greg Thelen, linux-kernel, linux-mm,
	cgroups, Yosry Ahmed

memcg->moving_account is used to signal that a memcg move is taking
place, so that folio_memcg_lock() would start acquiring the per-memcg
move lock instead of just initiating an rcu read section.

Refactor incrementing and decrementing memcg->moving_account, together
with rcu synchornization and the elaborate comment into helpers, to
allow for reuse by incoming patches.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8ca4bdcb03c..ffdb848f4003 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6305,16 +6305,26 @@ static const struct mm_walk_ops charge_walk_ops = {
 	.pmd_entry	= mem_cgroup_move_charge_pte_range,
 };
 
-static void mem_cgroup_move_charge(void)
+static void mem_cgroup_start_move_charge(struct mem_cgroup *memcg)
 {
-	lru_add_drain_all();
 	/*
 	 * Signal folio_memcg_lock() to take the memcg's move_lock
 	 * while we're moving its pages to another memcg. Then wait
 	 * for already started RCU-only updates to finish.
 	 */
-	atomic_inc(&mc.from->moving_account);
+	atomic_inc(&memcg->moving_account);
 	synchronize_rcu();
+}
+
+static void mem_cgroup_end_move_charge(struct mem_cgroup *memcg)
+{
+	atomic_dec(&memcg->moving_account);
+}
+
+static void mem_cgroup_move_charge(void)
+{
+	lru_add_drain_all();
+	mem_cgroup_start_move_charge(mc.from);
 retry:
 	if (unlikely(!mmap_read_trylock(mc.mm))) {
 		/*
@@ -6334,7 +6344,7 @@ static void mem_cgroup_move_charge(void)
 	 */
 	walk_page_range(mc.mm, 0, ULONG_MAX, &charge_walk_ops, NULL);
 	mmap_read_unlock(mc.mm);
-	atomic_dec(&mc.from->moving_account);
+	mem_cgroup_end_move_charge(mc.from);
 }
 
 static void mem_cgroup_move_task(void)
-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 2/8] mm: vmscan: add lruvec_for_each_list() helper
  2023-07-20  7:08 [RFC PATCH 0/8] memory recharging for offline memcgs Yosry Ahmed
  2023-07-20  7:08 ` [RFC PATCH 1/8] memcg: refactor updating memcg->moving_account Yosry Ahmed
@ 2023-07-20  7:08 ` Yosry Ahmed
  2023-07-20  7:08 ` [RFC PATCH 3/8] memcg: recharge mapped folios when a memcg is offlined Yosry Ahmed
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-20  7:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt
  Cc: Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, T.J. Mercier, Greg Thelen, linux-kernel, linux-mm,
	cgroups, Yosry Ahmed

This helper is used to provide a callback to be called for each lruvec
list. This abstracts different lruvec implementations (MGLRU vs. classic
LRUs). The helper is used by a following commit to iterate all folios in
all LRUs lists for memcg recharging.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/swap.h |  8 ++++++++
 mm/vmscan.c          | 28 ++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 456546443f1f..c0621deceb03 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -406,6 +406,14 @@ extern void lru_cache_add_inactive_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
+typedef bool (*lruvec_list_fn_t)(struct lruvec *lruvec,
+				 struct list_head *list,
+				 enum lru_list lru,
+				 void *arg);
+extern void lruvec_for_each_list(struct lruvec *lruvec,
+				 lruvec_list_fn_t fn,
+				 void *arg);
+
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1080209a568b..e7956000a3b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6254,6 +6254,34 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
 
 #endif /* CONFIG_LRU_GEN */
 
+/*
+ * lruvec_for_each_list - make a callback for every folio list in the lruvec
+ * @lruvec: the lruvec to iterate lists in
+ * @fn: the callback to make for each list, iteration stops if it returns true
+ * @arg: argument to pass to @fn
+ */
+void lruvec_for_each_list(struct lruvec *lruvec, lruvec_list_fn_t fn, void *arg)
+{
+	enum lru_list lru;
+
+#ifdef CONFIG_LRU_GEN
+	if (lru_gen_enabled()) {
+		int gen, type, zone;
+
+		for_each_gen_type_zone(gen, type, zone) {
+			lru = type * LRU_INACTIVE_FILE;
+			if (fn(lruvec, &lruvec->lrugen.folios[gen][type][zone],
+			       lru, arg))
+				break;
+		}
+	} else
+#endif
+		for_each_evictable_lru(lru) {
+			if (fn(lruvec, &lruvec->lists[lru], lru, arg))
+				break;
+		}
+}
+
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 3/8] memcg: recharge mapped folios when a memcg is offlined
  2023-07-20  7:08 [RFC PATCH 0/8] memory recharging for offline memcgs Yosry Ahmed
  2023-07-20  7:08 ` [RFC PATCH 1/8] memcg: refactor updating memcg->moving_account Yosry Ahmed
  2023-07-20  7:08 ` [RFC PATCH 2/8] mm: vmscan: add lruvec_for_each_list() helper Yosry Ahmed
@ 2023-07-20  7:08 ` Yosry Ahmed
  2023-07-20  7:08 ` [RFC PATCH 4/8] memcg: support deferred memcg recharging Yosry Ahmed
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-20  7:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt
  Cc: Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, T.J. Mercier, Greg Thelen, linux-kernel, linux-mm,
	cgroups, Yosry Ahmed

When a cgroup is removed by userspace and its memcg goes offline,
attempt to recharge the pages charged to the memcg to other memcgs that
are actually using the folios. Recharging is done in an asynchronous
worker as it is an expensive operation and needs to acquire multiple
locks.

Recharge targets are identified by walking the rmap and checking the
memcgs of the processes mapping the folio, if any. We avoid an OOM kill
if we fail to charge the folio, to avoid inducing an OOM kill at a
seemingly random point in time in the target memcg.

If we fail for any reason (e.g. could not isolate all folios, could not
lock folio, target memcg reached its limit, etc), we reschedule the
worker to rety.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/memcontrol.h |   6 +
 mm/memcontrol.c            | 230 +++++++++++++++++++++++++++++++++++++
 2 files changed, 236 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5818af8eca5a..b41d69685ead 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -326,6 +326,12 @@ struct mem_cgroup {
 	struct lru_gen_mm_list mm_list;
 #endif
 
+	/* async recharge of mapped folios for offline memcgs */
+	struct {
+		struct delayed_work dwork;
+		int retries;
+	} recharge_mapped_work;
+
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ffdb848f4003..a46bc8f000c8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -96,6 +96,8 @@ static bool cgroup_memory_nobpf __ro_after_init;
 static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
 #endif
 
+static struct workqueue_struct *memcg_recharge_wq;
+
 /* Whether legacy memory+swap accounting is active */
 static bool do_memsw_account(void)
 {
@@ -5427,6 +5429,8 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	return -ENOMEM;
 }
 
+static void memcg_recharge_mapped_folios(struct mem_cgroup *memcg);
+
 static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
@@ -5455,6 +5459,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	drain_all_stock(memcg);
 
 	mem_cgroup_id_put(memcg);
+
+	memcg_recharge_mapped_folios(memcg);
 }
 
 static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
@@ -5487,6 +5493,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 
 	vmpressure_cleanup(&memcg->vmpressure);
 	cancel_work_sync(&memcg->high_work);
+	cancel_delayed_work_sync(&memcg->recharge_mapped_work.dwork);
 	mem_cgroup_remove_from_trees(memcg);
 	free_shrinker_info(memcg);
 	mem_cgroup_free(memcg);
@@ -6367,6 +6374,219 @@ static void mem_cgroup_move_task(void)
 }
 #endif
 
+/* Returns true if recharging is successful */
+static bool mem_cgroup_recharge_folio(struct folio *folio,
+				      struct mem_cgroup *new_memcg)
+{
+	struct mem_cgroup *old_memcg = folio_memcg(folio);
+	gfp_t gfp = GFP_KERNEL | __GFP_RETRY_MAYFAIL;
+	long nr_pages = folio_nr_pages(folio);
+	int err = -1;
+
+	if (!new_memcg)
+		goto out;
+
+	err = try_charge(new_memcg, gfp, nr_pages);
+	if (err)
+		goto out;
+
+	err = mem_cgroup_move_account(&folio->page, folio_test_large(folio),
+				      old_memcg, new_memcg);
+	cancel_charge(err ? new_memcg : old_memcg, nr_pages);
+out:
+	return err == 0;
+}
+
+struct folio_memcg_rmap_recharge_arg {
+	bool recharged;
+};
+
+static bool folio_memcg_rmap_recharge_one(struct folio *folio,
+		struct vm_area_struct *vma, unsigned long address, void *arg)
+{
+	struct folio_memcg_rmap_recharge_arg *recharge_arg = arg;
+	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+	struct mem_cgroup *memcg;
+
+	/*
+	 * page_vma_mapped_walk() is only needed to grab any pte lock to
+	 * serialize with page_remove_rmap(), as folio_mapped() must remain
+	 * stable during the move.
+	 */
+	recharge_arg->recharged = false;
+	while (page_vma_mapped_walk(&pvmw)) {
+		memcg = get_mem_cgroup_from_mm(vma->vm_mm);
+		if (mem_cgroup_recharge_folio(folio, memcg))
+			recharge_arg->recharged = true;
+		mem_cgroup_put(memcg);
+		page_vma_mapped_walk_done(&pvmw);
+		break;
+	}
+
+	/* stop the rmap walk if we were successful */
+	return !recharge_arg->recharged;
+}
+
+/* Returns true if recharging is successful */
+static bool folio_memcg_rmap_recharge(struct folio *folio)
+{
+	struct folio_memcg_rmap_recharge_arg arg = { .recharged = false };
+	struct rmap_walk_control rwc = {
+		.rmap_one = folio_memcg_rmap_recharge_one,
+		.arg = (void *)&arg,
+		.anon_lock = folio_lock_anon_vma_read,
+		.try_lock = true,
+	};
+
+	rmap_walk(folio, &rwc);
+	return arg.recharged;
+}
+
+static unsigned long lruvec_nr_local_mapped_pages(struct lruvec *lruvec)
+{
+	return lruvec_page_state_local(lruvec, NR_ANON_MAPPED) +
+		lruvec_page_state_local(lruvec, NR_FILE_MAPPED);
+}
+
+static unsigned long memcg_nr_local_mapped_pages(struct mem_cgroup *memcg)
+{
+	return memcg_page_state_local(memcg, NR_ANON_MAPPED) +
+		memcg_page_state_local(memcg, NR_FILE_MAPPED);
+}
+
+static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
+				       struct list_head *list,
+				       enum lru_list lru,
+				       void *arg)
+{
+	int isolated_idx = NR_ISOLATED_ANON + is_file_lru(lru);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	unsigned long *nr_recharged = arg;
+	unsigned long nr_staged = 0;
+	LIST_HEAD(folios_skipped);
+	LIST_HEAD(folios_staged);
+	struct folio *folio;
+
+	/* Are we done with mapped pages on this node? */
+	if (!lruvec_nr_local_mapped_pages(lruvec))
+		return true;
+
+	/*
+	 * Iterating the LRUs here is tricky, because we
+	 * usually cannot iterate the entire list with the
+	 * lock held, and the LRU can change once we release it.
+	 *
+	 * What we try to do is isolate as many folios as we can
+	 * without hogging the lock or the cpu. We need to move
+	 * all the folios we iterate off the list to try to
+	 * avoid re-visiting them on retries.
+	 *
+	 * The folios we are interested in are moved to
+	 * @folios_staged, and other folios are moved to
+	 * @folios_skipped. Before releasing the lock, we splice
+	 * @folios_skipped back into the beginning of the LRU.
+	 * This essentially rotates the LRU, similar to reclaim,
+	 * as lru_to_folio() fetches folios from the end of the
+	 * LRU.
+	 */
+	spin_lock_irq(&lruvec->lru_lock);
+	while (!list_empty(list) && !need_resched() &&
+	       !spin_is_contended(&lruvec->lru_lock)) {
+		folio = lru_to_folio(list);
+		if (!folio_mapped(folio)) {
+			list_move(&folio->lru, &folios_skipped);
+			continue;
+		}
+
+		if (unlikely(!folio_try_get(folio))) {
+			list_move(&folio->lru, &folios_skipped);
+			continue;
+		}
+
+		if (!folio_test_clear_lru(folio)) {
+			list_move(&folio->lru, &folios_skipped);
+			folio_put(folio);
+			continue;
+		}
+
+		lruvec_del_folio(lruvec, folio);
+		list_add(&folio->lru, &folios_staged);
+		nr_staged += folio_nr_pages(folio);
+	}
+	list_splice(&folios_skipped, list);
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	mod_lruvec_state(lruvec, isolated_idx, nr_staged);
+	mem_cgroup_start_move_charge(memcg);
+	while (!list_empty(&folios_staged)) {
+		folio = lru_to_folio(&folios_staged);
+		list_del(&folio->lru);
+
+		if (!folio_trylock(folio)) {
+			folio_putback_lru(folio);
+			continue;
+		}
+
+		if (folio_memcg_rmap_recharge(folio))
+			*nr_recharged += folio_nr_pages(folio);
+
+		folio_unlock(folio);
+		folio_putback_lru(folio);
+		cond_resched();
+	}
+	mem_cgroup_end_move_charge(memcg);
+	mod_lruvec_state(lruvec, isolated_idx, -nr_staged);
+	return false;
+}
+
+static void memcg_do_recharge_mapped_folios(struct work_struct *work)
+{
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct mem_cgroup *memcg = container_of(dwork, struct mem_cgroup,
+						recharge_mapped_work.dwork);
+	unsigned long nr_recharged = 0;
+	struct lruvec *lruvec;
+	int nid;
+
+	lru_add_drain_all();
+	for_each_node_state(nid, N_MEMORY) {
+		lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+		lruvec_for_each_list(lruvec, memcg_recharge_lruvec_list,
+				     &nr_recharged);
+	}
+
+	/* Are we done with all mapped folios? */
+	if (!memcg_nr_local_mapped_pages(memcg))
+		return;
+
+	/* Did we make progress? reset retries */
+	if (nr_recharged > 0)
+		memcg->recharge_mapped_work.retries = 0;
+
+	/* Exponentially increase delay before each retry (from 1ms to ~32s) */
+	if (memcg->recharge_mapped_work.retries < MAX_RECLAIM_RETRIES)
+		queue_delayed_work(memcg_recharge_wq,
+				   &memcg->recharge_mapped_work.dwork,
+				   1 << memcg->recharge_mapped_work.retries++);
+}
+
+static void memcg_recharge_mapped_folios(struct mem_cgroup *memcg)
+{
+	/*
+	 * We need to initialize the work here, even if we are not going to
+	 * queue it, such that cancel_delayed_work_sync() in
+	 * mem_cgroup_css_free() does not complain.
+	 */
+	INIT_DELAYED_WORK(&memcg->recharge_mapped_work.dwork,
+			  memcg_do_recharge_mapped_folios);
+
+	if (memcg_recharge_wq && memcg_nr_local_mapped_pages(memcg)) {
+		memcg->recharge_mapped_work.retries = 0;
+		queue_delayed_work(memcg_recharge_wq,
+				   &memcg->recharge_mapped_work.dwork, 0);
+	}
+}
+
 #ifdef CONFIG_LRU_GEN
 static void mem_cgroup_attach(struct cgroup_taskset *tset)
 {
@@ -7904,3 +8124,13 @@ static int __init mem_cgroup_swap_init(void)
 subsys_initcall(mem_cgroup_swap_init);
 
 #endif /* CONFIG_SWAP */
+
+static int __init memcg_recharge_wq_init(void)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+	memcg_recharge_wq = alloc_workqueue("memcg_recharge", WQ_UNBOUND, 0);
+	WARN_ON(!memcg_recharge_wq);
+	return 0;
+}
+subsys_initcall(memcg_recharge_wq_init);
-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 4/8] memcg: support deferred memcg recharging
  2023-07-20  7:08 [RFC PATCH 0/8] memory recharging for offline memcgs Yosry Ahmed
                   ` (2 preceding siblings ...)
  2023-07-20  7:08 ` [RFC PATCH 3/8] memcg: recharge mapped folios when a memcg is offlined Yosry Ahmed
@ 2023-07-20  7:08 ` Yosry Ahmed
  2023-07-20  7:08 ` [RFC PATCH 5/8] memcg: recharge folios when accessed or dirtied Yosry Ahmed
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-20  7:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt
  Cc: Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, T.J. Mercier, Greg Thelen, linux-kernel, linux-mm,
	cgroups, Yosry Ahmed

The previous patch added support for memcg recharging for mapped folios
when a memcg goes offline. For unmapped folios, it is not straighforward
to find a rightful memcg to recharge the folio to. Hence, add support
for deferred recharging.

Deferred recharging provides a hook that can be added in data access
paths: folio_memcg_deferred_recharge().

folio_memcg_deferred_recharge() will check if the memcg that the folio
is charged to is offline. If so, it will queue an asynchronous worker to
attempt to recharge the folio to the memcg of the process accessing the
folio. An asynchronous worker is used for 2 reasons:
(a) Avoid expensive operations on the data access path.
(b) Acquring some locks (e.g. folio lock, lruvec lock) is not safe to do
    from all contexts.

Deferring recharging will not cause an OOM kill in the target memcg. If
recharging fails for any reason, the worker reschedules itself to retry,
unless the folio is freed or the target memcg goes offline.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/memcontrol.h |   6 ++
 mm/memcontrol.c            | 125 +++++++++++++++++++++++++++++++++++--
 2 files changed, 126 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b41d69685ead..59b653d4a76e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -956,6 +956,8 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
 void folio_memcg_lock(struct folio *folio);
 void folio_memcg_unlock(struct folio *folio);
 
+void folio_memcg_deferred_recharge(struct folio *folio);
+
 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
 
 /* try to stablize folio_memcg() for all the pages in a memcg */
@@ -1461,6 +1463,10 @@ static inline void mem_cgroup_unlock_pages(void)
 	rcu_read_unlock();
 }
 
+static inline void folio_memcg_deferred_recharge(struct folio *folio)
+{
+}
+
 static inline void mem_cgroup_handle_over_high(void)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a46bc8f000c8..cf9fb51ecfcc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6398,6 +6398,7 @@ static bool mem_cgroup_recharge_folio(struct folio *folio,
 }
 
 struct folio_memcg_rmap_recharge_arg {
+	struct mem_cgroup *memcg;
 	bool recharged;
 };
 
@@ -6415,10 +6416,12 @@ static bool folio_memcg_rmap_recharge_one(struct folio *folio,
 	 */
 	recharge_arg->recharged = false;
 	while (page_vma_mapped_walk(&pvmw)) {
-		memcg = get_mem_cgroup_from_mm(vma->vm_mm);
+		memcg = recharge_arg->memcg ?:
+			get_mem_cgroup_from_mm(vma->vm_mm);
 		if (mem_cgroup_recharge_folio(folio, memcg))
 			recharge_arg->recharged = true;
-		mem_cgroup_put(memcg);
+		if (!recharge_arg->memcg)
+			mem_cgroup_put(memcg);
 		page_vma_mapped_walk_done(&pvmw);
 		break;
 	}
@@ -6428,9 +6431,13 @@ static bool folio_memcg_rmap_recharge_one(struct folio *folio,
 }
 
 /* Returns true if recharging is successful */
-static bool folio_memcg_rmap_recharge(struct folio *folio)
+static bool folio_memcg_rmap_recharge(struct folio *folio,
+				      struct mem_cgroup *memcg)
 {
-	struct folio_memcg_rmap_recharge_arg arg = { .recharged = false };
+	struct folio_memcg_rmap_recharge_arg arg = {
+		.recharged = false,
+		.memcg = memcg,
+	};
 	struct rmap_walk_control rwc = {
 		.rmap_one = folio_memcg_rmap_recharge_one,
 		.arg = (void *)&arg,
@@ -6527,7 +6534,7 @@ static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
 			continue;
 		}
 
-		if (folio_memcg_rmap_recharge(folio))
+		if (folio_memcg_rmap_recharge(folio, NULL))
 			*nr_recharged += folio_nr_pages(folio);
 
 		folio_unlock(folio);
@@ -6587,6 +6594,114 @@ static void memcg_recharge_mapped_folios(struct mem_cgroup *memcg)
 	}
 }
 
+/* Result is only stable if @folio is locked */
+static bool should_do_deferred_recharge(struct folio *folio)
+{
+	struct mem_cgroup *memcg;
+	bool ret;
+
+	rcu_read_lock();
+	memcg = folio_memcg_rcu(folio);
+	ret = memcg && !!(memcg->css.flags & CSS_DYING);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+struct deferred_recharge_work {
+	struct folio *folio;
+	struct mem_cgroup *memcg;
+	struct work_struct work;
+};
+
+static void folio_memcg_do_deferred_recharge(struct work_struct *work)
+{
+	struct deferred_recharge_work *recharge_work = container_of(work,
+					struct deferred_recharge_work, work);
+	struct folio *folio = recharge_work->folio;
+	struct mem_cgroup *new = recharge_work->memcg;
+	struct mem_cgroup *old;
+
+	/* We are holding the last ref to the folio, let it be freed */
+	if (unlikely(folio_ref_count(folio) == 1))
+		goto out;
+
+	if (!folio_isolate_lru(folio))
+		goto out;
+
+	if (unlikely(!folio_trylock(folio)))
+		goto out_putback;
+
+	/* @folio was already recharged since the worker was queued? */
+	if (unlikely(!should_do_deferred_recharge(folio)))
+		goto out_unlock;
+
+	/* @folio was already recharged to @new and it already went offline? */
+	old = folio_memcg(folio);
+	if (unlikely(old == new))
+		goto out_unlock;
+
+	/*
+	 * folio_mapped() must remain stable during the move. If the folio is
+	 * mapped, we must use rmap recharge to serialize against unmapping.
+	 * Otherwise, if the folio is unmapped, the folio lock is held so this
+	 * should prevent faults against the pagecache or swapcache to map it.
+	 */
+	mem_cgroup_start_move_charge(old);
+	if (folio_mapped(folio))
+		folio_memcg_rmap_recharge(folio, new);
+	else
+		mem_cgroup_recharge_folio(folio, new);
+	mem_cgroup_end_move_charge(old);
+
+out_unlock:
+	folio_unlock(folio);
+out_putback:
+	folio_putback_lru(folio);
+out:
+	folio_put(folio);
+	mem_cgroup_put(new);
+	kfree(recharge_work);
+}
+
+/*
+ * Queue deferred work to recharge @folio to current's memcg if needed.
+ */
+void folio_memcg_deferred_recharge(struct folio *folio)
+{
+	struct deferred_recharge_work *recharge_work = NULL;
+	struct mem_cgroup *memcg = NULL;
+
+	/* racy check, the async worker will check again with @folio locked */
+	if (likely(!should_do_deferred_recharge(folio)))
+		return;
+
+	if (unlikely(!memcg_recharge_wq))
+		return;
+
+	if (unlikely(!folio_try_get(folio)))
+		return;
+
+	memcg = get_mem_cgroup_from_mm(current->mm);
+	if (!memcg)
+		goto fail;
+
+	recharge_work = kmalloc(sizeof(*recharge_work), GFP_ATOMIC);
+	if (!recharge_work)
+		goto fail;
+
+	/* we hold refs to both the folio and the memcg we are charging to */
+	recharge_work->folio = folio;
+	recharge_work->memcg = memcg;
+	INIT_WORK(&recharge_work->work, folio_memcg_do_deferred_recharge);
+	queue_work(memcg_recharge_wq, &recharge_work->work);
+	return;
+fail:
+	kfree(recharge_work);
+	mem_cgroup_put(memcg);
+	folio_put(folio);
+}
+
 #ifdef CONFIG_LRU_GEN
 static void mem_cgroup_attach(struct cgroup_taskset *tset)
 {
-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 5/8] memcg: recharge folios when accessed or dirtied
  2023-07-20  7:08 [RFC PATCH 0/8] memory recharging for offline memcgs Yosry Ahmed
                   ` (3 preceding siblings ...)
  2023-07-20  7:08 ` [RFC PATCH 4/8] memcg: support deferred memcg recharging Yosry Ahmed
@ 2023-07-20  7:08 ` Yosry Ahmed
  2023-07-20  7:08 ` [RFC PATCH 6/8] memcg: add stats for offline memcgs recharging Yosry Ahmed
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-20  7:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt
  Cc: Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, T.J. Mercier, Greg Thelen, linux-kernel, linux-mm,
	cgroups, Yosry Ahmed

The previous patch provided support for deferred recharging of folios
when their memcgs go offline. This patch adds recharging hooks to
folio_mark_accessed() and folio_mark_dirty().
This should cover a variety of code paths where folios are accessed by
userspace.

The hook, folio_memcg_deferred_recharge() only checks if the folio is
charged to an offline memcg in the common fast path (i.e checks
folio->memcg_data). If yes, an asynchronous worker is queued to do the
actual work.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/page-writeback.c | 2 ++
 mm/swap.c           | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d3f42009bb70..a644530d98c7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2785,6 +2785,8 @@ bool folio_mark_dirty(struct folio *folio)
 {
 	struct address_space *mapping = folio_mapping(folio);
 
+	folio_memcg_deferred_recharge(folio);
+
 	if (likely(mapping)) {
 		/*
 		 * readahead/folio_deactivate could remain
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..296c0b87c967 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -457,6 +457,8 @@ static void folio_inc_refs(struct folio *folio)
  */
 void folio_mark_accessed(struct folio *folio)
 {
+	folio_memcg_deferred_recharge(folio);
+
 	if (lru_gen_enabled()) {
 		folio_inc_refs(folio);
 		return;
-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 6/8] memcg: add stats for offline memcgs recharging
  2023-07-20  7:08 [RFC PATCH 0/8] memory recharging for offline memcgs Yosry Ahmed
                   ` (4 preceding siblings ...)
  2023-07-20  7:08 ` [RFC PATCH 5/8] memcg: recharge folios when accessed or dirtied Yosry Ahmed
@ 2023-07-20  7:08 ` Yosry Ahmed
  2023-07-20  7:08 ` [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging Yosry Ahmed
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-20  7:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt
  Cc: Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, T.J. Mercier, Greg Thelen, linux-kernel, linux-mm,
	cgroups, Yosry Ahmed

Add vm events for scanning pages for recharge, successfully recharging
pages, and cancelling a recharge due to failure to charge the target
memcg.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/vm_event_item.h | 5 +++++
 mm/memcontrol.c               | 6 ++++++
 mm/vmstat.c                   | 6 +++++-
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8abfa1240040..cd80c00c50c2 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -60,6 +60,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PAGEOUTRUN, PGROTATED,
 		DROP_PAGECACHE, DROP_SLAB,
 		OOM_KILL,
+#ifdef CONFIG_MEMCG
+		RECHARGE_PGSCANNED,
+		RECHARGE_PGMOVED,
+		RECHARGE_PGCANCELLED,
+#endif
 #ifdef CONFIG_NUMA_BALANCING
 		NUMA_PTE_UPDATES,
 		NUMA_HUGE_PTE_UPDATES,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cf9fb51ecfcc..2fe9c6f1be80 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6394,6 +6394,8 @@ static bool mem_cgroup_recharge_folio(struct folio *folio,
 				      old_memcg, new_memcg);
 	cancel_charge(err ? new_memcg : old_memcg, nr_pages);
 out:
+	count_vm_events(err ? RECHARGE_PGCANCELLED : RECHARGE_PGMOVED,
+			nr_pages);
 	return err == 0;
 }
 
@@ -6469,6 +6471,7 @@ static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
 	int isolated_idx = NR_ISOLATED_ANON + is_file_lru(lru);
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	unsigned long *nr_recharged = arg;
+	unsigned long nr_scanned = 0;
 	unsigned long nr_staged = 0;
 	LIST_HEAD(folios_skipped);
 	LIST_HEAD(folios_staged);
@@ -6505,6 +6508,7 @@ static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
 			continue;
 		}
 
+		nr_scanned += folio_nr_pages(folio);
 		if (unlikely(!folio_try_get(folio))) {
 			list_move(&folio->lru, &folios_skipped);
 			continue;
@@ -6543,6 +6547,7 @@ static bool memcg_recharge_lruvec_list(struct lruvec *lruvec,
 	}
 	mem_cgroup_end_move_charge(memcg);
 	mod_lruvec_state(lruvec, isolated_idx, -nr_staged);
+	count_vm_events(RECHARGE_PGSCANNED, nr_scanned);
 	return false;
 }
 
@@ -6679,6 +6684,7 @@ void folio_memcg_deferred_recharge(struct folio *folio)
 	if (unlikely(!memcg_recharge_wq))
 		return;
 
+	count_vm_events(RECHARGE_PGSCANNED, folio_nr_pages(folio));
 	if (unlikely(!folio_try_get(folio)))
 		return;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b731d57996c5..e425a1aa7890 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1303,7 +1303,11 @@ const char * const vmstat_text[] = {
 	"drop_pagecache",
 	"drop_slab",
 	"oom_kill",
-
+#ifdef CONFIG_MEMCG
+	"recharge_pgs_scanned",
+	"recharge_pgs_moved",
+	"recharge_pgs_cancelled",
+#endif
 #ifdef CONFIG_NUMA_BALANCING
 	"numa_pte_updates",
 	"numa_huge_pte_updates",
-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging
  2023-07-20  7:08 [RFC PATCH 0/8] memory recharging for offline memcgs Yosry Ahmed
                   ` (5 preceding siblings ...)
  2023-07-20  7:08 ` [RFC PATCH 6/8] memcg: add stats for offline memcgs recharging Yosry Ahmed
@ 2023-07-20  7:08 ` Yosry Ahmed
  2023-07-20 18:13   ` Luis Chamberlain
  2023-07-20  7:08 ` [RFC PATCH 8/8] selftests: cgroup: test_memcontrol: add a selftest for memcg recharging Yosry Ahmed
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-20  7:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt
  Cc: Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, T.J. Mercier, Greg Thelen, linux-kernel, linux-mm,
	cgroups, Yosry Ahmed

Add a sysctl to enable/disable memory recharging for offline memcgs. Add
a config option to control whether or not it is enabled by default.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/memcontrol.h |  2 ++
 kernel/sysctl.c            | 11 +++++++++++
 mm/Kconfig                 | 12 ++++++++++++
 mm/memcontrol.c            |  9 ++++++++-
 4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 59b653d4a76e..ae9f09ee90cb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -60,6 +60,8 @@ struct mem_cgroup_reclaim_cookie {
 
 #ifdef CONFIG_MEMCG
 
+extern int sysctl_recharge_offline_memcgs;
+
 #define MEM_CGROUP_ID_SHIFT	16
 #define MEM_CGROUP_ID_MAX	USHRT_MAX
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 354a2d294f52..1735d1d95652 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2249,6 +2249,17 @@ static struct ctl_table vm_table[] = {
 		.extra2		= (void *)&mmap_rnd_compat_bits_max,
 	},
 #endif
+#ifdef CONFIG_MEMCG
+	{
+		.procname	= "recharge_offline_memcgs",
+		.data		= &sysctl_recharge_offline_memcgs,
+		.maxlen		= sizeof(sysctl_recharge_offline_memcgs),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
+#endif /* CONFIG_MEMCG */
 	{ }
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..9462c4b598d9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1236,6 +1236,18 @@ config LOCK_MM_AND_FIND_VMA
 	bool
 	depends on !STACK_GROWSUP
 
+config MEMCG_RECHARGE_OFFLINE_ENABLED
+	bool "Recharge memory charged to offline memcgs"
+	depends on MEMCG
+	help
+	  When a memory cgroup is removed by userspace, try to recharge any
+	  memory still charged to it to avoid having it live on as an offline
+	  memcg. Offline memcgs potentially consume memory and limit scalability
+	  of some operations.
+
+	  This option enables the above behavior by default. It can be override
+	  at runtime through /proc/sys/vm/recharge_offline_memcgs.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2fe9c6f1be80..25cdb17eaaa3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -96,6 +96,9 @@ static bool cgroup_memory_nobpf __ro_after_init;
 static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
 #endif
 
+int sysctl_recharge_offline_memcgs __read_mostly = IS_ENABLED(
+		CONFIG_MEMCG_RECHARGE_OFFLINE_ENABLED);
+
 static struct workqueue_struct *memcg_recharge_wq;
 
 /* Whether legacy memory+swap accounting is active */
@@ -6592,7 +6595,8 @@ static void memcg_recharge_mapped_folios(struct mem_cgroup *memcg)
 	INIT_DELAYED_WORK(&memcg->recharge_mapped_work.dwork,
 			  memcg_do_recharge_mapped_folios);
 
-	if (memcg_recharge_wq && memcg_nr_local_mapped_pages(memcg)) {
+	if (sysctl_recharge_offline_memcgs &&
+	    memcg_recharge_wq && memcg_nr_local_mapped_pages(memcg)) {
 		memcg->recharge_mapped_work.retries = 0;
 		queue_delayed_work(memcg_recharge_wq,
 				   &memcg->recharge_mapped_work.dwork, 0);
@@ -6605,6 +6609,9 @@ static bool should_do_deferred_recharge(struct folio *folio)
 	struct mem_cgroup *memcg;
 	bool ret;
 
+	if (!sysctl_recharge_offline_memcgs)
+		return false;
+
 	rcu_read_lock();
 	memcg = folio_memcg_rcu(folio);
 	ret = memcg && !!(memcg->css.flags & CSS_DYING);
-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 8/8] selftests: cgroup: test_memcontrol: add a selftest for memcg recharging
  2023-07-20  7:08 [RFC PATCH 0/8] memory recharging for offline memcgs Yosry Ahmed
                   ` (6 preceding siblings ...)
  2023-07-20  7:08 ` [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging Yosry Ahmed
@ 2023-07-20  7:08 ` Yosry Ahmed
  2023-07-20 15:35 ` [RFC PATCH 0/8] memory recharging for offline memcgs Johannes Weiner
  2023-07-21  0:02 ` Roman Gushchin
  9 siblings, 0 replies; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-20  7:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt
  Cc: Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, T.J. Mercier, Greg Thelen, linux-kernel, linux-mm,
	cgroups, Yosry Ahmed

When a memcg is removed, any mapped pages charged to it are recharged to
the memcg of the process(es) mapping them. Any remaining pages are
recharged using deferred recharge on the next time they are accessed or
ditied. Add a selftest that exercises these paths for shmem and normal
files:
- A page is recharged on offlining if it is already mapped into the
  address space of a process in a different memcg.
- A page is recharged after offlining when written to by a process in a
  different memcg (if the write results in dirtying the page).
- A page is recharged after offlining when read by a process in a
  different memcg.
- A page is recharged after offlining when mapped by a process in a
  different memcg.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 tools/testing/selftests/cgroup/cgroup_util.c  |  14 +
 tools/testing/selftests/cgroup/cgroup_util.h  |   1 +
 .../selftests/cgroup/test_memcontrol.c        | 310 ++++++++++++++++++
 3 files changed, 325 insertions(+)

diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index e8bbbdb77e0d..e853b2a4db77 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -519,6 +519,20 @@ int is_swap_enabled(void)
 	return cnt > 1;
 }
 
+
+int is_memcg_recharging_enabled(void)
+{
+	char buf[10];
+	bool enabled;
+
+	if (read_text("/proc/sys/vm/recharge_offline_memcgs",
+		      buf, sizeof(buf)) <= 0)
+		return -1;
+
+	enabled = strtol(buf, NULL, 10);
+	return enabled;
+}
+
 int set_oom_adj_score(int pid, int score)
 {
 	char path[PATH_MAX];
diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h
index c92df4e5d395..10c0fa36bfd7 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/cgroup_util.h
@@ -49,6 +49,7 @@ extern int get_temp_fd(void);
 extern int alloc_pagecache(int fd, size_t size);
 extern int alloc_anon(const char *cgroup, void *arg);
 extern int is_swap_enabled(void);
+extern int is_memcg_recharging_enabled(void);
 extern int set_oom_adj_score(int pid, int score);
 extern int cg_wait_for_proc_count(const char *cgroup, int count);
 extern int cg_killall(const char *cgroup);
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index c7c9572003a8..4e1ea93e0a54 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -17,6 +17,8 @@
 #include <netdb.h>
 #include <errno.h>
 #include <sys/mman.h>
+#include <sys/mount.h>
+#include <sched.h>
 
 #include "../kselftest.h"
 #include "cgroup_util.h"
@@ -1287,6 +1289,313 @@ static int test_memcg_oom_group_score_events(const char *root)
 	return ret;
 }
 
+/* Map 50M from the beginning of a file */
+static int map_fd_50M_noexit(const char *cgroup, void *arg)
+{
+	size_t size = MB(50);
+	int ppid = getppid();
+	int fd = (long)arg;
+	char *memory;
+
+	memory = mmap(NULL, size, PROT_READ, MAP_SHARED | MAP_POPULATE, fd, 0);
+	if (memory == MAP_FAILED) {
+		fprintf(stderr, "error: mmap, errno %d\n", errno);
+		return -1;
+	}
+
+	while (getppid() == ppid)
+		sleep(1);
+
+	munmap(memory, size);
+	return 0;
+}
+
+/*
+ * Write 50M to the beginning of a file.
+ * The file is sync'ed first to make sure any dirty pages are laundered before
+ * we dirty them again.
+ */
+static int write_fd_50M(const char *cgroup, void *arg)
+{
+	size_t size = MB(50);
+	int fd = (long)arg;
+	char buf[PAGE_SIZE];
+	int i;
+
+	fsync(fd);
+	lseek(fd, 0, SEEK_SET);
+	for (i = 0; i < size; i += sizeof(buf))
+		write(fd, buf, sizeof(buf));
+
+	return 0;
+}
+
+/* See write_fd_50M() */
+static int write_fd_50M_noexit(const char *cgroup, void *arg)
+{
+	int ppid = getppid();
+
+	write_fd_50M(cgroup, arg);
+
+	while (getppid() == ppid)
+		sleep(1);
+
+	return 0;
+}
+
+/* Read 50M from the beginning of a file */
+static int read_fd_50M_noexit(const char *cgroup, void *arg)
+{
+	size_t size = MB(50);
+	int ppid = getppid();
+	int fd = (long)arg;
+	char buf[PAGE_SIZE];
+	int i;
+
+	lseek(fd, 0, SEEK_SET);
+	for (i = 0; i < size; i += sizeof(buf))
+		read(fd, buf, sizeof(buf));
+
+	while (getppid() == ppid)
+		sleep(1);
+
+	return 0;
+}
+
+#define TEST_RECHARGE_DIR "/test-recharge"
+
+static int __test_memcg_recharge(const char *root, char *stat_name)
+{
+	char *parent = NULL, *child1 = NULL, *child2 = NULL;
+	long stat, prev, pstat, current;
+	int ret = KSFT_FAIL;
+	char file_path[256];
+	int i, pid;
+	struct {
+		int fd;
+		int (*before_fn)(const char *cgroup, void *arg);
+		int (*after_fn)(const char *cgroup, void *arg);
+	} test_files[] = {
+		/* test recharge for already mapped file */
+		{
+			.before_fn = map_fd_50M_noexit,
+		},
+		/* test recharge on new mapping after offline */
+		{
+			.after_fn = map_fd_50M_noexit,
+		},
+		/* test recharge on write after offline */
+		{
+			.after_fn = write_fd_50M_noexit,
+		},
+		/* test recharge on read after offline */
+		{
+			.after_fn = read_fd_50M_noexit,
+		}
+	};
+
+	parent = cg_name(root, "parent");
+	if (!parent)
+		goto cleanup;
+
+	if (cg_create(parent))
+		goto cleanup;
+
+	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
+		goto cleanup;
+
+	child1 = cg_name(parent, "child1");
+	if (!child1)
+		goto cleanup;
+
+	if (cg_create(child1))
+		goto cleanup;
+
+	child2 = cg_name(parent, "child2");
+	if (!child2)
+		goto cleanup;
+
+	if (cg_create(child2))
+		goto cleanup;
+
+	for (i = 0; i < ARRAY_SIZE(test_files); i++) {
+		long target = MB(50) * (i+1); /* 50MB per file */
+		int fd;
+
+		snprintf(file_path, sizeof(file_path), "%s/file%d",
+			 TEST_RECHARGE_DIR, i);
+
+		fd = open(file_path, O_CREAT | O_RDWR);
+		if (fd < 0)
+			goto cleanup;
+
+		test_files[i].fd = fd;
+		if (cg_run(child1, write_fd_50M, (void *)(long) fd))
+			goto cleanup;
+
+		stat = 0;
+		do {
+			sleep(1);
+			prev = stat;
+			stat = cg_read_key_long(child1, "memory.stat",
+						stat_name);
+		} while (stat < target && stat > prev);
+
+		if (stat < target) {
+			fprintf(stderr, "error: child1 %s %ld < %ld",
+				stat_name, stat, target);
+			goto cleanup;
+		}
+
+		current = cg_read_long(child1, "memory.current");
+		if (current < target) {
+			fprintf(stderr, "error: child1 current %ld < %ld",
+				current, target);
+			goto cleanup;
+		}
+
+		if (test_files[i].before_fn) {
+			pid = cg_run_nowait(child2, test_files[i].before_fn,
+					    (void *)(long)fd);
+			if (pid < 0)
+				goto cleanup;
+			/* make sure before_fn() finishes executing before offlining */
+			sleep(1);
+		}
+	}
+
+	current = cg_read_long(child2, "memory.current");
+	if (current > MB(1)) {
+		fprintf(stderr, "error: child2 current %ld > 1M\n", current);
+		goto cleanup;
+	}
+
+	stat = cg_read_key_long(child2, "memory.stat", stat_name);
+	if (stat > 0) {
+		fprintf(stderr, "error: child2 %s %ld > 0\n",
+			stat_name, stat);
+		goto cleanup;
+	}
+
+	if (cg_destroy(child1) < 0)
+		goto cleanup;
+
+	for (i = 0; i < ARRAY_SIZE(test_files); i++) {
+		long target = MB(50) * (i+1);
+		int fd = test_files[i].fd;
+
+		if (test_files[i].after_fn) {
+			pid = cg_run_nowait(child2, test_files[i].after_fn,
+					    (void *)(long)fd);
+			if (pid < 0)
+				goto cleanup;
+		}
+
+		stat = 0;
+		do {
+			sleep(1);
+			prev = stat;
+			stat = cg_read_key_long(child2, "memory.stat",
+						stat_name);
+		} while (stat < target && stat > prev);
+
+		if (stat < target) {
+			fprintf(stderr, "error: child2 %s %ld < %ld\n",
+				stat_name, stat, target);
+			goto cleanup;
+		}
+
+		current = cg_read_long(child2, "memory.current");
+		if (current < target) {
+			fprintf(stderr, "error: child2 current %ld < %ld\n",
+				current, target);
+			goto cleanup;
+		}
+	}
+
+	pstat = cg_read_key_long(parent, "memory.stat", stat_name);
+	if (stat < pstat) {
+		fprintf(stderr, "error: recharged %s (%ld) < total (%ld)\n",
+			stat_name, stat, pstat);
+		goto cleanup;
+	}
+
+	ret = KSFT_PASS;
+cleanup:
+	if (child2) {
+		cg_destroy(child2);
+		free(child2);
+	}
+	if (child1) {
+		cg_destroy(child1);
+		free(child1);
+	}
+	if (parent) {
+		cg_destroy(parent);
+		free(parent);
+	}
+	for (i = 0; i < ARRAY_SIZE(test_files); i++) {
+		close(test_files[i].fd);
+		snprintf(file_path, sizeof(file_path), "%s/file%d",
+			 TEST_RECHARGE_DIR, i);
+		remove(file_path);
+	}
+	return ret;
+}
+
+static int test_memcg_recharge(const char *root)
+{
+	int i, ret = KSFT_PASS;
+	struct {
+		char *mount_type, *stat_name;
+	} test_setups[] = {
+		/* test both shmem & normal files */
+		{
+			.mount_type = "tmpfs",
+			.stat_name = "shmem",
+		},
+		{
+			.stat_name = "file",
+		}
+	};
+
+	if (!is_memcg_recharging_enabled())
+		return KSFT_SKIP;
+
+	if (unshare(CLONE_NEWNS) < 0)
+		return KSFT_FAIL;
+
+	if (mount(NULL, "/", "", MS_REC | MS_PRIVATE, NULL) < 0)
+		return KSFT_FAIL;
+
+	for (i = 0; i < ARRAY_SIZE(test_setups); i++) {
+		int setup_ret = KSFT_FAIL;
+		char *mount_type = test_setups[i].mount_type;
+		char *stat_name = test_setups[i].stat_name;
+
+		if (mkdir(TEST_RECHARGE_DIR, 0777) < 0)
+			goto next;
+
+		if (mount_type &&
+		    mount(NULL, TEST_RECHARGE_DIR, mount_type, 0, NULL) < 0)
+			goto next;
+
+		setup_ret = __test_memcg_recharge(root, stat_name);
+
+next:
+		if (mount_type)
+			umount(TEST_RECHARGE_DIR);
+		remove(TEST_RECHARGE_DIR);
+
+		if (setup_ret == KSFT_FAIL) {
+			ret = KSFT_FAIL;
+			break;
+		}
+	}
+	umount("/");
+	return ret;
+}
+
 #define T(x) { x, #x }
 struct memcg_test {
 	int (*fn)(const char *root);
@@ -1306,6 +1615,7 @@ struct memcg_test {
 	T(test_memcg_oom_group_leaf_events),
 	T(test_memcg_oom_group_parent_events),
 	T(test_memcg_oom_group_score_events),
+	T(test_memcg_recharge),
 };
 #undef T
 
-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-20  7:08 [RFC PATCH 0/8] memory recharging for offline memcgs Yosry Ahmed
                   ` (7 preceding siblings ...)
  2023-07-20  7:08 ` [RFC PATCH 8/8] selftests: cgroup: test_memcontrol: add a selftest for memcg recharging Yosry Ahmed
@ 2023-07-20 15:35 ` Johannes Weiner
  2023-07-20 19:57   ` Tejun Heo
                     ` (2 more replies)
  2023-07-21  0:02 ` Roman Gushchin
  9 siblings, 3 replies; 31+ messages in thread
From: Johannes Weiner @ 2023-07-20 15:35 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, T.J. Mercier, Greg Thelen, linux-kernel, linux-mm,
	cgroups

On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote:
> This patch series implements the proposal in LSF/MM/BPF 2023 conference
> for reducing offline/zombie memcgs by memory recharging [1]. The main
> difference is that this series focuses on recharging and does not
> include eviction of any memory charged to offline memcgs.
> 
> Two methods of recharging are proposed:
> 
> (a) Recharging of mapped folios.
> 
> When a memcg is offlined, queue an asynchronous worker that will walk
> the lruvec of the offline memcg and try to recharge any mapped folios to
> the memcg of one of the processes mapping the folio. The main assumption
> is that a process mapping the folio is the "rightful" owner of the
> memory.
> 
> Currently, this is only supported for evictable folios, as the
> unevictable lru is imaginary and we cannot iterate the folios on it. A
> separate proposal [2] was made to revive the unevictable lru, which
> would allow recharging of unevictable folios.
> 
> (b) Deferred recharging of folios.
> 
> For folios that are unmapped, or mapped but we fail to recharge them
> with (a), we rely on deferred recharging. Simply put, any time a folio
> is accessed or dirtied by a userspace process, and that folio is charged
> to an offline memcg, we will try to recharge it to the memcg of the
> process accessing the folio. Again, we assume this process should be the
> "rightful" owner of the memory. This is also done asynchronously to avoid
> slowing down the data access path.

I'm super skeptical of this proposal.

Recharging *might* be the most desirable semantics from a user pov,
but only if it applies consistently to the whole memory footprint.
There is no mention of slab allocations such as inodes, dentries,
network buffers etc. which can be a significant part of a cgroup's
footprint. These are currently reparented. I don't think doing one
thing with half of the memory, and a totally different thing with the
other half upon cgroup deletion is going to be acceptable semantics.

It appears this also brings back the reliability issue that caused us
to deprecate charge moving. The recharge path has trylocks, LRU
isolation attempts, GFP_ATOMIC allocations. These introduce a variable
error rate into the relocation process, which causes pages that should
belong to the same domain to be scattered around all over the place.
It also means that zombie pinning still exists, but it's now even more
influenced by timing and race conditions, and so less predictable.

There are two issues being conflated here:

a) the problem of zombie cgroups, and

b) who controls resources that outlive the control domain.

For a), reparenting is still the most reasonable proposal. It's
reliable for one, but it also fixes the problem fully within the
established, user-facing semantics: resources that belong to a cgroup
also hierarchically belong to all ancestral groups; if those resources
outlive the last-level control domain, they continue to belong to the
parents. This is how it works today, and this is how it continues to
work with reparenting. The only difference is that those resources no
longer pin a dead cgroup anymore, but instead are physically linked to
the next online ancestor. Since dead cgroups have no effective control
parameters anymore, this is semantically equivalent - it's just a more
memory efficient implementation of the same exact thing.

b) is a discussion totally separate from this. We can argue what we
want this behavior to be, but I'd argue strongly that whatever we do
here should apply to all resources managed by the controller equally.

It could also be argued that if you don't want to lose control over a
set of resources, then maybe don't delete their control domain while
they are still alive and in use. For example, when restarting a
workload, and the new instance is expected to have largely the same
workingset, consider reusing the cgroup instead of making a new one.

For the zombie problem, I think we should merge Muchun's patches
ASAP. They've been proposed several times, they have Roman's reviews
and acks, and they do not change user-facing semantics. There is no
good reason not to merge them.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging
  2023-07-20  7:08 ` [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging Yosry Ahmed
@ 2023-07-20 18:13   ` Luis Chamberlain
  2023-07-20 18:24     ` Yosry Ahmed
  0 siblings, 1 reply; 31+ messages in thread
From: Luis Chamberlain @ 2023-07-20 18:13 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

On Thu, Jul 20, 2023 at 07:08:24AM +0000, Yosry Ahmed wrote:
> Add a sysctl to enable/disable memory recharging for offline memcgs. Add
> a config option to control whether or not it is enabled by default.
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  include/linux/memcontrol.h |  2 ++
>  kernel/sysctl.c            | 11 +++++++++++
>  mm/Kconfig                 | 12 ++++++++++++
>  mm/memcontrol.c            |  9 ++++++++-
>  4 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 59b653d4a76e..ae9f09ee90cb 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -60,6 +60,8 @@ struct mem_cgroup_reclaim_cookie {
>  
>  #ifdef CONFIG_MEMCG
>  
> +extern int sysctl_recharge_offline_memcgs;
> +
>  #define MEM_CGROUP_ID_SHIFT	16
>  #define MEM_CGROUP_ID_MAX	USHRT_MAX
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 354a2d294f52..1735d1d95652 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2249,6 +2249,17 @@ static struct ctl_table vm_table[] = {
>  		.extra2		= (void *)&mmap_rnd_compat_bits_max,
>  	},
>  #endif
> +#ifdef CONFIG_MEMCG
> +	{
> +		.procname	= "recharge_offline_memcgs",
> +		.data		= &sysctl_recharge_offline_memcgs,
> +		.maxlen		= sizeof(sysctl_recharge_offline_memcgs),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_ONE,
> +	},
> +#endif /* CONFIG_MEMCG */
>  	{ }
>  };

Please don't add any more sysctls to kernel/sysctl.c, git log that file
for a series of cleanups which show how to use your own and why we have
been doing that cleanup.

  Luis


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging
  2023-07-20 18:13   ` Luis Chamberlain
@ 2023-07-20 18:24     ` Yosry Ahmed
  2023-07-20 18:30       ` Luis Chamberlain
  0 siblings, 1 reply; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-20 18:24 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

On Thu, Jul 20, 2023 at 11:13 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Thu, Jul 20, 2023 at 07:08:24AM +0000, Yosry Ahmed wrote:
> > Add a sysctl to enable/disable memory recharging for offline memcgs. Add
> > a config option to control whether or not it is enabled by default.
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >  include/linux/memcontrol.h |  2 ++
> >  kernel/sysctl.c            | 11 +++++++++++
> >  mm/Kconfig                 | 12 ++++++++++++
> >  mm/memcontrol.c            |  9 ++++++++-
> >  4 files changed, 33 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 59b653d4a76e..ae9f09ee90cb 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -60,6 +60,8 @@ struct mem_cgroup_reclaim_cookie {
> >
> >  #ifdef CONFIG_MEMCG
> >
> > +extern int sysctl_recharge_offline_memcgs;
> > +
> >  #define MEM_CGROUP_ID_SHIFT  16
> >  #define MEM_CGROUP_ID_MAX    USHRT_MAX
> >
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 354a2d294f52..1735d1d95652 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -2249,6 +2249,17 @@ static struct ctl_table vm_table[] = {
> >               .extra2         = (void *)&mmap_rnd_compat_bits_max,
> >       },
> >  #endif
> > +#ifdef CONFIG_MEMCG
> > +     {
> > +             .procname       = "recharge_offline_memcgs",
> > +             .data           = &sysctl_recharge_offline_memcgs,
> > +             .maxlen         = sizeof(sysctl_recharge_offline_memcgs),
> > +             .mode           = 0644,
> > +             .proc_handler   = proc_dointvec_minmax,
> > +             .extra1         = SYSCTL_ZERO,
> > +             .extra2         = SYSCTL_ONE,
> > +     },
> > +#endif /* CONFIG_MEMCG */
> >       { }
> >  };
>
> Please don't add any more sysctls to kernel/sysctl.c, git log that file
> for a series of cleanups which show how to use your own and why we have
> been doing that cleanup.

Thanks for pointing this out, I definitely missed it. Will do that in
the next version. I guess this will also reduce the reviewer churn if
I won't be touching kernel/sysctl.c?

>
>   Luis


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging
  2023-07-20 18:24     ` Yosry Ahmed
@ 2023-07-20 18:30       ` Luis Chamberlain
  0 siblings, 0 replies; 31+ messages in thread
From: Luis Chamberlain @ 2023-07-20 18:30 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

On Thu, Jul 20, 2023 at 11:24:20AM -0700, Yosry Ahmed wrote:
> On Thu, Jul 20, 2023 at 11:13 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On Thu, Jul 20, 2023 at 07:08:24AM +0000, Yosry Ahmed wrote:
> > > Add a sysctl to enable/disable memory recharging for offline memcgs. Add
> > > a config option to control whether or not it is enabled by default.
> > >
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > > ---
> > >  include/linux/memcontrol.h |  2 ++
> > >  kernel/sysctl.c            | 11 +++++++++++
> > >  mm/Kconfig                 | 12 ++++++++++++
> > >  mm/memcontrol.c            |  9 ++++++++-
> > >  4 files changed, 33 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 59b653d4a76e..ae9f09ee90cb 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -60,6 +60,8 @@ struct mem_cgroup_reclaim_cookie {
> > >
> > >  #ifdef CONFIG_MEMCG
> > >
> > > +extern int sysctl_recharge_offline_memcgs;
> > > +
> > >  #define MEM_CGROUP_ID_SHIFT  16
> > >  #define MEM_CGROUP_ID_MAX    USHRT_MAX
> > >
> > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > index 354a2d294f52..1735d1d95652 100644
> > > --- a/kernel/sysctl.c
> > > +++ b/kernel/sysctl.c
> > > @@ -2249,6 +2249,17 @@ static struct ctl_table vm_table[] = {
> > >               .extra2         = (void *)&mmap_rnd_compat_bits_max,
> > >       },
> > >  #endif
> > > +#ifdef CONFIG_MEMCG
> > > +     {
> > > +             .procname       = "recharge_offline_memcgs",
> > > +             .data           = &sysctl_recharge_offline_memcgs,
> > > +             .maxlen         = sizeof(sysctl_recharge_offline_memcgs),
> > > +             .mode           = 0644,
> > > +             .proc_handler   = proc_dointvec_minmax,
> > > +             .extra1         = SYSCTL_ZERO,
> > > +             .extra2         = SYSCTL_ONE,
> > > +     },
> > > +#endif /* CONFIG_MEMCG */
> > >       { }
> > >  };
> >
> > Please don't add any more sysctls to kernel/sysctl.c, git log that file
> > for a series of cleanups which show how to use your own and why we have
> > been doing that cleanup.
> 
> Thanks for pointing this out, I definitely missed it. Will do that in
> the next version. I guess this will also reduce the reviewer churn if
> I won't be touching kernel/sysctl.c?

Right, it means I don't have to care anymore about random sysctl knobs.
Let people knob it all up.

  Luis


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-20 15:35 ` [RFC PATCH 0/8] memory recharging for offline memcgs Johannes Weiner
@ 2023-07-20 19:57   ` Tejun Heo
  2023-07-20 21:34     ` Yosry Ahmed
  2023-07-20 21:33   ` Yosry Ahmed
  2023-08-01  9:54   ` Michal Hocko
  2 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2023-07-20 19:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yosry Ahmed, Andrew Morton, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

On Thu, Jul 20, 2023 at 11:35:15AM -0400, Johannes Weiner wrote:
> It could also be argued that if you don't want to lose control over a
> set of resources, then maybe don't delete their control domain while
> they are still alive and in use. For example, when restarting a
> workload, and the new instance is expected to have largely the same
> workingset, consider reusing the cgroup instead of making a new one.

Or just create a nesting layer so that there's a cgroup which represents the
persistent resources and a nested cgroup instance inside representing the
current instance.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-20 15:35 ` [RFC PATCH 0/8] memory recharging for offline memcgs Johannes Weiner
  2023-07-20 19:57   ` Tejun Heo
@ 2023-07-20 21:33   ` Yosry Ahmed
  2023-08-01  9:54   ` Michal Hocko
  2 siblings, 0 replies; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-20 21:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, T.J. Mercier, Greg Thelen, linux-kernel, linux-mm,
	cgroups

On Thu, Jul 20, 2023 at 8:35 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote:
> > This patch series implements the proposal in LSF/MM/BPF 2023 conference
> > for reducing offline/zombie memcgs by memory recharging [1]. The main
> > difference is that this series focuses on recharging and does not
> > include eviction of any memory charged to offline memcgs.
> >
> > Two methods of recharging are proposed:
> >
> > (a) Recharging of mapped folios.
> >
> > When a memcg is offlined, queue an asynchronous worker that will walk
> > the lruvec of the offline memcg and try to recharge any mapped folios to
> > the memcg of one of the processes mapping the folio. The main assumption
> > is that a process mapping the folio is the "rightful" owner of the
> > memory.
> >
> > Currently, this is only supported for evictable folios, as the
> > unevictable lru is imaginary and we cannot iterate the folios on it. A
> > separate proposal [2] was made to revive the unevictable lru, which
> > would allow recharging of unevictable folios.
> >
> > (b) Deferred recharging of folios.
> >
> > For folios that are unmapped, or mapped but we fail to recharge them
> > with (a), we rely on deferred recharging. Simply put, any time a folio
> > is accessed or dirtied by a userspace process, and that folio is charged
> > to an offline memcg, we will try to recharge it to the memcg of the
> > process accessing the folio. Again, we assume this process should be the
> > "rightful" owner of the memory. This is also done asynchronously to avoid
> > slowing down the data access path.
>
> I'm super skeptical of this proposal.

I expected this :)

>
> Recharging *might* be the most desirable semantics from a user pov,
> but only if it applies consistently to the whole memory footprint.
> There is no mention of slab allocations such as inodes, dentries,
> network buffers etc. which can be a significant part of a cgroup's
> footprint. These are currently reparented. I don't think doing one
> thing with half of the memory, and a totally different thing with the
> other half upon cgroup deletion is going to be acceptable semantics.

I think, as you say, recharging has the most desirable semantics
because the charge is maintained where it *should* be (with who is
actually using it). We simply cannot do that for kernel memory,
because we have no way of attributing it to a user. On the other hand,
we *can* attribute user memory to a user. Consistency is great, but
our inability to do (arguably) the right thing for one type of memory,
doesn't mean we shouldn't do it when we can. I would also argue that
user memory (anon/file pages) would commonly be the larger portion of
memory on a machine compared to kernel memory (e.g. slab).

>
> It appears this also brings back the reliability issue that caused us
> to deprecate charge moving. The recharge path has trylocks, LRU
> isolation attempts, GFP_ATOMIC allocations. These introduce a variable
> error rate into the relocation process,

Recharging is naturally best effort, because it's non-disruptive.
After a memcg dies, the kernel continuously tries to move the charges
away from it on every chance it gets. If it fails one time that's
fine, there will be other chances. Compared to the status quo, it is
definitely better than just leaving all the memory behind with the
zombie memcg. I would argue that over time (and accesses), most/all
memory should eventually get recharged. If not, something is not
working correctly, or a wrong assumption is being made.

> which causes pages that should
> belong to the same domain to be scattered around all over the place.

I strongly disagree with this point. Ideally, yes, memory charged to a
memcg would belong to the same domain. In practice, due to the first
touch charging semantics, this is far from the truth. For anonymous
memory, sure, they all belong to the same domain (mostly), the process
they belong to. But most of anonymous memory will go away when the
process dies anyway, the problem is mostly with shared resources (e.g.
file, tmpfs, ..). With file/tmpfs memory, the charging behavior is
random. The first memcg that touches a page gets charged for it.
Consequently, the file/tmpfs memory charged to a memcg would be a
mixture of pages from different files in different mounts, definitely
not a single domain. Perhaps with some workloads, where each memcg is
accessing different files, most memory charged to a memcg will belong
to the same domain, but in this case, recharging wouldn't move it away
anyway.

> It also means that zombie pinning still exists, but it's now even more
> influenced by timing and race conditions, and so less predictable.

It still exists, but it is improved. The kernel tries to move charges
away from zombies on every chance it gets instead of doing nothing
about it. It is less predictable, can't argue about this, but it can't
get worse, only better.

>
> There are two issues being conflated here:
>
> a) the problem of zombie cgroups, and
>
> b) who controls resources that outlive the control domain.
>
> For a), reparenting is still the most reasonable proposal. It's
> reliable for one, but it also fixes the problem fully within the
> established, user-facing semantics: resources that belong to a cgroup
> also hierarchically belong to all ancestral groups; if those resources
> outlive the last-level control domain, they continue to belong to the
> parents. This is how it works today, and this is how it continues to
> work with reparenting. The only difference is that those resources no
> longer pin a dead cgroup anymore, but instead are physically linked to
> the next online ancestor. Since dead cgroups have no effective control
> parameters anymore, this is semantically equivalent - it's just a more
> memory efficient implementation of the same exact thing.

I agree that reparenting is more deterministic and reliable, but there
are two major flaws off the top of my head:

(1) If a memcg touches a page one time and gets charged for it, the
charge is stuck in its hierarchy forever. It can get reparented, but
it will never be charged to whoever is actually using it again, unless
it is reclaimed and refaulted (in some cases).

Consider this hierarchy:
    root
  /       \
A        B
            \
            C

Consider a case where memcg C touches a library file once, and gets
charged for some memory, and then dies. The memory gets reparente to
memcg B. Meanwhile, memcg A is continuously using the memory that
memcg B is charged for. memcg B would be indefinitely taxed by memcg
A. The only way out is if memcg B hit its limit, and the pages get
reclaimed, and then refaulted and recharged to memcg A. In some cases
(e.g. tmpfs), even then the memory would still get charged to memcg B.
There is no way to get rid of the charge until the resource itself is
freed.

This problem exists today, even without reparenting, with the
difference being that the charge will remain with C instead of B.
Recharging offers a better alternative where the charge will be
correctly moved to A, the "rightful" owner.

(2) In the above scenario, when memcg B dies, the memory will be
reparented to the root. That's even worse. Now memcg A is using memory
that is not accounted for anywhere, essentially an accounting leak.
From an admin perspective, the memory charged to root is system
overhead, it is lost capacity. For long-living systems, as memcgs are
created and destroyed for different workloads, memory will keep
accumulating at the root. The machine will keep leaking capacity over
time, and accounting becomes less and less accurate as more memory
becomes charged to the root.

>
> b) is a discussion totally separate from this.

I would argue that the zombie problem is (at least partially) an
artifact of the shared/sticky resources problem. If all resources are
used by one memcg and do not outlive it, we wouldn't have zombies.

> We can argue what we
> want this behavior to be, but I'd argue strongly that whatever we do
> here should apply to all resources managed by the controller equally.

User memory and kernel memory are very different in nature. Ideally
yeah, we want to treat all resources equally. But user memory is
naturally more attributable to users and easier to account correctly
than kernel memory.

>
> It could also be argued that if you don't want to lose control over a
> set of resources, then maybe don't delete their control domain while
> they are still alive and in use.

This is easier said than done :) As I mentioned earlier, the charging
semantics are inherently indeterministic for shared resources (e.g.
file/tmpfs). The user cannot control or monitor which resources belong
to which control domain. Each memcg in the system could be charged for
one page from each file in a shared library for all that matters :)

> For example, when restarting a
> workload, and the new instance is expected to have largely the same
> workingset, consider reusing the cgroup instead of making a new one.

In a large fleet with many different jobs getting rescheduled and
restarted on different machines, it's really hard in practice to do
so. We can keep the same cgroup if the same workload is being
restarted on the same machine, sure, but most of the time there's a
new workload arriving or so. We can't reuse containers in this case.

>
> For the zombie problem, I think we should merge Muchun's patches
> ASAP. They've been proposed several times, they have Roman's reviews
> and acks, and they do not change user-facing semantics. There is no
> good reason not to merge them.

There are some, which I pointed out above.

All in all, I understand where you are coming from. Your concerns are
valid. Recharging is not a perfect approach, but it is arguably the
best we can do at this point. Being indeterministic sucks, but our
charging semantics are inherently indeterministic anyway.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-20 19:57   ` Tejun Heo
@ 2023-07-20 21:34     ` Yosry Ahmed
  2023-07-20 22:11       ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-20 21:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

On Thu, Jul 20, 2023 at 12:57 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Thu, Jul 20, 2023 at 11:35:15AM -0400, Johannes Weiner wrote:
> > It could also be argued that if you don't want to lose control over a
> > set of resources, then maybe don't delete their control domain while
> > they are still alive and in use. For example, when restarting a
> > workload, and the new instance is expected to have largely the same
> > workingset, consider reusing the cgroup instead of making a new one.
>
> Or just create a nesting layer so that there's a cgroup which represents the
> persistent resources and a nested cgroup instance inside representing the
> current instance.

In practice it is not easy to know exactly which resources are shared
and used by which cgroups, especially in a large dynamic environment.

>
> Thanks.
>
> --
> tejun


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-20 21:34     ` Yosry Ahmed
@ 2023-07-20 22:11       ` Tejun Heo
  2023-07-20 22:23         ` Yosry Ahmed
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2023-07-20 22:11 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

Hello,

On Thu, Jul 20, 2023 at 02:34:16PM -0700, Yosry Ahmed wrote:
> > Or just create a nesting layer so that there's a cgroup which represents the
> > persistent resources and a nested cgroup instance inside representing the
> > current instance.
> 
> In practice it is not easy to know exactly which resources are shared
> and used by which cgroups, especially in a large dynamic environment.

Yeah, that only covers when resource persistence is confined in a known
scope. That said, I have a hard time seeing how recharding once after cgroup
destruction can be a solution for the situations you describe. What if A
touches it once first, B constantly uses it but C only very occasionally and
after A dies C ends up owning it due to timing. This is very much possible
in a large dynamic environment but neither the initial or final situation is
satisfactory.

To solve the problems you're describing, you actually would have to
guarantee that memory pages are charged to the current majority user (or
maybe even spread across current active users). Maybe it can be argued that
this is a step towards that but it's a very partial step and at least would
need a technically viable direction that this development can follow.

On its own, AFAICS, I'm not sure the scope of problems it can actually solve
is justifiably greater than what can be achieved with simple nesting.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-20 22:11       ` Tejun Heo
@ 2023-07-20 22:23         ` Yosry Ahmed
  2023-07-20 22:31           ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-20 22:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

On Thu, Jul 20, 2023 at 3:12 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Jul 20, 2023 at 02:34:16PM -0700, Yosry Ahmed wrote:
> > > Or just create a nesting layer so that there's a cgroup which represents the
> > > persistent resources and a nested cgroup instance inside representing the
> > > current instance.
> >
> > In practice it is not easy to know exactly which resources are shared
> > and used by which cgroups, especially in a large dynamic environment.
>
> Yeah, that only covers when resource persistence is confined in a known
> scope. That said, I have a hard time seeing how recharding once after cgroup
> destruction can be a solution for the situations you describe. What if A
> touches it once first, B constantly uses it but C only very occasionally and
> after A dies C ends up owning it due to timing. This is very much possible
> in a large dynamic environment but neither the initial or final situation is
> satisfactory.

That is indeed possible, but it would be more likely that the charge
is moved to B. As I said, it's not perfect, but it is an improvement
over what we have today. Even if C ends up owning it, it's better than
staying with the dead A.

>
> To solve the problems you're describing, you actually would have to
> guarantee that memory pages are charged to the current majority user (or
> maybe even spread across current active users). Maybe it can be argued that
> this is a step towards that but it's a very partial step and at least would
> need a technically viable direction that this development can follow.

Right, that would be a much larger effort (arguably memcg v3 ;) ).
This proposal is focused on the painful artifact of the sharing/sticky
resources problem: zombie memcgs. We can extend the automatic charge
movement semantics later to cover more cases or be smarter, or ditch
the existing charging semantics completely and start over with
sharing/stickiness in mind. Either way, that would be a long-term
effort. There is a problem that exists today though that ideally can
be fixed/improved by this proposal.

>
> On its own, AFAICS, I'm not sure the scope of problems it can actually solve
> is justifiably greater than what can be achieved with simple nesting.

In our use case nesting is not a viable option. As I said, in a large
fleet where a lot of different workloads are dynamically being
scheduled on different machines, and where there is no way of knowing
what resources are being shared among what workloads, and even if we
do, it wouldn't be constant, it's very difficult to construct the
hierarchy with nesting to keep the resources confined.

Keep in mind that the environment is dynamic, workloads are constantly
coming and going. Even if find the perfect nesting to appropriately
scope resources, some rescheduling may render the hierarchy obsolete
and require us to start over.

>
> Thanks.
>
> --
> tejun


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-20 22:23         ` Yosry Ahmed
@ 2023-07-20 22:31           ` Tejun Heo
  2023-07-20 23:24             ` T.J. Mercier
  2023-07-21 18:15             ` Yosry Ahmed
  0 siblings, 2 replies; 31+ messages in thread
From: Tejun Heo @ 2023-07-20 22:31 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

Hello,

On Thu, Jul 20, 2023 at 03:23:59PM -0700, Yosry Ahmed wrote:
> > On its own, AFAICS, I'm not sure the scope of problems it can actually solve
> > is justifiably greater than what can be achieved with simple nesting.
> 
> In our use case nesting is not a viable option. As I said, in a large
> fleet where a lot of different workloads are dynamically being
> scheduled on different machines, and where there is no way of knowing
> what resources are being shared among what workloads, and even if we
> do, it wouldn't be constant, it's very difficult to construct the
> hierarchy with nesting to keep the resources confined.

Hmm... so, usually, the problems we see are resources that are persistent
across different instances of the same application as they may want to share
large chunks of memory like on-memory cache. I get that machines get
different dynamic jobs but unrelated jobs usually don't share huge amount of
memory at least in our case. The sharing across them comes down to things
like some common library pages which don't really account for much these
days.

> Keep in mind that the environment is dynamic, workloads are constantly
> coming and going. Even if find the perfect nesting to appropriately
> scope resources, some rescheduling may render the hierarchy obsolete
> and require us to start over.

Can you please go into more details on how much memory is shared for what
across unrelated dynamic workloads? That sounds different from other use
cases.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-20 22:31           ` Tejun Heo
@ 2023-07-20 23:24             ` T.J. Mercier
  2023-07-20 23:33               ` Tejun Heo
  2023-07-21 18:15             ` Yosry Ahmed
  1 sibling, 1 reply; 31+ messages in thread
From: T.J. Mercier @ 2023-07-20 23:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yosry Ahmed, Johannes Weiner, Andrew Morton, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song,
	Matthew Wilcox (Oracle),
	Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Greg Thelen, linux-kernel, linux-mm, cgroups

On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Jul 20, 2023 at 03:23:59PM -0700, Yosry Ahmed wrote:
> > > On its own, AFAICS, I'm not sure the scope of problems it can actually solve
> > > is justifiably greater than what can be achieved with simple nesting.
> >
> > In our use case nesting is not a viable option. As I said, in a large
> > fleet where a lot of different workloads are dynamically being
> > scheduled on different machines, and where there is no way of knowing
> > what resources are being shared among what workloads, and even if we
> > do, it wouldn't be constant, it's very difficult to construct the
> > hierarchy with nesting to keep the resources confined.
>
> Hmm... so, usually, the problems we see are resources that are persistent
> across different instances of the same application as they may want to share
> large chunks of memory like on-memory cache. I get that machines get
> different dynamic jobs but unrelated jobs usually don't share huge amount of
> memory at least in our case. The sharing across them comes down to things
> like some common library pages which don't really account for much these
> days.
>
This has also been my experience in terms of bytes of memory that are
incorrectly charged (because they're charged to a zombie), but that is
because memcg doesn't currently track the large shared allocations in
my case (primarily dma-buf). The greater issue I've seen so far is the
number of zombie cgroups that can accumulate over time. But my
understanding is that both of these two problems are currently
significant for Yosry's case.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-20 23:24             ` T.J. Mercier
@ 2023-07-20 23:33               ` Tejun Heo
  0 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2023-07-20 23:33 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Yosry Ahmed, Johannes Weiner, Andrew Morton, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song,
	Matthew Wilcox (Oracle),
	Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Greg Thelen, linux-kernel, linux-mm, cgroups

Hello,

On Thu, Jul 20, 2023 at 04:24:02PM -0700, T.J. Mercier wrote:
> > Hmm... so, usually, the problems we see are resources that are persistent
> > across different instances of the same application as they may want to share
> > large chunks of memory like on-memory cache. I get that machines get
> > different dynamic jobs but unrelated jobs usually don't share huge amount of
> > memory at least in our case. The sharing across them comes down to things
> > like some common library pages which don't really account for much these
> > days.
> >
> This has also been my experience in terms of bytes of memory that are
> incorrectly charged (because they're charged to a zombie), but that is
> because memcg doesn't currently track the large shared allocations in
> my case (primarily dma-buf). The greater issue I've seen so far is the
> number of zombie cgroups that can accumulate over time. But my
> understanding is that both of these two problems are currently
> significant for Yosry's case.

memcg already does reparenting of slab pages to lower the number of dying
cgroups and maybe it makes sense to expand that to user memory too. One
related thing is that if those reparented pages are written to, that's gonna
break IO isolation w/ blk-iocost because iocost currently bypasses IOs from
intermediate cgroups to root but we can fix that. Anyways, that's something
pretty different from what's proposed here. Reparenting, I think, is a lot
less conroversial.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-20  7:08 [RFC PATCH 0/8] memory recharging for offline memcgs Yosry Ahmed
                   ` (8 preceding siblings ...)
  2023-07-20 15:35 ` [RFC PATCH 0/8] memory recharging for offline memcgs Johannes Weiner
@ 2023-07-21  0:02 ` Roman Gushchin
  2023-07-21  0:07   ` Yosry Ahmed
  9 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2023-07-21  0:02 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, T.J. Mercier, Greg Thelen, linux-kernel, linux-mm,
	cgroups

On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote:
> This patch series implements the proposal in LSF/MM/BPF 2023 conference
> for reducing offline/zombie memcgs by memory recharging [1]. The main
> difference is that this series focuses on recharging and does not
> include eviction of any memory charged to offline memcgs.
> 
> Two methods of recharging are proposed:
> 
> (a) Recharging of mapped folios.
> 
> When a memcg is offlined, queue an asynchronous worker that will walk
> the lruvec of the offline memcg and try to recharge any mapped folios to
> the memcg of one of the processes mapping the folio. The main assumption
> is that a process mapping the folio is the "rightful" owner of the
> memory.
> 
> Currently, this is only supported for evictable folios, as the
> unevictable lru is imaginary and we cannot iterate the folios on it. A
> separate proposal [2] was made to revive the unevictable lru, which
> would allow recharging of unevictable folios.
> 
> (b) Deferred recharging of folios.
> 
> For folios that are unmapped, or mapped but we fail to recharge them
> with (a), we rely on deferred recharging. Simply put, any time a folio
> is accessed or dirtied by a userspace process, and that folio is charged
> to an offline memcg, we will try to recharge it to the memcg of the
> process accessing the folio. Again, we assume this process should be the
> "rightful" owner of the memory. This is also done asynchronously to avoid
> slowing down the data access path.

Unfortunately I have to agree with Johannes, Tejun and others who are not big
fans of this approach.

Lazy recharging leads to an interesting phenomena: a memory usage of a running
workload may suddenly go up only because some other workload is terminated and
now it's memory is being recharged. I find it confusing. It also makes hard
to set up limits and/or guarantees.

In general, I don't think we can handle shared memory well without getting rid
of "whoever allocates a page, pays the full price" policy and making a shared
ownership a fully supported concept. Of course, it's a huge work and I believe
the only way we can achieve it is to compromise on the granularity of the
accounting. Will the resulting system be better in the real life, it's hard to
say in advance.

Thanks!


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-21  0:02 ` Roman Gushchin
@ 2023-07-21  0:07   ` Yosry Ahmed
  0 siblings, 0 replies; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-21  0:07 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, T.J. Mercier, Greg Thelen, linux-kernel, linux-mm,
	cgroups

On Thu, Jul 20, 2023 at 5:02 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote:
> > This patch series implements the proposal in LSF/MM/BPF 2023 conference
> > for reducing offline/zombie memcgs by memory recharging [1]. The main
> > difference is that this series focuses on recharging and does not
> > include eviction of any memory charged to offline memcgs.
> >
> > Two methods of recharging are proposed:
> >
> > (a) Recharging of mapped folios.
> >
> > When a memcg is offlined, queue an asynchronous worker that will walk
> > the lruvec of the offline memcg and try to recharge any mapped folios to
> > the memcg of one of the processes mapping the folio. The main assumption
> > is that a process mapping the folio is the "rightful" owner of the
> > memory.
> >
> > Currently, this is only supported for evictable folios, as the
> > unevictable lru is imaginary and we cannot iterate the folios on it. A
> > separate proposal [2] was made to revive the unevictable lru, which
> > would allow recharging of unevictable folios.
> >
> > (b) Deferred recharging of folios.
> >
> > For folios that are unmapped, or mapped but we fail to recharge them
> > with (a), we rely on deferred recharging. Simply put, any time a folio
> > is accessed or dirtied by a userspace process, and that folio is charged
> > to an offline memcg, we will try to recharge it to the memcg of the
> > process accessing the folio. Again, we assume this process should be the
> > "rightful" owner of the memory. This is also done asynchronously to avoid
> > slowing down the data access path.
>
> Unfortunately I have to agree with Johannes, Tejun and others who are not big
> fans of this approach.
>
> Lazy recharging leads to an interesting phenomena: a memory usage of a running
> workload may suddenly go up only because some other workload is terminated and
> now it's memory is being recharged. I find it confusing. It also makes hard
> to set up limits and/or guarantees.

This can happen today.

If memcg A starts accessing some memory and gets charged for it, and
then memcg B also accesses it, it will not be charged for it. If at a
later point memcg A runs into reclaim, and the memory is freed, then
memcg B tries to access it, its usage will suddenly go up as well,
because some other workload experienced reclaim. This is a very
similar scenario, only instead of reclaim, the memcg was offlined. As
a matter of fact, it's common to try to free up a memcg before
removing it (by lowering the limit or using memory.reclaim). In that
case, the net result would be exactly the same -- with the difference
being that recharging will avoid freeing the memory and faulting it
back in.

>
> In general, I don't think we can handle shared memory well without getting rid
> of "whoever allocates a page, pays the full price" policy and making a shared
> ownership a fully supported concept. Of course, it's a huge work and I believe
> the only way we can achieve it is to compromise on the granularity of the
> accounting. Will the resulting system be better in the real life, it's hard to
> say in advance.
>
> Thanks!


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-20 22:31           ` Tejun Heo
  2023-07-20 23:24             ` T.J. Mercier
@ 2023-07-21 18:15             ` Yosry Ahmed
  2023-07-21 18:26               ` Tejun Heo
  1 sibling, 1 reply; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-21 18:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Jul 20, 2023 at 03:23:59PM -0700, Yosry Ahmed wrote:
> > > On its own, AFAICS, I'm not sure the scope of problems it can actually solve
> > > is justifiably greater than what can be achieved with simple nesting.
> >
> > In our use case nesting is not a viable option. As I said, in a large
> > fleet where a lot of different workloads are dynamically being
> > scheduled on different machines, and where there is no way of knowing
> > what resources are being shared among what workloads, and even if we
> > do, it wouldn't be constant, it's very difficult to construct the
> > hierarchy with nesting to keep the resources confined.
>
> Hmm... so, usually, the problems we see are resources that are persistent
> across different instances of the same application as they may want to share
> large chunks of memory like on-memory cache. I get that machines get
> different dynamic jobs but unrelated jobs usually don't share huge amount of

I am digging deeper to get more information for you. One thing I know
now is that different instances of the same job are contained within a
common parent, and we even use our previously proposed memcg= mount
option for tmpfs to charge their shared resources to a common parent.
So restarting tasks is not the problem we are seeing.

> memory at least in our case. The sharing across them comes down to things
> like some common library pages which don't really account for much these
> days.

Keep in mind that even a single page charged to a memcg and used by
another memcg is sufficient to result in a zombie memcg.

>
> > Keep in mind that the environment is dynamic, workloads are constantly
> > coming and going. Even if find the perfect nesting to appropriately
> > scope resources, some rescheduling may render the hierarchy obsolete
> > and require us to start over.
>
> Can you please go into more details on how much memory is shared for what
> across unrelated dynamic workloads? That sounds different from other use
> cases.

I am trying to collect more information from our fleet, but the
application restarting in a different cgroup is not what is happening
in our case. It is not easy to find out exactly what is going on on
machines and where the memory is coming from due to the
indeterministic nature of charging. The goal of this proposal is to
let the kernel handle leftover memory in zombie memcgs because it is
not always obvious to userspace what's going on (like it's not obvious
to me now where exactly is the sharing happening :) ).

One thing to note is that in some cases, maybe a userspace bug or
failed cleanup is a reason for the zombie memcgs. Ideally, this
wouldn't happen, but it would be nice to have a fallback mechanism in
the kernel if it does.

>
> Thanks.
>
> --
> tejun


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-21 18:15             ` Yosry Ahmed
@ 2023-07-21 18:26               ` Tejun Heo
  2023-07-21 18:47                 ` Yosry Ahmed
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2023-07-21 18:26 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

Hello,

On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote:
> On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
> > memory at least in our case. The sharing across them comes down to things
> > like some common library pages which don't really account for much these
> > days.
> 
> Keep in mind that even a single page charged to a memcg and used by
> another memcg is sufficient to result in a zombie memcg.

I mean, yeah, that's a separate issue or rather a subset which isn't all
that controversial. That can be deterministically solved by reparenting to
the parent like how slab is handled. I think the "deterministic" part is
important here. As you said, even a single page can pin a dying cgroup.

> > > Keep in mind that the environment is dynamic, workloads are constantly
> > > coming and going. Even if find the perfect nesting to appropriately
> > > scope resources, some rescheduling may render the hierarchy obsolete
> > > and require us to start over.
> >
> > Can you please go into more details on how much memory is shared for what
> > across unrelated dynamic workloads? That sounds different from other use
> > cases.
> 
> I am trying to collect more information from our fleet, but the
> application restarting in a different cgroup is not what is happening
> in our case. It is not easy to find out exactly what is going on on

This is the point that Johannes raised but I don't think the current
proposal would make things more deterministic. From what I can see, it
actually pushes it towards even less predictability. Currently, yeah, some
pages may end up in cgroups which aren't the majority user but it at least
is clear how that would happen. The proposed change adds layers of
indeterministic behaviors on top. I don't think that's the direction we want
to go.

> machines and where the memory is coming from due to the
> indeterministic nature of charging. The goal of this proposal is to
> let the kernel handle leftover memory in zombie memcgs because it is
> not always obvious to userspace what's going on (like it's not obvious
> to me now where exactly is the sharing happening :) ).
>
> One thing to note is that in some cases, maybe a userspace bug or
> failed cleanup is a reason for the zombie memcgs. Ideally, this
> wouldn't happen, but it would be nice to have a fallback mechanism in
> the kernel if it does.

I'm not disagreeing on that. Our handling of pages owned by dying cgroups
isn't great but I don't think the proposed change is an acceptable solution.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-21 18:26               ` Tejun Heo
@ 2023-07-21 18:47                 ` Yosry Ahmed
  2023-07-21 19:18                   ` Tejun Heo
  2023-07-21 20:44                   ` Johannes Weiner
  0 siblings, 2 replies; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-21 18:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote:
> > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
> > > memory at least in our case. The sharing across them comes down to things
> > > like some common library pages which don't really account for much these
> > > days.
> >
> > Keep in mind that even a single page charged to a memcg and used by
> > another memcg is sufficient to result in a zombie memcg.
>
> I mean, yeah, that's a separate issue or rather a subset which isn't all
> that controversial. That can be deterministically solved by reparenting to
> the parent like how slab is handled. I think the "deterministic" part is
> important here. As you said, even a single page can pin a dying cgroup.

There are serious flaws with reparenting that I mentioned above. We do
it for kernel memory, but that's because we really have no other
choice. Oftentimes the memory is not reclaimable and we cannot find an
owner for it. This doesn't mean it's the right answer for user memory.

The semantics are new compared to normal charging (as opposed to
recharging, as I explain below). There is an extra layer of
indirection that we did not (as far as I know) measure the impact of.
Parents end up with pages that they never used and we have no
observability into where it came from. Most importantly, over time
user memory will keep accumulating at the root, reducing the accuracy
and usefulness of accounting, effectively an accounting leak and
reduction of capacity. Memory that is not attributed to any user, aka
system overhead.

>
> > > > Keep in mind that the environment is dynamic, workloads are constantly
> > > > coming and going. Even if find the perfect nesting to appropriately
> > > > scope resources, some rescheduling may render the hierarchy obsolete
> > > > and require us to start over.
> > >
> > > Can you please go into more details on how much memory is shared for what
> > > across unrelated dynamic workloads? That sounds different from other use
> > > cases.
> >
> > I am trying to collect more information from our fleet, but the
> > application restarting in a different cgroup is not what is happening
> > in our case. It is not easy to find out exactly what is going on on
>
> This is the point that Johannes raised but I don't think the current
> proposal would make things more deterministic. From what I can see, it
> actually pushes it towards even less predictability. Currently, yeah, some
> pages may end up in cgroups which aren't the majority user but it at least
> is clear how that would happen. The proposed change adds layers of
> indeterministic behaviors on top. I don't think that's the direction we want
> to go.

I believe recharging is being mis-framed here :)

Recharging semantics are not new, it is a shortcut to a process that
is already happening that is focused on offline memcgs. Let's take a
step back.

It is common practice (at least in my knowledge) to try to reclaim
memory from a cgroup before deleting it (by lowering the limit or
using memory.reclaim). Reclaim heuristics are biased towards
reclaiming memory from offline cgroups. After the memory is reclaimed,
if it is used again by a different process, it will be refaulted and
charged again (aka recharged) to the new

What recharging is doing is *not* anything new. It is effectively
doing what reclaim + refault would do above, with an efficient
shortcut. It avoids the unnecessary fault, avoids disrupting the
workload that will access the memory after it is reclaimed, and cleans
up zombie memcgs memory faster than reclaim would. Moreover, it works
for memory that may not be reclaimable (e.g. because of lack of swap).

All the indeterministic behaviors in recharging are exactly the
indeterministic behaviors in reclaim. It is very similar. We iterate
the lrus, try to isolate and lock folios, etc. This is what reclaim
does. Recharging is basically lightweight reclaim + charging again (as
opposed to fully reclaiming the memory then refaulting it).

We are not introducing new indeterminism or charging semantics.
Recharging does exactly what would happen when we reclaim zombie
memory. It is just more efficient and accelerated.

>
> > machines and where the memory is coming from due to the
> > indeterministic nature of charging. The goal of this proposal is to
> > let the kernel handle leftover memory in zombie memcgs because it is
> > not always obvious to userspace what's going on (like it's not obvious
> > to me now where exactly is the sharing happening :) ).
> >
> > One thing to note is that in some cases, maybe a userspace bug or
> > failed cleanup is a reason for the zombie memcgs. Ideally, this
> > wouldn't happen, but it would be nice to have a fallback mechanism in
> > the kernel if it does.
>
> I'm not disagreeing on that. Our handling of pages owned by dying cgroups
> isn't great but I don't think the proposed change is an acceptable solution.

 I hope the above arguments change your mind :)

>
> Thanks.
>
> --
> tejun


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-21 18:47                 ` Yosry Ahmed
@ 2023-07-21 19:18                   ` Tejun Heo
  2023-07-21 20:37                     ` Yosry Ahmed
  2023-07-21 20:44                   ` Johannes Weiner
  1 sibling, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2023-07-21 19:18 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

Hello,

On Fri, Jul 21, 2023 at 11:47:49AM -0700, Yosry Ahmed wrote:
> On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote:
> > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote:
> > > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
> > > > memory at least in our case. The sharing across them comes down to things
> > > > like some common library pages which don't really account for much these
> > > > days.
> > >
> > > Keep in mind that even a single page charged to a memcg and used by
> > > another memcg is sufficient to result in a zombie memcg.
> >
> > I mean, yeah, that's a separate issue or rather a subset which isn't all
> > that controversial. That can be deterministically solved by reparenting to
> > the parent like how slab is handled. I think the "deterministic" part is
> > important here. As you said, even a single page can pin a dying cgroup.
> 
> There are serious flaws with reparenting that I mentioned above. We do
> it for kernel memory, but that's because we really have no other
> choice. Oftentimes the memory is not reclaimable and we cannot find an
> owner for it. This doesn't mean it's the right answer for user memory.
> 
> The semantics are new compared to normal charging (as opposed to
> recharging, as I explain below). There is an extra layer of
> indirection that we did not (as far as I know) measure the impact of.
> Parents end up with pages that they never used and we have no
> observability into where it came from. Most importantly, over time
> user memory will keep accumulating at the root, reducing the accuracy
> and usefulness of accounting, effectively an accounting leak and
> reduction of capacity. Memory that is not attributed to any user, aka
> system overhead.

That really sounds like the setup is missing cgroup layers tracking
persistent resources. Most of the problems you describe can be solved by
adding cgroup layers at the right spots which would usually align with the
logical structure of the system, right?

...
> I believe recharging is being mis-framed here :)
> 
> Recharging semantics are not new, it is a shortcut to a process that
> is already happening that is focused on offline memcgs. Let's take a
> step back.

Yeah, it does sound better when viewed that way. I'm still not sure what
extra problems it solves tho. We experienced similar problems but AFAIK all
of them came down to needing the appropriate hierarchical structure to
capture how resources are being used on systems.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-21 19:18                   ` Tejun Heo
@ 2023-07-21 20:37                     ` Yosry Ahmed
  0 siblings, 0 replies; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-21 20:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

On Fri, Jul 21, 2023 at 12:18 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Jul 21, 2023 at 11:47:49AM -0700, Yosry Ahmed wrote:
> > On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote:
> > > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote:
> > > > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
> > > > > memory at least in our case. The sharing across them comes down to things
> > > > > like some common library pages which don't really account for much these
> > > > > days.
> > > >
> > > > Keep in mind that even a single page charged to a memcg and used by
> > > > another memcg is sufficient to result in a zombie memcg.
> > >
> > > I mean, yeah, that's a separate issue or rather a subset which isn't all
> > > that controversial. That can be deterministically solved by reparenting to
> > > the parent like how slab is handled. I think the "deterministic" part is
> > > important here. As you said, even a single page can pin a dying cgroup.
> >
> > There are serious flaws with reparenting that I mentioned above. We do
> > it for kernel memory, but that's because we really have no other
> > choice. Oftentimes the memory is not reclaimable and we cannot find an
> > owner for it. This doesn't mean it's the right answer for user memory.
> >
> > The semantics are new compared to normal charging (as opposed to
> > recharging, as I explain below). There is an extra layer of
> > indirection that we did not (as far as I know) measure the impact of.
> > Parents end up with pages that they never used and we have no
> > observability into where it came from. Most importantly, over time
> > user memory will keep accumulating at the root, reducing the accuracy
> > and usefulness of accounting, effectively an accounting leak and
> > reduction of capacity. Memory that is not attributed to any user, aka
> > system overhead.
>
> That really sounds like the setup is missing cgroup layers tracking
> persistent resources. Most of the problems you describe can be solved by
> adding cgroup layers at the right spots which would usually align with the
> logical structure of the system, right?

It is difficult to track down all persistent/shareable resources and
find the users, especially when both the resources and the users are
dynamically changed. A simple example is text files for a shared
library or sidecar processes that run with different workloads and
need to have their usage charged to the workload, but they may have
memory. For those cases there is no layering that would work. More
practically, sometimes userspace just doesn't even know what exactly
is being shared by whom.

>
> ...
> > I believe recharging is being mis-framed here :)
> >
> > Recharging semantics are not new, it is a shortcut to a process that
> > is already happening that is focused on offline memcgs. Let's take a
> > step back.
>
> Yeah, it does sound better when viewed that way. I'm still not sure what
> extra problems it solves tho. We experienced similar problems but AFAIK all
> of them came down to needing the appropriate hierarchical structure to
> capture how resources are being used on systems.

It solves the problem of zombie memcgs and unaccounted memory. It is
great that in some cases an appropriate hierarchy structure fixes the
problem by accurately capturing how resources are being shared, but in
some cases it's not as straightforward. Recharging attempts to fix the
problem in a way that is more consistent with current semantics and
more appealing that reparenting in terms of rightful ownership.

Some systems are not rebooted for months. Can you imagine how much
memory can be accumulated at the root (escaping all accounting) over
months of reparenting?

>
> Thanks.
>
> --
> tejun


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-21 18:47                 ` Yosry Ahmed
  2023-07-21 19:18                   ` Tejun Heo
@ 2023-07-21 20:44                   ` Johannes Weiner
  2023-07-21 20:59                     ` Yosry Ahmed
  1 sibling, 1 reply; 31+ messages in thread
From: Johannes Weiner @ 2023-07-21 20:44 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Andrew Morton, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

On Fri, Jul 21, 2023 at 11:47:49AM -0700, Yosry Ahmed wrote:
> On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote:
> > > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
> > > > memory at least in our case. The sharing across them comes down to things
> > > > like some common library pages which don't really account for much these
> > > > days.
> > >
> > > Keep in mind that even a single page charged to a memcg and used by
> > > another memcg is sufficient to result in a zombie memcg.
> >
> > I mean, yeah, that's a separate issue or rather a subset which isn't all
> > that controversial. That can be deterministically solved by reparenting to
> > the parent like how slab is handled. I think the "deterministic" part is
> > important here. As you said, even a single page can pin a dying cgroup.
> 
> There are serious flaws with reparenting that I mentioned above. We do
> it for kernel memory, but that's because we really have no other
> choice. Oftentimes the memory is not reclaimable and we cannot find an
> owner for it. This doesn't mean it's the right answer for user memory.
> 
> The semantics are new compared to normal charging (as opposed to
> recharging, as I explain below). There is an extra layer of
> indirection that we did not (as far as I know) measure the impact of.
> Parents end up with pages that they never used and we have no
> observability into where it came from. Most importantly, over time
> user memory will keep accumulating at the root, reducing the accuracy
> and usefulness of accounting, effectively an accounting leak and
> reduction of capacity. Memory that is not attributed to any user, aka
> system overhead.

Reparenting has been the behavior since the first iteration of cgroups
in the kernel. The initial implementation would loop over the LRUs and
reparent pages synchronously during rmdir. This had some locking
issues, so we switched to the current implementation of just leaving
the zombie memcg behind but neutralizing its controls.

Thanks to Roman's objcg abstraction, we can now go back to the old
implementation of directly moving pages up to avoid the zombies.

However, these were pure implementation changes. The user-visible
semantics never varied: when you delete a cgroup, any leftover
resources are subject to control by the remaining parent cgroups.
Don't remove control domains if you still need to control resources.
But none of this is new or would change in any way! Neutralizing
controls of a zombie cgroup results in the same behavior and
accounting as linking the pages to the parent cgroup's LRU!

The only thing that's new is the zombie cgroups. We can fix that by
effectively going back to the earlier implementation, but thanks to
objcg without the locking problems.

I just wanted to address this, because your description/framing of
reparenting strikes me as quite wrong.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-21 20:44                   ` Johannes Weiner
@ 2023-07-21 20:59                     ` Yosry Ahmed
  0 siblings, 0 replies; 31+ messages in thread
From: Yosry Ahmed @ 2023-07-21 20:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Andrew Morton, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Matthew Wilcox (Oracle),
	Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	T.J. Mercier, Greg Thelen, linux-kernel, linux-mm, cgroups

On Fri, Jul 21, 2023 at 1:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Jul 21, 2023 at 11:47:49AM -0700, Yosry Ahmed wrote:
> > On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote:
> > > > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
> > > > > memory at least in our case. The sharing across them comes down to things
> > > > > like some common library pages which don't really account for much these
> > > > > days.
> > > >
> > > > Keep in mind that even a single page charged to a memcg and used by
> > > > another memcg is sufficient to result in a zombie memcg.
> > >
> > > I mean, yeah, that's a separate issue or rather a subset which isn't all
> > > that controversial. That can be deterministically solved by reparenting to
> > > the parent like how slab is handled. I think the "deterministic" part is
> > > important here. As you said, even a single page can pin a dying cgroup.
> >
> > There are serious flaws with reparenting that I mentioned above. We do
> > it for kernel memory, but that's because we really have no other
> > choice. Oftentimes the memory is not reclaimable and we cannot find an
> > owner for it. This doesn't mean it's the right answer for user memory.
> >
> > The semantics are new compared to normal charging (as opposed to
> > recharging, as I explain below). There is an extra layer of
> > indirection that we did not (as far as I know) measure the impact of.
> > Parents end up with pages that they never used and we have no
> > observability into where it came from. Most importantly, over time
> > user memory will keep accumulating at the root, reducing the accuracy
> > and usefulness of accounting, effectively an accounting leak and
> > reduction of capacity. Memory that is not attributed to any user, aka
> > system overhead.
>
> Reparenting has been the behavior since the first iteration of cgroups
> in the kernel. The initial implementation would loop over the LRUs and
> reparent pages synchronously during rmdir. This had some locking
> issues, so we switched to the current implementation of just leaving
> the zombie memcg behind but neutralizing its controls.

Thanks for the context.

>
> Thanks to Roman's objcg abstraction, we can now go back to the old
> implementation of directly moving pages up to avoid the zombies.
>
> However, these were pure implementation changes. The user-visible
> semantics never varied: when you delete a cgroup, any leftover
> resources are subject to control by the remaining parent cgroups.
> Don't remove control domains if you still need to control resources.
> But none of this is new or would change in any way!

The problem is that you cannot fully monitor or control all the
resources charged to a control domain. The example of common shared
libraries stands, the pages are charged on first touch basis. You
can't easily control it or monitor who is charged for what exactly.
Even if you can find out, is the answer to leave the cgroup alive
forever because it is charged for a shared resource?

> Neutralizing
> controls of a zombie cgroup results in the same behavior and
> accounting as linking the pages to the parent cgroup's LRU!
>
> The only thing that's new is the zombie cgroups. We can fix that by
> effectively going back to the earlier implementation, but thanks to
> objcg without the locking problems.
>
> I just wanted to address this, because your description/framing of
> reparenting strikes me as quite wrong.

Thanks for the context, and sorry if my framing was inaccurate. I was
more focused on the in-kernel semantics rather than user-visible
semantics. Nonetheless, with today's status or with reparenting, once
the memory is at the root level (whether reparented to the root level,
or in a zombie memcg whose parent is root), the memory has effectively
escaped accounting. This is not a new problem that reparenting would
introduce, but it's a problem that recharging is trying to fix that
reparenting won't.

As I outlined above, the semantics of recharging are not new, they are
equivalent to reclaiming and refaulting the memory in a more
accelerated/efficient manner. The indeterminism in recharging is very
similar to reclaiming and refaulting.

What do you think?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/8] memory recharging for offline memcgs
  2023-07-20 15:35 ` [RFC PATCH 0/8] memory recharging for offline memcgs Johannes Weiner
  2023-07-20 19:57   ` Tejun Heo
  2023-07-20 21:33   ` Yosry Ahmed
@ 2023-08-01  9:54   ` Michal Hocko
  2 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2023-08-01  9:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yosry Ahmed, Andrew Morton, Roman Gushchin, Shakeel Butt,
	Muchun Song, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Yu Zhao, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, T.J. Mercier, Greg Thelen, linux-kernel, linux-mm,
	cgroups

[Sorry for being late to this discussion]

On Thu 20-07-23 11:35:15, Johannes Weiner wrote:
[...]
> I'm super skeptical of this proposal.

Agreed.
 
> Recharging *might* be the most desirable semantics from a user pov,
> but only if it applies consistently to the whole memory footprint.
> There is no mention of slab allocations such as inodes, dentries,
> network buffers etc. which can be a significant part of a cgroup's
> footprint. These are currently reparented. I don't think doing one
> thing with half of the memory, and a totally different thing with the
> other half upon cgroup deletion is going to be acceptable semantics.
> 
> It appears this also brings back the reliability issue that caused us
> to deprecate charge moving. The recharge path has trylocks, LRU
> isolation attempts, GFP_ATOMIC allocations. These introduce a variable
> error rate into the relocation process, which causes pages that should
> belong to the same domain to be scattered around all over the place.
> It also means that zombie pinning still exists, but it's now even more
> influenced by timing and race conditions, and so less predictable.
> 
> There are two issues being conflated here:
> 
> a) the problem of zombie cgroups, and
> 
> b) who controls resources that outlive the control domain.
> 
> For a), reparenting is still the most reasonable proposal. It's
> reliable for one, but it also fixes the problem fully within the
> established, user-facing semantics: resources that belong to a cgroup
> also hierarchically belong to all ancestral groups; if those resources
> outlive the last-level control domain, they continue to belong to the
> parents. This is how it works today, and this is how it continues to
> work with reparenting. The only difference is that those resources no
> longer pin a dead cgroup anymore, but instead are physically linked to
> the next online ancestor. Since dead cgroups have no effective control
> parameters anymore, this is semantically equivalent - it's just a more
> memory efficient implementation of the same exact thing.
> 
> b) is a discussion totally separate from this. We can argue what we
> want this behavior to be, but I'd argue strongly that whatever we do
> here should apply to all resources managed by the controller equally.
> 
> It could also be argued that if you don't want to lose control over a
> set of resources, then maybe don't delete their control domain while
> they are still alive and in use. For example, when restarting a
> workload, and the new instance is expected to have largely the same
> workingset, consider reusing the cgroup instead of making a new one.
> 
> For the zombie problem, I think we should merge Muchun's patches
> ASAP. They've been proposed several times, they have Roman's reviews
> and acks, and they do not change user-facing semantics. There is no
> good reason not to merge them.

Yes, fully agreed on both points. The problem with zombies is real but
reparenting should address it for a large part. Ownership is a different
problem. We have discussed that at LSFMM this year and in the past as
well I believe. What we probably need is a concept of taking an
ownership of the memory (something like madvise(MADV_OWN, range) or
fadvise for fd based resources). This would allow the caller to take
ownership of the said resource (like memcg charge of it).

I understand that would require some changes to existing workloads.
Whatever the interface will be, it has to be explicit otherwise we
are hitting problems with unaccounted resources that are sitting without
any actual ownership and an undeterministic and time dependeing hopping
over owners. In other words, nobody should be able to drop
responsibility of any object while it is still consuming resources.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2023-08-01  9:54 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-20  7:08 [RFC PATCH 0/8] memory recharging for offline memcgs Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 1/8] memcg: refactor updating memcg->moving_account Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 2/8] mm: vmscan: add lruvec_for_each_list() helper Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 3/8] memcg: recharge mapped folios when a memcg is offlined Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 4/8] memcg: support deferred memcg recharging Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 5/8] memcg: recharge folios when accessed or dirtied Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 6/8] memcg: add stats for offline memcgs recharging Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging Yosry Ahmed
2023-07-20 18:13   ` Luis Chamberlain
2023-07-20 18:24     ` Yosry Ahmed
2023-07-20 18:30       ` Luis Chamberlain
2023-07-20  7:08 ` [RFC PATCH 8/8] selftests: cgroup: test_memcontrol: add a selftest for memcg recharging Yosry Ahmed
2023-07-20 15:35 ` [RFC PATCH 0/8] memory recharging for offline memcgs Johannes Weiner
2023-07-20 19:57   ` Tejun Heo
2023-07-20 21:34     ` Yosry Ahmed
2023-07-20 22:11       ` Tejun Heo
2023-07-20 22:23         ` Yosry Ahmed
2023-07-20 22:31           ` Tejun Heo
2023-07-20 23:24             ` T.J. Mercier
2023-07-20 23:33               ` Tejun Heo
2023-07-21 18:15             ` Yosry Ahmed
2023-07-21 18:26               ` Tejun Heo
2023-07-21 18:47                 ` Yosry Ahmed
2023-07-21 19:18                   ` Tejun Heo
2023-07-21 20:37                     ` Yosry Ahmed
2023-07-21 20:44                   ` Johannes Weiner
2023-07-21 20:59                     ` Yosry Ahmed
2023-07-20 21:33   ` Yosry Ahmed
2023-08-01  9:54   ` Michal Hocko
2023-07-21  0:02 ` Roman Gushchin
2023-07-21  0:07   ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).