linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] mm: memcontrol: switch to rstat v2
@ 2021-02-05 18:27 Johannes Weiner
  2021-02-05 18:27 ` [PATCH 1/8] mm: memcontrol: fix cpuhotplug statistics flushing Johannes Weiner
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Johannes Weiner @ 2021-02-05 18:27 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, linux-mm, cgroups,
	linux-kernel, kernel-team

This is version 2 of the memcg rstat patches. Updates since v1:

- added cgroup selftest output (see test section below) (thanks Roman)
- updated cgroup selftest to match new kernel implementation
- added Fixes: tag to 'mm: memcontrol: fix cpuhotplug statistics flushing' (Shakeel)
- collected review & ack tags
- added rstat overview to 'mm: memcontrol: switch to rstat' changelog (Michal)
- simplified memcg_flush_lruvec_page_state() and removed cpu==-1 case (Michal)

---

This series converts memcg stats tracking to the streamlined rstat
infrastructure provided by the cgroup core code. rstat is already used
by the CPU controller and the IO controller. This change is motivated
by recent accuracy problems in memcg's custom stats code, as well as
the benefits of sharing common infra with other controllers.

The current memcg implementation does batched tree aggregation on the
write side: local stat changes are cached in per-cpu counters, which
are then propagated upward in batches when a threshold (32 pages) is
exceeded. This is cheap, but the error introduced by the lazy upward
propagation adds up: 32 pages times CPUs times cgroups in the subtree.
We've had complaints from service owners that the stats do not
reliably track and react to allocation behavior as expected, sometimes
swallowing the results of entire test applications.

The original memcg stat implementation used to do tree aggregation
exclusively on the read side: local stats would only ever be tracked
in per-cpu counters, and a memory.stat read would iterate the entire
subtree and sum those counters up. This didn't keep up with the times:

- Cgroup trees are much bigger now. We switched to lazily-freed
  cgroups, where deleted groups would hang around until their
  remaining page cache has been reclaimed. This can result in large
  subtrees that are expensive to walk, while most of the groups are
  idle and their statistics don't change much anymore.

- Automated monitoring increased. With the proliferation of userspace
  oom killing, proactive reclaim, and higher-resolution logging of
  workload trends in general, top-level stat files are polled at least
  once a second in many deployments.

- The lifetime of cgroups got shorter. Where most cgroup setups in the
  past would have a few large policy-oriented cgroups for everything
  running on the system, newer cgroup deployments tend to create one
  group per application - which gets deleted again as the processes
  exit. An aggregation scheme that doesn't retain child data inside
  the parents loses event history of the subtree.

Rstat addresses all three of those concerns through intelligent,
persistent read-side aggregation. As statistics change at the local
level, rstat tracks - on a per-cpu basis - only those parts of a
subtree that have changes pending and require aggregation. The actual
aggregation occurs on the colder read side - which can now skip over
(potentially large) numbers of recently idle cgroups.

---

The test_kmem cgroup selftest is currently failing due to excessive
cumulative vmstat drift from 100 subgroups:

    ok 1 test_kmem_basic
    memory.current = 8810496
    slab + anon + file + kernel_stack = 17074568
    slab = 6101384
    anon = 946176
    file = 0
    kernel_stack = 10027008
    not ok 2 test_kmem_memcg_deletion
    ok 3 test_kmem_proc_kpagecgroup
    ok 4 test_kmem_kernel_stacks
    ok 5 test_kmem_dead_cgroups
    ok 6 test_percpu_basic

As you can see, memory.stat items far exceed memory.current. The
kernel stack alone is bigger than all of charged memory. That's
because the memory of the test has been uncharged from memory.current,
but the negative vmstat deltas are still sitting in the percpu caches.

The test at this time isn't even counting percpu, pagetables etc. yet,
which would further contribute to the error. The last patch in the
series updates the test to include them - as well as match error
expectations on the new implementation to catch future regressions.

With all patches applied, the (updated, more rigorous) test succeeds:

    ok 1 test_kmem_basic
    ok 2 test_kmem_memcg_deletion
    ok 3 test_kmem_proc_kpagecgroup
    ok 4 test_kmem_kernel_stacks
    ok 5 test_kmem_dead_cgroups
    ok 6 test_percpu_basic

---

A kernel build test confirms that overhead is comparable. Two kernels
are built simultaneously in a nested tree with several idle siblings:

root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16)
                                             `- build-b (defconfig, make -j16)
                                             `- idle-1
                                             `- ...
                                             `- idle-9

During the builds, kernelbuild/memory.stat is read once a second.

A perf diff shows that the changes in cycle distribution is
minimal. Top 10 kernel symbols:

     0.09%     +0.08%  [kernel.kallsyms]                       [k] __mod_memcg_lruvec_state
     0.00%     +0.06%  [kernel.kallsyms]                       [k] cgroup_rstat_updated
     0.08%     -0.05%  [kernel.kallsyms]                       [k] __mod_memcg_state.part.0
     0.16%     -0.04%  [kernel.kallsyms]                       [k] release_pages
     0.00%     +0.03%  [kernel.kallsyms]                       [k] __count_memcg_events
     0.01%     +0.03%  [kernel.kallsyms]                       [k] mem_cgroup_charge_statistics.constprop.0
     0.10%     -0.02%  [kernel.kallsyms]                       [k] get_mem_cgroup_from_mm
     0.05%     -0.02%  [kernel.kallsyms]                       [k] mem_cgroup_update_lru_size
     0.57%     +0.01%  [kernel.kallsyms]                       [k] asm_exc_page_fault


---

The on-demand aggregated stats are now fully accurate:

$ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \
  grep -e inactive_file /sys/fs/cgroup/memory.stat

vanilla:                              patched:
nr_inactive_file 1574105088           nr_inactive_file 1027801088
   inactive_file 1577410560              inactive_file 1027801088

---

 block/blk-cgroup.c                         |  14 +-
 include/linux/memcontrol.h                 | 119 ++++-------
 kernel/cgroup/cgroup.c                     |  34 +--
 kernel/cgroup/rstat.c                      |  62 +++---
 mm/memcontrol.c                            | 306 +++++++++++++--------------
 tools/testing/selftests/cgroup/test_kmem.c |  22 +-
 6 files changed, 266 insertions(+), 291 deletions(-)

Based on v5.11-rc5-mm1.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/8] mm: memcontrol: fix cpuhotplug statistics flushing
  2021-02-05 18:27 [PATCH 0/8] mm: memcontrol: switch to rstat v2 Johannes Weiner
@ 2021-02-05 18:27 ` Johannes Weiner
  2021-02-05 18:28 ` [PATCH 2/8] mm: memcontrol: kill mem_cgroup_nodeinfo() Johannes Weiner
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2021-02-05 18:27 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, linux-mm, cgroups,
	linux-kernel, kernel-team

The memcg hotunplug callback erroneously flushes counts on the local
CPU, not the counts of the CPU going away; those counts will be lost.

Flush the CPU that is actually going away.

Also simplify the code a bit by using mod_memcg_state() and
count_memcg_events() instead of open-coding the upward flush - this is
comparable to how vmstat.c handles hotunplug flushing.

Fixes: a983b5ebee572 ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ed5cc78a8dbf..8120d565dd79 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2411,45 +2411,52 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 static int memcg_hotplug_cpu_dead(unsigned int cpu)
 {
 	struct memcg_stock_pcp *stock;
-	struct mem_cgroup *memcg, *mi;
+	struct mem_cgroup *memcg;
 
 	stock = &per_cpu(memcg_stock, cpu);
 	drain_stock(stock);
 
 	for_each_mem_cgroup(memcg) {
+		struct memcg_vmstats_percpu *statc;
 		int i;
 
+		statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
+
 		for (i = 0; i < MEMCG_NR_STAT; i++) {
 			int nid;
-			long x;
 
-			x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0);
-			if (x)
-				for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
-					atomic_long_add(x, &memcg->vmstats[i]);
+			if (statc->stat[i]) {
+				mod_memcg_state(memcg, i, statc->stat[i]);
+				statc->stat[i] = 0;
+			}
 
 			if (i >= NR_VM_NODE_STAT_ITEMS)
 				continue;
 
 			for_each_node(nid) {
+				struct batched_lruvec_stat *lstatc;
 				struct mem_cgroup_per_node *pn;
+				long x;
 
 				pn = mem_cgroup_nodeinfo(memcg, nid);
-				x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0);
-				if (x)
+				lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu);
+
+				x = lstatc->count[i];
+				lstatc->count[i] = 0;
+
+				if (x) {
 					do {
 						atomic_long_add(x, &pn->lruvec_stat[i]);
 					} while ((pn = parent_nodeinfo(pn, nid)));
+				}
 			}
 		}
 
 		for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
-			long x;
-
-			x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0);
-			if (x)
-				for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
-					atomic_long_add(x, &memcg->vmevents[i]);
+			if (statc->events[i]) {
+				count_memcg_events(memcg, i, statc->events[i]);
+				statc->events[i] = 0;
+			}
 		}
 	}
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/8] mm: memcontrol: kill mem_cgroup_nodeinfo()
  2021-02-05 18:27 [PATCH 0/8] mm: memcontrol: switch to rstat v2 Johannes Weiner
  2021-02-05 18:27 ` [PATCH 1/8] mm: memcontrol: fix cpuhotplug statistics flushing Johannes Weiner
@ 2021-02-05 18:28 ` Johannes Weiner
  2021-02-05 18:28 ` [PATCH 3/8] mm: memcontrol: privatize memcg_page_state query functions Johannes Weiner
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2021-02-05 18:28 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, linux-mm, cgroups,
	linux-kernel, kernel-team

No need to encapsulate a simple struct member access.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/memcontrol.h |  8 +-------
 mm/memcontrol.c            | 21 +++++++++++----------
 2 files changed, 12 insertions(+), 17 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7a38a1517a05..c7f387a6233e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -602,12 +602,6 @@ void mem_cgroup_uncharge_list(struct list_head *page_list);
 
 void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
 
-static struct mem_cgroup_per_node *
-mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
-{
-	return memcg->nodeinfo[nid];
-}
-
 /**
  * mem_cgroup_lruvec - get the lru list vector for a memcg & node
  * @memcg: memcg of the wanted lruvec
@@ -631,7 +625,7 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 	if (!memcg)
 		memcg = root_mem_cgroup;
 
-	mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
+	mz = memcg->nodeinfo[pgdat->node_id];
 	lruvec = &mz->lruvec;
 out:
 	/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8120d565dd79..7e05a4ebf80f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -414,13 +414,14 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
 					 int size, int old_size)
 {
 	struct memcg_shrinker_map *new, *old;
+	struct mem_cgroup_per_node *pn;
 	int nid;
 
 	lockdep_assert_held(&memcg_shrinker_map_mutex);
 
 	for_each_node(nid) {
-		old = rcu_dereference_protected(
-			mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
+		pn = memcg->nodeinfo[nid];
+		old = rcu_dereference_protected(pn->shrinker_map, true);
 		/* Not yet online memcg */
 		if (!old)
 			return 0;
@@ -433,7 +434,7 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
 		memset(new->map, (int)0xff, old_size);
 		memset((void *)new->map + old_size, 0, size - old_size);
 
-		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
+		rcu_assign_pointer(pn->shrinker_map, new);
 		call_rcu(&old->rcu, memcg_free_shrinker_map_rcu);
 	}
 
@@ -450,7 +451,7 @@ static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
 		return;
 
 	for_each_node(nid) {
-		pn = mem_cgroup_nodeinfo(memcg, nid);
+		pn = memcg->nodeinfo[nid];
 		map = rcu_dereference_protected(pn->shrinker_map, true);
 		kvfree(map);
 		rcu_assign_pointer(pn->shrinker_map, NULL);
@@ -713,7 +714,7 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
 	int nid;
 
 	for_each_node(nid) {
-		mz = mem_cgroup_nodeinfo(memcg, nid);
+		mz = memcg->nodeinfo[nid];
 		mctz = soft_limit_tree_node(nid);
 		if (mctz)
 			mem_cgroup_remove_exceeded(mz, mctz);
@@ -796,7 +797,7 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
 	parent = parent_mem_cgroup(pn->memcg);
 	if (!parent)
 		return NULL;
-	return mem_cgroup_nodeinfo(parent, nid);
+	return parent->nodeinfo[nid];
 }
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
@@ -1163,7 +1164,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	if (reclaim) {
 		struct mem_cgroup_per_node *mz;
 
-		mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id);
+		mz = root->nodeinfo[reclaim->pgdat->node_id];
 		iter = &mz->iter;
 
 		if (prev && reclaim->generation != iter->generation)
@@ -1265,7 +1266,7 @@ static void __invalidate_reclaim_iterators(struct mem_cgroup *from,
 	int nid;
 
 	for_each_node(nid) {
-		mz = mem_cgroup_nodeinfo(from, nid);
+		mz = from->nodeinfo[nid];
 		iter = &mz->iter;
 		cmpxchg(&iter->position, dead_memcg, NULL);
 	}
@@ -2438,7 +2439,7 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
 				struct mem_cgroup_per_node *pn;
 				long x;
 
-				pn = mem_cgroup_nodeinfo(memcg, nid);
+				pn = memcg->nodeinfo[nid];
 				lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu);
 
 				x = lstatc->count[i];
@@ -4145,7 +4146,7 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 		unsigned long file_cost = 0;
 
 		for_each_online_pgdat(pgdat) {
-			mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
+			mz = memcg->nodeinfo[pgdat->node_id];
 
 			anon_cost += mz->lruvec.anon_cost;
 			file_cost += mz->lruvec.file_cost;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/8] mm: memcontrol: privatize memcg_page_state query functions
  2021-02-05 18:27 [PATCH 0/8] mm: memcontrol: switch to rstat v2 Johannes Weiner
  2021-02-05 18:27 ` [PATCH 1/8] mm: memcontrol: fix cpuhotplug statistics flushing Johannes Weiner
  2021-02-05 18:28 ` [PATCH 2/8] mm: memcontrol: kill mem_cgroup_nodeinfo() Johannes Weiner
@ 2021-02-05 18:28 ` Johannes Weiner
  2021-02-05 18:28 ` [PATCH 4/8] cgroup: rstat: support cgroup1 Johannes Weiner
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2021-02-05 18:28 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, linux-mm, cgroups,
	linux-kernel, kernel-team

There are no users outside of the memory controller itself. The rest
of the kernel cares either about node or lruvec stats.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/memcontrol.h | 44 --------------------------------------
 mm/memcontrol.c            | 32 +++++++++++++++++++++++++++
 2 files changed, 32 insertions(+), 44 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c7f387a6233e..20ecdfae3289 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -867,39 +867,6 @@ struct mem_cgroup *lock_page_memcg(struct page *page);
 void __unlock_page_memcg(struct mem_cgroup *memcg);
 void unlock_page_memcg(struct page *page);
 
-/*
- * idx can be of type enum memcg_stat_item or node_stat_item.
- * Keep in sync with memcg_exact_page_state().
- */
-static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
-{
-	long x = atomic_long_read(&memcg->vmstats[idx]);
-#ifdef CONFIG_SMP
-	if (x < 0)
-		x = 0;
-#endif
-	return x;
-}
-
-/*
- * idx can be of type enum memcg_stat_item or node_stat_item.
- * Keep in sync with memcg_exact_page_state().
- */
-static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg,
-						   int idx)
-{
-	long x = 0;
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		x += per_cpu(memcg->vmstats_local->stat[idx], cpu);
-#ifdef CONFIG_SMP
-	if (x < 0)
-		x = 0;
-#endif
-	return x;
-}
-
 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1337,17 +1304,6 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
 {
 }
 
-static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
-{
-	return 0;
-}
-
-static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg,
-						   int idx)
-{
-	return 0;
-}
-
 static inline void __mod_memcg_state(struct mem_cgroup *memcg,
 				     int idx,
 				     int nr)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7e05a4ebf80f..2f97cb4cef6d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -789,6 +789,38 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 	__this_cpu_write(memcg->vmstats_percpu->stat[idx], x);
 }
 
+/*
+ * idx can be of type enum memcg_stat_item or node_stat_item.
+ * Keep in sync with memcg_exact_page_state().
+ */
+static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
+{
+	long x = atomic_long_read(&memcg->vmstats[idx]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
+/*
+ * idx can be of type enum memcg_stat_item or node_stat_item.
+ * Keep in sync with memcg_exact_page_state().
+ */
+static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
+{
+	long x = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		x += per_cpu(memcg->vmstats_local->stat[idx], cpu);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
 static struct mem_cgroup_per_node *
 parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
 {
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/8] cgroup: rstat: support cgroup1
  2021-02-05 18:27 [PATCH 0/8] mm: memcontrol: switch to rstat v2 Johannes Weiner
                   ` (2 preceding siblings ...)
  2021-02-05 18:28 ` [PATCH 3/8] mm: memcontrol: privatize memcg_page_state query functions Johannes Weiner
@ 2021-02-05 18:28 ` Johannes Weiner
  2021-02-05 22:11   ` Shakeel Butt
  2021-02-06  3:00   ` Tejun Heo
  2021-02-05 18:28 ` [PATCH 5/8] cgroup: rstat: punt root-level optimization to individual controllers Johannes Weiner
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 22+ messages in thread
From: Johannes Weiner @ 2021-02-05 18:28 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, linux-mm, cgroups,
	linux-kernel, kernel-team

Rstat currently only supports the default hierarchy in cgroup2. In
order to replace memcg's private stats infrastructure - used in both
cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1.

The initialization and destruction callbacks for regular cgroups are
already in place. Remove the cgroup_on_dfl() guards to handle cgroup1.

The initialization of the root cgroup is currently hardcoded to only
handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root()
and cgroup_destroy_root() to handle the default root as well as the
various cgroup1 roots we may set up during mounting.

The linking of css to cgroups happens in code shared between cgroup1
and cgroup2 as well. Simply remove the cgroup_on_dfl() guard.

Linkage of the root css to the root cgroup is a bit trickier: per
default, the root css of a subsystem controller belongs to the default
hierarchy (i.e. the cgroup2 root). When a controller is mounted in its
cgroup1 version, the root css is stolen and moved to the cgroup1 root;
on unmount, the css moves back to the default hierarchy. Annotate
rebind_subsystems() to move the root css linkage along between roots.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
---
 kernel/cgroup/cgroup.c | 34 +++++++++++++++++++++-------------
 kernel/cgroup/rstat.c  |  2 --
 2 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9153b20e5cc6..e049edd66776 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1339,6 +1339,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
 
 	mutex_unlock(&cgroup_mutex);
 
+	cgroup_rstat_exit(cgrp);
 	kernfs_destroy_root(root->kf_root);
 	cgroup_free_root(root);
 }
@@ -1751,6 +1752,12 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 				       &dcgrp->e_csets[ss->id]);
 		spin_unlock_irq(&css_set_lock);
 
+		if (ss->css_rstat_flush) {
+			list_del_rcu(&css->rstat_css_node);
+			list_add_rcu(&css->rstat_css_node,
+				     &dcgrp->rstat_css_list);
+		}
+
 		/* default hierarchy doesn't enable controllers by default */
 		dst_root->subsys_mask |= 1 << ssid;
 		if (dst_root == &cgrp_dfl_root) {
@@ -1971,10 +1978,14 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	if (ret)
 		goto destroy_root;
 
-	ret = rebind_subsystems(root, ss_mask);
+	ret = cgroup_rstat_init(root_cgrp);
 	if (ret)
 		goto destroy_root;
 
+	ret = rebind_subsystems(root, ss_mask);
+	if (ret)
+		goto exit_stats;
+
 	ret = cgroup_bpf_inherit(root_cgrp);
 	WARN_ON_ONCE(ret);
 
@@ -2006,6 +2017,8 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	ret = 0;
 	goto out;
 
+exit_stats:
+	cgroup_rstat_exit(root_cgrp);
 destroy_root:
 	kernfs_destroy_root(root->kf_root);
 	root->kf_root = NULL;
@@ -4934,8 +4947,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
 			psi_cgroup_free(cgrp);
-			if (cgroup_on_dfl(cgrp))
-				cgroup_rstat_exit(cgrp);
+			cgroup_rstat_exit(cgrp);
 			kfree(cgrp);
 		} else {
 			/*
@@ -4976,8 +4988,7 @@ static void css_release_work_fn(struct work_struct *work)
 		/* cgroup release path */
 		TRACE_CGROUP_PATH(release, cgrp);
 
-		if (cgroup_on_dfl(cgrp))
-			cgroup_rstat_flush(cgrp);
+		cgroup_rstat_flush(cgrp);
 
 		spin_lock_irq(&css_set_lock);
 		for (tcgrp = cgroup_parent(cgrp); tcgrp;
@@ -5034,7 +5045,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 		css_get(css->parent);
 	}
 
-	if (cgroup_on_dfl(cgrp) && ss->css_rstat_flush)
+	if (ss->css_rstat_flush)
 		list_add_rcu(&css->rstat_css_node, &cgrp->rstat_css_list);
 
 	BUG_ON(cgroup_css(cgrp, ss));
@@ -5159,11 +5170,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	if (ret)
 		goto out_free_cgrp;
 
-	if (cgroup_on_dfl(parent)) {
-		ret = cgroup_rstat_init(cgrp);
-		if (ret)
-			goto out_cancel_ref;
-	}
+	ret = cgroup_rstat_init(cgrp);
+	if (ret)
+		goto out_cancel_ref;
 
 	/* create the directory */
 	kn = kernfs_create_dir(parent->kn, name, mode, cgrp);
@@ -5250,8 +5259,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 out_kernfs_remove:
 	kernfs_remove(cgrp->kn);
 out_stat_exit:
-	if (cgroup_on_dfl(parent))
-		cgroup_rstat_exit(cgrp);
+	cgroup_rstat_exit(cgrp);
 out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d51175cedfca..faa767a870ba 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -285,8 +285,6 @@ void __init cgroup_rstat_boot(void)
 
 	for_each_possible_cpu(cpu)
 		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
-
-	BUG_ON(cgroup_rstat_init(&cgrp_dfl_root.cgrp));
 }
 
 /*
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/8] cgroup: rstat: punt root-level optimization to individual controllers
  2021-02-05 18:27 [PATCH 0/8] mm: memcontrol: switch to rstat v2 Johannes Weiner
                   ` (3 preceding siblings ...)
  2021-02-05 18:28 ` [PATCH 4/8] cgroup: rstat: support cgroup1 Johannes Weiner
@ 2021-02-05 18:28 ` Johannes Weiner
  2021-02-06  3:34   ` Tejun Heo
  2021-02-05 18:28 ` [PATCH 6/8] mm: memcontrol: switch to rstat Johannes Weiner
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2021-02-05 18:28 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, linux-mm, cgroups,
	linux-kernel, kernel-team

Current users of the rstat code can source root-level statistics from
the native counters of their respective subsystem, allowing them to
forego aggregation at the root level. This optimization is currently
implemented inside the generic rstat code, which doesn't track the
root cgroup and doesn't invoke the subsystem flush callbacks on it.

However, the memory controller cannot do this optimization, because
cgroup1 breaks out memory specifically for the local level, including
at the root level. In preparation for the memory controller switching
to rstat, move the optimization from rstat core to the controllers.

Afterwards, rstat will always track the root cgroup for changes and
invoke the subsystem callbacks on it; and it's up to the subsystem to
special-case and skip aggregation of the root cgroup if it can source
this information through other, cheaper means.

The extra cost of tracking the root cgroup is negligible: on stat
changes, we actually remove a branch that checks for the root. The
queueing for a flush touches only per-cpu data, and only the first
stat change since a flush requires a (per-cpu) lock.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 block/blk-cgroup.c    | 14 +++++++---
 kernel/cgroup/rstat.c | 60 +++++++++++++++++++++++++------------------
 2 files changed, 45 insertions(+), 29 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 02ce2058c14b..76725e1cad7f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -766,6 +766,10 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 	struct blkcg *blkcg = css_to_blkcg(css);
 	struct blkcg_gq *blkg;
 
+	/* Root-level stats are sourced from system-wide IO stats */
+	if (!cgroup_parent(css->cgroup))
+		return;
+
 	rcu_read_lock();
 
 	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
@@ -789,6 +793,7 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 		u64_stats_update_end(&blkg->iostat.sync);
 
 		/* propagate global delta to parent */
+		/* XXX: could skip this if parent is root */
 		if (parent) {
 			u64_stats_update_begin(&parent->iostat.sync);
 			blkg_iostat_set(&delta, &blkg->iostat.cur);
@@ -803,10 +808,11 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 }
 
 /*
- * The rstat algorithms intentionally don't handle the root cgroup to avoid
- * incurring overhead when no cgroups are defined. For that reason,
- * cgroup_rstat_flush in blkcg_print_stat does not actually fill out the
- * iostat in the root cgroup's blkcg_gq.
+ * We source root cgroup stats from the system-wide stats to avoid
+ * tracking the same information twice and incurring overhead when no
+ * cgroups are defined. For that reason, cgroup_rstat_flush in
+ * blkcg_print_stat does not actually fill out the iostat in the root
+ * cgroup's blkcg_gq.
  *
  * However, we would like to re-use the printing code between the root and
  * non-root cgroups to the extent possible. For that reason, we simulate
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index faa767a870ba..6f50c199bf2a 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -25,13 +25,8 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
 void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 {
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
-	struct cgroup *parent;
 	unsigned long flags;
 
-	/* nothing to do for root */
-	if (!cgroup_parent(cgrp))
-		return;
-
 	/*
 	 * Speculative already-on-list test. This may race leading to
 	 * temporary inaccuracies, which is fine.
@@ -46,10 +41,10 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 	raw_spin_lock_irqsave(cpu_lock, flags);
 
 	/* put @cgrp and all ancestors on the corresponding updated lists */
-	for (parent = cgroup_parent(cgrp); parent;
-	     cgrp = parent, parent = cgroup_parent(cgrp)) {
+	while (true) {
 		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
-		struct cgroup_rstat_cpu *prstatc = cgroup_rstat_cpu(parent, cpu);
+		struct cgroup *parent = cgroup_parent(cgrp);
+		struct cgroup_rstat_cpu *prstatc;
 
 		/*
 		 * Both additions and removals are bottom-up.  If a cgroup
@@ -58,8 +53,16 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 		if (rstatc->updated_next)
 			break;
 
+		if (!parent) {
+			rstatc->updated_next = cgrp;
+			break;
+		}
+
+		prstatc = cgroup_rstat_cpu(parent, cpu);
 		rstatc->updated_next = prstatc->updated_children;
 		prstatc->updated_children = cgrp;
+
+		cgrp = parent;
 	}
 
 	raw_spin_unlock_irqrestore(cpu_lock, flags);
@@ -113,23 +116,26 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
 	 */
 	if (rstatc->updated_next) {
 		struct cgroup *parent = cgroup_parent(pos);
-		struct cgroup_rstat_cpu *prstatc = cgroup_rstat_cpu(parent, cpu);
-		struct cgroup_rstat_cpu *nrstatc;
-		struct cgroup **nextp;
-
-		nextp = &prstatc->updated_children;
-		while (true) {
-			nrstatc = cgroup_rstat_cpu(*nextp, cpu);
-			if (*nextp == pos)
-				break;
-
-			WARN_ON_ONCE(*nextp == parent);
-			nextp = &nrstatc->updated_next;
+
+		if (parent) {
+			struct cgroup_rstat_cpu *prstatc;
+			struct cgroup **nextp;
+
+			prstatc = cgroup_rstat_cpu(parent, cpu);
+			nextp = &prstatc->updated_children;
+			while (true) {
+				struct cgroup_rstat_cpu *nrstatc;
+
+				nrstatc = cgroup_rstat_cpu(*nextp, cpu);
+				if (*nextp == pos)
+					break;
+				WARN_ON_ONCE(*nextp == parent);
+				nextp = &nrstatc->updated_next;
+			}
+			*nextp = rstatc->updated_next;
 		}
 
-		*nextp = rstatc->updated_next;
 		rstatc->updated_next = NULL;
-
 		return pos;
 	}
 
@@ -309,11 +315,15 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 {
-	struct cgroup *parent = cgroup_parent(cgrp);
 	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+	struct cgroup *parent = cgroup_parent(cgrp);
 	struct cgroup_base_stat cur, delta;
 	unsigned seq;
 
+	/* Root-level stats are sourced from system-wide CPU stats */
+	if (!parent)
+		return;
+
 	/* fetch the current per-cpu values */
 	do {
 		seq = __u64_stats_fetch_begin(&rstatc->bsync);
@@ -326,8 +336,8 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 	cgroup_base_stat_add(&cgrp->bstat, &delta);
 	cgroup_base_stat_add(&rstatc->last_bstat, &delta);
 
-	/* propagate global delta to parent */
-	if (parent) {
+	/* propagate global delta to parent (unless that's root) */
+	if (cgroup_parent(parent)) {
 		delta = cgrp->bstat;
 		cgroup_base_stat_sub(&delta, &cgrp->last_bstat);
 		cgroup_base_stat_add(&parent->bstat, &delta);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/8] mm: memcontrol: switch to rstat
  2021-02-05 18:27 [PATCH 0/8] mm: memcontrol: switch to rstat v2 Johannes Weiner
                   ` (4 preceding siblings ...)
  2021-02-05 18:28 ` [PATCH 5/8] cgroup: rstat: punt root-level optimization to individual controllers Johannes Weiner
@ 2021-02-05 18:28 ` Johannes Weiner
  2021-02-08  2:19   ` Shakeel Butt
  2021-02-05 18:28 ` [PATCH 7/8] mm: memcontrol: consolidate lruvec stat flushing Johannes Weiner
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2021-02-05 18:28 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, linux-mm, cgroups,
	linux-kernel, kernel-team

Replace the memory controller's custom hierarchical stats code with
the generic rstat infrastructure provided by the cgroup core.

The current implementation does batched upward propagation from the
write side (i.e. as stats change). The per-cpu batches introduce an
error, which is multiplied by the number of subgroups in a tree. In
systems with many CPUs and sizable cgroup trees, the error can be
large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32
subgroups results in an error of up to 128M per stat item). This can
entirely swallow allocation bursts inside a workload that the user is
expecting to see reflected in the statistics.

In the past, we've done read-side aggregation, where a memory.stat
read would have to walk the entire subtree and add up per-cpu
counts. This became problematic with lazily-freed cgroups: we could
have large subtrees where most cgroups were entirely idle. Hence the
switch to change-driven upward propagation. Unfortunately, it needed
to trade accuracy for speed due to the write side being so hot.

Rstat combines the best of both worlds: from the write side, it
cheaply maintains a queue of cgroups that have pending changes, so
that the read side can do selective tree aggregation. This way the
reported stats will always be precise and recent as can be, while the
aggregation can skip over potentially large numbers of idle cgroups.

The way rstat works is that it implements a tree for tracking cgroups
with pending local changes, as well as a flush function that walks the
tree upwards. The controller then drives this by 1) telling rstat when
a local cgroup stat changes (e.g. mod_memcg_state) and 2) when a flush
is required to get uptodate hierarchy stats for a given subtree
(e.g. when memory.stat is read). The controller also provides a flush
callback that is called during the rstat flush walk for each cgroup
and aggregates its local per-cpu counters and propagates them upwards.

This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT +
NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward
aggregation. It removes 3 words from the per-cpu data. It eliminates
memcg_exact_page_state(), since memcg_page_state() is now exact.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/memcontrol.h |  67 ++++++-----
 mm/memcontrol.c            | 224 +++++++++++++++----------------------
 2 files changed, 133 insertions(+), 158 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 20ecdfae3289..a8c7a0ccc759 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -76,10 +76,27 @@ enum mem_cgroup_events_target {
 };
 
 struct memcg_vmstats_percpu {
-	long stat[MEMCG_NR_STAT];
-	unsigned long events[NR_VM_EVENT_ITEMS];
-	unsigned long nr_page_events;
-	unsigned long targets[MEM_CGROUP_NTARGETS];
+	/* Local (CPU and cgroup) page state & events */
+	long			state[MEMCG_NR_STAT];
+	unsigned long		events[NR_VM_EVENT_ITEMS];
+
+	/* Delta calculation for lockless upward propagation */
+	long			state_prev[MEMCG_NR_STAT];
+	unsigned long		events_prev[NR_VM_EVENT_ITEMS];
+
+	/* Cgroup1: threshold notifications & softlimit tree updates */
+	unsigned long		nr_page_events;
+	unsigned long		targets[MEM_CGROUP_NTARGETS];
+};
+
+struct memcg_vmstats {
+	/* Aggregated (CPU and subtree) page state & events */
+	long			state[MEMCG_NR_STAT];
+	unsigned long		events[NR_VM_EVENT_ITEMS];
+
+	/* Pending child counts during tree propagation */
+	long			state_pending[MEMCG_NR_STAT];
+	unsigned long		events_pending[NR_VM_EVENT_ITEMS];
 };
 
 struct mem_cgroup_reclaim_iter {
@@ -287,8 +304,8 @@ struct mem_cgroup {
 
 	MEMCG_PADDING(_pad1_);
 
-	atomic_long_t		vmstats[MEMCG_NR_STAT];
-	atomic_long_t		vmevents[NR_VM_EVENT_ITEMS];
+	/* memory.stat */
+	struct memcg_vmstats	vmstats;
 
 	/* memory.events */
 	atomic_long_t		memory_events[MEMCG_NR_MEMORY_EVENTS];
@@ -315,10 +332,6 @@ struct mem_cgroup {
 	atomic_t		moving_account;
 	struct task_struct	*move_lock_task;
 
-	/* Legacy local VM stats and events */
-	struct memcg_vmstats_percpu __percpu *vmstats_local;
-
-	/* Subtree VM stats and events (batched updates) */
 	struct memcg_vmstats_percpu __percpu *vmstats_percpu;
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -942,10 +955,6 @@ static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
 	local_irq_restore(flags);
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
-						gfp_t gfp_mask,
-						unsigned long *total_scanned);
-
 void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 			  unsigned long count);
 
@@ -1028,6 +1037,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 void mem_cgroup_split_huge_fixup(struct page *head);
 #endif
 
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
+						gfp_t gfp_mask,
+						unsigned long *total_scanned);
+
 #else /* CONFIG_MEMCG */
 
 #define MEM_CGROUP_ID_SHIFT	0
@@ -1136,6 +1149,10 @@ static inline bool lruvec_holds_page_lru_lock(struct page *page,
 	return lruvec == &pgdat->__lruvec;
 }
 
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+
 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
 	return NULL;
@@ -1349,18 +1366,6 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
 	mod_node_page_state(page_pgdat(page), idx, val);
 }
 
-static inline
-unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
-					    gfp_t gfp_mask,
-					    unsigned long *total_scanned)
-{
-	return 0;
-}
-
-static inline void mem_cgroup_split_huge_fixup(struct page *head)
-{
-}
-
 static inline void count_memcg_events(struct mem_cgroup *memcg,
 				      enum vm_event_item idx,
 				      unsigned long count)
@@ -1383,8 +1388,16 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
 
-static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+static inline void mem_cgroup_split_huge_fixup(struct page *head)
+{
+}
+
+static inline
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
+					    gfp_t gfp_mask,
+					    unsigned long *total_scanned)
 {
+	return 0;
 }
 #endif /* CONFIG_MEMCG */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2f97cb4cef6d..5dc0bd53b64a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -757,6 +757,11 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 	return mz;
 }
 
+static void memcg_flush_vmstats(struct mem_cgroup *memcg)
+{
+	cgroup_rstat_flush(memcg->css.cgroup);
+}
+
 /**
  * __mod_memcg_state - update cgroup memory statistics
  * @memcg: the memory cgroup
@@ -765,37 +770,17 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
  */
 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 {
-	long x, threshold = MEMCG_CHARGE_BATCH;
-
 	if (mem_cgroup_disabled())
 		return;
 
-	if (memcg_stat_item_in_bytes(idx))
-		threshold <<= PAGE_SHIFT;
-
-	x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
-	if (unlikely(abs(x) > threshold)) {
-		struct mem_cgroup *mi;
-
-		/*
-		 * Batch local counters to keep them in sync with
-		 * the hierarchical ones.
-		 */
-		__this_cpu_add(memcg->vmstats_local->stat[idx], x);
-		for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
-			atomic_long_add(x, &mi->vmstats[idx]);
-		x = 0;
-	}
-	__this_cpu_write(memcg->vmstats_percpu->stat[idx], x);
+	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
+	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
 }
 
-/*
- * idx can be of type enum memcg_stat_item or node_stat_item.
- * Keep in sync with memcg_exact_page_state().
- */
+/* idx can be of type enum memcg_stat_item or node_stat_item. */
 static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
 {
-	long x = atomic_long_read(&memcg->vmstats[idx]);
+	long x = READ_ONCE(memcg->vmstats.state[idx]);
 #ifdef CONFIG_SMP
 	if (x < 0)
 		x = 0;
@@ -803,17 +788,14 @@ static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
 	return x;
 }
 
-/*
- * idx can be of type enum memcg_stat_item or node_stat_item.
- * Keep in sync with memcg_exact_page_state().
- */
+/* idx can be of type enum memcg_stat_item or node_stat_item. */
 static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
 {
 	long x = 0;
 	int cpu;
 
 	for_each_possible_cpu(cpu)
-		x += per_cpu(memcg->vmstats_local->stat[idx], cpu);
+		x += per_cpu(memcg->vmstats_percpu->state[idx], cpu);
 #ifdef CONFIG_SMP
 	if (x < 0)
 		x = 0;
@@ -936,30 +918,16 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val)
 void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 			  unsigned long count)
 {
-	unsigned long x;
-
 	if (mem_cgroup_disabled())
 		return;
 
-	x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]);
-	if (unlikely(x > MEMCG_CHARGE_BATCH)) {
-		struct mem_cgroup *mi;
-
-		/*
-		 * Batch local counters to keep them in sync with
-		 * the hierarchical ones.
-		 */
-		__this_cpu_add(memcg->vmstats_local->events[idx], x);
-		for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
-			atomic_long_add(x, &mi->vmevents[idx]);
-		x = 0;
-	}
-	__this_cpu_write(memcg->vmstats_percpu->events[idx], x);
+	__this_cpu_add(memcg->vmstats_percpu->events[idx], count);
+	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
 }
 
 static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
 {
-	return atomic_long_read(&memcg->vmevents[event]);
+	return READ_ONCE(memcg->vmstats.events[event]);
 }
 
 static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
@@ -968,7 +936,7 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
 	int cpu;
 
 	for_each_possible_cpu(cpu)
-		x += per_cpu(memcg->vmstats_local->events[event], cpu);
+		x += per_cpu(memcg->vmstats_percpu->events[event], cpu);
 	return x;
 }
 
@@ -1631,6 +1599,7 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 	 *
 	 * Current memory state:
 	 */
+	memcg_flush_vmstats(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
@@ -2450,22 +2419,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
 	drain_stock(stock);
 
 	for_each_mem_cgroup(memcg) {
-		struct memcg_vmstats_percpu *statc;
 		int i;
 
-		statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
-
-		for (i = 0; i < MEMCG_NR_STAT; i++) {
+		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
 			int nid;
 
-			if (statc->stat[i]) {
-				mod_memcg_state(memcg, i, statc->stat[i]);
-				statc->stat[i] = 0;
-			}
-
-			if (i >= NR_VM_NODE_STAT_ITEMS)
-				continue;
-
 			for_each_node(nid) {
 				struct batched_lruvec_stat *lstatc;
 				struct mem_cgroup_per_node *pn;
@@ -2484,13 +2442,6 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
 				}
 			}
 		}
-
-		for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
-			if (statc->events[i]) {
-				count_memcg_events(memcg, i, statc->events[i]);
-				statc->events[i] = 0;
-			}
-		}
 	}
 
 	return 0;
@@ -3618,6 +3569,8 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 {
 	unsigned long val;
 
+	memcg_flush_vmstats(memcg);
+
 	if (mem_cgroup_is_root(memcg)) {
 		val = memcg_page_state(memcg, NR_FILE_PAGES) +
 			memcg_page_state(memcg, NR_ANON_MAPPED);
@@ -3683,26 +3636,15 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
 	}
 }
 
-static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg)
+static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg)
 {
-	unsigned long stat[MEMCG_NR_STAT] = {0};
-	struct mem_cgroup *mi;
-	int node, cpu, i;
-
-	for_each_online_cpu(cpu)
-		for (i = 0; i < MEMCG_NR_STAT; i++)
-			stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu);
-
-	for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
-		for (i = 0; i < MEMCG_NR_STAT; i++)
-			atomic_long_add(stat[i], &mi->vmstats[i]);
+	int node;
 
 	for_each_node(node) {
 		struct mem_cgroup_per_node *pn = memcg->nodeinfo[node];
+		unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0 };
 		struct mem_cgroup_per_node *pi;
-
-		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
-			stat[i] = 0;
+		int cpu, i;
 
 		for_each_online_cpu(cpu)
 			for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
@@ -3715,25 +3657,6 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg)
 	}
 }
 
-static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
-{
-	unsigned long events[NR_VM_EVENT_ITEMS];
-	struct mem_cgroup *mi;
-	int cpu, i;
-
-	for (i = 0; i < NR_VM_EVENT_ITEMS; i++)
-		events[i] = 0;
-
-	for_each_online_cpu(cpu)
-		for (i = 0; i < NR_VM_EVENT_ITEMS; i++)
-			events[i] += per_cpu(memcg->vmstats_percpu->events[i],
-					     cpu);
-
-	for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
-		for (i = 0; i < NR_VM_EVENT_ITEMS; i++)
-			atomic_long_add(events[i], &mi->vmevents[i]);
-}
-
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_online_kmem(struct mem_cgroup *memcg)
 {
@@ -4050,6 +3973,8 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
 	int nid;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
+	memcg_flush_vmstats(memcg);
+
 	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
 		seq_printf(m, "%s=%lu", stat->name,
 			   mem_cgroup_nr_lru_pages(memcg, stat->lru_mask,
@@ -4120,6 +4045,8 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 
 	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
 
+	memcg_flush_vmstats(memcg);
+
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
 
@@ -4596,22 +4523,6 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
 	return &memcg->cgwb_domain;
 }
 
-/*
- * idx can be of type enum memcg_stat_item or node_stat_item.
- * Keep in sync with memcg_exact_page().
- */
-static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx)
-{
-	long x = atomic_long_read(&memcg->vmstats[idx]);
-	int cpu;
-
-	for_each_online_cpu(cpu)
-		x += per_cpu_ptr(memcg->vmstats_percpu, cpu)->stat[idx];
-	if (x < 0)
-		x = 0;
-	return x;
-}
-
 /**
  * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg
  * @wb: bdi_writeback in question
@@ -4637,13 +4548,14 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
 
-	*pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);
+	memcg_flush_vmstats(memcg);
 
-	*pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
-	*pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) +
-			memcg_exact_page_state(memcg, NR_ACTIVE_FILE);
-	*pheadroom = PAGE_COUNTER_MAX;
+	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
+	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
+	*pfilepages = memcg_page_state(memcg, NR_INACTIVE_FILE) +
+			memcg_page_state(memcg, NR_ACTIVE_FILE);
 
+	*pheadroom = PAGE_COUNTER_MAX;
 	while ((parent = parent_mem_cgroup(memcg))) {
 		unsigned long ceiling = min(READ_ONCE(memcg->memory.max),
 					    READ_ONCE(memcg->memory.high));
@@ -5275,7 +5187,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 	for_each_node(node)
 		free_mem_cgroup_per_node_info(memcg, node);
 	free_percpu(memcg->vmstats_percpu);
-	free_percpu(memcg->vmstats_local);
 	kfree(memcg);
 }
 
@@ -5283,11 +5194,10 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
 {
 	memcg_wb_domain_exit(memcg);
 	/*
-	 * Flush percpu vmstats and vmevents to guarantee the value correctness
-	 * on parent's and all ancestor levels.
+	 * Flush percpu lruvec stats to guarantee the value
+	 * correctness on parent's and all ancestor levels.
 	 */
-	memcg_flush_percpu_vmstats(memcg);
-	memcg_flush_percpu_vmevents(memcg);
+	memcg_flush_lruvec_page_state(memcg);
 	__mem_cgroup_free(memcg);
 }
 
@@ -5314,11 +5224,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 		goto fail;
 	}
 
-	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
-						GFP_KERNEL_ACCOUNT);
-	if (!memcg->vmstats_local)
-		goto fail;
-
 	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
 						 GFP_KERNEL_ACCOUNT);
 	if (!memcg->vmstats_percpu)
@@ -5518,6 +5423,62 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 	memcg_wb_domain_size_changed(memcg);
 }
 
+static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+	struct memcg_vmstats_percpu *statc;
+	long delta, v;
+	int i;
+
+	statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
+
+	for (i = 0; i < MEMCG_NR_STAT; i++) {
+		/*
+		 * Collect the aggregated propagation counts of groups
+		 * below us. We're in a per-cpu loop here and this is
+		 * a global counter, so the first cycle will get them.
+		 */
+		delta = memcg->vmstats.state_pending[i];
+		if (delta)
+			memcg->vmstats.state_pending[i] = 0;
+
+		/* Add CPU changes on this level since the last flush */
+		v = READ_ONCE(statc->state[i]);
+		if (v != statc->state_prev[i]) {
+			delta += v - statc->state_prev[i];
+			statc->state_prev[i] = v;
+		}
+
+		if (!delta)
+			continue;
+
+		/* Aggregate counts on this level and propagate upwards */
+		memcg->vmstats.state[i] += delta;
+		if (parent)
+			parent->vmstats.state_pending[i] += delta;
+	}
+
+	for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
+		delta = memcg->vmstats.events_pending[i];
+		if (delta)
+			memcg->vmstats.events_pending[i] = 0;
+
+		v = READ_ONCE(statc->events[i]);
+		if (v != statc->events_prev[i]) {
+			delta += v - statc->events_prev[i];
+			statc->events_prev[i] = v;
+		}
+
+		if (!delta)
+			continue;
+
+		memcg->vmstats.events[i] += delta;
+		if (parent)
+			parent->vmstats.events_pending[i] += delta;
+	}
+}
+
 #ifdef CONFIG_MMU
 /* Handlers for move charge at task migration. */
 static int mem_cgroup_do_precharge(unsigned long count)
@@ -6571,6 +6532,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.css_released = mem_cgroup_css_released,
 	.css_free = mem_cgroup_css_free,
 	.css_reset = mem_cgroup_css_reset,
+	.css_rstat_flush = mem_cgroup_css_rstat_flush,
 	.can_attach = mem_cgroup_can_attach,
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.post_attach = mem_cgroup_move_task,
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/8] mm: memcontrol: consolidate lruvec stat flushing
  2021-02-05 18:27 [PATCH 0/8] mm: memcontrol: switch to rstat v2 Johannes Weiner
                   ` (5 preceding siblings ...)
  2021-02-05 18:28 ` [PATCH 6/8] mm: memcontrol: switch to rstat Johannes Weiner
@ 2021-02-05 18:28 ` Johannes Weiner
  2021-02-08  2:28   ` Shakeel Butt
  2021-02-08 13:54   ` Michal Hocko
  2021-02-05 18:28 ` [PATCH 8/8] kselftests: cgroup: update kmem test for new vmstat implementation Johannes Weiner
  2021-02-06  3:58 ` [PATCH 0/8] mm: memcontrol: switch to rstat v2 Tejun Heo
  8 siblings, 2 replies; 22+ messages in thread
From: Johannes Weiner @ 2021-02-05 18:28 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, linux-mm, cgroups,
	linux-kernel, kernel-team

There are two functions to flush the per-cpu data of an lruvec into
the rest of the cgroup tree: when the cgroup is being freed, and when
a CPU disappears during hotplug. The difference is whether all CPUs or
just one is being collected, but the rest of the flushing code is the
same. Merge them into one function and share the common code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 74 +++++++++++++++++++------------------------------
 1 file changed, 28 insertions(+), 46 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5dc0bd53b64a..490357945f2c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2410,39 +2410,39 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 	mutex_unlock(&percpu_charge_mutex);
 }
 
-static int memcg_hotplug_cpu_dead(unsigned int cpu)
+static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu)
 {
-	struct memcg_stock_pcp *stock;
-	struct mem_cgroup *memcg;
-
-	stock = &per_cpu(memcg_stock, cpu);
-	drain_stock(stock);
+	int nid;
 
-	for_each_mem_cgroup(memcg) {
+	for_each_node(nid) {
+		struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid];
+		unsigned long stat[NR_VM_NODE_STAT_ITEMS];
+		struct batched_lruvec_stat *lstatc;
 		int i;
 
+		lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu);
 		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
-			int nid;
+			stat[i] = lstatc->count[i];
+			lstatc->count[i] = 0;
+		}
 
-			for_each_node(nid) {
-				struct batched_lruvec_stat *lstatc;
-				struct mem_cgroup_per_node *pn;
-				long x;
+		do {
+			for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+				atomic_long_add(stat[i], &pn->lruvec_stat[i]);
+		} while ((pn = parent_nodeinfo(pn, nid)));
+	}
+}
 
-				pn = memcg->nodeinfo[nid];
-				lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu);
+static int memcg_hotplug_cpu_dead(unsigned int cpu)
+{
+	struct memcg_stock_pcp *stock;
+	struct mem_cgroup *memcg;
 
-				x = lstatc->count[i];
-				lstatc->count[i] = 0;
+	stock = &per_cpu(memcg_stock, cpu);
+	drain_stock(stock);
 
-				if (x) {
-					do {
-						atomic_long_add(x, &pn->lruvec_stat[i]);
-					} while ((pn = parent_nodeinfo(pn, nid)));
-				}
-			}
-		}
-	}
+	for_each_mem_cgroup(memcg)
+		memcg_flush_lruvec_page_state(memcg, cpu);
 
 	return 0;
 }
@@ -3636,27 +3636,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
 	}
 }
 
-static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg)
-{
-	int node;
-
-	for_each_node(node) {
-		struct mem_cgroup_per_node *pn = memcg->nodeinfo[node];
-		unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0 };
-		struct mem_cgroup_per_node *pi;
-		int cpu, i;
-
-		for_each_online_cpu(cpu)
-			for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
-				stat[i] += per_cpu(
-					pn->lruvec_stat_cpu->count[i], cpu);
-
-		for (pi = pn; pi; pi = parent_nodeinfo(pi, node))
-			for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
-				atomic_long_add(stat[i], &pi->lruvec_stat[i]);
-	}
-}
-
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_online_kmem(struct mem_cgroup *memcg)
 {
@@ -5192,12 +5171,15 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 
 static void mem_cgroup_free(struct mem_cgroup *memcg)
 {
+	int cpu;
+
 	memcg_wb_domain_exit(memcg);
 	/*
 	 * Flush percpu lruvec stats to guarantee the value
 	 * correctness on parent's and all ancestor levels.
 	 */
-	memcg_flush_lruvec_page_state(memcg);
+	for_each_online_cpu(cpu)
+		memcg_flush_lruvec_page_state(memcg, cpu);
 	__mem_cgroup_free(memcg);
 }
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 8/8] kselftests: cgroup: update kmem test for new vmstat implementation
  2021-02-05 18:27 [PATCH 0/8] mm: memcontrol: switch to rstat v2 Johannes Weiner
                   ` (6 preceding siblings ...)
  2021-02-05 18:28 ` [PATCH 7/8] mm: memcontrol: consolidate lruvec stat flushing Johannes Weiner
@ 2021-02-05 18:28 ` Johannes Weiner
  2021-02-09 13:48   ` Shakeel Butt
  2021-02-06  3:58 ` [PATCH 0/8] mm: memcontrol: switch to rstat v2 Tejun Heo
  8 siblings, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2021-02-05 18:28 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, linux-mm, cgroups,
	linux-kernel, kernel-team

With memcg having switched to rstat, memory.stat output is precise.
Update the cgroup selftest to reflect the expectations and error
tolerances of the new implementation.

Also add newly tracked types of memory to the memory.stat side of the
equation, since they're included in memory.current and could throw
false positives.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 tools/testing/selftests/cgroup/test_kmem.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c
index 0941aa16157e..22b31ebb3513 100644
--- a/tools/testing/selftests/cgroup/test_kmem.c
+++ b/tools/testing/selftests/cgroup/test_kmem.c
@@ -19,12 +19,12 @@
 
 
 /*
- * Memory cgroup charging and vmstat data aggregation is performed using
- * percpu batches 32 pages big (look at MEMCG_CHARGE_BATCH). So the maximum
- * discrepancy between charge and vmstat entries is number of cpus multiplied
- * by 32 pages multiplied by 2.
+ * Memory cgroup charging is performed using percpu batches 32 pages
+ * big (look at MEMCG_CHARGE_BATCH), whereas memory.stat is exact. So
+ * the maximum discrepancy between charge and vmstat entries is number
+ * of cpus multiplied by 32 pages.
  */
-#define MAX_VMSTAT_ERROR (4096 * 32 * 2 * get_nprocs())
+#define MAX_VMSTAT_ERROR (4096 * 32 * get_nprocs())
 
 
 static int alloc_dcache(const char *cgroup, void *arg)
@@ -162,7 +162,7 @@ static int cg_run_in_subcgroups(const char *parent,
  */
 static int test_kmem_memcg_deletion(const char *root)
 {
-	long current, slab, anon, file, kernel_stack, sum;
+	long current, slab, anon, file, kernel_stack, pagetables, percpu, sock, sum;
 	int ret = KSFT_FAIL;
 	char *parent;
 
@@ -184,11 +184,14 @@ static int test_kmem_memcg_deletion(const char *root)
 	anon = cg_read_key_long(parent, "memory.stat", "anon ");
 	file = cg_read_key_long(parent, "memory.stat", "file ");
 	kernel_stack = cg_read_key_long(parent, "memory.stat", "kernel_stack ");
+	pagetables = cg_read_key_long(parent, "memory.stat", "pagetables ");
+	percpu = cg_read_key_long(parent, "memory.stat", "percpu ");
+	sock = cg_read_key_long(parent, "memory.stat", "sock ");
 	if (current < 0 || slab < 0 || anon < 0 || file < 0 ||
-	    kernel_stack < 0)
+	    kernel_stack < 0 || pagetables < 0 || percpu < 0 || sock < 0)
 		goto cleanup;
 
-	sum = slab + anon + file + kernel_stack;
+	sum = slab + anon + file + kernel_stack + pagetables + percpu + sock;
 	if (abs(sum - current) < MAX_VMSTAT_ERROR) {
 		ret = KSFT_PASS;
 	} else {
@@ -198,6 +201,9 @@ static int test_kmem_memcg_deletion(const char *root)
 		printf("anon = %ld\n", anon);
 		printf("file = %ld\n", file);
 		printf("kernel_stack = %ld\n", kernel_stack);
+		printf("pagetables = %ld\n", pagetables);
+		printf("percpu = %ld\n", percpu);
+		printf("sock = %ld\n", sock);
 	}
 
 cleanup:
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/8] cgroup: rstat: support cgroup1
  2021-02-05 18:28 ` [PATCH 4/8] cgroup: rstat: support cgroup1 Johannes Weiner
@ 2021-02-05 22:11   ` Shakeel Butt
  2021-02-06  3:00   ` Tejun Heo
  1 sibling, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2021-02-05 22:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, Michal Hocko, Roman Gushchin, Linux MM,
	Cgroups, LKML, Kernel Team

On Fri, Feb 5, 2021 at 10:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Rstat currently only supports the default hierarchy in cgroup2. In
> order to replace memcg's private stats infrastructure - used in both
> cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1.
>
> The initialization and destruction callbacks for regular cgroups are
> already in place. Remove the cgroup_on_dfl() guards to handle cgroup1.
>
> The initialization of the root cgroup is currently hardcoded to only
> handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root()
> and cgroup_destroy_root() to handle the default root as well as the
> various cgroup1 roots we may set up during mounting.
>
> The linking of css to cgroups happens in code shared between cgroup1
> and cgroup2 as well. Simply remove the cgroup_on_dfl() guard.
>
> Linkage of the root css to the root cgroup is a bit trickier: per
> default, the root css of a subsystem controller belongs to the default
> hierarchy (i.e. the cgroup2 root). When a controller is mounted in its
> cgroup1 version, the root css is stolen and moved to the cgroup1 root;
> on unmount, the css moves back to the default hierarchy. Annotate
> rebind_subsystems() to move the root css linkage along between roots.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/8] cgroup: rstat: support cgroup1
  2021-02-05 18:28 ` [PATCH 4/8] cgroup: rstat: support cgroup1 Johannes Weiner
  2021-02-05 22:11   ` Shakeel Butt
@ 2021-02-06  3:00   ` Tejun Heo
  1 sibling, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2021-02-06  3:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	linux-mm, cgroups, linux-kernel, kernel-team

On Fri, Feb 05, 2021 at 01:28:02PM -0500, Johannes Weiner wrote:
> Rstat currently only supports the default hierarchy in cgroup2. In
> order to replace memcg's private stats infrastructure - used in both
> cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1.
> 
> The initialization and destruction callbacks for regular cgroups are
> already in place. Remove the cgroup_on_dfl() guards to handle cgroup1.
> 
> The initialization of the root cgroup is currently hardcoded to only
> handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root()
> and cgroup_destroy_root() to handle the default root as well as the
> various cgroup1 roots we may set up during mounting.
> 
> The linking of css to cgroups happens in code shared between cgroup1
> and cgroup2 as well. Simply remove the cgroup_on_dfl() guard.
> 
> Linkage of the root css to the root cgroup is a bit trickier: per
> default, the root css of a subsystem controller belongs to the default
> hierarchy (i.e. the cgroup2 root). When a controller is mounted in its
> cgroup1 version, the root css is stolen and moved to the cgroup1 root;
> on unmount, the css moves back to the default hierarchy. Annotate
> rebind_subsystems() to move the root css linkage along between roots.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Roman Gushchin <guro@fb.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/8] cgroup: rstat: punt root-level optimization to individual controllers
  2021-02-05 18:28 ` [PATCH 5/8] cgroup: rstat: punt root-level optimization to individual controllers Johannes Weiner
@ 2021-02-06  3:34   ` Tejun Heo
  2021-02-08 20:29     ` Johannes Weiner
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2021-02-06  3:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	linux-mm, cgroups, linux-kernel, kernel-team

Hello,

On Fri, Feb 05, 2021 at 01:28:03PM -0500, Johannes Weiner wrote:
> Current users of the rstat code can source root-level statistics from
> the native counters of their respective subsystem, allowing them to
> forego aggregation at the root level. This optimization is currently
> implemented inside the generic rstat code, which doesn't track the
> root cgroup and doesn't invoke the subsystem flush callbacks on it.
> 
> However, the memory controller cannot do this optimization, because
> cgroup1 breaks out memory specifically for the local level, including
> at the root level. In preparation for the memory controller switching
> to rstat, move the optimization from rstat core to the controllers.
> 
> Afterwards, rstat will always track the root cgroup for changes and
> invoke the subsystem callbacks on it; and it's up to the subsystem to
> special-case and skip aggregation of the root cgroup if it can source
> this information through other, cheaper means.
> 
> The extra cost of tracking the root cgroup is negligible: on stat
> changes, we actually remove a branch that checks for the root. The
> queueing for a flush touches only per-cpu data, and only the first
> stat change since a flush requires a (per-cpu) lock.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Generally looks good to me.

Acked-by: Tejun Heo <tj@kernel.org>

A couple nits below.

> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 02ce2058c14b..76725e1cad7f 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -766,6 +766,10 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>  	struct blkcg *blkcg = css_to_blkcg(css);
>  	struct blkcg_gq *blkg;
>  
> +	/* Root-level stats are sourced from system-wide IO stats */
> +	if (!cgroup_parent(css->cgroup))
> +		return;
> +
>  	rcu_read_lock();
>  
>  	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> @@ -789,6 +793,7 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>  		u64_stats_update_end(&blkg->iostat.sync);
>  
>  		/* propagate global delta to parent */
> +		/* XXX: could skip this if parent is root */
>  		if (parent) {
>  			u64_stats_update_begin(&parent->iostat.sync);
>  			blkg_iostat_set(&delta, &blkg->iostat.cur);

Might as well update this similar to cgroup_base_stat_flush()?

> @@ -58,8 +53,16 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
>  		if (rstatc->updated_next)
>  			break;
>  
> +		if (!parent) {

Maybe useful to note that the node is being marked busy but not added to the
non-existent parent.

> +			rstatc->updated_next = cgrp;
> +			break;
> +		}
> +
> +		prstatc = cgroup_rstat_cpu(parent, cpu);
>  		rstatc->updated_next = prstatc->updated_children;
>  		prstatc->updated_children = cgrp;
> +
> +		cgrp = parent;
>  	}
>  
>  	raw_spin_unlock_irqrestore(cpu_lock, flags);
...
>  static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
>  {
> -	struct cgroup *parent = cgroup_parent(cgrp);
>  	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
> +	struct cgroup *parent = cgroup_parent(cgrp);

Is this chunk intentional?

>  	struct cgroup_base_stat cur, delta;
>  	unsigned seq;
>  
> +	/* Root-level stats are sourced from system-wide CPU stats */
> +	if (!parent)
> +		return;
> +
>  	/* fetch the current per-cpu values */
>  	do {
>  		seq = __u64_stats_fetch_begin(&rstatc->bsync);
> @@ -326,8 +336,8 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
>  	cgroup_base_stat_add(&cgrp->bstat, &delta);
>  	cgroup_base_stat_add(&rstatc->last_bstat, &delta);
>  
> -	/* propagate global delta to parent */
> -	if (parent) {
> +	/* propagate global delta to parent (unless that's root) */
> +	if (cgroup_parent(parent)) {

Yeah, this makes sense. Can you add a short while-at-it note in the patch
description?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: switch to rstat v2
  2021-02-05 18:27 [PATCH 0/8] mm: memcontrol: switch to rstat v2 Johannes Weiner
                   ` (7 preceding siblings ...)
  2021-02-05 18:28 ` [PATCH 8/8] kselftests: cgroup: update kmem test for new vmstat implementation Johannes Weiner
@ 2021-02-06  3:58 ` Tejun Heo
  8 siblings, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2021-02-06  3:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	linux-mm, cgroups, linux-kernel, kernel-team

On Fri, Feb 05, 2021 at 01:27:58PM -0500, Johannes Weiner wrote:
> This is version 2 of the memcg rstat patches. Updates since v1:
> 
> - added cgroup selftest output (see test section below) (thanks Roman)
> - updated cgroup selftest to match new kernel implementation
> - added Fixes: tag to 'mm: memcontrol: fix cpuhotplug statistics flushing' (Shakeel)
> - collected review & ack tags
> - added rstat overview to 'mm: memcontrol: switch to rstat' changelog (Michal)
> - simplified memcg_flush_lruvec_page_state() and removed cpu==-1 case (Michal)

The whole series looks good to me. Please feel free to route the rstat
patches with the rest of the series.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/8] mm: memcontrol: switch to rstat
  2021-02-05 18:28 ` [PATCH 6/8] mm: memcontrol: switch to rstat Johannes Weiner
@ 2021-02-08  2:19   ` Shakeel Butt
  2021-02-08 20:40     ` Johannes Weiner
  0 siblings, 1 reply; 22+ messages in thread
From: Shakeel Butt @ 2021-02-08  2:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, Michal Hocko, Roman Gushchin, Linux MM,
	Cgroups, LKML, Kernel Team

On Fri, Feb 5, 2021 at 10:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Replace the memory controller's custom hierarchical stats code with
> the generic rstat infrastructure provided by the cgroup core.
>
> The current implementation does batched upward propagation from the
> write side (i.e. as stats change). The per-cpu batches introduce an
> error, which is multiplied by the number of subgroups in a tree. In
> systems with many CPUs and sizable cgroup trees, the error can be
> large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32
> subgroups results in an error of up to 128M per stat item). This can
> entirely swallow allocation bursts inside a workload that the user is
> expecting to see reflected in the statistics.
>
> In the past, we've done read-side aggregation, where a memory.stat
> read would have to walk the entire subtree and add up per-cpu
> counts. This became problematic with lazily-freed cgroups: we could
> have large subtrees where most cgroups were entirely idle. Hence the
> switch to change-driven upward propagation. Unfortunately, it needed
> to trade accuracy for speed due to the write side being so hot.
>
> Rstat combines the best of both worlds: from the write side, it
> cheaply maintains a queue of cgroups that have pending changes, so
> that the read side can do selective tree aggregation. This way the
> reported stats will always be precise and recent as can be, while the
> aggregation can skip over potentially large numbers of idle cgroups.
>
> The way rstat works is that it implements a tree for tracking cgroups
> with pending local changes, as well as a flush function that walks the
> tree upwards. The controller then drives this by 1) telling rstat when
> a local cgroup stat changes (e.g. mod_memcg_state) and 2) when a flush
> is required to get uptodate hierarchy stats for a given subtree
> (e.g. when memory.stat is read). The controller also provides a flush
> callback that is called during the rstat flush walk for each cgroup
> and aggregates its local per-cpu counters and propagates them upwards.
>
> This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT +
> NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward
> aggregation. It removes 3 words from the per-cpu data. It eliminates
> memcg_exact_page_state(), since memcg_page_state() is now exact.

Only if cgroup_rstat_flush() has been called before memcg_page_state(), right?

>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Roman Gushchin <guro@fb.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Overall the patch looks good to me with a couple of nits/queries below.

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> ---
>  include/linux/memcontrol.h |  67 ++++++-----
>  mm/memcontrol.c            | 224 +++++++++++++++----------------------
>  2 files changed, 133 insertions(+), 158 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20ecdfae3289..a8c7a0ccc759 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -76,10 +76,27 @@ enum mem_cgroup_events_target {
>  };
>
>  struct memcg_vmstats_percpu {
> -       long stat[MEMCG_NR_STAT];
> -       unsigned long events[NR_VM_EVENT_ITEMS];
> -       unsigned long nr_page_events;
> -       unsigned long targets[MEM_CGROUP_NTARGETS];
> +       /* Local (CPU and cgroup) page state & events */
> +       long                    state[MEMCG_NR_STAT];
> +       unsigned long           events[NR_VM_EVENT_ITEMS];
> +
> +       /* Delta calculation for lockless upward propagation */
> +       long                    state_prev[MEMCG_NR_STAT];
> +       unsigned long           events_prev[NR_VM_EVENT_ITEMS];
> +
> +       /* Cgroup1: threshold notifications & softlimit tree updates */
> +       unsigned long           nr_page_events;
> +       unsigned long           targets[MEM_CGROUP_NTARGETS];
> +};
> +
> +struct memcg_vmstats {
> +       /* Aggregated (CPU and subtree) page state & events */
> +       long                    state[MEMCG_NR_STAT];
> +       unsigned long           events[NR_VM_EVENT_ITEMS];
> +
> +       /* Pending child counts during tree propagation */
> +       long                    state_pending[MEMCG_NR_STAT];
> +       unsigned long           events_pending[NR_VM_EVENT_ITEMS];
>  };
>
>  struct mem_cgroup_reclaim_iter {
> @@ -287,8 +304,8 @@ struct mem_cgroup {
>
>         MEMCG_PADDING(_pad1_);
>
> -       atomic_long_t           vmstats[MEMCG_NR_STAT];
> -       atomic_long_t           vmevents[NR_VM_EVENT_ITEMS];
> +       /* memory.stat */
> +       struct memcg_vmstats    vmstats;
>
>         /* memory.events */
>         atomic_long_t           memory_events[MEMCG_NR_MEMORY_EVENTS];
> @@ -315,10 +332,6 @@ struct mem_cgroup {
>         atomic_t                moving_account;
>         struct task_struct      *move_lock_task;
>
> -       /* Legacy local VM stats and events */
> -       struct memcg_vmstats_percpu __percpu *vmstats_local;
> -
> -       /* Subtree VM stats and events (batched updates) */
>         struct memcg_vmstats_percpu __percpu *vmstats_percpu;
>
>  #ifdef CONFIG_CGROUP_WRITEBACK
> @@ -942,10 +955,6 @@ static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
>         local_irq_restore(flags);
>  }
>
> -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> -                                               gfp_t gfp_mask,
> -                                               unsigned long *total_scanned);
> -
>  void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
>                           unsigned long count);
>
> @@ -1028,6 +1037,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>  void mem_cgroup_split_huge_fixup(struct page *head);
>  #endif
>
> +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> +                                               gfp_t gfp_mask,
> +                                               unsigned long *total_scanned);
> +
>  #else /* CONFIG_MEMCG */
>
>  #define MEM_CGROUP_ID_SHIFT    0
> @@ -1136,6 +1149,10 @@ static inline bool lruvec_holds_page_lru_lock(struct page *page,
>         return lruvec == &pgdat->__lruvec;
>  }
>
> +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
> +{
> +}
> +
>  static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
>  {
>         return NULL;
> @@ -1349,18 +1366,6 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
>         mod_node_page_state(page_pgdat(page), idx, val);
>  }
>
> -static inline
> -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> -                                           gfp_t gfp_mask,
> -                                           unsigned long *total_scanned)
> -{
> -       return 0;
> -}
> -
> -static inline void mem_cgroup_split_huge_fixup(struct page *head)
> -{
> -}
> -
>  static inline void count_memcg_events(struct mem_cgroup *memcg,
>                                       enum vm_event_item idx,
>                                       unsigned long count)
> @@ -1383,8 +1388,16 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
>  {
>  }
>
> -static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
> +static inline void mem_cgroup_split_huge_fixup(struct page *head)
> +{
> +}
> +
> +static inline
> +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> +                                           gfp_t gfp_mask,
> +                                           unsigned long *total_scanned)
>  {
> +       return 0;

Any technical reason to move around mem_cgroup_soft_limit_reclaim(),
mem_cgroup_split_huge_fixup() and lruvec_memcg_debug() or just
aesthetics?

>  }
>  #endif /* CONFIG_MEMCG */
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2f97cb4cef6d..5dc0bd53b64a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -757,6 +757,11 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
>         return mz;
>  }
>
> +static void memcg_flush_vmstats(struct mem_cgroup *memcg)
> +{
> +       cgroup_rstat_flush(memcg->css.cgroup);
> +}

cgroup_rstat_flush() has one line wrapper but cgroup_rstat_updated()
does not, any reason?

> +
>  /**
>   * __mod_memcg_state - update cgroup memory statistics
>   * @memcg: the memory cgroup
> @@ -765,37 +770,17 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
>   */
>  void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
>  {
> -       long x, threshold = MEMCG_CHARGE_BATCH;
> -
>         if (mem_cgroup_disabled())
>                 return;
>
> -       if (memcg_stat_item_in_bytes(idx))
> -               threshold <<= PAGE_SHIFT;
> -
> -       x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
> -       if (unlikely(abs(x) > threshold)) {
> -               struct mem_cgroup *mi;
> -
> -               /*
> -                * Batch local counters to keep them in sync with
> -                * the hierarchical ones.
> -                */
> -               __this_cpu_add(memcg->vmstats_local->stat[idx], x);
> -               for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> -                       atomic_long_add(x, &mi->vmstats[idx]);
> -               x = 0;
> -       }
> -       __this_cpu_write(memcg->vmstats_percpu->stat[idx], x);
> +       __this_cpu_add(memcg->vmstats_percpu->state[idx], val);
> +       cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
>  }
>
> -/*
> - * idx can be of type enum memcg_stat_item or node_stat_item.
> - * Keep in sync with memcg_exact_page_state().
> - */
> +/* idx can be of type enum memcg_stat_item or node_stat_item. */
>  static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
>  {
> -       long x = atomic_long_read(&memcg->vmstats[idx]);
> +       long x = READ_ONCE(memcg->vmstats.state[idx]);
>  #ifdef CONFIG_SMP
>         if (x < 0)
>                 x = 0;
> @@ -803,17 +788,14 @@ static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
>         return x;
>  }
>
> -/*
> - * idx can be of type enum memcg_stat_item or node_stat_item.
> - * Keep in sync with memcg_exact_page_state().
> - */
> +/* idx can be of type enum memcg_stat_item or node_stat_item. */
>  static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
>  {
>         long x = 0;
>         int cpu;
>
>         for_each_possible_cpu(cpu)
> -               x += per_cpu(memcg->vmstats_local->stat[idx], cpu);
> +               x += per_cpu(memcg->vmstats_percpu->state[idx], cpu);
>  #ifdef CONFIG_SMP
>         if (x < 0)
>                 x = 0;
> @@ -936,30 +918,16 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val)
>  void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
>                           unsigned long count)
>  {
> -       unsigned long x;
> -
>         if (mem_cgroup_disabled())
>                 return;
>
> -       x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]);
> -       if (unlikely(x > MEMCG_CHARGE_BATCH)) {
> -               struct mem_cgroup *mi;
> -
> -               /*
> -                * Batch local counters to keep them in sync with
> -                * the hierarchical ones.
> -                */
> -               __this_cpu_add(memcg->vmstats_local->events[idx], x);
> -               for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> -                       atomic_long_add(x, &mi->vmevents[idx]);
> -               x = 0;
> -       }
> -       __this_cpu_write(memcg->vmstats_percpu->events[idx], x);
> +       __this_cpu_add(memcg->vmstats_percpu->events[idx], count);
> +       cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
>  }
>
>  static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
>  {
> -       return atomic_long_read(&memcg->vmevents[event]);
> +       return READ_ONCE(memcg->vmstats.events[event]);
>  }
>
>  static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
> @@ -968,7 +936,7 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
>         int cpu;
>
>         for_each_possible_cpu(cpu)
> -               x += per_cpu(memcg->vmstats_local->events[event], cpu);
> +               x += per_cpu(memcg->vmstats_percpu->events[event], cpu);
>         return x;
>  }
>
> @@ -1631,6 +1599,7 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>          *
>          * Current memory state:
>          */
> +       memcg_flush_vmstats(memcg);
>
>         for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
>                 u64 size;
> @@ -2450,22 +2419,11 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
>         drain_stock(stock);
>
>         for_each_mem_cgroup(memcg) {
> -               struct memcg_vmstats_percpu *statc;
>                 int i;
>
> -               statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
> -
> -               for (i = 0; i < MEMCG_NR_STAT; i++) {
> +               for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
>                         int nid;
>
> -                       if (statc->stat[i]) {
> -                               mod_memcg_state(memcg, i, statc->stat[i]);
> -                               statc->stat[i] = 0;
> -                       }
> -
> -                       if (i >= NR_VM_NODE_STAT_ITEMS)
> -                               continue;
> -
>                         for_each_node(nid) {
>                                 struct batched_lruvec_stat *lstatc;
>                                 struct mem_cgroup_per_node *pn;
> @@ -2484,13 +2442,6 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
>                                 }
>                         }
>                 }
> -
> -               for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
> -                       if (statc->events[i]) {
> -                               count_memcg_events(memcg, i, statc->events[i]);
> -                               statc->events[i] = 0;
> -                       }
> -               }
>         }
>
>         return 0;
> @@ -3618,6 +3569,8 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>  {
>         unsigned long val;
>
> +       memcg_flush_vmstats(memcg);
> +
>         if (mem_cgroup_is_root(memcg)) {

I think memcg_flush_vmstats(memcg) should be here.

>                 val = memcg_page_state(memcg, NR_FILE_PAGES) +
>                         memcg_page_state(memcg, NR_ANON_MAPPED);
> @@ -3683,26 +3636,15 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
>         }
>  }
>
> -static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg)
> +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg)
>  {
> -       unsigned long stat[MEMCG_NR_STAT] = {0};
> -       struct mem_cgroup *mi;
> -       int node, cpu, i;
> -
> -       for_each_online_cpu(cpu)
> -               for (i = 0; i < MEMCG_NR_STAT; i++)
> -                       stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu);
> -
> -       for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> -               for (i = 0; i < MEMCG_NR_STAT; i++)
> -                       atomic_long_add(stat[i], &mi->vmstats[i]);
> +       int node;
>
>         for_each_node(node) {
>                 struct mem_cgroup_per_node *pn = memcg->nodeinfo[node];
> +               unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0 };
>                 struct mem_cgroup_per_node *pi;
> -
> -               for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> -                       stat[i] = 0;
> +               int cpu, i;
>
>                 for_each_online_cpu(cpu)
>                         for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> @@ -3715,25 +3657,6 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg)
>         }
>  }
>
> -static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
> -{
> -       unsigned long events[NR_VM_EVENT_ITEMS];
> -       struct mem_cgroup *mi;
> -       int cpu, i;
> -
> -       for (i = 0; i < NR_VM_EVENT_ITEMS; i++)
> -               events[i] = 0;
> -
> -       for_each_online_cpu(cpu)
> -               for (i = 0; i < NR_VM_EVENT_ITEMS; i++)
> -                       events[i] += per_cpu(memcg->vmstats_percpu->events[i],
> -                                            cpu);
> -
> -       for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> -               for (i = 0; i < NR_VM_EVENT_ITEMS; i++)
> -                       atomic_long_add(events[i], &mi->vmevents[i]);
> -}
> -
>  #ifdef CONFIG_MEMCG_KMEM
>  static int memcg_online_kmem(struct mem_cgroup *memcg)
>  {
> @@ -4050,6 +3973,8 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
>         int nid;
>         struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
>
> +       memcg_flush_vmstats(memcg);
> +
>         for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
>                 seq_printf(m, "%s=%lu", stat->name,
>                            mem_cgroup_nr_lru_pages(memcg, stat->lru_mask,
> @@ -4120,6 +4045,8 @@ static int memcg_stat_show(struct seq_file *m, void *v)
>
>         BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
>
> +       memcg_flush_vmstats(memcg);
> +
>         for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
>                 unsigned long nr;
>
> @@ -4596,22 +4523,6 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
>         return &memcg->cgwb_domain;
>  }
>
> -/*
> - * idx can be of type enum memcg_stat_item or node_stat_item.
> - * Keep in sync with memcg_exact_page().
> - */
> -static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx)
> -{
> -       long x = atomic_long_read(&memcg->vmstats[idx]);
> -       int cpu;
> -
> -       for_each_online_cpu(cpu)
> -               x += per_cpu_ptr(memcg->vmstats_percpu, cpu)->stat[idx];
> -       if (x < 0)
> -               x = 0;
> -       return x;
> -}
> -
>  /**
>   * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg
>   * @wb: bdi_writeback in question
> @@ -4637,13 +4548,14 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
>         struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
>         struct mem_cgroup *parent;
>
> -       *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);
> +       memcg_flush_vmstats(memcg);
>
> -       *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
> -       *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) +
> -                       memcg_exact_page_state(memcg, NR_ACTIVE_FILE);
> -       *pheadroom = PAGE_COUNTER_MAX;
> +       *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
> +       *pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
> +       *pfilepages = memcg_page_state(memcg, NR_INACTIVE_FILE) +
> +                       memcg_page_state(memcg, NR_ACTIVE_FILE);
>
> +       *pheadroom = PAGE_COUNTER_MAX;
>         while ((parent = parent_mem_cgroup(memcg))) {
>                 unsigned long ceiling = min(READ_ONCE(memcg->memory.max),
>                                             READ_ONCE(memcg->memory.high));
> @@ -5275,7 +5187,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
>         for_each_node(node)
>                 free_mem_cgroup_per_node_info(memcg, node);
>         free_percpu(memcg->vmstats_percpu);
> -       free_percpu(memcg->vmstats_local);
>         kfree(memcg);
>  }
>
> @@ -5283,11 +5194,10 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
>  {
>         memcg_wb_domain_exit(memcg);
>         /*
> -        * Flush percpu vmstats and vmevents to guarantee the value correctness
> -        * on parent's and all ancestor levels.
> +        * Flush percpu lruvec stats to guarantee the value
> +        * correctness on parent's and all ancestor levels.
>          */
> -       memcg_flush_percpu_vmstats(memcg);
> -       memcg_flush_percpu_vmevents(memcg);
> +       memcg_flush_lruvec_page_state(memcg);
>         __mem_cgroup_free(memcg);
>  }
>
> @@ -5314,11 +5224,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>                 goto fail;
>         }
>
> -       memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> -                                               GFP_KERNEL_ACCOUNT);
> -       if (!memcg->vmstats_local)
> -               goto fail;
> -
>         memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
>                                                  GFP_KERNEL_ACCOUNT);
>         if (!memcg->vmstats_percpu)
> @@ -5518,6 +5423,62 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
>         memcg_wb_domain_size_changed(memcg);
>  }
>
> +static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +       struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> +       struct memcg_vmstats_percpu *statc;
> +       long delta, v;
> +       int i;
> +
> +       statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
> +
> +       for (i = 0; i < MEMCG_NR_STAT; i++) {
> +               /*
> +                * Collect the aggregated propagation counts of groups
> +                * below us. We're in a per-cpu loop here and this is
> +                * a global counter, so the first cycle will get them.
> +                */
> +               delta = memcg->vmstats.state_pending[i];
> +               if (delta)
> +                       memcg->vmstats.state_pending[i] = 0;
> +
> +               /* Add CPU changes on this level since the last flush */
> +               v = READ_ONCE(statc->state[i]);
> +               if (v != statc->state_prev[i]) {
> +                       delta += v - statc->state_prev[i];
> +                       statc->state_prev[i] = v;
> +               }
> +
> +               if (!delta)
> +                       continue;
> +
> +               /* Aggregate counts on this level and propagate upwards */
> +               memcg->vmstats.state[i] += delta;
> +               if (parent)
> +                       parent->vmstats.state_pending[i] += delta;
> +       }
> +
> +       for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
> +               delta = memcg->vmstats.events_pending[i];
> +               if (delta)
> +                       memcg->vmstats.events_pending[i] = 0;
> +
> +               v = READ_ONCE(statc->events[i]);
> +               if (v != statc->events_prev[i]) {
> +                       delta += v - statc->events_prev[i];
> +                       statc->events_prev[i] = v;
> +               }
> +
> +               if (!delta)
> +                       continue;
> +
> +               memcg->vmstats.events[i] += delta;
> +               if (parent)
> +                       parent->vmstats.events_pending[i] += delta;
> +       }
> +}
> +
>  #ifdef CONFIG_MMU
>  /* Handlers for move charge at task migration. */
>  static int mem_cgroup_do_precharge(unsigned long count)
> @@ -6571,6 +6532,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
>         .css_released = mem_cgroup_css_released,
>         .css_free = mem_cgroup_css_free,
>         .css_reset = mem_cgroup_css_reset,
> +       .css_rstat_flush = mem_cgroup_css_rstat_flush,
>         .can_attach = mem_cgroup_can_attach,
>         .cancel_attach = mem_cgroup_cancel_attach,
>         .post_attach = mem_cgroup_move_task,
> --
> 2.30.0
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/8] mm: memcontrol: consolidate lruvec stat flushing
  2021-02-05 18:28 ` [PATCH 7/8] mm: memcontrol: consolidate lruvec stat flushing Johannes Weiner
@ 2021-02-08  2:28   ` Shakeel Butt
  2021-02-08 20:54     ` Johannes Weiner
  2021-02-08 13:54   ` Michal Hocko
  1 sibling, 1 reply; 22+ messages in thread
From: Shakeel Butt @ 2021-02-08  2:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, Michal Hocko, Roman Gushchin, Linux MM,
	Cgroups, LKML, Kernel Team

On Fri, Feb 5, 2021 at 10:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> There are two functions to flush the per-cpu data of an lruvec into
> the rest of the cgroup tree: when the cgroup is being freed, and when
> a CPU disappears during hotplug. The difference is whether all CPUs or
> just one is being collected, but the rest of the flushing code is the
> same. Merge them into one function and share the common code.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

BTW what about the lruvec stats? Why not convert them to rstat as well?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/8] mm: memcontrol: consolidate lruvec stat flushing
  2021-02-05 18:28 ` [PATCH 7/8] mm: memcontrol: consolidate lruvec stat flushing Johannes Weiner
  2021-02-08  2:28   ` Shakeel Butt
@ 2021-02-08 13:54   ` Michal Hocko
  1 sibling, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2021-02-08 13:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, Roman Gushchin, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

On Fri 05-02-21 13:28:05, Johannes Weiner wrote:
> There are two functions to flush the per-cpu data of an lruvec into
> the rest of the cgroup tree: when the cgroup is being freed, and when
> a CPU disappears during hotplug. The difference is whether all CPUs or
> just one is being collected, but the rest of the flushing code is the
> same. Merge them into one function and share the common code.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Yes, this looks much better/cleaner.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/memcontrol.c | 74 +++++++++++++++++++------------------------------
>  1 file changed, 28 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5dc0bd53b64a..490357945f2c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2410,39 +2410,39 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>  	mutex_unlock(&percpu_charge_mutex);
>  }
>  
> -static int memcg_hotplug_cpu_dead(unsigned int cpu)
> +static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu)
>  {
> -	struct memcg_stock_pcp *stock;
> -	struct mem_cgroup *memcg;
> -
> -	stock = &per_cpu(memcg_stock, cpu);
> -	drain_stock(stock);
> +	int nid;
>  
> -	for_each_mem_cgroup(memcg) {
> +	for_each_node(nid) {
> +		struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid];
> +		unsigned long stat[NR_VM_NODE_STAT_ITEMS];
> +		struct batched_lruvec_stat *lstatc;
>  		int i;
>  
> +		lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu);
>  		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
> -			int nid;
> +			stat[i] = lstatc->count[i];
> +			lstatc->count[i] = 0;
> +		}
>  
> -			for_each_node(nid) {
> -				struct batched_lruvec_stat *lstatc;
> -				struct mem_cgroup_per_node *pn;
> -				long x;
> +		do {
> +			for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> +				atomic_long_add(stat[i], &pn->lruvec_stat[i]);
> +		} while ((pn = parent_nodeinfo(pn, nid)));
> +	}
> +}
>  
> -				pn = memcg->nodeinfo[nid];
> -				lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu);
> +static int memcg_hotplug_cpu_dead(unsigned int cpu)
> +{
> +	struct memcg_stock_pcp *stock;
> +	struct mem_cgroup *memcg;
>  
> -				x = lstatc->count[i];
> -				lstatc->count[i] = 0;
> +	stock = &per_cpu(memcg_stock, cpu);
> +	drain_stock(stock);
>  
> -				if (x) {
> -					do {
> -						atomic_long_add(x, &pn->lruvec_stat[i]);
> -					} while ((pn = parent_nodeinfo(pn, nid)));
> -				}
> -			}
> -		}
> -	}
> +	for_each_mem_cgroup(memcg)
> +		memcg_flush_lruvec_page_state(memcg, cpu);
>  
>  	return 0;
>  }
> @@ -3636,27 +3636,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
>  	}
>  }
>  
> -static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg)
> -{
> -	int node;
> -
> -	for_each_node(node) {
> -		struct mem_cgroup_per_node *pn = memcg->nodeinfo[node];
> -		unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0 };
> -		struct mem_cgroup_per_node *pi;
> -		int cpu, i;
> -
> -		for_each_online_cpu(cpu)
> -			for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> -				stat[i] += per_cpu(
> -					pn->lruvec_stat_cpu->count[i], cpu);
> -
> -		for (pi = pn; pi; pi = parent_nodeinfo(pi, node))
> -			for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> -				atomic_long_add(stat[i], &pi->lruvec_stat[i]);
> -	}
> -}
> -
>  #ifdef CONFIG_MEMCG_KMEM
>  static int memcg_online_kmem(struct mem_cgroup *memcg)
>  {
> @@ -5192,12 +5171,15 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
>  
>  static void mem_cgroup_free(struct mem_cgroup *memcg)
>  {
> +	int cpu;
> +
>  	memcg_wb_domain_exit(memcg);
>  	/*
>  	 * Flush percpu lruvec stats to guarantee the value
>  	 * correctness on parent's and all ancestor levels.
>  	 */
> -	memcg_flush_lruvec_page_state(memcg);
> +	for_each_online_cpu(cpu)
> +		memcg_flush_lruvec_page_state(memcg, cpu);
>  	__mem_cgroup_free(memcg);
>  }
>  
> -- 
> 2.30.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/8] cgroup: rstat: punt root-level optimization to individual controllers
  2021-02-08 20:29     ` Johannes Weiner
@ 2021-02-08 15:58       ` Tejun Heo
  0 siblings, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2021-02-08 15:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	linux-mm, cgroups, linux-kernel, kernel-team

Hello,

On Mon, Feb 08, 2021 at 03:29:21PM -0500, Johannes Weiner wrote:
> > > @@ -789,6 +793,7 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
> > >  		u64_stats_update_end(&blkg->iostat.sync);
> > >  
> > >  		/* propagate global delta to parent */
> > > +		/* XXX: could skip this if parent is root */
> > >  		if (parent) {
> > >  			u64_stats_update_begin(&parent->iostat.sync);
> > >  			blkg_iostat_set(&delta, &blkg->iostat.cur);
> > 
> > Might as well update this similar to cgroup_base_stat_flush()?
> 
> I meant to revisit that, but I'm never 100% confident when it comes to
> the interaction and lifetime of css, blkcg and blkg_gq.

Yeah, it does get hairy.

> IIUC, the blkg_gq->parent linkage always matches the css parent
> linkage; it just exists as an optimization for ancestor walks, which
> would otherwise have to do radix lookups when going through the css.

But yes, at least this part is straight-forward.

> So with the cgroup_parent() check at the beginning of the function
> making sure we're looking at a non-root group, blkg_gq->parent should
> also never be NULL and I can do if (paren->parent) directly, right?

I think so.

> > >  static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
> > >  {
> > > -	struct cgroup *parent = cgroup_parent(cgrp);
> > >  	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
> > > +	struct cgroup *parent = cgroup_parent(cgrp);
> > 
> > Is this chunk intentional?
> 
> Yeah, it puts the local variable declarations into reverse christmas
> tree ordering to make them a bit easier to read. It's a while-at-it
> cleanup, mostly a force of habit. I can drop it if it bothers you.

I don't mind either way. Was just wondering whether it was accidental.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/8] mm: memcontrol: consolidate lruvec stat flushing
  2021-02-08 20:54     ` Johannes Weiner
@ 2021-02-08 16:02       ` Tejun Heo
  0 siblings, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2021-02-08 16:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Shakeel Butt, Andrew Morton, Michal Hocko, Roman Gushchin,
	Linux MM, Cgroups, LKML, Kernel Team

Hello,

On Mon, Feb 08, 2021 at 03:54:14PM -0500, Johannes Weiner wrote:
> We probably do need a better solution for the lruvecs as well, but in
> this case it just started holding up fixing the memory.stat issue for
> no reason and so I tabled it for another patch series.

rstat doesn't currently have a flush throttling mechanism cuz it doens't
expect readers to be super hot but adding one should be pretty easy - e.g.
it can just keep track of the number of updates on this cpu since the last
flush and then flush iff the it's above a certain threshold. Shouldn't be
too difficult to match or exceed the performance and error characteristics
of the existing code.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/8] cgroup: rstat: punt root-level optimization to individual controllers
  2021-02-06  3:34   ` Tejun Heo
@ 2021-02-08 20:29     ` Johannes Weiner
  2021-02-08 15:58       ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2021-02-08 20:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	linux-mm, cgroups, linux-kernel, kernel-team

On Fri, Feb 05, 2021 at 10:34:39PM -0500, Tejun Heo wrote:
> Hello,
> 
> On Fri, Feb 05, 2021 at 01:28:03PM -0500, Johannes Weiner wrote:
> > Current users of the rstat code can source root-level statistics from
> > the native counters of their respective subsystem, allowing them to
> > forego aggregation at the root level. This optimization is currently
> > implemented inside the generic rstat code, which doesn't track the
> > root cgroup and doesn't invoke the subsystem flush callbacks on it.
> > 
> > However, the memory controller cannot do this optimization, because
> > cgroup1 breaks out memory specifically for the local level, including
> > at the root level. In preparation for the memory controller switching
> > to rstat, move the optimization from rstat core to the controllers.
> > 
> > Afterwards, rstat will always track the root cgroup for changes and
> > invoke the subsystem callbacks on it; and it's up to the subsystem to
> > special-case and skip aggregation of the root cgroup if it can source
> > this information through other, cheaper means.
> > 
> > The extra cost of tracking the root cgroup is negligible: on stat
> > changes, we actually remove a branch that checks for the root. The
> > queueing for a flush touches only per-cpu data, and only the first
> > stat change since a flush requires a (per-cpu) lock.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Generally looks good to me.
> 
> Acked-by: Tejun Heo <tj@kernel.org>

Thanks!

> A couple nits below.
> 
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 02ce2058c14b..76725e1cad7f 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -766,6 +766,10 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
> >  	struct blkcg *blkcg = css_to_blkcg(css);
> >  	struct blkcg_gq *blkg;
> >  
> > +	/* Root-level stats are sourced from system-wide IO stats */
> > +	if (!cgroup_parent(css->cgroup))
> > +		return;
> > +
> >  	rcu_read_lock();
> >  
> >  	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> > @@ -789,6 +793,7 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
> >  		u64_stats_update_end(&blkg->iostat.sync);
> >  
> >  		/* propagate global delta to parent */
> > +		/* XXX: could skip this if parent is root */
> >  		if (parent) {
> >  			u64_stats_update_begin(&parent->iostat.sync);
> >  			blkg_iostat_set(&delta, &blkg->iostat.cur);
> 
> Might as well update this similar to cgroup_base_stat_flush()?

I meant to revisit that, but I'm never 100% confident when it comes to
the interaction and lifetime of css, blkcg and blkg_gq.

IIUC, the blkg_gq->parent linkage always matches the css parent
linkage; it just exists as an optimization for ancestor walks, which
would otherwise have to do radix lookups when going through the css.

So with the cgroup_parent() check at the beginning of the function
making sure we're looking at a non-root group, blkg_gq->parent should
also never be NULL and I can do if (paren->parent) directly, right?

> > @@ -58,8 +53,16 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
> >  		if (rstatc->updated_next)
> >  			break;
> >  
> > +		if (!parent) {
> 
> Maybe useful to note that the node is being marked busy but not added to the
> non-existent parent.

Makes sense, I'll add a comment.

> > +			rstatc->updated_next = cgrp;
> > +			break;
> > +		}
> > +
> > +		prstatc = cgroup_rstat_cpu(parent, cpu);
> >  		rstatc->updated_next = prstatc->updated_children;
> >  		prstatc->updated_children = cgrp;
> > +
> > +		cgrp = parent;
> >  	}
> >  
> >  	raw_spin_unlock_irqrestore(cpu_lock, flags);
> ...
> >  static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
> >  {
> > -	struct cgroup *parent = cgroup_parent(cgrp);
> >  	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
> > +	struct cgroup *parent = cgroup_parent(cgrp);
> 
> Is this chunk intentional?

Yeah, it puts the local variable declarations into reverse christmas
tree ordering to make them a bit easier to read. It's a while-at-it
cleanup, mostly a force of habit. I can drop it if it bothers you.

> >  	struct cgroup_base_stat cur, delta;
> >  	unsigned seq;
> >  
> > +	/* Root-level stats are sourced from system-wide CPU stats */
> > +	if (!parent)
> > +		return;
> > +
> >  	/* fetch the current per-cpu values */
> >  	do {
> >  		seq = __u64_stats_fetch_begin(&rstatc->bsync);
> > @@ -326,8 +336,8 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
> >  	cgroup_base_stat_add(&cgrp->bstat, &delta);
> >  	cgroup_base_stat_add(&rstatc->last_bstat, &delta);
> >  
> > -	/* propagate global delta to parent */
> > -	if (parent) {
> > +	/* propagate global delta to parent (unless that's root) */
> > +	if (cgroup_parent(parent)) {
> 
> Yeah, this makes sense. Can you add a short while-at-it note in the patch
> description?

Will do.

Thanks for the review!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/8] mm: memcontrol: switch to rstat
  2021-02-08  2:19   ` Shakeel Butt
@ 2021-02-08 20:40     ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2021-02-08 20:40 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Tejun Heo, Michal Hocko, Roman Gushchin, Linux MM,
	Cgroups, LKML, Kernel Team

On Sun, Feb 07, 2021 at 06:19:04PM -0800, Shakeel Butt wrote:
> On Fri, Feb 5, 2021 at 10:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Replace the memory controller's custom hierarchical stats code with
> > the generic rstat infrastructure provided by the cgroup core.
> >
> > The current implementation does batched upward propagation from the
> > write side (i.e. as stats change). The per-cpu batches introduce an
> > error, which is multiplied by the number of subgroups in a tree. In
> > systems with many CPUs and sizable cgroup trees, the error can be
> > large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32
> > subgroups results in an error of up to 128M per stat item). This can
> > entirely swallow allocation bursts inside a workload that the user is
> > expecting to see reflected in the statistics.
> >
> > In the past, we've done read-side aggregation, where a memory.stat
> > read would have to walk the entire subtree and add up per-cpu
> > counts. This became problematic with lazily-freed cgroups: we could
> > have large subtrees where most cgroups were entirely idle. Hence the
> > switch to change-driven upward propagation. Unfortunately, it needed
> > to trade accuracy for speed due to the write side being so hot.
> >
> > Rstat combines the best of both worlds: from the write side, it
> > cheaply maintains a queue of cgroups that have pending changes, so
> > that the read side can do selective tree aggregation. This way the
> > reported stats will always be precise and recent as can be, while the
> > aggregation can skip over potentially large numbers of idle cgroups.
> >
> > The way rstat works is that it implements a tree for tracking cgroups
> > with pending local changes, as well as a flush function that walks the
> > tree upwards. The controller then drives this by 1) telling rstat when
> > a local cgroup stat changes (e.g. mod_memcg_state) and 2) when a flush
> > is required to get uptodate hierarchy stats for a given subtree
> > (e.g. when memory.stat is read). The controller also provides a flush
> > callback that is called during the rstat flush walk for each cgroup
> > and aggregates its local per-cpu counters and propagates them upwards.
> >
> > This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT +
> > NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward
> > aggregation. It removes 3 words from the per-cpu data. It eliminates
> > memcg_exact_page_state(), since memcg_page_state() is now exact.
> 
> Only if cgroup_rstat_flush() has been called before memcg_page_state(), right?

Yes, correct.

> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Reviewed-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Overall the patch looks good to me with a couple of nits/queries below.
> 
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Thanks!

> > @@ -1383,8 +1388,16 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
> >  {
> >  }
> >
> > -static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
> > +static inline void mem_cgroup_split_huge_fixup(struct page *head)
> > +{
> > +}
> > +
> > +static inline
> > +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> > +                                           gfp_t gfp_mask,
> > +                                           unsigned long *total_scanned)
> >  {
> > +       return 0;
> 
> Any technical reason to move around mem_cgroup_soft_limit_reclaim(),
> mem_cgroup_split_huge_fixup() and lruvec_memcg_debug() or just
> aesthetics?

Yeah, just a while-at-it cleanup. It seemed too minor to justify a
separate patch.

> >  #endif /* CONFIG_MEMCG */
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 2f97cb4cef6d..5dc0bd53b64a 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -757,6 +757,11 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> >         return mz;
> >  }
> >
> > +static void memcg_flush_vmstats(struct mem_cgroup *memcg)
> > +{
> > +       cgroup_rstat_flush(memcg->css.cgroup);
> > +}
> 
> cgroup_rstat_flush() has one line wrapper but cgroup_rstat_updated()
> does not, any reason?

cgroup_rstat_flush() seemed a bit low-level to sprinkle around the
code base. Especially with cgroup_rstat_updated() encapsulated by the
mod_memcg_state() layer, a reader of such a callsite might not easily
understand what rstat even is and when and why it needs to be called.

> > @@ -3618,6 +3569,8 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
> >  {
> >         unsigned long val;
> >
> > +       memcg_flush_vmstats(memcg);
> > +
> >         if (mem_cgroup_is_root(memcg)) {
> 
> I think memcg_flush_vmstats(memcg) should be here.

Good catch! I'll fix that in the next revision.

Thanks Shakeel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/8] mm: memcontrol: consolidate lruvec stat flushing
  2021-02-08  2:28   ` Shakeel Butt
@ 2021-02-08 20:54     ` Johannes Weiner
  2021-02-08 16:02       ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2021-02-08 20:54 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Tejun Heo, Michal Hocko, Roman Gushchin, Linux MM,
	Cgroups, LKML, Kernel Team

On Sun, Feb 07, 2021 at 06:28:37PM -0800, Shakeel Butt wrote:
> On Fri, Feb 5, 2021 at 10:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > There are two functions to flush the per-cpu data of an lruvec into
> > the rest of the cgroup tree: when the cgroup is being freed, and when
> > a CPU disappears during hotplug. The difference is whether all CPUs or
> > just one is being collected, but the rest of the flushing code is the
> > same. Merge them into one function and share the common code.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Thanks!

> BTW what about the lruvec stats? Why not convert them to rstat as well?

Great question.

I actually started this series with the lruvec stats included, but I'm
worried about the readers being too hot to use rstat (in its current
shape, at least). For example, the refault code accesses the lruvec
stats for every page that is refaulting - at the root level, in case
of global reclaim. With an active workload, that would result in a
very high rate of whole-tree flushes.

We probably do need a better solution for the lruvecs as well, but in
this case it just started holding up fixing the memory.stat issue for
no reason and so I tabled it for another patch series.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 8/8] kselftests: cgroup: update kmem test for new vmstat implementation
  2021-02-05 18:28 ` [PATCH 8/8] kselftests: cgroup: update kmem test for new vmstat implementation Johannes Weiner
@ 2021-02-09 13:48   ` Shakeel Butt
  0 siblings, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2021-02-09 13:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, Michal Hocko, Roman Gushchin, Linux MM,
	Cgroups, LKML, Kernel Team

On Fri, Feb 5, 2021 at 10:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> With memcg having switched to rstat, memory.stat output is precise.
> Update the cgroup selftest to reflect the expectations and error
> tolerances of the new implementation.
>
> Also add newly tracked types of memory to the memory.stat side of the
> equation, since they're included in memory.current and could throw
> false positives.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-02-09 13:51 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-05 18:27 [PATCH 0/8] mm: memcontrol: switch to rstat v2 Johannes Weiner
2021-02-05 18:27 ` [PATCH 1/8] mm: memcontrol: fix cpuhotplug statistics flushing Johannes Weiner
2021-02-05 18:28 ` [PATCH 2/8] mm: memcontrol: kill mem_cgroup_nodeinfo() Johannes Weiner
2021-02-05 18:28 ` [PATCH 3/8] mm: memcontrol: privatize memcg_page_state query functions Johannes Weiner
2021-02-05 18:28 ` [PATCH 4/8] cgroup: rstat: support cgroup1 Johannes Weiner
2021-02-05 22:11   ` Shakeel Butt
2021-02-06  3:00   ` Tejun Heo
2021-02-05 18:28 ` [PATCH 5/8] cgroup: rstat: punt root-level optimization to individual controllers Johannes Weiner
2021-02-06  3:34   ` Tejun Heo
2021-02-08 20:29     ` Johannes Weiner
2021-02-08 15:58       ` Tejun Heo
2021-02-05 18:28 ` [PATCH 6/8] mm: memcontrol: switch to rstat Johannes Weiner
2021-02-08  2:19   ` Shakeel Butt
2021-02-08 20:40     ` Johannes Weiner
2021-02-05 18:28 ` [PATCH 7/8] mm: memcontrol: consolidate lruvec stat flushing Johannes Weiner
2021-02-08  2:28   ` Shakeel Butt
2021-02-08 20:54     ` Johannes Weiner
2021-02-08 16:02       ` Tejun Heo
2021-02-08 13:54   ` Michal Hocko
2021-02-05 18:28 ` [PATCH 8/8] kselftests: cgroup: update kmem test for new vmstat implementation Johannes Weiner
2021-02-09 13:48   ` Shakeel Butt
2021-02-06  3:58 ` [PATCH 0/8] mm: memcontrol: switch to rstat v2 Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).