linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] mm: memcg accounting of percpu memory
@ 2020-06-08 23:08 Roman Gushchin
  2020-06-08 23:08 ` [PATCH v2 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:08 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

This patchset adds percpu memory accounting to memory cgroups.
It's based on the rework of the slab controller and reuses concepts
and features introduced for the per-object slab accounting.

Percpu memory is becoming more and more widely used by various
subsystems, and the total amount of memory controlled by the percpu
allocator can make a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory,
and they are created by a user. Also, some cgroup internals
(e.g. memory controller statistics) can be quite large.
On a machine with many CPUs and big number of cgroups they
can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be
accounted by default.

Percpu allocations by their nature are scattered over multiple pages,
so they can't be tracked on the per-page basis. So the per-object
tracking introduced by the new slab controller is reused.

The patchset implements charging of percpu allocations, adds
memcg-level statistics, enables accounting for percpu allocations made
by memory cgroup internals and provides some basic tests.

To implement the accounting of percpu memory without a significant
memory and performance overhead the following approach is used:
all accounted allocations are placed into a separate percpu chunk
(or chunks). These chunks are similar to default chunks, except
that they do have an attached vector of pointers to obj_cgroup objects,
which is big enough to save a pointer for each allocated object.
On the allocation, if the allocation has to be accounted
(__GFP_ACCOUNT is passed, the allocating process belongs to a non-root
memory cgroup, etc), the memory cgroup is getting charged and if the maximum
limit is not exceeded the allocation is performed using a memcg-aware
chunk. Otherwise -ENOMEM is returned or the allocation is forced over
the limit, depending on gfp (as any other kernel memory allocation).
The memory cgroup information is saved in the obj_cgroup vector
at the corresponding offset. On the release time the memcg
information is restored from the vector and the cgroup is getting
uncharged.
Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no
additional overhead is expected.

To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter. The trick is that it can be atomically swapped
to point at the parent cgroup, so that the original memory cgroup
can be released prior to all objects, which has been charged to it.
Because all charges and statistics are fully recursive, it's perfectly
correct to uncharge the parent cgroup instead. This scheme is used
in the slab memory accounting, and percpu memory can just follow
the scheme.

This version is based on top of v6 of the new slab controller
patchset. The following patches are actually required by this series:
  mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
  mm: memcg: prepare for byte-sized vmstat items
  mm: memcg: convert vmstat slab counters to bytes
  mm: slub: implement SLUB version of obj_to_index()
  mm: memcontrol: decouple reference counting from page accounting
  mm: memcg/slab: obj_cgroup API

The whole series can be found here:
https://github.com/rgushchin/linux/pull/new/percpu_acc.1

v2:
  1) minor cosmetic fixes (Dennis)
  2) rebased on top of v6 of the slab controller patchset

v1:
  1) fixed a bug with gfp flags handling (Dennis)
  2) added some comments (Tejun and Dennis)
  3) rebased on top of v5 of the slab controller patchset

RFC:
  https://lore.kernel.org/linux-mm/20200519201806.2308480-1-guro@fb.com/T/#t


Roman Gushchin (5):
  percpu: return number of released bytes from pcpu_free_area()
  mm: memcg/percpu: account percpu memory to memory cgroups
  mm: memcg/percpu: per-memcg percpu memory statistics
  mm: memcg: charge memcg percpu memory to the parent cgroup
  kselftests: cgroup: add perpcu memory accounting test

 Documentation/admin-guide/cgroup-v2.rst    |   4 +
 include/linux/memcontrol.h                 |   8 +
 mm/memcontrol.c                            |  18 +-
 mm/percpu-internal.h                       |  55 +++++-
 mm/percpu-km.c                             |   5 +-
 mm/percpu-stats.c                          |  36 ++--
 mm/percpu-vm.c                             |   5 +-
 mm/percpu.c                                | 206 ++++++++++++++++++---
 tools/testing/selftests/cgroup/test_kmem.c |  70 ++++++-
 9 files changed, 358 insertions(+), 49 deletions(-)

-- 
2.25.4



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/5] percpu: return number of released bytes from pcpu_free_area()
  2020-06-08 23:08 [PATCH v2 0/5] mm: memcg accounting of percpu memory Roman Gushchin
@ 2020-06-08 23:08 ` Roman Gushchin
  2020-06-08 23:08 ` [PATCH v2 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:08 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

To implement accounting of percpu memory we need the information
about the size of freed object. Return it from pcpu_free_area().

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
---
 mm/percpu.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 696367b18222..aa36b78d45a6 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1211,11 +1211,14 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
  *
  * This function determines the size of an allocation to free using
  * the boundary bitmap and clears the allocation map.
+ *
+ * RETURNS:
+ * Number of freed bytes.
  */
-static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
+static int pcpu_free_area(struct pcpu_chunk *chunk, int off)
 {
 	struct pcpu_block_md *chunk_md = &chunk->chunk_md;
-	int bit_off, bits, end, oslot;
+	int bit_off, bits, end, oslot, freed;
 
 	lockdep_assert_held(&pcpu_lock);
 	pcpu_stats_area_dealloc(chunk);
@@ -1230,8 +1233,10 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
 	bits = end - bit_off;
 	bitmap_clear(chunk->alloc_map, bit_off, bits);
 
+	freed = bits * PCPU_MIN_ALLOC_SIZE;
+
 	/* update metadata */
-	chunk->free_bytes += bits * PCPU_MIN_ALLOC_SIZE;
+	chunk->free_bytes += freed;
 
 	/* update first free bit */
 	chunk_md->first_free = min(chunk_md->first_free, bit_off);
@@ -1239,6 +1244,8 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
 	pcpu_block_update_hint_free(chunk, bit_off, bits);
 
 	pcpu_chunk_relocate(chunk, oslot);
+
+	return freed;
 }
 
 static void pcpu_init_md_block(struct pcpu_block_md *block, int nr_bits)
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/5] mm: memcg/percpu: account percpu memory to memory cgroups
  2020-06-08 23:08 [PATCH v2 0/5] mm: memcg accounting of percpu memory Roman Gushchin
  2020-06-08 23:08 ` [PATCH v2 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
@ 2020-06-08 23:08 ` Roman Gushchin
  2020-06-08 23:08 ` [PATCH v2 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:08 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Percpu memory is becoming more and more widely used by various
subsystems, and the total amount of memory controlled by the percpu
allocator can make a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory,
and they are created by a user. Also, some cgroup internals
(e.g. memory controller statistics) can be quite large.
On a machine with many CPUs and big number of cgroups they
can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be
accounted by default.

To implement the perpcu accounting it's possible to take the slab
memory accounting as a model to follow. Let's introduce two types of
percpu chunks: root and memcg. What makes memcg chunks different is
an additional space allocated to store memcg membership information.
If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be
used. If it's possible to charge the corresponding size to the target
memory cgroup, allocation is performed, and the memcg ownership data
is recorded. System-wide allocations are performed using root chunks,
so there is no additional memory overhead.

To implement a fast reparenting of percpu memory on memcg removal,
we don't store mem_cgroup pointers directly: instead we use obj_cgroup
API, introduced for slab accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
---
 mm/percpu-internal.h |  55 ++++++++++++-
 mm/percpu-km.c       |   5 +-
 mm/percpu-stats.c    |  36 +++++----
 mm/percpu-vm.c       |   5 +-
 mm/percpu.c          | 183 ++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 244 insertions(+), 40 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 0468ba500bd4..7983455842ff 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -5,6 +5,25 @@
 #include <linux/types.h>
 #include <linux/percpu.h>
 
+/*
+ * There are two chunk types: root and memcg-aware.
+ * Chunks of each type have separate slots list.
+ *
+ * Memcg-aware chunks have an attached vector of obj_cgroup pointers, which is
+ * used to store memcg membership data of a percpu object.  Obj_cgroups are
+ * ref-counted pointers to a memory cgroup with an ability to switch dynamically
+ * to the parent memory cgroup.  This allows to reclaim a deleted memory cgroup
+ * without reclaiming of all outstanding objects, which hold a reference at it.
+ */
+enum pcpu_chunk_type {
+	PCPU_CHUNK_ROOT,
+#ifdef CONFIG_MEMCG_KMEM
+	PCPU_CHUNK_MEMCG,
+#endif
+	PCPU_NR_CHUNK_TYPES,
+	PCPU_FAIL_ALLOC = PCPU_NR_CHUNK_TYPES
+};
+
 /*
  * pcpu_block_md is the metadata block struct.
  * Each chunk's bitmap is split into a number of full blocks.
@@ -54,6 +73,9 @@ struct pcpu_chunk {
 	int			end_offset;	/* additional area required to
 						   have the region end page
 						   aligned */
+#ifdef CONFIG_MEMCG_KMEM
+	struct obj_cgroup	**obj_cgroups;	/* vector of object cgroups */
+#endif
 
 	int			nr_pages;	/* # of pages served by this chunk */
 	int			nr_populated;	/* # of populated pages */
@@ -63,7 +85,7 @@ struct pcpu_chunk {
 
 extern spinlock_t pcpu_lock;
 
-extern struct list_head *pcpu_slot;
+extern struct list_head *pcpu_chunk_lists;
 extern int pcpu_nr_slots;
 extern int pcpu_nr_empty_pop_pages;
 
@@ -106,6 +128,37 @@ static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk)
 	return pcpu_nr_pages_to_map_bits(chunk->nr_pages);
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
+{
+	if (chunk->obj_cgroups)
+		return PCPU_CHUNK_MEMCG;
+	return PCPU_CHUNK_ROOT;
+}
+
+static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
+{
+	return chunk_type == PCPU_CHUNK_MEMCG;
+}
+
+#else
+static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
+{
+	return PCPU_CHUNK_ROOT;
+}
+
+static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
+{
+	return false;
+}
+#endif
+
+static struct list_head *pcpu_chunk_list(enum pcpu_chunk_type chunk_type)
+{
+	return &pcpu_chunk_lists[pcpu_nr_slots *
+				 pcpu_is_memcg_chunk(chunk_type)];
+}
+
 #ifdef CONFIG_PERCPU_STATS
 
 #include <linux/spinlock.h>
diff --git a/mm/percpu-km.c b/mm/percpu-km.c
index 20d2b69a13b0..35c9941077ee 100644
--- a/mm/percpu-km.c
+++ b/mm/percpu-km.c
@@ -44,7 +44,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 	/* nada */
 }
 
-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+					    gfp_t gfp)
 {
 	const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
 	struct pcpu_chunk *chunk;
@@ -52,7 +53,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
 	unsigned long flags;
 	int i;
 
-	chunk = pcpu_alloc_chunk(gfp);
+	chunk = pcpu_alloc_chunk(type, gfp);
 	if (!chunk)
 		return NULL;
 
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index 32558063c3f9..c8400a2adbc2 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -34,11 +34,15 @@ static int find_max_nr_alloc(void)
 {
 	struct pcpu_chunk *chunk;
 	int slot, max_nr_alloc;
+	enum pcpu_chunk_type type;
 
 	max_nr_alloc = 0;
-	for (slot = 0; slot < pcpu_nr_slots; slot++)
-		list_for_each_entry(chunk, &pcpu_slot[slot], list)
-			max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc);
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		for (slot = 0; slot < pcpu_nr_slots; slot++)
+			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
+					    list)
+				max_nr_alloc = max(max_nr_alloc,
+						   chunk->nr_alloc);
 
 	return max_nr_alloc;
 }
@@ -129,6 +133,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
 	P("cur_min_alloc", cur_min_alloc);
 	P("cur_med_alloc", cur_med_alloc);
 	P("cur_max_alloc", cur_max_alloc);
+#ifdef CONFIG_MEMCG_KMEM
+	P("memcg_aware", pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)));
+#endif
 	seq_putc(m, '\n');
 }
 
@@ -137,6 +144,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 	struct pcpu_chunk *chunk;
 	int slot, max_nr_alloc;
 	int *buffer;
+	enum pcpu_chunk_type type;
 
 alloc_buffer:
 	spin_lock_irq(&pcpu_lock);
@@ -202,18 +210,18 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 		chunk_map_stats(m, pcpu_reserved_chunk, buffer);
 	}
 
-	for (slot = 0; slot < pcpu_nr_slots; slot++) {
-		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
-			if (chunk == pcpu_first_chunk) {
-				seq_puts(m, "Chunk: <- First Chunk\n");
-				chunk_map_stats(m, chunk, buffer);
-
-
-			} else {
-				seq_puts(m, "Chunk:\n");
-				chunk_map_stats(m, chunk, buffer);
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
+		for (slot = 0; slot < pcpu_nr_slots; slot++) {
+			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
+					    list) {
+				if (chunk == pcpu_first_chunk) {
+					seq_puts(m, "Chunk: <- First Chunk\n");
+					chunk_map_stats(m, chunk, buffer);
+				} else {
+					seq_puts(m, "Chunk:\n");
+					chunk_map_stats(m, chunk, buffer);
+				}
 			}
-
 		}
 	}
 
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index a2b395acef89..e46f7a6917f9 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -328,12 +328,13 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 	pcpu_free_pages(chunk, pages, page_start, page_end);
 }
 
-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+					    gfp_t gfp)
 {
 	struct pcpu_chunk *chunk;
 	struct vm_struct **vms;
 
-	chunk = pcpu_alloc_chunk(gfp);
+	chunk = pcpu_alloc_chunk(type, gfp);
 	if (!chunk)
 		return NULL;
 
diff --git a/mm/percpu.c b/mm/percpu.c
index aa36b78d45a6..8ebd9fe30430 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -37,9 +37,14 @@
  * takes care of normal allocations.
  *
  * The allocator organizes chunks into lists according to free size and
- * tries to allocate from the fullest chunk first.  Each chunk is managed
- * by a bitmap with metadata blocks.  The allocation map is updated on
- * every allocation and free to reflect the current state while the boundary
+ * memcg-awareness.  To make a percpu allocation memcg-aware the __GFP_ACCOUNT
+ * flag should be passed.  All memcg-aware allocations are sharing one set
+ * of chunks and all unaccounted allocations and allocations performed
+ * by processes belonging to the root memory cgroup are using the second set.
+ *
+ * The allocator tries to allocate from the fullest chunk first. Each chunk
+ * is managed by a bitmap with metadata blocks.  The allocation map is updated
+ * on every allocation and free to reflect the current state while the boundary
  * map is only updated on allocation.  Each metadata block contains
  * information to help mitigate the need to iterate over large portions
  * of the bitmap.  The reverse mapping from page to chunk is stored in
@@ -81,6 +86,7 @@
 #include <linux/kmemleak.h>
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
+#include <linux/memcontrol.h>
 
 #include <asm/cacheflush.h>
 #include <asm/sections.h>
@@ -160,7 +166,7 @@ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init;
 DEFINE_SPINLOCK(pcpu_lock);	/* all internal data structures */
 static DEFINE_MUTEX(pcpu_alloc_mutex);	/* chunk create/destroy, [de]pop, map ext */
 
-struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */
+struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */
 
 /* chunks which need their map areas extended, protected by pcpu_lock */
 static LIST_HEAD(pcpu_map_extend_chunks);
@@ -500,6 +506,9 @@ static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
 			      bool move_front)
 {
 	if (chunk != pcpu_reserved_chunk) {
+		struct list_head *pcpu_slot;
+
+		pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
 		if (move_front)
 			list_move(&chunk->list, &pcpu_slot[slot]);
 		else
@@ -1341,6 +1350,10 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
 		panic("%s: Failed to allocate %zu bytes\n", __func__,
 		      alloc_size);
 
+#ifdef CONFIG_MEMCG_KMEM
+	/* first chunk isn't memcg-aware */
+	chunk->obj_cgroups = NULL;
+#endif
 	pcpu_init_md_blocks(chunk);
 
 	/* manage populated page bitmap */
@@ -1380,7 +1393,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
 	return chunk;
 }
 
-static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type, gfp_t gfp)
 {
 	struct pcpu_chunk *chunk;
 	int region_bits;
@@ -1408,6 +1421,16 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
 	if (!chunk->md_blocks)
 		goto md_blocks_fail;
 
+#ifdef CONFIG_MEMCG_KMEM
+	if (pcpu_is_memcg_chunk(type)) {
+		chunk->obj_cgroups =
+			pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
+					sizeof(struct obj_cgroup *), gfp);
+		if (!chunk->obj_cgroups)
+			goto objcg_fail;
+	}
+#endif
+
 	pcpu_init_md_blocks(chunk);
 
 	/* init metadata */
@@ -1415,6 +1438,8 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
 
 	return chunk;
 
+objcg_fail:
+	pcpu_mem_free(chunk->md_blocks);
 md_blocks_fail:
 	pcpu_mem_free(chunk->bound_map);
 bound_map_fail:
@@ -1429,6 +1454,9 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
 {
 	if (!chunk)
 		return;
+#ifdef CONFIG_MEMCG_KMEM
+	pcpu_mem_free(chunk->obj_cgroups);
+#endif
 	pcpu_mem_free(chunk->md_blocks);
 	pcpu_mem_free(chunk->bound_map);
 	pcpu_mem_free(chunk->alloc_map);
@@ -1505,7 +1533,8 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
 			       int page_start, int page_end, gfp_t gfp);
 static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 				  int page_start, int page_end);
-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp);
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+					    gfp_t gfp);
 static void pcpu_destroy_chunk(struct pcpu_chunk *chunk);
 static struct page *pcpu_addr_to_page(void *addr);
 static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai);
@@ -1547,6 +1576,77 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
 	return pcpu_get_page_chunk(pcpu_addr_to_page(addr));
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
+						     struct obj_cgroup **objcgp)
+{
+	struct obj_cgroup *objcg;
+
+	if (!memcg_kmem_enabled() || !(gfp & __GFP_ACCOUNT) ||
+	    memcg_kmem_bypass())
+		return PCPU_CHUNK_ROOT;
+
+	objcg = get_obj_cgroup_from_current();
+	if (!objcg)
+		return PCPU_CHUNK_ROOT;
+
+	if (obj_cgroup_charge(objcg, gfp, size * num_possible_cpus())) {
+		obj_cgroup_put(objcg);
+		return PCPU_FAIL_ALLOC;
+	}
+
+	*objcgp = objcg;
+	return PCPU_CHUNK_MEMCG;
+}
+
+static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
+				       struct pcpu_chunk *chunk, int off,
+				       size_t size)
+{
+	if (!objcg)
+		return;
+
+	if (chunk) {
+		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
+	} else {
+		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
+		obj_cgroup_put(objcg);
+	}
+}
+
+static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+	struct obj_cgroup *objcg;
+
+	if (!pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)))
+		return;
+
+	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
+	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
+
+	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
+
+	obj_cgroup_put(objcg);
+}
+
+#else /* CONFIG_MEMCG_KMEM */
+static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
+						     struct mem_cgroup **memcgp)
+{
+	return PCPU_CHUNK_ROOT;
+}
+
+static void pcpu_memcg_post_alloc_hook(struct mem_cgroup *memcg,
+				       struct pcpu_chunk *chunk, int off,
+				       size_t size)
+{
+}
+
+static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
 /**
  * pcpu_alloc - the percpu allocator
  * @size: size of area to allocate in bytes
@@ -1568,6 +1668,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	gfp_t pcpu_gfp;
 	bool is_atomic;
 	bool do_warn;
+	enum pcpu_chunk_type type;
+	struct list_head *pcpu_slot;
+	struct obj_cgroup *objcg = NULL;
 	static int warn_limit = 10;
 	struct pcpu_chunk *chunk, *next;
 	const char *err;
@@ -1602,16 +1705,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 		return NULL;
 	}
 
+	type = pcpu_memcg_pre_alloc_hook(size, gfp, &objcg);
+	if (unlikely(type == PCPU_FAIL_ALLOC))
+		return NULL;
+	pcpu_slot = pcpu_chunk_list(type);
+
 	if (!is_atomic) {
 		/*
 		 * pcpu_balance_workfn() allocates memory under this mutex,
 		 * and it may wait for memory reclaim. Allow current task
 		 * to become OOM victim, in case of memory pressure.
 		 */
-		if (gfp & __GFP_NOFAIL)
+		if (gfp & __GFP_NOFAIL) {
 			mutex_lock(&pcpu_alloc_mutex);
-		else if (mutex_lock_killable(&pcpu_alloc_mutex))
+		} else if (mutex_lock_killable(&pcpu_alloc_mutex)) {
+			pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
 			return NULL;
+		}
 	}
 
 	spin_lock_irqsave(&pcpu_lock, flags);
@@ -1666,7 +1776,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	}
 
 	if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
-		chunk = pcpu_create_chunk(pcpu_gfp);
+		chunk = pcpu_create_chunk(type, pcpu_gfp);
 		if (!chunk) {
 			err = "failed to allocate new chunk";
 			goto fail;
@@ -1723,6 +1833,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	trace_percpu_alloc_percpu(reserved, is_atomic, size, align,
 			chunk->base_addr, off, ptr);
 
+	pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
+
 	return ptr;
 
 fail_unlock:
@@ -1744,6 +1856,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	} else {
 		mutex_unlock(&pcpu_alloc_mutex);
 	}
+
+	pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
+
 	return NULL;
 }
 
@@ -1803,8 +1918,8 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
 }
 
 /**
- * pcpu_balance_workfn - manage the amount of free chunks and populated pages
- * @work: unused
+ * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
+ * @type: chunk type
  *
  * Reclaim all fully free chunks except for the first one.  This is also
  * responsible for maintaining the pool of empty populated pages.  However,
@@ -1813,11 +1928,12 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
  * allocation causes the failure as it is possible that requests can be
  * serviced from already backed regions.
  */
-static void pcpu_balance_workfn(struct work_struct *work)
+static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
 {
 	/* gfp flags passed to underlying allocators */
 	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
 	LIST_HEAD(to_free);
+	struct list_head *pcpu_slot = pcpu_chunk_list(type);
 	struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
 	struct pcpu_chunk *chunk, *next;
 	int slot, nr_to_pop, ret;
@@ -1915,7 +2031,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
 
 	if (nr_to_pop) {
 		/* ran out of chunks to populate, create a new one and retry */
-		chunk = pcpu_create_chunk(gfp);
+		chunk = pcpu_create_chunk(type, gfp);
 		if (chunk) {
 			spin_lock_irq(&pcpu_lock);
 			pcpu_chunk_relocate(chunk, -1);
@@ -1927,6 +2043,20 @@ static void pcpu_balance_workfn(struct work_struct *work)
 	mutex_unlock(&pcpu_alloc_mutex);
 }
 
+/**
+ * pcpu_balance_workfn - manage the amount of free chunks and populated pages
+ * @work: unused
+ *
+ * Call __pcpu_balance_workfn() for each chunk type.
+ */
+static void pcpu_balance_workfn(struct work_struct *work)
+{
+	enum pcpu_chunk_type type;
+
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		__pcpu_balance_workfn(type);
+}
+
 /**
  * free_percpu - free percpu area
  * @ptr: pointer to area to free
@@ -1941,8 +2071,9 @@ void free_percpu(void __percpu *ptr)
 	void *addr;
 	struct pcpu_chunk *chunk;
 	unsigned long flags;
-	int off;
+	int size, off;
 	bool need_balance = false;
+	struct list_head *pcpu_slot;
 
 	if (!ptr)
 		return;
@@ -1956,7 +2087,11 @@ void free_percpu(void __percpu *ptr)
 	chunk = pcpu_chunk_addr_search(addr);
 	off = addr - chunk->base_addr;
 
-	pcpu_free_area(chunk, off);
+	size = pcpu_free_area(chunk, off);
+
+	pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
+
+	pcpu_memcg_free_hook(chunk, off, size);
 
 	/* if there are more than one fully free chunks, wake up grim reaper */
 	if (chunk->free_bytes == pcpu_unit_size) {
@@ -2267,6 +2402,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	int map_size;
 	unsigned long tmp_addr;
 	size_t alloc_size;
+	enum pcpu_chunk_type type;
 
 #define PCPU_SETUP_BUG_ON(cond)	do {					\
 	if (unlikely(cond)) {						\
@@ -2384,13 +2520,18 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	 * empty chunks.
 	 */
 	pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
-	pcpu_slot = memblock_alloc(pcpu_nr_slots * sizeof(pcpu_slot[0]),
-				   SMP_CACHE_BYTES);
-	if (!pcpu_slot)
+	pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
+					  sizeof(pcpu_chunk_lists[0]) *
+					  PCPU_NR_CHUNK_TYPES,
+					  SMP_CACHE_BYTES);
+	if (!pcpu_chunk_lists)
 		panic("%s: Failed to allocate %zu bytes\n", __func__,
-		      pcpu_nr_slots * sizeof(pcpu_slot[0]));
-	for (i = 0; i < pcpu_nr_slots; i++)
-		INIT_LIST_HEAD(&pcpu_slot[i]);
+		      pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]) *
+		      PCPU_NR_CHUNK_TYPES);
+
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		for (i = 0; i < pcpu_nr_slots; i++)
+			INIT_LIST_HEAD(&pcpu_chunk_list(type)[i]);
 
 	/*
 	 * The end of the static region needs to be aligned with the
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/5] mm: memcg/percpu: per-memcg percpu memory statistics
  2020-06-08 23:08 [PATCH v2 0/5] mm: memcg accounting of percpu memory Roman Gushchin
  2020-06-08 23:08 ` [PATCH v2 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
  2020-06-08 23:08 ` [PATCH v2 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
@ 2020-06-08 23:08 ` Roman Gushchin
  2020-06-22  1:48   ` Nathan Chancellor
  2020-06-08 23:08 ` [PATCH v2 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:08 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Percpu memory can represent a noticeable chunk of the total
memory consumption, especially on big machines with many CPUs.
Let's track percpu memory usage for each memcg and display
it in memory.stat.

A percpu allocation is usually scattered over multiple pages
(and nodes), and can be significantly smaller than a page.
So let's add a byte-sized counter on the memcg level:
MEMCG_PERCPU_B. Byte-sized vmstat infra created for slabs
can be perfectly reused for percpu case.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
---
 Documentation/admin-guide/cgroup-v2.rst |  4 ++++
 include/linux/memcontrol.h              |  8 ++++++++
 mm/memcontrol.c                         |  4 +++-
 mm/percpu.c                             | 10 ++++++++++
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index ce3e05e41724..7c1e784239bf 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1274,6 +1274,10 @@ PAGE_SIZE multiple when read back.
 		Amount of memory used for storing in-kernel data
 		structures.
 
+	  percpu
+		Amount of memory used for storing per-cpu kernel
+		data structures.
+
 	  sock
 		Amount of memory used in network transmission buffers
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index eede46c43573..7ed3af71a6fb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -32,11 +32,19 @@ struct kmem_cache;
 enum memcg_stat_item {
 	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
 	MEMCG_SOCK,
+	MEMCG_PERCPU_B,
 	/* XXX: why are these zone and not node counters? */
 	MEMCG_KERNEL_STACK_KB,
 	MEMCG_NR_STAT,
 };
 
+static __always_inline bool memcg_stat_item_in_bytes(enum memcg_stat_item item)
+{
+	if (item == MEMCG_PERCPU_B)
+		return true;
+	return vmstat_item_in_bytes(item);
+}
+
 enum memcg_memory_event {
 	MEMCG_LOW,
 	MEMCG_HIGH,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 93b2e73ef2f7..839b4d890a90 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -783,7 +783,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 	if (mem_cgroup_disabled())
 		return;
 
-	if (vmstat_item_in_bytes(idx))
+	if (memcg_stat_item_in_bytes(idx))
 		threshold <<= PAGE_SHIFT;
 
 	x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
@@ -1490,6 +1490,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 	seq_buf_printf(&s, "slab %llu\n",
 		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
 			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
+	seq_buf_printf(&s, "percpu %llu\n",
+		       (u64)memcg_page_state(memcg, MEMCG_PERCPU_B));
 	seq_buf_printf(&s, "sock %llu\n",
 		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
 		       PAGE_SIZE);
diff --git a/mm/percpu.c b/mm/percpu.c
index 8ebd9fe30430..18d3d049bf91 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1608,6 +1608,11 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 
 	if (chunk) {
 		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
+
+		rcu_read_lock();
+		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
+				size * num_possible_cpus());
+		rcu_read_unlock();
 	} else {
 		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
 		obj_cgroup_put(objcg);
@@ -1626,6 +1631,11 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 
 	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
 
+	rcu_read_lock();
+	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
+			-(size * num_possible_cpus()));
+	rcu_read_unlock();
+
 	obj_cgroup_put(objcg);
 }
 
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-06-08 23:08 [PATCH v2 0/5] mm: memcg accounting of percpu memory Roman Gushchin
                   ` (2 preceding siblings ...)
  2020-06-08 23:08 ` [PATCH v2 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
@ 2020-06-08 23:08 ` Roman Gushchin
  2020-06-08 23:08 ` [PATCH v2 5/5] kselftests: cgroup: add perpcu memory accounting test Roman Gushchin
  2020-06-16 21:19 ` [Potential Spoof] [PATCH v2 0/5] mm: memcg accounting of percpu memory Roman Gushchin
  5 siblings, 0 replies; 12+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:08 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Memory cgroups are using large chunks of percpu memory to store
vmstat data. Yet this memory is not accounted at all, so in the
case when there are many (dying) cgroups, it's not exactly clear
where all the memory is.

Because the size of  memory cgroup internal structures can
dramatically exceed the size of object or page which is pinning
it in the memory, it's not a good idea to simple ignore it.
It actually breaks the isolation between cgroups.

Let's account the consumed percpu memory to the parent cgroup.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
---
 mm/memcontrol.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 839b4d890a90..638ad499a9e0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5069,13 +5069,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 	if (!pn)
 		return 1;
 
-	pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
+	pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
+						 GFP_KERNEL_ACCOUNT);
 	if (!pn->lruvec_stat_local) {
 		kfree(pn);
 		return 1;
 	}
 
-	pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
+	pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
+					       GFP_KERNEL_ACCOUNT);
 	if (!pn->lruvec_stat_cpu) {
 		free_percpu(pn->lruvec_stat_local);
 		kfree(pn);
@@ -5149,11 +5151,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 		goto fail;
 	}
 
-	memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
+	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
+						GFP_KERNEL_ACCOUNT);
 	if (!memcg->vmstats_local)
 		goto fail;
 
-	memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
+	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
+						 GFP_KERNEL_ACCOUNT);
 	if (!memcg->vmstats_percpu)
 		goto fail;
 
@@ -5202,7 +5206,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	struct mem_cgroup *memcg;
 	long error = -ENOMEM;
 
+	memalloc_use_memcg(parent);
 	memcg = mem_cgroup_alloc();
+	memalloc_unuse_memcg();
 	if (IS_ERR(memcg))
 		return ERR_CAST(memcg);
 
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 5/5] kselftests: cgroup: add perpcu memory accounting test
  2020-06-08 23:08 [PATCH v2 0/5] mm: memcg accounting of percpu memory Roman Gushchin
                   ` (3 preceding siblings ...)
  2020-06-08 23:08 ` [PATCH v2 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
@ 2020-06-08 23:08 ` Roman Gushchin
  2020-06-16 21:19 ` [Potential Spoof] [PATCH v2 0/5] mm: memcg accounting of percpu memory Roman Gushchin
  5 siblings, 0 replies; 12+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:08 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Add a simple test to check the percpu memory accounting.
The test creates a cgroup tree with 1000 child cgroups
and checks values of memory.current and memory.stat::percpu.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 tools/testing/selftests/cgroup/test_kmem.c | 70 +++++++++++++++++++++-
 1 file changed, 69 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c
index 5224dae216e5..0941aa16157e 100644
--- a/tools/testing/selftests/cgroup/test_kmem.c
+++ b/tools/testing/selftests/cgroup/test_kmem.c
@@ -18,6 +18,15 @@
 #include "cgroup_util.h"
 
 
+/*
+ * Memory cgroup charging and vmstat data aggregation is performed using
+ * percpu batches 32 pages big (look at MEMCG_CHARGE_BATCH). So the maximum
+ * discrepancy between charge and vmstat entries is number of cpus multiplied
+ * by 32 pages multiplied by 2.
+ */
+#define MAX_VMSTAT_ERROR (4096 * 32 * 2 * get_nprocs())
+
+
 static int alloc_dcache(const char *cgroup, void *arg)
 {
 	unsigned long i;
@@ -180,7 +189,7 @@ static int test_kmem_memcg_deletion(const char *root)
 		goto cleanup;
 
 	sum = slab + anon + file + kernel_stack;
-	if (abs(sum - current) < 4096 * 32 * 2 * get_nprocs()) {
+	if (abs(sum - current) < MAX_VMSTAT_ERROR) {
 		ret = KSFT_PASS;
 	} else {
 		printf("memory.current = %ld\n", current);
@@ -331,6 +340,64 @@ static int test_kmem_dead_cgroups(const char *root)
 	return ret;
 }
 
+/*
+ * This test creates a sub-tree with 1000 memory cgroups.
+ * Then it checks that the memory.current on the parent level
+ * is greater than 0 and approximates matches the percpu value
+ * from memory.stat.
+ */
+static int test_percpu_basic(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *parent, *child;
+	long current, percpu;
+	int i;
+
+	parent = cg_name(root, "percpu_basic_test");
+	if (!parent)
+		goto cleanup;
+
+	if (cg_create(parent))
+		goto cleanup;
+
+	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
+		goto cleanup;
+
+	for (i = 0; i < 1000; i++) {
+		child = cg_name_indexed(parent, "child", i);
+		if (!child)
+			return -1;
+
+		if (cg_create(child))
+			goto cleanup_children;
+
+		free(child);
+	}
+
+	current = cg_read_long(parent, "memory.current");
+	percpu = cg_read_key_long(parent, "memory.stat", "percpu ");
+
+	if (current > 0 && percpu > 0 && abs(current - percpu) <
+	    MAX_VMSTAT_ERROR)
+		ret = KSFT_PASS;
+	else
+		printf("memory.current %ld\npercpu %ld\n",
+		       current, percpu);
+
+cleanup_children:
+	for (i = 0; i < 1000; i++) {
+		child = cg_name_indexed(parent, "child", i);
+		cg_destroy(child);
+		free(child);
+	}
+
+cleanup:
+	cg_destroy(parent);
+	free(parent);
+
+	return ret;
+}
+
 #define T(x) { x, #x }
 struct kmem_test {
 	int (*fn)(const char *root);
@@ -341,6 +408,7 @@ struct kmem_test {
 	T(test_kmem_proc_kpagecgroup),
 	T(test_kmem_kernel_stacks),
 	T(test_kmem_dead_cgroups),
+	T(test_percpu_basic),
 };
 #undef T
 
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Potential Spoof] [PATCH v2 0/5] mm: memcg accounting of percpu memory
  2020-06-08 23:08 [PATCH v2 0/5] mm: memcg accounting of percpu memory Roman Gushchin
                   ` (4 preceding siblings ...)
  2020-06-08 23:08 ` [PATCH v2 5/5] kselftests: cgroup: add perpcu memory accounting test Roman Gushchin
@ 2020-06-16 21:19 ` Roman Gushchin
  2020-06-17 20:39   ` Andrew Morton
  5 siblings, 1 reply; 12+ messages in thread
From: Roman Gushchin @ 2020-06-16 21:19 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel

On Mon, Jun 08, 2020 at 04:08:14PM -0700, Roman Gushchin wrote:
> This patchset adds percpu memory accounting to memory cgroups.
> It's based on the rework of the slab controller and reuses concepts
> and features introduced for the per-object slab accounting.
> 
> Percpu memory is becoming more and more widely used by various
> subsystems, and the total amount of memory controlled by the percpu
> allocator can make a good part of the total memory.
> 
> As an example, bpf maps can consume a lot of percpu memory,
> and they are created by a user. Also, some cgroup internals
> (e.g. memory controller statistics) can be quite large.
> On a machine with many CPUs and big number of cgroups they
> can consume hundreds of megabytes.
> 
> So the lack of memcg accounting is creating a breach in the memory
> isolation. Similar to the slab memory, percpu memory should be
> accounted by default.
> 
> Percpu allocations by their nature are scattered over multiple pages,
> so they can't be tracked on the per-page basis. So the per-object
> tracking introduced by the new slab controller is reused.
> 
> The patchset implements charging of percpu allocations, adds
> memcg-level statistics, enables accounting for percpu allocations made
> by memory cgroup internals and provides some basic tests.
> 
> To implement the accounting of percpu memory without a significant
> memory and performance overhead the following approach is used:
> all accounted allocations are placed into a separate percpu chunk
> (or chunks). These chunks are similar to default chunks, except
> that they do have an attached vector of pointers to obj_cgroup objects,
> which is big enough to save a pointer for each allocated object.
> On the allocation, if the allocation has to be accounted
> (__GFP_ACCOUNT is passed, the allocating process belongs to a non-root
> memory cgroup, etc), the memory cgroup is getting charged and if the maximum
> limit is not exceeded the allocation is performed using a memcg-aware
> chunk. Otherwise -ENOMEM is returned or the allocation is forced over
> the limit, depending on gfp (as any other kernel memory allocation).
> The memory cgroup information is saved in the obj_cgroup vector
> at the corresponding offset. On the release time the memcg
> information is restored from the vector and the cgroup is getting
> uncharged.
> Unaccounted allocations (at this point the absolute majority
> of all percpu allocations) are performed in the old way, so no
> additional overhead is expected.
> 
> To avoid pinning dying memory cgroups by outstanding allocations,
> obj_cgroup API is used instead of directly saving memory cgroup pointers.
> obj_cgroup is basically a pointer to a memory cgroup with a standalone
> reference counter. The trick is that it can be atomically swapped
> to point at the parent cgroup, so that the original memory cgroup
> can be released prior to all objects, which has been charged to it.
> Because all charges and statistics are fully recursive, it's perfectly
> correct to uncharge the parent cgroup instead. This scheme is used
> in the slab memory accounting, and percpu memory can just follow
> the scheme.
> 
> This version is based on top of v6 of the new slab controller
> patchset. The following patches are actually required by this series:
>   mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
>   mm: memcg: prepare for byte-sized vmstat items
>   mm: memcg: convert vmstat slab counters to bytes
>   mm: slub: implement SLUB version of obj_to_index()
>   mm: memcontrol: decouple reference counting from page accounting
>   mm: memcg/slab: obj_cgroup API

Hello, Andrew!

How this patchset should be routed: through the mm or percpu tree?

It has been acked by Dennis (the percpu maintainer), but it does depend
on first several patches from the slab controller rework patchset.

The slab controller rework is ready to be merged: as in v6 most patches
in the series were acked by Johannes and/or Vlastimil and no questions
or concerns were raised after v6.

Please, let me know if you want me to resend both patchsets.

Thank you!

Roman



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Potential Spoof] [PATCH v2 0/5] mm: memcg accounting of percpu memory
  2020-06-16 21:19 ` [Potential Spoof] [PATCH v2 0/5] mm: memcg accounting of percpu memory Roman Gushchin
@ 2020-06-17 20:39   ` Andrew Morton
  2020-06-17 20:47     ` Roman Gushchin
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2020-06-17 20:39 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Dennis Zhou, Johannes Weiner, Michal Hocko, Shakeel Butt,
	linux-mm, kernel-team, linux-kernel

On Tue, 16 Jun 2020 14:19:01 -0700 Roman Gushchin <guro@fb.com> wrote:

> > 
> > This version is based on top of v6 of the new slab controller
> > patchset. The following patches are actually required by this series:
> >   mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
> >   mm: memcg: prepare for byte-sized vmstat items
> >   mm: memcg: convert vmstat slab counters to bytes
> >   mm: slub: implement SLUB version of obj_to_index()
> >   mm: memcontrol: decouple reference counting from page accounting
> >   mm: memcg/slab: obj_cgroup API
> 
> Hello, Andrew!
> 
> How this patchset should be routed: through the mm or percpu tree?
> 
> It has been acked by Dennis (the percpu maintainer), but it does depend
> on first several patches from the slab controller rework patchset.

I can grab both.

> The slab controller rework is ready to be merged: as in v6 most patches
> in the series were acked by Johannes and/or Vlastimil and no questions
> or concerns were raised after v6.
> 
> Please, let me know if you want me to resend both patchsets.

There was quite a bit of valuable discussion in response to [0/n] which
really should have been in the changelog[s] from day one. 
slab-vs-slub, performance testing, etc.

So, umm, I'll take a look at both series now but I do think an enhanced
[0/n] description is warranted?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Potential Spoof] [PATCH v2 0/5] mm: memcg accounting of percpu memory
  2020-06-17 20:39   ` Andrew Morton
@ 2020-06-17 20:47     ` Roman Gushchin
  0 siblings, 0 replies; 12+ messages in thread
From: Roman Gushchin @ 2020-06-17 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dennis Zhou, Johannes Weiner, Michal Hocko, Shakeel Butt,
	linux-mm, kernel-team, linux-kernel

On Wed, Jun 17, 2020 at 01:39:49PM -0700, Andrew Morton wrote:
> On Tue, 16 Jun 2020 14:19:01 -0700 Roman Gushchin <guro@fb.com> wrote:
> 
> > > 
> > > This version is based on top of v6 of the new slab controller
> > > patchset. The following patches are actually required by this series:
> > >   mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
> > >   mm: memcg: prepare for byte-sized vmstat items
> > >   mm: memcg: convert vmstat slab counters to bytes
> > >   mm: slub: implement SLUB version of obj_to_index()
> > >   mm: memcontrol: decouple reference counting from page accounting
> > >   mm: memcg/slab: obj_cgroup API
> > 
> > Hello, Andrew!
> > 
> > How this patchset should be routed: through the mm or percpu tree?
> > 
> > It has been acked by Dennis (the percpu maintainer), but it does depend
> > on first several patches from the slab controller rework patchset.
> 
> I can grab both.

Perfect, thanks!

> 
> > The slab controller rework is ready to be merged: as in v6 most patches
> > in the series were acked by Johannes and/or Vlastimil and no questions
> > or concerns were raised after v6.
> > 
> > Please, let me know if you want me to resend both patchsets.
> 
> There was quite a bit of valuable discussion in response to [0/n] which
> really should have been in the changelog[s] from day one. 
> slab-vs-slub, performance testing, etc.
> 
> So, umm, I'll take a look at both series now but I do think an enhanced
> [0/n] description is warranted?
> 

Yes, I'm running suggested tests right now, and will update on the results.

Thanks!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/5] mm: memcg/percpu: per-memcg percpu memory statistics
  2020-06-08 23:08 ` [PATCH v2 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
@ 2020-06-22  1:48   ` Nathan Chancellor
  2020-06-22  3:26     ` Roman Gushchin
  0 siblings, 1 reply; 12+ messages in thread
From: Nathan Chancellor @ 2020-06-22  1:48 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, clang-built-linux

On Mon, Jun 08, 2020 at 04:08:17PM -0700, Roman Gushchin wrote:
> Percpu memory can represent a noticeable chunk of the total
> memory consumption, especially on big machines with many CPUs.
> Let's track percpu memory usage for each memcg and display
> it in memory.stat.
> 
> A percpu allocation is usually scattered over multiple pages
> (and nodes), and can be significantly smaller than a page.
> So let's add a byte-sized counter on the memcg level:
> MEMCG_PERCPU_B. Byte-sized vmstat infra created for slabs
> can be perfectly reused for percpu case.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  4 ++++
>  include/linux/memcontrol.h              |  8 ++++++++
>  mm/memcontrol.c                         |  4 +++-
>  mm/percpu.c                             | 10 ++++++++++
>  4 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index ce3e05e41724..7c1e784239bf 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1274,6 +1274,10 @@ PAGE_SIZE multiple when read back.
>  		Amount of memory used for storing in-kernel data
>  		structures.
>  
> +	  percpu
> +		Amount of memory used for storing per-cpu kernel
> +		data structures.
> +
>  	  sock
>  		Amount of memory used in network transmission buffers
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index eede46c43573..7ed3af71a6fb 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -32,11 +32,19 @@ struct kmem_cache;
>  enum memcg_stat_item {
>  	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
>  	MEMCG_SOCK,
> +	MEMCG_PERCPU_B,
>  	/* XXX: why are these zone and not node counters? */
>  	MEMCG_KERNEL_STACK_KB,
>  	MEMCG_NR_STAT,
>  };
>  
> +static __always_inline bool memcg_stat_item_in_bytes(enum memcg_stat_item item)
> +{
> +	if (item == MEMCG_PERCPU_B)
> +		return true;
> +	return vmstat_item_in_bytes(item);

This patch is now in -next and this line causes a warning from clang,
which shows up in every translation unit that includes this header,
which is a lot:

include/linux/memcontrol.h:45:30: warning: implicit conversion from
enumeration type 'enum memcg_stat_item' to different enumeration type
'enum node_stat_item' [-Wenum-conversion]
        return vmstat_item_in_bytes(item);
               ~~~~~~~~~~~~~~~~~~~~ ^~~~
1 warning generated.

I assume this conversion is intentional; if so, it seems like expecting
a specific enum is misleading. Perhaps this should be applied on top?

Cheers,
Nathan

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2499f78cf32d..bddeb4ce7a4f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -38,7 +38,7 @@ enum memcg_stat_item {
 	MEMCG_NR_STAT,
 };
 
-static __always_inline bool memcg_stat_item_in_bytes(enum memcg_stat_item item)
+static __always_inline bool memcg_stat_item_in_bytes(int item)
 {
 	if (item == MEMCG_PERCPU_B)
 		return true;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 084ee1c17160..52d7961a24f0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -211,7 +211,7 @@ enum node_stat_item {
  * measured in pages). This defines the API part, the internal representation
  * might be different.
  */
-static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
+static __always_inline bool vmstat_item_in_bytes(int item)
 {
 	/*
 	 * Global and per-node slab counters track slab pages.


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/5] mm: memcg/percpu: per-memcg percpu memory statistics
  2020-06-22  1:48   ` Nathan Chancellor
@ 2020-06-22  3:26     ` Roman Gushchin
  2020-06-22  3:51       ` Nathan Chancellor
  0 siblings, 1 reply; 12+ messages in thread
From: Roman Gushchin @ 2020-06-22  3:26 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, clang-built-linux

On Sun, Jun 21, 2020 at 06:48:03PM -0700, Nathan Chancellor wrote:
> On Mon, Jun 08, 2020 at 04:08:17PM -0700, Roman Gushchin wrote:
> > Percpu memory can represent a noticeable chunk of the total
> > memory consumption, especially on big machines with many CPUs.
> > Let's track percpu memory usage for each memcg and display
> > it in memory.stat.
> > 
> > A percpu allocation is usually scattered over multiple pages
> > (and nodes), and can be significantly smaller than a page.
> > So let's add a byte-sized counter on the memcg level:
> > MEMCG_PERCPU_B. Byte-sized vmstat infra created for slabs
> > can be perfectly reused for percpu case.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Dennis Zhou <dennis@kernel.org>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |  4 ++++
> >  include/linux/memcontrol.h              |  8 ++++++++
> >  mm/memcontrol.c                         |  4 +++-
> >  mm/percpu.c                             | 10 ++++++++++
> >  4 files changed, 25 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index ce3e05e41724..7c1e784239bf 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1274,6 +1274,10 @@ PAGE_SIZE multiple when read back.
> >  		Amount of memory used for storing in-kernel data
> >  		structures.
> >  
> > +	  percpu
> > +		Amount of memory used for storing per-cpu kernel
> > +		data structures.
> > +
> >  	  sock
> >  		Amount of memory used in network transmission buffers
> >  
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index eede46c43573..7ed3af71a6fb 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -32,11 +32,19 @@ struct kmem_cache;
> >  enum memcg_stat_item {
> >  	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
> >  	MEMCG_SOCK,
> > +	MEMCG_PERCPU_B,
> >  	/* XXX: why are these zone and not node counters? */
> >  	MEMCG_KERNEL_STACK_KB,
> >  	MEMCG_NR_STAT,
> >  };
> >  
> > +static __always_inline bool memcg_stat_item_in_bytes(enum memcg_stat_item item)
> > +{
> > +	if (item == MEMCG_PERCPU_B)
> > +		return true;
> > +	return vmstat_item_in_bytes(item);
> 
> This patch is now in -next and this line causes a warning from clang,
> which shows up in every translation unit that includes this header,
> which is a lot:
> 
> include/linux/memcontrol.h:45:30: warning: implicit conversion from
> enumeration type 'enum memcg_stat_item' to different enumeration type
> 'enum node_stat_item' [-Wenum-conversion]
>         return vmstat_item_in_bytes(item);
>                ~~~~~~~~~~~~~~~~~~~~ ^~~~
> 1 warning generated.
> 
> I assume this conversion is intentional; if so, it seems like expecting
> a specific enum is misleading. Perhaps this should be applied on top?

Hi Nathan!

Yeah, these enums are kind of stacked on each other, so memcg_stat values
extend node_stat values. And I think your patch is correct.

I'm going to refresh the series with some small fixups. If you're not against
it, I'll merge your patch into the corresponding patches.

And thank you for reporting the problem!

> 
> Cheers,
> Nathan
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 2499f78cf32d..bddeb4ce7a4f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,7 +38,7 @@ enum memcg_stat_item {
>  	MEMCG_NR_STAT,
>  };
>  
> -static __always_inline bool memcg_stat_item_in_bytes(enum memcg_stat_item item)
> +static __always_inline bool memcg_stat_item_in_bytes(int item)
>  {
>  	if (item == MEMCG_PERCPU_B)
>  		return true;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 084ee1c17160..52d7961a24f0 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -211,7 +211,7 @@ enum node_stat_item {
>   * measured in pages). This defines the API part, the internal representation
>   * might be different.
>   */
> -static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
> +static __always_inline bool vmstat_item_in_bytes(int item)
>  {
>  	/*
>  	 * Global and per-node slab counters track slab pages.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/5] mm: memcg/percpu: per-memcg percpu memory statistics
  2020-06-22  3:26     ` Roman Gushchin
@ 2020-06-22  3:51       ` Nathan Chancellor
  0 siblings, 0 replies; 12+ messages in thread
From: Nathan Chancellor @ 2020-06-22  3:51 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, clang-built-linux

On Sun, Jun 21, 2020 at 08:26:35PM -0700, Roman Gushchin wrote:
> On Sun, Jun 21, 2020 at 06:48:03PM -0700, Nathan Chancellor wrote:
> > On Mon, Jun 08, 2020 at 04:08:17PM -0700, Roman Gushchin wrote:
> > > Percpu memory can represent a noticeable chunk of the total
> > > memory consumption, especially on big machines with many CPUs.
> > > Let's track percpu memory usage for each memcg and display
> > > it in memory.stat.
> > > 
> > > A percpu allocation is usually scattered over multiple pages
> > > (and nodes), and can be significantly smaller than a page.
> > > So let's add a byte-sized counter on the memcg level:
> > > MEMCG_PERCPU_B. Byte-sized vmstat infra created for slabs
> > > can be perfectly reused for percpu case.
> > > 
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Acked-by: Dennis Zhou <dennis@kernel.org>
> > > ---
> > >  Documentation/admin-guide/cgroup-v2.rst |  4 ++++
> > >  include/linux/memcontrol.h              |  8 ++++++++
> > >  mm/memcontrol.c                         |  4 +++-
> > >  mm/percpu.c                             | 10 ++++++++++
> > >  4 files changed, 25 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > index ce3e05e41724..7c1e784239bf 100644
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -1274,6 +1274,10 @@ PAGE_SIZE multiple when read back.
> > >  		Amount of memory used for storing in-kernel data
> > >  		structures.
> > >  
> > > +	  percpu
> > > +		Amount of memory used for storing per-cpu kernel
> > > +		data structures.
> > > +
> > >  	  sock
> > >  		Amount of memory used in network transmission buffers
> > >  
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index eede46c43573..7ed3af71a6fb 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -32,11 +32,19 @@ struct kmem_cache;
> > >  enum memcg_stat_item {
> > >  	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
> > >  	MEMCG_SOCK,
> > > +	MEMCG_PERCPU_B,
> > >  	/* XXX: why are these zone and not node counters? */
> > >  	MEMCG_KERNEL_STACK_KB,
> > >  	MEMCG_NR_STAT,
> > >  };
> > >  
> > > +static __always_inline bool memcg_stat_item_in_bytes(enum memcg_stat_item item)
> > > +{
> > > +	if (item == MEMCG_PERCPU_B)
> > > +		return true;
> > > +	return vmstat_item_in_bytes(item);
> > 
> > This patch is now in -next and this line causes a warning from clang,
> > which shows up in every translation unit that includes this header,
> > which is a lot:
> > 
> > include/linux/memcontrol.h:45:30: warning: implicit conversion from
> > enumeration type 'enum memcg_stat_item' to different enumeration type
> > 'enum node_stat_item' [-Wenum-conversion]
> >         return vmstat_item_in_bytes(item);
> >                ~~~~~~~~~~~~~~~~~~~~ ^~~~
> > 1 warning generated.
> > 
> > I assume this conversion is intentional; if so, it seems like expecting
> > a specific enum is misleading. Perhaps this should be applied on top?
> 
> Hi Nathan!
> 
> Yeah, these enums are kind of stacked on each other, so memcg_stat values
> extend node_stat values. And I think your patch is correct.

Yeah, I figured. It happens in the kernel in a couple of different
places that we have seen so far. So far, my suggestion seems to be the
best one that we have uncovered when dealing with one enum that
supplments or extends another. These functions are fairly self
explanation so I don't think that blowing away the type safety here is a
big deal.

> I'm going to refresh the series with some small fixups. If you're not against
> it, I'll merge your patch into the corresponding patches.

Please do! Thanks for the quick response.

Cheers,
Nathan

> And thank you for reporting the problem!
> 
> > 
> > Cheers,
> > Nathan
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 2499f78cf32d..bddeb4ce7a4f 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -38,7 +38,7 @@ enum memcg_stat_item {
> >  	MEMCG_NR_STAT,
> >  };
> >  
> > -static __always_inline bool memcg_stat_item_in_bytes(enum memcg_stat_item item)
> > +static __always_inline bool memcg_stat_item_in_bytes(int item)
> >  {
> >  	if (item == MEMCG_PERCPU_B)
> >  		return true;
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 084ee1c17160..52d7961a24f0 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -211,7 +211,7 @@ enum node_stat_item {
> >   * measured in pages). This defines the API part, the internal representation
> >   * might be different.
> >   */
> > -static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
> > +static __always_inline bool vmstat_item_in_bytes(int item)
> >  {
> >  	/*
> >  	 * Global and per-node slab counters track slab pages.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-06-22  3:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-08 23:08 [PATCH v2 0/5] mm: memcg accounting of percpu memory Roman Gushchin
2020-06-08 23:08 ` [PATCH v2 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
2020-06-08 23:08 ` [PATCH v2 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
2020-06-08 23:08 ` [PATCH v2 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
2020-06-22  1:48   ` Nathan Chancellor
2020-06-22  3:26     ` Roman Gushchin
2020-06-22  3:51       ` Nathan Chancellor
2020-06-08 23:08 ` [PATCH v2 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
2020-06-08 23:08 ` [PATCH v2 5/5] kselftests: cgroup: add perpcu memory accounting test Roman Gushchin
2020-06-16 21:19 ` [Potential Spoof] [PATCH v2 0/5] mm: memcg accounting of percpu memory Roman Gushchin
2020-06-17 20:39   ` Andrew Morton
2020-06-17 20:47     ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).