Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v3 0/5] mm: memcg accounting of percpu memory
@ 2020-06-23 18:45 Roman Gushchin
  2020-06-23 18:45 ` [PATCH v3 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Roman Gushchin @ 2020-06-23 18:45 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

This patchset adds percpu memory accounting to memory cgroups.
It's based on the rework of the slab controller and reuses concepts
and features introduced for the per-object slab accounting.

Percpu memory is becoming more and more widely used by various
subsystems, and the total amount of memory controlled by the percpu
allocator can make a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory,
and they are created by a user. Also, some cgroup internals
(e.g. memory controller statistics) can be quite large.
On a machine with many CPUs and big number of cgroups they
can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be
accounted by default.

Percpu allocations by their nature are scattered over multiple pages,
so they can't be tracked on the per-page basis. So the per-object
tracking introduced by the new slab controller is reused.

The patchset implements charging of percpu allocations, adds
memcg-level statistics, enables accounting for percpu allocations made
by memory cgroup internals and provides some basic tests.

To implement the accounting of percpu memory without a significant
memory and performance overhead the following approach is used:
all accounted allocations are placed into a separate percpu chunk
(or chunks). These chunks are similar to default chunks, except
that they do have an attached vector of pointers to obj_cgroup objects,
which is big enough to save a pointer for each allocated object.
On the allocation, if the allocation has to be accounted
(__GFP_ACCOUNT is passed, the allocating process belongs to a non-root
memory cgroup, etc), the memory cgroup is getting charged and if the maximum
limit is not exceeded the allocation is performed using a memcg-aware
chunk. Otherwise -ENOMEM is returned or the allocation is forced over
the limit, depending on gfp (as any other kernel memory allocation).
The memory cgroup information is saved in the obj_cgroup vector
at the corresponding offset. On the release time the memcg
information is restored from the vector and the cgroup is getting
uncharged.
Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no
additional overhead is expected.

To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter. The trick is that it can be atomically swapped
to point at the parent cgroup, so that the original memory cgroup
can be released prior to all objects, which has been charged to it.
Because all charges and statistics are fully recursive, it's perfectly
correct to uncharge the parent cgroup instead. This scheme is used
in the slab memory accounting, and percpu memory can just follow
the scheme.

v3:
  1) fixed a build errors and warning with !CONFIG_MEMCG_KMEM (Andrew)
  2) fixed a build warning on Clang (Nathan)
  3) rebased on top of v7 of the slab controller patchset

v2:
  1) minor cosmetic fixes (Dennis)
  2) rebased on top of v6 of the slab controller patchset

v1:
  1) fixed a bug with gfp flags handling (Dennis)
  2) added some comments (Tejun and Dennis)
  3) rebased on top of v5 of the slab controller patchset

RFC:
  https://lore.kernel.org/linux-mm/20200519201806.2308480-1-guro@fb.com/T/#t


Roman Gushchin (5):
  percpu: return number of released bytes from pcpu_free_area()
  mm: memcg/percpu: account percpu memory to memory cgroups
  mm: memcg/percpu: per-memcg percpu memory statistics
  mm: memcg: charge memcg percpu memory to the parent cgroup
  kselftests: cgroup: add perpcu memory accounting test

 Documentation/admin-guide/cgroup-v2.rst    |   4 +
 include/linux/memcontrol.h                 |   8 +
 mm/memcontrol.c                            |  18 +-
 mm/percpu-internal.h                       |  55 +++++-
 mm/percpu-km.c                             |   5 +-
 mm/percpu-stats.c                          |  36 ++--
 mm/percpu-vm.c                             |   5 +-
 mm/percpu.c                                | 208 ++++++++++++++++++---
 tools/testing/selftests/cgroup/test_kmem.c |  70 ++++++-
 9 files changed, 360 insertions(+), 49 deletions(-)

-- 
2.26.2



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 1/5] percpu: return number of released bytes from pcpu_free_area()
  2020-06-23 18:45 [PATCH v3 0/5] mm: memcg accounting of percpu memory Roman Gushchin
@ 2020-06-23 18:45 ` Roman Gushchin
  2020-06-24  0:58   ` Shakeel Butt
  2020-06-23 18:45 ` [PATCH v3 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Roman Gushchin @ 2020-06-23 18:45 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

To implement accounting of percpu memory we need the information about the
size of freed object.  Return it from pcpu_free_area().

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
---
 mm/percpu.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 696367b18222..aa36b78d45a6 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1211,11 +1211,14 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
  *
  * This function determines the size of an allocation to free using
  * the boundary bitmap and clears the allocation map.
+ *
+ * RETURNS:
+ * Number of freed bytes.
  */
-static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
+static int pcpu_free_area(struct pcpu_chunk *chunk, int off)
 {
 	struct pcpu_block_md *chunk_md = &chunk->chunk_md;
-	int bit_off, bits, end, oslot;
+	int bit_off, bits, end, oslot, freed;
 
 	lockdep_assert_held(&pcpu_lock);
 	pcpu_stats_area_dealloc(chunk);
@@ -1230,8 +1233,10 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
 	bits = end - bit_off;
 	bitmap_clear(chunk->alloc_map, bit_off, bits);
 
+	freed = bits * PCPU_MIN_ALLOC_SIZE;
+
 	/* update metadata */
-	chunk->free_bytes += bits * PCPU_MIN_ALLOC_SIZE;
+	chunk->free_bytes += freed;
 
 	/* update first free bit */
 	chunk_md->first_free = min(chunk_md->first_free, bit_off);
@@ -1239,6 +1244,8 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
 	pcpu_block_update_hint_free(chunk, bit_off, bits);
 
 	pcpu_chunk_relocate(chunk, oslot);
+
+	return freed;
 }
 
 static void pcpu_init_md_block(struct pcpu_block_md *block, int nr_bits)
-- 
2.26.2



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 2/5] mm: memcg/percpu: account percpu memory to memory cgroups
  2020-06-23 18:45 [PATCH v3 0/5] mm: memcg accounting of percpu memory Roman Gushchin
  2020-06-23 18:45 ` [PATCH v3 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
@ 2020-06-23 18:45 ` Roman Gushchin
  2020-06-24  1:25   ` Shakeel Butt
  2020-06-23 18:45 ` [PATCH v3 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
  2020-06-23 18:45 ` [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
  3 siblings, 1 reply; 24+ messages in thread
From: Roman Gushchin @ 2020-06-23 18:45 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user.  Also, some cgroup internals (e.g.  memory controller
statistics) can be quite large.  On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation.  Similar to the slab memory, percpu memory should be accounted
by default.

To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow.  Let's introduce two types of percpu
chunks: root and memcg.  What makes memcg chunks different is an
additional space allocated to store memcg membership information.  If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.

To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
---
 mm/percpu-internal.h |  55 ++++++++++++-
 mm/percpu-km.c       |   5 +-
 mm/percpu-stats.c    |  36 +++++----
 mm/percpu-vm.c       |   5 +-
 mm/percpu.c          | 185 ++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 246 insertions(+), 40 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 0468ba500bd4..7983455842ff 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -5,6 +5,25 @@
 #include <linux/types.h>
 #include <linux/percpu.h>
 
+/*
+ * There are two chunk types: root and memcg-aware.
+ * Chunks of each type have separate slots list.
+ *
+ * Memcg-aware chunks have an attached vector of obj_cgroup pointers, which is
+ * used to store memcg membership data of a percpu object.  Obj_cgroups are
+ * ref-counted pointers to a memory cgroup with an ability to switch dynamically
+ * to the parent memory cgroup.  This allows to reclaim a deleted memory cgroup
+ * without reclaiming of all outstanding objects, which hold a reference at it.
+ */
+enum pcpu_chunk_type {
+	PCPU_CHUNK_ROOT,
+#ifdef CONFIG_MEMCG_KMEM
+	PCPU_CHUNK_MEMCG,
+#endif
+	PCPU_NR_CHUNK_TYPES,
+	PCPU_FAIL_ALLOC = PCPU_NR_CHUNK_TYPES
+};
+
 /*
  * pcpu_block_md is the metadata block struct.
  * Each chunk's bitmap is split into a number of full blocks.
@@ -54,6 +73,9 @@ struct pcpu_chunk {
 	int			end_offset;	/* additional area required to
 						   have the region end page
 						   aligned */
+#ifdef CONFIG_MEMCG_KMEM
+	struct obj_cgroup	**obj_cgroups;	/* vector of object cgroups */
+#endif
 
 	int			nr_pages;	/* # of pages served by this chunk */
 	int			nr_populated;	/* # of populated pages */
@@ -63,7 +85,7 @@ struct pcpu_chunk {
 
 extern spinlock_t pcpu_lock;
 
-extern struct list_head *pcpu_slot;
+extern struct list_head *pcpu_chunk_lists;
 extern int pcpu_nr_slots;
 extern int pcpu_nr_empty_pop_pages;
 
@@ -106,6 +128,37 @@ static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk)
 	return pcpu_nr_pages_to_map_bits(chunk->nr_pages);
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
+{
+	if (chunk->obj_cgroups)
+		return PCPU_CHUNK_MEMCG;
+	return PCPU_CHUNK_ROOT;
+}
+
+static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
+{
+	return chunk_type == PCPU_CHUNK_MEMCG;
+}
+
+#else
+static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
+{
+	return PCPU_CHUNK_ROOT;
+}
+
+static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
+{
+	return false;
+}
+#endif
+
+static struct list_head *pcpu_chunk_list(enum pcpu_chunk_type chunk_type)
+{
+	return &pcpu_chunk_lists[pcpu_nr_slots *
+				 pcpu_is_memcg_chunk(chunk_type)];
+}
+
 #ifdef CONFIG_PERCPU_STATS
 
 #include <linux/spinlock.h>
diff --git a/mm/percpu-km.c b/mm/percpu-km.c
index 20d2b69a13b0..35c9941077ee 100644
--- a/mm/percpu-km.c
+++ b/mm/percpu-km.c
@@ -44,7 +44,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 	/* nada */
 }
 
-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+					    gfp_t gfp)
 {
 	const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
 	struct pcpu_chunk *chunk;
@@ -52,7 +53,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
 	unsigned long flags;
 	int i;
 
-	chunk = pcpu_alloc_chunk(gfp);
+	chunk = pcpu_alloc_chunk(type, gfp);
 	if (!chunk)
 		return NULL;
 
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index 32558063c3f9..c8400a2adbc2 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -34,11 +34,15 @@ static int find_max_nr_alloc(void)
 {
 	struct pcpu_chunk *chunk;
 	int slot, max_nr_alloc;
+	enum pcpu_chunk_type type;
 
 	max_nr_alloc = 0;
-	for (slot = 0; slot < pcpu_nr_slots; slot++)
-		list_for_each_entry(chunk, &pcpu_slot[slot], list)
-			max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc);
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		for (slot = 0; slot < pcpu_nr_slots; slot++)
+			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
+					    list)
+				max_nr_alloc = max(max_nr_alloc,
+						   chunk->nr_alloc);
 
 	return max_nr_alloc;
 }
@@ -129,6 +133,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
 	P("cur_min_alloc", cur_min_alloc);
 	P("cur_med_alloc", cur_med_alloc);
 	P("cur_max_alloc", cur_max_alloc);
+#ifdef CONFIG_MEMCG_KMEM
+	P("memcg_aware", pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)));
+#endif
 	seq_putc(m, '\n');
 }
 
@@ -137,6 +144,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 	struct pcpu_chunk *chunk;
 	int slot, max_nr_alloc;
 	int *buffer;
+	enum pcpu_chunk_type type;
 
 alloc_buffer:
 	spin_lock_irq(&pcpu_lock);
@@ -202,18 +210,18 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 		chunk_map_stats(m, pcpu_reserved_chunk, buffer);
 	}
 
-	for (slot = 0; slot < pcpu_nr_slots; slot++) {
-		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
-			if (chunk == pcpu_first_chunk) {
-				seq_puts(m, "Chunk: <- First Chunk\n");
-				chunk_map_stats(m, chunk, buffer);
-
-
-			} else {
-				seq_puts(m, "Chunk:\n");
-				chunk_map_stats(m, chunk, buffer);
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
+		for (slot = 0; slot < pcpu_nr_slots; slot++) {
+			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
+					    list) {
+				if (chunk == pcpu_first_chunk) {
+					seq_puts(m, "Chunk: <- First Chunk\n");
+					chunk_map_stats(m, chunk, buffer);
+				} else {
+					seq_puts(m, "Chunk:\n");
+					chunk_map_stats(m, chunk, buffer);
+				}
 			}
-
 		}
 	}
 
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index a2b395acef89..e46f7a6917f9 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -328,12 +328,13 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 	pcpu_free_pages(chunk, pages, page_start, page_end);
 }
 
-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+					    gfp_t gfp)
 {
 	struct pcpu_chunk *chunk;
 	struct vm_struct **vms;
 
-	chunk = pcpu_alloc_chunk(gfp);
+	chunk = pcpu_alloc_chunk(type, gfp);
 	if (!chunk)
 		return NULL;
 
diff --git a/mm/percpu.c b/mm/percpu.c
index aa36b78d45a6..c5b4f232bf37 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -37,9 +37,14 @@
  * takes care of normal allocations.
  *
  * The allocator organizes chunks into lists according to free size and
- * tries to allocate from the fullest chunk first.  Each chunk is managed
- * by a bitmap with metadata blocks.  The allocation map is updated on
- * every allocation and free to reflect the current state while the boundary
+ * memcg-awareness.  To make a percpu allocation memcg-aware the __GFP_ACCOUNT
+ * flag should be passed.  All memcg-aware allocations are sharing one set
+ * of chunks and all unaccounted allocations and allocations performed
+ * by processes belonging to the root memory cgroup are using the second set.
+ *
+ * The allocator tries to allocate from the fullest chunk first. Each chunk
+ * is managed by a bitmap with metadata blocks.  The allocation map is updated
+ * on every allocation and free to reflect the current state while the boundary
  * map is only updated on allocation.  Each metadata block contains
  * information to help mitigate the need to iterate over large portions
  * of the bitmap.  The reverse mapping from page to chunk is stored in
@@ -81,6 +86,7 @@
 #include <linux/kmemleak.h>
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
+#include <linux/memcontrol.h>
 
 #include <asm/cacheflush.h>
 #include <asm/sections.h>
@@ -160,7 +166,7 @@ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init;
 DEFINE_SPINLOCK(pcpu_lock);	/* all internal data structures */
 static DEFINE_MUTEX(pcpu_alloc_mutex);	/* chunk create/destroy, [de]pop, map ext */
 
-struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */
+struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */
 
 /* chunks which need their map areas extended, protected by pcpu_lock */
 static LIST_HEAD(pcpu_map_extend_chunks);
@@ -500,6 +506,9 @@ static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
 			      bool move_front)
 {
 	if (chunk != pcpu_reserved_chunk) {
+		struct list_head *pcpu_slot;
+
+		pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
 		if (move_front)
 			list_move(&chunk->list, &pcpu_slot[slot]);
 		else
@@ -1341,6 +1350,10 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
 		panic("%s: Failed to allocate %zu bytes\n", __func__,
 		      alloc_size);
 
+#ifdef CONFIG_MEMCG_KMEM
+	/* first chunk isn't memcg-aware */
+	chunk->obj_cgroups = NULL;
+#endif
 	pcpu_init_md_blocks(chunk);
 
 	/* manage populated page bitmap */
@@ -1380,7 +1393,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
 	return chunk;
 }
 
-static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type, gfp_t gfp)
 {
 	struct pcpu_chunk *chunk;
 	int region_bits;
@@ -1408,6 +1421,16 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
 	if (!chunk->md_blocks)
 		goto md_blocks_fail;
 
+#ifdef CONFIG_MEMCG_KMEM
+	if (pcpu_is_memcg_chunk(type)) {
+		chunk->obj_cgroups =
+			pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
+					sizeof(struct obj_cgroup *), gfp);
+		if (!chunk->obj_cgroups)
+			goto objcg_fail;
+	}
+#endif
+
 	pcpu_init_md_blocks(chunk);
 
 	/* init metadata */
@@ -1415,6 +1438,10 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
 
 	return chunk;
 
+#ifdef CONFIG_MEMCG_KMEM
+objcg_fail:
+	pcpu_mem_free(chunk->md_blocks);
+#endif
 md_blocks_fail:
 	pcpu_mem_free(chunk->bound_map);
 bound_map_fail:
@@ -1429,6 +1456,9 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
 {
 	if (!chunk)
 		return;
+#ifdef CONFIG_MEMCG_KMEM
+	pcpu_mem_free(chunk->obj_cgroups);
+#endif
 	pcpu_mem_free(chunk->md_blocks);
 	pcpu_mem_free(chunk->bound_map);
 	pcpu_mem_free(chunk->alloc_map);
@@ -1505,7 +1535,8 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
 			       int page_start, int page_end, gfp_t gfp);
 static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 				  int page_start, int page_end);
-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp);
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+					    gfp_t gfp);
 static void pcpu_destroy_chunk(struct pcpu_chunk *chunk);
 static struct page *pcpu_addr_to_page(void *addr);
 static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai);
@@ -1547,6 +1578,77 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
 	return pcpu_get_page_chunk(pcpu_addr_to_page(addr));
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
+						     struct obj_cgroup **objcgp)
+{
+	struct obj_cgroup *objcg;
+
+	if (!memcg_kmem_enabled() || !(gfp & __GFP_ACCOUNT) ||
+	    memcg_kmem_bypass())
+		return PCPU_CHUNK_ROOT;
+
+	objcg = get_obj_cgroup_from_current();
+	if (!objcg)
+		return PCPU_CHUNK_ROOT;
+
+	if (obj_cgroup_charge(objcg, gfp, size * num_possible_cpus())) {
+		obj_cgroup_put(objcg);
+		return PCPU_FAIL_ALLOC;
+	}
+
+	*objcgp = objcg;
+	return PCPU_CHUNK_MEMCG;
+}
+
+static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
+				       struct pcpu_chunk *chunk, int off,
+				       size_t size)
+{
+	if (!objcg)
+		return;
+
+	if (chunk) {
+		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
+	} else {
+		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
+		obj_cgroup_put(objcg);
+	}
+}
+
+static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+	struct obj_cgroup *objcg;
+
+	if (!pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)))
+		return;
+
+	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
+	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
+
+	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
+
+	obj_cgroup_put(objcg);
+}
+
+#else /* CONFIG_MEMCG_KMEM */
+static enum pcpu_chunk_type
+pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp, struct obj_cgroup **objcgp)
+{
+	return PCPU_CHUNK_ROOT;
+}
+
+static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
+				       struct pcpu_chunk *chunk, int off,
+				       size_t size)
+{
+}
+
+static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
 /**
  * pcpu_alloc - the percpu allocator
  * @size: size of area to allocate in bytes
@@ -1568,6 +1670,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	gfp_t pcpu_gfp;
 	bool is_atomic;
 	bool do_warn;
+	enum pcpu_chunk_type type;
+	struct list_head *pcpu_slot;
+	struct obj_cgroup *objcg = NULL;
 	static int warn_limit = 10;
 	struct pcpu_chunk *chunk, *next;
 	const char *err;
@@ -1602,16 +1707,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 		return NULL;
 	}
 
+	type = pcpu_memcg_pre_alloc_hook(size, gfp, &objcg);
+	if (unlikely(type == PCPU_FAIL_ALLOC))
+		return NULL;
+	pcpu_slot = pcpu_chunk_list(type);
+
 	if (!is_atomic) {
 		/*
 		 * pcpu_balance_workfn() allocates memory under this mutex,
 		 * and it may wait for memory reclaim. Allow current task
 		 * to become OOM victim, in case of memory pressure.
 		 */
-		if (gfp & __GFP_NOFAIL)
+		if (gfp & __GFP_NOFAIL) {
 			mutex_lock(&pcpu_alloc_mutex);
-		else if (mutex_lock_killable(&pcpu_alloc_mutex))
+		} else if (mutex_lock_killable(&pcpu_alloc_mutex)) {
+			pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
 			return NULL;
+		}
 	}
 
 	spin_lock_irqsave(&pcpu_lock, flags);
@@ -1666,7 +1778,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	}
 
 	if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
-		chunk = pcpu_create_chunk(pcpu_gfp);
+		chunk = pcpu_create_chunk(type, pcpu_gfp);
 		if (!chunk) {
 			err = "failed to allocate new chunk";
 			goto fail;
@@ -1723,6 +1835,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	trace_percpu_alloc_percpu(reserved, is_atomic, size, align,
 			chunk->base_addr, off, ptr);
 
+	pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
+
 	return ptr;
 
 fail_unlock:
@@ -1744,6 +1858,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	} else {
 		mutex_unlock(&pcpu_alloc_mutex);
 	}
+
+	pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
+
 	return NULL;
 }
 
@@ -1803,8 +1920,8 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
 }
 
 /**
- * pcpu_balance_workfn - manage the amount of free chunks and populated pages
- * @work: unused
+ * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
+ * @type: chunk type
  *
  * Reclaim all fully free chunks except for the first one.  This is also
  * responsible for maintaining the pool of empty populated pages.  However,
@@ -1813,11 +1930,12 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
  * allocation causes the failure as it is possible that requests can be
  * serviced from already backed regions.
  */
-static void pcpu_balance_workfn(struct work_struct *work)
+static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
 {
 	/* gfp flags passed to underlying allocators */
 	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
 	LIST_HEAD(to_free);
+	struct list_head *pcpu_slot = pcpu_chunk_list(type);
 	struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
 	struct pcpu_chunk *chunk, *next;
 	int slot, nr_to_pop, ret;
@@ -1915,7 +2033,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
 
 	if (nr_to_pop) {
 		/* ran out of chunks to populate, create a new one and retry */
-		chunk = pcpu_create_chunk(gfp);
+		chunk = pcpu_create_chunk(type, gfp);
 		if (chunk) {
 			spin_lock_irq(&pcpu_lock);
 			pcpu_chunk_relocate(chunk, -1);
@@ -1927,6 +2045,20 @@ static void pcpu_balance_workfn(struct work_struct *work)
 	mutex_unlock(&pcpu_alloc_mutex);
 }
 
+/**
+ * pcpu_balance_workfn - manage the amount of free chunks and populated pages
+ * @work: unused
+ *
+ * Call __pcpu_balance_workfn() for each chunk type.
+ */
+static void pcpu_balance_workfn(struct work_struct *work)
+{
+	enum pcpu_chunk_type type;
+
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		__pcpu_balance_workfn(type);
+}
+
 /**
  * free_percpu - free percpu area
  * @ptr: pointer to area to free
@@ -1941,8 +2073,9 @@ void free_percpu(void __percpu *ptr)
 	void *addr;
 	struct pcpu_chunk *chunk;
 	unsigned long flags;
-	int off;
+	int size, off;
 	bool need_balance = false;
+	struct list_head *pcpu_slot;
 
 	if (!ptr)
 		return;
@@ -1956,7 +2089,11 @@ void free_percpu(void __percpu *ptr)
 	chunk = pcpu_chunk_addr_search(addr);
 	off = addr - chunk->base_addr;
 
-	pcpu_free_area(chunk, off);
+	size = pcpu_free_area(chunk, off);
+
+	pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
+
+	pcpu_memcg_free_hook(chunk, off, size);
 
 	/* if there are more than one fully free chunks, wake up grim reaper */
 	if (chunk->free_bytes == pcpu_unit_size) {
@@ -2267,6 +2404,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	int map_size;
 	unsigned long tmp_addr;
 	size_t alloc_size;
+	enum pcpu_chunk_type type;
 
 #define PCPU_SETUP_BUG_ON(cond)	do {					\
 	if (unlikely(cond)) {						\
@@ -2384,13 +2522,18 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	 * empty chunks.
 	 */
 	pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
-	pcpu_slot = memblock_alloc(pcpu_nr_slots * sizeof(pcpu_slot[0]),
-				   SMP_CACHE_BYTES);
-	if (!pcpu_slot)
+	pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
+					  sizeof(pcpu_chunk_lists[0]) *
+					  PCPU_NR_CHUNK_TYPES,
+					  SMP_CACHE_BYTES);
+	if (!pcpu_chunk_lists)
 		panic("%s: Failed to allocate %zu bytes\n", __func__,
-		      pcpu_nr_slots * sizeof(pcpu_slot[0]));
-	for (i = 0; i < pcpu_nr_slots; i++)
-		INIT_LIST_HEAD(&pcpu_slot[i]);
+		      pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]) *
+		      PCPU_NR_CHUNK_TYPES);
+
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		for (i = 0; i < pcpu_nr_slots; i++)
+			INIT_LIST_HEAD(&pcpu_chunk_list(type)[i]);
 
 	/*
 	 * The end of the static region needs to be aligned with the
-- 
2.26.2



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 3/5] mm: memcg/percpu: per-memcg percpu memory statistics
  2020-06-23 18:45 [PATCH v3 0/5] mm: memcg accounting of percpu memory Roman Gushchin
  2020-06-23 18:45 ` [PATCH v3 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
  2020-06-23 18:45 ` [PATCH v3 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
@ 2020-06-23 18:45 ` Roman Gushchin
  2020-06-24  1:35   ` Shakeel Butt
  2020-08-11 15:05   ` Johannes Weiner
  2020-06-23 18:45 ` [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
  3 siblings, 2 replies; 24+ messages in thread
From: Roman Gushchin @ 2020-06-23 18:45 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Percpu memory can represent a noticeable chunk of the total memory
consumption, especially on big machines with many CPUs.  Let's track
percpu memory usage for each memcg and display it in memory.stat.

A percpu allocation is usually scattered over multiple pages (and nodes),
and can be significantly smaller than a page.  So let's add a byte-sized
counter on the memcg level: MEMCG_PERCPU_B.  Byte-sized vmstat infra
created for slabs can be perfectly reused for percpu case.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
---
 Documentation/admin-guide/cgroup-v2.rst |  4 ++++
 include/linux/memcontrol.h              |  8 ++++++++
 mm/memcontrol.c                         |  4 +++-
 mm/percpu.c                             | 10 ++++++++++
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index d09471aa7443..0715ae78ca54 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1274,6 +1274,10 @@ PAGE_SIZE multiple when read back.
 		Amount of memory used for storing in-kernel data
 		structures.
 
+	  percpu
+		Amount of memory used for storing per-cpu kernel
+		data structures.
+
 	  sock
 		Amount of memory used in network transmission buffers
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5a8b62d075e6..d71f8a45918c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -32,11 +32,19 @@ struct kmem_cache;
 enum memcg_stat_item {
 	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
 	MEMCG_SOCK,
+	MEMCG_PERCPU_B,
 	/* XXX: why are these zone and not node counters? */
 	MEMCG_KERNEL_STACK_KB,
 	MEMCG_NR_STAT,
 };
 
+static __always_inline bool memcg_stat_item_in_bytes(int idx)
+{
+	if (idx == MEMCG_PERCPU_B)
+		return true;
+	return vmstat_item_in_bytes(idx);
+}
+
 enum memcg_memory_event {
 	MEMCG_LOW,
 	MEMCG_HIGH,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1b858cd18b52..e8ec2498a6cf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -783,7 +783,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 	if (mem_cgroup_disabled())
 		return;
 
-	if (vmstat_item_in_bytes(idx))
+	if (memcg_stat_item_in_bytes(idx))
 		threshold <<= PAGE_SHIFT;
 
 	x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
@@ -1490,6 +1490,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 	seq_buf_printf(&s, "slab %llu\n",
 		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
 			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
+	seq_buf_printf(&s, "percpu %llu\n",
+		       (u64)memcg_page_state(memcg, MEMCG_PERCPU_B));
 	seq_buf_printf(&s, "sock %llu\n",
 		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
 		       PAGE_SIZE);
diff --git a/mm/percpu.c b/mm/percpu.c
index c5b4f232bf37..d7717438ea8b 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1610,6 +1610,11 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 
 	if (chunk) {
 		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
+
+		rcu_read_lock();
+		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
+				size * num_possible_cpus());
+		rcu_read_unlock();
 	} else {
 		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
 		obj_cgroup_put(objcg);
@@ -1628,6 +1633,11 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 
 	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
 
+	rcu_read_lock();
+	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
+			-(size * num_possible_cpus()));
+	rcu_read_unlock();
+
 	obj_cgroup_put(objcg);
 }
 
-- 
2.26.2



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-06-23 18:45 [PATCH v3 0/5] mm: memcg accounting of percpu memory Roman Gushchin
                   ` (2 preceding siblings ...)
  2020-06-23 18:45 ` [PATCH v3 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
@ 2020-06-23 18:45 ` Roman Gushchin
  2020-06-24  1:40   ` Shakeel Butt
                     ` (2 more replies)
  3 siblings, 3 replies; 24+ messages in thread
From: Roman Gushchin @ 2020-06-23 18:45 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Memory cgroups are using large chunks of percpu memory to store vmstat
data.  Yet this memory is not accounted at all, so in the case when there
are many (dying) cgroups, it's not exactly clear where all the memory is.

Because the size of memory cgroup internal structures can dramatically
exceed the size of object or page which is pinning it in the memory, it's
not a good idea to simple ignore it.  It actually breaks the isolation
between cgroups.

Let's account the consumed percpu memory to the parent cgroup.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
---
 mm/memcontrol.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8ec2498a6cf..9f14b91700d9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5069,13 +5069,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 	if (!pn)
 		return 1;
 
-	pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
+	pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
+						 GFP_KERNEL_ACCOUNT);
 	if (!pn->lruvec_stat_local) {
 		kfree(pn);
 		return 1;
 	}
 
-	pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
+	pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
+					       GFP_KERNEL_ACCOUNT);
 	if (!pn->lruvec_stat_cpu) {
 		free_percpu(pn->lruvec_stat_local);
 		kfree(pn);
@@ -5149,11 +5151,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 		goto fail;
 	}
 
-	memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
+	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
+						GFP_KERNEL_ACCOUNT);
 	if (!memcg->vmstats_local)
 		goto fail;
 
-	memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
+	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
+						 GFP_KERNEL_ACCOUNT);
 	if (!memcg->vmstats_percpu)
 		goto fail;
 
@@ -5202,7 +5206,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	struct mem_cgroup *memcg;
 	long error = -ENOMEM;
 
+	memalloc_use_memcg(parent);
 	memcg = mem_cgroup_alloc();
+	memalloc_unuse_memcg();
 	if (IS_ERR(memcg))
 		return ERR_CAST(memcg);
 
-- 
2.26.2



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/5] percpu: return number of released bytes from pcpu_free_area()
  2020-06-23 18:45 ` [PATCH v3 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
@ 2020-06-24  0:58   ` Shakeel Butt
  0 siblings, 0 replies; 24+ messages in thread
From: Shakeel Butt @ 2020-06-24  0:58 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML

On Tue, Jun 23, 2020 at 11:47 AM Roman Gushchin <guro@fb.com> wrote:
>
> To implement accounting of percpu memory we need the information about the
> size of freed object.  Return it from pcpu_free_area().
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/5] mm: memcg/percpu: account percpu memory to memory cgroups
  2020-06-23 18:45 ` [PATCH v3 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
@ 2020-06-24  1:25   ` Shakeel Butt
  0 siblings, 0 replies; 24+ messages in thread
From: Shakeel Butt @ 2020-06-24  1:25 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML

On Tue, Jun 23, 2020 at 11:47 AM Roman Gushchin <guro@fb.com> wrote:
>
> Percpu memory is becoming more and more widely used by various subsystems,
> and the total amount of memory controlled by the percpu allocator can make
> a good part of the total memory.
>
> As an example, bpf maps can consume a lot of percpu memory, and they are
> created by a user.  Also, some cgroup internals (e.g.  memory controller
> statistics) can be quite large.  On a machine with many CPUs and big
> number of cgroups they can consume hundreds of megabytes.
>
> So the lack of memcg accounting is creating a breach in the memory
> isolation.  Similar to the slab memory, percpu memory should be accounted
> by default.
>
> To implement the perpcu accounting it's possible to take the slab memory
> accounting as a model to follow.  Let's introduce two types of percpu
> chunks: root and memcg.  What makes memcg chunks different is an
> additional space allocated to store memcg membership information.  If
> __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
> If it's possible to charge the corresponding size to the target memory
> cgroup, allocation is performed, and the memcg ownership data is recorded.
> System-wide allocations are performed using root chunks, so there is no
> additional memory overhead.
>
> To implement a fast reparenting of percpu memory on memcg removal, we
> don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
> introduced for slab accounting.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 3/5] mm: memcg/percpu: per-memcg percpu memory statistics
  2020-06-23 18:45 ` [PATCH v3 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
@ 2020-06-24  1:35   ` Shakeel Butt
  2020-08-11 15:05   ` Johannes Weiner
  1 sibling, 0 replies; 24+ messages in thread
From: Shakeel Butt @ 2020-06-24  1:35 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML

On Tue, Jun 23, 2020 at 11:47 AM Roman Gushchin <guro@fb.com> wrote:
>
> Percpu memory can represent a noticeable chunk of the total memory
> consumption, especially on big machines with many CPUs.  Let's track
> percpu memory usage for each memcg and display it in memory.stat.
>
> A percpu allocation is usually scattered over multiple pages (and nodes),
> and can be significantly smaller than a page.  So let's add a byte-sized
> counter on the memcg level: MEMCG_PERCPU_B.  Byte-sized vmstat infra
> created for slabs can be perfectly reused for percpu case.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-06-23 18:45 ` [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
@ 2020-06-24  1:40   ` Shakeel Butt
  2020-06-24  1:49     ` Roman Gushchin
  2020-07-29 17:10   ` Michal Koutný
  2020-08-11 15:27   ` Johannes Weiner
  2 siblings, 1 reply; 24+ messages in thread
From: Shakeel Butt @ 2020-06-24  1:40 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML

On Tue, Jun 23, 2020 at 11:47 AM Roman Gushchin <guro@fb.com> wrote:
>
> Memory cgroups are using large chunks of percpu memory to store vmstat
> data.  Yet this memory is not accounted at all, so in the case when there
> are many (dying) cgroups, it's not exactly clear where all the memory is.
>
> Because the size of memory cgroup internal structures can dramatically
> exceed the size of object or page which is pinning it in the memory, it's
> not a good idea to simple ignore it.  It actually breaks the isolation

*simply

> between cgroups.
>
> Let's account the consumed percpu memory to the parent cgroup.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-06-24  1:40   ` Shakeel Butt
@ 2020-06-24  1:49     ` Roman Gushchin
  0 siblings, 0 replies; 24+ messages in thread
From: Roman Gushchin @ 2020-06-24  1:49 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML

On Tue, Jun 23, 2020 at 06:40:41PM -0700, Shakeel Butt wrote:
> On Tue, Jun 23, 2020 at 11:47 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > Memory cgroups are using large chunks of percpu memory to store vmstat
> > data.  Yet this memory is not accounted at all, so in the case when there
> > are many (dying) cgroups, it's not exactly clear where all the memory is.
> >
> > Because the size of memory cgroup internal structures can dramatically
> > exceed the size of object or page which is pinning it in the memory, it's
> > not a good idea to simple ignore it.  It actually breaks the isolation
> 
> *simply
> 
> > between cgroups.
> >
> > Let's account the consumed percpu memory to the parent cgroup.
> >
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Dennis Zhou <dennis@kernel.org>
> 
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Hello, Shakeel!

Thank you for the review of this and the previous patchsets!

Btw, I'll be completely offline till the end of the week,
so if any questions will arise around these patchsets,
I'll answer all them on Monday, Jun 29th. Thanks!


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-06-23 18:45 ` [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
  2020-06-24  1:40   ` Shakeel Butt
@ 2020-07-29 17:10   ` Michal Koutný
  2020-08-07  4:16     ` Andrew Morton
  2020-08-11 15:27   ` Johannes Weiner
  2 siblings, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2020-07-29 17:10 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel


[-- Attachment #1: Type: text/plain, Size: 1349 bytes --]

Hello.

On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <guro@fb.com> wrote:
> Because the size of memory cgroup internal structures can dramatically
> exceed the size of object or page which is pinning it in the memory, it's
> not a good idea to simple ignore it.  It actually breaks the isolation
> between cgroups.
No doubt about accounting the memory if it's significant amount.

> Let's account the consumed percpu memory to the parent cgroup.
Why did you choose charging to the parent of the created cgroup?

Should the charge go the cgroup _that is creating_ the new memcg?

One reason is that there are the throttling mechanisms for memory limits
and those are better exercised when the actor and its memory artefact
are the same cgroup, aren't they?

The second reason is based on the example Dlegation Containment
(Documentation/admin-guide/cgroup-v2.rst)

> For an example, let's assume cgroups C0 and C1 have been delegated to
> user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> all processes under C0 and C1 belong to U0::
> 
>   ~~~~~~~~~~~~~ - C0 - C00
>   ~ cgroup    ~      \ C01
>   ~ hierarchy ~
>   ~~~~~~~~~~~~~ - C1 - C10

Thanks to permissions a task running in C0 creating a cgroup in C1 would
deplete C1's supply victimizing tasks inside C1.

Thanks,
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-07-29 17:10   ` Michal Koutný
@ 2020-08-07  4:16     ` Andrew Morton
  2020-08-07  4:37       ` Roman Gushchin
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2020-08-07  4:16 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Roman Gushchin, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel

On Wed, 29 Jul 2020 19:10:39 +0200 Michal Koutný <mkoutny@suse.com> wrote:

> Hello.
> 
> On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <guro@fb.com> wrote:
> > Because the size of memory cgroup internal structures can dramatically
> > exceed the size of object or page which is pinning it in the memory, it's
> > not a good idea to simple ignore it.  It actually breaks the isolation
> > between cgroups.
> No doubt about accounting the memory if it's significant amount.
> 
> > Let's account the consumed percpu memory to the parent cgroup.
> Why did you choose charging to the parent of the created cgroup?
> 
> Should the charge go the cgroup _that is creating_ the new memcg?
> 
> One reason is that there are the throttling mechanisms for memory limits
> and those are better exercised when the actor and its memory artefact
> are the same cgroup, aren't they?
> 
> The second reason is based on the example Dlegation Containment
> (Documentation/admin-guide/cgroup-v2.rst)
> 
> > For an example, let's assume cgroups C0 and C1 have been delegated to
> > user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> > all processes under C0 and C1 belong to U0::
> > 
> >   ~~~~~~~~~~~~~ - C0 - C00
> >   ~ cgroup    ~      \ C01
> >   ~ hierarchy ~
> >   ~~~~~~~~~~~~~ - C1 - C10
> 
> Thanks to permissions a task running in C0 creating a cgroup in C1 would
> deplete C1's supply victimizing tasks inside C1.

These week-old issues appear to be significant.  Roman?  Or someone
else?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-07  4:16     ` Andrew Morton
@ 2020-08-07  4:37       ` Roman Gushchin
  2020-08-10 19:33         ` Roman Gushchin
  2020-08-11 14:47         ` Michal Koutný
  0 siblings, 2 replies; 24+ messages in thread
From: Roman Gushchin @ 2020-08-07  4:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Koutný,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> On Wed, 29 Jul 2020 19:10:39 +0200 Michal Koutný <mkoutny@suse.com> wrote:
> 
> > Hello.
> > 
> > On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <guro@fb.com> wrote:
> > > Because the size of memory cgroup internal structures can dramatically
> > > exceed the size of object or page which is pinning it in the memory, it's
> > > not a good idea to simple ignore it.  It actually breaks the isolation
> > > between cgroups.
> > No doubt about accounting the memory if it's significant amount.
> > 
> > > Let's account the consumed percpu memory to the parent cgroup.
> > Why did you choose charging to the parent of the created cgroup?
> > 
> > Should the charge go the cgroup _that is creating_ the new memcg?
> > 
> > One reason is that there are the throttling mechanisms for memory limits
> > and those are better exercised when the actor and its memory artefact
> > are the same cgroup, aren't they?

Hi!

In general, yes. But in this case I think it wouldn't be a good idea:
most often cgroups are created by a centralized daemon (systemd),
which is usually located in the root cgroup. Even if it's located not in
the root cgroup, limiting it's memory will likely affect the whole system,
even if only one specific limit was reached.
If there is a containerized workload, which creates sub-cgroups,
charging it's parent cgroup is perfectly effective.

And the opposite, if we'll charge the cgroup of a process, who created
a cgroup, we'll not cover the most common case: systemd creating
cgroups for all services in the system.

> > 
> > The second reason is based on the example Dlegation Containment
> > (Documentation/admin-guide/cgroup-v2.rst)
> > 
> > > For an example, let's assume cgroups C0 and C1 have been delegated to
> > > user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> > > all processes under C0 and C1 belong to U0::
> > > 
> > >   ~~~~~~~~~~~~~ - C0 - C00
> > >   ~ cgroup    ~      \ C01
> > >   ~ hierarchy ~
> > >   ~~~~~~~~~~~~~ - C1 - C10
> > 
> > Thanks to permissions a task running in C0 creating a cgroup in C1 would
> > deplete C1's supply victimizing tasks inside C1.

Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
in completely different cgroup. In this particular case there are tons of other
ways how a task from C00 can hurt C1.

> 
> These week-old issues appear to be significant.  Roman?  Or someone
> else?

Oh, I'm sorry, somehow I've missed this letter.
Thank you for pointing at it!

Thanks!


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-07  4:37       ` Roman Gushchin
@ 2020-08-10 19:33         ` Roman Gushchin
  2020-08-11 14:47         ` Michal Koutný
  1 sibling, 0 replies; 24+ messages in thread
From: Roman Gushchin @ 2020-08-10 19:33 UTC (permalink / raw)
  To: Andrew Morton, Michal Koutný
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin wrote:
> On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> > On Wed, 29 Jul 2020 19:10:39 +0200 Michal Koutný <mkoutny@suse.com> wrote:
> > 
> > > Hello.
> > > 
> > > On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <guro@fb.com> wrote:
> > > > Because the size of memory cgroup internal structures can dramatically
> > > > exceed the size of object or page which is pinning it in the memory, it's
> > > > not a good idea to simple ignore it.  It actually breaks the isolation
> > > > between cgroups.
> > > No doubt about accounting the memory if it's significant amount.
> > > 
> > > > Let's account the consumed percpu memory to the parent cgroup.
> > > Why did you choose charging to the parent of the created cgroup?
> > > 
> > > Should the charge go the cgroup _that is creating_ the new memcg?
> > > 
> > > One reason is that there are the throttling mechanisms for memory limits
> > > and those are better exercised when the actor and its memory artefact
> > > are the same cgroup, aren't they?
> 
> Hi!
> 
> In general, yes. But in this case I think it wouldn't be a good idea:
> most often cgroups are created by a centralized daemon (systemd),
> which is usually located in the root cgroup. Even if it's located not in
> the root cgroup, limiting it's memory will likely affect the whole system,
> even if only one specific limit was reached.
> If there is a containerized workload, which creates sub-cgroups,
> charging it's parent cgroup is perfectly effective.
> 
> And the opposite, if we'll charge the cgroup of a process, who created
> a cgroup, we'll not cover the most common case: systemd creating
> cgroups for all services in the system.
> 
> > > 
> > > The second reason is based on the example Dlegation Containment
> > > (Documentation/admin-guide/cgroup-v2.rst)
> > > 
> > > > For an example, let's assume cgroups C0 and C1 have been delegated to
> > > > user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> > > > all processes under C0 and C1 belong to U0::
> > > > 
> > > >   ~~~~~~~~~~~~~ - C0 - C00
> > > >   ~ cgroup    ~      \ C01
> > > >   ~ hierarchy ~
> > > >   ~~~~~~~~~~~~~ - C1 - C10
> > > 
> > > Thanks to permissions a task running in C0 creating a cgroup in C1 would
> > > deplete C1's supply victimizing tasks inside C1.
> 
> Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
> in completely different cgroup. In this particular case there are tons of other
> ways how a task from C00 can hurt C1.
> 
> > 
> > These week-old issues appear to be significant.  Roman?  Or someone
> > else?
> 
> Oh, I'm sorry, somehow I've missed this letter.
> Thank you for pointing at it!

Hello, Michal!

Do you have concerns left here or it's good to go?

It seems that this blocking the whole percpu accounting patchset from being merged,
and I still hope it can be squeezed into 5.9.

Thank you!

Roman


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-07  4:37       ` Roman Gushchin
  2020-08-10 19:33         ` Roman Gushchin
@ 2020-08-11 14:47         ` Michal Koutný
  2020-08-11 16:55           ` Roman Gushchin
  1 sibling, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2020-08-11 14:47 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel


[-- Attachment #1: Type: text/plain, Size: 2436 bytes --]

On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin <guro@fb.com> wrote:
> In general, yes. But in this case I think it wouldn't be a good idea:
> most often cgroups are created by a centralized daemon (systemd),
> which is usually located in the root cgroup. Even if it's located not in
> the root cgroup, limiting it's memory will likely affect the whole system,
> even if only one specific limit was reached.
The generic scheme would be (assuming the no internal process
constraint, in the root too)

` root or delegated root
  ` manager-cgroup (systemd, docker, ...)
  ` [aggregation group(s)]
    ` job-group-1
    ` ...
    ` job-group-n

> If there is a containerized workload, which creates sub-cgroups,
> charging it's parent cgroup is perfectly effective.
No dispute about this in either approaches.

> And the opposite, if we'll charge the cgroup of a process, who created
> a cgroup, we'll not cover the most common case: systemd creating
> cgroups for all services in the system.
What I mean is that systemd should be charged for the cgroup creation.
Or more generally, any container/cgroup manager should be charged.
Consider a leak when it wouldn't remove spent cgroups, IMO the effect
(throttling, reclaim) should be exercised on such a culprit.

I don't think the existing workload (job-group-i above) should
unnecessarily suffer when only manager is acting up. Is that different
from your idea?

> Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
> in completely different cgroup. In this particular case there are tons of other
> ways how a task from C00 can hurt C1.
I agree with that.


If I haven't overlooked anything, this should be first case when
cgroup-related structures are accounted (please correct me).
So this is setting a precendent, if others show useful to be accounted
in the future too. I'm thinking about cpu_cgroup_css_alloc() that can
also allocate a lot (with big CPU count). The current approach would lead
situations where matching cpu and memory csses needn't to exist and that
would need special handling.


> On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> > These week-old issues appear to be significant.  Roman?  Or someone
> > else?
Despite my concerns, I don't think this is fundamental and can't be
changed later so it doesn't prevent the inclusion in 5.9 RC1.

Regards,
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 3/5] mm: memcg/percpu: per-memcg percpu memory statistics
  2020-06-23 18:45 ` [PATCH v3 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
  2020-06-24  1:35   ` Shakeel Butt
@ 2020-08-11 15:05   ` Johannes Weiner
  1 sibling, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2020-08-11 15:05 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Tue, Jun 23, 2020 at 11:45:13AM -0700, Roman Gushchin wrote:
> Percpu memory can represent a noticeable chunk of the total memory
> consumption, especially on big machines with many CPUs.  Let's track
> percpu memory usage for each memcg and display it in memory.stat.
> 
> A percpu allocation is usually scattered over multiple pages (and nodes),
> and can be significantly smaller than a page.  So let's add a byte-sized
> counter on the memcg level: MEMCG_PERCPU_B.  Byte-sized vmstat infra
> created for slabs can be perfectly reused for percpu case.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-06-23 18:45 ` [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
  2020-06-24  1:40   ` Shakeel Butt
  2020-07-29 17:10   ` Michal Koutný
@ 2020-08-11 15:27   ` Johannes Weiner
  2020-08-11 17:06     ` Roman Gushchin
  2 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2020-08-11 15:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin wrote:
> Memory cgroups are using large chunks of percpu memory to store vmstat
> data.  Yet this memory is not accounted at all, so in the case when there
> are many (dying) cgroups, it's not exactly clear where all the memory is.
> 
> Because the size of memory cgroup internal structures can dramatically
> exceed the size of object or page which is pinning it in the memory, it's
> not a good idea to simple ignore it.  It actually breaks the isolation
> between cgroups.
> 
> Let's account the consumed percpu memory to the parent cgroup.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

This makes sense, and the accounting is in line with how we track and
distribute child creation quotas (cgroup.max.descendants and
cgroup.max.depth) up the cgroup tree.

I have one minor comment that isn't a dealbreaker for me:

> @@ -5069,13 +5069,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>  	if (!pn)
>  		return 1;
>  
> -	pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
> +	pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
> +						 GFP_KERNEL_ACCOUNT);
>  	if (!pn->lruvec_stat_local) {
>  		kfree(pn);
>  		return 1;
>  	}
>  
> -	pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
> +	pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
> +					       GFP_KERNEL_ACCOUNT);
>  	if (!pn->lruvec_stat_cpu) {
>  		free_percpu(pn->lruvec_stat_local);
>  		kfree(pn);
> @@ -5149,11 +5151,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>  		goto fail;
>  	}
>  
> -	memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
> +	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> +						GFP_KERNEL_ACCOUNT);
>  	if (!memcg->vmstats_local)
>  		goto fail;
>  
> -	memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
> +	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> +						 GFP_KERNEL_ACCOUNT);
>  	if (!memcg->vmstats_percpu)
>  		goto fail;
>  
> @@ -5202,7 +5206,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	struct mem_cgroup *memcg;
>  	long error = -ENOMEM;
>  
> +	memalloc_use_memcg(parent);
>  	memcg = mem_cgroup_alloc();
> +	memalloc_unuse_memcg();

The disconnect between 1) requesting accounting and 2) which cgroup to
charge is making me uneasy. It makes mem_cgroup_alloc() a bit of a
handgrenade, because accounting to the current task is almost
guaranteed to be wrong if the use_memcg() annotation were to get lost
in a refactor or not make it to a new caller of the function.

The saving grace is that mem_cgroup_alloc() is pretty unlikely to be
used elsewhere. And pretending it's an independent interface would be
overengineering. But how about the following in mem_cgroup_alloc() and
alloc_mem_cgroup_per_node_info() to document that caller relationship:

	/* We charge the parent cgroup, never the current task */
	WARN_ON_ONCE(!current->active_memcg);


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-11 14:47         ` Michal Koutný
@ 2020-08-11 16:55           ` Roman Gushchin
  2020-08-11 18:32             ` Michal Koutný
  0 siblings, 1 reply; 24+ messages in thread
From: Roman Gushchin @ 2020-08-11 16:55 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel

On Tue, Aug 11, 2020 at 04:47:54PM +0200, Michal Koutny wrote:
> On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin <guro@fb.com> wrote:
> > In general, yes. But in this case I think it wouldn't be a good idea:
> > most often cgroups are created by a centralized daemon (systemd),
> > which is usually located in the root cgroup. Even if it's located not in
> > the root cgroup, limiting it's memory will likely affect the whole system,
> > even if only one specific limit was reached.
> The generic scheme would be (assuming the no internal process
> constraint, in the root too)
> 
> ` root or delegated root
>   ` manager-cgroup (systemd, docker, ...)
>   ` [aggregation group(s)]
>     ` job-group-1
>     ` ...
>     ` job-group-n
> 
> > If there is a containerized workload, which creates sub-cgroups,
> > charging it's parent cgroup is perfectly effective.
> No dispute about this in either approaches.
> 
> > And the opposite, if we'll charge the cgroup of a process, who created
> > a cgroup, we'll not cover the most common case: systemd creating
> > cgroups for all services in the system.
> What I mean is that systemd should be charged for the cgroup creation.
> Or more generally, any container/cgroup manager should be charged.
> Consider a leak when it wouldn't remove spent cgroups, IMO the effect
> (throttling, reclaim) should be exercised on such a culprit.

As I said, there are 2 problems with charging systemd (or a similar daemon):
1) It often belongs to the root cgroup.
2) OOMing or failing some random memory allocations is a bad way
   to "communicate" a memory shortage to the daemon.
   What we really want is to prevent creating a huge number of cgroups
   (including dying cgroups) in some specific sub-tree(s).
   OOMing the daemon or returning -ENOMEM to some random syscalls
   will not help us to reach the goal and likely will bring a bad
   experience to a user.

In a generic case I don't see how we can charge the cgroup which
creates cgroups without solving these problems first.

And if there is a very special case where we have to limit it,
we can just add an additional layer:

` root or delegated root
   ` manager-parent-cgroup-with-a-limit
     ` manager-cgroup (systemd, docker, ...)
   ` [aggregation group(s)]
     ` job-group-1
     ` ...
     ` job-group-n

> 
> I don't think the existing workload (job-group-i above) should
> unnecessarily suffer when only manager is acting up. Is that different
> from your idea?
> 
> > Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
> > in completely different cgroup. In this particular case there are tons of other
> > ways how a task from C00 can hurt C1.
> I agree with that.
> 
> 
> If I haven't overlooked anything, this should be first case when
> cgroup-related structures are accounted (please correct me).
> So this is setting a precendent, if others show useful to be accounted
> in the future too.

Right.

> I'm thinking about cpu_cgroup_css_alloc() that can
> also allocate a lot (with big CPU count). The current approach would lead
> situations where matching cpu and memory csses needn't to exist and that
> would need special handling.

I'd definitely charge the parent cgroup in all similar cases.

> 
> 
> > On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> > > These week-old issues appear to be significant.  Roman?  Or someone
> > > else?
> Despite my concerns, I don't think this is fundamental and can't be
> changed later so it doesn't prevent the inclusion in 5.9 RC1.

Thank you!


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-11 15:27   ` Johannes Weiner
@ 2020-08-11 17:06     ` Roman Gushchin
  2020-08-13  9:16       ` Naresh Kamboju
  0 siblings, 1 reply; 24+ messages in thread
From: Roman Gushchin @ 2020-08-11 17:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Tue, Aug 11, 2020 at 11:27:37AM -0400, Johannes Weiner wrote:
> On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin wrote:
> > Memory cgroups are using large chunks of percpu memory to store vmstat
> > data.  Yet this memory is not accounted at all, so in the case when there
> > are many (dying) cgroups, it's not exactly clear where all the memory is.
> > 
> > Because the size of memory cgroup internal structures can dramatically
> > exceed the size of object or page which is pinning it in the memory, it's
> > not a good idea to simple ignore it.  It actually breaks the isolation
> > between cgroups.
> > 
> > Let's account the consumed percpu memory to the parent cgroup.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Dennis Zhou <dennis@kernel.org>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thank you!

> 
> This makes sense, and the accounting is in line with how we track and
> distribute child creation quotas (cgroup.max.descendants and
> cgroup.max.depth) up the cgroup tree.
> 
> I have one minor comment that isn't a dealbreaker for me:
> 
> > @@ -5069,13 +5069,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
> >  	if (!pn)
> >  		return 1;
> >  
> > -	pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
> > +	pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
> > +						 GFP_KERNEL_ACCOUNT);
> >  	if (!pn->lruvec_stat_local) {
> >  		kfree(pn);
> >  		return 1;
> >  	}
> >  
> > -	pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
> > +	pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
> > +					       GFP_KERNEL_ACCOUNT);
> >  	if (!pn->lruvec_stat_cpu) {
> >  		free_percpu(pn->lruvec_stat_local);
> >  		kfree(pn);
> > @@ -5149,11 +5151,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> >  		goto fail;
> >  	}
> >  
> > -	memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
> > +	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> > +						GFP_KERNEL_ACCOUNT);
> >  	if (!memcg->vmstats_local)
> >  		goto fail;
> >  
> > -	memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
> > +	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> > +						 GFP_KERNEL_ACCOUNT);
> >  	if (!memcg->vmstats_percpu)
> >  		goto fail;
> >  
> > @@ -5202,7 +5206,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> >  	struct mem_cgroup *memcg;
> >  	long error = -ENOMEM;
> >  
> > +	memalloc_use_memcg(parent);
> >  	memcg = mem_cgroup_alloc();
> > +	memalloc_unuse_memcg();
> 
> The disconnect between 1) requesting accounting and 2) which cgroup to
> charge is making me uneasy. It makes mem_cgroup_alloc() a bit of a
> handgrenade, because accounting to the current task is almost
> guaranteed to be wrong if the use_memcg() annotation were to get lost
> in a refactor or not make it to a new caller of the function.
> 
> The saving grace is that mem_cgroup_alloc() is pretty unlikely to be
> used elsewhere. And pretending it's an independent interface would be
> overengineering. But how about the following in mem_cgroup_alloc() and
> alloc_mem_cgroup_per_node_info() to document that caller relationship:
> 
> 	/* We charge the parent cgroup, never the current task */
> 	WARN_ON_ONCE(!current->active_memcg);

I have nothing against.

Andrew, can you, please, squash the following diff into the patch?

Thanks!

--

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 130093bdf74b..e25f2db7e61c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5137,6 +5137,9 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
        if (!pn)
                return 1;
 
+       /* We charge the parent cgroup, never the current task */
+       WARN_ON_ONCE(!current->active_memcg);
+
        pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
                                                 GFP_KERNEL_ACCOUNT);
        if (!pn->lruvec_stat_local) {
@@ -5219,6 +5222,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
                goto fail;
        }
 
+       /* We charge the parent cgroup, never the current task */
+       WARN_ON_ONCE(!current->active_memcg);
+
        memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
                                                GFP_KERNEL_ACCOUNT);
        if (!memcg->vmstats_local)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-11 16:55           ` Roman Gushchin
@ 2020-08-11 18:32             ` Michal Koutný
  2020-08-11 19:32               ` Roman Gushchin
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2020-08-11 18:32 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel


[-- Attachment #1: Type: text/plain, Size: 2118 bytes --]

On Tue, Aug 11, 2020 at 09:55:27AM -0700, Roman Gushchin <guro@fb.com> wrote:
> As I said, there are 2 problems with charging systemd (or a similar daemon):
> 1) It often belongs to the root cgroup.
This doesn't hold for systemd (if we agree that systemd is the most
common case).

> 2) OOMing or failing some random memory allocations is a bad way
>    to "communicate" a memory shortage to the daemon.
>    What we really want is to prevent creating a huge number of cgroups
There's cgroup.max.descendants for that...

>    (including dying cgroups) in some specific sub-tree(s).
...oh, so is this limiting the number of cgroups or limiting resources
used?

>    OOMing the daemon or returning -ENOMEM to some random syscalls
>    will not help us to reach the goal and likely will bring a bad
>    experience to a user.
If we reach the situation when memory for cgroup operations is tight,
it'll disappoint the user either way.
My premise is that a running workload is more valuable than the
accompanying manager.

> In a generic case I don't see how we can charge the cgroup which
> creates cgroups without solving these problems first.
In my understanding, "onbehalveness" is a concept useful for various
kernel threads doing deferred work. Here it's promoted to user processes
managing cgroups.

> And if there is a very special case where we have to limit it,
> we can just add an additional layer:
> 
> ` root or delegated root
>    ` manager-parent-cgroup-with-a-limit
>      ` manager-cgroup (systemd, docker, ...)
>    ` [aggregation group(s)]
>      ` job-group-1
>      ` ...
>      ` job-group-n
If the charge goes to the parent of created cgroup (job-cgroup-i here),
then the layer adds nothing. Am I missing something?

> I'd definitely charge the parent cgroup in all similar cases.
(This would mandate the controllers on the unified hierarchy, which is
fine IMO.) Then the order of enabling controllers on a subtree (e.g.
cpu,memory vs memory,cpu) by the manager would yield different charging.
This seems wrong^W confusing to me. 


Thanks,
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-11 18:32             ` Michal Koutný
@ 2020-08-11 19:32               ` Roman Gushchin
  2020-08-12 16:28                 ` Michal Koutný
  0 siblings, 1 reply; 24+ messages in thread
From: Roman Gushchin @ 2020-08-11 19:32 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel

On Tue, Aug 11, 2020 at 08:32:25PM +0200, Michal Koutny wrote:
> On Tue, Aug 11, 2020 at 09:55:27AM -0700, Roman Gushchin <guro@fb.com> wrote:
> > As I said, there are 2 problems with charging systemd (or a similar daemon):
> > 1) It often belongs to the root cgroup.
> This doesn't hold for systemd (if we agree that systemd is the most
> common case).

Ok, it's better.

> 
> > 2) OOMing or failing some random memory allocations is a bad way
> >    to "communicate" a memory shortage to the daemon.
> >    What we really want is to prevent creating a huge number of cgroups
> There's cgroup.max.descendants for that...

cgroup.max.descendants limits the number of live cgroups, it can't limit
the number of dying cgroups.

> 
> >    (including dying cgroups) in some specific sub-tree(s).
> ...oh, so is this limiting the number of cgroups or limiting resources
> used?

My scenario is simple: there is a large machine, which has no memory
pressure for some time (e.g. is idle or running a workload with small
working set). Periodically running services creating a lot of cgroups,
usually in system.slice. After some time a significant part of the whole
memory is getting consumed by dying cgroups and their percpu data.
Getting rid of it and reclaiming all memory is not always possible
(percpu is getting fragmented relatively easy) and is time consuming.

If we'll set memory.high on system.slice, it will create an artificial
memory pressure once we're getting close to the limit. It will trigger
the reclaim of user pages and slab objects, so eventually we'll be able
to release dying cgroups as well.

You might say that it would work even without charging memcg internal
structures. The problem is that a small slab object can indirectly pin
a lot of (percpu) memory. If don't take the indirectly pinned memory
into account, likely we won't apply enough memory pressure.

If we'll limit init.slice (where systemd seems to reside), as you suggest,
we'll eventually create trashing in init.slice, followed by OOM.
I struggle to see how it makes the life of a user better?

> 
> >    OOMing the daemon or returning -ENOMEM to some random syscalls
> >    will not help us to reach the goal and likely will bring a bad
> >    experience to a user.
> If we reach the situation when memory for cgroup operations is tight,
> it'll disappoint the user either way.
> My premise is that a running workload is more valuable than the
> accompanying manager.

The problem is that OOM-killing the accompanying manager won't release
resources and help to get rid of accumulated cgroups. So in the very
best case it will prevent new cgroups from being created (as well
as some other random operations from being performed). Most likely
the only way to "fix" this for a user will be to reboot the machine.

> 
> > In a generic case I don't see how we can charge the cgroup which
> > creates cgroups without solving these problems first.
> In my understanding, "onbehalveness" is a concept useful for various
> kernel threads doing deferred work. Here it's promoted to user processes
> managing cgroups.
> 
> > And if there is a very special case where we have to limit it,
> > we can just add an additional layer:
> > 
> > ` root or delegated root
> >    ` manager-parent-cgroup-with-a-limit
> >      ` manager-cgroup (systemd, docker, ...)
> >    ` [aggregation group(s)]
> >      ` job-group-1
> >      ` ...
> >      ` job-group-n
> If the charge goes to the parent of created cgroup (job-cgroup-i here),
> then the layer adds nothing. Am I missing something?

Sorry, I was wrong here, please ignore this part.

> 
> > I'd definitely charge the parent cgroup in all similar cases.
> (This would mandate the controllers on the unified hierarchy, which is
> fine IMO.) Then the order of enabling controllers on a subtree (e.g.
> cpu,memory vs memory,cpu) by the manager would yield different charging.
> This seems wrong^W confusing to me.

I agree it's confusing.

Thanks!


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-11 19:32               ` Roman Gushchin
@ 2020-08-12 16:28                 ` Michal Koutný
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Koutný @ 2020-08-12 16:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel


[-- Attachment #1: Type: text/plain, Size: 1026 bytes --]

On Tue, Aug 11, 2020 at 12:32:28PM -0700, Roman Gushchin <guro@fb.com> wrote:
> If we'll limit init.slice (where systemd seems to reside), as you suggest,
> we'll eventually create trashing in init.slice, followed by OOM.
> I struggle to see how it makes the life of a user better?
> [...]
> The problem is that OOM-killing the accompanying manager won't release
> resources and help to get rid of accumulated cgroups.
I see your point now. I focused on the creator because of the live
memcgs.

When the issue are the dying memcgs (c), they were effectively released
by their creator but are pinned by whatever remained after their life
(LRU pages, slab->obj_cgroups). Since these pins were created _from
within_ such a child (c), they're most readily removable by reclaiming
(hierarchically) close to c. (It'd be achievable by limiting the lowest
common ancestor of manager and its product (typically root) but that is
more clumsy and less effective.)

This is the reasoning that justifies the remote charge.

Thanks!
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-11 17:06     ` Roman Gushchin
@ 2020-08-13  9:16       ` Naresh Kamboju
  2020-08-13 23:27         ` Stephen Rothwell
  0 siblings, 1 reply; 24+ messages in thread
From: Naresh Kamboju @ 2020-08-13  9:16 UTC (permalink / raw)
  To: Roman Gushchin, Linux-Next Mailing List, open list, linux-mm, Cgroups
  Cc: Johannes Weiner, Andrew Morton, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Michal Hocko, Shakeel Butt, Kernel Team,
	lkft-triage

The kernel warnings  were noticed on linux next 20200813 while booting
on arm64, arm, x86_64 and i386.

metadata:
  git branch: master
  git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
  git commit: e6d113aca646fb6a92b237340109237fd7a9c770
  git describe: next-20200813
  make_kernelversion: 5.8.0
  kernel-config:
https://builds.tuxbuild.com/YQHc_PpEV-DF8rU7N9tlIQ/kernel.config

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 130093bdf74b..e25f2db7e61c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5137,6 +5137,9 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>         if (!pn)
>                 return 1;
>
> +       /* We charge the parent cgroup, never the current task */
> +       WARN_ON_ONCE(!current->active_memcg);
> +
>         pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
>                                                  GFP_KERNEL_ACCOUNT);
>         if (!pn->lruvec_stat_local) {
> @@ -5219,6 +5222,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>                 goto fail;
>         }
>
> +       /* We charge the parent cgroup, never the current task */
> +       WARN_ON_ONCE(!current->active_memcg);

[    0.217404] ------------[ cut here ]------------
[    0.218038] WARNING: CPU: 0 PID: 0 at mm/memcontrol.c:5226
mem_cgroup_css_alloc+0x680/0x740
[    0.219188] Modules linked in:
[    0.219597] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.8.0-next-20200813 #1
[    0.220187] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-1 04/01/2014
[    0.221190] EIP: mem_cgroup_css_alloc+0x680/0x740
[    0.222190] Code: d6 17 5d ff 8d 65 f4 89 d8 5b 5e 5f 5d c3 8d 74
26 00 b8 58 39 6a d1 e8 fe 94 55 ff 8d 65 f4 89 d8 5b 5e 5f 5d c3 8d
74 26 00 <0f> 0b e9 01 fa ff ff 8d b4 26 00 00 00 00 66 90 bb f4 ff ff
ff ba
[    0.223188] EAX: 00000000 EBX: d13666c0 ECX: 00000cc0 EDX: 0000ffff
[    0.224187] ESI: 00000000 EDI: f4c11000 EBP: d1361f50 ESP: d1361f40
[    0.225188] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246
[    0.226190] CR0: 80050033 CR2: ffd19000 CR3: 115f8000 CR4: 00040690
[    0.227195] Call Trace:
[    0.227882]  ? _cond_resched+0x17/0x30
[    0.228195]  cgroup_init_subsys+0x66/0x12a
[    0.229193]  cgroup_init+0x118/0x323
[    0.230194]  start_kernel+0x43c/0x47d
[    0.231193]  i386_start_kernel+0x48/0x4a
[    0.232194]  startup_32_smp+0x164/0x168
[    0.233195] ---[ end trace dfcf9be7b40caf05 ]---
[    0.2342#
08] ------------[ cut here ]------------
[    0.235192] WARNING: CPU: 0 PID: 0 at mm/memcontrol.c:5141
mem_cgroup_css_alloc+0x718/0x740
[    0.236187] Modules linked in:
[    0.236590] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W
  5.8.0-next-20200813 #1
[    0.237190] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-1 04/01/2014
[    0.238194] EIP: mem_cgroup_css_alloc+0x718/0x740
[    0.239191] Code: 48 ff e9 7c fd ff ff 8d 76 00 a1 b0 14 40 d1 e9
53 fc ff ff 8d b6 00 00 00 00 0f 0b 8d b6 00 00 00 00 0f 0b 8d b6 00
00 00 00 <0f> 0b e9 df f9 ff ff 90 89 f8 e8 29 0c 5c ff 89 f2 b8 10 f4
40 d1
[    0.240190] EAX: 00000000 EBX: f4c0c800 ECX: 00000000 EDX: d0eab660
[    0.241189] ESI: 00000000 EDI: f4c11000 EBP: d1361f50 ESP: d1361f40
[    0.242189] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246
[    0.243190] CR0: 80050033 CR2: ffd19000 CR3: 115f8000 CR4: 00040690
[    0.244188] Call Trace:
[    0.245191]  ? _cond_resched+0x17/0x30
[    0.245686]  cgroup_init_subsys+0x66/0x12a
[    0.246189]  cgroup_init+0x118/0x323
[    0.246654]  start_kernel+0x43c/0x47d
[    0.247189]  i386_start_kernel+0x48/0x4a
[    0.247697]  startup_32_smp+0x164/0x168
[    0.248188] ---[ end trace dfcf9be7b40caf06 ]---
[    0.248990] Last level iTLB entries: 4KB 512, 2MB 255, 4MB 127
[    0.249187] Last level dTLB entries: 4KB 512, 2MB 255, 4MB 127, 1GB 0


Full test log,
https://qa-reports.linaro.org/lkft/linux-next-oe/build/next-20200813/testrun/3061112/suite/linux-log-parser/test/check-kernel-warning-1665815/log

-- 
Linaro LKFT
https://lkft.linaro.org


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-13  9:16       ` Naresh Kamboju
@ 2020-08-13 23:27         ` Stephen Rothwell
  0 siblings, 0 replies; 24+ messages in thread
From: Stephen Rothwell @ 2020-08-13 23:27 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Roman Gushchin, Linux-Next Mailing List, open list, linux-mm,
	Cgroups, Johannes Weiner, Andrew Morton, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Michal Hocko, Shakeel Butt, Kernel Team,
	lkft-triage


[-- Attachment #1: Type: text/plain, Size: 659 bytes --]

Hi Naresh,

On Thu, 13 Aug 2020 14:46:51 +0530 Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>
> The kernel warnings  were noticed on linux next 20200813 while booting
> on arm64, arm, x86_64 and i386.
> 
> metadata:
>   git branch: master
>   git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>   git commit: e6d113aca646fb6a92b237340109237fd7a9c770
>   git describe: next-20200813
>   make_kernelversion: 5.8.0
>   kernel-config:
> https://builds.tuxbuild.com/YQHc_PpEV-DF8rU7N9tlIQ/kernel.config

Actually in Linus' tree.

It has been fixed today.  Thanks for reporting.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, back to index

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-23 18:45 [PATCH v3 0/5] mm: memcg accounting of percpu memory Roman Gushchin
2020-06-23 18:45 ` [PATCH v3 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
2020-06-24  0:58   ` Shakeel Butt
2020-06-23 18:45 ` [PATCH v3 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
2020-06-24  1:25   ` Shakeel Butt
2020-06-23 18:45 ` [PATCH v3 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
2020-06-24  1:35   ` Shakeel Butt
2020-08-11 15:05   ` Johannes Weiner
2020-06-23 18:45 ` [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
2020-06-24  1:40   ` Shakeel Butt
2020-06-24  1:49     ` Roman Gushchin
2020-07-29 17:10   ` Michal Koutný
2020-08-07  4:16     ` Andrew Morton
2020-08-07  4:37       ` Roman Gushchin
2020-08-10 19:33         ` Roman Gushchin
2020-08-11 14:47         ` Michal Koutný
2020-08-11 16:55           ` Roman Gushchin
2020-08-11 18:32             ` Michal Koutný
2020-08-11 19:32               ` Roman Gushchin
2020-08-12 16:28                 ` Michal Koutný
2020-08-11 15:27   ` Johannes Weiner
2020-08-11 17:06     ` Roman Gushchin
2020-08-13  9:16       ` Naresh Kamboju
2020-08-13 23:27         ` Stephen Rothwell

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git