All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/5] mm: memcg accounting of percpu memory
@ 2020-05-28 23:25 Roman Gushchin
  2020-05-28 23:25 ` [PATCH v1 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Roman Gushchin @ 2020-05-28 23:25 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

This patchset adds percpu memory accounting to memory cgroups.
It's based on the rework of the slab controller and reuses concepts
and features introduced for the per-object slab accounting.

Percpu memory is becoming more and more widely used by various
subsystems, and the total amount of memory controlled by the percpu
allocator can make a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory,
and they are created by a user. Also, some cgroup internals
(e.g. memory controller statistics) can be quite large.
On a machine with many CPUs and big number of cgroups they
can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be
accounted by default.

Percpu allocations by their nature are scattered over multiple pages,
so they can't be tracked on the per-page basis. So the per-object
tracking introduced by the new slab controller is reused.

The patchset implements charging of percpu allocations, adds
memcg-level statistics, enables accounting for percpu allocations made
by memory cgroup internals and provides some basic tests.

To implement the accounting of percpu memory without a significant
memory and performance overhead the following approach is used:
all accounted allocations are placed into a separate percpu chunk
(or chunks). These chunks are similar to default chunks, except
that they do have an attached vector of pointers to obj_cgroup objects,
which is big enough to save a pointer for each allocated object.
On the allocation, if the allocation has to be accounted
(__GFP_ACCOUNT is passed, the allocating process belongs to a non-root
memory cgroup, etc), the memory cgroup is getting charged and if the maximum
limit is not exceeded the allocation is performed using a memcg-aware
chunk. Otherwise -ENOMEM is returned or the allocation is forced over
the limit, depending on gfp (as any other kernel memory allocation).
The memory cgroup information is saved in the obj_cgroup vector
at the corresponding offset. On the release time the memcg
information is restored from the vector and the cgroup is getting
uncharged.
Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no
additional overhead is expected.

To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter. The trick is that it can be atomically swapped
to point at the parent cgroup, so that the original memory cgroup
can be released prior to all objects, which has been charged to it.
Because all charges and statistics are fully recursive, it's perfectly
correct to uncharge the parent cgroup instead. This scheme is used
in the slab memory accounting, and percpu memory can just follow
the scheme.

This version is based on top of v5 of the new slab controller
patchset. The following patches are actually required by this series:
  mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
  mm: memcg: prepare for byte-sized vmstat items
  mm: memcg: convert vmstat slab counters to bytes
  mm: slub: implement SLUB version of obj_to_index()
  mm: memcontrol: decouple reference counting from page accounting
  mm: memcg/slab: obj_cgroup API

The whole series can be found here:
https://github.com/rgushchin/linux/pull/new/percpu_acc.1

v1:
  1) fixed a bug with gfp flags handling (Dennis)
  2) added some comments (Tejun and Dennis)
  3) rebased to v5 of the slab controller patchset

RFC:
  https://lore.kernel.org/linux-mm/20200519201806.2308480-1-guro@fb.com/T/#t


Roman Gushchin (5):
  percpu: return number of released bytes from pcpu_free_area()
  mm: memcg/percpu: account percpu memory to memory cgroups
  mm: memcg/percpu: per-memcg percpu memory statistics
  mm: memcg: charge memcg percpu memory to the parent cgroup
  kselftests: cgroup: add perpcu memory accounting test

 Documentation/admin-guide/cgroup-v2.rst    |   4 +
 include/linux/memcontrol.h                 |   8 +
 mm/memcontrol.c                            |  18 +-
 mm/percpu-internal.h                       |  57 +++++-
 mm/percpu-km.c                             |   5 +-
 mm/percpu-stats.c                          |  36 ++--
 mm/percpu-vm.c                             |   5 +-
 mm/percpu.c                                | 209 ++++++++++++++++++---
 tools/testing/selftests/cgroup/test_kmem.c |  59 ++++++
 9 files changed, 352 insertions(+), 49 deletions(-)

-- 
2.25.4



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v1 1/5] percpu: return number of released bytes from pcpu_free_area()
  2020-05-28 23:25 [PATCH v1 0/5] mm: memcg accounting of percpu memory Roman Gushchin
@ 2020-05-28 23:25 ` Roman Gushchin
  2020-06-05 19:44   ` Dennis Zhou
  2020-05-28 23:25 ` [PATCH v1 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Roman Gushchin @ 2020-05-28 23:25 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

To implement accounting of percpu memory we need the information
about the size of freed object. Return it from pcpu_free_area().

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/percpu.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 696367b18222..aa36b78d45a6 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1211,11 +1211,14 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
  *
  * This function determines the size of an allocation to free using
  * the boundary bitmap and clears the allocation map.
+ *
+ * RETURNS:
+ * Number of freed bytes.
  */
-static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
+static int pcpu_free_area(struct pcpu_chunk *chunk, int off)
 {
 	struct pcpu_block_md *chunk_md = &chunk->chunk_md;
-	int bit_off, bits, end, oslot;
+	int bit_off, bits, end, oslot, freed;
 
 	lockdep_assert_held(&pcpu_lock);
 	pcpu_stats_area_dealloc(chunk);
@@ -1230,8 +1233,10 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
 	bits = end - bit_off;
 	bitmap_clear(chunk->alloc_map, bit_off, bits);
 
+	freed = bits * PCPU_MIN_ALLOC_SIZE;
+
 	/* update metadata */
-	chunk->free_bytes += bits * PCPU_MIN_ALLOC_SIZE;
+	chunk->free_bytes += freed;
 
 	/* update first free bit */
 	chunk_md->first_free = min(chunk_md->first_free, bit_off);
@@ -1239,6 +1244,8 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
 	pcpu_block_update_hint_free(chunk, bit_off, bits);
 
 	pcpu_chunk_relocate(chunk, oslot);
+
+	return freed;
 }
 
 static void pcpu_init_md_block(struct pcpu_block_md *block, int nr_bits)
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v1 2/5] mm: memcg/percpu: account percpu memory to memory cgroups
  2020-05-28 23:25 [PATCH v1 0/5] mm: memcg accounting of percpu memory Roman Gushchin
  2020-05-28 23:25 ` [PATCH v1 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
@ 2020-05-28 23:25 ` Roman Gushchin
  2020-06-05 19:49   ` Dennis Zhou
  2020-05-28 23:25 ` [PATCH v1 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Roman Gushchin @ 2020-05-28 23:25 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Percpu memory is becoming more and more widely used by various
subsystems, and the total amount of memory controlled by the percpu
allocator can make a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory,
and they are created by a user. Also, some cgroup internals
(e.g. memory controller statistics) can be quite large.
On a machine with many CPUs and big number of cgroups they
can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be
accounted by default.

To implement the perpcu accounting it's possible to take the slab
memory accounting as a model to follow. Let's introduce two types of
percpu chunks: root and memcg. What makes memcg chunks different is
an additional space allocated to store memcg membership information.
If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be
used. If it's possible to charge the corresponding size to the target
memory cgroup, allocation is performed, and the memcg ownership data
is recorded. System-wide allocations are performed using root chunks,
so there is no additional memory overhead.

To implement a fast reparenting of percpu memory on memcg removal,
we don't store mem_cgroup pointers directly: instead we use obj_cgroup
API, introduced for slab accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/percpu-internal.h |  57 ++++++++++++-
 mm/percpu-km.c       |   5 +-
 mm/percpu-stats.c    |  36 +++++----
 mm/percpu-vm.c       |   5 +-
 mm/percpu.c          | 186 ++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 248 insertions(+), 41 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 0468ba500bd4..0cf36337eb47 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -5,6 +5,27 @@
 #include <linux/types.h>
 #include <linux/percpu.h>
 
+/*
+ * There are two chunk types: root and memcg-aware.
+ * Chunks of each type have separate slots list.
+ *
+ * Memcg-aware chunks have an attached vector of obj_cgroup
+ * pointers, which is used to store memcg membership data
+ * of a percpu object. Obj_cgroups are ref-counted pointers
+ * to a memory cgroup with an ability to switch dynamically
+ * to the parent memory cgroup. This allows to reclaim a deleted
+ * memory cgroup without reclaiming of all outstanding objects,
+ * which do hold a reference at it.
+ */
+enum pcpu_chunk_type {
+	PCPU_CHUNK_ROOT,
+#ifdef CONFIG_MEMCG_KMEM
+	PCPU_CHUNK_MEMCG,
+#endif
+	PCPU_NR_CHUNK_TYPES,
+	PCPU_FAIL_ALLOC = PCPU_NR_CHUNK_TYPES
+};
+
 /*
  * pcpu_block_md is the metadata block struct.
  * Each chunk's bitmap is split into a number of full blocks.
@@ -54,6 +75,9 @@ struct pcpu_chunk {
 	int			end_offset;	/* additional area required to
 						   have the region end page
 						   aligned */
+#ifdef CONFIG_MEMCG_KMEM
+	struct obj_cgroup	**obj_cgroups;	/* vector of object cgroups */
+#endif
 
 	int			nr_pages;	/* # of pages served by this chunk */
 	int			nr_populated;	/* # of populated pages */
@@ -63,7 +87,7 @@ struct pcpu_chunk {
 
 extern spinlock_t pcpu_lock;
 
-extern struct list_head *pcpu_slot;
+extern struct list_head *pcpu_chunk_lists;
 extern int pcpu_nr_slots;
 extern int pcpu_nr_empty_pop_pages;
 
@@ -106,6 +130,37 @@ static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk)
 	return pcpu_nr_pages_to_map_bits(chunk->nr_pages);
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
+{
+	if (chunk->obj_cgroups)
+		return PCPU_CHUNK_MEMCG;
+	return PCPU_CHUNK_ROOT;
+}
+
+static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
+{
+	return chunk_type == PCPU_CHUNK_MEMCG;
+}
+
+#else
+static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
+{
+	return PCPU_CHUNK_ROOT;
+}
+
+static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
+{
+	return false;
+}
+#endif
+
+static struct list_head *pcpu_chunk_list(enum pcpu_chunk_type chunk_type)
+{
+	return &pcpu_chunk_lists[pcpu_nr_slots *
+				 pcpu_is_memcg_chunk(chunk_type)];
+}
+
 #ifdef CONFIG_PERCPU_STATS
 
 #include <linux/spinlock.h>
diff --git a/mm/percpu-km.c b/mm/percpu-km.c
index 20d2b69a13b0..35c9941077ee 100644
--- a/mm/percpu-km.c
+++ b/mm/percpu-km.c
@@ -44,7 +44,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 	/* nada */
 }
 
-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+					    gfp_t gfp)
 {
 	const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
 	struct pcpu_chunk *chunk;
@@ -52,7 +53,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
 	unsigned long flags;
 	int i;
 
-	chunk = pcpu_alloc_chunk(gfp);
+	chunk = pcpu_alloc_chunk(type, gfp);
 	if (!chunk)
 		return NULL;
 
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index 32558063c3f9..c8400a2adbc2 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -34,11 +34,15 @@ static int find_max_nr_alloc(void)
 {
 	struct pcpu_chunk *chunk;
 	int slot, max_nr_alloc;
+	enum pcpu_chunk_type type;
 
 	max_nr_alloc = 0;
-	for (slot = 0; slot < pcpu_nr_slots; slot++)
-		list_for_each_entry(chunk, &pcpu_slot[slot], list)
-			max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc);
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		for (slot = 0; slot < pcpu_nr_slots; slot++)
+			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
+					    list)
+				max_nr_alloc = max(max_nr_alloc,
+						   chunk->nr_alloc);
 
 	return max_nr_alloc;
 }
@@ -129,6 +133,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
 	P("cur_min_alloc", cur_min_alloc);
 	P("cur_med_alloc", cur_med_alloc);
 	P("cur_max_alloc", cur_max_alloc);
+#ifdef CONFIG_MEMCG_KMEM
+	P("memcg_aware", pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)));
+#endif
 	seq_putc(m, '\n');
 }
 
@@ -137,6 +144,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 	struct pcpu_chunk *chunk;
 	int slot, max_nr_alloc;
 	int *buffer;
+	enum pcpu_chunk_type type;
 
 alloc_buffer:
 	spin_lock_irq(&pcpu_lock);
@@ -202,18 +210,18 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 		chunk_map_stats(m, pcpu_reserved_chunk, buffer);
 	}
 
-	for (slot = 0; slot < pcpu_nr_slots; slot++) {
-		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
-			if (chunk == pcpu_first_chunk) {
-				seq_puts(m, "Chunk: <- First Chunk\n");
-				chunk_map_stats(m, chunk, buffer);
-
-
-			} else {
-				seq_puts(m, "Chunk:\n");
-				chunk_map_stats(m, chunk, buffer);
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
+		for (slot = 0; slot < pcpu_nr_slots; slot++) {
+			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
+					    list) {
+				if (chunk == pcpu_first_chunk) {
+					seq_puts(m, "Chunk: <- First Chunk\n");
+					chunk_map_stats(m, chunk, buffer);
+				} else {
+					seq_puts(m, "Chunk:\n");
+					chunk_map_stats(m, chunk, buffer);
+				}
 			}
-
 		}
 	}
 
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index a2b395acef89..e46f7a6917f9 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -328,12 +328,13 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 	pcpu_free_pages(chunk, pages, page_start, page_end);
 }
 
-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+					    gfp_t gfp)
 {
 	struct pcpu_chunk *chunk;
 	struct vm_struct **vms;
 
-	chunk = pcpu_alloc_chunk(gfp);
+	chunk = pcpu_alloc_chunk(type, gfp);
 	if (!chunk)
 		return NULL;
 
diff --git a/mm/percpu.c b/mm/percpu.c
index aa36b78d45a6..85f5755c9114 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -37,9 +37,14 @@
  * takes care of normal allocations.
  *
  * The allocator organizes chunks into lists according to free size and
- * tries to allocate from the fullest chunk first.  Each chunk is managed
- * by a bitmap with metadata blocks.  The allocation map is updated on
- * every allocation and free to reflect the current state while the boundary
+ * memcg-awareness.  To make a percpu allocation memcg-aware the __GFP_ACCOUNT
+ * flag should be passed.  All memcg-aware allocations are sharing one set
+ * of chunks and all unaccounted allocations and allocations performed
+ * by processes belonging to the root memory cgroup are using the second set.
+ *
+ * The allocator tries to allocate from the fullest chunk first. Each chunk
+ * is managed by a bitmap with metadata blocks.  The allocation map is updated
+ * on every allocation and free to reflect the current state while the boundary
  * map is only updated on allocation.  Each metadata block contains
  * information to help mitigate the need to iterate over large portions
  * of the bitmap.  The reverse mapping from page to chunk is stored in
@@ -81,6 +86,7 @@
 #include <linux/kmemleak.h>
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
+#include <linux/memcontrol.h>
 
 #include <asm/cacheflush.h>
 #include <asm/sections.h>
@@ -160,7 +166,7 @@ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init;
 DEFINE_SPINLOCK(pcpu_lock);	/* all internal data structures */
 static DEFINE_MUTEX(pcpu_alloc_mutex);	/* chunk create/destroy, [de]pop, map ext */
 
-struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */
+struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */
 
 /* chunks which need their map areas extended, protected by pcpu_lock */
 static LIST_HEAD(pcpu_map_extend_chunks);
@@ -500,6 +506,9 @@ static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
 			      bool move_front)
 {
 	if (chunk != pcpu_reserved_chunk) {
+		struct list_head *pcpu_slot;
+
+		pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
 		if (move_front)
 			list_move(&chunk->list, &pcpu_slot[slot]);
 		else
@@ -1341,6 +1350,10 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
 		panic("%s: Failed to allocate %zu bytes\n", __func__,
 		      alloc_size);
 
+#ifdef CONFIG_MEMCG_KMEM
+	/* first chunk isn't memcg-aware */
+	chunk->obj_cgroups = NULL;
+#endif
 	pcpu_init_md_blocks(chunk);
 
 	/* manage populated page bitmap */
@@ -1380,7 +1393,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
 	return chunk;
 }
 
-static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type, gfp_t gfp)
 {
 	struct pcpu_chunk *chunk;
 	int region_bits;
@@ -1408,6 +1421,16 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
 	if (!chunk->md_blocks)
 		goto md_blocks_fail;
 
+#ifdef CONFIG_MEMCG_KMEM
+	if (pcpu_is_memcg_chunk(type)) {
+		chunk->obj_cgroups =
+			pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
+					sizeof(struct obj_cgroup *), gfp);
+		if (!chunk->obj_cgroups)
+			goto objcg_fail;
+	}
+#endif
+
 	pcpu_init_md_blocks(chunk);
 
 	/* init metadata */
@@ -1415,6 +1438,8 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
 
 	return chunk;
 
+objcg_fail:
+	pcpu_mem_free(chunk->md_blocks);
 md_blocks_fail:
 	pcpu_mem_free(chunk->bound_map);
 bound_map_fail:
@@ -1429,6 +1454,9 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
 {
 	if (!chunk)
 		return;
+#ifdef CONFIG_MEMCG_KMEM
+	pcpu_mem_free(chunk->obj_cgroups);
+#endif
 	pcpu_mem_free(chunk->md_blocks);
 	pcpu_mem_free(chunk->bound_map);
 	pcpu_mem_free(chunk->alloc_map);
@@ -1505,7 +1533,8 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
 			       int page_start, int page_end, gfp_t gfp);
 static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 				  int page_start, int page_end);
-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp);
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+					    gfp_t gfp);
 static void pcpu_destroy_chunk(struct pcpu_chunk *chunk);
 static struct page *pcpu_addr_to_page(void *addr);
 static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai);
@@ -1547,6 +1576,77 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
 	return pcpu_get_page_chunk(pcpu_addr_to_page(addr));
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
+						     struct obj_cgroup **objcgp)
+{
+	struct obj_cgroup *objcg;
+
+	if (!memcg_kmem_enabled() || !(gfp & __GFP_ACCOUNT) ||
+	    memcg_kmem_bypass())
+		return PCPU_CHUNK_ROOT;
+
+	objcg = get_obj_cgroup_from_current();
+	if (!objcg)
+		return PCPU_CHUNK_ROOT;
+
+	if (obj_cgroup_charge(objcg, gfp, size * num_possible_cpus())) {
+		obj_cgroup_put(objcg);
+		return PCPU_FAIL_ALLOC;
+	}
+
+	*objcgp = objcg;
+	return PCPU_CHUNK_MEMCG;
+}
+
+static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
+				       struct pcpu_chunk *chunk, int off,
+				       size_t size)
+{
+	if (!objcg)
+		return;
+
+	if (chunk) {
+		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
+	} else {
+		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
+		obj_cgroup_put(objcg);
+	}
+}
+
+static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+	struct obj_cgroup *objcg;
+
+	if (!pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)))
+		return;
+
+	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
+	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
+
+	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
+
+	obj_cgroup_put(objcg);
+}
+
+#else /* CONFIG_MEMCG_KMEM */
+static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
+						     struct mem_cgroup **memcgp)
+{
+	return PCPU_CHUNK_ROOT;
+}
+
+static void pcpu_memcg_post_alloc_hook(struct mem_cgroup *memcg,
+				       struct pcpu_chunk *chunk, int off,
+				       size_t size)
+{
+}
+
+static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
 /**
  * pcpu_alloc - the percpu allocator
  * @size: size of area to allocate in bytes
@@ -1568,6 +1668,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	gfp_t pcpu_gfp;
 	bool is_atomic;
 	bool do_warn;
+	enum pcpu_chunk_type type;
+	struct list_head *pcpu_slot;
+	struct obj_cgroup *objcg = NULL;
 	static int warn_limit = 10;
 	struct pcpu_chunk *chunk, *next;
 	const char *err;
@@ -1602,16 +1705,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 		return NULL;
 	}
 
+	type = pcpu_memcg_pre_alloc_hook(size, gfp, &objcg);
+	if (unlikely(type == PCPU_FAIL_ALLOC))
+		return NULL;
+	pcpu_slot = pcpu_chunk_list(type);
+
 	if (!is_atomic) {
 		/*
 		 * pcpu_balance_workfn() allocates memory under this mutex,
 		 * and it may wait for memory reclaim. Allow current task
 		 * to become OOM victim, in case of memory pressure.
 		 */
-		if (gfp & __GFP_NOFAIL)
+		if (gfp & __GFP_NOFAIL) {
 			mutex_lock(&pcpu_alloc_mutex);
-		else if (mutex_lock_killable(&pcpu_alloc_mutex))
+		} else if (mutex_lock_killable(&pcpu_alloc_mutex)) {
+			pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
 			return NULL;
+		}
 	}
 
 	spin_lock_irqsave(&pcpu_lock, flags);
@@ -1637,7 +1747,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 restart:
 	/* search through normal chunks */
 	for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
-		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
+		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot],
+					 list) {
 			off = pcpu_find_block_fit(chunk, bits, bit_align,
 						  is_atomic);
 			if (off < 0) {
@@ -1666,7 +1777,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	}
 
 	if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
-		chunk = pcpu_create_chunk(pcpu_gfp);
+		chunk = pcpu_create_chunk(type, pcpu_gfp);
 		if (!chunk) {
 			err = "failed to allocate new chunk";
 			goto fail;
@@ -1723,6 +1834,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	trace_percpu_alloc_percpu(reserved, is_atomic, size, align,
 			chunk->base_addr, off, ptr);
 
+	pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
+
 	return ptr;
 
 fail_unlock:
@@ -1744,6 +1857,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	} else {
 		mutex_unlock(&pcpu_alloc_mutex);
 	}
+
+	pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
+
 	return NULL;
 }
 
@@ -1803,8 +1919,8 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
 }
 
 /**
- * pcpu_balance_workfn - manage the amount of free chunks and populated pages
- * @work: unused
+ * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
+ * @type: chunk type
  *
  * Reclaim all fully free chunks except for the first one.  This is also
  * responsible for maintaining the pool of empty populated pages.  However,
@@ -1813,11 +1929,12 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
  * allocation causes the failure as it is possible that requests can be
  * serviced from already backed regions.
  */
-static void pcpu_balance_workfn(struct work_struct *work)
+static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
 {
 	/* gfp flags passed to underlying allocators */
 	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
 	LIST_HEAD(to_free);
+	struct list_head *pcpu_slot = pcpu_chunk_list(type);
 	struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
 	struct pcpu_chunk *chunk, *next;
 	int slot, nr_to_pop, ret;
@@ -1915,7 +2032,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
 
 	if (nr_to_pop) {
 		/* ran out of chunks to populate, create a new one and retry */
-		chunk = pcpu_create_chunk(gfp);
+		chunk = pcpu_create_chunk(type, gfp);
 		if (chunk) {
 			spin_lock_irq(&pcpu_lock);
 			pcpu_chunk_relocate(chunk, -1);
@@ -1927,6 +2044,20 @@ static void pcpu_balance_workfn(struct work_struct *work)
 	mutex_unlock(&pcpu_alloc_mutex);
 }
 
+/**
+ * pcpu_balance_workfn - manage the amount of free chunks and populated pages
+ * @work: unused
+ *
+ * Call __pcpu_balance_workfn() for each chunk type.
+ */
+static void pcpu_balance_workfn(struct work_struct *work)
+{
+	enum pcpu_chunk_type type;
+
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		__pcpu_balance_workfn(type);
+}
+
 /**
  * free_percpu - free percpu area
  * @ptr: pointer to area to free
@@ -1941,8 +2072,9 @@ void free_percpu(void __percpu *ptr)
 	void *addr;
 	struct pcpu_chunk *chunk;
 	unsigned long flags;
-	int off;
+	int size, off;
 	bool need_balance = false;
+	struct list_head *pcpu_slot;
 
 	if (!ptr)
 		return;
@@ -1956,7 +2088,11 @@ void free_percpu(void __percpu *ptr)
 	chunk = pcpu_chunk_addr_search(addr);
 	off = addr - chunk->base_addr;
 
-	pcpu_free_area(chunk, off);
+	size = pcpu_free_area(chunk, off);
+
+	pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
+
+	pcpu_memcg_free_hook(chunk, off, size);
 
 	/* if there are more than one fully free chunks, wake up grim reaper */
 	if (chunk->free_bytes == pcpu_unit_size) {
@@ -2267,6 +2403,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	int map_size;
 	unsigned long tmp_addr;
 	size_t alloc_size;
+	enum pcpu_chunk_type type;
 
 #define PCPU_SETUP_BUG_ON(cond)	do {					\
 	if (unlikely(cond)) {						\
@@ -2384,13 +2521,18 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	 * empty chunks.
 	 */
 	pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
-	pcpu_slot = memblock_alloc(pcpu_nr_slots * sizeof(pcpu_slot[0]),
-				   SMP_CACHE_BYTES);
-	if (!pcpu_slot)
+	pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
+					  sizeof(pcpu_chunk_lists[0]) *
+					  PCPU_NR_CHUNK_TYPES,
+					  SMP_CACHE_BYTES);
+	if (!pcpu_chunk_lists)
 		panic("%s: Failed to allocate %zu bytes\n", __func__,
-		      pcpu_nr_slots * sizeof(pcpu_slot[0]));
-	for (i = 0; i < pcpu_nr_slots; i++)
-		INIT_LIST_HEAD(&pcpu_slot[i]);
+		      pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]) *
+		      PCPU_NR_CHUNK_TYPES);
+
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		for (i = 0; i < pcpu_nr_slots; i++)
+			INIT_LIST_HEAD(&pcpu_chunk_list(type)[i]);
 
 	/*
 	 * The end of the static region needs to be aligned with the
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v1 3/5] mm: memcg/percpu: per-memcg percpu memory statistics
  2020-05-28 23:25 [PATCH v1 0/5] mm: memcg accounting of percpu memory Roman Gushchin
  2020-05-28 23:25 ` [PATCH v1 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
  2020-05-28 23:25 ` [PATCH v1 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
@ 2020-05-28 23:25 ` Roman Gushchin
  2020-06-05 19:53   ` Dennis Zhou
  2020-05-28 23:25 ` [PATCH v1 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
  2020-05-28 23:25 ` [PATCH v1 5/5] kselftests: cgroup: add perpcu memory accounting test Roman Gushchin
  4 siblings, 1 reply; 13+ messages in thread
From: Roman Gushchin @ 2020-05-28 23:25 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Percpu memory can represent a noticeable chunk of the total
memory consumption, especially on big machines with many CPUs.
Let's track percpu memory usage for each memcg and display
it in memory.stat.

A percpu allocation is usually scattered over multiple pages
(and nodes), and can be significantly smaller than a page.
So let's add a byte-sized counter on the memcg level:
MEMCG_PERCPU_B. Byte-sized vmstat infra created for slabs
can be perfectly reused for percpu case.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  4 ++++
 include/linux/memcontrol.h              |  8 ++++++++
 mm/memcontrol.c                         |  4 +++-
 mm/percpu.c                             | 10 ++++++++++
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index fed4e1d2a343..aa8cb6dadadc 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1276,6 +1276,10 @@ PAGE_SIZE multiple when read back.
 		Amount of memory used for storing in-kernel data
 		structures.
 
+	  percpu
+		Amount of memory used for storing per-cpu kernel
+		data structures.
+
 	  sock
 		Amount of memory used in network transmission buffers
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7a84d9164449..f62a95d472f7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -32,11 +32,19 @@ struct kmem_cache;
 enum memcg_stat_item {
 	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
 	MEMCG_SOCK,
+	MEMCG_PERCPU_B,
 	/* XXX: why are these zone and not node counters? */
 	MEMCG_KERNEL_STACK_KB,
 	MEMCG_NR_STAT,
 };
 
+static __always_inline bool memcg_stat_item_in_bytes(enum memcg_stat_item item)
+{
+	if (item == MEMCG_PERCPU_B)
+		return true;
+	return vmstat_item_in_bytes(item);
+}
+
 enum memcg_memory_event {
 	MEMCG_LOW,
 	MEMCG_HIGH,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7bc3fd196210..5007d1585a4a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -783,7 +783,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 	if (mem_cgroup_disabled())
 		return;
 
-	if (vmstat_item_in_bytes(idx))
+	if (memcg_stat_item_in_bytes(idx))
 		threshold <<= PAGE_SHIFT;
 
 	x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
@@ -1490,6 +1490,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 	seq_buf_printf(&s, "slab %llu\n",
 		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
 			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
+	seq_buf_printf(&s, "percpu %llu\n",
+		       (u64)memcg_page_state(memcg, MEMCG_PERCPU_B));
 	seq_buf_printf(&s, "sock %llu\n",
 		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
 		       PAGE_SIZE);
diff --git a/mm/percpu.c b/mm/percpu.c
index 85f5755c9114..b4b3e9c8a6d1 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1608,6 +1608,11 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 
 	if (chunk) {
 		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
+
+		rcu_read_lock();
+		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
+				size * num_possible_cpus());
+		rcu_read_unlock();
 	} else {
 		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
 		obj_cgroup_put(objcg);
@@ -1626,6 +1631,11 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 
 	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
 
+	rcu_read_lock();
+	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
+			-(size * num_possible_cpus()));
+	rcu_read_unlock();
+
 	obj_cgroup_put(objcg);
 }
 
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v1 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-05-28 23:25 [PATCH v1 0/5] mm: memcg accounting of percpu memory Roman Gushchin
                   ` (2 preceding siblings ...)
  2020-05-28 23:25 ` [PATCH v1 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
@ 2020-05-28 23:25 ` Roman Gushchin
  2020-06-05 19:54   ` Dennis Zhou
  2020-05-28 23:25 ` [PATCH v1 5/5] kselftests: cgroup: add perpcu memory accounting test Roman Gushchin
  4 siblings, 1 reply; 13+ messages in thread
From: Roman Gushchin @ 2020-05-28 23:25 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Memory cgroups are using large chunks of percpu memory to store
vmstat data. Yet this memory is not accounted at all, so in the
case when there are many (dying) cgroups, it's not exactly clear
where all the memory is.

Because the size of  memory cgroup internal structures can
dramatically exceed the size of object or page which is pinning
it in the memory, it's not a good idea to simple ignore it.
It actually breaks the isolation between cgroups.

Let's account the consumed percpu memory to the parent cgroup.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5007d1585a4a..0dd0d05a011c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5020,13 +5020,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 	if (!pn)
 		return 1;
 
-	pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
+	pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
+						 GFP_KERNEL_ACCOUNT);
 	if (!pn->lruvec_stat_local) {
 		kfree(pn);
 		return 1;
 	}
 
-	pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
+	pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
+					       GFP_KERNEL_ACCOUNT);
 	if (!pn->lruvec_stat_cpu) {
 		free_percpu(pn->lruvec_stat_local);
 		kfree(pn);
@@ -5100,11 +5102,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 		goto fail;
 	}
 
-	memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
+	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
+						GFP_KERNEL_ACCOUNT);
 	if (!memcg->vmstats_local)
 		goto fail;
 
-	memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
+	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
+						 GFP_KERNEL_ACCOUNT);
 	if (!memcg->vmstats_percpu)
 		goto fail;
 
@@ -5153,7 +5157,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	struct mem_cgroup *memcg;
 	long error = -ENOMEM;
 
+	memalloc_use_memcg(parent);
 	memcg = mem_cgroup_alloc();
+	memalloc_unuse_memcg();
 	if (IS_ERR(memcg))
 		return ERR_CAST(memcg);
 
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v1 5/5] kselftests: cgroup: add perpcu memory accounting test
  2020-05-28 23:25 [PATCH v1 0/5] mm: memcg accounting of percpu memory Roman Gushchin
                   ` (3 preceding siblings ...)
  2020-05-28 23:25 ` [PATCH v1 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
@ 2020-05-28 23:25 ` Roman Gushchin
  2020-06-05 20:07   ` Dennis Zhou
  4 siblings, 1 reply; 13+ messages in thread
From: Roman Gushchin @ 2020-05-28 23:25 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Add a simple test to check the percpu memory accounting.
The test creates a cgroup tree with 1000 child cgroups
and checks values of memory.current and memory.stat::percpu.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 tools/testing/selftests/cgroup/test_kmem.c | 59 ++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c
index 5224dae216e5..a0d4f1a3137d 100644
--- a/tools/testing/selftests/cgroup/test_kmem.c
+++ b/tools/testing/selftests/cgroup/test_kmem.c
@@ -331,6 +331,64 @@ static int test_kmem_dead_cgroups(const char *root)
 	return ret;
 }
 
+/*
+ * This test creates a sub-tree with 1000 memory cgroups.
+ * Then it checks that the memory.current on the parent level
+ * is greater than 0 and approximates matches the percpu value
+ * from memory.stat.
+ */
+static int test_percpu_basic(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *parent, *child;
+	long current, percpu;
+	int i;
+
+	parent = cg_name(root, "percpu_basic_test");
+	if (!parent)
+		goto cleanup;
+
+	if (cg_create(parent))
+		goto cleanup;
+
+	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
+		goto cleanup;
+
+	for (i = 0; i < 1000; i++) {
+		child = cg_name_indexed(parent, "child", i);
+		if (!child)
+			return -1;
+
+		if (cg_create(child))
+			goto cleanup_children;
+
+		free(child);
+	}
+
+	current = cg_read_long(parent, "memory.current");
+	percpu = cg_read_key_long(parent, "memory.stat", "percpu ");
+
+	if (current > 0 && percpu > 0 && abs(current - percpu) <
+	    4096 * 32 * get_nprocs())
+		ret = KSFT_PASS;
+	else
+		printf("memory.current %ld\npercpu %ld\n",
+		       current, percpu);
+
+cleanup_children:
+	for (i = 0; i < 1000; i++) {
+		child = cg_name_indexed(parent, "child", i);
+		cg_destroy(child);
+		free(child);
+	}
+
+cleanup:
+	cg_destroy(parent);
+	free(parent);
+
+	return ret;
+}
+
 #define T(x) { x, #x }
 struct kmem_test {
 	int (*fn)(const char *root);
@@ -341,6 +399,7 @@ struct kmem_test {
 	T(test_kmem_proc_kpagecgroup),
 	T(test_kmem_kernel_stacks),
 	T(test_kmem_dead_cgroups),
+	T(test_percpu_basic),
 };
 #undef T
 
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 1/5] percpu: return number of released bytes from pcpu_free_area()
  2020-05-28 23:25 ` [PATCH v1 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
@ 2020-06-05 19:44   ` Dennis Zhou
  0 siblings, 0 replies; 13+ messages in thread
From: Dennis Zhou @ 2020-06-05 19:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Thu, May 28, 2020 at 04:25:04PM -0700, Roman Gushchin wrote:
> To implement accounting of percpu memory we need the information
> about the size of freed object. Return it from pcpu_free_area().
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/percpu.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 696367b18222..aa36b78d45a6 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1211,11 +1211,14 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
>   *
>   * This function determines the size of an allocation to free using
>   * the boundary bitmap and clears the allocation map.
> + *
> + * RETURNS:
> + * Number of freed bytes.
>   */
> -static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
> +static int pcpu_free_area(struct pcpu_chunk *chunk, int off)
>  {
>  	struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> -	int bit_off, bits, end, oslot;
> +	int bit_off, bits, end, oslot, freed;
>  
>  	lockdep_assert_held(&pcpu_lock);
>  	pcpu_stats_area_dealloc(chunk);
> @@ -1230,8 +1233,10 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
>  	bits = end - bit_off;
>  	bitmap_clear(chunk->alloc_map, bit_off, bits);
>  
> +	freed = bits * PCPU_MIN_ALLOC_SIZE;
> +
>  	/* update metadata */
> -	chunk->free_bytes += bits * PCPU_MIN_ALLOC_SIZE;
> +	chunk->free_bytes += freed;
>  
>  	/* update first free bit */
>  	chunk_md->first_free = min(chunk_md->first_free, bit_off);
> @@ -1239,6 +1244,8 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
>  	pcpu_block_update_hint_free(chunk, bit_off, bits);
>  
>  	pcpu_chunk_relocate(chunk, oslot);
> +
> +	return freed;
>  }
>  
>  static void pcpu_init_md_block(struct pcpu_block_md *block, int nr_bits)
> -- 
> 2.25.4
> 

Sorry for the delay.

Acked-by: Dennis Zhou <dennis@kernel.org>

What's the status of the depending patches? It might be easiest to have
Andrew pick these up once the depending patch series is settled.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/5] mm: memcg/percpu: account percpu memory to memory cgroups
  2020-05-28 23:25 ` [PATCH v1 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
@ 2020-06-05 19:49   ` Dennis Zhou
  2020-06-05 22:44     ` Roman Gushchin
  0 siblings, 1 reply; 13+ messages in thread
From: Dennis Zhou @ 2020-06-05 19:49 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Thu, May 28, 2020 at 04:25:05PM -0700, Roman Gushchin wrote:
> Percpu memory is becoming more and more widely used by various
> subsystems, and the total amount of memory controlled by the percpu
> allocator can make a good part of the total memory.
> 
> As an example, bpf maps can consume a lot of percpu memory,
> and they are created by a user. Also, some cgroup internals
> (e.g. memory controller statistics) can be quite large.
> On a machine with many CPUs and big number of cgroups they
> can consume hundreds of megabytes.
> 
> So the lack of memcg accounting is creating a breach in the memory
> isolation. Similar to the slab memory, percpu memory should be
> accounted by default.
> 
> To implement the perpcu accounting it's possible to take the slab
> memory accounting as a model to follow. Let's introduce two types of
> percpu chunks: root and memcg. What makes memcg chunks different is
> an additional space allocated to store memcg membership information.
> If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be
> used. If it's possible to charge the corresponding size to the target
> memory cgroup, allocation is performed, and the memcg ownership data
> is recorded. System-wide allocations are performed using root chunks,
> so there is no additional memory overhead.
> 
> To implement a fast reparenting of percpu memory on memcg removal,
> we don't store mem_cgroup pointers directly: instead we use obj_cgroup
> API, introduced for slab accounting.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/percpu-internal.h |  57 ++++++++++++-
>  mm/percpu-km.c       |   5 +-
>  mm/percpu-stats.c    |  36 +++++----
>  mm/percpu-vm.c       |   5 +-
>  mm/percpu.c          | 186 ++++++++++++++++++++++++++++++++++++++-----
>  5 files changed, 248 insertions(+), 41 deletions(-)
> 
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index 0468ba500bd4..0cf36337eb47 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -5,6 +5,27 @@
>  #include <linux/types.h>
>  #include <linux/percpu.h>
>  
> +/*
> + * There are two chunk types: root and memcg-aware.
> + * Chunks of each type have separate slots list.
> + *
> + * Memcg-aware chunks have an attached vector of obj_cgroup
> + * pointers, which is used to store memcg membership data
> + * of a percpu object. Obj_cgroups are ref-counted pointers
> + * to a memory cgroup with an ability to switch dynamically
> + * to the parent memory cgroup. This allows to reclaim a deleted
> + * memory cgroup without reclaiming of all outstanding objects,
> + * which do hold a reference at it.
> + */

nit: do you mind reflowing this to 80 characters and doing 2 spaces
after each period to keep the formatting uniform.

> +enum pcpu_chunk_type {
> +	PCPU_CHUNK_ROOT,
> +#ifdef CONFIG_MEMCG_KMEM
> +	PCPU_CHUNK_MEMCG,
> +#endif
> +	PCPU_NR_CHUNK_TYPES,
> +	PCPU_FAIL_ALLOC = PCPU_NR_CHUNK_TYPES
> +};
> +
>  /*
>   * pcpu_block_md is the metadata block struct.
>   * Each chunk's bitmap is split into a number of full blocks.
> @@ -54,6 +75,9 @@ struct pcpu_chunk {
>  	int			end_offset;	/* additional area required to
>  						   have the region end page
>  						   aligned */
> +#ifdef CONFIG_MEMCG_KMEM
> +	struct obj_cgroup	**obj_cgroups;	/* vector of object cgroups */
> +#endif
>  
>  	int			nr_pages;	/* # of pages served by this chunk */
>  	int			nr_populated;	/* # of populated pages */
> @@ -63,7 +87,7 @@ struct pcpu_chunk {
>  
>  extern spinlock_t pcpu_lock;
>  
> -extern struct list_head *pcpu_slot;
> +extern struct list_head *pcpu_chunk_lists;
>  extern int pcpu_nr_slots;
>  extern int pcpu_nr_empty_pop_pages;
>  
> @@ -106,6 +130,37 @@ static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk)
>  	return pcpu_nr_pages_to_map_bits(chunk->nr_pages);
>  }
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
> +{
> +	if (chunk->obj_cgroups)
> +		return PCPU_CHUNK_MEMCG;
> +	return PCPU_CHUNK_ROOT;
> +}
> +
> +static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
> +{
> +	return chunk_type == PCPU_CHUNK_MEMCG;
> +}
> +
> +#else
> +static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
> +{
> +	return PCPU_CHUNK_ROOT;
> +}
> +
> +static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
> +{
> +	return false;
> +}
> +#endif
> +
> +static struct list_head *pcpu_chunk_list(enum pcpu_chunk_type chunk_type)
> +{
> +	return &pcpu_chunk_lists[pcpu_nr_slots *
> +				 pcpu_is_memcg_chunk(chunk_type)];
> +}
> +
>  #ifdef CONFIG_PERCPU_STATS
>  
>  #include <linux/spinlock.h>
> diff --git a/mm/percpu-km.c b/mm/percpu-km.c
> index 20d2b69a13b0..35c9941077ee 100644
> --- a/mm/percpu-km.c
> +++ b/mm/percpu-km.c
> @@ -44,7 +44,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
>  	/* nada */
>  }
>  
> -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
> +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> +					    gfp_t gfp)
>  {
>  	const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
>  	struct pcpu_chunk *chunk;
> @@ -52,7 +53,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
>  	unsigned long flags;
>  	int i;
>  
> -	chunk = pcpu_alloc_chunk(gfp);
> +	chunk = pcpu_alloc_chunk(type, gfp);
>  	if (!chunk)
>  		return NULL;
>  
> diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
> index 32558063c3f9..c8400a2adbc2 100644
> --- a/mm/percpu-stats.c
> +++ b/mm/percpu-stats.c
> @@ -34,11 +34,15 @@ static int find_max_nr_alloc(void)
>  {
>  	struct pcpu_chunk *chunk;
>  	int slot, max_nr_alloc;
> +	enum pcpu_chunk_type type;
>  
>  	max_nr_alloc = 0;
> -	for (slot = 0; slot < pcpu_nr_slots; slot++)
> -		list_for_each_entry(chunk, &pcpu_slot[slot], list)
> -			max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc);
> +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> +		for (slot = 0; slot < pcpu_nr_slots; slot++)
> +			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
> +					    list)
> +				max_nr_alloc = max(max_nr_alloc,
> +						   chunk->nr_alloc);
>  
>  	return max_nr_alloc;
>  }
> @@ -129,6 +133,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
>  	P("cur_min_alloc", cur_min_alloc);
>  	P("cur_med_alloc", cur_med_alloc);
>  	P("cur_max_alloc", cur_max_alloc);
> +#ifdef CONFIG_MEMCG_KMEM
> +	P("memcg_aware", pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)));
> +#endif
>  	seq_putc(m, '\n');
>  }
>  
> @@ -137,6 +144,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
>  	struct pcpu_chunk *chunk;
>  	int slot, max_nr_alloc;
>  	int *buffer;
> +	enum pcpu_chunk_type type;
>  
>  alloc_buffer:
>  	spin_lock_irq(&pcpu_lock);
> @@ -202,18 +210,18 @@ static int percpu_stats_show(struct seq_file *m, void *v)
>  		chunk_map_stats(m, pcpu_reserved_chunk, buffer);
>  	}
>  
> -	for (slot = 0; slot < pcpu_nr_slots; slot++) {
> -		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> -			if (chunk == pcpu_first_chunk) {
> -				seq_puts(m, "Chunk: <- First Chunk\n");
> -				chunk_map_stats(m, chunk, buffer);
> -
> -
> -			} else {
> -				seq_puts(m, "Chunk:\n");
> -				chunk_map_stats(m, chunk, buffer);
> +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
> +		for (slot = 0; slot < pcpu_nr_slots; slot++) {
> +			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
> +					    list) {
> +				if (chunk == pcpu_first_chunk) {
> +					seq_puts(m, "Chunk: <- First Chunk\n");
> +					chunk_map_stats(m, chunk, buffer);
> +				} else {
> +					seq_puts(m, "Chunk:\n");
> +					chunk_map_stats(m, chunk, buffer);
> +				}
>  			}
> -
>  		}
>  	}
>  
> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> index a2b395acef89..e46f7a6917f9 100644
> --- a/mm/percpu-vm.c
> +++ b/mm/percpu-vm.c
> @@ -328,12 +328,13 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
>  	pcpu_free_pages(chunk, pages, page_start, page_end);
>  }
>  
> -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
> +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> +					    gfp_t gfp)
>  {
>  	struct pcpu_chunk *chunk;
>  	struct vm_struct **vms;
>  
> -	chunk = pcpu_alloc_chunk(gfp);
> +	chunk = pcpu_alloc_chunk(type, gfp);
>  	if (!chunk)
>  		return NULL;
>  
> diff --git a/mm/percpu.c b/mm/percpu.c
> index aa36b78d45a6..85f5755c9114 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -37,9 +37,14 @@
>   * takes care of normal allocations.
>   *
>   * The allocator organizes chunks into lists according to free size and
> - * tries to allocate from the fullest chunk first.  Each chunk is managed
> - * by a bitmap with metadata blocks.  The allocation map is updated on
> - * every allocation and free to reflect the current state while the boundary
> + * memcg-awareness.  To make a percpu allocation memcg-aware the __GFP_ACCOUNT
> + * flag should be passed.  All memcg-aware allocations are sharing one set
> + * of chunks and all unaccounted allocations and allocations performed
> + * by processes belonging to the root memory cgroup are using the second set.
> + *
> + * The allocator tries to allocate from the fullest chunk first. Each chunk
> + * is managed by a bitmap with metadata blocks.  The allocation map is updated
> + * on every allocation and free to reflect the current state while the boundary
>   * map is only updated on allocation.  Each metadata block contains
>   * information to help mitigate the need to iterate over large portions
>   * of the bitmap.  The reverse mapping from page to chunk is stored in
> @@ -81,6 +86,7 @@
>  #include <linux/kmemleak.h>
>  #include <linux/sched.h>
>  #include <linux/sched/mm.h>
> +#include <linux/memcontrol.h>
>  
>  #include <asm/cacheflush.h>
>  #include <asm/sections.h>
> @@ -160,7 +166,7 @@ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init;
>  DEFINE_SPINLOCK(pcpu_lock);	/* all internal data structures */
>  static DEFINE_MUTEX(pcpu_alloc_mutex);	/* chunk create/destroy, [de]pop, map ext */
>  
> -struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */
> +struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */
>  
>  /* chunks which need their map areas extended, protected by pcpu_lock */
>  static LIST_HEAD(pcpu_map_extend_chunks);
> @@ -500,6 +506,9 @@ static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
>  			      bool move_front)
>  {
>  	if (chunk != pcpu_reserved_chunk) {
> +		struct list_head *pcpu_slot;
> +
> +		pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
>  		if (move_front)
>  			list_move(&chunk->list, &pcpu_slot[slot]);
>  		else
> @@ -1341,6 +1350,10 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
>  		panic("%s: Failed to allocate %zu bytes\n", __func__,
>  		      alloc_size);
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +	/* first chunk isn't memcg-aware */
> +	chunk->obj_cgroups = NULL;
> +#endif
>  	pcpu_init_md_blocks(chunk);
>  
>  	/* manage populated page bitmap */
> @@ -1380,7 +1393,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
>  	return chunk;
>  }
>  
> -static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
> +static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type, gfp_t gfp)
>  {
>  	struct pcpu_chunk *chunk;
>  	int region_bits;
> @@ -1408,6 +1421,16 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
>  	if (!chunk->md_blocks)
>  		goto md_blocks_fail;
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +	if (pcpu_is_memcg_chunk(type)) {
> +		chunk->obj_cgroups =
> +			pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
> +					sizeof(struct obj_cgroup *), gfp);
> +		if (!chunk->obj_cgroups)
> +			goto objcg_fail;
> +	}
> +#endif
> +
>  	pcpu_init_md_blocks(chunk);
>  
>  	/* init metadata */
> @@ -1415,6 +1438,8 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
>  
>  	return chunk;
>  
> +objcg_fail:
> +	pcpu_mem_free(chunk->md_blocks);
>  md_blocks_fail:
>  	pcpu_mem_free(chunk->bound_map);
>  bound_map_fail:
> @@ -1429,6 +1454,9 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
>  {
>  	if (!chunk)
>  		return;
> +#ifdef CONFIG_MEMCG_KMEM
> +	pcpu_mem_free(chunk->obj_cgroups);
> +#endif
>  	pcpu_mem_free(chunk->md_blocks);
>  	pcpu_mem_free(chunk->bound_map);
>  	pcpu_mem_free(chunk->alloc_map);
> @@ -1505,7 +1533,8 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
>  			       int page_start, int page_end, gfp_t gfp);
>  static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
>  				  int page_start, int page_end);
> -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp);
> +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> +					    gfp_t gfp);
>  static void pcpu_destroy_chunk(struct pcpu_chunk *chunk);
>  static struct page *pcpu_addr_to_page(void *addr);
>  static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai);
> @@ -1547,6 +1576,77 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
>  	return pcpu_get_page_chunk(pcpu_addr_to_page(addr));
>  }
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
> +						     struct obj_cgroup **objcgp)
> +{
> +	struct obj_cgroup *objcg;
> +
> +	if (!memcg_kmem_enabled() || !(gfp & __GFP_ACCOUNT) ||
> +	    memcg_kmem_bypass())
> +		return PCPU_CHUNK_ROOT;
> +
> +	objcg = get_obj_cgroup_from_current();
> +	if (!objcg)
> +		return PCPU_CHUNK_ROOT;
> +
> +	if (obj_cgroup_charge(objcg, gfp, size * num_possible_cpus())) {
> +		obj_cgroup_put(objcg);
> +		return PCPU_FAIL_ALLOC;
> +	}
> +
> +	*objcgp = objcg;
> +	return PCPU_CHUNK_MEMCG;
> +}
> +
> +static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
> +				       struct pcpu_chunk *chunk, int off,
> +				       size_t size)
> +{
> +	if (!objcg)
> +		return;
> +
> +	if (chunk) {
> +		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
> +	} else {
> +		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
> +		obj_cgroup_put(objcg);
> +	}
> +}
> +
> +static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
> +{
> +	struct obj_cgroup *objcg;
> +
> +	if (!pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)))
> +		return;
> +
> +	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
> +	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
> +
> +	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
> +
> +	obj_cgroup_put(objcg);
> +}
> +
> +#else /* CONFIG_MEMCG_KMEM */
> +static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
> +						     struct mem_cgroup **memcgp)
> +{
> +	return PCPU_CHUNK_ROOT;
> +}
> +
> +static void pcpu_memcg_post_alloc_hook(struct mem_cgroup *memcg,
> +				       struct pcpu_chunk *chunk, int off,
> +				       size_t size)
> +{
> +}
> +
> +static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
> +{
> +}
> +#endif /* CONFIG_MEMCG_KMEM */
> +
>  /**
>   * pcpu_alloc - the percpu allocator
>   * @size: size of area to allocate in bytes
> @@ -1568,6 +1668,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  	gfp_t pcpu_gfp;
>  	bool is_atomic;
>  	bool do_warn;
> +	enum pcpu_chunk_type type;
> +	struct list_head *pcpu_slot;
> +	struct obj_cgroup *objcg = NULL;
>  	static int warn_limit = 10;
>  	struct pcpu_chunk *chunk, *next;
>  	const char *err;
> @@ -1602,16 +1705,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  		return NULL;
>  	}
>  
> +	type = pcpu_memcg_pre_alloc_hook(size, gfp, &objcg);
> +	if (unlikely(type == PCPU_FAIL_ALLOC))
> +		return NULL;
> +	pcpu_slot = pcpu_chunk_list(type);
> +
>  	if (!is_atomic) {
>  		/*
>  		 * pcpu_balance_workfn() allocates memory under this mutex,
>  		 * and it may wait for memory reclaim. Allow current task
>  		 * to become OOM victim, in case of memory pressure.
>  		 */
> -		if (gfp & __GFP_NOFAIL)
> +		if (gfp & __GFP_NOFAIL) {
>  			mutex_lock(&pcpu_alloc_mutex);
> -		else if (mutex_lock_killable(&pcpu_alloc_mutex))
> +		} else if (mutex_lock_killable(&pcpu_alloc_mutex)) {
> +			pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
>  			return NULL;
> +		}
>  	}
>  
>  	spin_lock_irqsave(&pcpu_lock, flags);
> @@ -1637,7 +1747,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  restart:
>  	/* search through normal chunks */
>  	for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
> -		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
> +		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot],
> +					 list) {

nit: this line change doesn't do anything. Can you please remove it.

>  			off = pcpu_find_block_fit(chunk, bits, bit_align,
>  						  is_atomic);
>  			if (off < 0) {
> @@ -1666,7 +1777,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  	}
>  
>  	if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
> -		chunk = pcpu_create_chunk(pcpu_gfp);
> +		chunk = pcpu_create_chunk(type, pcpu_gfp);
>  		if (!chunk) {
>  			err = "failed to allocate new chunk";
>  			goto fail;
> @@ -1723,6 +1834,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  	trace_percpu_alloc_percpu(reserved, is_atomic, size, align,
>  			chunk->base_addr, off, ptr);
>  
> +	pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
> +
>  	return ptr;
>  
>  fail_unlock:
> @@ -1744,6 +1857,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  	} else {
>  		mutex_unlock(&pcpu_alloc_mutex);
>  	}
> +
> +	pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
> +
>  	return NULL;
>  }
>  
> @@ -1803,8 +1919,8 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
>  }
>  
>  /**
> - * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> - * @work: unused
> + * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
> + * @type: chunk type
>   *
>   * Reclaim all fully free chunks except for the first one.  This is also
>   * responsible for maintaining the pool of empty populated pages.  However,
> @@ -1813,11 +1929,12 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
>   * allocation causes the failure as it is possible that requests can be
>   * serviced from already backed regions.
>   */
> -static void pcpu_balance_workfn(struct work_struct *work)
> +static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
>  {
>  	/* gfp flags passed to underlying allocators */
>  	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
>  	LIST_HEAD(to_free);
> +	struct list_head *pcpu_slot = pcpu_chunk_list(type);
>  	struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
>  	struct pcpu_chunk *chunk, *next;
>  	int slot, nr_to_pop, ret;
> @@ -1915,7 +2032,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
>  
>  	if (nr_to_pop) {
>  		/* ran out of chunks to populate, create a new one and retry */
> -		chunk = pcpu_create_chunk(gfp);
> +		chunk = pcpu_create_chunk(type, gfp);
>  		if (chunk) {
>  			spin_lock_irq(&pcpu_lock);
>  			pcpu_chunk_relocate(chunk, -1);
> @@ -1927,6 +2044,20 @@ static void pcpu_balance_workfn(struct work_struct *work)
>  	mutex_unlock(&pcpu_alloc_mutex);
>  }
>  
> +/**
> + * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> + * @work: unused
> + *
> + * Call __pcpu_balance_workfn() for each chunk type.
> + */
> +static void pcpu_balance_workfn(struct work_struct *work)
> +{
> +	enum pcpu_chunk_type type;
> +
> +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> +		__pcpu_balance_workfn(type);
> +}
> +
>  /**
>   * free_percpu - free percpu area
>   * @ptr: pointer to area to free
> @@ -1941,8 +2072,9 @@ void free_percpu(void __percpu *ptr)
>  	void *addr;
>  	struct pcpu_chunk *chunk;
>  	unsigned long flags;
> -	int off;
> +	int size, off;
>  	bool need_balance = false;
> +	struct list_head *pcpu_slot;
>  
>  	if (!ptr)
>  		return;
> @@ -1956,7 +2088,11 @@ void free_percpu(void __percpu *ptr)
>  	chunk = pcpu_chunk_addr_search(addr);
>  	off = addr - chunk->base_addr;
>  
> -	pcpu_free_area(chunk, off);
> +	size = pcpu_free_area(chunk, off);
> +
> +	pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
> +
> +	pcpu_memcg_free_hook(chunk, off, size);
>  
>  	/* if there are more than one fully free chunks, wake up grim reaper */
>  	if (chunk->free_bytes == pcpu_unit_size) {
> @@ -2267,6 +2403,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
>  	int map_size;
>  	unsigned long tmp_addr;
>  	size_t alloc_size;
> +	enum pcpu_chunk_type type;
>  
>  #define PCPU_SETUP_BUG_ON(cond)	do {					\
>  	if (unlikely(cond)) {						\
> @@ -2384,13 +2521,18 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
>  	 * empty chunks.
>  	 */
>  	pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
> -	pcpu_slot = memblock_alloc(pcpu_nr_slots * sizeof(pcpu_slot[0]),
> -				   SMP_CACHE_BYTES);
> -	if (!pcpu_slot)
> +	pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
> +					  sizeof(pcpu_chunk_lists[0]) *
> +					  PCPU_NR_CHUNK_TYPES,
> +					  SMP_CACHE_BYTES);
> +	if (!pcpu_chunk_lists)
>  		panic("%s: Failed to allocate %zu bytes\n", __func__,
> -		      pcpu_nr_slots * sizeof(pcpu_slot[0]));
> -	for (i = 0; i < pcpu_nr_slots; i++)
> -		INIT_LIST_HEAD(&pcpu_slot[i]);
> +		      pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]) *
> +		      PCPU_NR_CHUNK_TYPES);
> +
> +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> +		for (i = 0; i < pcpu_nr_slots; i++)
> +			INIT_LIST_HEAD(&pcpu_chunk_list(type)[i]);
>  
>  	/*
>  	 * The end of the static region needs to be aligned with the
> -- 
> 2.25.4
> 

There were just 2 minor nits. Do you mind resending with them fixed as
I'm not sure I'll be carrying these patches or not.

Acked-by: Dennis Zhou <dennis@kernel.org>

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 3/5] mm: memcg/percpu: per-memcg percpu memory statistics
  2020-05-28 23:25 ` [PATCH v1 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
@ 2020-06-05 19:53   ` Dennis Zhou
  0 siblings, 0 replies; 13+ messages in thread
From: Dennis Zhou @ 2020-06-05 19:53 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Thu, May 28, 2020 at 04:25:06PM -0700, Roman Gushchin wrote:
> Percpu memory can represent a noticeable chunk of the total
> memory consumption, especially on big machines with many CPUs.
> Let's track percpu memory usage for each memcg and display
> it in memory.stat.
> 
> A percpu allocation is usually scattered over multiple pages
> (and nodes), and can be significantly smaller than a page.
> So let's add a byte-sized counter on the memcg level:
> MEMCG_PERCPU_B. Byte-sized vmstat infra created for slabs
> can be perfectly reused for percpu case.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  4 ++++
>  include/linux/memcontrol.h              |  8 ++++++++
>  mm/memcontrol.c                         |  4 +++-
>  mm/percpu.c                             | 10 ++++++++++
>  4 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index fed4e1d2a343..aa8cb6dadadc 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1276,6 +1276,10 @@ PAGE_SIZE multiple when read back.
>  		Amount of memory used for storing in-kernel data
>  		structures.
>  
> +	  percpu
> +		Amount of memory used for storing per-cpu kernel
> +		data structures.
> +
>  	  sock
>  		Amount of memory used in network transmission buffers
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7a84d9164449..f62a95d472f7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -32,11 +32,19 @@ struct kmem_cache;
>  enum memcg_stat_item {
>  	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
>  	MEMCG_SOCK,
> +	MEMCG_PERCPU_B,
>  	/* XXX: why are these zone and not node counters? */
>  	MEMCG_KERNEL_STACK_KB,
>  	MEMCG_NR_STAT,
>  };
>  
> +static __always_inline bool memcg_stat_item_in_bytes(enum memcg_stat_item item)
> +{
> +	if (item == MEMCG_PERCPU_B)
> +		return true;
> +	return vmstat_item_in_bytes(item);
> +}
> +
>  enum memcg_memory_event {
>  	MEMCG_LOW,
>  	MEMCG_HIGH,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7bc3fd196210..5007d1585a4a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -783,7 +783,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
>  	if (mem_cgroup_disabled())
>  		return;
>  
> -	if (vmstat_item_in_bytes(idx))
> +	if (memcg_stat_item_in_bytes(idx))
>  		threshold <<= PAGE_SHIFT;
>  
>  	x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
> @@ -1490,6 +1490,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>  	seq_buf_printf(&s, "slab %llu\n",
>  		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
>  			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
> +	seq_buf_printf(&s, "percpu %llu\n",
> +		       (u64)memcg_page_state(memcg, MEMCG_PERCPU_B));
>  	seq_buf_printf(&s, "sock %llu\n",
>  		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
>  		       PAGE_SIZE);
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 85f5755c9114..b4b3e9c8a6d1 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1608,6 +1608,11 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
>  
>  	if (chunk) {
>  		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
> +
> +		rcu_read_lock();
> +		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
> +				size * num_possible_cpus());
> +		rcu_read_unlock();
>  	} else {
>  		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
>  		obj_cgroup_put(objcg);
> @@ -1626,6 +1631,11 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
>  
>  	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
>  
> +	rcu_read_lock();
> +	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
> +			-(size * num_possible_cpus()));
> +	rcu_read_unlock();
> +
>  	obj_cgroup_put(objcg);
>  }
>  
> -- 
> 2.25.4
> 

Acked-by: Dennis Zhou <dennis@kernel.org>

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-05-28 23:25 ` [PATCH v1 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
@ 2020-06-05 19:54   ` Dennis Zhou
  0 siblings, 0 replies; 13+ messages in thread
From: Dennis Zhou @ 2020-06-05 19:54 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Thu, May 28, 2020 at 04:25:07PM -0700, Roman Gushchin wrote:
> Memory cgroups are using large chunks of percpu memory to store
> vmstat data. Yet this memory is not accounted at all, so in the
> case when there are many (dying) cgroups, it's not exactly clear
> where all the memory is.
> 
> Because the size of  memory cgroup internal structures can
> dramatically exceed the size of object or page which is pinning
> it in the memory, it's not a good idea to simple ignore it.
> It actually breaks the isolation between cgroups.
> 
> Let's account the consumed percpu memory to the parent cgroup.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/memcontrol.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5007d1585a4a..0dd0d05a011c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5020,13 +5020,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>  	if (!pn)
>  		return 1;
>  
> -	pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
> +	pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
> +						 GFP_KERNEL_ACCOUNT);
>  	if (!pn->lruvec_stat_local) {
>  		kfree(pn);
>  		return 1;
>  	}
>  
> -	pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
> +	pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
> +					       GFP_KERNEL_ACCOUNT);
>  	if (!pn->lruvec_stat_cpu) {
>  		free_percpu(pn->lruvec_stat_local);
>  		kfree(pn);
> @@ -5100,11 +5102,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>  		goto fail;
>  	}
>  
> -	memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
> +	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> +						GFP_KERNEL_ACCOUNT);
>  	if (!memcg->vmstats_local)
>  		goto fail;
>  
> -	memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
> +	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> +						 GFP_KERNEL_ACCOUNT);
>  	if (!memcg->vmstats_percpu)
>  		goto fail;
>  
> @@ -5153,7 +5157,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	struct mem_cgroup *memcg;
>  	long error = -ENOMEM;
>  
> +	memalloc_use_memcg(parent);
>  	memcg = mem_cgroup_alloc();
> +	memalloc_unuse_memcg();
>  	if (IS_ERR(memcg))
>  		return ERR_CAST(memcg);
>  
> -- 
> 2.25.4
> 

Acked-by: Dennis Zhou <dennis@kernel.org>

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 5/5] kselftests: cgroup: add perpcu memory accounting test
  2020-05-28 23:25 ` [PATCH v1 5/5] kselftests: cgroup: add perpcu memory accounting test Roman Gushchin
@ 2020-06-05 20:07   ` Dennis Zhou
  2020-06-05 22:47     ` Roman Gushchin
  0 siblings, 1 reply; 13+ messages in thread
From: Dennis Zhou @ 2020-06-05 20:07 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Thu, May 28, 2020 at 04:25:08PM -0700, Roman Gushchin wrote:
> Add a simple test to check the percpu memory accounting.
> The test creates a cgroup tree with 1000 child cgroups
> and checks values of memory.current and memory.stat::percpu.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  tools/testing/selftests/cgroup/test_kmem.c | 59 ++++++++++++++++++++++
>  1 file changed, 59 insertions(+)
> 
> diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c
> index 5224dae216e5..a0d4f1a3137d 100644
> --- a/tools/testing/selftests/cgroup/test_kmem.c
> +++ b/tools/testing/selftests/cgroup/test_kmem.c
> @@ -331,6 +331,64 @@ static int test_kmem_dead_cgroups(const char *root)
>  	return ret;
>  }
>  
> +/*
> + * This test creates a sub-tree with 1000 memory cgroups.
> + * Then it checks that the memory.current on the parent level
> + * is greater than 0 and approximates matches the percpu value
> + * from memory.stat.
> + */
> +static int test_percpu_basic(const char *root)
> +{
> +	int ret = KSFT_FAIL;
> +	char *parent, *child;
> +	long current, percpu;
> +	int i;
> +
> +	parent = cg_name(root, "percpu_basic_test");
> +	if (!parent)
> +		goto cleanup;
> +
> +	if (cg_create(parent))
> +		goto cleanup;
> +
> +	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
> +		goto cleanup;
> +
> +	for (i = 0; i < 1000; i++) {
> +		child = cg_name_indexed(parent, "child", i);
> +		if (!child)
> +			return -1;
> +
> +		if (cg_create(child))
> +			goto cleanup_children;
> +
> +		free(child);
> +	}
> +
> +	current = cg_read_long(parent, "memory.current");
> +	percpu = cg_read_key_long(parent, "memory.stat", "percpu ");
> +
> +	if (current > 0 && percpu > 0 && abs(current - percpu) <
> +	    4096 * 32 * get_nprocs())

So this is checking that we've allocated less than 32 pages per cpu over
1000 child cgroups that's not percpu memory? Is there a more definitive
measurement or at least a comment we can leave saying why this limit was
chosen.

> +		ret = KSFT_PASS;
> +	else
> +		printf("memory.current %ld\npercpu %ld\n",
> +		       current, percpu);
> +
> +cleanup_children:
> +	for (i = 0; i < 1000; i++) {
> +		child = cg_name_indexed(parent, "child", i);
> +		cg_destroy(child);
> +		free(child);
> +	}
> +
> +cleanup:
> +	cg_destroy(parent);
> +	free(parent);
> +
> +	return ret;
> +}
> +
>  #define T(x) { x, #x }
>  struct kmem_test {
>  	int (*fn)(const char *root);
> @@ -341,6 +399,7 @@ struct kmem_test {
>  	T(test_kmem_proc_kpagecgroup),
>  	T(test_kmem_kernel_stacks),
>  	T(test_kmem_dead_cgroups),
> +	T(test_percpu_basic),
>  };
>  #undef T
>  
> -- 
> 2.25.4
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/5] mm: memcg/percpu: account percpu memory to memory cgroups
  2020-06-05 19:49   ` Dennis Zhou
@ 2020-06-05 22:44     ` Roman Gushchin
  0 siblings, 0 replies; 13+ messages in thread
From: Roman Gushchin @ 2020-06-05 22:44 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Andrew Morton, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Fri, Jun 05, 2020 at 07:49:53PM +0000, Dennis Zhou wrote:
> On Thu, May 28, 2020 at 04:25:05PM -0700, Roman Gushchin wrote:
> > Percpu memory is becoming more and more widely used by various
> > subsystems, and the total amount of memory controlled by the percpu
> > allocator can make a good part of the total memory.
> > 
> > As an example, bpf maps can consume a lot of percpu memory,
> > and they are created by a user. Also, some cgroup internals
> > (e.g. memory controller statistics) can be quite large.
> > On a machine with many CPUs and big number of cgroups they
> > can consume hundreds of megabytes.
> > 
> > So the lack of memcg accounting is creating a breach in the memory
> > isolation. Similar to the slab memory, percpu memory should be
> > accounted by default.
> > 
> > To implement the perpcu accounting it's possible to take the slab
> > memory accounting as a model to follow. Let's introduce two types of
> > percpu chunks: root and memcg. What makes memcg chunks different is
> > an additional space allocated to store memcg membership information.
> > If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be
> > used. If it's possible to charge the corresponding size to the target
> > memory cgroup, allocation is performed, and the memcg ownership data
> > is recorded. System-wide allocations are performed using root chunks,
> > so there is no additional memory overhead.
> > 
> > To implement a fast reparenting of percpu memory on memcg removal,
> > we don't store mem_cgroup pointers directly: instead we use obj_cgroup
> > API, introduced for slab accounting.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  mm/percpu-internal.h |  57 ++++++++++++-
> >  mm/percpu-km.c       |   5 +-
> >  mm/percpu-stats.c    |  36 +++++----
> >  mm/percpu-vm.c       |   5 +-
> >  mm/percpu.c          | 186 ++++++++++++++++++++++++++++++++++++++-----
> >  5 files changed, 248 insertions(+), 41 deletions(-)
> > 
> > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> > index 0468ba500bd4..0cf36337eb47 100644
> > --- a/mm/percpu-internal.h
> > +++ b/mm/percpu-internal.h
> > @@ -5,6 +5,27 @@
> >  #include <linux/types.h>
> >  #include <linux/percpu.h>
> >  
> > +/*
> > + * There are two chunk types: root and memcg-aware.
> > + * Chunks of each type have separate slots list.
> > + *
> > + * Memcg-aware chunks have an attached vector of obj_cgroup
> > + * pointers, which is used to store memcg membership data
> > + * of a percpu object. Obj_cgroups are ref-counted pointers
> > + * to a memory cgroup with an ability to switch dynamically
> > + * to the parent memory cgroup. This allows to reclaim a deleted
> > + * memory cgroup without reclaiming of all outstanding objects,
> > + * which do hold a reference at it.
> > + */
> 
> nit: do you mind reflowing this to 80 characters and doing 2 spaces
> after each period to keep the formatting uniform.
> 
> > +enum pcpu_chunk_type {
> > +	PCPU_CHUNK_ROOT,
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	PCPU_CHUNK_MEMCG,
> > +#endif
> > +	PCPU_NR_CHUNK_TYPES,
> > +	PCPU_FAIL_ALLOC = PCPU_NR_CHUNK_TYPES
> > +};
> > +
> >  /*
> >   * pcpu_block_md is the metadata block struct.
> >   * Each chunk's bitmap is split into a number of full blocks.
> > @@ -54,6 +75,9 @@ struct pcpu_chunk {
> >  	int			end_offset;	/* additional area required to
> >  						   have the region end page
> >  						   aligned */
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	struct obj_cgroup	**obj_cgroups;	/* vector of object cgroups */
> > +#endif
> >  
> >  	int			nr_pages;	/* # of pages served by this chunk */
> >  	int			nr_populated;	/* # of populated pages */
> > @@ -63,7 +87,7 @@ struct pcpu_chunk {
> >  
> >  extern spinlock_t pcpu_lock;
> >  
> > -extern struct list_head *pcpu_slot;
> > +extern struct list_head *pcpu_chunk_lists;
> >  extern int pcpu_nr_slots;
> >  extern int pcpu_nr_empty_pop_pages;
> >  
> > @@ -106,6 +130,37 @@ static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk)
> >  	return pcpu_nr_pages_to_map_bits(chunk->nr_pages);
> >  }
> >  
> > +#ifdef CONFIG_MEMCG_KMEM
> > +static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
> > +{
> > +	if (chunk->obj_cgroups)
> > +		return PCPU_CHUNK_MEMCG;
> > +	return PCPU_CHUNK_ROOT;
> > +}
> > +
> > +static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
> > +{
> > +	return chunk_type == PCPU_CHUNK_MEMCG;
> > +}
> > +
> > +#else
> > +static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
> > +{
> > +	return PCPU_CHUNK_ROOT;
> > +}
> > +
> > +static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
> > +{
> > +	return false;
> > +}
> > +#endif
> > +
> > +static struct list_head *pcpu_chunk_list(enum pcpu_chunk_type chunk_type)
> > +{
> > +	return &pcpu_chunk_lists[pcpu_nr_slots *
> > +				 pcpu_is_memcg_chunk(chunk_type)];
> > +}
> > +
> >  #ifdef CONFIG_PERCPU_STATS
> >  
> >  #include <linux/spinlock.h>
> > diff --git a/mm/percpu-km.c b/mm/percpu-km.c
> > index 20d2b69a13b0..35c9941077ee 100644
> > --- a/mm/percpu-km.c
> > +++ b/mm/percpu-km.c
> > @@ -44,7 +44,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
> >  	/* nada */
> >  }
> >  
> > -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
> > +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> > +					    gfp_t gfp)
> >  {
> >  	const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
> >  	struct pcpu_chunk *chunk;
> > @@ -52,7 +53,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
> >  	unsigned long flags;
> >  	int i;
> >  
> > -	chunk = pcpu_alloc_chunk(gfp);
> > +	chunk = pcpu_alloc_chunk(type, gfp);
> >  	if (!chunk)
> >  		return NULL;
> >  
> > diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
> > index 32558063c3f9..c8400a2adbc2 100644
> > --- a/mm/percpu-stats.c
> > +++ b/mm/percpu-stats.c
> > @@ -34,11 +34,15 @@ static int find_max_nr_alloc(void)
> >  {
> >  	struct pcpu_chunk *chunk;
> >  	int slot, max_nr_alloc;
> > +	enum pcpu_chunk_type type;
> >  
> >  	max_nr_alloc = 0;
> > -	for (slot = 0; slot < pcpu_nr_slots; slot++)
> > -		list_for_each_entry(chunk, &pcpu_slot[slot], list)
> > -			max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc);
> > +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> > +		for (slot = 0; slot < pcpu_nr_slots; slot++)
> > +			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
> > +					    list)
> > +				max_nr_alloc = max(max_nr_alloc,
> > +						   chunk->nr_alloc);
> >  
> >  	return max_nr_alloc;
> >  }
> > @@ -129,6 +133,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
> >  	P("cur_min_alloc", cur_min_alloc);
> >  	P("cur_med_alloc", cur_med_alloc);
> >  	P("cur_max_alloc", cur_max_alloc);
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	P("memcg_aware", pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)));
> > +#endif
> >  	seq_putc(m, '\n');
> >  }
> >  
> > @@ -137,6 +144,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
> >  	struct pcpu_chunk *chunk;
> >  	int slot, max_nr_alloc;
> >  	int *buffer;
> > +	enum pcpu_chunk_type type;
> >  
> >  alloc_buffer:
> >  	spin_lock_irq(&pcpu_lock);
> > @@ -202,18 +210,18 @@ static int percpu_stats_show(struct seq_file *m, void *v)
> >  		chunk_map_stats(m, pcpu_reserved_chunk, buffer);
> >  	}
> >  
> > -	for (slot = 0; slot < pcpu_nr_slots; slot++) {
> > -		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> > -			if (chunk == pcpu_first_chunk) {
> > -				seq_puts(m, "Chunk: <- First Chunk\n");
> > -				chunk_map_stats(m, chunk, buffer);
> > -
> > -
> > -			} else {
> > -				seq_puts(m, "Chunk:\n");
> > -				chunk_map_stats(m, chunk, buffer);
> > +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
> > +		for (slot = 0; slot < pcpu_nr_slots; slot++) {
> > +			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
> > +					    list) {
> > +				if (chunk == pcpu_first_chunk) {
> > +					seq_puts(m, "Chunk: <- First Chunk\n");
> > +					chunk_map_stats(m, chunk, buffer);
> > +				} else {
> > +					seq_puts(m, "Chunk:\n");
> > +					chunk_map_stats(m, chunk, buffer);
> > +				}
> >  			}
> > -
> >  		}
> >  	}
> >  
> > diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> > index a2b395acef89..e46f7a6917f9 100644
> > --- a/mm/percpu-vm.c
> > +++ b/mm/percpu-vm.c
> > @@ -328,12 +328,13 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
> >  	pcpu_free_pages(chunk, pages, page_start, page_end);
> >  }
> >  
> > -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
> > +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> > +					    gfp_t gfp)
> >  {
> >  	struct pcpu_chunk *chunk;
> >  	struct vm_struct **vms;
> >  
> > -	chunk = pcpu_alloc_chunk(gfp);
> > +	chunk = pcpu_alloc_chunk(type, gfp);
> >  	if (!chunk)
> >  		return NULL;
> >  
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index aa36b78d45a6..85f5755c9114 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -37,9 +37,14 @@
> >   * takes care of normal allocations.
> >   *
> >   * The allocator organizes chunks into lists according to free size and
> > - * tries to allocate from the fullest chunk first.  Each chunk is managed
> > - * by a bitmap with metadata blocks.  The allocation map is updated on
> > - * every allocation and free to reflect the current state while the boundary
> > + * memcg-awareness.  To make a percpu allocation memcg-aware the __GFP_ACCOUNT
> > + * flag should be passed.  All memcg-aware allocations are sharing one set
> > + * of chunks and all unaccounted allocations and allocations performed
> > + * by processes belonging to the root memory cgroup are using the second set.
> > + *
> > + * The allocator tries to allocate from the fullest chunk first. Each chunk
> > + * is managed by a bitmap with metadata blocks.  The allocation map is updated
> > + * on every allocation and free to reflect the current state while the boundary
> >   * map is only updated on allocation.  Each metadata block contains
> >   * information to help mitigate the need to iterate over large portions
> >   * of the bitmap.  The reverse mapping from page to chunk is stored in
> > @@ -81,6 +86,7 @@
> >  #include <linux/kmemleak.h>
> >  #include <linux/sched.h>
> >  #include <linux/sched/mm.h>
> > +#include <linux/memcontrol.h>
> >  
> >  #include <asm/cacheflush.h>
> >  #include <asm/sections.h>
> > @@ -160,7 +166,7 @@ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init;
> >  DEFINE_SPINLOCK(pcpu_lock);	/* all internal data structures */
> >  static DEFINE_MUTEX(pcpu_alloc_mutex);	/* chunk create/destroy, [de]pop, map ext */
> >  
> > -struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */
> > +struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */
> >  
> >  /* chunks which need their map areas extended, protected by pcpu_lock */
> >  static LIST_HEAD(pcpu_map_extend_chunks);
> > @@ -500,6 +506,9 @@ static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
> >  			      bool move_front)
> >  {
> >  	if (chunk != pcpu_reserved_chunk) {
> > +		struct list_head *pcpu_slot;
> > +
> > +		pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
> >  		if (move_front)
> >  			list_move(&chunk->list, &pcpu_slot[slot]);
> >  		else
> > @@ -1341,6 +1350,10 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
> >  		panic("%s: Failed to allocate %zu bytes\n", __func__,
> >  		      alloc_size);
> >  
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	/* first chunk isn't memcg-aware */
> > +	chunk->obj_cgroups = NULL;
> > +#endif
> >  	pcpu_init_md_blocks(chunk);
> >  
> >  	/* manage populated page bitmap */
> > @@ -1380,7 +1393,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
> >  	return chunk;
> >  }
> >  
> > -static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
> > +static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type, gfp_t gfp)
> >  {
> >  	struct pcpu_chunk *chunk;
> >  	int region_bits;
> > @@ -1408,6 +1421,16 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
> >  	if (!chunk->md_blocks)
> >  		goto md_blocks_fail;
> >  
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	if (pcpu_is_memcg_chunk(type)) {
> > +		chunk->obj_cgroups =
> > +			pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
> > +					sizeof(struct obj_cgroup *), gfp);
> > +		if (!chunk->obj_cgroups)
> > +			goto objcg_fail;
> > +	}
> > +#endif
> > +
> >  	pcpu_init_md_blocks(chunk);
> >  
> >  	/* init metadata */
> > @@ -1415,6 +1438,8 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
> >  
> >  	return chunk;
> >  
> > +objcg_fail:
> > +	pcpu_mem_free(chunk->md_blocks);
> >  md_blocks_fail:
> >  	pcpu_mem_free(chunk->bound_map);
> >  bound_map_fail:
> > @@ -1429,6 +1454,9 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
> >  {
> >  	if (!chunk)
> >  		return;
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	pcpu_mem_free(chunk->obj_cgroups);
> > +#endif
> >  	pcpu_mem_free(chunk->md_blocks);
> >  	pcpu_mem_free(chunk->bound_map);
> >  	pcpu_mem_free(chunk->alloc_map);
> > @@ -1505,7 +1533,8 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
> >  			       int page_start, int page_end, gfp_t gfp);
> >  static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
> >  				  int page_start, int page_end);
> > -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp);
> > +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> > +					    gfp_t gfp);
> >  static void pcpu_destroy_chunk(struct pcpu_chunk *chunk);
> >  static struct page *pcpu_addr_to_page(void *addr);
> >  static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai);
> > @@ -1547,6 +1576,77 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
> >  	return pcpu_get_page_chunk(pcpu_addr_to_page(addr));
> >  }
> >  
> > +#ifdef CONFIG_MEMCG_KMEM
> > +static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
> > +						     struct obj_cgroup **objcgp)
> > +{
> > +	struct obj_cgroup *objcg;
> > +
> > +	if (!memcg_kmem_enabled() || !(gfp & __GFP_ACCOUNT) ||
> > +	    memcg_kmem_bypass())
> > +		return PCPU_CHUNK_ROOT;
> > +
> > +	objcg = get_obj_cgroup_from_current();
> > +	if (!objcg)
> > +		return PCPU_CHUNK_ROOT;
> > +
> > +	if (obj_cgroup_charge(objcg, gfp, size * num_possible_cpus())) {
> > +		obj_cgroup_put(objcg);
> > +		return PCPU_FAIL_ALLOC;
> > +	}
> > +
> > +	*objcgp = objcg;
> > +	return PCPU_CHUNK_MEMCG;
> > +}
> > +
> > +static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
> > +				       struct pcpu_chunk *chunk, int off,
> > +				       size_t size)
> > +{
> > +	if (!objcg)
> > +		return;
> > +
> > +	if (chunk) {
> > +		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
> > +	} else {
> > +		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
> > +		obj_cgroup_put(objcg);
> > +	}
> > +}
> > +
> > +static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
> > +{
> > +	struct obj_cgroup *objcg;
> > +
> > +	if (!pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)))
> > +		return;
> > +
> > +	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
> > +	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
> > +
> > +	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
> > +
> > +	obj_cgroup_put(objcg);
> > +}
> > +
> > +#else /* CONFIG_MEMCG_KMEM */
> > +static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
> > +						     struct mem_cgroup **memcgp)
> > +{
> > +	return PCPU_CHUNK_ROOT;
> > +}
> > +
> > +static void pcpu_memcg_post_alloc_hook(struct mem_cgroup *memcg,
> > +				       struct pcpu_chunk *chunk, int off,
> > +				       size_t size)
> > +{
> > +}
> > +
> > +static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
> > +{
> > +}
> > +#endif /* CONFIG_MEMCG_KMEM */
> > +
> >  /**
> >   * pcpu_alloc - the percpu allocator
> >   * @size: size of area to allocate in bytes
> > @@ -1568,6 +1668,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
> >  	gfp_t pcpu_gfp;
> >  	bool is_atomic;
> >  	bool do_warn;
> > +	enum pcpu_chunk_type type;
> > +	struct list_head *pcpu_slot;
> > +	struct obj_cgroup *objcg = NULL;
> >  	static int warn_limit = 10;
> >  	struct pcpu_chunk *chunk, *next;
> >  	const char *err;
> > @@ -1602,16 +1705,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
> >  		return NULL;
> >  	}
> >  
> > +	type = pcpu_memcg_pre_alloc_hook(size, gfp, &objcg);
> > +	if (unlikely(type == PCPU_FAIL_ALLOC))
> > +		return NULL;
> > +	pcpu_slot = pcpu_chunk_list(type);
> > +
> >  	if (!is_atomic) {
> >  		/*
> >  		 * pcpu_balance_workfn() allocates memory under this mutex,
> >  		 * and it may wait for memory reclaim. Allow current task
> >  		 * to become OOM victim, in case of memory pressure.
> >  		 */
> > -		if (gfp & __GFP_NOFAIL)
> > +		if (gfp & __GFP_NOFAIL) {
> >  			mutex_lock(&pcpu_alloc_mutex);
> > -		else if (mutex_lock_killable(&pcpu_alloc_mutex))
> > +		} else if (mutex_lock_killable(&pcpu_alloc_mutex)) {
> > +			pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
> >  			return NULL;
> > +		}
> >  	}
> >  
> >  	spin_lock_irqsave(&pcpu_lock, flags);
> > @@ -1637,7 +1747,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
> >  restart:
> >  	/* search through normal chunks */
> >  	for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
> > -		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
> > +		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot],
> > +					 list) {
> 
> nit: this line change doesn't do anything. Can you please remove it.
> 
> >  			off = pcpu_find_block_fit(chunk, bits, bit_align,
> >  						  is_atomic);
> >  			if (off < 0) {
> > @@ -1666,7 +1777,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
> >  	}
> >  
> >  	if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
> > -		chunk = pcpu_create_chunk(pcpu_gfp);
> > +		chunk = pcpu_create_chunk(type, pcpu_gfp);
> >  		if (!chunk) {
> >  			err = "failed to allocate new chunk";
> >  			goto fail;
> > @@ -1723,6 +1834,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
> >  	trace_percpu_alloc_percpu(reserved, is_atomic, size, align,
> >  			chunk->base_addr, off, ptr);
> >  
> > +	pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
> > +
> >  	return ptr;
> >  
> >  fail_unlock:
> > @@ -1744,6 +1857,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
> >  	} else {
> >  		mutex_unlock(&pcpu_alloc_mutex);
> >  	}
> > +
> > +	pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
> > +
> >  	return NULL;
> >  }
> >  
> > @@ -1803,8 +1919,8 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
> >  }
> >  
> >  /**
> > - * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> > - * @work: unused
> > + * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
> > + * @type: chunk type
> >   *
> >   * Reclaim all fully free chunks except for the first one.  This is also
> >   * responsible for maintaining the pool of empty populated pages.  However,
> > @@ -1813,11 +1929,12 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
> >   * allocation causes the failure as it is possible that requests can be
> >   * serviced from already backed regions.
> >   */
> > -static void pcpu_balance_workfn(struct work_struct *work)
> > +static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
> >  {
> >  	/* gfp flags passed to underlying allocators */
> >  	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
> >  	LIST_HEAD(to_free);
> > +	struct list_head *pcpu_slot = pcpu_chunk_list(type);
> >  	struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
> >  	struct pcpu_chunk *chunk, *next;
> >  	int slot, nr_to_pop, ret;
> > @@ -1915,7 +2032,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
> >  
> >  	if (nr_to_pop) {
> >  		/* ran out of chunks to populate, create a new one and retry */
> > -		chunk = pcpu_create_chunk(gfp);
> > +		chunk = pcpu_create_chunk(type, gfp);
> >  		if (chunk) {
> >  			spin_lock_irq(&pcpu_lock);
> >  			pcpu_chunk_relocate(chunk, -1);
> > @@ -1927,6 +2044,20 @@ static void pcpu_balance_workfn(struct work_struct *work)
> >  	mutex_unlock(&pcpu_alloc_mutex);
> >  }
> >  
> > +/**
> > + * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> > + * @work: unused
> > + *
> > + * Call __pcpu_balance_workfn() for each chunk type.
> > + */
> > +static void pcpu_balance_workfn(struct work_struct *work)
> > +{
> > +	enum pcpu_chunk_type type;
> > +
> > +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> > +		__pcpu_balance_workfn(type);
> > +}
> > +
> >  /**
> >   * free_percpu - free percpu area
> >   * @ptr: pointer to area to free
> > @@ -1941,8 +2072,9 @@ void free_percpu(void __percpu *ptr)
> >  	void *addr;
> >  	struct pcpu_chunk *chunk;
> >  	unsigned long flags;
> > -	int off;
> > +	int size, off;
> >  	bool need_balance = false;
> > +	struct list_head *pcpu_slot;
> >  
> >  	if (!ptr)
> >  		return;
> > @@ -1956,7 +2088,11 @@ void free_percpu(void __percpu *ptr)
> >  	chunk = pcpu_chunk_addr_search(addr);
> >  	off = addr - chunk->base_addr;
> >  
> > -	pcpu_free_area(chunk, off);
> > +	size = pcpu_free_area(chunk, off);
> > +
> > +	pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
> > +
> > +	pcpu_memcg_free_hook(chunk, off, size);
> >  
> >  	/* if there are more than one fully free chunks, wake up grim reaper */
> >  	if (chunk->free_bytes == pcpu_unit_size) {
> > @@ -2267,6 +2403,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
> >  	int map_size;
> >  	unsigned long tmp_addr;
> >  	size_t alloc_size;
> > +	enum pcpu_chunk_type type;
> >  
> >  #define PCPU_SETUP_BUG_ON(cond)	do {					\
> >  	if (unlikely(cond)) {						\
> > @@ -2384,13 +2521,18 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
> >  	 * empty chunks.
> >  	 */
> >  	pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
> > -	pcpu_slot = memblock_alloc(pcpu_nr_slots * sizeof(pcpu_slot[0]),
> > -				   SMP_CACHE_BYTES);
> > -	if (!pcpu_slot)
> > +	pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
> > +					  sizeof(pcpu_chunk_lists[0]) *
> > +					  PCPU_NR_CHUNK_TYPES,
> > +					  SMP_CACHE_BYTES);
> > +	if (!pcpu_chunk_lists)
> >  		panic("%s: Failed to allocate %zu bytes\n", __func__,
> > -		      pcpu_nr_slots * sizeof(pcpu_slot[0]));
> > -	for (i = 0; i < pcpu_nr_slots; i++)
> > -		INIT_LIST_HEAD(&pcpu_slot[i]);
> > +		      pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]) *
> > +		      PCPU_NR_CHUNK_TYPES);
> > +
> > +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> > +		for (i = 0; i < pcpu_nr_slots; i++)
> > +			INIT_LIST_HEAD(&pcpu_chunk_list(type)[i]);
> >  
> >  	/*
> >  	 * The end of the static region needs to be aligned with the
> > -- 
> > 2.25.4
> > 
> 
> There were just 2 minor nits. Do you mind resending with them fixed as
> I'm not sure I'll be carrying these patches or not.

Sure, will send v2 based on the slab controller v6 early next week.

> 
> Acked-by: Dennis Zhou <dennis@kernel.org>

Thank you!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 5/5] kselftests: cgroup: add perpcu memory accounting test
  2020-06-05 20:07   ` Dennis Zhou
@ 2020-06-05 22:47     ` Roman Gushchin
  0 siblings, 0 replies; 13+ messages in thread
From: Roman Gushchin @ 2020-06-05 22:47 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Andrew Morton, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Fri, Jun 05, 2020 at 08:07:51PM +0000, Dennis Zhou wrote:
> On Thu, May 28, 2020 at 04:25:08PM -0700, Roman Gushchin wrote:
> > Add a simple test to check the percpu memory accounting.
> > The test creates a cgroup tree with 1000 child cgroups
> > and checks values of memory.current and memory.stat::percpu.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  tools/testing/selftests/cgroup/test_kmem.c | 59 ++++++++++++++++++++++
> >  1 file changed, 59 insertions(+)
> > 
> > diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c
> > index 5224dae216e5..a0d4f1a3137d 100644
> > --- a/tools/testing/selftests/cgroup/test_kmem.c
> > +++ b/tools/testing/selftests/cgroup/test_kmem.c
> > @@ -331,6 +331,64 @@ static int test_kmem_dead_cgroups(const char *root)
> >  	return ret;
> >  }
> >  
> > +/*
> > + * This test creates a sub-tree with 1000 memory cgroups.
> > + * Then it checks that the memory.current on the parent level
> > + * is greater than 0 and approximates matches the percpu value
> > + * from memory.stat.
> > + */
> > +static int test_percpu_basic(const char *root)
> > +{
> > +	int ret = KSFT_FAIL;
> > +	char *parent, *child;
> > +	long current, percpu;
> > +	int i;
> > +
> > +	parent = cg_name(root, "percpu_basic_test");
> > +	if (!parent)
> > +		goto cleanup;
> > +
> > +	if (cg_create(parent))
> > +		goto cleanup;
> > +
> > +	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
> > +		goto cleanup;
> > +
> > +	for (i = 0; i < 1000; i++) {
> > +		child = cg_name_indexed(parent, "child", i);
> > +		if (!child)
> > +			return -1;
> > +
> > +		if (cg_create(child))
> > +			goto cleanup_children;
> > +
> > +		free(child);
> > +	}
> > +
> > +	current = cg_read_long(parent, "memory.current");
> > +	percpu = cg_read_key_long(parent, "memory.stat", "percpu ");
> > +
> > +	if (current > 0 && percpu > 0 && abs(current - percpu) <
> > +	    4096 * 32 * get_nprocs())
> 
> So this is checking that we've allocated less than 32 pages per cpu over
> 1000 child cgroups that's not percpu memory? Is there a more definitive
> measurement or at least a comment we can leave saying why this limit was
> chosen.

It simple means that "current" should be approximately equal to "percpu" statistics.
Both charging and vmstat paths are using percpu batching, and the batch size is
32 pages.

I'll add a comment to make it more obvious.

Thanks!

> 
> > +		ret = KSFT_PASS;
> > +	else
> > +		printf("memory.current %ld\npercpu %ld\n",
> > +		       current, percpu);
> > +
> > +cleanup_children:
> > +	for (i = 0; i < 1000; i++) {
> > +		child = cg_name_indexed(parent, "child", i);
> > +		cg_destroy(child);
> > +		free(child);
> > +	}
> > +
> > +cleanup:
> > +	cg_destroy(parent);
> > +	free(parent);
> > +
> > +	return ret;
> > +}
> > +
> >  #define T(x) { x, #x }
> >  struct kmem_test {
> >  	int (*fn)(const char *root);
> > @@ -341,6 +399,7 @@ struct kmem_test {
> >  	T(test_kmem_proc_kpagecgroup),
> >  	T(test_kmem_kernel_stacks),
> >  	T(test_kmem_dead_cgroups),
> > +	T(test_percpu_basic),
> >  };
> >  #undef T
> >  
> > -- 
> > 2.25.4
> > 
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-06-05 22:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-28 23:25 [PATCH v1 0/5] mm: memcg accounting of percpu memory Roman Gushchin
2020-05-28 23:25 ` [PATCH v1 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
2020-06-05 19:44   ` Dennis Zhou
2020-05-28 23:25 ` [PATCH v1 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
2020-06-05 19:49   ` Dennis Zhou
2020-06-05 22:44     ` Roman Gushchin
2020-05-28 23:25 ` [PATCH v1 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
2020-06-05 19:53   ` Dennis Zhou
2020-05-28 23:25 ` [PATCH v1 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
2020-06-05 19:54   ` Dennis Zhou
2020-05-28 23:25 ` [PATCH v1 5/5] kselftests: cgroup: add perpcu memory accounting test Roman Gushchin
2020-06-05 20:07   ` Dennis Zhou
2020-06-05 22:47     ` Roman Gushchin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.