Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH RFC 0/5] mm: memcg accounting of percpu memory
@ 2020-05-19 20:18 Roman Gushchin
  2020-05-19 20:18 ` [PATCH RFC 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Roman Gushchin @ 2020-05-19 20:18 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

This patchset adds percpu memory accounting to memory cgroups.
It's based on the rework of the slab controller and reuses concepts
and features introduced for the per-object slab accounting.

Percpu allocations by their nature are scattered over multiple pages,
so they can't be tracked on the per-page basis. So the per-object
tracking introduced by the new slab controller is reused.

The patchset implements charging of percpu allocations, adds
memcg-level statistics, enables accounting for memcg internals
perpcu allocations and provides some basic tests.

It's based on top of the current version of the new slab controller
patchset. The following patches are actually required by this series:
  mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
  mm: memcg: prepare for byte-sized vmstat items
  mm: memcg: convert vmstat slab counters to bytes
  mm: slub: implement SLUB version of obj_to_index()
  mm: memcontrol: decouple reference counting from page accounting
  mm: memcg/slab: obj_cgroup API

The whole series can be found here:
https://github.com/rgushchin/linux/pull/new/percpu_acc.RFC


Roman Gushchin (5):
  percpu: return number of released bytes from pcpu_free_area()
  mm: memcg/percpu: account percpu memory to memory cgroups
  mm: memcg/percpu: per-memcg percpu memory statistics
  mm: memcg: charge memcg percpu memory to the parent cgroup
  kselftests: cgroup: add perpcu memory accounting test

 Documentation/admin-guide/cgroup-v2.rst    |   4 +
 include/linux/memcontrol.h                 |   8 +
 mm/memcontrol.c                            |  18 +-
 mm/percpu-internal.h                       |  49 ++++-
 mm/percpu-km.c                             |   5 +-
 mm/percpu-stats.c                          |  36 ++--
 mm/percpu-vm.c                             |   5 +-
 mm/percpu.c                                | 206 ++++++++++++++++++---
 tools/testing/selftests/cgroup/test_kmem.c |  59 ++++++
 9 files changed, 341 insertions(+), 49 deletions(-)

-- 
2.25.4



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH RFC 1/5] percpu: return number of released bytes from pcpu_free_area()
  2020-05-19 20:18 [PATCH RFC 0/5] mm: memcg accounting of percpu memory Roman Gushchin
@ 2020-05-19 20:18 ` Roman Gushchin
  2020-05-19 20:18 ` [PATCH RFC 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Roman Gushchin @ 2020-05-19 20:18 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

To implement accounting of percpu memory we need the information
about the size of freed object. Return it from pcpu_free_area().

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/percpu.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 696367b18222..aa36b78d45a6 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1211,11 +1211,14 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
  *
  * This function determines the size of an allocation to free using
  * the boundary bitmap and clears the allocation map.
+ *
+ * RETURNS:
+ * Number of freed bytes.
  */
-static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
+static int pcpu_free_area(struct pcpu_chunk *chunk, int off)
 {
 	struct pcpu_block_md *chunk_md = &chunk->chunk_md;
-	int bit_off, bits, end, oslot;
+	int bit_off, bits, end, oslot, freed;
 
 	lockdep_assert_held(&pcpu_lock);
 	pcpu_stats_area_dealloc(chunk);
@@ -1230,8 +1233,10 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
 	bits = end - bit_off;
 	bitmap_clear(chunk->alloc_map, bit_off, bits);
 
+	freed = bits * PCPU_MIN_ALLOC_SIZE;
+
 	/* update metadata */
-	chunk->free_bytes += bits * PCPU_MIN_ALLOC_SIZE;
+	chunk->free_bytes += freed;
 
 	/* update first free bit */
 	chunk_md->first_free = min(chunk_md->first_free, bit_off);
@@ -1239,6 +1244,8 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
 	pcpu_block_update_hint_free(chunk, bit_off, bits);
 
 	pcpu_chunk_relocate(chunk, oslot);
+
+	return freed;
 }
 
 static void pcpu_init_md_block(struct pcpu_block_md *block, int nr_bits)
-- 
2.25.4



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH RFC 2/5] mm: memcg/percpu: account percpu memory to memory cgroups
  2020-05-19 20:18 [PATCH RFC 0/5] mm: memcg accounting of percpu memory Roman Gushchin
  2020-05-19 20:18 ` [PATCH RFC 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
@ 2020-05-19 20:18 ` Roman Gushchin
  2020-05-26 17:31   ` Tejun Heo
  2020-05-27 21:02   ` Dennis Zhou
  2020-05-19 20:18 ` [PATCH RFC 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 8+ messages in thread
From: Roman Gushchin @ 2020-05-19 20:18 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Percpu memory is becoming more and more widely used by various
subsystems, and the total amount of memory controlled by the percpu
allocator can make a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory,
and they are created by a user. Also, some cgroup internals
(e.g. memory controller statistics) can be quite large.
On a machine with many CPUs and big number of cgroups they
can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be
accounted by default.

To implement the perpcu accounting it's possible to take the slab
memory accounting as a model to follow. Let's introduce two types of
percpu chunks: root and memcg. What makes memcg chunks different is
an additional space allocated to store memcg membership information.
If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be
used. If it's possible to charge the corresponding size to the target
memory cgroup, allocation is performed, and the memcg ownership data
is recorded. System-wide allocations are performed using root chunks,
so there is no additional memory overhead.

To implement a fast reparenting of percpu memory on memcg removal,
we don't store mem_cgroup pointers directly: instead we use obj_cgroup
API, introduced for slab accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/percpu-internal.h |  49 +++++++++++-
 mm/percpu-km.c       |   5 +-
 mm/percpu-stats.c    |  36 +++++----
 mm/percpu-vm.c       |   5 +-
 mm/percpu.c          | 183 +++++++++++++++++++++++++++++++++++++------
 5 files changed, 237 insertions(+), 41 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 0468ba500bd4..3ac5613103f7 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -5,6 +5,19 @@
 #include <linux/types.h>
 #include <linux/percpu.h>
 
+/*
+ * There are two chunk types: root and memcg.
+ * Chunks of each type have separate slots list.
+ */
+enum pcpu_chunk_type {
+	PCPU_CHUNK_ROOT,
+#ifdef CONFIG_MEMCG_KMEM
+	PCPU_CHUNK_MEMCG,
+#endif
+	PCPU_NR_CHUNK_TYPES,
+	PCPU_FAIL_ALLOC = PCPU_NR_CHUNK_TYPES
+};
+
 /*
  * pcpu_block_md is the metadata block struct.
  * Each chunk's bitmap is split into a number of full blocks.
@@ -54,6 +67,9 @@ struct pcpu_chunk {
 	int			end_offset;	/* additional area required to
 						   have the region end page
 						   aligned */
+#ifdef CONFIG_MEMCG_KMEM
+	struct obj_cgroup	**obj_cgroups;	/* vector of object cgroups */
+#endif
 
 	int			nr_pages;	/* # of pages served by this chunk */
 	int			nr_populated;	/* # of populated pages */
@@ -63,7 +79,7 @@ struct pcpu_chunk {
 
 extern spinlock_t pcpu_lock;
 
-extern struct list_head *pcpu_slot;
+extern struct list_head *pcpu_chunk_lists;
 extern int pcpu_nr_slots;
 extern int pcpu_nr_empty_pop_pages;
 
@@ -106,6 +122,37 @@ static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk)
 	return pcpu_nr_pages_to_map_bits(chunk->nr_pages);
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
+{
+	if (chunk->obj_cgroups)
+		return PCPU_CHUNK_MEMCG;
+	return PCPU_CHUNK_ROOT;
+}
+
+static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
+{
+	return chunk_type == PCPU_CHUNK_MEMCG;
+}
+
+#else
+static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
+{
+	return PCPU_CHUNK_ROOT;
+}
+
+static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
+{
+	return false;
+}
+#endif
+
+static struct list_head *pcpu_chunk_list(enum pcpu_chunk_type chunk_type)
+{
+	return &pcpu_chunk_lists[pcpu_nr_slots *
+				 pcpu_is_memcg_chunk(chunk_type)];
+}
+
 #ifdef CONFIG_PERCPU_STATS
 
 #include <linux/spinlock.h>
diff --git a/mm/percpu-km.c b/mm/percpu-km.c
index 20d2b69a13b0..35c9941077ee 100644
--- a/mm/percpu-km.c
+++ b/mm/percpu-km.c
@@ -44,7 +44,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 	/* nada */
 }
 
-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+					    gfp_t gfp)
 {
 	const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
 	struct pcpu_chunk *chunk;
@@ -52,7 +53,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
 	unsigned long flags;
 	int i;
 
-	chunk = pcpu_alloc_chunk(gfp);
+	chunk = pcpu_alloc_chunk(type, gfp);
 	if (!chunk)
 		return NULL;
 
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index 32558063c3f9..c8400a2adbc2 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -34,11 +34,15 @@ static int find_max_nr_alloc(void)
 {
 	struct pcpu_chunk *chunk;
 	int slot, max_nr_alloc;
+	enum pcpu_chunk_type type;
 
 	max_nr_alloc = 0;
-	for (slot = 0; slot < pcpu_nr_slots; slot++)
-		list_for_each_entry(chunk, &pcpu_slot[slot], list)
-			max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc);
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		for (slot = 0; slot < pcpu_nr_slots; slot++)
+			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
+					    list)
+				max_nr_alloc = max(max_nr_alloc,
+						   chunk->nr_alloc);
 
 	return max_nr_alloc;
 }
@@ -129,6 +133,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
 	P("cur_min_alloc", cur_min_alloc);
 	P("cur_med_alloc", cur_med_alloc);
 	P("cur_max_alloc", cur_max_alloc);
+#ifdef CONFIG_MEMCG_KMEM
+	P("memcg_aware", pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)));
+#endif
 	seq_putc(m, '\n');
 }
 
@@ -137,6 +144,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 	struct pcpu_chunk *chunk;
 	int slot, max_nr_alloc;
 	int *buffer;
+	enum pcpu_chunk_type type;
 
 alloc_buffer:
 	spin_lock_irq(&pcpu_lock);
@@ -202,18 +210,18 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 		chunk_map_stats(m, pcpu_reserved_chunk, buffer);
 	}
 
-	for (slot = 0; slot < pcpu_nr_slots; slot++) {
-		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
-			if (chunk == pcpu_first_chunk) {
-				seq_puts(m, "Chunk: <- First Chunk\n");
-				chunk_map_stats(m, chunk, buffer);
-
-
-			} else {
-				seq_puts(m, "Chunk:\n");
-				chunk_map_stats(m, chunk, buffer);
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
+		for (slot = 0; slot < pcpu_nr_slots; slot++) {
+			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
+					    list) {
+				if (chunk == pcpu_first_chunk) {
+					seq_puts(m, "Chunk: <- First Chunk\n");
+					chunk_map_stats(m, chunk, buffer);
+				} else {
+					seq_puts(m, "Chunk:\n");
+					chunk_map_stats(m, chunk, buffer);
+				}
 			}
-
 		}
 	}
 
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index a2b395acef89..e46f7a6917f9 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -328,12 +328,13 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 	pcpu_free_pages(chunk, pages, page_start, page_end);
 }
 
-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+					    gfp_t gfp)
 {
 	struct pcpu_chunk *chunk;
 	struct vm_struct **vms;
 
-	chunk = pcpu_alloc_chunk(gfp);
+	chunk = pcpu_alloc_chunk(type, gfp);
 	if (!chunk)
 		return NULL;
 
diff --git a/mm/percpu.c b/mm/percpu.c
index aa36b78d45a6..4615a0a809b8 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -81,6 +81,7 @@
 #include <linux/kmemleak.h>
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
+#include <linux/memcontrol.h>
 
 #include <asm/cacheflush.h>
 #include <asm/sections.h>
@@ -160,7 +161,7 @@ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init;
 DEFINE_SPINLOCK(pcpu_lock);	/* all internal data structures */
 static DEFINE_MUTEX(pcpu_alloc_mutex);	/* chunk create/destroy, [de]pop, map ext */
 
-struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */
+struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */
 
 /* chunks which need their map areas extended, protected by pcpu_lock */
 static LIST_HEAD(pcpu_map_extend_chunks);
@@ -500,6 +501,9 @@ static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
 			      bool move_front)
 {
 	if (chunk != pcpu_reserved_chunk) {
+		struct list_head *pcpu_slot;
+
+		pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
 		if (move_front)
 			list_move(&chunk->list, &pcpu_slot[slot]);
 		else
@@ -1341,6 +1345,10 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
 		panic("%s: Failed to allocate %zu bytes\n", __func__,
 		      alloc_size);
 
+#ifdef CONFIG_MEMCG_KMEM
+	/* first chunk isn't memcg-aware */
+	chunk->obj_cgroups = NULL;
+#endif
 	pcpu_init_md_blocks(chunk);
 
 	/* manage populated page bitmap */
@@ -1380,7 +1388,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
 	return chunk;
 }
 
-static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
+static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type, gfp_t gfp)
 {
 	struct pcpu_chunk *chunk;
 	int region_bits;
@@ -1408,6 +1416,16 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
 	if (!chunk->md_blocks)
 		goto md_blocks_fail;
 
+#ifdef CONFIG_MEMCG_KMEM
+	if (pcpu_is_memcg_chunk(type)) {
+		chunk->obj_cgroups =
+			pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
+					sizeof(struct obj_cgroup *), gfp);
+		if (!chunk->obj_cgroups)
+			goto objcg_fail;
+	}
+#endif
+
 	pcpu_init_md_blocks(chunk);
 
 	/* init metadata */
@@ -1415,6 +1433,8 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
 
 	return chunk;
 
+objcg_fail:
+	pcpu_mem_free(chunk->md_blocks);
 md_blocks_fail:
 	pcpu_mem_free(chunk->bound_map);
 bound_map_fail:
@@ -1429,6 +1449,9 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
 {
 	if (!chunk)
 		return;
+#ifdef CONFIG_MEMCG_KMEM
+	pcpu_mem_free(chunk->obj_cgroups);
+#endif
 	pcpu_mem_free(chunk->md_blocks);
 	pcpu_mem_free(chunk->bound_map);
 	pcpu_mem_free(chunk->alloc_map);
@@ -1505,7 +1528,8 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
 			       int page_start, int page_end, gfp_t gfp);
 static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 				  int page_start, int page_end);
-static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp);
+static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
+					    gfp_t gfp);
 static void pcpu_destroy_chunk(struct pcpu_chunk *chunk);
 static struct page *pcpu_addr_to_page(void *addr);
 static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai);
@@ -1547,6 +1571,77 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
 	return pcpu_get_page_chunk(pcpu_addr_to_page(addr));
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
+						     struct obj_cgroup **objcgp)
+{
+	struct obj_cgroup *objcg;
+
+	if (!memcg_kmem_enabled() || !(gfp & __GFP_ACCOUNT) ||
+	    memcg_kmem_bypass())
+		return PCPU_CHUNK_ROOT;
+
+	objcg = get_obj_cgroup_from_current();
+	if (!objcg)
+		return PCPU_CHUNK_ROOT;
+
+	if (obj_cgroup_charge(objcg, gfp, size * num_possible_cpus())) {
+		obj_cgroup_put(objcg);
+		return PCPU_FAIL_ALLOC;
+	}
+
+	*objcgp = objcg;
+	return PCPU_CHUNK_MEMCG;
+}
+
+static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
+				       struct pcpu_chunk *chunk, int off,
+				       size_t size)
+{
+	if (!objcg)
+		return;
+
+	if (chunk) {
+		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
+	} else {
+		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
+		obj_cgroup_put(objcg);
+	}
+}
+
+static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+	struct obj_cgroup *objcg;
+
+	if (!pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)))
+		return;
+
+	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
+	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
+
+	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
+
+	obj_cgroup_put(objcg);
+}
+
+#else /* CONFIG_MEMCG_KMEM */
+static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
+						     struct mem_cgroup **memcgp)
+{
+	return PCPU_CHUNK_ROOT;
+}
+
+static void pcpu_memcg_post_alloc_hook(struct mem_cgroup *memcg,
+				       struct pcpu_chunk *chunk, int off,
+				       size_t size)
+{
+}
+
+static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
 /**
  * pcpu_alloc - the percpu allocator
  * @size: size of area to allocate in bytes
@@ -1565,9 +1660,14 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
 static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 				 gfp_t gfp)
 {
-	gfp_t pcpu_gfp;
-	bool is_atomic;
-	bool do_warn;
+	/* whitelisted flags that can be passed to the backing allocators */
+	gfp_t pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN |
+				__GFP_ACCOUNT);
+	bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
+	bool do_warn = !(gfp & __GFP_NOWARN);
+	enum pcpu_chunk_type type;
+	struct list_head *pcpu_slot;
+	struct obj_cgroup *objcg = NULL;
 	static int warn_limit = 10;
 	struct pcpu_chunk *chunk, *next;
 	const char *err;
@@ -1602,16 +1702,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 		return NULL;
 	}
 
+	type = pcpu_memcg_pre_alloc_hook(size, gfp, &objcg);
+	if (unlikely(type == PCPU_FAIL_ALLOC))
+		return NULL;
+	pcpu_slot = pcpu_chunk_list(type);
+
 	if (!is_atomic) {
 		/*
 		 * pcpu_balance_workfn() allocates memory under this mutex,
 		 * and it may wait for memory reclaim. Allow current task
 		 * to become OOM victim, in case of memory pressure.
 		 */
-		if (gfp & __GFP_NOFAIL)
+		if (gfp & __GFP_NOFAIL) {
 			mutex_lock(&pcpu_alloc_mutex);
-		else if (mutex_lock_killable(&pcpu_alloc_mutex))
+		} else if (mutex_lock_killable(&pcpu_alloc_mutex)) {
+			pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
 			return NULL;
+		}
 	}
 
 	spin_lock_irqsave(&pcpu_lock, flags);
@@ -1637,7 +1744,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 restart:
 	/* search through normal chunks */
 	for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
-		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
+		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot],
+					 list) {
 			off = pcpu_find_block_fit(chunk, bits, bit_align,
 						  is_atomic);
 			if (off < 0) {
@@ -1666,7 +1774,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	}
 
 	if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
-		chunk = pcpu_create_chunk(pcpu_gfp);
+		chunk = pcpu_create_chunk(type, pcpu_gfp);
 		if (!chunk) {
 			err = "failed to allocate new chunk";
 			goto fail;
@@ -1723,6 +1831,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	trace_percpu_alloc_percpu(reserved, is_atomic, size, align,
 			chunk->base_addr, off, ptr);
 
+	pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
+
 	return ptr;
 
 fail_unlock:
@@ -1744,6 +1854,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	} else {
 		mutex_unlock(&pcpu_alloc_mutex);
 	}
+
+	pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
+
 	return NULL;
 }
 
@@ -1803,8 +1916,8 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
 }
 
 /**
- * pcpu_balance_workfn - manage the amount of free chunks and populated pages
- * @work: unused
+ * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
+ * @type: chunk type
  *
  * Reclaim all fully free chunks except for the first one.  This is also
  * responsible for maintaining the pool of empty populated pages.  However,
@@ -1813,11 +1926,12 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
  * allocation causes the failure as it is possible that requests can be
  * serviced from already backed regions.
  */
-static void pcpu_balance_workfn(struct work_struct *work)
+static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
 {
 	/* gfp flags passed to underlying allocators */
 	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
 	LIST_HEAD(to_free);
+	struct list_head *pcpu_slot = pcpu_chunk_list(type);
 	struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
 	struct pcpu_chunk *chunk, *next;
 	int slot, nr_to_pop, ret;
@@ -1915,7 +2029,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
 
 	if (nr_to_pop) {
 		/* ran out of chunks to populate, create a new one and retry */
-		chunk = pcpu_create_chunk(gfp);
+		chunk = pcpu_create_chunk(type, gfp);
 		if (chunk) {
 			spin_lock_irq(&pcpu_lock);
 			pcpu_chunk_relocate(chunk, -1);
@@ -1927,6 +2041,20 @@ static void pcpu_balance_workfn(struct work_struct *work)
 	mutex_unlock(&pcpu_alloc_mutex);
 }
 
+/**
+ * pcpu_balance_workfn - manage the amount of free chunks and populated pages
+ * @work: unused
+ *
+ * Call __pcpu_balance_workfn() for each chunk type.
+ */
+static void pcpu_balance_workfn(struct work_struct *work)
+{
+	enum pcpu_chunk_type type;
+
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		__pcpu_balance_workfn(type);
+}
+
 /**
  * free_percpu - free percpu area
  * @ptr: pointer to area to free
@@ -1941,8 +2069,9 @@ void free_percpu(void __percpu *ptr)
 	void *addr;
 	struct pcpu_chunk *chunk;
 	unsigned long flags;
-	int off;
+	int size, off;
 	bool need_balance = false;
+	struct list_head *pcpu_slot;
 
 	if (!ptr)
 		return;
@@ -1956,7 +2085,11 @@ void free_percpu(void __percpu *ptr)
 	chunk = pcpu_chunk_addr_search(addr);
 	off = addr - chunk->base_addr;
 
-	pcpu_free_area(chunk, off);
+	size = pcpu_free_area(chunk, off);
+
+	pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
+
+	pcpu_memcg_free_hook(chunk, off, size);
 
 	/* if there are more than one fully free chunks, wake up grim reaper */
 	if (chunk->free_bytes == pcpu_unit_size) {
@@ -2267,6 +2400,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	int map_size;
 	unsigned long tmp_addr;
 	size_t alloc_size;
+	enum pcpu_chunk_type type;
 
 #define PCPU_SETUP_BUG_ON(cond)	do {					\
 	if (unlikely(cond)) {						\
@@ -2384,13 +2518,18 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	 * empty chunks.
 	 */
 	pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
-	pcpu_slot = memblock_alloc(pcpu_nr_slots * sizeof(pcpu_slot[0]),
-				   SMP_CACHE_BYTES);
-	if (!pcpu_slot)
+	pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
+					  sizeof(pcpu_chunk_lists[0]) *
+					  PCPU_NR_CHUNK_TYPES,
+					  SMP_CACHE_BYTES);
+	if (!pcpu_chunk_lists)
 		panic("%s: Failed to allocate %zu bytes\n", __func__,
-		      pcpu_nr_slots * sizeof(pcpu_slot[0]));
-	for (i = 0; i < pcpu_nr_slots; i++)
-		INIT_LIST_HEAD(&pcpu_slot[i]);
+		      pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]) *
+		      PCPU_NR_CHUNK_TYPES);
+
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		for (i = 0; i < pcpu_nr_slots; i++)
+			INIT_LIST_HEAD(&pcpu_chunk_list(type)[i]);
 
 	/*
 	 * The end of the static region needs to be aligned with the
-- 
2.25.4



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH RFC 3/5] mm: memcg/percpu: per-memcg percpu memory statistics
  2020-05-19 20:18 [PATCH RFC 0/5] mm: memcg accounting of percpu memory Roman Gushchin
  2020-05-19 20:18 ` [PATCH RFC 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
  2020-05-19 20:18 ` [PATCH RFC 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
@ 2020-05-19 20:18 ` Roman Gushchin
  2020-05-19 20:18 ` [PATCH RFC 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
  2020-05-19 20:18 ` [PATCH RFC 5/5] kselftests: cgroup: add perpcu memory accounting test Roman Gushchin
  4 siblings, 0 replies; 8+ messages in thread
From: Roman Gushchin @ 2020-05-19 20:18 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Percpu memory can represent a noticeable chunk of the total
memory consumption, especially on big machines with many CPUs.
Let's track percpu memory usage for each memcg and display
it in memory.stat.

A percpu allocation is usually scattered over multiple pages
(and nodes), and can be significantly smaller than a page.
So let's add a byte-sized counter on the memcg level:
MEMCG_PERCPU_B. Byte-sized vmstat infra created for slabs
can be perfectly reused for percpu case.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  4 ++++
 include/linux/memcontrol.h              |  8 ++++++++
 mm/memcontrol.c                         |  4 +++-
 mm/percpu.c                             | 10 ++++++++++
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index fed4e1d2a343..aa8cb6dadadc 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1276,6 +1276,10 @@ PAGE_SIZE multiple when read back.
 		Amount of memory used for storing in-kernel data
 		structures.
 
+	  percpu
+		Amount of memory used for storing per-cpu kernel
+		data structures.
+
 	  sock
 		Amount of memory used in network transmission buffers
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 968b73cad428..a3086c832799 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -32,11 +32,19 @@ struct kmem_cache;
 enum memcg_stat_item {
 	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
 	MEMCG_SOCK,
+	MEMCG_PERCPU_B,
 	/* XXX: why are these zone and not node counters? */
 	MEMCG_KERNEL_STACK_KB,
 	MEMCG_NR_STAT,
 };
 
+static __always_inline bool memcg_stat_item_in_bytes(enum memcg_stat_item item)
+{
+	if (item == MEMCG_PERCPU_B)
+		return true;
+	return vmstat_item_in_bytes(item);
+}
+
 enum memcg_memory_event {
 	MEMCG_LOW,
 	MEMCG_HIGH,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 84a1fb848a85..b3cf8c5e7f71 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -783,7 +783,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 	if (mem_cgroup_disabled())
 		return;
 
-	if (vmstat_item_in_bytes(idx))
+	if (memcg_stat_item_in_bytes(idx))
 		threshold <<= PAGE_SHIFT;
 
 	x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
@@ -1490,6 +1490,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 	seq_buf_printf(&s, "slab %llu\n",
 		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
 			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
+	seq_buf_printf(&s, "percpu %llu\n",
+		       (u64)memcg_page_state(memcg, MEMCG_PERCPU_B));
 	seq_buf_printf(&s, "sock %llu\n",
 		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
 		       PAGE_SIZE);
diff --git a/mm/percpu.c b/mm/percpu.c
index 4615a0a809b8..71de0bae96f0 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1603,6 +1603,11 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 
 	if (chunk) {
 		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
+
+		rcu_read_lock();
+		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
+				size * num_possible_cpus());
+		rcu_read_unlock();
 	} else {
 		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
 		obj_cgroup_put(objcg);
@@ -1621,6 +1626,11 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 
 	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
 
+	rcu_read_lock();
+	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
+			-(size * num_possible_cpus()));
+	rcu_read_unlock();
+
 	obj_cgroup_put(objcg);
 }
 
-- 
2.25.4



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH RFC 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-05-19 20:18 [PATCH RFC 0/5] mm: memcg accounting of percpu memory Roman Gushchin
                   ` (2 preceding siblings ...)
  2020-05-19 20:18 ` [PATCH RFC 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
@ 2020-05-19 20:18 ` Roman Gushchin
  2020-05-19 20:18 ` [PATCH RFC 5/5] kselftests: cgroup: add perpcu memory accounting test Roman Gushchin
  4 siblings, 0 replies; 8+ messages in thread
From: Roman Gushchin @ 2020-05-19 20:18 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Memory cgroups are using large chunks of percpu memory to store
vmstat data. Yet this memory is not accounted at all, so in the
case when there are many (dying) cgroups, it's not exactly clear
where all the memory is.

Because the size of  memory cgroup internal structures can
dramatically exceed the size of object or page which is pinning
it in the memory, it's not a good idea to simple ignore it.
It actually breaks the isolation between cgroups.

Let's account the consumed percpu memory to the parent cgroup.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b3cf8c5e7f71..6463e5bdff9d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5016,13 +5016,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 	if (!pn)
 		return 1;
 
-	pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
+	pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
+						 GFP_KERNEL_ACCOUNT);
 	if (!pn->lruvec_stat_local) {
 		kfree(pn);
 		return 1;
 	}
 
-	pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
+	pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
+					       GFP_KERNEL_ACCOUNT);
 	if (!pn->lruvec_stat_cpu) {
 		free_percpu(pn->lruvec_stat_local);
 		kfree(pn);
@@ -5096,11 +5098,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 		goto fail;
 	}
 
-	memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
+	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
+						GFP_KERNEL_ACCOUNT);
 	if (!memcg->vmstats_local)
 		goto fail;
 
-	memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
+	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
+						 GFP_KERNEL_ACCOUNT);
 	if (!memcg->vmstats_percpu)
 		goto fail;
 
@@ -5149,7 +5153,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	struct mem_cgroup *memcg;
 	long error = -ENOMEM;
 
+	memalloc_use_memcg(parent);
 	memcg = mem_cgroup_alloc();
+	memalloc_unuse_memcg();
 	if (IS_ERR(memcg))
 		return ERR_CAST(memcg);
 
-- 
2.25.4



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH RFC 5/5] kselftests: cgroup: add perpcu memory accounting test
  2020-05-19 20:18 [PATCH RFC 0/5] mm: memcg accounting of percpu memory Roman Gushchin
                   ` (3 preceding siblings ...)
  2020-05-19 20:18 ` [PATCH RFC 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
@ 2020-05-19 20:18 ` Roman Gushchin
  4 siblings, 0 replies; 8+ messages in thread
From: Roman Gushchin @ 2020-05-19 20:18 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel, Roman Gushchin

Add a simple test to check the percpu memory accounting.
The test creates a cgroup tree with 1000 child cgroups
and checks values of memory.current and memory.stat::percpu.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 tools/testing/selftests/cgroup/test_kmem.c | 59 ++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c
index 5bc1132fec6b..0c37e14a2fa4 100644
--- a/tools/testing/selftests/cgroup/test_kmem.c
+++ b/tools/testing/selftests/cgroup/test_kmem.c
@@ -331,6 +331,64 @@ static int test_kmem_dead_cgroups(const char *root)
 	return ret;
 }
 
+/*
+ * This test creates a sub-tree with 1000 memory cgroups.
+ * Then it checks that the memory.current on the parent level
+ * is greater than 0 and approximates matches the percpu value
+ * from memory.stat.
+ */
+static int test_percpu_basic(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *parent, *child;
+	long current, percpu;
+	int i;
+
+	parent = cg_name(root, "percpu_basic_test");
+	if (!parent)
+		goto cleanup;
+
+	if (cg_create(parent))
+		goto cleanup;
+
+	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
+		goto cleanup;
+
+	for (i = 0; i < 1000; i++) {
+		child = cg_name_indexed(parent, "child", i);
+		if (!child)
+			return -1;
+
+		if (cg_create(child))
+			goto cleanup_children;
+
+		free(child);
+	}
+
+	current = cg_read_long(parent, "memory.current");
+	percpu = cg_read_key_long(parent, "memory.stat", "percpu ");
+
+	if (current > 0 && percpu > 0 && abs(current - percpu) <
+	    4096 * 32 * get_nprocs())
+		ret = KSFT_PASS;
+	else
+		printf("memory.current %ld\npercpu %ld\n",
+		       current, percpu);
+
+cleanup_children:
+	for (i = 0; i < 1000; i++) {
+		child = cg_name_indexed(parent, "child", i);
+		cg_destroy(child);
+		free(child);
+	}
+
+cleanup:
+	cg_destroy(parent);
+	free(parent);
+
+	return ret;
+}
+
 #define T(x) { x, #x }
 struct kmem_test {
 	int (*fn)(const char *root);
@@ -341,6 +399,7 @@ struct kmem_test {
 	T(test_kmem_proc_kpagecgroup),
 	T(test_kmem_kernel_stacks),
 	T(test_kmem_dead_cgroups),
+	T(test_percpu_basic),
 };
 #undef T
 
-- 
2.25.4



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC 2/5] mm: memcg/percpu: account percpu memory to memory cgroups
  2020-05-19 20:18 ` [PATCH RFC 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
@ 2020-05-26 17:31   ` Tejun Heo
  2020-05-27 21:02   ` Dennis Zhou
  1 sibling, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2020-05-26 17:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Tue, May 19, 2020 at 01:18:03PM -0700, Roman Gushchin wrote:
> Percpu memory is becoming more and more widely used by various
> subsystems, and the total amount of memory controlled by the percpu
> allocator can make a good part of the total memory.
> 
> As an example, bpf maps can consume a lot of percpu memory,
> and they are created by a user. Also, some cgroup internals
> (e.g. memory controller statistics) can be quite large.
> On a machine with many CPUs and big number of cgroups they
> can consume hundreds of megabytes.
> 
> So the lack of memcg accounting is creating a breach in the memory
> isolation. Similar to the slab memory, percpu memory should be
> accounted by default.
> 
> To implement the perpcu accounting it's possible to take the slab
> memory accounting as a model to follow. Let's introduce two types of
> percpu chunks: root and memcg. What makes memcg chunks different is
> an additional space allocated to store memcg membership information.
> If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be
> used. If it's possible to charge the corresponding size to the target
> memory cgroup, allocation is performed, and the memcg ownership data
> is recorded. System-wide allocations are performed using root chunks,
> so there is no additional memory overhead.
> 
> To implement a fast reparenting of percpu memory on memcg removal,
> we don't store mem_cgroup pointers directly: instead we use obj_cgroup
> API, introduced for slab accounting.

The overall approach makes sense to me but it'd help to have a high level
comment explaining what's going on and why.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC 2/5] mm: memcg/percpu: account percpu memory to memory cgroups
  2020-05-19 20:18 ` [PATCH RFC 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
  2020-05-26 17:31   ` Tejun Heo
@ 2020-05-27 21:02   ` Dennis Zhou
  1 sibling, 0 replies; 8+ messages in thread
From: Dennis Zhou @ 2020-05-27 21:02 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel

On Tue, May 19, 2020 at 01:18:03PM -0700, Roman Gushchin wrote:
> Percpu memory is becoming more and more widely used by various
> subsystems, and the total amount of memory controlled by the percpu
> allocator can make a good part of the total memory.
> 
> As an example, bpf maps can consume a lot of percpu memory,
> and they are created by a user. Also, some cgroup internals
> (e.g. memory controller statistics) can be quite large.
> On a machine with many CPUs and big number of cgroups they
> can consume hundreds of megabytes.
> 
> So the lack of memcg accounting is creating a breach in the memory
> isolation. Similar to the slab memory, percpu memory should be
> accounted by default.
> 
> To implement the perpcu accounting it's possible to take the slab
> memory accounting as a model to follow. Let's introduce two types of
> percpu chunks: root and memcg. What makes memcg chunks different is
> an additional space allocated to store memcg membership information.
> If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be
> used. If it's possible to charge the corresponding size to the target
> memory cgroup, allocation is performed, and the memcg ownership data
> is recorded. System-wide allocations are performed using root chunks,
> so there is no additional memory overhead.
> 
> To implement a fast reparenting of percpu memory on memcg removal,
> we don't store mem_cgroup pointers directly: instead we use obj_cgroup
> API, introduced for slab accounting.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/percpu-internal.h |  49 +++++++++++-
>  mm/percpu-km.c       |   5 +-
>  mm/percpu-stats.c    |  36 +++++----
>  mm/percpu-vm.c       |   5 +-
>  mm/percpu.c          | 183 +++++++++++++++++++++++++++++++++++++------
>  5 files changed, 237 insertions(+), 41 deletions(-)
> 
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index 0468ba500bd4..3ac5613103f7 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -5,6 +5,19 @@
>  #include <linux/types.h>
>  #include <linux/percpu.h>
>  
> +/*
> + * There are two chunk types: root and memcg.
> + * Chunks of each type have separate slots list.
> + */

This seems like a good place to add the comment tj asked for. In
addition, do you mind slipping something short into the big block
comment at the top mention this is a thing now too.

> +enum pcpu_chunk_type {
> +	PCPU_CHUNK_ROOT,
> +#ifdef CONFIG_MEMCG_KMEM
> +	PCPU_CHUNK_MEMCG,
> +#endif
> +	PCPU_NR_CHUNK_TYPES,
> +	PCPU_FAIL_ALLOC = PCPU_NR_CHUNK_TYPES
> +};
> +
>  /*
>   * pcpu_block_md is the metadata block struct.
>   * Each chunk's bitmap is split into a number of full blocks.
> @@ -54,6 +67,9 @@ struct pcpu_chunk {
>  	int			end_offset;	/* additional area required to
>  						   have the region end page
>  						   aligned */
> +#ifdef CONFIG_MEMCG_KMEM
> +	struct obj_cgroup	**obj_cgroups;	/* vector of object cgroups */
> +#endif
>  
>  	int			nr_pages;	/* # of pages served by this chunk */
>  	int			nr_populated;	/* # of populated pages */
> @@ -63,7 +79,7 @@ struct pcpu_chunk {
>  
>  extern spinlock_t pcpu_lock;
>  
> -extern struct list_head *pcpu_slot;
> +extern struct list_head *pcpu_chunk_lists;
>  extern int pcpu_nr_slots;
>  extern int pcpu_nr_empty_pop_pages;
>  
> @@ -106,6 +122,37 @@ static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk)
>  	return pcpu_nr_pages_to_map_bits(chunk->nr_pages);
>  }
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
> +{
> +	if (chunk->obj_cgroups)
> +		return PCPU_CHUNK_MEMCG;
> +	return PCPU_CHUNK_ROOT;
> +}
> +
> +static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
> +{
> +	return chunk_type == PCPU_CHUNK_MEMCG;
> +}
> +
> +#else
> +static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
> +{
> +	return PCPU_CHUNK_ROOT;
> +}
> +
> +static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
> +{
> +	return false;
> +}
> +#endif
> +
> +static struct list_head *pcpu_chunk_list(enum pcpu_chunk_type chunk_type)
> +{
> +	return &pcpu_chunk_lists[pcpu_nr_slots *
> +				 pcpu_is_memcg_chunk(chunk_type)];
> +}
> +
>  #ifdef CONFIG_PERCPU_STATS
>  
>  #include <linux/spinlock.h>
> diff --git a/mm/percpu-km.c b/mm/percpu-km.c
> index 20d2b69a13b0..35c9941077ee 100644
> --- a/mm/percpu-km.c
> +++ b/mm/percpu-km.c
> @@ -44,7 +44,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
>  	/* nada */
>  }
>  
> -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
> +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> +					    gfp_t gfp)
>  {
>  	const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
>  	struct pcpu_chunk *chunk;
> @@ -52,7 +53,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
>  	unsigned long flags;
>  	int i;
>  
> -	chunk = pcpu_alloc_chunk(gfp);
> +	chunk = pcpu_alloc_chunk(type, gfp);
>  	if (!chunk)
>  		return NULL;
>  
> diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
> index 32558063c3f9..c8400a2adbc2 100644
> --- a/mm/percpu-stats.c
> +++ b/mm/percpu-stats.c
> @@ -34,11 +34,15 @@ static int find_max_nr_alloc(void)
>  {
>  	struct pcpu_chunk *chunk;
>  	int slot, max_nr_alloc;
> +	enum pcpu_chunk_type type;
>  
>  	max_nr_alloc = 0;
> -	for (slot = 0; slot < pcpu_nr_slots; slot++)
> -		list_for_each_entry(chunk, &pcpu_slot[slot], list)
> -			max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc);
> +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> +		for (slot = 0; slot < pcpu_nr_slots; slot++)
> +			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
> +					    list)
> +				max_nr_alloc = max(max_nr_alloc,
> +						   chunk->nr_alloc);
>  
>  	return max_nr_alloc;
>  }
> @@ -129,6 +133,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
>  	P("cur_min_alloc", cur_min_alloc);
>  	P("cur_med_alloc", cur_med_alloc);
>  	P("cur_max_alloc", cur_max_alloc);
> +#ifdef CONFIG_MEMCG_KMEM
> +	P("memcg_aware", pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)));
> +#endif
>  	seq_putc(m, '\n');
>  }
>  
> @@ -137,6 +144,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
>  	struct pcpu_chunk *chunk;
>  	int slot, max_nr_alloc;
>  	int *buffer;
> +	enum pcpu_chunk_type type;
>  
>  alloc_buffer:
>  	spin_lock_irq(&pcpu_lock);
> @@ -202,18 +210,18 @@ static int percpu_stats_show(struct seq_file *m, void *v)
>  		chunk_map_stats(m, pcpu_reserved_chunk, buffer);
>  	}
>  
> -	for (slot = 0; slot < pcpu_nr_slots; slot++) {
> -		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> -			if (chunk == pcpu_first_chunk) {
> -				seq_puts(m, "Chunk: <- First Chunk\n");
> -				chunk_map_stats(m, chunk, buffer);
> -
> -
> -			} else {
> -				seq_puts(m, "Chunk:\n");
> -				chunk_map_stats(m, chunk, buffer);
> +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
> +		for (slot = 0; slot < pcpu_nr_slots; slot++) {
> +			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
> +					    list) {
> +				if (chunk == pcpu_first_chunk) {
> +					seq_puts(m, "Chunk: <- First Chunk\n");
> +					chunk_map_stats(m, chunk, buffer);
> +				} else {
> +					seq_puts(m, "Chunk:\n");
> +					chunk_map_stats(m, chunk, buffer);
> +				}
>  			}
> -
>  		}
>  	}
>  
> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> index a2b395acef89..e46f7a6917f9 100644
> --- a/mm/percpu-vm.c
> +++ b/mm/percpu-vm.c
> @@ -328,12 +328,13 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
>  	pcpu_free_pages(chunk, pages, page_start, page_end);
>  }
>  
> -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
> +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> +					    gfp_t gfp)
>  {
>  	struct pcpu_chunk *chunk;
>  	struct vm_struct **vms;
>  
> -	chunk = pcpu_alloc_chunk(gfp);
> +	chunk = pcpu_alloc_chunk(type, gfp);
>  	if (!chunk)
>  		return NULL;
>  
> diff --git a/mm/percpu.c b/mm/percpu.c
> index aa36b78d45a6..4615a0a809b8 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -81,6 +81,7 @@
>  #include <linux/kmemleak.h>
>  #include <linux/sched.h>
>  #include <linux/sched/mm.h>
> +#include <linux/memcontrol.h>
>  
>  #include <asm/cacheflush.h>
>  #include <asm/sections.h>
> @@ -160,7 +161,7 @@ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init;
>  DEFINE_SPINLOCK(pcpu_lock);	/* all internal data structures */
>  static DEFINE_MUTEX(pcpu_alloc_mutex);	/* chunk create/destroy, [de]pop, map ext */
>  
> -struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */
> +struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */
>  
>  /* chunks which need their map areas extended, protected by pcpu_lock */
>  static LIST_HEAD(pcpu_map_extend_chunks);
> @@ -500,6 +501,9 @@ static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
>  			      bool move_front)
>  {
>  	if (chunk != pcpu_reserved_chunk) {
> +		struct list_head *pcpu_slot;
> +
> +		pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
>  		if (move_front)
>  			list_move(&chunk->list, &pcpu_slot[slot]);
>  		else
> @@ -1341,6 +1345,10 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
>  		panic("%s: Failed to allocate %zu bytes\n", __func__,
>  		      alloc_size);
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +	/* first chunk isn't memcg-aware */
> +	chunk->obj_cgroups = NULL;
> +#endif
>  	pcpu_init_md_blocks(chunk);
>  
>  	/* manage populated page bitmap */
> @@ -1380,7 +1388,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
>  	return chunk;
>  }
>  
> -static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
> +static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type, gfp_t gfp)
>  {
>  	struct pcpu_chunk *chunk;
>  	int region_bits;
> @@ -1408,6 +1416,16 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
>  	if (!chunk->md_blocks)
>  		goto md_blocks_fail;
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +	if (pcpu_is_memcg_chunk(type)) {
> +		chunk->obj_cgroups =
> +			pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
> +					sizeof(struct obj_cgroup *), gfp);
> +		if (!chunk->obj_cgroups)
> +			goto objcg_fail;
> +	}
> +#endif
> +
>  	pcpu_init_md_blocks(chunk);
>  
>  	/* init metadata */
> @@ -1415,6 +1433,8 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
>  
>  	return chunk;
>  
> +objcg_fail:
> +	pcpu_mem_free(chunk->md_blocks);
>  md_blocks_fail:
>  	pcpu_mem_free(chunk->bound_map);
>  bound_map_fail:
> @@ -1429,6 +1449,9 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
>  {
>  	if (!chunk)
>  		return;
> +#ifdef CONFIG_MEMCG_KMEM
> +	pcpu_mem_free(chunk->obj_cgroups);
> +#endif
>  	pcpu_mem_free(chunk->md_blocks);
>  	pcpu_mem_free(chunk->bound_map);
>  	pcpu_mem_free(chunk->alloc_map);
> @@ -1505,7 +1528,8 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
>  			       int page_start, int page_end, gfp_t gfp);
>  static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
>  				  int page_start, int page_end);
> -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp);
> +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> +					    gfp_t gfp);
>  static void pcpu_destroy_chunk(struct pcpu_chunk *chunk);
>  static struct page *pcpu_addr_to_page(void *addr);
>  static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai);
> @@ -1547,6 +1571,77 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
>  	return pcpu_get_page_chunk(pcpu_addr_to_page(addr));
>  }
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
> +						     struct obj_cgroup **objcgp)
> +{
> +	struct obj_cgroup *objcg;
> +
> +	if (!memcg_kmem_enabled() || !(gfp & __GFP_ACCOUNT) ||
> +	    memcg_kmem_bypass())
> +		return PCPU_CHUNK_ROOT;
> +
> +	objcg = get_obj_cgroup_from_current();
> +	if (!objcg)
> +		return PCPU_CHUNK_ROOT;
> +
> +	if (obj_cgroup_charge(objcg, gfp, size * num_possible_cpus())) {
> +		obj_cgroup_put(objcg);
> +		return PCPU_FAIL_ALLOC;
> +	}
> +
> +	*objcgp = objcg;
> +	return PCPU_CHUNK_MEMCG;
> +}
> +
> +static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
> +				       struct pcpu_chunk *chunk, int off,
> +				       size_t size)
> +{
> +	if (!objcg)
> +		return;
> +
> +	if (chunk) {
> +		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
> +	} else {
> +		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
> +		obj_cgroup_put(objcg);
> +	}
> +}
> +
> +static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
> +{
> +	struct obj_cgroup *objcg;
> +
> +	if (!pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)))
> +		return;
> +
> +	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
> +	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
> +
> +	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
> +
> +	obj_cgroup_put(objcg);
> +}
> +
> +#else /* CONFIG_MEMCG_KMEM */
> +static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
> +						     struct mem_cgroup **memcgp)
> +{
> +	return PCPU_CHUNK_ROOT;
> +}
> +
> +static void pcpu_memcg_post_alloc_hook(struct mem_cgroup *memcg,
> +				       struct pcpu_chunk *chunk, int off,
> +				       size_t size)
> +{
> +}
> +
> +static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
> +{
> +}
> +#endif /* CONFIG_MEMCG_KMEM */
> +
>  /**
>   * pcpu_alloc - the percpu allocator
>   * @size: size of area to allocate in bytes
> @@ -1565,9 +1660,14 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
>  static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  				 gfp_t gfp)
>  {
> -	gfp_t pcpu_gfp;
> -	bool is_atomic;
> -	bool do_warn;
> +	/* whitelisted flags that can be passed to the backing allocators */
> +	gfp_t pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN |
> +				__GFP_ACCOUNT);
> +	bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
> +	bool do_warn = !(gfp & __GFP_NOWARN);
> +	enum pcpu_chunk_type type;
> +	struct list_head *pcpu_slot;
> +	struct obj_cgroup *objcg = NULL;

First, this collides w/ another patch that went in, 28307d938fb2.

Second, pcpu_gfp is to limit the flags that actually make it to the
backing allocator. I don't believe we want to pass __GFP_ACCOUNT to the
backing allocator right? This would charge a whole page in addition to
the hooks (should we need to allocate).  The pcpu_memcg_*_hook()s can
use the passed gfp directly.

>  	static int warn_limit = 10;
>  	struct pcpu_chunk *chunk, *next;
>  	const char *err;
> @@ -1602,16 +1702,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  		return NULL;
>  	}
>  
> +	type = pcpu_memcg_pre_alloc_hook(size, gfp, &objcg);

^ as you're doing here using gfp not pcpu_gfp.

> +	if (unlikely(type == PCPU_FAIL_ALLOC))
> +		return NULL;
> +	pcpu_slot = pcpu_chunk_list(type);
> +
>  	if (!is_atomic) {
>  		/*
>  		 * pcpu_balance_workfn() allocates memory under this mutex,
>  		 * and it may wait for memory reclaim. Allow current task
>  		 * to become OOM victim, in case of memory pressure.
>  		 */
> -		if (gfp & __GFP_NOFAIL)
> +		if (gfp & __GFP_NOFAIL) {
>  			mutex_lock(&pcpu_alloc_mutex);
> -		else if (mutex_lock_killable(&pcpu_alloc_mutex))
> +		} else if (mutex_lock_killable(&pcpu_alloc_mutex)) {
> +			pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
>  			return NULL;
> +		}
>  	}
>  
>  	spin_lock_irqsave(&pcpu_lock, flags);
> @@ -1637,7 +1744,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  restart:
>  	/* search through normal chunks */
>  	for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
> -		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
> +		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot],
> +					 list) {
>  			off = pcpu_find_block_fit(chunk, bits, bit_align,
>  						  is_atomic);
>  			if (off < 0) {
> @@ -1666,7 +1774,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  	}
>  
>  	if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
> -		chunk = pcpu_create_chunk(pcpu_gfp);
> +		chunk = pcpu_create_chunk(type, pcpu_gfp);
>  		if (!chunk) {
>  			err = "failed to allocate new chunk";
>  			goto fail;
> @@ -1723,6 +1831,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  	trace_percpu_alloc_percpu(reserved, is_atomic, size, align,
>  			chunk->base_addr, off, ptr);
>  
> +	pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
> +
>  	return ptr;
>  
>  fail_unlock:
> @@ -1744,6 +1854,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  	} else {
>  		mutex_unlock(&pcpu_alloc_mutex);
>  	}
> +
> +	pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
> +
>  	return NULL;
>  }
>  
> @@ -1803,8 +1916,8 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
>  }
>  
>  /**
> - * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> - * @work: unused
> + * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
> + * @type: chunk type
>   *
>   * Reclaim all fully free chunks except for the first one.  This is also
>   * responsible for maintaining the pool of empty populated pages.  However,
> @@ -1813,11 +1926,12 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
>   * allocation causes the failure as it is possible that requests can be
>   * serviced from already backed regions.
>   */
> -static void pcpu_balance_workfn(struct work_struct *work)
> +static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
>  {
>  	/* gfp flags passed to underlying allocators */
>  	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
>  	LIST_HEAD(to_free);
> +	struct list_head *pcpu_slot = pcpu_chunk_list(type);
>  	struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
>  	struct pcpu_chunk *chunk, *next;
>  	int slot, nr_to_pop, ret;
> @@ -1915,7 +2029,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
>  
>  	if (nr_to_pop) {
>  		/* ran out of chunks to populate, create a new one and retry */
> -		chunk = pcpu_create_chunk(gfp);
> +		chunk = pcpu_create_chunk(type, gfp);
>  		if (chunk) {
>  			spin_lock_irq(&pcpu_lock);
>  			pcpu_chunk_relocate(chunk, -1);
> @@ -1927,6 +2041,20 @@ static void pcpu_balance_workfn(struct work_struct *work)
>  	mutex_unlock(&pcpu_alloc_mutex);
>  }
>  
> +/**
> + * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> + * @work: unused
> + *
> + * Call __pcpu_balance_workfn() for each chunk type.
> + */
> +static void pcpu_balance_workfn(struct work_struct *work)
> +{
> +	enum pcpu_chunk_type type;
> +
> +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> +		__pcpu_balance_workfn(type);
> +}
> +
>  /**
>   * free_percpu - free percpu area
>   * @ptr: pointer to area to free
> @@ -1941,8 +2069,9 @@ void free_percpu(void __percpu *ptr)
>  	void *addr;
>  	struct pcpu_chunk *chunk;
>  	unsigned long flags;
> -	int off;
> +	int size, off;
>  	bool need_balance = false;
> +	struct list_head *pcpu_slot;
>  
>  	if (!ptr)
>  		return;
> @@ -1956,7 +2085,11 @@ void free_percpu(void __percpu *ptr)
>  	chunk = pcpu_chunk_addr_search(addr);
>  	off = addr - chunk->base_addr;
>  
> -	pcpu_free_area(chunk, off);
> +	size = pcpu_free_area(chunk, off);
> +
> +	pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
> +
> +	pcpu_memcg_free_hook(chunk, off, size);
>  
>  	/* if there are more than one fully free chunks, wake up grim reaper */
>  	if (chunk->free_bytes == pcpu_unit_size) {
> @@ -2267,6 +2400,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
>  	int map_size;
>  	unsigned long tmp_addr;
>  	size_t alloc_size;
> +	enum pcpu_chunk_type type;
>  
>  #define PCPU_SETUP_BUG_ON(cond)	do {					\
>  	if (unlikely(cond)) {						\
> @@ -2384,13 +2518,18 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
>  	 * empty chunks.
>  	 */
>  	pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
> -	pcpu_slot = memblock_alloc(pcpu_nr_slots * sizeof(pcpu_slot[0]),
> -				   SMP_CACHE_BYTES);
> -	if (!pcpu_slot)
> +	pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
> +					  sizeof(pcpu_chunk_lists[0]) *
> +					  PCPU_NR_CHUNK_TYPES,
> +					  SMP_CACHE_BYTES);
> +	if (!pcpu_chunk_lists)
>  		panic("%s: Failed to allocate %zu bytes\n", __func__,
> -		      pcpu_nr_slots * sizeof(pcpu_slot[0]));
> -	for (i = 0; i < pcpu_nr_slots; i++)
> -		INIT_LIST_HEAD(&pcpu_slot[i]);
> +		      pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]) *
> +		      PCPU_NR_CHUNK_TYPES);
> +
> +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> +		for (i = 0; i < pcpu_nr_slots; i++)
> +			INIT_LIST_HEAD(&pcpu_chunk_list(type)[i]);
>  
>  	/*
>  	 * The end of the static region needs to be aligned with the
> -- 
> 2.25.4
> 

Thanks this looks pretty good to me.

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, back to index

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-19 20:18 [PATCH RFC 0/5] mm: memcg accounting of percpu memory Roman Gushchin
2020-05-19 20:18 ` [PATCH RFC 1/5] percpu: return number of released bytes from pcpu_free_area() Roman Gushchin
2020-05-19 20:18 ` [PATCH RFC 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Roman Gushchin
2020-05-26 17:31   ` Tejun Heo
2020-05-27 21:02   ` Dennis Zhou
2020-05-19 20:18 ` [PATCH RFC 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Roman Gushchin
2020-05-19 20:18 ` [PATCH RFC 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Roman Gushchin
2020-05-19 20:18 ` [PATCH RFC 5/5] kselftests: cgroup: add perpcu memory accounting test Roman Gushchin

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git