Re: [PATCH v1 1/5] percpu: return number of released bytes from pcpu_free

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v1 1/5] percpu: return number of released bytes from pcpu_free_area()
       [not found] ` <20200528232508.1132382-2-guro@fb.com>
@ 2020-06-05 19:44   ` Dennis Zhou
  0 siblings, 0 replies; 7+ messages in thread
From: Dennis Zhou @ 2020-06-05 19:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Thu, May 28, 2020 at 04:25:04PM -0700, Roman Gushchin wrote:
> To implement accounting of percpu memory we need the information
> about the size of freed object. Return it from pcpu_free_area().
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/percpu.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 696367b18222..aa36b78d45a6 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1211,11 +1211,14 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
>   *
>   * This function determines the size of an allocation to free using
>   * the boundary bitmap and clears the allocation map.
> + *
> + * RETURNS:
> + * Number of freed bytes.
>   */
> -static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
> +static int pcpu_free_area(struct pcpu_chunk *chunk, int off)
>  {
>  	struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> -	int bit_off, bits, end, oslot;
> +	int bit_off, bits, end, oslot, freed;
>  
>  	lockdep_assert_held(&pcpu_lock);
>  	pcpu_stats_area_dealloc(chunk);
> @@ -1230,8 +1233,10 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
>  	bits = end - bit_off;
>  	bitmap_clear(chunk->alloc_map, bit_off, bits);
>  
> +	freed = bits * PCPU_MIN_ALLOC_SIZE;
> +
>  	/* update metadata */
> -	chunk->free_bytes += bits * PCPU_MIN_ALLOC_SIZE;
> +	chunk->free_bytes += freed;
>  
>  	/* update first free bit */
>  	chunk_md->first_free = min(chunk_md->first_free, bit_off);
> @@ -1239,6 +1244,8 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
>  	pcpu_block_update_hint_free(chunk, bit_off, bits);
>  
>  	pcpu_chunk_relocate(chunk, oslot);
> +
> +	return freed;
>  }
>  
>  static void pcpu_init_md_block(struct pcpu_block_md *block, int nr_bits)
> -- 
> 2.25.4
> 

Sorry for the delay.

Acked-by: Dennis Zhou <dennis@kernel.org>

What's the status of the depending patches? It might be easiest to have
Andrew pick these up once the depending patch series is settled.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 2/5] mm: memcg/percpu: account percpu memory to memory cgroups
       [not found] ` <20200528232508.1132382-3-guro@fb.com>
@ 2020-06-05 19:49   ` Dennis Zhou
  2020-06-05 22:44     ` Roman Gushchin
  0 siblings, 1 reply; 7+ messages in thread
From: Dennis Zhou @ 2020-06-05 19:49 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Thu, May 28, 2020 at 04:25:05PM -0700, Roman Gushchin wrote:
> Percpu memory is becoming more and more widely used by various
> subsystems, and the total amount of memory controlled by the percpu
> allocator can make a good part of the total memory.
> 
> As an example, bpf maps can consume a lot of percpu memory,
> and they are created by a user. Also, some cgroup internals
> (e.g. memory controller statistics) can be quite large.
> On a machine with many CPUs and big number of cgroups they
> can consume hundreds of megabytes.
> 
> So the lack of memcg accounting is creating a breach in the memory
> isolation. Similar to the slab memory, percpu memory should be
> accounted by default.
> 
> To implement the perpcu accounting it's possible to take the slab
> memory accounting as a model to follow. Let's introduce two types of
> percpu chunks: root and memcg. What makes memcg chunks different is
> an additional space allocated to store memcg membership information.
> If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be
> used. If it's possible to charge the corresponding size to the target
> memory cgroup, allocation is performed, and the memcg ownership data
> is recorded. System-wide allocations are performed using root chunks,
> so there is no additional memory overhead.
> 
> To implement a fast reparenting of percpu memory on memcg removal,
> we don't store mem_cgroup pointers directly: instead we use obj_cgroup
> API, introduced for slab accounting.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/percpu-internal.h |  57 ++++++++++++-
>  mm/percpu-km.c       |   5 +-
>  mm/percpu-stats.c    |  36 +++++----
>  mm/percpu-vm.c       |   5 +-
>  mm/percpu.c          | 186 ++++++++++++++++++++++++++++++++++++++-----
>  5 files changed, 248 insertions(+), 41 deletions(-)
> 
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index 0468ba500bd4..0cf36337eb47 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -5,6 +5,27 @@
>  #include <linux/types.h>
>  #include <linux/percpu.h>
>  
> +/*
> + * There are two chunk types: root and memcg-aware.
> + * Chunks of each type have separate slots list.
> + *
> + * Memcg-aware chunks have an attached vector of obj_cgroup
> + * pointers, which is used to store memcg membership data
> + * of a percpu object. Obj_cgroups are ref-counted pointers
> + * to a memory cgroup with an ability to switch dynamically
> + * to the parent memory cgroup. This allows to reclaim a deleted
> + * memory cgroup without reclaiming of all outstanding objects,
> + * which do hold a reference at it.
> + */

nit: do you mind reflowing this to 80 characters and doing 2 spaces
after each period to keep the formatting uniform.

> +enum pcpu_chunk_type {
> +	PCPU_CHUNK_ROOT,
> +#ifdef CONFIG_MEMCG_KMEM
> +	PCPU_CHUNK_MEMCG,
> +#endif
> +	PCPU_NR_CHUNK_TYPES,
> +	PCPU_FAIL_ALLOC = PCPU_NR_CHUNK_TYPES
> +};
> +
>  /*
>   * pcpu_block_md is the metadata block struct.
>   * Each chunk's bitmap is split into a number of full blocks.
> @@ -54,6 +75,9 @@ struct pcpu_chunk {
>  	int			end_offset;	/* additional area required to
>  						   have the region end page
>  						   aligned */
> +#ifdef CONFIG_MEMCG_KMEM
> +	struct obj_cgroup	**obj_cgroups;	/* vector of object cgroups */
> +#endif
>  
>  	int			nr_pages;	/* # of pages served by this chunk */
>  	int			nr_populated;	/* # of populated pages */
> @@ -63,7 +87,7 @@ struct pcpu_chunk {
>  
>  extern spinlock_t pcpu_lock;
>  
> -extern struct list_head *pcpu_slot;
> +extern struct list_head *pcpu_chunk_lists;
>  extern int pcpu_nr_slots;
>  extern int pcpu_nr_empty_pop_pages;
>  
> @@ -106,6 +130,37 @@ static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk)
>  	return pcpu_nr_pages_to_map_bits(chunk->nr_pages);
>  }
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
> +{
> +	if (chunk->obj_cgroups)
> +		return PCPU_CHUNK_MEMCG;
> +	return PCPU_CHUNK_ROOT;
> +}
> +
> +static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
> +{
> +	return chunk_type == PCPU_CHUNK_MEMCG;
> +}
> +
> +#else
> +static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
> +{
> +	return PCPU_CHUNK_ROOT;
> +}
> +
> +static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
> +{
> +	return false;
> +}
> +#endif
> +
> +static struct list_head *pcpu_chunk_list(enum pcpu_chunk_type chunk_type)
> +{
> +	return &pcpu_chunk_lists[pcpu_nr_slots *
> +				 pcpu_is_memcg_chunk(chunk_type)];
> +}
> +
>  #ifdef CONFIG_PERCPU_STATS
>  
>  #include <linux/spinlock.h>
> diff --git a/mm/percpu-km.c b/mm/percpu-km.c
> index 20d2b69a13b0..35c9941077ee 100644
> --- a/mm/percpu-km.c
> +++ b/mm/percpu-km.c
> @@ -44,7 +44,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
>  	/* nada */
>  }
>  
> -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
> +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> +					    gfp_t gfp)
>  {
>  	const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
>  	struct pcpu_chunk *chunk;
> @@ -52,7 +53,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
>  	unsigned long flags;
>  	int i;
>  
> -	chunk = pcpu_alloc_chunk(gfp);
> +	chunk = pcpu_alloc_chunk(type, gfp);
>  	if (!chunk)
>  		return NULL;
>  
> diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
> index 32558063c3f9..c8400a2adbc2 100644
> --- a/mm/percpu-stats.c
> +++ b/mm/percpu-stats.c
> @@ -34,11 +34,15 @@ static int find_max_nr_alloc(void)
>  {
>  	struct pcpu_chunk *chunk;
>  	int slot, max_nr_alloc;
> +	enum pcpu_chunk_type type;
>  
>  	max_nr_alloc = 0;
> -	for (slot = 0; slot < pcpu_nr_slots; slot++)
> -		list_for_each_entry(chunk, &pcpu_slot[slot], list)
> -			max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc);
> +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> +		for (slot = 0; slot < pcpu_nr_slots; slot++)
> +			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
> +					    list)
> +				max_nr_alloc = max(max_nr_alloc,
> +						   chunk->nr_alloc);
>  
>  	return max_nr_alloc;
>  }
> @@ -129,6 +133,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
>  	P("cur_min_alloc", cur_min_alloc);
>  	P("cur_med_alloc", cur_med_alloc);
>  	P("cur_max_alloc", cur_max_alloc);
> +#ifdef CONFIG_MEMCG_KMEM
> +	P("memcg_aware", pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)));
> +#endif
>  	seq_putc(m, '\n');
>  }
>  
> @@ -137,6 +144,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
>  	struct pcpu_chunk *chunk;
>  	int slot, max_nr_alloc;
>  	int *buffer;
> +	enum pcpu_chunk_type type;
>  
>  alloc_buffer:
>  	spin_lock_irq(&pcpu_lock);
> @@ -202,18 +210,18 @@ static int percpu_stats_show(struct seq_file *m, void *v)
>  		chunk_map_stats(m, pcpu_reserved_chunk, buffer);
>  	}
>  
> -	for (slot = 0; slot < pcpu_nr_slots; slot++) {
> -		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> -			if (chunk == pcpu_first_chunk) {
> -				seq_puts(m, "Chunk: <- First Chunk\n");
> -				chunk_map_stats(m, chunk, buffer);
> -
> -
> -			} else {
> -				seq_puts(m, "Chunk:\n");
> -				chunk_map_stats(m, chunk, buffer);
> +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
> +		for (slot = 0; slot < pcpu_nr_slots; slot++) {
> +			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
> +					    list) {
> +				if (chunk == pcpu_first_chunk) {
> +					seq_puts(m, "Chunk: <- First Chunk\n");
> +					chunk_map_stats(m, chunk, buffer);
> +				} else {
> +					seq_puts(m, "Chunk:\n");
> +					chunk_map_stats(m, chunk, buffer);
> +				}
>  			}
> -
>  		}
>  	}
>  
> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> index a2b395acef89..e46f7a6917f9 100644
> --- a/mm/percpu-vm.c
> +++ b/mm/percpu-vm.c
> @@ -328,12 +328,13 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
>  	pcpu_free_pages(chunk, pages, page_start, page_end);
>  }
>  
> -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
> +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> +					    gfp_t gfp)
>  {
>  	struct pcpu_chunk *chunk;
>  	struct vm_struct **vms;
>  
> -	chunk = pcpu_alloc_chunk(gfp);
> +	chunk = pcpu_alloc_chunk(type, gfp);
>  	if (!chunk)
>  		return NULL;
>  
> diff --git a/mm/percpu.c b/mm/percpu.c
> index aa36b78d45a6..85f5755c9114 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -37,9 +37,14 @@
>   * takes care of normal allocations.
>   *
>   * The allocator organizes chunks into lists according to free size and
> - * tries to allocate from the fullest chunk first.  Each chunk is managed
> - * by a bitmap with metadata blocks.  The allocation map is updated on
> - * every allocation and free to reflect the current state while the boundary
> + * memcg-awareness.  To make a percpu allocation memcg-aware the __GFP_ACCOUNT
> + * flag should be passed.  All memcg-aware allocations are sharing one set
> + * of chunks and all unaccounted allocations and allocations performed
> + * by processes belonging to the root memory cgroup are using the second set.
> + *
> + * The allocator tries to allocate from the fullest chunk first. Each chunk
> + * is managed by a bitmap with metadata blocks.  The allocation map is updated
> + * on every allocation and free to reflect the current state while the boundary
>   * map is only updated on allocation.  Each metadata block contains
>   * information to help mitigate the need to iterate over large portions
>   * of the bitmap.  The reverse mapping from page to chunk is stored in
> @@ -81,6 +86,7 @@
>  #include <linux/kmemleak.h>
>  #include <linux/sched.h>
>  #include <linux/sched/mm.h>
> +#include <linux/memcontrol.h>
>  
>  #include <asm/cacheflush.h>
>  #include <asm/sections.h>
> @@ -160,7 +166,7 @@ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init;
>  DEFINE_SPINLOCK(pcpu_lock);	/* all internal data structures */
>  static DEFINE_MUTEX(pcpu_alloc_mutex);	/* chunk create/destroy, [de]pop, map ext */
>  
> -struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */
> +struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */
>  
>  /* chunks which need their map areas extended, protected by pcpu_lock */
>  static LIST_HEAD(pcpu_map_extend_chunks);
> @@ -500,6 +506,9 @@ static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
>  			      bool move_front)
>  {
>  	if (chunk != pcpu_reserved_chunk) {
> +		struct list_head *pcpu_slot;
> +
> +		pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
>  		if (move_front)
>  			list_move(&chunk->list, &pcpu_slot[slot]);
>  		else
> @@ -1341,6 +1350,10 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
>  		panic("%s: Failed to allocate %zu bytes\n", __func__,
>  		      alloc_size);
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +	/* first chunk isn't memcg-aware */
> +	chunk->obj_cgroups = NULL;
> +#endif
>  	pcpu_init_md_blocks(chunk);
>  
>  	/* manage populated page bitmap */
> @@ -1380,7 +1393,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
>  	return chunk;
>  }
>  
> -static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
> +static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type, gfp_t gfp)
>  {
>  	struct pcpu_chunk *chunk;
>  	int region_bits;
> @@ -1408,6 +1421,16 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
>  	if (!chunk->md_blocks)
>  		goto md_blocks_fail;
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +	if (pcpu_is_memcg_chunk(type)) {
> +		chunk->obj_cgroups =
> +			pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
> +					sizeof(struct obj_cgroup *), gfp);
> +		if (!chunk->obj_cgroups)
> +			goto objcg_fail;
> +	}
> +#endif
> +
>  	pcpu_init_md_blocks(chunk);
>  
>  	/* init metadata */
> @@ -1415,6 +1438,8 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
>  
>  	return chunk;
>  
> +objcg_fail:
> +	pcpu_mem_free(chunk->md_blocks);
>  md_blocks_fail:
>  	pcpu_mem_free(chunk->bound_map);
>  bound_map_fail:
> @@ -1429,6 +1454,9 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
>  {
>  	if (!chunk)
>  		return;
> +#ifdef CONFIG_MEMCG_KMEM
> +	pcpu_mem_free(chunk->obj_cgroups);
> +#endif
>  	pcpu_mem_free(chunk->md_blocks);
>  	pcpu_mem_free(chunk->bound_map);
>  	pcpu_mem_free(chunk->alloc_map);
> @@ -1505,7 +1533,8 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
>  			       int page_start, int page_end, gfp_t gfp);
>  static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
>  				  int page_start, int page_end);
> -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp);
> +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> +					    gfp_t gfp);
>  static void pcpu_destroy_chunk(struct pcpu_chunk *chunk);
>  static struct page *pcpu_addr_to_page(void *addr);
>  static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai);
> @@ -1547,6 +1576,77 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
>  	return pcpu_get_page_chunk(pcpu_addr_to_page(addr));
>  }
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
> +						     struct obj_cgroup **objcgp)
> +{
> +	struct obj_cgroup *objcg;
> +
> +	if (!memcg_kmem_enabled() || !(gfp & __GFP_ACCOUNT) ||
> +	    memcg_kmem_bypass())
> +		return PCPU_CHUNK_ROOT;
> +
> +	objcg = get_obj_cgroup_from_current();
> +	if (!objcg)
> +		return PCPU_CHUNK_ROOT;
> +
> +	if (obj_cgroup_charge(objcg, gfp, size * num_possible_cpus())) {
> +		obj_cgroup_put(objcg);
> +		return PCPU_FAIL_ALLOC;
> +	}
> +
> +	*objcgp = objcg;
> +	return PCPU_CHUNK_MEMCG;
> +}
> +
> +static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
> +				       struct pcpu_chunk *chunk, int off,
> +				       size_t size)
> +{
> +	if (!objcg)
> +		return;
> +
> +	if (chunk) {
> +		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
> +	} else {
> +		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
> +		obj_cgroup_put(objcg);
> +	}
> +}
> +
> +static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
> +{
> +	struct obj_cgroup *objcg;
> +
> +	if (!pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)))
> +		return;
> +
> +	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
> +	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
> +
> +	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
> +
> +	obj_cgroup_put(objcg);
> +}
> +
> +#else /* CONFIG_MEMCG_KMEM */
> +static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
> +						     struct mem_cgroup **memcgp)
> +{
> +	return PCPU_CHUNK_ROOT;
> +}
> +
> +static void pcpu_memcg_post_alloc_hook(struct mem_cgroup *memcg,
> +				       struct pcpu_chunk *chunk, int off,
> +				       size_t size)
> +{
> +}
> +
> +static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
> +{
> +}
> +#endif /* CONFIG_MEMCG_KMEM */
> +
>  /**
>   * pcpu_alloc - the percpu allocator
>   * @size: size of area to allocate in bytes
> @@ -1568,6 +1668,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  	gfp_t pcpu_gfp;
>  	bool is_atomic;
>  	bool do_warn;
> +	enum pcpu_chunk_type type;
> +	struct list_head *pcpu_slot;
> +	struct obj_cgroup *objcg = NULL;
>  	static int warn_limit = 10;
>  	struct pcpu_chunk *chunk, *next;
>  	const char *err;
> @@ -1602,16 +1705,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  		return NULL;
>  	}
>  
> +	type = pcpu_memcg_pre_alloc_hook(size, gfp, &objcg);
> +	if (unlikely(type == PCPU_FAIL_ALLOC))
> +		return NULL;
> +	pcpu_slot = pcpu_chunk_list(type);
> +
>  	if (!is_atomic) {
>  		/*
>  		 * pcpu_balance_workfn() allocates memory under this mutex,
>  		 * and it may wait for memory reclaim. Allow current task
>  		 * to become OOM victim, in case of memory pressure.
>  		 */
> -		if (gfp & __GFP_NOFAIL)
> +		if (gfp & __GFP_NOFAIL) {
>  			mutex_lock(&pcpu_alloc_mutex);
> -		else if (mutex_lock_killable(&pcpu_alloc_mutex))
> +		} else if (mutex_lock_killable(&pcpu_alloc_mutex)) {
> +			pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
>  			return NULL;
> +		}
>  	}
>  
>  	spin_lock_irqsave(&pcpu_lock, flags);
> @@ -1637,7 +1747,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  restart:
>  	/* search through normal chunks */
>  	for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
> -		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
> +		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot],
> +					 list) {

nit: this line change doesn't do anything. Can you please remove it.

>  			off = pcpu_find_block_fit(chunk, bits, bit_align,
>  						  is_atomic);
>  			if (off < 0) {
> @@ -1666,7 +1777,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  	}
>  
>  	if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
> -		chunk = pcpu_create_chunk(pcpu_gfp);
> +		chunk = pcpu_create_chunk(type, pcpu_gfp);
>  		if (!chunk) {
>  			err = "failed to allocate new chunk";
>  			goto fail;
> @@ -1723,6 +1834,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  	trace_percpu_alloc_percpu(reserved, is_atomic, size, align,
>  			chunk->base_addr, off, ptr);
>  
> +	pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
> +
>  	return ptr;
>  
>  fail_unlock:
> @@ -1744,6 +1857,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  	} else {
>  		mutex_unlock(&pcpu_alloc_mutex);
>  	}
> +
> +	pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
> +
>  	return NULL;
>  }
>  
> @@ -1803,8 +1919,8 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
>  }
>  
>  /**
> - * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> - * @work: unused
> + * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
> + * @type: chunk type
>   *
>   * Reclaim all fully free chunks except for the first one.  This is also
>   * responsible for maintaining the pool of empty populated pages.  However,
> @@ -1813,11 +1929,12 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
>   * allocation causes the failure as it is possible that requests can be
>   * serviced from already backed regions.
>   */
> -static void pcpu_balance_workfn(struct work_struct *work)
> +static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
>  {
>  	/* gfp flags passed to underlying allocators */
>  	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
>  	LIST_HEAD(to_free);
> +	struct list_head *pcpu_slot = pcpu_chunk_list(type);
>  	struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
>  	struct pcpu_chunk *chunk, *next;
>  	int slot, nr_to_pop, ret;
> @@ -1915,7 +2032,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
>  
>  	if (nr_to_pop) {
>  		/* ran out of chunks to populate, create a new one and retry */
> -		chunk = pcpu_create_chunk(gfp);
> +		chunk = pcpu_create_chunk(type, gfp);
>  		if (chunk) {
>  			spin_lock_irq(&pcpu_lock);
>  			pcpu_chunk_relocate(chunk, -1);
> @@ -1927,6 +2044,20 @@ static void pcpu_balance_workfn(struct work_struct *work)
>  	mutex_unlock(&pcpu_alloc_mutex);
>  }
>  
> +/**
> + * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> + * @work: unused
> + *
> + * Call __pcpu_balance_workfn() for each chunk type.
> + */
> +static void pcpu_balance_workfn(struct work_struct *work)
> +{
> +	enum pcpu_chunk_type type;
> +
> +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> +		__pcpu_balance_workfn(type);
> +}
> +
>  /**
>   * free_percpu - free percpu area
>   * @ptr: pointer to area to free
> @@ -1941,8 +2072,9 @@ void free_percpu(void __percpu *ptr)
>  	void *addr;
>  	struct pcpu_chunk *chunk;
>  	unsigned long flags;
> -	int off;
> +	int size, off;
>  	bool need_balance = false;
> +	struct list_head *pcpu_slot;
>  
>  	if (!ptr)
>  		return;
> @@ -1956,7 +2088,11 @@ void free_percpu(void __percpu *ptr)
>  	chunk = pcpu_chunk_addr_search(addr);
>  	off = addr - chunk->base_addr;
>  
> -	pcpu_free_area(chunk, off);
> +	size = pcpu_free_area(chunk, off);
> +
> +	pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
> +
> +	pcpu_memcg_free_hook(chunk, off, size);
>  
>  	/* if there are more than one fully free chunks, wake up grim reaper */
>  	if (chunk->free_bytes == pcpu_unit_size) {
> @@ -2267,6 +2403,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
>  	int map_size;
>  	unsigned long tmp_addr;
>  	size_t alloc_size;
> +	enum pcpu_chunk_type type;
>  
>  #define PCPU_SETUP_BUG_ON(cond)	do {					\
>  	if (unlikely(cond)) {						\
> @@ -2384,13 +2521,18 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
>  	 * empty chunks.
>  	 */
>  	pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
> -	pcpu_slot = memblock_alloc(pcpu_nr_slots * sizeof(pcpu_slot[0]),
> -				   SMP_CACHE_BYTES);
> -	if (!pcpu_slot)
> +	pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
> +					  sizeof(pcpu_chunk_lists[0]) *
> +					  PCPU_NR_CHUNK_TYPES,
> +					  SMP_CACHE_BYTES);
> +	if (!pcpu_chunk_lists)
>  		panic("%s: Failed to allocate %zu bytes\n", __func__,
> -		      pcpu_nr_slots * sizeof(pcpu_slot[0]));
> -	for (i = 0; i < pcpu_nr_slots; i++)
> -		INIT_LIST_HEAD(&pcpu_slot[i]);
> +		      pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]) *
> +		      PCPU_NR_CHUNK_TYPES);
> +
> +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> +		for (i = 0; i < pcpu_nr_slots; i++)
> +			INIT_LIST_HEAD(&pcpu_chunk_list(type)[i]);
>  
>  	/*
>  	 * The end of the static region needs to be aligned with the
> -- 
> 2.25.4
> 

There were just 2 minor nits. Do you mind resending with them fixed as
I'm not sure I'll be carrying these patches or not.

Acked-by: Dennis Zhou <dennis@kernel.org>

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 3/5] mm: memcg/percpu: per-memcg percpu memory statistics
       [not found] ` <20200528232508.1132382-4-guro@fb.com>
@ 2020-06-05 19:53   ` Dennis Zhou
  0 siblings, 0 replies; 7+ messages in thread
From: Dennis Zhou @ 2020-06-05 19:53 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Thu, May 28, 2020 at 04:25:06PM -0700, Roman Gushchin wrote:
> Percpu memory can represent a noticeable chunk of the total
> memory consumption, especially on big machines with many CPUs.
> Let's track percpu memory usage for each memcg and display
> it in memory.stat.
> 
> A percpu allocation is usually scattered over multiple pages
> (and nodes), and can be significantly smaller than a page.
> So let's add a byte-sized counter on the memcg level:
> MEMCG_PERCPU_B. Byte-sized vmstat infra created for slabs
> can be perfectly reused for percpu case.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  4 ++++
>  include/linux/memcontrol.h              |  8 ++++++++
>  mm/memcontrol.c                         |  4 +++-
>  mm/percpu.c                             | 10 ++++++++++
>  4 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index fed4e1d2a343..aa8cb6dadadc 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1276,6 +1276,10 @@ PAGE_SIZE multiple when read back.
>  		Amount of memory used for storing in-kernel data
>  		structures.
>  
> +	  percpu
> +		Amount of memory used for storing per-cpu kernel
> +		data structures.
> +
>  	  sock
>  		Amount of memory used in network transmission buffers
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7a84d9164449..f62a95d472f7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -32,11 +32,19 @@ struct kmem_cache;
>  enum memcg_stat_item {
>  	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
>  	MEMCG_SOCK,
> +	MEMCG_PERCPU_B,
>  	/* XXX: why are these zone and not node counters? */
>  	MEMCG_KERNEL_STACK_KB,
>  	MEMCG_NR_STAT,
>  };
>  
> +static __always_inline bool memcg_stat_item_in_bytes(enum memcg_stat_item item)
> +{
> +	if (item == MEMCG_PERCPU_B)
> +		return true;
> +	return vmstat_item_in_bytes(item);
> +}
> +
>  enum memcg_memory_event {
>  	MEMCG_LOW,
>  	MEMCG_HIGH,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7bc3fd196210..5007d1585a4a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -783,7 +783,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
>  	if (mem_cgroup_disabled())
>  		return;
>  
> -	if (vmstat_item_in_bytes(idx))
> +	if (memcg_stat_item_in_bytes(idx))
>  		threshold <<= PAGE_SHIFT;
>  
>  	x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
> @@ -1490,6 +1490,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>  	seq_buf_printf(&s, "slab %llu\n",
>  		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
>  			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
> +	seq_buf_printf(&s, "percpu %llu\n",
> +		       (u64)memcg_page_state(memcg, MEMCG_PERCPU_B));
>  	seq_buf_printf(&s, "sock %llu\n",
>  		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
>  		       PAGE_SIZE);
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 85f5755c9114..b4b3e9c8a6d1 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1608,6 +1608,11 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
>  
>  	if (chunk) {
>  		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
> +
> +		rcu_read_lock();
> +		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
> +				size * num_possible_cpus());
> +		rcu_read_unlock();
>  	} else {
>  		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
>  		obj_cgroup_put(objcg);
> @@ -1626,6 +1631,11 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
>  
>  	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
>  
> +	rcu_read_lock();
> +	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
> +			-(size * num_possible_cpus()));
> +	rcu_read_unlock();
> +
>  	obj_cgroup_put(objcg);
>  }
>  
> -- 
> 2.25.4
> 

Acked-by: Dennis Zhou <dennis@kernel.org>

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
       [not found] ` <20200528232508.1132382-5-guro@fb.com>
@ 2020-06-05 19:54   ` Dennis Zhou
  0 siblings, 0 replies; 7+ messages in thread
From: Dennis Zhou @ 2020-06-05 19:54 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Thu, May 28, 2020 at 04:25:07PM -0700, Roman Gushchin wrote:
> Memory cgroups are using large chunks of percpu memory to store
> vmstat data. Yet this memory is not accounted at all, so in the
> case when there are many (dying) cgroups, it's not exactly clear
> where all the memory is.
> 
> Because the size of  memory cgroup internal structures can
> dramatically exceed the size of object or page which is pinning
> it in the memory, it's not a good idea to simple ignore it.
> It actually breaks the isolation between cgroups.
> 
> Let's account the consumed percpu memory to the parent cgroup.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/memcontrol.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5007d1585a4a..0dd0d05a011c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5020,13 +5020,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>  	if (!pn)
>  		return 1;
>  
> -	pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
> +	pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
> +						 GFP_KERNEL_ACCOUNT);
>  	if (!pn->lruvec_stat_local) {
>  		kfree(pn);
>  		return 1;
>  	}
>  
> -	pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
> +	pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
> +					       GFP_KERNEL_ACCOUNT);
>  	if (!pn->lruvec_stat_cpu) {
>  		free_percpu(pn->lruvec_stat_local);
>  		kfree(pn);
> @@ -5100,11 +5102,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>  		goto fail;
>  	}
>  
> -	memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
> +	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> +						GFP_KERNEL_ACCOUNT);
>  	if (!memcg->vmstats_local)
>  		goto fail;
>  
> -	memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
> +	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> +						 GFP_KERNEL_ACCOUNT);
>  	if (!memcg->vmstats_percpu)
>  		goto fail;
>  
> @@ -5153,7 +5157,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	struct mem_cgroup *memcg;
>  	long error = -ENOMEM;
>  
> +	memalloc_use_memcg(parent);
>  	memcg = mem_cgroup_alloc();
> +	memalloc_unuse_memcg();
>  	if (IS_ERR(memcg))
>  		return ERR_CAST(memcg);
>  
> -- 
> 2.25.4
> 

Acked-by: Dennis Zhou <dennis@kernel.org>

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 5/5] kselftests: cgroup: add perpcu memory accounting test
       [not found] ` <20200528232508.1132382-6-guro@fb.com>
@ 2020-06-05 20:07   ` Dennis Zhou
  2020-06-05 22:47     ` Roman Gushchin
  0 siblings, 1 reply; 7+ messages in thread
From: Dennis Zhou @ 2020-06-05 20:07 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Thu, May 28, 2020 at 04:25:08PM -0700, Roman Gushchin wrote:
> Add a simple test to check the percpu memory accounting.
> The test creates a cgroup tree with 1000 child cgroups
> and checks values of memory.current and memory.stat::percpu.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  tools/testing/selftests/cgroup/test_kmem.c | 59 ++++++++++++++++++++++
>  1 file changed, 59 insertions(+)
> 
> diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c
> index 5224dae216e5..a0d4f1a3137d 100644
> --- a/tools/testing/selftests/cgroup/test_kmem.c
> +++ b/tools/testing/selftests/cgroup/test_kmem.c
> @@ -331,6 +331,64 @@ static int test_kmem_dead_cgroups(const char *root)
>  	return ret;
>  }
>  
> +/*
> + * This test creates a sub-tree with 1000 memory cgroups.
> + * Then it checks that the memory.current on the parent level
> + * is greater than 0 and approximates matches the percpu value
> + * from memory.stat.
> + */
> +static int test_percpu_basic(const char *root)
> +{
> +	int ret = KSFT_FAIL;
> +	char *parent, *child;
> +	long current, percpu;
> +	int i;
> +
> +	parent = cg_name(root, "percpu_basic_test");
> +	if (!parent)
> +		goto cleanup;
> +
> +	if (cg_create(parent))
> +		goto cleanup;
> +
> +	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
> +		goto cleanup;
> +
> +	for (i = 0; i < 1000; i++) {
> +		child = cg_name_indexed(parent, "child", i);
> +		if (!child)
> +			return -1;
> +
> +		if (cg_create(child))
> +			goto cleanup_children;
> +
> +		free(child);
> +	}
> +
> +	current = cg_read_long(parent, "memory.current");
> +	percpu = cg_read_key_long(parent, "memory.stat", "percpu ");
> +
> +	if (current > 0 && percpu > 0 && abs(current - percpu) <
> +	    4096 * 32 * get_nprocs())

So this is checking that we've allocated less than 32 pages per cpu over
1000 child cgroups that's not percpu memory? Is there a more definitive
measurement or at least a comment we can leave saying why this limit was
chosen.

> +		ret = KSFT_PASS;
> +	else
> +		printf("memory.current %ld\npercpu %ld\n",
> +		       current, percpu);
> +
> +cleanup_children:
> +	for (i = 0; i < 1000; i++) {
> +		child = cg_name_indexed(parent, "child", i);
> +		cg_destroy(child);
> +		free(child);
> +	}
> +
> +cleanup:
> +	cg_destroy(parent);
> +	free(parent);
> +
> +	return ret;
> +}
> +
>  #define T(x) { x, #x }
>  struct kmem_test {
>  	int (*fn)(const char *root);
> @@ -341,6 +399,7 @@ struct kmem_test {
>  	T(test_kmem_proc_kpagecgroup),
>  	T(test_kmem_kernel_stacks),
>  	T(test_kmem_dead_cgroups),
> +	T(test_percpu_basic),
>  };
>  #undef T
>  
> -- 
> 2.25.4
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 2/5] mm: memcg/percpu: account percpu memory to memory cgroups
  2020-06-05 19:49   ` [PATCH v1 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Dennis Zhou
@ 2020-06-05 22:44     ` Roman Gushchin
  0 siblings, 0 replies; 7+ messages in thread
From: Roman Gushchin @ 2020-06-05 22:44 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Andrew Morton, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Fri, Jun 05, 2020 at 07:49:53PM +0000, Dennis Zhou wrote:
> On Thu, May 28, 2020 at 04:25:05PM -0700, Roman Gushchin wrote:
> > Percpu memory is becoming more and more widely used by various
> > subsystems, and the total amount of memory controlled by the percpu
> > allocator can make a good part of the total memory.
> > 
> > As an example, bpf maps can consume a lot of percpu memory,
> > and they are created by a user. Also, some cgroup internals
> > (e.g. memory controller statistics) can be quite large.
> > On a machine with many CPUs and big number of cgroups they
> > can consume hundreds of megabytes.
> > 
> > So the lack of memcg accounting is creating a breach in the memory
> > isolation. Similar to the slab memory, percpu memory should be
> > accounted by default.
> > 
> > To implement the perpcu accounting it's possible to take the slab
> > memory accounting as a model to follow. Let's introduce two types of
> > percpu chunks: root and memcg. What makes memcg chunks different is
> > an additional space allocated to store memcg membership information.
> > If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be
> > used. If it's possible to charge the corresponding size to the target
> > memory cgroup, allocation is performed, and the memcg ownership data
> > is recorded. System-wide allocations are performed using root chunks,
> > so there is no additional memory overhead.
> > 
> > To implement a fast reparenting of percpu memory on memcg removal,
> > we don't store mem_cgroup pointers directly: instead we use obj_cgroup
> > API, introduced for slab accounting.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  mm/percpu-internal.h |  57 ++++++++++++-
> >  mm/percpu-km.c       |   5 +-
> >  mm/percpu-stats.c    |  36 +++++----
> >  mm/percpu-vm.c       |   5 +-
> >  mm/percpu.c          | 186 ++++++++++++++++++++++++++++++++++++++-----
> >  5 files changed, 248 insertions(+), 41 deletions(-)
> > 
> > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> > index 0468ba500bd4..0cf36337eb47 100644
> > --- a/mm/percpu-internal.h
> > +++ b/mm/percpu-internal.h
> > @@ -5,6 +5,27 @@
> >  #include <linux/types.h>
> >  #include <linux/percpu.h>
> >  
> > +/*
> > + * There are two chunk types: root and memcg-aware.
> > + * Chunks of each type have separate slots list.
> > + *
> > + * Memcg-aware chunks have an attached vector of obj_cgroup
> > + * pointers, which is used to store memcg membership data
> > + * of a percpu object. Obj_cgroups are ref-counted pointers
> > + * to a memory cgroup with an ability to switch dynamically
> > + * to the parent memory cgroup. This allows to reclaim a deleted
> > + * memory cgroup without reclaiming of all outstanding objects,
> > + * which do hold a reference at it.
> > + */
> 
> nit: do you mind reflowing this to 80 characters and doing 2 spaces
> after each period to keep the formatting uniform.
> 
> > +enum pcpu_chunk_type {
> > +	PCPU_CHUNK_ROOT,
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	PCPU_CHUNK_MEMCG,
> > +#endif
> > +	PCPU_NR_CHUNK_TYPES,
> > +	PCPU_FAIL_ALLOC = PCPU_NR_CHUNK_TYPES
> > +};
> > +
> >  /*
> >   * pcpu_block_md is the metadata block struct.
> >   * Each chunk's bitmap is split into a number of full blocks.
> > @@ -54,6 +75,9 @@ struct pcpu_chunk {
> >  	int			end_offset;	/* additional area required to
> >  						   have the region end page
> >  						   aligned */
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	struct obj_cgroup	**obj_cgroups;	/* vector of object cgroups */
> > +#endif
> >  
> >  	int			nr_pages;	/* # of pages served by this chunk */
> >  	int			nr_populated;	/* # of populated pages */
> > @@ -63,7 +87,7 @@ struct pcpu_chunk {
> >  
> >  extern spinlock_t pcpu_lock;
> >  
> > -extern struct list_head *pcpu_slot;
> > +extern struct list_head *pcpu_chunk_lists;
> >  extern int pcpu_nr_slots;
> >  extern int pcpu_nr_empty_pop_pages;
> >  
> > @@ -106,6 +130,37 @@ static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk)
> >  	return pcpu_nr_pages_to_map_bits(chunk->nr_pages);
> >  }
> >  
> > +#ifdef CONFIG_MEMCG_KMEM
> > +static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
> > +{
> > +	if (chunk->obj_cgroups)
> > +		return PCPU_CHUNK_MEMCG;
> > +	return PCPU_CHUNK_ROOT;
> > +}
> > +
> > +static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
> > +{
> > +	return chunk_type == PCPU_CHUNK_MEMCG;
> > +}
> > +
> > +#else
> > +static enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk)
> > +{
> > +	return PCPU_CHUNK_ROOT;
> > +}
> > +
> > +static bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type)
> > +{
> > +	return false;
> > +}
> > +#endif
> > +
> > +static struct list_head *pcpu_chunk_list(enum pcpu_chunk_type chunk_type)
> > +{
> > +	return &pcpu_chunk_lists[pcpu_nr_slots *
> > +				 pcpu_is_memcg_chunk(chunk_type)];
> > +}
> > +
> >  #ifdef CONFIG_PERCPU_STATS
> >  
> >  #include <linux/spinlock.h>
> > diff --git a/mm/percpu-km.c b/mm/percpu-km.c
> > index 20d2b69a13b0..35c9941077ee 100644
> > --- a/mm/percpu-km.c
> > +++ b/mm/percpu-km.c
> > @@ -44,7 +44,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
> >  	/* nada */
> >  }
> >  
> > -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
> > +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> > +					    gfp_t gfp)
> >  {
> >  	const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
> >  	struct pcpu_chunk *chunk;
> > @@ -52,7 +53,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
> >  	unsigned long flags;
> >  	int i;
> >  
> > -	chunk = pcpu_alloc_chunk(gfp);
> > +	chunk = pcpu_alloc_chunk(type, gfp);
> >  	if (!chunk)
> >  		return NULL;
> >  
> > diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
> > index 32558063c3f9..c8400a2adbc2 100644
> > --- a/mm/percpu-stats.c
> > +++ b/mm/percpu-stats.c
> > @@ -34,11 +34,15 @@ static int find_max_nr_alloc(void)
> >  {
> >  	struct pcpu_chunk *chunk;
> >  	int slot, max_nr_alloc;
> > +	enum pcpu_chunk_type type;
> >  
> >  	max_nr_alloc = 0;
> > -	for (slot = 0; slot < pcpu_nr_slots; slot++)
> > -		list_for_each_entry(chunk, &pcpu_slot[slot], list)
> > -			max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc);
> > +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> > +		for (slot = 0; slot < pcpu_nr_slots; slot++)
> > +			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
> > +					    list)
> > +				max_nr_alloc = max(max_nr_alloc,
> > +						   chunk->nr_alloc);
> >  
> >  	return max_nr_alloc;
> >  }
> > @@ -129,6 +133,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
> >  	P("cur_min_alloc", cur_min_alloc);
> >  	P("cur_med_alloc", cur_med_alloc);
> >  	P("cur_max_alloc", cur_max_alloc);
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	P("memcg_aware", pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)));
> > +#endif
> >  	seq_putc(m, '\n');
> >  }
> >  
> > @@ -137,6 +144,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
> >  	struct pcpu_chunk *chunk;
> >  	int slot, max_nr_alloc;
> >  	int *buffer;
> > +	enum pcpu_chunk_type type;
> >  
> >  alloc_buffer:
> >  	spin_lock_irq(&pcpu_lock);
> > @@ -202,18 +210,18 @@ static int percpu_stats_show(struct seq_file *m, void *v)
> >  		chunk_map_stats(m, pcpu_reserved_chunk, buffer);
> >  	}
> >  
> > -	for (slot = 0; slot < pcpu_nr_slots; slot++) {
> > -		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> > -			if (chunk == pcpu_first_chunk) {
> > -				seq_puts(m, "Chunk: <- First Chunk\n");
> > -				chunk_map_stats(m, chunk, buffer);
> > -
> > -
> > -			} else {
> > -				seq_puts(m, "Chunk:\n");
> > -				chunk_map_stats(m, chunk, buffer);
> > +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
> > +		for (slot = 0; slot < pcpu_nr_slots; slot++) {
> > +			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
> > +					    list) {
> > +				if (chunk == pcpu_first_chunk) {
> > +					seq_puts(m, "Chunk: <- First Chunk\n");
> > +					chunk_map_stats(m, chunk, buffer);
> > +				} else {
> > +					seq_puts(m, "Chunk:\n");
> > +					chunk_map_stats(m, chunk, buffer);
> > +				}
> >  			}
> > -
> >  		}
> >  	}
> >  
> > diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> > index a2b395acef89..e46f7a6917f9 100644
> > --- a/mm/percpu-vm.c
> > +++ b/mm/percpu-vm.c
> > @@ -328,12 +328,13 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
> >  	pcpu_free_pages(chunk, pages, page_start, page_end);
> >  }
> >  
> > -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
> > +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> > +					    gfp_t gfp)
> >  {
> >  	struct pcpu_chunk *chunk;
> >  	struct vm_struct **vms;
> >  
> > -	chunk = pcpu_alloc_chunk(gfp);
> > +	chunk = pcpu_alloc_chunk(type, gfp);
> >  	if (!chunk)
> >  		return NULL;
> >  
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index aa36b78d45a6..85f5755c9114 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -37,9 +37,14 @@
> >   * takes care of normal allocations.
> >   *
> >   * The allocator organizes chunks into lists according to free size and
> > - * tries to allocate from the fullest chunk first.  Each chunk is managed
> > - * by a bitmap with metadata blocks.  The allocation map is updated on
> > - * every allocation and free to reflect the current state while the boundary
> > + * memcg-awareness.  To make a percpu allocation memcg-aware the __GFP_ACCOUNT
> > + * flag should be passed.  All memcg-aware allocations are sharing one set
> > + * of chunks and all unaccounted allocations and allocations performed
> > + * by processes belonging to the root memory cgroup are using the second set.
> > + *
> > + * The allocator tries to allocate from the fullest chunk first. Each chunk
> > + * is managed by a bitmap with metadata blocks.  The allocation map is updated
> > + * on every allocation and free to reflect the current state while the boundary
> >   * map is only updated on allocation.  Each metadata block contains
> >   * information to help mitigate the need to iterate over large portions
> >   * of the bitmap.  The reverse mapping from page to chunk is stored in
> > @@ -81,6 +86,7 @@
> >  #include <linux/kmemleak.h>
> >  #include <linux/sched.h>
> >  #include <linux/sched/mm.h>
> > +#include <linux/memcontrol.h>
> >  
> >  #include <asm/cacheflush.h>
> >  #include <asm/sections.h>
> > @@ -160,7 +166,7 @@ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init;
> >  DEFINE_SPINLOCK(pcpu_lock);	/* all internal data structures */
> >  static DEFINE_MUTEX(pcpu_alloc_mutex);	/* chunk create/destroy, [de]pop, map ext */
> >  
> > -struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */
> > +struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */
> >  
> >  /* chunks which need their map areas extended, protected by pcpu_lock */
> >  static LIST_HEAD(pcpu_map_extend_chunks);
> > @@ -500,6 +506,9 @@ static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
> >  			      bool move_front)
> >  {
> >  	if (chunk != pcpu_reserved_chunk) {
> > +		struct list_head *pcpu_slot;
> > +
> > +		pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
> >  		if (move_front)
> >  			list_move(&chunk->list, &pcpu_slot[slot]);
> >  		else
> > @@ -1341,6 +1350,10 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
> >  		panic("%s: Failed to allocate %zu bytes\n", __func__,
> >  		      alloc_size);
> >  
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	/* first chunk isn't memcg-aware */
> > +	chunk->obj_cgroups = NULL;
> > +#endif
> >  	pcpu_init_md_blocks(chunk);
> >  
> >  	/* manage populated page bitmap */
> > @@ -1380,7 +1393,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
> >  	return chunk;
> >  }
> >  
> > -static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
> > +static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type, gfp_t gfp)
> >  {
> >  	struct pcpu_chunk *chunk;
> >  	int region_bits;
> > @@ -1408,6 +1421,16 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
> >  	if (!chunk->md_blocks)
> >  		goto md_blocks_fail;
> >  
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	if (pcpu_is_memcg_chunk(type)) {
> > +		chunk->obj_cgroups =
> > +			pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
> > +					sizeof(struct obj_cgroup *), gfp);
> > +		if (!chunk->obj_cgroups)
> > +			goto objcg_fail;
> > +	}
> > +#endif
> > +
> >  	pcpu_init_md_blocks(chunk);
> >  
> >  	/* init metadata */
> > @@ -1415,6 +1438,8 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
> >  
> >  	return chunk;
> >  
> > +objcg_fail:
> > +	pcpu_mem_free(chunk->md_blocks);
> >  md_blocks_fail:
> >  	pcpu_mem_free(chunk->bound_map);
> >  bound_map_fail:
> > @@ -1429,6 +1454,9 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
> >  {
> >  	if (!chunk)
> >  		return;
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	pcpu_mem_free(chunk->obj_cgroups);
> > +#endif
> >  	pcpu_mem_free(chunk->md_blocks);
> >  	pcpu_mem_free(chunk->bound_map);
> >  	pcpu_mem_free(chunk->alloc_map);
> > @@ -1505,7 +1533,8 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
> >  			       int page_start, int page_end, gfp_t gfp);
> >  static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
> >  				  int page_start, int page_end);
> > -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp);
> > +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
> > +					    gfp_t gfp);
> >  static void pcpu_destroy_chunk(struct pcpu_chunk *chunk);
> >  static struct page *pcpu_addr_to_page(void *addr);
> >  static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai);
> > @@ -1547,6 +1576,77 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
> >  	return pcpu_get_page_chunk(pcpu_addr_to_page(addr));
> >  }
> >  
> > +#ifdef CONFIG_MEMCG_KMEM
> > +static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
> > +						     struct obj_cgroup **objcgp)
> > +{
> > +	struct obj_cgroup *objcg;
> > +
> > +	if (!memcg_kmem_enabled() || !(gfp & __GFP_ACCOUNT) ||
> > +	    memcg_kmem_bypass())
> > +		return PCPU_CHUNK_ROOT;
> > +
> > +	objcg = get_obj_cgroup_from_current();
> > +	if (!objcg)
> > +		return PCPU_CHUNK_ROOT;
> > +
> > +	if (obj_cgroup_charge(objcg, gfp, size * num_possible_cpus())) {
> > +		obj_cgroup_put(objcg);
> > +		return PCPU_FAIL_ALLOC;
> > +	}
> > +
> > +	*objcgp = objcg;
> > +	return PCPU_CHUNK_MEMCG;
> > +}
> > +
> > +static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
> > +				       struct pcpu_chunk *chunk, int off,
> > +				       size_t size)
> > +{
> > +	if (!objcg)
> > +		return;
> > +
> > +	if (chunk) {
> > +		chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
> > +	} else {
> > +		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
> > +		obj_cgroup_put(objcg);
> > +	}
> > +}
> > +
> > +static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
> > +{
> > +	struct obj_cgroup *objcg;
> > +
> > +	if (!pcpu_is_memcg_chunk(pcpu_chunk_type(chunk)))
> > +		return;
> > +
> > +	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
> > +	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
> > +
> > +	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
> > +
> > +	obj_cgroup_put(objcg);
> > +}
> > +
> > +#else /* CONFIG_MEMCG_KMEM */
> > +static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
> > +						     struct mem_cgroup **memcgp)
> > +{
> > +	return PCPU_CHUNK_ROOT;
> > +}
> > +
> > +static void pcpu_memcg_post_alloc_hook(struct mem_cgroup *memcg,
> > +				       struct pcpu_chunk *chunk, int off,
> > +				       size_t size)
> > +{
> > +}
> > +
> > +static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
> > +{
> > +}
> > +#endif /* CONFIG_MEMCG_KMEM */
> > +
> >  /**
> >   * pcpu_alloc - the percpu allocator
> >   * @size: size of area to allocate in bytes
> > @@ -1568,6 +1668,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
> >  	gfp_t pcpu_gfp;
> >  	bool is_atomic;
> >  	bool do_warn;
> > +	enum pcpu_chunk_type type;
> > +	struct list_head *pcpu_slot;
> > +	struct obj_cgroup *objcg = NULL;
> >  	static int warn_limit = 10;
> >  	struct pcpu_chunk *chunk, *next;
> >  	const char *err;
> > @@ -1602,16 +1705,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
> >  		return NULL;
> >  	}
> >  
> > +	type = pcpu_memcg_pre_alloc_hook(size, gfp, &objcg);
> > +	if (unlikely(type == PCPU_FAIL_ALLOC))
> > +		return NULL;
> > +	pcpu_slot = pcpu_chunk_list(type);
> > +
> >  	if (!is_atomic) {
> >  		/*
> >  		 * pcpu_balance_workfn() allocates memory under this mutex,
> >  		 * and it may wait for memory reclaim. Allow current task
> >  		 * to become OOM victim, in case of memory pressure.
> >  		 */
> > -		if (gfp & __GFP_NOFAIL)
> > +		if (gfp & __GFP_NOFAIL) {
> >  			mutex_lock(&pcpu_alloc_mutex);
> > -		else if (mutex_lock_killable(&pcpu_alloc_mutex))
> > +		} else if (mutex_lock_killable(&pcpu_alloc_mutex)) {
> > +			pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
> >  			return NULL;
> > +		}
> >  	}
> >  
> >  	spin_lock_irqsave(&pcpu_lock, flags);
> > @@ -1637,7 +1747,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
> >  restart:
> >  	/* search through normal chunks */
> >  	for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
> > -		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
> > +		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot],
> > +					 list) {
> 
> nit: this line change doesn't do anything. Can you please remove it.
> 
> >  			off = pcpu_find_block_fit(chunk, bits, bit_align,
> >  						  is_atomic);
> >  			if (off < 0) {
> > @@ -1666,7 +1777,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
> >  	}
> >  
> >  	if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
> > -		chunk = pcpu_create_chunk(pcpu_gfp);
> > +		chunk = pcpu_create_chunk(type, pcpu_gfp);
> >  		if (!chunk) {
> >  			err = "failed to allocate new chunk";
> >  			goto fail;
> > @@ -1723,6 +1834,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
> >  	trace_percpu_alloc_percpu(reserved, is_atomic, size, align,
> >  			chunk->base_addr, off, ptr);
> >  
> > +	pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
> > +
> >  	return ptr;
> >  
> >  fail_unlock:
> > @@ -1744,6 +1857,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
> >  	} else {
> >  		mutex_unlock(&pcpu_alloc_mutex);
> >  	}
> > +
> > +	pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
> > +
> >  	return NULL;
> >  }
> >  
> > @@ -1803,8 +1919,8 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
> >  }
> >  
> >  /**
> > - * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> > - * @work: unused
> > + * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
> > + * @type: chunk type
> >   *
> >   * Reclaim all fully free chunks except for the first one.  This is also
> >   * responsible for maintaining the pool of empty populated pages.  However,
> > @@ -1813,11 +1929,12 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
> >   * allocation causes the failure as it is possible that requests can be
> >   * serviced from already backed regions.
> >   */
> > -static void pcpu_balance_workfn(struct work_struct *work)
> > +static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
> >  {
> >  	/* gfp flags passed to underlying allocators */
> >  	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
> >  	LIST_HEAD(to_free);
> > +	struct list_head *pcpu_slot = pcpu_chunk_list(type);
> >  	struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
> >  	struct pcpu_chunk *chunk, *next;
> >  	int slot, nr_to_pop, ret;
> > @@ -1915,7 +2032,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
> >  
> >  	if (nr_to_pop) {
> >  		/* ran out of chunks to populate, create a new one and retry */
> > -		chunk = pcpu_create_chunk(gfp);
> > +		chunk = pcpu_create_chunk(type, gfp);
> >  		if (chunk) {
> >  			spin_lock_irq(&pcpu_lock);
> >  			pcpu_chunk_relocate(chunk, -1);
> > @@ -1927,6 +2044,20 @@ static void pcpu_balance_workfn(struct work_struct *work)
> >  	mutex_unlock(&pcpu_alloc_mutex);
> >  }
> >  
> > +/**
> > + * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> > + * @work: unused
> > + *
> > + * Call __pcpu_balance_workfn() for each chunk type.
> > + */
> > +static void pcpu_balance_workfn(struct work_struct *work)
> > +{
> > +	enum pcpu_chunk_type type;
> > +
> > +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> > +		__pcpu_balance_workfn(type);
> > +}
> > +
> >  /**
> >   * free_percpu - free percpu area
> >   * @ptr: pointer to area to free
> > @@ -1941,8 +2072,9 @@ void free_percpu(void __percpu *ptr)
> >  	void *addr;
> >  	struct pcpu_chunk *chunk;
> >  	unsigned long flags;
> > -	int off;
> > +	int size, off;
> >  	bool need_balance = false;
> > +	struct list_head *pcpu_slot;
> >  
> >  	if (!ptr)
> >  		return;
> > @@ -1956,7 +2088,11 @@ void free_percpu(void __percpu *ptr)
> >  	chunk = pcpu_chunk_addr_search(addr);
> >  	off = addr - chunk->base_addr;
> >  
> > -	pcpu_free_area(chunk, off);
> > +	size = pcpu_free_area(chunk, off);
> > +
> > +	pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk));
> > +
> > +	pcpu_memcg_free_hook(chunk, off, size);
> >  
> >  	/* if there are more than one fully free chunks, wake up grim reaper */
> >  	if (chunk->free_bytes == pcpu_unit_size) {
> > @@ -2267,6 +2403,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
> >  	int map_size;
> >  	unsigned long tmp_addr;
> >  	size_t alloc_size;
> > +	enum pcpu_chunk_type type;
> >  
> >  #define PCPU_SETUP_BUG_ON(cond)	do {					\
> >  	if (unlikely(cond)) {						\
> > @@ -2384,13 +2521,18 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
> >  	 * empty chunks.
> >  	 */
> >  	pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
> > -	pcpu_slot = memblock_alloc(pcpu_nr_slots * sizeof(pcpu_slot[0]),
> > -				   SMP_CACHE_BYTES);
> > -	if (!pcpu_slot)
> > +	pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
> > +					  sizeof(pcpu_chunk_lists[0]) *
> > +					  PCPU_NR_CHUNK_TYPES,
> > +					  SMP_CACHE_BYTES);
> > +	if (!pcpu_chunk_lists)
> >  		panic("%s: Failed to allocate %zu bytes\n", __func__,
> > -		      pcpu_nr_slots * sizeof(pcpu_slot[0]));
> > -	for (i = 0; i < pcpu_nr_slots; i++)
> > -		INIT_LIST_HEAD(&pcpu_slot[i]);
> > +		      pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]) *
> > +		      PCPU_NR_CHUNK_TYPES);
> > +
> > +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> > +		for (i = 0; i < pcpu_nr_slots; i++)
> > +			INIT_LIST_HEAD(&pcpu_chunk_list(type)[i]);
> >  
> >  	/*
> >  	 * The end of the static region needs to be aligned with the
> > -- 
> > 2.25.4
> > 
> 
> There were just 2 minor nits. Do you mind resending with them fixed as
> I'm not sure I'll be carrying these patches or not.

Sure, will send v2 based on the slab controller v6 early next week.

> 
> Acked-by: Dennis Zhou <dennis@kernel.org>

Thank you!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 5/5] kselftests: cgroup: add perpcu memory accounting test
  2020-06-05 20:07   ` [PATCH v1 5/5] kselftests: cgroup: add perpcu memory accounting test Dennis Zhou
@ 2020-06-05 22:47     ` Roman Gushchin
  0 siblings, 0 replies; 7+ messages in thread
From: Roman Gushchin @ 2020-06-05 22:47 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Andrew Morton, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Fri, Jun 05, 2020 at 08:07:51PM +0000, Dennis Zhou wrote:
> On Thu, May 28, 2020 at 04:25:08PM -0700, Roman Gushchin wrote:
> > Add a simple test to check the percpu memory accounting.
> > The test creates a cgroup tree with 1000 child cgroups
> > and checks values of memory.current and memory.stat::percpu.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  tools/testing/selftests/cgroup/test_kmem.c | 59 ++++++++++++++++++++++
> >  1 file changed, 59 insertions(+)
> > 
> > diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c
> > index 5224dae216e5..a0d4f1a3137d 100644
> > --- a/tools/testing/selftests/cgroup/test_kmem.c
> > +++ b/tools/testing/selftests/cgroup/test_kmem.c
> > @@ -331,6 +331,64 @@ static int test_kmem_dead_cgroups(const char *root)
> >  	return ret;
> >  }
> >  
> > +/*
> > + * This test creates a sub-tree with 1000 memory cgroups.
> > + * Then it checks that the memory.current on the parent level
> > + * is greater than 0 and approximates matches the percpu value
> > + * from memory.stat.
> > + */
> > +static int test_percpu_basic(const char *root)
> > +{
> > +	int ret = KSFT_FAIL;
> > +	char *parent, *child;
> > +	long current, percpu;
> > +	int i;
> > +
> > +	parent = cg_name(root, "percpu_basic_test");
> > +	if (!parent)
> > +		goto cleanup;
> > +
> > +	if (cg_create(parent))
> > +		goto cleanup;
> > +
> > +	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
> > +		goto cleanup;
> > +
> > +	for (i = 0; i < 1000; i++) {
> > +		child = cg_name_indexed(parent, "child", i);
> > +		if (!child)
> > +			return -1;
> > +
> > +		if (cg_create(child))
> > +			goto cleanup_children;
> > +
> > +		free(child);
> > +	}
> > +
> > +	current = cg_read_long(parent, "memory.current");
> > +	percpu = cg_read_key_long(parent, "memory.stat", "percpu ");
> > +
> > +	if (current > 0 && percpu > 0 && abs(current - percpu) <
> > +	    4096 * 32 * get_nprocs())
> 
> So this is checking that we've allocated less than 32 pages per cpu over
> 1000 child cgroups that's not percpu memory? Is there a more definitive
> measurement or at least a comment we can leave saying why this limit was
> chosen.

It simple means that "current" should be approximately equal to "percpu" statistics.
Both charging and vmstat paths are using percpu batching, and the batch size is
32 pages.

I'll add a comment to make it more obvious.

Thanks!

> 
> > +		ret = KSFT_PASS;
> > +	else
> > +		printf("memory.current %ld\npercpu %ld\n",
> > +		       current, percpu);
> > +
> > +cleanup_children:
> > +	for (i = 0; i < 1000; i++) {
> > +		child = cg_name_indexed(parent, "child", i);
> > +		cg_destroy(child);
> > +		free(child);
> > +	}
> > +
> > +cleanup:
> > +	cg_destroy(parent);
> > +	free(parent);
> > +
> > +	return ret;
> > +}
> > +
> >  #define T(x) { x, #x }
> >  struct kmem_test {
> >  	int (*fn)(const char *root);
> > @@ -341,6 +399,7 @@ struct kmem_test {
> >  	T(test_kmem_proc_kpagecgroup),
> >  	T(test_kmem_kernel_stacks),
> >  	T(test_kmem_dead_cgroups),
> > +	T(test_percpu_basic),
> >  };
> >  #undef T
> >  
> > -- 
> > 2.25.4
> > 
> > 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-06-05 22:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200528232508.1132382-1-guro@fb.com>
     [not found] ` <20200528232508.1132382-2-guro@fb.com>
2020-06-05 19:44   ` [PATCH v1 1/5] percpu: return number of released bytes from pcpu_free_area() Dennis Zhou
     [not found] ` <20200528232508.1132382-3-guro@fb.com>
2020-06-05 19:49   ` [PATCH v1 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Dennis Zhou
2020-06-05 22:44     ` Roman Gushchin
     [not found] ` <20200528232508.1132382-4-guro@fb.com>
2020-06-05 19:53   ` [PATCH v1 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Dennis Zhou
     [not found] ` <20200528232508.1132382-5-guro@fb.com>
2020-06-05 19:54   ` [PATCH v1 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Dennis Zhou
     [not found] ` <20200528232508.1132382-6-guro@fb.com>
2020-06-05 20:07   ` [PATCH v1 5/5] kselftests: cgroup: add perpcu memory accounting test Dennis Zhou
2020-06-05 22:47     ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).