[RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator
@ 2021-06-01  6:51 Bharata B Rao
  2021-06-01  6:51 ` [RFC PATCH v0 1/3] percpu: CPU hotplug support for alloc_percpu() Bharata B Rao
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Bharata B Rao @ 2021-06-01  6:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, aneesh.kumar, dennis, tj, cl, akpm, amakhalov, guro,
	vbabka, srikar, psampat, ego, Bharata B Rao

Hi,

This is an attempt to make the percpu allocator CPU hotplug aware.
Currently the percpu allocator allocates memory for all the possible
CPUs. This can lead to wastage of memory when possible number of CPUs
is significantly higher than the number of online CPUs. This can be
avoided if the percpu allocator were to allocate only for the online
CPUs and extend the allocation for other CPUs as and when they become
online. 

This early RFC work shows some good memory savings for a powerpc
KVM guest that is booted with 16 online and 1024 possible CPUs.
Here is the comparision of Percpu memory consumption from
/proc/meminfo before and after creating 1000 memcgs.

			W/o patch		W/ patch
Before			1441792 kB		22528 kB
After 1000 memcgs	4390912 kB		68608 kB

Note that the Percpu reporting in meminfo has been changed in
the patchset to reflect the allocation for online CPUs only.

More details about the approach are present in the patch
descriptions.

Bharata B Rao (3):
  percpu: CPU hotplug support for alloc_percpu()
  percpu: Limit percpu allocator to online cpus
  percpu: Avoid using percpu ptrs of non-existing cpus

 fs/namespace.c             |   4 +-
 include/linux/cpuhotplug.h |   2 +
 include/linux/percpu.h     |  15 +++
 kernel/cgroup/rstat.c      |  20 +++-
 kernel/sched/cpuacct.c     |  10 +-
 kernel/sched/psi.c         |  14 ++-
 lib/percpu-refcount.c      |   4 +-
 lib/percpu_counter.c       |   2 +-
 mm/percpu-internal.h       |   9 ++
 mm/percpu-vm.c             | 211 +++++++++++++++++++++++++++++++++-
 mm/percpu.c                | 229 +++++++++++++++++++++++++++++++++++--
 net/ipv4/fib_semantics.c   |   2 +-
 net/ipv6/route.c           |   6 +-
 13 files changed, 490 insertions(+), 38 deletions(-)

-- 
2.31.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH v0 1/3] percpu: CPU hotplug support for alloc_percpu()
  2021-06-01  6:51 [RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator Bharata B Rao
@ 2021-06-01  6:51 ` Bharata B Rao
  2021-06-01  6:51 ` [RFC PATCH v0 2/3] percpu: Limit percpu allocator to online cpus Bharata B Rao
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Bharata B Rao @ 2021-06-01  6:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, aneesh.kumar, dennis, tj, cl, akpm, amakhalov, guro,
	vbabka, srikar, psampat, ego, Bharata B Rao

The percpu allocator allocates memory for all the possible CPUs.
This can lead to wastage of memory when possible number of CPUs
is significantly higher than the number of online CPUs. This can
be avoided if the percpu allocator were to allocate only for the
online CPUs and extend the allocation for other CPUs as and they
become online. Essentially the population of the chunk which
involves allocating the pages for the chunk unit that corresponding
to the CPU and mapping them to the vmalloc range can be delayed to
the CPU hotplug time.

To achieve this, add CPU hotplug callback support to the percpu
allocator and let it setup the percpu allocation corresponding to
the newly coming up CPU at hotplug time. The vmalloc range is
allocated for all the possible CPUs upfront, but during hotplug
time, only the populated pages from the chunk are setup (allocated
and mapped) for the unit that corresponds to the new CPUs.
The same is undone (unit pages unmapped and freed) at unplug
time.

This itself isn't sufficient because some callers of alloc_percpu()
would expect the percpu variables/pointers for all the possible
CPUs to have been initialized at allocation time itself. Hence
allow them to register a callback via alloc_percpu() variants that
would be called back during hotpug time for any necessary
initialization of percpu variables.

This is very much an experimental patch with major unsolved
and unaddressed aspects listed below:

- Memcg charging has been changed to account for online CPUs,
  however the growing and removing of charge corresponding
  to the hotplugged CPU hasn't been done yet.
- The CPU hotplug support has been added only to vmalloc based
  percpu allocator.
- All the callers of alloc_percpu() who need initialization
  callbacks haven't been changed to use the new variants. I
  have changed only those callers whom I ran into when booting
  a minimal powerpc KVM guest in my environment.
- Yet to audit all the callers of alloc_percpu() and verify
  if the approach taken here would fit their use of percpu
  memory and if they are in a position to handle the required
  initialization during CPU hotplug time.
- The patches may break git blame, the intention right now is
  just to make it easy to illustrate the approach taken.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
---
 include/linux/cpuhotplug.h |   2 +
 include/linux/percpu.h     |  15 +++
 mm/percpu-internal.h       |   9 ++
 mm/percpu-vm.c             | 199 +++++++++++++++++++++++++++++++++++
 mm/percpu.c                | 209 ++++++++++++++++++++++++++++++++++++-
 5 files changed, 430 insertions(+), 4 deletions(-)

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 4a62b3980642..ae20a14967eb 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -36,6 +36,8 @@ enum cpuhp_state {
 	CPUHP_X86_MCE_DEAD,
 	CPUHP_VIRT_NET_DEAD,
 	CPUHP_SLUB_DEAD,
+	CPUHP_PERCPU_SETUP,
+	CPUHP_PERCPU_ALLOC,
 	CPUHP_DEBUG_OBJ_DEAD,
 	CPUHP_MM_WRITEBACK_DEAD,
 	CPUHP_MM_VMSTAT_DEAD,
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 5e76af742c80..145fdb9318d1 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -100,6 +100,7 @@ typedef void * (*pcpu_fc_alloc_fn_t)(unsigned int cpu, size_t size,
 typedef void (*pcpu_fc_free_fn_t)(void *ptr, size_t size);
 typedef void (*pcpu_fc_populate_pte_fn_t)(unsigned long addr);
 typedef int (pcpu_fc_cpu_distance_fn_t)(unsigned int from, unsigned int to);
+typedef int (*pcpu_cpuhp_fn_t)(void __percpu *ptr, unsigned int cpu, void *data);
 
 extern struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,
 							     int nr_units);
@@ -133,6 +134,11 @@ extern void __init setup_per_cpu_areas(void);
 
 extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp);
 extern void __percpu *__alloc_percpu(size_t size, size_t align);
+extern void __percpu *__alloc_percpu_gfp_cb(size_t size, size_t align,
+					    gfp_t gfp, pcpu_cpuhp_fn_t fn,
+					    void *data);
+extern void __percpu *__alloc_percpu_cb(size_t size, size_t align,
+					pcpu_cpuhp_fn_t fn, void *data);
 extern void free_percpu(void __percpu *__pdata);
 extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
 
@@ -143,6 +149,15 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
 	(typeof(type) __percpu *)__alloc_percpu(sizeof(type),		\
 						__alignof__(type))
 
+#define alloc_percpu_gfp_cb(type, gfp, fn, data)			\
+	(typeof(type) __percpu *)__alloc_percpu_gfp_cb(sizeof(type),	\
+						__alignof__(type), gfp,	\
+						fn, data)
+#define alloc_percpu_cb(type, fn, data)					\
+	(typeof(type) __percpu *)__alloc_percpu_cb(sizeof(type),	\
+						__alignof__(type),	\
+						fn, data)
+
 extern unsigned long pcpu_nr_pages(void);
 
 #endif /* __LINUX_PERCPU_H */
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index ae26b118e246..8064e7c43b9f 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -57,6 +57,8 @@ struct pcpu_chunk {
 #endif
 
 	struct list_head	list;		/* linked to pcpu_slot lists */
+	struct list_head	cpuhp;		/* list of registered cpu hotplug
+						   notifiers */
 	int			free_bytes;	/* free bytes in the chunk */
 	struct pcpu_block_md	chunk_md;
 	void			*base_addr;	/* base address of this chunk */
@@ -282,4 +284,11 @@ static inline void pcpu_stats_chunk_dealloc(void)
 
 #endif /* !CONFIG_PERCPU_STATS */
 
+struct percpu_cpuhp_notifier {
+	void __percpu *ptr;
+	void *data;
+	pcpu_cpuhp_fn_t cb;
+	struct list_head list;
+};
+
 #endif
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 8d3844bc0c7c..3250e1c9aeaf 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -41,6 +41,67 @@ static struct page **pcpu_get_pages(void)
 	return pages;
 }
 
+/**
+ * pcpu_alloc_pages_cpu - allocates pages for @chunk for a given cpu
+ * @cpu: target cpu
+ * @chunk: target chunk
+ * @pages: array to put the allocated pages into, indexed by pcpu_page_idx()
+ * @page_start: page index of the first page to be allocated
+ * @page_end: page index of the last page to be allocated + 1
+ * @gfp: allocation flags passed to the underlying allocator
+ *
+ * Allocate pages [@page_start,@page_end) into @pages for the given cpu.
+ * The allocation is for @chunk.  Percpu core doesn't care about the
+ * content of @pages and will pass it verbatim to pcpu_map_pages().
+ */
+static int pcpu_alloc_pages_cpu(unsigned int cpu, struct pcpu_chunk *chunk,
+				struct page **pages, int page_start, int page_end,
+				gfp_t gfp)
+{
+	int i;
+
+	gfp |= __GFP_HIGHMEM;
+
+	for (i = page_start; i < page_end; i++) {
+		struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
+
+		*pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
+		if (!*pagep)
+			goto err;
+	}
+	return 0;
+
+err:
+	while (--i >= page_start)
+		__free_page(pages[pcpu_page_idx(cpu, i)]);
+
+	return -ENOMEM;
+}
+
+/**
+ * pcpu_free_pages_cpu - free pages which were allocated for @chunk for @cpu
+ * @cpu: cpu for which the pages were allocated
+ * @chunk: chunk pages were allocated for
+ * @pages: array of pages to be freed, indexed by pcpu_page_idx()
+ * @page_start: page index of the first page to be freed
+ * @page_end: page index of the last page to be freed + 1
+ *
+ * Free pages [@page_start and @page_end) in @pages for @cpu.
+ * The pages were allocated for @chunk.
+ */
+static void pcpu_free_pages_cpu(unsigned int cpu, struct pcpu_chunk *chunk,
+				struct page **pages, int page_start, int page_end)
+{
+	int i;
+
+	for (i = page_start; i < page_end; i++) {
+		struct page *page = pages[pcpu_page_idx(cpu, i)];
+
+		if (page)
+			__free_page(page);
+	}
+}
+
 /**
  * pcpu_free_pages - free pages which were allocated for @chunk
  * @chunk: chunk pages were allocated for
@@ -137,6 +198,37 @@ static void __pcpu_unmap_pages(unsigned long addr, int nr_pages)
 	vunmap_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT));
 }
 
+/**
+ * pcpu_unmap_pages_cpu - unmap pages out of a pcpu_chunk for a cpu
+ * @cpu: cpu of interest
+ * @chunk: chunk of interest
+ * @pages: pages array which can be used to pass information to free
+ * @page_start: page index of the first page to unmap
+ * @page_end: page index of the last page to unmap + 1
+ *
+ * For the given cpu, unmap pages [@page_start,@page_end) out of @chunk.
+ * Corresponding elements in @pages were cleared by the caller and can
+ * be used to carry information to pcpu_free_pages() which will be
+ * called after all unmaps are finished.  The caller should call
+ * proper pre/post flush functions.
+ */
+static void pcpu_unmap_pages_cpu(unsigned int cpu, struct pcpu_chunk *chunk,
+				 struct page **pages, int page_start,
+				 int page_end)
+{
+	int i;
+
+	for (i = page_start; i < page_end; i++) {
+		struct page *page;
+
+		page = pcpu_chunk_page(chunk, cpu, i);
+		WARN_ON(!page);
+		pages[pcpu_page_idx(cpu, i)] = page;
+	}
+	__pcpu_unmap_pages(pcpu_chunk_addr(chunk, cpu, page_start),
+				   page_end - page_start);
+}
+
 /**
  * pcpu_unmap_pages - unmap pages out of a pcpu_chunk
  * @chunk: chunk of interest
@@ -197,6 +289,41 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages,
 					PAGE_KERNEL, pages, PAGE_SHIFT);
 }
 
+/**
+ * pcpu_map_pages_cpu - map pages into a pcpu_chunk for a cpu
+ * @cpu: cpu of interest
+ * @chunk: chunk of interest
+ * @pages: pages array containing pages to be mapped
+ * @page_start: page index of the first page to map
+ * @page_end: page index of the last page to map + 1
+ *
+ * For the given cpu, map pages [@page_start,@page_end) into @chunk.  The
+ * caller is responsible for calling pcpu_post_map_flush() after all
+ * mappings are complete.
+ *
+ * This function is responsible for setting up whatever is necessary for
+ * reverse lookup (addr -> chunk).
+ */
+static int pcpu_map_pages_cpu(unsigned int cpu, struct pcpu_chunk *chunk,
+			      struct page **pages, int page_start, int page_end)
+{
+	int i, err;
+
+	err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start),
+			       &pages[pcpu_page_idx(cpu, page_start)],
+			       page_end - page_start);
+	if (err < 0)
+		goto err;
+
+	for (i = page_start; i < page_end; i++)
+		pcpu_set_page_chunk(pages[pcpu_page_idx(cpu, i)],
+				    chunk);
+	return 0;
+err:
+	pcpu_post_unmap_tlb_flush(chunk, page_start, page_end);
+	return err;
+}
+
 /**
  * pcpu_map_pages - map pages into a pcpu_chunk
  * @chunk: chunk of interest
@@ -260,6 +387,40 @@ static void pcpu_post_map_flush(struct pcpu_chunk *chunk,
 		pcpu_chunk_addr(chunk, pcpu_high_unit_cpu, page_end));
 }
 
+/**
+ * pcpu_populate_chunk_cpu - populate and map an area of a pcpu_chunk for a cpu
+ * @cpu: cpu of interest
+ * @chunk: chunk of interest
+ * @page_start: the start page
+ * @page_end: the end page
+ * @gfp: allocation flags passed to the underlying memory allocator
+ *
+ * For the given cpu, populate and map pages [@page_start,@page_end) into
+ * @chunk.
+ *
+ * CONTEXT:
+ * pcpu_alloc_mutex, does GFP_KERNEL allocation.
+ */
+static int pcpu_populate_chunk_cpu(unsigned int cpu, struct pcpu_chunk *chunk,
+				   int page_start, int page_end, gfp_t gfp)
+{
+	struct page **pages;
+
+	pages = pcpu_get_pages();
+	if (!pages)
+		return -ENOMEM;
+
+	if (pcpu_alloc_pages_cpu(cpu, chunk, pages, page_start, page_end, gfp))
+		return -ENOMEM;
+
+	if (pcpu_map_pages_cpu(cpu, chunk, pages, page_start, page_end)) {
+		pcpu_free_pages_cpu(cpu, chunk, pages, page_start, page_end);
+		return -ENOMEM;
+	}
+	pcpu_post_map_flush(chunk, page_start, page_end);
+
+	return 0;
+}
 /**
  * pcpu_populate_chunk - populate and map an area of a pcpu_chunk
  * @chunk: chunk of interest
@@ -294,6 +455,44 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
 	return 0;
 }
 
+/**
+ * pcpu_depopulate_chunk_cpu - depopulate and unmap an area of a pcpu_chunk
+ * for a cpu
+ * @cpu: cpu of interest
+ * @chunk: chunk to depopulate
+ * @page_start: the start page
+ * @page_end: the end page
+ *
+ * For the given cpu, depopulate and unmap pages [@page_start,@page_end)
+ * from @chunk.
+ *
+ * CONTEXT:
+ * pcpu_alloc_mutex.
+ */
+static void pcpu_depopulate_chunk_cpu(unsigned int cpu,
+				      struct pcpu_chunk *chunk,
+				      int page_start, int page_end)
+{
+	struct page **pages;
+
+	/*
+	 * If control reaches here, there must have been at least one
+	 * successful population attempt so the temp pages array must
+	 * be available now.
+	 */
+	pages = pcpu_get_pages();
+	BUG_ON(!pages);
+
+	/* unmap and free */
+	pcpu_pre_unmap_flush(chunk, page_start, page_end);
+
+	pcpu_unmap_pages_cpu(cpu, chunk, pages, page_start, page_end);
+
+	/* no need to flush tlb, vmalloc will handle it lazily */
+
+	pcpu_free_pages_cpu(cpu, chunk, pages, page_start, page_end);
+}
+
 /**
  * pcpu_depopulate_chunk - depopulate and unmap an area of a pcpu_chunk
  * @chunk: chunk to depopulate
diff --git a/mm/percpu.c b/mm/percpu.c
index f99e9306b939..ca8ca541bede 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1324,6 +1324,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
 		      alloc_size);
 
 	INIT_LIST_HEAD(&chunk->list);
+	INIT_LIST_HEAD(&chunk->cpuhp);
 
 	chunk->base_addr = (void *)aligned_addr;
 	chunk->start_offset = start_offset;
@@ -1404,6 +1405,7 @@ static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type, gfp_t gfp)
 		return NULL;
 
 	INIT_LIST_HEAD(&chunk->list);
+	INIT_LIST_HEAD(&chunk->cpuhp);
 	chunk->nr_pages = pcpu_unit_pages;
 	region_bits = pcpu_chunk_map_bits(chunk);
 
@@ -1659,6 +1661,161 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
+static void pcpu_cpuhp_register(struct pcpu_chunk *chunk,
+				struct percpu_cpuhp_notifier *n)
+{
+	list_add(&n->list, &chunk->cpuhp);
+}
+
+static void pcpu_cpuhp_deregister(struct pcpu_chunk *chunk,
+				  void __percpu *ptr)
+{
+	struct percpu_cpuhp_notifier *n, *next;
+
+	list_for_each_entry_safe(n, next, &chunk->cpuhp, list)
+		if (n->ptr == ptr) {
+			list_del(&n->list);
+			kfree(n);
+			return;
+		}
+}
+
+static void __pcpu_cpuhp_setup(enum pcpu_chunk_type type, unsigned int cpu)
+{
+	int slot;
+	struct list_head *pcpu_slot = pcpu_chunk_list(type);
+	struct pcpu_chunk *chunk;
+
+	for (slot = 0; slot < pcpu_nr_slots; slot++) {
+		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
+			unsigned int rs, re;
+
+			if (chunk == pcpu_first_chunk)
+				continue;
+
+			bitmap_for_each_set_region(chunk->populated, rs, re, 0,
+						   chunk->nr_pages)
+				pcpu_populate_chunk_cpu(cpu, chunk, rs, re,
+							GFP_KERNEL);
+		}
+	}
+}
+
+/**
+ * cpu hotplug callback for percpu allocator
+ * @cpu: cpu that is being hotplugged
+ *
+ * Allocates and maps the pages that corresponds to @cpu's unit
+ * in all chunks.
+ */
+static int percpu_cpuhp_setup(unsigned int cpu)
+{
+	enum pcpu_chunk_type type;
+
+	mutex_lock(&pcpu_alloc_mutex);
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		__pcpu_cpuhp_setup(type, cpu);
+	mutex_unlock(&pcpu_alloc_mutex);
+
+	return 0;
+}
+
+static void __pcpu_cpuhp_destroy(enum pcpu_chunk_type type, unsigned int cpu)
+{
+	int slot;
+	struct list_head *pcpu_slot = pcpu_chunk_list(type);
+	struct pcpu_chunk *chunk;
+
+	for (slot = 0; slot < pcpu_nr_slots; slot++) {
+		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
+			unsigned int rs, re;
+
+			if (chunk == pcpu_first_chunk)
+				continue;
+
+			bitmap_for_each_set_region(chunk->populated, rs, re, 0,
+						   chunk->nr_pages)
+				pcpu_depopulate_chunk_cpu(cpu, chunk, rs, re);
+		}
+	}
+}
+
+/**
+ * cpu unplug callback for percpu allocator
+ * @cpu: cpu that is being hotplugged
+ *
+ * Unmaps and frees the pages that corresponds to @cpu's unit
+ * in all chunks.
+ */
+static int percpu_cpuhp_destroy(unsigned int cpu)
+{
+	enum pcpu_chunk_type type;
+
+	mutex_lock(&pcpu_alloc_mutex);
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		__pcpu_cpuhp_destroy(type, cpu);
+	mutex_unlock(&pcpu_alloc_mutex);
+
+	return 0;
+}
+
+static void __pcpu_cpuhp_alloc(enum pcpu_chunk_type type, unsigned int cpu)
+{
+	int slot;
+	struct list_head *pcpu_slot = pcpu_chunk_list(type);
+	struct pcpu_chunk *chunk;
+	struct percpu_cpuhp_notifier *n;
+
+	for (slot = 0; slot < pcpu_nr_slots; slot++) {
+		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
+			list_for_each_entry(n, &chunk->cpuhp, list)
+				n->cb(n->ptr, cpu, n->data);
+		}
+	}
+}
+
+/**
+ * cpu hotplug callback for executing any initialization routines
+ * registered by the callers of alloc_percpu_cb()
+ *
+ * @cpu: cpu that is being hotplugged
+ */
+static int percpu_cpuhp_alloc(unsigned int cpu)
+{
+	enum pcpu_chunk_type type;
+
+	mutex_lock(&pcpu_alloc_mutex);
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		__pcpu_cpuhp_alloc(type, cpu);
+	mutex_unlock(&pcpu_alloc_mutex);
+
+	return 0;
+}
+
+static int percpu_cpuhp_free(unsigned int cpu)
+{
+	return 0;
+}
+
+/**
+ * Register cpu hotplug callbacks for the percpu allocator
+ * and its callers
+ */
+static int percpu_hotplug_setup(void)
+{
+	/* Callback for percpu allocator */
+	if (cpuhp_setup_state(CPUHP_PERCPU_SETUP, "percpu:setup",
+	    percpu_cpuhp_setup, percpu_cpuhp_destroy))
+		return -EINVAL;
+
+	/* Callback for the callers of alloc_percpu() */
+	if (cpuhp_setup_state(CPUHP_PERCPU_ALLOC, "percpu:alloc",
+	    percpu_cpuhp_alloc, percpu_cpuhp_free))
+		return -EINVAL;
+
+	return 0;
+}
+
 /**
  * pcpu_alloc - the percpu allocator
  * @size: size of area to allocate in bytes
@@ -1675,7 +1832,7 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
  * Percpu pointer to the allocated area on success, NULL on failure.
  */
 static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
-				 gfp_t gfp)
+				 gfp_t gfp,  pcpu_cpuhp_fn_t cb, void *data)
 {
 	gfp_t pcpu_gfp;
 	bool is_atomic;
@@ -1690,6 +1847,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	unsigned long flags;
 	void __percpu *ptr;
 	size_t bits, bit_align;
+	struct percpu_cpuhp_notifier *n;
 
 	gfp = current_gfp_context(gfp);
 	/* whitelisted flags that can be passed to the backing allocators */
@@ -1697,6 +1855,12 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
 	do_warn = !(gfp & __GFP_NOWARN);
 
+	if (cb) {
+		n = kmalloc(sizeof(*n), gfp);
+		if (!n)
+			return NULL;
+	}
+
 	/*
 	 * There is now a minimum allocation size of PCPU_MIN_ALLOC_SIZE,
 	 * therefore alignment must be a minimum of that many bytes.
@@ -1847,6 +2011,13 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 
 	pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
 
+	if (cb) {
+		n->ptr = ptr;
+		n->cb = cb;
+		n->data = data;
+		pcpu_cpuhp_register(chunk, n);
+	}
+
 	return ptr;
 
 fail_unlock:
@@ -1870,6 +2041,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	}
 
 	pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
+	kfree(n);
 
 	return NULL;
 }
@@ -1891,7 +2063,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
  */
 void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp)
 {
-	return pcpu_alloc(size, align, false, gfp);
+	return pcpu_alloc(size, align, false, gfp, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(__alloc_percpu_gfp);
 
@@ -1904,7 +2076,7 @@ EXPORT_SYMBOL_GPL(__alloc_percpu_gfp);
  */
 void __percpu *__alloc_percpu(size_t size, size_t align)
 {
-	return pcpu_alloc(size, align, false, GFP_KERNEL);
+	return pcpu_alloc(size, align, false, GFP_KERNEL, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(__alloc_percpu);
 
@@ -1926,7 +2098,33 @@ EXPORT_SYMBOL_GPL(__alloc_percpu);
  */
 void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
 {
-	return pcpu_alloc(size, align, true, GFP_KERNEL);
+	return pcpu_alloc(size, align, true, GFP_KERNEL, NULL, NULL);
+}
+
+/**
+ * alloc_percpu variants that take a callback to handle
+ * any required initialization to the percpu ptr corresponding
+ * to the cpu that is coming online.
+ * @cb: This callback will be called whenever a cpu is hotplugged.
+ */
+void __percpu *__alloc_percpu_gfp_cb(size_t size, size_t align, gfp_t gfp,
+				     pcpu_cpuhp_fn_t cb, void *data)
+{
+	return pcpu_alloc(size, align, false, gfp, cb, data);
+}
+EXPORT_SYMBOL_GPL(__alloc_percpu_gfp_cb);
+
+void __percpu *__alloc_percpu_cb(size_t size, size_t align, pcpu_cpuhp_fn_t cb,
+				 void *data)
+{
+	return pcpu_alloc(size, align, false, GFP_KERNEL, cb, data);
+}
+EXPORT_SYMBOL_GPL(__alloc_percpu_cb);
+
+void __percpu *__alloc_reserved_percpu_cb(size_t size, size_t align,
+					  pcpu_cpuhp_fn_t cb, void *data)
+{
+	return pcpu_alloc(size, align, true, GFP_KERNEL, cb, data);
 }
 
 /**
@@ -2116,6 +2314,7 @@ void free_percpu(void __percpu *ptr)
 			}
 	}
 
+	pcpu_cpuhp_deregister(chunk, ptr);
 	trace_percpu_free_percpu(chunk->base_addr, off, ptr);
 
 	spin_unlock_irqrestore(&pcpu_lock, flags);
@@ -2426,6 +2625,8 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	}								\
 } while (0)
 
+	PCPU_SETUP_BUG_ON(percpu_hotplug_setup() < 0);
+
 	/* sanity checks */
 	PCPU_SETUP_BUG_ON(ai->nr_groups <= 0);
 #ifdef CONFIG_SMP
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH v0 2/3] percpu: Limit percpu allocator to online cpus
  2021-06-01  6:51 [RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator Bharata B Rao
  2021-06-01  6:51 ` [RFC PATCH v0 1/3] percpu: CPU hotplug support for alloc_percpu() Bharata B Rao
@ 2021-06-01  6:51 ` Bharata B Rao
  2021-06-01  6:51 ` [RFC PATCH v0 3/3] percpu: Avoid using percpu ptrs of non-existing cpus Bharata B Rao
  2021-06-02 15:01 ` [RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator Dennis Zhou
  3 siblings, 0 replies; 6+ messages in thread
From: Bharata B Rao @ 2021-06-01  6:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, aneesh.kumar, dennis, tj, cl, akpm, amakhalov, guro,
	vbabka, srikar, psampat, ego, Bharata B Rao

Now that percpu allocator supports growing of memory
for newly coming up CPU at hotplug time, limit the allocation,
mapping and memcg charging of memory to online CPUs.

Also change the Percpu memory reporting in /proc/meminfo
to reflect the populated pages of only online CPUs.

TODO: Address percpu memcg charging and uncharging from
CPU hotplug callback.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
---
 mm/percpu-vm.c | 12 ++++++------
 mm/percpu.c    | 20 +++++++++++++-------
 2 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 3250e1c9aeaf..79ce104c963a 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -118,7 +118,7 @@ static void pcpu_free_pages(struct pcpu_chunk *chunk,
 	unsigned int cpu;
 	int i;
 
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		for (i = page_start; i < page_end; i++) {
 			struct page *page = pages[pcpu_page_idx(cpu, i)];
 
@@ -149,7 +149,7 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
 
 	gfp |= __GFP_HIGHMEM;
 
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		for (i = page_start; i < page_end; i++) {
 			struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
 
@@ -164,7 +164,7 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
 	while (--i >= page_start)
 		__free_page(pages[pcpu_page_idx(cpu, i)]);
 
-	for_each_possible_cpu(tcpu) {
+	for_each_online_cpu(tcpu) {
 		if (tcpu == cpu)
 			break;
 		for (i = page_start; i < page_end; i++)
@@ -248,7 +248,7 @@ static void pcpu_unmap_pages(struct pcpu_chunk *chunk,
 	unsigned int cpu;
 	int i;
 
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		for (i = page_start; i < page_end; i++) {
 			struct page *page;
 
@@ -344,7 +344,7 @@ static int pcpu_map_pages(struct pcpu_chunk *chunk,
 	unsigned int cpu, tcpu;
 	int i, err;
 
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start),
 				       &pages[pcpu_page_idx(cpu, page_start)],
 				       page_end - page_start);
@@ -357,7 +357,7 @@ static int pcpu_map_pages(struct pcpu_chunk *chunk,
 	}
 	return 0;
 err:
-	for_each_possible_cpu(tcpu) {
+	for_each_online_cpu(tcpu) {
 		if (tcpu == cpu)
 			break;
 		__pcpu_unmap_pages(pcpu_chunk_addr(chunk, tcpu, page_start),
diff --git a/mm/percpu.c b/mm/percpu.c
index ca8ca541bede..83b6bcfcfa80 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1594,7 +1594,7 @@ static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
 	if (!objcg)
 		return PCPU_CHUNK_ROOT;
 
-	if (obj_cgroup_charge(objcg, gfp, size * num_possible_cpus())) {
+	if (obj_cgroup_charge(objcg, gfp, size * num_online_cpus())) {
 		obj_cgroup_put(objcg);
 		return PCPU_FAIL_ALLOC;
 	}
@@ -1615,10 +1615,10 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 
 		rcu_read_lock();
 		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-				size * num_possible_cpus());
+				size * num_online_cpus());
 		rcu_read_unlock();
 	} else {
-		obj_cgroup_uncharge(objcg, size * num_possible_cpus());
+		obj_cgroup_uncharge(objcg, size * num_online_cpus());
 		obj_cgroup_put(objcg);
 	}
 }
@@ -1633,11 +1633,11 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
 	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
 
-	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
+	obj_cgroup_uncharge(objcg, size * num_online_cpus());
 
 	rcu_read_lock();
 	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-			-(size * num_possible_cpus()));
+			-(size * num_online_cpus()));
 	rcu_read_unlock();
 
 	obj_cgroup_put(objcg);
@@ -1680,6 +1680,9 @@ static void pcpu_cpuhp_deregister(struct pcpu_chunk *chunk,
 		}
 }
 
+/*
+ * TODO: Grow the memcg charge
+ */
 static void __pcpu_cpuhp_setup(enum pcpu_chunk_type type, unsigned int cpu)
 {
 	int slot;
@@ -1720,6 +1723,9 @@ static int percpu_cpuhp_setup(unsigned int cpu)
 	return 0;
 }
 
+/*
+ * TODO: Reduce the memcg charge
+ */
 static void __pcpu_cpuhp_destroy(enum pcpu_chunk_type type, unsigned int cpu)
 {
 	int slot;
@@ -2000,7 +2006,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 		pcpu_schedule_balance_work();
 
 	/* clear the areas and return address relative to base address */
-	for_each_possible_cpu(cpu)
+	for_each_online_cpu(cpu)
 		memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);
 
 	ptr = __addr_to_pcpu_ptr(chunk->base_addr + off);
@@ -3372,7 +3378,7 @@ void __init setup_per_cpu_areas(void)
  */
 unsigned long pcpu_nr_pages(void)
 {
-	return pcpu_nr_populated * pcpu_nr_units;
+	return pcpu_nr_populated * num_online_cpus();
 }
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH v0 3/3] percpu: Avoid using percpu ptrs of non-existing cpus
  2021-06-01  6:51 [RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator Bharata B Rao
  2021-06-01  6:51 ` [RFC PATCH v0 1/3] percpu: CPU hotplug support for alloc_percpu() Bharata B Rao
  2021-06-01  6:51 ` [RFC PATCH v0 2/3] percpu: Limit percpu allocator to online cpus Bharata B Rao
@ 2021-06-01  6:51 ` Bharata B Rao
  2021-06-02 15:01 ` [RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator Dennis Zhou
  3 siblings, 0 replies; 6+ messages in thread
From: Bharata B Rao @ 2021-06-01  6:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, aneesh.kumar, dennis, tj, cl, akpm, amakhalov, guro,
	vbabka, srikar, psampat, ego, Bharata B Rao

Prevent the callers of alloc_percpu() from using the percpu
pointer of non-existing CPUs. Also switch those callers who
require initialization of percpu data for onlined CPU to use
the new variant alloc_percpu_cb()

Note: Not all callers have been modified here

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
---
 fs/namespace.c           |  4 ++--
 kernel/cgroup/rstat.c    | 20 ++++++++++++++++----
 kernel/sched/cpuacct.c   | 10 +++++-----
 kernel/sched/psi.c       | 14 +++++++++++---
 lib/percpu-refcount.c    |  4 ++--
 lib/percpu_counter.c     |  2 +-
 net/ipv4/fib_semantics.c |  2 +-
 net/ipv6/route.c         |  6 +++---
 8 files changed, 41 insertions(+), 21 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c3f1a78ba369..b6ea584b99e5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -182,7 +182,7 @@ int mnt_get_count(struct mount *mnt)
 	int count = 0;
 	int cpu;
 
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_count;
 	}
 
@@ -294,7 +294,7 @@ static unsigned int mnt_get_writers(struct mount *mnt)
 	unsigned int count = 0;
 	int cpu;
 
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_writers;
 	}
 
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index cee265cb535c..b25c59138c0b 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -152,7 +152,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
 
 	lockdep_assert_held(&cgroup_rstat_lock);
 
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock,
 						       cpu);
 		struct cgroup *pos = NULL;
@@ -245,19 +245,31 @@ void cgroup_rstat_flush_release(void)
 	spin_unlock_irq(&cgroup_rstat_lock);
 }
 
+static int cgroup_rstat_cpuhp_handler(void __percpu *ptr, unsigned int cpu, void *data)
+{
+	struct cgroup *cgrp = (struct cgroup *)data;
+	struct cgroup_rstat_cpu *rstatc = per_cpu_ptr(ptr, cpu);
+
+	rstatc->updated_children = cgrp;
+	u64_stats_init(&rstatc->bsync);
+	return 0;
+}
+
 int cgroup_rstat_init(struct cgroup *cgrp)
 {
 	int cpu;
 
 	/* the root cgrp has rstat_cpu preallocated */
 	if (!cgrp->rstat_cpu) {
-		cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
+		cgrp->rstat_cpu = alloc_percpu_cb(struct cgroup_rstat_cpu,
+						  cgroup_rstat_cpuhp_handler,
+						  cgrp);
 		if (!cgrp->rstat_cpu)
 			return -ENOMEM;
 	}
 
 	/* ->updated_children list is self terminated */
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
 
 		rstatc->updated_children = cgrp;
@@ -274,7 +286,7 @@ void cgroup_rstat_exit(struct cgroup *cgrp)
 	cgroup_rstat_flush(cgrp);
 
 	/* sanity check */
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
 
 		if (WARN_ON_ONCE(rstatc->updated_children != cgrp) ||
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 104a1bade14f..81dd53387ba5 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -160,7 +160,7 @@ static u64 __cpuusage_read(struct cgroup_subsys_state *css,
 	u64 totalcpuusage = 0;
 	int i;
 
-	for_each_possible_cpu(i)
+	for_each_online_cpu(i)
 		totalcpuusage += cpuacct_cpuusage_read(ca, i, index);
 
 	return totalcpuusage;
@@ -195,7 +195,7 @@ static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,
 	if (val)
 		return -EINVAL;
 
-	for_each_possible_cpu(cpu)
+	for_each_online_cpu(cpu)
 		cpuacct_cpuusage_write(ca, cpu, 0);
 
 	return 0;
@@ -208,7 +208,7 @@ static int __cpuacct_percpu_seq_show(struct seq_file *m,
 	u64 percpu;
 	int i;
 
-	for_each_possible_cpu(i) {
+	for_each_online_cpu(i) {
 		percpu = cpuacct_cpuusage_read(ca, i, index);
 		seq_printf(m, "%llu ", (unsigned long long) percpu);
 	}
@@ -242,7 +242,7 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 		seq_printf(m, " %s", cpuacct_stat_desc[index]);
 	seq_puts(m, "\n");
 
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
 
 		seq_printf(m, "%d", cpu);
@@ -275,7 +275,7 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)
 	int stat;
 
 	memset(val, 0, sizeof(val));
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
 
 		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index cc25a3cff41f..228977aa4780 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -186,7 +186,7 @@ static void group_init(struct psi_group *group)
 {
 	int cpu;
 
-	for_each_possible_cpu(cpu)
+	for_each_online_cpu(cpu)
 		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
 	group->avg_last_update = sched_clock();
 	group->avg_next_update = group->avg_last_update + psi_period;
@@ -321,7 +321,7 @@ static void collect_percpu_times(struct psi_group *group,
 	 * the sampling period. This eliminates artifacts from uneven
 	 * loading, or even entirely idle CPUs.
 	 */
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		u32 times[NR_PSI_STATES];
 		u32 nonidle;
 		u32 cpu_changed_states;
@@ -935,12 +935,20 @@ void psi_memstall_leave(unsigned long *flags)
 }
 
 #ifdef CONFIG_CGROUPS
+static int psi_cpuhp_handler(void __percpu *ptr, unsigned int cpu, void *unused)
+{
+	struct psi_group_cpu *groupc = per_cpu_ptr(ptr, cpu);
+
+	seqcount_init(&groupc->seq);
+	return 0;
+}
+
 int psi_cgroup_alloc(struct cgroup *cgroup)
 {
 	if (static_branch_likely(&psi_disabled))
 		return 0;
 
-	cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu);
+	cgroup->psi.pcpu = alloc_percpu_cb(struct psi_group_cpu, psi_cpuhp_handler, NULL);
 	if (!cgroup->psi.pcpu)
 		return -ENOMEM;
 	group_init(&cgroup->psi);
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index a1071cdefb5a..aeba43c33600 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -173,7 +173,7 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu)
 	unsigned long count = 0;
 	int cpu;
 
-	for_each_possible_cpu(cpu)
+	for_each_online_cpu(cpu)
 		count += *per_cpu_ptr(percpu_count, cpu);
 
 	pr_debug("global %lu percpu %lu\n",
@@ -253,7 +253,7 @@ static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref)
 	 * zeroing is visible to all percpu accesses which can see the
 	 * following __PERCPU_REF_ATOMIC clearing.
 	 */
-	for_each_possible_cpu(cpu)
+	for_each_online_cpu(cpu)
 		*per_cpu_ptr(percpu_count, cpu) = 0;
 
 	smp_store_release(&ref->percpu_count_ptr,
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index ed610b75dc32..db40abc6f0f5 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -63,7 +63,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&fbc->lock, flags);
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
 		*pcount = 0;
 	}
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index a632b66bc13a..dbfd14b0077f 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -194,7 +194,7 @@ static void rt_fibinfo_free_cpus(struct rtable __rcu * __percpu *rtp)
 	if (!rtp)
 		return;
 
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		struct rtable *rt;
 
 		rt = rcu_dereference_protected(*per_cpu_ptr(rtp, cpu), 1);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index a22822bdbf39..e7db3a5fe5c5 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -165,7 +165,7 @@ static void rt6_uncached_list_flush_dev(struct net *net, struct net_device *dev)
 	if (dev == loopback_dev)
 		return;
 
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		struct uncached_list *ul = per_cpu_ptr(&rt6_uncached_list, cpu);
 		struct rt6_info *rt;
 
@@ -3542,7 +3542,7 @@ void fib6_nh_release(struct fib6_nh *fib6_nh)
 	if (fib6_nh->rt6i_pcpu) {
 		int cpu;
 
-		for_each_possible_cpu(cpu) {
+		for_each_online_cpu(cpu) {
 			struct rt6_info **ppcpu_rt;
 			struct rt6_info *pcpu_rt;
 
@@ -6569,7 +6569,7 @@ int __init ip6_route_init(void)
 #endif
 #endif
 
-	for_each_possible_cpu(cpu) {
+	for_each_online_cpu(cpu) {
 		struct uncached_list *ul = per_cpu_ptr(&rt6_uncached_list, cpu);
 
 		INIT_LIST_HEAD(&ul->head);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator
  2021-06-01  6:51 [RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator Bharata B Rao
                   ` (2 preceding siblings ...)
  2021-06-01  6:51 ` [RFC PATCH v0 3/3] percpu: Avoid using percpu ptrs of non-existing cpus Bharata B Rao
@ 2021-06-02 15:01 ` Dennis Zhou
  2021-06-04  5:01   ` Bharata B Rao
  3 siblings, 1 reply; 6+ messages in thread
From: Dennis Zhou @ 2021-06-02 15:01 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, aneesh.kumar, tj, cl, akpm, amakhalov,
	guro, vbabka, srikar, psampat, ego

Hello,

On Tue, Jun 01, 2021 at 12:21:44PM +0530, Bharata B Rao wrote:
> Hi,
> 
> This is an attempt to make the percpu allocator CPU hotplug aware.
> Currently the percpu allocator allocates memory for all the possible
> CPUs. This can lead to wastage of memory when possible number of CPUs
> is significantly higher than the number of online CPUs. This can be
> avoided if the percpu allocator were to allocate only for the online
> CPUs and extend the allocation for other CPUs as and when they become
> online. 
> 
> This early RFC work shows some good memory savings for a powerpc
> KVM guest that is booted with 16 online and 1024 possible CPUs.
> Here is the comparision of Percpu memory consumption from
> /proc/meminfo before and after creating 1000 memcgs.
> 
> 			W/o patch		W/ patch
> Before			1441792 kB		22528 kB
> After 1000 memcgs	4390912 kB		68608 kB
> 
> Note that the Percpu reporting in meminfo has been changed in
> the patchset to reflect the allocation for online CPUs only.
> 
> More details about the approach are present in the patch
> descriptions.
> 
> Bharata B Rao (3):
>   percpu: CPU hotplug support for alloc_percpu()
>   percpu: Limit percpu allocator to online cpus
>   percpu: Avoid using percpu ptrs of non-existing cpus
> 
>  fs/namespace.c             |   4 +-
>  include/linux/cpuhotplug.h |   2 +
>  include/linux/percpu.h     |  15 +++
>  kernel/cgroup/rstat.c      |  20 +++-
>  kernel/sched/cpuacct.c     |  10 +-
>  kernel/sched/psi.c         |  14 ++-
>  lib/percpu-refcount.c      |   4 +-
>  lib/percpu_counter.c       |   2 +-
>  mm/percpu-internal.h       |   9 ++
>  mm/percpu-vm.c             | 211 +++++++++++++++++++++++++++++++++-
>  mm/percpu.c                | 229 +++++++++++++++++++++++++++++++++++--
>  net/ipv4/fib_semantics.c   |   2 +-
>  net/ipv6/route.c           |   6 +-
>  13 files changed, 490 insertions(+), 38 deletions(-)
> 
> -- 
> 2.31.1
> 

I have thought about this for a day now and to be honest my thoughts
haven't really changed since the last discussion in [1].

I struggle here for a few reasons:
1. We're intertwining cpu and memory for hotplug.
  - What does it mean if we don't have enough memory?
  - How hard do we try to reclaim memory?
  - Partially allocated cpus? Do we free it all and try again?
2. We're now blocking the whole system on the percpu mutex which can
   cause terrible side effects. If there is a large amount of percpu
   memory already in use, this means we've accumulated a substantial
   number of callbacks.
3. While I did mention a callback approach would work. I'm not thrilled
   by the additional complexity of it as it can be error prone.

Beyond the above. I still don't believe it's the most well motivated
problem. I struggle to see a world where it makes sense to let someone
scale from 16 cpus to 1024. As in my mind you would also need to scale
memory to some degree too (not necessarily linearly but a 1024 core
machine with say like 16 gigs of ram would be pretty funny).

Would it be that bad to use cold migration points and eat a little bit
of overhead for what I understand to be a relatively uncommon use case?

[1] https://lore.kernel.org/linux-mm/8E7F3D98-CB68-4418-8E0E-7287E8273DA9@vmware.com/

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator
  2021-06-02 15:01 ` [RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator Dennis Zhou
@ 2021-06-04  5:01   ` Bharata B Rao
  0 siblings, 0 replies; 6+ messages in thread
From: Bharata B Rao @ 2021-06-04  5:01 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: linux-kernel, linux-mm, aneesh.kumar, tj, cl, akpm, amakhalov,
	guro, vbabka, srikar, psampat, ego

On Wed, Jun 02, 2021 at 03:01:04PM +0000, Dennis Zhou wrote:
> Hello,
> 
> On Tue, Jun 01, 2021 at 12:21:44PM +0530, Bharata B Rao wrote:
> > Hi,
> > 
> > This is an attempt to make the percpu allocator CPU hotplug aware.
> > Currently the percpu allocator allocates memory for all the possible
> > CPUs. This can lead to wastage of memory when possible number of CPUs
> > is significantly higher than the number of online CPUs. This can be
> > avoided if the percpu allocator were to allocate only for the online
> > CPUs and extend the allocation for other CPUs as and when they become
> > online. 
> > 
> > This early RFC work shows some good memory savings for a powerpc
> > KVM guest that is booted with 16 online and 1024 possible CPUs.
> > Here is the comparision of Percpu memory consumption from
> > /proc/meminfo before and after creating 1000 memcgs.
> > 
> > 			W/o patch		W/ patch
> > Before			1441792 kB		22528 kB
> > After 1000 memcgs	4390912 kB		68608 kB
> > 
> 
> I have thought about this for a day now and to be honest my thoughts
> haven't really changed since the last discussion in [1].
> 
> I struggle here for a few reasons:
> 1. We're intertwining cpu and memory for hotplug.
>   - What does it mean if we don't have enough memory?

That means CPU hotplug will fail, but...

>   - How hard do we try to reclaim memory?
>   - Partially allocated cpus? Do we free it all and try again?

... yes these are some difficult questions. We should check if
roll back can be done cleanly and efficiently. You can see that
I am registering separate hotplug callbacks for the hotplug core
and for init routines of alloc_percpu() callers. Rolling back the
former should be fairly straight forward, but have to see how
desirable and feasible it is to undo the entire CPU hotplug when
one of the alloc_percpu callbacks fails, especially if there are
hundreds of registered alloc_percpu callbacks.

> 2. We're now blocking the whole system on the percpu mutex which can
>    cause terrible side effects. If there is a large amount of percpu
>    memory already in use, this means we've accumulated a substantial
>    number of callbacks.

I am yet to look at each caller in detail and see which of them
really need init/free callbacks and which can do without it. After
this we will have to measure the overhead all this is putting on the
hotplug path. Given that hotplug is a slow path, I wonder if some
overhead is tolerable here.

CPU hotplug already happens with cpu_hotplug_lock held, so when you
mention that this callback holding percpu mutex can have terrible
effects, are you specifically worried about blocking all the
percpu allocation requests during hotplug? Or is it something else?

> 3. While I did mention a callback approach would work. I'm not thrilled
>    by the additional complexity of it as it can be error prone.

Fair enough, the callback for the percpu allocator core seems fine
to me but since I haven't yet looked at all callers in detail, I
don't know if we would run into some issues/dependencies in any
specific callback handlers that increases the overall complexity.

Other than the callbacks, I am also bit worried about the complexity
and the overhead involved in memcg charging and uncharging at CPU
hotplug time. In my environment (powerpc kvm guest), I see that each
chunk can have a maximum of 180224 obj_cgroups. Now checking for the
valid/used one out of that, determining the allocation size and
charging/uncharging to the right memcg could be an expensive task.

> 
> Beyond the above. I still don't believe it's the most well motivated
> problem. I struggle to see a world where it makes sense to let someone
> scale from 16 cpus to 1024. As in my mind you would also need to scale
> memory to some degree too (not necessarily linearly but a 1024 core
> machine with say like 16 gigs of ram would be pretty funny).

Well the platform here provides the capability of scaling and until
that scaling happens, why consume memory for not-present CPUs is
the motivation here. But as you note, it definetely is a question of
whether any real application is making use of this scaling now
and the associated complexity.

Even if we consider the scaling from 16 to 1024 CPUs as unrealistic
for now, the usecase and the numbers from the production scenario that
Alexey mentioned in [1] (2 to 128 CPUs) is certainly a good motivator?

Alexey - You did mention about creating a huge number of memcgs and
observing VMs consuming 16-20 GB percpu memory in production. So
how any memcgs are we talking about here?

> 
> Would it be that bad to use cold migration points and eat a little bit
> of overhead for what I understand to be a relatively uncommon use case?

[1] https://lore.kernel.org/linux-mm/8E7F3D98-CB68-4418-8E0E-7287E8273DA9@vmware.com/

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-06-04  5:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-01  6:51 [RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator Bharata B Rao
2021-06-01  6:51 ` [RFC PATCH v0 1/3] percpu: CPU hotplug support for alloc_percpu() Bharata B Rao
2021-06-01  6:51 ` [RFC PATCH v0 2/3] percpu: Limit percpu allocator to online cpus Bharata B Rao
2021-06-01  6:51 ` [RFC PATCH v0 3/3] percpu: Avoid using percpu ptrs of non-existing cpus Bharata B Rao
2021-06-02 15:01 ` [RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator Dennis Zhou
2021-06-04  5:01   ` Bharata B Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).