linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/6] percpu: partial chunk depopulation
@ 2021-04-08  3:57 Roman Gushchin
  2021-04-08  3:57 ` [PATCH v3 1/6] percpu: fix a comment about the chunks ordering Roman Gushchin
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Roman Gushchin @ 2021-04-08  3:57 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Tejun Heo, Christoph Lameter, Andrew Morton, Vlastimil Babka,
	linux-mm, linux-kernel, Roman Gushchin

In our production experience the percpu memory allocator is sometimes struggling
with returning the memory to the system. A typical example is a creation of
several thousands memory cgroups (each has several chunks of the percpu data
used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
of these cgroups doesn't always lead to a shrinkage of the percpu memory,
so that sometimes there are several GB's of memory wasted.

The underlying problem is the fragmentation: to release an underlying chunk
all percpu allocations should be released first. The percpu allocator tends
to top up chunks to improve the utilization. It means new small-ish allocations
(e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
effectively pinning them in memory.

This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.

To illustrate the problem the following script can be used:

--
#!/bin/bash

cd /sys/fs/cgroup

mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
    mkdir percpu_test/cg_"${i}"
    for j in `seq 1 10`; do
	mkdir percpu_test/cg_"${i}"_"${j}"
    done
done

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
    for j in `seq 1 10`; do
	rmdir percpu_test/cg_"${i}"_"${j}"
    done
done

sleep 10

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
    rmdir percpu_test/cg_"${i}"
done

rmdir percpu_test
--

It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.

Results:
  vanilla:
    ./percpu_test.sh
    Percpu:             7488 kB
    Percpu:           481152 kB
    Percpu:           481152 kB

  with this patchset applied:
    ./percpu_test.sh
    Percpu:             7488 kB
    Percpu:           481408 kB
    Percpu:           135552 kB

So the total size of the percpu memory was reduced by more than 3.5 times.

v3:
  - introduced pcpu_check_chunk_hint()
  - fixed a bug related to the hint check
  - minor cosmetic changes
  - s/pretends/fixes (cc Vlastimil)

v2:
  - depopulated chunks are sidelined
  - depopulation happens in the reverse order
  - depopulate list made per-chunk type
  - better results due to better heuristics

v1:
  - depopulation heuristics changed and optimized
  - chunks are put into a separate list, depopulation scan this list
  - chunk->isolated is introduced, chunk->depopulate is dropped
  - rearranged patches a bit
  - fixed a panic discovered by krobot
  - made pcpu_nr_empty_pop_pages per chunk type
  - minor fixes

rfc:
  https://lwn.net/Articles/850508/


Roman Gushchin (6):
  percpu: fix a comment about the chunks ordering
  percpu: split __pcpu_balance_workfn()
  percpu: make pcpu_nr_empty_pop_pages per chunk type
  percpu: generalize pcpu_balance_populated()
  percpu: factor out pcpu_check_chunk_hint()
  percpu: implement partial chunk depopulation

 mm/percpu-internal.h |   4 +-
 mm/percpu-stats.c    |   9 +-
 mm/percpu.c          | 306 +++++++++++++++++++++++++++++++++++--------
 3 files changed, 261 insertions(+), 58 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 1/6] percpu: fix a comment about the chunks ordering
  2021-04-08  3:57 [PATCH v3 0/6] percpu: partial chunk depopulation Roman Gushchin
@ 2021-04-08  3:57 ` Roman Gushchin
  2021-04-16 21:06   ` Dennis Zhou
  2021-04-08  3:57 ` [PATCH v3 2/6] percpu: split __pcpu_balance_workfn() Roman Gushchin
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Roman Gushchin @ 2021-04-08  3:57 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Tejun Heo, Christoph Lameter, Andrew Morton, Vlastimil Babka,
	linux-mm, linux-kernel, Roman Gushchin

Since the commit 3e54097beb22 ("percpu: manage chunks based on
contig_bits instead of free_bytes") chunks are sorted based on the
size of the biggest continuous free area instead of the total number
of free bytes. Update the corresponding comment to reflect this.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/percpu.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 6596a0a4286e..2f27123bb489 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -99,7 +99,10 @@
 
 #include "percpu-internal.h"
 
-/* the slots are sorted by free bytes left, 1-31 bytes share the same slot */
+/*
+ * The slots are sorted by the size of the biggest continuous free area.
+ * 1-31 bytes share the same slot.
+ */
 #define PCPU_SLOT_BASE_SHIFT		5
 /* chunks in slots below this are subject to being sidelined on failed alloc */
 #define PCPU_SLOT_FAIL_THRESHOLD	3
-- 
2.30.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 2/6] percpu: split __pcpu_balance_workfn()
  2021-04-08  3:57 [PATCH v3 0/6] percpu: partial chunk depopulation Roman Gushchin
  2021-04-08  3:57 ` [PATCH v3 1/6] percpu: fix a comment about the chunks ordering Roman Gushchin
@ 2021-04-08  3:57 ` Roman Gushchin
  2021-04-16 21:06   ` Dennis Zhou
  2021-04-08  3:57 ` [PATCH v3 3/6] percpu: make pcpu_nr_empty_pop_pages per chunk type Roman Gushchin
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Roman Gushchin @ 2021-04-08  3:57 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Tejun Heo, Christoph Lameter, Andrew Morton, Vlastimil Babka,
	linux-mm, linux-kernel, Roman Gushchin

__pcpu_balance_workfn() became fairly big and hard to follow, but in
fact it consists of two fully independent parts, responsible for
the destruction of excessive free chunks and population of necessarily
amount of free pages.

In order to simplify the code and prepare for adding of a new
functionality, split it in two functions:

  1) pcpu_balance_free,
  2) pcpu_balance_populated.

Move the taking/releasing of the pcpu_alloc_mutex to an upper level
to keep the current synchronization in place.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Dennis Zhou <dennis@kernel.org>
---
 mm/percpu.c | 46 +++++++++++++++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 2f27123bb489..7e31e1b8725f 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1933,31 +1933,22 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
 }
 
 /**
- * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
+ * pcpu_balance_free - manage the amount of free chunks
  * @type: chunk type
  *
- * Reclaim all fully free chunks except for the first one.  This is also
- * responsible for maintaining the pool of empty populated pages.  However,
- * it is possible that this is called when physical memory is scarce causing
- * OOM killer to be triggered.  We should avoid doing so until an actual
- * allocation causes the failure as it is possible that requests can be
- * serviced from already backed regions.
+ * Reclaim all fully free chunks except for the first one.
  */
-static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
+static void pcpu_balance_free(enum pcpu_chunk_type type)
 {
-	/* gfp flags passed to underlying allocators */
-	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
 	LIST_HEAD(to_free);
 	struct list_head *pcpu_slot = pcpu_chunk_list(type);
 	struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
 	struct pcpu_chunk *chunk, *next;
-	int slot, nr_to_pop, ret;
 
 	/*
 	 * There's no reason to keep around multiple unused chunks and VM
 	 * areas can be scarce.  Destroy all free chunks except for one.
 	 */
-	mutex_lock(&pcpu_alloc_mutex);
 	spin_lock_irq(&pcpu_lock);
 
 	list_for_each_entry_safe(chunk, next, free_head, list) {
@@ -1985,6 +1976,25 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
 		pcpu_destroy_chunk(chunk);
 		cond_resched();
 	}
+}
+
+/**
+ * pcpu_balance_populated - manage the amount of populated pages
+ * @type: chunk type
+ *
+ * Maintain a certain amount of populated pages to satisfy atomic allocations.
+ * It is possible that this is called when physical memory is scarce causing
+ * OOM killer to be triggered.  We should avoid doing so until an actual
+ * allocation causes the failure as it is possible that requests can be
+ * serviced from already backed regions.
+ */
+static void pcpu_balance_populated(enum pcpu_chunk_type type)
+{
+	/* gfp flags passed to underlying allocators */
+	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
+	struct list_head *pcpu_slot = pcpu_chunk_list(type);
+	struct pcpu_chunk *chunk;
+	int slot, nr_to_pop, ret;
 
 	/*
 	 * Ensure there are certain number of free populated pages for
@@ -2054,22 +2064,24 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
 			goto retry_pop;
 		}
 	}
-
-	mutex_unlock(&pcpu_alloc_mutex);
 }
 
 /**
  * pcpu_balance_workfn - manage the amount of free chunks and populated pages
  * @work: unused
  *
- * Call __pcpu_balance_workfn() for each chunk type.
+ * Call pcpu_balance_free() and pcpu_balance_populated() for each chunk type.
  */
 static void pcpu_balance_workfn(struct work_struct *work)
 {
 	enum pcpu_chunk_type type;
 
-	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
-		__pcpu_balance_workfn(type);
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
+		mutex_lock(&pcpu_alloc_mutex);
+		pcpu_balance_free(type);
+		pcpu_balance_populated(type);
+		mutex_unlock(&pcpu_alloc_mutex);
+	}
 }
 
 /**
-- 
2.30.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 3/6] percpu: make pcpu_nr_empty_pop_pages per chunk type
  2021-04-08  3:57 [PATCH v3 0/6] percpu: partial chunk depopulation Roman Gushchin
  2021-04-08  3:57 ` [PATCH v3 1/6] percpu: fix a comment about the chunks ordering Roman Gushchin
  2021-04-08  3:57 ` [PATCH v3 2/6] percpu: split __pcpu_balance_workfn() Roman Gushchin
@ 2021-04-08  3:57 ` Roman Gushchin
  2021-04-16 21:08   ` Dennis Zhou
  2021-04-08  3:57 ` [PATCH v3 4/6] percpu: generalize pcpu_balance_populated() Roman Gushchin
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Roman Gushchin @ 2021-04-08  3:57 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Tejun Heo, Christoph Lameter, Andrew Morton, Vlastimil Babka,
	linux-mm, linux-kernel, Roman Gushchin

nr_empty_pop_pages is used to guarantee that there are some free
populated pages to satisfy atomic allocations. Accounted and
non-accounted allocations are using separate sets of chunks,
so both need to have a surplus of empty pages.

This commit makes pcpu_nr_empty_pop_pages and the corresponding logic
per chunk type.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/percpu-internal.h |  2 +-
 mm/percpu-stats.c    |  9 +++++++--
 mm/percpu.c          | 14 +++++++-------
 3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 18b768ac7dca..095d7eaa0db4 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -87,7 +87,7 @@ extern spinlock_t pcpu_lock;
 
 extern struct list_head *pcpu_chunk_lists;
 extern int pcpu_nr_slots;
-extern int pcpu_nr_empty_pop_pages;
+extern int pcpu_nr_empty_pop_pages[];
 
 extern struct pcpu_chunk *pcpu_first_chunk;
 extern struct pcpu_chunk *pcpu_reserved_chunk;
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index c8400a2adbc2..f6026dbcdf6b 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -145,6 +145,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 	int slot, max_nr_alloc;
 	int *buffer;
 	enum pcpu_chunk_type type;
+	int nr_empty_pop_pages;
 
 alloc_buffer:
 	spin_lock_irq(&pcpu_lock);
@@ -165,7 +166,11 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 		goto alloc_buffer;
 	}
 
-#define PL(X) \
+	nr_empty_pop_pages = 0;
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+		nr_empty_pop_pages += pcpu_nr_empty_pop_pages[type];
+
+#define PL(X)								\
 	seq_printf(m, "  %-20s: %12lld\n", #X, (long long int)pcpu_stats_ai.X)
 
 	seq_printf(m,
@@ -196,7 +201,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 	PU(nr_max_chunks);
 	PU(min_alloc_size);
 	PU(max_alloc_size);
-	P("empty_pop_pages", pcpu_nr_empty_pop_pages);
+	P("empty_pop_pages", nr_empty_pop_pages);
 	seq_putc(m, '\n');
 
 #undef PU
diff --git a/mm/percpu.c b/mm/percpu.c
index 7e31e1b8725f..61339b3d9337 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -176,10 +176,10 @@ struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */
 static LIST_HEAD(pcpu_map_extend_chunks);
 
 /*
- * The number of empty populated pages, protected by pcpu_lock.  The
- * reserved chunk doesn't contribute to the count.
+ * The number of empty populated pages by chunk type, protected by pcpu_lock.
+ * The reserved chunk doesn't contribute to the count.
  */
-int pcpu_nr_empty_pop_pages;
+int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
 
 /*
  * The number of populated pages in use by the allocator, protected by
@@ -559,7 +559,7 @@ static inline void pcpu_update_empty_pages(struct pcpu_chunk *chunk, int nr)
 {
 	chunk->nr_empty_pop_pages += nr;
 	if (chunk != pcpu_reserved_chunk)
-		pcpu_nr_empty_pop_pages += nr;
+		pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] += nr;
 }
 
 /*
@@ -1835,7 +1835,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 		mutex_unlock(&pcpu_alloc_mutex);
 	}
 
-	if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_LOW)
+	if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_LOW)
 		pcpu_schedule_balance_work();
 
 	/* clear the areas and return address relative to base address */
@@ -2013,7 +2013,7 @@ static void pcpu_balance_populated(enum pcpu_chunk_type type)
 		pcpu_atomic_alloc_failed = false;
 	} else {
 		nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
-				  pcpu_nr_empty_pop_pages,
+				  pcpu_nr_empty_pop_pages[type],
 				  0, PCPU_EMPTY_POP_PAGES_HIGH);
 	}
 
@@ -2595,7 +2595,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 
 	/* link the first chunk in */
 	pcpu_first_chunk = chunk;
-	pcpu_nr_empty_pop_pages = pcpu_first_chunk->nr_empty_pop_pages;
+	pcpu_nr_empty_pop_pages[PCPU_CHUNK_ROOT] = pcpu_first_chunk->nr_empty_pop_pages;
 	pcpu_chunk_relocate(pcpu_first_chunk, -1);
 
 	/* include all regions of the first chunk */
-- 
2.30.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 4/6] percpu: generalize pcpu_balance_populated()
  2021-04-08  3:57 [PATCH v3 0/6] percpu: partial chunk depopulation Roman Gushchin
                   ` (2 preceding siblings ...)
  2021-04-08  3:57 ` [PATCH v3 3/6] percpu: make pcpu_nr_empty_pop_pages per chunk type Roman Gushchin
@ 2021-04-08  3:57 ` Roman Gushchin
  2021-04-16 21:09   ` Dennis Zhou
  2021-04-08  3:57 ` [PATCH v3 5/6] percpu: factor out pcpu_check_chunk_hint() Roman Gushchin
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Roman Gushchin @ 2021-04-08  3:57 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Tejun Heo, Christoph Lameter, Andrew Morton, Vlastimil Babka,
	linux-mm, linux-kernel, Roman Gushchin

To prepare for the depopulation of percpu chunks, split out the
populating part of the pcpu_balance_populated() into the new
pcpu_grow_populated() (with an intention to add
pcpu_shrink_populated() in the next commit).

The goal of pcpu_balance_populated() is to determine whether
there is a shortage or an excessive amount of empty percpu pages
and call into the corresponding function.

pcpu_grow_populated() takes a desired number of pages as an argument
(nr_to_pop). If it creates a new chunk, nr_to_pop should be updated
to reflect that the new chunk could be created already populated.
Otherwise an infinite loop might appear.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/percpu.c | 63 +++++++++++++++++++++++++++++++++--------------------
 1 file changed, 39 insertions(+), 24 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 61339b3d9337..e20119668c42 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1979,7 +1979,7 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
 }
 
 /**
- * pcpu_balance_populated - manage the amount of populated pages
+ * pcpu_grow_populated - populate chunk(s) to satisfy atomic allocations
  * @type: chunk type
  *
  * Maintain a certain amount of populated pages to satisfy atomic allocations.
@@ -1988,35 +1988,15 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
  * allocation causes the failure as it is possible that requests can be
  * serviced from already backed regions.
  */
-static void pcpu_balance_populated(enum pcpu_chunk_type type)
+static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
 {
 	/* gfp flags passed to underlying allocators */
 	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
 	struct list_head *pcpu_slot = pcpu_chunk_list(type);
 	struct pcpu_chunk *chunk;
-	int slot, nr_to_pop, ret;
+	int slot, ret;
 
-	/*
-	 * Ensure there are certain number of free populated pages for
-	 * atomic allocs.  Fill up from the most packed so that atomic
-	 * allocs don't increase fragmentation.  If atomic allocation
-	 * failed previously, always populate the maximum amount.  This
-	 * should prevent atomic allocs larger than PAGE_SIZE from keeping
-	 * failing indefinitely; however, large atomic allocs are not
-	 * something we support properly and can be highly unreliable and
-	 * inefficient.
-	 */
 retry_pop:
-	if (pcpu_atomic_alloc_failed) {
-		nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
-		/* best effort anyway, don't worry about synchronization */
-		pcpu_atomic_alloc_failed = false;
-	} else {
-		nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
-				  pcpu_nr_empty_pop_pages[type],
-				  0, PCPU_EMPTY_POP_PAGES_HIGH);
-	}
-
 	for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) {
 		unsigned int nr_unpop = 0, rs, re;
 
@@ -2060,12 +2040,47 @@ static void pcpu_balance_populated(enum pcpu_chunk_type type)
 		if (chunk) {
 			spin_lock_irq(&pcpu_lock);
 			pcpu_chunk_relocate(chunk, -1);
+			nr_to_pop = max_t(int, 0, nr_to_pop - chunk->nr_populated);
 			spin_unlock_irq(&pcpu_lock);
-			goto retry_pop;
+			if (nr_to_pop)
+				goto retry_pop;
 		}
 	}
 }
 
+/**
+ * pcpu_balance_populated - manage the amount of populated pages
+ * @type: chunk type
+ *
+ * Populate or depopulate chunks to maintain a certain amount
+ * of free pages to satisfy atomic allocations, but not waste
+ * large amounts of memory.
+ */
+static void pcpu_balance_populated(enum pcpu_chunk_type type)
+{
+	int nr_to_pop;
+
+	/*
+	 * Ensure there are certain number of free populated pages for
+	 * atomic allocs.  Fill up from the most packed so that atomic
+	 * allocs don't increase fragmentation.  If atomic allocation
+	 * failed previously, always populate the maximum amount.  This
+	 * should prevent atomic allocs larger than PAGE_SIZE from keeping
+	 * failing indefinitely; however, large atomic allocs are not
+	 * something we support properly and can be highly unreliable and
+	 * inefficient.
+	 */
+	if (pcpu_atomic_alloc_failed) {
+		nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
+		/* best effort anyway, don't worry about synchronization */
+		pcpu_atomic_alloc_failed = false;
+		pcpu_grow_populated(type, nr_to_pop);
+	} else if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_HIGH) {
+		nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH - pcpu_nr_empty_pop_pages[type];
+		pcpu_grow_populated(type, nr_to_pop);
+	}
+}
+
 /**
  * pcpu_balance_workfn - manage the amount of free chunks and populated pages
  * @work: unused
-- 
2.30.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 5/6] percpu: factor out pcpu_check_chunk_hint()
  2021-04-08  3:57 [PATCH v3 0/6] percpu: partial chunk depopulation Roman Gushchin
                   ` (3 preceding siblings ...)
  2021-04-08  3:57 ` [PATCH v3 4/6] percpu: generalize pcpu_balance_populated() Roman Gushchin
@ 2021-04-08  3:57 ` Roman Gushchin
  2021-04-16 21:15   ` Dennis Zhou
  2021-04-08  3:57 ` [PATCH v3 6/6] percpu: implement partial chunk depopulation Roman Gushchin
  2021-04-16 12:56 ` [PATCH v3 0/6] percpu: " Pratik Sampat
  6 siblings, 1 reply; 26+ messages in thread
From: Roman Gushchin @ 2021-04-08  3:57 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Tejun Heo, Christoph Lameter, Andrew Morton, Vlastimil Babka,
	linux-mm, linux-kernel, Roman Gushchin

Factor out the pcpu_check_chunk_hint() helper, which will be useful
in the future. The new function checks if the allocation can likely
fit the given chunk.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/percpu.c | 30 +++++++++++++++++++++---------
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index e20119668c42..357fd6994278 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -306,6 +306,26 @@ static unsigned long pcpu_block_off_to_off(int index, int off)
 	return index * PCPU_BITMAP_BLOCK_BITS + off;
 }
 
+/**
+ * pcpu_check_chunk_hint - check that allocation can fit a chunk
+ * @chunk_md: chunk's block
+ * @bits: size of request in allocation units
+ * @align: alignment of area (max PAGE_SIZE)
+ *
+ * Check to see if the allocation can fit in the chunk's contig hint.
+ * This is an optimization to prevent scanning by assuming if it
+ * cannot fit in the global hint, there is memory pressure and creating
+ * a new chunk would happen soon.
+ */
+static bool pcpu_check_chunk_hint(struct pcpu_block_md *chunk_md, int bits,
+				  size_t align)
+{
+	int bit_off = ALIGN(chunk_md->contig_hint_start, align) -
+		chunk_md->contig_hint_start;
+
+	return bit_off + bits <= chunk_md->contig_hint;
+}
+
 /*
  * pcpu_next_hint - determine which hint to use
  * @block: block of interest
@@ -1065,15 +1085,7 @@ static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int alloc_bits,
 	struct pcpu_block_md *chunk_md = &chunk->chunk_md;
 	int bit_off, bits, next_off;
 
-	/*
-	 * Check to see if the allocation can fit in the chunk's contig hint.
-	 * This is an optimization to prevent scanning by assuming if it
-	 * cannot fit in the global hint, there is memory pressure and creating
-	 * a new chunk would happen soon.
-	 */
-	bit_off = ALIGN(chunk_md->contig_hint_start, align) -
-		  chunk_md->contig_hint_start;
-	if (bit_off + alloc_bits > chunk_md->contig_hint)
+	if (!pcpu_check_chunk_hint(chunk_md, alloc_bits, align))
 		return -1;
 
 	bit_off = pcpu_next_hint(chunk_md, alloc_bits);
-- 
2.30.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 6/6] percpu: implement partial chunk depopulation
  2021-04-08  3:57 [PATCH v3 0/6] percpu: partial chunk depopulation Roman Gushchin
                   ` (4 preceding siblings ...)
  2021-04-08  3:57 ` [PATCH v3 5/6] percpu: factor out pcpu_check_chunk_hint() Roman Gushchin
@ 2021-04-08  3:57 ` Roman Gushchin
  2021-04-16 12:56 ` [PATCH v3 0/6] percpu: " Pratik Sampat
  6 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2021-04-08  3:57 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Tejun Heo, Christoph Lameter, Andrew Morton, Vlastimil Babka,
	linux-mm, linux-kernel, Roman Gushchin

This patch implements partial depopulation of percpu chunks.

As now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.

This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations can't be served
using a sidelined chunk. The chunk can be moved back to a corresponding
slot if there are not enough chunks with empty populated pages.

The depopulation is scheduled on the free path. Is the chunk:
  1) has more than 1/4 of total pages free and populated
  2) the system has enough free percpu pages aside of this chunk
  3) isn't the reserved chunk
  4) isn't the first chunk
  5) isn't entirely free
it's a good target for depopulation. If it's already depopulated
but got free populated pages, it's a good target too.
The chunk is moved to a special pcpu_depopulate_list, chunk->isolate
flag is set and the async balancing is scheduled.

The async balancing moves pcpu_depopulate_list to a local list
(because pcpu_depopulate_list can be changed when pcpu_lock is
releases), and then tries to depopulate each chunk.  The depopulation
is performed in the reverse direction to keep populated pages close to
the beginning, if the global number of empty pages is reached.
Depopulated chunks are sidelined to prevent further allocations.
Skipped and fully empty chunks are returned to the corresponding slot.

On the allocation path, if there are no suitable chunks found,
the list of sidelined chunks in scanned prior to creating a new chunk.
If there is a good sidelined chunk, it's placed back to the slot
and the scanning is restarted.

Many thanks to Dennis Zhou for his great ideas and a very constructive
discussion which led to many improvements in this patchset!

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/percpu-internal.h |   2 +
 mm/percpu.c          | 158 ++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 158 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 095d7eaa0db4..8e432663c41e 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -67,6 +67,8 @@ struct pcpu_chunk {
 
 	void			*data;		/* chunk data */
 	bool			immutable;	/* no [de]population allowed */
+	bool			isolated;	/* isolated from chunk slot lists */
+	bool			depopulated;    /* sidelined after depopulation */
 	int			start_offset;	/* the overlap with the previous
 						   region to have a page aligned
 						   base_addr */
diff --git a/mm/percpu.c b/mm/percpu.c
index 357fd6994278..5bb294e394b3 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -181,6 +181,19 @@ static LIST_HEAD(pcpu_map_extend_chunks);
  */
 int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
 
+/*
+ * List of chunks with a lot of free pages.  Used to depopulate them
+ * asynchronously.
+ */
+static struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES];
+
+/*
+ * List of previously depopulated chunks.  They are not usually used for new
+ * allocations, but can be returned back to service if a need arises.
+ */
+static struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES];
+
+
 /*
  * The number of populated pages in use by the allocator, protected by
  * pcpu_lock.  This number is kept per a unit per chunk (i.e. when a page gets
@@ -562,6 +575,12 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
 {
 	int nslot = pcpu_chunk_slot(chunk);
 
+	/*
+	 * Keep isolated and depopulated chunks on a sideline.
+	 */
+	if (chunk->isolated || chunk->depopulated)
+		return;
+
 	if (oslot != nslot)
 		__pcpu_chunk_move(chunk, nslot, oslot < nslot);
 }
@@ -1790,6 +1809,19 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 		}
 	}
 
+	/* search through sidelined depopulated chunks */
+	list_for_each_entry(chunk, &pcpu_sideline_list[type], list) {
+		/*
+		 * If the allocation can fit the chunk, place the chunk back
+		 * into corresponding slot and restart the scanning.
+		 */
+		if (pcpu_check_chunk_hint(&chunk->chunk_md, bits, bit_align)) {
+			chunk->depopulated = false;
+			pcpu_chunk_relocate(chunk, -1);
+			goto restart;
+		}
+	}
+
 	spin_unlock_irqrestore(&pcpu_lock, flags);
 
 	/*
@@ -2060,6 +2092,106 @@ static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
 	}
 }
 
+/**
+ * pcpu_shrink_populated - scan chunks and release unused pages to the system
+ * @type: chunk type
+ *
+ * Scan over chunks in the depopulate list, try to release unused populated
+ * pages to the system.  Depopulated chunks are sidelined to prevent further
+ * allocations without a need.  Skipped and fully free chunks are returned
+ * to corresponding slots.  Stop depopulating if the number of empty populated
+ * pages reaches the threshold.  Each chunk is scanned in the reverse order to
+ * keep populated pages close to the beginning of the chunk.
+ */
+static void pcpu_shrink_populated(enum pcpu_chunk_type type)
+{
+	struct pcpu_block_md *block;
+	struct pcpu_chunk *chunk, *tmp;
+	LIST_HEAD(to_depopulate);
+	bool depopulated;
+	int i, end;
+
+	spin_lock_irq(&pcpu_lock);
+
+	list_splice_init(&pcpu_depopulate_list[type], &to_depopulate);
+
+	list_for_each_entry_safe(chunk, tmp, &to_depopulate, list) {
+		WARN_ON(chunk->immutable);
+		depopulated = false;
+
+		/*
+		 * Scan chunk's pages in the reverse order to keep populated
+		 * pages close to the beginning of the chunk.
+		 */
+		for (i = chunk->nr_pages - 1, end = -1; i >= 0; i--) {
+			/*
+			 * If the chunk has no empty pages or
+			 * we're short on empty pages in general,
+			 * just put the chunk back into the original slot.
+			 */
+			if (!chunk->nr_empty_pop_pages ||
+			    pcpu_nr_empty_pop_pages[type] <=
+			    PCPU_EMPTY_POP_PAGES_HIGH)
+				break;
+
+			/*
+			 * If the page is empty and populated, start or
+			 * extend the (i, end) range.  If i == 0, decrease
+			 * i and perform the depopulation to cover the last
+			 * (first) page in the chunk.
+			 */
+			block = chunk->md_blocks + i;
+			if (block->contig_hint == PCPU_BITMAP_BLOCK_BITS &&
+			    test_bit(i, chunk->populated)) {
+				if (end == -1)
+					end = i;
+				if (i > 0)
+					continue;
+				i--;
+			}
+
+			/*
+			 * Otherwise check if there is an active range,
+			 * and if yes, depopulate it.
+			 */
+			if (end == -1)
+				continue;
+
+			depopulated = true;
+
+			spin_unlock_irq(&pcpu_lock);
+			pcpu_depopulate_chunk(chunk, i + 1, end + 1);
+			cond_resched();
+			spin_lock_irq(&pcpu_lock);
+
+			pcpu_chunk_depopulated(chunk, i + 1, end + 1);
+
+			/*
+			 * Reset the range and continue.
+			 */
+			end = -1;
+		}
+
+		chunk->isolated = false;
+		if (chunk->free_bytes == pcpu_unit_size || !depopulated) {
+			/*
+			 * If the chunk is empty or hasn't been depopulated,
+			 * return it to the original slot.
+			 */
+			pcpu_chunk_relocate(chunk, -1);
+		} else {
+			/*
+			 * Otherwise put the chunk to the list of depopulated
+			 * chunks.
+			 */
+			chunk->depopulated = true;
+			list_move(&chunk->list, &pcpu_sideline_list[type]);
+		}
+	}
+
+	spin_unlock_irq(&pcpu_lock);
+}
+
 /**
  * pcpu_balance_populated - manage the amount of populated pages
  * @type: chunk type
@@ -2090,6 +2222,8 @@ static void pcpu_balance_populated(enum pcpu_chunk_type type)
 	} else if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_HIGH) {
 		nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH - pcpu_nr_empty_pop_pages[type];
 		pcpu_grow_populated(type, nr_to_pop);
+	} else if (!list_empty(&pcpu_depopulate_list[type])) {
+		pcpu_shrink_populated(type);
 	}
 }
 
@@ -2147,7 +2281,13 @@ void free_percpu(void __percpu *ptr)
 
 	pcpu_memcg_free_hook(chunk, off, size);
 
-	/* if there are more than one fully free chunks, wake up grim reaper */
+	/*
+	 * If there are more than one fully free chunks, wake up grim reaper.
+	 * Otherwise if at least 1/4 of its pages are empty and there is no
+	 * system-wide shortage of empty pages aside from this chunk, isolate
+	 * the chunk and schedule an async depopulation.  If the chunk was
+	 * depopulated previously and got free pages, depopulate it too.
+	 */
 	if (chunk->free_bytes == pcpu_unit_size) {
 		struct pcpu_chunk *pos;
 
@@ -2156,6 +2296,16 @@ void free_percpu(void __percpu *ptr)
 				need_balance = true;
 				break;
 			}
+	} else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
+		   !chunk->isolated &&
+		   (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
+		    PCPU_EMPTY_POP_PAGES_HIGH + chunk->nr_empty_pop_pages) &&
+		   ((chunk->depopulated && chunk->nr_empty_pop_pages) ||
+		    (chunk->nr_empty_pop_pages >= chunk->nr_pages / 4))) {
+		list_move(&chunk->list, &pcpu_depopulate_list[pcpu_chunk_type(chunk)]);
+		chunk->isolated = true;
+		chunk->depopulated = false;
+		need_balance = true;
 	}
 
 	trace_percpu_free_percpu(chunk->base_addr, off, ptr);
@@ -2583,10 +2733,14 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 		      pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]) *
 		      PCPU_NR_CHUNK_TYPES);
 
-	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
 		for (i = 0; i < pcpu_nr_slots; i++)
 			INIT_LIST_HEAD(&pcpu_chunk_list(type)[i]);
 
+		INIT_LIST_HEAD(&pcpu_depopulate_list[type]);
+		INIT_LIST_HEAD(&pcpu_sideline_list[type]);
+	}
+
 	/*
 	 * The end of the static region needs to be aligned with the
 	 * minimum allocation size as this offsets the reserved and
-- 
2.30.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] percpu: partial chunk depopulation
  2021-04-08  3:57 [PATCH v3 0/6] percpu: partial chunk depopulation Roman Gushchin
                   ` (5 preceding siblings ...)
  2021-04-08  3:57 ` [PATCH v3 6/6] percpu: implement partial chunk depopulation Roman Gushchin
@ 2021-04-16 12:56 ` Pratik Sampat
  2021-04-16 14:18   ` Dennis Zhou
  6 siblings, 1 reply; 26+ messages in thread
From: Pratik Sampat @ 2021-04-16 12:56 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton,
	Vlastimil Babka, linux-mm, linux-kernel, pratik.r.sampat

Hello Roman,

I've tried the v3 patch series on a POWER9 and an x86 KVM setup.

My results of the percpu_test are as follows:
Intel KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu:             1952 kB
Percpu:           219648 kB
Percpu:           219648 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu:             2080 kB
Percpu:           219712 kB
Percpu:            72672 kB

I'm able to see improvement comparable to that of what you're see too.

However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration

POWER9 KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu:             5888 kB
Percpu:           118272 kB
Percpu:           118272 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu:             6144 kB
Percpu:           119040 kB
Percpu:           119040 kB

I'm wondering if there's any architectural specific code that needs plumbing
here?

I will also look through the code to find the reason why POWER isn't
depopulating pages.

Thank you,
Pratik

On 08/04/21 9:27 am, Roman Gushchin wrote:
> In our production experience the percpu memory allocator is sometimes struggling
> with returning the memory to the system. A typical example is a creation of
> several thousands memory cgroups (each has several chunks of the percpu data
> used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
> of these cgroups doesn't always lead to a shrinkage of the percpu memory,
> so that sometimes there are several GB's of memory wasted.
>
> The underlying problem is the fragmentation: to release an underlying chunk
> all percpu allocations should be released first. The percpu allocator tends
> to top up chunks to improve the utilization. It means new small-ish allocations
> (e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
> effectively pinning them in memory.
>
> This patchset solves this problem by implementing a partial depopulation
> of percpu chunks: chunks with many empty pages are being asynchronously
> depopulated and the pages are returned to the system.
>
> To illustrate the problem the following script can be used:
>
> --
> #!/bin/bash
>
> cd /sys/fs/cgroup
>
> mkdir percpu_test
> echo "+memory" > percpu_test/cgroup.subtree_control
>
> cat /proc/meminfo | grep Percpu
>
> for i in `seq 1 1000`; do
>      mkdir percpu_test/cg_"${i}"
>      for j in `seq 1 10`; do
> 	mkdir percpu_test/cg_"${i}"_"${j}"
>      done
> done
>
> cat /proc/meminfo | grep Percpu
>
> for i in `seq 1 1000`; do
>      for j in `seq 1 10`; do
> 	rmdir percpu_test/cg_"${i}"_"${j}"
>      done
> done
>
> sleep 10
>
> cat /proc/meminfo | grep Percpu
>
> for i in `seq 1 1000`; do
>      rmdir percpu_test/cg_"${i}"
> done
>
> rmdir percpu_test
> --
>
> It creates 11000 memory cgroups and removes every 10 out of 11.
> It prints the initial size of the percpu memory, the size after
> creating all cgroups and the size after deleting most of them.
>
> Results:
>    vanilla:
>      ./percpu_test.sh
>      Percpu:             7488 kB
>      Percpu:           481152 kB
>      Percpu:           481152 kB
>
>    with this patchset applied:
>      ./percpu_test.sh
>      Percpu:             7488 kB
>      Percpu:           481408 kB
>      Percpu:           135552 kB
>
> So the total size of the percpu memory was reduced by more than 3.5 times.
>
> v3:
>    - introduced pcpu_check_chunk_hint()
>    - fixed a bug related to the hint check
>    - minor cosmetic changes
>    - s/pretends/fixes (cc Vlastimil)
>
> v2:
>    - depopulated chunks are sidelined
>    - depopulation happens in the reverse order
>    - depopulate list made per-chunk type
>    - better results due to better heuristics
>
> v1:
>    - depopulation heuristics changed and optimized
>    - chunks are put into a separate list, depopulation scan this list
>    - chunk->isolated is introduced, chunk->depopulate is dropped
>    - rearranged patches a bit
>    - fixed a panic discovered by krobot
>    - made pcpu_nr_empty_pop_pages per chunk type
>    - minor fixes
>
> rfc:
>    https://lwn.net/Articles/850508/
>
>
> Roman Gushchin (6):
>    percpu: fix a comment about the chunks ordering
>    percpu: split __pcpu_balance_workfn()
>    percpu: make pcpu_nr_empty_pop_pages per chunk type
>    percpu: generalize pcpu_balance_populated()
>    percpu: factor out pcpu_check_chunk_hint()
>    percpu: implement partial chunk depopulation
>
>   mm/percpu-internal.h |   4 +-
>   mm/percpu-stats.c    |   9 +-
>   mm/percpu.c          | 306 +++++++++++++++++++++++++++++++++++--------
>   3 files changed, 261 insertions(+), 58 deletions(-)
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] percpu: partial chunk depopulation
  2021-04-16 12:56 ` [PATCH v3 0/6] percpu: " Pratik Sampat
@ 2021-04-16 14:18   ` Dennis Zhou
  2021-04-16 15:28     ` Pratik Sampat
  2021-04-16 16:21     ` Roman Gushchin
  0 siblings, 2 replies; 26+ messages in thread
From: Dennis Zhou @ 2021-04-16 14:18 UTC (permalink / raw)
  To: Pratik Sampat
  Cc: Roman Gushchin, Tejun Heo, Christoph Lameter, Andrew Morton,
	Vlastimil Babka, linux-mm, linux-kernel, pratik.r.sampat

Hello,

On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> Hello Roman,
> 
> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> 
> My results of the percpu_test are as follows:
> Intel KVM 4CPU:4G
> Vanilla 5.12-rc6
> # ./percpu_test.sh
> Percpu:             1952 kB
> Percpu:           219648 kB
> Percpu:           219648 kB
> 
> 5.12-rc6 + with patchset applied
> # ./percpu_test.sh
> Percpu:             2080 kB
> Percpu:           219712 kB
> Percpu:            72672 kB
> 
> I'm able to see improvement comparable to that of what you're see too.
> 
> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
> 
> POWER9 KVM 4CPU:4G
> Vanilla 5.12-rc6
> # ./percpu_test.sh
> Percpu:             5888 kB
> Percpu:           118272 kB
> Percpu:           118272 kB
> 
> 5.12-rc6 + with patchset applied
> # ./percpu_test.sh
> Percpu:             6144 kB
> Percpu:           119040 kB
> Percpu:           119040 kB
> 
> I'm wondering if there's any architectural specific code that needs plumbing
> here?
> 

There shouldn't be. Can you send me the percpu_stats debug output before
and after?

> I will also look through the code to find the reason why POWER isn't
> depopulating pages.
> 
> Thank you,
> Pratik
> 
> On 08/04/21 9:27 am, Roman Gushchin wrote:
> > In our production experience the percpu memory allocator is sometimes struggling
> > with returning the memory to the system. A typical example is a creation of
> > several thousands memory cgroups (each has several chunks of the percpu data
> > used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
> > of these cgroups doesn't always lead to a shrinkage of the percpu memory,
> > so that sometimes there are several GB's of memory wasted.
> > 
> > The underlying problem is the fragmentation: to release an underlying chunk
> > all percpu allocations should be released first. The percpu allocator tends
> > to top up chunks to improve the utilization. It means new small-ish allocations
> > (e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
> > effectively pinning them in memory.
> > 
> > This patchset solves this problem by implementing a partial depopulation
> > of percpu chunks: chunks with many empty pages are being asynchronously
> > depopulated and the pages are returned to the system.
> > 
> > To illustrate the problem the following script can be used:
> > 
> > --
> > #!/bin/bash
> > 
> > cd /sys/fs/cgroup
> > 
> > mkdir percpu_test
> > echo "+memory" > percpu_test/cgroup.subtree_control
> > 
> > cat /proc/meminfo | grep Percpu
> > 
> > for i in `seq 1 1000`; do
> >      mkdir percpu_test/cg_"${i}"
> >      for j in `seq 1 10`; do
> > 	mkdir percpu_test/cg_"${i}"_"${j}"
> >      done
> > done
> > 
> > cat /proc/meminfo | grep Percpu
> > 
> > for i in `seq 1 1000`; do
> >      for j in `seq 1 10`; do
> > 	rmdir percpu_test/cg_"${i}"_"${j}"
> >      done
> > done
> > 
> > sleep 10
> > 
> > cat /proc/meminfo | grep Percpu
> > 
> > for i in `seq 1 1000`; do
> >      rmdir percpu_test/cg_"${i}"
> > done
> > 
> > rmdir percpu_test
> > --
> > 
> > It creates 11000 memory cgroups and removes every 10 out of 11.
> > It prints the initial size of the percpu memory, the size after
> > creating all cgroups and the size after deleting most of them.
> > 
> > Results:
> >    vanilla:
> >      ./percpu_test.sh
> >      Percpu:             7488 kB
> >      Percpu:           481152 kB
> >      Percpu:           481152 kB
> > 
> >    with this patchset applied:
> >      ./percpu_test.sh
> >      Percpu:             7488 kB
> >      Percpu:           481408 kB
> >      Percpu:           135552 kB
> > 
> > So the total size of the percpu memory was reduced by more than 3.5 times.
> > 
> > v3:
> >    - introduced pcpu_check_chunk_hint()
> >    - fixed a bug related to the hint check
> >    - minor cosmetic changes
> >    - s/pretends/fixes (cc Vlastimil)
> > 
> > v2:
> >    - depopulated chunks are sidelined
> >    - depopulation happens in the reverse order
> >    - depopulate list made per-chunk type
> >    - better results due to better heuristics
> > 
> > v1:
> >    - depopulation heuristics changed and optimized
> >    - chunks are put into a separate list, depopulation scan this list
> >    - chunk->isolated is introduced, chunk->depopulate is dropped
> >    - rearranged patches a bit
> >    - fixed a panic discovered by krobot
> >    - made pcpu_nr_empty_pop_pages per chunk type
> >    - minor fixes
> > 
> > rfc:
> >    https://lwn.net/Articles/850508/
> > 
> > 
> > Roman Gushchin (6):
> >    percpu: fix a comment about the chunks ordering
> >    percpu: split __pcpu_balance_workfn()
> >    percpu: make pcpu_nr_empty_pop_pages per chunk type
> >    percpu: generalize pcpu_balance_populated()
> >    percpu: factor out pcpu_check_chunk_hint()
> >    percpu: implement partial chunk depopulation
> > 
> >   mm/percpu-internal.h |   4 +-
> >   mm/percpu-stats.c    |   9 +-
> >   mm/percpu.c          | 306 +++++++++++++++++++++++++++++++++++--------
> >   3 files changed, 261 insertions(+), 58 deletions(-)
> > 
> 

Roman, sorry for the delay. I'm looking to apply this today to for-5.14.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] percpu: partial chunk depopulation
  2021-04-16 14:18   ` Dennis Zhou
@ 2021-04-16 15:28     ` Pratik Sampat
  2021-04-16 17:13       ` Roman Gushchin
  2021-04-16 16:21     ` Roman Gushchin
  1 sibling, 1 reply; 26+ messages in thread
From: Pratik Sampat @ 2021-04-16 15:28 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Roman Gushchin, Tejun Heo, Christoph Lameter, Andrew Morton,
	Vlastimil Babka, linux-mm, linux-kernel, pratik.r.sampat

Hello Dennis,

I apologize for the clutter of logs before, I'm pasting the logs of before and
after the percpu test in the case of the patchset being applied on 5.12-rc6 and
the vanilla kernel 5.12-rc6.

On 16/04/21 7:48 pm, Dennis Zhou wrote:
> Hello,
>
> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
>> Hello Roman,
>>
>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
>>
>> My results of the percpu_test are as follows:
>> Intel KVM 4CPU:4G
>> Vanilla 5.12-rc6
>> # ./percpu_test.sh
>> Percpu:             1952 kB
>> Percpu:           219648 kB
>> Percpu:           219648 kB
>>
>> 5.12-rc6 + with patchset applied
>> # ./percpu_test.sh
>> Percpu:             2080 kB
>> Percpu:           219712 kB
>> Percpu:            72672 kB
>>
>> I'm able to see improvement comparable to that of what you're see too.
>>
>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
>>
>> POWER9 KVM 4CPU:4G
>> Vanilla 5.12-rc6
>> # ./percpu_test.sh
>> Percpu:             5888 kB
>> Percpu:           118272 kB
>> Percpu:           118272 kB
>>
>> 5.12-rc6 + with patchset applied
>> # ./percpu_test.sh
>> Percpu:             6144 kB
>> Percpu:           119040 kB
>> Percpu:           119040 kB
>>
>> I'm wondering if there's any architectural specific code that needs plumbing
>> here?
>>
> There shouldn't be. Can you send me the percpu_stats debug output before
> and after?

I'll paste the whole debug stats before and after here.
5.12-rc6 + patchset
-----BEFORE-----
Percpu Memory Statistics
Allocation Info:
----------------------------------------
   unit_size           :       655360
   static_size         :       608920
   reserved_size       :            0
   dyn_size            :        46440
   atom_size           :        65536
   alloc_size          :       655360

Global Stats:
----------------------------------------
   nr_alloc            :         9040
   nr_dealloc          :         6994
   nr_cur_alloc        :         2046
   nr_max_alloc        :         2208
   nr_chunks           :            3
   nr_max_chunks       :            3
   min_alloc_size      :            4
   max_alloc_size      :         1072
   empty_pop_pages     :           12

Per Chunk Stats:
----------------------------------------
Chunk: <- First Chunk
   nr_alloc            :          859
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :        16384
   free_bytes          :            0
   contig_bytes        :            0
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            4
   cur_med_alloc       :            8
   cur_max_alloc       :         1072
   memcg_aware         :            0

Chunk:
   nr_alloc            :          827
   max_alloc_size      :          992
   empty_pop_pages     :            8
   first_bit           :          692
   free_bytes          :       645012
   contig_bytes        :       460096
   sum_frag            :       466420
   max_frag            :       460096
   cur_min_alloc       :            4
   cur_med_alloc       :            8
   cur_max_alloc       :          152
   memcg_aware         :            0

Chunk:
   nr_alloc            :          360
   max_alloc_size      :         1072
   empty_pop_pages     :            4
   first_bit           :        29207
   free_bytes          :       506640
   contig_bytes        :       506556
   sum_frag            :           84
   max_frag            :           32
   cur_min_alloc       :            4
   cur_med_alloc       :          156
   cur_max_alloc       :         1072
   memcg_aware         :            1

-----AFTER-----
Percpu Memory Statistics
Allocation Info:
----------------------------------------
   unit_size           :       655360
   static_size         :       608920
   reserved_size       :            0
   dyn_size            :        46440
   atom_size           :        65536
   alloc_size          :       655360

Global Stats:
----------------------------------------
   nr_alloc            :        97048
   nr_dealloc          :        95002
   nr_cur_alloc        :         2046
   nr_max_alloc        :        90054
   nr_chunks           :           48
   nr_max_chunks       :           48
   min_alloc_size      :            4
   max_alloc_size      :         1072
   empty_pop_pages     :           61

Per Chunk Stats:
----------------------------------------
Chunk: <- First Chunk
   nr_alloc            :          859
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :        16384
   free_bytes          :            0
   contig_bytes        :            0
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            4
   cur_med_alloc       :            8
   cur_max_alloc       :         1072
   memcg_aware         :            0

Chunk:
   nr_alloc            :          827
   max_alloc_size      :         1072
   empty_pop_pages     :            8
   first_bit           :          692
   free_bytes          :       645012
   contig_bytes        :       460096
   sum_frag            :       466420
   max_frag            :       460096
   cur_min_alloc       :            4
   cur_med_alloc       :            8
   cur_max_alloc       :          152
   memcg_aware         :            0

Chunk:
   nr_alloc            :            0
   max_alloc_size      :            0
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            0

Chunk:
   nr_alloc            :          360
   max_alloc_size      :         1072
   empty_pop_pages     :            7
   first_bit           :        29207
   free_bytes          :       506640
   contig_bytes        :       506556
   sum_frag            :           84
   max_frag            :           32
   cur_min_alloc       :            4
   cur_med_alloc       :          156
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1


I'm also pasting the logs before and after in a vanilla kernel too
There are considerably higher number of chunks in the vanilla kernel, than with
the patches though.

5.12-rc6 vanilla
-----BEFORE-----
Percpu Memory Statistics
Allocation Info:
----------------------------------------
   unit_size           :       655360
   static_size         :       608920
   reserved_size       :            0
   dyn_size            :        46440
   atom_size           :        65536
   alloc_size          :       655360

Global Stats:
----------------------------------------
   nr_alloc            :         9038
   nr_dealloc          :         6992
   nr_cur_alloc        :         2046
   nr_max_alloc        :         2178
   nr_chunks           :            3
   nr_max_chunks       :            3
   min_alloc_size      :            4
   max_alloc_size      :         1072
   empty_pop_pages     :            5

Per Chunk Stats:
----------------------------------------
Chunk: <- First Chunk
   nr_alloc            :         1088
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :        16384
   free_bytes          :            0
   contig_bytes        :            0
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            4
   cur_med_alloc       :            8
   cur_max_alloc       :         1072
   memcg_aware         :            0

Chunk:
   nr_alloc            :          598
   max_alloc_size      :          992
   empty_pop_pages     :            5
   first_bit           :          642
   free_bytes          :       645012
   contig_bytes        :       504292
   sum_frag            :       140720
   max_frag            :       116456
   cur_min_alloc       :            4
   cur_med_alloc       :            8
   cur_max_alloc       :          424
   memcg_aware         :            0

Chunk:
   nr_alloc            :          360
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :        27909
   free_bytes          :       506640
   contig_bytes        :       506556
   sum_frag            :           84
   max_frag            :           36
   cur_min_alloc       :            4
   cur_med_alloc       :          156
   cur_max_alloc       :         1072
   memcg_aware         :            1

-----AFTER-----
Percpu Memory Statistics
Allocation Info:
----------------------------------------
   unit_size           :       655360
   static_size         :       608920
   reserved_size       :            0
   dyn_size            :        46440
   atom_size           :        65536
   alloc_size          :       655360

Global Stats:
----------------------------------------
   nr_alloc            :        97046
   nr_dealloc          :        94237
   nr_cur_alloc        :         2809
   nr_max_alloc        :        90054
   nr_chunks           :           11
   nr_max_chunks       :           47
   min_alloc_size      :            4
   max_alloc_size      :         1072
   empty_pop_pages     :           29

Per Chunk Stats:
----------------------------------------
Chunk: <- First Chunk
   nr_alloc            :         1088
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :        16384
   free_bytes          :            0
   contig_bytes        :            0
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            4
   cur_med_alloc       :            8
   cur_max_alloc       :         1072
   memcg_aware         :            0

Chunk:
   nr_alloc            :          865
   max_alloc_size      :         1072
   empty_pop_pages     :            6
   first_bit           :          789
   free_bytes          :       640296
   contig_bytes        :       290672
   sum_frag            :       349624
   max_frag            :       169956
   cur_min_alloc       :            4
   cur_med_alloc       :            8
   cur_max_alloc       :         1072
   memcg_aware         :            0

Chunk:
   nr_alloc            :           90
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :          536
   free_bytes          :       595752
   contig_bytes        :        26164
   sum_frag            :       575132
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :         1072
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :           90
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       597428
   contig_bytes        :        26164
   sum_frag            :       596848
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :           92
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       595284
   contig_bytes        :        26164
   sum_frag            :       590360
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :           92
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       595284
   contig_bytes        :        26164
   sum_frag            :       583768
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :           90
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       595752
   contig_bytes        :        26164
   sum_frag            :       577748
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :         1072
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :           30
   max_alloc_size      :         1072
   empty_pop_pages     :            6
   first_bit           :            0
   free_bytes          :       636608
   contig_bytes        :       397944
   sum_frag            :       636500
   max_frag            :       426720
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :          360
   max_alloc_size      :         1072
   empty_pop_pages     :            7
   first_bit           :        27909
   free_bytes          :       506640
   contig_bytes        :       506556
   sum_frag            :           84
   max_frag            :           36
   cur_min_alloc       :            4
   cur_med_alloc       :          156
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :           12
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       647524
   contig_bytes        :       563492
   sum_frag            :        57872
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :           10
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

>> I will also look through the code to find the reason why POWER isn't
>> depopulating pages.
>>
>> Thank you,
>> Pratik
>>
>> On 08/04/21 9:27 am, Roman Gushchin wrote:
>>> In our production experience the percpu memory allocator is sometimes struggling
>>> with returning the memory to the system. A typical example is a creation of
>>> several thousands memory cgroups (each has several chunks of the percpu data
>>> used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
>>> of these cgroups doesn't always lead to a shrinkage of the percpu memory,
>>> so that sometimes there are several GB's of memory wasted.
>>>
>>> The underlying problem is the fragmentation: to release an underlying chunk
>>> all percpu allocations should be released first. The percpu allocator tends
>>> to top up chunks to improve the utilization. It means new small-ish allocations
>>> (e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
>>> effectively pinning them in memory.
>>>
>>> This patchset solves this problem by implementing a partial depopulation
>>> of percpu chunks: chunks with many empty pages are being asynchronously
>>> depopulated and the pages are returned to the system.
>>>
>>> To illustrate the problem the following script can be used:
>>>
>>> --
>>> #!/bin/bash
>>>
>>> cd /sys/fs/cgroup
>>>
>>> mkdir percpu_test
>>> echo "+memory" > percpu_test/cgroup.subtree_control
>>>
>>> cat /proc/meminfo | grep Percpu
>>>
>>> for i in `seq 1 1000`; do
>>>       mkdir percpu_test/cg_"${i}"
>>>       for j in `seq 1 10`; do
>>> 	mkdir percpu_test/cg_"${i}"_"${j}"
>>>       done
>>> done
>>>
>>> cat /proc/meminfo | grep Percpu
>>>
>>> for i in `seq 1 1000`; do
>>>       for j in `seq 1 10`; do
>>> 	rmdir percpu_test/cg_"${i}"_"${j}"
>>>       done
>>> done
>>>
>>> sleep 10
>>>
>>> cat /proc/meminfo | grep Percpu
>>>
>>> for i in `seq 1 1000`; do
>>>       rmdir percpu_test/cg_"${i}"
>>> done
>>>
>>> rmdir percpu_test
>>> --
>>>
>>> It creates 11000 memory cgroups and removes every 10 out of 11.
>>> It prints the initial size of the percpu memory, the size after
>>> creating all cgroups and the size after deleting most of them.
>>>
>>> Results:
>>>     vanilla:
>>>       ./percpu_test.sh
>>>       Percpu:             7488 kB
>>>       Percpu:           481152 kB
>>>       Percpu:           481152 kB
>>>
>>>     with this patchset applied:
>>>       ./percpu_test.sh
>>>       Percpu:             7488 kB
>>>       Percpu:           481408 kB
>>>       Percpu:           135552 kB
>>>
>>> So the total size of the percpu memory was reduced by more than 3.5 times.
>>>
>>> v3:
>>>     - introduced pcpu_check_chunk_hint()
>>>     - fixed a bug related to the hint check
>>>     - minor cosmetic changes
>>>     - s/pretends/fixes (cc Vlastimil)
>>>
>>> v2:
>>>     - depopulated chunks are sidelined
>>>     - depopulation happens in the reverse order
>>>     - depopulate list made per-chunk type
>>>     - better results due to better heuristics
>>>
>>> v1:
>>>     - depopulation heuristics changed and optimized
>>>     - chunks are put into a separate list, depopulation scan this list
>>>     - chunk->isolated is introduced, chunk->depopulate is dropped
>>>     - rearranged patches a bit
>>>     - fixed a panic discovered by krobot
>>>     - made pcpu_nr_empty_pop_pages per chunk type
>>>     - minor fixes
>>>
>>> rfc:
>>>     https://lwn.net/Articles/850508/
>>>
>>>
>>> Roman Gushchin (6):
>>>     percpu: fix a comment about the chunks ordering
>>>     percpu: split __pcpu_balance_workfn()
>>>     percpu: make pcpu_nr_empty_pop_pages per chunk type
>>>     percpu: generalize pcpu_balance_populated()
>>>     percpu: factor out pcpu_check_chunk_hint()
>>>     percpu: implement partial chunk depopulation
>>>
>>>    mm/percpu-internal.h |   4 +-
>>>    mm/percpu-stats.c    |   9 +-
>>>    mm/percpu.c          | 306 +++++++++++++++++++++++++++++++++++--------
>>>    3 files changed, 261 insertions(+), 58 deletions(-)
>>>
> Roman, sorry for the delay. I'm looking to apply this today to for-5.14.
>
> Thanks,
> Dennis
Thanks
Pratik

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] percpu: partial chunk depopulation
  2021-04-16 14:18   ` Dennis Zhou
  2021-04-16 15:28     ` Pratik Sampat
@ 2021-04-16 16:21     ` Roman Gushchin
  1 sibling, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2021-04-16 16:21 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Pratik Sampat, Tejun Heo, Christoph Lameter, Andrew Morton,
	Vlastimil Babka, linux-mm, linux-kernel, pratik.r.sampat

On Fri, Apr 16, 2021 at 02:18:10PM +0000, Dennis Zhou wrote:
> Hello,
> 
> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > Hello Roman,
> > 
> > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > 
> > My results of the percpu_test are as follows:
> > Intel KVM 4CPU:4G
> > Vanilla 5.12-rc6
> > # ./percpu_test.sh
> > Percpu:             1952 kB
> > Percpu:           219648 kB
> > Percpu:           219648 kB
> > 
> > 5.12-rc6 + with patchset applied
> > # ./percpu_test.sh
> > Percpu:             2080 kB
> > Percpu:           219712 kB
> > Percpu:            72672 kB
> > 
> > I'm able to see improvement comparable to that of what you're see too.
> > 
> > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
> > 
> > POWER9 KVM 4CPU:4G
> > Vanilla 5.12-rc6
> > # ./percpu_test.sh
> > Percpu:             5888 kB
> > Percpu:           118272 kB
> > Percpu:           118272 kB
> > 
> > 5.12-rc6 + with patchset applied
> > # ./percpu_test.sh
> > Percpu:             6144 kB
> > Percpu:           119040 kB
> > Percpu:           119040 kB
> > 
> > I'm wondering if there's any architectural specific code that needs plumbing
> > here?
> > 
> 
> There shouldn't be. Can you send me the percpu_stats debug output before
> and after?

Btw, sidelined chunks are not listed in the debug output. It was actually on my
to-do list, looks like I need to prioritize it a bit.

> 
> > I will also look through the code to find the reason why POWER isn't
> > depopulating pages.
> > 
> > Thank you,
> > Pratik
> > 
> > On 08/04/21 9:27 am, Roman Gushchin wrote:
> > > In our production experience the percpu memory allocator is sometimes struggling
> > > with returning the memory to the system. A typical example is a creation of
> > > several thousands memory cgroups (each has several chunks of the percpu data
> > > used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
> > > of these cgroups doesn't always lead to a shrinkage of the percpu memory,
> > > so that sometimes there are several GB's of memory wasted.
> > > 
> > > The underlying problem is the fragmentation: to release an underlying chunk
> > > all percpu allocations should be released first. The percpu allocator tends
> > > to top up chunks to improve the utilization. It means new small-ish allocations
> > > (e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
> > > effectively pinning them in memory.
> > > 
> > > This patchset solves this problem by implementing a partial depopulation
> > > of percpu chunks: chunks with many empty pages are being asynchronously
> > > depopulated and the pages are returned to the system.
> > > 
> > > To illustrate the problem the following script can be used:
> > > 
> > > --
> > > #!/bin/bash
> > > 
> > > cd /sys/fs/cgroup
> > > 
> > > mkdir percpu_test
> > > echo "+memory" > percpu_test/cgroup.subtree_control
> > > 
> > > cat /proc/meminfo | grep Percpu
> > > 
> > > for i in `seq 1 1000`; do
> > >      mkdir percpu_test/cg_"${i}"
> > >      for j in `seq 1 10`; do
> > > 	mkdir percpu_test/cg_"${i}"_"${j}"
> > >      done
> > > done
> > > 
> > > cat /proc/meminfo | grep Percpu
> > > 
> > > for i in `seq 1 1000`; do
> > >      for j in `seq 1 10`; do
> > > 	rmdir percpu_test/cg_"${i}"_"${j}"
> > >      done
> > > done
> > > 
> > > sleep 10
> > > 
> > > cat /proc/meminfo | grep Percpu
> > > 
> > > for i in `seq 1 1000`; do
> > >      rmdir percpu_test/cg_"${i}"
> > > done
> > > 
> > > rmdir percpu_test
> > > --
> > > 
> > > It creates 11000 memory cgroups and removes every 10 out of 11.
> > > It prints the initial size of the percpu memory, the size after
> > > creating all cgroups and the size after deleting most of them.
> > > 
> > > Results:
> > >    vanilla:
> > >      ./percpu_test.sh
> > >      Percpu:             7488 kB
> > >      Percpu:           481152 kB
> > >      Percpu:           481152 kB
> > > 
> > >    with this patchset applied:
> > >      ./percpu_test.sh
> > >      Percpu:             7488 kB
> > >      Percpu:           481408 kB
> > >      Percpu:           135552 kB
> > > 
> > > So the total size of the percpu memory was reduced by more than 3.5 times.
> > > 
> > > v3:
> > >    - introduced pcpu_check_chunk_hint()
> > >    - fixed a bug related to the hint check
> > >    - minor cosmetic changes
> > >    - s/pretends/fixes (cc Vlastimil)
> > > 
> > > v2:
> > >    - depopulated chunks are sidelined
> > >    - depopulation happens in the reverse order
> > >    - depopulate list made per-chunk type
> > >    - better results due to better heuristics
> > > 
> > > v1:
> > >    - depopulation heuristics changed and optimized
> > >    - chunks are put into a separate list, depopulation scan this list
> > >    - chunk->isolated is introduced, chunk->depopulate is dropped
> > >    - rearranged patches a bit
> > >    - fixed a panic discovered by krobot
> > >    - made pcpu_nr_empty_pop_pages per chunk type
> > >    - minor fixes
> > > 
> > > rfc:
> > >    https://lwn.net/Articles/850508/ 
> > > 
> > > 
> > > Roman Gushchin (6):
> > >    percpu: fix a comment about the chunks ordering
> > >    percpu: split __pcpu_balance_workfn()
> > >    percpu: make pcpu_nr_empty_pop_pages per chunk type
> > >    percpu: generalize pcpu_balance_populated()
> > >    percpu: factor out pcpu_check_chunk_hint()
> > >    percpu: implement partial chunk depopulation
> > > 
> > >   mm/percpu-internal.h |   4 +-
> > >   mm/percpu-stats.c    |   9 +-
> > >   mm/percpu.c          | 306 +++++++++++++++++++++++++++++++++++--------
> > >   3 files changed, 261 insertions(+), 58 deletions(-)
> > > 
> > 
> 
> Roman, sorry for the delay. I'm looking to apply this today to for-5.14.

Great, thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] percpu: partial chunk depopulation
  2021-04-16 15:28     ` Pratik Sampat
@ 2021-04-16 17:13       ` Roman Gushchin
  2021-04-16 18:27         ` Pratik Sampat
  0 siblings, 1 reply; 26+ messages in thread
From: Roman Gushchin @ 2021-04-16 17:13 UTC (permalink / raw)
  To: Pratik Sampat
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton,
	Vlastimil Babka, linux-mm, linux-kernel, pratik.r.sampat

On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> Hello Dennis,
> 
> I apologize for the clutter of logs before, I'm pasting the logs of before and
> after the percpu test in the case of the patchset being applied on 5.12-rc6 and
> the vanilla kernel 5.12-rc6.
> 
> On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > Hello,
> > 
> > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > Hello Roman,
> > > 
> > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > > 
> > > My results of the percpu_test are as follows:
> > > Intel KVM 4CPU:4G
> > > Vanilla 5.12-rc6
> > > # ./percpu_test.sh
> > > Percpu:             1952 kB
> > > Percpu:           219648 kB
> > > Percpu:           219648 kB
> > > 
> > > 5.12-rc6 + with patchset applied
> > > # ./percpu_test.sh
> > > Percpu:             2080 kB
> > > Percpu:           219712 kB
> > > Percpu:            72672 kB
> > > 
> > > I'm able to see improvement comparable to that of what you're see too.
> > > 
> > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
> > > 
> > > POWER9 KVM 4CPU:4G
> > > Vanilla 5.12-rc6
> > > # ./percpu_test.sh
> > > Percpu:             5888 kB
> > > Percpu:           118272 kB
> > > Percpu:           118272 kB
> > > 
> > > 5.12-rc6 + with patchset applied
> > > # ./percpu_test.sh
> > > Percpu:             6144 kB
> > > Percpu:           119040 kB
> > > Percpu:           119040 kB
> > > 
> > > I'm wondering if there's any architectural specific code that needs plumbing
> > > here?
> > > 
> > There shouldn't be. Can you send me the percpu_stats debug output before
> > and after?
> 
> I'll paste the whole debug stats before and after here.
> 5.12-rc6 + patchset
> -----BEFORE-----
> Percpu Memory Statistics
> Allocation Info:


Hm, this looks highly suspicious. Here is your stats in a more compact form:

Vanilla

nr_alloc            :         9038         nr_alloc            :        97046
nr_dealloc          :         6992	   nr_dealloc          :        94237
nr_cur_alloc        :         2046	   nr_cur_alloc        :         2809
nr_max_alloc        :         2178	   nr_max_alloc        :        90054
nr_chunks           :            3	   nr_chunks           :           11
nr_max_chunks       :            3	   nr_max_chunks       :           47
min_alloc_size      :            4	   min_alloc_size      :            4
max_alloc_size      :         1072	   max_alloc_size      :         1072
empty_pop_pages     :            5	   empty_pop_pages     :           29


Patched

nr_alloc            :         9040         nr_alloc            :        97048
nr_dealloc          :         6994	   nr_dealloc          :        95002
nr_cur_alloc        :         2046	   nr_cur_alloc        :         2046
nr_max_alloc        :         2208	   nr_max_alloc        :        90054
nr_chunks           :            3	   nr_chunks           :           48
nr_max_chunks       :            3	   nr_max_chunks       :           48
min_alloc_size      :            4	   min_alloc_size      :            4
max_alloc_size      :         1072	   max_alloc_size      :         1072
empty_pop_pages     :           12	   empty_pop_pages     :           61


So it looks like the number of chunks got bigger, as well as the number of
empty_pop_pages? This contradicts to what you wrote, so can you, please, make
sure that the data is correct and we're not messing two cases?

So it looks like for some reason sidelined (depopulated) chunks are not getting
freed completely. But I struggle to explain why the initial empty_pop_pages is
bigger with the same amount of chunks.

So, can you, please, apply the following patch and provide an updated statistics?

--

From d0d2bfdb891afec6bd63790b3492b852db490640 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Fri, 16 Apr 2021 09:54:38 -0700
Subject: [PATCH] percpu: include sidelined and depopulating chunks into debug
 output

Information about sidelined chunks and chunks in the depopulate queue
could be extremely valuable for debugging different problems.

Dump information about these chunks on pair with regular chunks
in percpu slots via percpu stats interface.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/percpu-internal.h |  2 ++
 mm/percpu-stats.c    | 10 ++++++++++
 mm/percpu.c          |  4 ++--
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 8e432663c41e..c11f115ced5c 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -90,6 +90,8 @@ extern spinlock_t pcpu_lock;
 extern struct list_head *pcpu_chunk_lists;
 extern int pcpu_nr_slots;
 extern int pcpu_nr_empty_pop_pages[];
+extern struct list_head pcpu_depopulate_list[];
+extern struct list_head pcpu_sideline_list[];
 
 extern struct pcpu_chunk *pcpu_first_chunk;
 extern struct pcpu_chunk *pcpu_reserved_chunk;
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index f6026dbcdf6b..af09ed1ea5f8 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -228,6 +228,16 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 				}
 			}
 		}
+
+		list_for_each_entry(chunk, &pcpu_sideline_list[type], list) {
+			seq_puts(m, "Chunk (sidelined):\n");
+			chunk_map_stats(m, chunk, buffer);
+		}
+
+		list_for_each_entry(chunk, &pcpu_depopulate_list[type], list) {
+			seq_puts(m, "Chunk (to depopulate):\n");
+			chunk_map_stats(m, chunk, buffer);
+		}
 	}
 
 	spin_unlock_irq(&pcpu_lock);
diff --git a/mm/percpu.c b/mm/percpu.c
index 5bb294e394b3..ded3a7541cb2 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -185,13 +185,13 @@ int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
  * List of chunks with a lot of free pages.  Used to depopulate them
  * asynchronously.
  */
-static struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES];
+struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES];
 
 /*
  * List of previously depopulated chunks.  They are not usually used for new
  * allocations, but can be returned back to service if a need arises.
  */
-static struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES];
+struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES];
 
 
 /*
-- 
2.30.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] percpu: partial chunk depopulation
  2021-04-16 17:13       ` Roman Gushchin
@ 2021-04-16 18:27         ` Pratik Sampat
  2021-04-16 18:34           ` Roman Gushchin
  0 siblings, 1 reply; 26+ messages in thread
From: Pratik Sampat @ 2021-04-16 18:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton,
	Vlastimil Babka, linux-mm, linux-kernel, pratik.r.sampat



On 16/04/21 10:43 pm, Roman Gushchin wrote:
> On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
>> Hello Dennis,
>>
>> I apologize for the clutter of logs before, I'm pasting the logs of before and
>> after the percpu test in the case of the patchset being applied on 5.12-rc6 and
>> the vanilla kernel 5.12-rc6.
>>
>> On 16/04/21 7:48 pm, Dennis Zhou wrote:
>>> Hello,
>>>
>>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
>>>> Hello Roman,
>>>>
>>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
>>>>
>>>> My results of the percpu_test are as follows:
>>>> Intel KVM 4CPU:4G
>>>> Vanilla 5.12-rc6
>>>> # ./percpu_test.sh
>>>> Percpu:             1952 kB
>>>> Percpu:           219648 kB
>>>> Percpu:           219648 kB
>>>>
>>>> 5.12-rc6 + with patchset applied
>>>> # ./percpu_test.sh
>>>> Percpu:             2080 kB
>>>> Percpu:           219712 kB
>>>> Percpu:            72672 kB
>>>>
>>>> I'm able to see improvement comparable to that of what you're see too.
>>>>
>>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
>>>>
>>>> POWER9 KVM 4CPU:4G
>>>> Vanilla 5.12-rc6
>>>> # ./percpu_test.sh
>>>> Percpu:             5888 kB
>>>> Percpu:           118272 kB
>>>> Percpu:           118272 kB
>>>>
>>>> 5.12-rc6 + with patchset applied
>>>> # ./percpu_test.sh
>>>> Percpu:             6144 kB
>>>> Percpu:           119040 kB
>>>> Percpu:           119040 kB
>>>>
>>>> I'm wondering if there's any architectural specific code that needs plumbing
>>>> here?
>>>>
>>> There shouldn't be. Can you send me the percpu_stats debug output before
>>> and after?
>> I'll paste the whole debug stats before and after here.
>> 5.12-rc6 + patchset
>> -----BEFORE-----
>> Percpu Memory Statistics
>> Allocation Info:
>
> Hm, this looks highly suspicious. Here is your stats in a more compact form:
>
> Vanilla
>
> nr_alloc            :         9038         nr_alloc            :        97046
> nr_dealloc          :         6992	   nr_dealloc          :        94237
> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2809
> nr_max_alloc        :         2178	   nr_max_alloc        :        90054
> nr_chunks           :            3	   nr_chunks           :           11
> nr_max_chunks       :            3	   nr_max_chunks       :           47
> min_alloc_size      :            4	   min_alloc_size      :            4
> max_alloc_size      :         1072	   max_alloc_size      :         1072
> empty_pop_pages     :            5	   empty_pop_pages     :           29
>
>
> Patched
>
> nr_alloc            :         9040         nr_alloc            :        97048
> nr_dealloc          :         6994	   nr_dealloc          :        95002
> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2046
> nr_max_alloc        :         2208	   nr_max_alloc        :        90054
> nr_chunks           :            3	   nr_chunks           :           48
> nr_max_chunks       :            3	   nr_max_chunks       :           48
> min_alloc_size      :            4	   min_alloc_size      :            4
> max_alloc_size      :         1072	   max_alloc_size      :         1072
> empty_pop_pages     :           12	   empty_pop_pages     :           61
>
>
> So it looks like the number of chunks got bigger, as well as the number of
> empty_pop_pages? This contradicts to what you wrote, so can you, please, make
> sure that the data is correct and we're not messing two cases?
>
> So it looks like for some reason sidelined (depopulated) chunks are not getting
> freed completely. But I struggle to explain why the initial empty_pop_pages is
> bigger with the same amount of chunks.
>
> So, can you, please, apply the following patch and provide an updated statistics?

Unfortunately, I'm not completely well versed in this area, but yes the empty
pop pages number doesn't make sense to me either.

I re-ran the numbers trying to make sure my experiment setup is sane but
results remain the same.

Vanilla
nr_alloc            :         9040         nr_alloc            :        97048
nr_dealloc          :         6994	   nr_dealloc          :        94404
nr_cur_alloc        :         2046	   nr_cur_alloc        :         2644
nr_max_alloc        :         2169	   nr_max_alloc        :        90054
nr_chunks           :            3	   nr_chunks           :           10
nr_max_chunks       :            3	   nr_max_chunks       :           47
min_alloc_size      :            4	   min_alloc_size      :            4
max_alloc_size      :         1072	   max_alloc_size      :         1072
empty_pop_pages     :            4	   empty_pop_pages     :           32

With the patchset + debug patch the results are as follows:
Patched

nr_alloc            :         9040         nr_alloc            :        97048
nr_dealloc          :         6994	   nr_dealloc          :        94349
nr_cur_alloc        :         2046	   nr_cur_alloc        :         2699
nr_max_alloc        :         2194	   nr_max_alloc        :        90054
nr_chunks           :            3	   nr_chunks           :           48
nr_max_chunks       :            3	   nr_max_chunks       :           48
min_alloc_size      :            4	   min_alloc_size      :            4
max_alloc_size      :         1072	   max_alloc_size      :         1072
empty_pop_pages     :           12	   empty_pop_pages     :           54

With the extra tracing I can see 39 entries of "Chunk (sidelined)"
after the test was run. I don't see any entries for "Chunk (to depopulate)"

I've snipped the results of slidelined chunks because they went on for ~600
lines, if you need the full logs let me know.

Thank you,
Pratik

> --
>
>  From d0d2bfdb891afec6bd63790b3492b852db490640 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Fri, 16 Apr 2021 09:54:38 -0700
> Subject: [PATCH] percpu: include sidelined and depopulating chunks into debug
>   output
>
> Information about sidelined chunks and chunks in the depopulate queue
> could be extremely valuable for debugging different problems.
>
> Dump information about these chunks on pair with regular chunks
> in percpu slots via percpu stats interface.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>   mm/percpu-internal.h |  2 ++
>   mm/percpu-stats.c    | 10 ++++++++++
>   mm/percpu.c          |  4 ++--
>   3 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index 8e432663c41e..c11f115ced5c 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -90,6 +90,8 @@ extern spinlock_t pcpu_lock;
>   extern struct list_head *pcpu_chunk_lists;
>   extern int pcpu_nr_slots;
>   extern int pcpu_nr_empty_pop_pages[];
> +extern struct list_head pcpu_depopulate_list[];
> +extern struct list_head pcpu_sideline_list[];
>   
>   extern struct pcpu_chunk *pcpu_first_chunk;
>   extern struct pcpu_chunk *pcpu_reserved_chunk;
> diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
> index f6026dbcdf6b..af09ed1ea5f8 100644
> --- a/mm/percpu-stats.c
> +++ b/mm/percpu-stats.c
> @@ -228,6 +228,16 @@ static int percpu_stats_show(struct seq_file *m, void *v)
>   				}
>   			}
>   		}
> +
> +		list_for_each_entry(chunk, &pcpu_sideline_list[type], list) {
> +			seq_puts(m, "Chunk (sidelined):\n");
> +			chunk_map_stats(m, chunk, buffer);
> +		}
> +
> +		list_for_each_entry(chunk, &pcpu_depopulate_list[type], list) {
> +			seq_puts(m, "Chunk (to depopulate):\n");
> +			chunk_map_stats(m, chunk, buffer);
> +		}
>   	}
>   
>   	spin_unlock_irq(&pcpu_lock);
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 5bb294e394b3..ded3a7541cb2 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -185,13 +185,13 @@ int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
>    * List of chunks with a lot of free pages.  Used to depopulate them
>    * asynchronously.
>    */
> -static struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES];
> +struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES];
>   
>   /*
>    * List of previously depopulated chunks.  They are not usually used for new
>    * allocations, but can be returned back to service if a need arises.
>    */
> -static struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES];
> +struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES];
>   
>   
>   /*


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] percpu: partial chunk depopulation
  2021-04-16 18:27         ` Pratik Sampat
@ 2021-04-16 18:34           ` Roman Gushchin
  2021-04-16 18:41             ` Pratik Sampat
  0 siblings, 1 reply; 26+ messages in thread
From: Roman Gushchin @ 2021-04-16 18:34 UTC (permalink / raw)
  To: Pratik Sampat
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton,
	Vlastimil Babka, linux-mm, linux-kernel, pratik.r.sampat

On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
> 
> 
> On 16/04/21 10:43 pm, Roman Gushchin wrote:
> > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> > > Hello Dennis,
> > > 
> > > I apologize for the clutter of logs before, I'm pasting the logs of before and
> > > after the percpu test in the case of the patchset being applied on 5.12-rc6 and
> > > the vanilla kernel 5.12-rc6.
> > > 
> > > On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > > > Hello,
> > > > 
> > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > > > Hello Roman,
> > > > > 
> > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > > > > 
> > > > > My results of the percpu_test are as follows:
> > > > > Intel KVM 4CPU:4G
> > > > > Vanilla 5.12-rc6
> > > > > # ./percpu_test.sh
> > > > > Percpu:             1952 kB
> > > > > Percpu:           219648 kB
> > > > > Percpu:           219648 kB
> > > > > 
> > > > > 5.12-rc6 + with patchset applied
> > > > > # ./percpu_test.sh
> > > > > Percpu:             2080 kB
> > > > > Percpu:           219712 kB
> > > > > Percpu:            72672 kB
> > > > > 
> > > > > I'm able to see improvement comparable to that of what you're see too.
> > > > > 
> > > > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
> > > > > 
> > > > > POWER9 KVM 4CPU:4G
> > > > > Vanilla 5.12-rc6
> > > > > # ./percpu_test.sh
> > > > > Percpu:             5888 kB
> > > > > Percpu:           118272 kB
> > > > > Percpu:           118272 kB
> > > > > 
> > > > > 5.12-rc6 + with patchset applied
> > > > > # ./percpu_test.sh
> > > > > Percpu:             6144 kB
> > > > > Percpu:           119040 kB
> > > > > Percpu:           119040 kB
> > > > > 
> > > > > I'm wondering if there's any architectural specific code that needs plumbing
> > > > > here?
> > > > > 
> > > > There shouldn't be. Can you send me the percpu_stats debug output before
> > > > and after?
> > > I'll paste the whole debug stats before and after here.
> > > 5.12-rc6 + patchset
> > > -----BEFORE-----
> > > Percpu Memory Statistics
> > > Allocation Info:
> > 
> > Hm, this looks highly suspicious. Here is your stats in a more compact form:
> > 
> > Vanilla
> > 
> > nr_alloc            :         9038         nr_alloc            :        97046
> > nr_dealloc          :         6992	   nr_dealloc          :        94237
> > nr_cur_alloc        :         2046	   nr_cur_alloc        :         2809
> > nr_max_alloc        :         2178	   nr_max_alloc        :        90054
> > nr_chunks           :            3	   nr_chunks           :           11
> > nr_max_chunks       :            3	   nr_max_chunks       :           47
> > min_alloc_size      :            4	   min_alloc_size      :            4
> > max_alloc_size      :         1072	   max_alloc_size      :         1072
> > empty_pop_pages     :            5	   empty_pop_pages     :           29
> > 
> > 
> > Patched
> > 
> > nr_alloc            :         9040         nr_alloc            :        97048
> > nr_dealloc          :         6994	   nr_dealloc          :        95002
> > nr_cur_alloc        :         2046	   nr_cur_alloc        :         2046
> > nr_max_alloc        :         2208	   nr_max_alloc        :        90054
> > nr_chunks           :            3	   nr_chunks           :           48
> > nr_max_chunks       :            3	   nr_max_chunks       :           48
> > min_alloc_size      :            4	   min_alloc_size      :            4
> > max_alloc_size      :         1072	   max_alloc_size      :         1072
> > empty_pop_pages     :           12	   empty_pop_pages     :           61
> > 
> > 
> > So it looks like the number of chunks got bigger, as well as the number of
> > empty_pop_pages? This contradicts to what you wrote, so can you, please, make
> > sure that the data is correct and we're not messing two cases?
> > 
> > So it looks like for some reason sidelined (depopulated) chunks are not getting
> > freed completely. But I struggle to explain why the initial empty_pop_pages is
> > bigger with the same amount of chunks.
> > 
> > So, can you, please, apply the following patch and provide an updated statistics?
> 
> Unfortunately, I'm not completely well versed in this area, but yes the empty
> pop pages number doesn't make sense to me either.
> 
> I re-ran the numbers trying to make sure my experiment setup is sane but
> results remain the same.
> 
> Vanilla
> nr_alloc            :         9040         nr_alloc            :        97048
> nr_dealloc          :         6994	   nr_dealloc          :        94404
> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2644
> nr_max_alloc        :         2169	   nr_max_alloc        :        90054
> nr_chunks           :            3	   nr_chunks           :           10
> nr_max_chunks       :            3	   nr_max_chunks       :           47
> min_alloc_size      :            4	   min_alloc_size      :            4
> max_alloc_size      :         1072	   max_alloc_size      :         1072
> empty_pop_pages     :            4	   empty_pop_pages     :           32
> 
> With the patchset + debug patch the results are as follows:
> Patched
> 
> nr_alloc            :         9040         nr_alloc            :        97048
> nr_dealloc          :         6994	   nr_dealloc          :        94349
> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2699
> nr_max_alloc        :         2194	   nr_max_alloc        :        90054
> nr_chunks           :            3	   nr_chunks           :           48
> nr_max_chunks       :            3	   nr_max_chunks       :           48
> min_alloc_size      :            4	   min_alloc_size      :            4
> max_alloc_size      :         1072	   max_alloc_size      :         1072
> empty_pop_pages     :           12	   empty_pop_pages     :           54
> 
> With the extra tracing I can see 39 entries of "Chunk (sidelined)"
> after the test was run. I don't see any entries for "Chunk (to depopulate)"
> 
> I've snipped the results of slidelined chunks because they went on for ~600
> lines, if you need the full logs let me know.

Yes, please! That's the most interesting part!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] percpu: partial chunk depopulation
  2021-04-16 18:34           ` Roman Gushchin
@ 2021-04-16 18:41             ` Pratik Sampat
  2021-04-16 19:09               ` Roman Gushchin
  0 siblings, 1 reply; 26+ messages in thread
From: Pratik Sampat @ 2021-04-16 18:41 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton,
	Vlastimil Babka, linux-mm, linux-kernel, pratik.r.sampat



On 17/04/21 12:04 am, Roman Gushchin wrote:
> On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
>>
>> On 16/04/21 10:43 pm, Roman Gushchin wrote:
>>> On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
>>>> Hello Dennis,
>>>>
>>>> I apologize for the clutter of logs before, I'm pasting the logs of before and
>>>> after the percpu test in the case of the patchset being applied on 5.12-rc6 and
>>>> the vanilla kernel 5.12-rc6.
>>>>
>>>> On 16/04/21 7:48 pm, Dennis Zhou wrote:
>>>>> Hello,
>>>>>
>>>>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
>>>>>> Hello Roman,
>>>>>>
>>>>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
>>>>>>
>>>>>> My results of the percpu_test are as follows:
>>>>>> Intel KVM 4CPU:4G
>>>>>> Vanilla 5.12-rc6
>>>>>> # ./percpu_test.sh
>>>>>> Percpu:             1952 kB
>>>>>> Percpu:           219648 kB
>>>>>> Percpu:           219648 kB
>>>>>>
>>>>>> 5.12-rc6 + with patchset applied
>>>>>> # ./percpu_test.sh
>>>>>> Percpu:             2080 kB
>>>>>> Percpu:           219712 kB
>>>>>> Percpu:            72672 kB
>>>>>>
>>>>>> I'm able to see improvement comparable to that of what you're see too.
>>>>>>
>>>>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
>>>>>>
>>>>>> POWER9 KVM 4CPU:4G
>>>>>> Vanilla 5.12-rc6
>>>>>> # ./percpu_test.sh
>>>>>> Percpu:             5888 kB
>>>>>> Percpu:           118272 kB
>>>>>> Percpu:           118272 kB
>>>>>>
>>>>>> 5.12-rc6 + with patchset applied
>>>>>> # ./percpu_test.sh
>>>>>> Percpu:             6144 kB
>>>>>> Percpu:           119040 kB
>>>>>> Percpu:           119040 kB
>>>>>>
>>>>>> I'm wondering if there's any architectural specific code that needs plumbing
>>>>>> here?
>>>>>>
>>>>> There shouldn't be. Can you send me the percpu_stats debug output before
>>>>> and after?
>>>> I'll paste the whole debug stats before and after here.
>>>> 5.12-rc6 + patchset
>>>> -----BEFORE-----
>>>> Percpu Memory Statistics
>>>> Allocation Info:
>>> Hm, this looks highly suspicious. Here is your stats in a more compact form:
>>>
>>> Vanilla
>>>
>>> nr_alloc            :         9038         nr_alloc            :        97046
>>> nr_dealloc          :         6992	   nr_dealloc          :        94237
>>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2809
>>> nr_max_alloc        :         2178	   nr_max_alloc        :        90054
>>> nr_chunks           :            3	   nr_chunks           :           11
>>> nr_max_chunks       :            3	   nr_max_chunks       :           47
>>> min_alloc_size      :            4	   min_alloc_size      :            4
>>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>>> empty_pop_pages     :            5	   empty_pop_pages     :           29
>>>
>>>
>>> Patched
>>>
>>> nr_alloc            :         9040         nr_alloc            :        97048
>>> nr_dealloc          :         6994	   nr_dealloc          :        95002
>>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2046
>>> nr_max_alloc        :         2208	   nr_max_alloc        :        90054
>>> nr_chunks           :            3	   nr_chunks           :           48
>>> nr_max_chunks       :            3	   nr_max_chunks       :           48
>>> min_alloc_size      :            4	   min_alloc_size      :            4
>>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>>> empty_pop_pages     :           12	   empty_pop_pages     :           61
>>>
>>>
>>> So it looks like the number of chunks got bigger, as well as the number of
>>> empty_pop_pages? This contradicts to what you wrote, so can you, please, make
>>> sure that the data is correct and we're not messing two cases?
>>>
>>> So it looks like for some reason sidelined (depopulated) chunks are not getting
>>> freed completely. But I struggle to explain why the initial empty_pop_pages is
>>> bigger with the same amount of chunks.
>>>
>>> So, can you, please, apply the following patch and provide an updated statistics?
>> Unfortunately, I'm not completely well versed in this area, but yes the empty
>> pop pages number doesn't make sense to me either.
>>
>> I re-ran the numbers trying to make sure my experiment setup is sane but
>> results remain the same.
>>
>> Vanilla
>> nr_alloc            :         9040         nr_alloc            :        97048
>> nr_dealloc          :         6994	   nr_dealloc          :        94404
>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2644
>> nr_max_alloc        :         2169	   nr_max_alloc        :        90054
>> nr_chunks           :            3	   nr_chunks           :           10
>> nr_max_chunks       :            3	   nr_max_chunks       :           47
>> min_alloc_size      :            4	   min_alloc_size      :            4
>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>> empty_pop_pages     :            4	   empty_pop_pages     :           32
>>
>> With the patchset + debug patch the results are as follows:
>> Patched
>>
>> nr_alloc            :         9040         nr_alloc            :        97048
>> nr_dealloc          :         6994	   nr_dealloc          :        94349
>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2699
>> nr_max_alloc        :         2194	   nr_max_alloc        :        90054
>> nr_chunks           :            3	   nr_chunks           :           48
>> nr_max_chunks       :            3	   nr_max_chunks       :           48
>> min_alloc_size      :            4	   min_alloc_size      :            4
>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>> empty_pop_pages     :           12	   empty_pop_pages     :           54
>>
>> With the extra tracing I can see 39 entries of "Chunk (sidelined)"
>> after the test was run. I don't see any entries for "Chunk (to depopulate)"
>>
>> I've snipped the results of slidelined chunks because they went on for ~600
>> lines, if you need the full logs let me know.
> Yes, please! That's the most interesting part!

Got it. Pasting the full logs of after the percpu experiment was completed

Percpu Memory Statistics
Allocation Info:
----------------------------------------
   unit_size           :       655360
   static_size         :       608920
   reserved_size       :            0
   dyn_size            :        46440
   atom_size           :        65536
   alloc_size          :       655360

Global Stats:
----------------------------------------
   nr_alloc            :        97048
   nr_dealloc          :        94349
   nr_cur_alloc        :         2699
   nr_max_alloc        :        90054
   nr_chunks           :           48
   nr_max_chunks       :           48
   min_alloc_size      :            4
   max_alloc_size      :         1072
   empty_pop_pages     :           54

Per Chunk Stats:
----------------------------------------
Chunk: <- First Chunk
   nr_alloc            :         1081
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :        16117
   free_bytes          :            4
   contig_bytes        :            4
   sum_frag            :            4
   max_frag            :            4
   cur_min_alloc       :            4
   cur_med_alloc       :            8
   cur_max_alloc       :         1072
   memcg_aware         :            0

Chunk:
   nr_alloc            :          826
   max_alloc_size      :         1072
   empty_pop_pages     :            6
   first_bit           :          819
   free_bytes          :       640660
   contig_bytes        :       249896
   sum_frag            :       464700
   max_frag            :       306216
   cur_min_alloc       :            4
   cur_med_alloc       :            8
   cur_max_alloc       :         1072
   memcg_aware         :            0

Chunk:
   nr_alloc            :            0
   max_alloc_size      :            0
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            0

Chunk:
   nr_alloc            :           90
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :          536
   free_bytes          :       595752
   contig_bytes        :        26164
   sum_frag            :       575132
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :         1072
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :           90
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       597428
   contig_bytes        :        26164
   sum_frag            :       596848
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :           92
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       595284
   contig_bytes        :        26164
   sum_frag            :       590360
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :           92
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       595284
   contig_bytes        :        26164
   sum_frag            :       583768
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :          360
   max_alloc_size      :         1072
   empty_pop_pages     :            7
   first_bit           :        26595
   free_bytes          :       506640
   contig_bytes        :       506540
   sum_frag            :          100
   max_frag            :           36
   cur_min_alloc       :            4
   cur_med_alloc       :          156
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :           12
   max_alloc_size      :         1072
   empty_pop_pages     :            3
   first_bit           :            0
   free_bytes          :       647524
   contig_bytes        :       563492
   sum_frag            :        57872
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :           52
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       621404
   contig_bytes        :       203104
   sum_frag            :       603400
   max_frag            :       260656
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            4
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       652748
   contig_bytes        :       570600
   sum_frag            :       570600
   max_frag            :       570600
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] percpu: partial chunk depopulation
  2021-04-16 18:41             ` Pratik Sampat
@ 2021-04-16 19:09               ` Roman Gushchin
  2021-04-16 19:44                 ` Pratik Sampat
  0 siblings, 1 reply; 26+ messages in thread
From: Roman Gushchin @ 2021-04-16 19:09 UTC (permalink / raw)
  To: Pratik Sampat
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton,
	Vlastimil Babka, linux-mm, linux-kernel, pratik.r.sampat

On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
> 
> 
> On 17/04/21 12:04 am, Roman Gushchin wrote:
> > On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
> > > 
> > > On 16/04/21 10:43 pm, Roman Gushchin wrote:
> > > > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> > > > > Hello Dennis,
> > > > > 
> > > > > I apologize for the clutter of logs before, I'm pasting the logs of before and
> > > > > after the percpu test in the case of the patchset being applied on 5.12-rc6 and
> > > > > the vanilla kernel 5.12-rc6.
> > > > > 
> > > > > On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > > > > > Hello Roman,
> > > > > > > 
> > > > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > > > > > > 
> > > > > > > My results of the percpu_test are as follows:
> > > > > > > Intel KVM 4CPU:4G
> > > > > > > Vanilla 5.12-rc6
> > > > > > > # ./percpu_test.sh
> > > > > > > Percpu:             1952 kB
> > > > > > > Percpu:           219648 kB
> > > > > > > Percpu:           219648 kB
> > > > > > > 
> > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > # ./percpu_test.sh
> > > > > > > Percpu:             2080 kB
> > > > > > > Percpu:           219712 kB
> > > > > > > Percpu:            72672 kB
> > > > > > > 
> > > > > > > I'm able to see improvement comparable to that of what you're see too.
> > > > > > > 
> > > > > > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
> > > > > > > 
> > > > > > > POWER9 KVM 4CPU:4G
> > > > > > > Vanilla 5.12-rc6
> > > > > > > # ./percpu_test.sh
> > > > > > > Percpu:             5888 kB
> > > > > > > Percpu:           118272 kB
> > > > > > > Percpu:           118272 kB
> > > > > > > 
> > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > # ./percpu_test.sh
> > > > > > > Percpu:             6144 kB
> > > > > > > Percpu:           119040 kB
> > > > > > > Percpu:           119040 kB
> > > > > > > 
> > > > > > > I'm wondering if there's any architectural specific code that needs plumbing
> > > > > > > here?
> > > > > > > 
> > > > > > There shouldn't be. Can you send me the percpu_stats debug output before
> > > > > > and after?
> > > > > I'll paste the whole debug stats before and after here.
> > > > > 5.12-rc6 + patchset
> > > > > -----BEFORE-----
> > > > > Percpu Memory Statistics
> > > > > Allocation Info:
> > > > Hm, this looks highly suspicious. Here is your stats in a more compact form:
> > > > 
> > > > Vanilla
> > > > 
> > > > nr_alloc            :         9038         nr_alloc            :        97046
> > > > nr_dealloc          :         6992	   nr_dealloc          :        94237
> > > > nr_cur_alloc        :         2046	   nr_cur_alloc        :         2809
> > > > nr_max_alloc        :         2178	   nr_max_alloc        :        90054
> > > > nr_chunks           :            3	   nr_chunks           :           11
> > > > nr_max_chunks       :            3	   nr_max_chunks       :           47
> > > > min_alloc_size      :            4	   min_alloc_size      :            4
> > > > max_alloc_size      :         1072	   max_alloc_size      :         1072
> > > > empty_pop_pages     :            5	   empty_pop_pages     :           29
> > > > 
> > > > 
> > > > Patched
> > > > 
> > > > nr_alloc            :         9040         nr_alloc            :        97048
> > > > nr_dealloc          :         6994	   nr_dealloc          :        95002
> > > > nr_cur_alloc        :         2046	   nr_cur_alloc        :         2046
> > > > nr_max_alloc        :         2208	   nr_max_alloc        :        90054
> > > > nr_chunks           :            3	   nr_chunks           :           48
> > > > nr_max_chunks       :            3	   nr_max_chunks       :           48
> > > > min_alloc_size      :            4	   min_alloc_size      :            4
> > > > max_alloc_size      :         1072	   max_alloc_size      :         1072
> > > > empty_pop_pages     :           12	   empty_pop_pages     :           61
> > > > 
> > > > 
> > > > So it looks like the number of chunks got bigger, as well as the number of
> > > > empty_pop_pages? This contradicts to what you wrote, so can you, please, make
> > > > sure that the data is correct and we're not messing two cases?
> > > > 
> > > > So it looks like for some reason sidelined (depopulated) chunks are not getting
> > > > freed completely. But I struggle to explain why the initial empty_pop_pages is
> > > > bigger with the same amount of chunks.
> > > > 
> > > > So, can you, please, apply the following patch and provide an updated statistics?
> > > Unfortunately, I'm not completely well versed in this area, but yes the empty
> > > pop pages number doesn't make sense to me either.
> > > 
> > > I re-ran the numbers trying to make sure my experiment setup is sane but
> > > results remain the same.
> > > 
> > > Vanilla
> > > nr_alloc            :         9040         nr_alloc            :        97048
> > > nr_dealloc          :         6994	   nr_dealloc          :        94404
> > > nr_cur_alloc        :         2046	   nr_cur_alloc        :         2644
> > > nr_max_alloc        :         2169	   nr_max_alloc        :        90054
> > > nr_chunks           :            3	   nr_chunks           :           10
> > > nr_max_chunks       :            3	   nr_max_chunks       :           47
> > > min_alloc_size      :            4	   min_alloc_size      :            4
> > > max_alloc_size      :         1072	   max_alloc_size      :         1072
> > > empty_pop_pages     :            4	   empty_pop_pages     :           32
> > > 
> > > With the patchset + debug patch the results are as follows:
> > > Patched
> > > 
> > > nr_alloc            :         9040         nr_alloc            :        97048
> > > nr_dealloc          :         6994	   nr_dealloc          :        94349
> > > nr_cur_alloc        :         2046	   nr_cur_alloc        :         2699
> > > nr_max_alloc        :         2194	   nr_max_alloc        :        90054
> > > nr_chunks           :            3	   nr_chunks           :           48
> > > nr_max_chunks       :            3	   nr_max_chunks       :           48
> > > min_alloc_size      :            4	   min_alloc_size      :            4
> > > max_alloc_size      :         1072	   max_alloc_size      :         1072
> > > empty_pop_pages     :           12	   empty_pop_pages     :           54
> > > 
> > > With the extra tracing I can see 39 entries of "Chunk (sidelined)"
> > > after the test was run. I don't see any entries for "Chunk (to depopulate)"
> > > 
> > > I've snipped the results of slidelined chunks because they went on for ~600
> > > lines, if you need the full logs let me know.
> > Yes, please! That's the most interesting part!
> 
> Got it. Pasting the full logs of after the percpu experiment was completed

Thanks!

Would you mind to apply the following patch and test again?

--

diff --git a/mm/percpu.c b/mm/percpu.c
index ded3a7541cb2..532c6a7ebdfd 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr)
                                need_balance = true;
                                break;
                        }
+
+               chunk->depopulated = false;
+               pcpu_chunk_relocate(chunk, -1);
        } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
                   !chunk->isolated &&
                   (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] percpu: partial chunk depopulation
  2021-04-16 19:09               ` Roman Gushchin
@ 2021-04-16 19:44                 ` Pratik Sampat
  2021-04-16 20:03                   ` Roman Gushchin
  2021-04-16 21:47                   ` Dennis Zhou
  0 siblings, 2 replies; 26+ messages in thread
From: Pratik Sampat @ 2021-04-16 19:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton,
	Vlastimil Babka, linux-mm, linux-kernel, pratik.r.sampat



On 17/04/21 12:39 am, Roman Gushchin wrote:
> On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
>>
>> On 17/04/21 12:04 am, Roman Gushchin wrote:
>>> On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
>>>> On 16/04/21 10:43 pm, Roman Gushchin wrote:
>>>>> On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
>>>>>> Hello Dennis,
>>>>>>
>>>>>> I apologize for the clutter of logs before, I'm pasting the logs of before and
>>>>>> after the percpu test in the case of the patchset being applied on 5.12-rc6 and
>>>>>> the vanilla kernel 5.12-rc6.
>>>>>>
>>>>>> On 16/04/21 7:48 pm, Dennis Zhou wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
>>>>>>>> Hello Roman,
>>>>>>>>
>>>>>>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
>>>>>>>>
>>>>>>>> My results of the percpu_test are as follows:
>>>>>>>> Intel KVM 4CPU:4G
>>>>>>>> Vanilla 5.12-rc6
>>>>>>>> # ./percpu_test.sh
>>>>>>>> Percpu:             1952 kB
>>>>>>>> Percpu:           219648 kB
>>>>>>>> Percpu:           219648 kB
>>>>>>>>
>>>>>>>> 5.12-rc6 + with patchset applied
>>>>>>>> # ./percpu_test.sh
>>>>>>>> Percpu:             2080 kB
>>>>>>>> Percpu:           219712 kB
>>>>>>>> Percpu:            72672 kB
>>>>>>>>
>>>>>>>> I'm able to see improvement comparable to that of what you're see too.
>>>>>>>>
>>>>>>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
>>>>>>>>
>>>>>>>> POWER9 KVM 4CPU:4G
>>>>>>>> Vanilla 5.12-rc6
>>>>>>>> # ./percpu_test.sh
>>>>>>>> Percpu:             5888 kB
>>>>>>>> Percpu:           118272 kB
>>>>>>>> Percpu:           118272 kB
>>>>>>>>
>>>>>>>> 5.12-rc6 + with patchset applied
>>>>>>>> # ./percpu_test.sh
>>>>>>>> Percpu:             6144 kB
>>>>>>>> Percpu:           119040 kB
>>>>>>>> Percpu:           119040 kB
>>>>>>>>
>>>>>>>> I'm wondering if there's any architectural specific code that needs plumbing
>>>>>>>> here?
>>>>>>>>
>>>>>>> There shouldn't be. Can you send me the percpu_stats debug output before
>>>>>>> and after?
>>>>>> I'll paste the whole debug stats before and after here.
>>>>>> 5.12-rc6 + patchset
>>>>>> -----BEFORE-----
>>>>>> Percpu Memory Statistics
>>>>>> Allocation Info:
>>>>> Hm, this looks highly suspicious. Here is your stats in a more compact form:
>>>>>
>>>>> Vanilla
>>>>>
>>>>> nr_alloc            :         9038         nr_alloc            :        97046
>>>>> nr_dealloc          :         6992	   nr_dealloc          :        94237
>>>>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2809
>>>>> nr_max_alloc        :         2178	   nr_max_alloc        :        90054
>>>>> nr_chunks           :            3	   nr_chunks           :           11
>>>>> nr_max_chunks       :            3	   nr_max_chunks       :           47
>>>>> min_alloc_size      :            4	   min_alloc_size      :            4
>>>>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>>>>> empty_pop_pages     :            5	   empty_pop_pages     :           29
>>>>>
>>>>>
>>>>> Patched
>>>>>
>>>>> nr_alloc            :         9040         nr_alloc            :        97048
>>>>> nr_dealloc          :         6994	   nr_dealloc          :        95002
>>>>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2046
>>>>> nr_max_alloc        :         2208	   nr_max_alloc        :        90054
>>>>> nr_chunks           :            3	   nr_chunks           :           48
>>>>> nr_max_chunks       :            3	   nr_max_chunks       :           48
>>>>> min_alloc_size      :            4	   min_alloc_size      :            4
>>>>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>>>>> empty_pop_pages     :           12	   empty_pop_pages     :           61
>>>>>
>>>>>
>>>>> So it looks like the number of chunks got bigger, as well as the number of
>>>>> empty_pop_pages? This contradicts to what you wrote, so can you, please, make
>>>>> sure that the data is correct and we're not messing two cases?
>>>>>
>>>>> So it looks like for some reason sidelined (depopulated) chunks are not getting
>>>>> freed completely. But I struggle to explain why the initial empty_pop_pages is
>>>>> bigger with the same amount of chunks.
>>>>>
>>>>> So, can you, please, apply the following patch and provide an updated statistics?
>>>> Unfortunately, I'm not completely well versed in this area, but yes the empty
>>>> pop pages number doesn't make sense to me either.
>>>>
>>>> I re-ran the numbers trying to make sure my experiment setup is sane but
>>>> results remain the same.
>>>>
>>>> Vanilla
>>>> nr_alloc            :         9040         nr_alloc            :        97048
>>>> nr_dealloc          :         6994	   nr_dealloc          :        94404
>>>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2644
>>>> nr_max_alloc        :         2169	   nr_max_alloc        :        90054
>>>> nr_chunks           :            3	   nr_chunks           :           10
>>>> nr_max_chunks       :            3	   nr_max_chunks       :           47
>>>> min_alloc_size      :            4	   min_alloc_size      :            4
>>>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>>>> empty_pop_pages     :            4	   empty_pop_pages     :           32
>>>>
>>>> With the patchset + debug patch the results are as follows:
>>>> Patched
>>>>
>>>> nr_alloc            :         9040         nr_alloc            :        97048
>>>> nr_dealloc          :         6994	   nr_dealloc          :        94349
>>>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2699
>>>> nr_max_alloc        :         2194	   nr_max_alloc        :        90054
>>>> nr_chunks           :            3	   nr_chunks           :           48
>>>> nr_max_chunks       :            3	   nr_max_chunks       :           48
>>>> min_alloc_size      :            4	   min_alloc_size      :            4
>>>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>>>> empty_pop_pages     :           12	   empty_pop_pages     :           54
>>>>
>>>> With the extra tracing I can see 39 entries of "Chunk (sidelined)"
>>>> after the test was run. I don't see any entries for "Chunk (to depopulate)"
>>>>
>>>> I've snipped the results of slidelined chunks because they went on for ~600
>>>> lines, if you need the full logs let me know.
>>> Yes, please! That's the most interesting part!
>> Got it. Pasting the full logs of after the percpu experiment was completed
> Thanks!
>
> Would you mind to apply the following patch and test again?
>
> --
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index ded3a7541cb2..532c6a7ebdfd 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr)
>                                  need_balance = true;
>                                  break;
>                          }
> +
> +               chunk->depopulated = false;
> +               pcpu_chunk_relocate(chunk, -1);
>          } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
>                     !chunk->isolated &&
>                     (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
>
Sure thing.

I see much lower sideline chunks. In one such test run I saw zero occurrences
of slidelined chunks

Pasting the full logs as an example:

BEFORE
Percpu Memory Statistics
Allocation Info:
----------------------------------------
   unit_size           :       655360
   static_size         :       608920
   reserved_size       :            0
   dyn_size            :        46440
   atom_size           :        65536
   alloc_size          :       655360

Global Stats:
----------------------------------------
   nr_alloc            :         9038
   nr_dealloc          :         6992
   nr_cur_alloc        :         2046
   nr_max_alloc        :         2200
   nr_chunks           :            3
   nr_max_chunks       :            3
   min_alloc_size      :            4
   max_alloc_size      :         1072
   empty_pop_pages     :           12

Per Chunk Stats:
----------------------------------------
Chunk: <- First Chunk
   nr_alloc            :         1092
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :        16247
   free_bytes          :            4
   contig_bytes        :            4
   sum_frag            :            4
   max_frag            :            4
   cur_min_alloc       :            4
   cur_med_alloc       :            8
   cur_max_alloc       :         1072
   memcg_aware         :            0

Chunk:
   nr_alloc            :          594
   max_alloc_size      :          992
   empty_pop_pages     :            8
   first_bit           :          456
   free_bytes          :       645008
   contig_bytes        :       319984
   sum_frag            :       325024
   max_frag            :       318680
   cur_min_alloc       :            4
   cur_med_alloc       :            8
   cur_max_alloc       :          424
   memcg_aware         :            0

Chunk:
   nr_alloc            :          360
   max_alloc_size      :         1072
   empty_pop_pages     :            4
   first_bit           :        26595
   free_bytes          :       506640
   contig_bytes        :       506540
   sum_frag            :          100
   max_frag            :           32
   cur_min_alloc       :            4
   cur_med_alloc       :          156
   cur_max_alloc       :         1072
   memcg_aware         :            1


AFTER
Percpu Memory Statistics
Allocation Info:
----------------------------------------
   unit_size           :       655360
   static_size         :       608920
   reserved_size       :            0
   dyn_size            :        46440
   atom_size           :        65536
   alloc_size          :       655360

Global Stats:
----------------------------------------
   nr_alloc            :        97046
   nr_dealloc          :        94304
   nr_cur_alloc        :         2742
   nr_max_alloc        :        90054
   nr_chunks           :           11
   nr_max_chunks       :           47
   min_alloc_size      :            4
   max_alloc_size      :         1072
   empty_pop_pages     :           18

Per Chunk Stats:
----------------------------------------
Chunk: <- First Chunk
   nr_alloc            :         1092
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :        16247
   free_bytes          :            4
   contig_bytes        :            4
   sum_frag            :            4
   max_frag            :            4
   cur_min_alloc       :            4
   cur_med_alloc       :            8
   cur_max_alloc       :         1072
   memcg_aware         :            0

Chunk:
   nr_alloc            :          838
   max_alloc_size      :         1072
   empty_pop_pages     :            7
   first_bit           :          464
   free_bytes          :       640476
   contig_bytes        :       290672
   sum_frag            :       349804
   max_frag            :       304344
   cur_min_alloc       :            4
   cur_med_alloc       :            8
   cur_max_alloc       :         1072
   memcg_aware         :            0

Chunk:
   nr_alloc            :           90
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :          536
   free_bytes          :       595752
   contig_bytes        :        26164
   sum_frag            :       575132
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :         1072
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :           90
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       597428
   contig_bytes        :        26164
   sum_frag            :       596848
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :           92
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       595284
   contig_bytes        :        26164
   sum_frag            :       590360
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :           92
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       595284
   contig_bytes        :        26164
   sum_frag            :       583768
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :          360
   max_alloc_size      :         1072
   empty_pop_pages     :            7
   first_bit           :        26595
   free_bytes          :       506640
   contig_bytes        :       506540
   sum_frag            :          100
   max_frag            :           32
   cur_min_alloc       :            4
   cur_med_alloc       :          156
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :           12
   max_alloc_size      :         1072
   empty_pop_pages     :            3
   first_bit           :            0
   free_bytes          :       647524
   contig_bytes        :       563492
   sum_frag            :        57872
   max_frag            :        26164
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk:
   nr_alloc            :            0
   max_alloc_size      :         1072
   empty_pop_pages     :            1
   first_bit           :            0
   free_bytes          :       655360
   contig_bytes        :       655360
   sum_frag            :            0
   max_frag            :            0
   cur_min_alloc       :            0
   cur_med_alloc       :            0
   cur_max_alloc       :            0
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :           72
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       608344
   contig_bytes        :       145552
   sum_frag            :       590340
   max_frag            :       145552
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1

Chunk (sidelined):
   nr_alloc            :            4
   max_alloc_size      :         1072
   empty_pop_pages     :            0
   first_bit           :            0
   free_bytes          :       652748
   contig_bytes        :       426720
   sum_frag            :       426720
   max_frag            :       426720
   cur_min_alloc       :          156
   cur_med_alloc       :          312
   cur_max_alloc       :         1072
   memcg_aware         :            1






^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] percpu: partial chunk depopulation
  2021-04-16 19:44                 ` Pratik Sampat
@ 2021-04-16 20:03                   ` Roman Gushchin
  2021-04-17  7:08                     ` Pratik Sampat
  2021-04-16 21:47                   ` Dennis Zhou
  1 sibling, 1 reply; 26+ messages in thread
From: Roman Gushchin @ 2021-04-16 20:03 UTC (permalink / raw)
  To: Pratik Sampat
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton,
	Vlastimil Babka, linux-mm, linux-kernel, pratik.r.sampat

On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote:
> 
> 
> On 17/04/21 12:39 am, Roman Gushchin wrote:
> > On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
> > > 
> > > On 17/04/21 12:04 am, Roman Gushchin wrote:
> > > > On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
> > > > > On 16/04/21 10:43 pm, Roman Gushchin wrote:
> > > > > > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> > > > > > > Hello Dennis,
> > > > > > > 
> > > > > > > I apologize for the clutter of logs before, I'm pasting the logs of before and
> > > > > > > after the percpu test in the case of the patchset being applied on 5.12-rc6 and
> > > > > > > the vanilla kernel 5.12-rc6.
> > > > > > > 
> > > > > > > On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > > > > > > > Hello,
> > > > > > > > 
> > > > > > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > > > > > > > Hello Roman,
> > > > > > > > > 
> > > > > > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > > > > > > > > 
> > > > > > > > > My results of the percpu_test are as follows:
> > > > > > > > > Intel KVM 4CPU:4G
> > > > > > > > > Vanilla 5.12-rc6
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:             1952 kB
> > > > > > > > > Percpu:           219648 kB
> > > > > > > > > Percpu:           219648 kB
> > > > > > > > > 
> > > > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:             2080 kB
> > > > > > > > > Percpu:           219712 kB
> > > > > > > > > Percpu:            72672 kB
> > > > > > > > > 
> > > > > > > > > I'm able to see improvement comparable to that of what you're see too.
> > > > > > > > > 
> > > > > > > > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
> > > > > > > > > 
> > > > > > > > > POWER9 KVM 4CPU:4G
> > > > > > > > > Vanilla 5.12-rc6
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:             5888 kB
> > > > > > > > > Percpu:           118272 kB
> > > > > > > > > Percpu:           118272 kB
> > > > > > > > > 
> > > > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:             6144 kB
> > > > > > > > > Percpu:           119040 kB
> > > > > > > > > Percpu:           119040 kB
> > > > > > > > > 
> > > > > > > > > I'm wondering if there's any architectural specific code that needs plumbing
> > > > > > > > > here?
> > > > > > > > > 
> > > > > > > > There shouldn't be. Can you send me the percpu_stats debug output before
> > > > > > > > and after?
> > > > > > > I'll paste the whole debug stats before and after here.
> > > > > > > 5.12-rc6 + patchset
> > > > > > > -----BEFORE-----
> > > > > > > Percpu Memory Statistics
> > > > > > > Allocation Info:
> > > > > > Hm, this looks highly suspicious. Here is your stats in a more compact form:
> > > > > > 
> > > > > > Vanilla
> > > > > > 
> > > > > > nr_alloc            :         9038         nr_alloc            :        97046
> > > > > > nr_dealloc          :         6992	   nr_dealloc          :        94237
> > > > > > nr_cur_alloc        :         2046	   nr_cur_alloc        :         2809
> > > > > > nr_max_alloc        :         2178	   nr_max_alloc        :        90054
> > > > > > nr_chunks           :            3	   nr_chunks           :           11
> > > > > > nr_max_chunks       :            3	   nr_max_chunks       :           47
> > > > > > min_alloc_size      :            4	   min_alloc_size      :            4
> > > > > > max_alloc_size      :         1072	   max_alloc_size      :         1072
> > > > > > empty_pop_pages     :            5	   empty_pop_pages     :           29
> > > > > > 
> > > > > > 
> > > > > > Patched
> > > > > > 
> > > > > > nr_alloc            :         9040         nr_alloc            :        97048
> > > > > > nr_dealloc          :         6994	   nr_dealloc          :        95002
> > > > > > nr_cur_alloc        :         2046	   nr_cur_alloc        :         2046
> > > > > > nr_max_alloc        :         2208	   nr_max_alloc        :        90054
> > > > > > nr_chunks           :            3	   nr_chunks           :           48
> > > > > > nr_max_chunks       :            3	   nr_max_chunks       :           48
> > > > > > min_alloc_size      :            4	   min_alloc_size      :            4
> > > > > > max_alloc_size      :         1072	   max_alloc_size      :         1072
> > > > > > empty_pop_pages     :           12	   empty_pop_pages     :           61
> > > > > > 
> > > > > > 
> > > > > > So it looks like the number of chunks got bigger, as well as the number of
> > > > > > empty_pop_pages? This contradicts to what you wrote, so can you, please, make
> > > > > > sure that the data is correct and we're not messing two cases?
> > > > > > 
> > > > > > So it looks like for some reason sidelined (depopulated) chunks are not getting
> > > > > > freed completely. But I struggle to explain why the initial empty_pop_pages is
> > > > > > bigger with the same amount of chunks.
> > > > > > 
> > > > > > So, can you, please, apply the following patch and provide an updated statistics?
> > > > > Unfortunately, I'm not completely well versed in this area, but yes the empty
> > > > > pop pages number doesn't make sense to me either.
> > > > > 
> > > > > I re-ran the numbers trying to make sure my experiment setup is sane but
> > > > > results remain the same.
> > > > > 
> > > > > Vanilla
> > > > > nr_alloc            :         9040         nr_alloc            :        97048
> > > > > nr_dealloc          :         6994	   nr_dealloc          :        94404
> > > > > nr_cur_alloc        :         2046	   nr_cur_alloc        :         2644
> > > > > nr_max_alloc        :         2169	   nr_max_alloc        :        90054
> > > > > nr_chunks           :            3	   nr_chunks           :           10
> > > > > nr_max_chunks       :            3	   nr_max_chunks       :           47
> > > > > min_alloc_size      :            4	   min_alloc_size      :            4
> > > > > max_alloc_size      :         1072	   max_alloc_size      :         1072
> > > > > empty_pop_pages     :            4	   empty_pop_pages     :           32
> > > > > 
> > > > > With the patchset + debug patch the results are as follows:
> > > > > Patched
> > > > > 
> > > > > nr_alloc            :         9040         nr_alloc            :        97048
> > > > > nr_dealloc          :         6994	   nr_dealloc          :        94349
> > > > > nr_cur_alloc        :         2046	   nr_cur_alloc        :         2699
> > > > > nr_max_alloc        :         2194	   nr_max_alloc        :        90054
> > > > > nr_chunks           :            3	   nr_chunks           :           48
> > > > > nr_max_chunks       :            3	   nr_max_chunks       :           48
> > > > > min_alloc_size      :            4	   min_alloc_size      :            4
> > > > > max_alloc_size      :         1072	   max_alloc_size      :         1072
> > > > > empty_pop_pages     :           12	   empty_pop_pages     :           54
> > > > > 
> > > > > With the extra tracing I can see 39 entries of "Chunk (sidelined)"
> > > > > after the test was run. I don't see any entries for "Chunk (to depopulate)"
> > > > > 
> > > > > I've snipped the results of slidelined chunks because they went on for ~600
> > > > > lines, if you need the full logs let me know.
> > > > Yes, please! That's the most interesting part!
> > > Got it. Pasting the full logs of after the percpu experiment was completed
> > Thanks!
> > 
> > Would you mind to apply the following patch and test again?
> > 
> > --
> > 
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index ded3a7541cb2..532c6a7ebdfd 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr)
> >                                  need_balance = true;
> >                                  break;
> >                          }
> > +
> > +               chunk->depopulated = false;
> > +               pcpu_chunk_relocate(chunk, -1);
> >          } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
> >                     !chunk->isolated &&
> >                     (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
> > 
> Sure thing.
> 
> I see much lower sideline chunks. In one such test run I saw zero occurrences
> of slidelined chunks
> 
So looking at the stats it now works properly. Do you see any savings in
comparison to vanilla? The size of savings can significanlty depend on the exact
size of cgroup-related objects, how many of them fit into a single chunk, etc.
So you might want to play with numbers in the test...

Anyway, thank you very much for the report and your work on testing follow-up
patches! It helped to reveal a serious bug in the implementation (completely
empty sidelined chunks were not released in some cases), which by pure
coincidence wasn't triggered on x86.

Thanks!


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/6] percpu: fix a comment about the chunks ordering
  2021-04-08  3:57 ` [PATCH v3 1/6] percpu: fix a comment about the chunks ordering Roman Gushchin
@ 2021-04-16 21:06   ` Dennis Zhou
  0 siblings, 0 replies; 26+ messages in thread
From: Dennis Zhou @ 2021-04-16 21:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, Christoph Lameter, Andrew Morton, Vlastimil Babka,
	linux-mm, linux-kernel

Hello,

On Wed, Apr 07, 2021 at 08:57:31PM -0700, Roman Gushchin wrote:
> Since the commit 3e54097beb22 ("percpu: manage chunks based on
> contig_bits instead of free_bytes") chunks are sorted based on the
> size of the biggest continuous free area instead of the total number
> of free bytes. Update the corresponding comment to reflect this.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/percpu.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 6596a0a4286e..2f27123bb489 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -99,7 +99,10 @@
>  
>  #include "percpu-internal.h"
>  
> -/* the slots are sorted by free bytes left, 1-31 bytes share the same slot */
> +/*
> + * The slots are sorted by the size of the biggest continuous free area.
> + * 1-31 bytes share the same slot.
> + */
>  #define PCPU_SLOT_BASE_SHIFT		5
>  /* chunks in slots below this are subject to being sidelined on failed alloc */
>  #define PCPU_SLOT_FAIL_THRESHOLD	3
> -- 
> 2.30.2
> 

I've applied this to for-5.14.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/6] percpu: split __pcpu_balance_workfn()
  2021-04-08  3:57 ` [PATCH v3 2/6] percpu: split __pcpu_balance_workfn() Roman Gushchin
@ 2021-04-16 21:06   ` Dennis Zhou
  0 siblings, 0 replies; 26+ messages in thread
From: Dennis Zhou @ 2021-04-16 21:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, Christoph Lameter, Andrew Morton, Vlastimil Babka,
	linux-mm, linux-kernel

Hello,

On Wed, Apr 07, 2021 at 08:57:32PM -0700, Roman Gushchin wrote:
> __pcpu_balance_workfn() became fairly big and hard to follow, but in
> fact it consists of two fully independent parts, responsible for
> the destruction of excessive free chunks and population of necessarily
> amount of free pages.
> 
> In order to simplify the code and prepare for adding of a new
> functionality, split it in two functions:
> 
>   1) pcpu_balance_free,
>   2) pcpu_balance_populated.
> 
> Move the taking/releasing of the pcpu_alloc_mutex to an upper level
> to keep the current synchronization in place.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Dennis Zhou <dennis@kernel.org>
> ---
>  mm/percpu.c | 46 +++++++++++++++++++++++++++++-----------------
>  1 file changed, 29 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 2f27123bb489..7e31e1b8725f 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1933,31 +1933,22 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
>  }
>  
>  /**
> - * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
> + * pcpu_balance_free - manage the amount of free chunks
>   * @type: chunk type
>   *
> - * Reclaim all fully free chunks except for the first one.  This is also
> - * responsible for maintaining the pool of empty populated pages.  However,
> - * it is possible that this is called when physical memory is scarce causing
> - * OOM killer to be triggered.  We should avoid doing so until an actual
> - * allocation causes the failure as it is possible that requests can be
> - * serviced from already backed regions.
> + * Reclaim all fully free chunks except for the first one.
>   */
> -static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
> +static void pcpu_balance_free(enum pcpu_chunk_type type)
>  {
> -	/* gfp flags passed to underlying allocators */
> -	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
>  	LIST_HEAD(to_free);
>  	struct list_head *pcpu_slot = pcpu_chunk_list(type);
>  	struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
>  	struct pcpu_chunk *chunk, *next;
> -	int slot, nr_to_pop, ret;
>  
>  	/*
>  	 * There's no reason to keep around multiple unused chunks and VM
>  	 * areas can be scarce.  Destroy all free chunks except for one.
>  	 */
> -	mutex_lock(&pcpu_alloc_mutex);
>  	spin_lock_irq(&pcpu_lock);
>  
>  	list_for_each_entry_safe(chunk, next, free_head, list) {
> @@ -1985,6 +1976,25 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
>  		pcpu_destroy_chunk(chunk);
>  		cond_resched();
>  	}
> +}
> +
> +/**
> + * pcpu_balance_populated - manage the amount of populated pages
> + * @type: chunk type
> + *
> + * Maintain a certain amount of populated pages to satisfy atomic allocations.
> + * It is possible that this is called when physical memory is scarce causing
> + * OOM killer to be triggered.  We should avoid doing so until an actual
> + * allocation causes the failure as it is possible that requests can be
> + * serviced from already backed regions.
> + */
> +static void pcpu_balance_populated(enum pcpu_chunk_type type)
> +{
> +	/* gfp flags passed to underlying allocators */
> +	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
> +	struct list_head *pcpu_slot = pcpu_chunk_list(type);
> +	struct pcpu_chunk *chunk;
> +	int slot, nr_to_pop, ret;
>  
>  	/*
>  	 * Ensure there are certain number of free populated pages for
> @@ -2054,22 +2064,24 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
>  			goto retry_pop;
>  		}
>  	}
> -
> -	mutex_unlock(&pcpu_alloc_mutex);
>  }
>  
>  /**
>   * pcpu_balance_workfn - manage the amount of free chunks and populated pages
>   * @work: unused
>   *
> - * Call __pcpu_balance_workfn() for each chunk type.
> + * Call pcpu_balance_free() and pcpu_balance_populated() for each chunk type.
>   */
>  static void pcpu_balance_workfn(struct work_struct *work)
>  {
>  	enum pcpu_chunk_type type;
>  
> -	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> -		__pcpu_balance_workfn(type);
> +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
> +		mutex_lock(&pcpu_alloc_mutex);
> +		pcpu_balance_free(type);
> +		pcpu_balance_populated(type);
> +		mutex_unlock(&pcpu_alloc_mutex);
> +	}
>  }
>  
>  /**
> -- 
> 2.30.2
> 
> 

I've applied this to for-5.14.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/6] percpu: make pcpu_nr_empty_pop_pages per chunk type
  2021-04-08  3:57 ` [PATCH v3 3/6] percpu: make pcpu_nr_empty_pop_pages per chunk type Roman Gushchin
@ 2021-04-16 21:08   ` Dennis Zhou
  0 siblings, 0 replies; 26+ messages in thread
From: Dennis Zhou @ 2021-04-16 21:08 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, Christoph Lameter, Andrew Morton, Vlastimil Babka,
	linux-mm, linux-kernel

Hello,

On Wed, Apr 07, 2021 at 08:57:33PM -0700, Roman Gushchin wrote:
> nr_empty_pop_pages is used to guarantee that there are some free
> populated pages to satisfy atomic allocations. Accounted and
> non-accounted allocations are using separate sets of chunks,
> so both need to have a surplus of empty pages.
> 
> This commit makes pcpu_nr_empty_pop_pages and the corresponding logic
> per chunk type.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/percpu-internal.h |  2 +-
>  mm/percpu-stats.c    |  9 +++++++--
>  mm/percpu.c          | 14 +++++++-------
>  3 files changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index 18b768ac7dca..095d7eaa0db4 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -87,7 +87,7 @@ extern spinlock_t pcpu_lock;
>  
>  extern struct list_head *pcpu_chunk_lists;
>  extern int pcpu_nr_slots;
> -extern int pcpu_nr_empty_pop_pages;
> +extern int pcpu_nr_empty_pop_pages[];
>  
>  extern struct pcpu_chunk *pcpu_first_chunk;
>  extern struct pcpu_chunk *pcpu_reserved_chunk;
> diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
> index c8400a2adbc2..f6026dbcdf6b 100644
> --- a/mm/percpu-stats.c
> +++ b/mm/percpu-stats.c
> @@ -145,6 +145,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
>  	int slot, max_nr_alloc;
>  	int *buffer;
>  	enum pcpu_chunk_type type;
> +	int nr_empty_pop_pages;
>  
>  alloc_buffer:
>  	spin_lock_irq(&pcpu_lock);
> @@ -165,7 +166,11 @@ static int percpu_stats_show(struct seq_file *m, void *v)
>  		goto alloc_buffer;
>  	}
>  
> -#define PL(X) \
> +	nr_empty_pop_pages = 0;
> +	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> +		nr_empty_pop_pages += pcpu_nr_empty_pop_pages[type];
> +
> +#define PL(X)								\
>  	seq_printf(m, "  %-20s: %12lld\n", #X, (long long int)pcpu_stats_ai.X)
>  
>  	seq_printf(m,
> @@ -196,7 +201,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
>  	PU(nr_max_chunks);
>  	PU(min_alloc_size);
>  	PU(max_alloc_size);
> -	P("empty_pop_pages", pcpu_nr_empty_pop_pages);
> +	P("empty_pop_pages", nr_empty_pop_pages);
>  	seq_putc(m, '\n');
>  
>  #undef PU
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 7e31e1b8725f..61339b3d9337 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -176,10 +176,10 @@ struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */
>  static LIST_HEAD(pcpu_map_extend_chunks);
>  
>  /*
> - * The number of empty populated pages, protected by pcpu_lock.  The
> - * reserved chunk doesn't contribute to the count.
> + * The number of empty populated pages by chunk type, protected by pcpu_lock.
> + * The reserved chunk doesn't contribute to the count.
>   */
> -int pcpu_nr_empty_pop_pages;
> +int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
>  
>  /*
>   * The number of populated pages in use by the allocator, protected by
> @@ -559,7 +559,7 @@ static inline void pcpu_update_empty_pages(struct pcpu_chunk *chunk, int nr)
>  {
>  	chunk->nr_empty_pop_pages += nr;
>  	if (chunk != pcpu_reserved_chunk)
> -		pcpu_nr_empty_pop_pages += nr;
> +		pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] += nr;
>  }
>  
>  /*
> @@ -1835,7 +1835,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  		mutex_unlock(&pcpu_alloc_mutex);
>  	}
>  
> -	if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_LOW)
> +	if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_LOW)
>  		pcpu_schedule_balance_work();
>  
>  	/* clear the areas and return address relative to base address */
> @@ -2013,7 +2013,7 @@ static void pcpu_balance_populated(enum pcpu_chunk_type type)
>  		pcpu_atomic_alloc_failed = false;
>  	} else {
>  		nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
> -				  pcpu_nr_empty_pop_pages,
> +				  pcpu_nr_empty_pop_pages[type],
>  				  0, PCPU_EMPTY_POP_PAGES_HIGH);
>  	}
>  
> @@ -2595,7 +2595,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
>  
>  	/* link the first chunk in */
>  	pcpu_first_chunk = chunk;
> -	pcpu_nr_empty_pop_pages = pcpu_first_chunk->nr_empty_pop_pages;
> +	pcpu_nr_empty_pop_pages[PCPU_CHUNK_ROOT] = pcpu_first_chunk->nr_empty_pop_pages;
>  	pcpu_chunk_relocate(pcpu_first_chunk, -1);
>  
>  	/* include all regions of the first chunk */
> -- 
> 2.30.2
> 

This turns out to have been a more pressing issue. Thanks for fixing
this. I ran this to Linus for v5.12-rc7 [1].

https://lore.kernel.org/lkml/YHHs618ESvKhYeeM@google.com/

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/6] percpu: generalize pcpu_balance_populated()
  2021-04-08  3:57 ` [PATCH v3 4/6] percpu: generalize pcpu_balance_populated() Roman Gushchin
@ 2021-04-16 21:09   ` Dennis Zhou
  0 siblings, 0 replies; 26+ messages in thread
From: Dennis Zhou @ 2021-04-16 21:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, Christoph Lameter, Andrew Morton, Vlastimil Babka,
	linux-mm, linux-kernel

Hello,

On Wed, Apr 07, 2021 at 08:57:34PM -0700, Roman Gushchin wrote:
> To prepare for the depopulation of percpu chunks, split out the
> populating part of the pcpu_balance_populated() into the new
> pcpu_grow_populated() (with an intention to add
> pcpu_shrink_populated() in the next commit).
> 
> The goal of pcpu_balance_populated() is to determine whether
> there is a shortage or an excessive amount of empty percpu pages
> and call into the corresponding function.
> 
> pcpu_grow_populated() takes a desired number of pages as an argument
> (nr_to_pop). If it creates a new chunk, nr_to_pop should be updated
> to reflect that the new chunk could be created already populated.
> Otherwise an infinite loop might appear.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/percpu.c | 63 +++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 39 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 61339b3d9337..e20119668c42 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1979,7 +1979,7 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
>  }
>  
>  /**
> - * pcpu_balance_populated - manage the amount of populated pages
> + * pcpu_grow_populated - populate chunk(s) to satisfy atomic allocations
>   * @type: chunk type
>   *
>   * Maintain a certain amount of populated pages to satisfy atomic allocations.
> @@ -1988,35 +1988,15 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
>   * allocation causes the failure as it is possible that requests can be
>   * serviced from already backed regions.
>   */
> -static void pcpu_balance_populated(enum pcpu_chunk_type type)
> +static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
>  {
>  	/* gfp flags passed to underlying allocators */
>  	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
>  	struct list_head *pcpu_slot = pcpu_chunk_list(type);
>  	struct pcpu_chunk *chunk;
> -	int slot, nr_to_pop, ret;
> +	int slot, ret;
>  
> -	/*
> -	 * Ensure there are certain number of free populated pages for
> -	 * atomic allocs.  Fill up from the most packed so that atomic
> -	 * allocs don't increase fragmentation.  If atomic allocation
> -	 * failed previously, always populate the maximum amount.  This
> -	 * should prevent atomic allocs larger than PAGE_SIZE from keeping
> -	 * failing indefinitely; however, large atomic allocs are not
> -	 * something we support properly and can be highly unreliable and
> -	 * inefficient.
> -	 */
>  retry_pop:
> -	if (pcpu_atomic_alloc_failed) {
> -		nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
> -		/* best effort anyway, don't worry about synchronization */
> -		pcpu_atomic_alloc_failed = false;
> -	} else {
> -		nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
> -				  pcpu_nr_empty_pop_pages[type],
> -				  0, PCPU_EMPTY_POP_PAGES_HIGH);
> -	}
> -
>  	for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) {
>  		unsigned int nr_unpop = 0, rs, re;
>  
> @@ -2060,12 +2040,47 @@ static void pcpu_balance_populated(enum pcpu_chunk_type type)
>  		if (chunk) {
>  			spin_lock_irq(&pcpu_lock);
>  			pcpu_chunk_relocate(chunk, -1);
> +			nr_to_pop = max_t(int, 0, nr_to_pop - chunk->nr_populated);
>  			spin_unlock_irq(&pcpu_lock);
> -			goto retry_pop;
> +			if (nr_to_pop)
> +				goto retry_pop;
>  		}
>  	}
>  }
>  
> +/**
> + * pcpu_balance_populated - manage the amount of populated pages
> + * @type: chunk type
> + *
> + * Populate or depopulate chunks to maintain a certain amount
> + * of free pages to satisfy atomic allocations, but not waste
> + * large amounts of memory.
> + */
> +static void pcpu_balance_populated(enum pcpu_chunk_type type)
> +{
> +	int nr_to_pop;
> +
> +	/*
> +	 * Ensure there are certain number of free populated pages for
> +	 * atomic allocs.  Fill up from the most packed so that atomic
> +	 * allocs don't increase fragmentation.  If atomic allocation
> +	 * failed previously, always populate the maximum amount.  This
> +	 * should prevent atomic allocs larger than PAGE_SIZE from keeping
> +	 * failing indefinitely; however, large atomic allocs are not
> +	 * something we support properly and can be highly unreliable and
> +	 * inefficient.
> +	 */
> +	if (pcpu_atomic_alloc_failed) {
> +		nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
> +		/* best effort anyway, don't worry about synchronization */
> +		pcpu_atomic_alloc_failed = false;
> +		pcpu_grow_populated(type, nr_to_pop);
> +	} else if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_HIGH) {
> +		nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH - pcpu_nr_empty_pop_pages[type];
> +		pcpu_grow_populated(type, nr_to_pop);
> +	}
> +}
> +
>  /**
>   * pcpu_balance_workfn - manage the amount of free chunks and populated pages
>   * @work: unused
> -- 
> 2.30.2
> 

I've applied this for-5.14.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 5/6] percpu: factor out pcpu_check_chunk_hint()
  2021-04-08  3:57 ` [PATCH v3 5/6] percpu: factor out pcpu_check_chunk_hint() Roman Gushchin
@ 2021-04-16 21:15   ` Dennis Zhou
  0 siblings, 0 replies; 26+ messages in thread
From: Dennis Zhou @ 2021-04-16 21:15 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, Christoph Lameter, Andrew Morton, Vlastimil Babka,
	linux-mm, linux-kernel

Hello,

On Wed, Apr 07, 2021 at 08:57:35PM -0700, Roman Gushchin wrote:
> Factor out the pcpu_check_chunk_hint() helper, which will be useful
> in the future. The new function checks if the allocation can likely
> fit the given chunk.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/percpu.c | 30 +++++++++++++++++++++---------
>  1 file changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index e20119668c42..357fd6994278 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -306,6 +306,26 @@ static unsigned long pcpu_block_off_to_off(int index, int off)
>  	return index * PCPU_BITMAP_BLOCK_BITS + off;
>  }
>  
> +/**
> + * pcpu_check_chunk_hint - check that allocation can fit a chunk
> + * @chunk_md: chunk's block

nit for consistency: @block: block of interest

> + * @bits: size of request in allocation units
> + * @align: alignment of area (max PAGE_SIZE)
> + *
> + * Check to see if the allocation can fit in the chunk's contig hint.
> + * This is an optimization to prevent scanning by assuming if it
> + * cannot fit in the global hint, there is memory pressure and creating
> + * a new chunk would happen soon.
> + */

It occurred to me, That I converged block_md and chunk_md to be the same
object as 1 is just a degenerative case of the other. Can we rename this
to be pcpu_check_block_hint() and have it take in pcpu_block_md?

> +static bool pcpu_check_chunk_hint(struct pcpu_block_md *chunk_md, int bits,
> +				  size_t align)
> +{
> +	int bit_off = ALIGN(chunk_md->contig_hint_start, align) -
> +		chunk_md->contig_hint_start;
> +
> +	return bit_off + bits <= chunk_md->contig_hint;
> +}
> +
>  /*
>   * pcpu_next_hint - determine which hint to use
>   * @block: block of interest
> @@ -1065,15 +1085,7 @@ static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int alloc_bits,
>  	struct pcpu_block_md *chunk_md = &chunk->chunk_md;
>  	int bit_off, bits, next_off;
>  
> -	/*
> -	 * Check to see if the allocation can fit in the chunk's contig hint.
> -	 * This is an optimization to prevent scanning by assuming if it
> -	 * cannot fit in the global hint, there is memory pressure and creating
> -	 * a new chunk would happen soon.
> -	 */
> -	bit_off = ALIGN(chunk_md->contig_hint_start, align) -
> -		  chunk_md->contig_hint_start;
> -	if (bit_off + alloc_bits > chunk_md->contig_hint)
> +	if (!pcpu_check_chunk_hint(chunk_md, alloc_bits, align))
>  		return -1;
>  
>  	bit_off = pcpu_next_hint(chunk_md, alloc_bits);
> -- 
> 2.30.2
> 

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] percpu: partial chunk depopulation
  2021-04-16 19:44                 ` Pratik Sampat
  2021-04-16 20:03                   ` Roman Gushchin
@ 2021-04-16 21:47                   ` Dennis Zhou
  2021-04-17  7:14                     ` Pratik Sampat
  1 sibling, 1 reply; 26+ messages in thread
From: Dennis Zhou @ 2021-04-16 21:47 UTC (permalink / raw)
  To: Pratik Sampat
  Cc: Roman Gushchin, Tejun Heo, Christoph Lameter, Andrew Morton,
	Vlastimil Babka, linux-mm, linux-kernel, pratik.r.sampat

Hello,

On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote:
> 
> 
> On 17/04/21 12:39 am, Roman Gushchin wrote:
> > On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
> > > 
> > > On 17/04/21 12:04 am, Roman Gushchin wrote:
> > > > On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
> > > > > On 16/04/21 10:43 pm, Roman Gushchin wrote:
> > > > > > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> > > > > > > Hello Dennis,
> > > > > > > 
> > > > > > > I apologize for the clutter of logs before, I'm pasting the logs of before and
> > > > > > > after the percpu test in the case of the patchset being applied on 5.12-rc6 and
> > > > > > > the vanilla kernel 5.12-rc6.
> > > > > > > 
> > > > > > > On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > > > > > > > Hello,
> > > > > > > > 
> > > > > > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > > > > > > > Hello Roman,
> > > > > > > > > 
> > > > > > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > > > > > > > > 
> > > > > > > > > My results of the percpu_test are as follows:
> > > > > > > > > Intel KVM 4CPU:4G
> > > > > > > > > Vanilla 5.12-rc6
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:             1952 kB
> > > > > > > > > Percpu:           219648 kB
> > > > > > > > > Percpu:           219648 kB
> > > > > > > > > 
> > > > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:             2080 kB
> > > > > > > > > Percpu:           219712 kB
> > > > > > > > > Percpu:            72672 kB
> > > > > > > > > 
> > > > > > > > > I'm able to see improvement comparable to that of what you're see too.
> > > > > > > > > 
> > > > > > > > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
> > > > > > > > > 
> > > > > > > > > POWER9 KVM 4CPU:4G
> > > > > > > > > Vanilla 5.12-rc6
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:             5888 kB
> > > > > > > > > Percpu:           118272 kB
> > > > > > > > > Percpu:           118272 kB
> > > > > > > > > 
> > > > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:             6144 kB
> > > > > > > > > Percpu:           119040 kB
> > > > > > > > > Percpu:           119040 kB
> > > > > > > > > 
> > > > > > > > > I'm wondering if there's any architectural specific code that needs plumbing
> > > > > > > > > here?
> > > > > > > > > 
> > > > > > > > There shouldn't be. Can you send me the percpu_stats debug output before
> > > > > > > > and after?
> > > > > > > I'll paste the whole debug stats before and after here.
> > > > > > > 5.12-rc6 + patchset
> > > > > > > -----BEFORE-----
> > > > > > > Percpu Memory Statistics
> > > > > > > Allocation Info:
> > > > > > Hm, this looks highly suspicious. Here is your stats in a more compact form:
> > > > > > 
> > > > > > Vanilla
> > > > > > 
> > > > > > nr_alloc            :         9038         nr_alloc            :        97046
> > > > > > nr_dealloc          :         6992	   nr_dealloc          :        94237
> > > > > > nr_cur_alloc        :         2046	   nr_cur_alloc        :         2809
> > > > > > nr_max_alloc        :         2178	   nr_max_alloc        :        90054
> > > > > > nr_chunks           :            3	   nr_chunks           :           11
> > > > > > nr_max_chunks       :            3	   nr_max_chunks       :           47
> > > > > > min_alloc_size      :            4	   min_alloc_size      :            4
> > > > > > max_alloc_size      :         1072	   max_alloc_size      :         1072
> > > > > > empty_pop_pages     :            5	   empty_pop_pages     :           29
> > > > > > 
> > > > > > 
> > > > > > Patched
> > > > > > 
> > > > > > nr_alloc            :         9040         nr_alloc            :        97048
> > > > > > nr_dealloc          :         6994	   nr_dealloc          :        95002
> > > > > > nr_cur_alloc        :         2046	   nr_cur_alloc        :         2046
> > > > > > nr_max_alloc        :         2208	   nr_max_alloc        :        90054
> > > > > > nr_chunks           :            3	   nr_chunks           :           48
> > > > > > nr_max_chunks       :            3	   nr_max_chunks       :           48
> > > > > > min_alloc_size      :            4	   min_alloc_size      :            4
> > > > > > max_alloc_size      :         1072	   max_alloc_size      :         1072
> > > > > > empty_pop_pages     :           12	   empty_pop_pages     :           61
> > > > > > 
> > > > > > 
> > > > > > So it looks like the number of chunks got bigger, as well as the number of
> > > > > > empty_pop_pages? This contradicts to what you wrote, so can you, please, make
> > > > > > sure that the data is correct and we're not messing two cases?
> > > > > > 
> > > > > > So it looks like for some reason sidelined (depopulated) chunks are not getting
> > > > > > freed completely. But I struggle to explain why the initial empty_pop_pages is
> > > > > > bigger with the same amount of chunks.
> > > > > > 
> > > > > > So, can you, please, apply the following patch and provide an updated statistics?
> > > > > Unfortunately, I'm not completely well versed in this area, but yes the empty
> > > > > pop pages number doesn't make sense to me either.
> > > > > 
> > > > > I re-ran the numbers trying to make sure my experiment setup is sane but
> > > > > results remain the same.
> > > > > 
> > > > > Vanilla
> > > > > nr_alloc            :         9040         nr_alloc            :        97048
> > > > > nr_dealloc          :         6994	   nr_dealloc          :        94404
> > > > > nr_cur_alloc        :         2046	   nr_cur_alloc        :         2644
> > > > > nr_max_alloc        :         2169	   nr_max_alloc        :        90054
> > > > > nr_chunks           :            3	   nr_chunks           :           10
> > > > > nr_max_chunks       :            3	   nr_max_chunks       :           47
> > > > > min_alloc_size      :            4	   min_alloc_size      :            4
> > > > > max_alloc_size      :         1072	   max_alloc_size      :         1072
> > > > > empty_pop_pages     :            4	   empty_pop_pages     :           32
> > > > > 
> > > > > With the patchset + debug patch the results are as follows:
> > > > > Patched
> > > > > 
> > > > > nr_alloc            :         9040         nr_alloc            :        97048
> > > > > nr_dealloc          :         6994	   nr_dealloc          :        94349
> > > > > nr_cur_alloc        :         2046	   nr_cur_alloc        :         2699
> > > > > nr_max_alloc        :         2194	   nr_max_alloc        :        90054
> > > > > nr_chunks           :            3	   nr_chunks           :           48
> > > > > nr_max_chunks       :            3	   nr_max_chunks       :           48
> > > > > min_alloc_size      :            4	   min_alloc_size      :            4
> > > > > max_alloc_size      :         1072	   max_alloc_size      :         1072
> > > > > empty_pop_pages     :           12	   empty_pop_pages     :           54
> > > > > 
> > > > > With the extra tracing I can see 39 entries of "Chunk (sidelined)"
> > > > > after the test was run. I don't see any entries for "Chunk (to depopulate)"
> > > > > 
> > > > > I've snipped the results of slidelined chunks because they went on for ~600
> > > > > lines, if you need the full logs let me know.
> > > > Yes, please! That's the most interesting part!
> > > Got it. Pasting the full logs of after the percpu experiment was completed
> > Thanks!
> > 
> > Would you mind to apply the following patch and test again?
> > 
> > --
> > 
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index ded3a7541cb2..532c6a7ebdfd 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr)
> >                                  need_balance = true;
> >                                  break;
> >                          }
> > +
> > +               chunk->depopulated = false;
> > +               pcpu_chunk_relocate(chunk, -1);
> >          } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
> >                     !chunk->isolated &&
> >                     (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
> > 
> Sure thing.
> 
> I see much lower sideline chunks. In one such test run I saw zero occurrences
> of slidelined chunks
> 
> Pasting the full logs as an example:
> 
> BEFORE
> Percpu Memory Statistics
> Allocation Info:
> ----------------------------------------
>   unit_size           :       655360
>   static_size         :       608920
>   reserved_size       :            0
>   dyn_size            :        46440
>   atom_size           :        65536
>   alloc_size          :       655360
> 
> Global Stats:
> ----------------------------------------
>   nr_alloc            :         9038
>   nr_dealloc          :         6992
>   nr_cur_alloc        :         2046
>   nr_max_alloc        :         2200
>   nr_chunks           :            3
>   nr_max_chunks       :            3
>   min_alloc_size      :            4
>   max_alloc_size      :         1072
>   empty_pop_pages     :           12
> 
> Per Chunk Stats:
> ----------------------------------------
> Chunk: <- First Chunk
>   nr_alloc            :         1092
>   max_alloc_size      :         1072
>   empty_pop_pages     :            0
>   first_bit           :        16247
>   free_bytes          :            4
>   contig_bytes        :            4
>   sum_frag            :            4
>   max_frag            :            4
>   cur_min_alloc       :            4
>   cur_med_alloc       :            8
>   cur_max_alloc       :         1072
>   memcg_aware         :            0
> 
> Chunk:
>   nr_alloc            :          594
>   max_alloc_size      :          992
>   empty_pop_pages     :            8
>   first_bit           :          456
>   free_bytes          :       645008
>   contig_bytes        :       319984
>   sum_frag            :       325024
>   max_frag            :       318680
>   cur_min_alloc       :            4
>   cur_med_alloc       :            8
>   cur_max_alloc       :          424
>   memcg_aware         :            0
> 
> Chunk:
>   nr_alloc            :          360
>   max_alloc_size      :         1072
>   empty_pop_pages     :            4
>   first_bit           :        26595
>   free_bytes          :       506640
>   contig_bytes        :       506540
>   sum_frag            :          100
>   max_frag            :           32
>   cur_min_alloc       :            4
>   cur_med_alloc       :          156
>   cur_max_alloc       :         1072
>   memcg_aware         :            1
> 
> 
> AFTER
> Percpu Memory Statistics
> Allocation Info:
> ----------------------------------------
>   unit_size           :       655360
>   static_size         :       608920
>   reserved_size       :            0
>   dyn_size            :        46440
>   atom_size           :        65536
>   alloc_size          :       655360
> 
> Global Stats:
> ----------------------------------------
>   nr_alloc            :        97046
>   nr_dealloc          :        94304
>   nr_cur_alloc        :         2742
>   nr_max_alloc        :        90054
>   nr_chunks           :           11
>   nr_max_chunks       :           47
>   min_alloc_size      :            4
>   max_alloc_size      :         1072
>   empty_pop_pages     :           18
> 
> Per Chunk Stats:
> ----------------------------------------
> Chunk: <- First Chunk
>   nr_alloc            :         1092
>   max_alloc_size      :         1072
>   empty_pop_pages     :            0
>   first_bit           :        16247
>   free_bytes          :            4
>   contig_bytes        :            4
>   sum_frag            :            4
>   max_frag            :            4
>   cur_min_alloc       :            4
>   cur_med_alloc       :            8
>   cur_max_alloc       :         1072
>   memcg_aware         :            0
> 
> Chunk:
>   nr_alloc            :          838
>   max_alloc_size      :         1072
>   empty_pop_pages     :            7
>   first_bit           :          464
>   free_bytes          :       640476
>   contig_bytes        :       290672
>   sum_frag            :       349804
>   max_frag            :       304344
>   cur_min_alloc       :            4
>   cur_med_alloc       :            8
>   cur_max_alloc       :         1072
>   memcg_aware         :            0
> 
> Chunk:
>   nr_alloc            :           90
>   max_alloc_size      :         1072
>   empty_pop_pages     :            0
>   first_bit           :          536
>   free_bytes          :       595752
>   contig_bytes        :        26164
>   sum_frag            :       575132
>   max_frag            :        26164
>   cur_min_alloc       :          156
>   cur_med_alloc       :         1072
>   cur_max_alloc       :         1072
>   memcg_aware         :            1
> 
> Chunk:
>   nr_alloc            :           90
>   max_alloc_size      :         1072
>   empty_pop_pages     :            0
>   first_bit           :            0
>   free_bytes          :       597428
>   contig_bytes        :        26164
>   sum_frag            :       596848
>   max_frag            :        26164
>   cur_min_alloc       :          156
>   cur_med_alloc       :          312
>   cur_max_alloc       :         1072
>   memcg_aware         :            1
> 
> Chunk:
>   nr_alloc            :           92
>   max_alloc_size      :         1072
>   empty_pop_pages     :            0
>   first_bit           :            0
>   free_bytes          :       595284
>   contig_bytes        :        26164
>   sum_frag            :       590360
>   max_frag            :        26164
>   cur_min_alloc       :          156
>   cur_med_alloc       :          312
>   cur_max_alloc       :         1072
>   memcg_aware         :            1
> 
> Chunk:
>   nr_alloc            :           92
>   max_alloc_size      :         1072
>   empty_pop_pages     :            0
>   first_bit           :            0
>   free_bytes          :       595284
>   contig_bytes        :        26164
>   sum_frag            :       583768
>   max_frag            :        26164
>   cur_min_alloc       :          156
>   cur_med_alloc       :          312
>   cur_max_alloc       :         1072
>   memcg_aware         :            1
> 
> Chunk:
>   nr_alloc            :          360
>   max_alloc_size      :         1072
>   empty_pop_pages     :            7
>   first_bit           :        26595
>   free_bytes          :       506640
>   contig_bytes        :       506540
>   sum_frag            :          100
>   max_frag            :           32
>   cur_min_alloc       :            4
>   cur_med_alloc       :          156
>   cur_max_alloc       :         1072
>   memcg_aware         :            1
> 
> Chunk:
>   nr_alloc            :           12
>   max_alloc_size      :         1072
>   empty_pop_pages     :            3
>   first_bit           :            0
>   free_bytes          :       647524
>   contig_bytes        :       563492
>   sum_frag            :        57872
>   max_frag            :        26164
>   cur_min_alloc       :          156
>   cur_med_alloc       :          312
>   cur_max_alloc       :         1072
>   memcg_aware         :            1
> 
> Chunk:
>   nr_alloc            :            0
>   max_alloc_size      :         1072
>   empty_pop_pages     :            1
>   first_bit           :            0
>   free_bytes          :       655360
>   contig_bytes        :       655360
>   sum_frag            :            0
>   max_frag            :            0
>   cur_min_alloc       :            0
>   cur_med_alloc       :            0
>   cur_max_alloc       :            0
>   memcg_aware         :            1
> 
> Chunk (sidelined):
>   nr_alloc            :           72
>   max_alloc_size      :         1072
>   empty_pop_pages     :            0
>   first_bit           :            0
>   free_bytes          :       608344
>   contig_bytes        :       145552
>   sum_frag            :       590340
>   max_frag            :       145552
>   cur_min_alloc       :          156
>   cur_med_alloc       :          312
>   cur_max_alloc       :         1072
>   memcg_aware         :            1
> 
> Chunk (sidelined):
>   nr_alloc            :            4
>   max_alloc_size      :         1072
>   empty_pop_pages     :            0
>   first_bit           :            0
>   free_bytes          :       652748
>   contig_bytes        :       426720
>   sum_frag            :       426720
>   max_frag            :       426720
>   cur_min_alloc       :          156
>   cur_med_alloc       :          312
>   cur_max_alloc       :         1072
>   memcg_aware         :            1
> 
> 
 
Thank you Pratik for testing this and working with us to resolve this. I
greatly appreciate it!

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] percpu: partial chunk depopulation
  2021-04-16 20:03                   ` Roman Gushchin
@ 2021-04-17  7:08                     ` Pratik Sampat
  0 siblings, 0 replies; 26+ messages in thread
From: Pratik Sampat @ 2021-04-17  7:08 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Andrew Morton,
	Vlastimil Babka, linux-mm, linux-kernel, pratik.r.sampat



On 17/04/21 1:33 am, Roman Gushchin wrote:
> On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote:
>>
>> On 17/04/21 12:39 am, Roman Gushchin wrote:
>>> On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
>>>> On 17/04/21 12:04 am, Roman Gushchin wrote:
>>>>> On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
>>>>>> On 16/04/21 10:43 pm, Roman Gushchin wrote:
>>>>>>> On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
>>>>>>>> Hello Dennis,
>>>>>>>>
>>>>>>>> I apologize for the clutter of logs before, I'm pasting the logs of before and
>>>>>>>> after the percpu test in the case of the patchset being applied on 5.12-rc6 and
>>>>>>>> the vanilla kernel 5.12-rc6.
>>>>>>>>
>>>>>>>> On 16/04/21 7:48 pm, Dennis Zhou wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
>>>>>>>>>> Hello Roman,
>>>>>>>>>>
>>>>>>>>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
>>>>>>>>>>
>>>>>>>>>> My results of the percpu_test are as follows:
>>>>>>>>>> Intel KVM 4CPU:4G
>>>>>>>>>> Vanilla 5.12-rc6
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             1952 kB
>>>>>>>>>> Percpu:           219648 kB
>>>>>>>>>> Percpu:           219648 kB
>>>>>>>>>>
>>>>>>>>>> 5.12-rc6 + with patchset applied
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             2080 kB
>>>>>>>>>> Percpu:           219712 kB
>>>>>>>>>> Percpu:            72672 kB
>>>>>>>>>>
>>>>>>>>>> I'm able to see improvement comparable to that of what you're see too.
>>>>>>>>>>
>>>>>>>>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
>>>>>>>>>>
>>>>>>>>>> POWER9 KVM 4CPU:4G
>>>>>>>>>> Vanilla 5.12-rc6
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             5888 kB
>>>>>>>>>> Percpu:           118272 kB
>>>>>>>>>> Percpu:           118272 kB
>>>>>>>>>>
>>>>>>>>>> 5.12-rc6 + with patchset applied
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             6144 kB
>>>>>>>>>> Percpu:           119040 kB
>>>>>>>>>> Percpu:           119040 kB
>>>>>>>>>>
>>>>>>>>>> I'm wondering if there's any architectural specific code that needs plumbing
>>>>>>>>>> here?
>>>>>>>>>>
>>>>>>>>> There shouldn't be. Can you send me the percpu_stats debug output before
>>>>>>>>> and after?
>>>>>>>> I'll paste the whole debug stats before and after here.
>>>>>>>> 5.12-rc6 + patchset
>>>>>>>> -----BEFORE-----
>>>>>>>> Percpu Memory Statistics
>>>>>>>> Allocation Info:
>>>>>>> Hm, this looks highly suspicious. Here is your stats in a more compact form:
>>>>>>>
>>>>>>> Vanilla
>>>>>>>
>>>>>>> nr_alloc            :         9038         nr_alloc            :        97046
>>>>>>> nr_dealloc          :         6992	   nr_dealloc          :        94237
>>>>>>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2809
>>>>>>> nr_max_alloc        :         2178	   nr_max_alloc        :        90054
>>>>>>> nr_chunks           :            3	   nr_chunks           :           11
>>>>>>> nr_max_chunks       :            3	   nr_max_chunks       :           47
>>>>>>> min_alloc_size      :            4	   min_alloc_size      :            4
>>>>>>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>>>>>>> empty_pop_pages     :            5	   empty_pop_pages     :           29
>>>>>>>
>>>>>>>
>>>>>>> Patched
>>>>>>>
>>>>>>> nr_alloc            :         9040         nr_alloc            :        97048
>>>>>>> nr_dealloc          :         6994	   nr_dealloc          :        95002
>>>>>>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2046
>>>>>>> nr_max_alloc        :         2208	   nr_max_alloc        :        90054
>>>>>>> nr_chunks           :            3	   nr_chunks           :           48
>>>>>>> nr_max_chunks       :            3	   nr_max_chunks       :           48
>>>>>>> min_alloc_size      :            4	   min_alloc_size      :            4
>>>>>>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>>>>>>> empty_pop_pages     :           12	   empty_pop_pages     :           61
>>>>>>>
>>>>>>>
>>>>>>> So it looks like the number of chunks got bigger, as well as the number of
>>>>>>> empty_pop_pages? This contradicts to what you wrote, so can you, please, make
>>>>>>> sure that the data is correct and we're not messing two cases?
>>>>>>>
>>>>>>> So it looks like for some reason sidelined (depopulated) chunks are not getting
>>>>>>> freed completely. But I struggle to explain why the initial empty_pop_pages is
>>>>>>> bigger with the same amount of chunks.
>>>>>>>
>>>>>>> So, can you, please, apply the following patch and provide an updated statistics?
>>>>>> Unfortunately, I'm not completely well versed in this area, but yes the empty
>>>>>> pop pages number doesn't make sense to me either.
>>>>>>
>>>>>> I re-ran the numbers trying to make sure my experiment setup is sane but
>>>>>> results remain the same.
>>>>>>
>>>>>> Vanilla
>>>>>> nr_alloc            :         9040         nr_alloc            :        97048
>>>>>> nr_dealloc          :         6994	   nr_dealloc          :        94404
>>>>>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2644
>>>>>> nr_max_alloc        :         2169	   nr_max_alloc        :        90054
>>>>>> nr_chunks           :            3	   nr_chunks           :           10
>>>>>> nr_max_chunks       :            3	   nr_max_chunks       :           47
>>>>>> min_alloc_size      :            4	   min_alloc_size      :            4
>>>>>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>>>>>> empty_pop_pages     :            4	   empty_pop_pages     :           32
>>>>>>
>>>>>> With the patchset + debug patch the results are as follows:
>>>>>> Patched
>>>>>>
>>>>>> nr_alloc            :         9040         nr_alloc            :        97048
>>>>>> nr_dealloc          :         6994	   nr_dealloc          :        94349
>>>>>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2699
>>>>>> nr_max_alloc        :         2194	   nr_max_alloc        :        90054
>>>>>> nr_chunks           :            3	   nr_chunks           :           48
>>>>>> nr_max_chunks       :            3	   nr_max_chunks       :           48
>>>>>> min_alloc_size      :            4	   min_alloc_size      :            4
>>>>>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>>>>>> empty_pop_pages     :           12	   empty_pop_pages     :           54
>>>>>>
>>>>>> With the extra tracing I can see 39 entries of "Chunk (sidelined)"
>>>>>> after the test was run. I don't see any entries for "Chunk (to depopulate)"
>>>>>>
>>>>>> I've snipped the results of slidelined chunks because they went on for ~600
>>>>>> lines, if you need the full logs let me know.
>>>>> Yes, please! That's the most interesting part!
>>>> Got it. Pasting the full logs of after the percpu experiment was completed
>>> Thanks!
>>>
>>> Would you mind to apply the following patch and test again?
>>>
>>> --
>>>
>>> diff --git a/mm/percpu.c b/mm/percpu.c
>>> index ded3a7541cb2..532c6a7ebdfd 100644
>>> --- a/mm/percpu.c
>>> +++ b/mm/percpu.c
>>> @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr)
>>>                                   need_balance = true;
>>>                                   break;
>>>                           }
>>> +
>>> +               chunk->depopulated = false;
>>> +               pcpu_chunk_relocate(chunk, -1);
>>>           } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
>>>                      !chunk->isolated &&
>>>                      (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
>>>
>> Sure thing.
>>
>> I see much lower sideline chunks. In one such test run I saw zero occurrences
>> of slidelined chunks
>>
> So looking at the stats it now works properly. Do you see any savings in
> comparison to vanilla? The size of savings can significanlty depend on the exact
> size of cgroup-related objects, how many of them fit into a single chunk, etc.
> So you might want to play with numbers in the test...
>
> Anyway, thank you very much for the report and your work on testing follow-up
> patches! It helped to reveal a serious bug in the implementation (completely
> empty sidelined chunks were not released in some cases), which by pure
> coincidence wasn't triggered on x86.
>
> Thanks!
>
Unfortunately not, I don't see any savings from the test.

# ./percpu_test_roman.sh
Percpu:             6144 kB
Percpu:           122880 kB
Percpu:           122880 kB

I had assumed that because POWER has a larger page size, we would indeed also
have higher fragmentation which could possibly lead to a lot more savings.

I'll dive deeper into the patches and tweak around the setup to see if I can
understand this behavior.

Thanks for helping me understand this patchset a little better and I'm glad we
found a bug with sidelined chunks!

I'll get back to you if I do find something interesting and need help
understanding it.

Thank you again,
Pratik


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] percpu: partial chunk depopulation
  2021-04-16 21:47                   ` Dennis Zhou
@ 2021-04-17  7:14                     ` Pratik Sampat
  0 siblings, 0 replies; 26+ messages in thread
From: Pratik Sampat @ 2021-04-17  7:14 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Roman Gushchin, Tejun Heo, Christoph Lameter, Andrew Morton,
	Vlastimil Babka, linux-mm, linux-kernel, pratik.r.sampat



On 17/04/21 3:17 am, Dennis Zhou wrote:
> Hello,
>
> On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote:
>>
>> On 17/04/21 12:39 am, Roman Gushchin wrote:
>>> On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
>>>> On 17/04/21 12:04 am, Roman Gushchin wrote:
>>>>> On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
>>>>>> On 16/04/21 10:43 pm, Roman Gushchin wrote:
>>>>>>> On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
>>>>>>>> Hello Dennis,
>>>>>>>>
>>>>>>>> I apologize for the clutter of logs before, I'm pasting the logs of before and
>>>>>>>> after the percpu test in the case of the patchset being applied on 5.12-rc6 and
>>>>>>>> the vanilla kernel 5.12-rc6.
>>>>>>>>
>>>>>>>> On 16/04/21 7:48 pm, Dennis Zhou wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
>>>>>>>>>> Hello Roman,
>>>>>>>>>>
>>>>>>>>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
>>>>>>>>>>
>>>>>>>>>> My results of the percpu_test are as follows:
>>>>>>>>>> Intel KVM 4CPU:4G
>>>>>>>>>> Vanilla 5.12-rc6
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             1952 kB
>>>>>>>>>> Percpu:           219648 kB
>>>>>>>>>> Percpu:           219648 kB
>>>>>>>>>>
>>>>>>>>>> 5.12-rc6 + with patchset applied
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             2080 kB
>>>>>>>>>> Percpu:           219712 kB
>>>>>>>>>> Percpu:            72672 kB
>>>>>>>>>>
>>>>>>>>>> I'm able to see improvement comparable to that of what you're see too.
>>>>>>>>>>
>>>>>>>>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
>>>>>>>>>>
>>>>>>>>>> POWER9 KVM 4CPU:4G
>>>>>>>>>> Vanilla 5.12-rc6
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             5888 kB
>>>>>>>>>> Percpu:           118272 kB
>>>>>>>>>> Percpu:           118272 kB
>>>>>>>>>>
>>>>>>>>>> 5.12-rc6 + with patchset applied
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             6144 kB
>>>>>>>>>> Percpu:           119040 kB
>>>>>>>>>> Percpu:           119040 kB
>>>>>>>>>>
>>>>>>>>>> I'm wondering if there's any architectural specific code that needs plumbing
>>>>>>>>>> here?
>>>>>>>>>>
>>>>>>>>> There shouldn't be. Can you send me the percpu_stats debug output before
>>>>>>>>> and after?
>>>>>>>> I'll paste the whole debug stats before and after here.
>>>>>>>> 5.12-rc6 + patchset
>>>>>>>> -----BEFORE-----
>>>>>>>> Percpu Memory Statistics
>>>>>>>> Allocation Info:
>>>>>>> Hm, this looks highly suspicious. Here is your stats in a more compact form:
>>>>>>>
>>>>>>> Vanilla
>>>>>>>
>>>>>>> nr_alloc            :         9038         nr_alloc            :        97046
>>>>>>> nr_dealloc          :         6992	   nr_dealloc          :        94237
>>>>>>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2809
>>>>>>> nr_max_alloc        :         2178	   nr_max_alloc        :        90054
>>>>>>> nr_chunks           :            3	   nr_chunks           :           11
>>>>>>> nr_max_chunks       :            3	   nr_max_chunks       :           47
>>>>>>> min_alloc_size      :            4	   min_alloc_size      :            4
>>>>>>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>>>>>>> empty_pop_pages     :            5	   empty_pop_pages     :           29
>>>>>>>
>>>>>>>
>>>>>>> Patched
>>>>>>>
>>>>>>> nr_alloc            :         9040         nr_alloc            :        97048
>>>>>>> nr_dealloc          :         6994	   nr_dealloc          :        95002
>>>>>>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2046
>>>>>>> nr_max_alloc        :         2208	   nr_max_alloc        :        90054
>>>>>>> nr_chunks           :            3	   nr_chunks           :           48
>>>>>>> nr_max_chunks       :            3	   nr_max_chunks       :           48
>>>>>>> min_alloc_size      :            4	   min_alloc_size      :            4
>>>>>>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>>>>>>> empty_pop_pages     :           12	   empty_pop_pages     :           61
>>>>>>>
>>>>>>>
>>>>>>> So it looks like the number of chunks got bigger, as well as the number of
>>>>>>> empty_pop_pages? This contradicts to what you wrote, so can you, please, make
>>>>>>> sure that the data is correct and we're not messing two cases?
>>>>>>>
>>>>>>> So it looks like for some reason sidelined (depopulated) chunks are not getting
>>>>>>> freed completely. But I struggle to explain why the initial empty_pop_pages is
>>>>>>> bigger with the same amount of chunks.
>>>>>>>
>>>>>>> So, can you, please, apply the following patch and provide an updated statistics?
>>>>>> Unfortunately, I'm not completely well versed in this area, but yes the empty
>>>>>> pop pages number doesn't make sense to me either.
>>>>>>
>>>>>> I re-ran the numbers trying to make sure my experiment setup is sane but
>>>>>> results remain the same.
>>>>>>
>>>>>> Vanilla
>>>>>> nr_alloc            :         9040         nr_alloc            :        97048
>>>>>> nr_dealloc          :         6994	   nr_dealloc          :        94404
>>>>>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2644
>>>>>> nr_max_alloc        :         2169	   nr_max_alloc        :        90054
>>>>>> nr_chunks           :            3	   nr_chunks           :           10
>>>>>> nr_max_chunks       :            3	   nr_max_chunks       :           47
>>>>>> min_alloc_size      :            4	   min_alloc_size      :            4
>>>>>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>>>>>> empty_pop_pages     :            4	   empty_pop_pages     :           32
>>>>>>
>>>>>> With the patchset + debug patch the results are as follows:
>>>>>> Patched
>>>>>>
>>>>>> nr_alloc            :         9040         nr_alloc            :        97048
>>>>>> nr_dealloc          :         6994	   nr_dealloc          :        94349
>>>>>> nr_cur_alloc        :         2046	   nr_cur_alloc        :         2699
>>>>>> nr_max_alloc        :         2194	   nr_max_alloc        :        90054
>>>>>> nr_chunks           :            3	   nr_chunks           :           48
>>>>>> nr_max_chunks       :            3	   nr_max_chunks       :           48
>>>>>> min_alloc_size      :            4	   min_alloc_size      :            4
>>>>>> max_alloc_size      :         1072	   max_alloc_size      :         1072
>>>>>> empty_pop_pages     :           12	   empty_pop_pages     :           54
>>>>>>
>>>>>> With the extra tracing I can see 39 entries of "Chunk (sidelined)"
>>>>>> after the test was run. I don't see any entries for "Chunk (to depopulate)"
>>>>>>
>>>>>> I've snipped the results of slidelined chunks because they went on for ~600
>>>>>> lines, if you need the full logs let me know.
>>>>> Yes, please! That's the most interesting part!
>>>> Got it. Pasting the full logs of after the percpu experiment was completed
>>> Thanks!
>>>
>>> Would you mind to apply the following patch and test again?
>>>
>>> --
>>>
>>> diff --git a/mm/percpu.c b/mm/percpu.c
>>> index ded3a7541cb2..532c6a7ebdfd 100644
>>> --- a/mm/percpu.c
>>> +++ b/mm/percpu.c
>>> @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr)
>>>                                   need_balance = true;
>>>                                   break;
>>>                           }
>>> +
>>> +               chunk->depopulated = false;
>>> +               pcpu_chunk_relocate(chunk, -1);
>>>           } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
>>>                      !chunk->isolated &&
>>>                      (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
>>>
>> Sure thing.
>>
>> I see much lower sideline chunks. In one such test run I saw zero occurrences
>> of slidelined chunks
>>
>> Pasting the full logs as an example:
>>
>> BEFORE
>> Percpu Memory Statistics
>> Allocation Info:
>> ----------------------------------------
>>    unit_size           :       655360
>>    static_size         :       608920
>>    reserved_size       :            0
>>    dyn_size            :        46440
>>    atom_size           :        65536
>>    alloc_size          :       655360
>>
>> Global Stats:
>> ----------------------------------------
>>    nr_alloc            :         9038
>>    nr_dealloc          :         6992
>>    nr_cur_alloc        :         2046
>>    nr_max_alloc        :         2200
>>    nr_chunks           :            3
>>    nr_max_chunks       :            3
>>    min_alloc_size      :            4
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :           12
>>
>> Per Chunk Stats:
>> ----------------------------------------
>> Chunk: <- First Chunk
>>    nr_alloc            :         1092
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :            0
>>    first_bit           :        16247
>>    free_bytes          :            4
>>    contig_bytes        :            4
>>    sum_frag            :            4
>>    max_frag            :            4
>>    cur_min_alloc       :            4
>>    cur_med_alloc       :            8
>>    cur_max_alloc       :         1072
>>    memcg_aware         :            0
>>
>> Chunk:
>>    nr_alloc            :          594
>>    max_alloc_size      :          992
>>    empty_pop_pages     :            8
>>    first_bit           :          456
>>    free_bytes          :       645008
>>    contig_bytes        :       319984
>>    sum_frag            :       325024
>>    max_frag            :       318680
>>    cur_min_alloc       :            4
>>    cur_med_alloc       :            8
>>    cur_max_alloc       :          424
>>    memcg_aware         :            0
>>
>> Chunk:
>>    nr_alloc            :          360
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :            4
>>    first_bit           :        26595
>>    free_bytes          :       506640
>>    contig_bytes        :       506540
>>    sum_frag            :          100
>>    max_frag            :           32
>>    cur_min_alloc       :            4
>>    cur_med_alloc       :          156
>>    cur_max_alloc       :         1072
>>    memcg_aware         :            1
>>
>>
>> AFTER
>> Percpu Memory Statistics
>> Allocation Info:
>> ----------------------------------------
>>    unit_size           :       655360
>>    static_size         :       608920
>>    reserved_size       :            0
>>    dyn_size            :        46440
>>    atom_size           :        65536
>>    alloc_size          :       655360
>>
>> Global Stats:
>> ----------------------------------------
>>    nr_alloc            :        97046
>>    nr_dealloc          :        94304
>>    nr_cur_alloc        :         2742
>>    nr_max_alloc        :        90054
>>    nr_chunks           :           11
>>    nr_max_chunks       :           47
>>    min_alloc_size      :            4
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :           18
>>
>> Per Chunk Stats:
>> ----------------------------------------
>> Chunk: <- First Chunk
>>    nr_alloc            :         1092
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :            0
>>    first_bit           :        16247
>>    free_bytes          :            4
>>    contig_bytes        :            4
>>    sum_frag            :            4
>>    max_frag            :            4
>>    cur_min_alloc       :            4
>>    cur_med_alloc       :            8
>>    cur_max_alloc       :         1072
>>    memcg_aware         :            0
>>
>> Chunk:
>>    nr_alloc            :          838
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :            7
>>    first_bit           :          464
>>    free_bytes          :       640476
>>    contig_bytes        :       290672
>>    sum_frag            :       349804
>>    max_frag            :       304344
>>    cur_min_alloc       :            4
>>    cur_med_alloc       :            8
>>    cur_max_alloc       :         1072
>>    memcg_aware         :            0
>>
>> Chunk:
>>    nr_alloc            :           90
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :            0
>>    first_bit           :          536
>>    free_bytes          :       595752
>>    contig_bytes        :        26164
>>    sum_frag            :       575132
>>    max_frag            :        26164
>>    cur_min_alloc       :          156
>>    cur_med_alloc       :         1072
>>    cur_max_alloc       :         1072
>>    memcg_aware         :            1
>>
>> Chunk:
>>    nr_alloc            :           90
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :            0
>>    first_bit           :            0
>>    free_bytes          :       597428
>>    contig_bytes        :        26164
>>    sum_frag            :       596848
>>    max_frag            :        26164
>>    cur_min_alloc       :          156
>>    cur_med_alloc       :          312
>>    cur_max_alloc       :         1072
>>    memcg_aware         :            1
>>
>> Chunk:
>>    nr_alloc            :           92
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :            0
>>    first_bit           :            0
>>    free_bytes          :       595284
>>    contig_bytes        :        26164
>>    sum_frag            :       590360
>>    max_frag            :        26164
>>    cur_min_alloc       :          156
>>    cur_med_alloc       :          312
>>    cur_max_alloc       :         1072
>>    memcg_aware         :            1
>>
>> Chunk:
>>    nr_alloc            :           92
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :            0
>>    first_bit           :            0
>>    free_bytes          :       595284
>>    contig_bytes        :        26164
>>    sum_frag            :       583768
>>    max_frag            :        26164
>>    cur_min_alloc       :          156
>>    cur_med_alloc       :          312
>>    cur_max_alloc       :         1072
>>    memcg_aware         :            1
>>
>> Chunk:
>>    nr_alloc            :          360
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :            7
>>    first_bit           :        26595
>>    free_bytes          :       506640
>>    contig_bytes        :       506540
>>    sum_frag            :          100
>>    max_frag            :           32
>>    cur_min_alloc       :            4
>>    cur_med_alloc       :          156
>>    cur_max_alloc       :         1072
>>    memcg_aware         :            1
>>
>> Chunk:
>>    nr_alloc            :           12
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :            3
>>    first_bit           :            0
>>    free_bytes          :       647524
>>    contig_bytes        :       563492
>>    sum_frag            :        57872
>>    max_frag            :        26164
>>    cur_min_alloc       :          156
>>    cur_med_alloc       :          312
>>    cur_max_alloc       :         1072
>>    memcg_aware         :            1
>>
>> Chunk:
>>    nr_alloc            :            0
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :            1
>>    first_bit           :            0
>>    free_bytes          :       655360
>>    contig_bytes        :       655360
>>    sum_frag            :            0
>>    max_frag            :            0
>>    cur_min_alloc       :            0
>>    cur_med_alloc       :            0
>>    cur_max_alloc       :            0
>>    memcg_aware         :            1
>>
>> Chunk (sidelined):
>>    nr_alloc            :           72
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :            0
>>    first_bit           :            0
>>    free_bytes          :       608344
>>    contig_bytes        :       145552
>>    sum_frag            :       590340
>>    max_frag            :       145552
>>    cur_min_alloc       :          156
>>    cur_med_alloc       :          312
>>    cur_max_alloc       :         1072
>>    memcg_aware         :            1
>>
>> Chunk (sidelined):
>>    nr_alloc            :            4
>>    max_alloc_size      :         1072
>>    empty_pop_pages     :            0
>>    first_bit           :            0
>>    free_bytes          :       652748
>>    contig_bytes        :       426720
>>    sum_frag            :       426720
>>    max_frag            :       426720
>>    cur_min_alloc       :          156
>>    cur_med_alloc       :          312
>>    cur_max_alloc       :         1072
>>    memcg_aware         :            1
>>
>>
>   
> Thank you Pratik for testing this and working with us to resolve this. I
> greatly appreciate it!
>
> Thanks,
> Dennis

No worries at all, glad I could be of some help!

Thank you,
Pratik


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-04-17  7:14 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-08  3:57 [PATCH v3 0/6] percpu: partial chunk depopulation Roman Gushchin
2021-04-08  3:57 ` [PATCH v3 1/6] percpu: fix a comment about the chunks ordering Roman Gushchin
2021-04-16 21:06   ` Dennis Zhou
2021-04-08  3:57 ` [PATCH v3 2/6] percpu: split __pcpu_balance_workfn() Roman Gushchin
2021-04-16 21:06   ` Dennis Zhou
2021-04-08  3:57 ` [PATCH v3 3/6] percpu: make pcpu_nr_empty_pop_pages per chunk type Roman Gushchin
2021-04-16 21:08   ` Dennis Zhou
2021-04-08  3:57 ` [PATCH v3 4/6] percpu: generalize pcpu_balance_populated() Roman Gushchin
2021-04-16 21:09   ` Dennis Zhou
2021-04-08  3:57 ` [PATCH v3 5/6] percpu: factor out pcpu_check_chunk_hint() Roman Gushchin
2021-04-16 21:15   ` Dennis Zhou
2021-04-08  3:57 ` [PATCH v3 6/6] percpu: implement partial chunk depopulation Roman Gushchin
2021-04-16 12:56 ` [PATCH v3 0/6] percpu: " Pratik Sampat
2021-04-16 14:18   ` Dennis Zhou
2021-04-16 15:28     ` Pratik Sampat
2021-04-16 17:13       ` Roman Gushchin
2021-04-16 18:27         ` Pratik Sampat
2021-04-16 18:34           ` Roman Gushchin
2021-04-16 18:41             ` Pratik Sampat
2021-04-16 19:09               ` Roman Gushchin
2021-04-16 19:44                 ` Pratik Sampat
2021-04-16 20:03                   ` Roman Gushchin
2021-04-17  7:08                     ` Pratik Sampat
2021-04-16 21:47                   ` Dennis Zhou
2021-04-17  7:14                     ` Pratik Sampat
2021-04-16 16:21     ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).