All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/4] percpu: partial chunk depopulation
@ 2021-04-19 22:50 Dennis Zhou
  2021-04-19 22:50 ` [PATCH 1/4] percpu: factor out pcpu_check_block_hint() Dennis Zhou
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Dennis Zhou @ 2021-04-19 22:50 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Roman Gushchin
  Cc: linux-mm, linux-kernel, Dennis Zhou

Hello,

This series is a continuation of Roman's series in [1]. It aims to solve
chunks holding onto free pages by adding a reclaim process to the percpu
balance work item.

The main difference is that the nr_empty_pop_pages is now managed at
time of isolation instead of intermixed. This helps with deciding which
chunks to free instead of having to interleave returning chunks to
active duty.

The allocation priority is as follows:
  1) appropriate chunk slot increasing until fit
  2) sidelined chunks
  3) full free chunks

The last slot for to_depopulate is never used for allocations.

A big thanks to Roman for initiating the work and being available for
iterating on these ideas.

This patchset contains the following 4 patches:
  0001-percpu-factor-out-pcpu_check_block_hint.patch
  0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
  0003-percpu-implement-partial-chunk-depopulation.patch
  0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch

0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
initially from Roman. 0004 adds a reclaim threshold so we do not need to
schedule for every page freed.

This series is on top of percpu$for-5.14 67c2669d69fb.

diffstats below:

Dennis Zhou (2):
  percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
  percpu: use reclaim threshold instead of running for every page

Roman Gushchin (2):
  percpu: factor out pcpu_check_block_hint()
  percpu: implement partial chunk depopulation

 mm/percpu-internal.h |   5 +
 mm/percpu-km.c       |   5 +
 mm/percpu-stats.c    |  20 ++--
 mm/percpu-vm.c       |  30 ++++++
 mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 278 insertions(+), 34 deletions(-)

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] percpu: factor out pcpu_check_block_hint()
  2021-04-19 22:50 [PATCH v4 0/4] percpu: partial chunk depopulation Dennis Zhou
@ 2021-04-19 22:50 ` Dennis Zhou
  2021-04-19 22:50 ` [PATCH 2/4] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1 Dennis Zhou
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Dennis Zhou @ 2021-04-19 22:50 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Roman Gushchin
  Cc: linux-mm, linux-kernel, Dennis Zhou

From: Roman Gushchin <guro@fb.com>

Factor out the pcpu_check_block_hint() helper, which will be useful
in the future. The new function checks if the allocation can likely
fit within the contig hint.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 mm/percpu.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 61339b3d9337..5edc7bd88133 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -306,6 +306,25 @@ static unsigned long pcpu_block_off_to_off(int index, int off)
 	return index * PCPU_BITMAP_BLOCK_BITS + off;
 }
 
+/**
+ * pcpu_check_block_hint - check against the contig hint
+ * @block: block of interest
+ * @bits: size of allocation
+ * @align: alignment of area (max PAGE_SIZE)
+ *
+ * Check to see if the allocation can fit in the block's contig hint.
+ * Note, a chunk uses the same hints as a block so this can also check against
+ * the chunk's contig hint.
+ */
+static bool pcpu_check_block_hint(struct pcpu_block_md *block, int bits,
+				  size_t align)
+{
+	int bit_off = ALIGN(block->contig_hint_start, align) -
+		block->contig_hint_start;
+
+	return bit_off + bits <= block->contig_hint;
+}
+
 /*
  * pcpu_next_hint - determine which hint to use
  * @block: block of interest
@@ -1066,14 +1085,11 @@ static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int alloc_bits,
 	int bit_off, bits, next_off;
 
 	/*
-	 * Check to see if the allocation can fit in the chunk's contig hint.
-	 * This is an optimization to prevent scanning by assuming if it
-	 * cannot fit in the global hint, there is memory pressure and creating
-	 * a new chunk would happen soon.
+	 * This is an optimization to prevent scanning by assuming if the
+	 * allocation cannot fit in the global hint, there is memory pressure
+	 * and creating a new chunk would happen soon.
 	 */
-	bit_off = ALIGN(chunk_md->contig_hint_start, align) -
-		  chunk_md->contig_hint_start;
-	if (bit_off + alloc_bits > chunk_md->contig_hint)
+	if (!pcpu_check_block_hint(chunk_md, alloc_bits, align))
 		return -1;
 
 	bit_off = pcpu_next_hint(chunk_md, alloc_bits);
-- 
2.31.1.368.gbe11c130af-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
  2021-04-19 22:50 [PATCH v4 0/4] percpu: partial chunk depopulation Dennis Zhou
  2021-04-19 22:50 ` [PATCH 1/4] percpu: factor out pcpu_check_block_hint() Dennis Zhou
@ 2021-04-19 22:50 ` Dennis Zhou
  2021-04-20 21:22   ` Roman Gushchin
  2021-04-19 22:50 ` [PATCH 3/4] percpu: implement partial chunk depopulation Dennis Zhou
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Dennis Zhou @ 2021-04-19 22:50 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Roman Gushchin
  Cc: linux-mm, linux-kernel, Dennis Zhou

This prepares for adding a to_depopulate list and sidelined list after
the free slot in the set of lists in pcpu_slot.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 mm/percpu.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 5edc7bd88133..d462222f4adc 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -135,6 +135,7 @@ static int pcpu_unit_size __ro_after_init;
 static int pcpu_nr_units __ro_after_init;
 static int pcpu_atom_size __ro_after_init;
 int pcpu_nr_slots __ro_after_init;
+int pcpu_free_slot __ro_after_init;
 static size_t pcpu_chunk_struct_size __ro_after_init;
 
 /* cpus with the lowest and highest unit addresses */
@@ -237,7 +238,7 @@ static int __pcpu_size_to_slot(int size)
 static int pcpu_size_to_slot(int size)
 {
 	if (size == pcpu_unit_size)
-		return pcpu_nr_slots - 1;
+		return pcpu_free_slot;
 	return __pcpu_size_to_slot(size);
 }
 
@@ -1806,7 +1807,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 		goto fail;
 	}
 
-	if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
+	if (list_empty(&pcpu_slot[pcpu_free_slot])) {
 		chunk = pcpu_create_chunk(type, pcpu_gfp);
 		if (!chunk) {
 			err = "failed to allocate new chunk";
@@ -1958,7 +1959,7 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
 {
 	LIST_HEAD(to_free);
 	struct list_head *pcpu_slot = pcpu_chunk_list(type);
-	struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
+	struct list_head *free_head = &pcpu_slot[pcpu_free_slot];
 	struct pcpu_chunk *chunk, *next;
 
 	/*
@@ -2033,7 +2034,7 @@ static void pcpu_balance_populated(enum pcpu_chunk_type type)
 				  0, PCPU_EMPTY_POP_PAGES_HIGH);
 	}
 
-	for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) {
+	for (slot = pcpu_size_to_slot(PAGE_SIZE); slot <= pcpu_free_slot; slot++) {
 		unsigned int nr_unpop = 0, rs, re;
 
 		if (!nr_to_pop)
@@ -2140,7 +2141,7 @@ void free_percpu(void __percpu *ptr)
 	if (chunk->free_bytes == pcpu_unit_size) {
 		struct pcpu_chunk *pos;
 
-		list_for_each_entry(pos, &pcpu_slot[pcpu_nr_slots - 1], list)
+		list_for_each_entry(pos, &pcpu_slot[pcpu_free_slot], list)
 			if (pos != chunk) {
 				need_balance = true;
 				break;
@@ -2562,7 +2563,8 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	 * Allocate chunk slots.  The additional last slot is for
 	 * empty chunks.
 	 */
-	pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
+	pcpu_free_slot = __pcpu_size_to_slot(pcpu_unit_size) + 1;
+	pcpu_nr_slots = pcpu_free_slot + 1;
 	pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
 					  sizeof(pcpu_chunk_lists[0]) *
 					  PCPU_NR_CHUNK_TYPES,
-- 
2.31.1.368.gbe11c130af-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] percpu: implement partial chunk depopulation
  2021-04-19 22:50 [PATCH v4 0/4] percpu: partial chunk depopulation Dennis Zhou
  2021-04-19 22:50 ` [PATCH 1/4] percpu: factor out pcpu_check_block_hint() Dennis Zhou
  2021-04-19 22:50 ` [PATCH 2/4] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1 Dennis Zhou
@ 2021-04-19 22:50 ` Dennis Zhou
  2021-07-02 19:11   ` Guenter Roeck
  2021-04-19 22:50 ` [PATCH 4/4] percpu: use reclaim threshold instead of running for every page Dennis Zhou
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Dennis Zhou @ 2021-04-19 22:50 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Roman Gushchin
  Cc: linux-mm, linux-kernel, Dennis Zhou

From: Roman Gushchin <guro@fb.com>

This patch implements partial depopulation of percpu chunks.

As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.

This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.

The depopulation is scheduled on the free path if the chunk is all of
the following:
  1) has more than 1/4 of total pages free and populated
  2) the system has enough free percpu pages aside of this chunk
  3) isn't the reserved chunk
  4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.

pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.

Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 mm/percpu-internal.h |   4 +
 mm/percpu-km.c       |   5 ++
 mm/percpu-stats.c    |  12 +--
 mm/percpu-vm.c       |  30 ++++++++
 mm/percpu.c          | 180 +++++++++++++++++++++++++++++++++++++++----
 5 files changed, 211 insertions(+), 20 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 095d7eaa0db4..10604dce806f 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -67,6 +67,8 @@ struct pcpu_chunk {
 
 	void			*data;		/* chunk data */
 	bool			immutable;	/* no [de]population allowed */
+	bool			isolated;	/* isolated from active chunk
+						   slots */
 	int			start_offset;	/* the overlap with the previous
 						   region to have a page aligned
 						   base_addr */
@@ -87,6 +89,8 @@ extern spinlock_t pcpu_lock;
 
 extern struct list_head *pcpu_chunk_lists;
 extern int pcpu_nr_slots;
+extern int pcpu_sidelined_slot;
+extern int pcpu_to_depopulate_slot;
 extern int pcpu_nr_empty_pop_pages[];
 
 extern struct pcpu_chunk *pcpu_first_chunk;
diff --git a/mm/percpu-km.c b/mm/percpu-km.c
index 35c9941077ee..c84a9f781a6c 100644
--- a/mm/percpu-km.c
+++ b/mm/percpu-km.c
@@ -118,3 +118,8 @@ static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai)
 
 	return 0;
 }
+
+static bool pcpu_should_reclaim_chunk(struct pcpu_chunk *chunk)
+{
+	return false;
+}
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index f6026dbcdf6b..2125981acfb9 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -219,13 +219,15 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 		for (slot = 0; slot < pcpu_nr_slots; slot++) {
 			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
 					    list) {
-				if (chunk == pcpu_first_chunk) {
+				if (chunk == pcpu_first_chunk)
 					seq_puts(m, "Chunk: <- First Chunk\n");
-					chunk_map_stats(m, chunk, buffer);
-				} else {
+				else if (slot == pcpu_to_depopulate_slot)
+					seq_puts(m, "Chunk (to_depopulate)\n");
+				else if (slot == pcpu_sidelined_slot)
+					seq_puts(m, "Chunk (sidelined):\n");
+				else
 					seq_puts(m, "Chunk:\n");
-					chunk_map_stats(m, chunk, buffer);
-				}
+				chunk_map_stats(m, chunk, buffer);
 			}
 		}
 	}
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index e46f7a6917f9..c75f6f24f2d5 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -377,3 +377,33 @@ static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai)
 	/* no extra restriction */
 	return 0;
 }
+
+/**
+ * pcpu_should_reclaim_chunk - determine if a chunk should go into reclaim
+ * @chunk: chunk of interest
+ *
+ * This is the entry point for percpu reclaim.  If a chunk qualifies, it is then
+ * isolated and managed in separate lists at the back of pcpu_slot: sidelined
+ * and to_depopulate respectively.  The to_depopulate list holds chunks slated
+ * for depopulation.  They no longer contribute to pcpu_nr_empty_pop_pages once
+ * they are on this list.  Once depopulated, they are moved onto the sidelined
+ * list which enables them to be pulled back in for allocation if no other chunk
+ * can suffice the allocation.
+ */
+static bool pcpu_should_reclaim_chunk(struct pcpu_chunk *chunk)
+{
+	/* do not reclaim either the first chunk or reserved chunk */
+	if (chunk == pcpu_first_chunk || chunk == pcpu_reserved_chunk)
+		return false;
+
+	/*
+	 * If it is isolated, it may be on the sidelined list so move it back to
+	 * the to_depopulate list.  If we hit at least 1/4 pages empty pages AND
+	 * there is no system-wide shortage of empty pages aside from this
+	 * chunk, move it to the to_depopulate list.
+	 */
+	return ((chunk->isolated && chunk->nr_empty_pop_pages) ||
+		(pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
+		 PCPU_EMPTY_POP_PAGES_HIGH + chunk->nr_empty_pop_pages &&
+		chunk->nr_empty_pop_pages >= chunk->nr_pages / 4));
+}
diff --git a/mm/percpu.c b/mm/percpu.c
index d462222f4adc..79eebc80860d 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -136,6 +136,8 @@ static int pcpu_nr_units __ro_after_init;
 static int pcpu_atom_size __ro_after_init;
 int pcpu_nr_slots __ro_after_init;
 int pcpu_free_slot __ro_after_init;
+int pcpu_sidelined_slot __ro_after_init;
+int pcpu_to_depopulate_slot __ro_after_init;
 static size_t pcpu_chunk_struct_size __ro_after_init;
 
 /* cpus with the lowest and highest unit addresses */
@@ -562,10 +564,41 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
 {
 	int nslot = pcpu_chunk_slot(chunk);
 
+	/* leave isolated chunks in-place */
+	if (chunk->isolated)
+		return;
+
 	if (oslot != nslot)
 		__pcpu_chunk_move(chunk, nslot, oslot < nslot);
 }
 
+static void pcpu_isolate_chunk(struct pcpu_chunk *chunk)
+{
+	enum pcpu_chunk_type type = pcpu_chunk_type(chunk);
+	struct list_head *pcpu_slot = pcpu_chunk_list(type);
+
+	lockdep_assert_held(&pcpu_lock);
+
+	if (!chunk->isolated) {
+		chunk->isolated = true;
+		pcpu_nr_empty_pop_pages[type] -= chunk->nr_empty_pop_pages;
+	}
+	list_move(&chunk->list, &pcpu_slot[pcpu_to_depopulate_slot]);
+}
+
+static void pcpu_reintegrate_chunk(struct pcpu_chunk *chunk)
+{
+	enum pcpu_chunk_type type = pcpu_chunk_type(chunk);
+
+	lockdep_assert_held(&pcpu_lock);
+
+	if (chunk->isolated) {
+		chunk->isolated = false;
+		pcpu_nr_empty_pop_pages[type] += chunk->nr_empty_pop_pages;
+		pcpu_chunk_relocate(chunk, -1);
+	}
+}
+
 /*
  * pcpu_update_empty_pages - update empty page counters
  * @chunk: chunk of interest
@@ -578,7 +611,7 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
 static inline void pcpu_update_empty_pages(struct pcpu_chunk *chunk, int nr)
 {
 	chunk->nr_empty_pop_pages += nr;
-	if (chunk != pcpu_reserved_chunk)
+	if (chunk != pcpu_reserved_chunk && !chunk->isolated)
 		pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] += nr;
 }
 
@@ -1778,7 +1811,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 
 restart:
 	/* search through normal chunks */
-	for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
+	for (slot = pcpu_size_to_slot(size); slot <= pcpu_free_slot; slot++) {
 		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
 			off = pcpu_find_block_fit(chunk, bits, bit_align,
 						  is_atomic);
@@ -1789,9 +1822,10 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 			}
 
 			off = pcpu_alloc_area(chunk, bits, bit_align, off);
-			if (off >= 0)
+			if (off >= 0) {
+				pcpu_reintegrate_chunk(chunk);
 				goto area_found;
-
+			}
 		}
 	}
 
@@ -1952,10 +1986,13 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
 /**
  * pcpu_balance_free - manage the amount of free chunks
  * @type: chunk type
+ * @empty_only: free chunks only if there are no populated pages
  *
- * Reclaim all fully free chunks except for the first one.
+ * If empty_only is %false, reclaim all fully free chunks regardless of the
+ * number of populated pages.  Otherwise, only reclaim chunks that have no
+ * populated pages.
  */
-static void pcpu_balance_free(enum pcpu_chunk_type type)
+static void pcpu_balance_free(enum pcpu_chunk_type type, bool empty_only)
 {
 	LIST_HEAD(to_free);
 	struct list_head *pcpu_slot = pcpu_chunk_list(type);
@@ -1975,7 +2012,8 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
 		if (chunk == list_first_entry(free_head, struct pcpu_chunk, list))
 			continue;
 
-		list_move(&chunk->list, &to_free);
+		if (!empty_only || chunk->nr_empty_pop_pages == 0)
+			list_move(&chunk->list, &to_free);
 	}
 
 	spin_unlock_irq(&pcpu_lock);
@@ -2083,20 +2121,121 @@ static void pcpu_balance_populated(enum pcpu_chunk_type type)
 	}
 }
 
+/**
+ * pcpu_reclaim_populated - scan over to_depopulate chunks and free empty pages
+ * @type: chunk type
+ *
+ * Scan over chunks in the depopulate list and try to release unused populated
+ * pages back to the system.  Depopulated chunks are sidelined to prevent
+ * repopulating these pages unless required.  Fully free chunks are reintegrated
+ * and freed accordingly (1 is kept around).  If we drop below the empty
+ * populated pages threshold, reintegrate the chunk if it has empty free pages.
+ * Each chunk is scanned in the reverse order to keep populated pages close to
+ * the beginning of the chunk.
+ */
+static void pcpu_reclaim_populated(enum pcpu_chunk_type type)
+{
+	struct list_head *pcpu_slot = pcpu_chunk_list(type);
+	struct pcpu_chunk *chunk;
+	struct pcpu_block_md *block;
+	int i, end;
+
+	spin_lock_irq(&pcpu_lock);
+
+restart:
+	/*
+	 * Once a chunk is isolated to the to_depopulate list, the chunk is no
+	 * longer discoverable to allocations whom may populate pages.  The only
+	 * other accessor is the free path which only returns area back to the
+	 * allocator not touching the populated bitmap.
+	 */
+	while (!list_empty(&pcpu_slot[pcpu_to_depopulate_slot])) {
+		chunk = list_first_entry(&pcpu_slot[pcpu_to_depopulate_slot],
+					 struct pcpu_chunk, list);
+		WARN_ON(chunk->immutable);
+
+		/*
+		 * Scan chunk's pages in the reverse order to keep populated
+		 * pages close to the beginning of the chunk.
+		 */
+		for (i = chunk->nr_pages - 1, end = -1; i >= 0; i--) {
+			/* no more work to do */
+			if (chunk->nr_empty_pop_pages == 0)
+				break;
+
+			/* reintegrate chunk to prevent atomic alloc failures */
+			if (pcpu_nr_empty_pop_pages[type] <
+			    PCPU_EMPTY_POP_PAGES_HIGH) {
+				pcpu_reintegrate_chunk(chunk);
+				goto restart;
+			}
+
+			/*
+			 * If the page is empty and populated, start or
+			 * extend the (i, end) range.  If i == 0, decrease
+			 * i and perform the depopulation to cover the last
+			 * (first) page in the chunk.
+			 */
+			block = chunk->md_blocks + i;
+			if (block->contig_hint == PCPU_BITMAP_BLOCK_BITS &&
+			    test_bit(i, chunk->populated)) {
+				if (end == -1)
+					end = i;
+				if (i > 0)
+					continue;
+				i--;
+			}
+
+			/* depopulate if there is an active range */
+			if (end == -1)
+				continue;
+
+			spin_unlock_irq(&pcpu_lock);
+			pcpu_depopulate_chunk(chunk, i + 1, end + 1);
+			cond_resched();
+			spin_lock_irq(&pcpu_lock);
+
+			pcpu_chunk_depopulated(chunk, i + 1, end + 1);
+
+			/* reset the range and continue */
+			end = -1;
+		}
+
+		if (chunk->free_bytes == pcpu_unit_size)
+			pcpu_reintegrate_chunk(chunk);
+		else
+			list_move(&chunk->list,
+				  &pcpu_slot[pcpu_sidelined_slot]);
+	}
+
+	spin_unlock_irq(&pcpu_lock);
+}
+
 /**
  * pcpu_balance_workfn - manage the amount of free chunks and populated pages
  * @work: unused
  *
- * Call pcpu_balance_free() and pcpu_balance_populated() for each chunk type.
+ * For each chunk type, manage the number of fully free chunks and the number of
+ * populated pages.  An important thing to consider is when pages are freed and
+ * how they contribute to the global counts.
  */
 static void pcpu_balance_workfn(struct work_struct *work)
 {
 	enum pcpu_chunk_type type;
 
+	/*
+	 * pcpu_balance_free() is called twice because the first time we may
+	 * trim pages in the active pcpu_nr_empty_pop_pages which may cause us
+	 * to grow other chunks.  This then gives pcpu_reclaim_populated() time
+	 * to move fully free chunks to the active list to be freed if
+	 * appropriate.
+	 */
 	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
 		mutex_lock(&pcpu_alloc_mutex);
-		pcpu_balance_free(type);
+		pcpu_balance_free(type, false);
+		pcpu_reclaim_populated(type);
 		pcpu_balance_populated(type);
+		pcpu_balance_free(type, true);
 		mutex_unlock(&pcpu_alloc_mutex);
 	}
 }
@@ -2137,8 +2276,12 @@ void free_percpu(void __percpu *ptr)
 
 	pcpu_memcg_free_hook(chunk, off, size);
 
-	/* if there are more than one fully free chunks, wake up grim reaper */
-	if (chunk->free_bytes == pcpu_unit_size) {
+	/*
+	 * If there are more than one fully free chunks, wake up grim reaper.
+	 * If the chunk is isolated, it may be in the process of being
+	 * reclaimed.  Let reclaim manage cleaning up of that chunk.
+	 */
+	if (!chunk->isolated && chunk->free_bytes == pcpu_unit_size) {
 		struct pcpu_chunk *pos;
 
 		list_for_each_entry(pos, &pcpu_slot[pcpu_free_slot], list)
@@ -2146,6 +2289,9 @@ void free_percpu(void __percpu *ptr)
 				need_balance = true;
 				break;
 			}
+	} else if (pcpu_should_reclaim_chunk(chunk)) {
+		pcpu_isolate_chunk(chunk);
+		need_balance = true;
 	}
 
 	trace_percpu_free_percpu(chunk->base_addr, off, ptr);
@@ -2560,11 +2706,15 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	pcpu_stats_save_ai(ai);
 
 	/*
-	 * Allocate chunk slots.  The additional last slot is for
-	 * empty chunks.
+	 * Allocate chunk slots.  The slots after the active slots are:
+	 *   sidelined_slot - isolated, depopulated chunks
+	 *   free_slot - fully free chunks
+	 *   to_depopulate_slot - isolated, chunks to depopulate
 	 */
-	pcpu_free_slot = __pcpu_size_to_slot(pcpu_unit_size) + 1;
-	pcpu_nr_slots = pcpu_free_slot + 1;
+	pcpu_sidelined_slot = __pcpu_size_to_slot(pcpu_unit_size) + 1;
+	pcpu_free_slot = pcpu_sidelined_slot + 1;
+	pcpu_to_depopulate_slot = pcpu_free_slot + 1;
+	pcpu_nr_slots = pcpu_to_depopulate_slot + 1;
 	pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
 					  sizeof(pcpu_chunk_lists[0]) *
 					  PCPU_NR_CHUNK_TYPES,
-- 
2.31.1.368.gbe11c130af-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] percpu: use reclaim threshold instead of running for every page
  2021-04-19 22:50 [PATCH v4 0/4] percpu: partial chunk depopulation Dennis Zhou
                   ` (2 preceding siblings ...)
  2021-04-19 22:50 ` [PATCH 3/4] percpu: implement partial chunk depopulation Dennis Zhou
@ 2021-04-19 22:50 ` Dennis Zhou
  2021-04-20 21:23   ` Roman Gushchin
  2021-04-19 22:54 ` [PATCH v4 0/4] percpu: partial chunk depopulation Dennis Zhou
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Dennis Zhou @ 2021-04-19 22:50 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Roman Gushchin
  Cc: linux-mm, linux-kernel, Dennis Zhou

The last patch implements reclaim by adding 2 additional lists where a
chunk's lifecycle is:
  active_slot -> to_depopulate_slot -> sidelined_slot

This worked great because we're able to nicely converge paths into
isolation. However, it's a bit aggressive to run for every free page.
Let's accumulate a few free pages before we do this. To do this, the new
lifecycle is:
  active_slot -> sidelined_slot -> to_depopulate_slot -> sidelined_slot

The transition from sidelined_slot -> to_depopulate_slot occurs on a
threshold instead of before where it directly went to the
to_depopulate_slot. pcpu_nr_isolated_empty_pop_pages[] is introduced to
aid with this.

Suggested-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 mm/percpu-internal.h |  1 +
 mm/percpu-stats.c    |  8 ++++++--
 mm/percpu.c          | 44 +++++++++++++++++++++++++++++++++++++-------
 3 files changed, 44 insertions(+), 9 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 10604dce806f..b3e43b016276 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -92,6 +92,7 @@ extern int pcpu_nr_slots;
 extern int pcpu_sidelined_slot;
 extern int pcpu_to_depopulate_slot;
 extern int pcpu_nr_empty_pop_pages[];
+extern int pcpu_nr_isolated_empty_pop_pages[];
 
 extern struct pcpu_chunk *pcpu_first_chunk;
 extern struct pcpu_chunk *pcpu_reserved_chunk;
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index 2125981acfb9..facc804eb86c 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -145,7 +145,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 	int slot, max_nr_alloc;
 	int *buffer;
 	enum pcpu_chunk_type type;
-	int nr_empty_pop_pages;
+	int nr_empty_pop_pages, nr_isolated_empty_pop_pages;
 
 alloc_buffer:
 	spin_lock_irq(&pcpu_lock);
@@ -167,8 +167,11 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 	}
 
 	nr_empty_pop_pages = 0;
-	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+	nr_isolated_empty_pop_pages = 0;
+	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
 		nr_empty_pop_pages += pcpu_nr_empty_pop_pages[type];
+		nr_isolated_empty_pop_pages += pcpu_nr_isolated_empty_pop_pages[type];
+	}
 
 #define PL(X)								\
 	seq_printf(m, "  %-20s: %12lld\n", #X, (long long int)pcpu_stats_ai.X)
@@ -202,6 +205,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 	PU(min_alloc_size);
 	PU(max_alloc_size);
 	P("empty_pop_pages", nr_empty_pop_pages);
+	P("iso_empty_pop_pages", nr_isolated_empty_pop_pages);
 	seq_putc(m, '\n');
 
 #undef PU
diff --git a/mm/percpu.c b/mm/percpu.c
index 79eebc80860d..ba13e683d022 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -110,6 +110,9 @@
 #define PCPU_EMPTY_POP_PAGES_LOW	2
 #define PCPU_EMPTY_POP_PAGES_HIGH	4
 
+/* only schedule reclaim if there are at least N empty pop pages sidelined */
+#define PCPU_EMPTY_POP_RECLAIM_THRESHOLD	4
+
 #ifdef CONFIG_SMP
 /* default addr <-> pcpu_ptr mapping, override in asm/percpu.h if necessary */
 #ifndef __addr_to_pcpu_ptr
@@ -183,6 +186,7 @@ static LIST_HEAD(pcpu_map_extend_chunks);
  * The reserved chunk doesn't contribute to the count.
  */
 int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
+int pcpu_nr_isolated_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
 
 /*
  * The number of populated pages in use by the allocator, protected by
@@ -582,8 +586,10 @@ static void pcpu_isolate_chunk(struct pcpu_chunk *chunk)
 	if (!chunk->isolated) {
 		chunk->isolated = true;
 		pcpu_nr_empty_pop_pages[type] -= chunk->nr_empty_pop_pages;
+		pcpu_nr_isolated_empty_pop_pages[type] +=
+			chunk->nr_empty_pop_pages;
+		list_move(&chunk->list, &pcpu_slot[pcpu_sidelined_slot]);
 	}
-	list_move(&chunk->list, &pcpu_slot[pcpu_to_depopulate_slot]);
 }
 
 static void pcpu_reintegrate_chunk(struct pcpu_chunk *chunk)
@@ -595,6 +601,8 @@ static void pcpu_reintegrate_chunk(struct pcpu_chunk *chunk)
 	if (chunk->isolated) {
 		chunk->isolated = false;
 		pcpu_nr_empty_pop_pages[type] += chunk->nr_empty_pop_pages;
+		pcpu_nr_isolated_empty_pop_pages[type] -=
+			chunk->nr_empty_pop_pages;
 		pcpu_chunk_relocate(chunk, -1);
 	}
 }
@@ -610,9 +618,15 @@ static void pcpu_reintegrate_chunk(struct pcpu_chunk *chunk)
  */
 static inline void pcpu_update_empty_pages(struct pcpu_chunk *chunk, int nr)
 {
+	enum pcpu_chunk_type type = pcpu_chunk_type(chunk);
+
 	chunk->nr_empty_pop_pages += nr;
-	if (chunk != pcpu_reserved_chunk && !chunk->isolated)
-		pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] += nr;
+	if (chunk != pcpu_reserved_chunk) {
+		if (chunk->isolated)
+			pcpu_nr_isolated_empty_pop_pages[type] += nr;
+		else
+			pcpu_nr_empty_pop_pages[type] += nr;
+	}
 }
 
 /*
@@ -2138,10 +2152,13 @@ static void pcpu_reclaim_populated(enum pcpu_chunk_type type)
 	struct list_head *pcpu_slot = pcpu_chunk_list(type);
 	struct pcpu_chunk *chunk;
 	struct pcpu_block_md *block;
+	LIST_HEAD(to_depopulate);
 	int i, end;
 
 	spin_lock_irq(&pcpu_lock);
 
+	list_splice_init(&pcpu_slot[pcpu_to_depopulate_slot], &to_depopulate);
+
 restart:
 	/*
 	 * Once a chunk is isolated to the to_depopulate list, the chunk is no
@@ -2149,9 +2166,9 @@ static void pcpu_reclaim_populated(enum pcpu_chunk_type type)
 	 * other accessor is the free path which only returns area back to the
 	 * allocator not touching the populated bitmap.
 	 */
-	while (!list_empty(&pcpu_slot[pcpu_to_depopulate_slot])) {
-		chunk = list_first_entry(&pcpu_slot[pcpu_to_depopulate_slot],
-					 struct pcpu_chunk, list);
+	while (!list_empty(&to_depopulate)) {
+		chunk = list_first_entry(&to_depopulate, struct pcpu_chunk,
+					 list);
 		WARN_ON(chunk->immutable);
 
 		/*
@@ -2208,6 +2225,13 @@ static void pcpu_reclaim_populated(enum pcpu_chunk_type type)
 				  &pcpu_slot[pcpu_sidelined_slot]);
 	}
 
+	if (pcpu_nr_isolated_empty_pop_pages[type] >=
+	    PCPU_EMPTY_POP_RECLAIM_THRESHOLD) {
+		list_splice_tail_init(&pcpu_slot[pcpu_sidelined_slot],
+				      &pcpu_slot[pcpu_to_depopulate_slot]);
+		pcpu_schedule_balance_work();
+	}
+
 	spin_unlock_irq(&pcpu_lock);
 }
 
@@ -2291,7 +2315,13 @@ void free_percpu(void __percpu *ptr)
 			}
 	} else if (pcpu_should_reclaim_chunk(chunk)) {
 		pcpu_isolate_chunk(chunk);
-		need_balance = true;
+		if (chunk->free_bytes == pcpu_unit_size ||
+		    pcpu_nr_isolated_empty_pop_pages[pcpu_chunk_type(chunk)] >=
+		    PCPU_EMPTY_POP_RECLAIM_THRESHOLD) {
+			list_splice_tail_init(&pcpu_slot[pcpu_sidelined_slot],
+					      &pcpu_slot[pcpu_to_depopulate_slot]);
+			need_balance = true;
+		}
 	}
 
 	trace_percpu_free_percpu(chunk->base_addr, off, ptr);
-- 
2.31.1.368.gbe11c130af-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 0/4] percpu: partial chunk depopulation
  2021-04-19 22:50 [PATCH v4 0/4] percpu: partial chunk depopulation Dennis Zhou
                   ` (3 preceding siblings ...)
  2021-04-19 22:50 ` [PATCH 4/4] percpu: use reclaim threshold instead of running for every page Dennis Zhou
@ 2021-04-19 22:54 ` Dennis Zhou
  2021-04-19 22:57 ` Dennis Zhou
  2021-04-21 18:24 ` Dennis Zhou
  6 siblings, 0 replies; 17+ messages in thread
From: Dennis Zhou @ 2021-04-19 22:54 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Roman Gushchin; +Cc: linux-mm, linux-kernel

On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
> Hello,
> 
> This series is a continuation of Roman's series in [1]. It aims to solve
> chunks holding onto free pages by adding a reclaim process to the percpu
> balance work item.
> 

And I forgot to link [1]...
[1] https://lore.kernel.org/lkml/20210408035736.883861-1-guro@fb.com/

> The main difference is that the nr_empty_pop_pages is now managed at
> time of isolation instead of intermixed. This helps with deciding which
> chunks to free instead of having to interleave returning chunks to
> active duty.
> 
> The allocation priority is as follows:
>   1) appropriate chunk slot increasing until fit
>   2) sidelined chunks
>   3) full free chunks
> 
> The last slot for to_depopulate is never used for allocations.
> 
> A big thanks to Roman for initiating the work and being available for
> iterating on these ideas.
> 
> This patchset contains the following 4 patches:
>   0001-percpu-factor-out-pcpu_check_block_hint.patch
>   0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
>   0003-percpu-implement-partial-chunk-depopulation.patch
>   0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
> 
> 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
> initially from Roman. 0004 adds a reclaim threshold so we do not need to
> schedule for every page freed.
> 
> This series is on top of percpu$for-5.14 67c2669d69fb.
> 
> diffstats below:
> 
> Dennis Zhou (2):
>   percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
>   percpu: use reclaim threshold instead of running for every page
> 
> Roman Gushchin (2):
>   percpu: factor out pcpu_check_block_hint()
>   percpu: implement partial chunk depopulation
> 
>  mm/percpu-internal.h |   5 +
>  mm/percpu-km.c       |   5 +
>  mm/percpu-stats.c    |  20 ++--
>  mm/percpu-vm.c       |  30 ++++++
>  mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
>  5 files changed, 278 insertions(+), 34 deletions(-)
> 
> Thanks,
> Dennis

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 0/4] percpu: partial chunk depopulation
  2021-04-19 22:50 [PATCH v4 0/4] percpu: partial chunk depopulation Dennis Zhou
                   ` (4 preceding siblings ...)
  2021-04-19 22:54 ` [PATCH v4 0/4] percpu: partial chunk depopulation Dennis Zhou
@ 2021-04-19 22:57 ` Dennis Zhou
  2021-04-20 11:07   ` Pratik Sampat
  2021-04-21 18:24 ` Dennis Zhou
  6 siblings, 1 reply; 17+ messages in thread
From: Dennis Zhou @ 2021-04-19 22:57 UTC (permalink / raw)
  To: Pratik Sampat; +Cc: Roman Gushchin, linux-mm, linux-kernel

On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
> Hello,
> 
> This series is a continuation of Roman's series in [1]. It aims to solve
> chunks holding onto free pages by adding a reclaim process to the percpu
> balance work item.
> 
> The main difference is that the nr_empty_pop_pages is now managed at
> time of isolation instead of intermixed. This helps with deciding which
> chunks to free instead of having to interleave returning chunks to
> active duty.
> 
> The allocation priority is as follows:
>   1) appropriate chunk slot increasing until fit
>   2) sidelined chunks
>   3) full free chunks
> 
> The last slot for to_depopulate is never used for allocations.
> 
> A big thanks to Roman for initiating the work and being available for
> iterating on these ideas.
> 
> This patchset contains the following 4 patches:
>   0001-percpu-factor-out-pcpu_check_block_hint.patch
>   0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
>   0003-percpu-implement-partial-chunk-depopulation.patch
>   0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
> 
> 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
> initially from Roman. 0004 adds a reclaim threshold so we do not need to
> schedule for every page freed.
> 
> This series is on top of percpu$for-5.14 67c2669d69fb.
> 
> diffstats below:
> 
> Dennis Zhou (2):
>   percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
>   percpu: use reclaim threshold instead of running for every page
> 
> Roman Gushchin (2):
>   percpu: factor out pcpu_check_block_hint()
>   percpu: implement partial chunk depopulation
> 
>  mm/percpu-internal.h |   5 +
>  mm/percpu-km.c       |   5 +
>  mm/percpu-stats.c    |  20 ++--
>  mm/percpu-vm.c       |  30 ++++++
>  mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
>  5 files changed, 278 insertions(+), 34 deletions(-)
> 
> Thanks,
> Dennis

Hello Pratik,

Do you mind testing this series again on POWER9? The base is available
here:
https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 0/4] percpu: partial chunk depopulation
  2021-04-19 22:57 ` Dennis Zhou
@ 2021-04-20 11:07   ` Pratik Sampat
  2021-04-20 14:39     ` Dennis Zhou
  0 siblings, 1 reply; 17+ messages in thread
From: Pratik Sampat @ 2021-04-20 11:07 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: Roman Gushchin, linux-mm, linux-kernel


On 20/04/21 4:27 am, Dennis Zhou wrote:
> On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
>> Hello,
>>
>> This series is a continuation of Roman's series in [1]. It aims to solve
>> chunks holding onto free pages by adding a reclaim process to the percpu
>> balance work item.
>>
>> The main difference is that the nr_empty_pop_pages is now managed at
>> time of isolation instead of intermixed. This helps with deciding which
>> chunks to free instead of having to interleave returning chunks to
>> active duty.
>>
>> The allocation priority is as follows:
>>    1) appropriate chunk slot increasing until fit
>>    2) sidelined chunks
>>    3) full free chunks
>>
>> The last slot for to_depopulate is never used for allocations.
>>
>> A big thanks to Roman for initiating the work and being available for
>> iterating on these ideas.
>>
>> This patchset contains the following 4 patches:
>>    0001-percpu-factor-out-pcpu_check_block_hint.patch
>>    0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
>>    0003-percpu-implement-partial-chunk-depopulation.patch
>>    0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
>>
>> 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
>> initially from Roman. 0004 adds a reclaim threshold so we do not need to
>> schedule for every page freed.
>>
>> This series is on top of percpu$for-5.14 67c2669d69fb.
>>
>> diffstats below:
>>
>> Dennis Zhou (2):
>>    percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
>>    percpu: use reclaim threshold instead of running for every page
>>
>> Roman Gushchin (2):
>>    percpu: factor out pcpu_check_block_hint()
>>    percpu: implement partial chunk depopulation
>>
>>   mm/percpu-internal.h |   5 +
>>   mm/percpu-km.c       |   5 +
>>   mm/percpu-stats.c    |  20 ++--
>>   mm/percpu-vm.c       |  30 ++++++
>>   mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
>>   5 files changed, 278 insertions(+), 34 deletions(-)
>>
>> Thanks,
>> Dennis
> Hello Pratik,
>
> Do you mind testing this series again on POWER9? The base is available
> here:
> https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14
>
> Thanks,
> Dennis

Hello Dennis, I have tested this patchset on POWER9.

I have tried variations of the percpu_test in the top level and nested cgroups
creation as the test with 1000:10 didn't show any benefits.

The following example shows more consistent benefits with the de-allocation
strategy.
Outer: 1000
Inner: 50
# ./percpu_test.sh
Percpu:             6912 kB
Percpu:           532736 kB
Percpu:           278784 kB

I believe it could be a result of bulk freeing within "free_unref_page_commit",
where pages are only free'd if pcp->count >= pcp->high. As POWER has a larger
page size it would end up creating lesser number of pages but with the
effects of fragmentation.

Having said that, the patchset and its behavior does look good to me.

Thanks!
Pratik




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 0/4] percpu: partial chunk depopulation
  2021-04-20 11:07   ` Pratik Sampat
@ 2021-04-20 14:39     ` Dennis Zhou
  2021-04-20 15:25       ` Pratik Sampat
  0 siblings, 1 reply; 17+ messages in thread
From: Dennis Zhou @ 2021-04-20 14:39 UTC (permalink / raw)
  To: Pratik Sampat; +Cc: Roman Gushchin, linux-mm, linux-kernel

On Tue, Apr 20, 2021 at 04:37:02PM +0530, Pratik Sampat wrote:
> 
> On 20/04/21 4:27 am, Dennis Zhou wrote:
> > On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
> > > Hello,
> > > 
> > > This series is a continuation of Roman's series in [1]. It aims to solve
> > > chunks holding onto free pages by adding a reclaim process to the percpu
> > > balance work item.
> > > 
> > > The main difference is that the nr_empty_pop_pages is now managed at
> > > time of isolation instead of intermixed. This helps with deciding which
> > > chunks to free instead of having to interleave returning chunks to
> > > active duty.
> > > 
> > > The allocation priority is as follows:
> > >    1) appropriate chunk slot increasing until fit
> > >    2) sidelined chunks
> > >    3) full free chunks
> > > 
> > > The last slot for to_depopulate is never used for allocations.
> > > 
> > > A big thanks to Roman for initiating the work and being available for
> > > iterating on these ideas.
> > > 
> > > This patchset contains the following 4 patches:
> > >    0001-percpu-factor-out-pcpu_check_block_hint.patch
> > >    0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
> > >    0003-percpu-implement-partial-chunk-depopulation.patch
> > >    0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
> > > 
> > > 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
> > > initially from Roman. 0004 adds a reclaim threshold so we do not need to
> > > schedule for every page freed.
> > > 
> > > This series is on top of percpu$for-5.14 67c2669d69fb.
> > > 
> > > diffstats below:
> > > 
> > > Dennis Zhou (2):
> > >    percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
> > >    percpu: use reclaim threshold instead of running for every page
> > > 
> > > Roman Gushchin (2):
> > >    percpu: factor out pcpu_check_block_hint()
> > >    percpu: implement partial chunk depopulation
> > > 
> > >   mm/percpu-internal.h |   5 +
> > >   mm/percpu-km.c       |   5 +
> > >   mm/percpu-stats.c    |  20 ++--
> > >   mm/percpu-vm.c       |  30 ++++++
> > >   mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
> > >   5 files changed, 278 insertions(+), 34 deletions(-)
> > > 
> > > Thanks,
> > > Dennis
> > Hello Pratik,
> > 
> > Do you mind testing this series again on POWER9? The base is available
> > here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14
> > 
> > Thanks,
> > Dennis
> 
> Hello Dennis, I have tested this patchset on POWER9.
> 
> I have tried variations of the percpu_test in the top level and nested cgroups
> creation as the test with 1000:10 didn't show any benefits.

This is most likely because the 1 in every 11 still pins every page
while 1 in 50 does not. Can you try the patch below on top? I think it
may show slightly better perf as well. If it doesn't I'll just drop it.

> 
> The following example shows more consistent benefits with the de-allocation
> strategy.
> Outer: 1000
> Inner: 50
> # ./percpu_test.sh
> Percpu:             6912 kB
> Percpu:           532736 kB
> Percpu:           278784 kB
> 
> I believe it could be a result of bulk freeing within "free_unref_page_commit",
> where pages are only free'd if pcp->count >= pcp->high. As POWER has a larger
> page size it would end up creating lesser number of pages but with the
> effects of fragmentation.

This is unrelated to per cpu pages in slab/slub. Percpu is a separate
memory allocator.

> 
> Having said that, the patchset and its behavior does look good to me.

Thanks, can I throw the following on the appropriate patches? In the
future it's good to be explicit about this because some prefer to credit
different emails.

Tested-by: Pratik Sampat <psampat@linux.ibm.com>

Thanks,
Dennis

The following may do a little better on power9:
---
From a1464c4d5900cca68fd95b935178d72bb74837d5 Mon Sep 17 00:00:00 2001
From: Dennis Zhou <dennis@kernel.org>
Date: Tue, 20 Apr 2021 14:25:20 +0000
Subject: [PATCH] percpu: convert free page float to bytes

The percpu memory allocator keeps around a minimum number of free pages
to ensure we can satisfy atomic allocations. However, we've always kept
this number in terms of pages. On certain architectures like arm and
powerpc, the default page size could be 64k instead of 4k. So, start
with a target number of free bytes and then convert to pages.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 mm/percpu.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index ba13e683d022..287fe3091244 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -80,6 +80,7 @@
 #include <linux/mutex.h>
 #include <linux/percpu.h>
 #include <linux/pfn.h>
+#include <linux/sizes.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/vmalloc.h>
@@ -107,11 +108,12 @@
 /* chunks in slots below this are subject to being sidelined on failed alloc */
 #define PCPU_SLOT_FAIL_THRESHOLD	3
 
-#define PCPU_EMPTY_POP_PAGES_LOW	2
-#define PCPU_EMPTY_POP_PAGES_HIGH	4
+#define PCPU_EMPTY_POP_PAGES_LOW	(max_t(int, (SZ_8K) / PAGE_SIZE, 1))
+#define PCPU_EMPTY_POP_PAGES_HIGH	(max_t(int, (SZ_16K) / PAGE_SIZE, \
+					       PCPU_EMPTY_POP_PAGES_LOW + 1))
 
 /* only schedule reclaim if there are at least N empty pop pages sidelined */
-#define PCPU_EMPTY_POP_RECLAIM_THRESHOLD	4
+#define PCPU_EMPTY_POP_RECLAIM_THRESHOLD	(PCPU_EMPTY_POP_PAGES_HIGH)
 
 #ifdef CONFIG_SMP
 /* default addr <-> pcpu_ptr mapping, override in asm/percpu.h if necessary */
-- 
2.31.1.368.gbe11c130af-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 0/4] percpu: partial chunk depopulation
  2021-04-20 14:39     ` Dennis Zhou
@ 2021-04-20 15:25       ` Pratik Sampat
  0 siblings, 0 replies; 17+ messages in thread
From: Pratik Sampat @ 2021-04-20 15:25 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: Roman Gushchin, linux-mm, linux-kernel

Hello Dennis,


On 20/04/21 8:09 pm, Dennis Zhou wrote:
> On Tue, Apr 20, 2021 at 04:37:02PM +0530, Pratik Sampat wrote:
>> On 20/04/21 4:27 am, Dennis Zhou wrote:
>>> On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
>>>> Hello,
>>>>
>>>> This series is a continuation of Roman's series in [1]. It aims to solve
>>>> chunks holding onto free pages by adding a reclaim process to the percpu
>>>> balance work item.
>>>>
>>>> The main difference is that the nr_empty_pop_pages is now managed at
>>>> time of isolation instead of intermixed. This helps with deciding which
>>>> chunks to free instead of having to interleave returning chunks to
>>>> active duty.
>>>>
>>>> The allocation priority is as follows:
>>>>     1) appropriate chunk slot increasing until fit
>>>>     2) sidelined chunks
>>>>     3) full free chunks
>>>>
>>>> The last slot for to_depopulate is never used for allocations.
>>>>
>>>> A big thanks to Roman for initiating the work and being available for
>>>> iterating on these ideas.
>>>>
>>>> This patchset contains the following 4 patches:
>>>>     0001-percpu-factor-out-pcpu_check_block_hint.patch
>>>>     0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
>>>>     0003-percpu-implement-partial-chunk-depopulation.patch
>>>>     0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
>>>>
>>>> 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
>>>> initially from Roman. 0004 adds a reclaim threshold so we do not need to
>>>> schedule for every page freed.
>>>>
>>>> This series is on top of percpu$for-5.14 67c2669d69fb.
>>>>
>>>> diffstats below:
>>>>
>>>> Dennis Zhou (2):
>>>>     percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
>>>>     percpu: use reclaim threshold instead of running for every page
>>>>
>>>> Roman Gushchin (2):
>>>>     percpu: factor out pcpu_check_block_hint()
>>>>     percpu: implement partial chunk depopulation
>>>>
>>>>    mm/percpu-internal.h |   5 +
>>>>    mm/percpu-km.c       |   5 +
>>>>    mm/percpu-stats.c    |  20 ++--
>>>>    mm/percpu-vm.c       |  30 ++++++
>>>>    mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
>>>>    5 files changed, 278 insertions(+), 34 deletions(-)
>>>>
>>>> Thanks,
>>>> Dennis
>>> Hello Pratik,
>>>
>>> Do you mind testing this series again on POWER9? The base is available
>>> here:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14
>>>
>>> Thanks,
>>> Dennis
>> Hello Dennis, I have tested this patchset on POWER9.
>>
>> I have tried variations of the percpu_test in the top level and nested cgroups
>> creation as the test with 1000:10 didn't show any benefits.
> This is most likely because the 1 in every 11 still pins every page
> while 1 in 50 does not. Can you try the patch below on top? I think it
> may show slightly better perf as well. If it doesn't I'll just drop it.

I did try it out, although my test spanned only across varying the inner cgroup
folders; it didn't seem to show any benefit over the previous test without the
patch for being being able to spawn as little memory cgroup folders and seeing
partial memory depopulation.

>> The following example shows more consistent benefits with the de-allocation
>> strategy.
>> Outer: 1000
>> Inner: 50
>> # ./percpu_test.sh
>> Percpu:             6912 kB
>> Percpu:           532736 kB
>> Percpu:           278784 kB
>>
>> I believe it could be a result of bulk freeing within "free_unref_page_commit",
>> where pages are only free'd if pcp->count >= pcp->high. As POWER has a larger
>> page size it would end up creating lesser number of pages but with the
>> effects of fragmentation.
> This is unrelated to per cpu pages in slab/slub. Percpu is a separate
> memory allocator.
>
You're right, I was actually referencing incorrect code.

>> Having said that, the patchset and its behavior does look good to me.
> Thanks, can I throw the following on the appropriate patches? In the
> future it's good to be explicit about this because some prefer to credit
> different emails.
>
> Tested-by: Pratik Sampat <psampat@linux.ibm.com>

Sure thing please feel free to add a tested-by wherever you feel appropriate.
I'll be more explicit about them in the future. Thanks!

> Thanks,
> Dennis
>
> The following may do a little better on power9:
> ---
>  From a1464c4d5900cca68fd95b935178d72bb74837d5 Mon Sep 17 00:00:00 2001
> From: Dennis Zhou <dennis@kernel.org>
> Date: Tue, 20 Apr 2021 14:25:20 +0000
> Subject: [PATCH] percpu: convert free page float to bytes
>
> The percpu memory allocator keeps around a minimum number of free pages
> to ensure we can satisfy atomic allocations. However, we've always kept
> this number in terms of pages. On certain architectures like arm and
> powerpc, the default page size could be 64k instead of 4k. So, start
> with a target number of free bytes and then convert to pages.
>
> Signed-off-by: Dennis Zhou <dennis@kernel.org>
> ---
>   mm/percpu.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index ba13e683d022..287fe3091244 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -80,6 +80,7 @@
>   #include <linux/mutex.h>
>   #include <linux/percpu.h>
>   #include <linux/pfn.h>
> +#include <linux/sizes.h>
>   #include <linux/slab.h>
>   #include <linux/spinlock.h>
>   #include <linux/vmalloc.h>
> @@ -107,11 +108,12 @@
>   /* chunks in slots below this are subject to being sidelined on failed alloc */
>   #define PCPU_SLOT_FAIL_THRESHOLD	3
>   
> -#define PCPU_EMPTY_POP_PAGES_LOW	2
> -#define PCPU_EMPTY_POP_PAGES_HIGH	4
> +#define PCPU_EMPTY_POP_PAGES_LOW	(max_t(int, (SZ_8K) / PAGE_SIZE, 1))
> +#define PCPU_EMPTY_POP_PAGES_HIGH	(max_t(int, (SZ_16K) / PAGE_SIZE, \
> +					       PCPU_EMPTY_POP_PAGES_LOW + 1))
>   
>   /* only schedule reclaim if there are at least N empty pop pages sidelined */
> -#define PCPU_EMPTY_POP_RECLAIM_THRESHOLD	4
> +#define PCPU_EMPTY_POP_RECLAIM_THRESHOLD	(PCPU_EMPTY_POP_PAGES_HIGH)
>   
>   #ifdef CONFIG_SMP
>   /* default addr <-> pcpu_ptr mapping, override in asm/percpu.h if necessary */

Thanks,
Pratik


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
  2021-04-19 22:50 ` [PATCH 2/4] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1 Dennis Zhou
@ 2021-04-20 21:22   ` Roman Gushchin
  0 siblings, 0 replies; 17+ messages in thread
From: Roman Gushchin @ 2021-04-20 21:22 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: Tejun Heo, Christoph Lameter, linux-mm, linux-kernel

On Mon, Apr 19, 2021 at 10:50:45PM +0000, Dennis Zhou wrote:
> This prepares for adding a to_depopulate list and sidelined list after
> the free slot in the set of lists in pcpu_slot.
> 
> Signed-off-by: Dennis Zhou <dennis@kernel.org>

Acked-by: Roman Gushchin <guro@fb.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] percpu: use reclaim threshold instead of running for every page
  2021-04-19 22:50 ` [PATCH 4/4] percpu: use reclaim threshold instead of running for every page Dennis Zhou
@ 2021-04-20 21:23   ` Roman Gushchin
  0 siblings, 0 replies; 17+ messages in thread
From: Roman Gushchin @ 2021-04-20 21:23 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: Tejun Heo, Christoph Lameter, linux-mm, linux-kernel

On Mon, Apr 19, 2021 at 10:50:47PM +0000, Dennis Zhou wrote:
> The last patch implements reclaim by adding 2 additional lists where a
> chunk's lifecycle is:
>   active_slot -> to_depopulate_slot -> sidelined_slot
> 
> This worked great because we're able to nicely converge paths into
> isolation. However, it's a bit aggressive to run for every free page.
> Let's accumulate a few free pages before we do this. To do this, the new
> lifecycle is:
>   active_slot -> sidelined_slot -> to_depopulate_slot -> sidelined_slot
> 
> The transition from sidelined_slot -> to_depopulate_slot occurs on a
> threshold instead of before where it directly went to the
> to_depopulate_slot. pcpu_nr_isolated_empty_pop_pages[] is introduced to
> aid with this.
> 
> Suggested-by: Roman Gushchin <guro@fb.com>
> Signed-off-by: Dennis Zhou <dennis@kernel.org>

Acked-by: Roman Gushchin <guro@fb.com>

Thanks, Dennis!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 0/4] percpu: partial chunk depopulation
  2021-04-19 22:50 [PATCH v4 0/4] percpu: partial chunk depopulation Dennis Zhou
                   ` (5 preceding siblings ...)
  2021-04-19 22:57 ` Dennis Zhou
@ 2021-04-21 18:24 ` Dennis Zhou
  6 siblings, 0 replies; 17+ messages in thread
From: Dennis Zhou @ 2021-04-21 18:24 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Roman Gushchin; +Cc: linux-mm, linux-kernel

On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
> Hello,
> 
> This series is a continuation of Roman's series in [1]. It aims to solve
> chunks holding onto free pages by adding a reclaim process to the percpu
> balance work item.
> 
> The main difference is that the nr_empty_pop_pages is now managed at
> time of isolation instead of intermixed. This helps with deciding which
> chunks to free instead of having to interleave returning chunks to
> active duty.
> 
> The allocation priority is as follows:
>   1) appropriate chunk slot increasing until fit
>   2) sidelined chunks
>   3) full free chunks
> 
> The last slot for to_depopulate is never used for allocations.
> 
> A big thanks to Roman for initiating the work and being available for
> iterating on these ideas.
> 
> This patchset contains the following 4 patches:
>   0001-percpu-factor-out-pcpu_check_block_hint.patch
>   0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
>   0003-percpu-implement-partial-chunk-depopulation.patch
>   0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
> 
> 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
> initially from Roman. 0004 adds a reclaim threshold so we do not need to
> schedule for every page freed.
> 
> This series is on top of percpu$for-5.14 67c2669d69fb.
> 
> diffstats below:
> 
> Dennis Zhou (2):
>   percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
>   percpu: use reclaim threshold instead of running for every page
> 
> Roman Gushchin (2):
>   percpu: factor out pcpu_check_block_hint()
>   percpu: implement partial chunk depopulation
> 
>  mm/percpu-internal.h |   5 +
>  mm/percpu-km.c       |   5 +
>  mm/percpu-stats.c    |  20 ++--
>  mm/percpu-vm.c       |  30 ++++++
>  mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
>  5 files changed, 278 insertions(+), 34 deletions(-)
> 
> Thanks,
> Dennis

I've applied this to for-5.14 with 1 modification to the batching
threshold, 4 pages is quite little, so make it 16.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] percpu: implement partial chunk depopulation
  2021-04-19 22:50 ` [PATCH 3/4] percpu: implement partial chunk depopulation Dennis Zhou
@ 2021-07-02 19:11   ` Guenter Roeck
  2021-07-02 19:45     ` Dennis Zhou
  0 siblings, 1 reply; 17+ messages in thread
From: Guenter Roeck @ 2021-07-02 19:11 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Tejun Heo, Christoph Lameter, Roman Gushchin, linux-mm, linux-kernel

Hi,

On Mon, Apr 19, 2021 at 10:50:46PM +0000, Dennis Zhou wrote:
> From: Roman Gushchin <guro@fb.com>
> 
> This patch implements partial depopulation of percpu chunks.
> 
> As of now, a chunk can be depopulated only as a part of the final
> destruction, if there are no more outstanding allocations. However
> to minimize a memory waste it might be useful to depopulate a
> partially filed chunk, if a small number of outstanding allocations
> prevents the chunk from being fully reclaimed.
> 
> This patch implements the following depopulation process: it scans
> over the chunk pages, looks for a range of empty and populated pages
> and performs the depopulation. To avoid races with new allocations,
> the chunk is previously isolated. After the depopulation the chunk is
> sidelined to a special list or freed. New allocations prefer using
> active chunks to sidelined chunks. If a sidelined chunk is used, it is
> reintegrated to the active lists.
> 
> The depopulation is scheduled on the free path if the chunk is all of
> the following:
>   1) has more than 1/4 of total pages free and populated
>   2) the system has enough free percpu pages aside of this chunk
>   3) isn't the reserved chunk
>   4) isn't the first chunk
> If it's already depopulated but got free populated pages, it's a good
> target too. The chunk is moved to a special slot,
> pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
> item is scheduled. On isolation, these pages are removed from the
> pcpu_nr_empty_pop_pages. It is constantly replaced to the
> to_depopulate_slot when it meets these qualifications.
> 
> pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
> becomes empty. The depopulation is performed in the reverse direction to
> keep populated pages close to the beginning. Depopulated chunks are
> sidelined to preferentially avoid them for new allocations. When no
> active chunk can suffice a new allocation, sidelined chunks are first
> checked before creating a new chunk.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Co-developed-by: Dennis Zhou <dennis@kernel.org>
> Signed-off-by: Dennis Zhou <dennis@kernel.org>

This patch results in a number of crashes and other odd behavior
when trying to boot mips images from Megasas controllers in qemu.
Sometimes the boot stalls, but I also see various crashes.
Some examples and bisect logs are attached.

Note: Bisect on mainline ended with

# first bad commit: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/l
inux/kernel/git/dennis/percpu

I then checked out the merge branch and ran a bisect there, which
points to this commit. I also rebased the merge branch to v5.13
and bisected again. Bisect results were the same.

Guenter

---
...
sd 0:2:0:0: [sda] Add. Sense: Internal target failure
CPU 0 Unable to handle kernel paging request at virtual address 00000004, epc == 805cf8fc, ra == 802ff3b0
Oops[#1]:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
$ 0   : 00000000 00000001 00000000 8258fc90
$ 4   : 825dbd40 820e7624 00000000 fffffff0
$ 8   : 80c70000 805e1a64 fffffffc 00000000
$12   : 81006d00 0000001f ffffffe0 00001e83
$16   : 00000000 825dbd30 80c70000 820e75f8
$20   : 8275c584 80cc4418 80c9409c 00000008
$24   : 0000004c 00000000
$28   : 8204c000 8204fc70 80c26c54 802ff3b0
Hi    : 0000004c
Lo    : 00000000
epc   : 805cf8fc rb_insert_color+0x1c/0x1e0
ra    : 802ff3b0 kernfs_link_sibling+0x94/0x120
Status: 1000a403	KERNEL EXL IE
Cause : 00800008 (ExcCode 02)
BadVA : 00000004
PrId  : 00019300 (MIPS 24Kc)
Modules linked in:
Process swapper/0 (pid: 1, threadinfo=(ptrval), task=(ptrval), tls=00000000)
Stack : 820e75f8 820e75f8 820e75f8 00000000 8275c584 fffffffe 825dbd30 8030084c
        820e75f8 803003f8 00000000 db668853 00000000 801655f4 00000000 00000000
        00000001 825dbd30 820e75f8 820e75f8 00000000 80300970 81006c80 82150fc0
        8204fd64 00000001 00000000 00000001 00000000 00000000 82150fc0 8275c580
        80c50000 80303dc8 82150fc0 8015ab94 81006c80 8015a960 00000000 8275c580
        ...
Call Trace:
[<805cf8fc>] rb_insert_color+0x1c/0x1e0
[<802ff3b0>] kernfs_link_sibling+0x94/0x120
[<8030084c>] kernfs_add_one+0xb8/0x184
[<80300970>] kernfs_create_dir_ns+0x58/0xb0
[<80303dc8>] sysfs_create_dir_ns+0x74/0x108
[<805ca51c>] kobject_add_internal+0xb4/0x364
[<805caaa0>] kobject_init_and_add+0x64/0xa8
[<8066f768>] bus_add_driver+0x98/0x230
[<806715a0>] driver_register+0x80/0x144
[<807c17b8>] usb_register_driver+0xa8/0x1c0
[<80cb89b8>] uas_init+0x44/0x78
[<8010065c>] do_one_initcall+0x50/0x1d4
[<80c95014>] kernel_init_freeable+0x20c/0x29c
[<80a66bd4>] kernel_init+0x14/0x118
[<80103098>] ret_from_kernel_thread+0x14/0x1c

Code: 30460001  14c00016  240afffc <8c460004> 34480001  10c30028  00404825  10c00012  00000000

---[ end trace bb7aba36814796cb ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

---
scsi host0: Avago SAS based MegaRAID driver
ata_piix 0000:00:0a.1: enabling device (0000 -> 0001)
random: fast init done
scsi 0:2:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
scsi host1: ata_piix
BUG: spinlock bad magic on CPU#0, kworker/u2:1/41
 lock: 0x82598a50, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
CPU: 0 PID: 41 Comm: kworker/u2:1 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
Workqueue: events_unbound async_run_entry_fn
Stack : 822839e4 80c4eee3 80c50000 80a54338 80c80000 801865e0 00000000 00000004
        822839e4 5e03e26b 80c50000 8014b4c8 80c50000 00000001 822839e0 8207e4c0
        00000000 00000000 80b7b9ac 82283828 00000001 8228383c 00000000 0000ffff
        00000008 00000007 00000280 822b7c00 80c50000 80c70000 00000000 80b80000
        00000003 00000000 80c50000 00000012 00000000 806591f8 00000000 80cf0000
        ...
Call Trace:
[<80109adc>] show_stack+0x84/0x11c
[<80a62b1c>] dump_stack+0xa8/0xe4
[<80181468>] do_raw_spin_lock+0xb0/0x128
[<80a70170>] _raw_spin_lock_irqsave+0x28/0x3c
[<80176640>] __wake_up_common_lock+0x68/0xe8
[<801766d4>] __wake_up+0x14/0x20
[<8054eb48>] percpu_ref_kill_and_confirm+0x120/0x178
[<80526d2c>] blk_freeze_queue_start+0x58/0x94
[<8051af0c>] blk_set_queue_dying+0x2c/0x60
[<8051afb4>] blk_cleanup_queue+0x40/0x130
[<806975b4>] __scsi_remove_device+0xd4/0x168
[<80693594>] scsi_probe_and_add_lun+0x53c/0xf44
[<806944c4>] __scsi_scan_target+0x158/0x754
[<80694eb4>] scsi_scan_host_selected+0x17c/0x2e0
[<806950c4>] do_scsi_scan_host+0xac/0xb4
[<806952f8>] do_scan_async+0x30/0x228
[<8015510c>] async_run_entry_fn+0x40/0x100
[<80148384>] process_one_work+0x170/0x428
[<80148be0>] worker_thread+0x188/0x578
[<80150d9c>] kthread+0x130/0x160
[<80103098>] ret_from_kernel_thread+0x14/0x1c

CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 801764a0, ra == 80176664
Oops[#1]:
CPU: 0 PID: 41 Comm: kworker/u2:1 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
Workqueue: events_unbound async_run_entry_fn
$ 0   : 00000000 00000001 00000000 00000000
$ 4   : 82283b14 00000003 00000000 00000000
$ 8   : 00000001 822837ac 00000000 0000ffff
$12   : 00000008 00000007 00000280 822b7c00
$16   : 82598a50 00000000 82283b08 00000003
$20   : 00000000 00000000 00000000 fffffff4
$24   : 00000000 806591f8
$28   : 82280000 82283ab8 82598a60 80176664
Hi    : 000000a7
Lo    : 3333335d
epc   : 801764a0 __wake_up_common+0x6c/0x1a4
ra    : 80176664 __wake_up_common_lock+0x8c/0xe8
Status: 1000a402	KERNEL EXL
Cause : 40808008 (ExcCode 02)
BadVA : 00000000
PrId  : 00019300 (MIPS 24Kc)
Modules linked in:
Process kworker/u2:1 (pid: 41, threadinfo=(ptrval), task=(ptrval), tls=00000000)
Stack : 82716400 82283c98 8246a880 8052cfe8 82598a50 00000000 00000000 00000000
        00000003 00000000 80c50000 00000012 80c80000 80176664 00000000 82126c80
        801762ec 82283afc 00000000 82283b08 00000000 00000000 00000000 82283b14
        82283b14 5e03e26b 825985c8 80c70000 00000001 00000000 80d10000 00000024
        00000003 801766d4 00000001 825985c8 80c70000 00000001 00000000 80d10000
        ...
Call Trace:
[<801764a0>] __wake_up_common+0x6c/0x1a4
[<80176664>] __wake_up_common_lock+0x8c/0xe8
[<801766d4>] __wake_up+0x14/0x20
[<8054eb48>] percpu_ref_kill_and_confirm+0x120/0x178
[<80526d2c>] blk_freeze_queue_start+0x58/0x94
[<8051af0c>] blk_set_queue_dying+0x2c/0x60
[<8051afb4>] blk_cleanup_queue+0x40/0x130
[<806975b4>] __scsi_remove_device+0xd4/0x168
[<80693594>] scsi_probe_and_add_lun+0x53c/0xf44
[<806944c4>] __scsi_scan_target+0x158/0x754
[<80694eb4>] scsi_scan_host_selected+0x17c/0x2e0
[<806950c4>] do_scsi_scan_host+0xac/0xb4
[<806952f8>] do_scan_async+0x30/0x228
[<8015510c>] async_run_entry_fn+0x40/0x100
[<80148384>] process_one_work+0x170/0x428
[<80148be0>] worker_thread+0x188/0x578
[<80150d9c>] kthread+0x130/0x160
[<80103098>] ret_from_kernel_thread+0x14/0x1c

---

megaraid_sas 0000:00:14.0: Max firmware commands: 1007 shared with default hw_queues = 1 poll_queues 0
scsi host0: Avago SAS based MegaRAID driver
ata_piix 0000:00:0a.1: enabling device (0000 -> 0001)
scsi 0:2:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
scsi host1: ata_piix
scsi host2: ata_piix
CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 00000000, ra == 8019d0b4
Oops[#1]:
CPU: 0 PID: 40 Comm: kworker/u2:1 Not tainted 5.13.0-07637-g3dbdb38e2869 #1
Workqueue: events_unbound async_run_entry_fn
$ 0   : 00000000 00000001 82568620 00000000
$ 4   : 82568620 00000200 8019d0b4 8212d580
$ 8   : ffffffe0 000003fc 00000000 81006d70
$12   : 81006d40 0000020c 00000000 80ab4400
$16   : 81007480 00000008 8201ff00 0000000a
$20   : 00000000 810074bc 80ced800 80cd0000
$24   : 000b0f1b 00000739
$28   : 82298000 8201fee8 8019d2f8 8019d0b4
Hi    : 00003f05
Lo    : 0000000f
epc   : 00000000 0x0
ra    : 8019d0b4 rcu_core+0x260/0x754
Status: 1000a403	KERNEL EXL IE
Cause : 00800008 (ExcCode 02)
BadVA : 00000000
PrId  : 00019300 (MIPS 24Kc)
Modules linked in:
Process kworker/u2:1 (pid: 40, threadinfo=(ptrval), task=(ptrval), tls=00000000)
Stack : 00000000 8018b544 ffffffc8 ffffffc8 80bde598 00000000 00000000 8201ff00
        00000048 2942dcc1 80ccc2c8 80cb8080 80d68358 0000000a 00000024 00000009
        00000100 80cb80a4 00000000 80aaac38 80cd0000 80cba400 80cba400 80191214
        00014680 2942dcc1 80cf9980 80ab3ce0 80bd9020 80d682f4 80d6e880 80d6e880
        ffff8fcf 80cd0000 80ab0000 04208060 80ccc2c8 00000001 00000020 80da0000
        ...
Call Trace:

[<8018b544>] __handle_irq_event_percpu+0xbc/0x184
[<80aaac38>] __do_softirq+0x190/0x33c
[<80191214>] handle_level_irq+0x130/0x1e8
[<80132fb8>] irq_exit+0x130/0x138
[<806112d0>] plat_irq_dispatch+0x9c/0x118
[<80103404>] handle_int+0x144/0x150

Code: (Bad address in epc)


---[ end trace 5d4c5bf55a0bb13f ]---
Kernel panic - not syncing: Fatal exception in interrupt
------------[ cut here ]------------

---
Bisect on mainline:

# bad: [3dbdb38e286903ec220aaf1fb29a8d94297da246] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
# good: [007b350a58754a93ca9fe50c498cc27780171153] Merge tag 'dlm-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
git bisect start '3dbdb38e2869' '007b350a5875'
# good: [b6df00789e2831fff7a2c65aa7164b2a4dcbe599] Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
git bisect good b6df00789e2831fff7a2c65aa7164b2a4dcbe599
# good: [990ec3014deedfed49e610cdc31dc6930ca63d8d] drm/amdgpu: add psp runtime db structures
git bisect good 990ec3014deedfed49e610cdc31dc6930ca63d8d
# good: [c288d9cd710433e5991d58a0764c4d08a933b871] Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block
git bisect good c288d9cd710433e5991d58a0764c4d08a933b871
# good: [514798d36572fb8eba6ccff3de10c9615063a7f5] Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
git bisect good 514798d36572fb8eba6ccff3de10c9615063a7f5
# good: [630e438f040c3838206b5e6717b9b5c29edf3548] RDMA/rtrs: Introduce head/tail wr
git bisect good 630e438f040c3838206b5e6717b9b5c29edf3548
# good: [a32b344e6f4375c5bdc3e89d0997b7eae187a3b1] Merge tag 'pinctrl-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
git bisect good a32b344e6f4375c5bdc3e89d0997b7eae187a3b1
# good: [cad065ed8d8831df67b9754cc4437ed55d8b48c0] MIPS: MT extensions are not available on MIPS32r1
git bisect good cad065ed8d8831df67b9754cc4437ed55d8b48c0
# good: [e4d777003a43feab2e000749163e531f6c48c385] percpu: optimize locking in pcpu_balance_workfn()
git bisect good e4d777003a43feab2e000749163e531f6c48c385
# bad: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
git bisect bad e267992f9ef0bf717d70a9ee18049782f77e4b3a
# good: [ab3040e1379bd6fcc260f1f7558ee9c2da62766b] MIPS: Ingenic: Add MAC syscon nodes for Ingenic SoCs.
git bisect good ab3040e1379bd6fcc260f1f7558ee9c2da62766b
# good: [34c522a07ccbfb0e6476713b41a09f9f51a06c9f] MIPS: CI20: Add second percpu timer for SMP.
git bisect good 34c522a07ccbfb0e6476713b41a09f9f51a06c9f
# good: [19b438592238b3b40c3f945bb5f9c4ca971c0c45] Merge tag 'mips_5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
git bisect good 19b438592238b3b40c3f945bb5f9c4ca971c0c45
# first bad commit: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu

---
Bisect on merge branch:

# bad: [e4d777003a43feab2e000749163e531f6c48c385] percpu: optimize locking in pcpu_balance_workfn()
# good: [d434405aaab7d0ebc516b68a8fc4100922d7f5ef] Linux 5.12-rc7
git bisect start 'HEAD' 'v5.12-rc7'
# bad: [f183324133ea535db4127f9fad3e19725ca88bf3] percpu: implement partial chunk depopulation
git bisect bad f183324133ea535db4127f9fad3e19725ca88bf3
# good: [67c2669d69fb5ada0f3b5123fb6ebf6fef9faee5] percpu: split __pcpu_balance_workfn()
git bisect good 67c2669d69fb5ada0f3b5123fb6ebf6fef9faee5
# good: [1c29a3ceaf5f02919e0a89119a70382581453dbb] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
git bisect good 1c29a3ceaf5f02919e0a89119a70382581453dbb
# first bad commit: [f183324133ea535db4127f9fad3e19725ca88bf3] percpu: implement partial chunk depopulation

---
Bisect on rebased merge branch:

# bad: [737dc4074d4969ee54d7f781591bcc608fc6990f] percpu: optimize locking in pcpu_balance_workfn()
# good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
git bisect start 'HEAD' 'v5.13'
# bad: [0bd2212ebd7a02a6c0e870bb4b35abc321c203bc] percpu: implement partial chunk depopulation
git bisect bad 0bd2212ebd7a02a6c0e870bb4b35abc321c203bc
# good: [a7aebdb482a3aa87a61f6414a87f31eb657c41f6] percpu: split __pcpu_balance_workfn()
git bisect good a7aebdb482a3aa87a61f6414a87f31eb657c41f6
# good: [123a0c4318bb8cfb984f41c0499064c383dd9eee] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
git bisect good 123a0c4318bb8cfb984f41c0499064c383dd9eee
# first bad commit: [0bd2212ebd7a02a6c0e870bb4b35abc321c203bc] percpu: implement partial chunk depopulation

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] percpu: implement partial chunk depopulation
  2021-07-02 19:11   ` Guenter Roeck
@ 2021-07-02 19:45     ` Dennis Zhou
  2021-07-02 20:28       ` Guenter Roeck
  0 siblings, 1 reply; 17+ messages in thread
From: Dennis Zhou @ 2021-07-02 19:45 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Tejun Heo, Christoph Lameter, Roman Gushchin, linux-mm, linux-kernel

Hello,

On Fri, Jul 02, 2021 at 12:11:40PM -0700, Guenter Roeck wrote:
> Hi,
> 
> On Mon, Apr 19, 2021 at 10:50:46PM +0000, Dennis Zhou wrote:
> > From: Roman Gushchin <guro@fb.com>
> > 
> > This patch implements partial depopulation of percpu chunks.
> > 
> > As of now, a chunk can be depopulated only as a part of the final
> > destruction, if there are no more outstanding allocations. However
> > to minimize a memory waste it might be useful to depopulate a
> > partially filed chunk, if a small number of outstanding allocations
> > prevents the chunk from being fully reclaimed.
> > 
> > This patch implements the following depopulation process: it scans
> > over the chunk pages, looks for a range of empty and populated pages
> > and performs the depopulation. To avoid races with new allocations,
> > the chunk is previously isolated. After the depopulation the chunk is
> > sidelined to a special list or freed. New allocations prefer using
> > active chunks to sidelined chunks. If a sidelined chunk is used, it is
> > reintegrated to the active lists.
> > 
> > The depopulation is scheduled on the free path if the chunk is all of
> > the following:
> >   1) has more than 1/4 of total pages free and populated
> >   2) the system has enough free percpu pages aside of this chunk
> >   3) isn't the reserved chunk
> >   4) isn't the first chunk
> > If it's already depopulated but got free populated pages, it's a good
> > target too. The chunk is moved to a special slot,
> > pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
> > item is scheduled. On isolation, these pages are removed from the
> > pcpu_nr_empty_pop_pages. It is constantly replaced to the
> > to_depopulate_slot when it meets these qualifications.
> > 
> > pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
> > becomes empty. The depopulation is performed in the reverse direction to
> > keep populated pages close to the beginning. Depopulated chunks are
> > sidelined to preferentially avoid them for new allocations. When no
> > active chunk can suffice a new allocation, sidelined chunks are first
> > checked before creating a new chunk.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Co-developed-by: Dennis Zhou <dennis@kernel.org>
> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> 
> This patch results in a number of crashes and other odd behavior
> when trying to boot mips images from Megasas controllers in qemu.
> Sometimes the boot stalls, but I also see various crashes.
> Some examples and bisect logs are attached.

Ah, this doesn't look good.. Do you have a reproducer I could use to
debug this?

Thanks,
Dennis

> 
> Note: Bisect on mainline ended with
> 
> # first bad commit: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/l
> inux/kernel/git/dennis/percpu
> 
> I then checked out the merge branch and ran a bisect there, which
> points to this commit. I also rebased the merge branch to v5.13
> and bisected again. Bisect results were the same.
> 
> Guenter
> 
> ---
> ...
> sd 0:2:0:0: [sda] Add. Sense: Internal target failure
> CPU 0 Unable to handle kernel paging request at virtual address 00000004, epc == 805cf8fc, ra == 802ff3b0
> Oops[#1]:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
> $ 0   : 00000000 00000001 00000000 8258fc90
> $ 4   : 825dbd40 820e7624 00000000 fffffff0
> $ 8   : 80c70000 805e1a64 fffffffc 00000000
> $12   : 81006d00 0000001f ffffffe0 00001e83
> $16   : 00000000 825dbd30 80c70000 820e75f8
> $20   : 8275c584 80cc4418 80c9409c 00000008
> $24   : 0000004c 00000000
> $28   : 8204c000 8204fc70 80c26c54 802ff3b0
> Hi    : 0000004c
> Lo    : 00000000
> epc   : 805cf8fc rb_insert_color+0x1c/0x1e0
> ra    : 802ff3b0 kernfs_link_sibling+0x94/0x120
> Status: 1000a403	KERNEL EXL IE
> Cause : 00800008 (ExcCode 02)
> BadVA : 00000004
> PrId  : 00019300 (MIPS 24Kc)
> Modules linked in:
> Process swapper/0 (pid: 1, threadinfo=(ptrval), task=(ptrval), tls=00000000)
> Stack : 820e75f8 820e75f8 820e75f8 00000000 8275c584 fffffffe 825dbd30 8030084c
>         820e75f8 803003f8 00000000 db668853 00000000 801655f4 00000000 00000000
>         00000001 825dbd30 820e75f8 820e75f8 00000000 80300970 81006c80 82150fc0
>         8204fd64 00000001 00000000 00000001 00000000 00000000 82150fc0 8275c580
>         80c50000 80303dc8 82150fc0 8015ab94 81006c80 8015a960 00000000 8275c580
>         ...
> Call Trace:
> [<805cf8fc>] rb_insert_color+0x1c/0x1e0
> [<802ff3b0>] kernfs_link_sibling+0x94/0x120
> [<8030084c>] kernfs_add_one+0xb8/0x184
> [<80300970>] kernfs_create_dir_ns+0x58/0xb0
> [<80303dc8>] sysfs_create_dir_ns+0x74/0x108
> [<805ca51c>] kobject_add_internal+0xb4/0x364
> [<805caaa0>] kobject_init_and_add+0x64/0xa8
> [<8066f768>] bus_add_driver+0x98/0x230
> [<806715a0>] driver_register+0x80/0x144
> [<807c17b8>] usb_register_driver+0xa8/0x1c0
> [<80cb89b8>] uas_init+0x44/0x78
> [<8010065c>] do_one_initcall+0x50/0x1d4
> [<80c95014>] kernel_init_freeable+0x20c/0x29c
> [<80a66bd4>] kernel_init+0x14/0x118
> [<80103098>] ret_from_kernel_thread+0x14/0x1c
> 
> Code: 30460001  14c00016  240afffc <8c460004> 34480001  10c30028  00404825  10c00012  00000000
> 
> ---[ end trace bb7aba36814796cb ]---
> Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> 
> ---
> scsi host0: Avago SAS based MegaRAID driver
> ata_piix 0000:00:0a.1: enabling device (0000 -> 0001)
> random: fast init done
> scsi 0:2:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
> scsi host1: ata_piix
> BUG: spinlock bad magic on CPU#0, kworker/u2:1/41
>  lock: 0x82598a50, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
> CPU: 0 PID: 41 Comm: kworker/u2:1 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
> Workqueue: events_unbound async_run_entry_fn
> Stack : 822839e4 80c4eee3 80c50000 80a54338 80c80000 801865e0 00000000 00000004
>         822839e4 5e03e26b 80c50000 8014b4c8 80c50000 00000001 822839e0 8207e4c0
>         00000000 00000000 80b7b9ac 82283828 00000001 8228383c 00000000 0000ffff
>         00000008 00000007 00000280 822b7c00 80c50000 80c70000 00000000 80b80000
>         00000003 00000000 80c50000 00000012 00000000 806591f8 00000000 80cf0000
>         ...
> Call Trace:
> [<80109adc>] show_stack+0x84/0x11c
> [<80a62b1c>] dump_stack+0xa8/0xe4
> [<80181468>] do_raw_spin_lock+0xb0/0x128
> [<80a70170>] _raw_spin_lock_irqsave+0x28/0x3c
> [<80176640>] __wake_up_common_lock+0x68/0xe8
> [<801766d4>] __wake_up+0x14/0x20
> [<8054eb48>] percpu_ref_kill_and_confirm+0x120/0x178
> [<80526d2c>] blk_freeze_queue_start+0x58/0x94
> [<8051af0c>] blk_set_queue_dying+0x2c/0x60
> [<8051afb4>] blk_cleanup_queue+0x40/0x130
> [<806975b4>] __scsi_remove_device+0xd4/0x168
> [<80693594>] scsi_probe_and_add_lun+0x53c/0xf44
> [<806944c4>] __scsi_scan_target+0x158/0x754
> [<80694eb4>] scsi_scan_host_selected+0x17c/0x2e0
> [<806950c4>] do_scsi_scan_host+0xac/0xb4
> [<806952f8>] do_scan_async+0x30/0x228
> [<8015510c>] async_run_entry_fn+0x40/0x100
> [<80148384>] process_one_work+0x170/0x428
> [<80148be0>] worker_thread+0x188/0x578
> [<80150d9c>] kthread+0x130/0x160
> [<80103098>] ret_from_kernel_thread+0x14/0x1c
> 
> CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 801764a0, ra == 80176664
> Oops[#1]:
> CPU: 0 PID: 41 Comm: kworker/u2:1 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
> Workqueue: events_unbound async_run_entry_fn
> $ 0   : 00000000 00000001 00000000 00000000
> $ 4   : 82283b14 00000003 00000000 00000000
> $ 8   : 00000001 822837ac 00000000 0000ffff
> $12   : 00000008 00000007 00000280 822b7c00
> $16   : 82598a50 00000000 82283b08 00000003
> $20   : 00000000 00000000 00000000 fffffff4
> $24   : 00000000 806591f8
> $28   : 82280000 82283ab8 82598a60 80176664
> Hi    : 000000a7
> Lo    : 3333335d
> epc   : 801764a0 __wake_up_common+0x6c/0x1a4
> ra    : 80176664 __wake_up_common_lock+0x8c/0xe8
> Status: 1000a402	KERNEL EXL
> Cause : 40808008 (ExcCode 02)
> BadVA : 00000000
> PrId  : 00019300 (MIPS 24Kc)
> Modules linked in:
> Process kworker/u2:1 (pid: 41, threadinfo=(ptrval), task=(ptrval), tls=00000000)
> Stack : 82716400 82283c98 8246a880 8052cfe8 82598a50 00000000 00000000 00000000
>         00000003 00000000 80c50000 00000012 80c80000 80176664 00000000 82126c80
>         801762ec 82283afc 00000000 82283b08 00000000 00000000 00000000 82283b14
>         82283b14 5e03e26b 825985c8 80c70000 00000001 00000000 80d10000 00000024
>         00000003 801766d4 00000001 825985c8 80c70000 00000001 00000000 80d10000
>         ...
> Call Trace:
> [<801764a0>] __wake_up_common+0x6c/0x1a4
> [<80176664>] __wake_up_common_lock+0x8c/0xe8
> [<801766d4>] __wake_up+0x14/0x20
> [<8054eb48>] percpu_ref_kill_and_confirm+0x120/0x178
> [<80526d2c>] blk_freeze_queue_start+0x58/0x94
> [<8051af0c>] blk_set_queue_dying+0x2c/0x60
> [<8051afb4>] blk_cleanup_queue+0x40/0x130
> [<806975b4>] __scsi_remove_device+0xd4/0x168
> [<80693594>] scsi_probe_and_add_lun+0x53c/0xf44
> [<806944c4>] __scsi_scan_target+0x158/0x754
> [<80694eb4>] scsi_scan_host_selected+0x17c/0x2e0
> [<806950c4>] do_scsi_scan_host+0xac/0xb4
> [<806952f8>] do_scan_async+0x30/0x228
> [<8015510c>] async_run_entry_fn+0x40/0x100
> [<80148384>] process_one_work+0x170/0x428
> [<80148be0>] worker_thread+0x188/0x578
> [<80150d9c>] kthread+0x130/0x160
> [<80103098>] ret_from_kernel_thread+0x14/0x1c
> 
> ---
> 
> megaraid_sas 0000:00:14.0: Max firmware commands: 1007 shared with default hw_queues = 1 poll_queues 0
> scsi host0: Avago SAS based MegaRAID driver
> ata_piix 0000:00:0a.1: enabling device (0000 -> 0001)
> scsi 0:2:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
> scsi host1: ata_piix
> scsi host2: ata_piix
> CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 00000000, ra == 8019d0b4
> Oops[#1]:
> CPU: 0 PID: 40 Comm: kworker/u2:1 Not tainted 5.13.0-07637-g3dbdb38e2869 #1
> Workqueue: events_unbound async_run_entry_fn
> $ 0   : 00000000 00000001 82568620 00000000
> $ 4   : 82568620 00000200 8019d0b4 8212d580
> $ 8   : ffffffe0 000003fc 00000000 81006d70
> $12   : 81006d40 0000020c 00000000 80ab4400
> $16   : 81007480 00000008 8201ff00 0000000a
> $20   : 00000000 810074bc 80ced800 80cd0000
> $24   : 000b0f1b 00000739
> $28   : 82298000 8201fee8 8019d2f8 8019d0b4
> Hi    : 00003f05
> Lo    : 0000000f
> epc   : 00000000 0x0
> ra    : 8019d0b4 rcu_core+0x260/0x754
> Status: 1000a403	KERNEL EXL IE
> Cause : 00800008 (ExcCode 02)
> BadVA : 00000000
> PrId  : 00019300 (MIPS 24Kc)
> Modules linked in:
> Process kworker/u2:1 (pid: 40, threadinfo=(ptrval), task=(ptrval), tls=00000000)
> Stack : 00000000 8018b544 ffffffc8 ffffffc8 80bde598 00000000 00000000 8201ff00
>         00000048 2942dcc1 80ccc2c8 80cb8080 80d68358 0000000a 00000024 00000009
>         00000100 80cb80a4 00000000 80aaac38 80cd0000 80cba400 80cba400 80191214
>         00014680 2942dcc1 80cf9980 80ab3ce0 80bd9020 80d682f4 80d6e880 80d6e880
>         ffff8fcf 80cd0000 80ab0000 04208060 80ccc2c8 00000001 00000020 80da0000
>         ...
> Call Trace:
> 
> [<8018b544>] __handle_irq_event_percpu+0xbc/0x184
> [<80aaac38>] __do_softirq+0x190/0x33c
> [<80191214>] handle_level_irq+0x130/0x1e8
> [<80132fb8>] irq_exit+0x130/0x138
> [<806112d0>] plat_irq_dispatch+0x9c/0x118
> [<80103404>] handle_int+0x144/0x150
> 
> Code: (Bad address in epc)
> 
> 
> ---[ end trace 5d4c5bf55a0bb13f ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> ------------[ cut here ]------------
> 
> ---
> Bisect on mainline:
> 
> # bad: [3dbdb38e286903ec220aaf1fb29a8d94297da246] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
> # good: [007b350a58754a93ca9fe50c498cc27780171153] Merge tag 'dlm-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
> git bisect start '3dbdb38e2869' '007b350a5875'
> # good: [b6df00789e2831fff7a2c65aa7164b2a4dcbe599] Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> git bisect good b6df00789e2831fff7a2c65aa7164b2a4dcbe599
> # good: [990ec3014deedfed49e610cdc31dc6930ca63d8d] drm/amdgpu: add psp runtime db structures
> git bisect good 990ec3014deedfed49e610cdc31dc6930ca63d8d
> # good: [c288d9cd710433e5991d58a0764c4d08a933b871] Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block
> git bisect good c288d9cd710433e5991d58a0764c4d08a933b871
> # good: [514798d36572fb8eba6ccff3de10c9615063a7f5] Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
> git bisect good 514798d36572fb8eba6ccff3de10c9615063a7f5
> # good: [630e438f040c3838206b5e6717b9b5c29edf3548] RDMA/rtrs: Introduce head/tail wr
> git bisect good 630e438f040c3838206b5e6717b9b5c29edf3548
> # good: [a32b344e6f4375c5bdc3e89d0997b7eae187a3b1] Merge tag 'pinctrl-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
> git bisect good a32b344e6f4375c5bdc3e89d0997b7eae187a3b1
> # good: [cad065ed8d8831df67b9754cc4437ed55d8b48c0] MIPS: MT extensions are not available on MIPS32r1
> git bisect good cad065ed8d8831df67b9754cc4437ed55d8b48c0
> # good: [e4d777003a43feab2e000749163e531f6c48c385] percpu: optimize locking in pcpu_balance_workfn()
> git bisect good e4d777003a43feab2e000749163e531f6c48c385
> # bad: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
> git bisect bad e267992f9ef0bf717d70a9ee18049782f77e4b3a
> # good: [ab3040e1379bd6fcc260f1f7558ee9c2da62766b] MIPS: Ingenic: Add MAC syscon nodes for Ingenic SoCs.
> git bisect good ab3040e1379bd6fcc260f1f7558ee9c2da62766b
> # good: [34c522a07ccbfb0e6476713b41a09f9f51a06c9f] MIPS: CI20: Add second percpu timer for SMP.
> git bisect good 34c522a07ccbfb0e6476713b41a09f9f51a06c9f
> # good: [19b438592238b3b40c3f945bb5f9c4ca971c0c45] Merge tag 'mips_5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
> git bisect good 19b438592238b3b40c3f945bb5f9c4ca971c0c45
> # first bad commit: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
> 
> ---
> Bisect on merge branch:
> 
> # bad: [e4d777003a43feab2e000749163e531f6c48c385] percpu: optimize locking in pcpu_balance_workfn()
> # good: [d434405aaab7d0ebc516b68a8fc4100922d7f5ef] Linux 5.12-rc7
> git bisect start 'HEAD' 'v5.12-rc7'
> # bad: [f183324133ea535db4127f9fad3e19725ca88bf3] percpu: implement partial chunk depopulation
> git bisect bad f183324133ea535db4127f9fad3e19725ca88bf3
> # good: [67c2669d69fb5ada0f3b5123fb6ebf6fef9faee5] percpu: split __pcpu_balance_workfn()
> git bisect good 67c2669d69fb5ada0f3b5123fb6ebf6fef9faee5
> # good: [1c29a3ceaf5f02919e0a89119a70382581453dbb] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
> git bisect good 1c29a3ceaf5f02919e0a89119a70382581453dbb
> # first bad commit: [f183324133ea535db4127f9fad3e19725ca88bf3] percpu: implement partial chunk depopulation
> 
> ---
> Bisect on rebased merge branch:
> 
> # bad: [737dc4074d4969ee54d7f781591bcc608fc6990f] percpu: optimize locking in pcpu_balance_workfn()
> # good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
> git bisect start 'HEAD' 'v5.13'
> # bad: [0bd2212ebd7a02a6c0e870bb4b35abc321c203bc] percpu: implement partial chunk depopulation
> git bisect bad 0bd2212ebd7a02a6c0e870bb4b35abc321c203bc
> # good: [a7aebdb482a3aa87a61f6414a87f31eb657c41f6] percpu: split __pcpu_balance_workfn()
> git bisect good a7aebdb482a3aa87a61f6414a87f31eb657c41f6
> # good: [123a0c4318bb8cfb984f41c0499064c383dd9eee] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
> git bisect good 123a0c4318bb8cfb984f41c0499064c383dd9eee
> # first bad commit: [0bd2212ebd7a02a6c0e870bb4b35abc321c203bc] percpu: implement partial chunk depopulation

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] percpu: implement partial chunk depopulation
  2021-07-02 19:45     ` Dennis Zhou
@ 2021-07-02 20:28       ` Guenter Roeck
  2021-07-02 21:00         ` Dennis Zhou
  0 siblings, 1 reply; 17+ messages in thread
From: Guenter Roeck @ 2021-07-02 20:28 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Tejun Heo, Christoph Lameter, Roman Gushchin, linux-mm, linux-kernel

On 7/2/21 12:45 PM, Dennis Zhou wrote:
> Hello,
> 
> On Fri, Jul 02, 2021 at 12:11:40PM -0700, Guenter Roeck wrote:
>> Hi,
>>
>> On Mon, Apr 19, 2021 at 10:50:46PM +0000, Dennis Zhou wrote:
>>> From: Roman Gushchin <guro@fb.com>
>>>
>>> This patch implements partial depopulation of percpu chunks.
>>>
>>> As of now, a chunk can be depopulated only as a part of the final
>>> destruction, if there are no more outstanding allocations. However
>>> to minimize a memory waste it might be useful to depopulate a
>>> partially filed chunk, if a small number of outstanding allocations
>>> prevents the chunk from being fully reclaimed.
>>>
>>> This patch implements the following depopulation process: it scans
>>> over the chunk pages, looks for a range of empty and populated pages
>>> and performs the depopulation. To avoid races with new allocations,
>>> the chunk is previously isolated. After the depopulation the chunk is
>>> sidelined to a special list or freed. New allocations prefer using
>>> active chunks to sidelined chunks. If a sidelined chunk is used, it is
>>> reintegrated to the active lists.
>>>
>>> The depopulation is scheduled on the free path if the chunk is all of
>>> the following:
>>>    1) has more than 1/4 of total pages free and populated
>>>    2) the system has enough free percpu pages aside of this chunk
>>>    3) isn't the reserved chunk
>>>    4) isn't the first chunk
>>> If it's already depopulated but got free populated pages, it's a good
>>> target too. The chunk is moved to a special slot,
>>> pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
>>> item is scheduled. On isolation, these pages are removed from the
>>> pcpu_nr_empty_pop_pages. It is constantly replaced to the
>>> to_depopulate_slot when it meets these qualifications.
>>>
>>> pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
>>> becomes empty. The depopulation is performed in the reverse direction to
>>> keep populated pages close to the beginning. Depopulated chunks are
>>> sidelined to preferentially avoid them for new allocations. When no
>>> active chunk can suffice a new allocation, sidelined chunks are first
>>> checked before creating a new chunk.
>>>
>>> Signed-off-by: Roman Gushchin <guro@fb.com>
>>> Co-developed-by: Dennis Zhou <dennis@kernel.org>
>>> Signed-off-by: Dennis Zhou <dennis@kernel.org>
>>
>> This patch results in a number of crashes and other odd behavior
>> when trying to boot mips images from Megasas controllers in qemu.
>> Sometimes the boot stalls, but I also see various crashes.
>> Some examples and bisect logs are attached.
> 
> Ah, this doesn't look good.. Do you have a reproducer I could use to
> debug this?
> 

I copied the relevant information to http://server.roeck-us.net/qemu/mips/.

run.sh - qemu command (I tried with qemu 6.0 and 4.2.1)
rootfs.ext2 - root file system
config - complete configuration
defconfig - shortened configuration
vmlinux - a crashing kernel image (v5.13-7637-g3dbdb38e2869, with above configuration)

Interestingly, the crash doesn't always happen at the same location, even
with the same image. Some memory corruption, maybe ?

Hope this helps. Please let me know if I can provide anything else.

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] percpu: implement partial chunk depopulation
  2021-07-02 20:28       ` Guenter Roeck
@ 2021-07-02 21:00         ` Dennis Zhou
  0 siblings, 0 replies; 17+ messages in thread
From: Dennis Zhou @ 2021-07-02 21:00 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Tejun Heo, Christoph Lameter, Roman Gushchin, linux-mm, linux-kernel

On Fri, Jul 02, 2021 at 01:28:18PM -0700, Guenter Roeck wrote:
> On 7/2/21 12:45 PM, Dennis Zhou wrote:
> > Hello,
> > 
> > On Fri, Jul 02, 2021 at 12:11:40PM -0700, Guenter Roeck wrote:
> > > Hi,
> > > 
> > > On Mon, Apr 19, 2021 at 10:50:46PM +0000, Dennis Zhou wrote:
> > > > From: Roman Gushchin <guro@fb.com>
> > > > 
> > > > This patch implements partial depopulation of percpu chunks.
> > > > 
> > > > As of now, a chunk can be depopulated only as a part of the final
> > > > destruction, if there are no more outstanding allocations. However
> > > > to minimize a memory waste it might be useful to depopulate a
> > > > partially filed chunk, if a small number of outstanding allocations
> > > > prevents the chunk from being fully reclaimed.
> > > > 
> > > > This patch implements the following depopulation process: it scans
> > > > over the chunk pages, looks for a range of empty and populated pages
> > > > and performs the depopulation. To avoid races with new allocations,
> > > > the chunk is previously isolated. After the depopulation the chunk is
> > > > sidelined to a special list or freed. New allocations prefer using
> > > > active chunks to sidelined chunks. If a sidelined chunk is used, it is
> > > > reintegrated to the active lists.
> > > > 
> > > > The depopulation is scheduled on the free path if the chunk is all of
> > > > the following:
> > > >    1) has more than 1/4 of total pages free and populated
> > > >    2) the system has enough free percpu pages aside of this chunk
> > > >    3) isn't the reserved chunk
> > > >    4) isn't the first chunk
> > > > If it's already depopulated but got free populated pages, it's a good
> > > > target too. The chunk is moved to a special slot,
> > > > pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
> > > > item is scheduled. On isolation, these pages are removed from the
> > > > pcpu_nr_empty_pop_pages. It is constantly replaced to the
> > > > to_depopulate_slot when it meets these qualifications.
> > > > 
> > > > pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
> > > > becomes empty. The depopulation is performed in the reverse direction to
> > > > keep populated pages close to the beginning. Depopulated chunks are
> > > > sidelined to preferentially avoid them for new allocations. When no
> > > > active chunk can suffice a new allocation, sidelined chunks are first
> > > > checked before creating a new chunk.
> > > > 
> > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > Co-developed-by: Dennis Zhou <dennis@kernel.org>
> > > > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > > 
> > > This patch results in a number of crashes and other odd behavior
> > > when trying to boot mips images from Megasas controllers in qemu.
> > > Sometimes the boot stalls, but I also see various crashes.
> > > Some examples and bisect logs are attached.
> > 
> > Ah, this doesn't look good.. Do you have a reproducer I could use to
> > debug this?
> > 
> 
> I copied the relevant information to http://server.roeck-us.net/qemu/mips/.
> 

This is perfect! I'm able to reproduce it.

> run.sh - qemu command (I tried with qemu 6.0 and 4.2.1)
> rootfs.ext2 - root file system
> config - complete configuration
> defconfig - shortened configuration
> vmlinux - a crashing kernel image (v5.13-7637-g3dbdb38e2869, with above configuration)
> 
> Interestingly, the crash doesn't always happen at the same location, even
> with the same image. Some memory corruption, maybe ?
> 

Well a few factors matter, percpu gets placed in random places. Percpu
allocations may happen in different order and this will cause different
freeing patterns. Then the problem patch may free the wrong backing
page.

I'm working on it, x86 doesn't seem to have any immediate issues
(fingers crossed) so it must be some delta here.

> Hope this helps. Please let me know if I can provide anything else.
> 
> Thanks,
> Guenter

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-07-02 21:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-19 22:50 [PATCH v4 0/4] percpu: partial chunk depopulation Dennis Zhou
2021-04-19 22:50 ` [PATCH 1/4] percpu: factor out pcpu_check_block_hint() Dennis Zhou
2021-04-19 22:50 ` [PATCH 2/4] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1 Dennis Zhou
2021-04-20 21:22   ` Roman Gushchin
2021-04-19 22:50 ` [PATCH 3/4] percpu: implement partial chunk depopulation Dennis Zhou
2021-07-02 19:11   ` Guenter Roeck
2021-07-02 19:45     ` Dennis Zhou
2021-07-02 20:28       ` Guenter Roeck
2021-07-02 21:00         ` Dennis Zhou
2021-04-19 22:50 ` [PATCH 4/4] percpu: use reclaim threshold instead of running for every page Dennis Zhou
2021-04-20 21:23   ` Roman Gushchin
2021-04-19 22:54 ` [PATCH v4 0/4] percpu: partial chunk depopulation Dennis Zhou
2021-04-19 22:57 ` Dennis Zhou
2021-04-20 11:07   ` Pratik Sampat
2021-04-20 14:39     ` Dennis Zhou
2021-04-20 15:25       ` Pratik Sampat
2021-04-21 18:24 ` Dennis Zhou

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.